**Introduction**

Road accidents are never a happy issue to discuss. It not only has severe consequences for those involved, it also affects the lives of many others like friends and family. With more vehicles on the road than ever before, its important to understand them in greater detail, and possibly ‘predict’ the locations and consequences of these accidents. Government agencies in the UK have been collecting data about the accidents that were reported since the year 2005. The data includes generic and specific details about the vehicles, driver, number of passengers and number of casualties.

With data available since 2005, one could develop a model to predict the accidents. We know that the data recorded in the database are for reported accidents, so we know for sure these accidents have ‘happened’. We use this data to predict the location of the accident, in terms of latitude and longitude, and also the number of expected casualties of the accidents.

**Research goals**

The purpose of this research is to:

– Identify and quantify associations (if any) between the number of causalities and other variables in the data set.

– Explore whether it is possible to predict accident hot-spots based on the data.

**Assumptions, getting and cleaning data**

We download the data from the source, do a set of pre-processing operations to prep the data for exploratory analysis and predictive modelling. The data sets are quite large. The ‘Accidents0514’ file has over 1.6 million rows and 32 columns. This is the smallest of the 3 files. Lets look at a snapshot of the accidents in the UK from 2005 to 2014. Given the scale of the data, one might expect the plots to be ‘crowded’.

As expected, we the plots are too crowded to see any patterns. We see clear spikes around the longitude of 0 and latitude of 51.5. Not surprisingly, these are the coordinates of London. We also see spikes around the Manchester and Birmingham areas. The slightly more subtle trend is that the number of accidents in the London area seems to be on the increase since 2005 as shown below. It may also be noted that the number of accidents in other places have remained either the same or decreased.

As a simplification, we limit our analysis to the accidents of the year 2014 alone.

**Exploratory Analysis**

Summary stats of the number of casualties/accident reveal the following:

Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|

0.053 | 0.5 | 0.5 | 0.7246 | 1 | 54 |

The summary shows that the 75th percentile is 1, meaning that for almost 3/4ths of the accidents, the number of casualties is 1.

Lets look at the locations of these accidents to establish any locations with high concentration of accidents.

This shows stunning collections of accident hot-spots. It may be easily corroborated that these hot-spots corresponds to major cities. The hottest accident zone seems to be the London area, followed by Birmingham area, Manchester area, Sheffield and Leeds area, and Newcastle area. It would be interesting to see if the predictive models gives these places are the predicted hot spots. The density plots also reveal that the western and northern most parts of the UK doesn’t seem to have very many accidents when compared to the rest of the UK, especially the south-east. Given the high dimensionality of the data, numerous interactions can be analyzes. These are described in the full length GitHub post, link to which is given in the final section.

One of the two variables we will consider in this short post is the time of the day during which the accident happened. We categorize the time into the customary morning, morning rush, noon, evening rush and so on. Lets plot the effect of this on the accidents.

There are observably more accidents in the ‘day time’ of morning rush, noon and evening rush. So the time of the day is a significant predictor of the accidents.

The other variable we consider for exploratory analysis is the sex of the driver. I, in no way, am endorsing the notion that sex of the driver can be an influence on the accident. This is done to dispel such myths, if possible. It has to be kept in mind that we DONOT know the actual sex of the driver, we use just the driver sex codes given in the original data.

It may be noted that one sex has been involved in more accidents than the other. The third category, ostensibly, is a code for sex of the driver unknown/not reported. But we do not conclusively say that one gender is correlated with more accidents. This is made more obvious with the below table.

driverSex | mean number of casualties/accident |
---|---|

1 | 0.7276312 |

2 | 0.737633 |

3 | 0.6305622 |

It can be see than both the sexes have nearly same casualties/accident. But it remains to be seen if this difference is statistically significant.

As a wrap-up of the exploratory section, lets visualize the correlations among all the relevant variables.

This could be indicative of the yet-to-be-made predictions.

**Predictive Model and Diagnostics**

We split the data in to training and test sets, with a 70/30 split. We use parallel processing to use multiple cores of the machine. We build a Stochastic Gradient Boosting model to fit both the casualties and accident location data. Bootstrapping was the choice of re-sampling method.

We are dealing with a regression problem and we measure the model’s prediction using RMSE and R-squared values. Lets look at the predictions and results in bit more detail:

– The root mean square error of casualty per accident in the training set is 0.42. This means, on average, in the training set, difference between the actual number of casualty is 0.42 away from what the model has ‘learnt’ to be the casualty given the conditions.

– RMSE for the test data is 0.44. While the interpretation for this stays the same as above, its very close to the RMSE of the training data. So this suggests the model hasn’t overfit the data.

– The R-sq values for the casualty prediction is on the lower side ~ 0.22.

– On the other hand, the RMSE for latitude and longitude in the training data are 0.18 and 0.28 respectively, while those for test data are 0.18 and 0.28 respectively. On one hand this is quite satisfying, but on the other hand, it poses the question of any of the variables in the data was/were surrogate(s) to the latitude and longitude, especially given that the R-sq values for both are almost close to 1!. This can be explored if more explanation were available for the variables and their definitions.

Lets visualize the results and explore the errors in a qualitative manner. Lets begin by looking at the most important factors contributing to number of casualties/accident and their locations.

The predictions for casualties aligns well with our observations from the exploratory plots. The predictions for Longitudes also aligns well with our from the exploratory sections that the police force and local district authority are correlated with longitudes.

Lets visualize the predictions for the number of casualties/accident and the locations for the test set and the model predictions.

This gives a clear picture of what the model is capable of and what its limitations are. The model does very well in predicting the hot-spots, while it struggles to predict accidents in the not-so-accident-prone locations, aka, hot-spots, prediction of which was one of our primary objectives.

This again clearly shows that model does well in predicting the expected number of casualties in a typical accident, but fails to predict higher casualties, which are rather infrequent.

**Results**

– The research goal was to analyze the UK accident data set and predict accident hot spots and number of casualties. Analyses were set up for the year 2014. The 3 data sets were cleaned and merged into a single data set. Exploratory analysis were performed and it was observed that:

– The number of casualties was correlated with Vehicle type, Manoeuvre, Sex of driver, Month, Day, Time, Urban or Rural area, Weather and light conditions.

– The model were cleaned for missing values and split into training and test sets. Boosting models were developed and has RMSE of 0.44, 0.18 and 0.28 for casualties, latitude and longitude respectively. The is implies, that on average the algorithm predicts the location of the accident to within (0.18, 0.28). This translates to 37km. So this algorithm is capable of predicting the location of the accident to within 37km of its actual location. Hypothesis testing and confidence limit analysis may be performed on this statistic, but this is not covered in this study.

– The diagnostics suggests there was little overfitting.

**TL;DR**

More accidents happen in big cities. More vehicles involved in an accident leads to higher casualties, as do the size/type of vehicle, vehicle leaving carriageway and severity of accident.

**Full Study**

A more detailed version of this research can be found here.