By: Srisai Sivakumar
F1 and Data:
F-1 is a fast paced, technologically advanced sport and industry. The cars are developed at breakneck speeds, with every possible technological advancement in place, may it be advanced aerodynamics and CFD capabilities (my other passion), carbon composite body or advanced control systems. With numerous Grand Prix (15 in 2015) during the season, there is an ever increasing demand to improve the performance of the car through any and all possible means.
A recent addition to the F1 team’s tool-box is Data. Vast amounts of telemetry data are streamed in real time from cars to pits and back to the ‘mission control’ during the practice, qualifying rounds and of course, during the race. Such data play a key role in developing race strategies.
It would be interesting to develop a model to predict the winner of a particular race. That would require significant insights into the sport, the circuits and what not. As a newbie to F1, I now take my first steps towards such a goal, with a simpler analysis.
The focus of this study will be (retrospective) statistical and graphical analysis of the 2011 British Grand Prix held at the Silverstone Circuit near the village of Silverstone in Northamptonshire in England.
The most important aspect of any data analysis study like this is the quality of the data, understanding the uncertainties and error margins in the data and knowledge of the uncertainties in the measurements during the data collection.
Most of my prior work have been from ‘trusted’ sources, like the UCI Machine Learning Repository or data that I have been involved in personally.
The data to be used in this study was obtained from the F1 Data Junkie website. The author credits his data to an external site, which is a well known source of F1 data. So for the rest of the study, we assume the data is of good quality and we consider the data on its face value.
Approach to the study
The first step would be to convert or transforming the data into a form that would be convenient for subsequent statistical analysis and visualization. This process is called Data munging or data wrangling. Of course getting the data into the ‘correct’ format requires some intuition of the kind of analysis one expects to do with the data. Since I have done a reasonable number of such studies, I have some intuition on whats to be done.
Once we have the data in the correct format, we dive into the heart of the study. We start by exploring the practice runs, get statistical and visual summary of the events. We do the same for the qualifying round as well.
Once we reach the race, we go a bit further, explore it in a bit more detail. For brevity not all the findings shall be included in this discussion.
We begin by looking at the first practice round. We begin with a plot of the lap times for each driver and how it has changed over the course of subsequent laps.
In the next plot, we breakdown the previous plot, grouping each by the stints.
We now see statistical summary of the lap times for each driver during practice 1, as a box plot and then as a table.
Summary of average times for each driver during practice-1:
|P. DI RESTA||111.9267|
We perform similar analysis on practice 2 and 3, but the results shall be presented in a consolidated manner for all the 3 practice rounds, not individually for practice rounds 2 and 3 for sake of brevity.
After the 3 practice rounds, It would be interesting to see how the average lap times have evolved for various drivers. Lets begin by seeing this. Then lets look at some stat summary for the 3 practice rounds.
Table of Average lap times after the 3 practice rounds:
|P. DI RESTA||111.927||115.699||99.208|
From the table and plots, We can very clearly see that the mean and median of lap times have decreased, by ~ 10% for Practice 3, when compared with 1 and 2. Interestingly, practice 2 has the highest average lap times for all drivers who participated in all the 3 of the runs. Its also interesting to know that Alonso and Maldonado have shown the greatest improvement going from practice 2 to 3.
Explaining the rules of the qualifying rounds is beyond the scope of the study. The reader may find resources elsewhere with a quick look up. We look at the (now familiar) box plot of the lap times of all the drivers.
Lets also familiarize ourselves with a new term, the elapsed time, for each driver- by looking at a plot. It should be clear that elapsed time is the cumulative time taken for each lap.
On to the big one. There may be a overdose of plots in this section, so lets begin with a easy one.
Lets start by looking at when the drivers chose to pit-stop.
The plot presents a mixed trend. While most drivers take their first stop around the 10th lap, the choice of subsequent stops have much larger spread.
Lets look at the lap times of the drivers.
This gives an overall sense of the driver and team’s abilities. Clearly Alonso, Massa and Webber have more sub-100 second laps than the rest. But this doesn’t necessarily mean they would take the podium.
This reveals a rough overall pattern. There seem to be a relatively ‘flat’ initial stages of the ract, around 10 laps, where the lap times seem fairly constant for all the drivers. Then is a period of reducing lap times, in the stages of 10th to the 25th or 30th laps, beyond which the lap times remain fairly constant again. One might observe a few points on the curve, that is well above the average times. These are the stops.
Lets now look at more esoteric plots. Lets see which driver has had the best lap times and how it has changed over the course of the 52 laps.
The plot is too crowded to observe anything meaningfully. So lets look at only the drivers who completed the 52 laps.
Lets see which drives had the most best lap times.
|Driver||Best Lap Times|
|P. DI RESTA||0|
It can be seen that Alonso was clearly the best performer, having best lap times in as much as 18 of the 52 laps, followed by Vettel, who had the best lap time in 8 laps. This gives us the first clues as to what could be a possible outcome.
Lets look at a similar plot, but with elapsed time.
This plot is lot less noisy, thus easier to read. It shows that Vettel dominated the first half of the race. At lap 27, he losses his supremacy to Alonso. Lets try to investigate the possible causes for it. A look at the race data shows that both of them had a pit-stop at the same time. But Alonso gained the lead ever since. Lets take a look at the pit stop times.
Pit stop times for Vettel = [24.818, 31.558, 23.137]
Pit stop times for Alonso = [26.566, 23.974, 23.474]
The differences in the pit stop times of Vettel and Alonso = [1.748, -7.584, 0.337]
The pit stop times are fairly consistent, with one aberration. The second pit stip for Vettel has taken much longer than the other 2 stops. And its possibly this that cost Vettel his lead on the race, from which he never recovered.
Lets see if we can come to the same conclusion by looking at the calculated time to leader metric for the drivers. To make the plot less crowded, lets consider only the top 10 drivers.
This reinforces our hypothesis that its most likely the extra time taken during Vettel’s second pit stop that cost him the race, given that he dominated the first half of the race.
Alonso wins the race despite not being in the lead for the first half of the race, while Vettel, despite having lead the race for more than half the laps, ends up second. This could well be due to the extra time taken during Vettel’s second pit stop during lap 27.
The final result can be tabulated as below:
Data visualization offers a convenient medium to communicate complex, high-dimensional data in the form of pictures or graphs. The visual representations helps to communicate and understand information more easily and quickly. In a fast paced and data-heavy sport like Formula-1, visualization enables strategists and decision makers to:
– ‘see’ analytical results, help find relevance among the millions of variables,
– communicate concepts and hypotheses to non-technical colleagues and upper management,
– even predict the outcomes of a future event to an extent.
With access to massive amounts of data from simulations, practices sessions and the race, along with design data, we can build models to predict the outcomes of future events, with reliability better than a random flip of an unbiased coin.