A delayed flight, which causes a last-minute shift in schedule and unneeded time in an airport, is every flier’s worst nightmare. To prevent fliers from having to deal with this inconvenience, our team developed a model which can show users if there will be a delay. Both airport companies and fliers will be able to use this model to adjust their schedules accordingly.
During our first week of development, our team learned more about the problem through exploratory data analysis (EDA).
During our second week of development, we tackled the major problem at hand; predicting if there would actually be a delay.
During our last week of development, our team created the very website you're browsing through!
Before predicting any delay behaviors, we needed to establish an understanding of the data we were working with; what correlations could we find between the airline being traveled, or the length of the flight, and the flight being delayed? What do these correlations (or lack thereof) mean?
We established these connections through EDA, or Exploratory Data Analysis. As the name suggests, we explored our data through plots and visualizations in order to gain a deeper understanding of the meaning behind the numbers on the spreadsheet. We used the Airline Delay Prediction Dataset on Kaggle.
The plots we created compared our independent variables (Flight length, departure time, airline, etc) to our target, in order for us to find the most relevant trends. The specifics for each plot are listed below, but the general trends are as follows:
Most features have little to no correlation to the occurrence of a delay
The arrival time for on time flights is 100 minutes behind the arrival time for delayed flights.
Wednesday is the most busy day of the week, followed closely by Tuesday and Thursday
Majority of airports (both the beginning and end locations) have very little documentation, meaning we would benefit from grouping those airports into an “Other” category
After finding trends in our data through EDA, we knew how we needed to clean our data. A few examples of how we transformed our data include:
Dropping the rows where the flight duration was zero (duration was not recorded)
Dropping unnecessary columns, in this case the id column
Encoding categorical variables (such as the different airport names) as numbers
Cutting down our large sample size of ~500,000 to a random sample of 5,000 for our data visualization, and 20,000 for training and testing the model.
Cutting down the number of categories in AirportTo and AirportFrom by grouping any airport with three or less samples into an “Other” category
These changes were made for the benefit of the model, as computers can’t handle non-numeric categories or such large datasets, but can work with numbers just fine. Similarly, our computers weren’t able to handle such a large amount of data, so cutting down the sample size immensely helped our processing speed without sacrificing too much information.
Now that we know our data, we can start on the bulk of the project: machine learning. We started with a wide set of models, namely KNN, SVM, Gaussian Naive Bayes, Logistic Regression, Gradient Boosting, Random Forest, and Decision Tree. From there, we chose three models to develop further, based on their baseline performance as judged by our chosen performance metrics. The details of these models are discussed below.
Out of the many metrics available, we prioritized the false positive rate (precision), as we want to minimize the chance of the model predicting there is a delay (a positive prediction) when there isn’t (a false prediction), since this would make the flier late to the flight. Of course, we also kept other performance metrics in mind, such as accuracy.
Predicts the probability that a certain data point will belong to a certain class; the class with the highest probability is chosen. Naive Bayes is different to other algorithms in that it assumes each feature is independent of each other. This model is efficient with a small sample size and categorical features. However, the seeming benefit of taking each variable as independent can actually be a curse, as many features in real life have an impact on each other.
An ensemble method that uses a collection of weak decision trees to build a stronger model. Each new tree uses information from the previous iteration to improve. Because these trees are built to correct the previous one’s errors, Gradient boosting is able to pick up complex patterns. However, it is also more susceptible to noise in the data.
An ensemble method that operates similarly to Gradient Boosting. However, instead of using information from previous trees to build the next, several trees are made at once, independent of each other. This model benefits from this independence, as it is less susceptible to picking up any noise in the trends, but loses to Gradient Boosting in terms of picking up complex trends.
Of the many models we tested, the three with the best baseline performances were Gradient Boosting, Random Forest, and Gaussian Naive Bayes. Respectively, they had accuracies of .67, .65, and .63, and precisions of .66, .62, and .59. A confusion matrix (a table comparing the predicted and actual values) for each model is shown below, along with a bar chart depicting the precision and accuracy for each model. As the accuracies for each model were similar, we could safely use precision as our sole qualifying metric for moving models onto hyperparameter tuning.
After we had chosen the models with the best baseline performance, we needed to improve their performance through hyperparameter tuning. Hyperparameter tuning is choosing the optimal settings of each model. To automate the task, we used sklearn’s Grid Search, which allowed us to go through several combinations of hyperparameters for each model. The precision metrics are compared in the bar graph below, along with confusion matrices for each model.
From the first few iterations of our model, we can figure out what changes need to be made in the future. Firstly, we can perform feature engineering on our two highly important variables (Arrival time and the Airline). Feature engineering includes numerical manipulations such as adding or multiplying the features together. We can also look to increase our processing power, so that we are able to utilize the full dataset, or using an entirely different dataset which includes a greater variety of features.
Despite the subpar performance of the model, we have built a strong foundation for further research. We have found trends in the data, and developed a model that, if used with higher correlated data, could become more accurate.
I am a programming and coffee aficionado. I love all things tech-related and enjoy learning about interesting things.
Currently a senior at Amity High School. I spend my free time studying the arts, coding, and just about everything in-between.
I enjoy coding and sports. I love playing baseball and messing around in Python.
I am a senior at Poolesville High School. I love coding, playing the flute, and tennis.