Flight Delay Prediction

About Our Project

A delayed flight, which causes a last-minute shift in schedule and unneeded time in an airport, is every flier’s worst nightmare. To prevent fliers from having to deal with this inconvenience, our team developed a model which can show users if there will be a delay. Both airport companies and fliers will be able to use this model to adjust their schedules accordingly.

The Process

1

Week One

Exploratory Data Analysis

During our first week of development, our team learned more about the problem through exploratory data analysis (EDA).
2

Week Two

Machine Learning

During our second week of development, we tackled the major problem at hand; predicting if there would actually be a delay.
3

Week 3

Deploying Our Website

During our last week of development, our team created the very website you're browsing through!
End

Exploratory Data Analysis

Before predicting any delay behaviors, we needed to establish an understanding of the data we were working with; what correlations could we find between the airline being traveled, or the length of the flight, and the flight being delayed? What do these correlations (or lack thereof) mean?

We established these connections through EDA, or Exploratory Data Analysis. As the name suggests, we explored our data through plots and visualizations in order to gain a deeper understanding of the meaning behind the numbers on the spreadsheet. We used the Airline Delay Prediction Dataset on Kaggle.

General Observations

The plots we created compared our independent variables (Flight length, departure time, airline, etc) to our target, in order for us to find the most relevant trends. The specifics for each plot are listed below, but the general trends are as follows:

Most features have little to no correlation to the occurrence of a delay
The arrival time for on time flights is 100 minutes behind the arrival time for delayed flights.
Wednesday is the most busy day of the week, followed closely by Tuesday and Thursday
Majority of airports (both the beginning and end locations) have very little documentation, meaning we would benefit from grouping those airports into an “Other” category

The Data

Understanding the dataset through visualizations

Heatmap

Relationships between features

Airlines

Distribution of airlines in the dataset

Day of Flight

Distribution of flights based on day of the week

Source Airports

Distribution of on-time and delayed flights

Destination Airports

Distribution of on-time and delayed flights

Arrival Times - Delay

Distribution of times when flight was delayed

Arrival Times - On Time

Distribution of times when flight was on-time

Flight Duration - Delay

Distribution of durations when flight was delayed

Flight Duration - On Time

Distribution of times when flight was on-time

Data Preparation

After finding trends in our data through EDA, we knew how we needed to clean our data. A few examples of how we transformed our data include:

Dropping the rows where the flight duration was zero (duration was not recorded)
Dropping unnecessary columns, in this case the id column
Encoding categorical variables (such as the different airport names) as numbers
Cutting down our large sample size of ~500,000 to a random sample of 5,000 for our data visualization, and 20,000 for training and testing the model.
Cutting down the number of categories in AirportTo and AirportFrom by grouping any airport with three or less samples into an “Other” category

These changes were made for the benefit of the model, as computers can’t handle non-numeric categories or such large datasets, but can work with numbers just fine. Similarly, our computers weren’t able to handle such a large amount of data, so cutting down the sample size immensely helped our processing speed without sacrificing too much information.

Machine Learning

An introduction to ML models and the first iteration of our classification models

Now that we know our data, we can start on the bulk of the project: machine learning. We started with a wide set of models, namely KNN, SVM, Gaussian Naive Bayes, Logistic Regression, Gradient Boosting, Random Forest, and Decision Tree. From there, we chose three models to develop further, based on their baseline performance as judged by our chosen performance metrics. The details of these models are discussed below.

Out of the many metrics available, we prioritized the false positive rate (precision), as we want to minimize the chance of the model predicting there is a delay (a positive prediction) when there isn’t (a false prediction), since this would make the flier late to the flight. Of course, we also kept other performance metrics in mind, such as accuracy.

TOP 3 BASELINE MODELS

Understanding Gaussian Naive Bayes, Random Forest, and Gradient Boosting

Gaussian Naive Bayes

Predicts the probability that a certain data point will belong to a certain class; the class with the highest probability is chosen. Naive Bayes is different to other algorithms in that it assumes each feature is independent of each other. This model is efficient with a small sample size and categorical features. However, the seeming benefit of taking each variable as independent can actually be a curse, as many features in real life have an impact on each other.

Gradient Boosting

An ensemble method that uses a collection of weak decision trees to build a stronger model. Each new tree uses information from the previous iteration to improve. Because these trees are built to correct the previous one’s errors, Gradient boosting is able to pick up complex patterns. However, it is also more susceptible to noise in the data.

Random Forest

An ensemble method that operates similarly to Gradient Boosting. However, instead of using information from previous trees to build the next, several trees are made at once, independent of each other. This model benefits from this independence, as it is less susceptible to picking up any noise in the trends, but loses to Gradient Boosting in terms of picking up complex trends.

Evaluation of Models

Of the many models we tested, the three with the best baseline performances were Gradient Boosting, Random Forest, and Gaussian Naive Bayes. Respectively, they had accuracies of .67, .65, and .63, and precisions of .66, .62, and .59. A confusion matrix (a table comparing the predicted and actual values) for each model is shown below, along with a bar chart depicting the precision and accuracy for each model. As the accuracies for each model were similar, we could safely use precision as our sole qualifying metric for moving models onto hyperparameter tuning.

Gaussian Naive Bayes

Gradient Boosting

Random Forest

Accuracy and Precision

Comparing accuracy and precision for all three models

Hyperparameter Tuning

Finetuning our best baseline models

After we had chosen the models with the best baseline performance, we needed to improve their performance through hyperparameter tuning. Hyperparameter tuning is choosing the optimal settings of each model. To automate the task, we used sklearn’s Grid Search, which allowed us to go through several combinations of hyperparameters for each model. The precision metrics are compared in the bar graph below, along with confusion matrices for each model.

Tuning Results

Interpreting our final models

Confusion Matrices

The number of correct and incorrect guesses for each model, displayed on a heatmap

Performance Metrics

Comparison of the accuracy, precision, recall, and f1 scores for each model

Conclusions

Our final thoughts

From the first few iterations of our model, we can figure out what changes need to be made in the future. Firstly, we can perform feature engineering on our two highly important variables (Arrival time and the Airline). Feature engineering includes numerical manipulations such as adding or multiplying the features together. We can also look to increase our processing power, so that we are able to utilize the full dataset, or using an entirely different dataset which includes a greater variety of features.

Despite the subpar performance of the model, we have built a strong foundation for further research. We have found trends in the data, and developed a model that, if used with higher correlated data, could become more accurate.

Our Amazing Team

Meet the people who made this project possible

Ayesha Ilyas

I am a programming and coffee aficionado. I love all things tech-related and enjoy learning about interesting things.

Aanchal Poddar

Currently a senior at Amity High School. I spend my free time studying the arts, coding, and just about everything in-between.

Justin Yang

I enjoy coding and sports. I love playing baseball and messing around in Python.

Sarah Torres

Sanjana Konka

I am a senior at Poolesville High School. I love coding, playing the flute, and tennis.

About Our Project

The Process

1

Week One

Exploratory Data Analysis

2

Week Two

Machine Learning

3

Week 3

Deploying Our Website

End

Exploratory Data Analysis

General Observations

The Data

Understanding the dataset through visualizations

Data Preparation

Machine Learning

An introduction to ML models and the first iteration of our classification models

TOP 3 BASELINE MODELS

Understanding Gaussian Naive Bayes, Random Forest, and Gradient Boosting

Gaussian Naive Bayes

Gradient Boosting

Random Forest

Evaluation of Models

Hyperparameter Tuning

Finetuning our best baseline models

Tuning Results

Interpreting our final models

Conclusions

Our final thoughts

Our Amazing Team

Meet the people who made this project possible

Ayesha Ilyas

Aanchal Poddar

Justin Yang

Sarah Torres

Sanjana Konka

Samuel O'Neil