Tools for data analysis in python

#TOOLS FOR DATA ANALYSIS IN PYTHON CODE#

In this part, we will draw more advanced insights using data frame transformation techniques and window functions from the pandas library. > result.head()Īirline nb_flights perc_delayed iata_code airline_nameĠ AA 85747.0 0.1499 AA American Airlines Inc.ġ AS 16196.0 0.1208 AS Alaska Airlines Inc. rank( method = 'first', ascending =False)) # Compute airline size and delay statistics

#TOOLS FOR DATA ANALYSIS IN PYTHON CODE#

Names of airlines associated to their IATA code is then gathered using the merge() method with the airlines_df data frame.The top 10 airlines with the highest volume of flighs are kept using the function.It is good to note that the ranking is done across all airlines.

The ranking of each airline by their number of flights is then computed with the rank( method = 'first') window function within the assign() method.

The percentage of delayed flights is first computed with the 2-step aggregation process using the groupby() and apply() functions.

One example of aggregation function is the len() function, which computes the number of rows within each group.Īmong the biggest airlines, where we define the airline size as the number of yearly flights, we want to know which airline have less delays compared to others.

The apply() method computes aggregations over the grouped columns specified in the previous step.

The groupby() method groups the data by a given set of columns.

We want to obtain the number of departing flights per airport across the year. Kennedy International Airport (New Yor.ģ 6.0 0 JFK John F. Kennedy International Airport (New Yor.Ģ -2.0 0 JFK John F. Kennedy International Airport (New Yor.ġ 27.0 1 JFK John F. Origin_airport destination_airport airline flight_number scheduled_departureĭeparture_delay is_delayed iata_code airportĠ 114.0 1 JFK John F. merge(airports_df, left_on = 'origin_airport', right_on = 'iata_code') It is good to note that if those keys had the same name, it would have been possible to have the single argument on along with the name of that key.

The keys on which the data frames are being joined on are specified in the left_on and right_on arguments.

The flights_df data frame is joined with airports_df by using the merge() method.

In order to do that, we note the following: We want to obtain the airport name corresponding to the airport code attached to flights. In this part, we will explore the data from different angles using basic data frame manipulation techniques with the pandas library. Also, we see that the missing value problem has been solved, and that the is_delayed statistics gives away that roughly 20% of flights according to our definition of delay. drop(, axis = 1)įrom the summary output above, the dataset is now down to 585k observations.

# Only flights from set of airports and with reasonable delay amount # All rows should not have any null value We add the following columns using the assign() method, time of flight in datetime format by combining existing columns:.In this part, we fix existing columns and add new ones that will be useful later on: We also convert flight_number from being integers to being character values with the astype() method, by noting that these are IDs are have no ordered meaning.We choose to filter out flights that have more than 1 day delay. Looking at the mean and median of departure_delay, we see that values are heavily right-skewed, and we have a maximum delay of 1988 (~ 33 hours).We keep flights departing from airports that we want to look at with the function.We remove rows with missing values with the dropna() method.# Compute statistics of columns flights_df_raw.