Data Preparation

2016-12-13

Database Operations

New York City Flights 2013

library(nycflights13);
str(flights);

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

str(planes);

## Classes 'tbl_df', 'tbl' and 'data.frame':    3322 obs. of  9 variables:
##  $ tailnum     : chr  "N10156" "N102UW" "N103US" "N104UW" ...
##  $ year        : int  2004 1998 1999 1999 2002 1999 1999 1999 1999 1999 ...
##  $ type        : chr  "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ...
##  $ manufacturer: chr  "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" ...
##  $ model       : chr  "EMB-145XR" "A320-214" "A320-214" "A320-214" ...
##  $ engines     : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ seats       : int  55 182 182 182 55 182 182 182 182 182 ...
##  $ speed       : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ engine      : chr  "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" ...

str(weather);

## Classes 'tbl_df', 'tbl' and 'data.frame':    26130 obs. of  15 variables:
##  $ origin    : chr  "EWR" "EWR" "EWR" "EWR" ...
##  $ year      : num  2013 2013 2013 2013 2013 ...
##  $ month     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hour      : int  0 1 2 3 4 6 7 8 9 10 ...
##  $ temp      : num  37 37 37.9 37.9 37.9 ...
##  $ dewp      : num  21.9 21.9 21.9 23 24.1 ...
##  $ humid     : num  54 54 52.1 54.5 57 ...
##  $ wind_dir  : num  230 230 230 230 240 270 250 240 250 260 ...
##  $ wind_speed: num  10.4 13.8 12.7 13.8 15 ...
##  $ wind_gust : num  11.9 15.9 14.6 15.9 17.2 ...
##  $ precip    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ pressure  : num  1014 1013 1013 1013 1013 ...
##  $ visib     : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ time_hour : POSIXct, format: "2012-12-31 19:00:00" "2012-12-31 20:00:00" ...

str(airlines);

## Classes 'tbl_df', 'tbl' and 'data.frame':    16 obs. of  2 variables:
##  $ carrier: chr  "9E" "AA" "AS" "B6" ...
##  $ name   : chr  "Endeavor Air Inc." "American Airlines Inc." "Alaska Airlines Inc." "JetBlue Airways" ...

str(airports);

## Classes 'tbl_df', 'tbl' and 'data.frame':    1396 obs. of  7 variables:
##  $ faa : chr  "04G" "06A" "06C" "06N" ...
##  $ name: chr  "Lansdowne Airport" "Moton Field Municipal Airport" "Schaumburg Regional" "Randall Airport" ...
##  $ lat : num  41.1 32.5 42 41.4 31.1 ...
##  $ lon : num  -80.6 -85.7 -88.1 -74.4 -81.4 ...
##  $ alt : int  1044 264 801 523 11 1593 730 492 1000 108 ...
##  $ tz  : num  -5 -5 -6 -5 -4 -4 -5 -5 -5 -8 ...
##  $ dst : chr  "A" "A" "A" "A" ...

`dplyr`

The dplyr package in R provides simple functions that correspond to the most common data manipulation operations (or verbs) and uses efficient storage approaches.

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

verb          meaning
--------------------------------------------
select()      select variables (or columns)
filter()      subset observations (or rows)
mutate()      add new variables (or columns)
arrange()     re-order the observations
summarise()   reduce to a single row
group_by()    aggregate
left_join()   merge two data objects
distinct()    remove duplicate entries
collect()     force computation and bring data back into R

Filtering observations

filter() extracts a subset of rows of interest
Suppose we wanted to find out which airports certain codes belong to?

library(dplyr)
airports %>% filter(faa %in% c('ALB', 'BDL', 'BTV'))

## # A tibble: 3 × 7
##     faa            name      lat       lon   alt    tz   dst
##   <chr>           <chr>    <dbl>     <dbl> <int> <dbl> <chr>
## 1   ALB     Albany Intl 42.74827 -73.80169   285    -5     A
## 2   BDL    Bradley Intl 41.93889 -72.68322   173    -5     A
## 3   BTV Burlington Intl 44.47186 -73.15328   335    -5     A

Grouping

head(flights,2)

## # A tibble: 2 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

bycarrier <- group_by(flights, carrier)
head(bycarrier,2)

## Source: local data frame [2 x 19]
## Groups: carrier [1]
## 
##    year month   day dep_time sched_dep_time dep_delay arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>
## 1  2013     1     1      517            515         2      830
## 2  2013     1     1      533            529         4      850
## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

Grouping and Summaries

flights %>% summarise(numflights = n()) #n() counts rows

## # A tibble: 1 × 1
##   numflights
##        <int>
## 1     336776

bycarrier %>% summarise(numflights = n()) #n() counts rows

## # A tibble: 16 × 2
##    carrier numflights
##      <chr>      <int>
## 1       9E      18460
## 2       AA      32729
## 3       AS        714
## 4       B6      54635
## 5       DL      48110
## 6       EV      54173
## 7       F9        685
## 8       FL       3260
## 9       HA        342
## 10      MQ      26397
## 11      OO         32
## 12      UA      58665
## 13      US      20536
## 14      VX       5162
## 15      WN      12275
## 16      YV        601

Aggregating observations

Aggregate the counts of flights at three airports at the monthly level.

airportcounts <- flights %>% 
   filter(dest %in% c('ALB', 'BDL', 'BTV')) %>%
   group_by(year, month, dest) %>%
   summarise(numflights = n())
head(airportcounts)

## Source: local data frame [6 x 4]
## Groups: year, month [2]
## 
##    year month  dest numflights
##   <int> <int> <chr>      <int>
## 1  2013     1   ALB         64
## 2  2013     1   BDL         37
## 3  2013     1   BTV        223
## 4  2013     2   ALB         58
## 5  2013     2   BDL         46
## 6  2013     2   BTV        189

Creating new derived variables

Add a new column by constructing a date variable (using mutate() and helper functions from the lubridate package), then generate time series plots.

library(lubridate)
library(lattice)
library(mosaic)
airportdailycounts <- flights %>% 
   filter(dest %in% c('ALB', 'BDL', 'BTV')) %>% group_by(year, month, day, dest) %>%
   summarise(numflights = n()) %>% mutate(date = ymd(paste(year, month, day, sep="-")))
head(airportdailycounts) # list only the first six observations

## Source: local data frame [6 x 6]
## Groups: year, month, day [2]
## 
##    year month   day  dest numflights       date
##   <int> <int> <int> <chr>      <int>     <date>
## 1  2013     1     1   ALB          3 2013-01-01
## 2  2013     1     1   BDL          2 2013-01-01
## 3  2013     1     1   BTV          7 2013-01-01
## 4  2013     1     2   ALB          3 2013-01-02
## 5  2013     1     2   BDL          2 2013-01-02
## 6  2013     1     2   BTV          9 2013-01-02

qplot(date, numflights, colour=dest, geom = "line",
xlab="Year", ylab="Number of flights per day", data=airportdailycounts)

Aggregate by month instead

airportcounts <- airportcounts %>%
  mutate(FirstOfMonth = ymd(paste(year, "-", month, "-01", sep="")))
head(airportcounts) # list only the first six observations

## Source: local data frame [6 x 5]
## Groups: year, month [2]
## 
##    year month  dest numflights FirstOfMonth
##   <int> <int> <chr>      <int>       <date>
## 1  2013     1   ALB         64   2013-01-01
## 2  2013     1   BDL         37   2013-01-01
## 3  2013     1   BTV        223   2013-01-01
## 4  2013     2   ALB         58   2013-02-01
## 5  2013     2   BDL         46   2013-02-01
## 6  2013     2   BTV        189   2013-02-01

qplot(FirstOfMonth, numflights, colour=dest, geom = "line",
xlab="Date", ylab="Number of flights per month", data=airportcounts)

Sorting and selecting

arrange() lets us display the months with the largest number of flights. Here we need to use ungroup(), since otherwise the data would remain aggregated by year, month, and destination.

airportcounts %>% ungroup() %>%
  arrange(desc(numflights)) %>% 
  select(numflights, year, month, dest) %>% head()

## # A tibble: 6 × 4
##   numflights  year month  dest
##        <int> <int> <int> <chr>
## 1        264  2013     6   BTV
## 2        257  2013     3   BTV
## 3        256  2013     5   BTV
## 4        238  2013    10   BTV
## 5        236  2013     7   BTV
## 6        223  2013     1   BTV

Comparing airlines

Which airline was most reliable flying from New York to Minneapolis/St. Paul (MSP) in January, 2013?

delays <- flights %>% 
  select(origin, dest, year, month, day, carrier, arr_delay) %>%
  filter(dest == 'MSP' & month == 1) %>%
  group_by(year, month, day, carrier) %>%
  summarise(meandelay = mean(arr_delay), count = n()) 
options(digits=3)
favstats(meandelay ~ carrier, data=delays)

##   carrier   min    Q1 median     Q3   max   mean   sd  n missing
## 1      9E -14.0 -5.50   5.00 23.833 133.3 14.261 30.4 31       0
## 2      DL -18.3 -8.63  -5.27  0.571  69.1 -0.727 17.6 26       5
## 3      EV -11.0  3.69  17.58 28.458  57.0 18.599 18.6 26       5
## 4      MQ -22.0 -3.50   8.75 22.500  69.2 11.563 21.7 29       2

qplot(carrier, meandelay, data = delays, geom = "boxplot")

Merging or "Joining"

Here, the full carrier names are merged (or joined, in database speak) using the left_join()

merged <- left_join(delays, airlines, by=c("carrier" = "carrier")) 
head(merged)

## Source: local data frame [6 x 7]
## Groups: year, month, day [2]
## 
##    year month   day carrier meandelay count                     name
##   <int> <int> <int>   <chr>     <dbl> <int>                    <chr>
## 1  2013     1     1      9E     25.00     2        Endeavor Air Inc.
## 2  2013     1     1      DL     -5.29     7     Delta Air Lines Inc.
## 3  2013     1     1      EV     29.00     1 ExpressJet Airlines Inc.
## 4  2013     1     1      MQ     69.25     4                Envoy Air
## 5  2013     1     2      9E      7.50     2        Endeavor Air Inc.
## 6  2013     1     2      DL     -3.33     9     Delta Air Lines Inc.

filter(merged, meandelay > 60) %>% arrange(desc(meandelay))

## Source: local data frame [4 x 7]
## Groups: year, month, day [4]
## 
##    year month   day carrier meandelay count                 name
##   <int> <int> <int>   <chr>     <dbl> <int>                <chr>
## 1  2013     1    16      9E     133.3     3    Endeavor Air Inc.
## 2  2013     1    30      9E      71.0     3    Endeavor Air Inc.
## 3  2013     1     1      MQ      69.2     4            Envoy Air
## 4  2013     1    23      DL      69.1     7 Delta Air Lines Inc.

Truth in Advertising

?flights gives description: "On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013."

flights %>% filter(dest == 'ORD') %>% summarize(count = n())

## # A tibble: 1 × 1
##   count
##   <int>
## 1 17283

flights %>% filter(dest == 'YYZ') %>% summarize(count = n())

## # A tibble: 1 × 1
##   count
##   <int>
## 1     0

Big Databases

nycflights13 is just a fraction of the available flight information. See http://www.amherst.edu/~nhorton/precursors for example code using SQLite.

Relational databases, first popularized in the 1970's, provide fast and efficient access to terabyte-sized files. These systems use a structured query language (SQL) to specify data operations.

Database systems have been highly optimized and tuned since they were first invented. Connections between general purpose statistics packages such as R and database systems can be facilitated through use of SQL.

Key operators in SQL

verb          meaning
--------------------------------------------
SELECT      create a new result set from a table
FROM        specify table
WHERE       subset observations
GROUP BY    aggregate
ORDER       re-order the observations
DISTINCT    remove duplicate values
JOIN        merge two data objects