2016-12-13
library(nycflights13); str(flights);
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ... ## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ... ## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ... ## $ carrier : chr "UA" "UA" "AA" "B6" ... ## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... ## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... ## $ origin : chr "EWR" "LGA" "JFK" "JFK" ... ## $ dest : chr "IAH" "IAH" "MIA" "BQN" ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 6 5 6 6 6 6 ... ## $ minute : num 15 29 40 45 0 58 0 0 0 0 ... ## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
str(planes);
## Classes 'tbl_df', 'tbl' and 'data.frame': 3322 obs. of 9 variables: ## $ tailnum : chr "N10156" "N102UW" "N103US" "N104UW" ... ## $ year : int 2004 1998 1999 1999 2002 1999 1999 1999 1999 1999 ... ## $ type : chr "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" "Fixed wing multi engine" ... ## $ manufacturer: chr "EMBRAER" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" "AIRBUS INDUSTRIE" ... ## $ model : chr "EMB-145XR" "A320-214" "A320-214" "A320-214" ... ## $ engines : int 2 2 2 2 2 2 2 2 2 2 ... ## $ seats : int 55 182 182 182 55 182 182 182 182 182 ... ## $ speed : int NA NA NA NA NA NA NA NA NA NA ... ## $ engine : chr "Turbo-fan" "Turbo-fan" "Turbo-fan" "Turbo-fan" ...
str(weather);
## Classes 'tbl_df', 'tbl' and 'data.frame': 26130 obs. of 15 variables: ## $ origin : chr "EWR" "EWR" "EWR" "EWR" ... ## $ year : num 2013 2013 2013 2013 2013 ... ## $ month : num 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ hour : int 0 1 2 3 4 6 7 8 9 10 ... ## $ temp : num 37 37 37.9 37.9 37.9 ... ## $ dewp : num 21.9 21.9 21.9 23 24.1 ... ## $ humid : num 54 54 52.1 54.5 57 ... ## $ wind_dir : num 230 230 230 230 240 270 250 240 250 260 ... ## $ wind_speed: num 10.4 13.8 12.7 13.8 15 ... ## $ wind_gust : num 11.9 15.9 14.6 15.9 17.2 ... ## $ precip : num 0 0 0 0 0 0 0 0 0 0 ... ## $ pressure : num 1014 1013 1013 1013 1013 ... ## $ visib : num 10 10 10 10 10 10 10 10 10 10 ... ## $ time_hour : POSIXct, format: "2012-12-31 19:00:00" "2012-12-31 20:00:00" ...
str(airlines);
## Classes 'tbl_df', 'tbl' and 'data.frame': 16 obs. of 2 variables: ## $ carrier: chr "9E" "AA" "AS" "B6" ... ## $ name : chr "Endeavor Air Inc." "American Airlines Inc." "Alaska Airlines Inc." "JetBlue Airways" ...
str(airports);
## Classes 'tbl_df', 'tbl' and 'data.frame': 1396 obs. of 7 variables: ## $ faa : chr "04G" "06A" "06C" "06N" ... ## $ name: chr "Lansdowne Airport" "Moton Field Municipal Airport" "Schaumburg Regional" "Randall Airport" ... ## $ lat : num 41.1 32.5 42 41.4 31.1 ... ## $ lon : num -80.6 -85.7 -88.1 -74.4 -81.4 ... ## $ alt : int 1044 264 801 523 11 1593 730 492 1000 108 ... ## $ tz : num -5 -5 -6 -5 -4 -4 -5 -5 -5 -8 ... ## $ dst : chr "A" "A" "A" "A" ...
dplyr
The dplyr
package in R provides simple functions that correspond to the most common data manipulation operations (or verbs) and uses efficient storage approaches.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
verb meaning -------------------------------------------- select() select variables (or columns) filter() subset observations (or rows) mutate() add new variables (or columns) arrange() re-order the observations summarise() reduce to a single row group_by() aggregate left_join() merge two data objects distinct() remove duplicate entries collect() force computation and bring data back into R
filter()
extracts a subset of rows of interestlibrary(dplyr) airports %>% filter(faa %in% c('ALB', 'BDL', 'BTV'))
## # A tibble: 3 × 7 ## faa name lat lon alt tz dst ## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> ## 1 ALB Albany Intl 42.74827 -73.80169 285 -5 A ## 2 BDL Bradley Intl 41.93889 -72.68322 173 -5 A ## 3 BTV Burlington Intl 44.47186 -73.15328 335 -5 A
head(flights,2)
## # A tibble: 2 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm>
bycarrier <- group_by(flights, carrier) head(bycarrier,2)
## Source: local data frame [2 x 19] ## Groups: carrier [1] ## ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, ## # time_hour <dttm>
flights %>% summarise(numflights = n()) #n() counts rows
## # A tibble: 1 × 1 ## numflights ## <int> ## 1 336776
bycarrier %>% summarise(numflights = n()) #n() counts rows
## # A tibble: 16 × 2 ## carrier numflights ## <chr> <int> ## 1 9E 18460 ## 2 AA 32729 ## 3 AS 714 ## 4 B6 54635 ## 5 DL 48110 ## 6 EV 54173 ## 7 F9 685 ## 8 FL 3260 ## 9 HA 342 ## 10 MQ 26397 ## 11 OO 32 ## 12 UA 58665 ## 13 US 20536 ## 14 VX 5162 ## 15 WN 12275 ## 16 YV 601
Aggregate the counts of flights at three airports at the monthly level.
airportcounts <- flights %>% filter(dest %in% c('ALB', 'BDL', 'BTV')) %>% group_by(year, month, dest) %>% summarise(numflights = n()) head(airportcounts)
## Source: local data frame [6 x 4] ## Groups: year, month [2] ## ## year month dest numflights ## <int> <int> <chr> <int> ## 1 2013 1 ALB 64 ## 2 2013 1 BDL 37 ## 3 2013 1 BTV 223 ## 4 2013 2 ALB 58 ## 5 2013 2 BDL 46 ## 6 2013 2 BTV 189
Add a new column by constructing a date variable (using mutate()
and helper functions from the lubridate
package), then generate time series plots.
library(lubridate) library(lattice) library(mosaic) airportdailycounts <- flights %>% filter(dest %in% c('ALB', 'BDL', 'BTV')) %>% group_by(year, month, day, dest) %>% summarise(numflights = n()) %>% mutate(date = ymd(paste(year, month, day, sep="-"))) head(airportdailycounts) # list only the first six observations
## Source: local data frame [6 x 6] ## Groups: year, month, day [2] ## ## year month day dest numflights date ## <int> <int> <int> <chr> <int> <date> ## 1 2013 1 1 ALB 3 2013-01-01 ## 2 2013 1 1 BDL 2 2013-01-01 ## 3 2013 1 1 BTV 7 2013-01-01 ## 4 2013 1 2 ALB 3 2013-01-02 ## 5 2013 1 2 BDL 2 2013-01-02 ## 6 2013 1 2 BTV 9 2013-01-02
qplot(date, numflights, colour=dest, geom = "line", xlab="Year", ylab="Number of flights per day", data=airportdailycounts)
airportcounts <- airportcounts %>% mutate(FirstOfMonth = ymd(paste(year, "-", month, "-01", sep=""))) head(airportcounts) # list only the first six observations
## Source: local data frame [6 x 5] ## Groups: year, month [2] ## ## year month dest numflights FirstOfMonth ## <int> <int> <chr> <int> <date> ## 1 2013 1 ALB 64 2013-01-01 ## 2 2013 1 BDL 37 2013-01-01 ## 3 2013 1 BTV 223 2013-01-01 ## 4 2013 2 ALB 58 2013-02-01 ## 5 2013 2 BDL 46 2013-02-01 ## 6 2013 2 BTV 189 2013-02-01
qplot(FirstOfMonth, numflights, colour=dest, geom = "line", xlab="Date", ylab="Number of flights per month", data=airportcounts)
arrange()
lets us display the months with the largest number of flights. Here we need to use ungroup()
, since otherwise the data would remain aggregated by year, month, and destination.
airportcounts %>% ungroup() %>% arrange(desc(numflights)) %>% select(numflights, year, month, dest) %>% head()
## # A tibble: 6 × 4 ## numflights year month dest ## <int> <int> <int> <chr> ## 1 264 2013 6 BTV ## 2 257 2013 3 BTV ## 3 256 2013 5 BTV ## 4 238 2013 10 BTV ## 5 236 2013 7 BTV ## 6 223 2013 1 BTV
Which airline was most reliable flying from New York to Minneapolis/St. Paul (MSP) in January, 2013?
delays <- flights %>% select(origin, dest, year, month, day, carrier, arr_delay) %>% filter(dest == 'MSP' & month == 1) %>% group_by(year, month, day, carrier) %>% summarise(meandelay = mean(arr_delay), count = n()) options(digits=3) favstats(meandelay ~ carrier, data=delays)
## carrier min Q1 median Q3 max mean sd n missing ## 1 9E -14.0 -5.50 5.00 23.833 133.3 14.261 30.4 31 0 ## 2 DL -18.3 -8.63 -5.27 0.571 69.1 -0.727 17.6 26 5 ## 3 EV -11.0 3.69 17.58 28.458 57.0 18.599 18.6 26 5 ## 4 MQ -22.0 -3.50 8.75 22.500 69.2 11.563 21.7 29 2
qplot(carrier, meandelay, data = delays, geom = "boxplot")
Here, the full carrier names are merged (or joined, in database speak) using the left_join()
merged <- left_join(delays, airlines, by=c("carrier" = "carrier")) head(merged)
## Source: local data frame [6 x 7] ## Groups: year, month, day [2] ## ## year month day carrier meandelay count name ## <int> <int> <int> <chr> <dbl> <int> <chr> ## 1 2013 1 1 9E 25.00 2 Endeavor Air Inc. ## 2 2013 1 1 DL -5.29 7 Delta Air Lines Inc. ## 3 2013 1 1 EV 29.00 1 ExpressJet Airlines Inc. ## 4 2013 1 1 MQ 69.25 4 Envoy Air ## 5 2013 1 2 9E 7.50 2 Endeavor Air Inc. ## 6 2013 1 2 DL -3.33 9 Delta Air Lines Inc.
filter(merged, meandelay > 60) %>% arrange(desc(meandelay))
## Source: local data frame [4 x 7] ## Groups: year, month, day [4] ## ## year month day carrier meandelay count name ## <int> <int> <int> <chr> <dbl> <int> <chr> ## 1 2013 1 16 9E 133.3 3 Endeavor Air Inc. ## 2 2013 1 30 9E 71.0 3 Endeavor Air Inc. ## 3 2013 1 1 MQ 69.2 4 Envoy Air ## 4 2013 1 23 DL 69.1 7 Delta Air Lines Inc.
?flights
gives description: "On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013."
flights %>% filter(dest == 'ORD') %>% summarize(count = n())
## # A tibble: 1 × 1 ## count ## <int> ## 1 17283
flights %>% filter(dest == 'YYZ') %>% summarize(count = n())
## # A tibble: 1 × 1 ## count ## <int> ## 1 0
nycflights13
is just a fraction of the available flight information. See http://www.amherst.edu/~nhorton/precursors for example code using SQLite.
Relational databases, first popularized in the 1970's, provide fast and efficient access to terabyte-sized files. These systems use a structured query language (SQL) to specify data operations.
Database systems have been highly optimized and tuned since they were first invented. Connections between general purpose statistics packages such as R and database systems can be facilitated through use of SQL.
verb meaning -------------------------------------------- SELECT create a new result set from a table FROM specify table WHERE subset observations GROUP BY aggregate ORDER re-order the observations DISTINCT remove duplicate values JOIN merge two data objects