2018-09-18

Data Wrangling and Exploration in R

Reasons to use R

New York City Flights 2013

# IF you see library(blah) in my code, you will need to install.packages("blah") before running it.
library(nycflights13)
print(flights)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

print(planes)
## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wi… EMBRAER       EMB-1…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wi… EMBRAER       EMB-1…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wi… AIRBUS INDUS… A320-…       2   182    NA Turbo…
## # ... with 3,312 more rows

print(weather)
## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

print(airlines)
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

print(airports)
## # A tibble: 1,458 x 8
##    faa   name                   lat    lon   alt    tz dst   tzone        
##    <chr> <chr>                <dbl>  <dbl> <int> <dbl> <chr> <chr>        
##  1 04G   Lansdowne Airport     41.1  -80.6  1044    -5 A     America/New_…
##  2 06A   Moton Field Municip…  32.5  -85.7   264    -6 A     America/Chic…
##  3 06C   Schaumburg Regional   42.0  -88.1   801    -6 A     America/Chic…
##  4 06N   Randall Airport       41.4  -74.4   523    -5 A     America/New_…
##  5 09J   Jekyll Island Airpo…  31.1  -81.4    11    -5 A     America/New_…
##  6 0A9   Elizabethton Munici…  36.4  -82.2  1593    -5 A     America/New_…
##  7 0G6   Williams County Air…  41.5  -84.5   730    -5 A     America/New_…
##  8 0G7   Finger Lakes Region…  42.9  -76.8   492    -5 A     America/New_…
##  9 0P2   Shoestring Aviation…  39.8  -76.6  1000    -5 U     America/New_…
## 10 0S9   Jefferson County In…  48.1 -123.    108    -8 A     America/Los_…
## # ... with 1,448 more rows

dplyr

The dplyr package in R provides simple functions that correspond to the most common data manipulation operations (or verbs) and uses efficient storage approaches.

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

verb          meaning
--------------------------------------------
select()      select variables (or columns)
filter()      subset observations (or rows)
mutate()      add new variables (or columns)
arrange()     re-order the observations
summarise()   reduce to a single row
group_by()    aggregate
left_join()   merge two data objects
distinct()    remove duplicate entries
collect()     force computation and bring data back into R

Designed for tidy data

  • dplyr was designed with the idea of tidy data in mind, but can be applied to all data
  • good for coercing data into a new format
  • convenient syntax:
newdata <- olddata %>% verb1(options) %>% verb2(options) %>% verb3(options)

Filtering observations

  • filter() extracts a subset of rows of interest
  • Suppose we wanted to find out which airports certain codes belong to?
library(dplyr)
airports %>% filter(faa %in% c('ALB', 'BDL', 'BTV'))   

Grouping

flights