2018-09-18
.csv
, .xlsx
There’s a package for everything. (The python of the statistics world.)
On the other hand, python is also great. You are not required to use R in this course.
Resource for tidying/wrangling: [https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf]
# IF you see library(blah) in my code, you will need to install.packages("blah") before running it. library(nycflights13) print(flights)
## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time ## <int> <int> <int> <int> <int> <dbl> <int> ## 1 2013 1 1 517 515 2 830 ## 2 2013 1 1 533 529 4 850 ## 3 2013 1 1 542 540 2 923 ## 4 2013 1 1 544 545 -1 1004 ## 5 2013 1 1 554 600 -6 812 ## 6 2013 1 1 554 558 -4 740 ## 7 2013 1 1 555 600 -5 913 ## 8 2013 1 1 557 600 -3 709 ## 9 2013 1 1 557 600 -3 838 ## 10 2013 1 1 558 600 -2 753 ## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>, ## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, ## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, ## # minute <dbl>, time_hour <dttm>
print(planes)
## # A tibble: 3,322 x 9 ## tailnum year type manufacturer model engines seats speed engine ## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr> ## 1 N10156 2004 Fixed wi… EMBRAER EMB-1… 2 55 NA Turbo… ## 2 N102UW 1998 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## 3 N103US 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## 4 N104UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## 5 N10575 2002 Fixed wi… EMBRAER EMB-1… 2 55 NA Turbo… ## 6 N105UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## 7 N107US 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## 8 N108UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## 9 N109UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## 10 N110UW 1999 Fixed wi… AIRBUS INDUS… A320-… 2 182 NA Turbo… ## # ... with 3,312 more rows
print(weather)
## # A tibble: 26,115 x 15 ## origin year month day hour temp dewp humid wind_dir wind_speed ## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 ## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 ## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 ## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 ## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 ## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 ## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 ## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 ## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 ## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 ## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>, ## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
print(airlines)
## # A tibble: 16 x 2 ## carrier name ## <chr> <chr> ## 1 9E Endeavor Air Inc. ## 2 AA American Airlines Inc. ## 3 AS Alaska Airlines Inc. ## 4 B6 JetBlue Airways ## 5 DL Delta Air Lines Inc. ## 6 EV ExpressJet Airlines Inc. ## 7 F9 Frontier Airlines Inc. ## 8 FL AirTran Airways Corporation ## 9 HA Hawaiian Airlines Inc. ## 10 MQ Envoy Air ## 11 OO SkyWest Airlines Inc. ## 12 UA United Air Lines Inc. ## 13 US US Airways Inc. ## 14 VX Virgin America ## 15 WN Southwest Airlines Co. ## 16 YV Mesa Airlines Inc.
print(airports)
## # A tibble: 1,458 x 8 ## faa name lat lon alt tz dst tzone ## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr> ## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_… ## 2 06A Moton Field Municip… 32.5 -85.7 264 -6 A America/Chic… ## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chic… ## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/New_… ## 5 09J Jekyll Island Airpo… 31.1 -81.4 11 -5 A America/New_… ## 6 0A9 Elizabethton Munici… 36.4 -82.2 1593 -5 A America/New_… ## 7 0G6 Williams County Air… 41.5 -84.5 730 -5 A America/New_… ## 8 0G7 Finger Lakes Region… 42.9 -76.8 492 -5 A America/New_… ## 9 0P2 Shoestring Aviation… 39.8 -76.6 1000 -5 U America/New_… ## 10 0S9 Jefferson County In… 48.1 -123. 108 -8 A America/Los_… ## # ... with 1,448 more rows
dplyr
The dplyr
package in R provides simple functions that correspond to the most common data manipulation operations (or verbs) and uses efficient storage approaches.
https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
verb meaning -------------------------------------------- select() select variables (or columns) filter() subset observations (or rows) mutate() add new variables (or columns) arrange() re-order the observations summarise() reduce to a single row group_by() aggregate left_join() merge two data objects distinct() remove duplicate entries collect() force computation and bring data back into R
dplyr
was designed with the idea of tidy data in mind, but can be applied to all datanewdata <- olddata %>% verb1(options) %>% verb2(options) %>% verb3(options)
filter()
extracts a subset of rows of interestlibrary(dplyr) airports %>% filter(faa %in% c('ALB', 'BDL', 'BTV'))
flights
bycarrier <- flights %>% group_by(carrier) bycarrier
flights %>% summarise(numflights = n()) #n() counts rows
bycarrier %>% summarise(numflights = n()) #n() counts rows
Aggregate the counts of flights at the monthly level.
monthlycounts <- flights %>% filter(dest %in% c('ALB', 'BDL', 'BTV')) %>% group_by(year, month) %>% summarise(numflights = n()) monthlycounts
Aggregate the counts of flights at three airports at the monthly level.
airportmonthlycounts <- flights %>% filter(dest %in% c('ALB', 'BDL', 'BTV')) %>% group_by(year, month, dest) %>% summarise(numflights = n()) airportmonthlycounts
Add a new column by constructing a date variable using mutate()
. R has a special “date” data type that is useful; dates can be constructed different ways, including using the ymd()
function.
library(lubridate) #To get the ymd() function airportdailycounts <- flights %>% filter(dest %in% c('ALB', 'BDL', 'BTV')) %>% group_by(year, month, day, dest) %>% summarise(numflights = n()) %>% mutate(date = ymd(paste(year, month, day, sep = "-"))) airportdailycounts
library(ggplot2) ggplot(data = airportdailycounts, aes(x = date, y = numflights, colour = dest)) + geom_point()
airportmonthlycounts <- airportmonthlycounts %>% mutate(FirstOfMonth = ymd(paste(year, "-", month, "-01", sep=""))) ggplot(data = airportmonthlycounts, aes(x = FirstOfMonth, y = numflights, colour = dest)) + geom_point()
arrange()
lets us display the months with the largest number of flights.
airportmonthlycounts %>% arrange(desc(numflights))
Which airline was most reliable flying from New York to Minneapolis/St. Paul (MSP) in January, 2013?
jandelays <- flights %>% select(origin, dest, year, month, day, carrier, arr_delay) %>% filter(dest == 'MSP' & month == 1) ggplot(data = jandelays, aes(x = carrier, y = arr_delay)) + geom_boxplot()
Here, the full carrier names are merged (or joined, in database speak) using the left_join()
merged <- left_join(jandelays, airlines, by = c("carrier" = "carrier")) merged
?flights
gives description: “On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013.”
flights %>% filter(dest == 'ORD') %>% summarize(count = n())
flights %>% filter(dest == 'YYZ') %>% summarize(count = n())
nycflights13
is just a fraction of the available flight information. See http://www.amherst.edu/~nhorton/precursors for example code using SQLite.
Relational databases, first popularized in the 1970’s, provide fast and efficient access to terabyte-sized files. These systems use a structured query language (SQL) to specify data operations.
Database systems have been highly optimized and tuned since they were first invented. Connections between general purpose statistics packages such as R and database systems can be facilitated through use of SQL.
verb meaning -------------------------------------------- SELECT create a new result set from a table FROM specify table WHERE subset observations GROUP BY aggregate ORDER re-order the observations DISTINCT remove duplicate values JOIN merge two data objects
Data cleansing, data cleaning, or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
(Wikipedia)