2018-09-18

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Advertising Data

A simulated dataset containing sales of child car seats at 400 different stores

Carseats

Simple questions: Summaries of one variable

Data summaries

A “statistic” is a the result of applying a function (summary) to the data: statistic <- function(data)

E.g. ranks: Min, Quantiles, Median, Mean, Max

summary (Carseats$Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.390   7.490   7.496   9.320  16.270

Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.

Visual Summary 1: Box Plot, Jitter Plot

library(ggplot2);
summary(Carseats$Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.390   7.490   7.496   9.320  16.270
ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) + geom_boxplot() + coord_flip()

Visual Summary 2: Jitter Plot

library(ggplot2);library(gridExtra); #boxplot relatives
#jitter plot
ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) + 
    geom_jitter(position=position_jitter(height=0,width=0.25)) + coord_flip()

Visual Summary 3: Histogram

## Construct different histogram of eruption times
ggplot(Carseats, aes(x=Sales)) + labs(y="Count") + geom_histogram(aes(y = ..count..))

Complex questions: Relationships

Relationships between variables

All of Supervised Learning

Proposal:

\[ Y = f(X) + \epsilon \]

  1. Here is some data
  2. Tell me what \(f\) is

Example: linear fit

csform <- Sales ~ Price; csmod <- lm(csform, data=Carseats); print(csmod$coefficients)
## (Intercept)       Price 
## 13.64191518 -0.05307302
ggplot(Carseats, aes(x = Price, y = Sales)) + geom_point() + geom_smooth(method = lm)

Fitting by Minimizing Error

What kind of \(f\) are you looking for?

Data and linear fit

## (Intercept)           x 
##         1.1         1.6

Data and quadratic fit

## (Intercept)           x      I(x^2) 
##        0.74        1.75        0.69

Is this a better fit to the data?

Order-3 fit

## (Intercept)           x      I(x^2)      I(x^3) 
##        0.71        1.39        0.80        0.46

Is this a better fit to the data?

Order-4 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4) 
##       0.795       1.128      -0.039       0.905       0.898

Is this a better fit to the data?

Order-5 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5) 
##        0.47        0.62        4.86        6.75       -5.25       -6.72

Is this a better fit to the data?

Order-6 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6) 
##        0.13        3.13        8.99      -11.11      -23.83       12.52       18.38

Is this a better fit to the data?

Order-7 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6)      I(x^7) 
##       0.096       3.207      10.193     -11.078     -30.742       8.263      25.527       5.483

Is this a better fit to the data?

Order-8 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6)      I(x^7)      I(x^8) 
##         1.3        -5.9        -5.1        69.9        48.8      -172.0      -131.9       123.3       101.2

Is this a better fit to the data?

Order-9 fit

## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5)      I(x^6)      I(x^7)      I(x^8)      I(x^9) 
##        -1.1        34.8      -127.9      -379.9      1186.9      1604.8     -2475.4     -2627.6      1499.6      1448.1

Is this a better fit to the data?

Evaluating Performance

Which do you prefer and why?

Recommended exercises