2018-09-18

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Advertising Data

A simulated dataset containing sales of child car seats at 400 different stores

Carseats

Simple questions: Summaries of one variable

Data summaries

A “statistic” is a the result of applying a function (summary) to the data: statistic <- function(data)

E.g. ranks: Min, Quantiles, Median, Mean, Max

summary (Carseats$Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.390   7.490   7.496   9.320  16.270

Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.

Visual Summary 1: Box Plot, Jitter Plot

library(ggplot2);
summary(Carseats$Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.390   7.490   7.496   9.320  16.270
ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) + geom_boxplot() + coord_flip()

Visual Summary 2: Jitter Plot

library(ggplot2);library(gridExtra); #boxplot relatives
#jitter plot
ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) + 
    geom_jitter(position=position_jitter(height=0,width=0.25)) + coord_flip()

Visual Summary 3: Histogram

## Construct different histogram of eruption times
ggplot(Carseats, aes(x=Sales)) + labs(y="Count") + geom_histogram(aes(y = ..count..))

Complex questions: Relationships

Relationships between variables

All of Supervised Learning

Proposal:

\[ Y = f(X) + \epsilon \]

  1. Here is some data
  2. Tell me what \(f\) is

Example: linear fit

csform <- Sales ~ Price; csmod <- lm(csform, data=Carseats); print(csmod$coefficients)
## (Intercept)       Price 
## 13.64191518 -0.05307302
ggplot(Carseats, aes(x = Price, y = Sales)) + geom_point() + geom_smooth(method = lm)