2018-09-18

##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union

A simulated dataset containing sales of child car seats at 400 different stores

Carseats

## Data summaries

A “statistic” is a the result of applying a function (summary) to the data: statistic <- function(data)

E.g.Â ranks: Min, Quantiles, Median, Mean, Max

summary (Carseats$Sales) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 5.390 7.490 7.496 9.320 16.270 Roughly, a quantile for a proportion $$p$$ is a value $$x$$ for which $$p$$ of the data are less than or equal to $$x$$. The first quartile, median, and third quartile are the quantiles for $$p=0.25$$, $$p=0.5$$, and $$p=0.75$$, respectively. ## Visual Summary 1: Box Plot, Jitter Plot library(ggplot2); summary(Carseats$Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   5.390   7.490   7.496   9.320  16.270
ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) + geom_boxplot() + coord_flip()

## Visual Summary 2: Jitter Plot

library(ggplot2);library(gridExtra); #boxplot relatives
#jitter plot
ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) +
geom_jitter(position=position_jitter(height=0,width=0.25)) + coord_flip()

## Visual Summary 3: Histogram

## Construct different histogram of eruption times
ggplot(Carseats, aes(x=Sales)) + labs(y="Count") + geom_histogram(aes(y = ..count..))

## All of Supervised Learning

Proposal:

$Y = f(X) + \epsilon$

1. Here is some data
2. Tell me what $$f$$ is

## Example: linear fit

csform <- Sales ~ Price; csmod <- lm(csform, data=Carseats); print(csmod\$coefficients)
## (Intercept)       Price
## 13.64191518 -0.05307302
ggplot(Carseats, aes(x = Price, y = Sales)) + geom_point() + geom_smooth(method = lm)