A simulated dataset containing sales of child car seats at 400 different stores
Carseats
2018-09-18
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats': ## ## filter, lag
## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
A simulated dataset containing sales of child car seats at 400 different stores
Carseats
A “statistic” is a the result of applying a function (summary) to the data: statistic <- function(data)
E.g. ranks: Min, Quantiles, Median, Mean, Max
summary (Carseats$Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 5.390 7.490 7.496 9.320 16.270
Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.
library(ggplot2); summary(Carseats$Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 5.390 7.490 7.496 9.320 16.270
ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) + geom_boxplot() + coord_flip()
library(ggplot2);library(gridExtra); #boxplot relatives #jitter plot ggplot(Carseats, aes(x="All",y=Sales)) + labs(x=NULL) + geom_jitter(position=position_jitter(height=0,width=0.25)) + coord_flip()
## Construct different histogram of eruption times ggplot(Carseats, aes(x=Sales)) + labs(y="Count") + geom_histogram(aes(y = ..count..))
Proposal:
\[ Y = f(X) + \epsilon \]
csform <- Sales ~ Price; csmod <- lm(csform, data=Carseats); print(csmod$coefficients)
## (Intercept) Price ## 13.64191518 -0.05307302
ggplot(Carseats, aes(x = Price, y = Sales)) + geom_point() + geom_smooth(method = lm)
## (Intercept) x ## 1.1 1.6
## (Intercept) x I(x^2) ## 0.74 1.75 0.69
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) ## 0.71 1.39 0.80 0.46
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) ## 0.795 1.128 -0.039 0.905 0.898
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) ## 0.47 0.62 4.86 6.75 -5.25 -6.72
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) ## 0.13 3.13 8.99 -11.11 -23.83 12.52 18.38
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) ## 0.096 3.207 10.193 -11.078 -30.742 8.263 25.527 5.483
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) I(x^8) ## 1.3 -5.9 -5.1 69.9 48.8 -172.0 -131.9 123.3 101.2
Is this a better fit to the data?
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5) I(x^6) I(x^7) I(x^8) I(x^9) ## -1.1 34.8 -127.9 -379.9 1186.9 1604.8 -2475.4 -2627.6 1499.6 1448.1
Is this a better fit to the data?
Which do you prefer and why?
(Or, follow along and see if you can do it in python.)