2017-01-17

Statistics

The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]

Why statistics?

• Can tell you if you should be surprised by your data
• Can help predict what future data will look like

Data

```## We'll use data on the duration and spacing of eruptions
## of the old faithful geyser
## Data are eruption duration and waiting time to next eruption
str (faithful) # display the internal structure of an R object```
```## 'data.frame':    272 obs. of  2 variables:
##  \$ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  \$ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...```

Data summaries

A "statistic" is a the result of applying a function (summary) to the data: `statistic <- function(data)`

E.g.Â ranks: Min, Quantiles, Median, Mean, Max

`summary (faithful\$eruptions)`
```##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   1.600   2.163   4.000   3.488   4.454   5.100```

Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.

Visual Summary 1: Box Plot

`boxplot (faithful\$eruptions, main="Eruption time", horizontal=T)`

Visual Summary 1.5: Box Plot, Jitter Plot

```library('ggplot2');library(gridExtra); #boxplot relatives
b1<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_boxplot()
#jitter plot
b2<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) +
geom_jitter(position=position_jitter(height=0,width=0.25))
grid.arrange(b1, b2, nrow=1)```

Visual Summary 2: Histogram

```## Construct histogram of eruption times, plot data points on the x axis
hist (faithful\$eruptions, main="Eruption time", xlab="Time (minutes)",
ylab="Count")
points (x=faithful\$eruptions,y=rep(0,length(faithful\$eruptions)), lwd=4, col='blue')```

Visual Summary 2.5: Histogram

```## Construct different histogram of eruption times
ggplot(faithful, aes(x=eruptions)) + labs(y="Proportion") + geom_histogram(aes(y = ..count../sum(..count..)))```

Visual Summary 3: Empirical Cumulative Distribution Function

```## Construct ECDF of eruption times, plot data points on the x axis
plot(ecdf(faithful\$eruptions), main="Eruption time", xlab="Time (minutes)",
ylab="Proportion")
points (x=faithful\$eruptions,y=rep(0,length(faithful\$eruptions)), lwd=4, col='blue')```

Visual Summary 3.5: Empirical Cumulative Distribution Function

```## Different picture of ECDF, with jitter plot
ggplot(faithful, aes(x=eruptions)) + labs(x="Eruption Time",y="Proportion") +
stat_ecdf() + geom_jitter(aes(y=0.125),position=position_jitter(width=0,height=0.1))```