2017-01-17

Statistics

The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]

Why statistics?

  • Can tell you if you should be surprised by your data
  • Can help predict what future data will look like

Data

## We'll use data on the duration and spacing of eruptions
## of the old faithful geyser
## Data are eruption duration and waiting time to next eruption
data ("faithful") # load data
str (faithful) # display the internal structure of an R object
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

Old Faithfull-pdPhoto by Jon Sullivan

Data summaries

A "statistic" is a the result of applying a function (summary) to the data: statistic <- function(data)

E.g. ranks: Min, Quantiles, Median, Mean, Max

summary (faithful$eruptions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   2.163   4.000   3.488   4.454   5.100

Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.

Visual Summary 1: Box Plot

boxplot (faithful$eruptions, main="Eruption time", horizontal=T)

Visual Summary 1.5: Box Plot, Jitter Plot

library('ggplot2');library(gridExtra); #boxplot relatives
b1<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_boxplot()
#jitter plot
b2<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + 
    geom_jitter(position=position_jitter(height=0,width=0.25))
grid.arrange(b1, b2, nrow=1)

Visual Summary 2: Histogram

## Construct histogram of eruption times, plot data points on the x axis
hist (faithful$eruptions, main="Eruption time", xlab="Time (minutes)",
      ylab="Count")
points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue')

Visual Summary 2.5: Histogram

## Construct different histogram of eruption times
ggplot(faithful, aes(x=eruptions)) + labs(y="Proportion") + geom_histogram(aes(y = ..count../sum(..count..)))

Visual Summary 3: Empirical Cumulative Distribution Function

## Construct ECDF of eruption times, plot data points on the x axis
plot(ecdf(faithful$eruptions), main="Eruption time", xlab="Time (minutes)",
      ylab="Proportion")
points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue')

Visual Summary 3.5: Empirical Cumulative Distribution Function

## Different picture of ECDF, with jitter plot
ggplot(faithful, aes(x=eruptions)) + labs(x="Eruption Time",y="Proportion") + 
    stat_ecdf() + geom_jitter(aes(y=0.125),position=position_jitter(width=0,height=0.1))