The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]

Why statistics?

  • Can tell you if you should be surprised by your data
  • Can help predict what future data will look like


## We'll use data on the duration and spacing of eruptions
## of the old faithful geyser
## Data are eruption duration and waiting time to next eruption
data ("faithful") # load data
str (faithful) # display the internal structure of an R object
## 'data.frame':    272 obs. of  2 variables:
##  $ eruptions: num  3.6 1.8 3.33 2.28 4.53 ...
##  $ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

Old Faithfull-pdPhoto by Jon Sullivan

Data summaries

A "statistic" is a the result of applying a function (summary) to the data: statistic <- function(data)

E.g. ranks: Min, Quantiles, Median, Mean, Max

summary (faithful$eruptions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   2.163   4.000   3.488   4.454   5.100

Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.

Visual Summary 1: Box Plot

boxplot (faithful$eruptions, main="Eruption time", horizontal=T)

Visual Summary 1.5: Box Plot, Jitter Plot

library(ggplot2);library(gridExtra); #boxplot relatives
b1<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_boxplot()
#jitter plot
b2<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + 
grid.arrange(b1, b2, nrow=1)

Visual Summary 2: Histogram

## Construct histogram of eruption times, plot data points on the x axis
hist (faithful$eruptions, main="Eruption time", xlab="Time (minutes)",
points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue')

Visual Summary 2.5: Histogram

## Construct different histogram of eruption times
ggplot(faithful, aes(x=eruptions)) + labs(y="Proportion") + geom_histogram(aes(y = ..count../sum(..count..)))


  • Common assumption is that data consists of replicates that are "the same."
  • Come from "the same population"
  • Come from "the same process"
  • The goal of data analysis is to understand what the data tell us about the population.


We often assume that we can treat items as if they were distributed "randomly."

  • "That's so random!"
  • Result of a coin flip is random
  • Passengers were screened at random
  • "random" does not mean "uniform"
  • Mathematical formalism: events and probability

Sample Spaces and Events

  • Sample space \({\mathcal{S}}\) is the set of all possible events we might observe. Depends on context.
    • Coin flips: \({\mathcal{S}}= \{ h, t \}\)
    • Eruption times: \({\mathcal{S}}= {\mathbb{R}}^{\ge 0}\)
    • (Eruption times, Eruption waits): \({\mathcal{S}}= {\mathbb{R}}^{\ge 0} \times {\mathbb{R}}^{\ge 0}\)
  • An event is a subset of the sample space.
    • Observe heads: \(\{ h \}\)
    • Observe eruption for 2 minutes: \(\{ 2.0 \}\)
    • Observe eruption with length between 1 and 2 minutes and wait between 50 and 70 minutes: \([1,2] \times [50,70]\).

Event Probabilities

Any event can be assigned a probability between \(0\) and \(1\) (inclusive).

  • \(\Pr(\{h\}) = 0.5\)
  • \(\Pr([1,2] \times [50,70]) = 0.10\)

Interpreting probability:
Objectivist view

  • Suppose we observe \(n\) replications of an experiment.
  • Let \(n(A)\) be the number of times event \(A\) was observed
  • \(\lim_{n \to \infty} \frac{n(A)}{n} = \Pr(A)\)
  • This is (loosely) Borel's Law of Large Numbers

  • Subjective interpretation is possible as well. ("Bayesian" statistics is related to this idea – more later.)

Abstraction of data: Random Variable

  • We often reduce data to numbers.
    • "\(1\) means heads, \(0\) means tails."
  • A random variable is a mapping from the event space to a number (or vector.)

  • Usually rendered in uppercase italics

  • \(X\) is every statistician's favourite, followed closely by \(Y\) and \(Z\).

  • "Realizations" of \(X\) are written in lower case, e.g. \(x_1\), \(x_2\), …

  • We will write the set of possible realizations as: \(\mathcal{X}\) for \(X\),
    \(\mathcal{Y}\) for \(Y\), and so on.

Distributions of random variables

  • Realizations are observed according to probabilities specified by the distribution of \(X\)

  • Can think of \(X\) as an "infinite supply of data"

  • Separate realizations of the same r.v. \(X\) are "independent and identically distributed" (i.i.d.)

  • Formal definition of a random variable requires measure theory, not covered here

Probabilities for random variables

Random variable \(X\), realization \(x\).

  • What is the probability we see \(x\)?
    • \(\Pr(X=x)\), (if lazy, \(\Pr(x)\), but don't do this)
  • Subsets of the domain of a random variable correspond to events.
    • \(\Pr(X > 0)\) probability that I see a realization that is positive.

Discrete Random Variables

  • Discrete random variables take values from a countable set
    • Coin flip \(X\)
      • \(\mathcal{X} = \{0,1\}\)
    • Number of snowflakes that fall in a day \(Y\)
      • \(\mathcal{Y} = \{0, 1, 2, ...\}\)

Probability Mass Function (PMF)

  • For a discrete \(X\), \(p_{X}(x)\) gives \(\Pr(X = x)\).
  • Requirement: \(\sum_{x \in \mathcal{X}} p_{X}(x) = 1\).
    • Note that the sum can have an infinite number of terms.

Probability Mass Function (PMF) Example

\(X\) is number of "heads" in 20 flips of a fair coin
\(\mathcal{X} = \{0,1,...,20\}\)

Cumulative Distribution Function (CDF)

  • For a discrete \(X\), \(P_{X}(x)\) gives \(\Pr(X \le x)\).
  • Requirements:
    • \(P\) is nondecreasing
    • \(\sup_{x \in \mathcal{X}} P_{X}(x) = 1\)
  • Note:
    • \(P_X(b) = \sum_{x \le b} p_X(x)\)
    • \(\Pr(a < X \le b) = P_X(b) - P_X(a)\)

Cumulative Distribution Function (CDF) Example

\(X\) is number of "heads" in 20 flips of a fair coin

Continuous random variables

  • Continuous random variables take values in intervals of \({\mathbb{R}}\)
  • Mass \(M\) of a star
    • \(\mathcal{M} = (0,\infty)\)
  • Oxygen saturation \(S\) of blood
    • \(\mathcal{S} = [0,1]\)

  • For a continuous r.v.¬†\(X\), \(\Pr(X = x) = 0\) for all \(x\).
    There is no probability mass function.
  • However, \(\Pr(X \in (a,b)) \ne 0\) in general.

Probability Density Function (PDF)

  • For continuous \(X\), \(\Pr(X = x) = 0\) and PMF does not exist.
  • However, we define the Probability Density Function \(f_X\):
    • \(\Pr(a \le X \le b) = \int_{a}^{b} f_X(x) {\mathrm{\,d}}x\)
  • Requirement:
    • \(\forall x \;f_X(x) > 0\), \(\int_{-\infty}^\infty f_X(x) {\mathrm{\,d}}x = 1\)

Probability Density Function (PDF) Example