2017-09-19

## Statistics

The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]

## Why statistics?

• Can tell you if you should be surprised by your data
• Can help predict what future data will look like

## Data

## We'll use data on the duration and spacing of eruptions
## of the old faithful geyser
## Data are eruption duration and waiting time to next eruption
str (faithful) # display the internal structure of an R object
## 'data.frame':    272 obs. of  2 variables:
##  $eruptions: num 3.6 1.8 3.33 2.28 4.53 ... ##$ waiting  : num  79 54 74 62 85 55 88 85 51 85 ...

## Data summaries

A "statistic" is a the result of applying a function (summary) to the data: statistic <- function(data)

E.g. ranks: Min, Quantiles, Median, Mean, Max

## Visual Summary 1.5: Box Plot, Jitter Plot

library(ggplot2);library(gridExtra); #boxplot relatives
b1<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_boxplot()
#jitter plot
b2<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) +
geom_jitter(position=position_jitter(height=0,width=0.25))
grid.arrange(b1, b2, nrow=1)

## Visual Summary 2: Histogram

## Construct histogram of eruption times, plot data points on the x axis
hist (faithful$eruptions, main="Eruption time", xlab="Time (minutes)", ylab="Count") points (x=faithful$eruptions,y=rep(0,length(faithful\$eruptions)), lwd=4, col='blue')

## Visual Summary 2.5: Histogram

## Construct different histogram of eruption times
ggplot(faithful, aes(x=eruptions)) + labs(y="Proportion") + geom_histogram(aes(y = ..count../sum(..count..)))

## Replicates

• Common assumption is that data consists of replicates that are "the same."
• Come from "the same population"
• Come from "the same process"
• The goal of data analysis is to understand what the data tell us about the population.

## Randomness

We often assume that we can treat items as if they were distributed "randomly."

• "That's so random!"
• Result of a coin flip is random
• Passengers were screened at random
• "random" does not mean "uniform"
• Mathematical formalism: events and probability

## Sample Spaces and Events

• Sample space $${\mathcal{S}}$$ is the set of all possible events we might observe. Depends on context.
• Coin flips: $${\mathcal{S}}= \{ h, t \}$$
• Eruption times: $${\mathcal{S}}= {\mathbb{R}}^{\ge 0}$$
• (Eruption times, Eruption waits): $${\mathcal{S}}= {\mathbb{R}}^{\ge 0} \times {\mathbb{R}}^{\ge 0}$$
• An event is a subset of the sample space.
• Observe heads: $$\{ h \}$$
• Observe eruption for 2 minutes: $$\{ 2.0 \}$$
• Observe eruption with length between 1 and 2 minutes and wait between 50 and 70 minutes: $$[1,2] \times [50,70]$$.

## Event Probabilities

Any event can be assigned a probability between $$0$$ and $$1$$ (inclusive).

• $$\Pr(\{h\}) = 0.5$$
• $$\Pr([1,2] \times [50,70]) = 0.10$$

## Interpreting probability: Objectivist view

• Suppose we observe $$n$$ replications of an experiment.
• Let $$n(A)$$ be the number of times event $$A$$ was observed
• $$\lim_{n \to \infty} \frac{n(A)}{n} = \Pr(A)$$
• This is (loosely) Borel's Law of Large Numbers

• Subjective interpretation is possible as well. ("Bayesian" statistics is related to this idea – more later.)

## Abstraction of data: Random Variable

• We often reduce data to numbers.
• "$$1$$ means heads, $$0$$ means tails."
• A random variable is a mapping from the event space to a number (or vector.)

• Usually rendered in uppercase italics

• $$X$$ is every statistician's favourite, followed closely by $$Y$$ and $$Z$$.

• "Realizations" of $$X$$ are written in lower case, e.g. $$x_1$$, $$x_2$$, …

• We will write the set of possible realizations as: $$\mathcal{X}$$ for $$X$$,
$$\mathcal{Y}$$ for $$Y$$, and so on.

## Distributions of random variables

• Realizations are observed according to probabilities specified by the distribution of $$X$$

• Can think of $$X$$ as an "infinite supply of data"

• Separate realizations of the same r.v. $$X$$ are "independent and identically distributed" (i.i.d.)

• Formal definition of a random variable requires measure theory, not covered here

## Probabilities for random variables

Random variable $$X$$, realization $$x$$.

• What is the probability we see $$x$$?
• $$\Pr(X=x)$$, (if lazy, $$\Pr(x)$$, but don't do this)
• Subsets of the domain of a random variable correspond to events.
• $$\Pr(X > 0)$$ probability that I see a realization that is positive.

## Discrete Random Variables

• Discrete random variables take values from a countable set
• Coin flip $$X$$
• $$\mathcal{X} = \{0,1\}$$
• Number of snowflakes that fall in a day $$Y$$
• $$\mathcal{Y} = \{0, 1, 2, ...\}$$

## Probability Mass Function (PMF)

• For a discrete $$X$$, $$p_{X}(x)$$ gives $$\Pr(X = x)$$.
• Requirement: $$\sum_{x \in \mathcal{X}} p_{X}(x) = 1$$.
• Note that the sum can have an infinite number of terms.

## Probability Mass Function (PMF) Example

$$X$$ is number of "heads" in 20 flips of a fair coin
$$\mathcal{X} = \{0,1,...,20\}$$

## Cumulative Distribution Function (CDF)

• For a discrete $$X$$, $$P_{X}(x)$$ gives $$\Pr(X \le x)$$.
• Requirements:
• $$P$$ is nondecreasing
• $$\sup_{x \in \mathcal{X}} P_{X}(x) = 1$$
• Note:
• $$P_X(b) = \sum_{x \le b} p_X(x)$$
• $$\Pr(a < X \le b) = P_X(b) - P_X(a)$$

## Cumulative Distribution Function (CDF) Example

$$X$$ is number of "heads" in 20 flips of a fair coin

## Continuous random variables

• Continuous random variables take values in intervals of $${\mathbb{R}}$$
• Mass $$M$$ of a star
• $$\mathcal{M} = (0,\infty)$$
• Oxygen saturation $$S$$ of blood
• $$\mathcal{S} = [0,1]$$

• For a continuous r.v. $$X$$, $$\Pr(X = x) = 0$$ for all $$x$$.
There is no probability mass function.
• However, $$\Pr(X \in (a,b)) \ne 0$$ in general.

## Probability Density Function (PDF)

• For continuous $$X$$, $$\Pr(X = x) = 0$$ and PMF does not exist.
• However, we define the Probability Density Function $$f_X$$:
• $$\Pr(a \le X \le b) = \int_{a}^{b} f_X(x) {\mathrm{\,d}}x$$
• Requirement:
• $$\forall x \;f_X(x) > 0$$, $$\int_{-\infty}^\infty f_X(x) {\mathrm{\,d}}x = 1$$