The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]
2017-01-17
The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]
## We'll use data on the duration and spacing of eruptions ## of the old faithful geyser ## Data are eruption duration and waiting time to next eruption data ("faithful") # load data str (faithful) # display the internal structure of an R object
## 'data.frame': 272 obs. of 2 variables: ## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ... ## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
A "statistic" is a the result of applying a function (summary) to the data: statistic <- function(data)
E.g. ranks: Min, Quantiles, Median, Mean, Max
summary (faithful$eruptions)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.600 2.163 4.000 3.488 4.454 5.100
Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.
boxplot (faithful$eruptions, main="Eruption time", horizontal=T)
library('ggplot2');library(gridExtra); #boxplot relatives b1<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_boxplot() #jitter plot b2<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_jitter(position=position_jitter(height=0,width=0.25)) grid.arrange(b1, b2, nrow=1)
## Construct histogram of eruption times, plot data points on the x axis hist (faithful$eruptions, main="Eruption time", xlab="Time (minutes)", ylab="Count") points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue')
## Construct different histogram of eruption times ggplot(faithful, aes(x=eruptions)) + labs(y="Proportion") + geom_histogram(aes(y = ..count../sum(..count..)))
## Construct ECDF of eruption times, plot data points on the x axis plot(ecdf(faithful$eruptions), main="Eruption time", xlab="Time (minutes)", ylab="Proportion") points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue')
## Different picture of ECDF, with jitter plot ggplot(faithful, aes(x=eruptions)) + labs(x="Eruption Time",y="Proportion") + stat_ecdf() + geom_jitter(aes(y=0.125),position=position_jitter(width=0,height=0.1))
We often assume that we can treat items as if they were distributed "randomly."
Any event can be assigned a probability between \(0\) and \(1\) (inclusive).
\(\Pr\) is a probability function over \(\mathcal{S}\) iff
(The more correct statment of this is coming up in a few slides.)
Subjective interpretation is possible as well. ("Bayesian" statistics is related to this idea – more later.)
A random variable is a mapping from the event space to a number (or vector.)
Usually rendered in uppercase italics
\(X\) is every statistician's favourite, followed closely by \(Y\) and \(Z\).
"Realizations" of \(X\) are written in lower case, e.g. \(x_1\), \(x_2\), …
We will write the set of possible realizations as: \(\mathcal{X}\) for \(X\),
\(\mathcal{Y}\) for \(Y\), and so on.
Realizations are observed according to probabilities specified by the distribution of \(X\)
Can think of \(X\) as an "infinite supply of data"
Separate realizations of the same r.v. \(X\) are "independent and identically distributed" (i.i.d.)
Formal definition of a random variable requires measure theory, not covered here
Random variable \(X\), realization \(x\).
\(X\) is number of "heads" in 20 flips of a fair coin
\(\mathcal{X} = \{0,1,...,20\}\)
\(X\) is number of "heads" in 20 flips of a fair coin
\[{\mathrm{E}}[X] = \sum_{x \in \mathcal{X}} x \cdot p_X(X = x)\]
\[{\mathrm{E}}[Y] = \int_{y \in \mathcal{Y}} y \cdot f_Y(Y = y) {\mathrm{\,d}}y\]
\[ \bar{x}_n = \frac{1}{n} \sum_i x_i \]
Given a dataset, \(\bar x_n\) is a fixed number. We use \(\bar X_n\) to denote the random variable corresponding to the sample mean computed from a randomly drawn dataset of size \(n\).
Datasets of size \(n = 15\), sample means plotted in red.
A statistic is any summary of a dataset. (E.g. \(\bar X_n\), sample median.) A statistic is the result of a function applied to a dataset.
A parameter is any summary of the distribution of a random variable. (E.g. \(\mu_X\), median.) A parameter is the result of a function applied to a distribution.
Given an estimate, how good is it?
The distribution of an estimator is called its sampling distribution.
\[ E[\bar{X}_n - \mu_X] \]
\[ E[(\bar{X}_n - E[\bar{X}_n])^2] \]
\[ E[(\bar{X}_n - \mu_X)^2] \]
\[ F_{\bar X_n}(\bar x) \approx \int_{-\infty}^{\bar x} \frac{1}{\sigma_n\sqrt{2\pi}} \mathrm{e}^{-\frac{(\bar x - \mu_X)^2}{2\sigma_n^2}}\]
where
\[ \sigma_n^2 = \frac{\sigma^2}{\sqrt{n}} \]
is called the standard error and \(\sigma^2\) is the variance of \(X\).
NOTE: More data means lower standard error.
\[ f_{X}(x) = \frac{1}{\sigma_X\sqrt{2\pi}} \mathrm{e}^{-\frac{(x - \mu_X)^2}{2\sigma_X^2}} \]
The normal distribution is special (among other reasons) because many estimators have approximately normal sampling distributions or have sampling distributions that are closely related to the normal.
Reminder, \(\sigma_X^2 = E[(X - \mu_X)^2]\). If \(X\) is normal and we let
\[Z = \frac{X - \mu_X}{\sigma_{X}}\]
we have
\[f_{Z}(z) = \frac{1}{\sqrt{2\pi}} \mathrm{e}^{-\frac{z^2}{2}}\]
Eruptions dataset has \(n = 272\) observations.
Our estimate of the mean of eruption times is \(\bar x_{272}\) = 3.4877831.
What is the probability of observing an \({\bar{x}}_{272}\) that is within 10 seconds of the true mean?
Let \(\sigma_{{\bar{X}}_{272}} = \sigma_X/\sqrt{272}\), let \(Z = \frac{{\bar{X}}_{272} - \mu_X}{\sigma_{{\bar{X}}_{272}}}\) be a new r.v. By the C.L.T.,
\[\Pr(-0.17 \le {\bar{X}}_{272} - \mu_X \le 0.17) = \Pr(\frac{-0.17}{\sigma_{{\bar{X}}_{272}}} \le Z \le \frac{0.17}{\sigma_{{\bar{X}}_{272}}})\] \[\approx \int_{z = \frac{-0.17}{\sigma_{{\bar{X}}_{272}}}}^{\frac{0.17}{\sigma_{{\bar{X}}_{272}}}} \frac{1}{\sqrt{2\pi}} \mathrm{e}^{-\frac{z^2}{2}} = \int_{z = -2.456}^{2.456} \frac{1}{\sqrt{2\pi}} \mathrm{e}^{-\frac{z^2}{2}} = 0.986\]
Note! I estimated \(\sigma_X\) here. (Look up "\(t\)-test.")
\[\approx \int_{z = -2.456}^{2.456} \frac{1}{\sqrt{2\pi}} \mathrm{e}^{-\frac{z^2}{2}} = 0.986\]
library(boot) bootstraps <- boot(faithful$eruptions,function(d,i){mean(d[i])},R=5000) bootdata = data.frame(xbars=bootstraps$t); limits = quantile(bootdata$xbars,c(0.025,0.975)) ggplot(bootdata, aes(x=xbars)) + labs(y="Prop.") + geom_histogram(aes(y = ..density..)) + geom_errorbarh(aes(xmin=limits[[1]], xmax=limits[[2]], y=c(0)),height=0.25,colour="red",size=2)
"My classifier is correct 20 times out of 30 on this test set!"
Have 50 observations of \(X\).
binom.test(20,30)
## ## Exact binomial test ## ## data: 20 and 30 ## number of successes = 20, number of trials = 30, p-value = 0.09874 ## alternative hypothesis: true probability of success is not equal to 0.5 ## 95 percent confidence interval: ## 0.4718800 0.8271258 ## sample estimates: ## probability of success ## 0.6666667
library(pwr) pwr.p.test(h = ES.h(p1 = 0.5, p2 = 0.66), n = NULL, power = 0.8, sig.level = 0.05)
## ## proportion power calculation for binomial distribution (arcsine transformation) ## ## h = 0.3257295 ## n = 73.97628 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided
str()
shows the structure of a vector, matrix, table, data frame, etc.summary()
shows basic summary statisticsfoo$bar
extracts column bar
from data frame foo
length()
,nrow()
,ncol()
size information for vector, data frame, etc.min()
, max()
, median()
, IQR()
, quantile(data,prob)
do what you expectIQR
is Inter-Quartile Range: 3rd Quartile minus 1st Quartile.mean()
, var()
, sd()
rep(e,n)
creates vector by repeating element e
, n
timeswhich(b)
returns list of indices for which boolean expression b
is truehist()
computes and plots a histogramprobability=T
shows proportions instead of frequency)ecdf()
boxplot()
draws box plot. Whiskers extend at most \(1.5\cdot{}\mathrm{IQR}\) from the nearest quartile.density()
constructs a kernel density estimate using given dataplot()
creates a new scatterplot of given \(x,y\) coordinates.R
objects. Try it!points()
adds additional points to an existing plotmain
- Plot titlexlab
- x label for plotylab
- y label for plotpch
- "plotting character", what shape to use for pointscex
- "character expansion" - multiplicative factor to enlarge/shrink points