The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]
2017-09-19
The systematic collection and arrangement of numerical facts or data of any kind;
(also) the branch of science or mathematics concerned with the analysis and interpretation of numerical data and appropriate ways of gathering such data. [OED]
## We'll use data on the duration and spacing of eruptions ## of the old faithful geyser ## Data are eruption duration and waiting time to next eruption data ("faithful") # load data str (faithful) # display the internal structure of an R object
## 'data.frame': 272 obs. of 2 variables: ## $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ... ## $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
A "statistic" is a the result of applying a function (summary) to the data: statistic <- function(data)
E.g. ranks: Min, Quantiles, Median, Mean, Max
summary (faithful$eruptions)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.600 2.163 4.000 3.488 4.454 5.100
Roughly, a quantile for a proportion \(p\) is a value \(x\) for which \(p\) of the data are less than or equal to \(x\). The first quartile, median, and third quartile are the quantiles for \(p=0.25\), \(p=0.5\), and \(p=0.75\), respectively.
boxplot (faithful$eruptions, main="Eruption time", horizontal=T)
library(ggplot2);library(gridExtra); #boxplot relatives b1<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_boxplot() #jitter plot b2<-ggplot(faithful, aes(x="All",y=eruptions)) + labs(x=NULL) + geom_jitter(position=position_jitter(height=0,width=0.25)) grid.arrange(b1, b2, nrow=1)
## Construct histogram of eruption times, plot data points on the x axis hist (faithful$eruptions, main="Eruption time", xlab="Time (minutes)", ylab="Count") points (x=faithful$eruptions,y=rep(0,length(faithful$eruptions)), lwd=4, col='blue')
## Construct different histogram of eruption times ggplot(faithful, aes(x=eruptions)) + labs(y="Proportion") + geom_histogram(aes(y = ..count../sum(..count..)))
We often assume that we can treat items as if they were distributed "randomly."
Any event can be assigned a probability between \(0\) and \(1\) (inclusive).
This is (loosely) Borel's Law of Large Numbers
Subjective interpretation is possible as well. ("Bayesian" statistics is related to this idea – more later.)
A random variable is a mapping from the event space to a number (or vector.)
Usually rendered in uppercase italics
\(X\) is every statistician's favourite, followed closely by \(Y\) and \(Z\).
"Realizations" of \(X\) are written in lower case, e.g. \(x_1\), \(x_2\), …
We will write the set of possible realizations as: \(\mathcal{X}\) for \(X\),
\(\mathcal{Y}\) for \(Y\), and so on.
Realizations are observed according to probabilities specified by the distribution of \(X\)
Can think of \(X\) as an "infinite supply of data"
Separate realizations of the same r.v. \(X\) are "independent and identically distributed" (i.i.d.)
Formal definition of a random variable requires measure theory, not covered here
Random variable \(X\), realization \(x\).
\(X\) is number of "heads" in 20 flips of a fair coin
\(\mathcal{X} = \{0,1,...,20\}\)
\(X\) is number of "heads" in 20 flips of a fair coin
\[{\mathrm{E}}[X] = \sum_{x \in \mathcal{X}} x \cdot p_X(X = x)\]
\[{\mathrm{E}}[Y] = \int_{y \in \mathcal{Y}} y \cdot f_Y(Y = y) {\mathrm{\,d}}y\]
\[ \bar{x}_n = \frac{1}{n} \sum_i x_i \]
Given a dataset, \(\bar x_n\) is a fixed number.
It is usually a good estimate of the expected value of a random variable \(X\) with an unknown distribution. (More on this later.)
\(X\) and \(Y\) have a joint distribution if their realizations come together as a pair. \((X,Y)\) is a random vector, and realizations may be written \((x_1,y_1), (x_2,y_2), ...\), or \(\langle x_1, y_1 \rangle, \langle x_2, y_2 \rangle, ...\)
faithful
\[ \rho_{X,Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y} \]
\[ r_{X,Y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]
## eruptions waiting ## eruptions 1.0000000 0.9008112 ## waiting 0.9008112 1.0000000
\[ F_{X,Y}(x,y) = F_X(x)F_Y(y) \]
\[ \Pr(X=x|Y=y) = \Pr(X=x) \]
\[ \Pr(Y=y|X=x) = \Pr(Y=y) \]
## x y ## x 1.0000000 0.6996526 ## y 0.6996526 1.0000000
## x y ## x 1.0000000000 0.0003411402 ## y 0.0003411402 1.0000000000
## x y ## x 1.000000000 0.002031327 ## y 0.002031327 1.000000000
## x y ## x 1.000000000 -0.008200447 ## y -0.008200447 1.000000000
## Mean: 70.90
## Mean: 55.60
## Mean: 81.33