Objectivist view - Suppose we observe $n$ replications of an experiment. - Let $n(A)$ be the number of times event $A$ was observed - $\lim_{n \to \infty} \frac{n(A)}{n} = \Pr(A)$ - This is (loosely) *Borel's Law of Large Numbers* - Subjective interpretation is possible as well. ("Bayesian" statistics is related to this idea -- more later.) ## Abstraction of data-generating process: Random Variable - We often reduce data to numbers. - "$1$ means heads, $0$ means tails." - A _random variable_ is a mapping from the event space to a number (or vector.) - Usually rendered in uppercase *italics* - $X$ is every statistician's favourite, followed closely by $Y$ and $Z$. - "Realizations" of $X$ are written in lower case, e.g. $x_1$, $x_2$, ... - We will write the set of possible realizations as: $\mathcal{X}$ for $X$, $\mathcal{Y}$ for $Y$, and so on. ## Distributions of random variables - Realizations are observed according to probabilities specified by the *distribution* of $X$ - Can think of $X$ as an "infinite supply of data" - Separate realizations of the same r.v. $X$ are "independent and identically distributed" (i.i.d.) - Formal definition of a random variable requires measure theory, not covered here ## Probabilities for random variables Random variable $X$, realization $x$. - What is the probability we see $x$? - $\Pr(X=x)$, (if lazy, $\Pr(x)$, but don't do this) - Subsets of the domain of a random variable correspond to events. - $\Pr(X > 0)$ probability that I see a realization that is positive. ## Discrete Random Variables - Discrete random variables take values from a countable set - Coin flip $X$ - $\mathcal{X} = \{0,1\}$ - Number of snowflakes that fall in a day $Y$ - $\mathcal{Y} = \{0, 1, 2, ...\}$ ## Probability Mass Function (PMF) - For a discrete $X$, $p_{X}(x)$ gives $\Pr(X = x)$. - Requirement: $\sum_{x \in \mathcal{X}} p_{X}(x) = 1$. - Note that the sum can have an infinite number of terms. ## Probability Mass Function (PMF) Example $X$ is number of "heads" in 20 flips of a fair coin $\mathcal{X} = \{0,1,...,20\}$ ```{r echo=F} pmfData <- data.frame( x = 0:20, px = dbinom(0:20, 20, .5)); eps = 0.05; ggplot(pmfData,aes(x = x,y = px,ymin=0,ymax=px)) + geom_point(size=3) + geom_segment(x = pmfData$x + eps, xend=(pmfData$x + 1 - eps),y = 0, yend = 0) + geom_point(y = 0, size=3, shape=1) + ylab(expression(p[X](x))) + scale_x_continuous(breaks=seq(0,20,5)) ``` ## Cumulative Distribution Function (CDF) - For a discrete $X$, $P_{X}(x)$ gives $\Pr(X \le x)$. - Requirements: - $P$ is nondecreasing - $\sup_{x \in \mathcal{X}} P_{X}(x) = 1$ - Note: - $P_X(b) = \sum_{x \le b} p_X(x)$ - $\Pr(a < X \le b) = P_X(b) - P_X(a)$ ## Cumulative Distribution Function (CDF) Example $X$ is number of "heads" in 20 flips of a fair coin ```{r echo=F} cdfData <- data.frame(x = 0:20, px = pbinom(0:20, 20, .5)) ggplot(cdfData,aes(x=x,y=px,ymin=0,ymax=px)) + geom_segment(xend=(cdfData$x + 1),yend = cdfData$px) + geom_point(size=3) + geom_point(x=(cdfData$x+1), size=3, shape=1) + ylab(expression(P[X](x))) + scale_x_continuous(breaks=seq(0,20,5)) ``` ## Continuous random variables - Continuous random variables take values in intervals of $\Re$ - Mass $M$ of a star - $\mathcal{M} = (0,\infty)$ - Oxygen saturation $S$ of blood - $\mathcal{S} = [0,1]$

- For a continuous r.v.\ $X$, $\Pr(X = x) = 0$ for all $x$. ***There is no probability mass function.*** - However, $\Pr(X \in (a,b)) \ne 0$ in general. ## Probability Density Function (PDF) \newcommand{\diff}{\mathrm{\,d}} - For continuous $X$, $\Pr(X = x) = 0$ and PMF does not exist. - However, we define the *Probability Density Function* $f_X$: - $\Pr(a \le X \le b) = \int_{a}^{b} f_X(x) \diff x$ - Requirement: - $\forall x \;f_X(x) > 0$, $\int_{-\infty}^\infty f_X(x) \diff x = 1$ ## Probability Density Function (PDF) Example ```{r echo=F} ggplot(data.frame(x = c(0, 20)), aes(x)) + ylab("Density") + stat_function(fun = function(x){dchisq(x,5)}, colour = "black") ``` ## Cumulative Distribution Function (CDF) - For a continuous $X$, $F_{X}(x)$ gives $\Pr(X \le x) = \Pr(X \in (-\infty,x])$. - Requirements: - $F$ is nondecreasing - $\sup_{x \in \mathcal{X}} F_{X}(x) = 1$ - Note: - $F_X(x) = \int_{-\infty}^x f_X(x) \diff x$ - $\Pr(x_1 < X \le x_2) = F_X(x_2) - F_X(x_1)$ ## Cumulative Distribution Function (CDF) Example ```{r echo=F} ggplot(data.frame(x = c(0, 20)), aes(x)) + ylab("Probability") + stat_function(fun = function(x){pchisq(x,5)}, colour = "black") ``` ## Joint Distributions Two random variables $X$ and $Y$ have a **joint distribution** if their realizations come together as a pair. $(X,Y)$ is a **random vector**, and realizations may be written $(x_1,y_1), (x_2,y_2), ...$, or $\langle x_1, y_1 \rangle, \langle x_2, y_2 \rangle, ...$ - Joint Cumulative Distribution Function (CDF) - We define the *Joint Cumulative Distribution Function* $F_{X,Y}$: - $\Pr(X \le b, Y \le d) = F_{X,Y}(b,d)$ - Joint Probability Density Function (PDF) - We define the *Joint Probability Density Function* $f_{X,Y}$: - $\Pr(\langle X , Y \rangle \in \mathcal{R} \subseteq \mathcal{X} \times \mathcal{Y}) = \int_{\mathcal{R}} f_{X,Y}(x,y) \diff x \diff y$

## Joint Density ```{r echo=F} fbase <- ggplot(data=faithful,aes(x=eruptions,y=waiting)) + xlim(1,5.75) + ylim(38,100) + scale_fill_distiller(palette="Greys",direction=1) + theme(legend.position="none") fbase + stat_density_2d(geom="polygon",aes(fill = ..level..)) #+ geom_point(alpha=I(1/4)) ``` ## Supervised Learning Framework

(JWHT 2, HTF 2) Training set: a set of ***labeled examples*** of the form \[\langle x_1,\,x_2,\,\dots x_p,y\rangle,\] where $x_j$ are ***feature values*** and $y$ is the ***output*** * Task: ***Given a new $x_1,\,x_2,\,\dots x_p$, predict $y$*** What to learn: A ***function*** $h:\mathcal{X}_1 \times \mathcal{X}_2 \times \cdots \times \mathcal{X}_p \rightarrow \mathcal{Y}$, which maps the features into the output domain * Goal: Make accurate ***future predictions*** (on unseen data) ## Common Assumptions - Data are realizations of a random variable $(X_1, X_2, ..., Y)$ - Future data will be realizations of the same random variable - We are given a **loss function** $\ell$ which measures how happy we are with our prediction $\hat{y}$ if the true observation is $(x,y)$. - $\ell(\hat{y}, y)$ is non-negative, and the worse the prediction, the larger it is. ## Great Expectations \newcommand{\E}{\mathrm{E}} - The _expected value_ of a discrete random variable $X$ is denoted $$\E[X] = \sum_{x \in \mathcal{X}} x \cdot p_X(X = x)$$ - The _expected value_ of a continuous random variable $Y$ is denoted $$\E[Y] = \int_{y \in \mathcal{Y}} y \cdot f_Y(Y = y) \diff y$$ - $\E[X]$ is called the **mean** of $X$, often denoted $\mu$ or $\mu_X$. ## Sample Mean - Given a ***dataset*** (collection of realizations) $x_1, x_2, ..., x_n$ of $X$, the **sample mean** is: $$ \bar{x}_n = \frac{1}{n} \sum_i x_i $$ Given a dataset, $\bar x_n$ is a fixed number. It is usually a good estimate of the expected value of a random variable $X$ with an unknown distribution. (More on this later.) ## Generalization Error - Suppose we are given a function $h$ to use for predictions - If $X$ is a random variable, then so is $h(X)$ - And, so is $\ell(h(X),Y)$ Under the assumption that future data are produced by a random variable $(X,Y)$, the ***expected loss*** of a given classifier is $$ E[\ell(h(X),Y)] $$ ## Test Error - Given a ***dataset*** (collection of realizations) $(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)$ of $(X,Y)$, ***that were not used to find $h$*** the **test error** is: $$ \bar{\ell}_{h,n} = \frac{1}{n} \sum_i \ell(h(x_i),y_i) $$ Given a test dataset, $\bar \ell_n$ is a fixed number. It has all the properties of a sample mean, which we will discuss. ## Model Choice Which one do you think has the best Generalization Error? Why? ```{r echo=FALSE,warning=FALSE,message=FALSE,fig.width=8.5} library(gridExtra) ex <- data.frame(x0=rep(1,10),x=c(0.86,0.09,-0.85,0.87,-0.44,-0.43,-1.10,0.40,-0.96,0.17), y=c(2.49,0.83,-0.25,3.10,0.87,0.02,-0.12,1.81,-0.83,0.43)) form <- y ~ x; mod <- lm(form, data=ex); ex$pred <- predict(mod,ex) fitplots_aes <- aes(x=x,y=y,xend=x,yend=pred) pymin=-1;pymax=3.3 f1 <- ggplot(ex,fitplots_aes) + geom_point(size=4) + geom_smooth(method="lm",formula=form,se=F,n=200,na.rm=F) + coord_cartesian(ylim=c(pymin,pymax)) + geom_segment(color="red") form <- y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) mod <- lm(form, data=ex); ex$pred <- predict(mod,ex) f8 <- ggplot(ex,fitplots_aes) + geom_point(size=4) + geom_smooth(method="lm",formula=form,se=F,n=200,na.rm=F) + coord_cartesian(ylim=c(pymin,pymax)) + geom_segment(color="red") grid.arrange(f1,f8,nrow=1) ```