- JWHT 2.3 Lab: Introduction to R
(Or, follow along and see if you can do it in python.)
2018-09-20
(Or, follow along and see if you can do it in python.)
We often assume that we can treat items as if they were distributed “randomly.”
Any event can be assigned a probability between \(0\) and \(1\) (inclusive).
Probability of the observation falling somewhere in the sample space is 1.0.
This is (loosely) Borel’s Law of Large Numbers
Subjective interpretation is possible as well. (“Bayesian” statistics is related to this idea – more later.)
A random variable is a mapping from the event space to a number (or vector.)
Usually rendered in uppercase italics
\(X\) is every statistician’s favourite, followed closely by \(Y\) and \(Z\).
“Realizations” of \(X\) are written in lower case, e.g. \(x_1\), \(x_2\), …
We will write the set of possible realizations as: \(\mathcal{X}\) for \(X\), \(\mathcal{Y}\) for \(Y\), and so on.
Realizations are observed according to probabilities specified by the distribution of \(X\)
Can think of \(X\) as an “infinite supply of data”
Separate realizations of the same r.v. \(X\) are “independent and identically distributed” (i.i.d.)
Formal definition of a random variable requires measure theory, not covered here
Random variable \(X\), realization \(x\).
\(X\) is number of “heads” in 20 flips of a fair coin
\(\mathcal{X} = \{0,1,...,20\}\)
\(X\) is number of “heads” in 20 flips of a fair coin
Two random variables \(X\) and \(Y\) have a joint distribution if their realizations come together as a pair. \((X,Y)\) is a random vector, and realizations may be written \((x_1,y_1), (x_2,y_2), ...\), or \(\langle x_1, y_1 \rangle, \langle x_2, y_2 \rangle, ...\)
Training set: a set of labeled examples of the form
\[\langle x_1,\,x_2,\,\dots x_p,y\rangle,\]
where \(x_j\) are feature values and \(y\) is the output
What to learn: A function \(h:\mathcal{X}_1 \times \mathcal{X}_2 \times \cdots \times \mathcal{X}_p \rightarrow \mathcal{Y}\), which maps the features into the output domain
Data are realizations of a random variable \((X_1, X_2, ..., Y)\)
Future data will be realizations of the same random variable
We are given a loss function \(\ell\) which measures how happy we are with our prediction \(\hat{y}\) if the true observation is \((x,y)\).
\(\ell(\hat{y}, y)\) is non-negative, and the worse the prediction, the larger it is.
\[\mathrm{E}[X] = \sum_{x \in \mathcal{X}} x \cdot p_X(X = x)\]
\[\mathrm{E}[Y] = \int_{y \in \mathcal{Y}} y \cdot f_Y(Y = y) \mathrm{\,d}y\]
\[ \bar{x}_n = \frac{1}{n} \sum_i x_i \]
Given a dataset, \(\bar x_n\) is a fixed number.
It is usually a good estimate of the expected value of a random variable \(X\) with an unknown distribution. (More on this later.)
Under the assumption that future data are produced by a random variable \((X,Y)\), the expected loss of a given classifier is
\[ E[\ell(h(X),Y)] \]
\[ \bar{\ell}_{h,n} = \frac{1}{n} \sum_i \ell(h(x_i),y_i) \]
Given a test dataset, \(\bar \ell_n\) is a fixed number.
It has all the properties of a sample mean, which we will discuss.
Which one do you think has the best Generalization Error? Why?