- JWHT 2.3 Lab: Introduction to R

(Or, follow along and see if you can do it in python.)

2018-09-20

- JWHT 2.3 Lab: Introduction to R

(Or, follow along and see if you can do it in python.)

- I have a secret… …your project might not work.
- That is okay. Prove to me and to your classmates that:
- You thoroughly understand the substantive area and problem
- You thoroughly understand the data
- You know what methods are reasonable to try and why
- You tried several
, but your predictions are just not that good.*and evaluated them rigorously*

- You can’t get blood from a turnip. (But demonstrate that as best you can.)

- One row is an observation. What does that mean?
- How are rows generated?

- Common assumption is that data consists of replicates that are “the same.”
- Come from “the same population”
- Come from “the same process”
- The goal of data analysis is to understand what the data tell us about the population.

We often assume that we can treat items as if they were distributed “*randomly*.”

- “
*That’s so random!*” - Result of a coin flip is “random”
- Passengers were screened “at random”

- “random” does not mean “uniform”

- Mathematical formalism:
*events*and*probability*

*Sample space*\(\mathcal{S}\) is the set of all possible events we might observe. Depends on context.- Coin flips: \(\mathcal{S}= \{ h, t \}\)
- Eruption times: \(\mathcal{S}= \mathbb{R}^{\ge 0}\)
- (Eruption times, Eruption waits): \(\mathcal{S}= \mathbb{R}^{\ge 0} \times \mathbb{R}^{\ge 0}\)

- An
*event*is a subset of the sample space.- Observe heads: \(\{ h \}\)
- Observe eruption for 2 minutes: \(\{ 2.0 \}\)
- Observe eruption with length between 1 and 2 minutes and wait between 50 and 70 minutes: \([1,2] \times [50,70]\).

Any event can be assigned a *probability* between \(0\) and \(1\) (inclusive).

- \(\Pr(\{h\}) = 0.5\)
- \(\Pr([1,2] \times [50,70]) = 0.10\)

Probability of the observation falling *somewhere* in the sample space is 1.0.

- \(\Pr(\mathcal{S}) = 1\)

Objectivist view

- Suppose we observe \(n\) replications of an experiment.
- Let \(n(A)\) be the number of times event \(A\) was observed
- \(\lim_{n \to \infty} \frac{n(A)}{n} = \Pr(A)\)
This is (loosely)

*Borel’s Law of Large Numbers*Subjective interpretation is possible as well. (“Bayesian” statistics is related to this idea – more later.)

- We often reduce data to numbers.
- “\(1\) means heads, \(0\) means tails.”

A

*random variable*is a mapping from the event space to a number (or vector.)Usually rendered in uppercase

*italics*\(X\) is every statistician’s favourite, followed closely by \(Y\) and \(Z\).

“Realizations” of \(X\) are written in lower case, e.g. \(x_1\), \(x_2\), …

We will write the set of possible realizations as: \(\mathcal{X}\) for \(X\), \(\mathcal{Y}\) for \(Y\), and so on.

Realizations are observed according to probabilities specified by the

*distribution*of \(X\)Can think of \(X\) as an “infinite supply of data”

Separate realizations of the same r.v. \(X\) are “independent and identically distributed” (i.i.d.)

Formal definition of a random variable requires measure theory, not covered here

Random variable \(X\), realization \(x\).

- What is the probability we see \(x\)?
- \(\Pr(X=x)\), (if lazy, \(\Pr(x)\), but don’t do this)

- Subsets of the domain of a random variable correspond to events.
- \(\Pr(X > 0)\) probability that I see a realization that is positive.

- Discrete random variables take values from a countable set
- Coin flip \(X\)
- \(\mathcal{X} = \{0,1\}\)

- Number of snowflakes that fall in a day \(Y\)
- \(\mathcal{Y} = \{0, 1, 2, ...\}\)

- Coin flip \(X\)

- For a discrete \(X\), \(p_{X}(x)\) gives \(\Pr(X = x)\).
- Requirement: \(\sum_{x \in \mathcal{X}} p_{X}(x) = 1\).
- Note that the sum can have an infinite number of terms.

\(X\) is number of “heads” in 20 flips of a fair coin

\(\mathcal{X} = \{0,1,...,20\}\)

- For a discrete \(X\), \(P_{X}(x)\) gives \(\Pr(X \le x)\).
- Requirements:
- \(P\) is nondecreasing
- \(\sup_{x \in \mathcal{X}} P_{X}(x) = 1\)

- Note:
- \(P_X(b) = \sum_{x \le b} p_X(x)\)
- \(\Pr(a < X \le b) = P_X(b) - P_X(a)\)

\(X\) is number of “heads” in 20 flips of a fair coin

- Continuous random variables take values in intervals of \(\mathbb{R}\)
- Mass \(M\) of a star
- \(\mathcal{M} = (0,\infty)\)

- Oxygen saturation \(S\) of blood
- \(\mathcal{S} = [0,1]\)

- For a continuous r.v. \(X\), \(\Pr(X = x) = 0\) for all \(x\).
*There is no probability mass function.* - However, \(\Pr(X \in (a,b)) \ne 0\) in general.

- For continuous \(X\), \(\Pr(X = x) = 0\) and PMF does not exist.
- However, we define the
*Probability Density Function*\(f_X\):- \(\Pr(a \le X \le b) = \int_{a}^{b} f_X(x) \mathrm{\,d}x\)

- Requirement:
- \(\forall x \;f_X(x) > 0\), \(\int_{-\infty}^\infty f_X(x) \mathrm{\,d}x = 1\)