Supervised Learning

2016-02-09

Project

I have a secret… …your project might not work.
That is okay. Prove to me and to your classmates that:
- You thoroughly understand the substantive area and problem
- You thoroughly understand the data
- You know what methods are reasonable to try and why
- You tried several and evaluated them rigorously, but your predictions are just not that good.
You can’t get blood from a turnip. (But prove it.)

Relationships between variables

Random vectors or vector-valued random variables.
Variables that occur together in some meaningful sense.

Joint distribution

library(knitr);
kable(faithful)

eruptions	waiting
3.600	79
1.800	54
3.333	74
2.283	62
4.533	85
2.883	55
4.700	88
3.600	85
1.950	51
4.350	85
1.833	54
3.917	84
4.200	78
1.750	47
4.700	83
2.167	52
1.750	62
4.800	84
1.600	52
4.250	79
1.800	51
1.750	47
3.450	78
3.067	69
4.533	74
3.600	83
1.967	55
4.083	76
3.850	78
4.433	79
4.300	73
4.467	77
3.367	66
4.033	80
3.833	74
2.017	52
1.867	48
4.833	80
1.833	59
4.783	90
4.350	80
1.883	58
4.567	84
1.750	58
4.533	73
3.317	83
3.833	64
2.100	53
4.633	82
2.000	59
4.800	75
4.716	90
1.833	54
4.833	80
1.733	54
4.883	83
3.717	71
1.667	64
4.567	77
4.317	81
2.233	59
4.500	84
1.750	48
4.800	82
1.817	60
4.400	92
4.167	78
4.700	78
2.067	65
4.700	73
4.033	82
1.967	56
4.500	79
4.000	71
1.983	62
5.067	76
2.017	60
4.567	78
3.883	76
3.600	83
4.133	75
4.333	82
4.100	70
2.633	65
4.067	73
4.933	88
3.950	76
4.517	80
2.167	48
4.000	86
2.200	60
4.333	90
1.867	50
4.817	78
1.833	63
4.300	72
4.667	84
3.750	75
1.867	51
4.900	82
2.483	62
4.367	88
2.100	49
4.500	83
4.050	81
1.867	47
4.700	84
1.783	52
4.850	86
3.683	81
4.733	75
2.300	59
4.900	89
4.417	79
1.700	59
4.633	81
2.317	50
4.600	85
1.817	59
4.417	87
2.617	53
4.067	69
4.250	77
1.967	56
4.600	88
3.767	81
1.917	45
4.500	82
2.267	55
4.650	90
1.867	45
4.167	83
2.800	56
4.333	89
1.833	46
4.383	82
1.883	51
4.933	86
2.033	53
3.733	79
4.233	81
2.233	60
4.533	82
4.817	77
4.333	76
1.983	59
4.633	80
2.017	49
5.100	96
1.800	53
5.033	77
4.000	77
2.400	65
4.600	81
3.567	71
4.000	70
4.500	81
4.083	93
1.800	53
3.967	89
2.200	45
4.150	86
2.000	58
3.833	78
3.500	66
4.583	76
2.367	63
5.000	88
1.933	52
4.617	93
1.917	49
2.083	57
4.583	77
3.333	68
4.167	81
4.333	81
4.500	73
2.417	50
4.000	85
4.167	74
1.883	55
4.583	77
4.250	83
3.767	83
2.033	51
4.433	78
4.083	84
1.833	46
4.417	83
2.183	55
4.800	81
1.833	57
4.800	76
4.100	84
3.966	77
4.233	81
3.500	87
4.366	77
2.250	51
4.667	78
2.100	60
4.350	82
4.133	91
1.867	53
4.600	78
1.783	46
4.367	77
3.850	84
1.933	49
4.500	83
2.383	71
4.700	80
1.867	49
3.833	75
3.417	64
4.233	76
2.400	53
4.800	94
2.000	55
4.150	76
1.867	50
4.267	82
1.750	54
4.483	75
4.000	78
4.117	79
4.083	78
4.267	78
3.917	70
4.550	79
4.083	70
2.417	54
4.183	86
2.217	50
4.450	90
1.883	54
1.850	54
4.283	77
3.950	79
2.333	64
4.150	75
2.350	47
4.933	86
2.900	63
4.583	85
3.833	82
2.083	57
4.367	82
2.133	67
4.350	74
2.200	54
4.450	83
3.567	73
4.500	73
4.150	88
3.817	80
3.917	71
4.450	83
2.000	56
4.283	79
4.767	78
4.533	84
1.850	58
4.250	83
1.983	43
2.250	60
4.750	75
4.117	81
2.150	46
4.417	90
1.817	46
4.467	74

Correlation (JWHT 2.3,3.1.3)

Pearson Correlation

\[ \rho_{X,Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y} \]

Pearson Correlation: "Plugin" Estimate

\[ r_{X,Y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

Sample Correlation

##           eruptions   waiting
## eruptions 1.0000000 0.9008112
## waiting   0.9008112 1.0000000

Correlation Gotchas

Joint distribution - Density

Marginal distributions - Densities

Marginal distributions - Rug Plot

Conditional distributions

Independence

Two random variables \(X\) and \(Y\) that are part of a random vector are independent iff:

\[ F_{X,Y}(x,y) = F_X(x)F_Y(y) \]

Consequences:

\[ \Pr(X=x|Y=y) = \Pr(X=x) \]

\[ \Pr(Y=y|X=x) = \Pr(Y=y) \]

Any conditional distribution of \(X\) is the same as the marginal distribution of \(X\)
Knowing about \(Y\) provides no information about \(X\).

Independence vs. Correlation

##          x        y
## x 1.000000 0.703538
## y 0.703538 1.000000

Independence vs. Correlation

##             x           y
## x  1.00000000 -0.01458965
## y -0.01458965  1.00000000

Independence vs. Correlation

##             x           y
## x 1.000000000 0.003657629
## y 0.003657629 1.000000000

Independence vs. Correlation

##             x           y
## x  1.00000000 -0.01608683
## y -0.01608683  1.00000000

Predicting Waiting Time

## Mean: 70.90

Conditional predictions

If I know eruption time, can I do better?

## Mean: 55.60

Conditional predictions

If I know eruption time, can I do better?

## Mean: 81.33

Conditional predictions?

If I know eruption time, can I do better?

Supervised Learning Framework (HTF 2, JWHT 2)

Training experience: a set of labeled examples of the form

\[\langle x_1,\,x_2,\,\dots x_p,y\rangle,\]

where \(x_j\) are feature values and \(y\) is the output

Task: Given a new \(x_1,\,x_2,\,\dots x_p\), predict \(y\)

What to learn: A function \(f:\mathcal{X}_1 \times \mathcal{X}_2 \times \cdots \times \mathcal{X}_p \rightarrow \mathcal{Y}\), which maps the features into the output domain

Goal: Make accurate future predictions (on unseen data)
Plan: Learn to make accurate predictions on the training data

Wisconsin Breast Cancer Prognostic Data

Cell samples were taken from tumors in breast cancer patients before surgery and imaged; tumors were excised; patients were followed to determine whether or not the cancer recurred, and how long until recurrence or disease free.

Wisconsin data (continued)

198 instances, 32 features for prediction
Outcome (R=recurrence, N=non-recurrence)
Time (until recurrence, for R, time healthy, for N).

Radius.Mean	Texture.Mean	Perimeter.Mean	Outcome	Time
18.02	27.60	117.50	N	31
17.99	10.38	122.80	N	61
21.37	17.44	137.50	N	116
11.42	20.38	77.58	N	123
20.29	14.34	135.10	R	27
12.75	15.29	84.60	R	77
…	…	…	…	…

Terminology

Radius.Mean	Texture.Mean	Perimeter.Mean	Outcome	Time
18.02	27.60	117.50	N	31
17.99	10.38	122.80	N	61
21.37	17.44	137.50	N	116
11.42	20.38	77.58	N	123
20.29	14.34	135.10	R	27
12.75	15.29	84.60	R	77
…	…	…	…	…

Columns are called input variables or features or attributes
The outcome and time (which we are trying to predict) are called labels or output variables or targets
A row in the table is called training example or instance
The whole table is called (training) data set.

Prediction problems

Radius.Mean	Texture.Mean	Perimeter.Mean	Outcome	Time
18.02	27.60	117.50	N	31
17.99	10.38	122.80	N	61
21.37	17.44	137.50	N	116
11.42	20.38	77.58	N	123
20.29	14.34	135.10	R	27
12.75	15.29	84.60	R	77
…	…	…	…	…

The problem of predicting the recurrence is called (binary) classification
The problem of predicting the time is called regression

More formally

The \(i\)th training example has the form: \(\langle x_{1,i}, \dots x_{p,i}, y_i\rangle\) where \(p\) is the number of features (32 in our case).
Notation \({\bf x}_i\) denotes a column vector with elements \(x_{1,i},\dots x_{p,i}\).
The training set \(D\) consists of \(n\) training examples
We denote the \(n\times p\) matrix of features by \(X\) and the size-\(n\) column vector of outputs from the data set by \({\mathbf{y}}\).
In statistics, \(X\) is called the data matrix or the design matrix.
\({{\cal X}}\) denotes space of input values
\({{\cal Y}}\) denotes space of output values

Supervised learning problem

Given a data set \(D \subset ({{\cal X}}\times {{\cal Y}})^n\), find a function: \[h : {{\cal X}}\rightarrow {{\cal Y}}\] such that \(h({\bf x})\) is a “good predictor” for the value of \(y\).
\(h\) is called a predictive model or hypothesis
Problems are categorized by the type of output domain
- If \({{\cal Y}}=\mathbb{R}\), this problem is called regression
- If \({{\cal Y}}\) is a finite discrete set, the problem is called classification
- If \({{\cal Y}}\) has 2 elements, the problem is called binary classification

Steps to solving a supervised learning problem

Decide what the input-output pairs are.
Decide how to encode inputs and outputs.

This defines the input space \({{\cal X}}\), and the output space \({{\cal Y}}\).

(We will discuss this in detail later)
Choose model space/hypothesis class \({{\cal H}}\) .
…

Example: Choosing a model space

Linear hypothesis (HTF 3, JWHT 3)

Suppose \(y\) was a linear function of \({\bf x}\): \[h_{\bf w}({\bf x}) = w_0 + w_1 x_1 + w_2 x_2 + \cdots\]
\(w_i\) are called parameters or weights (often \(\beta_i\) in stats books)
Typically include an attribute \(x_0=1\) (also called bias term or intercept term) so that the number of weights is \(p+1\). We then write: \[h_{\bf w}({\bf x}) = \sum_{i=0}^p w_i x_i = {\mathbf{x}}^{\mathsf{T}}{\mathbf{w}}\] where \({\bf w}\) and \({\bf x}\) are column vectors of size \(p+1\).
The design matrix \(X\) is now \(n\) by \(p+1\).

Example: Design matrix with bias term

x0	x1	y
1	0.86	2.49
1	0.09	0.83
1	-0.85	-0.25
1	0.87	3.10
1	-0.44	0.87
1	-0.43	0.02
1	-1.10	-0.12
1	0.40	1.81
1	-0.96	-0.83
1	0.17	0.43

Models will be of the form \[ \begin{align} h_{\mathbf{w}}({\mathbf{x}}) & = x_0 w_0 + x_1 w_1\\ & = w_0 + x_1 w_1 \end{align} \] How should we pick \({\mathbf{w}}\)?

Error minimization

Intuitively, \(\bf w\) should make the predictions of \(h_{\bf w}\) close to the true values \(y_i\) on on the training data
Define an error function or cost function to measure how much our prediction differs from the “true” answer on the training data
Pick \(\bf w\) such that the error function is minimized
Hopefully, new examples are somehow “similar” to the training examples, and will also have small error.

How should we choose the error function?

Least mean squares (LMS)

Main idea: try to make \(h_{\bf w}({\bf x})\) close to \(y\) on the examples in the training set
We define a sum-of-squares error function \[J({\bf w}) = \frac{1}{2}\sum_{i=1}^n (h_{\bf w}({\bf x}_i)-y_i)^2\] (the \(1/2\) is just for convenience)
We will choose \(\bf w\) such as to minimize \(J(\bf w)\)
One way to do it: compute \(\bf w\) such that: \[\frac{\partial}{\partial w_j}J({\bf w}) = 0,\,\, \forall j=0\dots p\]

Example: \(w_0=0.9,w_1=-0.4\)

## SSE: 21.510

OLS Fit to Example Data

mod <- lm(y ~ x1, data=exb); print(mod$coefficients)

## (Intercept)          x1 
##    1.058813    1.610168

## SSE: 2.240

Solving a supervised learning problem

Decide what the input-output pairs are.
Decide how to encode inputs and outputs.

This defines the input space \({{\cal X}}\), and the output space \({{\cal Y}}\).
Choose a class of models/hypotheses \({{\cal H}}\) .
Choose an error function (cost function) to define the best model in the class
Choose an algorithm for searching efficiently through the space of models to find the best.

Recurrence Time from Tumor Radius

mod <- lm(Time ~ Radius.Mean, data=bc %>% filter(Outcome == 'R')); print(mod$coefficients)

## (Intercept) Radius.Mean 
##   83.161238   -3.156896

Notation reminder

Consider a function \(J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}\) (for us, this will usually be an error function)
The gradient \(\nabla J(u_1,u_2,\ldots,u_p):\mathbb{R}^p\mapsto\mathbb{R}^p\) is a function which outputs a vector containing the partial derivatives.
That is: \[\nabla J = \left\langle{\frac{\partial}{\partial u_1}}J,{\frac{\partial}{\partial u_2}}J,\ldots,{\frac{\partial}{\partial u_p}}J\right\rangle\]
If \(J\) is differentiable and convex, we can find the global minimum of \(J\) by solving \(\nabla J = \mathbf{0}\).
The partial derivative is the derivative along the \(u_i\) axis, keeping all other variables fixed.

The Least Squares Solution (HTF 2.6, 3.2, JWHT 3.1)

Recalling some multivariate calculus: \[\begin{aligned} \nabla_{\mathbf{w}}J & = & \nabla_{\mathbf{w}}(X{\mathbf{w}}-{{\mathbf{y}}})^{{\mathsf{T}}}(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{\mathsf{T}}X^{\mathsf{T}}-{{\mathbf{y}}^{\mathsf{T}}})(X{\mathbf{w}}-{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}-{\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}{{\mathbf{y}}}{}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & \nabla_{\mathbf{w}}({\mathbf{w}}^{{\mathsf{T}}}X^{{\mathsf{T}}}X{\mathbf{w}}-2{{\mathbf{y}}}^{{\mathsf{T}}}X{\mathbf{w}}+{{\mathbf{y}}}{}^{{\mathsf{T}}}{{\mathbf{y}}}{}) \\ & = & 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}} \end{aligned}\]
Setting gradient equal to zero: \[\begin{aligned} 2 X^{{\mathsf{T}}}X {\mathbf{w}}- 2 X^{{\mathsf{T}}}{{\mathbf{y}}}{} & = & 0 \\ \Rightarrow X^{{\mathsf{T}}}X {\mathbf{w}}& = & X^{{\mathsf{T}}}{{\mathbf{y}}}{} \\ \Rightarrow {\mathbf{w}}= (X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}\end{aligned}\]
The inverse exists if the columns of \(X\) are linearly independent.

Example of linear regression

x0	x1	y
1	0.86	2.49
1	0.09	0.83
1	-0.85	-0.25
1	0.87	3.10
1	-0.44	0.87
1	-0.43	0.02
1	-1.10	-0.12
1	0.40	1.81
1	-0.96	-0.83
1	0.17	0.43

\(h_{\bf w} ({\bf x}) = 1.06 + 1.61 x_1\)

Data matrices

\[X=\left[\begin{array}{rr} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]\]

\(X^{{\mathsf{T}}}X\)

\[X^{{\mathsf{T}}}X =\] \[{\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{cc} 1 & 0.86 \\ 1 & 0.09 \\ 1 & -0.85 \\ 1 & 0.87 \\ 1 & -0.44 \\ 1 & -0.43 \\ 1 & -1.10 \\ 1 & 0.40 \\ 1 & -0.96 \\ 1 & 0.17 \end{array}\right]}\] \[=\left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]\]

\(X^{{\mathsf{T}}}{{\mathbf{y}}}{}\)

\[X^{{\mathsf{T}}}{{\mathbf{y}}}{}=\] \[{\tiny \left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \end{array}\right]\times\left[\begin{array}{c} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}\] \[=\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right]\]

Solving for \({\bf w}\)

\[{\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = \left[\begin{array}{rr} 10 & -1.39 \\ -1.39 & 4.95 \end{array}\right]^{-1}\left[\begin{array}{r} 8.34 \\ 6.49 \end{array}\right] = \left[\begin{array}{r} 1.06 \\ 1.61 \end{array}\right]\]

So the best fit line is \(y=1.06 + 1.61x\).

Linear regression summary

The optimal solution (minimizing sum-squared-error) can be computed in polynomial time in the size of the data set.
The solution is \({\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{}\), where \(X\) is the data matrix augmented with a column of ones, and \({{\mathbf{y}}}{}\) is the column vector of target outputs.
A very rare case in which an analytical, exact solution is possible

Is linear regression enough?

Linear regression should be the first thing you try for real-valued outputs!
…but it is sometimes not expressive enough.
Two possible solutions:
1. Explicitly transform the data, i.e. create additional features
  - Add cross-terms, higher-order terms
  - More generally, apply a transformation of the inputs from \({\cal X}\) to some other space \({\cal X}'\), then do linear regression in the transformed space
2. Use a different model space/hypothesis class
Idea (1) and idea (2) are two views of the strategy. Today we focus on the first approach

Polynomial fits (HTF 2.6, JWHT 7.1)

Suppose we want to fit a higher-degree polynomial to the data.
(E.g., \(y=w_0 + w_1x_1+w_2 x_1^2\).)
Suppose for now that there is a single input variable \(x_{i,1}\) per training sample.
How do we do it?

Answer: Polynomial regression

Given data: \((x_{1,1},y_1), (x_{1,2},y_2), \ldots, (x_{1,n},y_n)\).
Suppose we want a degree-\(d\) polynomial fit.
Let \({{\mathbf{y}}}{}\) be as before and let \[X=\left[\begin{array}{rrrrr} 1 & x_{1,1} & x_{1,1}^2 & \ldots & x_{1,1}^d \\ 1 & x_{1,2} & x_{1,2}^2 & \ldots & x_{1,2}^d \\ \vdots & & \vdots & \vdots & \vdots \\ 1 & x_{1,n} & x_{1,n}^2 & \ldots & x_{1,n}^d \\ \end{array}\right]\]
We are making up features to add to our design matrix
Solve the linear regression \(X{\bf w}\approx {{\mathbf{y}}}{}\).

Example of quadratic regression: Data matrices

\[X=\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]~~~~~{{\mathbf{y}}}{}=\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]\]

\(X^{{\mathsf{T}}}X\)

\[X^{{\mathsf{T}}}X =\] \[{\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03 \end{array}\right]\times\left[\begin{array}{rrr} 1 & 0.86 & 0.75 \\ 1 & 0.09 & 0.01 \\ 1 & -0.85 & 0.73 \\ 1 & 0.87 & 0.76 \\ 1 & -0.44 & 0.19 \\ 1 & -0.43 & 0.18 \\ 1 & -1.10 & 1.22 \\ 1 & 0.40 & 0.16 \\ 1 & -0.96 & 0.93 \\ 1 & 0.17 & 0.03 \end{array}\right]}\] \[=\left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \end{array}\right]\]

\(X^{{\mathsf{T}}}{{\mathbf{y}}}{}\)

\[X^{{\mathsf{T}}}{{\mathbf{y}}}{}=\] \[{\tiny \hspace{-0.3in}\left[\begin{array}{rrrrrrrrrr} 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ 0.86 & 0.09 & -0.85 & 0.87 & -0.44 & -0.43 & -1.10 & 0.40 & -0.96 & 0.17 \\ 0.75 & 0.01 & 0.73 & 0.76 & 0.19 & 0.18 & 1.22 & 0.16 & 0.93 & 0.03\\ \end{array}\right]\times\left[\begin{array}{r} 2.49 \\ 0.83 \\ -0.25 \\ 3.10 \\ 0.87 \\ 0.02 \\ -0.12 \\ 1.81 \\ -0.83 \\ 0.43 \end{array}\right]}\] \[=\left[\begin{array}{r} 8.34 \\ 6.49 \\ 3.60 \end{array}\right]\]

Solving for \({\bf w}\)

\[{\bf w}=(X^{{\mathsf{T}}}X)^{-1}X^{{\mathsf{T}}}{{\mathbf{y}}}{} = {\tiny \left[\begin{array}{rrr} 10 & -1.39 & 4.95 \\ -1.39 & 4.95 & 1.64 \\ 4.95 & 1.64 & 4.11 \\ \end{array}\right]^{-1}\left[\begin{array}{r} 3.60 \\ 6.49 \\ 8.34 \end{array}\right] = \left[\begin{array}{r} 0.74 \\ 1.75 \\ 0.69 \end{array}\right]}\]

So the best order-2 polynomial is \(y=0.74 + 1.75 x + 0.69x^2\).