Nonlinearly separable data

Separability by adding features

Separability by adding features

Separability by adding features

more flexible decision boundary \(\approx\) enriched feature space

Margin optimization in feature space

max \(\sum_{i=1}^n\alpha_i-\frac{1}{2}\sum_{i,j=1}^n{{y}}_i{{y}}_j\alpha_i\alpha_j({\phi({{{\mathbf{x}}_i}})\cdot\phi({{{\mathbf{x}}_j}})})\)
w.r.t. \(\alpha_i\)
s.t. \(0\leq\alpha_i\leq C\) and \(\sum_{i=1}^n\alpha_i{{y}}_i=0\)

Kernel functions

Example: Quadratic kernel

Polynomial kernels

The “kernel trick”

Some other (fairly generic) kernel functions

Example: Gaussian kernel

Kernels beyond SVMs

Example: String kernels

Second brush with “feature construction”

Getting SVMs to work in practice

Artificial Neural Networks (HTF Ch. 11)

Functional Form - Neuron

\(\color{red}{h^{(1)}_{1}}(\color{blue}{x_1}, \color{blue}{x_2}, \color{blue}{x_3}) = \phi^{(1)}_{1}(w^{(1)}_{1,1}\color{blue}{x_1} + w^{(1)}_{2,1}\color{blue}{x_2}, + w^{(1)}_{3,1}\color{blue}{x_3} + w^{(1)}_{0,1})\)

Functional Form

\(\color{red}{h^{(1)}_{1}}(\color{blue}{x_1}, \color{blue}{x_2}, \color{blue}{x_3}) = \phi^{(1)}_{1}(w^{(1)}_{1,1}\color{blue}{x_1} + w^{(1)}_{2,1}\color{blue}{x_2}, + w^{(1)}_{3,1}\color{blue}{x_3} + w^{(1)}_{0,1})\) \(\color{red}{h^{(1)}_{2}}(\color{blue}{x_1}, \color{blue}{x_2}, \color{blue}{x_3}) = \phi^{(1)}_{2}(w^{(1)}_{1,2}\color{blue}{x_1} + w^{(1)}_{2,2}\color{blue}{x_2}, + w^{(1)}_{3,2}\color{blue}{x_3} + w^{(1)}_{0,2})\) \(\color{red}{h^{(1)}_{3}}(\color{blue}{x_1}, \color{blue}{x_2}, \color{blue}{x_3}) = \phi^{(1)}_{3}(w^{(1)}_{1,3}\color{blue}{x_1} + w^{(1)}_{2,3}\color{blue}{x_2}, + w^{(1)}_{3,3}\color{blue}{x_3} + w^{(1)}_{0,3})\)

\(\color{purple}{h^{(2)}_{1}}(\color{blue}{x_1}, \color{blue}{x_2}, \color{blue}{x_3}) = \phi^{(2)}_{1}(w^{(2)}_{1,1}\color{red}{h^{(1)}_{1}} + w^{(2)}_{2,1}\color{red}{h^{(1)}_{2}}, + w^{(2)}_{3,1}\color{red}{h^{(1)}_{3}} + w^{(2)}_{0,1})\) \(\color{purple}{h^{(2)}_{2}}(\color{blue}{x_1}, \color{blue}{x_2}, \color{blue}{x_3}) = \phi^{(2)}_{2}(w^{(2)}_{1,2}\color{red}{h^{(1)}_{1}} + w^{(2)}_{2,2}\color{red}{h^{(1)}_{2}}, + w^{(2)}_{3,2}\color{red}{h^{(1)}_{3}} + w^{(2)}_{0,2})\)

“Activation” or “Transfer” Function \(\phi\)

Name Plot Equation
Identity Activation identity.svg
Binary step Activation binary step.svg
Logistic Activation logistic.svg
TanH Activation tanh.svg
Rectified linear unit (ReLU)[9] Activation rectified linear.svg

Hidden Units are Features


Aside: “One-Hot” Encoding

Error Functions

\[ \begin{eqnarray} J({\bf w}) & = & \frac{1}{2}\sum_{j} \sum_{i=1}^n (h^{(L)}_j({\bf x}_i)-y_i)^2\\ J({\bf w}) & = & -\sum_{j}\sum_{i=1}^n y_i \log h^{(L)}_j({\mathbf{x}}_i) + (1-y_i) \log (1-h^{(L)}_j({\mathbf{x}}_i)) \end{eqnarray} \]


Training

Traditional ANN training is done by looping over examples (much like the perceptron) and taking a small step down the single-example gradient.

For a single output neuron: \[ \begin{eqnarray} J_j({\mathbf{w}}) & = & \frac{1}{2} (h^{(L)}_j({\bf x}_i)-y_i)^2\\ \nabla_{\mathbf{w}}J & = & (h^{(L)}_j({\bf x}_i)-y_i)\nabla_{\mathbf{w}}h^{(L)}_j({\bf x}_i)\\ & = & (\phi^{(L)}_j({\bf x}_i^{\mathsf{T}}{\bf w})-y_i)\cdot \phi'^{(L)}_j({\bf x}_i^{\mathsf{T}}{\bf w}) \cdot {\mathbf{x}_i} \end{eqnarray} \] Learning rule:

\[ \textbf{w}^{t+1} \leftarrow \textbf{w}^{t} - \alpha\cdot (\phi^{(L)}_j({\bf x}_i^{\mathsf{T}}{\bf w})-y_i)\cdot \phi'^{(L)}_j{\bf x}_i^{\mathsf{T}}{\bf w}) \cdot {\mathbf{x}_i}\]

where \(\alpha\) is a learning rate.

Backpropagation

\[ \textbf{w}^{(\ell),t+1}_{\cdot,j} \leftarrow \textbf{w}^{(\ell),t}_{\cdot,j} - \alpha\cdot \delta^{(\ell)}_{j}\cdot \phi'^{(\ell)}_j({\bf x}_i^{\mathsf{T}}{\bf w}^{(\ell),t}_{\cdot,j})\cdot {\mathbf{x}_i}\]

where

\[\delta^{(\ell)}_j = \left\{ \begin{array}{ll} (\phi({\bf x}_i^{\mathsf{T}}{\bf w})-y_i) & \mbox{if $(\ell)$ is the output layer} \\ \sum_k w^{(\ell)}_{j,k} \delta^{(\ell+1)}_k & \mbox{otherwise} \end{array} \right.\]


Variants of Backpropagation

Neural Net Design Decisions

Derivatives of \(\phi\)

Name Plot Derivative (with respect to x)
Identity Activation identity.svg
Binary step Activation binary step.svg
Logistic Activation logistic.svg
TanH Activation tanh.svg
Rectified linear unit (ReLU)[9] Activation rectified linear.svg

Backprop “Conventional Wisdom”

Deep Learning: “Unconventional Wisdom”

http://www.nature.com/nature/journal/v521/n7553/full/nature14539.html

What changed?

Deep Learning “Trick”: Pre-training


Deep Learning “Trick”: Dropout

Before presenting each training example and performing backprop, randomly ignore 50% of the nodes in your network.

http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

Deep Learning “Trick”: ReLUs

Aside: Convolution


Convolutional Neural Network (HTF 11.7)


http://cs231n.github.io/convolutional-networks/

Want to give it a try?

https://www.tensorflow.org

(Other packages available.)

Instance-based learning, Decision Trees

Parametric supervised learning

Non-parametric (memory-based) learning methods

Simple idea: Connect the dots!

Simple idea: Connect the dots!

One-nearest neighbor

What does the approximator look like?

Each line segment is equidistant between two points of opposite classes.

What kind of distance metric?

Distance metric is really important!

Left: attributes weighted equally Right: unequal weighting

Distance metric tricks

\(k\)-nearest neighbor

k-NN classification, Majority, k=1

k-NN classification, Majority, k=2

k-NN classification, Majority, k=3

k-NN classification, Majority, k=5

k-NN classification, Majority, k=10

k-NN classification, Majority, k=15

k-NN classification, Mean (prob), k=1

k-NN classification, Mean (prob), k=2

k-NN classification, Mean (prob), k=3

k-NN classification, Mean (prob), k=5

k-NN classification, Mean (prob), k=10

k-NN classification, Mean (prob), k=15

k-NN classification, Mean (prob), k=20

k-NN classification, Mean (prob), k=25

k-NN regression, Mean, k=1

k-NN regression, Mean, k=2

k-NN regression, Mean, k=3

k-NN regression, Mean, k=5

k-NN regression, Mean, k=10

k-NN regression, Mean, k=15

k-NN regression, Mean, k=20

k-NN regression, Mean, k=25

Bias-variance trade-off

Locally-weighted regression

\[ J({\mathbf{w}}) = \sum_i w_i \cdot (h_{\mathbf{w}}({\mathbf{x}}_i) - y_i)^2 \]

LOESS Smoothing

LOESS Smoothing, alpha=0.200

LOESS Smoothing, alpha=0.300

LOESS Smoothing, alpha=0.400

LOESS Smoothing, alpha=0.500

LOESS Smoothing, alpha=0.600

LOESS Smoothing, alpha=0.700

LOESS Smoothing, alpha=0.750

LOESS Smoothing, alpha=0.800

LOESS Smoothing, alpha=0.900

LOESS Smoothing, alpha=1.000

Generalized Additive Models

GAM Smoothed Example

Lazy and eager learning

Does it matter?

Pros and cons of lazy and eager learning

When to consider nonparametric methods

  • Advantages:

    • Training is very fast

    • Easy to learn complex functions over few variables

    • Can give back confidence intervals in addition to the prediction

    • Often wins if you have enough data

  • Disadvantages:

    • Slow at query time

    • Query answering complexity depends on the number of instances

    • Easily fooled by irrelevant attributes (for most distance metrics)

    • “Inference” is not possible

Decision Trees

Non-metric learning

Example: Decision tree for Wisconsin data

Using decision trees for classification

How do we classify a new a new instance, e.g.: radius=18, texture=12, …

Decision trees as logical representations

A decision tree can be converted an equivalent set of if-then rules.

IF THEN most likely class is
radius \(>17.5\) AND texture \(>21.5\) R
radius \(>17.5\) AND texture \(\leq 21.5\) N
radius \(\leq17.5\) N

Decision trees as logical representations

A decision tree can be converted an equivalent set of if-then rules.

IF THEN P(R) is
radius\(> 17.5\) AND texture\(> 21.5\) \(33/(33+5)\)
radius\(> 17.5\) AND texture\(\leq 21.5\) \(12/(12+31)\)
radius\(\leq 17.5\) \(25/(25+64)\)

Decision trees, more formally

More on tests for real-valued features

Representational power and efficiency of decision trees

Representational power and efficiency of decision trees

An artificial example

Example: Decision tree decision surface

How do we learn decision trees?

Top-down induction of decision trees

Which test is best?

Which test is best?

Entropy

\[H(P)=\sum_{i=1}^k p_i \log_2\frac{1}{p_i}\]

Entropy applied to binary classfication

Marginal Entropy

\(x=\)HasKids \(y=\)OwnsDoraVideo
Yes Yes
Yes Yes
Yes Yes
Yes Yes
No No
No No
Yes No
Yes No
  • From the table, we can estimate \(P(Y=\mathrm{Yes}) = 0.5 = P(Y=\mathrm{No})\).

  • Thus, we estimate \(H(Y) = 0.5 \log \frac{1}{0.5} + 0.5 \log \frac{1}{0.5} = 1\).

Specific Conditional entropy

\(x=\)HasKids \(y=\)OwnsDoraVideo
Yes Yes
Yes Yes
Yes Yes
Yes Yes
No No
No No
Yes No
Yes No

Specific conditional entropy is the uncertainty in \(Y\) given a particular \(x\) value. E.g.,

  • \(P(Y=\mathrm{Yes}|X=\mathrm{Yes}) = \frac{2}{3}\), \(P(Y=\mathrm{No}|X=\mathrm{Yes})=\frac{1}{3}\)

  • \(H(Y|X=\mathrm{Yes}) = \frac{2}{3}\log \frac{1}{(\frac{2}{3})} + \frac{1}{3}\log \frac{1}{(\frac{1}{3})}\) \(\approx 0.9183\).

(Average) Conditional entropy

\(x=\)HasKids \(y=\)OwnsDoraVideo
Yes Yes
Yes Yes
Yes Yes
Yes Yes
No No
No No
Yes No
Yes No
  • The conditional entropy, \(H(Y|X)\), is the average specific conditional entropy of \(y\) given the values for \(x\): \[H(Y|X)=\sum_x P(X=x)H(Y|X=x)\]

  • \(H(Y|X=\mathrm{Yes}) = \frac{2}{3}\log \frac{1}{(\frac{2}{3})} + \frac{1}{3}\log \frac{1}{(\frac{1}{3})}\) \(\approx 0.9183\)

  • \(H(Y|X=\mathrm{No}) = 0 \log \frac{1}{0} + 1 \log \frac{1}{1} = 0\).

  • \(H(Y|X) = H(Y|X=\mathrm{Yes})P(X=\mathrm{Yes}) + H(Y|X=\mathrm{No})P(X=\mathrm{No})\) \(= 0.9183 \cdot \frac{3}{4} + 0 \cdot \frac{1}{4}\) \(\approx 0.6887\)

  • Interpretation: the expected number of bits needed to transmit \(y\) if both the emitter and the receiver know the possible values of \(x\) (but before they are told \(x\)’s specific value).

Information gain

Information gain to determine best test

Caveats on tests with multiple values

Alternative purity measures

Dealing with noise in the training data

Noise is inevitable!

What happens when adding a noisy example?

Overfitting in decision trees

Example from (Mitchell, 1997)

Avoiding overfitting

Example: Reduced-error pruning

  1. Split the “training data” into a training set and a validation set

  2. Grow a large tree (e.g. until each leaf is pure)

  3. For each node:

    1. Evaluate the validation set accuracy of pruning the subtree rooted at the node

    2. Greedily remove the node that most improves validation set accuracy, with its corresponding subtree

    3. Replace the removed node by a leaf with the majority class of the corresponding examples.

  4. Stop when pruning starts hurting the accuracy on the validation set.

Example: Effect of reduced-error pruning

Example: Rule post-pruning in C4.5

  1. Convert the decision tree to rules

  2. Prune each rule independently of the others, by removing preconditions such that the accuracy is improved

  3. Sort final rules in order of estimated accuracy

Advantages:

Random Forests

Missing values during classification

Decision Tree Summary (1)

Decision Tree Summary (2)

Extra Slides

What is information?

  • Imagine:

    1. You are about to observe the outcome of a dice roll

    2. You are about to observe the outcome of a coin flip

    3. You are about to observe the outcome of a biased coin flip

    4. Someone is about to tell you your own name

  • Intuitively, in each situation you have a different amount of uncertainty as to what outcome / message you will observe.

Information = Reduction in uncertainty

Let \(E\) be an event that occurs with probability \(P(E)\). If we are told that \(E\) has occurred with certainty, then we received \[I(E) = \log_2\frac{1}{P(E)}\] bits of information.

  • You can also think of information as the amount of “surprise” in the outcome (e.g., consider \(P(E)=1\), \(P(E)\approx 0\))

  • E.g., result of a fair coin flip provides \(\log_2 2 = 1\) bit of information

  • E.g., result of a fair dice roll provides \(\log_2 6 \approx 2.58\) bits of information.

  • E.g., result of being told your own name (or any other deterministic event) produces \(0\) bits of information

Interpretations of entropy

\[H(P) = \sum_i p_i \log_2\frac{1}{p_i}\]

  • Average amount of information per symbol

  • Average amount of surprise when observing the symbol

  • Uncertainty the observer has before seeing the symbol

  • Average number of bits needed to communicate the symbol