2018-10-30

## “My data don’t look like the simple examples you showed in class…”

• “I don’t have nice vectors of features, each of the same length.”

• Fair enough. Today, two instances of the following strategy:

1. Identify the prediction you want to make.

2. Identify the information you need to make each prediction.

3. Summarize that information into a feature vector.

4. Pass the result to a supervised learning method.

## Documents: Spam or ham?

BNP PARIBAS 10 HAREWOOD AVENUE, LONDON NW1 6AA TEL: +44 (0) 207 595 2000

RE: NOTIFICATION OF PAYMENT OF ACCRUED INTEREST OF ONE HUNDRED AND FIFTY THOUSAND BRITISH POUNDS STERLING ONLY (£150,000.00).

This is inform you that your contract/inheritance fund, which was deposited with us since 2010-2013 has accumulated an interest of sum of One Hundred and Fifty Thousand British Pounds Sterling Only (£150,000.00).

In line with the joint agreement signed by the board of trustees of BNP Paribas and the depositor of the said fund, the management has mandated us to pay you the accrued interest pending when approvals would be granted on the principal amount.

In view of the above, you are hereby required to complete the attached questionnaire form, affix your passport photograph, signed in the specified columns and return back to us via email for the processing of the transfer of your fund.

Do not hesitate to contact your assigned transaction manager, Mr. Paul Edward on his direct telephone no. +44 792 4278526, if you need any clarification.

Yours faithfully,

Elizabeth Manning (Ms.) Chief Credit Officer, BNP Paribas, London. Dear

## Documents: Spam or ham?

Dan,

How are you? I hope everything is well.

Recently I have collaborated with some colleagues of mine from environmental engineering to research in multi-objective RL (if you are interested you can take a look at one of our papers: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6252759) and I found your approach very interesting.

I want to thank you for replying to my students’ email, we have really appreciated your willingness to share your code with us. I know that the code used to produce paper results is often “weak” since it is not produced for being distributed, and I understand your desire to set up and embellish the code (I do the same thing when other researchers ask for my code). I only want to reassure you that we have no hurry, so having your code in one or two weeks will be perfect.

Unfortunately I will not be in Rome for ICAPS. However, I hope to meet you soon, will you attend ICML?

paribas(0)=3.0
harewood(1)=1.0
avenue(2)=1.0
london(3)=2.0
nw(4)=1.0
aa(5)=1.0
tel(6)=1.0
attn(7)=1.0
sir(8)=1.0
re(10)=1.0
of(12)=11.0
payment(13)=1.0
accrued(14)=2.0
interest(15)=3.0
one(16)=2.0
hundred(17)=2.0
and(18)=4.0
fifty(19)=2.0
thousand(20)=2.0
british(21)=2.0
pounds(22)=4.0
sterling(23)=2.0

credit(114)=1.0
officer(115)=1.0
of(12)=2.0
one(16)=2.0
and(18)=3.0
only(24)=1.0
is(26)=3.0
you(28)=7.0
that(29)=2.0
your(30)=5.0
with(37)=2.0
us(38)=1.0
since(39)=1.0
in(44)=3.0
the(46)=3.0
to(58)=8.0
when(61)=1.0
be(64)=2.0
are(71)=2.0
email(86)=1.0
for(87)=4.0
do(90)=1.0

attend(202)=1.0
icml(203)=1.0

## Dictionary-based Representations

• Define “feature detectors” that map an instance (e.g. document) to one feature value.
• E.g. word features that are $$1$$ if word is in a document, $$0$$ otherwise
• Or number of times word occurs in document
• Fix a collection of instances, called a corpus.
• Your feature set consists of all of your feature detectors that “turn on” for some instance in the corpus.
• Often results in a very large number of features.

## Bag-of-words

• Little example has vocabulary size 203. In a real corpus, more like 10000.

• Representation is sparse: Only non-zero entries recorded

• Nonetheless, still a fixed-size feature vector

• Typical tricks: omit punctuation, omit capitalization, omit stop words

• liblinear, libsvm, and mallet are three excellent tools

## Beyond Bag of Words

• Simple! Maybe too simple?

• All structure, i.e., relationships between words, is lost.

• “Michael ate the fish.”

• “The fish ate Michael.”

## Structural Features I: Bigrams

• Count all pairs of adjacent words, throw them in with our original bag of words.

paribas_harewood(208)=1.0
harewood_avenue(209)=1.0
avenue_london(210)=1.0
london_nw(211)=1.0

chief_credit(664)=1.0
credit_officer(665)=1.0

of_one(666)=1.0
one_and(667)=1.0
and_only(668)=1.0
only_is(669)=1.0
is_you(670)=1.0
you_that(671)=1.0

meet_attend(672)=1.0
attend_icml(673)=1.0

• Trigrams, 4-grams, …, N-grams…

## Structural Features II: Parts of Speech

• “Go milk the cow.” milk is a Verb Phrase

• “Drink your milk.” milk is a Noun Phrase

• A part-of-speech tagger can tell the difference.

milk_VP=1.0
milk_NP=1.0

## Structural Features III: Parsing

    (S (NP Michael)
(VP ate
(NP the fish))
.)

    (S(NP)(VP(NP)))=1.0\
(S(NP\_Michael)(VP(NP)))=1.0\


## Structural Features III: Parsing

“Do not tell them our products contain asbestos.”

“Tell them our products do not contain asbestos.”

    (S (VP Do not
(VP tell
(NP them)
(SBAR (S (NP our products)
(VP contain
(NP asbestos))))))
.)

(S (VP Tell
(NP them)
(SBAR (S (NP our products)
(VP do not
(VP contain
(NP asbestos))))))
.)

(VP contain (NP asbestos))=1.0\
(VP do not (VP contain (NP asbestos)))=1.0\


## Extreme Overfitting

• With say 100s of documents and 10000s of features, overfitting is easy.

• E.g., for binary classification with linear separator, if every document has a unique word, give $$+$$ weight to those for $$+$$ documents, $$-$$ weight to those of $$-$$ documents. Perfect fit!

• For $$n$$ documents, if each had a unique word, a linear classifier could use $$n$$ weights out of the 10000 to be nonzero. A decision tree could be built with $$n$$ nodes.

## Sparsity from SVMs

• Recall that SVMs try to minimize the norm of the weights, $$||\mathbf{w}||^2$$, while getting training points on the correct side of the margin.

• Suppose features coded $$\pm 1$$. In an SVM, weight for each word would have to be $$\pm 1$$ to satisfy $$y_i\mathbf{w}^\mathsf{T}\mathbf{x}_i \ge 1$$; then $$||\mathbf{w}||^2 = n$$.

• If one word can discriminate, can use weight vector with $$||\mathbf{w}||^2 = 1$$

• SVMs prefer sparse, simple solutions, can avoid overfitting even when $$p >> n$$.

## Sparsity from Other Models

• Recall linear regression and logistic regression both work by finding the $$\mathbf{w}$$ that minimizes training error.

$\mathbf{w}^* = \min_\mathbf{w}J(\mathbf{w})$

• If we can exactly fit every point (linear regression) or data are linearly separable (logistic regression) then $$\mathbf{w}$$ will not be unique, and we are prone to overfitting. (Depending on software, you will see warnings/crashes.)

## Regularization [JWHT 6.2]

• Idea: Ask for a weight vector that has low training error but is also small:

$\mathbf{w}^* = \min_\mathbf{w}J(\mathbf{w}) + \lambda ||\mathbf{w}||$

• This is called regularization.
• The $$\lambda ||\mathbf{w}||$$ is called a penalty; importance governed by $$\lambda$$
• If $$||\mathbf{w}||$$ defined as $$\sum_j |w^{(j)}|^2$$, this is called a ridge penalty
• If $$||\mathbf{w}||$$ defined as $$\sum_j |w^{(j)}|$$, this is an L1 or LASSO penalty
• Ridge shrinks all weights toward zero; many may be nonzero though.
• LASSO will shrink some weights to exactly zero, performing feature selection

## Feature Selection [JWHT Ch 6.1]

• Special case of model selection which we saw previously (E.g. with polynomial degree.)
• Add or remove features, check cross-validation error, pick favourite model
• Model selection criteria as alternatives to cross-validation are given in [JWHT Ch 6.1.3]. Can save computation.
• Mallows’ $$C_p$$, AIC, BIC all have the same form; training error plus a penalty for the number of parameters
• Try different subsets of features, choose the one that gives lowest criterion value

## Feature Selection vs. Feature Construction

Let $$n$$ be number of examples, $$p$$ be number of features.

• If you have e.g. $$n > 10p$$, you may not need to bother with feature selection.
• Unless you want to discover that some features are redundant.
• If you have e.g. $$n > 1000p$$ and you are getting bad performance, you may want to construct features, or use a non-linear classifier, in case the problem is lack of fit.

## Feature Selection Scenario

Suppose you have features $$x_1, x_2, x_3, ..., x_p$$, and label $$y$$.

What properties of $$x_1$$ might lead you to remove it from the model?

## Feature Selection Scenario

Suppose you only look at the features $$x_1, x_2, x_3, ..., x_p$$.

Can you identify some that don’t help predict $$y$$?

## Unsupervised Learning

• Only features $$x_1, ..., x_p$$
• None is more important than the others
• Discover relationships among the instances and features

• Difficult to evaluate “performance” because task is ill-defined
• You might use the output of unsupervised learning for supervised learning

## Dimensionality Reduction

• Dimensionality reduction (or embedding) techniques:

• Take data that has $$p$$ features

• Re-encode as data that has $$q$$ features, $$q < p$$

• Don’t lose too much information

## Dimensionality Reduction Techniques

• Axis-aligned: Remove features that are well-predicted by other features

• Linear: Principal components analysis creates smaller set of new features that are weighted sums of existing features

• Non-linear: Create small set of new features that are non-linear functions of existing features

• Kernel PCA
• Independent components analysis
• Self-organizing maps
• Multi-dimensional scaling
• t-SNE: t-distributed Stochastic Neighbour Embedding

## Axis-aligned dimensionality reduction

Correlation matrix

## “True dimensionality” of this dataset?

• You may give me a model with $$\ll n$$ parameters ahead of time.

• How many additional numbers must you send to tell me approximately where a particular data point is?

## Remarks

• All dimensionality reduction techniques are based on an implicit assumption that the data lies on (near) some low-dimensional manifold

• This is the case for the first three examples, which (almost) lie along a 1-dimensional manifold despite being plotted in 2D

• In the last example, no dimensionality reduction is possible without losing a lot of information

## Principal Component Analysis (PCA) [JWHT 10.2]

• Given: $$n$$ instances, each being a length-$$p$$ real vector.

• Suppose we want a 1-dimensional representation of that data, instead of $$p$$-dimensional.

• Specifically, we will:

• Choose a line in $${\mathbb{R}}^{p}$$ that “best represents” the data.

• Assign each data object to a point along that line.

• Identifying a point on a line just requires a scalar: How far along the line is the point?

## Reconstruction error

• Let the line be represented as $${\bf b}+\alpha \mathbf{v}$$ for $${\bf b},\mathbf{v}\in{\mathbb{R}}^p$$, $$\alpha\in{\mathbb{R}}$$.
For convenience assume $$\|\mathbf{v}\|=1$$.

• Each instance $$\mathbf{x}_i$$ is associated with a point on the line $$\hat{\mathbf{x}_i}={\bf b}+\alpha_i\mathbf{v}$$.

• Instance $$\mathbf{x}_i$$ is encoded as a single $$\alpha_i$$

• This is the new (and only) feature for instance $$i$$

## Minimizing reconstruction error

• We want to choose $${\bf b}$$, $$\mathbf{v}$$, and the $$\alpha_i$$ to minimize the total reconstruction error over all data points, measured using Euclidean distance: $R=\sum_{i=1}^n\|\mathbf{x}_i-\hat{\mathbf{x}_i}\|^2$

• Difference from regression: Given the new feature $$\alpha_i$$, reconstruct all dimensions of the x_i. All are equally important.

 min $$\sum_{i=1}^n\|\mathbf{x}_i-({\bf b} + \alpha_i \mathbf{v})\|^2$$ w.r.t. $${\bf b}, \mathbf{v}, \alpha_i,i=1,\dots n$$ s.t. $$\|\mathbf{v}\|^2=1$$
• Can be computed by Singular Value Decomposition

## Reduction to $$d$$ dimensions

• $${\bf b}$$, $$\mathbf{v}$$, and the $$\alpha_i$$ can be computed easily in polynomial time. The $$\alpha_i$$ give a 1D representation.

• More generally, we can create a $$d$$-dimensional representation of our data by projecting the instances onto a hyperplane $${\bf b}+\alpha^1\mathbf{v}_1+\ldots+\alpha^d\mathbf{v}_d$$.

## Singular Value Decomposition

• $${\bf b}$$, the eigenvalues $$\lambda$$, the $$\mathbf{v}_j$$, and the projections of the instances can all be computing in polynomial time, e.g. using (thin) Singular Value Decomposition.
$X_{n\times p} = {\color{red}U_{n\times p}} {\color{blue}D_{p \times p}} {\color{green}V_{p \times p}^\mathsf{T}}$

• Columns of $$U$$ are left-eigenvectors, diagonal of $$D$$ are sqrts of eigenvalues (“singular values”), $$V$$ are right-eigenvectors

• Typically $$D$$ is sorted by magnitude.
First $$d$$ columns of $$U$$ are new representation of $$X$$.

• To encode new feature vector $$\mathbf{x}$$ as a vector $$\mathbf{u}$$:

$$\mathbf{u}= D^{-1} V^\mathsf{T}\mathbf{x}$$, take first $$d$$ elements.

## Eigenvalue Magnitudes [JWHT p.383]

• The magnitude of the $$j^{th}$$-largest eigenvalue, $$\lambda_j$$, tells how much variability in the data is captured by the $$j^{th}$$ principal component

• When the eigenvalues are sorted in decreasing order, the proportion of the variance captured by the first $$d$$ components is: $\frac{\lambda_1 + \dots + \lambda_d}{\lambda_1 + \dots + \lambda_d + \lambda_{d+1} + \dots + \lambda_n}$

• So if a “big” drop occurs in the eigenvalues at some point, that suggests a good dimension cutoff

## More remarks [JWHT 10.2.3]

• Outliers have a big effect on the covariance matrix, so they can affect the eigenvectors quite a bit