2017-02-26

## “My data don’t look like the simple examples you showed in class…”

• “I don’t have nice vectors of features, each of the same length.”

• Fair enough. Today, two instances of the following strategy:

1. Identify the prediction you want to make.

2. Identify the information you need to make each prediction.

3. Summarize that information into a feature vector.

4. Pass the result to a supervised learning method.

## Documents: Spam or ham?

BNP PARIBAS 10 HAREWOOD AVENUE, LONDON NW1 6AA TEL: +44 (0) 207 595 2000

RE: NOTIFICATION OF PAYMENT OF ACCRUED INTEREST OF ONE HUNDRED AND FIFTY THOUSAND BRITISH POUNDS STERLING ONLY (£150,000.00).

This is inform you that your contract/inheritance fund, which was deposited with us since 2010-2013 has accumulated an interest of sum of One Hundred and Fifty Thousand British Pounds Sterling Only (£150,000.00).

In line with the joint agreement signed by the board of trustees of BNP Paribas and the depositor of the said fund, the management has mandated us to pay you the accrued interest pending when approvals would be granted on the principal amount.

In view of the above, you are hereby required to complete the attached questionnaire form, affix your passport photograph, signed in the specified columns and return back to us via email for the processing of the transfer of your fund.

Do not hesitate to contact your assigned transaction manager, Mr. Paul Edward on his direct telephone no. +44 792 4278526, if you need any clarification.

Yours faithfully,

Elizabeth Manning (Ms.) Chief Credit Officer, BNP Paribas, London. Dear

## Documents: Spam or ham?

Dan,

How are you? I hope everything is well.

Recently I have collaborated with some colleagues of mine from environmental engineering to research in multi-objective RL (if you are interested you can take a look at one of our papers: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6252759) and I found your approach very interesting.

I want to thank you for replying to my students’ email, we have really appreciated your willingness to share your code with us. I know that the code used to produce paper results is often “weak” since it is not produced for being distributed, and I understand your desire to set up and embellish the code (I do the same thing when other researchers ask for my code). I only want to reassure you that we have no hurry, so having your code in one or two weeks will be perfect.

Unfortunately I will not be in Rome for ICAPS. However, I hope to meet you soon, will you attend ICML?

paribas(0)=3.0
harewood(1)=1.0
avenue(2)=1.0
london(3)=2.0
nw(4)=1.0
aa(5)=1.0
tel(6)=1.0
attn(7)=1.0
sir(8)=1.0
re(10)=1.0
of(12)=11.0
payment(13)=1.0
accrued(14)=2.0
interest(15)=3.0
one(16)=2.0
hundred(17)=2.0
and(18)=4.0
fifty(19)=2.0
thousand(20)=2.0
british(21)=2.0
pounds(22)=4.0
sterling(23)=2.0

credit(114)=1.0
officer(115)=1.0
of(12)=2.0
one(16)=2.0
and(18)=3.0
only(24)=1.0
is(26)=3.0
you(28)=7.0
that(29)=2.0
your(30)=5.0
with(37)=2.0
us(38)=1.0
since(39)=1.0
in(44)=3.0
the(46)=3.0
to(58)=8.0
when(61)=1.0
be(64)=2.0
are(71)=2.0
email(86)=1.0
for(87)=4.0
do(90)=1.0

attend(202)=1.0
icml(203)=1.0

## Dictionary-based Representations

• Define "feature detectors" that map an instance (e.g. document) to one feature value.
• E.g. word features that are $$1$$ if word is in a document, $$0$$ otherwise
• Fix a collection of instances, called a corpus.
• Your feature set consists of all of your feature detectors that "turn on" for some instance in the corpus.
• Often results in a very large number of features.

## Bag-of-words

• Little example has vocabulary size 203. In a real corpus, more like 10000.

• Representation is sparse: Only non-zero entries recorded

• Nonetheless, still a fixed-size feature vector

• Typical tricks: omit punctuation, omit capitalization, omit stop words

• liblinear, libsvm, and mallet are three excellent tools

## Beyond Bag of Words

• Simple! Maybe too simple?

• All structure, i.e., relationships between words, is lost.

• “Michael ate the fish.”

• “The fish ate Michael.”

## Structural Features I: Bigrams

• Count all pairs of adjacent words, throw them in with our original bag of words.

paribas_harewood(208)=1.0
harewood_avenue(209)=1.0
avenue_london(210)=1.0
london_nw(211)=1.0

chief_credit(664)=1.0
credit_officer(665)=1.0

of_one(666)=1.0
one_and(667)=1.0
and_only(668)=1.0
only_is(669)=1.0
is_you(670)=1.0
you_that(671)=1.0

meet_attend(672)=1.0
attend_icml(673)=1.0

• Trigrams, 4-grams, …, N-grams…

## Structural Features II: Parts of Speech

• “Go milk the cow.” milk is a Verb Phrase

• “Drink your milk.” milk is a Noun Phrase

• A part-of-speech tagger can tell the difference.

milk_VP=1.0
milk_NP=1.0

## Structural Features III: Parsing

    (S (NP Michael)
(VP ate
(NP the fish))
.)

    (S(NP)(VP(NP)))=1.0\
(S(NP\_Michael)(VP(NP)))=1.0\


## Structural Features III: Parsing

“Do not tell them our products contain asbestos.”

“Tell them our products do not contain asbestos.”

    (S (VP Do not
(VP tell
(NP them)
(SBAR (S (NP our products)
(VP contain
(NP asbestos))))))
.)

(S (VP Tell
(NP them)
(SBAR (S (NP our products)
(VP do not
(VP contain
(NP asbestos))))))
.)

(VP contain (NP asbestos))=1.0\
(VP do not (VP contain (NP asbestos)))=1.0\


## Overfitting

• With say 100s of documents and 10000s of features, overfitting is easy.

• E.g., for binary classification with linear separator, if every document has a unique word, give $$+$$ weight to those for $$+$$ documents, $$-$$ weight to those of $$-$$ documents. Perfect fit!

• For $$n$$ documents, linear classifier would need $$n$$ weights out of the 10000 to be nonzero.

• Suppose features coded $$\pm 1$$. In an SVM, each weight would have to be $$\pm 1$$ to satisfy $$y_i{\mathbf{w}}^T{\mathbf{x}}_i \ge 1$$; norm of $${\mathbf{w}}$$ is $$\sqrt{n}$$.

• If one word can discriminate, can use weight vector with $$||{\mathbf{w}}|| = 1$$

• SVMs prefer sparse, simple solutions, can avoid overfitting even when $$p >> n$$.

## What is clustering?

• Clustering is grouping similar objects together.

• To establish prototypes, or detect outliers.

• To simplify data for further analysis/learning.

• To visualize data (in conjunction with dimensionality reduction)

• Clusterings are usually not "right“ or "wrong” – different clusterings/clustering criteria can reveal different things about the data.

## Clustering Algorithms

• Clustering algorithms:

• Employ some notion of distance between objects

• Have an explicit or implicit criterion defining what a good cluster is

• Heuristically optimize that criterion to determine the clustering

• Some clustering criteria/algorithms have natural probabilistic interpretations

## $$K$$-means clustering

• One of the most commonly-used clustering algorithms, because it is easy to implement and quick to run.

• Assumes the objects (instances) to be clustered are $$p$$-dimensional vectors, $${\mathbf{x}}_i$$.

• Uses a distance measure between the instances (typically Euclidean distance)

• The goal is to partition the data into $$K$$ disjoint subsets

## $$K$$-means clustering

• Inputs:

• A set of $$p$$-dimensional real vectors $$\{{\mathbf{x}}_1, {\mathbf{x}}_2, \ldots, {\mathbf{x}}_n\}$$.

• $$K$$, the desired number of clusters.

• Output: A mapping of the vectors into $$K$$ clusters (disjoint subsets), $$C:\{1,\ldots,n\}\mapsto\{1,\ldots,K\}$$.

1. Initialize $$C$$ randomly.

2. Repeat:

1. Compute the centroid of each cluster (the mean of all the instances in the cluster)

2. Reassign each instance to the cluster with closest centroid

until $$C$$ stops changing.

## Assessing the quality of the clustering

• If used as a pre-processing step for supervised learning, measure the performance of the supervised learner

• Measure the "tightness" of the clusters: points in the same cluster should be close together, points in different clusters should be far apart

• Tightness can be measured by the minimum distance, maximum distance or average distance between points

• Silhouette criterion is sometimes used

• Problem: these measures usually favour large numbers of clusters, so some form of complexity penalty is necessary

## Typical applications of clustering

• Pre-processing step for supervised learning

• Data inspection/experimental data analysis

• Discretizing real-valued variables in non-uniform buckets.

• Data compression

## Questions

• Will $$K$$-means terminate?

• Will it always find the same answer?

• How should we choose the initial cluster centers?

• Can we automatically choose the number of centers?

## Does $$K$$-means clustering terminate?

• For given data $$\{{\mathbf{x}}_1,\ldots,{\mathbf{x}}_n\}$$ and a clustering $$C$$, consider the sum of the squared Euclidean distance between each vector and the center of its cluster: $J = \sum_{i=1}^n\|{\mathbf{x}}_i-\mu_{C(i)}\|^2~,$ where $$\mu_{C(i)}$$ denotes the centroid of the cluster containing $${\mathbf{x}}_i$$.

• There are finitely many possible clusterings: at most $$K^n$$.

• Each time we reassign a vector to a cluster with a nearer centroid, $$J$$ decreases.

• Each time we recompute the centroids of each cluster, $$J$$ decreases (or stays the same.)

• Thus, the algorithm must terminate.

## Does $$K$$-means always find the same answer?

• $$K$$-means is a version of coordinate descent, where the parameters are the cluster center coordinates, and the assignments of points to clusters.

• It minimizes the sum of squared Euclidean distances from vectors to their cluster centroid.

• This error function has many local minima!

• The solution found is locally optimal, but not globally optimal

• Because the solution depends on the initial assignment of instances to clusters, random restarts will give different solutions

## Example - Same problem, different solutions

 $$J=0.22870$$ $$J=0.3088$$

## Choosing the number of clusters

• A difficult problem.

• Delete clusters that cover too few points

• Split clusters that cover too many points

• Add extra clusters for "outliers"

• Add option to belong to “no cluster”

• Minimum description length: minimize loss + complexity of the clustering

• Use a hierarchical method first

## Why Euclidean distance?

Subjective reason: It produces nice, round clusters.

image

## Why not Euclidean distance?

1. It produces nice round clusters!