“My data don’t look like the simple examples you showed in class…”

• “I don’t have nice vectors of features, each of the same length.”

• Fair enough. Today, two instances of the following strategy:

1. Identify the prediction you want to make.

2. Identify the information you need to make each prediction.

3. Summarize that information into a feature vector.

4. Pass the result to a supervised learning method.

Documents: Spam or ham?

BNP PARIBAS 10 HAREWOOD AVENUE, LONDON NW1 6AA TEL: +44 (0) 207 595 2000

RE: NOTIFICATION OF PAYMENT OF ACCRUED INTEREST OF ONE HUNDRED AND FIFTY THOUSAND BRITISH POUNDS STERLING ONLY (£150,000.00).

This is inform you that your contract/inheritance fund, which was deposited with us since 2010-2013 has accumulated an interest of sum of One Hundred and Fifty Thousand British Pounds Sterling Only (£150,000.00).

In line with the joint agreement signed by the board of trustees of BNP Paribas and the depositor of the said fund, the management has mandated us to pay you the accrued interest pending when approvals would be granted on the principal amount.

In view of the above, you are hereby required to complete the attached questionnaire form, affix your passport photograph, signed in the specified columns and return back to us via email for the processing of the transfer of your fund.

Do not hesitate to contact your assigned transaction manager, Mr. Paul Edward on his direct telephone no. +44 792 4278526, if you need any clarification.

Yours faithfully,

Elizabeth Manning (Ms.) Chief Credit Officer, BNP Paribas, London. Dear

Documents: Spam or ham?

Dan,

How are you? I hope everything is well.

Recently I have collaborated with some colleagues of mine from environmental engineering to research in multi-objective RL (if you are interested you can take a look at one of our papers: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6252759) and I found your approach very interesting.

I want to thank you for replying to my students’ email, we have really appreciated your willingness to share your code with us. I know that the code used to produce paper results is often “weak” since it is not produced for being distributed, and I understand your desire to set up and embellish the code (I do the same thing when other researchers ask for my code). I only want to reassure you that we have no hurry, so having your code in one or two weeks will be perfect.

Unfortunately I will not be in Rome for ICAPS. However, I hope to meet you soon, will you attend ICML?

paribas(0)=3.0
harewood(1)=1.0
avenue(2)=1.0
london(3)=2.0
nw(4)=1.0
aa(5)=1.0
tel(6)=1.0
attn(7)=1.0
sir(8)=1.0
re(10)=1.0
of(12)=11.0
payment(13)=1.0
accrued(14)=2.0
interest(15)=3.0
one(16)=2.0
hundred(17)=2.0
and(18)=4.0
fifty(19)=2.0
thousand(20)=2.0
british(21)=2.0
pounds(22)=4.0
sterling(23)=2.0

credit(114)=1.0
officer(115)=1.0
of(12)=2.0
one(16)=2.0
and(18)=3.0
only(24)=1.0
is(26)=3.0
you(28)=7.0
that(29)=2.0
your(30)=5.0
with(37)=2.0
us(38)=1.0
since(39)=1.0
in(44)=3.0
the(46)=3.0
to(58)=8.0
when(61)=1.0
be(64)=2.0
are(71)=2.0
email(86)=1.0
for(87)=4.0
do(90)=1.0

attend(202)=1.0
icml(203)=1.0

Dictionary-based Representations

• Define “feature detectors” that map an instance (e.g. document) to one feature value.
• E.g. word features that are $$1$$ if word is in a document, $$0$$ otherwise
• Fix a collection of instances, called a corpus.
• Your feature set consists of all of your feature detectors that “turn on” for some instance in the corpus.
• Often results in a very large number of features.

Bag-of-words

• Little example has vocabulary size 203. In a real corpus, more like 10000.

• Representation is sparse: Only non-zero entries recorded

• Nonetheless, still a fixed-size feature vector

• Typical tricks: omit punctuation, omit capitalization, omit stop words

• liblinear, libsvm, and mallet are three excellent tools

Beyond Bag of Words

• Simple! Maybe too simple?

• All structure, i.e., relationships between words, is lost.

• “Michael ate the fish.”

• “The fish ate Michael.”

Structural Features I: Bigrams

• Count all pairs of adjacent words, throw them in with our original bag of words.

paribas_harewood(208)=1.0
harewood_avenue(209)=1.0
avenue_london(210)=1.0
london_nw(211)=1.0

chief_credit(664)=1.0
credit_officer(665)=1.0

of_one(666)=1.0
one_and(667)=1.0
and_only(668)=1.0
only_is(669)=1.0
is_you(670)=1.0
you_that(671)=1.0

meet_attend(672)=1.0
attend_icml(673)=1.0

• Trigrams, 4-grams, …, N-grams…

Structural Features II: Parts of Speech

• “Go milk the cow.” milk is a Verb Phrase

• “Drink your milk.” milk is a Noun Phrase

• A part-of-speech tagger can tell the difference.

milk_VP=1.0
milk_NP=1.0

Structural Features III: Parsing

    (S (NP Michael)
(VP ate
(NP the fish))
.)

    (S(NP)(VP(NP)))=1.0\
(S(NP\_Michael)(VP(NP)))=1.0\


Structural Features III: Parsing

“Do not tell them our products contain asbestos.”

“Tell them our products do not contain asbestos.”

    (S (VP Do not
(VP tell
(NP them)
(SBAR (S (NP our products)
(VP contain
(NP asbestos))))))
.)

(S (VP Tell
(NP them)
(SBAR (S (NP our products)
(VP do not
(VP contain
(NP asbestos))))))
.)

(VP contain (NP asbestos))=1.0\
(VP do not (VP contain (NP asbestos)))=1.0\


Overfitting

• With say 100s of documents and 10000s of features, overfitting is easy.

• E.g., for binary classification with linear separator, if every document has a unique word, give $$+$$ weight to those for $$+$$ documents, $$-$$ weight to those of $$-$$ documents. Perfect fit!

• For $$n$$ documents, linear classifier would need $$n$$ weights out of the 10000 to be nonzero.

• Suppose features coded $$\pm 1$$. In an SVM, each weight would have to be $$\pm 1$$ to satisfy $$y_i{\mathbf{w}}^T{\mathbf{x}}_i \ge 1$$; norm of $${\mathbf{w}}$$ is $$\sqrt{n}$$.

• If one word can discriminate, can use weight vector with $$||{\mathbf{w}}|| = 1$$

• SVMs prefer sparse, simple solutions, can avoid overfitting even when $$p >> n$$.

Clustering

What is clustering?

• Clustering is grouping similar objects together.

• To establish prototypes, or detect outliers.

• To simplify data for further analysis/learning.

• To visualize data (in conjunction with dimensionality reduction)

• Clusterings are usually not “right“ or”wrong” – different clusterings/clustering criteria can reveal different things about the data.

Clustering Algorithms

• Clustering algorithms:

• Employ some notion of distance between objects

• Have an explicit or implicit criterion defining what a good cluster is

• Heuristically optimize that criterion to determine the clustering

• Some clustering criteria/algorithms have natural probabilistic interpretations

$$K$$-means clustering

• One of the most commonly-used clustering algorithms, because it is easy to implement and quick to run.

• Assumes the objects (instances) to be clustered are $$p$$-dimensional vectors, $${\mathbf{x}}_i$$.

• Uses a distance measure between the instances (typically Euclidean distance)

• The goal is to partition the data into $$K$$ disjoint subsets

$$K$$-means clustering

• Inputs:

• A set of $$p$$-dimensional real vectors $$\{{\mathbf{x}}_1, {\mathbf{x}}_2, \ldots, {\mathbf{x}}_n\}$$.

• $$K$$, the desired number of clusters.

• Output: A mapping of the vectors into $$K$$ clusters (disjoint subsets), $$C:\{1,\ldots,n\}\mapsto\{1,\ldots,K\}$$.

1. Initialize $$C$$ randomly.

2. Repeat:

1. Compute the centroid of each cluster (the mean of all the instances in the cluster)

2. Reassign each instance to the cluster with closest centroid

until $$C$$ stops changing.