“I don’t have nice vectors of features, each of the same length.”

Fair enough. Today, two instances of the following strategy:

Identify the prediction you want to make.

Identify the information you need to make each prediction.

Summarize that information into a feature vector.

Pass the result to a supervised learning method.

BNP PARIBAS 10 HAREWOOD AVENUE, LONDON NW1 6AA TEL: +44 (0) 207 595 2000

Attn:Sir/Madam,

RE: NOTIFICATION OF PAYMENT OF ACCRUED INTEREST OF ONE HUNDRED AND FIFTY THOUSAND BRITISH POUNDS STERLING ONLY (£150,000.00).

This is inform you that your contract/inheritance fund, which was deposited with us since 2010-2013 has accumulated an interest of sum of One Hundred and Fifty Thousand British Pounds Sterling Only (£150,000.00).

In line with the joint agreement signed by the board of trustees of BNP Paribas and the depositor of the said fund, the management has mandated us to pay you the accrued interest pending when approvals would be granted on the principal amount.

In view of the above, you are hereby required to complete the attached questionnaire form, affix your passport photograph, signed in the specified columns and return back to us via email for the processing of the transfer of your fund.

Do not hesitate to contact your assigned transaction manager, Mr. Paul Edward on his direct telephone no. +44 792 4278526, if you need any clarification.

Yours faithfully,

Elizabeth Manning (Ms.) Chief Credit Officer, BNP Paribas, London. Dear

Dan,

How are you? I hope everything is well.

Recently I have collaborated with some colleagues of mine from environmental engineering to research in multi-objective RL (if you are interested you can take a look at one of our papers: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6252759) and I found your approach very interesting.

I want to thank you for replying to my students’ email, we have really appreciated your willingness to share your code with us. I know that the code used to produce paper results is often “weak” since it is not produced for being distributed, and I understand your desire to set up and embellish the code (I do the same thing when other researchers ask for my code). I only want to reassure you that we have no hurry, so having your code in one or two weeks will be perfect.

Unfortunately I will not be in Rome for ICAPS. However, I hope to meet you soon, will you attend ICML?

paribas(0)=3.0

harewood(1)=1.0

avenue(2)=1.0

london(3)=2.0

nw(4)=1.0

aa(5)=1.0

tel(6)=1.0

attn(7)=1.0

sir(8)=1.0

madam(9)=1.0

re(10)=1.0

notification(11)=1.0

of(12)=11.0

payment(13)=1.0

accrued(14)=2.0

interest(15)=3.0

one(16)=2.0

hundred(17)=2.0

and(18)=4.0

fifty(19)=2.0

thousand(20)=2.0

british(21)=2.0

pounds(22)=4.0

sterling(23)=2.0

…

credit(114)=1.0

officer(115)=1.0

of(12)=2.0

one(16)=2.0

and(18)=3.0

only(24)=1.0

is(26)=3.0

you(28)=7.0

that(29)=2.0

your(30)=5.0

with(37)=2.0

us(38)=1.0

since(39)=1.0

in(44)=3.0

the(46)=3.0

to(58)=8.0

when(61)=1.0

be(64)=2.0

are(71)=2.0

email(86)=1.0

for(87)=4.0

do(90)=1.0

…

attend(202)=1.0

icml(203)=1.0

- Define “feature detectors” that map an instance (e.g. document) to one feature value.
- E.g. word features that are \(1\) if word is in a document, \(0\) otherwise

- Fix a collection of instances, called a
*corpus*. - Your feature set consists of all of your feature detectors that “turn on” for
*some*instance in the corpus. - Often results in a very large number of features.

Little example has vocabulary size 203. In a real corpus, more like 10000.

Representation is

*sparse*: Only non-zero entries recordedNonetheless, still a fixed-size feature vector

Typical tricks: omit punctuation, omit capitalization, omit stop words

liblinear, libsvm, and mallet are three excellent tools

Simple! Maybe too simple?

All

**structure**, i.e., relationships between words, is lost.“Michael ate the fish.”

“The fish ate Michael.”

- Count all pairs of adjacent words, throw them in with our original bag of words.

…

paribas_harewood(208)=1.0

harewood_avenue(209)=1.0

avenue_london(210)=1.0

london_nw(211)=1.0

…

chief_credit(664)=1.0

credit_officer(665)=1.0

…

of_one(666)=1.0

one_and(667)=1.0

and_only(668)=1.0

only_is(669)=1.0

is_you(670)=1.0

you_that(671)=1.0

…

meet_attend(672)=1.0

attend_icml(673)=1.0

- Trigrams, 4-grams, …, N-grams…

“Go milk the cow.”

*milk*is a Verb Phrase“Drink your milk.”

*milk*is a Noun PhraseA

**part-of-speech tagger**can tell the difference.

milk_VP=1.0

milk_NP=1.0

(S (NP Michael) (VP ate (NP the fish)) .)

(S(NP)(VP(NP)))=1.0\ (S(NP\_Michael)(VP(NP)))=1.0\

- Matt Post and Shane Bergsma. “Explicit and Implicit Syntactic Features for Text Classfication” http://cs.jhu.edu/~post/papers/post-bergsma-acl13.pdf

*“Do not tell them our products contain asbestos.”*

*“Tell them our products do not contain asbestos.”*

(S (VP Do not (VP tell (NP them) (SBAR (S (NP our products) (VP contain (NP asbestos)))))) .) (S (VP Tell (NP them) (SBAR (S (NP our products) (VP do not (VP contain (NP asbestos)))))) .) (VP contain (NP asbestos))=1.0\ (VP do not (VP contain (NP asbestos)))=1.0\

With say 100s of documents and 10000s of features, overfitting is easy.

E.g., for binary classification with linear separator, if every document has a unique word, give \(+\) weight to those for \(+\) documents, \(-\) weight to those of \(-\) documents. Perfect fit!

For \(n\) documents, linear classifier would need \(n\) weights out of the 10000 to be nonzero.

Suppose features coded \(\pm 1\). In an SVM, each weight would have to be \(\pm 1\) to satisfy \(y_i{\mathbf{w}}^T{\mathbf{x}}_i \ge 1\); norm of \({\mathbf{w}}\) is \(\sqrt{n}\).

If

*one*word can discriminate, can use weight vector with \(||{\mathbf{w}}|| = 1\)SVMs prefer sparse, simple solutions, can avoid overfitting even when \(p >> n\).

Clustering is grouping similar objects together.

To establish prototypes, or detect outliers.

To simplify data for further analysis/learning.

To visualize data (in conjunction with dimensionality reduction)

Clusterings are usually not “right“ or”wrong” – different clusterings/clustering criteria can reveal different things about the data.

Clustering algorithms:

Employ some notion of distance between objects

Have an explicit or implicit criterion defining what a good cluster is

Heuristically optimize that criterion to determine the clustering

Some clustering criteria/algorithms have natural probabilistic interpretations

One of the most commonly-used clustering algorithms, because it is easy to implement and quick to run.

Assumes the objects (instances) to be clustered are \(p\)-dimensional vectors, \({\mathbf{x}}_i\).

Uses a distance measure between the instances (typically Euclidean distance)

The goal is to

*partition*the data into \(K\) disjoint subsets

Inputs:

A set of \(p\)-dimensional real vectors \(\{{\mathbf{x}}_1, {\mathbf{x}}_2, \ldots, {\mathbf{x}}_n\}\).

\(K\), the desired number of clusters.

Output: A mapping of the vectors into \(K\) clusters (disjoint subsets), \(C:\{1,\ldots,n\}\mapsto\{1,\ldots,K\}\).

Initialize \(C\) randomly.

Repeat:

Compute the

*centroid*of each cluster (the mean of all the instances in the cluster)Reassign each instance to the cluster with closest centroid

until \(C\) stops changing.