2016-02-11

“My data don’t look like the simple examples you showed in class…”

• “I don’t have nice vectors of features, each of the same length.”

• Fair enough. Today, two instances of the following strategy:

1. Identify the prediction you want to make.

2. Identify the information you need to make each prediction.

3. Summarize that information into a feature vector.

4. Pass the result to a supervised learning method.

Documents: Spam or ham?

BNP PARIBAS 10 HAREWOOD AVENUE, LONDON NW1 6AA TEL: +44 (0) 207 595 2000

RE: NOTIFICATION OF PAYMENT OF ACCRUED INTEREST OF ONE HUNDRED AND FIFTY THOUSAND BRITISH POUNDS STERLING ONLY (£150,000.00).

This is inform you that your contract/inheritance fund, which was deposited with us since 2010-2013 has accumulated an interest of sum of One Hundred and Fifty Thousand British Pounds Sterling Only (£150,000.00).

In line with the joint agreement signed by the board of trustees of BNP Paribas and the depositor of the said fund, the management has mandated us to pay you the accrued interest pending when approvals would be granted on the principal amount.

In view of the above, you are hereby required to complete the attached questionnaire form, affix your passport photograph, signed in the specified columns and return back to us via email for the processing of the transfer of your fund.

Do not hesitate to contact your assigned transaction manager, Mr. Paul Edward on his direct telephone no. +44 792 4278526, if you need any clarification.

Yours faithfully,

Elizabeth Manning (Ms.) Chief Credit Officer, BNP Paribas, London. Dear

Documents: Spam or ham?

Dan,

How are you? I hope everything is well.

Recently I have collaborated with some colleagues of mine from environmental engineering to research in multi-objective RL (if you are interested you can take a look at one of our papers: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6252759) and I found your approach very interesting.

I want to thank you for replying to my students’ email, we have really appreciated your willingness to share your code with us. I know that the code used to produce paper results is often “weak” since it is not produced for being distributed, and I understand your desire to set up and embellish the code (I do the same thing when other researchers ask for my code). I only want to reassure you that we have no hurry, so having your code in one or two weeks will be perfect.

Unfortunately I will not be in Rome for ICAPS. However, I hope to meet you soon, will you attend ICML?

paribas(0)=3.0
harewood(1)=1.0
avenue(2)=1.0
london(3)=2.0
nw(4)=1.0
aa(5)=1.0
tel(6)=1.0
attn(7)=1.0
sir(8)=1.0
re(10)=1.0
of(12)=11.0
payment(13)=1.0
accrued(14)=2.0
interest(15)=3.0
one(16)=2.0
hundred(17)=2.0
and(18)=4.0
fifty(19)=2.0
thousand(20)=2.0
british(21)=2.0
pounds(22)=4.0
sterling(23)=2.0

credit(114)=1.0
officer(115)=1.0
of(12)=2.0
one(16)=2.0
and(18)=3.0
only(24)=1.0
is(26)=3.0
you(28)=7.0
that(29)=2.0
your(30)=5.0
with(37)=2.0
us(38)=1.0
since(39)=1.0
in(44)=3.0
the(46)=3.0
to(58)=8.0
when(61)=1.0
be(64)=2.0
are(71)=2.0
email(86)=1.0
for(87)=4.0
do(90)=1.0

attend(202)=1.0
icml(203)=1.0

Bag-of-words

• Vocabulary size: 203 (in a real corpus, more like 10000)

• Representation is sparse: Only non-zero entries recorded

• Nonetheless, still a fixed-size feature vector

• Typical tricks: omit punctuation, omit capitalization, omit stop words

• liblinear, libsvm, and mallet are three excellent tools

Beyond Bag of Words

• Simple! Maybe too simple?

• All structure, i.e., relationships between words, is lost.

• “Michael ate the fish.”

• “The fish ate Michael.”

Structural Features I: Bigrams

• Count all pairs of adjacent words, throw them in with our original bag of words.

paribas_harewood(208)=1.0
harewood_avenue(209)=1.0
avenue_london(210)=1.0
london_nw(211)=1.0

chief_credit(664)=1.0
credit_officer(665)=1.0

of_one(666)=1.0
one_and(667)=1.0
and_only(668)=1.0
only_is(669)=1.0
is_you(670)=1.0
you_that(671)=1.0

meet_attend(672)=1.0
attend_icml(673)=1.0

• Trigrams, 4-grams, …, N-grams…

Structural Features II: Parts of Speech

• “Go milk the cow.” milk is a Verb Phrase

• “Drink your milk.” milk is a Noun Phrase

• A part-of-speech tagger can tell the difference.

milk_VP=1.0
milk_NP=1.0

Structural Features III: Parsing

    (S (NP Michael)
(VP ate
(NP the fish))
.)

    (S(NP)(VP(NP)))=1.0\
(S(NP\_Michael)(VP(NP)))=1.0\


Structural Features III: Parsing

“Do not tell them our products contain asbestos.”

“Tell them our products do not contain asbestos.”

    (S (VP Do not
(VP tell
(NP them)
(SBAR (S (NP our products)
(VP contain
(NP asbestos))))))
.)

(S (VP Tell
(NP them)
(SBAR (S (NP our products)
(VP do not
(VP contain
(NP asbestos))))))
.)

(VP contain (NP asbestos))=1.0\
(VP do not (VP contain (NP asbestos)))=1.0\


Images: Cat or dog?

1. Identify the prediction you want to make.

2. Identify the information you need to make each prediction.

3. Summarize that information into a feature vector.

4. Pass the result to a supervised learner

Image features

1. Summarize that information into a feature vector.
• Do I need to summarize? Can I just use pixels?

• The average lolcat has 250,000 pixels

• Pixels are affected by many non-cat-related issues, including:

• Color of cat

• Distance to cat

• Illumination

• Background

• Expecting to learn that the important difference is the cat-dog difference rather than some other accidental difference is unrealistic.

What is an image feature?

• A function that given an image produces some (relatively) low-dimensional output but retains some (relatively) interesting information

• “Global” image features:

• Mean (median, mode) pixel intensity - very low-dimensional, super boring, probably not useful

• RGB histogram(s) - $$(2^8 \cdot 3)$$-dimensional vector, no spatial information, might help find some objects (e.g. Canadian flag vs. American flag?)

• Image “gradient” - 2D vector pointing at direction of increasing brightness

What is an image feature?

• “Local” image features:

• Global features applied to little patches of the big image

• Dense if we pre-determine the patches (say using a grid)

• Sparse if we decide which ones to compute based on the image itself

Dense vs. Sparse

• Dense seems good – fixed-length feature vector!

• What if the important information is between grid cells?

• Too fine a grid is impractical.

• Famous sparse local image features: The Scale-Invariant Feature Transform (SIFT),
Lowe, David G. ’Distinctive Image Features from Scale Invariant Features’, International Journal of Computer Vision, Vol. 60, No. 2, 2004, pp. 91-110

Identifying Interesting Points (“keypoints”)

• A point is interesting if it is much darker or brighter than its “neighbors” according to difference-of-Gaussians filter.

• “Interestingness” will depend on both the (x,y) location in the image, and on the chosen scale $$\sigma$$ of the filter.
• The “scale” of a point is the $$\sigma$$ that makes it most interesting.