“My data don’t look like the simple examples you showed in class…”

Documents: Spam or ham?

BNP PARIBAS 10 HAREWOOD AVENUE, LONDON NW1 6AA TEL: +44 (0) 207 595 2000

Attn:Sir/Madam,

RE: NOTIFICATION OF PAYMENT OF ACCRUED INTEREST OF ONE HUNDRED AND FIFTY THOUSAND BRITISH POUNDS STERLING ONLY (£150,000.00).

This is inform you that your contract/inheritance fund, which was deposited with us since 2010-2013 has accumulated an interest of sum of One Hundred and Fifty Thousand British Pounds Sterling Only (£150,000.00).

In line with the joint agreement signed by the board of trustees of BNP Paribas and the depositor of the said fund, the management has mandated us to pay you the accrued interest pending when approvals would be granted on the principal amount.

In view of the above, you are hereby required to complete the attached questionnaire form, affix your passport photograph, signed in the specified columns and return back to us via email for the processing of the transfer of your fund.

Do not hesitate to contact your assigned transaction manager, Mr. Paul Edward on his direct telephone no. +44 792 4278526, if you need any clarification.

Yours faithfully,

Elizabeth Manning (Ms.) Chief Credit Officer, BNP Paribas, London. Dear

Documents: Spam or ham?

Dan,

How are you? I hope everything is well.

Recently I have collaborated with some colleagues of mine from environmental engineering to research in multi-objective RL (if you are interested you can take a look at one of our papers: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6252759) and I found your approach very interesting.

I want to thank you for replying to my students’ email, we have really appreciated your willingness to share your code with us. I know that the code used to produce paper results is often “weak” since it is not produced for being distributed, and I understand your desire to set up and embellish the code (I do the same thing when other researchers ask for my code). I only want to reassure you that we have no hurry, so having your code in one or two weeks will be perfect.

Unfortunately I will not be in Rome for ICAPS. However, I hope to meet you soon, will you attend ICML?

Documents: Spam or ham?

paribas(0)=3.0
harewood(1)=1.0
avenue(2)=1.0
london(3)=2.0
nw(4)=1.0
aa(5)=1.0
tel(6)=1.0
attn(7)=1.0
sir(8)=1.0
madam(9)=1.0
re(10)=1.0
notification(11)=1.0
of(12)=11.0
payment(13)=1.0
accrued(14)=2.0
interest(15)=3.0
one(16)=2.0
hundred(17)=2.0
and(18)=4.0
fifty(19)=2.0
thousand(20)=2.0
british(21)=2.0
pounds(22)=4.0
sterling(23)=2.0

credit(114)=1.0
officer(115)=1.0
of(12)=2.0
one(16)=2.0
and(18)=3.0
only(24)=1.0
is(26)=3.0
you(28)=7.0
that(29)=2.0
your(30)=5.0
with(37)=2.0
us(38)=1.0
since(39)=1.0
in(44)=3.0
the(46)=3.0
to(58)=8.0
when(61)=1.0
be(64)=2.0
are(71)=2.0
email(86)=1.0
for(87)=4.0
do(90)=1.0

attend(202)=1.0
icml(203)=1.0

Dictionary-based Representations

Bag-of-words

Beyond Bag of Words

Structural Features I: Bigrams


paribas_harewood(208)=1.0
harewood_avenue(209)=1.0
avenue_london(210)=1.0
london_nw(211)=1.0

chief_credit(664)=1.0
credit_officer(665)=1.0

of_one(666)=1.0
one_and(667)=1.0
and_only(668)=1.0
only_is(669)=1.0
is_you(670)=1.0
you_that(671)=1.0

meet_attend(672)=1.0
attend_icml(673)=1.0

Structural Features II: Parts of Speech

 
milk_VP=1.0
milk_NP=1.0

Structural Features III: Parsing

    (S (NP Michael)
       (VP ate
           (NP the fish))
       .)
    (S(NP)(VP(NP)))=1.0\
    (S(NP\_Michael)(VP(NP)))=1.0\

Structural Features III: Parsing

“Do not tell them our products contain asbestos.”

“Tell them our products do not contain asbestos.”

    (S (VP Do not
           (VP tell
               (NP them)
               (SBAR (S (NP our products)
                        (VP contain
                            (NP asbestos))))))
       .)
       
    (S (VP Tell
           (NP them)
           (SBAR (S (NP our products)
                    (VP do not
                        (VP contain
                            (NP asbestos))))))
       .)

    (VP contain (NP asbestos))=1.0\
    (VP do not (VP contain (NP asbestos)))=1.0\

Overfitting

Clustering

What is clustering?

  • Clustering is grouping similar objects together.

    • To establish prototypes, or detect outliers.

    • To simplify data for further analysis/learning.

    • To visualize data (in conjunction with dimensionality reduction)

  • Clusterings are usually not “right“ or”wrong” – different clusterings/clustering criteria can reveal different things about the data.

Clustering Algorithms

  • Clustering algorithms:

    • Employ some notion of distance between objects

    • Have an explicit or implicit criterion defining what a good cluster is

    • Heuristically optimize that criterion to determine the clustering

  • Some clustering criteria/algorithms have natural probabilistic interpretations

\(K\)-means clustering

  • One of the most commonly-used clustering algorithms, because it is easy to implement and quick to run.

  • Assumes the objects (instances) to be clustered are \(p\)-dimensional vectors, \({\mathbf{x}}_i\).

  • Uses a distance measure between the instances (typically Euclidean distance)

  • The goal is to partition the data into \(K\) disjoint subsets

\(K\)-means clustering

  • Inputs:

    • A set of \(p\)-dimensional real vectors \(\{{\mathbf{x}}_1, {\mathbf{x}}_2, \ldots, {\mathbf{x}}_n\}\).

    • \(K\), the desired number of clusters.

  • Output: A mapping of the vectors into \(K\) clusters (disjoint subsets), \(C:\{1,\ldots,n\}\mapsto\{1,\ldots,K\}\).

  1. Initialize \(C\) randomly.

  2. Repeat:

    1. Compute the centroid of each cluster (the mean of all the instances in the cluster)

    2. Reassign each instance to the cluster with closest centroid

    until \(C\) stops changing.

Example: initial data

Example: assign into 3 clusters randomly

Example: compute centroids

Example: reassign clusters

Example: recompute centroids

Example: reassign clusters

Example: recompute centroids – done!

What is the right number of clusters?

Example: assign into 4 clusters randomly

Example: compute centroids

Example: reassign clusters

Example: recompute centroids

Example: reassign clusters

Example: recompute centroids – done!