CS 2120: Class #14

Machine Learning

  • We’re about to jump about threeish years ahead in your CS education.
  • There is a very rich, very old (by CS standards) field of computer science called Artificial Intelligence
  • One small corner of this vast field is an area called Machine Learning
  • Normally, you’d learn a whole bunch of basic CS. Both theoretical and applied.
  • Then you’d take a couple of general AI course.
  • Then you’d take a specialized course in machine learning.

If we wanted to do this right, we’d need to learn about:

  • AI (of course)
  • The theory of computation
  • Complexity theory
  • Advanced algorithms & Data structures
  • Linear Algebra
  • Multivariable calculus
  • Multivariate statistics (lots of stats, actually)
  • Even more stats
  • Think you’ve got enough stats? NO! MOAR STATS!
  • Signal Processing
  • Information Theory
  • ...

But that’d take too long, so...

  • We’re going to skip straight to the last step.

SRSLY?

  • Yes. Machine learning is now too important for me not to show it to you.
  • It would be absolutely negligent to allow you to leave this course without seeing some ML techniques.

What you can expect:

  • A very superficial introduction to ML
  • You’ll have some ideas about how to apply specific ML techniques and what they can tell you about data.
  • You should feel comfortable to begin exploring scikit-learn after working through this class.
  • Everything is pretty much going to be tiny wizards and magic.
  • Hopefully you get excited enough about what these techniques can do to take the time to learn the details properly.
  • In order to avoid getting bogged down in detail, I’m going to play fast and loose with some definitions and concepts. Sorry (or not, depending on your perspective).
  • You’ll be able to turn your science up to an 11!
_images/turnUp.jpg

scikit-learn

  • Lucky for us, Python has a whole whack of ML libraries (including many specialized for particular fields).
  • We’re going to use scikit-learn as it is relatively full-featured and easy to use.

Requires Supervision

  • Very broadly speaking, there are two types of ML:
    • Supervised learning – you have a bunch of labelled training data (like the vampire data in assignment 3!) and you want to build a program that will learn to generalize the training data so that it can classify new inputs (e.g., classifying new subjects as vampires or not vampires).
    • Unsupervised learning – you have a bunch of unlabelled data and you want to answer the question: “Does any of this stuff look like any of the other stuff?”. You want a program that will divide your dataset into clusters where all of the data items in the same cluster are similar to each other in some way.
  • There are very many algorithms for both types of learning and new ones being described every day. We’re just going to barely scratch the surface here.

Activity

With your neighbours, come up with some situations in which you think you’d use supervised learning and some more in which you’d use unsupervised learning.

Let’s get some data

  • To speed things up, we’re going to work with a dataset built in to scikit-learn.

  • If you want to use your own data, you just load it into a 2D array.
    • Each row is a data point

    • Each column is a feature
      • In ML terminology, a single observation of a property (like petal length) is called a feature
  • This data set records 4 features (sepal and petal length and width) for 150 Irises of three different types (Setosa, Versicolour, and Virginica).

    >>> from sklearn import datasets
    >>> iris = datasets.load_iris()
    >>> data = iris.data
    >>> data.shape
    (150, 4)
    
  • The dataset we loaded came with labels already classifying the Irises:

    >>> labels = iris.target
    >>> numpy.unique(labels)
    array([0, 1, 2])
    
  • So data now contains feature vectors for 150 irises and labels contains the known truth about what type each iris is. Just like the vampire dataset we used in Assignment 3.

  • What we want to build is something like the is_vampire() function. is_type_of_iris()?

Activity

Given the iris data at hand... if I told you to write an is_type_of_iris() function for this data... how would you do it? Discuss with your classmates. No need to code this up, just come up with an English description.

  • ... now we remember... building is_vampire() by hand was a lot of work!
  • We want some automated way of building such a function for any set of data.
  • That’s where ML comes in.

Supervised: k-Nearest Neighbours

  • Imagine we do this:
    • For each row in our training set data, plot the 4 features (lengths) in a 4D space.
    • When we get a new iris, we also plot it in the 4D space.
    • Find the k closest points to the new point we just plotted.
    • Whatever iris type the majority of those points came from... that’s our guess for the new iris.
  • Let’s go through it on the board, with a 2D feature space.

  • Now let’s automate this with scikit, where we aren’t limited to 2D (and by our own growing boredom at plotting points).

First, we’ll import the kNN classifier:
>>> from sklearn.neighbors import KNeighborsClassifier
Now we create a classifier:
>>> knn = KNeighborsClassifier()

Now we train it on our data for which we already have labels:

>>> knn.fit(data, labels)
  • That’s it. That’s how easy scikit-learn makes ML for you. knn is now a k-nearest neighbours classifier for irises.

  • Let’s try it. When we get a new iris for which we want to predict the class, we use:

    >>> knn.predict(new_iris_vector)
    

Activity

Pick some random irises from your data set and attempt to classify them. Check the answer using your known labels in label. For example:

>>> pred = knn.predict(data[50])
array([1])

Are they the same?

>>> pred[0] == labels[50]

Activity+

Actually try to quantify how good your classifier is. Test the predictions for all 150 irises in data and keep track of how many it gets right. What is the percentage accuracy?

Activity

Well, hey, that’s pretty good! Or maybe not.

What atrocity have we committed in our analysis of the classifier?

Activity+

Redo the analysis. This time split your data set into a ‘training set’ and a ‘testing set’.

  • Rebuild your knn classifier using only the training data. Keep the test data sacred and hidden away. AVOID TEMPTATION.
  • Now use the test set to test the classifier (just as you did in the earlier activity, but using only the test set instead of all of the data).

HINT: There might be a super easy ‘built in’ way of doing this.

  • Even though it’s obvious that “double dipping” is pretty sketchy, sometimes it’s less obvious than it was here. Think about what you’re doing. Know your tools.

  • Sometimes people just don’t know any better.

  • This type of fundamental logical error has been a major source of paper retraction. If you have to retract a paper because you “double dipped” you are loudly announcing to your research community: “I’M AN EXCEPTIONALLY LAZY RESEARCHER. I CAN’T BE BOTHERED TO LEARN HOW TO USE THE TOOLS I’M DOING RESEARCH WITH. LOL.”

  • It’s also usually a good idea to shuffle your data. Some algorithms can become biased based on how the data was fed to it.

  • Although simple, kNN is a pretty decent estimator... for datasets with small feature vectors. In general, as the size of your feature vector grows linearly, the size of the training set required to make a good estimator grows exponentially.

Supervised: Support Vector Machines (SVM)

  • Let’s go back and look at a simple plotting of our data (reduced to 2D for convenience).

  • Maybe I could do this:
    • draw lines that separate regions of the plane that all contain the same type of iris.
    • treat those lines as absolute partitions of the plane.
    • when I get a new iris, plot it on the plane, and label it according to whatever partition it falls in.
  • Let’s try on the board again.

  • (In general, of course, our feature vectors will be higher-dimensional... in which case just substitute the word ‘line’ with ‘hyperplane’. The idea is exactly the same: partition the space.)

  • This idea leads to the Linear Support Vector Machine.

  • This is a bit more complex than the kNN classifer but, fortunately for us, it’s just as easy to use:

    >>> from sklearn import svm
    >>> svc = svm.SVC(kernel='linear')
    >>> svc.fit(data,labels)
    

Activity+

Figure out how to use the SVM to predict the label of new irises.

Now quantify how good your classifier is. Remember what you’ve learned!

You’ll have to split your data set into training and testing sets!

Did we do better, or worse, than kNN?

Unsupervised: K-means clustering

  • What if we just had data and no labels for the iris dataset?

  • We obviously can’t make a classifier...

  • ... but we can still look for structure in our data.

  • Let’s try this:
    • Plot all of our datapoints on the plane.

    • Guess the number of clusters we’re looking for. Let’s use the fact that we know there are 3 types of iris and pick 3 clusters.

    • Randomly place 3 “means” on the plane.

    • Repeat the following until convergence:
      • Associate each data point to the nearest “mean”.
      • Compute the centroid of all of the points attached to each “mean”.
      • Move the position of the “mean” to this centroid.
  • Let’s try it (note we ignore labels!):

    >>> from sklearn import cluster
    >>> k_means = cluster.KMeans(3)
    >>> k_means.fit(data)
    

Activity+

Pretending you don’t have access to labels, what, if anything, does this result tell you?

Try visualizing your results.

Activity+

Quantify how good of a job k-means clustering did of grouping together irises of the same type. To do this, you’ll need to bring in your “ground truth” labels.

Do we have to worry about “double dipping” here?

What else do we have to worry about?

Feature Selection/Reduction & Principle Component Analysis (PCA)

  • Let’s say we want to classify people into tall people and short people based on their age, weight, height, name, SIN, eye colour, and current mood.
    • Are all these features helpful?
  • PCA isn’t machine learning, however, it tends to come up a lot when doing machine learning stuff

  • So what is it?
  • Basically, sometimes we have data that has n dimensions, but if we do some fancy math on them, we can actually reduce the number of dimensions.
  • Let’s Try:

    from sklearn import decomposition
    from sklearn import datasets
    
    iris = datasets.load_iris()
    maDataz = iris.data
    maTargetz = iris.target
    
    print maDataz.shape
    pca = decomposition.PCA(n_components=3)
    pca.fit(maDataz)
    maDataz = pca.transform(maDataz)
    print maDataz.shape
    
  • Who thinks they know what happens?

  • How do we know how many dimensions to reduce to?

Activity

Try some of the above machine learning algorithms on the dimension reduced data.
  • Was it better or worse?

Cross-Validation

  • One of the things you learned above was the importance of proper cross-validation of machine learning results.
  • Because this is so important, scikit-learn has several built in cross-validation generators that will slice your data into test and training sets for you... and then do the testing and training.
KFold (n, k) StratifiedKFold (y, k) LeaveOneOut (n) LeaveOneLabelOut (labels)
Split it K folds, train on K-1, test on left-out Make sure that all classes are even accross the folds Leave one observation out Takes a label array to group observations
  • More generally, there is a whole set of tools to help with Model Selection .

The Zoo

  • This has been a (very) meagre taste of ML.

  • There is a whole zoo of Supervised and Unsupervised learning methods, with new ones being published every day.

  • Although the techniques we just looked at are ‘simple’, they are by no means insignificant!

  • scikit-learn has a pretty decent collection of the major algorithms, and a unified interface that makes it easy to try different options with minimum effort.

  • (And, like any good Python package, has a nice gallery ).

  • It is, however, by no means complete.

  • ML is a very powerful tool, especially in an age where we produce more data than is possible to analyze by hand.

  • Like any powerful tool, it’s also really easy to misuse.

  • If you want to use ML in your research, you owe it to yourself to learn more. A couple of pointers to start you off:

    • Andrew Ng offers a ML course on Coursera . It’s awesome. If you want to use ML, take this course and do all the assignments.
    • If you really want to learn ML, get Chris Bishop’s Book . It starts from basic probability theory and goes from there. It is comprehensive, it is rigorous... it is not easy to read.

Activity

Break into small groups. Identify a problem that you think could be solved well with machine learning. Specifically, you should be able to answer:

  1. What is the data source?
  2. What do you hope to learn from the data?
  3. What ML approach(es) will allow you to do so?
  4. How would you gather your data? Store it? Implement the ML step?
  5. What approach would you take to analyzing your results?
  6. What impact would your results have?