Assignment #3: One thing about living in Santa Carla I never could stomach... all the damn vampires

  • Worth: 10%
  • DUE: November 16th; submitted on OWL by 11:55pm.
  • Files needed: CVC_data.csv .
_images/abe.jpg

The world is at war. With Vampires.

You have been recruited by the United Nations to join the international Centres for Vampire Control and Prevention (CVC) in Bethesda, Maryland. The CVC has gathered a large corpus of data on the vampire plague and, noting your superior informatics skills, has assigned you to the Bloodsucking Undead Forensic Feature Identification (BUFFI) unit.

Your first task will be to write some code to “data mine” a dataset of studies conducted on known vampires and non-vampires. Using the code you write, you will look for properties that differ between vampires and normal humans in the hopes of identifying measurable differences between vampires and normal humans.

Following the work above, you’ll write a function which takes measurements from an experimental subject who’s vampire status is unknown. Your function will return the probability of that subject being a vampire, based on the measurements you’re given. You’ll also use some cool statistics to figure out how good your function is.

I don’t need to tell you, recruit: a cheap, effective, ‘vampire test’ could save the world. Get to it.

How to do this assignment

  • Read the whole thing. Yes, all of it.
  • Do nothing until you’ve read the whole thing.
  • Do Part I, #1-3.
  • Do Part II, #1-4.
  • Use the functions you built in Part II to figure out what makes vampires different from non-vampires.
  • Do Part III, #1 quickly. Don’t agonize over it – just write a quick is_vampire() function.
  • Do Part IV, #1. Now you have a way to test is_vampire() functions.
  • Go back to Part III and fix up your is_vampire() function until you’re happy with it.
  • Write up what you did and submit.
    • Yes, you have to do a writeup explaining your function, why you picked it, and maybe showing some basic stats showing why you think it’s good!
  • If you wrote the most effective is_vampire() function in the class, collect your super-awesome prize.

Part I: File I/O

First things first... we have to get the data into Python so we can work with it:

  1. Download the data file .
  2. Write a function loaddata(filename) that will open a CSV file with the name filename and read the contents of the file into a list of lists. The function will then return this list of lists. (Remember: each line of the CSV file gets read into a list if you use csv.reader(), so a CSV file with multiple lines needs to be a list of lists).

Hint

Look at the data loading function from assignment 1!

  1. Write a function dat2arr(datalist) that takes a list of lists (like the one generated by loaddata()!) and turns part of it into a NumPy array. Here we’re going to assume that each item (each line of the CSV file) has the following format:

    [name, height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]
    

    for example:

    ['Patient 13',153,81,0.28,1.05,0.19,-0.07,1]
    

    When we convert to an array, we’ll drop the subject name, and just convert the remaining entries. You should return a 2D NumPy array where each row corresponds to one subject, and each column one measure.

  2. Write a function save_array(arr,fname) that saves a 2D NumPy array arr to a MATLAB file fname with the Python array arr stored as the MATLAB array named vampire_array. Remember that in Python, MATLAB files are represented as dictionaries.

Hint

First create a dictionary. Then add arr to the dictionary with the key vampire_array. Then use scipy.io.savemat to save the dictionary to a file.

_images/lost_boys_ver2.jpeg

Part II: Analysis and visualization

Now that we’ve got the data into Python, our first task is to analyze it to see if we can find measures that are repeatably (across the whole population) different for vampires and non-vampires. Are vampires taller than non-vampires? More averse to garlic? We don’t know right now... but you’ve got exactly the data you need to find out.

  1. Write a function column_stats(arr,col) that will return a list of summary statistics (mean, min and max) for a particular column (col) in the array arr. Summary stats for vampires and non-vampires should be returned separately (so you can compare them). Specifically you should return a list that looks like this:

    [ [vampire mean, vampire min, vampire max], [normal mean, normal min, normal max] ]
    

    Are there particular measures (columns) for which the values really look different for vampires and non-vampires?

  2. Sometimes numbers don’t tell the whole story. Write a function hist_compare(arr,col) that uses pylab.hist() to plot the histogram of values for a particular column. Plot separate histograms for vampires and non-vampires so you can easily compare them. You may notice that when the histograms overlap, it can be hard to read. If this bothers you, try using pylab.boxplot() instead and see how that works (you don’t have to do this though).

  3. It might be useful to see if some measures seem to follow the same trends, across the whole dataset. If two measures always move up, or down, together, as you step through the dataset, that might suggest that those two measures are related. For this function, and the next, we’re not going to split the data into “vampires” and “non-vampires”, we’re just going to visualize the whole data set all together. Write a function corr_columns(arr,col1,col2) that will compute the Pearson Correlation between columns col1 and col2 in arr. You might want to use the function scipy.stats.pearsonr() to do this (don’t forget to import scipy.stats, though!). Try correlateing different columns. You should pay special attention to columns that correlate highly with column 6 (the IS_VAMPIRE? column). Why? The r-value you get from the correlation function is easy to interpret: the closer it is to 0 the less related the two columns are. The closer it is to 1, the more related.

  4. Correlation values are nice (because you get a single number telling you how similar two vectors are) but, again, sometimes you want to see the difference. Write a function scatter_columns(arr,col1,col2) that will produce a scatterplot of col1 of arr against col2. Use the function pylab.scatter(). Is there a relationship between corr_columns(arr,col1,col2) and scatter_columns(arr,col1,col2)?

_images/novampire.jpeg

Part III: Model Building

Good job, recruit! We’ve now got a handle on things which we can measure easily that are different between vampires and non-vampires. Up to this point you’ve been examining a dataset where we knew whether or not each subject was a vampire. What we want you to do now is to design a test that will tell us, to the best of your ability, if an unknown subject is a vampire.

Your colleagues in the field-scanning division track down potential vampires and perform a battery of tests on them. From the tests we get the measures you’ve seen before: height, weight, aversion to wooden stakes, aversion to garlic, how well they reflect in a mirror and how “sparkly” their skin is. Your job is to write a program that will take in all of these values and return the probability that the person with these measurements is a vampire.

There are no clear-cut right or wrong answers here. This is a real-world modelling problem. You need to use what you’ve learned above about vampires to write a function that will take in an array of measures and tell me how likely it is that the person with those measures is a vampire.

  1. Write a function is_vampire(row) that takes a 1-D array (row) of 6 measurements and returns a probability that the person with these measurements is a vampire. The probability should never be zero. It can be very, very, small... but never 0.0 . Likewise the probability should never be 1.0 . Vampires are tricky. You can never be certain.

How should you approach writing the is_vampire() function? Using your brain and intuition, that’s how. I can’t give you a recipe, because there isn’t one. Model building is an art which you learn by doing, so just start doing. To get you started, I’ve listed two (bad) functions below. As for what goes in your function – it can be whatever you want. As long as you can explain the logic to the TA grading the assignment.

Here’s the worst possible function, it completely ignores row and just returns a random number:

def is_vampire(row):
   return numpy.clip(numpy.random.rand(),0.01,0.99)

Here’s something slightly better. It decides on the probability that someone is a vampire based on only their aversion to garlic (row[4]):

def is_vampire(row):
   return numpy.clip(1-row[4],0.01,0.99)

If you really believed that garlic aversion is the only measure separating vampires from non-vampires, then this might actually be a good function. If, however, your data analysis above tells you that there are other measures you need to consider... then your function might be more complex.

Again, there is no “right” or “wrong”. Experiment with different functions until you find one you like. As long as you can explain it in a couple of points to a TA, it’s worth full marks. One twist: if you write the best is_vampire() function in the whole class, you win a prize. Not just bonus marks. A real prize.

  • Best is a loose term to be defined by me when I’m in a good mood.

  • You must explain why you created this function clearly in your function header.
    • Why did you make it do what it did, why is it good, etc.
_images/buffy.jpeg

Part IV: Model Testing

How do we know how good our is_vampire() function is? Is there some way to quantify its “goodness”?

Well, fortunately for us, we have that nice big CSV file full of measurements for known vampires and non-vampires. A reasonable idea would be to plug values from that CSV file into our is_vampire() function and check the output of our function against the truth. Consider a line like this from the CSV file:

Subject 3,183,70,0.36,0.66,-0.04,0.32,1

The last value is a 1, so we know the subject is a vampire. If we call:

>>> is_vampire(numpy.array([3,183,70,0.36,0.66,-0.04,0.32])

and it returns a high probability (like, say, 0.83), we’d be pretty happy. If it returned a low probability... maybe our function needs more work. Or maybe this is just a weird vampire. We don’t want to get too down on our function after testing only a single subject!

So, maybe a better way to do this is to test all of the subjects in the CSV file. How about this: I feed in the measurements for every single subject in my CSV value and keep track of how close my is_vampire() function is to the known the truth. A perfect is_vampire() function would return 0.999999 whenever the subject is a vampire and 0.000001 when it isn’t a vampire.

With real data, we’ll never get a perfect function, but we can use that as a guide. What if, for each subject (each row of the CSV), I record the probability that my function is_vampire() assigns to the correct answer? Like this:

  • If the subject is a vampire, record is_vampire(row)
  • If the subject is not a vampire, record 1-is_vampire(row)

(Think about that second line for a second... is_vampire() gives me the probability that a subject is a vampire... so, the probabilistic complement 1-is_vampire() gives me the probability that the subject isn’t a vampire. If there’s an 80% chance that I’m a vampire... then there is a 100-80=20% chance that I’m not a vampire).

Now I do this for every line of my CSV file and multiply the results together. The product of all of these probabilities from your model, tested on real data, is called the likelihood of your model (in this case, your model is the function is_vampire() ). Now we have a way to compare two is_vampire() functions. We compute the likelihood for each function, using the data from the CSV file. Whichever function has a higher likelihood is better.

That, in a nutshell, is the method of Maximum Likelihood. If you plan to do any statistical modelling, ever, at any point in your life... you’ll need to know this. I’ve skipped a lot of details and technicalities because I just want you to get the intuition, you can pick up the details in a boring stats class some other time. The Wikipedia article is pretty technical, so if you read it, don’t lose sight of how simple this actually is: If you need to compare two models... the best model is the one which is most consistent with the observations in your dataset. More concretely, you’re asking: “what is the probability of observing this set of real data I have here, if my model is correct?”. The higher that probability, the more you like that model. That’s all you’re doing with maximum likelihood.

There’s just one small technical detail left: Probabilities are always small numbers (between one and zero). Multiplying small numbers together makes even smaller numbers. Remember cookie monster from the notes on floating point arithmetic in computers? Computers don’t do so well with really small numbers. So, to get around this we cheat a bit: instead of computing the likelihood... we compute the log of the likelihood... which we call, very creatively: the log likelihood. For each line in the CSV we compute numpy.log( is_vampire(row)) (or numpy.log(1-is_vampire(row))). We now add the log likelihoods (because this is equivalent to multiplying the raw probabilities (because math)) and our numbers stay reasonably large.

1. So then, you’ll need a log likelihood function of your own to compare the is_vampire() functions you come up with. I’ll give you a sketch that you’ll have to turn into Python code:

def log_likelihood(arr,vampire_function):

   #initialize the log likelihood to zero

   #loop over each row in arr
      result = vampire_function(row)

              # if row[6] > 0.5 (i.e., the subject IS a vampire)
              #    add numpy.log(result) to the log likelihood
              # otherwise
              #    add numpy.log(1-result) to the log likelihood

#return the log likelihood

Note that vampire_function should be your is_vampire() function! I know it’s weird thinking about passing functions as variables... but, as we’ve discussed, Python allows this... and in this case it sure is handy. You might have a whole bunch of different functions is_vampire1(), is_vampire2(), is_vampire3(), etc. and you can test each just by passing them to your log_likelihood() function, like this:

>>> log_likelihood(arr,is_vampire)
>>> log_likelihood(arr,is_vampire2)
>>> log_likelihood(arr,is_vampire3)

Note that when you pass a function as a variable you don’t include the ()s.

_images/Marceline_Bass.png

FAQ:

  • I don’t know how to do X.
  • My thing keeps telling me ERROR: File `u’SOMETHING’` not found.
    • Then the file isn’t where python is looking.
  • Do I have enough comments?
    • I don’t know, maybe? If you’re looking at code and have to ask if you should comment it... just comment it. That said, don’t write me a book.
  • You didn’t tell me exactly what to put in the writeup!
    • I know. Make it whatever you think it should be to get full marks.
  • Is my writeup good enough?
    • I don’t know. Is it something that you feel is worth full marks?
  • My plot only has like 6 points on it. Is this OK?
    • I bet you accidentally plotted the rows instead of the columns
  • Seriously, WTF is log liklihood? This seems useless.
    • It’s not useless, but if you’re having a hard time thinking about what it means, perhaps write yourself a function to measure accuracy (just count how many you classified correctly and divide by the number of data points)! How does this value change with the log liklihood?

What to submit on OWL

  • Your version of asn3.py
    • Make sure your NAME and STUDENT NUMBER appear in a comment at the top of the program.
    • List anyone you worked with in the comments, too
    • Make sure it’s commented and has function headers!!
  • A text file describing how you came up with your is_vampire() function and justifying why it’s a good one.
    • Like, actually write something becoming of a university student!!!!
    • A LOT of marks will be for a proper analysis!
_images/zv.jpeg