# Assignment #3: One thing about living in Santa Carla I never could stomach... all the damn vampires¶

**Worth**: 10%**DUE**: November something-th; submitted on OWL by midnight.- Files needed: CVC_data.csv .

The world is at war. With Vampires.

You have been recruited by the United Nations to join the international
Centres for Vampire Control and Prevention (**CVC**) in Bethesda, Maryland. The CVC has
gathered a large corpus of data on the vampire plague and, noting your superior
informatics skills, has assigned you to the Bloodsucking Undead Forensic Feature
Identification (**BUFFI**) unit.

Your first task will be to write some code to “data mine” a dataset of studies conducted on known vampires and non-vampires. Using the code you write, you will look for properties that differ between vampires and normal humans in the hopes of identifying measurable differences between vampires and normal humans.

Following the work above, you’ll write a function which takes measurements from an experimental subject who’s vampire status is unknown. Your function will return the probability of that subject being a vampire, based on the measurements you’re given. You’ll also use some cool statistics to figure out how good your function is.

I don’t need to tell you, recruit: a cheap, effective, ‘vampire test’ could *save the world*. Get to it.

## How to do this assignment¶

- Read the whole thing. Yes, all of it.
- Do nothing until you’ve
*read the whole thing*. - Do Part I, #1-3.
- Do Part II, #1-4.
- Use the functions you built in Part II to figure out what makes vampires different from non-vampires.
- Do Part III, #1 quickly. Don’t agonize over it – just write a quick
`is_vampire()`

function. - Do Part IV, #1. Now you have a way to
*test*`is_vampire()`

functions. - Go back to Part III and fix up your
`is_vampire()`

function until you’re happy with it. **Write up what you did and submit.**- Yes, you have to do a writeup explaining your function, why you picked it, and maybe showing some basic stats showing why you think it’s good!

- If you wrote the most effective
`is_vampire()`

function in the class, collect your super-awesome prize.

## Part I: File I/O¶

First things first... we have to get the data into Python so we can work with it:

- Download the data file .
- Write a function
`loaddata(filename)`

that will open a CSV file with the name`filename`

and read the contents of the file into a list of lists. The function will then`return`

this list of lists. (Remember: each line of the CSV file gets read into a list if you use`csv.reader()`

, so a CSV file with multiple lines needs to be a list of lists).

Hint

Look at the data loading function from assignment 1!

Write a function

`dat2arr(datalist)`

that takes a list of lists (like the one generated by`loaddata()`

!) and turns part of it into a NumPy array. Here we’re going to assume that each item (each line of the CSV file) has the following format:[name, height(cm), weight(kg), stake aversion, garlic aversion, reflectance, shiny, IS_VAMPIRE?]

for example:

['Patient 13',153,81,0.28,1.05,0.19,-0.07,1]

When we convert to an array, we’ll drop the subject name, and just convert the remaining entries. You should return a 2D NumPy array where each row corresponds to one subject, and each column one measure.

Write a function

`save_array(arr,fname)`

that saves a 2D NumPy array`arr`

to a MATLAB file`fname`

with the Python array`arr`

stored as the MATLAB array named`vampire_array`

. Remember that in Python, MATLAB files are represented as*dictionaries*.

Hint

First create a dictionary. Then add `arr`

to the dictionary with the key `vampire_array`

. Then
use `scipy.io.savemat`

to save the dictionary to a file.

## Part II: Analysis and visualization¶

Now that we’ve got the data into Python, our first task is to analyze it to see if we can find measures that are repeatably (across the whole population) different for vampires and non-vampires. Are vampires taller than non-vampires? More averse to garlic? We don’t know right now... but you’ve got exactly the data you need to find out.

Write a function

`column_stats(arr,col)`

that will return a list of summary statistics (mean, min and max) for a particular column (`col`

) in the array`arr`

. Summary stats for vampires and non-vampires should be returned*separately*(so you can compare them). Specifically you should return a list that looks like this:[ [vampire mean, vampire min, vampire max], [normal mean, normal min, normal max] ]

Are there particular measures (columns) for which the values really look different for vampires and non-vampires?

Sometimes numbers don’t tell the whole story. Write a function

`hist_compare(arr,col)`

that uses`pylab.hist()`

to plot the histogram of values for a particular column. Plot separate histograms for vampires and non-vampires so you can easily compare them. You may notice that when the histograms overlap, it can be hard to read. If this bothers you, try using`pylab.boxplot()`

instead and see how that works (you don’t have to do this though).It might be useful to see if some measures seem to follow the same trends, across the whole dataset. If two measures always move up, or down, together, as you step through the dataset, that might suggest that those two measures are related. For this function, and the next, we’re not going to split the data into “vampires” and “non-vampires”, we’re just going to visualize the whole data set all together. Write a function

`corr_columns(arr,col1,col2)`

that will compute the Pearson Correlation between columns`col1`

and`col2`

in`arr`

. You might want to use the function`scipy.stats.pearsonr()`

to do this (don’t forget to`import scipy.stats`

, though!). Try correlateing different columns. You should pay special attention to columns that correlate highly with column 6 (the`IS_VAMPIRE?`

column). Why? The r-value you get from the correlation function is easy to interpret: the closer it is to 0 the less related the two columns are. The closer it is to 1, the more related.Correlation values are nice (because you get a single number telling you how similar two vectors are) but, again, sometimes you want to

*see*the difference. Write a function`scatter_columns(arr,col1,col2)`

that will produce a scatterplot of`col1`

of`arr`

against`col2`

. Use the function`pylab.scatter()`

. Is there a relationship between`corr_columns(arr,col1,col2)`

and`scatter_columns(arr,col1,col2)`

?

## Part III: Model Building¶

Good job, recruit! We’ve now got a handle on things which we can measure easily that are different between
vampires and non-vampires. Up to this point you’ve been examining a dataset where we *knew* whether or
not each subject was a vampire. What we want you to do now is to design a *test* that will tell us, to
the best of your ability, if an *unknown* subject is a vampire.

Your colleagues in the field-scanning division track down potential vampires and perform a battery of tests on them. From the tests we get the measures you’ve seen before: height, weight, aversion to wooden stakes, aversion to garlic, how well they reflect in a mirror and how “sparkly” their skin is. Your job is to write a program that will take in all of these values and return the probability that the person with these measurements is a vampire.

There are no clear-cut right or wrong answers here. This is a real-world modelling problem. You need to use what you’ve learned above about vampires to write a function that will take in an array of measures and tell me how likely it is that the person with those measures is a vampire.

- Write a function
`is_vampire(row)`

that takes a 1-D array (`row`

) of 6 measurements and returns a probability that the person with these measurements is a vampire. The probability should never be zero. It can be very, very, small... but never`0.0`

. Likewise the probability should never be`1.0`

. Vampires are tricky. You can never be*certain*.

How should you approach writing the `is_vampire()`

function? *Using your brain and intuition*, that’s how.
I can’t give you a recipe, because there isn’t one. Model building is an art which you learn by doing, so
just start doing. To get you started, I’ve listed two (bad) functions below. As for what goes in *your*
function – it can be whatever you want. As long as you can explain the logic to the TA grading the assignment.

Here’s the worst possible function, it completely ignores `row`

and just returns a random number:

```
def is_vampire(row):
return numpy.clip(numpy.random.rand(),0.01,0.99)
```

Here’s something slightly better. It decides on the probability that someone is a vampire based on
only their aversion to garlic (`row[4]`

):

```
def is_vampire(row):
return numpy.clip(1-row[4],0.01,0.99)
```

If you really believed that garlic aversion is the only measure separating vampires from non-vampires, then this might actually be a good function. If, however, your data analysis above tells you that there are other measures you need to consider... then your function might be more complex.

Again, there is no “right” or “wrong”. Experiment with different functions until you find one you like.
As long as you can explain it in a couple of points to a TA, it’s worth full marks. One twist: if you write
the *best* `is_vampire()`

function in the whole class, you win a prize. Not just bonus marks. A real prize.

Bestis a loose term to be defined by me when I’m in a good mood.

- You
mustexplain why you created this function clearly in your function header.

- Why did you make it do what it did, why is it good, etc.

## Part IV: Model Testing¶

How do we know how good our `is_vampire()`

function is? Is there some way to quantify its “goodness”?

Well, fortunately for us, we have that nice big CSV file full of measurements for *known* vampires and non-vampires.
A reasonable idea would be to plug values from that CSV file into our `is_vampire()`

function and check the
output of our function against the truth. Consider a line like this from the CSV file:

```
Subject 3,183,70,0.36,0.66,-0.04,0.32,1
```

The last value is a `1`

, so we know the subject is a vampire. If we call:

```
>>> is_vampire(numpy.array([3,183,70,0.36,0.66,-0.04,0.32])
```

and it returns a high probability (like, say, `0.83`

), we’d be pretty happy. If it returned a low
probability... maybe our function needs more work. Or maybe this is just a weird vampire. We don’t
want to get too down on our function after testing only a *single* subject!

So, maybe a better way to do this is to test *all* of the subjects in the CSV file. How about this:
I feed in the measurements for every single subject in my CSV value and keep track of how close my
`is_vampire()`

function is to the known the truth. A perfect `is_vampire()`

function would return
`0.999999`

whenever the subject is a vampire and `0.000001`

when it isn’t a vampire.

With real data, we’ll *never* get a perfect function, but we can use that as a guide. What if, for each
subject (each row of the CSV), I record the probability that my function `is_vampire()`

assigns to the
correct answer? Like this:

- If the subject is a vampire, record
`is_vampire(row)`

- If the subject is
nota vampire, record`1-is_vampire(row)`

(Think about that second line for a second... `is_vampire()`

gives me the probability that a subject *is*
a vampire... so, the probabilistic complement `1-is_vampire()`

gives me the probability that the subject
*isn’t* a vampire. If there’s an 80% chance that I’m a vampire... then there is a 100-80=20% chance that I’m not
a vampire).

Now I do this for every line of my CSV file and multiply the results together. The product of all of these
probabilities from your model, tested on real data, is called the *likelihood* of your model (in this case, your model is the function
`is_vampire()`

). *Now* we have a way to compare two `is_vampire()`

functions. We compute the *likelihood* for
each function, using the data from the CSV file. Whichever function has a higher likelihood is better.

That, in a nutshell, is the method of Maximum Likelihood. If you plan to do any statistical modelling, ever, at any point in your life... you’ll need to know this. I’ve skipped a lot of details and technicalities because I just want you to get the intuition, you can pick up the details in a boring stats class some other time. The Wikipedia article is pretty technical, so if you read it, don’t lose sight of how simple this actually is: If you need to compare two models... the best model is the one which is most consistent with the observations in your dataset. More concretely, you’re asking: “what is the probability of observing this set of real data I have here, if my model is correct?”. The higher that probability, the more you like that model. That’s all you’re doing with maximum likelihood.

There’s just one small technical detail left: Probabilities are always small numbers (between one and zero).
Multiplying small numbers together makes even smaller numbers. Remember cookie monster from the notes on
floating point arithmetic in computers? Computers don’t do so well with really small numbers. So, to get around
this we cheat a bit: instead of computing the likelihood... we compute the *log* of the likelihood... which
we call, very creatively: the *log likelihood*. For each line in the CSV we compute `numpy.log( is_vampire(row))`

(or `numpy.log(1-is_vampire(row))`

). We now *add* the log likelihoods (because this is equivalent to multiplying
the raw probabilities (because math)) and our numbers stay reasonably large.

1. So then, you’ll need a log likelihood function of your own to compare the `is_vampire()`

functions you come up
with. I’ll give you a sketch that you’ll have to turn into Python code:

```
def log_likelihood(arr,vampire_function):
#initialize the log likelihood to zero
#loop over each row in arr
result = vampire_function(row)
# if row[6] > 0.5 (i.e., the subject IS a vampire)
# add numpy.log(result) to the log likelihood
# otherwise
# add numpy.log(1-result) to the log likelihood
#return the log likelihood
```

Note that `vampire_function`

should be your `is_vampire()`

function! I know it’s weird thinking
about passing functions as variables... but, as we’ve discussed, Python allows this... and in this
case it sure is handy. You might have a whole bunch of different functions `is_vampire1()`

,
`is_vampire2()`

, `is_vampire3()`

, etc. and you can test each just by passing them to your
`log_likelihood()`

function, like this:

```
>>> log_likelihood(arr,is_vampire)
>>> log_likelihood(arr,is_vampire2)
>>> log_likelihood(arr,is_vampire3)
```

Note that when you pass a function as a variable you *don’t* include the ()s.

## FAQ:¶

- I don’t know how to do
*X*. - OK, go to google.ca and type in
*X*.

- OK, go to google.ca and type in

- I don’t know how to do
- My thing keeps telling me
*ERROR: File `u’SOMETHING’` not found*. - Then the file isn’t where python is looking.

- My thing keeps telling me
- Do I have enough comments?
- I don’t know, maybe? If you’re looking at code and have to ask if you should comment it... just comment it. That said, don’t write me a book.

- You didn’t tell me exactly what to put in the writeup!
- I know. Make it whatever you think it should be to get full marks.

- Is my writeup good enough?
- I don’t know. Is it something that you feel is worth full marks?

## What to submit on OWL¶

- Your version of
`asn3.py`

- Make sure your
**NAME**and**STUDENT NUMBER**appear in a comment at the top of the program. - List anyone you worked with in the comments, too
- Make sure it’s
*commented*and has*function headers*!!

- Make sure your
- A text file describing how you came up with your
`is_vampire()`

function and justifying why it’s a good one.**Like, actually write something becoming of a university student!!!!**- A LOT of marks will be for a proper analysis!