Assignment #4: Here you go again (on your own)

  • Worth: 10%
  • DUE: December 7th; submitted on OWL by 11:55PM.
_images/sort.jpeg

Part I: Get yourself sorted

Pick a sorting algorithm that we haven’t discussed in class. It can be iterative or recursive. Efficient, or perversely awful. Any sorting algorithm at all will do, except for (THIS MEANS DON’T DO THESE ONES):

  • Selection Sort
  • Insertion Sort
  • Bubble Sort
  • Bogosort
  • Quick Sort
  • Merge Sort

Once you’ve picked a sorting algorithm, you have to do two three things:

  • Implement it in Python
  • Test your implementation to make sure it works
  • Write a brief description of how good you think the sorting algorithm is (how many loops does it have to do for a list with n elements?)

This is a big step from previous assignments, where you’ve had to do implementations based on my descriptions. Now you have to implement sorting based on someone else’s description, which is probably more terse. If you want to use programming in your own work, this is a critical skill. You need to be able to read a paper (or textbook) and, with some time and effort, turn an algorithm described in that document into a Python program that you can use.

Suggested places to look for sorting algorithms:

You MUST have a test harness. It will be something you can call to easily run this part of the assignment. Make it look something like this:

def part1():
        Generate List
        print list
        call sorting algo
        print sorted list

Note

I’m well aware of the fact that it’s completely trivial to ‘cheat’ on this portion of the assignment and just cut and paste Python code from the internet, instead of reading a pseudocode description of the algorithm and turning it into Python yourself. If you do this, believe me, it’ll be obvious. Maybe not to you, but it will be to the TAs. More importantly, it will be obvious to this kitten. And every time someone hands in cut-and-paste sorting code from the internet, this kitten gets sadder. If you can live with that, so be it. You monster.

This was also featured on reddit and Buzzfeed

_images/sad.jpeg

If you’re stuck or overwhelmed by all the choices of sorting algorithm, come and talk to me (James), Mike, or Santo. We’ll help you pick one to implement.

_images/bigdata.jpeg

Part II: Data Analytics

You have enough skill as programmers now to be let loose upon the world. Instead of me dictating an assignment and dataset to you, this time I want you to find your own dataset. It could be a stock dataset from a website, or it could be data from your own research (assuming you get permission from whomever actually owns the data – you have my word that the TAs and I will keep any results you present completely confidential). The important thing is that the data interests you, personally, in some way.

I’d prefer that you find a dataset that really interests you, but if you’re stuck for ideas, here are some places to start:

Warning

Pick reasonable data! You will have to submit it with your code, and if your data is too large you will have trouble doing this! In other words, make sure you have a reasonably sized dataset.

Here’s what you’re going to do:

  • Pick a data set
  • Figure out how to load it into Python.
  • Do two (different!) visualizations of the data, and interpret them.
  • Use two different machine learning approaches to analyze, and interpret your data.

I can’t tell you what visualizations to do, because this choice totally depends on your dataset and what you want to do with it. Part of the assignment is to demonstrate some understanding of the tools you’ve learned by using them appropriately. I suggest starting with the links here and finding a plot that looks like what you want. Grab that code and modify it to suit your data. If you’re really stuck (and not sure what/how to plot) come and talk to me (James), Stephen. We’re happy to make suggestions.

Note

Do your own work. There is an effectively infinite amount of data out there. The probability of you picking exactly the same datasets, visualizations and algorithms as someone in last year’s class is just about 0%. We have those assignments, and we’ll notice if you do. The opportunity to allow you freedom to apply your new skills in ways that are meaningful to you is worth the small extra effort of making sure everyone is honest. Besides, you don’t want to cause more heartache to that poor kitten, do you?

Once you’ve got your visualizations, write up a few sentences analyzing them. Do they make any sense? Is there a story in your data? What is it?

Likewise with the machine learning component. I can’t tell you what you should do, because I don’t know what data you’ve chosen to work with. A shot in the dark might be to try one supervised approach (where you try to classify your data) and one unsupervised approach (where you try to cluster your data). For example, suppose I were working with the Vampire data from assignment 3. For supervised learning, I might try to train a Support Vector Machine (SVM) to classify individuals as Vampire/Not-Vampire. For unsupervised learning, I might just run the whole data set through a k-means clustering algorithm to see if there are any obvious clusters in the data.

Once you’ve done the machine learning, write a few more sentences describing your results (figures?!). If you did k-means clustering on the Vampire data and got two very clean clusters... what would that tell you? Try to draw inferences from the results you got, rather than just restating the result itself.

Your final Python program should have the following functions:
  • load_data(...) – loads your data set
  • viz1(data,..,) – vizualizes your data
  • viz2(data,...) – vizualizes your data
  • learn1(data,...) – first machine learning function
  • learn2(data,...) – first machine learning function

The ... in the function calls is there to make it clear that you can add as many (or as few) parameters to these functions as you need.

Similar to part 1, you MUST have a test harness called part2() which calls everything needed to demonstrate part 2.

Warning

Falsifying results, even on an assignment, is a very serious academic offense and can result in a grade of 0, failure of the course, or even expulsion from the university. In other words, don’t lie in your results; don’t make up stuff.

What to submit

  • An implementation of a sorting function in a Python file mysort.py.

  • A text file which:
    • describes the algorithm you implemented and where you found it (doesn’t have to be formal, a link to a website is fine).
    • shows you testing your sorting algorithm on some lists of integers and describing how good (efficient) you think this sorting algorithm is (and why). Don’t just say ‘it goes through the main loop n times’ – explain to me why you think that happens. You don’t need a mathematical proof, just an argument in English. Like we did in class.
  • A Python file containing functions to load your data, run 2 visualizations and 2 machine learning algorithms on it.

  • A text file describing the data set you chose, your motivation for choosing it, the results of your visualization and machine learning and your interpretation of the results.

  • Your data! (Failure to do this will result in a mark of 0!)

FAQ:

  • I don’t know how to do X.
  • My thing keeps telling me ERROR: File `u’SOMETHING’` not found.
    • Then the file isn’t where python is looking.
  • Do I have enough comments?
    • I don’t know, maybe? If you’re looking at code and have to ask if you should comment it... just comment it. That said, don’t write me a book.
  • You didn’t tell me exactly what to put in the writeup!
    • I know. Make it whatever you think it should be to get full marks.
  • Is my writeup good enough?
    • I don’t know. Is it something that you feel is worth full marks? I do have higher expectations for this analysis compared to previous assignments.
  • Do I need to add figures to my analysis?
    • You don’t need to, but I’d imagine you’ll get a bad mark.
  • Will I really get 0 if I don’t upload my data?
    • Yes.
  • What if I send you the data after, because I totally forgot?
    • Lol, no, still 0
  • What happens if I did all my work only to find out later that my data was too large to be submitted?
    • That sucks. Start over with different data.