CS 2120: Class #10

  • What data structures have we learned about so far?
  • We’re going to learn about two more Python data structures before moving to some much more practical issues.

Tuples

  • A tuple looks a lot like a list, but with () instead of []:

    >>> tup = (5,3)
    >>> print tup
    (5, 3)
    
  • Handy for storing something like an (x,y) co-ordinate pair.

  • If a tuple is exactly like a list, why would I use it? It must be different somehow

Activity

Figure out how tuples differ from lists (other than using different types of brackets!).

Some questions you might ask: Are tuples mutable? Do tuples have “built in functions”? (e.g., something like tup.max()?)

  • So why would you ever use a tuple instead of a list?

  • Well, you don’t have to. Anything you can do with tuples, you can do with lists. But, because they are immutable:
    • Tuples are faster
    • Tuples prevent you from overwriting something that shouldn’t be overwritten

Dictionaries

  • Python Dictionaries are a more complex datastructure than what we’ve seen so far.

  • But... they are very useful.

  • Imagine a list which you can index with strings instead of numbers.

  • That’s a dictionary.

  • Let’s create an empty dictionary:

    >>> mydict = {}
    
  • Looks like an empty list, but using {} instead of [].

  • How do I add something?

    >>> mydict['James']=50
    >>> print mydict
    {'James': 50}
    
  • The dictionary has associated the key James with the value 50.

  • Maybe this is a dictionary of grades? I need to work harder.

  • Let’s add more:

    >>> mydict['Suzy'] = 95
    >>> mydict['Johnny'] = 85
    >>> print mydict
    {'Suzy': 95, 'Johnny': 85, 'James': 50}
    
  • Dictionaries always associate a key with a value.
    • dict[key] = value

Activity

Build the dictionary mydict above.

Figure out how to access the value associated with a particular key, without printing out the whole dictionary (e.g., how would I print just Suzy’s grade?).

Hint: it’s a lot like indexing an array...

What happens if I try to index the dictionary with a key that doesn’t exist?

  • Dictionaries are a generalization of lists:
    • A list associates fixed indices from 0 up to n with values.
    • A dictionary associates arbitrary strings with values.

Activity

Now type mydict. and hit the [Tab] key. Play around with the built-in functions for dictionaries.

Take special care to look at:

  • mydict.keys()
  • mydict.values()
  • mydict.has_key()
  • This is really useful for humans, because it’s much easier for us to assign names to things than to try to remember arbitrary numberings.
  • Many programming languages have nothing like dictionaries. In some others you’ll see them called “associative arrays” or “associative memories”.
  • We’ve just scratched the surface of what you can do with dictionaries here, but it’s enough for our purposes right now.

Getting data into Python

  • Probably the most important things in python.
    • If you’re going to pay attention only once this term... now’s the time.
  • You now know a lot about how to manipulate data.

  • But as a working researcher, you want to manipulate specific data. Your data. Not toy examples.

  • We need to learn about File I/O.

  • I wont lie to you: file I/O is boring, painful, detail-oriented work.

  • Fortunately, Python makes it less painful than just about any other language.

  • It’s worth the pain because once you combine these skills with your existing skills...

_images/real.jpeg

Loading a CSV file

  • CSV stands for “Comma Separated Values”.

  • The file is stored in plain text (you can read it with a text editor)

  • Delivers what it promises.

  • Each line of the file is one item.

  • Within the line each value associated with that item is in a comma-delimited field.
    • At least it should... Some annoying people use tabs, spaces, or other dumb characters.
      • PLZ USE COMMAS
        • wtf is the point of a ‘standard’ if no one follows it?
  • For example, suppose I have recorded height, weight and IQ for 3 subjects:

    name, height, weight, IQ
    Subject 1, 170, 68, 100
    Subject 2, 182, 80, 110
    Subject 3, 155, 54, 105
    
  • The first line is a header, explaining the values in each field.

  • Headers are not mandatory. Some CSVs have ‘em, some don’t.

  • Good news: Python has a built-in library to read CSV files for you!

  • In fact, we’ve seen this before:

    def load_asn1_data():
            """
            This function loads the file `starbucks.csv` and returns a LIST of
            latitudes and longitudes for North American Starbucks'.
            We'll talk about lists formally in class in a few lectures, but maybe
            you can start guessing how they work based on what you see here...
            """
    
            import csv
    
            reader = csv.reader(open('starbucks.csv', 'r'))
            locations = []
    
            for r in reader:
                    locations.append( (r[0],r[1]))
    
            return locations
    
  • What does open do? What does 'r' mean?
    • HEY! My thing is saying: something, something, something, newline!
      • Change ‘r’ to ‘rU’
  • How does the csv.reader work?

Activity+++

Have a look at the function load_asn1_data() from Assignment 1 . That function loads a CSV file.

Figure out how it works. Download this CSV file to your computer.

Now write a function called load_airports() that loads this CSV file into a list.

Play with this list a bit and get a feel for how the data is organized.

Activity+++

Now write a function get_name_from_code(airportCode,airportlist) that will return a string containing the full name of the airport with the code airportCode.

The parameter airportlist should be the list you loaded using load_airports().

  • Why such hard activities??

  • Because learning to load in data is critical to your ability to apply what you’ve learned to real world situations.

  • Programming is pretty boring if you can’t ever apply it to real problems and data.

  • Suppose you have some tabular data in Python that you want to save back in to a CSV

    >>> csvout = csv.writer(open('yourfilename', 'w'))
    >>> csvout.writerow(['First cell','Second cell', 'Third cell'])
    write as many rows as you need to... maybe in a loop?
    >>> csvout.close()
    
  • CSV files are popular because they’re simple.

  • You can, e.g., export any Excel spreadsheet as a CSV.

  • If you have tabular data, this is a decent choice of format.

  • If you don’t have tabular data... this is an awful choice.

MATLAB

  • Some of your less enlightened colleagues might use MATLAB instead of Python.

  • You need to be able to trade files back and forth with MATLAB users.

  • MATLAB’s default file format is .mat and it’s reasonably complex.

  • We’ll look here at a simple example.

  • Download and save this .mat file which describes connectivity between different parts of the macaque brain.

  • We’re too cool for MATLAB, so how do we open it?

  • Fortunately for us, SciPy has tools to load, and save, MATLAB files built in.

  • So... let’s open it:

    >>> import scipy.io
    >>> scipy.io.loadmat('macaque47.mat')
    
  • Whoa... what was all that stuff???

  • Ah! A dictionary!!! Remember: think of them as a special list that can be indexed with strings instead of numbers.

  • We should probably store that in a variable:

    >>> a = scipy.io.loadmat('macaque47.mat')
    
  • Important difference with CSV files:
    • The CSV file stored exactly one table. That’s it.
    • The .mat file stores a whole MATLAB workspace which might have many (MATLAB) variables in it!
  • So how do we tell Python which of the MATLAB variables we’re interested in? By accessing the dictionary with the name of the variable (which will be the same as the whatever the MATLAB user called it).

  • So if I want the values of the MATLAB variable CIJ:

    >>> a['CIJ']
    ...
    
  • What if I want a list of the different MATLAB variables available in this dictionary?

    >>> a.keys()
    ['CIJ', '__version__', '__header__', 'Names', '__globals__']
    
  • Let’s extract the brain connectivity matrix from the .mat file and store it in it’s own variable.

    >>> brain = a['CIJ']
    

Activity

What type is brain?

Has Python converted it to a Python/NumPy type for you?

How would you visualize this matrix? (Hint: matshow(...) ).

Go back and extract the MATLAB variable named Names from the dictionary a.

  • If you need to share your results with someone who uses MATLAB, you can use scipy.io.savemat() .

  • Remember, .mat files can store a whole MATLAB workspace, not just one array.

  • We simulate that workspace with a dictionary.
    • keys are variable names
    • values are the values of the variable
  • for example...

    >>> import numpy
    >>> import scipy.io
    >>> myarr = numpy.random.rand(10,10)
    >>> out_dict = {}
    >>> out_dict['myarr']=myarr
    >>> scipy.io.savemat('filename.mat',out_dict)
    

Rolling your own text file I/O

  • What if you want to create your own file format?
  • DON’T.
  • Seriously.
  • Don’t.
_images/card.png
  • Use an existing, standardized, format.

  • If some jackass doesn’t heed this advice, you may be stuck trying to load their personal file format.

  • We’ll look at how to load text files “from scratch”. If the file is binary, the process is much more complex and may involve you having to reverse engineer the file structure (which may not even be possible).

  • First, you have to open the file:

    >>> infile = open('filename.txt','r')
    
  • filename.txt should be self-explanatory

  • 'r' means “open file for r eading”. (Guess what w means?)

  • infile is now a special type of variable that references a file on a disk.

  • We can load the file in, one line at a time, using a for loop:

    for line in infile:
       print line
    
  • Of course, you probably want to do something more interesting that just printing the line.

  • (Probably, you want to store the parts of the line into a data structure).

Activity

Make a text file with at least 5 lines of text, and several words per line.

Write a function to load the text file and print each line.

Now change the body of the for loop to print line.strip(). What happened?

How about print line.split()?

Finally: print line.strip().split().

  • When you’re done with the file, you should close it:

    >>> infile.close()
    
  • No really, if you don’t close files then sometimes the files can get messed up.
    • Typically a problem when writing to a file

Internet

  • Python also makes it easy to grab files directly from the internet.

  • In an earlier example, you had to download airports.csv to your local computer before you accessed it using Python (disk) File I/O.

  • Let’s skip that first step:

    >>> import urllib2
    >>> response = urllib2.urlopen('http://www.csd.uwo.ca/Courses/CS2120a/cs2120/data/airports.csv')
    >>> for line in response:
    ...       print line.strip()
    
  • So if the data you want to process is already available somewhere on the internet, you can access it directly.

  • Especially useful with data that frequently changes or is being collected on an ongoing basis.

Activity

Using the urllib2 method above, load a file/website/whatever (anything on the internet) and print it out to the Python console.

If you loaded something more complex than a plain text file, what did you get? How might you make sense of it?

  • This is just the very tip of the iceberg. Any possible way you might want to interact with data on the internet... is almost certainly already coded in to Python.
  • You know enough now to be able to read the docs and make it happen.

Activity

Figure out how you’d print the last 20 Tweets from the public Twitter timeline. Don’t worry about actually getting the code to run (you’d have to install an additional package), but figure out, roughly, what you’d need to do.