CS 2120: Class #10 ================== * What data structures have we learned about so far? * We're going to learn about two more Python data structures before moving to some much more practical issues. Tuples ^^^^^^^ * A tuple looks a lot like a list, but with ``()`` instead of ``[]``: >>> tup = (5,3) >>> print tup (5, 3) * Handy for storing something like an (x,y) co-ordinate pair. * If a tuple is exactly like a list, why would I use it? It must be different *somehow* .. admonition:: Activity Figure out how tuples differ from lists (other than using different types of brackets!). Some questions you might ask: Are tuples *mutable*? Do tuples have "built in functions"? (e.g., something like ``tup.max()``?) * So why would you ever use a tuple instead of a list? * Well, you don't have to. Anything you can do with tuples, you can do with lists. But, because they are immutable: * Tuples are *faster* * Tuples prevent you from overwriting something that shouldn't be overwritten Dictionaries ^^^^^^^^^^^^^ * Python Dictionaries are a more complex datastructure than what we've seen so far. * But... they are *very useful*. * Imagine a list which you can index with *strings* instead of *numbers*. * That's a dictionary. * Let's create an empty dictionary: >>> mydict = {} * Looks like an empty list, but using ``{}`` instead of ``[]``. * How do I add something? >>> mydict['James']=50 >>> print mydict {'James': 50} * The dictionary has associated the *key* ``James`` with the *value* ``50``. * Maybe this is a dictionary of grades? I need to work harder. * Let's add more: >>> mydict['Suzy'] = 95 >>> mydict['Johnny'] = 85 >>> print mydict {'Suzy': 95, 'Johnny': 85, 'James': 50} * Dictionaries always associate a *key* with a *value*. * ``dict[key] = value`` .. admonition:: Activity Build the dictionary ``mydict`` above. Figure out how to access the value associated with a particular key, without printing out the whole dictionary (e.g., how would I print just Suzy's grade?). *Hint*: it's a lot like indexing an array... What happens if I try to index the dictionary with a key that doesn't exist? * Dictionaries are a *generalization* of lists: * A list associates *fixed indices* from 0 up to ``n`` with values. * A dictionary associates *arbitrary strings* with values. .. admonition:: Activity Now type ``mydict.`` and hit the [Tab] key. Play around with the built-in functions for dictionaries. Take special care to look at: * ``mydict.keys()`` * ``mydict.values()`` * ``mydict.has_key()`` * This is *really useful* for humans, because it's much easier for us to assign names to things than to try to remember arbitrary numberings. * Many programming languages have nothing like dictionaries. In some others you'll see them called "associative arrays" or "associative memories". * We've just scratched the surface of what you can do with dictionaries here, but it's enough for our purposes right now. .. raw:: html Getting data into Python ^^^^^^^^^^^^^^^^^^^^^^^^^^ * Probably the most important things in python. * If you're going to pay attention only once this term... now's the time. * You now know a lot about how to *manipulate* data. * But as a working researcher, you want to manipulate *specific* data. *Your* data. Not toy examples. * We need to learn about **File I/O**. * I wont lie to you: file I/O is boring, painful, detail-oriented work. * Fortunately, Python makes it less painful than just about any other language. * It's worth the pain because once you combine these skills with your existing skills... .. image:: ../img/real.jpeg Loading a CSV file ^^^^^^^^^^^^^^^^^^^ * CSV stands for "Comma Separated Values". * The file is stored in plain text (you can read it with a text editor) * Delivers what it promises. * Each line of the file is one item. * Within the line each value associated with that item is in a comma-delimited field. * At least it should... Some annoying people use tabs, spaces, or other dumb characters. * PLZ USE COMMAS * wtf is the point of a *'standard'* if no one follows it? * For example, suppose I have recorded height, weight and IQ for 3 subjects:: name, height, weight, IQ Subject 1, 170, 68, 100 Subject 2, 182, 80, 110 Subject 3, 155, 54, 105 * The first line is a *header*, explaining the values in each field. * Headers are *not* mandatory. Some CSVs have 'em, some don't. * Good news: Python has a built-in library to read CSV files for you! * In fact, we've seen this before:: def load_asn1_data(): """ This function loads the file `starbucks.csv` and returns a LIST of latitudes and longitudes for North American Starbucks'. We'll talk about lists formally in class in a few lectures, but maybe you can start guessing how they work based on what you see here... """ import csv reader = csv.reader(open('starbucks.csv', 'r')) locations = [] for r in reader: locations.append( (r[0],r[1])) return locations * What does ``open`` do? What does ``'r'`` mean? * HEY! My thing is saying: *something, something, something, newline*! * Change `'r'` to `'rU'` * How does the ``csv.reader`` work? .. raw:: html .. admonition:: Activity+++ Have a look at the function ``load_asn1_data()`` from `Assignment 1 `_ . That function loads a CSV file. Figure out how it works. Download `this CSV file `_ to your computer. Now write a function called ``load_airports()`` that loads this CSV file into a list. Play with this list a bit and get a feel for how the data is organized. .. admonition:: Activity+++ Now write a function ``get_name_from_code(airportCode,airportlist)`` that will return a string containing the full name of the airport with the code ``airportCode``. The parameter ``airportlist`` should be the list you loaded using ``load_airports()``. .. raw:: html * Why such hard activities?? * Because learning to load in data is *critical* to your ability to apply what you've learned to real world situations. * Programming is pretty boring if you can't ever apply it to real problems and data. * Suppose you have some tabular data in Python that you want to save back in to a CSV >>> csvout = csv.writer(open('yourfilename', 'w')) >>> csvout.writerow(['First cell','Second cell', 'Third cell']) write as many rows as you need to... maybe in a loop? >>> csvout.close() * CSV files are popular because they're simple. * You can, e.g., export any Excel spreadsheet as a CSV. * If you have tabular data, this is a decent choice of format. * If you don't have tabular data... this is an awful choice. MATLAB ^^^^^^^ * Some of your less *enlightened* colleagues might use MATLAB instead of Python. * You need to be able to trade files back and forth with MATLAB users. * MATLAB's default file format is ``.mat`` and it's reasonably complex. * We'll look here at a simple example. * Download and save `this .mat file `_ which describes connectivity between different parts of the macaque brain. * We're *too cool* for MATLAB, so how do we open it? * Fortunately for us, SciPy has tools to load, and save, MATLAB files built in. * So... let's open it: >>> import scipy.io >>> scipy.io.loadmat('macaque47.mat') * Whoa... what was all that stuff??? * Ah! A dictionary!!! Remember: think of them as a special list that can be indexed with *strings* instead of numbers. * We should probably store that in a variable: >>> a = scipy.io.loadmat('macaque47.mat') * Important difference with CSV files: * The CSV file stored *exactly one* table. That's it. * The ``.mat`` file stores a *whole MATLAB workspace* which might have *many* (MATLAB) variables in it! * So how do we tell Python which of the MATLAB variables we're interested in? By accessing the dictionary with the name of the variable (which will be the same as the whatever the MATLAB user called it). * So if I want the values of the MATLAB variable ``CIJ``:: >>> a['CIJ'] ... * What if I want a list of the different MATLAB variables available in this dictionary? >>> a.keys() ['CIJ', '__version__', '__header__', 'Names', '__globals__'] * Let's extract the brain connectivity matrix from the ``.mat`` file and store it in it's own variable. >>> brain = a['CIJ'] .. admonition:: Activity What *type* is ``brain``? Has Python converted it to a Python/NumPy type for you? How would you *visualize* this matrix? (*Hint*: ``matshow(...)`` ). Go back and extract the MATLAB variable named ``Names`` from the dictionary ``a``. * If you need to share your results with someone who uses MATLAB, you can use ``scipy.io.savemat()`` . * Remember, ``.mat`` files can store a *whole MATLAB workspace*, not just one array. * We simulate that workspace with a dictionary. * keys are variable names * values are the values of the variable * for example... >>> import numpy >>> import scipy.io >>> myarr = numpy.random.rand(10,10) >>> out_dict = {} >>> out_dict['myarr']=myarr >>> scipy.io.savemat('filename.mat',out_dict) Rolling your own text file I/O ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * What if you want to create your own file format? * **DON'T**. * Seriously. * Don't. .. image:: ../img/card.png * Use an existing, standardized, format. * If some jackass doesn't heed this advice, you may be stuck trying to load their personal file format. * We'll look at how to load text files "from scratch". If the file is binary, the process is much more complex and may involve you having to reverse engineer the file structure (which may not even be possible). * First, you have to open the file: >>> infile = open('filename.txt','r') * ``filename.txt`` should be self-explanatory * ``'r'`` means "open file for ``r`` eading". (Guess what ``w`` means?) * ``infile`` is now a special type of variable that references a file on a disk. * We can load the file in, one line at a time, using a ``for`` loop:: for line in infile: print line * Of course, you probably want to do something more interesting that just printing the line. * (Probably, you want to store the parts of the line into a data structure). .. admonition:: Activity Make a text file with at least 5 lines of text, and several words per line. Write a function to load the text file and print each line. Now change the body of the ``for`` loop to ``print line.strip()``. What happened? How about ``print line.split()``? Finally: ``print line.strip().split()``. * When you're done with the file, you should close it: >>> infile.close() * No really, if you don't close files then sometimes the files can get messed up. * Typically a problem when writing to a file Internet ^^^^^^^^^ * Python also makes it easy to grab files directly from the internet. * In an earlier example, you had to download ``airports.csv`` to your local computer before you accessed it using Python (disk) File I/O. * Let's skip that first step: >>> import urllib2 >>> response = urllib2.urlopen('http://www.csd.uwo.ca/Courses/CS2120a/cs2120/data/airports.csv') >>> for line in response: ... print line.strip() * So if the data you want to process is already available somewhere on the internet, you can access it directly. * Especially useful with data that frequently changes or is being collected on an ongoing basis. .. admonition:: Activity Using the `urllib2 `_ method above, load a file/website/whatever (anything on the internet) and print it out to the Python console. If you loaded something more complex than a plain text file, what did you get? How might you make sense of it? * This is just the very tip of the iceberg. Any possible way you might want to interact with data on the internet... is almost certainly already coded in to Python. * You know enough now to be able to *read the docs* and *make it happen*. .. admonition:: Activity Figure out how you'd print the last 20 Tweets from the public `Twitter `_ timeline. Don't worry about actually getting the code to run (you'd have to install an additional package), but figure out, roughly, what you'd need to do.