CS 2120: Class #11

Plotting with Python

  • Now that you know how to load and manipulate data, we’re going to spend some time learning how to visualize our data (and the results of processing it).

  • If you’re running Python from the command line, try this:

    % ipython --pylab
    
  • otherwise, make sure you do this before trying anything below:

    >>> from pylab import *
    
  • Anyone remember what that’s doing?

  • If you’re like me and like the namespace, do this:

    >>> import matplotlib.pylab as plt
    
  • Just, if you do this, be sure to add plt. before whenever you’re calling a plotting thing.
    • Ex. plt.plot(x) or plt.clf() or plt.title('something')
  • Before we start: If you need to clear your plot at any time, just type:

    >>> clf()
    

Let’s get some data

Activity

Download this Google Trends CSV .

  • Each row is a week. Starting in 2004, up to the 2012-ish
  • The columns are the search terms ‘vampire’, ‘zombie’, ‘flu’, ‘ice cream’
  • The numbers are “search volume index” (normalized to ‘flu’).

Now get this data into Python!

  • Open a csv.reader (look at last class’ notes)
  • Read each row into a list

You should now have a ‘list of lists’. Convert it to a NumPy array of floats, and switch rows for columns. e.g., if it’s in the variable ‘data’, do this:

>>> data = numpy.array(data).astype(numpy.float).transpose()

Make sure you understand how that line works!

Simple plots

  • We can do a simple line plot of 1D data with the plot() command.

  • Try this:

    >>> plot(data[0])
    
  • OR, try this if you’re like me and import matplotlib.pylab as plt :

    >>> plt.plot(data[0])
    

Activity

What did we just plot? How could you do a similar plot for the popularity of the search term ‘zombie’?

Can you plot both the search volumes for ‘vampire’ and ‘zombie’ on the same graph?

Activity

Experiment with the following commands. What do they do to your plot?

  • grid()
  • xlabel('This is a label!')
  • ylabel('Another label!')
  • title('My title')
  • axvline(100)

Save your plot to disk as an image.

  • There are a crazy number of options that you can pass to plot(). Like these:

    >>> plot(data[0],':')
    >>> plot(data[1],'--')
    >>> plot(data[2],'r--')
    

Activity

Plot search volume for ‘flu’ ( data[2] ) against ‘ice cream’ ( data[3] ).
  • Don’t forget about clf()

Use different line types for the two plots. Use the ‘zoom tool’ to magnify the portion of the graph below y==20.

See any trends worth noting? Visual inspection is a power tool for data analysis.

  • I wonder if any of our keywords have search volumes that are linearly related to each other?

  • Pearson Correlation is a good way to check this.

  • We could compute r-values, for each pair, like this:

    >>> import scipy.stats
    >>> scipy.stats.pearsonr(data[1],data[0])
    (0.7604487911797595, 1.0173257365818087e-87)
    ...
    
  • Or we could be lazy, and complete the full correlation matrix with one command:

    >>> cor = numpy.corrcoef(data)
    

Activity

Build the correlation matrix for data. Look at it. What does it tell you?

2D Plots

  • Let’s look at our correlation matrix visually.

    >>> matshow(cor)
    
  • Each square is one entry in the 2D array. Pretty intuitive.

  • We can change colour schemes, too. E.g.:

    >>> gray()
    >>> hot()
    
  • And, if the axis labels are annoying us, or we need a colour scale:

    >>> axis('off')
    >>> colorbar()
    

Activity

Start with a bigger array: r = numpy.random.rand(50,50). Plot this array, using matshow with a colour bar and no axis labels. What happens if you use imshow instead of matshow? (Try zooming WAAAY in).

Histograms and boxplots

  • Sometimes you want to see the distribution of the values your data, rather than the values themselves.

  • Consider these data:

    >>> u = numpy.random.rand(1000)
    >>> g = numpy.random.normal(size=1000)
    
  • If I just plot them, what intuitions do I get? (Assume I don’t know where it came from!)

    >>> plot(u)
    >>> plot(g)
    
  • What about if I plot the distributions of values in u and d?

    >>> hist(u)
    >>> hist(g)
    
  • As usual, hist() has a lot of options .

Activity

Plot a histogram of the data in g, with bins from -2 to -1, -1 to 0, 0 to 1 and 1 to 2.

Plot a cumulative histogram of the data in g (with the default automatically chosen bins) and u. How do they differ?

  • Let’s create 3 fake sets of experimental data:

    >>> d1 = numpy.random.normal(0,10,size=1000)
    >>> d2 = numpy.random.normal(5,10,size=1000)
    >>> d3 = numpy.random.poisson(size=1000)
    

Activity

Compare the histograms of d1, d2 and d3.

Scatter plots

  • Earlier, we used Pearson correlation to investigate relationships in time series data.

  • A more visual way to investigate this is with a scatter plot:

    >>> scatter(d1,d2)
    
  • For every pair of datapoints (d1,d2)... we just plot them as if they were the (x,y) co-ordinates of a point.

  • Let’s fake some correlated data:

    >>> d4 = d2 + 1.0 + numpy.random.normal(1,2,size=1000)
    
    • d4 = d2 + a constant offset + some noise

Activity

Scatterplot d2 against d1.

Now scatterplot d2 against d4.

What conclusions can you draw? Back up your conclusions with scipy.stats.pearsonr() on both pairs.

Onward

  • We’ve barely even scatched the surface of the surface of what’s available with Python.

  • The types of plots that are of interest to you will depend heavily on what your needs are.

  • You’ve now got the fundamentals to go forth and steal examples wholesale from the internet.

  • Yes, I’m advocating this methodology for practical visualization:
    • Find an existing visualization in Python that looks close to what you want
    • Get the code
    • Spend some time figuring out how it works
    • Modify it to suit your purposes
    • PROFIT!!!
  • This kleptoprogramming approach is enabled nicely by the Python community’s strong tradition of publishing source.

  • Good places to steal ideas (and code) from:

Activity

Pick an attractive looking plot from one of the galleries above.

Get the code for the plot working on your machine (100% cut and paste).

Now modify the code to visualize one of the variables we worked with in class today.