CS 2120: Class #11 ================== Plotting with Python ^^^^^^^^^^^^^^^^^^^^^ * Now that you know how to load and manipulate data, we're going to spend some time learning how to *visualize* our data (and the results of processing it). * If you're running Python from the command line, try this:: % ipython --pylab * otherwise, make sure you do this before trying anything below: >>> from pylab import * * Anyone remember what that's doing? * If you're like me and like the namespace, do this: >>> import matplotlib.pylab as plt * Just, if you do this, be sure to add ``plt.`` before whenever you're calling a plotting thing. * Ex. ``plt.plot(x)`` or ``plt.clf()`` or ``plt.title('something')`` * Before we start: If you need to clear your plot at any time, just type: >>> clf() Let's get some data ^^^^^^^^^^^^^^^^^^^^ .. admonition:: Activity Download this `Google Trends CSV `_ . * Each row is a week. Starting in 2004, up to the 2012-ish * The columns are the search terms 'vampire', 'zombie', 'flu', 'ice cream' * The numbers are "search volume index" (normalized to 'flu'). Now get this data into Python! * Open a ``csv.reader`` (look at last class' notes) * Read each row into a list You should now have a 'list of lists'. Convert it to a NumPy array of floats, and switch rows for columns. e.g., if it's in the variable 'data', do this: >>> data = numpy.array(data).astype(numpy.float).transpose() Make sure you understand how that line works! Simple plots ^^^^^^^^^^^ * We can do a simple line plot of 1D data with the ``plot()`` command. * Try this: >>> plot(data[0]) * OR, try this if you're like me and ``import matplotlib.pylab as plt`` : >>> plt.plot(data[0]) .. admonition:: Activity What did we just plot? How could you do a similar plot for the popularity of the search term 'zombie'? Can you plot both the search volumes for 'vampire' and 'zombie' on the same graph? .. admonition:: Activity Experiment with the following commands. What do they do to your plot? * ``grid()`` * ``xlabel('This is a label!')`` * ``ylabel('Another label!')`` * ``title('My title')`` * ``axvline(100)`` Save your plot to disk as an image. * There are a `crazy number of options `_ that you can pass to ``plot()``. Like these: >>> plot(data[0],':') >>> plot(data[1],'--') >>> plot(data[2],'r--') .. admonition:: Activity Plot search volume for 'flu' ( ``data[2]`` ) against 'ice cream' ( ``data[3]`` ). * Don't forget about ``clf()`` Use different line types for the two plots. Use the 'zoom tool' to magnify the portion of the graph below ``y==20``. See any trends worth noting? Visual inspection is a power tool for data analysis. * I wonder if any of our keywords have search volumes that are linearly related to each other? * `Pearson Correlation `_ is a good way to check this. * We could compute r-values, for each pair, like this: >>> import scipy.stats >>> scipy.stats.pearsonr(data[1],data[0]) (0.7604487911797595, 1.0173257365818087e-87) ... * Or we could be lazy, and complete the full correlation matrix with one command: >>> cor = numpy.corrcoef(data) .. raw:: html .. admonition:: Activity Build the correlation matrix for ``data``. Look at it. What does it tell you? 2D Plots ^^^^^^^^^ * Let's look at our correlation matrix visually. >>> matshow(cor) * Each square is one entry in the 2D array. Pretty intuitive. * We can change colour schemes, too. E.g.: >>> gray() >>> hot() * And, if the axis labels are annoying us, or we need a colour scale: >>> axis('off') >>> colorbar() .. admonition:: Activity Start with a bigger array: ``r = numpy.random.rand(50,50)``. Plot this array, using ``matshow`` with a colour bar and no axis labels. What happens if you use ``imshow`` instead of ``matshow``? (Try zooming WAAAY in). .. NOTE FROM JAMES --- DON'T DO THIS. PEOPLE WITH WINDOWS WILL BE ANGRY .. It actually messed up python on a bunch of PCs so i'd avoid this... maybe briefly mention it? .. 3D plots .. ^^^^^^^^^ .. * The `Mayavi `_ package makes 3D visualization in Python a snap. .. * we actually need install this package. Type "conda install mayavi" into the terminal .. * Try this: .. >>> from mayavi import mlab .. >>> mlab.figure() .. >>> mlab.surf(r) .. >>> mlab.axes() .. >>> mlab.clf() .. >>> mlab.barchart(data) .. * The viewer is interactive. You can rotate and interact with the 3D visualization. .. * Mayavi is incredibly powerful. If you need to do 3D visualizations in Python, it's worth learning. .. * Have a look at the `gallery `_ to get a sense of what it can do. .. * Don't despair about the complexity... note that all the gallery examples *come with the code that generated them*! .. raw:: html Histograms and boxplots ^^^^^^^^^^^^^^^^^^^^^^^^ * Sometimes you want to see the *distribution* of the values your data, rather than the values themselves. * Consider these data: >>> u = numpy.random.rand(1000) >>> g = numpy.random.normal(size=1000) * If I just plot them, what intuitions do I get? (Assume I don't know where it came from!) >>> plot(u) >>> plot(g) * What about if I plot the *distributions* of values in ``u`` and ``d``? >>> hist(u) >>> hist(g) * As usual, ``hist()`` has `a lot of options `_ . .. admonition:: Activity Plot a histogram of the data in ``g``, with bins from -2 to -1, -1 to 0, 0 to 1 and 1 to 2. Plot a *cumulative* histogram of the data in ``g`` (with the default automatically chosen bins) and ``u``. How do they differ? * Let's create 3 fake sets of experimental data: >>> d1 = numpy.random.normal(0,10,size=1000) >>> d2 = numpy.random.normal(5,10,size=1000) >>> d3 = numpy.random.poisson(size=1000) .. admonition:: Activity Compare the histograms of d1, d2 and d3. .. * That works, but maybe boxplots would make side-by-side comparison easier? .. >>> boxplot(d1,d2,d3) Scatter plots ^^^^^^^^^^^^^^ * Earlier, we used Pearson correlation to investigate relationships in time series data. * A more visual way to investigate this is with a *scatter plot*: >>> scatter(d1,d2) * For every pair of datapoints (d1,d2)... we just plot them as if they were the (x,y) co-ordinates of a point. * Let's fake some correlated data: >>> d4 = d2 + 1.0 + numpy.random.normal(1,2,size=1000) * d4 = d2 + a constant offset + some noise .. admonition:: Activity Scatterplot ``d2`` against ``d1``. Now scatterplot ``d2`` against ``d4``. What conclusions can you draw? Back up your conclusions with ``scipy.stats.pearsonr()`` on both pairs. .. Linear regression / Curve fitting .. raw:: html Onward ^^^^^^^ * We've barely even scatched the surface of the surface of what's available with Python. * The types of plots that are of interest to you will depend heavily on what your needs are. * You've now got the fundamentals to go forth and *steal examples wholesale from the internet*. * Yes, I'm advocating this methodology for practical visualization: * Find an existing visualization in Python that looks close to what you want * Get the code * Spend some time figuring out how it works * Modify it to suit your purposes * PROFIT!!! * This kleptoprogramming approach is enabled nicely by the Python community's strong tradition of publishing source. * Good places to steal ideas (and code) from: * `Matplotlib gallery `_ (click the picture to get the code!) * `Matplotlib cookbook `_ * `Mayavi gallery `_ * `Scipy cookbook `_ (look under "Graphics") .. admonition:: Activity Pick an attractive looking plot from one of the galleries above. Get the code for the plot working on your machine (100% cut and paste). Now modify the code to visualize one of the variables we worked with in class today.