**High-performance computing****Parallel computing****Stream processing****Single Instruction, Multiple Data (SIMD or vector instructions)****Instruction level parallelism****Distributed computing****Multi-core****Multi-threading****Hyper-threading****Cycles per instruction****CPU Cache****Cache coherency****Cache-oblivious algorithm****Instruction pipeline****Speculative execution****GPU****GP-GPU****FPGA****Silicon Valley**

**MIT OpenCourseWare****MIT 6.895 Fall 2003 by Charles E. Leiserson**- MIT 6.172 Fall 2009 by Saman P. Amarasinghe and Charles E. Leiserson
**MIT 18.337 Applied Parallel Computing Spring 2009 by Alan Edelman****MIT 6.852 Distributed Algorithms by Nancy A Lynch****CMU 18-645 How to Write Fast Code (Spring 2008) by Markus Pueschel****Drexel University CS 540 High Performance Computing by Jeremy Johnson****Stony Brook University CSE 690 - General Purpose Computing on Graphics Hardware by Klaus Mueller****U.C. Berkeley CS267/EngC233 Home Page by James Demmel****CS 267, Applications of Parallel Computers by Jim Demmel (Univ. Berkeley)****A LINPACK, LAPACK course by Jim Demmel (Univ. Berkeley)****COMP 422 Parallel Computing, Spring 2008 at Rice University****CS 402/CS 535: Parallel and Distributed Computing at UWO****U. Illinois ECE 498 Programming Massively Parallel Processors by Wen-Mei Hwu.****U. Illinois Heterogeneous Parallel Programming Wen-mei W. Hwu.****Parallel Algorithms (WISM 459, 2005/2006)by Rob Bisseling****Topics in Parallel Computing (CS 838, Univ. of Wisconsin-Madison, Spring 1999)by Pavel Tvrdik****Parallel Algorithms (CS 662, San Diego Univ., Spring 1996)by Roger Whitney****Computer Systems: A Programmer's Perspective (CS:APP) by Randal E. Bryant and David R. O'Hallaron****Introduction to Parallel Programming (Uni. Waterloo)****CS Honours course 7933: Distributed and High-Performance Computing at the Univ. of Adelaide (Australia)****Principles of Distributed Computing (FS 2010) (ETH, Zurich)****CS262: Introduction to Distributed Computing at Harvard, Spring, 2008****An Introduction to Parallel and Distributed Computing at Middle East Technical University****Introduction to Parallel and Distributed Computing (RISC, Austria)****Calcul parallele et distribue a l'Ecole Polytechnique.**

**The Wikipedia page of Cilk****The Cilk Project at MIT****The CilkPlus web site****GCC, the GNU Compiler Collection****gcc-4.7 (GNU compiler collection) packages for Debian Linux****The Cilk-P runtime system for pipeline parallelism with CilkPlus****Documentation of the Intel Cilk++ Software Development Kit****Download Intel Cilk++ SDK****Download the Cilk++ compiler from the Source Forge****Using CilkPlus in the MC10 lab at UWO.****CUDA Developer Zone.****CUDA Occupancy calculator.****Kernel for Adaptative, Asynchronous Parallel and Interactive programming (KAAPI)****The Open MP Web site****Open MP Tutorial****Getting Started with OpenMP.****Pthreads page from Wikipedia****PThread Tutorial****MPICH-A Portable Implementation of MPI****MPI Send modes.****MPI Sendrecv.****Development Tools for MPICH**

These software makes use of auto-tuning techniques.

These software makes use of auto-tuning techniques.

**GDB tutorial****Another GDB tutorial****And another GDB tutorial****Intel Cilkscreen Race Detector for Cilk++ programs****Visualizing Parallel Speedup with Cilkview****PIN: a tool for the dynamic instrumentation of programs****Performance Application Programming Interface (PAPI)****The Performance Counter Library: PCL****Intel VTune Performance Analyze****Performance Engineering with Profiling Tools****Linux profiling with Perf**with**simple examples.**

**How To Write Fast Numerical Code: A Small Introduction by Srinivas Chellappa, Franz Franchetti, and Markus Pueschel.****What Every Programmer Should Know About Memory by Ulrich Drepper****Models of computation (book web site) by John E. Savage.****Data Prefetch Mechanisms by STEVEN P. VANDERWIEL and DAVID J. LILJA****Challenges and Opportunities in Many-Core Computing by John L. Manferdelli, Naga K. Govindaraju, and Chris Crall.**

**The Implementation of the Cilk-5 Multithreaded Language by Matteo Frigo Charles E. Leiserson Keith H. Randall.****Thread Scheduling for Multiprogrammed Multiprocessors by Nimar S. Arora, Robert D. Blumofe and C. Greg Plaxton****KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors by Thierry Gautier, Xavier Besseron, Laurent Pigeon.****The Cilkview Scalability Analyzer by Yuxiong He, Charles E. Leiserson and William M. Leiserson.****Identifying Performance Bottlenecks in Work-Stealing Computations by Nathan R. Tallent and John M. Mellor-Crummey.****The Design of OpenMP Tasks by Eduard Ayguade, Nawal Copty, Member, IEEE Computer Society, Alejandro Duran, Jay Hoeflinger, Yuan Lin, Federico Massaioli, Member, IEEE, Xavier Teruel, Priya Unnikrishnan, and Guansong Zhang.****Scheduling Multithreaded Computations by Work Stealing by CHARLES E. LEISERSON and ROBERT D. BLUMOFE.**

**A Memory Model for Scientific Algorithms on Graphics Processors by Naga K. Govindaraju, Scott Larsen, Jim Gray and Dinesh Manocha.****The FFT on a GPU by Kenneth Moreland and Edward Angel.****Fitting FFT onto the G80 Architecture by Vasily Volkov and Brian Kazian.****Cache and Bandwidth Aware Matrix Multiplication on the GPU by Jesse D. Hall, Nathan A. Carr and John C. Hart.****High Performance Discrete Fourier Transforms on Graphics Processors by Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli.****Understanding the Efficiency of GPU Algorithms for Matrix-Matrix Multiplication by K. Fatahalian, J. Sugerman, and P. Hanrahan.****Reducing Branch Divergence in GPU Programs by Tianyi David Han and Tarek S. Abdelrahman****Designing efficient sorting algorithms for manycore GPUs by Satish, Nadathur and Harris, Mark and Garland, Michael.****Linear algebra operators for GPU implementation of numerical algorithms by Jens Krüger and Rüdiger Westermann.****Simple Memory Machine Models for GPUs by Koji Nakano.****Algorithmic Strategies for Optimizing the Parallel Reduction Primitive in CUDA by Pedro J. Martín, Luis F. Ayuso, Roberto Torres, Antonio Gavilanes.**

**Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop and Sridhar Ramachandran****Cache-Oblivious Algorithms and Data Structures by Erik D. Demaine.****Reducers and Other Cilk++ Hyperobjects by Matteo Frigo, Pablo Halpern, Charles E. Leiserson and Stephen Lewin-Berlin.****The memory behavior of cache oblivious stencil computations by Matteo Frigo and Volker Strumpen.****The Cache Complexity of Multithreaded Cache Oblivious Algorithms by Matteo Frigo and Volker Strumpen.****Cache-oblivious comparison-based algorithms on multisets by Arash Farzan, Paolo Ferragina, Gianni Franceschini and J. Ian Munro.****A Consistency Architecture for Hierarchical Shared Caches by Edya Ladan-Mozes and Charles E. Leiserson.**-
**Analysing Cache Effects in Distribution Sorting by Rahman, Naila and Raman, Rajeev.** **Communication-optimal parallel algorithm for strassen's matrix multiplication by Ballard, Grey and Demmel, James and Holtz, Olga and Lipshitz, Benjamin and Schwartz, Oded.**

**Processor-oblivious parallel stream computations by Julien Bernard, Jean-Louis Roch and Daouda Traore****Systolic Arrays for Polynomial GCD Computations by Richard P. Brent and H.T. Kung****i/O Complexity: the Red-Blue Pebble Game by H.T. Kung.****On Time versus Space by JOHN HOPCROFT, WOLFGANG PAUL and LESLIE VALIANT.****A More Practical PRAM Model by Phillip B. Gibbons.****Extending the Hong-Kung Model to Memory Hierarchies by John E. Savage.****A Unified Model for Multicore Architectures by John E. Savage and Mohammad Zubair.****A Bridging Model for Multi-Core Computing by Leslie G. Valiant.**

**Stencil Computation Optimization and Auto-tuning on State-of-the-Art Multicore Architectures by Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf and Katherine Yelick.****SPIRAL: Code Generation for DSP Transforms by MARKUS PÜSCHEL, MEMBER, IEEE, JOSÉ M. F. MOURA, FELLOW, IEEE, JEREMY R. JOHNSON, MEMBER, IEEE, DAVID PADUA, FELLOW, IEEE, MANUELA M. VELOSO, BRYAN W. SINGER, JIANXIN XIONG, FRANZ FRANCHETTI, ACA GACIC, STUDENT MEMBER, IEEE, YEVGEN VORONENKO, KANG CHEN, ROBERT W. JOHNSON, AND NICHOLAS RIZZOLO.****The Memory Behavior of Cache Oblivious Stencil Computations by Matteo Frigo and Volker Strumpen.****Implementing FFTs in Practice by Steven G. Johnson andMatteo Frigo.****A Work-Efficient Parallel Breadth-First Search Algorithm (or How to Cope with the Nondeterminism of Reducers) by Charles E. Leiserson andTao B. Schardl.****Output-sensitive decoding for redundant residue systems by Majid Khonji, Clément Pernet, Jean-Louis Roch, Thomas Roche and Thomas Stalinski.****Some Linear-Time Algorithms for Systolic Arrays by Richard P. Brent, H. T. Kung and Franklin T. Luk.****Automated Empirical Optimization of Software and the ATLAS project.****The Design and Implementation of FFTW3 by Matteo Frigo and Steven G. Johnson.**

**Understanding cache memories by Ulrich Drepper****Automatic Performance Tuning by Jeremy Johnson****Ten Ways to Waste a Parallel Computer by Kathy Yelick.****Processor oblivious algorithms by Jean-Louis Roch.****Designing ultra-fast algorithms by Jean-Louis Roch.****Adaptive redundant residue system for cloud computing: a processor-oblivious and fault-oblivious technology by Jean-Louis Roch.****Cilk++, Parallel Performance, and the Cilk Runtime System by John Mellor-Crummey.****MPI intro lecture by Jon Eyolfson.****MPI: Beyond the Basics by David McCaughan.**

**The BSP page****Basics of SIMD Programming.****PC Assembly Language by Paul A. Carter.****Notes on the Master Theorem****Master Theorem: practice exercises with solutions.****A Master Theorem for Recurrences.****The Akra-Bazzi method.****Notes on Better Master Theorems for Divide-and-Conquer Recurrences by Tom Leighton.****Modern Convex Optimization by Arkadi Nemirovski****An Idiot's Guide to C++ Templates - Part 1****Basic C Programs : C Programs A-Z****Optimizing C and C++ Code.****Writing efficient C and C code optimization.****C++ Programming Style Guidelines.****A Cache Primer by Paul Genua.****HPC Challenge Benchmark.****Introduction to the Graph 500 List.****Graph pebbling.****Graph Pebling Page.****The Galet (Software) Page.****Algorithm Design (course web site) by Kleinberg.****Algorithm Design (book web site) by Kleinberg.**

**42nd International Conference on Parallelism Processing (ICPP 2013)****41st International Conference on Parallelism Processing (ICPP 2012)****40th International Conference on Parallelism Processing (ICPP 2012)****39th International Conference on Parallelism Processing (ICPP 2010)****38th International Conference on Parallelism Processing (ICPP 2009)****37th International Conference on Parallelism Processing (ICPP 2008)****22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2013)****22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2012)****22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2011)****22nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2010)****21st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2009)****20th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2008)****Super Computing (SC 2010)****IEEE International Parallel & Distributed Processing Symposium (IPDPS 2013)****IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012)****IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011)****IEEE International Parallel & Distributed Processing Symposium (IPDPS 2010)****Parallel and Distributed Computing, Applications and Technologies (PDCAT 2010)****PARALLEL COMPUTING 2013 (ParCo2013)****SIAM Conference on Parallel Processing and Scientific Computing (PP10)****High Performance Computing Symposium (HPCS 2013)****High Performance Computing Symposium (HPCS 2012)****High Performance Computing Symposium (HPCS 2011)****High Performance Computing Symposium (HPCS 2010)****High Performance Computing Symposium (HPCS 2009)****ACM Workshop on Parallel Symbolic Computation (PASCO 2010)****ACM Workshop on Parallel Symbolic Computation (PASCO 2007)****Some Theoretical Computer Science Conferences****MADALGO: Cache complexity summer school.**