This 5-day mini-course attempts to cover a variety of techniques for improving performance of multithreaded code on hardware accelerators, in particular Graphics Processing Units (GPUs) and multi-core architectures.
Depending on the attendee backgrouund, the first day will be either an introduction or a review of the notion of performance of computer programs, with a focus on memory traffic issues. In the same spirit, the second day will be either an introduction or a review of multithreaded programming on multi-core architectures, with an emphasis on the relations between data locality and parallelism. The third day will be dedicated to multithreaded programming on many-core GPUs. The fourth day is a lab practice where the attendees will have the opportunity to implement the techniques presented in the lectures. The fifth day, several advanced topics including models of computation and distributed computing will be presented. Corrections to the lab exercices will be distributed.
The audience is expected to have a basic knowledge of algorithms for scientific computing (linear algebra, sorting, Fast Fourier Transforms, etc.) and C programming. Some experience with multithreaded programming will be a plus but is not required. Each of the two lectures (Day 1, Day 2, Day 3 and Day 5) is meant to be independent of the other parts of the mini-course. However, the lab practice will rely on lectures of Day 1, Day 2 and Day 3.
The materials presented during the lectures and the lab practice will be available on the web at the time of the mini-course. This mini-course is based on tutorials:
Parallel hardware architectures (multicores, graphics processing units, etc.) and computer memory hierarchies (from processor registers to hard disks via successive cache memories) impose an evolution in the design of scientific software. Algorithms for scientific computation have often been designed with algebraic complexity as the main complexity measure and with serial running time as the main performance counter. On modern computers minimizing the number of arithmetic operations is no longer the primary factor affecting performance. Effective utilization of the underlying architecture (reducing pipeline stalls, increasing the amount of instruction level parallelism, better utilizing the memory hierarchy) can be much more important.
This tutorial is an introduction to the performance issues related to data locality and parallelism targeting multicore processors and Graphics Processing Units (GPUs) with a focus on the latter type of hardware accelerators. A first part will be dedicated to data locality independently of parallelism thus using serial algorithms and programs as illustration. In a (short) second part analyzing and optimizing multithreaded programs on multicores will be discussed. In the third part we will switch to GPU specific issues and then compare the implementation techniques of both architecture platforms.
The mini-course materials for DAYS 1 to 3:
We shall see through various examples, (considering multicore and GPU implementation) how performance bottlenecks can be identified (by complexity analysis or performance measurement) and resolved, when possible. In the multicore case, we will take advantage of comparative implementation thanks to the compilation framework of the MetaFork project. In the GPU case, we will take advantage of the MCM to optimize programs depending on parameters.
When a parallel program does not reach the expected performances, several causes can be involved: insufficient parallelism at the algorithm level, limitations to parallel execution at the architecture level, parallelization overheads at the level of the targeted concurrency platform. Parallelization overheads are inevitable and understanding these costs can help programmers design more scalable applications.
We present a model of multithreaded computation (MCM), combining fork-join and single-instruction-multiple-data parallelisms, with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. We establish a Graham-Brent theorem for this model so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with four fundamental algorithms from scientific computing. In each case, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter; moreover experimentation confirms the model's prediction.
The mini-course materials for DAY 5: