Mini-course on optimizing multithreaded algorithms and code targeting hardware accelerators

Presenter

Marc Moreno Maza (University of Western Ontario)

Aim and Scope

This 5-day mini-course attempts to cover a variety of techniques for improving performance of multithreaded code on hardware accelerators, in particular Graphics Processing Units (GPUs) and multi-core architectures.

Depending on the attendee backgrouund, the first day will be either an introduction or a review of the notion of performance of computer programs, with a focus on memory traffic issues. In the same spirit, the second day will be either an introduction or a review of multithreaded programming on multi-core architectures, with an emphasis on the relations between data locality and parallelism. The third day will be dedicated to multithreaded programming on many-core GPUs. The fourth day is a lab practice where the attendees will have the opportunity to implement the techniques presented in the lectures. The fifth day, several advanced topics including models of computation and distributed computing will be presented. Corrections to the lab exercices will be distributed.

The audience is expected to have a basic knowledge of algorithms for scientific computing (linear algebra, sorting, Fast Fourier Transforms, etc.) and C programming. Some experience with multithreaded programming will be a plus but is not required. Each of the two lectures (Day 1, Day 2, Day 3 and Day 5) is meant to be independent of the other parts of the mini-course. However, the lab practice will rely on lectures of Day 1, Day 2 and Day 3.

The materials presented during the lectures and the lab practice will be available on the web at the time of the mini-course. This mini-course is based on tutorials:

as well as courses taught by the presenters at the University of Western Ontario:

Days 1 to 3: Optimizing algorithms and code for data locality and parallelism

Parallel hardware architectures (multicores, graphics processing units, etc.) and computer memory hierarchies (from processor registers to hard disks via successive cache memories) impose an evolution in the design of scientific software. Algorithms for scientific computation have often been designed with algebraic complexity as the main complexity measure and with serial running time as the main performance counter. On modern computers minimizing the number of arithmetic operations is no longer the primary factor affecting performance. Effective utilization of the underlying architecture (reducing pipeline stalls, increasing the amount of instruction level parallelism, better utilizing the memory hierarchy) can be much more important.

This tutorial is an introduction to the performance issues related to data locality and parallelism targeting multicore processors and Graphics Processing Units (GPUs) with a focus on the latter type of hardware accelerators. A first part will be dedicated to data locality independently of parallelism thus using serial algorithms and programs as illustration. In a (short) second part analyzing and optimizing multithreaded programs on multicores will be discussed. In the third part we will switch to GPU specific issues and then compare the implementation techniques of both architecture platforms.

The mini-course materials for DAYS 1 to 3:

Day 4: Optimizing multithreaded algorithms and code. Hands on!

We shall see through various examples, (considering multicore and GPU implementation) how performance bottlenecks can be identified (by complexity analysis or performance measurement) and resolved, when possible. In the multicore case, we will take advantage of comparative implementation thanks to the compilation framework of the MetaFork project. In the GPU case, we will take advantage of the MCM to optimize programs depending on parameters.

Exercices

Exercises 0 to 1:
See the README in the tarball of C programs.
Exercises 2 to 7:
See the README in the tarball of Cilk programs.
Exercises 8 to 10:
See the README in the tarball of CUDA programs.
Exercises 11 to 14:
See the README in the tarball of Julia programs.

Day 5: A Many-core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

When a parallel program does not reach the expected performances, several causes can be involved: insufficient parallelism at the algorithm level, limitations to parallel execution at the architecture level, parallelization overheads at the level of the targeted concurrency platform. Parallelization overheads are inevitable and understanding these costs can help programmers design more scalable applications.

We present a model of multithreaded computation (MCM), combining fork-join and single-instruction-multiple-data parallelisms, with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. We establish a Graham-Brent theorem for this model so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with four fundamental algorithms from scientific computing. In each case, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter; moreover experimentation confirms the model's prediction.

The mini-course materials for DAY 5: