Mini-course on optimizing multithreaded algorithms and code targeting hardware accelerators

Presenters

Yuzhen Xie (University of Western Ontario)
Marc Moreno Maza (University of Western Ontario)

Aim and Scope

This 3-day mini-course attempts to cover a variety of techniques for improving performance of multithreaded code on hardware accelerators, in particular Graphics Processing Units (GPUs).

The first day can be both an introduction or a review of multithreaded programming with an emphasis on data locality and parallelism. In the second day, we present a model of computation aiming at minimizing parallelism overheads on many-core GPUs. The third day is a lab practice where the attendees will have the opportunity to implement the techniques presented in the lectures.

The audience is expected to have the basic knowledge of algorithms for scientific computing and some experience with multithreaded programming. Each of the two lectures (Day 1 and Day 2) is meant to be independent of the other parts of the mini-course. However, the lab practice will rely on the two lectures.

The materials presented during the lectures and the lab practice will be available on the web at the time of the mini-course. This mini-course is based on tutorials:

as well as courses taught by the presenters at the University of Western Ontario:

Day 1: Optimizing algorithms and code for data locality and parallelism

Parallel hardware architectures (multicores, graphics processing units, etc.) and computer memory hierarchies (from processor registers to hard disks via successive cache memories) impose an evolution in the design of scientific software. Algorithms for scientific computation have often been designed with algebraic complexity as the main complexity measure and with serial running time as the main performance counter. On modern computers minimizing the number of arithmetic operations is no longer the primary factor affecting performance. Effective utilization of the underlying architecture (reducing pipeline stalls, increasing the amount of instruction level parallelism, better utilizing the memory hierarchy) can be much more important.

This tutorial is an introduction to the performance issues related to data locality and parallelism targeting multicore processors and Graphics Processing Units (GPUs) with a focus on the latter type of hardware accelerators. A first part will be dedicated to data locality independently of parallelism thus using serial algorithms and programs as illustration. In a (short) second part analyzing and optimizing multithreaded programs on multicores will be discussed. In the third part we will switch to GPU specific issues and then compare the implementation techniques of both architecture platforms.

The mini-course materials for DAY 1:

Day 2: A Many-core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

When a parallel program does not reach the expected performances, several causes can be involved: insufficient parallelism at the algorithm level, limitations to parallel execution at the architecture level, parallelization overheads at the level of the targeted concurrency platform. Parallelization overheads are inevitable and understanding these costs can help programmers design more scalable applications.

We present a model of multithreaded computation (MCM), combining fork-join and single-instruction-multiple-data parallelisms, with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. We establish a Graham-Brent theorem for this model so as to estimate execution time of programs running on a given number of streaming multiprocessors. We evaluate the benefits of our model with four fundamental algorithms from scientific computing. In each case, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter; moreover experimentation confirms the model's prediction.

The mini-course materials for DAY 2:

Day 3: Optimizing multithreaded algorithms and code. Hands on!

We shall see through various examples, (considering multicore and GPU implementation) how performance bottlenecks can be identified (by complexity analysis or performance measurement) and resolved, when possible. In the multicore case, we will take advantage of comparative implementation thanks to the compilation framework of the MetaFork project. In the GPU case, we will take advantage of the MCM to optimize programs depending on parameters.

Exercices

Exercise 1:
See the README in the tarball of C programs.
Exercises 2 to 7:
See the README in the tarball of Cilk programs.
Exercises 8 to 10:
See the README in the tarball of C programs.