The Challenges of Writing Portable, Correct and High Performance Libraries for GPUs or How to Avoid the Heroics of GPU Programming
Presenter
January 10, 2011
Keywords:
- GPU programming, CUDA, correctness, Biomedical Imaging, Floating Point, GPU Libraries
MSC:
- 65D18
Abstract
We live in the age of heroic programming for scientific applications on Graphics Processing Units (GPUs). Typically a scientist chooses an application to accelerate and a target platform, and through great effort maps their application to that platform. If they are a true hero, they achieve two or three orders of magnitude speedup for that application and target hardware pair. The effort required includes a deep understanding of the application, its implementation and the target architecture. When a new, higher performance architecture becomes available additional heroic acts are required.
There is another group of scientists who prefer to spend their time focused on the application level rather than lower levels. These scientists would like to use GPUs for their applications, but would prefer to have parameterized library components available that deliver high performance without requiring heroic efforts on their part.
The library components should be easy to use and should support a wide range of user input parameters. They should exhibit good performance on a range of different GPU platforms, including future architectures. Our research focuses on creating such libraries.
We have been investigating parameterized library components for use with Matlab/Simulink and with the SCIRun Biomedical Problem Solving Environment from the University of Utah. In this talk I will discuss our library development efforts and challenges to achieving high performance across a range of both application and architectural parameters.
I will also focus on issues that arise in achieving correct behavior of GPU kernels. One issue is correct behavior with respect to thread synchronization. Another is knowing whether or not your scientific application that uses floating point is correct when the results differ depending on the target architecture and order of computation.