Videos

The basis and perspectives of an exascale algorithm: our ExaFMM project

Presenter
January 12, 2011
Keywords:
  • Scaling Algorithms
MSC:
  • 65F35
Abstract
Linearly scaling algorithms will be crucial for the problem sizes that will be tackled in capability exascale systems. It is interesting to note that many of the most successful algorithms are hierarchical in nature, such as multi-grid methods and fast multipole methods (FMM). We have been leading development efforts for open-source FMM software for some time, and recently produced GPU implementations of the various computational kernels involved in the FMM algorithm. Most recently, we have produced a multi-GPU code, and performed scalability studies showing high parallel efficiency in strong scaling. These results have pointed to several features of the FMM that make it a particularly favorable algorithm for the emerging heterogeneous, many-core architectural landscape. We propose that the FMM algorithm offers exceptional opportunities to enable exascale applications. Among its exascale-suitable features are: (i) it has intrinsic geometric locality, and access patterns are made local via particle indexing techniques; (ii) we can achieve temporal locality via an efficient queuing of GPU tasks before execution, and at a fine level by means of memory coalescing based on the natural index-sorting techniques; (iii) global data communication and synchronization, often a significant impediment to scalability, is a soft barrier for the FMM, where the most time-consuming kernels are, respectively, purely local (particle-to-particle interactions) and "hierarchically synchronized" (multipole-to-local interactions, which happen simultaneously at every level of the tree). In addition, we suggest a strategy for achieving the best algorithmic performance, based on two key ideas: (i) hybridize the FMM with treecode by choosing on-the-fly between particle-particle, particle-box, and box-box interactions, according to a work estimate; (ii) apply a dynamic error-control technique, effected on the treecode by means of a variable "box-opening angle" and on the FMM by means of a variable order of the multipole expansion. We have carried out preliminary implementation of these ideas/techniques, achieving a 14x speed-up with respect to our current published version of the FMM. Considering that this effort was only exploratory, we are certain to possess the potential for unprecedented performance with these algorithms.