The basis and perspectives of an exascale algorithm: our ExaFMM project
Presenter
January 12, 2011
Keywords:
- Scaling Algorithms
MSC:
- 65F35
Abstract
Linearly scaling algorithms will be crucial for the problem sizes that
will be tackled in capability exascale systems. It is interesting to
note that many of the most successful algorithms are hierarchical in
nature, such as multi-grid methods and fast multipole methods (FMM). We
have been leading development efforts for open-source FMM software for
some time, and recently produced GPU implementations of the various
computational kernels involved in the FMM algorithm. Most recently, we
have produced a multi-GPU code, and performed scalability studies
showing high parallel efficiency in strong scaling. These results have
pointed to several features of the FMM that make it a particularly
favorable algorithm for the emerging heterogeneous, many-core
architectural landscape. We propose that the FMM algorithm offers
exceptional opportunities to enable exascale applications. Among its
exascale-suitable features are: (i) it has intrinsic geometric
locality, and access patterns are made local via particle indexing
techniques; (ii) we can achieve temporal locality via an efficient
queuing of GPU tasks before execution, and at a fine level by means of
memory coalescing based on the natural index-sorting techniques; (iii)
global data communication and synchronization, often a significant
impediment to scalability, is a soft barrier for the FMM, where the
most time-consuming kernels are, respectively, purely local
(particle-to-particle interactions) and "hierarchically synchronized"
(multipole-to-local interactions, which happen simultaneously at every
level of the tree). In addition, we suggest a strategy for achieving
the best algorithmic performance, based on two key ideas: (i)
hybridize the FMM with treecode by choosing on-the-fly between
particle-particle, particle-box, and box-box interactions, according to
a work estimate; (ii) apply a dynamic error-control technique, effected
on the treecode by means of a variable "box-opening angle" and on the
FMM by means of a variable order of the multipole expansion. We have
carried out preliminary implementation of these ideas/techniques,
achieving a 14x speed-up with respect to our current published version
of the FMM. Considering that this effort was only exploratory, we are
certain to possess the potential for unprecedented performance with
these algorithms.