MathInstitutes.org

Giang Tran - Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

Presenter

Giang Tran

February 28, 2024

IPAM

Event: EnCORE Workshop on Computational vs Statistical Gaps in Learning and Optimization

Giang Tran - Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences Thumbnail

Play Video

Abstract

Recorded 28 February 2024. Giang Tran of the University of Waterloo presents "Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences" at IPAM's EnCORE Workshop on Computational vs Statistical Gaps in Learning and Optimization. Abstract: Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention (FMA), a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length n from O(n2) to O(n log n) or O(n), while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into O(log n) levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. This multi-level divide-and-conquer strategy is inspired by fast summation methods from n-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our FMA model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The FMA mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences. This is joint work with Hans De Sterck and Yanming Kang. Learn more online at: https://www.ipam.ucla.edu/programs/workshops/encore-workshop-on-computational-vs-statistical-gaps-in-learning-and-optimization/?tab=overview

Videos

Giang Tran - Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

Presenter

Abstract