Towards multilevel training algorithms: Applying scientific computing perspectives to neural networks
Presenter
April 16, 2026
Abstract
"The recent explosion of large language models (LLMs) in the commercial space has created unprecedented energy demands driven by the growth of data centers. One aspect of this is the need to train large-scale neural network models on massive amounts of data. Recent work has demonstrated that pre-training a LLM on the Frontier supercomputer would require two years with ideal parallelism [1]. Yet despite advances in optimizers, the training algorithms remain largely the same even as the neural networks scale to trillions of parameters and suffer from quadratic scaling in the number parameters. This motivates our aspirational hypothesis that multilevel (or hierarchical) methodologies can dramatically accelerate training algorithms.
This talk proceeds in three parts. In the first part, we discuss an adaptive basis perspective that has proved fruitful in Scientific Machine Learning (SciML) [2]. With this perspective we develop efficient “operator-split” training algorithms, and new initialization strategies motivated by stability concerns. The second part of the talk considers the impact of using quasi-second-order methods to address the “grokking” phenomenon [3]. Through the lens of the spectral bias, we show how Levenberg–Marquardt reduces generalization gap in our experiments and allow training to proceed quickly through the lazy learning regime towards the rich one. The third part presents a view of Kolmogorov-Arnold networks (KANs) that reformulate the activation function as a spline that can be naturally adapted [4]. We explore how this approach can be related to a multi-channel ReLU network and present a multilevel training algorithm based on the relaxation properties of gradient descent using KANs.
Holistically, the goal of this talk is to show how we applied techniques developed for scientific computing to understand and improve neural network training.
1. Dash, et al., Optimizing distributed training on frontier for large language models. In ISC High Performance 2024 Research Paper Proceedings (39th Intl. Conference), 2024.
2. Cyr, et al., Robust training and initialization of deep neural networks: An adaptive basis viewpoint, Mathematical and Scientific Machine Learning, 2020.
3. Jiang, et al., On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime, ICLR, 2026.
4. Southworth, et al. Multilevel Training for Kolmogorov Arnold Networks, arXiv:2603.04827, 2026 (submitted to SISC)"