Why Deep Learning Works: Heavy-Tailed Random Matrix Theory as an Example of Physics Informed Machine Learning
Presenter
October 14, 2019
Abstract
Michael Mahoney - University of California, Berkeley (UC Berkeley)
Physics has a long history of developing and using useful mathematics. As such, it should not be surprising that many methods currently in use in machine learning (regularization methods, kernel methods, neural network methods, to name just a few) have their roots in physics. We'll provide an overview of some of these topics as well as a "deep dive" into how they can be used to gain better understanding of why certain machine learning methods work as they do (and what it even means to answer the "why" question). For the former, we'll discuss physics constrained learning (roughly, using physical insight to constrain a machine learning model) versus physics informed learning (roughly, using physics ideas or methodologies to improve learning more generally). For the latter, we'll describe recent work (with Charles Martin of Calculation Consulting) on how physics informed learning can be used to obtain a qualitatively improved understanding of why deep learning works.
To understand why deep learning works, Random Matrix Theory (RMT) can applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a phenomenological theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a "size scale" separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered but strongly-correlated systems. We will describe validating predictions of the theory; how this can explain the so-called generalization gap phenomena; and how one can use Heavy-Tailed Universality to develop novel metrics that predict trends in generalization accuracies for (not small toy models but for) very large pre-trained production-scale deep neural networks.