Abstract
Modern machine learning often optimizes a nonconvex objective using simple algorithm such as gradient descent. One way of explaining the success of such simple algorithms is by analyzing the optimization landscape and show that all local minima are also globally optimal. However, even for two-layer neural networks, the optimization landscape is hard to analyze and have many different regimes, depending on the size of a student network compared to the teacher network or the size of training set. We will talk about some recent results on the mildly overparametrized setting.