Robust and Scalable Learning of Gaussian Mixture Models

December 17, 2021
A Gaussian mixture model (GMM) is one of the highlighted methods in both machine learning and statistics communities for probabilistic clustering and density estimation. Estimation of the model is usually executed by the expectation-maximization (EM)-like algorithm. When the sample size is large, however, the EM algorithm may not be a convenient option due to exponential growth in computational costs. In this talk, I present a divide-and-conquer approach with minimal communication to resolve this problem by working with a Hilbertian structure of GMMs induced by kernel embedding of Gaussian measures. This is done by estimating multiple models on independent subsets of the data and aggregating those into a single GMM by geometric median in the Hilbert space, which guarantees robustness of the estimate under mild conditions. Next, once the estimate is achieved, it may contain overly redundant components in that the obtained clustering is not meaningful and interpretation of each component becomes incomprehensible. Upon the observation, two postprocessing strategies for model reduction and clustering characterization are proposed.