Abstract
Statistical inference for large scale factorization and latent
variable model problems is challenging. It requires the ability to
partition the state space, to synchronize copies, and to perform
distributed updates. Such problems arise in very large scale topic
models dealing with 500 million documents, and in graph factorization
problems with 200 million vertices.
This talk describes basic tools from systems research for distributing data and computation over hundreds of computers and how to synchronize updates efficiently. We argue in favor of asynchronous
updates, both from a systems design and from an experimental point of
view. In particular, we show how a distributed approximate Gibbs
sampler can be implemented for time-dependent latent variable models
and how the method of multipliers can be adapted for large scale graph
factorization.