Abstract
Machine Learning is invaluable for extracting insights from large volumes of data. A key assumption enabling many methods, however, is having access to training data comprising independent observations from the entire distribution of relevant data. In practice, data is commonly missing due to measurement limitations, legal restrictions, or data collection and sharing practices. Moreover, observations are commonly collected on a network, a spatial or a temporal domain and may be intricately dependent. Training on data that is censored or dependent is known to lead to Machine Learning models that are biased.
In this talk, we overview recent work on learning from censored and dependent data. We propose a learning framework which is broadly applicable, and instantiate this framework to obtain computationally and statistically efficient methods for linear, and logistic regression from censored or dependent samples, in high dimensions. Our findings are enabled through connections to Statistical Physics, Concentration and Anti-concentration of measure, and properties of Stochastic Gradient Descent, and advance some classical challenges in Statistics and Econometrics. (I will overview works with Dagan, Dikkala, Gouleakis, Ilyas, Jayanti, Kontonis, Panageas, Rohatgi, Tzamos, Zampetakis)