Videos

Reduced label complexity for tight linear regression

Presenter
June 29, 2023
Abstract
The success of modern supervised machine learning is predicated on the existence of large labeled data sets, but in many domains, it is cost prohibitive to generate a large number of high quality labels. It is more feasible to first collect a large unlabeled data set, then decide to invest in labeling a subset of this data set. This motivates the consideration of label complexity: for a given supervised learning problem and unlabeled data set, what is the minimal number of labels required so that a model fitted using the labeled subset has almost as much predictive power as would a model fitted on the entire data set if all the labels were available? We present algorithmic results on the label complexity of linear regression: given n data points, how many samples must be labeled to obtain a model with near optimal in-sample prediction error? Existing approaches to reducing the label complexity of linear regression--including various approaches from the randomized numerical linear algebra community such as core-sets and the use of iterative algorithms that touch one data point per iteration-- are applicable when a constant factor approximation is acceptable. New approaches are needed to enter the regime where the approximation factor decreases with the size of the data set. In this setting, we provide a polynomial time algorithm that reduces the label complexity by O(sqrt(n)) additively. The algorithm is based on a tight analysis of the regression error incurred by forming a core-set using backward selection.