### Squeezing data with the power of algorithms

*Two computer scientists determine the smallest amount of data needed to estimate the mean*

When two fields of study collide, advances that were once all but unimaginable can be realized. At least that’s what two computer scientists showed when they applied the algorithmic tools of their field to solve the statistics problem of how to accurately guess the average value of a phenomenon from as few measurements of it as possible.

Specifically, they determined how accurately one can estimate the mean of an unknown distribution of numbers given a certain amount of data and a given confidence level, and how to compute such an estimate. “Traditionally with statistics, one way to get more accuracy is to gather more data,” says Paul Valiant, a von Neumann Fellow in the School of Mathematics at the Institute for Advanced Study (IAS) in Princeton, New Jersey. “But sometimes you can’t get more of it. So a different way to go about it is to be more clever with the data you already have.”

This clever approach is at the heart of the mean estimator that Valiant posted to Arxiv in late 2020 with Jasper C.H. Lee, then a doctoral student at Brown University. Their algorithm culminates a three-year pursuit to find what is the smallest amount of data needed to estimate the mean accurately—or looked at another way, Valiant notes, to find what is the most accurate estimate that can be obtained given some fixed amount of data.

The problem itself is perhaps the most basic one can ask in statistics, says Avi Wigderson, the Herbert H. Maass Professor in the School of Mathematics at the IAS. “One would imagine that such a fundamental problem had been solved decades ago,” says Wigderson. “But it was open, until Lee and Valiant just settled it, with an ingenious estimator which uses the best possible number of samples for every distribution.”

The value of data efficiency

The estimator not only solves a mathematical challenge but it also enables data efficiency—something with practical value to our big data world. Data-driven decision-making underlies virtually every facet of our lives. It determines what we can buy in stores, when we can shop, and how quickly we receive service. It sets airplane and train schedules, prices, and seating availability. It guides college acceptance rates, hospital staffing decisions, and new construction development. It shapes our online advertising experience, chain coffee shop options, and package delivery times.

Yet data can be difficult, costly, and time-consuming to collect, and so it is often a limiting factor. Consider how the release of Covid-19 vaccines hinged on how long it took pharmaceutical companies to collect enough data to prove their candidates worked, safely, in a representative sample.

This new mean estimator potentially allows data researchers in a broad range of industries to do more with less data than previously thought possible. “With computer science, I’m used to this mentality of taking a limited resource, like computation time, and trying to squeeze every last drop out of it to figure out how to get a fast algorithm,” says Valiant. “Now I’m using this approach for data instead of time.”

Computer science enables algorithms that are as complicated as they need to be: once proven, peer-reviewed, and neatly packaged as code, they are automated to run at the push of a button on a computer or other device. “Because of this kind of background, I’m looking for algorithms that are much more complicated than what the field of statistics has usually looked at. The internal complexity doesn’t really matter,” says Valiant.

He notes that computer scientists solve problems with a two-pronged approach: They ask what they can achieve algorithmically, and they ask what’s impossible. Then they work to improve the algorithm and to improve their understanding of the impossibility results, making progress from both directions until, ideally, there’s a meet-up in the middle that reveals the truth.

Future directions in mean estimation

Valiant uses this problem-solving process in his ongoing research in mean estimation. He is working on generalizing his mean estimator to higher dimensions. That is, he wants to find a way to use it to analyze multiple types of data, such as height and weight, at the same time rather than individually. “This approach can give you performance improvements and leverage correlations that could presumably help more accurately estimate properties of the joint distribution compared to if you examined each one separately,” he says.

This mean estimation research is just one of the many problems in algorithms and machine learning that Valiant has studied during his fellowship at IAS. Covid-19 restrictions canceled public gatherings on the campus and limited interactions with colleagues. But the restrictions did have had the upside of giving him more time to focus on solving hard problems—ones that would not only prove esoteric mathematical theorems, but that also hold potential for practical, real-world applications.

Reflecting on the year of the pandemic, which also contained a presidential election, he mentions how recent elections have revealed that “pollsters don’t have a good way to get an unbiased sample of the population.” He wonders if he could come up with a mathematical model that, while not explicitly tailored to polling, employs statistical tools in a way that enables more accurate polling.

It’s just one of the new questions motivating his research as he joins the faculty at Purdue University later this year.

*—Jennifer Fisher Wilson*