Videos

Clustering with an Oracle

Presenter
October 28, 2016
Abstract
Given a set V of n elements, consider the simple task of clustering them into k clusters, where k is unknown. We are allowed to make pairwise queries. Given elements u and v in V, a query asks whether u,v belong to the same cluster and returns a binary answer assuming a true underlying clustering. The goal is to minimize the number of such queries to correctly reconstruct the clusters. When the answer to each query is correct, a simple lower and upper bound of Theta(nk) on query complexity is easy to derive. Our major contribution is to show how only a mild side information in the form of a similarity matrix leads to a great reduction in query complexity to O(k^2). This remains true even when the answer of each query can be erroneous with certain probability and `resampling' is not allowed. Note that this bound can be significantly sublinear in n depending on the value of k. We also develop parallel versions of our algorithms which give near-optimal bounds on the number of adaptive rounds required to match the query complexity. To show our lower bounds we introduce new general information theoretic methods; as well as use, in completely novel way, information theoretic inequalities to design efficient algorithms for clustering with near-optimal complexity. We believe our techniques both for the lower and upper bounds are of general interest, and will find many applications in theoretical computer science and machine learning. This talk is based on a joint work with Barna Saha. Arya Mazumdar is an assistant professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst. Prior to this, Arya was an assistant professor at University of Minnesota-Twin Cities, and form Aug 2011 to Dec 2012, he was a postdoctoral scholar at Massachusetts Institute of Technology. Arya received his Ph.D. from University of Maryland, College Park, in 2011. Arya is a recipient of the 2014-15 NSF CAREER award and the 2010 IEEE ISIT Jack K. Wolf Student Paper Award. He is also a recipient of the Distinguished Dissertation Fellowship Award, 2011, at the University of Maryland. He spent the summers of 2008 and 2010 at the Hewlett-Packard Laboratories, Palo Alto, CA, and IBM Almaden Research Center, San Jose, CA, respectively. Arya's research interests include information theory, coding theory, and their applications to CS theory/learning.