Random walk distances in data clustering and applications
Presenter
November 28, 2011
Abstract
Clustering data into groups of similarity is well recognized as an important step in many diverse applications, including biomedical imaging, data mining and bioinformatics. Well known clustering methods, dating to the 70's and 80's, include the K-means algorithm and its generalization, the Fuzzy C-means (FCM) scheme, and hierarchical tree decompositions of various sorts. More recently, spectral techniques have been employed to much success. However, with the inundation of many types of data sets into virtually every arena of science, it makes sense to introduce new clustering techniques which emphasize geometric aspects of the data, the lack of which has been somewhat of a drawback in most previous algorithms. In this talk, we focus on a slate of "random-walk" distances arising in the context of several weighted graphs formed from the data set, in a comprehensive generalized FCM framework, which allow to assign "fuzzy" variables to data points which respect in many ways their geometry. The method we present groups together data which are in a sense "well-connected", as in spectral clustering, but also assigns to them membership values as in FCM. We demonstrate the effectiveness and robustness of our method on several standard synthetic benchmarks and other standard data sets such as the IRIS and the YALE face data sets. This is joint work with Sijia Liu and Sunder Sethuraman.