Cornell University, USA
Title: Overlapping Variable Clustering with Statistical Guarantees
Department of Mathematics & Department of Statistical Science Cornell University
Abstract: We propose a new clustering method (LOVE) to recover overlapping sub-groups of variables of a p dimensional vector X from a sample of size n on X, with p allowed to be larger than n.
Using a latent model, we formulate a cluster as the set of variables that are associated with the same latent factor.
For identification purposes, some variables are associated to one latent factor only.
These variables are called pure variables and they anchor the clusters.
Our method estimates the set of pure variables and the number of clusters first.
After that, clusters are populated with variables with multiple associations, favoring sparsity in case of many clusters.
We show that under minimal signal strength conditions, LOVE recovers the population level overlapping clusters consistently.
The practical relevance of LOVE is illustrated by determination of the functional annotation of genes with unknown function in a RNA-seq data set.