Chapter 14: In this chapter we address unsupervised learning, or “learning without a teacher”…we present those unsupervised learning techniques that are among the most commonly used in practice, and additionally, a few others that are favored by the authors.
We take an axiomatic approach to defining ‘good’ objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of admissible objective functions … that have the property that when the input admits a ‘natural’ ground-truth hierarchical clustering, the ground-truth clustering has an optimal value.
A recent proof of NP-hardness of Euclidean sum-of-squares clustering, due to Drineas et al. (Mach. Learn. 56:9–33, 2004), is not valid. An alternate short proof is provided.
…the k-means problem is NP-hard when the dimension m is part of the input even for k = 2. However, to the best of our knowledge, there is no known NP-hardness result when the dimension m is fixed and k, the number of clusters, is part of the input…In this paper, we establish the NP-hardness of the k-means problem in the plane.
t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering and visualization method proposed by van der Maaten & Hinton in 2008, has rapidly become a standard tool in a number of natural sciences. Despite its overwhelming success, there is a distinct lack of mathematical foundations and the inner workings of the algorithm are not well understood. The purpose of this paper is to prove that t-SNE is able to recover well-separated clusters; more precisely, we prove that t-SNE in the `early exaggeration’ phase, an optimization technique proposed by van der Maaten & Hinton (2008) and van der Maaten (2014), can be rigorously analyzed.
This work gives a formal framework for the problem of data visualization - finding a 2-dimensional embedding of clusterable data that correctly separates individual clusters to make them visually identifiable. We then give a rigorous analysis of the performance of t-SNE under a natural, deterministic condition on the “ground-truth” clusters (similar to conditions assumed in earlier analyses of clustering) in the underlying data.
The notion of defining a cluster as a component in a mixture model was put forth by Tiedeman in 1955; since then, the use of mixture models for clustering has grown into an important subfield of classification. Considering the volume of work within this field over the past decade, which seems equal to all of that which went before, a review of work to date is timely. First, the definition of a cluster is discussed and some historical context for model-based clustering is provided. Then, starting with Gaussian mixtures, the evolution of model-based clustering is traced, from the famous paper by Wolfe in 1965 to work that is currently available only in preprint form. This review ends with a look ahead to the next decade or so.
McLachlan and Basford. 1988. Mixture Models: inference and applications to clustering. (Book)
McLachlan, et. al. 2018. Finite Mixture Models.
The aim of this article is to provide an up-to-date account of the theory and methodological developments underlying the applications of finite mixture models. Because of their flexibility, mixture models are being increasingly exploited as a convenient, semiparametric way in which to model unknown distributional shapes. This is in addition to their obvious applications where there is group-structure in the data or where the aim is to explore the data for such structure, as in a cluster analysis.
Kernel k-means and spectral clustering have both been used to identify clusters that are non-linearly separable in input space. Despite significant research, these methods have remained only loosely related. In this paper, we give an explicit theoretical connection between them.
Here we suggest a formal perspective on the on the difficulty in finding [a unified framework for clustering] in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-medians.
We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods.
The stochastic block model (SBM) with two communities, or equivalently the planted bisection model, is a popular model of random graph exhibiting a cluster behaviour. In the symmetric case, the graph has two equally sized clusters and vertices connect with probability $p$ within clusters and $q$ across clusters. In the past two decades, a large body of literature in statistics and computer science has focused on providing lower-bounds on the scaling of $|p−q|$ to ensure exact recovery. In this paper, we identify a sharp threshold phenomenon for exact recovery … that improves the existing bounds. (Thanks to Zhongyang Li for pointing out this paper; the Stochastic Block Model seems very relevant).
In this paper, we will discuss how geometry and topology can be applied to make useful contributions to the analysis of various kinds of data….Although clustering is clearly a very important part of data analysis, the ways in which it is formulated and implemented are fraught with ambiguities. In particular, the arbitrariness of various threshhold choices and lack of robustness are difficulties one confronts. Much of current research efforts are focused in this direction (see, e.g., [43] and [39]), and functoriality provides the right general mathematical framework for addressing them.
We study the vertex classification problem on a graph in which the vertices are in k(k≥2) different groups, or communities, and edges are only allowed between vertices in distinct groups. The observation is the weighted adjacency matrix, perturbed by a Gaussian Orthogonal Ensemble (GOE), or Gaussian Unitary Ensemble (GUE) matrix. Different from the standard symmetric Z/2-synchronization in which there are 2 communities with equal number of vertices, we do not assume that the number of vertices in each group is the same. For the exact recovery of the maximum likelihood estimation (MLE) with various weighted adjacency matrix, we prove a sharp phase transition result with respect to the intensity σ of the Gaussian perturbation. These weighted adjacency matrices may be considered as natural models for the electric network. Surprisingly, the threshold, or critical value, of σ does not depend on whether or not the sample space for MLE is restricted to the classifications such that the number of vertices in each group is the same as the true value. In contrast to the well-studied Z/2-sychronization, a new complex version of the semi-definite programming (SDP) is designed to efficiently implement the community detection problem when the number of communities k is greater than 2, and a common region for σ such that SDP exactly recovers the true classification is obtained, which is independent of k.
Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system’s components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because these networks’ links are formed explicitly based on those known communities. However, there are no planted communities in real-world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. We show that metadata are not the same as ground truth and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value, so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structures.
Clinical outcome prediction normally employs static, one-size-fits-all models that perform well for the average patient but are sub-optimal for individual patients with unique characteristics. In the era of digital healthcare, it is feasible to dynamically personalize decision support by identifying and analyzing similar past patients, in a way that is analogous to personalized product recommendation in e-commerce. Our objectives were: 1) to prove that analyzing only similar patients leads to better outcome prediction performance than analyzing all available patients, and 2) to characterize the trade-off between training data size and the degree of similarity between the training data and the index patient for whom prediction is to be made.
Lack of reproducibility in medical studies is a barrier to the generation of a robust knowledge base to support clinical decision-making. In this paper we outline the Medical Information Mart for Intensive Care (MIMIC) Code Repository, a centralized code base for generating reproducible studies on an openly available critical care dataset.