1 Introduction
Consider the following mixture model: For each , let
be an unknown subgaussian probability distribution over
, with first moment
and second moment matrix with largest eigenvalue
. For each , an unknown number of random points is drawn independently from . Given the points along with the model order , the goal is to approximate the centers . How large must be relative to , and how large must be relative to , in order to have sufficient signal for successful approximation?For the most popular instance of this problem, where the subgaussian distributions are Gaussians, theoretical guarantees date back to the work of Dasgupta [Das99]. Dasgupta introduced an algorithm based on random projections and showed that this algorithm wellapproximates centers of Gaussians in that are separated by
. Since Dasgupta’s seminal work, performance guarantees for several algorithmic alternatives have emerged, including expectation maximization
[DS07], spectral methods [VW04, KK10, AS12], projections (random and deterministic) [MV10, AK05], and the method of moments [MV10]. Every existing performance guarantee has one two forms:
the algorithm correctly clusters all points according to Gaussian mixture component, or

the algorithm wellapproximates the center of each Gaussian (a la Dasgupta [Das99]).
Results of type (a), which include [VW04, KK10, AS12, AM07], require the minimum separation between the Gaussians centers to have a multiplicative factor of , where is the total number of points. This stems from a requirement that every point be closer to their Gaussian’s center (in some sense) than the other centers, so that the problem of cluster recovery is wellposed. We note that in the case of spherical Gaussians, such highly separated Gaussian components may be truncated so as to match a different data model known as the stochastic ball model, where the semidefinite program we use in this paper is already known to be tight with high probability [ABC15, IMPV15].
Results of type (b) tend to be specifically tailored to exploit unique properties of the Gaussian distribution, and are thus not easily generalizable to other data models. For instance, if
has distribution , then , and concentration of measure implies that in high dimensions, most of the points will reside in a thin shell with center and radius about . This sort of behavior can be exploited to cluster even concentric Gaussians as long as the covariances are sufficiently different. However, algorithms that perform well even with no separation between centers require a sample complexity which is exponential in [MV10].In this paper, we provide a performance guarantee of type (b), but our approach is modelfree. In particular, we consider the means clustering objective:
minimize  (1)  
subject to 
Letting denote the matrix defined entrywise by , then a straightforward calculation gives the following “lifted” expression for the means objective:
(2) 
The matrix necessarily satisfies various convex constraints, and relaxing to such constraints leads to the following semidefinite relaxation of (1), first introduced by Peng and Wei in [PW07]:
minimize  (3)  
subject to 
Here, means that is entrywise nonnegative, whereas means that is symmetric and positive semidefinite.
As mentioned earlier, this semidefinite relaxation is known to be tight for a particular data model called the stochastic ball model [NW15, ABC15, IMPV15]
. In this paper, we study its performance under subgaussian mixture models, which include the stochastic ball model and the Gaussian mixture model as special cases. The SDP is not typically tight under this general model, but the optimizer can be interpreted as a denoised version of the data and can be rounded in order to produce a good estimate for the centers (and therefore produce a good clustering).
To see this, let denote the matrix whose columns are the points . Notice that whenever has the form (2), then for each , has columns equal to the centroid of points assigned to . In particular, if is meansoptimal, then reports the meansoptimal centroids (with appropriate multiplicities). Next, we note that every SDPfeasible matrix satisfies , and so
is a stochastic matrix, meaning each column of
is still a weighted average of columns from . Intuitively, if the SDP relaxation (3) were close to being tight, then the SDPoptimal would make the columns of close to the meansoptimal centroids. Empirically, this appears to be the case (see Figure 1 for an illustration). Overall, we may interpret as a denoised version of the original data , and we leverage this strengthened signal to identify good estimates for the meansoptimal centroids.What follows is a summary of our relaxandround procedure for (approximately) solving the means problem (1):
Relaxandround means clustering procedure. Given and data matrix , do: Compute distancesquared matrix defined by . Solve means semidefinite program (3), resulting in optimizer . Cluster the columns of the denoised data matrix .
For step (iii), we find there tends to be vectors that appear as columns in with particularly high frequency, and so we are inclined to use these as estimators for the meanoptimal centroids (see Figure 1, for example). Running Lloyd’s algorithm for step (iii) also works well in practice. To obtain theoretical guarantees, we instead find the columns of for which the unit balls of a certain radius centered at these points in contain the most columns of (see Theorem 15 for more details). An implementation of our procedure is available on GitHut [Vil16].
Our contribution. We study performance guarantees for the means semidefinite relaxation (3) when the point cloud is drawn from a subgaussian mixture model. Adapting ideas from Guédon and Vershynin [GV14], we obtain approximation guarantees comparable with the state of the art for learning mixtures of Gaussians despite the fact that our algorithm is a generic means solver and uses no model assumptions. To the best of our knowledge, no convex relaxation has been used before to provide theoretical guarantees for clustering mixtures of Gaussians. We also provide conditional lower bounds on how well a means solver can approximate the centers of Gaussians.
Organization of this paper. In Section 2, we present a summary of our results and give a highlevel explanation of our proof techniques. We also illustrate the performance of our relaxandround algorithm on the MNIST dataset of handwritten digits. Our theoretical results consist of an approximation theorem for the SDP (proved in Section 3), a denoising consequence of the approximation (explained in Section 4), and a rounding step (presented in Section 5). We also study the fundamental limits for estimating Gaussian centers by means clustering (see Section 2.2).
2 Summary of results
This paper has two main results. First, we present a relaxandround algorithm for means clustering that wellapproximates the centers of sufficiently separated subgaussians. Second, we provide a conditional result on the minimum separation necessary for Gaussian center approximation by means clustering. The first result establishes that the means SDP (3) performs well with noisy data (despite not being tight), and the second result helps to illustrate how sharp our analysis is. This section discusses these results, and then applies our algorithm to the MNIST dataset [LC10].
2.1 Performance guarantee for the means SDP
Our relaxandround performance guarantee consists of three steps.
Step 1: Approximation. We adapt an approach used by Guédon and Vershynin to provide approximation guarantees for a certain semidefinite program under the stochastic block model for graph clustering [GV14].
Given the points drawn independently from , consider the squareddistance matrix and the corresponding minimizer of the SDP (3). We first construct a “reference” matrix such that the SDP (3) is tight when with optimizer . To this end, take , let denote the minimizer of (3), and let denote the minimizer of (3) when is replaced by the reference matrix defined as
(4) 
where , and is a parameter to be chosen later. Indeed, this choice of reference is quite technical, as an artifact of the entries in being statistically dependent. Despite its lack of beauty, our choice of reference enjoys the following important property:
Lemma 1.
Let denote the indicator function for the indices corresponding to points drawn from the th subgaussian. If whenever , then .
Proof.
Let be feasible for the the SDP (3). Replacing with in (3), we may use the SDP constraints and to obtain the bound
Furthermore, since whenever , and since , we have that equality occurs precisely for the such that equals zero whenever . The other constraints on then force to have the claimed form (i.e., is the unique minimizer). ∎
Now that we know that is the solution we want, it remains to demonstrate regularity in the sense that provided the subgaussian centers are sufficiently separated. For this, we use the following scheme:

If then is small (Lemma 8).

If the centers are separated by , then .
What follows is the result of this analysis:
Theorem 2.
Fix . There exist universal constants such that if
then with probability provided
where is the minimal cluster center separation.
See Section 3 for the proof. Note that if we remove the assumption , we obtain the result .
Step 2: Denoising. Suppose we solve the SDP (3) for an instance of the subgaussian mixture model where is sufficiently large. Then Theorem 2 ensures that the solution is close to the ground truth. At this point, it remains to convert into an estimate for the centers . Let denote the matrix whose th column is . Then is an matrix whose th column is , the centroid of the th cluster, which converges to as , (and does not change when varies, for a fixed ), and so one might expect to have its columns be close to the ’s. In fact, we can interpret the columns of as a denoised version of the points (see Figure 1).
To illustrate this idea, assume the points come from in for each . Then we have
(5) 
Letting denote the th column of (i.e., the th estimate of ), in Section 4 we obtain the following denoising result:
Corollary 3.
If , then
with high probability as .
Note that Corollary 3 guarantees denoising in the regime . This is a corollary of a more technical result (Theorem 11), which guarantees denoising for certain configurations of subgaussians (e.g., when the ’s are vertices of a regular simplex) in the regime .
At this point, we comment that one might expect this level of denoising from principal component analysis (PCA) when the mixture of subgaussians is sufficiently nice. To see this, suppose we have spherical Gaussians of equal entrywise variance
centered at vertices of a regular simplex. Then in the largesample limit, we expect PCA to approach the dimensional affine subspace that contains the centers. Projecting onto this affine subspace will not change the variance of any Gaussian in any of the principal components, and so one expects the mean squared deviation of the projected points from their respective Gaussian centers to be .By contrast, we find that in practice, the SDP denoises substantially more than PCA does. For example, Figures 1 and 3 illustrate cases in which PCA would not change the data, since the data already lies in dimensional space, and yet the SDP considerably enhances the signal. In fact, we observe empirically that the matrix has low rank and that has repeated columns. This doesn’t come as a complete surprise, considering SDP optimizers are known to exhibit low rank [Sha82, Bar95, Pat98]. Still, we observe that the optimizer tends to have rank when clustering points from the mixture model. This is not predicted by existing bounds, and we did not leverage this feature in our analysis, though it certainly warrants further investigation.
Step 3: Rounding. In Section 5, we present a rounding scheme that provides a clustering of the original data from the denoised results of the SDP (Theorem 15). In general, the cost of rounding is a factor of in the average squared deviation of our estimates. Under the same hypothesis as Corollary 3, we have that there exists a permutation on such that
(6) 
where is what our algorithm chooses as the th center estimate. Much like the denoising portion, we also have a more technical result that allows one to replace the righthand side of (6) with for sufficiently nice configurations of subgaussians. As such, we can estimate Gaussian centers with mean squared error provided the centers have pairwise distance . This contrasts with the seminal work of Dasgupta [Das99], which gives mean squared error when the pairwise distances are . As such, our guarantee replaces dimensionality dependence with model order–dependence, which is an improvement to the state of the art in the regime . In the next section, we indicate that model order–dependence cannot be completely removed when using means to estimate the centers.
Before concluding this section, we want to clarify the nature of our approximation guarantee (6). Since centroids correspond to a partition of Euclidean space, our guarantee says something about how “close” our means partition is to the “true” partition. By contrast, the usual approximation guarantees for relaxandround algorithms compare values of the objective function (e.g., the means value of the algorithm’s output is within a factor of 2 of minimum). Also, the latter sort of optimal value–based approximation guarantee cannot be used to produce the sort of optimizerbased guarantee we want. To illustrate this, imagine a relaxandround algorithm for means that produces a nearoptimal partition with for data coming from a single spherical Gaussian. We expect every subspace of codimension 1 to separate the data into a nearoptimal partition, but the partitions are very different from each other when the dimension , and so a guarantee of the form (6) will not hold.
2.2 Fundamental limits of means clustering
In Section 5, we provide a rounding scheme that, when applied to the output of the means SDP, produces estimates of the subgaussian centers. But how good is our guarantee? Observe the following two issues: (i) The amount of tolerable noise and our bound on the error both depend on . (ii) Our bound on the error does not vanish with .
In this section, we give a conditional result that these issues are actually artifacts of means; that is, both of these would arise if one were to estimate the Gaussian centers with the meansoptimal centroids (though these centroids might be computationally difficult to obtain). The main trick in our argument is that, in some cases, socalled “Voronoi means” appear to serve as a good a proxy for the meansoptimal centroids. This trick is useful because the Voronoi means are much easier to analyze. We start by providing some motivation for the Voronoi means.
Given , let denote any minimizer of the means objective
and define the meansoptimal centroids by
(Note that the means minimizer is unique for generic .) Given , then for each , consider the Voronoi cell
Given a probability distribution over , define the Voronoi means by
Finally, we say is a stable isogon if

,

the symmetry group of acts transitively on , and

for each , the stabilizer has the property that
(Examples of stable isogons include regular and quasiregular polyhedra, as well as highly symmetric frames [BW13].) With this background, we formulate the following conjecture:
Conjecture 4 (Voronoi Means Conjecture).
Draw points independently from a mixture of equally weighted spherical Gaussians of equal variance centered at the points in a stable isogon . Then
converges to zero in probability as , i.e., the meansoptimal centroids converge to the Voronoi means.
Our conditional result will only require the Voronoi Means Conjecture in the special case where the ’s form an orthoplex (see Lemma 6). Figure 2 provides some numerical evidence in favor of this conjecture (specifically in the orthoplex case). We note that the hypothesis that be a stable isogon appears somewhat necessary. For example, simulations suggest that the conjecture does not hold when is three equally spaced points on a line (in this case, the meansoptimal centroids appear to be more separated than the Voronoi means). The following theorem provides some insight as to why the stable isogon hypothesis is reasonable:
Theorem 5.
Let be a mixture of equally weighted spherical Gaussians of equal variance centered at the points of a stable isogon . Then there exists such that for each .
See Section 6 for the proof. To interpret Theorem 5, consider means optimization over the distribution instead of a large sample drawn from . This optimization amounts to finding points in that minimize
(7) 
Intuitively, the optimal is a good proxy for the meansoptimal centroids when is large (and one might make this rigorous using the plugin principle with the Glivenko–Cantelli Theorem). What Theorem 5 provides is that, when is a stable isogon, the Voronoi means have the same Voronoi cells as do . As such, if one were to initialize Lloyd’s algorithm at the Gaussian centers to solve (7), the algorithm converges to the Voronoi means in one step. Overall, one should interpret Theorem 5 as a statement about how the Voronoi means locally minimize (7), whereas the Voronoi Means Conjecture is a statement about global minimization.
As indicated earlier, we will use the Voronoi Means Conjecture in the special case where is an orthoplex:
Lemma 6.
The standard orthoplex of dimension , given by the first columns of and of , is a stable isogon.
Proof.
First, we have that , implying (si1). Next, every can be reached from any with the appropriate signed transposition, which permutes , and is therefore in the symmetry group ; as such, we conclude (si2). For (si3), pick and let denote its nonzero entry. Consider the matrix , where denotes the th identity basis element. Then
, and the eigenspace of
with eigenvalue is , and so we are done. ∎What follows is the main result of this subsection:
Theorem 7.
Let be even, and let denote the standard orthoplex of dimension . Then for every , either
where denotes the mixture of equally weighted spherical Gaussians of entrywise variance centered at the members of .
2.3 Numerical example: Clustering the MNIST dataset
In this section, we apply our clustering algorithm to the NMIST handwritten digits dataset [LC10]. This dataset consists of 70,000 different grayscale images, reshaped as vectors; 55,000 of them are considered training set, 10,000 are test set, and the remaining 5,000 are validation set.
Clustering the raw data gives poor results (due to 4’s and 9’s being similar, for example), so we first learn meaningful features, and then cluster the data in feature space. To simplify feature extraction, we used the first example from the TensorFlow tutorial
[AAB15]. This consists of a onelayer neural network
, where is the softmax function, is a matrix to learn, and is a vector to learn. As the tutorial shows, the neural network is trained for 1,000 iterations, each iteration using batches of 100 random points from the training set.Training the neural network amounts to finding and that fit the training set well. After selecting these parameters, we run the trained neural network on the first 1,000 elements of the test set, obtaining , where each is a vector representing the probabilities of being each digit. Since is a probability vector, its entries sum to , and so the feature space is actually dimensional.
For this experiment, we cluster with two different algorithms: (i) MATLAB’s builtin implementation of means++, and (ii) our relaxandround algorithm based on the means semidefinite relaxation (3). (The results of the latter alternative are illustrated in Figure 3.)
Since each run of means++ uses a random initialization that impacts the partition, we ran this algorithm 100 times. In fact, the means value of the output varied quite a bit: the alltime low was , the alltime high was , and the median was ; the alltime low was reached in out of the trials. Since our relaxandround alternative has no randomness, the outcome is deterministic, and its means value was , i.e., identical to the alltime low from means++. By comparison, the means value of the planted solution (i.e., clustering according to the hidden digit label) was 103.5430, and the value of the SDP (which serves as a lower bound on the optimal means value) was . As such, not only did our relaxandround alternative produce the best clustering that means++ could find, it also provided a certificate that no clustering has a means value that is 1.5% better.
Recalling the nature of our approximation guarantees, we also want to know well the relaxandround algorithm’s clustering captures the ground truth. To evaluate this, we determined a labeling of the clusters for which the resulting classification exhibited a minimal misclassification rate. (This amounts to minimizing a linear objective over all permutation matrices, which can be relaxed to a generically tight linear program over doubly stochastic matrices.) For
means++, the alltime low misclassification rate was (again, accomplished by of the trials), the alltime high was , and the median was . As one might expect, the relaxandround output had a misclassification rate of .3 Proof of Theorem 2
By the following lemma, it suffices to bound :
Lemma 8.
.
Proof.
First, by Lemma 1, we have . We also claim that . To see this, first note that and , and so the th entry of can be interpreted as a convex combination of the entries of . Let
be an eigenvector of
with eigenvalue , and let index the largest entry of (this entry is positive without loss of generality). Then , implying that . Since the eigenvalues of lie in , we may conclude that . As such,(8) 
We will bound (8) in two different ways, and a convex combination of these bounds will give the result. For both bounds, we let denote the indices in the diagonal blocks, and the indices in the offdiagonal blocks, and denote the indices in the diagonal block for the cluster . In particular, denotes the matrix that equals on the diagonal blocks and is zero on the offdiagonal blocks. For the first bound, we use that , and that has nonnegative entries (since both and have nonnegative entries, and ). Recalling that , we have
(9) 
For the second bound, we first write . Since , we then have
where the last and secondtolast steps use that . Next, and , and so we may continue:
(10) 
where again, the last step uses the fact that . At this point, we have bounds of the form with and with (explicitly, (9) and (10)), and we seek a bound of the form . As such, we take the convex combination for such that and
Taking and and combining with (8) then gives
choosing sufficiently small and simplifying yields the result. ∎
We will bound in terms of the following: For each real symmetric matrix , let denote the value of the following program:
maximum  (11)  
subject to 
Lemma 9.
Put and . Then .
Proof.
Now it suffices to bound . For an matrix , consider the matrix norm
The following lemma will be useful:
Lemma 10.
where .
Proof.
The first bound follows from the classical version of Hölder’s inequality (recalling that and by design):
The second bound is a consequence of Von Neumann’s trace inequality: if the singular values of
and are respectively and thenSince is feasible in (11) we have and . Using that we get
Proof of Theorem 2.
Write . Then
Furthermore,
where is the matrix whose th column is , and is the column vector whose th entry is . Recall that
Then where
Considering Lemma 10 and that we will bound
(13) 
For the first term:
Note that if the rows , of come from a distribution with second moment matrix , then has the same distribution as , where
Comments
There are no comments yet.