1 Introduction
Our study is broadly motivated by questions in highdimensional learning. As is well known, learning in high dimensions is feasible only if the data distribution satisfies suitable prior assumptions. One such assumption is that the data distribution lies on, or is close to, a lowdimensional set embedded in a high dimensional space, for instance a low dimensional manifold. This latter assumption has proved to be useful in practice, as well as amenable to theoretical analysis, and it has led to a significant amount of recent work. Starting from [29, 40, 7], this set of ideas, broadly referred to as manifold learning, has been applied to a variety of problems from supervised [42]
[8], to clustering [45] and dimensionality reduction [7], to name a few.Interestingly, the problem of learning the manifold itself has received less attention: given samples from a dmanifold embedded in some ambient space , the problem is to learn a set that approximates in a suitable sense. This problem has been considered in computational geometry, but in a setting in which typically the manifold is a hypersurface in a lowdimensional space (e.g. ), and the data are typically not sampled probabilistically, see for instance [32, 30]. The problem of learning a manifold is also related to that of estimating the support of a distribution, (see [17, 18] for recent surveys.) In this context, some of the distances considered to measure approximation quality are the Hausforff distance, and the socalled excess mass distance.
The reconstruction framework that we consider is related to the work of [1, 38], as well as to the framework proposed in [37], in which a manifold is approximated by a set, with performance measured by an expected distance to this set. This setting is similar to the problem of dictionary learning (see for instance [36], and extensive references therein), in which a dictionary is found by minimizing a similar reconstruction error, perhaps with additional constraints on an associated encoding of the data. Crucially, while the dictionary is learned on the empirical data, the quantity of interest is the expected reconstruction error, which is the focus of this work.
We analyze this problem by focusing on two important, and widelyused algorithms, namely kmeans and kflats. The kmeans algorithm can be seen to define a piecewise constant approximation of . Indeed, it induces a Voronoi decomposition on , in which each Voronoi region is effectively approximated by a fixed mean. Given this, a natural extension is to consider higher order approximations, such as those induced by discrete collections of dimensional affine spaces (kflats), with possibly better resulting performance. Since is a manifold, the kflats approximation naturally resembles the way in which a manifold is locally approximated by its tangent bundle.
Our analysis extends previous results for kmeans to the case in which the datagenerating distribution is supported on a manifold, and provides analogous results for kflats. We note that the kmeans algorithm has been widely studied, and thus much of our analysis in this case involves the combination of known facts to obtain novel results. The analysis of kflats, however, requires developing substantially new mathematical tools.
The rest of the paper is organized as follows. In section 2, we describe the formal setting and the algorithms that we study. We begin our analysis by discussing the reconstruction properties of kmeans in section 3. In section 4, we present and discuss our main results, whose proofs are postponed to the appendices.
2 Learning Manifolds
Let by a Hilbert space with inner product
, endowed with a Borel probability measure
supported over a compact, smooth manifold . We assume the data to be given by a training set, in the form of samples drawn identically and independently with respect to .Our goal is to learn a set that approximates well the manifold. The approximation (learning error) is measured by the expected reconstruction error
(1) 
where the distance to a set is , with . This is the same reconstruction measure that has been the recent focus of [37, 5, 38].
It is easy to see that any set such that will have zero risk, with being the “smallest” such set (with respect to set containment.) In other words, the above error measure does not introduce an explicit penalty on the “size” of : enlarging any given
can never increase the learning error.
With this observation in mind, we study specific learning algorithms that, given the data,
produce a set belonging to some restricted hypothesis space (e.g. sets of size for kmeans),
which effectively introduces a constraint on the size of the sets.
Finally, note that the risk of Equation 1 is nonnegative and, if the hypothesis space is sufficiently rich, the risk of an unsupervised algorithm may converge to zero under suitable conditions.
2.1 Using KMeans and KFlats for Piecewise Manifold Approximation
In this work, we focus on two specific algorithms, namely kmeans [34, 33] and kflats [12]. Although typically discussed in the Euclidean space case, their definition can be easily extended to a Hilbert space setting. The study of manifolds embedded in a Hilbert space is of special interest when considering nonlinear (kernel) versions of the algorithms [20]
. More generally, this setting can be seen as a limit case when dealing with high dimensional data. Naturally, the more classical setting of an absolutely continuous distribution over
dimensional Euclidean space is simply a particular case, in which , and is a domain with positive Lebesgue measure.KMeans. Let be the class of sets of size in . Given a training set and a choice of , kmeans is defined by the minimization over of the empirical reconstruction error
(2) 
where, for any fixed set , is an unbiased empirical estimate of , so that kmeans can be seen to be performing a kind of empirical risk minimization [13, 9, 37, 10, 37].
A minimizer of Equation 2 on is a discrete set of means , which induces a DirichletVoronoi tiling of : a collection of regions, each closest to a common mean [4] (in our notation, the subscript denotes the dependence of on the sample, while refers to its size.) By virtue of being a minimizing set, each mean must occupy the center of mass of the samples in its Voronoi region. These two facts imply that it is possible to compute a local minimum of the empirical risk by using a greedy coordinatedescent relaxation, namely Lloyd’s algorithm [33]. Furthermore, given a finite sample , the number of locallyminimizing sets is also finite since (by the centerofmass condition) there cannot be more than the number of possible partitions of into groups, and therefore the global minimum must be attainable. Even though Lloyd’s algorithm provides no guarantees of closeness to the global minimizer, in practice it is possible to use a randomized approximation algorithm, such as kmeans++ [3], which provides guarantees of approximation to the global minimum in expectation with respect to the randomization.
KFlats. Let be the class of collections of flats (affine spaces) of dimension . For any value of , kflats, analogously to kmeans, aims at finding the set that minimizes the empirical reconstruction (2) over . By an argument similar to the one used for kmeans, a global minimizer must be attainable, and a Lloydtype relaxation converges to a local minimum. Note that, in this case, given a Voronoi partition of into regions closest to each flat, new optimizing flats for that partition can be computed by a truncated PCA solution on the samples falling in each region.
2.2 Learning a Manifold with Kmeans and Kflats
In practice, kmeans is often interpreted to be a clustering algorithm, with clusters defined by the Voronoi diagram of the set of means . In this interpretation, Equation 2 is simply rewritten by summing over the Voronoi regions, and adding all pairwise distances between samples in the region (the intracluster distances.) For instance, this point of view is considered in [14]
where kmeans is studied from an information theoretic persepective. Kmeans can also be interpreted to be performing vector quantization, where the goal is to minimize the encoding error associated to a nearestneighbor quantizer
[23]. Interestingly, in the limit of increasing sample size, this problem coincides, in a precise sense [39], with the problem of optimal quantization of probability distributions (see for instance the excellent monograph of
[24].)When the datagenerating distribution is supported on a manifold , kmeans can be seen to be approximating points on the manifold by a discrete set of means. Analogously to the Euclidean setting, this induces a Voronoi decomposition of , in which each Voronoi region is effectively approximated by a fixed mean (in this sense kmeans produces a piecewise constant approximation of .) As in the Euclidean setting, the limit of this problem with increasing sample size is precisely the problem of optimal quantization of distributions on manifolds, which is the subject of significant recent work in the field of optimal quantization [26, 27].
In this paper, we take the above view of kmeans as defining a (piecewise constant) approximation of the manifold supporting the data distribution. In particular, we are interested in the behavior of the expected reconstruction error , for varying and . This perspective has an interesting relation with dictionary learning, in which one is interested in finding a dictionary, and an associated representation, that allows to approximately reconstruct a finite set of datapoints/signals. In this interpretation, the set of means can be seen as a dictionary of size that produces a maximally sparse representation (the kmeans encoding), see for example [36] and references therein. Crucially, while the dictionary is learned on the available empirical data, the quantity of interest is the expected reconstruction error, and the question of characterizing the performance with respect to this latter quantity naturally arises.
Since kmeans produces a piecewise constant approximation of the data, a natural idea is to consider higher orders of approximation, such as approximation by discrete collections of dimensional affine spaces (kflats), with possibly better performance. Since is a manifold, the approximation induced by kflats may more naturally resemble the way in which a manifold is locally approximated by its tangent bundle. We provide in Sec. 4.2 a partial answer to this question.
3 Reconstruction Properties of kMeans
Since we are interested in the behavior of the expected reconstruction (1) of kmeans and kflats for varying and , before analyzing this behavior, we consider what is currently known about this problem, based on previous work. While kflats is a relatively new algorithm whose behavior is not yet well understood, several properties of kmeans are currently known.
Recall that kmeans find an discrete set of size that best approximates the samples in the sense of (2). Clearly, as increases, the empirical reconstruction error cannot increase, and typically decreases. However, we are ultimately interested in the expected reconstruction error, and therefore would like to understand the behavior of with varying .
In the context of optimal quantization, the behavior of the expected reconstruction error has been considered for an approximating set obtained by minimizing the expected reconstruction error itself over the hypothesis space . The set can thus be interpreted as the output of a population, or infinite sample version of kmeans. In this case, it is possible to show that is a non increasing function of and, in fact, to derive explicit rates. For example in the case , and under fairly general technical assumptions, it is possible to show that , where the constants depend on and [24].
In machine learning, the properties of kmeans have been studied,
for fixed , by considering the excess reconstruction error . In particular, this quantity has been studied for , and shown to be, with high probability, of order , upto logarithmic factors [37]. The case where is a Hilbert space has been considered in [37, 10], where an upperbound of order is proven to hold with high probability. The more general setting where is a metric space has been studied in [9].When analyzing the behavior of , and in the particular case that , the above results can be combined to obtain, with high probability, a bound of the form
(3) 
up to logarithmic factors, where the constant does not depend on or (a complete derivation is given in the Appendix.) The above inequality suggests a somewhat surprising effect: the expected reconstruction properties of kmeans may be described by a tradeoff between a statistical error (of order ) and a geometric approximation error (of order .)
The existence of such a tradeoff between the approximation, and the statistical errors may itself not be entirely obvious, see the discussion in [5]. For instance, in the kmeans problem, it is intuitive that, as more means are inserted, the expected distance from a random sample to the means should decrease, and one might expect a similar behavior for the expected reconstruction error. This observation naturally begs the question of whether and when this tradeoff really exists or if it is simply a result of the looseness in the bounds. In particular, one could ask how tight the bound (3) is.
While the bound on is known to be tight for sufficiently large [24], the remaining terms (which are dominated by ) are derived by controlling the supremum of an empirical process
(4) 
and it is unknown whether available bounds for it are tight [37]. Indeed, it is not clear how close the distortion redundancy is to its known lower bound of order (in expectation) [5]. More importantly, we are not aware of a lower bound for itself. Indeed, as pointed out in [5], “The exact dependence of the minimax distortion redundancy on k and d is still a challenging open problem”.
Finally, we note that, whenever a tradeoff can be shown to hold, it may be used to justify a heuristic for choosing
empirically as the value that minimizes the reconstruction error in a holdout set.In Figure 1 we perform some simple numerical simulations showing that the tradeoff indeed occurs in certain regimes. The following example provides a situation where a tradeoff can be easily shown to occur.
Example 1.
Consider a setup in which
samples are drawn from a uniform distribution on the unit
sphere, though the argument holds for other much smaller than . Because , with high probability, the samples are nearly orthogonal: , while a third sample drawn uniformly on will also very likely be nearly orthogonal to both [31]. The kmeans solution on this dataset is clearly (Fig 2(a)). Indeed, since (Fig 2(b)), it is with very high probability. In this case, it is better to place a single mean closer to the origin (with ), than to place two means at the sample locations. This example is sufficiently simple that the exact kmeans solution is known, but the effect can be observed in more complex settings.4 Main Results
Contributions. Our work extends previous results in two different directions:

We provide an analysis of kmeans for the case in which the datagenerating distribution is supported on a manifold embedded in a Hilbert space. In particular, in this setting: 1) we derive new results on the approximation error, and 2) new sample complexity results (learning rates) arising from the choice of by optimizing the resulting bound. We analyze the case in which a solution is obtained from an approximation algorithm, such as kmeans++ [3], to include this computational error in the bounds.

We generalize the above results from kmeans to kflats, deriving learning rates obtained from new bounds on both the statistical and the approximation errors. To the best of our knowledge, these results provide the first theoretical analysis of kflats in either sense.
We note that the kmeans algorithm has been widely studied in the past, and much of our analysis in this case involves the combination of known facts to obtain novel results. However, in the case of kflats, there is currently no known analysis, and we provide novel results as well as new performance bounds for each of the components in the bounds.
Throughout this section we make the following technical assumption:
Assumption 1.
is a smooth dmanifold with metric of class , contained in the unit ball in , and with volume measure denoted by . The probability measure is absolutely continuous with respect to , with density .
4.1 Learning Rates for kMeans
The first result considers the idealized case where we have access to an exact solution for kmeans.
Theorem 1.
Under Assumption 1, if is a solution of kmeans then, for , there are constants and dependent only on , and sufficiently large such that, by setting
(5) 
and , it is
(6) 
for all , where and grows sublinearly with .
Remark 1.
Note that the distinction between distributions with density in , and singular distributions is important. The bound of Equation (6) holds only when the absolutely continuous part of over is nonvanishing. the case in which the distribution is singular over requires a different analysis, and may result in faster convergence rates.
The following result considers the case where the kmeans++ algorithm is used to compute the estimator.
Theorem 2.
Under Assumption 1, if is the solution of kmeans++ , then for , there are constants and that depend only on , and a sufficiently large such that, by setting
(7) 
and , it is
(8) 
for all , where the expectation is with respect to the random choice in the algorithm, and , , and grows sublinearly with .
Remark 2.
Remark 3.
Note that according to the above theorems, choosing requires knowledge of properties of the distribution underlying the data, such as the intrinsic dimension of the support. In fact, following the ideas in [43] Section 6.35, it is easy to prove that choosing to minimize the reconstruction error on a holdout set, allows to achieve the same learning rates (up to a logarithmic factor), adaptively in the sense that knowledge of properties of are not needed.
4.2 Learning Rates for kFlats
To study kflats, we need to slightly strengthen Assumption 1 by adding to it by the following:
Assumption 2.
Assume the manifold to have metric of class , and finite second fundamental form II [22].
One reason for the highersmoothness assumption is that kflats uses higher order approximation, whose analysis requires a higher order of differentiability.
We begin by providing a result for kflats on hypersurfaces (codimension one), and next extend it to manifolds in more general spaces.
Theorem 3.
Let, . Under Assumptions 1,2, if is a solution of kflats, then there is a constant that depends only on , and sufficiently large such that, by setting
(10) 
and , then for all it is
(11) 
where is the total root curvature of , is the measure associated with the (positive) second fundamental form, and is the Gaussian curvature on .
In the more general case of a manifold (with metric in ) embedded in a separable Hilbert space , we cannot make any assumption on the codimension of (the dimension of the orthogonal complement to the tangent space at each point.)
In particular, the second fundamental form II, which is an extrinsic quantity describing how the tangent spaces bend locally
is, at every , a map
(in this case of class by Assumption 2)
from the tangent space to its orthogonal complement ( in the notation of [22, p. 128].)
Crucially, in this case, we may no longer assume the dimension of the orthogonal complement to be finite.
Denote by
, the operator norm of . We have:
Theorem 4.
Note that the better kflats bounds stem from the higher approximation power of flats over points. Although this greatly complicates the setup and proofs, as well as the analysis of the constants, the resulting bounds are of order , compared with the slower order of kmeans.
4.3 Discussion
In all the results, the final performance does not depend on the dimensionality of the embedding space (which in fact can be infinite), but only on the intrinsic dimension of the space on which the datagenerating distribution is defined. The key to these results is an approximation construction in which the Voronoi regions on the manifold (points closest to a given mean or flat) are guaranteed to have vanishing diameter in the limit of going to infinity. Under our construction, a hypersurface is approximated efficiently by tracking the variation of its tangent spaces by using the second fundamental form. Where this form vanishes, the Voronoi regions of an approximation will not be ensured to have vanishing diameter with going to infinity, unless certain care is taken in the analysis.
An important point of interest is that the approximations are controlled by averaged quantities, such as the total root curvature (kflats for surfaces of codimension one), total curvature (kflats in arbitrary codimensions), and norm of the probability density (kmeans), which are integrated over the domain where the distribution is defined. Note that these types of quantities have been linked to provably tight approximations in certain cases, such as for convex manifolds [25, 16], in contrast with worstcase methods that place a constraint on a maximum curvature, or minimum injectivity radius (for instance [1, 38].) Intuitively, it is easy to see that a constraint on an average quantity may be arbitrarily less restrictive than one on its maximum. A small difficult region (e.g. of very high curvature) may cause the bounds of the latter to substantially degrade, while the results presented here would not be adversely affected so long as the region is small.
Additionally, care has been taken throughout to analyze the behavior of the constants. In particular, there are no constants in the analysis that grow exponentially with the dimension, and in fact, many have polynomial, or slower growth. We believe this to be an important point, since this ensures that the asymptotic bounds do not hide an additional exponential dependence on the dimension.
References
 [1] William K Allard, Guangliang Chen, and Mauro Maggioni. Multiscale geometric methods for data sets ii: Geometric multiresolution analysis. Applied and Computational Harmonic Analysis, 1:1–38, 2011.
 [2] Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. Np–hardness of euclidean sumofsquares clustering. Mach. Learn., 75:245–248, May 2009.
 [3] David Arthur and Sergei Vassilvitskii. k–means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. SIAM.
 [4] Franz Aurenhammer. Voronoi diagrams: A survey of a fundamental geometric data structure. ACM Comput. Surv., 23:345–405, September 1991.
 [5] Peter L. Bartlett, Tamas Linder, and Gabor Lugosi. The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information Theory, 44:1802–1813, 1998.
 [6] Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
 [7] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003.
 [8] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res., 7:2399–2434, 2006.
 [9] Shai BenDavid. A framework for statistical clustering with constant time approximation algorithms for kmedian and kmeans clustering. Mach. Learn., 66(23):243–257, March 2007.
 [10] Gérard Biau, Luc Devroye, and Gábor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790, 2008.

[11]
Gilles Blanchard, Olivier Bousquet, and Laurent Zwald.
Statistical properties of kernel principal component analysis.
Mach. Learn., 66:259–294, March 2007.  [12] P. S. Bradley and O. L. Mangasarian. kplane clustering. J. of Global Optimization, 16:23–32, January 2000.

[13]
Joachim M. Buhmann.
Empirical risk approximation: An induction principle for unsupervised learning.
Technical report, University of Bonn, 1998.  [14] Joachim M. Buhmann. Information theoretic model validation for clustering. In International Symposium on Information Theory, Austin Texas. IEEE, 2010. (in press).
 [15] E V Chernaya. On the optimization of weighted cubature formulae on certain classes of continuous functions. East J. Approx, 1995.

[16]
Kenneth L. Clarkson.
Building triangulations using nets.
In
Proceedings of the thirtyeighth annual ACM symposium on Theory of computing
, STOC ’06, pages 326–335, New York, NY, USA, 2006. ACM.  [17] A. Cuevas and R. Fraiman. Set estimation. In New perspectives in stochastic geometry, pages 374–397. Oxford Univ. Press, Oxford, 2010.
 [18] A. Cuevas and A. RodríguezCasal. Set estimation: an overview and some recent developments. In Recent advances and trends in nonparametric statistics, pages 251–264. Elsevier B. V., Amsterdam, 2003.
 [19] Sanjoy Dasgupta and Yoav Freund. Random projection trees for vector quantization. IEEE Trans. Inf. Theor., 55:3229–3242, July 2009.

[20]
Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis.
Kernel kmeans: spectral clustering and normalized cuts.
In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 551–556, New York, NY, USA, 2004. ACM.  [21] J. Dieudonne. Foundations of Modern Analysis. Pure and Applied Mathematics. Hesperides Press, 2008.
 [22] M.P. DoCarmo. Riemannian geometry. Theory and Applications Series. Birkhäuser, 1992.
 [23] Allen Gersho and Robert M. Gray. Vector quantization and signal compression. Kluwer Academic Publishers, Norwell, MA, USA, 1991.
 [24] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2000.
 [25] P. M. Gruber. Asymptotic estimates for best and stepwise approximation of convex bodies i. Forum Mathematicum, 15:281–297, 1993.
 [26] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002.
 [27] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer, 2007.
 [28] Matthias Hein and JeanYves Audibert. Intrinsic dimensionality estimation of submanifolds in rd. In ICML ’05: Proceedings of the 22nd international conference on Machine learning, pages 289–296, 2005.
 [29] V. De Silva J. B. Tenenbaum and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000.
 [30] Ravikrishna Kolluri, Jonathan Richard Shewchuk, and James F. O’Brien. Spectral surface reconstruction from noisy point clouds. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, SGP ’04, pages 11–21, New York, NY, USA, 2004. ACM.
 [31] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, 2001.

[32]
David Levin.
Meshindependent surface interpolation.
In Hamann Brunnett and Mueller, editors, Geometric Modeling for Scientific Visualization, pages 37–49. SpringerVerlag, 2003.  [33] Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.
 [34] J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In L. M. Le Cam and J. Neyman, editors, Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281–297. University of California Press, 1967.
 [35] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k–means problem is nphard. In Proceedings of the 3rd International Workshop on Algorithms and Computation, WALCOM ’09, pages 274–285, Berlin, Heidelberg, 2009. SpringerVerlag.
 [36] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 689–696, 2009.
 [37] A. Maurer and M. Pontil. K–dimensional coding schemes in hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 –5846, nov. 2010.

[38]
Hariharan Narayanan and Sanjoy Mitter.
Sample complexity of testing the manifold hypothesis.
In Advances in Neural Information Processing Systems 23, pages 1786–1794. MIT Press, 2010.  [39] David Pollard. Strong consistency of kmeans clustering. Annals of Statistics, 9(1):135–140, 1981.
 [40] ST Roweis and LK Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
 [41] David Slepian. The onesided barrier problem for Gaussian noise. Bell System Tech. J., 41:463–501, 1962.
 [42] Florian Steinke, Matthias Hein, and Bernhard Schölkopf. Nonparametric regression between general Riemannian manifolds. SIAM J. Imaging Sci., 3(3):527–563, 2010.
 [43] I. Steinwart and A. Christmann. Support vector machines. Information Science and Statistics. Springer, New York, 2008.
 [44] G Fejes Toth. Sur la representation d´une population in par une nombre d´elements. Acta Math. Acad. Sci. Hungaricae, 1959.
 [45] Ulrike von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395–416, 2007.
 [46] Paul L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions on Information Theory, 28(2):139–148, 1982.
Appendix A Methodology and Derivation of Results
Although both kmeans and kflats optimize the same empirical risk, the performance measure we are interested in is that of Equation 1. We may bound it from above as follows:
(14)  
(15) 
where is the best attainable performance over , and is a set for which the best performance is attained. Note that by the definition of . The same error decomposition can be considered for kflats, by replacing by and by .
Equation 14 decomposes the total learning error into two terms: a uniform (over all sets in the class ) bound on the difference between the empirical, and true error measures, and an approximation error term. The uniform statistical error bound will depend on the samples, and thus may hold with a certain probability.
In this setting, the approximation error will typically tend to zero as the class becomes larger (as increases.) Note that this is true, for instance, if is the class of discrete sets of size , as in the kmeans problem.
The performance of Equation 14
is, through its dependence on the samples, a random variable. We will thus set out to find probabilistic bounds on its performance, as a function of the number
of samples, and the size of the approximation. By choosing the approximation size parameter to minimize these bounds, we obtain performance bounds as a function of the sample size.Appendix B KMeans
We use the above decomposition to derive sample complexity bounds for the performance of the kmeans algorithm.
To derive explicit bounds on the different error terms we have to combine in a novel way some previous results
and some new observations.
Approximation error. The error is related to the problem of optimal quantization. The classical optimal quantization problem is quite well understood, going back to the fundamental work of [46, 44] on optimal quantization for data transmission, and more recently by the work of [24, 27, 26, 15]
. In particular, it is known that, for distributions with finite moment of order
, for some , it is [24](16) 
where is the Lebesgue measure, is the density of the absolutely continuous part of the distribution (according to its Lebesgue decomposition), and is a constant that depends only on the dimension. Therefore, the approximation error decays at least as fast as .
We note that, by setting to be the uniform distribution over the unit cube , it clearly is
and thus, by making use of Zador’s asymptotic formula [46], and combining it with a result of Böröczky (see [27], p. 491), we observe that with , for the th order quantization problem. In particular, this shows that the constant only depends on the dimension, and, in our case (), has only linear growth in , a fact that will be used in the sequel.
The approximation error of kmeans is related to the problem of optimal quantization on manifolds, for which some results are known [26]. By calling the approximation error only among sets of means contained in , Theorem 5 in Appendix C, implies in this case (letting ) that
(17) 
where is absolutely continuous over and, by replacing with a dimensional domain in , it is clear that the constant is the same as above.
Since restricting the means to be on cannot decrease the approximation error, it is , and therefore the righthand side of Equation 17 provides an (asymptotic) upper bound to .
For the statistical error we use available bounds.
Statistical error.
The statistical error of Equation 14, which uniformly bounds the difference between the empirical, and expected error, has been widelystudied in recent years in the literature [37, 38, 5].
In particular, it has been shown that, for a distribution over the unit ball in , it is
(18) 
with probability [37].
Clearly, this implies convergence almost surely, as ; although this latter result was proven earlier in [39], under the less restrictive condition that have finite second moment.
By bringing together the above results, we obtain the bound in Theorem 1 on the performance of kmeans, whose proof is postponed to Appendix A.
Further, we can consider the error incurred by the actual optimization algorithm used to compute the kmeans solution.
Computational error.
In practice, the kmeans problem is NPhard [2, 19, 35], with the original Lloyd relaxation algorithm providing no guarantees of closeness to the global minimum of Equation 2.
However, practical approximations, such as the kmeans++ algorithm [3], exist.
When using kmeans++, means are inserted one by one at samples selected with probability proportional to their squared distance to the set of previouslyinserted means.
This randomized seeding has been shown by [3] to output a set that is, in expectation, within a factor of the optimal.
Once again, by combining these results, we obtain Theorem 2,
whose proof is also in Appendix A.
Proof.
Letting , then with probability , it is
(19) 
where the parameter
(20) 
has been chosen to balance the summands in the third line of Equation 19. ∎
The proof of Theorem 2 follows a similar argument.
Proof.
In the case of Theorem 2, the additional multiplicative term corresponding to the computational error incurred by the kmeans++ algorithm does not affect the choice of parameter since both summands in the third line of Equation 19 are multiplied by in this case. Therefore, we may simply use the same choice of as in Equation 20 in this case to obtain
(21) 
with probability , where the expectation is with respect to the random choice in the algorithm. From this the bound of Theorem 2 follows. ∎
Appendix C KFlats
Here we state a series of lemma that we prove in the next section. For the kflats problem, we begin by introducing a uniform bound on the difference between empirical (Equation 2) and expected risk (Equation 1.)
Lemma 1.
If is the class of sets of dimensional affine spaces then, with probability on the sampling of , it is
By combining the above result with approximation error bounds, we may produce performance bounds on the expected risk for the kflats problem, with appropriate choice of parameter . We distinguish between the codimension one hypersurface case, and the more general case of a smooth manifold embedded in a Hilbert space. We begin with an approximation error bound for hypersurfaces in Euclidean space.
Lemma 2.
Assume given smooth with metric of class in . If is the class of sets of dimensional affine spaces, and is the minimizer of Equation 1 over , then there is a constant that depends on only, such that
where is the total root curvature of , and is the measure associated with the (positive) second fundamental form. The constant grows as with .
For the more general problem of approximation of a smooth manifold in a separable Hilbert space, we begin by considering the definitions in Section 4 the second fundamental form II and its operator norm at a point . The we have:
Lemma 3.
Assume given a manifold with metric in embedded in a separable Hilbert space . If is the class of sets of dimensional affine spaces, and is the minimizer of Equation 1 over , then there is a constant that depends on only, such that
where and is the volume measure over . The constant grows as with .
c.1 Proofs
We begin proving the bound on the statistical error given in Lemma 1.
Proof.
We begin by finding uniform upper bounds on the difference between Equations 1 and 2 for the class of sets of dimensional affine spaces. To do this, we will first bound the Rademacher complexity of the class .
Let and be Gaussian processes indexed by , and defined by
(22) 
, is the union of subspaces: , where each is an orthogonal projection onto , and
are independent Gaussian sequences of zero mean and unit variance.
Noticing that for any orthogonal projection (see for instance [11], Sec. 2.1), where is the HilbertSchmidt inner product, we may verify that:
(23) 
Since it is,
(24) 
we may bound the Gaussian complexity as follows:
(25) 
where the first inequality follows from Equation 23 and Slepian’s Lemma [41], and the second from Equation 24.
Therefore the Rademacher complexity is bounded by
(26) 
c.2 Approximation Error
In order to prove approximation bounds for the kflats problem, we will begin by first considering the simpler setting of a smooth manifold in space (codimension ), and later we will extend the analysis to the general case.
Approximation Error: Codimension One
Assume that it is with the natural metric, and is a compact, smooth manifold with metric of class . Since is of codimension one, the second fundamental form at each point is a map from the tangent space to the reals. Assume given and . At every point , define the metric , where

I and II are, respectively, the first and second fundamental forms on [22].

is the convexified
second fundamental form, whose eigenvalues are those of
II but in absolute value. If the second fundamental form II is written in coordinates (with respect to an orthonormal basis of the tangent space) as , with orthonormal, and diagonal, then is in coordinates. Because is continuous and positive semidefinite, it has an associated measure (with respect to the volume measure .) 
is chosen such that . Note that such always exists since:

implies , and

can be made arbitrarily large by increasing .
and therefore there is some intermediate value of that satisfies the constraint.

In particular, from condition c), it is clear that is everywhere positive definite.
Let and be the measures over , associated with I and . Since, by its definition, is absolutely continuous with respect to I, then so must be. Therefore, we may define
to be the density of with respect to .
Consider the discrete set of size that minimizes the quantity
(28) 
among all sets of points on . is the (fourthorder) quantization error over , with metric , and with respect to a weight function . Note that, in the definition of , it is crucial that the measure (), and distance () match, in the sense that is the geodesic distance with respect to the metric , whose associated measure is .
The following theorem, adapted from [26], characterizes the relation between and the quantization error on a Riemannian manifold.
Theorem 5.
[[26]] Given a smooth compact Riemannian manifold with metric of class , and a continuous function , then
(29) 
as
Comments
There are no comments yet.