This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A study of the classification of low-dimensional data with supervised manifold learning

\nameElif Vural \emailvelif@metu.edu.tr
\addrMiddle East Technical University
Ankara, Turkey \AND\nameChristine Guillemot \emailchristine.guillemot@inria.fr
\addrCentre de recherche INRIA Bretagne Atlantique
Rennes, France
Most part of the work was performed while the first author was in INRIA.
Abstract

Supervised manifold learning methods learn data representations by preserving the geometric structure of data while enhancing the separation between data samples from different classes. In this work, we propose a theoretical study of supervised manifold learning for classification. We consider nonlinear dimensionality reduction algorithms that yield linearly separable embeddings of training data and present generalization bounds for this type of algorithms. A necessary condition for satisfactory generalization performance is that the embedding allow the construction of a sufficiently regular interpolation function in relation with the separation margin of the embedding. We show that for supervised embeddings satisfying this condition, the classification error decays at an exponential rate with the number of training samples. Finally, we examine the separability of supervised nonlinear embeddings that aim to preserve the low-dimensional geometric structure of data based on graph representations. The proposed analysis is supported by experiments on several real data sets.

Keywords: Manifold learning, dimensionality reduction, classification, out-of-sample extensions, RBF interpolation

1 Introduction

In many data analysis problems, data samples have an intrinsically low-dimensional structure although they reside in a high-dimensional ambient space. The learning of low-dimensional structures in collections of data has been a well studied topic of the last two decades (Tenenbaum et al., 2000), (Roweis and Saul, 2000), (Belkin and Niyogi, 2003), (He and Niyogi, 2004), (Donoho and Grimes, 2003), (Zhang and Zha, 2005). Following these works, many classification methods have been proposed in the recent years to apply such manifold learning techniques to learn classifiers that are adapted to the geometric structure of low-dimensional data (Hua et al., 2012), (Yang et al., 2011), (Zhang et al., 2012), (Sugiyama, 2007), (Raducanu and Dornaika, 2012). The common approach in such works is to learn a data representation that enhances the between-class separation while preserving the intrinsic low-dimensional structure of data. While many efforts have focused on the practical aspects of learning such supervised embeddings for training data, the generalization performance of these methods as supervised classification algorithms has not been investigated much yet. In this work, we aim to study nonlinear supervised dimensionality reduction methods and present performance bounds based on the properties of the embedding and the interpolation function used for generalizing the embedding.

Several supervised manifold learning methods extend the Laplacian eigenmaps algorithm (Belkin and Niyogi, 2003), or its linear variant LPP (He and Niyogi, 2004) to the classification problem. The algorithms proposed by Hua et al. (2012), Yang et al. (2011), Zhang et al. (2012) provide a supervised extension of the LPP algorithm and learn a linear projection that preserves the proximity of neighboring samples from the same class, while increasing the distance between nearby samples from different classes. The method by Sugiyama (2007) proposes an adaptation of the Fisher metric for linear manifold learning, which is in fact shown to be equivalent to the above methods by Yang et al. (2011), Zhang et al. (2012). In (Li et al., 2013), (Cui and Fan, 2012), (Wang and Chen, 2009), some other similar Fisher-based linear manifold learning methods are proposed. In (Raducanu and Dornaika, 2012) a method relying on a similar formulation as in (Hua et al., 2012), (Yang et al., 2011), (Zhang et al., 2012) is presented, which, however, learns a nonlinear embedding. The main advantage of linear dimensionality reduction methods over nonlinear ones is that the generalization of the learnt embedding to novel (initially unavailable) samples is straightforward. However, nonlinear manifold learning algorithms are more flexible as the possible data representations they can learn belong to a wider family of functions, e.g., one can always find a nonlinear embedding to make training samples from different classes linearly separable. On the other hand, when a nonlinear embedding is used, one must also determine a suitable interpolation function to generalize the embedding to new samples, and the choice of the interpolator is critical for the classification performance.

The common effort in all of these supervised dimensionality reduction methods is to learn an embedding that increases the separation between different classes, while preserving the geometric structure of data. It is interesting to note that supervised manifold learning methods achieve separability by reducing the dimension of data, while kernel methods in traditional classifiers achieve this by increasing the dimension of data. Meanwhile, making training data linearly separable in supervised manifold learning does not mean much only by itself. Assuming that the data are sampled from a continuous distribution (hence two samples coincide with 0 probability), it is almost always possible to separate a discrete set of samples from different classes with a nonlinear embedding, e.g., even with a simple embedding such as the one mapping each sample to a vector encoding its class label. What actually matters is how the embedding generalizes to test data, i.e., where the test samples will be mapped to in the low-dimensional domain of embedding and how well the performance will be. The generalization for test data is straightforward for kernel methods, it is determined by the underlying main algorithm. However, in nonlinear supervised manifold learning, this question has rather been overlooked so far. In this work we aim to fill this gap and look into the generalization capabilities of supervised manifold learning algorithms. We study the conditions that must be satisfied by the embedding of the training samples and the interpolation function for satisfactory generalization of the classifier. We then examine the rates of convergence of supervised manifold learning algorithms that satisfy these conditions.

In Section 2, we consider arbitrary supervised manifold learning algorithms that compute a linearly separable embedding of training samples. We study the generalization capability of such algorithms for two types of out-of-sample interpolation functions. We first consider arbitrary interpolation functions that are Lipschitz-continuous on the support of each class, and then focus on out-of-sample extensions with radial basis function (RBF) kernels, which is a popular family of interpolation functions. For both types of interpolators, we derive conditions that must be satisfied by the embedding of the training samples and the regularity of the interpolation function that generalizes the embedding to test samples, when a nearest neighbor or linear classifier is used in the low-dimensional domain of embedding. These conditions enforce the Lipschitz constant of the interpolator to be sufficiently small, in comparison with the separation margin between training samples from different classes in the low-dimensional domain of embedding. The practical value of these results resides in their implications about what must really be taken into account when designing a supervised dimensionality reduction algorithm: Achieving a good separation margin does not suffice by itself; the geometric structure must also be preserved so as to ensure that a sufficiently regular interpolator can be found to generalize the embedding to the whole ambient space. We then particularly consider Gaussian RBF kernels and show the existence of an optimal value for the kernel scale by studying the condition in our main result that links the separation with the Lipschitz constant of the kernel.

Our results in Section 2 also provide bounds on the rate of convergence of the classification error of supervised embeddings. We show that the misclassification error probability decays at an exponential rate with the number of samples, provided that the interpolation function is sufficiently regular with respect to the separation margin of the embedding. These convergence rates are higher than those reported in previous results on RBF networks (Niyogi and Girosi, 1996), (Lin et al., 2014), (Hernández-Aguirre et al., 2002), and regularized least-squares regression algorithms (Caponnetto and De Vito, 2007), (Steinwart et al., 2009). The essential difference between our results and such previous works is that those assume a general setting and do not focus on a particular data model, whereas our results are rather relevant to settings where the support of each class admits some certain structure, so as to allow the existence of an interpolator that is sufficiently regular on the support of each class. Moreover, in contrast with these previous works, our bounds are independent of the ambient space dimension and vary only with the intrinsic dimensions of the class supports as they characterize the error in terms of the covering numbers of the supports.

The results in Section 2 assume an embedding that makes training samples from different classes linearly separable. Even if most nonlinear dimensionality reduction methods are observed to yield separable embeddings in practice, we aim to verify this theoretically in Section 3. In particular, we focus on the nonlinear version of the supervised Laplacian eigenmaps embeddings (Raducanu and Dornaika, 2012), (Hua et al., 2012), (Yang et al., 2011), (Zhang et al., 2012). Supervised Laplacian eigenmaps methods embed the data with the eigenvectors of the linear combination of two graph Laplacian matrices that encode the links between neighboring samples from the same class and different classes. In such a data representation, the coordinates of neighboring data samples change slowly within the same class and rapidly across different classes. We study the conditions for the linear separability of these embeddings and characterize their separation margin in terms of some graph and algorithm parameters.

In Section 4, we evaluate our results with experiments on several object and face data sets. We study the implications of the condition derived in Section 2 on the separability margin - interpolator regularity tradeoff. The experimental comparison of several supervised dimensionality reduction algorithms shows that this compromise between the separation and interpolator regularity can indeed be related to the practical classification performance of a supervised manifold learning algorithm. This suggests that, one can possibly improve the accuracy of supervised dimensionality reduction algorithms by considering more carefully the generalization capability of the embedding during the learning. We then study the variation of the classification performance with parameters such as the sample size, the RBF kernel scale, and the dimension of the embedding, in view of the generalization bounds presented in Section 2. Finally, we conclude in Section 5.

2 Performance bounds for supervised manifold learning methods

2.1 Notation and Problem Formulation

Consider a setting with MM data classes where the samples of each class m{1,,M}m\in\{1,\dots,M\} are drawn from a probability measure νm\nu_{m} in a Hilbert space HH such that νm\nu_{m} has a bounded support mH\mathcal{M}_{m}\subset H. Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of NN training samples such that each xix_{i} is drawn from one of the probability measures νm\nu_{m}, and the samples drawn from each νm\nu_{m} are independent and identically distributed. We denote the class label of xix_{i} by Ci{1,2,,M}C_{i}\in\{1,2,\dots,M\}.

Let Y={yi}i=1NdY=\{y_{i}\}_{i=1}^{N}\subset\mathbb{R}^{d} be a dd-dimensional embedding of 𝒳\mathcal{X}, where each yiy_{i} corresponds to xix_{i}. We consider supervised embeddings such that YY is linearly separable. Linear separability is defined as follows:

Definition 1

The data representation YY is linearly separable with a margin of γ>0\gamma>0, if for any two classes k,l{1,2,,M}k,l\in\{1,2,\dots,M\}, there exists a separating hyperplane defined by ωkld\omega_{kl}\in\mathbb{R}^{d}, ωkl=1\|\omega_{kl}\|=1 and bklb_{kl}\in\mathbb{R} such that

ωklTyi+bklγ/2 if Ci=kωklTyi+bklγ/2 if Ci=l.\begin{split}\omega_{kl}^{T}\,y_{i}+b_{kl}&\geq\gamma/2\quad\,\,\,\,\text{ if }C_{i}=k\\ \omega_{kl}^{T}\,y_{i}+b_{kl}&\leq-\gamma/2\quad\text{ if }C_{i}=l.\end{split} (1)
Refer to caption
Figure 1: Illustration of a linearly separable embedding. Data in XX are sampled from two different classes with supports 1\mathcal{M}_{1}, 2\mathcal{M}_{2}. The samples XX are mapped to the coordinates YY with a low-dimensional embedding, where the two classes become linearly separable with margin γ\gamma with the hyperplane given by ω\omega, bb.

The above definition of separability implies the following. For any given class mm, there exists a set of hyperplanes {ωmk}kmd\{\omega_{mk}\}_{k\neq m}\subset\mathbb{R}^{d}, ωmk=1\|\omega_{mk}\|=1, and a set of real numbers {bmk}km\{b_{mk}\}_{k\neq m}\subset\mathbb{R} that separate class mm from other classes, such that for all yiy_{i} of class Ci=mC_{i}=m

ωmkTyi+bmk>γ/2,km\omega_{mk}^{T}\,y_{i}+b_{mk}>\gamma/2,\quad\forall k\neq m (2)

and for all yiy_{i} of class CimC_{i}\neq m, there exists a kk such that

ωmkTyi+bmk<γ/2.\omega_{mk}^{T}\,y_{i}+b_{mk}<-\gamma/2. (3)

These hyperplanes are obtained by setting ωkm=ωmk\omega_{km}=-\omega_{mk}, bkm=bmkb_{km}=-b_{mk}.

Figure 1 shows an illustration of a linearly separable embedding of data samples from two classes. Manifold learning methods typically compute a low-dimensional embedding YY of training data 𝒳\mathcal{X} in a pointwise manner, i.e., the coordinates yiy_{i} are computed only for the initially available training samples xix_{i}. However, in a classification problem, in order to estimate the class label of a new data sample xx of unknown class, xx needs to be mapped to the low-dimensional domain of embedding as well. The construction of a function f:Hdf:H\rightarrow\mathbb{R}^{d} that generalizes the learnt embedding to the whole space is known as the out-of-sample generalization problem. Smooth functions are commonly used for out-of-sample interpolation, e.g. as in (Qiao et al., 2013), (Peherstorfer et al., 2011).

Now let xx be a test sample drawn from the probability measure νm\nu_{m}, hence, the true class label of xx is mm. In our study, we consider two basic classification schemes in the domain of embedding:

Linear classifier. The embeddings of the training samples are used to compute the separating hyperplanes, i.e., the classifier parameters {ωmk}\{\omega_{mk}\} and {bmk}\{b_{mk}\}. Then, mapping xx to the low-dimensional domain as f(x)df(x)\in\mathbb{R}^{d}, the class label of xx is estimated as C^(x)=l\hat{C}(x)=l if there exists l{1,,M}l\in\{1,\dots,M\} such that

ωlkTf(x)+blk>0,k{1,,M}{l}.\omega_{lk}^{T}\,f(x)+b_{lk}>0,\quad\forall k\in\{1,\dots,M\}\setminus\{l\}. (4)

Note that the existence of such an ll is not guaranteed in general for any xx, but for a given xx there cannot be more than one ll satisfying the above condition. Then xx is classified correctly if the estimated class label agrees with the true class label, i.e., C^(x)=l=m\hat{C}(x)=l=m.

Nearest neighbor classification. The test sample xx is assigned the class label of the closest training point in the domain of embedding, i.e., C^(x)=Ci\hat{C}(x)=C_{i^{\prime}}, where

i=argmini=1,,Nyif(x)i^{\prime}=\arg\min_{i=1,\dots,N}\|y_{i}-f(x)\|

In the rest of this section, we study the generalization performance of supervised dimensionality reduction methods. We first consider in Section 2.2 interpolation functions that vary regularly on each class support and we search for a lower bound on the probability of correctly classifying a new data sample in terms of the regularity of ff, the separation of the embedding, and the sampling density. Then in Section 2.3, we study the classification performance for a particular type of interpolation functions, namely RBF interpolators, which is one of the most popular ones (Peherstorfer et al., 2011), (Chin and Suter, 2008). We focus particularly on Gaussian RBF interpolators in Section 2.4 and derive some results regarding the existence of an optimal kernel scale parameter. Lastly, we discuss our results in comparison with previous literature in Section 2.5.

In the results in Sections 2.2-2.4, we keep a generic formulation and simply treat the supports {m}\{\mathcal{M}_{m}\} as arbitrary bounded subsets of HH, each of which represents a different data class. Nevertheless, from the perspective of manifold learning, our results are of interest especially when the data is assumed to have an underlying low-dimensional structure. In Section 2.5, we study the implications of our results for the setting where m\mathcal{M}_{m} are low-dimensional manifolds. We then examine how the proposed bounds vary in relation to the intrinsic dimensions of {m}\{\mathcal{M}_{m}\}.

2.2 Out-of-sample interpolation with regular functions

Let f:Hdf:H\rightarrow\mathbb{R}^{d} be an out-of-sample interpolation function such that f(xi)=yif(x_{i})=y_{i} for each training sample xix_{i}, i=1,,Ni=1,\dots,N. Assume that ff is Lipschitz continuous with constant L>0L>0 when restricted to any one of the supports m\mathcal{M}_{m}; i.e., for any m{1,,M}m\in\{1,\dots,M\} and any u,vmu,v\in\mathcal{M}_{m}

f(u)f(v)Luv\|f(u)-f(v)\|\leq L\,\|u-v\|

where \|\cdot\| denotes above the 2\ell_{2}-norm if the argument is in d\mathbb{R}^{d}, and the norm induced from the inner product in HH if the argument is in HH.

We will find a relation between the classification accuracy and the number of training samples via the covering number of the supports m\mathcal{M}_{m}. Let Bϵ(x)HB_{\epsilon}(x)\subset H denote an open ball of radius ϵ\epsilon around xx

Bϵ(x)={uH:xu<ϵ}.B_{\epsilon}(x)=\{u\in H:\|x-u\|<\epsilon\}.

The covering number 𝒩(ϵ,A)\mathcal{N}(\epsilon,A) of a set AHA\subset H is defined as the smallest number of open balls BϵB_{\epsilon} of radius ϵ\epsilon whose union contains AA (Kulkarni and Posner, 1995)

𝒩(ϵ,A)=inf{k:u1,,ukHs.t.Ai=1kBϵ(ui)}.\mathcal{N}(\epsilon,A)=\inf\{k:\exists\,u_{1},\dots,u_{k}\in H\,\text{s.t.}\,A\subset\bigcup_{i=1}^{k}B_{\epsilon}(u_{i})\}.

We assume that the supports m\mathcal{M}_{m} are totally bounded, i.e., m\mathcal{M}_{m} has a finite covering number 𝒩(ϵ,m)\mathcal{N}(\epsilon,\mathcal{M}_{m}) for any ϵ>0\epsilon>0.

We state below a lower bound for the probability of correctly classifying a sample xx drawn from νm\nu_{m}, in terms of the number of training samples drawn from νm\nu_{m}, the separation of the embedding and the regularity of ff.

Theorem 2

For some ϵ\epsilon with 0<ϵγ/(2L)0<\epsilon\leq\gamma/(2L), let the training set 𝒳\mathcal{X} contain at least NmN_{m} samples drawn i.i.d. according to a probability measure νm\nu_{m} such that

Nm𝒩(ϵ/2,m).N_{m}\geq\mathcal{N}(\epsilon/2,\mathcal{M}_{m}).

Let YY be an embedding of the training samples 𝒳\mathcal{X} that is linearly separable with margin larger than γ\gamma, and let ff be an interpolation function that is Lipschitz continuous with constant LL on the support m\mathcal{M}_{m}. Then the probability of correctly classifying a test sample xx drawn from νm\nu_{m} independently from the training samples with the linear classifier (4) is lower bounded as

P(C^(x)=m)1𝒩(ϵ/2,m)2Nm.P\left(\hat{C}(x)=m\right)\geq 1-\frac{\mathcal{N}(\epsilon/2,\mathcal{M}_{m})}{2N_{m}}.

The proof of the theorem is given in Appendix A.1. Theorem 2 establishes a link between the classification performance and the separation of the embedding of the training samples. In particular, due to the condition ϵγ/(2L)\epsilon\leq\gamma/(2L), the increase in the separation γ\gamma allows a larger value for ϵ\epsilon, provided that the interpolator regularity is not affected much. This reduces the covering number 𝒩(ϵ/2,m)\mathcal{N}(\epsilon/2,\mathcal{M}_{m}) in return and increases the probability of correct classification. Similarly, from the condition ϵγ/(2L)\epsilon\leq\gamma/(2L), one can also observe that at a given separation γ\gamma, a smaller Lipschitz constant LL for the interpolation function allows the parameter ϵ\epsilon to take a larger value. This reduces the covering number 𝒩(ϵ/2,m)\mathcal{N}(\epsilon/2,\mathcal{M}_{m}) and therefore increases the correct classification probability. Thus, choosing a more regular interpolator at a given separation helps improve the classification performance. If the ϵ\epsilon parameter is fixed, the Lipschitz constant of the interpolator is allowed to increase only proportionally to the separation margin. The condition that the interpolator must be sufficiently regular in comparison with the separation suggests that increasing the separation too much at the cost of impairing the interpolator regularity may degrade the classifier performance. In the case that the supports m\mathcal{M}_{m} are low-dimensional manifolds, the covering number 𝒩(ϵ/2,m)\mathcal{N}(\epsilon/2,\mathcal{M}_{m}) increases at a geometric rate with the intrinsic dimension DD of the manifold, since a DD-dimensional manifold is locally homeomorphic to D\mathbb{R}^{D}. Therefore, from the condition on the number of samples, NmN_{m} should increase at a geometric rate with DD.

In Theorem 2 the probability of misclassification decreases with the number NmN_{m} of training samples at a rate of O(Nm1)O(N_{m}^{-1}). In the rest of this section, we show that it is in fact possible to obtain an exponential convergence rate with linear and NN-classifiers under certain assumptions. We first present the following lemma.

Lemma 3

Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of training samples such that each xix_{i} is drawn i.i.d. from one of the probability measures {νm}m=1M\{\nu_{m}\}_{m=1}^{M}. Let xx be a test sample randomly drawn according to the probability measure νm\nu_{m} of class mm. Let

A={xi𝒳:xiBδ(x),xiνm}A=\{x_{i}\in\mathcal{X}:x_{i}\in B_{\delta}(x),x_{i}\sim\nu_{m}\} (5)

be the set of samples in 𝒳\mathcal{X} that are in a δ\delta-neighborhood of xx and also drawn from the measure νm\nu_{m}. Assume that AA contains |A|=Q|A|=Q samples. Then

P(f(x)1QxjAf(xj)Lδ+dϵ)12dexp(Qϵ22L2δ2).P\left(\|f(x)-\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})\|\leq L\delta+\sqrt{d}\epsilon\right)\geq 1-2d\exp\left(-\frac{Q\,\epsilon^{2}}{2L^{2}\delta^{2}}\right). (6)

Lemma 3 is proved in Appendix A.2. The inequality in (6) shows that as the number QQ of training samples falling in a neighborhood of a test point xx increases, the probability of the deviation of f(x)f(x) from its average within the neighborhood decreases. The parameter ϵ\epsilon captures the relation between the amount and the probability of deviation.

When studying the classification accuracy in the main result below, we will use the following generalized definition of the linear separation.

Definition 4

Let YY be a linearly separable embedding with margin γ\gamma such that each pair (k,l)(k,l) of classes are separated with the hyperplanes given by ωkl\omega_{kl}, bklb_{kl} as defined in Definition 1. We say that the linear classifier given by {ωkl}\{\omega_{kl}\}, {bkl}\{b_{kl}\} has a QQ-mean separability margin of γQ>0\gamma_{Q}>0 if any choice of QQ samples {yk,i}i=1QY\{y_{k,i}\}_{i=1}^{Q}\subset Y from class kk and QQ samples {yl,i}i=1QY\{y_{l,i}\}_{i=1}^{Q}\subset Y from class ll, lkl\neq k, satisfies

ωklT(1Qi=1Qyk,i)+bklγQ/2ωklT(1Qi=1Qyl,i)+bklγQ/2.\begin{split}\omega_{kl}^{T}\,\left(\frac{1}{Q}\sum_{i=1}^{Q}y_{k,i}\right)+b_{kl}&\geq\gamma_{Q}/2\\ \omega_{kl}^{T}\,\left(\frac{1}{Q}\sum_{i=1}^{Q}y_{l,i}\right)+b_{kl}&\leq-\gamma_{Q}/2.\end{split} (7)

The above definition of separability is more flexible than the one in Definition 1. Clearly, an embedding that is linearly separable with margin γ\gamma has a QQ-mean separability margin of γQγ\gamma_{Q}\geq\gamma for any QQ. As in the previous section, we consider that the test sample xx is classified with the linear classifier (4) in the low-dimensional domain, defined with respect to the set of hyperplanes given by {ωmk}\{\omega_{mk}\} and {bmk}\{b_{mk}\} as in (2) and (3).

In the following result, we show that an exponential convergence rate can be obtained with linear classifiers in supervised manifold learning. We define beforehand a parameter depending on δ\delta, which gives the smallest possible measure of the δ\delta-neighborhood Bδ(x)B_{\delta}(x) of a point xx in support m\mathcal{M}_{m}.

ηm,δ:=infxmνm(Bδ(x)).\eta_{m,\delta}:=\inf_{x\in\mathcal{M}_{m}}\nu_{m}(B_{\delta}(x)).
Theorem 5

Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of training samples such that each xix_{i} is drawn i.i.d. from one of the probability measures {νm}m=1M\{\nu_{m}\}_{m=1}^{M}. Let YY be an embedding of 𝒳\mathcal{X} in d\mathbb{R}^{d} that is linearly separable with a QQ-mean separability margin larger than γQ\gamma_{Q}. For a given ϵ>0\epsilon>0 and δ>0\delta>0, let ff be a Lipschitz-continuous interpolator such that

Lδ+dϵγQ2.L\delta+\sqrt{d}\epsilon\leq\frac{\gamma_{Q}}{2}. (8)

Consider a test sample xx randomly drawn according to the probability measure νm\nu_{m} of class mm. If 𝒳\mathcal{X} contains at least NmN_{m} training samples drawn i.i.d. from νm\nu_{m} such that

Nm>Qηm,δN_{m}>\frac{Q}{\eta_{m,\delta}}

then the probability of correctly classifying xx with the linear classifier given in (4) is lower bounded as

P(C^(x)=m)1exp(2(Nmηm,δQ)2Nm)2dexp(Qϵ22L2δ2).P\left(\hat{C}(x)=m\right)\geq 1-\exp\left(-\frac{2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2d\exp\left(-\frac{Q\,\epsilon^{2}}{2L^{2}\delta^{2}}\right). (9)

Theorem 5 is proved in Appendix A.3. The theorem shows how the classification accuracy is influenced by the separation of the classes in the embedding, the smoothness of the out-of-sample interpolant, and the number of training samples drawn from the density of each class. The condition in (8) points to the tradeoff between the separation and the regularity of the interpolation function. As the Lipschitz constant LL of the interpolation function ff increases, ff becomes less “regular”, and a higher separation γQ\gamma_{Q} is needed to meet the condition. This is coherent with the expectation that, when ff becomes irregular, the classifier becomes more sensitive to the perturbations of the data, e.g., due to noise. The requirement of a higher separation is then for ensuring a larger margin in the linear classifier, which compensates for the irregularity of ff. From (8), it is also observed that the separation should increase with the dimension dd as well, and also with ϵ\epsilon, whose increase improves the confidence of the bound (9). Note that the condition in (8) implies also the following: When computing an embedding, it is not advisable to increase the separation of training data unconditionally. In particular, increasing the separation too much may violate the preservation of the geometry and yield an irregular interpolator. Hence, when designing a supervised dimensionality reduction algorithm, one must pay attention to the regularity of the resulting interpolator as much as the enhancement of the separation margin.

Next, we discuss the roles of the parameters QQ and δ\delta. The term exp(Qϵ2/(2L2δ2))\exp(-Q\,\epsilon^{2}/(2L^{2}\delta^{2})) in the correct classification probability bound (9) shows that, for fixed δ\delta, the confidence increases with the value of QQ. Meanwhile, due to the numerator of the term exp(2(Nmηm,δQ)2/Nm)\exp(-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}/N_{m}), for a high confidence, the number of samples NmN_{m} should also be relatively big with respect to QQ to have a high overall confidence. Similarly, at fixed QQ, δ\delta should be made smaller to increase the confidence due to the term exp((Qϵ2)/(2L2δ2))\exp(-(Q\,\epsilon^{2})/(2L^{2}\delta^{2})), which then reduces the parameter ηm,δ\eta_{m,\delta} and eventually requires the number of samples NmN_{m} to take a sufficiently large value in order to make the term exp(2(Nmηm,δQ)2/Nm)\exp(-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}/N_{m}) small and have a high confidence. Therefore, these two parameters QQ and δ\delta behave in a similar way, and determine the relation between the number of samples and the correct classification probability, i.e., they indicate how large NmN_{m} should be in order to have a certain confidence of correct classification.

Theorem 5 studies the setting where the class labels are estimated with a linear classifier in the domain of embedding. We also provide another result below that analyses the performance when a nearest-neighbor classifier is used in the domain of embedding.

Theorem 6

Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of training samples such that each xix_{i} is drawn i.i.d. from one of the probability measures {νm}m=1M\{\nu_{m}\}_{m=1}^{M}. Let YY be an embedding of 𝒳\mathcal{X} in d\mathbb{R}^{d} such that

yiyj<Dδ, if xixjδ and Ci=Cjyiyj>γ, if CiCj,\begin{split}\|y_{i}-y_{j}\|&<D_{\delta},\text{ if }\|x_{i}-x_{j}\|\leq\delta\text{ and }C_{i}=C_{j}\\ \|y_{i}-y_{j}\|&>\gamma,\ \text{ if }C_{i}\neq C_{j},\end{split}

hence, nearby samples from the same class are mapped to nearby points, and samples from different classes are separated by a distance of at least γ\gamma in the embedding.

For given ϵ>0\epsilon>0 and δ>0\delta>0, let ff be a Lipschitz-continuous interpolation function such that

Lδ+dϵ+D2δγ2.L\delta+\sqrt{d}\epsilon+D_{2\delta}\leq\frac{\gamma}{2}. (10)

Consider a test sample xx randomly drawn according to the probability measure νm\nu_{m} of class mm. If 𝒳\mathcal{X} contains at least NmN_{m} training samples drawn i.i.d. from νm\nu_{m} such that

Nm>Qηm,δN_{m}>\frac{Q}{\eta_{m,\delta}}

then the probability of correctly classifying xx with nearest-neighbor classification in d\mathbb{R}^{d} is lower bounded as

P(C^(x)=m)1exp(2(Nmηm,δQ)2Nm)2dexp(Qϵ22L2δ2).P\left(\hat{C}(x)=m\right)\geq 1-\exp\left(-\frac{2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2d\exp\left(-\frac{Q\,\epsilon^{2}}{2L^{2}\delta^{2}}\right). (11)

Theorem 6 is proved in Appendix A.4. Theorem 6 is quite similar to Theorem 5 and can be interpreted similarly. Unlike in the previous result, the separability condition of the embedding is based on the pairwise distances of samples from different classes here. The condition (10) suggests that the result is useful when the parameter D2δD_{2\delta} is sufficiently small, which requires the embedding to map nearby samples from the same class in the ambient space to nearby points.

In this section, we have characterized the regularity of the interpolation functions via their rates of variation when restricted to the supports m\mathcal{M}_{m}. While the results of this section are generic in the sense that they are valid for any interpolation function with the described regularity properties, we have not examined the construction of such functions. In a practical classification problem where one uses a particular type of interpolation functions, one would also be interested in the adaptation of these results to obtain performance guarantees for the particular type of function used. Hence, in the following section we focus on a popular family of smooth functions; radial basis function (RBF) interpolators, and study the classification performance of this particular type of interpolators.

2.3 Out-of-sample interpolation with RBF interpolators

Here we consider an RBF interpolation function f:Hdf:H\rightarrow\mathbb{R}^{d} of the form

f(x)=[f1(x)f2(x)fd(x)]f(x)=[f^{1}(x)\,f^{2}(x)\,\dots f^{d}(x)]

such that each component fkf^{k} of ff is given by

fk(x)=i=1Ncikϕ(xxi)f^{k}(x)=\sum_{i=1}^{N}c_{i}^{k}\,\phi(\|x-x_{i}\|)

where ϕ:+\phi:\mathbb{R}\rightarrow\mathbb{R}^{+} is a kernel function, cikc_{i}^{k}\in\mathbb{R} are coefficients, and xix_{i} are kernel centers. In interpolation with RBF functions, it is common to choose the set of kernel centers as the set of available data samples. Hence, we assume that the set of kernel centers {xi}i=1N\{x_{i}\}_{i=1}^{N} is selected to be the same as the set of training samples 𝒳\mathcal{X}. We consider a setting where the coefficients cikc_{i}^{k} are set such that f(xi)=yif(x_{i})=y_{i}, i.e., ff maps each training point in 𝒳\mathcal{X} to its embedding previously computed with supervised manifold learning.

We consider the RBF kernel ϕ\phi to be a Lipschitz continuous function with constant Lϕ>0L_{\phi}>0, hence, for any u,vu,v\in\mathbb{R}

|ϕ(u)ϕ(v)|Lϕ|uv|.|\phi(u)-\phi(v)|\leq L_{\phi}\,|u-v|.

Also, let 𝒞\mathcal{C} be an upper bound on the coefficient magnitudes such that for all k=1,,dk=1,\dots,d

i=1N|cik|𝒞.\sum_{i=1}^{N}|c_{i}^{k}|\leq\mathcal{C}.

In the following, we analyze the classification accuracy and extend the results in Section 2.2 to the case of RBF interpolators. We first give the following result, which probabilistically bounds how much the value of the interpolator ff at a point xx randomly drawn from νm\nu_{m} may deviate from the average interpolator value of the training points of the same class within a neighborhood of xx.

Lemma 7

Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of training samples such that each xix_{i} is drawn i.i.d. from one of the probability measures {νm}m=1M\{\nu_{m}\}_{m=1}^{M}. Let xx be a test sample randomly drawn according to the probability measure νm\nu_{m} of class mm. Let

A={xi𝒳:xiBδ(x),xiνm}A=\{x_{i}\in\mathcal{X}:x_{i}\in B_{\delta}(x),x_{i}\sim\nu_{m}\} (12)

be the set of samples in 𝒳\mathcal{X} that are in a δ\delta-neighborhood of xx and also drawn from the measure νm\nu_{m}. Assume that AA contains |A|=Q|A|=Q samples. Then

P(f(x)1QxjAf(xj)d𝒞(Lϕδ+ϵ))12Nexp((Q1)ϵ22Lϕ2δ2).P\left(\|f(x)-\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})\|\leq\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)\right)\geq 1-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right). (13)

The proof of Lemma 7 is given in Appendix A.5. The lemma states a result similar to the one in Lemma 3; however, is specialized to the case where ff is an RBF interpolator.

We are now ready to present the following main result.

Theorem 8

Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of training samples such that each xix_{i} is drawn i.i.d. from one of the probability measures {νm}m=1M\{\nu_{m}\}_{m=1}^{M}. Let YY be an embedding of 𝒳\mathcal{X} in d\mathbb{R}^{d} that is linearly separable with a QQ-mean separability margin larger than γQ\gamma_{Q}. For a given ϵ>0\epsilon>0 and δ>0\delta>0, let ff be an RBF interpolator such that

d𝒞(Lϕδ+ϵ)γQ2.\sqrt{d}\,\mathcal{C}\,(L_{\phi}\delta+\epsilon)\leq\frac{\gamma_{Q}}{2}. (14)

Consider a test sample xx randomly drawn according to the probability measure νm\nu_{m} of class mm. If 𝒳\mathcal{X} contains at least NmN_{m} training samples drawn i.i.d. from νm\nu_{m} such that

Nm>Qηm,δN_{m}>\frac{Q}{\eta_{m,\delta}}

then the probability of correctly classifying xx with the linear classifier given in (4) is lower bounded as

P(C^(x)=m)1exp(2(Nmηm,δQ)2Nm)2Nexp((Q1)ϵ22Lϕ2δ2).P\left(\hat{C}(x)=m\right)\geq 1-\exp\left(-\frac{2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right). (15)

The theorem is proved in Appendix A.6. The theorem bounds the classification accuracy in terms of the smoothness of the RBF interpolation function and the number of samples. The condition in (14) characterizes the compromise between the separation and the regularity of the interpolator, which depends on the Lipschitz constant of the RBF kernels and the coefficient magnitude. As the Lipschitz constant LϕL_{\phi} and the coefficient magnitude parameter 𝒞\mathcal{C} increase (i.e., ff becomes less “regular”), a higher separation γQ\gamma_{Q} is required to provide a performance guarantee. When the separation margin of the embedding and the interpolator satisfy the condition in (14), the misclassification probability decays exponentially as the number of training samples increases, similarly to the results in Section 2.2.

Theorem 8 studies the misclassification probability when the class labels in the low-dimensional domain are estimated with a linear classifier. We also present below a bound on the misclassification probability when the nearest-neighbor classifier is used in the low-dimensional domain.

Theorem 9

Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of training samples such that each xix_{i} is drawn i.i.d. from one of the probability measures {νm}m=1M\{\nu_{m}\}_{m=1}^{M}. Let YY be an embedding of 𝒳\mathcal{X} in d\mathbb{R}^{d} such that

yiyj<Dδ, if xixjδ and Ci=Cjyiyj>γ, if CiCj.\begin{split}\|y_{i}-y_{j}\|&<D_{\delta},\text{ if }\|x_{i}-x_{j}\|\leq\delta\text{ and }C_{i}=C_{j}\\ \|y_{i}-y_{j}\|&>\gamma,\ \text{ if }C_{i}\neq C_{j}.\end{split}

For given ϵ>0\epsilon>0 and δ>0\delta>0, let ff be an RBF interpolator such that

d𝒞(Lϕδ+ϵ)+D2δγ2.\sqrt{d}\,\mathcal{C}\,(L_{\phi}\delta+\epsilon)+D_{2\delta}\leq\frac{\gamma}{2}. (16)

Consider a test sample xx randomly drawn according to the probability measure νm\nu_{m} of class mm. If 𝒳\mathcal{X} contains at least NmN_{m} training samples drawn i.i.d. from νm\nu_{m} such that

Nm>Qηm,δN_{m}>\frac{Q}{\eta_{m,\delta}}

then the probability of correctly classifying xx with nearest-neighbor classification in d\mathbb{R}^{d} is lower bounded as

P(C^(x)=m)1exp(2(Nmηm,δQ)2Nm)2Nexp((Q1)ϵ22Lϕ2δ2).P\left(\hat{C}(x)=m\right)\geq 1-\exp\left(-\frac{2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right). (17)

Theorem 9 is proved in Appendix A.7. While it provides the exact convergence rate as in Theorem 8, the necessary condition in (16) includes also the parameter D2δD_{2\delta}. Hence, if the embedding maps nearby samples from the same class to nearby points, and a compromise is achieved between the separation and the interpolator regularity, the misclassification probability can be upper bounded.

2.4 Optimizing the scale of Gaussian RBF kernels

In data interpolation with RBFs, it is known that the accuracy of interpolation is quite sensitive to the choice of the shape parameter for several kernels including the Gaussian kernel (Baxter, 1992). The relation between the shape parameter and the performance of interpolation has been an important problem of interest (Piret, 2007). In this section, we focus on the Gaussian RBF kernel, which is a popular choice for RBF interpolation due to its smoothness and good spatial localization properties. We study the choice of the scale parameter of the kernel within the context of classification.

We consider the RBF kernel given by

ϕ(r)=er2σ2\phi(r)=e^{-\frac{r^{2}}{\sigma^{2}}}

where σ\sigma is the scale parameter of the Gaussian function. We focus on the condition (14) in Theorem 8

d𝒞(Lϕδ+ϵ)γQ/2,\sqrt{d}\,\mathcal{C}\,(L_{\phi}\delta+\epsilon)\leq\gamma_{Q}/2,

(or equivalently the condition (16) if the nearest neighbor classifier is used), which relates the interpolation function properties with the separation. In particular, for a given separation margin, this condition is satisfied more easily when the term on the left hand side of the inequality is smaller. Thus, in the following, we derive an expression for the left hand side of the above inequality by deriving the Lipschitz constant LϕL_{\phi} and the coefficient bound 𝒞\mathcal{C} in terms of the scale parameter σ\sigma of the Gaussian kernel. We then study the scale parameter that minimizes d𝒞(Lϕδ+ϵ)\sqrt{d}\,\mathcal{C}\,(L_{\phi}\delta+\epsilon).

Writing the condition f(xi)=yif(x_{i})=y_{i} in a matrix form for each dimension k=1,,dk=1,\dots,d, we have

Φck=yk\Phi c^{k}=y^{k} (18)

where ΦN×N\Phi\in\mathbb{R}^{N\times N} is a matrix whose (i,j)(i,j)-th entry is given by Φij=ϕ(xixj)\Phi_{ij}=\phi(\|x_{i}-x_{j}\|), ckN×1c^{k}\in\mathbb{R}^{N\times 1} is the coefficient vector whose ii-th entry is cikc_{i}^{k}, and ykN×1y^{k}\in\mathbb{R}^{N\times 1} is the data coordinate vector giving the kk-th dimensions of the embeddings of all samples, i.e., yik=Yiky_{i}^{k}=Y_{ik}. Assuming that the embedding is computed with the usual scale constraint YTY=IY^{T}Y=I, we have yk=1\|y^{k}\|=1. The norm of the coefficient vector can then be bounded as

ckΦ1yk=Φ1.\|c^{k}\|\leq\|\Phi^{-1}\|\|y^{k}\|=\|\Phi^{-1}\|. (19)

In the rest of this section, we assume that the data 𝒳\mathcal{X} are sampled from the Euclidean space, i.e., H=nH=\mathbb{R}^{n}. We first use a result by Narcowich et al. (1994) in order to bound the norm Φ1\|\Phi^{-1}\| of the inverse matrix. From (Narcowich et al., 1994, Theorem 4.1) we get111The result stated in (Narcowich et al., 1994, Theorem 4.1) is adapted to our study by taking the measure as β(ρ)=δ(ρρ0)\beta(\rho)=\delta(\rho-\rho_{0}) so that the RBF kernel defined in (Narcowich et al., 1994, (1.1)) corresponds to a Gaussian function as F(r)=exp(ρ0r2)F(r)=\exp(-\rho_{0}\,r^{2}). The scale of the Gaussian kernel is then given by σ=ρ01/2\sigma={\rho_{0}}^{-1/2}.

Φ1βσneασ2\|\Phi^{-1}\|\leq\beta\,\sigma^{-n}e^{\alpha\sigma^{2}} (20)

where α>0\alpha>0 and β>0\beta>0 are constants depending on the dimension nn and the minimum distance between the training points 𝒳\mathcal{X} (separation radius) (Narcowich et al., 1994). As the 1\ell_{1}-norm of the coefficient vector can be bounded as ck1Nck\|c^{k}\|_{1}\leq\sqrt{N}\|c^{k}\|, from (19) one can set the parameter 𝒞\mathcal{C} that upper bounds the coefficients magnitudes as

𝒞=aσneασ2\mathcal{C}=a\sigma^{-n}e^{\alpha\sigma^{2}}

where a=βNa=\beta\sqrt{N}.

Next, we derive a Lipschitz constant for the Gaussian kernel ϕ(r)\phi(r) in terms of σ\sigma. Setting the second derivative of ϕ\phi to zero

d2ϕdr2=er2σ2(4r2σ42σ2)=0\frac{d^{2}\phi}{dr^{2}}=e^{-\frac{r^{2}}{\sigma^{2}}}\left(\frac{4r^{2}}{\sigma^{4}}-\frac{2}{\sigma^{2}}\right)=0

we get that the maximum value of |dϕ/dr||d\phi/dr| is attained at r=σ/2r=\sigma/\sqrt{2}. Evaluating |dϕ/dr||d\phi/dr| at this value, we obtain

Lϕ=2e12σ1.L_{\phi}=\sqrt{2}e^{-\frac{1}{2}}\sigma^{-1}.

Now rewriting the condition (14) of the theorem, we have

d𝒞(Lϕδ+ϵ)=a1σn1eασ2+a2σneασ2γQ/2\sqrt{d}\,\mathcal{C}\,(L_{\phi}\delta+\epsilon)=a_{1}\sigma^{-n-1}e^{\alpha\sigma^{2}}\,+a_{2}\sigma^{-n}e^{\alpha\sigma^{2}}\leq\gamma_{Q}/2

where a1=2dae1/2δa_{1}=\sqrt{2d}\,a\,e^{-1/2}\delta and a2=daϵa_{2}=\sqrt{d}\,a\,\epsilon. We thus determine the Gaussian scale parameter σ\sigma that minimizes

F(σ)=a1σn1eασ2+a2σneασ2.F(\sigma)=a_{1}\sigma^{-n-1}e^{\alpha\sigma^{2}}\,+a_{2}\sigma^{-n}e^{\alpha\sigma^{2}}.

First, notice that as σ0\sigma\rightarrow 0 and σ\sigma\rightarrow\infty, the function F(σ)F(\sigma)\rightarrow\infty. Therefore, it has at least one minimum. Setting

dFdσ=eασ2σn2(2αa2σ3+2αa1σ2a2nσa1(n+1))=0\frac{dF}{d\sigma}=e^{\alpha\sigma^{2}}\sigma^{-n-2}\big{(}2\alpha a_{2}\sigma^{3}+2\alpha a_{1}\sigma^{2}-a_{2}n\sigma-a_{1}(n+1)\big{)}=0

we need to solve

2αa2σ3+2αa1σ2a2nσa1(n+1)=0.2\alpha a_{2}\sigma^{3}+2\alpha a_{1}\sigma^{2}-a_{2}n\sigma-a_{1}(n+1)=0. (21)

The leading and the second-degree coefficients are positive, while the first-degree and the constant coefficients are negative in the above cubic polynomial. Then, the sum of the roots is negative and the product of the roots is positive. Therefore, there is one and only one positive root σopt\sigma_{opt}, which is the unique minimizer of F(σ)F(\sigma).

The existence of an optimal scale parameter 0<σopt<0<\sigma_{opt}<\infty for the RBF kernel can be intuitively explained as follows. When σ\sigma takes too small values, the support of the RBF function concentrated around the training points does not sufficiently cover the whole class supports m\mathcal{M}_{m}. This manifests itself in (14) with the increase in the term LϕL_{\phi}, which indicates that the interpolation function is not sufficiently regular. This weakens the guarantee that a test sample will be interpolated sufficiently close to its neighboring training samples from the same class and mapped to the correct side of the hyperplane in the linear classifier. On the other hand, when σ\sigma increases too much, the stability of the linear system (18) is impaired and the coefficients cc increase too much. This results in an overfitting of the interpolator and, therefore, decreases the classification performance. Hence, the analysis in this section provides a theoretical justification of the common knowledge that σ\sigma should be set to a sufficiently large value while avoiding overfitting.

Remark: It is also interesting to observe how the optimal scale parameter changes with the number of samples NN. In the study (Narcowich et al., 1994), the constants α\alpha and β\beta in (20) are shown to vary with the separation radius qq at rates α=O(q2)\alpha=O(q^{-2}) and β=O(qn)\beta=O(q^{n}), where the separation radius qq is proportional to the smallest distance between two distinct training samples. Then a reasonable assumption is that the separation radius qq should typically decrease at rate O(N1/n)O(N^{-1/n}) as NN increases. Using this relation, we get that α\alpha and β\beta should vary at rates α=O(N2/n)\alpha=O(N^{2/n}) and β=O(N1)\beta=O(N^{-1}) with NN. It follows that a=βN=O(N1/2)a=\beta\sqrt{N}=O(N^{-1/2}), and the parameters a1a_{1}, a2a_{2} of the cubic polynomial in (21) also vary with NN at rates a1=O(N1/2)a_{1}=O(N^{-1/2}), a2=O(N1/2)a_{2}=O(N^{-1/2}). The equation (21) in σ\sigma can then be rearranged as

b3σ3+b2σ2b1σb0=0b_{3}\sigma^{3}+b_{2}\sigma^{2}-b_{1}\sigma-b_{0}=0

such that the constants vary with NN at rates b3=O(N2/n)b_{3}=O(N^{2/n}), b2=O(N2/n)b_{2}=O(N^{2/n}), b1=O(1)b_{1}=O(1), b0=O(1)b_{0}=O(1). We can then inspect how the roots of this equation change with NN as NN increases. Since b3b_{3} and b2b_{2} dominate the other coefficients for large NN, three real roots will exist if NN is sufficiently large, two of which are negative and one is positive. The sum of the pairwise products of the roots is negative and it decays with NN at rate O(N2/n)O(N^{-2/n}), and the product of the roots also decays with NN. Then at least two of the roots must decay with NN. Meanwhile, the sum of the three roots is O(1)O(1) and negative. This shows that one of the negative roots is O(1)O(1), i.e., does not decay with NN. From the product of three roots, we then observe that the product of the two decaying roots is O(N2/n)O(N^{-2/n}). However, their sum also decays at the same rate (from the sum of the pairwise products), which is possible if their dominant terms have the same rate and cancel each other. We conclude that both of the decaying roots vary at rate O(N1/n)O(N^{-1/n}), one of which is the positive root and the optimal value σopt\sigma_{opt} of the scale parameter. This analysis shows that the scale parameter of the Gaussian kernel should be adapted to the number of training samples, and a smaller kernel scale must be preferred for a larger number of training samples. In fact, the relation σopt=O(N1/n)\sigma_{opt}=O(N^{-1/n}) is quite intuitive, as the average or typical distance between two samples will also decrease at rate O(N1/n)O(N^{-1/n}) as the number of samples NN increases in an nn-dimensional space. Then the above result simply suggests that the kernel scale should be chosen as proportional to the average distance between the training samples.

2.5 Discussion of the results in relation with previous results

In Theorems 8 and 9, we have presented a result that characterizes the performance of classification with RBF interpolation functions. In particular, we have considered a setting where an RBF interpolator is fitted to each dimension of a low-dimensional embedding where different classes are separable. Our study has several links with RBF networks or least-squares regression algorithms. In this section, we interpret our findings in relation with previously established results.

Several previous works study the performance of learning by considering a probability measure ρ\rho defined on X×YX\times Y, where XX and YY are two sets. The “label” set YY is often taken as an interval [L,L][-L,L]. Given a set of data pairs {(xj,yj)}j=1N\{(x_{j},y_{j})\}_{j=1}^{N} sampled from the distribution ρ\rho, the RBF network estimates a function f^\hat{f} of the form

f^(x)=i=1Rciϕ(xtiσi).\hat{f}(x)=\sum_{i=1}^{R}c_{i}\,\phi\left(\frac{\|x-t_{i}\|}{\sigma_{i}}\right). (22)

The number of RBF terms RR may be different from the number of samples NN in general. The function f^\hat{f} minimizes the empirical error

f^=argminfj=1N(f(xj)yj)2.\hat{f}=\arg\min_{f}\sum_{j=1}^{N}\left(f(x_{j})-y_{j}\right)^{2}.

The function f^\hat{f} estimated from a finite collection of data samples is often compared to the regression function (Cucker and Smale, 2002)

fo(x)=Yy𝑑ρ(y|x)f_{o}(x)=\int_{Y}y\,d\rho(y|x)

where dρ(y|x)d\rho(y|x) is the conditional probability measure on YY. The regression function fof_{o} minimizes the expected risk as

fo=argminfX×Y(f(x)y)2𝑑ρ.f_{o}=\arg\min_{f}\int_{X\times Y}\big{(}f(x)-y\big{)}^{2}d\rho.

As the probability measure ρ\rho is not known in practice, the estimate f^\hat{f} of fof_{o} is obtained from data samples. Several previous works have characterized the performance of learning by studying the approximation error (Niyogi and Girosi, 1996), (Lin et al., 2014)

𝔼[(fof^)2]=X(fo(x)f^(x))2𝑑ρX(x)\mathbb{E}[(f_{o}-\hat{f})^{2}]=\int_{X}(f_{o}(x)-\hat{f}(x))^{2}d\rho_{X}(x) (23)

where ρX\rho_{X} is the marginal probability measure on XX. This definition of the approximation error can be adapted to our setting as follows. In our problem the distribution of each class is assumed to have a bounded support, which is a special case of modeling the data with an overall probability distribution ρ\rho. If the supports m\mathcal{M}_{m} are assumed to be nonintersecting, the regression function fof_{o} is given by

fo(x)=m=1MmIm(x)f_{o}(x)=\sum_{m=1}^{M}m\,I_{m}(x)

which corresponds to the class labels m=1,,Mm=1,\dots,M, where ImI_{m} is the indicator function of the support m\mathcal{M}_{m}. It is then easy to show that the approximation error 𝔼[(fof^)2]\mathbb{E}[(f_{o}-\hat{f})^{2}] can be bounded as a constant times the probability of misclassification P(C^(x)m)P(\hat{C}(x)\neq m). Hence, we can compare our misclassification probability bounds in Section 2.3 with the approximation error in other works.

The study in (Niyogi and Girosi, 1996) assumes that the regression function is an element of the Bessel potential space of a sufficiently high order and that the sum of the coefficients |ci||c_{i}| is bounded. It is then shown that for data sampled from n\mathbb{R}^{n}, with probability greater than 1δ1-\delta the approximation error in (23) can be bounded as

𝔼[(fof^)2]O(1R)+O(Rnlog(RN)log(δ)N)\mathbb{E}[(f_{o}-\hat{f})^{2}]\leq O\left(\frac{1}{R}\right)+O\left(\sqrt{\frac{Rn\log(RN)-\log(\delta)}{N}}\right) (24)

where RR is the number of RBF terms.

The analysis by Lin et al. (2014) considers families of RBF kernels that include the Gaussian function. Supposing that the regression function fof_{o} is of Sobolev class W2rW_{2}^{r}, and that the number of RBF terms is given by R=Nnn+2rR=N^{\frac{n}{n+2r}} in terms of the number of samples NN, the approximation error is bounded as

𝔼[(fof^)2]O(N2rn+2rlog2(N)).\mathbb{E}[(f_{o}-\hat{f})^{2}]\leq O(N^{-\frac{2r}{n+2r}}\log^{2}(N)). (25)

Next, we overview the study by Hernández-Aguirre et al. (2002), which studies the performance of RBFs in a Probably Approximately Correct (PAC)-learning framework. For XnX\subset\mathbb{R}^{n}, a family \mathcal{F} of measurable functions from XX to [0,1][0,1] is considered and the problem of approximating a target function f0f_{0} known only through examples with a function in f^\hat{f}\in\mathcal{F} is studied. The authors use a previous result from (Vidyasagar, 1997) that relates the accuracy of empirical risk minimization to the covering number of \mathcal{F} and the number of samples. Combining this result with the bounds on covering number estimates of Lipschitz continuous functions (Kolmogorov and Tihomirov, 1961), the following result is obtained for PAC function learning with RBF neural networks with Gaussian kernel. Let the coefficients be bounded as |ci|A|c_{i}|\leq A, a common scale parameter be chosen as σi=σ\sigma_{i}=\sigma, and 𝔼[|f0f^|]\mathbb{E}[|f_{0}-\hat{f}|] be computed under a uniform probability measure ρ\rho. Then if the number of samples satisfies

N8ε2log(2RnAe1/2σζ)N\geq\frac{8}{\varepsilon^{2}}\log\bigg{(}\frac{\sqrt{2}RnA}{e^{-1/2}\sigma\zeta}\bigg{)} (26)

an approximation of the target function is obtained with accuracy parameter ε\varepsilon and confidence parameter ζ\zeta:

P(𝔼[|f0f^|]>ε)ζ.P(\mathbb{E}[|f_{0}-\hat{f}|]>\varepsilon)\leq\zeta. (27)

In the above expression, the expectation is over the test samples, whereas the probability is over the training samples; i.e., over all possible distributions of training samples, the probability of having the average approximation error larger than ε\varepsilon is bounded. Note that, our results in Theorems 8 and 9, when translated into the above PAC-learning framework, correspond to a confidence parameter of ζ=0\zeta=0. This is because the misclassification probability bound of a test sample is valid for any choice of the training samples, provided that the condition (14) (or the condition (16)) holds. Thus, in our result the probability running over the training samples in (27) has no counterpart. When we take ζ=0\zeta=0, the above result does not provide a useful bound since NN\rightarrow\infty as ζ0\zeta\rightarrow 0. By contrast, our result is valid only if the conditions (14), (16) on the interpolation function holds. It is easy to show that, assuming nonintersecting class supports m\mathcal{M}_{m}, the expression 𝔼[|f0f^|]\mathbb{E}[|f_{0}-\hat{f}|] is given by a constant times the probability of misclassification. The accuracy parameter ε\varepsilon can then be seen as the counterpart of the misclassification probability upper bound given on the right hand sides of (15) and (17) (the expression subtracted from 1). At fixed NN, the dependence of the accuracy on the kernel scale parameter is monotonic in the bound (26); ε\varepsilon decreases as σ\sigma increases. Therefore, this bound does not guide the selection of the scale parameter of the RBF kernel, while the discussion in Section 2.4 (confirmed by the experimental results in Section 4.2) suggests the existence of an optimal scale.

Finally, we mention some results on the learning performance of regularized least squares regression algorithms. In (Caponnetto and De Vito, 2007) optimal rates are derived for the regularized least squares method in a Reproducing Kernel Hilbert Space (RKHS) in the minimax sense. It is shown that, under some hypotheses concerning the data probability measure and the complexity of the family of learnt functions, the maximum error (yielded by the worst distribution) obtained with the regularized least squares method converges at a rate of O(1/N)O(1/N). Next, the work in (Steinwart et al., 2009) shows that, in regularized least squares regression over a RKHS, if the eigenvalues of the kernel integral operator decay sufficiently fast, and if the \ell_{\infty}-norms of regression functions can be bounded, the error of the classifier converges at a rate of up to O(1/N)O(1/N) with high probability. Steinwart et al. also examine the learning performance in relation with the exponent of the function norm in the regularization term and show that the learning rate is not affected by the choice of the exponent of the function norm.

We now overview the three bounds given in (24), (25), and (26) in terms of the dependence of the error on the number of samples. The results in (24) and (25) provide a useful bound only in the case where the number of samples NN is larger than the number of RBF terms RR, contrary to our study where we treat the case R=NR=N. If it is assumed that NN is sufficiently larger than RR, the result in (24) predicts a rate of decay of only O(log(N)/N)O(\sqrt{\log(N)/N}) in the misclassification probability. The bound in (25) improves with the Sobolev regularity of the regression function; however, the dependence of the error on the number of samples is of a similar nature to the one in (24). Considering ε\varepsilon as a misclassification error parameter in the bound in (26), the error decreases at a rate of O(N1/2)O(N^{-1/2}) as the number of samples increases. The analysis in (Caponnetto and De Vito, 2007) and (Steinwart et al., 2009) also provide the similar rates of convergence of O(N1)O(N^{-1}). Meanwhile, our results in Theorems 8 and 9 predict an exponential decay in the misclassification probability as the number of samples NN increases (under the reasonable assumption that Nm=O(N)N_{m}=O(N) for each class mm). The reason why we arrive at a more optimistic bound is the specialization of the analysis to the considered particular setting, where the support of each class is assumed to be restricted to a totally bounded region in the ambient space, as well as the assumed relations between the separation margin of the embedding and the regularity of the interpolation function.

Another difference between these previous results and ours is the dependence on the dimension. The results in (24), (25), and (26) predict an increase in the error at the respective rates of O(n)O(\sqrt{n}), O(e1/n)O(e^{-1/n}), and O(logn)O(\sqrt{\log n}) with the ambient space dimension nn. While these results assume that the data 𝒳n\mathcal{X}\subset\mathbb{R}^{n} is in an Euclidean space of dimension nn, our study assumes the data 𝒳\mathcal{X} to be in a generic Hilbert space HH. The results in Theorems 5-8 involve the dimension dd of the low-dimensional space of embedding and does not explicitly depend on the dimension of the ambient Hilbert space HH (which could be infinite-dimensional). However, especially in the context of manifold learning, it is interesting to analyze the dependence of our bound on the intrinsic dimension of the class supports m\mathcal{M}_{m}.

In order to put the expressions (15), (17) in a more convenient form, let us reduce one parameter by setting Q=Nmηm,δ/2Q=N_{m}\eta_{m,\delta}/2. Then the misclassification probability is of

O(exp(Nmηm,δ2)+Nexp(Nmηm,δϵ2Lϕ2δ2)).O\left(\exp(-N_{m}\eta_{m,\delta}^{2})+N\exp\left(-\frac{N_{m}\,\eta_{m,\delta}\,\epsilon^{2}}{L_{\phi}^{2}\,\delta^{2}}\right)\right).

We can relate the dependence of this expression on the intrinsic dimension as follows. Since the supports m\mathcal{M}_{m} are assumed to be totally bounded, one can define a parameter Θ\Theta that represents the “diameter” of m\mathcal{M}_{m}, i.e., the largest distance between any two points on m\mathcal{M}_{m}. Then the measure ηm,δ\eta_{m,\delta} of the minimum ball of radius δ\delta in m\mathcal{M}_{m} is of O((δ/Θ)D)O((\delta/\Theta)^{D}), where DD is the intrinsic dimension of m\mathcal{M}_{m}. Replacing this in the above expression gives the probability of misclassification as

O(exp(Nmδ2DΘ2D)+Nexp(NmδD2ϵ2Lϕ2ΘD)).O\left(\exp\left(-\frac{N_{m}\,\delta^{2D}}{\Theta^{2D}}\right)+N\exp\left(-\frac{N_{m}\,\delta^{D-2}\,\epsilon^{2}}{L_{\phi}^{2}\,\Theta^{D}}\right)\right).

This shows that in order to retain the correct classification guarantee, as the intrinsic dimension DD grows, the number of samples NmN_{m} should increase at a geometric rate with DD. In supervised manifold learning problems, data sets usually have a low intrinsic dimension, therefore, this geometric rate of increase can often be tolerated. Meanwhile the dimension of the ambient space is typically high, so that performance bounds independent of the ambient space dimension are of particular interest. Note that generalization bounds in terms of the intrinsic dimension have been proposed in some previous works as well (Bickel and Li, 2007), (Kpotufe, 2011), for the local linear regression and the K-NN regression problems.

3 Separability of supervised nonlinear embeddings

In the results in Section 2, we have presented generalization bounds for classifiers based on linearly separable embeddings. One may wonder if the separability assumption is easy to satisfy when computing structure-preserving nonlinear embeddings of data. In this section, we try to answer this question by focusing on a particular family of supervised dimensionality reduction algorithms, i.e., supervised Laplacian eigenmaps embeddings, and analyze the conditions of separability. We first discuss the supervised Laplacian eigenmaps embeddings in Section 3.1 and then present results in Section 3.2 about the linearly separability of these embeddings.

3.1 Supervised Laplacian eigenmaps embeddings

Let 𝒳={xi}i=1NH\mathcal{X}=\{x_{i}\}_{i=1}^{N}\subset H be a set of training samples, where each xix_{i} belongs to one of MM classes. Most manifold learning algorithms rely on a graph representation of data. This graph can be a complete graph in some works, in which case an edge exists between each pair of samples. Meanwhile, in some manifold learning algorithms, in order to better capture the intrinsic geometric structure of data, each data sample is connected only to its nearest neighbors in the graph. In this case, an edge exists only between neighboring data samples.

In our analysis, we consider a weighted data graph GG each vertex of which represents a point xix_{i}. We write xixjx_{i}\sim x_{j}, or simply iji\sim j if the graph contains an edge between the data samples xix_{i}, xjx_{j}. We denote the edge weight as wij>0w_{ij}>0. The weights wijw_{ij} are usually determined as a positive and monotonically decreasing function of the distance between xix_{i} and xjx_{j} in HH, where the Gaussian function is a common choice. Nevertheless, we maintain a generic formulation here without making any assumption on the neighborhood or weight selection strategies.

Now let GwG_{w} and GbG_{b} represent two subgraphs of GG, which contain the edges of GG that are respectively within the same class and between different classes. Hence, GwG_{w} contains an edge iwji\sim_{w}j between samples xix_{i} and xjx_{j}, if iji\sim j and Ci=CjC_{i}=C_{j}. Similarly, GbG_{b} contains an edge ibji\sim_{b}j if iji\sim j and CiCjC_{i}\neq C_{j}. We assume that all vertices of GG are contained in both GwG_{w} and GbG_{b}; and that GwG_{w} has exactly MM connected components such that the training samples in each class form a connected component222The straightforward application of common graph construction strategies, like connecting each training sample to its K-nearest neighbors or to its neighbors within a given distance, may result in several disconnected components in a single class in the graph if there is much diversity in that class. However, this difficulty can be easily overcome by introducing extra edges to bridge between graph components that are originally disconnected.. We also assume that GwG_{w} and GbG_{b} do not contain any isolated vertices; i.e., each data sample xix_{i} has at least one neighbor in both graphs.

The N×NN\times N weight matrices WwW_{w} and WbW_{b} of GwG_{w} and GbG_{b} have entries as follows.

Ww(i,j)={wij if ij and Ci=Cj0 otherwiseW_{w}(i,j)=\bigg{\{}\begin{array}[]{l}w_{ij}\,\,\text{ if }i\sim j\text{ and }C_{i}=C_{j}\\ 0\,\,\text{ otherwise}\end{array}
Wb(i,j)={wij if ij and CiCj0 otherwiseW_{b}(i,j)=\bigg{\{}\begin{array}[]{l}w_{ij}\,\,\text{ if }i\sim j\text{ and }C_{i}\neq C_{j}\\ 0\,\,\text{ otherwise}\end{array}

Let dw(i)d_{w}(i) and db(i)d_{b}(i) denote the degrees of xix_{i} in GwG_{w} and GbG_{b}

dw(i)=jwiwij,db(i)=jbiwij\displaystyle d_{w}(i)=\sum_{j\sim_{w}i}w_{ij},\qquad\qquad d_{b}(i)=\sum_{j\sim_{b}i}w_{ij}

and DwD_{w}, DbD_{b} denote the N×NN\times N diagonal degree matrices given by Dw(i,i)=dw(i)D_{w}(i,i)=d_{w}(i), Db(i,i)=db(i)D_{b}(i,i)=d_{b}(i). The normalized graph Laplacian matrices LwL_{w} and LbL_{b} of GwG_{w} and GbG_{b} are then defined as

Lw:=Dw1/2(DwWw)Dw1/2,Lb:=Db1/2(DbWb)Db1/2.L_{w}:=D_{w}^{-1/2}(D_{w}-W_{w})D_{w}^{-1/2},\qquad\qquad L_{b}:=D_{b}^{-1/2}(D_{b}-W_{b})D_{b}^{-1/2}.

Supervised extensions of the Laplacian eigenmaps and LPP algorithms seek a dd-dimensional embedding of the data set 𝒳\mathcal{X}, such that each xix_{i} is represented by a vector yid×1y_{i}\in\mathbb{R}^{d\times 1}. Denoting the new data matrix as Y=[y1y2yN]TN×dY=[y_{1}\,y_{2}\,\dots\,y_{N}]^{T}\in\mathbb{R}^{N\times d}, the coordinates of data samples are computed by solving the problem

“Minimize tr(YTLwY) while maximizing tr(YTLbY).”\text{``Minimize }\mathrm{tr}(Y^{T}L_{w}Y)\text{ while maximizing }\mathrm{tr}(Y^{T}L_{b}Y)\text{.''} (28)

The reason behind this formulation can be explained as follows. For a graph Laplacian matrix L=D1/2(DW)D1/2L=D^{-1/2}(D-W)D^{-1/2}, where DD and WW are respectively the degree and the weight matrices, defining the coordinates Z=D1/2YZ=D^{-1/2}Y normalized with the vertex degrees, we have

tr(YTLY)=tr(ZT(DW)Z)=ijzizj2wij\mathrm{tr}(Y^{T}L\,Y)=\mathrm{tr}(Z^{T}(D-W)Z)=\sum_{i\sim j}\|z_{i}-z_{j}\|^{2}w_{ij} (29)

where ziz_{i} is the ii-th row of ZZ giving the normalized coordinates of the embedding of the data sample xix_{i}. Hence, the problem in (28) seeks a representation YY that maps nearby samples in the same class to nearby points, while mapping nearby samples from different classes to distant points. In fact, when the samples xix_{i} are assumed to come from a manifold \mathcal{M}, the term yTLyy^{T}Ly is the discrete equivalent of

f(x)2𝑑x\int_{\mathcal{M}}\|\nabla f(x)\|^{2}dx

where f:f:\mathcal{M}\rightarrow\mathbb{R} is a continuous function on the manifold that extends the one-dimensional coordinates yy to the whole manifold. Hence, the term tr(YTLY)\mathrm{tr}(Y^{T}LY) captures the rate of change of the learnt coordinate vectors YY over the underlying manifold. Then, in a setting where the samples of different classes come from MM different manifolds {m}m=1M\{\mathcal{M}_{m}\}_{m=1}^{M}, the formulation in (28) looks for a function that has a slow variation on each manifold m\mathcal{M}_{m}, while having a fast variation “between” different manifolds.

The supervised learning problem in (28) has so far been studied by several authors with slight variations in their problem formulations. Raducanu and Dornaika (2012) minimize a weighted difference of the within-class and between-class similarity terms in (28) in order to learn a nonlinear embedding. Meanwhile, linear dimensionality reduction methods pose the manifold learning problem as the learning of a linear projection matrix Pd×nP\in\mathbb{R}^{d\times n}; therefore, solve the problem in (28) under the constraint yi=Pxiy_{i}=P\,x_{i}, where xin×1x_{i}\in\mathbb{R}^{n\times 1} and d<nd<n. Hua et al. (2012) formulate the problem as the minimization of the difference of the within-class and the between-class similarity terms in (28) as well. Thus, their algorithm can be seen as the linear version of the method by Raducanu and Dornaika (2012). Sugiyama (2007) proposes an adaptation of the Fisher discriminant analysis algorithm to preserve the local structures of data. Data sample pairs are weighted with respect to their affinities in the construction of the within-class and the between-class scatter matrices in Fisher discriminant analysis. Then the trace of the ratio of the between-class and the within-class scatter matrices is maximized to learn a linear embedding. Meanwhile, the within-class and the between-class local scatter matrices are closely related to the two terms in (28) as shown by Yang et al. (2011). The terms YTLwYY^{T}L_{w}Y and YTLbYY^{T}L_{b}Y, when evaluated under the constraint yi=Pxiy_{i}=P\,x_{i}, become equal to the locally weighted within-class and between-class scatter matrices of the projected data. Cui and Fan (2012) and Wang and Chen (2009) propose to maximize the ratio of the between-class and the within-class local scatters in the learning. Yang et al. (2011) optimize the same objective function, while they construct the between-class graph only on the centers of mass of the classes. Zhang et al. (2012) similarly optimize a Fisher metric to maximize the ratio of the between- and within-class scatters; however, the total scatter is also taken into account in the objective function in order to preserve the overall manifold structure.

All of the above methods use similar formulations of the supervised manifold learning problem and give comparable results. In our study, we base our analysis on the following formal problem definition

minYtr(YTLwY)μtr(YTLbY) subject to YTY=I\min_{Y}\mathrm{tr}(Y^{T}L_{w}Y)-\mu\,\mathrm{tr}(Y^{T}L_{b}Y)\text{ subject to }Y^{T}Y=I (30)

which minimizes the difference of the within-class and the between-class similarity terms as in works such as (Raducanu and Dornaika, 2012) and (Hua et al., 2012). Here II is the d×dd\times d identity matrix and μ>0\mu>0 is a parameter adjusting the weights of the two terms. The condition YTY=IY^{T}Y=I is a commonly used constraint to remove the scale ambiguity of the coordinates. The solution of the problem (30) is given by the first dd eigenvectors of the matrix

LwμLbL_{w}-\mu L_{b}

corresponding to its smallest eigenvalues.

Our purpose in this section is then to theoretically study the linear separability of the learnt coordinates of training data, with respect to the definition of linear separability given in (1). In the following, we determine some conditions on the graph properties and the weight parameter μ\mu that ensure the linear separability. We derive lower bounds on the margin γ\gamma and study its dependence on the model parameters. Let us give beforehand the following definitions about the graphs GwG_{w} and GbG_{b}.

Definition 10

The volume of the subgraph of GwG_{w} that corresponds to the connected component containing samples from class kk is

Vk:=i:Ci=kdw(i).V_{k}:=\sum_{i:\,C_{i}=k}d_{w}(i).

We define the maximal within-class volume as

Vmax:=maxk=1,,MVk.V_{max}:=\max_{k=1,\dots,M}V_{k}.

The volume of the component of GbG_{b} containing the edges between the samples of classes kk and ll is 333In order to keep the analogy with the definition of VkV_{k}, a 22 factor is introduced in this expression as each edge is counted only once in the sum.

Vklb:=ibjCi=k,Cj=l2wij.V^{b}_{kl}:=\sum_{\begin{subarray}{c}i\sim_{b}j\\ C_{i}=k,C_{j}=l\end{subarray}}2\,w_{ij}.

We then define the maximal pairwise between-class volume as

Vmaxb:=maxklVklb.V^{b}_{max}:=\max_{k\neq l}V^{b}_{kl}.

In a connected graph, the distance between two vertices xix_{i} and xjx_{j} is the number of edges in a shortest path joining xix_{i} and xjx_{j}. The diameter of the graph is then given by the maximum distance between any two vertices in the graph (Chung, 1996). We define the diameter of the connected component of GwG_{w} corresponding to class kk as follows.

Definition 11

For any two vertices xix_{i} and xjx_{j} such that Ci=Cj=kC_{i}=C_{j}=k, consider a within-class shortest path joining xix_{i} and xjx_{j}, which contains samples only from class kk. Then the diameter DkD_{k} of the connected component of GwG_{w} corresponding to class kk is the maximum number of edges in the within-class shortest path joining any two vertices xix_{i} and xjx_{j} from class kk.

Definition 12

The minimum edge weight within class kk is defined as

wmin,k:=miniwjCi=Cj=kwij.w_{min,k}:=\min_{\begin{subarray}{c}i\sim_{w}j\\ C_{i}=C_{j}=k\end{subarray}}w_{ij}.

3.2 Separability bounds for two classes

We now present a lower bound for the linear separability of the embedding obtained by solving (30) in a setting with two classes Ci{1,2}C_{i}\in\{1,2\}. We first show that an embedding of dimension d=1d=1 is sufficient to achieve linear separability for the case of two classes. We then derive a lower bound on the separation in terms of the graph parameters and the algorithm parameter μ\mu.

Consider a one-dimensional embedding Y=y=[y1y2yN]TN×1Y=y=[y_{1}\,y_{2}\,\dots\,y_{N}]^{T}\in\mathbb{R}^{N\times 1}, where yiy_{i}\in\mathbb{R} is the coordinate of the data sample xix_{i} in the one-dimensional space. The coordinate vector yy is given by the eigenvector of LwμLbL_{w}-\mu L_{b} corresponding to its smallest eigenvalue. We begin with presenting the following result, which states that the samples from the two classes are always mapped to different halves (nonnegative or nonpositive) of the real line.

Lemma 13

The learnt embedding yy of dimension d=1d=1 satisfies

yi0if Ci=1 (or respectively Ci=2)yi0if Ci=2 (or respectively Ci=1)\begin{split}y_{i}&\leq 0\qquad\text{if }\,C_{i}=1\text{ (or respectively $C_{i}$=2)}\\ y_{i}&\geq 0\qquad\text{if }\,C_{i}=2\text{ (or respectively $C_{i}$=1)}\end{split}

for any μ>0\mu>0 and for any choice of the graph parameters.

Lemma 13 is proved in Appendix B.1. The lemma states that in one-dimensional embeddings of two classes, samples from different classes always have coordinates with different signs. Therefore, the hyperplane given by ω=1\omega=1, b=0b=0 separates the data as ωTyi0\omega^{T}y_{i}\leq 0 for Ci=1C_{i}=1 and ωTyi0\omega^{T}y_{i}\geq 0 for Ci=2C_{i}=2 (since the embedding is one dimensional, the vector ω\omega is a scalar in this case). However, this does not guarantee that the data is separable with a positive margin γ>0\gamma>0. In the following result, we show that a positive margin exists and give a lower bound on it. In the rest of this section, we assume without loss of generality that classes 11 and 22 are respectively mapped to the negative and positive halves of the real axis.

Theorem 14

Defining the normalized data coordinates z=Dw1/2yz=D_{w}^{-1/2}y, let

z1,max:=maxi:Ci=1ziz2,min:=mini:Ci=2ziz_{1,max}:=\max_{i:\,C_{i}=1}z_{i}\qquad z_{2,min}:=\min_{i:\,C_{i}=2}z_{i}

denote the maximum and minimum coordinates that classes 11 and 22 are respectively mapped to with a one-dimensional embedding learnt with supervised Laplacian eigenmaps. We also define the parameters

w¯min=mink{1,2}wmin,kDk,βi=dw(i)db(i),βmax=maxiβi,\overline{w}_{min}=\min_{k\in\{1,2\}}\frac{w_{min,k}}{D_{k}}\,,\qquad\qquad\beta_{i}=\frac{d_{w}(i)}{d_{b}(i)}\,,\qquad\qquad\beta_{max}=\max_{i}\beta_{i}\,,

where DkD_{k} is the diameter of the graph corresponding to class kk as defined in Definition 11. Then, if the weight parameter is chosen such that 0<μ<w¯min/(βmaxVmaxb)0<\mu<\overline{w}_{min}/(\beta_{max}V^{b}_{max}), any supervised Laplacian embedding of dimension d1d\geq 1 is linearly separable with a positive margin lower bounded as below:

z2,minz1,max1Vmax(1μβmaxVmaxbw¯min).z_{2,min}-z_{1,max}\geq\frac{1}{\sqrt{V_{max}}}\left(1-\sqrt{\frac{\mu\beta_{max}V^{b}_{max}}{\overline{w}_{min}}}\right). (31)

The proof of Theorem 14 is given in Appendix B.2. The proof is based on a variational characterization of the eigenvector of LwμLbL_{w}-\mu L_{b} corresponding to its smallest eigenvalue, whose elements are then bounded in terms of the parameters of the graph such as the diameters and volumes of its connected components.

Theorem 14 states that an embedding learnt with the supervised Laplacian eigenmaps method makes two classes linearly separable if the weight parameter μ\mu is chosen sufficiently small. In particular, the theorem shows that, for any 0<δ<Vmax1/20<\delta<{V_{max}}^{-1/2}, a choice of the weight parameter μ\mu satisfying

0<μw¯minβmaxVmaxb(1Vmaxδ)20<\mu\leq\frac{\overline{w}_{min}}{\beta_{max}\,V^{b}_{max}}\left(1-\sqrt{V_{max}}\,\delta\right)^{2}

guarantees a separation of z2,minz1,maxδz_{2,min}-z_{1,max}\geq\delta between classes 11 and 22 at d=1d=1. Here, we use the symbol δ\delta to denote the separation in the normalized coordinates zz. In practice, either one of the normalized eigenvectors zz or the original eigenvectors yy can be used for embedding the data. If the original eigenvectors yy are used, due to the relation y=Dw1/2zy=D_{w}^{1/2}z, we can lower bound the separation as y2,miny1,maxdw,min(z2,minz1,max)y_{2,min}-y_{1,max}\geq\sqrt{d_{w,min}}(z_{2,min}-z_{1,max}) where dw,min=minidw(i)d_{w,min}=\min_{i}d_{w}(i). Thus, for any embedding of dimension d1d\geq 1, there exists a hyperplane that results in a linear separation with a margin γ\gamma of at least

γdw,minVmax(1μβmaxVmaxbw¯min).\gamma\geq\sqrt{\frac{d_{w,min}}{V_{max}}}\left(1-\sqrt{\frac{\mu\beta_{max}V^{b}_{max}}{\overline{w}_{min}}}\right).

Next, we comment on the dependence of the separation on μ\mu. The inequality in (31) shows that the lower bound on the separation z2,minz1,maxz_{2,min}-z_{1,max} has a variation of O(1μ)O(1-\sqrt{\mu}) with the weight parameter μ\mu. The fact that the separation decreases with the increase in μ\mu seems counterintuitive at first; this parameter weights the between-class dissimilarity in the objective function. This can be explained as follows. When μ\mu is high, the algorithm tries to increase the distance between neighboring samples from different classes as much as possible by moving them away from the origin (remember that different classes are mapped to the positive and the negative sides of the real line). However, since the normalized coordinate vector zz has to respect the equality zTDwz=1z^{T}D_{w}z=1, the total squared norm of the coordinates cannot be arbitrarily large. Due to this constraint, setting μ\mu to a high value causes the algorithm to map non-neighboring samples from different classes to nearby coordinates close to the origin. This occurs since the increase in μ\mu reduces the impact of the first term yTLwyy^{T}L_{w}y in the overall objective and results in an embedding with a weaker link between the samples of the same class. This causes a polarization of the data and eventually reduces the separation. Hence, the μ\mu parameter should be carefully chosen and should not take too large values.

Theorem 14 characterizes the separation at d=1d=1 in terms of the distance between the supports of the two classes. Meanwhile, it is also interesting to determine the individual distances of the supports of the two classes to the origin. In the following corollary, we present a lower bound on the distance between the coordinates of any sample and the origin.

Corollary 15

The distance between the supports of the first and the second classes and the origin in a one-dimensional embedding is lower bounded in terms of the separation between the two classes as

min{|z1,max|,|z2,min|}12βminβmax(z2,minz1,max)\min\{|z_{1,max}|,\,\,|z_{2,min}|\}\geq\frac{1}{2}\frac{\beta_{min}}{\beta_{max}}(z_{2,min}-z_{1,max})

where

βmin=miniβi,βmax=maxiβi.\begin{split}\beta_{min}&=\min_{i}\beta_{i},\qquad\beta_{max}=\max_{i}\beta_{i}.\end{split}

Corollary 15 is proved in Appendix B.3. The proof is based on a Lagrangian formulation of the embedding as a constrained optimization problem, which then allows us to establish a link between the separation and the individual distances of class supports to the origin. The corollary states a lower bound on the portion of the overall separation lying in the negative or the positive sides of the real line. In particular, if the vertex degrees are equal for all samples in GwG_{w} and GbG_{b} (which is the case, for instance, if all vertices have the same number of neighbors and a constant weight of wij=1w_{ij}=1 is assigned to the edges), since βmin=βmax\beta_{min}=\beta_{max}, the portions of the overall separation in the positive and negative sides of the real line will be equal. Although the statement of Theorem 14 is sufficient to show the existence of separating hyperplanes with positive margins for the embeddings of two classes, we will see in Section 3.3 that the separability with a hyperplane passing through the origin as in Corollary 15 is a desirable property for the extension of these results to a multi-class setting.

3.3 Separability bounds for multiple classes

In this section, we study the separability of the embeddings of multiple classes with the supervised Laplacian eigenmaps algorithm. In particular, we focus on a setting with multiple classes that can be grouped into several categories. The classes in each category are assumed to bear a relatively high resemblance within themselves, whereas the resemblance between classes from different categories is weaker. This is a scenario that is likely to be encountered in several practical data classification problems.

In the following, we study the embeddings of multiple categorizable classes. The objective matrix LwμLbL_{w}-\mu L_{b} defining the embedding is close to a block-diagonal matrix if the between-category similarities are relatively low. Building on this observation, we present a result that links the separability of the overall embedding to the separability of the embeddings of each individual category with the same algorithm. Especially in a setting with many classes, this simplifies the problem for multiple classes and makes it possible to deduce information for the overall separation by studying the separation of the individual categories, which is easier to analyze.

We consider data samples 𝒳={xi}i=1N\mathcal{X}=\{x_{i}\}_{i=1}^{N} belonging to MM different classes that can be categorized into QQ groups. For the purpose of our theoretical analysis, let us focus for a moment on the individual categories and consider the embedding of the samples in each category qq with the supervised Laplacian eigenmaps algorithm if the data graph was constructed only within the category qq. Let YqY^{q} be the dqd^{q}-dimensional embedding of category qq. Assume that YqY^{q} is separable with margin γc\gamma^{c}. Then for any two classes k,lk,l in category qq, there exists a hyperplane ωkl\omega_{kl} such that

ωklTyiqγc/2 if Ci=kωklTyiqγc/2 if Ci=l\begin{split}\omega_{kl}^{T}\,y^{q}_{i}&\geq\gamma^{c}/2\quad\,\,\,\,\text{ if }C_{i}=k\\ \omega_{kl}^{T}\,y^{q}_{i}&\leq-\gamma^{c}/2\quad\text{ if }C_{i}=l\end{split} (32)

where yiqy^{q}_{i} is the ii-th row of YqY^{q} defining the coordinates of the ii-th data sample in category qq. Note that an offset of bkl=0b_{kl}=0 is assumed here, i.e., the classes in each category are assumed to be separable with hyperplanes passing through the origin. While this is mainly for simplifying the analysis, the studied supervised Laplacian eigenmaps algorithm in fact computes embeddings having this property in practice (the theoretical guarantee for the two-class setting being provided in Corollary 15).

Now let L=LwμLbL=L_{w}-\mu L_{b} denote the N×NN\times N objective matrix defining the embedding of 𝒳\mathcal{X} with supervised Laplacian eigenmaps. Also, let Lc=LwcμLbcL^{c}=L_{w}^{c}-\mu L_{b}^{c} denote the block-diagonal objective matrix where the within-class and the between-class Laplacians LwcL_{w}^{c} and LbcL_{b}^{c} are obtained by restricting the graph edges to the ones within the categories. In other words, LcL^{c} is obtained by removing the edges between all pairs of data samples belonging to different categories.

Let Lnc=LLcL^{nc}=L-L^{c} denote the component of LL arising from the between-category data connections. In our analysis, we will treat this component LncL^{nc} as a perturbation on the block-diagonal matrix LcL^{c} and analyze the eigenvectors of LL accordingly in order to study the separability of the embedding obtained with LL.

We will need a condition on the separation of the eigenvalues of LcL^{c}. Let η\eta denote the minimal separation (the smallest difference) between the eigenvalues of LcL^{c}

η:=minij|λiλj|\eta:=\min_{i\neq j}|\lambda_{i}-\lambda_{j}| (33)

where λi\lambda_{i} are the eigenvalues of LcL^{c} for i=1,,Ni=1,\dots,N. For μ>0\mu>0 and a random sampling of data, the eigenvalues of LcL^{c} are expected to be distinct444Note that the within-class and the between-class Laplacians LwcL_{w}^{c} and LbcL_{b}^{c} are normalized Laplacians; therefore, the constant vector is not an eigenvector and 0 is not a repeating eigenvalue.; therefore, one can reasonably assume the minimal eigenvalue separation to be positive. The characterization of the behavior of the minimal separation of the eigenvalues depending on the graph properties is not within the scope of this study and remains as future work.

We state below our main result about the separability of the embeddings of multiple categorizable classes.

Theorem 16

Let L=LwμLbN×NL=L_{w}-\mu L_{b}\in\mathbb{R}^{N\times N} be the matrix representing the objective function of the supervised Laplacian eigenmaps algorithm with MM classes categorizable into QQ groups. Assume that LL is close to a block-diagonal objective matrix LcL^{c} containing only within-category edges such that the perturbation Lnc=LLcL^{nc}=L-L^{c} is bounded as

Lnc<η2\|L^{nc}\|<\frac{\eta}{2}

in terms of the minimal eigenvalue separation η\eta of the matrix LcL^{c} defined in (33). Let each category qq have a dqd^{q}-dimensional embedding YqY^{q} separable with margin γc\gamma^{c} as in (32). We define the parameters

ξ=(14Lnc2η2)1/2\xi=\left(1-\frac{4\|L^{nc}\|^{2}}{\eta^{2}}\right)^{1/2}

and

ζ=(22ξ+2N(1ξ2))1/2.\zeta=\left(2-2\xi+2\sqrt{N(1-\xi^{2})}\right)^{1/2}.

Then there exists an embedding YY of dimension d=q=1Qdqd=\sum_{q=1}^{Q}d_{q} consisting of the eigenvectors of the overall objective matrix LL that is separable with a margin of at least

γ=γc/22ζ\gamma=\gamma^{c}/\sqrt{2}-2\zeta

provided that ζ<γc/(22)\zeta<\gamma^{c}/(2\sqrt{2}).

The proof of Theorem 16 is given in Appendix B.4. The proof is based on first analyzing the separation of the embedding corresponding to the block-diagonal component LcL^{c} of the objective matrix, and then lower bounding the separation of the original embedding in terms of the perturbation and the separation of the eigenvalues. In brief, the theorem says that if the classes are categorizable with sufficiently low between-category edge weights, and if the individual embedding of each category makes all classes in that category linearly separable, then in the embedding computed for the overall data graph with the supervised Laplacian eigenmaps algorithm, all pairs of classes (from same and different categories) are also linearly separable. This extends the linear separability of individual categories to the separability of all classes. The margin of the overall separation decreases at a rate of O(1Lnc)O(\sqrt{1-\|L^{nc}\|}) as the magnitude of the non-block-diagonal component Lnc\|L^{nc}\| of the objective matrix increases.555From the definition of the parameters ξ\xi,ζ\zeta, and γ\gamma in Theorem 16, we have ξ=O(1Lnc2)\xi=O(\sqrt{1-\|L^{nc}\|^{2}}), ζ=O(1ξ)\zeta=O(\sqrt{1-\xi}), γ=O(1ζ)\gamma=O(1-\zeta). It follows that γ=O(1(11Lnc2)1/2)O(1Lnc)\gamma=O(1-(1-\sqrt{1-\|L^{nc}\|^{2}}\,)^{1/2})\approx O(\sqrt{1-\|L^{nc}\|}).

The dimension of the separable embedding is given by the sum of the dimensions of the individual embeddings of the categories that ensure the linear separability within each category, hence, the dimension required for linear separability must be linearly proportional to the number of categories. In order to compute the exact value of the number of dimensions required for linear separability, one needs the knowledge of the number of dimensions that ensures the separability within each category. Nevertheless, the provided result is still interesting as the theoretical or numerical analysis of individual categories is often easier than the analysis of the whole data set, since the number of classes in a particular category is more limited.666For instance, we have theoretically shown in Section 3.2 that one dimension is sufficient for obtaining a linearly separable embedding of two classes. While we do not provide a theoretical analysis for more than two classes, we have experimentally observed that data becomes linearly separable at two dimensions when the number of classes is three or four. Note that, one can also interpret the theorem by considering each class as a different category. However, in this case the edges between samples of different classes must have sufficiently low weights for the applicability of the theorem, i.e., the non-block diagonal component LncL^{nc} of the Laplacian must be sufficiently small. The examination of the general problem of embedding data with multiple non-categorizable classes and no assumptions of the edge weights between different classes seems to be a more challenging problem and remains as a future direction to study.

4 Experimental Results

In this section, we present results on synthetical and real data sets. We compare several supervised manifold learning methods and study their performances in relation with our theoretical results.

Refer to caption
(a) Quadratic surfaces
Refer to caption
(b) Swissrolls
Refer to caption
(c) Spheres
Figure 2: Data sampled from two-dimensional synthetical surfaces. Red and blue colors represent two different classes.
Refer to caption
(a) 1-D embedding
Refer to caption
(b) 2-D embedding
Refer to caption
(c) 3-D embedding
Figure 3: Supervised Laplacian embeddings of data sampled from quadratic surfaces.

4.1 Separability of embeddings with supervised manifold learning

We first present results on synthetical data in order to study the embeddings obtained with supervised dimensionality reduction. We test the supervised Laplacian eigenmaps algorithm in a setting with two classes. We generate samples from two nonintersecting and linearly nonseparable surfaces in 3\mathbb{R}^{3} that represent two different classes. We experiment on three different types of surfaces; namely, quadratic surfaces, Swiss rolls and spheres. The data sampled from these surfaces are shown in Figure 2. We choose N=200N=200 samples from each class. We construct the graph GwG_{w} by connecting each sample to its KK-nearest neighbors from the same class, where KK is chosen between 2020 and 3030. The graph GbG_{b} is constructed similarly, where each sample is connected to its K/5K/5 nearest neighbors from the other class. The graph weights are determined as a Gaussian function of the distance between the samples. The embeddings are then computed by minimizing the objective function in (30). The one-dimensional, two-dimensional, and three-dimensional embeddings obtained for the quadratic surface are shown in Figure 3, where the weight parameter is taken as μ=0.57\mu=0.57 (to have a visually clear embedding for the purpose of illustration). Similar results are obtained on the Swiss roll and the spherical surface. One can observe that the data samples that were initially linearly nonseparable become linearly separable when embedded with the supervised Laplacian eigenmaps algorithm. The two classes are mapped to different (positive or negative) sides of the real line in Figure 3(a) as predicted by Lemma 13. The separation in the 2-D and 3-D embeddings in Figure 3 is close to the separation obtained with the 1-D embedding.

Refer to caption
(a) Experimental value of the separation γ\gamma
Refer to caption
(b) Theoretical upper bound for μ\mu that guarantees a separation of at least γ\gamma
Figure 4: Variation of the separation γ\gamma between the two classes with the parameter μ\mu for the synthetic data sets

We then compute and plot the separation obtained at different values of μ\mu. Figure 4(a) shows the experimental value of the separation γ=z2,minz1,max\gamma=z_{2,min}-z_{1,max} obtained with the 1-D embedding for the three types of surfaces. Figure 4(b) shows the theoretical upper bound for μ\mu in Theorem 14 that guarantees a separation of at least γ\gamma. Both the experimental value and the theoretical bound for the separation γ\gamma decrease with the increase in the parameter μ\mu. This is in agreement with (31), which predicts a decrease of O(1μ)O(1-\sqrt{\mu}) in the separation with respect to μ\mu. The theoretical bound for the separation is seen to decrease at a relatively faster rate with μ\mu for the Swiss roll data set. This is due to the particular structure of this data set with a nonuniform sampling density where the sampling is sparser away from the spiral center. The parameter w¯min\overline{w}_{min} then takes a small value, which consequently leads to a fast rate of decrease for the separation due to (31). Comparing Figures 4(a) and 4(b), one observes that the theoretical bounds for the separation are numerically more pessimistic than their experimental values, which is a result of the fact that our results are obtained with a worst-case analysis. Nevertheless, the theoretical bounds capture well the actual variation of the separation margin with μ\mu.

4.2 Classification performance of supervised manifold learning algorithms

Refer to caption
(a) COIL-20 object data set
Refer to caption
(b) ETH-80 object data set
Refer to caption
(c) Yale face data set
Refer to caption
(d) Reduced Yale face data set
Refer to caption
(e) MNIST data set
Figure 5: Comparison of the performance of several supervised classification methods

We now study the overall performance of classification obtained in a setting with supervised manifold learning, where the out-of-sample generalization is achieved with smooth RBF interpolators. We evaluate the theoretical results of Section 2 on several real data sets: the COIL-20 object database (Nene et al., 1996), the Yale face database (Georghiades et al., 2001), the ETH-80 object database (Leibe and Schiele, 2003), and the MNIST handwritten digit database (LeCun et al., 1998). The COIL-20, Yale face, ETH-80, and MNIST databases contain a total of 1420, 2204, 3280, and 70046 images from 20, 38, 8, and 10 image classes respectively. The images in the COIL-20, Yale and ETH-80 data sets are converted to greyscale, normalized, and downsampled to a resolution of respectively 32×3232\times 32, 20×1720\times 17, and 20×2020\times 20 pixels.

4.2.1 Comparison of supervised manifold learning to baseline classifiers

We first compare the performance of supervised manifold learning with some reference classification methods. The performances of SVM, K-NN, kernel regression, and the supervised Laplacian eigenmaps methods are evaluated and compared. Figure 5 reports the results obtained on the COIL-20 data set, the ETH-80 data set, the Yale data set, a subset of the Yale data set consisting of its first 10 classes (reduced Yale data set), and the MNIST data set. The SVM, K-NN, and kernel regression algorithms are applied in the original domain and their hyperparameters are optimized with cross-validation. In the supervised Laplacian eigenmaps method, the embedding of the training images into a low-dimensional space is computed. Then, an out-of-sample interpolator with Gaussian RBFs is constructed that maps the training samples to their embedded coordinates as described in Section 2.3. Test samples are mapped to the low-dimensional domain via the RBF interpolator and the class labels of test samples are estimated via nearest-neighbor classification in the low-dimensional domain. The supervised Laplacian eigenmaps and the SVM methods are also tested over an alternative representation of the image data sets based on deep learning. The images are provided as input to the pretrained AlexNet convolutional neural network proposed in (Krizhevsky et al., 2012), and the activation values at the second fully connected layer are used as the feature representations of the images. The feature representations of training and test images are then provided to the supervised Laplacian eigenmaps and the SVM methods. The plots in Figure 5 show the variation of the misclassification rate of test samples in percentage with the ratio of the number of training samples in the whole data set. The results are the average of 5 repetitions of the experiment with different random choices for the training and test samples.

The results in Figure 5 show that the best results are obtained with the supervised Laplacian eigenmaps algorithm in general. The performances of the algorithms improve with the number of training images as expected. In the COIL-20 and ETH-80 object data sets, the supervised Laplacian eigenmaps and the SVM algorithms yield significantly smaller error when applied to the feature representations of the images obtained with deep learning. Meanwhile, in the Yale face data set these two methods perform better on raw image intensity maps. This can be explained with the fact that the AlexNet model may be more successful in extracting useful features for object images rather than face images as it is trained on many common object and animal classes. It is interesting to compare Figures 5(c) and 5(d). While the performances of the supervised Laplacian eigenmaps and the SVM methods are closer in the reduced version of the Yale database with 10 classes, the performance gap between the supervised Laplacian eigenmaps method and the other methods is larger for the full data set with 38 classes. This can be explained with the fact that the linear separability of different classes degrades as the number of classes increases, thus causing a degradation in the performance of the classifiers in comparison. Meanwhile, the performance of the supervised Laplacian eigenmaps method is not much affected by the increase in the number of classes. The K-NN and kernel regression classifiers are seen to give almost the same performance in the plots in Figure 5. The number of neighbors is set as K=1K=1 for the K-NN algorithm in these experiments, where it has been observed to attain its best performance; and the scale parameter of the kernel regression algorithm is optimized to get the best accuracy, which has turned out to take relatively small values. Hence the performances of these two classifiers practically correspond to that of the nearest-neighbor classifier in the original domain.

4.2.2 Variation of the error with algorithm parameters and sample size

Refer to caption
(a) COIL-20 object data set
Refer to caption
(b) ETH-80 object data set
Refer to caption
(c) Yale face data set
Figure 6: Variation of the misclassification rate with the number of training samples
Refer to caption
(a) COIL-20 object data set
Refer to caption
(b) ETH-80 object data set
Refer to caption
(c) Yale face data set
Figure 7: Variation of the misclassification rate with the dimension of the embedding

We first study the evolution of the classification error with the number of training samples. Figures 6(a)- 6(c) show the variation of the misclassification rate of test samples with respect to the total number of training samples NN for the COIL-20, ETH-80 and Yale data sets. Each curve in the figure shows the errors obtained at a different value of the dimension dd of the embedding. The decrease in the misclassification rate with the number of training samples is in agreement with the results in Section 2 as expected.

The results of Figure 6 are replotted in Figure 7, where the variation of the misclassification rate is shown with respect to the dimension dd of the embedding at different NN values. It is observed that there may exist an optimal value of the dimension that minimizes the misclassification rate. This can be interpreted in light of the conditions (14) and (16) in Theorems 8 and 9, which impose a lower bound on the separability margin γQ\gamma_{Q} in terms of the dimension dd of the embedding. In the supervised Laplacian eigenmaps algorithm, the first few dimensions are critical and effective for separating different classes. The decrease in the error with the increase in the dimension for small values of dd can be explained with the fact that the separation increases with dd at small dd, thereby satisfying the conditions (14), (16). Meanwhile, the error may stagnate or increase if the dimension dd increases beyond a certain value, as the separation does not necessarily increase at the same rate.

We then examine the variation of the misclassification rate with the separation. We obtain embeddings at different separation values γ\gamma by changing the parameter μ\mu of the supervised Laplacian eigenmaps algorithm. Figure 8 shows the variation of the misclassification rate with the separation γ\gamma. Each curve is obtained at a different value of the scale parameter σ\sigma of the RBF kernels. It is seen that the misclassification rate decreases in general with the separation for small γ\gamma values. This is in agreement with our results, as the conditions (14), (16) require the separation to be higher than a threshold. On the other hand, the possible increase in the error at relatively large values of the separation is due to the following. These parts of the plots are obtained at very small μ\mu values, which typically result in a deformed embedding with a degenerate geometry. The deformation of structure at too small values of μ\mu may cause the interpolation function to be irregular and hence result in an increase in the error. The tradeoff between the separation and the interpolation function regularity is further studied in Section 4.2.3.

Finally, Figure 9 shows the relation between the misclassification error and the scale parameter σ\sigma of the Gaussian RBF kernels. Each curve is obtained at a different value of the μ\mu parameter. The optimum value of the scale parameter minimizing the misclassification error can be observed in most experiments. These results confirm the findings of Section 2.4, suggesting that there exists a unique value of σ\sigma that minimizes the left hand side of the conditions (14), (16), which probabilistically guarantee the correct classification of data.

Refer to caption
(a) COIL-20 object data set
Refer to caption
(b) ETH-80 object data set
Refer to caption
(c) Yale face data set
Figure 8: Variation of the misclassification rate with the separation
Refer to caption
(a) COIL-20 object data set
Refer to caption
(b) ETH-80 object data set
Refer to caption
(c) Yale face data set
Figure 9: Variation of the misclassification rate with the scale parameter

4.2.3 Performance analysis of several supervised manifold learning algorithms

Next, we compare several supervised manifold learning methods. We aim to interpret the performance differences of different types of embeddings in light of our theoretical results in Section 2.3. First, remember from Theorem 8 that the condition

d𝒞(Lϕδ+ϵ)γ/2\sqrt{d}\,\mathcal{C}\,(L_{\phi}\delta+\epsilon)\leq\gamma/2 (34)

needs to be satisfied (or, equivalently the condition (16) from Theorem 9) in order for the generalization bounds to hold. This preliminary condition basically states that a compromise must be achieved between the regularity of the interpolation function, captured via the terms 𝒞\mathcal{C} and LϕL_{\phi}, and the separation γ\gamma of the embedding of training samples, in order to bound the misclassification error. In other words, increasing the separation too much in the embedding of training samples does not necessarily lead to good classification performance if the interpolation function has poor regularity.

Hence, when comparing different embeddings in the experiments of this section, we define a condition parameter given by

d𝒞Lϕγ\frac{\sqrt{d}\mathcal{C}L_{\phi}}{\gamma}

which represents the ratio of the left and right hand sides of (34) (by fixing the probability parameters δ\delta and ϵ\epsilon). Setting the Lipschitz constant of the Gaussian RBF kernel as Lϕ=2e12σ1L_{\phi}=\sqrt{2}e^{-\frac{1}{2}}\sigma^{-1} (see Section 2.4 for details), we can equivalently define the condition parameter as

κ=d𝒞σγ\kappa=\frac{\sqrt{d}\mathcal{C}}{\sigma\gamma} (35)

and study this condition parameter for the supervised dimensionality methods in comparison. Note that a smaller condition parameter means that the necessary conditions of Theorems 8 and 9 are more likely to be satisfied, hence hinting at the expectation of a better classification accuracy.

We compare the following supervised embeddings:

  • Supervised Laplacian eigenmaps embedding obtained by solving (30):

    minYtr(YTLwY)μtr(YTLbY) subject to YTY=I\min_{Y}\mathrm{tr}(Y^{T}L_{w}Y)-\mu\,\mathrm{tr}(Y^{T}L_{b}Y)\text{ subject to }Y^{T}Y=I
  • Fisher embedding777We use a nonlinear version of the formulation in (Wang and Chen, 2009) by removing the constraint that the embedding be given by a linear projection of the data., obtained by solving

    maxYtr(YTLbY)tr(YTLwY).\max_{Y}\frac{\mathrm{tr}(Y^{T}L_{b}Y)}{\mathrm{tr}(Y^{T}L_{w}Y)}. (36)
  • Label encoding, which maps each data sample to its label vector of the form [0 010][0\ 0\dots 1\dots 0], where the only nonzero entry corresponds to its class.

The label encoding method is included in the experiments to provide a reference, which can also be regarded as a degenerate supervised manifold learning algorithm that provides maximal separation between data samples from different classes. In all of the above methods the training samples are embedded into the low-dimensional domain, and test samples are mapped via Gaussian RBF interpolators and assigned labels via nearest neighbor classification in the low-dimensional domain. The scale parameter σ\sigma of the RBF kernel is set to a reference value in each dataset within the typical range [0.5,1][0.5,1] where the best accuracy is attained. We have fixed the weight parameter as μ=0.01\mu=0.01 in all setups, and set the dimension of the embedding as equal to the number of classes. In order to study the properties of the interpolation function in relation with the condition parameter in (35), we also test the supervised Laplacian eigenmaps and the label encoding methods under RBF interpolators with high scale parameters, which are chosen as a few times the reference σ\sigma value giving the best results. Finally, we also include in the comparisons a regularized version of the supervised Laplacian eigenmaps embedding by controlling the magnitude of the interpolation function.

The results obtained on the COIL-20, ETH-80, Yale and reduced Yale data sets are reported respectively in Figures 10-13. In each figure, panel (a) shows the misclassification rates of the embeddings and panel (b) shows the condition parameters of the embeddings at different total number of training samples (NN). The logarithm of the condition parameter is plotted for ease of visualization.

Refer to caption
Refer to caption
Figure 10: Misclassification rates and the condition parameters of the embeddings for the COIL-20 object data set
Refer to caption
Refer to caption
Figure 11: Misclassification rates and the condition parameters of the embeddings for the ETH-80 object data set
Refer to caption
Refer to caption
Figure 12: Misclassification rates and the condition parameters of the embeddings for the Yale face data set
Refer to caption
Refer to caption
Figure 13: Misclassification rates and the condition parameters of the embeddings for the reduced Yale face data set

The plots in Figures 10-13 show that the label encoding, supervised Laplacian eigenmaps, and the regularized supervised Laplacian eigenmaps embeddings yield better classification accuracy than the other three methods (supervised Fisher, and the embeddings with high scale parameters) in all experiments, with the only exception of the cases N=60N=60 and N=100N=100 for the reduced Yale data set. Meanwhile, examining the condition parameters of the embeddings, we observe that label encoding, supervised Laplacian eigenmaps, and the regularized supervised Laplacian eigenmaps embeddings always have a smaller condition parameter than the other three methods. This observation confirms the intuition provided by the necessary conditions of Theorems 8 and 9: A compromise between the separation and the interpolator regularity is required for good classification accuracy. The increase in the condition parameter as NN increases is since the coefficient bound 𝒞\mathcal{C} involves a summation over all training samples. The reason why the embeddings with high σ\sigma parameters yield better classification accuracy than the other ones in the cases N=60N=60 and N=100N=100 for the reduced Yale data set is that a larger RBF scale helps better cover up the ambient space when the number of training samples is particularly low.

In the COIL-20 and the reduced Yale data sets, the best classification accuracy is obtained with the regularized supervised Laplacian eigenmaps method, while this is also the method having the smallest condition number, except for the smallest two values of NN in the reduced Yale data set. In the ETH-80 and Full Yale data sets, the classification accuracy of label encoding attains that of the supervised Laplacian eigenmaps method. The condition parameter of the label encoding embedding is relatively small in these two data sets; in fact, in ETH-80 the label encoding embedding has the smallest condition number among all methods. This may be useful for explaining why this simple classification method has quite favorable performance in this data set. Likewise, if we leave aside the versions of the methods with high-scale interpolators, the Fisher embedding has the highest misclassification rate compared to label encoding, the supervised Laplacian, and the regularized supervised Laplacian embeddings, while it also has the highest condition parameter among these methods. 888The formulation in (36) has been observed to give highly polarized embeddings in (Vural and Guillemot, 2016), where the samples of only few classes stretch out along each dimension and all the other classes are mapped close to zero.

To conclude, the results in this section suggest that the experimental findings are in agreement with the main results in Section 2.3, justifying the pertinence of the conditions (14) and (16) to classification accuracy, hence suggesting that a balance must be sought between the separability margin of the embedding and the regularity of the interpolation function in supervised manifold learning.

5 Conclusions

Most of the current supervised manifold learning algorithms focus on learning representations of training data, while the generalization properties of these representations have not been understood well yet. In this work, we have proposed a theoretical analysis of the performance of supervised manifold learning methods. We have presented generalization bounds for nonlinear supervised manifold learning algorithms and explored how the classification accuracy relates to several setup parameters such as the linear separation margin of the embedding, the regularity of the interpolation function, the number of training samples, and the intrinsic dimensions of the class supports (manifolds). Our results suggest that embeddings of training data with good generalization capacities must allow the construction of sufficiently regular interpolation functions that extend the mapping to new data. We have then examined whether the assumption of linear separability is easy to satisfy for structure-preserving supervised embedding algorithms. We have taken the supervised Laplacian eigenmaps algorithms as reference, and showed that these methods can yield linearly separable embeddings. Providing insight about the generalization capabilities of supervised dimensionality reduction algorithms, our findings can be helpful in the classification of low-dimensional data sets.


Acknowledgments

We would like to thank Pascal Frossard and Alhussein Fawzi for the helpful discussions that contributed to this study.

A Proof of the results in Section 2

A.1 Proof of Theorem 2

Proof  Given xx, let xi𝒳x_{i}\in\mathcal{X} be the nearest neighbor of xx in 𝒳\mathcal{X} that is sampled from νm\nu_{m}

i=argminjxxj s.t. xjνm.i=\arg\min_{j}\|x-x_{j}\|\,\text{ s.t. }\,x_{j}\sim\nu_{m}.

Due to the separation hypothesis,

ωmkTyi+bmk>γ/2,k=1,,M1.\omega_{mk}^{T}\,y_{i}+b_{mk}>\gamma/2,\quad\forall k=1,\dots,M-1.

We have

ωmkTf(x)+bmk=ωmkTf(xi)+bmk+ωmkT(f(x)f(xi))ωmkTyi+bmk|ωmkT(f(x)f(xi))|>γ/2f(x)f(xi)γ/2Lxxi.\begin{split}\omega_{mk}^{T}\,f(x)+b_{mk}&=\omega_{mk}^{T}\,f(x_{i})+b_{mk}+\omega_{mk}^{T}\,(f(x)-f(x_{i}))\\ &\geq\omega_{mk}^{T}\,y_{i}+b_{mk}-\left|\omega_{mk}^{T}\,(f(x)-f(x_{i}))\right|\\ &>\gamma/2-\|f(x)-f(x_{i})\|\,\,\geq\,\,\gamma/2-L\|x-x_{i}\|.\end{split}

Then if the condition Lxxiγ/2L\|x-x_{i}\|\leq\gamma/2 is satisfied, from the above inequality we have ωmkTf(x)+bmk>0\omega_{mk}^{T}\,f(x)+b_{mk}>0 for all k=1,,M1k=1,\dots,M-1. This gives C^(x)=m\hat{C}(x)=m and thus ensures that xx is classified correctly.

In the sequel, we lower bound the probability that the distance xxi\|x-x_{i}\| between xx and its nearest neighbor from the same class is smaller than γ/2\gamma/2. We employ the following result by Kulkarni and Posner (1995). It is demonstrated in the proof of Theorem 1 in (Kulkarni and Posner, 1995) that, if 𝒳\mathcal{X} contains at least NmN_{m} samples drawn i.i.d. from νm\nu_{m} such that Nm𝒩(ϵ/2,m)N_{m}\geq\mathcal{N}(\epsilon/2,\mathcal{M}_{m}) for some ϵ>0\epsilon>0, then the probability of xxi\|x-x_{i}\| being larger than ϵ\epsilon can be upper bounded in terms of the covering number of m\mathcal{M}_{m} as

P(xxi>ϵ)𝒩(ϵ/2,m)2Nm.P(\|x-x_{i}\|>\epsilon)\leq\frac{\mathcal{N}(\epsilon/2,\mathcal{M}_{m})}{2N_{m}}.

Therefore, for any ϵ\epsilon such that ϵγ/(2L)\epsilon\leq\gamma/(2L) and Nm𝒩(ϵ/2,m)N_{m}\geq\mathcal{N}(\epsilon/2,\mathcal{M}_{m}), with probability at least 1𝒩(ϵ/2,m)/(2Nm)1-\mathcal{N}(\epsilon/2,\mathcal{M}_{m})/(2N_{m}), we have

xxiϵγ/(2L)\|x-x_{i}\|\leq\epsilon\leq\gamma/(2L)

thus, the class label of xx is correctly estimated as C^(x)=m\hat{C}(x)=m due to the above discussion.  

A.2 Proof of Lemma 3

Proof  We first bound the deviation of f(x)f(x) from the sample average of ff in the neighborhood of xx as

f(x)1QxjAf(xj)f(x)mf+1QxjAf(xj)mf\left\|f(x)-\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})\right\|\leq\left\|f(x)-m_{f}\right\|+\left\|\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})-m_{f}\right\| (37)

where mfm_{f} is the conditional expectation of f(u)f(u), given uBδ(x)u\in B_{\delta}(x)

mf=𝔼u[f(u)|uBδ(x)]=1νm(Bδ(x))Bδ(x)f(u)𝑑νm(u).m_{f}=\mathbb{E}_{u}\big{[}f(u)\,|\,u\in B_{\delta}(x)\big{]}=\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}f(u)\,\,d\nu_{m}(u).

The first term in (37) can be bounded as

f(x)mf=1νm(Bδ(x))Bδ(x)(f(x)f(u))𝑑νm(u)1νm(Bδ(x))Bδ(x)f(x)f(u)𝑑νm(u)1νm(Bδ(x))Bδ(x)Lxu𝑑νm(u)1νm(Bδ(x))Bδ(x)Lδ𝑑νm(u)=Lδ\begin{split}\|f(x)-m_{f}\|&=\left\|\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}\big{(}f(x)-f(u)\big{)}\,d\nu_{m}(u)\right\|\\ &\leq\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}\|f(x)-f(u)\|\,d\nu_{m}(u)\leq\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}L\|x-u\|\,d\nu_{m}(u)\\ &\leq\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}L\delta\,d\nu_{m}(u)=L\delta\end{split} (38)

where the second inequality follows from the fact that ff is Lipschitz continuous on the support m\mathcal{M}_{m}, where the measure νm\nu_{m} is nonzero.

The second term in (37) is given by

1QxjAf(xj)mf=(k=1d|1QxjAfk(xj)mfk|2)1/2\left\|\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})-m_{f}\right\|=\left(\sum_{k=1}^{d}\bigg{|}\frac{1}{Q}\sum_{x_{j}\in A}f^{k}(x_{j})-m_{f}^{k}\bigg{|}^{2}\right)^{1/2} (39)

where mfkm_{f}^{k} denotes the kk-th component of mfm_{f}, for k=1,,dk=1,\dots,d. Consider the random variables fk(xj)f^{k}(x_{j}). Defining

fmink=infuBδ(x)fk(u),fmaxk=supuBδ(x)fk(u),f_{\min}^{k}=\inf_{u\in B_{\delta}(x)}f^{k}(u),\qquad\quad f_{\max}^{k}=\sup_{u\in B_{\delta}(x)}f^{k}(u),

it follows that fmaxkfmink2Lδf_{\max}^{k}-f_{\min}^{k}\leq 2L\delta due to the Lipschitz continuity of ff. Then from Hoeffding’s inequality, we have

P(|1QxjAfk(xj)mfk|ϵ)2exp(2Qϵ2(fmaxkfmink)2)2exp(Qϵ22L2δ2).P\left(\bigg{|}\frac{1}{Q}\sum_{x_{j}\in A}f^{k}(x_{j})-m_{f}^{k}\bigg{|}\geq\epsilon\right)\leq 2\exp\left(-\frac{2Q\epsilon^{2}}{(f_{\max}^{k}-f_{\min}^{k})^{2}}\right)\leq 2\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right).

From the union bound, we get that with probability at least 12dexp(Qϵ22L2δ2)1-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right), for all kk

|1QxjAfk(xj)mfk|ϵ,\bigg{|}\frac{1}{Q}\sum_{x_{j}\in A}f^{k}(x_{j})-m_{f}^{k}\bigg{|}\leq\epsilon,

which yields from (39)

1QxjAf(xj)mfdϵ.\left\|\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})-m_{f}\right\|\leq\sqrt{d}\epsilon.

Combining this result with the bound in (38), we conclude that with probability at least 12dexp(Qϵ22L2δ2)1-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right)

f(x)1QxjAf(xj)Lδ+dϵ.\left\|f(x)-\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})\right\|\leq L\delta+\sqrt{d}\epsilon.
 

A.3 Proof of Theorem 5

Proof  Given the test sample xx and a training sample xix_{i} drawn i.i.d. with respect to νm\nu_{m}, the probability that xix_{i} lies within a δ\delta-neighborhood of xx is given by

P(xiBδ(x))=νm(Bδ(x))ηm,δ.P(x_{i}\in B_{\delta}(x))=\nu_{m}(B_{\delta}(x))\geq\eta_{m,\delta}.

Then, among the NmN_{m} samples drawn with respect to νm\nu_{m}, the probability that Bδ(x)B_{\delta}(x) contains at least QQ samples is given by

P(|A|Q)=q=QNm(Nmq)(νm(Bδ(x)))q(1νm(Bδ(x)))Nmqq=QNm(Nmq)(ηm,δ)q(1ηm,δ)Nmq\begin{split}P(|A|\geq Q)&=\sum_{q=Q}^{N_{m}}\binom{N_{m}}{q}\bigg{(}\nu_{m}(B_{\delta}(x))\bigg{)}^{q}\bigg{(}1-\nu_{m}(B_{\delta}(x))\bigg{)}^{N_{m}-q}\\ &\geq\sum_{q=Q}^{N_{m}}\binom{N_{m}}{q}(\eta_{m,\delta})^{q}\,(1-\eta_{m,\delta})^{N_{m}-q}\end{split}

where the set AA is defined as in (5). The last expression above is the probability of having at least QQ successes out of NmN_{m} realizations of a Bernoulli random variable with probability parameter ηm,δ\eta_{m,\delta}. This probability can be lower bounded using a tail bound for binomial distributions. We thus have

P(|A|Q)1exp(2(Nmηm,δQ)2Nm)P(|A|\geq Q)\geq 1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)

which simply follows from interpreting |A||A| as the sum of of NmN_{m} i.i.d. observations of a Bernoulli distributed random variable and then applying Hoeffding’s inequality as shown by Herbrich (1999), under the hypothesis that Nm>Q/ηm,δN_{m}>Q/{\eta_{m,\delta}}.

Assuming that Bδ(x)B_{\delta}(x) contains at least QQ samples, Lemma 3 states that with probability at least

12dexp(|A|ϵ22L2δ2)12dexp(Qϵ22L2δ2)1-2d\exp\left(-\frac{|A|\epsilon^{2}}{2L^{2}\delta^{2}}\right)\geq 1-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right)

the deviation between f(x)f(x) and the sample average of its neighbors is bounded as

f(x)1|A|xjAf(xj)Lδ+dϵ.\left\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\right\|\leq L\delta+\sqrt{d}\epsilon.

Hence, with probability at least

(1exp(2(Nmηm,δQ)2Nm))(12dexp(Qϵ22L2δ2))1exp(2(Nmηm,δQ)2Nm)2dexp(Qϵ22L2δ2)\begin{split}\left(1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)\right)\left(1-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right)\right)\\ \geq 1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right)\end{split}

we have

f(x)1|A|xjAf(xj)Lδ+dϵ.\left\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\right\|\leq L\delta+\sqrt{d}\epsilon. (40)

The class label of a test sample xx drawn from νm\nu_{m} is correctly estimated with respect to the classifier (4) if

ωmkTf(x)+bmk>0,k=1,,M1,km.\omega_{mk}^{T}\,f(x)+b_{mk}>0,\quad\forall k=1,\dots,M-1,\,k\neq m.

If the condition in (40) is satisfied, for all kmk\neq m, we have

ωmkTf(x)+bmk=ωmkT1|A|xjAf(xj)+bmk+ωmkT(f(x)1|A|xjAf(xj))ωmkT1|A|xjAf(xj)+bmkf(x)1|A|xjAf(xj)>γQ/2f(x)1|A|xjAf(xj)γQ/2Lδdϵ0.\begin{split}\omega_{mk}^{T}\,f(x)+b_{mk}&=\omega_{mk}^{T}\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})+b_{mk}+\omega_{mk}^{T}\,\left(f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\right)\\ &\geq\omega_{mk}^{T}\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})+b_{mk}-\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|\\ &>\gamma_{Q}/2-\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|\geq\gamma_{Q}/2-L\delta-\sqrt{d}\epsilon\geq 0.\end{split}

Here, we obtain the second inequality from the hypothesis that the embedding is QQ-mean separable with margin larger than γQ\gamma_{Q}, which implies that the embedding is also RR-mean separable with margin larger than γQ\gamma_{Q}, for R>QR>Q. Then the last inequality is due to the condition (8) on the interpolation function in the theorem. We thus get that with probability at least

1exp(2(Nmηm,δQ)2Nm)2dexp(Qϵ22L2δ2)1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right)

ωmkTf(x)+bmk>0\omega_{mk}^{T}\,f(x)+b_{mk}>0 for all kmk\neq m, hence, the sample xx is correctly classified. This concludes the proof of the theorem.  

A.4 Proof of Theorem 6

Proof  Remember from the proof of Theorem 5 that with probability at least

1exp(2(Nmηm,δQ)2Nm)2dexp(Qϵ22L2δ2)1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right)

the δ\delta-neighborhood Bδ(x)B_{\delta}(x) of a test sample xx from class mm contains at least QQ samples from class mm, and

f(x)1|A|xjAf(xj)Lδ+dϵ\left\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\right\|\leq L\delta+\sqrt{d}\epsilon (41)

where AA is the set of training samples in Bδ(x)B_{\delta}(x) from class mm.

Let xi,xjAx_{i},x_{j}\in A be two training samples from class mm in Bδ(x)B_{\delta}(x). As xixj2δ\|x_{i}-x_{j}\|\leq 2\delta, by the hypothesis on the embedding, we have yiyj=f(xi)f(xj)D2δ\|y_{i}-y_{j}\|=\|f(x_{i})-f(x_{j})\|\leq D_{2\delta}, which gives

f(xi)1|A|xjAf(xj)=1|A|xjA(f(xi)f(xj))1|A|xjAf(xi)f(xj)D2δ.\|f(x_{i})-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|=\left\|\frac{1}{|A|}\sum_{x_{j}\in A}\big{(}f(x_{i})-f(x_{j})\big{)}\right\|\leq\frac{1}{|A|}\sum_{x_{j}\in A}\|f(x_{i})-f(x_{j})\|\leq D_{2\delta}.

Then, for any xiBδ(x)x_{i}\in B_{\delta}(x),

f(x)f(xi)=f(x)1|A|xjAf(xj)+1|A|xjAf(xj)f(xi)f(x)1|A|xjAf(xj)+D2δ.\begin{split}\|f(x)-f(x_{i})\|=\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})+\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})-f(x_{i})\|\\ \leq\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|+D_{2\delta}.\end{split}

Combining this with (41), we get that with probability at least

1exp(2(Nmηm,δQ)2Nm)2dexp(Qϵ22L2δ2)1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2d\exp\left(-\frac{Q\epsilon^{2}}{2L^{2}\delta^{2}}\right)

Bδ(x)B_{\delta}(x) will contain at least QQ samples xix_{i} from class mm such that

f(x)f(xi)Lδ+dϵ+D2δ.\|f(x)-f(x_{i})\|\leq L\delta+\sqrt{d}\epsilon+D_{2\delta}. (42)

Now, assuming (42), let xix_{i}^{\prime} be a training sample from another class (other than mm). We have

f(x)f(xi)f(xi)f(xi)f(x)f(xi)>γ(Lδ+dϵ+D2δ)\|f(x)-f(x_{i}^{\prime})\|\geq\|f(x_{i})-f(x_{i}^{\prime})\|-\|f(x)-f(x_{i})\|>\gamma-(L\delta+\sqrt{d}\epsilon+D_{2\delta})

which follows from (42) and the hypothesis on the embedding that f(xi)f(xi)>γ\|f(x_{i})-f(x_{i}^{\prime})\|>\gamma.

It follows from the condition (10) that γ2Lδ+2dϵ+2D2δ\gamma\geq 2L\delta+2\sqrt{d}\epsilon+2D_{2\delta}. Using this in the above equation, we get

f(x)f(xi)>Lδ+dϵ+D2δ.\|f(x)-f(x_{i}^{\prime})\|>L\delta+\sqrt{d}\epsilon+D_{2\delta}.

This means that the distance of f(x)f(x) to the embedding of any other sample from another class is more than Lδ+dϵ+D2δL\delta+\sqrt{d}\epsilon+D_{2\delta}, while there are samples from its own class within a distance of Lδ+dϵ+D2δL\delta+\sqrt{d}\epsilon+D_{2\delta} to f(x)f(x). Therefore, xx is classified correctly with nearest-neighbor classification in the low-dimensional domain of embedding.

 

A.5 Proof of Lemma 7

Proof  The deviation of each component fk(x)f^{k}(x) of the interpolator from the sample average in the neighborhood of xx is given by

|fk(x)1QxjAfk(xj)|=|i=1Ncik(ϕ(xxi)1QxjAϕ(xjxi))|.\begin{split}\left|f^{k}(x)-\frac{1}{Q}\sum_{x_{j}\in A}f^{k}(x_{j})\right|=\left|\sum_{i=1}^{N}c_{i}^{k}\left(\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\right)\right|.\end{split} (43)

We thus proceed by studying the term

ϕ(xxi)1QxjAϕ(xjxi)\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|) (44)

which will then be used in the above expression to arrive at the stated result.

Now let xi𝒳x_{i}\in\mathcal{X} be any training sample. In order to study the term in (44), we first look at

|ϕ(xxi)𝔼u[ϕ(uxi)|uBδ(x)]|\bigg{|}\phi(\|x-x_{i}\|)-\mathbb{E}_{u}\big{[}\phi(\|u-x_{i}\|)\,|\,u\in B_{\delta}(x)\big{]}\bigg{|}

where 𝔼u[ϕ(uxi)|uBδ(x)]\mathbb{E}_{u}\big{[}\phi(\|u-x_{i}\|)\,|\,u\in B_{\delta}(x)\big{]} denotes the conditional expectation of ϕ(uxi)\phi(\|u-x_{i}\|) over uu, given uBδ(x)u\in B_{\delta}(x). The conditional expectation is given by

𝔼u[ϕ(uxi)|uBδ(x)]=1νm(Bδ(x))Bδ(x)ϕ(uxi)𝑑νm(u).\begin{split}\mathbb{E}_{u}\big{[}\phi(\|u-x_{i}\|)\,|\,u\in B_{\delta}(x)\big{]}&=\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}\phi(\|u-x_{i}\|)\,\,d\nu_{m}(u).\end{split}

We have

|ϕ(xxi)𝔼u[ϕ(uxi)|uBδ(x)]|=1νm(Bδ(x))|Bδ(x)(ϕ(xxi)ϕ(uxi))𝑑νm(u)|1νm(Bδ(x))Bδ(x)|ϕ(xxi)ϕ(uxi)|𝑑νm(u).\begin{split}\bigg{|}\phi(\|x-x_{i}\|)&-\mathbb{E}_{u}\big{[}\phi(\|u-x_{i}\|)\,|\,u\in B_{\delta}(x)\big{]}\bigg{|}\\ &=\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\ \bigg{|}\int_{B_{\delta}(x)}\big{(}\phi(\|x-x_{i}\|)-\phi(\|u-x_{i}\|)\big{)}\,\,d\nu_{m}(u)\bigg{|}\\ &\leq\frac{1}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}\big{|}\phi(\|x-x_{i}\|)-\phi(\|u-x_{i}\|)\big{|}\,\,d\nu_{m}(u).\end{split}

The term in the integral is bounded as

|ϕ(xxi)ϕ(uxi)|Lϕ|xxiuxi|Lϕxu.\big{|}\phi(\|x-x_{i}\|)-\phi(\|u-x_{i}\|)\big{|}\leq L_{\phi}\,\big{|}\|x-x_{i}\|-\|u-x_{i}\|\big{|}\leq L_{\phi}\,\|x-u\|.

Using this in the above term, we get

|ϕ(xxi)𝔼u[ϕ(uxi)|uBδ(x)]|Lϕνm(Bδ(x))Bδ(x)xu𝑑νm(u)=Lϕ𝔼u[ux|uBδ(x)]Lϕδ.\begin{split}\bigg{|}\phi(\|x-x_{i}\|)&-\mathbb{E}_{u}\big{[}\phi(\|u-x_{i}\|)\,|\,u\in B_{\delta}(x)\big{]}\bigg{|}\\ &\leq\frac{L_{\phi}}{\nu_{m}\big{(}B_{\delta}(x)\big{)}}\int_{B_{\delta}(x)}\|x-u\|\,d\nu_{m}(u)=L_{\phi}\,\mathbb{E}_{u}\big{[}\|u-x\|\,|\,u\in B_{\delta}(x)\big{]}\\ &\leq L_{\phi}\,\delta.\end{split} (45)

We now analyze the term in (44) for a given xix_{i} for two different cases, i.e., for xiBδ(x)x_{i}\notin B_{\delta}(x) and xiBδ(x)x_{i}\in B_{\delta}(x). We first look at the case xiBδ(x)x_{i}\notin B_{\delta}(x). For xjBδ(x)x_{j}\in B_{\delta}(x), let

ζj:=ϕ(xjxi).\zeta_{j}:=\phi(\|x_{j}-x_{i}\|).

The observations ζj\zeta_{j} are i.i.d. (since xjx_{j} are i.i.d.) with mean mζ=𝔼u[ϕ(uxi)|uBδ(x)]m_{\zeta}=\mathbb{E}_{u}\big{[}\phi(\|u-x_{i}\|)\,|\,u\in B_{\delta}(x)\big{]} and take values in the interval ζminζjζmax\zeta_{\min}\leq\zeta_{j}\leq\zeta_{\max}, where

ζmin:=infuBδ(x)ϕ(uxi),ζmax:=supuBδ(x)ϕ(uxi).\zeta_{\min}:=\inf_{u\in B_{\delta}(x)}\phi(\|u-x_{i}\|),\qquad\qquad\zeta_{\max}:=\sup_{u\in B_{\delta}(x)}\phi(\|u-x_{i}\|).

Since for any u1,u2Bδ(x)u_{1},u_{2}\in B_{\delta}(x), u1u22δ\|u_{1}-u_{2}\|\leq 2\delta, it follows from the Lipschitz continuity of ϕ\phi that ζmaxζmin2Lϕδ\zeta_{\max}-\zeta_{\min}\leq 2L_{\phi}\delta. Using this together with the Hoeffding’s inequality, we get

P(|1QxjAζjmζ|ϵ)2exp(2Qϵ2(ζmaxζmin)2)2exp(Qϵ22Lϕ2δ2).P\bigg{(}\bigg{|}\frac{1}{Q}\sum_{x_{j}\in A}\zeta_{j}-m_{\zeta}\bigg{|}\geq\epsilon\bigg{)}\leq 2\exp\left(-\frac{2\,Q\,\epsilon^{2}}{(\zeta_{\max}-\zeta_{\min})^{2}}\right)\leq 2\exp\left(-\frac{Q\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right). (46)

We have

|ϕ(xxi)1QxjAϕ(xjxi)||ϕ(xxi)mζ|+|mζ1QxjAϕ(xjxi)|.\begin{split}\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\bigg{|}\leq\big{|}\phi(\|x-x_{i}\|)-m_{\zeta}\big{|}+\bigg{|}m_{\zeta}-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\bigg{|}.\end{split}

Using (45) and (46) in the above equation, it holds with probability at least

12exp(Qϵ22Lϕ2δ2)1-2\exp\left(-\frac{Q\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)

that

|ϕ(xxi)1QxjAϕ(xjxi)|Lϕδ+ϵ.\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\bigg{|}\leq L_{\phi}\delta+\epsilon.

Next, we study the case xiBδ(x)x_{i}\in B_{\delta}(x). For any fixed xiBδ(x)x_{i}\in B_{\delta}(x), hence xiAx_{i}\in A, we have:

|ϕ(xxi)1QxjAϕ(xjxi)|=|1Qϕ(xxi)+Q1Qϕ(xxi)1Qϕ(xixi)1QxjA{xi}ϕ(xjxi)|1Q|ϕ(xxi)ϕ(xixi)|+Q1Q|ϕ(xxi)1Q1xjA{xi}ϕ(xjxi)|\begin{split}&\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\bigg{|}\\ &=\bigg{|}\frac{1}{Q}\phi(\|x-x_{i}\|)+\frac{Q-1}{Q}\phi(\|x-x_{i}\|)-\frac{1}{Q}\phi(\|x_{i}-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A\setminus\{x_{i}\}}\phi(\|x_{j}-x_{i}\|)\bigg{|}\\ &\leq\frac{1}{Q}\bigg{|}\phi(\|x-x_{i}\|)-\phi(\|x_{i}-x_{i}\|)\bigg{|}+\frac{Q-1}{Q}\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q-1}\sum_{x_{j}\in A\setminus\{x_{i}\}}\phi(\|x_{j}-x_{i}\|)\bigg{|}\end{split}

The first term above is bounded as

1Q|ϕ(xxi)ϕ(xixi)|LϕδQ.\frac{1}{Q}\bigg{|}\phi(\|x-x_{i}\|)-\phi(\|x_{i}-x_{i}\|)\bigg{|}\leq\frac{L_{\phi}\delta}{Q}.

Next, similarly to the analysis of the case xiBδ(x)x_{i}\neq B_{\delta}(x), we get that for xiBδ(x)x_{i}\in B_{\delta}(x) with probability at least

12exp((Q1)ϵ22Lϕ2δ2)1-2\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)

it holds that

|ϕ(xxi)1Q1xjA{xi}ϕ(xjxi)|Lϕδ+ϵ,\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q-1}\sum_{x_{j}\in A\setminus\{x_{i}\}}\phi(\|x_{j}-x_{i}\|)\bigg{|}\leq L_{\phi}\delta+\epsilon,

hence

|ϕ(xxi)1QxjAϕ(xjxi)|LϕδQ+Q1Q(Lϕδ+ϵ)Lϕδ+ϵ.\begin{split}\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\bigg{|}\leq\frac{L_{\phi}\delta}{Q}+\frac{Q-1}{Q}(L_{\phi}\delta+\epsilon)\leq L_{\phi}\delta+\epsilon.\end{split}

Combining the analyses of the cases xiBδ(x)x_{i}\neq B_{\delta}(x) and xiBδ(x)x_{i}\in B_{\delta}(x), we conclude that for any given xi𝒳x_{i}\in\mathcal{X},

P(|ϕ(xxi)1QxjAϕ(xjxi)|Lϕδ+ϵ)12exp((Q1)ϵ22Lϕ2δ2).P\left(\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\bigg{|}\leq L_{\phi}\delta+\epsilon\right)\geq 1-2\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right).

Therefore, applying the union bound on all NN samples xix_{i} in 𝒳\mathcal{X}, we get that with probability at least

12Nexp((Q1)ϵ22Lϕ2δ2)1-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)

it holds that

|ϕ(xxi)1QxjAϕ(xjxi)|Lϕδ+ϵ\bigg{|}\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\bigg{|}\leq L_{\phi}\delta+\epsilon (47)

for all xi𝒳x_{i}\in\mathcal{X}.

We can now use this in (43) to bound the deviation of fk(x)f^{k}(x) from the empirical mean of fkf^{k} in the neighbourhood of xx. Assuming that the condition (47) holds for all xi𝒳x_{i}\in\mathcal{X}, we obtain

|fk(x)1QxjAfk(xj)|=|i=1Ncik(ϕ(xxi)1QxjAϕ(xjxi))|(Lϕδ+ϵ)i=1N|cik|𝒞(Lϕδ+ϵ),\begin{split}\left|f^{k}(x)-\frac{1}{Q}\sum_{x_{j}\in A}f^{k}(x_{j})\right|&=\left|\sum_{i=1}^{N}c_{i}^{k}\left(\phi(\|x-x_{i}\|)-\frac{1}{Q}\sum_{x_{j}\in A}\phi(\|x_{j}-x_{i}\|)\right)\right|\\ &\leq(L_{\phi}\delta+\epsilon)\sum_{i=1}^{N}|c_{i}^{k}|\leq\mathcal{C}(L_{\phi}\delta+\epsilon),\end{split}

which gives

f(x)1QxjAf(xj)=(k=1d(fk(x)1QxjAfk(xj))2)1/2d𝒞(Lϕδ+ϵ).\begin{split}\|f(x)-\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})\|=\left(\sum_{k=1}^{d}\bigg{(}f^{k}(x)-\frac{1}{Q}\sum_{x_{j}\in A}f^{k}(x_{j})\bigg{)}^{2}\right)^{1/2}\leq\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon).\end{split}

We thus get

P(f(x)1QxjAf(xj)d𝒞(Lϕδ+ϵ))12Nexp((Q1)ϵ22Lϕ2δ2)P\left(\|f(x)-\frac{1}{Q}\sum_{x_{j}\in A}f(x_{j})\|\leq\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)\right)\geq 1-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)

which completes the proof.  

A.6 Proof of Theorem 8

Proof  Remember from the proof of Theorem 5 that

P(|A|Q)1exp(2(Nmηm,δQ)2Nm).P(|A|\geq Q)\geq 1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right).

Lemma 7 states that, if Bδ(x)B_{\delta}(x) contains at least QQ samples from class mm, i.e., |A|Q|A|\geq Q, then

P(f(x)1|A|xjAf(xj)d𝒞(Lϕδ+ϵ))12Nexp((|A|1)ϵ22Lϕ2δ2)12Nexp((Q1)ϵ22Lϕ2δ2).\begin{split}P\left(\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|\leq\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)\right)&\geq 1-2N\exp\left(-\frac{(|A|-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)\\ &\geq 1-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right).\end{split}

Hence, combining these two results (multiplying both probabilities), we get that with probability at least

(1exp(2(Nmηm,δQ)2Nm))(12Nexp((Q1)ϵ22Lϕ2δ2))1exp(2(Nmηm,δQ)2Nm)2Nexp((Q1)ϵ22Lϕ2δ2)\begin{split}\left(1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)\right)\left(1-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)\right)\\ \geq 1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)\end{split}

it holds that

f(x)1|A|xjAf(xj)d𝒞(Lϕδ+ϵ).\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|\leq\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon). (48)

A test sample xx drawn from νm\nu_{m} is classified correctly with the linear classifier if

ωmkTf(x)+bmk>0,k=1,,M1,km.\omega_{mk}^{T}\,f(x)+b_{mk}>0,\quad\forall k=1,\dots,M-1,\,k\neq m.

If the condition in (48) is satisfied, for all kmk\neq m, we have

ωmkTf(x)+bmk=ωmkT1|A|xjAf(xj)+bmk+ωmkT(f(x)1|A|xjAf(xj))ωmkT1|A|xjAf(xj)+bmkf(x)1|A|xjAf(xj)>γQ/2f(x)1|A|xjAf(xj)γQ/2d𝒞(Lϕδ+ϵ)0.\begin{split}\omega_{mk}^{T}\,f(x)+b_{mk}&=\omega_{mk}^{T}\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})+b_{mk}+\omega_{mk}^{T}\,\left(f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\right)\\ &\geq\omega_{mk}^{T}\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})+b_{mk}-\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|\\ &>\gamma_{Q}/2-\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|\geq\gamma_{Q}/2-\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)\geq 0.\end{split}

We thus conclude that with probability at least

1exp(2(Nmηm,δQ)2Nm)2Nexp((Q1)ϵ22Lϕ2δ2)1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)

ωmkTf(x)+bmk>0\omega_{mk}^{T}\,f(x)+b_{mk}>0 for all kmk\neq m, hence, the class label of xx is estimated correctly.  

A.7 Proof of Theorem 9

Proof  First, recall from the proof of Theorem 8 that, with probability at least

1exp(2(Nmηm,δQ)2Nm)2Nexp((Q1)ϵ22Lϕ2δ2)1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)

the δ\delta-neighborhood Bδ(x)B_{\delta}(x) of a test sample xx from class mm contains at least QQ samples from class mm, and

f(x)1|A|xjAf(xj)d𝒞(Lϕδ+ϵ)\|f(x)-\frac{1}{|A|}\sum_{x_{j}\in A}f(x_{j})\|\leq\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon) (49)

where AA is the set of training samples in Bδ(x)B_{\delta}(x) from class mm.

Then it is easy to show that (as in the proof of Theorem 6), with probability at least

1exp(2(Nmηm,δQ)2Nm)2Nexp((Q1)ϵ22Lϕ2δ2)1-\exp\left(\frac{-2\,(N_{m}\,\eta_{m,\delta}-Q)^{2}}{N_{m}}\right)-2N\exp\left(-\frac{(Q-1)\,\epsilon^{2}}{2L_{\phi}^{2}\delta^{2}}\right)

Bδ(x)B_{\delta}(x) will contain at least QQ samples xix_{i} from class mm such that

f(x)f(xi)d𝒞(Lϕδ+ϵ)+D2δ.\|f(x)-f(x_{i})\|\leq\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)+D_{2\delta}. (50)

Hence, for a training sample xix_{i}^{\prime} from another class (other than mm), we have

f(x)f(xi)f(xi)f(xi)f(x)f(xi)>γ(d𝒞(Lϕδ+ϵ)+D2δ)\|f(x)-f(x_{i}^{\prime})\|\geq\|f(x_{i})-f(x_{i}^{\prime})\|-\|f(x)-f(x_{i})\|>\gamma-(\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)+D_{2\delta})

which follows from (50) and the hypothesis on the embedding that f(xi)f(xi)>γ\|f(x_{i})-f(x_{i}^{\prime})\|>\gamma.

Due to the condition (16), we have γ2d𝒞(Lϕδ+ϵ)+2D2δ\gamma\geq 2\sqrt{d}\,\mathcal{C}\,(L_{\phi}\delta+\epsilon)+2D_{2\delta}. Using this above equation, we obtain

f(x)f(xi)>d𝒞(Lϕδ+ϵ)+D2δ.\|f(x)-f(x_{i}^{\prime})\|>\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)+D_{2\delta}.

Therefore, the distance of f(x)f(x) to the embedding of the samples from other classes is more than d𝒞(Lϕδ+ϵ)+D2δ\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)+D_{2\delta}, while there are samples from its own class within a distance of d𝒞(Lϕδ+ϵ)+D2δ\sqrt{d}\mathcal{C}(L_{\phi}\delta+\epsilon)+D_{2\delta} to f(x)f(x). We thus conclude that the class label of xx is estimated correctly with nearest-neighbor classification in the low-dimensional domain of embedding.  

B Proof of the results in Section 3

B.1 Proof of Lemma 13

Proof  The coordinate vector yy is the eigenvector of the matrix LwμLbL_{w}-\mu L_{b} corresponding to its minimum eigenvalue. Hence,

y=argminξξ=1ξT(LwμLb)ξ.y=\arg\min_{\begin{subarray}{c}\xi\\ \|\xi\|=1\end{subarray}}\xi^{T}(L_{w}-\mu L_{b})\xi.

Equivalently, defining the degree-normalized coordinates z=Dw1/2yz=D_{w}^{-1/2}y, and thus replacing the above ξ\xi by Dw1/2ξD_{w}^{1/2}\xi, we have

z=argminξξTDwξ=1N(ξ)N(ξ)=ξTDw1/2(LwμLb)Dw1/2ξ=ξT(DwWw)ξμξT(DwDb1)1/2(DbWb)(Db1Dw)1/2ξ.\begin{split}z&=\arg\min_{\begin{subarray}{c}\xi\\ \xi^{T}D_{w}\xi=1\end{subarray}}N(\xi)\\ N(\xi)&=\xi^{T}D_{w}^{1/2}(L_{w}-\mu L_{b})D_{w}^{1/2}\xi\\ &=\xi^{T}(D_{w}-W_{w})\xi-\mu\,\xi^{T}(D_{w}D_{b}^{-1})^{1/2}\,(D_{b}-W_{b})\,(D_{b}^{-1}D_{w})^{1/2}\xi.\end{split} (51)

Then, denoting βi=dw(i)/db(i)\beta_{i}=d_{w}(i)/d_{b}(i), the term N(ξ)N(\xi) can be rearranged as

N(ξ)=iξi(dw(i)ξijwiξjwij)μiξi(dw(i)ξijbiξjwijβiβj)=iξijwi(ξiξj)wijμiξijbi(βiξiβiβjξj)wij=ijwi(ξi2ξiξj)wijμijbi(βiξi2βiβjξiξj)wij\begin{split}N(\xi)&=\sum_{i}\xi_{i}\bigg{(}d_{w}(i)\,\xi_{i}-\sum_{j\sim_{w}i}\xi_{j}\,w_{ij}\bigg{)}-\mu\sum_{i}\xi_{i}\bigg{(}d_{w}(i)\,\xi_{i}-\sum_{j\sim_{b}i}\xi_{j}\,w_{ij}\,\sqrt{\beta_{i}\beta_{j}}\bigg{)}\\ &=\sum_{i}\xi_{i}\sum_{j\sim_{w}i}(\xi_{i}-\xi_{j})w_{ij}-\mu\sum_{i}\xi_{i}\sum_{j\sim_{b}i}(\beta_{i}\xi_{i}-\sqrt{\beta_{i}\beta_{j}}\,\xi_{j})w_{ij}\\ &=\sum_{i}\sum_{j\sim_{w}i}(\xi_{i}^{2}-\xi_{i}\xi_{j})w_{ij}-\mu\sum_{i}\sum_{j\sim_{b}i}(\beta_{i}\xi_{i}^{2}-\sqrt{\beta_{i}\beta_{j}}\xi_{i}\xi_{j})w_{ij}\end{split}

which gives 999In our notation, the terms iwji\sim_{w}j and ibji\sim_{b}j in the summation indices as in (52) refer to edges rather than neighboring (i,j)(i,j)-pairs; i.e., each pair is counted only once in the summation.

N(ξ)=iwj(ξiξj)2wijμibj(βiξiβjξj)2wijN(\xi)=\sum_{i\sim_{w}j}(\xi_{i}-\xi_{j})^{2}\,w_{ij}-\mu\sum_{i\sim_{b}j}\left(\sqrt{\beta_{i}}\xi_{i}-\sqrt{\beta_{j}}\xi_{j}\right)^{2}\,w_{ij} (52)

by grouping the neighboring (i,j)(i,j) pairs in the inner sums. Now, for any ξN×1\xi\in\mathbb{R}^{N\times 1} such that ξTDwξ=1\xi^{T}D_{w}\xi=1, we define ξ\xi^{*} as follows.

ξi={|ξi| if Ci=1|ξi| if Ci=2\xi_{i}^{*}=\bigg{\{}\begin{array}[]{l}-|\xi_{i}|\,\,\text{ if }C_{i}=1\\ \,\,\,\,\,|\xi_{i}|\,\,\text{ if }C_{i}=2\end{array} (53)

Clearly, ξ\xi^{*} also satisfies (ξ)TDwξ=1(\xi^{*})^{T}D_{w}\xi^{*}=1. From (52), it can be easily checked that N(ξ)N(ξ)N(\xi^{*})\leq N(\xi) for any ξ\xi, Then, a minimizer zz of the problem (51) has to be of the separable form defined in (53); otherwise zz^{*} would yield a smaller value for the function NN, which would contradict the fact that zz is a minimizer. Note that the equality N(z)=N(z)N(z^{*})=N(z) holds only if z=zz=z^{*} or z=zz=-z^{*}, thus when zz is separable. Therefore, the embedding zz satisfies the condition

zi0 if Ci=1,zi0 if Ci=2z_{i}\leq 0\,\,\text{ if }C_{i}=1,\qquad\qquad z_{i}\geq 0\,\,\text{ if }C_{i}=2

or the equivalent condition

zi0 if Ci=2,zi0 if Ci=1.z_{i}\leq 0\,\,\text{ if }C_{i}=2,\qquad\qquad z_{i}\geq 0\,\,\text{ if }C_{i}=1.

Finally, since yi=dw(i)ziy_{i}=\sqrt{d_{w}(i)}\,z_{i}, the same property also holds for the embedding yy.

 

B.2 Proof of Theorem 14

Proof  From (51) and (52), we have

z=argminξξTDwξ=1iwj(ξiξj)2wijμibj(βiξiβjξj)2wij.z=\arg\min_{\begin{subarray}{c}\xi\\ \xi^{T}D_{w}\xi=1\end{subarray}}\sum_{i\sim_{w}j}(\xi_{i}-\xi_{j})^{2}\,w_{ij}-\mu\sum_{i\sim_{b}j}\left(\sqrt{\beta_{i}}\xi_{i}-\sqrt{\beta_{j}}\xi_{j}\right)^{2}\,w_{ij}. (54)

Thus, at the optimal solution zz the objective function takes the value

N(z)=iwj(zizj)2wijμibj(βiziβjzj)2wij.N(z)=\sum_{i\sim_{w}j}(z_{i}-z_{j})^{2}\,w_{ij}-\mu\sum_{i\sim_{b}j}\left(\sqrt{\beta_{i}}z_{i}-\sqrt{\beta_{j}}z_{j}\right)^{2}\,w_{ij}. (55)

In the following, we derive a lower bound for the first sum and an upper bound for the second sum in (55). We begin with the first sum. Let i1,mini_{1,min}, i1,maxi_{1,max}, i2,mini_{2,min} and i2,maxi_{2,max} denote the indices of the data samples in class 1 and class 2 that are respectively mapped to the extremal coordinates z1,minz_{1,min}, z1,maxz_{1,max}, z2,minz_{2,min}, z2,maxz_{2,max}, where

zk,min=mini:Ci=kzi,zk,max=maxi:Ci=kzi.z_{k,min}=\min_{i:\,C_{i}=k}z_{i}\,,\qquad\qquad z_{k,max}=\max_{i:\,C_{i}=k}z_{i}\,.

Let P1={(xki1,xki)}i=1L1P_{1}=\{(x_{k_{i-1}},x_{k_{i}})\}_{i=1}^{L_{1}} be a shortest path of length L1L_{1} joining xi1,minx_{i_{1,min}} and xi1,maxx_{i_{1,max}} and P2={(xni1,xni)}i=1L2P_{2}=\{(x_{n_{i-1}},x_{n_{i}})\}_{i=1}^{L_{2}} be a shortest path of length L2L_{2} joining xi2,minx_{i_{2,min}} and xi2,maxx_{i_{2,max}}. We have

iwj(zizj)2wiji=1L1(zkizki1)2wki1ki+i=1L2(znizni1)2wni1niwmin,1i=1L1(zkizki1)2+wmin,2i=1L2(znizni1)2\begin{split}\sum_{i\sim_{w}j}(z_{i}-z_{j})^{2}\,w_{ij}\geq\sum_{i=1}^{L_{1}}(z_{k_{i}}-z_{k_{i-1}})^{2}\,w_{k_{i-1}k_{i}}+\sum_{i=1}^{L_{2}}(z_{n_{i}}-z_{n_{i-1}})^{2}\,w_{n_{i-1}n_{i}}\\ \geq w_{min,1}\sum_{i=1}^{L_{1}}(z_{k_{i}}-z_{k_{i-1}})^{2}\,+w_{min,2}\sum_{i=1}^{L_{2}}(z_{n_{i}}-z_{n_{i-1}})^{2}\,\end{split} (56)

where the first inequality simply follows from the fact that the set of edges making up P1P2P_{1}\cup P_{2} are contained in the set of all edges in GwG_{w}. For a sequence {ai}i=0L\{a_{i}\}_{i=0}^{L}, the following inequality holds.

(aLa0)2=i=1L(aiai1)2+i,j=1ijL(aiai1)(ajaj1)i=1L(aiai1)2+12i,j=1ijL((aiai1)2+(ajaj1)2)=Li=1L(aiai1)2\begin{split}(a_{L}-a_{0})^{2}&=\sum_{i=1}^{L}(a_{i}-a_{i-1})^{2}+\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{L}(a_{i}-a_{i-1})(a_{j}-a_{j-1})\\ &\leq\sum_{i=1}^{L}(a_{i}-a_{i-1})^{2}+\frac{1}{2}\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{L}\big{(}(a_{i}-a_{i-1})^{2}+(a_{j}-a_{j-1})^{2}\big{)}=L\sum_{i=1}^{L}(a_{i}-a_{i-1})^{2}\end{split}

Hence,

i=1L(aiai1)21L(aLa0)2.\sum_{i=1}^{L}(a_{i}-a_{i-1})^{2}\geq\frac{1}{L}(a_{L}-a_{0})^{2}.

Using this inequality in (56), we get

iwj(zizj)2wijwmin,1L1(z1,maxz1,min)2+wmin,2L2(z2,maxz2,min)2.\begin{split}\sum_{i\sim_{w}j}(z_{i}-z_{j})^{2}\,w_{ij}\geq\frac{w_{min,1}}{L_{1}}(z_{1,max}-z_{1,min})^{2}\,+\frac{w_{min,2}}{L_{2}}(z_{2,max}-z_{2,min})^{2}.\end{split}

Since the path lengths L1L_{1} and L2L_{2} are upper bounded by the diameters D1D_{1} and D2D_{2}, we finally obtain the lower bound

iwj(zizj)2wijwmin,1D1(z1,maxz1,min)2+wmin,2D2(z2,maxz2,min)2.\sum_{i\sim_{w}j}(z_{i}-z_{j})^{2}\,w_{ij}\geq\frac{w_{min,1}}{D_{1}}(z_{1,max}-z_{1,min})^{2}\,+\frac{w_{min,2}}{D_{2}}(z_{2,max}-z_{2,min})^{2}. (57)

Next, we find an upper bound for the second sum in (55). Using Lemma 13, we obtain the following inequality.

ibj(βiziβjzj)2wijibj(z2,maxz1,min)2βmaxwij=12(z2,maxz1,min)2βmaxVmaxb\begin{split}\sum_{i\sim_{b}j}\left(\sqrt{\beta_{i}}z_{i}-\sqrt{\beta_{j}}z_{j}\right)^{2}\,w_{ij}&\leq\sum_{i\sim_{b}j}(z_{2,max}-z_{1,min})^{2}\,\beta_{max}\,w_{ij}\\ &=\frac{1}{2}(z_{2,max}-z_{1,min})^{2}\,\beta_{max}V^{b}_{max}\end{split} (58)

Now, since the solution zz in (54) minimizes the objective function N(ξ)N(\xi), we have

N(z)=λmin(LwμLb)N(z)=\lambda_{\min}(L_{w}-\mu L_{b})

where λmin()\lambda_{\min}(\cdot) and λmax()\lambda_{\max}(\cdot) respectively denote the minimum and the maximum eigenvalues of a matrix. For two Hermitian matrices AA and BB, the inequality λmin(A+B)λmin(A)+λmax(B)\lambda_{\min}(A+B)\leq\lambda_{\min}(A)+\lambda_{\max}(B) holds. As LwL_{w} and LbL_{b} are graph Laplacian matrices, we have λmin(Lw)=λmin(Lb)=0\lambda_{\min}(L_{w})=\lambda_{\min}(L_{b})=0 and thus

N(z)=λmin(LwμLb)λmin(Lw)+λmax(μLb)=λmin(Lw)μλmin(Lb)=0.N(z)=\lambda_{\min}(L_{w}-\mu L_{b})\leq\lambda_{\min}(L_{w})+\lambda_{\max}(-\mu L_{b})=\lambda_{\min}(L_{w})-\mu\lambda_{\min}(L_{b})=0.

Using in (55) the above inequality and the lower and upper bounds in (57) and (58), we obtain

0N(z)=iwj(zizj)2wijμibj(βiziβjzj)2wijwmin,1D1(z1,maxz1,min)2+wmin,2D2(z2,maxz2,min)212μ(z2,maxz1,min)2βmaxVmaxb.\begin{split}0\geq N(z)&=\sum_{i\sim_{w}j}(z_{i}-z_{j})^{2}\,w_{ij}-\mu\sum_{i\sim_{b}j}\left(\sqrt{\beta_{i}}z_{i}-\sqrt{\beta_{j}}z_{j}\right)^{2}\,w_{ij}\\ &\geq\frac{w_{min,1}}{D_{1}}(z_{1,max}-z_{1,min})^{2}\,+\frac{w_{min,2}}{D_{2}}(z_{2,max}-z_{2,min})^{2}\\ &\quad-\frac{1}{2}\mu(z_{2,max}-z_{1,min})^{2}\,\beta_{max}V^{b}_{max}.\end{split}

Hence

wmin,1D1(z1,maxz1,min)2+wmin,2D2(z2,maxz2,min)212μ(z2,maxz1,min)2βmaxVmaxb.\frac{w_{min,1}}{D_{1}}(z_{1,max}-z_{1,min})^{2}\,+\frac{w_{min,2}}{D_{2}}(z_{2,max}-z_{2,min})^{2}\leq\frac{1}{2}\mu(z_{2,max}-z_{1,min})^{2}\,\beta_{max}V^{b}_{max}. (59)

The RHS of the above inequality is related to the overall support z2,maxz1,minz_{2,max}-z_{1,min} of the data, whereas the terms on the LHS are related to the individual supports z1,maxz1,minz_{1,max}-z_{1,min} and z2,maxz2,minz_{2,max}-z_{2,min} of the two classes in the learnt embedding. Meanwhile, the separation z2,minz1,maxz_{2,min}-z_{1,max} between the two classes is given by the gap between the overall support and the sums of the individual supports. In order to use the above inequality in view of this observation, we first derive a lower bound on the RHS term. Since zTDwz=1z^{T}D_{w}z=1, we have

1=izi2dw(i)=i:Ci=1zi2dw(i)+i:Ci=2zi2dw(i)z1,min2i:Ci=1dw(i)+z2,max2i:Ci=2dw(i)=z1,min2V1+z2,max2V2.\begin{split}1=\sum_{i}z_{i}^{2}\,d_{w}(i)&=\sum_{i:\,C_{i}=1}z_{i}^{2}\,d_{w}(i)+\sum_{i:\,C_{i}=2}z_{i}^{2}\,d_{w}(i)\\ &\leq z_{1,min}^{2}\sum_{i:\,C_{i}=1}d_{w}(i)+z_{2,max}^{2}\sum_{i:\,C_{i}=2}d_{w}(i)=z_{1,min}^{2}V_{1}+z_{2,max}^{2}V_{2}.\end{split}

This gives

z1,min2+z2,max21Vmax.z_{1,min}^{2}+z_{2,max}^{2}\geq\frac{1}{V_{max}}.

Hence, we obtain the following lower bound on the overall support

(z2,maxz1,min)2z2,max2+z1,min21Vmax.(z_{2,max}-z_{1,min})^{2}\geq z_{2,max}^{2}+z_{1,min}^{2}\geq\frac{1}{V_{max}}. (60)

Denoting the supports of class 1 and class 2 and the overall support as

S1=z1,maxz1,min,S2=z2,maxz2,min,S=z2,maxz1,min,S_{1}=z_{1,max}-z_{1,min},\qquad\qquad S_{2}=z_{2,max}-z_{2,min},\qquad\qquad S=z_{2,max}-z_{1,min},

we have from (59)

w¯min(S12+S22)12μS2βmaxVmaxb\overline{w}_{min}(S_{1}^{2}+S_{2}^{2})\leq\frac{1}{2}\,\mu\,S^{2}\,\beta_{max}V^{b}_{max}

which yields the following upper bound on the total support of the two classes

S1+S22(S12+S22)SμβmaxVmaxbw¯min.S_{1}+S_{2}\leq\sqrt{2(S_{1}^{2}+S_{2}^{2})}\leq S\sqrt{\frac{\mu\beta_{max}V^{b}_{max}}{\overline{w}_{min}}}.

We can thus lower bound the separation z2,minz1,maxz_{2,min}-z_{1,max} as

z2,minz1,max=S(S1+S2)S(1μβmaxVmaxbw¯min)z_{2,min}-z_{1,max}=S-(S_{1}+S_{2})\geq S\left(1-\sqrt{\frac{\mu\beta_{max}V^{b}_{max}}{\overline{w}_{min}}}\right)

provided that μ<w¯min/(βmaxVmaxb)\mu<\overline{w}_{min}/(\beta_{max}V^{b}_{max}). From the lower bound on the overall support in (60), we lower bound the separation as follows

z2,minz1,max1Vmax(1μβmaxVmaxbw¯min).z_{2,min}-z_{1,max}\geq\frac{1}{\sqrt{V_{max}}}\left(1-\sqrt{\frac{\mu\beta_{max}V^{b}_{max}}{\overline{w}_{min}}}\right).

Finally, since the separation of any embedding with dimension d1d\geq 1 is at least as much as the separation z2,minz1,maxz_{2,min}-z_{1,max} of the embedding of dimension d=1d=1, the above lower bound holds for any d1d\geq 1 as well.  

B.3 Proof of Corollary 15

Proof  The one-dimensional embedding zz is given as the solution of the constrained optimization problem

z=argminN(ξ) s.t. D(ξ)=1z=\arg\min N(\xi)\text{ s.t. }D(\xi)=1

where

N(ξ)=ξTDw1/2(LwμLb)Dw1/2ξ,D(ξ)=ξTDwξ.N(\xi)=\xi^{T}D_{w}^{1/2}(L_{w}-\mu L_{b})D_{w}^{1/2}\xi,\qquad D(\xi)=\xi^{T}D_{w}\xi.

Defining the Lagrangian function

Λ(ξ,λ)=N(ξ)+λ(D(ξ)1)\Lambda(\xi,\lambda)=N(\xi)+\lambda(D(\xi)-1)

at the optimal solution zz, we have

ξΛ=λΛ=0\nabla_{\xi}\Lambda=\nabla_{\lambda}\Lambda=0

where ξ\nabla_{\xi} and λ\nabla_{\lambda} respectively denote the derivatives of Λ\Lambda with respect to ξ\xi and λ\lambda. Thus, at ξ=z\xi=z,

Λξi=N(ξ)ξi+λD(ξ)ξi=0\frac{\partial\Lambda}{\partial\xi_{i}}=\frac{\partial N(\xi)}{\partial\xi_{i}}+\lambda\frac{\partial D(\xi)}{\partial\xi_{i}}=0

for all i=1,,Ni=1,\dots,N. From (52), the derivatives of N(ξ)N(\xi) and D(ξ)D(\xi) at zz are given by

N(ξ)ξi|ξ=z=jwi2(zizj)wijμjbi2(βiziβjzj)βiwijD(ξ)ξi|ξ=z=2dw(i)zi\begin{split}\frac{\partial N(\xi)}{\partial\xi_{i}}\bigg{|}_{\xi=z}&=\sum_{j\sim_{w}i}2(z_{i}-z_{j})w_{ij}-\mu\sum_{j\sim_{b}i}2\left(\sqrt{\beta_{i}}z_{i}-\sqrt{\beta_{j}}z_{j}\right)\,\sqrt{\beta_{i}}\,w_{ij}\\ \frac{\partial D(\xi)}{\partial\xi_{i}}\bigg{|}_{\xi=z}&=2\,d_{w}(i)\,z_{i}\end{split}

which yields

jwi(zizj)wijμjbi(βiziβjzj)βiwij+λdw(i)zi=0\sum_{j\sim_{w}i}(z_{i}-z_{j})w_{ij}-\mu\sum_{j\sim_{b}i}\left(\sqrt{\beta_{i}}z_{i}-\sqrt{\beta_{j}}z_{j}\right)\,\sqrt{\beta_{i}}\,w_{ij}+\lambda\,d_{w}(i)\,z_{i}=0 (61)

for all ii. At i=i1,maxi=i_{1,max}, as zz attains its maximal value z1,maxz_{1,max} for class 1, we have

λdw(i1,max)z1,max=jwi1,max(zjz1,max)wi1,maxj+μjbi1,max(βi1,maxz1,maxβjzj)βi1,maxwi1,maxjμβmin(z2,minz1,max)db(i1,max).\begin{split}\lambda\,d_{w}(i_{1,max})\,z_{1,max}&=\sum_{j\sim_{w}i_{1,max}}(z_{j}-z_{1,max})w_{i_{1,max}j}\\ &\qquad+\mu\sum_{j\sim_{b}i_{1,max}}\left(\sqrt{\beta_{i_{1,max}}}z_{1,max}-\sqrt{\beta_{j}}z_{j}\right)\sqrt{\beta_{i_{1,max}}}\,w_{i_{1,max}j}\\ &\leq-\mu\,\beta_{min}\,(z_{2,min}-z_{1,max})\,d_{b}(i_{1,max}).\end{split}

Hence

|z1,max|=z1,maxμβmin(z2,minz1,max)db(i1,max)λdw(i1,max)μβmin(z2,minz1,max)λβmax.|z_{1,max}|=-z_{1,max}\geq\frac{\mu\,\beta_{min}\,(z_{2,min}-z_{1,max})d_{b}(i_{1,max})}{\lambda\,d_{w}(i_{1,max})}\geq\frac{\mu\,\beta_{min}\,(z_{2,min}-z_{1,max})}{\lambda\,\beta_{max}}. (62)

We proceed by deriving an upper bound for λ\lambda. The gradients of N(ξ)N(\xi) and D(ξ)D(\xi) are given by

ξN=2Dw1/2(LwμLb)Dw1/2ξ,ξD=2Dwξ.\nabla_{\xi}N=2D_{w}^{1/2}(L_{w}-\mu L_{b})D_{w}^{1/2}\xi,\qquad\nabla_{\xi}D=2D_{w}\xi.

From the condition ξΛ=0\nabla_{\xi}\Lambda=0 at ξ=z\xi=z, we have

Dw1/2(LwμLb)Dw1/2z+λDwz=0(LwμLb)y+λy=0.\begin{split}D_{w}^{1/2}(L_{w}-\mu L_{b})D_{w}^{1/2}z+\lambda D_{w}z&=0\\ (L_{w}-\mu L_{b})y+\lambda y&=0.\end{split}

Since y=Dw1/2zy=D_{w}^{1/2}z is the unit-norm eigenvector of LwμLbL_{w}-\mu L_{b} corresponding to its smallest eigenvalue, the Lagrangian multiplier λ\lambda is given by

λ=λmin(LwμLb).\lambda=-\lambda_{\min}(L_{w}-\mu L_{b}).

We can lower bound the minimum eigenvalue as

λmin(LwμLb)λmin(Lw)+λmin(μLb)=0μλmax(Lb)2μ\lambda_{\min}(L_{w}-\mu L_{b})\geq\lambda_{\min}(L_{w})+\lambda_{\min}(-\mu L_{b})=0-\mu\lambda_{\max}(L_{b})\geq-2\mu

since the eigenvalues of a graph Laplacian are upper bounded by 22. This gives λ2μ\lambda\leq 2\mu. Using this upper bound on λ\lambda in (62), we obtain

|z1,max|12βminβmax(z2,minz1,max).|z_{1,max}|\geq\frac{1}{2}\frac{\beta_{min}}{\beta_{max}}\,(z_{2,min}-z_{1,max}).

Repeating the same steps for i=i2,mini=i_{2,min} following (61), one can similarly show that

z2,min12βminβmax(z2,minz1,max).z_{2,min}\geq\frac{1}{2}\frac{\beta_{min}}{\beta_{max}}\,(z_{2,min}-z_{1,max}).
 

B.4 Proof of Theorem 16

We first present two lemmas that will be useful for proving Theorem 16.

Lemma 17

Let AN×NA\in\mathbb{R}^{N\times N} be a symmetric matrix with eigenvalue decomposition A=UΛUTA=U\Lambda U^{T}, where UU is an orthogonal matrix and Λ\Lambda is a diagonal matrix consisting of the eigenvalues λ1,,λN\lambda_{1},\dots,\lambda_{N}. Consider a symmetric perturbation ΔA\Delta A on AA. Let the perturbed matrix A~=A+ΔA\tilde{A}=A+\Delta A have the eigenvalue decomposition A~=U~Λ~U~T\tilde{A}=\tilde{U}\tilde{\Lambda}\tilde{U}^{T}.

Assume that the eigenvalues λi\lambda_{i} have a separation of at least η\eta, i.e., for all distinct i,ji,j, one has |λiλj|η|\lambda_{i}-\lambda_{j}|\geq\eta. Then the inner products of the corresponding eigenvectors of AA and A~\tilde{A} are lower bounded as

|u~jTuj|14ΔA2η2|\tilde{u}_{j}^{T}u_{j}|\geq\sqrt{1-\frac{4\,\|\Delta A\|^{2}}{\eta^{2}}}

for all j=1,,Nj=1,\dots,N, where uju_{j} denotes the jj-th column of UU.

Proof  Defining R=U~TUR=\tilde{U}^{T}U, we look for a lower bound on the diagonal entries of RR. It will be helpful to examine the term

ΛRRΛ=Λ~R(ΔΛ)RRΛΛ~RRΛ+ΔΛΛ~RRΛ+ΔA\|\Lambda R-R\Lambda\|=\|\tilde{\Lambda}R-(\Delta\Lambda)R-R\Lambda\|\leq\|\tilde{\Lambda}R-R\Lambda\|+\|\Delta\Lambda\|\leq\|\tilde{\Lambda}R-R\Lambda\|+\|\Delta A\| (63)

where ΔΛ=Λ~Λ\Delta\Lambda=\tilde{\Lambda}-\Lambda and the last inequality follows from the fact that the variation in the eigenvalues is upper bounded by the norm of the perturbation matrix.

We proceed by bounding the term Λ~RRΛ\|\tilde{\Lambda}R-R\Lambda\|. First observe that

(ΔA)U=(A~A)U=(U~Λ~U~TUΛUT)U=U~Λ~RUΛ(\Delta A)U=(\tilde{A}-A)U=(\tilde{U}\tilde{\Lambda}\tilde{U}^{T}-U\Lambda U^{T})U=\tilde{U}\tilde{\Lambda}R-U\Lambda

which gives

ΔA=U~Λ~RUΛ=U~T(U~Λ~RUΛ)=Λ~RRΛ.\|\Delta A\|=\|\tilde{U}\tilde{\Lambda}R-U\Lambda\|=\|\tilde{U}^{T}(\tilde{U}\tilde{\Lambda}R-U\Lambda)\|=\|\tilde{\Lambda}R-R\Lambda\|.

Using this in (63), we get

ΛRRΛ2ΔA.\|\Lambda R-R\Lambda\|\leq 2\|\Delta A\|. (64)

Since each column of RR is given by the projection of a unit norm vector on an orthogonal basis, it is unit-norm. Denoting the (i,j)(i,j)-th entry of RR by RijR_{ij}, we have

|Rjj|=(1ijRij2)1/2.|R_{jj}|=(1-\sum_{i\neq j}R_{ij}^{2})^{1/2}. (65)

We proceed by bounding the sum of the entries Rij2R_{ij}^{2}. Notice that the (i,j)(i,j)-th entry of ΛRRΛ\Lambda R-R\Lambda is given by (λiλj)Rij(\lambda_{i}-\lambda_{j})R_{ij}. For each jj,

ij(λiλj)2Rij2ΛRRΛ24ΔA2\sum_{i\neq j}(\lambda_{i}-\lambda_{j})^{2}R_{ij}^{2}\leq\|\Lambda R-R\Lambda\|^{2}\leq 4\|\Delta A\|^{2}

where the first inequality follows from the fact that the norm of the jj-th column of a matrix can be upper bounded by its operator norm, and the second inequality is due to (64). Due to the eigenvalue separation hypothesis, the first term above can be lower bounded as

ij(λiλj)2Rij2η2ijRij2\sum_{i\neq j}(\lambda_{i}-\lambda_{j})^{2}R_{ij}^{2}\geq\eta^{2}\sum_{i\neq j}R_{ij}^{2}

which gives

ijRij24ΔA2η2.\sum_{i\neq j}R_{ij}^{2}\leq\frac{4\|\Delta A\|^{2}}{\eta^{2}}.

From (65), we arrive at the stated result, i.e., for each jj

|u~jTuj|=|Rjj|=(1ijRij2)1/2(14ΔA2η2)1/2.|\tilde{u}_{j}^{T}u_{j}|=|R_{jj}|=(1-\sum_{i\neq j}R_{ij}^{2})^{1/2}\geq\left(1-\frac{4\|\Delta A\|^{2}}{\eta^{2}}\right)^{1/2}.
 
Lemma 18

Let U,U~N×NU,\tilde{U}\in\mathbb{R}^{N\times N} be two orthogonal matrices such that the difference between the corresponding columns uiu_{i}, u~i\tilde{u}_{i} of UU and U~\tilde{U} are upper bounded as uiu~i2δ\|u_{i}-\tilde{u}_{i}\|^{2}\leq\delta for some δ<2\delta<2. Let V=UTV=U^{T} and V~=U~T\tilde{V}=\tilde{U}^{T}. Then the difference between the corresponding columns viv_{i}, v~i\tilde{v}_{i} of VV and V~\tilde{V} are upper bounded as

viv~i2δ+2N1(1δ2)2.\|v_{i}-\tilde{v}_{i}\|^{2}\leq\delta+2\sqrt{N}\sqrt{1-\left(1-\frac{\delta}{2}\right)^{2}}.

Proof  Let R=U~TUR=\tilde{U}^{T}U. Since uiu_{i} and u~i\tilde{u}_{i} are unit-norm vectors, we have

uiu~i2=22u~iTuiδ\|u_{i}-\tilde{u}_{i}\|^{2}=2-2\tilde{u}_{i}^{T}u_{i}\leq\delta

therefore

Rii=u~iTui1δ2>0R_{ii}=\tilde{u}_{i}^{T}u_{i}\geq 1-\frac{\delta}{2}>0 (66)

where RiiR_{ii} denotes the ii-th diagonal entry of RR. From v~i=Rvi\tilde{v}_{i}=Rv_{i} it follows that

viTv~i=viTRvi=viTRdvi+viTRndviviTRdvi|viTRndvi|\begin{split}v_{i}^{T}\tilde{v}_{i}=v_{i}^{T}Rv_{i}=v_{i}^{T}R^{d}v_{i}+v_{i}^{T}R^{nd}v_{i}\geq v_{i}^{T}R^{d}v_{i}-|v_{i}^{T}R^{nd}v_{i}|\end{split} (67)

where RdR^{d} and RndR^{nd} denote the components of RR consisting respectively of the diagonal and the nondiagonal terms. From the condition (66) on the diagonal entries of RR, it follows that the first term is lower bounded as

viTRdvi1δ/2.v_{i}^{T}R^{d}v_{i}\geq 1-\delta/2.

Also, from (66), the 2\ell_{2}-norm of each row and each column of RndR^{nd} is upper bounded by 1(1δ/2)2\sqrt{1-(1-\delta/2)^{2}}. Bounding the operator norm of RndR^{nd} in terms of the maximal 1\ell_{1}-norms of the rows and columns, we get

|viTRndvi|RndN1(1δ/2)2.|v_{i}^{T}R^{nd}v_{i}|\leq\|R^{nd}\|\leq\sqrt{N}\sqrt{1-(1-\delta/2)^{2}}.

Using this together with the inequality (66) in (67), we get

viTv~i1δ2N1(1δ/2)2v_{i}^{T}\tilde{v}_{i}\geq 1-\frac{\delta}{2}-\sqrt{N}\sqrt{1-(1-\delta/2)^{2}}

which gives the stated result

viv~i2=22viTv~iδ+2N1(1δ/2)2.\|v_{i}-\tilde{v}_{i}\|^{2}=2-2v_{i}^{T}\tilde{v}_{i}\leq\delta+2\sqrt{N}\sqrt{1-(1-\delta/2)^{2}}.
 

We are now ready to prove Theorem 16.

Proof  We first look at the separation of the embedding obtained with LcL^{c} for the reduced data graph with all between-category edges removed. The data graph corresponding to LcL^{c} has QQ connected components; therefore, LcL^{c} is a block diagonal matrix consisting of QQ blocks. Each qq-th block is given by the objective matrix Lc,q=LwqμLbqL^{c,q}=L_{w}^{q}-\mu L_{b}^{q} where LwqL_{w}^{q} and LbqL_{b}^{q} are the within-class and the between-class Laplacian matrices of the data graph restricted to only the category qq. As LcL^{c} is a block-diagonal matrix, its eigenvalues and eigenvectors are given by the union of the eigenvalues and the eigenvectors of the block components Lc,qL^{c,q} (i.e., their inclusions in N\mathbb{R}^{N} by zero-padding).

Let Yq=[y1qyNqq]TY^{q}=[y_{1}^{q}\dots y_{N_{q}}^{q}]^{T} be the dqd^{q}-dimensional embedding of the NqN_{q} samples in category qq, whose columns are the eigenvectors of Lc,qL^{c,q}. The embedding YqY^{q} is assumed to be separable with a margin of γc\gamma^{c} by the theorem hypothesis. Consider the embeddings YqY^{q}, YrY^{r} of two different categories qq and rr, and two classes kk, ll respectively from these two categories. By the separation hypothesis (32) within each category, there exist hyperplanes ωkq\omega_{k}^{q} and ωlr\omega_{l}^{r} with ωkq=1\|\omega_{k}^{q}\|=1, ωlr=1\|\omega_{l}^{r}\|=1, such that for the embedding of any sample yiqdqy_{i}^{q}\in\mathbb{R}^{d^{q}} from class kk and any sample yjrdry_{j}^{r}\in\mathbb{R}^{d^{r}} from class ll it holds that

(ωkq)Tyiqγc/2(ωlr)Tyjrγc/2.\begin{split}(\omega_{k}^{q})^{T}y_{i}^{q}&\geq\gamma^{c}/2\\ (\omega_{l}^{r})^{T}y_{j}^{r}&\leq-\gamma^{c}/2.\end{split} (68)

Now considering an ordering of all QQ categories, we can define the inclusion y¯iqd\overline{y}_{i}^{q}\in\mathbb{R}^{d} of each sample yiqdq{y}_{i}^{q}\in\mathbb{R}^{d^{q}} from each category qq, where d=qdqd=\sum_{q}d^{q} and the nonzero entries of y¯iq=[00(yiq)T 00]T\overline{y}_{i}^{q}=[0\dots 0\ ({y}_{i}^{q})^{T}\ 0\dots 0]^{T} are located at the support of category qq. Note that each (y¯iq)T(\overline{y}_{i}^{q})^{T} corresponds to a row of the coordinate matrix YcY^{c}, whose columns are the eigenvectors of LcL^{c}.

Consider the hyperplane

ωk,lq,r=12[00(ωkq)T 00(ωlr)T]Td\omega_{k,l}^{q,r}=\frac{1}{\sqrt{2}}[0\dots 0\ (\omega_{k}^{q})^{T}\ 0\dots 0\ (\omega_{l}^{r})^{T}]^{T}\in\mathbb{R}^{d}

with ωk,lq,r=1\|\omega_{k,l}^{q,r}\|=1, formed by the inclusion of (ωkq)T(\omega_{k}^{q})^{T} and (ωlr)T(\omega_{l}^{r})^{T} in d\mathbb{R}^{d} over the entries corresponding respectively to the categories qq and rr. From (68), we get that the hyperplane ωk,lq,r\omega_{k,l}^{q,r} separates these two classes as

(ωk,lq,r)Ty¯iqγc22(ωk,lq,r)Ty¯jrγc22.\begin{split}(\omega_{k,l}^{q,r})^{T}\ \overline{y}_{i}^{q}&\geq\frac{\gamma^{c}}{2\sqrt{2}}\\ (\omega_{k,l}^{q,r})^{T}\ \overline{y}_{j}^{r}&\leq-\frac{\gamma^{c}}{2\sqrt{2}}.\end{split} (69)

This shows that there exists a dd-dimensional embedding given by the eigenvectors of LcL^{c} that separates any pair of classes with a margin of at least γc/2\gamma^{c}/\sqrt{2}.

Now observe from Lemma 17 that the correlation between the ii-th eigenvector uicu^{c}_{i} of LcL^{c} and the corresponding eigenvector uiu_{i} of LL is upper bounded as

|uiTuic|ξ=14Lnc2η2.|u_{i}^{T}u^{c}_{i}|\geq\xi=\sqrt{1-\frac{4\|L^{nc}\|^{2}}{\eta^{2}}}.

This implies either of the conditions uiTuicξu_{i}^{T}u^{c}_{i}\geq\xi or uiT(uic)ξu_{i}^{T}(-u^{c}_{i})\geq\xi. Therefore, the eigenvector uiu_{i} of the perturbed objective matrix LL has a correlation of at least ξ\xi with either uicu^{c}_{i} or its opposite uic-u^{c}_{i}. Meanwhile, the separability of an embedding is invariant to the negation of one of the eigenvectors. This corresponds simply to changing the sign of one of the coordinates of all data samples (i.e., taking the symmetric of the embedding with respect to one axis); therefore, the linear separability remains the same. For this reason, it suffices to treat the case uiTuicξu_{i}^{T}u^{c}_{i}\geq\xi for analyzing the separability without loss of generality.

The condition uiTuicξu_{i}^{T}u^{c}_{i}\geq\xi implies

uiuic2=22uiTuic22ξ.\|u_{i}-u^{c}_{i}\|^{2}=2-2u_{i}^{T}u^{c}_{i}\leq 2-2\xi. (70)

While this upper bounds the difference between the corresponding eigenvectors of LL and LcL^{c}, we need to upper bound the variation between the rows of LL and LcL^{c}, as we are interested in the separation obtained with the embedded data coordinates given by the rows of LL. Denoting the ii-th rows of LL and LcL^{c} respectively as yiTy_{i}^{T} and y¯iT\overline{y}_{i}^{T}, from the condition in (70) and Lemma 18, the difference between the corresponding rows of these matrices can be bounded as

yiTy¯iT222ξ+2N(1ξ2).\|y_{i}^{T}-\overline{y}_{i}^{T}\|^{2}\leq 2-2\xi+2\sqrt{N(1-\xi^{2})}. (71)

As the separability condition in (69) is general and valid for any two categories, we can reformulate it as follows. For any pair of classes k,l{1,,M}k,l\in\{1,\dots,M\}, there exists a hyperplane ωk,l\omega_{k,l} such that

ωk,lTy¯iγc22 if Ci=kωk,lTy¯iγc22 if Ci=l.\begin{split}\omega_{k,l}^{T}\ \overline{y}_{i}&\geq\frac{\gamma^{c}}{2\sqrt{2}}\quad\,\,\,\,\text{ if }C_{i}=k\\ \omega_{k,l}^{T}\ \overline{y}_{i}&\leq-\frac{\gamma^{c}}{2\sqrt{2}}\quad\,\,\,\,\text{ if }C_{i}=l.\end{split} (72)

Then, from (71) and (72) we have

ωk,lTyi=ωk,lTyi¯+ωk,lT(yiyi¯)ωk,lTyi¯yiyi¯γc22(22ξ+2N(1ξ2))1/2\omega_{k,l}^{T}y_{i}=\omega_{k,l}^{T}\overline{y_{i}}+\omega_{k,l}^{T}(y_{i}-\overline{y_{i}})\geq\omega_{k,l}^{T}\overline{y_{i}}-\|y_{i}-\overline{y_{i}}\|\geq\frac{\gamma^{c}}{2\sqrt{2}}-\left(2-2\xi+2\sqrt{N(1-\xi^{2})}\right)^{1/2}

if Ci=kC_{i}=k; and

ωk,lTyi=ωk,lTyi¯+ωk,lT(yiyi¯)ωk,lTyi¯+yiyi¯γc22+(22ξ+2N(1ξ2))1/2\omega_{k,l}^{T}y_{i}=\omega_{k,l}^{T}\overline{y_{i}}+\omega_{k,l}^{T}(y_{i}-\overline{y_{i}})\leq\omega_{k,l}^{T}\overline{y_{i}}+\|y_{i}-\overline{y_{i}}\|\leq-\frac{\gamma^{c}}{2\sqrt{2}}+\left(2-2\xi+2\sqrt{N(1-\xi^{2})}\right)^{1/2}

if Ci=lC_{i}=l. Hence, the embedding YY given by the eigenvectors of the overall objective matrix LL is linearly separable with a margin of

γ=γc22(22ξ+2N(1ξ2))1/2\gamma=\frac{\gamma^{c}}{\sqrt{2}}-2\left(2-2\xi+2\sqrt{N(1-\xi^{2})}\right)^{1/2}

if

γc>4(1ξ+N(1ξ2))1/2.\gamma^{c}>4\left(1-\xi+\sqrt{N(1-\xi^{2})}\right)^{1/2}.

We thus arrive at the result stated in the theorem.

 

References

  • Baxter (1992) B. J. C. Baxter. The interpolation theory of radial basis functions. PhD thesis, Cambridge University, Trinity College, 1992.
  • Belkin and Niyogi (2003) M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, June 2003.
  • Bickel and Li (2007) P. J. Bickel and B. Li. Local polynomial regression on unknown manifolds. Lecture Notes-Monograph Series, 54:177–186, 2007.
  • Caponnetto and De Vito (2007) A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • Chin and Suter (2008) T. J. Chin and D. Suter. Out-of-sample extrapolation of learned manifolds. IEEE Trans. Pattern Anal. Mach. Intell., 30(9):1547–1556, 2008.
  • Chung (1996) F. R. K. Chung. Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92). American Mathematical Society, December 1996.
  • Cucker and Smale (2002) F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1–49, 2002.
  • Cui and Fan (2012) Y. Cui and L. Fan. A novel supervised dimensionality reduction algorithm: Graph-based fisher analysis. Pattern Recognition, 45(4):1471–1481, 2012.
  • Donoho and Grimes (2003) D. L. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences of the United States of America, 100(10):5591–5596, May 2003.
  • Georghiades et al. (2001) A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.
  • He and Niyogi (2004) X. He and P. Niyogi. Locality Preserving Projections. In Advances in Neural Information Processing Systems 16, Cambridge, MA, 2004. MIT Press.
  • Herbrich (1999) R. Herbrich. Exact tail bounds for binomial distributed variables. Online: Available at http://research.microsoft.com/apps/pubs/default.aspx?id=66854, 1999.
  • Hernández-Aguirre et al. (2002) A. Hernández-Aguirre, C. Koutsougeras, and B. P. Buckles. Sample complexity for function learning tasks through linear neural networks. International Journal on Artificial Intelligence Tools, 11(4):499–511, 2002.
  • Hua et al. (2012) Q. Hua, L. Bai, X. Wang, and Y. Liu. Local similarity and diversity preserving discriminant projection for face and handwriting digits recognition. Neurocomputing, 86:150–157, 2012.
  • Kolmogorov and Tihomirov (1961) A. N. Kolmogorov and V. M. Tihomirov. ε\varepsilon-entropy and ε\varepsilon-capacity of sets in functional spaces. Amer. Math. Soc. Transl., 2(17):277–364, 1961.
  • Kpotufe (2011) S. Kpotufe. k-NN regression adapts to local intrinsic dimension. In Proc. Advances in Neural Information Processing Systems 24, pages 729–737, 2011.
  • Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.
  • Kulkarni and Posner (1995) S. R. Kulkarni and S. E. Posner. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 41(4):1028–1039, 1995.
  • LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324, 1998.
  • Leibe and Schiele (2003) B. Leibe and B. Schiele. Analyzing appearance and contour based methods for object categorization. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), pages 409–415, 2003.
  • Li et al. (2013) B. Li, J. Liu, Z. Zhao, and W. Zhang. Locally linear representation fisher criterion. In The 2013 International Joint Conference on Neural Networks, pages 1–7, 2013.
  • Lin et al. (2014) S. Lin, X. Liu, Y. Rong, and Z. Xu. Almost optimal estimates for approximation and learning by radial basis function networks. Machine Learning, 95(2):147–164, 2014.
  • Narcowich et al. (1994) F. J. Narcowich, N. Sivakumar, and J. D. Ward. On condition numbers associated with radial-function interpolation. Journal of Mathematical Analysis and Applications, 186(2):457 – 485, 1994.
  • Nene et al. (1996) S. A. Nene, S. K. Nayar, and H. Murase. Columbia Object Image Library (COIL-20). Technical report, Feb 1996.
  • Niyogi and Girosi (1996) P. Niyogi and F. Girosi. On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Computation, 8:819–842, 1996.
  • Peherstorfer et al. (2011) B. Peherstorfer, D. Pflüger, and H. J. Bungartz. A sparse-grid-based out-of-sample extension for dimensionality reduction and clustering with laplacian eigenmaps. In AI 2011: Proc. Advances in Artificial Intelligence - 24th Australasian Joint Conference, pages 112–121, 2011.
  • Piret (2007) C. Piret. Analytical and Numerical Advances in radial basis functions. PhD thesis, University of Colorado, 2007.
  • Qiao et al. (2013) H. Qiao, P. Zhang, D. Wang, and B. Zhang. An explicit nonlinear mapping for manifold learning. IEEE T. Cybernetics, 43(1):51–63, 2013.
  • Raducanu and Dornaika (2012) B. Raducanu and F. Dornaika. A supervised non-linear dimensionality reduction approach for manifold learning. Pattern Recognition, 45(6):2432–2444, 2012.
  • Roweis and Saul (2000) S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, 2000.
  • Steinwart et al. (2009) I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.
  • Sugiyama (2007) M. Sugiyama. Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. Journal of Machine Learning Research, 8:1027–1061, 2007.
  • Tenenbaum et al. (2000) J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, December 2000.
  • Vidyasagar (1997) M. Vidyasagar. A Theory of Learning and Generalization. Springer-Verlag, Secaucus, NJ, USA, 2nd edition, 1997.
  • Vural and Guillemot (2016) E. Vural and C. Guillemot. Out-of-sample generalizations for supervised manifold learning for classification. IEEE Transactions on Image Processing, 25(3):1410–1424, March 2016.
  • Wang and Chen (2009) R. Wang and X. Chen. Manifold discriminant analysis. In CVPR, pages 429–436, 2009.
  • Yang et al. (2011) W. Yang, C. Sun, and L. Zhang. A multi-manifold discriminant analysis method for image feature extraction. Pattern Recognition, 44(8):1649–1657, 2011.
  • Zhang and Zha (2005) Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Journal of Scientific Computing, 26:313–338, 2005.
  • Zhang et al. (2012) Z. Zhang, M. Zhao, and T. Chow. Marginal semi-supervised sub-manifold projections with informative constraints for dimensionality reduction and recognition. Neural Networks, 36:97–111, 2012.