This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Geometry of Nonlinear Embeddings
in Kernel Discriminant Analysis

Jiae Kim Jiae Kim is Graduate Student, Department of Statistics, The Ohio State University, Columbus, OH 43210 (Email: kim.3887@osu.edu).    Yoonkyung Lee Yoonkyung Lee is Professor, Department of Statistics, The Ohio State University, Columbus, OH 43210 (Email: yklee@stat.osu.edu).    Zhiyu Liang Zhiyu Liang is Senior Staff Data Scientist, Kohl’s, Milpitas, CA 95035 (Email: zhiyu.liang@kohls.com).
Abstract

Fisher’s linear discriminant analysis is a classical method for classification, yet it is limited to capturing linear features only. Kernel discriminant analysis as an extension is known to successfully alleviate the limitation through a nonlinear feature mapping. We study the geometry of nonlinear embeddings in discriminant analysis with polynomial kernels and Gaussian kernel by identifying the population-level discriminant function that depends on the data distribution and the kernel. In order to obtain the discriminant function, we solve a generalized eigenvalue problem with between-class and within-class covariance operators. The polynomial discriminants are shown to capture the class difference through the population moments explicitly. For approximation of the Gaussian discriminant, we use a particular representation of the Gaussian kernel by utilizing the exponential generating function for Hermite polynomials. We also show that the Gaussian discriminant can be approximated using randomized projections of the data. Our results illuminate how the data distribution and the kernel interact in determination of the nonlinear embedding for discrimination, and provide a guideline for choice of the kernel and its parameters.

Keywords: Discriminant analysis, Feature map, Gaussian kernel, Polynomial kernel, Rayleigh quotient, Spectral analysis

1 Introduction

Kernel methods have been widely used in statistics and machine learning for pattern recognition and analysis (Shawe-Taylor and Cristianini 2004, Schölkopf and Smola 2002, Hofmann et al. 2008). They can be described in a unified framework with a special class of functions called kernels encoding pairwise similarities between data points. Such kernels enable nonlinear extensions of linear methods seamlessly and allow us to deal with general types of data such as vectors, text documents, graphs, and images. Combined with problem-specific evaluation criteria typically in the form of a loss function or a spectral norm of a kernel matrix, this kernel-based framework can produce a variety of learning algorithms for regression, classification, ranking, clustering, and dimension reduction. Popular kernel methods include smoothing splines (Wahba 1990), support vector machines (Vapnik 1995), kernel Fisher discriminant analysis (Mika et al. 1999, Baudat and Anouar 2000), ranking SVM (Joachims 2002), spectral clustering (Scott and Longuet-Higgins 1990, von Luxburg 2007), and kernel principal component analysis (Schölkopf et al. 1998).

This paper regards the geometry of kernel discriminant analysis (KDA). KDA is a nonlinear generalization of Fisher’s linear discriminant analysis (LDA), which is a standard multivariate technique for classification. Intrinsically as a dimension reduction method, KDA looks for discriminants that embed multivariate data into a real line so that decisions can be made easily in a low dimensional space. For simplicity of exposition, we focus on the case of two classes. Fisher’s linear discriminant projects data along the direction that maximizes separation between classes. Extending this geometric idea, kernel discriminant analysis finds a data embedding that maximizes the ratio of the between-class variation to within-class variation measured in the feature space specified by a kernel. To determine the embedding as a discriminant, we solve a generalized eigenvalue problem involving kernel-dependent covariance matrices.

We examine the kernel discriminant at the population level to illuminate the interplay between the kernel and the probability distribution for data. Of particular interest is how the kernel discriminant captures the difference between two classes geometrically, and how the choice of a kernel and associated kernel parameters affect the discriminant in connection with salient features of the underlying distribution. As a continuous analogue of the kernel-dependent covariance matrices, we define the between-class and within-class covariance operators first and state the population version of the eigenvalue problem using those operators which depend on both the data distribution and the kernel. For some kernels, we can obtain explicit solutions and determine the corresponding population kernel discriminants.

Similar population-level analyses have been done for kernel PCA and spectral clustering (Zhu et al. 1998, Shi et al. 2009, Liang and Lee 2013) to gain insights into the interplay between the kernel and distributional features on low dimensional embeddings for data visualization and clustering. The population analyses of kernel PCA, spectral clustering, and KDA require a spectral analysis of kernel operators of different forms depending on the method. They help us examine the dependence of eigenfunctions and eigenvalues of the kernel operators on the data distribution, which can guide applications of those methods in practice.

The population discriminants with polynomial kernels admit a closed-form expression due to their finite dimensional feature map. Analogous to the geometric interpretation of Fisher’s linear discriminant that it projects data along the mean difference direction after whitening the within-class covariance, the polynomial discriminants are characterized by the difference in the population moments between classes. By contrast, the Gaussian kernel does not allow a simple closed-form expression for the discriminant because its feature map and associated function space are infinite-dimensional. We provide approximations to the Gaussian discriminant instead using two representations of the kernel. These approximations shed some light on the workings of KDA with the Gaussian kernel. By using a deterministic representation of the Gaussian kernel with the Hermite polynomial generating function, we approximate the population Gaussian discriminant with polynomial discriminants of degree as high as desired for the accuracy of approximation. This implies that the Gaussian discriminant captures the difference between classes through the entirety of the moments. Alternatively, using a stochastic representation of the Gaussian kernel through Fourier features of random projections (Rahimi and Recht 2008a), we can also view the Gaussian discriminant as an embedding that combines the expected differences in sinusoidal features of randomly projected data from two classes.

How are the forms of these population kernel discriminants related to the task of minimizing classification error? To attain the least possible error rate, the optimal decision rule assigns a data point 𝐱p\mathbf{x}\in\mathbb{R}^{p} to the most probable class by comparing the likelihood of one class, say p1(𝐱)p_{1}(\mathbf{x}), versus the other, p2(𝐱)p_{2}(\mathbf{x}), given 𝐱\mathbf{x}. In other words, the ideal data embedding for discrimination of two classes should be based on the likelihood ratio p1(𝐱)/p2(𝐱)p_{1}(\mathbf{x})/p_{2}(\mathbf{x}) or log[p1(𝐱)/p2(𝐱)]\log[p_{1}(\mathbf{x})/p_{2}(\mathbf{x})]. As a simple example, when the population distribution for each class is multivariate normal with a common covariance matrix, log[p1(𝐱)/p2(𝐱)]\log[p_{1}(\mathbf{x})/p_{2}(\mathbf{x})] is linear in 𝐱\mathbf{x}, and it coincides with the population version of Fisher’s linear discriminant. Difference in the covariance brings additional quadratic terms to the log likelihood ratio requiring a quadratic discriminant for the lowest error. As the distributions further deviate from elliptical scatter patterns exemplified by normal distributions, the ideal data embedding according to log[p1(𝐱)/p2(𝐱)]\log[p_{1}(\mathbf{x})/p_{2}(\mathbf{x})] will involve nonlinear terms beyond quadratic. The basic fact that each distribution can be identified with its moment-generating function or characteristic function, i.e., its Fourier transform, implies that any difference between two distributions can be described in terms of the moments or expected Fourier features in general. Our population analysis of kernel discriminants indicates that the Gaussian kernel treats the distributional difference as a whole, including both global and local (or low and high frequency) characteristics, while the polynomial kernels focus on differences in more global characteristics represented by low-order moments. The ideal choice of a kernel in KDA will inevitably depend on the mode of class difference mathematically expressed through the log likelihood ratio, log[p1(𝐱)/p2(𝐱)]\log[p_{1}(\mathbf{x})/p_{2}(\mathbf{x})].

The rest of the paper is organized as follows. Section 2 provides a brief review of kernel discriminant analysis and describes its population version by introducing two kernel covariance operators for measuring the between-class variation and within-class variation in the feature space. Section 3 presents a population-level discriminant analysis using two types of polynomial kernels and Gaussian kernel and provides an explicit form of population kernel discriminants. Numerical examples are given in Section 4 to illustrate the geometry of kernel discriminants in relation to the data distribution. Section 5 concludes the paper with discussions.

2 Preliminaries

This section provides a technical background for kernel discriminant analysis. After reviewing kernel functions, corresponding function spaces, and feature mappings in Section 2.1, we briefly describe Fisher’s linear discriminant analysis and its extension using kernels in Section 2.2 and further extend the sample-dependent description of kernel discriminant analysis to its population version in Section 2.3.

2.1 Kernel

Let the input domain for data be denoted by 𝒳\mathcal{X}. A kernel K(,)K(\cdot,\cdot) is defined as a positive semi-definite function from 𝒳×𝒳\mathcal{X}\times\mathcal{X} to \mathbb{R}. As a positive semi-definite function, KK is symmetric: K(𝐱,𝐮)=K(𝐮,𝐱)K(\mathbf{x},\mathbf{u})=K(\mathbf{u},\mathbf{x}) for all 𝐱,𝐮𝒳\mathbf{x},\mathbf{u}\in\mathcal{X}, and for each nn\in\mathbb{N} and for all choices of 𝐱1,,𝐱n𝒳\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\in\mathcal{X}, Kn=[K(𝐱i,𝐱j)]K_{n}=[K(\mathbf{x}_{i},\mathbf{x}_{j})] as an n×nn\times n matrix is positive semi-definite.

Given KK, there is a unique function space K\mathcal{H}_{K} with inner product ,K\langle\cdot,\cdot\rangle_{\mathcal{H}_{K}} corresponding to the kernel such that for every 𝐱𝒳\mathbf{x}\in\mathcal{X} and fKf\in\mathcal{H}_{K}, (i) K(𝐱,)KK(\mathbf{x},\cdot)\in\mathcal{H}_{K}, and (ii) f(𝐱)=f,K(𝐱,)Kf(\mathbf{x})=\langle f,K(\mathbf{x},\cdot)\rangle_{\mathcal{H}_{K}}. The second property is called the reproducing property of KK, and it entails the following identity: K(𝐱,𝐮)=K(𝐱,),K(𝐮,)KK(\mathbf{x},\mathbf{u})=\langle K(\mathbf{x},\cdot),K(\mathbf{u},\cdot)\rangle_{\mathcal{H}_{K}}. Such a function space with reproducing kernel is called a reproducing kernel Hilbert space (RKHS). See Aronszajn (1950), Wahba (1990) and Gu (2002) for reference.

Alternatively, kernels can be characterized as those functions that arise as a result of the dot product of feature vectors. This is a common viewpoint in machine learning in the use of kernels for nonlinear generalization of linear methods. To capture nonlinear features often desired for data analysis, consider a mapping ϕ\boldsymbol{\phi} from the input space 𝒳\mathcal{X} to a feature space =D\mathcal{F}=\mathbb{R}^{D}, ϕ:𝒳\boldsymbol{\phi}:\mathcal{X}\rightarrow\mathcal{F}, which is called a feature map. The feature vector ϕ(𝐱)=(ϕ1(𝐱),,ϕD(𝐱))T\boldsymbol{\phi}(\mathbf{x})=(\phi_{1}(\mathbf{x}),\ldots,\phi_{D}(\mathbf{x}))^{T} consists of DD features, and for expressiveness of the features, the dimension of the feature space is often much higher than the input dimension, and possibly infinite. Through the dot product of feature vectors, we can define a valid kernel KK on 𝒳×𝒳\mathcal{X}\times\mathcal{X} as K(𝐱,𝐮)=ϕ(𝐱)Tϕ(𝐮)K(\mathbf{x},\mathbf{u})=\boldsymbol{\phi}(\mathbf{x})^{T}\boldsymbol{\phi}(\mathbf{u}). When D=D=\infty, the dot product is to be interpreted in the sense of 2\ell_{2} inner product. More general treatment of kernels with a general inner product for the feature space is feasible, but for brevity, we confine our description to the dot product only. Using a feature map, we can generalize a linear method by applying it in the feature space, which amounts to replacing the dot product for the original features, 𝐱T𝐮\mathbf{x}^{T}\mathbf{u}, in the linear method with a kernel, K(𝐱,𝐮)K(\mathbf{x},\mathbf{u}). This substitution is called the “kernel trick” in machine learning. For general description of kernel methods, an explicit form of a feature map is not needed nor the feature map for a given kernel is unique. See Schölkopf and Smola (2002) for general properties of kernels.

In this paper, we focus on the following kernels that are commonly used in practice with 𝒳=p\mathcal{X}=\mathbb{R}^{p}:

  • Homogeneous polynomial kernel of degree dd: Kd(𝐱,𝐮)=(𝐱T𝐮)dK_{d}(\mathbf{x},\mathbf{u})=(\mathbf{x}^{T}\mathbf{u})^{d}

  • Inhomogeneous polynomial kernel of degree dd: K~d(𝐱,𝐮)=(1+𝐱T𝐮)d\tilde{K}_{d}(\mathbf{x},\mathbf{u})=(1+\mathbf{x}^{T}\mathbf{u})^{d}

  • Gaussian kernel with bandwidth parameter ω\omega: Kω(𝐱,𝐮)=exp(𝐱𝐮22ω2)K_{\omega}(\mathbf{x},\mathbf{u})=\exp\left(-\frac{\|\mathbf{x}-\mathbf{u}\|^{2}}{2\omega^{2}}\right).

Consideration of their explicit feature maps will be useful for the analyses presented in Section 3. For instance, the homogeneous polynomial kernel of degree 22 on 𝒳=2\mathcal{X}=\mathbb{R}^{2}, K2(𝐱,𝐮)=(x1u1+x2u2)2K_{2}(\mathbf{x},\mathbf{u})=(x_{1}u_{1}+x_{2}u_{2})^{2}, can be described with a feature map ϕ(𝐱)=(x12,2x1x2,x22)T3\boldsymbol{\phi}(\mathbf{x})=\left(x_{1}^{2},\leavevmode\nobreak\ \sqrt{2}x_{1}x_{2},\leavevmode\nobreak\ x_{2}^{2}\right)^{T}\in\mathbb{R}^{3}. The Gaussian kernel on \mathbb{R} with bandwidth parameter 1 admits =\mathcal{F}=\mathbb{R}^{\infty} with 2\ell_{2} inner product as a feature space and a feature map of

ϕ(x)=ex22(1,x,x22!,x33!,)T.\boldsymbol{\phi}(x)=e^{-\frac{x^{2}}{2}}\left(1,\leavevmode\nobreak\ x,\leavevmode\nobreak\ \frac{x^{2}}{\sqrt{2!}},\leavevmode\nobreak\ \frac{x^{3}}{\sqrt{3!}},\leavevmode\nobreak\ \ldots\right)^{T}.

2.2 Kernel Discriminant Analysis

Kernel discriminant analysis (KDA) is a nonlinear extension of Fisher’s linear discriminant analysis using kernels. For description of KDA, we start with a classification problem. Suppose we have data from two classes labeled 1 and 2: {(𝐱i,yi)|𝐱i𝒳 and yi{1,2} for i=1,,n}\{(\mathbf{x}_{i},y_{i})\leavevmode\nobreak\ |\leavevmode\nobreak\ \mathbf{x}_{i}\in\mathcal{X}\mbox{ and }y_{i}\in\{1,2\}\mbox{ for }i=1,\ldots,n\}. For simplicity, assume that the data points are ordered so that the first n1n_{1} observations are from class 1 and the rest (n2=nn1n_{2}=n-n_{1}) are from class 2.

2.2.1 Fisher’s Linear Discriminant Analysis

As a classical approach to classification, Fisher’s linear discriminant analysis (LDA) looks for linear combinations of the original variables called linear discriminants that can separate observations from different classes effectively. It can be viewed as a dimension reduction technique for classification.

When 𝒳=p\mathcal{X}=\mathbb{R}^{p}, a linear discriminant is of the form, f(𝐱)=𝐯T𝐱f(\mathbf{x})=\mathbf{v}^{T}\mathbf{x}, with a coefficient vector 𝐯p\mathbf{v}\in\mathbb{R}^{p}. For the discriminant 𝐯T𝐱\mathbf{v}^{T}\mathbf{x} as a univariate measurement, we define the between-class variation as

(𝐯T𝐱¯1𝐯T𝐱¯2)2=𝐯T(𝐱¯1𝐱¯2)(𝐱¯1𝐱¯2)T𝐯(\mathbf{v}^{T}\overline{\mathbf{x}}_{1}-\mathbf{v}^{T}\overline{\mathbf{x}}_{2})^{2}=\mathbf{v}^{T}(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2})(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2})^{T}\mathbf{v}

and the within-class variation as its pooled sample variance:

n1n𝐯TS1𝐯+n2n𝐯TS2𝐯=𝐯T(n1nS1+n2nS2)𝐯,\frac{n_{1}}{n}\mathbf{v}^{T}S_{1}\mathbf{v}+\frac{n_{2}}{n}\mathbf{v}^{T}S_{2}\mathbf{v}=\mathbf{v}^{T}\left(\frac{n_{1}}{n}S_{1}+\frac{n_{2}}{n}S_{2}\right)\mathbf{v},

where 𝐱¯j\overline{\mathbf{x}}_{j} and SjS_{j} are the sample mean vector and sample covariance matrix of 𝐱\mathbf{x} for class jj. Letting SB=(𝐱¯1𝐱¯2)(𝐱¯1𝐱¯2)TS_{B}=(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2})(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2})^{T} and SW=n1nS1+n2nS2S_{W}=\frac{n_{1}}{n}S_{1}+\frac{n_{2}}{n}S_{2} (the pooled covariance matrix), we can express the two variations succinctly as quadratic forms of 𝐯TSB𝐯\mathbf{v}^{T}S_{B}\mathbf{v} and 𝐯TSW𝐯\mathbf{v}^{T}S_{W}\mathbf{v}, respectively. Note that both forms are shift-invariant.

To find the best direction that gives the maximum separation between two classes measured relative to the within-class variance in LDA, we maximize the ratio of the between-class variation to the within-class variation with respect to 𝐯\mathbf{v}:

maximize𝐯p𝐯TSB𝐯𝐯TSW𝐯.\underset{\mathbf{v}\in\mathbb{R}^{p}}{\text{maximize}}\frac{\mathbf{v}^{T}S_{B}\mathbf{v}}{\mathbf{v}^{T}S_{W}\mathbf{v}}.

This ratio is also known as the Rayleigh quotient and taken as a measure of the signal-to-noise ratio in classification along the direction 𝐯\mathbf{v}. This maximization problem leads to the following generalized eigenvalue problem:

SB𝐯=λSW𝐯S_{B}\mathbf{v}=\lambda S_{W}\mathbf{v}

and the solution is given by the leading eigenvector. More explicitly, 𝐯^=SW1(𝐱¯1𝐱¯2)\hat{\mathbf{v}}=S_{W}^{-1}(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2}) defined only up to a normalization constant, and λ^=(𝐱¯1𝐱¯2)TSW1(𝐱¯1𝐱¯2)\hat{\lambda}=(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2})^{T}S_{W}^{-1}(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2}) is the corresponding eigenvalue. Since SBS_{B} has rank 1, λ^\hat{\lambda} is the only positive eigenvalue. The resulting linear discriminant, f^(𝐱)=𝐯^T𝐱\hat{f}(\mathbf{x})=\hat{\mathbf{v}}^{T}\mathbf{x}, together with an appropriately chosen threshold cc yields a classification boundary of the form {𝐱p|𝐯^T𝐱=c}\{\mathbf{x}\in\mathbb{R}^{p}\leavevmode\nobreak\ |\leavevmode\nobreak\ \hat{\mathbf{v}}^{T}\mathbf{x}=c\}, which is linear in the input space. When SWIpS_{W}\approx I_{p}, 𝐯^𝐱¯1𝐱¯2\hat{\mathbf{v}}\approx\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2} (mean difference) provides the best direction for projection. Re-expression of the linear discriminant as f^(𝐱)=𝐯^T𝐱=[SW12(𝐱¯1𝐱¯2)]TSW12𝐱\hat{f}(\mathbf{x})=\hat{\mathbf{v}}^{T}\mathbf{x}=\big{[}S_{W}^{-\frac{1}{2}}(\overline{\mathbf{x}}_{1}-\overline{\mathbf{x}}_{2})\big{]}^{T}S_{W}^{-\frac{1}{2}}\mathbf{x} further reveals that LDA projects data onto the mean difference direction after whitening the variables via SW12S_{W}^{-\frac{1}{2}}. This interpretation also implies the invariance of f^()\hat{f}(\cdot) under variable scaling.

2.2.2 Nonlinear Generalization

Using the aforementioned kernel trick, Mika et al. (1999) proposed a nonlinear extension of linear discriminant analysis, which can be useful when the optimal classification boundary is not linear. Conceptually, by mapping the data into a feature space using a kernel, kernel discriminant analysis finds the best direction for discrimination and corresponding linear discriminant in the feature space, which then defines a nonlinear discriminant function in the input space.

Given kernel KK, let ϕ:𝒳\boldsymbol{\phi}:\mathcal{X}\rightarrow{\cal F} be a feature map. Then using the feature vector ϕ(𝐱)\boldsymbol{\phi}(\mathbf{x}), we can define the sample means and between-class and within-class covariance matrices in the feature space analogously. These matrices are denoted by SBϕS_{B}^{\boldsymbol{\phi}} and SWϕS_{W}^{\boldsymbol{\phi}}. KDA aims to find 𝐯\mathbf{v} in the feature space that maximizes

𝐯TSBϕ𝐯𝐯TSWϕ𝐯.\frac{\mathbf{v}^{T}S_{B}^{\boldsymbol{\phi}}\mathbf{v}}{\mathbf{v}^{T}S_{W}^{\boldsymbol{\phi}}\mathbf{v}}. (1)

When 𝐯\mathbf{v} is in the span of all training feature vectors ϕ(𝐱i)\boldsymbol{\phi}(\mathbf{x}_{i}), it can be expressed as 𝐯=i=1nαiϕ(𝐱i)\mathbf{v}=\sum_{i=1}^{n}\alpha_{i}\boldsymbol{\phi}(\mathbf{x}_{i}) for some 𝜶=(α1,,αn)Tn\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{n})^{T}\in\mathbb{R}^{n}. When we plug 𝐯\mathbf{v} of the form into the numerator and denominator of the ratio in (1) and expand both in terms of αi\alpha_{i} using the kernel identity K(𝐱,𝐮)=ϕ(𝐱)Tϕ(𝐮)K(\mathbf{x},\mathbf{u})=\boldsymbol{\phi}(\mathbf{x})^{T}\boldsymbol{\phi}(\mathbf{u}), we have

𝐯TSBϕ𝐯=𝜶TBn𝜶and𝐯TSWϕ𝐯=𝜶TWn𝜶,\mathbf{v}^{T}S_{B}^{\boldsymbol{\phi}}\mathbf{v}=\boldsymbol{\alpha}^{T}B_{n}\boldsymbol{\alpha}\quad\mbox{and}\quad\mathbf{v}^{T}S_{W}^{\boldsymbol{\phi}}\mathbf{v}=\boldsymbol{\alpha}^{T}W_{n}\boldsymbol{\alpha},

where BnB_{n} and WnW_{n} are the n×nn\times n matrices defined through the kernel that reflect between-class variation and within-class variation, respectively. To describe BnB_{n} and WnW_{n} precisely, start with the kernel matrix Kn=[K(𝐱i,𝐱j)]K_{n}=[K(\mathbf{x}_{i},\mathbf{x}_{j})]. It can be partitioned into [K1K2][K_{1}\leavevmode\nobreak\ K_{2}] with n×n1n\times n_{1} matrix of K1K_{1} and n×n2n\times n_{2} matrix of K2K_{2}, according to the class label yiy_{i}. Using this partition of KnK_{n}, we can show that Bn=(K¯1K¯2)(K¯1K¯2)TB_{n}=(\bar{K}_{1}-\bar{K}_{2})(\bar{K}_{1}-\bar{K}_{2})^{T} with K¯j=1njKj1nj\bar{K}_{j}=\frac{1}{n_{j}}K_{j}1_{n_{j}} and

Wn=n1nK1(1n1In11n12Jn1)K1T+n2nK2(1n2In21n22Jn2)K2T,W_{n}=\frac{n_{1}}{n}K_{1}\left(\frac{1}{n_{1}}I_{n_{1}}-\frac{1}{n_{1}^{2}}J_{n_{1}}\right)K_{1}^{T}+\frac{n_{2}}{n}K_{2}\left(\frac{1}{n_{2}}I_{n_{2}}-\frac{1}{n_{2}^{2}}J_{n_{2}}\right)K_{2}^{T},

where 1nj1_{n_{j}} is the njn_{j} vector of ones, and Jnj=1nj1njTJ_{n_{j}}=1_{n_{j}}1_{n_{j}}^{T} (nj×nj{n_{j}}\times{n_{j}} matrix of ones).

In order to find the best discriminant direction 𝐯=i=1nαiϕ(𝐱i)\mathbf{v}=\sum_{i=1}^{n}\alpha_{i}\boldsymbol{\phi}(\mathbf{x}_{i}), we maximize 𝜶TBn𝜶𝜶TWn𝜶\displaystyle\frac{\boldsymbol{\alpha}^{T}B_{n}\boldsymbol{\alpha}}{\boldsymbol{\alpha}^{T}W_{n}\boldsymbol{\alpha}} with respect to 𝜶n\boldsymbol{\alpha}\in\mathbb{R}^{n} instead. The solution is again given by the leading eigenvector of the generalized eigenvalue problem:

Bn𝜶=λWn𝜶.B_{n}\boldsymbol{\alpha}=\lambda W_{n}\boldsymbol{\alpha}. (2)

Further, the estimated direction 𝐯^=i=1nα^iϕ(𝐱i)\hat{\mathbf{v}}=\sum_{i=1}^{n}\hat{\alpha}_{i}\boldsymbol{\phi}(\mathbf{x}_{i}) results in the discriminant function of the form:

f^(𝐱)=𝐯^Tϕ(𝐱)=i=1nα^iϕ(𝐱i)Tϕ(𝐱)=i=1nα^iK(𝐱i,𝐱).\hat{f}(\mathbf{x})=\hat{\mathbf{v}}^{T}\boldsymbol{\phi}(\mathbf{x})=\sum_{i=1}^{n}\hat{\alpha}_{i}\boldsymbol{\phi}(\mathbf{x}_{i})^{T}\boldsymbol{\phi}(\mathbf{x})=\sum_{i=1}^{n}\hat{\alpha}_{i}K(\mathbf{x}_{i},\mathbf{x}). (3)

Obviously f^()\hat{f}(\cdot) is in the span of K(𝐱i,)K(\mathbf{x}_{i},\cdot), i=1,,ni=1,\ldots,n, and belongs to the reproducing kernel Hilbert space K\mathcal{H}_{K}. As with Fisher’s linear discriminant, the kernel discriminant function is determined only up to a normalization constant. To specify a decision rule completely, we need to choose an appropriate threshold for the discriminant function.

2.3 Population Version of Kernel Discriminant Analysis

To understand the effects of the data distribution, geometrical difference between two classes, in particular, and the kernel on the resulting discriminant function, we consider a population version of KDA. For proper description of the population version, we first assume that {𝐱1,,𝐱n}\{\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\} in the dataset is a random sample of 𝐗\mathbf{X} from a mixture of two distributions 1\mathbb{P}_{1} and 2\mathbb{P}_{2} with population proportions of π1\pi_{1} and π2(=1π1)\pi_{2}(=1-\pi_{1}) for two classes, or =π11+π22\mathbb{P}=\pi_{1}\mathbb{P}_{1}+\pi_{2}\mathbb{P}_{2}.

To illustrate how the sample version of KDA extends to the population version under this assumption, we begin with the eigenvalue problem in (2). Suppose λn\lambda_{n} and 𝜶=(α1,,αn)T\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{n})^{T} are a pair of eigenvalue and eigenvector satisfying (2). After scaling both sides of (2) by the sample size nn, we have

1nj=1nBn(i,j)αj=λnnj=1nWn(i,j)αjfori=1,,n.\frac{1}{n}\sum_{j=1}^{n}B_{n}(i,j)\alpha_{j}=\frac{\lambda_{n}}{n}\sum_{j=1}^{n}W_{n}(i,j)\alpha_{j}\qquad\mbox{for}\quad i=1,\ldots,n. (4)

As a continuous population analogue of BnB_{n} and WnW_{n}, we can define the following bivariate functions on 𝒳×𝒳\mathcal{X}\times\mathcal{X}:

BK(𝐱,𝐮)\displaystyle B_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= {𝔼1[K(𝐱,𝐗)]𝔼2[K(𝐱,𝐗)]}{𝔼1[K(𝐮,𝐗)]𝔼2[K(𝐮,𝐗)]}\displaystyle\biggl{\{}\mathbb{E}_{1}[K(\mathbf{x},\mathbf{X})]-\mathbb{E}_{2}[K(\mathbf{x},\mathbf{X})]\biggl{\}}\biggl{\{}\mathbb{E}_{1}[K(\mathbf{u},\mathbf{X})]-\mathbb{E}_{2}[K(\mathbf{u},\mathbf{X})]\biggl{\}} (5)
WK(𝐱,𝐮)\displaystyle W_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= π1Cov1[K(𝐱,𝐗),K(𝐮,𝐗)]+π2Cov2[K(𝐱,𝐗),K(𝐮,𝐗)],\displaystyle\pi_{1}\text{Cov}_{1}[K(\mathbf{x},\mathbf{X}),K(\mathbf{u},\mathbf{X})]+\pi_{2}\text{Cov}_{2}[K(\mathbf{x},\mathbf{X}),K(\mathbf{u},\mathbf{X})], (6)

where 𝔼j\mathbb{E}_{j} and Covj\text{Cov}_{j} indicate that the expectation and covariance are taken with respect to j\mathbb{P}_{j}. The matrices BnB_{n} and WnW_{n} can be viewed as a sample version of BK(,)B_{K}(\cdot,\cdot) and WK(,)W_{K}(\cdot,\cdot) evaluated at all pairs of data points 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n}.

Further treating 𝜶=(α1,,αn)T\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{n})^{T} as a discrete version of a function α()\alpha(\cdot) at the data points, i.e., 𝜶=(α(𝐱1),,α(𝐱n))T\boldsymbol{\alpha}=(\alpha(\mathbf{x}_{1}),\ldots,\alpha(\mathbf{x}_{n}))^{T}, and taking the sample class proportion, (nj/n)(n_{j}/n), as an estimate of the population proportion, πj\pi_{j}, and λn\lambda_{n} as a sample version of the population eigenvalue λ\lambda, we arrive at the following integral counterpart of (4):

𝒳BK(𝐱,𝐮)α(𝐮)𝑑(𝐮)=λ𝒳WK(𝐱,𝐮)α(𝐮)𝑑(𝐮) for every 𝐱𝒳.\int_{\mathcal{X}}B_{K}(\mathbf{x},\mathbf{u})\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})=\lambda\int_{\mathcal{X}}W_{K}(\mathbf{x},\mathbf{u})\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})\quad\mbox{ for every }\mathbf{x}\in\mathcal{X}. (7)

This eigenvalue problem involves two integral operators: (i) the between-class covariance operator defined as

[α(𝐱)]=𝒳BK(𝐱,𝐮)α(𝐮)𝑑(𝐮),{\cal B}[\alpha(\mathbf{x})]=\int_{\cal X}B_{K}(\mathbf{x},\mathbf{u})\alpha(\mathbf{u})d\mathbb{P}(\mathbf{u}),

and (ii) the within-class covariance operator defined as

𝒲[α(𝐱)]=𝒳WK(𝐱,𝐮)α(𝐮)𝑑(𝐮).{\cal W}[\alpha(\mathbf{x})]=\int_{\cal X}W_{K}(\mathbf{x},\mathbf{u})\alpha(\mathbf{u})d\mathbb{P}(\mathbf{u}).

The form of the sample discriminant function in (3) with scaling of 1/n1/n suggests that using the solution to equation (7), α()\alpha(\cdot), we define the population discriminant function as

f(𝐱)=𝒳K(𝐱,𝐮)α(𝐮)𝑑(𝐮).f(\mathbf{x})=\int_{\mathcal{X}}K(\mathbf{x},\mathbf{u})\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u}). (8)

Clearly, the eigenfunction α()\alpha(\cdot) depends on the kernel KK and probability distribution \mathbb{P}, and so does the kernel discriminant function with α()\alpha(\cdot) as a coefficient function. Hence, identification of the solution to the generalized eigenvalue problem in (7) will give us better understanding of kernel discriminants in relation to the data distribution and the choice of the kernel. The correspondence between the pattern of class difference and the nature of the resulting discriminant is of particular interest.

3 Kernel Discriminant Analysis with Covariance Operators

In this section, we carry out a population-level discriminant analysis with two types of polynomial kernels and Gaussian kernel and derive an explicit form of population discriminant functions. Section 3.1 covers the case with polynomial kernels in p\mathbb{R}^{p}. Section 3.2 extends it to the Gaussian kernel using two types of kernel representations.

3.1 Polynomial Discriminant

Starting with 𝒳=2\mathcal{X}=\mathbb{R}^{2}, we lay out steps necessary for a population version of discriminant analysis with homogeneous polynomial kernel and derive the population kernel discriminant function in Section 3.1.1. We then extend the results to a multi-dimensional setting with homogeneous polynomial kernel in Section 3.1.2 and inhomogeneous polynomial in Section 3.1.3.

3.1.1 Homogeneous Polynomial Kernel in Two-Dimensional Setting

The homogeneous polynomial kernel of degree dd in 2\mathbb{R}^{2} is

Kd(𝐱,𝐮)\displaystyle K_{d}(\mathbf{x},\mathbf{u}) =\displaystyle= (x1u1+x2u2)d=i=0d(di)(x1u1)di(x2u2)i=i=0d(di)(x1dix2i)(u1diu2i).\displaystyle\left(x_{1}u_{1}+x_{2}u_{2}\right)^{d}=\sum_{i=0}^{d}{d\choose i}(x_{1}u_{1})^{d-i}(x_{2}u_{2})^{i}=\sum_{i=0}^{d}{d\choose i}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(u_{1}^{d-i}u_{2}^{i}\right). (9)

The simple form of KdK_{d} allows us to obtain the between-class variation function BK(𝐱,𝐮)B_{K}(\mathbf{x},\mathbf{u}) in (5) and within-class variation function WK(𝐱,𝐮)W_{K}(\mathbf{x},\mathbf{u}) in (6) explicitly in terms of the population parameters.

For BK(𝐱,𝐮)B_{K}(\mathbf{x},\mathbf{u}), we begin with

𝔼1[Kd(𝐱,𝐗)]𝔼2[Kd(𝐱,𝐗)]\displaystyle\mathbb{E}_{1}[K_{d}(\mathbf{x},\mathbf{X})]-\mathbb{E}_{2}[K_{d}(\mathbf{x},\mathbf{X})]
=\displaystyle= 𝔼1[i=0d(di)(x1dix2i)(X1diX2i)]𝔼2[i=0d(di)(x1dix2i)(X1diX2i)]\displaystyle\mathbb{E}_{1}\left[\sum_{i=0}^{d}{d\choose i}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(X_{1}^{d-i}X_{2}^{i}\right)\right]-\mathbb{E}_{2}\left[\sum_{i=0}^{d}{d\choose i}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(X_{1}^{d-i}X_{2}^{i}\right)\right]
=\displaystyle= i=0d(di)(x1dix2i)(𝔼1[X1diX2i]𝔼2[X1diX2i]),\displaystyle\sum_{i=0}^{d}{d\choose i}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(\mathbb{E}_{1}[X_{1}^{d-i}X_{2}^{i}]-\mathbb{E}_{2}[X_{1}^{d-i}X_{2}^{i}]\right),

which depends on the difference in the moments of total degree dd between two classes. Letting Δi=𝔼1[X1diX2i]𝔼2[X1diX2i]\Delta_{i}=\mathbb{E}_{1}[X_{1}^{d-i}X_{2}^{i}]-\mathbb{E}_{2}[X_{1}^{d-i}X_{2}^{i}], the difference in moments, we can express BK(𝐱,𝐮)B_{K}(\mathbf{x},\mathbf{u}) as

BK(𝐱,𝐮)\displaystyle B_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= {𝔼1[Kd(𝐱,𝐗)]𝔼2[Kd(𝐱,𝐗)]}{𝔼1[Kd(𝐮,𝐗)]𝔼2[Kd(𝐮,𝐗)]}\displaystyle\biggl{\{}\mathbb{E}_{1}[K_{d}(\mathbf{x},\mathbf{X})]-\mathbb{E}_{2}[K_{d}(\mathbf{x},\mathbf{X})]\biggl{\}}\biggl{\{}\mathbb{E}_{1}[K_{d}(\mathbf{u},\mathbf{X})]-\mathbb{E}_{2}[K_{d}(\mathbf{u},\mathbf{X})]\biggl{\}}
=\displaystyle= {i=0d(di)(x1dix2i)Δi}{j=0d(dj)(u1dju2j)Δj}\displaystyle\left\{\sum_{i=0}^{d}{d\choose i}\left(x_{1}^{d-i}x_{2}^{i}\right)\Delta_{i}\right\}\left\{\sum_{j=0}^{d}{d\choose j}\left(u_{1}^{d-j}u_{2}^{j}\right)\Delta_{j}\right\}
=\displaystyle= i=0dj=0d(di)(dj)ΔiΔj(x1dix2i)(u1dju2j).\displaystyle\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}{d\choose j}\Delta_{i}\Delta_{j}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(u_{1}^{d-j}u_{2}^{j}\right).

Similarly, for WK(𝐱,𝐮)W_{K}(\mathbf{x},\mathbf{u}), using the form of KdK_{d}, we first derive the covariance for each class (l=1,2l=1,2)

Covl[Kd(𝐱,𝐗),Kd(𝐮,𝐗)]\displaystyle\text{Cov}_{l}[K_{d}(\mathbf{x},\mathbf{X}),K_{d}(\mathbf{u},\mathbf{X})] =\displaystyle= Covl[i=0d(di)(x1dix2i)(X1diX2i),j=0d(dj)(u1dju2j)(X1djX2j)]\displaystyle\text{Cov}_{l}\left[\sum_{i=0}^{d}{d\choose i}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(X_{1}^{d-i}X_{2}^{i}\right),\sum_{j=0}^{d}{d\choose j}\left(u_{1}^{d-j}u_{2}^{j}\right)\left(X_{1}^{d-j}X_{2}^{j}\right)\right]
=\displaystyle= i=0dj=0d(di)(dj)(x1dix2i)(u1dju2j)Covl[X1diX2i,X1djX2j].\displaystyle\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}{d\choose j}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(u_{1}^{d-j}u_{2}^{j}\right)\text{Cov}_{l}[X_{1}^{d-i}X_{2}^{i},X_{1}^{d-j}X_{2}^{j}].

Letting Wi,j=π1Cov1[X1diX2i,X1djX2j]+π2Cov2[X1diX2i,X1djX2j]W_{i,j}=\pi_{1}\text{Cov}_{1}[X_{1}^{d-i}X_{2}^{i},X_{1}^{d-j}X_{2}^{j}]+\pi_{2}\text{Cov}_{2}[X_{1}^{d-i}X_{2}^{i},X_{1}^{d-j}X_{2}^{j}], the within-class covariance of a pair of polynomial features of degree dd, we can express the within-class variation function as

WK(𝐱,𝐮)\displaystyle W_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= π1Cov1[Kd(𝐱,𝐗),Kd(𝐮,𝐗)]+π2Cov2[Kd(𝐱,𝐗),Kd(𝐮,𝐗)]\displaystyle\pi_{1}\text{Cov}_{1}[K_{d}(\mathbf{x},\mathbf{X}),K_{d}(\mathbf{u},\mathbf{X})]+\pi_{2}\text{Cov}_{2}[K_{d}(\mathbf{x},\mathbf{X}),K_{d}(\mathbf{u},\mathbf{X})]
=\displaystyle= i=0dj=0d(di)(dj)Wi,j(x1dix2i)(u1dju2j).\displaystyle\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}{d\choose j}W_{i,j}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(u_{1}^{d-j}u_{2}^{j}\right).

Using these two functions for KdK_{d}, we obtain the between-class covariance operator as

2\displaystyle\int_{\mathbb{R}^{2}} BK(𝐱,𝐮)α(𝐮)d(𝐮)\displaystyle B_{K}(\mathbf{x},\mathbf{u})\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})
=\displaystyle= 2i=0dj=0d(di)(dj)ΔiΔj(x1dix2i)(u1dju2j)α(𝐮)d(𝐮)\displaystyle\int_{\mathbb{R}^{2}}\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}{d\choose j}\Delta_{i}\Delta_{j}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(u_{1}^{d-j}u_{2}^{j}\right)\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})
=\displaystyle= i=0dj=0d(di)(dj)ΔiΔj(x1dix2i)2(u1dju2j)α(𝐮)𝑑(𝐮)\displaystyle\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}{d\choose j}\Delta_{i}\Delta_{j}\left(x_{1}^{d-i}x_{2}^{i}\right)\int_{\mathbb{R}^{2}}\left(u_{1}^{d-j}u_{2}^{j}\right)\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})

and the within-class covariance operator as

2\displaystyle\int_{\mathbb{R}^{2}} WK(𝐱,𝐮)α(𝐮)d(𝐮)\displaystyle W_{K}(\mathbf{x},\mathbf{u})\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})
=\displaystyle= 2i=0dj=0d(di)(dj)Wi,j(x1dix2i)(u1dju2j)α(𝐮)d(𝐮)\displaystyle\int_{\mathbb{R}^{2}}\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}{d\choose j}W_{i,j}\left(x_{1}^{d-i}x_{2}^{i}\right)\left(u_{1}^{d-j}u_{2}^{j}\right)\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})
=\displaystyle= i=0dj=0d(di)(dj)Wi,j(x1dix2i)2(u1dju2j)α(𝐮)𝑑(𝐮).\displaystyle\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}{d\choose j}W_{i,j}\left(x_{1}^{d-i}x_{2}^{i}\right)\int_{\mathbb{R}^{2}}\left(u_{1}^{d-j}u_{2}^{j}\right)\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u}).

Given α(𝐮)\alpha(\mathbf{u}), 2(u1dju2j)α(𝐮)𝑑(𝐮)\int_{\mathbb{R}^{2}}\left(u_{1}^{d-j}u_{2}^{j}\right)\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u}) is a constant. Thus, letting νj=(dj)2(u1dju2j)α(𝐮)𝑑(𝐮)\nu_{j}={d\choose j}\int_{\mathbb{R}^{2}}\left(u_{1}^{d-j}u_{2}^{j}\right)\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u}), we arrive at the following eigenvalue problem from (7) for identification of α()\alpha(\cdot):

i=0dj=0d(di)ΔiΔjνj(x1dix2i)=λi=0dj=0d(di)Wi,jνj(x1dix2i),\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}\Delta_{i}\Delta_{j}\nu_{j}\left(x_{1}^{d-i}x_{2}^{i}\right)=\lambda\sum_{i=0}^{d}\sum_{j=0}^{d}{d\choose i}W_{i,j}\nu_{j}\left(x_{1}^{d-i}x_{2}^{i}\right),

which should hold for all 𝐱=(x1,x2)T2\mathbf{x}=(x_{1},x_{2})^{T}\in\mathbb{R}^{2}. Rearranging the terms in the polynomial equation, we have

i=0d{(di)Δij=0dΔjνj}(x1dix2i)=λi=0d{(di)j=0dWi,jνj}(x1dix2i).\sum_{i=0}^{d}\left\{{d\choose i}\Delta_{i}\sum_{j=0}^{d}\Delta_{j}\nu_{j}\right\}\left(x_{1}^{d-i}x_{2}^{i}\right)=\lambda\sum_{i=0}^{d}\left\{{d\choose i}\sum_{j=0}^{d}W_{i,j}\nu_{j}\right\}\left(x_{1}^{d-i}x_{2}^{i}\right).

Matching the coefficients of x1dix2ix_{1}^{d-i}x_{2}^{i} on both sides of the equation leads to the following system of linear equations for 𝝂=(ν0,,νd)T\boldsymbol{\nu}=(\nu_{0},\ldots,\nu_{d})^{T}:

𝚫𝚫T𝝂=λW𝝂,\boldsymbol{\Delta}\boldsymbol{\Delta}^{T}\boldsymbol{\nu}=\lambda W\boldsymbol{\nu}, (10)

where 𝚫=(Δ0,Δ1,,Δd)T\boldsymbol{\Delta}=(\Delta_{0},\Delta_{1},\ldots,\Delta_{d})^{T} is a vector of the mean differences of X1diX2iX_{1}^{d-i}X_{2}^{i} for i=0,,di=0,\ldots,d, and W=[Wi,j]W=[W_{i,j}] is a weighted average of their covariance matrices.

When d=1d=1, KdK_{d} becomes a linear kernel, and the features are just X1X_{1} and X2X_{2}. Thus, 𝚫=𝝁1𝝁2\boldsymbol{\Delta}=\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2} (population mean difference) and W=π1Σ1+π2Σ2W=\pi_{1}\Sigma_{1}+\pi_{2}\Sigma_{2} (pooled population covariance matrix). Clearly, the eigenvalue problem in (10) reduces to that for the population version of Fisher’s LDA when d=1d=1.

Assuming that W1W^{-1} exists, we can show that the largest eigenvalue satisfying equation (10) is λ=𝚫TW1𝚫\lambda^{*}=\boldsymbol{\Delta}^{T}W^{-1}\boldsymbol{\Delta} with eigenvector of 𝝂=W1𝚫\boldsymbol{\nu}^{*}=W^{-1}\boldsymbol{\Delta}. Given the best direction 𝝂=(ν0,,νd)T\boldsymbol{\nu}^{*}=(\nu_{0},\ldots,\nu_{d})^{T}, the population discriminant function f()f(\cdot) in (8) with homogenous polynomial kernel of degree dd is

f(𝐱)\displaystyle f(\mathbf{x}) =\displaystyle= 2Kd(𝐱,𝐮)α(𝐮)𝑑(𝐮)=2j=0d(dj)x1djx2ju1dju2jα(𝐮)d(𝐮)\displaystyle\int_{\mathbb{R}^{2}}K_{d}(\mathbf{x},\mathbf{u})\,\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})=\int_{\mathbb{R}^{2}}\sum_{j=0}^{d}{d\choose j}x_{1}^{d-j}x_{2}^{j}u_{1}^{d-j}u_{2}^{j}\,\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})
=\displaystyle= j=0dx1djx2j(dj)2u1dju2jα(𝐮)𝑑(𝐮)νj=j=0dνjx1djx2j.\displaystyle\sum_{j=0}^{d}x_{1}^{d-j}x_{2}^{j}\underbrace{{d\choose j}\int_{\mathbb{R}^{2}}u_{1}^{d-j}u_{2}^{j}\,\alpha(\mathbf{u})\,d\mathbb{P}(\mathbf{u})}_{\nu_{j}}=\sum_{j=0}^{d}\nu_{j}x_{1}^{d-j}x_{2}^{j}.

We see that this polynomial discriminant is expressed as a linear combination of the corresponding polynomial features and their coefficients are determined through the mean differences and variances of the features.

3.1.2 Homogeneous Polynomial Kernel in Multi-Dimensional Setting

We extend the result in 𝒳=2\mathcal{X}=\mathbb{R}^{2} to general p\mathbb{R}^{p}. The homogeneous polynomial kernel of degree dd in p\mathbb{R}^{p} is given as

Kd(𝐱,𝐮)=(𝐱T𝐮)d=(i=1pxiui)d=j1++jp=d(dj1,,jp)k=1p(xkuk)jk.K_{d}(\mathbf{x},\mathbf{u})=(\mathbf{x}^{T}\mathbf{u})^{d}=\left(\sum_{i=1}^{p}x_{i}u_{i}\right)^{d}=\sum_{j_{1}+\cdots+j_{p}=d}{d\choose j_{1},\ldots,j_{p}}\prod_{k=1}^{p}(x_{k}u_{k})^{j_{k}}.

As a function of 𝐱\mathbf{x}, it involves polynomials in pp variables of total degree dd. To facilitate similar derivations as in 2\mathbb{R}^{2}, we will use a multi-index for the polynomial features.

Let 𝐣d\mathbf{j}_{d} denote a pp-tuple multi-index with non-negative integer entries that sum up to dd. That is, 𝐣d𝐒d:={(j1,,jp)|ji{0},i=1pji=d}\mathbf{j}_{d}\in\mathbf{S}_{d}:=\{(j_{1},\ldots,j_{p})\leavevmode\nobreak\ |\leavevmode\nobreak\ j_{i}\in\mathbb{N}\cup\{0\},\sum_{i=1}^{p}j_{i}=d\} with cardinality of (d+p1d){d+p-1\choose d}. We will omit the subscript dd from 𝐣d\mathbf{j}_{d} for brevity whenever it is clear from the context. For 𝐣=(j1,,jp)𝐒d\mathbf{j}=(j_{1},\ldots,j_{p})\in\mathbf{S}_{d}, we abbreviate the multinomial coefficient (dj1,,jp){d\choose j_{1},\ldots,j_{p}} to (d𝐣){d\choose\mathbf{j}}, and let |𝐣|=j1++jp|\mathbf{j}|=j_{1}+\cdots+j_{p} and 𝐣!=k=1pjk!\mathbf{j}!=\prod_{k=1}^{p}j_{k}!. For 𝐱p\mathbf{x}\in\mathbb{R}^{p} and 𝐣𝐒d\mathbf{j}\in\mathbf{S}_{d}, let 𝐱𝐣=x1j1xpjp\mathbf{x}^{\mathbf{j}}=x_{1}^{j_{1}}\cdots x_{p}^{j_{p}}, and for aa\in\mathbb{R}, a𝐣a^{\mathbf{j}} means aj1ajp=a|𝐣|a^{j_{1}}\cdots a^{j_{p}}=a^{|\mathbf{j}|}. For convenience, we will use 𝐣𝐒d\mathbf{j}\in\mathbf{S}_{d} and |𝐣|=d|\mathbf{j}|=d interchangeably.

Using this multi-index, we rewrite the homogeneous polynomial kernel in p\mathbb{R}^{p} simply as

Kd(𝐱,𝐮)=|𝐣|=d(d𝐣)𝐱𝐣𝐮𝐣,K_{d}(\mathbf{x},\mathbf{u})=\sum_{|\mathbf{j}|=d}{d\choose\mathbf{j}}\mathbf{x}^{\mathbf{j}}\mathbf{u}^{\mathbf{j}}, (11)

which can be viewed as a multi-dimensional extension of the expression in (9). Further, we can derive the between-class and within-class variation functions similarly:

BK(𝐱,𝐮)\displaystyle B_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= |𝐢|=d|𝐣|=d(d𝐢)(d𝐣)Δ𝐢Δ𝐣𝐱𝐢𝐮𝐣\displaystyle\sum_{|\mathbf{i}|=d}\sum_{|\mathbf{j}|=d}{d\choose\mathbf{i}}{d\choose\mathbf{j}}\Delta_{\mathbf{i}}\Delta_{\mathbf{j}}\mathbf{x}^{\mathbf{i}}\mathbf{u}^{\mathbf{j}}
WK(𝐱,𝐮)\displaystyle W_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= |𝐢|=d|𝐣|=d(d𝐢)(d𝐣)W𝐢,𝐣𝐱𝐢𝐮𝐣\displaystyle\sum_{|\mathbf{i}|=d}\sum_{|\mathbf{j}|=d}{d\choose\mathbf{i}}{d\choose\mathbf{j}}W_{\mathbf{i},\mathbf{j}}\mathbf{x}^{\mathbf{i}}\mathbf{u}^{\mathbf{j}}

with Δ𝐢=𝔼1[𝐗𝐢]𝔼2[𝐗𝐢]\Delta_{\mathbf{i}}=\mathbb{E}_{1}[\mathbf{X}^{\mathbf{i}}]-\mathbb{E}_{2}[\mathbf{X}^{\mathbf{i}}] and W𝐢,𝐣=π1Cov1[𝐗𝐢,𝐗𝐣]+π2Cov2[𝐗𝐢,𝐗𝐣]W_{\mathbf{i},\mathbf{j}}=\pi_{1}\text{Cov}_{1}[\mathbf{X}^{\mathbf{i}},\mathbf{X}^{\mathbf{j}}]+\pi_{2}\text{Cov}_{2}[\mathbf{X}^{\mathbf{i}},\mathbf{X}^{\mathbf{j}}] for 𝐢,𝐣𝐒d\mathbf{i},\mathbf{j}\in\mathbf{S}_{d}. As an example, when the degree dd is 2 in 3\mathbb{R}^{3}, 𝐒2={(2,0,0),(1,1,0),(1,0,1),(0,2,0),(0,1,1),(0,0,2)}\mathbf{S}_{2}=\{(2,0,0),(1,1,0),(1,0,1),(0,2,0),(0,1,1),(0,0,2)\}. For 𝐢=(1,1,0)\mathbf{i}=(1,1,0) and 𝐣=(0,1,1)\mathbf{j}=(0,1,1), 𝐗𝐢=X1X2\mathbf{X}^{\mathbf{i}}=X_{1}X_{2} and 𝐗𝐣=X2X3\mathbf{X}^{\mathbf{j}}=X_{2}X_{3}, and thus we have Δ𝐢=𝔼1[X1X2]𝔼2[X1X2]\Delta_{\mathbf{i}}=\mathbb{E}_{1}[X_{1}X_{2}]-\mathbb{E}_{2}[X_{1}X_{2}] and W𝐢,𝐣=π1Cov1[X1X2,X2X3]+π2Cov2[X1X2,X2X3]W_{\mathbf{i},\mathbf{j}}=\pi_{1}\text{Cov}_{1}[X_{1}X_{2},X_{2}X_{3}]+\pi_{2}\text{Cov}_{2}[X_{1}X_{2},X_{2}X_{3}]. Due to the same structure, we can easily extend the between-class and within-class covariance operators.

To identify the population discriminant function in this setting, we define 𝚫=(Δ𝐢)𝐢𝐒dT\boldsymbol{\Delta}=(\Delta_{\mathbf{i}})_{\mathbf{i}\in\mathbf{S}_{d}}^{T}, and 𝐖=[W𝐢,𝐣]𝐢,𝐣𝐒d\mathbf{W}=[W_{\mathbf{i},\mathbf{j}}]_{\mathbf{i},\leavevmode\nobreak\ \mathbf{j}\in\mathbf{S}_{d}} analogously. Letting ν𝐣=(d𝐣)p𝐮𝐣α(𝐮)(𝐮)\nu_{\mathbf{j}}={d\choose\mathbf{j}}\int_{\mathbb{R}^{p}}\mathbf{u}^{\mathbf{j}}\alpha(\mathbf{u})\,\mathbb{P}(\mathbf{u}) given a kernel coefficient function α()\alpha(\cdot), we solve the generalized eigenvalue problem in (10) for 𝝂=(ν𝐣)𝐣𝐒dT\boldsymbol{\nu}=(\nu_{\mathbf{j}})^{T}_{\mathbf{j}\in\mathbf{S}_{d}}, and determine the population-level discriminant function as

f(𝐱)=|𝐣|=dν𝐣𝐱𝐣.f(\mathbf{x})=\sum_{|\mathbf{j}|=d}\nu_{\mathbf{j}}\mathbf{x}^{\mathbf{j}}.

Note that the size of 𝚫\boldsymbol{\Delta} and 𝐖\mathbf{W} is |𝐒d|=(d+p1d)|\mathbf{S}_{d}|={d+p-1\choose d}, and while ordering of the indices in 𝐒d\mathbf{S}_{d} does not matter, the elements in 𝚫\boldsymbol{\Delta} and 𝐖\mathbf{W} should be consistently indexed for specification of the eigenvalue problem. The following theorem summarizes the results so far.

Theorem 3.1.

Suppose that for each class, the distribution of 𝐗p\mathbf{X}\in\mathbb{R}^{p} has finite moments, 𝔼l[𝐗𝐢]\mathbb{E}_{l}[\mathbf{X}^{\mathbf{i}}] and Covl[𝐗𝐢,𝐗𝐣]\text{Cov}_{l}[\mathbf{X}^{\mathbf{i}},\mathbf{X}^{\mathbf{j}}] for all 𝐢,𝐣𝐒d\mathbf{i},\mathbf{j}\in\mathbf{S}_{d}. For the homogeneous polynomial kernel of degree dd, Kd(𝐱,𝐮)=(𝐱T𝐮)dK_{d}(\mathbf{x},\mathbf{u})=(\mathbf{x}^{T}\mathbf{u})^{d},

  • (i)

    The kernel discriminant function maximizing the ratio of between-class variation relative to within-class variation is of the form

    fd(𝐱)=|𝐣|=dν𝐣𝐱𝐣.f_{d}(\mathbf{x})=\sum_{|\mathbf{j}|=d}\nu_{\mathbf{j}}\mathbf{x}^{\mathbf{j}}. (12)
  • (ii)

    The coefficients, 𝝂=(ν𝐢)𝐢𝐒dT{\boldsymbol{\nu}}=(\nu_{\mathbf{i}})^{T}_{\mathbf{i}\in\mathbf{S}_{d}}, for the discriminant function satisfy the eigen-equation with λ>0\lambda>0:

    𝚫𝚫T𝝂=λ𝐖𝝂,\boldsymbol{\Delta}\boldsymbol{\Delta}^{T}{\boldsymbol{\nu}}=\lambda\mathbf{W}{\boldsymbol{\nu}}, (13)

    where 𝚫=(Δ𝐢)𝐢𝐒dT\boldsymbol{\Delta}=(\Delta_{\mathbf{i}})^{T}_{\mathbf{i}\in\mathbf{S}_{d}} is a vector of moment differences, Δ𝐢=𝔼1[𝐗𝐢]𝔼2[𝐗𝐢]\Delta_{\mathbf{i}}=\mathbb{E}_{1}[\mathbf{X}^{\mathbf{i}}]-\mathbb{E}_{2}[\mathbf{X}^{\mathbf{i}}], and 𝐖=[W𝐢,𝐣]𝐢,𝐣𝐒d\mathbf{W}=[W_{\mathbf{i},\mathbf{j}}]_{\mathbf{i},\leavevmode\nobreak\ \mathbf{j}\in\mathbf{S}_{d}} is a matrix of pooled covariances, W𝐢,𝐣=π1Cov1[𝐗𝐢,𝐗𝐣]+π2Cov2[𝐗𝐢,𝐗𝐣]W_{\mathbf{i},\mathbf{j}}=\pi_{1}\text{Cov}_{1}[\mathbf{X}^{\mathbf{i}},\mathbf{X}^{\mathbf{j}}]+\pi_{2}\text{Cov}_{2}[\mathbf{X}^{\mathbf{i}},\mathbf{X}^{\mathbf{j}}].


Alternatively, the discriminant function can be derived using an explicit feature map for the kernel. The expression of KdK_{d} in (11) suggests ϕ(𝐱)=((d𝐣)12𝐱𝐣)𝐣𝐒dT\displaystyle\phi(\mathbf{x})=\left({d\choose\mathbf{j}}^{\frac{1}{2}}\mathbf{x}^{\mathbf{j}}\right)^{T}_{\mathbf{j}\in\mathbf{S}_{d}} as a feature vector, and it can be shown that a direct application of LDA to the between-class and within-class variance matrices of ϕ(𝐗)\phi(\mathbf{X}) leads to the same kernel discriminant. This result indicates that employing homogeneous polynomial kernels in discriminant analysis has the same effect as using the polynomial features of given degree in LDA.

3.1.3 Inhomogeneous Polynomial Kernel

The inhomogeneous polynomial kernel of degree dd in p\mathbb{R}^{p} can be expanded as

K~d(𝐱,𝐮)=(1+𝐱T𝐮)d=m=0d(dm)(𝐱T𝐮)m=m=0d(dm)|𝐣|=m(m𝐣)𝐱𝐣𝐮𝐣,\tilde{K}_{d}(\mathbf{x},\mathbf{u})=(1+\mathbf{x}^{T}\mathbf{u})^{d}=\sum_{m=0}^{d}{d\choose m}(\mathbf{x}^{T}\mathbf{u})^{m}=\sum_{m=0}^{d}{d\choose m}\sum_{|\mathbf{j}|=m}{m\choose\mathbf{j}}\mathbf{x}^{\mathbf{j}}\mathbf{u}^{\mathbf{j}},

which is a sum of all homogeneous polynomial kernels of degree up to dd. Since (d𝐣)=(dm)(m𝐣){d\choose\mathbf{j}}={d\choose m}{m\choose\mathbf{j}} for 𝐣𝐒m\mathbf{j}\in\mathbf{S}_{m}, m=0,,dm=0,\ldots,d, and the term with m=0m=0 is 1, we can rewrite the kernel as

K~d(𝐱,𝐮)=1+m=1d|𝐣|=m(d𝐣)𝐱𝐣𝐮𝐣=1+|𝐣|=1d(d𝐣)𝐱𝐣𝐮𝐣.\tilde{K}_{d}(\mathbf{x},\mathbf{u})=1+\sum_{m=1}^{d}\sum_{|\mathbf{j}|=m}{d\choose\mathbf{j}}\mathbf{x}^{\mathbf{j}}\mathbf{u}^{\mathbf{j}}=1+\sum_{|\mathbf{j}|=1}^{d}{d\choose\mathbf{j}}\mathbf{x}^{\mathbf{j}}\mathbf{u}^{\mathbf{j}}.

Note that m=1d|𝐣|=m\sum_{m=1}^{d}\sum_{|\mathbf{j}|=m} is abbreviated to |𝐣|=1d\sum_{|\mathbf{j}|=1}^{d}. This kernel has the same structure as the homogenous polynomial kernel. Using the relation, we can find the population kernel discriminant function similarly. Recognizing that K~d\tilde{K}_{d} involves expanded polynomial features in pp variables of total degree 0 to dd: 1,𝐱,(𝐱𝐣)|𝐣|=2,,(𝐱𝐣)|𝐣|=d1,\mathbf{x},(\mathbf{x}^{\mathbf{j}})_{|\mathbf{j}|=2},\ldots,(\mathbf{x}^{\mathbf{j}})_{|\mathbf{j}|=d}, we define a vector of the mean differences of those features (excluding the constant 11) and a block matrix of their pooled covariances as follows:

𝚫~=(𝚫1𝚫d), and 𝐖~=[𝐖1,1𝐖1,d𝐖d,1𝐖d,d],\tilde{\boldsymbol{\Delta}}=\left(\begin{array}[]{c}{\boldsymbol{\Delta}}_{1}\\ \vdots\\ {\boldsymbol{\Delta}}_{d}\end{array}\right),\quad\mbox{ and }\quad\tilde{\mathbf{W}}=\left[\begin{array}[]{ccc}{\mathbf{W}}_{1,1}&\ldots&{\mathbf{W}}_{1,d}\\ \vdots&\ddots&\vdots\\ {\mathbf{W}}_{d,1}&\ldots&{\mathbf{W}}_{d,d}\end{array}\right],

where 𝚫m=(Δ𝐢)𝐢𝐒mT{\boldsymbol{\Delta}}_{m}=(\Delta_{\mathbf{i}})^{T}_{\mathbf{i}\in\mathbf{S}_{m}} and 𝐖m,l=[W𝐢,𝐣]𝐢𝐒m,𝐣𝐒l{\mathbf{W}}_{m,l}=[W_{\mathbf{i},\mathbf{j}}]_{\mathbf{i}\in\mathbf{S}_{m},\mathbf{j}\in\mathbf{S}_{l}} for all m,l=1,,dm,\leavevmode\nobreak\ l=1,\ldots,d. That is, 𝚫~\tilde{\boldsymbol{\Delta}} contains all the difference of the moments of degree 1 to dd, and 𝐖~\tilde{\mathbf{W}} has the covariances between all the monomials of degree 1 to dd. Thus, the size of the eigenvalue problem to solve becomes m=1d(m+p1m)=(p+dd)1\sum_{m=1}^{d}{m+p-1\choose m}={p+d\choose d}-1. The following theorem states similar results for the discriminant function with inhomogeneous polynomial kernel.

Theorem 3.2.

Suppose that for each class, the distribution of 𝐗p\mathbf{X}\in\mathbb{R}^{p} has finite moments, 𝔼l[𝐗𝐢]\mathbb{E}_{l}\left[\mathbf{X}^{\mathbf{i}}\right] and Covl[𝐗𝐢,𝐗𝐣]\text{Cov}_{l}\left[\mathbf{X}^{\mathbf{i}},\mathbf{X}^{\mathbf{j}}\right] for all 𝐢𝐒m\mathbf{i}\in\mathbf{S}_{m}, 𝐣𝐒l\mathbf{j}\in\mathbf{S}_{l} and m,l=1,,dm,l=1,\ldots,d. For the inhomogeneous polynomial kernel of degree dd, K~d(𝐱,𝐮)=(1+𝐱T𝐮)d\tilde{K}_{d}(\mathbf{x},\mathbf{u})=(1+\mathbf{x}^{T}\mathbf{u})^{d},

  • (i)

    The kernel discriminant function maximizing the ratio of between-class variation relative to within-class variation is of the form

    f~d(𝐱)=|𝐣|=1dν~𝐣𝐱𝐣.\tilde{f}_{d}(\mathbf{x})=\sum_{|\mathbf{j}|=1}^{d}\tilde{\nu}_{\mathbf{j}}\mathbf{x}^{\mathbf{j}}. (14)
  • (ii)

    The coefficients, 𝝂~=(ν~𝐣)1|𝐣|dT\tilde{\boldsymbol{\nu}}=(\tilde{\nu}_{\mathbf{j}})_{1\leq|\mathbf{j}|\leq d}^{T}, for the discriminant function satisfy the eigen-equation with λ>0\lambda>0:

    𝚫~𝚫~T𝝂~=λ𝐖~𝝂~.\tilde{\boldsymbol{\Delta}}\tilde{\boldsymbol{\Delta}}^{T}\tilde{\boldsymbol{\nu}}=\lambda\tilde{\mathbf{W}}\tilde{\boldsymbol{\nu}}. (15)

3.2 Gaussian Discriminant

We extend the discriminant analysis with polynomial kernels in the previous section to the Gaussian kernel. For the extension, we use two representations for the Gaussian kernel: a deterministic representation in Section 3.2.1 and a randomized feature representation in Section 3.2.2.

3.2.1 Deterministic Representation of Gaussian Kernel

We have seen so far that derivation of the population discriminant function with polynomial kernels is aided by their expansion, or equivalently, their explicit feature maps. Taking a similar approach to the Gaussian kernel, we could use the Maclaurin series of exe^{x} to express it as

Kω(𝐱,𝐮)\displaystyle K_{\omega}(\mathbf{x},\mathbf{u}) =\displaystyle= exp(𝐱𝐮22ω2)=|𝐣|=0ϕ𝐣(𝐱)ϕ𝐣(𝐮),\displaystyle\exp\left(-\frac{\|\mathbf{x}-\mathbf{u}\|^{2}}{2\omega^{2}}\right)=\sum_{|\mathbf{j}|=0}^{\infty}\phi_{\mathbf{j}}(\mathbf{x})\phi_{\mathbf{j}}(\mathbf{u}),

with ϕ𝐣(𝐱)=exp(𝐱22ω2)1𝐣!𝐱𝐣ω𝐣\displaystyle\phi_{\mathbf{j}}(\mathbf{x})=\exp\left(-\frac{\|\mathbf{x}\|^{2}}{2\omega^{2}}\right)\frac{1}{\sqrt{\mathbf{j}!}}\frac{\mathbf{x}^{\mathbf{j}}}{\omega^{\mathbf{j}}}. While the structure of KωK_{\omega} in this representation would permit similar derivations as before for the discriminant function, the result will depend on the expectations and covariances of ϕ𝐣(𝐗)\phi_{\mathbf{j}}(\mathbf{X}) which may not be easy to obtain analytically in general.

Alternatively, we consider a representation of the kernel in the form that allows a direct use of polynomial features in much the same way as polynomial kernels. We start with a one-dimensional case and then extend it to a multi-dimensional case. The one-dimensional Gaussian kernel with bandwidth ω\omega can be written as

Kω(x,u)=exp((xu)22ω2)=exp(x22ω2)m=0Hem(xω)umm!ωm.K_{\omega}(x,u)=\exp\left(-\frac{(x-u)^{2}}{2\omega^{2}}\right)=\exp\left(-\frac{x^{2}}{2\omega^{2}}\right)\sum_{m=0}^{\infty}He_{m}\left(\frac{x}{\omega}\right)\frac{u^{m}}{m!\leavevmode\nobreak\ \omega^{m}}. (16)

Hem(x)He_{m}(x) are referred to as the probabilist’s Hermite polynomials and defined as

Hem(x)\displaystyle He_{m}(x) =\displaystyle= (1)m(ϕ(x))1dmdxmϕ(x),\displaystyle(-1)^{m}(\phi(x))^{-1}\frac{d^{m}}{dx^{m}}\phi(x),

where ϕ\phi is the density function of the standard normal distribution. The representation of KωK_{\omega} in (16) comes from the Hermite polynomial generating function:

exp(xu12u2)=m=0Hem(x)umm!.\exp\left(xu-\frac{1}{2}u^{2}\right)=\sum_{m=0}^{\infty}He_{m}(x)\frac{u^{m}}{m!}. (17)

It can be extended to a multivariate case using the vector-valued Hermite polynomials introduced in Holmquist (1996).

For 𝐱p\mathbf{x}\in\mathbb{R}^{p} and mm\in\mathbb{N}, the pp-variate vector-valued Hermite polynomial of order mm is defined as

𝐇m(𝐱)=(1)m(Φ(𝐱))1𝐱mΦ(𝐱),\mathbf{H}_{m}(\mathbf{x})=(-1)^{m}(\Phi(\mathbf{x}))^{-1}\partial_{\mathbf{x}}^{\langle m\rangle}\Phi(\mathbf{x}),

where 𝐱m=𝐱𝐱𝐱\partial_{\mathbf{x}}^{\langle m\rangle}=\partial_{\mathbf{x}}\otimes\partial_{\mathbf{x}}\otimes\cdots\otimes\partial_{\mathbf{x}} (mm-times) is a Kronecker product of the differential operator 𝐱=(x1,,xp)T\partial_{\mathbf{x}}=(\frac{\partial}{\partial x_{1}},\ldots,\frac{\partial}{\partial x_{p}})^{T} and Φ\Phi is the product of pp univariate standard normal densities. Thus the components of 𝐇m(𝐱)\mathbf{H}_{m}(\mathbf{x}) are a product of univariate Hermite polynomials whose total degree is mm: 𝐇m(𝐱)=(H𝐣(𝐱))𝐣𝐒mT\mathbf{H}_{m}(\mathbf{x})=(H_{\mathbf{j}}(\mathbf{x}))^{T}_{\mathbf{j}\in\mathbf{S}_{m}}, where H𝐣(𝐱)=Hej1(x1)Hejp(xp)H_{\mathbf{j}}(\mathbf{x})=He_{j_{1}}(x_{1})\cdots He_{j_{p}}(x_{p}) for each 𝐣𝐒m\mathbf{j}\in\mathbf{S}_{m}.

Using this notation, a multivariate version of the generating function (17) can be written as

exp(𝐱T𝐮12𝐮T𝐮)=m=01m!𝐇m(𝐱),𝐮m,\exp\left(\mathbf{x}^{T}\mathbf{u}-\frac{1}{2}\mathbf{u}^{T}\mathbf{u}\right)=\sum_{m=0}^{\infty}\frac{1}{m!}\langle\mathbf{H}_{m}(\mathbf{x}),\mathbf{u}^{\langle m\rangle}\rangle,

where 𝐮m=𝐮𝐮𝐮\mathbf{u}^{\langle m\rangle}=\mathbf{u}\otimes\mathbf{u}\otimes\cdots\otimes\mathbf{u} (mm-times) and

𝐇m(𝐱),𝐮m=𝐣𝐒m(m𝐣)Hej1(x1)Hejp(xp)u1j1upjp=𝐣𝐒m(m𝐣)H𝐣(𝐱)𝐮𝐣.\displaystyle\langle\mathbf{H}_{m}(\mathbf{x}),\mathbf{u}^{\langle m\rangle}\rangle=\sum_{\mathbf{j}\in\mathbf{S}_{m}}{m\choose\mathbf{j}}He_{j_{1}}(x_{1})\cdots He_{j_{p}}(x_{p})u_{1}^{j_{1}}\cdots u_{p}^{j_{p}}=\sum_{\mathbf{j}\in\mathbf{S}_{m}}{m\choose\mathbf{j}}H_{\mathbf{j}}(\mathbf{x})\mathbf{u}^{\mathbf{j}}.

Using the generating function for 𝐇m\mathbf{H}_{m} and letting 𝐱ω=(1/ω)𝐱\mathbf{x}_{\omega}=(1/\omega)\mathbf{x} with bandwidth ω\omega, we get the following expansion for the multivariate Gaussian kernel:

Kω(𝐱,𝐮)\displaystyle K_{\omega}(\mathbf{x},\mathbf{u}) =\displaystyle= exp(𝐱𝐮22ω2)=exp(𝐱ω22)exp(𝐱ωT𝐮ω12𝐮ωT𝐮ω)\displaystyle\exp\left(-\frac{\lVert\mathbf{x}-\mathbf{u}\rVert^{2}}{2\omega^{2}}\right)=\exp\left(-\frac{\lVert\mathbf{x}_{\omega}\rVert^{2}}{2}\right)\exp\left(\mathbf{x}_{\omega}^{T}\mathbf{u}_{\omega}-\frac{1}{2}\mathbf{u}^{T}_{\omega}\mathbf{u}_{\omega}\right)
=\displaystyle= exp(𝐱ω22)|𝐣|=01𝐣!H𝐣(𝐱ω)𝐮ω𝐣.\displaystyle\exp\left(-\frac{\lVert\mathbf{x}_{\omega}\rVert^{2}}{2}\right)\sum_{|\mathbf{j}|=0}^{\infty}\frac{1}{\mathbf{j}!}H_{\mathbf{j}}(\mathbf{x}_{\omega})\mathbf{u}_{\omega}^{\mathbf{j}}.

Further with the definition of H~𝐣(𝐱ω)=1𝐣!ω𝐣exp(𝐱ω22)H𝐣(𝐱ω)\displaystyle\tilde{H}_{\mathbf{j}}(\mathbf{x}_{\omega})=\frac{1}{\mathbf{j}!\omega^{\mathbf{j}}}\exp\left(-\frac{\lVert\mathbf{x}_{\omega}\rVert^{2}}{2}\right)H_{\mathbf{j}}(\mathbf{x}_{\omega}), the kernel is represented as

Kω(𝐱,𝐮)=|𝐣|=0H~𝐣(𝐱ω)𝐮𝐣.K_{\omega}(\mathbf{x},\mathbf{u})=\sum_{|\mathbf{j}|=0}^{\infty}\tilde{H}_{\mathbf{j}}(\mathbf{x}_{\omega})\mathbf{u}^{\mathbf{j}}. (18)

Although this representation is asymmetric in 𝐱\mathbf{x} and 𝐮\mathbf{u}, it facilitates similar derivations of the generalized eigenvalue problem and population kernel discriminant as with polynomial kernels, but using the entirety of polynomial features.

With this representation, it is easy to show that

𝔼1[Kω(𝐱,𝐗)]𝔼2[Kω(𝐱,𝐗)]\displaystyle\mathbb{E}_{1}\left[K_{\omega}(\mathbf{x},\mathbf{X})\right]-\mathbb{E}_{2}[K_{\omega}(\mathbf{x},\mathbf{X})] =\displaystyle= |𝐣|=0H~𝐣(𝐱ω){𝔼1[𝐗𝐣]𝔼2[𝐗𝐣]}\displaystyle\sum_{|\mathbf{j}|=0}^{\infty}\tilde{H}_{\mathbf{j}}(\mathbf{x}_{\omega})\left\{\mathbb{E}_{1}[\mathbf{X}^{\mathbf{j}}]-\mathbb{E}_{2}[\mathbf{X}^{\mathbf{j}}]\right\}
=\displaystyle= |𝐣|=0H~𝐣(𝐱ω)Δ𝐣=|𝐣|=1H~𝐣(𝐱ω)Δ𝐣,\displaystyle\sum_{|\mathbf{j}|=0}^{\infty}\tilde{H}_{\mathbf{j}}(\mathbf{x}_{\omega})\Delta_{\mathbf{j}}=\sum_{|\mathbf{j}|=1}^{\infty}\tilde{H}_{\mathbf{j}}(\mathbf{x}_{\omega})\Delta_{\mathbf{j}},

which involves the moments of the distribution rather than the expectations of H~𝐣(𝐗ω)\tilde{H}_{\mathbf{j}}(\mathbf{X}_{\omega}). Note that the last equality is due to Δ𝟎=0\Delta_{\mathbf{0}}=0 for 𝐗𝟎=1\mathbf{X}^{\mathbf{0}}=1. Thus the between-class variation function is given as

BK(𝐱,𝐮)\displaystyle B_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= |𝐢|=1|𝐣|=1Δ𝐢Δ𝐣H~𝐢(𝐱ω)H~𝐣(𝐮ω).\displaystyle\sum_{|\mathbf{i}|=1}^{\infty}\sum_{|\mathbf{j}|=1}^{\infty}\Delta_{\mathbf{i}}\Delta_{\mathbf{j}}\tilde{H}_{\mathbf{i}}(\mathbf{x}_{\omega})\tilde{H}_{\mathbf{j}}(\mathbf{u}_{\omega}).

Similarly the within-class variation function is given as

WK(𝐱,𝐮)\displaystyle W_{K}(\mathbf{x},\mathbf{u}) =\displaystyle= |𝐢|=1|𝐣|=1W𝐢,𝐣H~𝐢(𝐱ω)H~𝐣(𝐮ω).\displaystyle\sum_{|\mathbf{i}|=1}^{\infty}\sum_{|\mathbf{j}|=1}^{\infty}W_{\mathbf{i},\mathbf{j}}\tilde{H}_{\mathbf{i}}(\mathbf{x}_{\omega})\tilde{H}_{\mathbf{j}}(\mathbf{u}_{\omega}).

Therefore, the eigenvalue problem in (7) with the Gaussian kernel is given by

|𝐢|=1|𝐣|=1Δ𝐢Δ𝐣ν𝐣H~𝐢(𝐱ω)=λ|𝐢|=1|𝐣|=1𝐖𝐢,𝐣ν𝐣H~𝐢(𝐱ω),\sum_{|\mathbf{i}|=1}^{\infty}\sum_{|\mathbf{j}|=1}^{\infty}\Delta_{\mathbf{i}}\Delta_{\mathbf{j}}\nu_{\mathbf{j}}\tilde{H}_{\mathbf{i}}(\mathbf{x}_{\omega})=\lambda\sum_{|\mathbf{i}|=1}^{\infty}\sum_{|\mathbf{j}|=1}^{\infty}\mathbf{W}_{\mathbf{i},\mathbf{j}}\nu_{\mathbf{j}}\tilde{H}_{\mathbf{i}}(\mathbf{x}_{\omega}), (19)

where ν𝐣=𝒳H~𝐣(𝐮ω)α(𝐮)𝑑(𝐮)\nu_{\mathbf{j}}=\int_{\mathcal{X}}\tilde{H}_{\mathbf{j}}(\mathbf{u}_{\omega})\;\alpha(\mathbf{u})\;d\mathbb{P}(\mathbf{u}).

To find ν𝐣\nu_{\mathbf{j}} satisfying (19) for every 𝐱ω\mathbf{x}_{\omega}, the coefficients of H~𝐢(𝐱ω)\tilde{H}_{\mathbf{i}}(\mathbf{x}_{\omega}) on both sides must equal for all 𝐢𝐒m\mathbf{i}\in\mathbf{S}_{m}, mm\in\mathbb{N}. This entails the following system of an infinite number of linear equations for ν𝐣\nu_{\mathbf{j}}:

Δ𝐢|𝐣|=1Δ𝐣ν𝐣\displaystyle\Delta_{\mathbf{i}}\sum_{|\mathbf{j}|=1}^{\infty}\Delta_{\mathbf{j}}\nu_{\mathbf{j}} =λ|𝐣|=1W𝐢,𝐣ν𝐣,𝐢𝐒m,m,\displaystyle=\lambda\sum_{|\mathbf{j}|=1}^{\infty}W_{\mathbf{i},\mathbf{j}}\nu_{\mathbf{j}},\quad\mathbf{i}\in\mathbf{S}_{m},\leavevmode\nobreak\ m\in\mathbb{N}, (20)

and the resulting discriminant function of the form: f(𝐱)=|𝐣|=1ν𝐣𝐱𝐣f(\mathbf{x})=\sum_{|\mathbf{j}|=1}^{\infty}\nu_{\mathbf{j}}\mathbf{x}^{\mathbf{j}}.

For a finite dimensional approximation of the population discriminant function, we may consider truncation of the kernel representation in (18) at |𝐣|=N|\mathbf{j}|=N:

KN(𝐱,𝐮)\displaystyle K_{N}(\mathbf{x},\mathbf{u}) =\displaystyle= |𝐣|=0NH~𝐣(𝐱ω)𝐮𝐣.\displaystyle\sum_{|\mathbf{j}|=0}^{N}\tilde{H}_{\mathbf{j}}(\mathbf{x}_{\omega})\mathbf{u}^{\mathbf{j}}.

This approximation brings the corresponding truncation of the system of linear equations for the generalized eigenvalue problem in (20). As a result, the eigenvalue equation coincides with that for the inhomogeneous polynomial kernel of degree NN in Theorem 3.2, and so does the truncated discriminant function. As more polynomial features are added or NN increases, the largest eigenvalue satisfying equation (15) increases. Adding subscript NN to λ\lambda, 𝚫~\tilde{\boldsymbol{\Delta}} and 𝐖~\tilde{\mathbf{W}} to indicate the degree clearly, let λN=max𝝂𝝂T𝚫~N𝚫~NT𝝂𝝂T𝐖~N𝝂\displaystyle\lambda_{N}=\max_{\boldsymbol{\nu}}\frac{\boldsymbol{\nu}^{T}\tilde{\boldsymbol{\Delta}}_{N}\tilde{\boldsymbol{\Delta}}_{N}^{T}\boldsymbol{\nu}}{\boldsymbol{\nu}^{T}\tilde{\mathbf{W}}_{N}\boldsymbol{\nu}}. The moment difference vector 𝚫~N\tilde{\boldsymbol{\Delta}}_{N} and the within-class covariance matrix 𝐖~N\tilde{\mathbf{W}}_{N} expand with NN, including all the elements up to degree NN. This nesting structure produces an increasing sequence of λN\lambda_{N}. It is because maximization of the ratio for degree NN amounts to that for degree N+1N+1 with a limited space for 𝝂\boldsymbol{\nu}. In Section 4.1, we will study the relation between polynomial and Gaussian discriminants numerically under various scenarios and discuss the effect of NN on the quality of the discriminant function.

3.2.2 Fourier Feature Representation of Gaussian kernel

In addition to the polynomial approximation presented in the previous section, a stochastic approximation to the Gaussian kernel can be used for population analysis. Rahimi and Recht (2008a) examined approximation of shift-invariant kernels in general using random Fourier features for fast large-scale optimization with kernels. They proposed the following representation for the Gaussian kernel using random features of the form z𝐰(𝐱)=(cos(𝐰T𝐱),sin(𝐰T𝐱))Tz_{\mathbf{w}}(\mathbf{x})=(\cos(\mathbf{w}^{T}\mathbf{x}),\sin(\mathbf{w}^{T}\mathbf{x}))^{T}:

Kω(𝐱,𝐮)=𝔼𝐰[z𝐰(𝐱)Tz𝐰(𝐮)],K_{\omega}(\mathbf{x},\mathbf{u})=\mathbb{E}_{\mathbf{w}}\left[z_{\mathbf{w}}(\mathbf{x})^{T}z_{\mathbf{w}}(\mathbf{u})\right], (21)

where 𝐰\mathbf{w} is a random vector from a multivariate normal distribution with mean zero and covariance matrix 1ω2Ip\frac{1}{\omega^{2}}I_{p}. This representation comes from Bochner’s theorem (Rudin 2017), which describes the correspondence between a positive definite shift-invariant kernel and the Fourier transform of a nonnegative measure. The feature map z𝐰()z_{\mathbf{w}}(\cdot) projects 𝐱\mathbf{x} onto a random direction 𝐰\mathbf{w} first and then takes sinusoidal transforms. Their frequency depends on the norm of 𝐰\mathbf{w}. A large bandwidth ω\omega for the Gaussian kernel implies realization of 𝐰\mathbf{w} with a small norm on average, which generally entails a low frequency for the sinusoids.

The representation in (21) suggests a Monte Carlo approximation of the kernel. Suppose that 𝐰i\mathbf{w}_{i}, i=1,,Di=1,\ldots,D are randomly generated from Np(𝟎,1ω2Ip)N_{p}(\mathbf{0},\frac{1}{\omega^{2}}I_{p}). Defining random Fourier features z𝐰(𝐱)z_{\mathbf{w}}(\mathbf{x}) with 𝐰=𝐰i\mathbf{w}=\mathbf{w}_{i}, we can approximate the Gaussian kernel using a sample average as follows:

Kω(𝐱,𝐮)=exp(𝐱𝐮22ω2)1Di=1Dz𝐰i(𝐱)Tz𝐰i(𝐮).K_{\omega}(\mathbf{x},\mathbf{u})=\exp\left(-\frac{\|\mathbf{x}-\mathbf{u}\|^{2}}{2\omega^{2}}\right)\approx\frac{1}{D}\sum_{i=1}^{D}z_{\mathbf{w}_{i}}(\mathbf{x})^{T}z_{\mathbf{w}_{i}}(\mathbf{u}).

This average can be taken as an unbiased estimate of the kernel, and its precision is controlled by DD. Concatenating these DD random components z𝐰i(𝐱)z_{\mathbf{w}_{i}}(\mathbf{x}), we can also see that the stochastic approximation above amounts to defining

𝐙~D(𝐱)=1D(z𝐰1(𝐱)T,,z𝐰D(𝐱)T)T\underaccent{\tilde}{\Z}_{D}(\mathbf{x})=\frac{1}{\sqrt{D}}(z_{\mathbf{w}_{1}}(\mathbf{x})^{T},\ldots,z_{\mathbf{w}_{D}}(\mathbf{x})^{T})^{T}

as a randomized feature map for the kernel.

Using the random Fourier features, we approximate the between-class variation function BK(𝐱,𝐮)B_{K}(\mathbf{x},\mathbf{u}) and within-class variation function WK(𝐱,𝐮)W_{K}(\mathbf{x},\mathbf{u}) as follows:

BK(𝐱,𝐮)\displaystyle B_{K}(\mathbf{x},\mathbf{u}) \displaystyle\approx 1D2i=1Dj=1Dz𝐰i(𝐱)TΔ𝐰iΔ𝐰jTz𝐰j(𝐮)\displaystyle\frac{1}{D^{2}}\sum_{i=1}^{D}\sum_{j=1}^{D}z_{\mathbf{w}_{i}}(\mathbf{x})^{T}\Delta_{\mathbf{w}_{i}}\Delta_{\mathbf{w}_{j}}^{T}z_{\mathbf{w}_{j}}(\mathbf{u})
WK(𝐱,𝐮)\displaystyle W_{K}(\mathbf{x},\mathbf{u}) \displaystyle\approx 1D2i=1Dj=1Dz𝐰i(𝐱)TW𝐰i,𝐰jz𝐰j(𝐮),\displaystyle\frac{1}{D^{2}}\sum_{i=1}^{D}\sum_{j=1}^{D}z_{\mathbf{w}_{i}}(\mathbf{x})^{T}W_{\mathbf{w}_{i},\mathbf{w}_{j}}z_{\mathbf{w}_{j}}(\mathbf{u}),

where Δ𝐰i=𝔼1[z𝐰i(𝐗)]𝔼2[z𝐰i(𝐗)]\Delta_{\mathbf{w}_{i}}=\mathbb{E}_{1}\left[z_{\mathbf{w}_{i}}(\mathbf{X})\right]-\mathbb{E}_{2}\left[z_{\mathbf{w}_{i}}(\mathbf{X})\right] and W𝐰i,𝐰j=π1Cov1[z𝐰i(𝐗),z𝐰j(𝐗)]+π2Cov2[z𝐰i(𝐗),z𝐰j(𝐗)]W_{\mathbf{w}_{i},\mathbf{w}_{j}}=\pi_{1}\text{Cov}_{1}\left[z_{\mathbf{w}_{i}}(\mathbf{X}),z_{\mathbf{w}_{j}}(\mathbf{X})\right]+\pi_{2}\text{Cov}_{2}\left[z_{\mathbf{w}_{i}}(\mathbf{X}),z_{\mathbf{w}_{j}}(\mathbf{X})\right]. Then we can define a randomized version of the eigenvalue problem in (7) with these approximations. Let α^()\hat{\alpha}(\cdot) denote the solution to the problem with λ>0\lambda>0 and define 𝝂i=z𝐰i(𝐮)α^(𝐮)𝑑(𝐮)\boldsymbol{\nu}_{i}=\int z_{\mathbf{w}_{i}}(\mathbf{u})\hat{\alpha}(\mathbf{u})\,d\mathbb{P}(\mathbf{u}). Similar arguments as before lead to the following generalized eigenvalue problem to determine 𝝂=(𝝂iT)T\boldsymbol{\nu}=(\boldsymbol{\nu}_{i}^{T})^{T}:

𝚫^𝚫^T𝝂=λ𝐖^𝝂,\hat{\boldsymbol{\Delta}}\hat{\boldsymbol{\Delta}}^{T}\boldsymbol{\nu}=\lambda\hat{\mathbf{W}}\boldsymbol{\nu},

where 𝚫^=(Δ𝐰iT)T\hat{\boldsymbol{\Delta}}=(\Delta_{\mathbf{w}_{i}}^{T})^{T} and 𝐖^=[W𝐰i,𝐰j]\hat{\mathbf{W}}=[W_{\mathbf{w}_{i},\mathbf{w}_{j}}] for i,j=1,,Di,j=1,\ldots,D. Given 𝝂\boldsymbol{\nu}, the approximate Gaussian discriminant obtained via random Fourier features is

fD(𝐱)=1Di=1D𝝂iTz𝐰i(𝐱).f_{D}(\mathbf{x})=\frac{1}{D}\sum_{i=1}^{D}\boldsymbol{\nu}_{i}^{T}z_{\mathbf{w}_{i}}(\mathbf{x}). (22)

Rather than sine and cosine pairs, we could also use phase-shifted cosine features only to approximate the Gaussian kernel as suggested in Rahimi and Recht (2008a) and Rahimi and Recht (2008b). Let z𝐰,b(𝐱)=2cos(𝐰T𝐱+b)z_{\mathbf{w},b}(\mathbf{x})=\sqrt{2}\cos(\mathbf{w}^{T}\mathbf{x}+b) with an additional phase parameter bb which is independent of 𝐰\mathbf{w} and distributed uniformly on (0,2π)(0,2\pi). Then using a trigonometric identity, we can verify that

Kω(𝐱,𝐮)=𝔼𝐰,b[z𝐰,b(𝐱)z𝐰,b(𝐮)]=𝔼𝐰,b[2cos(𝐰T𝐱+b)cos(𝐰T𝐮+b)].K_{\omega}(\mathbf{x},\mathbf{u})=\mathbb{E}_{\mathbf{w},b}\left[z_{\mathbf{w},b}(\mathbf{x})z_{\mathbf{w},b}(\mathbf{u})\right]=\mathbb{E}_{\mathbf{w},b}\left[2\cos(\mathbf{w}^{T}\mathbf{x}+b)\cos(\mathbf{w}^{T}\mathbf{u}+b)\right].

Given 𝐰\mathbf{w} and bb, if 𝐗\mathbf{X} is distributed with Np(𝝁,Σ)N_{p}(\boldsymbol{\mu},\Sigma), we can show that

𝔼𝐗[cos(𝐰T𝐗+b)]=exp(12𝐰TΣ𝐰)cos(𝐰T𝝁+b).\mathbb{E}_{\mathbf{X}}\left[\cos(\mathbf{w}^{T}\mathbf{X}+b)\right]=\exp(-\frac{1}{2}\mathbf{w}^{T}\Sigma\mathbf{w})\cos(\mathbf{w}^{T}\boldsymbol{\mu}+b).

Thus in the classical LDA setting of j=N(𝝁j,Σ)\mathbb{P}_{j}=N(\boldsymbol{\mu}_{j},\Sigma) for j=1,2j=1,2, this Fourier feature lets us focus on the difference in cos(𝐰T𝝁j+b)\cos(\mathbf{w}^{T}\boldsymbol{\mu}_{j}+b) rather than 𝝁j\boldsymbol{\mu}_{j}.

4 Numerical Studies

This section illustrates the relation between the data distribution and kernel discriminants discussed so far through simulation studies and an application to real data.

4.1 Simulation Study

We numerically study the population discriminant functions in (12), (14), and (22) with both polynomial and Gaussian kernels, and examine their relationship with the underlying data distributions for two classes. For illustration, we consider two scenarios where each class follows a bivariate normal distribution. In Scenario 1, two classes have different means (𝝁1=(0.6,0.9)T\boldsymbol{\mu}_{1}=(0.6,0.9)^{T} and 𝝁2=(1.0,1.2)T\boldsymbol{\mu}_{2}=(-1.0,-1.2)^{T}) but the same covariance (Σ1=Σ2=I2\Sigma_{1}=\Sigma_{2}=I_{2}), and in Scenario 2, they have the same mean (𝝁1=𝝁2=𝟎\boldsymbol{\mu}_{1}=\boldsymbol{\mu}_{2}=\mathbf{0}) but different covariances (Σ1=diag(2,0.2)\Sigma_{1}=\mbox{diag}(2,0.2) and Σ2=diag(0.2,2)\Sigma_{2}=\mbox{diag}(0.2,2)). Figure 1 shows the scatter plots of samples generated from each scenario with 400 data points in each class (red: class 1 and blue: class 2) under the assumption that two classes are equally likely.

Refer to caption
(a) Scenario 1
Refer to caption
(b) Scenario 2
Figure 1: Scatterplots of the samples simulated from a mixture of two normal distributions with contours of the probability densities for each class overlaid in two settings: (a) Scenario 1 and (b) Scenario 2.

4.1.1 Polynomial Kernel

Under each scenario, we find the population discriminant functions in (12) and (14) with polynomial kernels of degree 1 to 4 and examine the effect of the degree on the discriminants. To determine fd(𝐱)f_{d}(\mathbf{x}), we first obtain the population moment differences 𝚫\boldsymbol{\Delta} and covariances 𝐖\mathbf{W} explicitly and solve the eigenvalue problem in (13). Similarly we determine f~d(𝐱)\tilde{f}_{d}(\mathbf{x}) with 𝚫~\tilde{\boldsymbol{\Delta}} and 𝐖~\tilde{\mathbf{W}}. Tables 1 and 2 present the coefficients for the polynomial discriminants fd(𝐱)f_{d}(\mathbf{x}) and f~d(𝐱)\tilde{f}_{d}(\mathbf{x}) in each scenario, which are the solution 𝝂\boldsymbol{\nu} or 𝝂~\tilde{\boldsymbol{\nu}} (eigenvector) normalized to unit length.

Table 1: Coefficients for the population polynomial discriminants under Scenario 1.
Homogeneous polynomial Inhomogeneous polynomial
Term f𝟏(𝐱)f_{\boldsymbol{1}}(\mathbf{x}) f𝟐(𝐱)f_{\boldsymbol{2}}(\mathbf{x}) f𝟑(𝐱)f_{\boldsymbol{3}}(\mathbf{x}) f𝟒(𝐱)f_{\boldsymbol{4}}(\mathbf{x}) f~𝟏(𝐱)\tilde{f}_{\boldsymbol{1}}(\mathbf{x}) f~𝟐(𝐱)\tilde{f}_{\boldsymbol{2}}(\mathbf{x}) f~𝟑(𝐱)\tilde{f}_{\boldsymbol{3}}(\mathbf{x}) f~𝟒(𝐱)\tilde{f}_{\boldsymbol{4}}(\mathbf{x})
x1x_{1} 0.6060 - - - 0.6060 0.6060 0.6033 0.6033
x2x_{2} 0.7954 - - - 0.7954 0.7954 0.7919 0.7919
x12x_{1}^{2} - -0.4461 - - - 0.0000 -0.0141 -0.0141
x1x2x_{1}x_{2} - -0.8376 - - - 0.0000 -0.0369 -0.0369
x22x_{2}^{2} - -0.3154 - - - 0.0000 -0.0242 -0.0242
x13x_{1}^{3} - - 0.6412 - - - -0.0118 -0.0118
x12x2x_{1}^{2}x_{2} - - 0.3105 - - - -0.0465 -0.0465
x1x22x_{1}x_{2}^{2} - - -0.2277 - - - -0.0610 -0.0610
x23x_{2}^{3} - - 0.6637 - - - -0.0267 -0.0267
x14x_{1}^{4} - - - -0.2575 - - - 0.0000
x13x2x_{1}^{3}x_{2} - - - -0.6186 - - - 0.0000
x12x22x_{1}^{2}x_{2}^{2} - - - 0.3860 - - - 0.0000
x1x23x_{1}x_{2}^{3} - - - -0.6146 - - - 0.0000
x24x_{2}^{4} - - - -0.1563 - - - 0.0000

Scenario 1: Fisher’s linear discriminant analysis is optimal in this scenario. Since the common covariance matrix is I2I_{2}, the linear discriminant is simply determined by the direction of the mean difference, which is 𝝁1𝝁2=(1.6,2.1)T\boldsymbol{\mu}_{1}-\boldsymbol{\mu}_{2}=(1.6,2.1)^{T}. This gives f(𝐱)=1.6x1+2.1x2f^{*}(\mathbf{x})=1.6x_{1}+2.1x_{2} as an optimal linear discriminant defined up to a multiplicative constant. From Table 1, we first notice that the coefficient vector for the population linear discriminant, f1(𝐱)f_{1}(\mathbf{x}), 𝝂=(0.6060,0.7954)T\boldsymbol{\nu}=(0.6060,0.7954)^{T}, is a normalized mean difference. Further we observe that the coefficients for the discriminants with inhomogeneous polynomial kernels, f~𝟏(𝐱)\tilde{f}_{\boldsymbol{1}}(\mathbf{x}) and f~𝟐(𝐱)\tilde{f}_{\boldsymbol{2}}(\mathbf{x}), are also proportional to the mean difference.

Figures 2 and 3 display the polynomial discriminants identified in Table 1. The first row of Figure 2 shows contours of the population discriminants with homogenous polynomial kernels. High to low discriminant scores correspond to red to blue contours. The black dashed line is 1.6x1+2.1x2=0.6351.6x_{1}+2.1x_{2}=0.635, which is the classification boundary from Fisher’s linear discriminant analysis. The second row of Figure 2 presents the corresponding sample embeddings obtained by performing a kernel discriminant analysis to the given samples. Figure 3 shows contours of both versions with inhomogeneous polynomial kernels of degree 2 to 4, omitting degree 1 as they are identical to those with the linear kernel in Figure 2.

Refer to caption
Figure 2: Contours of the population discriminant functions with homogeneous polynomial kernels of degree 1 to 4 (upper panels from left to right) and their corresponding sample counterparts (lower panels) under Scenario 1. The black dashed lines are the optimal classification boundary.
Refer to caption
Figure 3: Contours of the population discriminant functions with inhomogeneous polynomial kernels of degree 2 to 4 (upper panels from left to right) and their corresponding sample counterparts (lower panels) under Scenario 1. The black dashed lines are the optimal classification boundary.

The population discriminants and sample versions are similar in terms of shape and direction of change in contours. With odd-degree homogeneous polynomial kernels, we observe that the contours change in the direction of the mean difference, indicating that odd degrees are effective in this setting. The even-degree discriminants, however, are of hyperbolic paraboloid shape, varying in a way that masks the class difference completely. By contrast, the degree doesn’t affect the major direction of change in the population discriminants with inhomogeneous polynomial kernels. Their variation seems to occur only in the direction of the mean difference. Table 1 confirms that the resulting discriminants f~d(𝐱)\tilde{f}_{d}(\mathbf{x}) are identical for degrees d=2k1d=2k-1 and 2k2k, k=1,2k=1,2.

Table 2: Coefficients for the population polynomial discriminants under Scenario 2.
Homogeneous polynomial Inhomogeneous polynomial
Term f𝟏(𝐱)f_{\boldsymbol{1}}(\mathbf{x}) f𝟐(𝐱)f_{\boldsymbol{2}}(\mathbf{x}) f𝟑(𝐱)f_{\boldsymbol{3}}(\mathbf{x}) f𝟒(𝐱)f_{\boldsymbol{4}}(\mathbf{x}) f~𝟏(𝐱)\tilde{f}_{\boldsymbol{1}}(\mathbf{x}) f~𝟐(𝐱)\tilde{f}_{\boldsymbol{2}}(\mathbf{x}) f~𝟑(𝐱)\tilde{f}_{\boldsymbol{3}}(\mathbf{x}) f~𝟒(𝐱)\tilde{f}_{\boldsymbol{4}}(\mathbf{x})
x1x_{1} 0.00 - - - 0.00 0.0000 0.0000 0.0000
x2x_{2} 0.00 - - - 0.00 0.0000 0.0000 0.0000
x12x_{1}^{2} - 0.7071 - - - 0.7071 0.7071 0.7063
x1x2x_{1}x_{2} - 0.0000 - - - 0.0000 0.0000 0.0000
x22x_{2}^{2} - -0.7071 - - - -0.7071 -0.7071 -0.7063
x13x_{1}^{3} - - 0.0000 - - - 0.0000 0.0000
x12x2x_{1}^{2}x_{2} - - 0.0000 - - - 0.0000 0.0000
x1x22x_{1}x_{2}^{2} - - 0.0000 - - - 0.0000 0.0000
x23x_{2}^{3} - - 0.0000 - - - 0.0000 0.0000
x14x_{1}^{4} - - - 0.7071 - - - -0.0335
x13x2x_{1}^{3}x_{2} - - - 0.0000 - - - 0.0000
x12x22x_{1}^{2}x_{2}^{2} - - - 0.0000 - - - 0.0000
x1x23x_{1}x_{2}^{3} - - - 0.0000 - - - 0.0000
x24x_{2}^{4} - - - -0.7071 - - - 0.0335

Scenario 2: In this scenario, using the true densities, the optimal decision boundary is found to be (x1+x2)(x1x2)=0(x_{1}+x_{2})(x_{1}-x_{2})=0, and the optimal discriminant function is f(𝐱)=x12x22f^{*}(\mathbf{x})=x_{1}^{2}-x_{2}^{2}, which is a homogeneous polynomial of degree 2. In contrast with Scenario 1, even-degree features are discriminative in this setting. Note that the coefficients of f𝟐(𝐱)f_{\boldsymbol{2}}(\mathbf{x}), f~𝟐(𝐱)\tilde{f}_{\boldsymbol{2}}(\mathbf{x}) and f~𝟑(𝐱)\tilde{f}_{\boldsymbol{3}}(\mathbf{x}) in Table 2 are proportional to those of f(𝐱)f^{*}(\mathbf{x}). Odd-degree homogeneous polynomials produce a degenerate discriminant in this setting. The quadratic discriminant, f𝟐(𝐱)=0.7071x120.7071x22f_{\boldsymbol{2}}(\mathbf{x})=0.7071x_{1}^{2}-0.7071x_{2}^{2}, is a normalized version of f(𝐱)f^{*}(\mathbf{x}). With degree 4 homogeneous polynomial kernel, we have f𝟒(𝐱)=0.7071x140.7071x24f_{\boldsymbol{4}}(\mathbf{x})=0.7071x_{1}^{4}-0.7071x_{2}^{4}, which has the optimal discriminant as its factor. Contours of these polynomial discriminants are displayed in the first row of Figure 4. The black dashed lines are the optimal decision boundaries. The second row of Figure 4 presents the corresponding nonlinear kernel embeddings of degree 1 to 4 induced by the samples. Figure 5 shows contours of both versions (theoretical in the first row and empirical in the second row) with inhomogeneous polynomial kernels of degree 2 to 4, omitting the degenerate linear case in Table 2.

Similar to Scenario 1, we observe that the population discriminant functions and their sample counterparts in Figures 4 and 5 exhibit similarity in terms of shape and direction of change in contours. The contours of the population quadratic and quartic discriminants in Figure 4 show symmetry along each variable axis. Quadratic features contain all information necessary for discrimination in this scenario. Even-degree features successfully discriminate the two classes while odd-degree features completely fail as shown in Figure 4. Nonlinear inhomogeneous polynomial kernels with even-degree features enable proper classification as illustrated in Figure 5. Inhomogeneous polynomial kernels of degree 2k+12k+1 and 2k2k produce identical discriminants in this setting.

Refer to caption
Figure 4: Contours of the population discriminant functions with homogeneous polynomial kernels of degree 1 to 4 (upper panels) and their sample counterparts (lower panels) under Scenario 2. The black dashed lines are the optimal classification boundaries.
Refer to caption
Figure 5: Contours of the population discriminant functions with inhomogeneous polynomial kernels of degree 2 to 4 (upper panels) and their sample counterparts (lower panels) under Scenario 2. The black dashed lines are the optimal classification boundaries.

4.1.2 Gaussian Kernel

We examine Gaussian discriminant functions under each scenario using two types of approximation to the Gaussian kernel discussed earlier.

Deterministic representation: Truncation of the deterministic representation of the Gaussian kernel at a certain degree leads to the population polynomial discriminant using the inhomogeneous polynomial kernel of the same degree. Thus to approximate the population Gaussian discriminant, we need to choose an appropriate degree for truncation. As the truncation degree NN increases, the largest (and only nonzero) eigenvalue λN\lambda_{N} as a measure of class separation naturally increases. We may stop at NN where the increment in λN\lambda_{N} is negligible.

Refer to caption
(a) Scenario 1
Refer to caption
(b) Scenario 2
Figure 6: The ratio of between-class variation to within-class variation (λN\lambda_{N}) as a function of the truncation degree NN under (a) Scenario 1 and (b) Scenario 2.

Figure 6 shows how this eigenvalue λN\lambda_{N} changes with degree NN for each scenario. In Scenario 1, since a linear component is essential, there is a sharp increase in λN\lambda_{N} at degree 1 followed by a gradual increase as odd features are added. By contrast, in Scenario 2, λN\lambda_{N} steadily increases as even features are added. Overall the magnitude of the maximum ratio of between-class variation to within-class variation (λN\lambda_{N}) indicates that Scenario 1 presents an inherently easier problem than Scenario 2. Figure 7 displays some contours of the approximate Gaussian discriminants for each scenario using N=14N=14, which suggest that the Gaussian kernel can capture the difference between classes effectively in both scenarios.

Refer to caption
(a) Scenario 1
Refer to caption
(b) Scenario 2
Figure 7: Contours of the population Gaussian discriminants approximated by polynomials truncated at degree 14 under (a) Scenario 1 and (b) Scenario 2. The black dashed lines are the optimal classification boundaries.

Random Fourier feature representation: While polynomial features in the deterministic representation are naturally ordered by degree, there is no natural order in random Fourier features. As with degree NN for deterministic features, however, the Rayleigh quotient as a measure of class separation or the corresponding eigenvalue increases as we add more random features. We numerically examine the effect of the number of random features DD on the eigenvalue λD\lambda_{D} and monitor the increment in λD\lambda_{D}.

For both scenarios, we randomly generated 40 𝐰i\mathbf{w}_{i} from N2(𝟎,I2)N_{2}(\boldsymbol{0},I_{2}) and bib_{i} from Uniform(0,2π)(0,2\pi), and defined phase-shifted cosine features, z𝐰i,bi(𝐱)=2cos(𝐰iT𝐱+bi)z_{\mathbf{w}_{i},b_{i}}(\mathbf{x})=\sqrt{2}\cos(\mathbf{w}_{i}^{T}\mathbf{x}+b_{i}). Figure 8 shows how λD\lambda_{D} changes with DD for each scenario. Figure 9 shows how the approximate Gaussian discriminant in (22) changes as the number of random features increases from 2 to 40 under Scenario 1. Figure 10 shows a similar change under Scenario 2. Those snapshots in Figures 9 and 10 are chosen by monitoring the increment in the eigenvalue as more features are added. The number of features used is marked by the red vertical lines in Figure 8 for reference. As DD increases, the approximate Gaussian discriminants tend to better approximate the optimal classification boundaries. Compared to the polynomial approximation, the eigenvalues level off quickly with the number of random features DD, and the maximum values are far less than their counterparts with polynomial features in both scenarios in part due to the randomness in the choice of 𝐰i\mathbf{w}_{i} and bib_{i} and the fact that the nature of class difference is not harmonic. In summary, Fourier features are not as effective as polynomial features in these two settings.

Refer to caption
(a) Scenario 1
Refer to caption
(b) Scenario 2
Figure 8: The ratio of between-class variation to within-class variation (λD\lambda_{D}) as a function of the number of random Fourier features DD under (a) Scenario 1 and (b) Scenario 2. The red vertical lines indicate the number of random features used in Figures 9 and 10.
Refer to caption
Figure 9: Contours of the approximate discriminant functions using random Fourier features under Scenario 1. The value of DD in each panel indicates the number of random Fourier features.
Refer to caption
Figure 10: Contours of the approximate discriminant functions using random Fourier features under Scenario 2. The value of DD in each panel indicates the number of random Fourier features.

4.2 Real Data Example

In this section, we carry out a kernel discriminant analysis on the spam email data set from the UCI Machine Learning Repository (Dua and Graff 2017). We examine the geometry of sample kernel discriminants with various kernels as in the simulation study, and test the performance of the induced classifiers to see the impact of the kernel choice and kernel parameters.

The data set contains information from 4601 email messages of which 60.6% are regular email and 39.4% spam. The task is to detect whether a given email is regular or spam using 57 predictors available in order to filter out spam. 48 predictors are the percentage of words in the email that match a given word (e.g., credit, you, free), 6 predictors are the percentage of punctuation marks in the email that match a given punctuation mark (e.g., !, $), and additional three predictors are the longest, average, and total length of strings of capital letters in the message.

For ease of illustration, we start with a low dimensional representation of the data using principal components and construct kernel discriminants with those components rather than the individual predictors. We observed that the predictors measuring relative frequencies of words exhibit strong skewness in distribution. To alleviate the skewness, we considered a logit transformation before defining principal components. We also observed a large number of zeros on many predictors as some words do not necessarily appear in every e-mail message. To handle this issue, we replaced zeros with a half of the least nonzero value in each predictor before taking a logit transformation and carried out a principal component analysis on the transformed data using their correlation matrix. We then split the principal component scores into training and test sets of about 60% and 40% each and evaluated the performance of trained classifiers over the test set.

Figure 11 shows the scores on the first two principal components for the training data. The two principal components explain 26% of variation in the original data. The score distributions for two types of email are skewed and substantially overlap with very different covariances, suggesting that a nonlinear boundary is needed for classification.

Refer to caption
Figure 11: A scatterplot of the first two principal components scores on the email messages in the training data (blue: regular and red: spam).

We performed a kernel discriminant analysis on the training data using the inhomogeneous polynomial kernels of degree 1 to 6, and obtained the corresponding polynomial discriminants. For computational efficiency, we estimated the moment difference Δ~\tilde{\Delta} and covariance matrix 𝐖~\tilde{\mathbf{W}} directly using the training data and solved a sample version of (15) instead of (2). Figure 12 shows the estimated coefficients for the discriminants that are normalized to unit length using a color map. High order terms, especially beyond the cubic terms, have negligible coefficients. We need to decide on a threshold for discriminant scores to make a decision for spam filtering. We chose the threshold value by minimizing the training error. Figure 13 displays the decision boundaries of the final discriminant functions using the chosen threshold. All nonlinear polynomial discriminants in the figure seem to have similar boundaries at least in the region where data density is high. Table 3 presents their test error rates for comparison along with the rates for misclassifying spam as regular and vice versa. The fifth and sixth order polynomial discriminants have the lowest error rate in this case. However, reduction in the test error rate is marginal after the third order, which we may expect from the result in Figure 12 and diminishing returns in the ratio from degree as shown in Table 3. We may well consider the cubic discriminant sufficient for this application. It provides a good compromise between the two kinds of errors while maintaining simplicity.

Refer to caption
Figure 12: A color map of the estimated coefficients for the polynomial discriminants of degree 1 to 6 using two principal components from the spam email data displayed in the lower triangular array. The column label in the gray band (e.g., x1=PC1x_{1}=PC_{1} and x2=PC2x_{2}=PC_{2}) indicates the term corresponding to each coefficient.
Refer to caption
Figure 13: Decision boundaries of the polynomial discriminants with the inhomogeneous polynomial kernels of degree 1 to 6 obtained from the spam email data. The black dashed lines are the boundaries with minimum training error for each kernel.
Table 3: Test error rates of kernel discriminant analysis on the spam email data set with the inhomogeneous polynomial kernels of varying degrees. The training error rates and between-class to within-class variation ratio are provided for comparison.
Degree Ratio Training Test error
error Misclassified spam Misclassified regular Overall
1 4.3154 0.1381 0.2218 0.0744 0.1325
2 6.4226 0.1163 0.1887 0.0645 0.1135
3 7.2928 0.1116 0.1736 0.0645 0.1075
4 7.7043 0.1095 0.1377 0.0959 0.1124
5 8.0941 0.1058 0.1612 0.0672 0.1042
6 8.3192 0.1033 0.1543 0.0717 0.1042

5 Discussion

We have examined the population version of kernel discriminant analysis and the generalized eigenvalue problem with between-class and within-class kernel covariance operators to shed light on the relation between the data distribution and resulting kernel discriminant. Our analysis shows that polynomial discriminants capture the difference between two distributions through their moments of a certain order specified by the polynomial kernel. Depending on the representation of the Gaussian kernel, on the other hand, Gaussian discriminants encode the class difference using all polynomial features or Fourier features of random projections.

Whenever we have some discriminative predictors in the data by design as is typically the case, kernels of a simple form aligned with those predictors will work well. For instance, if we use polynomial kernels in such a setting, we expect the Rayleigh quotient as a measure of class separation to become saturated quickly with degree and low-order polynomial features to prevail. The geometric perspective of kernel discriminant analysis presented in this paper suggests that the ideal kernel for discrimination retains only those features necessary for describing the difference in two distributions. This promotes a compositional view of kernels (e.g., K~d(𝐱,𝐮)=m=0d(dm)Km(𝐱,𝐮)\tilde{K}_{d}(\mathbf{x},\mathbf{u})=\sum_{m=0}^{d}{d\choose m}K_{m}(\mathbf{x},\mathbf{u})) and further points to the potential benefits of selecting kernel components relevant to discrimination similar to the way feature selection is incorporated into linear discriminant analysis using sparsity inducing penalties (Clemmensen et al. 2011, Cai and Liu 2011). For instance, Kim et al. (2006) formulated a convex optimization problem for kernel selection in KDA. It is also of interest to compare this kernel selection approach with other approaches for numerical approximation of kernel matrices themselves through Nyström approximation (Williams and Seeger 2001, Drineas and Mahoney 2005) or random projections (Ye et al. 2017).

As a related issue, it has not been formally examined how the Rayleigh quotient maximized in kernel discriminant analysis is related to the error rate of the induced classifier except for some special cases only. It is of particular interest how the relation changes with the form of a kernel and associated features given the difference between two distributions.

While our analysis has focused on the case of two classes, we can generalize it to the case of multiple classes where more than one kernel discriminants need to be considered and properly combined to make a decision. We leave this extension as future research.

Acknowledgements

This research was supported in part by the National Science Foundation under grant DMS-15-13566. We thank Professor Mikyoung Lim at KAIST for helpful conversations on linear operators.

References

  • (1)
  • Aronszajn (1950) Aronszajn, N. (1950). Theory of reproducing kernel, Transactions of the American Mathematical Society 68: 3337–404.
  • Baudat and Anouar (2000) Baudat, G. and Anouar, F. (2000). Generalized discriminant analysis using a kernel approach, Neural Computation 12(10): 2385–2404.
  • Cai and Liu (2011) Cai, T. and Liu, W. (2011). A direct estimation approach to sparse linear discriminant analysis, Journal of the American Statistical Association 106(496): 1566–1577.
  • Clemmensen et al. (2011) Clemmensen, L., Hastie, T., Witten, D. and Ersboll, B. (2011). Sparse discriminant analysis, Technometrics 53(4): 406–413.
  • Drineas and Mahoney (2005) Drineas, P. and Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning, J. Mach. Learn. Res. 6: 2153–2175.
  • Dua and Graff (2017) Dua, D. and Graff, C. (2017). UCI machine learning repository.
    http://archive.ics.uci.edu/ml
  • Gu (2002) Gu, C. (2002). Smoothing Spline ANOVA Models, New York: Springer.
  • Hofmann et al. (2008) Hofmann, T., Schölkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning, The Annals of Statistics 36(3): 1171–1220.
  • Holmquist (1996) Holmquist, B. (1996). The dd-variate vector Hermite polynomial of order kk, Linear algebra and its applications 237: 155–190.
  • Joachims (2002) Joachims, T. (2002). Optimizing search engines using clickthrough data, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, Association for Computing Machinery, New York, NY, USA, p. 133–142.
  • Kim et al. (2006) Kim, S., Magnani, A. and Boyd, S. (2006). Optimal kernel selection in kernel Fisher discriminant analysis, Proceedings of the 23rd international conference on Machine learning, ACM, pp. 465–472.
  • Liang and Lee (2013) Liang, Z. and Lee, Y. (2013). Eigen-analysis of nonlinear PCA with polynomial kernels, Statistical Analysis and Data Mining 6: 529–544.
  • Mika et al. (1999) Mika, S., Ratsch, G., Weston, J., Schölkopf, B. and Müllers, K. (1999). Fisher discriminant analysis with kernels, Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop, IEEE, pp. 41–48.
  • Rahimi and Recht (2008a) Rahimi, A. and Recht, B. (2008a). Random features for large-scale kernel machines, Advances in neural information processing systems, pp. 1177–1184.
  • Rahimi and Recht (2008b) Rahimi, A. and Recht, B. (2008b). Uniform approximation of functions with random bases, 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561.
  • Rudin (2017) Rudin, W. (2017). Fourier analysis on groups, Courier Dover Publications.
  • Schölkopf and Smola (2002) Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels, MIT Press, Cambridge, MA.
  • Schölkopf et al. (1998) Schölkopf, B., Smola, A. J. and Müller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem, Neural Computation 10: 1299–1319.
  • Scott and Longuet-Higgins (1990) Scott, G. L. and Longuet-Higgins, H. C. (1990). Feature grouping by relocalisation of eigenvectors of proximity matrix, Proceedings of British Machine Vision Conference, pp. 103–108.
  • Shawe-Taylor and Cristianini (2004) Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis, Cambridge University Press, USA.
  • Shi et al. (2009) Shi, T., Belkin, M. and Yu, B. (2009). Data spectroscopy: eigenspaces of convolution operators and clustering, The Annals of Statistics 37(6B): 3960–3984.
  • Vapnik (1995) Vapnik, V. N. (1995). The Nature of Statistical Learning Theory, Springer, New York.
  • von Luxburg (2007) von Luxburg, U. (2007). A tutorial on spectral clustering, Statistics and Computing 17(4): 395–416.
  • Wahba (1990) Wahba, G. (1990). Spline models for observational data, Philadelphia, PA: Society for Industrial and Applied Mathematics.
  • Williams and Seeger (2001) Williams, C. K. I. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines, in T. K. Leen, T. G. Dietterich and V. Tresp (eds), Advances in Neural Information Processing Systems 13, MIT Press, pp. 682–688.
  • Ye et al. (2017) Ye, H., Li, Y., Chen, C. and Zhang, Z. (2017). Fast Fisher discriminant analysis with randomized algorithms, Pattern Recognition 72: 82–92.
  • Zhu et al. (1998) Zhu, H., Williams, C. K., Rhower, R. and Morciniec, M. (1998). Gaussian regression and optimal finite dimensional linear models, in C. Bishop (ed.), Neural Networks and Machine Learning, Springer, Berlin, pp. 167–184.