This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sharp Convergence Rate and Support Consistency of
Multiple Kernel Learning with Sparse and Dense Regularization

Taiji Suzuki, Ryota Tomioka
Department of Mathematical Informatics,
The University of Tokyo,
7-3-1 Hongo, Bunkyo-ku, Tokyo
t-suzuki@mist.i.u-tokyo.ac.jp,
tomioka@mist.i.u-tokyo.ac.jp &Masashi Sugiyama
Department of Computer Science,
Tokyo Institute of Technology,
2-12-1 O-okayama, Meguro-ku, Tokyo
sugi@cs.titech.ac.jp
Abstract

We theoretically investigate the convergence rate and support consistency (i.e., correctly identifying the subset of non-zero coefficients in the large sample limit) of multiple kernel learning (MKL). We focus on MKL with block-1\ell_{1} regularization (inducing sparse kernel combination), block-2\ell_{2} regularization (inducing uniform kernel combination), and elastic-net regularization (including both block-1\ell_{1} and block-2\ell_{2} regularization). For the case where the true kernel combination is sparse, we show a sharper convergence rate of the block-1\ell_{1} and elastic-net MKL methods than the existing rate for block-1\ell_{1} MKL. We further show that elastic-net MKL requires a milder condition for being consistent than block-1\ell_{1} MKL. For the case where the optimal kernel combination is not exactly sparse, we prove that elastic-net MKL can achieve a faster convergence rate than the block-1\ell_{1} and block-2\ell_{2} MKL methods by carefully controlling the balance between the block-1\ell_{1}and block-2\ell_{2} regularizers. Thus, our theoretical results overall suggest the use of elastic-net regularization in MKL.

1 Introduction

The choice of kernel functions is a key issue for kernel methods such as support vector machines to work well (Vapnik, 1998). A traditional but very powerful approach to optimizing the kernel function is the use of cross-validation (CV) (Stone, 1974). Although the CV-based kernel choice often leads to better generalization, it is computationally expensive when the kernel contains multiple tuning parameters.

To overcome this limitation, the framework of multiple kernel learning (MKL) has been introduced, which tries to learn the optimal linear combination of prefixed base-kernels by convex optimization (Lanckriet et al., 2004, Micchelli and Pontil, 2005, Lin and Zhang, 2006, Sonnenburg et al., 2006, Rakotomamonjy et al., 2008, Suzuki and Tomioka, 2009). The seminal paper by Bach et al. (2004) showed that this MKL formulation can be interpreted as block-1\ell_{1} regularization (i.e., 1\ell_{1} regularization across the kernels and 2\ell_{2} regularization within the same kernel). We refer to this MKL formulation as ‘block-1\ell_{1} MKL’. Based on this interpretation, block-1\ell_{1} MKL was proved to be support consistent (i.e., correctly identifying the subset of non-zero coefficients with probability one in the large sample limit) when the true kernel combination is sparse (Bach, 2008). Furthermore, the convergence rate of block-1\ell_{1} MKL has also been elucidated in Koltchinskii and Yuan (2008), which can be regarded as an extension of the theoretical analysis for ordinary (non-block) 1\ell_{1} regularization (Bickel et al., 2009, Zhang, 2009).

However, in many practical applications, the true kernel combination may not be exactly sparse. In such a non-sparse situation, block-1\ell_{1} MKL was shown to perform rather poorly—just the uniform combination of base kernels obtained by block-2\ell_{2} regularization (Micchelli and Pontil, 2005) (which we call ‘block-2\ell_{2} MKL’) often works better in practice (Cortes, 2009). Furthermore, recent works showed that some ‘intermediate’ regularization between block-1\ell_{1} and block-2\ell_{2} regularization is more promising, e.g., block-p\ell_{p} regularization with 1p21\leq p\leq 2 (Cortes et al., 2009, Kloft et al., 2009), and elastic-net regularization (Zou and Hastie, 2005) which includes both block-1\ell_{1} and block-2\ell_{2} regularization (Tomioka and Suzuki, 2010) (we call this method ‘elastic-net MKL’). Theoretically, the support consistency and the convergence rate for parametric elastic-nets have been elucidated in Yuan and Lin (2007) and Zou and Zhang (2009), respectively, and that for non-parametric cases has been investigated in Meier et al. (2009) focusing on the Sobolev space.

In this paper, we theoretically analyze the support consistency and convergence rate of MKL, and provide three new results.

  • For the case where the true kernel combination is sparse, we show that elastic-net MKL achieves a faster convergence rate than the one shown for block-1\ell_{1} MKL (Koltchinskii and Yuan, 2008). More specifically, we show that the L2L_{2} convergence error is given by 𝒪p(min{dn22+s+dlog(M)/n,d1s1+sn11+s+dlog(M)/n})\mathcal{O}_{p}(\min\{dn^{-\frac{2}{2+s}}+d\log(M)/n,d^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}+d\log(M)/n\}), where dd is the number of active components of the target function, ss is the complexity of RKHSs, MM is the number of candidate kernels, and nn is the number of samples.

  • For the case where the optimal kernel combination is not exactly sparse, we prove that elastic-net MKL achieves a faster convergence rate than the block-1\ell_{1} and block-2\ell_{2} MKL methods by carefully controlling the balance between block-1\ell_{1} and block-2\ell_{2} regularization. Our theoretical result well agrees with the experimental results reported in Tomioka and Suzuki (2010).

  • For the case where the true kernel combination is sparse, we prove that the necessary and sufficient conditions of the support consistency for elastic-net MKL is milder than the conditions required for block-1\ell_{1} MKL (Bach, 2008).

Overall, our theoretical results suggest the use of elastic-net regularization in MKL.

2 Preliminaries

In this section, we formulate the elastic-net MKL approach and summarize mathematical tools that are needed for the theoretical analysis.

2.1 Formulation

Suppose we are given nn samples (xi,yi)i=1n(x_{i},y_{i})_{i=1}^{n} where xix_{i} belongs to an input space 𝒳\mathcal{X} and yiy_{i}\in\mathbb{R}. (xi,yi)i=1n(x_{i},y_{i})_{i=1}^{n} are independent and identically distributed from a probability measure PP. We denote the marginal distribution of XX by Π\Pi. We consider a MKL regression problem in which the unknown target function is represented as a form of f(x)=m=1Mfm(x)f(x)=\sum_{m=1}^{M}f_{m}(x), where each fmf_{m} belongs to different RKHSs m(m=1,,M)\mathcal{H}_{m}(m=1,\dots,M) corresponding to MM different base kernels kmk_{m} over 𝒳×𝒳\mathcal{X}\times\mathcal{X}.

Elastic-net MKL learns a decision function f^\hat{f} as111 For simplicity, we focus on the squared-loss function here. However, we note that it is straightforward to extend our convergence analysis and support consistency results given in Sections 3 and 4 to general loss functions that are strongly convex and Lipschitz continuous, by following the line of Koltchinskii and Yuan (2008).

f^=\displaystyle\!\!\!\hat{f}= argminfmm(m=1,,M)1ni=1n(yim=1Mfm(xi))2+λ1(n)m=1Mfmm+λ2(n)m=1Mfmm2,\displaystyle\mathop{\arg\min}_{f_{m}\in\mathcal{H}_{m}~(m=1,\dots,M)}\frac{1}{n}\sum_{i=1}^{n}\left(y_{i}-\sum_{m=1}^{M}f_{m}(x_{i})\right)^{2}\!\!\!+{\lambda_{1}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}+{\lambda_{2}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2},\!\!\! (1)

where the first term is the squared-loss of function fitting and, the second and the third terms are block-1\ell_{1} and block-2\ell_{2} regularizers, respectively. It can be seen from (1) that elastic-net MKL is reduced to block-1\ell_{1} MKL if λ2(n)=0{\lambda_{2}^{(n)}}=0, which tends to induce sparse kernel combination (Lanckriet et al., 2004, Bach et al., 2004). On the other hand, it is reduced to block-2\ell_{2} MKL if λ1(n)=0{\lambda_{1}^{(n)}}=0, which results in uniform kernel combination (Micchelli and Pontil, 2005). It is worth noting that, elastic-net MKL allows us to obtain various levels of sparsity by controlling the ratio between λ1(n){\lambda_{1}^{(n)}} and λ2(n){\lambda_{2}^{(n)}}.

2.2 Notations and Assumptions

Here, we prepare technical tools needed in the following sections.

Due to Mercer’s theorem, there are an orthonormal system {ϕk,m}k,m\{\phi_{k,m}\}_{k,m} in L2(Π)L_{2}(\Pi) and the spectrum {μk,m}k,m\{\mu_{k,m}\}_{k,m} such that kmk_{m} has the following spectral representation:

km(x,x)=k=1μk,mϕk,m(x)ϕk,m(x).k_{m}(x,x^{\prime})=\sum_{k=1}^{\infty}\mu_{k,m}\phi_{k,m}(x)\phi_{k,m}(x^{\prime}). (2)

By this spectral representation, the inner-product of RKHS can be expressed as fm,gmm=k=1μk,m1fm,ϕk,mL2(Π)ϕk,m,gmL2(Π).\langle f_{m},g_{m}\rangle_{\mathcal{H}_{m}}=\sum_{k=1}^{\infty}\mu_{k,m}^{-1}\langle f_{m},\phi_{k,m}\rangle_{L_{2}(\Pi)}\langle\phi_{k,m},g_{m}\rangle_{L_{2}(\Pi)}.

Let =1M\mathcal{H}=\mathcal{H}_{1}\oplus\dots\oplus\mathcal{H}_{M}. For f=(f1,,fM)f=(f_{1},\dots,f_{M})\in\mathcal{H} and a subset of indices I{1,,M}I\subseteq\{1,\dots,M\}, we denote by fIf_{I} the restriction of ff to an index set II, i.e., fI=(fm)mIf_{I}=(f_{m})_{m\in I}.

We denote by I0I_{0} the indices of truly active kernels, i.e.,

I0={mfmm>0},I_{0}=\{m\mid\|f^{*}_{m}\|_{\mathcal{H}_{m}}>0\},

and define the complement of I0I_{0} as J0=I0cJ_{0}={I_{0}}^{c}.

Throughout the paper, we assume the following technical conditions (see also Bach (2008)).

Assumption 1

(Basic Assumptions)

  1. (A1)\mathrm{(A1)}

    There exists f=(f1,,fM)f^{*}=(f^{*}_{1},\dots,f^{*}_{M})\in\mathcal{H} such that E[Y|X]=m=1Mfm(X)\mathrm{E}[Y|X]=\sum_{m=1}^{M}f^{*}_{m}(X), and the noise ϵ:=Yf(X)\epsilon:=Y-f^{*}(X) has a strictly positive variance; there exists σ>0\sigma>0 such that E[ϵ2|X]>σ2\mathrm{E}[\epsilon^{2}|X]>\sigma^{2} for all X𝒳X\in\mathcal{X}. We also assume that ϵ\epsilon is bounded as |ϵ|L|\epsilon|\leq L.

  2. (A2)\mathrm{(A2)}

    For each m=1,,Mm=1,\dots,M, m\mathcal{H}_{m} is separable and supX𝒳|km(X,X)|<1\sup_{X\in\mathcal{X}}|k_{m}(X,X)|<1.

  3. (A3)\mathrm{(A3)}

    There exists gmmg^{*}_{m}\in\mathcal{H}_{m} such that

    fm(x)=𝒳km(1/2)(x,x)gm(x)dΠ(x)(m=1,,M),f^{*}_{m}(x)=\int_{\mathcal{X}}k_{m}^{(1/2)}(x,x^{\prime})g^{*}_{m}(x^{\prime})\mathrm{d}\Pi(x^{\prime})\qquad(\forall m=1,\dots,M), (3)

    where km(1/2)(x,x)=k=1μk,m1/2ϕk,m(x)ϕk,m(x)k_{m}^{(1/2)}(x,x^{\prime})=\sum_{k=1}^{\infty}\mu_{k,m}^{1/2}\phi_{k,m}(x)\phi_{k,m}(x^{\prime}) is the operator square-root of kmk_{m}.

The first assumption in (A1) ensures the model \mathcal{H} is correctly specified, and the technical assumption |ϵ|<L|\epsilon|<L allows ϵf\epsilon f to be Lipschitz continuous with respect to ff.

It is known that the assumption (A2) gives the following relation:

fmsupxkm(x,),fmmsupxkm(x,)mfmmsupxkm(x,x)fmmfmm.\displaystyle\|f_{m}\|_{\infty}\!\!\leq\!\sup_{x}\langle k_{m}(x,\cdot),f_{m}\rangle_{\mathcal{H}_{m}}\!\!\leq\!\sup_{x}\|k_{m}(x,\cdot)\|_{\mathcal{H}_{m}}\!\|f_{m}\|_{\mathcal{H}_{m}}\!\!\leq\!\sup_{x}\sqrt{k_{m}(x,x)}\|f_{m}\|_{\mathcal{H}_{m}}\!\leq\!\|f_{m}\|_{\mathcal{H}_{m}}.
Table 1: Summary of the constants we use in this article.
MM The number of candidate kernels.
dd The number of active kernels of the truth; i.e., d=|I0|d=|I_{0}|.
RR The upper bound of m=1M(fmm+fmm2)\sum_{m=1}^{M}(\|f^{*}_{m}\|_{\mathcal{H}_{m}}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}); see (A4).
ss The spectral decay coefficient; see (A5).
β\beta The approximate sparsity coefficient; see (A7).
bb The parameter that tunes the correlation between kernels; see (A8).

The assumption (A3) was used in Caponnetto and de Vito (2007) and also in Bach (2008). It ensures the consistency of the least-squares estimates in terms of the RKHS norm. Using the spectral representation (2), the condition gmmg^{*}_{m}\in\mathcal{H}_{m} is expressed as

gmm2=k=1μk,m2fm,ϕk,mL2(Π)2<.\|g^{*}_{m}\|_{\mathcal{H}_{m}}^{2}=\sum_{k=1}^{\infty}\mu_{k,m}^{-2}\langle f^{*}_{m},\phi_{k,m}\rangle_{L_{2}(\Pi)}^{2}<\infty. (4)

This condition was also assumed in Koltchinskii and Yuan (2008). Proposition 9 of Bach (2008) gave a sufficient condition to fulfill (3) for translation invariant kernels km(x,x)=hm(xx)k_{m}(x,x^{\prime})=h_{m}(x-x^{\prime}).

Constants we use later are summarized in Table 1.

3 Convergence Rate of Elastic-net MKL

In this section, we derive the convergence rate of elastic-net MKL in two situations:

  1. (i)

    A sparse situation where the truth ff^{*} is sparse (Section 3.1).

  2. (ii)

    A near sparse situation where the truth is not exactly sparse, but fmm\|f_{m}\|_{\mathcal{H}_{m}} decays polynomially as mm increases (Section 3.2).

For (i), we show that elastic-net MKL (and block-1\ell_{1} MKL) achieves a faster convergence rate than the rate shown for block-1\ell_{1} MKL (Koltchinskii and Yuan, 2008). Furthermore, for (ii), we show that elastic-net MKL can outperform block-1\ell_{1} MKL and block-2\ell_{2} MKL depending on the sparsity of the truth and the condition of the problem. Throughout this section, we assume the following conditions.

Assumption 2

(Boundedness Assumption) There exists constants C1C_{1} and RR such that

(A4) maxmI0gmmfmmC1,m=1M(fmm+fmm2)R.\displaystyle\max_{m\in I_{0}}\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\leq C_{1},~~~\sum\limits_{m=1}^{M}(\|f^{*}_{m}\|_{\mathcal{H}_{m}}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2})\leq R.
Assumption 3

(Spectral Assumption) There exist 0<s<10<s<1 and C2C_{2} such that

(A5) μk,mC2k1s,(1k,1mM),\displaystyle\mu_{k,m}\leq C_{2}k^{-\frac{1}{s}},~~~(1\leq\forall k,1\leq\forall m\leq M),

where {μk,m}k\{\mu_{k,m}\}_{k} is the spectrum of the kernel kmk_{m} (see Eq.(2)).

The first assumption in (A4) appeared in Theorem 2 of Koltchinskii and Yuan (2008). The second assumption in (A4) bounds the amplitude of ff^{*}. It was shown that the spectral assumption (A5) is equivalent to the classical covering number assumption (Steinwart et al., 2009). Recall that the ϵ\epsilon-covering number 𝒩(ϵ,m,L2(Π))\mathcal{N}(\epsilon,\mathcal{B}_{\mathcal{H}_{m}},L_{2}(\Pi)) with respect to L2(Π)L_{2}(\Pi) is the minimal number of balls with radius ϵ\epsilon needed to cover the unit ball m\mathcal{B}_{\mathcal{H}_{m}} in m\mathcal{H}_{m} (van der Vaart and Wellner, 1996). If the spectral assumption (A5) holds, there exists a constant cc that depends only on ss such that

𝒩(ε,m,L2(Π))cε2s,\displaystyle\mathcal{N}(\varepsilon,\mathcal{B}_{\mathcal{H}_{m}},L_{2}(\Pi))\leq c\varepsilon^{-2s}, (5)

and the converse is also true (see Theorem 15 of Steinwart et al. (2009) and Steinwart (2008) for details). Therefore, if ss is large, at least one RKHS is “complex”, and if ss is small, the RKHSs are regarded as “simple”.

For a given set of indices I{1,,M}I\subseteq\{1,\dots,M\}, let κ(I)\kappa(I) be defined as follows:

κ(I)\displaystyle\kappa(I) :=sup{κ0κmIfmL2(Π)2mIfmL2(Π)2,fmm(mI)}.\displaystyle:=\sup\left\{\kappa\geq 0\mid\kappa\leq\frac{\|\sum_{m\in I}f_{m}\|_{L_{2}(\Pi)}^{2}}{\sum_{m\in I}\|f_{m}\|_{L_{2}(\Pi)}^{2}},~\forall f_{m}\in\mathcal{H}_{m}~(m\in I)\right\}.

κ(I)\kappa(I) represents the correlation of RKHSs inside the indices II. Similarly, we define the correlations of RKHSs between II and IcI^{c} as follows:

ρ(I)\displaystyle\rho(I) :=sup{fI,gIcL2(Π)fIL2(Π)gIcL2(Π)fII,gIcIc,fI0,gIc0}.\displaystyle:=\sup\left\{\frac{\langle f_{I},g_{I^{c}}\rangle_{L_{2}(\Pi)}}{\|f_{I}\|_{L_{2}(\Pi)}\|g_{I^{c}}\|_{L_{2}(\Pi)}}\mid f_{I}\in\mathcal{H}_{I},g_{I^{c}}\in\mathcal{H}_{I^{c}},f_{I}\neq 0,g_{I^{c}}\neq 0\right\}.

In Subsections 3.1 and 3.2, we will assume that the kernels have no perfect canonical dependence, implying that the kernels are not similar to each other (see (A6) and (A8) below).

Throughout this paper, we assume log(Mn)n1\frac{\log(Mn)}{n}\leq 1 and log(M)\log(M) is slower than any polynomial order against the number of samples nn: log(M)=o(nϵ)\log(M)=o(n^{\epsilon}) for all ϵ>0\epsilon>0. With some abuse, we use CC to denote constants that are independent of dd and nn; its value may be different.

3.1 Sparse Situation

Here we derive the convergence rate of the estimator f^\hat{f} when the truth ff^{*} is sparse. Let d=|I0|d=|I_{0}| and suppose that the number of kernels MM and the number of active kernels dd are increasing with respect to the number of samples nn. We further assume the following condition in this subsection.

Assumption 4

(Incoherence Assumption) There exists a constant C3>0C_{3}>0 such that

(A6) 0<C31<κ(I0)(1ρ2(I0)).\displaystyle 0<C_{3}^{-1}<\kappa(I_{0})(1-\rho^{2}(I_{0})). (6)

This condition is known as the incoherence condition (Koltchinskii and Yuan, 2008, Meier et al., 2009), i.e., kernels are not too dependent on each other and the problem is well conditioned. Then we have the following convergence rate.

Theorem 1

Under assumptions (A1-A6), there exist constants CC, FF and KK depending only on κ(I0)\kappa(I_{0}), ρ(I0)\rho(I_{0}), ss, C1C_{1}, C2C_{2}, LL, and RR such that the L2(Π)L_{2}(\Pi)-norm of the residual f^f\hat{f}-f^{*} can be bounded as follows: when d3+sn11d^{3+s}n^{-1}\leq 1, for λ1(n)=λ2(n)=max{Kn12+s+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}={\lambda_{2}^{(n)}}=\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\},

f^fL2(Π)2C(dn22+s+dtn),\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\Big{(}dn^{-\frac{2}{2+s}}+\frac{dt}{n}\Big{)}, (7)

and, when d3+sn1>1d^{3+s}n^{-1}>1, for λ1(n)=max{K(1+t)n12,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\{K(1+\sqrt{t})n^{-\frac{1}{2}},F\sqrt{\frac{\log(Mn)}{n}}\} and λ2(n)λ1(n){\lambda_{2}^{(n)}}\leq{\lambda_{1}^{(n)}},

f^fL2(Π)2C(d1s1+sn11+s+d(log(Mn)+t)n),\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\Big{(}d^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}+\frac{d(\log(Mn)+t)}{n}\Big{)}, (8)

where each inequality holds with probability at least 1etn11-e^{-t}-n^{-1} for all tloglog(Rn)+logMt\geq\log\log(R\sqrt{n})+\log M.

The above theorem indicates that the learning rate depends on the complexity of RKHSs (the simpler, the faster) and the number of active kernels rather than the number of kernels MM (the influence of MM is at most dlog(M)n\frac{d\log(M)}{n}). It is worth noting that the convergence rate in (7) and (8) is faster than or equal to the rate of block-1\ell_{1} MKL shown by Koltchinskii and Yuan (2008) which established the learning rate Op(d1s1+sn11+s+dlog(M)n)O_{p}\left(d^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}+\frac{d\log(M)}{n}\right) under the same conditions as ours 222In our second bound (8), there is the additional dlog(n)n\frac{d\log(n)}{n} term. However this can be eliminated by replacing the probability 1etn11-e^{-t}-n^{-1} with 1etMA1-e^{-t}-M^{-A} as in Koltchinskii and Yuan (2008). Moreover, if nlog(n)1+s2sd\sqrt{n}\log(n)^{-\frac{1+s}{2s}}\geq d, then the term dlog(n)n\frac{d\log(n)}{n} is dominated by the first term d1s1+sn11+sd^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}..

3.2 Near-Sparse Situation

In this subsection, we analyze the convergence rate under a situation where ff^{*} is not sparse but near sparse. We have shown a faster learning rate than existing bounds in the previous subsection. However, the assumptions we used might be too restrictive to capture the situation where MKL is used in practice. In fact, it was pointed out in Zou and Hastie (2005) in the context of (non-block) 1\ell_{1} regularization that 1\ell_{1} regularization could fail in the following situations:

  • When the truth ff^{*} is not sparse, the 1\ell_{1} regularization shrinks many small but non-zero components to zero.

  • When there exist strong correlations between different kernels, the solution of block-1\ell_{1} MKL becomes unstable.

  • When the number of kernels MM is not large, there is no need to impose the estimator to be sparse.

In order to analyze these situations in the MKL setting, we introduce three parameters β\beta, bb, and τ\tau: β\beta controls the level of sparsity (see (A7)), bb controls the correlation between candidate kernels (see (A8)), and τ\tau controls the growth of the number of kernels against the number of samples (see (A9)).

We show that naturally block-2\ell_{2} MKL is preferable when there are only few candidate kernels or the truth is dense. Importantly, if the candidate kernels are correlated, the convergence of block-1\ell_{1} MKL can be slow even when the truth is sparse. Our analysis shows that elastic-net MKL is most valuable in such an intermediate situation.

By permuting indices, we can assume without loss of generality that fmm\|f^{*}_{m}\|_{\mathcal{H}_{m}} is decreasing with respect to mm, i.e., f11f22f33\|f^{*}_{1}\|_{\mathcal{H}_{1}}\geq\|f^{*}_{2}\|_{\mathcal{H}_{2}}\geq\|f^{*}_{3}\|_{\mathcal{H}_{3}}\geq\cdots. We further assume the following conditions in this subsection.

Assumption 5

(Approximate Sparsity) The truth is approximately sparse, i.e., fmm>0\|f^{*}_{m}\|_{\mathcal{H}_{m}}>0 for all mm and thus I0={1,,M}I_{0}=\{1,\dots,M\}. However, fmm\|f^{*}_{m}\|_{\mathcal{H}_{m}} decays polynomially with respect to mm as follows:

(A7)\displaystyle{\rm(A7)} fmmC3mβ.\displaystyle\|f^{*}_{m}\|_{\mathcal{H}_{m}}\leq C_{3}m^{-\beta}.

We call β(>1)\beta~(>1) the approximate sparsity coefficient.

Assumption 6

(Generalized Incoherence) There exist b>0b>0 and C4C_{4} such that for all I{1,,M}I\subseteq\{1,\dots,M\},

(A8)\displaystyle{\rm(A8)} (1ρ2(I))κ(I)C4|I|b.\displaystyle(1-\rho^{2}(I))\kappa(I)\geq C_{4}|I|^{-b}.
Assumption 7

(Kernel-Set Growth) The number of kernels MM is increasing polynomially with respect to the number of samples nn, i.e., τ>0\exists\tau>0 such that

(A9)\displaystyle{\rm(A9)} M=nτ.\displaystyle M=\lceil n^{\tau}\rceil.

For notational convenience, let τ1=1(2β+b)(2+s)1s\tau_{1}=\frac{1}{(2\beta+b)(2+s)-1-s}, τ2=(s1)(2β1)+bs(2β+b)(2+s)1s\tau_{2}=\frac{(s-1)(2\beta-1)+bs}{(2\beta+b)(2+s)-1-s}, τ3=s{2(b+β)1}2(2+s)(b+β)s\tau_{3}=\frac{s\{2(b+\beta)-1\}}{2(2+s)(b+\beta)-s}, τ4=s2+s\tau_{4}=\frac{s}{2+s}, τ5=b+1(β+b){b(2+s)+2}\tau_{5}=\frac{b+1}{(\beta+b)\{b(2+s)+2\}} and τ6=1(1s)(1+b)\tau_{6}=\frac{1}{(1-s)(1+b)}. In addition, we denote by KK some sufficiently large constant.

Theorem 2

Suppose assumptions (A1-A5) and (A7-A9), 2β(1s)<s(b1)2\beta(1-s)<s(b-1), and τ1<τ<τ4\tau_{1}<\tau<\tau_{4} are satisfied. Then the estimator of elastic-net MKL possesses the following convergence rate each of which holds with probability at least 1etn11-e^{-t}-n^{-1} for all tloglog(Rn)+logMt\geq\log\log(R\sqrt{n})+\log M:
1. When τ1<τ<τ2\tau_{1}<\tau<\tau_{2},

f^fL2(Π)2\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C{nγ1+(n(2β+b)(2+s)3s+2β2{(2β+b)(2+s)1s}+λ2(n)2)(t+t)},\displaystyle C\Big{\{}n^{-\gamma_{1}}+(n^{-\frac{(2\beta+b)(2+s)-3-s+2\beta}{2\{(2\beta+b)(2+s)-1-s\}}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t)\Big{\}},
whereγ1=4β+b2(2+s)(2β+b)1s,\displaystyle~~\text{where}~~\gamma_{1}=\frac{4\beta+b-2}{(2+s)(2\beta+b)-1-s}, (9)

with λ1(n)=max{Kn3β+b1(2β+b)(2+s)1s+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{3\beta+b-1}{(2\beta+b)(2+s)-1-s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\} and λ2(n)=Kn2β+b1(2β+b)(2+s)1s{\lambda_{2}^{(n)}}=Kn^{-\frac{2\beta+b-1}{(2\beta+b)(2+s)-1-s}}.
2. When τ2τ<τ3\tau_{2}\leq\tau<\tau_{3},

f^fL2(Π)2\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C{nτ(2+s)b+22{(2+s)(b+β)s}γ2+(nτ(2+s)(1β)(4β+2b+sb2)2{(β+b)(2+s)s}+λ2(n)2)(t+t)},\displaystyle C\Big{\{}n^{\tau\frac{(2+s)b+2}{2\{(2+s)(b+\beta)-s\}}-\gamma_{2}}+(n^{\frac{\tau(2+s)(1-\beta)-(4\beta+2b+sb-2)}{2\{(\beta+b)(2+s)-s\}}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t)\Big{\}},
whereγ2=4β+b(2+s)22{(2+s)(b+β)s},\displaystyle~~\text{where}~~\gamma_{2}=\frac{4\beta+b(2+s)-2}{2\{(2+s)(b+\beta)-s\}}, (10)

with λ1(n)=max{KMn+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\{K\sqrt{\frac{M}{n}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\} and λ2(n)=Knτ{2(b+β)1}2{(2+s)(b+β)s}{\lambda_{2}^{(n)}}=Kn^{\frac{\tau-\{2(b+\beta)-1\}}{2\{(2+s)(b+\beta)-s\}}}.

3. When τ3τ<τ4\tau_{3}\leq\tau<\tau_{4},

f^fL2(Π)2\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C(nτγ3γ3+(nτ(β1)+12βb2(b+β)+λ2(n)2)(t+t)),\displaystyle C\Big{(}n^{\tau\gamma_{3}-\gamma_{3}}+(n^{\frac{\tau(\beta-1)+1-2\beta-b}{2(b+\beta)}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t)\Big{)},
whereγ3=b+2β12(b+β),\displaystyle~~\text{where}~~\gamma_{3}=\frac{b+2\beta-1}{2(b+\beta)}, (11)

with λ1(n)=max{KMn+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\{K\sqrt{\frac{M}{n}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\} and λ2(n)=K(M/n)2(b+β)14(b+β){\lambda_{2}^{(n)}}=K(M/n)^{\frac{2(b+\beta)-1}{4(b+\beta)}}.

Theorem 3

Under assumptions (A1-A5) and (A7-A9), if τ5<τ\tau_{5}<\tau, the estimator f^1\hat{f}_{\ell_{1}} of block-1\ell_{1} MKL has the following convergence rate with probability at least 1etn11-e^{-t}-n^{-1} for all tloglog(Rn)+logMt\geq\log\log(R\sqrt{n})+\log M:

(block-1 MKL)\displaystyle(\text{block-$\ell_{1}$ MKL}) f^1fL2(Π)2C(nγ4+n4β+2b2+s(b+β)2(2+s)(b+β)(t+t)),\displaystyle\|\hat{f}_{\ell_{1}}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\left(n^{-\gamma_{4}}+n^{-\frac{4\beta+2b-2+s(b+\beta)}{2(2+s)(b+\beta)}}(\sqrt{t}+t)\right),
whereγ4=2β+b1(β+b)(2+s),\displaystyle~~\text{where}~~\gamma_{4}=\frac{2\beta+b-1}{(\beta+b)(2+s)}, (12)

with λ1(n)=max{Kn12+s+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\} and λ2(n)=0{\lambda_{2}^{(n)}}=0. Moreover, if τ<τ6\tau<\tau_{6}, the estimator f^2\hat{f}_{\ell_{2}} of block-2\ell_{2} MKL has the following convergence rate with probability at least 1etn11-e^{-t}-n^{-1} for all tloglog(Rn)+logMt\geq\log\log(R\sqrt{n})+\log M:

(block-2 MKL)\displaystyle(\text{block-$\ell_{2}$ MKL}) f^2fL2(Π)2C(nτ(b+22+s)γ5+(λ2(n)2+M1+bn)t),\displaystyle\|\hat{f}_{\ell_{2}}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\left(n^{\tau(b+\frac{2}{2+s})-\gamma_{5}}+\left({\lambda_{2}^{(n)}}^{2}+\frac{M^{1+b}}{n}\right)t\right),
whereγ5=22+s,\displaystyle~~\text{where}~~\gamma_{5}=\frac{2}{2+s}, (13)

with λ2(n)=max{K(Mn)12+s,Flog(Mn)n}{\lambda_{2}^{(n)}}=\max\{K(\frac{M}{n})^{\frac{1}{2+s}},F\sqrt{\frac{\log(Mn)}{n}}\} and λ1(n)=0{\lambda_{1}^{(n)}}=0.

In all convergence rates presented in Theorems 2 and 3, the leading terms are the terms that do not contain tt. The convergence order of the terms containing tt are faster than the leading terms, thus negligible.

By simple calculation, we can confirm that elastic-net MKL always converges faster than block-1\ell_{1} MKL and block-2\ell_{2} MKL if β\beta and MM satisfy the condition of Theorem 2. The convergence rate of elastic-net MKL becomes identical with block-2\ell_{2} MKL and block-1\ell_{1} MKL at the two extreme points of the interval τ=τ1\tau={\tau_{1}} and τ4{\tau_{4}}, respectively. Outside the region, block-1\ell_{1} MKL or block-2\ell_{2} MKL has a faster convergence rate than elastic-net MKL. Moreover, at τ=τ2\tau=\tau_{2}, the convergence rates (2) and (2) of elastic-net MKL are identical, and at τ=τ3\tau=\tau_{3}, the convergence rates (2) and (2) are identical. The relation between the most preferred method and the growth rate τ\tau of the number of kernels is illustrated in Figure 1.

The condition τ1<τ<τ4\tau_{1}<\tau<\tau_{4} in Theorem 2 indicates that when the number of kernels is not too small or too large, an ‘intermediate’ effect of elastic-net MKL becomes advantageous. Roughly speaking, if MM is large, sparsity is needed to ensure the convergence and thus block-1\ell_{1} MKL performs the best. On the other hand, if MM is small, there is no need to make the solution sparse and thus block-2\ell_{2} MKL becomes the best. For an intermediate MM, elastic-net MKL becomes the best.

The condition 2β(1s)<s(b1)2\beta(1-s)<s(b-1) in Theorem 2 ensures the existence of MM that satisfies the condition in the theorem, i.e., τ1<τ2<τ3<τ4\tau_{1}<\tau_{2}<\tau_{3}<\tau_{4}. It can be seen that as bb becomes large (the condition of the problem becomes worse), the range of β\beta and MM in which elastic-net MKL performs better than block-1\ell_{1} MKL and block-2\ell_{2} MKL becomes large. This indicates that the worse the condition of the problem becomes, the more important to control the balance of λ1(n){\lambda_{1}^{(n)}} and λ2(n){\lambda_{2}^{(n)}} appropriately.

Refer to caption
Figure 1: Relation between the convergence rate and the number of kernels. If the truth is intermediately sparse (the growth rate τ\tau of the number of kernels is between τ1\tau_{1} and τ5\tau_{5}), then elastic-net MKL performs best. At the edge of the interval, the convergence rate of elastic-net MKL coincides with that of block-1\ell_{1} MKL or block-2\ell_{2} MKL.

4 Support Consistency of Elastic-net MKL

In this section, we derive necessary and sufficient conditions for the statistical support consistency of the estimated sparsity pattern, i.e., the probability of {mf^mm0}=I0\{m\mid\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\neq 0\}=I_{0} goes to 1 as the number of samples nn tends to infinity. Due to the additional squared regularization term, the necessary condition for the support consistency of elastic-net MKL is shown to be weaker than that for block-1\ell_{1} MKL (Bach, 2008). In this section, we assume MM and d=|I0|d=|I_{0}| are fixed against the number of samples nn.

Let I\mathcal{H}_{I} be the restriction of 1M\mathcal{H}_{1}\oplus\dots\oplus\mathcal{H}_{M} to the index set II. Since EX[km(X,X)]<\mathrm{E}_{X}[k_{m}(X,X)]<\infty for all mm (from assumption (A2)), we define the (non-centered) cross covariance operator ΣI,J:IJ\Sigma_{I,J}:\mathcal{H}_{I}\to\mathcal{H}_{J} as a bounded linear operator such that333 If one fits a function with a constant offset (f(x)+bf(x)+b instead of f(x)f(x)) as in Bach (2008), then the centered version of cross covariance operator is required instead of the non-centered version, i.e., fm,Σm,mgmm=EX[(fm(X)EX[fm])(gm(X)EX[gm])]\langle f_{m},\Sigma_{m,m^{\prime}}g_{m^{\prime}}\rangle_{\mathcal{H}_{m}}=\mathrm{E}_{X}[(f_{m}(X)-\mathrm{E}_{X}[f_{m}])(g_{m^{\prime}}(X)-\mathrm{E}_{X}[g_{m^{\prime}}])]. However, this difference is not essential because, without loss of generality, one can consider a situation where EY[Y]=0\mathrm{E}_{Y}[Y]=0 and EX[fm(X)]=0\mathrm{E}_{X}[f_{m}(X)]=0 for all fmMf_{m}\in\mathcal{H}_{M} by centering all the functions.

fI,ΣI,JgJI=mImJfm,Σm,mgmm=mImJEX[fm(X)gm(X)],\displaystyle\langle f_{I},\Sigma_{I,J}g_{J}\rangle_{\mathcal{H}_{I}}=\sum_{m\in I}\sum_{m^{\prime}\in J}\langle f_{m},\Sigma_{m,m^{\prime}}g_{m^{\prime}}\rangle_{\mathcal{H}_{m}}=\sum_{m\in I}\sum_{m^{\prime}\in J}\mathrm{E}_{X}[f_{m}(X)g_{m^{\prime}}(X)], (14)

for all fI=(fm)mIIf_{I}=(f_{m})_{m\in I}\in\mathcal{H}_{I} and gJ=(gm)mJJg_{J}=(g_{m^{\prime}})_{m^{\prime}\in J}\in\mathcal{H}_{J}. See Baker (1973) for the details of the cross covariance operator (f,g)cov(f(X)g(X))(f,g)\mapsto\mathrm{cov}(f(X)g(X)).

Moreover, we define the bounded (non-centered) cross-correlation operators444 Actually, such a bounded operator always exists (Baker, 1973). Vl,mV_{l,m} by Σl,l1/2Vl,mΣm,m1/2=Σl,m\Sigma_{l,l}^{1/2}V_{l,m}\Sigma_{m,m}^{1/2}=\Sigma_{l,m}. The joint cross-correlation operator VI,J:JIV_{I,J}:\mathcal{H}_{J}\rightarrow\mathcal{H}_{I} is defined analogously to ΣI,J\Sigma_{I,J}.

In this section, we assume in addition to the basic assumptions (A1-A3) that

  1. (A10)\mathrm{(A10)}

    All Vl,mV_{l,m} are compact and the joint correlation operator VV is invertible.

Let I^\hat{I} be the indices of active kernels for the estimated f^\hat{f}\in\mathcal{H} by elastic-net MKL: I^:={mf^mm>0}\hat{I}:=\{m\mid\|\hat{f}_{m}\|_{\mathcal{H}_{m}}>0\}. Let D:=Diag(fmm1)=Diag((fmm1)mI0)D:=\mathrm{Diag}(\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1})=\mathrm{Diag}((\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1})_{m\in I_{0}}), where Diag\mathrm{Diag} is the |I0|×|I0||I_{0}|\times|I_{0}| block-diagonal operator with operators fmm1𝐈m\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1}{\mathbf{I}}_{\mathcal{H}_{m}} on diagonal blocks for mI0m\in I_{0}. In this section, we assume that the true sparsity pattern I0I_{0} and the number of kernels MM are fixed independently of the number of samples nn.

The norm of ff\in\mathcal{H} is defined by f:=m=1Mfmm2\|f\|_{\mathcal{H}}:=\sqrt{\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2}} and similarly that of fIIf_{I}\in\mathcal{H}_{I} is defined by fII:=mIfmm2\|f_{I}\|_{\mathcal{H}_{I}}:=\sqrt{\sum_{m\in I}\|f_{m}\|_{\mathcal{H}_{m}}^{2}} . The following theorem gives a sufficient condition for the support consistency of sparsity patterns.

Theorem 4

Suppose λ2(n)>0{\lambda_{2}^{(n)}}>0, λ1(n)0,{\lambda_{1}^{(n)}}\to 0, λ2(n)0{\lambda_{2}^{(n)}}\to 0, λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty, and

lim supnΣm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))fI0m<1,(mJ=I0c).\displaystyle\textstyle\limsup_{n}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}<1,~~~~~(\forall m\in J=I_{0}^{c}). (15)

Then555For random variables xnx_{n} and yy, xnpyx_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}y means the convergence in probability, i.e., the probability |xny|>ϵ|x_{n}-y|>\epsilon goes to 0 for all ϵ\epsilon as the number of samples nn tends to infinity., under assumptions (A1-A3, A10), f^fp0\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\rightarrow}}0 and I^pI0\hat{I}\stackrel{{\scriptstyle p}}{{\rightarrow}}I_{0}.

The condition λ2(n)>0{\lambda_{2}^{(n)}}>0 is just for technical simplicity to let ΣI0,I0+λ2(n)\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}} invertible. The condition λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty means that λ1(n){\lambda_{1}^{(n)}} does not decrease too quickly. The condition (15) corresponds to an infinite-dimensional extension of the elastic-net ‘irrepresentable’ condition. In the paper of Zhao and Yu (2006), the irrepresentable condition was derived as a necessary and sufficient condition for the sign consistency of 1\ell_{1} regularization when the number of parameters is finite. Its elastic-net version was derived in Yuan and Lin (2007), and it was extended to a situation where the number of parameters diverges as nn increases (Jia and Yu, 2010).

We also have a necessary condition for consistency.

Theorem 5

If f^fp0\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0 and I^pI0\hat{I}\stackrel{{\scriptstyle p}}{{\to}}I_{0}, then under assumptions (A1-A3, A10), there exist sequences λ1(n),λ2(n)0{\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0 such that

lim supnΣm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))fI0m1,(mJ=I0c).\displaystyle\textstyle\limsup_{n}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}\leq 1,~~~~~(\forall m\in J=I_{0}^{c}). (16)

Moreover, such λ1(n){\lambda_{1}^{(n)}} satisfies λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty.

The sufficient condition (15) contains the strict inequality (‘<<’), while similar conditions for ordinary (non-block) 1\ell_{1} regularization or ordinary (non-block) elastic-net regularization contain the weak inequality (‘\leq’). The strict inequality appears because each block contains multiple variables in group lasso and MKL (Bach, 2008).

The condition λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty is necessary to impose the RKHS-norm convergence f^fp0\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0. Roughly speaking, this means that the block-1\ell_{1} regularization term should be stronger than the noise level to suppress fluctuations by noise.

It is worth noting that the conditions (15) and (16) are weaker than the condition for block-1\ell_{1} MKL presented in Bach (2008); the block-1\ell_{1} MKL irrepresentable condition is666 Note that in the original paper by Bach (2008), the RHS of (17) is mI0fmm\sum_{m\in I_{0}}\|f^{*}_{m}\|_{\mathcal{H}_{m}} because the squared group-1\ell_{1} regularizer (mfmm)2(\sum_{m}\|f_{m}\|_{\mathcal{H}_{m}})^{2} was used. We can show that the squared formulation is actually equivalent to the non-squared formulation in the sense that there exists one-to-one correspondence between the two formulations.

{(Sufficient condition)Σm,m1/2Vm,I0VI0,I01DgI0m<1,(mJ),(Necessary condition)Σm,m1/2Vm,I0VI0,I01DgI0m1,(mJ).\displaystyle\begin{cases}\mbox{(Sufficient condition)}&\left\|\Sigma_{m,m}^{1/2}V_{m,I_{0}}V_{I_{0},I_{0}}^{-1}Dg^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}<1,~(\forall m\in J),\\ \mbox{(Necessary condition)}&\left\|\Sigma_{m,m}^{1/2}V_{m,I_{0}}V_{I_{0},I_{0}}^{-1}Dg^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}\leq 1,~(\forall m\in J).\end{cases} (17)

This is because the group-2\ell_{2} regularization term eases the singularity of the problem. Examples that elastic-nets successfully estimate the true sparsity pattern, while 1\ell_{1} regularization fails in parametric situations can be found in Jia and Yu (2010).

5 Conclusions

We provided three novel theoretical results on the support consistency and convergence rate of elastic-net MKL.

  1. (i)

    Elastic-net MKL was shown to be support consistent under a milder condition than block-1\ell_{1} MKL.

  2. (ii)

    A tighter convergence rate than existing bounds was derived for the situation where the truth is sparse.

  3. (iii)

    The convergence rates of block-1\ell_{1} MKL, elastic-net MKL, and block-2\ell_{2} MKL when the truth is near sparse were elucidated, and elastic-net MKL was shown to perform better when the decrease rate β\beta is not large, or the condition of the problem is bad.

Based on our theoretical findings, we conclude that the use of elastic-net regularization is recommended for MKL.

Elastic-net MKL can be regarded as ‘intermediate’ between block-1\ell_{1} MKL and block-2\ell_{2} MKL. Another popular intermediate variant is block-p\ell_{p} MKL for 1p21\leq p\leq 2 (Kloft et al., 2009, Cortes et al., 2009). Elastic-net MKL and block-p\ell_{p} MKL are conceptually similar, but they have a notable difference: elastic-net MKL with λ1(n)>0{\lambda_{1}^{(n)}}>0 tends to produce sparse solutions, while block-p\ell_{p} MKL with 1<p21<p\leq 2 always produces dense solutions (i.e., all combination coefficients of kernels are non-zero). Sparsity of elastic-net MKL would be advantageous when the true kernel combination is sparse, as we proved in this paper. However, when the true kernel combination is non-sparse, the difference/relation between elastic-net MKL and block-p\ell_{p} MKL is not clear yet. This needs to be further investigated in the future work.

Appendix A Proofs of the theorems

For a function ff on 𝒳×\mathcal{X}\times\mathbb{R}, we define Pnf:=1ni=1nf(xi,yi)P_{n}f:=\frac{1}{n}\sum_{i=1}^{n}f(x_{i},y_{i}) and Pf:=EX,Y[f(X,Y)]Pf:=\mathrm{E}_{X,Y}[f(X,Y)]. For a function fIIf_{I}\in\mathcal{H}_{I}, we define fI1\|f_{I}\|_{\ell_{1}} as fI1:=mIfmm\|f_{I}\|_{\ell_{1}}:=\sum_{m\in I}\|f_{m}\|_{\mathcal{H}_{m}} and for ff\in\mathcal{H} we write f1:=m=1Mfmm\|f\|_{\ell_{1}}:=\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}. Similarly we define fI2\|f_{I}\|_{\ell_{2}} as fI22:=mIfmm2\|f_{I}\|_{\ell_{2}}^{2}:=\sum_{m\in I}\|f_{m}\|_{\mathcal{H}_{m}}^{2} for fIIf_{I}\in\mathcal{H}_{I} and for ff\in\mathcal{H} we write f22:=m=1Mfmm2\|f\|_{\ell_{2}}^{2}:=\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2}. We write max{a,b}\max\{a,b\} as aba\vee b.

Lemma 6

For all I{1,,M}I\subseteq\{1,\dots,M\}, we have

fL2(Π)2(1ρ(I)2)κ(I)(mIfmL2(Π)2).\displaystyle\|f\|_{L_{2}(\Pi)}^{2}\geq(1-\rho(I)^{2})\kappa(I)(\sum_{m\in I}\|f_{m}\|_{L_{2}(\Pi)}^{2}). (18)

Proof: For J=IcJ=I^{c}, we have

Pf2\displaystyle Pf^{2} =fIL2(Π)2+2fI,fJL2(Π)+fJL2(Π)2fIL2(Π)22ρ(I)fIL2(Π)fJL2(Π)+fJL2(Π)2\displaystyle=\|f_{I}\|_{L_{2}(\Pi)}^{2}+2\langle f_{I},f_{J}\rangle_{L_{2}(\Pi)}+\|f_{J}\|_{L_{2}(\Pi)}^{2}\geq\|f_{I}\|_{L_{2}(\Pi)}^{2}-2\rho(I)\|f_{I}\|_{L_{2}(\Pi)}\|f_{J}\|_{L_{2}(\Pi)}+\|f_{J}\|_{L_{2}(\Pi)}^{2}
(1ρ(I)2)fIL2(Π)2(1ρ(I)2)κ(I)(mIfmL2(Π)2),\displaystyle\geq(1-\rho(I)^{2})\|f_{I}\|_{L_{2}(\Pi)}^{2}\geq(1-\rho(I)^{2})\kappa(I)(\sum_{m\in I}\|f_{m}\|_{L_{2}(\Pi)}^{2}), (19)

where we used Schwarz’s inequality in the last line.  

The following lemma gives an upper bound of m=1Mf^m\sum_{m=1}^{M}\|\hat{f}\|_{\mathcal{H}_{m}} that hold with a high probability. This is an extension of Theorem 1 of Koltchinskii and Yuan (2008). The proof is given in Appendix B.

Lemma 7

There exists a constant FF depending on only LL in (A1) such that, if λ1(n)Flog(Mn)n{\lambda_{1}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}, we have, for r=λ1(n)λ1(n)λ2(n)r=\frac{{\lambda_{1}^{(n)}}}{{\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}}, with probability 1n11-n^{-1},

m=1Mf^mmM1r2r(3m=1Mfmm+3m=1Mfmm2)12r.\displaystyle\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\leq M^{\frac{1-r}{2-r}}\left(3\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}}+3\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\right)^{\frac{1}{2-r}}.

Moreover, if λ2(n)Flog(Mn)n{\lambda_{2}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}} and λ2(n)λ1(n){\lambda_{2}^{(n)}}\geq{\lambda_{1}^{(n)}}, we have

m=1Mf^mfmmM(3/2+2maxmfmm).\displaystyle\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\leq M\left(3/2+2\max_{m}\|f^{*}_{m}\|_{\mathcal{H}_{m}}\right).

The following lemma gives a basic inequality that is a start point for the following analyses. The proof is given in Appendix B.

Lemma 8

Suppose λ1(n)λ2(n)Flog(Mn)n{\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}} where FF is the constant appeared in Lemma 7. Then there exist constants K~1\tilde{K}_{1} and K~2\tilde{K}_{2} depending only on LL in (A1), RR in (A4), ss in (A6){\rm(A6)}, C2C_{2} in (A6){\rm(A6)} such that for all I{1,,M}I\subseteq\{1,\dots,M\}, and for all tloglog(Rn)+logMt\geq\log\log(R\sqrt{n})+\log M, with probability at least 1etn11-e^{-t}-n^{-1},

12f^fL2(Π)2+λ2(n)mIf^IfIm2+λ2(n)mJf^mm2+(λ1(n)γ^nK~2tn)mJf^mm\displaystyle\frac{1}{2}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I}\|\hat{f}_{I}-f^{*}_{I}\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}^{2}+\left({\lambda_{1}^{(n)}}-\hat{\gamma}_{n}-\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}
\displaystyle\leq K~1(1+f^f1)(mIf^mfmL2(Π)1sf^mfmmsnf^mfmmn11+s+tf^f1n)\displaystyle\tilde{K}_{1}(1+\|\hat{f}-f^{*}\|_{\ell_{1}})\Big{(}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\Big{)}
+mI(λ1(n)gmmfmm+2λ2(n)gmm+K~2tn)f^mfmL2(Π)\displaystyle\!+\!\!\sum_{m\in I}\left(\!{\lambda_{1}^{(n)}}\!\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\!\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}
+λ2(n)mJfmm2+(λ1(n)+γ^n+K~2tn)mJfmm,\displaystyle\!\!+\!{\lambda_{2}^{(n)}}\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\!\!+\!\left({\lambda_{1}^{(n)}}\!\!+\!\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}, (20)

where J=IcJ=I^{c}, γn:=K~1n\gamma_{n}:=\frac{\tilde{K}_{1}}{\sqrt{n}} and γ^n:=γn(1+f^f).\hat{\gamma}_{n}:=\gamma_{n}(1+\|\hat{f}-f^{*}\|_{\infty}).

The above lemma is derived by peeling device or localization method. Details of those techniques can be found in, for example, Bartlett et al. (2005), Koltchinskii (2006), Mendelson (2002), van de Geer (2000).

Proof: (Theorem 1) Since λ1(n)Flog(Mn)n{\lambda_{1}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}, we can assume that the inequality (20) is satisfied with I=I0I=I_{0}. For notational simplicity, we suppose II denotes I0I_{0} in this proof. In addition, since λ1(n)λ2(n){\lambda_{1}^{(n)}}\geq{\lambda_{2}^{(n)}}, f^m=1Mfm3R\|\hat{f}\|_{\infty}\leq\sum_{m=1}^{M}\|f^{*}\|_{\mathcal{H}_{m}}\leq 3R (with probability 1n1)1-n^{-1}) by Lemma 7. Note that fmm=0\|f^{*}_{m}\|_{\mathcal{H}_{m}}=0 for all mJ=Ic=I0cm\in J=I^{c}=I_{0}^{c}, and γ^n+K~2tnmax{Kn12+s+K~2tn,Flog(Mn)n}=λ1(n)\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\leq\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}={\lambda_{1}^{(n)}} by taking KK sufficiently large. Therefore by the inequality (20), we have

12f^fL2(Π)2+λ2(n)f^IfI22K1(mIf^mfmL2(Π)1sf^mfmmsn+tn)\displaystyle\frac{1}{2}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2}\leq K_{1}\Big{(}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t}{n}\Big{)}
+mI(λ1(n)gmmfmm+2λ2(n)gmm+K~2tn)f^mfmL2(Π),\displaystyle~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+\sum_{m\in I}\left({\lambda_{1}^{(n)}}\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}+2{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}, (21)

where K1K_{1} is K~1(1+3R)\tilde{K}_{1}(1+3R) (here we omitted the term mIn11+sf^mfmm\sum_{m\in I}n^{-\frac{1}{1+s}}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}} for simplicity. One can show that that term is negligible).

By Hölder’s inequality, the first term in the RHS of the above inequality can be bounded as

K1mIf^mfmL2(Π)1sf^mfmmsn\displaystyle K_{1}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}} K1(mIf^mfmL2(Π))1s(f^IfI1)sn\displaystyle\leq K_{1}\frac{(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)})^{1-s}(\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{1}})^{s}}{\sqrt{n}}
dK1(mIf^mfmL2(Π)2)1s2(f^IfI22)s2n.\displaystyle\leq\sqrt{d}K_{1}\frac{(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}(\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2})^{\frac{s}{2}}}{\sqrt{n}}.

Applying Young’s inequality, the last term can be bounded by

K1(λ2(n)/2)s2dn(mIf^mfmL2(Π)2)1s2×(λ2(n)/2)s2(f^IfI22)s2\displaystyle\frac{K_{1}({\lambda_{2}^{(n)}}/2)^{-\frac{s}{2}}\sqrt{d}}{\sqrt{n}}(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}\times({\lambda_{2}^{(n)}}/2)^{\frac{s}{2}}(\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2})^{\frac{s}{2}}
\displaystyle\leq C(n12dλ2(n)s2)22s(mIf^mfmL2(Π)2)1s2s+λ2(n)2f^IfI22\displaystyle C(n^{-\frac{1}{2}}\sqrt{d}{\lambda_{2}^{(n)}}^{-\frac{s}{2}})^{\frac{2}{2-s}}\left(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2}\right)^{\frac{1-s}{2-s}}+\frac{{\lambda_{2}^{(n)}}}{2}\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2}
\displaystyle\leq C[(1ρ2(I))κ(I)]1n1dλ2(n)s+(1ρ2(I))κ(I)8mIf^mfmL2(Π)2+λ2(n)2f^IfI22\displaystyle C[(1-\rho^{2}(I))\kappa(I)]^{-1}n^{-1}d{\lambda_{2}^{(n)}}^{-s}+\frac{(1-\rho^{2}(I))\kappa(I)}{8}\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2}
\displaystyle\leq Cn1dλ2(n)s+18f^fL2(Π)2+λ2(n)2f^IfI22.\displaystyle Cn^{-1}d{\lambda_{2}^{(n)}}^{-s}+\frac{1}{8}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2}. (22)

where CC denotes a constant that is independent of dd and nn and changes by the contexts, and we used Lemma 6 in the last line. Similarly, by the inequality of arithmetic and geometric means, we obtain a bound as

mI2(λ1(n)gmmfmm+λ2(n)gmm+K~2tn)f^mfmL2(Π)\displaystyle\sum_{m\in I}2\left({\lambda_{1}^{(n)}}\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}+{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}
\displaystyle\leq C[(1ρ2(I))κ(I)]1mI{(gmmfmm)2λ1(n)2+gmm2λ2(n)2+tn}\displaystyle C[(1-\rho^{2}(I))\kappa(I)]^{-1}\sum_{m\in I}\left\{\left(\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\right)^{2}{\lambda_{1}^{(n)}}^{2}+\|g^{*}_{m}\|_{\mathcal{H}_{m}}^{2}{\lambda_{2}^{(n)}}^{2}+\frac{t}{n}\right\}
+(1ρ2(I))κ(I)8mIf^mfmL2(Π)2\displaystyle+\frac{(1-\rho^{2}(I))\kappa(I)}{8}\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2}
\displaystyle\leq C(dλ1(n)2+λ2(n)2+dt/n)+18f^fL2(Π)2,\displaystyle C(d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+dt/n)+\frac{1}{8}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}, (23)

where we used Lemma 6 in the last line. By substituting (22) and (23) to (21), we have

14f^fL2(Π)2C(dn1λ2(n)s+dλ1(n)2+λ2(n)2+(d+1)tn).\displaystyle\frac{1}{4}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\left(dn^{-1}{\lambda_{2}^{(n)}}^{-s}+d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+\frac{(d+1)t}{n}\right). (24)

The minimum of the RHS with respect to λ1(n),λ2(n){\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}} under the constraint λ1(n)λ2(n){\lambda_{1}^{(n)}}\geq{\lambda_{2}^{(n)}} is achieved by λ1(n)=max{Kn12+s+K~2tn,Flog(Mn)n},λ2(n)=Kn12+s{\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\},{\lambda_{2}^{(n)}}=Kn^{-\frac{1}{2+s}} up to constants. Thus we have the first assertion (7).

Next we show the second assertion (8). By Hölder’s inequality and Young’s inequality, we have

K1mIf^mfmL2(Π)1sf^mfmmsnK1(mIf^mfmL2(Π))1s(f^IfI1)sn\displaystyle K_{1}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\leq K_{1}\frac{(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)})^{1-s}(\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{1}})^{s}}{\sqrt{n}}
Cλ~s1sn12(1s)mIf^mfmL2(Π)+λ~2f^IfI1\displaystyle\leq C\tilde{\lambda}^{-\frac{s}{1-s}}n^{-\frac{1}{2(1-s)}}\textstyle\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\frac{\tilde{\lambda}}{2}\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{1}}
Cdλ~2s1sn11s+18f^fL2(Π)2+λ~2(f^I1+fI1),\displaystyle\leq Cd\tilde{\lambda}^{-\frac{2s}{1-s}}n^{-\frac{1}{1-s}}+\frac{1}{8}\textstyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+\frac{\tilde{\lambda}}{2}(\|\hat{f}_{I}\|_{\ell_{1}}+\|f^{*}_{I}\|_{\ell_{1}}), (25)

where λ~>0\tilde{\lambda}>0 is an arbitrary positive real. By substituting (25) and (23) to (21), we have

14f^fL2(Π)2C(dλ~2s1sn11s+λ~+dλ1(n)2+λ2(n)2+(d+1)tn).\displaystyle\frac{1}{4}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\Big{(}d\tilde{\lambda}^{-\frac{2s}{1-s}}n^{-\frac{1}{1-s}}+\tilde{\lambda}+d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+\frac{(d+1)t}{n}\Big{)}.

This is minimized by λ~=Cd1s1+sn11+s\tilde{\lambda}=Cd^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}, λ1(n)=(2K~1(1+3R)n+K~2tn)Flog(Mn)n(2γ^n+K~2tn)Flog(Mn)n{\lambda_{1}^{(n)}}=(\frac{2\tilde{K}_{1}(1+3R)}{\sqrt{n}}+\tilde{K}_{2}\sqrt{\frac{t}{n}})\vee F\sqrt{\frac{\log(Mn)}{n}}\geq(2\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}})\vee F\sqrt{\frac{\log(Mn)}{n}}, and λ2(n)λ1(n){\lambda_{2}^{(n)}}\leq{\lambda_{1}^{(n)}}. Thus we obtain the assertion.  

Proof: (Theorem 2) Let Id:={1,,d}I_{d}:=\{1,\dots,d\} and Jd=Idc={d+1,,M}J_{d}=I_{d}^{c}=\{d+1,\dots,M\}. By the assumption (A7), we have mJdfmm2C32β1d12β,mJdfmmC3β1d1β.\sum_{m\in J_{d}}\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\leq\frac{C_{3}}{2\beta-1}d^{1-2\beta},~~~\sum_{m\in J_{d}}\|f^{*}_{m}\|_{\mathcal{H}_{m}}\leq\frac{C_{3}}{\beta-1}d^{1-\beta}. Therefore Lemma 8 gives

f^fL2(Π)2+λ2(n)f^IdfId22+λ2(n)f^Jd22\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\|_{\ell_{2}}^{2}+{\lambda_{2}^{(n)}}\|\hat{f}_{J_{d}}\|_{\ell_{2}}^{2}
\displaystyle\leq K1(mIdf^mfmL2(Π)1sf^mfmmsn+tf^f1n)\displaystyle K_{1}\Big{(}\sum_{m\in I_{d}}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\Big{)}
+K1(m=1Mf^mfmm)(mIdf^mfmL2(Π)1sf^mfmmsn+tf^f1n)\displaystyle+K_{1}\left(\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\right)\Big{(}\sum_{m\in I_{d}}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\Big{)}
+mId(λ1(n)gmmfmm+2λ2(n)gmm+K~2tn)f^mfmL2(Π)\displaystyle+\sum_{m\in I_{d}}\!\!\left(\!{\lambda_{1}^{(n)}}\!\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}
+C(λ2(n)d12β+(λ1(n)+γ^n+tn)d1β),\displaystyle+C\left({\lambda_{2}^{(n)}}d^{1-2\beta}+\left({\lambda_{1}^{(n)}}+\hat{\gamma}_{n}+\sqrt{\frac{t}{n}}\right)d^{1-\beta}\right), (26)

if λ1(n)>γ^n+K~2tn{\lambda_{1}^{(n)}}>\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}} and λ1(n)Flog(Mn)n{\lambda_{1}^{(n)}}\geq F\sqrt{\frac{log(Mn)}{n}}. The second term can be upper bounded as

K1(m=1Mf^mfmm)(mIdf^mfmL2(Π)1sf^mfmmsn+tf^f1n)\displaystyle K_{1}\left(\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\right)\Big{(}\sum_{m\in I_{d}}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\Big{)}
Hölder\displaystyle\mathop{\leq}^{\text{H\"{o}lder}} K1(m=1Mf^mfmm){(mIdf^mfmL2(Π))1s(mIdf^mfmm)sn+tf^f1n}\displaystyle K_{1}\left(\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\right)\Bigg{\{}\frac{(\sum_{m\in I_{d}}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)})^{1-s}(\sum_{m\in I_{d}}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}})^{s}}{\sqrt{n}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\Bigg{\}}
=\displaystyle= K1(mIdf^mfmL2(Π))1s(m=1Mf^mfmm)(mIdf^mfmm)sn+tf^f12n\displaystyle K_{1}\frac{(\sum_{m\in I_{d}}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)})^{1-s}\left(\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\right)(\sum_{m\in I_{d}}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}})^{s}}{\sqrt{n}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}^{2}}{n}
Jensen\displaystyle\mathop{\leq}^{\text{Jensen}} K1d1s2(mIdf^mfmL2(Π)2)1s2M12(m=1Mf^mfmm2)12ds2(mIdf^mfmm2)s2n\displaystyle K_{1}\frac{d^{\frac{1-s}{2}}(\sum_{m\in I_{d}}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}M^{\frac{1}{2}}\left(\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\right)^{\frac{1}{2}}d^{\frac{s}{2}}(\sum_{m\in I_{d}}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2})^{\frac{s}{2}}}{\sqrt{n}}
+tf^f12n\displaystyle+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}^{2}}{n}
Lemma 6\displaystyle\mathop{\leq}^{\text{Lemma \ref{lem:incoherenceIneq}}} K1{(1ρ(Id)2)κ(Id)}1s2(f^fL2(Π)2)1s2d12M12f^f21+sn+tf^f12n\displaystyle K_{1}\frac{\{(1-\rho(I_{d})^{2})\kappa(I_{d})\}^{-\frac{1-s}{2}}(\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}d^{\frac{1}{2}}M^{\frac{1}{2}}\|\hat{f}-f^{*}\|_{\ell_{2}}^{1+s}}{\sqrt{n}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}^{2}}{n}
Young\displaystyle\mathop{\leq}^{\text{Young}} f^fL2(Π)22+C{(1ρ(Id)2)κ(Id)}1s1+sd11+sM11+sf^f22n11+s+tf^f12n\displaystyle\frac{\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}}{2}+C\frac{\{(1-\rho(I_{d})^{2})\kappa(I_{d})\}^{-\frac{1-s}{1+s}}d^{\frac{1}{1+s}}M^{\frac{1}{1+s}}\|\hat{f}-f^{*}\|_{\ell_{2}}^{2}}{n^{\frac{1}{1+s}}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}^{2}}{n}
(A8)\displaystyle\mathop{\leq}^{\text{(A8)}} f^fL2(Π)22+Cdb(1s)+11+sM11+sn11+sf^f22+tf^f12n.\displaystyle\frac{\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}}{2}+C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\|\hat{f}-f^{*}\|_{\ell_{2}}^{2}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}^{2}}{n}.

We will see that we may assume Cdb(1s)+11+sM11+sn11+sλ2(n)4C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\leq\frac{{\lambda_{2}^{(n)}}}{4}. Thus the second term in the RHS of the above inequality can be upper bounded as

Cdb(1s)+11+sM11+sn11+sf^f22λ2(n)4f^f22λ2(n)4(f^IdfId22+2f^Jd22+2fJd22)\displaystyle C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\|\hat{f}-f^{*}\|_{\ell_{2}}^{2}\leq\frac{{\lambda_{2}^{(n)}}}{4}\|\hat{f}-f^{*}\|_{\ell_{2}}^{2}\leq\frac{{\lambda_{2}^{(n)}}}{4}\left(\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\|_{\ell_{2}}^{2}+2\|\hat{f}_{J_{d}}\|_{\ell_{2}}^{2}+2\|f^{*}_{J_{d}}\|_{\ell_{2}}^{2}\right)
λ2(n)2(f^IdfId22+f^Jd22+fJd22).\displaystyle\leq\frac{{\lambda_{2}^{(n)}}}{2}\left(\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\|_{\ell_{2}}^{2}+\|\hat{f}_{J_{d}}\|_{\ell_{2}}^{2}+\|f^{*}_{J_{d}}\|_{\ell_{2}}^{2}\right). (27)

Moreover Lemma 7 gives f^f1nCRMnCλ2(n)2\frac{\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\leq\frac{C\sqrt{RM}}{n}\leq C{\lambda_{2}^{(n)}}^{2} and f^f12nCRMnCRλ2(n)2.\frac{\|\hat{f}-f^{*}\|_{\ell_{1}}^{2}}{n}\leq\frac{CRM}{n}\leq CR{\lambda_{2}^{(n)}}^{2}. Therefore (26) becomes

12f^fL2(Π)2+λ2(n)2f^IdfId22+λ2(n)2f^Jd22\displaystyle\frac{1}{2}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\|_{\ell_{2}}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\|\hat{f}_{J_{d}}\|_{\ell_{2}}^{2}
\displaystyle\leq C(mIdf^mfmL2(Π)1sf^mfmmsn+tλ2(n)2)\displaystyle C\Big{(}\sum_{m\in I_{d}}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+t{\lambda_{2}^{(n)}}^{2}\Big{)}
+mId(C1λ1(n)+2λ2(n)gmm+K~2tn)f^mfmL2(Π)\displaystyle+\!\!\sum_{m\in I_{d}}\!\!\!\left(\!C_{1}{\lambda_{1}^{(n)}}\!+\!2{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\!\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}
+C(λ2(n)d12β+(λ1(n)+γ^n+tn)d1β).\displaystyle+C\left({\lambda_{2}^{(n)}}d^{1-2\beta}+\left({\lambda_{1}^{(n)}}+\hat{\gamma}_{n}+\sqrt{\frac{t}{n}}\right)d^{1-\beta}\right).

As in the proof of Theorem 1 (using the relations (23) and (22)), we have

12f^fL2(Π)2\displaystyle\frac{1}{2}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}
\displaystyle\leq C{[(1ρ2(Id))κ(Id))]1[dn1λ2(n)s+dλ1(n)2+λ2(n)2+tn]\displaystyle C\Bigg{\{}[(1-\rho^{2}(I_{d}))\kappa(I_{d}))]^{-1}\left[dn^{-1}{\lambda_{2}^{(n)}}^{-s}+d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+\frac{t}{n}\right]
+λ2(n)d12β+(λ1(n)+γ^n+(t/n)12)d1β+tλ2(n)2}.\displaystyle+{\lambda_{2}^{(n)}}d^{1-2\beta}+({\lambda_{1}^{(n)}}+\hat{\gamma}_{n}+(t/n)^{\frac{1}{2}})d^{1-\beta}+t{\lambda_{2}^{(n)}}^{2}\Bigg{\}}.

Now using the assumption (1ρ2(Id))κ(Id)C4db(1-\rho^{2}(I_{d}))\kappa(I_{d})\geq C_{4}d^{-b}, we have

f^IdfIdL2(Π)2\displaystyle\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\|_{L_{2}(\Pi)}^{2} C[d1+bn1λ2(n)s+d1+bλ1(n)2+dbλ2(n)2+λ2(n)d12β+(λ1(n)+γ^n)d1β+tλ2(n)2\displaystyle\leq C\Bigg{[}d^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}\!\!+d^{1+b}{\lambda_{1}^{(n)}}^{2}\!\!+d^{b}{\lambda_{2}^{(n)}}^{2}\!\!+{\lambda_{2}^{(n)}}d^{1-2\beta}\!+({\lambda_{1}^{(n)}}+\hat{\gamma}_{n})d^{1-\beta}+t{\lambda_{2}^{(n)}}^{2}
+d1βtn+d1+btn].\displaystyle~~~~~~~~~~~~~~~~~~+d^{1-\beta}\sqrt{\frac{t}{n}}+\frac{d^{1+b}t}{n}\Bigg{]}. (28)

Remind that γ^n=K1~(1+f^f)/n\hat{\gamma}_{n}=\tilde{K_{1}}(1+\|\hat{f}-f^{*}\|_{\infty})/\sqrt{n}. Since λ1(n)Flog(Mn)n{\lambda_{1}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}, Lemma 7 gives f^fM3R+RcM\|\hat{f}-f^{*}\|_{\infty}\leq\sqrt{M3R}+R\leq c\sqrt{M} with probability 1n11-n^{-1} for some constant c>0c>0. Therefore γ^ncM/n\hat{\gamma}_{n}\leq c\sqrt{M/n}. The values of λ1(n){\lambda_{1}^{(n)}}, λ2(n){\lambda_{2}^{(n)}} presented in the statement is achieved by minimizing the RHS of Eq. (28) under the constraint λ1(n)cM/n+K~2tnγ^n+K~2tn{\lambda_{1}^{(n)}}\geq c\sqrt{M/n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}} and Cdb(1s)+11+sM11+sn11+sλ2(n)4C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\leq\frac{{\lambda_{2}^{(n)}}}{4}.

i) Suppose nb+3β1(2β+b)(2+s)1s>cM/nn^{-\frac{b+3\beta-1}{(2\beta+b)(2+s)-1-s}}>c\sqrt{M/n}, i.e., ττ2\tau\leq\tau_{2}. Then the RHS of the above inequality can be minimized by d=n1(2β+b)(2+s)1sd=n^{\frac{1}{(2\beta+b)(2+s)-1-s}}, λ2(n)=Kn2β+b1(2β+b)(2+s)1s{\lambda_{2}^{(n)}}=Kn^{-\frac{2\beta+b-1}{(2\beta+b)(2+s)-1-s}}, and λ1(n)=max{Knb+3β1(2β+b)(2+s)1s+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{b+3\beta-1}{(2\beta+b)(2+s)-1-s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\} up to constants independent of nn, where the leading terms are d1+bn1λ2(n)s+dbλ2(n)2+λ2(n)d12β+λ1(n)d1βd^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}\!\!+d^{b}{\lambda_{2}^{(n)}}^{2}\!\!+{\lambda_{2}^{(n)}}d^{1-2\beta}+{\lambda_{1}^{(n)}}d^{1-\beta}. It should be noted that λ1(n){\lambda_{1}^{(n)}} is greater than γ^n+K~2tn\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}} because nb+3β1(2β+b)(2+s)1s>cM/nγ^nn^{-\frac{b+3\beta-1}{(2\beta+b)(2+s)-1-s}}>c\sqrt{M/n}\geq\hat{\gamma}_{n}, therefore (26) is valid. Using ττ2\tau\leq\tau_{2}, we can show that Cdb(1s)+11+s(M/n)11+sλ2(n)/4Cd^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}\leq{\lambda_{2}^{(n)}}/4 by setting the constant KK sufficiently large, hence (27) is valid. Moreover, since M>n1(2β+b)(2+s)1s=nτ1M>n^{\frac{1}{(2\beta+b)(2+s)-1-s}}=n^{\tau_{1}}, we can take dd as d=n1(2β+b)(2+s)1sMd=n^{\frac{1}{(2\beta+b)(2+s)-1-s}}\leq M.

ii) Suppose τ2ττ3\tau_{2}\leq\tau\leq\tau_{3}. Then the RHS of the above inequality can be minimized by d=(M2+sn2s)12{(2+s)(b+β)s}d=(M^{2+s}n^{2-s})^{\frac{1}{2\{(2+s)(b+\beta)-s\}}}, λ2(n)=K(Mn{2(b+β)1})12{(2+s)(b+β1)+2}{\lambda_{2}^{(n)}}=K(Mn^{-\{2(b+\beta)-1\}})^{\frac{1}{2\{(2+s)(b+\beta-1)+2\}}}, and λ1(n)=max{cM/n+K~2tn,Flog(Mn)n}γ^n+K~2tn{\lambda_{1}^{(n)}}=\max\left\{c\sqrt{M/n}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\right\}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}} up to constants independent of nn, where the leading terms are d1+bn1λ2(n)s+dbλ2(n)2+λ1(n)d1βd^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}\!\!+d^{b}{\lambda_{2}^{(n)}}^{2}\!\!+{\lambda_{1}^{(n)}}d^{1-\beta}. Since λ1(n)γ^n+K~2tn{\lambda_{1}^{(n)}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}, (26) is valid. Using ττ3\tau\leq\tau_{3}, we can show that Cdb(1s)+11+s(M/n)11+sλ2(n)/4Cd^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}\leq{\lambda_{2}^{(n)}}/4 by setting the constant KK sufficiently large, hence (27) is valid. Moreover, since βs(b1)2(1s)\beta\leq\frac{s(b-1)}{2(1-s)} and τ2τ\tau_{2}\leq\tau, we can show that dMd\leq M.

iii) Suppose τ3ττ4\tau_{3}\leq\tau\leq\tau_{4}. We take λ1(n)=max{cM/n+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\left\{c\sqrt{M/n}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\right\}. Then the RHS of the inequality (28) is minimized by λ2(n)=Kdλ1(n)dM/n{\lambda_{2}^{(n)}}=K\sqrt{d}{\lambda_{1}^{(n)}}\sim\sqrt{dM/n} and d=(nM)12(b+β)d=(\frac{n}{M})^{\frac{1}{2(b+\beta)}} up to constants, where the leading terms are dbλ2(n)2+d1+bλ1(n)2+λ1(n)d1βd^{b}{\lambda_{2}^{(n)}}^{2}+d^{1+b}{\lambda_{1}^{(n)}}^{2}\!\!+{\lambda_{1}^{(n)}}d^{1-\beta}. Note that since λ1(n)γ^n+K~2tn{\lambda_{1}^{(n)}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}, (26) is valid. Using ττ4\tau\leq\tau_{4}, we can show that Cdb(1s)+11+s(M/n)11+sλ2(n)/4Cd^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}\leq{\lambda_{2}^{(n)}}/4 by setting the constant KK sufficiently large, hence (27) is valid. Moreover, since βs(b1)2(1s)\beta\leq\frac{s(b-1)}{2(1-s)} and nτ3Mn^{\tau_{3}}\leq M, we have d=(nM)12(b+β)Md=(\frac{n}{M})^{\frac{1}{2(b+\beta)}}\leq M.

In all settings i) to iii), we can show that d1βnd1+bn\frac{d^{1-\beta}}{\sqrt{n}}\gtrsim\frac{d^{1+b}}{n}. Thus the terms regarding tt is upper bounded as d1βtn+d1+btn+tλ2(n)2(d1βn+λ2(n)2)(t+t)d^{1-\beta}\sqrt{\frac{t}{n}}+\frac{d^{1+b}t}{n}+t{\lambda_{2}^{(n)}}^{2}\lesssim(\frac{d^{1-\beta}}{\sqrt{n}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t). Through a simple calculation d1βn\frac{d^{1-\beta}}{\sqrt{n}} is evaluated as i) d1βnn(2β+b)(2+s)3s+2β2{(2β+b)(2+s)1s}\frac{d^{1-\beta}}{\sqrt{n}}\simeq n^{-\frac{(2\beta+b)(2+s)-3-s+2\beta}{2\{(2\beta+b)(2+s)-1-s\}}}, ii) d1βn(M(2+s)(1β)n(4β+2b+sb2))12{(β+b)(2+s)s}\frac{d^{1-\beta}}{\sqrt{n}}\simeq(M^{(2+s)(1-\beta)}n^{-(4\beta+2b+sb-2)})^{\frac{1}{2\{(\beta+b)(2+s)-s\}}}, and iii) d1βn(Mβ1n12βb)12(β+b)\frac{d^{1-\beta}}{\sqrt{n}}\simeq(M^{\beta-1}n^{1-2\beta-b})^{\frac{1}{2(\beta+b)}} respectively. Thus we obtain the assertion.  

Proof: (Theorem 3)

(Convergence rate of block-1\ell_{1} MKL)

Note that since λ1(n)>λ2(n)=0{\lambda_{1}^{(n)}}>{\lambda_{2}^{(n)}}=0, we have λ1(n)λ1(n)λ2(n)=1\frac{{\lambda_{1}^{(n)}}}{{\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}}=1. Therefore Lemma 7 gives m=1Mf^mm3R\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\leq 3R with probability 1n11-n^{-1}. Thus γ^n=γn(1+f^f)γn(1+m=1Mf^mm+m=1Mfmm)γn(1+4R)\hat{\gamma}_{n}=\gamma_{n}(1+\|\hat{f}-f^{*}\|_{\infty})\leq\gamma_{n}(1+\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}+\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}})\leq\gamma_{n}(1+4R).

When λ2(n)=0{\lambda_{2}^{(n)}}=0 and λ1(n)>(1+4R)γn+K~2tn{\lambda_{1}^{(n)}}>(1+4R)\gamma_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}, as in Lemma 8 we have with probability at least 1etn11-e^{-t}-n^{-1}

f^fL2(Π)2+λ1(n)mIf^mm\displaystyle\|\hat{f}\!-\!f^{*}\|_{L_{2}(\Pi)}^{2}\!+\!{\lambda_{1}^{(n)}}\!\!\sum_{m\in I}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}
\displaystyle\leq K1(mIf^mfmL2(Π)1sf^mfmmsn+tn)+λ1(n)mIfmm+2λ1(n)mJfmm\displaystyle K_{1}\Big{(}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t}{n}\Big{)}+{\lambda_{1}^{(n)}}\sum_{m\in I}\|f^{*}_{m}\|_{\mathcal{H}_{m}}+2{\lambda_{1}^{(n)}}\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}
+K~2mItnfmf^mL2(Π),\displaystyle+\tilde{K}_{2}\sum_{m\in I}\sqrt{\frac{t}{n}}\|f^{*}_{m}-\hat{f}_{m}\|_{L_{2}(\Pi)}, (29)

for all tloglog(Rn)+logMt\geq\log\log(R\sqrt{n})+\log M.

We lower bound the term λ1(n)mI(f^mmfmm){\lambda_{1}^{(n)}}\!\!\sum_{m\in I}(\|\hat{f}_{m}\|_{\mathcal{H}_{m}}-\|f^{*}_{m}\|_{\mathcal{H}_{m}}) in the LHS of the above inequality (21). There exists c1>0c_{1}>0 only depending RR such that

fmm\displaystyle\|f_{m}\|_{\mathcal{H}_{m}} =fmfmm22fmfm,fmm+fmm2\displaystyle=\sqrt{\|f_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\langle f_{m}-f^{*}_{m},f^{*}_{m}\rangle_{\mathcal{H}_{m}}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}}
c1fmfmm22fmm1|fmfm,fmm|+fmm\displaystyle\geq c_{1}\|f_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1}|\langle f_{m}-f^{*}_{m},f^{*}_{m}\rangle_{\mathcal{H}_{m}}|+\|f^{*}_{m}\|_{\mathcal{H}_{m}} (30)

for all fmmf_{m}\in\mathcal{H}_{m} such that fmm3R\|f_{m}\|_{\mathcal{H}_{m}}\leq 3R and mI0m\in I_{0}. Remind that fm=Tm1/2gmf^{*}_{m}=T_{m}^{1/2}g^{*}_{m}, then we have fmmc1fmfmm22gmmfmmfmfmL2(Π)+fmm\|f_{m}\|_{\mathcal{H}_{m}}\geq c_{1}\|f_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\|f_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}. Since maxmf^mm3R\max_{m}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\leq 3R are met with probability 1n11-n^{-1},

f^mmc1f^mfmm22gmmfmmf^mfmL2(Π)+fmm,\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\geq c_{1}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\|f^{*}_{m}\|_{\mathcal{H}_{m}},

with probability 1n11-n^{-1}.

Therefore by the inequality (29), we have with probability at least 1etn11-e^{-t}-n^{-1}

f^fL2(Π)2+λ1(n)mI(c1f^mfmm22gmmfmmf^mfmL2(Π)+fmm)\displaystyle\|\hat{f}\!-\!f^{*}\|_{L_{2}(\Pi)}^{2}\!+\!{\lambda_{1}^{(n)}}\!\!\sum_{m\in I}\!(c_{1}\|\hat{f}_{m}\!-\!f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\!\!-\!\!2\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\|\hat{f}_{m}\!-\!f^{*}_{m}\|_{L_{2}(\Pi)}\!\!+\!\!\|f^{*}_{m}\|_{\mathcal{H}_{m}})
\displaystyle\leq K1(mIf^mfmL2(Π)1sf^mfmmsn+tn)+λ1(n)mIfmm+2λ1(n)mJfmm\displaystyle K_{1}\Big{(}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t}{n}\Big{)}+{\lambda_{1}^{(n)}}\sum_{m\in I}\|f^{*}_{m}\|_{\mathcal{H}_{m}}+2{\lambda_{1}^{(n)}}\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}
+K~2mItnfmf^mL2(Π),\displaystyle+\tilde{K}_{2}\sum_{m\in I}\sqrt{\frac{t}{n}}\|f^{*}_{m}-\hat{f}_{m}\|_{L_{2}(\Pi)}, (31)

for all tloglog(Rn)+logMt\geq\log\log(R\sqrt{n})+\log M. Thus using Young’s inequality

f^fL2(Π)2\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C[d1+bn1λ1(n)s+d1+bλ1(n)2+2λ1(n)d1β+t(1+d1+b)n].\displaystyle C\left[d^{1+b}n^{-1}{\lambda_{1}^{(n)}}^{-s}+d^{1+b}{\lambda_{1}^{(n)}}^{2}+2{\lambda_{1}^{(n)}}d^{1-\beta}+\frac{t(1+d^{1+b})}{n}\right].

The RHS is minimized by d=n1(2+s)(β+b)d=n^{\frac{1}{(2+s)(\beta+b)}} and λ1(n)=max{Kn12+s+K~2tn,Flog(Mn)n}{\lambda_{1}^{(n)}}=\max\left\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\right\} (up to constants independent of nn). Note that since the optimal λ1(n){\lambda_{1}^{(n)}} obtained above satisfies λ1(n)>(1+4R)γn+K~2tn{\lambda_{1}^{(n)}}>(1+4R)\gamma_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}} by taking KK sufficiently large, the inequality (31) is valid. Moreover the condition M>nτ5=nb+1(β+b){b(2+s)+2}M>n^{\tau_{5}}=n^{\frac{b+1}{(\beta+b)\{b(2+s)+2\}}} in the statement ensures d<Md<M. Finally we evaluate the terms including tt, that is, tnd1+b+tnd1β\frac{t}{n}d^{1+b}+\sqrt{\frac{t}{n}}d^{1-\beta}. We can check that 1nd1+b1nd1β\frac{1}{n}d^{1+b}\lesssim\sqrt{\frac{1}{n}}d^{1-\beta}. Therefore those terms are upper bounded as tnd1+b+tnd1β1nd1β(t+t)n4β+2b2+s(b+β)2(2+s)(b+β)(t+t)\frac{t}{n}d^{1+b}+\sqrt{\frac{t}{n}}d^{1-\beta}\lesssim\sqrt{\frac{1}{n}}d^{1-\beta}(\sqrt{t}+t)\simeq n^{-\frac{4\beta+2b-2+s(b+\beta)}{2(2+s)(b+\beta)}}(\sqrt{t}+t). Thus we obtain the assertion.

(Convergence rate for block-2\ell_{2} MKL)

When λ1(n)=0{\lambda_{1}^{(n)}}=0, substituting IMI_{M} to II in Lemma 8, and using Young’s inequality, as in the proof of Theorem 2, the convergence rate of block-2\ell_{2} MKL can be evaluated as

f^IdfIdL2(Π)2C[M1+bn1λ2(n)s+Mbλ2(n)2+tλ2(n)2+tnM1+b],\displaystyle\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\|_{L_{2}(\Pi)}^{2}\leq C\left[M^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}+M^{b}{\lambda_{2}^{(n)}}^{2}+t{\lambda_{2}^{(n)}}^{2}+\frac{t}{n}M^{1+b}\right], (32)

with probability 1etn11-e^{-t}-n^{-1} (note that since I={1,,M}I=\{1,\dots,M\} (Ic=I^{c}=\emptyset), we don’t need the condition λ1(n)γ^n+K~2tn{\lambda_{1}^{(n)}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}). λ2(n)=K(Mn)12+sFlog(Mn)n{\lambda_{2}^{(n)}}=K(\frac{M}{n})^{\frac{1}{2+s}}\vee F\sqrt{\frac{\log(Mn)}{n}} gives the minimum of the RHS with respect to λ2(n){\lambda_{2}^{(n)}} up to constants. Using ττ6\tau\leq\tau_{6}, we can show that Mb(1s)+11+s(M/n)11+s=Mb(1s)+21+sn11+sλ2(n)M^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}=M^{\frac{b(1-s)+2}{1+s}}n^{-\frac{1}{1+s}}\lesssim{\lambda_{2}^{(n)}} by setting the constant KK sufficiently large, hence (27) is valid.

 

Appendix B Proof of Lemmas 7 and 8

Proof: (Lemma 7) Since f^\hat{f} minimizes the empirical risk (1), we have

1ni=1n(m=1M(f^m(xi)fm(xi)))2+λ1(n)f^1+λ2(n)f^22\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left(\sum_{m=1}^{M}(\hat{f}_{m}(x_{i})-f^{*}_{m}(x_{i}))\right)^{2}+{\lambda_{1}^{(n)}}\|\hat{f}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|\hat{f}\|_{\ell_{2}}^{2}
\displaystyle\leq 2nm=1Mi=1nϵi(f^m(xi)fm(xi))+λ1(n)f1+λ2(n)f22.\displaystyle\frac{2}{n}\sum_{m=1}^{M}\sum_{i=1}^{n}\epsilon_{i}(\hat{f}_{m}(x_{i})-f^{*}_{m}(x_{i}))+{\lambda_{1}^{(n)}}\|f^{*}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|f^{*}\|_{\ell_{2}}^{2}. (33)

By Proposition 1 (Bernstein’s inequality in Hilbert spaces, see also Theorem 6.14 of Steinwart (2008) for example), there exists a universal constant CC such that we have

1ni=1nϵi(f^m(xi)fm(xi))|1ni=1nϵikm(xi,)|f^mfmm\displaystyle\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(\hat{f}_{m}(x_{i})-f^{*}_{m}(x_{i}))\leq\left|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}k_{m}(x_{i},\cdot)\right|\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}
\displaystyle\leq CLlog(Mn)nf^mfmmCLlog(Mn)n(f^mm+fmm)\displaystyle CL\sqrt{\frac{\log(Mn)}{n}}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\leq CL\sqrt{\frac{\log(Mn)}{n}}(\|\hat{f}_{m}\|_{\mathcal{H}_{m}}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}) (34)

for all mm with probability at least 1n11-n^{-1}, where we used the assumption log(Mn)n1\frac{\log(Mn)}{n}\leq 1. If λ1(n)4CLlog(Mn)n{\lambda_{1}^{(n)}}\geq 4CL\sqrt{\frac{\log(Mn)}{n}}, then we have

λ1(n)f^1+λ2(n)f^223(λ1(n)λ2(n))(f1+f22),\displaystyle{\lambda_{1}^{(n)}}\|\hat{f}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|\hat{f}\|_{\ell_{2}}^{2}\leq 3({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})(\|f^{*}\|_{\ell_{1}}+\|f^{*}\|_{\ell_{2}}^{2}), (35)

with probability at least 1n11-n^{-1}. Set r=λ1(n)λ1(n)λ2(n)r=\frac{{\lambda_{1}^{(n)}}}{{\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}}, then by Young’s inequality and Jensen’s inequality, the LHS of the above inequality (33) is lower bounded by

λ1(n)f^1+λ2(n)f^22\displaystyle{\lambda_{1}^{(n)}}\|\hat{f}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|\hat{f}\|_{\ell_{2}}^{2} (λ1(n)λ2(n))(m=1Mf^mm2r)\displaystyle\geq({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})(\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}^{2-r})
M(λ1(n)λ2(n))(1Mm=1Mf^mm2r)\displaystyle\geq M({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})\left(\frac{1}{M}\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}^{2-r}\right)
Mr1(λ1(n)λ2(n))f^12r.\displaystyle\geq M^{r-1}({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})\|\hat{f}\|_{\ell_{1}}^{2-r}. (36)

Therefore we have the first assertion by setting F=4CLF=4CL.

The second assertion can be shown as follows: by the inequality (33) we have

M1λ2(n)(f^f1)2λ2(n)f^f22\displaystyle M^{-1}{\lambda_{2}^{(n)}}\left(\|\hat{f}-f^{*}\|_{\ell_{1}}\right)^{2}\leq{\lambda_{2}^{(n)}}\|\hat{f}-f^{*}\|_{\ell_{2}}^{2}
\displaystyle\leq 2nm=1Mi=1nϵi(f^m(xi)fm(xi))+λ1(n)f^f1+2λ2(n)m=1Mfm,fmf^mm\displaystyle\frac{2}{n}\sum_{m=1}^{M}\sum_{i=1}^{n}\epsilon_{i}(\hat{f}_{m}(x_{i})-f^{*}_{m}(x_{i}))+{\lambda_{1}^{(n)}}\|\hat{f}-f^{*}\|_{\ell_{1}}+2{\lambda_{2}^{(n)}}\sum_{m=1}^{M}\langle f^{*}_{m},f^{*}_{m}-\hat{f}_{m}\rangle_{\mathcal{H}_{m}}
\displaystyle\leq λ2(n)(32+2maxmfmm)f^f1\displaystyle{\lambda_{2}^{(n)}}\left(\frac{3}{2}+2\max_{m}\|f^{*}_{m}\|_{\mathcal{H}_{m}}\right)\|\hat{f}-f^{*}\|_{\ell_{1}} (37)

with probability at least 1n11-n^{-1}, where we used (34), λ2(n)4CLlog(Mn)n{\lambda_{2}^{(n)}}\geq 4CL\sqrt{\frac{\log(Mn)}{n}} and λ2(n)λ1(n){\lambda_{2}^{(n)}}\geq{\lambda_{1}^{(n)}} in the last inequality.  

Proof: (Lemma 8) In what follows, we assume f^f1R¯\|\hat{f}-f^{*}\|_{\ell_{1}}\leq\bar{R} where R¯=4MR\bar{R}=4MR (the probability of this event is greater than 1n11-n^{-1} by Lemma 7). Since f^\hat{f} minimizes the empirical risk we have

Pn(f^Y)2+λ1(n)f^1+λ2(n)f^22Pn(fY)2+λ1(n)f1+λ2(n)f22\displaystyle P_{n}(\hat{f}-Y)^{2}+{\lambda_{1}^{(n)}}\|\hat{f}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|\hat{f}\|_{\ell_{2}}^{2}\leq P_{n}(f^{*}-Y)^{2}+{\lambda_{1}^{(n)}}\|f^{*}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|f^{*}\|_{\ell_{2}}^{2}
\displaystyle\Rightarrow~ P(f^f)2+λ1(n)f^J1+λ2(n)f^J22(PPn)((ff^)2+2(f^f)ϵ)+\displaystyle P(\hat{f}-f^{*})^{2}+{\lambda_{1}^{(n)}}\|\hat{f}_{J}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|\hat{f}_{J}\|_{\ell_{2}}^{2}\leq(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon)+
+λ1(n)(fI1f^I1)+λ2(n)(fI22f^I22)+λ1(n)fJ1+λ2(n)fJ22.\displaystyle~~~~~~~~~~~~~~~~~~~~~~+{\lambda_{1}^{(n)}}(\|f^{*}_{I}\|_{\ell_{1}}-\|\hat{f}_{I}\|_{\ell_{1}})+{\lambda_{2}^{(n)}}(\|f^{*}_{I}\|_{\ell_{2}}^{2}-\|\hat{f}_{I}\|_{\ell_{2}}^{2})+{\lambda_{1}^{(n)}}\|f^{*}_{J}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|f^{*}_{J}\|_{\ell_{2}}^{2}. (38)

The second term in the RHS of the above inequality (38) can be bounded from above as

(fI1f^I1)\displaystyle(\|f^{*}_{I}\|_{\ell_{1}}-\|\hat{f}_{I}\|_{\ell_{1}}) mIfmm,f^mfmm\displaystyle\leq\sum_{m\in I}\langle\nabla\|f^{*}_{m}\|_{\mathcal{H}_{m}},\hat{f}_{m}-f^{*}_{m}\rangle_{\mathcal{H}_{m}}
=mIgm,Tm1/2(f^mfm)mfmmmIgmmfmmf^mfmL2(Π),\displaystyle=\sum_{m\in I}\frac{\langle g^{*}_{m},T_{m}^{1/2}(\hat{f}_{m}-f^{*}_{m})\rangle_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\leq\sum_{m\in I}\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}, (39)

where we used fm=Tm1/2gmf^{*}_{m}=T_{m}^{1/2}g^{*}_{m} for mII0m\in I\subseteq I_{0}. We also have

λ2(n)(fI22f^I22)\displaystyle{\lambda_{2}^{(n)}}(\|f^{*}_{I}\|_{\ell_{2}}^{2}-\|\hat{f}_{I}\|_{\ell_{2}}^{2}) =λ2(n)(mI2fm,fmf^mmf^IfI22)\displaystyle={\lambda_{2}^{(n)}}(\sum_{m\in I}2\langle f^{*}_{m},f^{*}_{m}-\hat{f}_{m}\rangle_{\mathcal{H}_{m}}-\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2})
λ2(n)(mI2gmmf^mfmL2(Π)f^IfI22).\displaystyle\leq{\lambda_{2}^{(n)}}(\sum_{m\in I}2\|g^{*}_{m}\|_{\mathcal{H}_{m}}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}-\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2}). (40)

Substituting (39) and (40) to (38), we obtain

f^fL2(Π)2+λ2(n)f^IfI22+λ1(n)f^J1+λ2(n)f^J22\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\|\hat{f}_{I}-f^{*}_{I}\|_{\ell_{2}}^{2}+{\lambda_{1}^{(n)}}\|\hat{f}_{J}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|\hat{f}_{J}\|_{\ell_{2}}^{2}
\displaystyle\leq (PPn)((ff^)2+2(f^f)ϵ)+mI(λ1(n)gmmfmm+2λ2(n)gmm)f^mfmL2(Π)\displaystyle(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon)+\sum_{m\in I}({\lambda_{1}^{(n)}}\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}+2{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}})\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}
+λ1(n)fJ1+λ2(n)fJ22.\displaystyle+{\lambda_{1}^{(n)}}\|f^{*}_{J}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|f^{*}_{J}\|_{\ell_{2}}^{2}. (41)

Finally we evaluate the first term (PPn)((ff^)2+2(f^f)ϵ)(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon) in the RHS of the above inequality (41) by applying Talagrand’s concentration inequality (Talagrand, 1996a, b, Bousquet, 2002). First we decompose (PPn)((ff^)2+2(f^f)ϵ)(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon) as

(PPn)((ff^)2+2(f^f)ϵ)=m=1M(PPn)((ff^)(fmf^m)+2(f^mfm)ϵ),\displaystyle(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon)=\sum_{m=1}^{M}(P-P_{n})((f^{*}-\hat{f})(f^{*}_{m}-\hat{f}_{m})+2(\hat{f}_{m}-f^{*}_{m})\epsilon),

and bound each term (PPn)((ff^)(fmf^m)+2(f^mfm)ϵ)(P-P_{n})((f^{*}-\hat{f})(f^{*}_{m}-\hat{f}_{m})+2(\hat{f}_{m}-f^{*}_{m})\epsilon) in the summation. Here suppose ff\in\mathcal{H} satisfies ff1R^\|f\|_{\infty}\leq\|f\|_{\ell_{1}}\leq\hat{R} for a constant R^(R¯)\hat{R}~(\leq\bar{R}). Since |ϵ|L|\epsilon|\leq L, we have

|ffm+2fmϵ|2(L+R^)|f|2(L+R^)fmm,\displaystyle|ff_{m}+2f_{m}\epsilon|\leq 2(L+\hat{R})|f|\leq 2(L+\hat{R})\|f_{m}\|_{\mathcal{H}_{m}}, (42a)
P(ffm+2fmϵ)2=P(f2fm2)+4P(fm2ϵ2)fL2(Π)2fmL2(Π)2+4L2fmL2(Π)2\displaystyle\sqrt{P(ff_{m}+2f_{m}\epsilon)^{2}}=\sqrt{P(f^{2}f_{m}^{2})+4P(f_{m}^{2}\epsilon^{2})}\leq\sqrt{\|f\|_{L_{2}(\Pi)}^{2}\|f_{m}\|_{L_{2}(\Pi)}^{2}+4L^{2}\|f_{m}\|_{L_{2}(\Pi)}^{2}}
fL2(Π)fmL2(Π)+2LfmL2(Π),\displaystyle\leq\|f\|_{L_{2}(\Pi)}\|f_{m}\|_{L_{2}(\Pi)}+2L\|f_{m}\|_{L_{2}(\Pi)}, (42b)

for all ff\in\mathcal{H}. Let Qnf:=1ni=1nεif(xi,yi)Q_{n}f:=\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}f(x_{i},y_{i}) where {εi}i=1n{±1}n\{\varepsilon_{i}\}_{i=1}^{n}\in\{\pm 1\}^{n} is the Rademacher random variable, and Ψm(ξm,σm)\Psi_{m}(\xi_{m},\sigma_{m}) be

Ψm(ξm,σm):=E[sup{Qn(|fm|)fmm,fmmξm,fmL2(Π)σm}].\Psi_{m}(\xi_{m},\sigma_{m}):=\mathrm{E}[\sup\{Q_{n}(|f_{m}|)\mid f_{m}\in\mathcal{H}_{m},\|f_{m}\|_{\mathcal{H}_{m}}\leq\xi_{m},\|f_{m}\|_{L_{2}(\Pi)}\leq\sigma_{m}\}].

Then one can show that by the spectral assumptions (A5) (equivalently the covering number condition)

Ψm(ξm,σm)Ks(σm1sξmsnn11+sξm)\Psi_{m}(\xi_{m},\sigma_{m})\leq K_{s}\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee n^{-\frac{1}{1+s}}\xi_{m}\right)

where KsK_{s} is a constant that depends on ss and C2C_{2} (Mendelson, 2002). Let Ξm(ξm,σm):={fmmfmmξm,fmL2(Π)σm}\Xi_{m}(\xi_{m},\sigma_{m}):=\{f_{m}\in\mathcal{H}_{m}\mid\|f_{m}\|_{\mathcal{H}_{m}}\leq\xi_{m},\|f_{m}\|_{L_{2}(\Pi)}\leq\sigma_{m}\}. Now by Rademacher contraction inequality (Ledoux and Talagrand, 1991, Theorem 4.12), for given {ξm,σm}mI\{\xi_{m},\sigma_{m}\}_{m\in I} and R^\hat{R} we have

E[sup{Qn(ffm+2fmϵ)f such that fmΞm(ξm,σm),f1R^}]\displaystyle\mathrm{E}[\sup\{Q_{n}(ff_{m}+2f_{m}\epsilon)\mid f\in\mathcal{H}\text{~such that~}f_{m}\in\Xi_{m}(\xi_{m},\sigma_{m}),~\|f\|_{\ell_{1}}\leq\hat{R}\}]
\displaystyle\leq 2(L+R^)Ψm(ξm,σm)2Ks(L+R^)(σm1sξmsnn11+sξm).\displaystyle 2(L+\hat{R})\Psi_{m}(\xi_{m},\sigma_{m})\leq 2K_{s}(L+\hat{R})\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee n^{-\frac{1}{1+s}}\xi_{m}\right). (43)

Therefore by the symmetrization argument (van der Vaart and Wellner, 1996), we have

E[sup{(PnP)(ffm+2fmϵ)f such that fmΞm(ξm,σm),f1R^}]\displaystyle\mathrm{E}[\sup\{(P_{n}-P)(ff_{m}+2f_{m}\epsilon)\mid f\in\mathcal{H}\text{~such that~}f_{m}\in\Xi_{m}(\xi_{m},\sigma_{m}),~\|f\|_{\ell_{1}}\leq\hat{R}\}]
\displaystyle\leq 4Ks(L+R^)(σm1sξmsnn11+sξm).\displaystyle 4K_{s}(L+\hat{R})\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee n^{-\frac{1}{1+s}}\xi_{m}\right). (44)

By Talagrand’s concentration inequality with (42) and (44), for given R^,σ¯,ξm,σm\hat{R},\bar{\sigma},\xi_{m},\sigma_{m} with probability at least 1et1-e^{-t} (t>0)(t>0), we have

supf:fL2(Π)σ¯,fR^,fmΞm(ξm,σm)(PnP)(ffm+2fmϵ)\displaystyle\sup_{f\in\mathcal{H}:\atop\|f\|_{L_{2}(\Pi)}\leq\bar{\sigma},\|f\|_{\infty}\leq\hat{R},f_{m}\in\Xi_{m}(\xi_{m},\sigma_{m})}(P_{n}-P)(ff_{m}+2f_{m}\epsilon)\leq
2(4Ks(L+R^)(σm1sξmsnξmn11+s)+tn(σ¯σm+2Lσm)+2(L+R^)ξmtn).\displaystyle~~~~~~\textstyle\sqrt{2}\left(4K_{s}(L+\hat{R})\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee\frac{\xi_{m}}{n^{\frac{1}{1+s}}}\right)+\sqrt{\frac{t}{n}}(\bar{\sigma}\sigma_{m}+2L\sigma_{m})+2(L+\hat{R})\xi_{m}\frac{t}{n}\right). (45)

where we used the relation (42). Our next goal is to derive an uniform version of the above inequality over

1nR^R¯,1nσ¯R¯,1nMξmR¯and1nMσmR¯.\frac{1}{\sqrt{n}}\leq\hat{R}\leq\bar{R},~~\frac{1}{\sqrt{n}}\leq\bar{\sigma}\leq\bar{R},~~~\frac{1}{\sqrt{n}M}\leq\xi_{m}\leq\bar{R}~~~\text{and}~~~\frac{1}{\sqrt{n}M}\leq\sigma_{m}\leq\bar{R}.

By considering a grid {R^(k1),σ¯(k2),ξm(k3),σm(k4)}ki=0(i=1,,4)log2(MR¯n)\{\hat{R}^{(k_{1})},\bar{\sigma}^{(k_{2})},\xi_{m}^{(k_{3})},\sigma_{m}^{(k_{4})}\}_{k_{i}=0(i=1,\dots,4)}^{\log_{2}(M\bar{R}\sqrt{n})} such that R^(k):=R¯2k\hat{R}^{(k)}:=\bar{R}2^{-k}, σ¯(k):=R¯2k\bar{\sigma}^{(k)}:=\bar{R}2^{-k}, ξm(k):=R¯2k\xi_{m}^{(k)}:=\bar{R}2^{-k} and σm(k):=R¯2k\sigma_{m}^{(k)}:=\bar{R}2^{-k}, we have with probability at least 1(log(MR¯n))4et1(log(4RM2n))4et1-(\log(M\bar{R}\sqrt{n}))^{4}e^{-t}\geq 1-(\log(4RM^{2}\sqrt{n}))^{4}e^{-t}

(PnP)(ffm+2fmϵ)\displaystyle(P_{n}-P)(ff_{m}+2f_{m}\epsilon)\leq K(1+f1)(fmL2(Π)1sfmmsnfmmn11+s+tfmmn)\displaystyle K(1+\|f\|_{\ell_{1}})\left(\frac{\|f_{m}\|_{L_{2}(\Pi)}^{1-s}\|f_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\|f_{m}\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\|f_{m}\|_{\mathcal{H}_{m}}}{n}\right)
+2tn(fL2(Π)fmL2(Π)+2LfmL2(Π)),\displaystyle~~~~~~~~~~~~~~~\!\!+\!\!\sqrt{\frac{2t}{n}}(\|f\|_{L_{2}(\Pi)}\|f_{m}\|_{L_{2}(\Pi)}+2L\|f_{m}\|_{L_{2}(\Pi)}),

for all ff\in\mathcal{H} such that fmmR¯\|f_{m}\|_{\mathcal{H}_{m}}\leq\bar{R} and f1R¯\|f\|_{\ell_{1}}\leq\bar{R}, and for all t>1t>1, where K=4(4KsL4Ks2L2)K=4(4K_{s}L\vee 4K_{s}\vee 2L\vee 2). Summing up this bound for m=1,,Mm=1,\dots,M, then we obtain

(PnP)(f2+2fϵ)\displaystyle(P_{n}-P)(f^{2}+2f\epsilon)\leq K(1+f1)(m=1MfmL2(Π)1sfmmsnfmmn11+s+tf1n)\displaystyle K(1+\|f\|_{\ell_{1}})\left(\sum_{m=1}^{M}\frac{\|f_{m}\|_{L_{2}(\Pi)}^{1-s}\|f_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\|f_{m}\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\|f\|_{\ell_{1}}}{n}\right)
+2tn(fL2(Π)m=1MfmL2(Π)+2Lm=1MfmL2(Π)),\displaystyle~~~~~~~~~~~~~~~\!\!+\!\!\sqrt{\frac{2t}{n}}\left(\|f\|_{L_{2}(\Pi)}\sum_{m=1}^{M}\|f_{m}\|_{L_{2}(\Pi)}+2L\sum_{m=1}^{M}\|f_{m}\|_{L_{2}(\Pi)}\right),

uniformly for all ff\in\mathcal{H} such that fmmR¯\|f_{m}\|_{\mathcal{H}_{m}}\leq\bar{R} (m\forall m) and f1R¯\|f\|_{\ell_{1}}\leq\bar{R} with probability at least 1M(log(4RM2n))4et1-M(\log(4RM^{2}\sqrt{n}))^{4}e^{-t}. Here set γn=Kn\gamma_{n}=\frac{K}{\sqrt{n}} and note that 2tnfL2(Π)m=1MfmL2(Π)12fL2(Π)2+tn(m=1MfmL2(Π))212fL2(Π)2+tn(f1)2\sqrt{\frac{2t}{n}}\|f\|_{L_{2}(\Pi)}\sum_{m=1}^{M}\|f_{m}\|_{L_{2}(\Pi)}\leq\frac{1}{2}\|f\|_{L_{2}(\Pi)}^{2}+\frac{t}{n}(\sum_{m=1}^{M}\|f_{m}\|_{L_{2}(\Pi)})^{2}\leq\frac{1}{2}\|f\|_{L_{2}(\Pi)}^{2}+\frac{t}{n}(\|f\|_{\ell_{1}})^{2} then we have

(PnP)(f2+2fϵ)\displaystyle(P_{n}-P)(f^{2}+2f\epsilon)\leq K(1+f1)[mI(fmL2(Π)1sfmmsnfmmn11+s)+2tf1n]\displaystyle K(1+\|f\|_{\ell_{1}})\Bigg{[}\sum_{m\in I}\left(\frac{\|f_{m}\|_{L_{2}(\Pi)}^{1-s}\|f_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\|f_{m}\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}\right)+\frac{2t\|f\|_{\ell_{1}}}{n}\Bigg{]}
+γn(1+f1)fJ1+12fL2(Π)2+22Ltnm=1MfmL2(Π).\displaystyle+\gamma_{n}(1+\|f\|_{\ell_{1}})\|f_{J}\|_{\ell_{1}}+\frac{1}{2}\|f\|_{L_{2}(\Pi)}^{2}+2\sqrt{2}L\sqrt{\frac{t}{n}}\sum_{m=1}^{M}\|f_{m}\|_{L_{2}(\Pi)}. (46)

for all ff\in\mathcal{H} such that fmmR¯\|f_{m}\|_{\mathcal{H}_{m}}\leq\bar{R} (m\forall m) and f1R¯\|f\|_{\ell_{1}}\leq\bar{R} with probability at least 1M(log(4RM2n))4et1-M(\log(4RM^{2}\sqrt{n}))^{4}e^{-t}. We will replace tt with t+5logM+4loglog(Rn)t+5\log M+4\log\log(R\sqrt{n}), then the probability 1M(log(4RnM2))4et1-M(\log(4R\sqrt{n}M^{2}))^{4}e^{-t} can be replaced with 1et1-e^{-t} and we have t+5logM+4loglog(Rn)6tt+5\log M+4\log\log(R\sqrt{n})\leq 6t for all tlogM+loglog(Rn)t\geq\log M+\log\log(R\sqrt{n}). On the event where f^f1R¯\|\hat{f}-f^{*}\|_{\ell_{1}}\leq\bar{R} holds, substituting f^f\hat{f}-f^{*} to ff in (46) and replacing KK appropriately, (41) yields

12f^fL2(Π)2+λ2(n)mIf^IfIm2+λ2(n)mJf^mm2+(λ1(n)γ^n)mJf^mm\displaystyle\frac{1}{2}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I}\|\hat{f}_{I}-f^{*}_{I}\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}^{2}+({\lambda_{1}^{(n)}}-\hat{\gamma}_{n})\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}
\displaystyle\leq K~1(1+f^f1)(mIf^mfmL2(Π)1sf^mfmmsnf^mfmmn11+s+tf^f1n)\displaystyle\tilde{K}_{1}(1+\|\hat{f}-f^{*}\|_{\ell_{1}})\Big{(}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\Big{)}
+mI(λ1(n)gmmfmm+2λ2(n)gmm)f^mfmL2(Π)+λ2(n)mJfmm2+(λ1(n)+γ^n)mJfmm\displaystyle\!+\!\!\sum_{m\in I}\left(\!{\lambda_{1}^{(n)}}\!\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}}\!\!\right)\!\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}\!\!+\!{\lambda_{2}^{(n)}}\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\!\!+\!({\lambda_{1}^{(n)}}\!\!+\!\hat{\gamma}_{n})\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}
+K~2tnm=1Mf^mfmL2(Π),\displaystyle+\tilde{K}_{2}\sqrt{\frac{t}{n}}\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}, (47)

where K~1\tilde{K}_{1} and K~2\tilde{K}_{2} are constants and γ^n=γn(1+f1)\hat{\gamma}_{n}=\gamma_{n}(1+\|f\|_{\ell_{1}}). Finally since K~2tnm=1Mf^mfmL2(Π)=K~2tn(mIf^mfmL2(Π)+mJf^mL2(Π)+mJfmL2(Π))K~2tn(mIf^mfmL2(Π)+mJf^mm+mJfmm)\tilde{K}_{2}\sqrt{\frac{t}{n}}\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}=\tilde{K}_{2}\sqrt{\frac{t}{n}}(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\sum_{m\in J}\|\hat{f}_{m}\|_{L_{2}(\Pi)}+\sum_{m\in J}\|f^{*}_{m}\|_{L_{2}(\Pi)})\leq\tilde{K}_{2}\sqrt{\frac{t}{n}}(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}+\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}), (47) becomes

12f^fL2(Π)2+λ2(n)mIf^IfIm2+λ2(n)mJf^mm2+(λ1(n)γ^nK~2tn)mJf^mm\displaystyle\frac{1}{2}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I}\|\hat{f}_{I}-f^{*}_{I}\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}^{2}+\left({\lambda_{1}^{(n)}}-\hat{\gamma}_{n}-\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}
\displaystyle\leq K~1(1+f^f1)(mIf^mfmL2(Π)1sf^mfmmsnf^mfmmn11+s+tf^f1n)\displaystyle\tilde{K}_{1}(1+\|\hat{f}-f^{*}\|_{\ell_{1}})\Big{(}\sum_{m\in I}\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{1-s}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\Big{)}
+mI(λ1(n)gmmfmm+2λ2(n)gmm+K~2tn)f^mfmL2(Π)\displaystyle\!+\!\!\sum_{m\in I}\left(\!{\lambda_{1}^{(n)}}\!\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\|g^{*}_{m}\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\!\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}
+λ2(n)mJfmm2+(λ1(n)+γ^n+K~2tn)mJfmm,\displaystyle\!\!+\!{\lambda_{2}^{(n)}}\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\!\!+\!\left({\lambda_{1}^{(n)}}\!\!+\!\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}}, (48)

which yields the assertion.

 

Appendix C Proof of Theorems 4 and 5

We write the operator norm of SI,J:JIS_{I,J}:\mathcal{H}_{J}\to\mathcal{H}_{I} as SI,JI,J:=supgJJ,gJ0SI,JgJIgJJ\|S_{I,J}\|_{\mathcal{H}_{I},\mathcal{H}_{J}}:=\sup\limits_{g_{J}\in\mathcal{H}_{J},g_{J}\neq 0}\frac{\|S_{I,J}g_{J}\|_{\mathcal{H}_{I}}}{\|g_{J}\|_{\mathcal{H}_{J}}}.

Definition 9

For all 1m,mM1\leq m,m^{\prime}\leq M, we define the empirical (non centered) cross covariance operator Σ^m,m\hat{\Sigma}_{m,m^{\prime}} as follows:

fm,Σ^m,mgmm:=1ni=1nfm(xi)gm(xi),\langle f_{m},\hat{\Sigma}_{m,m^{\prime}}g_{m^{\prime}}\rangle_{\mathcal{H}_{m}}:=\frac{1}{n}\sum_{i=1}^{n}f_{m}(x_{i})g_{m^{\prime}}(x_{i}), (49)

where fmm,gmmf_{m}\in\mathcal{H}_{m},g_{m^{\prime}}\in\mathcal{H}_{m^{\prime}}. Analogous to the joint covariance operator Σ\Sigma, we define the joint empirical cross covariance operator Σ^:\hat{\Sigma}:\mathcal{H}\to\mathcal{H} as (Σ^h)m=l=1MΣ^m,lhl(\hat{\Sigma}h)_{m}=\sum_{l=1}^{M}\hat{\Sigma}_{m,l}h_{l}. We denote by Σ^m,ϵ\hat{\Sigma}_{m,\epsilon} the element of m\mathcal{H}_{m} such that

fm,Σ^m,ϵm:=1ni=1nϵifm(xi).\langle f_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}:=\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f_{m}(x_{i}).

Let R¯\bar{R} be a constant such that 4(m=1Mfmm+m=1Mfmm)R¯4(\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}}+\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}})\leq\bar{R}. We denote by FnF_{n} the objective function of elastic-net MKL

Fn(f):=1ni=1n(f(xi)yi)2+λ1(n)m=1Mfmm+λ2(n)m=1Mfmm2.F_{n}(f):=\frac{1}{n}\sum_{i=1}^{n}(f(x_{i})-y_{i})^{2}+{\lambda_{1}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}+{\lambda_{2}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2}.

Proof: (Theorem 4) Let f~mI0m\tilde{f}\in\oplus_{m\in I_{0}}\mathcal{H}_{m} be the minimizer of F~n\tilde{F}_{n}:

f~:=argminfI0F~n(f),\displaystyle\tilde{f}:=\mathop{\arg\min}_{f\in\mathcal{H}_{I_{0}}}\tilde{F}_{n}(f),
where F~n(f):=1ni=1n(f(xi)yi)2+λ1(n)mI0fmm+λ2(n)mI0fmm2.\displaystyle\tilde{F}_{n}(f):=\frac{1}{n}\sum_{i=1}^{n}(f(x_{i})-y_{i})^{2}+{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\|f_{m}\|_{\mathcal{H}_{m}}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\|f_{m}\|_{\mathcal{H}_{m}}^{2}.

(Step 1) We first show that f~pf\tilde{f}\stackrel{{\scriptstyle p}}{{\to}}f^{*} with respect to the RKHS norm. Since λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty, as in the proof of Lemma 7, the probability of m=1Mf^mfmmMR¯\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\leq\sqrt{M}\bar{R} goes to 1 (this can be checked as follows: by replacing log(Mn)n\sqrt{\frac{\log(Mn)}{n}} in Eq. (34) with log(M)λ1(n)\log(M){\lambda_{1}^{(n)}}, then we see that Eq. (34) holds with probability 1exp(λ1(n)2n)1-\exp(-{\lambda_{1}^{(n)}}^{2}n)). There exists c1c_{1} only depending MR¯\sqrt{M}\bar{R} such that

fmm=fmfmm22fmfm,fmm+fmm2\displaystyle\|f_{m}\|_{\mathcal{H}_{m}}=\sqrt{\|f_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\langle f_{m}-f^{*}_{m},f^{*}_{m}\rangle_{\mathcal{H}_{m}}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}}
\displaystyle\geq c1fmfmm22fmm1|fmfm,fmm|+fmm\displaystyle c_{1}\|f_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1}|\langle f_{m}-f^{*}_{m},f^{*}_{m}\rangle_{\mathcal{H}_{m}}|+\|f^{*}_{m}\|_{\mathcal{H}_{m}} (50)

for all mI0m\in I_{0} and all fmmf_{m}\in\mathcal{H}_{m} such that fmmMR¯\|f_{m}\|_{\mathcal{H}_{m}}\leq\sqrt{M}\bar{R}.

Since f~\tilde{f} minimizes F~n\tilde{F}_{n}, if m=1Mf~mfmmMR¯\sum_{m=1}^{M}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\leq\sqrt{M}\bar{R} (the probability of which event goes to 1) we have

f~I0fI0,Σ^I0,I0(f~I0fI0)I0+c1λ1(n)mI0f~mfmm2+λ2(n)mI0f~mfmm2\displaystyle\langle\tilde{f}_{I_{0}}-f^{*}_{I_{0}},\hat{\Sigma}_{I_{0},I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})\rangle_{\mathcal{H}_{I_{0}}}+c_{1}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}
\displaystyle\leq 2Σ^I0,ϵ,f~fI0+2mI0(1fmmλ1(n)+λ2(n))|f~mfm,fmm|,\displaystyle 2\langle\hat{\Sigma}_{I_{0},\epsilon},\tilde{f}-f^{*}\rangle_{\mathcal{H}_{I_{0}}}+2\sum_{m\in I_{0}}\left(\frac{1}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}{\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}\right)|\langle\tilde{f}_{m}-f^{*}_{m},f^{*}_{m}\rangle_{\mathcal{H}_{m}}|, (51)

where we used the relation (50). By the assumption fm=Σm,m1/2gmf^{*}_{m}=\Sigma_{m,m}^{1/2}g^{*}_{m}, we have |f~mfm,fmm|gmmf~mfmL2(Π)|\langle\tilde{f}_{m}-f^{*}_{m},f^{*}_{m}\rangle_{\mathcal{H}_{m}}|\leq\|g^{*}_{m}\|_{\mathcal{H}_{m}}\|\tilde{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}. By Lemma 10 and Lemma 11, we have

Σm,mΣ^m,mm,m=Op(1/n),Σ^I0,ϵI0=Op(1/n).\|\Sigma_{m,m^{\prime}}-\hat{\Sigma}_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}=O_{p}(1/\sqrt{n}),~~~\|\hat{\Sigma}_{I_{0},\epsilon}\|_{\mathcal{H}_{I_{0}}}=O_{p}(1/\sqrt{n}).

Substituting these inequalities to (51), we have

f~fL2(Π)2+c1λ1(n)mI0f~mfmm2+λ2(n)mI0f~mfmm2\displaystyle\|\tilde{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+c_{1}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}
\displaystyle\leq Op(mI0f~mfmmn+(λ1(n)+λ2(n))mI0f~mfmL2(Π)).\displaystyle O_{p}\left(\frac{\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}}{\sqrt{n}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}\right). (52)

Remind that the (non centered) cross correlation operator is invertible. Thus there exists a constant cc such that

f~fL2(Π)2=f~I0fI0,ΣI0,I0(f~I0fI0)=f~I0fI0,Diag(Σm,m1/2)VI0,I0Diag(Σm,m1/2)(f~I0fI0)I0\displaystyle\|\tilde{f}-f^{*}\|_{L_{2}(\Pi)}^{2}=\langle\tilde{f}_{I_{0}}-f^{*}_{I_{0}},\Sigma_{I_{0},I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})\rangle_{\mathcal{H}}=\langle\tilde{f}_{I_{0}}-f^{*}_{I_{0}},\mathrm{Diag}(\Sigma_{m,m}^{1/2})V_{I_{0},I_{0}}\mathrm{Diag}(\Sigma_{m,m}^{1/2})(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})\rangle_{\mathcal{H}_{I_{0}}}
\displaystyle\geq cmI0f~mfm,Σm,m(f~mfm)m=cmI0f~mfmL2(Π)2.\displaystyle c\sum_{m\in I_{0}}\langle\tilde{f}_{m}-f^{*}_{m},\Sigma_{m,m}(\tilde{f}_{m}-f^{*}_{m})\rangle_{\mathcal{H}_{m}}=c\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2}.

This and Eq. (52) give that using ab(a2+b2)/2ab\leq(a^{2}+b^{2})/2

f~fL2(Π)2+c1λ1(n)mI0f~mfmm2+λ2(n)mI0f~mfmm2\displaystyle\|\tilde{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+c_{1}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}
Op(mI0f~mfmmn+(λ1(n)+λ2(n))mI0f~mfmL2(Π))\displaystyle\leq O_{p}\left(\frac{\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}}{\sqrt{n}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}\right)
Op(1nλ1(n)+(λ1(n)+λ2(n))2)+c12λ1(n)mI0f~mfmm2+c2mI0f~mfmL2(Π)2\displaystyle\leq O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}\right)+\frac{c_{1}}{2}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}+\frac{c}{2}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}^{2}
Op(1nλ1(n)+(λ1(n)+λ2(n))2)+c12λ1(n)mI0f~mfmm2+12f~fL2(Π)2.\displaystyle\leq O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}\right)+\frac{c_{1}}{2}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}+\frac{1}{2}\|\tilde{f}-f^{*}\|_{L_{2}(\Pi)}^{2}.

Therefore we have

12f~fL2(Π)2+c12λ1(n)mI0f~mfmm2+λ2(n)mI0f~mfmm2Op(1nλ1(n)+(λ1(n)+λ2(n))2)\displaystyle\frac{1}{2}\|\tilde{f}-f^{*}\|_{L_{2}(\Pi)}^{2}+\frac{c_{1}}{2}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\leq O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}\right)
\displaystyle\Rightarrow mI0f~mfmm2Op(1(c1λ1(n)+λ2(n))nλ1(n)+(λ1(n)+λ2(n))2c1λ1(n)+λ2(n))=Op(1nλ1(n)2+(λ1(n)+λ2(n))).\displaystyle\sum_{m\in I_{0}}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\leq O_{p}\left(\frac{1}{(c_{1}{\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})n{\lambda_{1}^{(n)}}}+\frac{({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}}{c_{1}{\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}}\right)=O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}^{2}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})\right).

This and λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty gives f~fI0I00\|\tilde{f}-f^{*}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}\to 0 in probability.

(Step 2) Next we show that the probability of f~=f^\tilde{f}=\hat{f} goes to 1. Since f~fI0I00\|\tilde{f}-f^{*}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}\to 0, we can assume that f~mm>0(mI0)\|\tilde{f}_{m}\|_{\mathcal{H}_{m}}>0~(m\in I_{0}) without loss of generality. We identify f~\tilde{f} as an element of \mathcal{H} by setting f~m=0\tilde{f}_{m}=0 for mJ0m\in J_{0}. Now we show that f~\tilde{f} is also the minimizer of FnF_{n}, that is f~=f^\tilde{f}=\hat{f} , with high probability, hence I^=I0\hat{I}=I_{0} with high probability. By the KKT condition, the necessary and sufficient condition that f~\tilde{f} also minimizes FnF_{n} is

2Σ^m,I0(f~I0fI0)2Σ^m,ϵmλ1(n)(mJ0),\displaystyle\|2\hat{\Sigma}_{m,I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}\leq{\lambda_{1}^{(n)}}~~~(\forall m\in J_{0}), (53)
(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)(f~I0fI0)+λ1(n)DnfI0+2λ2(n)fI02Σ^I0,ϵ=0,\displaystyle(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})+{\lambda_{1}^{(n)}}D_{n}f^{*}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}=0, (54)

where Dn=Diag(f~mm1)D_{n}=\mathrm{Diag}(\|\tilde{f}_{m}\|_{\mathcal{H}_{m}}^{-1}). Note that (54) is satisfied (with high probability) because f~\tilde{f} is the minimizer of F~n\tilde{F}_{n} and f~mm>0\|\tilde{f}_{m}\|_{\mathcal{H}_{m}}>0 for all mI0m\in I_{0} (with high probability). Therefore if the condition (53) holds w.h.p., f~=f^\tilde{f}=\hat{f} w.h.p..

We will now show the condition (53) holds w.h.p.. Due to (54), we have

f~I0fI0=(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1[(λ1(n)Dn+2λ2(n))fI02Σ^I0,ϵ].\tilde{f}_{I_{0}}-f^{*}_{I_{0}}=-(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}[({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}].

Therefore the LHS of (53), 2Σ^m,I0(f~I0fI0)2Σ^m,ϵm\|2\hat{\Sigma}_{m,I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}, can be evaluated as

2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1[(λ1(n)Dn+2λ2(n))fI02Σ^I0,ϵ]2Σ^m,ϵm\displaystyle\|-2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}[({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}]-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}
=\displaystyle= 2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(λ1(n)Dn+2λ2(n))fI0\displaystyle\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}
2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)12Σ^I0,ϵ+2Σ^m,ϵm\displaystyle-2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}+2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}
\displaystyle\leq 2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(λ1(n)Dn+2λ2(n))fI0m\displaystyle\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\|_{\mathcal{H}_{m}}
+2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)12Σ^I0,ϵ2Σ^m,ϵm.\displaystyle+\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}. (55)

We evaluate the probabilistic orders of the last two terms.

(i) (Bounding Bn,m:=2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)12Σ^I0,ϵ2Σ^m,ϵmB_{n,m}:=\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}) We show that

Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1Σ^I0,ϵ=Op(1n).\displaystyle\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}=O_{p}\left(\frac{1}{\sqrt{n}}\right).

Since O(Σ^I0,I0Σ^I0,mΣ^m,I0Σ^m,m),O\preceq\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}\end{pmatrix}, we have

O(Σ^I0,I0+λ2(n)+λ1(n)Dn/2Σ^I0,mΣ^m,I0Σ^m,m+λ2(n))(2Σ^I0,I0+2λ2(n)+λ1(n)Dn002Σ^m,m+2λ2(n)).\displaystyle O\preceq\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}/2&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}}\end{pmatrix}\preceq\begin{pmatrix}2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}&0\\ 0&2\hat{\Sigma}_{m,m}+2{\lambda_{2}^{(n)}}\end{pmatrix}.

The second inequality is due to the fact that for all (fI0,fm)I0m(f_{I_{0}},f_{m})\in\mathcal{H}_{I_{0}\cup m} we have

(fI0fm),(Σ^I0,I0+λ2(n)+λ1(n)Dn/2Σ^I0,mΣ^m,I0Σ^m,m+λ2(n))(fI0fm)I0m0\displaystyle\left\langle\begin{pmatrix}f_{I_{0}}\\ -f_{m}\end{pmatrix},\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}/2&-\hat{\Sigma}_{I_{0},m}\\ -\hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}}\end{pmatrix}\begin{pmatrix}f_{I_{0}}\\ -f_{m}\end{pmatrix}\right\rangle_{\mathcal{H}_{I_{0}\cup m}}\geq 0

because of O(Σ^I0,I0Σ^I0,mΣ^m,I0Σ^m,m).O\preceq\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}\end{pmatrix}.

Thus we have

(Σ^I0,I0+λ2(n)+λ1(n)Dn2Σ^I0,mΣ^m,I0Σ^m,m+λ2(n))(2Σ^I0,I0+2λ2(n)+λ1(n)Dn002Σ^m,m+2λ2(n))1(Σ^I0,ϵΣ^m,ϵ)I0m\displaystyle\left\|\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D_{n}}{2}&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}}\end{pmatrix}\begin{pmatrix}2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}&0\\ 0&2\hat{\Sigma}_{m,m}+2{\lambda_{2}^{(n)}}\end{pmatrix}^{-1}\begin{pmatrix}\hat{\Sigma}_{I_{0},\epsilon}\\ \hat{\Sigma}_{m,\epsilon}\end{pmatrix}\right\|_{\mathcal{H}_{I_{0}\cup m}}
(Σ^I0,ϵΣ^m,ϵ)I0mOp(1/n).\displaystyle\leq\left\|\begin{pmatrix}\hat{\Sigma}_{I_{0},\epsilon}\\ \hat{\Sigma}_{m,\epsilon}\end{pmatrix}\right\|_{\mathcal{H}_{I_{0}\cup m}}\leq O_{p}(1/\sqrt{n}). (56)

Here the LHS of the above inequality is equivalent to

(Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1Σ^I0,ϵ+(Σ^m,m+λ2(n))(2Σ^m,m+2λ2(n))1Σ^m,ϵ)I0m.\displaystyle\left\|\begin{pmatrix}*\\ \hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}+(\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}})(2\hat{\Sigma}_{m,m}+2{\lambda_{2}^{(n)}})^{-1}\hat{\Sigma}_{m,\epsilon}\end{pmatrix}\right\|_{\mathcal{H}_{I_{0}\cup m}}.

Therefore we observe

Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1Σ^I0,ϵ+12Σ^m,ϵm=Op(1/n).\displaystyle\left\|\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}+\frac{1}{2}\hat{\Sigma}_{m,\epsilon}\right\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n}).

Since Σ^m,ϵm=Op(1/n)\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n}), we also have

Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1Σ^I0,ϵm=Op(1/n).\|\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n}).

This and Σ^m,ϵm=Op(1/n)\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n}) yield

Bn,m=Op(1/n).\displaystyle B_{n,m}=O_{p}(1/\sqrt{n}). (57)

(ii) (Bounding En,m:=2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(λ1(n)Dn+2λ2(n))fI0mE_{n,m}:=\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\|_{\mathcal{H}_{m}}) Note that, due to f~fp0\|\tilde{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0, we have DnpDD_{n}\stackrel{{\scriptstyle p}}{{\to}}D, and we know that maxm,mΣ^m,mΣm,mm,m=Op(log(M)/n)=Op(1n)\max_{m,m^{\prime}}\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}=O_{p}(\sqrt{\log(M)/n})=O_{p}(\frac{1}{\sqrt{n}}) by Lemma 10. Thus Sn:=(2ΣI0,I02Σ^I0,I0)/λ1(n)+DDnS_{n}:=(2\Sigma_{I_{0},I_{0}}-2\hat{\Sigma}_{I_{0},I_{0}})/{\lambda_{1}^{(n)}}+D-D_{n} satisfies Sn=op(1)S_{n}=o_{p}(1) and thus DSnD/2D-S_{n}\succeq D/2 with high probability. Hence

2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(λ1(n)Dn+2λ2(n))fI0\displaystyle 2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}
=\displaystyle= 2Σm,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(λ1(n)Dn+2λ2(n))fI0+Op(1n)\displaystyle 2\Sigma_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+O_{p}\left(\frac{1}{\sqrt{n}}\right)
=\displaystyle= 2Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)1(λ1(n)Dn+2λ2(n))fI0+\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+
2Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)1λ1(n)Sn(2ΣI0,I0+2λ2(n)+λ1(n)(DSn))1(λ1(n)Dn+2λ2(n))fI0\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}{\lambda_{1}^{(n)}}S_{n}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}(D-S_{n}))^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}
+Op(1n).\displaystyle+O_{p}\left(\frac{1}{\sqrt{n}}\right). (58)

Here we obtain

Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)12m,I02\displaystyle\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-\frac{1}{2}}\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}^{2}
=\displaystyle= Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)1ΣI0,mm,m\displaystyle\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}\Sigma_{I_{0},m}\|_{\mathcal{H}_{m},\mathcal{H}_{m}}
\displaystyle\leq Σm,m12Vm,I0(2VI0,I0)1VI0,mΣm,m12m,m=Op(1),\displaystyle\|\Sigma_{m,m}^{\frac{1}{2}}V_{m,I_{0}}(2V_{I_{0},I_{0}})^{-1}V_{I_{0},m}\Sigma_{m,m}^{\frac{1}{2}}\|_{\mathcal{H}_{m},\mathcal{H}_{m}}=O_{p}(1), (59)

and due to the fact that DSnD/2D-S_{n}\succeq D/2 with high probability we have

(ΣI0,I0+λ2(n)+λ1(n)(DSn))12(λ1(n)Dn+2λ2(n))fI0I0\displaystyle\|(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}(D-S_{n}))^{-\frac{1}{2}}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}
=\displaystyle= (ΣI0,I0+λ2(n)+λ1(n)(DSn))12Diag(Σm,m12)(λ1(n)Dn+2λ2(n))gI0I0\displaystyle\|(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}(D-S_{n}))^{-\frac{1}{2}}\mathrm{Diag}(\Sigma_{m,m}^{\frac{1}{2}})({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})g^{*}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}
\displaystyle\leq Op(VI0,I01I0,I012(λ1(n)+λ2(n)))=Op(λ1(n)+λ2(n)).\displaystyle O_{p}(\|V_{I_{0},I_{0}}^{-1}\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}^{-\frac{1}{2}}({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}))=O_{p}({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}).

Therefore the second term in the RHS of Eq. (58) is evaluated as

Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)1λ1(n)Sn(2ΣI0,I0+2λ2(n)+λ1(n)(DSn))1(λ1(n)Dn+2λ2(n))fI0m\displaystyle\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}{\lambda_{1}^{(n)}}S_{n}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}\!(D\!-\!S_{n}))^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\|_{\mathcal{H}_{m}}
\displaystyle\leq Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)12m,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)12I0,I0λ1(n)SnI0,I0×\displaystyle\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-\frac{1}{2}}\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}\|(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-\frac{1}{2}}\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}{\lambda_{1}^{(n)}}\|S_{n}\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}\times
(2ΣI0,I0+2λ2(n)+λ1(n)(DSn))12I0,I0(ΣI0,I0+λ2(n)+λ1(n)(DSn))12(λ1(n)Dn+2λ2(n))fI0I0\displaystyle\|(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}\!(D\!-\!S_{n}))^{-\frac{1}{2}}\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}\|(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}\!(D\!-\!S_{n}))^{-\frac{1}{2}}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}
\displaystyle\leq Op(1(λ1(n)+λ2(n))12λ1(n)op(1)(λ1(n)+λ2(n))12(λ1(n)+λ2(n)))\displaystyle O_{p}(1\cdot({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{-\frac{1}{2}}\cdot{\lambda_{1}^{(n)}}o_{p}(1)\cdot({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{-\frac{1}{2}}\cdot({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}))
=\displaystyle= op(λ1(n)).\displaystyle o_{p}({\lambda_{1}^{(n)}}).

Therefore this and Eq. (58) give

2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(λ1(n)Dn+2λ2(n))fI0\displaystyle 2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}
=\displaystyle= 2Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)1(λ1(n)Dn+2λ2(n))fI0+op(λ1(n))+Op(1n)\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+o_{p}({\lambda_{1}^{(n)}})+O_{p}\left(\frac{1}{\sqrt{n}}\right)
=\displaystyle= 2Σm,I0(2ΣI0,I0+2λ2(n)+λ1(n)D)1(λ1(n)Dn+2λ2(n))fI0+op(λ1(n)).\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+o_{p}({\lambda_{1}^{(n)}}).

Define

An:=Σm,I0(ΣI0,I0+λ2(n)+λ1(n)D2)1(Dn+2λ2(n)λ1(n))fI0,\displaystyle A_{n}:=\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}},
A:=Σm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))fI0.\displaystyle A:=\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}.

We show AnAm=op(1)\|A_{n}-A\|_{\mathcal{H}_{m}}=o_{p}(1). By the definition, we have

AAn=\displaystyle A-A_{n}= Σm,I0(ΣI0,I0+λ2(n))1λ1(n)D2(ΣI0,I0+λ2(n)+λ1(n)D2)1(D+2λ2(n)λ1(n))fI0\displaystyle\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\frac{{\lambda_{1}^{(n)}}D}{2}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}
+Σm,I0(ΣI0,I0+λ2(n)+λ1(n)D2)1(DDn)fI0.\displaystyle+\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D-D_{n}\right)f^{*}_{I_{0}}. (60)

On the other hand, as in Eq. (56), we observe that

2\displaystyle 2\geq (ΣI0,I0ΣI0,mΣm,I0Σm,m)((ΣI0,I0+λ2(n))1000)I0m,I0m\displaystyle\left\|\begin{pmatrix}\Sigma_{I_{0},I_{0}}&\Sigma_{I_{0},m}\\ \Sigma_{m,I_{0}}&\Sigma_{m,m}\end{pmatrix}\begin{pmatrix}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}&0\\ 0&0\end{pmatrix}\right\|_{\mathcal{H}_{I_{0}\cup m},\mathcal{H}_{I_{0}\cup m}}
=\displaystyle= (Σm,I0(ΣI0,I0+λ2(n))10)I0m,I0mΣm,I0(ΣI0,I0+λ2(n))1m,I0.\displaystyle\left\|\begin{pmatrix}*&*\\ \Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}&0\end{pmatrix}\right\|_{\mathcal{H}_{I_{0}\cup m},\mathcal{H}_{I_{0}\cup m}}\geq\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}. (61)

Moreover, since fm=Σm,m12gmf^{*}_{m}=\Sigma_{m,m}^{\frac{1}{2}}g^{*}_{m} (m\forall m), we have

(ΣI0,I0+λ2(n)+λ1(n)D2)1(D+2λ2(n)λ1(n))fI0I0\displaystyle\left\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{I_{0}}}
=\displaystyle= (ΣI0,I0+λ2(n)+λ1(n)D2)1Diag(Σm,m12)(D+2λ2(n)λ1(n))gI0I0\displaystyle\left\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\mathrm{Diag}(\Sigma_{m,m}^{\frac{1}{2}})\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\|_{\mathcal{H}_{I_{0}}}
\displaystyle\leq (ΣI0,I0+λ2(n)+λ1(n)D2)12I0,I0(ΣI0,I0+λ2(n)+λ1(n)D2)12Diag(Σm,m12)I0,I0\displaystyle\left\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-\frac{1}{2}}\right\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}\left\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-\frac{1}{2}}\mathrm{Diag}(\Sigma_{m,m}^{\frac{1}{2}})\right\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}
×(D+2λ2(n)λ1(n))gI0I0\displaystyle\times\left\|\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\|_{\mathcal{H}_{I_{0}}}
\displaystyle\leq Op((λ1(n)+λ2(n))12VI0,I012I0,I0)Op(λ1(n)12).\displaystyle O_{p}(({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{-\frac{1}{2}}\left\|V_{I_{0},I_{0}}^{-\frac{1}{2}}\right\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}})\leq O_{p}({\lambda_{1}^{(n)}}^{-\frac{1}{2}}). (62)

We can also bound the second term of (60) as

Σm,I0(ΣI0,I0+λ2(n)+λ1(n)D2)1(DDn)fI0m\displaystyle\left\|\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D-D_{n}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}
\displaystyle\leq Σm,I0(ΣI0,I0+λ2(n)+λ1(n)D2)1m,I0(DDn)fI0I0\displaystyle\left\|\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\right\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}\left\|\left(D-D_{n}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{I_{0}}}
\displaystyle\leq Σm,I0(ΣI0,I0+λ2(n))1m,I0(DDn)fI0I0\displaystyle\left\|\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}\right)^{-1}\right\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}\left\|\left(D-D_{n}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{I_{0}}}
\displaystyle\leq 2(DDn)fI0I0(Eq. (61))\displaystyle 2\left\|\left(D-D_{n}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{I_{0}}}~~~~(\because\text{Eq.~{\eqref{eq:SigmaIIfirstbound}}})
=\displaystyle= op(1).\displaystyle o_{p}(1).

Therefore applying the inequalities Eq. (61) and Eq. (62) to Eq. (60), we have

AnAm=Op(λ1(n)12)+op(1)=op(1).\displaystyle\|A_{n}-A\|_{\mathcal{H}_{m}}=O_{p}({\lambda_{1}^{(n)}}^{\frac{1}{2}})+o_{p}(1)=o_{p}(1). (63)

Hence we have En,m=λ1(n)Am+op(λ1(n))E_{n,m}={\lambda_{1}^{(n)}}\|A\|_{\mathcal{H}_{m}}+o_{p}({\lambda_{1}^{(n)}}).

(iii) (Combining (i) and (ii)) Due to the above evaluations ((i) and (ii)), we have

maxmJ02Σ^m,I(f~I0fI0)2Σ^m,ϵm\displaystyle\max_{m\in J_{0}}\left\|2\hat{\Sigma}_{m,I}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\right\|_{\mathcal{H}_{m}}
=\displaystyle= maxmJλ1(n)Σm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))fI0m+op(λ1(n))<λ1(n)(1η)+op(λ1(n)).\displaystyle\max_{m\in J}{\lambda_{1}^{(n)}}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}+o_{p}({\lambda_{1}^{(n)}})<{\lambda_{1}^{(n)}}(1-\eta)+o_{p}({\lambda_{1}^{(n)}}).

This yields

P(2Σ^m,I0(f~I0fI0)2Σ^m,ϵmλ1(n),mJ0)0.P\left(\|2\hat{\Sigma}_{m,I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}\geq{\lambda_{1}^{(n)}},\forall m\in J_{0}\right)\to 0.

Thus the probability of the condition (53) goes to 1.  

Proof: (Theorem 5) First we prove that λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty is a necessary condition for I^pI0\hat{I}\stackrel{{\scriptstyle p}}{{\to}}I_{0}. Assume that lim infλ1(n)n<\liminf{\lambda_{1}^{(n)}}\sqrt{n}<\infty. Then we can take a sub-sequence that converges to a finite value, therefore by taking the sub-sequence, if necessary, we can assume limλ1(n)nμ1\lim{\lambda_{1}^{(n)}}\sqrt{n}\to\mu_{1} without loss of generality. We will derive a contradiction under the conditions of f^fp0\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0 and I^pI0\hat{I}\stackrel{{\scriptstyle p}}{{\to}}I_{0}. Suppose I^=I0\hat{I}=I_{0}.

By the KKT condition,

0=2(Σ^I0,I0f^I0Σ^I0,ϵΣ^I0,I0fI0)+λ1(n)Dnf^I0+2λ2(n)f^I0\displaystyle 0=2(\hat{\Sigma}_{I_{0},I_{0}}\hat{f}_{I_{0}}-\hat{\Sigma}_{I_{0},\epsilon}-\hat{\Sigma}_{I_{0},I_{0}}f^{*}_{I_{0}})+{\lambda_{1}^{(n)}}D_{n}\hat{f}_{I_{0}}+2{\lambda_{2}^{(n)}}\hat{f}_{I_{0}}
\displaystyle\Rightarrow~~ 2(Σ^I0,I0+λ2(n))(fI0f^I0)=λ1(n)DnfI0+2λ2(n)fI02Σ^I0,ϵ\displaystyle 2(\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})(f^{*}_{I_{0}}-\hat{f}_{I_{0}})={\lambda_{1}^{(n)}}D_{n}f^{*}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon} (64)
\displaystyle\Rightarrow~~ 2n(ΣI0,I0+λ2(n))(fI0f^I0)=nλ1(n)DfI0+n2λ2(n)fI02nΣ^I0,ϵ\displaystyle 2\sqrt{n}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})(f^{*}_{I_{0}}-\hat{f}_{I_{0}})=\sqrt{n}{\lambda_{1}^{(n)}}Df^{*}_{I_{0}}+\sqrt{n}2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\sqrt{n}\hat{\Sigma}_{I_{0},\epsilon}
+(2n(ΣI0,I0Σ^I0,I0)(fI0f^I0)+nλ1(n)(DnD)fI0)\displaystyle~~~~~~~~~~~~~~~~~~+(2\sqrt{n}(\Sigma_{I_{0},I_{0}}-\hat{\Sigma}_{I_{0},I_{0}})(f^{*}_{I_{0}}-\hat{f}_{I_{0}})+\sqrt{n}{\lambda_{1}^{(n)}}(D_{n}-D)f^{*}_{I_{0}})
\displaystyle\Rightarrow~~ 2n(ΣI0,I0+λ2(n))(fI0f^I0)=μ1DfI0+n2λ2(n)fI02nΣ^I0,ϵ+op(1),\displaystyle 2\sqrt{n}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})(f^{*}_{I_{0}}-\hat{f}_{I_{0}})=\mu_{1}Df^{*}_{I_{0}}+\sqrt{n}2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\sqrt{n}\hat{\Sigma}_{I_{0},\epsilon}+o_{p}(1), (65)

where the last inequality is due to nλ1(n)μ1,DnDI0,I0=op(1),f^f=op(1)\sqrt{n}{\lambda_{1}^{(n)}}\to\mu_{1},\|D_{n}-D\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}=o_{p}(1),\|\hat{f}-f^{*}\|_{\mathcal{H}}=o_{p}(1) and ΣI0,I0Σ^I0,I0I0,I0=op(1)\|\Sigma_{I_{0},I_{0}}-\hat{\Sigma}_{I_{0},I_{0}}\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}=o_{p}(1). Moreover since the second equality (64) indicates that op(1)+op(λ2(n))=λ1(n)DfI0+2λ2(n)fI0+op(1)o_{p}(1)+o_{p}({\lambda_{2}^{(n)}})={\lambda_{1}^{(n)}}Df^{*}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}+o_{p}(1), we have λ1(n)=op(1){\lambda_{1}^{(n)}}=o_{p}(1) and λ2(n)=op(1){\lambda_{2}^{(n)}}=o_{p}(1).

We now show that the KKT condition under which f^\hat{f} satisfying I^=I0\hat{I}=I_{0} is optimal with respect to FnF_{n} is violated with strictly positive probability:

lim infP(mJ,2(Σ^m,I0f^I0Σ^m,I0fI0Σ^m,ϵ)m>λ1(n))>0.\displaystyle\liminf P\left(\exists m\in J,~\|2(\hat{\Sigma}_{m,I_{0}}\hat{f}_{I_{0}}-\hat{\Sigma}_{m,I_{0}}f^{*}_{I_{0}}-\hat{\Sigma}_{m,\epsilon})\|_{\mathcal{H}_{m}}>{\lambda_{1}^{(n)}}\right)>0. (66)

Obviously this indicates that the probability I^=I0\hat{I}=I_{0} does not converges to 1, which is a contradiction.

For all vmmv_{m}\in\mathcal{H}_{m} (mJ0)(m\in J_{0}), there exists wI0I0w_{I_{0}}\in\mathcal{H}_{I_{0}} such that

ΣI0,mvm=(ΣI0,I0+λ2(n))wI0.\Sigma_{I_{0},m}v_{m}=(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})w_{I_{0}}. (67)

Note that wI0w_{I_{0}} is uniformly bounded for all λ2(n)0{\lambda_{2}^{(n)}}\geq 0 because the range of ΣI0,m\Sigma_{I_{0},m} is included in the range of ΣI0,I0\Sigma_{I_{0},I_{0}} (Baker, 1973) and there exists w~I0\tilde{w}_{I_{0}} such that ΣI0,mvm=ΣI0,I0w~I0\Sigma_{I_{0},m}v_{m}=\Sigma_{I_{0},I_{0}}\tilde{w}_{I_{0}} (w~I0\tilde{w}_{I_{0}} is independent of λ2(n){\lambda_{2}^{(n)}}), hence ΣI0,I0w~I0=(ΣI0,I0+λ2(n))wI0\Sigma_{I_{0},I_{0}}\tilde{w}_{I_{0}}=(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})w_{I_{0}}, and

wI0I0w~I0,ΣI0,I0(ΣI0,I0+λ2(n))2ΣI0,I0w~I0I0w~I0I0\|w_{I_{0}}\|_{\mathcal{H}_{I_{0}}}\leq\sqrt{\langle\tilde{w}_{I_{0}},\Sigma_{I_{0},I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-2}\Sigma_{I_{0},I_{0}}\tilde{w}_{I_{0}}\rangle_{\mathcal{H}_{I_{0}}}}\leq\|\tilde{w}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}

for λ2(n)>0{\lambda_{2}^{(n)}}>0 and wI0I0=w~II0\|w_{I_{0}}\|_{\mathcal{H}_{I_{0}}}=\|\tilde{w}_{I}\|_{\mathcal{H}_{I_{0}}} for λ2(n)=0{\lambda_{2}^{(n)}}=0. Let vmmv_{m}\in\mathcal{H}_{m} be any non-zero element such that Σm,m1/2vm0\Sigma_{m,m}^{1/2}v_{m}\neq 0 and wI0w_{I_{0}} be satisfying the above equality (67), then

nvm,Σ^m,ϵ+Σ^m,I0fI0Σ^m,I0f^I0m\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}+\hat{\Sigma}_{m,I_{0}}f^{*}_{I_{0}}-\hat{\Sigma}_{m,I_{0}}\hat{f}_{I_{0}}\rangle_{\mathcal{H}_{m}}
=\displaystyle= nvm,Σ^m,ϵm+vm,Σ^m,I0n(fI0f^I0)m\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}+\langle v_{m},\hat{\Sigma}_{m,I_{0}}\sqrt{n}(f^{*}_{I_{0}}-\hat{f}_{I_{0}})\rangle_{\mathcal{H}_{m}}
=\displaystyle= nvm,Σ^m,ϵm+vm,Σm,In(fI0f^I0)m+op(1)\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}+\langle v_{m},\Sigma_{m,I}\sqrt{n}(f^{*}_{I_{0}}-\hat{f}_{I_{0}})\rangle_{\mathcal{H}_{m}}+o_{p}(1)
=\displaystyle= nvm,Σ^m,ϵm+wI0,(ΣI0,I0+λ2(n))n(fI0f^I0)m+op(1)\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}+\langle w_{I_{0}},(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})\sqrt{n}(f^{*}_{I_{0}}-\hat{f}_{I_{0}})\rangle_{\mathcal{H}_{m}}+o_{p}(1)
=\displaystyle= nvm,Σ^m,ϵmnwI0,Σ^I0,ϵm+wI0,(μ12D+nλ2(n))fI0m+op(1),\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}-\sqrt{n}\langle w_{I_{0}},\hat{\Sigma}_{I_{0},\epsilon}\rangle_{\mathcal{H}_{m}}+\left\langle w_{I_{0}},\left(\frac{\mu_{1}}{2}D+\sqrt{n}{\lambda_{2}^{(n)}}\right)f^{*}_{I_{0}}\right\rangle_{\mathcal{H}_{m}}+o_{p}(1),

where we used Σ^m,I0Σm,I0m,I0=Op(1/n)\|\hat{\Sigma}_{m,I_{0}}-\Sigma_{m,I_{0}}\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}=O_{p}(1/\sqrt{n}) and ff^p0\|f^{*}-\hat{f}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0 in the second equality, and the relation (65) in the last equality. We can show that Zn:=nvm,Σ^m,ϵnwI0,Σ^I0,ϵZ_{n}:=\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle-\sqrt{n}\langle w_{I_{0}},\hat{\Sigma}_{I_{0},\epsilon}\rangle has a positive variance as follows (see also Bach (2008)):

E[Zn]\displaystyle\mathrm{E}[Z_{n}] =0,\displaystyle=0,
E[Zn2]\displaystyle\mathrm{E}[Z_{n}^{2}] σ2(vm,Σm,mvm2vm,Σm,I0wI0+wI0,ΣI0,I0wI0)\displaystyle\geq\sigma^{2}\left(\langle v_{m},\Sigma_{m,m}v_{m}\rangle-2\langle v_{m},\Sigma_{m,I_{0}}w_{I_{0}}\rangle+\langle w_{I_{0}},\Sigma_{I_{0},I_{0}}w_{I_{0}}\rangle\right)
=σ2(vm,Σm,mvmvm,Σm,I0wI0+op(1))(λ2(n)=op(1))\displaystyle=\sigma^{2}\left(\langle v_{m},\Sigma_{m,m}v_{m}\rangle-\langle v_{m},\Sigma_{m,I_{0}}w_{I_{0}}\rangle+o_{p}(1)\right)~~~~~~~(\because{\lambda_{2}^{(n)}}=o_{p}(1))
=σ2Σm,m1/2vm,(ImVm,I0V~I0,I01VI0,m)Σm,m1/2vm+op(1),\displaystyle=\sigma^{2}\langle\Sigma_{m,m}^{1/2}v_{m},(I_{\mathcal{H}_{m}}-V_{m,I_{0}}\tilde{V}^{-1}_{I_{0},I_{0}}V_{I_{0},m})\Sigma_{m,m}^{1/2}v_{m}\rangle+o_{p}(1),

where V~I0,I01=Diag(Σm,m1/2)(ΣI0,I0+λ2(n))1Diag(Σm,m1/2)\tilde{V}^{-1}_{I_{0},I_{0}}=\mathrm{Diag}(\Sigma_{m,m}^{1/2})(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\mathrm{Diag}(\Sigma_{m,m}^{1/2}) (note that V~I0,I0\tilde{V}_{I_{0},I_{0}} is invertible because VI0,I0V~I0,I0V_{I_{0},I_{0}}\preceq\tilde{V}_{I_{0},I_{0}} and VI0,I0V_{I_{0},I_{0}} is invertible). Now since VI0,I0V~I0,I0V_{I_{0},I_{0}}\preceq\tilde{V}_{I_{0},I_{0}} and ImVm,I0VI0,I01VI0,mOI_{\mathcal{H}_{m}}-V_{m,I_{0}}V^{-1}_{I_{0},I_{0}}V_{I_{0},m}\succ O (this is because VI0m,I0m=(VI0,I0Vm,I0VI0,mIm)V_{I_{0}\cup m,I_{0}\cup m}=\begin{pmatrix}V_{I_{0},I_{0}}&V_{m,I_{0}}\\ V_{I_{0},m}&I_{\mathcal{H}_{m}}\end{pmatrix} is invertible), we have ImVm,I0V~I0,I01VI0,mOI_{\mathcal{H}_{m}}-V_{m,I_{0}}\tilde{V}^{-1}_{I_{0},I_{0}}V_{I_{0},m}\succ O. Therefore by the central limit theorem ZnZ_{n} converges Gaussian random variable with strictly positive variance in distribution. Thus the probability of

2|vm,Σ^m,ϵ+Σ^m,I0fI0Σ^m,I0f^I0m|>λ1(n)vmm2|\langle v_{m},\hat{\Sigma}_{m,\epsilon}+\hat{\Sigma}_{m,I_{0}}f^{*}_{I_{0}}-\hat{\Sigma}_{m,I_{0}}\hat{f}_{I_{0}}\rangle_{m}|>{\lambda_{1}^{(n)}}\|v_{m}\|_{\mathcal{H}_{m}}

is asymptotically strictly positive because λ1(n)nμ1{\lambda_{1}^{(n)}}\sqrt{n}\to\mu_{1} (Note that this is true whether nλ2(n)\sqrt{n}{\lambda_{2}^{(n)}} converges to finite value or not). This yields (66), i.e. f^\hat{f} does not satisfy I^=I0\hat{I}=I_{0} with asymptotically strictly positive probability.

We say Condition A as

Condition A:λ1(n)n.\text{Condition A}:~~~~{\lambda_{1}^{(n)}}\sqrt{n}\to\infty.

Now that we have proven λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty, we are ready to prove the assertion (16). Suppose the condition (16) is not satisfied for any sequences λ1(n),λ2(n)0{\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0, that is, there exists a constant ξ>0\xi>0 such that

lim supnΣm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))gI0m>(1+ξ),(mJ0),\displaystyle\limsup_{n\to\infty}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}>(1+\xi),~~(\exists m\in J_{0}), (68)

for any sequences λ1(n),λ2(n)0{\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0 satisfying Condition A (λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty). Fix arbitrary sequences λ1(n),λ2(n)0{\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0 satisfying Condition A. If I^=I0\hat{I}=I_{0}, the KKT condition

2Σ^m,I0(f^I0fI0)2Σ^m,ϵmλ1(n)(mJ0),\displaystyle\|2\hat{\Sigma}_{m,I_{0}}(\hat{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}\leq{\lambda_{1}^{(n)}}~~~(\forall m\in J_{0}), (69)
(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)(f~I0fI0)+λ1(n)DnfI0+2λ2(n)fI02Σ^I0,ϵ=0,\displaystyle(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})+{\lambda_{1}^{(n)}}D_{n}f^{*}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}=0, (70)

should be satisfied (see (53) and (54)). We prove that the first inequality (69) of the KKT condition is violated with strictly positive probability under the assumptions and the condition (70). We have shown that (see (55))

λ1(n)1(2Σ^m,I0(f^I0fI0)2Σ^m,ϵ)\displaystyle{\lambda_{1}^{(n)}}^{-1}(2\hat{\Sigma}_{m,I_{0}}(\hat{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon})
=\displaystyle= 2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(Dn+2λ2(n)λ1(n))fI0\displaystyle 2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}})f^{*}_{I_{0}}
2λ1(n)Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)12Σ^I0,ϵ+2λ1(n)Σ^m,ϵ.\displaystyle-\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}+\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,\epsilon}. (71)

As shown in the proof of Theorem 1, the first term can be approximated by Σm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))fI0,\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}\right)^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}, more precisely Eq. (63) gives

Σ^m,I0(Σ^I0,I0+λ2(n)+λ1(n)Dn2)1(Dn+2λ2(n)λ1(n))fI0Σm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))gIm\displaystyle\left\|\hat{\Sigma}_{m,I_{0}}\left(\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D_{n}}{2}\right)^{-1}\!\!\!\left(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}-\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\!\left(\!D\!+\!2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I}\right\|_{\mathcal{H}_{m}}
p0.\displaystyle\stackrel{{\scriptstyle p}}{{\to}}0.

Since lim infnΣm,I0(ΣI0,I0+λ2(n))1(D+2λ2(n)λ1(n))gI0m>(1+ξ)\liminf_{n}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}>(1+\xi) by the assumption, we observe that

P(2Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)1(Dn+2λ2(n)λ1(n))fI0m>(1+ξ))↛0.\displaystyle P\left(\left\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\left(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}>(1+\xi)\right)\not\to 0. (72)

Now since λ1(n)n{\lambda_{1}^{(n)}}\sqrt{n}\to\infty, we have proven that

2λ1(n)Σ^m,I0(2Σ^I0,I0+2λ2(n)+λ1(n)Dn)12Σ^I0,ϵ+2λ1(n)Σ^m,ϵm=Op(1/(λ1(n)n))=op(1),\displaystyle\left\|-\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}+\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,\epsilon}\right\|_{\mathcal{H}_{m}}=O_{p}(1/({\lambda_{1}^{(n)}}\sqrt{n}))=o_{p}(1), (73)

in the proof of Theorem 1 (Eq. (57)). Therefore, combining (71), (72) and (73), we have observed that the KKT condition (53) is violated with strictly positive probability if the condition (68) is satisfied. This yields the irrepresenter condition (16) is a necessary condition for the consistency of elastic-net MKL.  

Lemma 10

If supXkm(X,X)1\sup_{X}k_{m}(X,X)\leq 1 and supXkm(X,X)1\sup_{X}k_{m^{\prime}}(X,X)\leq 1, then

P(Σ^m,mΣm,mm,mE[Σ^m,mΣm,mm,m]+ε)exp(nε2/2).\displaystyle P(\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m}^{\prime}}\geq\mathrm{E}[\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m}^{\prime}}]+\varepsilon)\leq\exp(-n\varepsilon^{2}/2). (74)

In particular,

P(Σ^m,mΣm,mm,m1n+ϵ)exp(nε2/2).P\left(\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m}^{\prime}}\geq\sqrt{\frac{1}{n}}+\epsilon\right)\leq\exp(-n\varepsilon^{2}/2). (75)

Proof: We use McDiarmid’s inequality (Devroye et al., 1996). By definition

g,Σ^mmf=1ni=1ng,km(,xi)mf,km(,xi)m.\langle g,\hat{\Sigma}_{mm^{\prime}}f\rangle=\frac{1}{n}\sum_{i=1}^{n}\left\langle g,k_{m}(\cdot,x_{i})\right\rangle_{m}\left\langle f,k_{m^{\prime}}(\cdot,x_{i})\right\rangle_{m^{\prime}}.

We denote by Σ~m,m\tilde{\Sigma}_{m,m^{\prime}} the empirical cross covariance operator with nn samples (x1,,xj1,x~j,xj+1,,xn)(x_{1},\dots,x_{j-1},\tilde{x}_{j},x_{j+1},\dots,x_{n}) where the jj-th sample xjx_{j} is replaced by x~j\tilde{x}_{j} independently distributed by the same distribution as xjx_{j}’s.

By the triangular inequality, we have

Σ^m,mΣm,mm,mΣ~m,mΣm,mm,mΣ^m,mΣ~m,mm,m.\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\|\tilde{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}\leq\|\hat{\Sigma}_{m,m^{\prime}}-\tilde{\Sigma}_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}.

Now the RHS can be evaluated as follows:

Σ^m,mΣ~m,mm,m\displaystyle\|\hat{\Sigma}_{m,m^{\prime}}-\tilde{\Sigma}_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}
=\displaystyle= 1n(km(,xj)km(xj,)km(,x~j)km(x~j,))m,m.\displaystyle\left\|\frac{1}{n}(k_{m}(\cdot,x_{j})k_{m^{\prime}}(x_{j},\cdot)-k_{m}(\cdot,\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\cdot))\right\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}. (76)

The RHS of (76) can be further evaluated as

1n(km(,xj)km(xj,)km(,x~j)km(x~j,))m,m\displaystyle\|\frac{1}{n}(k_{m}(\cdot,x_{j})k_{m^{\prime}}(x_{j},\cdot)-k_{m}(\cdot,\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\cdot))\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}
\displaystyle\leq 1n(km(,xj)km(xj,)m,m+km(,x~j)km(x~j,))m,m)\displaystyle\frac{1}{n}(\|k_{m}(\cdot,x_{j})k_{m^{\prime}}(x_{j},\cdot)\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}+\|k_{m}(\cdot,\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\cdot))\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}})
\displaystyle\leq 1n(km(,xj)mkm(xj,)m+km(,x~j)mkm(x~j,))m)\displaystyle\frac{1}{n}(\|k_{m}(\cdot,x_{j})\|_{\mathcal{H}_{m}}\|k_{m^{\prime}}(x_{j},\cdot)\|_{\mathcal{H}_{m^{\prime}}}+\|k_{m}(\cdot,\tilde{x}_{j})\|_{\mathcal{H}_{m}}\|k_{m^{\prime}}(\tilde{x}_{j},\cdot))\|_{\mathcal{H}_{m^{\prime}}})
\displaystyle\leq 1n(km(xj,xj)km(xj,xj)+km(x~j,x~j)km(x~j,x~j))\displaystyle\frac{1}{n}(\sqrt{k_{m}(x_{j},x_{j})k_{m^{\prime}}(x_{j},x_{j})}+\sqrt{k_{m}(\tilde{x}_{j},\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\tilde{x}_{j})})
\displaystyle\leq 2n,\displaystyle\frac{2}{n}, (77)

where we used km(,xj)m=km(,xj),km(,xj)m=km(xj,xj)\|k_{m}(\cdot,x_{j})\|_{\mathcal{H}_{m}}=\sqrt{\langle k_{m}(\cdot,x_{j}),k_{m}(\cdot,x_{j})\rangle_{\mathcal{H}_{m}}}=\sqrt{k_{m}(x_{j},x_{j})}. Bounding the norm of (76) by (77), we have

Σ^m,mΣm,mm,mΣ~m,mΣm,mm,m2n.\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\|\tilde{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}\leq\frac{2}{n}.

By symmetry, changing Σ^\hat{\Sigma} and Σ~\tilde{\Sigma} gives

|Σ^m,mΣm,mm,mΣ~m,mΣm,mm,m|2n.|\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\|\tilde{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}|\leq\frac{2}{n}.

Therefore by McDiarmid’s inequality we obtain

P(Σ^m,mΣm,mm,mE[Σ^m,mΣm,mm,m]ε)\displaystyle P\left(\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\mathrm{E}[\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}]\geq\varepsilon\right)
\displaystyle\leq exp(2ε2n(2/n)2)=exp(ε2n2).\displaystyle\exp\left(-\frac{2\varepsilon^{2}}{n(2/n)^{2}}\right)=\exp\left(-\frac{\varepsilon^{2}n}{2}\right).

This gives the first assertion Eq. (74).

To show the second assertion (Eq. (75)), first we note that

E[Σ^m,mΣm,mm,m]\displaystyle\mathrm{E}[\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}] E[Σ^m,mΣm,mm,m2]\displaystyle\leq\sqrt{\mathrm{E}[\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}^{2}]}
=E[(Σ^m,mΣm,m)(Σ^m,mΣm,m)m,m]\displaystyle=\sqrt{\mathrm{E}[\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\|_{\mathcal{H}_{m},\mathcal{H}_{m}}]}
E[(Σ^m,mΣm,m)(Σ^m,mΣm,m)tr],\displaystyle\leq\sqrt{\mathrm{E}[\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\|_{\mathrm{tr}}]}, (78)

where tr\|\cdot\|_{\mathrm{tr}} is the trace norm and the last inequality. As in Lemma 1 of Gretton et al. (2005), we see that

(Σ^m,mΣm,m)(Σ^m,mΣm,m)tr\displaystyle\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\|_{\mathrm{tr}}
=\displaystyle= 1n2i,j=1nkm(,xi)km(xi,xj)km(xj,)tr\displaystyle\frac{1}{n^{2}}\sum_{i,j=1}^{n}\|k_{m}(\cdot,x_{i})k_{m^{\prime}}(x_{i},x_{j})k_{m}(x_{j},\cdot)\|_{\mathrm{tr}}
2ni=1nEX[km(,xi)km(xi,X)km(X,)tr]+EX,X[km(,X)km(X,X)km(X,)tr]\displaystyle-\frac{2}{n}\sum_{i=1}^{n}\mathrm{E}_{X}[\|k_{m}(\cdot,x_{i})k_{m^{\prime}}(x_{i},X)k_{m}(X,\cdot)\|_{\mathrm{tr}}]+\mathrm{E}_{X,X^{\prime}}[\|k_{m}(\cdot,X)k_{m^{\prime}}(X,X^{\prime})k_{m}(X^{\prime},\cdot)\|_{\mathrm{tr}}]
=\displaystyle= 1n2i,j=1nkm(xj,xi)km(xi,xj)2ni=1nEX[km(X,xi)km(xi,X)]+EX,X[km(X,X)km(X,X)],\displaystyle\frac{1}{n^{2}}\sum_{i,j=1}^{n}k_{m}(x_{j},x_{i})k_{m^{\prime}}(x_{i},x_{j})-\frac{2}{n}\sum_{i=1}^{n}\mathrm{E}_{X}[k_{m}(X,x_{i})k_{m^{\prime}}(x_{i},X)]+\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})],

where XX and XX^{\prime} are independent random variable distributed from Π\Pi. Thus

E[(Σ^m,mΣm,m)(Σ^m,mΣm,m)tr]\displaystyle\mathrm{E}[\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\|_{\mathrm{tr}}]
=\displaystyle= nn2EX[km(X,X)km(X,X)]+n(n1)n2EX,X[km(X,X)km(X,X)]\displaystyle\frac{n}{n^{2}}\mathrm{E}_{X}[k_{m}(X,X)k_{m^{\prime}}(X,X)]+\frac{n(n-1)}{n^{2}}\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]
2EX,X[km(X,X)km(X,X)]+EX,X[km(X,X)km(X,X)]\displaystyle-2\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]+\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]
=\displaystyle= 1nEX[km(X,X)km(X,X)]1nEX,X[km(X,X)km(X,X)]1n.\displaystyle\frac{1}{n}\mathrm{E}_{X}[k_{m}(X,X)k_{m^{\prime}}(X,X)]-\frac{1}{n}\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]\leq\frac{1}{n}.

This and Eq. (78) with the first assertion (Eq. (74)) gives the second assertion.  

Lemma 11

If E[ϵ2|X]σ2\mathrm{E}[\epsilon^{2}|X]\leq\sigma^{2} almost surely and supXkm(X,X)1\sup_{X}k_{m}(X,X)\leq 1, then we have

Σ^m,ϵm=Op(σ/n).\displaystyle\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(\sigma/\sqrt{n}). (79)

Proof: By definition, we have

E[Σ^m,ϵm]\displaystyle\mathrm{E}[\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}] E[Σ^m,ϵm2]\displaystyle\leq\sqrt{\mathrm{E}[\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}^{2}]}
=E[1n2i,j=1nkm(xi,xj)ϵiϵj]\displaystyle=\sqrt{\mathrm{E}\left[\frac{1}{n^{2}}\sum_{i,j=1}^{n}k_{m}(x_{i},x_{j})\epsilon_{i}\epsilon_{j}\right]}
σ2n.\displaystyle\leq\sqrt{\frac{\sigma^{2}}{n}}.

Applying Markov’s inequality we obtain the assertion.  

Proposition 1 (Bernstein’s inequality in Hilbert spaces)

Let (Ω,𝒜,P)(\Omega,\mathcal{A},P) be a probability space, \mathcal{H} be a separable Hilbert space, >0\mathcal{B}>0, and σ>0\sigma>0. Furthermore, let ξ1,,ξn:Ω\xi_{1},\dots,\xi_{n}:\Omega\to\mathcal{H} be independent random variables satisfying E[ξi]=0\mathrm{E}[\xi_{i}]=0, ξB\|\xi\|_{\mathcal{H}}\leq B, and E[ξi2]σ2\mathrm{E}[\|\xi_{i}\|_{\mathcal{H}}^{2}]\leq\sigma^{2} for all i=1,,ni=1,\dots,n. Then we have

P(1ni=1nξi2σ2τn+σ2n+2Bτ3n)eτ,(τ>0).\displaystyle P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\right\|_{\mathcal{H}}\geq\sqrt{\frac{2\sigma^{2}\tau}{n}}+\sqrt{\frac{\sigma^{2}}{n}}+\frac{2B\tau}{3n}\right)\leq e^{-\tau},~~~~~(\tau>0).

Proof: See Theorem 6.14 of Steinwart (2008).  

References

  • Bach et al. (2004) F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In the 21st International Conference on Machine Learning, pages 41–48, 2004.
  • Bach (2008) F. R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008.
  • Baker (1973) C. R. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273–289, 1973.
  • Bartlett et al. (2005) P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33:1487–1537, 2005.
  • Bickel et al. (2009) P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
  • Bousquet (2002) O. Bousquet. A Bennett concentration inequality and its application to suprema of empirical process. C. R. Acad. Sci. Paris Ser. I Math., 334:495–500, 2002.
  • Caponnetto and de Vito (2007) A. Caponnetto and E. de Vito. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • Cortes (2009) C. Cortes. Can learning kernels help performance?, 2009. Invited talk at International Conference on Machine Learning (ICML 2009). Montréal, Canada, 2009.
  • Cortes et al. (2009) C. Cortes, M. Mohri, and A. Rostamizadeh. L2L_{2} regularization for learning kernels. In the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), 2009. Montréal, Canada.
  • Devroye et al. (1996) L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.
  • Gretton et al. (2005) A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and E. Tomita, editors, Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence, pages 63–77, Berlin, 2005. Springer-Verlag.
  • Jia and Yu (2010) J. Jia and B. Yu. On model selection consistency of the elastic net when p \gg n. Statistica Sinica, 20(2):to appear, 2010.
  • Kloft et al. (2009) M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Müller, and A. Zien. Efficient and accurate p\ell_{p}-norm multiple kernel learning. In Advances in Neural Information Processing Systems 22, pages 997–1005, Cambridge, MA, 2009. MIT Press.
  • Koltchinskii (2006) V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34:2593–2656, 2006.
  • Koltchinskii and Yuan (2008) V. Koltchinskii and M. Yuan. Sparse recovery in large ensembles of kernel machines. In Proceedings of the Annual Conference on Learning Theory, pages 229–238, 2008.
  • Lanckriet et al. (2004) G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. Jordan. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5:27–72, 2004.
  • Ledoux and Talagrand (1991) M. Ledoux and M. Talagrand. Probability in Banach Spaces. Isoperimetry and Processes. Springer, New York, 1991. MR1102015.
  • Lin and Zhang (2006) Y. Lin and H. H. Zhang. Component selecion and smoothing in multivariate nonparametric regression. The Annals of Statistics,, 34(5):2272–2297, 2006.
  • Meier et al. (2009) L. Meier, S. van de Geer, and P. Bühlmann. High-dimensional additive modeling. The Annals of Statistics, 37(6B):3779–3821, 2009.
  • Mendelson (2002) S. Mendelson. Improving the sample complexity using global data. IEEE Transactions on Information Theory, 48:1977–1991, 2002.
  • Micchelli and Pontil (2005) C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099–1125, 2005.
  • Rakotomamonjy et al. (2008) A. Rakotomamonjy, F. Bach, S. Canu, and G. Y. SimpleMKL. Journal of Machine Learning Research, 9:2491–2521, 2008.
  • Sonnenburg et al. (2006) S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.
  • Steinwart (2008) I. Steinwart. Support Vector Machines. Springer, 2008.
  • Steinwart et al. (2009) I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the Annual Conference on Learning Theory, pages 79–93, 2009.
  • Stone (1974) M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36:111–147, 1974.
  • Suzuki and Tomioka (2009) T. Suzuki and R. Tomioka. Spicymkl, 2009. arXiv:0909.5026.
  • Talagrand (1996a) M. Talagrand. A new look at independence. The Annals of Statistics, 24:1–34, 1996a.
  • Talagrand (1996b) M. Talagrand. New concentration inequalities in product spaces. Inventiones Mathematicae, 126:505–563, 1996b.
  • Tomioka and Suzuki (2010) R. Tomioka and T. Suzuki. Sparsity-accuracy trade-off in MKL, 2010. arXiv:1001.2615.
  • van de Geer (2000) S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.
  • van der Vaart and Wellner (1996) A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996.
  • Vapnik (1998) V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
  • Yuan and Lin (2007) M. Yuan and Y. Lin. On the nonnegative garrote estimator. Journal of the Royal Statistical Society B, 69(2):143–161, 2007.
  • Zhang (2009) T. Zhang. Some sharp performance bounds for least squares regression with l1l_{1} regularization. The Annals of Statistics, 37(5):2109–2144, 2009.
  • Zhao and Yu (2006) P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
  • Zou and Hastie (2005) H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical: Series B, 67(2):301–320, 2005.
  • Zou and Zhang (2009) H. Zou and H. H. Zhang. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4):1733–1751, 2009.