Sharp Convergence Rate and Support Consistency of
Multiple Kernel Learning with Sparse and Dense Regularization

Taiji Suzuki, Ryota Tomioka
Department of Mathematical Informatics,
The University of Tokyo,
7-3-1 Hongo, Bunkyo-ku, Tokyo
t-suzuki@mist.i.u-tokyo.ac.jp,
tomioka@mist.i.u-tokyo.ac.jp &Masashi Sugiyama
Department of Computer Science,
Tokyo Institute of Technology,
2-12-1 O-okayama, Meguro-ku, Tokyo
sugi@cs.titech.ac.jp

Abstract

We theoretically investigate the convergence rate and support consistency (i.e., correctly identifying the subset of non-zero coefficients in the large sample limit) of multiple kernel learning (MKL). We focus on MKL with block- $\ell_{1}$ regularization (inducing sparse kernel combination), block- $\ell_{2}$ regularization (inducing uniform kernel combination), and elastic-net regularization (including both block- $\ell_{1}$ and block- $\ell_{2}$ regularization). For the case where the true kernel combination is sparse, we show a sharper convergence rate of the block- $\ell_{1}$ and elastic-net MKL methods than the existing rate for block- $\ell_{1}$ MKL. We further show that elastic-net MKL requires a milder condition for being consistent than block- $\ell_{1}$ MKL. For the case where the optimal kernel combination is not exactly sparse, we prove that elastic-net MKL can achieve a faster convergence rate than the block- $\ell_{1}$ and block- $\ell_{2}$ MKL methods by carefully controlling the balance between the block- $\ell_{1}$ and block- $\ell_{2}$ regularizers. Thus, our theoretical results overall suggest the use of elastic-net regularization in MKL.

1 Introduction

The choice of kernel functions is a key issue for kernel methods such as support vector machines to work well (Vapnik, 1998). A traditional but very powerful approach to optimizing the kernel function is the use of cross-validation (CV) (Stone, 1974). Although the CV-based kernel choice often leads to better generalization, it is computationally expensive when the kernel contains multiple tuning parameters.

To overcome this limitation, the framework of multiple kernel learning (MKL) has been introduced, which tries to learn the optimal linear combination of prefixed base-kernels by convex optimization (Lanckriet et al., 2004, Micchelli and Pontil, 2005, Lin and Zhang, 2006, Sonnenburg et al., 2006, Rakotomamonjy et al., 2008, Suzuki and Tomioka, 2009). The seminal paper by Bach et al. (2004) showed that this MKL formulation can be interpreted as block- $\ell_{1}$ regularization (i.e., $\ell_{1}$ regularization across the kernels and $\ell_{2}$ regularization within the same kernel). We refer to this MKL formulation as ‘block- $\ell_{1}$ MKL’. Based on this interpretation, block- $\ell_{1}$ MKL was proved to be support consistent (i.e., correctly identifying the subset of non-zero coefficients with probability one in the large sample limit) when the true kernel combination is sparse (Bach, 2008). Furthermore, the convergence rate of block- $\ell_{1}$ MKL has also been elucidated in Koltchinskii and Yuan (2008), which can be regarded as an extension of the theoretical analysis for ordinary (non-block) $\ell_{1}$ regularization (Bickel et al., 2009, Zhang, 2009).

However, in many practical applications, the true kernel combination may not be exactly sparse. In such a non-sparse situation, block- $\ell_{1}$ MKL was shown to perform rather poorly—just the uniform combination of base kernels obtained by block- $\ell_{2}$ regularization (Micchelli and Pontil, 2005) (which we call ‘block- $\ell_{2}$ MKL’) often works better in practice (Cortes, 2009). Furthermore, recent works showed that some ‘intermediate’ regularization between block- $\ell_{1}$ and block- $\ell_{2}$ regularization is more promising, e.g., block- $\ell_{p}$ regularization with $1\leq p\leq 2$ (Cortes et al., 2009, Kloft et al., 2009), and elastic-net regularization (Zou and Hastie, 2005) which includes both block- $\ell_{1}$ and block- $\ell_{2}$ regularization (Tomioka and Suzuki, 2010) (we call this method ‘elastic-net MKL’). Theoretically, the support consistency and the convergence rate for parametric elastic-nets have been elucidated in Yuan and Lin (2007) and Zou and Zhang (2009), respectively, and that for non-parametric cases has been investigated in Meier et al. (2009) focusing on the Sobolev space.

In this paper, we theoretically analyze the support consistency and convergence rate of MKL, and provide three new results.

•

For the case where the true kernel combination is sparse, we show that elastic-net MKL achieves a faster convergence rate than the one shown for block- $\ell_{1}$ MKL (Koltchinskii and Yuan, 2008). More specifically, we show that the $L_{2}$ convergence error is given by $\mathcal{O}_{p}(\min\{dn^{-\frac{2}{2+s}}+d\log(M)/n,d^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}+d\log(M)/n\})$ , where $d$ is the number of active components of the target function, $s$ is the complexity of RKHSs, $M$ is the number of candidate kernels, and $n$ is the number of samples.
•

For the case where the optimal kernel combination is not exactly sparse, we prove that elastic-net MKL achieves a faster convergence rate than the block- $\ell_{1}$ and block- $\ell_{2}$ MKL methods by carefully controlling the balance between block- $\ell_{1}$ and block- $\ell_{2}$ regularization. Our theoretical result well agrees with the experimental results reported in Tomioka and Suzuki (2010).
•

For the case where the true kernel combination is sparse, we prove that the necessary and sufficient conditions of the support consistency for elastic-net MKL is milder than the conditions required for block- $\ell_{1}$ MKL (Bach, 2008).

Overall, our theoretical results suggest the use of elastic-net regularization in MKL.

2 Preliminaries

In this section, we formulate the elastic-net MKL approach and summarize mathematical tools that are needed for the theoretical analysis.

2.1 Formulation

Suppose we are given $n$ samples $(x_{i},y_{i})_{i=1}^{n}$ where $x_{i}$ belongs to an input space $\mathcal{X}$ and $y_{i}\in\mathbb{R}$ . $(x_{i},y_{i})_{i=1}^{n}$ are independent and identically distributed from a probability measure $P$ . We denote the marginal distribution of $X$ by $\Pi$ . We consider a MKL regression problem in which the unknown target function is represented as a form of $f(x)=\sum_{m=1}^{M}f_{m}(x)$ , where each $f_{m}$ belongs to different RKHSs $\mathcal{H}_{m}(m=1,\dots,M)$ corresponding to $M$ different base kernels $k_{m}$ over $\mathcal{X}\times\mathcal{X}$ .

Elastic-net MKL learns a decision function $\hat{f}$ as¹¹1 For simplicity, we focus on the squared-loss function here. However, we note that it is straightforward to extend our convergence analysis and support consistency results given in Sections 3 and 4 to general loss functions that are strongly convex and Lipschitz continuous, by following the line of Koltchinskii and Yuan (2008).

\displaystyle\!\!\!\hat{f}=

\displaystyle\mathop{\arg\min}_{f_{m}\in\mathcal{H}_{m}~(m=1,\dots,M)}\frac{1}{n}\sum_{i=1}^{n}\left(y_{i}-\sum_{m=1}^{M}f_{m}(x_{i})\right)^{2}\!\!\!+{\lambda_{1}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}+{\lambda_{2}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2},\!\!\!

(1)

where the first term is the squared-loss of function fitting and, the second and the third terms are block- $\ell_{1}$ and block- $\ell_{2}$ regularizers, respectively. It can be seen from (1) that elastic-net MKL is reduced to block- $\ell_{1}$ MKL if ${\lambda_{2}^{(n)}}=0$ , which tends to induce sparse kernel combination (Lanckriet et al., 2004, Bach et al., 2004). On the other hand, it is reduced to block- $\ell_{2}$ MKL if ${\lambda_{1}^{(n)}}=0$ , which results in uniform kernel combination (Micchelli and Pontil, 2005). It is worth noting that, elastic-net MKL allows us to obtain various levels of sparsity by controlling the ratio between ${\lambda_{1}^{(n)}}$ and ${\lambda_{2}^{(n)}}$ .

2.2 Notations and Assumptions

Here, we prepare technical tools needed in the following sections.

Due to Mercer’s theorem, there are an orthonormal system $\{\phi_{k,m}\}_{k,m}$ in $L_{2}(\Pi)$ and the spectrum $\{\mu_{k,m}\}_{k,m}$ such that $k_{m}$ has the following spectral representation:

k_{m}(x,x^{\prime})=\sum_{k=1}^{\infty}\mu_{k,m}\phi_{k,m}(x)\phi_{k,m}(x^{\prime}).

(2)

By this spectral representation, the inner-product of RKHS can be expressed as $\langle f_{m},g_{m}\rangle_{\mathcal{H}_{m}}=\sum_{k=1}^{\infty}\mu_{k,m}^{-1}\langle f_{m},\phi_{k,m}\rangle_{L_{2}(\Pi)}\langle\phi_{k,m},g_{m}\rangle_{L_{2}(\Pi)}.$

Let $\mathcal{H}=\mathcal{H}_{1}\oplus\dots\oplus\mathcal{H}_{M}$ . For $f=(f_{1},\dots,f_{M})\in\mathcal{H}$ and a subset of indices $I\subseteq\{1,\dots,M\}$ , we denote by $f_{I}$ the restriction of $f$ to an index set $I$ , i.e., $f_{I}=(f_{m})_{m\in I}$ .

We denote by $I_{0}$ the indices of truly active kernels, i.e.,

I_{0}=\{m\mid\|f^{*}_{m}\|_{\mathcal{H}_{m}}>0\},

and define the complement of $I_{0}$ as $J_{0}={I_{0}}^{c}$ .

Throughout the paper, we assume the following technical conditions (see also Bach (2008)).

Assumption 1

(Basic Assumptions)

$\mathrm{(A1)}$

There exists $f^{*}=(f^{*}_{1},\dots,f^{*}_{M})\in\mathcal{H}$ such that $\mathrm{E}[Y|X]=\sum_{m=1}^{M}f^{*}_{m}(X)$ , and the noise $\epsilon:=Y-f^{*}(X)$ has a strictly positive variance; there exists $\sigma>0$ such that $\mathrm{E}[\epsilon^{2}|X]>\sigma^{2}$ for all $X\in\mathcal{X}$ . We also assume that $\epsilon$ is bounded as $|\epsilon|\leq L$ .
$\mathrm{(A2)}$

For each $m=1,\dots,M$ , $\mathcal{H}_{m}$ is separable and $\sup_{X\in\mathcal{X}}|k_{m}(X,X)|<1$ .

\mathrm{(A3)}

There exists $g^{*}_{m}\in\mathcal{H}_{m}$ such that

f^{*}_{m}(x)=\int_{\mathcal{X}}k_{m}^{(1/2)}(x,x^{\prime})g^{*}_{m}(x^{\prime})\mathrm{d}\Pi(x^{\prime})\qquad(\forall m=1,\dots,M),

(3)

where $k_{m}^{(1/2)}(x,x^{\prime})=\sum_{k=1}^{\infty}\mu_{k,m}^{1/2}\phi_{k,m}(x)\phi_{k,m}(x^{\prime})$ is the operator square-root of $k_{m}$ .

The first assumption in (A1) ensures the model $\mathcal{H}$ is correctly specified, and the technical assumption $|\epsilon|<L$ allows $\epsilon f$ to be Lipschitz continuous with respect to $f$ .

It is known that the assumption (A2) gives the following relation:

\displaystyle\|f_{m}\|_{\infty}\!\!\leq\!\sup_{x}\langle k_{m}(x,\cdot),f_{m}\rangle_{\mathcal{H}_{m}}\!\!\leq\!\sup_{x}\|k_{m}(x,\cdot)\|_{\mathcal{H}_{m}}\!\|f_{m}\|_{\mathcal{H}_{m}}\!\!\leq\!\sup_{x}\sqrt{k_{m}(x,x)}\|f_{m}\|_{\mathcal{H}_{m}}\!\leq\!\|f_{m}\|_{\mathcal{H}_{m}}.

Table 1: Summary of the constants we use in this article.

$M$	The number of candidate kernels.
$d$	The number of active kernels of the truth; i.e., $d=\|I_{0}\|$ .
$R$	The upper bound of $\sum_{m=1}^{M}(\\|f^{}_{m}\\|_{\mathcal{H}_{m}}+\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{2})$ ; see (A4).
$s$	The spectral decay coefficient; see (A5).
$\beta$	The approximate sparsity coefficient; see (A7).
$b$	The parameter that tunes the correlation between kernels; see (A8).

The assumption (A3) was used in Caponnetto and de Vito (2007) and also in Bach (2008). It ensures the consistency of the least-squares estimates in terms of the RKHS norm. Using the spectral representation (2), the condition $g^{*}_{m}\in\mathcal{H}_{m}$ is expressed as

\|g^{*}_{m}\|_{\mathcal{H}_{m}}^{2}=\sum_{k=1}^{\infty}\mu_{k,m}^{-2}\langle f^{*}_{m},\phi_{k,m}\rangle_{L_{2}(\Pi)}^{2}<\infty.

(4)

This condition was also assumed in Koltchinskii and Yuan (2008). Proposition 9 of Bach (2008) gave a sufficient condition to fulfill (3) for translation invariant kernels $k_{m}(x,x^{\prime})=h_{m}(x-x^{\prime})$ .

Constants we use later are summarized in Table 1.

3 Convergence Rate of Elastic-net MKL

In this section, we derive the convergence rate of elastic-net MKL in two situations:

(i)

A sparse situation where the truth $f^{*}$ is sparse (Section 3.1).
(ii)

A near sparse situation where the truth is not exactly sparse, but $\|f_{m}\|_{\mathcal{H}_{m}}$ decays polynomially as $m$ increases (Section 3.2).

For (i), we show that elastic-net MKL (and block- $\ell_{1}$ MKL) achieves a faster convergence rate than the rate shown for block- $\ell_{1}$ MKL (Koltchinskii and Yuan, 2008). Furthermore, for (ii), we show that elastic-net MKL can outperform block- $\ell_{1}$ MKL and block- $\ell_{2}$ MKL depending on the sparsity of the truth and the condition of the problem. Throughout this section, we assume the following conditions.

Assumption 2

(Boundedness Assumption) There exists constants $C_{1}$ and $R$ such that

(A4)

\displaystyle\max_{m\in I_{0}}\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\leq C_{1},~~~\sum\limits_{m=1}^{M}(\|f^{*}_{m}\|_{\mathcal{H}_{m}}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2})\leq R.

Assumption 3

(Spectral Assumption) There exist $0<s<1$ and $C_{2}$ such that

(A5)

\displaystyle\mu_{k,m}\leq C_{2}k^{-\frac{1}{s}},~~~(1\leq\forall k,1\leq\forall m\leq M),

where $\{\mu_{k,m}\}_{k}$ is the spectrum of the kernel $k_{m}$ (see Eq.(2)).

The first assumption in (A4) appeared in Theorem 2 of Koltchinskii and Yuan (2008). The second assumption in (A4) bounds the amplitude of $f^{*}$ . It was shown that the spectral assumption (A5) is equivalent to the classical covering number assumption (Steinwart et al., 2009). Recall that the $\epsilon$ -covering number $\mathcal{N}(\epsilon,\mathcal{B}_{\mathcal{H}_{m}},L_{2}(\Pi))$ with respect to $L_{2}(\Pi)$ is the minimal number of balls with radius $\epsilon$ needed to cover the unit ball $\mathcal{B}_{\mathcal{H}_{m}}$ in $\mathcal{H}_{m}$ (van der Vaart and Wellner, 1996). If the spectral assumption (A5) holds, there exists a constant $c$ that depends only on $s$ such that

\displaystyle\mathcal{N}(\varepsilon,\mathcal{B}_{\mathcal{H}_{m}},L_{2}(\Pi))\leq c\varepsilon^{-2s},

(5)

and the converse is also true (see Theorem 15 of Steinwart et al. (2009) and Steinwart (2008) for details). Therefore, if $s$ is large, at least one RKHS is “complex”, and if $s$ is small, the RKHSs are regarded as “simple”.

For a given set of indices $I\subseteq\{1,\dots,M\}$ , let $\kappa(I)$ be defined as follows:

\displaystyle\kappa(I)

\displaystyle:=\sup\left\{\kappa\geq 0\mid\kappa\leq\frac{\|\sum_{m\in I}f_{m}\|_{L_{2}(\Pi)}^{2}}{\sum_{m\in I}\|f_{m}\|_{L_{2}(\Pi)}^{2}},~\forall f_{m}\in\mathcal{H}_{m}~(m\in I)\right\}.

$\kappa(I)$ represents the correlation of RKHSs inside the indices $I$ . Similarly, we define the correlations of RKHSs between $I$ and $I^{c}$ as follows:

\displaystyle\rho(I)

\displaystyle:=\sup\left\{\frac{\langle f_{I},g_{I^{c}}\rangle_{L_{2}(\Pi)}}{\|f_{I}\|_{L_{2}(\Pi)}\|g_{I^{c}}\|_{L_{2}(\Pi)}}\mid f_{I}\in\mathcal{H}_{I},g_{I^{c}}\in\mathcal{H}_{I^{c}},f_{I}\neq 0,g_{I^{c}}\neq 0\right\}.

In Subsections 3.1 and 3.2, we will assume that the kernels have no perfect canonical dependence, implying that the kernels are not similar to each other (see (A6) and (A8) below).

Throughout this paper, we assume $\frac{\log(Mn)}{n}\leq 1$ and $\log(M)$ is slower than any polynomial order against the number of samples $n$ : $\log(M)=o(n^{\epsilon})$ for all $\epsilon>0$ . With some abuse, we use $C$ to denote constants that are independent of $d$ and $n$ ; its value may be different.

3.1 Sparse Situation

Here we derive the convergence rate of the estimator $\hat{f}$ when the truth $f^{*}$ is sparse. Let $d=|I_{0}|$ and suppose that the number of kernels $M$ and the number of active kernels $d$ are increasing with respect to the number of samples $n$ . We further assume the following condition in this subsection.

Assumption 4

(Incoherence Assumption) There exists a constant $C_{3}>0$ such that

(A6)

\displaystyle 0<C_{3}^{-1}<\kappa(I_{0})(1-\rho^{2}(I_{0})).

(6)

This condition is known as the incoherence condition (Koltchinskii and Yuan, 2008, Meier et al., 2009), i.e., kernels are not too dependent on each other and the problem is well conditioned. Then we have the following convergence rate.

Theorem 1

Under assumptions (A1-A6), there exist constants $C$ , $F$ and $K$ depending only on $\kappa(I_{0})$ , $\rho(I_{0})$ , $s$ , $C_{1}$ , $C_{2}$ , $L$ , and $R$ such that the $L_{2}(\Pi)$ -norm of the residual $\hat{f}-f^{*}$ can be bounded as follows: when $d^{3+s}n^{-1}\leq 1$ , for ${\lambda_{1}^{(n)}}={\lambda_{2}^{(n)}}=\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}$ ,

\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\Big{(}dn^{-\frac{2}{2+s}}+\frac{dt}{n}\Big{)},

(7)

and, when $d^{3+s}n^{-1}>1$ , for ${\lambda_{1}^{(n)}}=\max\{K(1+\sqrt{t})n^{-\frac{1}{2}},F\sqrt{\frac{\log(Mn)}{n}}\}$ and ${\lambda_{2}^{(n)}}\leq{\lambda_{1}^{(n)}}$ ,

\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\Big{(}d^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}+\frac{d(\log(Mn)+t)}{n}\Big{)},

(8)

where each inequality holds with probability at least $1-e^{-t}-n^{-1}$ for all $t\geq\log\log(R\sqrt{n})+\log M$ .

The above theorem indicates that the learning rate depends on the complexity of RKHSs (the simpler, the faster) and the number of active kernels rather than the number of kernels $M$ (the influence of $M$ is at most $\frac{d\log(M)}{n}$ ). It is worth noting that the convergence rate in (7) and (8) is faster than or equal to the rate of block- $\ell_{1}$ MKL shown by Koltchinskii and Yuan (2008) which established the learning rate $O_{p}\left(d^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}+\frac{d\log(M)}{n}\right)$ under the same conditions as ours ²²2In our second bound (8), there is the additional $\frac{d\log(n)}{n}$ term. However this can be eliminated by replacing the probability $1-e^{-t}-n^{-1}$ with $1-e^{-t}-M^{-A}$ as in Koltchinskii and Yuan (2008). Moreover, if $\sqrt{n}\log(n)^{-\frac{1+s}{2s}}\geq d$ , then the term $\frac{d\log(n)}{n}$ is dominated by the first term $d^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}$ ..

3.2 Near-Sparse Situation

In this subsection, we analyze the convergence rate under a situation where $f^{*}$ is not sparse but near sparse. We have shown a faster learning rate than existing bounds in the previous subsection. However, the assumptions we used might be too restrictive to capture the situation where MKL is used in practice. In fact, it was pointed out in Zou and Hastie (2005) in the context of (non-block) $\ell_{1}$ regularization that $\ell_{1}$ regularization could fail in the following situations:

•

When the truth $f^{*}$ is not sparse, the $\ell_{1}$ regularization shrinks many small but non-zero components to zero.
•

When there exist strong correlations between different kernels, the solution of block- $\ell_{1}$ MKL becomes unstable.
•

When the number of kernels $M$ is not large, there is no need to impose the estimator to be sparse.

In order to analyze these situations in the MKL setting, we introduce three parameters $\beta$ , $b$ , and $\tau$ : $\beta$ controls the level of sparsity (see (A7)), $b$ controls the correlation between candidate kernels (see (A8)), and $\tau$ controls the growth of the number of kernels against the number of samples (see (A9)).

We show that naturally block- $\ell_{2}$ MKL is preferable when there are only few candidate kernels or the truth is dense. Importantly, if the candidate kernels are correlated, the convergence of block- $\ell_{1}$ MKL can be slow even when the truth is sparse. Our analysis shows that elastic-net MKL is most valuable in such an intermediate situation.

By permuting indices, we can assume without loss of generality that $\|f^{*}_{m}\|_{\mathcal{H}_{m}}$ is decreasing with respect to $m$ , i.e., $\|f^{*}_{1}\|_{\mathcal{H}_{1}}\geq\|f^{*}_{2}\|_{\mathcal{H}_{2}}\geq\|f^{*}_{3}\|_{\mathcal{H}_{3}}\geq\cdots$ . We further assume the following conditions in this subsection.

Assumption 5

(Approximate Sparsity) The truth is approximately sparse, i.e., $\|f^{*}_{m}\|_{\mathcal{H}_{m}}>0$ for all $m$ and thus $I_{0}=\{1,\dots,M\}$ . However, $\|f^{*}_{m}\|_{\mathcal{H}_{m}}$ decays polynomially with respect to $m$ as follows:

\displaystyle{\rm(A7)}

\displaystyle\|f^{*}_{m}\|_{\mathcal{H}_{m}}\leq C_{3}m^{-\beta}.

We call $\beta~(>1)$ the approximate sparsity coefficient.

Assumption 6

(Generalized Incoherence) There exist $b>0$ and $C_{4}$ such that for all $I\subseteq\{1,\dots,M\}$ ,

\displaystyle{\rm(A8)}

\displaystyle(1-\rho^{2}(I))\kappa(I)\geq C_{4}|I|^{-b}.

Assumption 7

(Kernel-Set Growth) The number of kernels $M$ is increasing polynomially with respect to the number of samples $n$ , i.e., $\exists\tau>0$ such that

\displaystyle{\rm(A9)}

\displaystyle M=\lceil n^{\tau}\rceil.

For notational convenience, let $\tau_{1}=\frac{1}{(2\beta+b)(2+s)-1-s}$ , $\tau_{2}=\frac{(s-1)(2\beta-1)+bs}{(2\beta+b)(2+s)-1-s}$ , $\tau_{3}=\frac{s\{2(b+\beta)-1\}}{2(2+s)(b+\beta)-s}$ , $\tau_{4}=\frac{s}{2+s}$ , $\tau_{5}=\frac{b+1}{(\beta+b)\{b(2+s)+2\}}$ and $\tau_{6}=\frac{1}{(1-s)(1+b)}$ . In addition, we denote by $K$ some sufficiently large constant.

Theorem 2

Suppose assumptions (A1-A5) and (A7-A9), $2\beta(1-s)<s(b-1)$ , and $\tau_{1}<\tau<\tau_{4}$ are satisfied. Then the estimator of elastic-net MKL possesses the following convergence rate each of which holds with probability at least $1-e^{-t}-n^{-1}$ for all $t\geq\log\log(R\sqrt{n})+\log M$ :
1. When $\tau_{1}<\tau<\tau_{2}$ ,

	$\displaystyle\\|\hat{f}-f^{*}\\|_{L_{2}(\Pi)}^{2}\leq$	$\displaystyle C\Big{\{}n^{-\gamma_{1}}+(n^{-\frac{(2\beta+b)(2+s)-3-s+2\beta}{2\{(2\beta+b)(2+s)-1-s\}}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t)\Big{\}},$
		$\displaystyle~~\text{where}~~\gamma_{1}=\frac{4\beta+b-2}{(2+s)(2\beta+b)-1-s},$		(9)

with ${\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{3\beta+b-1}{(2\beta+b)(2+s)-1-s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}$ and ${\lambda_{2}^{(n)}}=Kn^{-\frac{2\beta+b-1}{(2\beta+b)(2+s)-1-s}}$ .
2. When $\tau_{2}\leq\tau<\tau_{3}$ ,

	$\displaystyle\\|\hat{f}-f^{*}\\|_{L_{2}(\Pi)}^{2}\leq$	$\displaystyle C\Big{\{}n^{\tau\frac{(2+s)b+2}{2\{(2+s)(b+\beta)-s\}}-\gamma_{2}}+(n^{\frac{\tau(2+s)(1-\beta)-(4\beta+2b+sb-2)}{2\{(\beta+b)(2+s)-s\}}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t)\Big{\}},$
		$\displaystyle~~\text{where}~~\gamma_{2}=\frac{4\beta+b(2+s)-2}{2\{(2+s)(b+\beta)-s\}},$		(10)

with ${\lambda_{1}^{(n)}}=\max\{K\sqrt{\frac{M}{n}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}$ and ${\lambda_{2}^{(n)}}=Kn^{\frac{\tau-\{2(b+\beta)-1\}}{2\{(2+s)(b+\beta)-s\}}}$ .

3. When $\tau_{3}\leq\tau<\tau_{4}$ ,

	$\displaystyle\\|\hat{f}-f^{*}\\|_{L_{2}(\Pi)}^{2}\leq$	$\displaystyle C\Big{(}n^{\tau\gamma_{3}-\gamma_{3}}+(n^{\frac{\tau(\beta-1)+1-2\beta-b}{2(b+\beta)}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t)\Big{)},$
		$\displaystyle~~\text{where}~~\gamma_{3}=\frac{b+2\beta-1}{2(b+\beta)},$		(11)

with ${\lambda_{1}^{(n)}}=\max\{K\sqrt{\frac{M}{n}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}$ and ${\lambda_{2}^{(n)}}=K(M/n)^{\frac{2(b+\beta)-1}{4(b+\beta)}}$ .

Theorem 3

Under assumptions (A1-A5) and (A7-A9), if $\tau_{5}<\tau$ , the estimator $\hat{f}_{\ell_{1}}$ of block- $\ell_{1}$ MKL has the following convergence rate with probability at least $1-e^{-t}-n^{-1}$ for all $t\geq\log\log(R\sqrt{n})+\log M$ :

	$\displaystyle(\text{block-$\ell_{1}$ MKL})$	$\displaystyle\\|\hat{f}_{\ell_{1}}-f^{*}\\|_{L_{2}(\Pi)}^{2}\leq C\left(n^{-\gamma_{4}}+n^{-\frac{4\beta+2b-2+s(b+\beta)}{2(2+s)(b+\beta)}}(\sqrt{t}+t)\right),$
		$\displaystyle~~\text{where}~~\gamma_{4}=\frac{2\beta+b-1}{(\beta+b)(2+s)},$		(12)

with ${\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}$ and ${\lambda_{2}^{(n)}}=0$ . Moreover, if $\tau<\tau_{6}$ , the estimator $\hat{f}_{\ell_{2}}$ of block- $\ell_{2}$ MKL has the following convergence rate with probability at least $1-e^{-t}-n^{-1}$ for all $t\geq\log\log(R\sqrt{n})+\log M$ :

	$\displaystyle(\text{block-$\ell_{2}$ MKL})$	$\displaystyle\\|\hat{f}_{\ell_{2}}-f^{*}\\|_{L_{2}(\Pi)}^{2}\leq C\left(n^{\tau(b+\frac{2}{2+s})-\gamma_{5}}+\left({\lambda_{2}^{(n)}}^{2}+\frac{M^{1+b}}{n}\right)t\right),$
		$\displaystyle~~\text{where}~~\gamma_{5}=\frac{2}{2+s},$		(13)

with ${\lambda_{2}^{(n)}}=\max\{K(\frac{M}{n})^{\frac{1}{2+s}},F\sqrt{\frac{\log(Mn)}{n}}\}$ and ${\lambda_{1}^{(n)}}=0$ .

In all convergence rates presented in Theorems 2 and 3, the leading terms are the terms that do not contain $t$ . The convergence order of the terms containing $t$ are faster than the leading terms, thus negligible.

By simple calculation, we can confirm that elastic-net MKL always converges faster than block- $\ell_{1}$ MKL and block- $\ell_{2}$ MKL if $\beta$ and $M$ satisfy the condition of Theorem 2. The convergence rate of elastic-net MKL becomes identical with block- $\ell_{2}$ MKL and block- $\ell_{1}$ MKL at the two extreme points of the interval $\tau={\tau_{1}}$ and ${\tau_{4}}$ , respectively. Outside the region, block- $\ell_{1}$ MKL or block- $\ell_{2}$ MKL has a faster convergence rate than elastic-net MKL. Moreover, at $\tau=\tau_{2}$ , the convergence rates (2) and (2) of elastic-net MKL are identical, and at $\tau=\tau_{3}$ , the convergence rates (2) and (2) are identical. The relation between the most preferred method and the growth rate $\tau$ of the number of kernels is illustrated in Figure 1.

The condition $\tau_{1}<\tau<\tau_{4}$ in Theorem 2 indicates that when the number of kernels is not too small or too large, an ‘intermediate’ effect of elastic-net MKL becomes advantageous. Roughly speaking, if $M$ is large, sparsity is needed to ensure the convergence and thus block- $\ell_{1}$ MKL performs the best. On the other hand, if $M$ is small, there is no need to make the solution sparse and thus block- $\ell_{2}$ MKL becomes the best. For an intermediate $M$ , elastic-net MKL becomes the best.

The condition $2\beta(1-s)<s(b-1)$ in Theorem 2 ensures the existence of $M$ that satisfies the condition in the theorem, i.e., $\tau_{1}<\tau_{2}<\tau_{3}<\tau_{4}$ . It can be seen that as $b$ becomes large (the condition of the problem becomes worse), the range of $\beta$ and $M$ in which elastic-net MKL performs better than block- $\ell_{1}$ MKL and block- $\ell_{2}$ MKL becomes large. This indicates that the worse the condition of the problem becomes, the more important to control the balance of ${\lambda_{1}^{(n)}}$ and ${\lambda_{2}^{(n)}}$ appropriately.

Refer to caption — Figure 1: Relation between the convergence rate and the number of kernels. If the truth is intermediately sparse (the growth rate $\tau$ of the number of kernels is between $\tau_{1}$ and $\tau_{5}$ ), then elastic-net MKL performs best. At the edge of the interval, the convergence rate of elastic-net MKL coincides with that of block- $\ell_{1}$ MKL or block- $\ell_{2}$ MKL.

4 Support Consistency of Elastic-net MKL

In this section, we derive necessary and sufficient conditions for the statistical support consistency of the estimated sparsity pattern, i.e., the probability of $\{m\mid\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\neq 0\}=I_{0}$ goes to 1 as the number of samples $n$ tends to infinity. Due to the additional squared regularization term, the necessary condition for the support consistency of elastic-net MKL is shown to be weaker than that for block- $\ell_{1}$ MKL (Bach, 2008). In this section, we assume $M$ and $d=|I_{0}|$ are fixed against the number of samples $n$ .

Let $\mathcal{H}_{I}$ be the restriction of $\mathcal{H}_{1}\oplus\dots\oplus\mathcal{H}_{M}$ to the index set $I$ . Since $\mathrm{E}_{X}[k_{m}(X,X)]<\infty$ for all $m$ (from assumption (A2)), we define the (non-centered) cross covariance operator $\Sigma_{I,J}:\mathcal{H}_{I}\to\mathcal{H}_{J}$ as a bounded linear operator such that³³3 If one fits a function with a constant offset ( $f(x)+b$ instead of $f(x)$ ) as in Bach (2008), then the centered version of cross covariance operator is required instead of the non-centered version, i.e., $\langle f_{m},\Sigma_{m,m^{\prime}}g_{m^{\prime}}\rangle_{\mathcal{H}_{m}}=\mathrm{E}_{X}[(f_{m}(X)-\mathrm{E}_{X}[f_{m}])(g_{m^{\prime}}(X)-\mathrm{E}_{X}[g_{m^{\prime}}])]$ . However, this difference is not essential because, without loss of generality, one can consider a situation where $\mathrm{E}_{Y}[Y]=0$ and $\mathrm{E}_{X}[f_{m}(X)]=0$ for all $f_{m}\in\mathcal{H}_{M}$ by centering all the functions.

\displaystyle\langle f_{I},\Sigma_{I,J}g_{J}\rangle_{\mathcal{H}_{I}}=\sum_{m\in I}\sum_{m^{\prime}\in J}\langle f_{m},\Sigma_{m,m^{\prime}}g_{m^{\prime}}\rangle_{\mathcal{H}_{m}}=\sum_{m\in I}\sum_{m^{\prime}\in J}\mathrm{E}_{X}[f_{m}(X)g_{m^{\prime}}(X)],

(14)

for all $f_{I}=(f_{m})_{m\in I}\in\mathcal{H}_{I}$ and $g_{J}=(g_{m^{\prime}})_{m^{\prime}\in J}\in\mathcal{H}_{J}$ . See Baker (1973) for the details of the cross covariance operator $(f,g)\mapsto\mathrm{cov}(f(X)g(X))$ .

Moreover, we define the bounded (non-centered) cross-correlation operators⁴⁴4 Actually, such a bounded operator always exists (Baker, 1973). $V_{l,m}$ by $\Sigma_{l,l}^{1/2}V_{l,m}\Sigma_{m,m}^{1/2}=\Sigma_{l,m}$ . The joint cross-correlation operator $V_{I,J}:\mathcal{H}_{J}\rightarrow\mathcal{H}_{I}$ is defined analogously to $\Sigma_{I,J}$ .

In this section, we assume in addition to the basic assumptions (A1-A3) that

$\mathrm{(A10)}$

All $V_{l,m}$ are compact and the joint correlation operator $V$ is invertible.

Let $\hat{I}$ be the indices of active kernels for the estimated $\hat{f}\in\mathcal{H}$ by elastic-net MKL: $\hat{I}:=\{m\mid\|\hat{f}_{m}\|_{\mathcal{H}_{m}}>0\}$ . Let $D:=\mathrm{Diag}(\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1})=\mathrm{Diag}((\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1})_{m\in I_{0}})$ , where $\mathrm{Diag}$ is the $|I_{0}|\times|I_{0}|$ block-diagonal operator with operators $\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{-1}{\mathbf{I}}_{\mathcal{H}_{m}}$ on diagonal blocks for $m\in I_{0}$ . In this section, we assume that the true sparsity pattern $I_{0}$ and the number of kernels $M$ are fixed independently of the number of samples $n$ .

The norm of $f\in\mathcal{H}$ is defined by $\|f\|_{\mathcal{H}}:=\sqrt{\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2}}$ and similarly that of $f_{I}\in\mathcal{H}_{I}$ is defined by $\|f_{I}\|_{\mathcal{H}_{I}}:=\sqrt{\sum_{m\in I}\|f_{m}\|_{\mathcal{H}_{m}}^{2}}$ . The following theorem gives a sufficient condition for the support consistency of sparsity patterns.

Theorem 4

Suppose ${\lambda_{2}^{(n)}}>0$ , ${\lambda_{1}^{(n)}}\to 0,$ ${\lambda_{2}^{(n)}}\to 0$ , ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ , and

\displaystyle\textstyle\limsup_{n}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}<1,~~~~~(\forall m\in J=I_{0}^{c}).

(15)

Then⁵⁵5For random variables $x_{n}$ and $y$ , $x_{n}\stackrel{{\scriptstyle p}}{{\rightarrow}}y$ means the convergence in probability, i.e., the probability $|x_{n}-y|>\epsilon$ goes to 0 for all $\epsilon$ as the number of samples $n$ tends to infinity., under assumptions (A1-A3, A10), $\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\rightarrow}}0$ and $\hat{I}\stackrel{{\scriptstyle p}}{{\rightarrow}}I_{0}$ .

The condition ${\lambda_{2}^{(n)}}>0$ is just for technical simplicity to let $\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}$ invertible. The condition ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ means that ${\lambda_{1}^{(n)}}$ does not decrease too quickly. The condition (15) corresponds to an infinite-dimensional extension of the elastic-net ‘irrepresentable’ condition. In the paper of Zhao and Yu (2006), the irrepresentable condition was derived as a necessary and sufficient condition for the sign consistency of $\ell_{1}$ regularization when the number of parameters is finite. Its elastic-net version was derived in Yuan and Lin (2007), and it was extended to a situation where the number of parameters diverges as $n$ increases (Jia and Yu, 2010).

We also have a necessary condition for consistency.

Theorem 5

If $\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0$ and $\hat{I}\stackrel{{\scriptstyle p}}{{\to}}I_{0}$ , then under assumptions (A1-A3, A10), there exist sequences ${\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0$ such that

\displaystyle\textstyle\limsup_{n}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}\leq 1,~~~~~(\forall m\in J=I_{0}^{c}).

(16)

Moreover, such ${\lambda_{1}^{(n)}}$ satisfies ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ .

The sufficient condition (15) contains the strict inequality (‘ $<$ ’), while similar conditions for ordinary (non-block) $\ell_{1}$ regularization or ordinary (non-block) elastic-net regularization contain the weak inequality (‘ $\leq$ ’). The strict inequality appears because each block contains multiple variables in group lasso and MKL (Bach, 2008).

The condition ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ is necessary to impose the RKHS-norm convergence $\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0$ . Roughly speaking, this means that the block- $\ell_{1}$ regularization term should be stronger than the noise level to suppress fluctuations by noise.

It is worth noting that the conditions (15) and (16) are weaker than the condition for block- $\ell_{1}$ MKL presented in Bach (2008); the block- $\ell_{1}$ MKL irrepresentable condition is⁶⁶6 Note that in the original paper by Bach (2008), the RHS of (17) is $\sum_{m\in I_{0}}\|f^{*}_{m}\|_{\mathcal{H}_{m}}$ because the squared group- $\ell_{1}$ regularizer $(\sum_{m}\|f_{m}\|_{\mathcal{H}_{m}})^{2}$ was used. We can show that the squared formulation is actually equivalent to the non-squared formulation in the sense that there exists one-to-one correspondence between the two formulations.

\displaystyle\begin{cases}\mbox{(Sufficient condition)}&\left\|\Sigma_{m,m}^{1/2}V_{m,I_{0}}V_{I_{0},I_{0}}^{-1}Dg^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}<1,~(\forall m\in J),\\ \mbox{(Necessary condition)}&\left\|\Sigma_{m,m}^{1/2}V_{m,I_{0}}V_{I_{0},I_{0}}^{-1}Dg^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}\leq 1,~(\forall m\in J).\end{cases}

(17)

This is because the group- $\ell_{2}$ regularization term eases the singularity of the problem. Examples that elastic-nets successfully estimate the true sparsity pattern, while $\ell_{1}$ regularization fails in parametric situations can be found in Jia and Yu (2010).

5 Conclusions

We provided three novel theoretical results on the support consistency and convergence rate of elastic-net MKL.

(i)

Elastic-net MKL was shown to be support consistent under a milder condition than block- $\ell_{1}$ MKL.
(ii)

A tighter convergence rate than existing bounds was derived for the situation where the truth is sparse.
(iii)

The convergence rates of block- $\ell_{1}$ MKL, elastic-net MKL, and block- $\ell_{2}$ MKL when the truth is near sparse were elucidated, and elastic-net MKL was shown to perform better when the decrease rate $\beta$ is not large, or the condition of the problem is bad.

Based on our theoretical findings, we conclude that the use of elastic-net regularization is recommended for MKL.

Elastic-net MKL can be regarded as ‘intermediate’ between block- $\ell_{1}$ MKL and block- $\ell_{2}$ MKL. Another popular intermediate variant is block- $\ell_{p}$ MKL for $1\leq p\leq 2$ (Kloft et al., 2009, Cortes et al., 2009). Elastic-net MKL and block- $\ell_{p}$ MKL are conceptually similar, but they have a notable difference: elastic-net MKL with ${\lambda_{1}^{(n)}}>0$ tends to produce sparse solutions, while block- $\ell_{p}$ MKL with $1<p\leq 2$ always produces dense solutions (i.e., all combination coefficients of kernels are non-zero). Sparsity of elastic-net MKL would be advantageous when the true kernel combination is sparse, as we proved in this paper. However, when the true kernel combination is non-sparse, the difference/relation between elastic-net MKL and block- $\ell_{p}$ MKL is not clear yet. This needs to be further investigated in the future work.

Appendix A Proofs of the theorems

For a function $f$ on $\mathcal{X}\times\mathbb{R}$ , we define $P_{n}f:=\frac{1}{n}\sum_{i=1}^{n}f(x_{i},y_{i})$ and $Pf:=\mathrm{E}_{X,Y}[f(X,Y)]$ . For a function $f_{I}\in\mathcal{H}_{I}$ , we define $\|f_{I}\|_{\ell_{1}}$ as $\|f_{I}\|_{\ell_{1}}:=\sum_{m\in I}\|f_{m}\|_{\mathcal{H}_{m}}$ and for $f\in\mathcal{H}$ we write $\|f\|_{\ell_{1}}:=\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}$ . Similarly we define $\|f_{I}\|_{\ell_{2}}$ as $\|f_{I}\|_{\ell_{2}}^{2}:=\sum_{m\in I}\|f_{m}\|_{\mathcal{H}_{m}}^{2}$ for $f_{I}\in\mathcal{H}_{I}$ and for $f\in\mathcal{H}$ we write $\|f\|_{\ell_{2}}^{2}:=\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2}$ . We write $\max\{a,b\}$ as $a\vee b$ .

Lemma 6

For all $I\subseteq\{1,\dots,M\}$ , we have

\displaystyle\|f\|_{L_{2}(\Pi)}^{2}\geq(1-\rho(I)^{2})\kappa(I)(\sum_{m\in I}\|f_{m}\|_{L_{2}(\Pi)}^{2}).

(18)

Proof: For $J=I^{c}$ , we have

	$\displaystyle Pf^{2}$	$\displaystyle=\\|f_{I}\\|_{L_{2}(\Pi)}^{2}+2\langle f_{I},f_{J}\rangle_{L_{2}(\Pi)}+\\|f_{J}\\|_{L_{2}(\Pi)}^{2}\geq\\|f_{I}\\|_{L_{2}(\Pi)}^{2}-2\rho(I)\\|f_{I}\\|_{L_{2}(\Pi)}\\|f_{J}\\|_{L_{2}(\Pi)}+\\|f_{J}\\|_{L_{2}(\Pi)}^{2}$
		$\displaystyle\geq(1-\rho(I)^{2})\\|f_{I}\\|_{L_{2}(\Pi)}^{2}\geq(1-\rho(I)^{2})\kappa(I)(\sum_{m\in I}\\|f_{m}\\|_{L_{2}(\Pi)}^{2}),$		(19)

where we used Schwarz’s inequality in the last line.

The following lemma gives an upper bound of $\sum_{m=1}^{M}\|\hat{f}\|_{\mathcal{H}_{m}}$ that hold with a high probability. This is an extension of Theorem 1 of Koltchinskii and Yuan (2008). The proof is given in Appendix B.

Lemma 7

There exists a constant $F$ depending on only $L$ in (A1) such that, if ${\lambda_{1}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}$ , we have, for $r=\frac{{\lambda_{1}^{(n)}}}{{\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}}$ , with probability $1-n^{-1}$ ,

\displaystyle\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\leq M^{\frac{1-r}{2-r}}\left(3\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}}+3\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\right)^{\frac{1}{2-r}}.

Moreover, if ${\lambda_{2}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}$ and ${\lambda_{2}^{(n)}}\geq{\lambda_{1}^{(n)}}$ , we have

\displaystyle\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\leq M\left(3/2+2\max_{m}\|f^{*}_{m}\|_{\mathcal{H}_{m}}\right).

The following lemma gives a basic inequality that is a start point for the following analyses. The proof is given in Appendix B.

Lemma 8

Suppose ${\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}$ where $F$ is the constant appeared in Lemma 7. Then there exist constants $\tilde{K}_{1}$ and $\tilde{K}_{2}$ depending only on $L$ in (A1), $R$ in (A4), $s$ in ${\rm(A6)}$ , $C_{2}$ in ${\rm(A6)}$ such that for all $I\subseteq\{1,\dots,M\}$ , and for all $t\geq\log\log(R\sqrt{n})+\log M$ , with probability at least $1-e^{-t}-n^{-1}$ ,

	$\displaystyle\frac{1}{2}\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I}\\|\hat{f}_{I}-f^{}_{I}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}^{2}+\left({\lambda_{1}^{(n)}}-\hat{\gamma}_{n}-\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle\tilde{K}_{1}(1+\\|\hat{f}-f^{}\\|_{\ell_{1}})\Big{(}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle\!+\!\!\sum_{m\in I}\left(\!{\lambda_{1}^{(n)}}\!\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\!\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
	$\displaystyle\!\!+\!{\lambda_{2}^{(n)}}\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}\!\!+\!\left({\lambda_{1}^{(n)}}\!\!+\!\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}},$	(20)

where $J=I^{c}$ , $\gamma_{n}:=\frac{\tilde{K}_{1}}{\sqrt{n}}$ and $\hat{\gamma}_{n}:=\gamma_{n}(1+\|\hat{f}-f^{*}\|_{\infty}).$

The above lemma is derived by peeling device or localization method. Details of those techniques can be found in, for example, Bartlett et al. (2005), Koltchinskii (2006), Mendelson (2002), van de Geer (2000).

Proof: (Theorem 1) Since ${\lambda_{1}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}$ , we can assume that the inequality (20) is satisfied with $I=I_{0}$ . For notational simplicity, we suppose $I$ denotes $I_{0}$ in this proof. In addition, since ${\lambda_{1}^{(n)}}\geq{\lambda_{2}^{(n)}}$ , $\|\hat{f}\|_{\infty}\leq\sum_{m=1}^{M}\|f^{*}\|_{\mathcal{H}_{m}}\leq 3R$ (with probability $1-n^{-1})$ by Lemma 7. Note that $\|f^{*}_{m}\|_{\mathcal{H}_{m}}=0$ for all $m\in J=I^{c}=I_{0}^{c}$ , and $\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\leq\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}={\lambda_{1}^{(n)}}$ by taking $K$ sufficiently large. Therefore by the inequality (20), we have

	$\displaystyle\frac{1}{2}\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{2}}^{2}\leq K_{1}\Big{(}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t}{n}\Big{)}$
	$\displaystyle~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~+\sum_{m\in I}\left({\lambda_{1}^{(n)}}\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}+2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)},$		(21)

where $K_{1}$ is $\tilde{K}_{1}(1+3R)$ (here we omitted the term $\sum_{m\in I}n^{-\frac{1}{1+s}}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}$ for simplicity. One can show that that term is negligible).

By Hölder’s inequality, the first term in the RHS of the above inequality can be bounded as

	$\displaystyle K_{1}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}$	$\displaystyle\leq K_{1}\frac{(\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)})^{1-s}(\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{1}})^{s}}{\sqrt{n}}$
		$\displaystyle\leq\sqrt{d}K_{1}\frac{(\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}(\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{2}}^{2})^{\frac{s}{2}}}{\sqrt{n}}.$

Applying Young’s inequality, the last term can be bounded by

	$\displaystyle\frac{K_{1}({\lambda_{2}^{(n)}}/2)^{-\frac{s}{2}}\sqrt{d}}{\sqrt{n}}(\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}\times({\lambda_{2}^{(n)}}/2)^{\frac{s}{2}}(\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{2}}^{2})^{\frac{s}{2}}$
$\displaystyle\leq$	$\displaystyle C(n^{-\frac{1}{2}}\sqrt{d}{\lambda_{2}^{(n)}}^{-\frac{s}{2}})^{\frac{2}{2-s}}\left(\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{2}\right)^{\frac{1-s}{2-s}}+\frac{{\lambda_{2}^{(n)}}}{2}\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{2}}^{2}$
$\displaystyle\leq$	$\displaystyle C[(1-\rho^{2}(I))\kappa(I)]^{-1}n^{-1}d{\lambda_{2}^{(n)}}^{-s}+\frac{(1-\rho^{2}(I))\kappa(I)}{8}\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{2}}^{2}$
$\displaystyle\leq$	$\displaystyle Cn^{-1}d{\lambda_{2}^{(n)}}^{-s}+\frac{1}{8}\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{2}}^{2}.$	(22)

where $C$ denotes a constant that is independent of $d$ and $n$ and changes by the contexts, and we used Lemma 6 in the last line. Similarly, by the inequality of arithmetic and geometric means, we obtain a bound as

	$\displaystyle\sum_{m\in I}2\left({\lambda_{1}^{(n)}}\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}+{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
$\displaystyle\leq$	$\displaystyle C[(1-\rho^{2}(I))\kappa(I)]^{-1}\sum_{m\in I}\left\{\left(\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\right)^{2}{\lambda_{1}^{(n)}}^{2}+\\|g^{*}_{m}\\|_{\mathcal{H}_{m}}^{2}{\lambda_{2}^{(n)}}^{2}+\frac{t}{n}\right\}$
	$\displaystyle+\frac{(1-\rho^{2}(I))\kappa(I)}{8}\sum_{m\in I}\\|\hat{f}_{m}-f^{*}_{m}\\|_{L_{2}(\Pi)}^{2}$
$\displaystyle\leq$	$\displaystyle C(d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+dt/n)+\frac{1}{8}\\|\hat{f}-f^{*}\\|_{L_{2}(\Pi)}^{2},$	(23)

where we used Lemma 6 in the last line. By substituting (22) and (23) to (21), we have

\displaystyle\frac{1}{4}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\left(dn^{-1}{\lambda_{2}^{(n)}}^{-s}+d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+\frac{(d+1)t}{n}\right).

(24)

The minimum of the RHS with respect to ${\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}$ under the constraint ${\lambda_{1}^{(n)}}\geq{\lambda_{2}^{(n)}}$ is achieved by ${\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\},{\lambda_{2}^{(n)}}=Kn^{-\frac{1}{2+s}}$ up to constants. Thus we have the first assertion (7).

Next we show the second assertion (8). By Hölder’s inequality and Young’s inequality, we have

	$\displaystyle K_{1}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\leq K_{1}\frac{(\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)})^{1-s}(\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{1}})^{s}}{\sqrt{n}}$
	$\displaystyle\leq C\tilde{\lambda}^{-\frac{s}{1-s}}n^{-\frac{1}{2(1-s)}}\textstyle\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}+\frac{\tilde{\lambda}}{2}\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{1}}$
	$\displaystyle\leq Cd\tilde{\lambda}^{-\frac{2s}{1-s}}n^{-\frac{1}{1-s}}+\frac{1}{8}\textstyle\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+\frac{\tilde{\lambda}}{2}(\\|\hat{f}_{I}\\|_{\ell_{1}}+\\|f^{}_{I}\\|_{\ell_{1}}),$		(25)

where $\tilde{\lambda}>0$ is an arbitrary positive real. By substituting (25) and (23) to (21), we have

\displaystyle\frac{1}{4}\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq C\Big{(}d\tilde{\lambda}^{-\frac{2s}{1-s}}n^{-\frac{1}{1-s}}+\tilde{\lambda}+d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+\frac{(d+1)t}{n}\Big{)}.

This is minimized by $\tilde{\lambda}=Cd^{\frac{1-s}{1+s}}n^{-\frac{1}{1+s}}$ , ${\lambda_{1}^{(n)}}=(\frac{2\tilde{K}_{1}(1+3R)}{\sqrt{n}}+\tilde{K}_{2}\sqrt{\frac{t}{n}})\vee F\sqrt{\frac{\log(Mn)}{n}}\geq(2\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}})\vee F\sqrt{\frac{\log(Mn)}{n}}$ , and ${\lambda_{2}^{(n)}}\leq{\lambda_{1}^{(n)}}$ . Thus we obtain the assertion.

Proof: (Theorem 2) Let $I_{d}:=\{1,\dots,d\}$ and $J_{d}=I_{d}^{c}=\{d+1,\dots,M\}$ . By the assumption (A7), we have $\sum_{m\in J_{d}}\|f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}\leq\frac{C_{3}}{2\beta-1}d^{1-2\beta},~~~\sum_{m\in J_{d}}\|f^{*}_{m}\|_{\mathcal{H}_{m}}\leq\frac{C_{3}}{\beta-1}d^{1-\beta}.$ Therefore Lemma 8 gives

	$\displaystyle\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\\|\hat{f}_{I_{d}}-f^{}_{I_{d}}\\|_{\ell_{2}}^{2}+{\lambda_{2}^{(n)}}\\|\hat{f}_{J_{d}}\\|_{\ell_{2}}^{2}$
$\displaystyle\leq$	$\displaystyle K_{1}\Big{(}\sum_{m\in I_{d}}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle+K_{1}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)\Big{(}\sum_{m\in I_{d}}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle+\sum_{m\in I_{d}}\!\!\left(\!{\lambda_{1}^{(n)}}\!\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
	$\displaystyle+C\left({\lambda_{2}^{(n)}}d^{1-2\beta}+\left({\lambda_{1}^{(n)}}+\hat{\gamma}_{n}+\sqrt{\frac{t}{n}}\right)d^{1-\beta}\right),$	(26)

if ${\lambda_{1}^{(n)}}>\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ and ${\lambda_{1}^{(n)}}\geq F\sqrt{\frac{log(Mn)}{n}}$ . The second term can be upper bounded as

		$\displaystyle K_{1}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)\Big{(}\sum_{m\in I_{d}}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle\mathop{\leq}^{\text{H\"{o}lder}}$	$\displaystyle K_{1}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)\Bigg{\{}\frac{(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)})^{1-s}(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}})^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}}{n}\Bigg{\}}$
	$\displaystyle=$	$\displaystyle K_{1}\frac{(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)})^{1-s}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}})^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{Jensen}}$	$\displaystyle K_{1}\frac{d^{\frac{1-s}{2}}(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}M^{\frac{1}{2}}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}\right)^{\frac{1}{2}}d^{\frac{s}{2}}(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{*}_{m}\\|_{\mathcal{H}_{m}}^{2})^{\frac{s}{2}}}{\sqrt{n}}$
		$\displaystyle+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{Lemma \ref{lem:incoherenceIneq}}}$	$\displaystyle K_{1}\frac{\{(1-\rho(I_{d})^{2})\kappa(I_{d})\}^{-\frac{1-s}{2}}(\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}d^{\frac{1}{2}}M^{\frac{1}{2}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{1+s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{Young}}$	$\displaystyle\frac{\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}}{2}+C\frac{\{(1-\rho(I_{d})^{2})\kappa(I_{d})\}^{-\frac{1-s}{1+s}}d^{\frac{1}{1+s}}M^{\frac{1}{1+s}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{2}}{n^{\frac{1}{1+s}}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{(A8)}}$	$\displaystyle\frac{\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}}{2}+C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{2}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}.$

We will see that we may assume $C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\leq\frac{{\lambda_{2}^{(n)}}}{4}$ . Thus the second term in the RHS of the above inequality can be upper bounded as

	$\displaystyle C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{2}\leq\frac{{\lambda_{2}^{(n)}}}{4}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{2}\leq\frac{{\lambda_{2}^{(n)}}}{4}\left(\\|\hat{f}_{I_{d}}-f^{}_{I_{d}}\\|_{\ell_{2}}^{2}+2\\|\hat{f}_{J_{d}}\\|_{\ell_{2}}^{2}+2\\|f^{}_{J_{d}}\\|_{\ell_{2}}^{2}\right)$
	$\displaystyle\leq\frac{{\lambda_{2}^{(n)}}}{2}\left(\\|\hat{f}_{I_{d}}-f^{}_{I_{d}}\\|_{\ell_{2}}^{2}+\\|\hat{f}_{J_{d}}\\|_{\ell_{2}}^{2}+\\|f^{}_{J_{d}}\\|_{\ell_{2}}^{2}\right).$		(27)

Moreover Lemma 7 gives $\frac{\|\hat{f}-f^{*}\|_{\ell_{1}}}{n}\leq\frac{C\sqrt{RM}}{n}\leq C{\lambda_{2}^{(n)}}^{2}$ and $\frac{\|\hat{f}-f^{*}\|_{\ell_{1}}^{2}}{n}\leq\frac{CRM}{n}\leq CR{\lambda_{2}^{(n)}}^{2}.$ Therefore (26) becomes

		$\displaystyle\frac{1}{2}\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\\|\hat{f}_{I_{d}}-f^{}_{I_{d}}\\|_{\ell_{2}}^{2}+\frac{{\lambda_{2}^{(n)}}}{2}\\|\hat{f}_{J_{d}}\\|_{\ell_{2}}^{2}$
	$\displaystyle\leq$	$\displaystyle C\Big{(}\sum_{m\in I_{d}}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+t{\lambda_{2}^{(n)}}^{2}\Big{)}$
		$\displaystyle+\!\!\sum_{m\in I_{d}}\!\!\!\left(\!C_{1}{\lambda_{1}^{(n)}}\!+\!2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\!\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
		$\displaystyle+C\left({\lambda_{2}^{(n)}}d^{1-2\beta}+\left({\lambda_{1}^{(n)}}+\hat{\gamma}_{n}+\sqrt{\frac{t}{n}}\right)d^{1-\beta}\right).$

As in the proof of Theorem 1 (using the relations (23) and (22)), we have

		$\displaystyle\frac{1}{2}\\|\hat{f}-f^{*}\\|_{L_{2}(\Pi)}^{2}$
	$\displaystyle\leq$	$\displaystyle C\Bigg{\{}[(1-\rho^{2}(I_{d}))\kappa(I_{d}))]^{-1}\left[dn^{-1}{\lambda_{2}^{(n)}}^{-s}+d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+\frac{t}{n}\right]$
		$\displaystyle+{\lambda_{2}^{(n)}}d^{1-2\beta}+({\lambda_{1}^{(n)}}+\hat{\gamma}_{n}+(t/n)^{\frac{1}{2}})d^{1-\beta}+t{\lambda_{2}^{(n)}}^{2}\Bigg{\}}.$

Now using the assumption $(1-\rho^{2}(I_{d}))\kappa(I_{d})\geq C_{4}d^{-b}$ , we have

	$\displaystyle\\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\\|_{L_{2}(\Pi)}^{2}$	$\displaystyle\leq C\Bigg{[}d^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}\!\!+d^{1+b}{\lambda_{1}^{(n)}}^{2}\!\!+d^{b}{\lambda_{2}^{(n)}}^{2}\!\!+{\lambda_{2}^{(n)}}d^{1-2\beta}\!+({\lambda_{1}^{(n)}}+\hat{\gamma}_{n})d^{1-\beta}+t{\lambda_{2}^{(n)}}^{2}$
		$\displaystyle~~~~~~~~~~~~~~~~~~+d^{1-\beta}\sqrt{\frac{t}{n}}+\frac{d^{1+b}t}{n}\Bigg{]}.$		(28)

Remind that $\hat{\gamma}_{n}=\tilde{K_{1}}(1+\|\hat{f}-f^{*}\|_{\infty})/\sqrt{n}$ . Since ${\lambda_{1}^{(n)}}\geq F\sqrt{\frac{\log(Mn)}{n}}$ , Lemma 7 gives $\|\hat{f}-f^{*}\|_{\infty}\leq\sqrt{M3R}+R\leq c\sqrt{M}$ with probability $1-n^{-1}$ for some constant $c>0$ . Therefore $\hat{\gamma}_{n}\leq c\sqrt{M/n}$ . The values of ${\lambda_{1}^{(n)}}$ , ${\lambda_{2}^{(n)}}$ presented in the statement is achieved by minimizing the RHS of Eq. (28) under the constraint ${\lambda_{1}^{(n)}}\geq c\sqrt{M/n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ and $C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\leq\frac{{\lambda_{2}^{(n)}}}{4}$ .

i) Suppose $n^{-\frac{b+3\beta-1}{(2\beta+b)(2+s)-1-s}}>c\sqrt{M/n}$ , i.e., $\tau\leq\tau_{2}$ . Then the RHS of the above inequality can be minimized by $d=n^{\frac{1}{(2\beta+b)(2+s)-1-s}}$ , ${\lambda_{2}^{(n)}}=Kn^{-\frac{2\beta+b-1}{(2\beta+b)(2+s)-1-s}}$ , and ${\lambda_{1}^{(n)}}=\max\{Kn^{-\frac{b+3\beta-1}{(2\beta+b)(2+s)-1-s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\}$ up to constants independent of $n$ , where the leading terms are $d^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}\!\!+d^{b}{\lambda_{2}^{(n)}}^{2}\!\!+{\lambda_{2}^{(n)}}d^{1-2\beta}+{\lambda_{1}^{(n)}}d^{1-\beta}$ . It should be noted that ${\lambda_{1}^{(n)}}$ is greater than $\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ because $n^{-\frac{b+3\beta-1}{(2\beta+b)(2+s)-1-s}}>c\sqrt{M/n}\geq\hat{\gamma}_{n}$ , therefore (26) is valid. Using $\tau\leq\tau_{2}$ , we can show that $Cd^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}\leq{\lambda_{2}^{(n)}}/4$ by setting the constant $K$ sufficiently large, hence (27) is valid. Moreover, since $M>n^{\frac{1}{(2\beta+b)(2+s)-1-s}}=n^{\tau_{1}}$ , we can take $d$ as $d=n^{\frac{1}{(2\beta+b)(2+s)-1-s}}\leq M$ .

ii) Suppose $\tau_{2}\leq\tau\leq\tau_{3}$ . Then the RHS of the above inequality can be minimized by $d=(M^{2+s}n^{2-s})^{\frac{1}{2\{(2+s)(b+\beta)-s\}}}$ , ${\lambda_{2}^{(n)}}=K(Mn^{-\{2(b+\beta)-1\}})^{\frac{1}{2\{(2+s)(b+\beta-1)+2\}}}$ , and ${\lambda_{1}^{(n)}}=\max\left\{c\sqrt{M/n}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\right\}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ up to constants independent of $n$ , where the leading terms are $d^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}\!\!+d^{b}{\lambda_{2}^{(n)}}^{2}\!\!+{\lambda_{1}^{(n)}}d^{1-\beta}$ . Since ${\lambda_{1}^{(n)}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ , (26) is valid. Using $\tau\leq\tau_{3}$ , we can show that $Cd^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}\leq{\lambda_{2}^{(n)}}/4$ by setting the constant $K$ sufficiently large, hence (27) is valid. Moreover, since $\beta\leq\frac{s(b-1)}{2(1-s)}$ and $\tau_{2}\leq\tau$ , we can show that $d\leq M$ .

iii) Suppose $\tau_{3}\leq\tau\leq\tau_{4}$ . We take ${\lambda_{1}^{(n)}}=\max\left\{c\sqrt{M/n}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\right\}$ . Then the RHS of the inequality (28) is minimized by ${\lambda_{2}^{(n)}}=K\sqrt{d}{\lambda_{1}^{(n)}}\sim\sqrt{dM/n}$ and $d=(\frac{n}{M})^{\frac{1}{2(b+\beta)}}$ up to constants, where the leading terms are $d^{b}{\lambda_{2}^{(n)}}^{2}+d^{1+b}{\lambda_{1}^{(n)}}^{2}\!\!+{\lambda_{1}^{(n)}}d^{1-\beta}$ . Note that since ${\lambda_{1}^{(n)}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ , (26) is valid. Using $\tau\leq\tau_{4}$ , we can show that $Cd^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}\leq{\lambda_{2}^{(n)}}/4$ by setting the constant $K$ sufficiently large, hence (27) is valid. Moreover, since $\beta\leq\frac{s(b-1)}{2(1-s)}$ and $n^{\tau_{3}}\leq M$ , we have $d=(\frac{n}{M})^{\frac{1}{2(b+\beta)}}\leq M$ .

In all settings i) to iii), we can show that $\frac{d^{1-\beta}}{\sqrt{n}}\gtrsim\frac{d^{1+b}}{n}$ . Thus the terms regarding $t$ is upper bounded as $d^{1-\beta}\sqrt{\frac{t}{n}}+\frac{d^{1+b}t}{n}+t{\lambda_{2}^{(n)}}^{2}\lesssim(\frac{d^{1-\beta}}{\sqrt{n}}+{\lambda_{2}^{(n)}}^{2})(\sqrt{t}+t)$ . Through a simple calculation $\frac{d^{1-\beta}}{\sqrt{n}}$ is evaluated as i) $\frac{d^{1-\beta}}{\sqrt{n}}\simeq n^{-\frac{(2\beta+b)(2+s)-3-s+2\beta}{2\{(2\beta+b)(2+s)-1-s\}}}$ , ii) $\frac{d^{1-\beta}}{\sqrt{n}}\simeq(M^{(2+s)(1-\beta)}n^{-(4\beta+2b+sb-2)})^{\frac{1}{2\{(\beta+b)(2+s)-s\}}}$ , and iii) $\frac{d^{1-\beta}}{\sqrt{n}}\simeq(M^{\beta-1}n^{1-2\beta-b})^{\frac{1}{2(\beta+b)}}$ respectively. Thus we obtain the assertion.

Proof: (Theorem 3)

(Convergence rate of block- $\ell_{1}$ MKL)

Note that since ${\lambda_{1}^{(n)}}>{\lambda_{2}^{(n)}}=0$ , we have $\frac{{\lambda_{1}^{(n)}}}{{\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}}=1$ . Therefore Lemma 7 gives $\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\leq 3R$ with probability $1-n^{-1}$ . Thus $\hat{\gamma}_{n}=\gamma_{n}(1+\|\hat{f}-f^{*}\|_{\infty})\leq\gamma_{n}(1+\sum_{m=1}^{M}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}+\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}})\leq\gamma_{n}(1+4R)$ .

When ${\lambda_{2}^{(n)}}=0$ and ${\lambda_{1}^{(n)}}>(1+4R)\gamma_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ , as in Lemma 8 we have with probability at least $1-e^{-t}-n^{-1}$

	$\displaystyle\\|\hat{f}\!-\!f^{*}\\|_{L_{2}(\Pi)}^{2}\!+\!{\lambda_{1}^{(n)}}\!\!\sum_{m\in I}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle K_{1}\Big{(}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t}{n}\Big{)}+{\lambda_{1}^{(n)}}\sum_{m\in I}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}+2{\lambda_{1}^{(n)}}\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}$
	$\displaystyle+\tilde{K}_{2}\sum_{m\in I}\sqrt{\frac{t}{n}}\\|f^{*}_{m}-\hat{f}_{m}\\|_{L_{2}(\Pi)},$	(29)

for all $t\geq\log\log(R\sqrt{n})+\log M$ .

We lower bound the term ${\lambda_{1}^{(n)}}\!\!\sum_{m\in I}(\|\hat{f}_{m}\|_{\mathcal{H}_{m}}-\|f^{*}_{m}\|_{\mathcal{H}_{m}})$ in the LHS of the above inequality (21). There exists $c_{1}>0$ only depending $R$ such that

	$\displaystyle\\|f_{m}\\|_{\mathcal{H}_{m}}$	$\displaystyle=\sqrt{\\|f_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}-2\langle f_{m}-f^{}_{m},f^{}_{m}\rangle_{\mathcal{H}_{m}}+\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}}$
		$\displaystyle\geq c_{1}\\|f_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}-2\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{-1}\|\langle f_{m}-f^{}_{m},f^{}_{m}\rangle_{\mathcal{H}_{m}}\|+\\|f^{*}_{m}\\|_{\mathcal{H}_{m}}$		(30)

for all $f_{m}\in\mathcal{H}_{m}$ such that $\|f_{m}\|_{\mathcal{H}_{m}}\leq 3R$ and $m\in I_{0}$ . Remind that $f^{*}_{m}=T_{m}^{1/2}g^{*}_{m}$ , then we have $\|f_{m}\|_{\mathcal{H}_{m}}\geq c_{1}\|f_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\|f_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\|f^{*}_{m}\|_{\mathcal{H}_{m}}$ . Since $\max_{m}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\leq 3R$ are met with probability $1-n^{-1}$ ,

\|\hat{f}_{m}\|_{\mathcal{H}_{m}}\geq c_{1}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}^{2}-2\frac{\|g^{*}_{m}\|_{\mathcal{H}_{m}}}{\|f^{*}_{m}\|_{\mathcal{H}_{m}}}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\|f^{*}_{m}\|_{\mathcal{H}_{m}},

with probability $1-n^{-1}$ .

Therefore by the inequality (29), we have with probability at least $1-e^{-t}-n^{-1}$

	$\displaystyle\\|\hat{f}\!-\!f^{}\\|_{L_{2}(\Pi)}^{2}\!+\!{\lambda_{1}^{(n)}}\!\!\sum_{m\in I}\!(c_{1}\\|\hat{f}_{m}\!-\!f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}\!\!-\!\!2\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\\|\hat{f}_{m}\!-\!f^{}_{m}\\|_{L_{2}(\Pi)}\!\!+\!\!\\|f^{}_{m}\\|_{\mathcal{H}_{m}})$
$\displaystyle\leq$	$\displaystyle K_{1}\Big{(}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t}{n}\Big{)}+{\lambda_{1}^{(n)}}\sum_{m\in I}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}+2{\lambda_{1}^{(n)}}\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}$
	$\displaystyle+\tilde{K}_{2}\sum_{m\in I}\sqrt{\frac{t}{n}}\\|f^{*}_{m}-\hat{f}_{m}\\|_{L_{2}(\Pi)},$	(31)

for all $t\geq\log\log(R\sqrt{n})+\log M$ . Thus using Young’s inequality

\displaystyle\|\hat{f}-f^{*}\|_{L_{2}(\Pi)}^{2}\leq

\displaystyle C\left[d^{1+b}n^{-1}{\lambda_{1}^{(n)}}^{-s}+d^{1+b}{\lambda_{1}^{(n)}}^{2}+2{\lambda_{1}^{(n)}}d^{1-\beta}+\frac{t(1+d^{1+b})}{n}\right].

The RHS is minimized by $d=n^{\frac{1}{(2+s)(\beta+b)}}$ and ${\lambda_{1}^{(n)}}=\max\left\{Kn^{-\frac{1}{2+s}}+\tilde{K}_{2}\sqrt{\frac{t}{n}},F\sqrt{\frac{\log(Mn)}{n}}\right\}$ (up to constants independent of $n$ ). Note that since the optimal ${\lambda_{1}^{(n)}}$ obtained above satisfies ${\lambda_{1}^{(n)}}>(1+4R)\gamma_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ by taking $K$ sufficiently large, the inequality (31) is valid. Moreover the condition $M>n^{\tau_{5}}=n^{\frac{b+1}{(\beta+b)\{b(2+s)+2\}}}$ in the statement ensures $d<M$ . Finally we evaluate the terms including $t$ , that is, $\frac{t}{n}d^{1+b}+\sqrt{\frac{t}{n}}d^{1-\beta}$ . We can check that $\frac{1}{n}d^{1+b}\lesssim\sqrt{\frac{1}{n}}d^{1-\beta}$ . Therefore those terms are upper bounded as $\frac{t}{n}d^{1+b}+\sqrt{\frac{t}{n}}d^{1-\beta}\lesssim\sqrt{\frac{1}{n}}d^{1-\beta}(\sqrt{t}+t)\simeq n^{-\frac{4\beta+2b-2+s(b+\beta)}{2(2+s)(b+\beta)}}(\sqrt{t}+t)$ . Thus we obtain the assertion.

(Convergence rate for block- $\ell_{2}$ MKL)

When ${\lambda_{1}^{(n)}}=0$ , substituting $I_{M}$ to $I$ in Lemma 8, and using Young’s inequality, as in the proof of Theorem 2, the convergence rate of block- $\ell_{2}$ MKL can be evaluated as

\displaystyle\|\hat{f}_{I_{d}}-f^{*}_{I_{d}}\|_{L_{2}(\Pi)}^{2}\leq C\left[M^{1+b}n^{-1}{\lambda_{2}^{(n)}}^{-s}+M^{b}{\lambda_{2}^{(n)}}^{2}+t{\lambda_{2}^{(n)}}^{2}+\frac{t}{n}M^{1+b}\right],

(32)

with probability $1-e^{-t}-n^{-1}$ (note that since $I=\{1,\dots,M\}$ ( $I^{c}=\emptyset$ ), we don’t need the condition ${\lambda_{1}^{(n)}}\geq\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}$ ). ${\lambda_{2}^{(n)}}=K(\frac{M}{n})^{\frac{1}{2+s}}\vee F\sqrt{\frac{\log(Mn)}{n}}$ gives the minimum of the RHS with respect to ${\lambda_{2}^{(n)}}$ up to constants. Using $\tau\leq\tau_{6}$ , we can show that $M^{\frac{b(1-s)+1}{1+s}}(M/n)^{\frac{1}{1+s}}=M^{\frac{b(1-s)+2}{1+s}}n^{-\frac{1}{1+s}}\lesssim{\lambda_{2}^{(n)}}$ by setting the constant $K$ sufficiently large, hence (27) is valid.

Appendix B Proof of Lemmas 7 and 8

Proof: (Lemma 7) Since $\hat{f}$ minimizes the empirical risk (1), we have

		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left(\sum_{m=1}^{M}(\hat{f}_{m}(x_{i})-f^{*}_{m}(x_{i}))\right)^{2}+{\lambda_{1}^{(n)}}\\|\hat{f}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|\hat{f}\\|_{\ell_{2}}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{n}\sum_{m=1}^{M}\sum_{i=1}^{n}\epsilon_{i}(\hat{f}_{m}(x_{i})-f^{}_{m}(x_{i}))+{\lambda_{1}^{(n)}}\\|f^{}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|f^{*}\\|_{\ell_{2}}^{2}.$		(33)

By Proposition 1 (Bernstein’s inequality in Hilbert spaces, see also Theorem 6.14 of Steinwart (2008) for example), there exists a universal constant $C$ such that we have

		$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(\hat{f}_{m}(x_{i})-f^{}_{m}(x_{i}))\leq\left\|\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}k_{m}(x_{i},\cdot)\right\|\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}$
	$\displaystyle\leq$	$\displaystyle CL\sqrt{\frac{\log(Mn)}{n}}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\leq CL\sqrt{\frac{\log(Mn)}{n}}(\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}+\\|f^{}_{m}\\|_{\mathcal{H}_{m}})$		(34)

for all $m$ with probability at least $1-n^{-1}$ , where we used the assumption $\frac{\log(Mn)}{n}\leq 1$ . If ${\lambda_{1}^{(n)}}\geq 4CL\sqrt{\frac{\log(Mn)}{n}}$ , then we have

\displaystyle{\lambda_{1}^{(n)}}\|\hat{f}\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\|\hat{f}\|_{\ell_{2}}^{2}\leq 3({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})(\|f^{*}\|_{\ell_{1}}+\|f^{*}\|_{\ell_{2}}^{2}),

(35)

with probability at least $1-n^{-1}$ . Set $r=\frac{{\lambda_{1}^{(n)}}}{{\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}}}$ , then by Young’s inequality and Jensen’s inequality, the LHS of the above inequality (33) is lower bounded by

$\displaystyle{\lambda_{1}^{(n)}}\\|\hat{f}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|\hat{f}\\|_{\ell_{2}}^{2}$	$\displaystyle\geq({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})(\sum_{m=1}^{M}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}^{2-r})$
	$\displaystyle\geq M({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})\left(\frac{1}{M}\sum_{m=1}^{M}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}^{2-r}\right)$
	$\displaystyle\geq M^{r-1}({\lambda_{1}^{(n)}}\vee{\lambda_{2}^{(n)}})\\|\hat{f}\\|_{\ell_{1}}^{2-r}.$	(36)

Therefore we have the first assertion by setting $F=4CL$ .

The second assertion can be shown as follows: by the inequality (33) we have

	$\displaystyle M^{-1}{\lambda_{2}^{(n)}}\left(\\|\hat{f}-f^{}\\|_{\ell_{1}}\right)^{2}\leq{\lambda_{2}^{(n)}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{2}$
$\displaystyle\leq$	$\displaystyle\frac{2}{n}\sum_{m=1}^{M}\sum_{i=1}^{n}\epsilon_{i}(\hat{f}_{m}(x_{i})-f^{}_{m}(x_{i}))+{\lambda_{1}^{(n)}}\\|\hat{f}-f^{}\\|_{\ell_{1}}+2{\lambda_{2}^{(n)}}\sum_{m=1}^{M}\langle f^{}_{m},f^{}_{m}-\hat{f}_{m}\rangle_{\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle{\lambda_{2}^{(n)}}\left(\frac{3}{2}+2\max_{m}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}\right)\\|\hat{f}-f^{}\\|_{\ell_{1}}$	(37)

with probability at least $1-n^{-1}$ , where we used (34), ${\lambda_{2}^{(n)}}\geq 4CL\sqrt{\frac{\log(Mn)}{n}}$ and ${\lambda_{2}^{(n)}}\geq{\lambda_{1}^{(n)}}$ in the last inequality.

Proof: (Lemma 8) In what follows, we assume $\|\hat{f}-f^{*}\|_{\ell_{1}}\leq\bar{R}$ where $\bar{R}=4MR$ (the probability of this event is greater than $1-n^{-1}$ by Lemma 7). Since $\hat{f}$ minimizes the empirical risk we have

	$\displaystyle P_{n}(\hat{f}-Y)^{2}+{\lambda_{1}^{(n)}}\\|\hat{f}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|\hat{f}\\|_{\ell_{2}}^{2}\leq P_{n}(f^{}-Y)^{2}+{\lambda_{1}^{(n)}}\\|f^{}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|f^{*}\\|_{\ell_{2}}^{2}$
$\displaystyle\Rightarrow~$	$\displaystyle P(\hat{f}-f^{})^{2}+{\lambda_{1}^{(n)}}\\|\hat{f}_{J}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|\hat{f}_{J}\\|_{\ell_{2}}^{2}\leq(P-P_{n})((f^{}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon)+$
	$\displaystyle~~~~~~~~~~~~~~~~~~~~~~+{\lambda_{1}^{(n)}}(\\|f^{}_{I}\\|_{\ell_{1}}-\\|\hat{f}_{I}\\|_{\ell_{1}})+{\lambda_{2}^{(n)}}(\\|f^{}_{I}\\|_{\ell_{2}}^{2}-\\|\hat{f}_{I}\\|_{\ell_{2}}^{2})+{\lambda_{1}^{(n)}}\\|f^{}_{J}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|f^{}_{J}\\|_{\ell_{2}}^{2}.$	(38)

The second term in the RHS of the above inequality (38) can be bounded from above as

	$\displaystyle(\\|f^{*}_{I}\\|_{\ell_{1}}-\\|\hat{f}_{I}\\|_{\ell_{1}})$	$\displaystyle\leq\sum_{m\in I}\langle\nabla\\|f^{}_{m}\\|_{\mathcal{H}_{m}},\hat{f}_{m}-f^{}_{m}\rangle_{\mathcal{H}_{m}}$
		$\displaystyle=\sum_{m\in I}\frac{\langle g^{}_{m},T_{m}^{1/2}(\hat{f}_{m}-f^{}_{m})\rangle_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\leq\sum_{m\in I}\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)},$		(39)

where we used $f^{*}_{m}=T_{m}^{1/2}g^{*}_{m}$ for $m\in I\subseteq I_{0}$ . We also have

	$\displaystyle{\lambda_{2}^{(n)}}(\\|f^{*}_{I}\\|_{\ell_{2}}^{2}-\\|\hat{f}_{I}\\|_{\ell_{2}}^{2})$	$\displaystyle={\lambda_{2}^{(n)}}(\sum_{m\in I}2\langle f^{}_{m},f^{}_{m}-\hat{f}_{m}\rangle_{\mathcal{H}_{m}}-\\|\hat{f}_{I}-f^{*}_{I}\\|_{\ell_{2}}^{2})$
		$\displaystyle\leq{\lambda_{2}^{(n)}}(\sum_{m\in I}2\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}-\\|\hat{f}_{I}-f^{*}_{I}\\|_{\ell_{2}}^{2}).$		(40)

Substituting (39) and (40) to (38), we obtain

	$\displaystyle\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{2}}^{2}+{\lambda_{1}^{(n)}}\\|\hat{f}_{J}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|\hat{f}_{J}\\|_{\ell_{2}}^{2}$
$\displaystyle\leq$	$\displaystyle(P-P_{n})((f^{}-\hat{f})^{2}+2(\hat{f}-f^{})\epsilon)+\sum_{m\in I}({\lambda_{1}^{(n)}}\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}+2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}})\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
	$\displaystyle+{\lambda_{1}^{(n)}}\\|f^{}_{J}\\|_{\ell_{1}}+{\lambda_{2}^{(n)}}\\|f^{}_{J}\\|_{\ell_{2}}^{2}.$	(41)

Finally we evaluate the first term $(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon)$ in the RHS of the above inequality (41) by applying Talagrand’s concentration inequality (Talagrand, 1996a, b, Bousquet, 2002). First we decompose $(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon)$ as

\displaystyle(P-P_{n})((f^{*}-\hat{f})^{2}+2(\hat{f}-f^{*})\epsilon)=\sum_{m=1}^{M}(P-P_{n})((f^{*}-\hat{f})(f^{*}_{m}-\hat{f}_{m})+2(\hat{f}_{m}-f^{*}_{m})\epsilon),

and bound each term $(P-P_{n})((f^{*}-\hat{f})(f^{*}_{m}-\hat{f}_{m})+2(\hat{f}_{m}-f^{*}_{m})\epsilon)$ in the summation. Here suppose $f\in\mathcal{H}$ satisfies $\|f\|_{\infty}\leq\|f\|_{\ell_{1}}\leq\hat{R}$ for a constant $\hat{R}~(\leq\bar{R})$ . Since $|\epsilon|\leq L$ , we have


	$\displaystyle\|ff_{m}+2f_{m}\epsilon\|\leq 2(L+\hat{R})\|f\|\leq 2(L+\hat{R})\\|f_{m}\\|_{\mathcal{H}_{m}},$		(42a)
	$\displaystyle\sqrt{P(ff_{m}+2f_{m}\epsilon)^{2}}=\sqrt{P(f^{2}f_{m}^{2})+4P(f_{m}^{2}\epsilon^{2})}\leq\sqrt{\\|f\\|_{L_{2}(\Pi)}^{2}\\|f_{m}\\|_{L_{2}(\Pi)}^{2}+4L^{2}\\|f_{m}\\|_{L_{2}(\Pi)}^{2}}$
	$\displaystyle\leq\\|f\\|_{L_{2}(\Pi)}\\|f_{m}\\|_{L_{2}(\Pi)}+2L\\|f_{m}\\|_{L_{2}(\Pi)},$		(42b)

for all $f\in\mathcal{H}$ . Let $Q_{n}f:=\frac{1}{n}\sum_{i=1}^{n}\varepsilon_{i}f(x_{i},y_{i})$ where $\{\varepsilon_{i}\}_{i=1}^{n}\in\{\pm 1\}^{n}$ is the Rademacher random variable, and $\Psi_{m}(\xi_{m},\sigma_{m})$ be

\Psi_{m}(\xi_{m},\sigma_{m}):=\mathrm{E}[\sup\{Q_{n}(|f_{m}|)\mid f_{m}\in\mathcal{H}_{m},\|f_{m}\|_{\mathcal{H}_{m}}\leq\xi_{m},\|f_{m}\|_{L_{2}(\Pi)}\leq\sigma_{m}\}].

Then one can show that by the spectral assumptions (A5) (equivalently the covering number condition)

\Psi_{m}(\xi_{m},\sigma_{m})\leq K_{s}\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee n^{-\frac{1}{1+s}}\xi_{m}\right)

where $K_{s}$ is a constant that depends on $s$ and $C_{2}$ (Mendelson, 2002). Let $\Xi_{m}(\xi_{m},\sigma_{m}):=\{f_{m}\in\mathcal{H}_{m}\mid\|f_{m}\|_{\mathcal{H}_{m}}\leq\xi_{m},\|f_{m}\|_{L_{2}(\Pi)}\leq\sigma_{m}\}$ . Now by Rademacher contraction inequality (Ledoux and Talagrand, 1991, Theorem 4.12), for given $\{\xi_{m},\sigma_{m}\}_{m\in I}$ and $\hat{R}$ we have

		$\displaystyle\mathrm{E}[\sup\{Q_{n}(ff_{m}+2f_{m}\epsilon)\mid f\in\mathcal{H}\text{~such that~}f_{m}\in\Xi_{m}(\xi_{m},\sigma_{m}),~\\|f\\|_{\ell_{1}}\leq\hat{R}\}]$
	$\displaystyle\leq$	$\displaystyle 2(L+\hat{R})\Psi_{m}(\xi_{m},\sigma_{m})\leq 2K_{s}(L+\hat{R})\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee n^{-\frac{1}{1+s}}\xi_{m}\right).$		(43)

Therefore by the symmetrization argument (van der Vaart and Wellner, 1996), we have

		$\displaystyle\mathrm{E}[\sup\{(P_{n}-P)(ff_{m}+2f_{m}\epsilon)\mid f\in\mathcal{H}\text{~such that~}f_{m}\in\Xi_{m}(\xi_{m},\sigma_{m}),~\\|f\\|_{\ell_{1}}\leq\hat{R}\}]$
	$\displaystyle\leq$	$\displaystyle 4K_{s}(L+\hat{R})\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee n^{-\frac{1}{1+s}}\xi_{m}\right).$		(44)

By Talagrand’s concentration inequality with (42) and (44), for given $\hat{R},\bar{\sigma},\xi_{m},\sigma_{m}$ with probability at least $1-e^{-t}$ $(t>0)$ , we have

	$\displaystyle\sup_{f\in\mathcal{H}:\atop\\|f\\|_{L_{2}(\Pi)}\leq\bar{\sigma},\\|f\\|_{\infty}\leq\hat{R},f_{m}\in\Xi_{m}(\xi_{m},\sigma_{m})}(P_{n}-P)(ff_{m}+2f_{m}\epsilon)\leq$
	$\displaystyle~~~~~~\textstyle\sqrt{2}\left(4K_{s}(L+\hat{R})\left(\frac{\sigma_{m}^{1-s}\xi_{m}^{s}}{\sqrt{n}}\vee\frac{\xi_{m}}{n^{\frac{1}{1+s}}}\right)+\sqrt{\frac{t}{n}}(\bar{\sigma}\sigma_{m}+2L\sigma_{m})+2(L+\hat{R})\xi_{m}\frac{t}{n}\right).$		(45)

where we used the relation (42). Our next goal is to derive an uniform version of the above inequality over

\frac{1}{\sqrt{n}}\leq\hat{R}\leq\bar{R},~~\frac{1}{\sqrt{n}}\leq\bar{\sigma}\leq\bar{R},~~~\frac{1}{\sqrt{n}M}\leq\xi_{m}\leq\bar{R}~~~\text{and}~~~\frac{1}{\sqrt{n}M}\leq\sigma_{m}\leq\bar{R}.

By considering a grid $\{\hat{R}^{(k_{1})},\bar{\sigma}^{(k_{2})},\xi_{m}^{(k_{3})},\sigma_{m}^{(k_{4})}\}_{k_{i}=0(i=1,\dots,4)}^{\log_{2}(M\bar{R}\sqrt{n})}$ such that $\hat{R}^{(k)}:=\bar{R}2^{-k}$ , $\bar{\sigma}^{(k)}:=\bar{R}2^{-k}$ , $\xi_{m}^{(k)}:=\bar{R}2^{-k}$ and $\sigma_{m}^{(k)}:=\bar{R}2^{-k}$ , we have with probability at least $1-(\log(M\bar{R}\sqrt{n}))^{4}e^{-t}\geq 1-(\log(4RM^{2}\sqrt{n}))^{4}e^{-t}$

	$\displaystyle(P_{n}-P)(ff_{m}+2f_{m}\epsilon)\leq$	$\displaystyle K(1+\\|f\\|_{\ell_{1}})\left(\frac{\\|f_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|f_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\\|f_{m}\\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\\|f_{m}\\|_{\mathcal{H}_{m}}}{n}\right)$
		$\displaystyle~~~~~~~~~~~~~~~\!\!+\!\!\sqrt{\frac{2t}{n}}(\\|f\\|_{L_{2}(\Pi)}\\|f_{m}\\|_{L_{2}(\Pi)}+2L\\|f_{m}\\|_{L_{2}(\Pi)}),$

for all $f\in\mathcal{H}$ such that $\|f_{m}\|_{\mathcal{H}_{m}}\leq\bar{R}$ and $\|f\|_{\ell_{1}}\leq\bar{R}$ , and for all $t>1$ , where $K=4(4K_{s}L\vee 4K_{s}\vee 2L\vee 2)$ . Summing up this bound for $m=1,\dots,M$ , then we obtain

	$\displaystyle(P_{n}-P)(f^{2}+2f\epsilon)\leq$	$\displaystyle K(1+\\|f\\|_{\ell_{1}})\left(\sum_{m=1}^{M}\frac{\\|f_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|f_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\\|f_{m}\\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\\|f\\|_{\ell_{1}}}{n}\right)$
		$\displaystyle~~~~~~~~~~~~~~~\!\!+\!\!\sqrt{\frac{2t}{n}}\left(\\|f\\|_{L_{2}(\Pi)}\sum_{m=1}^{M}\\|f_{m}\\|_{L_{2}(\Pi)}+2L\sum_{m=1}^{M}\\|f_{m}\\|_{L_{2}(\Pi)}\right),$

uniformly for all $f\in\mathcal{H}$ such that $\|f_{m}\|_{\mathcal{H}_{m}}\leq\bar{R}$ ( $\forall m$ ) and $\|f\|_{\ell_{1}}\leq\bar{R}$ with probability at least $1-M(\log(4RM^{2}\sqrt{n}))^{4}e^{-t}$ . Here set $\gamma_{n}=\frac{K}{\sqrt{n}}$ and note that $\sqrt{\frac{2t}{n}}\|f\|_{L_{2}(\Pi)}\sum_{m=1}^{M}\|f_{m}\|_{L_{2}(\Pi)}\leq\frac{1}{2}\|f\|_{L_{2}(\Pi)}^{2}+\frac{t}{n}(\sum_{m=1}^{M}\|f_{m}\|_{L_{2}(\Pi)})^{2}\leq\frac{1}{2}\|f\|_{L_{2}(\Pi)}^{2}+\frac{t}{n}(\|f\|_{\ell_{1}})^{2}$ then we have

	$\displaystyle(P_{n}-P)(f^{2}+2f\epsilon)\leq$	$\displaystyle K(1+\\|f\\|_{\ell_{1}})\Bigg{[}\sum_{m\in I}\left(\frac{\\|f_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|f_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\\|f_{m}\\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}\right)+\frac{2t\\|f\\|_{\ell_{1}}}{n}\Bigg{]}$
		$\displaystyle+\gamma_{n}(1+\\|f\\|_{\ell_{1}})\\|f_{J}\\|_{\ell_{1}}+\frac{1}{2}\\|f\\|_{L_{2}(\Pi)}^{2}+2\sqrt{2}L\sqrt{\frac{t}{n}}\sum_{m=1}^{M}\\|f_{m}\\|_{L_{2}(\Pi)}.$		(46)

for all $f\in\mathcal{H}$ such that $\|f_{m}\|_{\mathcal{H}_{m}}\leq\bar{R}$ ( $\forall m$ ) and $\|f\|_{\ell_{1}}\leq\bar{R}$ with probability at least $1-M(\log(4RM^{2}\sqrt{n}))^{4}e^{-t}$ . We will replace $t$ with $t+5\log M+4\log\log(R\sqrt{n})$ , then the probability $1-M(\log(4R\sqrt{n}M^{2}))^{4}e^{-t}$ can be replaced with $1-e^{-t}$ and we have $t+5\log M+4\log\log(R\sqrt{n})\leq 6t$ for all $t\geq\log M+\log\log(R\sqrt{n})$ . On the event where $\|\hat{f}-f^{*}\|_{\ell_{1}}\leq\bar{R}$ holds, substituting $\hat{f}-f^{*}$ to $f$ in (46) and replacing $K$ appropriately, (41) yields

	$\displaystyle\frac{1}{2}\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I}\\|\hat{f}_{I}-f^{}_{I}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}^{2}+({\lambda_{1}^{(n)}}-\hat{\gamma}_{n})\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle\tilde{K}_{1}(1+\\|\hat{f}-f^{}\\|_{\ell_{1}})\Big{(}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle\!+\!\!\sum_{m\in I}\left(\!{\lambda_{1}^{(n)}}\!\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\!\!\right)\!\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}\!\!+\!{\lambda_{2}^{(n)}}\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}\!\!+\!({\lambda_{1}^{(n)}}\!\!+\!\hat{\gamma}_{n})\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}$
	$\displaystyle+\tilde{K}_{2}\sqrt{\frac{t}{n}}\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{*}_{m}\\|_{L_{2}(\Pi)},$	(47)

where $\tilde{K}_{1}$ and $\tilde{K}_{2}$ are constants and $\hat{\gamma}_{n}=\gamma_{n}(1+\|f\|_{\ell_{1}})$ . Finally since $\tilde{K}_{2}\sqrt{\frac{t}{n}}\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}=\tilde{K}_{2}\sqrt{\frac{t}{n}}(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\sum_{m\in J}\|\hat{f}_{m}\|_{L_{2}(\Pi)}+\sum_{m\in J}\|f^{*}_{m}\|_{L_{2}(\Pi)})\leq\tilde{K}_{2}\sqrt{\frac{t}{n}}(\sum_{m\in I}\|\hat{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}+\sum_{m\in J}\|\hat{f}_{m}\|_{\mathcal{H}_{m}}+\sum_{m\in J}\|f^{*}_{m}\|_{\mathcal{H}_{m}})$ , (47) becomes

	$\displaystyle\frac{1}{2}\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I}\\|\hat{f}_{I}-f^{}_{I}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}^{2}+\left({\lambda_{1}^{(n)}}-\hat{\gamma}_{n}-\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle\tilde{K}_{1}(1+\\|\hat{f}-f^{}\\|_{\ell_{1}})\Big{(}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle\!+\!\!\sum_{m\in I}\left(\!{\lambda_{1}^{(n)}}\!\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\!\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
	$\displaystyle\!\!+\!{\lambda_{2}^{(n)}}\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}\!\!+\!\left({\lambda_{1}^{(n)}}\!\!+\!\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}},$	(48)

which yields the assertion.

Appendix C Proof of Theorems 4 and 5

We write the operator norm of $S_{I,J}:\mathcal{H}_{J}\to\mathcal{H}_{I}$ as $\|S_{I,J}\|_{\mathcal{H}_{I},\mathcal{H}_{J}}:=\sup\limits_{g_{J}\in\mathcal{H}_{J},g_{J}\neq 0}\frac{\|S_{I,J}g_{J}\|_{\mathcal{H}_{I}}}{\|g_{J}\|_{\mathcal{H}_{J}}}$ .

Definition 9

For all $1\leq m,m^{\prime}\leq M$ , we define the empirical (non centered) cross covariance operator $\hat{\Sigma}_{m,m^{\prime}}$ as follows:

\langle f_{m},\hat{\Sigma}_{m,m^{\prime}}g_{m^{\prime}}\rangle_{\mathcal{H}_{m}}:=\frac{1}{n}\sum_{i=1}^{n}f_{m}(x_{i})g_{m^{\prime}}(x_{i}),

(49)

where $f_{m}\in\mathcal{H}_{m},g_{m^{\prime}}\in\mathcal{H}_{m^{\prime}}$ . Analogous to the joint covariance operator $\Sigma$ , we define the joint empirical cross covariance operator $\hat{\Sigma}:\mathcal{H}\to\mathcal{H}$ as $(\hat{\Sigma}h)_{m}=\sum_{l=1}^{M}\hat{\Sigma}_{m,l}h_{l}$ . We denote by $\hat{\Sigma}_{m,\epsilon}$ the element of $\mathcal{H}_{m}$ such that

\langle f_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}:=\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f_{m}(x_{i}).

Let $\bar{R}$ be a constant such that $4(\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}}+\sum_{m=1}^{M}\|f^{*}_{m}\|_{\mathcal{H}_{m}})\leq\bar{R}$ . We denote by $F_{n}$ the objective function of elastic-net MKL

F_{n}(f):=\frac{1}{n}\sum_{i=1}^{n}(f(x_{i})-y_{i})^{2}+{\lambda_{1}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}+{\lambda_{2}^{(n)}}\sum_{m=1}^{M}\|f_{m}\|_{\mathcal{H}_{m}}^{2}.

Proof: (Theorem 4) Let $\tilde{f}\in\oplus_{m\in I_{0}}\mathcal{H}_{m}$ be the minimizer of $\tilde{F}_{n}$ :

		$\displaystyle\tilde{f}:=\mathop{\arg\min}_{f\in\mathcal{H}_{I_{0}}}\tilde{F}_{n}(f),$
	where	$\displaystyle\tilde{F}_{n}(f):=\frac{1}{n}\sum_{i=1}^{n}(f(x_{i})-y_{i})^{2}+{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\\|f_{m}\\|_{\mathcal{H}_{m}}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\\|f_{m}\\|_{\mathcal{H}_{m}}^{2}.$

(Step 1) We first show that $\tilde{f}\stackrel{{\scriptstyle p}}{{\to}}f^{*}$ with respect to the RKHS norm. Since ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ , as in the proof of Lemma 7, the probability of $\sum_{m=1}^{M}\|\hat{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\leq\sqrt{M}\bar{R}$ goes to 1 (this can be checked as follows: by replacing $\sqrt{\frac{\log(Mn)}{n}}$ in Eq. (34) with $\log(M){\lambda_{1}^{(n)}}$ , then we see that Eq. (34) holds with probability $1-\exp(-{\lambda_{1}^{(n)}}^{2}n)$ ). There exists $c_{1}$ only depending $\sqrt{M}\bar{R}$ such that

		$\displaystyle\\|f_{m}\\|_{\mathcal{H}_{m}}=\sqrt{\\|f_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}-2\langle f_{m}-f^{}_{m},f^{}_{m}\rangle_{\mathcal{H}_{m}}+\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}}$
	$\displaystyle\geq$	$\displaystyle c_{1}\\|f_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}-2\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{-1}\|\langle f_{m}-f^{}_{m},f^{}_{m}\rangle_{\mathcal{H}_{m}}\|+\\|f^{*}_{m}\\|_{\mathcal{H}_{m}}$		(50)

for all $m\in I_{0}$ and all $f_{m}\in\mathcal{H}_{m}$ such that $\|f_{m}\|_{\mathcal{H}_{m}}\leq\sqrt{M}\bar{R}$ .

Since $\tilde{f}$ minimizes $\tilde{F}_{n}$ , if $\sum_{m=1}^{M}\|\tilde{f}_{m}-f^{*}_{m}\|_{\mathcal{H}_{m}}\leq\sqrt{M}\bar{R}$ (the probability of which event goes to 1) we have

		$\displaystyle\langle\tilde{f}_{I_{0}}-f^{}_{I_{0}},\hat{\Sigma}_{I_{0},I_{0}}(\tilde{f}_{I_{0}}-f^{}_{I_{0}})\rangle_{\mathcal{H}_{I_{0}}}+c_{1}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}$
	$\displaystyle\leq$	$\displaystyle 2\langle\hat{\Sigma}_{I_{0},\epsilon},\tilde{f}-f^{}\rangle_{\mathcal{H}_{I_{0}}}+2\sum_{m\in I_{0}}\left(\frac{1}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}{\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}\right)\|\langle\tilde{f}_{m}-f^{}_{m},f^{}_{m}\rangle_{\mathcal{H}_{m}}\|,$		(51)

where we used the relation (50). By the assumption $f^{*}_{m}=\Sigma_{m,m}^{1/2}g^{*}_{m}$ , we have $|\langle\tilde{f}_{m}-f^{*}_{m},f^{*}_{m}\rangle_{\mathcal{H}_{m}}|\leq\|g^{*}_{m}\|_{\mathcal{H}_{m}}\|\tilde{f}_{m}-f^{*}_{m}\|_{L_{2}(\Pi)}$ . By Lemma 10 and Lemma 11, we have

\|\Sigma_{m,m^{\prime}}-\hat{\Sigma}_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}=O_{p}(1/\sqrt{n}),~~~\|\hat{\Sigma}_{I_{0},\epsilon}\|_{\mathcal{H}_{I_{0}}}=O_{p}(1/\sqrt{n}).

Substituting these inequalities to (51), we have

		$\displaystyle\\|\tilde{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+c_{1}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{*}_{m}\\|_{\mathcal{H}_{m}}^{2}$
	$\displaystyle\leq$	$\displaystyle O_{p}\left(\frac{\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}}{\sqrt{n}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}\right).$		(52)

Remind that the (non centered) cross correlation operator is invertible. Thus there exists a constant $c$ such that

		$\displaystyle\\|\tilde{f}-f^{}\\|_{L_{2}(\Pi)}^{2}=\langle\tilde{f}_{I_{0}}-f^{}_{I_{0}},\Sigma_{I_{0},I_{0}}(\tilde{f}_{I_{0}}-f^{}_{I_{0}})\rangle_{\mathcal{H}}=\langle\tilde{f}_{I_{0}}-f^{}_{I_{0}},\mathrm{Diag}(\Sigma_{m,m}^{1/2})V_{I_{0},I_{0}}\mathrm{Diag}(\Sigma_{m,m}^{1/2})(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})\rangle_{\mathcal{H}_{I_{0}}}$
	$\displaystyle\geq$	$\displaystyle c\sum_{m\in I_{0}}\langle\tilde{f}_{m}-f^{}_{m},\Sigma_{m,m}(\tilde{f}_{m}-f^{}_{m})\rangle_{\mathcal{H}_{m}}=c\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{*}_{m}\\|_{L_{2}(\Pi)}^{2}.$

This and Eq. (52) give that using $ab\leq(a^{2}+b^{2})/2$

	$\displaystyle\\|\tilde{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+c_{1}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{*}_{m}\\|_{\mathcal{H}_{m}}^{2}$
	$\displaystyle\leq O_{p}\left(\frac{\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}}{\sqrt{n}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}\right)$
	$\displaystyle\leq O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}\right)+\frac{c_{1}}{2}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}+\frac{c}{2}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{2}$
	$\displaystyle\leq O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}\right)+\frac{c_{1}}{2}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}+\frac{1}{2}\\|\tilde{f}-f^{}\\|_{L_{2}(\Pi)}^{2}.$

Therefore we have

		$\displaystyle\frac{1}{2}\\|\tilde{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+\frac{c_{1}}{2}{\lambda_{1}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{*}_{m}\\|_{\mathcal{H}_{m}}^{2}\leq O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}\right)$
	$\displaystyle\Rightarrow$	$\displaystyle\sum_{m\in I_{0}}\\|\tilde{f}_{m}-f^{*}_{m}\\|_{\mathcal{H}_{m}}^{2}\leq O_{p}\left(\frac{1}{(c_{1}{\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})n{\lambda_{1}^{(n)}}}+\frac{({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{2}}{c_{1}{\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}}\right)=O_{p}\left(\frac{1}{n{\lambda_{1}^{(n)}}^{2}}+({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})\right).$

This and ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ gives $\|\tilde{f}-f^{*}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}\to 0$ in probability.

(Step 2) Next we show that the probability of $\tilde{f}=\hat{f}$ goes to 1. Since $\|\tilde{f}-f^{*}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}\to 0$ , we can assume that $\|\tilde{f}_{m}\|_{\mathcal{H}_{m}}>0~(m\in I_{0})$ without loss of generality. We identify $\tilde{f}$ as an element of $\mathcal{H}$ by setting $\tilde{f}_{m}=0$ for $m\in J_{0}$ . Now we show that $\tilde{f}$ is also the minimizer of $F_{n}$ , that is $\tilde{f}=\hat{f}$ , with high probability, hence $\hat{I}=I_{0}$ with high probability. By the KKT condition, the necessary and sufficient condition that $\tilde{f}$ also minimizes $F_{n}$ is

	$\displaystyle\\|2\hat{\Sigma}_{m,I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\\|_{\mathcal{H}_{m}}\leq{\lambda_{1}^{(n)}}~~~(\forall m\in J_{0}),$		(53)
	$\displaystyle(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})(\tilde{f}_{I_{0}}-f^{}_{I_{0}})+{\lambda_{1}^{(n)}}D_{n}f^{}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}=0,$		(54)

where $D_{n}=\mathrm{Diag}(\|\tilde{f}_{m}\|_{\mathcal{H}_{m}}^{-1})$ . Note that (54) is satisfied (with high probability) because $\tilde{f}$ is the minimizer of $\tilde{F}_{n}$ and $\|\tilde{f}_{m}\|_{\mathcal{H}_{m}}>0$ for all $m\in I_{0}$ (with high probability). Therefore if the condition (53) holds w.h.p., $\tilde{f}=\hat{f}$ w.h.p..

We will now show the condition (53) holds w.h.p.. Due to (54), we have

\tilde{f}_{I_{0}}-f^{*}_{I_{0}}=-(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}[({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}].

Therefore the LHS of (53), $\|2\hat{\Sigma}_{m,I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}$ , can be evaluated as

	$\displaystyle\\|-2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}[({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}]-2\hat{\Sigma}_{m,\epsilon}\\|_{\mathcal{H}_{m}}$
$\displaystyle=$	$\displaystyle\\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}$
	$\displaystyle-2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}+2\hat{\Sigma}_{m,\epsilon}\\|_{\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle\\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\\|_{\mathcal{H}_{m}}$
	$\displaystyle+\\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}-2\hat{\Sigma}_{m,\epsilon}\\|_{\mathcal{H}_{m}}.$	(55)

We evaluate the probabilistic orders of the last two terms.

(i) (Bounding $B_{n,m}:=\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}$ ) We show that

\displaystyle\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}=O_{p}\left(\frac{1}{\sqrt{n}}\right).

Since $O\preceq\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}\end{pmatrix},$ we have

\displaystyle O\preceq\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}/2&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}}\end{pmatrix}\preceq\begin{pmatrix}2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}&0\\ 0&2\hat{\Sigma}_{m,m}+2{\lambda_{2}^{(n)}}\end{pmatrix}.

The second inequality is due to the fact that for all $(f_{I_{0}},f_{m})\in\mathcal{H}_{I_{0}\cup m}$ we have

\displaystyle\left\langle\begin{pmatrix}f_{I_{0}}\\ -f_{m}\end{pmatrix},\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}/2&-\hat{\Sigma}_{I_{0},m}\\ -\hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}}\end{pmatrix}\begin{pmatrix}f_{I_{0}}\\ -f_{m}\end{pmatrix}\right\rangle_{\mathcal{H}_{I_{0}\cup m}}\geq 0

because of $O\preceq\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}\end{pmatrix}.$

Thus we have

	$\displaystyle\left\\|\begin{pmatrix}\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D_{n}}{2}&\hat{\Sigma}_{I_{0},m}\\ \hat{\Sigma}_{m,I_{0}}&\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}}\end{pmatrix}\begin{pmatrix}2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n}&0\\ 0&2\hat{\Sigma}_{m,m}+2{\lambda_{2}^{(n)}}\end{pmatrix}^{-1}\begin{pmatrix}\hat{\Sigma}_{I_{0},\epsilon}\\ \hat{\Sigma}_{m,\epsilon}\end{pmatrix}\right\\|_{\mathcal{H}_{I_{0}\cup m}}$
	$\displaystyle\leq\left\\|\begin{pmatrix}\hat{\Sigma}_{I_{0},\epsilon}\\ \hat{\Sigma}_{m,\epsilon}\end{pmatrix}\right\\|_{\mathcal{H}_{I_{0}\cup m}}\leq O_{p}(1/\sqrt{n}).$		(56)

Here the LHS of the above inequality is equivalent to

\displaystyle\left\|\begin{pmatrix}*\\ \hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}+(\hat{\Sigma}_{m,m}+{\lambda_{2}^{(n)}})(2\hat{\Sigma}_{m,m}+2{\lambda_{2}^{(n)}})^{-1}\hat{\Sigma}_{m,\epsilon}\end{pmatrix}\right\|_{\mathcal{H}_{I_{0}\cup m}}.

Therefore we observe

\displaystyle\left\|\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}+\frac{1}{2}\hat{\Sigma}_{m,\epsilon}\right\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n}).

Since $\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n})$ , we also have

\|\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\hat{\Sigma}_{I_{0},\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n}).

This and $\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(1/\sqrt{n})$ yield

\displaystyle B_{n,m}=O_{p}(1/\sqrt{n}).

(57)

(ii) (Bounding $E_{n,m}:=\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\|_{\mathcal{H}_{m}}$ ) Note that, due to $\|\tilde{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0$ , we have $D_{n}\stackrel{{\scriptstyle p}}{{\to}}D$ , and we know that $\max_{m,m^{\prime}}\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}=O_{p}(\sqrt{\log(M)/n})=O_{p}(\frac{1}{\sqrt{n}})$ by Lemma 10. Thus $S_{n}:=(2\Sigma_{I_{0},I_{0}}-2\hat{\Sigma}_{I_{0},I_{0}})/{\lambda_{1}^{(n)}}+D-D_{n}$ satisfies $S_{n}=o_{p}(1)$ and thus $D-S_{n}\succeq D/2$ with high probability. Hence

	$\displaystyle 2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}$
$\displaystyle=$	$\displaystyle 2\Sigma_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+O_{p}\left(\frac{1}{\sqrt{n}}\right)$
$\displaystyle=$	$\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+$
	$\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}{\lambda_{1}^{(n)}}S_{n}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}(D-S_{n}))^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}$
	$\displaystyle+O_{p}\left(\frac{1}{\sqrt{n}}\right).$	(58)

Here we obtain

	$\displaystyle\\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-\frac{1}{2}}\\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}^{2}$
$\displaystyle=$	$\displaystyle\\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}\Sigma_{I_{0},m}\\|_{\mathcal{H}_{m},\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle\\|\Sigma_{m,m}^{\frac{1}{2}}V_{m,I_{0}}(2V_{I_{0},I_{0}})^{-1}V_{I_{0},m}\Sigma_{m,m}^{\frac{1}{2}}\\|_{\mathcal{H}_{m},\mathcal{H}_{m}}=O_{p}(1),$	(59)

and due to the fact that $D-S_{n}\succeq D/2$ with high probability we have

		$\displaystyle\\|(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}(D-S_{n}))^{-\frac{1}{2}}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\\|_{\mathcal{H}_{I_{0}}}$
	$\displaystyle=$	$\displaystyle\\|(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}(D-S_{n}))^{-\frac{1}{2}}\mathrm{Diag}(\Sigma_{m,m}^{\frac{1}{2}})({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})g^{*}_{I_{0}}\\|_{\mathcal{H}_{I_{0}}}$
	$\displaystyle\leq$	$\displaystyle O_{p}(\\|V_{I_{0},I_{0}}^{-1}\\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}^{-\frac{1}{2}}({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}))=O_{p}({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}).$

Therefore the second term in the RHS of Eq. (58) is evaluated as

		$\displaystyle\\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}{\lambda_{1}^{(n)}}S_{n}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}\!(D\!-\!S_{n}))^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\\|_{\mathcal{H}_{m}}$
	$\displaystyle\leq$	$\displaystyle\\|\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-\frac{1}{2}}\\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}\\|(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-\frac{1}{2}}\\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}{\lambda_{1}^{(n)}}\\|S_{n}\\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}\times$
		$\displaystyle\\|(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}\!(D\!-\!S_{n}))^{-\frac{1}{2}}\\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}\\|(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}\!(D\!-\!S_{n}))^{-\frac{1}{2}}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}\\|_{\mathcal{H}_{I_{0}}}$
	$\displaystyle\leq$	$\displaystyle O_{p}(1\cdot({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{-\frac{1}{2}}\cdot{\lambda_{1}^{(n)}}o_{p}(1)\cdot({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{-\frac{1}{2}}\cdot({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}}))$
	$\displaystyle=$	$\displaystyle o_{p}({\lambda_{1}^{(n)}}).$

Therefore this and Eq. (58) give

		$\displaystyle 2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}$
	$\displaystyle=$	$\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+o_{p}({\lambda_{1}^{(n)}})+O_{p}\left(\frac{1}{\sqrt{n}}\right)$
	$\displaystyle=$	$\displaystyle 2\Sigma_{m,I_{0}}(2\Sigma_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D)^{-1}({\lambda_{1}^{(n)}}D_{n}+2{\lambda_{2}^{(n)}})f^{*}_{I_{0}}+o_{p}({\lambda_{1}^{(n)}}).$

Define

	$\displaystyle A_{n}:=\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}},$
	$\displaystyle A:=\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}.$

We show $\|A_{n}-A\|_{\mathcal{H}_{m}}=o_{p}(1)$ . By the definition, we have

	$\displaystyle A-A_{n}=$	$\displaystyle\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\frac{{\lambda_{1}^{(n)}}D}{2}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}$
		$\displaystyle+\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D-D_{n}\right)f^{*}_{I_{0}}.$		(60)

On the other hand, as in Eq. (56), we observe that

	$\displaystyle 2\geq$	$\displaystyle\left\\|\begin{pmatrix}\Sigma_{I_{0},I_{0}}&\Sigma_{I_{0},m}\\ \Sigma_{m,I_{0}}&\Sigma_{m,m}\end{pmatrix}\begin{pmatrix}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}&0\\ 0&0\end{pmatrix}\right\\|_{\mathcal{H}_{I_{0}\cup m},\mathcal{H}_{I_{0}\cup m}}$
	$\displaystyle=$	$\displaystyle\left\\|\begin{pmatrix}&\\ \Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}&0\end{pmatrix}\right\\|_{\mathcal{H}_{I_{0}\cup m},\mathcal{H}_{I_{0}\cup m}}\geq\\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}.$		(61)

Moreover, since $f^{*}_{m}=\Sigma_{m,m}^{\frac{1}{2}}g^{*}_{m}$ ( $\forall m$ ), we have

	$\displaystyle\left\\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\\|_{\mathcal{H}_{I_{0}}}$
$\displaystyle=$	$\displaystyle\left\\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\mathrm{Diag}(\Sigma_{m,m}^{\frac{1}{2}})\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\\|_{\mathcal{H}_{I_{0}}}$
$\displaystyle\leq$	$\displaystyle\left\\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-\frac{1}{2}}\right\\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}\left\\|\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-\frac{1}{2}}\mathrm{Diag}(\Sigma_{m,m}^{\frac{1}{2}})\right\\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}$
	$\displaystyle\times\left\\|\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\\|_{\mathcal{H}_{I_{0}}}$
$\displaystyle\leq$	$\displaystyle O_{p}(({\lambda_{1}^{(n)}}+{\lambda_{2}^{(n)}})^{-\frac{1}{2}}\left\\|V_{I_{0},I_{0}}^{-\frac{1}{2}}\right\\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}})\leq O_{p}({\lambda_{1}^{(n)}}^{-\frac{1}{2}}).$	(62)

We can also bound the second term of (60) as

		$\displaystyle\left\\|\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\left(D-D_{n}\right)f^{*}_{I_{0}}\right\\|_{\mathcal{H}_{m}}$
	$\displaystyle\leq$	$\displaystyle\left\\|\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D}{2}\right)^{-1}\right\\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}\left\\|\left(D-D_{n}\right)f^{*}_{I_{0}}\right\\|_{\mathcal{H}_{I_{0}}}$
	$\displaystyle\leq$	$\displaystyle\left\\|\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}\right)^{-1}\right\\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}\left\\|\left(D-D_{n}\right)f^{*}_{I_{0}}\right\\|_{\mathcal{H}_{I_{0}}}$
	$\displaystyle\leq$	$\displaystyle 2\left\\|\left(D-D_{n}\right)f^{*}_{I_{0}}\right\\|_{\mathcal{H}_{I_{0}}}~~~~(\because\text{Eq.~{\eqref{eq:SigmaIIfirstbound}}})$
	$\displaystyle=$	$\displaystyle o_{p}(1).$

Therefore applying the inequalities Eq. (61) and Eq. (62) to Eq. (60), we have

\displaystyle\|A_{n}-A\|_{\mathcal{H}_{m}}=O_{p}({\lambda_{1}^{(n)}}^{\frac{1}{2}})+o_{p}(1)=o_{p}(1).

(63)

Hence we have $E_{n,m}={\lambda_{1}^{(n)}}\|A\|_{\mathcal{H}_{m}}+o_{p}({\lambda_{1}^{(n)}})$ .

(iii) (Combining (i) and (ii)) Due to the above evaluations ((i) and (ii)), we have

		$\displaystyle\max_{m\in J_{0}}\left\\|2\hat{\Sigma}_{m,I}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\right\\|_{\mathcal{H}_{m}}$
	$\displaystyle=$	$\displaystyle\max_{m\in J}{\lambda_{1}^{(n)}}\left\\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\\|_{\mathcal{H}_{m}}+o_{p}({\lambda_{1}^{(n)}})<{\lambda_{1}^{(n)}}(1-\eta)+o_{p}({\lambda_{1}^{(n)}}).$

This yields

P\left(\|2\hat{\Sigma}_{m,I_{0}}(\tilde{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}\geq{\lambda_{1}^{(n)}},\forall m\in J_{0}\right)\to 0.

Thus the probability of the condition (53) goes to 1.

Proof: (Theorem 5) First we prove that ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ is a necessary condition for $\hat{I}\stackrel{{\scriptstyle p}}{{\to}}I_{0}$ . Assume that $\liminf{\lambda_{1}^{(n)}}\sqrt{n}<\infty$ . Then we can take a sub-sequence that converges to a finite value, therefore by taking the sub-sequence, if necessary, we can assume $\lim{\lambda_{1}^{(n)}}\sqrt{n}\to\mu_{1}$ without loss of generality. We will derive a contradiction under the conditions of $\|\hat{f}-f^{*}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0$ and $\hat{I}\stackrel{{\scriptstyle p}}{{\to}}I_{0}$ . Suppose $\hat{I}=I_{0}$ .

By the KKT condition,

	$\displaystyle 0=2(\hat{\Sigma}_{I_{0},I_{0}}\hat{f}_{I_{0}}-\hat{\Sigma}_{I_{0},\epsilon}-\hat{\Sigma}_{I_{0},I_{0}}f^{*}_{I_{0}})+{\lambda_{1}^{(n)}}D_{n}\hat{f}_{I_{0}}+2{\lambda_{2}^{(n)}}\hat{f}_{I_{0}}$
$\displaystyle\Rightarrow~~$	$\displaystyle 2(\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})(f^{}_{I_{0}}-\hat{f}_{I_{0}})={\lambda_{1}^{(n)}}D_{n}f^{}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}$	(64)
$\displaystyle\Rightarrow~~$	$\displaystyle 2\sqrt{n}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})(f^{}_{I_{0}}-\hat{f}_{I_{0}})=\sqrt{n}{\lambda_{1}^{(n)}}Df^{}_{I_{0}}+\sqrt{n}2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\sqrt{n}\hat{\Sigma}_{I_{0},\epsilon}$
	$\displaystyle~~~~~~~~~~~~~~~~~~+(2\sqrt{n}(\Sigma_{I_{0},I_{0}}-\hat{\Sigma}_{I_{0},I_{0}})(f^{}_{I_{0}}-\hat{f}_{I_{0}})+\sqrt{n}{\lambda_{1}^{(n)}}(D_{n}-D)f^{}_{I_{0}})$
$\displaystyle\Rightarrow~~$	$\displaystyle 2\sqrt{n}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})(f^{}_{I_{0}}-\hat{f}_{I_{0}})=\mu_{1}Df^{}_{I_{0}}+\sqrt{n}2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\sqrt{n}\hat{\Sigma}_{I_{0},\epsilon}+o_{p}(1),$	(65)

where the last inequality is due to $\sqrt{n}{\lambda_{1}^{(n)}}\to\mu_{1},\|D_{n}-D\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}=o_{p}(1),\|\hat{f}-f^{*}\|_{\mathcal{H}}=o_{p}(1)$ and $\|\Sigma_{I_{0},I_{0}}-\hat{\Sigma}_{I_{0},I_{0}}\|_{\mathcal{H}_{I_{0}},\mathcal{H}_{I_{0}}}=o_{p}(1)$ . Moreover since the second equality (64) indicates that $o_{p}(1)+o_{p}({\lambda_{2}^{(n)}})={\lambda_{1}^{(n)}}Df^{*}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}+o_{p}(1)$ , we have ${\lambda_{1}^{(n)}}=o_{p}(1)$ and ${\lambda_{2}^{(n)}}=o_{p}(1)$ .

We now show that the KKT condition under which $\hat{f}$ satisfying $\hat{I}=I_{0}$ is optimal with respect to $F_{n}$ is violated with strictly positive probability:

\displaystyle\liminf P\left(\exists m\in J,~\|2(\hat{\Sigma}_{m,I_{0}}\hat{f}_{I_{0}}-\hat{\Sigma}_{m,I_{0}}f^{*}_{I_{0}}-\hat{\Sigma}_{m,\epsilon})\|_{\mathcal{H}_{m}}>{\lambda_{1}^{(n)}}\right)>0.

(66)

Obviously this indicates that the probability $\hat{I}=I_{0}$ does not converges to 1, which is a contradiction.

For all $v_{m}\in\mathcal{H}_{m}$ $(m\in J_{0})$ , there exists $w_{I_{0}}\in\mathcal{H}_{I_{0}}$ such that

\Sigma_{I_{0},m}v_{m}=(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})w_{I_{0}}.

(67)

Note that $w_{I_{0}}$ is uniformly bounded for all ${\lambda_{2}^{(n)}}\geq 0$ because the range of $\Sigma_{I_{0},m}$ is included in the range of $\Sigma_{I_{0},I_{0}}$ (Baker, 1973) and there exists $\tilde{w}_{I_{0}}$ such that $\Sigma_{I_{0},m}v_{m}=\Sigma_{I_{0},I_{0}}\tilde{w}_{I_{0}}$ ( $\tilde{w}_{I_{0}}$ is independent of ${\lambda_{2}^{(n)}}$ ), hence $\Sigma_{I_{0},I_{0}}\tilde{w}_{I_{0}}=(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})w_{I_{0}}$ , and

\|w_{I_{0}}\|_{\mathcal{H}_{I_{0}}}\leq\sqrt{\langle\tilde{w}_{I_{0}},\Sigma_{I_{0},I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-2}\Sigma_{I_{0},I_{0}}\tilde{w}_{I_{0}}\rangle_{\mathcal{H}_{I_{0}}}}\leq\|\tilde{w}_{I_{0}}\|_{\mathcal{H}_{I_{0}}}

for ${\lambda_{2}^{(n)}}>0$ and $\|w_{I_{0}}\|_{\mathcal{H}_{I_{0}}}=\|\tilde{w}_{I}\|_{\mathcal{H}_{I_{0}}}$ for ${\lambda_{2}^{(n)}}=0$ . Let $v_{m}\in\mathcal{H}_{m}$ be any non-zero element such that $\Sigma_{m,m}^{1/2}v_{m}\neq 0$ and $w_{I_{0}}$ be satisfying the above equality (67), then

		$\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}+\hat{\Sigma}_{m,I_{0}}f^{*}_{I_{0}}-\hat{\Sigma}_{m,I_{0}}\hat{f}_{I_{0}}\rangle_{\mathcal{H}_{m}}$
	$\displaystyle=$	$\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}+\langle v_{m},\hat{\Sigma}_{m,I_{0}}\sqrt{n}(f^{*}_{I_{0}}-\hat{f}_{I_{0}})\rangle_{\mathcal{H}_{m}}$
	$\displaystyle=$	$\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}+\langle v_{m},\Sigma_{m,I}\sqrt{n}(f^{*}_{I_{0}}-\hat{f}_{I_{0}})\rangle_{\mathcal{H}_{m}}+o_{p}(1)$
	$\displaystyle=$	$\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}+\langle w_{I_{0}},(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})\sqrt{n}(f^{*}_{I_{0}}-\hat{f}_{I_{0}})\rangle_{\mathcal{H}_{m}}+o_{p}(1)$
	$\displaystyle=$	$\displaystyle\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle_{\mathcal{H}_{m}}-\sqrt{n}\langle w_{I_{0}},\hat{\Sigma}_{I_{0},\epsilon}\rangle_{\mathcal{H}_{m}}+\left\langle w_{I_{0}},\left(\frac{\mu_{1}}{2}D+\sqrt{n}{\lambda_{2}^{(n)}}\right)f^{*}_{I_{0}}\right\rangle_{\mathcal{H}_{m}}+o_{p}(1),$

where we used $\|\hat{\Sigma}_{m,I_{0}}-\Sigma_{m,I_{0}}\|_{\mathcal{H}_{m},\mathcal{H}_{I_{0}}}=O_{p}(1/\sqrt{n})$ and $\|f^{*}-\hat{f}\|_{\mathcal{H}}\stackrel{{\scriptstyle p}}{{\to}}0$ in the second equality, and the relation (65) in the last equality. We can show that $Z_{n}:=\sqrt{n}\langle v_{m},\hat{\Sigma}_{m,\epsilon}\rangle-\sqrt{n}\langle w_{I_{0}},\hat{\Sigma}_{I_{0},\epsilon}\rangle$ has a positive variance as follows (see also Bach (2008)):

	$\displaystyle\mathrm{E}[Z_{n}]$	$\displaystyle=0,$
	$\displaystyle\mathrm{E}[Z_{n}^{2}]$	$\displaystyle\geq\sigma^{2}\left(\langle v_{m},\Sigma_{m,m}v_{m}\rangle-2\langle v_{m},\Sigma_{m,I_{0}}w_{I_{0}}\rangle+\langle w_{I_{0}},\Sigma_{I_{0},I_{0}}w_{I_{0}}\rangle\right)$
		$\displaystyle=\sigma^{2}\left(\langle v_{m},\Sigma_{m,m}v_{m}\rangle-\langle v_{m},\Sigma_{m,I_{0}}w_{I_{0}}\rangle+o_{p}(1)\right)~~~~~~~(\because{\lambda_{2}^{(n)}}=o_{p}(1))$
		$\displaystyle=\sigma^{2}\langle\Sigma_{m,m}^{1/2}v_{m},(I_{\mathcal{H}_{m}}-V_{m,I_{0}}\tilde{V}^{-1}_{I_{0},I_{0}}V_{I_{0},m})\Sigma_{m,m}^{1/2}v_{m}\rangle+o_{p}(1),$

where $\tilde{V}^{-1}_{I_{0},I_{0}}=\mathrm{Diag}(\Sigma_{m,m}^{1/2})(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\mathrm{Diag}(\Sigma_{m,m}^{1/2})$ (note that $\tilde{V}_{I_{0},I_{0}}$ is invertible because $V_{I_{0},I_{0}}\preceq\tilde{V}_{I_{0},I_{0}}$ and $V_{I_{0},I_{0}}$ is invertible). Now since $V_{I_{0},I_{0}}\preceq\tilde{V}_{I_{0},I_{0}}$ and $I_{\mathcal{H}_{m}}-V_{m,I_{0}}V^{-1}_{I_{0},I_{0}}V_{I_{0},m}\succ O$ (this is because $V_{I_{0}\cup m,I_{0}\cup m}=\begin{pmatrix}V_{I_{0},I_{0}}&V_{m,I_{0}}\\ V_{I_{0},m}&I_{\mathcal{H}_{m}}\end{pmatrix}$ is invertible), we have $I_{\mathcal{H}_{m}}-V_{m,I_{0}}\tilde{V}^{-1}_{I_{0},I_{0}}V_{I_{0},m}\succ O$ . Therefore by the central limit theorem $Z_{n}$ converges Gaussian random variable with strictly positive variance in distribution. Thus the probability of

2|\langle v_{m},\hat{\Sigma}_{m,\epsilon}+\hat{\Sigma}_{m,I_{0}}f^{*}_{I_{0}}-\hat{\Sigma}_{m,I_{0}}\hat{f}_{I_{0}}\rangle_{m}|>{\lambda_{1}^{(n)}}\|v_{m}\|_{\mathcal{H}_{m}}

is asymptotically strictly positive because ${\lambda_{1}^{(n)}}\sqrt{n}\to\mu_{1}$ (Note that this is true whether $\sqrt{n}{\lambda_{2}^{(n)}}$ converges to finite value or not). This yields (66), i.e. $\hat{f}$ does not satisfy $\hat{I}=I_{0}$ with asymptotically strictly positive probability.

We say Condition A as

\text{Condition A}:~~~~{\lambda_{1}^{(n)}}\sqrt{n}\to\infty.

Now that we have proven ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ , we are ready to prove the assertion (16). Suppose the condition (16) is not satisfied for any sequences ${\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0$ , that is, there exists a constant $\xi>0$ such that

\displaystyle\limsup_{n\to\infty}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}>(1+\xi),~~(\exists m\in J_{0}),

(68)

for any sequences ${\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0$ satisfying Condition A ( ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ ). Fix arbitrary sequences ${\lambda_{1}^{(n)}},{\lambda_{2}^{(n)}}\to 0$ satisfying Condition A. If $\hat{I}=I_{0}$ , the KKT condition

	$\displaystyle\\|2\hat{\Sigma}_{m,I_{0}}(\hat{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon}\\|_{\mathcal{H}_{m}}\leq{\lambda_{1}^{(n)}}~~~(\forall m\in J_{0}),$		(69)
	$\displaystyle(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})(\tilde{f}_{I_{0}}-f^{}_{I_{0}})+{\lambda_{1}^{(n)}}D_{n}f^{}_{I_{0}}+2{\lambda_{2}^{(n)}}f^{*}_{I_{0}}-2\hat{\Sigma}_{I_{0},\epsilon}=0,$		(70)

should be satisfied (see (53) and (54)). We prove that the first inequality (69) of the KKT condition is violated with strictly positive probability under the assumptions and the condition (70). We have shown that (see (55))

	$\displaystyle{\lambda_{1}^{(n)}}^{-1}(2\hat{\Sigma}_{m,I_{0}}(\hat{f}_{I_{0}}-f^{*}_{I_{0}})-2\hat{\Sigma}_{m,\epsilon})$
$\displaystyle=$	$\displaystyle 2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}})f^{*}_{I_{0}}$
	$\displaystyle-\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}+\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,\epsilon}.$	(71)

As shown in the proof of Theorem 1, the first term can be approximated by $\Sigma_{m,I_{0}}\left(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}\right)^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}},$ more precisely Eq. (63) gives

	$\displaystyle\left\\|\hat{\Sigma}_{m,I_{0}}\left(\hat{\Sigma}_{I_{0},I_{0}}+{\lambda_{2}^{(n)}}+\frac{{\lambda_{1}^{(n)}}D_{n}}{2}\right)^{-1}\!\!\!\left(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{}_{I_{0}}-\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\!\left(\!D\!+\!2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{}_{I}\right\\|_{\mathcal{H}_{m}}$
	$\displaystyle\stackrel{{\scriptstyle p}}{{\to}}0.$

Since $\liminf_{n}\left\|\Sigma_{m,I_{0}}(\Sigma_{I_{0},I_{0}}+{\lambda_{2}^{(n)}})^{-1}\left(D+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)g^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}>(1+\xi)$ by the assumption, we observe that

\displaystyle P\left(\left\|2\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}\left(D_{n}+2\frac{{\lambda_{2}^{(n)}}}{{\lambda_{1}^{(n)}}}\right)f^{*}_{I_{0}}\right\|_{\mathcal{H}_{m}}>(1+\xi)\right)\not\to 0.

(72)

Now since ${\lambda_{1}^{(n)}}\sqrt{n}\to\infty$ , we have proven that

\displaystyle\left\|-\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,I_{0}}(2\hat{\Sigma}_{I_{0},I_{0}}+2{\lambda_{2}^{(n)}}+{\lambda_{1}^{(n)}}D_{n})^{-1}2\hat{\Sigma}_{I_{0},\epsilon}+\frac{2}{{\lambda_{1}^{(n)}}}\hat{\Sigma}_{m,\epsilon}\right\|_{\mathcal{H}_{m}}=O_{p}(1/({\lambda_{1}^{(n)}}\sqrt{n}))=o_{p}(1),

(73)

in the proof of Theorem 1 (Eq. (57)). Therefore, combining (71), (72) and (73), we have observed that the KKT condition (53) is violated with strictly positive probability if the condition (68) is satisfied. This yields the irrepresenter condition (16) is a necessary condition for the consistency of elastic-net MKL.

Lemma 10

If $\sup_{X}k_{m}(X,X)\leq 1$ and $\sup_{X}k_{m^{\prime}}(X,X)\leq 1$ , then

\displaystyle P(\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m}^{\prime}}\geq\mathrm{E}[\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m}^{\prime}}]+\varepsilon)\leq\exp(-n\varepsilon^{2}/2).

(74)

In particular,

P\left(\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m}^{\prime}}\geq\sqrt{\frac{1}{n}}+\epsilon\right)\leq\exp(-n\varepsilon^{2}/2).

(75)

Proof: We use McDiarmid’s inequality (Devroye et al., 1996). By definition

\langle g,\hat{\Sigma}_{mm^{\prime}}f\rangle=\frac{1}{n}\sum_{i=1}^{n}\left\langle g,k_{m}(\cdot,x_{i})\right\rangle_{m}\left\langle f,k_{m^{\prime}}(\cdot,x_{i})\right\rangle_{m^{\prime}}.

We denote by $\tilde{\Sigma}_{m,m^{\prime}}$ the empirical cross covariance operator with $n$ samples $(x_{1},\dots,x_{j-1},\tilde{x}_{j},x_{j+1},\dots,x_{n})$ where the $j$ -th sample $x_{j}$ is replaced by $\tilde{x}_{j}$ independently distributed by the same distribution as $x_{j}$ ’s.

By the triangular inequality, we have

\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\|\tilde{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}\leq\|\hat{\Sigma}_{m,m^{\prime}}-\tilde{\Sigma}_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}.

Now the RHS can be evaluated as follows:

		$\displaystyle\\|\hat{\Sigma}_{m,m^{\prime}}-\tilde{\Sigma}_{m,m^{\prime}}\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}$
	$\displaystyle=$	$\displaystyle\left\\|\frac{1}{n}(k_{m}(\cdot,x_{j})k_{m^{\prime}}(x_{j},\cdot)-k_{m}(\cdot,\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\cdot))\right\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}.$		(76)

The RHS of (76) can be further evaluated as

	$\displaystyle\\|\frac{1}{n}(k_{m}(\cdot,x_{j})k_{m^{\prime}}(x_{j},\cdot)-k_{m}(\cdot,\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\cdot))\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}$
$\displaystyle\leq$	$\displaystyle\frac{1}{n}(\\|k_{m}(\cdot,x_{j})k_{m^{\prime}}(x_{j},\cdot)\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}+\\|k_{m}(\cdot,\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\cdot))\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}})$
$\displaystyle\leq$	$\displaystyle\frac{1}{n}(\\|k_{m}(\cdot,x_{j})\\|_{\mathcal{H}_{m}}\\|k_{m^{\prime}}(x_{j},\cdot)\\|_{\mathcal{H}_{m^{\prime}}}+\\|k_{m}(\cdot,\tilde{x}_{j})\\|_{\mathcal{H}_{m}}\\|k_{m^{\prime}}(\tilde{x}_{j},\cdot))\\|_{\mathcal{H}_{m^{\prime}}})$
$\displaystyle\leq$	$\displaystyle\frac{1}{n}(\sqrt{k_{m}(x_{j},x_{j})k_{m^{\prime}}(x_{j},x_{j})}+\sqrt{k_{m}(\tilde{x}_{j},\tilde{x}_{j})k_{m^{\prime}}(\tilde{x}_{j},\tilde{x}_{j})})$
$\displaystyle\leq$	$\displaystyle\frac{2}{n},$	(77)

where we used $\|k_{m}(\cdot,x_{j})\|_{\mathcal{H}_{m}}=\sqrt{\langle k_{m}(\cdot,x_{j}),k_{m}(\cdot,x_{j})\rangle_{\mathcal{H}_{m}}}=\sqrt{k_{m}(x_{j},x_{j})}$ . Bounding the norm of (76) by (77), we have

\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\|\tilde{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}\leq\frac{2}{n}.

By symmetry, changing $\hat{\Sigma}$ and $\tilde{\Sigma}$ gives

|\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\|\tilde{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}|\leq\frac{2}{n}.

Therefore by McDiarmid’s inequality we obtain

		$\displaystyle P\left(\\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}-\mathrm{E}[\\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}]\geq\varepsilon\right)$
	$\displaystyle\leq$	$\displaystyle\exp\left(-\frac{2\varepsilon^{2}}{n(2/n)^{2}}\right)=\exp\left(-\frac{\varepsilon^{2}n}{2}\right).$

This gives the first assertion Eq. (74).

To show the second assertion (Eq. (75)), first we note that

$\displaystyle\mathrm{E}[\\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}]$	$\displaystyle\leq\sqrt{\mathrm{E}[\\|\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}}\\|_{\mathcal{H}_{m},\mathcal{H}_{m^{\prime}}}^{2}]}$
	$\displaystyle=\sqrt{\mathrm{E}[\\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\\|_{\mathcal{H}_{m},\mathcal{H}_{m}}]}$
	$\displaystyle\leq\sqrt{\mathrm{E}[\\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\\|_{\mathrm{tr}}]},$	(78)

where $\|\cdot\|_{\mathrm{tr}}$ is the trace norm and the last inequality. As in Lemma 1 of Gretton et al. (2005), we see that

		$\displaystyle\\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\\|_{\mathrm{tr}}$
	$\displaystyle=$	$\displaystyle\frac{1}{n^{2}}\sum_{i,j=1}^{n}\\|k_{m}(\cdot,x_{i})k_{m^{\prime}}(x_{i},x_{j})k_{m}(x_{j},\cdot)\\|_{\mathrm{tr}}$
		$\displaystyle-\frac{2}{n}\sum_{i=1}^{n}\mathrm{E}_{X}[\\|k_{m}(\cdot,x_{i})k_{m^{\prime}}(x_{i},X)k_{m}(X,\cdot)\\|_{\mathrm{tr}}]+\mathrm{E}_{X,X^{\prime}}[\\|k_{m}(\cdot,X)k_{m^{\prime}}(X,X^{\prime})k_{m}(X^{\prime},\cdot)\\|_{\mathrm{tr}}]$
	$\displaystyle=$	$\displaystyle\frac{1}{n^{2}}\sum_{i,j=1}^{n}k_{m}(x_{j},x_{i})k_{m^{\prime}}(x_{i},x_{j})-\frac{2}{n}\sum_{i=1}^{n}\mathrm{E}_{X}[k_{m}(X,x_{i})k_{m^{\prime}}(x_{i},X)]+\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})],$

where $X$ and $X^{\prime}$ are independent random variable distributed from $\Pi$ . Thus

		$\displaystyle\mathrm{E}[\\|(\hat{\Sigma}_{m,m^{\prime}}-\Sigma_{m,m^{\prime}})(\hat{\Sigma}_{m^{\prime},m}-\Sigma_{m^{\prime},m})\\|_{\mathrm{tr}}]$
	$\displaystyle=$	$\displaystyle\frac{n}{n^{2}}\mathrm{E}_{X}[k_{m}(X,X)k_{m^{\prime}}(X,X)]+\frac{n(n-1)}{n^{2}}\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]$
		$\displaystyle-2\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]+\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\mathrm{E}_{X}[k_{m}(X,X)k_{m^{\prime}}(X,X)]-\frac{1}{n}\mathrm{E}_{X,X^{\prime}}[k_{m}(X^{\prime},X)k_{m^{\prime}}(X,X^{\prime})]\leq\frac{1}{n}.$

This and Eq. (78) with the first assertion (Eq. (74)) gives the second assertion.

Lemma 11

If $\mathrm{E}[\epsilon^{2}|X]\leq\sigma^{2}$ almost surely and $\sup_{X}k_{m}(X,X)\leq 1$ , then we have

\displaystyle\|\hat{\Sigma}_{m,\epsilon}\|_{\mathcal{H}_{m}}=O_{p}(\sigma/\sqrt{n}).

(79)

Proof: By definition, we have

	$\displaystyle\mathrm{E}[\\|\hat{\Sigma}_{m,\epsilon}\\|_{\mathcal{H}_{m}}]$	$\displaystyle\leq\sqrt{\mathrm{E}[\\|\hat{\Sigma}_{m,\epsilon}\\|_{\mathcal{H}_{m}}^{2}]}$
		$\displaystyle=\sqrt{\mathrm{E}\left[\frac{1}{n^{2}}\sum_{i,j=1}^{n}k_{m}(x_{i},x_{j})\epsilon_{i}\epsilon_{j}\right]}$
		$\displaystyle\leq\sqrt{\frac{\sigma^{2}}{n}}.$

Applying Markov’s inequality we obtain the assertion.

Proposition 1 (Bernstein’s inequality in Hilbert spaces)

Let $(\Omega,\mathcal{A},P)$ be a probability space, $\mathcal{H}$ be a separable Hilbert space, $\mathcal{B}>0$ , and $\sigma>0$ . Furthermore, let $\xi_{1},\dots,\xi_{n}:\Omega\to\mathcal{H}$ be independent random variables satisfying $\mathrm{E}[\xi_{i}]=0$ , $\|\xi\|_{\mathcal{H}}\leq B$ , and $\mathrm{E}[\|\xi_{i}\|_{\mathcal{H}}^{2}]\leq\sigma^{2}$ for all $i=1,\dots,n$ . Then we have

\displaystyle P\left(\left\|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\right\|_{\mathcal{H}}\geq\sqrt{\frac{2\sigma^{2}\tau}{n}}+\sqrt{\frac{\sigma^{2}}{n}}+\frac{2B\tau}{3n}\right)\leq e^{-\tau},~~~~~(\tau>0).

Proof: See Theorem 6.14 of Steinwart (2008).

References

Bach et al. (2004) F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In the 21st International Conference on Machine Learning, pages 41–48, 2004.
Bach (2008) F. R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008.
Baker (1973) C. R. Baker. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:273–289, 1973.
Bartlett et al. (2005) P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. The Annals of Statistics, 33:1487–1537, 2005.
Bickel et al. (2009) P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
Bousquet (2002) O. Bousquet. A Bennett concentration inequality and its application to suprema of empirical process. C. R. Acad. Sci. Paris Ser. I Math., 334:495–500, 2002.
Caponnetto and de Vito (2007) A. Caponnetto and E. de Vito. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
Cortes (2009) C. Cortes. Can learning kernels help performance?, 2009. Invited talk at International Conference on Machine Learning (ICML 2009). Montréal, Canada, 2009.
Cortes et al. (2009) C. Cortes, M. Mohri, and A. Rostamizadeh. $L_{2}$ regularization for learning kernels. In the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), 2009. Montréal, Canada.
Devroye et al. (1996) L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.
Gretton et al. (2005) A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and E. Tomita, editors, Algorithmic Learning Theory, Lecture Notes in Artificial Intelligence, pages 63–77, Berlin, 2005. Springer-Verlag.
Jia and Yu (2010) J. Jia and B. Yu. On model selection consistency of the elastic net when p $\gg$ n. Statistica Sinica, 20(2):to appear, 2010.
Kloft et al. (2009) M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. Müller, and A. Zien. Efficient and accurate $\ell_{p}$ -norm multiple kernel learning. In Advances in Neural Information Processing Systems 22, pages 997–1005, Cambridge, MA, 2009. MIT Press.
Koltchinskii (2006) V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34:2593–2656, 2006.
Koltchinskii and Yuan (2008) V. Koltchinskii and M. Yuan. Sparse recovery in large ensembles of kernel machines. In Proceedings of the Annual Conference on Learning Theory, pages 229–238, 2008.
Lanckriet et al. (2004) G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. Jordan. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5:27–72, 2004.
Ledoux and Talagrand (1991) M. Ledoux and M. Talagrand. Probability in Banach Spaces. Isoperimetry and Processes. Springer, New York, 1991. MR1102015.
Lin and Zhang (2006) Y. Lin and H. H. Zhang. Component selecion and smoothing in multivariate nonparametric regression. The Annals of Statistics,, 34(5):2272–2297, 2006.
Meier et al. (2009) L. Meier, S. van de Geer, and P. Bühlmann. High-dimensional additive modeling. The Annals of Statistics, 37(6B):3779–3821, 2009.
Mendelson (2002) S. Mendelson. Improving the sample complexity using global data. IEEE Transactions on Information Theory, 48:1977–1991, 2002.
Micchelli and Pontil (2005) C. A. Micchelli and M. Pontil. Learning the kernel function via regularization. Journal of Machine Learning Research, 6:1099–1125, 2005.
Rakotomamonjy et al. (2008) A. Rakotomamonjy, F. Bach, S. Canu, and G. Y. SimpleMKL. Journal of Machine Learning Research, 9:2491–2521, 2008.
Sonnenburg et al. (2006) S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531–1565, 2006.
Steinwart (2008) I. Steinwart. Support Vector Machines. Springer, 2008.
Steinwart et al. (2009) I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In Proceedings of the Annual Conference on Learning Theory, pages 79–93, 2009.
Stone (1974) M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36:111–147, 1974.
Suzuki and Tomioka (2009) T. Suzuki and R. Tomioka. Spicymkl, 2009. arXiv:0909.5026.
Talagrand (1996a) M. Talagrand. A new look at independence. The Annals of Statistics, 24:1–34, 1996a.
Talagrand (1996b) M. Talagrand. New concentration inequalities in product spaces. Inventiones Mathematicae, 126:505–563, 1996b.
Tomioka and Suzuki (2010) R. Tomioka and T. Suzuki. Sparsity-accuracy trade-off in MKL, 2010. arXiv:1001.2615.
van de Geer (2000) S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.
van der Vaart and Wellner (1996) A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York, 1996.
Vapnik (1998) V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
Yuan and Lin (2007) M. Yuan and Y. Lin. On the nonnegative garrote estimator. Journal of the Royal Statistical Society B, 69(2):143–161, 2007.
Zhang (2009) T. Zhang. Some sharp performance bounds for least squares regression with $l_{1}$ regularization. The Annals of Statistics, 37(5):2109–2144, 2009.
Zhao and Yu (2006) P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
Zou and Hastie (2005) H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical: Series B, 67(2):301–320, 2005.
Zou and Zhang (2009) H. Zou and H. H. Zhang. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37(4):1733–1751, 2009.

	$\displaystyle\frac{1}{2}\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in I}\\|\hat{f}_{I}-f^{}_{I}\\|_{\mathcal{H}_{m}}^{2}+{\lambda_{2}^{(n)}}\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}^{2}+\left({\lambda_{1}^{(n)}}-\hat{\gamma}_{n}-\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\\|\hat{f}_{m}\\|_{\mathcal{H}_{m}}$
$\displaystyle\leq$	$\displaystyle\tilde{K}_{1}(1+\\|\hat{f}-f^{}\\|_{\ell_{1}})\Big{(}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\vee\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}}{n^{\frac{1}{1+s}}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle\!+\!\!\sum_{m\in I}\left(\!{\lambda_{1}^{(n)}}\!\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\!\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
	$\displaystyle\!\!+\!{\lambda_{2}^{(n)}}\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}\!\!+\!\left({\lambda_{1}^{(n)}}\!\!+\!\hat{\gamma}_{n}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\sum_{m\in J}\\|f^{}_{m}\\|_{\mathcal{H}_{m}},$	(20)

	$\displaystyle\sum_{m\in I}2\left({\lambda_{1}^{(n)}}\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}+{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
$\displaystyle\leq$	$\displaystyle C[(1-\rho^{2}(I))\kappa(I)]^{-1}\sum_{m\in I}\left\{\left(\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\right)^{2}{\lambda_{1}^{(n)}}^{2}+\\|g^{*}_{m}\\|_{\mathcal{H}_{m}}^{2}{\lambda_{2}^{(n)}}^{2}+\frac{t}{n}\right\}$
	$\displaystyle+\frac{(1-\rho^{2}(I))\kappa(I)}{8}\sum_{m\in I}\\|\hat{f}_{m}-f^{*}_{m}\\|_{L_{2}(\Pi)}^{2}$
$\displaystyle\leq$	$\displaystyle C(d{\lambda_{1}^{(n)}}^{2}+{\lambda_{2}^{(n)}}^{2}+dt/n)+\frac{1}{8}\\|\hat{f}-f^{*}\\|_{L_{2}(\Pi)}^{2},$	(23)

	$\displaystyle K_{1}\sum_{m\in I}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}\leq K_{1}\frac{(\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)})^{1-s}(\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{1}})^{s}}{\sqrt{n}}$
	$\displaystyle\leq C\tilde{\lambda}^{-\frac{s}{1-s}}n^{-\frac{1}{2(1-s)}}\textstyle\sum_{m\in I}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}+\frac{\tilde{\lambda}}{2}\\|\hat{f}_{I}-f^{}_{I}\\|_{\ell_{1}}$
	$\displaystyle\leq Cd\tilde{\lambda}^{-\frac{2s}{1-s}}n^{-\frac{1}{1-s}}+\frac{1}{8}\textstyle\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+\frac{\tilde{\lambda}}{2}(\\|\hat{f}_{I}\\|_{\ell_{1}}+\\|f^{}_{I}\\|_{\ell_{1}}),$		(25)

	$\displaystyle\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}+{\lambda_{2}^{(n)}}\\|\hat{f}_{I_{d}}-f^{}_{I_{d}}\\|_{\ell_{2}}^{2}+{\lambda_{2}^{(n)}}\\|\hat{f}_{J_{d}}\\|_{\ell_{2}}^{2}$
$\displaystyle\leq$	$\displaystyle K_{1}\Big{(}\sum_{m\in I_{d}}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle+K_{1}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)\Big{(}\sum_{m\in I_{d}}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle+\sum_{m\in I_{d}}\!\!\left(\!{\lambda_{1}^{(n)}}\!\frac{\\|g^{}_{m}\\|_{\mathcal{H}_{m}}}{\\|f^{}_{m}\\|_{\mathcal{H}_{m}}}\!+\!2{\lambda_{2}^{(n)}}\\|g^{}_{m}\\|_{\mathcal{H}_{m}}\!\!+\tilde{K}_{2}\sqrt{\frac{t}{n}}\right)\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}$
	$\displaystyle+C\left({\lambda_{2}^{(n)}}d^{1-2\beta}+\left({\lambda_{1}^{(n)}}+\hat{\gamma}_{n}+\sqrt{\frac{t}{n}}\right)d^{1-\beta}\right),$	(26)

		$\displaystyle K_{1}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)\Big{(}\sum_{m\in I_{d}}\frac{\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{1-s}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}}{n}\Big{)}$
	$\displaystyle\mathop{\leq}^{\text{H\"{o}lder}}$	$\displaystyle K_{1}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)\Bigg{\{}\frac{(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)})^{1-s}(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}})^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}}{n}\Bigg{\}}$
	$\displaystyle=$	$\displaystyle K_{1}\frac{(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)})^{1-s}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}\right)(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}})^{s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{Jensen}}$	$\displaystyle K_{1}\frac{d^{\frac{1-s}{2}}(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{}_{m}\\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}M^{\frac{1}{2}}\left(\sum_{m=1}^{M}\\|\hat{f}_{m}-f^{}_{m}\\|_{\mathcal{H}_{m}}^{2}\right)^{\frac{1}{2}}d^{\frac{s}{2}}(\sum_{m\in I_{d}}\\|\hat{f}_{m}-f^{*}_{m}\\|_{\mathcal{H}_{m}}^{2})^{\frac{s}{2}}}{\sqrt{n}}$
		$\displaystyle+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{Lemma \ref{lem:incoherenceIneq}}}$	$\displaystyle K_{1}\frac{\{(1-\rho(I_{d})^{2})\kappa(I_{d})\}^{-\frac{1-s}{2}}(\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2})^{\frac{1-s}{2}}d^{\frac{1}{2}}M^{\frac{1}{2}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{1+s}}{\sqrt{n}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{Young}}$	$\displaystyle\frac{\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}}{2}+C\frac{\{(1-\rho(I_{d})^{2})\kappa(I_{d})\}^{-\frac{1-s}{1+s}}d^{\frac{1}{1+s}}M^{\frac{1}{1+s}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{2}}{n^{\frac{1}{1+s}}}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}$
	$\displaystyle\mathop{\leq}^{\text{(A8)}}$	$\displaystyle\frac{\\|\hat{f}-f^{}\\|_{L_{2}(\Pi)}^{2}}{2}+C\frac{d^{\frac{b(1-s)+1}{1+s}}M^{\frac{1}{1+s}}}{n^{\frac{1}{1+s}}}\\|\hat{f}-f^{}\\|_{\ell_{2}}^{2}+\frac{t\\|\hat{f}-f^{*}\\|_{\ell_{1}}^{2}}{n}.$

Sharp Convergence Rate and Support Consistency of Multiple Kernel Learning with Sparse and Dense Regularization