Appendix A Proofs of the theorems
For a function on , we define
and .
For a function ,
we define as
and for we write .
Similarly
we define as
for
and for we write .
We write as .
Lemma 6
For all , we have
|
|
|
(18) |
Proof:
For , we have
|
|
|
|
|
|
|
|
(19) |
where we used Schwarz’s inequality in the last line.
The following lemma gives an upper bound of
that hold with a high probability. This is an extension of Theorem 1 of Koltchinskii and Yuan (2008).
The proof is given in Appendix B.
Lemma 7
There exists a constant depending on only in (A1)
such that,
if ,
we have, for , with probability ,
|
|
|
Moreover, if and ,
we have
|
|
|
The following lemma gives a basic inequality that
is a start point for the following analyses.
The proof is given in Appendix B.
Lemma 8
Suppose
where is the constant appeared in Lemma 7.
Then there exist constants and depending only on in (A1), in (A4), in , in such that
for all , and for all ,
with probability at least ,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(20) |
where
,
and
The above lemma is derived by peeling device or localization method.
Details of those techniques can be found in, for example, Bartlett et al. (2005), Koltchinskii (2006), Mendelson (2002), van de Geer (2000).
Proof:
(Theorem 1)
Since ,
we can assume that the inequality (20) is satisfied
with .
For notational simplicity, we suppose denotes in this proof.
In addition, since , (with probability by Lemma 7.
Note that for all ,
and
by taking sufficiently large. Therefore by the inequality (20), we have
|
|
|
|
|
|
(21) |
where is (here we omitted the term for simplicity.
One can show that that term is negligible).
By Hölder’s inequality, the first term in the RHS of the above inequality can be bounded as
|
|
|
|
|
|
|
|
Applying Young’s inequality, the last term
can be bounded by
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(22) |
where denotes a constant that is independent of and and changes by the contexts, and
we used Lemma 6 in the last line.
Similarly, by the inequality of arithmetic and geometric means, we obtain a bound as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(23) |
where we used Lemma 6 in the last line.
By substituting
(22) and (23) to (21),
we have
|
|
|
(24) |
The minimum of the RHS with respect to under the constraint is achieved
by up to constants. Thus we have the first assertion (7).
Next we show the second assertion (8). By Hölder’s inequality and Young’s inequality, we have
|
|
|
|
|
|
|
|
|
(25) |
where is an arbitrary positive real.
By substituting
(25) and (23) to (21),
we have
|
|
|
This is minimized by ,
, and . Thus we obtain the assertion.
Proof:
(Theorem 2)
Let and .
By the assumption (A7), we have
Therefore Lemma 8 gives
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(26) |
if and .
The second term can be upper bounded as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
We will see that we may assume . Thus
the second term in the RHS of the above inequality can be upper bounded as
|
|
|
|
|
|
(27) |
Moreover Lemma 7 gives
and
Therefore (26) becomes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As in the proof of Theorem 1 (using the relations (23) and (22)),
we have
|
|
|
|
|
|
|
|
|
|
|
|
Now using the assumption ,
we have
|
|
|
|
|
|
|
|
(28) |
Remind that .
Since ,
Lemma 7 gives with probability for some constant .
Therefore .
The values of , presented in the statement is achieved by minimizing
the RHS of Eq. (28) under the constraint
and .
i) Suppose , i.e., .
Then the RHS of the above inequality
can be minimized by
,
,
and up to constants independent of ,
where the leading terms are
.
It should be noted that is greater than
because ,
therefore (26) is valid.
Using , we can show that by setting the constant sufficiently large,
hence (27) is valid.
Moreover, since , we can take as
.
ii) Suppose . Then the RHS of the above inequality
can be minimized by
, , and up to constants independent of ,
where the leading terms are
.
Since , (26) is valid.
Using , we can show that by setting the constant sufficiently large,
hence (27) is valid.
Moreover, since and ,
we can show that .
iii) Suppose .
We take .
Then the RHS of the inequality (28) is minimized by
and up to constants, where the leading terms are
.
Note that since , (26) is valid.
Using , we can show that by setting the constant sufficiently large,
hence (27) is valid.
Moreover, since and ,
we have .
In all settings i) to iii), we can show that .
Thus the terms regarding is upper bounded as
.
Through a simple calculation is evaluated as
i) ,
ii) , and
iii)
respectively.
Thus we obtain the assertion.
(Convergence rate of block- MKL)
Note that since , we have .
Therefore Lemma 7 gives
with probability .
Thus .
When and ,
as in Lemma 8 we have
with probability at least
|
|
|
|
|
|
|
|
|
|
|
|
(29) |
for all .
We lower bound the term in the LHS of the above inequality (21).
There exists only depending such that
|
|
|
|
|
|
|
|
(30) |
for all such that and .
Remind that , then we have
.
Since are met with probability ,
|
|
|
with probability .
Therefore by the inequality (29),
we have with probability at least
|
|
|
|
|
|
|
|
|
|
|
|
(31) |
for all .
Thus using Young’s inequality
|
|
|
|
The RHS is minimized by
and (up to constants independent of ).
Note that since the optimal obtained above satisfies by taking sufficiently large,
the inequality (31) is valid.
Moreover the condition in the statement
ensures .
Finally we evaluate the terms including , that is, .
We can check that .
Therefore those terms are upper bounded as
.
Thus we obtain the assertion.
(Convergence rate for block- MKL)
When ,
substituting to in Lemma 8,
and using Young’s inequality,
as in the proof of Theorem 2, the convergence rate of block- MKL can be evaluated as
|
|
|
(32) |
with probability (note that since (), we don’t need the condition ).
gives the minimum of the RHS with respect to up to constants.
Using , we can show that by setting the constant
sufficiently large,
hence (27) is valid.
Appendix B Proof of Lemmas 7 and 8
Proof:
(Lemma 7)
Since minimizes the empirical risk (1),
we have
|
|
|
|
|
|
|
|
(33) |
By Proposition 1 (Bernstein’s inequality in Hilbert spaces, see also Theorem 6.14 of Steinwart (2008) for example),
there exists a universal constant such that we have
|
|
|
|
|
|
|
|
(34) |
for all with probability at least , where we used the assumption .
If , then
we have
|
|
|
(35) |
with probability at least .
Set , then by Young’s inequality
and Jensen’s inequality,
the LHS of the above inequality (33) is lower bounded by
|
|
|
|
|
|
|
|
|
|
|
|
(36) |
Therefore we have the first assertion by setting .
The second assertion can be shown as follows:
by the inequality (33) we have
|
|
|
|
|
|
|
|
|
|
|
|
(37) |
with probability at least ,
where we used (34),
and in the last inequality.
Proof:
(Lemma 8)
In what follows, we assume
where
(the probability of this event is greater than by Lemma 7).
Since minimizes the empirical risk we have
|
|
|
|
|
|
|
|
|
|
|
|
(38) |
The second term in the RHS of the above inequality (38) can be bounded from above
as
|
|
|
|
|
|
|
|
(39) |
where we used for .
We also have
|
|
|
|
|
|
|
|
(40) |
Substituting
(39)
and (40)
to (38),
we obtain
|
|
|
|
|
|
|
|
|
|
|
|
(41) |
Finally we evaluate the first term
in the RHS of the above inequality (41)
by applying Talagrand’s concentration inequality (Talagrand, 1996a, b, Bousquet, 2002).
First we decompose as
|
|
|
and bound each term in the summation.
Here suppose satisfies for a constant .
Since ,
we have
|
|
|
|
(42a) |
|
|
|
|
|
|
(42b) |
for all .
Let where is
the Rademacher random variable, and
be
|
|
|
Then one can show that by the spectral assumptions (A5) (equivalently the covering number condition)
|
|
|
where is a constant that depends on and (Mendelson, 2002).
Let .
Now by Rademacher contraction inequality (Ledoux and Talagrand, 1991, Theorem 4.12),
for given and we have
|
|
|
|
|
|
|
|
(43) |
Therefore by the symmetrization argument (van der Vaart and Wellner, 1996), we have
|
|
|
|
|
|
|
|
(44) |
By Talagrand’s concentration inequality with
(42) and (44),
for given
with probability at least , we have
|
|
|
|
|
|
(45) |
where we used the relation (42).
Our next goal is to derive an uniform version of the above inequality over
|
|
|
By considering a grid
such that ,
, and ,
we have
with probability at least
|
|
|
|
|
|
|
|
for all such that and , and for all , where
.
Summing up this bound for , then we obtain
|
|
|
|
|
|
|
|
uniformly for all such that () and
with probability at least .
Here set
and note that
then we have
|
|
|
|
|
|
|
|
(46) |
for all such that () and with probability at least .
We will replace with ,
then the probability can be replaced with
and we have for all .
On the event where holds, substituting to in (46)
and replacing appropriately,
(41) yields
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(47) |
where and are constants and .
Finally since
,
(47) becomes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(48) |
which yields the assertion.
Appendix C Proof of Theorems 4 and 5
We write the operator norm of as .
Definition 9
For all , we define the empirical (non centered) cross covariance operator as follows:
|
|
|
(49) |
where .
Analogous to the joint covariance operator , we define
the joint empirical cross covariance operator as
.
We denote by the element of such that
|
|
|
Let be a constant such that
.
We denote by the objective function of elastic-net MKL
|
|
|
Proof: (Theorem 4)
Let be the minimizer of :
|
|
|
|
|
where |
|
|
(Step 1) We first show that with respect to the RKHS norm.
Since ,
as in the proof of Lemma 7, the probability of
goes to 1
(this can be checked as follows: by replacing in Eq. (34) with ,
then we see that Eq. (34) holds with probability ).
There exists only depending such that
|
|
|
|
|
|
|
|
(50) |
for all and all such that .
Since minimizes ,
if (the probability of which event goes to 1)
we have
|
|
|
|
|
|
|
|
(51) |
where we used the relation (50).
By the assumption ,
we have .
By Lemma 10 and Lemma 11, we have
|
|
|
Substituting these inequalities to (51),
we have
|
|
|
|
|
|
|
|
(52) |
Remind that the (non centered) cross correlation operator is invertible. Thus
there exists a constant such that
|
|
|
|
|
|
|
|
This and Eq. (52) give that using
|
|
|
|
|
|
|
|
|
|
|
|
Therefore we have
|
|
|
|
|
|
|
|
This and gives in probability.
(Step 2) Next we show that the probability of goes to 1.
Since , we can assume that without loss of generality.
We identify as an element of by setting for .
Now we show that is also the minimizer of , that is
, with high probability, hence
with high probability.
By the KKT condition,
the necessary and sufficient condition that also minimizes is
|
|
|
(53) |
|
|
|
(54) |
where .
Note that (54) is satisfied (with high probability)
because is the minimizer of and for all (with high probability).
Therefore if the condition (53) holds w.h.p.,
w.h.p..
We will now show the condition (53) holds w.h.p..
Due to (54), we have
|
|
|
Therefore the LHS of (53),
, can be evaluated as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(55) |
We evaluate the probabilistic orders of the last two terms.
(i) (Bounding )
We show that
|
|
|
Since
we have
|
|
|
The second inequality is due to the fact that for all we have
|
|
|
because of
Thus we have
|
|
|
|
|
|
(56) |
Here the LHS of the above inequality is equivalent to
|
|
|
Therefore we observe
|
|
|
Since , we also have
|
|
|
This and yield
|
|
|
(57) |
(ii) (Bounding )
Note that, due to , we have ,
and we know that by Lemma 10.
Thus satisfies and thus with high probability.
Hence
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(58) |
Here we obtain
|
|
|
|
|
|
|
|
|
|
|
|
(59) |
and due to the fact that with high probability we have
|
|
|
|
|
|
|
|
|
|
|
|
Therefore the second term in the RHS of Eq. (58) is evaluated as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Therefore
this and Eq. (58) give
|
|
|
|
|
|
|
|
|
|
|
|
Define
|
|
|
|
|
|
We show .
By the definition, we have
|
|
|
|
|
|
|
|
(60) |
On the other hand, as in Eq. (56), we observe that
|
|
|
|
|
|
|
|
(61) |
Moreover, since (), we have
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(62) |
We can also bound the second term of (60) as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Therefore
applying the inequalities Eq. (61) and Eq. (62) to Eq. (60),
we have
|
|
|
(63) |
Hence we have .
(iii) (Combining (i) and (ii))
Due to the above evaluations ((i) and (ii)), we have
|
|
|
|
|
|
|
|
This yields
|
|
|
Thus the probability of the condition (53) goes to 1.
Proof: (Theorem 5) First we prove that is a necessary condition
for .
Assume that .
Then we can take a sub-sequence that converges to a finite value,
therefore by taking the sub-sequence, if necessary, we can assume
without loss of generality.
We will derive a contradiction
under the conditions of
and .
Suppose .
By the KKT condition,
|
|
|
|
|
|
|
|
(64) |
|
|
|
|
|
|
|
|
|
|
|
|
(65) |
where the last inequality is due to
and .
Moreover since the second equality (64) indicates that
,
we have and .
We now show that the KKT condition under which satisfying is optimal with respect to
is violated with strictly positive probability:
|
|
|
(66) |
Obviously this indicates that the probability does not converges to 1, which is a contradiction.
For all , there exists
such that
|
|
|
(67) |
Note that is uniformly bounded for all because
the range of is included in the range of (Baker, 1973)
and
there exists such that ( is independent of ),
hence ,
and
|
|
|
for and for .
Let be any non-zero element such that
and be satisfying the above equality (67), then
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where we used and
in the second equality,
and the relation (65) in the last equality.
We can show that
has a positive variance as follows (see also Bach (2008)):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where
(note that is invertible because and is invertible). Now since and
(this is because is invertible),
we have
.
Therefore
by the central limit theorem
converges Gaussian random variable with strictly positive variance in distribution.
Thus
the probability of
|
|
|
is asymptotically strictly positive because
(Note that this is true whether converges to finite value or not).
This yields (66), i.e.
does not satisfy with asymptotically strictly positive probability.
We say Condition A as
|
|
|
Now that we have proven , we are ready to prove the assertion (16).
Suppose the condition (16) is not satisfied
for any sequences , that is,
there exists a constant such that
|
|
|
(68) |
for any sequences
satisfying Condition A (). Fix arbitrary sequences satisfying Condition A. If , the KKT condition
|
|
|
(69) |
|
|
|
(70) |
should be satisfied (see (53) and (54)).
We prove that the first inequality (69) of the KKT condition is violated with strictly positive probability under the assumptions and
the condition (70).
We have shown that
(see (55))
|
|
|
|
|
|
|
|
|
|
|
|
(71) |
As shown in the proof of Theorem 1,
the first term can be approximated by
more precisely Eq. (63) gives
|
|
|
|
|
|
Since
by the assumption,
we observe that
|
|
|
(72) |
Now since ,
we have proven that
|
|
|
(73) |
in the proof of Theorem 1 (Eq. (57)).
Therefore, combining (71), (72) and (73),
we have observed that the KKT condition (53)
is violated with strictly positive probability if the condition (68)
is satisfied.
This yields the irrepresenter condition (16) is a necessary condition for the consistency of elastic-net MKL.
Lemma 10
If and , then
|
|
|
(74) |
In particular,
|
|
|
(75) |
Proof:
We use McDiarmid’s inequality (Devroye et al., 1996).
By definition
|
|
|
We denote by the empirical cross covariance operator with
samples
where the -th sample is replaced by
independently distributed by the same distribution as ’s.
By the triangular inequality,
we have
|
|
|
Now the RHS can be evaluated as follows:
|
|
|
|
|
|
|
|
(76) |
The RHS of (76) can be further evaluated as
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(77) |
where we used .
Bounding the norm of (76) by
(77),
we have
|
|
|
By symmetry, changing and gives
|
|
|
Therefore
by McDiarmid’s inequality we obtain
|
|
|
|
|
|
|
|
This gives the first assertion Eq. (74).
To show the second assertion (Eq. (75)), first we note that
|
|
|
|
|
|
|
|
|
|
|
|
(78) |
where is the trace norm and the last inequality.
As in Lemma 1 of Gretton et al. (2005), we see that
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where and are independent random variable distributed from .
Thus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This and Eq. (78) with the first assertion (Eq. (74)) gives the second assertion.
Lemma 11
If almost surely
and , then we have
|
|
|
(79) |
Proof:
By definition, we have
|
|
|
|
|
|
|
|
|
|
|
|
Applying Markov’s inequality we obtain the assertion.
Proposition 1 (Bernstein’s inequality in Hilbert spaces)
Let be a probability space, be a separable Hilbert space, , and .
Furthermore, let be independent random variables satisfying , , and
for all .
Then we have
|
|
|
Proof:
See Theorem 6.14 of Steinwart (2008).