This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The cost-free nature of optimally tuning Tikhonov regularizers and other ordered smoothers

Pierre C. Bellec  and  Dana Yang
Abstract.

We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on QQ-aggregation achieves the mean square error of the best estimator, up to a small error term no larger than Cσ2C\sigma^{2}, where σ2\sigma^{2} is the noise level and C>0C>0 is an absolute constant. Remarkably, the error term does not depend on the penalty matrix or the number of estimators as long as they share the same penalty matrix, i.e., it applies to any grid of tuning parameters, no matter how large the cardinality of the grid is. This reveals the surprising "cost-free" nature of optimally tuning Tikhonov regularizers, in striking contrast with the existing literature on aggregation of estimators where one typically has to pay a cost of σ2log(M)\sigma^{2}\log(M) where MM is the number of estimators in the family. The result holds, more generally, for any family of ordered linear smoothers. This encompasses Ridge regression as well as Principal Component Regression. The result is extended to the problem of tuning Tikhonov regularizers with different penalty matrices.

: Department of Statistics, Busch Campus, Rutgers University, Piscataway, NJ 08854, USA.
: Research partially supported by the NSF Grant DMS-1811976.
: Department of Statistics & Data Science, Yale University, New Haven, CT 06511, USA.

1. Introduction

Consider a learning problem where one is given an observation vector yny\in{{\mathbb{R}}}^{n} and a design matrix Xn×pX\in{{\mathbb{R}}}^{n\times p}. Given a positive definite matrix Kp×pK\in{{\mathbb{R}}}^{p\times p} and a regularization parameter λ>0\lambda>0, the Tikhonov regularized estimator w^(K,λ)\hat{w}(K,\lambda) is defined as the solution of the quadratic program

(1.1) w^(K,λ)=argminwp(Xwy2+λwTKw),\textstyle\hat{w}(K,\lambda)=\mathop{\rm arg\,min}_{w\in{{\mathbb{R}}}^{p}}\left(\|Xw-y\|^{2}+\lambda w^{T}Kw\right),

where \|\cdot\| is the Euclidean norm. Since we assume that the penalty matrix KK is positive definite, the above optimization problem is strongly convex and the solution is unique. In the special case K=Ip×pK=I_{p\times p}, the above estimator reduces to Ridge regression. It is well known that the above optimization problem can be explicitly solved and that

w^(K,λ)\displaystyle\hat{w}(K,\lambda) =(XTX+λK)1XTy\displaystyle=(X^{T}X+\lambda K)^{-1}X^{T}y
=K1/2(K1/2XTXK1/2+λIp×p)1K1/2XT.\displaystyle=K^{-1/2}(K^{-1/2}X^{T}XK^{-1/2}+\lambda I_{p\times p})^{-1}K^{-1/2}X^{T}.

Problem statement.

Consider the Gaussian mean model

(1.2) y=μ+ε with εN(0,σ2In×n)y=\mu+\varepsilon\qquad\text{ with }\qquad\varepsilon\sim N(0,\sigma^{2}I_{n\times n})

where μn\mu\in{{\mathbb{R}}}^{n} is an unknown mean, and consider a deterministic design matrix Xn×pX\in{{\mathbb{R}}}^{n\times p}. We are given a grid of tuning parameters λ1,,λM0\lambda_{1},...,\lambda_{M}\geq 0 and a penalty matrix KK as above. Our goal is to construct an estimator w~\tilde{w} such that the regret or excess risk

(1.3) 𝔼[Xw~μ2]minj=1,,M𝔼[Xw^(K,λj)μ2]{\mathbb{E}}[\|X\tilde{w}-\mu\|^{2}]-\min_{j=1,...,M}{\mathbb{E}}[\|X\hat{w}(K,\lambda_{j})-\mu\|^{2}]

is small. Beyond the construction of an estimator w~\tilde{w} that has small regret, we aim to answer the following questions:

  • How does the worst-case regret scales with MM, the number of tuning parameters on the grid?

  • How does the worst case regret scales with R=minj=1,,M𝔼[Xw^(K,λj)μ2]R^{*}=\min_{j=1,...,M}{\mathbb{E}}[\|X\hat{w}(K,\lambda_{j})-\mu\|^{2}], the minimal mean squared error among the tuning parameters λ1,,λM\lambda_{1},...,\lambda_{M}?

Ordered linear smoothers.

If Aj=X(XTX+λjK)XTA_{j}=X(X^{T}X+\lambda_{j}K)X^{T} is the matrix such that Ajy=Xw^(K,λj)A_{j}y=X\hat{w}(K,\lambda_{j}), the family of estimators {Aj,j=1,,M}\{A_{j},j=1,...,M\} is an example of ordered linear smoothers, introduced [23].

Definition 1.

The family of n×nn\times n matrices {A1,,AM}\{A_{1},...,A_{M}\} are referred to as ordered linear smoothers if (i) AjA_{j} is symmetric and 0wTAjww20\leq w^{T}A_{j}w\leq\|w\|^{2} for all wpw\in{{\mathbb{R}}}^{p} and all j=1,,Mj=1,...,M, (ii) the matrices commute: AjAk=AkAjA_{j}A_{k}=A_{k}A_{j} for all j,k=1,,Mj,k=1,...,M, and (iii) either AjAkA_{j}\preceq A_{k} or AkAjA_{k}\preceq A_{j} holds for all j,k=1,,Mj,k=1,...,M, where \preceq denotes the partial order of positive symmetric matrices, i.e., ABA\preceq B if and only if BAB-A is positive semi-definite.

Condition (i) is mild: if the matrix AA is not symmetric then it is not admissible and there exists a symmetric matrix AA^{\prime} such that 𝔼[Ayμ2]𝔼[Ayμ2]{\mathbb{E}}[\|A^{\prime}y-\mu\|^{2}]\leq{\mathbb{E}}[\|Ay-\mu\|^{2}] with a strict inequality for at least one μn\mu\in{{\mathbb{R}}}^{n} [11], so we may as well replace AA with the symmetric matrix AA^{\prime}. Similarly, if AA is symmetric with some eigenvalues outside of [0,1][0,1], then AA is not admissible and there exists another symmetric matrix AA^{\prime} with eigenvalues in [0,1][0,1] and smaller prediction error for all μn\mu\in{{\mathbb{R}}}^{n}, and strictly smaller prediction error for at least one μn\mu\in{{\mathbb{R}}}^{n} if n3n\geq 3 [11].

Conditions (ii) and (iii) are more stringent: they require that the matrices can be diagonalized in the same orthogonal basis (u1,,uk)(u_{1},...,u_{k}) of n{{\mathbb{R}}}^{n}, and that the matrices are ordered in the sense that there exists nn functions α1,,αn:[0,1]\alpha_{1},...,\alpha_{n}:{{\mathbb{R}}}\to[0,1], either all non-increasing or all non-decreasing, such that

(1.4) {A1,,AM}{α1(λ)u1u1T++αn(λ)ununT,λ},\{A_{1},...,A_{M}\}\subset\{\alpha_{1}(\lambda)u_{1}u_{1}^{T}+...+\alpha_{n}(\lambda)u_{n}u_{n}^{T},\lambda\in{{\mathbb{R}}}\},

see [23] for a rigorous proof of this fact. A special case of particular interest is the above Tikhonov regularized estimators, which satisfies conditions (i)-(ii)-(iii). In this case, the matrix Aj=X(XTX+λjK)1XTA_{j}=X(X^{T}X+\lambda_{j}K)^{-1}X^{T} is such that Ajy=Xw^(K,λj)A_{j}y=X\hat{w}(K,\lambda_{j}). To see that for any grid of tuning parameters λ1,,λM\lambda_{1},...,\lambda_{M}, the Tikhonov regularizers form a family of ordered linear smoothers, the matrix AjA_{j} can be rewritten as Aj=B(BTB+λjIp×p)1BTA_{j}=B(B^{T}B+\lambda_{j}I_{p\times p})^{-1}B^{T} where BB is the matrix XK1/2XK^{-1/2}. From this expression of AjA_{j}, it is clear that AjA_{j} is symmetric, that AjA_{j} can be diagonalized in the orthogonal basis made of the left singular vectors of BB, and that the eigenvalues of AjA_{j} are decreasing functions of the tuning parameter. Namely, the ii-th eigenvalue of AjA_{j} is equal to αi(λj)=μi(B)2/(μi(B)2+λj)\alpha_{i}(\lambda_{j})=\mu_{i}(B)^{2}/(\mu_{i}(B)^{2}+\lambda_{j}) where μi(B)\mu_{i}(B) is the ii-th singular value of BB.

Overview of the literature.

There is a substantial amount of literature related to this problem, starting with [23] where ordered linear smoothers are introduced and where their properties were first studied. Kneip [23] proves that if A1,,AMA_{1},...,A_{M} are ordered linear smoothers, then selecting the estimate with the smallest CpC_{p} criterion [26], i.e.,

(1.5) k^=argminj=1,,MCp(Aj),whereCp(A)=Ayy2+2σ2trace(Aj),\textstyle\hat{k}=\mathop{\rm arg\,min}_{j=1,...,M}C_{p}(A_{j}),\quad\text{where}\quad C_{p}(A)=\|Ay-y\|^{2}+2\sigma^{2}\operatorname{trace}(A_{j}),

leads to the regret bound (sometimes referred to as oracle inequality)

(1.6) 𝔼[Ak^yμ2]RCσR+Cσ2, where R=minj=1,,M𝔼[Ajyμ2]{\mathbb{E}}[\|A_{\hat{k}}y-\mu\|^{2}]-R^{*}\leq C\sigma\sqrt{R^{*}}+C\sigma^{2},\quad\text{ where }\quad R^{*}=\min_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}]

for some absolute constant C>0C>0. This result was later improved in [19, Theorem 3] [10] using an estimate based on exponential weighting, showing that the regret is bounded from above by σ2log(2+R/σ2)\sigma^{2}\log(2+R^{*}/\sigma^{2}).

Another line of research has obtained regret bounds that scales with the cardinality MM of the given family of linear estimators. Using an exponential weight estimate with a well chosen temperature parameter, [24, 15] showed that if A1,,AMA_{1},...,A_{M} are squared matrices of size nn that are either orthogonal projections, or that satisfies some commutativity property, then a data-driven convex combination A^EW\hat{A}_{EW} of the matrices A1,,AMA_{1},...,A_{M} satisfies

(1.7) 𝔼[A^EWyμ2]RCσ2logM.{\mathbb{E}}[\|\hat{A}_{EW}y-\mu\|^{2}]-R^{*}\leq C\sigma^{2}\log M.

where C>0C>0 is an absolute constant. This was later improved in [5] using an estimate from the QQ-aggregation procedure of [13, 14]. Namely, Theorem 2.1 in [5] states that if A1,,AMA_{1},...,A_{M} are squared matrices with operator norm at most 1, then

(1.8) (A^Qyμ2]minj=1,,MAjyμ2Cσ2log(M/δ))1δ\mathbb{P}\Big{(}\|\hat{A}_{Q}y-\mu\|^{2}]-\min_{j=1,...,M}\|A_{j}y-\mu\|^{2}\leq C\sigma^{2}\log(M/\delta)\Big{)}\geq 1-\delta

for any δ(0,1)\delta\in(0,1), where A^Q\hat{A}_{Q} is a data-driven convex combination of the matrices A1,,AMA_{1},...,A_{M}. A result similar to (1.7) can then be deduced from the above high probability bound by integration. It should be noted that the linear estimators in (1.7) and (1.8) need not be ordered smoothers (the only assumption in in (1.8) is that the operator norm of AjA_{j} is at most one), unlike (1.6) where the ordered smoothers assumption is key.

Another popular approach to select a good estimate among a family of linear estimators is the Generalized Cross-Validation (GCV) criterion of [12, 18]. If we are given MM linear estimators defined by square matrices A1,,AMA_{1},...,A_{M}, Generalized Cross-Validation selects the estimator

k^=argminj=1,,M(Ajyy2/(trace[In×nAj])2).\hat{k}=\mathop{\rm arg\,min}_{j=1,...,M}\left(\|A_{j}y-y\|^{2}/(\operatorname{trace}[I_{n\times n}-A_{j}])^{2}\right).

We could not pinpoint in the literature an oracle inequality satisfied by GCV comparable to (1.6)-(1.7)-(1.8), though we mention that [25] exhibits asymptotic frameworks where GCV is suboptimal while, in the same asymptotic frameworks, Mallows CpC_{p} is optimal.

The problem of optimally tuning Tikhonov regularizers, Ridge regressors or smoothning splines has received considerable attention in the last four decades (for instance, the GCV paper [18] is cited more than four thousand times) and the authors of the present paper are guilty of numerous omissions of important related works. We refer the reader to the recent surveys [3, 2] and the references therein for the problem of tuning linear estimators, and to [34] for a survey of aggregation results.

Coming back to our initial problem of optimally tuning a family of Tikhonov regularizers w^(K,λ1),,w^(K,λM)\hat{w}(K,\lambda_{1}),...,\hat{w}(K,\lambda_{M}), the results (1.6), (1.7) and (1.8) above suggest that one must pay a price that depends either on the cardinality MM of the grid of tuning parameters, or on R=minj=1,,M𝔼[Xw^(K,λj)μ2]R^{*}=\min_{j=1,...,M}{\mathbb{E}}[\|X\hat{w}(K,\lambda_{j})-\mu\|^{2}], the minimal mean squared error on this grid.

Optimally tuning ordered linear smoothers incurs no statistical cost.

Surprisingly, our theoretical results of the next sections reveal that if A1,,AMA_{1},...,A_{M} are ordered linear smoothers, for example Tikhonov regularizers sharing the same penalty matrix KK, then it is possible to construct a data-driven convex combination A^\hat{A} of A1,,AMA_{1},...,A_{M} such that the regret satisfies

𝔼[A^yμ2]minj=1,,M𝔼[Ajyμ2]\Clpunchlineσ2{\mathbb{E}}[\|\hat{A}y-\mu\|^{2}]-\min_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}]\leq\Cl{punchline}\sigma^{2}

for some absolute constant \Crpunchline>0\Cr{punchline}>0. Hence the regret in (1.3) is bounded by \Crpunchlineσ2\Cr{punchline}\sigma^{2}, an upper bound that is (a) independent of the cardinality MM of the grid of tuning parameters and (b) independent of the minimal risk R=minj=1,,M𝔼[Ajyμ2]R^{*}=\min_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}]. No matter how coarse the grid of tuning parameter is, no matter the number of tuning parameters to choose from, no matter how large the minimal risk RR^{*} is, the regret of the procedure constructed in the next section is always bounded by \Crpunchlineσ2\Cr{punchline}\sigma^{2}.

Notation.

Throughout the paper, C1,C2,C3C_{1},C_{2},C_{3}... denote absolute positive constants. The norm \|\cdot\| is the Euclidean norm of vectors. Let op\|\cdot\|_{op} and F\|\cdot\|_{F} be the operator and Frobenius norm of matrices.

2. Construction of the estimator

Assume that we are given MM matrices A1,,AMA_{1},...,A_{M}, each matrix corresponding to the linear estimator AjyA_{j}y. Mallows [26] CpC_{p} criterion is given by

(2.1) Cp(A)Ayy2+2σ2traceAC_{p}(A)\triangleq\|Ay-y\|^{2}+2\sigma^{2}\operatorname{trace}A

for any square matrix AA of size n×nn\times n. Following several works on aggregation of estimators [28, 35, 24, 30, 15, 13, 5] we parametrize the convex hull of the matrices A1,,AMA_{1},...,A_{M} as follows:

(2.2) Aθj=1MθjAj,for each θΛM, where ΛM={θM:θj0,j=1Mθj=1}.A_{\theta}\triangleq\sum_{j=1}^{M}\theta_{j}A_{j},\quad\text{for each }\theta\in\Lambda_{M},\quad\text{ where }\quad\Lambda_{M}=\Big{\{}\theta\in{{\mathbb{R}}}^{M}:\theta_{j}\geq 0,\sum_{j=1}^{M}\theta_{j}=1\Big{\}}.

Above, ΛM\Lambda_{M} is the simplex in M{{\mathbb{R}}}^{M} and the convex hull of the matrices A1,,AMA_{1},...,A_{M} is exactly the set {Aθ,θΛM}\{A_{\theta},\theta\in\Lambda_{M}\}. Finally, define the weights θ^ΛM\hat{\theta}\in\Lambda_{M} by

(2.3) θ^=argminθΛM(Cp(Aθ)+12j=1Mθj(AθAj)y2).\hat{\theta}=\mathop{\rm arg\,min}_{\theta\in\Lambda_{M}}\Big{(}C_{p}(A_{\theta})+\frac{1}{2}\sum_{j=1}^{M}\theta_{j}\|(A_{\theta}-A_{j})y\|^{2}\Big{)}.

The first term of the objective function is Mallows CpC_{p} from (2.1), while the second term is a penalty derived from the QQ-aggregation procedure from [31, 13]. The penalty is minimized at the vertices of the simplex and thus penalizes the interior of ΛM\Lambda_{M}. Although convexity of the above optimization problem is unclear at first sight because the penalty is non-convex, the objective function can be rewritten, thanks to a bias-variance decomposition, as

(2.4) 12Aθyy2+2σ2trace(Aθ)+12j=1MθjAjyy2.\textstyle\frac{1}{2}\|A_{\theta}y-y\|^{2}+2\sigma^{2}\operatorname{trace}(A_{\theta})+\frac{1}{2}\sum_{j=1}^{M}\theta_{j}\|A_{j}y-y\|^{2}.

The first term is a convex quadratic form in θ\theta, while both the second term (2σ2trace[Aθ])(2\sigma^{2}\operatorname{trace}[A_{\theta}]) and the last term are linear in θ\theta. It is now clear that the objective function is convex and (2.3) is a convex quadratic program (QP) with MM variables and M+1M+1 linear constraints. The computational complexity of such convex QP is polynomial and well studied, e.g., [37, page 304]. The final estimator is

(2.5) y^Aθ^y=j=1Mθ^jAjy,\textstyle\hat{y}\triangleq A_{\hat{\theta}}y=\sum_{j=1}^{M}\hat{\theta}_{j}A_{j}y,

a weighted sum of the values predicted by the linear estimators A1,,AjA_{1},...,A_{j}. The performance of this procedure is studied in [14, 5]; [5] derived the oracle inequality (1.8) which is optimal for certain collections {A1,,Am}\{A_{1},...,A_{m}\}. However, we are not aware of previous analysis of this procedure in the context of ordered linear smoothers.

3. Constant regret for ordered linear smoothers

Theorem 3.1.

The following holds for absolute constants \Crpunchline,\Crlem56,\Crseriesprobability>0\Cr{punchline},\Cr{lem56},\Cr{series-probability}>0. Consider the Gaussian mean model (1.2). Let {A1,,AM}\{A_{1},...,A_{M}\} be a family ordered linear smoothers as in Definition˜1. Let θ^\hat{\theta} be the solution to the optimization problem (2.3). Then y^=Aθ^y\hat{y}=A_{\hat{\theta}}y enjoys the regret bound

(3.1) 𝔼[Aθ^yμ2]minj=1,,M𝔼[Ajyμ2]\Crpunchlineσ2.{\mathbb{E}}[\|A_{\hat{\theta}}y-\mu\|^{2}]-\min_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}]\leq\Cr{punchline}\sigma^{2}.

Furthermore, if j=argminj=1,,M𝔼[Ajyμ2]j_{*}=\mathop{\rm arg\,min}_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}] has minimal risk then for any x1x\geq 1,

(3.2) {Aθ^yμ2Ajyμ2\Cllem56σ2x}1\Clseriesprobabilityex.\mathbb{P}\left\{\|A_{\hat{\theta}}y-\mu\|^{2}-\|A_{j_{*}}y-\mu\|^{2}\leq\Cl{lem56}\sigma^{2}x\right\}\geq 1-\Cl{series-probability}e^{-x}.

Let us explain the “cost-free” nature of the above result. In the simplest, one-dimensional regression problem where the design matrix XX has only one column and μ=Xβ\mu=X\beta^{*} for some unknown scalar β\beta^{*}, the prediction error of the Ordinary Least Squares estimator is 𝔼[X(β^olsβ)2]=σ2{\mathbb{E}}[\|X(\hat{\beta}^{ols}-\beta^{*})\|^{2}]=\sigma^{2} because the random variable X(β^olsβ)2/σ2\|X(\hat{\beta}^{ols}-\beta^{*})\|^{2}/\sigma^{2} has chi-square distribution with one degree-of-freedom. Hence the right hand side of the regret bound in (3.1) is no larger than a constant times the prediction error in a one-dimensional linear model. The right hand side of (3.1) is independent of the minimal risk RR^{*}, independent of the cardinality MM of the family of estimators, and if the estimators were constructed from a linear model with pp covariates, the right hand side of (3.1) is also independent of the dimension pp.

Since the most commonly ordered linear smoothers are Tikhonov regularizers (which encompass Ridge regression and smoothing splines), we provide the following corollary for convenience.

Corollary 3.2 (Application to Tikhonov regularizers).

Let KK be a positive definite matrix of size p×pp\times p and let λ1,,λM0\lambda_{1},...,\lambda_{M}\geq 0 be distinct tuning parameters. Define θ^\hat{\theta} as the minimizer of

(3.3) θ^=argminθΛM(12j=1MθjXw^(K,λj)y2+2σ2j=1Mθjdf+j12j=1MθjXw^(K,λj)y2),\hat{\theta}=\mathop{\rm arg\,min}_{\theta\in\Lambda_{M}}\Big{(}\frac{1}{2}\|\sum_{j=1}^{M}\theta_{j}X\hat{w}(K,\lambda_{j})-y\|^{2}+2\sigma^{2}\sum_{j=1}^{M}\theta_{j}{\mathop{{\rm df}}}{}_{j}+\frac{1}{2}\sum_{j=1}^{M}\theta_{j}\|X\hat{w}(K,\lambda_{j})-y\|^{2}\Big{)},

where df=jtrace[XT(XTX+λjK)1XT]{\mathop{{\rm df}}}{}_{j}=\operatorname{trace}[X^{T}(X^{T}X+\lambda_{j}K)^{-1}X^{T}]. Then the weight vector w~=j=1Mθ^jw^(K,λj)\tilde{w}=\sum_{j=1}^{M}\hat{\theta}_{j}\hat{w}(K,\lambda_{j}) in p{{\mathbb{R}}}^{p} is such that the regret (1.3) is bounded from above by \Crpunchlineσ2\Cr{punchline}\sigma^{2} for some absolute constant \Crpunchline>0\Cr{punchline}>0.

This corollary is a direct consequence of Theorem˜3.1 with Aj=XT(XTX+λjK)1XTA_{j}=X^{T}(X^{T}X+\lambda_{j}K)^{-1}X^{T}. The fact that this forms a family of ordered linear smoothers is explained after (1.4). The objective function (3.3) corresponds to the formulation (2.4) of the objective function in (2.3); we have chosen this formulation so that (3.3) can be easily implemented as a convex quadratic program with linear constraints, the first term of the objective function being quadratic in θ\theta while the second and third terms are linear in θ\theta.

The procedure above requires knowledge of σ2\sigma^{2}, which needs to be estimated beforehand in practice. Estimators of σ2\sigma^{2} are available depending on the underlying context, e.g., difference based estimates for observations on a grid [16, 20, 27, 9], or pivotal estimators of σ\sigma in sparse linear regression, e.g., [6, 33, 29]. Finally [21, Section 7.5] recommends estimating σ2\sigma^{2} by the squared residuals on a low-bias model. We also note that procedure (2.3) is robust to misspecified σ\sigma if each AjA_{j} is an orthogonal projection [5, Section 6.2].

4. Multiple families of ordered smoothers or Tikhonov penalty matrices

Theorem 4.1.

The following holds for absolute constants \Crpunchline,\Crlem56,\Crseriesprobability>0\Cr{punchline},\Cr{lem56},\Cr{series-probability}>0. Consider the Gaussian mean model (1.2). Let {A1,,AM}\{A_{1},...,A_{M}\} be a set of linear estimators such that

{A1,,AM}F1Fq,\{A_{1},...,A_{M}\}\subset F_{1}\cup...\cup F_{q},

where FkF_{k} is a family of ordered linear smoothers as in Definition˜1 for each k=1,,qk=1,...,q. Let θ^\hat{\theta} be the solution to the optimization problem (2.3). Then y^=Aθ^y\hat{y}=A_{\hat{\theta}}y enjoys the regret bound

(4.1) 𝔼[Aθ^yμ2]minj=1,,M𝔼[Ajyμ2]\Crpunchlineσ2+\Crlem56σ2logq.{\mathbb{E}}[\|A_{\hat{\theta}}y-\mu\|^{2}]-\min_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}]\leq\Cr{punchline}\sigma^{2}+\Cr{lem56}\sigma^{2}\log q.

Furthermore, if j=argminj=1,,M𝔼[Ajyμ2]j_{*}=\mathop{\rm arg\,min}_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}] has minimal risk then for any x1x\geq 1,

(4.2) {Aθ^yμ2Ajyμ2\Crlem56σ2(x+logq)}1\Crseriesprobabilityex.\mathbb{P}\left\{\|A_{\hat{\theta}}y-\mu\|^{2}-\|A_{j_{*}}y-\mu\|^{2}\leq\Cr{lem56}\sigma^{2}(x+\log q)\right\}\geq 1-\Cr{series-probability}e^{-x}.

We now allow not only one family of ordered linear smoothers, but several. Above, qq denotes the number of families. This setting was considered in [23], although with a regret bound of the form Rσlog(q)2+σ2log(q)4\sqrt{R^{*}}\sigma\log(q)^{2}+\sigma^{2}\log(q)^{4} where R=minj=1,,M𝔼[Ajyμ2]R^{*}=\min_{j=1,...,M}{\mathbb{E}}[\|A_{j}y-\mu\|^{2}]; Theorem˜4.1 improves both the dependence in RR^{*} and in qq. Let us also note that the dependence in qq in the above bound (4.2) is optimal [5, Proposition 2.1].

The above result is typically useful in situations where several Tikhonov penalty matrices K1,,KqK_{1},...,K_{q} are candidate. For each m=1,,qm=1,...,q, the penalty matrix is KmK_{m}, the practitioner chooses a grid of bm1b_{m}\geq 1 tuning parameters, say, {λa(m),a=1,,bm}\{\lambda^{(m)}_{a},a=1,...,b_{m}\}. If the matrices A1,,AMA_{1},...,A_{M} are such that

{A1,,AM}=m=1q{X(XTX+λa(m)Km)1XT,a=1,,bm},\{A_{1},...,A_{M}\}=\cup_{m=1}^{q}\{X(X^{T}X+\lambda_{a}^{(m)}K_{m})^{-1}X^{T},a=1,...,b_{m}\},

so that M=m=1qbmM=\sum_{m=1}^{q}b_{m}, the procedure (2.3) enjoys the regret bound

𝔼[Aθ^yμ2]minm=1,,qmina=1,,bm𝔼[Xw^(Km,λa)μ2]\Cσ2(1+logq){\mathbb{E}}[\|A_{\hat{\theta}}y-\mu\|^{2}]-\min_{m=1,...,q}\min_{a=1,...,b_{m}}{\mathbb{E}}[\|X\hat{w}(K_{m},\lambda_{a})-\mu\|^{2}]\leq\C\sigma^{2}(1+\log q)

and a similar bound in probability. That is, the procedure of Section˜2 automatically adapts to both the best penalty matrix and the best tuning parameter. The error term σ2(1+logq)\sigma^{2}(1+\log q) only depends on the number of regularization matrices used, not on the cardinality of the grids of tuning parameters.

5. Proofs

We start the proof with the following deterministic result.

Lemma 5.1 (Deterministic inequality).

Let A1,,AMA_{1},...,A_{M} be square matrices of size n×nn\times n and consider the procedure (2.3) in the unknown mean model (1.2). Then for any A¯{A1,,AM}\bar{A}\in\{A_{1},...,A_{M}\},

Aθ^yμ2A¯yμ2maxj=1,,M(2εT(AjA¯)y2σ2trace(AjA¯)12(AjA¯)y2).\|A_{\hat{\theta}}y-\mu\|^{2}-\|\bar{A}y-\mu\|^{2}\leq\max_{j=1,...,M}\Big{(}2\varepsilon^{T}(A_{j}-\bar{A})y-2\sigma^{2}\operatorname{trace}(A_{j}-\bar{A})-\tfrac{1}{2}\|(A_{j}-\bar{A})y\|^{2}\Big{)}.
Proof.

The above is proved in [5, Proposition 3.2]. We reproduce the short proof here for completeness: If H:ΛMH:\Lambda_{M}\to{{\mathbb{R}}} is the convex objective of (2.3) and A¯=Ak\bar{A}=A_{k} for some k=1,,Mk=1,...,M, the optimality condition of (2.3) states that H(θ^)(ekθ^)0\nabla H(\hat{\theta})(e_{k}-\hat{\theta})\geq 0 holds (cf. [8, (4.21)]). Then H(θ^)(ekθ^)0\nabla H(\hat{\theta})(e_{k}-\hat{\theta})\geq 0 can be equivalently rewritten as

Aθ^yμ2A¯yμ2j=1Mθ^j(2εT(AjA¯)y2σ2trace(AjA¯)12(AjA¯)y2).\|A_{\hat{\theta}}y-\mu\|^{2}-\|\bar{A}y-\mu\|^{2}\leq\sum_{j=1}^{M}\hat{\theta}_{j}\Big{(}2\varepsilon^{T}(A_{j}-\bar{A})y-2\sigma^{2}\operatorname{trace}(A_{j}-\bar{A})-\tfrac{1}{2}\|(A_{j}-\bar{A})y\|^{2}\Big{)}.

The proof is completed by noting that the average j=1Mθ^jaj\sum_{j=1}^{M}\hat{\theta}_{j}a_{j} with weights θ^=(θ^1,,θ^M)ΛM\hat{\theta}=(\hat{\theta}_{1},...,\hat{\theta}_{M})\in\Lambda_{M} is smaller than the maximum maxj=1,,Maj\max_{j=1,...,M}a_{j} for every reals a1,,aMa_{1},...,a_{M}. ∎

Throughout the proof, A¯\bar{A} is a fixed deterministic matrix with A¯op1\|\bar{A}\|_{op}\leq 1. Our goal is to bound from above the right hand side of Lemma˜5.1 with high probability. To this end, define the process (ZB)B(Z_{B})_{B} indexed by symmetric matrices BB of size n×nn\times n, by

ZB=2εT(BA¯)y2σ2trace(BA¯)12((BA¯)y2d(B,A¯)2)Z_{B}=2\varepsilon^{T}(B-\bar{A})y-2\sigma^{2}\operatorname{trace}(B-\bar{A})-\tfrac{1}{2}(\|(B-\bar{A})y\|^{2}-d(B,\bar{A})^{2})

where dd is the metric

(5.1) d(B,A)2𝔼[(BA)y2]=σ2BAF2+(BA)μ2,A,Bn×n.d(B,A)^{2}\triangleq{\mathbb{E}}[\|(B-A)y\|^{2}]=\sigma^{2}\|B-A\|_{F}^{2}+\|(B-A)\mu\|^{2},\qquad A,B\in{{\mathbb{R}}}^{n\times n}.

With this definition, the quantity inside the parenthesis in the right hand side of Lemma˜5.1 is exactly ZAj12d(Aj,A¯)Z_{A_{j}}-\tfrac{1}{2}d(A_{j},\bar{A}). We split the process ZBZ_{B} into a Gaussian part and a quadratic part. Define the processes (GB)B(G_{B})_{B} and (WB)B(W_{B})_{B} by

(5.2) GB\displaystyle G_{B} =εT[2In×n(BA¯)/2](BA¯)μ,\displaystyle=\varepsilon^{T}[2I_{n\times n}-(B-\bar{A})/2](B-\bar{A})\mu,
(5.3) WB\displaystyle W_{B} =2εT(BA¯)ε2σ2trace(BA¯)12εT(BA¯)2ε+σ22BA¯F2.\displaystyle=2\varepsilon^{T}(B-\bar{A})\varepsilon-2\sigma^{2}\operatorname{trace}(B-\bar{A})-\tfrac{1}{2}\varepsilon^{T}(B-\bar{A})^{2}\varepsilon+\tfrac{\sigma^{2}}{2}\|B-\bar{A}\|_{F}^{2}.

Before bounding supremum of the above processes, we need to derive the following metric property of ordered linear smoothers. If TT is a subset of the space of symmetric matrices of size n×nn\times n and if dd is a metric on TT, the diameter Δ(T,d)\Delta(T,d) of TT and the Talagrand generic chaining functionals for each α=1,2\alpha=1,2 are defined by

(5.4) Δ(T,d)=supA,BTd(A,B),γα(T,d)=inf(Tk)k0suptTk=1+2k/αd(t,Tk)\Delta(T,d)=\sup_{A,B\in T}d(A,B),\qquad\qquad\gamma_{\alpha}(T,d)=\inf_{(T_{k})_{k\geq 0}}\sup_{t\in T}\sum_{k=1}^{+\infty}2^{k/\alpha}d(t,T_{k})

where the infimum is over all sequences (Tk)k0(T_{k})_{k\geq 0} of subsets of TT such that |T0|=1|T_{0}|=1 and |Tk|22k|T_{k}|\leq 2^{2^{k}}.

Lemma 5.2.

Let a0a\geq 0 and let μn\mu\in{{\mathbb{R}}}^{n}. Let Fn×nF\subset{{\mathbb{R}}}^{n\times n} be a set of ordered linear smoothers (cf. Definition˜1) and let dd be any semi-metric of the form d(A,B)2=aABF2+(AB)μ2d(A,B)^{2}=a\|A-B\|_{F}^{2}+\|(A-B)\mu\|^{2}. Then γ2(F,d)+γ1(F,d)\CldiameterΔ(F,d)\gamma_{2}(F,d)+\gamma_{1}(F,d)\leq\Cl{diameter}\Delta(F,d) where \Crdiameter\Cr{diameter} is an absolute constant.

Proof.

We have to specify a sequence (Tk)k0(T_{k})_{k\geq 0} of subsets of FF with |Tk|22k|T_{k}|\leq 2^{2^{k}}. Since FF satisfies Definition˜1, there exists a basis of eigenvectors u1,,unu_{1},...,u_{n}, increasing functions α1,,αn:[0,1]\alpha_{1},...,\alpha_{n}:{{\mathbb{R}}}\to[0,1] and a set Λ\Lambda\subset{{\mathbb{R}}} such that F={Bλ,λΛ}F=\{B_{\lambda},\lambda\in\Lambda\} where Bλ=i=1nαi(λ)uiuiTB_{\lambda}=\sum_{i=1}^{n}\alpha_{i}(\lambda)u_{i}u_{i}^{T}, cf. (1.4). Hence for any λ0,λ,νΛ\lambda_{0},\lambda,\nu\in\Lambda,

d(Bλ,Bν)2\displaystyle d(B_{\lambda},B_{\nu})^{2} =i=1nwi(αi(λ)αi(ν))2 for weights wi=(a+(uiTμ)2)0,\displaystyle=\sum_{i=1}^{n}w_{i}(\alpha_{i}(\lambda)-\alpha_{i}(\nu))^{2}\quad\text{ for weights }\quad w_{i}=(a+(u_{i}^{T}\mu)^{2})\geq 0,
d(Bλ0,Bλ)2+d(Bλ,Bν)2\displaystyle d(B_{\lambda_{0}},B_{\lambda})^{2}+d(B_{\lambda},B_{\nu})^{2} =d(Bλ0,Bν)2+2i=1nwi(αi(λ)αi(λ0))(αi(λ)αi(ν)).\displaystyle=d(B_{\lambda_{0}},B_{\nu})^{2}+2\sum_{i=1}^{n}w_{i}(\alpha_{i}(\lambda)-\alpha_{i}(\lambda_{0}))(\alpha_{i}(\lambda)-\alpha_{i}(\nu)).

If λ0λν\lambda_{0}\leq\lambda\leq\nu, since each αi()\alpha_{i}(\cdot) is nondecreasing, the sum in the right hand side of the previous display is non-positive and d(Bλ,Bν)2d(Bν,Bλ0)2d(Bλ,Bλ0)2d(B_{\lambda},B_{\nu})^{2}\leq d(B_{\nu},B_{\lambda_{0}})^{2}-d(B_{\lambda},B_{\lambda_{0}})^{2} holds. Let N=22kN=2^{2^{k}} and δ=Δ(F,d)/N\delta=\Delta(F,d)/N. We construct a δ\delta-covering of FF by considering the bins Binj={BF:δ2jd(B,Bλ0)2<δ2(j+1)}\textsc{Bin}_{j}=\{B\in F:\delta^{2}j\leq d(B,B_{\lambda_{0}})^{2}<\delta^{2}(j+1)\} for j=0,,N1j=0,...,N-1 where λ0=infΛ\lambda_{0}=\inf\Lambda. If Binj\textsc{Bin}_{j} is non-empty, any of its element is a δ\delta-covering of Binj\textsc{Bin}_{j} thanks to

d(Bλ,Bν)2d(Bν,Bλ0)2d(Bλ,Bλ0)2(j+1)δ2jδ2=δ2.d(B_{\lambda},B_{\nu})^{2}\leq d(B_{\nu},B_{\lambda_{0}})^{2}-d(B_{\lambda},B_{\lambda_{0}})^{2}\leq(j+1)\delta^{2}-j\delta^{2}=\delta^{2}.

for Bν,BλBinjB_{\nu},B_{\lambda}\in\textsc{Bin}_{j} with λν\lambda\leq\nu. This constructs a δ\delta-covering of FF with N=22kN=2^{2^{k}} elements. Hence γ2(F,d)Δ(F,d)k=12k/2/22k=Δ(F,d)\C\gamma_{2}(F,d)\leq\Delta(F,d)\sum_{k=1}^{\infty}2^{k/2}/2^{2^{k}}=\Delta(F,d)\C and the same holds for γ1(F,d)\gamma_{1}(F,d) for a different absolute constant. ∎

Lemma 5.3 (The Gaussian process GBG_{B}).

Let TT^{*} be a family of ordered smoothers (cf. Definition˜1) such that supBTd(A¯,B)δ\sup_{B\in T^{*}}d(\bar{A},B)\leq\delta^{*} for the metric (5.1). Then for all x>0x>0,

(supBTGBσ(\C+32x)δ)1ex.\mathbb{P}(\sup_{B\in T^{*}}G_{B}\leq\sigma(\C+3\sqrt{2x})\delta^{*})\geq 1-e^{-x}.
Proof.

By the Gaussian concentration theorem [7, Theorem 5.8], with probability at least 1ex1-e^{-x} we have

(5.5) supBTGB\displaystyle\sup_{B\in T^{*}}G_{B} 𝔼supBTGB+σ2xsupBT[2In×n(BA¯)/2](BA¯)μ.\displaystyle\leq{\mathbb{E}}\sup_{B\in T^{*}}G_{B}+\sigma\sqrt{2x}\sup_{B\in T^{*}}\|[2I_{n\times n}-(B-\bar{A})/2](B-\bar{A})\mu\|.
(5.6) \Cγ2(T,dG)+σ2xsupBT3(BA¯)μ\displaystyle\leq\C\gamma_{2}(T^{*},d_{G})+\sigma\sqrt{2x}\sup_{B\in T^{*}}3\|(B-\bar{A})\mu\|

where for the second inequality we used Talagrand’s majorizing measure theorem (cf., e.g., [38, Section 8.6]) and the fact that B,A¯B,\bar{A} have operator norm at most one, where dGd_{G} is the canonical metric of the Gaussian process,

dG(A,B)2=𝔼[(GAGB)2].d_{G}(A,B)^{2}={\mathbb{E}}[(G_{A}-G_{B})^{2}].

If D=BAD=B-A is the difference and PP commute with AA and BB,

GBGA=ϵT[2Dμ12(A+B2A¯)Dμ12D(A+B2P)μ]+ϵTD(A¯P)μ.G_{B}-G_{A}=\epsilon^{T}\left[2D\mu-\tfrac{1}{2}(A+B-2\bar{A})D\mu-\tfrac{1}{2}D(A+B-2P)\mu\right]+\epsilon^{T}D\left(\bar{A}-P\right)\mu.

By the triangle inequality and using that A,B,P,A¯A,B,P,\bar{A} have operator norm at most one, dG(A,B)6σDμ+σD(A¯P)μ.d_{G}(A,B)\leq 6\sigma\|D\mu\|+\sigma\|D(\bar{A}-P)\mu\|. This shows that

γ2(T,dG)6σγ2(T,d1)+σγ2(T,d2)\gamma_{2}(T^{*},d_{G})\leq 6\sigma\gamma_{2}(T^{*},d_{1})+\sigma\gamma_{2}(T^{*},d_{2})

where d1(A,B)=(BA)μd_{1}(A,B)=\|(B-A)\mu\| and d2(A,B)=(AB)(A¯P)μd_{2}(A,B)=\|(A-B)(\bar{A}-P)\mu\|. By Lemma˜5.2, γ2(T,d1)\CΔ(T,d1)\gamma_{2}(T^{*},d_{1})\leq\C\Delta(T^{*},d_{1}) and similarly for d2d_{2} (note that d2d_{2} is similar to d1d_{1} with μ\mu replaced by μ=(PA¯)μ\mu^{\prime}=(P-\bar{A})\mu).

If supBTd(B,A¯)δ\sup_{B\in T^{*}}d(B,\bar{A})\leq\delta^{*} for the metric dd in (5.1), then supBT(BA¯)μδ\sup_{B\in T^{*}}\|(B-\bar{A})\mu\|\leq\delta^{*} and Δ(T,d1)2δ\Delta(T^{*},d_{1})\leq 2\delta^{*}. Furthermore if PP is the convex projection of A¯\bar{A} onto the convex hull of TT^{*} with respect to the Hilbert metric dd in (5.1), then

Δ(T,d2)=supB,BTd2(B,B)2(PA¯)μ2d(P,A¯)2d(B0,A¯)2δ\Delta(T^{*},d_{2})=\sup_{B,B^{\prime}\in T^{*}}d_{2}(B,B^{\prime})\leq 2\|(P-\bar{A})\mu\|\leq 2d(P,\bar{A})\leq 2d(B_{0},\bar{A})\leq 2\delta^{*}

for any B0TB_{0}\in T^{*} where we used that by definition of the convex projection, d(P,A¯)d(B0,A¯)d(P,\bar{A})\leq d(B_{0},\bar{A}). ∎

The following inequality, known as the Hanson-Wright inequality, will be useful for the next Lemma. If εN(0,σ2In×n)\varepsilon\sim N(0,\sigma^{2}I_{n\times n}) is standard normal, then

(5.7) [|εTQεσ2traceQ|>2σ2(QFx+Qopx)]2ex,\mathbb{P}\Big{[}|\varepsilon^{T}Q\varepsilon-\sigma^{2}\operatorname{trace}Q|>2\sigma^{2}(\|Q\|_{F}\sqrt{x}+\|Q\|_{op}x)\Big{]}\leq 2e^{-x},

for any square matrix Qn×nQ\in{{\mathbb{R}}}^{n\times n}. We refer to [7, Example 2.12] for a proof for normally distributed ε\varepsilon and [32, 22, 4, 1] for proofs of (5.7) in the sub-gaussian case.

Lemma 5.4 (The Quadratic process WBW_{B}).

Let TT^{*} be a family of ordered smoothers (cf. Definition˜1) such that σBA¯Fδ\sigma\|B-\bar{A}\|_{F}\leq\delta^{*} for all BTB\in T^{*}. Then for all x>0x>0,

(supBTWB\Cσδ+\Cσxδ+\Cσ2x)12ex.\mathbb{P}\Big{(}\sup_{B\in T^{*}}W_{B}\leq\C\sigma\delta^{*}+\C\sigma\sqrt{x}\delta^{*}+\C\sigma^{2}x\Big{)}\geq 1-2e^{-x}.
Proof.

We apply Theorem 2.4 in [1] which implies that if WB=εTQBεtrace[QB]W_{B}=\varepsilon^{T}Q_{B}\varepsilon-\operatorname{trace}[Q_{B}] where εN(0,In×n)\varepsilon\sim N(0,I_{n\times n}) and QBQ_{B} is a symmetric matrix of size n×nn\times n for every BB, then

(supBTWB𝔼supBTWB+\CσxsupBT𝔼QBε+\Cxσ2supBTQBop)12ex.\mathbb{P}\Big{(}\sup_{B\in T^{*}}W_{B}\leq{\mathbb{E}}\sup_{B\in T^{*}}W_{B}+\C\sigma\sqrt{x}\sup_{B\in T^{*}}{\mathbb{E}}\|Q_{B}\varepsilon\|+\C x\sigma^{2}\sup_{B\in T^{*}}\|Q_{B}\|_{op}\Big{)}\geq 1-2e^{-x}.

For the third term, QB=2(BA¯)(BA¯)2/2Q_{B}=2(B-\bar{A})-(B-\bar{A})^{2}/2 hence QBop6\|Q_{B}\|_{op}\leq 6 because B,A¯B,\bar{A} both have operator norm at most one. For the second term, since TT^{*} is a family of ordered linear smoothers, there exists extremal matrices B0,B1TB_{0},B_{1}\in T^{*} such that B0BB1B_{0}\preceq B\preceq B_{1} for all BTB\in T^{*}; we then have BB0B1B0B-B_{0}\preceq B_{1}-B_{0} and

QBε3(BA¯)ε3(B1B0)ε+3(B0A¯)ε3(B1A¯)ε+6(B0A¯)ε.\|Q_{B}\varepsilon\|\leq 3\|(B-\bar{A})\varepsilon\|\leq 3\|(B_{1}-B_{0})\varepsilon\|+3\|(B_{0}-\bar{A})\varepsilon\|\leq 3\|(B_{1}-\bar{A})\varepsilon\|+6\|(B_{0}-\bar{A})\varepsilon\|.

Hence 𝔼QBε𝔼[QBε2]1/23σB1A¯F+6σB0A¯F9δ{\mathbb{E}}\|Q_{B}\varepsilon\|\leq{\mathbb{E}}[\|Q_{B}\varepsilon\|^{2}]^{1/2}\leq 3\sigma\|B_{1}-\bar{A}\|_{F}+6\sigma\|B_{0}-\bar{A}\|_{F}\leq 9\delta^{*}.

We finally apply a generic chaining upper bound to bound 𝔼supBTWB{\mathbb{E}}\sup_{B\in T^{*}}W_{B}. For any fixed B0TB_{0}\in T^{*} we have 𝔼[WB0]=0{\mathbb{E}}[W_{B_{0}}]=0 hence 𝔼supBTWB=𝔼supBT(WBWB0){\mathbb{E}}\sup_{B\in T^{*}}W_{B}={\mathbb{E}}\sup_{B\in T^{*}}(W_{B}-W_{B_{0}}). For two matrices A,BTA,B\in T^{*} we have WBWA=εT(QBQA)εtrace[QBQA]W_{B}-W_{A}=\varepsilon^{T}(Q_{B}-Q_{A})\varepsilon-\operatorname{trace}[Q_{B}-Q_{A}], and

εT(QBQA)ϵ=εT[(BA)(2In×n12(A+B2A¯))]ε,\varepsilon^{T}(Q_{B}-Q_{A})\epsilon=\varepsilon^{T}[(B-A)(2I_{n\times n}-\tfrac{1}{2}(A+B-2\bar{A}))]\varepsilon,

hence by the Hanson-Wright inequality (5.7), with probability at least 12ex1-2e^{-x},

|WBWA|2σ2(BA)(2In×n12(A+B2A¯))F(x+x)8σ2ABF(x+x).|W_{B}-W_{A}|\leq 2\sigma^{2}\|(B-A)(2I_{n\times n}-\tfrac{1}{2}(A+B-2\bar{A}))\|_{F}(\sqrt{x}+x)\leq 8\sigma^{2}\|A-B\|_{F}(x+\sqrt{x}).

Hence by the generic chaining bound given in Theorem 3.5 in [17], we get that

𝔼supBT|WBWB0|\Cσ2[γ1(T,F)+γ2(T,F)+Δ(T,F)].{\mathbb{E}}\sup_{B\in T^{*}}|W_{B}-W_{B_{0}}|\leq\C\sigma^{2}\left[\gamma_{1}(T^{*},\|\cdot\|_{F})+\gamma_{2}(T^{*},\|\cdot\|_{F})+\Delta(T^{*},\|\cdot\|_{F})\right].

For each α=1,2\alpha=1,2 we have γα(T,F)\CΔ(T,F)\gamma_{\alpha}(T^{*},\|\cdot\|_{F})\leq\C\Delta(T^{*},\|\cdot\|_{F}) by Lemma˜5.2. Since σBA¯δ\sigma\|B-\bar{A}\|\leq\delta^{*} for any BTB\in T^{*}, we obtain Δ(T,F)2δ/σ\Delta(T^{*},\|\cdot\|_{F})\leq 2\delta^{*}/\sigma. ∎

Lemma 5.5.

Suppose FF is a family of n×nn\times n ordered linear smoothers (cf. Definition˜1), and A¯\bar{A} is a fixed matrix with A¯op1\|\bar{A}\|_{op}\leq 1 which may not belong to FF. Let dd be the metric (5.1). Then for any reals u1,u\geq 1, and δ>δ0\delta^{*}>\delta_{*}\geq 0, we have with probability at least 13eu1-3e^{-u},

supBF:δd(B,A¯)<δ(ZB12d(B,A¯)2)\Cllem54upper1[σ2u+δσu]12δ2\Cllem54upper2σ2u+116(δ)212δ2.\sup_{B\in F:\;\delta_{*}\leq d(B,\bar{A})<\delta^{*}}\left(Z_{B}-\tfrac{1}{2}d(B,\bar{A})^{2}\right)\leq\Cl{lem54-upper-1}\left[\sigma^{2}u+\delta^{*}\sigma\sqrt{u}\right]-\tfrac{1}{2}\delta_{*}^{2}\leq\Cl{lem54-upper-2}\sigma^{2}u+\tfrac{1}{16}(\delta^{*})^{2}-\tfrac{1}{2}\delta_{*}^{2}.
Proof.

First note that d(B,A¯)2δ2-d(B,\bar{A})^{2}\leq-\delta_{*}^{2} for any BB as in the supremum.

Now ZB=GB+WBZ_{B}=G_{B}+W_{B} where GBG_{B} and WBW_{B} are the processes studied in Lemmas˜5.3 and 5.4. These lemmas applied to T={BF:d(B,A¯)δ}T^{*}=\{B\in F:d(B,\bar{A})\leq\delta^{*}\} yields that on an event of probability at least 13eu1-3e^{-u} we have

supBTZBsupBT(GB+WB)\C(σδ(1+u)+σ2u).\textstyle\sup_{B\in T^{*}}Z_{B}\leq\sup_{B\in T^{*}}(G_{B}+W_{B})\leq\C(\sigma\delta^{*}(1+\sqrt{u})+\sigma^{2}u).

Since u1u\geq 1, we have established the first inequality by adjusting the absolute constant. For the second inequality, we use that \Crlem54upper1δσu4\Crlem54upper12σ2u+116(δ)2\Cr{lem54-upper-1}\delta_{*}\sigma\sqrt{u}\leq 4\Cr{lem54-upper-1}^{2}\sigma^{2}u+\tfrac{1}{16}(\delta^{*})^{2} and set \Crlem54upper2=\Crlem54upper1+4\Crlem54upper12\Cr{lem54-upper-2}=\Cr{lem54-upper-1}+4\Cr{lem54-upper-1}^{2}. ∎

Lemma 5.6 (Slicing).

Suppose FF is a family of n×nn\times n ordered linear smoothers (cf. Definition˜1), and A¯\bar{A} is a fixed matrix with A¯op1\|\bar{A}\|_{op}\leq 1 which may not belong to FF. Let dd be the metric (5.1). Then for any x1x\geq 1, we have with probability at least 1\Crseriesprobabilityex1-\Cr{series-probability}e^{-x}

supBF(ZB12d(B,A¯)2)\Crlem56σ2x.\textstyle\sup_{B\in F}\left(Z_{B}-\tfrac{1}{2}d(B,\bar{A})^{2}\right)\leq\Cr{lem56}\sigma^{2}x.
Proof.

We use here a method known as slicing, we refer the reader to Section 5.4 in [36] for an introduction. Write FF as the union

F=k=1Tk where Tk is the slice Tk={BF:δk1d~(B,A¯)δk},F=\cup_{k=1}^{\infty}T_{k}\quad\text{ where }T_{k}\text{ is the slice }\quad T_{k}=\{B\in F:\delta_{k-1}\leq\tilde{d}(B,\bar{A})\leq\delta_{k}\},

with δ0=0\delta_{0}=0 and δk=2kσ\delta_{k}=2^{k}\sigma for k1k\geq 1. By definition of the geometric sequence (δk)k0(\delta_{k})_{k\geq 0}, inequality 116δk212δk1212σ2116δk212σ2x116δk2\frac{1}{16}\delta_{k}^{2}-\frac{1}{2}\delta_{k-1}^{2}\leq\frac{1}{2}\sigma^{2}-\frac{1}{16}\delta_{k}^{2}\leq\frac{1}{2}\sigma^{2}x-\frac{1}{16}\delta_{k}^{2} holds for all k1k\geq 1. With δ=δk1,δ=δk\delta_{*}=\delta_{k-1},\delta^{*}=\delta_{k}, Lemma˜5.5 yields that for all k1k\geq 1,

(supBTk(ZB12d(B,A¯)2)\Crlem54upper2σ2uk116δk2+σ2x2)13euk\mathbb{P}\Big{(}\sup_{B\in T_{k}}(Z_{B}-\tfrac{1}{2}d(B,\bar{A})^{2})\leq\Cr{lem54-upper-2}\sigma^{2}u_{k}-\tfrac{1}{16}\delta_{k}^{2}+\tfrac{\sigma^{2}x}{2}\Big{)}\geq 1-3e^{-u_{k}}

for any uk1u_{k}\geq 1. The above holds simultaneously over all slices (Tk)k1(T_{k})_{k\geq 1} with probability at least 13k=1euk1-3\sum_{k=1}^{\infty}e^{-u_{k}} by the union bound. It remains to specify a sequence (uk)k1(u_{k})_{k\geq 1} of reals greater than 1. We choose uk=x+δk2/(σ216\Crlem54upper2)u_{k}=x+\delta_{k}^{2}/(\sigma^{2}16\Cr{lem54-upper-2}) which is greater than 1 since x1x\geq 1. Then by construction, \Crlem54upper2σ2uk116δk2+σ2x2=(\Crlem54upper2+1/2)σ2x\Cr{lem54-upper-2}\sigma^{2}u_{k}-\tfrac{1}{16}\delta_{k}^{2}+\tfrac{\sigma^{2}x}{2}=(\Cr{lem54-upper-2}+1/2)\sigma^{2}x and we set \Crlem56=\Crlem54upper2+1/2\Cr{lem56}=\Cr{lem54-upper-2}+1/2. Furthermore, k=1euk=exk=1e22k/(16\Crlem54upper2).\sum_{k=1}^{\infty}e^{-u_{k}}=e^{-x}\sum_{k=1}^{\infty}e^{-2^{2k}/(16\Cr{lem54-upper-2})}. The sum 3k=1e22k/(16\Crlem54upper2)3\sum_{k=1}^{\infty}e^{-2^{2k}/(16\Cr{lem54-upper-2})} is equal to a finite absolute constant named \Crseriesprobability\Cr{series-probability} in the statement of the Lemma. ∎

Proof of Theorem˜3.1.

Let F={A1,,AM}F=\{A_{1},...,A_{M}\} and A¯=Aj\bar{A}=A_{j_{*}} where jj_{*} is defined in the statement of Theorem˜3.1. The conclusion of Lemma˜5.1 can be rewritten as

Aθ^yμ2A¯yμ2supBF(ZB12d(B,A¯)2)\|A_{\hat{\theta}}y-\mu\|^{2}-\|\bar{A}y-\mu\|^{2}\leq\sup_{B\in F}(Z_{B}-\tfrac{1}{2}d(B,\bar{A})^{2})

where F={A1,,AM}F=\{A_{1},...,A_{M}\} is a family of ordered linear smoothers. Lemma˜5.6 completes the proof of (3.2). Then (3.1) is obtained by integration of (3.2) using 𝔼[Z]0(Z>t)𝑑t{\mathbb{E}}[Z]\leq\int_{0}^{\infty}\mathbb{P}(Z>t)dt for any Z0Z\geq 0. ∎

Proof of Theorem˜4.1.

As in the proof of Theorem˜3.1, we use Lemma˜5.1 to deduce that a.s.,

Aθ^yμ2A¯yμ2maxj=1,,M(ZAj12d(Aj,A¯)2)=maxk=1,,qmaxBFk(ZB12d(B,A¯)2).\|A_{\hat{\theta}}y-\mu\|^{2}-\|\bar{A}y-\mu\|^{2}\leq\max_{j=1,...,M}(Z_{A_{j}}-\tfrac{1}{2}d(A_{j},\bar{A})^{2})=\max_{k=1,...,q}\max_{B\in F_{k}}(Z_{B}-\tfrac{1}{2}d(B,\bar{A})^{2}).

Since each FkF_{k} is a family of ordered linear smoothers, by Lemma˜5.6 we have

(maxBFk(ZB12d(B,A¯)2)>\Crlem56σ2x)\Crseriesprobabilityex for each k=1,,q.\textstyle\mathbb{P}\big{(}\max_{B\in F_{k}}(Z_{B}-\tfrac{1}{2}d(B,\bar{A})^{2})>\Cr{lem56}\sigma^{2}x\big{)}\leq\Cr{series-probability}e^{-x}\qquad\text{ for each }k=1,\ldots,q.

The union bound yields (4.2) and we use 𝔼[Z]0(Z>t)𝑑t{\mathbb{E}}[Z]\leq\int_{0}^{\infty}\mathbb{P}(Z>t)dt for Z0Z\geq 0 to deduce (4.1). ∎

References

  • Adamczak [2015] Radoslaw Adamczak. A note on the hanson-wright inequality for random vectors with dependencies. Electronic Communications in Probability, 20, 2015.
  • Arlot and Bach [2009] Sylvain Arlot and Francis R Bach. Data-driven calibration of linear estimators with minimal penalties. In Advances in Neural Information Processing Systems, pages 46–54, 2009.
  • Arlot and Celisse [2010] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40–79, 2010.
  • Bellec [2014] Pierre C. Bellec. Concentration of quadratic forms under a bernstein moment assumption. Technical report. Arxiv:1901.08726, 2014. URL https://arxiv.org/pdf/1901.08736.
  • Bellec [2018] Pierre C. Bellec. Optimal bounds for aggregation of affine estimators. Ann. Statist., 46(1):30–59, 02 2018. doi: 10.1214/17-AOS1540. URL https://arxiv.org/pdf/1410.0346.pdf.
  • Belloni et al. [2014] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Pivotal estimation via square-root lasso in nonparametric regression. Ann. Statist., 42(2):757–788, 04 2014. URL http://dx.doi.org/10.1214/14-AOS1204.
  • Boucheron et al. [2013] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press, 2013.
  • Boyd and Vandenberghe [2009] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2009.
  • Brown et al. [2007] Lawrence D Brown, Michael Levine, et al. Variance estimation in nonparametric regression via the difference sequence method. The Annals of Statistics, 35(5):2219–2232, 2007.
  • Chernousova et al. [2013] Elena Chernousova, Yuri Golubev, Ekaterina Krymova, et al. Ordered smoothers with exponential weighting. Electronic journal of statistics, 7:2395–2419, 2013.
  • Cohen [1966] Arthur Cohen. All admissible linear estimates of the mean vector. The Annals of Mathematical Statistics, pages 458–463, 1966.
  • Craven and Wahba [1978] Peter Craven and Grace Wahba. Smoothing noisy data with spline functions. Numerische mathematik, 31(4):377–403, 1978.
  • Dai et al. [2012] D. Dai, P. Rigollet, and T. Zhang. Deviation optimal learning using greedy Q-aggregation. The Annals of Statistics, 40(3):1878–1905, 2012.
  • Dai et al. [2014] D. Dai, P. Rigollet, Xia L., and Zhang T. Aggregation of affine estimators. Electon. J. Stat., 8:302–327, 2014.
  • Dalalyan and Salmon [2012] Arnak S. Dalalyan and Joseph Salmon. Sharp oracle inequalities for aggregation of affine estimators. The Annals of Statistics, 40(4):2327–2355, 2012.
  • Dette et al. [1998] Holger Dette, Axel Munk, and Thorsten Wagner. Estimating the variance in nonparametric regression—what is a reasonable choice? Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(4):751–764, 1998.
  • Dirksen [2015] Sjoerd Dirksen. Tail bounds via generic chaining. Electronic Journal of Probability, 20, 2015.
  • Golub et al. [1979] Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979.
  • Golubev et al. [2010] Yuri Golubev et al. On universal oracle inequalities related to high-dimensional linear models. The Annals of Statistics, 38(5):2751–2780, 2010.
  • Hall et al. [1990] Peter Hall, JW Kay, and DM Titterinton. Asymptotically optimal difference-based estimation of variance in nonparametric regression. Biometrika, 77(3):521–528, 1990.
  • Hastie et al. [2001] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer, 2001.
  • Hsu et al. [2012] Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab., 17:no. 52, 1–6, 2012. doi: 10.1214/ECP.v17-2079. URL http://ecp.ejpecp.org/article/view/2079.
  • Kneip [1994] Alois Kneip. Ordered linear smoothers. The Annals of Statistics, 22(2):835–866, 1994.
  • Leung and Barron [2006] Gilbert Leung and Andrew R. Barron. Information theory and mixing least-squares regressions. Information Theory, IEEE Transactions on, 52(8):3396–3410, 2006.
  • Li [1986] Ker-Chau Li. Asymptotic optimality of c_lc\_l and generalized cross-validation in ridge regression with application to spline smoothing. The Annals of Statistics, 14(3):1101–1112, 1986.
  • Mallows [1973] Colin L Mallows. Some comments on c p. Technometrics, 15(4):661–675, 1973.
  • Munk et al. [2005] Axel Munk, Nicolai Bissantz, Thorsten Wagner, and Gudrun Freitag. On difference-based variance estimation in nonparametric regression when the covariate is high dimensional. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):19–41, 2005.
  • Nemirovski [2000] Arkadi Nemirovski. Topics in non-parametric statistics. In Lectures on probability theory and statistics (Saint-Flour, 1998), volume 1738 of Lecture Notes in Mathematics. Springer, Berlin, 2000.
  • Owen [2007] Art B Owen. A robust hybrid of lasso and ridge regression. Technical report, Stanford University, 2007.
  • Rigollet and Tsybakov [2007] Ph. Rigollet and A. B. Tsybakov. Linear and convex aggregation of density estimators. Math. Methods Statist., 16(3):260–280, 2007. doi: 10.3103/S1066530707030052. URL http://dx.doi.org/10.3103/S1066530707030052.
  • Rigollet [2012] Philippe Rigollet. Kullback–Leibler aggregation and misspecified generalized linear models. Ann. Statist., 40(2):639–665, 2012. doi: 10.1214/11-AOS961.
  • Rudelson and Vershynin [2013] Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. Electron. Commun. Probab., 18:no. 82, 1–9, 2013. URL http://ecp.ejpecp.org/article/view/2865.
  • Sun and Zhang [2012] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 2012.
  • Tsybakov [2014] A.B. Tsybakov. Aggregation and minimax optimality in high dimensional estimation. Proceedings of International Congress of Mathematicians (Seoul, 2014), 3:225–246, 2014.
  • Tsybakov [2003] Alexandre B. Tsybakov. Optimal rates of aggregation. In Learning Theory and Kernel Machines, pages 303–313. Springer, 2003.
  • van Handel [2014] Ramon van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014.
  • Vavasis [2001] Stephen A. Vavasis. Complexity Theory: Quadratic Programming, pages 304–307. Springer US, Boston, MA, 2001. ISBN 978-0-306-48332-5. doi: 10.1007/0-306-48332-7_65. URL https://doi.org/10.1007/0-306-48332-7_65.
  • Vershynin [2018] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.