This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Kernel-based estimation for partially functional linear model: Minimax rates and randomized sketches

\nameShaogao Lv \emaillvsg716@swufe.edu.cn
\addrCollege of Statistics and Mathmetics
Nanjing Audit University
Nanjing, China \AND\nameXin He \emailhe.xin17@mail.shufe.edu.cn
\addrSchool of Statistics and Management
Shanghai University of Finance and Economics
Shanghai, China \AND\nameJunhui Wang \emailj.h.wang@cityu.edu.hk
\addrSchool of Data Science
City University of Hong Kong
Kowloon Tong, Kowloon, Hong Kong
Abstract

This paper considers the partially functional linear model (PFLM) where all predictive features consist of a functional covariate and a high dimensional scalar vector. Over an infinite dimensional reproducing kernel Hilbert space, the proposed estimation for PFLM is a least square approach with two mixed regularizations of a function-norm and an 1\ell_{1}-norm. Our main task in this paper is to establish the minimax rates for PFLM under high dimensional setting, and the optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes. In addition, we propose an efficient numerical algorithm based on randomized sketches of the kernel matrix. Several numerical experiments are implemented to support our method and optimization strategy.

Keywords: Functional linear models, minimax rates, sparsity, randomized sketches, reproducing kernel Hilbert space.

1 Introduction

In the problem of functional linear regression, a single functional feature X()X(\cdot) is assumed to be square-integrable over an interval 𝒯\mathcal{T}, and the classical functional linear regression between the response YY and XX is given as

Y=X,f2+ε,\displaystyle Y=\langle X,f^{*}\rangle_{\mathcal{L}_{2}}+\varepsilon, (1.1)

where the inner product ,2\langle\cdot,\cdot\rangle_{\mathcal{L}_{2}} is defined as f,g2:=𝒯f(t)g(t)𝑑t\langle f,g\rangle_{\mathcal{L}_{2}}:=\int_{\mathcal{T}}f(t)g(t)dt for any f,g2(𝒯)f,g\in\mathcal{L}_{2}(\mathcal{T}). Here ff^{*} is some slope function within 2(𝒯)\mathcal{L}_{2}(\mathcal{T}) and ε\varepsilon denotes a error term with zero-mean. Let (Yi,Xi):i=1,,n{(Y_{i},X_{i}):\,i=1,...,n} denote independent and identically distributed (i.i.d.) realizations from the population (Y,X)(Y,X), there is extensive literature on estimation of the slope function ff^{*}, or the value of X,f2\langle X,f^{*}\rangle_{\mathcal{L}_{2}}.

In practice, it is often the case that a response is affected by both a high-dimensional scalar vector and some random functional variables as predictive features. These scenarios partially motivates us to study PFLM under high dimensional setting. For simplifying the notations, this paper assumes that YY and X()X(\cdot) are centered. To be more precise, we are concerned with partially functional linear regression with the functional feature XX and scalar predictors 𝐙=(Z1,,Zp)Tp{\bf Z}=(Z_{1},...,Z_{p})^{T}\in{\cal R}^{p}, and a linear model links the response YY and predictive features 𝐔=(X,𝐙){\bf U}=(X,{\bf Z}) that

Y=X,f2+𝐙T𝜸+ε,Y=\langle X,f^{*}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}^{*}+\varepsilon, (1.2)

where 𝜸=(γ1,,γp)T\boldsymbol{\gamma}^{*}=(\gamma^{*}_{1},...,\gamma^{*}_{p})^{T} denotes the regression coefficients of the scalar covariates, and ε\varepsilon is a standard normal variable and independent of XX and 𝐙\bf Z. Under the sparse high dimensional setting, a standard assumption is that the cardinality of the active set S0:={j:γj0,j=1,,p}S_{0}:=\{j:\gamma^{*}_{j}\neq 0,\ j=1,...,p\} is far less than pp, while pp and p0:=|S0|p_{0}:=|S_{0}| are allowed to diverge as the sample size nn increases. In fact, estimation and variable selection issues for partially functional linear models have been investigated via FPCA methods by Shin and Lee (2012); Lu et al. (2014) and Kong et al. (2016), respectively.

In this paper, we focus on a least square regularized estimation for the slope function and the regression coefficients in (1.2) under a kernel-based framework and high dimension setting. The estimators obtained are based on a combination of the least-squared loss with a 1\ell_{1}-type penalty and the square of a functional norm, where the former penalty corresponds to the regression coefficients and the latter one is used to control the kernel complexity. The optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes, and an efficient numerical algorithm based on randomized sketches of the kernel matrix is implemented to verify our theoretical findings.

1.1 Our Contributions

This paper makes three main contributions to this functional modeling literature.

Our first contribution is to establish Theorem 1 stating that with high probability, under mild regularity conditions, the prediction error of our procedure under the squared L2L_{2}-norm is bounded by (p0logpn+n2r2r+1)\big{(}\frac{p_{0}\log p}{n}+n^{-\frac{2r}{2r+1}}\big{)}, where the quantity r>1/2r>1/2 corresponds to the kernel complexity of one composition kernel K1/2CK1/2K^{1/2}CK^{1/2}. The proof of this upper bound involves two different penalties for analyzing the obtained estimator in high dimensions, and we want to emphasize that it is very hard to prove constraint cone set that has often been used to define a critical condition (constraint eigenvalues constant) for high-dimensional problems (Bickel, Ritov, and Tsybakov, 2009; Verzelen, 2012). To handle this technical difficulty, we combine the methods used in Müller and Van de Geer (2015) for high dimensional partial linear models with various techniques in empirical process theory for analyzing kernel classes (Aronszajn, 1950; Cai and Yuan, 2012; Yuan and Cai, 2010; Zhu et al., 2014).

Our second contribution is to establish algorithm-independent minimax lower bounds under the squared L2L_{2} norm. These minimax lower bounds, stated in Theorem 2, are determined in terms of the metric entropy of the composition kernel K1/2CK1/2K^{1/2}CK^{1/2} and the sparsity structure of high dimensional scalar coefficients. For the commonly-used kernels, including the Sobolev classes, these lower bounds match our achievable results, showing optimality of our estimator for PFLM. It is worthy noting that, the lower bound of parametric part does not depend on nonparametric smoothness indices, coinciding with the classical sparse estimation rate in the high dimensional linear models (Verzelen, 2012). By contrast, the lower bound for estimating ff^{*} turns out to be affected by the regression coefficient 𝜸\boldsymbol{\gamma}^{*}. The proof of Theorem 2 is based on characterizing the packing entropies of the class of nonparametric kernel models, interaction between the composition kernel and high dimensional scalar vector, combined with classical information theoretic techniques involving Fano’s inequality and variants (Yang and Barron, 1999; Van. de. Geer, 2000; Tsybakov, 2009).

Our third contribution is to consider randomized sketches for our original estimation with statistical dimension. Despite these attractive statistical properties stated as above, the computational complexity of computing our original estimate prevents it from being routinely used in large-scale problems. In fact, a standard implementation for any kernel estimator leads to the time complexity O(n3)O(n^{3}) and space complexity O(n2)O(n^{2}) respectively. To this end, we employ the random projection and sketching techniques developed in Yang et al. (2017); Mahoney (2011), where it is proposed to approximate nn-dimensional kernel matrix by projecting its row and column subspaces to a randomly chosen m-dimensional subspace with mnm\ll n. We give the sketch dimension mm proportional to the statistical dimension, under which the resulting estimator has a comparable numerical performance.

1.2 Related Work

A class of conventional estimation procedures for functional linear regressions in the statistical literature are based on functional principal components regression (FPCA) or spline functions; see (Ramsay and Silverman, 2005; Ferraty and Vieu, 2006; Kong, Xue, Yao, and Zhang, 2016) and (Cardot, Ferraty, and Sarda, 2003) for details. These truncation approaches to handle an infinity-dimensional function only depend on the information of the feature XX. In particular, commonly-used FPCA methods that form an available basis for the slope function ff^{*} are determined solely by empirical covariance of the observed feature XX, and these basis may not act as an efficient representation to approximate ff^{*}, since the slope function ff^{*} and the leading functional components are essentially unrelated. Similar problems also arise when spline-based finite representation are used.

To avoid inappropriate representation for the slope function, reproducing kernel methods have been known to be a family of powerful tools for directly estimating infinity-dimensional functions. When the slope function is assumed to reside in a reproducing kernel Hilbert Space (RKHS), denoted by K\mathcal{H}_{K}, several existing work (Yuan and Cai, 2010; Cai and Yuan, 2012; Zhu, Yao, and Zhang, 2014) for functional linear or additive regression have proved that the minimax rate of convergence depends on both the kernel KK and the covariance function CC of the functional feature XX. In particular, the alignment of KK and CC can significantly affect the optimal rate of convergence. However, it is well known that kernel-based methods suffer a lot from storage cost and computational burden. Specially, kernel-based methods need to store a n×nn\times n matrix before running algorithms and are limited to small-scale problems.

1.3 Paper organization

The rest of this paper is organized as follows. Section 2 introduces some notations and the basic knowledge on kernel method, and formulates the proposed kernel-based regularized estimation method. Section 3 is devoted to establish the minimax rate of the prediction problem for PFLM and provide detailed discussion on the obtained results, including the desired convergence rate of the upper bounds and a matching set of minimax lower bounds. In Section 4, a general sketching-based strategy is provided, and an approximate algorithm for solving (2.2) is employed. Several numerical experiments are implemented in Section 5 to support the proposed approach and the employed optimization strategy. A brief summary of this paper is provided in Section 6. Appendix A contains several core proof procedures of the main results, including the technical proofs of Theorems 13. Some useful lemmas and more technical details are provided in Appendix B.

2 Problem Statement and Proposed Method

2.1 Notation

Let u,vu,v be two general random variables, and denote the joint distribution of (u,v)(u,v) by QQ and the marginal distribution of u(z)u(z) by Qu(Qv)Q_{u}(Q_{v}). For a measurable function f:u×vf:\,u\times v\rightarrow\mathbb{R}, we define the squared L2L_{2}-norm by f2:=𝔼Qf2(u,v)\|f\|^{2}:=\mathbb{E}_{Q}f^{2}(u,v), and the squared empirical norm is given by fn2:=1ni=1nf2(ui,vi)\|f\|_{n}^{2}:=\frac{1}{n}\sum_{i=1}^{n}f^{2}(u_{i},v_{i}), where {(ui,vi)}i=1n\{(u_{i},v_{i})\}_{i=1}^{n} are i.i.d. copies of (u,v)(u,v). Note that QQ may differ from line to line. For a vector 𝜸p\boldsymbol{\gamma}\in\mathbb{R}^{p}, the 1\ell_{1}-norm and 2\ell_{2}-norm are given by 𝜸1:=j=1p|γj|\|\boldsymbol{\gamma}\|_{1}:=\sum_{j=1}^{p}|\gamma_{j}| and 𝜸2:=(j=1pγj2)1/2\|\boldsymbol{\gamma}\|_{2}:=\big{(}\sum_{j=1}^{p}\gamma_{j}^{2}\big{)}^{1/2}, respectively. With a slight abuse of notation, we write f2:=f,f2\|f\|_{\mathcal{L}_{2}}:=\langle f,f\rangle_{\mathcal{L}_{2}} with f,g2=𝒯f(t)g(t)𝑑t\langle f,g\rangle_{\mathcal{L}_{2}}=\int_{\mathcal{T}}f(t)g(t)dt. For two sequences {ak:k1}\{a_{k}:k\geq 1\} and {bk:k1}\{b_{k}:k\geq 1\}, akbka_{k}\lesssim b_{k} (or ak=O(bk)a_{k}=O(b_{k})) means that there exists some constant cc such that akcbka_{k}\leq cb_{k} for all k1k\geq 1. Also, we write akbka_{k}\gtrsim b_{k} if there is some positive constant cc such that akcbka_{k}\geq cb_{k} for all k1k\geq 1. Accordingly, we write akbka_{k}\asymp b_{k} if both akbka_{k}\lesssim b_{k} and akbka_{k}\gtrsim b_{k} are satisfied.

2.2 Kernel Method

Kernel methods are one of the most powerful learning schemes in machine learning, which often take the form of regularization schemes in a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel (Aronszajn, 1950). A major advantage of employing the kernel methods is that the corresponding optimization task over an infinite dimensional RKHS are equivalent to a nn-dimensional optimization problems, benefiting from the so-called reproducing property.

Recall that a kernel K(,):𝒯×𝒯K(\cdot,\cdot):\mathcal{T}\times\mathcal{T}\rightarrow{\cal R} is a continuous, symmetric, and positive semi-definite function. Let K\mathcal{H}_{{K}} be the closure of the linear span of functions {Kt():=K(t,),t𝒯}\{K_{t}(\cdot):={K}(t,\cdot),t\in\mathcal{T}\} endowed with the inner product i=1nαiKti,j=1nβjKtjK:=i,j=1nαiβjK(ti,tj)\langle\sum_{i=1}^{n}\alpha_{i}{K}_{t_{i}},\,\sum_{j=1}^{n}\beta_{j}{K}_{t_{j}}\rangle_{{K}}:=\sum_{i,j=1}^{n}\alpha_{i}\beta_{j}{K}(t_{i},t_{j}), for any {ti}i=1n,{ti}i=1n𝒯n\{t_{i}\}_{i=1}^{n},\{t_{i}\}_{i=1}^{n}\in\mathcal{T}^{n} and n𝒩+n\in\mathcal{N^{+}}. An important property on K\mathcal{H}_{{K}} is the reproducing property stating that

f(t)=f,KtK,for anyfK.f(t)=\langle f,{K}_{t}\rangle_{{K}},\,\,\,\hbox{for any}\,f\in\mathcal{H}_{{K}}.

This property ensures that an RKHS inherits many nice properties from the standard finite dimensional Euclidean spaces. Throughout this paper, we assume that the slope function ff^{*} resides in a specified RKHS, still denoted by K\mathcal{H}_{K}. In addition, another RKHS can be naturally induced by the stochastic process of X()X(\cdot). Without loss of generality, we assume that X()X(\cdot) is square integrable over 𝒯\mathcal{T} with zero-mean, ant thus the covariance function of XX, defining as

C(s,t)=𝔼[X(s)X(t)],t,s𝒯,C(s,t)=\mathbb{E}[X(s)X(t)],\quad\forall\,t,\,s\in\mathcal{T},

is also a real, semi-definite kernel.

Note that the kernel complexity is characterized explicitly by a kernel-induced integral operator. Precisely, for any kernel K(,):𝒯×𝒯{K(\cdot,\cdot)}:\mathcal{T}\times\mathcal{T}\rightarrow{\cal R}, we define the integral operator LK:2(𝒯)2(𝒯)L_{{K}}:\mathcal{L}_{2}(\mathcal{T})\rightarrow\mathcal{L}_{2}(\mathcal{T}) by

LK(f)()=𝒯K(s,)f(s)𝑑s.L_{{K}}(f)(\cdot)=\int_{\mathcal{T}}{K}(s,\cdot)f(s)ds.

By the reproducing property, LKL_{{K}} can be equivalently defined as

f,LK(g)K=f,g2,fK,g2(𝒯).\langle f,L_{{K}}(g)\rangle_{K}=\langle f,g\rangle_{\mathcal{L}_{2}},\quad\forall\,f\in\mathcal{H}_{{K}},\,g\in\mathcal{L}_{2}(\mathcal{T}).

Since the operator LKL_{{K}} is linear, bounded and self-adjoint in 2(𝒯)\mathcal{L}_{2}(\mathcal{T}), the spectral theorem implies that there exists a family of orthonormalized eigenfunctions {ϕK:1}\{\phi^{{K}}_{\ell}:\,\ell\geq 1\} and a sequence of eigenvalues θ1Kθ2K>0\theta_{1}^{{K}}\geq\theta_{2}^{{K}}\geq...>0 such that

K(s,t)=1θKϕK(s)ϕK(t),s,t𝒯,{K}(s,t)=\sum_{\ell\geq 1}\theta_{\ell}^{{K}}\phi^{{K}}_{\ell}(s)\phi^{{K}}_{\ell}(t),\quad s,\,t\in\mathcal{T},

and thus by definition, it holds

LK(ϕK)=θKϕK,=1,2,L_{{K}}(\phi^{{K}}_{\ell})=\theta_{\ell}^{{K}}\phi^{{K}}_{\ell},\quad\ell=1,2,...

Based on the semi-definiteness of LKL_{{K}}, we can always decompose it into the following form

LK=LK1/2LK1/2,L_{{K}}=L_{{K}}^{1/2}\circ L_{{K}}^{1/2},

where LK1/2L_{{K}^{1/2}} is also a kernel-induced integral operator associated with a fractional kernel K1/2{K}^{1/2} that

K1/2(s,t):=1θKϕK(s)ϕK(t),s,t𝒯.{K}^{1/2}(s,t):=\sum_{\ell\geq 1}\sqrt{\theta_{\ell}^{{K}}}\phi^{{K}}_{\ell}(s)\phi^{{K}}_{\ell}(t),\quad s,\,t\in\mathcal{T}.

Also, it holds

LK1/2(ϕK):=θKϕK.L_{{K}^{1/2}}(\phi^{K}_{\ell}):=\sqrt{\theta_{\ell}^{{K}}}\phi^{{K}}_{\ell}.

Given two kernels K1,K2K_{1},K_{2}, we define

(K1K2)(s,t):=𝒯K1(s,u)K2(t,u)𝑑u,(K_{1}K_{2})(s,t):=\int_{\mathcal{T}}K_{1}(s,u)K_{2}(t,u)du,

and then it holds LK1K2=LK1LK2L_{K_{1}K_{2}}=L_{K_{1}}\circ L_{K_{2}}. Note that K1K2K_{1}K_{2} is not necessarily a symmetric kernel.

In the rest of this paper, we focus on the RKHS K\mathcal{H}_{K} in which the slope function ff^{*} in (1.2) resides. Given the kernel KK, the covariance function CC and by using the above notation, we define the linear operator LK1/2CkK1/2L_{K^{1/2}C_{k}K^{1/2}} by

LK1/2CK1/2:=LK1/2LCLK1/2.L_{K^{1/2}CK^{1/2}}:=L_{K^{1/2}}\circ L_{C}\circ L_{K^{1/2}}.

If the both operators LK1/2L_{K^{1/2}} and LCL_{C} are linear, bounded and self-adjoint, so is LK1/2CK1/2L_{K^{1/2}CK^{1/2}}. By the spectral theorem, there exist a sequence of positive eigenvalues s1s2>0s_{1}\geq s_{2}\geq...>0 and a set of orthonormalied eigenfunctions {φ:1}\{\varphi_{\ell}:\ell\geq 1\} such that

K1/2CK1/2(s,t)=1sφ(s)φ(t),s,t𝒯,K^{1/2}CK^{1/2}(s,t)=\sum_{\ell\geq 1}s_{\ell}\varphi_{\ell}(s)\varphi_{\ell}(t),\quad\forall\,s,t\in\mathcal{T},

and particularly

LK1/2CK1/2(φ)=sφ,=1,2,L_{K^{1/2}CK^{1/2}}(\varphi_{\ell})=s_{\ell}\varphi_{\ell},\quad\ell=1,2,...

It is worthwhile to note that the eigenvalues {s:1}\{s_{\ell}:\ell\geq 1\} of the linear operator LK1/2CK1/2L_{K^{1/2}CK^{1/2}} depend on the eigenvalues of both the reproducing kernel KK and the covariance function CC. We shall show in Section 3 that the minimax rate of convergence of the excess prediction risk is determined by the decay rate of the eigenvalues {s:1}\{s_{\ell}:\ell\geq 1\}.

2.3 Regularized Estimation and Randomized Sketches

Given the sample {Yi,(Xi,𝐙i)}i=1n\{Y_{i},(X_{i},{\bf Z}_{i})\}_{i=1}^{n} which are independently drawn from (1.2), the proposed estimation procedure for PFLM (1.2) is formulated in a least square regularization scheme by solving

(f^,𝜸^)=argminfK,𝜸p{1ni=1n(YiXi,f2𝐙iT𝜸)2+μ2fK2+λ𝜸1},(\widehat{f},\widehat{\boldsymbol{\gamma}})=\mathop{\rm argmin}_{f\in\mathcal{H}_{K},\boldsymbol{\gamma}\in{\cal R}^{p}}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\big{(}Y_{i}-\langle X_{i},f\rangle_{\mathcal{L}_{2}}-{\bf Z}^{T}_{i}\boldsymbol{\gamma}\big{)}^{2}+\mu^{2}\|f\|^{2}_{K}+\lambda\|\boldsymbol{\gamma}\|_{1}\Big{\}}, (2.1)

where the parameter μ2>0\mu^{2}>0 is used to control the smoothness of the nonparametric component and λ>0\lambda>0 associated with the 1\ell_{1}-type penalty is used to generate sparsity with respect to the scalar covariates.

Note that although the proposed estimation procedure (2.1) is formulated within an infinite-dimensional Hilbert space, the following lemma shows that this optimization task is equivalent to a finite-dimensional minimization problem.

Lemma 1

The proposed estimation procedure (2.1) defined on K×p\mathcal{H}_{K}\times{\cal R}^{p} is equivalent to a finite-dimensional parametric convex optimization. That is, f^(t)=k=1nαkBk(t)\widehat{f}(t)=\sum_{k=1}^{n}\alpha_{k}B_{k}(t) with unknown coefficients 𝛂=(α1,,αn)T\boldsymbol{\alpha}=(\alpha_{1},...,\alpha_{n})^{T}, for any t𝒯t\in\mathcal{T}. Here each basis function Bk(t)=Xk,K(t,)2(𝒯)KB_{k}(t)=\langle X_{k},K(t,)\rangle_{\mathcal{L}_{2}(\mathcal{T})}\in\mathcal{H}_{K}, k=1,,nk=1,...,n.

To rewrite the minimization problem (2.1) into a matrix form, we define a n×nn\times n semi-definite matrix 𝕂c=(Kikc)i,k=1n\mathbb{K}^{c}=(K^{c}_{ik})_{i,k=1}^{n} with Kikc:=Xi,Bk2=Xk(u)Xi(t)K(t,u)𝑑u𝑑tK^{c}_{ik}:=\langle X_{i},B_{k}\rangle_{\mathcal{L}_{2}}=\iint X_{k}(u)X_{i}(t)K(t,u)dudt, and by the reproducing property on KK, we also get Bi,BkK=Kikc\langle B_{i},B_{k}\rangle_{K}=K^{c}_{ik}, i,k=1,,ni,k=1,...,n. Thus, by Lemma 1, the matrix form of (2.1) is given as

min𝜶n,𝜸p1n𝐲𝕂c𝜶𝜸22+μ2𝜶T𝕂c𝜶+λ𝜸1,\displaystyle\min_{\boldsymbol{\alpha}\in{\cal R}^{n},\boldsymbol{\gamma}\in{\cal R}^{p}}\frac{1}{n}\big{\|}\mathbf{y}-\mathbb{K}^{c}\boldsymbol{\alpha}-\mathbb{Z}\boldsymbol{\gamma}\big{\|}_{2}^{2}+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{K}^{c}\boldsymbol{\alpha}+\lambda\|\boldsymbol{\gamma}\|_{1}, (2.2)

where n×p\mathbb{Z}\in{\cal R}^{n\times p} denotes the design matrix of 𝐙\bf Z. Since the unconstrained problem (2.2) is convex for both 𝜶\boldsymbol{\alpha} and 𝜸\boldsymbol{\gamma}, the standard alternative optimization (Boyd and Vandenberghe, 2004) can be applied directly to approximate a global minimizer of (2.2). Yet, due to the fact that 𝕂c\mathbb{K}^{c} is a n×nn\times n matrix, both computation cost and storage burden are very heavy in standard implementation, with the orders O(n3)O(n^{3}) and O(n2)O(n^{2}) respectively. To alleviate the computational issue, we propose an approximate numerical optimization instead of (2.2) in Section 4. Precisely, a class of general random projections are adopted to compress the original kernel matrix 𝕂c\mathbb{K}^{c} and improve the computational efficiency.

3 Main Results: Minimax Rates

In this section, we present the main theoretical results of the proposed estimation in the minimax sense. Specifically, we derive the minimax rates in term of prediction error for the estimators in (2.1) under high dimensional and kernel-based frameworks. The first two theorems prove the convergence of the obtained estimators, while the last one provides an algorithmic-independent lower bound for the prediction error.

3.1 Upper Bounds

We denote the short-hand notation

𝒢:={g=X,f2+𝐙T𝜸,fK,𝜸p},\mathcal{G}:=\big{\{}g=\langle X,f\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma},\,\,f\in\mathcal{H}_{K},\,\boldsymbol{\gamma}\in{\cal R}^{p}\big{\}},

and the functional g(𝐔):=X,f2+𝐙T𝜸g^{*}({\bf U}):=\langle X,f^{*}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}^{*} for 𝐔=(X,𝐙){\bf U}=(X,{\bf Z}). With a slight confusion of notation, we sometimes also write 𝒢:={g=(f,𝜸),fK,𝜸p}\mathcal{G}:=\big{\{}g=(f,\boldsymbol{\gamma}),\,\,f\in\mathcal{H}_{K},\,\boldsymbol{\gamma}\in{\cal R}^{p}\big{\}}. To split the scalar components and the functional component involved in our analysis, we define the projection of ZjZ_{j} concerning K\mathcal{H}_{K} as Π(Zj|K)=argminfKZjX,f22\Pi(Z_{j}|\mathcal{H}_{K})=\mathop{\rm argmin}_{f\in\mathcal{H}_{K}}\|Z_{j}-\langle X,f\rangle_{\mathcal{L}_{2}}\|^{2}. Let Π(Zj|X)=X,Π(Zj|K)2\Pi(Z_{j}|X)=\langle X,\Pi(Z_{j}|\mathcal{H}_{K})\rangle_{\mathcal{L}_{2}} and Π𝐙|X=(Π(Z1|X),,Π(Zp|X))T\Pi_{{\bf Z}|X}=(\Pi(Z_{1}|X),...,\Pi(Z_{p}|X))^{T}, and then we denote 𝐙~:=𝐙Π𝐙|X\widetilde{\bf Z}:={\bf Z}-\Pi_{{\bf Z}|X} as a random vector of p{\cal R}^{p}. For any g1(𝐔):=X,f12+𝐙T𝜸1𝒢g_{1}({\bf U}):=\langle X,f_{1}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}_{1}\in\mathcal{G} and g2(𝐔):=X,f22+𝐙T𝜸2𝒢g_{2}({\bf U}):=\langle X,f_{2}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}_{2}\in\mathcal{G}, we have the following orthogonal decomposition that

g1(𝐔)g2(𝐔)\displaystyle g_{1}({\bf U})-g_{2}({\bf U}) =𝐙T(𝜸1𝜸2)+X,f2f12\displaystyle={\bf Z}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\langle X,f_{2}-f_{1}\rangle_{\mathcal{L}_{2}}
=𝐙~T(𝜸1𝜸2)+Π𝐙T|XT(𝜸1𝜸2)+X,f2f12,\displaystyle=\widetilde{\bf Z}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\Pi_{{\bf Z}^{T}|X}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\langle X,f_{2}-f_{1}\rangle_{\mathcal{L}_{2}},

and by the definition of projection, it holds

g1g22=𝐙~T(𝜸1𝜸2)2+Π𝐙|XT(𝜸1𝜸2)+X,f2f122.\displaystyle\|g_{1}-g_{2}\|^{2}=\|\widetilde{\bf Z}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})\|^{2}+\|\Pi_{{\bf Z}|X}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\langle X,f_{2}-f_{1}\rangle_{\mathcal{L}_{2}}\|^{2}. (3.1)

To establish the refined upper bounds of the prediction and estimation errors, we summarize and discuss the main conditions needed in the theoretical analysis below.

Condition A(Eigenvalues condition). The smallest eigenvalue Λmin2\Lambda^{2}_{min} of 𝔼[𝐙~𝐙~T]\mathbb{E}[\widetilde{\bf Z}\widetilde{\bf Z}^{T}] is positive, and the largest eigenvalue Λmax2\Lambda^{2}_{max} of 𝔼[Π𝐙|XΠ𝐙|XT]\mathbb{E}[\Pi_{{\bf Z}|X}\Pi_{{\bf Z}|X}^{T}] is finite.

Condition B(Design condition). For some positive constants Cz,Cπ,ChC_{z},C_{\pi},C_{h}, there holds:

|Zj|Cz,Π(Zj|X)Cπ,andΠ(Zj|K)KCh,for anyj=1,,p.|Z_{j}|\leq C_{z},\,\,\|\Pi(Z_{j}|X)\|_{\infty}\leq C_{\pi},\,\hbox{and}\,\,\|\Pi(Z_{j}|\mathcal{H}_{K})\|_{K}\leq C_{h},\quad\mbox{for any}\,j=1,...,p.

Condition C(Light tail condition). There exist two constants c1,c2c_{1},\,c_{2} such that

{LK1/2X2t}c1exp(c2t2),for anyt>0.\mathbb{P}\{\|L_{K^{1/2}}X\|_{\mathcal{L}_{2}}\geq t\}\leq c_{1}\exp(-c_{2}t^{2}),\quad\mbox{for any}\ t>0.

Condition D(Entropy condition). For some constant 1/2<r<1/2<r<\infty, the sequence of eigenvalues ss_{\ell} satisfy that

s2r,+.s_{\ell}\asymp\ell^{-2r},\quad\ell\in\mathbb{N}^{+}.

Condition A is commonly used in literature of semiparametric modelling ; see (Müller and Van de Geer, 2015) for reference. This condition ensures that there is enough information in the data to identify the parameters in the scalar part. Condition B imposes some boundedness assumptions, which are not essential and are used only for simplifying the technical proofs. Condition C implies that the random process LK1/2XL_{K^{1/2}}X has an exponential decay rate and the same condition is also considered in Cai and Yuan (2012). Particularly, it is naturally satisfied if XX is a Gaussian process. In Condition D, the parameters ss_{\ell} are related to the alignment between KK and CC, which plays an important role in determining the minimax optimal rates. Moreover, the decay of ss_{\ell} characterizes the kernel complexity and has close relation with various covering numbers and Radmeacher complexity. Specially, the polynomial decay assumed in Condition D is satisfied for the classical Sobolev class and Besov class.

The following theorem states that with an appropriately chosen (μ,λ)(\mu,\lambda), the predictor g^:=X,f^2+𝐙T𝜸^\widehat{g}:=\langle X,\widehat{f}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\widehat{\boldsymbol{\gamma}} attains a sharp convergence rate under L2L_{2}-norm.

Theorem 1

Suppose that Conditions A-D hold. With the choice of the tuning parameters (μ,λ)(\mu,\lambda), such that

μnr2r+1+log(2p)/n,λlog(2p)/n.\mu\asymp n^{-\frac{r}{2r+1}}+\sqrt{\log(2p)/n},\,\,\lambda\asymp\sqrt{\log(2p)/n}.

Then with probability at least 12exp[n(δ1′′)2μ2]1-2\exp[-n(\delta_{1}^{\prime\prime})^{2}\mu^{2}], the proposed estimation for PFLM satisfies

g^g2(n2r2r+1+p0log(2p)n),\|\widehat{g}-g^{*}\|^{2}\lesssim\Big{(}n^{-\frac{2r}{2r+1}}+\frac{p_{0}\log(2p)}{n}\Big{)},

where the constant δ′′\delta^{\prime\prime} is small appropriately.

Theorem 1 shows that the proposed estimation (2.1) achieves a fast convergence rate in the term of prediction error. Note that the derived rate depends on the kernel complexity of K1/2CK1/2K^{1/2}CK^{1/2} and the sparsity of scalar components. It is interesting to note that even there exists some underlying correlation structure between the functional feature and the scalar covariates, the choice of hyper-parameter μ\mu depends on the structural information of all the features, while the sparsity hyper-parameter λ\lambda only depends on the scalar component.

Theorem 2

Suppose that all the conditions in Theorem 1 are satisfied. Then with probability at least 14exp[n(δ1′′)2μ2]52p1-4\exp[-n(\delta_{1}^{\prime\prime})^{2}\mu^{2}]-\frac{5}{2p}, there holds

𝐙~T(𝜸^𝜸)2+λ8𝜸^𝜸1(p0Λmin2log(2p)n),\displaystyle\|\widetilde{\bf Z}^{T}(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\|^{2}+\frac{\lambda}{8}\|\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*}\|_{1}\lesssim\Big{(}\frac{p_{0}}{\Lambda^{2}_{min}}\frac{\log(2p)}{n}\Big{)}, (3.2)

and in addition, we have

X,g^g22(n2r2r+1+p0log(2p)n).\displaystyle\|\langle X,\widehat{g}-g^{*}\rangle_{\mathcal{L}_{2}}\|^{2}\lesssim\Big{(}n^{-\frac{2r}{2r+1}}+\frac{p_{0}\log(2p)}{n}\Big{)}. (3.3)

It is worthy pointing out that the estimation error of the parametric estimator 𝜸^\widehat{\boldsymbol{\gamma}} can achieve the optimal convergence rate in the high dimensional linear models (Verzelen, 2012), even in the presence of nonparametric components. This result in the functional literature is similar in spirit to the classical high dimensional partial linear models (Müller and Van de Geer, 2015; Yu, Levine, and Cheng, 2019).

3.2 Lower Bounds

In this part, we establish the lower bounds on the minimax risk of estimating 𝜸\boldsymbol{\gamma}^{*} and X,f2\langle X,f^{*}\rangle_{\mathcal{L}_{2}} separately. Let B[p0,p]B[p_{0},p] be a set of pp-dimensional vectors with at most p0p_{0} non-zero coordinates, and K{\cal B}_{K} be the unit ball of K\mathcal{H}_{K}. Moreover, we define the risk of estimating 𝜸\boldsymbol{\gamma}^{*} as

R𝜸(p0,p,K):=inf𝜸^sup𝜸B[p0,p],fK𝔼[𝜸^𝜸22],R_{\boldsymbol{\gamma}^{*}}(p_{0},p,{\cal B}_{K}):=\inf_{\widehat{\boldsymbol{\gamma}}}\sup_{\boldsymbol{\gamma}^{*}\in B[p_{0},p],f^{*}\in{\cal B}_{K}}\mathbb{E}[\|\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*}\|_{2}^{2}],

where inf{\inf} is taken over all possible estimators for 𝜸\boldsymbol{\gamma}^{*} in model (1.2). Similarly, we define the risk of estimating X,f2\langle X,f^{*}\rangle_{\mathcal{L}_{2}} as

Rf(s0,p,K):=inff^sup𝜸B[p0,p],fK𝔼[X,f^f22]=inff^sup𝜸B[p0,p],fKLC1/2(f^f)22.R_{f^{*}}(s_{0},p,{\cal B}_{K}):=\inf_{\widehat{f}}\sup_{\boldsymbol{\gamma}^{*}\in B[p_{0},p],f^{*}\in{\cal B}_{K}}\mathbb{E}[\langle X,\widehat{f}-f^{*}\rangle_{\mathcal{L}_{2}}^{2}]=\inf_{\widehat{f}}\sup_{\boldsymbol{\gamma}^{*}\in B[p_{0},p],f^{*}\in{\cal B}_{K}}\|L_{C^{1/2}}(\widehat{f}-f^{*})\|_{\mathcal{L}_{2}}^{2}.

The following theorem provides the lower bounds of the minimax optimal estimation error for 𝜸\boldsymbol{\gamma}^{*} and the predictor error for ff^{*}, respectively.

Theorem 3

Given nn i.i.d. samples from (1.2) with the entropy condition (Condition D). When pp is diverging as nn increases and p0pp_{0}\ll p, the minimax risk for estimating 𝛄\boldsymbol{\gamma}^{*} can be bounded from below as

R𝜸(p0,p,K)p0log(p/p0)n;R_{\boldsymbol{\gamma}^{*}}(p_{0},p,{\cal B}_{K})\gtrsim\frac{p_{0}\log(p/p_{0})}{n};

the minimax risk for estimating X,f2\langle X,f^{*}\rangle_{\mathcal{L}_{2}} can be bounded from below as

Rf(p0,p,K)max{p0log(p/p0)n,n2r2r+1}.R_{f^{*}}(p_{0},p,{\cal B}_{K})\gtrsim\max\Big{\{}\frac{p_{0}\log(p/p_{0})}{n},n^{-\frac{2r}{2r+1}}\Big{\}}.

The proof of Theorem 3 is provided in Appendix A. As mentioned previously, these results indicate that the best possible estimation of 𝜸\boldsymbol{\gamma}^{*} is not affected by the existence of nonparametric components, while the minimax risk for estimating the (nonparametric) slope function not only depends on the smoothness itself, but also on the dimensionality and sparsity of the scalar covariates. From the lower bound of Rf(p0,p,K)R_{f^{*}}(p_{0},p,{\cal B}_{K}), we observe that a rate-switching phenomenon occurring between a sparse regime and a smooth regime. Particularly when p0log(p/p0)n\frac{p_{0}\log(p/p_{0})}{n} dominates n2r2r+1n^{-\frac{2r}{2r+1}} corresponding to the sparse regime, the lower bound becomes the classical high dimensional parametric rate p0log(p/p0)n\frac{p_{0}\log(p/p_{0})}{n}. Otherwise, this corresponds to the smooth regime and thus has similar behaviors as classical nonparametric models. We also notice that the minimax lower bound obtained for the predictor error generalizes the previous results for the pure functional linar model (Cai and Yuan, 2012).

4 Randomized Sketches and Optimization

This section is devoted to considering an approximate algorithm for (2.2), based on constraining the original parameter 𝜶n\boldsymbol{\alpha}\in{\cal R}^{n} to an mm-dimensional subspace of n{\cal R}^{n}, where mnm\ll n is the projection dimension. We define this approximation via a sketch matrix 𝕊m×n\mathbb{S}\in{\cal R}^{m\times n} such that the mm-dimensional subspace is generated by the row span of 𝕊\mathbb{S}. More precisely, the sketched kernel partial functional estimator is given by first solving

(𝜶^s,𝜸^s):\displaystyle(\widehat{\boldsymbol{\alpha}}_{s},\widehat{\boldsymbol{\gamma}}_{s}): =argmin𝜶m,𝜸p1n𝜶(𝕊𝕂c)(𝕊𝕂c)T𝜶2n𝜶T𝕊𝕂c(𝐲𝜸)+1n𝐲𝜸22\displaystyle=\arg\min_{\boldsymbol{\alpha}\in{\cal R}^{m},\boldsymbol{\gamma}\in{\cal R}^{p}}\frac{1}{n}\boldsymbol{\alpha}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}\boldsymbol{\alpha}-\frac{2}{n}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})+\frac{1}{n}\|\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma}\|_{2}^{2}
+μ2𝜶T𝕊𝕂c𝕊T𝜶+λ𝜸1.\displaystyle+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{T}\boldsymbol{\alpha}+\lambda\|\boldsymbol{\gamma}\|_{1}. (4.1)

Then the resulting predictor for the slope function ff^{*} is given as

f^s(t):=k=1n(𝕊T𝜶^s)kBk(t)=𝜶^sT𝕊𝐁(t),t𝒯.\widehat{f}_{s}(t):=\sum_{k=1}^{n}(\mathbb{S}^{T}\widehat{\boldsymbol{\alpha}}_{s})_{k}B_{k}(t)=\widehat{\boldsymbol{\alpha}}_{s}^{T}\mathbb{S}{\bf B}(t),\quad\forall\,t\in\mathcal{T}.

where 𝐁(t)=(B1(t),,Bn(t))Tn{\bf B}(t)=(B_{1}(t),...,B_{n}(t))^{T}\in{\cal R}^{n}, where Bk(t)B_{k}(t) is defined in Lemma 1. By doing randomized sketches, an approximate form of the kernel estimate 𝜶^s\widehat{\boldsymbol{\alpha}}_{s} can be obtained by solving an mm-dimensional quadratic program when 𝜸^s\widehat{\boldsymbol{\gamma}}_{s} is fixed, which involves time and space complexity O(m3)O(m^{3}) and O(m2)O(m^{2}). Computing the approximate kernel matrix is a preprocessing step with time complexity O(n2log(m))O(n^{2}\log(m)) for properly chosen projections.

4.1 Alternating Optimization

This section provides the detailed computational issues of the proposed approach. Precisely, we aim to solve the following optimization task that

(𝜶^s,𝜸^s):=argmin𝜶m,𝜸p1n𝜶T(𝕊𝕂c)(𝕊𝕂c)T𝜶2n𝜶T𝕊𝕂c(𝐲𝜸)+\displaystyle(\widehat{\boldsymbol{\alpha}}_{s},\widehat{\boldsymbol{\gamma}}_{s}):=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in{\cal R}^{m},\boldsymbol{\gamma}\in{\cal R}^{p}}\frac{1}{n}\boldsymbol{\alpha}^{T}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}\boldsymbol{\alpha}{-\frac{2}{n}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})}+
1n(𝐲𝜸)T(𝐲𝜸)+μ2𝜶T𝕊𝕂c𝕊T𝜶+λ𝜸1.\displaystyle\hskip 113.81102pt{\frac{1}{n}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})^{T}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})}+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{T}\boldsymbol{\alpha}+\lambda\|\boldsymbol{\gamma}\|_{1}. (4.2)

To solve (4.1), a splitting algorithm with proximal operator is applied, which updates the representer coefficients 𝜶{\boldsymbol{\alpha}} and the linear coefficients 𝜸{\boldsymbol{\gamma}} sequentially. Specifically, at the tt-th iteration with current solution (𝜶t,𝜸t)(\boldsymbol{\alpha}^{t},\mbox{\boldmath$\gamma$}^{t}), the following two optimization tasks are solved sequentially to obtain the solution of the (t+1)(t+1)-th iteration

𝜶t+1=argmin𝜶m{1n𝜶T(𝕊𝕂c)(𝕊𝕂c)T𝜶2n𝜶T𝕊𝕂c(𝐲𝜸t)+μ2𝜶T𝕊𝕂c𝕊T𝜶},\displaystyle\boldsymbol{\alpha}^{t+1}=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in{\cal R}^{m}}\Big{\{}\frac{1}{n}\boldsymbol{\alpha}^{T}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}\boldsymbol{\alpha}-\frac{2}{n}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma}^{t})+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{T}\boldsymbol{\alpha}\Big{\}}, (4.3)
𝜸t+1=argmin𝜸p{Rn(𝜶t+1,𝜸)+λ𝜸1},\displaystyle\mbox{\boldmath$\gamma$}^{t+1}=\mathop{\rm argmin}_{\mbox{\boldmath$\gamma$}\in{\cal R}^{p}}\Big{\{}R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$})+\lambda\|\boldsymbol{\gamma}\|_{1}\Big{\}}, (4.4)

where Rn(𝜶t+1,𝜸):=2n(𝜶t+1)T𝕊𝕂c𝜸+1n(𝐲𝜸)T(𝐲𝜸)R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}):=\frac{2}{n}({\boldsymbol{\alpha}}^{t+1})^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{Z}\boldsymbol{\gamma}+{\frac{1}{n}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})^{T}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})}.

To update 𝜶\boldsymbol{\alpha}, it is clear that the optimization task (4.3) has an analytic solution that

𝜶t+1=((𝕊𝕂c)(𝕊𝕂c)T+nμ2𝕊𝕂c𝕊)1𝕊𝕂c(𝐲𝜸t).\boldsymbol{\alpha}^{t+1}=\big{(}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}+n\mu^{2}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{\prime}\big{)}^{-1}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma}^{t}).

To update 𝜸\boldsymbol{\gamma}, we first introduce the proximal operator (Moreau, 1962), which is defined as

Proxλ1(𝐮):=argmin𝐮{12𝐮𝐯22+λ𝐮1}.\displaystyle\mbox{Prox}_{{\lambda}\|\cdot\|_{1}}({\bf u}):=\mathop{\rm argmin}_{{\bf u}}\Big{\{}\frac{1}{2}\|{\bf u}-{\bf v}\|_{2}^{2}+\lambda\|{\bf u}\|_{1}\Big{\}}. (4.5)

Note that the solution of optimization task (4.5) is the well-known soft-thresholding operator with solution that

(Proxλ1(𝐮))i=sign(ui)(|ui|λ)+.{\big{(}\mbox{Prox}_{{\lambda}\|\cdot\|_{1}}(\mathop{\bf u})\big{)}_{i}=\mathop{\rm sign}(u_{i})(|u_{i}|-{\lambda})_{+}}.

Then, for the optimization task (4.4), we have

𝜸t+1=ProxλD1(𝜸t1D𝜸Rn(𝜶t+1,𝜸t)),\mbox{\boldmath$\gamma$}^{t+1}=\mbox{Prox}_{\frac{\lambda}{D}\|\cdot\|_{1}}\Big{(}\mbox{\boldmath$\gamma$}^{t}-\frac{1}{D}\nabla_{\mbox{\boldmath$\gamma$}}R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}^{t})\Big{)},

where DD denotes an upper bound of the Lipschitz constant of Rn(𝜶t+1,𝜸t)R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}^{t}), and compute 𝜸Rn(𝜶t+1,𝜸t)=2nT(𝕊𝕂c)T𝜶t+1+2nT𝜸t2nT𝐲\nabla_{\boldsymbol{\gamma}}R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}^{t})=\frac{2}{n}\mathbb{Z}^{T}(\mathbb{S}\mathbb{K}^{c})^{T}{\boldsymbol{\alpha}}^{t+1}+\frac{2}{n}\mathbb{Z}^{T}\mathbb{Z}\boldsymbol{\gamma}^{t}-\frac{2}{n}\mathbb{Z}^{T}\mathbf{y}. We repeat the above iteration steps until (𝜶t+1,𝜸t+1)(\boldsymbol{\alpha}^{t+1},\boldsymbol{\gamma}^{t+1}) converges.

It should be pointed out that the exact value of DD is often difficult to determine in large-scale problems. A common way to handle this problem is to use a backtracking scheme (Boyd and Vandenberghe, 2004) as a more efficient alternative to approximately compute an upper bound of it.

4.2 Choice of Random Sketch Matrix

In this paper, we consider three random sketch methods, including the sub-Gaussian random sketch (GRS), randomized orthogonal system sketch (ROS) and sub-sampling random sketch (SUB). Precisely, we denote the ii-th row of the random matrix 𝕊\mathbb{S} as 𝐬i{\mathop{\bf s}}_{i} and consider three different types of 𝐬i{\mathop{\bf s}}_{i} as follows.

Sub-Gaussian sketch (GRS): The row 𝐬i{\mathop{\bf s}}_{i} of 𝕊\mathbb{S} is zero-mean 11-sub-Gaussian if for any 𝐮n\mathop{\bf u}\in{\cal R}^{n}, we have

P(𝐬i,𝐮t𝐮2)et2/2,t0.\text{P}\big{(}\langle{\mathop{\bf s}}_{i},\mathop{\bf u}\rangle\geq t\|\mathop{\bf u}\|_{2}\big{)}\leq e^{-t^{2}/2},\ \ \forall\,t\geq 0.

Note that the row 𝐬i{\mathop{\bf s}}_{i} with independent and identical distributed N(0,1)N(0,1) entries is 1-sub-Gaussian. For simplicity, we further rescale the sub-Gaussian sketch matrix 𝕊\mathbb{S} such that the rows 𝐬i\mathop{\bf s}_{i} have the covariance matrix 1m𝕀n\frac{1}{\sqrt{m}}\mathbb{I}_{n}, where 𝕀n\mathbb{I}_{n} denotes a nn dimensional identity matrix.

Randomized orthogonal system sketch (ROS): The row 𝐬i\mathop{\bf s}_{i} of the random matrix 𝕊\mathbb{S} is formed with i.i.d rows of the form

𝐬i=nmT𝕀(i), for i=1,,m,{\mathop{\bf s}}_{i}=\sqrt{\frac{n}{m}}\mathbb{R}\mathbb{H}^{T}\mathbb{I}_{(i)},~{}\text{ for }~{}i=1,...,m,

where n×n\mathbb{R}\in{\cal R}^{n\times n} is a random diagonal matrix whose entries are i.i.d. Rademacher variables taking value {1,1}\{-1,1\} with equal probability, ={Hij}i,j=1nn×n\mathbb{H}=\{H_{ij}\}_{i,j=1}^{n}\in{\cal R}^{n\times n} is an orthonormal matrix with bounded entries that Hij[1n,1n]H_{ij}\in[-\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}}], and the nn-dimensional vectors 𝕀(1),,𝕀(m)\mathbb{I}_{(1)},...,\mathbb{I}_{(m)} are drawn uniformly at random without replacement from the nn-dimensional identity matrix 𝕀n\mathbb{I}_{n} .

Sub-sampling sketches (SUB): The rows 𝐬i\mathop{\bf s}_{i} of the random matrix 𝕊\mathbb{S} has the form that

𝐬i=nm𝕀(i),{\mathop{\bf s}}_{i}=\sqrt{\frac{n}{m}}\mathbb{I}_{(i)},

where the nn-dimensional vectors 𝕀(1),,𝕀(m)\mathbb{I}_{(1)},...,\mathbb{I}_{(m)} are drawn uniformly at random without replacement from a nn dimensional identity matrix. Note that the sub-sampling sketches method can be regarded as a special case of the ROS sketch by replacing the matrix T\mathbb{R}^{T}\mathbb{H} with a nn-dimensional identity matrix 𝕀n\mathbb{I}_{n}.

4.3 Choice of the Sketch Dimension

In practice, we are interested in the m×nm\times n sketch matrices with mnm\ll n to enhance computational efficiency. Note that the existence of a n×nn\times n kernel matrix in Lemma 1 is only a sufficient condition for equivalent optimization. It has been shown theoretically in the kernel regression (Yang et al., 2017) that the kernel matrix can be compressed to be the one with small size, based on some intrinsic low-dimensional notations. Despite the model difference from Yang et al. (2017), our kernel matrix 𝕂c\mathbb{K}^{c} does not depend on the scalar covariates 𝐙\bf Z, and thus those derived results for the kernel regression are still applicable to our case.

Consider the eigen-decomposition 𝕂c=𝕌𝔻𝕌T\mathbb{K}^{c}=\mathbb{U}\mathbb{D}\mathbb{U}^{T} of the kernel matrix, where 𝕌n×n\mathbb{U}\in{\cal R}^{n\times n} is an orthonormal matrix of eigenvectors, and 𝔻=diag{μ^1,,μ^n}\mathbb{D}=\hbox{diag}\{\hat{\mu}_{1},...,\hat{\mu}_{n}\} is a diagonal matrix of eigenvalues, where μ^1μ^2μ^n0\hat{\mu}_{1}\geq\hat{\mu}_{2}\geq...\geq\hat{\mu}_{n}\geq 0. We define the kernel complexity function as

^(δ)=1nj=1nmin{δ,μ^j}.\widehat{\mathcal{R}}(\delta)=\sqrt{\frac{1}{n}\sum_{j=1}^{n}\min\{\delta,\hat{\mu}_{j}\}}.

The critical radius is defined to be the smallest positive solution of δn>0\delta_{n}>0 to the inequality

^(δ)δ2/σ.\widehat{\mathcal{R}}(\delta)\leq\delta^{2}/\sigma.

Note that the existence and uniqueness of this critical radius is guaranteed for any kernel class. Based on this, we define the statistical dimension of the kernel is

dn:=min{j[n]:μ^jδn2}.d_{n}:=\min\{j\in[n]:\hat{\mu}_{j}\leq\delta^{2}_{n}\}.

Recall that, Theorem 2 in Yang et al. (2017) shows that various forms of randomized sketches can achieve the minimax rate using a sketch dimension proportional to the statistical dimension dnd_{n}. In particular, for Gaussian sketches and ROS sketches, the sketch dimension mm is required satisfy a lower bound of the form

m{cdnfor Gaussian sketches,cdnlog4(n)for ROS sketches.m\geq\begin{cases}cd_{n}&\hbox{for Gaussian sketches},\\ cd_{n}\log^{4}(n)&\hbox{for ROS sketches}.\end{cases}

Here cc is some constant. In this paper, we adopt this specified sketch dimension mm to implement our experiments.

5 Numerical Experiments

In this section, we illustrate the numerical performance of the proposed method with random sketches in two numerical examples. Specifically, we assume that the true generating model is

Yi=𝒯f(t)Xi(t)𝑑t+𝐙iT𝜸+εi,\displaystyle Y_{i}=\int_{\cal T}f^{*}(t)X_{i}(t)dt+{\bf Z}_{i}^{T}\mbox{\boldmath$\gamma$}^{*}+\varepsilon_{i}, (5.1)

where εiN(0,σ2)\varepsilon_{i}\sim N(0,\sigma^{2}) with σ=1\sigma=1, and 𝒯{\cal T} is set as [0,1][0,1]. Note that the generating scheme is the same as that in Hall and Horowitz 2007 and Yuan and Cai 2010. In practice, the integrals in calculation of 𝔹\mathbb{B} and 𝕂c\mathbb{K}^{c} are approximated by summations, and thus we generate 1000 points in 𝒯=[0,1]{\cal T}=[0,1] with equal distance and evaluate the integral by using the generated points. As the proper choice of tuning parameters plays a crucial role in achieving the desired performance of the proposed method, we adopt 5-fold cross-validation to select the optimal values of the tuning parameters μ\mu and λ\lambda.

In all the simulated cases, we consider a RKHS K{\cal H}_{K} induced by a reproducing kernel function on 𝒯×𝒯{\cal T}\times{\cal T} that

K(s,t)\displaystyle K(s,t) =k12(kπ)4cos(kπs)cos(kπt)\displaystyle=\sum_{k\geq 1}\frac{2}{(k\pi)^{4}}\cos(k\pi s)\cos(k\pi t)
=k11(kπ)4cos(kπ(st))+k11(kπ)4cos(kπ(s+t))\displaystyle=\sum_{k\geq 1}\frac{1}{(k\pi)^{4}}\cos(k\pi(s-t))+\sum_{k\geq 1}\frac{1}{(k\pi)^{4}}\cos(k\pi(s+t))
=13B4(|st|2)13B4(s+t2),\displaystyle=-\frac{1}{3}B_{4}\big{(}\frac{|s-t|}{2}\big{)}-\frac{1}{3}B_{4}\big{(}\frac{s+t}{2}\big{)},

where B2m()B_{2m}(\cdot) denotes the 2m2m-th Bernoulli polynomial that

B2m(s)=(1)m12(2m)!k1cos(2πks)(2πk)2m,for anys𝒯.B_{2m}(s)=(-1)^{m-1}2(2m)!\sum_{k\geq 1}\frac{\cos(2\pi ks)}{(2\pi k)^{2m}},~{}\text{for any}~{}s\in{\cal T}.

Note that the RKHS K{\cal H}_{K} induced by K(s,t)K(s,t) contains the functions in a linear span of the cosine basis that

f(s)=2k1gkcos(kπs),for anys𝒯.f(s)=\sqrt{2}\sum_{k\geq 1}g_{k}\cos(k\pi s),~{}\text{for any}~{}s\in{\cal T}.

such that k1k4gk2<\sum_{k\geq 1}k^{4}g_{k}^{2}<\infty and the endowed norm is

fK2=𝒯(2k1(kπ)2gkcos(kπt))2𝑑t=k1(kπ)4gk2.\|f\|^{2}_{K}=\int_{{\cal T}}\big{(}\sqrt{2}\sum_{k\geq 1}(k\pi)^{2}g_{k}\cos(k\pi t)\big{)}^{2}dt=\sum_{k\geq 1}(k\pi)^{4}g_{k}^{2}.

The performance of the proposed method is evaluated under the following two numerical examples.

Example 1. We consider the true slope function ff^{*} and the random function XX are

f(t)=k=1504(1)k+1k22cos(kπt),f^{*}(t)=\sum_{k=1}^{50}4(-1)^{k+1}k^{-2}\sqrt{2}\cos(k\pi t),

and

X(t)=ξ1U1+k=250ξkUk2cos(kπt),X(t)=\xi_{1}U_{1}+\sum_{k=2}^{50}\xi_{k}U_{k}\sqrt{2}\cos(k\pi t),

where UkU(3,3)U_{k}\sim U(-\sqrt{3},\sqrt{3}) and ξk=(1)k+1kv/2\xi_{k}=(-1)^{k+1}k^{-v/2}. For the linear part, the true regression coefficients are set as 𝜸0=(2,2,0,,0)T\mbox{\boldmath$\gamma$}^{0}=(2,-2,0,...,0)^{T} and the sample =(𝐙1,,𝐙n)Tn×p\mathbb{Z}=({\bf Z}_{1},...,{\bf Z}_{n})^{T}\in{\cal R}^{n\times p} with 𝐙i=(zi1,,zip)T{\bf Z}_{i}=(z_{i1},...,z_{ip})^{T} are generated i.i.d. as zijU(0,1)z_{ij}\sim U(0,1).

Example 2. The generating scheme is the same as Example 1, except that

ξk={1,k=1,0.2(1)k+1(10.0001k),2k4,0.2(1)k+1[(5k/5)υ/20.0001(kmod5)],k5.\xi_{k}=\begin{cases}1,&\quad k=1,\\ 0.2(-1)^{k+1}(1-0.0001k),&\quad 2\leq k\leq 4,\\ 0.2(-1)^{k+1}\big{[}(5\lfloor k/5\rfloor)^{-\upsilon/2}-0.0001(k~{}\text{mod}~{}5)\big{]},&\quad k\geq 5.\end{cases}

Clearly, ξk2\xi^{2}_{k}’s are the eigenvalues of the covariance function CC and we choose v=1.1,2v=1.1,2 and 44 to evaluate the effect of the smoothness of ξk\xi_{k} in the both examples. Note that in Example 1, these eigenvalues are well spaced, and the covariance function CC and the reproducing kernel KK share the same eigenvalues, while in Example 2, these eigenvalues are closely spaced and the alignment between KK and CC is considered.

To comprehend the effect of sample size, we consider the same settings as in Yang et al. (2017) that n=256,512,1024,2048,4096,8192n=256,512,1024,2048,4096,8192 and 1638416384 and conservatively, take m=n1/3m=\lfloor n^{1/3}\rfloor for the three random sketch methods introduced in Section 4.2. Note that with the choice of mm, the time and store complexities reduce to O(n)O(n) and O(n2/3)O(n^{2/3}), respectively. Each scenario is replicated 50 times and the performance of the proposed method is evaluated by various measures, including the estimation accuracy of the linear coefficients, the integrated prediction error in terms of the slope function and the prediction error. Specifically, the estimation accuracy of the linear coefficients is evaluated by 𝜸^𝜸022=l=1p(γ^lγl0)2,\|\widehat{\mbox{\boldmath$\gamma$}}-\mbox{\boldmath$\gamma$}^{0}\|^{2}_{2}=\sum_{l=1}^{p}(\widehat{\gamma}_{l}-\gamma_{l}^{0})^{2}, and Figure 1 shows the estimation accuracy of the coefficients with different choice of vv.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Estimation accuracy of the coefficients in Example 1 under various scenarios.

It is clear that the estimation error of the coefficients converges linearly as sample size nn increases and becomes stable when nn is sufficiently large, and the three employed sketch methods have similar performance. It is also interesting to notice that the convergence patterns under difference choice of vv are almost the same, which concurs with our theoretical findings that estimation of 𝜸\boldsymbol{\gamma}^{*} is not affected by the existence of nonparametric components in Theorems 1 and 3.

Let (Y,X(),𝐙)(Y^{\prime},{X^{\prime}}(\cdot),{\bf Z}^{\prime}) denotes an independent copy of (Y,X(),𝐙)(Y,{X}(\cdot),{\bf Z}) and the integrated prediction error in terms of the slope function is reported by

𝔼^Xf^f2=𝔼^X(𝒯(f^(t)f(t))X(t)𝑑t)2\widehat{\mathbb{E}}_{X^{\prime}}\|\widehat{f}-f^{*}\|^{2}=\widehat{\mathbb{E}}_{X^{\prime}}\big{(}\int_{\cal T}(\widehat{f}(t)-f^{*}(t))X^{\prime}(t)dt\big{)}^{2}

The empirical expectation 𝔼^\widehat{\mathbb{E}} is evaluated by a testing sample with size 1000010000 and Y^=𝒯f^(t)Xi(t)𝑑t+(𝐙i)T𝜸^\widehat{Y}^{\prime}=\int_{\cal T}\widehat{f}(t)X^{{}^{\prime}}_{i}(t)dt+({\bf Z}^{{}^{\prime}}_{i})^{T}\widehat{\mbox{\boldmath$\gamma$}} and the numerical performance are summarized in Figure 2.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Prediction error of the slope function in Example 1 under various scenarios.

Note that Figure 2 suggests that the prediction error of the slope function converges at some polynomial rate as sample size nn, which agrees with our theoretical results in Section 3, and the three employed sketch methods yield similar numerical performance. Moreover, it can be seen that with the increase of the value of vv, the prediction error goes down, which also concurs with our theoretical findings in Theorems 2 and 3 that the faster decay rate of the eigenvalues, the smaller the prediction error.

We also report the integrated prediction error of the response by calculating

𝔼^Y,XY^Y22.\widehat{\mathbb{E}}_{Y^{\prime},X^{\prime}}\|\widehat{Y}^{\prime}-Y^{\prime}\|^{2}_{2}.

The empirical expectation 𝔼^\widehat{\mathbb{E}} is also evaluated by a testing sample with size 1000010000 and the numerical performance are summarized in Figure 3.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Prediction error of the response in Example 1 under various scenarios.

Clearly, we conclude that prediction error of the response converges at some polynomial rate as sample size nn and the prediction error becomes smaller with vv increases, which agrees with our theoretical results in Theorem 2. It is also interesting to point out that the three employed sketch methods yield similar numerical performance and the prediction errors tends to converge to 1, which is the variance of ε\varepsilon in the true modelling. This verifies the efficiency of the proposed estimation and the proper choice of mm.

Note that the numerical results in Example 2 where the eigenvalues are closely spaced are similar to those in the case with well-spaced eigenvalues in Example 1. Figure 4 shows the numerical performance under the closely spaced eigenvalues setting in Example 2.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Numerical performance of the proposed method in Example 2 under various scenarios.

6 Conclusion

This paper establishes the optimal minimax rate for the estimation of partially functional linear model (PFLM) under kernel-based and high dimensional setting. The optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes, and an efficient numerical algorithm based on randomized sketches of the kernel matrix is implemented to verify our theoretical findings.


Acknowledgments

Shaogao Lv’s research was partially supported by NSFC-11871277. Xin He’s research was supported in part by NSFC-11901375 and Shanghai Pujiang Program 2019PJC051. Junhui Wang’s research was supported in part by HK RGC Grants GRF-11303918 and GRF-11300919.

References

  • Aronszajn (1950) N. Aronszajn. Theory of reporudcing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
  • Bickel et al. (2009) P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 37:1705–1732, 2009.
  • Bousquet (2002) O. Bousquet. A bennet concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique Academie des Sciences Paris, 334:495–550, 2002.
  • Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
  • Bühlmann and Van. de. Geer (2011) P. Bühlmann and S. Van. de. Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg, 2011.
  • Cai and Yuan (2012) T. Cai and M. Yuan. Minimax and adaptive prediction for functional linear regression. Journal of the American Statistical Association, 107:1201–1216, 2012.
  • Cardot et al. (2003) H. Cardot, F. Ferraty, and P. Sarda. Spline estimators for the functional linear model. Statistica Sinica, 13:571–591, 2003.
  • Ferraty and Vieu (2006) F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York, 2006.
  • Hall and Horowitz (2007) P. Hall and J. Horowitz. Methodology and convergence rates for functional linear regression. Annals of Statistics, 35:70–91, 2007.
  • Kong et al. (2016) D. Kong, K. Xue, F. Yao, and H. Zhang. Partially functional linear regression in high dimensions. Biometrika, 103:1–13, 2016.
  • Ledoux (1997) M. Ledoux. On talagrand’s deviation inequalities for product measures. Probability and Statistics, 1:63–87, 1997.
  • Ledoux (2001) M. Ledoux. The Concentration of Measure Phenomenon (Mathematical Surveys and Monographs). American Mathematical Society, Providence, RI, 2001.
  • Lu et al. (2014) Y. Lu, J. Du, and Z. Sun. Functional partially linear quantile regression model. Metrika, 77:17–32, 2014.
  • Mahoney (2011) M. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning in Machine Learning, 3:1–54, 2011.
  • Müller and Van de Geer (2015) P. Müller and S. Van de Geer. The partial linear model in high dimensions. Scandinavian Journal of Statistics, 42:580–608, 2015.
  • Ramsay and Silverman (2005) J. Ramsay and B. Silverman. Functional Data Analysis. Springer, New York, 2005.
  • Shin and Lee (2012) H. Shin and M. Lee. On prediction rate in partial functional linear regression. Journal of Multivariate Analysis, 103:93–106, 2012.
  • Tsybakov (2009) A. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, 2009.
  • Van. de. Geer (2000) S. Van. de. Geer. Emprical Processes in M-Estimation. Cambridge University Press, New York, 2000.
  • Verzelen (2012) N. Verzelen. Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics, 6:38––90, 2012.
  • Yang and Barron (1999) Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27:1564–1599, 1999.
  • Yang et al. (2017) Y. Yang, M. Pilanci, and M. Wainwright. Randomized sketches for kernels: fast and optimal non-parametric regression. Annals of Statistics, 45:991–1023, 2017.
  • Yu et al. (2019) Z. Yu, M. Levine, and G. Cheng. Minimax optimal estimation in partially linear additive models under high dimension. Bernoulli, 25:1289–1325, 2019.
  • Yuan and Cai (2010) M. Yuan and T. Cai. A reproducing kernel hilbert space approach to functional linear regression. Annals of Statistics, 38:3412–3444, 2010.
  • Zhu et al. (2014) H. Zhu, F. Yao, and H. Zhang. Structured functional additive regression in reproducing kernel hilbert spaces. Journal of the Royal Statistical Society, Series B, 76:581–603, 2014.