Kernel-based estimation for partially functional linear model: Minimax rates and randomized sketches

\nameShaogao Lv \emaillvsg716@swufe.edu.cn
\addrCollege of Statistics and Mathmetics
Nanjing Audit University
Nanjing, China \AND\nameXin He \emailhe.xin17@mail.shufe.edu.cn
\addrSchool of Statistics and Management
Shanghai University of Finance and Economics
Shanghai, China \AND\nameJunhui Wang \emailj.h.wang@cityu.edu.hk
\addrSchool of Data Science
City University of Hong Kong
Kowloon Tong, Kowloon, Hong Kong

Abstract

This paper considers the partially functional linear model (PFLM) where all predictive features consist of a functional covariate and a high dimensional scalar vector. Over an infinite dimensional reproducing kernel Hilbert space, the proposed estimation for PFLM is a least square approach with two mixed regularizations of a function-norm and an $\ell_{1}$ -norm. Our main task in this paper is to establish the minimax rates for PFLM under high dimensional setting, and the optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes. In addition, we propose an efficient numerical algorithm based on randomized sketches of the kernel matrix. Several numerical experiments are implemented to support our method and optimization strategy.

Keywords: Functional linear models, minimax rates, sparsity, randomized sketches, reproducing kernel Hilbert space.

1 Introduction

In the problem of functional linear regression, a single functional feature $X(\cdot)$ is assumed to be square-integrable over an interval $\mathcal{T}$ , and the classical functional linear regression between the response $Y$ and $X$ is given as

\displaystyle Y=\langle X,f^{*}\rangle_{\mathcal{L}_{2}}+\varepsilon,

(1.1)

where the inner product $\langle\cdot,\cdot\rangle_{\mathcal{L}_{2}}$ is defined as $\langle f,g\rangle_{\mathcal{L}_{2}}:=\int_{\mathcal{T}}f(t)g(t)dt$ for any $f,g\in\mathcal{L}_{2}(\mathcal{T})$ . Here $f^{*}$ is some slope function within $\mathcal{L}_{2}(\mathcal{T})$ and $\varepsilon$ denotes a error term with zero-mean. Let ${(Y_{i},X_{i}):\,i=1,...,n}$ denote independent and identically distributed (i.i.d.) realizations from the population $(Y,X)$ , there is extensive literature on estimation of the slope function $f^{*}$ , or the value of $\langle X,f^{*}\rangle_{\mathcal{L}_{2}}$ .

In practice, it is often the case that a response is affected by both a high-dimensional scalar vector and some random functional variables as predictive features. These scenarios partially motivates us to study PFLM under high dimensional setting. For simplifying the notations, this paper assumes that $Y$ and $X(\cdot)$ are centered. To be more precise, we are concerned with partially functional linear regression with the functional feature $X$ and scalar predictors ${\bf Z}=(Z_{1},...,Z_{p})^{T}\in{\cal R}^{p}$ , and a linear model links the response $Y$ and predictive features ${\bf U}=(X,{\bf Z})$ that

Y=\langle X,f^{*}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}^{*}+\varepsilon,

(1.2)

where $\boldsymbol{\gamma}^{*}=(\gamma^{*}_{1},...,\gamma^{*}_{p})^{T}$ denotes the regression coefficients of the scalar covariates, and $\varepsilon$ is a standard normal variable and independent of $X$ and $\bf Z$ . Under the sparse high dimensional setting, a standard assumption is that the cardinality of the active set $S_{0}:=\{j:\gamma^{*}_{j}\neq 0,\ j=1,...,p\}$ is far less than $p$ , while $p$ and $p_{0}:=|S_{0}|$ are allowed to diverge as the sample size $n$ increases. In fact, estimation and variable selection issues for partially functional linear models have been investigated via FPCA methods by Shin and Lee (2012); Lu et al. (2014) and Kong et al. (2016), respectively.

In this paper, we focus on a least square regularized estimation for the slope function and the regression coefficients in (1.2) under a kernel-based framework and high dimension setting. The estimators obtained are based on a combination of the least-squared loss with a $\ell_{1}$ -type penalty and the square of a functional norm, where the former penalty corresponds to the regression coefficients and the latter one is used to control the kernel complexity. The optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes, and an efficient numerical algorithm based on randomized sketches of the kernel matrix is implemented to verify our theoretical findings.

1.1 Our Contributions

This paper makes three main contributions to this functional modeling literature.

Our first contribution is to establish Theorem 1 stating that with high probability, under mild regularity conditions, the prediction error of our procedure under the squared $L_{2}$ -norm is bounded by $\big{(}\frac{p_{0}\log p}{n}+n^{-\frac{2r}{2r+1}}\big{)}$ , where the quantity $r>1/2$ corresponds to the kernel complexity of one composition kernel $K^{1/2}CK^{1/2}$ . The proof of this upper bound involves two different penalties for analyzing the obtained estimator in high dimensions, and we want to emphasize that it is very hard to prove constraint cone set that has often been used to define a critical condition (constraint eigenvalues constant) for high-dimensional problems (Bickel, Ritov, and Tsybakov, 2009; Verzelen, 2012). To handle this technical difficulty, we combine the methods used in Müller and Van de Geer (2015) for high dimensional partial linear models with various techniques in empirical process theory for analyzing kernel classes (Aronszajn, 1950; Cai and Yuan, 2012; Yuan and Cai, 2010; Zhu et al., 2014).

Our second contribution is to establish algorithm-independent minimax lower bounds under the squared $L_{2}$ norm. These minimax lower bounds, stated in Theorem 2, are determined in terms of the metric entropy of the composition kernel $K^{1/2}CK^{1/2}$ and the sparsity structure of high dimensional scalar coefficients. For the commonly-used kernels, including the Sobolev classes, these lower bounds match our achievable results, showing optimality of our estimator for PFLM. It is worthy noting that, the lower bound of parametric part does not depend on nonparametric smoothness indices, coinciding with the classical sparse estimation rate in the high dimensional linear models (Verzelen, 2012). By contrast, the lower bound for estimating $f^{*}$ turns out to be affected by the regression coefficient $\boldsymbol{\gamma}^{*}$ . The proof of Theorem 2 is based on characterizing the packing entropies of the class of nonparametric kernel models, interaction between the composition kernel and high dimensional scalar vector, combined with classical information theoretic techniques involving Fano’s inequality and variants (Yang and Barron, 1999; Van. de. Geer, 2000; Tsybakov, 2009).

Our third contribution is to consider randomized sketches for our original estimation with statistical dimension. Despite these attractive statistical properties stated as above, the computational complexity of computing our original estimate prevents it from being routinely used in large-scale problems. In fact, a standard implementation for any kernel estimator leads to the time complexity $O(n^{3})$ and space complexity $O(n^{2})$ respectively. To this end, we employ the random projection and sketching techniques developed in Yang et al. (2017); Mahoney (2011), where it is proposed to approximate $n$ -dimensional kernel matrix by projecting its row and column subspaces to a randomly chosen m-dimensional subspace with $m\ll n$ . We give the sketch dimension $m$ proportional to the statistical dimension, under which the resulting estimator has a comparable numerical performance.

1.2 Related Work

A class of conventional estimation procedures for functional linear regressions in the statistical literature are based on functional principal components regression (FPCA) or spline functions; see (Ramsay and Silverman, 2005; Ferraty and Vieu, 2006; Kong, Xue, Yao, and Zhang, 2016) and (Cardot, Ferraty, and Sarda, 2003) for details. These truncation approaches to handle an infinity-dimensional function only depend on the information of the feature $X$ . In particular, commonly-used FPCA methods that form an available basis for the slope function $f^{*}$ are determined solely by empirical covariance of the observed feature $X$ , and these basis may not act as an efficient representation to approximate $f^{*}$ , since the slope function $f^{*}$ and the leading functional components are essentially unrelated. Similar problems also arise when spline-based finite representation are used.

To avoid inappropriate representation for the slope function, reproducing kernel methods have been known to be a family of powerful tools for directly estimating infinity-dimensional functions. When the slope function is assumed to reside in a reproducing kernel Hilbert Space (RKHS), denoted by $\mathcal{H}_{K}$ , several existing work (Yuan and Cai, 2010; Cai and Yuan, 2012; Zhu, Yao, and Zhang, 2014) for functional linear or additive regression have proved that the minimax rate of convergence depends on both the kernel $K$ and the covariance function $C$ of the functional feature $X$ . In particular, the alignment of $K$ and $C$ can significantly affect the optimal rate of convergence. However, it is well known that kernel-based methods suffer a lot from storage cost and computational burden. Specially, kernel-based methods need to store a $n\times n$ matrix before running algorithms and are limited to small-scale problems.

1.3 Paper organization

The rest of this paper is organized as follows. Section 2 introduces some notations and the basic knowledge on kernel method, and formulates the proposed kernel-based regularized estimation method. Section 3 is devoted to establish the minimax rate of the prediction problem for PFLM and provide detailed discussion on the obtained results, including the desired convergence rate of the upper bounds and a matching set of minimax lower bounds. In Section 4, a general sketching-based strategy is provided, and an approximate algorithm for solving (2.2) is employed. Several numerical experiments are implemented in Section 5 to support the proposed approach and the employed optimization strategy. A brief summary of this paper is provided in Section 6. Appendix A contains several core proof procedures of the main results, including the technical proofs of Theorems 1–3. Some useful lemmas and more technical details are provided in Appendix B.

2 Problem Statement and Proposed Method

2.1 Notation

Let $u,v$ be two general random variables, and denote the joint distribution of $(u,v)$ by $Q$ and the marginal distribution of $u(z)$ by $Q_{u}(Q_{v})$ . For a measurable function $f:\,u\times v\rightarrow\mathbb{R}$ , we define the squared $L_{2}$ -norm by $\|f\|^{2}:=\mathbb{E}_{Q}f^{2}(u,v)$ , and the squared empirical norm is given by $\|f\|_{n}^{2}:=\frac{1}{n}\sum_{i=1}^{n}f^{2}(u_{i},v_{i})$ , where $\{(u_{i},v_{i})\}_{i=1}^{n}$ are i.i.d. copies of $(u,v)$ . Note that $Q$ may differ from line to line. For a vector $\boldsymbol{\gamma}\in\mathbb{R}^{p}$ , the $\ell_{1}$ -norm and $\ell_{2}$ -norm are given by $\|\boldsymbol{\gamma}\|_{1}:=\sum_{j=1}^{p}|\gamma_{j}|$ and $\|\boldsymbol{\gamma}\|_{2}:=\big{(}\sum_{j=1}^{p}\gamma_{j}^{2}\big{)}^{1/2}$ , respectively. With a slight abuse of notation, we write $\|f\|_{\mathcal{L}_{2}}:=\langle f,f\rangle_{\mathcal{L}_{2}}$ with $\langle f,g\rangle_{\mathcal{L}_{2}}=\int_{\mathcal{T}}f(t)g(t)dt$ . For two sequences $\{a_{k}:k\geq 1\}$ and $\{b_{k}:k\geq 1\}$ , $a_{k}\lesssim b_{k}$ (or $a_{k}=O(b_{k})$ ) means that there exists some constant $c$ such that $a_{k}\leq cb_{k}$ for all $k\geq 1$ . Also, we write $a_{k}\gtrsim b_{k}$ if there is some positive constant $c$ such that $a_{k}\geq cb_{k}$ for all $k\geq 1$ . Accordingly, we write $a_{k}\asymp b_{k}$ if both $a_{k}\lesssim b_{k}$ and $a_{k}\gtrsim b_{k}$ are satisfied.

2.2 Kernel Method

Kernel methods are one of the most powerful learning schemes in machine learning, which often take the form of regularization schemes in a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel (Aronszajn, 1950). A major advantage of employing the kernel methods is that the corresponding optimization task over an infinite dimensional RKHS are equivalent to a $n$ -dimensional optimization problems, benefiting from the so-called reproducing property.

Recall that a kernel $K(\cdot,\cdot):\mathcal{T}\times\mathcal{T}\rightarrow{\cal R}$ is a continuous, symmetric, and positive semi-definite function. Let $\mathcal{H}_{{K}}$ be the closure of the linear span of functions $\{K_{t}(\cdot):={K}(t,\cdot),t\in\mathcal{T}\}$ endowed with the inner product $\langle\sum_{i=1}^{n}\alpha_{i}{K}_{t_{i}},\,\sum_{j=1}^{n}\beta_{j}{K}_{t_{j}}\rangle_{{K}}:=\sum_{i,j=1}^{n}\alpha_{i}\beta_{j}{K}(t_{i},t_{j})$ , for any $\{t_{i}\}_{i=1}^{n},\{t_{i}\}_{i=1}^{n}\in\mathcal{T}^{n}$ and $n\in\mathcal{N^{+}}$ . An important property on $\mathcal{H}_{{K}}$ is the reproducing property stating that

f(t)=\langle f,{K}_{t}\rangle_{{K}},\,\,\,\hbox{for any}\,f\in\mathcal{H}_{{K}}.

This property ensures that an RKHS inherits many nice properties from the standard finite dimensional Euclidean spaces. Throughout this paper, we assume that the slope function $f^{*}$ resides in a specified RKHS, still denoted by $\mathcal{H}_{K}$ . In addition, another RKHS can be naturally induced by the stochastic process of $X(\cdot)$ . Without loss of generality, we assume that $X(\cdot)$ is square integrable over $\mathcal{T}$ with zero-mean, ant thus the covariance function of $X$ , defining as

C(s,t)=\mathbb{E}[X(s)X(t)],\quad\forall\,t,\,s\in\mathcal{T},

is also a real, semi-definite kernel.

Note that the kernel complexity is characterized explicitly by a kernel-induced integral operator. Precisely, for any kernel ${K(\cdot,\cdot)}:\mathcal{T}\times\mathcal{T}\rightarrow{\cal R}$ , we define the integral operator $L_{{K}}:\mathcal{L}_{2}(\mathcal{T})\rightarrow\mathcal{L}_{2}(\mathcal{T})$ by

L_{{K}}(f)(\cdot)=\int_{\mathcal{T}}{K}(s,\cdot)f(s)ds.

By the reproducing property, $L_{{K}}$ can be equivalently defined as

\langle f,L_{{K}}(g)\rangle_{K}=\langle f,g\rangle_{\mathcal{L}_{2}},\quad\forall\,f\in\mathcal{H}_{{K}},\,g\in\mathcal{L}_{2}(\mathcal{T}).

Since the operator $L_{{K}}$ is linear, bounded and self-adjoint in $\mathcal{L}_{2}(\mathcal{T})$ , the spectral theorem implies that there exists a family of orthonormalized eigenfunctions $\{\phi^{{K}}_{\ell}:\,\ell\geq 1\}$ and a sequence of eigenvalues $\theta_{1}^{{K}}\geq\theta_{2}^{{K}}\geq...>0$ such that

{K}(s,t)=\sum_{\ell\geq 1}\theta_{\ell}^{{K}}\phi^{{K}}_{\ell}(s)\phi^{{K}}_{\ell}(t),\quad s,\,t\in\mathcal{T},

and thus by definition, it holds

L_{{K}}(\phi^{{K}}_{\ell})=\theta_{\ell}^{{K}}\phi^{{K}}_{\ell},\quad\ell=1,2,...

Based on the semi-definiteness of $L_{{K}}$ , we can always decompose it into the following form

L_{{K}}=L_{{K}}^{1/2}\circ L_{{K}}^{1/2},

where $L_{{K}^{1/2}}$ is also a kernel-induced integral operator associated with a fractional kernel ${K}^{1/2}$ that

{K}^{1/2}(s,t):=\sum_{\ell\geq 1}\sqrt{\theta_{\ell}^{{K}}}\phi^{{K}}_{\ell}(s)\phi^{{K}}_{\ell}(t),\quad s,\,t\in\mathcal{T}.

Also, it holds

L_{{K}^{1/2}}(\phi^{K}_{\ell}):=\sqrt{\theta_{\ell}^{{K}}}\phi^{{K}}_{\ell}.

Given two kernels $K_{1},K_{2}$ , we define

(K_{1}K_{2})(s,t):=\int_{\mathcal{T}}K_{1}(s,u)K_{2}(t,u)du,

and then it holds $L_{K_{1}K_{2}}=L_{K_{1}}\circ L_{K_{2}}$ . Note that $K_{1}K_{2}$ is not necessarily a symmetric kernel.

In the rest of this paper, we focus on the RKHS $\mathcal{H}_{K}$ in which the slope function $f^{*}$ in (1.2) resides. Given the kernel $K$ , the covariance function $C$ and by using the above notation, we define the linear operator $L_{K^{1/2}C_{k}K^{1/2}}$ by

L_{K^{1/2}CK^{1/2}}:=L_{K^{1/2}}\circ L_{C}\circ L_{K^{1/2}}.

If the both operators $L_{K^{1/2}}$ and $L_{C}$ are linear, bounded and self-adjoint, so is $L_{K^{1/2}CK^{1/2}}$ . By the spectral theorem, there exist a sequence of positive eigenvalues $s_{1}\geq s_{2}\geq...>0$ and a set of orthonormalied eigenfunctions $\{\varphi_{\ell}:\ell\geq 1\}$ such that

K^{1/2}CK^{1/2}(s,t)=\sum_{\ell\geq 1}s_{\ell}\varphi_{\ell}(s)\varphi_{\ell}(t),\quad\forall\,s,t\in\mathcal{T},

and particularly

L_{K^{1/2}CK^{1/2}}(\varphi_{\ell})=s_{\ell}\varphi_{\ell},\quad\ell=1,2,...

It is worthwhile to note that the eigenvalues $\{s_{\ell}:\ell\geq 1\}$ of the linear operator $L_{K^{1/2}CK^{1/2}}$ depend on the eigenvalues of both the reproducing kernel $K$ and the covariance function $C$ . We shall show in Section 3 that the minimax rate of convergence of the excess prediction risk is determined by the decay rate of the eigenvalues $\{s_{\ell}:\ell\geq 1\}$ .

2.3 Regularized Estimation and Randomized Sketches

Given the sample $\{Y_{i},(X_{i},{\bf Z}_{i})\}_{i=1}^{n}$ which are independently drawn from (1.2), the proposed estimation procedure for PFLM (1.2) is formulated in a least square regularization scheme by solving

(\widehat{f},\widehat{\boldsymbol{\gamma}})=\mathop{\rm argmin}_{f\in\mathcal{H}_{K},\boldsymbol{\gamma}\in{\cal R}^{p}}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\big{(}Y_{i}-\langle X_{i},f\rangle_{\mathcal{L}_{2}}-{\bf Z}^{T}_{i}\boldsymbol{\gamma}\big{)}^{2}+\mu^{2}\|f\|^{2}_{K}+\lambda\|\boldsymbol{\gamma}\|_{1}\Big{\}},

(2.1)

where the parameter $\mu^{2}>0$ is used to control the smoothness of the nonparametric component and $\lambda>0$ associated with the $\ell_{1}$ -type penalty is used to generate sparsity with respect to the scalar covariates.

Note that although the proposed estimation procedure (2.1) is formulated within an infinite-dimensional Hilbert space, the following lemma shows that this optimization task is equivalent to a finite-dimensional minimization problem.

Lemma 1

The proposed estimation procedure (2.1) defined on $\mathcal{H}_{K}\times{\cal R}^{p}$ is equivalent to a finite-dimensional parametric convex optimization. That is, $\widehat{f}(t)=\sum_{k=1}^{n}\alpha_{k}B_{k}(t)$ with unknown coefficients $\boldsymbol{\alpha}=(\alpha_{1},...,\alpha_{n})^{T}$ , for any $t\in\mathcal{T}$ . Here each basis function $B_{k}(t)=\langle X_{k},K(t,)\rangle_{\mathcal{L}_{2}(\mathcal{T})}\in\mathcal{H}_{K}$ , $k=1,...,n$ .

To rewrite the minimization problem (2.1) into a matrix form, we define a $n\times n$ semi-definite matrix $\mathbb{K}^{c}=(K^{c}_{ik})_{i,k=1}^{n}$ with $K^{c}_{ik}:=\langle X_{i},B_{k}\rangle_{\mathcal{L}_{2}}=\iint X_{k}(u)X_{i}(t)K(t,u)dudt$ , and by the reproducing property on $K$ , we also get $\langle B_{i},B_{k}\rangle_{K}=K^{c}_{ik}$ , $i,k=1,...,n$ . Thus, by Lemma 1, the matrix form of (2.1) is given as

\displaystyle\min_{\boldsymbol{\alpha}\in{\cal R}^{n},\boldsymbol{\gamma}\in{\cal R}^{p}}\frac{1}{n}\big{\|}\mathbf{y}-\mathbb{K}^{c}\boldsymbol{\alpha}-\mathbb{Z}\boldsymbol{\gamma}\big{\|}_{2}^{2}+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{K}^{c}\boldsymbol{\alpha}+\lambda\|\boldsymbol{\gamma}\|_{1},

(2.2)

where $\mathbb{Z}\in{\cal R}^{n\times p}$ denotes the design matrix of $\bf Z$ . Since the unconstrained problem (2.2) is convex for both $\boldsymbol{\alpha}$ and $\boldsymbol{\gamma}$ , the standard alternative optimization (Boyd and Vandenberghe, 2004) can be applied directly to approximate a global minimizer of (2.2). Yet, due to the fact that $\mathbb{K}^{c}$ is a $n\times n$ matrix, both computation cost and storage burden are very heavy in standard implementation, with the orders $O(n^{3})$ and $O(n^{2})$ respectively. To alleviate the computational issue, we propose an approximate numerical optimization instead of (2.2) in Section 4. Precisely, a class of general random projections are adopted to compress the original kernel matrix $\mathbb{K}^{c}$ and improve the computational efficiency.

3 Main Results: Minimax Rates

In this section, we present the main theoretical results of the proposed estimation in the minimax sense. Specifically, we derive the minimax rates in term of prediction error for the estimators in (2.1) under high dimensional and kernel-based frameworks. The first two theorems prove the convergence of the obtained estimators, while the last one provides an algorithmic-independent lower bound for the prediction error.

3.1 Upper Bounds

We denote the short-hand notation

\mathcal{G}:=\big{\{}g=\langle X,f\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma},\,\,f\in\mathcal{H}_{K},\,\boldsymbol{\gamma}\in{\cal R}^{p}\big{\}},

and the functional $g^{*}({\bf U}):=\langle X,f^{*}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}^{*}$ for ${\bf U}=(X,{\bf Z})$ . With a slight confusion of notation, we sometimes also write $\mathcal{G}:=\big{\{}g=(f,\boldsymbol{\gamma}),\,\,f\in\mathcal{H}_{K},\,\boldsymbol{\gamma}\in{\cal R}^{p}\big{\}}$ . To split the scalar components and the functional component involved in our analysis, we define the projection of $Z_{j}$ concerning $\mathcal{H}_{K}$ as $\Pi(Z_{j}|\mathcal{H}_{K})=\mathop{\rm argmin}_{f\in\mathcal{H}_{K}}\|Z_{j}-\langle X,f\rangle_{\mathcal{L}_{2}}\|^{2}$ . Let $\Pi(Z_{j}|X)=\langle X,\Pi(Z_{j}|\mathcal{H}_{K})\rangle_{\mathcal{L}_{2}}$ and $\Pi_{{\bf Z}|X}=(\Pi(Z_{1}|X),...,\Pi(Z_{p}|X))^{T}$ , and then we denote $\widetilde{\bf Z}:={\bf Z}-\Pi_{{\bf Z}|X}$ as a random vector of ${\cal R}^{p}$ . For any $g_{1}({\bf U}):=\langle X,f_{1}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}_{1}\in\mathcal{G}$ and $g_{2}({\bf U}):=\langle X,f_{2}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\boldsymbol{\gamma}_{2}\in\mathcal{G}$ , we have the following orthogonal decomposition that

	$\displaystyle g_{1}({\bf U})-g_{2}({\bf U})$	$\displaystyle={\bf Z}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\langle X,f_{2}-f_{1}\rangle_{\mathcal{L}_{2}}$
		$\displaystyle=\widetilde{\bf Z}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\Pi_{{\bf Z}^{T}\|X}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\langle X,f_{2}-f_{1}\rangle_{\mathcal{L}_{2}},$

and by the definition of projection, it holds

\displaystyle\|g_{1}-g_{2}\|^{2}=\|\widetilde{\bf Z}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})\|^{2}+\|\Pi_{{\bf Z}|X}^{T}(\boldsymbol{\gamma}_{1}-\boldsymbol{\gamma}_{2})+\langle X,f_{2}-f_{1}\rangle_{\mathcal{L}_{2}}\|^{2}.

(3.1)

To establish the refined upper bounds of the prediction and estimation errors, we summarize and discuss the main conditions needed in the theoretical analysis below.

Condition A(Eigenvalues condition). The smallest eigenvalue $\Lambda^{2}_{min}$ of $\mathbb{E}[\widetilde{\bf Z}\widetilde{\bf Z}^{T}]$ is positive, and the largest eigenvalue $\Lambda^{2}_{max}$ of $\mathbb{E}[\Pi_{{\bf Z}|X}\Pi_{{\bf Z}|X}^{T}]$ is finite.

Condition B(Design condition). For some positive constants $C_{z},C_{\pi},C_{h}$ , there holds:

|Z_{j}|\leq C_{z},\,\,\|\Pi(Z_{j}|X)\|_{\infty}\leq C_{\pi},\,\hbox{and}\,\,\|\Pi(Z_{j}|\mathcal{H}_{K})\|_{K}\leq C_{h},\quad\mbox{for any}\,j=1,...,p.

Condition C(Light tail condition). There exist two constants $c_{1},\,c_{2}$ such that

\mathbb{P}\{\|L_{K^{1/2}}X\|_{\mathcal{L}_{2}}\geq t\}\leq c_{1}\exp(-c_{2}t^{2}),\quad\mbox{for any}\ t>0.

Condition D(Entropy condition). For some constant $1/2<r<\infty$ , the sequence of eigenvalues $s_{\ell}$ satisfy that

s_{\ell}\asymp\ell^{-2r},\quad\ell\in\mathbb{N}^{+}.

Condition A is commonly used in literature of semiparametric modelling ; see (Müller and Van de Geer, 2015) for reference. This condition ensures that there is enough information in the data to identify the parameters in the scalar part. Condition B imposes some boundedness assumptions, which are not essential and are used only for simplifying the technical proofs. Condition C implies that the random process $L_{K^{1/2}}X$ has an exponential decay rate and the same condition is also considered in Cai and Yuan (2012). Particularly, it is naturally satisfied if $X$ is a Gaussian process. In Condition D, the parameters $s_{\ell}$ are related to the alignment between $K$ and $C$ , which plays an important role in determining the minimax optimal rates. Moreover, the decay of $s_{\ell}$ characterizes the kernel complexity and has close relation with various covering numbers and Radmeacher complexity. Specially, the polynomial decay assumed in Condition D is satisfied for the classical Sobolev class and Besov class.

The following theorem states that with an appropriately chosen $(\mu,\lambda)$ , the predictor $\widehat{g}:=\langle X,\widehat{f}\rangle_{\mathcal{L}_{2}}+{\bf Z}^{T}\widehat{\boldsymbol{\gamma}}$ attains a sharp convergence rate under $L_{2}$ -norm.

Theorem 1

Suppose that Conditions A-D hold. With the choice of the tuning parameters $(\mu,\lambda)$ , such that

\mu\asymp n^{-\frac{r}{2r+1}}+\sqrt{\log(2p)/n},\,\,\lambda\asymp\sqrt{\log(2p)/n}.

Then with probability at least $1-2\exp[-n(\delta_{1}^{\prime\prime})^{2}\mu^{2}]$ , the proposed estimation for PFLM satisfies

\|\widehat{g}-g^{*}\|^{2}\lesssim\Big{(}n^{-\frac{2r}{2r+1}}+\frac{p_{0}\log(2p)}{n}\Big{)},

where the constant $\delta^{\prime\prime}$ is small appropriately.

Theorem 1 shows that the proposed estimation (2.1) achieves a fast convergence rate in the term of prediction error. Note that the derived rate depends on the kernel complexity of $K^{1/2}CK^{1/2}$ and the sparsity of scalar components. It is interesting to note that even there exists some underlying correlation structure between the functional feature and the scalar covariates, the choice of hyper-parameter $\mu$ depends on the structural information of all the features, while the sparsity hyper-parameter $\lambda$ only depends on the scalar component.

Theorem 2

Suppose that all the conditions in Theorem 1 are satisfied. Then with probability at least $1-4\exp[-n(\delta_{1}^{\prime\prime})^{2}\mu^{2}]-\frac{5}{2p}$ , there holds

\displaystyle\|\widetilde{\bf Z}^{T}(\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*})\|^{2}+\frac{\lambda}{8}\|\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*}\|_{1}\lesssim\Big{(}\frac{p_{0}}{\Lambda^{2}_{min}}\frac{\log(2p)}{n}\Big{)},

(3.2)

and in addition, we have

\displaystyle\|\langle X,\widehat{g}-g^{*}\rangle_{\mathcal{L}_{2}}\|^{2}\lesssim\Big{(}n^{-\frac{2r}{2r+1}}+\frac{p_{0}\log(2p)}{n}\Big{)}.

(3.3)

It is worthy pointing out that the estimation error of the parametric estimator $\widehat{\boldsymbol{\gamma}}$ can achieve the optimal convergence rate in the high dimensional linear models (Verzelen, 2012), even in the presence of nonparametric components. This result in the functional literature is similar in spirit to the classical high dimensional partial linear models (Müller and Van de Geer, 2015; Yu, Levine, and Cheng, 2019).

3.2 Lower Bounds

In this part, we establish the lower bounds on the minimax risk of estimating $\boldsymbol{\gamma}^{*}$ and $\langle X,f^{*}\rangle_{\mathcal{L}_{2}}$ separately. Let $B[p_{0},p]$ be a set of $p$ -dimensional vectors with at most $p_{0}$ non-zero coordinates, and ${\cal B}_{K}$ be the unit ball of $\mathcal{H}_{K}$ . Moreover, we define the risk of estimating $\boldsymbol{\gamma}^{*}$ as

R_{\boldsymbol{\gamma}^{*}}(p_{0},p,{\cal B}_{K}):=\inf_{\widehat{\boldsymbol{\gamma}}}\sup_{\boldsymbol{\gamma}^{*}\in B[p_{0},p],f^{*}\in{\cal B}_{K}}\mathbb{E}[\|\widehat{\boldsymbol{\gamma}}-\boldsymbol{\gamma}^{*}\|_{2}^{2}],

where ${\inf}$ is taken over all possible estimators for $\boldsymbol{\gamma}^{*}$ in model (1.2). Similarly, we define the risk of estimating $\langle X,f^{*}\rangle_{\mathcal{L}_{2}}$ as

R_{f^{*}}(s_{0},p,{\cal B}_{K}):=\inf_{\widehat{f}}\sup_{\boldsymbol{\gamma}^{*}\in B[p_{0},p],f^{*}\in{\cal B}_{K}}\mathbb{E}[\langle X,\widehat{f}-f^{*}\rangle_{\mathcal{L}_{2}}^{2}]=\inf_{\widehat{f}}\sup_{\boldsymbol{\gamma}^{*}\in B[p_{0},p],f^{*}\in{\cal B}_{K}}\|L_{C^{1/2}}(\widehat{f}-f^{*})\|_{\mathcal{L}_{2}}^{2}.

The following theorem provides the lower bounds of the minimax optimal estimation error for $\boldsymbol{\gamma}^{*}$ and the predictor error for $f^{*}$ , respectively.

Theorem 3

Given $n$ i.i.d. samples from (1.2) with the entropy condition (Condition D). When $p$ is diverging as $n$ increases and $p_{0}\ll p$ , the minimax risk for estimating $\boldsymbol{\gamma}^{*}$ can be bounded from below as

R_{\boldsymbol{\gamma}^{*}}(p_{0},p,{\cal B}_{K})\gtrsim\frac{p_{0}\log(p/p_{0})}{n};

the minimax risk for estimating $\langle X,f^{*}\rangle_{\mathcal{L}_{2}}$ can be bounded from below as

R_{f^{*}}(p_{0},p,{\cal B}_{K})\gtrsim\max\Big{\{}\frac{p_{0}\log(p/p_{0})}{n},n^{-\frac{2r}{2r+1}}\Big{\}}.

The proof of Theorem 3 is provided in Appendix A. As mentioned previously, these results indicate that the best possible estimation of $\boldsymbol{\gamma}^{*}$ is not affected by the existence of nonparametric components, while the minimax risk for estimating the (nonparametric) slope function not only depends on the smoothness itself, but also on the dimensionality and sparsity of the scalar covariates. From the lower bound of $R_{f^{*}}(p_{0},p,{\cal B}_{K})$ , we observe that a rate-switching phenomenon occurring between a sparse regime and a smooth regime. Particularly when $\frac{p_{0}\log(p/p_{0})}{n}$ dominates $n^{-\frac{2r}{2r+1}}$ corresponding to the sparse regime, the lower bound becomes the classical high dimensional parametric rate $\frac{p_{0}\log(p/p_{0})}{n}$ . Otherwise, this corresponds to the smooth regime and thus has similar behaviors as classical nonparametric models. We also notice that the minimax lower bound obtained for the predictor error generalizes the previous results for the pure functional linar model (Cai and Yuan, 2012).

4 Randomized Sketches and Optimization

This section is devoted to considering an approximate algorithm for (2.2), based on constraining the original parameter $\boldsymbol{\alpha}\in{\cal R}^{n}$ to an $m$ -dimensional subspace of ${\cal R}^{n}$ , where $m\ll n$ is the projection dimension. We define this approximation via a sketch matrix $\mathbb{S}\in{\cal R}^{m\times n}$ such that the $m$ -dimensional subspace is generated by the row span of $\mathbb{S}$ . More precisely, the sketched kernel partial functional estimator is given by first solving

	$\displaystyle(\widehat{\boldsymbol{\alpha}}_{s},\widehat{\boldsymbol{\gamma}}_{s}):$	$\displaystyle=\arg\min_{\boldsymbol{\alpha}\in{\cal R}^{m},\boldsymbol{\gamma}\in{\cal R}^{p}}\frac{1}{n}\boldsymbol{\alpha}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}\boldsymbol{\alpha}-\frac{2}{n}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})+\frac{1}{n}\\|\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma}\\|_{2}^{2}$
		$\displaystyle+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{T}\boldsymbol{\alpha}+\lambda\\|\boldsymbol{\gamma}\\|_{1}.$		(4.1)

Then the resulting predictor for the slope function $f^{*}$ is given as

\widehat{f}_{s}(t):=\sum_{k=1}^{n}(\mathbb{S}^{T}\widehat{\boldsymbol{\alpha}}_{s})_{k}B_{k}(t)=\widehat{\boldsymbol{\alpha}}_{s}^{T}\mathbb{S}{\bf B}(t),\quad\forall\,t\in\mathcal{T}.

where ${\bf B}(t)=(B_{1}(t),...,B_{n}(t))^{T}\in{\cal R}^{n}$ , where $B_{k}(t)$ is defined in Lemma 1. By doing randomized sketches, an approximate form of the kernel estimate $\widehat{\boldsymbol{\alpha}}_{s}$ can be obtained by solving an $m$ -dimensional quadratic program when $\widehat{\boldsymbol{\gamma}}_{s}$ is fixed, which involves time and space complexity $O(m^{3})$ and $O(m^{2})$ . Computing the approximate kernel matrix is a preprocessing step with time complexity $O(n^{2}\log(m))$ for properly chosen projections.

4.1 Alternating Optimization

This section provides the detailed computational issues of the proposed approach. Precisely, we aim to solve the following optimization task that

		$\displaystyle(\widehat{\boldsymbol{\alpha}}_{s},\widehat{\boldsymbol{\gamma}}_{s}):=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in{\cal R}^{m},\boldsymbol{\gamma}\in{\cal R}^{p}}\frac{1}{n}\boldsymbol{\alpha}^{T}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}\boldsymbol{\alpha}{-\frac{2}{n}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})}+$
		$\displaystyle\hskip 113.81102pt{\frac{1}{n}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})^{T}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})}+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{T}\boldsymbol{\alpha}+\lambda\\|\boldsymbol{\gamma}\\|_{1}.$		(4.2)

To solve (4.1), a splitting algorithm with proximal operator is applied, which updates the representer coefficients ${\boldsymbol{\alpha}}$ and the linear coefficients ${\boldsymbol{\gamma}}$ sequentially. Specifically, at the $t$ -th iteration with current solution $(\boldsymbol{\alpha}^{t},\mbox{\boldmath$\gamma$}^{t})$ , the following two optimization tasks are solved sequentially to obtain the solution of the $(t+1)$ -th iteration

	$\displaystyle\boldsymbol{\alpha}^{t+1}=\mathop{\rm argmin}_{\boldsymbol{\alpha}\in{\cal R}^{m}}\Big{\{}\frac{1}{n}\boldsymbol{\alpha}^{T}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}\boldsymbol{\alpha}-\frac{2}{n}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma}^{t})+\mu^{2}\boldsymbol{\alpha}^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{T}\boldsymbol{\alpha}\Big{\}},$		(4.3)
	$\displaystyle\mbox{\boldmath$\gamma$}^{t+1}=\mathop{\rm argmin}_{\mbox{\boldmath$\gamma$}\in{\cal R}^{p}}\Big{\{}R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$})+\lambda\\|\boldsymbol{\gamma}\\|_{1}\Big{\}},$		(4.4)

where $R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}):=\frac{2}{n}({\boldsymbol{\alpha}}^{t+1})^{T}\mathbb{S}\mathbb{K}^{c}\mathbb{Z}\boldsymbol{\gamma}+{\frac{1}{n}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})^{T}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma})}$ .

To update $\boldsymbol{\alpha}$ , it is clear that the optimization task (4.3) has an analytic solution that

\boldsymbol{\alpha}^{t+1}=\big{(}(\mathbb{S}\mathbb{K}^{c})(\mathbb{S}\mathbb{K}^{c})^{T}+n\mu^{2}\mathbb{S}\mathbb{K}^{c}\mathbb{S}^{\prime}\big{)}^{-1}\mathbb{S}\mathbb{K}^{c}(\mathbf{y}-\mathbb{Z}\boldsymbol{\gamma}^{t}).

To update $\boldsymbol{\gamma}$ , we first introduce the proximal operator (Moreau, 1962), which is defined as

\displaystyle\mbox{Prox}_{{\lambda}\|\cdot\|_{1}}({\bf u}):=\mathop{\rm argmin}_{{\bf u}}\Big{\{}\frac{1}{2}\|{\bf u}-{\bf v}\|_{2}^{2}+\lambda\|{\bf u}\|_{1}\Big{\}}.

(4.5)

Note that the solution of optimization task (4.5) is the well-known soft-thresholding operator with solution that

{\big{(}\mbox{Prox}_{{\lambda}\|\cdot\|_{1}}(\mathop{\bf u})\big{)}_{i}=\mathop{\rm sign}(u_{i})(|u_{i}|-{\lambda})_{+}}.

Then, for the optimization task (4.4), we have

\mbox{\boldmath$\gamma$}^{t+1}=\mbox{Prox}_{\frac{\lambda}{D}\|\cdot\|_{1}}\Big{(}\mbox{\boldmath$\gamma$}^{t}-\frac{1}{D}\nabla_{\mbox{\boldmath$\gamma$}}R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}^{t})\Big{)},

where $D$ denotes an upper bound of the Lipschitz constant of $R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}^{t})$ , and compute $\nabla_{\boldsymbol{\gamma}}R_{n}(\boldsymbol{\alpha}^{t+1},\mbox{\boldmath$\gamma$}^{t})=\frac{2}{n}\mathbb{Z}^{T}(\mathbb{S}\mathbb{K}^{c})^{T}{\boldsymbol{\alpha}}^{t+1}+\frac{2}{n}\mathbb{Z}^{T}\mathbb{Z}\boldsymbol{\gamma}^{t}-\frac{2}{n}\mathbb{Z}^{T}\mathbf{y}$ . We repeat the above iteration steps until $(\boldsymbol{\alpha}^{t+1},\boldsymbol{\gamma}^{t+1})$ converges.

It should be pointed out that the exact value of $D$ is often difficult to determine in large-scale problems. A common way to handle this problem is to use a backtracking scheme (Boyd and Vandenberghe, 2004) as a more efficient alternative to approximately compute an upper bound of it.

4.2 Choice of Random Sketch Matrix

In this paper, we consider three random sketch methods, including the sub-Gaussian random sketch (GRS), randomized orthogonal system sketch (ROS) and sub-sampling random sketch (SUB). Precisely, we denote the $i$ -th row of the random matrix $\mathbb{S}$ as ${\mathop{\bf s}}_{i}$ and consider three different types of ${\mathop{\bf s}}_{i}$ as follows.

Sub-Gaussian sketch (GRS): The row ${\mathop{\bf s}}_{i}$ of $\mathbb{S}$ is zero-mean $1$ -sub-Gaussian if for any $\mathop{\bf u}\in{\cal R}^{n}$ , we have

\text{P}\big{(}\langle{\mathop{\bf s}}_{i},\mathop{\bf u}\rangle\geq t\|\mathop{\bf u}\|_{2}\big{)}\leq e^{-t^{2}/2},\ \ \forall\,t\geq 0.

Note that the row ${\mathop{\bf s}}_{i}$ with independent and identical distributed $N(0,1)$ entries is 1-sub-Gaussian. For simplicity, we further rescale the sub-Gaussian sketch matrix $\mathbb{S}$ such that the rows $\mathop{\bf s}_{i}$ have the covariance matrix $\frac{1}{\sqrt{m}}\mathbb{I}_{n}$ , where $\mathbb{I}_{n}$ denotes a $n$ dimensional identity matrix.

Randomized orthogonal system sketch (ROS): The row $\mathop{\bf s}_{i}$ of the random matrix $\mathbb{S}$ is formed with i.i.d rows of the form

{\mathop{\bf s}}_{i}=\sqrt{\frac{n}{m}}\mathbb{R}\mathbb{H}^{T}\mathbb{I}_{(i)},~{}\text{ for }~{}i=1,...,m,

where $\mathbb{R}\in{\cal R}^{n\times n}$ is a random diagonal matrix whose entries are i.i.d. Rademacher variables taking value $\{-1,1\}$ with equal probability, $\mathbb{H}=\{H_{ij}\}_{i,j=1}^{n}\in{\cal R}^{n\times n}$ is an orthonormal matrix with bounded entries that $H_{ij}\in[-\frac{1}{\sqrt{n}},\frac{1}{\sqrt{n}}]$ , and the $n$ -dimensional vectors $\mathbb{I}_{(1)},...,\mathbb{I}_{(m)}$ are drawn uniformly at random without replacement from the $n$ -dimensional identity matrix $\mathbb{I}_{n}$ .

Sub-sampling sketches (SUB): The rows $\mathop{\bf s}_{i}$ of the random matrix $\mathbb{S}$ has the form that

{\mathop{\bf s}}_{i}=\sqrt{\frac{n}{m}}\mathbb{I}_{(i)},

where the $n$ -dimensional vectors $\mathbb{I}_{(1)},...,\mathbb{I}_{(m)}$ are drawn uniformly at random without replacement from a $n$ dimensional identity matrix. Note that the sub-sampling sketches method can be regarded as a special case of the ROS sketch by replacing the matrix $\mathbb{R}^{T}\mathbb{H}$ with a $n$ -dimensional identity matrix $\mathbb{I}_{n}$ .

4.3 Choice of the Sketch Dimension

In practice, we are interested in the $m\times n$ sketch matrices with $m\ll n$ to enhance computational efficiency. Note that the existence of a $n\times n$ kernel matrix in Lemma 1 is only a sufficient condition for equivalent optimization. It has been shown theoretically in the kernel regression (Yang et al., 2017) that the kernel matrix can be compressed to be the one with small size, based on some intrinsic low-dimensional notations. Despite the model difference from Yang et al. (2017), our kernel matrix $\mathbb{K}^{c}$ does not depend on the scalar covariates $\bf Z$ , and thus those derived results for the kernel regression are still applicable to our case.

Consider the eigen-decomposition $\mathbb{K}^{c}=\mathbb{U}\mathbb{D}\mathbb{U}^{T}$ of the kernel matrix, where $\mathbb{U}\in{\cal R}^{n\times n}$ is an orthonormal matrix of eigenvectors, and $\mathbb{D}=\hbox{diag}\{\hat{\mu}_{1},...,\hat{\mu}_{n}\}$ is a diagonal matrix of eigenvalues, where $\hat{\mu}_{1}\geq\hat{\mu}_{2}\geq...\geq\hat{\mu}_{n}\geq 0$ . We define the kernel complexity function as

\widehat{\mathcal{R}}(\delta)=\sqrt{\frac{1}{n}\sum_{j=1}^{n}\min\{\delta,\hat{\mu}_{j}\}}.

The critical radius is defined to be the smallest positive solution of $\delta_{n}>0$ to the inequality

\widehat{\mathcal{R}}(\delta)\leq\delta^{2}/\sigma.

Note that the existence and uniqueness of this critical radius is guaranteed for any kernel class. Based on this, we define the statistical dimension of the kernel is

d_{n}:=\min\{j\in[n]:\hat{\mu}_{j}\leq\delta^{2}_{n}\}.

Recall that, Theorem 2 in Yang et al. (2017) shows that various forms of randomized sketches can achieve the minimax rate using a sketch dimension proportional to the statistical dimension $d_{n}$ . In particular, for Gaussian sketches and ROS sketches, the sketch dimension $m$ is required satisfy a lower bound of the form

m\geq\begin{cases}cd_{n}&\hbox{for Gaussian sketches},\\ cd_{n}\log^{4}(n)&\hbox{for ROS sketches}.\end{cases}

Here $c$ is some constant. In this paper, we adopt this specified sketch dimension $m$ to implement our experiments.

5 Numerical Experiments

In this section, we illustrate the numerical performance of the proposed method with random sketches in two numerical examples. Specifically, we assume that the true generating model is

\displaystyle Y_{i}=\int_{\cal T}f^{*}(t)X_{i}(t)dt+{\bf Z}_{i}^{T}\mbox{\boldmath$\gamma$}^{*}+\varepsilon_{i},

(5.1)

where $\varepsilon_{i}\sim N(0,\sigma^{2})$ with $\sigma=1$ , and ${\cal T}$ is set as $[0,1]$ . Note that the generating scheme is the same as that in Hall and Horowitz 2007 and Yuan and Cai 2010. In practice, the integrals in calculation of $\mathbb{B}$ and $\mathbb{K}^{c}$ are approximated by summations, and thus we generate 1000 points in ${\cal T}=[0,1]$ with equal distance and evaluate the integral by using the generated points. As the proper choice of tuning parameters plays a crucial role in achieving the desired performance of the proposed method, we adopt 5-fold cross-validation to select the optimal values of the tuning parameters $\mu$ and $\lambda$ .

In all the simulated cases, we consider a RKHS ${\cal H}_{K}$ induced by a reproducing kernel function on ${\cal T}\times{\cal T}$ that

	$\displaystyle K(s,t)$	$\displaystyle=\sum_{k\geq 1}\frac{2}{(k\pi)^{4}}\cos(k\pi s)\cos(k\pi t)$
		$\displaystyle=\sum_{k\geq 1}\frac{1}{(k\pi)^{4}}\cos(k\pi(s-t))+\sum_{k\geq 1}\frac{1}{(k\pi)^{4}}\cos(k\pi(s+t))$
		$\displaystyle=-\frac{1}{3}B_{4}\big{(}\frac{\|s-t\|}{2}\big{)}-\frac{1}{3}B_{4}\big{(}\frac{s+t}{2}\big{)},$

where $B_{2m}(\cdot)$ denotes the $2m$ -th Bernoulli polynomial that

B_{2m}(s)=(-1)^{m-1}2(2m)!\sum_{k\geq 1}\frac{\cos(2\pi ks)}{(2\pi k)^{2m}},~{}\text{for any}~{}s\in{\cal T}.

Note that the RKHS ${\cal H}_{K}$ induced by $K(s,t)$ contains the functions in a linear span of the cosine basis that

f(s)=\sqrt{2}\sum_{k\geq 1}g_{k}\cos(k\pi s),~{}\text{for any}~{}s\in{\cal T}.

such that $\sum_{k\geq 1}k^{4}g_{k}^{2}<\infty$ and the endowed norm is

\|f\|^{2}_{K}=\int_{{\cal T}}\big{(}\sqrt{2}\sum_{k\geq 1}(k\pi)^{2}g_{k}\cos(k\pi t)\big{)}^{2}dt=\sum_{k\geq 1}(k\pi)^{4}g_{k}^{2}.

The performance of the proposed method is evaluated under the following two numerical examples.

Example 1. We consider the true slope function $f^{*}$ and the random function $X$ are

f^{*}(t)=\sum_{k=1}^{50}4(-1)^{k+1}k^{-2}\sqrt{2}\cos(k\pi t),

and

X(t)=\xi_{1}U_{1}+\sum_{k=2}^{50}\xi_{k}U_{k}\sqrt{2}\cos(k\pi t),

where $U_{k}\sim U(-\sqrt{3},\sqrt{3})$ and $\xi_{k}=(-1)^{k+1}k^{-v/2}$ . For the linear part, the true regression coefficients are set as $\mbox{\boldmath$\gamma$}^{0}=(2,-2,0,...,0)^{T}$ and the sample $\mathbb{Z}=({\bf Z}_{1},...,{\bf Z}_{n})^{T}\in{\cal R}^{n\times p}$ with ${\bf Z}_{i}=(z_{i1},...,z_{ip})^{T}$ are generated i.i.d. as $z_{ij}\sim U(0,1)$ .

Example 2. The generating scheme is the same as Example 1, except that

\xi_{k}=\begin{cases}1,&\quad k=1,\\ 0.2(-1)^{k+1}(1-0.0001k),&\quad 2\leq k\leq 4,\\ 0.2(-1)^{k+1}\big{[}(5\lfloor k/5\rfloor)^{-\upsilon/2}-0.0001(k~{}\text{mod}~{}5)\big{]},&\quad k\geq 5.\end{cases}

Clearly, $\xi^{2}_{k}$ ’s are the eigenvalues of the covariance function $C$ and we choose $v=1.1,2$ and $4$ to evaluate the effect of the smoothness of $\xi_{k}$ in the both examples. Note that in Example 1, these eigenvalues are well spaced, and the covariance function $C$ and the reproducing kernel $K$ share the same eigenvalues, while in Example 2, these eigenvalues are closely spaced and the alignment between $K$ and $C$ is considered.

To comprehend the effect of sample size, we consider the same settings as in Yang et al. (2017) that $n=256,512,1024,2048,4096,8192$ and $16384$ and conservatively, take $m=\lfloor n^{1/3}\rfloor$ for the three random sketch methods introduced in Section 4.2. Note that with the choice of $m$ , the time and store complexities reduce to $O(n)$ and $O(n^{2/3})$ , respectively. Each scenario is replicated 50 times and the performance of the proposed method is evaluated by various measures, including the estimation accuracy of the linear coefficients, the integrated prediction error in terms of the slope function and the prediction error. Specifically, the estimation accuracy of the linear coefficients is evaluated by $\|\widehat{\mbox{\boldmath$\gamma$}}-\mbox{\boldmath$\gamma$}^{0}\|^{2}_{2}=\sum_{l=1}^{p}(\widehat{\gamma}_{l}-\gamma_{l}^{0})^{2},$ and Figure 1 shows the estimation accuracy of the coefficients with different choice of $v$ .

Refer to caption — Figure 1: Estimation accuracy of the coefficients in Example 1 under various scenarios.

It is clear that the estimation error of the coefficients converges linearly as sample size $n$ increases and becomes stable when $n$ is sufficiently large, and the three employed sketch methods have similar performance. It is also interesting to notice that the convergence patterns under difference choice of $v$ are almost the same, which concurs with our theoretical findings that estimation of $\boldsymbol{\gamma}^{*}$ is not affected by the existence of nonparametric components in Theorems 1 and 3.

Let $(Y^{\prime},{X^{\prime}}(\cdot),{\bf Z}^{\prime})$ denotes an independent copy of $(Y,{X}(\cdot),{\bf Z})$ and the integrated prediction error in terms of the slope function is reported by

\widehat{\mathbb{E}}_{X^{\prime}}\|\widehat{f}-f^{*}\|^{2}=\widehat{\mathbb{E}}_{X^{\prime}}\big{(}\int_{\cal T}(\widehat{f}(t)-f^{*}(t))X^{\prime}(t)dt\big{)}^{2}

The empirical expectation $\widehat{\mathbb{E}}$ is evaluated by a testing sample with size $10000$ and $\widehat{Y}^{\prime}=\int_{\cal T}\widehat{f}(t)X^{{}^{\prime}}_{i}(t)dt+({\bf Z}^{{}^{\prime}}_{i})^{T}\widehat{\mbox{\boldmath$\gamma$}}$ and the numerical performance are summarized in Figure 2.

Note that Figure 2 suggests that the prediction error of the slope function converges at some polynomial rate as sample size $n$ , which agrees with our theoretical results in Section 3, and the three employed sketch methods yield similar numerical performance. Moreover, it can be seen that with the increase of the value of $v$ , the prediction error goes down, which also concurs with our theoretical findings in Theorems 2 and 3 that the faster decay rate of the eigenvalues, the smaller the prediction error.

We also report the integrated prediction error of the response by calculating

\widehat{\mathbb{E}}_{Y^{\prime},X^{\prime}}\|\widehat{Y}^{\prime}-Y^{\prime}\|^{2}_{2}.

The empirical expectation $\widehat{\mathbb{E}}$ is also evaluated by a testing sample with size $10000$ and the numerical performance are summarized in Figure 3.

Clearly, we conclude that prediction error of the response converges at some polynomial rate as sample size $n$ and the prediction error becomes smaller with $v$ increases, which agrees with our theoretical results in Theorem 2. It is also interesting to point out that the three employed sketch methods yield similar numerical performance and the prediction errors tends to converge to 1, which is the variance of $\varepsilon$ in the true modelling. This verifies the efficiency of the proposed estimation and the proper choice of $m$ .

Note that the numerical results in Example 2 where the eigenvalues are closely spaced are similar to those in the case with well-spaced eigenvalues in Example 1. Figure 4 shows the numerical performance under the closely spaced eigenvalues setting in Example 2.

6 Conclusion

This paper establishes the optimal minimax rate for the estimation of partially functional linear model (PFLM) under kernel-based and high dimensional setting. The optimal minimax rates of estimation is established by using various techniques in empirical process theory for analyzing kernel classes, and an efficient numerical algorithm based on randomized sketches of the kernel matrix is implemented to verify our theoretical findings.

Acknowledgments

Shaogao Lv’s research was partially supported by NSFC-11871277. Xin He’s research was supported in part by NSFC-11901375 and Shanghai Pujiang Program 2019PJC051. Junhui Wang’s research was supported in part by HK RGC Grants GRF-11303918 and GRF-11300919.

References

Aronszajn (1950) N. Aronszajn. Theory of reporudcing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
Bickel et al. (2009) P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector. Annals of Statistics, 37:1705–1732, 2009.
Bousquet (2002) O. Bousquet. A bennet concentration inequality and its application to suprema of empirical processes. Comptes Rendus Mathematique Academie des Sciences Paris, 334:495–550, 2002.
Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
Bühlmann and Van. de. Geer (2011) P. Bühlmann and S. Van. de. Geer. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Heidelberg, 2011.
Cai and Yuan (2012) T. Cai and M. Yuan. Minimax and adaptive prediction for functional linear regression. Journal of the American Statistical Association, 107:1201–1216, 2012.
Cardot et al. (2003) H. Cardot, F. Ferraty, and P. Sarda. Spline estimators for the functional linear model. Statistica Sinica, 13:571–591, 2003.
Ferraty and Vieu (2006) F. Ferraty and P. Vieu. Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York, 2006.
Hall and Horowitz (2007) P. Hall and J. Horowitz. Methodology and convergence rates for functional linear regression. Annals of Statistics, 35:70–91, 2007.
Kong et al. (2016) D. Kong, K. Xue, F. Yao, and H. Zhang. Partially functional linear regression in high dimensions. Biometrika, 103:1–13, 2016.
Ledoux (1997) M. Ledoux. On talagrand’s deviation inequalities for product measures. Probability and Statistics, 1:63–87, 1997.
Ledoux (2001) M. Ledoux. The Concentration of Measure Phenomenon (Mathematical Surveys and Monographs). American Mathematical Society, Providence, RI, 2001.
Lu et al. (2014) Y. Lu, J. Du, and Z. Sun. Functional partially linear quantile regression model. Metrika, 77:17–32, 2014.
Mahoney (2011) M. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning in Machine Learning, 3:1–54, 2011.
Müller and Van de Geer (2015) P. Müller and S. Van de Geer. The partial linear model in high dimensions. Scandinavian Journal of Statistics, 42:580–608, 2015.
Ramsay and Silverman (2005) J. Ramsay and B. Silverman. Functional Data Analysis. Springer, New York, 2005.
Shin and Lee (2012) H. Shin and M. Lee. On prediction rate in partial functional linear regression. Journal of Multivariate Analysis, 103:93–106, 2012.
Tsybakov (2009) A. Tsybakov. Introduction to Nonparametric Estimation. Springer, New York, 2009.
Van. de. Geer (2000) S. Van. de. Geer. Emprical Processes in M-Estimation. Cambridge University Press, New York, 2000.
Verzelen (2012) N. Verzelen. Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics, 6:38––90, 2012.
Yang and Barron (1999) Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27:1564–1599, 1999.
Yang et al. (2017) Y. Yang, M. Pilanci, and M. Wainwright. Randomized sketches for kernels: fast and optimal non-parametric regression. Annals of Statistics, 45:991–1023, 2017.
Yu et al. (2019) Z. Yu, M. Levine, and G. Cheng. Minimax optimal estimation in partially linear additive models under high dimension. Bernoulli, 25:1289–1325, 2019.
Yuan and Cai (2010) M. Yuan and T. Cai. A reproducing kernel hilbert space approach to functional linear regression. Annals of Statistics, 38:3412–3444, 2010.
Zhu et al. (2014) H. Zhu, F. Yao, and H. Zhang. Structured functional additive regression in reproducing kernel hilbert spaces. Journal of the Royal Statistical Society, Series B, 76:581–603, 2014.