This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High-dimensional single-index models:
link estimation and marginal inference

Kazuma Sawaya1, Yoshimasa Uematsu2, Masaaki Imaizumi1,3 1The University of Tokyo, Bunkyo, Tokyo, Japan.
2Hitotsubashi University, Kunitachi, Tokyo, Japan.
3RIKEN Center for Advanced Intelligence Project, Chuo, Tokyo, Japan
Abstract.

This study proposes a novel method for estimation and hypothesis testing in high-dimensional single-index models. We address a common scenario where the sample size and the dimension of regression coefficients are large and comparable. Unlike traditional approaches, which often overlook the estimation of the unknown link function, we introduce a new method for link function estimation. Leveraging the information from the estimated link function, we propose more efficient estimators that are better aligned with the underlying model. Furthermore, we rigorously establish the asymptotic normality of each coordinate of the estimator. This provides a valid construction of confidence intervals and pp-values for any finite collection of coordinates. Numerical experiments validate our theoretical results.

KS is supported by Grant-in-Aid for JSPS Fellows (24KJ0841). YU is supported by JSPS KAKENHI (19K13665). MI was supported by JSPS KAKENHI (21K11780), JST CREST (JPMJCR21D2), and JST FOREST (JPMJFR216I)

1. Introduction

We consider nn i.i.d. observations {(𝑿i,yi)}i=1n\{(\bm{X}_{i},y_{i})\}_{i=1}^{n} with a pp-dimensional Gaussian feature vector 𝑿i𝒩p(𝟎,𝚺),𝚺p×p\bm{X}_{i}\sim\mathcal{N}_{p}(\bm{0},\bm{\Sigma}),\bm{\Sigma}\in\mathbb{R}^{p\times p}, and each scalar response yiy_{i} belongs to a set 𝒴\mathcal{Y} (e.g., ,+,0,1,0\mathbb{R},\mathbb{R}+,{0,1},\mathbb{N}\cup{0}), following the single-index model:

𝔼[yi𝑿i=𝒙]=g(𝜷𝒙),\displaystyle\operatorname{\mathbb{E}}[y_{i}\mid\bm{X}_{i}=\bm{x}]=g(\bm{\beta}^{\top}\bm{x}), (1)

where 𝜷=(β1,,βp)p\bm{\beta}=(\beta_{1},\dots,\beta_{p})^{\top}\in\mathbb{R}^{p} is an unknown deterministic coefficient vector, and g()g(\cdot) is an unknown deterministic function, referred to as the link function, with 𝜷𝒙\bm{\beta}^{\top}\bm{x} being the index. To identify the scale of 𝜷\bm{\beta}, we assume Var(𝜷𝑿i)=𝜷𝚺𝜷=1\mathrm{Var}(\bm{\beta}^{\top}\bm{X}_{i})=\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}=1. The model includes common scenarios such as:

  • Linear regression: yi𝑿i𝒩(𝜷𝑿i,σε2)y_{i}\mid\bm{X}_{i}\sim\mathcal{N}(\bm{\beta}^{\top}\bm{X}_{i},\sigma_{\varepsilon}^{2}) with σε>0\sigma_{\varepsilon}>0 by setting g(t)=tg(t)=t.

  • Poisson regression: yi𝑿iPois(exp(𝜷𝑿i))y_{i}\mid\bm{X}_{i}\sim\mathrm{Pois}(\exp(\bm{\beta}^{\top}\bm{X}_{i})) by setting g(t)=exp(t)g(t)=\exp(t).

  • Binary choice models: yi𝑿iBern(g(𝜷𝑿i))y_{i}\mid\bm{X}_{i}\sim\mathrm{Bern}(g(\bm{\beta}^{\top}\bm{X}_{i})) with g:[0,1]g:\mathbb{R}\to[0,1]. This includes logistic regression for g(t)=1/(1+exp(t))g(t)=1/(1+\exp(-t)) and the probit model by setting g()g(\cdot) as the cumulative distribution function of the standard Gaussian distribution.

We are interested in a high-dimensional setting, where both the sample size nn and the coefficient dimension p:=p(n)p:=p(n) are large and comparable. Specifically, this study examines the proportionally high-dimensional regime defined by:

n,p(n)andp(n)/n=:κκ¯,\displaystyle n,p(n)\to\infty\quad\text{and}\quad p(n)/n=:\kappa\to\bar{\kappa}, (2)

where κ¯\bar{\kappa} is a positive constant.

The single-index model (1) possesses several practically important properties. First, it mitigates concerns about model misspecification, as it eliminates the need to specify g()g(\cdot). Second, this model bypasses the curse of dimensionality associated with function estimation since the input index 𝜷𝑿i\bm{\beta}^{\top}\bm{X}_{i} is a scalar. This advantage is particularly notable in comparison with nonparametric regression models, such as yi=gˇ(𝑿i)+εiy_{i}=\check{g}(\bm{X}_{i})+\varepsilon_{i}, where gˇ:p\check{g}:\mathbb{R}^{p}\to\mathbb{R} remains unspecified. Third, the model facilitates the analysis of the contribution of each covariate, XijX_{ij} for j=1,,pj=1,\ldots,p, to the response yiy_{i} by testing βj=0\beta_{j}=0 against βj0\beta_{j}\neq 0. Owing to these advantages, single-index models have been actively researched for decades (Powell et al., 1989; Härdle and Stoker, 1989; Li and Duan, 1989; Ichimura, 1993; Klein and Spady, 1993; Hristache et al., 2001; Nishiyama and Robinson, 2005; Dalalyan et al., 2008; Alquier and Biau, 2013; Eftekhari et al., 2021; Bietti et al., 2022; Fan et al., 2023), attracting interest across a broad spectrum of fields, particularly in econometrics (Horowitz, 2009; Li and Racine, 2023).

In the proportionally high-dimensional regime as defined in (2), the single-index model and its variants have been extensively studied. For logistic regression, which is a particular instance of the single-index model, Sur et al. (2019); Salehi et al. (2019) have investigated the estimation and classification errors of the regression coefficient estimators 𝜷\bm{\beta}. Furthermore, Sur and Candès (2019); Zhao et al. (2022); Yadlowsky et al. (2021) have developed methods for asymptotically valid statistical inference. In the case of generalized linear models with a known link function g()g(\cdot), Rangan (2011); Barbier et al. (2019) have characterized the asymptotic behavior of the coefficient estimator, while Sawaya et al. (2023) have derived the coordinate-wise marginal asymptotic normality of an adjusted estimator of βj\beta_{j}. For the single-index model with an unknown link function g()g(\cdot), the seminal work by Bellec (2022) establishes the (non-marginal) asymptotic normality of estimators, even when there is link misspecification. However, the construction of an estimator for the link function g()g(\cdot) and the marginal asymptotic normality of the coefficient estimator are issues that have not yet been fully resolved.

Inspired by these seminal works, the following questions naturally arise:

  1. (i)

    Can we consistently estimate the unknown link function g()g(\cdot)?

  2. (ii)

    Can we rigorously establish marginal statistical inference for each coordinate of 𝜷\bm{\beta}?

  3. (iii)

    Can we improve the estimation efficiency by utilizing the estimated link function?

This paper aims to provide affirmative answers to these questions. Specifically, we propose a novel estimation methodology comprising three steps. First, we construct an estimator for the index 𝜷𝑿i\bm{\beta}^{\top}\bm{X}_{i}, based on the distributional characteristics of a pilot estimator for 𝜷\bm{\beta}. Second, we develop an estimator for the link function g()g(\cdot) by utilizing the estimated index in a nonparametric regression problem that involves errors-in-variables. Third, we design a new convex loss function that leverages the estimated link function to estimate 𝜷\bm{\beta}. To conduct statistical inference, we investigate the estimation problem of inferential parameters necessary for establishing coordinate-wise asymptotic normality in high-dimensional settings.

Our contributions are summarized as follows:

  • -

    Link estimation: We propose a consistent estimator for the link function g()g(\cdot), which is of practical significance as well as estimating coefficients. This aids in interpreting the model via the link function and mitigates negative impacts on coefficient estimation due to link misspecification.

  • -

    Marginal inference: We establish the asymptotic normality for any finite subset of the coordinates of our estimator, facilitating coordinate-wise inference of 𝜷\bm{\beta}. This approach allows us not only to test each variable’s contribution to the response but also to conduct variable selection based on importance statistics for each feature.

  • -

    Efficiency improvement: By utilizing the consistently estimated link function, we anticipate that our estimator of 𝜷\bm{\beta} will be more efficient than previous estimators that rely on potentially misspecified link functions. We predominantly validate this efficiency through numerical simulations.

From a technical perspective, we leverage the proof strategy in Zhao et al. (2022) to demonstrate the marginal asymptotic normality of our estimator for 𝜷\bm{\beta}. Specifically, we extend the arguments to a broader class of unregularized M-estimators, whereas Zhao et al. (2022) originally considered the maximum likelihood estimator (MLE) for logistic regression.

1.1. Marginal Inference in High Dimensions

We review key technical aspects of statistical inference for each coordinate βj\beta_{j} in the proportionally high-dimensional regime (2). We maintain 𝜷𝑿i\bm{\beta}^{\top}\bm{X}_{i} of constant order by considering the setting 𝜷𝚺𝜷=Θ(1)\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}=\Theta(1). We define 𝚯=𝚺1\bm{\Theta}=\bm{\Sigma}^{-1} as the precision matrix for the distribution of 𝑿i\bm{X}_{i} and set τj2:=Θjj1>0\tau_{j}^{2}:=\Theta_{jj}^{-1}>0. An unbiased estimator of τj\tau_{j} can be constructed using nodewise regression (cf. Section 5.1 in Zhao et al. (2022)). For simplicity, we assume τj\tau_{j} is known, following prior studies.

In the high-dimensional regime (2), statistical inference must address two components: the asymptotic distribution and the inferential parameters of an estimator. We review the asymptotic distribution of the MLE 𝜷^m\hat{\bm{\beta}}^{\mathrm{m}} for logistic regression. According to Zhao et al. (2022), for all j{1,,p}j\in\{1,\ldots,p\} such that pτjβj=O(1)\sqrt{p}\tau_{j}\beta_{j}=O(1) as nn\to\infty, the estimator achieves the following asymptotic normality:

p(β^jmμβj)σ/τjd𝒩(0,1).\displaystyle\frac{\sqrt{p}(\hat{\beta}_{j}^{\mathrm{m}}-\mu_{*}\beta_{j})}{\sigma_{*}/\tau_{j}}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1). (3)

Here, we define μ\mu_{*}\in\mathbb{R} and σ>0\sigma_{*}>0 as the asymptotic bias and variance, respectively, ensuring the convergence (3). It is crucial to note that both the estimator β^jm\hat{\beta}_{j}^{\mathrm{m}} and the target βj\beta_{j} scale as Op(1/p)O_{p}(1/\sqrt{p}) here.

To perform statistical inference based on (3), it is necessary to estimate the inferential parameters μ\mu_{*} and σ\sigma_{*}. Several studies including El Karoui et al. (2013); Thrampoulidis et al. (2018); Sur and Candès (2019); Loureiro et al. (2021) theoretically characterize these parameters as solutions to a system of nonlinear equations that depend on the data-generating process and the loss function. Additionally, various approaches have been developed to practically solve the equations by determining their hyperparameter 𝜷𝚺𝜷\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta} under different conditions. Specifically, Sur and Candès (2019) introduces ProbeFrontier for estimating 𝜷𝚺𝜷\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta} based on the asymptotic existence/non-existence boundary of the maximum likelihood estimator (MLE) in logistic regression. SLOE, proposed by Yadlowsky et al. (2021), enhances this estimation using a leave-one-out technique. Moreover, Sawaya et al. (2023) takes a different approach to estimate 𝜷𝚺𝜷\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta} for generalized linear models.

For single-index models, Bellec (2022) introduces observable adjustments that estimate the inferential parameters directly under the identification condition 𝜷𝚺𝜷=1\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}=1 irrespective of link misspecification, bypassing the system of equations. In our study, we develop an estimator for the single-index model satisfying the asymptotic normality (3), with corresponding estimators for the inferential parameters using observable adjustments.

1.2. Related Works

Research into the asymptotic behavior of statistical models in high-dimensional settings, where both nn and pp diverge proportionally, has gained momentum in recent years. Notable areas of exploration include (regularized) linear regression models (Donoho et al., 2009; Bayati and Montanari, 2011b; Krzakala et al., 2012; Bayati et al., 2013; Thrampoulidis et al., 2018; Mousavi et al., 2018; Takahashi and Kabashima, 2018; Miolane and Montanari, 2021; Guo and Cheng, 2022; Hastie et al., 2022; Li and Sur, 2023), robust estimation (El Karoui et al., 2013; Donoho and Montanari, 2016), generalized linear models (Rangan, 2011; Sur et al., 2019; Sur and Candès, 2019; Salehi et al., 2019; Barbier et al., 2019; Zhao et al., 2022; Tan and Bellec, 2023; Sawaya et al., 2023), low-rank matrix estimation (Deshpande et al., 2017; Macris et al., 2020; Montanari and Venkataramanan, 2021), and various other models (Montanari et al., 2019; Loureiro et al., 2021; Yang and Hu, 2021; Mei and Montanari, 2022). These investigations focus primarily on the convergence limits of estimation and prediction errors. Theoretical analyses have shown that classical statistical estimation often fails to accurately estimate standard errors and may lack key desirable properties such as asymptotic unbiasedness and asymptotic normality.

In such analyses, the following theoretical tools have been employed: (i) the replica method (Mézard et al., 1987; Charbonneau et al., 2023), (ii) approximate message passing algorithms (Donoho et al., 2009; Bolthausen, 2014; Bayati and Montanari, 2011a; Feng et al., 2022), (iii) the leave-one-out technique (El Karoui et al., 2013; El Karoui, 2018), (iv) the convex Gaussian min-max theorem (Thrampoulidis et al., 2018), (v) second-order Poincaré inequalities (Chatterjee, 2009; Lei et al., 2018), and (vi) second-order Stein’s formulae (Bellec and Zhang, 2021; Bellec and Shen, 2022). Although these tools were initially proposed for analyzing linear models with Gaussian design, they have been extensively adapted to a diverse range of models. In this study, we apply observable adjustments based on second-order Stein’s formulae (Bellec, 2022) to directly estimate the asymptotic bias and variance of coefficient estimators. Furthermore, we provide a comprehensive proof of marginal asymptotic normality, extending the work of Zhao et al. (2022) to a wider array of estimators.

1.3. Notation

Define [z]={1,,z}[z]=\{1,\ldots,z\} for zz\in\mathbb{N}. For a vector 𝒃=(b1,,bp)p\bm{b}=(b_{1},\dots,b_{p})\in\mathbb{R}^{p}, we write 𝒃:=(j=1pbj2)1/2\|\bm{b}\|:=(\sum_{j=1}^{p}b_{j}^{2})^{1/2}. For a collection of indices 𝒮[p]\mathcal{S}\subset[p], we define a sub-vector 𝒃𝒮:=(bj)j𝒮\bm{b}_{\mathcal{S}}:=(b_{j})_{j\in\mathcal{S}} as a slice of 𝜷\bm{\beta}. For a matrix 𝑨p×p\bm{A}\in\mathbb{R}^{p\times p}, we define its minimum and maximum eigenvalues by λmin(𝑨)\lambda_{\rm min}(\bm{A}) and λmax(𝑨)\lambda_{\rm max}(\bm{A}), respectively. For a function F:F:\mathbb{R}\to\mathbb{R}, we say FF^{\prime} the derivative of FF and F(m)F^{(m)} the mmth-order derivative. For a function f:f:\mathbb{R}\to\mathbb{R} and a vector 𝒃p\bm{b}\in\mathbb{R}^{p}, f(𝒃)=(f(b1),f(b2),,f(bp))pf(\bm{b})=(f(b_{1}),f(b_{2}),\dots,f(b_{p}))^{\top}\in\mathbb{R}^{p} denotes a vector by elementwise operations.

1.4. Organization

We organize the remainder of the paper as follows: Section 2 presents our estimation procedure. Section 3 describes the asymptotic properties of the proposed estimator and develops a statistical inference method. Section 4 provides several experiments to validate our estimation theory. Section 5 outlines the proofs of our theoretical results. Section 6 discusses alternative designs for estimators. Finally, Section 7 concludes with a discussion of our findings. The Appendix contains additional discussions and the complete proofs.

2. Statistical Estimation Procedure

In this section, we introduce a novel statistical estimation method for single-index models as defined in (1). Our estimator 𝜷^\hat{\bm{\beta}} is constructed through the following steps:

  1. (i)

    Construct an index estimator WiW_{i} for 𝜷𝑿i\bm{\beta}^{\top}\bm{X}_{i} using the ridge regression estimator 𝜷~\tilde{\bm{\beta}}, referred to as a pilot estimator. This estimator is reasonable regardless of the misspecification of the link function.

  2. (ii)

    Develop a function estimator g^()\hat{g}(\cdot) for the link function g()g(\cdot), based on the distributional characteristics of the index estimator WiW_{i}.

  3. (iii)

    Construct our estimator 𝜷^\hat{\bm{\beta}} for the coefficients 𝜷\bm{\beta}, using the estimated link g^()\hat{g}(\cdot) function.

Furthermore, statistical inference additionally involves a fourth step:

  1. (iv)

    Estimate the inferential parameters μ\mu_{*} and σ\sigma_{*}, conditional on the estimated link function g^()\hat{g}(\cdot).

In our estimation procedure, we divide the dataset (𝑿i,yi)i=1n{(\bm{X}_{i},y_{i})}_{i=1}^{n} into two disjoint subsets (𝑿i,yi)iI1{(\bm{X}_{i},y_{i})}_{i\in I_{1}} and (𝑿i,yi)iI2{(\bm{X}_{i},y_{i})}_{i\in I_{2}}, where I1,I2[n]I_{1},I_{2}\subset[n] are index sets such that I1I2=I_{1}\cap I_{2}=\emptyset and I1I2=[n]I_{1}\cup I_{2}=[n]. Additionally, for k=1,2k=1,2, let 𝑿(k)nk×p\bm{X}^{(k)}\in\mathbb{R}^{n_{k}\times p} and 𝒚(k)nk\bm{y}^{(k)}\in\mathbb{R}^{n_{k}} denote the design matrix and response vector of subset IkI_{k}, respectively. We utilize the first subset to estimate the link function (Steps (i) and (ii)), and the second subset to estimate the regression coefficients (Step (iii)) and inference parameters (Step (iv)). From a theoretical perspective, this division helps to manage the complicated dependency structure caused by data reuse. Nonetheless, for practical applications, we recommend employing all observations in each step to maximize the utilization of the data’s inherent signal strength. Here, n1n_{1} and n2n_{2} are the sample sizes for each partition, satisfying n=n1+n2n=n_{1}+n_{2}. We define κ1=p/n1\kappa_{1}=p/n_{1} and κ2=p/n2\kappa_{2}=p/n_{2}.

2.1. Index Estimation

In this step, we use the first subset (𝑿(1),𝒚(1))(\bm{X}^{(1)},\bm{y}^{(1)}). We define the pilot estimator as the ridge estimator, 𝜷~=((𝑿(1))𝑿(1)+n1λ𝑰p)1(𝑿(1))𝒚(1)\tilde{\bm{\beta}}=(({\bm{X}^{(1)}})^{\top}\bm{X}^{(1)}+n_{1}\lambda\bm{I}_{p})^{-1}({\bm{X}^{(1)}})^{\top}\bm{y}^{(1)} where λ>0\lambda>0 is the regularization parameter. Further, we consider inferential parameters (μ1,σ1)(\mu_{1},\sigma_{1}) of 𝜷~\tilde{\bm{\beta}}, which satisfy pτj(β~jμ1βj)/σ1d𝒩(0,1)\sqrt{p}\tau_{j}(\tilde{\beta}_{j}-\mu_{1}\beta_{j})/\sigma_{1}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1) for j[p]j\in[p] such that pτjβj=O(1)\sqrt{p}\tau_{j}\beta_{j}=O(1). Using these parameters, we develop an estimator WiW_{i} for the index 𝜷𝑿i(1)\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}} as follows:

Wi:=μ~1𝜷~𝑿i(1)μ~1γ~(yi(1)𝜷~𝑿i(1))\displaystyle W_{i}:=\tilde{\mu}^{-1}\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}-\tilde{\mu}^{-1}\tilde{\gamma}\left(y_{i}^{(1)}-\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}\right) (4)

for each i[n1]i\in[n_{1}]. Here, μ~\tilde{\mu} and σ~2\tilde{\sigma}^{2} are estimators of μ1\mu_{1} and σ1\sigma_{1}, defined as

μ~=|𝜷~2σ~2|1/2andσ~2=n11𝒚(1)𝑿(1)𝜷~2(v~+λ)2/κ1,\displaystyle\tilde{\mu}=\left\lvert\|\tilde{\bm{\beta}}\|^{2}-\tilde{\sigma}^{2}\right\rvert^{1/2}\quad\mbox{and}\quad\tilde{\sigma}^{2}=\frac{n_{1}^{-1}\|\bm{y}^{(1)}-\bm{X}^{(1)}\tilde{\bm{\beta}}\|^{2}}{(\tilde{v}+\lambda)^{2}/\kappa_{1}}, (5)

where γ~:=κ1/(v~+λ)\tilde{\gamma}:=\kappa_{1}/(\tilde{v}+\lambda) and v~=n11tr(𝑰n𝑿(1)((𝑿(1))𝑿(1)+n1λ𝑰p)1(𝑿(1)))\tilde{v}=n_{1}^{-1}\mathrm{tr}(\bm{I}_{n}-\bm{X}^{(1)}({(\bm{X}^{(1)}})^{\top}\bm{X}^{(1)}+n_{1}\lambda\bm{I}_{p})^{-1}{(\bm{X}^{(1)}})^{\top}). These estimators are obtained by the observable adjustment technique described in Bellec (2022).

This index estimator WiW_{i} is approximately unbiased for the index 𝜷𝑿i(1)\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}, yielding the following asymptotic result:

Wi𝜷𝑿i(1)+𝒩(0,σ~2/μ~2).\displaystyle W_{i}\approx\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}+\mathcal{N}(0,\tilde{\sigma}^{2}/\tilde{\mu}^{2}). (6)

We will provide its rigorous statement in Proposition 1 in Section 5.1.

There are other options for the pilot estimator besides the ridge estimator 𝜷~\tilde{\bm{\beta}}. If κ11\kappa_{1}\leq 1 holds, the least squares estimator can be an alternative. If yiy_{i} is a binary or non-negative integer, the MLE of logistic or Poisson regression can be a natural candidate, respectively, although the ridge estimator 𝜷~\tilde{\bm{\beta}} is valid regardless of the form that yiy_{i} takes. In each case, the estimated inferential parameters (γ~,μ~,σ~2)(\tilde{\gamma},\tilde{\mu},\tilde{\sigma}^{2}) should be updated accordingly. Details are presented in Section 6.

2.2. Link Estimation

We develop an estimator of the link function g()g(\cdot) using WiW_{i} in (4). If we could observe the true index 𝜷𝑿i(1)\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}} with the unknown coefficient 𝜷\bm{\beta}, it would be possible to estimate g(x)=𝔼[y1𝜷𝑿1=x]g(x)=\operatorname{\mathbb{E}}[y_{1}\mid\bm{\beta}^{\top}\bm{X}_{1}=x] by applying standard nonparametric methods to the pairs of responses and true indices (𝒚(1),𝑿(1)𝜷)(\bm{y}^{(1)},\bm{X}^{(1)}\bm{\beta}). However, as the true index is unobservable, we must estimate g()g(\cdot) using given pairs of responses and contaminated indices (yi(1),Wi)i=1n1{(y_{i}^{(1)},W_{i})}_{i=1}^{n_{1}}, where Wi𝜷𝑿i(1)+𝒩(0,σ~2/μ~2)W_{i}\approx\bm{\beta}^{\top}\bm{X}_{i}^{(1)}+\mathcal{N}(0,\tilde{\sigma}^{2}/\tilde{\mu}^{2}). The type of error 𝒩(0,σ~2/μ~2)\mathcal{N}(0,\tilde{\sigma}^{2}/\tilde{\mu}^{2}) involving the regressor leads to an attenuation bias in the estimation of g()g(\cdot), known as the errors-in-variables problem. To address this issue, we utilize a deconvolution technique (Stefanski and Carroll, 1990) to remove the bias stemming from error-in-variables asymptotically. Further details of the deconvolution are provided in Section C.

We define an estimator of g()g(\cdot). In preparation, we specify a kernel function K:K:\mathbb{R}\rightarrow\mathbb{R}, and define a deconvolution kernel Kn:K_{n}:\mathbb{R}\to\mathbb{R} as follows:

Kn(x)=12πexp(itx)ϕK(t)ϕς~(t/hn)𝑑t,\displaystyle K_{n}(x)=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-\mathrm{i}tx)\frac{\phi_{K}(t)}{\phi_{\tilde{\varsigma}}(t/h_{n})}dt, (7)

where hn>0h_{n}>0 is a bandwidth, i=1\mathrm{i}=\sqrt{-1} is an imaginary unit, ς~2=σ~2/μ~2\tilde{\varsigma}^{2}=\tilde{\sigma}^{2}/\tilde{\mu}^{2} is a ratio of the inferential parameters, and ϕK:\phi_{K}:\mathbb{R}\rightarrow\mathbb{R} and ϕς~:\phi_{\tilde{\varsigma}}:\mathbb{R}\to\mathbb{R} are the Fourier transform of K()K(\cdot) and the density function of 𝒩(0,ς~2)\mathcal{N}(0,\tilde{\varsigma}^{2}), respectively. We then define our estimator of g()g(\cdot) as

g^(x):=[g˘](x)withg˘(x)=i=1n1yi(1)Kn((xWi)/hn)i=1n1Kn((xWi)/hn),\displaystyle\hat{g}(x):=\mathcal{R}[\breve{g}](x)\quad\mbox{with}\quad\breve{g}(x)=\frac{\sum_{i=1}^{n_{1}}y_{i}^{(1)}K_{n}\left((x-W_{i})/h_{n}\right)}{\sum_{i=1}^{n_{1}}K_{n}\left((x-W_{i})/h_{n}\right)}, (8)

where []\mathcal{R}[\cdot] is a monotonization operator, specified later, which maps any measurable function to a monotonic function, and g˘()\breve{g}(\cdot) is a Nadaraya-Watson estimator obtained by the deconvolution kernel. We will prove the consistency of this estimator in Theorem 1 in Section 3.

The monotonization operation []\mathcal{R}[\cdot] on g˘()\breve{g}(\cdot) is justifiable because the true link function g()g(\cdot) is assumed to be monotonic. One simple choice for []\mathcal{R}[\cdot], applicable to any measurable function f:f:\mathbb{R}\to\mathbb{R}, is

naive[f](x)=supxxf(x),x.\displaystyle\mathcal{R}_{\mathrm{naive}}[f](x)=\sup_{x^{\prime}\leq x}f(x^{\prime}),\quad x\in\mathbb{R}. (9)

This definition holds for all xx\in\mathbb{R}. Another effective alternative is the rearrangement operator (Chernozhukov et al., 2009). This operator monotonizes a measurable function f:f:\mathbb{R}\to\mathbb{R} within a compact interval [x¯,x¯][\underline{x},\overline{x}]\subset\mathbb{R}:

r[f](x)=inf{t:[0,1]1{f(ux¯x¯x¯)t}𝑑uxx¯x¯x¯},x[x¯,x¯].\displaystyle\mathcal{R}_{\mathrm{r}}[f](x)=\inf\left\{t\in\mathbb{R}:\int_{[0,1]}1\left\{f\left(\frac{u-\underline{x}}{\overline{x}-\underline{x}}\right)\leq t\right\}du\geq\frac{x-\underline{x}}{\overline{x}-\underline{x}}\right\},~{}x\in[\underline{x},\overline{x}]. (10)

This operator, which sorts the values of f()f(\cdot) in increasing order, is robust against local fluctuations such as function bumps. Thus, it effectively addresses bumps in g˘()\breve{g}(\cdot) arising from kernel-based methods.

2.3. Coefficient Estimation

We next propose our estimator of 𝜷\bm{\beta} using g^()\hat{g}(\cdot) obtained in (8). In this step, we consider the link estimator g^()\hat{g}(\cdot) from 𝑿(1)\bm{X}^{(1)} as given, and estimate 𝜷\bm{\beta} using 𝑿(2)\bm{X}^{(2)}. To facilitate this, we introduce the surrogate loss function for 𝒃p\bm{b}\in\mathbb{R}^{p}, with 𝒙p\bm{x}\in\mathbb{R}^{p}, yy\in\mathbb{R}, and any measurable function g¯:\bar{g}:\mathbb{R}\to\mathbb{R}:

(𝒃;𝒙,y,g¯):=G¯(𝒙𝒃)y𝒙𝒃,\displaystyle\ell(\bm{b};\bm{x},y,\bar{g}):=\bar{G}(\bm{x}^{\top}\bm{b})-y\bm{x}^{\top}\bm{b}, (11)

where G¯:\bar{G}:\mathbb{R}\to\mathbb{R} is a function such that G¯(t)=g¯(t)\bar{G}^{\prime}(t)=\bar{g}(t). This function can be viewed as a natural extension of the matching loss (Auer et al., 1995) used in generalized linear models. If g¯()\bar{g}(\cdot) is strictly increasing, then the loss is strictly convex in 𝒃\bm{b}. Moreover, the surrogate loss is justified by the characteristics of the true parameter as follows (Agarwal et al., 2014):

𝜷=argmin𝒃p𝔼[(𝒃;𝑿1,y1,g)|𝑿1],\displaystyle\bm{\beta}=\operatornamewithlimits{argmin}_{\bm{b}\in\mathbb{R}^{p}}\operatorname{\mathbb{E}}\left[\ell(\bm{b};{\bm{X}_{1}},{y_{1}},{g})|\bm{X}_{1}\right], (12)

provided that G()G(\cdot) is integrable. The surrogate loss aligns with the negative log-likelihood when g()g(\cdot) is known and serves as a canonical link function, thereby making the surrogate loss minimizer a generalization of the MLEs in generalized linear models.

Using the second dataset 𝑿(2)\bm{X}^{(2)} with any given function g¯()\bar{g}(\cdot), we define our estimator of 𝜷\bm{\beta} as

𝜷^(g¯)=argmin𝒃pi=1n2(𝒃;𝑿i(2),yi(2),g¯)+J(𝒃),\displaystyle\hat{\bm{\beta}}(\bar{g})=\operatornamewithlimits{argmin}_{\bm{b}\in\mathbb{R}^{p}}\sum_{i=1}^{n_{2}}\ell(\bm{b};\bm{X}_{i}^{(2)},y_{i}^{(2)},\bar{g})+J(\bm{b}), (13)

where J:pJ:\mathbb{R}^{p}\to\mathbb{R} is a convex regularization function. Finally, we substitute the link estimator g^()\hat{g}(\cdot) into (13) to obtain our estimator 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}). The use of a nonzero regularization term, J()J(\cdot), is beneficial in cases where the minimizer (13) is not unique or does not exist; see, for example, Candès and Sur (2020) for the logistic regression case.

2.4. Inferential Parameter Estimation

We finally study estimators for the inferential parameters of our estimator 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}), which are essential for statistical inference as discussed in Section 1.1. As established in (3), it is necessary to estimate the asymptotic bias μ(g^)\mu_{*}(\hat{g}) and variance σ2(g^)\sigma_{*}^{2}(\hat{g}) that satisfy the following relationship:

p(β^j(g^)μ(g^)βj)σ(g^)d𝒩(0,1),j[p],\displaystyle\frac{\sqrt{p}(\hat{\beta}_{j}(\hat{g})-\mu_{*}(\hat{g})\beta_{j})}{\sigma_{*}(\hat{g})}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1),\quad j\in[p], (14)

conditional on (𝑿(1),𝒚(1))(\bm{X}^{(1)},\bm{y}^{(1)}) and consequently on g^()\hat{g}(\cdot).

We develop estimators for these inferential parameters using observable adjustments as suggested by Bellec (2022), in accordance with the estimator (13). For any measurable function g¯:\bar{g}:\mathbb{R}\to\mathbb{R}, we define 𝑫=diag(g¯(𝑿(2)𝜷^(g¯)))\bm{D}=\mathrm{diag}(\bar{g}^{\prime}(\bm{X}^{(2)}\hat{\bm{\beta}}(\bar{g}))) and v^λ=n21tr(𝑫𝑫𝑿(2)((𝑿(2))𝑫𝑿(2)+n2λ𝑰p)1(𝑿(2))𝑫)\hat{v}_{\lambda}=n_{2}^{-1}\mathrm{tr}(\bm{D}-\bm{D}\bm{X}^{(2)}(({\bm{X}^{(2)}})^{\top}\bm{D}\bm{X}^{(2)}+n_{2}\lambda\bm{I}_{p})^{-1}({\bm{X}^{(2)}})^{\top}\bm{D}) for λ0\lambda\geq 0. When incorporating J(𝒃)=λ𝒃2/2J(\bm{b})=\lambda\|\bm{b}\|^{2}/2 into (13) with λ>0\lambda>0, we propose the following estimators:

μ^(g¯)\displaystyle\hat{\mu}(\bar{g}) =|𝜷^(g¯)2σ^2(g¯)|1/2 and σ^2(g¯)=𝒚(2)g¯(𝑿(2)𝜷^(g¯))2n2(v^λ+λ)2/κ2.\displaystyle=\left\lvert\|\hat{\bm{\beta}}(\bar{g})\|^{2}-\hat{\sigma}^{2}(\bar{g})\right\rvert^{1/2}\mbox{~{}and~{}}\hat{\sigma}^{2}(\bar{g})=\frac{\|\bm{y}^{(2)}-\bar{g}(\bm{X}^{(2)}\hat{\bm{\beta}}(\bar{g}))\|^{2}}{n_{2}(\hat{v}_{\lambda}+\lambda)^{2}/\kappa_{2}}. (15)

In the case where J()𝟎J(\cdot)\equiv\bm{0} holds, we define

μ^0(g¯)=|𝑿(2)𝜷^(g¯)2/n2(1κ2)σ^02(g¯)|1/2 and σ^02(g¯)=𝒚(2)g¯(𝑿(2)𝜷^(g¯))2n2v^02/κ2.\displaystyle\hat{\mu}_{0}(\bar{g})=\left\lvert{\|\bm{X}^{(2)}\hat{\bm{\beta}}(\bar{g})\|^{2}}/n_{2}-(1-\kappa_{2})\hat{\sigma}_{0}^{2}(\bar{g})\right\rvert^{1/2}\mbox{~{}~{}and~{}~{}}\hat{\sigma}_{0}^{2}(\bar{g})=\frac{\|\bm{y}^{(2)}-\bar{g}(\bm{X}^{(2)}\hat{\bm{\beta}}(\bar{g}))\|^{2}}{n_{2}\hat{v}_{0}^{2}/\kappa_{2}}. (16)

A theoretical justification for the asymptotic normality with these estimators and their application in inference is provided in Section 3.3.

3. Main Theoretical Results of Proposed Estimators

This section presents theoretical results for our estimation framework. Specifically, we prove the consistency of the estimator g^()\hat{g}(\cdot) for the link function g()g(\cdot), and the asymptotic normality of the estimator 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}) for the coefficient vector 𝜷\bm{\beta}. Outlines of the proofs for each assertion will be provided in Section 5.

3.1. Assumptions

As a preparation, we present some assumptions.

Assumption 1 (Data-splitting in high dimensions).

There exist constants c1,c2>0c_{1},c_{2}>0 with c1c2c_{1}\leq c_{2} independent of nn such that κ1,κ2[c1,c2]\kappa_{1},\kappa_{2}\in[c_{1},c_{2}] holds.

Assumption 1 requires that the split subsamples, as described at the beginning of Section 2, have the same order of sample size.

Assumption 2 (Gaussian covariates and identification).

Each row of the matrix 𝐗\bm{X} independently follows 𝒩p(𝟎,𝚺)\mathcal{N}_{p}(\bm{0},\bm{\Sigma}) with 𝚺\bm{\Sigma} obeying 𝛃𝚺𝛃=1\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}=1 and 0<cΣ1λmin(𝚺)λmax(𝚺)cΣ<0<c_{\Sigma}^{-1}\leq\lambda_{\mathrm{min}}(\bm{\Sigma})\leq\lambda_{\mathrm{max}}(\bm{\Sigma})\leq c_{\Sigma}<\infty with a constant cΣc_{\Sigma}.

It is common to assume Gaussianity of covariates in the proportionally high-dimensional regime, as mentioned in Section 1.2. The condition 𝜷𝚺𝜷=1\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}=1 is necessary to identify the scale of 𝜷\bm{\beta}, which ensures the uniqueness of the estimator in the single-index model with an unknown link function g()g(\cdot). For example, without this condition, it would be impossible to distinguish between g(𝑿1𝒃)g(\bm{X}_{1}^{\top}\bm{b}) and f(2𝑿1𝒃)f(2\bm{X}_{1}^{\top}\bm{b}), where f(t)=g(t/2)f(t)=g(t/2), for any 𝒃p\bm{b}\in\mathbb{R}^{p}. Furthermore, the assumption of upper and lower bounds on the eigenvalues of 𝚺\bm{\Sigma} implies that 𝜷=Θ(1)\|\bm{\beta}\|=\Theta(1).

Assumption 3 (Monotone and smooth link function).

There exists a constant cg>0c_{g}>0 such that 0<cg1g(x)0<c_{g}^{-1}\leq g^{\prime}(x) holds for any xx\in\mathbb{R}. Also, there exist constants B>0B>0 and a,ba,b\in\mathbb{R} such that a<ba<b, g()(x)g^{(\ell)}(x) exists for x(a,b)x\in(a,b), and supaxb|g()(x)|B\sup_{a\leq x\leq b}|g^{(\ell)}(x)|\leq B holds for every =0,1,m\ell=0,1\ldots,m for some mm\in\mathbb{N}.

Assumption 3 restricts the class of link functions to those that are monotonic. This class has been extensively reviewed in the literature, with Balabdaoui et al. (2019) providing a comprehensive discussion. It encompasses a wide range of applications, including utility functions, growth curves, and dose-response models (Matzkin, 1991; Wan et al., 2017; Foster et al., 2013). Furthermore, under a monotonically increasing link function, the sign of 𝜷\bm{\beta} is identified, so that we can identify 𝜷\bm{\beta} only by the scale condition 𝜷𝚺𝜷=1\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}=1.

Assumption 4 (Moment conditions of 𝒚\bm{y}).

𝔼[y12]<\operatorname{\mathbb{E}}[y_{1}^{2}]<\infty holds. Further, m2(x):=𝔼[y12𝐗1𝛃=x]m_{2}(x):=\operatorname{\mathbb{E}}[y_{1}^{2}\mid\bm{X}_{1}^{\top}\bm{\beta}=x] is continuous in x[a,b]x\in[a,b] for a,ba,b\in\mathbb{R} defined in Assumption 3.

The continuity of m2(x)m_{2}(x) is maintained in many commonly used models, particularly when g()g(\cdot) is continuous. For instance, the Poisson regression model defines m2(x)=exp(x)(1+exp(x))m_{2}(x)=\exp(x)(1+\exp(x)), and binary choice models specify m2(x)=g(x)m_{2}(x)=g(x).

3.2. Consistency of Link Estimation

We demonstrate the uniform consistency of the link estimator g^()\hat{g}(\cdot) in (8) over closed intervals. We consider the mthm\mathrm{th}-order kernel K()K(\cdot) that satisfies

K(t)𝑑t=1,tmK(t)𝑑t0,andtK(t)𝑑t=0,\displaystyle\int_{-\infty}^{\infty}K(t)dt=1,~{}~{}~{}\int_{-\infty}^{\infty}t^{m}K(t)dt\neq 0,~{}\mbox{and}~{}\int_{-\infty}^{\infty}t^{\ell}K(t)dt=0, (17)

for [m1]\ell\in[m-1]. We then obtain the following:

Theorem 1.

Suppose that Assumptions 14 hold, the Fourier transform ϕK(t)\phi_{K}(t) of the kernel K()K(\cdot) has a bounded support in [M0,M0][-M_{0},M_{0}] with some M0>0M_{0}>0, and the bandwidth hn=(chlogn1)1/2h_{n}=(c_{h}\log n_{1})^{-1/2} satisfies 2M02(σ1/μ1)2ch<12M_{0}^{2}(\sigma_{1}/\mu_{1})^{2}c_{h}<1. Then, we have the following as n1n_{1}\to\infty:

supaxb|g^(x)g(x)|=Op(1(logn1)m/2).\displaystyle\sup_{a\leq x\leq b}|\hat{g}(x)-g(x)|=O_{\mathrm{p}}\left(\frac{1}{(\log n_{1})^{m/2}}\right). (18)

Here, according to Fan and Truong (1993), the logarithmic rate Op(1/(logn1)m/2)O_{\mathrm{p}}(1/(\log n_{1})^{m/2}) reaches a lower bound, indicating that this rate cannot be improved.

3.3. Marginal Asymptotic Normality of Coefficient Estimators

This section demonstrates the marginal asymptotic normality of our estimator 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}) for 𝜷\bm{\beta}, facilitated by the estimators of the inferential parameters, μ^(g^)\hat{\mu}(\hat{g}) and σ^(g^)\hat{\sigma}(\hat{g}). These results are directly applicable to hypothesis testing and the construction of confidence intervals for any finite subset of the βj\beta_{j}’s.

3.3.1. Unit Covariance (𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p}) and p>np>n Case

As previously noted, the inferential parameters vary depending on the estimator considered. In this section, we focus on the ridge regularized estimator with unit covariance 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p}. We will also present additional results for generalized covariance matrices and the ridgeless scenario later.

Theorem 2.

We consider the coefficient estimator 𝛃^(g^)\hat{\bm{\beta}}(\hat{g}) with J(𝐛)=λ𝐛2/2J(\bm{b})=\lambda\|\bm{b}\|^{2}/2, and the inferential estimators (μ^(g^),σ^(g^))(\hat{\mu}(\hat{g}),\hat{\sigma}(\hat{g})), associated with the link estimator g^()\hat{g}(\cdot). Suppose that 𝚺=𝐈p\bm{\Sigma}=\bm{I}_{p} and Assumptions 1-3 hold. Then, a conditional distribution of (𝛃^(g^),μ^(g^),σ^(g^))(\hat{\bm{\beta}}(\hat{g}),\hat{\mu}(\hat{g}),\hat{\sigma}(\hat{g})) with a fixed event on g^()\hat{g}(\cdot) satisfies the following: for any coordinate j[p]j\in[p] satisfying pβj=O(1)\sqrt{p}\beta_{j}=O(1), we have

Tj:=p(β^j(g^)μ^(g^)βj)σ^(g^)d𝒩(0,1)\displaystyle T_{j}:=\frac{\sqrt{p}({\hat{\beta}_{j}(\hat{g})-\hat{\mu}(\hat{g})\beta_{j}})}{\hat{\sigma}(\hat{g})}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1) (19)

as n,pn,p\to\infty with the regime (2). Moreover, for any finite set of coordinates 𝒮[p]\mathcal{S}\subset[p] satisfying p𝛃𝒮=O(1)\sqrt{p}\|\bm{\beta}_{\mathcal{S}}\|=O(1), we have, as n,pn,p\to\infty,

p(𝜷^𝒮(g^)μ^(g^)𝜷𝒮)σ^(g^)d𝒩(𝟎,𝑰|𝒮|).\displaystyle\frac{\sqrt{p}({\hat{\bm{\beta}}_{\mathcal{S}}(\hat{g})-\hat{\mu}(\hat{g})\bm{\beta}_{\mathcal{S}})}}{\hat{\sigma}(\hat{g})}\overset{\mathrm{d}}{\to}\mathcal{N}(\bm{0},\bm{I}_{|\mathcal{S}|}). (20)

This result also implies that β^j(g^)/μ^(g^)\hat{\beta}_{j}(\hat{g})/\hat{\mu}(\hat{g}) is an asymptotically unbiased estimator of βj\beta_{j}. Note that the convergence of the conditional distribution is ensured by the non-degeneracy property of the conditional event, as defined by (𝑿(1),𝒚(1))({\bm{X}}^{(1)},{\bm{y}}^{(1)}); see Goggin (1994) for details.

We highlight two key contributions of Theorem 2. First, it remains valid even when the ratio κ=p/n\kappa=p/n exceeds one, a notable distinction compared to a similar result by Bellec (2022), which holds only when κ\kappa is less than one. Second, the statistic TjT_{j} is independent of any unknown parameters, contrasting with, for example, the marginal asymptotic normality in logistic regression by Zhao et al. (2022), which relies on unknown inferential parameters.

Application to Statistical Inference: From Theorem 2, we construct a confidence interval CI1αj\mathrm{CI}_{{1-\alpha}}^{j} for each βj\beta_{j} with a confidence level (1α)(1-\alpha) as follows:

CI1αj:=1μ^(g^)[β^j(g^)z(1α/2)σ^(g^)p,β^j(g^)+z(1α/2)σ^(g^)p],\displaystyle\mathrm{CI}_{{1-\alpha}}^{j}:=\frac{1}{\hat{\mu}(\hat{g})}\left[\hat{\beta}_{j}(\hat{g})-z_{(1-\alpha/2)}\frac{\hat{\sigma}(\hat{g})}{\sqrt{p}},~{}\hat{\beta}_{j}(\hat{g})+z_{(1-\alpha/2)}\frac{\hat{\sigma}(\hat{g})}{\sqrt{p}}\right], (21)

where j[p]j\in[p] and z(1α/2)z_{(1-\alpha/2)} is the (1α/2)(1-\alpha/2)-quantile of the standard normal distribution. This construction ensures the coverage probability adheres to the specified confidence level asymptotically.

Corollary 1.

Under the settings of Theorem 2, for any α(0,1)\alpha\in(0,1), we have the following as n,pn,p\to\infty with the regime (2):

sup1jp|(βjCI1αj)(1α)|0.\displaystyle\sup_{1\leq j\leq p}\left\lvert\mathbb{P}\left(\beta_{j}\in\mathrm{CI}_{1-\alpha}^{j}\right)-(1-\alpha)\right\rvert\to 0. (22)

Hence, for testing the hypothesis H0j:βj=0H_{0}^{j}:\beta_{j}=0 against H1j:βj0H_{1}^{j}:\beta_{j}\neq 0 at level α(0,1)\alpha\in(0,1), we can use the corrected tt-statistics in (19). The test that rejects the null hypothesis H0jH_{0}^{j} if

σ^(g^)z(1α/2)pτj|β^j(g^)|\displaystyle\frac{\hat{\sigma}(\hat{g})z_{(1-\alpha/2)}}{\sqrt{p}\tau_{j}}\leq|\hat{\beta}_{j}(\hat{g})| (23)

controls the asymptotic size of the test at the level α\alpha. Additionally, it is feasible to develop a variable selection procedure that identifies variables related to the response. Specifically, the pp-value associated with H0jH_{0}^{j} and the statistic pβ^j(g^)/σ^(g^)\sqrt{p}\hat{\beta}_{j}(\hat{g})/\hat{\sigma}(\hat{g}) can serve as importance statistics for the jjth covariate. This approach facilitates variable selection procedures that control the false discovery rate, as detailed in sources such as Benjamini and Hochberg (1995); Candes et al. (2018); Xing et al. (2023); Dai et al. (2023).

3.3.2. General Covariance 𝚺\bm{\Sigma} and p<np<n Case

We extend Theorem 2 to scenarios with a general covariance matrix 𝚺\bm{\Sigma} in unregularized settings. To this end, we utilize the estimators (μ^0(g^),σ^0(g^))(\hat{\mu}_{0}(\hat{g}),\hat{\sigma}_{0}(\hat{g})), which are defined for inferential parameters in Section 2.4. Recall that the precision matrix 𝚯\bm{\Theta} is defined as 𝚺1\bm{\Sigma}^{-1}.

Theorem 3.

We consider the coefficient estimator 𝛃^(g^)\hat{\bm{\beta}}(\hat{g}) with J(𝐛)0J(\bm{b})\equiv 0, and the inferential estimators (μ^0(g^),σ^0(g^))(\hat{\mu}_{0}(\hat{g}),\hat{\sigma}_{0}(\hat{g})), associated with the link estimator g^()\hat{g}(\cdot). Suppose that Assumptions 1-3 hold. Then, a conditional distribution of (𝛃^(g^),μ^0(g^),σ^0(g^))(\hat{\bm{\beta}}(\hat{g}),\hat{\mu}_{0}(\hat{g}),\hat{\sigma}_{0}(\hat{g})) with a fixed event on g^()\hat{g}(\cdot) satisfies the following: for any coordinate j[p]j\in[p] satisfying pτjβj=O(1)\sqrt{p}\tau_{j}\beta_{j}=O(1), we have

p(β^j(g^)μ^0(g^)βj)σ^0(g^)/τjd𝒩(0,1)\displaystyle\frac{\sqrt{p}({\hat{\beta}_{j}(\hat{g})-\hat{\mu}_{0}(\hat{g})\beta_{j}})}{\hat{\sigma}_{0}(\hat{g})/\tau_{j}}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1) (24)

as n,pn,p\to\infty with the regime (2). Moreover, for a finite set of coordinates 𝒮{1,,p}\mathcal{S}\subset\{1,\ldots,p\}, we have

p𝚯𝒮1/2(𝜷^𝒮(g^)μ^0(g^)𝜷𝒮)σ^0(g^)d𝒩(𝟎,𝑰|𝒮|),\displaystyle\frac{\sqrt{p}\bm{\Theta}_{\mathcal{S}}^{-1/2}({\hat{\bm{\beta}}_{\mathcal{S}}(\hat{g})-\hat{\mu}_{0}(\hat{g})\bm{\beta}_{\mathcal{S}})}}{\hat{\sigma}_{0}(\hat{g})}\overset{\mathrm{d}}{\to}\mathcal{N}(\bm{0},\bm{I}_{|\mathcal{S}|}), (25)

where the submatrix 𝚯𝒮\bm{\Theta}_{\mathcal{S}} consists of Θij\Theta_{ij} for i,j𝒮i,j\in\mathcal{S}.

4. Experiments

This section provides numerical validations of our estimation framework as outlined in Section 2. The efficiency of our proposed estimator is subsequently compared with that of other estimators.

We examine two high-dimensional scenarios: n<pn<p and n>pn>p. For the scenario where n>pn>p, we assume the true coefficient vector follows a uniform distribution on the sphere: 𝜷Unif(𝕊p1)\bm{\beta}\sim\mathrm{Unif}(\mathbb{S}^{p-1}). In the case of n<pn<p, we set β1==β100=p/100\beta_{1}=\cdots=\beta_{100}=\sqrt{p/100} and β101==βp=0\beta_{101}=\cdots=\beta_{p}=0. We generate response variables yiy_{i} for Gaussian predictors 𝑿i\bm{X}_{i} with an identity covariance matrix 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p}, under the following four scenarios:

  1. (i)

    Cloglog: yi𝑿iBern(g(i)(𝜷𝑿i))y_{i}\mid\bm{X}_{i}\sim\mathrm{Bern}(g_{\mathrm{(i)}}(\bm{\beta}^{\top}\bm{X}_{i})) with g(i)(t)=1exp(exp(t))g_{\mathrm{(i)}}(t)=1-\exp(-\exp(t));

  2. (ii)

    xSqrt: yi𝑿iPois(g(ii)(𝜷𝑿i))y_{i}\mid\bm{X}_{i}\sim\mathrm{Pois}(g_{\mathrm{(ii)}}(\bm{\beta}^{\top}\bm{X}_{i})) with g(ii)(t)=t+t2+1g_{\mathrm{(ii)}}(t)=t+\sqrt{t^{2}+1};

  3. (iii)

    Cubic: cubic regression yi=g(iii)(𝜷𝑿i)+εiy_{i}=g_{\mathrm{(iii)}}(\bm{\beta}^{\top}\bm{X}_{i})+\varepsilon_{i} with εi𝒩(0,1/2)\varepsilon_{i}\sim\mathcal{N}(0,1/2) and g(iii)(t)=t3/3g_{\mathrm{(iii)}}(t)=t^{3}/3;

  4. (iv)

    Piecewise: piecewise regression yi=g(iv)(𝜷𝑿i)+εiy_{i}=g_{\mathrm{(iv)}}(\bm{\beta}^{\top}\bm{X}_{i})+\varepsilon_{i} with εi𝒩(0,1/5)\varepsilon_{i}\sim\mathcal{N}(0,1/5) and g(iv)(t)=(0.2t2.3)1(,1]+2.5t1(1,1)+(0.2t+2.3)1[1,)g_{\mathrm{(iv)}}(t)=(0.2t-2.3)1_{(-\infty,-1]}+2.5t1_{(-1,1)}+(0.2t+2.3)1_{[1,\infty)}.

4.1. Index Estimator

We validate the normal approximation of the index estimator WiW_{i} as shown in (6). For cases where n>pn>p, we set (n,p)=(500,50)(n,p)=(500,50) for the Cloglog model and (n,p)=(500,200)(n,p)=(500,200) for the other models. For cases where n<pn<p, we set (n,p)=(250,500)(n,p)=(250,500) and apply the ridge regularized estimator to all models. We assign the maximum likelihood estimator (MLE) of logistic regression to the pilot estimator for (i) Cloglog, the MLE of Poisson regression for (ii) xSqrt, and the least squares estimator for both (iii) Cubic and (iv) Piecewise models. We calculate μ~(𝑾𝑿𝜷)/σ~\tilde{\mu}(\bm{W}-\bm{X}\bm{\beta})/\tilde{\sigma} using 1,0001,000 replications for each setup.

Figure 1 displays the results. In all settings, the difference between the index estimator and the index follows a Gaussian distribution, as expected.

Refer to caption
Figure 1. Histograms of the first coordinate of the statistics μ~(𝑾𝑿𝜷)/σ~\tilde{\mu}(\bm{W}-\bm{X}\bm{\beta})/\tilde{\sigma} over 1,0001,000 replications. According to Proposition 1, these histograms are expected to resemble the 𝒩(0,1)\mathcal{N}(0,1) density. The columns correspond to each model, ranging from (i) Cloglog to (iv) Piecewise, while the rows represent unregularized and ridge-regularized estimations for cases where n>pn>p and n<pn<p, respectively.

4.2. Link Function Estimator

Next, we evaluate the numerical performance of the link estimator g^()\hat{g}(\cdot), constructed from (W1,,Wn)(W_{1},\ldots,W_{n}), using a fixed bandwidth for each nn. Figure 2 (left panel) shows that the estimation error of g^()\hat{g}(\cdot) for (iv) Piecewise uniformly approaches zero as the sample size increases. The right four panels in Figure 2 display the squared losses of g^()\hat{g}(\cdot) evaluated over the interval [3,3][-3,3], which all decrease as nn increases.

Refer to caption
Figure 2. Estimated link functions g^()\hat{g}(\cdot) for (iv) Piecewise were obtained with a fixed ratio p/n=0.6p/n=0.6 and n=32,64,,1024n=32,64,\ldots,1024, averaged over 1,0001,000 replications (left). The squared loss for g^()\hat{g}(\cdot), evaluated over the interval [3,3][-3,3], for (i) Cloglog to (iv) Piecewise, as defined in the previous section, with a fixed ratio p/n=0.4p/n=0.4 averaged over 1,0001,000 replications (right).

4.3. Our Coefficient Estimator

We examine the asymptotic normality of each coordinate of the estimator 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}) for the true coefficients. As in Section 4.2, we construct the estimator using a fixed bandwidth and apply the rearrangement operator r[]\mathcal{R}_{\mathrm{r}}[\cdot] as defined in (10) over the interval [3,3][-3,3] to obtain g^()\hat{g}(\cdot). We then compute 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}) according to (13) using J()𝟎J(\cdot)\equiv\bm{0} when n>pn>p and J(𝒃)=𝒃2J(\bm{b})=\|\bm{b}\|^{2} when npn\leq p. Figure 3 shows the marginal normal approximation of the estimators under these conditions. All histograms closely resemble the standard normal density, corroborating the asymptotic normality stated in Theorems 2 and 3.

Refer to caption
Figure 3. Histograms of the first coordinate of the statistics p(β^1(g^)μ^(g^)β1)/σ^(g^)\sqrt{p}(\hat{\beta}_{1}(\hat{g})-\hat{\mu}(\hat{g})\beta_{1})/\hat{\sigma}(\hat{g}) made from 1,0001,000 replications. The columns correspond to each model from (i) Cloglog to (iv) Piecewise, and the rows correspond to unregularized and ridge-regularized estimation under n>pn>p and n<pn<p, respectively. They are expected to approach 𝒩(0,1)\mathcal{N}(0,1) density by Theorems 2 and 3.

4.4. Efficiency Comparison

Finally, we compare the estimation efficiency of the proposed estimator with several pilot estimators. We use the effective asymptotic variance σ2/μ2\sigma^{2}_{*}/\mu_{*}^{2} as an efficiency measure, which is the inverse of the effective signal-to-noise ratio as described in Feng et al. (2022). We estimate this variance using the statistic 𝜷^𝜷^/(𝜷^𝜷)1{\hat{\bm{\beta}}^{\top}\hat{\bm{\beta}}}/({\hat{\bm{\beta}}^{\top}\bm{\beta}})-1 for an estimator 𝜷^\hat{\bm{\beta}}. This statistic is a reasonable approximation of the asymptotic variance of the debiased version of 𝜷^\hat{\bm{\beta}} and converges almost surely to the effective asymptotic variance under certain conditions (see Section 5 for details).

From a practical perspective, we analyze the scatter plot of (Wi,yi)(W_{i},y_{i}) and manually specify a functional form for g()g(\cdot) to conduct parametric regression. We estimate parameters a,b,ca,b,c\in\mathbb{R} in different forms: gˇ(t)=1/(1+exp(at+b))\check{g}(t)=1/(1+\exp(-at+b)) for case (i), gˇ(t)=aexp(t)+b\check{g}(t)=a\exp(t)+b for case (ii), gˇ(t)=at3+bt2+ct\check{g}(t)=at^{3}+bt^{2}+ct for case (iii), and gˇ(t)=a/(1+exp(bt+c))a/2\check{g}(t)=a/(1+\exp(-bt+c))-a/2 for case (iv). We then use these estimates to construct the link function. Additionally, we introduce new data-generating processes: Logit, where yi𝑿iBern(1/(1+exp(𝜷𝑿i)))y_{i}\mid\bm{X}_{i}\sim\mathrm{Bern}(1/(1+\exp(\bm{\beta}^{\top}\bm{X}_{i}))); Poisson, where yi𝑿iPois(exp(𝜷𝑿i))y_{i}\mid\bm{X}_{i}\sim\mathrm{Pois}(\exp(\bm{\beta}^{\top}\bm{X}_{i})); Cubic+, where yi=g(iii)(𝜷𝑿i)+εiy_{i}=g_{\mathrm{(iii)}}(\bm{\beta}^{\top}\bm{X}_{i})+\varepsilon_{i}, with εi𝒩(5,1/2)\varepsilon_{i}\sim\mathcal{N}(5,1/2); and Piecewise+, where yi=g(iv)(𝜷𝑿i)+εiy_{i}=g_{\mathrm{(iv)}}(\bm{\beta}^{\top}\bm{X}_{i})+\varepsilon_{i}, with εi𝒩(5,1/5)\varepsilon_{i}\sim\mathcal{N}(5,1/5).

Table 1 displays the efficiency measures for our proposed estimator and others across 100 replications. We find that our proposed estimator is generally more efficient in most settings, except when the estimators are specifically tailored to the models. This highlights the broad applicability of our estimator.

LeastSquares LogitMLE PoisMLE Proposed
Logit 3.77±.9673.77\pm.967 .525±\pm.157 .527±.157.527\pm.157
Cloglog 3.13±.5993.13\pm.599 .294±.080.294\pm.080 .271±\pm.068
Poisson 3.77±.9673.77\pm.967 .630±\pm.124 .630±\pm.124
xSqrt 2.32±.6922.32\pm.692 1.12±\pm.290 1.12±\pm.290
Cubic 1.15±\pm.258 1.74±.4401.74\pm.440
Cubic+ 33.9±50.733.9\pm 50.7 1.74±\pm.439
Piecewise .541±.031.541\pm.031 .391±\pm.157
Piecewise+ 6.32±3.266.32\pm 3.26 .330±\pm.184
Table 1. Efficiency measure for each pair of model and estimator. We report the average ±\pm standard deviation.

4.5. Real Data Applications

We utilize two datasets from the UCI Machine Learning Repository (Dua and Graff, 2017) to illustrate the performance of the proposed estimator. The DARWIN dataset (Cilia et al., 2018) comprises handwriting data from 174 participants, including both Alzheimer’s disease patients and healthy individuals. The second dataset (Sakar and Sakar, 2018) features 753 attributes derived from the sustained phonation of the vowel sounds of patients, both with and without Alzheimer’s disease. We employ a leave-one-out strategy for splitting each dataset. For each n1n-1 subset, we compute the regularized MLE of logistic regression alongside the proposed estimate derived from it. We then estimate the effective asymptotic variance, σ2/μ2\sigma^{2}_{*}/\mu_{*}^{2}, for each estimator. The results, presented in Tables 23, indicate that the proposed estimator consistently provides a more accurate estimation of the true coefficient vector compared to conventional logistic regression.

λ=1\lambda=1 λ=5\lambda=5 λ=10\lambda=10
Logit 1.87±\pm0.06 0.46±\pm0.01 0.30±\pm0.00
proposed 0.61±\pm0.01 0.25±\pm0.00 0.18±\pm0.00
Table 2. Estimated effective asymptotic variance of the MLE of logistic regression and the proposed estimator for DARWIN data. We provide the average ±\pm standard deviation by using leave-one-out split datasets.
λ=1\lambda=1 λ=5\lambda=5 λ=10\lambda=10
Logit 2.22±\pm0.03 0.30±\pm0.00 0.16±\pm0.00
proposed 0.15±\pm0.00 0.06±\pm0.00 0.05±\pm0.00
Table 3. Estimated effective asymptotic variance of the MLE of logistic regression and the proposed estimator for speech data. We provide the average ±\pm standard deviation by using leave-one-out split datasets.

5. Proof Outline

We outline the proofs for each theorem in Section 3.

5.1. Consistency of Link Estimation (Theorem 1)

We provide an overview of the proof for Theorem 1, which comprises two primary steps: (i) the asymptotic characteristics of the index estimator WiW_{i} discussed in Section 2.1, and (ii) demonstrating the consistency of the estimator g^()\hat{g}(\cdot) in Section 2.2, related to WiW_{i}.

5.1.1. Error of Index Estimator

We consider the distributional approximation (6) for the index estimator WiW_{i}, established through observable adjustments by Bellec (2022). Theorems 4.3 and 4.4 in Bellec (2022) support the following proposition:

Proposition 1.

Under Assumptions 1-2 and 𝔼[y12]<\operatorname{\mathbb{E}}[y_{1}^{2}]<\infty, there exists Zi𝒩(0,1)Z_{i}\sim\mathcal{N}(0,1) such that, for each i[n1]i\in[n_{1}], as n1n_{1}\to\infty,

|𝜷~𝑿i(1)γ~(yi(1)𝜷~𝑿i(1))μ~𝜷𝑿i(1)σ~Zi|p0.\displaystyle\left\lvert\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}-\tilde{\gamma}\left(y_{i}^{(1)}-\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}\right)-\tilde{\mu}\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}-\tilde{\sigma}Z_{i}\right\rvert\overset{\mathrm{p}}{\to}0. (26)

The proposition asserts that each 𝜷~𝑿i(1)\tilde{\bm{\beta}}^{\top}{{\bm{X}}_{i}^{(1)}} for i[n1]i\in[n_{1}] is approximately equal to the sum of the biased true index μ~𝜷𝑿i(1)\tilde{\mu}\bm{\beta}^{\top}{{\bm{X}}_{i}^{(1)}}, a Gaussian error, and an additive bias term. Since WiW_{i} has the form μ~1(𝜷~𝑿i(1)γ~(yi(1)𝜷~𝑿i(1)))\tilde{\mu}^{-1}(\tilde{\bm{\beta}}^{\top}{{\bm{X}}_{i}^{(1)}}-\tilde{\gamma}(y_{i}^{(1)}-\tilde{\bm{\beta}}^{\top}{\bm{X}}_{i}^{(1)})), its approximation error is asymptotically represented by the Gaussian term as shown in Equation (6).

5.1.2. Error of Link Estimator

Next, we prove the consistency of the link estimator g^()\hat{g}(\cdot) using the index estimator WiW_{i}. Suppose WiW_{i} were exactly equivalent to 𝜷𝑿i(1)+𝒩(0,σ12/μ12)=:W~i\bm{\beta}^{\top}\bm{X}_{i}^{(1)}+\mathcal{N}(0,\sigma_{1}^{2}/\mu_{1}^{2})=:\tilde{W}_{i}. In this case, we could apply the classical result of nonparametric error-in-variables regression (Fan and Truong, 1993) to demonstrate the uniform consistency of g^()\hat{g}(\cdot). However, this equivalence is only asymptotic as shown in (26). Therefore, we establish that the error due to this asymptotic equivalence is negligibly small in the estimation of g^()\hat{g}(\cdot) to complete the proof.

Specifically, we take the following steps. First, we decompose the error of g^()\hat{g}(\cdot) into two terms. In preparation, we define g~()\tilde{g}(\cdot) as a deconvolution estimator using a deconvolution kernel K~n(x)=(2π)1exp(itx)ϕK(t)/ϕς(t/hn)𝑑t\tilde{K}_{n}(x)=({2\pi})^{-1}\int_{-\infty}^{\infty}\exp(-\mathrm{i}tx){\phi_{K}(t)}/{\phi_{\varsigma}(t/h_{n})}dt using the true inferential parameters as ς=σ/μ\varsigma=\sigma_{*}/\mu_{*} (its precise definition is given in Section C). This estimator corresponds to the estimator for the error-in-variable setup developed by Fan and Truong (1993). Then, from the effect of the monotonization operator, we obtain the following decomposition:

supaxb|g^(x)g(x)|supaxb|g˘(x)g(x)|supaxb|g˘(x)g~(x)|+supaxb|g~(x)g(x)|.\displaystyle\sup_{a\leq x\leq b}\left\lvert\hat{g}(x)-g(x)\right\rvert\leq\sup_{a\leq x\leq b}\left\lvert\breve{g}(x)-{g}(x)\right\rvert\leq\sup_{a\leq x\leq b}\left\lvert\breve{g}(x)-\tilde{g}(x)\right\rvert+\sup_{a\leq x\leq b}\left\lvert\tilde{g}(x)-{g}(x)\right\rvert. (27)

The second term supaxb|g~(x)g(x)|\sup_{a\leq x\leq b}\left\lvert\tilde{g}(x)-{g}(x)\right\rvert in (27) is the estimation error by the deconvolution estimator g~()\tilde{g}(\cdot), which is proven to be op(1)o_{\mathrm{p}}(1) according to the result of Fan and Truong (1993).

On the other hand, the first term supaxb|g˘(x)g~(x)|\sup_{a\leq x\leq b}\left\lvert\breve{g}(x)-\tilde{g}(x)\right\rvert in (27) represents how our pre-monotonized estimator g˘()\breve{g}(\cdot) in (8) approximates the estimator g~()\tilde{g}(\cdot). Rigorously, we obtain

|g˘(x)g~(x)|\displaystyle\left\lvert\breve{g}(x)-\tilde{g}(x)\right\rvert (28)
1n1hn|i=1n1Kn(W~ixhn)Kn(Wixhn)|=:T1+1n1hn|i=1n1Kn(Wixhn)K~n(Wixhn)|=:T2,\displaystyle\lesssim\underbrace{\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)-{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert}_{=:T_{1}}+\underbrace{\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)-\tilde{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert}_{=:T_{2}}, (29)

where \lesssim is an inequality up to some universal constant. The first term T1T_{1} describes the error by the index estimator Wi{W}_{i}. We develop an upper bound on T1T_{1} by using the result of Proposition 1. The second term T2T_{2} represents the discrepancy between the convolution kernels Kn()K_{n}(\cdot) and K~n()\tilde{K}_{n}(\cdot). Note that Kn()K_{n}(\cdot) depends on the estimator ς~2=σ~2/μ~2\tilde{\varsigma}^{2}=\tilde{\sigma}^{2}/\tilde{\mu}^{2} of the inferential parameter, and K~n()\tilde{K}_{n}(\cdot) depends on the true value of the inferential parameter ς=σ/μ\varsigma=\sigma_{*}/\mu_{*}. We derive its upper bound by evaluating the error of the estimators K~n()\tilde{K}_{n}(\cdot).

By integrating these results into (27), we prove that the estimation error of g^()\hat{g}(\cdot) is op(1)o_{\mathrm{p}}(1).

5.2. Marginal Asymptotic Normality (Theorem 3)

This section provides a proof sketch of Theorem 3. We specifically present a general theorem that characterizes the asymptotic normality of each coordinate of the unregularized estimator in high-dimensional settings. This discussion extends the proof provided by Zhao et al. (2022) for logistic regression.

Consider the single-index model given by (1) and an arbitrary loss function ¯:×𝒴\bar{\ell}:\mathbb{R}\times\mathcal{Y}\to\mathbb{R}. We define an M-estimator 𝜷¯\bar{\bm{\beta}}, based on the loss function ¯()\bar{\ell}(\cdot), as follows:

𝜷¯argmin𝒃pi=1n¯(𝒃𝑿i;yi).\displaystyle\bar{\bm{\beta}}\in\operatornamewithlimits{argmin}_{\bm{b}\in\mathbb{R}^{p}}\sum_{i=1}^{n}\bar{\ell}(\bm{b}^{\top}\bm{X}_{i};y_{i}). (30)

With this general setup, we establish the following statement:

Theorem 4.

Suppose that Assumptions 1 and 2 hold. Also, suppose that the M-estimator 𝛃¯p\bar{\bm{\beta}}\in\mathbb{R}^{p} in (30) is uniquely determined and there exists a constant C>0C>0 such that (𝛃¯<C)1o(1)\mathbb{P}(\|\bar{\bm{\beta}}\|<C)\geq 1-o(1) holds. With the true parameter 𝛃p\bm{\beta}\in\mathbb{R}^{p}, define

μ𝜷¯=𝜷¯𝚺𝜷𝜷𝚺𝜷,andσ𝜷¯2=𝑷𝚺1/2𝜷𝜷¯2=𝜷¯μ𝜷¯𝜷2,\displaystyle\mu_{\bar{\bm{\beta}}}=\frac{\bar{\bm{\beta}}^{\top}\bm{\Sigma}\bm{\beta}}{\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}},\quad\mathrm{and}\quad\sigma_{\bar{\bm{\beta}}}^{2}=\|\bm{P}_{\bm{\Sigma}^{1/2}\bm{\beta}}^{\perp}\bar{\bm{\beta}}\|^{2}=\|\bar{\bm{\beta}}-\mu_{\bar{\bm{\beta}}}\bm{\beta}\|^{2}, (31)

where 𝐏𝚺1/2𝛃=𝐈p𝚺1/2𝛃𝛃𝚺1/2/𝛃𝚺𝛃\bm{P}_{\bm{\Sigma}^{1/2}\bm{\beta}}^{\perp}=\bm{I}_{p}-\bm{\Sigma}^{1/2}\bm{\beta}\bm{\beta}^{\top}\bm{\Sigma}^{1/2}/\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}. Then, for any coordinates j[p]j\in[p] with pτjβj=O(1)\sqrt{p}\tau_{j}\beta_{j}=O(1), we obtain

Tj:=p(β¯jμ𝜷¯βj)σ𝜷¯/τjd𝒩(0,1).\displaystyle T_{j}:=\frac{\sqrt{p}(\bar{\beta}_{j}-\mu_{\bar{\bm{\beta}}}\beta_{j})}{\sigma_{\bar{\bm{\beta}}}/\tau_{j}}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1). (32)

as n,pn,p\to\infty with p/nκ¯<1p/n\to\bar{\kappa}<1.

This theorem establishes the marginal asymptotic normality for a broad class of estimators defined by the minimization of convex loss functions. Additionally, it demonstrates that the limiting distributional behavior of 𝜷¯\bar{\bm{\beta}} is characterized by μ𝜷¯\mu_{\bar{\bm{\beta}}} and σ𝜷¯2\sigma_{\bar{\bm{\beta}}}^{2} in the high-dimensional setting (2). Intuitively, μ𝜷¯\mu_{\bar{\bm{\beta}}} is a scaled inner product of 𝜷¯\bar{\bm{\beta}} and 𝜷\bm{\beta}, and σ𝜷¯2\sigma_{\bar{\bm{\beta}}}^{2} denotes the magnitude of the orthogonal component of 𝜷¯\bar{\bm{\beta}} to 𝜷\bm{\beta}.

The rigorous proof in Section D.1 is conducted in the following steps:

  1. (i)

    Since we have 𝜷𝑿i=(𝚺1/2𝑿i)(𝚺1/2𝜷)\bm{\beta}^{\top}\bm{X}_{i}=(\bm{\Sigma}^{-1/2}\bm{X}_{i})^{\top}(\bm{\Sigma}^{1/2}\bm{\beta}), we achieve the replacements 𝑿i\bm{X}_{i} to 𝚺1/2𝑿i𝒩(𝟎,𝑰p)\bm{\Sigma}^{-1/2}\bm{X}_{i}\sim\mathcal{N}(\bm{0},\bm{I}_{p}), 𝜷\bm{\beta} to 𝜼=𝚺1/2𝜷\bm{\eta}=\bm{\Sigma}^{1/2}\bm{\beta}, and 𝜷¯\bar{\bm{\beta}} to 𝜼^=𝚺1/2𝜷¯\hat{\bm{\eta}}=\bm{\Sigma}^{1/2}\bar{\bm{\beta}}. From the Cholesky factorization of 𝚺\bm{\Sigma}, we have

    Tj=p(β¯jμ𝜷¯βj)σ𝜷¯/τj=p(η^jμ𝜷¯ηj)σ𝜷¯.\displaystyle T_{j}=\frac{\sqrt{p}(\bar{\beta}_{j}-\mu_{\bar{\bm{\beta}}}\beta_{j})}{\sigma_{\bar{\bm{\beta}}}/\tau_{j}}=\frac{\sqrt{p}(\hat{\eta}_{j}-\mu_{\bar{\bm{\beta}}}\eta_{j})}{\sigma_{\bar{\bm{\beta}}}}. (33)
  2. (ii)

    Considering the rotation 𝑼\bm{U} around 𝜼\bm{\eta} (i.e., 𝑼𝜼=𝜼\bm{U}\bm{\eta}=\bm{\eta} and 𝑼𝑼=𝑰p\bm{U}\bm{U}^{\top}=\bm{I}_{p}), several calculations give, for 𝑻:=(T1,,Tp)/p\bm{T}:=(T_{1},\ldots,T_{p})^{\top}/\sqrt{p},

    𝑻=𝑷𝜼𝜼^𝑷𝜼𝜼^=d𝐔𝐏𝜼𝜼^𝐏𝜼𝜼^.\displaystyle\bm{T}=\frac{\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}}{\|\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}\|}\overset{\rm d}{=}\frac{\bm{U}\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}}{\|\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}\|}. (34)

    This means that 𝑻\bm{T} is uniformly distributed on the unit sphere in 𝜼\bm{\eta}^{\perp} (See Figure 4).

  3. (iii)

    Drawing on the analogy to the asymptotic equivalence between the pp-dimensional standard normal distribution and Unif(p𝕊p1)\mathrm{Unif}(\sqrt{p}\mathbb{S}^{p-1}), we obtain the asymptotic normality of TjT_{j}.

Refer to caption
Figure 4. Illustration of the proof technique of Theorem 4. μ𝜷¯\mu_{\bar{\bm{\beta}}} is the inner product of (𝜼,𝜼^\bm{\eta},\hat{\bm{\eta}}). The radius of the set depicted by the red circle corresponds to σ𝜷¯\sigma_{\bar{\bm{\beta}}}.

We apply this general theorem to obtain Theorem 3. A similar argument implies Theorem 2 for the regularized estimator.

6. Other Design of Pilot Estimator

We can consider alternative estimators as the pilot estimator discussed in Section 2.1. Depending on the context, choosing an appropriate pilot estimator can enhance the asymptotic efficiency of the overall estimation process. Below, we list the various estimator options and their associated values necessary for estimating their inferential parameters.

6.1. Least Squares Estimators

In the case of κ11\kappa_{1}\leq 1, we can use the least squares estimator

𝜷~LS=((𝑿(1))𝑿(1))1(𝑿(1))𝒚.\displaystyle\tilde{\bm{\beta}}_{\mathrm{LS}}=({(\bm{X}^{(1)}})^{\top}\bm{X}^{(1)})^{-1}{(\bm{X}^{(1)}})^{\top}\bm{y}. (35)

In this case, there exist corresponding inferential parameters of 𝜷~LS\tilde{\bm{\beta}}_{\mathrm{LS}}.

We obtain the following marginal asymptotic normality of the least-squares estimator. We recall the definition of inferential parameters in (31) and consider the corresponding parameter μ𝜷~LS\mu_{\tilde{\bm{\beta}}_{\mathrm{LS}}} and σ𝜷~LS\sigma_{\tilde{\bm{\beta}}_{\mathrm{LS}}} by substituting 𝜷~LS\tilde{\bm{\beta}}_{\mathrm{LS}}. Then, we obtain the following result by a straightforward application of Theorem 4.

Corollary 2.

Under Assumptions 1-2, for any coordinates j=1,,pj=1,\ldots,p obeying pτjβj=O(1)\sqrt{p}\tau_{j}\beta_{j}=O(1), we have the following as n,pn,p\to\infty with the regime (2):

p(β~LS,jμ𝜷~LSβj)σ𝜷~LS/τjd𝒩(0,1),\displaystyle\frac{\sqrt{p}(\tilde{\beta}_{\mathrm{LS},j}-\mu_{{\tilde{\bm{\beta}}_{\mathrm{LS}}}}\beta_{j})}{{\sigma_{\tilde{\bm{\beta}}_{\mathrm{LS}}}}/\tau_{j}}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1), (36)

We also define the following values (γ~LS,μ~LS,σ~LS2)(\tilde{\gamma}_{\mathrm{LS}},\tilde{\mu}_{\mathrm{LS}},\tilde{\sigma}^{2}_{\mathrm{LS}}) to estimate the inferential parameters μ𝜷~LS\mu_{\tilde{\bm{\beta}}_{\mathrm{LS}}} and σ𝜷~LS\sigma_{\tilde{\bm{\beta}}_{\mathrm{LS}}}. Namely, we define γ~LS=κ1/(1κ1)\tilde{\gamma}_{\mathrm{LS}}={\kappa_{1}}/({1-\kappa_{1}}), and

μ~LS=|𝑿(1)𝜷~LS2n1(1κ1)σ~LS2|1/2,σ~LS2=γ~LSn1(1κ1)𝒚(1)𝑿(1)𝜷~LS2.\displaystyle\tilde{\mu}_{\mathrm{LS}}=\left\lvert\frac{\|\bm{X}^{(1)}\tilde{\bm{\beta}}_{\mathrm{LS}}\|^{2}}{n_{1}}-(1-\kappa_{1})\tilde{\sigma}_{\mathrm{LS}}^{2}\right\rvert^{1/2},\tilde{\sigma}_{\mathrm{LS}}^{2}=\frac{\tilde{\gamma}_{\mathrm{LS}}}{n_{1}(1-\kappa_{1})}\|\bm{y}^{(1)}-\bm{X}^{(1)}\tilde{\bm{\beta}}_{\mathrm{LS}}\|^{2}. (37)

If we employ the least squares estimator 𝜷~LS\tilde{\bm{\beta}}_{\mathrm{LS}} as the pilot estimator in Section 2.1, we replace (μ~,σ~2)(\tilde{\mu},\tilde{\sigma}^{2}) for the index estimator WiW_{i} in (4) by (μ~LS,σ~LS2)(\tilde{\mu}_{\mathrm{LS}},\tilde{\sigma}^{2}_{\mathrm{LS}}).

6.2. Maximum Likelihood Estimators

When yiy_{i} takes discrete values, a more appropriate pilot estimator can be proposed. For binary outcomes such as in classification problems where yi{0,1}y_{i}\in\{0,1\}, we can employ MLE for logistic regression:

𝜷~mleargmin𝒃pi=1n1log(1+exp(𝒃𝑿i(1)))yi𝒃𝑿i(1).\displaystyle\tilde{\bm{\beta}}_{\mathrm{mle}}\in\operatornamewithlimits{argmin}_{\bm{b}\in\mathbb{R}^{p}}\sum_{i=1}^{n_{1}}\log(1+\exp(\bm{b}^{\top}{\bm{X}_{i}^{(1)}}))-y_{i}\bm{b}^{\top}{\bm{X}_{i}^{(1)}}. (38)

In the case with yi{0}y_{i}\in\mathbb{N}\cup\{0\}, we can consider the MLE for the Poisson regression

𝜷~mleargmin𝒃pi=1n1exp(𝒃𝑿i(1))yi𝒃𝑿i(1).\displaystyle\tilde{\bm{\beta}}_{\mathrm{mle}}\in\operatornamewithlimits{argmin}_{\bm{b}\in\mathbb{R}^{p}}\sum_{i=1}^{n_{1}}\exp(\bm{b}^{\top}{\bm{X}_{i}^{(1)}})-y_{i}\bm{b}^{\top}{\bm{X}_{i}^{(1)}}. (39)

Its asymptotic normality is obtained by applying Theorem 4.

Corollary 3.

Under Assumptions 1-2, for any coordinates j=1,,pj=1,\ldots,p obeying pτjβj=O(1)\sqrt{p}\tau_{j}\beta_{j}=O(1), we have the following as n,pn,p\to\infty with the regime (2):

p(β~mle,jμ𝜷~mleβj)σ𝜷~mle/τjd𝒩(0,1),\displaystyle\frac{\sqrt{p}(\tilde{\beta}_{\mathrm{mle},j}-\mu_{{\tilde{\bm{\beta}}_{\mathrm{mle}}}}\beta_{j})}{{\sigma_{\tilde{\bm{\beta}}_{\mathrm{mle}}}}/\tau_{j}}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1), (40)

In these cases, we can define values (γ~mle,μ~mle,σ~mle)(\tilde{\gamma}_{\mathrm{mle}},\tilde{\mu}_{\mathrm{mle}},\tilde{\sigma}_{\mathrm{mle}}) for estimating their inferential parameters. Define g0(x)=1/(1+exp(x))g_{0}(x)=1/(1+\exp(-x)) for logistic regression and g0(x)=exp(x)g_{0}(x)=\exp(x) for Poisson regression. Then, we define the values as γ~mle=κ1v~mle1\tilde{\gamma}_{\mathrm{mle}}=\kappa_{1}\tilde{v}_{\mathrm{mle}}^{-1} and

μ~mle=|𝑿(1)𝜷~mle2n1(1κ1)σ~mle2|1/2,σ~mle2=𝒚(1)g0(𝑿(1)𝜷~mle)2n1v~mle/κ1,\displaystyle\tilde{\mu}_{\mathrm{mle}}=\left\lvert\frac{\|\bm{X}^{(1)}\tilde{\bm{\beta}}_{\mathrm{mle}}\|^{2}}{n_{1}}-(1-\kappa_{1})\tilde{\sigma}_{\mathrm{mle}}^{2}\right\rvert^{1/2},~{}~{}~{}\tilde{\sigma}_{\mathrm{mle}}^{2}=\frac{\|\bm{y}^{(1)}-g_{0}(\bm{X}^{(1)}\tilde{\bm{\beta}}_{\mathrm{mle}})\|^{2}}{n_{1}\tilde{v}_{\mathrm{mle}}/\kappa_{1}}, (41)

with v~mle=n11tr(𝑫~𝑫~𝑿(1)((𝑿(1))𝑫~𝑿(1))1(𝑿(1))𝑫~)\tilde{v}_{\mathrm{mle}}=n_{1}^{-1}\mathrm{tr}(\tilde{\bm{D}}-\tilde{\bm{D}}\bm{X}^{(1)}((\bm{X}^{(1)})^{\top}\tilde{\bm{D}}\bm{X}^{(1)})^{-1}(\bm{X}^{(1)})^{\top}\tilde{\bm{D}}) and 𝑫~=diag(g0(𝑿(1)𝜷~mle))\tilde{\bm{D}}=\mathrm{diag}(g_{0}^{\prime}(\bm{X}^{(1)}\tilde{\bm{\beta}}_{\mathrm{mle}})). Based on this definition, we can develop a corresponding index estimator by replacing (μ~,σ~)(\tilde{\mu},\tilde{\sigma}) in (6) by μ~mle\tilde{\mu}_{\mathrm{mle}} and σ~mle\tilde{\sigma}_{\mathrm{mle}}.

7. Conclusion and Discussion

This study establishes a novel statistical inference procedure for high-dimensional single-index models. Specifically, we develop a consistent estimation method for the link function. Furthermore, using the estimated link function, we formulate an efficient estimator and confirm its marginal asymptotic normality. This verification allows for the accurate construction of confidence intervals and pp-values for any finite collection of coordinates.

We identify several avenues for future research: (a) extending these results to cases where the covariate distribution is non-Gaussian, (b) generalizing our findings to multi-index models, and (c) confirming the marginal asymptotic normality of our proposed estimators under any form of regularization and covariance. These prospects offer intriguing possibilities for further exploration.

Appendix A Effect of Link Estimation on Inferential Parameters

The following theorem reveals that the estimation error of the link function is asymptotically negligible with respect to the observable adjustments.

Specifically, we consider a slightly modified version of the inferential estimator. In preparation, we define a censoring operator ι:\iota:\mathbb{R}\to\mathbb{R} on a interval [a,b][a,b]\subset\mathbb{R} as ι(z)=max(a,min(b,z))\iota(z)=\max(a,\min(b,z)). Then, for any g¯:\bar{g}:\mathbb{R}\to\mathbb{R}, we define a truncation version of 𝑫\bm{D} as 𝑫c=diag(g¯(ι(𝑿(2)𝜷^(g¯))))\bm{D}_{c}=\mathrm{diag}(\bar{g}^{\prime}(\iota(\bm{X}^{(2)}\hat{\bm{\beta}}(\bar{g})))), and v^0c=n21tr(𝑫c𝑫c𝑿(2)((𝑿(2))𝑫c𝑿(2))1(𝑿(2))𝑫c)\hat{v}_{0c}=n_{2}^{-1}\mathrm{tr}(\bm{D}_{c}-\bm{D}_{c}\bm{X}^{(2)}(({\bm{X}^{(2)}})^{\top}\bm{D}_{c}\bm{X}^{(2)})^{-1}({\bm{X}^{(2)}})^{\top}\bm{D}_{c}). Further, in the case of J()𝟎J(\cdot)\equiv\bm{0}, we define the modified estimator as

μ^0c(g¯)\displaystyle\hat{\mu}_{0c}(\bar{g}) =|ι(𝑿(2)𝜷^(g¯))2/n2(1κ2)σ^0c2(g¯)|1/2, and\displaystyle=\left\lvert{\|\iota(\bm{X}^{(2)}\hat{\bm{\beta}}(\bar{g}))\|^{2}}/n_{2}-(1-\kappa_{2})\hat{\sigma}_{0c}^{2}(\bar{g})\right\rvert^{1/2},\mbox{~{}~{}and} (42)
σ^0c2(g¯)\displaystyle\hat{\sigma}_{0c}^{2}(\bar{g}) =𝒚(2)g¯(ι(𝑿(2)𝜷^(g¯)))2n2v^0c2/κ2.\displaystyle=\frac{\|\bm{y}^{(2)}-\bar{g}(\iota(\bm{X}^{(2)}\hat{\bm{\beta}}(\bar{g})))\|^{2}}{n_{2}\hat{v}_{0c}^{2}/\kappa_{2}}. (43)

Using the modified definition, we obtain the following result.

Theorem 5.

Suppose that J()𝟎J(\cdot)\equiv\bm{0} holds and the estimator (13) exists. Further, suppose that Assumptions 1-4 hold. Then, we have the following as n1n_{1}\to\infty:

|μ^0c(g^)μ^0c(g)|p0,and|σ^0c2(g^)σ^0c2(g)|p0.\displaystyle\left\lvert\hat{\mu}_{0c}(\hat{g})-\hat{\mu}_{0c}(g)\right\rvert\overset{\mathrm{p}}{\to}0,\quad\mathrm{and}\quad\left\lvert\hat{\sigma}_{0c}^{2}(\hat{g})-\hat{\sigma}_{0c}^{2}(g)\right\rvert\overset{\mathrm{p}}{\to}0. (44)

This result indicates that, since the link estimator g^()\hat{g}(\cdot) is consistent, we can estimate the inferential parameters under the true link g()g(\cdot).

The difficulty in this proof arises from the dependence between the elements of the estimator, which cannot be handled by the triangle inequality or Hölder’s inequality, To overcome the difficulty, we utilize the Azuma-Hoeffding inequality for martingale difference sequences.

Appendix B Theoretical Efficiency Comparison

We compare the efficiency of our estimator 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}) with that of the ridge estimator 𝜷~\tilde{\bm{\beta}} as the pilot. As shown in Bellec (2022), the ridge estimator is a valid estimator for the single-index model in the high-dimensional scheme (2) even without estimating the link function g()g(\cdot).

To the aim, we define the effective asymptotic variance based on inferential parameters, which is a ratio of the asymptotic bias and the asymptotic variance. That is, our estimator 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}) has its effective asymptotic variance σ^2(g^)/μ^2(g^)\hat{\sigma}^{2}(\hat{g})/\hat{\mu}^{2}(\hat{g}), and the ridge estimator 𝜷~\tilde{\bm{\beta}} has σ~2/μ~2\tilde{\sigma}^{2}/\tilde{\mu}^{2}. The effective asymptotic variance corresponds to the asymptotic variance of each coordinate of the estimators with bias correction.

We give the following result for necessary and sufficient conditions for the proposed estimator to be more efficient than the least squares estimator and the ridge estimator.

Proposition 2.

We consider the coefficient estimator 𝛃^(g^)\hat{\bm{\beta}}(\hat{g}) with J(𝐛)λ𝐛2J(\bm{b})\equiv\lambda\|\bm{b}\|^{2} and the setup n1=n2n_{1}=n_{2}. We use the regularization parameter λ1>0\lambda_{1}>0 for the pilot estimator 𝛃~\tilde{\bm{\beta}}. Suppose that Assumptions 1-3 are fulfilled. Then, σ^2(g^)/μ^2(g^)<σ~2/μ~2\hat{\sigma}^{2}(\hat{g})/\hat{\mu}^{2}(\hat{g})<\tilde{\sigma}^{2}/\tilde{\mu}^{2} holds if and only if we have

𝜷^(g^)𝜷~|v^λ+λ||v~+λ1|𝒚(1)𝑿(1)𝜷~𝒚(2)g^(𝑿(2)𝜷^(g^))>1.\displaystyle\frac{\|\hat{\bm{\beta}}(\hat{g})\|}{\|\tilde{\bm{\beta}}\|}\cdot\frac{\left\lvert\hat{v}_{\lambda}+\lambda\right\rvert}{\left\lvert\tilde{v}+\lambda_{1}\right\rvert}\cdot\frac{\|\bm{y}^{(1)}-\bm{X}^{(1)}\tilde{\bm{\beta}}\|}{\|\bm{y}^{(2)}-\hat{g}(\bm{X}^{(2)}\hat{\bm{\beta}}(\hat{g}))\|}>1. (45)

This necessary and sufficient condition suggests that our estimator may have an advantage by exploiting the nonlinearity of the link function g()g(\cdot). The first reason is that, when 𝒚\bm{y} has nonlinearity in 𝑿𝜷\bm{X}\bm{\beta}, the residual 𝒚g^(𝑿𝜷^)2\|\bm{y}-\hat{g}(\bm{X}\hat{\bm{\beta}})\|^{2} of the proposed method is expected to be asymptotically smaller than 𝒚𝑿𝜷~2\|\bm{y}-\bm{X}\tilde{\bm{\beta}}\|^{2}. The second reason is that v^λ\hat{v}_{\lambda} approximates the gradient mean n1i=1ng^(𝑿i𝜷^(g^))n^{-1}\sum_{i=1}^{n}\hat{g}^{\prime}(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})), so this element increases when g()g(\cdot) has a large gradient. Using these facsts, the proposed method incorporates the nonlinearity of g()g(\cdot) and helps improve efficiency.

Proposition 3.

If J()𝟎J(\cdot)\equiv\bm{0}, n1=n2n_{1}=n_{2}, and are fulfilled, then σ^02(g^)/μ^02(g^)<σ~LS2/μ~LS2\hat{\sigma}_{0}^{2}(\hat{g})/\hat{\mu}_{0}^{2}(\hat{g})<\tilde{\sigma}_{\mathrm{LS}}^{2}/\tilde{\mu}_{\mathrm{LS}}^{2} if and only if

𝑿(2)𝜷^(g^)𝑿(1)𝜷~LS|v^0|1κ1𝒚(1)𝑿(1)𝜷~LS𝒚(2)g^(𝑿(2)𝜷^(g^))>1.\displaystyle{\frac{\|\bm{X}^{(2)}\hat{\bm{\beta}}(\hat{g})\|}{\|\bm{X}^{(1)}\tilde{\bm{\beta}}_{\mathrm{LS}}\|}\cdot\frac{\left\lvert\hat{v}_{0}\right\rvert}{1-\kappa_{1}}\cdot\frac{\|\bm{y}^{(1)}-\bm{X}^{(1)}\tilde{\bm{\beta}}_{\mathrm{LS}}\|}{\|\bm{y}^{(2)}-\hat{g}(\bm{X}^{(2)}\hat{\bm{\beta}}(\hat{g}))\|}}>1. (46)

Appendix C Nonparametric Regression with Deconvolution

In this section, we review the concept of nonparametric regression with deconvolution to address the errors-in-variable problem. To begin with, we redefine the notation only for this section. For a pair of random variables (X,Y,Z)(X,Y,Z), suppose that the model is

𝔼[ZX=x]=m(x),\displaystyle\operatorname{\mathbb{E}}[Z\mid X=x]=m(x), (47)

and that we can only observe nn iid realizations of Y=X+εY=X+\varepsilon and ZZ. Here, ε\varepsilon is a random variable called measurement error or error in variables. For the identification, we assume that the distribution of ε\varepsilon is known. Let the joint distribution of (X,Z)(X,Z) be f(x,z)f(x,z). By the definition of the conditional expectations, m(x)=r(x)/f(x)m(x)={r(x)}/{f(x)} with

r(x)=zf(x,z)𝑑z,f(x)=f(x,z)𝑑z,\displaystyle r(x)=\int_{-\infty}^{\infty}zf(x,z)dz,\quad f(x)=\int_{-\infty}^{\infty}f(x,z)dz, (48)

for the continuous random variables. The goal of the problem is to estimate the function m()m(\cdot).

If we could observe XX, a popular estimator of m(x)m(x) is Nadaraya-Watson estimator r~(x)/f~(x){\tilde{r}(x)}/{\tilde{f}(x)} with

r~(x)=1nhni=1nZiK(xXihn),f~(x)=1nhni=1nK(xXihn),\displaystyle\tilde{r}(x)=\frac{1}{nh_{n}}\sum_{i=1}^{n}Z_{i}K\left(\frac{x-X_{i}}{h_{n}}\right),\quad\tilde{f}(x)=\frac{1}{nh_{n}}\sum_{i=1}^{n}K\left(\frac{x-X_{i}}{h_{n}}\right), (49)

where K()K(\cdot) is a kernel function and hnh_{n} is the bandwidth. Since XX is unobservable, we alternatively construct the deconvolution estimator (Stefanski and Carroll, 1990). Let the characteristic function of XX, YY and ε\varepsilon be ϕX()\phi_{X}(\cdot), ϕY()\phi_{Y}(\cdot) and ϕε()\phi_{\varepsilon}(\cdot), respectively. Since the density of YY is the convolution of that of XX and ε\varepsilon, and the convolution in the frequency domain is just a multiplication, we have ϕX(t)=ϕY(t)/ϕε(t)\phi_{X}(t)=\phi_{Y}(t)/\phi_{\varepsilon}(t). Thus, the inverse Fourier transform of ϕY(t)/ϕε(t)\phi_{Y}(t)/\phi_{\varepsilon}(t) gives the density of XX. Since we know the distribution of ε\varepsilon and we can approximate ϕY(t)\phi_{Y}(t) by the characteristic function of the kernel density estimator of YY, we can construct an estimator of f(x)f(x) as

f^(x)=12πexp(itx)ϕK(thn)ϕ^Y(t)ϕε(t)𝑑t,\displaystyle\hat{f}(x)=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-\mathrm{i}tx)\phi_{K}(th_{n})\frac{\hat{\phi}_{Y}(t)}{\phi_{\varepsilon}(t)}dt, (50)

where we use the fact that the Fourier transform of f~Y(y)=(nhn)1i=1nK((yYi)/hn)\tilde{f}_{Y}(y)=(nh_{n})^{-1}\sum_{i=1}^{n}K((y-Y_{i})/h_{n}) is ϕK(thn)ϕ^Y(t)\phi_{K}(th_{n})\hat{\phi}_{Y}(t), which approximates ϕY()\phi_{Y}(\cdot). Here, ϕ^Y(t)\hat{\phi}_{Y}(t) is the empirical characteristic function:

ϕ^Y(t)=1ni=1nexp(itYi).\displaystyle\hat{\phi}_{Y}(t)=\frac{1}{n}\sum_{i=1}^{n}\exp(\mathrm{i}tY_{i}). (51)

We can rewrite (50) in a kernel form

f^(x)=1nhni=1nKn(xYihn),\displaystyle\hat{f}(x)=\frac{1}{nh_{n}}\sum_{i=1}^{n}K_{n}\left(\frac{x-Y_{i}}{h_{n}}\right), (52)

with

Kn(x)=12πexp(itx)ϕK(t)ϕε(t/hn)𝑑t.\displaystyle K_{n}(x)=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-\mathrm{i}tx)\frac{\phi_{K}(t)}{\phi_{\varepsilon}(t/h_{n})}dt. (53)

Using this, Fan and Truong (1993) proposes a kernel regression estimator m^(x)=r^(x)/f^(x)\hat{m}(x)=\hat{r}(x)/\hat{f}(x) involving errors in variables with

r^(x)=1nhni=1nZiKn(xYihn).\displaystyle\hat{r}(x)=\frac{1}{nh_{n}}\sum_{i=1}^{n}Z_{i}K_{n}\left(\frac{x-Y_{i}}{h_{n}}\right). (54)

To establish the theoretical guarantee, we impose the following assumptions:

  1. (N1)

    (Super-smoothness of the distribution of ε\varepsilon) There exists constants d0,d1,β,γ>0d_{0},d_{1},\beta,\gamma>0 and β0,β1\beta_{0},\beta_{1}\in\mathbb{R} satisfying, as tt\to\infty,

    d0|t|β0exp(|t|β/γ)|ϕε(t)|d1|t|β1exp(|t|β/γ).\displaystyle d_{0}\left\lvert t\right\rvert^{\beta_{0}}\exp(-\left\lvert t\right\rvert^{\beta}/\gamma)\leq\left\lvert\phi_{\varepsilon}(t)\right\rvert\leq d_{1}\left\lvert t\right\rvert^{\beta_{1}}\exp(-\left\lvert t\right\rvert^{\beta}/\gamma). (55)
  2. (N2)

    The characteristic function of the error distribution ϕε()\phi_{\varepsilon}(\cdot) does not vanish.

  3. (N3)

    Let a<ba<b. The marginal density fX()f_{X}(\cdot) of the unobserved XX is bounded away from zero on the interval [a,b][a,b], and has a bounded kk-th derivative.

  4. (N4)

    The true regression function m()m(\cdot) has a continuous kk-th derivative on [a,b][a,b].

  5. (N5)

    The conditional second moment 𝔼[Z2X=x]\operatorname{\mathbb{E}}[Z^{2}\mid X=x] is continuous on [a,b][a,b], and 𝔼[Z2]<\operatorname{\mathbb{E}}[Z^{2}]<\infty.

  6. (N6)

    The kernel K()K(\cdot) is a kk-th order kernel. Namely,

    K(t)𝑑t=1,tkK(t)𝑑t0,tjK(t)𝑑t=0forj=1,,k1.\displaystyle\int_{-\infty}^{\infty}K(t)dt=1,\quad\int_{-\infty}^{\infty}t^{k}K(t)dt\neq 0,\quad\int_{-\infty}^{\infty}t^{j}K(t)dt=0\quad\mathrm{for}\quad j=1,\ldots,k-1. (56)

(N1) includes Gaussian distributions for β=2\beta=2 and Cauchy distributions for β=1\beta=1. For a positive constant BB, define a set of function

={f(x,z):|fX(k)()|B,minaxbfX(x)B1,supaxb|m(j)(x)|B,j=0,1,,k}.\displaystyle\mathcal{F}=\left\{f(x,z):\left\lvert f_{X}^{(k)}(\cdot)\right\rvert\leq B,\min_{a\leq x\leq b}f_{X}(x)\geq B^{-1},\sup_{a\leq x\leq b}\left\lvert m^{(j)}(x)\right\rvert\leq B,j=0,1,\ldots,k\right\}. (57)

In this setting, we have the uniform consistency of m^()\hat{m}(\cdot) and its rate of convergence.

Lemma 1 (Theorem 2 in Fan and Truong (1993)).

Assume (N1)-(N6) and that ϕK(t)\phi_{K}(t) has a bounded support on |t|<M0\left\lvert t\right\rvert<M_{0}. Then, for bandwidth hn=c(logn)1/βh_{n}=c(\log n)^{-1/\beta} with c>M0(2/γ)1/βc>M_{0}(2/\gamma)^{1/\beta},

limdlim supn(supaxb|m^(x)m(x)|d(logn)k/β)=0,\displaystyle\lim_{d\to\infty}\limsup_{n\to\infty}\mathbb{P}\left(\sup_{a\leq x\leq b}|\hat{m}(x)-m(x)|\geq d(\log n)^{-k/\beta}\right)=0, (58)

holds for any ff\in\mathcal{F}.

Furthermore, we can show the uniform convergence of the derivative of m^()\hat{m}(\cdot).

Lemma 2.

Under the condition of Lemma 1, we have, for any ff\in\mathcal{F},

supaxb|m^(x)m(x)|p0.\displaystyle\sup_{a\leq x\leq b}|\hat{m}^{\prime}(x)-m^{\prime}(x)|\overset{\mathrm{p}}{\to}0. (59)

To prove this, we use the following two lemmas.

Lemma 3.

We have, for any tt\in\mathbb{R},

𝔼[|ϕ^Y(t)ϕY(t)|2]n1,\displaystyle\operatorname{\mathbb{E}}\left[\left\lvert\hat{\phi}_{Y}(t)-\phi_{Y}(t)\right\rvert^{2}\right]\leq n^{-1}, (60)

and

𝔼[|1ni=1nZiexp(itYi)𝔼[Zexp(itY)]|2]n1𝔼[Z2].\displaystyle\operatorname{\mathbb{E}}\left[\left\lvert\frac{1}{n}\sum_{i=1}^{n}Z_{i}\exp(\mathrm{i}tY_{i})-\operatorname{\mathbb{E}}[Z\exp(\mathrm{i}tY)]\right\rvert^{2}\right]\leq n^{-1}\operatorname{\mathbb{E}}[Z^{2}]. (61)
Proof of Lemma 3.

We decompose the term on the left-hand side in the first statement by Euler’s formula as

𝔼[|ϕ^Y(t)ϕY(t)|2]\displaystyle\operatorname{\mathbb{E}}\left[\left\lvert\hat{\phi}_{Y}(t)-\phi_{Y}(t)\right\rvert^{2}\right] =𝔼[|1ni=1neitYi𝔼eitY|2]\displaystyle=\operatorname{\mathbb{E}}\left[\left\lvert\frac{1}{n}\sum_{i=1}^{n}e^{\mathrm{i}tY_{i}}-\operatorname{\mathbb{E}}e^{\mathrm{i}tY}\right\rvert^{2}\right] (62)
=𝔼[|1ni=1n{cos(tYi)𝔼cos(tY)}ini=1n{sin(tYi)𝔼sin(tY)}|2]\displaystyle=\operatorname{\mathbb{E}}\left[\left\lvert\frac{1}{n}\sum_{i=1}^{n}\left\{\cos(tY_{i})-\operatorname{\mathbb{E}}\cos(tY)\right\}-\frac{\mathrm{i}}{n}\sum_{i=1}^{n}\left\{\sin(tY_{i})-\operatorname{\mathbb{E}}\sin(tY)\right\}\right\rvert^{2}\right] (63)
=𝔼[{1ni=1ncos(tYi)𝔼cos(tY)}2{1ni=1nsin(tYi)𝔼sin(tY)}2]\displaystyle=\operatorname{\mathbb{E}}\left[\left\{\frac{1}{n}\sum_{i=1}^{n}\cos(tY_{i})-\operatorname{\mathbb{E}}\cos(tY)\right\}^{2}-\left\{\frac{1}{n}\sum_{i=1}^{n}\sin(tY_{i})-\operatorname{\mathbb{E}}\sin(tY)\right\}^{2}\right] (64)
Var(n1i=1ncos(tYi))+Var(n1i=1nsin(tYi))\displaystyle\leq\operatorname{Var}\left(n^{-1}\sum_{i=1}^{n}\cos(tY_{i})\right)+\operatorname{Var}\left(n^{-1}\sum_{i=1}^{n}\sin(tY_{i})\right) (65)
n1𝔼[cos(tY1)2+sin(tY1)2]=n1.\displaystyle\leq n^{-1}{\operatorname{\mathbb{E}}\left[\cos(tY_{1})^{2}+\sin(tY_{1})^{2}\right]}=n^{-1}. (66)

Similarly, we obtain

𝔼[|1ni=1nZiexp(itYi)𝔼[Zexp(itY)]|2]\displaystyle\operatorname{\mathbb{E}}\left[\left\lvert\frac{1}{n}\sum_{i=1}^{n}Z_{i}\exp(\mathrm{i}tY_{i})-\operatorname{\mathbb{E}}[Z\exp(\mathrm{i}tY)]\right\rvert^{2}\right] =1nVar(Z1cos(tY1))+1nVar(Z1sin(tY1))\displaystyle=\frac{1}{n}\operatorname{Var}(Z_{1}\cos(tY_{1}))+\frac{1}{n}\operatorname{Var}(Z_{1}\sin(tY_{1})) (67)
n1𝔼[Z2].\displaystyle\leq n^{-1}\operatorname{\mathbb{E}}[Z^{2}]. (68)

This completes the proof. ∎

Lemma 4.

Under the setting of Lemma 1, for bandwidth hn=c(logn)1/βh_{n}=c(\log n)^{-1/\beta} with c>M0(2/γ)1/βc>M_{0}(2/\gamma)^{1/\beta}, we have

n1supx|Kn(x)|2=o(1).\displaystyle n^{-1}\sup_{x}\left\lvert K_{n}(x)\right\rvert^{2}=o(1). (69)
Proof of Lemma 4.

At first, (N1) implies that there exists a constant MM such that

|ϕε(t)|>d02|t|β0exp(|t|β/γ),\displaystyle\left\lvert\phi_{\varepsilon}(t)\right\rvert>\frac{d_{0}}{2}|t|^{\beta_{0}}\exp(-|t|^{\beta}/\gamma), (70)

for |t|>M|t|>M. By the fact that |exp(itx)|=1\left\lvert\exp(-\mathrm{i}tx)\right\rvert=1 and that the support of ϕK()\phi_{K}(\cdot) is bounded by M0M_{0}, we have

supx|Kn(x)|\displaystyle\sup_{x}\left\lvert K_{n}(x)\right\rvert |ϕK(t)||ϕε(t/hn)|𝑑t\displaystyle\leq\int_{-\infty}^{\infty}\frac{\left\lvert\phi_{K}(t)\right\rvert}{\left\lvert\phi_{\varepsilon}(t/h_{n})\right\rvert}dt (71)
20Mhn|ϕK(t)||ϕε(t/hn)|𝑑t+4d0MhnM0|ϕK(t)||thn|β0exp(|t/hn|βγ)𝑑t\displaystyle\leq 2\int_{0}^{Mh_{n}}\frac{\left\lvert\phi_{K}(t)\right\rvert}{\left\lvert\phi_{\varepsilon}(t/h_{n})\right\rvert}dt+\frac{4}{d_{0}}\int_{Mh_{n}}^{M_{0}}|\phi_{K}(t)|\left\lvert\frac{t}{h_{n}}\right\rvert^{-\beta_{0}}\exp\left(\frac{|t/h_{n}|^{\beta}}{\gamma}\right)dt (72)
2hn0M1|ϕε(u)|𝑑u+4d0(M0Mhn)hnβ0M0β0exp(|M0/hn|βγ)\displaystyle\leq 2h_{n}\int_{0}^{M}\frac{1}{|\phi_{\varepsilon}(u)|}du+\frac{4}{d_{0}}(M_{0}-Mh_{n})h_{n}^{\beta_{0}}M_{0}^{-\beta_{0}}\exp\left(\frac{|M_{0}/h_{n}|^{\beta}}{\gamma}\right) (73)
=O(hn)+O(hnβ0exp(|M0/hn|β/γ)).\displaystyle=O(h_{n})+O(h_{n}^{\beta_{0}}\exp(|M_{0}/h_{n}|^{\beta}/\gamma)). (74)

Here, we use the fact that |ϕK(t)||eitx||K(x)|𝑑x<|\phi_{K}(t)|\leq{\int|e^{-itx}||K(x)|dx}<\infty. Since we choose hn=c(logn)1/βh_{n}=c(\log n)^{-1/\beta} with c>M0(2/γ)1/βc>M_{0}(2/\gamma)^{1/\beta}, we obtain the conclusion. ∎

Proof of Lemma 2.

Let axba\leq x\leq b. To begin with, by the triangle inequality, we have

supaxb|m^(x)m(x)|\displaystyle\sup_{a\leq x\leq b}\left\lvert\hat{m}^{\prime}(x)-m^{\prime}(x)\right\rvert =supaxb|r^(x)f^(x)r^(x)f^(x)f^(x)2r(x)f(x)r(x)f(x)f(x)2|\displaystyle=\sup_{a\leq x\leq b}\left\lvert\frac{\hat{r}^{\prime}(x)\hat{f}(x)-\hat{r}(x)\hat{f}^{\prime}(x)}{\hat{f}(x)^{2}}-\frac{{r}^{\prime}(x){f}(x)-{r}(x){f}^{\prime}(x)}{{f}(x)^{2}}\right\rvert (75)
supaxb|r^(x)f^(x)r^(x)f^(x)r(x)f(x)+r(x)f(x)f(x)2|\displaystyle\leq\sup_{a\leq x\leq b}\left\lvert\frac{\hat{r}^{\prime}(x)\hat{f}(x)-\hat{r}(x)\hat{f}^{\prime}(x)-{r}^{\prime}(x){f}(x)+{r}(x){f}^{\prime}(x)}{f(x)^{2}}\right\rvert (76)
+supaxb|r^(x)f^(x)r^(x)f^(x)f(x)2(f(x)2f^(x)21)|\displaystyle\quad+\sup_{a\leq x\leq b}\left\lvert\frac{\hat{r}^{\prime}(x)\hat{f}(x)-\hat{r}(x)\hat{f}^{\prime}(x)}{f(x)^{2}}\left(\frac{f(x)^{2}}{\hat{f}(x)^{2}}-1\right)\right\rvert (77)
B2supaxb|r^(x)f^(x)r^(x)f^(x)r(x)f(x)+r(x)f(x)|\displaystyle\leq B^{2}\sup_{a\leq x\leq b}\left\lvert\hat{r}^{\prime}(x)\hat{f}(x)-\hat{r}(x)\hat{f}^{\prime}(x)-{r}^{\prime}(x){f}(x)+{r}(x){f}^{\prime}(x)\right\rvert (78)
+B2supaxb|r^(x)f^(x)r^(x)f^(x)||f(x)2f^(x)2f^(x)2|,\displaystyle\quad+B^{2}\sup_{a\leq x\leq b}\left\lvert\hat{r}^{\prime}(x)\hat{f}(x)-\hat{r}(x)\hat{f}^{\prime}(x)\right\rvert\left\lvert\frac{f(x)^{2}-\hat{f}(x)^{2}}{\hat{f}(x)^{2}}\right\rvert, (79)

where the last inequality uses the assumption minaxb|f(x)|B1\min_{a\leq x\leq b}|f(x)|\geq B^{-1}. We consider showing the convergence in probability by showing the L1L^{1} convergence. Using the triangle inequality and the Cauchy-Schwarz inequality, we have

𝔼supaxb|r^(x)f^(x)r^(x)f^(x)r(x)f(x)+r(x)f(x)|\displaystyle\operatorname{\mathbb{E}}\sup_{a\leq x\leq b}\left\lvert\hat{r}^{\prime}(x)\hat{f}(x)-\hat{r}(x)\hat{f}^{\prime}(x)-{r}^{\prime}(x){f}(x)+{r}(x){f}^{\prime}(x)\right\rvert (80)
𝔼supaxb|f^(x)(r^(x)r(x))|+𝔼supaxb|r(x)(f^(x)f(x))|\displaystyle\leq\operatorname{\mathbb{E}}\sup_{a\leq x\leq b}\left\lvert\hat{f}(x)\left(\hat{r}^{\prime}(x)-r^{\prime}(x)\right)\right\rvert+\operatorname{\mathbb{E}}\sup_{a\leq x\leq b}\left\lvert r^{\prime}(x)\left(\hat{f}(x)-f(x)\right)\right\rvert (81)
+𝔼supaxb|f(x)(r(x)r^(x))|+𝔼supaxb|r^(x)(f(x)f^(x))|\displaystyle\quad+\operatorname{\mathbb{E}}\sup_{a\leq x\leq b}\left\lvert f^{\prime}(x)\left(r(x)-\hat{r}(x)\right)\right\rvert+\operatorname{\mathbb{E}}\sup_{a\leq x\leq b}\left\lvert\hat{r}(x)\left(f^{\prime}(x)-\hat{f}^{\prime}(x)\right)\right\rvert (82)
𝔼[supaxb|f^(x)|2]𝔼[supaxb|r^(x)r(x)|2]+supaxb|r(x)|𝔼[supaxb|f^(x)f(x)|2]\displaystyle\leq\sqrt{\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}\left\lvert\hat{f}(x)\right\rvert^{2}\right]}\sqrt{\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}\left\lvert\hat{r}^{\prime}(x)-r^{\prime}(x)\right\rvert^{2}\right]}+\sup_{a\leq x\leq b}|r^{\prime}(x)|\sqrt{\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}\left\lvert\hat{f}(x)-f(x)\right\rvert^{2}\right]} (83)
+supaxb|f(x)|𝔼[supaxb|r^(x)r(x)|2]+𝔼[supaxb|r^(x)|2]𝔼[supaxb|f^(x)f(x)|2].\displaystyle\quad+\sup_{a\leq x\leq b}|f^{\prime}(x)|\sqrt{\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}\left\lvert\hat{r}(x)-r(x)\right\rvert^{2}\right]}+\sqrt{\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}\left\lvert\hat{r}(x)\right\rvert^{2}\right]}\sqrt{\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}\left\lvert\hat{f}^{\prime}(x)-f^{\prime}(x)\right\rvert^{2}\right]}. (84)

Thus, to bound the right-hand side of (79), we need to show that 𝔼[supx|f^(x)|2]\operatorname{\mathbb{E}}[{\sup_{x}|{\hat{f}(x)}|^{2}}] and 𝔼[supx|r^(x)|2]\operatorname{\mathbb{E}}[{\sup_{x}\left\lvert\hat{r}(x)\right\rvert^{2}}] are bounded by constants and that 𝔼[supx|f^(x)f(x)|2]\operatorname{\mathbb{E}}[{\sup_{x}|{\hat{f}(x)-f(x)}|^{2}}], 𝔼[supx|r^(x)r(x)|2]\operatorname{\mathbb{E}}[{\sup_{x}\left\lvert\hat{r}(x)-r(x)\right\rvert^{2}}], 𝔼[supx|f^(x)f(x)|2],\operatorname{\mathbb{E}}[{\sup_{x}|{\hat{f}^{\prime}(x)-f^{\prime}(x)}|^{2}}], and 𝔼[supx|r^(x)r(x)|2]\operatorname{\mathbb{E}}[{\sup_{x}\left\lvert\hat{r}^{\prime}(x)-r^{\prime}(x)\right\rvert^{2}}] converge to zero.

\bullet Bound for 𝔼[supaxb|f^(x)f(x)|2]\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}(x)-f(x)}|^{2}\right]. By triangle inequality and the fact that (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2} for x,yx,y\in\mathbb{R}, we have

𝔼[supaxb|f^(x)f(x)|2]2𝔼[supaxb|f^(x)𝔼f^(x)|2]+2supaxb|𝔼f^(x)f(x)|2.\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}(x)-f(x)}|^{2}\right]\leq 2\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}(x)-\operatorname{\mathbb{E}}\hat{f}(x)}|^{2}\right]+2\sup_{a\leq x\leq b}|{\operatorname{\mathbb{E}}\hat{f}(x)-f(x)}|^{2}. (85)

For the first term of the left-hand side of (85), the Cauchy-Schwarz inequality gives

𝔼[supaxb|f^(x)𝔼f^(x)|2]\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}(x)-\operatorname{\mathbb{E}}\hat{f}(x)}|^{2}\right] 1(2π)2𝔼[{|ϕK(thn)||ϕε(t)||ϕ^Y(t)ϕY(t)|𝑑t}2]\displaystyle\leq\frac{1}{(2\pi)^{2}}\operatorname{\mathbb{E}}\left[\left\{\int_{-\infty}^{\infty}\frac{|\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}\left\lvert\hat{\phi}_{Y}(t)-\phi_{Y}(t)\right\rvert dt\right\}^{2}\right] (86)
1(2π)2{|ϕK(thn)||ϕε(t)|𝑑t}{𝔼[|ϕ^Y(t)ϕY(t)|2]|ϕK(thn)||ϕε(t)|𝑑t}.\displaystyle\leq\frac{1}{(2\pi)^{2}}\left\{\int_{-\infty}^{\infty}\frac{|\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}dt\right\}\left\{\int_{-\infty}^{\infty}\operatorname{\mathbb{E}}\left[\left\lvert\hat{\phi}_{Y}(t)-\phi_{Y}(t)\right\rvert^{2}\right]\frac{|\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}dt\right\}. (87)

Lemma 3 and the proof of Lemma 4 imply that this converges to zero as nn\to\infty. Next, we consider the second term in (85). We obtain

𝔼[f^(x)]\displaystyle\operatorname{\mathbb{E}}\left[\hat{f}(x)\right] =12πexp(itx)ϕK(thn)𝔼Y[exp(itY)]ϕε(t)𝑑t\displaystyle=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-\mathrm{i}tx)\phi_{K}(th_{n})\frac{\operatorname{\mathbb{E}}_{Y}[\exp(\mathrm{i}tY)]}{\phi_{\varepsilon}(t)}dt (88)
=12π𝔼X[exp(itx)ϕK(thn)exp(itX)𝑑t]\displaystyle=\frac{1}{2\pi}\operatorname{\mathbb{E}}_{X}\left[\int_{-\infty}^{\infty}\exp(\mathrm{i}tx)\phi_{K}(th_{n})\exp(-\mathrm{i}tX)dt\right] (89)
=𝔼X[12πhnexp(itxXh)ϕK(t)𝑑t]\displaystyle=\operatorname{\mathbb{E}}_{X}\left[\frac{1}{2\pi h_{n}}\int_{-\infty}^{\infty}\exp\left(\mathrm{i}t\frac{x-X}{h}\right)\phi_{K}(t)dt\right] (90)
=1hn𝔼X[K(xXhn)].\displaystyle=\frac{1}{h_{n}}\operatorname{\mathbb{E}}_{X}\left[K\left(\frac{x-X}{h_{n}}\right)\right]. (91)

Thus, a classical result for the kernel density estimation gives supx|𝔼[f^(x)]f(x)|0\sup_{x}|\operatorname{\mathbb{E}}[\hat{f}(x)]-f(x)|\to 0 as n0n\to 0.

\bullet Bound for 𝔼[supaxb|r^(x)r(x)|2]\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}(x)-r(x)}|^{2}\right]. By triangle inequality and the fact that (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2},

𝔼[supaxb|r^(x)r(x)|2]2𝔼[supaxb|r^(x)𝔼r^(x)|2]+2supaxb|𝔼r^(x)r(x)|2.\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}(x)-r(x)}|^{2}\right]\leq 2\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}(x)-\operatorname{\mathbb{E}}\hat{r}(x)}|^{2}\right]+2\sup_{a\leq x\leq b}|{\operatorname{\mathbb{E}}\hat{r}(x)-r(x)}|^{2}. (92)

For the first term of the left-hand side of (92), Cauchy-Schwarz inequality gives

𝔼[supaxb|r^(x)𝔼r^(x)|2]\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}(x)-\operatorname{\mathbb{E}}\hat{r}(x)}|^{2}\right] 1(2π)2𝔼[{|ϕK(thn)||ϕε(t)||1ni=1nZiexp(itYj)𝔼[Zexp(itY)]|𝑑t}2]\displaystyle\leq\frac{1}{(2\pi)^{2}}\operatorname{\mathbb{E}}\left[\left\{\int_{-\infty}^{\infty}\frac{|\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}\left\lvert\frac{1}{n}\sum_{i=1}^{n}Z_{i}\exp(\mathrm{i}tY_{j})-\operatorname{\mathbb{E}}[Z\exp(\mathrm{i}tY)]\right\rvert dt\right\}^{2}\right] (93)
1(2π)2{|ϕK(thn)||ϕε(t)|𝑑t}{1n|ϕK(thn)||ϕε(t)|𝑑t},\displaystyle\leq\frac{1}{(2\pi)^{2}}\left\{\int_{-\infty}^{\infty}\frac{|\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}dt\right\}\left\{\frac{1}{n}\int_{-\infty}^{\infty}\frac{|\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}dt\right\}, (94)

where we use the proof of Lemma 4 for the last inequality. Lemma 3 implies that this term converges to zero as nn\to\infty. Next, we consider the second term in (92). We have

𝔼[r^(x)]=1hn𝔼X,Z[ZK(xXhn)].\displaystyle\operatorname{\mathbb{E}}\left[\hat{r}(x)\right]=\frac{1}{h_{n}}\operatorname{\mathbb{E}}_{X,Z}\left[ZK\left(\frac{x-X}{h_{n}}\right)\right]. (95)

Thus we have supaxb|𝔼[r^(x)]r(x)|0\sup_{a\leq x\leq b}|\operatorname{\mathbb{E}}[\hat{r}(x)]-r(x)|\to 0.

\bullet Bound for 𝔼[supaxb|f^(x)f(x)|2]\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}^{\prime}(x)-f^{\prime}(x)}|^{2}\right]. By triangle inequality and the fact that (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2} for x,yx,y\in\mathbb{R}, we have

𝔼[supaxb|f^(x)f(x)|2]2𝔼[supaxb|f^(x)𝔼f^(x)|2]+2supaxb|𝔼f^(x)f(x)|2.\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}^{\prime}(x)-f^{\prime}(x)}|^{2}\right]\leq 2\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}^{\prime}(x)-\operatorname{\mathbb{E}}\hat{f}^{\prime}(x)}|^{2}\right]+2\sup_{a\leq x\leq b}|{\operatorname{\mathbb{E}}\hat{f}^{\prime}(x)-f^{\prime}(x)}|^{2}. (96)

For the first term of the left-hand side of (96), since exp(itx)/(x)=itexp(itx)\partial\exp(-\mathrm{i}tx)/(\partial x)=-\mathrm{i}t\exp(-\mathrm{i}tx) and |i|=|exp(itx)|=1|\mathrm{i}|=|\exp(-\mathrm{i}tx)|=1,

𝔼[supaxb|f^(x)𝔼f^(x)|2]\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}^{\prime}(x)-\operatorname{\mathbb{E}}\hat{f}^{\prime}(x)}|^{2}\right] =𝔼[supaxb|12πitexp(itx)ϕK(thn)ϕε(t){ϕ^Y(t)ϕY(t)}dt|2]\displaystyle=\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}\left\lvert\frac{1}{2\pi}\int_{-\infty}^{\infty}-\mathrm{i}t\exp(-\mathrm{i}tx)\frac{\phi_{K}(th_{n})}{\phi_{\varepsilon}(t)}\left\{\hat{\phi}_{Y}(t)-\phi_{Y}(t)\right\}dt\right\rvert^{2}\right] (97)
1(2π)2𝔼[{|tϕK(thn)||ϕε(t)||ϕ^Y(t)ϕY(t)|𝑑t}2].\displaystyle\leq\frac{1}{(2\pi)^{2}}\operatorname{\mathbb{E}}\left[\left\{\int_{-\infty}^{\infty}\frac{|t\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}\left\lvert\hat{\phi}_{Y}(t)-\phi_{Y}(t)\right\rvert dt\right\}^{2}\right]. (98)

Thus, this converges to zero in the same way as (85). For the second term in (96), by the integration by parts,

𝔼[f^(x)]\displaystyle\operatorname{\mathbb{E}}\left[\hat{f}^{\prime}(x)\right] =1hn2K(xyhn)f(y)𝑑y\displaystyle=\frac{1}{h_{n}^{2}}\int_{-\infty}^{\infty}K^{\prime}\left(\frac{x-y}{h_{n}}\right)f(y)dy (99)
=1hnK(xyhn)f(y)𝑑y1hn[K(xyhn)f(y)].\displaystyle=\frac{1}{h_{n}}\int_{-\infty}^{\infty}K\left(\frac{x-y}{h_{n}}\right)f^{\prime}(y)dy-\frac{1}{h_{n}}\left[K\left(\frac{x-y}{h_{n}}\right)f^{\prime}(y)\right]_{-\infty}^{\infty}. (100)

Here, the second term is zero and the first term converges to f(x)f^{\prime}(x) uniformly.

\bullet Bound for 𝔼[supaxb|r^(x)r(x)|2]\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}^{\prime}(x)-r^{\prime}(x)}|^{2}\right]. By triangle inequality and the fact that (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2} for x,yx,y\in\mathbb{R}, we have

𝔼[supaxb|r^(x)r(x)|2]2𝔼[supaxb|r^(x)𝔼r^(x)|2]+2supaxb|𝔼r^(x)r(x)|2.\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}^{\prime}(x)-r^{\prime}(x)}|^{2}\right]\leq 2\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}^{\prime}(x)-\operatorname{\mathbb{E}}\hat{r}^{\prime}(x)}|^{2}\right]+2\sup_{a\leq x\leq b}|{\operatorname{\mathbb{E}}\hat{r}^{\prime}(x)-r^{\prime}(x)}|^{2}. (101)

For the first term of the left-hand side of (101), since |i|=|exp(itx)|=1|\mathrm{i}|=|\exp(-\mathrm{i}tx)|=1, we have

𝔼[supx|f^(x)𝔼f^(x)|2]\displaystyle\operatorname{\mathbb{E}}\left[\sup_{x}|{\hat{f}^{\prime}(x)-\operatorname{\mathbb{E}}\hat{f}^{\prime}(x)}|^{2}\right] (102)
=𝔼[supx|12πitexp(itx)ϕK(thn)ϕε(t){1ni=1nZiexp(itYi)𝔼[Zexp(itZ)]}dt|2]\displaystyle=\operatorname{\mathbb{E}}\left[\sup_{x}\left\lvert\frac{1}{2\pi}\int_{-\infty}^{\infty}-\mathrm{i}t\exp(-\mathrm{i}tx)\frac{\phi_{K}(th_{n})}{\phi_{\varepsilon}(t)}\left\{\frac{1}{n}\sum_{i=1}^{n}Z_{i}\exp(-\mathrm{i}tY_{i})-\operatorname{\mathbb{E}}[Z\exp(-\mathrm{i}tZ)]\right\}dt\right\rvert^{2}\right] (103)
1(2π)2𝔼[{|tϕK(thn)||ϕε(t)||1ni=1nZiexp(itYi)𝔼[Zexp(itZ)]|𝑑t}2].\displaystyle\leq\frac{1}{(2\pi)^{2}}\operatorname{\mathbb{E}}\left[\left\{\int_{-\infty}^{\infty}\frac{|t\phi_{K}(th_{n})|}{|\phi_{\varepsilon}(t)|}\left\lvert\frac{1}{n}\sum_{i=1}^{n}Z_{i}\exp(-\mathrm{i}tY_{i})-\operatorname{\mathbb{E}}[Z\exp(-\mathrm{i}tZ)]\right\rvert dt\right\}^{2}\right]. (104)

Thus, this converges to zero in the same way as (92). For the second term in (101), by the integration by parts,

𝔼[r^(x)]\displaystyle\operatorname{\mathbb{E}}\left[\hat{r}^{\prime}(x)\right] =1hn2zK(xyhn)f(y,z)𝑑y𝑑z\displaystyle=\frac{1}{h_{n}^{2}}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}zK^{\prime}\left(\frac{x-y}{h_{n}}\right)f(y,z)dydz (105)
=1hnzK(xyhn)yf(y,z)𝑑y𝑑z1hn[zK(xyhn)yf(y,z)]𝑑z.\displaystyle=\frac{1}{h_{n}}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}zK\left(\frac{x-y}{h_{n}}\right)\frac{\partial}{\partial y}f(y,z)dydz-\frac{1}{h_{n}}\int_{-\infty}^{\infty}\left[zK\left(\frac{x-y}{h_{n}}\right)\frac{\partial}{\partial y}f(y,z)\right]_{-\infty}^{\infty}dz. (106)

Here, the second term is zero and the first term converges to r(x)=(/x)zf(x,z)𝑑zr^{\prime}(x)=(\partial/\partial x)\int zf(x,z)dz uniformly.

\bullet Bound for 𝔼[supaxb|f^(x)|2]\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}(x)}|^{2}\right] and 𝔼[supaxb|r^(x)|2]\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}(x)}|^{2}\right]. By triangle inequality and the fact that (x+y)22x2+2y2(x+y)^{2}\leq 2x^{2}+2y^{2},

𝔼[supaxb|f^(x)|2]\displaystyle\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{f}(x)}|^{2}\right] 2supaxb|f(x)|2+2𝔼[supaxb|f^(x)f(x)|2].\displaystyle\leq 2\sup_{a\leq x\leq b}|f(x)|^{2}+2\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|\hat{f}(x)-f(x)|^{2}\right]. (107)

We have already shown 𝔼[supaxb|f^(x)f(x)|2]=o(1)\operatorname{\mathbb{E}}[{\sup_{a\leq x\leq b}|\hat{f}(x)-f(x)|^{2}]=o(1)}, 𝔼[supaxb|f^(x)|2]\operatorname{\mathbb{E}}[{\sup_{a\leq x\leq b}|{\hat{f}(x)}|^{2}}] is asymptotically bounded by a constant. Similarly, we can show that 𝔼[supaxb|r^(x)|2]\operatorname{\mathbb{E}}\left[\sup_{a\leq x\leq b}|{\hat{r}(x)}|^{2}\right] is asymptotically bounded by a constant. Combining these results together, we conclude that the first term of (79) is o(1)o(1).

Next, we consider the second term of (79). Since f^(x)\hat{f}(x) is asymptotically bounded uniformly on [a,b][a,b] by the results above, we have only to show that supaxb|f^(x)2f(x)2|=o(1)\sup_{a\leq x\leq b}|\hat{f}(x)^{2}-f(x)^{2}|=o(1). This holds since

supaxb|f^(x)2f(x)2|supaxb|f^(x)+f(x)|supaxb|f^(x)f(x)|=op(1).\displaystyle\sup_{a\leq x\leq b}|\hat{f}(x)^{2}-f(x)^{2}|\leq\sup_{a\leq x\leq b}|\hat{f}(x)+f(x)|\sup_{a\leq x\leq b}|\hat{f}(x)-f(x)|=o_{\mathrm{p}}(1). (108)

This concludes that supaxb|m^(x)m(x)|p0\sup_{a\leq x\leq b}|\hat{m}^{\prime}(x)-m^{\prime}(x)|\overset{\mathrm{p}}{\to}0 as nn\to\infty. ∎

Appendix D Proofs of the Results

For a convex function f:f:\mathbb{R}\to\mathbb{R} and a constant γ>0\gamma>0, define the proximal operator proxγf:\mathrm{prox}_{\gamma f}:\mathbb{R}\to\mathbb{R} as

proxγf(x)=argminz{γf(z)+12(xz)2}.\displaystyle\mathrm{prox}_{\gamma f}(x)=\operatornamewithlimits{argmin}_{z\in\mathbb{R}}\left\{\gamma f(z)+\frac{1}{2}(x-z)^{2}\right\}. (109)

D.1. Proof of Master Theorem

First, we define the notation used in the proof. We consider an invertible matrix 𝑳p×p\bm{L}\in\mathbb{R}^{p\times p} that satisfies 𝚺=𝑳𝑳\bm{\Sigma}=\bm{L}\bm{L}^{\top}. Define, for each i{1,,n}i\in\{1,\ldots,n\},

𝑿~i=𝑳1𝑿i,𝜼=𝑳𝜷,𝜼^=𝑳𝜷¯.\displaystyle\tilde{\bm{X}}_{i}=\bm{L}^{-1}\bm{X}_{i},\quad\bm{\eta}=\bm{L}^{\top}\bm{\beta},\quad\hat{\bm{\eta}}=\bm{L}^{\top}\bar{\bm{\beta}}. (110)
Proof of Theorem 4.

We consider the following three steps.

Step 1: Reduction to standard Gaussian features. Note that the single-index model yi=g(𝑿i𝜷)+εiy_{i}=g(\bm{X}_{i}^{\top}\bm{\beta})+\varepsilon_{i} is equivalent to yi=g(𝑿~i𝜼)+εiy_{i}=g(\tilde{\bm{X}}_{i}^{\top}\bm{\eta})+\varepsilon_{i}. Since 𝑿i𝒃=𝑿~i(𝑳𝒃)\bm{X}_{i}^{\top}\bm{b}=\tilde{\bm{X}}_{i}^{\top}(\bm{L}^{\top}\bm{b}), we have ¯(𝑿i𝒃;yi)=¯(𝑿~i𝑳𝒃;yi)\bar{\ell}(\bm{X}_{i}^{\top}\bm{b};y_{i})=\bar{\ell}(\tilde{\bm{X}}_{i}^{\top}\bm{L}^{\top}\bm{b};y_{i}). Hence, 𝜼^argmin𝒃~pi=1n¯(𝑿~i𝒃~;yi)\hat{\bm{\eta}}\in\operatornamewithlimits{argmin}_{\tilde{\bm{b}}\in\mathbb{R}^{p}}\sum_{i=1}^{n}\bar{\ell}(\tilde{\bm{X}}_{i}^{\top}\tilde{\bm{b}};y_{i}) is the estimator corresponding to the true parameter 𝜼p\bm{\eta}\in\mathbb{R}^{p} and features 𝑿~i𝒩p(𝟎,𝑰p)\tilde{\bm{X}}_{i}\sim\mathcal{N}_{p}(\bm{0},\bm{I}_{p}).

We can choose 𝚺=𝑳𝑳\bm{\Sigma}=\bm{L}\bm{L}^{\top} to be a Cholesky factorization so that ηp=τpβp\eta_{p}=\tau_{p}\beta_{p} and η^p=τpβ¯p\hat{\eta}_{p}=\tau_{p}\bar{\beta}_{p} with τp=(𝚺1)pp1/2\tau_{p}=(\bm{\Sigma}^{-1})_{pp}^{-1/2} by (110). This follows from the fact that Lpp=τpL_{pp}=\tau_{p} since τp2=Var(Xip𝑿ip)=Var(Xip𝑿~ip)\tau_{p}^{2}=\mathrm{Var}(X_{ip}\mid\bm{X}_{i\setminus p})=\mathrm{Var}(X_{ip}\mid\tilde{\bm{X}}_{i\setminus p}), where 𝑿ipp1\bm{X}_{i\setminus p}\in\mathbb{R}^{p-1} denotes the vector 𝑿i\bm{X}_{i} without ppth coordinate. Since we can generalize this to any coordinate by permutation, we obtain

τjβ¯jμ𝜷¯βjσ𝜷¯=η^jμ𝜷¯ηjσ𝜷¯,\displaystyle\tau_{j}\frac{\bar{\beta}_{j}-\mu_{\bar{\bm{\beta}}}\beta_{j}}{\sigma_{\bar{\bm{\beta}}}}=\frac{\hat{\eta}_{j}-\mu_{\bar{\bm{\beta}}}\eta_{j}}{\sigma_{\bar{\bm{\beta}}}}, (111)

for each j{1,,p}j\in\{1,\ldots,p\} and any pair (μ𝜷¯,σ𝜷¯)(\mu_{\bar{\bm{\beta}}},\sigma_{\bar{\bm{\beta}}}).

Step 2: Reduction to uniform distribution on sphere. Define an orthogonal projection matrix 𝑷𝜼=𝜼𝜼/𝜼2\bm{P}_{\bm{\eta}}=\bm{\eta}\bm{\eta}^{\top}/\left\lVert\bm{\eta}\right\rVert^{2} onto 𝜼\bm{\eta}, and an orthogonal projection matrix 𝑷𝜼=𝑰p𝑷𝜼\bm{P}_{\bm{\eta}}^{\perp}=\bm{I}_{p}-\bm{P}_{\bm{\eta}} onto the orthogonal complement of 𝜼\bm{\eta}. Let 𝑼p×p\bm{U}\in\mathbb{R}^{p\times p} be any orthogonal matrix obeying 𝑼𝜼=𝜼\bm{U}\bm{\eta}=\bm{\eta}, namely, any rotation operator about 𝜼\bm{\eta}. Then, since 𝜼^=𝑷𝜼𝜼^+𝑷𝜼𝜼^\hat{\bm{\eta}}=\bm{P}_{\bm{\eta}}\hat{\bm{\eta}}+\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}, we have

𝑼𝜼^=𝑼𝑷𝜼𝜼^+𝑼𝑷𝜼𝜼^=𝑷𝜼𝜼^+𝑼𝑷𝜼𝜼^.\displaystyle\bm{U}\hat{\bm{\eta}}=\bm{U}\bm{P}_{\bm{\eta}}\hat{\bm{\eta}}+\bm{U}\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}=\bm{P}_{\bm{\eta}}\hat{\bm{\eta}}+\bm{U}\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}. (112)

Using this, we obtain

𝑼𝑷𝜼𝜼^𝑷𝜼𝜼^=d𝐏𝜼𝜼^𝐏𝜼𝜼^=𝜼^μ𝜷¯𝜼σ𝜷¯,\displaystyle\frac{\bm{U}\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}}{\|\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}\|}\overset{\rm d}{=}\frac{\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}}{\|\bm{P}_{\bm{\eta}}^{\perp}\hat{\bm{\eta}}\|}=\frac{\hat{\bm{\eta}}-\mu_{\bar{\bm{\beta}}}\bm{\eta}}{\sigma_{\bar{\bm{\beta}}}}, (113)

where the first identity follows from the fact that 𝑼𝜼^=d𝜼^\bm{U}\hat{\bm{\eta}}\overset{\rm d}{=}\hat{\bm{\eta}} since 𝑼𝜼^\bm{U}\hat{\bm{\eta}} is the estimator with a true coefficient 𝑼𝜼=𝜼\bm{U}\bm{\eta}=\bm{\eta} and features drawn iid from 𝒩p(𝟎,𝑰p)\mathcal{N}_{p}(\bm{0},\bm{I}_{p}), by ¯(𝑿~i𝒃~;yi)=¯((𝑼𝑿~i)𝑼𝒃~;yi)\bar{\ell}(\tilde{\bm{X}}_{i}^{\top}\tilde{\bm{b}};y_{i})=\bar{\ell}((\bm{U}^{\top}\tilde{\bm{X}}_{i})^{\top}\bm{U}\tilde{\bm{b}};y_{i}) and 𝑼𝑿~i=d𝐗~i\bm{U}^{\top}\tilde{\bm{X}}_{i}\overset{\rm d}{=}\tilde{\bm{X}}_{i}. (113) reveals that (𝜼^μ𝜷¯𝜼)/σ𝜷¯(\hat{\bm{\eta}}-\mu_{\bar{\bm{\beta}}}\bm{\eta})/\sigma_{\bar{\bm{\beta}}} is rotationally invariant about 𝜼\bm{\eta}, lies in 𝜼\bm{\eta}^{\perp}, and has a unit norm. This means (𝜼^μ𝜷¯𝜼)/σ𝜷¯(\hat{\bm{\eta}}-\mu_{\bar{\bm{\beta}}}\bm{\eta})/\sigma_{\bar{\bm{\beta}}} is uniformly distributed on the unit sphere lying in 𝜼\bm{\eta}^{\perp}.

Step 3: Deriving asymptotic normality. The result of the previous step gives us

𝜼^μ𝜷¯𝜼σ𝜷¯=d𝐏𝜼𝐙𝐏𝜼𝐙,\displaystyle\frac{\hat{\bm{\eta}}-\mu_{\bar{\bm{\beta}}}\bm{\eta}}{\sigma_{\bar{\bm{\beta}}}}\overset{\rm d}{=}\frac{\bm{P}_{\bm{\eta}}^{\perp}\bm{Z}}{\|\bm{P}_{\bm{\eta}}^{\perp}\bm{Z}\|}, (114)

where 𝒁𝒩p(𝟎,𝑰p)\bm{Z}\sim\mathcal{N}_{p}(\bm{0},\bm{I}_{p}). Triangle inequalities yield that

𝒁p|𝜼𝒁|p𝜼𝑷𝜼𝒁p𝒁p+|𝜼𝒁|p𝜼.\displaystyle\frac{\left\lVert\bm{Z}\right\rVert}{\sqrt{p}}-\frac{|\bm{\eta}^{\top}\bm{Z}|}{\sqrt{p}\left\lVert\bm{\eta}\right\rVert}\leq\frac{\|\bm{P}_{\bm{\eta}}^{\perp}\bm{Z}\|}{\sqrt{p}}\leq\frac{\left\lVert\bm{Z}\right\rVert}{\sqrt{p}}+\frac{|\bm{\eta}^{\top}\bm{Z}|}{\sqrt{p}\left\lVert\bm{\eta}\right\rVert}. (115)

Since |𝜼𝒁|/(p𝜼)a.s.0|\bm{\eta}^{\top}\bm{Z}|/(\sqrt{p}\left\lVert\bm{\eta}\right\rVert)\overset{\mathrm{a.s.}}{\longrightarrow}0 and 𝒁/pa.s.1\left\lVert\bm{Z}\right\rVert/\sqrt{p}\overset{\mathrm{a.s.}}{\longrightarrow}1, we obtain 𝑷𝜼𝒁/pa.s.1\|\bm{P}_{\bm{\eta}}^{\perp}\bm{Z}\|/\sqrt{p}\overset{\mathrm{a.s.}}{\longrightarrow}1. Therefore, this fact and (114) imply that

pη^jμ𝜷¯ηjσ𝜷¯=dσˇjQ+op(1),σˇj2=1ηj2𝜼2,\displaystyle\sqrt{p}\frac{\hat{\eta}_{j}-\mu_{\bar{\bm{\beta}}}\eta_{j}}{\sigma_{\bar{\bm{\beta}}}}\overset{\rm d}{=}\check{\sigma}_{j}Q+o_{\mathrm{p}}(1),\quad\check{\sigma}_{j}^{2}=1-\frac{\eta_{j}^{2}}{\left\lVert\bm{\eta}\right\rVert^{2}}, (116)

where Q𝒩(0,1)Q\sim\mathcal{N}(0,1). Here we use the fact that the covariance matrix of 𝑷𝜼𝒁\bm{P}_{\bm{\eta}}^{\perp}\bm{Z} is 𝑷𝜼𝑷𝜼=𝑰p𝜼𝜼/𝜼2\bm{P}_{\bm{\eta}}^{\perp}\bm{P}_{\bm{\eta}}^{\perp}=\bm{I}_{p}-\bm{\eta}\bm{\eta}^{\top}/\|\bm{\eta}\|^{2}. Assumptions ηj=o(1)\eta_{j}=o(1) and 𝜼=1\|\bm{\eta}\|=1 complete the proof. ∎

D.2. Proof of Theorem 1

Let ς2\varsigma^{2} be the ratio σ12/μ12\sigma_{1}^{2}/\mu_{1}^{2}, where μ12\mu_{1}^{2} and σ12\sigma_{1}^{2} are true inferential parameters of the pilot estimator 𝜷~\tilde{\bm{\beta}}.

Proof of Proposition 1.

First, define γ1=tr(𝚺(𝑿(1)𝑿(1))1)\gamma_{1}=\mathrm{tr}(\bm{\Sigma}({\bm{X}^{(1)}}^{\top}{\bm{X}^{(1)}})^{-1}). We also define

μ1=𝜷𝚺𝜷~,σ12=𝜷~𝚺𝜷~μ12.\displaystyle\mu_{1}=\bm{\beta}^{\top}\bm{\Sigma}\tilde{\bm{\beta}},\quad\sigma_{1}^{2}=\tilde{\bm{\beta}}^{\top}\bm{\Sigma}\tilde{\bm{\beta}}-\mu_{1}^{2}. (117)

Let r~2\tilde{r}^{2} be the mean squared error n11𝒚(1)𝑿(1)𝜷~2n_{1}^{-1}\|{\bm{y}^{(1)}-\bm{X}^{(1)}\tilde{\bm{\beta}}\|}^{2}. Since 𝜷~\tilde{\bm{\beta}} is a ridge estimator, Theorem 4.3 in Bellec (2022) implies,

max1in1𝔼[r~2|𝜷~𝑿i(1)proxγ1f(μ1𝜷𝑿i(1)+σ1Zi)|2]Cn1,\displaystyle\max_{1\leq i\leq n_{1}}\operatorname{\mathbb{E}}\left[\tilde{r}^{-2}\left\lvert\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}-\mathrm{prox}_{\gamma_{1}f}\left(\mu_{1}\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}+\sigma_{1}Z_{i}\right)\right\rvert^{2}\right]\leq\frac{C}{n_{1}}, (118)

with f(t)=t2/2f(t)=t^{2}/2. Since ridge regression satisfies 𝜷~Cλ\|\tilde{\bm{\beta}}\|\leq C_{\lambda} with a constant Cλ>0C_{\lambda}>0 depending on the regularization parameter λ>0\lambda>0 by the KKT condition, we have r~2=Op(1)\tilde{r}^{2}=O_{\mathrm{p}}(1). Hence,

|𝜷~𝑿i(1)proxγ1f(μ1𝜷𝑿i(1)+σ1Zi)|p0,\displaystyle\left\lvert\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}-\mathrm{prox}_{\gamma_{1}f}\left(\mu_{1}\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}+\sigma_{1}Z_{i}\right)\right\rvert\overset{\mathrm{p}}{\to}0, (119)

as n1n_{1}\to\infty for each i[n1]i\in[n_{1}]. By using the fact that proxγf(a)=af(proxγf(a))\mathrm{prox}_{\gamma f}(a)=a-f^{\prime}(\mathrm{prox}_{\gamma f}(a)) for aa\in\mathbb{R}, γ>0\gamma>0, and f:f:\mathbb{R}\to\mathbb{R} by the definition of the proximal operator, we obtain

|𝜷~𝑿i(1)+γ1(yi(1)𝜷~𝑿i(1))μ1𝜷𝑿i(1)σ1Zi|p0,\displaystyle\left\lvert\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}+\gamma_{1}\left({y_{i}^{(1)}}-\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}\right)-\mu_{1}\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}-\sigma_{1}Z_{i}\right\rvert\overset{\mathrm{p}}{\to}0, (120)

as n1n_{1}\to\infty. Next, we consider to replace (μ1,σ1,γ1)(\mu_{1},\sigma_{1},\gamma_{1}) with observable adjustments (μ~,σ~,γ~)(\tilde{\mu},\tilde{\sigma},\tilde{\gamma}). Theorem 4.4 in Bellec (2022) gives their consistency:

𝔼[v~|γ~γ1|]C1n1/2,\displaystyle\operatorname{\mathbb{E}}\left[\tilde{v}\left\lvert\tilde{\gamma}-\gamma_{1}\right\rvert\right]\leq C_{1}n^{-1/2}, (121)
𝔼[v~2t~2r~2(|μ~2μ12|+|σ~2σ12|)]C2n1/2,\displaystyle\operatorname{\mathbb{E}}\left[\tilde{v}^{2}\tilde{t}^{2}\tilde{r}^{-2}\left(\left\lvert\tilde{\mu}^{2}-\mu_{1}^{2}\right\rvert+\left\lvert\tilde{\sigma}^{2}-\sigma_{1}^{2}\right\rvert\right)\right]\leq C_{2}n^{-1/2}, (122)

where we define t~2=(λ𝚺1/2+v~𝚺1/2)𝜷~2κ1r~2\tilde{t}^{2}=\|(\lambda\bm{\Sigma}^{-1/2}+\tilde{v}\bm{\Sigma}^{1/2})\tilde{\bm{\beta}}\|^{2}-\kappa_{1}\tilde{r}^{2} and C1,C2C_{1},C_{2} are positive constants. Proposition 3.1 in Bellec (2022) implies that v~1/(1+c¯)4c¯/n1\tilde{v}\geq 1/(1+\bar{c})-4\bar{c}/n_{1} for a constant c¯>0\bar{c}>0. Also, Theorem 4.4 in Bellec (2022) implies that t~2p(𝜷(v~𝚺+λ)𝜷~)2\tilde{t}^{2}\overset{\mathrm{p}}{\to}(\bm{\beta}^{\top}(\tilde{v}\bm{\Sigma}+\lambda)\tilde{\bm{\beta}})^{2}. By using these results, we have γ~pγ1\tilde{\gamma}\overset{\mathrm{p}}{\to}\gamma_{1}, μ~pμ1\tilde{\mu}\overset{\mathrm{p}}{\to}\mu_{1}, and σ~2pσ12\tilde{\sigma}^{2}\overset{\mathrm{p}}{\to}\sigma_{1}^{2} as n1n_{1}\to\infty since the sign of μ1\mu_{1} is specified by an assumption. Then, triangle inequality implies

|𝜷~𝑿i(1)+γ~(yi(1)𝜷~𝑿i(1))μ~𝜷𝑿i(1)σ~Zi|\displaystyle\left\lvert\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}+\tilde{\gamma}\left(y_{i}^{(1)}-\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}\right)-\tilde{\mu}\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}-\tilde{\sigma}Z_{i}\right\rvert (123)
|𝜷~𝑿i(1)+γ1(yi(1)𝜷~𝑿i(1))μ1𝜷𝑿i(1)σ1Zi|\displaystyle\leq\left\lvert\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}+\gamma_{1}\left(y_{i}^{(1)}-\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}\right)-\mu_{1}\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}-\sigma_{1}Z_{i}\right\rvert (124)
+|(γ1γ~)(yi(1)𝜷~𝑿i(1))|+|(μ1μ~)𝜷~𝑿i(1)|+|(σ1σ~)Zi|,\displaystyle+\left\lvert(\gamma_{1}-\tilde{\gamma})\left(y_{i}^{(1)}-\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}\right)\right\rvert+\left\lvert(\mu_{1}-\tilde{\mu})\tilde{\bm{\beta}}^{\top}{\bm{X}_{i}^{(1)}}\right\rvert+\left\lvert(\sigma_{1}-\tilde{\sigma})Z_{i}\right\rvert, (125)

which converges in probability to zero. ∎

To prove Theorem 1, we first introduce an approximation g~()\tilde{g}(\cdot) for the link estimator g^()\hat{g}(\cdot) defined in (8).

Lemma 5.

For i=1,,n1i=1,\ldots,n_{1}, define Wi~=𝛃𝐗i(1)+ςZi\tilde{W_{i}}=\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}+\varsigma Z_{i} with Ziind𝒩(0,1)Z_{i}\overset{\rm ind}{\sim}\mathcal{N}(0,1) and ς=σ1/|μ1|\varsigma=\sigma_{1}/|\mu_{1}|. We also define

g~(x):=i=1n1yi(1)exp(t2ς2/(2hn2)it(xWi~)/hn)ϕK(t)𝑑ti=1n1exp(t2ς2/(2hn2)it(xWi~)/hn)ϕK(t)𝑑t.\displaystyle\tilde{g}(x):=\frac{\displaystyle\sum_{i=1}^{n_{1}}y_{i}^{(1)}\int_{-\infty}^{\infty}\exp\left(t^{2}\varsigma^{2}/(2h_{n}^{2})-\mathrm{i}t(x-\tilde{W_{i}})/h_{n}\right)\phi_{K}(t)dt}{\displaystyle\sum_{i=1}^{n_{1}}\int_{-\infty}^{\infty}\exp\left(t^{2}\varsigma^{2}/(2h_{n}^{2})-\mathrm{i}t(x-\tilde{W_{i}})/h_{n}\right)\phi_{K}(t)dt}. (126)

Then, under the setting of Theorem 1, we have, as n1n_{1}\to\infty,

supaxb|g˘(x)g~(x)|=Op(1(logn1)m/2).\displaystyle\sup_{a\leq x\leq b}\left\lvert\breve{g}(x)-\tilde{g}(x)\right\rvert=O_{\mathrm{p}}\left(\frac{1}{(\log n_{1})^{m/2}}\right). (127)
Proof of Lemma 5.

Denote We rewrite the kernel function for deconvolution in (8) as

Kn(x)=12πexp(itx)ϕK(t)exp(t2ς^2/(2hn2))𝑑t,\displaystyle K_{n}(x)=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-\mathrm{i}tx)\frac{\phi_{K}(t)}{\exp(-t^{2}\hat{\varsigma}^{2}/(2h_{n}^{2}))}dt, (128)

and also introduce an approximated version of the kernel function as

K~n(x)=12πexp(itx)ϕK(t)exp(t2ς2/(2hn2))𝑑t.\displaystyle\tilde{K}_{n}(x)=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-\mathrm{i}tx)\frac{\phi_{K}(t)}{\exp(-t^{2}{\varsigma}^{2}/(2h_{n}^{2}))}dt. (129)

The difference here is that the parameter ς^\hat{\varsigma} is replaced by ς\varsigma. We also define ϕς(t)=exp(t2ς2/2)\phi_{\varsigma}(t)=\exp(-t^{2}\varsigma^{2}/2).

At first, we have

|g˘(x)g~(x)|\displaystyle\left\lvert\breve{g}(x)-\tilde{g}(x)\right\rvert =|i=1n1yi(1)Kn(Wixhn)i=1n1Kn(Wixhn)i=1n1yi(1)K~n(W~ixhn)i=1n1K~n(W~ixhn)|\displaystyle=\left\lvert\frac{\sum_{i=1}^{n_{1}}y_{i}^{(1)}K_{n}\left(\frac{W_{i}-x}{h_{n}}\right)}{\sum_{i=1}^{n_{1}}K_{n}\left(\frac{W_{i}-x}{h_{n}}\right)}-\frac{\sum_{i=1}^{n_{1}}y_{i}^{(1)}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)}{\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)}\right\rvert (130)
C1,nC2,n|1n1hni=1n1yi(1){Kn(Wixhn)K~n(W~ixhn)}|\displaystyle\leq C_{1,n}C_{2,n}\left\lvert\frac{1}{n_{1}h_{n}}\sum_{i=1}^{n_{1}}y_{i}^{(1)}\left\{K_{n}\left(\frac{W_{i}-x}{h_{n}}\right)-\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)\right\}\right\rvert (131)
+C1,nC3,n1n1hn|i=1n1K~n(W~ixhn)Kn(Wixhn)|\displaystyle\quad+C_{1,n}C_{3,n}\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)-{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert (132)
C1,nC2,n|1n1hni=1n1yi(1){K~n(W~ixhn)K~n(Wixhn)}|\displaystyle\leq C_{1,n}C_{2,n}\left\lvert\frac{1}{n_{1}h_{n}}\sum_{i=1}^{n_{1}}y_{i}^{(1)}\left\{\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)-\tilde{K}_{n}\left(\frac{W_{i}-x}{h_{n}}\right)\right\}\right\rvert (133)
+C1,nC2,n|1n1hni=1n1yi(1){K~n(Wixhn)Kn(Wixhn)}|\displaystyle\quad+C_{1,n}C_{2,n}\left\lvert\frac{1}{n_{1}h_{n}}\sum_{i=1}^{n_{1}}y_{i}^{(1)}\left\{\tilde{K}_{n}\left(\frac{W_{i}-x}{h_{n}}\right)-{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\}\right\rvert (134)
+C1,nC3,n1n1hn|i=1n1K~n(W~ixhn)K~n(Wixhn)|\displaystyle\quad+C_{1,n}C_{3,n}\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)-\tilde{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert (135)
+C1,nC3,n1n1hn|i=1n1K~n(Wixhn)Kn(Wixhn)|,\displaystyle\quad+C_{1,n}C_{3,n}\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)-{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert, (136)

where C1,n=|n2hn2(i=1n1K~n(W~ixhn))1(i=1n1Kn(Wixhn))1|C_{1,n}=\left\lvert n^{2}h_{n}^{2}(\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right))^{-1}(\sum_{i=1}^{n_{1}}K_{n}\left(\frac{W_{i}-x}{h_{n}}\right))^{-1}\right\rvert, C2,n=|1n1hni=1n1K~n(W~ixhn)|C_{2,n}=\left\lvert\frac{1}{n_{1}h_{n}}\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)\right\rvert, and C3,n=|1n1hni=1n1yi(1)K~n(W~ixhn)|C_{3,n}=\left\lvert\frac{1}{n_{1}h_{n}}\sum_{i=1}^{n_{1}}y_{i}^{(1)}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)\right\rvert. Here, C1,n,C2,n,C_{1,n},C_{2,n}, and C3,nC_{3,n} converge to positive constants by the consistency of the deconvoluted kernel density estimator. We proceed to bound each term on the right-hand side. First, we bound (136). (|t|e1/2)(|t|e^{-1}/\sqrt{2})-Lipschitz continuity of ϕς(t)\phi_{\varsigma}(t) with respect to ς\varsigma yields

1n1hn|i=1n1K~n(Wixhn)Kn(Wixhn)|\displaystyle\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)-{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert (137)
=|12πn1hni=1n1M0M0exp(itWixhn)ϕK(t)ϕς(t/hn)ϕς~(t/hn){ϕς(t/hn)ϕς~(t/hn)}𝑑t|\displaystyle=\left\lvert\frac{1}{2\pi n_{1}h_{n}}\sum_{i=1}^{n_{1}}\int_{-M_{0}}^{M_{0}}\exp\left(-\mathrm{i}t\frac{W_{i}-x}{h_{n}}\right)\frac{\phi_{K}(t)}{\phi_{\varsigma}(t/h_{n})\phi_{\tilde{\varsigma}}(t/h_{n})}\left\{\phi_{\varsigma}(t/h_{n})-\phi_{\tilde{\varsigma}}(t/h_{n})\right\}dt\right\rvert (138)
12eπhn2|ςς~|0M0|tϕK(t)|exp(t2(ς2+ς~2)2hn2)𝑑t.\displaystyle\leq\frac{1}{\sqrt{2}e\pi h_{n}^{2}}\left\lvert\varsigma-\tilde{\varsigma}\right\rvert\int_{0}^{M_{0}}|t\phi_{K}(t)|\exp\left(\frac{t^{2}(\varsigma^{2}+\tilde{\varsigma}^{2})}{2h_{n}^{2}}\right)dt. (139)

Theorem 4.4 in Bellec (2022) implies that |ςς~|=Op(n11/2)|\varsigma-\tilde{\varsigma}|=O_{\mathrm{p}}(n_{1}^{-1/2}) since

|ςς~|(ς+ς~)=|ς2ς~2|1μ~2μ12{μ12|σ~2σ12|+σ12|μ12μ~2|}.\displaystyle\left\lvert\varsigma-\tilde{\varsigma}\right\rvert(\varsigma+\tilde{\varsigma})=\left\lvert\varsigma^{2}-\tilde{\varsigma}^{2}\right\rvert\leq\frac{1}{\tilde{\mu}^{2}\mu_{1}^{2}}\left\{\mu_{1}^{2}\left\lvert\tilde{\sigma}^{2}-\sigma_{1}^{2}\right\rvert+\sigma_{1}^{2}\left\lvert\mu_{1}^{2}-\tilde{\mu}^{2}\right\rvert\right\}. (140)

Hence, as we choose hn=(chlogn1)1/2h_{n}=(c_{h}\log n_{1})^{-1/2} such that M02(ς2+ς~2)ch/2+c1/2M_{0}^{2}(\varsigma^{2}+\tilde{\varsigma}^{2})c_{h}/2+c\leq 1/2 for some c>0c>0, we obtain

1n1hn|i=1n1K~n(Wixhn)Kn(Wixhn)|=Op((logn1)n1c).\displaystyle\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)-{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert=O_{\mathrm{p}}\left((\log n_{1})n_{1}^{-c}\right). (141)

Next, we bound (135). For any x,xx,x^{\prime}\in\mathbb{R}, we have

|eitxeitx|=({cos(tx)cos(tx)}2+{sin(tx)sin(tx)}2)1/22t|xx|,\displaystyle|e^{-\mathrm{i}tx}-e^{-\mathrm{i}tx^{\prime}}|=\left(\left\{\cos(-tx)-\cos(-tx^{\prime})\right\}^{2}+\left\{\sin(-tx)-\sin(-tx^{\prime})\right\}^{2}\right)^{1/2}\leq\sqrt{2}t|x-x^{\prime}|, (142)

where the last inequality follows from 1-Lipschitz continuity of cos()\cos(\cdot) and sin()\sin(\cdot). Since ϕK()\phi_{K}(\cdot) is supported on [M0,M0][-M_{0},M_{0}], we have

1n1hn|i=1n1K~n(W~ixhn)K~n(Wixhn)|\displaystyle\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)-\tilde{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert (143)
=|12πn1hni=1n1M0M0ϕK(t)ϕς(t/hn){exp(it(W~ix)hn)exp(it(Wix)hn)}𝑑t|\displaystyle=\left\lvert\frac{1}{2\pi n_{1}h_{n}}\sum_{i=1}^{n_{1}}\int_{-M_{0}}^{M_{0}}\frac{\phi_{K}(t)}{\phi_{\varsigma}(t/h_{n})}\left\{\exp\left(-\mathrm{i}t\frac{(\tilde{W}_{i}-x)}{h_{n}}\right)-\exp\left(-\mathrm{i}t\frac{(W_{i}-x)}{h_{n}}\right)\right\}dt\right\rvert (144)
2πn1hn2i=1n1|W~iWi|0M0|tϕK(t)|exp(t2ς22hn2)𝑑t.\displaystyle\leq\frac{\sqrt{2}}{\pi n_{1}h_{n}^{2}}\sum_{i=1}^{n_{1}}\left\lvert\tilde{W}_{i}-W_{i}\right\rvert\int_{0}^{M_{0}}|t\phi_{K}(t)|\exp\left(\frac{t^{2}\varsigma^{2}}{2h_{n}^{2}}\right)dt. (145)

Here, we can use the fact that, by the triangle inequality,

|W~iWi||ςς~|Zi+|Wi𝜷𝑿i(1)ς~Zi|=Op(n11/2),\displaystyle|\tilde{W}_{i}-W_{i}|\leq|\varsigma-\tilde{\varsigma}|Z_{i}+|W_{i}-\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}-\tilde{\varsigma}Z_{i}|=O_{\mathrm{p}}(n_{1}^{-1/2}), (146)

where the equality follows from (140) and (118). Thus, since we choose hn=(chlogn)1/2h_{n}=(c_{h}\log n)^{-1/2} such that M02ς2ch/2+c1/2M_{0}^{2}\varsigma^{2}c_{h}/2+c\leq 1/2 for some c>0c>0, we have

1n1hn|i=1n1K~n(W~ixhn)K~n(Wixhn)|\displaystyle\frac{1}{n_{1}h_{n}}\left\lvert\sum_{i=1}^{n_{1}}\tilde{K}_{n}\left(\frac{\tilde{W}_{i}-x}{h_{n}}\right)-\tilde{K}_{n}\left(\frac{{W}_{i}-x}{h_{n}}\right)\right\rvert =Op(1n11/2hn2exp(M02ς22chlogn1))\displaystyle=O_{\mathrm{p}}\left(\frac{1}{{n_{1}^{1/2}}h_{n}^{2}}\exp\left(\frac{M_{0}^{2}\varsigma^{2}}{2}c_{h}\log n_{1}\right)\right) (147)
=Op((logn1)n1c).\displaystyle=O_{\mathrm{p}}((\log n_{1})n_{1}^{-c}). (148)

This concludes the convergence of (135). Repeating the arguments above for (133)-(134) completes the proof. ∎

Proof of Theorem 1.

Since 𝜷𝑿i(1)𝒩(0,1)\bm{\beta}^{\top}{\bm{X}_{i}^{(1)}}\sim\mathcal{N}(0,1) by Assumption 2, Lemma 1 implies that, for g~()\tilde{g}(\cdot) defined in (126),

supaxb|g~(x)g(x)|=Op(1(logn1)m/2).\displaystyle\sup_{a\leq x\leq b}\left\lvert\tilde{g}(x)-g(x)\right\rvert=O_{\mathrm{p}}\left(\frac{1}{(\log n_{1})^{m/2}}\right). (149)

Thus, we obtain

supaxb|g^(x)g(x)|\displaystyle\sup_{a\leq x\leq b}\left\lvert\hat{g}(x)-g(x)\right\rvert supaxb|g˘(x)g~(x)|+supaxb|g~(x)g(x)|=Op(1(logn1)m/2).\displaystyle\leq\sup_{a\leq x\leq b}\left\lvert\breve{g}(x)-\tilde{g}(x)\right\rvert+\sup_{a\leq x\leq b}\left\lvert\tilde{g}(x)-g(x)\right\rvert=O_{\mathrm{p}}\left(\frac{1}{(\log n_{1})^{m/2}}\right). (150)

The last equality follows Lemma 5 and (149). Also, the first inequality follows the triangle inequality and a property of each choice of the monotonization operator []\mathcal{R}[\cdot]. If we select the naive naive[]\mathcal{R}_{\mathrm{naive}}[\cdot], we obtain the following for x[a,b]x\in[a,b]:

|g^(x)g(x)|=|supx[a,x]g˘(x)g(x)|=|supx[a,x]g˘(x)supx[a,x]g(x)|supx[a,x]|g˘(x)g(x)|,\displaystyle|\hat{g}(x)-g(x)|=\left|\sup_{x^{\prime}\in[a,x]}\breve{g}(x^{\prime})-g(x)\right|=\left|\sup_{x^{\prime}\in[a,x]}\breve{g}(x^{\prime})-\sup_{x^{\prime}\in[a,x]}g(x^{\prime})\right|\leq\sup_{x^{\prime}\in[a,x]}|\breve{g}(x^{\prime})-g(x^{\prime})|, (151)

by the monotonicity of g()g(\cdot). If we select the rearrangement operator a[]\mathcal{R}^{a}[\cdot], Proposition 1 in Chernozhukov et al. (2009) yields the same result for x[a,b]x\in[a,b]. Thus, whichever monotonization is chosen, we obtain the statement. ∎

D.3. Proof of Theorem 2

Lemma 6.

Let Assumption 1-3 hold. Define

μn=𝜷𝜷^(g^),σn2=𝑷𝜷𝜷^(g^)2,\displaystyle\mu_{n}={\bm{\beta}^{\top}\hat{\bm{\beta}}(\hat{g})},\quad\sigma_{n}^{2}=\|{\bm{P}_{\bm{\beta}}^{\perp}\hat{\bm{\beta}}(\hat{g})\|}^{2}, (152)

where 𝐏𝛃=𝐈p𝛃𝛃\bm{P}_{\bm{\beta}}^{\perp}=\bm{I}_{p}-\bm{\beta}\bm{\beta}^{\top}. Then, we have

|μ^(g^)μn|p0,and|σ^2(g^)σn2|p0.\displaystyle\left\lvert\hat{\mu}({\hat{g}})-\mu_{n}\right\rvert\overset{\mathrm{p}}{\to}0,\quad\mathrm{and}\quad\left\lvert\hat{\sigma}^{2}(\hat{{g}})-\sigma_{n}^{2}\right\rvert\overset{\mathrm{p}}{\to}0. (153)
Proof of Lemma 6.

Theorem 4.4 in Bellec (2022) implies that as n2n_{2}\to\infty, we have

v^λ2t^2r˙4|μ^2(g^)μn2|p0,v^λ2t^2r˙4|σ^2(g^)σn2|p0,\displaystyle{\hat{v}_{\lambda}^{2}\hat{t}^{2}}{\dot{r}^{-4}}\left\lvert\hat{\mu}^{2}(\hat{g})-\mu_{n}^{2}\right\rvert\overset{\mathrm{p}}{\to}0,\quad{\hat{v}_{\lambda}^{2}\hat{t}^{2}}{\dot{r}^{-4}}\left\lvert\hat{\sigma}^{2}(\hat{g})-\sigma_{n}^{2}\right\rvert\overset{\mathrm{p}}{\to}0, (154)

with t^2=(v^λ+λ)2𝜷^(g^)2κ2r˙2\hat{t}^{2}=(\hat{v}_{\lambda}+\lambda)^{2}\|\hat{\bm{\beta}}(\hat{g})\|^{2}-\kappa_{2}\dot{r}^{2} and r˙2=n21𝒚(2)g^(𝑿(2)𝜷^(g^))2\dot{r}^{2}=n_{2}^{-1}\|{\bm{y}^{(2)}-\hat{g}({\bm{X}^{(2)}}\hat{\bm{\beta}}(\hat{g}))\|}^{2}. Recall that v^λ\hat{v}_{\lambda} and 𝑫\bm{D} are defined in Section 2.4. Thus, it is sufficient to show that v^λ2,t^2\hat{v}_{\lambda}^{2},\hat{t}^{2}, and r˙4\dot{r}^{-4} are asymptotically lower bounded away from zero. First,the fact that tr(𝑫)n2cg1>0\mathrm{tr}(\bm{D})\geq n_{2}c_{g}^{-1}>0 holds by Assumption 3 and Proposition 3.1 in Bellec (2022) imply that there exists a constant c^>0\hat{c}>0 such that v^λcg1/(1+c^)4c^/n2\hat{v}_{\lambda}\geq c_{g}^{-1}/(1+\hat{c})-4\hat{c}/n_{2} holds. Next, since ridge penalized regression estimators satisfy 𝜷^(g^)Cλ\|\hat{\bm{\beta}}(\hat{g})\|\leq C_{\lambda}^{\prime} with a constant Cλ>0C_{\lambda}^{\prime}>0 depending on the regularization parameter λ>0\lambda>0, we have r˙2=Op(1)\dot{r}^{2}=O_{\mathrm{p}}(1). Also, Theorem 4.4 in Bellec (2022) implies that t^2p((v^λ+λ)𝜷𝜷^(g^))2\hat{t}^{2}\overset{\mathrm{p}}{\to}((\hat{v}_{\lambda}+\lambda)\bm{\beta}^{\top}\hat{\bm{\beta}}(\hat{g}))^{2}. Thus, we have |μ^(g^)μn|p0\left\lvert\hat{\mu}({\hat{g}})-\mu_{n}\right\rvert\overset{\mathrm{p}}{\to}0 and |σ^2(g^)σn2|p0\left\lvert\hat{\sigma}^{2}(\hat{{g}})-\sigma_{n}^{2}\right\rvert\overset{\mathrm{p}}{\to}0 as n2n_{2}\to\infty since the sign of μn\mu_{n} is specified by an assumption. ∎

Proof of Theorem 2.

We use the notations defined in (152). First, we can apply Theorem 4 and obtain

p(𝜷^j(g^)μn𝜷j)σnd𝒩(0,1).\displaystyle\frac{\sqrt{p}(\hat{\bm{\beta}}_{j}(\hat{g})-\mu_{n}\bm{\beta}_{j})}{\sigma_{n}}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1). (155)

This is because we can skip Step 1 in the proof of Theorem 4 by 𝚺=𝑰p\bm{\Sigma}=\bm{I}_{p} and repeat Steps 2–3 since J(𝑼~𝒃)=J(𝒃)J(\tilde{\bm{U}}\bm{b})=J(\bm{b}) for any orthogonal matrices 𝑼~p×p\tilde{\bm{U}}\in\mathbb{R}^{p\times p}. Hence, we have

pβ^j(g^)μ^(g^)βjσ^(g^)=pβ^j(g^)μnβjσnσnσ^(g^)+p(μnμ^(g^))βjσ^(g^)d𝒩(0,1),\displaystyle\sqrt{p}\frac{\hat{\beta}_{j}(\hat{g})-\hat{\mu}(\hat{g})\beta_{j}}{\hat{\sigma}(\hat{g})}=\sqrt{p}\frac{\hat{\beta}_{j}(\hat{g})-\mu_{n}\beta_{j}}{\sigma_{n}}\frac{\sigma_{n}}{\hat{\sigma}(\hat{g})}+\sqrt{p}\frac{(\mu_{n}-\hat{\mu}(\hat{g}))\beta_{j}}{\hat{\sigma}(\hat{g})}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1), (156)

where the convergence follows from the facts that μ^(g^)pμn\hat{\mu}(\hat{g})\overset{\mathrm{p}}{\to}\mu_{n} and σ^2(g^)pσn2\hat{\sigma}^{2}(\hat{g})\overset{\mathrm{p}}{\to}\sigma_{n}^{2} by Lemma 6. This concludes the proof of (19).

Next, we consider an orthogonal matrix 𝑼p×p\bm{U}\in\mathbb{R}^{p\times p} with the first row 𝑼1=𝒗\bm{U}_{1}=\bm{v}^{\top}. Since 𝑼𝜷^(g^)\bm{U}\hat{\bm{\beta}}(\hat{g}) is the estimator given by (13) with covariates 𝑼𝑿i(2)\bm{U}\bm{X}_{i}^{(2)} and the true coefficient vector 𝑼𝜷\bm{U}\bm{\beta}, applying (24) to this with j=1j=1 yields that, for any sequence of non-random vectors 𝒗n\bm{v}_{n} such that 𝒗n=1\|\bm{v}_{n}\|=1 and pτ(𝒗n)𝒗n𝜷=O(1)\sqrt{p}\tau(\bm{v}_{n})\bm{v}_{n}^{\top}\bm{\beta}=O(1),

p𝒗n(𝜷^(g^)μ^(g^)𝜷)σ^(g^)/τ(𝒗n)d𝒩(0,1),\displaystyle\frac{\sqrt{p}\bm{v}_{n}^{\top}(\hat{\bm{\beta}}(\hat{g})-\hat{\mu}(\hat{g})\bm{\beta})}{\hat{\sigma}(\hat{g})/\tau(\bm{v}_{n})}\overset{\mathrm{d}}{\to}\mathcal{N}(0,1), (157)

where τ2(𝒗n)=(𝒗n𝚯𝒗n)1\tau^{2}(\bm{v}_{n})=(\bm{v}_{n}^{\top}\bm{\Theta}\bm{v}_{n})^{-1}. Finally, (25) follows from (157) and the Cramér-Wold device. ∎

D.4. Proof of Theorem 3

First, we define the notations used in the proof. We consider an invertible matrix 𝑳p×p\bm{L}\in\mathbb{R}^{p\times p} satisfying 𝚺=𝑳𝑳\bm{\Sigma}=\bm{L}\bm{L}^{\top}. Define, for each i{1,,n}i\in\{1,\ldots,n\},

𝑿~i=𝑳1𝑿i(2),𝜽=𝑳𝜷,𝜽^:=𝜽^(g^)=𝑳𝜷^(g^).\displaystyle\tilde{\bm{X}}_{i}=\bm{L}^{-1}\bm{X}_{i}^{(2)},\quad\bm{\theta}=\bm{L}^{\top}\bm{\beta},\quad\hat{\bm{\theta}}:=\hat{\bm{\theta}}(\hat{g})=\bm{L}^{\top}\hat{\bm{\beta}}(\hat{g}). (158)
Lemma 7.

Let Assumption 1-3(1) hold. Using the notations (158), define

μ0=𝜽𝜽^,σ02=𝑷𝜽𝜽^2,\displaystyle\mu_{0}={\bm{\theta}^{\top}\hat{\bm{\theta}}},\quad\sigma_{0}^{2}=\|{\bm{P}_{\bm{\theta}}^{\perp}\hat{\bm{\theta}}\|}^{2}, (159)

where 𝐏𝛉=𝐈p𝛉𝛉\bm{P}_{\bm{\theta}}^{\perp}=\bm{I}_{p}-\bm{\theta}\bm{\theta}^{\top}. Then, we have

|μ^0(g^)μ0|p0,and|σ^02(g^)σ02|p0.\displaystyle\left\lvert\hat{\mu}_{0}({\hat{g}})-\mu_{0}\right\rvert\overset{\mathrm{p}}{\to}0,\quad\mathrm{and}\quad\left\lvert\hat{\sigma}_{0}^{2}(\hat{{g}})-\sigma_{0}^{2}\right\rvert\overset{\mathrm{p}}{\to}0. (160)
Proof of Lemma 7.

Theorem 4.4 in Bellec (2022) implies that as n2n_{2}\to\infty, we have

v^02t^02r˙04|μ^0(g^)μ0|p0,v^02t^02r˙04|σ^02(g^)σ02|p0,\displaystyle{\hat{v}_{0}^{2}\hat{t}_{0}^{2}}{\dot{r}_{0}^{-4}}\left\lvert\hat{\mu}_{0}(\hat{g})-\mu_{0}\right\rvert\overset{\mathrm{p}}{\to}0,\quad{\hat{v}_{0}^{2}\hat{t}_{0}^{2}}{\dot{r}_{0}^{-4}}\left\lvert\hat{\sigma}_{0}^{2}(\hat{g})-\sigma_{0}^{2}\right\rvert\overset{\mathrm{p}}{\to}0, (161)

with t^02=n21𝑿(2)𝜷^(g^)2v^02κ2(1κ2)r˙02\hat{t}_{0}^{2}={n_{2}^{-1}\|{{\bm{X}^{(2)}}\hat{\bm{\beta}}(\hat{g})\|}^{2}}\hat{v}_{0}^{2}-\kappa_{2}(1-\kappa_{2}){\dot{r}_{0}^{2}} and r˙02=n21𝒚(2)g^(𝑿(2)𝜷^(g^))2\dot{r}_{0}^{2}=n_{2}^{-1}\|{\bm{y}^{(2)}-\hat{g}({\bm{X}^{(2)}}\hat{\bm{\beta}}(\hat{g}))\|}^{2}. Recall that v^0\hat{v}_{0} is obtain by the definition of v^λ\hat{v}_{\lambda} in Section 2.4 and setting λ=0\lambda=0. Thus, it is sufficient to show that v^02,t^02\hat{v}_{0}^{2},\hat{t}_{0}^{2}, and r˙04\dot{r}_{0}^{-4} are asymptotically lower bounded away from zero. First, tr(𝑫)n2cg1>0\mathrm{tr}(\bm{D})\geq n_{2}c_{g}^{-1}>0 by Assumption 3 and Proposition 3.1 in Bellec (2022) imply that there exists a constant c^>0\hat{c}^{\prime}>0 such that v^0cg1/(1+c^)4c^/n2\hat{v}_{0}\geq c_{g}^{-1}/(1+\hat{c}^{\prime})-4\hat{c}^{\prime}/n_{2}. Next, we assume that 𝜷^(g^)C\|\hat{\bm{\beta}}(\hat{g})\|\leq C with probability approaching one, we have r˙02=Op(1)\dot{r}_{0}^{2}=O_{\mathrm{p}}(1). Also, Theorem 4.4 in Bellec (2022) implies that t^02pv^0μ0\hat{t}_{0}^{2}\overset{\mathrm{p}}{\to}\hat{v}_{0}\mu_{0}. Thus, we have |μ^0(g^)μ0|p0\left\lvert\hat{\mu}_{0}({\hat{g}})-\mu_{0}\right\rvert\overset{\mathrm{p}}{\to}0 and |σ^02(g^)σ02|p0\left\lvert\hat{\sigma}_{0}^{2}(\hat{{g}})-\sigma_{0}^{2}\right\rvert\overset{\mathrm{p}}{\to}0 as n2n_{2}\to\infty since the sign of μ0\mu_{0} is specified by an assumption. ∎

Proof of Theorem 3.

At first, the first step of the proof of Theorem 4 implies that, for any coordinate j=1,,pj=1,\ldots,p,

τjβ^jμ^(g^)βjσ^(g^)=θ^jμ^(g^)θjσ^(g^),\displaystyle\tau_{j}\frac{\hat{\beta}_{j}-\hat{\mu}(\hat{g})\beta_{j}}{\hat{\sigma}(\hat{g})}=\frac{\hat{\theta}_{j}-\hat{\mu}(\hat{g})\theta_{j}}{\hat{\sigma}(\hat{g})}, (162)

where τj2=(𝚺1)jj\tau_{j}^{-2}=(\bm{\Sigma}^{-1})_{jj}. Here, 𝜽\bm{\theta} and 𝜽^\hat{\bm{\theta}} are defined in (158). Thus, we consider 𝜽^\hat{\bm{\theta}} instead of 𝜷^(g^)\hat{\bm{\beta}}(\hat{g}). We have

pθ^jμ^(g^)θjσ^(g^)=pθ^jμ0θjσ0σnσ^(g^)+p(μ0μ^(g^))θjσ^(g^).\displaystyle\sqrt{p}\frac{\hat{\theta}_{j}-\hat{\mu}(\hat{g})\theta_{j}}{\hat{\sigma}(\hat{g})}=\sqrt{p}\frac{\hat{\theta}_{j}-\mu_{0}\theta_{j}}{\sigma_{0}}\frac{\sigma_{n}}{\hat{\sigma}(\hat{g})}+\sqrt{p}\frac{(\mu_{0}-\hat{\mu}(\hat{g}))\theta_{j}}{\hat{\sigma}(\hat{g})}. (163)

Thus, the facts that μ^0(g^)pμ0\hat{\mu}_{0}(\hat{g})\overset{\mathrm{p}}{\to}\mu_{0} and σ^02(g^)pσ02\hat{\sigma}_{0}^{2}(\hat{g})\overset{\mathrm{p}}{\to}\sigma_{0}^{2} by Lemma 7 conclude the proof of (24). The rest of the proof follows from repeating the arguments in the proof of Theorem 2. ∎

D.5. Proof of Theorem 5

Lemma 8.

Let cg1g()c_{g}^{-1}\leq g^{\prime}(\cdot) hold. Consider censoring of 𝛃^(g^)𝐗i(2)\hat{\bm{\beta}}(\hat{g})^{\top}{\bm{X}_{i}^{(2)}} and 𝛃^(g)𝐗i(2)\hat{\bm{\beta}}({g})^{\top}{\bm{X}_{i}^{(2)}} for all i[n2]i\in[n_{2}] in [a,b][a,b]. Under the setting of Lemma 1 with k=3k=3, we have

maxi=1,,n2|𝜷^(g)𝑿i(2)𝜷^(g^)𝑿i(2)|p0.\displaystyle\max_{i=1,\ldots,n_{2}}\left\lvert\hat{\bm{\beta}}(g)^{\top}\bm{X}_{i}^{(2)}-\hat{\bm{\beta}}(\hat{g})^{\top}\bm{X}_{i}^{(2)}\right\rvert\overset{\mathrm{p}}{\to}0. (164)
Proof of Lemma 8.

We can assume 𝑿i(2)𝒩(𝟎,𝑰p)\bm{X}_{i}^{(2)}\sim\mathcal{N}(\bm{0},\bm{I}_{p}) for each i=1,,n2i=1,\ldots,n_{2} without loss of generality by the first step of the proof of Theorem 4. In this proof, we omit (2)(2) on 𝑿(2)\bm{X}^{(2)} for simplicity of the notation. To begin with, we write the KKT condition of the estimation:

f(𝜷^(g),g)=𝟎,\displaystyle f(\hat{\bm{\beta}}(g),g)=\bm{0}, (165)

where we define f(𝒃,g)=n21/2𝑿(g(𝑿𝒃)𝒚)f(\bm{b},g)=n_{2}^{-1/2}\bm{X}^{\top}(g(\bm{X}\bm{b})-\bm{y}). We write fj(𝒃,g)=n21/2𝑿j(g(𝑿𝒃)𝒚)f_{j}(\bm{b},g)=n_{2}^{-1/2}\bm{X}_{\cdot j}^{\top}(g(\bm{X}\bm{b})-\bm{y}) for j=1,,pj=1,\ldots,p. Since /(bj)fj(𝒃,g)=n21/2𝑿j𝑫(𝒃)𝑿j\partial/(\partial b_{j})f_{j}(\bm{b},g)=n_{2}^{-1/2}\bm{X}_{\cdot j}^{\top}\bm{D}(\bm{b})\bm{X}_{\cdot j} with 𝑫(𝒃)=diag(g(𝑿𝒃))\bm{D}(\bm{b})=\mathrm{diag}(g^{\prime}(\bm{X}\bm{b})), by the mean value theorem, there exists a constant c[0,1]c\in[0,1] such that 𝒃¯=c𝜷^(g)+(1c)𝜷^(g^)\bar{\bm{b}}=c\hat{\bm{\beta}}({g})+(1-c)\hat{\bm{\beta}}(\hat{g}) satisfies

fj(𝜷^(g),g)fj(𝜷^(g^),g)β^j(g)β^j(g^)=(1n2𝑿j𝑫(𝒃¯)𝑿j)>0.\displaystyle\frac{f_{j}(\hat{\bm{\beta}}(g),g)-f_{j}(\hat{\bm{\beta}}(\hat{g}),g)}{\hat{\beta}_{j}(g)-\hat{\beta}_{j}(\hat{g})}=\left(\frac{1}{\sqrt{n_{2}}}\bm{X}_{\cdot j}^{\top}\bm{D}(\bar{\bm{b}})\bm{X}_{\cdot j}\right)>0. (166)

Define Rk:=g^(𝑿k𝜷^(g^))g(𝑿k𝜷^(g^))R_{k}:=\hat{g}(\bm{X}_{k}^{\top}\hat{\bm{\beta}}(\hat{g}))-g(\bm{X}_{k}^{\top}\hat{\bm{\beta}}(\hat{g})). We have

n2(β^j(g)β^j(g^))\displaystyle\sqrt{n_{2}}\left(\hat{\beta}_{j}(g)-\hat{\beta}_{j}(\hat{g})\right) (167)
=(n21𝑿j𝑫(𝒃¯)𝑿j)1(fj(β^j(g),g)fj(β^j(g^),g))\displaystyle=\left(n_{2}^{-1}\bm{X}_{\cdot j}^{\top}\bm{D}(\bar{\bm{b}})\bm{X}_{\cdot j}\right)^{-1}\left(f_{j}(\hat{\beta}_{j}(g),g)-f_{j}(\hat{\beta}_{j}(\hat{g}),g)\right) (168)
=(n21𝑿j𝑫(𝒃¯)𝑿j)1{fj(β^j(g),g)+(fj(β^j(g^),g^)fj(β^j(g^),g))fj(β^j(g^),g^)}\displaystyle=\left(n_{2}^{-1}\bm{X}_{\cdot j}^{\top}\bm{D}(\bar{\bm{b}})\bm{X}_{\cdot j}\right)^{-1}\left\{{f_{j}(\hat{\beta}_{j}(g),g)}+\left(f_{j}(\hat{\beta}_{j}(\hat{g}),\hat{g})-f_{j}(\hat{\beta}_{j}(\hat{g}),g)\right)-{f_{j}(\hat{\beta}_{j}(\hat{g}),\hat{g})}\right\} (169)
=(n21𝑿j𝑫(𝒃¯)𝑿j)1(1n2k=1n2XkjRk),\displaystyle=\left(n_{2}^{-1}\bm{X}_{\cdot j}^{\top}\bm{D}(\bar{\bm{b}})\bm{X}_{\cdot j}\right)^{-1}\left(\frac{1}{\sqrt{n_{2}}}\sum_{k=1}^{n_{2}}X_{kj}R_{k}\right), (170)

where the second equality follows from the first-order conditions. In sequel, for simplicity, we consider the leave-one-out estimator 𝜷^i\hat{\bm{\beta}}_{-i} and g^i\hat{g}_{-i} constructed by the observations without the ii-th sample. Define

R~k:=(logn2)(g^1(𝑿k𝜷^1(g^1))g(𝑿k𝜷^1(g^1))),Tj\displaystyle\tilde{R}_{k}:=(\log{n_{2}})\left(\hat{g}_{-1}(\bm{X}_{k}^{\top}\hat{\bm{\beta}}_{-1}(\hat{g}_{-1}))-g(\bm{X}_{k}^{\top}\hat{\bm{\beta}}_{-1}(\hat{g}_{-1}))\right),\quad T_{j} :=(n21𝑿1,j𝑫1(𝒃¯)𝑿1,j)1,\displaystyle:=\left(n_{2}^{-1}\bm{X}_{-1,j}^{\top}\bm{D}_{-1}(\bar{\bm{b}})\bm{X}_{-1,j}\right)^{-1}, (171)

where 𝑿1,j:=(X2j,,Xn2j)n21\bm{X}_{-1,j}:=(X_{2j},\ldots,X_{{n_{2}}j})^{\top}\in\mathbb{R}^{{n_{2}}-1}, and 𝑫1(𝒃¯):=diag(g1(𝑿2𝒃¯),,g1(𝑿n2𝒃¯))(n21)×(n21)\bm{D}_{-1}(\bar{\bm{b}}):=\mathrm{diag}({g}_{-1}^{\prime}(\bm{X}_{2}^{\top}\bar{\bm{b}}),\ldots,{g}_{-1}^{\prime}(\bm{X}_{n_{2}}^{\top}\bar{\bm{b}}))\in\mathbb{R}^{({n_{2}}-1)\times({n_{2}}-1)}. We obtain

|𝑿1𝜷^1(g)𝑿1𝜷^1(g^1)|\displaystyle\left\lvert\bm{X}_{1}^{\top}\hat{\bm{\beta}}_{-1}(g)-\bm{X}_{1}^{\top}\hat{\bm{\beta}}_{-1}(\hat{g}_{-1})\right\rvert =|j=1pX1j(β^1,j(g)β^1,j(g^1))|\displaystyle=\left\lvert\sum_{j=1}^{p}X_{1j}\left(\hat{\beta}_{-1,j}(g)-\hat{\beta}_{-1,j}(\hat{g}_{-1})\right)\right\rvert (172)
|1n2logn2j=1pX1jTjk=2n2XkjR~k|.\displaystyle\leq\left\lvert\frac{1}{{{n_{2}}\log{n_{2}}}}\sum_{j=1}^{p}X_{1j}T_{j}{\sum_{k=2}^{n_{2}}X_{kj}\tilde{R}_{k}}\right\rvert. (173)

Here, define a filtration k=σ({g^1,𝜷^1(g^1),T1,,Tp,𝑿2,,𝑿k+1})\mathcal{F}_{k}=\sigma(\{\hat{g}_{-1},\hat{\bm{\beta}}_{-1}(\hat{g}_{-1}),T_{1},\ldots,T_{p},\bm{X}_{2},\ldots,\bm{X}_{k+1}\}) with an initialization 0=σ({g^1,𝜷^1(g^1),T1,,Tp})\mathcal{F}_{0}=\sigma(\{\hat{g}_{-1},\hat{\bm{\beta}}_{-1}(\hat{g}_{-1}),T_{1},\ldots,T_{p}\}). Define a random variable S~k=n21j=1pTjX1jXkj\tilde{S}_{k}={n_{2}}^{-1}\sum_{j=1}^{p}T_{j}X_{1j}X_{kj}. Then, R~kS~k\tilde{R}_{k}\tilde{S}_{k} is a martingale difference sequence since 𝔼[R~kS~kk1]=0\operatorname{\mathbb{E}}[\tilde{R}_{k}\tilde{S}_{k}\mid\mathcal{F}_{k-1}]=0 and 𝔼|R~kS~k|𝔼[R~k2]1/2𝔼[S~k2]1/2<.\operatorname{\mathbb{E}}|{\tilde{R}_{k}\tilde{S}_{k}}|\leq\operatorname{\mathbb{E}}[\tilde{R}_{k}^{2}]^{1/2}\operatorname{\mathbb{E}}[\tilde{S}_{k}^{2}]^{1/2}<\infty. This follows from the fact that

𝔼[S~k2]\displaystyle\operatorname{\mathbb{E}}\left[\tilde{S}_{k}^{2}\right] =𝔼[(n21j=1pTjX1jXkj)2]\displaystyle=\operatorname{\mathbb{E}}\left[\left(n_{2}^{-1}\sum_{j=1}^{p}T_{j}X_{1j}X_{kj}\right)^{2}\right] (174)
=n22j=1p𝔼[Tj2X1j2Xkj2]+2j<j𝔼[TjX1jXkjTjX1jXkj]\displaystyle=n_{2}^{-2}\sum_{j=1}^{p}\operatorname{\mathbb{E}}\left[T_{j}^{2}X_{1j}^{2}X_{kj}^{2}\right]+2\sum_{j<j^{\prime}}\operatorname{\mathbb{E}}\left[T_{j}X_{1j}X_{kj}T_{j^{\prime}}X_{1j^{\prime}}X_{kj^{\prime}}\right] (175)
=n22j=1p𝔼[Tj2Xkj2]n22j=1p𝔼[Tj4]1/2𝔼[Xkj4]1/2.\displaystyle=n_{2}^{-2}\sum_{j=1}^{p}\operatorname{\mathbb{E}}[T_{j}^{2}X_{kj}^{2}]\leq n_{2}^{-2}\sum_{j=1}^{p}\operatorname{\mathbb{E}}[T_{j}^{4}]^{1/2}\operatorname{\mathbb{E}}[X_{kj}^{4}]^{1/2}. (176)

The last inequality follows from the Cauchy-Schwartz inequality. Since XkjX_{kj} is the standard Gaussian, 𝔼[Xkj4]=3\operatorname{\mathbb{E}}[X_{kj}^{4}]=3 holds. Also, we have 0<Tjcg(n21l=2n2Xlj2)10<T_{j}\leq c_{g}({{n_{2}}}^{-1}\sum_{l=2}^{n_{2}}X_{lj}^{2})^{-1}, where ((n21)1l=2n2Xlj2)1({({n_{2}}-1)^{-1}\sum_{l=2}^{n_{2}}X_{lj}^{2}})^{-1} follows the inverse gamma distribution with parameters ((n21)/2,2/(n21))(({n_{2}}-1)/2,2/({n_{2}}-1)) and the bounded fourth moment (2/(n21))4Γ(2/(n21)4)/Γ(2/(n21))(2/({n_{2}}-1))^{4}\Gamma(2/({n_{2}}-1)-4)/\Gamma(2/({n_{2}}-1)).

Let (logn2)m/2R~:=supaxb|g^1(x)g(x)|(\log{n_{2}})^{-m/2}\tilde{R}:=\sup_{a\leq x\leq b}\left\lvert\hat{g}_{-1}(x)-g(x)\right\rvert. Note that, since R~=Op(1)\tilde{R}=O_{\mathrm{p}}(1) by Lemma 1, for any ϵ1>0\epsilon_{1}>0, there exists c¯>0\bar{c}>0 such that we have (Rk>c¯)ϵ1\mathbb{P}\left(R_{k}>\bar{c}\right)\leq\epsilon_{1}. Also, note that censoring does not affect this fact since g^()\hat{g}(\cdot) is given independent of 𝑿\bm{X}. Hence, we obtain, for c¯\bar{c} and any tn>0t_{n}>0 satisfying tn=o(n2)t_{n}=o(\sqrt{n_{2}}),

(1n2max2kn2|R~kj=1pTjX1jXkj|>tn||R~|c¯)\displaystyle\mathbb{P}\left(\left.\frac{1}{\sqrt{n_{2}}}\max_{2\leq k\leq{n_{2}}}\left\lvert\tilde{R}_{k}\sum_{j=1}^{p}T_{j}X_{1j}X_{kj}\right\rvert>t_{n}\ \right|\ |\tilde{R}|\leq\bar{c}\right) (177)
(c¯n2max2kn2|j=1pTjX1jXkj|>tn)\displaystyle\leq\mathbb{P}\left(\frac{\bar{c}}{\sqrt{n_{2}}}\max_{2\leq k\leq{n_{2}}}\left\lvert\sum_{j=1}^{p}T_{j}X_{1j}X_{kj}\right\rvert>t_{n}\right) (178)
(c¯n2max2kn2|j=1pX1jXkjTj|>tn|max1jp|Tj|u)+(max1jp|Tj|>u)\displaystyle\leq\mathbb{P}\left(\left.\frac{\bar{c}}{\sqrt{{n_{2}}}}\max_{2\leq k\leq{n_{2}}}\left\lvert\sum_{j=1}^{p}X_{1j}X_{kj}T_{j}\right\rvert>t_{n}\ \right|\ \max_{1\leq j\leq p}|T_{j}|\leq u\right)+\mathbb{P}\left(\max_{1\leq j\leq p}|T_{j}|>u\right) (179)
2n2exp(ctn2c¯2K2)+(max1jp|Tj|>u).\displaystyle\leq 2{n_{2}}\exp\left(-\frac{ct_{n}^{2}}{\bar{c}^{2}K^{2}}\right)+\mathbb{P}\left(\max_{1\leq j\leq p}|T_{j}|>u\right). (180)

with some c>0c>0 depending on uu, where the last inequality follows from the union bound and Bernstein’s inequality. Here, KK is the sub-exponential norm of uX11X21uX_{11}X_{21}. Since we have Tjcg(n21l=2n2Xlj2)T_{j}\leq c_{g}({n_{2}}^{-1}\sum_{l=2}^{n_{2}}X_{lj}^{2}), it holds that

(max1jp|Tj|>u)(max1jp1n2l=2n2Xlj21>u)pexp(c(u2K2uK)n2),\displaystyle\mathbb{P}\left(\max_{1\leq j\leq p}|T_{j}|>u\right)\leq\mathbb{P}\left(\max_{1\leq j\leq p}{\frac{1}{{n_{2}}}\sum_{l=2}^{n_{2}}X_{lj}^{2}-1}>u_{*}\right)\leq p\exp\left(-c\left(\frac{u_{*}^{2}}{K^{2}}\wedge\frac{u_{*}}{K}\right){n_{2}}\right), (181)

where u=cg/u1u_{*}=c_{g}/u-1. Using the bounds, Azuma-Hoeffding’s inequality yields, for any x,un>0x,u_{n}>0 and tn>0t_{n}>0 satisfying tn=o(n2)t_{n}=o(\sqrt{{n_{2}}}),

(1(logn2)m/2max1in2|1n2kin2R~kj=1n2XijXkj|>x)\displaystyle\mathbb{P}\left(\frac{1}{(\log{n_{2}})^{m/2}}\max_{1\leq i\leq{n_{2}}}\left\lvert\frac{1}{{n_{2}}}\sum_{k\neq i}^{n_{2}}\tilde{R}_{k}\sum_{j=1}^{n_{2}}X_{ij}X_{kj}\right\rvert>x\right) (182)
(1(logn2)m/2max1in2|1n2kin2R~kj=1n2XijXkj|>x||R~|c¯)+ϵ1\displaystyle\leq\mathbb{P}\left(\left.\frac{1}{(\log{n_{2}})^{m/2}}\max_{1\leq i\leq{n_{2}}}\left\lvert\frac{1}{{n_{2}}}\sum_{k\neq i}^{n_{2}}\tilde{R}_{k}\sum_{j=1}^{n_{2}}X_{ij}X_{kj}\right\rvert>x\ \right|\ |{\tilde{R}}|\leq\bar{c}\right)+\epsilon_{1} (183)
n2(1(logn2)m/2|1n2k=2n2R~kj=1n2X1jXkj|>x||R~|c¯)+ϵ1\displaystyle\leq{n_{2}}\mathbb{P}\left(\left.\frac{1}{(\log{n_{2}})^{m/2}}\left\lvert\frac{1}{{n_{2}}}\sum_{k=2}^{n_{2}}\tilde{R}_{k}\sum_{j=1}^{n_{2}}X_{1j}X_{kj}\right\rvert>x\ \right|\ |{\tilde{R}}|\leq\bar{c}\right)+\epsilon_{1} (184)
n2(1(logn2)m/2|1n2k=2n2R~kj=1n2X1jXkj|>x|1nmax2kn2|R~kj=1pX1jXkj|tnn2,|R~|c¯)\displaystyle\leq{n_{2}}\mathbb{P}\left(\left.\frac{1}{(\log{n_{2}})^{m/2}}\left\lvert\frac{1}{{n_{2}}}\sum_{k=2}^{n_{2}}\tilde{R}_{k}\sum_{j=1}^{n_{2}}X_{1j}X_{kj}\right\rvert>x\ \right|\ \frac{1}{n}\max_{2\leq k\leq{n_{2}}}\left\lvert\tilde{R}_{k}\sum_{j=1}^{p}X_{1j}X_{kj}\right\rvert\leq\frac{t_{n}}{\sqrt{{n_{2}}}},|\tilde{R}|\leq\bar{c}\right) (185)
+n2(1n2max2kn2|R~kj=1pX1jXkj|>tnn2||R~|c¯)+ϵ1\displaystyle\quad+{n_{2}}\mathbb{P}\left(\left.\frac{1}{{n_{2}}}\max_{2\leq k\leq{n_{2}}}\left\lvert\tilde{R}_{k}\sum_{j=1}^{p}X_{1j}X_{kj}\right\rvert>\frac{t_{n}}{\sqrt{{n_{2}}}}\ \right|\ |\tilde{R}|\leq\bar{c}\right)+\epsilon_{1} (186)
2n2exp(x2(logn2)m2tn2)+2n22exp(ctn2c¯2K2)+n22exp(c(u2K2uK)n2)+ϵ1,\displaystyle\leq 2{n_{2}}\exp\left(-\frac{x^{2}(\log{n_{2}})^{m}}{2t_{n}^{2}}\right)+2n_{2}^{2}\exp\left(-\frac{ct_{n}^{2}}{\bar{c}^{2}K^{2}}\right)+n_{2}^{2}\exp\left(-c\left(\frac{u_{*}^{2}}{K^{2}}\wedge\frac{u_{*}}{K}\right){n_{2}}\right)+\epsilon_{1}, (187)

where the last inequality follows from (180) and (181). Thus, one can choose, for instance, m=3m=3, tn=(logn2)3/5t_{n}=(\log{n_{2}})^{3/5}, and c¯=loglogn2\bar{c}=\log\log{n_{2}} so that we have 1(logn2)m/2max1in2|1n2kin2R~kj=1n2XijXkj|=op(1)\frac{1}{(\log{n_{2}})^{m/2}}\max_{1\leq i\leq{n_{2}}}|{\frac{1}{{n_{2}}}\sum_{k\neq i}^{n_{2}}\tilde{R}_{k}\sum_{j=1}^{n_{2}}X_{ij}X_{kj}}|=o_{\mathrm{p}}(1) and ϵ10\epsilon_{1}\to 0. ∎

Proof of Theorem 5.

In this proof, we omit the superscript (2)(2) on 𝑿(2)\bm{X}^{(2)} and 𝒚(2)\bm{y}^{(2)} for simplicity of the notation. We firstly rewrite the inferential parameters defined in Section A as

μ^0c2(g)=ι(𝑿𝜷^(g))2n2κ2(1κ2)σ^2(g),σ^0c2(g)=n21𝒚g(ι(𝑿𝜷^(g))2(n21tr(𝑽(g)))2,\displaystyle\hat{\mu}_{0c}^{2}(g)=\frac{\|\iota(\bm{X}\hat{\bm{\beta}}(g))\|^{2}}{n_{2}}-\kappa_{2}(1-\kappa_{2})\hat{\sigma}^{2}(g),\quad\hat{\sigma}^{2}_{0c}(g)=\frac{n_{2}^{-1}\|\bm{y}-g(\iota(\bm{X}\hat{\bm{\beta}}(g))\|^{2}}{\left(n_{2}^{-1}\mathrm{tr}(\bm{V}(g))\right)^{2}}, (188)

where 𝑽c(g)=𝑫c(g)𝑫c(g)𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g)\bm{V}_{c}(g)=\bm{D}_{c}(g)-\bm{D}_{c}(g)\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g). Since we have

|σ^0c2(g^)σ^0c2(g)|\displaystyle\left\lvert\hat{\sigma}_{0c}^{2}(\hat{g})-\hat{\sigma}_{0c}^{2}(g)\right\rvert (189)
1(n21tr(𝑽c(g^)))2(n21tr(𝑽c(g)))2{𝒚g(ι(𝑿𝜷^(g))2n2|(tr(𝑽c(g^))n2)2(tr(𝑽c(g))n2)2|\displaystyle\leq\frac{1}{\left(n_{2}^{-1}\mathrm{tr}(\bm{V}_{c}(\hat{g}))\right)^{2}\left(n_{2}^{-1}\mathrm{tr}(\bm{V}_{c}(g))\right)^{2}}\left\{\frac{\|\bm{y}-g(\iota(\bm{X}\hat{\bm{\beta}}(g))\|^{2}}{n_{2}}\left\lvert\left(\frac{\mathrm{tr}(\bm{V}_{c}(\hat{g}))}{n_{2}}\right)^{2}-\left(\frac{\mathrm{tr}(\bm{V}_{c}(g))}{n_{2}}\right)^{2}\right\rvert\right. (190)
+(tr(𝑽c(g))n2)2|𝒚g(ι(𝑿𝜷^(g))2n2𝒚g(𝑿𝜷^(g^))2n2|},\displaystyle\quad\left.+\left(\frac{\mathrm{tr}(\bm{V}_{c}(g))}{n_{2}}\right)^{2}\left\lvert\frac{\|\bm{y}-g(\iota(\bm{X}\hat{\bm{\beta}}(g))\|^{2}}{n_{2}}-\frac{\|\bm{y}-g(\bm{X}\hat{\bm{\beta}}(\hat{g}))\|^{2}}{n_{2}}\right\rvert\right\}, (191)

It is sufficient to show the following properties:

n21|ι(𝑿𝜷^(g^))2ι(𝑿𝜷^(g))2|=op(1),\displaystyle n_{2}^{-1}\left\lvert{\|\iota(\bm{X}\hat{\bm{\beta}}(\hat{g}))\|^{2}}-{\|\iota(\bm{X}\hat{\bm{\beta}}({g}))\|^{2}}\right\rvert=o_{\mathrm{p}}(1), (192)
n21|𝒚g(ι(𝑿𝜷^(g^)))2𝒚g(ι(𝑿𝜷^(g)))2|=op(1),\displaystyle n_{2}^{-1}\left\lvert{\|\bm{y}-g(\iota(\bm{X}\hat{\bm{\beta}}(\hat{g})))\|^{2}}-{\|\bm{y}-g(\iota(\bm{X}\hat{\bm{\beta}}({g})))\|^{2}}\right\rvert=o_{\mathrm{p}}(1), (193)
n21|tr(𝑽c(g^))tr(𝑽c(g))|=op(1).\displaystyle n_{2}^{-1}\left\lvert{\mathrm{tr}(\bm{V}_{c}(\hat{g}))}-{\mathrm{tr}(\bm{V}_{c}(g))}\right\rvert=o_{\mathrm{p}}(1). (194)

For (192), immediately we have

n21|ι(𝑿𝜷^(g^))2ι(𝑿𝜷^(g))2|\displaystyle n_{2}^{-1}\left\lvert{\|\iota(\bm{X}\hat{\bm{\beta}}(\hat{g}))\|^{2}}-{\|\iota(\bm{X}\hat{\bm{\beta}}({g}))\|^{2}}\right\rvert (195)
=|n21i=1n2(ι(𝑿i𝜷^(g^))ι(𝑿i𝜷^(g)))(ι(𝑿i𝜷^(g^))+ι(𝑿i𝜷^(g)))|.\displaystyle=\left\lvert n_{2}^{-1}\sum_{i=1}^{n_{2}}\left(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))-\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g))\right)\left(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))+\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g))\right)\right\rvert. (196)

Since maxi=1,,n2|ι(𝑿i𝜷^(g))ι(𝑿i𝜷^(g^))|p0\max_{i=1,\ldots,{n_{2}}}|{\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g))-\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))}|\overset{\mathrm{p}}{\to}0 as n2{n_{2}}\to\infty by Lemma 8, this term converges in probability to zero.

Next, for (193), since we have

n21|𝒚g(ι(𝑿𝜷^(g^)))2𝒚g(ι(𝑿𝜷^(g)))2|\displaystyle n_{2}^{-1}\left\lvert{\|\bm{y}-g(\iota(\bm{X}\hat{\bm{\beta}}(\hat{g})))\|^{2}}-{\|\bm{y}-g(\iota(\bm{X}\hat{\bm{\beta}}({g})))\|^{2}}\right\rvert (197)
=|n21i=1n2(g^(ι(𝑿i𝜷(g^)))g(ι(𝑿i𝜷(g))))(2yig^(ι(𝑿i𝜷(g^)))g(ι(𝑿i𝜷(g))))|,\displaystyle=\left\lvert n_{2}^{-1}\sum_{i=1}^{n_{2}}\left(\hat{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(\hat{g})))-{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(g)))\right)\left(2y_{i}-\hat{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(\hat{g})))-{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(g)))\right)\right\rvert, (198)

we should bound g^(ι(𝑿i𝜷(g^)))g(ι(𝑿i𝜷(g)))\hat{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(\hat{g})))-{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(g))). Indeed, using the triangle inequality reveals

|g^(ι(𝑿i𝜷(g^)))g(ι(𝑿i𝜷(g)))|\displaystyle\left\lvert\hat{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(\hat{g})))-{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(g)))\right\rvert (199)
|g(ι(𝑿i𝜷(g^)))g(ι(𝑿i𝜷(g)))|+|g^(ι(𝑿i𝜷(g^)))g(ι(𝑿i𝜷(g^)))|.\displaystyle\leq\left\lvert{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(\hat{g})))-{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(g)))\right\rvert+\left\lvert\hat{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(\hat{g})))-{g}(\iota(\bm{X}_{i}^{\top}\bm{\beta}(\hat{g})))\right\rvert. (200)

The first term on the right-hand side is op(1)o_{\mathrm{p}}(1) by the Lipschitz continuity of g()g(\cdot) and Lemma 8. Also, the second term is upper bounded by supx|g^(x)g(x)|\sup_{x}|\hat{g}(x)-g(x)|, and is op(1)o_{\mathrm{p}}(1) by Lemma 1.

To achieve (194), we first have

n21|tr(𝑽c(g^))tr(𝑽c(g))|\displaystyle n_{2}^{-1}\left\lvert{\mathrm{tr}(\bm{V}_{c}(\hat{g}))}-{\mathrm{tr}(\bm{V}_{c}(g))}\right\rvert (201)
n21|tr(𝑫c(g^)𝑫c(g))|\displaystyle\leq n_{2}^{-1}\left\lvert\mathrm{tr}\left(\bm{D}_{c}(\hat{g})-\bm{D}_{c}(g)\right)\right\rvert (202)
+n21|tr(𝑫c(g^)𝑿(𝑿𝑫c(g^)𝑿)1𝑿𝑫c(g^)𝑫c(g)𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g))|.\displaystyle\quad+n_{2}^{-1}\left\lvert\mathrm{tr}\left(\bm{D}_{c}(\hat{g})\bm{X}(\bm{X}^{\top}\bm{D}_{c}(\hat{g})\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(\hat{g})-\bm{D}_{c}(g)\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)\right)\right\rvert. (203)

For the first term, we have

n21|tr(𝑫c(g^)𝑫c(g))|\displaystyle n_{2}^{-1}\left\lvert\mathrm{tr}\left(\bm{D}_{c}(\hat{g})-\bm{D}_{c}(g)\right)\right\rvert (204)
n21i=1n2|g^(ι(𝑿i𝜷^(g^)))g(ι(𝑿i𝜷^(g)))|\displaystyle\leq n_{2}^{-1}\sum_{i=1}^{n_{2}}\left\lvert\hat{g}^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})))-g(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g)))\right\rvert (205)
supaxb|g^(x)g(x)|+Bn21i=1n2|ι(𝑿i𝜷^(g^))ι(𝑿i𝜷^(g))|=op(1),\displaystyle\leq\sup_{a\leq x\leq b}|\hat{g}^{\prime}(x)-g^{\prime}(x)|+Bn_{2}^{-1}\sum_{i=1}^{n_{2}}\left\lvert\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))-\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}({g}))\right\rvert=o_{\mathrm{p}}(1), (206)

by Lemma 2 and Lemma 8. For the second term, the triangle inequality yields

n21|tr(𝑫c(g^)𝑿(𝑿𝑫c(g^)𝑿)1𝑿𝑫c(g^)𝑫c(g)𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g))|\displaystyle n_{2}^{-1}\left\lvert\mathrm{tr}\left(\bm{D}_{c}(\hat{g})\bm{X}(\bm{X}^{\top}\bm{D}_{c}(\hat{g})\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(\hat{g})-\bm{D}_{c}(g)\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)\right)\right\rvert (207)
n21|tr({𝑫c(g)𝑫c(g^)}𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g))|\displaystyle\leq n_{2}^{-1}\left\lvert\mathrm{tr}\left(\left\{\bm{D}_{c}({g})-\bm{D}_{c}(\hat{g})\right\}\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)\right)\right\rvert (208)
+n21|tr(𝑫c(g^)𝑿{(𝑿𝑫c(g)𝑿)1(𝑿𝑫c(g^)𝑿)1}𝑿𝑫c(g))|\displaystyle\quad+n_{2}^{-1}\left\lvert\mathrm{tr}\left(\bm{D}_{c}(\hat{g})\bm{X}\left\{(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}-(\bm{X}^{\top}\bm{D}_{c}(\hat{g})\bm{X})^{-1}\right\}\bm{X}^{\top}\bm{D}_{c}(g)\right)\right\rvert (209)
+n21|tr(𝑫c(g^)𝑿(𝑿𝑫c(g^)𝑿)1𝑿{𝑫c(g)𝑫c(g^)})|.\displaystyle\quad+n_{2}^{-1}\left\lvert\mathrm{tr}\left(\bm{D}_{c}(\hat{g})\bm{X}(\bm{X}^{\top}\bm{D}_{c}(\hat{g})\bm{X})^{-1}\bm{X}^{\top}\left\{\bm{D}_{c}(g)-\bm{D}_{c}(\hat{g})\right\}\right)\right\rvert. (210)

Using the Cauchy-Schwartz inequality, (208) is bounded by

n21𝑫c(g)𝑫c(g^)F𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g)F.\displaystyle n_{2}^{-1}\left\lVert\bm{D}_{c}(g)-\bm{D}_{c}(\hat{g})\right\rVert_{F}\left\lVert\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)\right\rVert_{F}. (211)

Here, we have

n21/2𝑫c(g)𝑫c(g^)F\displaystyle n_{2}^{-1/2}\left\lVert\bm{D}_{c}(g)-\bm{D}_{c}(\hat{g})\right\rVert_{F} (212)
(1n2i=1n2{g(ι(𝑿i𝜷^(g)))g^(𝑿i𝜷^(g^))}2)1/2\displaystyle\leq\left(\frac{1}{{n_{2}}}\sum_{i=1}^{n_{2}}\left\{g^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g)))-\hat{g}^{\prime}(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))\right\}^{2}\right)^{1/2} (213)
(1n2i=1n2[2{g(ι(𝑿i𝜷^(g)))g(𝑿i𝜷^(g^))}2+2{g(ι(𝑿i𝜷^(g^)))g^(𝑿i𝜷^(g^))}2])1/2\displaystyle\leq\left(\frac{1}{{n_{2}}}\sum_{i=1}^{n_{2}}\left[2\left\{g^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g)))-{g}^{\prime}(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))\right\}^{2}+2\left\{{g}^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})))-\hat{g}^{\prime}(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))\right\}^{2}\right]\right)^{1/2} (214)
(2n2i=1n2{g(ι(𝑿i𝜷^(g)))g(𝑿i𝜷^(g^))}2)1/2+(2n2i=1n2{g(ι(𝑿i𝜷^(g^)))g^(𝑿i𝜷^(g^))}2)1/2\displaystyle\leq\left(\frac{2}{{n_{2}}}\sum_{i=1}^{n_{2}}\left\{g^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g)))-{g}^{\prime}(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))\right\}^{2}\right)^{1/2}+\left(\frac{2}{{n_{2}}}\sum_{i=1}^{n_{2}}\left\{{g}^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})))-\hat{g}^{\prime}(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))\right\}^{2}\right)^{1/2} (215)
2Bmaxi=1,,n2|ι(𝑿i𝜷^(g))ι(𝑿i𝜷^(g^))|+2supaxb|g^(x)g(x)|=op(1).\displaystyle\leq 2B\max_{i=1,\ldots,{n_{2}}}\left\lvert\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g))-\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))\right\rvert+2\sup_{a\leq x\leq b}|\hat{g}^{\prime}(x)-g^{\prime}(x)|=o_{\mathrm{p}}(1). (216)

by Lemma 2 and Lemma 8. Also, we have

n21/2𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g)F\displaystyle n_{2}^{-1/2}\left\lVert\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)\right\rVert_{F} (217)
=n21/2𝑫c(g)1/2𝑫c(g)1/2𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g)1/2𝑫c(g)1/2F\displaystyle=n_{2}^{-1/2}\left\lVert\bm{D}_{c}(g)^{-1/2}\bm{D}_{c}(g)^{1/2}\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)^{1/2}\bm{D}_{c}(g)^{1/2}\right\rVert_{F} (218)
n21/2𝑫c(g)1/2op𝑫c(g)1/2op𝑫c(g)1/2𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g)1/2F\displaystyle\leq n_{2}^{-1/2}\left\lVert\bm{D}_{c}(g)^{-1/2}\right\rVert_{\rm op}\left\lVert\bm{D}_{c}(g)^{1/2}\right\rVert_{\rm op}\left\lVert\bm{D}_{c}(g)^{1/2}\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)^{1/2}\right\rVert_{F} (219)
=n21/2𝑫c(g)1/2op𝑫c(g)1/2optr(𝑰n2)\displaystyle=n_{2}^{-1/2}\left\lVert\bm{D}_{c}(g)^{-1/2}\right\rVert_{\rm op}\left\lVert\bm{D}_{c}(g)^{1/2}\right\rVert_{\rm op}\sqrt{\mathrm{tr}(\bm{I}_{n_{2}})} (220)
=𝑫c(g)1/2op𝑫c(g)1/2op.\displaystyle=\left\lVert\bm{D}_{c}(g)^{-1/2}\right\rVert_{\rm op}\left\lVert\bm{D}_{c}(g)^{1/2}\right\rVert_{\rm op}. (221)

Since 𝑫c(g)1/2opsupxg(x)1/2\left\lVert\bm{D}_{c}(g)^{1/2}\right\rVert_{\rm op}\leq\sup_{x}g^{\prime}(x)^{1/2} and 𝑫c(g)1/2op(infxg(x))1/2\left\lVert\bm{D}_{c}(g)^{-1/2}\right\rVert_{\rm op}\leq(\inf_{x}g^{\prime}(x))^{-1/2} are constants by an assumption of g()g(\cdot), we conclude that (208) is op(1)o_{\mathrm{p}}(1). (210) is also shown to be op(1)o_{\mathrm{p}}(1) in a similar manner. Since 𝑨1𝑩1=𝑨1(𝑨𝑩)𝑩1\bm{A}^{-1}-\bm{B}^{-1}=-\bm{A}^{-1}(\bm{A}-\bm{B})\bm{B}^{-1} for two invertible matrices 𝑨\bm{A} and 𝑩\bm{B}, (209) can be rewritten as

n21|tr(𝑫c(g^)𝑿(𝑿𝑫c(g^)𝑿)1𝑿{𝑫c(g)𝑫c(g^)}𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g))|,\displaystyle n_{2}^{-1}\left\lvert\mathrm{tr}\left(\bm{D}_{c}(\hat{g})\bm{X}(\bm{X}^{\top}\bm{D}_{c}(\hat{g})\bm{X})^{-1}\bm{X}^{\top}\left\{\bm{D}_{c}(g)-\bm{D}_{c}(\hat{g})\right\}\bm{X}(\bm{X}^{\top}\bm{D}_{c}(g)\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(g)\right)\right\rvert, (222)

and a similar technique used above provides the upper bound,

n21𝑫c(g^)op1/2𝑫c(g^)1op1/2𝑫c(g)op1/2𝑫c(g)1op1/2𝑫c(g)𝑫c(g^)op\displaystyle n_{2}^{-1}\left\lVert\bm{D}_{c}(\hat{g})\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}(\hat{g})^{-1}\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}({g})\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}({g})^{-1}\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}(g)-\bm{D}_{c}(\hat{g})\right\rVert_{\rm op} (223)
×𝑫c(g^)1/2𝑿(𝑿𝑫c(g^)𝑿)1𝑿𝑫c(g^)1/2F𝑫c(g)1/2𝑿(𝑿𝑫c(g)𝑿)1𝑿𝑫c(g)1/2F\displaystyle\quad\times\left\lVert\bm{D}_{c}(\hat{g})^{1/2}\bm{X}(\bm{X}^{\top}\bm{D}_{c}(\hat{g})\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}(\hat{g})^{1/2}\right\rVert_{F}\left\lVert\bm{D}_{c}({g})^{1/2}\bm{X}(\bm{X}^{\top}\bm{D}_{c}({g})\bm{X})^{-1}\bm{X}^{\top}\bm{D}_{c}({g})^{1/2}\right\rVert_{F} (224)
=𝑫c(g^)op1/2𝑫c(g^)1op1/2𝑫c(g)op1/2𝑫c(g)1op1/2𝑫c(g)𝑫c(g^)op.\displaystyle=\left\lVert\bm{D}_{c}(\hat{g})\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}(\hat{g})^{-1}\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}({g})\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}({g})^{-1}\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}(g)-\bm{D}_{c}(\hat{g})\right\rVert_{\rm op}. (225)

Here, 𝑫c(g)op1/2𝑫c(g)1op1/2\left\lVert\bm{D}_{c}({g})\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}({g})^{-1}\right\rVert_{\rm op}^{1/2} is a constant by an assumption, and also 𝑫c(g^)op1/2𝑫c(g^)1op1/2\left\lVert\bm{D}_{c}(\hat{g})\right\rVert_{\rm op}^{1/2}\left\lVert\bm{D}_{c}(\hat{g})^{-1}\right\rVert_{\rm op}^{1/2} is asymptotically bounded by the uniform consistency of g^\hat{g}^{\prime} for gg^{\prime} by Lemma 2. Finally, we have

𝑫c(g)𝑫c(g^)op\displaystyle\left\lVert\bm{D}_{c}(g)-\bm{D}_{c}(\hat{g})\right\rVert_{\rm op} (226)
=maxi=1,,n2|g(ι(𝑿i𝜷^(g)))g^(ι(𝑿i𝜷^(g^)))|\displaystyle=\max_{i=1,\ldots,{n_{2}}}\left\lvert g^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g)))-\hat{g}^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})))\right\rvert (227)
maxi=1,,n2|g(ι(𝑿i𝜷^(g)))g(ι(𝑿i𝜷^(g^)))|+maxi=1,,n2|g(ι(𝑿i𝜷^(g^)))g^(ι(𝑿i𝜷^(g^)))|\displaystyle\leq\max_{i=1,\ldots,{n_{2}}}\left\lvert g^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g)))-g^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})))\right\rvert+\max_{i=1,\ldots,{n_{2}}}\left\lvert g^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})))-\hat{g}^{\prime}(\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g})))\right\rvert (228)
Bmaxi=1,,n2|ι(𝑿i𝜷^(g))ι(𝑿i𝜷^(g^))|+supaxb|g(x)g^(x)|=op(1),\displaystyle\leq B\max_{i=1,\ldots,{n_{2}}}\left\lvert\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(g))-\iota(\bm{X}_{i}^{\top}\hat{\bm{\beta}}(\hat{g}))\right\rvert+\sup_{a\leq x\leq b}|g^{\prime}(x)-\hat{g}^{\prime}(x)|=o_{\mathrm{p}}(1), (229)

by Lemma 2 and Lemma 8. Thus, (209) is op(1)o_{\mathrm{p}}(1). Combining these results concludes the proof. ∎

D.6. Proof of Proposition 2

Proof of Proposition 2.

Let 𝜷^:=𝜷^(g^)\hat{\bm{\beta}}:=\hat{\bm{\beta}}(\hat{g}) for simplicity of the notation. Recall that when J()𝟎J(\cdot)\equiv\bm{0},

μ~LS2\displaystyle\tilde{\mu}_{\mathrm{LS}}^{2} =n11𝑿𝜷~LS2(1κ1)σ~LS2,σ~LS2=κ1n1(1κ1)2𝒚𝑿𝜷~LS2,\displaystyle=n_{1}^{-1}\|{\bm{X}\tilde{\bm{\beta}}_{\mathrm{LS}}\|}^{2}-(1-\kappa_{1})\tilde{\sigma}_{\mathrm{LS}}^{2},\quad\tilde{\sigma}_{\mathrm{LS}}^{2}=\frac{\kappa_{1}}{n_{1}(1-\kappa_{1})^{2}}\|{\bm{y}-\bm{X}\tilde{\bm{\beta}}_{\mathrm{LS}}\|}^{2}, (230)
μ^2(g^)\displaystyle\hat{\mu}^{2}(\hat{g}) =n21𝑿𝜷^(g^)2(1κ2)σ^2(g^),σ^2(g^)=κ2n2v^λ2𝒚g^(𝑿𝜷^(g^))2.\displaystyle=n_{2}^{-1}\|\bm{X}\hat{\bm{\beta}}(\hat{g})\|^{2}-(1-\kappa_{2})\hat{\sigma}^{2}(\hat{g}),\quad\hat{\sigma}^{2}(\hat{g})=\frac{\kappa_{2}}{n_{2}\hat{v}_{\lambda}^{2}}{\|\bm{y}-\hat{g}(\bm{X}\hat{\bm{\beta}}(\hat{g}))\|^{2}}. (231)

Since we have

μ~LS2σ~LS2=𝑿𝜷~LS2κ1(1κ1)2𝒚𝑿𝜷~LS2(1κ1),μ^2(g^)σ^2(g^)=𝑿𝜷^(g^)2κ2v^λ2𝒚g^(𝑿𝜷^(g^))2(1κ2),\displaystyle\frac{\tilde{\mu}_{\mathrm{LS}}^{2}}{\tilde{\sigma}_{\mathrm{LS}}^{2}}=\frac{\|\bm{X}\tilde{\bm{\beta}}_{\mathrm{LS}}\|^{2}}{\frac{\kappa_{1}}{(1-\kappa_{1})^{2}}\|\bm{y}-\bm{X}\tilde{\bm{\beta}}_{\mathrm{LS}}\|^{2}}-(1-\kappa_{1}),\quad\frac{\hat{\mu}^{2}(\hat{g})}{\hat{\sigma}^{2}(\hat{g})}=\frac{\|\bm{X}\hat{\bm{\beta}}(\hat{g})\|^{2}}{\frac{\kappa_{2}}{\hat{v}_{\lambda}^{2}}\|\bm{y}-\hat{g}(\bm{X}\hat{\bm{\beta}}(\hat{g}))\|^{2}}-(1-\kappa_{2}), (232)

σ^2(g^)/μ^2(g^)<σ~LS2/μ~LS2\hat{\sigma}^{2}(\hat{g})/\hat{\mu}^{2}(\hat{g})<\tilde{\sigma}_{\mathrm{LS}}^{2}/\tilde{\mu}_{\mathrm{LS}}^{2} is equivalent to

𝑿𝜷^(g^)𝑿𝜷~LS|v^λ|1κ1𝒚𝑿𝜷~LS𝒚g^(𝑿𝜷^(g^))>1.\displaystyle{\frac{\|\bm{X}\hat{\bm{\beta}}(\hat{g})\|}{\|\bm{X}\tilde{\bm{\beta}}_{\mathrm{LS}}\|}\cdot\frac{\left\lvert\hat{v}_{\lambda}\right\rvert}{1-\kappa_{1}}\cdot\frac{\|\bm{y}-\bm{X}\tilde{\bm{\beta}}_{\mathrm{LS}}\|}{\|\bm{y}-\hat{g}(\bm{X}\hat{\bm{\beta}}(\hat{g}))\|}}>1. (233)

Next, when J(𝒃)=λ𝒃2J(\bm{b})=\lambda\|\bm{b}\|^{2}, recall that

μ~2\displaystyle\tilde{\mu}^{2} =𝜷~2σ~2,σ~2=κ1n11𝒚𝑿𝜷~2(v~2+λ1)2\displaystyle=\|\tilde{\bm{\beta}}\|^{2}-\tilde{\sigma}^{2},\quad\tilde{\sigma}^{2}=\kappa_{1}n_{1}^{-1}\|\bm{y}-\bm{X}\tilde{\bm{\beta}}\|^{2}(\tilde{v}^{2}+\lambda_{1})^{-2} (234)
μ^2(g^)\displaystyle\hat{\mu}^{2}(\hat{g}) =𝜷^(g^)2σ^2(g^),σ^2(g^)=κ2n11𝒚𝑿𝜷^(g^)2(v^λ2+λ)2.\displaystyle=\|\hat{\bm{\beta}}(\hat{g})\|^{2}-\hat{\sigma}^{2}(\hat{g}),\quad\hat{\sigma}^{2}(\hat{g})=\kappa_{2}n_{1}^{-1}\|\bm{y}-\bm{X}\hat{\bm{\beta}}(\hat{g})\|^{2}(\hat{v}_{\lambda}^{2}+\lambda)^{-2}. (235)

Thus, in a similar way as when J()𝟎J(\cdot)\equiv\bm{0}, we conclude the proof. ∎

References

  • Agarwal et al. (2014) Agarwal, A., Kakade, S., Karampatziakis, N., Song, L. and Valiant, G. (2014) Least squares revisited: Scalable approaches for multi-class prediction, in International Conference on Machine Learning, PMLR, pp. 541–549.
  • Alquier and Biau (2013) Alquier, P. and Biau, G. (2013) Sparse single-index model, Journal of Machine Learning Research, 14, 243–280.
  • Auer et al. (1995) Auer, P., Herbster, M. and Warmuth, M. K. (1995) Exponentially many local minima for single neurons, Advances in neural information processing systems, 8.
  • Balabdaoui et al. (2019) Balabdaoui, F., Durot, C. and Jankowski, H. (2019) Least squares estimation in the monotone single index model, Bernoulli, 25, 3276–3310.
  • Barbier et al. (2019) Barbier, J., Krzakala, F., Macris, N., Miolane, L. and Zdeborová, L. (2019) Optimal errors and phase transitions in high-dimensional generalized linear models, Proceedings of the National Academy of Sciences, 116, 5451–5460.
  • Bayati et al. (2013) Bayati, M., Erdogdu, M. A. and Montanari, A. (2013) Estimating lasso risk and noise level, Advances in Neural Information Processing Systems, 26.
  • Bayati and Montanari (2011a) Bayati, M. and Montanari, A. (2011a) The dynamics of message passing on dense graphs, with applications to compressed sensing, IEEE Transactions on Information Theory, 57, 764–785.
  • Bayati and Montanari (2011b) Bayati, M. and Montanari, A. (2011b) The lasso risk for gaussian matrices, IEEE Transactions on Information Theory, 58, 1997–2017.
  • Bellec (2022) Bellec, P. C. (2022) Observable adjustments in single-index models for regularized m-estimators, arXiv preprint arXiv:2204.06990.
  • Bellec and Shen (2022) Bellec, P. C. and Shen, Y. (2022) Derivatives and residual distribution of regularized m-estimators with application to adaptive tuning, in Conference on Learning Theory, PMLR, pp. 1912–1947.
  • Bellec and Zhang (2021) Bellec, P. C. and Zhang, C.-H. (2021) Second-order stein: Sure for sure and other applications in high-dimensional inference, The Annals of Statistics, 49, 1864–1903.
  • Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal statistical society: series B (Methodological), 57, 289–300.
  • Bietti et al. (2022) Bietti, A., Bruna, J., Sanford, C. and Song, M. J. (2022) Learning single-index models with shallow neural networks, Advances in Neural Information Processing Systems, 35, 9768–9783.
  • Bolthausen (2014) Bolthausen, E. (2014) An iterative construction of solutions of the tap equations for the sherrington–kirkpatrick model, Communications in Mathematical Physics, 325, 333–366.
  • Candes et al. (2018) Candes, E., Fan, Y., Janson, L. and Lv, J. (2018) Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection, Journal of the Royal Statistical Society Series B: Statistical Methodology, 80, 551–577.
  • Candès and Sur (2020) Candès, E. J. and Sur, P. (2020) The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression, The Annals of Statistics, 48, 27–42.
  • Charbonneau et al. (2023) Charbonneau, P., Marinari, E., Parisi, G., Ricci-tersenghi, F., Sicuro, G., Zamponi, F. and Mezard, M. (2023) Spin Glass Theory and Far Beyond: Replica Symmetry Breaking after 40 Years, World Scientific.
  • Chatterjee (2009) Chatterjee, S. (2009) Fluctuations of eigenvalues and second order poincaré inequalities, Probability Theory and Related Fields, 143, 1–40.
  • Chernozhukov et al. (2009) Chernozhukov, V., Fernandez-Val, I. and Galichon, A. (2009) Improving point and interval estimators of monotone functions by rearrangement, Biometrika, 96, 559–575.
  • Cilia et al. (2018) Cilia, N. D., De Stefano, C., Fontanella, F. and Di Freca, A. S. (2018) An experimental protocol to support cognitive impairment diagnosis by using handwriting analysis, Procedia Computer Science, 141, 466–471.
  • Dai et al. (2023) Dai, C., Lin, B., Xing, X. and Liu, J. S. (2023) False discovery rate control via data splitting, Journal of the American Statistical Association, 118, 2503–2520.
  • Dalalyan et al. (2008) Dalalyan, A. S., Juditsky, A. and Spokoiny, V. (2008) A new algorithm for estimating the effective dimension-reduction subspace, The Journal of Machine Learning Research, 9, 1647–1678.
  • Deshpande et al. (2017) Deshpande, Y., Abbe, E. and Montanari, A. (2017) Asymptotic mutual information for the balanced binary stochastic block model, Information and Inference: A Journal of the IMA, 6, 125–170.
  • Donoho and Montanari (2016) Donoho, D. and Montanari, A. (2016) High dimensional robust m-estimation: Asymptotic variance via approximate message passing, Probability Theory and Related Fields, 166, 935–969.
  • Donoho et al. (2009) Donoho, D. L., Maleki, A. and Montanari, A. (2009) Message-passing algorithms for compressed sensing, Proceedings of the National Academy of Sciences, 106, 18914–18919.
  • Dua and Graff (2017) Dua, D. and Graff, C. (2017) UCI machine learning repository.
  • Eftekhari et al. (2021) Eftekhari, H., Banerjee, M. and Ritov, Y. (2021) Inference in high-dimensional single-index models under symmetric designs, The Journal of Machine Learning Research, 22, 1247–1309.
  • El Karoui (2018) El Karoui, N. (2018) On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators, Probability Theory and Related Fields, 170, 95–175.
  • El Karoui et al. (2013) El Karoui, N., Bean, D., Bickel, P. J., Lim, C. and Yu, B. (2013) On robust regression with high-dimensional predictors, Proceedings of the National Academy of Sciences, 110, 14557–14562.
  • Fan and Truong (1993) Fan, J. and Truong, Y. K. (1993) Nonparametric regression with errors in variables, The Annals of Statistics, pp. 1900–1925.
  • Fan et al. (2023) Fan, J., Yang, Z. and Yu, M. (2023) Understanding implicit regularization in over-parameterized single index model, Journal of the American Statistical Association, 118, 2315–2328.
  • Feng et al. (2022) Feng, O. Y., Venkataramanan, R., Rush, C. and Samworth, R. J. (2022) A unifying tutorial on approximate message passing, Foundations and Trends® in Machine Learning, 15, 335–536.
  • Foster et al. (2013) Foster, J. C., Taylor, J. M. and Nan, B. (2013) Variable selection in monotone single-index models via the adaptive lasso, Statistics in medicine, 32, 3944–3954.
  • Goggin (1994) Goggin, E. M. (1994) Convergence in distribution of conditional expectations, The Annals of Probability, pp. 1097–1114.
  • Guo and Cheng (2022) Guo, X. and Cheng, G. (2022) Moderate-dimensional inferences on quadratic functionals in ordinary least squares, Journal of the American Statistical Association, 117, 1931–1950.
  • Härdle and Stoker (1989) Härdle, W. and Stoker, T. M. (1989) Investigating smooth multiple regression by the method of average derivatives, Journal of the American statistical Association, 84, 986–995.
  • Hastie et al. (2022) Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2022) Surprises in high-dimensional ridgeless least squares interpolation, The Annals of Statistics, 50, 949–986.
  • Horowitz (2009) Horowitz, J. L. (2009) Semiparametric and nonparametric methods in econometrics, vol. 12, Springer.
  • Hristache et al. (2001) Hristache, M., Juditsky, A. and Spokoiny, V. (2001) Direct estimation of the index coefficient in a single-index model, Annals of Statistics, pp. 595–623.
  • Ichimura (1993) Ichimura, H. (1993) Semiparametric least squares (sls) and weighted sls estimation of single-index models, Journal of econometrics, 58, 71–120.
  • Klein and Spady (1993) Klein, R. W. and Spady, R. H. (1993) An efficient semiparametric estimator for binary response models, Econometrica: Journal of the Econometric Society, pp. 387–421.
  • Krzakala et al. (2012) Krzakala, F., Mézard, M., Sausset, F., Sun, Y. and Zdeborová, L. (2012) Probabilistic reconstruction in compressed sensing: algorithms, phase diagrams, and threshold achieving matrices, Journal of Statistical Mechanics: Theory and Experiment, 2012, P08009.
  • Lei et al. (2018) Lei, L., Bickel, P. J. and El Karoui, N. (2018) Asymptotics for high dimensional regression m-estimates: fixed design results, Probability Theory and Related Fields, 172, 983–1079.
  • Li and Duan (1989) Li, K.-C. and Duan, N. (1989) Regression analysis under link violation, The Annals of Statistics, 17, 1009–1052.
  • Li and Racine (2023) Li, Q. and Racine, J. S. (2023) Nonparametric econometrics: theory and practice, Princeton University Press.
  • Li and Sur (2023) Li, Y. and Sur, P. (2023) Spectrum-aware adjustment: A new debiasing framework with applications to principal components regression, arXiv preprint arXiv:2309.07810.
  • Loureiro et al. (2021) Loureiro, B., Gerbelot, C., Cui, H., Goldt, S., Krzakala, F., Mezard, M. and Zdeborová, L. (2021) Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems, 34, 18137–18151.
  • Macris et al. (2020) Macris, N., Rush, C. et al. (2020) All-or-nothing statistical and computational phase transitions in sparse spiked matrix estimation, Advances in Neural Information Processing Systems, 33, 14915–14926.
  • Matzkin (1991) Matzkin, R. L. (1991) Semiparametric estimation of monotone and concave utility functions for polychotomous choice models, Econometrica: Journal of the Econometric Society, pp. 1315–1327.
  • Mei and Montanari (2022) Mei, S. and Montanari, A. (2022) The generalization error of random features regression: Precise asymptotics and the double descent curve, Communications on Pure and Applied Mathematics, 75, 667–766.
  • Mézard et al. (1987) Mézard, M., Parisi, G. and Virasoro, M. A. (1987) Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications, vol. 9, World Scientific Publishing Company.
  • Miolane and Montanari (2021) Miolane, L. and Montanari, A. (2021) The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning, The Annals of Statistics, 49, 2313–2335.
  • Montanari et al. (2019) Montanari, A., Ruan, F., Sohn, Y. and Yan, J. (2019) The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime, arXiv preprint arXiv:1911.01544.
  • Montanari and Venkataramanan (2021) Montanari, A. and Venkataramanan, R. (2021) Estimation of low-rank matrices via approximate message passing, The Annals of Statistics, 49, 321–345.
  • Mousavi et al. (2018) Mousavi, A., Maleki, A. and Baraniuk, R. G. (2018) Consistent parameter estimation for LASSO and approximate message passing, The Annals of Statistics, 46, 119 – 148.
  • Nishiyama and Robinson (2005) Nishiyama, Y. and Robinson, P. M. (2005) The bootstrap and the edgeworth correction for semiparametric averaged derivatives, Econometrica, 73, 903–948.
  • Powell et al. (1989) Powell, J. L., Stock, J. H. and Stoker, T. M. (1989) Semiparametric estimation of index coefficients, Econometrica: Journal of the Econometric Society, pp. 1403–1430.
  • Rangan (2011) Rangan, S. (2011) Generalized approximate message passing for estimation with random linear mixing, in 2011 IEEE International Symposium on Information Theory Proceedings, IEEE, pp. 2168–2172.
  • Sakar and Sakar (2018) Sakar, S. G. G. A. N. H., C. and Sakar, B. (2018) Parkinson’s Disease Classification, UCI Machine Learning Repository.
  • Salehi et al. (2019) Salehi, F., Abbasi, E. and Hassibi, B. (2019) The impact of regularization on high-dimensional logistic regression, Advances in Neural Information Processing Systems, 32.
  • Sawaya et al. (2023) Sawaya, K., Uematsu, Y. and Imaizumi, M. (2023) Feasible adjustments of statistical inference in high-dimensional generalized linear models, arXiv preprint arXiv:2305.17731.
  • Stefanski and Carroll (1990) Stefanski, L. A. and Carroll, R. J. (1990) Deconvolving kernel density estimators, Statistics, 21, 169–184.
  • Sur and Candès (2019) Sur, P. and Candès, E. J. (2019) A modern maximum-likelihood theory for high-dimensional logistic regression, Proceedings of the National Academy of Sciences, 116, 14516–14525.
  • Sur et al. (2019) Sur, P., Chen, Y. and Candès, E. J. (2019) The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square, Probability theory and related fields, 175, 487–558.
  • Takahashi and Kabashima (2018) Takahashi, T. and Kabashima, Y. (2018) A statistical mechanics approach to de-biasing and uncertainty estimation in lasso for random measurements, Journal of Statistical Mechanics: Theory and Experiment, 2018, 073405.
  • Tan and Bellec (2023) Tan, K. and Bellec, P. C. (2023) Multinomial logistic regression: Asymptotic normality on null covariates in high-dimensions, arXiv preprint arXiv:2305.17825.
  • Thrampoulidis et al. (2018) Thrampoulidis, C., Abbasi, E. and Hassibi, B. (2018) Precise error analysis of regularized mm-estimators in high dimensions, IEEE Transactions on Information Theory, 64, 5592–5628.
  • Wan et al. (2017) Wan, Y., Datta, S., Lee, J. J. and Kong, M. (2017) Monotonic single-index models to assess drug interactions, Statistics in medicine, 36, 655–670.
  • Xing et al. (2023) Xing, X., Zhao, Z. and Liu, J. S. (2023) Controlling false discovery rate using gaussian mirrors, Journal of the American Statistical Association, 118, 222–241.
  • Yadlowsky et al. (2021) Yadlowsky, S., Yun, T., McLean, C. Y. and D’Amour, A. (2021) Sloe: A faster method for statistical inference in high-dimensional logistic regression, Advances in Neural Information Processing Systems, 34, 29517–29528.
  • Yang and Hu (2021) Yang, G. and Hu, E. J. (2021) Tensor programs iv: Feature learning in infinite-width neural networks, in International Conference on Machine Learning, PMLR, pp. 11727–11737.
  • Zhao et al. (2022) Zhao, Q., Sur, P. and Candes, E. J. (2022) The asymptotic distribution of the mle in high-dimensional logistic models: Arbitrary covariance, Bernoulli, 28, 1835–1861.