This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multiple–index Nonstationary Time Series Models:
Robust Estimation Theory and Practice

Chaohua Donga, Jiti Gaob, Bin Pengb and Yundong Tuc111Corresponding author. Address: Guanghua School of Management and Center for Statistical Science, Peking University, Beijing, 100871, China. E-mail: yundong.tu@gsm.pku.edu.cn.
aZhongnan University of Economics and Law, China
bMonash University, Australia
cPeking University, China
(July 30, 2025)
Abstract

This paper proposes a class of parametric multiple-index time series models that involve linear combinations of time trends, stationary variables and unit root processes as regressors. The inclusion of the three different types of time series, along with the use of a multiple-index structure for these variables to circumvent the curse of dimensionality, is due to both theoretical and practical considerations. The M-type estimators (including OLS, LAD, Huber’s estimator, quantile and expectile estimators, etc.) for the index vectors are proposed, and their asymptotic properties are established, with the aid of the generalized function approach to accommodate a wide class of loss functions that may not be necessarily differentiable at every point. The proposed multiple-index model is then applied to study the stock return predictability, which reveals strong nonlinear predictability under various loss measures. Monte Carlo simulations are also included to evaluate the finite-sample performance of the proposed estimators.

Key words: Additive index model, robust estimation; stationary process, time trend, unit root process

JEL classification: C13, C22

1 Introduction

Robust model building and determination of good models for estimation and prediction is a very important part of much empirical research in economics and finance. Since real world data often display characteristics, such as nonlinearity, nonstationarity and spatiality, an important aspect of model building and determination is how to take such characteristics into account. Meanwhile, the increasing availability of economic and financial data in recent years has been accompanied by increasing interest in both theoretical research and empirical data analysis in diverse fields of sciences as well as more broadly within social and medical sciences.

One challenge is how to select the most relevant variables when there are many variables available from the data. One initial step is to explore a simple method associated with principal component analysis. Different from dealing with high-dimensional data issues in medical sciences for example, we do need to pay attention to many econometric issues, such as multicollinearity when aggregating real datasets for model building. Meanwhile, one additional complexity is that many original data appear to be highly dependent and even nonstationary. Simply using a transformed version of the original data may completely ignore some key features involved in the original data, such as trending behaviours, which can be a key interest in modelling time series data in climatology, energy, epidemiology, health and macroeconomics. More importantly, transformed versions of two different datasets may have similar features, but the original datasets can have very different structures. Another challenge is the development of new models and estimation methods that are not only robust in theory, but also capable of offering user-friendly tools for both the computational and implementational procedures of empirical data analysis in practice.

There are several levels of robustness involved in the discussion of this paper. First, the proposed model building and estimation methods are robust to data with outliers, spatiality and temporality, different types of nonstationarity. Second, the class of models we propose are applicable to several types of data, such as stationary time series data with deterministic trends, highly dependent and nonstationary time series data. Third, the proposed robust estimation procedures are iterative and computationally tractable. Fourth, the proposed models and robust estimation methods as well as the associated computational algorithms are user-friendly and easily implementable for empirical researchers and practitioners. While the relevant literature accounts for one or more of the model building and estimation issues we will address in this paper, the novelty of this paper is addressing them all simultaneously to produce new models and feasible estimation methods with computational tractability as well as empirical relevance and applicability.

We now propose a cointegrating relationship among time series sequence {yt,xt,zt,τt,1tn}\{y_{t},x_{t},z_{t},\tau_{t},1\leq t\leq n\}, where yty_{t} is a response variable, and we first specify the respective structures of xtx_{t} and ztz_{t} by

xt=xt1+wtandzt=h(τt,vt),\displaystyle\begin{split}&x_{t}=x_{t-1}+w_{t}\ \ \ \mbox{and}\ \ \ z_{t}=h(\tau_{t},v_{t}),\end{split} (1.1)

where τt=tn\tau_{t}=\frac{t}{n}, as defined later, wtw_{t} is a linear vector process, and h()h(\cdot) is a known vector of functions driven by the time trend τt\tau_{t} and stationary vtv_{t}.

In the formulation of (1.1) xtx_{t} is an I(1)I(1) vector, and it is well known that xtx_{t} may capture ‘large’ trending behaviours. Meanwhile, ztz_{t} is a function in (τt,vt\tau_{t},v_{t}) that can capture ‘small’ fluctuations. The general form of ztz_{t} accommodates possible nonlinearity, as it is quite common in nonstationary time series settings. Such general structure covers some important special cases, such as (1) zt=h(τt)+vtz_{t}=h(\tau_{t})+v_{t}, a stationary vector superimposed with a vector of time trends; (2) zt=σ(τt)vtz_{t}=\sigma(\tau_{t})^{\top}v_{t}, a combination of vtv_{t} with varying coefficients (σ()\sigma(\cdot) may be a matrix); and (3) zt=h(τt)+σ(τt)vtz_{t}=h(\tau_{t})+\sigma(\tau_{t})^{\top}v_{t}, which accommodates both deterministic and stochastic trending behaviours in the respective mean and volatility components.

We therefore propose a parametric multiple-index model of the form:

yt=\displaystyle y_{t}= g1(xtθ0)+g2(ztβ0)+et,\displaystyle g_{1}(x_{t}^{\top}\theta_{0})+g_{2}(z_{t}^{\top}\beta_{0})+e_{t}, (1.2)

for t=1,2,,n,t=1,2,\cdots,n, where θ0d1\theta_{0}\in\mathbb{R}^{d_{1}} and β0d2\beta_{0}\in\mathbb{R}^{d_{2}} are unknown parameters, both g1()g_{1}(\cdot) and g2()g_{2}(\cdot) are parametrically known functions, and ete_{t} is an error term. Different from existing settings, model (1.2) takes into account not only the ‘large’ stochastic trending component driven by xtθ0x_{t}^{\top}\theta_{0}, but also the ‘small’ slowly varying component ztβ0z_{t}^{\top}\beta_{0} mainly represented by the deterministic time trend τt\tau_{t}. In doing so, one may consider g2(ztβ0)g_{2}(z_{t}^{\top}\beta_{0}) as a ‘conditional mean’ decomposition of the ‘original’ error term: ξt=g2(ztβ0)+et\xi_{t}=g_{2}(z_{t}^{\top}\beta_{0})+e_{t} to eliminate any potential endogeneity involved in ete_{t}. As a consequence, one may assume that ete_{t} is uncorrelated with (xtθ0,ztβ0)(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0}), although ξt\xi_{t} may be correlated with xtθ0x_{t}^{\top}\theta_{0}. Our focus of this paper is that xtθ0x_{t}^{\top}\theta_{0} is nonstationary (rather than cointegrated), and the primary goal is to estimate these unknown parameters by M-estimation to take the advantage of its robustness for modelling nonlinear functional forms of nonstationary time series data.

To the best of our knowledge, model (1.2) has not been proposed and studied in the relevant literature. For the univariate regressor setting, Chang et al., (2001) investigate a parametric additive model to accommodate a time trend, and both univariate stationary and unit root regressors, and the authors estimate the unknown parameters by a nonlinear least-square method. Chang and Park, (2003) study a parametric model that specifies the regression function as a mixture of linear function and nonlinear function of unit root process that appears as a single-index structure. Meanwhile, Dong et al., (2016) consider the estimation for a partially linear single-index model of I(1) process vector. In addition, there are a couple of other models that may be considered as nonparametric counterparts of model (1.2). Dong and Linton, (2018) and Dong et al., (2021) extend Chang et al., (2001), respectively, to additive nonparametric and fully nonparametric models involving time trend as well as stationary and univariate nonstationary regressors.

With respect to M-estimation, there is an extensive literature. Starting from the seminal work of Huber, (1964), M-estimation methods gradually extend to have general loss functions that may circumvent some drawbacks caused by outliers and so on. The resulting estimators yielded by such approaches are coined as MM- and ZZ-estimators (see, for example, Chapter 3 of van der Vaart, (1996)).

Using M-estimation, the unknown parameters (θ0,β0)(\theta^{\prime}_{0},\beta^{\prime}_{0})^{\prime} are estimated by minimizing

Ln(θ,β)=t=1nρ(ytg(xtθ,ztβ)),L_{n}(\theta,\beta)=\sum_{t=1}^{n}\rho(y_{t}-g(x_{t}^{\top}\theta,z_{t}^{\top}\beta)), (1.3)

where g(u,v)=g1(u)+g2(v)g(u,v)=g_{1}(u)+g_{2}(v) and ρ()\rho(\cdot) is some loss function. Possible choices of ρ()\rho(\cdot) include, but not limits to,

  1. (1)

    Squared error loss: ρ(u)=u2\rho(u)=u^{2}.

  2. (2)

    LpL_{p} norm: ρ(u)=|u|p\rho(u)=|u|^{p} with p>1p>1.

  3. (3)

    Huber’s loss (Huber, , 1964): For fixed c>0c>0,

    ρc(u)={12u2,|u|c,c|u|12c2,|u|>c.\displaystyle\rho_{c}(u)=\begin{cases}\frac{1}{2}u^{2},&|u|\leq c,\\ c|u|-\frac{1}{2}c^{2},&|u|>c.\end{cases}
  4. (4)

    Least absolute deviation (LAD): ρ(u)=|u|\rho(u)=|u|.

  5. (5)

    Quantile loss (Koenker and Bassett, , 1978): ρτ(u)=u(τI(u<0))\rho_{\tau}(u)=u(\tau-I(u<0)) for τ(0,1)\tau\in(0,1).

  6. (6)

    Expectile loss (Newey and Powell, , 1986): ρτ(u)=|τI(u<0)|u2\rho_{\tau}(u)=|\tau-I(u<0)|u^{2} for τ(0,1)\tau\in(0,1).

The loss function ρ()\rho(\cdot) here may not be differentiable at every point on \mathbb{R}. Often in the literature the subgradient needs to be introduced to tackle these potential erratic points (Chen et al., , 2010). Alternatively, some papers consider the population criterion function like 𝔼[(Z,θ)]{\mathbb{E}}[\ell(Z,\theta)] instead of (Z,θ)\ell(Z,\theta) itself when ZZ is stationary variable. The rationale why the population criterion is workable is that although (Z,θ)\ell(Z,\theta) may not be differentiable at some points, 𝔼[(Z,θ)]{\mathbb{E}}[\ell(Z,\theta)] may be smooth at these points (see, for example, Chen et al., (2014, p. 640)).

However, in this paper the model has nonstationary regressors that make the population criterion approach fail to work; also the so-called “subgradient approach” depends on particular forms of the loss functions under consideration. We therefore propose a generalized function approach.

The generalized function approach has the following advantages: (1) Generalized functions include ordinary functions that are locally integrable. (2) More importantly, in generalized function sense, generalized functions always have derivatives of all orders. This makes it possible to use Taylor expansion of generalized functions for these loss functions although they may not be necessarily smooth at some points (see Stanković, , 1996).

Phillips, (1991, 1995) are among the first in econometrics to develop a generalized function approach for studying asymptotic properties for LAD and M-estimators in the linear model setting. In particular, the two papers show Taylor expansion for generalized functions, establish (weak and strong) law of large numbers and central limit theorem in generalized function context. In addition, Zinde-Walsh, (2008) uses a generalized function approach in kernel estimation to estimate a density that may not be differentiable at some points. Notice also that the idea of generalized function (concretely, delta-convergent sequence) approaches already have been used in earlier studies in statistics, such as Walter and Blum, (1979) and Susarla and Walter, (1981), where the authors use the property of delta sequence converging to Dirac delta function, δ(x)\delta(x), to estimate density function. Moreover, the rationale behind the conventional kernel estimation method is also the delta-convergent sequence Kh(x)h1K(x/h)δ(x)K_{h}(x)\equiv h^{-1}K(x/h)\to\delta(x) as h0h\to 0 in distribution sense where K()K(\cdot) is a density (the theorem in Kanwal, , 1983, p. 62). Thus, if g()g(\cdot) is continuous at x0x_{0}, Kh(xx0)g(x)𝑑xg(x0)\int K_{h}(x-x_{0})g(x)dx\to g(x_{0}) when h0h\to 0, while kernel estimation in econometrics and statistics is merely a sample version of this integral.

In summary, this paper makes the following contributions to the relevant literature. (i) Model (1.2) accommodates three different types of time series vectors in a multiple-index form; (ii) M-estimation is developed for the unknown parameters that deals with a number of loss functions as special cases, such as OLS, LAD, quantile and expectile estimators; (iii) Non-differentiable loss functions are dealt with by a generalized function approach that provides a unified framework in M-estimation; (iv) The empirical findings show that the proposed model outperform its natural competitors in terms of understanding and evaluating stock return predictability from the points of view of both pseudo out-of-sample R2{\rm R}^{2} and distributional quantiles; and (v) The finite-sample simulation results show that the proposed model and estimation theory works well numerically.

The rest of the paper is organized as follows. Section 2 introduces the estimation method and then the necessary assumptions before an asymptotic theory is established in Section 3. An empirical analysis about stock return predictability is given in Section 4. Simulation results are then given in Section 5. Section 6 concludes the main sections of this paper. Useful definitions and properties about generalized functions are given in Appendix A, and Appendices B and C include all the necessary mathematical technicalities.

2 Assumptions for M–estimators

Let ρ()\rho(\cdot) be a convex nonnegative function defined on \mathbb{R} that plays a role of loss function. Given the objective function Ln(θ,β)L_{n}(\theta,\beta) in equation (1.3), we define the M-estimators (θ^,β^)(\widehat{\theta},\widehat{\beta}) of (θ0,β0)(\theta_{0},\beta_{0}) that satisfy

Ln(θ^,β^)<argminθ,βLn(θ,β)+oP(ϵn2),\displaystyle L_{n}(\widehat{\theta},\widehat{\beta})<\underset{\theta,\beta}{\arg\min}\ L_{n}(\theta,\beta)+o_{P}(\epsilon_{n}^{2}), (2.1)

where ϵn0\epsilon_{n}\to 0 as nn\to\infty.

Our theory relies on the assumption that ρ()\rho(\cdot) is twice differentiable. When ρ()\rho^{\prime}(\cdot) and/or ρ′′()\rho^{\prime\prime}(\cdot) may not exist at some points, we shall invoke its derivatives in a generalized function sense, as discussed in Appendix A below. Indeed, our theory depends heavily on the theory of generalized functions.

We now give some examples of the derivatives of generalized functions we discuss in the sequel.

Examples.  The function given by

H(x)={1,x>0,0,x<0,H(x)=\begin{cases}1,&x>0,\\ 0,&x<0,\end{cases}

is called Heaviside function. Its value at zero usually is defined as 1/21/2, while sometimes it is taken as cc for 0<c<10<c<1, written as Hc(x)H_{c}(x). Simply by the definition of the derivative of generalized functions, H(x)=δ(x)H^{\prime}(x)=\delta(x) the so-called Dirac delta function.

(a) Let ρ(u)=|u|\rho(u)=|u| (LAD loss). Again, by the definition of the derivative of generalized functions, ρ(u)=H(u)+H(u)\rho^{\prime}(u)=-H(-u)+H(u), and thus ρ′′(u)=δ(x)+δ(x)=2δ(x)\rho^{\prime\prime}(u)=\delta(-x)+\delta(x)=2\delta(x), where we use δ(u)=δ(u)\delta(-u)=\delta(u) given in Kanwal, (1983, p. 54).

(b) Let ρ(u)=u(τI(u<0))\rho(u)=u(\tau-I(u<0)) for some 0<τ<10<\tau<1 (Quantile loss). Similarly, we have ρ(u)=τI(u<0)=τH(u)+(τ1)H(u)\rho^{\prime}(u)=\tau-I(u<0)=\tau H(u)+(\tau-1)H(-u). Hence, ρ′′(u)=τδ(u)(τ1)δ(u)=δ(u)\rho^{\prime\prime}(u)=\tau\delta(u)-(\tau-1)\delta(-u)=\delta(u).

(c) For Huber’s loss function ρ(u)\rho(u), it has the following derivative,

ρ(u)={u,|u|c,csgn(u),|u|>c.\rho^{\prime}(u)=\begin{cases}u,&|u|\leq c,\\ c\,\text{sgn}(u),&|u|>c.\end{cases}

Moreover, ρ′′(u)=I(|u|c)\rho^{\prime\prime}(u)=I(|u|\leq c) in generalized function sense.

(d) For expectile loss function ρ(u)=|τI(u<0)|u2\rho(u)=|\tau-I(u<0)|u^{2}, it is a smooth function in ordinary sense with ρ(u)=2|τI(u<0)|u\rho^{\prime}(u)=2|\tau-I(u<0)|u, whereas ρ(u)\rho^{\prime}(u) is not smooth at u=0u=0. By definition, ρ′′(u)=2|τI(u<0)|\rho^{\prime\prime}(u)=2|\tau-I(u<0)|.

(e) For LpL_{p} norm loss ρ(u)=|u|p\rho(u)=|u|^{p} with p>1p>1, we have ρ(u)=p|u|p1sgn(u)\rho^{\prime}(u)=p|u|^{p-1}\text{sgn}(u). However, if 1<p<21<p<2, as an ordinary function ρ′′(u)\rho^{\prime\prime}(u) does not exist at u=0u=0. In generalized function sense, ρ′′(u)=p(p1)|u|p2\rho^{\prime\prime}(u)=p(p-1)|u|^{p-2}, while the meaning of generalized function |u|p2|u|^{p-2} is discussed in Gel’fand and Shilov, (1964) and Kanwal, (1983) for example.

Before stating assumptions needed for our theoretical development, we introduce some notations. In what follows, \|\cdot\| signifies Euclidean norm for vector or element-wise norm for matrix; \int means an integral (maybe multiple integral) over entire real set \mathbb{R} ( p\mathbb{R}^{p}).

Assumption A

  1. (A.1)

    Let {ηj,j}\{\eta_{j},j\in\mathbb{Z}\} be a vector sequence of d1d_{1}-dimensional independent and identically distributed (iid) continuous random variables satisfying Eη0=0E\eta_{0}=0, Eη0η0=Id1E\eta_{0}\eta_{0}^{\prime}=I_{d_{1}} and Eη04<E\|\eta_{0}\|^{4}<\infty. Let φ(u)\varphi(u) be the characteristic function of η0\eta_{0} satisfying u|φ(u)|𝑑u<\int\|u\|\,|\varphi(u)|du<\infty.

  2. (A.2)

    Let {wt}\{w_{t}\} be a linear process defined by wt=j=0Ajηtjw_{t}=\sum_{j=0}^{\infty}A_{j}\eta_{t-j}, where {Aj}\{A_{j}\} is a sequence of square matrices, A0=IdA_{0}=I_{d}, and j=0jAj<\sum_{j=0}^{\infty}j\|A_{j}\|<\infty. Moreover, suppose that A:=j=0AjA:=\sum_{j=0}^{\infty}A_{j} has full rank.

  3. (A.3)

    Let xt=xt1+wtx_{t}=x_{t-1}+w_{t} for t1t\geq 1 with x0=OP(1)x_{0}=O_{P}(1).

The structure of the unit root process xtx_{t} is commonly imposed in the literature since wtw_{t} has a considerably general form that covers many special cases. See Park and Phillips, (1999, 2001) and Dong et al., (2016). It is straightforward to calculate that dt2:=𝔼(xtxt)=AAt(1+o(1))d_{t}^{2}:=\mathbb{E}(x_{t}x_{t}^{\top})=AA^{\top}t(1+o(1)), and it is well known that, for r[0,1]r\in[0,1], n1/2x[nr]DB(r)n^{-1/2}x_{[nr]}\to_{D}B(r) as nn\to\infty where B(r)B(r) is a Brownian motion on [0,1][0,1] with covariance AAAA^{\top}.

Assumption B.

  1. (B.1)

    Let h(r,v)h(r,v) be a d2d_{2}-dimensional vector of continuous functions in rr on [0,1][0,1].

  2. (B.2)

    Suppose that vt=q(ηt,,ηtd0+1;v~t)v_{t}=q(\eta_{t},\cdots,\eta_{t-d_{0}+1};\tilde{v}_{t}) for d01d_{0}\geq 1, where {v~t,1tn}\{\tilde{v}_{t},1\leq t\leq n\} is also a vector sequence of strictly stationary and α\alpha-mixing time series, and independent of ηj\eta_{j}’s introduced in Assumption A, and the vector function q()q(\cdots) is measurable such that

    supr[0,1]𝔼g˙22(β0h(r,v1))h(r,v1)h(r,v1)4<.\sup_{r\in[0,1]}{\mathbb{E}}\|\dot{g}_{2}^{2}(\beta_{0}^{\top}h(r,v_{1}))\;h(r,v_{1})h(r,v_{1})^{\top}\|^{4}<\infty.

    Moreover, the α\alpha-mixing coefficient α(k)\alpha(k) of {v~t,1tn}\{\tilde{v}_{t},1\leq t\leq n\} satisfies k=1α1/2(k)<\sum_{k=1}^{\infty}\alpha^{1/2}(k)<\infty.

Basically, {vt}\{v_{t}\} is a vector sequence of strictly stationary variables and vtv_{t} can be correlated with xtx_{t} through a finite number of innovations of η\eta’s. The existence of the fourth moment and the condition on the mixing coefficients guarantee the convergence of the average of g˙22(β0zt)ztzt\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top} in probability by virtue of Davydov’s inequality, as shown in Appendix B. All these are quit common but in this paper relies on a known function h()h(\cdot), so we impose in B.1 the continuity on h(,v)h(\cdot,v) to make sure that h(r,v)h(r,v) is bounded in r[0,1]r\in[0,1] for each vv. Let ξ1t=ρ(et)\xi_{1t}=\rho^{\prime}(e_{t}), ξ2t=ρ(et)zt\xi_{2t}=\rho^{\prime}(e_{t})z_{t} and ξ3t=ρ′′(et)\xi_{3t}=\rho^{\prime\prime}(e_{t}).

Assumption C.

  1. (C.1)

    Suppose that ρ()\rho(\cdot) is positive convex, and has its only minimum at zero; moreover, it is locally integrable function on the real line.

  2. (C.2)

    Suppose that (ξ1t,t)(\xi_{1t},\mathcal{F}_{t}) is a martingale difference sequence where t\mathcal{F}_{t} is a σ\sigma-field such that xtx_{t} and vtv_{t} are adapted to t1\mathcal{F}_{t-1}. Moreover, 𝔼[ξ1t2|t1]=a1\mathbb{E}[\xi_{1t}^{2}|\mathcal{F}_{t-1}]=a_{1} and 𝔼[ξ3t|t1]=a2\mathbb{E}[\xi_{3t}|\mathcal{F}_{t-1}]=a_{2} almost surely for some 0<a1,a2<0<a_{1},a_{2}<\infty. In addition, max1tn𝔼[ξ1t4|t1]<\max_{1\leq t\leq n}\mathbb{E}[\xi_{1t}^{4}|\mathcal{F}_{t-1}]<\infty uniformly in n1n\geq 1.

  3. (C.3)

    Suppose that, as nn\to\infty, for r[0,1]r\in[0,1],

    (1nt=1[nr]ξ1t,1nt=1[nr]ξ2t,1nt=1[nr]ηt)(Uρ(r),Vρ(r),W(r)),\left(\frac{1}{\sqrt{n}}\sum_{t=1}^{[nr]}\xi_{1t},\frac{1}{\sqrt{n}}\sum_{t=1}^{[nr]}\xi_{2t},\frac{1}{\sqrt{n}}\sum_{t=1}^{[nr]}\eta_{t}\right)\Rightarrow(U_{\rho}(r),V_{\rho}(r),W(r)),

which is a Brownian motion of (d+1)(d+1) dimension, where Uρ(r),Vρ(r)U_{\rho}(r),V_{\rho}(r) and W(r)W(r) have variance and covariance, a1a_{1}, a1𝔼(h(r,v1)h(r,v1))𝑑ra_{1}\int{\mathbb{E}}(h(r,v_{1})h(r,v_{1})^{\top})dr and Id1I_{d_{1}}, respectively.

Condition (C.1) is mild that ensures the eligibility of ρ()\rho(\cdot) as a generalized function, apart from being a loss function. In Assumption (C.2), ρ()\rho^{\prime}(\cdot) and ρ′′()\rho^{\prime\prime}(\cdot) are the first two derivatives of ρ()\rho(\cdot) possibly in the generalized function sense, while they are the same as the derivatives in ordinary sense whenever exist. The filtration in (C.2) can be taken as t=σ(e1,,et;xs,zs,st+1)\mathcal{F}_{t}=\sigma(e_{1},\cdots,e_{t};x_{s},z_{s},s\leq t+1) that is general than series independence between ete_{t} and (xt,zt)(x_{t},z_{t}). When ρ()\rho^{\prime}(\cdot) exists in ordinary sense in (C.2), the martingale difference sequence condition renders 𝔼[ξ1t|t1]=0\mathbb{E}[\xi_{1t}|\mathcal{F}_{t-1}]=0 that is easily fulfilled for the loss functions of OLS, LAD, pp-norm (p>1p>1) and Huber’s when the conditional distribution of ete_{t} given t1\mathcal{F}_{t-1} is symmetric, because these functions have odd derivatives. Also, this is the same as Fe(0|t1)=τF_{e}(0|\mathcal{F}_{t-1})=\tau for quantile loss (see, for example, Assumption 2 in Chen et al., , 2019). Some authors obtain this condition by adjusting the intercept that we also have to deal with when it is violated, like He and Shao, (2000, p. 124) and Xiao, (2009, p. 251). In addition, the positiveness 𝔼[ξ3t|t1]=a2>0\mathbb{E}[\xi_{3t}|\mathcal{F}_{t-1}]=a_{2}>0 is due to the convexity of the loss function under consideration. In particular, a2=fe(0)>0a_{2}=f_{e}(0)>0 for both check loss and LAD loss, while for Huber’s loss, a2=P(|et|c)>0a_{2}=P(|e_{t}|\leq c)>0 in exogenous situation.

When ρ()\rho^{\prime}(\cdot) and ρ′′()\rho^{\prime\prime}(\cdot) in (C.2) are genuinely generalized functions, the conditional expectations can be defined as

𝔼[ξ1t|t1]=ρ(u)ft,e(u)𝑑uand𝔼[ξ3t|t1]=ρ(u)ft,e′′(u)𝑑u{\mathbb{E}}[\xi_{1t}|\mathcal{F}_{t-1}]=-\int\rho(u)f^{\prime}_{t,e}(u)du\ \ \mbox{and}\ \ {\mathbb{E}}[\xi_{3t}|\mathcal{F}_{t-1}]=\int\rho(u)f^{\prime\prime}_{t,e}(u)du

under some conditions on the conditional density ft,e(u)f_{t,e}(u) of ete_{t} given t1\mathcal{F}_{t-1}. Though ρ()\rho(\cdot) may not be differentiable at some isolated points, 𝔼[ξ1t|t1]\mathbb{E}[\xi_{1t}|\mathcal{F}_{t-1}] and 𝔼[ξ3t|t1]\mathbb{E}[\xi_{3t}|\mathcal{F}_{t-1}] are well-defined given that the conditional density is smooth and is a member in SS (the rapid decay test function space, see Appendix A); this is due to the use of generalized functions that extend the ordinary derivatives. The condition D.2 in He and Shao, (2000, p. 125) defines 𝔼[ξ3t]\mathbb{E}[\xi_{3t}] in the same fashion as for non-differentiable loss functions.

Note that (C.3) is quite commonly encountered in the cointegrating regression literature when there is no additional stationary regressor involved (see Park and Phillips, , 2001, Xiao, , 2009 and Wang et al., , 2018). Nevertheless, here the subscript ρ\rho indicates that the joint convergence relies on the loss function. See, for example, equation (3) of Phillips, (1995) shows the functional invariance principle for the derivative of LAD loss, i.e. sign()(\cdot).

For simplicity of notation we would mostly suppress these subscript, viz., denote Uρ(r)U_{\rho}(r) by U(r)U(r), and Vρ(r)V_{\rho}(r) by V(r)V(r), if there is no confusion raised.

Note that (C.3), in particular, ensures that as nn\rightarrow\infty

(1nt=1[nr]ξ1t,1nt=1[nr]ξ2t,1nx[nr])(U(r),V(r),B(r)),\left(\frac{1}{\sqrt{n}}\sum_{t=1}^{[nr]}\xi_{1t},\frac{1}{\sqrt{n}}\sum_{t=1}^{[nr]}\xi_{2t},\frac{1}{\sqrt{n}}x_{[nr]}\right)\Rightarrow(U(r),V(r),B(r)), (2.2)

where B(r)B(r) has a different covariance function from that of W(r)W(r).

Because of the involvement of the unit root process xtx_{t}, we need a similar classification on the function classes, such as HH-regular and II-regular in Park and Phillips, (1999, 2001). However, in our paper we only consider a simpler version of HH-regular class though a general form can be adopted straightforwardly.

Definition 2.1.

A function f()f(\cdot) defined on \mathbb{R} is called HH-regular with homogeneity order ν()\nu(\cdot) if f(λx)=ν(λ)f(x)f(\lambda x)=\nu(\lambda)f(x) for all λ>0\lambda>0 and any xx. A function f()f(\cdot) defined on \mathbb{R} is called II-regular if |f|<\int|f|<\infty.

The class of HH-regular functions mainly contains power functions while the class of II-regular functions has all integrable functions on the entire real line as its member, such as probability density functions. More detailed discussion on these definitions can be found in Park and Phillips, (1999, 2001).

Assumption D    Let both g1(u)g_{1}(u) and g2(v)g_{2}(v) be continuously differentiable up to the second order. Suppose further that

  1. (D.1)

    g1(u)g_{1}(u), g˙1(u)\dot{g}_{1}(u) and g¨1(u)\ddot{g}_{1}(u) are HH-regular with homogeneity order ν()\nu(\cdot), ν˙()\dot{\nu}(\cdot) and ν¨()\ddot{\nu}(\cdot), respectively, such that limλ+ν˙(λ)/ν(λ)=0\lim_{\lambda\to+\infty}\dot{\nu}(\lambda)/\nu(\lambda)=0, and limλ+ν¨(λ)/ν˙(λ)=0\lim_{\lambda\to+\infty}\ddot{\nu}(\lambda)/\dot{\nu}(\lambda)=0.

  2. (D.2)

    g1(u)g_{1}(u) and g12(u)g_{1}^{2}(u) are II-regular. Moreover, g˙1(u)\dot{g}_{1}(u), g˙12(u)\dot{g}_{1}^{2}(u), ug˙1(u)u\dot{g}_{1}(u), [ug˙1(u)]2[u\dot{g}_{1}(u)]^{2}, g¨1(u)\ddot{g}_{1}(u) and ug¨1(u)u\ddot{g}_{1}(u) are all II-regular.

It is possible to relax the restriction on the function forms but they are the benchmark. For example, mixtures of HH-regular and II-regular in terms of argument uu are feasible for the development in what follows, while we do not pursue it since this would make notation much complicated.

3 Asymptotic theory

In the process of establishing our asymptotic theory, we shall consider in the proofs that each of the generalized derivatives of ρ()\rho(\cdot) is a limit of regular sequence under the setting of generalized functions (see, for example, Phillips, (1991, 1995)).

Theorem 3.1.

Under Assumptions A-C and D(1), as nn\to\infty,

(ν˙(n)n(θ^θ0)n(β^β0))D([A1A3A2A3]1(b1A3A21b2)A21(b2A3[A1A3A2A3]1(b1A3A21b2))),\displaystyle\begin{pmatrix}\dot{\nu}(\sqrt{n})n(\widehat{\theta}-\theta_{0})\\ \sqrt{n}(\widehat{\beta}-\beta_{0})\end{pmatrix}\to_{D}\begin{pmatrix}[A_{1}-A_{3}^{\top}A_{2}A_{3}]^{-1}(b_{1}-A^{\top}_{3}A_{2}^{-1}b_{2})\\ A_{2}^{-1}(b_{2}-A_{3}[A_{1}-A_{3}^{\top}A_{2}A_{3}]^{-1}(b_{1}-A_{3}^{\top}A_{2}^{-1}b_{2}))\end{pmatrix},

where

b1=\displaystyle b_{1}= 01g˙1(B(r)θ0)B(r)𝑑U(r),\displaystyle\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})B(r)dU(r),
b2\displaystyle b_{2}\sim N(0,a1Σ),withΣ=01𝔼[g˙22(β0h(r,v1))h(r,v1)h(r,v1)]𝑑r,\displaystyle N(0,a_{1}\Sigma),\ \text{with}\ \Sigma=\int_{0}^{1}\mathbb{E}[\dot{g}_{2}^{2}(\beta_{0}^{\top}h(r,v_{1}))\,h(r,v_{1})h(r,v_{1})^{\top}]dr,
A1=\displaystyle A_{1}= a201g˙1(B(r)θ0)B(r)B(r)𝑑r,A2=a2Σ,\displaystyle a_{2}\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})B(r)B(r)^{\top}dr,\ \ \ A_{2}=a_{2}\Sigma,
A3=\displaystyle A_{3}= a201g˙1(B(r)θ0)[𝔼g˙2(h(r,v1)β0)h(r,v1)]B(r)𝑑r,\displaystyle a_{2}\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})[\mathbb{E}\dot{g}_{2}(h(r,v_{1})^{\top}\beta_{0})h(r,v_{1})]B(r)^{\top}dr,

and positive numbers a1a_{1} and a2a_{2} are given in Assumption C.

Remark.  It can be seen from the theorem that the imposition of HH-regular functions g1(u)g_{1}(u) and its derivatives delivers a fast rate of convergence for θ^\widehat{\theta} due to the divergence of unit root process xtx_{t}. This is comparable with the result in Park and Phillips, (2001). On the other hand, the estimator β^\widehat{\beta} possesses a usual square-root-nn rate, although the regressor ztz_{t} has a deterministic trending component. Notice that the positive definiteness of Σ\Sigma is easily satisfied as long as v1v_{1} is a continuous variable. More importantly, it is readily seen that the different loss functions take a role in the asymptotic limits through a1a_{1} and a2a_{2} that may affect the covariance but not the rates of convergence.

In order to make a comparison with the literature, here we simply suppose the model is exogenous. (1) When ρ(u)=|u|\rho(u)=|u|, ρ(u)=H(u)+H(u)\rho^{\prime}(u)=-H(-u)+H(u) with H()H(\cdot) being Heaviside function and ρ′′(u)=2δ(u)\rho^{\prime\prime}(u)=2\delta(u). We have a1=𝔼[ρ(et)]2=1a_{1}=\mathbb{E}[\rho^{\prime}(e_{t})]^{2}=1 and a2=𝔼[ρ′′(et)]=2fe(0)>0a_{2}=\mathbb{E}[\rho^{\prime\prime}(e_{t})]=2f_{e}(0)>0 where fe()f_{e}(\cdot) is the density of ete_{t}. Moreover, if g1(u)=ug_{1}(u)=u and g2(v)0g_{2}(v)\equiv 0, the model reduces to a linear cointegrating model. Our result implies that

n(θ^θ0)D[2fe(0)]1[01B(r)B(r)𝑑r]101B(r)𝑑U(r),n(\widehat{\theta}-\theta_{0})\to_{D}[2f_{e}(0)]^{-1}\left[\int_{0}^{1}B(r)B(r)^{\top}dr\right]^{-1}\int_{0}^{1}B(r)dU(r),

that is the result of Phillips, (1995) without correlation between xtx_{t} and ete_{t}; if g2(v)=vg_{2}(v)=v and g1(u)0g_{1}(u)\equiv 0, h(r,v)vh(r,v)\equiv v the model reduces to a linear model with stationary regressors. Our result also naturally covers that n(β^β0)D[2fe(0)]1[𝔼(v1v1)]1/2N(0,Id2)\sqrt{n}(\widehat{\beta}-\beta_{0})\to_{D}[2f_{e}(0)]^{-1}[\mathbb{E}(v_{1}v_{1}^{\top})]^{-1/2}N(0,I_{d_{2}}) as shown in Pollard, (1991).

(2) When ρ(u)\rho(u) is the quantile loss, ρ(u)=τH(u)+(τ1)H(u)\rho^{\prime}(u)=\tau H(u)+(\tau-1)H(-u) and ρ′′(u)=δ(u)\rho^{\prime\prime}(u)=\delta(u). We have a1=𝔼[ρ(et)]2=τ(1τ)a_{1}=\mathbb{E}[\rho^{\prime}(e_{t})]^{2}=\tau(1-\tau) and a2=𝔼[ρ′′(et)]=fe(0)>0a_{2}=\mathbb{E}[\rho^{\prime\prime}(e_{t})]=f_{e}(0)>0 where fe()f_{e}(\cdot) is the density of ete_{t}. Further, if g1(u)=ug_{1}(u)=u and g2(v)0g_{2}(v)\equiv 0, the model reduces to a linear cointegrating model. Our result implies that n(θ^θ0)D[fe(0)]1[01B(r)B(r)𝑑r]101B(r)𝑑U(r)n(\widehat{\theta}-\theta_{0})\to_{D}[f_{e}(0)]^{-1}[\int_{0}^{1}B(r)B(r)^{\top}dr]^{-1}\int_{0}^{1}B(r)dU(r) that is the result of Xiao, (2009) without correlation between xtx_{t} and ete_{t}; if g2(v)=vg_{2}(v)=v and g1(u)0g_{1}(u)\equiv 0, h(r,v)vh(r,v)\equiv v the model reduces to a linear model with stationary regressor. Our result implies that n(β^β0)Dτ(1τ)[fe(0)]1[𝔼(v1v1)]1/2N(0,Id2)\sqrt{n}(\widehat{\beta}-\beta_{0})\to_{D}\sqrt{\tau(1-\tau)}[f_{e}(0)]^{-1}[\mathbb{E}(v_{1}v_{1}^{\top})]^{-1/2}N(0,I_{d_{2}}), which is the result of quantile regression for linear model. See Koenker and Bassett, (1978). \Box

While Σ\Sigma can easily be consistently estimated by the function hh and observations of vtv_{t}, we need to construct consistent estimators for both a1a_{1} and a2a_{2} in order to use the asymptotic limits for statistical inferences. These estimators are given by a^1=1nt=1n[ρ(e^t)]2\widehat{a}_{1}=\frac{1}{n}\sum_{t=1}^{n}[\rho^{\prime}(\widehat{e}_{t})]^{2} and a^2=1nt=1nρ′′(e^t\widehat{a}_{2}=\frac{1}{n}\sum_{t=1}^{n}\rho^{\prime\prime}(\widehat{e}_{t}), respectively, where e^t=ytg(θ^xt,β^zt)\widehat{e}_{t}=y_{t}-g(\widehat{\theta}^{\,\top}x_{t},\widehat{\beta}^{\,\top}z_{t}) for 1tn1\leq t\leq n. We then have the following corollary.

Corollary 3.1.

Let the conditions of Theorem 3.1 hold. If, in addition, {[ρ(et)]2}\{[\rho^{\prime}(e_{t})]^{2}\} and {ρ′′(et)}\{\rho^{\prime\prime}(e_{t})\} are mean ergodic, then we have a^1Pa1\widehat{a}_{1}\to_{P}a_{1} and a^2Pa2\widehat{a}_{2}\to_{P}a_{2} as nn\to\infty.

The ergodicity for the two sequences is quite weak and can therefore be fulfilled under several sufficient conditions, such as requiring ete_{t} be a martingale difference sequence or a strictly stationary and α\alpha-mixing sequence. When the derivations of the loss function ρ(u)\rho(u) do not exist in ordinary sense, its regular sequence may be used for ρ(u)\rho(u) (see Appendix A and Phillips, (1995, p 918)).

Now we consider the regression function g(u,v)=g1(u)+g2(v)g(u,v)=g_{1}(u)+g_{2}(v) where both g1(u)g_{1}(u) and g2(v)g_{2}(v) are smooth but g1(u)g_{1}(u) is integrable on the entire real line stipulated by Assumption D(3). This assumption will change the asymptotic theory drastically. The limit theory depends on the local time of some scalar Brownian motion W(r)W(r) defined as

LW(p,s)=limϵ012ϵ0pI(|W(r)s|<ϵ)𝑑r,L_{W}(p,s)=\lim_{\epsilon\to 0}\frac{1}{2\epsilon}\int_{0}^{p}I(|W(r)-s|<\epsilon)dr,

which measures the sojourning time of W(r)W(r) around ss during the time period (0,p)(0,p). In our context, the scalar Brownian motion may be B1(r):=θ0B(r)B_{1}(r):=\theta_{0}^{\top}B(r) where B(r)B(r) is defined by (2.2).

Another feature of the limit theory about this regression function is that we derive in a new coordinate system first, then recover the limit in the original coordinate. To do so, using θ0\theta_{0} that is a nonzero but not necessarily a unit vector, we construct an orthogonal matrix P=(θ0/θ0,P1)d1×d1P=(\theta_{0}/\|\theta_{0}\|,P_{1})_{d_{1}\times d_{1}} to rotate all the vectors of interest, including θ0\theta_{0}, xtx_{t} and B(r)B(r). The relevant rotated vectors are x1tθ0xtx_{1t}\equiv\theta_{0}^{\top}x_{t} and x2tP1xtx_{2t}\equiv P_{1}^{\top}x_{t} that after normalization converge jointly to B1(r)θ0B(r)B_{1}(r)\equiv\theta_{0}^{\top}B(r) and B2(r)P1B(r)B_{2}(r)\equiv P_{1}^{\top}B(r); and we shall denote by L1(p,s)L_{1}(p,s) the local time of B1(r)B_{1}(r).

Theorem 3.2.

Under Assumptions A-C and D(2), as nn\to\infty we have

(α^γ^)=(n41θ02θ0(θ^θ0)n43P1(θ^θ0)n(β^β0))Da1a2M1/2N(0,Id),\displaystyle\begin{pmatrix}\widehat{\alpha}\\ \widehat{\gamma}\end{pmatrix}=\begin{pmatrix}\sqrt[4]{n}\frac{1}{\|\theta_{0}\|^{2}}\theta_{0}^{\top}(\widehat{\theta}-\theta_{0})\\ \sqrt[4]{n}^{3}P_{1}^{\top}(\widehat{\theta}-\theta_{0})\\ \sqrt{n}(\widehat{\beta}-\beta_{0})\end{pmatrix}\to_{D}\frac{\sqrt{a_{1}}}{a_{2}}M^{-1/2}N(0,I_{d}),

where P(θ0/θ0,P1)P\equiv(\theta_{0}/\|\theta_{0}\|,P_{1}) is an orthogonal matrix, a1a_{1} and a2a_{2} are positive constants given in Assumption C, M=diag(M1,Σ)M=\text{diag}(M_{1},\Sigma) is a symmetric positive definite d×dd\times d matrix in which M1=(mij)M_{1}=(m_{ij}) is given by

m11=\displaystyle m_{11}= [ug˙1(u)]2𝑑uL1(1,0),\displaystyle\int[u\dot{g}_{1}(u)]^{2}du\;L_{1}(1,0), m12=\displaystyle m_{12}= 01B2(r)𝑑L1(r,0)ug˙12(u)𝑑u,\displaystyle\int_{0}^{1}B_{2}^{\top}(r)dL_{1}(r,0)\int u\dot{g}_{1}^{2}(u)du,
m21=\displaystyle m_{21}= m12,\displaystyle m_{12}, m22=\displaystyle m_{22}= 01B2(r)B2(r)𝑑L1(r,0)g˙12(u)𝑑u,\displaystyle\int_{0}^{1}B_{2}(r)B_{2}(r)^{\top}dL_{1}(r,0)\int\dot{g}_{1}^{2}(u)du,

and Σ\Sigma is the d2×d2d_{2}\times d_{2} matrix defined defined in Theorem 2.1.

Noting that MM is a block diagonal matrix, the result in Theorem 3.2 implies

DnJ\displaystyle D_{n}J P(θ^θ0)=(n41θ02θ0n43P1)(θ^θ0)Da1a2N(0,M11),\displaystyle P^{\top}(\widehat{\theta}-\theta_{0})=\begin{pmatrix}\sqrt[4]{n}\frac{1}{\|\theta_{0}\|^{2}}\theta_{0}^{\top}\\ \sqrt[4]{n}^{3}P_{1}^{\top}\end{pmatrix}(\widehat{\theta}-\theta_{0})\to_{D}\frac{\sqrt{a_{1}}}{a_{2}}N(0,M^{-1}_{1}), (3.1)
and
n(β^β0)Da1a2N(0,Σ1),\displaystyle\sqrt{n}(\widehat{\beta}-\beta_{0})\to_{D}\frac{\sqrt{a_{1}}}{a_{2}}N(0,\Sigma^{-1}), (3.2)

as nn\to\infty where J=diag(1/θ0,Id11)J=\text{diag}(1/\|\theta_{0}\|,I_{d_{1}-1}).

Notice also that in the condition of D.2, g2(v)g_{2}(v) is the same as in D.1. However, the convergence of (3.2) has not been bothered by the first additive component g1(v)g_{1}(v), because the convergence is similar to the existing literature except our setting for the regressor is more complicated (see Pollard, , 1991 and Koenker and Bassett, , 1978). By sharp contrast, the asymptotic limit of β^\widehat{\beta} in Theorem 2.1 is affected by the limit of the regressor in the first component function. This is due to the drastic different behavior between the H-regular and I-regular functions of unit root processes. See, for example, Park and Phillips, (2001) and Dong et al., (2016).

The convergence of (3.1)\eqref{int1} is similar to Theorem 3.1 in Dong et al., (2016) where J=IdJ=I_{d} due to the identification condition θ0=1\|\theta_{0}\|=1 in semiparametric single-index model, a1=4σe2a_{1}=4\sigma_{e}^{2} and a2=2a_{2}=2 because ρ(u)=u2\rho(u)=u^{2} in the paper. It is also similar to Theorem 5.1 in Park and Phillips, (2001) where the regressor is univariate but the parameter is multivariate. In addition, when the loss function reduces to LS loss, the convergence of (3.1)\eqref{int1} is the same as Theorem 2 in Park and Phillips, (2000).

If one is concerned about the estimate of the matrix P1P_{1}, note that P1P_{1} is an orthogonal system of θ0\theta_{0}^{\bot} (orthogonal complement space), so that once θ^\widehat{\theta} is available, we may define P^1=θ^\widehat{P}_{1}=\widehat{\theta}^{\bot}. Normally, one does not need to rotate the coordinate system in practice that is introduced to facilitate our theory only.

Using the block representation of M1=(mij)2×2M_{1}=(m_{ij})_{2\times 2}, we have M11=(τij)2×2M_{1}^{-1}=(\tau_{ij})_{2\times 2} with

τ11=(m11m12m221m21)1andτ22=(m22m21m111m12)1.\tau_{11}=(m_{11}-m_{12}m_{22}^{-1}m_{21})^{-1}\ \ \ \text{and}\ \ \ \tau_{22}=(m_{22}-m_{21}m_{11}^{-1}m_{12})^{-1}.

The following corollary recovers the asymptotic distribution of θ^\widehat{\theta} from the limit of its rotation.

Corollary 3.2.

Under the conditions in Theorem 3.2, we have as nn\to\infty

n4(θ^θ0)DN(0,τ11θ0θ0).\sqrt[4]{n}(\widehat{\theta}-\theta_{0})\to_{D}N(0,\tau_{11}\theta_{0}\theta_{0}^{\top}).

Though indicated by (3.1) that the coordinates of θ^\widehat{\theta} on θ0\theta_{0} and P1P_{1} have rates n4\sqrt[4]{n} and n43\sqrt[4]{n}^{3}, respectively, the estimator θ^\widehat{\theta} eventually has slower rate n4\sqrt[4]{n}. This is the same as Dong et al., (2016) and one may find more detailed explanation therein. Interestingly notice that since the model in Dong et al., (2016) is nonparametric, θ0\theta_{0} is identified up to its direction, so the condition θ0=1\|\theta_{0}\|=1 is imposed. Thus, the authors normalize the estimator θ^\widehat{\theta} to be a unit vector and find that the normalized estimator has a much faster rate than n4\sqrt[4]{n}. By contrast, no identification condition is needed in this paper as the model is parametrically nonlinear. Consequently we would not be able to achieve any rates faster than what we have obtained. However, if one could know the length of θ0\theta_{0}, say θ0=q\|\theta_{0}\|=q, the estimator qθ^/θ^q\;\widehat{\theta}/\|\widehat{\theta}\| would converge to θ0\theta_{0} with a rate of an order n43\sqrt[4]{n}^{3}, basically because the normalization stretches θ^\widehat{\theta} directly on the unit circle with radius qq (see, Theorem 3.2 in Dong et al., (2016), for example). Moreover, one may have the estimates of a1a_{1} and a2a_{2} similarly to Corollary 3.1 and that of the local time in the limit similarly to Dong et al., (2016). The details are omitted due to the similarity.

Before we conclude this section, we briefly discuss another important issue related to model (1.2). If the model has conditional heteroscedasticity, such as et=σ(ϑ0xt,π0zt)εte_{t}=\sigma(\vartheta_{0}^{\top}x_{t},\pi_{0}^{\top}z_{t})\varepsilon_{t}, where εt\varepsilon_{t} is independent of (zt,xt)(z_{t},x_{t}), and if σ(u,v)=σ1(u)σ2(v)\sigma(u,v)=\sigma_{1}(u)\sigma_{2}(v) is known and separable, we will have

log(et2)=\displaystyle\log(e_{t}^{2})= log(σ12(ϑ0xt))+log(σ22(π0zt))+log(εt2)\displaystyle\log(\sigma_{1}^{2}(\vartheta_{0}^{\top}x_{t}))+\log(\sigma_{2}^{2}(\pi_{0}^{\top}z_{t}))+\log(\varepsilon^{2}_{t})
=\displaystyle= μ0+log(σ12(ϑ0xt))+log(σ22(π0zt))+ζt,\displaystyle\mu_{0}+\log(\sigma_{1}^{2}(\vartheta_{0}^{\top}x_{t}))+\log(\sigma_{2}^{2}(\pi_{0}^{\top}z_{t}))+\zeta_{t},

where ζt\zeta_{t} is the centralized version of log(εt2)\log(\varepsilon^{2}_{t}) with μ0=𝔼[log(εt2)]\mu_{0}=\mathbb{E}[\log(\varepsilon^{2}_{t})].

In this case, the proposed estimation method is readily applicable for us to derive the corresponding M-estimators of ϑ0\vartheta_{0} and π0\pi_{0} after we replace ete_{t} by e^t=ytg1(θ^xt)g2(β^zt)\widehat{e}_{t}=y_{t}-g_{1}(\widehat{\theta}^{\;\top}x_{t})-g_{2}(\widehat{\beta}^{\;\top}z_{t}). Under some same assumptions on log(σ12(u))\log(\sigma_{1}^{2}(u)) and log(σ22(u))\log(\sigma_{2}^{2}(u)) as in Assumption D, we may establish asymptotic properties corresponding to the above theorems and corollaries for the estimators of ϑ0\vartheta_{0} and π0\pi_{0}.

4 Stork return predictability

To demonstrate the practical relevance and superiority of our proposed model and estimation method over some natural competitors, we investigate its applicability in stock return predictability. It is common to use a linear predictive mean regression in the literature, which has led to considerable disagreements in the empirical finding as to whether stock returns are predictable or not (Campbell and Thompson, , 2008; Welch and Goyal, , 2008). Recently, Kasparis et al., (2015) use a nonparametric mean regression, Lee, (2016) and Fan and Lee, (2019) consider a quantile linear regression, and Tu et al., (2021) adopt a nonparametric quantile framework to re-investigate this important issue. We shall demonstrate the nonlinearity in return predictability through the use of the proposed multiple-index model with both stationary and nonstationary predictors.

The data sets to be used for return prediction are obtained from Welch and Goyal, (2008). They have been extensively used in the predictive literature, including the recent balanced predictive mean regression model by Ren et al. (2019), the linear quantile predictive regression by Lee, (2016) and Fan and Lee, (2019), the linear prediction with cointegrated variables by Koo et al., (2020), the LASSO predictive regression of Lee et al., (2021), and the nonparametric quantile predictive regression of Tu et al., (2021), among others. The monthly data spans from January 1927 to December 2005. The dependent variable, excess stock return, is defined as the difference between the S&P 500 index return, including dividends and the one month Treasury bill rate. The nonstationary (persistent) predictors include dividend-price (dpdp), dividend-payout ratio (dede), long term yield (ltylty), book to market (bmbm) ratios, T-bill rate (tbltbl), default yield spread (dfydfy), net equity expansion (ntisntis), earnings-price (epep), term spread (tmstms), while the stationary variables include default return spread (dfrdfr), long term rate of return (ltrltr), stock variance (svarsvar) and inflation (inflinfl). The first-order serial correlations of these variables and their time series plot are available from, for example, Lee et al., (2021). For detailed description on each series and the data construction, please refer to Welch and Goyal, (2008).

To measure the performance of the predictive regression models under discussion, we use the pseudo out-of-sample R2R^{2}, defined as

PR2=t=T1+1T2ρ(e^t)t=T1+1T2ρ(e¯t),PR^{2}=\frac{\sum_{t=T_{1}+1}^{T_{2}}\rho(\widehat{e}_{t})}{\sum_{t=T_{1}+1}^{T_{2}}\rho(\overline{e}_{t})},

where e^t\widehat{e}_{t} and e¯t\overline{e}_{t} are the respective out-of-sample prediction errors at time tt for a given model and that for the base model with only a constant predictor, for the forecasting sample spanning from T1T_{1} to T2T_{2}. Four loss functions are entertained for ρ()\rho(\cdot): (L1) squared errors loss ρ(e)=e2\rho(e)=e^{2}; (L2) absolute errors loss ρ(e)=|e|\rho(e)=|e|, (L3) Huber’s loss ρδ(e)=12e21{|e|δ}+δ(e12δ)1{|e|>δ}\rho_{\delta}(e)=\frac{1}{2}e^{2}\cdot 1\{|e|\leq\delta\}+\delta\cdot(e-\frac{1}{2}\delta)\cdot 1\{|e|>\delta\} with δ=1.25\delta=1.25; and (L4) quantile loss ρτ(e)=e(τI(e<0))\rho_{\tau}(e)=e\cdot(\tau-I(e<0)). Positive PR2PR^{2} indicates the better performance of the given model over the base model, and the larger the value is, the better the performance is.

For space limitation, we follow Kasparis et al., (2015), Lee, (2016), and Tu et al., (2021) to consider two subsamples: (i) the period spanning from January 1927 to December 2005, where significant predictability has been discovered, and (ii) the tranquil period starting from January 1952 to December 2005, where mixed evidences are found for predictability. We consider a rolling-window scheme for the out-of-sample return prediction and let the rolling window in-sample size (RWS) be 120 (10 years), 240 (20 years), and 360 (30 years), respectively. For example, the forecast starts from January 1927 and ends at December 2005 for the first subsample when RWS equals 120.

In contrast to Tu et al., (2021), our proposed model allows the use of multivariate nonstationary predictors together with those stationary ones to capture nonlinearity in the return predictability. There are numerous choices of combinations of the nonstationary and stationary predictors. For demonstration purposes, we use the stationary predictor z={inf,svar}z=\{inf,svar\}, and choose the nonstationary predictors as either x={bm,lty}x=\{bm,lty\} or x={dp,tms}x=\{dp,tms\}. When x={bm,lty}x=\{bm,lty\}, we consider an exponential link function g1(u)=ue0.4u2g_{1}(u)=u\cdot e^{-0.4u^{2}} and a linear link g2(v)=vg_{2}(v)=v outside of the nonstationary (resp. stationary) and stationary (resp. nonstationary) index variables, respectively. This amounts to four possible combinations. When x={dp,tms}x=\{dp,tms\}, we alternatively consider the normal cumulative distribution function Φ\Phi to capture possible nonlinearity. The logistic distribution function has also been experimented in this case and makes little difference to subsequent main findings, the results of which are therefore not reported below. The extensive analysis with a comprehensive set of predictors noted above and all possible specifications of the nonlinear link functions remains an interesting but a challenging work, which is left as future study.

Tables 1 and 2 collect the pseudo out-of-sample R2R^{2} for excess return prediction with the above two sets of selected predictors and specifications of the link functions, respectively, for the loss functions specified in (L1)-(L3). For each loss function and each RWS, the combination of the link function that produces the largest PR2PR^{2} is bolded. There are several interesting findings. First, noticeable nonlinearity exists in return prediction, for both choices of the nonstationary predictors, and for both subsamples considered. It is especially found that for the squared error loss and the Huber’s loss, the linear specification is always worse than the nonlinear alternatives. When x={bm,lty}x=\{bm,lty\}, nonlinearity has been discovered for both the stationary index and the nonstationary component. However, when x={dp,tms}x=\{dp,tms\}, only the nonstationary index presents nonlinearity in return prediction. Second, nonlinear predictability is found in forecasting return when linear predictability is not. The linear predictive regression (nearly) outperforms the constant model without any predictor when x={bm,lty}x=\{bm,lty\}, but the former underperforms the latter when x={dp,tms}x=\{dp,tms\}. Although the linear model does not reveal predictability in this case, it is uniformly observed that the nonlinear link function specification improves over the constant model for both choices of subsamples. Third, the conventional “tranquil” period 1952-2005 reveals strong nonlinear predictability, which is quite comparable to the 1927-2005 subsample. This finding is in contrast to the declining linear predictability discovered by Campbell and Yogo, (2006), or the weak nonlinear predictability with only a single nonstationary predictor by Kasparis et al., (2015). To summarize, significant return predictability has been revealed in the nonlinear predictive regression with both stationary and nonstationary indices we introduced.

As the recent literature has witnessed a growing interest in investigating the return prediction in quantiles (Lee, 2016; Fan and Lee, 2019; Tu, et al., 2021), we next study the performance of our proposed model in predicting the return quantiles. We consider the quantile level τ=0.05,0.1,0.2,,0.9,0.95\tau=0.05,0.1,0.2,\ldots,0.9,0.95. The analysis will only focus on the nonlinear effect of the nonstationary index as discovered from Tables 1 and 2. Therefore, we only present the results for x={dp,tms}x=\{dp,tms\} and z={inf,svar}z=\{inf,svar\} in Table 3 at all the quantile levels specified above, for the two subsamples mentioned earlier. The nonlinear specification for the nonstationary component includes not only the normal distribution function, but also the popularly used logistic distribution function (Logit), as a robustness check. From Table 3, it is observed that linear quantile predictability seems to only exist for the lower quantile levels, such as those at τ=0.05,0.1\tau=0.05,0.1 and 0.20.2. However, nonlinear quantile predictability prevails at all quantile levels. Even at the lower quantiles, the proposed nonlinear index model outperforms the linear model except for a very few cases. To compare the normal with logistic distributions, the latter seems to enjoy better performance for both sample periods. In addition, it seems that the nonlinear predictability becomes stronger as the quantile level moves to the two ends, especially for the first subsample.

Table 1: Pseudo Out-of-sample R2R^{2} for excess return prediction with x={bm,lty}x=\{bm,lty\} and z={inf,svar}z=\{inf,svar\}.
Panel A: January 1927-December 2005
g2(v)=vg_{2}(v)=v ve0.4v2v\cdot e^{-0.4v^{2}}
g1(u)=g_{1}(u)= RWS L1 L2 L3 L1 L2 L3
uu 120 0.0306 0.0319 0.0258 0.0302 0.0276 0.0312
240 0.0658 0.0326 0.0763 0.0601 0.0321 0.0747
360 0.0657 0.0076 0.0784 0.0672 0.0088 0.0797
ue0.4u2u\cdot e^{-0.4u^{2}} 120 0.0346 0.0243 0.0341 0.0364 0.0266 0.0367
240 0.0788 0.0328 0.0773 0.0848 0.0322 0.0723
360 0.0325 0.0217 0.0791 0.0311 0.0217 0.0780
Panel B: January 1952-December 2005
uu 120 -0.0274 0.0014 -0.0213 -0.0255 -0.0063 -0.0144
240 0.0162 0.0063 0.0394 0.0112 0.0063 0.0370
360 0.0177 -0.0331 0.0343 0.0193 -0.0315 0.0363
ue0.4u2u\cdot e^{-0.4u^{2}} 120 -0.0136 -0.0063 -0.0129 -0.0134 -0.0031 -0.0077
240 0.0370 0.0093 0.0395 0.0442 0.0086 0.0318
360 -0.0337 -0.0138 0.0427 -0.0355 -0.0139 0.0417
Table 2: Pseudo Out-of-sample R2R^{2} for excess return prediction with x={dp,tms}x=\{dp,tms\} and z={inf,svar}z=\{inf,svar\}.
Panel A: January 1927-December 2005
g2(v)=vg_{2}(v)=v Φ(v)\Phi(v)
g1(u)=g_{1}(u)= RWS L1 L2 L3 L1 L2 L3
uu 120 -0.0473 -0.0360 -0.0535 -1.0341 -1.8261 -1.4626
240 -0.0758 -0.0451 -0.0774 -0.8457 -1.3144 -1.6810
360 -0.0665 -0.0378 -0.0716 -0.8198 -0.8714 -0.6289
Φ(u)\Phi(u) 120 0.0701 0.0490 0.0669 -1.0341 -1.8261 -1.4626
240 0.0970 0.0517 0.1162 -0.8457 -1.3144 -1.6810
360 0.1275 0.0530 0.1210 -0.8198 -0.8714 -0.6289
Panel B: January 1952-December 2005
uu 120 -0.0306 -0.0263 -0.0335 -0.2523 -0.1541 -0.2666
240 -0.0580 -0.0351 -0.0579 -0.7872 -0.4355 -0.7848
360 -0.0748 -0.0379 -0.0742 -1.2610 -0.6199 -1.2700
Φ(u)\Phi(u) 120 0.0466 0.0395 0.0269 -0.8947 -1.0879 -1.2178
240 0.0664 0.0286 0.0844 -0.4725 -0.6958 -0.6943
360 0.1026 0.0319 0.0950 -0.7625 -0.5595 -0.4954
Table 3: Pseudo Out-of-sample R2R^{2} for quantile excess return prediction with x={dp,tms}x=\{dp,tms\} and z={inf,svar}z=\{inf,svar\}.
Panel A: January 1927-December 2005
g2(v)=vg_{2}(v)=v
g1(u)=g_{1}(u)= RWS τ=0.05\tau=0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95
uu 120 -0.0187 0.0032 -0.0109 -0.0099 -0.0241 -0.0360 -0.0408 -0.0451 -0.0605 -0.0886 -0.0796
240 0.0389 0.0133 0.0011 -0.0108 -0.0266 -0.0451 -0.0469 -0.0677 -0.0921 -0.1123 -0.1581
360 0.0030 0.0305 0.0093 -0.0100 -0.0187 -0.0382 -0.0455 -0.0714 -0.0895 -0.1372 -0.1828
Φ(u)\Phi(u) 120 -0.0504 0.0165 0.0305 0.0439 0.0432 0.0490 0.0514 0.0493 0.0497 0.0540 0.0338
240 0.0569 0.0319 0.0237 0.0442 0.0649 0.0517 0.0562 0.0461 0.0437 0.0471 0.0328
360 -0.0056 0.0039 0.0127 0.0542 0.0528 0.0530 0.0534 0.0381 0.0257 0.0268 0.0333
Logit(u)Logit(u) 120 -0.0017 0.0251 0.0318 0.0393 0.0441 0.0516 0.0432 0.0543 0.0583 0.0606 0.0732
240 0.0667 0.0387 0.0363 0.0536 0.0568 0.0564 0.0538 0.0633 0.0534 0.0629 0.0764
360 0.0058 0.0131 0.0391 0.0560 0.0540 0.0620 0.0613 0.0648 0.0439 0.0419 0.0714
Panel B: January 1952-December 2005
uu 120 -0.0169 -0.0011 -0.0080 -0.0091 -0.0105 -0.0262 -0.0251 -0.0316 -0.0426 -0.0552 -0.0629
240 0.0459 0.0145 0.0064 -0.0093 -0.0149 -0.0350 -0.0346 -0.0518 -0.0727 -0.1068 -0.1245
360 0.0406 0.0320 0.0125 -0.0147 -0.0297 -0.0380 -0.0454 -0.0492 -0.0709 -0.1351 -0.1811
Φ(u)\Phi(u) 120 0.0418 -0.0034 0.0124 0.0392 0.0296 0.0395 0.0454 0.0273 0.0228 0.0379 0.0199
240 0.0357 0.0141 0.0130 0.0377 0.0481 0.0286 0.0293 0.0029 -0.0127 0.0033 0.0001
360 0.0027 0.0141 0.0092 0.0707 0.0456 0.0319 0.0146 -0.0268 -0.0933 -0.1060 -0.0830
Logit(u)Logit(u) 120 0.0445 0.0303 0.0323 0.0326 0.0336 0.0385 0.0335 0.0394 0.0273 0.0378 0.0415
240 0.0452 0.0171 0.0204 0.0479 0.0377 0.0358 0.0306 0.0317 0.0029 0.0194 0.0529
360 0.0140 0.0173 0.0395 0.0596 0.0474 0.0415 0.0270 0.0143 -0.0612 -0.0746 -0.0116

5 Monte Carlo simulations

We consider the following data generating process:

yt=g(xtθ,ztβ)+et,y_{t}=g(x_{t}^{\top}{\bf\theta},z_{t}^{\top}{\bf\beta})+e_{t}, (5.1)

where xt=ρn1xt1+σ1wtx_{t}=\rho_{n1}x_{t-1}+\sigma_{1}w_{t}, zt=h(t/n)+vtz_{t}=h(t/n)+v_{t}, h(τ)=τh(\tau)=\tau, vt=ρn,2vt1+σ2ϵtv_{t}=\rho_{n,2}v_{t-1}+\sigma_{2}\epsilon_{t}, both xtx_{t} and vtv_{t} are bivariate autoregressive vector processes with x0=0x_{0}=0, v0=0v_{0}=0, ρn1=I2\rho_{n1}=I_{2}, ρn2=0.5×I2\rho_{n2}=0.5\times I_{2}, σ1=diag{0.2,0.5}\sigma_{1}=diag\{0.2,0.5\}, σ2=I2\sigma_{2}=I_{2}, and (wt,ϵt)(w_{t}^{\top},\epsilon_{t}^{\top}) is a series of independent four dimensional normal random vector with zero mean and identity covariance matrix, θ=(1,0)\theta=(1,0)^{\top}, β=(2,1)\beta=(2,1)^{\top}. The regression residual et=0.1ute_{t}=0.1\cdot u_{t} with utu_{t} independently generated according to four distributions: (D1) standard normal; (D2) mixed normal 0.9N(0,1)+0.1N(0,4)0.9\cdot N(0,1)+0.1\cdot N(0,4); (D3) tt distribution with 2 degrees of freedom; (D4) standard Cauchy distribution.

For the bivariate nonlinear function g(,)g(\cdot,\cdot), we consider the following designs:

  • (M1)

    g(u,v)=u+vg(u,v)=u+v,

  • (M2)

    g(u,v)=ϕ(u)+vg(u,v)=\phi(u)+v,

  • (M3)

    g(u,v)=Φ(u)+vg(u,v)=\Phi(u)+v,

  • (M4)

    g(u,v)=u+ϕ(v)g(u,v)=u+\phi(v),

where ϕ()\phi(\cdot) is the standard normal density function, Φ()\Phi(\cdot) is the standard normal distribution function. Three loss functions are entertained: (L1) squared errors loss ρ(e)=e2\rho(e)=e^{2}; (L2) absolute errors loss ρ(e)=|e|\rho(e)=|e|, and (L3) Huber’s loss ρδ(e)=12e21{|e|δ}+δ(e12δ)1{|e|>δ}\rho_{\delta}(e)=\frac{1}{2}e^{2}\cdot 1\{|e|\leq\delta\}+\delta\cdot(e-\frac{1}{2}\delta)\cdot 1\{|e|>\delta\} with δ=1.25\delta=1.25. The simulations are conducted for n=100,200,400n=100,200,400 with 5,000 replications.

To measure the estimation accuracy of the nonlinear least-square estimates for the index parameters, we compute the bias, estimated standard deviations and mean squared errors for each element of the indices. For space limitation, we only report the mean squared errors in Tables 4-7 for M1-M4, respectively. Each table contains the estimation results under the three loss functions L1-L3, and four types of error distributions D1-D4. The main findings are summarized as follows. First, the mean squared errors are decreasing as the sample size nn increases, except when the errors are generated according to the Cauchy distribution and when the least squares loss is implemented. Under Cauchy errors, the least-square estimator is known to be inconsistent. However, the median estimator and the Huber’s estimator remain consistent. As a result, we shall exclude this scenario when we further comment on the estimator performance below. Second, the index parameter estimate of the nonstationary variables enjoys super-rate of convergence, as shown in all the cases, as compared to the standard n\sqrt{n} rate for the stationary index estimator. Finally, the three estimates considered are quite competitive in terms of the different error distributions. It is found that, as expected, the least squares estimator is the most efficient estimator when the errors are drawn from normal distributions. The Huber’s estimator becomes the most efficient, in general, for the mixed normal errors and t(2)t(2) errors. The least absolute error estimate is the most efficient one when the errors are Cauchy.

Table 4: Mean Squared Error of M-estimator under M1 (×103\times 10^{3}). Numbers greater than 10310^{3} are marked as “*”.
D1: Normal D2: Mixed normal D3: t(2) D4: Cauchy
L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3
n=100n=100
θ1\theta_{1} 0.273 0.341 0.287 0.357 0.392 0.334 3.349 0.478 0.499 * 0.596 0.806
θ2\theta_{2} 0.042 0.026 0.045 0.052 0.029 0.051 0.456 0.042 0.085 * 0.046 0.132
β1\beta_{1} 0.084 0.127 0.089 0.101 0.134 0.095 1.024 0.172 0.155 * 0.230 0.245
β2\beta_{2} 0.082 0.125 0.087 0.105 0.136 0.100 1.024 0.172 0.157 * 0.224 0.242
n=200n=200
θ1\theta_{1} 0.065 0.097 0.069 0.085 0.107 0.080 1.009 0.120 0.120 * 0.166 0.198
θ2\theta_{2} 0.010 0.011 0.010 0.013 0.012 0.012 0.291 0.015 0.019 * 0.020 0.031
β1\beta_{1} 0.039 0.059 0.042 0.050 0.068 0.047 1.510 0.080 0.072 * 0.100 0.110
β2\beta_{2} 0.040 0.061 0.043 0.051 0.066 0.049 0.869 0.079 0.071 * 0.101 0.110
n=400n=400
θ1\theta_{1} 0.017 0.025 0.017 0.021 0.028 0.021 0.409 0.031 0.028 * 0.040 0.045
θ2\theta_{2} 0.003 0.004 0.003 0.003 0.004 0.003 0.032 0.005 0.005 * 0.006 0.007
β1\beta_{1} 0.018 0.029 0.019 0.025 0.032 0.023 0.244 0.039 0.034 * 0.047 0.053
β2\beta_{2} 0.019 0.029 0.020 0.025 0.034 0.024 0.308 0.037 0.035 * 0.049 0.053
Table 5: Mean Squared Error of M-estimator under M2 (×103\times 10^{3}). Numbers greater than 10310^{3} are marked as “*”.
D1: Normal D2: Mixed normal D3: t(2) D4: Cauchy
L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3
n=100n=100
θ1\theta_{1} 1.545 1.075 1.598 1.854 1.238 1.755 * 1.511 2.899 * 2.199 5.955
θ2\theta_{2} 0.352 0.026 0.349 0.353 0.029 0.337 * 0.029 0.614 * 0.050 1.290
β1\beta_{1} 0.008 0.012 0.008 0.010 0.013 0.010 0.089 0.016 0.015 * 0.022 0.024
β2\beta_{2} 0.008 0.012 0.008 0.010 0.013 0.009 0.100 0.016 0.014 * 0.021 0.024
n=200n=200
θ1\theta_{1} 0.650 0.502 0.675 0.794 0.592 0.769 * 0.770 1.366 * 0.913 2.204
θ2\theta_{2} 0.144 0.018 0.159 0.203 0.027 0.206 * 0.032 0.262 * 0.041 0.530
β1\beta_{1} 0.004 0.006 0.004 0.005 0.006 0.004 0.035 0.007 0.007 * 0.010 0.011
β2\beta_{2} 0.004 0.006 0.004 0.005 0.006 0.005 0.039 0.007 0.007 * 0.010 0.011
n=400n=400
θ1\theta_{1} 0.342 0.356 0.367 0.512 0.407 0.486 * 0.437 0.751 * 0.567 1.257
θ2\theta_{2} 0.071 0.013 0.079 0.103 0.018 0.095 * 0.019 0.159 * 0.021 0.333
β1\beta_{1} 0.002 0.003 0.002 0.002 0.003 0.002 0.020 0.003 0.003 * 0.004 0.005
β2\beta_{2} 0.002 0.003 0.002 0.002 0.003 0.002 0.023 0.003 0.003 * 0.004 0.005
Table 6: Mean Squared Error of M-estimator under M3 (×103\times 10^{3}). Numbers greater than 10310^{3} are marked as “*”.
D1: Normal D2: Mixed normal D3: t(2) D4: Cauchy
L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3
n=100n=100
θ11\theta_{11} 9.219 9.533 10.15 13.19 10.32 11.93 * 13.01 24.69 * 19.59 84.01
θ12\theta_{12} 2.058 0.202 1.881 2.279 0.176 2.198 * 0.244 5.514 * 0.254 16.23
θ21\theta_{21} 0.077 0.116 0.082 0.099 0.131 0.094 0.866 0.161 0.149 * 0.216 0.239
θ22\theta_{22} 0.078 0.116 0.082 0.100 0.130 0.096 1.094 0.163 0.150 * 0.218 0.242
n=200n=200
θ11\theta_{11} 5.331 5.673 5.777 8.520 6.780 7.481 * 8.102 14.89 * 11.69 56.70
θ12\theta_{12} 0.859 0.260 0.896 1.237 0.174 1.085 * 0.193 3.578 * 0.260 10.81
θ21\theta_{21} 0.035 0.054 0.037 0.046 0.062 0.044 0.432 0.072 0.065 * 0.092 0.102
θ22\theta_{22} 0.036 0.056 0.038 0.046 0.061 0.043 0.516 0.069 0.065 * 0.093 0.102
n=400n=400
θ11\theta_{11} 3.896 4.364 4.307 5.005 4.275 4.512 * 5.642 7.141 * 7.081 25.72
θ12\theta_{12} 0.641 0.181 0.666 0.783 0.161 0.737 * 0.204 1.093 * 0.236 5.358
θ21\theta_{21} 0.016 0.025 0.018 0.021 0.029 0.020 0.229 0.035 0.031 * 0.042 0.045
θ22\theta_{22} 0.017 0.027 0.018 0.022 0.030 0.021 0.246 0.034 0.031 * 0.045 0.049
Table 7: Mean Squared Error of M-estimator under M4 (×103\times 10^{3}). Numbers greater than 10310^{3} are marked as “*”.
D1: Normal D2: Mixed normal D3: t(2) D4: Cauchy
L1 L2 L3 L1 L2 L3 L1 L2 L3 L1 L2 L3
n=100n=100
θ11\theta_{11} 0.274 0.369 0.288 0.353 0.440 0.331 2.860 0.531 0.537 * 0.698 0.819
θ12\theta_{12} 0.041 0.044 0.044 0.058 0.053 0.053 0.519 0.063 0.084 * 0.080 0.130
θ21\theta_{21} 31.78 49.76 34.24 44.38 57.81 40.89 * 66.72 64.26 * 110.5 136.7
θ22\theta_{22} 11.27 17.31 12.19 15.95 21.04 14.99 * 23.92 23.12 * 42.04 50.73
n=200n=200
θ11\theta_{11} 0.067 0.104 0.071 0.087 0.109 0.081 0.814 0.130 0.120 * 0.171 0.200
θ12\theta_{12} 0.011 0.016 0.012 0.014 0.017 0.013 0.113 0.019 0.019 * 0.025 0.031
θ21\theta_{21} 14.12 22.44 15.06 18.92 24.85 17.73 * 31.88 28.39 * 40.95 45.49
θ22\theta_{22} 5.102 8.171 5.454 6.771 8.833 6.281 * 10.91 9.927 * 14.83 16.20
n=400n=400
θ11\theta_{11} 0.016 0.026 0.017 0.022 0.029 0.021 1.091 0.033 0.030 * 0.044 0.048
θ12\theta_{12} 0.003 0.004 0.003 0.003 0.004 0.003 0.361 0.005 0.005 * 0.006 0.007
θ21\theta_{21} 6.725 10.680 7.168 9.091 12.34 8.616 * 14.36 13.23 * 19.15 20.54
θ22\theta_{22} 2.381 3.798 2.524 3.287 4.421 3.086 * 5.247 4.680 * 6.672 7.182

6 Conclusion

In order to cater for the practical usefulness this paper proposes a class of multiple index time series parametric models that accommodate time trend as well as both stationary and nonstationary vectors. An MM-estimation approach is adopted where the loss functions can be, but not limited to, six popular ones, such as the squared loss, LAD, Huber’s loss, quantile loss, LpL_{p} and expectile loss. Meanwhile, two categories of link functions are investigated due to the different behaviour of functions of nonstationary vector variables, that is, II-regular and HH-regular classes in the related literature. Accordingly, our asymptotic theory dwells on two categories of estimators and the rates of convergence are discussed under different classes of loss functions. Moreover, these models are used in the analysis of predictability where we find that our models are competitive comparing with the literature in terms of predictability. Finally, we conduct Monte Carlo simulations that reveal the satisfactory performance of our estimators proposed in finite sample situations.

Acknowledgements

Dong would like to thank the financial support from National Natural Science Foundation of China (Grant 72073143); Gao acknowledges financial support from the Australian Research Council Discovery Grants Program under Grant Number: DP170104421; Peng acknowledges the Australian Research Council Discovery Grants Program for its financial support under Grant Number DP210100476; and Tu would like to thank support from National Natural Science Foundation of China (Grant 72073002, 12026607, 92046021), the Center for Statistical Science at Peking University, and Key Laboratory of Mathematical Economics and Quantitative Finance (Peking University), Ministry of Education.

References

  • Badu, (1989) Badu, G. J. (1989). Strong representations for LAD estimators in linear models. Probability theory and related fields, 83:547–558.
  • Bickel, (1974) Bickel, P. J. (1974). Edgeworth expansions in nonparametric statistics. Annals of Statistics, 2:1–20.
  • Bosq, (1996) Bosq, D. (1996). Nonparametric Statistics for Stochastic Processes: Estimation and Prediction. Springer, New York.
  • Campbell and Thompson, (2008) Campbell, J. Y. and Thompson, S. B. (2008). Predicting excess stock returns out of sample: Can anything beat the historical average? Review of Financial Studies, 21(4):1509–1531.
  • Campbell and Yogo, (2006) Campbell, J. Y. and Yogo, M. (2006). Efficient tests of stock return predictability. Journal of Financial Economics, 81(1):27–60.
  • Chang and Park, (2003) Chang, Y. and Park, J. Y. (2003). Index models with integrated time series. Journal of Econometrics, 114:73–106.
  • Chang et al., (2001) Chang, Y., Park, J. Y., and Phillips, P. C. B. (2001). Nonlinear econometric models with cointegrated and deterministically trending regressors. Econometrics Journal, 4:1–36.
  • Chen et al., (2010) Chen, J., Li, D., and Zhang, L. (2010). Robust estimation in a nonlinear cointegration. Journal of Multivariate Analysis, 101:706–717.
  • Chen et al., (2019) Chen, X., Li, D., Li, Q., and Li, Z. (2019). Nonparametric estimation of conditional quantile functions in the presence of irrelevant covariates. Journal of Econometrics, 212:433–450.
  • Chen et al., (2014) Chen, X., Liao, Z., and Sun, Y. (2014). Sieve inference on possibly misspecified semi-nonparametric time series models . Journal of Econometrics, 178:639–658.
  • Divas et al., (1992) Divas, R. A., Knight, K., and Liu, J. (1992). M-estimation for autoregressions with infinite variance . Stochastic Process and Their Application, 40:145–180.
  • Dong and Gao, (2018) Dong, C. and Gao, J. (2018). Specification testing driven by orthogonal series for nonlinear cointegration with endogeneity. Econometric Theory, 34:754–789.
  • Dong and Gao, (2019) Dong, C. and Gao, J. (2019). Expansion and estimation of levy process functionals in nonlinear and nonstationary time series regression. Econometric Reviews, 38:125–150.
  • Dong et al., (2016) Dong, C., Gao, J., and Tjøstheim, D. (2016). Estimation for single-index and partially linear single-index integrated models. Annals of Statistics, 44:425–453.
  • Dong et al., (2017) Dong, C., Gao, J., Tjøstheim, D., and Yin, J. (2017). Specification testing for nonlinear multivariate cointegrating regressions. Journal of Econometrics, 200:104–117.
  • Dong and Linton, (2018) Dong, C. and Linton, O. (2018). Additive nonparametric models with time variable and both stationary and nonstationary regressors. Journal of Econometrics, 207:212–236.
  • Dong et al., (2021) Dong, C., Linton, O., and Peng, B. (2021). A weighted sieve estimator for nonparametric time series models with nonstationary variables. Journal of Econometrics, 222:909–932.
  • Estrada and Kanwal, (1993) Estrada, R. and Kanwal, R. P. (1993). Taylor expansion of distributions. Mathematical methods in the Applied Sciences, 16:297–304.
  • Fan and Lee, (2019) Fan, R. and Lee, J. H. (2019). Predictive quantile regressions under persistence and conditional heteroskedasticity. Journal of Econometrics, 213(1):261–280.
  • Gel’fand and Shilov, (1964) Gel’fand, I. M. and Shilov, G. E. (1964). Generalized Functions. Academic Press, New York.
  • Hall and Heyde, (1980) Hall, P. and Heyde, C. C. (1980). Martingale Limit Theory and Its Application. Academic Press, New York.
  • He and Shao, (2000) He, X. and Shao, Q. (2000). On parameters of increasing dimensions. Journal of Multivariate Analysis, 73:120–135.
  • Huber, (1964) Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35:73–101.
  • Kanwal, (1983) Kanwal, R. P. (1983). Generalized Fuctions: Theory and Technique. Academic Press, New York.
  • Kasparis et al., (2015) Kasparis, I., Andreou, E., and Phillips, P. (2015). Nonparametric predictive regression. Journal of Econometrics, 185:468–494.
  • Knight, (1989) Knight, K. (1989). Limit theory for autoregressive-parameter estimates in an infinite-variance random walk. Canadian Journal of Statistics, 17:261–278.
  • Koenker and Bassett, (1978) Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46:33–50.
  • Koo et al., (2020) Koo, B., Anderson, H. M., Seo, M. H., and Yao, W. (2020). High dimensional predictive regression in the presence of cointegration. Journal of Econometrics, 219:456–477.
  • Lee et al., (2021) Lee, J., Shi, Z., and Gao, Z. (2021). On lasso for predictive regression. Journal of Econometrics, forthcoming.
  • Lee, (2016) Lee, J. H. (2016). Predictive quantile regression with persistent covariates: Ivx-qr approach. Journal of Econometrics, 192:105–118.
  • Newey and Powell, (1986) Newey, W. K. and Powell, J. (1986). Asymmetric least squares estimation and testing. Econometrica, 55:819–847.
  • Park and Phillips, (1999) Park, J. Y. and Phillips, P. C. B. (1999). Asymptotics for nonlinear transformations of integrated time series. Econometric Theory, 15:269–298.
  • Park and Phillips, (2000) Park, J. Y. and Phillips, P. C. B. (2000). Nonstationary binary choice. Econometrica, 68:1249–1280.
  • Park and Phillips, (2001) Park, J. Y. and Phillips, P. C. B. (2001). Nonlinear regression with integreted time series. Econometrica, 69(1):117–161.
  • Phillips, (1991) Phillips, P. C. B. (1991). A shortcut to LAD estimator asymptotics. Econometric Theory, 7:450–463.
  • Phillips, (1995) Phillips, P. C. B. (1995). Robust nonstationary regression. Econometric Theory, 11:912–951.
  • Phillips and Solo, (1992) Phillips, P. C. B. and Solo, V. (1992). Asymptotics for linear processes. Annals of Statistics, 20(2):971–1001.
  • Pollard, (1991) Pollard, D. (1991). Asymptotics for the least absolute deviation regression estimators. Econometric Theory, 7:186–199.
  • Stanković, (1996) Stanković, B. (1996). Taylor expansion for generalized functions. Journal of mathematical analysis and applications, 203:31–37.
  • Susarla and Walter, (1981) Susarla, V. and Walter, G. (1981). Estimation of a multivariate density function using delta sequences. Annals of Statistics, 9:347–355.
  • Tu et al., (2021) Tu, Y., Liang, H. Y., and Wang, Q. (2021). Nonparametric inference for quantile cointegrations with stationary covariates. Journal of Econometrics, forthcoming.
  • van der Vaart, (1996) van der Vaart, A. W. (1996). Weak Convergence and Empirical Processes with Applications to Statistics. Springer Series in Statistics. Springer, New York.
  • Walter and Blum, (1979) Walter, G. and Blum, J. (1979). Probability density estimation using delta sequences. Annals of Statistics, 7:328–340.
  • Wang et al., (2018) Wang, Q., Wu, D., and Zhu, K. (2018). Model checks for nonlinear cointegrating regression. Journal of Econometrics, 207:261–284.
  • Welch and Goyal, (2008) Welch, I. and Goyal, A. (2008). A comprehensive look at the empirical performance of equity premium prediction. Review of Financial Studies, 21:455–508.
  • Xiao, (2009) Xiao, Z. (2009). Quantile cointegrating regression. Journal of Econometrics, 150:248–260.
  • Zinde-Walsh, (2008) Zinde-Walsh, V. (2008). Kernel estimation when density may not exist. Econometric Theory, 24:696–725.

Appendix A: Preliminaries on generalized functions

Since we shall adopt a generalized function approach in the proofs, some preliminaries about generalized functions are given in this section. In mathematic context generalized functions are called distributions or tempered distributions according to the spaces of test functions.

Definition (Space DD)  The space of all functions ϕ(x)\phi(x) defined on the real line satisfying

  • ϕ(x)\phi(x) is an infinitely differentiable function defined at every point on \mathbb{R}. That is, ϕ(k)(x)\phi^{(k)}(x) exists for any positive integer kk;

  • There is a constant A>0A>0 such that ϕ(x)0\phi(x)\equiv 0 for |x|>A|x|>A, or equivalently it has a compact support,

is called Space DD.

Note that DD is a linear space over real set. A function ϕ(x)D\phi(x)\in D is of CC^{\infty}, also known as a test function. There exist many different types of functions in DD, and it is noteworthy that for any continuous function f(x)f(x) with compact support, there is a function ϕ(x)D\phi(x)\in D such that |f(x)ϕ(x)|<ε|f(x)-\phi(x)|<\varepsilon for all xx and any given ε>0\varepsilon>0. This implies the denseness of DD in any C[a,b]C[a,b]. See Gel’fand and Shilov, (1964, p. 3).

Definition (Convergence in DD)  A sequence {ϕm}\{\phi_{m}\} in DD is said to converge to a function ϕ0\phi_{0} if the following conditions are satisfied:

  • All ϕm\phi_{m} as well as ϕ0\phi_{0} vanish outside a common region;

  • ϕm(k)ϕ0(k)\phi_{m}^{(k)}\to\phi_{0}^{(k)} uniformly over \mathbb{R} as mm\to\infty for all k0k\geq 0.

It can be shown that ϕ0D\phi_{0}\in D, and therefore DD is closed under this definition.

Definition (Distribution)  A continuous linear functional on the space DD is called a distribution. The space of all distributions on DD is denoted by DD^{\prime}.

Distribution is another name of generalized function. For the definitions of continuity and linearity of a functional, please see Kanwal, (1983, p. 25). The space DD^{\prime} is called the dual space of DD, is itself a linear space.

The set of distributions that are mostly useful are those generated by locally integrable functions. Indeed, if f(x)f(x) is locally integrable on \mathbb{R}, it generates a distribution f:ϕf:\ \phi\mapsto\mathbb{R} through

f,ϕ=f(x)ϕ(x)𝑑x,ϕD.\langle f,\phi\rangle=\int f(x)\phi(x)dx,\ \ \ \forall\phi\in D.

Such defined distribution is called regular distribution. Remarkably, it is proved in Kanwal, (1983, p. 27) that two continuous functions that generate the same regular distributions are identical. Moreover, if two locally integrable functions produce the same regular distributions, they are identical almost everywhere. These enable one to identify functions from the distributions they generate.

All distributions other than regular ones are called singular. Thus, DD^{\prime} is larger than DD, because all ϕD\phi\in D are distributions so that DDD\subset D^{\prime}, and there do exist singular distributions, in particular Dirac delta δ()\delta(\cdot): ϕϕ(x0)\phi\mapsto\phi(x_{0}), ϕD\forall\,\phi\in D, with a fixed x0x_{0}\in\mathbb{R}, is a singular distribution as shown in Gel’fand and Shilov, (1964, p. 4).

By definition, generalized functions cannot be assigned values at isolated points, while statements about a generalized function on a neighbourhood of points can be given in a well-defined way. This means that generalized functions are determined by its local property. Detailed discussions can be found in Gel’fand and Shilov, (1964, p. 5). We should point out that it is because of this regard that we can circumvent the difficulty of non-smoothness of the loss functions in the MM-estimation.

As is well known, not all ordinary functions are differentiable. In contradiction, generalized functions always have derivatives that are generalized functions too, and consequently have derivatives of any order. See Gel’fand and Shilov, (1964, p. 18).

Definition (Derivative of distributions) Let fDf\in D^{\prime}. A functional gg defined on DD given by

g,ϕ=f,ϕ,ϕD\langle g,\phi\rangle=-\langle f,\phi^{\prime}\,\rangle,\ \ \ \forall\,\phi\in D

is called the derivative of ff, denoted as ff^{\prime} or df/dxdf/dx.

The definition is motivated by integration by parts. To be an eligible member of DD^{\prime}, note that such defined ff^{\prime} is linear and continuous. Accordingly, all generalized functions have derivatives of all orders. Moreover, the generalized derivative of a regular distribution agrees with the conventional one whenever the latter exists.

Definition (Convergence in DD^{\prime})  A sequence of distributions fmDf_{m}\in D^{\prime}, m=1,2,m=1,2,\cdots, is said to converge to a distribution fDf\in D^{\prime} if

limmfm,ϕ=f,ϕ,ϕD.\lim_{m\to\infty}\langle f_{m},\phi\rangle=\langle f,\phi\rangle,\ \ \ \forall\,\phi\in D.

A set of distributions {fϵ}\{f_{\epsilon}\} indexed by real ϵ\epsilon is said converging to ff when ϵϵ0\epsilon\to\epsilon_{0}, if for ϕD\forall\,\phi\in D, limϵϵ0fϵ,ϕ=f,ϕ\lim_{\epsilon\to\epsilon_{0}}\langle f_{\epsilon},\phi\rangle=\langle f,\phi\rangle.

A series of distributions m=1fm\sum_{m=1}^{\infty}f_{m} converges to a distribution fDf\in D if the sequence of partial sum sM=m=1Mfms_{M}=\sum_{m=1}^{M}f_{m} converges to ff as MM\to\infty.

These definitions contain the convergence of ordinary functions as a special case. Indeed, suppose that all members of distribution sequence {fm}\{f_{m}\} are regular, and fm(x)f_{m}(x) converge to f(x)f(x) uniformly on any compact interval, then

limmfm,ϕ=limmfm(x)ϕ(x)𝑑x=f(x)ϕ(x)𝑑x=f,ϕ,ϕD,\lim_{m\to\infty}\langle f_{m},\phi\rangle=\lim_{m\to\infty}\int f_{m}(x)\phi(x)dx=\int f(x)\phi(x)dx=\langle f,\phi\rangle,\ \ \forall\,\phi\in D,

by uniform convergence theorem.

A consequence of the definition is that if fmff_{m}\to f then fm(k)f(k)f_{m}^{(k)}\to f^{(k)} for any k>0k>0; if m=1fm\sum_{m=1}^{\infty}f_{m} converges to ff, then the series can be differentiated term by term as many times as required. See Kanwal, (1983, p. 59).

The most important sequence of distributions is a sequence of regular distributions {fm}\{f_{m}\}, so-called delta-convergent sequence, that converges to δ\delta distribution. This may be regarded as a bridge between regular and singular distributions.

Lemma A.1.

Let f(x)f(x) be a nonnegative function defined on \mathbb{R} such that f(x)𝑑x=1\int f(x)dx=1. Put fϵ(x)=ϵ1f(x/ϵ)f_{\epsilon}(x)=\epsilon^{-1}f(x/\epsilon), ϵ>0\epsilon>0. Then limϵ0fϵ(x)=δ(x)\lim_{\epsilon\to 0}f_{\epsilon}(x)=\delta(x).

This is exactly the theorem in Kanwal, (1983, p. 62) for univariate functions. This result enables us to construct delta sequence. Observe that ϵ\epsilon can be replaced by 1/m1/m to have sequence fm(x)f_{m}(x) that convergence to δ(x)\delta(x) as mm\to\infty.

Example (Delta-convergent sequence)

(1) f(x)=I(|x|1/2)f(x)=I(|x|\leq 1/2), fm(x)=mI(|mx|1/2)f_{m}(x)=mI(|mx|\leq 1/2). Then limmfm(x)=δ(x)\lim_{m\to\infty}f_{m}(x)=\delta(x).

(2) f(x)=1π1x2+1f(x)=\frac{1}{\pi}\frac{1}{x^{2}+1}, and fϵ(x)=1πϵx2+ϵ2f_{\epsilon}(x)=\frac{1}{\pi}\frac{\epsilon}{x^{2}+\epsilon^{2}}. Then limϵ0fϵ(x)=δ(x)\lim_{\epsilon\to 0}f_{\epsilon}(x)=\delta(x).

(3) Define fϵ(x)=12πϵexp(x2/2ϵ2)f_{\epsilon}(x)=\frac{1}{\sqrt{2\pi}\epsilon}\exp(-x^{2}/2\epsilon^{2}) with ϵ>0\epsilon>0. Then limϵ0fϵ(x)=δ(x)\lim_{\epsilon\to 0}f_{\epsilon}(x)=\delta(x), and hence for any g(x)g(x) that is continuous at x=x0x=x_{0}, we have

limϵ0fϵ(xx0)g(x)𝑑x=δ(xx0)g(x)𝑑x=g(x0).\lim_{\epsilon\to 0}\int f_{\epsilon}(x-x_{0})g(x)dx=\int\delta(x-x_{0})g(x)dx=g(x_{0}).

This actually is the rationale behind the kernel estimation.

(4) Consider

f(x)={Cexp(11x2),|x|<1,0,|x|1,f(x)=\begin{cases}C\exp\left(-\frac{1}{1-x^{2}}\right),&|x|<1,\\ 0,&|x|\geq 1,\end{cases}

where CC is such that f(x)𝑑x=1\int f(x)dx=1. Then, we define

fϵ(x)={Cϵ1exp(ϵ2ϵ2x2),|x|<ϵ,0,|x|ϵ,f_{\epsilon}(x)=\begin{cases}C\epsilon^{-1}\exp\left(-\frac{\epsilon^{2}}{\epsilon^{2}-x^{2}}\right),&|x|<\epsilon,\\ 0,&|x|\geq\epsilon,\end{cases}

and limϵ0fϵ(x)=δ(x)\lim_{\epsilon\to 0}f_{\epsilon}(x)=\delta(x).

In some cases we should consider a space that is larger than DD as test function space, in order to extend the compact support of test functions in DD to the entire real line.

Definition (Space SS) The space SS of test functions of rapid decay contains all functions ϕ\phi defined on \mathbb{R} that satisfy

  • ϕ(x)\phi(x) is infinitely differentiable, i.e. ϕ(x)C\phi(x)\in C^{\infty};

  • ϕ(x)\phi(x), as well as its derivatives of all orders, vanishes at infinity faster than any power of 1/|x|1/|x|, i.e. for any p,k0p,k\geq 0, |xpϕ(k)(x)|Cpk|x^{p}\phi^{(k)}(x)|\leq C_{pk} where the constant CpkC_{pk} only depends on pp, kk and ϕ\phi.

The space SS is linear and clearly DSD\subset S. Accordingly SDS^{\prime}\subset D^{\prime} because a continuous linear functional on SS is also a continuous linear functional on DD. Similarly to DD and DD^{\prime}, we may define the convergence of sequence and the derivative of distributions in SS and SS^{\prime}. One may find these in Gel’fand and Shilov, (1964, p.17) and Kanwal, (1983, p.138).

Appendix B: Lemmas

Under Assumption A and w.l.o.g. letting x0=0x_{0}=0 a.s., similar to (A.5) and (A.6) of Dong et al., (2017), we have

xt=i=1twi=j=tBt,jηj,whereBt,j=i=max(1,j)tAij,\displaystyle x_{t}=\sum_{i=1}^{t}w_{i}=\sum_{j=-\infty}^{t}B_{t,j}\eta_{j},\ \ \text{where}\ B_{t,j}=\sum_{i=\max(1,j)}^{t}A_{i-j}, (B.1)

and for t>s>0t>s>0,

xt=\displaystyle x_{t}= i=1twi=i=s+1twi+xs=xts+xts,\displaystyle\sum_{i=1}^{t}w_{i}=\sum_{i=s+1}^{t}w_{i}+x_{s}=x_{ts}+x_{ts}^{*}, (B.2)

where xts=j=s+1tBt,jηjx_{ts}=\sum_{j=s+1}^{t}B_{t,j}\eta_{j}, xts=xs+x¯tsx_{ts}^{*}=x_{s}+\bar{x}_{ts}, and x¯ts=j=s(i=s+1tAij)ηj\bar{x}_{ts}=\sum_{j=-\infty}^{s}\left(\sum_{i=s+1}^{t}A_{i-j}\right)\eta_{j}. As a result, xtsx_{ts} is independent of xtsx_{ts}^{*}. Denote dts2:=𝔼(xtsxts)d_{ts}^{2}:=\mathbb{E}(x_{ts}x_{ts}^{\top}) and from the B-N decomposition (Phillips and Solo, , 1992) we have dts2tsd_{ts}^{2}\sim t-s when tst-s is large. The representations of (B.1) and (B.2), along with the following lemma, will be used for asymptotic analysis.

Lemma B.1.

Let Assumption A hold. Let x0,t=θ0xtx_{0,t}=\theta_{0}^{\top}x_{t} be unit root process and define x0,ts=θ0xtsx_{0,ts}=\theta_{0}^{\top}x_{ts} for t>st>s where xtsx_{ts} is given by (B.2); and define d0,t2=𝔼(x0,t2)d_{0,t}^{2}=\mathbb{E}(x_{0,t}^{2}) and d0,ts2=𝔼(x0,ts2)d_{0,ts}^{2}=\mathbb{E}(x_{0,ts}^{2}).

  1. (1)

    For large tt, d0,t1x0,td_{0,t}^{-1}x_{0,t} have densities ft(u)f_{t}(u) which are uniformly bounded over uu\in\mathbb{R} and tt. Meanwhile, the partial derivatives of ft(u)f_{t}(u) are uniformly bounded as well. Consequently, ft(u)f_{t}(u) satisfy a uniform Lipschitz condition

    supu|ft(u+u)ft(u)|C|u|\sup_{u\in\mathbb{R}}|f_{t}(u+\triangle u)-f_{t}(u)|\leq C\,|\triangle u| (B.3)

    for some C>0C>0 and any u\triangle u\in\mathbb{R}.

  2. (2)

    For large tst-s, d0,ts1x0,tsd_{0,ts}^{-1}x_{0,ts} have uniformly bounded densities fts(u)f_{ts}(u) over all uu\in\mathbb{R} and (t,s)(t,s). Additionally, fts(u)f_{ts}(u) have bounded partial derivatives and satisfy Lipshitz condition similar to (B.3) as well.

This is Lemma A.2 in Dong and Gao, (2018).

Let θ\theta be a unit vector in d1\mathbb{R}^{d_{1}} such that the sequence θxt\theta^{\top}x_{t} is of unit root, and R=(θ,R2)R=(\theta,R_{2}) be an orthogonal matrix of dimension d1×d1d_{1}\times d_{1}. Denote, if no ambiguity arises, Rxt=(x1t,x2t)R^{\top}x_{t}=(x_{1t},x_{2t}^{\top})^{\top} conformably with block representation of RR. It is clear that the covariance matrix of RxtR^{\top}x_{t} has asymptotics RAARt(1+o(1))R^{\top}AA^{\top}R\,t(1+o(1)). We suppress θ\theta from x1tx_{1t} and x2tx_{2t} for simplicity.

Lemma B.2.

Let Assumption A hold.

  1. (1)

    For large tt, t1(x1t,x2t)\sqrt{t}^{\;-1}(x_{1t},x_{2t}^{\top})^{\top} have densities ψt(x,w)\psi_{t}(x,w) which are uniformly bounded over x,wd11x\in\mathbb{R},w\in\mathbb{R}^{d_{1}-1} and tt. Meanwhile, the derivatives of ψt(x,w)\psi_{t}(x,w) are uniformly bounded as well. Consequently, ψt(x,w)\psi_{t}(x,w) satisfy a uniform Lipschitz condition

    supx,w|ψt(x+x,w+w)ψt(x,w)|C(|x|+w)\sup_{x,w}|\psi_{t}(x+\triangle x,w+\triangle w)-\psi_{t}(x,w)|\leq C(|\triangle x|+\|\triangle w\|) (B.4)

    for some C>0C>0 and any x\triangle x and w\triangle w.

    Meanwhile, for large tt, ψt(x,w)=f1t(x)ϕ(w)(1+o(1))\psi_{t}(x,w)=f_{1t}(x)\phi(w)(1+o(1)) where f1t(x)f_{1t}(x) is the marginal density of t1x1t\sqrt{t}^{\;-1}x_{1t} and ϕ(w)\phi(w) is the standard normal density.

  2. (2)

    For large tst-s, dθ,ts1xθ,tsd_{\theta,ts}^{-1}x_{\theta,ts} have uniformly bounded densities fθ,ts(x)f_{\theta,ts}(x) over all xx\in\mathbb{R} and t>st>s, where xθ,ts=θxtsx_{\theta,ts}=\theta^{\top}x_{ts} and dθ,ts2=𝔼xθ,ts2d_{\theta,ts}^{2}=\mathbb{E}x_{\theta,ts}^{2}; additionally, fθ,ts(x)f_{\theta,ts}(x) have bounded partial derivatives and satisfy Lipshitz condition similar to (B.4) as well.

Lemma B.3.

The following assertions hold:

  1. (1)

    1t(x1t,x2t)\frac{1}{\sqrt{t}}(x_{1t},x_{2t}^{\top}) has a joint probability density ψt(x,w)\psi_{t}(x,w^{\top}); and given s\mathcal{F}_{s} (defined in Assumption A), 1ts(x1tx1s,x2tx2s)\frac{1}{\sqrt{t-s}}(x_{1t}-x_{1s},x_{2t}^{\top}-x_{2s}^{\top}) has a joint density ψts(x,w)\psi_{ts}(x,w^{\top}) where t>s+1t>s+1. Meanwhile, these functions are bounded uniformly in (x,w)(x,w) as well as tt and (t,s)(t,s), respectively.

  2. (2)

    For large tt and tst-s, we have ψt(x,w)=ϕ(w)ft(x)(1+o(1))\psi_{t}(x,w^{\top})=\phi(w)f_{t}(x)(1+o(1)) and ψts(x,w)=ϕ(w)fts(x)(1+o(1))\psi_{ts}(x,w^{\top})=\phi(w)f_{ts}(x)(1+o(1)) where ϕ(w)\phi(w) is the density of a multivariate (d1)(d-1)-dimensional normal distribution, ft(x)f_{t}(x) is the marginal density of 1tx1t\frac{1}{\sqrt{t}}x_{1t} and fts(x)f_{ts}(x) is the marginal density of 1ts(x1tx1s)\frac{1}{\sqrt{t-s}}(x_{1t}-x_{1s}).

Lemma B.4.

Suppose that ρ(u)\rho^{\prime}(u) and ρ′′(u)\rho^{\prime\prime}(u) exist in the ordinary sense. Under Assumptions A-C and D(1)-(2), denoting xnt=n1/2xtx_{nt}=n^{-1/2}x_{t}, as nn\to\infty,

1nt=1nρ(et)(g˙1(θ0xnt)xntg˙2(β0zt)zt)D(01g˙1(θ0B(r))]B(r)dU(r)N(0,a1Σ),)\displaystyle\frac{1}{\sqrt{n}}\sum_{t=1}^{n}\rho^{\prime}(e_{t})\begin{pmatrix}\dot{g}_{1}(\theta_{0}^{\top}x_{nt})x_{nt}\\ \dot{g}_{2}(\beta_{0}^{\top}z_{t})z_{t}\end{pmatrix}\to_{D}\begin{pmatrix}\int_{0}^{1}\dot{g}_{1}(\theta_{0}^{\top}B(r))]B(r)dU(r)\\ N(0,a_{1}\Sigma),\end{pmatrix} (B.5)

where Σ=01[𝔼g˙22(β0h(r,v1))h(r,v1)h(r,v1)]𝑑r\Sigma=\int_{0}^{1}[\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}h(r,v_{1}))\,h(r,v_{1})h(r,v_{1})^{\top}]dr and a1>0a_{1}>0 given in Assumption C.

Moreover,

1nt=1nρ′′(et)g22(β0zt)ztztPa2Σ,\displaystyle\frac{1}{n}\sum_{t=1}^{n}\rho^{\prime\prime}(e_{t})g_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}\to_{P}a_{2}\Sigma, (B.6)
and
1nt=1nρ′′(et)(g˙12(θ0xnt)xntxntg˙1(θ0xnt)g˙2(β0zt)ztxnt)\displaystyle\frac{1}{n}\sum_{t=1}^{n}\rho^{\prime\prime}(e_{t})\begin{pmatrix}\dot{g}_{1}^{2}(\theta_{0}^{\top}x_{nt})x_{nt}x_{nt}^{\top}\\ \dot{g}_{1}(\theta_{0}^{\top}x_{nt})\dot{g}_{2}(\beta_{0}^{\top}z_{t})z_{t}x_{nt}^{\top}\end{pmatrix}
D(a201g˙12(θ0B(r))B(r)B(r)𝑑ra201g˙1(θ0B(r))[𝔼g˙2(β0h(r,v1))h(r,v1)]B(r)𝑑r).\displaystyle\to_{D}\begin{pmatrix}a_{2}\int_{0}^{1}\dot{g}_{1}^{2}(\theta_{0}^{\top}B(r))B(r)B(r)^{\top}dr\\ a_{2}\int_{0}^{1}\dot{g}_{1}(\theta_{0}^{\top}B(r))[\mathbb{E}\dot{g}_{2}(\beta_{0}^{\top}h(r,v_{1}))h(r,v_{1})]B(r)^{\top}dr\end{pmatrix}. (B.7)

where a2=𝔼[ρ′′(et)|t1]>0a_{2}=\mathbb{E}[\rho^{\prime\prime}(e_{t})|{\cal F}_{t-1}]>0 a.s. stipulated in Assumption C.

Proof of Lemma B.4.

Recall that zt=h(τt,vt)z_{t}=h(\tau_{t},v_{t}), and τt=t/n\tau_{t}=t/n and vt=q(ηt,,ηtd0+1)+vt~v_{t}=q(\eta_{t},\cdots,\eta_{t-d_{0}+1})+\tilde{v_{t}}, where v~t\tilde{v}_{t} is independent of {ηj,j}\{\eta_{j},j\in\mathbb{Z}\}. Thus, ztz_{t} and xtx_{t} are correlated through these η\eta’s. In view of (B.1) we may write xt=i=td0+1twi+i=1td0wiwt,d0+xtd0x_{t}=\sum_{i=t-d_{0}+1}^{t}w_{i}+\sum_{i=1}^{t-d_{0}}w_{i}\equiv w_{t,d_{0}}+x_{t-d_{0}}. Then, because n1/2wt,d0=oP(1)n^{-1/2}w_{t,d_{0}}=o_{P}(1), xtx_{t} and xtd0x_{t-d_{0}} have the same asymptotic distribution. Meanwhile, xtd0x_{t-d_{0}} and ztz_{t} are mutually independent.

By the continuity of g¨1()\ddot{g}_{1}(\cdot), g˙1(θ0xnt)g˙1(θ0xn,td0)=OP(n1/2)\dot{g}_{1}(\theta_{0}^{\top}x_{nt})-\dot{g}_{1}(\theta_{0}^{\top}x_{n,t-d_{0}})=O_{P}(n^{-1/2}). We therefore may replace g˙1(θ0xnt)\dot{g}_{1}(\theta_{0}^{\top}x_{nt}) by g˙1(θ0xn,td0)\dot{g}_{1}(\theta_{0}^{\top}x_{n,t-d_{0}}) in the derivation in the sequel but we avoid doing so and treat them independent for simplicity.

It follows from Park and Phillips, (2001) that

1nt=1nρ(et)g˙1(θ0xn,t)xn,tD01g˙1(θ0B(r))B(r)𝑑U(r),\frac{1}{\sqrt{n}}\sum_{t=1}^{n}\rho^{\prime}(e_{t})\dot{g}_{1}(\theta_{0}^{\top}x_{n,t})x_{n,t}\to_{D}\int_{0}^{1}\dot{g}_{1}(\theta_{0}^{\top}B(r))B(r)dU(r),

while since ztz_{t} is a strictly stationary and α\alpha-mixing stationary, and due to the martingale difference structure imposed in Assumption B it is evidently that 1nt=1nρ(et)g˙2(β0zt)ztDN(0,a1Σ)\frac{1}{\sqrt{n}}\sum_{t=1}^{n}\rho^{\prime}(e_{t})\dot{g}_{2}(\beta_{0}^{\top}z_{t})z_{t}\to_{D}N(0,a_{1}\Sigma), where the conditional covariance matrix converging to Σ\Sigma in probability is shown below. The joint convergence then follows by the independence between xn,td0x_{n,t-d_{0}} and ztz_{t}. This finishes (B.5).

Now, consider the convergence in probability in (B.6). Because of the martingale difference structure for ete_{t}, the result will hold if we can shown 1nt=1ng˙22(β0zt)ztztPΣ\frac{1}{n}\sum_{t=1}^{n}\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}\to_{P}\Sigma as nn\to\infty. To this end, notice that ztz_{t} is α\alpha-mixing, so that by Assumption B 1nt=1n[g˙22(β0zt)ztzt𝔼g˙22(β0zt)ztzt]=oP(1)\frac{1}{n}\sum_{t=1}^{n}[\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}-\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}]=o_{P}(1). Indeed,

𝔼1nt=1n[g˙22(β0zt)ztzt𝔼g˙22(β0zt)ztzt]2=1n2𝔼t=1ng˙22(β0zt)ztzt𝔼[g˙22(β0zt)ztzt]2\displaystyle\mathbb{E}\left\|\frac{1}{n}\sum_{t=1}^{n}[\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}-\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}]\right\|^{2}=\frac{1}{n^{2}}\mathbb{E}\sum_{t=1}^{n}\left\|\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}-\mathbb{E}[\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}]\right\|^{2}
+2n2𝔼t=2ns=1t1tr([g˙22(β0zt)ztzt𝔼g˙22(β0zt)ztzt][g˙22(β0zs)zszs𝔼g˙22(β0zs)zszs])\displaystyle+\frac{2}{n^{2}}\mathbb{E}\sum_{t=2}^{n}\sum_{s=1}^{t-1}\text{tr}\left([\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}-\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}][\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{s})z_{s}z_{s}^{\top}-\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{s})z_{s}z_{s}^{\top}]^{\top}\right)
\displaystyle\leq 1n2t=1n𝔼[g˙24(β0zt)zt4]+2n2t=2ns=1t1α1/2(ts)[𝔼(g˙22(β0zt)ztzt4)]1/4[𝔼(g˙22(β0zs)zszs4)]1/4\displaystyle\frac{1}{n^{2}}\sum_{t=1}^{n}\mathbb{E}[\dot{g}_{2}^{4}(\beta_{0}^{\top}z_{t})\|z_{t}\|^{4}]+\frac{2}{n^{2}}\sum_{t=2}^{n}\sum_{s=1}^{t-1}\alpha^{1/2}(t-s)[\mathbb{E}(\|\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})\;z_{t}z_{t}^{\top}\|^{4})]^{1/4}[\mathbb{E}(\|\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{s})\;z_{s}z_{s}^{\top}\|^{4})]^{1/4}
=\displaystyle= O(n1)=o(1)\displaystyle O(n^{-1})=o(1)

where we use the Davydov’s inequality for α\alpha-mixing processes with p=2p=2 and q=r=4q=r=4 (Bosq, , 1996, p 19) and the condition in Assumption B(2).

It then suffices to show that

1nt=1n𝔼g˙22(β0zt)ztzt=\displaystyle\frac{1}{n}\sum_{t=1}^{n}\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}= 1nt=1n𝔼g˙22(β0h(τt,vt))h(τt,vt)h(τt,vt)\displaystyle\frac{1}{n}\sum_{t=1}^{n}\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}h(\tau_{t},v_{t}))h(\tau_{t},v_{t})h(\tau_{t},v_{t})^{\top}
=\displaystyle= t=1n1τtτt+1𝔼g˙22(β0h(r,vt)h(r,vt)h(r,vt)dr+o(1)\displaystyle\sum_{t=1}^{n-1}\int_{\tau_{t}}^{\tau_{t+1}}\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}h(r,v_{t})h(r,v_{t})h(r,v_{t})^{\top}dr+o(1)
=\displaystyle= 01𝔼g˙22(β0(h(r)+v1))(h(r)+v1))(h(r)+v1))dr+o(1)Σ.\displaystyle\int_{0}^{1}\mathbb{E}\dot{g}_{2}^{2}(\beta_{0}^{\top}(h(r)+v_{1}))(h(r)+v_{1}))(h(r)+v_{1}))^{\top}dr+o(1)\to\Sigma.

Finally we consider (B.7). As argued above, it suffices to consider the convergence of

1n(t=1ng˙12(θ0xnt)xntt=1ng˙1(θ0xnt)(𝔼[g˙2(β0zt)zt])xnt)\displaystyle\frac{1}{n}\begin{pmatrix}\sum_{t=1}^{n}\dot{g}_{1}^{2}(\theta_{0}^{\top}x_{nt})x_{nt}^{\top}\\ \sum_{t=1}^{n}\dot{g}_{1}(\theta_{0}^{\top}x_{nt})(\mathbb{E}[\dot{g}_{2}(\beta_{0}^{\top}z_{t})z_{t}])x_{nt}^{\top}\end{pmatrix}
=\displaystyle= 1n(t=1ng˙12(θ0xnt)xntt=1ng˙1(θ0xnt)(𝔼[g˙2(β0h(τt,v1))h(τt,v1)])xnt).\displaystyle\frac{1}{n}\begin{pmatrix}\sum_{t=1}^{n}\dot{g}_{1}^{2}(\theta_{0}^{\top}x_{nt})x_{nt}^{\top}\\ \sum_{t=1}^{n}\dot{g}_{1}(\theta_{0}^{\top}x_{nt})(\mathbb{E}[\dot{g}_{2}(\beta_{0}^{\top}h(\tau_{t},v_{1}))h(\tau_{t},v_{1})])x_{nt}^{\top}\end{pmatrix}.

The result then follows from Theorem 3.2 of Dong and Gao, (2019, p. 131). ∎

Consider the case where ρ()\rho^{\prime}(\cdot) or ρ′′()\rho^{\prime\prime}(\cdot) is a generalized function. Let ρm()\rho^{\prime}_{m}(\cdot) and ρm′′()\rho^{\prime\prime}_{m}(\cdot) be regular sequences of ρ()\rho^{\prime}(\cdot) and ρ′′()\rho^{\prime\prime}(\cdot), respectively. See Phillips, (1995, p 918). For Dirac delta function, for example, the regular sequence is any delta-convergent sequence. Define, for r[0,1]r\in[0,1],

Umn(r)1nt=1[nr][ρm(et)𝔼ρm(et)].U_{mn}(r)\equiv\frac{1}{\sqrt{n}}\sum_{t=1}^{[nr]}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})].

Then for each mm, by the functional invariant principle, Umn(r)Um(r)U_{mn}(r)\Rightarrow U_{m}(r) as nn\to\infty. Here, Um(r)U_{m}(r) is a Brownian motion with variance Var(ρm(e1))\text{Var}(\rho^{\prime}_{m}(e_{1})).

Lemma B.5.

Under Assumptions A-C and D(1)-(2), for each mm, as nn\to\infty,

1nt=1n[ρm(et)𝔼ρm(et)](g˙1(θ0xnt)xntg2(β0zt)zt)D(01g˙1(θ0B(r))B(r)𝑑Um(r)N(0,am1Σ)),\displaystyle\frac{1}{\sqrt{n}}\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\begin{pmatrix}\dot{g}_{1}(\theta_{0}^{\top}x_{nt})x_{nt}\\ g_{2}(\beta_{0}^{\top}z_{t})z_{t}\end{pmatrix}\to_{D}\begin{pmatrix}\int_{0}^{1}\dot{g}_{1}(\theta_{0}^{\top}B(r))B(r)dU_{m}(r)\\ N(0,a_{m1}\Sigma)\end{pmatrix}, (B.8)

where Σ\Sigma is the same as in Lemma B.4, am1=Var(ρm(e1))a_{m1}=\text{Var}(\rho^{\prime}_{m}(e_{1})).

In addition, as nn\to\infty,

1nt=1nρm′′(et)g22(β0zt)ztztPam2Σ,\displaystyle\frac{1}{n}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})g_{2}^{2}(\beta_{0}^{\top}z_{t})z_{t}z_{t}^{\top}\to_{P}a_{m2}\Sigma, (B.9)
and
1nt=1nρm′′(et)(g˙12(θ0xnt)xntxntg˙1(θ0xnt)g2(β0zt)ztxnt)\displaystyle\frac{1}{n}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})\begin{pmatrix}\dot{g}_{1}^{2}(\theta_{0}^{\top}x_{nt})x_{nt}x_{nt}^{\top}\\ \dot{g}_{1}(\theta_{0}^{\top}x_{nt})g_{2}(\beta_{0}^{\top}z_{t})z_{t}x_{nt}^{\top}\end{pmatrix}
D(am201g˙12(θ0B(r))B(r)B(r)𝑑ram201g˙1(θ0B(r))[𝔼g2(β0h(r,v1))h(r,v1)]B(r)𝑑r),\displaystyle\hskip 28.45274pt\to_{D}\begin{pmatrix}a_{m2}\int_{0}^{1}\dot{g}_{1}^{2}(\theta_{0}^{\top}B(r))B(r)B(r)^{\top}dr\\ a_{m2}\int_{0}^{1}\dot{g}_{1}(\theta_{0}^{\top}B(r))[\mathbb{E}g_{2}(\beta_{0}^{\top}h(r,v_{1}))h(r,v_{1})]B(r)^{\top}dr\end{pmatrix}, (B.10)

where am2=𝔼[ρm′′(et)]>0a_{m2}=\mathbb{E}[\rho^{\prime\prime}_{m}(e_{t})]>0.

Proof of Lemma B.5.

The proof is the same as that of Lemma B.4. ∎

Denote

ξnt:=(1n4g˙1(x1t)x1t1n43g˙1(x1t)x2t1ng˙2(ztβ0)zt)andMn,t:=ξntξnt.\displaystyle\xi_{nt}:=\begin{pmatrix}\frac{1}{\sqrt[4]{n}}\dot{g}_{1}(x_{1t})x_{1t}\\ \frac{1}{\sqrt[4]{n}^{3}}\dot{g}_{1}(x_{1t})x_{2t}\\ \frac{1}{\sqrt{n}}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\end{pmatrix}\ \ \ \text{and}\ \ \ M_{n,t}:=\xi_{nt}\xi_{nt}^{\top}.
Lemma B.6.

Suppose that ρ(u)\rho^{\prime}(u) and ρ′′(u)\rho^{\prime\prime}(u) exist in ordinary sense. Under Assumptions A-C and D(2)-(3), as nn\to\infty we have jointly

t=1nρ′′(et)Mn,tDa2M,\displaystyle\sum_{t=1}^{n}\rho^{\prime\prime}(e_{t})M_{n,t}\to_{D}a_{2}\,M, (B.11)

where a2>0a_{2}>0 is given in Assumption C and MM is a d×dd\times d symmetric matrix

M=([ug˙1(u)]2𝑑uL1(1,0)01B2(r)𝑑L1(r,0)ug˙12(u)𝑑u001B2(r)B2(r)𝑑L1(r,0)g˙12(u)𝑑u0Σ),\displaystyle M=\begin{pmatrix}\int[u\dot{g}_{1}(u)]^{2}duL_{1}(1,0)&\int_{0}^{1}B_{2}^{\top}(r)dL_{1}(r,0)\int u\dot{g}_{1}^{2}(u)du&0\\ &\int_{0}^{1}B_{2}(r)B_{2}(r)^{\top}dL_{1}(r,0)\int\dot{g}_{1}^{2}(u)du&0\\ &&\Sigma\end{pmatrix}, (B.12)

where Σ\Sigma is defined in Lemma B.4.

Meanwhile,

t=1nρ(et)ξntDM1/2ξwithξN(0,a1Id),\displaystyle\sum_{t=1}^{n}\rho^{\prime}(e_{t})\xi_{nt}\to_{D}M^{1/2}\xi\ \ \ \mbox{with}\ \ \ \xi\sim N(0,a_{1}I_{d}), (B.13)

where a1>0a_{1}>0 is defined in Assumption C.

Proof of Lemma B.6.

We first show (B.11). The joint convergence is based on the assumption of joint convergence for the underlying process in Assumption C.

Notice that t=1nρ′′(et)Mn,t=a2t=1nMn,t+t=1n[ρ′′(et)a2]Mn,t=a2t=1nMn,t+oP(1)\sum_{t=1}^{n}\rho^{\prime\prime}(e_{t})M_{n,t}=a_{2}\sum_{t=1}^{n}M_{n,t}+\sum_{t=1}^{n}[\rho^{\prime\prime}(e_{t})-a_{2}]M_{n,t}=a_{2}\sum_{t=1}^{n}M_{n,t}+o_{P}(1) by the martingale difference sequence structure in Assumption C. What is needed to show (B.11) is to establish the limit of t=1nMn,t\sum_{t=1}^{n}M_{n,t}.

Observe that as nn\to\infty,

1nt=1n[g˙1(x1t)x1t]2\displaystyle\frac{1}{\sqrt{n}}\sum_{t=1}^{n}[\dot{g}_{1}(x_{1t})x_{1t}]^{2} D[ug˙1(u)]2𝑑uL1(1,0),\displaystyle\to_{D}\int[u\dot{g}_{1}(u)]^{2}du\,L_{1}(1,0),
1nt=1ng˙1(x1t)2x1tx2t\displaystyle\frac{1}{n}\sum_{t=1}^{n}\dot{g}_{1}(x_{1t})^{2}x_{1t}x_{2t} D01B2(r)𝑑L1(r,0)ug˙12(u)𝑑u,\displaystyle\to_{D}\int_{0}^{1}B_{2}(r)dL_{1}(r,0)\int u\dot{g}_{1}^{2}(u)du,
1nnt=1ng˙12(x1t)x2tx2t\displaystyle\frac{1}{n\sqrt{n}}\sum_{t=1}^{n}\dot{g}_{1}^{2}(x_{1t})x_{2t}x_{2t}^{\top} D01B2(r)B2(r)𝑑L1(r,0)g˙12(u)𝑑u\displaystyle\to_{D}\int_{0}^{1}B_{2}(r)B_{2}(r)^{\top}dL_{1}(r,0)\int\dot{g}_{1}^{2}(u)du

were shown in Park and Phillips, (2000), whereas 1nt=1ng˙22(ztβ0)ztztPΣ\frac{1}{n}\sum_{t=1}^{n}\dot{g}_{2}^{2}(z_{t}^{\top}\beta_{0})z_{t}z_{t}^{\top}\to_{P}\Sigma shown in Lemma B.4. We now turn to the convergence of

(1)1n43t=1ng˙1(x1t)x1tg˙2(ztβ0)zt=oP(1),and(2)1nn4t=1ng˙12(x1t)x2tg˙2(ztβ0)zt=oP(1).(1)\ \ \frac{1}{\sqrt[4]{n}^{3}}\sum_{t=1}^{n}\dot{g}_{1}(x_{1t})x_{1t}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}=o_{P}(1),\ \ \text{and}\ \ (2)\ \ \frac{1}{n\sqrt[4]{n}}\sum_{t=1}^{n}\dot{g}_{1}^{2}(x_{1t})x_{2t}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}=o_{P}(1).

(1) Write

1n43t=1ng˙1(x1t)x1tg˙2(ztβ0)zt\displaystyle\frac{1}{\sqrt[4]{n}^{3}}\sum_{t=1}^{n}\dot{g}_{1}(x_{1t})x_{1t}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}
=\displaystyle= 1n43t=1ng˙1(x1t)x1t𝔼g˙2(ztβ0)zt+1n43t=1ng˙1(x1t)x1t[g˙2(ztβ0)zt𝔼g˙2(ztβ0)zt]\displaystyle\frac{1}{\sqrt[4]{n}^{3}}\sum_{t=1}^{n}\dot{g}_{1}(x_{1t})x_{1t}\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}+\frac{1}{\sqrt[4]{n}^{3}}\sum_{t=1}^{n}\dot{g}_{1}(x_{1t})x_{1t}\left[\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}-\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\right]
=\displaystyle= A1n+A2n,say.\displaystyle A_{1n}+A_{2n},\ \ \ \text{say}.

First, A2n=oP(1)A_{2n}=o_{P}(1). Indeed, as argued in the proof of Lemma B.4, xtx_{t} and ztz_{t} can be regarded as mutually independent since they only share a finite number of η\eta’s in the assumption, and we thus have

𝔼A2n2=1n3𝔼t=1ng˙12(x1t)x1t2g˙2(ztβ0)zt𝔼g˙2(ztβ0)zt2\displaystyle\mathbb{E}\|A_{2n}\|^{2}=\frac{1}{\sqrt{n}^{3}}\mathbb{E}\sum_{t=1}^{n}\dot{g}_{1}^{2}(x_{1t})x_{1t}^{2}\left\|\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}-\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\right\|^{2}
+2n3𝔼t=2ns=1t1g˙1(x1t)x1tg˙1(x1s)x1s\displaystyle+\frac{2}{\sqrt{n}^{3}}\mathbb{E}\sum_{t=2}^{n}\sum_{s=1}^{t-1}\dot{g}_{1}(x_{1t})x_{1t}\dot{g}_{1}(x_{1s})x_{1s}
×[g˙2(ztβ0)zt𝔼g˙2(ztβ0)zt][g˙2(zsβ0)zs𝔼g˙2(zsβ0)zs]\displaystyle\times\left[\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}-\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\right]^{\prime}\left[\dot{g}_{2}(z_{s}^{\top}\beta_{0})z_{s}-\mathbb{E}\dot{g}_{2}(z_{s}^{\top}\beta_{0})z_{s}\right]
\displaystyle\leq 1n3t=1n𝔼[g˙12(x1t)x1t2]𝔼g˙2(ztβ0)zt𝔼g˙2(ztβ0)zt2\displaystyle\frac{1}{\sqrt{n}^{3}}\sum_{t=1}^{n}\mathbb{E}[\dot{g}_{1}^{2}(x_{1t})x_{1t}^{2}]\mathbb{E}\left\|\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}-\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\right\|^{2}
+2n3t=2ns=1t1𝔼|g˙1(x1t)x1tg˙1(x1s)x1s|α1/2(ts)\displaystyle+\frac{2}{\sqrt{n}^{3}}\sum_{t=2}^{n}\sum_{s=1}^{t-1}\mathbb{E}|\dot{g}_{1}(x_{1t})x_{1t}\dot{g}_{1}(x_{1s})x_{1s}|\alpha^{1/2}(t-s)
×𝔼g˙2(ztβ0)zt𝔼g˙2(ztβ0)zt4𝔼g˙2(zsβ0)zs𝔼g˙2(zsβ0)zs4=O(n1/2),\displaystyle\times\mathbb{E}\left\|\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}-\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\right\|^{4}\mathbb{E}\left\|\dot{g}_{2}(z_{s}^{\top}\beta_{0})z_{s}-\mathbb{E}\dot{g}_{2}(z_{s}^{\top}\beta_{0})z_{s}\right\|^{4}=O(n^{-1/2}),

since kα1/2(k)<\sum_{k}\alpha^{1/2}(k)<\infty, 𝔼g˙2(ztβ0)zt𝔼g˙2(ztβ0)zt4\mathbb{E}\left\|\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}-\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\right\|^{4} is bounded, x1tx_{1t} is unit root, g˙12(u)u2\dot{g}_{1}^{2}(u)u^{2} is integrable and (x1t,x1s)(x_{1t},x_{1s}) for tst-s large has joint density given s{\cal F}_{s} that satisfies Lemma B.1. The detailed calculation is referred to Lemma A.3 in Dong and Gao, (2018).

Second, A1n=oP(1)A_{1n}=o_{P}(1). This is shown by invoking the density of d1t1x1td_{1t}^{-1}x_{1t} in Lemma B.1 and the integrability of g˙12(u)u2\dot{g}_{1}^{2}(u)u^{2} as well as the boundedness of 𝔼g˙2(ztβ0)zt𝔼g˙2(ztβ0)zt2\mathbb{E}\left\|\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}-\mathbb{E}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\right\|^{2}. This finishes the proof of (1).

The assertion (2) can be shown similarly but one has to invoke Lemma B.2 where the joint densities of (x1t,x2t)(x_{1t},x_{2t}) and their properties have established. We omit the details due to the similarity.

Now we consider (B.13). It is a martingale sequence and we thus invoke the martingale CLT, corollary 3.1 of Hall and Heyde, (1980, p. 58). Under Assumption C, we may show

Mn1/2t=1nρ(et)ξntDξ,whereξnt:=(1n4g˙1(x1t)x1t1n43g˙1(x1t)x2t1ng˙2(ztβ0)zt)andMn:=t=1nMn,t.M_{n}^{-1/2}\sum_{t=1}^{n}\rho^{\prime}(e_{t})\xi_{nt}\to_{D}\xi,\ \ \ \text{where}\ \xi_{nt}:=\begin{pmatrix}\frac{1}{\sqrt[4]{n}}\dot{g}_{1}(x_{1t})x_{1t}\\ \frac{1}{\sqrt[4]{n}^{3}}\dot{g}_{1}(x_{1t})x_{2t}\\ \frac{1}{\sqrt{n}}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\end{pmatrix}\ \ \mbox{and}\ \ M_{n}:=\sum_{t=1}^{n}M_{n,t}.

In fact, since the conditional variance is exactly a1Ida_{1}I_{d}, what we need to show is the conditional Lindeberg condition: for any nonzero vector λd\lambda\in\mathbb{R}^{d},

t=1n𝔼[ρ(et)2(λMn1/2ξnt)2I(|ρ(et)λMn1/2ξnt|>ε)|t1]P0,\sum_{t=1}^{n}\mathbb{E}[\rho^{\prime}(e_{t})^{2}(\lambda^{\top}M_{n}^{-1/2}\xi_{nt})^{2}I(|\rho^{\prime}(e_{t})\lambda^{\top}M_{n}^{-1/2}\xi_{nt}|>\varepsilon)|{\cal F}_{t-1}]\to_{P}0,

for any ε>0\varepsilon>0. This is true because maxt𝔼[ρ(et)4|t1]<\max_{t}\mathbb{E}[\rho^{\prime}(e_{t})^{4}|{\cal F}_{t-1}]<\infty, and MnM_{n} converges to a positive definite matrix so that

t=1n(λMn1/2ξnt)4Ct=1nλ4ξnt4\displaystyle\sum_{t=1}^{n}(\lambda^{\top}M_{n}^{-1/2}\xi_{nt})^{4}\leq C\sum_{t=1}^{n}\|\lambda\|^{4}\|\xi_{nt}\|^{4}
\displaystyle\leq C1nt=1ng˙14(x1t)x1t4+C1n3t=1ng˙1(x1t)4x2t4+C1n2t=1ng˙24(ztβ0)zt4𝑃 0\displaystyle C\frac{1}{n}\sum_{t=1}^{n}\dot{g}_{1}^{4}(x_{1t})x_{1t}^{4}+C\frac{1}{n^{3}}\sum_{t=1}^{n}\dot{g}_{1}(x_{1t})^{4}\|x_{2t}\|^{4}+C\frac{1}{n^{2}}\sum_{t=1}^{n}\dot{g}_{2}^{4}(z_{t}^{\top}\beta_{0})\|z_{t}\|^{4}\overset{P}{\to}\,0

due to Park and Phillips, (2000) and the integrability of u4g˙14(u)u^{4}\dot{g}^{4}_{1}(u) for the first two terms and the LLN for the last one. ∎

Lemma B.7.

Suppose that ρ(u)\rho^{\prime}(u) and/or ρ′′(u)\rho^{\prime\prime}(u) exists in generalized sense. Under Assumptions A-C and D(2)-(3), as nn\to\infty we have jointly for each mm large,

t=1nρm′′(et)Mn,tDa2mM,\displaystyle\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})M_{n,t}\to_{D}a_{2m}\,M, (B.14)

where a2m=𝔼ρm′′(et)>0a_{2m}=\mathbb{E}\rho^{\prime\prime}_{m}(e_{t})>0 and MM is the same as in Lemma B.6.

Meanwhile,

t=1n[ρm(et)𝔼ρm(et)]ξntDM1/2ξwithξN(0,a1mId),\displaystyle\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\xi_{nt}\to_{D}M^{1/2}\xi\ \ \mbox{with}\ \ \xi\sim N(0,a_{1m}I_{d}), (B.15)

where a1m=Var[ρm(et)]a_{1m}=\text{Var}[\rho^{\prime}_{m}(e_{t})].

Proof of Lemma B.7.

The proof is the same as that of Lemma B.6, so is omitted. ∎

Appendix C: Proofs of the main results

Proof of Theorem 3.1  Letting pn,qnp_{n},q_{n}\to\infty with nn (determined later) and α=pn(θθ0)\alpha=p_{n}(\theta-\theta_{0}), γ=qn(ββ0)\gamma=q_{n}(\beta-\beta_{0}), we write

Ln(θ,β)=\displaystyle L_{n}(\theta,\beta)= t=1n[ρ(ytg(xtθ,ztβ))ρ(ytg(xtθ0,ztβ0))]\displaystyle\sum_{t=1}^{n}[\rho(y_{t}-g(x_{t}^{\top}\theta,z_{t}^{\top}\beta))-\rho(y_{t}-g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0}))]
=\displaystyle= t=1n[ρ(et+g(xtθ0,ztβ0)g(xtθ,ztβ))ρ(et)]\displaystyle\sum_{t=1}^{n}[\rho(e_{t}+g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0})-g(x_{t}^{\top}\theta,z_{t}^{\top}\beta))-\rho(e_{t})]
=\displaystyle= t=1n[ρ(et+g(xtθ0,ztβ0)g(xtθ0+xtpn1α,ztβ0+ztqn1γ))ρ(et)]\displaystyle\sum_{t=1}^{n}[\rho(e_{t}+g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0})-g(x_{t}^{\top}\theta_{0}+x_{t}^{\top}p_{n}^{-1}\alpha,z_{t}^{\top}\beta_{0}+z_{t}^{\top}q_{n}^{-1}\gamma))-\rho(e_{t})]
\displaystyle\equiv Qn(α,γ).\displaystyle Q_{n}(\alpha,\gamma).

Hence, if (α^,γ^)(\widehat{\alpha},\widehat{\gamma}) minimizes Qn(α,γ)Q_{n}(\alpha,\gamma), (θ^,β^)(\widehat{\theta},\widehat{\beta}) minimizes Ln(θ,β)L_{n}(\theta,\beta) where α^=pn(θ^θ0)\widehat{\alpha}=p_{n}(\widehat{\theta}-\theta_{0}) and γ^=qn(β^β0)\widehat{\gamma}=q_{n}(\widehat{\beta}-\beta_{0}). This reparameterization is adopted in the literature quite often, for example, Bickel, (1974), Badu, (1989), Divas et al., (1992) and Phillips, (1995).

For simplicity we denote Δt(α,γ)=g(xtθ0,ztβ0)g(xtθ0+xtpn1α,ztβ0+ztqn1γ)\Delta_{t}(\alpha,\gamma)=g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0})-g(x_{t}^{\top}\theta_{0}+x_{t}^{\top}p_{n}^{-1}\alpha,z_{t}^{\top}\beta_{0}+z_{t}^{\top}q_{n}^{-1}\gamma), and we then have Qn(α,γ)=t=1n[ρ(et+Δt(α,γ))ρ(et)]Q_{n}(\alpha,\gamma)=\sum_{t=1}^{n}[\rho(e_{t}+\Delta_{t}(\alpha,\gamma))-\rho(e_{t})]. Certainly, the Qn(α,γ)Q_{n}(\alpha,\gamma) only serves for theoretical analysis.

Using Taylor expansion, we have

Qn(α,γ)=\displaystyle Q_{n}(\alpha,\gamma)= t=1n[ρ(et+Δt(α,γ))ρ(et)]=t=1n[ρ(et)Δt(α,γ)+12ρ′′(et)Δt(α,γ)2],\displaystyle\sum_{t=1}^{n}[\rho(e_{t}+\Delta_{t}(\alpha,\gamma))-\rho(e_{t})]=\sum_{t=1}^{n}\left[\rho^{\prime}(e_{t})\Delta_{t}(\alpha,\gamma)+\frac{1}{2}\rho^{\prime\prime}(e_{t}^{*})\Delta_{t}(\alpha,\gamma)^{2}\right],

where ete_{t}^{*} is between ete_{t} and et+Δt(α,γ)e_{t}+\Delta_{t}(\alpha,\gamma). This raises an issue: for some loss function, ρ()\rho^{\prime}(\cdot) and ρ′′()\rho^{\prime\prime}(\cdot) may not exist everywhere. Although it is possible to have Taylor expansion for nonsmooth ρ\rho (see Estrada and Kanwal, (1993)) by regarding it as a generalized function, in order to facilitate the development of an asymptotic theory in the sequel, we make a detour to invoke the regular sequence

ρm(u)=\displaystyle\rho_{m}(u)= ρ(x)ϕm(ux)𝑑x,m=1,2,,\displaystyle\int\rho(x)\phi_{m}(u-x)dx,\ \ m=1,2,\cdots,

where ϕm(x)S\phi_{m}(x)\in S is a delta-convergent sequence that ensures ρm(u)ρ(u)\rho_{m}(u)\to\rho(u) as mm\to\infty. There are many choices of ϕm(x)\phi_{m}(x) such as ϕm(x)=mexp(mx2)\phi_{m}(x)=\sqrt{m}\exp(-mx^{2}).

It follows that

ρm(u)=\displaystyle\rho_{m}^{\prime}(u)= ρ(x)ϕm(xu)𝑑x,\displaystyle-\int\rho(x)\phi_{m}^{\prime}(x-u)dx, ρm′′(u)=\displaystyle\rho_{m}^{\prime\prime}(u)= ρ(x)ϕm′′(xu)𝑑x,m=1,2,,\displaystyle\int\rho(x)\phi_{m}^{\prime\prime}(x-u)dx,\ \ m=1,2,\cdots,

and ρm(u)\rho^{\prime}_{m}(u) and ρm′′(u)\rho^{\prime\prime}_{m}(u) will converge to ρ(u)\rho^{\prime}(u) and ρ′′(u)\rho^{\prime\prime}(u), respectively, in generalized function sense. Accordingly, 𝔼[ρm(e1)]𝔼[ρ(e1)]=0\mathbb{E}[\rho^{\prime}_{m}(e_{1})]\to\mathbb{E}[\rho^{\prime}(e_{1})]=0 as mm\to\infty.


Hence, whenever ρ()\rho^{\prime}(\cdot) and/or ρ′′()\rho^{\prime\prime}(\cdot) may not exist point-wisely, instead of Qn(α,γ)Q_{n}(\alpha,\gamma) we consider

Qmn(α,γ)=t=1n[ρm(et)𝔼ρm(et)]Δt(α,γ)+12ρm′′(et)Δt(α,γ)2,Q_{mn}(\alpha,\gamma)=\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\Delta_{t}(\alpha,\gamma)+\frac{1}{2}\rho^{\prime\prime}_{m}(e_{t}^{*})\Delta_{t}(\alpha,\gamma)^{2},

where the term ρm(et)𝔼ρm(et)\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t}) not only converges to ρ(et)\rho^{\prime}(e_{t}) as mm\to\infty, but also keeps the zero mean condition just as ρ(et)\rho^{\prime}(e_{t}).

By Assumption D(1), g(u,v)=g1(u)+g2(v)g(u,v)=g_{1}(u)+g_{2}(v) where g1(u)g_{1}(u), g˙1(u)\dot{g}_{1}(u) and g¨1(u)\ddot{g}_{1}(u) are HH-regular with homogeneity order ν()\nu(\cdot), ν˙()\dot{\nu}(\cdot) and ν¨()\ddot{\nu}(\cdot), respectively, such that limλ+ν˙(λ)/ν(λ)=0\lim_{\lambda\to+\infty}\dot{\nu}(\lambda)/\nu(\lambda)=0, and limλ+ν¨(λ)/ν˙(λ)=0\lim_{\lambda\to+\infty}\ddot{\nu}(\lambda)/\dot{\nu}(\lambda)=0. In this case, we take pn=ν˙(n)np_{n}=\dot{\nu}(\sqrt{n})\,n and qn=nq_{n}=\sqrt{n}.

Further, for any (α,γ)d1+d2(\alpha,\gamma)\in\mathbb{R}^{d_{1}+d_{2}},

Δt(α,γ)=\displaystyle\Delta_{t}(\alpha,\gamma)= g(xtθ0,ztβ0)g(xtθ0+xtpn1α,ztβ0+ztqn1γ)\displaystyle g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0})-g(x_{t}^{\top}\theta_{0}+x_{t}^{\top}p_{n}^{-1}\alpha,z_{t}^{\top}\beta_{0}+z_{t}^{\top}q_{n}^{-1}\gamma)
=\displaystyle= g˙1(xtθ0)xtpn1αg˙2(ztβ0)ztqn1γ\displaystyle-\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}p_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma
12g¨1(xtθ)(xtpn1α)212g¨2(ztβ)(ztqn1γ)2\displaystyle-\frac{1}{2}\ddot{g}_{1}(x_{t}^{\top}\theta^{*})(x_{t}^{\top}p_{n}^{-1}\alpha)^{2}-\frac{1}{2}\ddot{g}_{2}(z_{t}^{\top}\beta^{*})(z_{t}^{\top}q_{n}^{-1}\gamma)^{2}
=\displaystyle= g˙1(xtθ0)xtpn1αg˙2(ztβ0)ztqn1γ+reminder term,\displaystyle-\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}p_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma+\text{reminder term},

where the reminder term is of smaller order than the first two. Indeed, observe that

g¨1(xtθ)(xtpn1α)2=g¨1(xntθ)(xntα)2ν¨(n)[ν˙(n)]21n=OP(ν¨(n)[ν˙(n)]21n),\displaystyle\ddot{g}_{1}(x_{t}^{\top}\theta^{*})(x_{t}^{\top}p_{n}^{-1}\alpha)^{2}=\ddot{g}_{1}(x_{nt}^{\top}\theta^{*})(x_{nt}^{\top}\alpha)^{2}\frac{\ddot{\nu}(\sqrt{n})}{[\dot{\nu}(\sqrt{n})]^{2}}\frac{1}{n}=O_{P}\left(\frac{\ddot{\nu}(\sqrt{n})}{[\dot{\nu}(\sqrt{n})]^{2}}\frac{1}{n}\right),
g¨2(ztβ)(ztqn1γ)2=g¨2(ztβ)(ztγ)21n=OP(n1),\displaystyle\ddot{g}_{2}(z_{t}^{\top}\beta^{*})(z_{t}^{\top}q_{n}^{-1}\gamma)^{2}=\ddot{g}_{2}(z_{t}^{\top}\beta^{*})(z_{t}^{\top}\gamma)^{2}\frac{1}{n}=O_{P}(n^{-1}),

which by Assumption D(1) implies the negligibility of the reminder.

We then write Δt(α,γ)=g˙1(xtθ0)xtpn1αg˙2(ztβ0)ztqn1γ\Delta_{t}(\alpha,\gamma)=-\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}p_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma ignoring the higher order terms.

Moreover, we shall also show

Qmn(α,γ)=\displaystyle Q_{mn}(\alpha,\gamma)= t=1n[ρm(et)𝔼ρm(et)]Δt(α,γ)+12t=1nρm′′(et)Δt(α,γ)2\displaystyle\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\Delta_{t}(\alpha,\gamma)+\frac{1}{2}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t}^{*})\Delta_{t}(\alpha,\gamma)^{2}
=\displaystyle= t=1n[ρm(et)𝔼ρm(et)]Δt(α,γ)+12t=1nρm′′(et)Δn(α,γ)2+rmn,\displaystyle\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\Delta_{t}(\alpha,\gamma)+\frac{1}{2}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})\Delta_{n}(\alpha,\gamma)^{2}+r_{mn},

where we define and will show rmn=12t=1n[ρm′′(et)ρm′′(et)]Δn(α,γ)2=oP(1)r_{mn}=\frac{1}{2}\sum_{t=1}^{n}[\rho^{\prime\prime}_{m}(e_{t})-\rho^{\prime\prime}_{m}(e_{t}^{*})]\Delta_{n}(\alpha,\gamma)^{2}=o_{P}(1).

Notice that for any integer mm, ρm′′()\rho^{\prime\prime}_{m}(\cdot) satisfies Lipschitz condition with the Lipschitz constant KmK_{m} that may depend on mm, because ϕmS\phi_{m}\in S. Hence,

|rmn|\displaystyle|r_{mn}|\leq t=1n|ρm′′(et)ρm′′(et)|Δn(α,γ)2\displaystyle\sum_{t=1}^{n}|\rho^{\prime\prime}_{m}(e_{t})-\rho^{\prime\prime}_{m}(e_{t}^{*})|\Delta_{n}(\alpha,\gamma)^{2}
\displaystyle\leq t=1nKm|etet|Δn(α,γ)2Kmt=1n|Δn(α,γ)|3.\displaystyle\sum_{t=1}^{n}K_{m}|e_{t}-e_{t}^{*}|\Delta_{n}(\alpha,\gamma)^{2}\leq K_{m}\sum_{t=1}^{n}|\Delta_{n}(\alpha,\gamma)|^{3}.

Further, recalling pnp_{n}, qnq_{n} and xntx_{nt} we have

|rmn|\displaystyle|r_{mn}|\leq Kmt=1n|Δn(α,γ)|3\displaystyle K_{m}\sum_{t=1}^{n}|\Delta_{n}(\alpha,\gamma)|^{3}
=\displaystyle= Kmt=1n|g˙1(xtθ0)xtpn1α+g˙2(ztβ0)ztqn1γ|3\displaystyle K_{m}\sum_{t=1}^{n}|\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}p_{n}^{-1}\alpha+\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma|^{3}
=\displaystyle= Kmn3/2t=1n|g˙1(xntθ0)xntα+g˙2(ztβ0)ztγ|3=KmOP(n1/2)P0,asn,\displaystyle K_{m}n^{-3/2}\sum_{t=1}^{n}|\dot{g}_{1}(x_{nt}^{\top}\theta_{0})x_{nt}^{\top}\alpha+\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}\gamma|^{3}=K_{m}O_{P}(n^{-1/2})\to_{P}0,\ \ \ \ \text{as}\;n\to\infty,

for any mm and uniformly on any compact set of (α,γ)(\alpha,\gamma).

Rewrite the leading term of Qmn(α,γ)Q_{mn}(\alpha,\gamma),

Qmn(α,γ)=\displaystyle Q_{mn}(\alpha,\gamma)= t=1n[ρm(et)𝔼ρm(et)]Δt(α,γ)+12t=1nρm′′(et)Δt(α,γ)2\displaystyle\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\Delta_{t}(\alpha,\gamma)+\frac{1}{2}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})\Delta_{t}(\alpha,\gamma)^{2}
=\displaystyle= t=1n[ρm(et)𝔼ρm(et)]g˙1(xtθ0)xtpn1α\displaystyle-\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}p_{n}^{-1}\alpha
t=1n[ρm(et)𝔼ρm(et)]g˙2(ztβ0)ztqn1γ\displaystyle-\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma
+12t=1nρm′′(et)[g˙1(xtθ0)xtpn1α]2+12t=1nρm′′(et)[g˙2(ztβ0)ztqn1γ]2\displaystyle+\frac{1}{2}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}p_{n}^{-1}\alpha]^{2}+\frac{1}{2}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]^{2}
+t=1nρm′′(et)[g˙1(xtθ0)xtpn1α][g˙2(ztβ0)ztqn1γ]\displaystyle+\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}p_{n}^{-1}\alpha][\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]
=\displaystyle= 1nt=1n[ρm(et)𝔼ρm(et)]g˙1(xntθ0)xntα\displaystyle-\frac{1}{\sqrt{n}}\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\dot{g}_{1}(x_{nt}^{\top}\theta_{0})x_{nt}^{\top}\alpha
1nt=1n[ρm(et)𝔼ρm(et)]g˙2(ztβ0)ztγ\displaystyle-\frac{1}{\sqrt{n}}\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}\gamma
+12nt=1nρm′′(et)[g˙1(xntθ0)xntα]2+12nt=1nρm′′(et)[g˙2(ztβ0)ztγ]2\displaystyle+\frac{1}{2n}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{1}(x_{nt}^{\top}\theta_{0})x_{nt}^{\top}\alpha]^{2}+\frac{1}{2n}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}\gamma]^{2}
+1nt=1nρm′′(et)[g˙1(xntθ0)xntα][g˙2(ztβ0)ztγ],\displaystyle+\frac{1}{n}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{1}(x_{nt}^{\top}\theta_{0})x_{nt}^{\top}\alpha][\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}\gamma],

which is quadratic and hence convex in (α,γ)(\alpha,\gamma) by noting the non-negativeness of ρm′′()\rho_{m}^{\prime\prime}(\cdot).

Using the joint convergence in Lemma B.5, for each mm, as nn\to\infty, QmnQ_{mn} converges in distribution to

01g˙1(B(r)θ0)B(r)𝑑Um(r)αξmγ\displaystyle-\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})B(r)^{\top}dU_{m}(r)\,\alpha-\xi_{m}^{\top}\gamma
+12α(am201g˙1(B(r)θ0)B(r)B(r)𝑑r)α+12γ(am2Σ)γ\displaystyle+\frac{1}{2}\alpha^{\top}\left(a_{m2}\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})B(r)B(r)^{\top}dr\right)\alpha+\frac{1}{2}\gamma^{\top}\left(a_{m2}\Sigma\right)\gamma
+γ(am201g˙1(B(r)θ0)[𝔼g˙2(h(r,v1)β0)h(r,v1)]B(r)𝑑r)α\displaystyle+\gamma^{\top}\left(a_{m2}\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})[\mathbb{E}\dot{g}_{2}(h(r,v_{1})^{\top}\beta_{0})h(r,v_{1})]B(r)^{\top}dr\right)\alpha

where ξm\xi_{m} is a normal vector variable with covariance am1Σa_{m1}\Sigma given in Lemma B.5. Notice further that am1a1=𝔼[ρ(et)2|t1]a_{m1}\to a_{1}=\mathbb{E}[\rho^{\prime}(e_{t})^{2}|{\cal F}_{t-1}] and am2a2=𝔼[ρ′′(et)|t1]a_{m2}\to a_{2}=\mathbb{E}[\rho^{\prime\prime}(e_{t})|{\cal F}_{t-1}] which implies ξmPb2N(0,a1Σ)\xi_{m}\to_{P}b_{2}\sim N(0,a_{1}\Sigma) and supr[0,1]𝔼[Um(r)U(r)]20\sup_{r\in[0,1]}\mathbb{E}[U_{m}(r)-U(r)]^{2}\to 0 as mm\to\infty; then QmnQ_{mn} has sequential limit in distribution

Q(α,γ)=\displaystyle Q(\alpha,\gamma)= b1αb2γ+12αA1α+12γA2γ+γA3α\displaystyle-b_{1}^{\top}\alpha-b_{2}^{\top}\gamma+\frac{1}{2}\alpha^{\top}A_{1}\alpha+\frac{1}{2}\gamma^{\top}A_{2}\gamma+\gamma^{\top}A_{3}\alpha (B.16)

when nn\to\infty firstly and mm\to\infty secondly, where

b1=\displaystyle b_{1}= 01g˙1(B(r)θ0)B(r)𝑑U(r),\displaystyle\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})B(r)dU(r), b2\displaystyle b_{2}\sim N(0,a1Σ),\displaystyle N(0,a_{1}\Sigma),
A1=\displaystyle A_{1}= a201g˙1(B(r)θ0)B(r)B(r)𝑑r,\displaystyle a_{2}\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})B(r)B(r)^{\top}dr, A2=\displaystyle A_{2}= a2Σ,\displaystyle a_{2}\Sigma,
A3=\displaystyle A_{3}= a201g˙1(B(r)θ0)[𝔼g˙2(h(r,v1)β0)h(r,v1)]B(r)𝑑r.\displaystyle a_{2}\int_{0}^{1}\dot{g}_{1}(B(r)^{\top}\theta_{0})[\mathbb{E}\dot{g}_{2}(h(r,v_{1})^{\top}\beta_{0})h(r,v_{1})]B(r)^{\top}dr.

The convergence holds uniformly on any compact set of (α,γ)(\alpha,\gamma).

By virtue of the convexity in (α,γ)(\alpha,\gamma) for Qmn(α,γ)Q_{mn}(\alpha,\gamma) and Q(α,γ)Q(\alpha,\gamma), the estimator (α^,γ^)(\widehat{\alpha},\widehat{\gamma}) will converge to the unique minimizer (α0,γ0)(\alpha_{0},\gamma_{0}) of the limit Q(α,γ)Q(\alpha,\gamma) in (B.16) (see Lemma A of Knight, (1989)), that is,

(ν˙(n)n(θ^θ0)n(β^β0))D([A1A3A2A3]1(b1A3A21b2)A21(b2A3[A1A3A2A3]1(b1A3A21b2))).\displaystyle\begin{pmatrix}\dot{\nu}(\sqrt{n})n(\widehat{\theta}-\theta_{0})\\ \sqrt{n}(\widehat{\beta}-\beta_{0})\end{pmatrix}\to_{D}\begin{pmatrix}[A_{1}-A_{3}^{\top}A_{2}A_{3}]^{-1}(b_{1}-A^{\top}_{3}A_{2}^{-1}b_{2})\\ A_{2}^{-1}(b_{2}-A_{3}[A_{1}-A_{3}^{\top}A_{2}A_{3}]^{-1}(b_{1}-A_{3}^{\top}A_{2}^{-1}b_{2}))\end{pmatrix}.

This finishes the proof.   \Box

Proof of Corollary 3.1  When ρ\rho^{\prime} or ρ′′\rho^{\prime\prime} does not exist in ordinary sense, their regular sequences ρm\rho^{\prime}_{m} or ρm′′\rho^{\prime\prime}_{m} may be used to replace them. The replacement does not affect the proof as can be seen from the proof of the theorem. Thus, in what follows we only consider the case where ρ\rho^{\prime} and ρ′′\rho^{\prime\prime} exist and are continuous.

By the definition,

e^t=\displaystyle\widehat{e}_{t}= ytg(θ^xt,β^zt)=et+g(θ0xt,β0zt)g(θ^xt,β^zt)=et+Δt(θ^,β^),\displaystyle y_{t}-g(\widehat{\theta}^{\,\top}x_{t},\widehat{\beta}^{\,\top}z_{t})=e_{t}+g({\theta}^{\,\top}_{0}x_{t},{\beta}^{\,\top}_{0}z_{t})-g(\widehat{\theta}^{\,\top}x_{t},\widehat{\beta}^{\,\top}z_{t})=e_{t}+\Delta_{t}(\widehat{\theta},\widehat{\beta}),

where Δ(θ,β)=g(θ0xt,β0zt)g(θxt,βzt)\Delta(\theta,\beta)=g({\theta}^{\,\top}_{0}x_{t},{\beta}^{\,\top}_{0}z_{t})-g({\theta}^{\,\top}x_{t},{\beta}^{\,\top}z_{t}) and it is shown in the proof of the theorem Δt(θ^,β^)=g˙1(θ0xt)xt(θ^θ0)g˙2(β0zt)zt(β^β0)+\Delta_{t}(\widehat{\theta},\widehat{\beta})=-\dot{g}_{1}({\theta}^{\,\top}_{0}x_{t})x_{t}^{\top}(\widehat{\theta}-\theta_{0})-\dot{g}_{2}({\beta}^{\,\top}_{0}z_{t})z_{t}^{\top}(\widehat{\beta}-\beta_{0})+higher order term.

Thus, we simply take Δt(θ^,β^)=g˙1(θ0xt)xt(θ^θ0)g˙2(β0zt)zt(β^β0)\Delta_{t}(\widehat{\theta},\widehat{\beta})=-\dot{g}_{1}({\theta}^{\,\top}_{0}x_{t})x_{t}^{\top}(\widehat{\theta}-\theta_{0})-\dot{g}_{2}({\beta}^{\,\top}_{0}z_{t})z_{t}^{\top}(\widehat{\beta}-\beta_{0}) and further by virtue of Theorem 3.1, Δt(θ^,β^)=OP(n1/2)\Delta_{t}(\widehat{\theta},\widehat{\beta})=O_{P}(n^{-1/2}) uniformly in tt. It then follows from the mean value theorem that

[ρ(e^t)]2[ρ(et)]2=[ρ(et+Δt(θ^,β^))]2[ρ(et)]2=2ρ(et)ρ′′(et)Δt(θ^,β^)\displaystyle[\rho^{\prime}(\widehat{e}_{t})]^{2}-[\rho^{\prime}(e_{t})]^{2}=[\rho^{\prime}(e_{t}+\Delta_{t}(\widehat{\theta},\widehat{\beta}))]^{2}-[\rho^{\prime}(e_{t})]^{2}=2\rho^{\prime}(e_{t}^{*})\rho^{\prime\prime}(e_{t}^{*})\Delta_{t}(\widehat{\theta},\widehat{\beta})

where ete_{t}^{*} is on the segment joining ete_{t} and e^t\widehat{e}_{t}. Hence,

a^1a1=\displaystyle\widehat{a}_{1}-a_{1}= 1nt=1n{[ρ(e^t)]2𝔼[ρ(et)]2}\displaystyle\frac{1}{n}\sum_{t=1}^{n}\{[\rho^{\prime}(\widehat{e}_{t})]^{2}-{\mathbb{E}}[\rho^{\prime}(e_{t})]^{2}\}
=\displaystyle= 21nt=1nρ(et)ρ′′(et)Δt(θ^,β^)+1nt=1n{[ρ(et)]2𝔼[ρ(et)]2}.\displaystyle 2\frac{1}{n}\sum_{t=1}^{n}\rho^{\prime}(e_{t}^{*})\rho^{\prime\prime}(e_{t}^{*})\Delta_{t}(\widehat{\theta},\widehat{\beta})+\frac{1}{n}\sum_{t=1}^{n}\{[\rho^{\prime}(e_{t})]^{2}-{\mathbb{E}}[\rho^{\prime}(e_{t})]^{2}\}.

The first term is oP(1)o_{P}(1) since the continuity of the two derivatives and |etet||ete^t|Δt(θ^,β^)|e_{t}^{*}-e_{t}|\leq|e_{t}-\widehat{e}_{t}|\leq\Delta_{t}(\widehat{\theta},\widehat{\beta}) that is uniformly oP(1)o_{P}(1). The last term is oP(1)o_{P}(1) because of the mean ergodicity of the sequence {ρ(et)2}1n\{\rho^{\prime}(e_{t})^{2}\}_{1}^{n} and a1=𝔼([ρ(et)]2|t1)<a_{1}={\mathbb{E}}([\rho^{\prime}(e_{t})]^{2}|{\cal F}_{t-1})<\infty by Assumption C. Then the first assertion follows.

Notice further that

a^2a2=\displaystyle\widehat{a}_{2}-a_{2}= 1nt=1n{ρ′′(e^t)𝔼[ρ′′(et)]}\displaystyle\frac{1}{n}\sum_{t=1}^{n}\{\rho^{\prime\prime}(\widehat{e}_{t})-{\mathbb{E}}[\rho^{\prime\prime}(e_{t})]\}
=\displaystyle= 1nt=1n[ρ′′(e^t)ρ′′(et)]+1nt=1n{ρ′′(et)𝔼[ρ′′(et)]}.\displaystyle\frac{1}{n}\sum_{t=1}^{n}[\rho^{\prime\prime}(\widehat{e}_{t})-\rho^{\prime\prime}(e_{t})]+\frac{1}{n}\sum_{t=1}^{n}\{\rho^{\prime\prime}(e_{t})-{\mathbb{E}}[\rho^{\prime\prime}(e_{t})]\}.

Here, the first term is oP(1)o_{P}(1) by the continuity of the ρ′′\rho^{\prime\prime} and the uniform approximation of |ete^t||e_{t}-\widehat{e}_{t}|; the second term is oP(1)o_{P}(1) because {ρ′′(et)}1n\{\rho^{\prime\prime}(e_{t})\}_{1}^{n} is ergodic.  \Box

Proof of Theorem 3.2  Note that θ00\|\theta_{0}\|\neq 0 and we allow that θ0\theta_{0} may not be a unit vector, so the treatment in what follows is different from the existing literature on semiparametric single index models. We take P=(θ0/θ0,P2)d1×d1P=(\theta_{0}/\|\theta_{0}\|,P_{2})_{d_{1}\times d_{1}} to be an orthogonal matrix that will be used to rotate the regressor xtx_{t}. Define Dn=diag(n4,n43Id11)D_{n}=\text{diag}(\sqrt[4]{n},\sqrt[4]{n}^{3}I_{d_{1}-1}) and qn=nq_{n}=\sqrt{n} for later use.

Denoting Δt(θ,β)=g(xtθ0,ztβ0)g(xtθ,ztβ)\Delta_{t}(\theta,\beta)=g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0})-g(x_{t}^{\top}\theta,z_{t}^{\top}\beta), by Taylor theorem we have

Ln(θ,β)=\displaystyle L_{n}(\theta,\beta)= t=1n[ρ(ytg(xtθ,ztβ))ρ(ytg(xtθ0,ztβ0))]\displaystyle\sum_{t=1}^{n}[\rho(y_{t}-g(x_{t}^{\top}\theta,z_{t}^{\top}\beta))-\rho(y_{t}-g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0}))]
=\displaystyle= t=1n[ρ(et+Δt(θ,β))ρ(et)]=t=1n[ρ(et)Δt(θ,β)+12ρ′′(et)Δt(θ,β)2],\displaystyle\sum_{t=1}^{n}[\rho(e_{t}+\Delta_{t}(\theta,\beta))-\rho(e_{t})]=\sum_{t=1}^{n}[\rho^{\prime}(e_{t})\Delta_{t}(\theta,\beta)+\frac{1}{2}\rho^{\prime\prime}(e_{t}^{*})\Delta_{t}(\theta,\beta)^{2}],

where ete_{t}^{*} is somewhere between ete_{t} and et+Δt(θ,β)e_{t}+\Delta_{t}(\theta,\beta), and we contemporarily consider the case that the first and second derivatives of ρ()\rho(\cdot) exist while otherwise the cases will be dealt with at a later stage. Moreover, using the similar idea as in the proof of Theorem 2.1, the ρ′′(et)\rho^{\prime\prime}(e_{t}^{*}) in the second term above can be approximated by ρ′′(et)\rho^{\prime\prime}(e_{t}) because as shown in Theorem 2.1

t=1n|ρ′′(et)ρ′′(et)|Δt(θ,β)2Ct=1n|Δt(θ,β)|3,\displaystyle\sum_{t=1}^{n}|\rho^{\prime\prime}(e_{t}^{*})-\rho^{\prime\prime}(e_{t})|\Delta_{t}(\theta,\beta)^{2}\leq C\sum_{t=1}^{n}|\Delta_{t}(\theta,\beta)|^{3},

and using first order Taylor expansion and Assumption D(3),

Δt(θ,β)=\displaystyle\Delta_{t}(\theta,\beta)= g(xtθ0,ztβ0)g(xtθ,ztβ)\displaystyle g(x_{t}^{\top}\theta_{0},z_{t}^{\top}\beta_{0})-g(x_{t}^{\top}\theta,z_{t}^{\top}\beta)
=\displaystyle= g˙1(xtθ)xt(θθ0)g˙2(ztβ)zt(ββ0),\displaystyle-\dot{g}_{1}(x_{t}^{\top}\theta^{*})x^{\top}_{t}(\theta-\theta_{0})-\dot{g}_{2}(z_{t}^{\top}\beta^{*})z_{t}^{\top}(\beta-\beta_{0}),

where (θ,β)(\theta^{*},\beta^{*}) is some point on the segment joining (θ,β)(\theta,\beta) and (θ0,β0)(\theta_{0},\beta_{0}). Here we shall use θ\theta^{*} to construct a matrix P=(θ,P2)P^{*}=(\theta^{*},P_{2}^{*}) such that θ\theta^{*} is orthogonal to each row of P2P_{2}^{*}, similar to the construction of PP. Then, xt(θθ0)=xtPDn1DnP(θθ0)=(x1t,(x2t))Dn1α~x^{\top}_{t}(\theta-\theta_{0})=x_{t}^{\top}P^{*}D_{n}^{-1}D_{n}{P^{*}}^{\prime}(\theta-\theta_{0})=(x_{1t}^{*},(x_{2t}^{*})^{\top})^{\top}D_{n}^{-1}\widetilde{\alpha}, where x1t=xtθx_{1t}^{*}=x_{t}^{\top}\theta^{*}, x2t=(P2)xtx_{2t}^{*}=(P_{2}^{*})^{\top}x_{t} and α~=DnP(θθ0)\widetilde{\alpha}=D_{n}{P^{*}}^{\prime}(\theta-\theta_{0}) that is a scaled vector. Also, rewrite zt(ββ0)=ztqn1γz_{t}^{\top}(\beta-\beta_{0})=z_{t}^{\top}q_{n}^{-1}\gamma with γ=qn(ββ0)\gamma=q_{n}(\beta-\beta_{0}) a scaled vector too. We then may regard the scaled parameters α~\widetilde{\alpha} and γ\gamma as generic parameters in the minimization we consider later.

Moreover, we bound Δt(θ,β)\Delta_{t}(\theta,\beta) by three additive components, ensuing that

t=1n|Δt(θ,β)|3\displaystyle\sum_{t=1}^{n}|\Delta_{t}(\theta,\beta)|^{3}\leq C11n3/4t=1n|g˙1(x1t)x1t|3+C21n9/4t=1n|g˙1(x1t)|3x2t3\displaystyle C_{1}\frac{1}{n^{3/4}}\sum_{t=1}^{n}|\dot{g}_{1}(x_{1t}^{*})x_{1t}^{*}|^{3}+C_{2}\frac{1}{n^{9/4}}\sum_{t=1}^{n}|\dot{g}_{1}(x_{1t}^{*})|^{3}\|x_{2t}^{*}\|^{3}
+C31n3/2t=1ng˙2(ztβ)zt3=OP(n1/2),\displaystyle+C_{3}\frac{1}{n^{3/2}}\sum_{t=1}^{n}\|\dot{g}_{2}(z_{t}^{\top}\beta^{*})z_{t}\|^{3}=O_{P}(n^{-1/2}),

evaluated by using Lemma B.2 for the first two terms and LLN for the last term. Now, we can write

Ln(θ,β)=\displaystyle L_{n}(\theta,\beta)= t=1n[ρ(et)Δt(θ,β)+12ρ′′(et)Δt(θ,β)2]+OP(n1/2).\displaystyle\sum_{t=1}^{n}[\rho^{\prime}(e_{t})\Delta_{t}(\theta,\beta)+\frac{1}{2}\rho^{\prime\prime}(e_{t})\Delta_{t}(\theta,\beta)^{2}]+O_{P}(n^{-1/2}).

We further consider the approximation of Δt(θ,β)\Delta_{t}(\theta,\beta). By D(3), again Taylor theorem expanding g(u,v)g(u,v) up to the second order and the orthogonal PP and a new matrix PP^{*} that is constructed as before but most unlikely the same,

Δt(θ,β)=\displaystyle\Delta_{t}(\theta,\beta)= g˙1(xtθ0)xtPDn1DnP(θθ0)g˙2(ztβ0)ztqn1qn(ββ0)\displaystyle-\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}PD_{n}^{-1}D_{n}P^{\top}(\theta-\theta_{0})-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}q_{n}(\beta-\beta_{0})
12g¨1(xtθ)(xtPDn1DnP(θθ0))212g¨2(ztβ)(ztqn1qn(ββ0))2\displaystyle-\frac{1}{2}\ddot{g}_{1}(x_{t}^{\top}\theta^{*})(x_{t}^{\top}P^{*}D_{n}^{-1}D_{n}P^{*\top}(\theta-\theta_{0}))^{2}-\frac{1}{2}\ddot{g}_{2}(z_{t}^{\top}\beta^{*})(z_{t}^{\top}q_{n}^{-1}q_{n}(\beta-\beta_{0}))^{2}
=\displaystyle= g˙1(xtθ0)xtPDn1α~g˙2(ztβ0)ztqn1γ\displaystyle-\dot{g}_{1}(x_{t}^{\top}\theta_{0})x_{t}^{\top}PD_{n}^{-1}\widetilde{\alpha}-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma
12g¨1(xtθ)(xtPDn1α~)212g¨2(ztβ)(ztqn1γ)2,\displaystyle-\frac{1}{2}\ddot{g}_{1}(x_{t}^{\top}\theta^{*})(x_{t}^{\top}P^{*}D_{n}^{-1}{\widetilde{\alpha}}^{*})^{2}-\frac{1}{2}\ddot{g}_{2}(z_{t}^{\top}\beta^{*})(z_{t}^{\top}q_{n}^{-1}\gamma)^{2},

where α~=DnP(θθ0)\widetilde{\alpha}=D_{n}P^{\top}(\theta-\theta_{0}), γ=qn(ββ0)\gamma=q_{n}(\beta-\beta_{0}) and α~=DnP(θθ0){\widetilde{\alpha}}^{*}=D_{n}P^{*\top}(\theta-\theta_{0}). We then may follow the same derivation as above, along with the martingale difference structure for (ρ(et),t)(\rho^{\prime}(e_{t}),{\cal F}_{t}), to show that the second order term in the expansion is negligible, that is,

t=1nρ(et)g¨1(xtθ)(xtPDn1α~)2=oP(1),t=1nρ(et)g¨2(ztβ)(ztqn1γ)2=oP(1),\displaystyle\sum_{t=1}^{n}\rho^{\prime}(e_{t})\ddot{g}_{1}(x_{t}^{\top}\theta^{*})(x_{t}^{\top}P^{*}D_{n}^{-1}{\widetilde{\alpha}}^{*})^{2}=o_{P}(1),\ \ \sum_{t=1}^{n}\rho^{\prime}(e_{t})\ddot{g}_{2}(z_{t}^{\top}\beta^{*})(z_{t}^{\top}q_{n}^{-1}\gamma)^{2}=o_{P}(1),
t=1nρ′′(et)g¨1(xtθ)(xtPDn1α~)4=oP(1),t=1nρ′′(et)g¨2(ztβ)(ztqn1γ)4=oP(1).\displaystyle\sum_{t=1}^{n}\rho^{\prime\prime}(e_{t})\ddot{g}_{1}(x_{t}^{\top}\theta^{*})(x_{t}^{\top}P^{*}D_{n}^{-1}{\widetilde{\alpha}}^{*})^{4}=o_{P}(1),\ \ \sum_{t=1}^{n}\rho^{\prime\prime}(e_{t})\ddot{g}_{2}(z_{t}^{\top}\beta^{*})(z_{t}^{\top}q_{n}^{-1}\gamma)^{4}=o_{P}(1).

Thus, this higher order term is omitted in the sequel.

Notice further that, denoting α~=(α~1,α~2)\widetilde{\alpha}=(\widetilde{\alpha}_{1},\widetilde{\alpha}_{2}^{\top})^{\top} conformably with x1t=xtθ0x_{1t}=x_{t}^{\top}\theta_{0} and x2t=xtP2x_{2t}^{\top}=x_{t}^{\top}P_{2} (hereafter denote x~t=(x1t,x2t){\widetilde{x}}_{t}=(x_{1t},x_{2t}^{\top})^{\top}),

xtPDn1α~=\displaystyle x_{t}^{\top}PD_{n}^{-1}\widetilde{\alpha}= (xtθ0/θ0,xtP2)Dn1α~=x~tDn1(α~1/θ0,α~2)x~tDn1α,\displaystyle(x_{t}^{\top}\theta_{0}/\|\theta_{0}\|,x_{t}^{\top}P_{2})D_{n}^{-1}\widetilde{\alpha}={\widetilde{x}}_{t}^{\top}D_{n}^{-1}(\widetilde{\alpha}_{1}/\|\theta_{0}\|,\widetilde{\alpha}_{2}^{\top})^{\top}\equiv{\widetilde{x}}_{t}^{\top}D_{n}^{-1}\alpha,

where we denote

α=Dn(1θ02θ0P2)(θθ0).\displaystyle\alpha=D_{n}\begin{pmatrix}\frac{1}{\|\theta_{0}\|^{2}}\theta_{0}^{\top}\\ P_{2}^{\top}\end{pmatrix}(\theta-\theta_{0}). (B.17)

The transformation from xtx_{t} to x~t{\widetilde{x}}_{t} is necessary since g1()g_{1}(\cdot) is II-regular. See Dong et al., (2016) and the reference therein for more details.

Accordingly, Ln(θ,β)L_{n}(\theta,\beta) is reparameterized as Qn(α,γ)Q_{n}(\alpha,\gamma)

Qn(α,γ)=\displaystyle Q_{n}(\alpha,\gamma)= t=1n{ρ(et)[g˙1(xtθ0)x~tDn1αg˙2(ztβ0)ztqn1γ]\displaystyle\sum_{t=1}^{n}\big{\{}\rho^{\prime}(e_{t})[-\dot{g}_{1}(x_{t}^{\top}\theta_{0}){\widetilde{x}}_{t}D_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]
+12ρ′′(et)[g˙1(xtθ0)x~tDn1αg˙2(ztβ0)ztqn1γ]2}.\displaystyle+\frac{1}{2}\rho^{\prime\prime}(e_{t})[\dot{g}_{1}(x_{t}^{\top}\theta_{0}){\widetilde{x}}_{t}D_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]^{2}\big{\}}.

Note that, when (α^,γ^)(\widehat{\alpha},\widehat{\gamma}) minimizes Qn(α,γ)Q_{n}(\alpha,\gamma), (θ^,β^)(\widehat{\theta},\widehat{\beta}) minimizes Ln(θ,β)L_{n}(\theta,\beta) where γ^=qn(β^β0)\widehat{\gamma}=q_{n}(\widehat{\beta}-\beta_{0}) and θ^\widehat{\theta} is given through the relationship (B.17) with α\alpha and θ\theta being respectively replaced by α^\widehat{\alpha} and θ^\widehat{\theta}.

When ρ()\rho^{\prime}(\cdot) or/and ρ′′()\rho^{\prime\prime}(\cdot) may not exist at some points, similarly to Theorem 2.1 we consider

Qmn(α,γ)=\displaystyle Q_{mn}(\alpha,\gamma)= t=1n{[ρm(et)𝔼ρm(et)][g˙1(xtθ0)x~tDn1αg˙2(ztβ0)ztqn1γ]\displaystyle\sum_{t=1}^{n}\big{\{}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})][-\dot{g}_{1}(x_{t}^{\top}\theta_{0}){\widetilde{x}}_{t}D_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]
+12ρm′′(et)[g˙1(xtθ0)x~tDn1αg˙2(ztβ0)ztqn1γ]2},\displaystyle+\frac{1}{2}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{1}(x_{t}^{\top}\theta_{0}){\widetilde{x}}_{t}D_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]^{2}\big{\}},

where ρm()\rho^{\prime}_{m}(\cdot) or/and ρm′′()\rho^{\prime\prime}_{m}(\cdot) are regular sequences of ρ()\rho^{\prime}(\cdot) and ρ′′()\rho^{\prime\prime}(\cdot).

Hence,

Qmn(α,γ)=t=1n[ρm(et)𝔼ρm(et)][g˙1(xtθ0)x~tDn1αg˙2(ztβ0)ztqn1γ]\displaystyle Q_{mn}(\alpha,\gamma)=-\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})][\dot{g}_{1}(x_{t}^{\top}\theta_{0}){\widetilde{x}}_{t}^{\top}D_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]
+12t=1nρm′′(et)[g˙1(xtθ0)x~tDn1αg˙2(ztβ0)ztqn1γ]2\displaystyle\hskip 65.44142pt+\frac{1}{2}\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})[\dot{g}_{1}(x_{t}^{\top}\theta_{0}){\widetilde{x}}_{t}^{\top}D_{n}^{-1}\alpha-\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}^{\top}q_{n}^{-1}\gamma]^{2}
=\displaystyle= (α1,α2,γ)t=1n[ρm(et)𝔼ρm(et)]ξnt+12(α1,α2,γ)t=1nρm′′(et)Mnt(α1α2γ)\displaystyle-(\alpha_{1},\alpha_{2}^{\top},\gamma^{\top})\sum_{t=1}^{n}[\rho^{\prime}_{m}(e_{t})-\mathbb{E}\rho^{\prime}_{m}(e_{t})]\xi_{nt}+\frac{1}{2}(\alpha_{1},\alpha_{2}^{\top},\gamma^{\top})\sum_{t=1}^{n}\rho^{\prime\prime}_{m}(e_{t})M_{nt}\begin{pmatrix}\alpha_{1}\\ \alpha_{2}\\ \gamma\end{pmatrix}

where we define ξnt:=(1n4g˙1(x1t)x1t1n43g˙1(x1t)x2t1ng˙2(ztβ0)zt)\xi_{nt}:=\begin{pmatrix}\frac{1}{\sqrt[4]{n}}\dot{g}_{1}(x_{1t})x_{1t}\\ \frac{1}{\sqrt[4]{n}^{3}}\dot{g}_{1}(x_{1t})x_{2t}\\ \frac{1}{\sqrt{n}}\dot{g}_{2}(z_{t}^{\top}\beta_{0})z_{t}\end{pmatrix} and Mnt:=ξntξntM_{nt}:=\xi_{nt}\xi_{nt}^{\top}.

It follows from the joint convergence in Lemmas B.6-B.7 that, as nn\to\infty and mm\to\infty,

Qmn(α,γ)DQ(α,γ)=(α1,α2,γ)ξ+a22(α1,α2,γ)M(α1,α2,γ)Q_{mn}(\alpha,\gamma)\to_{D}Q(\alpha,\gamma)=-(\alpha_{1},\alpha_{2}^{\top},\gamma^{\top})\xi+\frac{a_{2}}{2}(\alpha_{1},\alpha_{2}^{\top},\gamma^{\top})M(\alpha_{1},\alpha_{2}^{\top},\gamma^{\top})^{\top}

uniformly in any compact subset of d\mathbb{R}^{d} where ξN(0,a1M)\xi\sim N(0,a_{1}M).

By virtue of the convexity in (α,γ)(\alpha,\gamma) for Qmn(α,γ)Q_{mn}(\alpha,\gamma) and Q(α,γ)Q(\alpha,\gamma), the estimator (α^,γ^)(\widehat{\alpha},\widehat{\gamma}) will converge to the unique minimizer (α0,γ0)(\alpha_{0},\gamma_{0}) of the limit Q(α,γ)Q(\alpha,\gamma) in (B.16) (see Lemma A of Knight, (1989)), that is,

(α^γ^)=(n41θ02θ0(θ^θ0)n43P1(θ^θ0)qn(β^β0))D(a2M)1ξ=Da1a2M1/2N(0,Id).\displaystyle\begin{pmatrix}\widehat{\alpha}\\ \widehat{\gamma}\end{pmatrix}=\begin{pmatrix}\sqrt[4]{n}\frac{1}{\|\theta_{0}\|^{2}}\theta_{0}^{\top}(\widehat{\theta}-\theta_{0})\\ \sqrt[4]{n}^{3}P_{1}^{\top}(\widehat{\theta}-\theta_{0})\\ q_{n}(\widehat{\beta}-\beta_{0})\end{pmatrix}\to_{D}(a_{2}M)^{-1}\xi=_{D}\frac{\sqrt{a_{1}}}{a_{2}}M^{-1/2}N(0,I_{d}).

The proof is finished.   \Box

Proof of Corollary 3.2  We now consider the convergence of θ^θ0\widehat{\theta}-\theta_{0}. Denote J=diag(1/θ0,Id11)J=\text{diag}(1/\|\theta_{0}\|,I_{d_{1}-1}) for convenience. Observe that

n4(θ^θ0)=n4(DnJP)1(DnJP)(θ^θ0)\displaystyle\sqrt[4]{n}(\widehat{\theta}-\theta_{0})=\sqrt[4]{n}(D_{n}JP)^{-1}(D_{n}JP)(\widehat{\theta}-\theta_{0})
=\displaystyle= n4(DnJP)1(n41θ02θ0(θ^θ0)n43P1(θ^θ0))\displaystyle\sqrt[4]{n}(D_{n}JP)^{-1}\begin{pmatrix}\sqrt[4]{n}\frac{1}{\|\theta_{0}\|^{2}}\theta_{0}^{\top}(\widehat{\theta}-\theta_{0})\\ \sqrt[4]{n}^{3}P_{1}^{\top}(\widehat{\theta}-\theta_{0})\end{pmatrix}
D\displaystyle\to_{D} PJ1diag(1,𝟎d11)N(0,M11)=DN(0,τ11θ0θ0)\displaystyle P^{\top}J^{-1}\text{diag}(1,{\bf 0}_{d_{1}-1})N(0,M^{-1}_{1})=_{D}N(0,\tau_{11}\theta_{0}\theta_{0}^{\top})\quad\Box