Inference in Nonparametric Series Estimation with Specification Searches for the Number of Series Terms

Byunghoon Kang
Department of Economics, Lancaster University I thank the editor Peter Phillips, the co-editor Iván Fernández-Val, and the two anonymous referees for thoughtful comments that significantly improved this paper. I am also grateful to Bruce Hansen, Jack Porter, Xiaoxia Shi and Joachim Freyberger for useful comments and discussions, and thanks to Michal Kolesár, Denis Chetverikov, Yixiao Sun, Andres Santos, Patrik Guggenberger, Federico Bugni, Joris Pinkse, Liangjun Su, Myung Hwan Seo, and Áureo de Paula for helpful conversations and criticism. This paper is a revised version of the first chapter in my Ph.D. thesis at UW-Madison and previously titled “Inference in Nonparametric Series Estimation with Data-Dependent Undersmoothing”. I acknowledge support by the Kwanjeong Educational Foundation Graduate Research Fellowship and Leon Mears Dissertation Fellowship from UW-Madison. All errors are my own. Email: b.kang1@lancaster.ac.uk, Homepage: https://sites.google.com/site/davidbhkang

(September 15, 2025)

Abstract

Nonparametric series regression often involves specification search over the tuning parameter, i.e., evaluating estimates and confidence intervals with a different number of series terms. This paper develops pointwise and uniform inferences for conditional mean functions in nonparametric series estimations that are uniform in the number of series terms. As a result, this paper constructs confidence intervals and confidence bands with possibly data-dependent series terms that have valid asymptotic coverage probabilities. This paper also considers a partially linear model setup and develops inference methods for the parametric part uniform in the number of series terms. The finite sample performance of the proposed methods is investigated in various simulation setups as well as in an illustrative example, i.e., the nonparametric estimation of the wage elasticity of the expected labor supply from Blomquist and Newey (2002).

Keywords: Nonparametric series regression, Pointwise confidence interval, Smoothing parameter choice, Specification search, Undersmoothing, Uniform confidence bands.

JEL classification: C12, C14.

1 Introduction

We consider the following nonparametric regression model

y_{i}=g_{0}(x_{i})+\varepsilon_{i},\qquad E(\varepsilon_{i}|x_{i})=0

(1.1)

where $\{y_{i},x_{i}\}_{i=1}^{n}$ is i.i.d., $y_{i}$ is a scalar response variable, $x_{i}\in\mathcal{X}\subset\mathbb{R}^{d_{x}}$ is a vector of covariates, and $g_{0}(x)=E(y_{i}|x_{i}=x)$ is the conditional mean function. The theory of estimation and inference is well developed for nonparametric series (sieve) methods in a large body of econometrics and statistics literature. Series estimators have also received attention in applied economics because they have many appealing features, e.g., they can easily impose shape restrictions such as additive separability and monotonicity. Once the basis function is chosen (e.g., polynomial or regression spline series of fixed order), implementation requires a choice of the number of series terms $K=K_{n}$ where $K$ denotes the order of the polynomials or the number of knots in the splines. However, this often involves some ad-hoc specification searches over $K\in\mathcal{K}_{n}$ . For example, when $x_{i}\in\mathbb{R}^{d_{x}}$ is vector valued, researchers often evaluate the different numbers of terms in each dimension separately and construct a set of bases with different powers and cross-products of covariates. Although specification search seems necessary in some cases, it may lead to misleading inference without considering the first-step specification search or series term selection.¹¹1As a referee noted, the bias and MSE of the series estimator depend on not only $K$ but also the specific bases or sieve spaces, e.g., the order of the splines. In this paper, we fix the basis function, and we do not allow searching over the specific bases or sieve spaces.

Existing theory for the asymptotic normality of t-statistics and valid inference imposes a so-called undersmoothing (i.e., overfitting) condition that is a faster rate of $K$ than the mean-squared error (MSE) optimal convergence rates, and many papers in the literature typically suggest rule-of-thumb rules that give the desired level of undersmoothing. Among many others, Newey (2013) suggested increasing $K$ until the standard errors are large relative to small changes in objects of interest. Newey, Powell, and Vella (1999) suggested using more terms than that chosen by cross-validation. Horowitz and Lee (2012) suggested increasing $K$ until the integrated variance suddenly increases and then adding additional terms.

In this paper, we formally justify these rule-of-thumb methods or “plug-in” methods with undersmoothed $\widehat{K}$ for valid inference in nonparametric series regression. Specifically, we provide pointwise inference for $g_{0}(x)$ with possibly data-dependent (undersmoothed) $\widehat{K}\in\mathcal{K}_{n}$ , i.e., constructing $100(1-\alpha)\%$ confidence interval (CI),

\liminf\limits_{n\rightarrow\infty}P(g_{0}(x)\in[\widehat{g}_{n}(\widehat{K},x)\pm\widehat{c}_{1-\alpha}(x)\sqrt{\widehat{V}_{n}(\widehat{K},x)/n}])\geq 1-\alpha,

(1.2)

with an estimator $\widehat{g}_{n}(K,x)$ , variance $\widehat{V}_{n}(K,x)$ using $K$ series terms, and critical values $\widehat{c}_{1-\alpha}(x)$ from the supremum of the t-statistics. For this result, we first develop a uniform distributional approximation theory of the absolute value of the supremum of the t-statistics over different series terms to construct asymptotically valid confidence intervals, which are uniform in $K\in\mathcal{K}_{n}$ ,

P(g_{0}(x)\in[\widehat{g}_{n}(K,x)\pm\widehat{c}_{1-\alpha}(x)\sqrt{\widehat{V}_{n}(K,x)/n}],\quad K\in\mathcal{K}_{n})=1-\alpha+o(1).

(1.3)

The critical values $\widehat{c}_{1-\alpha}(x)$ can be easily implemented using simple simulation or weighted bootstrap methods.

Furthermore, this paper develops the construction of confidence bands for $g_{0}(x)$ with asymptotically uniform (in $K\in\mathcal{K}_{n}$ ) coverage with critical values $\widehat{c}_{1-\alpha}$ chosen to satisfy

P(g_{0}(x)\in[\widehat{g}_{n}(K,x)\pm\widehat{c}_{1-\alpha}\sqrt{\widehat{V}_{n}(K,x)/n}],\quad K\in\mathcal{K}_{n},x\in\mathcal{X})=1-\alpha+o(1).

(1.4)

Analogous to the pointwise inference in (1.2), we can show the validity of confidence bands with the data-dependent $\widehat{K}$ . Even in pointwise inference, deriving a uniform asymptotic distribution theory for all sequences of t-statistics over $K\in\mathcal{K}$ may not be possible unless $p=|\mathcal{K}_{n}|$ is finite. Allowing $p\rightarrow\infty$ as $n\rightarrow\infty$ , results in this paper build on coupling inequalities for the supremum of the empirical process developed by Chernozhukov, Chetverikov, and Kato (2014a, 2016) combined with anti-concentration inequality in Chernozhukov, Chetverikov, and Kato (2014b).

We also provide inference methods in a partially linear model setup focusing on the common parametric part. Unlike the nonparametric object of interest that has a slower convergence rate than $n^{1/2}$ (e.g., regression function or regression derivative), the t-statistics for the parametric object of interest are asymptotically equivalent for all sequences of $K$ under standard rate conditions $K/n\rightarrow 0$ as $n\rightarrow\infty$ . To account for the dependency of the t-statistics with the different sequences of $K$ in this setup, we consider a faster rate of $K$ that grows as fast as the sample size $n$ , as in Cattaneo, Jansson, and Newey (2018a, 2018b), and develop an asymptotic distribution of the t-statistics over $K\in\mathcal{K}_{n}$ . Then, we discuss methods to construct confidence intervals that are similar to the nonparametric regression setup and provide uniform (in $K\in\mathcal{K}_{n}$ ) coverage properties.

We investigate finite sample coverage and length properties of the proposed CIs and uniform confidence bands in various simulation setups. As an illustrative example, we revisit nonparametric estimation of labor supply function using the entire individual piecewise-linear budget set as in Blomquist and Newey (2002). Imposing additive separability, which is derived by economic theory, Blomquist and Newey (2002) estimate the conditional mean of labor supply function using series estimation and report wage elasticity of the expected labor supply as well as other welfare measures with various specifications of the different number of series terms.

Several important papers have investigated the asymptotic properties of series (and sieve) estimators, including papers by Andrews (1991a); Eastwood and Gallant (1991); Newey (1997); Chen and Shen (1998); Huang (2003); Chen (2007); Chen and Liao (2014); Chen, Liao, and Sun (2014); Belloni, Chernozhukov, Chetverikov, and Kato (2015); and Chen and Christensen (2015), among many others. This paper extends inference based on the t-statistic under a single sequence of $K$ to the sequences of $K$ over a set $\mathcal{K}_{n}$ and focuses both on the pointwise and uniform inferences on $g_{0}(x)$ , which is irregular (i.e., slower than a rate of $n^{1/2}$ ) and a linear functional, under an i.i.d. setup.

The supremum t-statistics have been used as a correction for multiple-testing problems and to construct simultaneous confidence bands, and the importance of multiple-testing problems (data mining or data snooping) has long been noted in various other contexts (see Leamer (1983), White (2000), Romano and Wolf (2005), Hansen (2005)).

There is also a growing literature on data-dependent series term selection and its impact on estimation and inference in econometrics and statistics. Asymptotic optimality results of cross-validation have been developed, e.g., in papers by Li (1987), Andrews (1991b), and Hansen (2015). Horowitz (2014) develops data-driven methods for choosing the sieve dimension in the nonparametric instrumental variables (NPIV) estimation such that resulting NPIV estimators attain the optimal sup-norm or $L^{2}$ norm rates adaptive to the unknown smoothness of $g_{0}(x)$ . Although we do not pursue adaptive inference in this paper, there is also a large statistical literature on adaptive inference. For example, Giné and Nickl (2010), Chernozhukov, Chetverikov, and Kato (2014b) construct adaptive confidence bands in the density estimation problem (see Giné and Nickl (2015, Section 8) for comprehensive lists of references). However, once data-driven choice is obtained for adaptive estimation (e.g., Lepski (1990)-type procedures), one still requires an undersmoothing condition for inference to eliminate asymptotic bias terms (see Theorem 1 of Giné and Nickl (2010)), and this may result in similar specification search issues when choosing sufficiently “large” $K$ in practice.

We can, in principle, consider kernel-based estimation where several data-dependent bandwidth selections or explicit bias corrections have been proposed.²²2See Härdle and Linton (1994), Li and Racine (2007) for references. See also Hall and Horowitz (2013), Calonico, Cattaneo, and Farrell (2018), Schennach (2015) and references therein for various recent works on related bias issues and inference for kernel estimators. However, there exist many examples estimating $g_{0}(x)$ using (global) series estimation and imposing shape constraints easily (such as additive separability to reduce dimensionality) that are also interested in both pointwise and uniform inference. Given the issues of specification search, our paper is closely related to a recent paper by Armstrong and Kolesár (2018) which considers a bandwidth snooping adjustment for kernel-based inference.

Unlike kernel-based methods, little is known about the statistical properties of data-dependent selection rules and explicit bias formulas for general series estimation. Zhou, Shen, and Wolfe (1998) and Huang (2003) are two of the few exceptions. A recent paper, Cattaneo, Farrell, and Feng (2019), develops novel explicit asymptotic bias/integrated mean squared error (IMSE) formulas and asymptotic theory of the bias-correction methods for general partitioning-based series estimators. The results in Cattaneo, Farrel, and Feng (2019) can be used as an alternative to the undersmoothing approach to avoid specification search issues.

The remainder of the paper is organized as follows. Section 2 introduces the basic nonparametric series regression setup and the candidate set $\mathcal{K}_{n}$ . Section 3 provides the pointwise inference, and Section 4 provides uniform inference in $x\in\mathcal{X}$ . Section 5 extends our inference methods to the partially linear model setup. Section 6 summarizes Monte Carlo experiments in various setups, and Section 7 illustrates an empirical example as in Blomquist and Newey (2002). Then, Section 8 concludes the paper. Appendix A includes the main proofs, and Appendix B includes figures and tables. Additional supporting lemmas and simulation results are provided in the Online Supplementary Material available at Cambridge Journals Online (journals.cambridge.org/ect).

1.1 Notation

$||A||$ denotes the spectral norm, which equals the largest singular value of a matrix $A$ , and $\lambda_{min}(A),\lambda_{max}(A)$ denote the minimum and maximum eigenvalues of a symmetric matrix $A$ , respectively. $o_{p}(\cdot)$ and $O_{p}(\cdot)$ denote the usual stochastic order symbols, $\overset{d}{\longrightarrow}$ denotes convergence in distribution, and $\Rightarrow$ denotes weak convergence. Let $a\wedge b=\min\{a,b\},a\vee b=\max\{a,b\}$ and denote $\lfloor a\rfloor$ as the largest integer less than the real number $a$ . For two sequences of positive real numbers $a_{n}$ and $b_{n}$ , $a_{n}\lesssim b_{n}$ denotes $a_{n}\leq cb_{n}$ for all $n$ sufficiently large with some constant $c>0$ that is independent of $n$ . $a_{n}\asymp b_{n}$ denotes $a_{n}\lesssim b_{n}$ and $b_{n}\lesssim a_{n}$ . Furthermore, $a_{n}\lesssim_{P}b_{n}$ denotes $a_{n}=O_{p}(b_{n})$ . For a given random variable $\{X_{i}\}$ and $1\leq p<\infty$ , $L^{p}(X)$ is the space of all $L^{p}$ -norm bounded functions with $||f||_{L^{p}}=[E||f(X_{i})||^{p}]^{1/p}$ , $\ell^{\infty}(X)$ denotes the space of all bounded functions under the sup-norm, and $||f||_{\infty}=\sup_{x\in\mathcal{X}}|f(x)|$ for the bounded real-valued functions $f$ on the support $\mathcal{X}$ .

2 Setup

We introduce the nonparametric series regression setup in the model (1.1). Given a random sample $\{y_{i},x_{i}\}_{i=1}^{n}$ , we are interested in inference on the conditional mean $g_{0}(x)=E(y_{i}|x_{i}=x)$ at a particular point $x\in\mathcal{X}\subset\mathbb{R}^{d_{x}}$ or uniform in $x\in\mathcal{X}$ .

Let $\widehat{g}_{n}(K,x)$ be an estimator of $g_{0}(x)$ using $K=K_{n}\geq 1$ series terms $P(K,x)=(p_{1}(x),\cdots,p_{K}(x))^{\prime}$ , which is a vector of basis functions that can change with $n$ . Standard examples for the basis functions are power series, Fourier series, orthogonal polynomials, splines and wavelets. The series estimator is then obtained by the least square (LS) estimation of $y_{i}$ on regressors $P(K,x_{i})$

\widehat{g}_{n}(K,x)=P(K,x)^{\prime}\widehat{\beta}_{K},\qquad\widehat{\beta}_{K}=(P^{K^{\prime}}P^{K})^{-1}P^{K^{\prime}}Y

(2.1)

where $P^{K}=[P_{K1},\cdots,P_{Kn}]^{\prime},P_{Ki}\equiv P(K,x_{i})=(p_{1}(x_{i}),p_{2}(x_{i}),\cdots,p_{K}(x_{i}))^{\prime},Y=(y_{1},\cdots y_{n})^{\prime}$ . Define the least square residuals as $\widehat{\varepsilon}_{Ki}=y_{i}-P_{Ki}^{\prime}\widehat{\beta}_{K}$ ,

\displaystyle\begin{split}&\widehat{V}_{n}(K,x)=P(K,x)^{\prime}\widehat{Q}_{K}^{-1}\widehat{\Omega}_{K}\widehat{Q}_{K}^{-1}P(K,x),\\ &\widehat{Q}_{K}=\frac{1}{n}\sum_{i=1}^{n}P_{Ki}P_{Ki}^{\prime},\quad\widehat{\Omega}_{K}=\frac{1}{n}\sum_{i=1}^{n}P_{Ki}P_{Ki}^{\prime}\widehat{\varepsilon}_{Ki}^{2},\end{split}

(2.2)

and consider the t-statistic

\widehat{T}_{n}(K,x)\equiv\frac{\sqrt{n}(\widehat{g}_{n}(K,x)-g_{0}(x))}{\widehat{V}_{n}(K,x)^{1/2}}.

(2.3)

Under standard regularity conditions (discussed in the next section), the t-statistic can be decomposed as follows:

\widehat{T}_{n}(K,x)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\frac{P(K,x)^{\prime}Q_{K}^{-1}P_{Ki}\varepsilon_{i}}{\widehat{V}_{n}(K,x)^{1/2}}-\frac{r_{n}(K,x)}{\sqrt{\widehat{V}_{n}(K,x)/n}}+o_{p}(1)

(2.4)

where $Q_{K}=E(P_{Ki}P_{Ki}^{\prime})$ , $r_{n}(K,x)=g_{0}(x)-P(K,x)^{\prime}\beta_{K}$ , and $\beta_{K}\equiv E[P_{Ki}P_{Ki}^{\prime}])^{-1}E[P_{Ki}y_{i}]$ is the best linear $L^{2}$ projection coefficient. The first term in the decomposition (2.4) converges to a standard normal distribution for the deterministic sequence $K\rightarrow\infty$ as $n\rightarrow\infty$ , and the second term does not necessarily converge to 0 due to approximation errors $r_{n}(K,x)$ . The second term can be ignored with an undersmoothing assumption, and the asymptotic distribution of the t-statistic, $\widehat{T}_{n}(K,x)\overset{d}{\longrightarrow}N(0,1)$ , is well known in the literature (see, for examples, Andrews (1991a), Newey (1997), Belloni et al. (2015), and Chen and Christensen (2015), among many others). Then, the $100(1-\alpha)\%$ confidence interval for $g_{0}(x)$ can be easily constructed using the normal critical value $z_{1-\alpha/2}$

\Big{[}\widehat{g}_{n}(K,x)\pm z_{1-\alpha/2}\sqrt{\widehat{V}_{n}(K,x)/n}\Big{]}.

(2.5)

However, it is not clear whether the conventional CI using normal critical values (2.5) has a correct coverage probability with a possibly data-dependent $\widehat{K}$ such as cross-validation or IMSE-optimal selection. First, $T_{n}({\widehat{K}},\theta_{0})\overset{d}{\rightarrow}N(0,1)$ may not hold with a random sequence of $\widehat{K}$ , even if we assume the asymptotic bias is negligible. Second, it is well known that some data-dependent rules $\widehat{K}$ do not satisfy the undersmoothing rate conditions, which can lead to a large asymptotic bias and coverage distortion of the standard CI. For example, suppose that the researcher uses $\widehat{K}=\widehat{K}_{\texttt{cv}}$ selected by cross-validation; then, $\widehat{K}_{\texttt{cv}}$ is typically too “small” and violates the undersmoothing assumption needed to ensure the asymptotic normality without bias terms and the valid inference.

As discussed in the introduction, the undersmoothing assumption involves possibly ad-hoc methods to choose series terms $K$ over a candidate set $\mathcal{K}_{n}$ for a valid inference, and cross-validation methods naturally involve specification search over a set of the different number of series terms.

The following set assumption on $\mathcal{K}_{n}$ is constructed to allow a broad range of $K$ such that $\mathcal{K}_{n}$ can allow (unknown) an optimal MSE rate of $K$ as well as an undersmoothing rate that increases faster than the optimal MSE rate.

Assumption 2.1.

(Set of number of series terms) Assume the candidate set as $\mathcal{K}_{n}=\{K_{j}:1\leq j\leq p\}$ , where $\underline{K}=K_{1}\rightarrow\infty$ and $\overline{K}=K_{p}\rightarrow\infty$ as $n\rightarrow\infty$ .

Here, we consider a possibly growing set of the number of series terms, and a similar assumption is used in the literature, for example, in Newey (1994a, 1994b). Suppose $g_{0}(x)$ belongs to the Hölder space of smoothness $s>0$ , $\Sigma(s,\mathcal{X})$ ; then, we obtain optimal $L^{2}$ convergence rates $O_{p}(n^{-s/(2s+d_{x})})$ with $K\asymp n^{d_{x}/(d_{x}+2s)}$ . Assumption 2.1 allows having optimal $L^{2}$ rates of $K$ in a large set of classes of functions. By setting $\mathcal{K}_{n}=[\underline{K},\overline{K}]\cap\mathbb{N}$ , $\overline{K}\asymp n^{\overline{\phi}}$ and $\underline{K}\asymp n^{\underline{\phi}}$ with $\overline{\phi}=d_{x}/(d_{x}+2\underline{s}),\underline{\phi}=d_{x}/(d_{x}+2\overline{s})$ , Assumption 2.1 contains the number of series terms that obtain an optimal $L^{2}$ rate of convergence for $g_{0}(x)\in\bigcup_{s\in S}\Sigma(s,\mathcal{X})$ , $S=[\underline{s},\overline{s}]$ . A similar assumption is used in the literature on adaptive inference, although we do not pursue this direction in the current paper.

Assumption 2.1 gives flexible choices of $K$ , as we only assume the rates of $K$ , for example, $\overline{K}=Cn^{\overline{\phi}},\underline{K}=cn^{\underline{\phi}}$ , where $c$ and $C$ can be set arbitrarily small or large. We only require rate restrictions uniformly over $K\in\mathcal{K}$ to guarantee the linearization of the t-statistic in (2.4) and the rates of the cardinality $p=|\mathcal{K}_{n}|$ . Since $K\in\mathcal{K}_{n}$ is a positive integer and $p\leq\overline{K}$ , $p$ is growing at a rate much slower than $n$ under the rate restrictions in Section 3.

Remark 2.1 ( $\mathcal{K}_{n}$ and the largest $K$ ).

As a referee noted, specification search is often performed over a simple pre-defined set in practice. For example, a researcher may only use quadratic, cubic, or quartic terms in polynomial regression or try only a few different numbers of knots in regression splines to observe how the estimate and standard error change. In the nonparametric estimation of the Mincer equation (Heckman, Lochner, and Todd (2006)), researchers may consider a regression of log wages on experience with polynomials of order $\underline{K}=1$ (linear) to $\overline{K}=4$ (quartic).³³3All of our results continue to hold with fixed $p$ ; however, it may be preferred to use larger sets $\mathcal{K}_{n}$ with $p\rightarrow\infty$ to give greater flexibility to the candidate models as the sample size $n$ increases.

However, it may not be clear how to define a priori $\mathcal{K}_{n}$ in practice. One must first consider a set of pre-selected models over which to search. As discussed earlier and suggested by many papers in the literature, some formal data-dependent methods to obtain optimal $L^{2}$ norm or sup-norm rates, such as cross-validation, can be a useful guideline for $\mathcal{K}_{n}$ . For example, one can consider a reasonable set $\widetilde{\mathcal{K}}_{n}$ first, choose $\widehat{K}_{\texttt{cv}}\in\widetilde{\mathcal{K}}_{n}$ by cross-validation, and then consider $\mathcal{K}_{n}=[\widehat{K}_{\texttt{cv}},c_{1}\widehat{K}_{\texttt{cv}}]$ or $[\widehat{K}_{\texttt{cv}},\widehat{K}_{\texttt{cv}}n^{c_{2}}]$ for some constants $c_{1},c_{2}>0$ . One can also search $\underline{K}$ and $\overline{K}$ sequentially by calculating changes in cross-validation or standard errors from the initial candidate set. Extending results developed in this paper with data-dependent $\mathcal{K}_{n}$ are beyond the scope of the paper.

3 Pointwise Inference

In this section, we focus on pointwise inference for $g_{0}(x)$ . The goal of this section is to provide a uniform distributional approximation theory of $\widehat{T}_{n}(K,x)$ over a set $\mathcal{K}_{n}$ and provide uniform (in $K\in\mathcal{K}_{n}$ ) coverage properties of confidence intervals for $g_{0}(x)$ in (1.2), (1.3) with the construction of critical values.

From the decomposition of the t-statistic in (2.4), we first consider the (infeasible) test statistic

\max_{K\in\mathcal{K}_{n}}|t_{n}(K,x)|=\max_{1\leq j\leq p}|t_{n}(K_{j},x)|

(3.1)

where $t_{n}(K,x)=n^{-1/2}\sum_{i=1}^{n}P(K,x)^{\prime}Q_{K}^{-1}P_{Ki}\varepsilon_{i}/V_{n}(K,x)^{1/2}$ with the series variance $V_{n}(K,x)=P(K,x)^{\prime}Q_{K}^{-1}\Omega_{K}Q_{K}^{-1}P(K,x)$ , $\Omega_{K}=E(P_{Ki}P_{Ki}^{\prime}\varepsilon_{i}^{2})$ . In general, $t_{n}(K,x),K\in\mathcal{K}_{n}$ does not have a limiting distribution because it is not asymptotically tight under Assumption 2.1 unless $|\mathcal{K}_{n}|$ is finite or under the restrictive assumption on $\mathcal{K}_{n}$ .⁴⁴4In an earlier version of the paper, we provide the weak convergence of a series process under the same rates of $K\in\mathcal{K}_{n}$ and high-level assumptions. This can be viewed as an analogous result in the kernel estimation literature (see Section 2 of Armstrong and Kolesár (2018) and other references therein). However, we show below that there exists a sequence of random variables $\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|$ such that $\big{|}\max_{K\in\mathcal{K}_{n}}|t_{n}(K,x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|\big{|}=O_{p}(a_{n})$ for a sequence of constants $a_{n}\rightarrow 0$ , where $Z_{i}=(Z_{i1},...,Z_{ip})^{\prime}$ is a Gaussian random vector in $\mathbb{R}^{p}$ such that $Z_{i}\sim N(0,\frac{1}{n}\Sigma_{n})$ with $(j,l)$ elements of the variance-covariance matrix

\Sigma_{n}(j,l)=E[t_{n}(K_{j},x)t_{n}(K_{l},x))]=\frac{P(K_{j},x)^{\prime}Q_{K_{j}}^{-1}\Omega_{K_{j},K_{l}}Q_{K_{l}}^{-1}P(K_{l},x)}{V_{n}(K_{j},x)^{1/2}V_{n}(K_{l},x)^{1/2}},

(3.2)

$\Omega_{K_{j},K_{l}}=E(P_{K_{j}i}P_{K_{l}i}^{\prime}\varepsilon_{i}^{2})$ .

By replacing unknown $\Sigma_{n},V_{n}(K,x)$ with consistent estimators $\widehat{\Sigma}_{n},\widehat{V}_{n}(K,x)$ , we show below that we can approximate $\max_{K\in\mathcal{K}_{n}}|\widehat{T}_{n}(K,x)|$ by $\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|$ and then obtain critical values by using a simulation-based method to provide valid coverage properties in (1.2) and (1.3). We define $\widehat{c}_{1-\alpha}(x)$ as follows:

\displaystyle\begin{split}&\widehat{c}_{1-\alpha}(x)\equiv(1-\alpha)\textnormal{ quantile of }\max_{1\leq j\leq p}\sum_{i=1}^{n}|\widehat{Z}_{ij}|,\textnormal{ where }\widehat{Z}_{i}=(\widehat{Z}_{i1},...,\widehat{Z}_{ip})^{\prime}\sim N(0,\frac{1}{n}\widehat{\Sigma}_{n}),\\ &\widehat{\Sigma}_{n}(j,j)=1,\quad\widehat{\Sigma}_{n}(j,l)=\frac{\widehat{V}_{n}(K_{j},K_{l},x)}{\widehat{V}_{n}(K_{j},x)^{1/2}\widehat{V}_{n}(K_{l},x)^{1/2}},\\ &\widehat{V}_{n}(K_{j},K_{l},x)=P(K_{j},x)^{\prime}\widehat{Q}_{K_{j}}^{-1}\widehat{\Omega}_{K_{j},K_{l}}\widehat{Q}_{K_{l}}^{-1}P(K_{l},x),\widehat{\Omega}_{K_{j},K_{l}}=\frac{1}{n}\sum_{i=1}^{n}P_{K_{j}i}P_{K_{l}i}^{\prime}\widehat{\varepsilon}_{K_{j}i}\widehat{\varepsilon}_{K_{l}i}\end{split}

(3.3)

where $\widehat{\Sigma}_{n}$ is a consistent estimator of the variance-covariance matrix $\Sigma_{n}$ defined in (3.2), $\widehat{V}_{n}(K,x)$ is the simple plug-in estimator for $V_{n}(K,x)$ as in (2.2), and $\widehat{\varepsilon}_{Ki}=y_{i}-P_{Ki}^{\prime}\widehat{\beta}_{K},\forall K\in\mathcal{K}_{n}$ . One can compute $\widehat{c}_{1-\alpha}(x)$ by simulating $B$ (typically $B=1000$ or $5000$ ) i.i.d. random vectors $\widehat{Z}_{i}^{b}\sim N(0,\frac{1}{n}\widehat{\Sigma}_{n})$ and by taking a $(1-\alpha)$ sample quantile of $\{\max\limits_{1\leq j\leq p}\sum_{i=1}^{n}|\widehat{Z}_{ij}^{b}|:b=1,\cdots,B\}$ . Alternatively, we can use weighted bootstrap methods. See Section 4 for the implementation and the validity of bootstrap procedures in the construction of confidence bands.

To establish our main results, we impose mild regularity conditions uniform in $K\in\mathcal{K}_{n}$ . For each $K\in\mathcal{K}_{n}$ , define $\zeta_{K}\equiv\sup_{x\in\mathcal{X}}||P(K,x)||$ as the largest normalized length of the regressor vector and $\lambda_{K}\equiv(\lambda_{min}(Q_{K}))^{-1/2}$ for $K\times K$ design matrix $Q_{K}=E(P_{Ki}P_{Ki}^{\prime})$ .

Assumption 3.1.

(Regularity conditions - model)

(i)

$\{y_{i},x_{i}\}_{i=1}^{n}$ are $i.i.d$ random variables satisfying the model (1.1).
(ii)

$\max\limits_{K\in\mathcal{K}_{n}}\lambda_{K}\lesssim 1$ , and for each $K\in\mathcal{K}_{n}$ , as $K\rightarrow\infty$ , there exists $c_{K},\ell_{K}$ such that

$\displaystyle\sup_{x\in\mathcal{X}}|r_{n}(K,x)|\leq\ell_{K}c_{K},\quad E[r_{n}(K,x)^{2}]^{1/2}\leq c_{K},$

where $r_{n}(K,x)=g_{0}(x)-P(K,x)^{\prime}\beta_{K},\beta_{K}=(E[P_{Ki}P_{Ki}^{\prime}])^{-1}E[P_{Ki}y_{i}]$ .

Assumption 3.2.

(Regularity conditions - pointwise inference)

(i)

$\max\limits_{K\in\mathcal{K}_{n}}\sqrt{\zeta_{K}^{2}\log K\log^{2}p/n}(1+\sqrt{K}\ell_{K}c_{K})+\ell_{K}c_{K}\log p\rightarrow 0$ as $n\rightarrow\infty$ .
(ii)

$\sup_{x\in\mathcal{X}}E(|\varepsilon_{i}|^{3}|x_{i}=x)<\infty$ , $\inf_{x\in\mathcal{X}}E(\varepsilon_{i}^{2}|x_{i}=x)>0$ , and either of the following conditions hold: (a) $\sup_{x\in\mathcal{X}}E[|\varepsilon_{i}|^{q}|x_{i}=x]<\infty$ for $q\geq 4$ or (b) there exists a constant $C>0$ such that $\sup_{x\in\mathcal{X}}E[\exp(|\varepsilon_{i}|/C)|X_{i}=x]\leq 2$ .
(iii)

$\max\limits_{K\in\mathcal{K}_{n}}|\frac{V_{n}(K,x)}{\widehat{V}_{n}(K,x)}-1|=o_{p}(1/\log p)$ , $\max\limits_{1\leq j,l\leq p}|\widehat{\Sigma}_{n}(j,l)-{\Sigma}_{n}(j,l)|=o_{p}(1/\log^{2}p)$ .

Assumptions 3.1(ii) and 3.2(i) are similar to those imposed in Belloni et al. (2015) and Chen and Christensen (2015), and all the discussions made there also apply here except that we impose rate conditions of $K$ uniformly over $\mathcal{K}_{n}$ . The rate conditions can be replaced by the specific bounds of $\zeta_{K},c_{K},\ell_{K}$ with various sieve bases. For example, when $\mathcal{X}=[0,1]^{d_{x}}$ , the probability density of $x_{i}$ is uniformly bounded above and bounded away from zero, and $g_{0}(x)\in\Sigma(s,\mathcal{X})$ , i.e., the Hölder space of smoothness $s>0$ , then $\lambda_{K}\lesssim 1$ , $\zeta_{K}\lesssim\sqrt{K}$ , $\ell_{K}c_{K}\lesssim K^{-(s\wedge s_{0})/d_{x}}$ for regression spline series of order $s_{0}$ , and Assumption 3.2(i) is satisfied when $\sqrt{\overline{K}(\log^{3}\overline{K})/n}(1+\overline{K}^{1/2}\underline{K}^{-(s\wedge s_{0})/d_{x}})+\underline{K}^{-(s\wedge s_{0})/d_{x}}\log\overline{K}\rightarrow 0$ . Other standard regularity conditions in the literature (e.g., Newey (1997) and Chen (2007)) can also be used here, and the rate condition can be improved with different pointwise linearization and approximation bounds in Huang (2003) for splines and Cattaneo et al. (2019) for partitioning-based estimators.

Assumption 3.2(ii) imposes either the bounded polynomial moment conditions or sub-exponential moments of the regression errors. Assumption 3.2(iii) imposes the consistency of variance estimator $\widehat{V}_{n}(K,x)$ uniformly in $K\in\mathcal{K}_{n}$ , and this holds under mild regularity conditions (see Lemma 5.1 of Belloni et al. (2015) and Lemma 3.1-3.2 of Chen and Christensen (2015)).

Theorem 3.1.

Suppose that Assumptions 2.1, 3.1, and 3.2 hold and that either of the following rate conditions hold depending on the case (a) or (b) in Assumption 3.2(ii): (a) $(\max_{K}\zeta_{K})^{2}\log^{5}n\log^{3}p/n\vee\max_{K}\zeta_{K}\log^{3/4}n\log p/n^{1/2-1/q}\rightarrow 0$ or (b) $(\max_{K}\zeta_{K})^{2}\log^{5}n\log^{3}p/n\rightarrow 0$ . If, in addition, we assume that $\max\limits_{K\in\mathcal{K}_{n}}|\frac{\sqrt{n}r_{n}(K,x)}{V_{n}(K,x)^{1/2}}|=o(1/\sqrt{\log p})$ , then

\displaystyle\sup_{u\in\mathbb{R}}\big{|}P(\max_{K\in\mathcal{K}_{n}}|\widehat{T}_{n}(K,x)|\leq u)-P(\max_{1\leq j\leq p}\sum_{i=1}^{n}|\widehat{Z}_{ij}|\leq u)\big{|}=o(1),

(3.4)

and the following coverage property holds

\displaystyle P(g_{0}(x)\in[\widehat{g}_{n}(K,x)\pm\widehat{c}_{1-\alpha}(x)\sqrt{\widehat{V}_{n}(K,x)/n}],\quad K\in\mathcal{K}_{n})=1-\alpha+o(1)

(3.5)

with a critical value $\widehat{c}_{1-\alpha}(x)$ defined in (3.3). Alternatively, if we assume $|\frac{\sqrt{n}r_{n}(\widehat{K},x)}{V_{n}(\widehat{K},x)^{1/2}}|=o(1/\sqrt{\log p})$ with $\widehat{K}\in\mathcal{K}_{n}$ , then the following holds:

\liminf\limits_{n\rightarrow\infty}P(g_{0}(x)\in[\widehat{g}_{n}(\widehat{K},x)\pm\widehat{c}_{1-\alpha}(x)\sqrt{\widehat{V}_{n}(\widehat{K},x)/n}])\geq 1-\alpha.

(3.6)

Theorem 3.1 provides a uniform coverage property of the confidence interval over $K\in\mathcal{K}_{n}$ for the regression function $g_{0}(x)$ . Equation (3.6) guarantees the asymptotic coverage of CI for data-dependent $\widehat{K}\in\mathcal{K}_{n}$ with undersmoothing. Note that standard inference methods in the nonparametric regression setup typically consider a singleton set $\mathcal{K}_{n}=\{K\}$ with $K\rightarrow\infty$ as $n\rightarrow\infty$ . The rate restriction is mild because it only requires $\overline{K}/n^{1-2/q}\rightarrow 0$ , up to $\log n$ terms, in case $(a)$ and $\overline{K}/n\rightarrow 0$ , up to $\log n$ terms, in case $(b)$ when $\zeta_{K}\lesssim\sqrt{K}$ for splines and wavelet series. Theorem 3.1 builds upon a coupling inequality for maxima of sums of random vectors in Chernozhukov, Chetverikov, and Kato (2014a) combined with the anti-concentration inequality in Chernozukhov, Chetverikov, and Kato (2014b).

Remark 3.1 (Undersmoothing assumption).

Note that (3.5) requires an undersmoothing assumption uniformly over $K\in\mathcal{K}_{n}$ . Without $\max\limits_{K\in\mathcal{K}_{n}}|\frac{\sqrt{n}r_{n}(K,x)}{V_{n}(K,x)^{1/2}}|=o(1)$ , coverage in (3.5) can be understood as the uniform confidence intervals for the pseudo-true value $g(K,x)=P(K,x)^{\prime}\beta_{K}$ , i.e.,

P(g(K,x)\in[\widehat{g}_{n}(K,x)\pm\widehat{c}_{1-\alpha}(x)\sqrt{\widehat{V}_{n}(K,x)/n}],\quad K\in\mathcal{K}_{n})=1-\alpha+o(1)

(3.7)

However, a uniform undersmoothing condition is not assumed in (3.6), and it only requires that the chosen $\widehat{K}\in\mathcal{K}_{n}$ satisfies the undersmoothing condition such that the asymptotic bias is negligible. This allows broader ranges of $K$ in $\mathcal{K}_{n}$ including an unknown optimal MSE rate. We formally justify rule-of-thumb methods for valid inference suggested in the literature that include an additional number of series terms, a blow up of the numbers after using cross-validation, or some “plug-in” methods for choosing $\widehat{K}$ such as those in Newey, Powell, and Vella (1999), Newey (2013). Here, uniform (in $K\in\mathcal{K}_{n}$ ) inference considers uncertainty from specification search and using larger critical values $\widehat{c}_{1-\alpha}(x)$ than the normal critical value $z_{1-\alpha/2}$ .

Remark 3.2 (Other functionals).

Here, we focus on the leading example with $g_{0}(x)$ for some fixed point $x\in\mathcal{X}$ ; however, we can consider other linear functionals $a(g_{0}(\cdot))$ such as the regression derivatives $a(g_{0}(x))=\frac{d}{dx}g_{0}(x)$ . All the results in this paper can be applied to irregular (slower than $n^{1/2}$ rate) linear functionals using estimators $a(\widehat{g}_{n}(K,x))=a_{K}(x)^{\prime}\widehat{\beta}_{K}$ and an appropriate transformation of basis $a_{K}(x)=(a(p_{1}(x),\cdots,a(p_{K}(x)))^{\prime}$ with proper smoothness condition on the functional and continuity conditions on the derivative as in Newey (1997). Although the verification of previous results for regular ( $n^{1/2}$ rate) functionals, such as integrals and weighted average derivatives, is beyond the scope of this paper, we examine similar results for the partially linear model setup in Section 5.

4 Uniform Inference

This section provides construction of uniform confidence bands for $g_{0}(x)$ (uniform in $K\in\mathcal{K}_{n}$ ) given in (1.4). We define the following empirical process

\widehat{T}_{n}(K,x)\equiv\frac{\sqrt{n}(\widehat{g}_{n}(K,x)-g_{0}(x))}{\widehat{V}_{n}(K,x)^{1/2}}

(4.1)

over $\mathcal{K}_{n}\times\mathcal{X}$ , and we show below that the supremum of the empirical process $\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}(K,x)|$ can be approximated by a sequence of random variables $\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|Z_{n}(K,x)|$ , where $Z_{n}(K,x)$ is a tight Gaussian random process in $\ell^{\infty}(\mathcal{K}_{n}\times\mathcal{X})$ with zero mean and covariance function

E[Z_{n}(K,x)Z_{n}(K^{\prime},x^{\prime})]=\frac{P(K,x)^{\prime}Q_{K}^{-1}\Omega_{K,K^{\prime}}Q_{K^{\prime}}^{-1}P(K^{\prime},x^{\prime})}{V_{n}(K,x)^{1/2}V_{n}(K^{\prime},x^{\prime})^{1/2}}.

(4.2)

Although the Gaussian approximation is an important first step, the covariance function (4.2) is generally difficult to construct for the purpose of uniform inference. Thus, we employ weighted bootstrap methods similar to Belloni et al. (2015) and show the validity of the bootstrap procedure for uniform confidence bands.

Let $e_{1},...,e_{n}$ be a sequence of i.i.d. standard exponential random variables that are independent of $X^{n}=\{x_{1},...,x_{n}\}$ . For $(K,x)\in\mathcal{K}_{n}\times\mathcal{X}$ , we define a (centered) weighted bootstrap process

\widehat{T}_{n}^{e}(K,x)=\frac{\sqrt{n}(\widehat{g}_{n}^{e}(K,x)-\widehat{g}_{n}(K,x))}{\widehat{V}_{n}(K,x)^{1/2}}

(4.3)

where $\widehat{g}_{n}^{e}(K,x)=P(K,x)^{\prime}\widehat{\beta}_{K}^{e}$ , and $\widehat{\beta}_{K}^{e}$ is obtained by the following weighted least squares regression

\widehat{\beta}_{K}^{e}=\operatorname*{arg\,min}_{\beta\in\mathbb{R}^{K}}\sum_{i=1}^{n}e_{i}(y_{i}-P(K,x_{i})^{\prime}\beta)^{2}.

(4.4)

Define the critical value

\widehat{c}_{1-\alpha}\equiv(1-\alpha)\textnormal{ conditional quantile of }\sup_{K\in\mathcal{K}_{n},x\in\mathcal{X}}|\widehat{T}_{n}^{e}(K,x)|\textnormal{ given the data }X^{n},

(4.5)

and we consider confidence bands of the form

[\widehat{g}_{n}(K,x)\pm\widehat{c}_{1-\alpha}\sqrt{\widehat{V}_{n}(K,x)/n}],\quad K\in\mathcal{K}_{n},x\in\mathcal{X}.

(4.6)

To provide the validity of the bootstrap critical values and confidence bands in (4.6), we show below that the conditional distribution of $\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}^{e}(K,x)|$ is “close” to the distribution of $\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|Z_{n}(K,x)|$ and that of $\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}(K,x)|$ using coupling inequalities for the supremum of the empirical process and the bootstrap process as in Chernozhukov et al. (2016). Then, similar to Theorem 3.1, this gives bounds on the Kolmogorov distance for the distribution functions of $P(\sup_{K\in\mathcal{K}_{n},x\in\mathcal{X}}|\widehat{T}_{n}(K,x)|\leq u)$ and $P(\sup_{K\in\mathcal{K}_{n},x\in\mathcal{X}}|\widehat{T}_{n}^{e}(K,x)|\leq u|X^{n})$ .

The following assumptions are used to establish the coverage probability of confidence bands uniformly over $K\in\mathcal{K}_{n}$ . Define $\alpha(K,x)\equiv Q_{K}^{-1/2}P(K,x)/V_{n}(K,x)^{1/2}$ , and

\zeta^{L_{1}}=\max_{K\in\mathcal{K}_{n}}\sup_{x,x^{\prime}\in\mathcal{X},x\neq x^{\prime}}\frac{||\alpha(K,x)-\alpha(K,x^{\prime})||}{||x-x^{\prime}||},\ \zeta^{L_{2}}=\sup_{x\in\mathcal{X}}\max_{K,K^{\prime}\in\mathcal{K}_{n}:K\neq K^{\prime}}\frac{||\alpha(K,x)-\alpha(K^{\prime},x)||}{|K-K^{\prime}|}.

Assumption 4.1.

(Regularity conditions - uniform inference)

(i)

$\sup_{x}E[|\varepsilon_{i}|^{q}|x_{i}=x]<\infty$ for $q\geq 4$ and $\inf_{x\in\mathcal{X}}E(\varepsilon_{i}^{2}|x_{i}=x)>0$ .
(ii)

$\max_{K\in\mathcal{K}_{n}}\sqrt{\frac{\lambda_{K}^{2}\zeta_{K}^{2}\log K\log^{4}n}{n}}(n^{1/q}+\ell_{K}c_{K}\sqrt{K})+(\ell_{K}c_{K})\log n\rightarrow 0$ as $n\rightarrow\infty$ .
(iii)

$\log(\zeta^{L_{1}}\vee\zeta^{L_{2}})\lesssim\log n$ , $\max_{K}\zeta_{K}^{2q/(q-2)}\log^{3}n/n\lesssim 1$ , and $\max_{K}\zeta_{K}\lesssim\log n$ .
(iv)

$\sup\limits_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\frac{V_{n}(K,x)}{\widehat{V}_{n}(K,x)}-1|=o_{p}(1/\log^{2}n)$ .

For uniform inference, we require similar but slightly stronger conditions compared to Assumption 3.2. We also impose mild rate restrictions on $\zeta^{L_{1}},\zeta^{L_{2}}$ and $\max_{K\in\mathcal{K}_{n}}\zeta_{K}$ similar to Chernozhukov et al. (2014a) and Belloni et al. (2015).

Theorem 4.1.

Suppose that Assumptions 2.1, 3.1, and 4.1 hold, and $(\max_{K}\zeta_{K})\log^{2+1/(2q)}n/n^{1/2-1/q}\rightarrow 0$ , $(\max_{K}\zeta_{K})^{2}\log^{7}n/n\rightarrow 0$ as $n\rightarrow\infty$ . If, in addition, we assume that $\sup\limits_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\frac{\sqrt{n}r_{n}(K,x)}{V_{n}(K,x)^{1/2}}|=o(1/\sqrt{\log n})$ , then

\displaystyle P(g_{0}(x)\in[\widehat{g}_{n}(K,x)\pm\widehat{c}_{1-\alpha}\sqrt{\widehat{V}_{n}(K,x)/n}],\quad K\in\mathcal{K}_{n},x\in\mathcal{X})=1-\alpha+o(1)

(4.7)

with a critical value $\widehat{c}_{1-\alpha}$ in (4.5).

Alternatively, if we assume $\sup\limits_{x\in\mathcal{X}}|\frac{\sqrt{n}r_{n}(\widehat{K},x)}{V_{n}(\widehat{K},x)^{1/2}}|=o(1/\sqrt{\log n})$ with $\widehat{K}\in\mathcal{K}_{n}$ , then the following coverage property holds:

\liminf\limits_{n\rightarrow\infty}P(g_{0}(x)\in[\widehat{g}_{n}(\widehat{K},x)\pm\widehat{c}_{1-\alpha}\sqrt{\widehat{V}_{n}(\widehat{K},x)/n}],\quad x\in\mathcal{X})\geq 1-\alpha.

(4.8)

Theorem 4.1 shows the uniform asymptotic coverage property of the confidence bands defined in (4.6) uniformly over $K\in\mathcal{K}_{n}$ . Furthermore, it shows a confidence band with possibly data-dependent $\widehat{K}\in\mathcal{K}_{n}$ having an asymptotic coverage of at least $1-\alpha$ . The confidence band constructed in (4.8) requires a substantially weaker assumption on the undersmoothing similar to Theorem 3.1.

5 Extension: Partially Linear Model

In this section we provide inference methods for the partially linear model (PLM) setup. For notational simplicity, we use similar notation as defined in the nonparametric regression setup. Suppose we observe random samples $\{y_{i},w_{i},x_{i}\}_{i=1}^{n}$ , where $y_{i}$ is the scalar response variable, $w_{i}\in\mathcal{W}\subset\mathbb{R}$ is the treatment/policy variable of interest, and $x_{i}\in\mathcal{X}\subset\mathbb{R}^{d_{x}}$ is a set of explanatory variables. For simplicity, we shall assume that $w_{i}$ is a scalar. We consider the model

\displaystyle y_{i}=\theta_{0}w_{i}+g_{0}(x_{i})+\varepsilon_{i},\qquad E(\varepsilon_{i}|w_{i},x_{i})=0.

(5.1)

We are interested in inference on $\theta_{0}$ after approximating an unknown function $g_{0}(x)$ by series terms/regressors $p(x_{i})$ among a set of potential control variables. Specification searches can be performed for the number of different approximating terms or for the number of covariates in estimating the nonparametric part.

The series estimator $\widehat{\theta}_{n}(K)$ for $\theta_{0}$ using the first $K=K_{n}$ terms is obtained by standard LS estimation of $y_{i}$ on $w_{i}$ and $P_{Ki}=P(K,x_{i})$ and has the usual “partialling out” formula

\displaystyle\widehat{\theta}_{n}(K)=\left(W^{\prime}M_{K}W\right)^{-1}W^{\prime}M_{K}Y

(5.2)

where $W=(w_{1},\cdots,w_{n})^{\prime},M_{K}=I_{K}-P^{K}(P^{K^{\prime}}P^{K})^{-1}P^{K^{\prime}},P^{K}=[P_{K1},\cdots,P_{Kn}]^{\prime},Y=(y_{1},\cdots,y_{n})^{\prime}$ . The asymptotic normality and valid inference for $\widehat{\theta}_{n}(K)$ have been developed in the literature.⁵⁵5See also Robinson (1988), Linton (1995) and references therein for the results of the kernel estimators. Donald and Newey (1994) derived the asymptotic normality of $\widehat{\theta}_{n}(K)$ under standard rate conditions $K/n\rightarrow 0$ . Belloni, Chernozukhov, and Hansen (2014) analyzed asymptotic normality and uniformly valid inference for the post-double-selection estimator even when $K$ is much larger than $n$ (see also Kozbur (2018)). Recent papers by Cattaneo, Jansson, and Newey (2018a, 2018b) provided a valid approximation theory for $\widehat{\theta}_{n}(K)$ when $K$ grows at the same rate of $n$ .

A different approximation theory using a faster rate of $K$ ( $K/n\rightarrow c,0<c<1$ ) than the standard rate conditions ( $K/n\rightarrow 0$ ) is particularly useful for our purpose to establish the asymptotic distribution of t-statistics over $K\in\mathcal{K}_{n}$ . From the results in Cattaneo, Jansson, and Newey (2018a), we have the following decomposition:

	$\displaystyle\sqrt{n}(\widehat{\theta}_{n}(K)-\theta_{0})$	$\displaystyle=(\frac{1}{n}W^{\prime}M_{K}W)^{-1}\frac{1}{\sqrt{n}}W^{\prime}M_{K}Y$
		$\displaystyle=\widehat{\Gamma}_{n}(K)^{-1}(\frac{1}{\sqrt{n}}\sum_{i}v_{i}M_{K,ii}\varepsilon_{i}+\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\sum_{j=1,j\neq i}^{n}v_{i}M_{K,ij}\varepsilon_{j})+o_{p}(1)$		(5.3)

where $v_{i}\equiv w_{i}-g_{w0}(x_{i})$ , $g_{w0}(x_{i})\equiv E[w_{i}|x_{i}]$ and $\widehat{\Gamma}_{n}(K)=W^{\prime}M_{K}W/n$ . For any deterministic sequence $K\rightarrow\infty$ satisfying standard rate conditions $K/n\rightarrow 0$ , $\sqrt{n}(\widehat{\theta}_{n}(K)-\theta_{0})$ is asymptotically normal with variance $V=\Gamma^{-1}\Omega\Gamma^{-1},\Gamma=E[v_{i}v_{i}^{\prime}],\Omega=E[v_{i}v_{i}^{\prime}\varepsilon_{i}^{2}]$ . Unlike the nonparametric object of interest in the fully nonparametric model, where the variance term increases with $K$ , $\widehat{\theta}_{n}(K)$ has a parametric ( $n^{1/2}$ ) convergence rate, and $\widehat{\theta}_{n}(K)$ with all different sequences of $K$ are asymptotically equivalent under $K/n\rightarrow 0$ .⁶⁶6This is also related to the well-known results of the two-step semiparametric estimation; the asymptotic variance of two-step semiparametric estimators does not depend on the type of the first-step estimator or smoothing parameter sequences under certain conditions (see Newey (1994b)). However, under faster rate conditions, $K/n\rightarrow c$ for $0<c<1$ , the second term in (5.3) is not negligible and converges to bounded random variables. Cattaneo, Jansson, and Newey (2018a) apply the central limit theorem of degenerate U-statistics for the second term, similar to the many instrument asymptotics analyzed in Chao, Swanson, Hausman, Newey, and Woutersen (2012). Then, the limiting normal distribution has a larger variance than the standard first-order asymptotic variance, and the adjusted variances generally depend on the number of terms $K$ such that we can provide an asymptotic distribution of the t-statistics with the different sequence of $K$ over $\mathcal{K}_{n}$ .

The following assumption on $\mathcal{K}_{n}$ is considered, and we impose the regularity conditions that are used in Cattaneo, Jansson, and Newey (2018a, Assumption PLM) uniformly over $K\in\mathcal{K}_{n}$ .

Assumption 5.1.

(Set of finite number of series terms)

Assume $\mathcal{K}_{n}=\{\underline{K}\equiv K_{1},\cdots,K_{m},\cdots,\overline{K}\equiv K_{p}\}$ , where $K_{m}\rightarrow\infty,K_{m}/n\rightarrow c_{m}$ as $n\rightarrow\infty$ for all $m=1,...,p$ , constant $c_{m}$ such that $0<c_{1}<c_{2}<\cdots<c_{p}<1$ , and fixed $p$ .

Assumption 5.2.

(Regularity conditions - partially linear model)

(i)

$\{y_{i},w_{i},x_{i}\}_{i=1}^{n}$ are $i.i.d$ random variables satisfying the model (5.1).
(ii)

There exists constants $0<c\leq C<\infty$ such that $E[\varepsilon_{i}^{2}|w_{i},x_{i}]\geq c$ and $E[v_{i}^{2}|x_{i}]\geq c$ , $E[\varepsilon_{i}^{4}|w_{i},x_{i}]\leq C$ and $E[v_{i}^{4}|x_{i}]\leq C$ .
(iii)

$rank(P_{K})=K$ (a.s.) and $M_{K,ii}\geq C$ for $C>0$ for all $K\in\mathcal{K}_{n}$ .

(iv)

For each $K\in\mathcal{K}_{n}$ , there exists some $\gamma_{g},\gamma_{g_{w}}$ ,

\displaystyle\min_{\eta_{g}}E[(g_{0}(x_{i})-\eta_{g}^{\prime}P_{Ki})^{2}]=O(K^{-2\gamma_{g}}),\quad\min_{\eta_{g_{w}}}E[(g_{w0}(x_{i})-\eta_{g_{w}}^{\prime}P_{Ki})^{2}]=O(K^{-2\gamma_{g_{w}}}).

Assumption 5.2 does not require $K/n\rightarrow 0$ which is required to obtain asymptotic normality in the literature (e.g., Donald and Newey (1994)). Similar to Assumption 3.2(iii) in the nonparametric setup, Assumption 5.2(iv) holds for the polynomials and spline basis. For example, 5.2(iv) holds with $\gamma_{g}=s_{g}/d_{x},\gamma_{g_{w}}=s_{w}/d_{x}$ when $\mathcal{X}$ is compact and when the unknown functions $g_{0}(x)$ and $g_{w0}(x)$ have $s_{g}$ and $s_{w}$ continuous derivates, respectively.

Under Assumptions 5.1, 5.2 and undersmoothing condition ( $n\overline{K}^{-2(\gamma_{g}+\gamma_{g_{w}})}\rightarrow 0$ ), we have a joint asymptotic distribution of the t-statistics $T_{n}(K,\theta)=\sqrt{n}V_{n}(K)^{-1/2}(\widehat{\theta}_{n}(K)-\theta_{0})$ over $K\in\mathcal{K}_{n}$ :

(T_{n}(K_{1},\theta_{0}),\cdots,T_{n}(K_{p},\theta_{0}))^{\prime}\overset{d}{\longrightarrow}Z_{\Sigma}=(Z_{1},\cdots Z_{p})^{\prime}\sim N(0,\Sigma)

where

	$\displaystyle V_{n}(K)$	$\displaystyle=\Gamma_{n}(K)^{-1}\Omega_{n}(K)\Gamma_{n}(K)^{-1},$
	$\displaystyle\Gamma_{n}(K)$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}M_{K,ii}E[v_{i}^{2}\|x_{i}],\ \Omega_{n}(K)=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{n}M_{K,ij}^{2}E[v_{i}^{2}\varepsilon_{j}^{2}\|x_{i},x_{j}],$

and the variance-covariance matrix $\Sigma$ with $(l,l^{\prime})$ element

\displaystyle\begin{split}&\Sigma(l,l^{\prime})\equiv\lim_{n\rightarrow\infty}\frac{V_{n}(K_{l},K_{l^{\prime}})}{V_{n}(K_{l})^{1/2}(K_{l^{\prime}})^{1/2}},\quad V_{n}(K_{l},K_{l^{\prime}})=\Gamma_{n}(K_{l})^{-1}\Omega_{n}(K_{l},K_{l^{\prime}})\Gamma_{n}(K_{l^{\prime}})^{-1}\\ &\Omega_{n}(K_{l},K_{l^{\prime}})=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{n}M_{K_{l},ij}M_{K_{l^{\prime}},ij}E[v_{i}^{2}\varepsilon_{j}^{2}|x_{i},x_{j}],\end{split}

(5.4)

for $l,l^{\prime}=1,...,p.$ Then, we can similarly define critical values as in (3.3) to construct confidence intervals for $\theta_{0}$ uniform in $K\in\mathcal{K}_{n}$ analogous to the nonparametric setup. Let

\widehat{c}_{1-\alpha}\equiv(1-\alpha)\textnormal{ quantile of }\max_{m=1,...,p}|\widehat{Z}_{m}|,\ \ \widehat{Z}_{\Sigma}=(\widehat{Z}_{1},...,\widehat{Z}_{p})^{\prime}\sim N(0,\widehat{\Sigma}_{n})

(5.5)

where $\widehat{\Sigma}_{n}$ is a consistent estimator for unknown $\Sigma$ defined in (5.4).

Theorem 5.1 is the main result for the partially linear model setup and provides the asymptotic coverage results of the CIs uniform in $K\in\mathcal{K}_{n}$ analogous to the nonparametric setup in Section 3.

Theorem 5.1.

Suppose that Assumptions 5.1 and 5.2 hold. In addition, assume that $n\overline{K}^{-2(\gamma_{g}+\gamma_{g_{w}})}\rightarrow 0$ and $\max\limits_{K,K^{\prime}\in\mathcal{K}_{n}}|\frac{\widehat{V}_{n}(K,K^{\prime})}{V_{n}(K,K^{\prime})}-1|=o_{p}(1)$ as $n,K\rightarrow\infty$ . Then,

		$\displaystyle\lim\limits_{n\rightarrow\infty}P(\theta_{0}\in[\widehat{\theta}_{n}(K)\pm\widehat{c}_{1-\alpha}\sqrt{\widehat{V}_{n}(K)/n}],\quad\forall K\in\mathcal{K}_{n})=1-\alpha,$		(5.6)
		$\displaystyle\liminf\limits_{n\rightarrow\infty}P(\theta_{0}\in[\widehat{\theta}_{n}(\widehat{K})\pm\widehat{c}_{1-\alpha}\sqrt{\widehat{V}_{n}(\widehat{K})/n}])\geq 1-\alpha,\quad\widehat{K}\in\mathcal{K}_{n},$		(5.7)

where the critical value $\widehat{c}_{1-\alpha}$ is defined in (5.5).

Remark 5.1.

Note that the construction of CIs requires consistent variance estimation of $\Omega_{n}(K)$ . As discussed in Cattaneo, Jansson, and Newey (2018a, 2018b), the construction of the heteroskedasticity-robust estimator for $\Omega_{n}(K)$ under $K/n\rightarrow c>0$ is challenging, and the Eicker-Huber-White-type variance estimator generally requires $K/n\rightarrow 0$ for consistency. Cattaneo, Jansson, and Newey (2018b) considers the following standard error formula:

\widehat{\Omega}_{n}(K,\kappa_{n})=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{n}\kappa_{ij}\hat{v}_{K,i}^{2}\hat{\varepsilon}_{K,j}^{2}

(5.8)

where $\hat{v}_{K}=M_{K}W,\hat{\varepsilon}_{K}=M_{K}(Y-W\widehat{\theta}_{n}(K))$ and symmetric matrix $\kappa_{n}$ with $(i,j)$ element $\kappa_{ij}$ . Cattaneo, Jansson, and Newey (2018b) show that $\widehat{\Omega}_{n}(K,\kappa_{n})$ is consistent even under heteroskedasticity and $K/n\rightarrow c>0$ with a certain choice of $\kappa_{n}$ and provide a sufficient condition for consistency. See Theorems 3 and 4 of Cattaneo, Jansson, and Newey (2018b) for further discussion.

6 Simulations

This section investigates the small sample performance of the proposed inference methods. We report the empirical coverage and the average length of the confidence intervals/confidence bands considered in Sections 3 and 4 with various simulation setups.

We consider the following data generating process:

	$\displaystyle y_{i}$	$\displaystyle=g(x_{i})+\varepsilon_{i},$
	$\displaystyle x_{i}=\Phi(x_{i}^{}),\begin{pmatrix}x_{i}^{}\\ \varepsilon_{i}\end{pmatrix}$	$\displaystyle\sim N\begin{pmatrix}\begin{pmatrix}0\\ 0\end{pmatrix},\begin{pmatrix}1&0\\ 0&\sigma^{2}(x_{i}^{*})\end{pmatrix}\end{pmatrix}$

where $\Phi(\cdot)$ is the standard normal cumulative distribution function needed to ensure compact support, and $\sigma^{2}(x_{i}^{*})=((1+2x_{i}^{*})/2)^{2}$ (heteroskedastic). We investigate the following three functions for $g(x)$ : $g_{1}(x)=\ln(|6x-3|+1)sgn(x-1/2)$ , $g_{2}(x)=\frac{\sin(7\pi x/2)}{1+2x^{2}(sgn(x)+1)}$ , and $g_{3}(x)=x-1/2+5\phi(10(x-1/2))$ , where $\phi(\cdot)$ is the standard normal probability density function, and $sgn(\cdot)$ is the sign function. $g_{1}(x)$ is used in Newey and Powell (2003), as well as Chen and Christensen (2018). $g_{2}(x)$ and $g_{3}(x)$ are rescaled versions used in Hall and Horowitz (2013). See Figure 1 for the shapes of all three functions on $[0,1]$ . For all simulation results below, we generate 2000 simulation replications for each design with a sample size $n=200$ .

Results for quadratic splines with evenly placed knots are reported where the number of knots $K$ are selected among $\mathcal{K}_{n}=\{6,7,...,12\}$ by setting $\underline{K}=2n^{1/5}$ and $\overline{K}=2n^{1/3}$ rounded up to the nearest integer. Then, we calculate a pointwise coverage rate (COV) and the average length (AL) of various 95% nominal CIs, as well as analogous uniform CBs for the grid points of $x$ on the support $\mathcal{X}=[0.05,0.95]$ . To calculate critical values, 1000 additional Monte Carlo or bootstrap replications are performed on each simulation iteration. In addition, we investigate results for homoskedastic errors ( $\sigma^{2}(x_{i}^{*})=1$ ), different sample sizes $n=\{100,500\}$ , polynomial regressions, and different specifications as in Cattaneo and Farrell (2013) with multivariate and non-normal regressors; however, the results show qualitatively similar patterns and hence are not reported here for brevity. Additional simulation results are reported in the Online Supplementary Material.

Table 1 reports the nominal 95% coverage of the following pointwise CIs at $x=0.2,0.5,0.8,0.9$ : (1) the standard CI in (2.5) with $\widehat{K}_{\texttt{cv}}\in\mathcal{K}_{n}$ selected to minimize the leave-one-out cross-validation; (2) robust CI in (3.6) with $\widehat{K}_{\texttt{cv}}$ using the critical value $\widehat{c}_{1-\alpha}(x)$ ; (3) robust CI using $\widehat{K}_{\texttt{cv+}}=\widehat{K}_{\texttt{cv}}+2$ . Analogous uniform inference results for CBs are also reported. The critical values, $\widehat{c}_{1-\alpha}(x)$ and $\widehat{c}_{1-\alpha}$ are constructed using the Monte Carlo methods and weighted bootstrap method, respectively.

Overall, we find that the coverage of the standard CI with $\widehat{K}_{\texttt{cv}}$ is far less than 95% over the support although it has the shortest length. However, the coverage of robust CIs based on $\widehat{K}_{\texttt{cv}}$ or $\widehat{K}_{\texttt{cv+}}$ with $\widehat{c}_{1-\alpha}(x)$ is close to or above 95% and performs well across the different simulation designs, and this is consistent with theoretical results in Theorem 3.1. Using the undersmoothed $\widehat{K}_{\texttt{cv+}}$ (using more terms than the cross-validation) seems to work quite well at most points and for highly nonlinear designs where there exists relatively large bias, e.g., Model 3 $(g_{3}(x))$ at $x=0.5$ .⁷⁷7The possibly poor coverage property of the standard kernel-based CIs for $g_{3}(x)$ at the single peak ( $x=0.5$ ) was also described in Hall and Horowitz (2013, Figure 3). Uniform coverage rates of confidence bands with selected $K$ seem conservative, and this is due to the large critical values based on weighted bootstrap methods to be uniform in both $K\in\mathcal{K}_{n}$ and $x\in\mathcal{X}$ , including boundary points.

7 Empirical application

In this section, we illustrate inference procedures by revisiting Blomquist and Newey (2002). Understanding how tax policy affects individual labor supply has been a central issue in labor economics (see Hausman (1985) and Blundell and MaCurdy (1999), among many others). Blomquist and Newey (2002) estimate the conditional mean of hours of work given the individual nonlinear budget sets using nonparametric series estimation. They also estimate the wage elasticity of the expected labor supply and find evidence of possible misspecification of the usual parametric model such as maximum likelihood estimation (MLE).

Specifically, Blomquist and Newey (2002) consider the following model by exploiting an additive structure from the utility maximization with piecewise linear budget sets:

	$\displaystyle h_{i}$	$\displaystyle=g(x_{i})+\varepsilon_{i},\quad E(\varepsilon_{i}\|x_{i})=0,$		(7.1)
	$\displaystyle g(x_{i})$	$\displaystyle=g_{1}(y_{J},w_{J})+\sum_{j=1}^{J-1}[g_{2}(y_{j},w_{j},\ell_{j})-g_{2}(y_{j+1},w_{j+1},\ell_{j})],$		(7.2)

where $h_{i}$ is the hours worked of the $i$ th individual and $x_{i}=(y_{1},\cdots,y_{J},w_{1},\cdots,w_{J},\ell_{1},\cdots,\ell_{J},J)$ is the budget set, which can be represented by the intercept $y_{j}$ (non-labor income), slope $w_{j}$ (marginal wage rates) and the end point $\ell_{j}$ of the $j$ th segment in a piecewise linear budget with $J$ segments. Equation (7.2) for the conditional mean function follows from Theorem 2.1 of Blomquist and Newey (2002), and this additive structure substantially reduces the dimensionality issues. To approximate $g(x)$ , they consider the power series, $p_{k}(x)=(y_{J}^{p_{1}(k)}w_{J}^{q_{1}(k)},\sum_{j=1}^{J-1}\ell_{j}^{m(k)}(y_{j}^{p_{2}(k)}w_{j}^{q_{2}(k)}-y_{j+1}^{p_{2}(k)}w_{j+1}^{q_{2}(k)})),$ $p_{2}(k)+q_{2}(k)\geq 1$ .

From the Swedish “Level of Living” survey in 1973, 1980 and 1990, they pool the data from three waves and use the data for married or cohabiting men of ages 20-60. Changes in the tax system over three different time periods give a large variation in the budget sets. The sample size is $n=2321$ . See Section 5 of Blomquist and Newey (2002) for more detailed descriptions. They estimate the wage elasticity of the expected labor supply

E_{w}=\bar{w}/\bar{h}[\frac{\partial g(w,\cdots,w,\bar{y},\cdots,\bar{y})}{\partial w}]|_{w=\bar{w}},

(7.3)

which is the regression derivative of $g(x)$ evaluated at the mean of the net wage rates $\bar{w}$ , virtual income $\bar{y}$ and level of hours $\bar{h}$ .

Table 2 is the same table as in Blomquist and Newey (2002, Table 1). They report estimates $\widehat{E}_{w}$ and standard errors $SE_{\widehat{E}_{w}}$ with a different number of series terms by adding additional series terms. For example, the estimates in the second row use the term in the first row $(1,y_{J},w_{J})$ with the additional terms $(\Delta y,\Delta w)$ . Here, $\ell^{m}\Delta y^{p}w^{q}$ denotes approximating the term $\sum_{j}\ell_{j}^{m}(y_{j}^{p}w_{j}^{q}-y_{j+1}^{p}w_{j+1}^{q})$ . Blomquist and Newey (2002) also report cross-validation criteria, $CV$ , for each specification. In their formula, series terms are chosen to maximize $CV$ , which minimizes the asymptotic MSE. In addition to their original table, we add the standard 95% CI for each specification, i.e., $CI(K)=\widehat{E}_{w}(K)\pm 1.96SE_{\widehat{E}_{w}}(K)$ . In Table 2, it is ambiguous as to which large model ( $K$ ) can be used for the inference, and we do not have compelling data-dependent methods for selecting one of the large $K$ for the confidence interval to be reported. Here we want to construct CIs that are robust to specification searches.

Figure 2 displays pointwise 95% uniform CIs for $K_{m}\in\{K_{1},K_{2},\cdots,K_{11}\}$ , where $K_{m}$ corresponds to each specification in Table 2 with increasing order of series terms, along with the point estimates and standard 95% confidence interval.⁸⁸8It is straightforward to construct $\widehat{c}_{1-\alpha}(x)$ using the covariance structure under the homoskedastic error and it only requires estimated variances for different $K\in\mathcal{K}_{n}$ that are already reported in the table of Blomquist and Newey (2002). Based on 100,000 simulation repetitions, we have $\widehat{c}_{1-\alpha}(x)=2.503$ . From Figure 2, we reject a zero wage elasticity of the labor supply for almost all models except $\overline{K}$ . Table 2 also reports robust confidence intervals $CI_{\widehat{E}_{w}}^{\texttt{sup}}(K)=\widehat{E}_{w}(K)\pm\widehat{c}_{1-\alpha}(x)SE_{\widehat{E}_{w}}(K)$ with possibly data-dependent $\widehat{K}$ justified by Theorem 3.1 (eq (3.6)). Note that cross-validation chooses $\widehat{K}_{\texttt{cv}}=K_{5}$ , and the standard CI with $\widehat{K}_{\texttt{cv}}$ is $[0.0247,0.0839]$ and the robust CI is $[0.0165,0.0921]$ . Using $\widehat{K}_{\texttt{cv+}}=K_{6}$ or $\widehat{K}_{\texttt{cv++}}=K_{7}$ widens the standard CI, and the robust CIs are $CI_{\widehat{E}_{w}}^{\texttt{sup}}(\widehat{K}_{\texttt{cv+}})=[0.0166,0.1152],CI_{\widehat{E}_{w}}^{\texttt{sup}}(\widehat{K}_{\texttt{cv++}})=[0.0070,0.1186]$ .

8 Conclusion

This paper considers nonparametric inference methods given specification searches over different numbers of series terms in the nonparametric series regression model. We provide methods of constructing uniform CIs and confidence bands by adjusting the conventional normal critical value to the critical value based on the supremum of the t-statistics. The critical values can be constructed using simple Monte Carlo simulation or weighted bootstrap methods. Then, we provide an extension of the proposed CIs in the partially linear model setup. Finally, we investigate the finite sample properties of the proposed methods and illustrate uniform CIs in an empirical example of Blomquist and Newey (2002).

While beyond the scope of this paper, there are some potential directions to extend the results established here. First, investigating the coverage property of CIs with data-dependent $\widehat{K}$ using bias-corrected methods is of interest. In particular, it would be of interest to analyze the bias-corrected CI and confidence bands using cross-validation methods combined with the recent results established in Cattaneo, Farrell, and Feng (2019). Second, an extension of the current theory for quantile regression (e.g., Belloni, Chernozhukov, Chetverikov, and Fernández-Val (2019)) or the nonparametric IV setup would be desirable. In the NPIV setup, for example, one can consider pointwise CIs (or uniform confidence bands) that are uniform in pairs of $(K_{n},J_{n})\in\mathcal{K}_{n}\times\mathcal{J}_{n}$ with an additional dimension of the instrument sieve and the number of instruments $J=J_{n}$ . This is a difficult problem, and it would require a distinct theory to address the ill-posed inverse problem as well as two-dimensional choices. We leave these topics for future research.

References

: Andrews, D. W. K. (1991a): “Asymptotic Normality of Series Estimators for Nonparametric and Semiparametric Regression Models,”Econometrica, 59, 307-345.
: Andrews, D. W. K. (1991b): “Asymptotic Optimality of Generalized $C_{L}$ , Cross-Validation, and Generalized Cross-Validation in Regression with Heteroskedastic Errors,”Journal of Econometrics, 47, 359-377.
: Armstrong, T. B. and M. Kolesár (2018): “A Simple Adjustment for Bandwidth Snooping,”Review of Economic Studies, 85, 732-765.
: Belloni, A., V. Chernozhukov, D. Chetverikov, and I. Fernández-Val (2019): “Conditional quantile processes based on series or many regressors,” Journal of Econometrics, 213, 4-29.
: Belloni, A., V. Chernozhukov, D. Chetverikov, and K. Kato (2015): “Some New Asymptotic Theory for Least Squares Series: Pointwise and Uniform Results,” Journal of Econometrics, 186, 345-366.
: Belloni, A., V. Chernozhukov, and C. Hansen (2014): “Inference on Treatment Effects after Selection among High-Dimensional Controls,”Review of Economic Studies, 81, 608-650.
: Blomquist, S. and W. K. Newey (2002): “Nonparametric Estimation with Nonlinear Budget Sets,” Econometrica, 70, 2455-2480.
: Blundell, R. and T. E. MaCurdy (1999): “Labor Supply: A Review of Alternative Approaches,” Handbook of Labor Economics, In: O. Ashenfelter, D. Card (Eds.), vol. 3., Elsevier, Chapter 27.
: Calonico, S., M. D. Cattaneo, and M. H. Farrell (2018): “On the Effect of Bias Estimation on Coverage Accuracy in Nonparametric Inference,”Journal of the American Statistical Association, 113, 767-779.
: Cattaneo, M. D. and M. H. Farrell (2013): “Optimal Convergence Rates, Bahadur Representation, and Asymptotic Normality of Partitioning Estimators,” Journal of Econometrics, 174, 127-143.
: Cattaneo, M. D., M. H. Farrell, and Y. Feng (2019): “Large Sample Properties of Partitioning-Based Series Estimators,” Annals of Statistics, forthcoming.
: Cattaneo, M. D., M. Jansson, and W. K. Newey (2018a): “Alternative Asymptotics and the Partially Linear Model with Many Regressors,” Econometric Theory, 34, 277-301.
: Cattaneo, M. D., M. Jansson, and W. K. Newey (2018b): “Inference in Linear Regression Models with Many Covariates and Heteroscedasticity,” Journal of the American Statistical Association, 113, 1350-1361.
: Chao, J. C., N. R. Swanson, J. A. Hausman, W. K. Newey, and T. Woutersen (2012): “Asymptotic Distribution of JIVE in a Heteroskedastic IV Regression with Many Instruments,”Econometric Theory, 28, 42-86.
: Chatterjee, S. (2005): “An error bound in the Sudakov-Fernique inequality,”arXiv:math/0510424
: Chen, X. (2007): “Large Sample Sieve Estimation of Semi-nonparametric Models,” Handbook of Econometrics, In: J.J. Heckman, E. Leamer (Eds.), vol. 6B., Elsevier, Chapter 76.
: Chen, X. and T. Christensen (2015): “Optimal Uniform Convergence Rates and Asymptotic Normality for Series Estimators Under Weak Dependence and Weak Conditions,” Journal of Econometrics, 188, 447-465.
: Chen, X. and T. Christensen (2018): “Optimal Sup-norm Rates and Uniform Inference on Nonlinear Functionals of Nonparametric IV Regression”, Quantitative Economics, 9(1), 39-85.
: Chen, X. and Z. Liao (2014): “Sieve M inference on irregular parameters,” Journal of Econometrics,182, 70-86.
: Chen, X., Z. Liao, and Y. Sun (2014): “Sieve inference on possibly misspecified semi-nonparametric time series models,” Journal of Econometrics,178, 639-658.
: Chen, X. and X. Shen (1998): “Sieve extremum estimates for weakly dependent data,” Econometrica, 66 (2), 289-314.
: Chernozhukov V, D. Chetverikov, and K. Kato (2014a): “Gaussian approximation of suprema of empirical processes,” The Annals of Statistics, 42(4), 1564-1597.
: Chernozhukov V, D. Chetverikov, and K. Kato (2014b): “Anti-Concentration and Honest, Adaptive Confidence Bands,”The Annals of Statistics, 42(5), 1787-1818.
: Chernozhukov V, D. Chetverikov, and K. Kato (2016): “Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings,”Stochastic Processes and their Applications, 126(12), 3632-3651.
: Donald, S. G. and W. K. Newey (1994): “Series Estimation of Semilinear Models,” Journal of Multivariate Analysis, 50, 30-40.
: Eastwood, B. J. and A.R. Gallant, (1991): “Adaptive Rules for Seminonparametric Estimators That Achieve Asymptotic Normality,”Econometric Theory, 7, 307-340.
: Giné, E. and R. Nickl (2010): “Confidence bands in density estimation,” The Annals of Statistics, 38, 1122-1170.
: Giné, E. and R. Nickl (2015): Mathematical Foundations of Infinite-Dimensional Statistical Models, Cambridge University Press.
: Hall, P. and J. Horowitz (2013): “A Simple Bootstrap Method for Constructing Nonparametric Confidence Bands for Functions,” The Annals of Statistics, 41, 1892-1921.
: Hansen B. E. (2015): “The Integrated Mean Squared Error of Series Regression and a Rosenthal Hilbert-Space Inequality,” Econometric Theory, 31, 337-361.
: Hansen, P.R. (2005): “A Test for Superior Predictive Ability,”Journal of Business and Economic Statistics, 23, 365-380.
: Härdle, W. and O. Linton (1994): “Applied Nonparametric Methods,” Handbook of Econometrics, In: R. F. Engle, D. F. McFadden (Eds.), vol. 4., Elsevier, Chapter 38.
: Hausman, J. A. (1985): “The Econometrics of Nonlinear Budget Sets”, Econometrica, 53, 1255-1282.
: Heckman, J. J., L. J. Lochner, and P. E. Todd (2006): “Earnings Functions, Rates of Return and Treatment Effects: The Mincer Equation and Beyond,” Handbook of the Economics of Education, In: E. A. Hanushek, and F. Welch (Eds.), Vol. 1, Elsevier, Chapter 7.
: Horowitz, J. L. (2014): “Adaptive Nonparametric Instrumental Variables Estimation: Empirical Choice of the Regularization Parameter,” Journal of Econometrics, 180, 158-173.
: Horowitz, J. L. and S. Lee (2012): “Uniform Confidence Bands for Functions Estimated Nonparametrically with Instrumental Variables,” Journal of Econometrics, 168, 175-188.
: Huang, J. Z. (2003): “Local Asymptotics for Polynomial Spline Regression,” The Annals of Statistics, 31, 1600-1635.
: Kozbur, D. (2018): “Inference in Additively Separable Models With a High-Dimensional Set of Conditioning Variables,” Working Paper, arXiv:1503.05436.
: Leamer, E. E. (1983): “Let’s Take the Con Out of Econometrics,”The American Economic Review, 73, 31-43.
: Lepski, O. V. (1990): “On a problem of adaptive estimation in Gaussian white noise,”Theory of Probability and its Applications, 35, 454-466.
: Li, K. C. (1987): “Asymptotic Optimality for $C_{p}$ , $C_{L}$ , Cross-Validation and Generalized Cross-Validation: Discrete Index Set,” The Annals of Statistics, 15, 958-975.
: Li, Qi, and J. S. Racine (2007): Nonparametric Econometrics: Theory and Practice, Princeton University Press.
: Linton, O. (1995): “Second order approximation in the partialy linear regression model,” Econometrica, 63(5), 1079-1112.
: Newey, W. K. (1994a): “Series Estimation of Regression Functionals,” Econometric Theory, 10, 1-28.
: Newey, W. K. (1994b): “The Asymptotic Variance of Semiparametric Estimators,” Econometrica, 62, 1349-1382.
: Newey, W. K. (1997): “Convergence Rates and Asymptotic Normality for Series Estimators,”Journal of Econometrics, 79, 147-168.
: Newey, W. K. (2013): “Nonparametric Instrumental Variables Estimation,”American Economic Review: Papers & Proceedings, 103, 550-556.
: Newey, W. K. and J. L. Powell (2003): “Instrumental Variable Estimation of Nonparametric Models,”Econometrica, 71, 1565-1578.
: Newey, W. K. and J. L. Powell, F. Vella (1999): “Nonparametric Estimation of Triangular Simultaneous Equations Models,”Econometrica, 67, 565-603.
: Robinson, P. M. (1988): “Root-N-Consistent Semiparametric Regression,”Econometrica, 56(4), 931-954.
: Romano, J. P. and M. Wolf (2005): “Stepwise Multiple Testing as Formalized Data Snooping,”Econometrica, 73, 1237-1282.
: Schennach, S. M. (2015): “A bias bound approach to nonparametric inference,” CEMMAP working paper CWP71/15.
: Van Der Vaart, A. W. and J. A. Wellner (1996): Weak Convergence and Empirical Processes, Springer.
: White, H. (2000): “A Reality Check for Data Snooping,”Econometrica, 68, 1097-1126.
: Zhou, S., X. Shen, and D.A. Wolfe (1998): “Local Asymptotics for Regression Splines and Confidence Regions,” The Annals of Statistics, 26, 1760-1782.

Appendix A Proofs

A.1 Preliminaries and Useful Lemmas

We define additional notations for the empirical process theory used in the proof of Theorem 4.1. Given measurable space $(S,\mathcal{S})$ , let $\mathcal{F}$ as a class of measurable functions $f:\mathcal{S}\rightarrow\mathbb{R}$ . For any probability measure $Q$ on $(S,\mathcal{S})$ , we define $N(\epsilon,\mathcal{F},L_{2}(Q))$ as covering numbers, which is the minimal number of the $L_{2}(Q)$ balls of radius $\epsilon$ to cover $\mathcal{F}$ with $L_{2}(Q)$ norms $||f||_{Q,2}=(\int|f|^{2}dQ)^{1/2}$ . The uniform entropy numbers relative to the $L_{2}(Q)$ norms are defined as $\sup_{Q}\log N(\epsilon||F||_{Q,2},\mathcal{F},L_{2}(Q))$ where the supremum is over all discrete probability measures with an envelope function $F$ . We define $\mathcal{F}$ as a VC type with envelope $F$ if there are constants $A,v>0$ such that $\sup_{Q}N(\epsilon||F||_{Q,2},\mathcal{F},L_{2}(Q))\leq(A/\epsilon)^{v}$ for all $0<\epsilon\leq 1$ .

Let the data $z_{i}=(\varepsilon_{i},x_{i})$ be i.i.d. random vectors defined on the probability space $(\mathcal{Z}=\mathcal{E}\times\mathcal{X},\mathcal{A},P)$ with common probability distribution $P\equiv P_{\varepsilon,x}$ . We think of $(\varepsilon_{1},x_{1}),\cdots(\varepsilon_{n},x_{n})$ as the coordinates of the infinite product probability space. We avoid discussing nonmeasurability issues and outer expectations (for the related issues, see van der Vaart and Wellner (1996)). Throughout the proofs, we denote $c,C>0$ as universal constants that do not depend on $n$ .

For any sequence $\{K=K_{n}:n\geq 1\}\in\prod_{n=1}^{\infty}\mathcal{K}_{n}$ under Assumption 2.1, we first define the orthonormalized vector of basis functions

\tilde{P}(K,x)\equiv Q_{K}^{-1/2}P(K,x)=E[P_{Ki}P_{Ki}^{\prime}]^{-1/2}P(K,x),\ \tilde{P}_{Ki}=\tilde{P}(K,x_{i}),\ \tilde{P}^{K}=[\tilde{P}_{K1},\cdots,\tilde{P}_{Kn}]^{\prime}.

We observe that

\widehat{g}_{n}(K,x)=\tilde{P}(K,x)^{\prime}(\tilde{P}^{K^{\prime}}\tilde{P}^{K})^{-1}\tilde{P}^{K^{\prime}}Y,\ \ V_{n}(K,x)=\tilde{P}(K,x)^{\prime}\tilde{\Omega}_{K}\tilde{P}(K,x),\ \ \tilde{\Omega}_{K}=E(\tilde{P}_{Ki}\tilde{P}_{Ki}^{\prime}\varepsilon_{i}^{2}).

Without loss of generality, we may impose normalizations of $Q_{\overline{K}}=I_{\overline{K}}$ or $Q_{K}=E(P_{Ki}P_{Ki}^{\prime})=I_{K}$ uniformly over $K\in\mathcal{K}_{n}$ , since $\widehat{g}_{n}(K,x)$ is invariant to nonsingular linear transformations of $P(K,x)$ . However, we shall treat $Q_{K}$ as unknown and deal with the non-orthonormalized series terms. Next, we re-define pseudo true value $\beta_{K}$ , with an abuse of notation, using orthonormalized series terms $\tilde{P}_{Ki}$ . That is, $y_{i}=\tilde{P}_{Ki}^{\prime}\beta_{K}+\varepsilon_{Ki},E[\tilde{P}_{Ki}\varepsilon_{Ki}]=0$ where $\varepsilon_{Ki}=r_{Ki}+\varepsilon_{i}$ , $r_{n}(K,x)=g_{0}(x)-\tilde{P}(K,x)^{\prime}\beta_{K},r_{Ki}=r_{n}(K,x_{i})$ , and $r_{K}\equiv(r_{K1},\cdots r_{Kn})^{\prime}$ . We also define $\widehat{Q}_{K}\equiv\frac{1}{n}\tilde{P}^{K^{\prime}}\tilde{P}^{K}$ , $\underline{\sigma}^{2}\equiv\inf_{x}E[\varepsilon_{i}^{2}|x_{i}=x],\bar{\sigma}^{2}\equiv\sup_{x}E[\varepsilon_{i}^{2}|x_{i}=x]$ .

We first provide useful lemmas which will be used in the proof of Theorem 3.1 and 4.1. The versions of proof of Lemmas 1 and 2 with $\mathcal{K}_{n}=\{K\}$ are available in the literature, such as Belloni et al. (2015) and Chen and Christensen (2015), among many others. The maximal inequalities are used in the proof of Lemmas 1 and 2 to bound the remainder terms in the linearization of the t-statistics. Also note that different rate conditions of $K$ such as those in Newey (1997) can be used here but lead to different bounds. We provide the proofs of Lemma 1 and 2 in the Online Supplementary Material (Section B).

Lemma 1.

Suppose that Assumptions 2.1, 3.1, and 3.2 hold, then $||\widehat{Q}_{K}-I_{K}||=O_{p}(\sqrt{\lambda_{K}^{2}\zeta_{K}^{2}\log K/n})$ for any $K\in\mathcal{K}_{n}$ and the following holds

		$\displaystyle\max_{K\in\mathcal{K}_{n}}\|R_{1}(K,x)\|=O_{p}(\max_{K\in\mathcal{K}_{n}}\sqrt{\frac{\lambda_{K}^{2}\zeta_{K}^{2}\log K\log p}{n}}(1+\ell_{K}c_{K}\sqrt{K})),$		(A.1)
		$\displaystyle\max_{K\in\mathcal{K}_{n}}\|R_{2}(K,x)\|=O_{p}(\max_{K\in\mathcal{K}_{n}}(\ell_{K}c_{K})\sqrt{\log p}),$		(A.2)

where $R_{1}(K,x)\equiv\sqrt{\frac{1}{nV_{n}(K,x)}}\tilde{P}(K,x)^{\prime}(\widehat{Q}_{K}^{-1}-I_{K})\tilde{P}^{K^{\prime}}(\varepsilon+r_{K}),R_{2}(K,x)\equiv\sqrt{\frac{1}{nV_{n}(K,x)}}\tilde{P}(K,x)^{\prime}\tilde{P}^{K^{\prime}}r_{K}$ .

Lemma 2.

Suppose that Assumptions 2.1, 3.1 and 4.1 hold, then the following holds

		$\displaystyle\sup_{K\in\mathcal{K}_{n},x\in\mathcal{X}}\|R_{1}(K,x)\|=O_{p}(\max_{K\in\mathcal{K}_{n}}\sqrt{\frac{\lambda_{K}^{2}\zeta_{K}^{2}\log K\log n}{n}}(n^{1/q}+\ell_{K}c_{K}\sqrt{K})),$		(A.3)
		$\displaystyle\sup_{K\in\mathcal{K}_{n},x\in\mathcal{X}}\|R_{2}(K,x)\|=O_{p}(\max_{K\in\mathcal{K}_{n}}(\ell_{K}c_{K})\sqrt{\log n}),$		(A.4)

where $R_{1}(K,x),R_{2}(K,x)$ are defined in Lemma 1.

A.2 Proofs of the Main Results

A.2.1 Proof of Theorem 3.1

Proof.

For any $K\in\mathcal{K}_{n}$ , we first consider the decomposition of the t-statistic in (2.4) with the known variance $V_{n}(K,x)$ ,

	$\displaystyle T_{n}(K,x)$	$\displaystyle=\sqrt{\frac{n}{V_{n}(K,x)}}\tilde{P}(K,x)^{\prime}(\widehat{\beta}_{K}-\beta_{K})-\sqrt{\frac{n}{V_{n}(K,x)}}r_{n}(K,x)$
		$\displaystyle=t_{n}(K,x)+R_{1}(K,x)+R_{2}(K,x)+\nu_{n}(K,x)$

where $t_{n}(K,x)=n^{-1/2}\sum_{i=1}^{n}\frac{\tilde{P}(K,x)^{\prime}\tilde{P}_{Ki}\varepsilon_{i}}{V_{n}(K,x)^{1/2}}$ , $R_{1}(K,x),R_{2}(K,x)$ are defined in Lemma 1, and $\nu_{n}(K,x)=-\sqrt{n}V_{n}(K,x)^{-1/2}r_{n}(K,x)$ . Define

t_{n}\equiv(t_{n}(K_{1},x),\cdots,t_{n}(K_{p},x))^{\prime}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\xi_{i}

where $\xi_{i}=(\xi_{i1},\xi_{i2},\cdots,\xi_{ip})^{\prime}\in\mathbb{R}^{p}$ with $\xi_{ij}=\frac{\tilde{P}(K_{j},x)^{\prime}\tilde{P}_{K_{j}i}\varepsilon_{i}}{V_{n}(K_{j},x)^{1/2}}$ and $p=|\mathcal{K}_{n}|$ . Note that $E[\xi_{ij}]=0$ and $E[|\xi_{ij}|^{3}]\lesssim E[|\tilde{P}(K_{j},x)^{\prime}\tilde{P}_{K_{j}i}/V_{n}(K_{j},x)^{1/2}|^{3}]\sup_{x}E[|\varepsilon_{i}|^{3}|x_{i}=x]\lesssim\max_{K}\zeta_{K}$ for all $1\leq i\leq n,1\leq j\leq p$ . By Lemma A.2 in the Online Supplementary Material, for any $\delta>0$ , there exists a random variable $\max_{1\leq j\leq p}\sum_{i=1}^{n}Z_{ij}$ with independent random vectors $\{Z_{i}\}_{i=1}^{n}\in\mathbb{R}^{p}$ , $Z_{i}\sim N(0,\frac{1}{n}E[\xi_{i}\xi_{i}^{\prime}]),1\leq i\leq n$ , such that

P(|\max_{1\leq j\leq p}|t_{n}(K_{j},x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}||>16\delta)\lesssim\frac{\log(p\vee n)}{\delta^{2}}D_{1}+\frac{\log^{2}(p\vee n)}{\delta^{3}n^{3/2}}(D_{2}+D_{3})+\frac{\log n}{n}

where $D_{1}=E\big{[}\max_{1\leq j,l\leq p}|\frac{1}{n}\sum_{i=1}^{n}(\xi_{ij}\xi_{il}-E[\xi_{ij}\xi_{il}])|\big{]},D_{2}=E\big{[}\max_{1\leq j\leq p}\sum_{i=1}^{n}|\xi_{ij}|^{3}\big{]}$ , and $D_{3}=\sum_{i=1}^{n}E\big{[}\max_{1\leq j\leq p}|\xi_{ij}|^{3}1\big{(}\max_{1\leq j\leq p}|\xi_{ij}|>\delta\sqrt{n}/\log(p\vee n)\big{)}\big{]}.$

First consider the case (a) in Assumption 3.2(ii). Combining bounds for $D_{1},D_{2},D_{3}$ in Lemma B.1 in the Online Supplementary Material gives, for any $\delta>0$ ,

	$\displaystyle\hskip-14.22636ptP(\|\max_{1\leq j\leq p}\|t_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\|>16\delta)$
	$\displaystyle\lesssim\frac{\log(p\vee n)}{\delta^{2}}\big{[}(\frac{(\max_{K}\zeta_{K})^{2}\log p}{n})^{1/2}+\frac{(\max_{K}\zeta_{K})^{2}\log p}{n^{1-2/q}}\big{]}$
	$\displaystyle+\frac{\log^{2}(p\vee n)}{\delta^{3}}\big{[}(\frac{(\max_{K}\zeta_{K})^{2}}{n})^{1/2}+\frac{(\max_{K}\zeta_{K})^{3}\log p}{n^{3/2-3/q}}\big{]}+\frac{\log^{q-1}(p\vee n)}{\delta^{q}}\frac{(\max_{K}\zeta_{K})^{q}}{n^{q/2-1}}+\frac{\log n}{n}.$

For $\gamma>0$ , by setting

	$\displaystyle\delta$	$\displaystyle=$	$\displaystyle\gamma^{-1/3}\big{(}\frac{(\max_{K}\zeta_{K})^{2}\log^{4}(p\vee n)}{n}\big{)}^{1/6}+\gamma^{-1/2}\big{(}\frac{(\max_{K}\zeta_{K})^{2}\log(p\vee n)\log p}{n^{1-2/q}}\big{)}^{1/2}$
			$\displaystyle+\gamma^{-1/3}\big{(}\frac{(\max_{K}\zeta_{K})^{3}\log^{2}(p\vee n)\log p}{n^{3/2-3/q}}\big{)}^{1/3},$

we have

P(|\max_{1\leq j\leq p}|t_{n}(K_{j},x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}||>C_{1}\delta)\leq C_{2}(\gamma+\frac{\log n}{n})

where $C_{1},C_{2}$ are positive constants that depend only on $q$ . If we take $\gamma=\gamma_{n}\rightarrow 0$ sufficiently slowly, e.g., $\gamma=\log(p\vee n)^{-1/2}$ , then the above implies there exists $\max_{1\leq j\leq p}\sum_{i=1}^{n}Z_{ij}$ such that

|\max_{1\leq j\leq p}|t_{n}(K_{j},x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}||=o_{p}(\big{(}\frac{(\max_{K}\zeta_{K})^{2}\log^{5}(p\vee n)}{n}\big{)}^{1/6}+\frac{(\max_{K}\zeta_{K})\log^{3/4}(p\vee n)\log^{1/2}p}{n^{1/2-1/q}}).

Next, consider the case (b) in Assumption 3.2(ii). For any $\delta>0$ ,

	$\displaystyle\hskip-14.22636ptP(\|\max_{1\leq j\leq p}\|t_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\|>16\delta)$
	$\displaystyle\lesssim\frac{\log(p\vee n)}{\delta^{2}}\big{[}(\frac{(\max_{K}\zeta_{K})^{2}\log p}{n})^{1/2}+\frac{(\max_{K}\zeta_{K})^{2}\log^{2}(pn)\log p}{n}\big{]}$
	$\displaystyle+\frac{\log^{2}(p\vee n)}{\delta^{3}}\big{[}(\frac{(\max_{K}\zeta_{K})^{2}}{n})^{1/2}+\frac{(\max_{K}\zeta_{K})^{3}\log^{3}(pn)\log p}{n^{3/2}}\big{]}$
	$\displaystyle+\frac{\log^{2}(p\vee n)}{\delta^{3}}\big{[}\frac{1}{n^{1/2}}(\frac{\delta^{3}n^{3/2}}{\log^{3}(p\vee n)}+(\max_{K}\zeta_{K})^{3}\log^{3}p)\exp(-\frac{\delta\sqrt{n}}{C\max_{K}\zeta_{K}\log p\log(p\vee n)})\big{]}+\frac{\log n}{n}$

by Lemma B.1 in the Online Supplementary Material. Similarly, by setting

\delta=\max\{\gamma^{-1/3}(\max_{K}\zeta_{K})^{2}\log^{4}(p\vee n)/n)^{1/6},2C((\max_{K}\zeta_{K})^{2}\log^{4}(p\vee n)\log^{2}p/n)^{1/2}\}

we have, for $\gamma>0$ ,

P(|\max_{1\leq j\leq p}|t_{n}(K_{j},x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}||>C_{1}\delta)\leq C_{2}(\gamma+\frac{\log n}{n})

where $C_{1},C_{2}$ are universal constants which do not depend on $n$ . Here we use $\frac{\delta\sqrt{n}}{C\max_{K}\zeta_{K}\log p\log(p\vee n)}\geq 2\log(p\vee n)$ . By taking $\gamma=\log(p\vee n)^{-1/2}$ , there exists $\max_{1\leq j\leq p}\sum_{i=1}^{n}Z_{ij}$ such that

|\max_{1\leq j\leq p}|t_{n}(K_{j},x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}||=o_{p}(\big{(}\frac{(\max_{K}\zeta_{K})^{2}\log^{5}(p\vee n)}{n}\big{)}^{1/6}+\frac{(\max_{K}\zeta_{K})^{2}\log^{4}(p\vee n)\log^{2}p}{n}\big{)}^{1/2}).

In either case (a) or (b), the above coupling inequality shows that there exists a sequence of random variables $\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|$ such that $\big{|}\max_{K\in\mathcal{K}_{n}}|t_{n}(K,x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|\big{|}=o_{p}(a_{n})$ , $a_{n}=1/(\log p)^{1/2}$ under the rate conditions imposed in Theorem 3.1. Furthermore,

	$\displaystyle\big{\|}\max_{1\leq j\leq p}\|T_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\|t_{n}(K_{j},x)\|\big{\|}$	$\displaystyle\leq\max_{1\leq j\leq p}\|T_{n}(K_{j},x)-t_{n}(K_{j},x)\|\leq\max_{1\leq j\leq p}\|R_{1}(K_{j},x)\|$
		$\displaystyle\quad+\max_{1\leq j\leq p}\|R_{2}(K_{j},x)\|+\max_{1\leq j\leq p}\|\nu_{n}(K_{j},x)\|=o_{p}(a_{n})$		(A.5)

with $a_{n}=1/(\log p)^{1/2}$ by Lemma 1 and the assumption imposed in Theorem 3.1. We also have

	$\displaystyle\big{\|}$	$\displaystyle\max_{1\leq j\leq p}\|T_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|\big{\|}\leq\max_{1\leq j\leq p}\|T_{n}(K_{j},x)-\widehat{T}_{n}(K_{j},x)\|$
	$\displaystyle\leq$	$\displaystyle\max_{1\leq j\leq p}\|T_{n}(K_{j},x)\|\max_{1\leq j\leq p}\|1-\frac{V_{n}(K_{j},x)^{1/2}}{\widehat{V}_{n}(K_{j},x)^{1/2}}\|=o_{p}(a_{n})$		(A.6)

where we use Lemma 1 and $\max_{1\leq j\leq p}|t_{n}(K_{j},x)|\lesssim_{P}\sqrt{\log p}$ by the maximal inequality (e.g., Lemma A.4 in the Online Supplementary Material) and Assumption 3.2(iii) with $a_{n}=1/(\log p)^{1/2}$ . Combining (A.5) and (A.6) gives $\big{|}\max_{1\leq j\leq p}|\widehat{T}_{n}(K_{j},x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|\big{|}=o_{p}(a_{n})$ with $a_{n}=1/(\log p)^{1/2}$ . Then, there exists some sequence of positive constant $\delta_{n}$ such that $\delta_{n}=o(1)$ and $P(\big{|}\max_{1\leq j\leq p}|\widehat{T}_{n}(K_{j},x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|\big{|}>a_{n}\delta_{n})=o(1)$ .

For any $u\in\mathbb{R}$ , we have

	$\displaystyle P(\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|\leq u)$
	$\displaystyle\leq P(\{\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|\leq u\}\cap\{\big{\|}\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\big{\|}\leq a_{n}\delta_{n}\})$
	$\displaystyle\quad+P(\big{\|}\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\big{\|}>a_{n}\delta_{n})$
	$\displaystyle\leq P(\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\leq u+a_{n}\delta_{n})+o(1)\leq P(\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\leq u)+a_{n}\delta_{n}E[\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|]+o(1)$

where the last inequality uses anti-concentration inequality (Lemma A.8 in the Online Supplementary Material). The reverse inequality holds with a similar argument above, and thus

\sup_{u\in\mathbb{R}}\big{|}P(\max_{1\leq j\leq p}|\widehat{T}_{n}(K,x)|\leq u)-P(\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|\leq u)\big{|}=a_{n}\delta_{n}E[\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|]+o(1)=o(1)

where we use $E[\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|]\lesssim\sqrt{\log p}$ by Gaussian maximal inequality and $a_{n}=(\log p)^{-1/2}$ . Using the same arguments above, $\big{|}\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|\widehat{Z}_{ij}|\big{|}=o_{p}(a_{n})$ by Sudakov-Fernique type bound (e.g., Chatterjee (2005)) and Assumption 3.2(iii), we have $\sup_{u\in\mathbb{R}}\big{|}P(\max_{1\leq j\leq p}|\widehat{Z}_{ij}||\leq u)-P(\max_{1\leq j\leq p}\sum_{i=1}^{n}|Z_{ij}|\leq u)|=o(1)$ . Therefore, the following holds by the triangle inequality,

\sup_{u\in\mathbb{R}}\big{|}P(\max_{1\leq j\leq p}|\widehat{T}_{n}(K,x)|\leq u)-P(\max_{1\leq j\leq p}\sum_{i=1}^{n}|\widehat{Z}_{ij}|\leq u)\big{|}=o(1),

and then we conclude

P(\max_{K\in\mathcal{K}_{n}}|\widehat{T}_{n}(K,x)|\leq\widehat{c}_{1-\alpha}(x))=1-\alpha+o(1),

with a critical value $\widehat{c}_{1-\alpha}(x)$ given in (3.3), and the coverage result (3.5) follows.

Finally, we will show (3.6). For $\widehat{K}\in\mathcal{K}_{n}$ , observe that

|\widehat{T}_{n}(\widehat{K},x)|\leq(|t_{n}(\widehat{K},x)|+|R_{1}(\widehat{K},x)|+|R_{2}(\widehat{K},x)|+|\nu_{n}(\widehat{K},x)|)|\frac{V_{n}(\widehat{K},x)^{1/2}}{\widehat{V}_{n}(\widehat{K},x)^{1/2}}|

(A.7)

by the triangle inequality. Then,

		$\displaystyle P(g_{0}(x)\in[\widehat{g}_{n}(\widehat{K},x)\pm\widehat{c}_{1-\alpha}(x)\sqrt{\widehat{V}_{n}(\widehat{K},x)/n}])$
		$\displaystyle\geq P(\|t_{n}(\widehat{K},x)\|+\|R_{1}(\widehat{K},x)\|+\|R_{2}(\widehat{K},x)\|+\|\nu_{n}(\widehat{K},x)\|\leq\widehat{c}_{1-\alpha}(x)\|\frac{\widehat{V}_{n}(\widehat{K},x)^{1/2}}{V_{n}(\widehat{K},x)^{1/2}}\|)$
		$\displaystyle\geq P(\|t_{n}(\widehat{K},x)\|+\|R_{1}(\widehat{K},x)\|+\|R_{2}(\widehat{K},x)\|+\|\nu_{n}(\widehat{K},x)\|\leq\widehat{c}_{1-\alpha}(x)(1-a_{n}^{2}\delta_{1n}))-\epsilon_{1n}$		(A.8)
		$\displaystyle\geq P(\|t_{n}(\widehat{K},x)\|\leq\widehat{c}_{1-\alpha}(x)(1-a_{n}^{2}\delta_{1n})-a_{n}\delta_{2n}-a_{n}\delta_{3n})-\epsilon_{1n}-\epsilon_{2n}-\epsilon_{3n}$		(A.9)
		$\displaystyle\geq P(\max_{K\in\mathcal{K}_{n}}\|t_{n}(K,x)\|\leq\widehat{c}_{1-\alpha}(x)(1-a_{n}^{2}\delta_{1n})-a_{n}\delta_{2n}-a_{n}\delta_{3n})-\epsilon_{1n}-\epsilon_{2n}-\epsilon_{3n}$		(A.10)
		$\displaystyle\geq P(\max_{1\leq j\leq p}\sum_{i=1}^{n}\|\widehat{Z}_{ij}\|\leq\widehat{c}_{1-\alpha}(x)-\tilde{\delta}_{n})-\tilde{\epsilon}_{n}$		(A.11)
		$\displaystyle\geq 1-\alpha-\sup_{u}P(\|\max_{1\leq j\leq p}\sum_{i=1}^{n}\|\widehat{Z}_{ij}\|-u\|\leq\tilde{\delta}_{n})-\tilde{\epsilon}_{n}\ \geq 1-\alpha-o(1).$		(A.12)

The first inequality follows by (A.7), and (A.8) holds by Assumption 3.2(iii) with some sequence of positive constant $\delta_{1n}=o(1),\epsilon_{1n}=o(1)$ and (A.9) follows by $|R_{1}(\widehat{K},x)|+|R_{2}(\widehat{K},x)|=o_{p}(a_{n})$ from Lemma 1 and the assumption $|\frac{\sqrt{n}r_{n}(\widehat{K},x)}{V_{n}(\widehat{K},x)^{1/2}}|=o(a_{n})$ with $a_{n}=1/(\log p)^{1/2}$ and some sequences of constants $\delta_{2n}=o(1),\epsilon_{2n}=o(1),\delta_{3n}=o(1),\epsilon_{3n}=o(1)$ . (A.10) follows by $|t_{n}(\widehat{K},x)|\leq\max_{K\in\mathcal{K}_{n}}|t_{n}(K,x)|$ , and (A.11) holds by $|\max_{K\in\mathcal{K}_{n}}|t_{n}(K,x)|-\max_{1\leq j\leq p}\sum_{i=1}^{n}|\widehat{Z}_{ij}||=o_{p}(a_{n})$ with some sequences $\delta_{4n}=o(1),\epsilon_{4n}=o(1)$ and defining $\tilde{\delta}_{n}=\widehat{c}_{1-\alpha}(x)a_{n}^{2}\delta_{1n}+a_{n}\delta_{2n}+a_{n}\delta_{3n}+a_{n}\delta_{4n},\tilde{\epsilon}_{n}=\epsilon_{1n}+\epsilon_{2n}+\epsilon_{3n}+\epsilon_{4n}$ . Finally, (A.12) holds by Lemma A.8, $E[\max_{1\leq j\leq p}\sum_{i=1}^{n}|\widehat{Z}_{ij}|]\lesssim\sqrt{\log p}$ and $\tilde{\delta}_{n}\sqrt{\log p}=o(1)$ since $\widehat{c}_{1-\alpha}(x)\lesssim\sqrt{\log p}$ by Lemma A.15. This completes the proof. ∎

A.2.2 Proof of Theorem 4.1

Proof.

Similar to the proof of Theorem 3.1, we have the following linearization of the t-statistics uniformly in $(K,x)\in\mathcal{K}_{n}\times\mathcal{X}$ ,

T_{n}(K,x)=t_{n}(K,x)+\nu_{n}(K,x)+R_{n}(K,x),

where $t_{n}(K,x)=n^{-1/2}\sum_{i=1}^{n}\tilde{P}(K,x)^{\prime}\tilde{P}_{Ki}\varepsilon_{i}/V_{n}(K,x)^{1/2}$ and $R_{n}(K,x)=R_{1}(K,x)+R_{2}(K,x)$ . Define $f_{n,K,x}:(\mathcal{E}\times\mathcal{X})\mapsto\mathbb{R}$ for given $n\geq 1$ , $K\in\mathcal{K}_{n},x\in\mathcal{X}$ ,

f_{n,K,x}(\varepsilon,t)=\frac{\tilde{P}(K,x)^{\prime}\tilde{P}(K,t)\varepsilon}{V_{n}(K,x)^{1/2}},(\varepsilon,t)\in\mathcal{E}\times\mathcal{X}.

(A.13)

and consider the class of measurable functions $\mathcal{F}_{n}=\{f_{n,K,x}:(K,x)\in\mathcal{K}_{n}\times\mathcal{X}\}$ . Then, we consider the following empirical process:

\Big{\{}t_{n}(K,x):(K,x)\in\mathcal{K}_{n}\times\mathcal{X}\Big{\}}=\Big{\{}n^{-1/2}\sum_{i=1}^{n}f_{n,K,x}(\varepsilon_{i},x_{i}):(K,x)\in\mathcal{K}_{n}\times\mathcal{X}\Big{\}}

which is indexed by classes of functions $\mathcal{F}_{n}$ . Define $\alpha(K,x)\equiv\tilde{P}(K,x)/V_{n}(K,x)^{1/2}=\tilde{P}(K,x)/||\Omega^{1/2}_{K}\tilde{P}(K,x)||$ . Note that $|f_{n,K,x}(\varepsilon,t)|=|\alpha(K,x)^{\prime}\tilde{P}(K,t)\varepsilon|\leq C|\varepsilon|\max_{K}\zeta_{K}$ for any $(K,x)\in\mathcal{K}_{n}\times\mathcal{X}$ . We define the envelope function $F_{n}(\varepsilon,t)\equiv C|\varepsilon|\max_{K}\zeta_{K}\vee 1$ . By Assumption 4.1, we have

	$\displaystyle\|f_{n,K,x}-f_{n,K^{\prime},x^{\prime}}\|=\|\varepsilon\|\|\alpha(K,x)^{\prime}\tilde{P}(K,t)-\alpha(K^{\prime},x^{\prime})^{\prime}\tilde{P}(K^{\prime},t)\|$
	$\displaystyle\leq\|\varepsilon\|\|\big{[}\|\alpha(K,x)^{\prime}\tilde{P}(K,t)-\alpha(K,x)^{\prime}\tilde{P}(K^{\prime},t)\|+\|\alpha(K,x)^{\prime}\tilde{P}(K^{\prime},t)-\alpha(K^{\prime},x)^{\prime}\tilde{P}(K^{\prime},t)\|$
	$\displaystyle\quad+\|\alpha(K^{\prime},x)^{\prime}\tilde{P}(K^{\prime},t)-\alpha(K^{\prime},x^{\prime})^{\prime}\tilde{P}(K^{\prime},t)\|\big{]}\leq\|\varepsilon\|A\max_{K}\zeta_{K}L_{n}(\|\|x-x^{\prime}\|\|+\|K-K^{\prime}\|)$

for all $x,x^{\prime}\in\mathcal{X},K,K^{\prime}\in\mathcal{K}_{n}$ where $L_{n}=\zeta^{L_{1}}\vee\zeta^{L_{2}}$ . Therefore, the class of functions $\mathcal{F}_{n}=\{f_{n,K,x}:(K,x)\in\mathcal{K}_{n}\times\mathcal{X}\}$ is a VC type and there are constants $A,V>0$ such that

\sup_{Q}N(\epsilon||F_{n}||_{L^{2}(Q)},\mathcal{F}_{n},L^{2}(Q))\leq(AL_{n}/\epsilon)^{V},0<\forall\epsilon\leq 1

for each $n$ . Then, using Theorem 2.1 (Lemma A.9 in the Online Supplementary Material) in Chernozhukov et al. (2016) with $B(f)=0$ , there exists a tight Gaussian process $G_{n}(f)$ in $\ell^{\infty}(\mathcal{F}_{n})$ and $Z_{n}(K,x)=G_{n}(f_{n,K,x})$ in $\ell^{\infty}(\mathcal{K}_{n}\times\mathcal{X})$ with zero mean and covariance function (4.2), $E[G_{n}(f)G_{n}(f^{\prime})]=Cov(f_{n,K,x}(\varepsilon_{i},x_{i}),f_{n,K^{\prime},x^{\prime}}^{\prime}(\varepsilon_{i},x_{i}))$ and a sequence of random variables $\widetilde{Z}\equiv\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|Z_{n}(K,x)|$ such that, for every $\gamma\in(0,1)$ ,

P(|\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|t_{n}(K,x)|-\widetilde{Z}|>C_{1}\delta_{1n})\leq C_{2}(\gamma+n^{-1})

(A.14)

where $C_{1},C_{2}$ are positive constants that depend only on $q$ , and

\delta_{1n}=\gamma^{-1/q}n^{-1/2+1/q}\max_{K}\zeta_{K}\log n+\gamma^{-1/3}n^{-1/6}(\max_{K}\zeta_{K})^{1/3}\log^{2/3}n

by Assumption 4.1(iii) and assuming $\log^{3}n\leq n$ . By taking $\gamma=(\log n)^{-1/2}$ , we have

|\sup_{K,x}|t_{n}(K,x)|-\widetilde{Z}|=o_{p}(n^{-1/2+1/q}\max_{K}\zeta_{K}\log^{1+1/2q}n+n^{-1/6}(\max_{K}\zeta_{K})^{1/3}\log^{5/6}n).

Furthermore, $|R_{1}(K,x)|=o_{p}(a_{n}),|R_{2}(K,x)|=o_{p}(a_{n}),|\nu_{n}(K,x)|=o_{p}(a_{n})$ uniformly in $(K,x)\in\mathcal{K}_{n}\times\mathcal{X}$ with $a_{n}=1/(\log n)^{1/2}$ by Lemma 2 and the rate conditions. Again, consider the class of functions $\mathcal{F}_{n}=\{f_{n,K,x}:(K,x)\in\mathcal{K}_{n}\times\mathcal{X}\}$ and then

E\big{[}\sup_{K,x}|t_{n}(K,x)|\big{]}\lesssim\sqrt{\log n}+(\max_{K}\zeta_{K})^{q/(q-2)}\log n/\sqrt{n}\lesssim\sqrt{\log n}

by Lemma A.13 and Assumption 4.1(iii), and we have $\sup_{K,x}|t_{n}(K,x)|\lesssim_{P}\sqrt{\log n}$ . Further, $\sup_{K,x}|Z_{n}(K,x)|\lesssim_{P}\sqrt{\log n}$ using Dudley’s inequality (Corollary 2.2.8 in van der Vaart and Wellner (1996)) and using the same arguments given in Theorem 3.1, we have $\big{|}\sup_{K,x}|\widehat{T}_{n}(K_{,}x)|-\widetilde{Z}\big{|}=o_{p}(a_{n})$ with $a_{n}=1/(\log n)^{1/2}$ and

\sup_{u\in\mathbb{R}}\big{|}P(\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}(K_{,}x)|\leq u)-P(\widetilde{Z}\leq u)\big{|}=o(1).

(A.15)

Next we consider following (infeasible) bootstrap process

T_{n}^{e}(K,x)=\frac{\sqrt{n}(\widehat{g}_{n}^{e}(K,x)-\widehat{g}_{n}(K,x))}{V_{n}(K,x)^{1/2}},\quad(K,x)\in\mathcal{K}_{n}\times\mathcal{X}

where $\widehat{g}_{n}^{e}(K,x)=\tilde{P}(K,x)^{\prime}\widehat{\beta}_{K}^{e}$ , $\widehat{\beta}_{K}^{e}$ is defined in (4.4) with $\tilde{P}(K,x_{i})$ , and $e_{i}$ is i.i.d. standard exponential random variables independent of $X^{n}=\{x_{1},...,x_{n}\}$ . Then, we have

	$\displaystyle T_{n}^{e}(K,x)$	$\displaystyle=\frac{\sqrt{n}(\widehat{g}_{n}^{e}(K,x)-g_{0}(x))}{V_{n}(K,x)^{1/2}}-\frac{\sqrt{n}(\widehat{g}_{n}(K,x)-g_{0}(x))}{V_{n}(K,x)^{1/2}}$
		$\displaystyle=t_{n}^{e}(K,x)+R_{n}^{e}(K,x)-R_{n}(K,x)$

where $t_{n}^{e}(K,x)=n^{-1/2}\sum_{i=1}^{n}(e_{i}-1)f_{n,K,x}(\varepsilon_{i},x_{i})$ , $R_{n}^{e}(K,x)=R_{1}^{e}(K,x)+R_{2}^{e}(K,x)$ , $R_{1}^{e}(K,x)$ , and $R_{2}^{e}(K,x)$ are defined the same as in Lemma 1 with the rescaled data $\{(\sqrt{e_{i}}\tilde{P}(K,x_{i}),\sqrt{e_{i}}\varepsilon_{i}\}_{i=1}^{n}$ . Note that $\widehat{\beta}_{K}^{e}$ is the weighted least square estimator for the original data, and we can extend the uniform linearization results in Lemma 2 by replacing $\zeta_{K}$ with $\zeta_{K}^{e}=\zeta_{K}\log^{1/2}n$ and noting that $E[e_{i}]=1,E[e_{i}^{2}]=1,\max_{1\leq i\leq n}|e_{i}|=o_{p}(\log n)$ .

By applying Theorem 2.1 in Chernozhukov et al. (2016) to the weighted bootstrap process $t_{n}^{e}(K,x)$ , there exists a random variable $\widetilde{Z}^{e}\overset{d|X^{n}}{=}\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|Z_{n}(K,x)|$ such that, for every $\gamma\in(0,1)$ ,

P(|\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|t_{n}^{e}(K,x)|-\widetilde{Z}^{e}|>C_{3}\delta_{2n})\leq C_{4}(\gamma+n^{-1})

(A.16)

where $C_{3},C_{4}$ are positive constants that depend only on $q$ ,

\delta_{2n}=\gamma^{-1/q}n^{-1/2+1/q}\max_{K}\zeta_{K}\log^{2}n+\gamma^{-1/3}n^{-1/6}(\max_{K}\zeta_{K})^{1/3}\log n,

and $\overset{d|X^{n}}{=}$ denotes that the two random variables have the same conditional distribution given $X^{n}$ .

Further,

\big{|}\sup_{K,x}|\widehat{T}_{n}^{e}(K,x)|-\sup_{K,x}|t_{n}^{e}(K,x)|\big{|}\leq\sup_{K,x}|\widehat{T}_{n}^{e}(K,x)-T_{n}^{e}(K,x)|+\sup_{K,x}|T_{n}^{e}(K,x)-t_{n}^{e}(K,x)|=o_{p}(a_{n})

by using $E\big{[}\sup_{K,x}|t_{n}^{e}(K,x)|\big{]}\leq\max_{1\leq i\leq n}|e_{i}|E\big{[}\sup_{K,x}|t_{n}(K,x)|\big{]}\lesssim_{P}\log^{3/2}n$ , Assumption 4.1(iv), and $|R_{n}^{e}(K,x)|=o_{p}(a_{n}),|R_{n}(K,x)|=o_{p}(a_{n})$ uniformly in $(K,x)\in\mathcal{K}_{n}\times\mathcal{X}$ under the rate conditions in Assumption 4.1(ii) with $a_{n}=1/(\log n)^{1/2}$ . Then, there exists some sequence of positive constant $\delta_{3n},\delta_{4n}$ such that $\delta_{3n}=o(1),\delta_{4n}=o(1)$ ,

P(\big{|}\sup_{K,x}|\widehat{T}_{n}^{e}(K,x)|-\sup_{K,x}|t_{n}^{e}(K,x)|\big{|}>a_{n}\delta_{3n})\leq\delta_{4n}.

(A.17)

Combining (A.16) and (A.17) gives

P(|\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}^{e}(K,x)|-\widetilde{Z}^{e}|>a_{n}\delta_{3n}+C_{3}\delta_{2n})\leq C_{4}(\gamma+n^{-1})+\delta_{4n}

(A.18)

By the Markov’s inequality, the following is deduced from (A.18), for every $\nu\in(0,1)$ ,

P(|\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}^{e}(K,x)|-\widetilde{Z}^{e}|>a_{n}\delta_{3n}+C_{3}\delta_{2n}|X^{n})\leq\nu^{-1}(C_{4}(\gamma+n^{-1})+\delta_{4n})

(A.19)

with probability at least $1-\nu$ . Similar derivation as in Theorem 3.1 using Lemma A.14 gives

\sup_{u\in\mathbb{R}}\big{|}P(\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}^{e}(K_{,}x)|\leq u|X^{n})-P(\widetilde{Z}\leq u)\big{|}=(a_{n}\delta_{3n}+C_{3}\delta_{2n})\sqrt{\log n}+\nu^{-1}(C_{4}(\gamma+n^{-1})+\delta_{4n})

(A.20)

with probability at least $1-\nu$ where we use $\widetilde{Z}^{e}\overset{d|X^{n}}{=}\widetilde{Z}$ and $E[\sup_{K,x}|Z_{n}(K,x)|]\lesssim\sqrt{\log n}$ . By taking $\gamma=(\log n)^{-1/2}$ and $\nu=\nu_{n}\rightarrow 0$ sufficiently slower than $(\log n)^{-1/2}\vee\delta_{4n}$ , and using $\delta_{2n}=o(a_{n})$ , the rate conditions imposed in the theorem, (A.20) is $o_{p}(1)$ . Combining this with (A.15),

\sup_{u\in\mathbb{R}}\big{|}P(\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}^{e}(K_{,}x)|\leq u|X)-P(\sup_{(K,x)\in\mathcal{K}_{n}\times\mathcal{X}}|\widehat{T}_{n}(K_{,}x)|\leq u)\big{|}=o_{p}(1).

(A.21)

Then, the coverage result (4.7) follows. The second part of the theorem, (4.8), can be similarly derived as in the proof of Theorem 3.1 and this completes the proof. ∎

A.2.3 Proof of Theorem 5.1

Proof.

Conditional on $X=[x_{1},\cdots,x_{n}]^{\prime}$ , the following decomposition holds for any sequence $K\in\mathcal{K}_{n}$ :

\displaystyle\sqrt{n}(\widehat{\theta}_{n}(K)-\theta_{0})=\widehat{\Gamma}_{n}(K)^{-1}S_{n}(K),\quad\widehat{\Gamma}_{n}(K)=\frac{1}{n}(W^{\prime}M_{K}W),\quad S_{n}(K)=\frac{1}{\sqrt{n}}W^{\prime}M_{K}(g+\varepsilon)

where $g=[g_{1},\cdots,g_{n}]^{\prime},g_{i}=g_{0}(x_{i}),g_{w}=[g_{w1},\cdots,g_{wn}]^{\prime},g_{wi}=g_{w0}(x_{i})=E[w_{i}|x_{i}]$ , and $v=[v_{1},\cdots,v_{n}]$ . All remaining proofs contain conditional expectations (conditioning on $X$ ) and hold almost surely (a.s.). Under Assumption 5.2,

\displaystyle\widehat{\Gamma}_{n}(K)=\Gamma_{n}(K)+o_{p}(1),\quad\Gamma_{n}(K)=\frac{1}{n}\sum_{i=1}^{n}M_{K,ii}E[v_{i}^{2}|x_{i}]

by Lemma 1 of Cattaneo, Jansson, and Newey (2018a). Moreover,

\displaystyle S_{n}(K)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}M_{K,ii}v_{i}\varepsilon_{i}-\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\sum_{j=1,j<i}^{n}P_{K,ij}(v_{i}\varepsilon_{j}+v_{j}\varepsilon_{i})+o_{p}(1)

since $M_{K,ij}=-P_{K,ij}$ for $j<i$ , $\frac{1}{\sqrt{n}}g_{w}^{\prime}M_{K}g=O_{p}(\sqrt{n}\overline{K}^{-\gamma_{g}-\gamma_{g_{w}}})=o_{p}(1)$ , $\frac{1}{\sqrt{n}}(v^{\prime}M_{K}g+g_{w}^{\prime}M_{K}\varepsilon)=O_{p}(\overline{K}^{-\gamma_{g}}+\overline{K}^{-\gamma_{g_{w}}})=o_{p}(1)$ by Lemma 2 of Cattaneo, Jansson and Newey (2018a) under Assumption 5.2. Then, the following holds:

T_{n}(K,\theta_{0})=\sqrt{n}V_{n}({K})^{-1/2}(\widehat{\theta}_{n}(K)-\theta_{0})=V_{n}(K)^{-1/2}\Gamma_{n}(K)^{-1}\frac{1}{\sqrt{n}}v^{\prime}M_{K}\varepsilon+o_{p}(1)\overset{d}{\longrightarrow}N(0,1)

by Theorem 1 of Cattaneo, Jansson and Newey (2018a).

For simplicity, here we only show the joint convergence of bivariate t-statistics, but the proof can be easily extended to the multivariate case. For any $K_{1}<K_{2}$ in $\mathcal{K}_{n}$ , we show

\displaystyle Y_{n}=\Xi^{-1/2}(\delta_{1}T_{n}(K_{1},\theta_{0})+\delta_{2}T_{n}(K_{2},\theta_{0}))\overset{d}{\longrightarrow}N(0,1),\quad\forall(\delta_{1},\delta_{2})\in\mathbb{R}^{2}

(A.22)

where $\Xi=\delta_{1}^{2}+\delta_{2}^{2}+2\delta_{1}\delta_{2}v_{12},v_{12}=\lim_{n\rightarrow\infty}V_{n}({K_{1}})^{-1/2}\Gamma_{n}(K_{1})^{-1}\Omega_{n}(K_{1},K_{2})\Gamma_{n}(K_{2})^{-1}V_{n}({K_{2}})^{-1/2}$ .

Define $Y_{n}=Y_{1,n}+Y_{2,n},Y_{1,n}$ and $Y_{2,n}$ as follows

\displaystyle Y_{1,n}=\omega_{1,1n}+\sum_{i=2}^{n}y_{1,in},y_{1,in}=\omega_{1,in}+\bar{y}_{1,in},\quad Y_{2,n}=\omega_{2,1n}+\sum_{i=2}^{n}y_{2,in},y_{2,in}=\omega_{2,in}+\bar{y}_{2,in},

where $\omega_{1,in}=b_{1,n}W_{1,in}$ , $b_{1,n}=\delta_{1}\Xi^{-1/2}V_{n}({K_{1}})^{-1/2}\Gamma_{n}(K_{1})^{-1}$ , $W_{1,in}=v_{i}M_{K_{1},ii}\varepsilon_{i}/\sqrt{n}$ , $\bar{y}_{1,in}=\sum_{j<i}(u_{1,j}P_{K_{1},ij}\varepsilon_{i}+u_{1,i}P_{K_{1},ij}\varepsilon_{j})/\sqrt{K_{1}},u_{1,i}=c_{1,n}v_{i},c_{1,n}=-\delta_{1}\Xi^{-1/2}V_{n}({K_{1}})^{-1/2}\Gamma_{n}(K_{1})^{-1}\sqrt{K_{1}/n}$ and $\omega_{2,in}=b_{2,n}W_{2,in},\bar{y}_{2,in}$ are similarly defined with $P_{K_{2}},V_{n}({K_{2}}),\Gamma_{n}(K_{2})$ and $K_{2}$ . Note that $||V_{n}(K_{1})^{-1}||\leq C$ and $||\Gamma_{n}(K_{1})^{-1}||\leq C$ a.s. for $n$ large enough by Assumption 5.2, and it follows that $||b_{1,n}||\leq C$ . Also, $E[||\omega_{1,1n}||^{4}|X]\leq C\sum_{i=1}^{n}E[||W_{1,in}||^{4}|X]\rightarrow 0$ a.s. by Assumption 5.2(ii). Using the same arguments in the proof of Lemma A2 in Chao et al. (2012), we have $\omega_{1,1n}=o_{p}(1)$ and $\omega_{2,1n}=o_{p}(1)$ unconditionally, thus $Y_{n}=\sum_{i=2}^{n}y_{in}+o_{p}(1),y_{in}=y_{1,in}+y_{2,in}$ .

Let $\mathcal{X}_{i}=(W_{1,in},W_{2,in},v_{i},\varepsilon_{i})^{\prime}$ and define the $\sigma$ -fields $F_{i,n}=\sigma(\mathcal{X}_{1},...,\mathcal{X}_{i})$ for $i=1,...,n.$ Then, conditional on $X$ , $\{y_{in},F_{i,n},1\leq i\leq n,n\geq 2\}$ is a martingale difference array with $F_{i-1,n}\subseteq F_{i,n}$ . We apply the martingale central limit theorem to show, conditional on $X$ , $\sum_{i=2}^{n}y_{in}\overset{d}{\longrightarrow}N(0,1)$ a.s. Note that $E[\omega_{1,in}\bar{y}_{1,jn}|X]=0,E[\omega_{1,in}\bar{y}_{2,jn}|X]=0,E[\omega_{2,in}\bar{y}_{1,jn}|X]=0,E[\omega_{2,in}\bar{y}_{2,jn}|X]=0$ for all $i,j$ . Then similar to the proof of Lemma A2 in Chao et al. (2012),

	$\displaystyle s_{n}^{2}(X)$	$\displaystyle=E[(\sum_{i=2}^{n}y_{in})^{2}\|X]=\sum_{i=2}^{n}(E[\omega_{1,in}^{2}\|X]+E[\bar{y}_{1,in}^{2}\|X])+\sum_{i=2}^{n}(E[\omega_{2,in}^{2}\|X]+E[\bar{y}_{2,in}^{2}\|X])$
		$\displaystyle+2\sum_{i=2}^{n}(E[\omega_{1,in}\omega_{2,in}\|X]+E[\bar{y}_{1,in}\bar{y}_{2,in}\|X])$
		$\displaystyle=\delta_{1}^{2}\Xi^{-1}+\delta_{2}^{2}\Xi^{-1}-E[\omega_{1,1n}^{2}\|X]-E[\omega_{2,1n}^{2}\|X]-2E[\omega_{1,1n}\omega_{2,1n}\|X]$
		$\displaystyle+2\delta_{1}\delta_{2}\Xi^{-1}V_{n}({K_{1}})^{-1/2}\Gamma_{n}(K_{1})^{-1}\Omega_{n}(K_{1},K_{2})\Gamma_{n}(K_{2})^{-1}V_{n}({K_{2}})^{-1/2}\rightarrow 1\quad a.s.$

Moreover, we have $\sum_{i=2}^{n}E[y_{in}^{4}|X]\lesssim\sum_{i=2}^{n}E[y_{1,in}^{4}|X]+\sum_{i=2}^{n}E[y_{2,in}^{4}|X]\overset{a.s.}{\rightarrow}0$ as in the proof of Lemma A2 of Chao et al. (2012).

It remains to prove that for any $\delta>0$ , $P(\big{|}\sum_{i=2}^{n}E[y_{in}^{2}|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-s_{n}^{2}(X)\big{|}\geq\delta|X)\rightarrow 0$ . Note that

	$\displaystyle\sum_{i=2}^{n}E[y_{in}^{2}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-s_{n}^{2}(X)$
	$\displaystyle=\sum_{i=2}^{n}E[y_{1,in}^{2}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-\sum_{i=2}^{n}(E[\omega_{1,in}^{2}\|X]+E[\bar{y}_{1,in}^{2}\|X])$		(A.23)
	$\displaystyle+\sum_{i=2}^{n}E[y_{2,in}^{2}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-\sum_{i=2}^{n}(E[\omega_{2,in}^{2}\|X]+E[\bar{y}_{2,in}^{2}\|X])$		(A.24)
	$\displaystyle+2\Big{(}\sum_{i=2}^{n}(E[\omega_{1,in}\omega_{2,in}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-E[\omega_{1,in}\omega_{2,in}\|X])$		(A.25)
	$\displaystyle+\sum_{i=2}^{n}E[\omega_{1,in}\bar{y}_{2,in}+\omega_{2,in}\bar{y}_{1,in}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]+\sum_{i=2}^{n}(E[\bar{y}_{1,in}\bar{y}_{2,in}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-E[\bar{y}_{1,in}\bar{y}_{2,in}\|X])\Big{)}.$		(A.26)

(A.23) and (A.24) converge to 0 a.s. by the proof of Lemma A2 in Chao et al. (2012). Moreover, it is straightforward to verify that (A.25) and (A.26) converge to 0 a.s. since $P_{K_{1},ij}P_{K_{2},ij}\leq P_{K_{1},ij}^{2}\vee P_{K_{2},ij}^{2}$ , $K_{1}\asymp K_{2}$ and by closely following the proof of Lemma A2 in Chao et al. (2012). Then we can apply the martingale central limit theorem and deduce $Y_{n}\overset{d}{\longrightarrow}N(0,1)$ using similar arguments to the proof of Lemma A2 in Chao et al. (2012). Coverage results (5.6) and (5.7) follow by the joint convergence of $\widehat{T}_{n}(K,\theta_{0})$ with $\max\limits_{K\in\mathcal{K}_{n}}|\frac{\widehat{V}_{n}(K)}{V_{n}(K)}-1|=o_{p}(1),||\widehat{\Sigma}_{n}-\Sigma_{n}||=o_{p}(1)$ as $n,K\rightarrow\infty$ under the assumption imposed in Theorem 5.1 and the Slutzky theorem. This completes the proof. ∎

Appendix B Figures and Tables

Refer to caption — Figure 1: Different functions of $g(x)$ used in simulations (Section 6).
Solid lines (Black) are $g_{1}(x)=\ln(|6x-3|+1)sgn(x-1/2)$ ; Dashed lines (Green) are $g_{2}(x)=\sin(7\pi x/2)/[1+2x^{2}(sgn(x)+1)]$ ; Dotted lines (Blue) are $g_{3}(x)=x-1/2+5\phi(10(x-1/2))$ , where $\phi(\cdot)$ is the standard normal pdf.

Table 1: Coverage and Length of Nominal 95

\%

CIs and CBs - Splines

	Pointwise								Uniform
	$x=0.2$		$x=0.5$		$x=0.8$		$x=0.9$
	COV	AL	COV	AL	COV	AL	COV	AL	COV	AL
Model 1: $g_{1}(x)=\ln(\|6x-3\|+1)sgn(x-1/2)$
Standard	0.93	0.27	0.93	0.36	0.91	0.92	0.92	1.49	0.42	0.69
Robust ( $\widehat{K}_{\texttt{cv}}$ )	0.98	0.37	0.98	0.46	0.96	1.14	0.95	1.76	0.97	1.33
Robust ( $\widehat{K}_{\texttt{cv+}}$ )	0.98	0.51	0.98	0.49	0.98	1.51	0.97	2.08	0.98	1.42
Model 2: $g_{2}(x)=\sin(7\pi x/2)/[1+2x^{2}(sgn(x)+1)]$
Standard	0.80	0.28	0.93	0.36	0.91	0.92	0.92	1.49	0.27	0.69
Robust ( $\widehat{K}_{\texttt{cv}}$ )	0.93	0.37	0.97	0.46	0.96	1.14	0.95	1.76	0.96	1.33
Robust ( $\widehat{K}_{\texttt{cv+}}$ )	0.98	0.51	0.98	0.49	0.98	1.51	0.97	2.08	0.98	1.42
Model 3: $g_{3}(x)=x-1/2+5\phi(10(x-1/2))$
Standard	0.77	0.29	0.65	0.40	0.89	1.00	0.91	1.57	0.16	0.70
Robust ( $\widehat{K}_{\texttt{cv}}$ )	0.88	0.39	0.74	0.50	0.96	1.23	0.95	1.85	0.75	1.35
Robust ( $\widehat{K}_{\texttt{cv+}}$ )	0.98	0.52	0.92	0.53	0.98	1.52	0.97	2.06	0.97	1.44

Notes: “Pointwise” reports coverage (COV) and average length (AL) of (1) the standard 95% CI with $\widehat{K}_{\texttt{cv}}\in\mathcal{K}_{n}$ ; (2) robust CI with $\widehat{K}_{\texttt{cv}}$ ; (3) robust CI with $\widehat{K}_{\texttt{cv+}}$ . “Uniform” reports analogous uniform inference results for confidence bands. $\widehat{K}_{\texttt{cv}}$ is selected to minimize leave-one-out cross-validation and $\widehat{K}_{\texttt{cv+}}=\widehat{K}_{\texttt{cv}}+2$ . Using quadratic spline regressions with evenly placed knots.

Table 2: Nonparametric Wage Elasticity of Hours of Work Estimates in Blomquist and Newey (Table 1, 2002). Wage elasticity evaluated at the mean net wage rates, virtual income, and level of hours.

critical values: $\widehat{c}_{1-\alpha}(x)=2.503$ , $CI_{\widehat{E}_{w}}^{\texttt{sup}}(\widehat{K}_{\texttt{cv}})=[0.0165,0.0921]$ ³
	Additional Terms¹	$CV{\textsuperscript{2}}$	$\widehat{E}_{w}$	$SE_{\widehat{E}_{w}}$	$CI_{\widehat{E}_{w}}(K)$
	$1,y_{J},w_{J}$	0.00472	0.0372	0.0104	[0.0168, 0.0576]
	$\Delta y\Delta w$	0.0313	0.0761	0.0128	[0.0510, 0.1012]
	$\ell\Delta y$	0.0305	0.0760	0.0127	[0.0511, 0.1009]
	$y_{J}^{2},w_{J}^{2}$	0.0323	0.0763	0.0129	[0.0510, 0.1016]
	$\Delta y^{2},\Delta w^{2}$	0.0369	0.0543	0.0151	[0.0247, 0.0839]
	$y_{J}w_{J}$	0.0364	0.0659	0.0197	[0.0273, 0.1045]
	$\Delta yw$	0.0350	0.0628	0.0223	[0.0191, 0.1065]
	$\ell^{2}\Delta y$	0.0364	0.0636	0.0223	[0.0199, 0.1073]
	$y_{J}^{3},w_{J}^{3}$	0.0331	0.0845	0.0275	[0.0306, 0.1384]
	$\ell\Delta y^{2},\ell\Delta w^{2},\ell\Delta yw$	0.0263	0.0775	0.0286	[0.0214, 0.1336]
	$y_{J}^{2}w_{J},y_{J}w_{J}^{2}$	0.0252	0.0714	0.0289	[0.0148, 0.1280]
	MLE estimates		0.123	0.0137
$CI_{\widehat{E}_{w}}^{\texttt{sup}}(\widehat{K}_{\texttt{cv+}})=[0.0166,0.1152],\quad CI_{\widehat{E}_{w}}^{\texttt{sup}}(\widehat{K}_{\texttt{cv++}})=[0.0070,0.1186]$

1

$y$ : non-labor income, $w$ : marginal wage rates, $\ell$ : the end point of the segment in a piecewise linear budget set. $\ell^{m}\Delta y^{p}w^{q}$ denotes $\sum_{j}\ell_{j}^{m}(y_{j}^{p}w_{j}^{q}-y_{j+1}^{p}w_{j+1}^{q})$ .
2

$CV$ denotes the cross-validation criteria defined in Blomquist and Newey (2002, p.2464). $\widehat{K}_{\texttt{cv}}=K_{5}$ , the 5th smallest model, is chosen by the cross-validation, and let $\widehat{K}_{\texttt{cv+}}=K_{6}$ , $\widehat{K}_{\texttt{cv++}}=K_{7}$ .
3

$CI_{\widehat{E}_{w}}^{\texttt{sup}}(K)=\widehat{E}_{w}(K)\pm\widehat{c}_{1-\alpha}(x)SE_{\widehat{E}_{w}}(K)$ , $CI_{\widehat{E}_{w}}(K)=\widehat{E}_{w}(K)\pm z_{1-\alpha/2}SE_{\widehat{E}_{w}}(K)$ .

	$\displaystyle\big{\|}\max_{1\leq j\leq p}\|T_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\|t_{n}(K_{j},x)\|\big{\|}$	$\displaystyle\leq\max_{1\leq j\leq p}\|T_{n}(K_{j},x)-t_{n}(K_{j},x)\|\leq\max_{1\leq j\leq p}\|R_{1}(K_{j},x)\|$
		$\displaystyle\quad+\max_{1\leq j\leq p}\|R_{2}(K_{j},x)\|+\max_{1\leq j\leq p}\|\nu_{n}(K_{j},x)\|=o_{p}(a_{n})$		(A.5)

	$\displaystyle\big{\|}$	$\displaystyle\max_{1\leq j\leq p}\|T_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|\big{\|}\leq\max_{1\leq j\leq p}\|T_{n}(K_{j},x)-\widehat{T}_{n}(K_{j},x)\|$
	$\displaystyle\leq$	$\displaystyle\max_{1\leq j\leq p}\|T_{n}(K_{j},x)\|\max_{1\leq j\leq p}\|1-\frac{V_{n}(K_{j},x)^{1/2}}{\widehat{V}_{n}(K_{j},x)^{1/2}}\|=o_{p}(a_{n})$		(A.6)

	$\displaystyle P(\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|\leq u)$
	$\displaystyle\leq P(\{\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|\leq u\}\cap\{\big{\|}\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\big{\|}\leq a_{n}\delta_{n}\})$
	$\displaystyle\quad+P(\big{\|}\max_{1\leq j\leq p}\|\widehat{T}_{n}(K_{j},x)\|-\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\big{\|}>a_{n}\delta_{n})$
	$\displaystyle\leq P(\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\leq u+a_{n}\delta_{n})+o(1)\leq P(\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|\leq u)+a_{n}\delta_{n}E[\max_{1\leq j\leq p}\sum_{i=1}^{n}\|Z_{ij}\|]+o(1)$

	$\displaystyle s_{n}^{2}(X)$	$\displaystyle=E[(\sum_{i=2}^{n}y_{in})^{2}\|X]=\sum_{i=2}^{n}(E[\omega_{1,in}^{2}\|X]+E[\bar{y}_{1,in}^{2}\|X])+\sum_{i=2}^{n}(E[\omega_{2,in}^{2}\|X]+E[\bar{y}_{2,in}^{2}\|X])$
		$\displaystyle+2\sum_{i=2}^{n}(E[\omega_{1,in}\omega_{2,in}\|X]+E[\bar{y}_{1,in}\bar{y}_{2,in}\|X])$
		$\displaystyle=\delta_{1}^{2}\Xi^{-1}+\delta_{2}^{2}\Xi^{-1}-E[\omega_{1,1n}^{2}\|X]-E[\omega_{2,1n}^{2}\|X]-2E[\omega_{1,1n}\omega_{2,1n}\|X]$
		$\displaystyle+2\delta_{1}\delta_{2}\Xi^{-1}V_{n}({K_{1}})^{-1/2}\Gamma_{n}(K_{1})^{-1}\Omega_{n}(K_{1},K_{2})\Gamma_{n}(K_{2})^{-1}V_{n}({K_{2}})^{-1/2}\rightarrow 1\quad a.s.$

	$\displaystyle\sum_{i=2}^{n}E[y_{in}^{2}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-s_{n}^{2}(X)$
	$\displaystyle=\sum_{i=2}^{n}E[y_{1,in}^{2}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-\sum_{i=2}^{n}(E[\omega_{1,in}^{2}\|X]+E[\bar{y}_{1,in}^{2}\|X])$		(A.23)
	$\displaystyle+\sum_{i=2}^{n}E[y_{2,in}^{2}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-\sum_{i=2}^{n}(E[\omega_{2,in}^{2}\|X]+E[\bar{y}_{2,in}^{2}\|X])$		(A.24)
	$\displaystyle+2\Big{(}\sum_{i=2}^{n}(E[\omega_{1,in}\omega_{2,in}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-E[\omega_{1,in}\omega_{2,in}\|X])$		(A.25)
	$\displaystyle+\sum_{i=2}^{n}E[\omega_{1,in}\bar{y}_{2,in}+\omega_{2,in}\bar{y}_{1,in}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]+\sum_{i=2}^{n}(E[\bar{y}_{1,in}\bar{y}_{2,in}\|\mathcal{X}_{1},...,\mathcal{X}_{i-1},X]-E[\bar{y}_{1,in}\bar{y}_{2,in}\|X])\Big{)}.$		(A.26)

Inference in Nonparametric Series Estimation with Specification Searches for the Number of Series Terms

Abstract

1 Introduction

1.1 Notation

2 Setup

Assumption 2.1.

Remark 2.1 (𝒦n\mathcal{K}_{n} and the largest KK).

3 Pointwise Inference

Assumption 3.1.

Assumption 3.2.

Theorem 3.1.

Remark 3.1 (Undersmoothing assumption).

Remark 3.2 (Other functionals).

4 Uniform Inference

Assumption 4.1.

Theorem 4.1.

5 Extension: Partially Linear Model

Assumption 5.1.

Assumption 5.2.

Theorem 5.1.

Remark 5.1.

6 Simulations

7 Empirical application

8 Conclusion

References

Appendix A Proofs

A.1 Preliminaries and Useful Lemmas

Lemma 1.

Lemma 2.

A.2 Proofs of the Main Results

A.2.1 Proof of Theorem 3.1

Proof.

A.2.2 Proof of Theorem 4.1

Proof.

A.2.3 Proof of Theorem 5.1

Proof.

Appendix B Figures and Tables

Remark 2.1 ( $\mathcal{K}_{n}$ and the largest $K$ ).