A Sieve Quasi-likelihood Ratio Test for Neural Networks with Applications to Genetic Association Studies
Abstract
Neural networks (NN) play a central role in modern Artificial intelligence (AI) technology and has been successfully used in areas such as natural language processing and image recognition. While majority of NN applications focus on prediction and classification, there are increasing interests in studying statistical inference of neural networks. The study of NN statistical inference can enhance our understanding of NN statistical proprieties. Moreover, it can facilitate the NN-based hypothesis testing that can be applied to hypothesis-driven clinical and biomedical research. In this paper, we propose a sieve quasi-likelihood ratio test based on NN with one hidden layer for testing complex associations. The test statistic has asymptotic chi-squared distribution, and therefore it is computationally efficient and easy for implementation in real data analysis. The validity of the asymptotic distribution is investigated via simulations. Finally, we demonstrate the use of the proposed test by performing a genetic association analysis of the sequencing data from Alzheimer’s Disease Neuroimaging Initiative (ADNI).
Keywords: Sieve quasi-likelihood ratio test; nonparametric least squares; influence functions.
1 Introduction
With the advance of science and technology, we are now in the era of the fourth industrial revolution. One of the key drivers of the fourth industrial revolution is artificial intelligence (AI). Deep neural networks play a critical role in AI and have achieved great success in many fields such as natural language processing and imaging recognition. While great attention has been given to applications of neural works (NN), limited studies have been focus on its theoretical properties and statistical inference, which hinders its application to hypothesis-driven clinical and biomedical research. The study of NN statistical inference can improve our understanding of NN properties and facilitate hypotheses testing using NN. Nevertheless, it is challenging to study NN statistical inference. For instance, it has been pointed out in Fukumizu (1996) and Fukumizu et al. (2003) that the parameters in a neural network model are unidentifiable so that classical tests (e.g., Wald test and likelihood ratio test) cannot be used because unidentifaibility of parameters leads to inconsistency of the nonlinear least squares estimators (Wu, 1981).
Many existing literature on NN, such as Shen et al. (2021), Shen et al. (2019), Horel and Giesecke (2020), Schmidt-Hieber et al. (2020), and Chen and White (1999), are based on the framework of nonparametric regression. It has been shown in Chen and White (1999) that the rate of convergence for neural network estimators is for sufficiently smooth , so one of the advantages of neural networks compared with commonly used in nonparametric regression methods (e.g., Nadaraya-Watson estimator and spline regression) is that neural network estimators can avoid the curse of dimensionality in terms of rate of convergence.
There are increasing interests in studying hypothesis testing based on neural networks. Recently, Shen et al. (2019) established asymptotic theories for neural networks, which can be used to perform a nonparametric hypothesis on the true function. Horel and Giesecke (2020) used a Linderberg-Feller type central limit theorem for random process and second order functional delta method to construct test statistic to perform significance tests on input features. However, the asymptotic distribution of the test statistic is complex, making it difficult to obtain the critical value. Shen et al. (2021) proposed a goodness of fit type test based on neural networks. The test statistic is based on comparing the mean squared error values of two neural networks built under the null and the alternative hypothesis. The test statistic has an asymptotic normal distribution, and hence it can be easily used in practice. However, constructing the test statistic requires a random split of the data, which can lead to a potential power loss. In this paper, we propose a sieve quasi likelihood ratio (SQLR) test based on neural networks. Similar to Shen et al. (2021), the test statistic has an asymptotic chi-squared distribution, which facilitate its use in practice. Compared with the goodness of fit test in Shen et al. (2021), the SQLR test does note require data splitting, but requires continuous random input features.
The rest of the paper is organized as follows. Section 2 provides the general results of the sieve quasi-likelihood ratio test under the setup of nonparametric regressions. In section 3, we apply the general theories to neural networks so that significance tests based on neural networks can be performed. We investigate the validity of the theories via simple simulations in section 4, followed by a real data application to genetic association analysis of the sequencing data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) in section 5. The proofs of the main results in are given in the supplementary materials.
2 Sieve Quasi-Likelihood Ratio Test
Consider the classical setting of a nonparametric regression model under the random design,
where the covariates are assumed to be i.i.d. from a distribution , and are i.i.d. random errors with . are the responses, which are continuous random variables. The true functions is assume to be in , where is a compact subset. For simplicity, we take . The norm considered on is the -norm . We further assume that for some . Such a assumption is also considered in Han and Wellner (2019) and is necessary to obtain the desired convergence rate.
The approximate sieve exremum estimator based on is defined as
where is the classical sample squared error loss function
We assume that is uniformly dense in , that is, for each , there exists such that as . For simplicity, we assume that the sieve space is countable to avoid additional technical issue on measurability.
The null hypothesis of the sieve quasi-likelihood ratio test is , which is the same as the one proposed in Shen and Shi (2005). We define the sieve quasi-likelihood ratio statistic as
where is the null sieve space given by
Similar to the definition of , we denote the approximate sieve extremum estimator under by , which satisfies
According to Shen (1997) and Shen and Shi (2005), we assume that the functional has the following smoothness property: for any ,
(1) |
where is defined as with being a path in connecting and such that and . is the degree of smoothness of at , is linear in , and
Then is a bounded linear functional on , which is the completion of . From the Riesz representation theorem, there exists
Let be the rate of convergence for , that is, . Let be a sequence converging to 0 with . For , we define
where . The main result relies on the following conditions:
-
(C1)
(Sieve Space) Suppose that is uniformly bounded. Moreover, assume that there exists a non-increasing continuous function of such that
and
-
(C2)
(Rate of Convergence) The rate of convergence satisfies and
-
(C3)
(Approximation Error)
where the supremum is taken over all probability measures on and
Remark 1.
- (i)
-
(ii)
Condition (C2) is on the rate of convergence of sieve estimators. To obtain the desired result, the convergence rate cannot be too slow. Together with (C1), we can derive the uniform law of large numbers for empirical norm, as given in van de Geer (2000).
-
(iii)
The conditions on the approximation errors in the setting of nonparametric regression are given in condition (C3). These two requirements are special cases of the ones given in Shen (1997).
Theorem 1.
Suppose , and under (C1)-(C3),
Remark 2.
In view of Lemma 22 given in the supplementary materials, the empirical inner product can be replaced by its population version .
We now state the main theorem for sieve quasi-likelihood ratio statistics. The proof of the theorem follows the same steps as those in Shen and Shi (2005) and are given in the Supplementary materials.
Theorem 2.
Under and (C1)-(C3), suppose that , , and , we have
where .
In practice, is rarely known apriori. A simple application of Slutsky’s theorem yields the following corollary, which shows that we can replace with any consistent estimator of . A straightforward consistent estimator for is given by .
Corollary 3.
3 An Application to Neural Networks
We first introduce the notations to be used in this section. where 1 appears at the th position. We use to the set of non-negative integers and use to denote a multi-index. Moreover, we set and for any . For a differentiable function on , we set
One of our goals in this paper is to establish a sieve quasi-likelihood ratio test for neural network estimators. Specifically, for a given , let and the null hypothesis of interest be
Different from linear regression, in which the hypothesis can be easily translated into testing whether the corresponding regression coefficients are zero, testing significance of an association in nonparametric regression is more complicated. From Chen and White (1999) and Horel and Giesecke (2020), testing in the nonparametric setting is equivalent to test whether the corresponding partial derivatives are zeros, or equivalently, to test
Hence, we assume that the true function is a smooth function. Specifically, we consider the Barron class for some integer and some fixed constant , as considered in Siegel and Xu (2020). Here
and is the Fourier transform of . As shown in Siegel and Xu (2020), and . In what follows, we will take with .
The functional from the general result in section 2 is given by
(2) |
The directional derivative evaluated at “direction” can be calculated straightforwardly. For the sieve space, we use the class of neural networks with one hidden layer and sigmoid activation function .
(3) |
where as . In view of Barron (1993), is -dense in .
Based on the general results in the previous section, the function needs to be smooth so that the sieve quasi-likelihood ratio statistic follows a chi-squared asymptotic distribution. The following propositions guarantee the satisfaction of conditions on in the general theory.
Proposition 4.
Let be the same function given in (2), then, for any ,
Moreover, is a bounded linear functional on .
Proof.
By definition,
For the second claim, linearity follows directly from the definition of . Boundedness follows from the Hölder’s inequality by noting that
∎
We now impose the following condition on the distribution .
-
(C4)
Suppose that , where is the Lebesgue measure on . Let
Moreover, we assume that on , , and .
Proposition 5.
Under (C4), the Riesz representor for the bounded linear functional is given by
Proof.
Define and as
then . Let be the unit outward normal to . Given the integration by parts formula and the fact that on , we have
Based on the given assumptions, we know that . Therefore,
∎
Before we bound the remainder error of the first order functional Taylor expansion, we provide a bound for higher order derivatives of a neural network.
Proposition 6.
Let be a non-negative integer. For any and any multi-index with ,
Proof.
Lemma 7 (Rate of Convergence of Neural Network Sieve Estimators).
-
(i)
The sieve space satisfies (C1).
-
(ii)
Suppose that , then the rate of convergence of neural network sieve estimators is
and satisfies (C2).
Proof.
-
(i)
From Theorem 14.5 in Anthony and Bartlett (2009), we have
which implies that
(4) Hence, (C1) is satisfied with by noting that
-
(ii)
Note that, for ,
Let . Clearly, is decreasing on for . Note that
(5) It follows from Makovoz (1996) that . By taking in (5), we obtain the following governing inequality:
Given that , we have
To show that satisfies condition (C2), we note that as long as as . The governing inequality is certainly satisfied based on the previous arguments. We also note that
On the other hand, we have
where the last inequality follows from the assumption .
∎
Remark 3.
The rate of convergence we obtained has an additional term in the denominator compared with the results in Chen and Shen (1998), but this has little effect on the main result.
Proposition 8.
Under (C4) and the assumption of
(6) |
for any , we have
Proof.
Note that
where the last inequality follows from the elementary inequality and the triangle inequality. For the second term, it follows from Corollary 1 in Siegel and Xu (2020) that
For the first term, we use the Gagliardo-Nirenberg interpolation inequality (Theorem 12.87 in Leoni (2017)). For and , there exists a constant , which is independent of , such that
It then follows from Proposition 6 that
As we have shown in Lemma 7, under (6), , and then
By taking , we obtain that
Therefore, we obtain that for ,
∎
Now we state and prove the asymptotic distribution of the sieve quasi-likelihood ratio statistic.
Theorem 9.
Proof.
While conditions (C1) and (C2) have been verified in Lemma 7, condition (C3) remains to be verified. According to Theorem 2.1 in Mhaskar (1996), we can find vectors and such that for any , there exists coefficients satisfying
(7) |
In addition, the functionals are continuous linear functionals on .
Based on the results from Goulaouic (1971) or Baouendi and Goulaouic (1974) and Lemma 3.2 in Mhaskar (1996) we can show that for an analytic function defined on a compact set , there exist and coefficients such that
(8) |
where and are the same as those given in (7). Since is analytic, for every with , there exists a neural network with
such that for some . For , there exists a neural network with
such that . By considering
it is clear that and
Therefore, by choosing and note that
we can know that the first requirement in (C3) is satisfied. For the second requirement, note that if is large enough,
Hence (C3) is satisfied, and the desired claim follows from Corollary 3. ∎
4 A Simulation Study
We conducted a simulation study to investigate the type I error and power performance of our proposed test. The model for generating the simulation data is given as follows:
where , and . Since is not included in the true model, we use to investigate whether the SQLR test have good control of type I error, while the other 5 covariates are used for evaluate the power of the proposed test.
A subgradient method discussed in section 7 in Boyd and Mutapcic (2008) was applied to obtain a neural network estimate due to the constraints on the sieve space . The step size for the th iteration is chosen to be to fit a neural network under the null hypothesis , while the step size of is used under the alternative hypothesis . Such choices of step sizes ensure the convergence of the subgradient method. In terms of the structure of the neural networks, we set and for both neural networks fitted under and . When fitting the neural network under , the initial value for the weights are randomly assigned. We use the fitted values for the weights from the neural network under as the initial values and set all the extra weights to be zero when we fit the neural network under .
Table 1 summarizes the empirical type I error and the empirical power under various sample sizes for the proposed neural-network-based SQLR test and the linear-regression-based -test after conducting 500 Monte Carlo iterations. Results from table show that both testing procedure can control the empirical type I error well. In terms of empirical power, the -test can only detect the linear component of the simulated model, while SQLR can detect all the components of the model. Therefore, when nonlinear patterns exist in the underlying function, the SQLR test is anticipate to be more powerful than the -test. Even in the case of linear terms, the performances between the two methods are comparable.
SQLR | -test | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Sample Size | 100 | 500 | 1000 | 3000 | 5000 | 100 | 500 | 1000 | 3000 | 5000 |
0.072 | 0.072 | 0.080 | 0.326 | 0.818 | 0.054 | 0.068 | 0.046 | 0.060 | 0.042 | |
0.058 | 0.088 | 0.152 | 0.504 | 0.932 | 0.066 | 0.062 | 0.070 | 0.054 | 0.058 | |
0.052 | 0.062 | 0.104 | 0.308 | 0.812 | 0.050 | 0.048 | 0.058 | 0.078 | 0.060 | |
0.064 | 0.072 | 0.132 | 0.486 | 0.920 | 0.054 | 0.066 | 0.048 | 0.056 | 0.064 | |
0.074 | 0.202 | 0.406 | 0.904 | 0.978 | 0.070 | 0.222 | 0.414 | 0.826 | 0.956 | |
(Type I Error) | 0.054 | 0.058 | 0.046 | 0.042 | 0.060 | 0.046 | 0.054 | 0.038 | 0.032 | 0.054 |
5 Real Data Applications
We conducted two genetic association analyses by applying the proposed sieve quasi-likelihood ratio test based on neural networks to the gene expression data and the sequencing data from Alzheimer’s Disease Neuroimaging Initiative (ADNI). Studies have shown that the hippocampus region in brain plays a vital part in memory and learning Mu and Gage (2011) and the change in the volume of hippocampus has a great impact on Alzheimer’s disease (Schuff et al., 2009). For both analyses, we first regress the logarithm of the hippocampus volume on important covariates (i.e, age, gender and education status) and then use the residual obtained as the response variable to fit neural networks. A total of 464 subjects and 15,837 gene expressions were obtained after quality control.
Under the null hypothesis, the gene is not associated with the response. Therefore, we can use the sample average of the response variable as the null estimator. When we fitted neural networks under the alternative hypothesis, we set the number of hidden units as and the upper bound for the -norm of the hidden-to-output weights as . Totally, 3e4 iterations were performed. At the th iteration, the learning rate is chosen to be . Table 2 summarizes the top 10 significant genes detected by SQLR and -test. Based on the result, the top 10 genes having the smallest -values detected by -test and SQLR are similar.
-test | SQLR | ||
---|---|---|---|
Gene | -value | Gene | -value |
SNRNP40 | 5.48E-05 | PPIH | 5.84E-05 |
PPIH | 1.01E-04 | SNRNP40 | 6.91E-05 |
GPR85 | 1.65E-04 | NOD2 | 1.22E-04 |
DNAJB1 | 1.87E-04 | DNAJB1 | 1.66E-04 |
WDR70 | 1.91E-04 | CTBP1-AS2 | 1.94E-04 |
CYP4F2 | 2.64E-04 | GPR85 | 2.21E-04 |
NOD2 | 2.84E-04 | WDR70 | 2.31E-04 |
MEGF9 | 2.85E-04 | KAZALD1 | 2.59E-04 |
CTBP1-AS2 | 3.35E-04 | CYP4F2 | 2.95E-04 |
HNRNPAB | 3.58E-04 | HNRNPAB | 3.72E-04 |
To explore the performance of the proposed SQLR test for categorical predictors, we conducted a genetic association analysis by applying SQLR to the ADNI genotype data in the APOE gene. The APOE gene in chromosome 19 is a well-known AD gene (Strittmatter et al., 1993). For this analysis, we considered all available single-nucleotide polymorphisms (SNPs) in the APOE gene as the input feature and conducted single-locus association tests considering all other SNPs in the gene. We used the same response variable as the one used in the previous gene expression study. A total of 780 subjects and 169 SNPs were obtained after quality control.
Same to the gene expression study, we used the sample average of the response variable as the null estimator. The tuning parameters used to fit neural networks are the same as mentioned before. Table 3 summarizes the top 10 significant SNPs in the APOE gene detected by the SQLR method for neural networks and by the -test in linear regression along with their -values.
As we can see from the result, the majority of significant SNPs detected by the -test and SQLR test overlap. Whether these significant SNPs are biologically meaningful needs further investigation. This shows that the SQLR test based on neural networks has the potential for wider applications, at least in this study, it performs as good as the F-test.
-test | SQLR-neural net | ||
---|---|---|---|
SNP | -value | SNP | -value |
rs10414043 | 1.10E-05 | rs10414043 | 1.18E-05 |
rs7256200 | 1.10E-05 | rs7256200 | 1.18E-05 |
rs769449 | 1.88E-05 | rs769449 | 2.00E-05 |
rs438811 | 1.94E-05 | rs438811 | 2.28E-05 |
rs10119 | 2.42E-05 | rs10119 | 2.59E-05 |
rs483082 | 2.50E-05 | rs483082 | 2.91E-05 |
rs75627662 | 5.32E-04 | rs75627662 | 5.44E-04 |
rs_x139 | 1.76E-03 | rs1038025 | 3.42E-03 |
rs59325138 | 3.01E-03 | rs59325138 | 3.67E-03 |
rs1038025 | 3.15E-03 | rs1038026 | 4.34E-03 |
6 Discussion
Hypothesis-driven studies are quite common in biomedical and public health research. For instance, investigators are typically interested in detecting complex relationships (e.g., non-linear relationships) between genetic variants and diseases in genetic studies. Therefore, significance tests based on a flexible and powerful model is crucial in real world applications. Although neural networks have achieved great success in pattern recognition, due to its black-box nature, it is not straightforward to conduct statistical inference based on neural networks. To fill this gap, we proposed a sieve quasi-likelihood ratio test based on neural networks to testing complex associations. The asymptotic chi-squared distribution of the test statistic was developed, which was validated via simulations studies. We also evaluated SQLR by applying it to the gene expression and sequence data from ADNI.
There are some limitations of the proposed method. First, the underlying function is required to be sufficiently smooth, which may not be true in some applications. Such requirement is not needed in the goodness of fit test proposed in Shen et al. (2021). However, the construction of the goodness of fit test requires data splitting, which could potentially reduce its power. Our empirical studies also find that a suitable choice of the step size is crucial for decent performance of the proposed method. Further studies will be conducted on how to choose the suitable step size for our method so that it can be used as a guidance for real data applications.
In section 2, we developed general theories for the SQLR test under the framework of nonparametric regression. The conditions (C1)-(C3) are easy to verify compared with the original ones in Shen and Shi (2005). Such type of results can be extended to deep neural networks and other models used in artificial intelligence, such as convolution neural networks or long-short term memory reccurrent neural networks, as long as one can obtain a good bound on the metric entropy for the class of functions.
Acknowledgements
This work is supported by the National Institute on Drug Abuse (Award No. R01DA043501) and the National Library of Medicine (Award No. R01LM012848).
References
- Anthony and Bartlett (2009) Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009.
- Baouendi and Goulaouic (1974) MS Baouendi and C Goulaouic. Approximation of analytic functions on compact sets and bernstein’s inequality. Transactions of the American Mathematical Society, 189:251–261, 1974.
- Barron (1993) Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- Boyd and Mutapcic (2008) Stephen Boyd and Almir Mutapcic. Subgradient methods (notes for EE364B Winter 2006-07, Stanford University), 2008.
- Chen and Shen (1998) Xiaohong Chen and Xiaotong Shen. Sieve extremum estimates for weakly dependent data. Econometrica, pages 289–314, 1998.
- Chen and White (1999) Xiaohong Chen and Halbert White. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Transactions on Information Theory, 45(2):682–691, 1999.
- Fukumizu (1996) Kenji Fukumizu. A regularity condition of the information matrix of a multilayer perceptron network. Neural networks, 9(5):871–879, 1996.
- Fukumizu et al. (2003) Kenji Fukumizu et al. Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31(3):833–851, 2003.
- Giné et al. (2000) Evarist Giné, Rafał Latała, and Joel Zinn. Exponential and moment inequalities for u-statistics. In High Dimensional Probability II, pages 13–38. Springer, 2000.
- Goulaouic (1971) Charles Goulaouic. Approximation polynômiale de fonctions et analytiques. Ann. Inst. Fourier Grenoble, 21:149–173, 1971.
- Han and Wellner (2019) Qiyang Han and Jon A Wellner. Convergence rates of least squares regression estimators with heavy-tailed errors. Annals of Statistics, 47(4):2286–2319, 2019.
- Horel and Giesecke (2020) Enguerrand Horel and Kay Giesecke. Significance tests for neural networks. Journal of Machine Learning Research, 21(227):1–29, 2020.
- Leoni (2017) Giovanni Leoni. A first course in Sobolev spaces. American Mathematical Soc., 2017.
- Makovoz (1996) Yuly Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85(1):98–109, 1996.
- McDiarmid (1989) Colin McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
- Mhaskar (1996) Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
- Minai and Williams (1993) Ali A Minai and Ronald D Williams. On the derivatives of the sigmoid. Neural Networks, 6(6):845–853, 1993.
- Mu and Gage (2011) Yangling Mu and Fred H Gage. Adult hippocampal neurogenesis and its role in alzheimer’s disease. Molecular neurodegeneration, 6(1):1–9, 2011.
- Schmidt-Hieber et al. (2020) Johannes Schmidt-Hieber et al. Nonparametric regression using deep neural networks with relu activation function. Annals of Statistics, 48(4):1875–1897, 2020.
- Schuff et al. (2009) N Schuff, N Woerner, L Boreta, T Kornfield, LM Shaw, JQ Trojanowski, PM Thompson, CR Jack Jr, MW Weiner, and Alzheimer’s; Disease Neuroimaging Initiative. Mri of hippocampal volume loss in early alzheimer’s disease in relation to apoe genotype and biomarkers. Brain, 132(4):1067–1077, 2009.
- Shen (1997) Xiaotong Shen. On methods of sieves and penalization. The Annals of Statistics, pages 2555–2591, 1997.
- Shen and Shi (2005) Xiaotong Shen and Jian Shi. Sieve likelihood ratio inference on general parameter space. Science in China Series A: Mathematics, 48(1):67–78, 2005.
- Shen et al. (2019) Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. Asymptotic properties of neural network sieve estimators. arXiv preprint arXiv:1906.00875, 2019.
- Shen et al. (2021) Xiaoxi Shen, Chang Jiang, Lyudmila Sakhanenko, and Qing Lu. A goodness-of-fit test based on neural network sieve estimators. Statistics & Probability Letters, page 109100, 2021.
- Siegel and Xu (2020) Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 2020.
- Strittmatter et al. (1993) Warren J Strittmatter, Ann M Saunders, Donald Schmechel, Margaret Pericak-Vance, Jan Enghild, Guy S Salvesen, and Allen D Roses. Apolipoprotein e: high-avidity binding to beta-amyloid and increased frequency of type 4 allele in late-onset familial alzheimer disease. Proceedings of the National Academy of Sciences, 90(5):1977–1981, 1993.
- van de Geer (1987) Sara van de Geer. A new approach to least-squares estimation, with applications. The Annals of Statistics, pages 587–602, 1987.
- van de Geer (1988) Sara van de Geer. Regression analysis and empirical processes. CWI, 1988.
- van de Geer (1990) Sara van de Geer. Estimating a regression function. The Annals of Statistics, pages 907–924, 1990.
- van de Geer (2000) Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
- van der Vaart and Wellner (1996) Aad W van der Vaart and Jon A Wellner. Weak convergence. Springer, 1996.
- Wu (1981) Chien-Fu Wu. Asymptotic theory of nonlinear least squares estimation. The Annals of Statistics, pages 501–513, 1981.
Supplementary Materials
Proof of Theorem 1
In this section, we take the sequence . The proof of the theorem relies on the following lemmas.
Lemma 10.
Under (C1)-(C3), for a sufficiently large ,
Proof.
We first note that
For a enough large , the Strong Law of Large Numbers implies that a.s. and hence
Moreover,
On the other hand, under (C1) and (C2), it follows from Lemma 5.4 in van de Geer (2000) that
(9) |
which implies that and then
Under (C2) and (C3), we have
and for a large enough ,
and
Therefore, we obtain
∎
Lemma 11.
Under (C1) - (C3),
Proof.
Note that
(10) |
Now, we define
and
Let be i.i.d. Rademacher random variables independent of and . It then follows from Corollary 2.2.8 in van der Vaart and Wellner (1996) that
Moreover, based on (C2), we obtain
where the last inequality follows from the mean value theorem for integrals. Hence, and
By Proposition 20, we know that
which implies that
(11) |
The desired claim then follows by combining (C3) with equations (11) and (10). ∎
We now prove Theorem 1.
Proof.
Note that for , we have
With probability tending to 1, . Since
we have
By subtracting these two equation, we have
It then follows from the definition of that
From Lemma 10, we know that
From Lemma 11, we have
Therefore, put all the pieces together, we have
which implies that
By replacing with , we get
and the desired result follows immediately. ∎
Proof of Theorem 2
In what follows, we consider and . Under (C1)-(C3), it follows from Theorem 1 and Lemma 22 that if
The proof of Theorem 2 relies on the following lemmas.
Lemma 12 (Convergence Rate for ).
Under (C1) and (C2),
Proof.
As , it suffices to show that . Note that for any ,
Under (C2), we know that . It then follows from the definition of that, for any , there exists such that,
and hence
Note that
Let . By a standard argument of peeling technique, we have
Let . Then on
which implies that
Therefore
where
As , by Markov’s inequality and triangle inequality,
For , it follows from Chebyshev’s inequality, symmetrization inequality, contraction principle and moment inequality (see Proposition 3.1 of Giné et al. (2000)) that
where are i.i.d. Rademacher random variables independent of . Similarly, we have
Then
Under (C2), we have , which implies that . Moreover, for , we have . Therefore, we obtain
which implies that, for a sufficiently large ,
Therefore,
which implies . ∎
Lemma 13 (Local Approximation).
Suppose that . Then, under (C1)-(C3) and ,
and
Proof.
First, note that for ,
Under (C3), it is clear that
On the other hand, since , we have
where the last equality follows from Theorem 1. For a enough large , we have
which implies that
(12) |
By replacing with and considering , we have
For any , under , we have
which implies that . Moreover, note that , we have
where the last equality follows from Lemma 22 and by the weak law of large numbers. Now, since ,
and then
Therefore,
(13) |
which proves the desired results. ∎
The problem here is that may not be in , so we need to construct an approximate minimizer having similar properties. Set
Note that, for any and a enough large ,
Under and , we have
(14) |
By using the -inequality, there exists a certain constant such that
Let , then
Therefore, by (14),
Furthermore, by continuity of and the mean value theorem, there exists some such that and . This implies that . Clearly, for a large .
Lemma 14.
Under (C1)-(C3), we have
Proof.
Note that
Moreover, since
it follows from Cauchy-Schwarz inequality that
On the other hand, note that
which implies that
From (C3), we have
which proves the desired result. ∎
We now ready to finish the proof of Theorem 2.
Rate of Convergence of Approximate Sieve Extremum Estimators
We start with a general result on the rate of convergence of sieve estimators under the setup of nonparametric regression. The notations in this section are inherited from section 2 in the main text.
Lemma 15.
For every and every , we have
Proof.
First, note that
The triangle inequality gives
Therefore, we have
so that for every satisfying , i.e. , we have
(15) |
which implies that . By squaring both sides of (15), we obtain
Hence, for , we have
∎
Lemma 16.
For every sufficiently large and , under (C1)
Proof.
Note that
we obtain
We start by bounding . As , it follows from Hoeffding’s inequality that
and hence
which implies that for a sufficiently large ,
On the other hand, it follows from symmetrization inequality that
where are i.i.d. Rademacher random variables independent of . On the other hand, since is uniformly bounded, we know that
According to the contraction principle and Corollary 2.2.8 in van der Vaart and Wellner (1996),
where . Similar to the arguments used in bounding , we have
which implies that for a sufficiently large . Therefore, for a sufficiently large ,
Finally, for every sufficiently large , as for some , it follows from the multiplier inequality (Lemma 2.9.1 in van der Vaart and Wellner (1996)) that
where the last inequality follows as the upper bound for the local Rademacher complexity does not depend on . Combining all the pieces together, we obtain the desired result. ∎
Based on Lemma 15 and Lemma 16, the rate of convergence for the approximate sieve extremum estimators can be easily obtained via an application of Theorem 3.4.1 in van der Vaart and Wellner (1996).
Theorem 17.
Suppose that for some function and for every sufficiently large and . Suppose that is decreasing on for some . Let satisfy
Then if and , we get
and
Rate of Convergence of Multiplier Processes
Proposition 18 (Proposition 5 in Han and Wellner (2019)).
Suppose that are i.i.d. mean zero random variables independent of i.i.d. random variables . Then, for any function class ,
where are i.i.d. Rademacher random variables independent of and and are the reversed order statistics for with being an indepedent copy of .
As a consequence of Proposition 18, we can obtain the following result.
Proposition 19.
Under the same assumption in Proposition 18 and
-
(i)
If for some and the sequence is non-decreasing, then
-
(ii)
If for some and the sequence is non-decreasing, then
Proof.
Remark 4.
If , then is non-decreasing when . is non-decreasing when . This shows that to obtain the desired result, the “rate of convergence” should not converges to 0 too fast.
Proposition 19 can be used to obtain the rate of convergence of the multiplier process, which is just an straightforward application of Markov’s inequality.
Proposition 20.
Under the same assumption in Proposition 18 and
-
(i)
If for some and the sequence is non-decreasing, then
-
(ii)
If for some and the sequence is non-decreasing, then
Auxiliary Results
Proposition 21.
For all non-negative integer ,
Proof.
We prove this result by induction. For , the identity holds trivially according to the definition. Now, suppose that the result holds for , then
where in equation (i), we let , and equation (ii) follows from the induction hypothesis. Hence the desired result follows. ∎
Lemma 22.
Suppose that , then
In particular,
Proof.
Consider the function class
Let be a minimal -cover of with respect to the -norm so that . Define
Note that for any , there exists such that . For such a function , we can find so that . Moreover, since
We know that forms an -cover for , and hence
which implies that under (C1). On the other hand, since is uniformly bounded, we know that for any . Hence, for any ,
which implies that is uniformly bounded. It then follows from Theorem van de Geer (2000) that is a Donsker class. Thus, from Lemma 2.3.11 in van der Vaart and Wellner (1996), for any sequence ,
where . For , set and . Note that
we obtain
Moreover, let and note that
by McDiarmid inequality (McDiarmid, 1989), for all ,
which implies that and hence the desired claim follows. ∎