This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2021

1]Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, USA

2]Department of Statistics, Rutgers University, Piscataway, USA

3]Department of Statistics and Data Science, Fudan University, Shanghai, China

Statistical Inference and Large-scale Multiple Testing for High-dimensional Regression Models

\fnmT. Tony \sur Cai tcai@wharton.upenn.edu    \fnmZijian \sur Guo zijguo@stat.rutgers.edu    \fnmYin \sur Xia xiayin@fudan.edu.cn [ [ [
Abstract

This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to generate debiased and desparsified estimators for the targeted low-dimensional objectives and estimate their uncertainty. In addition to covering the motivations for and intuitions behind these statistical methods, we also discuss their optimality and adaptivity in the context of high-dimensional inference. In addition, we review the recent development of statistical inference based on multiple regression models and the advancement of large-scale multiple testing for high-dimensional regression. The R package SIHR has implemented some of the high-dimensional inference methods discussed in this paper.

keywords:
confidence interval, debiasing, false discovery rate, hypothesis testing, linear functionals, quadratic functionals, simultaneous inference

1 Introduction

High-dimensional data analysis has become a vital part of scientific research in many fields. While the abundance of high-dimensional data offers numerous opportunities for statistical analysis, it also presents significant technical challenges, as the number of variables can be much larger than the number of observations. In high-dimensional settings, most of the classical inferential procedures, such as the maximum likelihood, are no longer valid. In recent years, there has been significant progress in developing new theories and methods for parameter estimation, hypothesis testing, confidence interval construction, and large-scale simultaneous inference in the context of high-dimensional data analysis.

In this paper, we provide a selective survey of recent advances in statistical inference and multiple testing for high-dimensional regression models, commonly used in modern data analysis across various fields such as genetics, metabolomics, finance, health, and economics. Much progress has been made in estimation and prediction under the high-dimensional linear and generalized linear models (GLMs); see, for example, tibshirani1996regression ; candes2007dantzig ; zou2005regularization ; zou2006adaptive ; bickel2009simultaneous ; buhlmann2011statistics ; negahban2009unified ; van2009conditions ; huang2012estimation ; fan2001variable ; zhang2010nearly ; belloni2011square ; sun2012scaled ; bunea2008honest ; bach2010self ; meier2008group ; bellec2018slope ; friedman2010regularization ; efron2004least ; greenshtein2004persistence . Theoretical properties have been established in different settings, including the minimax estimation rate and the rate of convergence for the estimation and prediction errors of several penalized procedures.

Uncertainty quantification is at the heart of many critical scientific applications and is more challenging than point estimation. For high-dimensional regression models, although the Lasso and other penalized estimators have been shown to achieve the optimal rates of convergence for estimation, these estimators suffer from non-negligible bias that makes them unsuitable to be directly used for statistical inference, as noted in several studies (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ). To overcome this issue, debiased inference methods have been developed in (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ) that correct the bias of penalized estimators and allow for statistical inference based on the debiased estimators. The development of debiased inference methods has led to an increase in research on statistical inference for a wide range of low-dimensional objectives in different high-dimensional models; see, for example, guo2021inference ; guo2019optimal ; guo2021group ; cai2020semisupervised ; ma2022statistical ; cai2017confidence ; athey2018approximate ; cai2021statistical ; ning2017general ; ren2015asymptotic ; yu2018confidence ; fang2020test ; zhou2020estimation ; zhao2014general ; javanmard2020flexible ; chen2019inference ; zhu2018linear ; guo2018testing ; cai2018high ; eftekhari2021inference ; deshpande2018accurate ; fang2017testing ; neykov2018unified ; dezeure2015high . In particular, cai2017confidence studied the minimaxity and adaptivity of confidence intervals for general linear functionals of a high-dimensional regression vector and found significant differences between the cases of sparse and dense loading vectors. Another important inference tool, known as Neyman’s orthogonalization or double machine learning, has been proposed in econometrics to enable inference for low-dimensional objectives with high-dimensional nuisance parameters; see, for example, (belloni2014inference, ; chernozhukov2015valid, ; farrell2015robust, ; chernozhukov2018double, ; belloni2017program, ).

In the single regression model (one-sample) setting, we observe data {Yk,Xk,}1kn\{Y_{k},X_{k,\cdot}\}_{1\leq k\leq n}, where YkY_{k}\in\mathbb{R} and Xk,p+1X_{k,\cdot}\in\mathbb{R}^{p+1} denote the outcome and the high-dimensional covariates respectively, generated independently from the high-dimensional GLM,

Yk=h(Xk,β)+ϵk,for1knY_{k}=h(X_{k,\cdot}^{\intercal}\beta)+\epsilon_{k},\quad\text{for}\quad 1\leq k\leq n (1)

with 𝔼(ϵk|Xk,)=0{\mathbb{E}}(\epsilon_{k}|X_{k,\cdot})=0 and the high-dimensional regression vector βp+1\beta\in\mathbb{R}^{p+1}. Throughout the paper, we use β1\beta_{1} to denote the intercept and set Xk,1=1X_{k,1}=1. We assume that the high-dimensional covariates Xk,1pX_{k,-1}\in\mathbb{R}^{p} are centered and sub-gaussian and the matrix Σ𝔼Xk,Xk,(p+1)×(p+1)\Sigma\equiv{\mathbb{E}}X_{k,\cdot}X_{k,\cdot}^{\intercal}\in\mathbb{R}^{(p+1)\times(p+1)} is well-conditioned. We focus on the linear and logistic regression models with the link function h(z)=zh(z)=z and h(z)=exp(z)/[1+exp(z)]h(z)=\exp(z)/[1+\exp(z)] respectively. The regression vector β\beta is assumed to be sparse, and its sparsity level is denoted by β0.\|\beta\|_{0}. The high-dimensional covariates Xk,X_{k,\cdot} might come from a large number of measured covariates or the basis transformations of the baseline covariates. For the linear model, we further assume the error ϵk\epsilon_{k} is sub-gaussian with homoscedastic regression error σϵ2=𝔼(ϵk2|Xk,)\sigma_{\epsilon}^{2}={\mathbb{E}}(\epsilon_{k}^{2}|X_{k,\cdot}).

In addition to the one-sample setting, we examine the statistical inference methods for the two-sample high-dimensional regression models. For d=1,2,d=1,2, we assume that the data {Yk(d),Xk,(d)}1knd\{Y^{\scriptscriptstyle(d)}_{k},X^{\scriptscriptstyle(d)}_{k,\cdot}\}_{1\leq k\leq n_{d}} are i.i.d. generated, following

Yk(d)=h([Xk,(d)]β(d))+ϵk(d),for1knd,\displaystyle Y^{\scriptscriptstyle(d)}_{k}=h([X^{\scriptscriptstyle(d)}_{k,\cdot}]^{\intercal}\beta^{\scriptscriptstyle(d)})+\epsilon^{\scriptscriptstyle(d)}_{k},\quad\text{for}\quad 1\leq k\leq n_{d}, (2)

where 𝔼(ϵk(d)|Xk,(d))=0{\mathbb{E}}(\epsilon^{\scriptscriptstyle(d)}_{k}|X^{\scriptscriptstyle(d)}_{k,\cdot})=0 and h()h(\cdot) is the pre-specified link function.

Based on the models above, this paper first addresses the challenges of making statistical inferences for low-dimensional objectives (e.g., linear and quadratic functionals) in high-dimensional regression, both in one- and two-sample settings. Specifically, the following quantities are of particular interest.

  1. 1.

    Linear functional xnewβx_{\rm new}^{\intercal}\beta with xnewp+1x_{\rm new}\in\mathbb{R}^{p+1} in one-sample setting. The linear functional xnewβx_{\rm new}^{\intercal}\beta includes as special cases the single regression coefficient βj\beta_{j} (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ; cai2017confidence, ) when xnewx_{\rm new} is the jthj^{\scalebox{0.8}{th}} canonical unit vector and the conditional mean of the outcome under (1) when xnewx_{\rm new} is a future observation’s covariates. When xnewx_{\rm new} denotes the average of the covariates observations for a group, xnewβx_{\rm new}^{\intercal}\beta is closely related to average treatment effect athey2018approximate . In logistic regression, the linear functional xnewβx_{\rm new}^{\intercal}\beta is closely related to the case probability guo2021inference .

  2. 2.

    Quadratic functionals βGAβG\beta_{G}^{\intercal}A\beta_{G} and βGΣG,GβG\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G} with G{1,2,,p+1}G\subset\{1,2,\cdots,p+1\} and A|G|×|G|A\in\mathbb{R}^{|G|\times|G|} in one-sample setting. For a subset GG, these quadratic functionals measure the total effect of variables in G.G. Statistical inference for quadratic functionals can be motivated from the group significance test, and the (local) genetic heritability estimation guo2019optimal ; guo2021group . The inference method can be generalized to handle heterogeneous effect tests, hierarchical testing, prediction loss evaluation, and confidence ball construction guo2021group ; cai2020semisupervised .

  3. 3.

    Difference between linear functionals h(xnewβ(2))h(xnewβ(1))h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)})-h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)}) with xnewp+1x_{\rm new}\in\mathbb{R}^{p+1} in two-sample setting. This difference measures the discrepancy between the conditional means, which is closely related to individual treatment selection for the new observation xnewp+1x_{\rm new}\in\mathbb{R}^{p+1} cai2021optimal .

  4. 4.

    Inner products of regression vectors [β(1)]β(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)} and [β(1)]Aβ(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)} with the weighting matrix A(p+1)×(p+1)A\in\mathbb{R}^{(p+1)\times(p+1)} in two-sample setting. The inner product of regression vectors or its weighted version measures the similarity between the two regression vectors. In genetic studies, such inner products can be used as the genetic relatedness measure when the covariates are genetic variants, and outcome variables are different phenotypes guo2019optimal ; ma2022statistical .

We examine statistical inference procedures for linear and quadratic functionals from both methodological and theoretical perspectives. We also discuss the optimality results for the corresponding estimation and inference problems. A user-friendly R package SIHR rakshit2021sihr has been developed to implement the statistical inference methods for the low-dimensional objectives mentioned above. This package provides a convenient way to apply the discussed statistical inference methods.

Beyond the aforementioned inference for a single coordinate of the regression vector or other one-dimensional functionals, we also discuss the simultaneous inference of high-dimensional regression models. This includes using global methods with maximum-type statistics to test the entire regression coefficient vector (dezeure2017high, ; zhang2017simultaneous, ; ma2021global, ), as well as component-wise simultaneous inference methods that control the false discovery rate (FDR). Specifically, we examine the one-sample testing of high-dimensional linear regression coefficients liuluo2014 , the comparison of two high-dimensional linear regression models xia2018two , and the joint testing of regression coefficients across multiple responses xia2018joint . We also extend our discussion to logistic regression models. We discuss these large-scale multiple testing problems focusing on controlling the asymptotic FDR.

While error rate control is important for simultaneous inference, statistical power is also crucial. However, many existing testing methods for high-dimensional linear models do not consider auxiliary information, such as model sparsity and heteroscedasticity, that could improve statistical power. While there has been a significant amount of research on methods to enhance power in multiple testing in general (BenHoc97, ; storey2002direct, ; genovese2006false, ; Roeder2009, ; Ignatiadis2016, ; LeiFithian2018, ; Li2019, ; LAWS, ; fithian2022conditional, , among many others), recent efforts have also focused on simultaneous inference methods that incorporate auxiliary information to assist power improvement of high-dimensional regression analysis. For example, xia2018joint achieved power gains by leveraging similarities across multivariate responses; xia2020gap explored the sparsity information hidden in the data structures and improved the power through pp-value weighting mechanisms; liu2021integrative obtained power enhancement through integrating heterogeneous linear models. In the current paper, we primarily focus on power enhancement in a two-sample inference setting where the high-dimensional objects of interest are individually sparse. We will discuss methods for controlling FDR with and without power enhancement and the related theoretical aspects.

The rest of the paper is organized as follows. We finish this section by introducing the notation. Section 2 discusses the debiased inference idea for the regression coefficients, and Section 3 presents the debiased methods for linear and quadratic functionals in one- and two-sample settings. Section 4 focuses on simultaneous inference for high-dimensional regression models. We conclude the paper by discussing other related works in Section 5.

Notation. For an event EE, denote by 𝕀{E}\mathbb{I}\{E\} its indicator function. For an index set J{1,2,,p}J\subset\{1,2,\cdots,p\} and a vector xpx\in\mathbb{R}^{p}, xJx_{J} is the sub-vector of xx with indices in JJ and xJx_{-J} is the sub-vector with indices in JcJ^{c}. For a set SS, |S||S| denotes the cardinality of SS. For a vector xpx\in\mathbb{R}^{p}, the q\ell_{q} norm of xx is defined as xq=(l=1p|xl|q)1q\|x\|_{q}=\left(\sum_{l=1}^{p}|x_{l}|^{q}\right)^{\frac{1}{q}} for q0q\geq 0 with x0=|{1lp:xl0}|\|x\|_{0}=|\{1\leq l\leq p:x_{l}\neq 0\}| and x=max1lp|xl|\|x\|_{\infty}=\max_{1\leq l\leq p}|x_{l}|. Denote by eje_{j} the jthj^{\scalebox{0.8}{th}} canonical unit vector and Ipn×p{\rm I}_{p}\in\mathbb{R}^{n\times p} the identity matrix. For a symmetric matrix AA, denote by λmax(A)\lambda_{\max}(A) and λmin(A)\lambda_{\min}(A) its maximum and minimum eigenvalues, respectively. For a matrix An×pA\in\mathbb{R}^{n\times p}, A,jA_{\cdot,j} and Ai,A_{i,\cdot} respectively denote the jthj^{\scalebox{0.8}{th}} column and ithi^{\scalebox{0.8}{th}} row of AA, Ai,jA_{i,j} denotes the (i,j)th(i,j)^{\scalebox{0.8}{th}} entry of AA, Ai,jA_{i,-j} denotes the ithi^{\scalebox{0.8}{th}} row of AA with its jthj^{\scalebox{0.8}{th}} entry removed, Ai,jA_{-i,j} denotes the jthj^{\scalebox{0.8}{th}} column of AA with its ithi^{\scalebox{0.8}{th}} entry removed, Ai,{j1,j2}A_{i,-\{j_{1},j_{2}\}} denotes the ithi^{\scalebox{0.8}{th}} row of AA with its j1thj_{1}^{\scalebox{0.8}{th}} and j2thj_{2}^{\scalebox{0.8}{th}} entries both removed and Ai,j(n1)×(p1)A_{-i,-j}\in\mathbb{R}^{(n-1)\times(p-1)} denotes the submatrix of AA with its ithi^{\scalebox{0.8}{th}} row and jthj^{\scalebox{0.8}{th}} column removed. We use cc and CC to denote generic positive constants that may vary from place to place. For two positive sequences ana_{n} and bnb_{n}, anbna_{n}\lesssim b_{n} means there exists a constant C>0C>0 such that anCbna_{n}\leq Cb_{n} for all nn; anbna_{n}\asymp b_{n} if anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}, and anbna_{n}\ll b_{n} if lim¯nan/bn=0\mathop{\overline{\rm lim}}_{n\rightarrow\infty}{a_{n}}/{b_{n}}=0. Let o{an}o_{{\mathbb{P}}}\{a_{n}\} and O{an}O_{{\mathbb{P}}}\{a_{n}\} respectively represent the sequences that grow in a smaller and equal/smaller rate of the sequence ana_{n} with probability approaching 11 as nn\rightarrow\infty.

2 Inference for Regression Coefficients

In this section, we begin with a discussion in Section 2.1 on several commonly used penalized estimators for the high-dimensional GLMs. We review in Section 2.2 the debiased methods for linear models introduced in zhang2014confidence ; van2014asymptotically ; javanmard2014confidence and discuss its extensions to high-dimensional logistic regression in Section 2.3. We also present the optimality of the confidence interval construction in both linear and logistic high-dimensional regression models.

2.1 Estimation in high-dimensional regression

For the high-dimensional linear model (1), a commonly used estimator of the regression vector β\beta is the Lasso estimator tibshirani1996regression , defined as

β^=argminβp+1YXβ222n+λ0j=2p+1X,j2n|βj|,\small\widehat{\beta}=\mathop{\rm arg\min}_{\beta\in\mathbb{R}^{p+1}}\frac{\|Y-X\beta\|_{2}^{2}}{2n}+\lambda_{0}\sum_{j=2}^{p+1}\frac{\|X_{\cdot,j}\|_{2}}{\sqrt{n}}|\beta_{j}|, (3)

with the tuning parameter λ0=Aσϵlogp/n\lambda_{0}=A\sigma_{\epsilon}\sqrt{\log p/n} for some positive constant A>2A>2. In the penalized regression (3), we do not penalize the intercept β1\beta_{1}. The tuning parameter λ0\lambda_{0} is typically chosen by cross-validation, as implemented in the R package glmnet friedman2010regularization . With the Lasso estimator β^,\widehat{\beta}, the variance σϵ2\sigma_{\epsilon}^{2} can be estimated by σ^ϵ2=1nYXβ^22.\widehat{\sigma}_{\epsilon}^{2}=\tfrac{1}{n}\|Y-X\widehat{\beta}\|^{2}_{2}.

The tuning parameter λ0\lambda_{0} in (3) depends on the noise level σϵ.\sigma_{\epsilon}. Alternative estimators have been proposed such that the tuning parameter does not depend on the unknown noise (sun2012scaled, ; belloni2011square, , e.g.). Particularly, sun2012scaled proposed the scaled Lasso estimator

{β^,σ^ϵ}=argminβp+1,σϵ+yXβ222nσϵ+σϵ2+λ0j=2p+1X,j2n|βj|\small\{\widehat{\beta},\widehat{\sigma}_{\epsilon}\}=\arg\min_{\beta\in\mathbb{R}^{p+1},\sigma_{\epsilon}\in\mathbb{R}^{+}}\frac{\|y-X\beta\|_{2}^{2}}{2n\sigma_{\epsilon}}+\frac{\sigma_{\epsilon}}{2}+\lambda_{0}\sum_{j=2}^{p+1}\frac{\|X_{\cdot,j}\|_{2}}{\sqrt{n}}|\beta_{j}| (4)

with the tuning parameter λ0=Alogp/n\lambda_{0}=A\sqrt{\log p/n} for some positive constant A>2.A>2. Beyond the estimators mentioned above, a wide collection of estimators of high-dimensional regression vectors have been proposed (e.g., zou2005regularization, ; zou2006adaptive, ; buhlmann2011statistics, ; negahban2009unified, ; huang2012estimation, ; fan2001variable, ; zhang2010nearly, ; belloni2011square, ).

For the high-dimensional logistic model (1), the penalized methods have also been well developed to estimate βp+1\beta\in\mathbb{R}^{p+1} (e.g., bunea2008honest, ; bach2010self, ; buhlmann2011statistics, ; meier2008group, ; negahban2009unified, ; huang2012estimation, ). In this paper, we use the penalized log-likelihood estimator β^\widehat{\beta} in buhlmann2011statistics , defined as

β^=argminβ1nk=1n{log[1+exp(Xk,β)]Yk(Xk,β)}+λ0j=2p+1X,j2n|βj|,\small\widehat{\beta}=\mathop{\rm arg\min}_{\beta}\frac{1}{n}\sum_{k=1}^{n}\{\log[1+\exp(X_{k,\cdot}^{\intercal}\beta)]-Y_{k}(X_{k,\cdot}^{\intercal}\beta)\}+\lambda_{0}\sum_{j=2}^{p+1}\frac{\|X_{\cdot,j}\|_{2}}{\sqrt{n}}|\beta_{j}|, (5)

with a positive tuning parameter λ0logp/n\lambda_{0}\asymp\sqrt{\log p/n}.

2.2 Debiased or desparsified estimators in linear models

The penalized estimators introduced in Section 2.1 have been shown to achieve the optimal convergence rate and satisfy desirable variable selection properties meinshausen2006high ; bickel2009simultaneous ; zhao2006model ; wainwright2009sharp . However, (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ) highlighted that the Lasso and other penalized estimators are not ready for statistical inference due to the non-negligible estimation bias. They further proposed correcting the penalized estimators’ bias and then making inferences based on the bias-corrected estimators.

In the following, we present the main idea of the bias correction method. To illustrate the main idea, we fix the parameter index 2jp+12\leq j\leq p+1 and focus on the confidence interval construction for βj\beta_{j} in the model (1). With β^\widehat{\beta} denoting the Lasso estimator in (3), the main idea of the method proposed in zhang2014confidence ; javanmard2014confidence is to estimate the error of the plug-in estimator β^jβj\widehat{\beta}_{j}-\beta_{j}. The approximation of the error β^jβj\widehat{\beta}_{j}-\beta_{j} can be motivated by the following decomposition: for any vector up+1u\in\mathbb{R}^{p+1},

u1nk=1nXk,(YkXk,β^)(βjβ^j)=u1nXϵ+(Σ^uej)(ββ^),\small{u^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta})}-(\beta_{j}-\widehat{\beta}_{j})={u^{\intercal}\frac{1}{n}X^{\intercal}\epsilon}+{\left(\widehat{\Sigma}u-e_{j}\right)^{\intercal}(\beta-\widehat{\beta})}, (6)

with Σ^=1nk=1nXk,Xk,\widehat{\Sigma}=\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}X_{k,\cdot}^{\intercal}. We explain how to construct the vector uu by balancing the two terms on the right-hand side of the decomposition in (6). The first term on the right-hand side of (6) has the conditional variance σϵ2/nuΣ^u\sigma_{\epsilon}^{2}/n\cdot u^{\intercal}\widehat{\Sigma}u while the second term can be further upper-bounded by

|(Σ^uej)(ββ^)|Σ^uejββ^1.\small\left|\left(\widehat{\Sigma}u-e_{j}\right)^{\intercal}(\beta-\widehat{\beta})\right|\leq\|\widehat{\Sigma}u-e_{j}\|_{\infty}\|\beta-\widehat{\beta}\|_{1}. (7)

An algorithmic method of constructing uu is to constrain the bias and minimize the variance. Particularly, zhang2014confidence ; javanmard2014confidence proposed the following construction of the projection direction,

u^=argminup+1uΣ^usubject toΣ^uejλ\small\widehat{u}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}u\quad\text{subject to}\quad\|\widehat{\Sigma}u-e_{j}\|_{\infty}\leq\lambda (8)

with λlogp/n\lambda\asymp\sqrt{{\log p}/{n}} denoting a positive tuning parameter. The construction in (8) is designed to minimize the conditional variance uΣ^uu^{\intercal}\widehat{\Sigma}u of the “asymp normal” term and constrain Σ^uej\|\widehat{\Sigma}u-e_{j}\|_{\infty}, which further controls the “remaining bias” term as in (7).

With the projection direction u^\widehat{u} in (8), zhang2014confidence ; javanmard2014confidence introduced the debiased estimator,

β^jDeb=β^j+u^1nk=1nXk,(YkXk,β^),for2jp+1.\small\widehat{\beta}^{\rm Deb}_{j}=\widehat{\beta}_{j}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}),\quad\text{for}\quad 2\leq j\leq p+1. (9)

Following from (6), we obtain the following error decomposition of β^jDeb\widehat{\beta}^{\rm Deb}_{j},

β^jDebβj=u^1nk=1nXk,ϵk+(Σ^u^ej)(ββ^).\small\widehat{\beta}^{\rm Deb}_{j}-\beta_{j}={\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}\epsilon_{k}}+{(\widehat{\Sigma}\widehat{u}-e_{j})^{\intercal}(\beta-\widehat{\beta})}. (10)

The first term on the right-hand side of (10) is asymptotically normal with the asymptotic variance (σϵ2/n)u^Σ^u^(\sigma_{\epsilon}^{2}/n)\cdot\widehat{u}^{\intercal}\widehat{\Sigma}\widehat{u}. Implied by (7), we show that the projection direction u^\widehat{u} in (8) constrains the second term on the right-hand side of (10) by an upper bound λββ^1,\lambda\|\beta-\widehat{\beta}\|_{1}, with the tuning parameter λ\lambda specified in (8). This upper bound can be established to be of rate β0logp/n\|\beta\|_{0}\log p/n and the rate of convergence of the estimating error β^jDebβj\widehat{\beta}^{\rm Deb}_{j}-\beta_{j} is 1n+β0logpn,\frac{1}{\sqrt{n}}+\|\beta\|_{0}\frac{\log p}{n}, which is shown to be the minimax optimal rate of estimating the regression coefficient βj\beta_{j} ren2015asymptotic ; cai2021statistical . More importantly, zhang2014confidence ; javanmard2014confidence have shown that the debiased estimator β^jDeb\widehat{\beta}^{\rm Deb}_{j} in (9) is approximately unbiased and asymptotically normal when β0n/logp\|\beta\|_{0}\ll\sqrt{n}/\log p. Based on the asymptotic normality, zhang2014confidence ; javanmard2014confidence constructed the following confidence interval

CI=(β^jDebzα/2V^,β^jDeb+zα/2V^)withV^=σ^ϵ2nu^Σ^u^,\small\mathrm{CI}=\left(\widehat{\beta}^{\rm Deb}_{j}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{\beta}^{\rm Deb}_{j}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{\widehat{\sigma}_{\epsilon}^{2}}{n}\widehat{u}^{\intercal}\widehat{\Sigma}\widehat{u}, (11)

where zα/2z_{\alpha/2} is the upper α/2\alpha/2 quantile for the standard normal distribution.

Remark 1.

In the low-dimensional setting, we may set λ=0\lambda=0 as in (8) and obtain u^=Σ^1ej.\widehat{u}=\widehat{\Sigma}^{-1}e_{j}. This choice of u^\widehat{u} reduces the debiased estimator in (9) to the OLS estimator. The debiased estimator is also referred to as the “desparsified” estimator zhang2014confidence ; van2014asymptotically since β^jDeb\widehat{\beta}^{\rm Deb}_{j} is generally not zero even if the true βj\beta_{j} is zero. Hence, even if β\beta is a sparse vector, the vector β^Deb=(β^1Deb,β^2Deb,,β^p+1Deb)\widehat{\beta}^{\rm Deb}=(\widehat{\beta}^{\rm Deb}_{1},\widehat{\beta}^{\rm Deb}_{2},\cdots,\widehat{\beta}^{\rm Deb}_{p+1})^{\intercal} is dense. Consequently, the vector β^Deb\widehat{\beta}^{\rm Deb} does not estimate β\beta well in general even though β^jDeb\widehat{\beta}^{\rm Deb}_{j} is an optimal estimator of βj\beta_{j} for every 2jp+1.2\leq j\leq p+1.

2.2.1 Optimality of statistical inference

In high-dimensional linear model, the paper cai2021statistical established the minimax expected length of the confidence interval over the parameter space

Θ(s)={θ=(β,Σ,σϵ):β0s,c0λmin(Σ)λmax(Σ)C0,0<σϵC1},\small\Theta(s)=\left\{\theta=(\beta,\Sigma,\sigma_{\epsilon}):\|\beta\|_{0}\leq s,c_{0}\leq\lambda_{\min}(\Sigma)\leq\lambda_{\max}(\Sigma)\leq C_{0},0<\sigma_{\epsilon}\leq C_{1}\right\}, (12)

where C0c0>0C_{0}\geq c_{0}>0 and C1>0C_{1}>0 are positive constants. The space Θ(s)\Theta(s) contains all regression vectors of less than ss non-zero elements. As established in cai2021statistical , the minimax expected length of confidence intervals over Θ(s)\Theta(s) for the regime sn/logps\lesssim n/\log p is 1n+slogpn.\frac{1}{\sqrt{n}}+s\frac{\log p}{n}. When snlogps\lesssim\frac{\sqrt{n}}{\log p}, the optimal length 1/n1/\sqrt{n} can be achieved by the confidence interval in (11). Over the regime nlogpsn/logp\frac{\sqrt{n}}{\log p}\ll s\lesssim n/\log p, cai2021statistical proposed a confidence interval attaining the optimal rate slogp/ns\log p/n, where the construction requires the prior knowledge of the sparsity level s.s. We illustrate the minimax expected length in Figure 1.

Refer to caption
Figure 1: An illustration of the minimax optimality and adaptivity of the confidence intervals concerning the sparsity ss of β\beta for the unknown design setting. On the top of the figure, we report the minimax expected lengths of the confidence intervals. At the bottom of the figure, the possibility of being adaptive to the sparsity ss is presented.

More importantly, cai2017confidence studied the possibility of constructing a rate-optimal adaptive confidence interval. Here, an adaptive confidence interval means that, even without knowing about the sparsity level ss, the length of the constructed confidence interval is automatically adapting to ss. Since the sparsity information of the regression vector β\beta is generally unknown, it is desirable to construct an adaptive confidence interval. The work cai2017confidence established the following adaptivity results: if the design covariance matrix Σ\Sigma is unknown, it is possible to construct an adaptive confidence interval only for the ultra-sparse regime sn/logps\lesssim\sqrt{n}/\log p. That is, if n/logpsn/logp\sqrt{n}/\log p\ll s\lesssim{n}/\log p, it is impossible to construct a rate-optimal confidence interval that is adaptive to the sparsity level. This phase transition about the possibility of constructing adaptive confidence intervals is presented in Figure 1.

The information of Σ\Sigma is critical for constructing optimal confidence intervals. If Σ\Sigma is known, the minimax expected length is 1/n1/\sqrt{n} over the entire sparse regime sn/logps\lesssim n/\log p; see javanmard2018debiasing ; cai2017confidence for details. This contrasts sharply with the optimality results in Figure 1.

2.2.2 Another viewpoint: debiasing with decorrelation

In this subsection, we detour to introduce a slightly different view of debiased Lasso estimator proposed in zhang2014confidence ; van2014asymptotically , which is of the following form

β~jDeb=Z,j(YX,jβ^j)Z,jX,j,\small\widetilde{\beta}^{\rm Deb}_{j}=\frac{Z_{\cdot,j}^{\intercal}(Y-X_{\cdot,-j}\widehat{\beta}_{-j})}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}}, (13)

where Z,jnZ_{\cdot,j}\in\mathbb{R}^{n} is a decorrelation vector to be specified. zhang2014confidence ; van2014asymptotically proposed to construct the vector Z,jnZ_{\cdot,j}\in\mathbb{R}^{n} as the residual Z,j=X,jX,jγ^Z_{\cdot,j}=X_{\cdot,j}-X_{\cdot,-j}\widehat{\gamma}, with the Lasso estimator γ^=12nX,jX,jγ22+λγljX,l2n|γl|,\widehat{\gamma}=\frac{1}{2n}\|X_{\cdot,j}-X_{\cdot,-j}\gamma\|_{2}^{2}+\lambda_{\gamma}\sum_{l\neq j}\frac{\|X_{\cdot,l}\|_{2}}{\sqrt{n}}|\gamma_{l}|, where λγ>0\lambda_{\gamma}>0 is a positive tuning parameter. The KKT condition ensures that the residual Z,j=X,jX,jγ^Z_{\cdot,j}=X_{\cdot,j}-X_{\cdot,-j}\widehat{\gamma} is nearly orthogonal to all columns of X,jX_{\cdot,-j}, via

1nZ,jX,jλγmaxljX,l2n.{\small\frac{1}{n}\|Z_{\cdot,j}^{\intercal}X_{\cdot,-j}\|_{\infty}\leq\lambda_{\gamma}\cdot\max_{l\neq j}\frac{\|X_{\cdot,l}\|_{2}}{\sqrt{n}}.}

To see the effectiveness of the estimator in (13), we examine its estimation error

β~jDebβj=Z,jϵZ,jX,j+Z,jX,j(βjβ^j)Z,jX,j.\small\widetilde{\beta}^{\rm Deb}_{j}-\beta_{j}=\frac{Z_{\cdot,j}^{\intercal}\epsilon}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}}+\frac{Z_{\cdot,j}^{\intercal}X_{\cdot,-j}(\beta_{-j}-\widehat{\beta}_{-j})}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}}. (14)

In the above expression, the first term on the right-hand side can be shown to be asymptotically normal under regularity conditions, while the second term is constrained as

|Z,jX,j(βjβ^j)Z,jX,j|1nZ,jX,jβjβ^j11n|Z,jX,j|Cλγβ^β1,\displaystyle{\small\left|\frac{Z_{\cdot,j}^{\intercal}X_{\cdot,-j}(\beta_{-j}-\widehat{\beta}_{-j})}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}}\right|\leq\frac{1}{n}\|Z_{\cdot,j}^{\intercal}X_{\cdot,-j}\|_{\infty}\cdot\frac{\|\beta_{-j}-\widehat{\beta}_{-j}\|_{1}}{\frac{1}{n}|Z_{\cdot,j}^{\intercal}X_{\cdot,j}|}\leq C\lambda_{\gamma}\|\widehat{\beta}-\beta\|_{1},}

where the last inequality holds with a high probability for some positive constant C>0.C>0. With the above argument, zhang2014confidence ; van2014asymptotically have shown that the first term in the decomposition (14) is the dominating term for the ultra-sparse regime β0n/logp\|\beta\|_{0}\ll\sqrt{n}/\log p. This leads to the asymptotic normality of the estimator β~jDeb\widetilde{\beta}^{\rm Deb}_{j}. zhang2014confidence ; van2014asymptotically further constructed the confidence interval based on asymptotic normality. In the current paper, we focus on methods generalizing the debiased estimator in (9), instead of the decorrelation form in (13). However, the decorrelation idea has been extended to handle other statistical inference problems, including high-dimensional generalized linear model ning2017general ; van2014asymptotically , gaussian graphical model ning2017general , high-dimensional confounding model guo2020doubly ; sun2022decorrelating , and high-dimensional additive model guo2019local .

2.3 Debiasing in binary outcome models

We generalize the debiased estimator in (9) to high-dimensional GLM with a binary outcome. Similarly to the Lasso estimator, the penalized logistic regression estimator β^\widehat{\beta} in (5) suffers from the bias due to the 1\ell_{1} penalty. The bias-corrected estimator is proposed as the following generic form,

β^jDeb=β^j+u^1nk=1nWkXk,(Ykh(Xk,β^)),\small\widehat{\beta}^{\rm Deb}_{j}=\widehat{\beta}_{j}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}W_{k}X_{k,\cdot}(Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta})), (15)

where WkW_{k}\in\mathbb{R}, for 1kn1\leq k\leq n, are the weights to be specified and u^p+1\widehat{u}\in\mathbb{R}^{p+1} is the projection direction to be specified. We apply the Taylor expansion of the hh function and obtain

Ykh(Xk,β^)=h(Xk,β)h(Xk,β^)+ϵk=h(Xk,β^)Xk,(ββ^)+Rk+ϵk\displaystyle Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta})=h(X_{k,\cdot}^{\intercal}\beta)-h(X_{k,\cdot}^{\intercal}\widehat{\beta})+\epsilon_{k}=h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}^{\intercal}(\beta-\widehat{\beta})+R_{k}+\epsilon_{k} (16)

with the approximation error Rk=01(1t)h′′(Xk,β^+tXk(ββ^))𝑑t(Xk(β^β))2R_{k}=\int_{0}^{1}(1-t)h^{\prime\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}+tX^{\intercal}_{k\cdot}(\beta-\widehat{\beta}))dt\cdot(X^{\intercal}_{k\cdot}(\widehat{\beta}-\beta))^{2}. We plug in the Taylor expansion (16) into the weighted bias-correction estimator in (15), leading to the error decomposition of β^jDebβj\widehat{\beta}^{\rm Deb}_{j}-\beta_{j} as

1nk=1nu^Xk,Wkϵkasymp normal+1nk=1n(Wkh(Xk,β^)Xk,Xk,u^ej)(ββ^)remaining bias\displaystyle\underbrace{\frac{1}{n}\sum_{k=1}^{n}\widehat{u}^{\intercal}X_{k,\cdot}W_{k}\epsilon_{k}}_{\text{asymp normal}}+\underbrace{\frac{1}{n}\sum_{k=1}^{n}\left(W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\widehat{u}-e_{j}\right)^{\intercal}(\beta-\widehat{\beta})}_{\text{remaining bias}} (17)
+1nk=1nu^Xk,WkRknonlinearity bias.\displaystyle+\underbrace{\frac{1}{n}\sum_{k=1}^{n}\widehat{u}^{\intercal}X_{k,\cdot}W_{k}R_{k}}_{\text{nonlinearity bias}}.

In the following, we describe two ways of specifying the weights WkW_{k}.

  1. 1.

    Linearization weighting. guo2021inference proposed to construct the weight Wk=1/h(Xk,β^)W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}). Then the “remaining bias” term in (17) is reduced to 1nk=1n(Xk,Xk,u^ej)(ββ^),\frac{1}{n}\sum_{k=1}^{n}\left(X_{k,\cdot}X_{k,\cdot}^{\intercal}\widehat{u}-e_{j}\right)^{\intercal}(\beta-\widehat{\beta}), which is the same as the corresponding term in the linear regression. This enables us to directly adopt the projection direction u^\widehat{u} constructed in (8). This connection reveals the advantage of the weight Wk=1/h(Xk,β^)W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}), that is, the bias-correction developed under the linear regression model can be directly extended to the logistic regression.

  2. 2.

    Link-specific weighting. For a general link function h()h(\cdot), cai2021statistical constructed the weight Wk=h(Xk,β^)/[h(Xk,β^)(1h(Xk,β^))]W_{k}=h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})/[h(X_{k,\cdot}^{\intercal}\widehat{\beta})(1-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))]. If hh is the logistic link, we have h()=h()(1h())h^{\prime}(\cdot)=h(\cdot)(1-h(\cdot)) and obtain the constant weight Wk=1W_{k}=1. Such a link-specific weighting can be generalized to other binary outcome models (e.g., the probit model) with various link functions h()h(\cdot); see details in cai2021statistical .

After specifying the weights, the projection direction can be constructed as

u^\displaystyle\widehat{u} =argminup+1u(1nk=1nWkh(Xk,β^)Xk,Xk,)u\displaystyle=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u (18)
subject to (1nk=1nWkh(Xk,β^)Xk,Xk,)uejλ,Xuτ\displaystyle\quad\left\|\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-e_{j}\right\|_{\infty}\leq\lambda,\quad\|Xu\|_{\infty}\leq\tau

with the positive tuning parameters λlogp/n\lambda\asymp\sqrt{{\log p}/{n}} and τlogn\tau\asymp\sqrt{\log n}. The construction in (18) can be motivated from a similar view as (8). The constraint (1nk=1nWkh(Xk,β^)Xk,Xk,)uejλ\left\|\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-e_{j}\right\|_{\infty}\leq\lambda is imposed to constrain the “remaining bias” term in (17) and Xuτ\|Xu\|_{\infty}\leq\tau is imposed to constrain the “nonlinearity bias” in (17). For the linearization weighting, the conditional variance of 1nk=1nuXk,Wkϵk\frac{1}{n}\sum_{k=1}^{n}{u}^{\intercal}X_{k,\cdot}W_{k}\epsilon_{k} is u(1nk=1nWk2h(Xk,β^)Xk,Xk,)uu^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W^{2}_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u, which is of the same order as the objective function in (18) for a bounded Wk.W_{k}. So, instead of minimizing the exact variance, we minimize a scaled conditional variance in (18), which has the advantage of leading to almost the same optimization as in (8) for the linear model.

Theoretical properties of the debiased estimators (15) have been established for the logistic outcome model. With the weights Wk=1/h(Xk,β^)W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}) for the linearization weighting, guo2021inference established the asymptotic normality of β^jDeb\widehat{\beta}^{\rm Deb}_{j} in (15). cai2021statistical established a similar theoretical property for β^jDeb\widehat{\beta}^{\rm Deb}_{j} in (15) with the weights Wk=1W_{k}=1 for the link-specific weighting. The asymptotic normality results in both works require the ultra-sparse condition β0n/[logplogn]\|\beta\|_{0}\ll\sqrt{n}/[\log p\log n]. The use of the weights Wk=1W_{k}=1 in cai2021statistical leads to a smaller standard error than using the weights Wk=1/h(Xk,β^)W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}) proposed in guo2021inference . The theoretical justification in cai2021statistical requires a sample splitting such that the initial estimator β^\widehat{\beta} is constructed from an independent sample. Such sample splitting is not required in the analysis of guo2021inference , which is part of the benefit of linearization weighting.

Based on the asymptotic normality, we construct the confidence interval

CI=(β^jDebzα/2V^,β^jDeb+zα/2V^)withV^=1nu^Σ^Gu^,\small\mathrm{CI}=\left(\widehat{\beta}^{\rm Deb}_{j}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{\beta}^{\rm Deb}_{j}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{1}{n}\widehat{u}^{\intercal}\widehat{\Sigma}^{G}\widehat{u}, (19)

where Σ^G=1nk=1nWk2h(Xk,β^)(1h(Xk,β^))Xk,Xk,.\widehat{\Sigma}^{G}=\frac{1}{n}\sum_{k=1}^{n}W_{k}^{2}h(X_{k,\cdot}^{\intercal}\widehat{\beta})(1-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))X_{k,\cdot}X_{k,\cdot}^{\intercal}.

The optimality of confidence interval construction in high-dimensional logistic regression was studied in cai2021statistical . The minimax expected length and the possible regime of constructing adaptive confidence intervals are similar to those in Figure 1, up to a polynomial order of logn\log n; see the results in cai2021statistical .

3 Linear and Quadratic Functionals Inference

We consider in this section statistical inference for linear and quadratic transformations of the regression vectors under high-dimensional linear and logistic regressions. We investigate both one- and two-sample regression models. In Section 3.7, we discuss the R package SIHR rakshit2021sihr that implements these methods.

3.1 Linear functionals for linear regression

For a given vector xnewp+1x_{\rm new}\in\mathbb{R}^{p+1}, we present the construction of the point estimator and confidence interval for xnewβx_{\text{new}}^{\intercal}\beta under the high-dimensional linear model (1). Similar to inference for βj\beta_{j}, the plug-in estimator xnewβ^x_{\rm new}^{\intercal}\widehat{\beta} suffers from the estimation bias by directly plugging in the Lasso estimator β^\widehat{\beta} in (3). The work cai2021optimal proposed the following bias-corrected estimator,

xnewβ^=xnewβ^+u^1nk=1nXk,(YkXk,β^),\small\widehat{x_{\rm new}^{\intercal}\beta}=x_{\rm new}^{\intercal}\widehat{\beta}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}), (20)

with the projection direction u^\widehat{u} defined as

u^=argminup+1uΣ^usubject to\displaystyle\small\widehat{u}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}u\quad\text{subject to}\; Σ^uxnewxnew2λ\displaystyle\left\|\widehat{\Sigma}u-x_{\rm new}\right\|_{\infty}\leq\|x_{\rm new}\|_{2}\lambda (21)
|xnewΣ^uxnew22|xnew22λ,\displaystyle\;\left|x_{\rm new}^{\intercal}\widehat{\Sigma}u-\|x_{\rm new}\|_{2}^{2}\right|\leq\|x_{\rm new}\|_{2}^{2}\lambda, (22)

where Σ^=1nk=1nXk,Xk,\widehat{{{\Sigma}}}=\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}X_{k,\cdot}^{\intercal} and λlogp/n\lambda\asymp\sqrt{{\log p}/{n}} is a positive tuning parameter.

The debiased estimator in (20) satisfies the following error decomposition,

xnewβ^xnewβ=u^1nXϵasymp normal+(Σ^u^xnew)(ββ^)remaining bias.\small\widehat{x_{\rm new}^{\intercal}\beta}-x_{\rm new}^{\intercal}\beta=\underbrace{\widehat{u}^{\intercal}\frac{1}{n}X^{\intercal}\epsilon}_{\text{asymp normal}}+\underbrace{\left(\widehat{\Sigma}\widehat{u}-x_{\rm new}\right)^{\intercal}(\beta-\widehat{\beta})}_{\text{remaining bias}}. (23)

The construction in (21), without the additional constraint (22), can be viewed as a direct generalization of (8) by replacing eje_{j} with the general loading xnewx_{\rm new}. Specifically, (21) minimizes the conditional variance of the “asymp normal” term in (23) and provides a control of the “remaining bias” term by the inequality |(Σ^uxnew)(ββ^)|Σ^uxnewββ^1.|(\widehat{\Sigma}u-x_{\rm new})^{\intercal}(\beta-\widehat{\beta})|\leq\|\widehat{\Sigma}u-x_{\rm new}\|_{\infty}\|\beta-\widehat{\beta}\|_{1}. Such a construction of the projection direction for linear functional has been proposed in cai2017confidence ; athey2018approximate . However, such a direct generalization is not universally effective for all loadings. As established in Proposition 2 in cai2021optimal , the projection direction, constructed without the additional constraint (22), does not correct the bias for a wide class of dense loading vectors.

We shall emphasize that the new constraint in (22) is crucial to ensuring the asymptotic normality of xnewβ^xnewβ\widehat{x_{\rm new}^{\intercal}\beta}-x_{\rm new}^{\intercal}\beta for any loading vector xnew.x_{\rm new}. This constraint is imposed such that the variance of the “asymp normal” term in (23) always dominates the “remaining bias” term in (23). With this additional constraint, the projection direction u^\widehat{u} constructed in (21) and (22) enables the effective bias correction for any loading vector xnewx_{\rm new}, no matter it is sparse or dense. The work cai2021optimal established the asymptotic normality of the estimator xnewβ^\widehat{x_{\rm new}^{\intercal}\beta} in (20) for any loading vector xnewp+1.x_{\rm new}\in\mathbb{R}^{p+1}. Based on the asymptotic normality, we construct a confidence interval for xnewβx_{\rm new}^{\intercal}\beta as

CI=(xnewβ^zα/2V^,xnewβ^+zα/2V^)withV^=σ^ϵ2nu^Σ^u^.\small\mathrm{CI}=\left(\widehat{x_{\rm new}^{\intercal}\beta}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{x_{\rm new}^{\intercal}\beta}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{\widehat{\sigma}_{\epsilon}^{2}}{n}\widehat{u}^{\intercal}\widehat{\Sigma}\widehat{u}. (24)

3.2 Linear functionals for logistic regression

We now consider the high-dimensional logistic model and present the inference procedures for xnewβx_{\rm new}^{\intercal}\beta or h(xnewβ)h(x_{\rm new}^{\intercal}\beta) proposed in guo2021inference . In particular, guo2021inference proposed the following debiased estimator,

xnewβ^=xnewβ^+u^1nk=1nWkXk,(Ykh(Xk,β^)),\small\widehat{x_{\rm new}^{\intercal}{\beta}}=x_{\rm new}^{\intercal}\widehat{\beta}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}W_{k}X_{k,\cdot}\left(Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta})\right), (25)

where β^\widehat{\beta} is defined in (5), Wk=1/h(Xk,β^)W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}) for 1kn1\leq k\leq n, and the projection direction u^p+1\widehat{u}\in\mathbb{R}^{p+1} is defined as

u^\displaystyle\widehat{u} =argminup+1u(1nk=1nWkh(Xk,β^)Xk,Xk,)u\displaystyle=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u (26)
subject to (1nk=1nWkh(Xk,β^)Xk,Xk,)uxnewxnew2λ,Xuτ\displaystyle\quad\left\|\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-x_{\rm new}\right\|_{\infty}{\leq\|x_{\rm new}\|_{2}\lambda},\quad\|Xu\|_{\infty}\leq\tau
|xnew(1nk=1nWkh(Xk,β^)Xk,Xk,)uxnew22|xnew22λ.\displaystyle\quad\left|x_{\rm new}^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-\|x_{\rm new}\|_{2}^{2}\right|\leq\|x_{\rm new}\|_{2}^{2}\lambda.

The bias-corrected estimator in (25) can be viewed as a generalization of those in (21) and (22) by incorporating the weighted bias-correction in (15).

It has been established in guo2021inference that xnewβ^\widehat{x_{\rm new}^{\intercal}{\beta}} in (25) is asymptotically unbiased and normal. Based on the asymptotic normality, we construct the following confidence interval

CI=(xnewβ^zα/2V^,xnewβ^+zα/2V^)withV^=1nu^Σ^Gu^,\small\mathrm{CI}=\left(\widehat{x_{\rm new}^{\intercal}\beta}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{x_{\rm new}^{\intercal}\beta}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{1}{n}\widehat{u}^{\intercal}\widehat{\Sigma}^{G}\widehat{u}, (27)

where Σ^G=1nk=1nWk2h(Xk,β^)(1h(Xk,β^))Xk,Xk,\widehat{\Sigma}^{G}=\frac{1}{n}\sum_{k=1}^{n}W_{k}^{2}h(X_{k,\cdot}^{\intercal}\widehat{\beta})(1-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))X_{k,\cdot}X_{k,\cdot}^{\intercal}. We estimate the case probability (Yk=1|Xk,=xnew){\mathbb{P}}(Y_{k}=1|X_{k,\cdot}=x_{\rm new}) by h(xnewβ^)h(\widehat{x_{\rm new}^{\intercal}\beta}) and construct the confidence interval for h(xnewβ)h(x_{\text{new}}^{\intercal}\beta) as

CI=[h(xnewβ^zα/2V^),h(xnewβ^+zα/2V^)].\small{\rm CI}=\left[h\left(\widehat{x_{\rm new}^{\intercal}\beta}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right),h\left(\widehat{x_{\rm new}^{\intercal}\beta}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\right]. (28)

3.3 Conditional average treatment effects

The inference methods proposed in Sections 3.1 and 3.2 can be generalized to make inferences for conditional average treatment effects, which can be expressed as the difference between two linear functionals. For 1kn,1\leq k\leq n, let Ak{1,2}A_{k}\in\{1,2\} denote the treatment assignment for the kthk^{\scalebox{0.8}{th}} observation, where Ak=1A_{k}=1 and Ak=2A_{k}=2 represent the subject receiving the control or the treatment assignment, respectively. Moreover, in the context of comparing the treatment effectiveness, Ak=1A_{k}=1, and Ak=2A_{k}=2 may stand for the subject receiving the first or the second treatment assignment, respectively. As a special case of (2), we consider the following conditional outcome models 𝔼(Yk|Xk,,Ak=1)=Xk,β(1){\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=1)=X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(1)} and 𝔼(Yk|Xk,,Ak=2)=Xk,β(2).{\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=2)=X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(2)}. For an individual with Xk,=xnewX_{k,\cdot}=x_{\rm new}, we define Δ(xnew)=𝔼(Yk|Xk,=xnew,Ak=2)𝔼(Yk|Xk,=xnew,Ak=1)=xnew(β(2)β(1)),\Delta(x_{\rm new})={\mathbb{E}}(Y_{k}|X_{k,\cdot}=x_{\rm new},A_{k}=2)-{\mathbb{E}}(Y_{k}|X_{k,\cdot}=x_{\rm new},A_{k}=1)=x_{\rm new}^{\intercal}(\beta^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(1)}), which measures the change of the conditional mean from being untreated to treated for individuals with the covariates xnewx_{\rm new}.

By generalizing (20), we construct the debiased estimators xnewβ(1)^\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)}} and xnewβ(2)^,\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)}}, together with their corresponding variance estimators V^β(1)\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}} and V^β(2)\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}. The paper cai2021optimal proposed to estimate Δ(xnew)\Delta(x_{\rm new}) by Δ^(xnew)=xnewβ(2)^xnewβ(1)^\widehat{\Delta}(x_{\rm new})=\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)}}-\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)}} and construct the confidence interval for Δ(xnew)\Delta(x_{\rm new}) as

CI=(Δ^(xnew)zα2V^β(1)+V^β(2),Δ^(xnew)+zα2V^β(1)+V^β(2)).\small\mathrm{CI}=\left(\widehat{\Delta}(x_{\rm new})-z_{\frac{\alpha}{2}}\sqrt{\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}}+\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}},\quad\widehat{\Delta}(x_{\rm new})+z_{\frac{\alpha}{2}}\sqrt{\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}}+\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}}\right). (29)

Regarding the hypothesis testing problem H0:xnew(β(2)β(1))0H_{0}:x_{\rm new}^{\intercal}(\beta^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(1)})\leq 0 versus H1:xnew(β(2)β(1))>0,H_{1}:x_{\rm new}^{\intercal}(\beta^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(1)})>0, the paper cai2021optimal proposed the following test procedure,

ϕα(xnew)=𝕀{Δ^(xnew)zαV^β(1)+V^β(2)0}.\small{\phi}_{\alpha}(x_{\rm new})=\mathbb{I}\left\{\widehat{\Delta}(x_{\rm new})-z_{\alpha}\sqrt{\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}}+\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}}\geq 0\right\}. (30)

As a direct generalization, we may consider the logistic version, 𝔼(Yk|Xk,,Ak=1)=h(Xk,β(1)){\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=1)=h(X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(1)}) and 𝔼(Yk|Xk,,Ak=2)=h(Xk,β(2)).{\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=2)=h(X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(2)}). For an individual with Xk,=xnewX_{k,\cdot}=x_{\rm new}, our inference target becomes Δ(xnew)=h(xnewβ(2))h(xnewβ(1)).\Delta(x_{\rm new})=h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)})-h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)}). The methods in Section 3.2 can be applied here to make inferences for Δ(xnew).\Delta(x_{\rm new}).

3.4 Quadratic functionals

We now focus on inference for the quadratic functionals QA=βGAβG{\rm Q}_{A}=\beta^{\intercal}_{G}A\beta_{G} and QΣ=βGΣG,GβG{\rm Q}_{\Sigma}=\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}, where G{1,,p+1}G\subset\{1,\cdots,p+1\} and A|G|×|G|A\in\mathbb{R}^{|G|\times|G|} denotes a pre-specified matrix. Without loss of generality, we set G={2,,|G|+1}.G=\{2,\cdots,|G|+1\}. In the following, we mainly discuss the main idea under the high-dimensional linear regression, which can be generalized to the high-dimensional logistic regression. Let β^\widehat{\beta} denote the Lasso estimator in (3).

We start with the error decomposition of the plug-in estimator β^GAβ^G\widehat{\beta}_{G}^{\intercal}A\widehat{\beta}_{G},

β^GAβ^GβGAβG=2β^GA(β^GβG)(β^GβG)A(β^GβG).\small\widehat{\beta}_{G}^{\intercal}A\widehat{\beta}_{G}-\beta_{G}^{\intercal}A{\beta_{G}}=2\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G})-(\widehat{\beta}_{G}-\beta_{G})^{\intercal}A(\widehat{\beta}_{G}-\beta_{G}). (31)

In consideration of high-dimensional linear models, guo2021group ; guo2019optimal proposed to construct the bias-corrected estimator through estimating the error component 2β^GA(β^GβG)2\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G}) on the right-hand side of (31). Since β^GA(β^GβG)\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G}) can be expressed as xnew(β^β)x_{\rm new}^{\intercal}(\widehat{\beta}-\beta) with xnew=(0β^GA𝟎)x_{\rm new}=\begin{pmatrix}0&\widehat{\beta}_{G}^{\intercal}A&{\bf 0}^{\intercal}\end{pmatrix}^{\intercal}, the techniques of estimating the error component for the linear functional can be directly applied to approximate β^GA(β^GβG)\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G}). Particularly, guo2021group ; guo2019optimal proposed the following estimator of QA{\rm Q}_{A},

Q^A=β^GAβ^G+2u^AX(YXβ^)/n,\small\widehat{{\rm Q}}_{A}=\widehat{\beta}_{G}^{\intercal}A\widehat{\beta}_{G}+2\widehat{u}_{A}^{\intercal}X^{\intercal}(Y-X\widehat{\beta})/n, (32)

where u^A\widehat{u}_{A} denotes the solution of (21) and (22) with xnew=(0β^GA𝟎).x_{\rm new}=\begin{pmatrix}0&\widehat{\beta}_{G}^{\intercal}A&{\bf 0}^{\intercal}\end{pmatrix}^{\intercal}. No bias correction is required for the last term on the right-hand side of (31) since it is a higher-order error term under regular conditions.

We now turn to the estimation of QΣ=βGΣG,GβG,{\rm Q}_{\Sigma}=\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}, where the matrix ΣG,G\Sigma_{G,G} has to be estimated from the data. With Σ^=1nk=1nXk,Xk,,\widehat{\Sigma}=\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}X_{k,\cdot}^{\intercal}, we decompose the estimation error of the plug-in estimator β^GΣ^G,Gβ^G\widehat{\beta}^{\intercal}_{G}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G} as

β^GΣ^G,Gβ^GβGΣG,GβG=\displaystyle\widehat{\beta}^{\intercal}_{G}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G}-\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}= 2β^GΣ^G,G(β^GβG)+βG(Σ^G,GΣG,G)βG\displaystyle 2\widehat{\beta}_{G}^{\intercal}\widehat{\Sigma}_{G,G}(\widehat{\beta}_{G}-\beta_{G})+\beta_{G}^{\intercal}(\widehat{\Sigma}_{G,G}-\Sigma_{G,G})\beta_{G} (33)
(β^GβG)Σ^G,G(β^GβG).\displaystyle-(\widehat{\beta}_{G}-\beta_{G})^{\intercal}\widehat{\Sigma}_{G,G}(\widehat{\beta}_{G}-\beta_{G}).

By a similar approach as (32), guo2021group proposed the following estimator of QΣ{\rm Q}_{\Sigma},

Q^Σ=β^GΣ^G,Gβ^G+2u^ΣX(YXβ^)/n,\small\widehat{{\rm Q}}_{\Sigma}=\widehat{\beta}^{\intercal}_{G}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G}+2\widehat{u}_{\Sigma}^{\intercal}X^{\intercal}(Y-X\widehat{\beta})/n, (34)

where u^Σ\widehat{u}_{\Sigma} denotes the solution of (21) and (22) with xnew=(0β^GΣ^G,G𝟎).x_{\rm new}=\begin{pmatrix}0&\widehat{\beta}_{G}^{\intercal}\widehat{\Sigma}_{G,G}&{\bf 0}^{\intercal}\end{pmatrix}^{\intercal}. As a special case, cai2020semisupervised considered G={1,,p+1}G=\{1,\cdots,p+1\} and constructed u^Σ=β^.\widehat{u}_{\Sigma}=\widehat{\beta}.

The works guo2021group ; guo2019optimal established the asymptotic properties for a sample-splitting version of Q^A\widehat{{\rm Q}}_{A} and Q^Σ\widehat{{\rm Q}}_{\Sigma}, where the initial estimator β^\widehat{\beta} is constructed from a subsample independent from the samples {Xk,,Yk}1kn\{X_{k,\cdot},Y_{k}\}_{1\leq k\leq n} used in (32) and (34). Based on the asymptotic properties, guo2021group estimated the variance of Q^A\widehat{{\rm Q}}_{A} by V^A(τ)=4σ^ϵ2u^AΣ^u^A/n+τ/n\widehat{{\rm V}}_{A}(\tau)={4\widehat{\sigma}_{\epsilon}^{2}}\widehat{u}_{A}^{\intercal}\widehat{\Sigma}\widehat{u}_{A}/n+{\tau}/{n} for some positive constant τ>0\tau>0, where the term τ/n\tau/n is introduced as an upper bound for the term (β^GβG)A(β^GβG)(\widehat{\beta}_{G}-\beta_{G})^{\intercal}A(\widehat{\beta}_{G}-\beta_{G}) in (31). guo2021group estimated the variance of Q^Σ\widehat{{\rm Q}}_{\Sigma} by

V^Σ(τ)=4σ^ϵ2nu^ΣΣ^u^Σ+1n2k=1n(β^GXk,GXk,Gβ^Gβ^GΣ^G,Gβ^G)2+τn,\small\widehat{{\rm V}}_{\Sigma}(\tau)=\frac{4\widehat{\sigma}_{\epsilon}^{2}}{n}\widehat{u}_{\Sigma}^{\intercal}\widehat{\Sigma}\widehat{u}_{\Sigma}+\frac{1}{n^{2}}\sum_{k=1}^{n}\left(\widehat{\beta}_{G}^{\intercal}X_{k,G}X_{k,G}^{\intercal}\widehat{\beta}_{G}-\widehat{\beta}_{G}^{\intercal}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G}\right)^{2}+\frac{\tau}{n},

and further constructed the following confidence intervals for QA{\rm Q}_{A} and QΣ{\rm Q}_{\Sigma},

CIA(τ)\displaystyle{\rm CI}_{A}(\tau) =(Q^Azα/2V^A(τ),Q^A+zα/2V^A(τ)),\displaystyle=\left(\widehat{{\rm Q}}_{A}-{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{A}(\tau)},\,\,\widehat{{\rm Q}}_{A}+{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{A}(\tau)}\right), (35)
CIΣ(τ)\displaystyle{\rm CI}_{\Sigma}(\tau) =(Q^Σzα/2V^Σ(τ),Q^Σ+zα/2V^Σ(τ)).\displaystyle=\left(\widehat{{\rm Q}}_{\Sigma}-{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{\Sigma}(\tau)},\,\,\widehat{{\rm Q}}_{\Sigma}+{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{\Sigma}(\tau)}\right).

For a positive definite matrix AA, the significance test H0:βG=0H_{0}:\beta_{G}=0 can be recast as H0,A:βGAβG=0H_{0,A}:\beta_{G}^{\intercal}A\beta_{G}=0 or H0,Σ:βGΣG,GβG=0.H_{0,\Sigma}:\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}=0. The following α\alpha-level significance tests of H0,AH_{0,A} and H0,ΣH_{0,\Sigma} have been respectively proposed in guo2021group ,

ϕA(τ)=𝕀{Q^AzαV^A(τ)},ϕΣ(τ)=𝕀{Q^ΣzαV^Σ(τ)}.\small\phi_{A}(\tau)=\mathbb{I}\left\{\widehat{{\rm Q}}_{A}\geq z_{\alpha}\cdot\sqrt{{\widehat{{\rm V}}_{A}(\tau)}}\right\},\quad\phi_{\Sigma}(\tau)=\mathbb{I}\left\{\widehat{{\rm Q}}_{\Sigma}\geq{z_{\alpha}}\cdot\sqrt{\widehat{{\rm V}}_{\Sigma}(\tau)}\right\}. (36)

The inference for QA{\rm Q}_{A} and QΣ{\rm Q}_{\Sigma} can be generalized to the high-dimensional logistic regression together with the weighted bias correction detailed in (26). The paper ma2022statistical studied how to construct the debiased estimators of QΣ{\rm Q}_{\Sigma} under the high-dimensional logistic regression.

3.5 Semi-supervised inference

We summarize the optimal estimation of β22\|\beta\|_{2}^{2} and βΣβ\beta^{\intercal}\Sigma\beta in a general semi-supervised setting, with the labelled data (X1,,Y1),,(Xn,,Yn)(X_{1,\cdot},Y_{1}),\cdots,(X_{n,\cdot},Y_{n}) and the unlabelled data Xn+1,,,Xn+N,X_{n+1,\cdot},\cdots,X_{n+N,\cdot}, where the covariates X1,,,Xn+N,X_{1,\cdot},\cdots,X_{n+N,\cdot} are assumed to be identically distributed.

The work cai2020semisupervised proposed the following semi-supervised estimator of βΣβ\beta^{\intercal}\Sigma\beta,

Q^(β^,Σ^S)=β^Σ^Sβ^+2β^1nk=1nXk,(YkXk,β^),Σ^S=1n+Nk=1n+NXk,Xk,.\small{\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S})=\widehat{\beta}^{\intercal}\widehat{\Sigma}^{S}\widehat{\beta}+2\widehat{\beta}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}),\widehat{\Sigma}^{S}=\frac{1}{n+N}\sum_{k=1}^{n+N}X_{k,\cdot}X_{k,\cdot}^{\intercal}. (37)

The matrix estimator Σ^S\widehat{\Sigma}^{S} in (37) utilizes both the labeled and unlabelled data. The work cai2020semisupervised constructed the following confidence interval for βΣβ\beta^{\intercal}\Sigma\beta in the semi-supervised setting,

CI(Z)=((Q^(β^,Σ^S)zα/2V^)+,Q^(β^,Σ^S)zα/2V^),\small{\rm CI}(Z)=\left(\left({\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S})-z_{\alpha/2}\sqrt{\widehat{\rm V}}\right)_{+},\;{\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S})-z_{\alpha/2}\sqrt{\widehat{\rm V}}\right),

where V^=1n(4σ^ϵ2β^Σ^Sβ^+ρ^n+Nk=1n+N(β^XkXkβ^β^Σ^Sβ^)2+τ)\small\widehat{\rm V}=\frac{1}{n}\left(4{\widehat{\sigma}_{\epsilon}^{2}\widehat{\beta}^{\intercal}\widehat{\Sigma}^{S}\widehat{\beta}}+{\frac{\widehat{\rho}}{n+N}\sum_{k=1}^{n+N}\left(\widehat{\beta}^{\intercal}X_{k\cdot}X_{k\cdot}^{\intercal}\widehat{\beta}-\widehat{\beta}^{\intercal}\widehat{\Sigma}^{S}\widehat{\beta}\right)^{2}}+\tau\right) with ρ^=n/(N+n)\widehat{\rho}=n/(N+n) and τ>0\tau>0 being a user-specific tuning parameter adjusting for the higher order estimation error. The above confidence interval construction demonstrates the usefulness of integrating the unlabelled data, significantly reducing the ratio ρ^\widehat{\rho} and the interval length.

Define Θ(s,M)={(β,Σ,σϵ)Θ(s):M2β2M},\Theta\left(s,M\right)=\left\{\left(\beta,\Sigma,\sigma_{\epsilon}\right)\in\Theta(s):{\frac{M}{2}\leq\|\beta\|_{2}\leq M}\right\}, as a subspace of Θ(s)\Theta(s) in (12) with the constraint on β2.\|\beta\|_{2}. In the semi-supervised setting, it was established in cai2020semisupervised that the minimax rate for estimating βΣβ\beta^{\intercal}{\Sigma}\beta over Θ(s,M)\Theta\left(s,M\right) is

M2N+n+min{Mn+slogpn,M2}.\small{M^{2}\over\sqrt{N+n}}+\min\left\{{M\over\sqrt{n}}+{s\log p\over n},M^{2}\right\}. (38)

The estimator Q^(β^,Σ^S){\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S}) in (37) is shown to achieve the optimal rate in (38) for MCslogp/nM\geq C\sqrt{s\log p/n}. When MCslogp/nM\leq C\sqrt{s\log p/n}, then estimating Q{\rm Q} by zero achieves the optimal rate in (38). The optimal convergence rate in (37) characterizes its dependence on the amount of unlabelled data. With a larger size NN of the unlabelled data, the convergence rate M2N+n{M^{2}\over\sqrt{N+n}} decreases. In the semi-supervised setting, the unlabelled data can also be used to construct a more accurate estimator of β22\|\beta\|_{2}^{2}; see Section 4 of cai2020semisupervised for details.

In the following, we consider the two extremes under semi-supervised learning: the supervised learning setting with N=0N=0 and the other extreme setting of knowing Σ=I\Sigma={\rm I}, which can be viewed as a special case of the semi-supervised setting with N.N\rightarrow\infty. In Table 1, we summarize the optimal convergence rate of estimating β22\|\beta\|_{2}^{2} and βΣβ\beta^{\intercal}\Sigma\beta over both Θ(s,M)\Theta(s,M) and Θ0(s,M)\Theta_{0}(s,M), where Θ0(s,M)={(β,Σ,σϵ)Θ(s,M):Σ=I}.\Theta_{0}(s,M)=\left\{\left(\beta,\Sigma,\sigma_{\epsilon}\right)\in\Theta(s,M):\Sigma={\rm I}\right\}. With leveraging the information of Σ\Sigma, Table 1 shows that the optimal rates of estimating β22\|\beta\|_{2}^{2} and βΣβ\beta^{\intercal}\Sigma\beta are reduced by Mslogpn{M\frac{s\log p}{n}} and M21n,{M^{2}\frac{1}{\sqrt{n}}}, respectively. The improvement can be substantial when the parameter MM, characterizing the 2\ell_{2} norm β2\|\beta\|_{2}, is relatively large. The optimal rate of estimating β22\|\beta\|_{2}^{2} was established in collier2017minimax under the sequence model Yj=βj+1nϵjfor 1jp.Y_{j}=\beta_{j}+\frac{1}{\sqrt{n}}\epsilon_{j}\;\text{for}\;1\leq j\leq p. When we know Σ=I\Sigma={\rm I} and focus on the regime scmin{n/logp,pν}s\leq c\min\{{n/\log p},p^{\nu}\} for 0ν<1/20\leq\nu<1/2, the optimal rate of estimating β22\|\beta\|_{2}^{2} in the high-dimensional linear model matches that in collier2017minimax . However, when Σ\Sigma is unknown, estimating β22\|\beta\|_{2}^{2} is much harder in the high-dimensional linear model than in the sequence model.

Target Optimal Rate over Θ(s,M)\Theta(s,M) Optimal Rate over Θ0(s,M)\Theta_{0}(s,M)
β22\|\beta\|_{2}^{2} min{M1n+slogpn+Mslogpn,M2}\min\left\{M\frac{1}{\sqrt{n}}+\frac{s\log p}{n}+{M\frac{s\log p}{n}},M^{2}\right\} min{M1n+slogpn,M2}\min\left\{M\frac{1}{\sqrt{n}}+\frac{s\log p}{n},M^{2}\right\}
βΣβ\beta^{\intercal}\Sigma\beta min{M1n+slogpn+M21n,M2}\min\left\{M\frac{1}{\sqrt{n}}+\frac{s\log p}{n}+{M^{2}\frac{1}{\sqrt{n}}},M^{2}\right\}
Table 1: The minimax optimal rate of estimating β22\|\beta\|_{2}^{2} over Θ(s,M)\Theta(s,M) was established in guo2019optimal ; the remaining optimal rates are implied by (38), which was established in cai2020semisupervised . The focus is on the regime scmin{n/logp,pν}s\leq c\min\{{n/\log p},p^{\nu}\} for 0ν<1/2.0\leq\nu<1/2.

3.6 Inner products of regression vectors

We next consider the two-sample regression models in (2). When the high-dimensional covariates are genetic variants and the outcome variables measure different phenotypes, then the inner product [β(1)]β(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)} can be interpreted as the genetic relatedness guo2019optimal , measuring the similarity between the two association vectors β(1)\beta^{\scriptscriptstyle(1)} and β(2)\beta^{\scriptscriptstyle(2)}. We present the debiased estimator of [β(1)]β(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)} proposed in guo2019optimal . For d=1,2,d=1,2, let β^(d)\widehat{\beta}^{\scriptscriptstyle(d)} be the Lasso estimator of β(d)\beta^{\scriptscriptstyle(d)}. Similar to the decomposition in (31), we decompose the error of the plug-in estimator [β^(1)]β^(2),[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)},

β^(2)[β(1)]β(2)=\displaystyle{}^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)}-[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}= [β^(1)](β^(2)β(2))+[β^(2)](β^(1)β(1))\displaystyle[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)})+[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)}) (39)
(β^(2)β(2))(β^(1)β(1)).\displaystyle-(\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)})^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)}).

The key is to estimate [β^(1)](β^(2)β(2))[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)}) and [β^(2)](β^(1)β(1))[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)}) separately in the above decomposition, which can be viewed as a projection of β^(2)β(2)\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)} and β^(1)β(1),\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)}, respectively.

The work guo2019optimal proposed the following bias-corrected estimator of [β(1)]β(2),[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)},

[β(1)]β(2)^=\displaystyle\widehat{[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}}= [β^(1)]β^(2)+u^11n2k=1n2[Xk,(2)](Yk(2)[Xk,(2)]β^(2))\displaystyle[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)}+\widehat{u}_{1}^{\intercal}\frac{1}{n_{2}}\sum_{k=1}^{n_{2}}[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(2)}_{k}-[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)}) (40)
+u^21n1k=1n1[Xk,(1)](Yk(1)[Xk,(1)]β^(1))\displaystyle+\widehat{u}_{2}^{\intercal}\frac{1}{n_{1}}\sum_{k=1}^{n_{1}}[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(1)}_{k}-[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(1)})

where the projection direction vectors are constructed as

u^1=argminup+1uΣ^(2)usubject to\displaystyle\widehat{u}_{1}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}^{\scriptscriptstyle(2)}u\quad\text{subject to} Σ^(2)uβ^(1)β^(1)2λ2\displaystyle\left\|\widehat{\Sigma}^{\scriptscriptstyle(2)}u-\widehat{\beta}^{\scriptscriptstyle(1)}\right\|_{\infty}\leq\|\widehat{\beta}^{\scriptscriptstyle(1)}\|_{2}\lambda_{2}
u^2=argminup+1uΣ^(1)usubject to\displaystyle\widehat{u}_{2}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}^{\scriptscriptstyle(1)}u\quad\text{subject to} Σ^(1)uβ^(2)β^(2)2λ1\displaystyle\left\|\widehat{\Sigma}^{\scriptscriptstyle(1)}u-\widehat{\beta}^{\scriptscriptstyle(2)}\right\|_{\infty}\leq\|\widehat{\beta}^{\scriptscriptstyle(2)}\|_{2}\lambda_{1}

with Σ^(d)=1nk=1ndXk,(d)[Xk,(d)]\widehat{\Sigma}^{\scriptscriptstyle(d)}=\frac{1}{n}\sum_{k=1}^{n_{d}}X^{\scriptscriptstyle(d)}_{k,\cdot}[X^{\scriptscriptstyle(d)}_{k,\cdot}]^{\intercal} and λdlogp/nd\lambda_{d}\asymp\sqrt{{\log p}/{n_{d}}} for d=1,2.d=1,2.

For the estimator (40), the bias-correction terms u^11n2k=1n2[Xk,(2)](Yk(2)[Xk,(2)]β^(2))\widehat{u}_{1}^{\intercal}\frac{1}{n_{2}}\sum_{k=1}^{n_{2}}[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(2)}_{k}-[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)}) and u^21n1k=1n1[Xk,(1)](Yk(1)[Xk,(1)]β^(1))\widehat{u}_{2}^{\intercal}\frac{1}{n_{1}}\sum_{k=1}^{n_{1}}[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(1)}_{k}-[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(1)}) are constructed to estimate [β^(1)](β(2)β^(2))[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}({\beta}^{\scriptscriptstyle(2)}-\widehat{\beta}^{\scriptscriptstyle(2)}) and [β^(2)](β(1)β^(1)),[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}({\beta}^{\scriptscriptstyle(1)}-\widehat{\beta}^{\scriptscriptstyle(1)}), respectively. The construction of the projection directions u^1\widehat{u}_{1} and u^2\widehat{u}_{2} can be viewed as extensions of that in (21) with xnew=β^(1)x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(1)} and xnew=β^(2),x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(2)}, respectively. An additional constraint as in (22) is not needed here since both xnew=β^(1)x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(1)} and xnew=β^(2)x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(2)} are sufficiently sparse. The paper guo2019optimal has established the convergence rate of the debiased estimator proposed in (40). The analysis can be extended to establish the asymptotic normal distribution of [β(1)]β(2)^\widehat{[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}} under suitable conditions.

The quantity [β(1)]Σβ(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)} is another genetic relatedness measure if Σ(1)=Σ(2)=Σ\Sigma^{\scriptscriptstyle(1)}=\Sigma^{\scriptscriptstyle(2)}=\Sigma with Σ(d)=𝔼Xk,(d)[Xk,(d)]\Sigma^{\scriptscriptstyle(d)}={\mathbb{E}}X^{\scriptscriptstyle(d)}_{k,\cdot}[X^{\scriptscriptstyle(d)}_{k,\cdot}]^{\intercal} for d=1,2.d=1,2. We can propose the debiased estimator [β(1)]Σβ(2)^\widehat{[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)}}, defined as [β^(1)]Σ^β^(2)+[β^(1)]1n2k=1n2[Xk,(2)](Yk(2)[Xk,(2)]β^(2))+[β^(2)]1n1k=1n1[Xk,(1)](Yk(1)[Xk,(1)]β^(1)).[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\widehat{\Sigma}\widehat{\beta}^{\scriptscriptstyle(2)}+[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\frac{1}{n_{2}}\sum_{k=1}^{n_{2}}[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(2)}_{k}-[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)})+[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}\frac{1}{n_{1}}\sum_{k=1}^{n_{1}}[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(1)}_{k}-[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(1)}). The above results have been extended to the logistic regression models, where the quantity [β(1)]Σβ(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)} still captures an interpretation of genetic relatedness. The paper ma2022statistical has carefully investigated the confidence interval construction for [β(1)]Σβ(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)} in consideration of several high-dimensional logistic regression models. Moreover, we might need to estimate and make inferences for [β(1)]Aβ(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)} with AA denoting a general matrix. The idea in (40) can be generalized to make inference for [β(1)]Aβ(2).[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)}. The work guo2020inference applied such a generalized debiased estimator of [β(1)]Aβ(2)[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)} in an intermediate step to determine the optimal aggregation weight of multiple regression models.

3.7 R package SIHR

The methods reviewed in Sections 2 and 3 have been implemented in the R package SIHR rakshit2021sihr , which is available from CRAN. The SIHR package contains the main functions: LF, QF, and CATE. The LF function implements the confidence interval for xnewβx_{\rm new}^{\intercal}\beta in (24) under the high-dimensional linear regression with specifying model="linear" and the confidence intervals for h(xnewβ)h(x_{\rm new}^{\intercal}\beta) in (28) under the high-dimensional logistic regression with specifying model="logistic" or model="logisticalter", corresponding to the linearization weighting and link-specific weighting introduced in Section 2.3, respectively.

The QF function implements the confidence interval construction in (35) and hypothesis testing in (36) with different choices of the weighting matrix AA. The CATE function implements the confidence interval in (29) and hypothesis testing in (30). Both QF and CATE functions can be applied to the logistic regression setting by specifying model="logistic" or model="logisticalter". The detailed usage of the SIHR package can be found in the paper rakshit2021sihr .

4 Multiple Testing

In the previous sections, we have examined statistical inference for individual regression coefficients and related one-dimensional functionals. However, in many applications, such as genomics, it is necessary to perform simultaneous inference for multiple regression coefficients while controlling for the false discovery rate (FDR) and false discovery proportion (FDP). In this section, we will explore several large-scale multiple testing procedures for high-dimensional regression models. We start with the linear models in Section 4.1 and extend the discussion to logistic models in Section 4.2. The testing procedures are unified in Section 4.3 and the power enhancement methods are discussed next.

4.1 Simultaneous inference for linear regression

One-sample simultaneous inference for high-dimensional linear regression coefficients is closely related to the problem of variable selection. Common approaches for variable selection include regularization methods, such as Lasso tibshirani1996regression , SCAD fan2001variable , Adaptive Lasso zou2006adaptive and MCP zhang2010nearly , which simultaneously estimate parameters and select features, and stepwise feature selection techniques like LARS efron2004least and FoBa zhang2011adaptive , which prioritize variable selection. See the discussions in liuluo2014 and references therein. However, both of these approaches aim to find the model that is closest to the truth, which may not be achievable in practice. Alternatively, liuluo2014 approached the problem from a multiple testing perspective and focused on controlling false discoveries rather than achieving perfect selection results. Specifically, for the high-dimensional regression model (1) with link function h(z)=zh(z)=z, i.e., Y=Xβ+ϵ,Y=X\beta+\epsilon, where β=(β1,,βp+1)p+1\beta=(\beta_{1},\dots,\beta_{p+1})^{\intercal}\in\mathbb{R}^{p+1}, X=(X1,,,Xn,)X=(X_{1,\cdot}^{\intercal},\ldots,X_{n,\cdot}^{\intercal})^{\intercal}, Y=(Y1,,Yn)Y=(Y_{1},\ldots,Y_{n})^{\intercal}, and ϵ=(ϵ1,,ϵn)\epsilon=(\epsilon_{1},\dots,\epsilon_{n})^{\intercal}, with {ϵk}\{\epsilon_{k}\} being independent and identically distributed (i.i.d) random variables with mean zero and variance σϵ2\sigma_{\epsilon}^{2} and independent of Xk,X_{k,\cdot}, k=1,,nk=1,\dots,n, liuluo2014 considered the following multiple testing problem,

H0,i:βi=0 versus H1,i:βi0,i=2,,p+1,\displaystyle H_{0,i}:\beta_{i}=0\mbox{ versus }H_{1,i}:\beta_{i}\neq 0,\quad i=2,\dots,p+1, (41)

with the control of FDR and FDP.

In some fields, one-sample inference may not be sufficient, particularly for detecting interactions. For example, as demonstrated in hunter2005gene , many complex diseases are the result of interactions between genes and the environment. Therefore, it is important to thoroughly examine the effects of the environment and its interactions with genetic predispositions on disease phenotypes. When the environmental factor is a binary variable, such as smoking status or gender, interaction detection can be approached under a two-sample high-dimensional regression framework. Specifically, interactions can be identified by comparing two high-dimensional regression models as introduced in (2) with identity link function, i.e., Y(d)=X(d)β(d)+ϵ(d),Y^{\scriptscriptstyle(d)}=X^{\scriptscriptstyle(d)}\beta^{\scriptscriptstyle(d)}+\epsilon^{\scriptscriptstyle(d)}, for d=1,2d=1,2, and recovering the nonzero components of β1(1)β1(2)\beta^{\scriptscriptstyle(1)}_{-1}-\beta^{\scriptscriptstyle(2)}_{-1}, where β(d)=(β1(d),,βp+1(d))p+1\beta^{\scriptscriptstyle(d)}=(\beta_{1}^{\scriptscriptstyle(d)},\dots,\beta_{p+1}^{\scriptscriptstyle(d)})^{\intercal}\in\mathbb{R}^{p+1}, X(d)=(X1,(d),,Xnd,(d))X^{\scriptscriptstyle(d)}=(X_{1,\cdot}^{{\scriptscriptstyle(d)}\intercal},\ldots,X_{n_{d},\cdot}^{{\scriptscriptstyle(d)}\intercal})^{\intercal}, Y(d)=(Y1(d),,Ynd(d))Y^{\scriptscriptstyle(d)}=(Y_{1}^{\scriptscriptstyle(d)},\ldots,Y_{n_{d}}^{\scriptscriptstyle(d)})^{\intercal}, and ϵ(d)=(ϵ1(d),,ϵnd(d))\epsilon^{\scriptscriptstyle(d)}=(\epsilon_{1}^{\scriptscriptstyle(d)},\dots,\epsilon_{n_{d}}^{\scriptscriptstyle(d)})^{\intercal}, with {ϵk(d)}\{\epsilon_{k}^{\scriptscriptstyle(d)}\} being i.i.d random variables with mean zero and variance σϵ(d)2\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2} and independent of Xk,(d)X_{k,\cdot}^{\scriptscriptstyle(d)}, k=1,,ndk=1,\dots,n_{d}. Assume that n1n2n_{1}\asymp n_{2} and let n=max{n1,n2}n=\max\{n_{1},n_{2}\}. Then xia2018two investigated simultaneous testing of the hypotheses

H0,i:βi(1)=βi(2) versus H1,i:βi(1)βi(2),i=2,,p+1,\displaystyle H_{0,i}:\beta_{i}^{\scriptscriptstyle(1)}=\beta_{i}^{\scriptscriptstyle(2)}\;\mbox{ versus }\;H_{1,i}:\beta_{i}^{\scriptscriptstyle(1)}\neq\beta_{i}^{\scriptscriptstyle(2)},\quad i=2,\dots,p+1, (42)

with FDR and FDP control.

In genetic association studies, it is common to measure multiple correlated phenotypes on the same individuals. To detect associations between high-dimensional genetic variants and these phenotypes, one can individually assess the relationship between each response and each covariate, as in (41), and then adjust for multiplicity in the comparisons. However, as noted by Zhou2015 and Schifano2013 , jointly analyzing these phenotypic measurements may increase the power to detect causal genetic variants. Therefore, motivated by the potential to enhance power by leveraging the similarity across multivariate responses, xia2018joint used high-dimensional multivariate regression models to address applications in which DD correlated responses are measured on nn independent individuals:

Yn×D\displaystyle Y_{n\times D} =\displaystyle= Xn×(p+1)B(p+1)×D+Υn×D,\displaystyle X_{n\times(p+1)}B_{(p+1)\times D}+\Upsilon_{n\times D}, (43)

where Y=(Y,1,,Y,D)n×DY=(Y_{\cdot,1},\ldots,Y_{\cdot,D})\in\mathbb{R}^{n\times D}, with Y,d=(Y1,d,,Yn,d)Y_{\cdot,d}=(Y_{1,d},\ldots,Y_{n,d})^{\intercal}, denotes DD responses with DD fixed, and X=(X1,,,Xn,)n×(p+1)X=(X_{1,\cdot}^{\intercal},\ldots,X_{n,\cdot}^{\intercal})^{\intercal}\in\mathbb{R}^{n\times(p+1)} is the covariate matrix. In (43), B=(B,1,,B,D)(p+1)×DB=(B_{\cdot,1},\ldots,B_{\cdot,D})\in\mathbb{R}^{(p+1)\times D}, with B,d=(B1,d,,Bp+1,d)p+1B_{\cdot,d}=(B_{1,d},\dots,B_{p+1,d})^{\intercal}\in\mathbb{R}^{p+1}, represents the regression coefficient matrix, where Bi,B_{i,\cdot} represents the regression coefficients of the ithi^{\scalebox{0.8}{th}} covariate; Υ=(ϵ,1,,ϵ,D)n×D\Upsilon=(\epsilon_{\cdot,1},\ldots,\epsilon_{\cdot,D})\in\mathbb{R}^{n\times D}, where ϵ,d=(ϵ1,d,,ϵn,d)\epsilon_{\cdot,d}=(\epsilon_{1,d},\dots,\epsilon_{n,d})^{\intercal}, and {ϵk,d}\{\epsilon_{k,d}\} are i.i.d random variables with mean zero and variance σϵ2\sigma_{\epsilon}^{2} and independent of XX. To examine whether the ithi^{\scalebox{0.8}{th}} covariate is associated with any of the DD responses, xia2018joint simultaneously tested

H0,i:Bi,=0 versus H1,i:Bi,0,i=2,,p+1,\displaystyle H_{0,i}:B_{i,\cdot}=0\;\mbox{ versus }\;H_{1,i}:B_{i,\cdot}\neq 0,\quad i=2,\dots,p+1, (44)

while controlling FDR and FDP. Because the effect of the ithi^{\scalebox{0.8}{th}} variable on each of the DD responses may share strong similarities, namely, if Bi,d0B_{i,d}\neq 0, then the rest of the entries in this row are more likely to be nonzero, a row-wise testing method using the group-wise information is more favorable than testing the significance of the matrix BB column by column as in testing problem (41).

4.1.1 Bias corrections via inverse regression

In the multiple testing problems discussed in Section 4.1, our goal is to simultaneously infer the regression coefficients while controlling for error rates. Therefore, it is crucial to begin with an asymptotically unbiased estimator for each regression component. The debiasing techniques outlined in Section 2.2 can be used to attain nearly unbiased estimates, however, as noted in liuluo2014 , the constraints or tuning parameters utilized in debiasing can significantly affect test accuracy. Additionally, the asymptotic distribution of these debiased estimators is conditional, making it challenging to characterize the dependence structure among the test statistics, which is vital for error rate control in simultaneous inference. As an alternative, we will explore an inverse regression approach in this section, which establishes the unconditional asymptotic distribution of bias corrected statistics and allows for explicit characterization of the correlation structure. Alternatively, the debiasing method (13) in zhang2014confidence ; van2014asymptotically can be used as long as the decorrelation vectors Z,jZ_{\cdot,j}’s introduced in Section 2.2.2 are close enough to their population counterparts. This approach will be illustrated in Section 4.2.

Recall that Xk,1(d)=1X^{\scriptscriptstyle(d)}_{k,1}=1. To achieve bias correction, by taking testing problem (42) as an example, we consider the inverse regression models obtained by regressing Xk,i(d)X_{k,i}^{\scriptscriptstyle(d)} on (Yk(d),X~k,i(d))(Y_{k}^{\scriptscriptstyle(d)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)}), where X~k,i(d)=Xk,{1,i}(d)\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)}=X_{k,-\{1,i\}}^{\scriptscriptstyle(d)}, for i=2,,p+1i=2,\dots,p+1:

Xk,i(1)\displaystyle X_{k,i}^{\scriptscriptstyle(1)} =\displaystyle= αi(1)+(Yk(1),X~k,i(1))γi(1)+ηk,i(1),(k=1,,n1)\displaystyle\alpha_{i}^{\scriptscriptstyle(1)}+(Y_{k}^{\scriptscriptstyle(1)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(1)})\gamma_{i}^{\scriptscriptstyle(1)}+\eta_{k,i}^{\scriptscriptstyle(1)},\quad(k=1,\dots,n_{1})
Xk,i(2)\displaystyle X_{k,i}^{\scriptscriptstyle(2)} =\displaystyle= αi(2)+(Yk(2),X~k,i(2))γi(2)+ηk,i(2),(k=1,,n2)\displaystyle\alpha_{i}^{\scriptscriptstyle(2)}+(Y_{k}^{\scriptscriptstyle(2)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(2)})\gamma_{i}^{\scriptscriptstyle(2)}+\eta_{k,i}^{\scriptscriptstyle(2)},\quad(k=1,\dots,n_{2})

where for d=1,2d=1,2, ηk,i(d)\eta_{k,i}^{\scriptscriptstyle(d)} has mean zero and variance σi,d2\sigma_{i,d}^{2} and is uncorrelated with (Yk(d),X~k,i(d))(Y_{k}^{\scriptscriptstyle(d)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)}), and the first component of γi(d)=(γi,1(d),,γi,p(d))\gamma_{i}^{\scriptscriptstyle(d)}=(\gamma_{i,1}^{\scriptscriptstyle(d)},\dots,\gamma_{i,p}^{\scriptscriptstyle(d)})^{\intercal} satisfies

γi,1(d)=σi,d2βi(d)/σϵ(d)2,i=2,,p+1,\displaystyle\gamma_{i,1}^{\scriptscriptstyle(d)}=\sigma_{i,d}^{2}\beta_{i}^{\scriptscriptstyle(d)}/\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2},\quad i=2,\dots,p+1, (45)

where σi,d2=[{βi(d)}2/σϵ(d)2+ωi1,i1(d)]1\sigma_{i,d}^{2}=[\{\beta_{i}^{\scriptscriptstyle(d)}\}^{2}/\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}+\omega_{i-1,i-1}^{\scriptscriptstyle(d)}]^{-1} with {Cov(Xk,1(d))}1=Ωd=(ωi,j(d))\{\textsf{Cov}(X_{k,-1}^{\scriptscriptstyle(d)})\}^{-1}=\Omega_{d}=(\omega_{i,j}^{\scriptscriptstyle(d)}).

Note that, ri(d)=Cov(ϵk(d),ηk,i(d))r_{i}^{\scriptscriptstyle(d)}=\textsf{Cov}(\epsilon_{k}^{\scriptscriptstyle(d)},\eta_{k,i}^{\scriptscriptstyle(d)}) can be expressed as γi,1(d)Cov(ϵk(d),Yk(d))=γi,1(d)σϵ(d)2=σi,d2βi(d)-\gamma_{i,1}^{\scriptscriptstyle(d)}\textsf{Cov}(\epsilon_{k}^{\scriptscriptstyle(d)},Y_{k}^{\scriptscriptstyle(d)})=-\gamma_{i,1}^{\scriptscriptstyle(d)}\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}=-\sigma_{i,d}^{2}\beta_{i}^{\scriptscriptstyle(d)}, hence we can approach the debiasing of βi(d)\beta_{i}^{\scriptscriptstyle(d)} through the debiasing of ri(d)r_{i}^{\scriptscriptstyle(d)}, and equivalently formulate the testing problem (42) as

H0,i:ri(1)/σi,12=ri(2)/σi,22,i=2,,p+1.\displaystyle H_{0,i}:r_{i}^{\scriptscriptstyle(1)}/\sigma_{i,1}^{2}=r_{i}^{\scriptscriptstyle(2)}/\sigma_{i,2}^{2},\quad i=2,\dots,p+1.

The most straightforward way to estimate ri(d)r_{i}^{\scriptscriptstyle(d)} is to use the sample covariance between the error terms, nd1k=1ndϵk(d)ηk,i(d)n_{d}^{-1}\sum_{k=1}^{n_{d}}\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)}. However, the error terms are unknown, so we first estimate them by

ϵ^k(d)=Yk(d)Xk,(d)β^(d),η^k,i(d)=Xk,i(d)(Yk(d)Y¯(d),X~k,i(d))γ^i(d),\displaystyle\widehat{\epsilon}_{k}^{\scriptscriptstyle(d)}=Y_{k}^{\scriptscriptstyle(d)}-X_{k,\cdot}^{\scriptscriptstyle(d)}\widehat{\beta}^{\scriptscriptstyle(d)},\quad\widehat{\eta}_{k,i}^{\scriptscriptstyle(d)}=X_{k,i}^{\scriptscriptstyle(d)}-(Y_{k}^{\scriptscriptstyle(d)}-\bar{Y}^{\scriptscriptstyle(d)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)})\widehat{\gamma}_{i}^{\scriptscriptstyle(d)},

where β^(d)=(β^1(d),,β^p+1(d))\widehat{\beta}^{\scriptscriptstyle(d)}=(\widehat{\beta}_{1}^{\scriptscriptstyle(d)},\dots,\widehat{\beta}_{p+1}^{\scriptscriptstyle(d)}) and γ^i(d)=(γ^i,1(d),,γ^i,p(d))\widehat{\gamma}_{i}^{\scriptscriptstyle(d)}=(\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)},\dots,\widehat{\gamma}_{i,p}^{\scriptscriptstyle(d)}) are respectively the estimators of β(d)\beta^{\scriptscriptstyle(d)} and γi(d)\gamma_{i}^{\scriptscriptstyle(d)} that satisfy

max{|β^1(d)β1(d)|1,maxi=1,,p|γ^i(d)γi(d)|1}\displaystyle\max\{|\widehat{\beta}_{-1}^{\scriptscriptstyle(d)}-\beta_{-1}^{\scriptscriptstyle(d)}|_{1},\max_{i=1,\ldots,p}|\widehat{\gamma}_{i}^{\scriptscriptstyle(d)}-\gamma_{i}^{\scriptscriptstyle(d)}|_{1}\} =\displaystyle= O(an1),\displaystyle O_{{\mathbb{P}}}(a_{n1}), (46)
max{|β^1(d)β1(d)|2,maxi=1,,p|γ^i(d)γi(d)|2}\displaystyle\max\{|\widehat{\beta}_{-1}^{\scriptscriptstyle(d)}-\beta_{-1}^{\scriptscriptstyle(d)}|_{2},\max_{i=1,\ldots,p}|\widehat{\gamma}_{i}^{\scriptscriptstyle(d)}-\gamma_{i}^{\scriptscriptstyle(d)}|_{2}\} =\displaystyle= O(an2),\displaystyle O_{{\mathbb{P}}}(a_{n2}), (47)

for some an1a_{n1} and an2a_{n2} such that

max{an1an2,an22}=o{(nlogp)1/2}, and an1=o(1/logp).\max\{a_{n1}a_{n2},a_{n2}^{2}\}=o\{(n\log p)^{-1/2}\},\mbox{ and }a_{n1}=o(1/\log p). (48)

As noted by (liuluo2014, ; xia2018two, ), estimators β^(d)\widehat{\beta}^{\scriptscriptstyle(d)} and γ^i(d)\widehat{\gamma}_{i}^{\scriptscriptstyle(d)} that satisfy (46) and (48) can be obtained easily via standard methods such as the Lasso and Danzig selector. Following that, a natural estimator of ri(d)r_{i}^{\scriptscriptstyle(d)} can be constructed by r~i(d)=nd1k=1ndϵ^k(d)η^k,i(d).\widetilde{r}_{i}^{\scriptscriptstyle(d)}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\widehat{\epsilon}_{k}^{\scriptscriptstyle(d)}\widehat{\eta}_{k,i}^{\scriptscriptstyle(d)}. However, the bias of r~i(d)\widetilde{r}_{i}^{\scriptscriptstyle(d)} exceeds the desired rate (ndlogp)1/2(n_{d}\log p)^{-1/2} for the subsequent analysis. Hence, the difference of r~i(d)\widetilde{r}_{i}^{\scriptscriptstyle(d)} and nd1k=1ndϵk(d)ηk,i(d)n_{d}^{-1}\sum_{k=1}^{n_{d}}\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)} is calculated, and it is equal to σ^ϵ(d)2γ^i,1(d)+σ^i,d2β^i(d)\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{i,d}^{2}\widehat{\beta}_{i}^{\scriptscriptstyle(d)} up to order (ndlogp)1/2(n_{d}\log p)^{-1/2} under regularity conditions, where σ^ϵ(d)2=nd1k=1nd{ϵ^k(d)}2\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\{\widehat{\epsilon}_{k}^{\scriptscriptstyle(d)}\}^{2} and σ^i,d2=nd1k=1nd{η^k,i(d)}2\widehat{\sigma}_{i,d}^{2}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\{\widehat{\eta}_{k,i}^{\scriptscriptstyle(d)}\}^{2} are the sample variances. Hence, a bias-corrected estimator for ri(d)r_{i}^{\scriptscriptstyle(d)} is defined as

r^i(d)=r~i(d)+σ^ϵ(d)2γ^i,1(d)+σ^i,d2β^i(d).\widehat{r}_{i}^{\scriptscriptstyle(d)}=\widetilde{r}_{i}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{i,d}^{2}\widehat{\beta}_{i}^{\scriptscriptstyle(d)}. (49)

For the other two testing problems, the bias corrections can be performed almost exactly the same via the inverse regression technique above that translates the debiasing of regression coefficients to the debiasing of residual covariances. Note that, through such an inverse regression approach, one can appropriately deal with the dependency of the component-wise debiased statistics, which is important for the following adjustment of multiplicity and the goal of FDR control.

4.1.2 Construction of test statistics

We next construct test statistics for each of the three problems discussed in Section 4.1, using the bias-corrected statistics as a starting point.

For problem (41), because testing whether βi=0\beta_{i}=0 is equivalent as testing whether the residual covariance is equal to zero, the test statistics can be constructed directly based on the bias corrected r^i\widehat{r}_{i} (it can be obtained exactly the same as r^i(d)\widehat{r}_{i}^{\scriptscriptstyle(d)} as shown in Section 4.1.1 where the superscript (d)(d) is dropped since there is only one sample). Then the test statistics that standardize r^i\widehat{r}_{i}’s are obtained by

Wi=r^i(σ^ϵ2σ^i2/n)1/2,i=2,,p+1,W_{i}=\frac{\widehat{r}_{i}}{(\widehat{\sigma}^{2}_{\epsilon}\widehat{\sigma}^{2}_{i}/n)^{1/2}},\quad i=2,\dots,p+1, (50)

where σ^ϵ2\widehat{\sigma}^{2}_{\epsilon} and σ^i2\widehat{\sigma}^{2}_{i} are again the sample variances by respectively dropping the superscript (d)(d) and subscript dd in the one-sample case. As shown in liuluo2014 , the statistics WiW_{i}’s are asymptotically normal under the null.

The above construction cannot be directly applied to the problem (42), because βi(d)\beta_{i}^{\scriptscriptstyle(d)} is not necessary equal to 0 under the two-sample null and ri(1)/σi,12=ri(2)/σi,22r_{i}^{\scriptscriptstyle(1)}/\sigma_{i,1}^{2}=r_{i}^{\scriptscriptstyle(2)}/\sigma_{i,2}^{2} is not equivalent to ri(1)=ri(2)r_{i}^{\scriptscriptstyle(1)}=r_{i}^{\scriptscriptstyle(2)}. Thus, it is necessary to construct testing procedures based directly on estimators of ri(1)/σi,12ri(2)/σi,22r_{i}^{\scriptscriptstyle(1)}/\sigma_{i,1}^{2}-r_{i}^{\scriptscriptstyle(2)}/\sigma_{i,2}^{2}. By the bias correction in Section 4.1.1, xia2018two proposed an estimator of ri(d)/σi,d2r_{i}^{\scriptscriptstyle(d)}/\sigma_{i,d}^{2}:

Ti(d)=r^i(d)/σ^i,d2,i=2,,p+1;d=1,2,\displaystyle T_{i}^{\scriptscriptstyle(d)}=\widehat{r}_{i}^{\scriptscriptstyle(d)}/\widehat{\sigma}_{i,d}^{2},\quad i=2,\dots,p+1\mathchar 24635\relax\;d=1,2,

and tested (42) via the estimators {Ti(1)Ti(2):i=2,,p+1}\{T_{i}^{\scriptscriptstyle(1)}-T_{i}^{\scriptscriptstyle(2)}:i=2,\dots,p+1\}. Due to the heteroscedasticity, xia2018two considered a standardized version of Ti(1)Ti(2)T_{i}^{\scriptscriptstyle(1)}-T_{i}^{\scriptscriptstyle(2)}. Specifically, let

U~i(d)=(βi(d)+Ui(d))/σi,d2, with Ui(d)=nd1k=1nd{ϵk(d)ηk,i(d)E(ϵk(d)ηk,i(d))}.{\small\widetilde{U}_{i}^{\scriptscriptstyle(d)}=(\beta_{i}^{\scriptscriptstyle(d)}+U_{i}^{\scriptscriptstyle(d)})/\sigma_{i,d}^{2},\mbox{ with }U_{i}^{\scriptscriptstyle(d)}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\{\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)}-\textsf{E}(\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)})\}.}

It was shown in xia2018two that Ti(d)T_{i}^{\scriptscriptstyle(d)} is close to U~i(d)\widetilde{U}_{i}^{\scriptscriptstyle(d)} asymptotically under regularity conditions. Because θi(d)=Var(U~i(d))=Var(ϵk(d)ηk,i(d)/σi,d2)/nd=(σϵ(d)2/σi,d2+{βi(d)}2)/nd,\theta_{i}^{\scriptscriptstyle(d)}=\textsf{Var}(\widetilde{U}_{i}^{\scriptscriptstyle(d)})=\textsf{Var}(\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)}/\sigma_{i,d}^{2})/n_{d}=(\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}/\sigma_{i,d}^{2}+\{\beta_{i}^{\scriptscriptstyle(d)}\}^{2})/n_{d}, it can be estimated by θ^i(d)=(σ^ϵ(d)2/σ^i,d2+{β^i(d)}2)/nd,\widehat{\theta}_{i}^{\scriptscriptstyle(d)}=(\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}/\widehat{\sigma}_{i,d}^{2}+\{\widehat{\beta}_{i}^{\scriptscriptstyle(d)}\}^{2})/n_{d}, and the standardized statistics are defined by

Wi=Ti(1)Ti(2)(θ^i(1)+θ^i(2))1/2,i=2,,p+1,W_{i}=\frac{T_{i}^{\scriptscriptstyle(1)}-T_{i}^{\scriptscriptstyle(2)}}{(\widehat{\theta}_{i}^{\scriptscriptstyle(1)}+\widehat{\theta}_{i}^{\scriptscriptstyle(2)})^{1/2}},\quad i=2,\dots,p+1, (51)

which are asymptotically normal under the null as studied in xia2018two .

For the multivariate testing problem (44), by taking advantage of the similar effect of the ithi^{\scalebox{0.8}{th}} variable on each of the responses, a group lasso penalty (yuan2006model, ) can be imposed to obtain an estimator of BB, such that

max1dD|B^1,dB1,d|1=O(an1),max1dD|B^1,dB1,d|2=O(an2),\displaystyle{\small\max_{1\leq d\leq D}|\widehat{B}_{-1,d}-B_{-1,d}|_{1}=O_{{\mathbb{P}}}(a_{n1}),\max_{1\leq d\leq D}|\widehat{B}_{-1,d}-B_{-1,d}|_{2}=O_{{\mathbb{P}}}(a_{n2}),}

for some an1a_{n1} and an2a_{n2} satisfying (48). By lounici2011oracle , the above rates can be fulfilled if the row sparsity of BB satisfies s(p)=o(n1/3/logp)s(p)=o(n^{1/3}/{\log p}). Then following the same bias correction strategy as described in Section 4.1.1, a debiased estimator for ri(d)r_{i}^{\scriptscriptstyle(d)} can be obtained via r^i(d)=r~i(d)+σ^ϵ2γ^i,1(d)+σ^i,d2B^i,d.\widehat{r}_{i}^{\scriptscriptstyle(d)}=\widetilde{r}_{i}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{\epsilon}^{2}\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{i,d}^{2}\widehat{B}_{i,d}. Then similarly as the aforementioned two problems, the standardized statistic can be constructed by

T~i(d)=Ti(d)/{θ^i(d)}1/2,i=2,,p+1;d=1,,D,\displaystyle\widetilde{T}_{i}^{\scriptscriptstyle(d)}={T_{i}^{\scriptscriptstyle(d)}}/{\{\widehat{\theta}_{i}^{\scriptscriptstyle(d)}\}^{1/2}},\quad i=2,\dots,p+1\mathchar 24635\relax\;d=1,\ldots,D,

where Ti(d)=r^i(d)/σ^i,d2T_{i}^{\scriptscriptstyle(d)}=\widehat{r}_{i}^{\scriptscriptstyle(d)}/\widehat{\sigma}_{i,d}^{2} and θ^i(d)=(σ^ϵ2/σ^i,d2+B^i,d2)/n.\widehat{\theta}_{i}^{\scriptscriptstyle(d)}=(\widehat{\sigma}_{\epsilon}^{2}/\widehat{\sigma}_{i,d}^{2}+\widehat{B}_{i,d}^{2})/n. Finally, a sum-of-squares-type test statistic for testing the ithi^{\scalebox{0.8}{th}} row of BB is proposed:

Wi=d=1D{T~i(d)}2,i=2,,p+1,W_{i}=\sum_{d=1}^{D}\{\widetilde{T}_{i}^{\scriptscriptstyle(d)}\}^{2},\quad i=2,\dots,p+1, (52)

which is asymptotically χD2\chi^{2}_{D} distributed under the null as studied in xia2018joint .

4.2 Simultaneous inference for logistic regression

The principles of simultaneous inference for high-dimensional linear regression can also be applied to high-dimensional logistic regression models. In particular, ma2021global considered the regression model (1), i.e., Yk=h(Xk,β)+ϵkY_{k}=h(X_{k,\cdot}^{\intercal}\beta)+\epsilon_{k}, k=1,,nk=1,\ldots,n, with the link function h(z)=exp(z)/[1+exp(z)]h(z)=\exp(z)/[1+\exp(z)], and studied the simultaneous testing problem (41) as described in Section 4.1, namely testing

H0,i:βi=0 versus H1,i:βi0,i=2,,p+1,\displaystyle H_{0,i}:\beta_{i}=0\mbox{ versus }H_{1,i}:\beta_{i}\neq 0,\quad i=2,\dots,p+1,

with FDR control.

Based on the regularized estimator β^\widehat{\beta} given in (5), ma2021global corrected the bias of β^\widehat{\beta} via the Taylor expansion of h(uk)h(u_{k}) at u^k\widehat{u}_{k} for uk=Xk,βu_{k}=X_{k,\cdot}^{\intercal}\beta and u^k=Xk,β^\widehat{u}_{k}=X_{k,\cdot}^{\intercal}\widehat{\beta}, and obtained that

Ykh(u^k)+h(u^k)Xk,β^=h(u^k)Xk,β+(Rk+ϵk),Y_{k}-h(\widehat{u}_{k})+h^{\prime}(\widehat{u}_{k})X_{k,\cdot}^{\intercal}\widehat{\beta}=h^{\prime}(\widehat{u}_{k})X_{k,\cdot}^{\intercal}\beta+(R_{k}+\epsilon_{k}), (53)

where RkR_{k} is the remainder term as specified in (16). Next, Ykh(u^k)+h(u^k)Xk,β^Y_{k}-h(\widehat{u}_{k})+h^{\prime}(\widehat{u}_{k})X_{k,\cdot}^{\intercal}\widehat{\beta} can be treated as the new response, h(u^k)Xk,h^{\prime}(\widehat{u}_{k})X_{k,\cdot} as the new covariates, and Rk+ϵkR_{k}+\epsilon_{k} as the new noise. Under such a formulation, testing H0,i:βi=0H_{0,i}:\beta_{i}=0 can be translated into the simultaneous inference for the regression coefficients of an approximate linear model. ma2021global applied the decorrelation method (13) and constructed the following debiased estimator β~iDeb\widetilde{\beta}_{i}^{\rm Deb},

β~iDeb=β^i+k=1nZk,i(Ykh(Xk,β^))k=1nZk,ih(Xk,β^)Xk,i,i=2,,p+1,\widetilde{\beta}_{i}^{\rm Deb}=\widehat{\beta}_{i}+\frac{\sum_{k=1}^{n}Z_{k,i}(Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))}{\sum_{k=1}^{n}Z_{k,i}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,i}},\quad i=2,\dots,p+1, (54)

where Z,iZ_{\cdot,i} is determined by the scaled residual that regresses X,iX_{\cdot,i} on X,iX_{\cdot,-i} through the linearization weighting W^k=1/h(Xk,β^)\widehat{W}_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}) introduced in Section 2.3; same strategy was also employed in (ren2016asymptotic, ; cai2019differential, ). As summarized in Section 4.1.1, for the subsequent FDR control analysis, the decorrelation vectors Z,iZ_{\cdot,i}’s were shown to be close to the true regression errors. Alternatively, the inverse regression technique in Section 4.1.1 can be similarly applied in the approximate linear model (53) for the bias correction.

Based on (54), ma2021global proposed the following standardized test statistic:

Wi=β~iDeb{k=1nh(u^k)Zk,i2}1/2/{k=1nh(u^k)Zk,iXk,i},i=2,,p+1,W_{i}=\frac{\widetilde{\beta}_{i}^{\rm Deb}}{\{\sum_{k=1}^{n}h^{\prime}(\widehat{u}_{k})Z_{k,i}^{2}\}^{1/2}/\{\sum_{k=1}^{n}h^{\prime}(\widehat{u}_{k})Z_{k,i}X_{k,i}\}},\quad i=2,\dots,p+1, (55)

which is asymptotically normal under the null. Additionally, ma2021global extended the idea to the two-sample multiple testing (42). Similarly, multiple testing of (44) can also be approached in the logistic setting.

4.3 Multiple testing procedure

Using the test statistics {Wi:i=2,,p+1}\{W_{i}:i=2,\dots,p+1\} in (50), (51), (52) and (55) as a foundation, we next examine a unified multiple testing procedure that guarantees error rates control.

Let ={2,,p+1}\mathcal{H}=\{2,\dots,p+1\}, 0\mathcal{H}_{0} be the set of true null indices and 1=0\mathcal{H}_{1}=\mathcal{H}\setminus\mathcal{H}_{0} be the set of true alternatives. We are interested in cases where most of the tests are nulls, that is, |1||\mathcal{H}_{1}| is relatively small compared to |||\mathcal{H}|. Let Ψ()\Psi(\cdot) be the asymptotic cumulative distribution function (cdf) of WiW_{i} under the null and let Φ()\Phi(\cdot) be the standard normal cdf, then we develop a normal quantile transformation of WiW_{i} by Ni=Φ1{1(1Ψ(Wi))/2}N_{i}=\Phi^{-1}\left\{1-(1-\Psi(W_{i}))/2\right\}, which approximately has the same distribution as the absolute value of a standard normal random variable under the null H0,iH_{0,i}. Let tt be the threshold level such that H0,iH_{0,i} is rejected if NitN_{i}\geq t. For any given tt, denote by R0(t)=i0𝕀{Nit}R_{0}(t)=\sum_{i\in\mathcal{H}_{0}}\mathbb{I}\{N_{i}\geq t\} and R(t)=i𝕀{Nit}R(t)=\sum_{i\in\mathcal{H}}\mathbb{I}\{N_{i}\geq t\} the total number of false positives and the total number of rejections, respectively. Then the FDP and FDR are defined as

FDP(t)=R0(t)max{R(t),1},FDR(t)=𝔼{FDP(t)}.\text{FDP}(t)=\frac{R_{0}(t)}{\max\{R(t),1\}},\quad\text{FDR}(t)=\mathbb{E}\{\text{FDP}(t)\}.

An ideal choice of tt rejects as many true positives as possible while controlling the FDP at the pre-specified level α\alpha, i.e.,

t0=inf{0t(2logp)1/2:FDP(t)α}.t_{0}=\inf\left\{0\leq t\leq(2\log p)^{1/2}:\;\text{FDP}(t)\leq\alpha\right\}.

Since R0(t)R_{0}(t) can be estimated by 2{1Φ(t)}|0|2\{1-\Phi(t)\}|\mathcal{H}_{0}| and |0||\mathcal{H}_{0}| is upper bounded by pp, we conservatively estimate it by 2p{1Φ(t)}2p\{1-\Phi(t)\}. Therefore, the following multiple testing algorithm is proposed in (e.g., liuluo2014, ; xia2018two, ; xia2018joint, ; ma2021global, ).

Algorithm 1 The multiple testing procedure.
Step 1.

Obtain the transformed statistics Ni=Φ1{1(1Ψ(Wi))/2}N_{i}=\Phi^{-1}\left\{1-(1-\Psi(W_{i}))/2\right\} from the test statistics WiW_{i}, i=2,,p+1i=2,\dots,p+1.

Step 2.

For a given 0α10\leq\alpha\leq 1, calculate

t^=inf[0t(2logp2loglogp)1/2:2p{1Φ(t)}max{R(t),1}α].\small\widehat{t}=\inf\left[0\leq t\leq(2\log p-2\log\log p)^{1/2}:\;\frac{2p\{1-\Phi(t)\}}{\max\{R(t),1\}}\leq\alpha\right]. (56)

If (56) does not exist, then set t^=(2logp)1/2\widehat{t}=(2\log p)^{1/2}.

Step 3.

For ii\in\mathcal{H}, reject H0,iH_{0,i} if Nit^N_{i}\geq\widehat{t}.

As noted in xia2018multiple , the constraint 0t(2logp2loglogp)1/20\leq t\leq(2\log p-2\log\log p)^{1/2} in (56) is critical. When tt exceeds the upper bound, 2p{1Φ(t)}02p\{1-\Phi(t)\}\rightarrow 0 is not even a consistent estimate of R0(t)R_{0}(t). However, Benjamini-Hochberg (B-H) procedure (BenHoc95, ) used 2p{1Φ(t)}2p\{1-\Phi(t)\} as an estimate of R0(t)R_{0}(t) for all t0t\geq 0 and hence may not be able to control the FDP. On the other hand, if tt is not attained in the range, it is important to threshold NiN_{i} at (2logp)1/2(2\log p)^{1/2}, because thresholding NiN_{i} at (2logp2loglogp)1/2(2\log p-2\log\log p)^{1/2} may cause too many false rejections. As a result, by applying the above multiple testing algorithm to each of the problems in Sections 4.1 and 4.2, under some regularity conditions as specified in (liuluo2014, ; xia2018two, ; xia2018joint, ; ma2021global, ), we reach both the FDP and FDR control at the pre-specified level α\alpha asymptotically, i.e., lim(n,p){FDPAlg1α+ϵ}=1\lim_{(n,p)\rightarrow\infty}{\mathbb{P}}\{\text{FDP}_{\text{Alg1}}\leq\alpha+\epsilon\}=1 for any ϵ>0\epsilon>0 and lim¯(n,p)FDRAlg1α\mathop{\overline{\rm lim}}_{(n,p)\rightarrow\infty}\text{FDR}_{\text{Alg1}}\leq\alpha.

4.4 Power enhancement

In addition to controlling error rates, we consider strategies to improve the power of multiple testing procedures, focusing on enhancing the power for two-sample inference when the high-dimensional objects of interest are individually sparse. This is explored in xia2020gap with an extension to a more general framework in liang2022locally . It is worth noting that xia2020gap improved the performance of Algorithm 1 by utilizing unknown sparsity in two-sample multiple testing. Additionally, power enhancement for the simultaneous inference of GLMs can be achieved through data integration (e.g., cai2021individual, ; liu2021integrative, ) as well as other power boosting methods designed for general multiple testing problems as introduced in Section 1.

Recall that, in the two-sample problem studied in Section 4.1, we aim to make inference for δi=𝕀{βi(1)βi(2)}\delta_{i}=\mathbb{I}\{\beta_{i}^{\scriptscriptstyle(1)}\neq\beta_{i}^{\scriptscriptstyle(2)}\}, i=2,,p+1i=2,\dots,p+1. Following Algorithm 1, we first summarize the data by a single vector of test statistics {N2,,Np+1}\{N_{2},\cdots,N_{p+1}\} and then choose a significance threshold to control the multiplicity. However, such approach ignores the important feature that both objects β(1)\beta^{\scriptscriptstyle(1)} and β(2)\beta^{\scriptscriptstyle(2)} are individually sparse. Let d={i=2,,p+1:βi(d)0}\mathcal{I}_{d}=\{i=2,\dots,p+1:\beta_{i}^{\scriptscriptstyle(d)}\neq 0\} denote the support of β(d)\beta^{\scriptscriptstyle(d)}, d=1,2d=1,2, and =12\mathcal{I}=\mathcal{I}_{1}\cup\mathcal{I}_{2} the union support. Because the small cardinality of \mathcal{I} implies that both β(1)\beta^{\scriptscriptstyle(1)} and β(2)\beta^{\scriptscriptstyle(2)} are sparse, the information on \mathcal{I} can be potentially utilized to narrow down the alternatives via the logical relationship that ii\notin\mathcal{I} implies that δi=0\delta_{i}=0.

The goal of xia2020gap is to incorporate the sparsity information to improve the testing efficiency, and it is accomplished via the construction of an additional covariate sequence {Si:i=2,,p+1}\{S_{i}:i=2,\dots,p+1\} to capture the information on \mathcal{I}. Note that SiS_{i} and NiN_{i} have different roles: NiN_{i} is the primary statistic to evaluate the significance of the test, while SiS_{i} is the auxiliary one that captures the sparsity information to assist the inference. It is also important to note that SiS_{i} should be asymptotically independent with NiN_{i} so that the null distribution of NiN_{i} would not be distorted by the incorporation of SiS_{i}. For the two-sample problem in Section 4.1, such auxiliary statistics can be constructed by

Si=r^i(1)/σ^i,12+(θ^i(1)/θ^i(2))(r^i(2)/σ^i,22){θ^i(1)(1+θ^i(1)/θ^i(2))}1/2,i=2,,p+1.\small S_{i}=\frac{{\widehat{r}_{i}^{\scriptscriptstyle(1)}}/{\widehat{\sigma}_{{i,1}}^{2}}+(\widehat{\theta}_{i}^{\scriptscriptstyle(1)}/\widehat{\theta}_{i}^{\scriptscriptstyle(2)})({\widehat{r}_{i}^{\scriptscriptstyle(2)}}/{\widehat{\sigma}_{{i,2}}^{2}})}{\{{\widehat{\theta}_{i}^{\scriptscriptstyle(1)}(1+\widehat{\theta}_{i}^{\scriptscriptstyle(1)}/\widehat{\theta}_{i}^{\scriptscriptstyle(2)})}\}^{1/2}},\quad i=2,\dots,p+1. (57)

Then based on the pairs of statistics {(Ni,Si):i=2,,p+1}\{(N_{i},S_{i}):i=2,\dots,p+1\}, the proposal in xia2020gap operates in three steps: grouping, adjusting and pooling (GAP). The first step divides all tests into KK groups based on SiS_{i}, which leads to heterogeneous groups with varied sparsity levels. The second step adjusts the pp-values to incorporate the sparsity information. The final step combines the adjusted pp-values and chooses a threshold to control the global FDR. Based on the pp-values obtained by the test statistics NiN_{i}’s, i.e., pi=2{1Φ(Ni)}p_{i}=2\{1-\Phi(N_{i})\}, the algorithm is summarized in Algorithm 2 and we refer to xia2020gap for its detailed implementations such as the choices of the number of groups and the grid sets.

Algorithm 2 Multiple testing via grouping, adjusting and pooling (GAP).
Step 1 (Grouping).

Divide hypotheses into KK groups: 𝒢l={i=2,,p+1:λl1<Siλl}\mathcal{G}_{l}=\{i=2,\dots,p+1:\lambda_{l-1}<S_{i}\leq\lambda_{l}\}, for 1lK1\leq l\leq K. The optimal choice of grouping will be determined in Step 2.

Step 2 (Adjusting).

Define ml=|𝒢l|m_{l}=|\mathcal{G}_{l}|. Calculate adjusted pp-values piw=min{pi/wlo,1}p_{i}^{w}=\min\{p_{i}/w_{l}^{o},1\} if i𝒢li\in\mathcal{G}_{l}, 1lK1\leq l\leq K, where pi=2{1Φ(Ni)}p_{i}=2\{1-\Phi(N_{i})\} and wlow_{l}^{o} will be calculated as follows.

  • Initial adjusting. For a given grouping {𝒢l:1lK}\{\mathcal{G}_{l}:1\leq l\leq K\}, let π^l\widehat{\pi}_{l} be the estimated proportion of non-nulls in 𝒢l\mathcal{G}_{l}. Compute the group-wise weights

    wl={l=1Kmlπ^l1π^l}1pπ^l(1π^l), 1lK.\small w_{l}=\left\{\sum_{l=1}^{K}\frac{m_{l}\widehat{\pi}_{l}}{1-\widehat{\pi}_{l}}\right\}^{-1}\frac{p\widehat{\pi}_{l}}{(1-\widehat{\pi}_{l})},\;1\leq l\leq K. (58)

    Define piw=min{pi/wl,1}p_{i}^{w}=\min\{p_{i}/w_{l},1\} for i𝒢li\in\mathcal{G}_{l}.

  • Further refining. For each Λ={λl:1lK1}\Lambda=\{\lambda_{l}:1\leq l\leq K-1\} (allowed to be empty), let

    k=max{j:p(j)wjα/p},\small k=\max\{j:p^{w}_{(j)}\leq{j}\alpha/p\}, (59)

    and reject the hypotheses corresponding to {p(1)w,,p(k)w}\{p^{w}_{(1)},\ldots,p^{w}_{(k)}\}, where p(1)wp(p)wp^{w}_{(1)}\leq\cdots\leq p^{w}_{(p)} are the re-ordered pp-values. The weights wlow_{l}^{o} are computed using (58) based on the optimal grouping that yields most rejections.

Step 3 (Pooling).

Combine piwp_{i}^{w}’s computed from Step 2 based on the optimal grouping, apply (59) again and output the rejections.

In addition, xia2020gap provided some insights on why the GAP algorithm works. First, it adaptively chooses the group-wise FDR levels via adjusted pp-values and effectively incorporates group-wise information. Intuitively, Algorithm 2 increases the overall power by assigning higher FDR levels to groups where signals are more common. It does not assume known groups and searches for the optimal grouping to maximize the power. Moreover, the construction of the weights in Algorithm 2 ensures that after all groups are combined, the weights are always “proper” in the sense of genovese2006false . As a result, even inaccurate estimates of the non-null proportions would not affect the validity of overall FDR control.

Then, same as Algorithm 1, xia2020gap provided the asymptotic error rates control results for Algorithm 2, namely, we have lim(n,p){FDPAlg2α+ϵ}=1\lim_{(n,p)\rightarrow\infty}{\mathbb{P}}\{\text{FDP}_{\text{Alg2}}\leq\alpha+\epsilon\}=1 for any ϵ>0\epsilon>0 and lim¯(n,p)FDRAlg2α\mathop{\overline{\rm lim}}_{(n,p)\rightarrow\infty}\text{FDR}_{\text{Alg2}}\leq\alpha. Moreover, due to the informative weights (58) that effectively incorporate the sparsity information through {Si:i=2,,p+1}\{S_{i}:i=2,\dots,p+1\}, Algorithm 2 dominates Algorithm 1 in power asymptotically. Specifically, xia2020gap provided the rigorous theoretical comparison that ΨAlg2ΨAlg1+o(1)\Psi_{\rm Alg2}\geq\Psi_{\rm Alg1}+o(1) as pp\rightarrow\infty, where ΨAlg1\Psi_{\rm Alg1} and ΨAlg2\Psi_{\rm Alg2} represent the expectations of the proportions of correct rejections among all alternative hypotheses for Algorithms 1 and 2, respectively.

5 Discussion

In this expository paper, we provided a review of methods and theoretical results on statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. Due to limited space, we were unable to discuss a number of related inference problems. In this section, we briefly mention a few of them.

Accuracy assessment. Accuracy assessment is a crucial part of high-dimensional uncertainty quantification. Its goal is to evaluate the estimation accuracy of an estimator β^\widehat{\beta}. For linear regression, cai2018accuracy considered a collection of estimators β^\widehat{\beta} and established the minimaxity and adaptivity of the point estimation and confidence interval for β^βqq\|\widehat{\beta}-\beta\|_{q}^{q} with 1q2.1\leq q\leq 2. Suppose that β^\widehat{\beta} is independent of the data {Xk,,Yk}1kn\{X_{k,\cdot},Y_{k}\}_{1\leq k\leq n} (which can be achieved by sample splitting, for example). Define the residue Rek=YkXk,β^=Xk,(ββ^)+ϵkRe_{k}=Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}=X_{k,\cdot}^{\intercal}(\beta-\widehat{\beta})+\epsilon_{k}. The quadratic functional inference approach introduced in cai2020semisupervised can be applied to the data {Rek,Xk,}1kn\{Re_{k},X_{k,\cdot}\}_{1\leq k\leq n} to make inference for both (β^β)Σ(β^β)(\widehat{\beta}-\beta)^{\intercal}\Sigma(\widehat{\beta}-\beta) and β^β22\|\widehat{\beta}-\beta\|_{2}^{2}; see more details in Section 5.2 of cai2020semisupervised . nickl2013confidence used an estimator of β^β22\|\widehat{\beta}-\beta\|_{2}^{2} to construct a confidence set for β\beta with β^\widehat{\beta} denoting an accurate high-dimensional estimator. In the context of approximate message passing, donoho2011noise ; bayati2011lasso considered a different framework with n/(p+1)δ(0,1)n/(p+1)\rightarrow\delta\in(0,1) and independent Gaussian design and established the asymptotic limit of β^β22/(p+1)\|\widehat{\beta}-\beta\|_{2}^{2}/(p+1) for the Lasso estimator β^.\widehat{\beta}.

Semi-supervised Inference. The semi-supervised inference is well motivated by a wide range of modern applications, such as Electronic Health Record data analysis. In addition to the labeled data, additional covariate observations exist in the semi-supervised setting. It is critical to leverage the information in the unlabelled data and improve the inference efficiency. As reviewed in Section 3.5, the additional unlabelled data improves the accuracy of various high-dimensional inference procedures by facilitating the estimation of Σ\Sigma or Σ1\Sigma^{-1} javanmard2018debiasing ; cai2020semisupervised . Moreover, in a different context where the linear outcome model might be misspecified, one active research direction in semi-supervised learning is to construct a more complicated imputation model (e.g., by applying the classical non-parametric or machine learning methods) and conduct a following-up bias correction after outcome imputation (e.g., zhang2019semi, ; chakrabortty2018efficient, ; zhang2021double, ; deng2020optimal, ; hou2021surrogate, ).

Applications of quadratic form inference. The statistical inference methods for quadratic functionals in Section 3.6 and inner products in Section 3.4 are useful in a wide range of statistical applications. Firstly, the group significance test of βG22=0\|\beta_{G}\|_{2}^{2}=0 or βGΣG,GβG=0\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}=0 forms an important basis for designing computationally efficient hierarchical testing approaches mandozzi2016hierarchical ; guo2021group . Secondly, consider the high-dimensional interaction model Yk=Akη+Xk,τ+AkXk,γY_{k}=A_{k}\eta+X_{k,\cdot}^{\intercal}\tau+A_{k}\cdot X_{k,\cdot}^{\intercal}\gamma with AkA_{k} denoting the variable of interest (e.g., the treatment) and Xk,X_{k,\cdot} denoting a large number of other variables. In this model, testing the existence of the interaction term can be reduced to the inference for γ22.\|\gamma\|_{2}^{2}. Lastly, in the two-sample model (2), the methods proposed in Sections 3.6 and 3.4 have been applied in guo2023robust to calculating the difference β(1)β(2)22\|\beta^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(2)}\|_{2}^{2}, which has the expression of β(1)22+β(2)222[β(1)]β(2).\|\beta^{\scriptscriptstyle(1)}\|_{2}^{2}+\|\beta^{\scriptscriptstyle(2)}\|_{2}^{2}-2[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}.

Multiple heterogeneous regression models. It is essential to perform efficient integrative inference in various applications that combine multiple regression models. Examples include transfer learning, distributed learning, federated learning, and distributionally robust learning. Transfer learning provides a powerful tool for incorporating data from related studies to improve estimation and inference accuracy in a target study of direct interest. li2021transfer studied transfer learning for high-dimensional linear regression. Minimax optimal convergence rates were established, and data-driven adaptive algorithms were proposed. Li2021Transfer-GLM ; tian2022transfer explored transfer learning in high-dimensional GLMs and battey2018distributed ; lee2017communication considered distributed learning for high-dimensional regression. Additionally, liu2021integrative ; cai2021individual proposed integrative estimation and multiple testing procedures of cross-sites high-dimensional regression models that simultaneously accommodated between study heterogeneity and protected individual-level data. In a separate direction, meinshausen2015maximin proposed the maximin effect as a robust prediction model for the target distribution being generated as a mixture of multiple source populations. guo2020inference further established that the maximin effect is a group distributionally robust model and studied the statistical inference for the maximin effect in high dimensions. Many questions in multiple heterogeneous regression models are open and warrant future research.

Other simultaneous inference problems. The principles of multiple testing procedures reviewed in Section 4 are also widely applicable to a range of simultaneous inference problems, including graph learning (e.g., liu2013ggm, ; xia2017hypothesis, ; xia2018multiple, ; cai2019differential, ), differential network recovery (e.g., xia2015testing, ; xia2019matrix, ; kim2021two, ), as well as various structured regression analysis (e.g., zhang2020estimation, ; li2021inference, ; sun2022decorrelating, ). For example, through similar regression-based bias-correction techniques as reviewed in Section 4.1.1, liu2013ggm studied the estimation of Gaussian Graphical Models with FDR control; xia2018multiple focused on the recovery of sub-networks in Gaussian graphs; cai2019differential identified significant communities for compositional data. Additionally, xia2015testing ; xia2019matrix proposed multiple testing procedures for differential network detections of vector-valued and matrix-valued observations, respectively. Multiple testing of structured regression models such as mixed-effects models and confounded linear models has also been extensively studied in the literature (li2021inference, ; sun2022decorrelating, ).

Alternative multiple testing methods for regression models. Besides the bias-correction based multiple testing approaches reviewed in this article, there are a few alternative classes of methods that aim at finite sample FDR control for linear models. Examples include knockoff based methods, mirror statistics approaches and ee-value proposals. In particular, barber2015controlling constructed a set of knockoff variables and selected those predictors that have considerably higher importance scores than the knockoff counterparts; candes2018panning extended the work to a model-X knockoff framework that allowed unknown conditional distribution of the response. There are several generalizations along this line of research; see barber2020robust ; ren2021derandomizing and many references therein. Inspired by the knockoff idea, du2021false proposed a symmetrized data aggregation approach to build mirror statistics that incorporate data dependence structure; a general framework on mirror statistics of GLMs was studied in dai2023scale and the references therein. The ee-value based proposal (vovk2021values, ) is another useful tool for multiple testing in general. wang2020false proposed an e-BH procedure that achieved FDR control under arbitrary dependence among the ee-values, and the equivalence between the knockoffs and the e-BH was studied in ren2022derandomized .

References

  • \bibcommenthead
  • (1) Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B 58(1), 267–288 (1996)
  • (2) Candes, E., Tao, T.: The dantzig selector: Statistical estimation when pp is much larger than nn. Ann. Statist. 35(6), 2313–2351 (2007)
  • (3) Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Statist. Soc. B 67(2), 301–320 (2005)
  • (4) Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Statist. Assoc. 101(476), 1418–1429 (2006)
  • (5) Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and dantzig selector. Ann. Statist. 37(4), 1705–1732 (2009)
  • (6) Bühlmann, P., van de Geer, S.: Statistics for High-dimensional Data: Methods, Theory and Applications. Springer, New York (2011)
  • (7) Negahban, S., Yu, B., Wainwright, M.J., Ravikumar, P.K.: A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. In: Adv. Neural Inf. Process. Syst., pp. 1348–1356 (2009)
  • (8) van de Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3, 1360–1392 (2009)
  • (9) Huang, J., Zhang, C.-H.: Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. J. Mach. Learn. Res. 13(Jun), 1839–1864 (2012)
  • (10) Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc. 96(456), 1348–1360 (2001)
  • (11) Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38(2), 894–942 (2010)
  • (12) Belloni, A., Chernozhukov, V., Wang, L.: Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4), 791–806 (2011)
  • (13) Sun, T., Zhang, C.-H.: Scaled sparse linear regression. Biometrika 101(2), 269–284 (2012)
  • (14) Bunea, F.: Honest variable selection in linear and logistic regression models via 1\ell_{1} and 1\ell_{1}+ 2\ell_{2} penalization. Electron. J. Stat. 2, 1153–1194 (2008)
  • (15) Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
  • (16) Meier, L., van de Geer, S., Bühlmann, P.: The group Lasso for logistic regression. J. R. Statist. Soc. B 70(1), 53–71 (2008)
  • (17) Bellec, P.C., Lecué, G., Tsybakov, A.B.: Slope meets Lasso: improved oracle bounds and optimality. Ann. Statist. 46(6B), 3603–3642 (2018)
  • (18) Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
  • (19) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Statist. 32(2), 407–499 (2004)
  • (20) Greenshtein, E., Ritov, Y.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6), 971–988 (2004)
  • (21) van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R.: On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42(3), 1166–1202 (2014)
  • (22) Javanmard, A., Montanari, A.: Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15(1), 2869–2909 (2014)
  • (23) Zhang, C.-H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B 76(1), 217–242 (2014)
  • (24) Guo, Z., Rakshit, P., Herman, D.S., Chen, J.: Inference for the case probability in high-dimensional logistic regression. J. Mach. Learn. Res. 22(1), 11480–11533 (2021)
  • (25) Guo, Z., Wang, W., Cai, T.T., Li, H.: Optimal estimation of genetic relatedness in high-dimensional linear models. J. Am. Statist. Assoc. 114(525), 358–369 (2019)
  • (26) Guo, Z., Renaux, C., Bühlmann, P., Cai, T.T.: Group inference in high dimensions with applications to hierarchical testing. Electron. J. Stat. 15(2), 6633–6676 (2021)
  • (27) Cai, T.T., Guo, Z.: Semisupervised inference for explained variance in high dimensional linear regression and its applications. J. R. Statist. Soc. B 82(2), 391–419 (2020)
  • (28) Ma, R., Guo, Z., Cai, T.T., Li, H.: Statistical inference for genetic relatedness based on high-dimensional logistic regression. arXiv preprint arXiv:2202.10007 (2022)
  • (29) Cai, T.T., Guo, Z.: Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. Ann. Statist. 45(2), 615–646 (2017)
  • (30) Athey, S., Imbens, G.W., Wager, S.: Approximate residual balancing: debiased inference of average treatment effects in high dimensions. J. R. Statist. Soc. B 80(4), 597–623 (2018)
  • (31) Cai, T.T., Guo, Z., Ma, R.: Statistical inference for high-dimensional generalized linear models with binary outcomes. J. Am. Statist. Assoc. 116, 1–14 (2021)
  • (32) Ning, Y., Liu, H.: A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist. 45(1), 158–195 (2017)
  • (33) Ren, Z., Sun, T., Zhang, C.-H., Zhou, H.H.: Asymptotic normality and optimalities in estimation of large gaussian graphical models. Ann. Statist. 43(3), 991–1026 (2015)
  • (34) Yu, Y., Bradic, J., Samworth, R.J.: Confidence intervals for high-dimensional cox models. arXiv preprint arXiv:1803.01150 (2018)
  • (35) Fang, E.X., Ning, Y., Li, R.: Test of significance for high-dimensional longitudinal data. Ann. Statist. 48(5), 2622 (2020)
  • (36) Zhou, R.R., Wang, L., Zhao, S.D.: Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika 107(3), 573–589 (2020)
  • (37) Zhao, T., Kolar, M., Liu, H.: A general framework for robust testing and confidence regions in high-dimensional quantile regression. arXiv preprint arXiv:1412.8724 (2014)
  • (38) Javanmard, A., Lee, J.D.: A flexible framework for hypothesis testing in high dimensions. J. R. Statist. Soc. B 82(3), 685–718 (2020)
  • (39) Chen, Y., Fan, J., Ma, C., Yan, Y.: Inference and uncertainty quantification for noisy matrix completion. Proc. Natl. Acad. Sci. 116(46), 22931–22937 (2019)
  • (40) Zhu, Y., Bradic, J.: Linear hypothesis testing in dense high-dimensional linear models. J. Am. Statist. Assoc. 113(524), 1583–1600 (2018)
  • (41) Guo, Z., Kang, H., Cai, T.T., Small, D.S.: Testing endogeneity with high dimensional covariates. J. Econom. 207(1), 175–187 (2018)
  • (42) Cai, T.T., Zhang, L.: High-dimensional gaussian copula regression: Adaptive estimation and statistical inference. Stat. Sin., 963–993 (2018)
  • (43) Eftekhari, H., Banerjee, M., Ritov, Y.: Inference in high-dimensional single-index models under symmetric designs. J. Mach. Learn. Res. 22, 27–1 (2021)
  • (44) Deshpande, Y., Mackey, L., Syrgkanis, V., Taddy, M.: Accurate inference for adaptive linear models. In: International Conference on Machine Learning, pp. 1194–1203 (2018). PMLR
  • (45) Fang, E.X., Ning, Y., Liu, H.: Testing and confidence intervals for high dimensional proportional hazards models. J. R. Statist. Soc. B 79(5), 1415–1437 (2017)
  • (46) Neykov, M., Ning, Y., Liu, J.S., Liu, H.: A unified theory of confidence regions and testing for high-dimensional estimating equations. Stat. Sci. 33(3), 427–443 (2018)
  • (47) Dezeure, R., Bühlmann, P., Meier, L., Meinshausen, N.: High-dimensional inference: confidence intervals, pp-values and R-software hdi. Stat. Sci., 533–558 (2015)
  • (48) Belloni, A., Chernozhukov, V., Hansen, C.: Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81(2), 608–650 (2014)
  • (49) Chernozhukov, V., Hansen, C., Spindler, M.: Valid post-selection and post-regularization inference: An elementary, general approach. Annu. Rev. Econom. 7(1), 649–688 (2015)
  • (50) Farrell, M.H.: Robust inference on average treatment effects with possibly more covariates than observations. J. Econom. 189(1), 1–23 (2015)
  • (51) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J.: Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. Econom. J. 21(1), 1–68 (2018)
  • (52) Belloni, A., Chernozhukov, V., Fernández-Val, I., Hansen, C.: Program evaluation and causal inference with high-dimensional data. Econometrica 85(1), 233–298 (2017)
  • (53) Cai, T., Tony Cai, T., Guo, Z.: Optimal statistical inference for individualized treatment effects in high-dimensional models. J. R. Statist. Soc. B 83(4), 669–719 (2021)
  • (54) Rakshit, P., Cai, T.T., Guo, Z.: SIHR: An R package for statistical inference in high-dimensional linear and logistic regression models. arXiv preprint arXiv:2109.03365 (2021)
  • (55) Dezeure, R., Bühlmann, P., Zhang, C.-H.: High-dimensional simultaneous inference with the bootstrap. Test 26(4), 685–719 (2017)
  • (56) Zhang, X., Cheng, G.: Simultaneous inference for high-dimensional linear models. J. Am. Statist. Assoc. 112(518), 757–768 (2017)
  • (57) Ma, R., Cai, T.T., Li, H.: Global and simultaneous hypothesis testing for high-dimensional logistic regression models. J. Am. Statist. Assoc. 116(534), 984–998 (2021)
  • (58) Liu, W., Luo, S.: Hypothesis testing for high-dimensional regression models. Technical report (2014)
  • (59) Xia, Y., Cai, T., Tony Cai, T.: Two-sample tests for high-dimensional linear regression with an application to detecting interactions. Stat. Sin. 28, 63–92 (2018)
  • (60) Xia, Y., Cai, T.T., Li, H.: Joint testing and false discovery rate control in high-dimensional multivariate regression. Biometrika 105(2), 249–269 (2018)
  • (61) Benjamini, Y., Hochberg, Y.: Multiple hypotheses testing with weights. Scand. J. Stat. 24(3), 407–418 (1997)
  • (62) Storey, J.D.: A direct approach to false discovery rates. J. Roy. Statist. Soc. B 64(3), 479–498 (2002)
  • (63) Genovese, C.R., Roeder, K., Wasserman, L.: False discovery control with pp-value weighting. Biometrika 93(3), 509–524 (2006)
  • (64) Roeder, K., Wasserman, L.: Genome-wide significance levels and weighted hypothesis testing. Stat. Sci. 24(4), 398 (2009)
  • (65) Ignatiadis, N., Klaus, B., Zaugg, J.B., Huber, W.: Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods 13(7), 577–580 (2016)
  • (66) Lei, L., Fithian, W.: Adapt: an interactive procedure for multiple testing with side information. J. R. Statist. Soc. B 80(4), 649–679 (2018)
  • (67) Li, A., Barber, R.F.: Multiple testing with the structure-adaptive Benjamini-Hochberg algorithm. J. R. Statist. Soc. B 81(1), 45–74 (2019)
  • (68) Cai, T.T., Sun, W., Xia, Y.: LAWS: A locally adaptive weighting and screening approach to spatial multiple testing. J. Am. Statist. Assoc. 117, 1370–1383 (2022)
  • (69) Fithian, W., Lei, L.: Conditional calibration for false discovery rate control under dependence. Ann. Statist. 50(6), 3091–3118 (2022)
  • (70) Xia, Y., Cai, T.T., Sun, W.: GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference. J. Am. Statist. Assoc. 115(531), 1236–1250 (2020)
  • (71) Liu, M., Xia, Y., Cho, K., Cai, T.: Integrative high dimensional multiple testing with heterogeneity under data sharing constraints. J. Mach. Learn. Res. 22, 126–1 (2021)
  • (72) Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34(3), 1436–1462 (2006)
  • (73) Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
  • (74) Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using 1\ell_{1}-constrained quadratic programming (Lasso). IEEE Trans. Inf. Theory 55(5), 2183–2202 (2009)
  • (75) Javanmard, A., Montanari, A.: Debiasing the Lasso: Optimal sample size for gaussian designs. Ann. Statist. 46(6A), 2593–2622 (2018)
  • (76) Guo, Z., Ćevid, D., Bühlmann, P.: Doubly debiased Lasso: High-dimensional inference under hidden confounding. Ann. Statist. 50(3), 1320–1347 (2022)
  • (77) Sun, Y., Ma, L., Xia, Y.: A decorrelating and debiasing approach to simultaneous inference for high-dimensional confounded models. arXiv preprint arXiv:2208.08754 (2022)
  • (78) Guo, Z., Yuan, W., Zhang, C.-H.: Decorrelated local linear estimator: Inference for non-linear effects in high-dimensional additive models. arXiv preprint arXiv:1907.12732 (2019)
  • (79) Collier, O., Comminges, L., Tsybakov, A.B.: Minimax estimation of linear and quadratic functionals on sparsity classes. Ann. Statist. 45(3), 923–958 (2017)
  • (80) Guo, Z.: Inference for high-dimensional maximin effects in heterogeneous regression models using a sampling approach. arXiv preprint arXiv:2011.07568 (2020)
  • (81) Zhang, T.: Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Trans. Inf. Theory 57(7), 4689–4708 (2011)
  • (82) Hunter, D.J.: Gene–environment interactions in human diseases. Nat. Rev. Genet. 6(4), 287–298 (2005)
  • (83) Zhou, J.J., Cho, M.H., Lange, C., Lutz, S., Silverman, E.K., Laird, N.M.: Integrating multiple correlated phenotypes for genetic association analysis by maximizing heritability. Hum. Hered. 79, 93–104 (2015)
  • (84) Schifano, L. E.D. Li, Christiani, D.C., Lin, X.: Genome-wide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet., 744–759 (2013)
  • (85) Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B 68(1), 49–67 (2006)
  • (86) Lounici, K., Pontil, M., van de Geer, S., Tsybakov, A.B., et al.: Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39(4), 2164–2204 (2011)
  • (87) Ren, Z., Zhang, C.-H., Zhou, H.: Asymptotic normality in estimation of large ising graphical model. Unpublished Manuscript (2016)
  • (88) Cai, T.T., Li, H., Ma, J., Xia, Y.: Differential markov random field analysis with an application to detecting differential microbial community networks. Biometrika 106(2), 401–416 (2019)
  • (89) Xia, Y., Cai, T., Tony Cai, T.: Multiple testing of submatrices of a precision matrix with applications to identification of between pathway interactions. J. Am. Statist. Assoc. 113(521), 328–339 (2018)
  • (90) Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 57, 289–300 (1995)
  • (91) Liang, Z., Cai, T.T., Sun, W., Xia, Y.: Locally adaptive transfer learning algorithms for large-scale multiple testing. arXiv preprint arXiv:2203.11461 (2022)
  • (92) Cai, T., Liu, M., Xia, Y.: Individual data protected integrative regression analysis of high-dimensional heterogeneous data. J. Am. Statist. Assoc. 116, 1–15 (2021)
  • (93) Cai, T.T., Guo, Z.: Accuracy assessment for high-dimensional linear regression. Ann. Statist. 46(4), 1807–1836 (2018)
  • (94) Nickl, R., van de Geer, S.: Confidence sets in sparse regression. Ann. Statist. 41(6), 2852–2876 (2013)
  • (95) Donoho, D.L., Maleki, A., Montanari, A.: The noise-sensitivity phase transition in compressed sensing. IEEE Trans. Inf. Theory 57(10), 6920–6941 (2011)
  • (96) Bayati, M., Montanari, A.: The Lasso risk for gaussian matrices. IEEE Trans. Inf. Theory 58(4), 1997–2017 (2011)
  • (97) Zhang, A., Brown, L.D., Cai, T.T.: Semi-supervised inference: General theory and estimation of means. Ann. Statist. 47(5), 2538–2566 (2019)
  • (98) Chakrabortty, A., Cai, T.: Efficient and adaptive linear regression in semi-supervised settings. Ann. Statist. 46(4), 1541–1572 (2018)
  • (99) Zhang, Y., Chakrabortty, A., Bradic, J.: Double robust semi-supervised inference for the mean: Selection bias under mar labeling with decaying overlap. arXiv preprint arXiv:2104.06667 (2021)
  • (100) Deng, S., Ning, Y., Zhao, J., Zhang, H.: Optimal semi-supervised estimation and inference for high-dimensional linear regression. arXiv preprint arXiv:2011.14185 (2020)
  • (101) Hou, J., Guo, Z., Cai, T.: Surrogate assisted semi-supervised inference for high dimensional risk prediction. arXiv preprint arXiv:2105.01264 (2021)
  • (102) Mandozzi, J., Bühlmann, P.: Hierarchical testing in the high-dimensional setting with correlated variables. J. Am. Statist. Assoc. 111(513), 331–343 (2016)
  • (103) Guo, Z., Li, X., Han, L., Cai, T.: Robust inference for federated meta-learning. arXiv preprint arXiv:2301.00718 (2023)
  • (104) Li, S., Cai, T.T., Li, H.: Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Statist. Soc. B 84(1), 149–173 (2021)
  • (105) Li, S., Zhang, L., Cai, T.T., Li, H.: Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Technical Report (2021)
  • (106) Tian, Y., Feng, Y.: Transfer learning under high-dimensional generalized linear models. J. Am. Statist. Assoc. 117(just-accepted), 1–30 (2022)
  • (107) Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Statist. 46(3), 1352 (2018)
  • (108) Lee, J.D., Liu, Q., Sun, Y., Taylor, J.E.: Communication-efficient sparse regression. J. Mach. Learn. Res. 18(1), 115–144 (2017)
  • (109) Meinshausen, N., Bühlmann, P.: Maximin effects in inhomogeneous large-scale data. Ann. Statist. 43(4), 1801–1830 (2015)
  • (110) Liu, W.: Gaussian graphical model estimation with false discovery rate control. Ann. Statist. 41(6), 2948–2978 (2013)
  • (111) Xia, Y., Li, L.: Hypothesis testing of matrix graph model with application to brain connectivity analysis. Biometrics 73(3), 780–791 (2017)
  • (112) Xia, Y., Cai, T., Tony Cai, T.: Testing differential networks with applications to the detection of gene-gene interactions. Biometrika 102(2), 247–266 (2015)
  • (113) Xia, Y., Li, L.: Matrix graph hypothesis testing and application in brain connectivity alternation detection. Stat. Sin. 29(1), 303–328 (2019)
  • (114) Kim, B., Liu, S., Kolar, M.: Two-sample inference for high-dimensional markov networks. J. R. Statist. Soc. B (2021)
  • (115) Zhang, L., Ma, R., Cai, T.T., Li, H.: Estimation, confidence intervals, and large-scale hypotheses testing for high-dimensional mixed linear regression. arXiv preprint arXiv:2011.03598 (2020)
  • (116) Li, S., Cai, T.T., Li, H.: Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach. J. Am. Statist. Assoc. 116, 1–12 (2021)
  • (117) Barber, R.F., Candès, E.J.: Controlling the false discovery rate via knockoffs. Ann. Statist. 43(5), 2055–2085 (2015)
  • (118) Candes, E., Fan, Y., Janson, L., Lv, J.: Panning for gold:‘model-x’ knockoffs for high dimensional controlled variable selection. J. R. Statist. Soc. B 80(3), 551–577 (2018)
  • (119) Barber, R.F., Candès, E.J., Samworth, R.J.: Robust inference with knockoffs. Ann. Statist. 48(3), 1409–1431 (2020)
  • (120) Ren, Z., Wei, Y., Candès, E.: Derandomizing knockoffs. J. Am. Statist. Assoc. 116, 1–11 (2021)
  • (121) Du, L., Guo, X., Sun, W., Zou, C.: False discovery rate control under general dependence by symmetrized data aggregation. J. Am. Statist. Assoc. 116, 1–15 (2021)
  • (122) Dai, C., Lin, B., Xing, X., Liu, J.S.: A scale-free approach for false discovery rate control in generalized linear models. J. Am. Statist. Assoc. (just-accepted), 1–31 (2023)
  • (123) Vovk, V., Wang, R.: E-values: Calibration, combination and applications. Ann. Statist. 49(3), 1736–1754 (2021)
  • (124) Wang, R., Ramdas, A.: False discovery rate control with e-values. arXiv preprint arXiv:2009.02824 (2020)
  • (125) Ren, Z., Barber, R.F.: Derandomized knockoffs: leveraging e-values for false discovery rate control. arXiv preprint arXiv:2205.15461 (2022)