\jyear

2021

1]Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, USA

2]Department of Statistics, Rutgers University, Piscataway, USA

3]Department of Statistics and Data Science, Fudan University, Shanghai, China

Statistical Inference and Large-scale Multiple Testing for High-dimensional Regression Models

\fnmT. Tony \sur Cai tcai@wharton.upenn.edu \fnmZijian \sur Guo zijguo@stat.rutgers.edu \fnmYin \sur Xia xiayin@fudan.edu.cn [ [ [

Abstract

This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to generate debiased and desparsified estimators for the targeted low-dimensional objectives and estimate their uncertainty. In addition to covering the motivations for and intuitions behind these statistical methods, we also discuss their optimality and adaptivity in the context of high-dimensional inference. In addition, we review the recent development of statistical inference based on multiple regression models and the advancement of large-scale multiple testing for high-dimensional regression. The R package SIHR has implemented some of the high-dimensional inference methods discussed in this paper.

keywords:

confidence interval, debiasing, false discovery rate, hypothesis testing, linear functionals, quadratic functionals, simultaneous inference

1 Introduction

High-dimensional data analysis has become a vital part of scientific research in many fields. While the abundance of high-dimensional data offers numerous opportunities for statistical analysis, it also presents significant technical challenges, as the number of variables can be much larger than the number of observations. In high-dimensional settings, most of the classical inferential procedures, such as the maximum likelihood, are no longer valid. In recent years, there has been significant progress in developing new theories and methods for parameter estimation, hypothesis testing, confidence interval construction, and large-scale simultaneous inference in the context of high-dimensional data analysis.

In this paper, we provide a selective survey of recent advances in statistical inference and multiple testing for high-dimensional regression models, commonly used in modern data analysis across various fields such as genetics, metabolomics, finance, health, and economics. Much progress has been made in estimation and prediction under the high-dimensional linear and generalized linear models (GLMs); see, for example, tibshirani1996regression ; candes2007dantzig ; zou2005regularization ; zou2006adaptive ; bickel2009simultaneous ; buhlmann2011statistics ; negahban2009unified ; van2009conditions ; huang2012estimation ; fan2001variable ; zhang2010nearly ; belloni2011square ; sun2012scaled ; bunea2008honest ; bach2010self ; meier2008group ; bellec2018slope ; friedman2010regularization ; efron2004least ; greenshtein2004persistence . Theoretical properties have been established in different settings, including the minimax estimation rate and the rate of convergence for the estimation and prediction errors of several penalized procedures.

Uncertainty quantification is at the heart of many critical scientific applications and is more challenging than point estimation. For high-dimensional regression models, although the Lasso and other penalized estimators have been shown to achieve the optimal rates of convergence for estimation, these estimators suffer from non-negligible bias that makes them unsuitable to be directly used for statistical inference, as noted in several studies (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ). To overcome this issue, debiased inference methods have been developed in (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ) that correct the bias of penalized estimators and allow for statistical inference based on the debiased estimators. The development of debiased inference methods has led to an increase in research on statistical inference for a wide range of low-dimensional objectives in different high-dimensional models; see, for example, guo2021inference ; guo2019optimal ; guo2021group ; cai2020semisupervised ; ma2022statistical ; cai2017confidence ; athey2018approximate ; cai2021statistical ; ning2017general ; ren2015asymptotic ; yu2018confidence ; fang2020test ; zhou2020estimation ; zhao2014general ; javanmard2020flexible ; chen2019inference ; zhu2018linear ; guo2018testing ; cai2018high ; eftekhari2021inference ; deshpande2018accurate ; fang2017testing ; neykov2018unified ; dezeure2015high . In particular, cai2017confidence studied the minimaxity and adaptivity of confidence intervals for general linear functionals of a high-dimensional regression vector and found significant differences between the cases of sparse and dense loading vectors. Another important inference tool, known as Neyman’s orthogonalization or double machine learning, has been proposed in econometrics to enable inference for low-dimensional objectives with high-dimensional nuisance parameters; see, for example, (belloni2014inference, ; chernozhukov2015valid, ; farrell2015robust, ; chernozhukov2018double, ; belloni2017program, ).

In the single regression model (one-sample) setting, we observe data $\{Y_{k},X_{k,\cdot}\}_{1\leq k\leq n}$ , where $Y_{k}\in\mathbb{R}$ and $X_{k,\cdot}\in\mathbb{R}^{p+1}$ denote the outcome and the high-dimensional covariates respectively, generated independently from the high-dimensional GLM,

Y_{k}=h(X_{k,\cdot}^{\intercal}\beta)+\epsilon_{k},\quad\text{for}\quad 1\leq k\leq n

(1)

with ${\mathbb{E}}(\epsilon_{k}|X_{k,\cdot})=0$ and the high-dimensional regression vector $\beta\in\mathbb{R}^{p+1}$ . Throughout the paper, we use $\beta_{1}$ to denote the intercept and set $X_{k,1}=1$ . We assume that the high-dimensional covariates $X_{k,-1}\in\mathbb{R}^{p}$ are centered and sub-gaussian and the matrix $\Sigma\equiv{\mathbb{E}}X_{k,\cdot}X_{k,\cdot}^{\intercal}\in\mathbb{R}^{(p+1)\times(p+1)}$ is well-conditioned. We focus on the linear and logistic regression models with the link function $h(z)=z$ and $h(z)=\exp(z)/[1+\exp(z)]$ respectively. The regression vector $\beta$ is assumed to be sparse, and its sparsity level is denoted by $\|\beta\|_{0}.$ The high-dimensional covariates $X_{k,\cdot}$ might come from a large number of measured covariates or the basis transformations of the baseline covariates. For the linear model, we further assume the error $\epsilon_{k}$ is sub-gaussian with homoscedastic regression error $\sigma_{\epsilon}^{2}={\mathbb{E}}(\epsilon_{k}^{2}|X_{k,\cdot})$ .

In addition to the one-sample setting, we examine the statistical inference methods for the two-sample high-dimensional regression models. For $d=1,2,$ we assume that the data $\{Y^{\scriptscriptstyle(d)}_{k},X^{\scriptscriptstyle(d)}_{k,\cdot}\}_{1\leq k\leq n_{d}}$ are i.i.d. generated, following

\displaystyle Y^{\scriptscriptstyle(d)}_{k}=h([X^{\scriptscriptstyle(d)}_{k,\cdot}]^{\intercal}\beta^{\scriptscriptstyle(d)})+\epsilon^{\scriptscriptstyle(d)}_{k},\quad\text{for}\quad 1\leq k\leq n_{d},

(2)

where ${\mathbb{E}}(\epsilon^{\scriptscriptstyle(d)}_{k}|X^{\scriptscriptstyle(d)}_{k,\cdot})=0$ and $h(\cdot)$ is the pre-specified link function.

Based on the models above, this paper first addresses the challenges of making statistical inferences for low-dimensional objectives (e.g., linear and quadratic functionals) in high-dimensional regression, both in one- and two-sample settings. Specifically, the following quantities are of particular interest.

1.

Linear functional $x_{\rm new}^{\intercal}\beta$ with $x_{\rm new}\in\mathbb{R}^{p+1}$ in one-sample setting. The linear functional $x_{\rm new}^{\intercal}\beta$ includes as special cases the single regression coefficient $\beta_{j}$ (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ; cai2017confidence, ) when $x_{\rm new}$ is the $j^{\scalebox{0.8}{th}}$ canonical unit vector and the conditional mean of the outcome under (1) when $x_{\rm new}$ is a future observation’s covariates. When $x_{\rm new}$ denotes the average of the covariates observations for a group, $x_{\rm new}^{\intercal}\beta$ is closely related to average treatment effect athey2018approximate . In logistic regression, the linear functional $x_{\rm new}^{\intercal}\beta$ is closely related to the case probability guo2021inference .
2.

Quadratic functionals $\beta_{G}^{\intercal}A\beta_{G}$ and $\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}$ with $G\subset\{1,2,\cdots,p+1\}$ and $A\in\mathbb{R}^{|G|\times|G|}$ in one-sample setting. For a subset $G$ , these quadratic functionals measure the total effect of variables in $G.$ Statistical inference for quadratic functionals can be motivated from the group significance test, and the (local) genetic heritability estimation guo2019optimal ; guo2021group . The inference method can be generalized to handle heterogeneous effect tests, hierarchical testing, prediction loss evaluation, and confidence ball construction guo2021group ; cai2020semisupervised .
3.

Difference between linear functionals $h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)})-h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)})$ with $x_{\rm new}\in\mathbb{R}^{p+1}$ in two-sample setting. This difference measures the discrepancy between the conditional means, which is closely related to individual treatment selection for the new observation $x_{\rm new}\in\mathbb{R}^{p+1}$ cai2021optimal .
4.

Inner products of regression vectors $[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}$ and $[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)}$ with the weighting matrix $A\in\mathbb{R}^{(p+1)\times(p+1)}$ in two-sample setting. The inner product of regression vectors or its weighted version measures the similarity between the two regression vectors. In genetic studies, such inner products can be used as the genetic relatedness measure when the covariates are genetic variants, and outcome variables are different phenotypes guo2019optimal ; ma2022statistical .

We examine statistical inference procedures for linear and quadratic functionals from both methodological and theoretical perspectives. We also discuss the optimality results for the corresponding estimation and inference problems. A user-friendly R package SIHR rakshit2021sihr has been developed to implement the statistical inference methods for the low-dimensional objectives mentioned above. This package provides a convenient way to apply the discussed statistical inference methods.

Beyond the aforementioned inference for a single coordinate of the regression vector or other one-dimensional functionals, we also discuss the simultaneous inference of high-dimensional regression models. This includes using global methods with maximum-type statistics to test the entire regression coefficient vector (dezeure2017high, ; zhang2017simultaneous, ; ma2021global, ), as well as component-wise simultaneous inference methods that control the false discovery rate (FDR). Specifically, we examine the one-sample testing of high-dimensional linear regression coefficients liuluo2014 , the comparison of two high-dimensional linear regression models xia2018two , and the joint testing of regression coefficients across multiple responses xia2018joint . We also extend our discussion to logistic regression models. We discuss these large-scale multiple testing problems focusing on controlling the asymptotic FDR.

While error rate control is important for simultaneous inference, statistical power is also crucial. However, many existing testing methods for high-dimensional linear models do not consider auxiliary information, such as model sparsity and heteroscedasticity, that could improve statistical power. While there has been a significant amount of research on methods to enhance power in multiple testing in general (BenHoc97, ; storey2002direct, ; genovese2006false, ; Roeder2009, ; Ignatiadis2016, ; LeiFithian2018, ; Li2019, ; LAWS, ; fithian2022conditional, , among many others), recent efforts have also focused on simultaneous inference methods that incorporate auxiliary information to assist power improvement of high-dimensional regression analysis. For example, xia2018joint achieved power gains by leveraging similarities across multivariate responses; xia2020gap explored the sparsity information hidden in the data structures and improved the power through $p$ -value weighting mechanisms; liu2021integrative obtained power enhancement through integrating heterogeneous linear models. In the current paper, we primarily focus on power enhancement in a two-sample inference setting where the high-dimensional objects of interest are individually sparse. We will discuss methods for controlling FDR with and without power enhancement and the related theoretical aspects.

The rest of the paper is organized as follows. We finish this section by introducing the notation. Section 2 discusses the debiased inference idea for the regression coefficients, and Section 3 presents the debiased methods for linear and quadratic functionals in one- and two-sample settings. Section 4 focuses on simultaneous inference for high-dimensional regression models. We conclude the paper by discussing other related works in Section 5.

Notation. For an event $E$ , denote by $\mathbb{I}\{E\}$ its indicator function. For an index set $J\subset\{1,2,\cdots,p\}$ and a vector $x\in\mathbb{R}^{p}$ , $x_{J}$ is the sub-vector of $x$ with indices in $J$ and $x_{-J}$ is the sub-vector with indices in $J^{c}$ . For a set $S$ , $|S|$ denotes the cardinality of $S$ . For a vector $x\in\mathbb{R}^{p}$ , the $\ell_{q}$ norm of $x$ is defined as $\|x\|_{q}=\left(\sum_{l=1}^{p}|x_{l}|^{q}\right)^{\frac{1}{q}}$ for $q\geq 0$ with $\|x\|_{0}=|\{1\leq l\leq p:x_{l}\neq 0\}|$ and $\|x\|_{\infty}=\max_{1\leq l\leq p}|x_{l}|$ . Denote by $e_{j}$ the $j^{\scalebox{0.8}{th}}$ canonical unit vector and ${\rm I}_{p}\in\mathbb{R}^{n\times p}$ the identity matrix. For a symmetric matrix $A$ , denote by $\lambda_{\max}(A)$ and $\lambda_{\min}(A)$ its maximum and minimum eigenvalues, respectively. For a matrix $A\in\mathbb{R}^{n\times p}$ , $A_{\cdot,j}$ and $A_{i,\cdot}$ respectively denote the $j^{\scalebox{0.8}{th}}$ column and $i^{\scalebox{0.8}{th}}$ row of $A$ , $A_{i,j}$ denotes the $(i,j)^{\scalebox{0.8}{th}}$ entry of $A$ , $A_{i,-j}$ denotes the $i^{\scalebox{0.8}{th}}$ row of $A$ with its $j^{\scalebox{0.8}{th}}$ entry removed, $A_{-i,j}$ denotes the $j^{\scalebox{0.8}{th}}$ column of $A$ with its $i^{\scalebox{0.8}{th}}$ entry removed, $A_{i,-\{j_{1},j_{2}\}}$ denotes the $i^{\scalebox{0.8}{th}}$ row of $A$ with its $j_{1}^{\scalebox{0.8}{th}}$ and $j_{2}^{\scalebox{0.8}{th}}$ entries both removed and $A_{-i,-j}\in\mathbb{R}^{(n-1)\times(p-1)}$ denotes the submatrix of $A$ with its $i^{\scalebox{0.8}{th}}$ row and $j^{\scalebox{0.8}{th}}$ column removed. We use $c$ and $C$ to denote generic positive constants that may vary from place to place. For two positive sequences $a_{n}$ and $b_{n}$ , $a_{n}\lesssim b_{n}$ means there exists a constant $C>0$ such that $a_{n}\leq Cb_{n}$ for all $n$ ; $a_{n}\asymp b_{n}$ if $a_{n}\lesssim b_{n}$ and $b_{n}\lesssim a_{n}$ , and $a_{n}\ll b_{n}$ if $\mathop{\overline{\rm lim}}_{n\rightarrow\infty}{a_{n}}/{b_{n}}=0$ . Let $o_{{\mathbb{P}}}\{a_{n}\}$ and $O_{{\mathbb{P}}}\{a_{n}\}$ respectively represent the sequences that grow in a smaller and equal/smaller rate of the sequence $a_{n}$ with probability approaching $1$ as $n\rightarrow\infty$ .

2 Inference for Regression Coefficients

In this section, we begin with a discussion in Section 2.1 on several commonly used penalized estimators for the high-dimensional GLMs. We review in Section 2.2 the debiased methods for linear models introduced in zhang2014confidence ; van2014asymptotically ; javanmard2014confidence and discuss its extensions to high-dimensional logistic regression in Section 2.3. We also present the optimality of the confidence interval construction in both linear and logistic high-dimensional regression models.

2.1 Estimation in high-dimensional regression

For the high-dimensional linear model (1), a commonly used estimator of the regression vector $\beta$ is the Lasso estimator tibshirani1996regression , defined as

\small\widehat{\beta}=\mathop{\rm arg\min}_{\beta\in\mathbb{R}^{p+1}}\frac{\|Y-X\beta\|_{2}^{2}}{2n}+\lambda_{0}\sum_{j=2}^{p+1}\frac{\|X_{\cdot,j}\|_{2}}{\sqrt{n}}|\beta_{j}|,

(3)

with the tuning parameter $\lambda_{0}=A\sigma_{\epsilon}\sqrt{\log p/n}$ for some positive constant $A>2$ . In the penalized regression (3), we do not penalize the intercept $\beta_{1}$ . The tuning parameter $\lambda_{0}$ is typically chosen by cross-validation, as implemented in the R package glmnet friedman2010regularization . With the Lasso estimator $\widehat{\beta},$ the variance $\sigma_{\epsilon}^{2}$ can be estimated by $\widehat{\sigma}_{\epsilon}^{2}=\tfrac{1}{n}\|Y-X\widehat{\beta}\|^{2}_{2}.$

The tuning parameter $\lambda_{0}$ in (3) depends on the noise level $\sigma_{\epsilon}.$ Alternative estimators have been proposed such that the tuning parameter does not depend on the unknown noise (sun2012scaled, ; belloni2011square, , e.g.). Particularly, sun2012scaled proposed the scaled Lasso estimator

\small\{\widehat{\beta},\widehat{\sigma}_{\epsilon}\}=\arg\min_{\beta\in\mathbb{R}^{p+1},\sigma_{\epsilon}\in\mathbb{R}^{+}}\frac{\|y-X\beta\|_{2}^{2}}{2n\sigma_{\epsilon}}+\frac{\sigma_{\epsilon}}{2}+\lambda_{0}\sum_{j=2}^{p+1}\frac{\|X_{\cdot,j}\|_{2}}{\sqrt{n}}|\beta_{j}|

(4)

with the tuning parameter $\lambda_{0}=A\sqrt{\log p/n}$ for some positive constant $A>2.$ Beyond the estimators mentioned above, a wide collection of estimators of high-dimensional regression vectors have been proposed (e.g., zou2005regularization, ; zou2006adaptive, ; buhlmann2011statistics, ; negahban2009unified, ; huang2012estimation, ; fan2001variable, ; zhang2010nearly, ; belloni2011square, ).

For the high-dimensional logistic model (1), the penalized methods have also been well developed to estimate $\beta\in\mathbb{R}^{p+1}$ (e.g., bunea2008honest, ; bach2010self, ; buhlmann2011statistics, ; meier2008group, ; negahban2009unified, ; huang2012estimation, ). In this paper, we use the penalized log-likelihood estimator $\widehat{\beta}$ in buhlmann2011statistics , defined as

\small\widehat{\beta}=\mathop{\rm arg\min}_{\beta}\frac{1}{n}\sum_{k=1}^{n}\{\log[1+\exp(X_{k,\cdot}^{\intercal}\beta)]-Y_{k}(X_{k,\cdot}^{\intercal}\beta)\}+\lambda_{0}\sum_{j=2}^{p+1}\frac{\|X_{\cdot,j}\|_{2}}{\sqrt{n}}|\beta_{j}|,

(5)

with a positive tuning parameter $\lambda_{0}\asymp\sqrt{\log p/n}$ .

2.2 Debiased or desparsified estimators in linear models

The penalized estimators introduced in Section 2.1 have been shown to achieve the optimal convergence rate and satisfy desirable variable selection properties meinshausen2006high ; bickel2009simultaneous ; zhao2006model ; wainwright2009sharp . However, (van2014asymptotically, ; javanmard2014confidence, ; zhang2014confidence, ) highlighted that the Lasso and other penalized estimators are not ready for statistical inference due to the non-negligible estimation bias. They further proposed correcting the penalized estimators’ bias and then making inferences based on the bias-corrected estimators.

In the following, we present the main idea of the bias correction method. To illustrate the main idea, we fix the parameter index $2\leq j\leq p+1$ and focus on the confidence interval construction for $\beta_{j}$ in the model (1). With $\widehat{\beta}$ denoting the Lasso estimator in (3), the main idea of the method proposed in zhang2014confidence ; javanmard2014confidence is to estimate the error of the plug-in estimator $\widehat{\beta}_{j}-\beta_{j}$ . The approximation of the error $\widehat{\beta}_{j}-\beta_{j}$ can be motivated by the following decomposition: for any vector $u\in\mathbb{R}^{p+1}$ ,

\small{u^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta})}-(\beta_{j}-\widehat{\beta}_{j})={u^{\intercal}\frac{1}{n}X^{\intercal}\epsilon}+{\left(\widehat{\Sigma}u-e_{j}\right)^{\intercal}(\beta-\widehat{\beta})},

(6)

with $\widehat{\Sigma}=\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}X_{k,\cdot}^{\intercal}$ . We explain how to construct the vector $u$ by balancing the two terms on the right-hand side of the decomposition in (6). The first term on the right-hand side of (6) has the conditional variance $\sigma_{\epsilon}^{2}/n\cdot u^{\intercal}\widehat{\Sigma}u$ while the second term can be further upper-bounded by

\small\left|\left(\widehat{\Sigma}u-e_{j}\right)^{\intercal}(\beta-\widehat{\beta})\right|\leq\|\widehat{\Sigma}u-e_{j}\|_{\infty}\|\beta-\widehat{\beta}\|_{1}.

(7)

An algorithmic method of constructing $u$ is to constrain the bias and minimize the variance. Particularly, zhang2014confidence ; javanmard2014confidence proposed the following construction of the projection direction,

\small\widehat{u}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}u\quad\text{subject to}\quad\|\widehat{\Sigma}u-e_{j}\|_{\infty}\leq\lambda

(8)

with $\lambda\asymp\sqrt{{\log p}/{n}}$ denoting a positive tuning parameter. The construction in (8) is designed to minimize the conditional variance $u^{\intercal}\widehat{\Sigma}u$ of the “asymp normal” term and constrain $\|\widehat{\Sigma}u-e_{j}\|_{\infty}$ , which further controls the “remaining bias” term as in (7).

With the projection direction $\widehat{u}$ in (8), zhang2014confidence ; javanmard2014confidence introduced the debiased estimator,

\small\widehat{\beta}^{\rm Deb}_{j}=\widehat{\beta}_{j}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}),\quad\text{for}\quad 2\leq j\leq p+1.

(9)

Following from (6), we obtain the following error decomposition of $\widehat{\beta}^{\rm Deb}_{j}$ ,

\small\widehat{\beta}^{\rm Deb}_{j}-\beta_{j}={\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}\epsilon_{k}}+{(\widehat{\Sigma}\widehat{u}-e_{j})^{\intercal}(\beta-\widehat{\beta})}.

(10)

The first term on the right-hand side of (10) is asymptotically normal with the asymptotic variance $(\sigma_{\epsilon}^{2}/n)\cdot\widehat{u}^{\intercal}\widehat{\Sigma}\widehat{u}$ . Implied by (7), we show that the projection direction $\widehat{u}$ in (8) constrains the second term on the right-hand side of (10) by an upper bound $\lambda\|\beta-\widehat{\beta}\|_{1},$ with the tuning parameter $\lambda$ specified in (8). This upper bound can be established to be of rate $\|\beta\|_{0}\log p/n$ and the rate of convergence of the estimating error $\widehat{\beta}^{\rm Deb}_{j}-\beta_{j}$ is $\frac{1}{\sqrt{n}}+\|\beta\|_{0}\frac{\log p}{n},$ which is shown to be the minimax optimal rate of estimating the regression coefficient $\beta_{j}$ ren2015asymptotic ; cai2021statistical . More importantly, zhang2014confidence ; javanmard2014confidence have shown that the debiased estimator $\widehat{\beta}^{\rm Deb}_{j}$ in (9) is approximately unbiased and asymptotically normal when $\|\beta\|_{0}\ll\sqrt{n}/\log p$ . Based on the asymptotic normality, zhang2014confidence ; javanmard2014confidence constructed the following confidence interval

\small\mathrm{CI}=\left(\widehat{\beta}^{\rm Deb}_{j}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{\beta}^{\rm Deb}_{j}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{\widehat{\sigma}_{\epsilon}^{2}}{n}\widehat{u}^{\intercal}\widehat{\Sigma}\widehat{u},

(11)

where $z_{\alpha/2}$ is the upper $\alpha/2$ quantile for the standard normal distribution.

Remark 1.

In the low-dimensional setting, we may set $\lambda=0$ as in (8) and obtain $\widehat{u}=\widehat{\Sigma}^{-1}e_{j}.$ This choice of $\widehat{u}$ reduces the debiased estimator in (9) to the OLS estimator. The debiased estimator is also referred to as the “desparsified” estimator zhang2014confidence ; van2014asymptotically since $\widehat{\beta}^{\rm Deb}_{j}$ is generally not zero even if the true $\beta_{j}$ is zero. Hence, even if $\beta$ is a sparse vector, the vector $\widehat{\beta}^{\rm Deb}=(\widehat{\beta}^{\rm Deb}_{1},\widehat{\beta}^{\rm Deb}_{2},\cdots,\widehat{\beta}^{\rm Deb}_{p+1})^{\intercal}$ is dense. Consequently, the vector $\widehat{\beta}^{\rm Deb}$ does not estimate $\beta$ well in general even though $\widehat{\beta}^{\rm Deb}_{j}$ is an optimal estimator of $\beta_{j}$ for every $2\leq j\leq p+1.$

2.2.1 Optimality of statistical inference

In high-dimensional linear model, the paper cai2021statistical established the minimax expected length of the confidence interval over the parameter space

\small\Theta(s)=\left\{\theta=(\beta,\Sigma,\sigma_{\epsilon}):\|\beta\|_{0}\leq s,c_{0}\leq\lambda_{\min}(\Sigma)\leq\lambda_{\max}(\Sigma)\leq C_{0},0<\sigma_{\epsilon}\leq C_{1}\right\},

(12)

where $C_{0}\geq c_{0}>0$ and $C_{1}>0$ are positive constants. The space $\Theta(s)$ contains all regression vectors of less than $s$ non-zero elements. As established in cai2021statistical , the minimax expected length of confidence intervals over $\Theta(s)$ for the regime $s\lesssim n/\log p$ is $\frac{1}{\sqrt{n}}+s\frac{\log p}{n}.$ When $s\lesssim\frac{\sqrt{n}}{\log p}$ , the optimal length $1/\sqrt{n}$ can be achieved by the confidence interval in (11). Over the regime $\frac{\sqrt{n}}{\log p}\ll s\lesssim n/\log p$ , cai2021statistical proposed a confidence interval attaining the optimal rate $s\log p/n$ , where the construction requires the prior knowledge of the sparsity level $s.$ We illustrate the minimax expected length in Figure 1.

Refer to caption — Figure 1: An illustration of the minimax optimality and adaptivity of the confidence intervals concerning the sparsity $s$ of $\beta$ for the unknown design setting. On the top of the figure, we report the minimax expected lengths of the confidence intervals. At the bottom of the figure, the possibility of being adaptive to the sparsity $s$ is presented.

More importantly, cai2017confidence studied the possibility of constructing a rate-optimal adaptive confidence interval. Here, an adaptive confidence interval means that, even without knowing about the sparsity level $s$ , the length of the constructed confidence interval is automatically adapting to $s$ . Since the sparsity information of the regression vector $\beta$ is generally unknown, it is desirable to construct an adaptive confidence interval. The work cai2017confidence established the following adaptivity results: if the design covariance matrix $\Sigma$ is unknown, it is possible to construct an adaptive confidence interval only for the ultra-sparse regime $s\lesssim\sqrt{n}/\log p$ . That is, if $\sqrt{n}/\log p\ll s\lesssim{n}/\log p$ , it is impossible to construct a rate-optimal confidence interval that is adaptive to the sparsity level. This phase transition about the possibility of constructing adaptive confidence intervals is presented in Figure 1.

The information of $\Sigma$ is critical for constructing optimal confidence intervals. If $\Sigma$ is known, the minimax expected length is $1/\sqrt{n}$ over the entire sparse regime $s\lesssim n/\log p$ ; see javanmard2018debiasing ; cai2017confidence for details. This contrasts sharply with the optimality results in Figure 1.

2.2.2 Another viewpoint: debiasing with decorrelation

In this subsection, we detour to introduce a slightly different view of debiased Lasso estimator proposed in zhang2014confidence ; van2014asymptotically , which is of the following form

\small\widetilde{\beta}^{\rm Deb}_{j}=\frac{Z_{\cdot,j}^{\intercal}(Y-X_{\cdot,-j}\widehat{\beta}_{-j})}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}},

(13)

where $Z_{\cdot,j}\in\mathbb{R}^{n}$ is a decorrelation vector to be specified. zhang2014confidence ; van2014asymptotically proposed to construct the vector $Z_{\cdot,j}\in\mathbb{R}^{n}$ as the residual $Z_{\cdot,j}=X_{\cdot,j}-X_{\cdot,-j}\widehat{\gamma}$ , with the Lasso estimator $\widehat{\gamma}=\frac{1}{2n}\|X_{\cdot,j}-X_{\cdot,-j}\gamma\|_{2}^{2}+\lambda_{\gamma}\sum_{l\neq j}\frac{\|X_{\cdot,l}\|_{2}}{\sqrt{n}}|\gamma_{l}|,$ where $\lambda_{\gamma}>0$ is a positive tuning parameter. The KKT condition ensures that the residual $Z_{\cdot,j}=X_{\cdot,j}-X_{\cdot,-j}\widehat{\gamma}$ is nearly orthogonal to all columns of $X_{\cdot,-j}$ , via

{\small\frac{1}{n}\|Z_{\cdot,j}^{\intercal}X_{\cdot,-j}\|_{\infty}\leq\lambda_{\gamma}\cdot\max_{l\neq j}\frac{\|X_{\cdot,l}\|_{2}}{\sqrt{n}}.}

To see the effectiveness of the estimator in (13), we examine its estimation error

\small\widetilde{\beta}^{\rm Deb}_{j}-\beta_{j}=\frac{Z_{\cdot,j}^{\intercal}\epsilon}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}}+\frac{Z_{\cdot,j}^{\intercal}X_{\cdot,-j}(\beta_{-j}-\widehat{\beta}_{-j})}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}}.

(14)

In the above expression, the first term on the right-hand side can be shown to be asymptotically normal under regularity conditions, while the second term is constrained as

\displaystyle{\small\left|\frac{Z_{\cdot,j}^{\intercal}X_{\cdot,-j}(\beta_{-j}-\widehat{\beta}_{-j})}{Z_{\cdot,j}^{\intercal}X_{\cdot,j}}\right|\leq\frac{1}{n}\|Z_{\cdot,j}^{\intercal}X_{\cdot,-j}\|_{\infty}\cdot\frac{\|\beta_{-j}-\widehat{\beta}_{-j}\|_{1}}{\frac{1}{n}|Z_{\cdot,j}^{\intercal}X_{\cdot,j}|}\leq C\lambda_{\gamma}\|\widehat{\beta}-\beta\|_{1},}

where the last inequality holds with a high probability for some positive constant $C>0.$ With the above argument, zhang2014confidence ; van2014asymptotically have shown that the first term in the decomposition (14) is the dominating term for the ultra-sparse regime $\|\beta\|_{0}\ll\sqrt{n}/\log p$ . This leads to the asymptotic normality of the estimator $\widetilde{\beta}^{\rm Deb}_{j}$ . zhang2014confidence ; van2014asymptotically further constructed the confidence interval based on asymptotic normality. In the current paper, we focus on methods generalizing the debiased estimator in (9), instead of the decorrelation form in (13). However, the decorrelation idea has been extended to handle other statistical inference problems, including high-dimensional generalized linear model ning2017general ; van2014asymptotically , gaussian graphical model ning2017general , high-dimensional confounding model guo2020doubly ; sun2022decorrelating , and high-dimensional additive model guo2019local .

2.3 Debiasing in binary outcome models

We generalize the debiased estimator in (9) to high-dimensional GLM with a binary outcome. Similarly to the Lasso estimator, the penalized logistic regression estimator $\widehat{\beta}$ in (5) suffers from the bias due to the $\ell_{1}$ penalty. The bias-corrected estimator is proposed as the following generic form,

\small\widehat{\beta}^{\rm Deb}_{j}=\widehat{\beta}_{j}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}W_{k}X_{k,\cdot}(Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta})),

(15)

where $W_{k}\in\mathbb{R}$ , for $1\leq k\leq n$ , are the weights to be specified and $\widehat{u}\in\mathbb{R}^{p+1}$ is the projection direction to be specified. We apply the Taylor expansion of the $h$ function and obtain

\displaystyle Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta})=h(X_{k,\cdot}^{\intercal}\beta)-h(X_{k,\cdot}^{\intercal}\widehat{\beta})+\epsilon_{k}=h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}^{\intercal}(\beta-\widehat{\beta})+R_{k}+\epsilon_{k}

(16)

with the approximation error $R_{k}=\int_{0}^{1}(1-t)h^{\prime\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta}+tX^{\intercal}_{k\cdot}(\beta-\widehat{\beta}))dt\cdot(X^{\intercal}_{k\cdot}(\widehat{\beta}-\beta))^{2}$ . We plug in the Taylor expansion (16) into the weighted bias-correction estimator in (15), leading to the error decomposition of $\widehat{\beta}^{\rm Deb}_{j}-\beta_{j}$ as

		$\displaystyle\underbrace{\frac{1}{n}\sum_{k=1}^{n}\widehat{u}^{\intercal}X_{k,\cdot}W_{k}\epsilon_{k}}_{\text{asymp normal}}+\underbrace{\frac{1}{n}\sum_{k=1}^{n}\left(W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\widehat{u}-e_{j}\right)^{\intercal}(\beta-\widehat{\beta})}_{\text{remaining bias}}$		(17)
		$\displaystyle+\underbrace{\frac{1}{n}\sum_{k=1}^{n}\widehat{u}^{\intercal}X_{k,\cdot}W_{k}R_{k}}_{\text{nonlinearity bias}}.$		(17)

In the following, we describe two ways of specifying the weights $W_{k}$ .

1.

Linearization weighting. guo2021inference proposed to construct the weight $W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})$ . Then the “remaining bias” term in (17) is reduced to $\frac{1}{n}\sum_{k=1}^{n}\left(X_{k,\cdot}X_{k,\cdot}^{\intercal}\widehat{u}-e_{j}\right)^{\intercal}(\beta-\widehat{\beta}),$ which is the same as the corresponding term in the linear regression. This enables us to directly adopt the projection direction $\widehat{u}$ constructed in (8). This connection reveals the advantage of the weight $W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})$ , that is, the bias-correction developed under the linear regression model can be directly extended to the logistic regression.
2.

Link-specific weighting. For a general link function $h(\cdot)$ , cai2021statistical constructed the weight $W_{k}=h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})/[h(X_{k,\cdot}^{\intercal}\widehat{\beta})(1-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))]$ . If $h$ is the logistic link, we have $h^{\prime}(\cdot)=h(\cdot)(1-h(\cdot))$ and obtain the constant weight $W_{k}=1$ . Such a link-specific weighting can be generalized to other binary outcome models (e.g., the probit model) with various link functions $h(\cdot)$ ; see details in cai2021statistical .

After specifying the weights, the projection direction can be constructed as

	$\displaystyle\widehat{u}$	$\displaystyle=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u$		(18)
	subject to	$\displaystyle\quad\left\\|\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-e_{j}\right\\|_{\infty}\leq\lambda,\quad\\|Xu\\|_{\infty}\leq\tau$		(18)

with the positive tuning parameters $\lambda\asymp\sqrt{{\log p}/{n}}$ and $\tau\asymp\sqrt{\log n}$ . The construction in (18) can be motivated from a similar view as (8). The constraint $\left\|\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-e_{j}\right\|_{\infty}\leq\lambda$ is imposed to constrain the “remaining bias” term in (17) and $\|Xu\|_{\infty}\leq\tau$ is imposed to constrain the “nonlinearity bias” in (17). For the linearization weighting, the conditional variance of $\frac{1}{n}\sum_{k=1}^{n}{u}^{\intercal}X_{k,\cdot}W_{k}\epsilon_{k}$ is $u^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W^{2}_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u$ , which is of the same order as the objective function in (18) for a bounded $W_{k}.$ So, instead of minimizing the exact variance, we minimize a scaled conditional variance in (18), which has the advantage of leading to almost the same optimization as in (8) for the linear model.

Theoretical properties of the debiased estimators (15) have been established for the logistic outcome model. With the weights $W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})$ for the linearization weighting, guo2021inference established the asymptotic normality of $\widehat{\beta}^{\rm Deb}_{j}$ in (15). cai2021statistical established a similar theoretical property for $\widehat{\beta}^{\rm Deb}_{j}$ in (15) with the weights $W_{k}=1$ for the link-specific weighting. The asymptotic normality results in both works require the ultra-sparse condition $\|\beta\|_{0}\ll\sqrt{n}/[\log p\log n]$ . The use of the weights $W_{k}=1$ in cai2021statistical leads to a smaller standard error than using the weights $W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})$ proposed in guo2021inference . The theoretical justification in cai2021statistical requires a sample splitting such that the initial estimator $\widehat{\beta}$ is constructed from an independent sample. Such sample splitting is not required in the analysis of guo2021inference , which is part of the benefit of linearization weighting.

Based on the asymptotic normality, we construct the confidence interval

\small\mathrm{CI}=\left(\widehat{\beta}^{\rm Deb}_{j}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{\beta}^{\rm Deb}_{j}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{1}{n}\widehat{u}^{\intercal}\widehat{\Sigma}^{G}\widehat{u},

(19)

where $\widehat{\Sigma}^{G}=\frac{1}{n}\sum_{k=1}^{n}W_{k}^{2}h(X_{k,\cdot}^{\intercal}\widehat{\beta})(1-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))X_{k,\cdot}X_{k,\cdot}^{\intercal}.$

The optimality of confidence interval construction in high-dimensional logistic regression was studied in cai2021statistical . The minimax expected length and the possible regime of constructing adaptive confidence intervals are similar to those in Figure 1, up to a polynomial order of $\log n$ ; see the results in cai2021statistical .

3 Linear and Quadratic Functionals Inference

We consider in this section statistical inference for linear and quadratic transformations of the regression vectors under high-dimensional linear and logistic regressions. We investigate both one- and two-sample regression models. In Section 3.7, we discuss the R package SIHR rakshit2021sihr that implements these methods.

3.1 Linear functionals for linear regression

For a given vector $x_{\rm new}\in\mathbb{R}^{p+1}$ , we present the construction of the point estimator and confidence interval for $x_{\text{new}}^{\intercal}\beta$ under the high-dimensional linear model (1). Similar to inference for $\beta_{j}$ , the plug-in estimator $x_{\rm new}^{\intercal}\widehat{\beta}$ suffers from the estimation bias by directly plugging in the Lasso estimator $\widehat{\beta}$ in (3). The work cai2021optimal proposed the following bias-corrected estimator,

\small\widehat{x_{\rm new}^{\intercal}\beta}=x_{\rm new}^{\intercal}\widehat{\beta}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}),

(20)

with the projection direction $\widehat{u}$ defined as

	$\displaystyle\small\widehat{u}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}u\quad\text{subject to}\;$	$\displaystyle\left\\|\widehat{\Sigma}u-x_{\rm new}\right\\|_{\infty}\leq\\|x_{\rm new}\\|_{2}\lambda$		(21)
		$\displaystyle\;\left\|x_{\rm new}^{\intercal}\widehat{\Sigma}u-\\|x_{\rm new}\\|_{2}^{2}\right\|\leq\\|x_{\rm new}\\|_{2}^{2}\lambda,$		(22)

where $\widehat{{{\Sigma}}}=\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}X_{k,\cdot}^{\intercal}$ and $\lambda\asymp\sqrt{{\log p}/{n}}$ is a positive tuning parameter.

The debiased estimator in (20) satisfies the following error decomposition,

\small\widehat{x_{\rm new}^{\intercal}\beta}-x_{\rm new}^{\intercal}\beta=\underbrace{\widehat{u}^{\intercal}\frac{1}{n}X^{\intercal}\epsilon}_{\text{asymp normal}}+\underbrace{\left(\widehat{\Sigma}\widehat{u}-x_{\rm new}\right)^{\intercal}(\beta-\widehat{\beta})}_{\text{remaining bias}}.

(23)

The construction in (21), without the additional constraint (22), can be viewed as a direct generalization of (8) by replacing $e_{j}$ with the general loading $x_{\rm new}$ . Specifically, (21) minimizes the conditional variance of the “asymp normal” term in (23) and provides a control of the “remaining bias” term by the inequality $|(\widehat{\Sigma}u-x_{\rm new})^{\intercal}(\beta-\widehat{\beta})|\leq\|\widehat{\Sigma}u-x_{\rm new}\|_{\infty}\|\beta-\widehat{\beta}\|_{1}.$ Such a construction of the projection direction for linear functional has been proposed in cai2017confidence ; athey2018approximate . However, such a direct generalization is not universally effective for all loadings. As established in Proposition 2 in cai2021optimal , the projection direction, constructed without the additional constraint (22), does not correct the bias for a wide class of dense loading vectors.

We shall emphasize that the new constraint in (22) is crucial to ensuring the asymptotic normality of $\widehat{x_{\rm new}^{\intercal}\beta}-x_{\rm new}^{\intercal}\beta$ for any loading vector $x_{\rm new}.$ This constraint is imposed such that the variance of the “asymp normal” term in (23) always dominates the “remaining bias” term in (23). With this additional constraint, the projection direction $\widehat{u}$ constructed in (21) and (22) enables the effective bias correction for any loading vector $x_{\rm new}$ , no matter it is sparse or dense. The work cai2021optimal established the asymptotic normality of the estimator $\widehat{x_{\rm new}^{\intercal}\beta}$ in (20) for any loading vector $x_{\rm new}\in\mathbb{R}^{p+1}.$ Based on the asymptotic normality, we construct a confidence interval for $x_{\rm new}^{\intercal}\beta$ as

\small\mathrm{CI}=\left(\widehat{x_{\rm new}^{\intercal}\beta}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{x_{\rm new}^{\intercal}\beta}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{\widehat{\sigma}_{\epsilon}^{2}}{n}\widehat{u}^{\intercal}\widehat{\Sigma}\widehat{u}.

(24)

3.2 Linear functionals for logistic regression

We now consider the high-dimensional logistic model and present the inference procedures for $x_{\rm new}^{\intercal}\beta$ or $h(x_{\rm new}^{\intercal}\beta)$ proposed in guo2021inference . In particular, guo2021inference proposed the following debiased estimator,

\small\widehat{x_{\rm new}^{\intercal}{\beta}}=x_{\rm new}^{\intercal}\widehat{\beta}+\widehat{u}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}W_{k}X_{k,\cdot}\left(Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta})\right),

(25)

where $\widehat{\beta}$ is defined in (5), $W_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})$ for $1\leq k\leq n$ , and the projection direction $\widehat{u}\in\mathbb{R}^{p+1}$ is defined as

$\displaystyle\widehat{u}$	$\displaystyle=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u$	(26)
subject to	$\displaystyle\quad\left\\|\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-x_{\rm new}\right\\|_{\infty}{\leq\\|x_{\rm new}\\|_{2}\lambda},\quad\\|Xu\\|_{\infty}\leq\tau$
	$\displaystyle\quad\left\|x_{\rm new}^{\intercal}\left(\frac{1}{n}\sum_{k=1}^{n}W_{k}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,\cdot}X_{k,\cdot}^{\intercal}\right)u-\\|x_{\rm new}\\|_{2}^{2}\right\|\leq\\|x_{\rm new}\\|_{2}^{2}\lambda.$

The bias-corrected estimator in (25) can be viewed as a generalization of those in (21) and (22) by incorporating the weighted bias-correction in (15).

It has been established in guo2021inference that $\widehat{x_{\rm new}^{\intercal}{\beta}}$ in (25) is asymptotically unbiased and normal. Based on the asymptotic normality, we construct the following confidence interval

\small\mathrm{CI}=\left(\widehat{x_{\rm new}^{\intercal}\beta}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}},\quad\widehat{x_{\rm new}^{\intercal}\beta}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\quad\text{with}\quad\widehat{\mathrm{V}}=\frac{1}{n}\widehat{u}^{\intercal}\widehat{\Sigma}^{G}\widehat{u},

(27)

where $\widehat{\Sigma}^{G}=\frac{1}{n}\sum_{k=1}^{n}W_{k}^{2}h(X_{k,\cdot}^{\intercal}\widehat{\beta})(1-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))X_{k,\cdot}X_{k,\cdot}^{\intercal}$ . We estimate the case probability ${\mathbb{P}}(Y_{k}=1|X_{k,\cdot}=x_{\rm new})$ by $h(\widehat{x_{\rm new}^{\intercal}\beta})$ and construct the confidence interval for $h(x_{\text{new}}^{\intercal}\beta)$ as

\small{\rm CI}=\left[h\left(\widehat{x_{\rm new}^{\intercal}\beta}-z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right),h\left(\widehat{x_{\rm new}^{\intercal}\beta}+z_{\alpha/2}\sqrt{\widehat{\mathrm{V}}}\right)\right].

(28)

3.3 Conditional average treatment effects

The inference methods proposed in Sections 3.1 and 3.2 can be generalized to make inferences for conditional average treatment effects, which can be expressed as the difference between two linear functionals. For $1\leq k\leq n,$ let $A_{k}\in\{1,2\}$ denote the treatment assignment for the $k^{\scalebox{0.8}{th}}$ observation, where $A_{k}=1$ and $A_{k}=2$ represent the subject receiving the control or the treatment assignment, respectively. Moreover, in the context of comparing the treatment effectiveness, $A_{k}=1$ , and $A_{k}=2$ may stand for the subject receiving the first or the second treatment assignment, respectively. As a special case of (2), we consider the following conditional outcome models ${\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=1)=X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(1)}$ and ${\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=2)=X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(2)}.$ For an individual with $X_{k,\cdot}=x_{\rm new}$ , we define $\Delta(x_{\rm new})={\mathbb{E}}(Y_{k}|X_{k,\cdot}=x_{\rm new},A_{k}=2)-{\mathbb{E}}(Y_{k}|X_{k,\cdot}=x_{\rm new},A_{k}=1)=x_{\rm new}^{\intercal}(\beta^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(1)}),$ which measures the change of the conditional mean from being untreated to treated for individuals with the covariates $x_{\rm new}$ .

By generalizing (20), we construct the debiased estimators $\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)}}$ and $\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)}},$ together with their corresponding variance estimators $\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}}$ and $\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}$ . The paper cai2021optimal proposed to estimate $\Delta(x_{\rm new})$ by $\widehat{\Delta}(x_{\rm new})=\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)}}-\widehat{x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)}}$ and construct the confidence interval for $\Delta(x_{\rm new})$ as

\small\mathrm{CI}=\left(\widehat{\Delta}(x_{\rm new})-z_{\frac{\alpha}{2}}\sqrt{\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}}+\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}},\quad\widehat{\Delta}(x_{\rm new})+z_{\frac{\alpha}{2}}\sqrt{\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}}+\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}}\right).

(29)

Regarding the hypothesis testing problem $H_{0}:x_{\rm new}^{\intercal}(\beta^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(1)})\leq 0$ versus $H_{1}:x_{\rm new}^{\intercal}(\beta^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(1)})>0,$ the paper cai2021optimal proposed the following test procedure,

\small{\phi}_{\alpha}(x_{\rm new})=\mathbb{I}\left\{\widehat{\Delta}(x_{\rm new})-z_{\alpha}\sqrt{\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(1)}}+\widehat{\mathrm{V}}_{\beta^{\scriptscriptstyle(2)}}}\geq 0\right\}.

(30)

As a direct generalization, we may consider the logistic version, ${\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=1)=h(X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(1)})$ and ${\mathbb{E}}(Y_{k}|X_{k,\cdot},A_{k}=2)=h(X_{k,\cdot}^{\intercal}\beta^{\scriptscriptstyle(2)}).$ For an individual with $X_{k,\cdot}=x_{\rm new}$ , our inference target becomes $\Delta(x_{\rm new})=h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(2)})-h(x_{\rm new}^{\intercal}\beta^{\scriptscriptstyle(1)}).$ The methods in Section 3.2 can be applied here to make inferences for $\Delta(x_{\rm new}).$

3.4 Quadratic functionals

We now focus on inference for the quadratic functionals ${\rm Q}_{A}=\beta^{\intercal}_{G}A\beta_{G}$ and ${\rm Q}_{\Sigma}=\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}$ , where $G\subset\{1,\cdots,p+1\}$ and $A\in\mathbb{R}^{|G|\times|G|}$ denotes a pre-specified matrix. Without loss of generality, we set $G=\{2,\cdots,|G|+1\}.$ In the following, we mainly discuss the main idea under the high-dimensional linear regression, which can be generalized to the high-dimensional logistic regression. Let $\widehat{\beta}$ denote the Lasso estimator in (3).

We start with the error decomposition of the plug-in estimator $\widehat{\beta}_{G}^{\intercal}A\widehat{\beta}_{G}$ ,

\small\widehat{\beta}_{G}^{\intercal}A\widehat{\beta}_{G}-\beta_{G}^{\intercal}A{\beta_{G}}=2\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G})-(\widehat{\beta}_{G}-\beta_{G})^{\intercal}A(\widehat{\beta}_{G}-\beta_{G}).

(31)

In consideration of high-dimensional linear models, guo2021group ; guo2019optimal proposed to construct the bias-corrected estimator through estimating the error component $2\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G})$ on the right-hand side of (31). Since $\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G})$ can be expressed as $x_{\rm new}^{\intercal}(\widehat{\beta}-\beta)$ with $x_{\rm new}=\begin{pmatrix}0&\widehat{\beta}_{G}^{\intercal}A&{\bf 0}^{\intercal}\end{pmatrix}^{\intercal}$ , the techniques of estimating the error component for the linear functional can be directly applied to approximate $\widehat{\beta}_{G}^{\intercal}A(\widehat{\beta}_{G}-\beta_{G})$ . Particularly, guo2021group ; guo2019optimal proposed the following estimator of ${\rm Q}_{A}$ ,

\small\widehat{{\rm Q}}_{A}=\widehat{\beta}_{G}^{\intercal}A\widehat{\beta}_{G}+2\widehat{u}_{A}^{\intercal}X^{\intercal}(Y-X\widehat{\beta})/n,

(32)

where $\widehat{u}_{A}$ denotes the solution of (21) and (22) with $x_{\rm new}=\begin{pmatrix}0&\widehat{\beta}_{G}^{\intercal}A&{\bf 0}^{\intercal}\end{pmatrix}^{\intercal}.$ No bias correction is required for the last term on the right-hand side of (31) since it is a higher-order error term under regular conditions.

We now turn to the estimation of ${\rm Q}_{\Sigma}=\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G},$ where the matrix $\Sigma_{G,G}$ has to be estimated from the data. With $\widehat{\Sigma}=\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}X_{k,\cdot}^{\intercal},$ we decompose the estimation error of the plug-in estimator $\widehat{\beta}^{\intercal}_{G}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G}$ as

	$\displaystyle\widehat{\beta}^{\intercal}_{G}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G}-\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}=$	$\displaystyle 2\widehat{\beta}_{G}^{\intercal}\widehat{\Sigma}_{G,G}(\widehat{\beta}_{G}-\beta_{G})+\beta_{G}^{\intercal}(\widehat{\Sigma}_{G,G}-\Sigma_{G,G})\beta_{G}$		(33)
		$\displaystyle-(\widehat{\beta}_{G}-\beta_{G})^{\intercal}\widehat{\Sigma}_{G,G}(\widehat{\beta}_{G}-\beta_{G}).$		(33)

By a similar approach as (32), guo2021group proposed the following estimator of ${\rm Q}_{\Sigma}$ ,

\small\widehat{{\rm Q}}_{\Sigma}=\widehat{\beta}^{\intercal}_{G}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G}+2\widehat{u}_{\Sigma}^{\intercal}X^{\intercal}(Y-X\widehat{\beta})/n,

(34)

where $\widehat{u}_{\Sigma}$ denotes the solution of (21) and (22) with $x_{\rm new}=\begin{pmatrix}0&\widehat{\beta}_{G}^{\intercal}\widehat{\Sigma}_{G,G}&{\bf 0}^{\intercal}\end{pmatrix}^{\intercal}.$ As a special case, cai2020semisupervised considered $G=\{1,\cdots,p+1\}$ and constructed $\widehat{u}_{\Sigma}=\widehat{\beta}.$

The works guo2021group ; guo2019optimal established the asymptotic properties for a sample-splitting version of $\widehat{{\rm Q}}_{A}$ and $\widehat{{\rm Q}}_{\Sigma}$ , where the initial estimator $\widehat{\beta}$ is constructed from a subsample independent from the samples $\{X_{k,\cdot},Y_{k}\}_{1\leq k\leq n}$ used in (32) and (34). Based on the asymptotic properties, guo2021group estimated the variance of $\widehat{{\rm Q}}_{A}$ by $\widehat{{\rm V}}_{A}(\tau)={4\widehat{\sigma}_{\epsilon}^{2}}\widehat{u}_{A}^{\intercal}\widehat{\Sigma}\widehat{u}_{A}/n+{\tau}/{n}$ for some positive constant $\tau>0$ , where the term $\tau/n$ is introduced as an upper bound for the term $(\widehat{\beta}_{G}-\beta_{G})^{\intercal}A(\widehat{\beta}_{G}-\beta_{G})$ in (31). guo2021group estimated the variance of $\widehat{{\rm Q}}_{\Sigma}$ by

\small\widehat{{\rm V}}_{\Sigma}(\tau)=\frac{4\widehat{\sigma}_{\epsilon}^{2}}{n}\widehat{u}_{\Sigma}^{\intercal}\widehat{\Sigma}\widehat{u}_{\Sigma}+\frac{1}{n^{2}}\sum_{k=1}^{n}\left(\widehat{\beta}_{G}^{\intercal}X_{k,G}X_{k,G}^{\intercal}\widehat{\beta}_{G}-\widehat{\beta}_{G}^{\intercal}\widehat{\Sigma}_{G,G}\widehat{\beta}_{G}\right)^{2}+\frac{\tau}{n},

and further constructed the following confidence intervals for ${\rm Q}_{A}$ and ${\rm Q}_{\Sigma}$ ,

	$\displaystyle{\rm CI}_{A}(\tau)$	$\displaystyle=\left(\widehat{{\rm Q}}_{A}-{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{A}(\tau)},\,\,\widehat{{\rm Q}}_{A}+{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{A}(\tau)}\right),$		(35)
	$\displaystyle{\rm CI}_{\Sigma}(\tau)$	$\displaystyle=\left(\widehat{{\rm Q}}_{\Sigma}-{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{\Sigma}(\tau)},\,\,\widehat{{\rm Q}}_{\Sigma}+{z_{\alpha/2}}\cdot\sqrt{\widehat{{\rm V}}_{\Sigma}(\tau)}\right).$		(35)

For a positive definite matrix $A$ , the significance test $H_{0}:\beta_{G}=0$ can be recast as $H_{0,A}:\beta_{G}^{\intercal}A\beta_{G}=0$ or $H_{0,\Sigma}:\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}=0.$ The following $\alpha$ -level significance tests of $H_{0,A}$ and $H_{0,\Sigma}$ have been respectively proposed in guo2021group ,

\small\phi_{A}(\tau)=\mathbb{I}\left\{\widehat{{\rm Q}}_{A}\geq z_{\alpha}\cdot\sqrt{{\widehat{{\rm V}}_{A}(\tau)}}\right\},\quad\phi_{\Sigma}(\tau)=\mathbb{I}\left\{\widehat{{\rm Q}}_{\Sigma}\geq{z_{\alpha}}\cdot\sqrt{\widehat{{\rm V}}_{\Sigma}(\tau)}\right\}.

(36)

The inference for ${\rm Q}_{A}$ and ${\rm Q}_{\Sigma}$ can be generalized to the high-dimensional logistic regression together with the weighted bias correction detailed in (26). The paper ma2022statistical studied how to construct the debiased estimators of ${\rm Q}_{\Sigma}$ under the high-dimensional logistic regression.

3.5 Semi-supervised inference

We summarize the optimal estimation of $\|\beta\|_{2}^{2}$ and $\beta^{\intercal}\Sigma\beta$ in a general semi-supervised setting, with the labelled data $(X_{1,\cdot},Y_{1}),\cdots,(X_{n,\cdot},Y_{n})$ and the unlabelled data $X_{n+1,\cdot},\cdots,X_{n+N,\cdot}$ , where the covariates $X_{1,\cdot},\cdots,X_{n+N,\cdot}$ are assumed to be identically distributed.

The work cai2020semisupervised proposed the following semi-supervised estimator of $\beta^{\intercal}\Sigma\beta$ ,

\small{\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S})=\widehat{\beta}^{\intercal}\widehat{\Sigma}^{S}\widehat{\beta}+2\widehat{\beta}^{\intercal}\frac{1}{n}\sum_{k=1}^{n}X_{k,\cdot}(Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}),\widehat{\Sigma}^{S}=\frac{1}{n+N}\sum_{k=1}^{n+N}X_{k,\cdot}X_{k,\cdot}^{\intercal}.

(37)

The matrix estimator $\widehat{\Sigma}^{S}$ in (37) utilizes both the labeled and unlabelled data. The work cai2020semisupervised constructed the following confidence interval for $\beta^{\intercal}\Sigma\beta$ in the semi-supervised setting,

\small{\rm CI}(Z)=\left(\left({\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S})-z_{\alpha/2}\sqrt{\widehat{\rm V}}\right)_{+},\;{\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S})-z_{\alpha/2}\sqrt{\widehat{\rm V}}\right),

where $\small\widehat{\rm V}=\frac{1}{n}\left(4{\widehat{\sigma}_{\epsilon}^{2}\widehat{\beta}^{\intercal}\widehat{\Sigma}^{S}\widehat{\beta}}+{\frac{\widehat{\rho}}{n+N}\sum_{k=1}^{n+N}\left(\widehat{\beta}^{\intercal}X_{k\cdot}X_{k\cdot}^{\intercal}\widehat{\beta}-\widehat{\beta}^{\intercal}\widehat{\Sigma}^{S}\widehat{\beta}\right)^{2}}+\tau\right)$ with $\widehat{\rho}=n/(N+n)$ and $\tau>0$ being a user-specific tuning parameter adjusting for the higher order estimation error. The above confidence interval construction demonstrates the usefulness of integrating the unlabelled data, significantly reducing the ratio $\widehat{\rho}$ and the interval length.

Define $\Theta\left(s,M\right)=\left\{\left(\beta,\Sigma,\sigma_{\epsilon}\right)\in\Theta(s):{\frac{M}{2}\leq\|\beta\|_{2}\leq M}\right\},$ as a subspace of $\Theta(s)$ in (12) with the constraint on $\|\beta\|_{2}.$ In the semi-supervised setting, it was established in cai2020semisupervised that the minimax rate for estimating $\beta^{\intercal}{\Sigma}\beta$ over $\Theta\left(s,M\right)$ is

\small{M^{2}\over\sqrt{N+n}}+\min\left\{{M\over\sqrt{n}}+{s\log p\over n},M^{2}\right\}.

(38)

The estimator ${\widehat{{\rm Q}}}(\widehat{\beta},\widehat{\Sigma}^{S})$ in (37) is shown to achieve the optimal rate in (38) for $M\geq C\sqrt{s\log p/n}$ . When $M\leq C\sqrt{s\log p/n}$ , then estimating ${\rm Q}$ by zero achieves the optimal rate in (38). The optimal convergence rate in (37) characterizes its dependence on the amount of unlabelled data. With a larger size $N$ of the unlabelled data, the convergence rate ${M^{2}\over\sqrt{N+n}}$ decreases. In the semi-supervised setting, the unlabelled data can also be used to construct a more accurate estimator of $\|\beta\|_{2}^{2}$ ; see Section 4 of cai2020semisupervised for details.

In the following, we consider the two extremes under semi-supervised learning: the supervised learning setting with $N=0$ and the other extreme setting of knowing $\Sigma={\rm I}$ , which can be viewed as a special case of the semi-supervised setting with $N\rightarrow\infty.$ In Table 1, we summarize the optimal convergence rate of estimating $\|\beta\|_{2}^{2}$ and $\beta^{\intercal}\Sigma\beta$ over both $\Theta(s,M)$ and $\Theta_{0}(s,M)$ , where $\Theta_{0}(s,M)=\left\{\left(\beta,\Sigma,\sigma_{\epsilon}\right)\in\Theta(s,M):\Sigma={\rm I}\right\}.$ With leveraging the information of $\Sigma$ , Table 1 shows that the optimal rates of estimating $\|\beta\|_{2}^{2}$ and $\beta^{\intercal}\Sigma\beta$ are reduced by ${M\frac{s\log p}{n}}$ and ${M^{2}\frac{1}{\sqrt{n}}},$ respectively. The improvement can be substantial when the parameter $M$ , characterizing the $\ell_{2}$ norm $\|\beta\|_{2}$ , is relatively large. The optimal rate of estimating $\|\beta\|_{2}^{2}$ was established in collier2017minimax under the sequence model $Y_{j}=\beta_{j}+\frac{1}{\sqrt{n}}\epsilon_{j}\;\text{for}\;1\leq j\leq p.$ When we know $\Sigma={\rm I}$ and focus on the regime $s\leq c\min\{{n/\log p},p^{\nu}\}$ for $0\leq\nu<1/2$ , the optimal rate of estimating $\|\beta\|_{2}^{2}$ in the high-dimensional linear model matches that in collier2017minimax . However, when $\Sigma$ is unknown, estimating $\|\beta\|_{2}^{2}$ is much harder in the high-dimensional linear model than in the sequence model.

Target	Optimal Rate over $\Theta(s,M)$	Optimal Rate over $\Theta_{0}(s,M)$
$\\|\beta\\|_{2}^{2}$	$\min\left\{M\frac{1}{\sqrt{n}}+\frac{s\log p}{n}+{M\frac{s\log p}{n}},M^{2}\right\}$	$\min\left\{M\frac{1}{\sqrt{n}}+\frac{s\log p}{n},M^{2}\right\}$
$\beta^{\intercal}\Sigma\beta$	$\min\left\{M\frac{1}{\sqrt{n}}+\frac{s\log p}{n}+{M^{2}\frac{1}{\sqrt{n}}},M^{2}\right\}$

Table 1: The minimax optimal rate of estimating

\|\beta\|_{2}^{2}

over

\Theta(s,M)

was established in guo2019optimal ; the remaining optimal rates are implied by (38), which was established in cai2020semisupervised . The focus is on the regime

s\leq c\min\{{n/\log p},p^{\nu}\}

for

0\leq\nu<1/2.

3.6 Inner products of regression vectors

We next consider the two-sample regression models in (2). When the high-dimensional covariates are genetic variants and the outcome variables measure different phenotypes, then the inner product $[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}$ can be interpreted as the genetic relatedness guo2019optimal , measuring the similarity between the two association vectors $\beta^{\scriptscriptstyle(1)}$ and $\beta^{\scriptscriptstyle(2)}$ . We present the debiased estimator of $[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}$ proposed in guo2019optimal . For $d=1,2,$ let $\widehat{\beta}^{\scriptscriptstyle(d)}$ be the Lasso estimator of $\beta^{\scriptscriptstyle(d)}$ . Similar to the decomposition in (31), we decompose the error of the plug-in estimator $[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)},$

	$\displaystyle{}^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)}-[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}=$	$\displaystyle[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)})+[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)})$		(39)
		$\displaystyle-(\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)})^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)}).$		(39)

The key is to estimate $[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)})$ and $[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}(\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)})$ separately in the above decomposition, which can be viewed as a projection of $\widehat{\beta}^{\scriptscriptstyle(2)}-\beta^{\scriptscriptstyle(2)}$ and $\widehat{\beta}^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(1)},$ respectively.

The work guo2019optimal proposed the following bias-corrected estimator of $[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)},$

	$\displaystyle\widehat{[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}}=$	$\displaystyle[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)}+\widehat{u}_{1}^{\intercal}\frac{1}{n_{2}}\sum_{k=1}^{n_{2}}[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(2)}_{k}-[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)})$		(40)
		$\displaystyle+\widehat{u}_{2}^{\intercal}\frac{1}{n_{1}}\sum_{k=1}^{n_{1}}[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(1)}_{k}-[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(1)})$		(40)

where the projection direction vectors are constructed as

	$\displaystyle\widehat{u}_{1}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}^{\scriptscriptstyle(2)}u\quad\text{subject to}$	$\displaystyle\left\\|\widehat{\Sigma}^{\scriptscriptstyle(2)}u-\widehat{\beta}^{\scriptscriptstyle(1)}\right\\|_{\infty}\leq\\|\widehat{\beta}^{\scriptscriptstyle(1)}\\|_{2}\lambda_{2}$
	$\displaystyle\widehat{u}_{2}=\;\mathop{\rm arg\min}_{u\in\mathbb{R}^{p+1}}u^{\intercal}\widehat{\Sigma}^{\scriptscriptstyle(1)}u\quad\text{subject to}$	$\displaystyle\left\\|\widehat{\Sigma}^{\scriptscriptstyle(1)}u-\widehat{\beta}^{\scriptscriptstyle(2)}\right\\|_{\infty}\leq\\|\widehat{\beta}^{\scriptscriptstyle(2)}\\|_{2}\lambda_{1}$

with $\widehat{\Sigma}^{\scriptscriptstyle(d)}=\frac{1}{n}\sum_{k=1}^{n_{d}}X^{\scriptscriptstyle(d)}_{k,\cdot}[X^{\scriptscriptstyle(d)}_{k,\cdot}]^{\intercal}$ and $\lambda_{d}\asymp\sqrt{{\log p}/{n_{d}}}$ for $d=1,2.$

For the estimator (40), the bias-correction terms $\widehat{u}_{1}^{\intercal}\frac{1}{n_{2}}\sum_{k=1}^{n_{2}}[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(2)}_{k}-[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)})$ and $\widehat{u}_{2}^{\intercal}\frac{1}{n_{1}}\sum_{k=1}^{n_{1}}[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(1)}_{k}-[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(1)})$ are constructed to estimate $[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}({\beta}^{\scriptscriptstyle(2)}-\widehat{\beta}^{\scriptscriptstyle(2)})$ and $[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}({\beta}^{\scriptscriptstyle(1)}-\widehat{\beta}^{\scriptscriptstyle(1)}),$ respectively. The construction of the projection directions $\widehat{u}_{1}$ and $\widehat{u}_{2}$ can be viewed as extensions of that in (21) with $x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(1)}$ and $x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(2)},$ respectively. An additional constraint as in (22) is not needed here since both $x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(1)}$ and $x_{\rm new}=\widehat{\beta}^{\scriptscriptstyle(2)}$ are sufficiently sparse. The paper guo2019optimal has established the convergence rate of the debiased estimator proposed in (40). The analysis can be extended to establish the asymptotic normal distribution of $\widehat{[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}}$ under suitable conditions.

The quantity $[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)}$ is another genetic relatedness measure if $\Sigma^{\scriptscriptstyle(1)}=\Sigma^{\scriptscriptstyle(2)}=\Sigma$ with $\Sigma^{\scriptscriptstyle(d)}={\mathbb{E}}X^{\scriptscriptstyle(d)}_{k,\cdot}[X^{\scriptscriptstyle(d)}_{k,\cdot}]^{\intercal}$ for $d=1,2.$ We can propose the debiased estimator $\widehat{[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)}}$ , defined as $[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\widehat{\Sigma}\widehat{\beta}^{\scriptscriptstyle(2)}+[\widehat{\beta}^{\scriptscriptstyle(1)}]^{\intercal}\frac{1}{n_{2}}\sum_{k=1}^{n_{2}}[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(2)}_{k}-[X^{\scriptscriptstyle(2)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(2)})+[\widehat{\beta}^{\scriptscriptstyle(2)}]^{\intercal}\frac{1}{n_{1}}\sum_{k=1}^{n_{1}}[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}(Y^{\scriptscriptstyle(1)}_{k}-[X^{\scriptscriptstyle(1)}_{k,\cdot}]^{\intercal}\widehat{\beta}^{\scriptscriptstyle(1)}).$ The above results have been extended to the logistic regression models, where the quantity $[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)}$ still captures an interpretation of genetic relatedness. The paper ma2022statistical has carefully investigated the confidence interval construction for $[\beta^{\scriptscriptstyle(1)}]^{\intercal}\Sigma\beta^{\scriptscriptstyle(2)}$ in consideration of several high-dimensional logistic regression models. Moreover, we might need to estimate and make inferences for $[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)}$ with $A$ denoting a general matrix. The idea in (40) can be generalized to make inference for $[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)}.$ The work guo2020inference applied such a generalized debiased estimator of $[\beta^{\scriptscriptstyle(1)}]^{\intercal}A\beta^{\scriptscriptstyle(2)}$ in an intermediate step to determine the optimal aggregation weight of multiple regression models.

3.7 R package SIHR

The methods reviewed in Sections 2 and 3 have been implemented in the R package SIHR rakshit2021sihr , which is available from CRAN. The SIHR package contains the main functions: LF, QF, and CATE. The LF function implements the confidence interval for $x_{\rm new}^{\intercal}\beta$ in (24) under the high-dimensional linear regression with specifying model="linear" and the confidence intervals for $h(x_{\rm new}^{\intercal}\beta)$ in (28) under the high-dimensional logistic regression with specifying model="logistic" or model="logisticalter", corresponding to the linearization weighting and link-specific weighting introduced in Section 2.3, respectively.

The QF function implements the confidence interval construction in (35) and hypothesis testing in (36) with different choices of the weighting matrix $A$ . The CATE function implements the confidence interval in (29) and hypothesis testing in (30). Both QF and CATE functions can be applied to the logistic regression setting by specifying model="logistic" or model="logisticalter". The detailed usage of the SIHR package can be found in the paper rakshit2021sihr .

4 Multiple Testing

In the previous sections, we have examined statistical inference for individual regression coefficients and related one-dimensional functionals. However, in many applications, such as genomics, it is necessary to perform simultaneous inference for multiple regression coefficients while controlling for the false discovery rate (FDR) and false discovery proportion (FDP). In this section, we will explore several large-scale multiple testing procedures for high-dimensional regression models. We start with the linear models in Section 4.1 and extend the discussion to logistic models in Section 4.2. The testing procedures are unified in Section 4.3 and the power enhancement methods are discussed next.

4.1 Simultaneous inference for linear regression

One-sample simultaneous inference for high-dimensional linear regression coefficients is closely related to the problem of variable selection. Common approaches for variable selection include regularization methods, such as Lasso tibshirani1996regression , SCAD fan2001variable , Adaptive Lasso zou2006adaptive and MCP zhang2010nearly , which simultaneously estimate parameters and select features, and stepwise feature selection techniques like LARS efron2004least and FoBa zhang2011adaptive , which prioritize variable selection. See the discussions in liuluo2014 and references therein. However, both of these approaches aim to find the model that is closest to the truth, which may not be achievable in practice. Alternatively, liuluo2014 approached the problem from a multiple testing perspective and focused on controlling false discoveries rather than achieving perfect selection results. Specifically, for the high-dimensional regression model (1) with link function $h(z)=z$ , i.e., $Y=X\beta+\epsilon,$ where $\beta=(\beta_{1},\dots,\beta_{p+1})^{\intercal}\in\mathbb{R}^{p+1}$ , $X=(X_{1,\cdot}^{\intercal},\ldots,X_{n,\cdot}^{\intercal})^{\intercal}$ , $Y=(Y_{1},\ldots,Y_{n})^{\intercal}$ , and $\epsilon=(\epsilon_{1},\dots,\epsilon_{n})^{\intercal}$ , with $\{\epsilon_{k}\}$ being independent and identically distributed (i.i.d) random variables with mean zero and variance $\sigma_{\epsilon}^{2}$ and independent of $X_{k,\cdot}$ , $k=1,\dots,n$ , liuluo2014 considered the following multiple testing problem,

\displaystyle H_{0,i}:\beta_{i}=0\mbox{ versus }H_{1,i}:\beta_{i}\neq 0,\quad i=2,\dots,p+1,

(41)

with the control of FDR and FDP.

In some fields, one-sample inference may not be sufficient, particularly for detecting interactions. For example, as demonstrated in hunter2005gene , many complex diseases are the result of interactions between genes and the environment. Therefore, it is important to thoroughly examine the effects of the environment and its interactions with genetic predispositions on disease phenotypes. When the environmental factor is a binary variable, such as smoking status or gender, interaction detection can be approached under a two-sample high-dimensional regression framework. Specifically, interactions can be identified by comparing two high-dimensional regression models as introduced in (2) with identity link function, i.e., $Y^{\scriptscriptstyle(d)}=X^{\scriptscriptstyle(d)}\beta^{\scriptscriptstyle(d)}+\epsilon^{\scriptscriptstyle(d)},$ for $d=1,2$ , and recovering the nonzero components of $\beta^{\scriptscriptstyle(1)}_{-1}-\beta^{\scriptscriptstyle(2)}_{-1}$ , where $\beta^{\scriptscriptstyle(d)}=(\beta_{1}^{\scriptscriptstyle(d)},\dots,\beta_{p+1}^{\scriptscriptstyle(d)})^{\intercal}\in\mathbb{R}^{p+1}$ , $X^{\scriptscriptstyle(d)}=(X_{1,\cdot}^{{\scriptscriptstyle(d)}\intercal},\ldots,X_{n_{d},\cdot}^{{\scriptscriptstyle(d)}\intercal})^{\intercal}$ , $Y^{\scriptscriptstyle(d)}=(Y_{1}^{\scriptscriptstyle(d)},\ldots,Y_{n_{d}}^{\scriptscriptstyle(d)})^{\intercal}$ , and $\epsilon^{\scriptscriptstyle(d)}=(\epsilon_{1}^{\scriptscriptstyle(d)},\dots,\epsilon_{n_{d}}^{\scriptscriptstyle(d)})^{\intercal}$ , with $\{\epsilon_{k}^{\scriptscriptstyle(d)}\}$ being i.i.d random variables with mean zero and variance $\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}$ and independent of $X_{k,\cdot}^{\scriptscriptstyle(d)}$ , $k=1,\dots,n_{d}$ . Assume that $n_{1}\asymp n_{2}$ and let $n=\max\{n_{1},n_{2}\}$ . Then xia2018two investigated simultaneous testing of the hypotheses

\displaystyle H_{0,i}:\beta_{i}^{\scriptscriptstyle(1)}=\beta_{i}^{\scriptscriptstyle(2)}\;\mbox{ versus }\;H_{1,i}:\beta_{i}^{\scriptscriptstyle(1)}\neq\beta_{i}^{\scriptscriptstyle(2)},\quad i=2,\dots,p+1,

(42)

with FDR and FDP control.

In genetic association studies, it is common to measure multiple correlated phenotypes on the same individuals. To detect associations between high-dimensional genetic variants and these phenotypes, one can individually assess the relationship between each response and each covariate, as in (41), and then adjust for multiplicity in the comparisons. However, as noted by Zhou2015 and Schifano2013 , jointly analyzing these phenotypic measurements may increase the power to detect causal genetic variants. Therefore, motivated by the potential to enhance power by leveraging the similarity across multivariate responses, xia2018joint used high-dimensional multivariate regression models to address applications in which $D$ correlated responses are measured on $n$ independent individuals:

\displaystyle Y_{n\times D}

\displaystyle=

\displaystyle X_{n\times(p+1)}B_{(p+1)\times D}+\Upsilon_{n\times D},

(43)

where $Y=(Y_{\cdot,1},\ldots,Y_{\cdot,D})\in\mathbb{R}^{n\times D}$ , with $Y_{\cdot,d}=(Y_{1,d},\ldots,Y_{n,d})^{\intercal}$ , denotes $D$ responses with $D$ fixed, and $X=(X_{1,\cdot}^{\intercal},\ldots,X_{n,\cdot}^{\intercal})^{\intercal}\in\mathbb{R}^{n\times(p+1)}$ is the covariate matrix. In (43), $B=(B_{\cdot,1},\ldots,B_{\cdot,D})\in\mathbb{R}^{(p+1)\times D}$ , with $B_{\cdot,d}=(B_{1,d},\dots,B_{p+1,d})^{\intercal}\in\mathbb{R}^{p+1}$ , represents the regression coefficient matrix, where $B_{i,\cdot}$ represents the regression coefficients of the $i^{\scalebox{0.8}{th}}$ covariate; $\Upsilon=(\epsilon_{\cdot,1},\ldots,\epsilon_{\cdot,D})\in\mathbb{R}^{n\times D}$ , where $\epsilon_{\cdot,d}=(\epsilon_{1,d},\dots,\epsilon_{n,d})^{\intercal}$ , and $\{\epsilon_{k,d}\}$ are i.i.d random variables with mean zero and variance $\sigma_{\epsilon}^{2}$ and independent of $X$ . To examine whether the $i^{\scalebox{0.8}{th}}$ covariate is associated with any of the $D$ responses, xia2018joint simultaneously tested

\displaystyle H_{0,i}:B_{i,\cdot}=0\;\mbox{ versus }\;H_{1,i}:B_{i,\cdot}\neq 0,\quad i=2,\dots,p+1,

(44)

while controlling FDR and FDP. Because the effect of the $i^{\scalebox{0.8}{th}}$ variable on each of the $D$ responses may share strong similarities, namely, if $B_{i,d}\neq 0$ , then the rest of the entries in this row are more likely to be nonzero, a row-wise testing method using the group-wise information is more favorable than testing the significance of the matrix $B$ column by column as in testing problem (41).

4.1.1 Bias corrections via inverse regression

In the multiple testing problems discussed in Section 4.1, our goal is to simultaneously infer the regression coefficients while controlling for error rates. Therefore, it is crucial to begin with an asymptotically unbiased estimator for each regression component. The debiasing techniques outlined in Section 2.2 can be used to attain nearly unbiased estimates, however, as noted in liuluo2014 , the constraints or tuning parameters utilized in debiasing can significantly affect test accuracy. Additionally, the asymptotic distribution of these debiased estimators is conditional, making it challenging to characterize the dependence structure among the test statistics, which is vital for error rate control in simultaneous inference. As an alternative, we will explore an inverse regression approach in this section, which establishes the unconditional asymptotic distribution of bias corrected statistics and allows for explicit characterization of the correlation structure. Alternatively, the debiasing method (13) in zhang2014confidence ; van2014asymptotically can be used as long as the decorrelation vectors $Z_{\cdot,j}$ ’s introduced in Section 2.2.2 are close enough to their population counterparts. This approach will be illustrated in Section 4.2.

Recall that $X^{\scriptscriptstyle(d)}_{k,1}=1$ . To achieve bias correction, by taking testing problem (42) as an example, we consider the inverse regression models obtained by regressing $X_{k,i}^{\scriptscriptstyle(d)}$ on $(Y_{k}^{\scriptscriptstyle(d)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)})$ , where $\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)}=X_{k,-\{1,i\}}^{\scriptscriptstyle(d)}$ , for $i=2,\dots,p+1$ :

	$\displaystyle X_{k,i}^{\scriptscriptstyle(1)}$	$\displaystyle=$	$\displaystyle\alpha_{i}^{\scriptscriptstyle(1)}+(Y_{k}^{\scriptscriptstyle(1)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(1)})\gamma_{i}^{\scriptscriptstyle(1)}+\eta_{k,i}^{\scriptscriptstyle(1)},\quad(k=1,\dots,n_{1})$
	$\displaystyle X_{k,i}^{\scriptscriptstyle(2)}$	$\displaystyle=$	$\displaystyle\alpha_{i}^{\scriptscriptstyle(2)}+(Y_{k}^{\scriptscriptstyle(2)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(2)})\gamma_{i}^{\scriptscriptstyle(2)}+\eta_{k,i}^{\scriptscriptstyle(2)},\quad(k=1,\dots,n_{2})$

where for $d=1,2$ , $\eta_{k,i}^{\scriptscriptstyle(d)}$ has mean zero and variance $\sigma_{i,d}^{2}$ and is uncorrelated with $(Y_{k}^{\scriptscriptstyle(d)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)})$ , and the first component of $\gamma_{i}^{\scriptscriptstyle(d)}=(\gamma_{i,1}^{\scriptscriptstyle(d)},\dots,\gamma_{i,p}^{\scriptscriptstyle(d)})^{\intercal}$ satisfies

\displaystyle\gamma_{i,1}^{\scriptscriptstyle(d)}=\sigma_{i,d}^{2}\beta_{i}^{\scriptscriptstyle(d)}/\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2},\quad i=2,\dots,p+1,

(45)

where $\sigma_{i,d}^{2}=[\{\beta_{i}^{\scriptscriptstyle(d)}\}^{2}/\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}+\omega_{i-1,i-1}^{\scriptscriptstyle(d)}]^{-1}$ with $\{\textsf{Cov}(X_{k,-1}^{\scriptscriptstyle(d)})\}^{-1}=\Omega_{d}=(\omega_{i,j}^{\scriptscriptstyle(d)})$ .

Note that, $r_{i}^{\scriptscriptstyle(d)}=\textsf{Cov}(\epsilon_{k}^{\scriptscriptstyle(d)},\eta_{k,i}^{\scriptscriptstyle(d)})$ can be expressed as $-\gamma_{i,1}^{\scriptscriptstyle(d)}\textsf{Cov}(\epsilon_{k}^{\scriptscriptstyle(d)},Y_{k}^{\scriptscriptstyle(d)})=-\gamma_{i,1}^{\scriptscriptstyle(d)}\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}=-\sigma_{i,d}^{2}\beta_{i}^{\scriptscriptstyle(d)}$ , hence we can approach the debiasing of $\beta_{i}^{\scriptscriptstyle(d)}$ through the debiasing of $r_{i}^{\scriptscriptstyle(d)}$ , and equivalently formulate the testing problem (42) as

\displaystyle H_{0,i}:r_{i}^{\scriptscriptstyle(1)}/\sigma_{i,1}^{2}=r_{i}^{\scriptscriptstyle(2)}/\sigma_{i,2}^{2},\quad i=2,\dots,p+1.

The most straightforward way to estimate $r_{i}^{\scriptscriptstyle(d)}$ is to use the sample covariance between the error terms, $n_{d}^{-1}\sum_{k=1}^{n_{d}}\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)}$ . However, the error terms are unknown, so we first estimate them by

\displaystyle\widehat{\epsilon}_{k}^{\scriptscriptstyle(d)}=Y_{k}^{\scriptscriptstyle(d)}-X_{k,\cdot}^{\scriptscriptstyle(d)}\widehat{\beta}^{\scriptscriptstyle(d)},\quad\widehat{\eta}_{k,i}^{\scriptscriptstyle(d)}=X_{k,i}^{\scriptscriptstyle(d)}-(Y_{k}^{\scriptscriptstyle(d)}-\bar{Y}^{\scriptscriptstyle(d)},\widetilde{X}_{k,-i}^{\scriptscriptstyle(d)})\widehat{\gamma}_{i}^{\scriptscriptstyle(d)},

where $\widehat{\beta}^{\scriptscriptstyle(d)}=(\widehat{\beta}_{1}^{\scriptscriptstyle(d)},\dots,\widehat{\beta}_{p+1}^{\scriptscriptstyle(d)})$ and $\widehat{\gamma}_{i}^{\scriptscriptstyle(d)}=(\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)},\dots,\widehat{\gamma}_{i,p}^{\scriptscriptstyle(d)})$ are respectively the estimators of $\beta^{\scriptscriptstyle(d)}$ and $\gamma_{i}^{\scriptscriptstyle(d)}$ that satisfy

	$\displaystyle\max\{\|\widehat{\beta}_{-1}^{\scriptscriptstyle(d)}-\beta_{-1}^{\scriptscriptstyle(d)}\|_{1},\max_{i=1,\ldots,p}\|\widehat{\gamma}_{i}^{\scriptscriptstyle(d)}-\gamma_{i}^{\scriptscriptstyle(d)}\|_{1}\}$	$\displaystyle=$	$\displaystyle O_{{\mathbb{P}}}(a_{n1}),$		(46)
	$\displaystyle\max\{\|\widehat{\beta}_{-1}^{\scriptscriptstyle(d)}-\beta_{-1}^{\scriptscriptstyle(d)}\|_{2},\max_{i=1,\ldots,p}\|\widehat{\gamma}_{i}^{\scriptscriptstyle(d)}-\gamma_{i}^{\scriptscriptstyle(d)}\|_{2}\}$	$\displaystyle=$	$\displaystyle O_{{\mathbb{P}}}(a_{n2}),$		(47)

for some $a_{n1}$ and $a_{n2}$ such that

\max\{a_{n1}a_{n2},a_{n2}^{2}\}=o\{(n\log p)^{-1/2}\},\mbox{ and }a_{n1}=o(1/\log p).

(48)

As noted by (liuluo2014, ; xia2018two, ), estimators $\widehat{\beta}^{\scriptscriptstyle(d)}$ and $\widehat{\gamma}_{i}^{\scriptscriptstyle(d)}$ that satisfy (46) and (48) can be obtained easily via standard methods such as the Lasso and Danzig selector. Following that, a natural estimator of $r_{i}^{\scriptscriptstyle(d)}$ can be constructed by $\widetilde{r}_{i}^{\scriptscriptstyle(d)}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\widehat{\epsilon}_{k}^{\scriptscriptstyle(d)}\widehat{\eta}_{k,i}^{\scriptscriptstyle(d)}.$ However, the bias of $\widetilde{r}_{i}^{\scriptscriptstyle(d)}$ exceeds the desired rate $(n_{d}\log p)^{-1/2}$ for the subsequent analysis. Hence, the difference of $\widetilde{r}_{i}^{\scriptscriptstyle(d)}$ and $n_{d}^{-1}\sum_{k=1}^{n_{d}}\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)}$ is calculated, and it is equal to $\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{i,d}^{2}\widehat{\beta}_{i}^{\scriptscriptstyle(d)}$ up to order $(n_{d}\log p)^{-1/2}$ under regularity conditions, where $\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\{\widehat{\epsilon}_{k}^{\scriptscriptstyle(d)}\}^{2}$ and $\widehat{\sigma}_{i,d}^{2}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\{\widehat{\eta}_{k,i}^{\scriptscriptstyle(d)}\}^{2}$ are the sample variances. Hence, a bias-corrected estimator for $r_{i}^{\scriptscriptstyle(d)}$ is defined as

\widehat{r}_{i}^{\scriptscriptstyle(d)}=\widetilde{r}_{i}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{i,d}^{2}\widehat{\beta}_{i}^{\scriptscriptstyle(d)}.

(49)

For the other two testing problems, the bias corrections can be performed almost exactly the same via the inverse regression technique above that translates the debiasing of regression coefficients to the debiasing of residual covariances. Note that, through such an inverse regression approach, one can appropriately deal with the dependency of the component-wise debiased statistics, which is important for the following adjustment of multiplicity and the goal of FDR control.

4.1.2 Construction of test statistics

We next construct test statistics for each of the three problems discussed in Section 4.1, using the bias-corrected statistics as a starting point.

For problem (41), because testing whether $\beta_{i}=0$ is equivalent as testing whether the residual covariance is equal to zero, the test statistics can be constructed directly based on the bias corrected $\widehat{r}_{i}$ (it can be obtained exactly the same as $\widehat{r}_{i}^{\scriptscriptstyle(d)}$ as shown in Section 4.1.1 where the superscript $(d)$ is dropped since there is only one sample). Then the test statistics that standardize $\widehat{r}_{i}$ ’s are obtained by

W_{i}=\frac{\widehat{r}_{i}}{(\widehat{\sigma}^{2}_{\epsilon}\widehat{\sigma}^{2}_{i}/n)^{1/2}},\quad i=2,\dots,p+1,

(50)

where $\widehat{\sigma}^{2}_{\epsilon}$ and $\widehat{\sigma}^{2}_{i}$ are again the sample variances by respectively dropping the superscript $(d)$ and subscript $d$ in the one-sample case. As shown in liuluo2014 , the statistics $W_{i}$ ’s are asymptotically normal under the null.

The above construction cannot be directly applied to the problem (42), because $\beta_{i}^{\scriptscriptstyle(d)}$ is not necessary equal to 0 under the two-sample null and $r_{i}^{\scriptscriptstyle(1)}/\sigma_{i,1}^{2}=r_{i}^{\scriptscriptstyle(2)}/\sigma_{i,2}^{2}$ is not equivalent to $r_{i}^{\scriptscriptstyle(1)}=r_{i}^{\scriptscriptstyle(2)}$ . Thus, it is necessary to construct testing procedures based directly on estimators of $r_{i}^{\scriptscriptstyle(1)}/\sigma_{i,1}^{2}-r_{i}^{\scriptscriptstyle(2)}/\sigma_{i,2}^{2}$ . By the bias correction in Section 4.1.1, xia2018two proposed an estimator of $r_{i}^{\scriptscriptstyle(d)}/\sigma_{i,d}^{2}$ :

\displaystyle T_{i}^{\scriptscriptstyle(d)}=\widehat{r}_{i}^{\scriptscriptstyle(d)}/\widehat{\sigma}_{i,d}^{2},\quad i=2,\dots,p+1\mathchar 24635\relax\;d=1,2,

and tested (42) via the estimators $\{T_{i}^{\scriptscriptstyle(1)}-T_{i}^{\scriptscriptstyle(2)}:i=2,\dots,p+1\}$ . Due to the heteroscedasticity, xia2018two considered a standardized version of $T_{i}^{\scriptscriptstyle(1)}-T_{i}^{\scriptscriptstyle(2)}$ . Specifically, let

{\small\widetilde{U}_{i}^{\scriptscriptstyle(d)}=(\beta_{i}^{\scriptscriptstyle(d)}+U_{i}^{\scriptscriptstyle(d)})/\sigma_{i,d}^{2},\mbox{ with }U_{i}^{\scriptscriptstyle(d)}=n_{d}^{-1}\sum_{k=1}^{n_{d}}\{\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)}-\textsf{E}(\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)})\}.}

It was shown in xia2018two that $T_{i}^{\scriptscriptstyle(d)}$ is close to $\widetilde{U}_{i}^{\scriptscriptstyle(d)}$ asymptotically under regularity conditions. Because $\theta_{i}^{\scriptscriptstyle(d)}=\textsf{Var}(\widetilde{U}_{i}^{\scriptscriptstyle(d)})=\textsf{Var}(\epsilon_{k}^{\scriptscriptstyle(d)}\eta_{k,i}^{\scriptscriptstyle(d)}/\sigma_{i,d}^{2})/n_{d}=(\sigma_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}/\sigma_{i,d}^{2}+\{\beta_{i}^{\scriptscriptstyle(d)}\}^{2})/n_{d},$ it can be estimated by $\widehat{\theta}_{i}^{\scriptscriptstyle(d)}=(\widehat{\sigma}_{\epsilon^{\scalebox{0.4}{(d)}}}^{2}/\widehat{\sigma}_{i,d}^{2}+\{\widehat{\beta}_{i}^{\scriptscriptstyle(d)}\}^{2})/n_{d},$ and the standardized statistics are defined by

W_{i}=\frac{T_{i}^{\scriptscriptstyle(1)}-T_{i}^{\scriptscriptstyle(2)}}{(\widehat{\theta}_{i}^{\scriptscriptstyle(1)}+\widehat{\theta}_{i}^{\scriptscriptstyle(2)})^{1/2}},\quad i=2,\dots,p+1,

(51)

which are asymptotically normal under the null as studied in xia2018two .

For the multivariate testing problem (44), by taking advantage of the similar effect of the $i^{\scalebox{0.8}{th}}$ variable on each of the responses, a group lasso penalty (yuan2006model, ) can be imposed to obtain an estimator of $B$ , such that

\displaystyle{\small\max_{1\leq d\leq D}|\widehat{B}_{-1,d}-B_{-1,d}|_{1}=O_{{\mathbb{P}}}(a_{n1}),\max_{1\leq d\leq D}|\widehat{B}_{-1,d}-B_{-1,d}|_{2}=O_{{\mathbb{P}}}(a_{n2}),}

for some $a_{n1}$ and $a_{n2}$ satisfying (48). By lounici2011oracle , the above rates can be fulfilled if the row sparsity of $B$ satisfies $s(p)=o(n^{1/3}/{\log p})$ . Then following the same bias correction strategy as described in Section 4.1.1, a debiased estimator for $r_{i}^{\scriptscriptstyle(d)}$ can be obtained via $\widehat{r}_{i}^{\scriptscriptstyle(d)}=\widetilde{r}_{i}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{\epsilon}^{2}\widehat{\gamma}_{i,1}^{\scriptscriptstyle(d)}+\widehat{\sigma}_{i,d}^{2}\widehat{B}_{i,d}.$ Then similarly as the aforementioned two problems, the standardized statistic can be constructed by

\displaystyle\widetilde{T}_{i}^{\scriptscriptstyle(d)}={T_{i}^{\scriptscriptstyle(d)}}/{\{\widehat{\theta}_{i}^{\scriptscriptstyle(d)}\}^{1/2}},\quad i=2,\dots,p+1\mathchar 24635\relax\;d=1,\ldots,D,

where $T_{i}^{\scriptscriptstyle(d)}=\widehat{r}_{i}^{\scriptscriptstyle(d)}/\widehat{\sigma}_{i,d}^{2}$ and $\widehat{\theta}_{i}^{\scriptscriptstyle(d)}=(\widehat{\sigma}_{\epsilon}^{2}/\widehat{\sigma}_{i,d}^{2}+\widehat{B}_{i,d}^{2})/n.$ Finally, a sum-of-squares-type test statistic for testing the $i^{\scalebox{0.8}{th}}$ row of $B$ is proposed:

W_{i}=\sum_{d=1}^{D}\{\widetilde{T}_{i}^{\scriptscriptstyle(d)}\}^{2},\quad i=2,\dots,p+1,

(52)

which is asymptotically $\chi^{2}_{D}$ distributed under the null as studied in xia2018joint .

4.2 Simultaneous inference for logistic regression

The principles of simultaneous inference for high-dimensional linear regression can also be applied to high-dimensional logistic regression models. In particular, ma2021global considered the regression model (1), i.e., $Y_{k}=h(X_{k,\cdot}^{\intercal}\beta)+\epsilon_{k}$ , $k=1,\ldots,n$ , with the link function $h(z)=\exp(z)/[1+\exp(z)]$ , and studied the simultaneous testing problem (41) as described in Section 4.1, namely testing

\displaystyle H_{0,i}:\beta_{i}=0\mbox{ versus }H_{1,i}:\beta_{i}\neq 0,\quad i=2,\dots,p+1,

with FDR control.

Based on the regularized estimator $\widehat{\beta}$ given in (5), ma2021global corrected the bias of $\widehat{\beta}$ via the Taylor expansion of $h(u_{k})$ at $\widehat{u}_{k}$ for $u_{k}=X_{k,\cdot}^{\intercal}\beta$ and $\widehat{u}_{k}=X_{k,\cdot}^{\intercal}\widehat{\beta}$ , and obtained that

Y_{k}-h(\widehat{u}_{k})+h^{\prime}(\widehat{u}_{k})X_{k,\cdot}^{\intercal}\widehat{\beta}=h^{\prime}(\widehat{u}_{k})X_{k,\cdot}^{\intercal}\beta+(R_{k}+\epsilon_{k}),

(53)

where $R_{k}$ is the remainder term as specified in (16). Next, $Y_{k}-h(\widehat{u}_{k})+h^{\prime}(\widehat{u}_{k})X_{k,\cdot}^{\intercal}\widehat{\beta}$ can be treated as the new response, $h^{\prime}(\widehat{u}_{k})X_{k,\cdot}$ as the new covariates, and $R_{k}+\epsilon_{k}$ as the new noise. Under such a formulation, testing $H_{0,i}:\beta_{i}=0$ can be translated into the simultaneous inference for the regression coefficients of an approximate linear model. ma2021global applied the decorrelation method (13) and constructed the following debiased estimator $\widetilde{\beta}_{i}^{\rm Deb}$ ,

\widetilde{\beta}_{i}^{\rm Deb}=\widehat{\beta}_{i}+\frac{\sum_{k=1}^{n}Z_{k,i}(Y_{k}-h(X_{k,\cdot}^{\intercal}\widehat{\beta}))}{\sum_{k=1}^{n}Z_{k,i}h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})X_{k,i}},\quad i=2,\dots,p+1,

(54)

where $Z_{\cdot,i}$ is determined by the scaled residual that regresses $X_{\cdot,i}$ on $X_{\cdot,-i}$ through the linearization weighting $\widehat{W}_{k}=1/h^{\prime}(X_{k,\cdot}^{\intercal}\widehat{\beta})$ introduced in Section 2.3; same strategy was also employed in (ren2016asymptotic, ; cai2019differential, ). As summarized in Section 4.1.1, for the subsequent FDR control analysis, the decorrelation vectors $Z_{\cdot,i}$ ’s were shown to be close to the true regression errors. Alternatively, the inverse regression technique in Section 4.1.1 can be similarly applied in the approximate linear model (53) for the bias correction.

Based on (54), ma2021global proposed the following standardized test statistic:

W_{i}=\frac{\widetilde{\beta}_{i}^{\rm Deb}}{\{\sum_{k=1}^{n}h^{\prime}(\widehat{u}_{k})Z_{k,i}^{2}\}^{1/2}/\{\sum_{k=1}^{n}h^{\prime}(\widehat{u}_{k})Z_{k,i}X_{k,i}\}},\quad i=2,\dots,p+1,

(55)

which is asymptotically normal under the null. Additionally, ma2021global extended the idea to the two-sample multiple testing (42). Similarly, multiple testing of (44) can also be approached in the logistic setting.

4.3 Multiple testing procedure

Using the test statistics $\{W_{i}:i=2,\dots,p+1\}$ in (50), (51), (52) and (55) as a foundation, we next examine a unified multiple testing procedure that guarantees error rates control.

Let $\mathcal{H}=\{2,\dots,p+1\}$ , $\mathcal{H}_{0}$ be the set of true null indices and $\mathcal{H}_{1}=\mathcal{H}\setminus\mathcal{H}_{0}$ be the set of true alternatives. We are interested in cases where most of the tests are nulls, that is, $|\mathcal{H}_{1}|$ is relatively small compared to $|\mathcal{H}|$ . Let $\Psi(\cdot)$ be the asymptotic cumulative distribution function (cdf) of $W_{i}$ under the null and let $\Phi(\cdot)$ be the standard normal cdf, then we develop a normal quantile transformation of $W_{i}$ by $N_{i}=\Phi^{-1}\left\{1-(1-\Psi(W_{i}))/2\right\}$ , which approximately has the same distribution as the absolute value of a standard normal random variable under the null $H_{0,i}$ . Let $t$ be the threshold level such that $H_{0,i}$ is rejected if $N_{i}\geq t$ . For any given $t$ , denote by $R_{0}(t)=\sum_{i\in\mathcal{H}_{0}}\mathbb{I}\{N_{i}\geq t\}$ and $R(t)=\sum_{i\in\mathcal{H}}\mathbb{I}\{N_{i}\geq t\}$ the total number of false positives and the total number of rejections, respectively. Then the FDP and FDR are defined as

\text{FDP}(t)=\frac{R_{0}(t)}{\max\{R(t),1\}},\quad\text{FDR}(t)=\mathbb{E}\{\text{FDP}(t)\}.

An ideal choice of $t$ rejects as many true positives as possible while controlling the FDP at the pre-specified level $\alpha$ , i.e.,

t_{0}=\inf\left\{0\leq t\leq(2\log p)^{1/2}:\;\text{FDP}(t)\leq\alpha\right\}.

Since $R_{0}(t)$ can be estimated by $2\{1-\Phi(t)\}|\mathcal{H}_{0}|$ and $|\mathcal{H}_{0}|$ is upper bounded by $p$ , we conservatively estimate it by $2p\{1-\Phi(t)\}$ . Therefore, the following multiple testing algorithm is proposed in (e.g., liuluo2014, ; xia2018two, ; xia2018joint, ; ma2021global, ).

Algorithm 1 The multiple testing procedure.

Step 1.

Obtain the transformed statistics $N_{i}=\Phi^{-1}\left\{1-(1-\Psi(W_{i}))/2\right\}$ from the test statistics $W_{i}$ , $i=2,\dots,p+1$ .

Step 2.

For a given $0\leq\alpha\leq 1$ , calculate

\small\widehat{t}=\inf\left[0\leq t\leq(2\log p-2\log\log p)^{1/2}:\;\frac{2p\{1-\Phi(t)\}}{\max\{R(t),1\}}\leq\alpha\right].

(56)

If (56) does not exist, then set $\widehat{t}=(2\log p)^{1/2}$ .

Step 3.

For $i\in\mathcal{H}$ , reject $H_{0,i}$ if $N_{i}\geq\widehat{t}$ .

As noted in xia2018multiple , the constraint $0\leq t\leq(2\log p-2\log\log p)^{1/2}$ in (56) is critical. When $t$ exceeds the upper bound, $2p\{1-\Phi(t)\}\rightarrow 0$ is not even a consistent estimate of $R_{0}(t)$ . However, Benjamini-Hochberg (B-H) procedure (BenHoc95, ) used $2p\{1-\Phi(t)\}$ as an estimate of $R_{0}(t)$ for all $t\geq 0$ and hence may not be able to control the FDP. On the other hand, if $t$ is not attained in the range, it is important to threshold $N_{i}$ at $(2\log p)^{1/2}$ , because thresholding $N_{i}$ at $(2\log p-2\log\log p)^{1/2}$ may cause too many false rejections. As a result, by applying the above multiple testing algorithm to each of the problems in Sections 4.1 and 4.2, under some regularity conditions as specified in (liuluo2014, ; xia2018two, ; xia2018joint, ; ma2021global, ), we reach both the FDP and FDR control at the pre-specified level $\alpha$ asymptotically, i.e., $\lim_{(n,p)\rightarrow\infty}{\mathbb{P}}\{\text{FDP}_{\text{Alg1}}\leq\alpha+\epsilon\}=1$ for any $\epsilon>0$ and $\mathop{\overline{\rm lim}}_{(n,p)\rightarrow\infty}\text{FDR}_{\text{Alg1}}\leq\alpha$ .

4.4 Power enhancement

In addition to controlling error rates, we consider strategies to improve the power of multiple testing procedures, focusing on enhancing the power for two-sample inference when the high-dimensional objects of interest are individually sparse. This is explored in xia2020gap with an extension to a more general framework in liang2022locally . It is worth noting that xia2020gap improved the performance of Algorithm 1 by utilizing unknown sparsity in two-sample multiple testing. Additionally, power enhancement for the simultaneous inference of GLMs can be achieved through data integration (e.g., cai2021individual, ; liu2021integrative, ) as well as other power boosting methods designed for general multiple testing problems as introduced in Section 1.

Recall that, in the two-sample problem studied in Section 4.1, we aim to make inference for $\delta_{i}=\mathbb{I}\{\beta_{i}^{\scriptscriptstyle(1)}\neq\beta_{i}^{\scriptscriptstyle(2)}\}$ , $i=2,\dots,p+1$ . Following Algorithm 1, we first summarize the data by a single vector of test statistics $\{N_{2},\cdots,N_{p+1}\}$ and then choose a significance threshold to control the multiplicity. However, such approach ignores the important feature that both objects $\beta^{\scriptscriptstyle(1)}$ and $\beta^{\scriptscriptstyle(2)}$ are individually sparse. Let $\mathcal{I}_{d}=\{i=2,\dots,p+1:\beta_{i}^{\scriptscriptstyle(d)}\neq 0\}$ denote the support of $\beta^{\scriptscriptstyle(d)}$ , $d=1,2$ , and $\mathcal{I}=\mathcal{I}_{1}\cup\mathcal{I}_{2}$ the union support. Because the small cardinality of $\mathcal{I}$ implies that both $\beta^{\scriptscriptstyle(1)}$ and $\beta^{\scriptscriptstyle(2)}$ are sparse, the information on $\mathcal{I}$ can be potentially utilized to narrow down the alternatives via the logical relationship that $i\notin\mathcal{I}$ implies that $\delta_{i}=0$ .

The goal of xia2020gap is to incorporate the sparsity information to improve the testing efficiency, and it is accomplished via the construction of an additional covariate sequence $\{S_{i}:i=2,\dots,p+1\}$ to capture the information on $\mathcal{I}$ . Note that $S_{i}$ and $N_{i}$ have different roles: $N_{i}$ is the primary statistic to evaluate the significance of the test, while $S_{i}$ is the auxiliary one that captures the sparsity information to assist the inference. It is also important to note that $S_{i}$ should be asymptotically independent with $N_{i}$ so that the null distribution of $N_{i}$ would not be distorted by the incorporation of $S_{i}$ . For the two-sample problem in Section 4.1, such auxiliary statistics can be constructed by

\small S_{i}=\frac{{\widehat{r}_{i}^{\scriptscriptstyle(1)}}/{\widehat{\sigma}_{{i,1}}^{2}}+(\widehat{\theta}_{i}^{\scriptscriptstyle(1)}/\widehat{\theta}_{i}^{\scriptscriptstyle(2)})({\widehat{r}_{i}^{\scriptscriptstyle(2)}}/{\widehat{\sigma}_{{i,2}}^{2}})}{\{{\widehat{\theta}_{i}^{\scriptscriptstyle(1)}(1+\widehat{\theta}_{i}^{\scriptscriptstyle(1)}/\widehat{\theta}_{i}^{\scriptscriptstyle(2)})}\}^{1/2}},\quad i=2,\dots,p+1.

(57)

Then based on the pairs of statistics $\{(N_{i},S_{i}):i=2,\dots,p+1\}$ , the proposal in xia2020gap operates in three steps: grouping, adjusting and pooling (GAP). The first step divides all tests into $K$ groups based on $S_{i}$ , which leads to heterogeneous groups with varied sparsity levels. The second step adjusts the $p$ -values to incorporate the sparsity information. The final step combines the adjusted $p$ -values and chooses a threshold to control the global FDR. Based on the $p$ -values obtained by the test statistics $N_{i}$ ’s, i.e., $p_{i}=2\{1-\Phi(N_{i})\}$ , the algorithm is summarized in Algorithm 2 and we refer to xia2020gap for its detailed implementations such as the choices of the number of groups and the grid sets.

Algorithm 2 Multiple testing via grouping, adjusting and pooling (GAP).

Step 1 (Grouping).

Divide hypotheses into $K$ groups: $\mathcal{G}_{l}=\{i=2,\dots,p+1:\lambda_{l-1}<S_{i}\leq\lambda_{l}\}$ , for $1\leq l\leq K$ . The optimal choice of grouping will be determined in Step 2.

Step 2 (Adjusting).

Define $m_{l}=|\mathcal{G}_{l}|$ . Calculate adjusted $p$ -values $p_{i}^{w}=\min\{p_{i}/w_{l}^{o},1\}$ if $i\in\mathcal{G}_{l}$ , $1\leq l\leq K$ , where $p_{i}=2\{1-\Phi(N_{i})\}$ and $w_{l}^{o}$ will be calculated as follows.

•

Initial adjusting. For a given grouping $\{\mathcal{G}_{l}:1\leq l\leq K\}$ , let $\widehat{\pi}_{l}$ be the estimated proportion of non-nulls in $\mathcal{G}_{l}$ . Compute the group-wise weights

\small w_{l}=\left\{\sum_{l=1}^{K}\frac{m_{l}\widehat{\pi}_{l}}{1-\widehat{\pi}_{l}}\right\}^{-1}\frac{p\widehat{\pi}_{l}}{(1-\widehat{\pi}_{l})},\;1\leq l\leq K.

(58)

Define $p_{i}^{w}=\min\{p_{i}/w_{l},1\}$ for $i\in\mathcal{G}_{l}$ .

•

Further refining. For each $\Lambda=\{\lambda_{l}:1\leq l\leq K-1\}$ (allowed to be empty), let

$\small k=\max\{j:p^{w}_{(j)}\leq{j}\alpha/p\},$ (59)

and reject the hypotheses corresponding to $\{p^{w}_{(1)},\ldots,p^{w}_{(k)}\}$ , where $p^{w}_{(1)}\leq\cdots\leq p^{w}_{(p)}$ are the re-ordered $p$ -values. The weights $w_{l}^{o}$ are computed using (58) based on the optimal grouping that yields most rejections.

Step 3 (Pooling).

Combine $p_{i}^{w}$ ’s computed from Step 2 based on the optimal grouping, apply (59) again and output the rejections.

In addition, xia2020gap provided some insights on why the GAP algorithm works. First, it adaptively chooses the group-wise FDR levels via adjusted $p$ -values and effectively incorporates group-wise information. Intuitively, Algorithm 2 increases the overall power by assigning higher FDR levels to groups where signals are more common. It does not assume known groups and searches for the optimal grouping to maximize the power. Moreover, the construction of the weights in Algorithm 2 ensures that after all groups are combined, the weights are always “proper” in the sense of genovese2006false . As a result, even inaccurate estimates of the non-null proportions would not affect the validity of overall FDR control.

Then, same as Algorithm 1, xia2020gap provided the asymptotic error rates control results for Algorithm 2, namely, we have $\lim_{(n,p)\rightarrow\infty}{\mathbb{P}}\{\text{FDP}_{\text{Alg2}}\leq\alpha+\epsilon\}=1$ for any $\epsilon>0$ and $\mathop{\overline{\rm lim}}_{(n,p)\rightarrow\infty}\text{FDR}_{\text{Alg2}}\leq\alpha$ . Moreover, due to the informative weights (58) that effectively incorporate the sparsity information through $\{S_{i}:i=2,\dots,p+1\}$ , Algorithm 2 dominates Algorithm 1 in power asymptotically. Specifically, xia2020gap provided the rigorous theoretical comparison that $\Psi_{\rm Alg2}\geq\Psi_{\rm Alg1}+o(1)$ as $p\rightarrow\infty$ , where $\Psi_{\rm Alg1}$ and $\Psi_{\rm Alg2}$ represent the expectations of the proportions of correct rejections among all alternative hypotheses for Algorithms 1 and 2, respectively.

5 Discussion

In this expository paper, we provided a review of methods and theoretical results on statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. Due to limited space, we were unable to discuss a number of related inference problems. In this section, we briefly mention a few of them.

Accuracy assessment. Accuracy assessment is a crucial part of high-dimensional uncertainty quantification. Its goal is to evaluate the estimation accuracy of an estimator $\widehat{\beta}$ . For linear regression, cai2018accuracy considered a collection of estimators $\widehat{\beta}$ and established the minimaxity and adaptivity of the point estimation and confidence interval for $\|\widehat{\beta}-\beta\|_{q}^{q}$ with $1\leq q\leq 2.$ Suppose that $\widehat{\beta}$ is independent of the data $\{X_{k,\cdot},Y_{k}\}_{1\leq k\leq n}$ (which can be achieved by sample splitting, for example). Define the residue $Re_{k}=Y_{k}-X_{k,\cdot}^{\intercal}\widehat{\beta}=X_{k,\cdot}^{\intercal}(\beta-\widehat{\beta})+\epsilon_{k}$ . The quadratic functional inference approach introduced in cai2020semisupervised can be applied to the data $\{Re_{k},X_{k,\cdot}\}_{1\leq k\leq n}$ to make inference for both $(\widehat{\beta}-\beta)^{\intercal}\Sigma(\widehat{\beta}-\beta)$ and $\|\widehat{\beta}-\beta\|_{2}^{2}$ ; see more details in Section 5.2 of cai2020semisupervised . nickl2013confidence used an estimator of $\|\widehat{\beta}-\beta\|_{2}^{2}$ to construct a confidence set for $\beta$ with $\widehat{\beta}$ denoting an accurate high-dimensional estimator. In the context of approximate message passing, donoho2011noise ; bayati2011lasso considered a different framework with $n/(p+1)\rightarrow\delta\in(0,1)$ and independent Gaussian design and established the asymptotic limit of $\|\widehat{\beta}-\beta\|_{2}^{2}/(p+1)$ for the Lasso estimator $\widehat{\beta}.$

Semi-supervised Inference. The semi-supervised inference is well motivated by a wide range of modern applications, such as Electronic Health Record data analysis. In addition to the labeled data, additional covariate observations exist in the semi-supervised setting. It is critical to leverage the information in the unlabelled data and improve the inference efficiency. As reviewed in Section 3.5, the additional unlabelled data improves the accuracy of various high-dimensional inference procedures by facilitating the estimation of $\Sigma$ or $\Sigma^{-1}$ javanmard2018debiasing ; cai2020semisupervised . Moreover, in a different context where the linear outcome model might be misspecified, one active research direction in semi-supervised learning is to construct a more complicated imputation model (e.g., by applying the classical non-parametric or machine learning methods) and conduct a following-up bias correction after outcome imputation (e.g., zhang2019semi, ; chakrabortty2018efficient, ; zhang2021double, ; deng2020optimal, ; hou2021surrogate, ).

Applications of quadratic form inference. The statistical inference methods for quadratic functionals in Section 3.6 and inner products in Section 3.4 are useful in a wide range of statistical applications. Firstly, the group significance test of $\|\beta_{G}\|_{2}^{2}=0$ or $\beta_{G}^{\intercal}\Sigma_{G,G}\beta_{G}=0$ forms an important basis for designing computationally efficient hierarchical testing approaches mandozzi2016hierarchical ; guo2021group . Secondly, consider the high-dimensional interaction model $Y_{k}=A_{k}\eta+X_{k,\cdot}^{\intercal}\tau+A_{k}\cdot X_{k,\cdot}^{\intercal}\gamma$ with $A_{k}$ denoting the variable of interest (e.g., the treatment) and $X_{k,\cdot}$ denoting a large number of other variables. In this model, testing the existence of the interaction term can be reduced to the inference for $\|\gamma\|_{2}^{2}.$ Lastly, in the two-sample model (2), the methods proposed in Sections 3.6 and 3.4 have been applied in guo2023robust to calculating the difference $\|\beta^{\scriptscriptstyle(1)}-\beta^{\scriptscriptstyle(2)}\|_{2}^{2}$ , which has the expression of $\|\beta^{\scriptscriptstyle(1)}\|_{2}^{2}+\|\beta^{\scriptscriptstyle(2)}\|_{2}^{2}-2[\beta^{\scriptscriptstyle(1)}]^{\intercal}\beta^{\scriptscriptstyle(2)}.$

Multiple heterogeneous regression models. It is essential to perform efficient integrative inference in various applications that combine multiple regression models. Examples include transfer learning, distributed learning, federated learning, and distributionally robust learning. Transfer learning provides a powerful tool for incorporating data from related studies to improve estimation and inference accuracy in a target study of direct interest. li2021transfer studied transfer learning for high-dimensional linear regression. Minimax optimal convergence rates were established, and data-driven adaptive algorithms were proposed. Li2021Transfer-GLM ; tian2022transfer explored transfer learning in high-dimensional GLMs and battey2018distributed ; lee2017communication considered distributed learning for high-dimensional regression. Additionally, liu2021integrative ; cai2021individual proposed integrative estimation and multiple testing procedures of cross-sites high-dimensional regression models that simultaneously accommodated between study heterogeneity and protected individual-level data. In a separate direction, meinshausen2015maximin proposed the maximin effect as a robust prediction model for the target distribution being generated as a mixture of multiple source populations. guo2020inference further established that the maximin effect is a group distributionally robust model and studied the statistical inference for the maximin effect in high dimensions. Many questions in multiple heterogeneous regression models are open and warrant future research.

Other simultaneous inference problems. The principles of multiple testing procedures reviewed in Section 4 are also widely applicable to a range of simultaneous inference problems, including graph learning (e.g., liu2013ggm, ; xia2017hypothesis, ; xia2018multiple, ; cai2019differential, ), differential network recovery (e.g., xia2015testing, ; xia2019matrix, ; kim2021two, ), as well as various structured regression analysis (e.g., zhang2020estimation, ; li2021inference, ; sun2022decorrelating, ). For example, through similar regression-based bias-correction techniques as reviewed in Section 4.1.1, liu2013ggm studied the estimation of Gaussian Graphical Models with FDR control; xia2018multiple focused on the recovery of sub-networks in Gaussian graphs; cai2019differential identified significant communities for compositional data. Additionally, xia2015testing ; xia2019matrix proposed multiple testing procedures for differential network detections of vector-valued and matrix-valued observations, respectively. Multiple testing of structured regression models such as mixed-effects models and confounded linear models has also been extensively studied in the literature (li2021inference, ; sun2022decorrelating, ).

Alternative multiple testing methods for regression models. Besides the bias-correction based multiple testing approaches reviewed in this article, there are a few alternative classes of methods that aim at finite sample FDR control for linear models. Examples include knockoff based methods, mirror statistics approaches and $e$ -value proposals. In particular, barber2015controlling constructed a set of knockoff variables and selected those predictors that have considerably higher importance scores than the knockoff counterparts; candes2018panning extended the work to a model-X knockoff framework that allowed unknown conditional distribution of the response. There are several generalizations along this line of research; see barber2020robust ; ren2021derandomizing and many references therein. Inspired by the knockoff idea, du2021false proposed a symmetrized data aggregation approach to build mirror statistics that incorporate data dependence structure; a general framework on mirror statistics of GLMs was studied in dai2023scale and the references therein. The $e$ -value based proposal (vovk2021values, ) is another useful tool for multiple testing in general. wang2020false proposed an e-BH procedure that achieved FDR control under arbitrary dependence among the $e$ -values, and the equivalence between the knockoffs and the e-BH was studied in ren2022derandomized .

References

\bibcommenthead
(1) Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B 58(1), 267–288 (1996)
(2) Candes, E., Tao, T.: The dantzig selector: Statistical estimation when $p$ is much larger than $n$ . Ann. Statist. 35(6), 2313–2351 (2007)
(3) Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Statist. Soc. B 67(2), 301–320 (2005)
(4) Zou, H.: The adaptive Lasso and its oracle properties. J. Am. Statist. Assoc. 101(476), 1418–1429 (2006)
(5) Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and dantzig selector. Ann. Statist. 37(4), 1705–1732 (2009)
(6) Bühlmann, P., van de Geer, S.: Statistics for High-dimensional Data: Methods, Theory and Applications. Springer, New York (2011)
(7) Negahban, S., Yu, B., Wainwright, M.J., Ravikumar, P.K.: A unified framework for high-dimensional analysis of $m$ -estimators with decomposable regularizers. In: Adv. Neural Inf. Process. Syst., pp. 1348–1356 (2009)
(8) van de Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3, 1360–1392 (2009)
(9) Huang, J., Zhang, C.-H.: Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. J. Mach. Learn. Res. 13(Jun), 1839–1864 (2012)
(10) Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc. 96(456), 1348–1360 (2001)
(11) Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38(2), 894–942 (2010)
(12) Belloni, A., Chernozhukov, V., Wang, L.: Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4), 791–806 (2011)
(13) Sun, T., Zhang, C.-H.: Scaled sparse linear regression. Biometrika 101(2), 269–284 (2012)
(14) Bunea, F.: Honest variable selection in linear and logistic regression models via $\ell_{1}$ and $\ell_{1}$ + $\ell_{2}$ penalization. Electron. J. Stat. 2, 1153–1194 (2008)
(15) Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
(16) Meier, L., van de Geer, S., Bühlmann, P.: The group Lasso for logistic regression. J. R. Statist. Soc. B 70(1), 53–71 (2008)
(17) Bellec, P.C., Lecué, G., Tsybakov, A.B.: Slope meets Lasso: improved oracle bounds and optimality. Ann. Statist. 46(6B), 3603–3642 (2018)
(18) Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1 (2010)
(19) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Statist. 32(2), 407–499 (2004)
(20) Greenshtein, E., Ritov, Y.: Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6), 971–988 (2004)
(21) van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R.: On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 42(3), 1166–1202 (2014)
(22) Javanmard, A., Montanari, A.: Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res. 15(1), 2869–2909 (2014)
(23) Zhang, C.-H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B 76(1), 217–242 (2014)
(24) Guo, Z., Rakshit, P., Herman, D.S., Chen, J.: Inference for the case probability in high-dimensional logistic regression. J. Mach. Learn. Res. 22(1), 11480–11533 (2021)
(25) Guo, Z., Wang, W., Cai, T.T., Li, H.: Optimal estimation of genetic relatedness in high-dimensional linear models. J. Am. Statist. Assoc. 114(525), 358–369 (2019)
(26) Guo, Z., Renaux, C., Bühlmann, P., Cai, T.T.: Group inference in high dimensions with applications to hierarchical testing. Electron. J. Stat. 15(2), 6633–6676 (2021)
(27) Cai, T.T., Guo, Z.: Semisupervised inference for explained variance in high dimensional linear regression and its applications. J. R. Statist. Soc. B 82(2), 391–419 (2020)
(28) Ma, R., Guo, Z., Cai, T.T., Li, H.: Statistical inference for genetic relatedness based on high-dimensional logistic regression. arXiv preprint arXiv:2202.10007 (2022)
(29) Cai, T.T., Guo, Z.: Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity. Ann. Statist. 45(2), 615–646 (2017)
(30) Athey, S., Imbens, G.W., Wager, S.: Approximate residual balancing: debiased inference of average treatment effects in high dimensions. J. R. Statist. Soc. B 80(4), 597–623 (2018)
(31) Cai, T.T., Guo, Z., Ma, R.: Statistical inference for high-dimensional generalized linear models with binary outcomes. J. Am. Statist. Assoc. 116, 1–14 (2021)
(32) Ning, Y., Liu, H.: A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist. 45(1), 158–195 (2017)
(33) Ren, Z., Sun, T., Zhang, C.-H., Zhou, H.H.: Asymptotic normality and optimalities in estimation of large gaussian graphical models. Ann. Statist. 43(3), 991–1026 (2015)
(34) Yu, Y., Bradic, J., Samworth, R.J.: Confidence intervals for high-dimensional cox models. arXiv preprint arXiv:1803.01150 (2018)
(35) Fang, E.X., Ning, Y., Li, R.: Test of significance for high-dimensional longitudinal data. Ann. Statist. 48(5), 2622 (2020)
(36) Zhou, R.R., Wang, L., Zhao, S.D.: Estimation and inference for the indirect effect in high-dimensional linear mediation models. Biometrika 107(3), 573–589 (2020)
(37) Zhao, T., Kolar, M., Liu, H.: A general framework for robust testing and confidence regions in high-dimensional quantile regression. arXiv preprint arXiv:1412.8724 (2014)
(38) Javanmard, A., Lee, J.D.: A flexible framework for hypothesis testing in high dimensions. J. R. Statist. Soc. B 82(3), 685–718 (2020)
(39) Chen, Y., Fan, J., Ma, C., Yan, Y.: Inference and uncertainty quantification for noisy matrix completion. Proc. Natl. Acad. Sci. 116(46), 22931–22937 (2019)
(40) Zhu, Y., Bradic, J.: Linear hypothesis testing in dense high-dimensional linear models. J. Am. Statist. Assoc. 113(524), 1583–1600 (2018)
(41) Guo, Z., Kang, H., Cai, T.T., Small, D.S.: Testing endogeneity with high dimensional covariates. J. Econom. 207(1), 175–187 (2018)
(42) Cai, T.T., Zhang, L.: High-dimensional gaussian copula regression: Adaptive estimation and statistical inference. Stat. Sin., 963–993 (2018)
(43) Eftekhari, H., Banerjee, M., Ritov, Y.: Inference in high-dimensional single-index models under symmetric designs. J. Mach. Learn. Res. 22, 27–1 (2021)
(44) Deshpande, Y., Mackey, L., Syrgkanis, V., Taddy, M.: Accurate inference for adaptive linear models. In: International Conference on Machine Learning, pp. 1194–1203 (2018). PMLR
(45) Fang, E.X., Ning, Y., Liu, H.: Testing and confidence intervals for high dimensional proportional hazards models. J. R. Statist. Soc. B 79(5), 1415–1437 (2017)
(46) Neykov, M., Ning, Y., Liu, J.S., Liu, H.: A unified theory of confidence regions and testing for high-dimensional estimating equations. Stat. Sci. 33(3), 427–443 (2018)
(47) Dezeure, R., Bühlmann, P., Meier, L., Meinshausen, N.: High-dimensional inference: confidence intervals, $p$ -values and R-software hdi. Stat. Sci., 533–558 (2015)
(48) Belloni, A., Chernozhukov, V., Hansen, C.: Inference on treatment effects after selection among high-dimensional controls. Rev. Econ. Stud. 81(2), 608–650 (2014)
(49) Chernozhukov, V., Hansen, C., Spindler, M.: Valid post-selection and post-regularization inference: An elementary, general approach. Annu. Rev. Econom. 7(1), 649–688 (2015)
(50) Farrell, M.H.: Robust inference on average treatment effects with possibly more covariates than observations. J. Econom. 189(1), 1–23 (2015)
(51) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., Robins, J.: Double/debiased machine learning for treatment and structural parameters: Double/debiased machine learning. Econom. J. 21(1), 1–68 (2018)
(52) Belloni, A., Chernozhukov, V., Fernández-Val, I., Hansen, C.: Program evaluation and causal inference with high-dimensional data. Econometrica 85(1), 233–298 (2017)
(53) Cai, T., Tony Cai, T., Guo, Z.: Optimal statistical inference for individualized treatment effects in high-dimensional models. J. R. Statist. Soc. B 83(4), 669–719 (2021)
(54) Rakshit, P., Cai, T.T., Guo, Z.: SIHR: An R package for statistical inference in high-dimensional linear and logistic regression models. arXiv preprint arXiv:2109.03365 (2021)
(55) Dezeure, R., Bühlmann, P., Zhang, C.-H.: High-dimensional simultaneous inference with the bootstrap. Test 26(4), 685–719 (2017)
(56) Zhang, X., Cheng, G.: Simultaneous inference for high-dimensional linear models. J. Am. Statist. Assoc. 112(518), 757–768 (2017)
(57) Ma, R., Cai, T.T., Li, H.: Global and simultaneous hypothesis testing for high-dimensional logistic regression models. J. Am. Statist. Assoc. 116(534), 984–998 (2021)
(58) Liu, W., Luo, S.: Hypothesis testing for high-dimensional regression models. Technical report (2014)
(59) Xia, Y., Cai, T., Tony Cai, T.: Two-sample tests for high-dimensional linear regression with an application to detecting interactions. Stat. Sin. 28, 63–92 (2018)
(60) Xia, Y., Cai, T.T., Li, H.: Joint testing and false discovery rate control in high-dimensional multivariate regression. Biometrika 105(2), 249–269 (2018)
(61) Benjamini, Y., Hochberg, Y.: Multiple hypotheses testing with weights. Scand. J. Stat. 24(3), 407–418 (1997)
(62) Storey, J.D.: A direct approach to false discovery rates. J. Roy. Statist. Soc. B 64(3), 479–498 (2002)
(63) Genovese, C.R., Roeder, K., Wasserman, L.: False discovery control with $p$ -value weighting. Biometrika 93(3), 509–524 (2006)
(64) Roeder, K., Wasserman, L.: Genome-wide significance levels and weighted hypothesis testing. Stat. Sci. 24(4), 398 (2009)
(65) Ignatiadis, N., Klaus, B., Zaugg, J.B., Huber, W.: Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods 13(7), 577–580 (2016)
(66) Lei, L., Fithian, W.: Adapt: an interactive procedure for multiple testing with side information. J. R. Statist. Soc. B 80(4), 649–679 (2018)
(67) Li, A., Barber, R.F.: Multiple testing with the structure-adaptive Benjamini-Hochberg algorithm. J. R. Statist. Soc. B 81(1), 45–74 (2019)
(68) Cai, T.T., Sun, W., Xia, Y.: LAWS: A locally adaptive weighting and screening approach to spatial multiple testing. J. Am. Statist. Assoc. 117, 1370–1383 (2022)
(69) Fithian, W., Lei, L.: Conditional calibration for false discovery rate control under dependence. Ann. Statist. 50(6), 3091–3118 (2022)
(70) Xia, Y., Cai, T.T., Sun, W.: GAP: A General Framework for Information Pooling in Two-Sample Sparse Inference. J. Am. Statist. Assoc. 115(531), 1236–1250 (2020)
(71) Liu, M., Xia, Y., Cho, K., Cai, T.: Integrative high dimensional multiple testing with heterogeneity under data sharing constraints. J. Mach. Learn. Res. 22, 126–1 (2021)
(72) Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34(3), 1436–1462 (2006)
(73) Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
(74) Wainwright, M.J.: Sharp thresholds for high-dimensional and noisy sparsity recovery using $\ell_{1}$ -constrained quadratic programming (Lasso). IEEE Trans. Inf. Theory 55(5), 2183–2202 (2009)
(75) Javanmard, A., Montanari, A.: Debiasing the Lasso: Optimal sample size for gaussian designs. Ann. Statist. 46(6A), 2593–2622 (2018)
(76) Guo, Z., Ćevid, D., Bühlmann, P.: Doubly debiased Lasso: High-dimensional inference under hidden confounding. Ann. Statist. 50(3), 1320–1347 (2022)
(77) Sun, Y., Ma, L., Xia, Y.: A decorrelating and debiasing approach to simultaneous inference for high-dimensional confounded models. arXiv preprint arXiv:2208.08754 (2022)
(78) Guo, Z., Yuan, W., Zhang, C.-H.: Decorrelated local linear estimator: Inference for non-linear effects in high-dimensional additive models. arXiv preprint arXiv:1907.12732 (2019)
(79) Collier, O., Comminges, L., Tsybakov, A.B.: Minimax estimation of linear and quadratic functionals on sparsity classes. Ann. Statist. 45(3), 923–958 (2017)
(80) Guo, Z.: Inference for high-dimensional maximin effects in heterogeneous regression models using a sampling approach. arXiv preprint arXiv:2011.07568 (2020)
(81) Zhang, T.: Adaptive forward-backward greedy algorithm for learning sparse representations. IEEE Trans. Inf. Theory 57(7), 4689–4708 (2011)
(82) Hunter, D.J.: Gene–environment interactions in human diseases. Nat. Rev. Genet. 6(4), 287–298 (2005)
(83) Zhou, J.J., Cho, M.H., Lange, C., Lutz, S., Silverman, E.K., Laird, N.M.: Integrating multiple correlated phenotypes for genetic association analysis by maximizing heritability. Hum. Hered. 79, 93–104 (2015)
(84) Schifano, L. E.D. Li, Christiani, D.C., Lin, X.: Genome-wide association analysis for multiple continuous secondary phenotypes. Am J Hum Genet., 744–759 (2013)
(85) Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B 68(1), 49–67 (2006)
(86) Lounici, K., Pontil, M., van de Geer, S., Tsybakov, A.B., et al.: Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39(4), 2164–2204 (2011)
(87) Ren, Z., Zhang, C.-H., Zhou, H.: Asymptotic normality in estimation of large ising graphical model. Unpublished Manuscript (2016)
(88) Cai, T.T., Li, H., Ma, J., Xia, Y.: Differential markov random field analysis with an application to detecting differential microbial community networks. Biometrika 106(2), 401–416 (2019)
(89) Xia, Y., Cai, T., Tony Cai, T.: Multiple testing of submatrices of a precision matrix with applications to identification of between pathway interactions. J. Am. Statist. Assoc. 113(521), 328–339 (2018)
(90) Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 57, 289–300 (1995)
(91) Liang, Z., Cai, T.T., Sun, W., Xia, Y.: Locally adaptive transfer learning algorithms for large-scale multiple testing. arXiv preprint arXiv:2203.11461 (2022)
(92) Cai, T., Liu, M., Xia, Y.: Individual data protected integrative regression analysis of high-dimensional heterogeneous data. J. Am. Statist. Assoc. 116, 1–15 (2021)
(93) Cai, T.T., Guo, Z.: Accuracy assessment for high-dimensional linear regression. Ann. Statist. 46(4), 1807–1836 (2018)
(94) Nickl, R., van de Geer, S.: Confidence sets in sparse regression. Ann. Statist. 41(6), 2852–2876 (2013)
(95) Donoho, D.L., Maleki, A., Montanari, A.: The noise-sensitivity phase transition in compressed sensing. IEEE Trans. Inf. Theory 57(10), 6920–6941 (2011)
(96) Bayati, M., Montanari, A.: The Lasso risk for gaussian matrices. IEEE Trans. Inf. Theory 58(4), 1997–2017 (2011)
(97) Zhang, A., Brown, L.D., Cai, T.T.: Semi-supervised inference: General theory and estimation of means. Ann. Statist. 47(5), 2538–2566 (2019)
(98) Chakrabortty, A., Cai, T.: Efficient and adaptive linear regression in semi-supervised settings. Ann. Statist. 46(4), 1541–1572 (2018)
(99) Zhang, Y., Chakrabortty, A., Bradic, J.: Double robust semi-supervised inference for the mean: Selection bias under mar labeling with decaying overlap. arXiv preprint arXiv:2104.06667 (2021)
(100) Deng, S., Ning, Y., Zhao, J., Zhang, H.: Optimal semi-supervised estimation and inference for high-dimensional linear regression. arXiv preprint arXiv:2011.14185 (2020)
(101) Hou, J., Guo, Z., Cai, T.: Surrogate assisted semi-supervised inference for high dimensional risk prediction. arXiv preprint arXiv:2105.01264 (2021)
(102) Mandozzi, J., Bühlmann, P.: Hierarchical testing in the high-dimensional setting with correlated variables. J. Am. Statist. Assoc. 111(513), 331–343 (2016)
(103) Guo, Z., Li, X., Han, L., Cai, T.: Robust inference for federated meta-learning. arXiv preprint arXiv:2301.00718 (2023)
(104) Li, S., Cai, T.T., Li, H.: Transfer learning for high-dimensional linear regression: Prediction, estimation and minimax optimality. J. R. Statist. Soc. B 84(1), 149–173 (2021)
(105) Li, S., Zhang, L., Cai, T.T., Li, H.: Estimation and inference for high-dimensional generalized linear models with knowledge transfer. Technical Report (2021)
(106) Tian, Y., Feng, Y.: Transfer learning under high-dimensional generalized linear models. J. Am. Statist. Assoc. 117(just-accepted), 1–30 (2022)
(107) Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Statist. 46(3), 1352 (2018)
(108) Lee, J.D., Liu, Q., Sun, Y., Taylor, J.E.: Communication-efficient sparse regression. J. Mach. Learn. Res. 18(1), 115–144 (2017)
(109) Meinshausen, N., Bühlmann, P.: Maximin effects in inhomogeneous large-scale data. Ann. Statist. 43(4), 1801–1830 (2015)
(110) Liu, W.: Gaussian graphical model estimation with false discovery rate control. Ann. Statist. 41(6), 2948–2978 (2013)
(111) Xia, Y., Li, L.: Hypothesis testing of matrix graph model with application to brain connectivity analysis. Biometrics 73(3), 780–791 (2017)
(112) Xia, Y., Cai, T., Tony Cai, T.: Testing differential networks with applications to the detection of gene-gene interactions. Biometrika 102(2), 247–266 (2015)
(113) Xia, Y., Li, L.: Matrix graph hypothesis testing and application in brain connectivity alternation detection. Stat. Sin. 29(1), 303–328 (2019)
(114) Kim, B., Liu, S., Kolar, M.: Two-sample inference for high-dimensional markov networks. J. R. Statist. Soc. B (2021)
(115) Zhang, L., Ma, R., Cai, T.T., Li, H.: Estimation, confidence intervals, and large-scale hypotheses testing for high-dimensional mixed linear regression. arXiv preprint arXiv:2011.03598 (2020)
(116) Li, S., Cai, T.T., Li, H.: Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach. J. Am. Statist. Assoc. 116, 1–12 (2021)
(117) Barber, R.F., Candès, E.J.: Controlling the false discovery rate via knockoffs. Ann. Statist. 43(5), 2055–2085 (2015)
(118) Candes, E., Fan, Y., Janson, L., Lv, J.: Panning for gold:‘model-x’ knockoffs for high dimensional controlled variable selection. J. R. Statist. Soc. B 80(3), 551–577 (2018)
(119) Barber, R.F., Candès, E.J., Samworth, R.J.: Robust inference with knockoffs. Ann. Statist. 48(3), 1409–1431 (2020)
(120) Ren, Z., Wei, Y., Candès, E.: Derandomizing knockoffs. J. Am. Statist. Assoc. 116, 1–11 (2021)
(121) Du, L., Guo, X., Sun, W., Zou, C.: False discovery rate control under general dependence by symmetrized data aggregation. J. Am. Statist. Assoc. 116, 1–15 (2021)
(122) Dai, C., Lin, B., Xing, X., Liu, J.S.: A scale-free approach for false discovery rate control in generalized linear models. J. Am. Statist. Assoc. (just-accepted), 1–31 (2023)
(123) Vovk, V., Wang, R.: E-values: Calibration, combination and applications. Ann. Statist. 49(3), 1736–1754 (2021)
(124) Wang, R., Ramdas, A.: False discovery rate control with e-values. arXiv preprint arXiv:2009.02824 (2020)
(125) Ren, Z., Barber, R.F.: Derandomized knockoffs: leveraging e-values for false discovery rate control. arXiv preprint arXiv:2205.15461 (2022)