A New Non-parametric Test for Multivariate Paired Data and Pair Matching

Jingru Zhang
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
Hao Chen
Department of Statistics, University of California, Davis
Xiao-Hua Zhou
Beijing International Center for Mathematical Research and School of Public Health,
Peking University

Abstract

In paired design studies, it is common to have multiple measurements taken for the same set of subjects under different conditions. In observational studies, it is many times of interest to conduct pair matching on multiple covariates between a treatment group and a control group, and to test the treatment effect represented by multiple response variables on well pair-matched data. However, there is a lack of an effective test on multivariate paired data. The multivariate paired Hotelling’s $T^{2}$ test can sometimes be used, but its power decreases fast as the dimension increases. Existing methods for assessing the balance of multiple covariates in matched observational studies usually ignore the paired structure and thus they do not perform well under some settings. In this work, we propose a new non-parametric test for paired data, which exhibits a substantial power improvement over existing methods under a wide range of situations. We also derive the asymptotic distribution of the new test and the approximate $p$ -value is reasonably accurate under finite samples through simulation studies even when the dimension is larger than the sample size, making the new test an easy-off-the-shelf tool for real applications. The proposed test is illustrated through an analysis of a real data set on the Alzheimer’s disease research.

Keywords: Graph-based test; Nonparametric test; Paired-comparison permutation null distribution; Pair matching.

1 Introduction

Paired data are prevalent in paired design studies, where each subject is measured twice at two different time points or under two different situations. In many applications, multiple variables are measured and it is of interest to test whether they are significantly different under the two situations. For example, in studying the Alzheimer’s disease, neuropsychologic evaluation plays a central role in the clinical characterization of the disease (McKhann et al., 1984; Weintraub et al., 2009). It provides confirmatory evidence of the diagnosis of dementia. In a typical study, cognitive performance of participants is evaluated at approximately annual visits. Variables, such as total number of animals named in 60 seconds, orientation subscale score, total number of story units recalled from the current test administration, and so on, are recorded to reflect the performance, and it is of scientific interest to find out whether there is a significant difference in their neuropsychologic measures after several years. In addition, pair-matched data are also frequently encountered in causal inference. For instance, sex differences in the Alzheimer’s disease have aroused more and more attention (Carter et al., 2012; Mazure and Swendsen, 2016). In particular, an important study is to compare neuropsychologic performances between well-matched female participants and male participants. In this case, the two groups are matched well on a few covariates such as age and BMI, and then the test is done on their neuropsychology measurement variables.

For two-sample comparison on paired data, the conventional way is to take the difference between the corresponding values from the two data sets and convert the problem into a one-sample problem. If only one variable is measured, there are many choices, including the paired $t$ -test, sign test, signed-rank test, paired-comparison permutation test, and general scoring systems (Wilcoxon, 1992; Conover, 1998). When more than one variable is measured, the multivariate paired Hotelling’s $T^{2}$ test has been widely applied. However, unless the dimension of the data is very low, this test depends on whether the differences follow a multivariate Gaussian distribution. Also, the power of the test decreases quickly as the dimension increases. This is a common issue for parametric testing procedures as the number of parameters one has to estimate increases quickly as the dimension increases, such as the covariance matrix, unless strong assumptions are imposed. For example, in a real application on the Alzheimer’s disease research, 22 neuropsychology measurement variables are collected (details of the data are provided in Section 4). When the multivariate paired Hotelling’s $T^{2}$ test is applied to test whether there is any difference in neuropsychology measures between the initial visit and the visit after five years for the mild dementia group, its $p$ -value is 0.2121. So the test cannot reject the null hypothesis and concludes no significant difference. However, if applying the paired $t$ -test to each variable, we see that 11 out of 22 $p$ -values of the paired $t$ -test are less than 0.05 and the smallest one is 0.001 (Table 1). So the paired $t$ -test would reject the null hypothesis at 0.05 significance level even by the Bonferroni correction. Hence, the multivariate paired Hotelling’s $T^{2}$ test does not meet the needs for many applications with multiple variables.

Table 1: The

p

-values of applying the paired

t

-test to each of the variables separately for the mild dementia group, the initial visit versus the visit after five years.

Variable	$p$ -value
MMSEORDA	0.068
MMSEORLO	0.173
PENTAGON	0.534
NACCMMSE	0.022
LOGIMEM	0.303
MEMUNITS	0.315
MEMTIME	0.740
DIGIF	0.024
DIGIFLEN	0.014
DIGIB	0.027
DIGIBLEN	0.031
ANIMALS	0.090
VEG	0.039
TRAILA	0.041
TRAILARR	0.071
TRAILALI	0.418
TRAILB	0.005
TRAILBRR	0.227
TRAILBLI	0.013
WAIS	0.001
BOSTON	0.005
COGSTAT	0.643

Paired settings also arise naturally from matched observational studies, where a key question is that whether the covariates of a treated group and its matched control group are well balanced. Some approaches have been proposed to compare the distributions of the observed covariates, such as the method of combined differences (Hansen and Bowers, 2008) and the crossmatch test (Rosenbaum, 2005). More recently, Chen and Small (2020) proposed new tests (CrossNN and CrossMST) based on the similarity information of subjects. However, these existing tests do not fully utilize the paired structure, which makes them less effective under some circumstances as illustrated in our simulations (Section 3.2) and real application (Section 4.2).

In this paper, we propose a new non-parametric framework for paired data that can be applied to multivariate data with the dimension comparable with or possibly higher than the sample size (Section 2). This new non-parametric framework for paired data relies on a similarity graph constructed on the observations. We also derive the asymptotic distribution of the statistic under the paired-comparison permutation null distribution, facilitating its application to large data sets. The performance of the proposed test is examined through extensive simulation studies in Section 3. The new test is illustrated through a real data application on the Alzheimer’s disease research in Section 4. We discuss two other related statistics in Section 5 and conclude in Section 6.

2 A new non-parametric test for paired data based on a similarity graph

In two-sample tests, with $m$ observations in sample 1 and $n$ observations in sample 2, the two-sample permutation framework is to randomly assign $m$ out of the total $m+n$ pooled observations to sample 1 and the rest to sample 2 due to exchangeability of the observations under the null hypothesis of equal distribution. However, for paired data, the observations from the two samples are no longer exchangeable. On the other hand, the corresponding observations (paired observations) are exchangeable under the null hypothesis. Let $(X_{i},Y_{i})$ , $i=1,\ldots,n$ , be the paired data. The paired-comparison permutation null distribution places probability $2^{-n}$ on each of $2^{n}$ choices of “assigning $X_{i}$ to sample 1 and $Y_{i}$ to sample 2” or “assigning $Y_{i}$ to sample 1 and $X_{i}$ to sample 2” for each $i$ . In the following, when there is no further specification, we use pr, $E$ , var and cov to denote probability, expectation, variance and covariance, respectively, under this paired-comparison permutation null distribution.

We first construct a similarity graph on the pool observations. The idea of utilizing a similarity graph has proven to be successful under the two-sample hypothesis testing problems for high-dimensional data (Friedman and Rafsky, 1979; Chen and Friedman, 2017). We use $G$ to denote the similarity graph, such as the minimum spanning tree (MST), which is a spanning tree that connects all observations with the sum of the distances of the edges in the tree minimized. Since the graph-based statistic is usually more powerful under a slightly denser graph (Friedman and Rafsky, 1979), we choose 5-MST as the similarity graph in our simulation studies and real application as recommended by Chen and Friedman (2017). Here, a $k$ -MST is the union of the 1st, $\ldots,k$ th MSTs, where the 1st MST is the MST and the jth ( $j>1$ ) MST is a spanning tree that connects all observations such that the sum of the edges in the tree is minimized under the constraint that it does not contain any edge in the 1st, $\ldots,(j-1)$ th MSTs.

Since the graph is constructed on the pooled observations, it is unchanged under the paired-comparison permutation null distribution. For notation simplicity, we do not include $G$ as a subscript for the quantities depending on $G$ . Let $R_{1}$ be the number of edges in the similarity graph connecting observations from sample 1, and $R_{2}$ be the number of edges connecting observations from sample 2. Here, $R_{1}$ and $R_{2}$ are random variables under the paired-comparison permutation null distribution. The new test statistic is defined as

D=\begin{pmatrix}R_{1}-E(R_{1})\\ R_{2}-E(R_{2})\end{pmatrix}^{T}\mathbf{\Sigma}^{-1}_{R}\begin{pmatrix}R_{1}-E(R_{1})\\ R_{2}-E(R_{2})\end{pmatrix},

(1)

where $\mathbf{\Sigma}_{R}=\text{var}((R_{1},R_{2})^{T})$ . The test is rejected at level $\alpha$ if $D>C(\alpha)$ .

In the following, we derive exact analytic expressions for $E(R_{1})$ , $E(R_{2})$ , $\text{var}(R_{1})$ , $\text{var}(R_{2})$ and $\text{cov}(R_{1},R_{2})$ , so that the proposed test statistic can be computed quickly (Section 2.1), and then discuss how to determine the critical value $C(\alpha)$ in an analytic way (Section 2.2).

2.1 Analytic expressions for the new paired test statistic

Let $N=2n$ be the total number of observations and let $Z_{i}=I(i\leq n)X_{i}+I(i>n)Y_{i-n}$ , $i=1,\ldots,N$ , where $I(\cdot)$ is the indicator function. Let $g_{i}$ be the indicator function that equals 1 when $Z_{i}$ is assigned to sample 1 under the paired-comparison permutation null distribution, and 0 if $Z_{i}$ is assigned to sample 2. It is easy to see that $\text{pr}(g_{i}=1)=0.5$ . We use $a\wedge b$ to denote the minimum of $a$ and $b$ , and use $a\vee b$ to denote the maximum of $a$ and $b$ . Let $i$ and $i^{*}$ be the two indices belonging to pair $\mathbf{i}=(i\wedge i^{*},i\vee i^{*})$ , i.e., $i^{*}=i+n$ if $i\leq n$ , and $i^{*}=i-n$ if $i>n$ . Then, we always have $g_{i}+g_{i^{*}}=1$ . Let $j$ and $j^{*}$ be the two indices belonging to pair $\mathbf{j}=(j\wedge j^{*},j\vee j^{*})$ . Since assigning $Z_{i}$ to sample 1 is independent of assigning $Z_{j}$ $(\mathbf{j}\neq\mathbf{i})$ to sample 1, $g_{i}$ and $g_{j}$ are independent.

For an edge in $G$ , we denote it by the indices of the nodes connected by the edge. Then by definition, we have

	$\displaystyle R_{1}=\sum_{(i,j)\in G}I(g_{i}=g_{j}=1),$
	$\displaystyle R_{2}=\sum_{(i,j)\in G}I(g_{i}=g_{j}=0).$

Here, we do not distinguish edge $(i,j)$ from edge $(j,i)$ . For graph $G$ , let $|G|$ be the number of edges in the graph. Let $G_{1}$ be the subgraph in $G$ that connects observations from different pairs, i.e., $G_{1}$ contains edges $\{(i,j)\in G:j\neq i^{*}\}$ . Let $G_{1,i}$ be the subgraph in $G_{1}$ that connects to node $i$ . Then $|G_{1,i}|$ is the degree of node $i$ in $G_{1}$ . Let $C_{1}$ be the number of pairs of edges $(i,j),~{}(i^{*},j^{*})\in G_{1}$ , and $C_{2}$ be the number of pairs of edges $(i,j),~{}(i,j^{*})\in G_{1}$ . The analytic expressions are provided in the following theorem.

Theorem 1.

The analytic expressions of the expectations and variances under the paired-comparison permutation null are as follows:

	$\displaystyle E(R_{1})=E(R_{2})=\frac{1}{4}\|G_{1}\|,$
	$\displaystyle\text{var}(R_{1})=\text{var}(R_{2})=\frac{1}{16}(\|G_{1}\|+2C_{1}-2C_{2})+\frac{1}{16}\sum_{i=1}^{n}(\|G_{1,i}\|-\|G_{1,i^{*}}\|)^{2},$
	$\displaystyle\text{cov}(R_{1},R_{2})=\frac{1}{16}(\|G_{1}\|+2C_{1}-2C_{2})-\frac{1}{16}\sum_{i=1}^{n}(\|G_{1,i}\|-\|G_{1,i^{*}}\|)^{2}.$

Remark 1.

The expectation and variance of $(R_{1},R_{2})$ are very different from those under the permutation null for the two-sample test setting. Also, they only depend on $G_{1}$ , the set of edges that connect observations from different pairs. This is because the two end nodes of an edge connecting observations from a pair would always be in different samples under the pair-comparison permutation null distribution. Hence, one could construct a similarity graph $G$ that does not include any edge connecting within a pair. It is reasonable to not use the information of edge connecting within pairs as they only reflect that the subjects are distinct enough rather than they are not changed under a different circumstance.

To ensure that the proposed test statistic $D$ is well defined, $\Sigma_{R}$ needs to be invertible.

Theorem 2.

The proposed test statistic $D$ is well defined except the following two scenarios:

1.

For each pair $\mathbf{i}$ , the two nodes have the same degree in $G_{1}$ , i.e., $|G_{1,i}|-|G_{1,i^{*}}|=0$ , for all $i$ ’s; or
2.

$|G_{1}|+2C_{1}-2C_{2}=0$ .

Remark 2.

This theorem follows straightforwardly from the analytic expression of $\Sigma_{R}$ derived in Theorem 1. After some simplifications, we have that the determinant of $\Sigma_{R}$ is

|\Sigma_{R}|=\frac{1}{64}(|G_{1}|+2C_{1}-2C_{2})\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2}.

Hence, $D$ is well defined except for when $|\Sigma_{R}|=0$ , i.e., the two scenarios in Theorem 2.

Let $subG_{1}^{\mathbf{i},\mathbf{j}}$ be the subgraph of $G_{1}$ that connects any nodes in the pairs of $\mathbf{i}$ and $\mathbf{j}$ . Then it has eight possible configurations if there is at least one edge in $subG_{1}^{\mathbf{i},\mathbf{j}}$ . The eight configurations are shown in Figure 1.

Refer to caption — Figure 1: For $(i,j)\in G_{1}$ , eight possible subgraphs of $G_{1}$ between pairs $\mathbf{i}$ and $\mathbf{j}$ .

Notice that

|G_{1}|+2C_{1}-2C_{2}=\sum_{\begin{subarray}{c}\text{pairs }\mathbf{i},\mathbf{j}\\ |subG_{1}^{\mathbf{i},\mathbf{j}}|>0\end{subarray}}T(subG_{1}^{\mathbf{i},\mathbf{j}}),

where

	$\displaystyle T(subG_{1}^{\mathbf{i},\mathbf{j}}):=$	$\displaystyle\|subG_{1}^{\mathbf{i},\mathbf{j}}\|+2\times\text{pairs of two edges in }subG_{1}^{\mathbf{i},\mathbf{j}}\text{ not sharing any node }$
		$\displaystyle-2\times\text{pairs of two edges in }subG_{1}^{\mathbf{i}\mathbf{j}}\text{ sharing a node}.$

It is not hard to see that

T(c)=T(d)=T(h)=0,\quad T(a)=T(e)=T(f)=T(g)=1,\quad T(b)=4.

Hence, if every subgraph $subG_{1}^{\mathbf{i},\mathbf{j}}$ belongs to one of the three configurations, $c$ , $d$ and $h$ , then $\Sigma_{R}$ is noninvertible.

2.2 Asymptotics

For the critical value $C(\alpha)$ , it can be determined by performing the paired-comparison permutation directly, which is however time consuming. To make the test more application friendly, we study the asymptotic distribution of the statistic $D$ .

Before stating the theorem, we define two additional terms on the similarity graph $G_{1}$ : for edge $e\in G_{1}$ , let $e_{-}$ and $e_{+}$ be the indices of the nodes connected by the edge $e$ .

	$\displaystyle A_{e}=\{(i,j)\in G_{1}:i\in\{e_{-},e_{+},e_{-}^{},e_{+}^{}\}\text{ or }j\in\{e_{-},e_{+},e_{-}^{},e_{+}^{}\}\},$
	$\displaystyle B_{e}=\cup_{\tilde{e}\in A_{e}}A_{\tilde{e}}.$

We use $a=O(b)$ to denote that $a$ and $b$ are of the same order, and $a=o(b)$ to denote that $a$ is of a smaller order than $b$ .

To derive the asymptotic behavior of our statistic, we work under the following conditions for some $\gamma>0$ :

Condition 1.

$\sum_{e\in G_{1}}|A_{e}||B_{e}|=o(N^{1.5\gamma})$ .

Condition 2.

$\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2}=O(N^{\gamma})$ .

Condition 3.

$|G_{1}|+2C_{1}-2C_{2}=O(N^{\gamma})$ .

Remark 3.

The parameter $\gamma$ could be any positive number. For example, when $\gamma=1$ , the conditions above become the following conditions (2), (3) and (4), respectively.

\sum_{e\in G_{1}}|A_{e}||B_{e}|=o(N^{1.5}),

(2)

\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2}=O(N),

(3)

|G_{1}|+2C_{1}-2C_{2}=O(N).

(4)

Here, Condition (2) sets a constraint on the number of edges sharing a pair in the graph $G_{1}$ such that it cannot be too large. (A similar condition was proposed for graph-based statistics for independent observations and was discussed in Chen and Friedman (2017) and Chen et al. (2018).)

Conditions (3) and (4) ensure that $(R_{1},R_{2})$ does not degenerate asymptotically. Let

L_{1}=\{\mathbf{i}=(i\wedge i^{*},i\vee i^{*}):|G_{1,i}|\neq|G_{1,i^{*}}|\}.

If $|L_{1}|=O(N)$ and $(|G_{1,i}|-|G_{1,i^{*}}|)^{2}=O(1),~{}\mathbf{i}\in L_{1}$ , then (3) could be satisfied.

Condition (4) sets a constraint on the structure of the graph $G_{1}$ . As shown in the proof of Theorem 2, we obtain

|G_{1}|+2C_{1}-2C_{2}=\sum_{\begin{subarray}{c}\text{pairs }\mathbf{i},\mathbf{j}\\ |subG_{1}^{\mathbf{i},\mathbf{j}}|>0\end{subarray}}T(subG_{1}^{\mathbf{i},\mathbf{j}})

and $T(subG_{1}^{\mathbf{i},\mathbf{j}})=0,1$ or $4$ . Let

L_{2}=\{subG_{1}^{\mathbf{i},\mathbf{j}}:subG_{1}^{\mathbf{i},\mathbf{j}}\text{ contains at least one edge, i.e., }|subG_{1}^{\mathbf{i},\mathbf{j}}|>0\},

L_{3}=\{subG_{1}^{\mathbf{i},\mathbf{j}}\in L_{2}:T(subG_{1}^{\mathbf{i},\mathbf{j}})\neq 0\}.

We have

|G_{1}|/4\leq|L_{2}|\leq|G_{1}|.

Thus, $|L_{2}|=O(|G_{1}|)$ . If $|G_{1}|=O(N)$ and $|L_{3}|=O(|L_{2}|)$ , then (4) could be satisfied.

Theorem 3.

Under Conditions 1, 2 and 3, as $N\rightarrow\infty$ , $((R_{1}-E(R_{1}))/\sqrt{\text{var}(R_{1})}$ , $(R_{2}-E(R_{2}))/\sqrt{\text{var}(R_{2})})^{T}$ converges in distribution to a bivariate Gaussian distribution under the paired-comparison permutation null distribution.

Based on Theorem 3, it is easy to obtain the asymptotic distribution of $D$ .

Corollary 1.

Under Conditions 1, 2 and 3, as $N\rightarrow\infty$ ,

\quad D\longrightarrow\chi_{2}^{2}

in distribution under the paired-comparison permutation null distribution.

We reject the null hypothesis at $\alpha$ level when $D>C(\alpha)$ . Based on Corollary 1, $C(\alpha)$ can be approximated by $\chi_{2}^{2}(1-\alpha)$ , the $(1-\alpha)$ quantile of the $\chi_{2}^{2}$ distribution.

We next check how well the rejection region $D>\chi_{2}^{2}(1-\alpha)$ can control Type I error through numerical studies. We consider the following three settings:

•

Setting 1 (S1): $(X_{1},Y_{1})^{T},(X_{2},Y_{2})^{T},\ldots,(X_{n},Y_{n})^{T}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{2d}(\nu,\Gamma)$ ;
•

Setting 2 (S2): $(X_{1},Y_{1})^{T},(X_{2},Y_{2})^{T},\ldots,(X_{n},Y_{n})^{T}\stackrel{{\scriptstyle iid}}{{\sim}}\text{multivariate }t_{3}(\nu,\Gamma)$ ;
•

Setting 3 (S3): $\log(X_{1},Y_{1})^{T},\log(X_{2},Y_{2})^{T},\ldots,\log(X_{n},Y_{n})^{T}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{2d}(\nu,\Gamma)$ .

Here, $\nu^{T}=(\nu_{1},\nu_{2})^{T}$ and $\Gamma=\begin{pmatrix}\Gamma_{1}&\Gamma_{12}\\ \Gamma_{12}&\Gamma_{2}\end{pmatrix}$ with $\nu_{1}=\nu_{2}=\mathbf{0}_{d}$ , $\Gamma_{1}=\Gamma_{2}=\mathbf{I}_{d}$ and $\Gamma_{12}=0.6\mathbf{I}_{d}$ . Then, in each setting, the distributions of $(X_{i},Y_{i})^{T}$ and $(Y_{i},X_{i})^{T}$ are the same.

Table 2: Examination of empirical size for data generated from a multivariate normal distribution (S1), a multivariate

t

distribution (S2), and a multivariate log-normal distribution (S3)

(a) Empirical size at 0.05 nominal level

	$d=10$				$d=100$
	$n=25$	$n=50$	$n=100$	$n=150$	$n=25$	$n=50$	$n=100$	$n=150$
(S1)	0.038	0.038	0.046	0.036	0.046	0.042	0.039	0.048
(S2)	0.040	0.043	0.030	0.045	0.041	0.048	0.042	0.041
(S3)	0.039	0.025	0.042	0.040	0.034	0.040	0.040	0.037

(b) Empirical size at 0.1 nominal level

	$d=10$				$d=100$
	$n=25$	$n=50$	$n=100$	$n=150$	$n=25$	$n=50$	$n=100$	$n=150$
(S1)	0.081	0.087	0.081	0.088	0.085	0.089	0.094	0.100
(S2)	0.077	0.100	0.078	0.095	0.084	0.094	0.095	0.096
(S3)	0.079	0.072	0.085	0.091	0.084	0.092	0.088	0.088

Table 2 shows the empirical sizes of the proposed test under 0.05 and 0.1 nominal levels. The empirical size is computed as the fraction of trials (out of 1000) that the test statistic is greater than $\chi_{2}^{2}(1-\alpha)$ . We see that the empirical sizes are well controlled for the proposed test even for quite small sample sizes under both low and high dimensions in all three settings.

3 Performance of the proposed test

In this section, we evaluate the performance of the paired graph-based test $D$ through two-sample testing for paired nonindependent data and pair matching. The proposed test is compared with the multivariate paired Hotelling’s $T^{2}$ test (pHT) in Section 3.1, and compared with pHT and four existing tests for matching: the method of combined differences (Hansen and Bowers, 2008), the crossmatch test (Rosenbaum, 2005), CrossNN and CrossMST tests (Chen and Small, 2020) in Section 3.2. The levels of the tests are all set to be 0.05. For the paired graph-based test, we reject the test when $D>\chi_{2}^{2}(0.95)$ . For the paired Hotelling’s $T^{2}$ test, let $T_{i}=X_{i}-Y_{i}~{}(i=1,\ldots,n)$ , $\bar{T}=\frac{1}{n}\sum_{i=1}^{n}T_{i}$ , and $\Sigma_{T}=\frac{1}{n-1}\sum_{i=1}^{n}(T_{i}-\bar{T})(T_{i}-\bar{T})^{T}$ . The null hypothesis is rejected if

\frac{(n-d)n}{d(n-1)}\bar{T}^{T}\Sigma_{T}^{-1}\bar{T}>F_{d,n-d}(0.95),

where $F_{d,n-d}(0.95)$ denotes the 0.95 quantile of an F-distribution with $d$ and $n-d$ degrees of freedom.

3.1 Two-sample testing for paired nonindependent data

We first examine the performance of the proposed test for data from the same family of distributions. We consider the three settings (S1), (S2) and (S3) in Section 2.2 but with different choices of $\nu$ and $\Gamma$ . Let the number of pairs $n$ be fixed at the moderate size with $n=60$ . We assess how the proposed statistic behaves when the dimension is comparable to or larger than the number of pairs by considering three dimensions: $d=50$ , $d=100$ and $d=1000$ . For each setting, we consider two alternatives for each $d$ listed below.

(i)

Only $\nu_{1}$ differs from $\nu_{2}$ with $\nu_{1}=\mathbf{0}_{d}$ , $\nu_{2}=0.5d^{-1/4}\mathbf{1}_{d}$ , $\Gamma_{1}=\Gamma_{2}=\mathbf{I}_{d}$ and $\Gamma_{12}=0.6\mathbf{I}_{d}$ .
(ii)

Parameter $\nu_{1}$ differs from $\nu_{2}$ and $\Gamma_{1}$ differs from $\Gamma_{2}$ with $\nu_{1}=\mathbf{0}_{d}$ , $\nu_{2}=0.5d^{-1/4}\mathbf{1}_{d}$ , $\Gamma_{1}=\mathbf{I}_{d}$ , $\Gamma_{2}=c_{d}\mathbf{I}_{d}$ and $\Gamma_{12}=0.6c_{d}^{1/2}\mathbf{I}_{d}$ , where $c_{d}=1.15$ for $d=50$ , $c_{d}=1.1$ for $d=100$ and $c_{d}=1.05$ for $d=1000$ .

Table 3: Estimated power at 0.05 significance level based on 1000 runs. The larger estimated power under each setting is in bold

(a) Data from the same family of distributions (i) $\nu_{1}\neq\nu_{2}$ , $\Gamma_{1}=\Gamma_{2}$ (ii) $\nu_{1}\neq\nu_{2}$ , $\Gamma_{1}\neq\Gamma_{2}$ $d$ 50 100 1000 50 100 1000 multivariate normal pHT 0.788 - - 0.729 - - $D$ 0.709 0.754 0.810 0.972 0.974 0.999 multivariate t pHT 0.750 - - 0.671 - - $D$ 0.826 0.823 0.543 0.942 0.941 0.916 multivariate log-normal pHT 0.450 - - 0.611 - - $D$ 0.780 0.821 0.847 0.995 0.993 0.992

(b) Data from different families of distributions $\Delta=\mathbf{0}_{d}$ $\Delta=0.2\mathbf{1}_{d}$ $d$ 50 100 1000 50 100 1000 $\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{normal}$ pHT 0.053 - - 0.429 - - $D$ 0.643 0.758 0.980 0.752 0.928 1.000 $\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{skew normal}$ pHT 0.039 - - 0.416 - - $D$ 0.566 0.658 0.960 0.639 0.856 1.000 $\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{Laplace}$ pHT 0.041 - - 0.440 - - $D$ 0.358 0.486 0.920 0.571 0.778 1.000

The results under various scenarios are summarized in Table 3. We first look at the results under moderate-dimensional settings ( $d=50$ ). When only $\nu_{1}$ differs from $\nu_{2}$ , the paired Hotelling’s $T^{2}$ test has high power for the multivariate normal and multivariate $t$ distributions, but is not good for the multivariate log-normal distribution. When $\nu_{1}\neq\nu_{2}$ and $\Gamma_{1}\neq\Gamma_{2}$ , the performance of paired Hotelling’s $T^{2}$ is much worse than the proposed test $D$ , especially for the multivariate log-normal distribution. For results under dimensions $d=100$ and $d=1000$ , $D$ exhibits pretty good performance under all the settings, while the paired Hotelling’s $T^{2}$ cannot be applied as $d>n$ .

Additionally, under setting (ii), $\nu_{1}$ and $\nu_{2}$ are set to be the same as those under setting (i), while $\Gamma_{2}$ becomes different from $\Gamma_{1}$ . Thus, the difference between the two samples should be more significant. Comparing the results under setting (ii) with those under setting (i), we observe an increasing power of $D$ for the three distributions, while the power of pHT decreases for the multivariate normal and multivariate $t$ distributions.

Now we examine the performance of $D$ when $X_{i}$ and $Y_{i}$ are from different families of distributions. Suppose $X_{i},Y_{i}\in\mathbb{R}^{d}$ , $X_{1},\ldots,X_{n}\stackrel{{\scriptstyle iid}}{{\sim}}F_{X}$ , $Y_{1},\ldots,Y_{n}\stackrel{{\scriptstyle iid}}{{\sim}}F_{Y}$ . We generate data by

X_{i}=\alpha_{i}+\epsilon_{i},\quad Y_{i}=\Delta+\alpha_{i}+\tau_{i},\quad i=1,\ldots,n,

where $\Delta$ is a constant vector, and $\alpha_{i},\epsilon_{i},\tau_{i}$ are independent with each other. Let $\alpha_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{d}(\mathbf{0}_{d},\Omega_{1})$ and $\epsilon_{i}\stackrel{{\scriptstyle iid}}{{\sim}}t_{3}(\mathbf{0}_{d},\Omega_{2}/3)$ . We consider three types of distributions for $\tau_{i}$ : (i) $\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{d}(\mathbf{0}_{d},\Omega_{2})$ , (ii) $\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{multivariate skew normal distribution with mean }\mathbf{0}_{d}$ , variance $\Omega_{2}$ , skewness 1 and (iii) $\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{multivariate Laplace distribution with mean }\mathbf{0}_{d}$ , variance $\Omega_{2}$ . Therefore, $E(X_{i})=\mathbf{0}_{d}$ , $E(Y_{i})=\Delta$ , $\text{var}(X_{i})=\text{var}(Y_{i})=\Omega_{1}+\Omega_{2}$ and $\text{cov}(X_{i},Y_{i})=\Omega_{1}$ . Let $n=60$ and $d=50,100,1000$ . Let $\Omega_{1}=(\Omega_{ij})$ with the $(i,j)$ element of $\Omega$ being $\Omega_{ij}=0.5^{|i-j|}$ and $\Omega_{2}=\mathbf{I}_{d}$ . For each $d$ , we consider two settings for $\Delta$ : $\Delta=\mathbf{0}_{d}$ and $\Delta=0.2\mathbf{1}_{d}$ .

Table 3 gives the results. We first take a look at the results when there is no mean difference between the two samples ( $\Delta=\mathbf{0}_{d}$ ). When $\Delta=\mathbf{0}_{d}$ , the means and variances of $X_{i}$ and $Y_{i}$ are the same, while the two distributions $F_{X}$ and $F_{Y}$ are different, and we would expect a powerful test to reject the null hypothesis. However, the paired Hotelling’s $T^{2}$ test has almost no power under $d=50$ and cannot be applied to high-dimensional scenarios ( $d=100,1000$ ), so it cannot detect the shape difference between two distributions. The proposed test $D$ performs well under all these scenarios. When there is a mean difference between the two samples ( $\Delta=0.2\mathbf{1}_{d}$ ), the paired Hotelling’s $T^{2}$ exhibits some power under $d=50$ , though the proposed test $D$ has a higher power. For the high dimensional scenarios ( $d=100,1000$ ), the paired Hotelling’s $T^{2}$ is inapplicable, while the power of $D$ is very high.

3.2 Assessing covariate balance for pair matching

Following a similar simulation setting as in Franklin et al. (2014), we assume that 20-dimensional covariates $\mathbf{X}_{(1)}=(X_{1},\cdots,X_{20})^{T}$ are observed, where $X_{j}~{}(j=1,\cdots,20)$ independent identically follows Laplace distribution with mean 0 and variance 0.65. The exposure/treatment, $T$ , depends on these 20 covariates as well as three unobserved transformations of them ( $\mathbf{X}_{(2)}=(X_{21},X_{22},X_{23})^{T}=(\sin(X_{1})/4,\cos(X_{2})/4,X_{3}^{4})^{T}$ ). We simulate $T$ as a binary variable via the logistic model $\text{logit}\{P(T=1)\}=\alpha_{0}+\bm{\alpha}_{1}^{T}\mathbf{X}_{(1)}+\bm{\alpha}_{2}^{T}\mathbf{X}_{(2)}$ , where $\bm{\alpha}_{1}$ and $\bm{\alpha}_{2}$ define the log-odds ratios between exposures and controls in the pre-matched data set. We generated 1000 subjects and determined whether each of them is exposed or not. Then the control subjects $(T=0)$ are matched to exposed subjects $(T=1)$ through their propensity scores on $\mathbf{X}_{(1)}$ .

We consider the following four scenarios.

(i)

Zero coefficients for both $\mathbf{X}_{(1)}$ and $\mathbf{X}_{(2)}$ : $\bm{\alpha}_{1}=\mathbf{0}_{20}$ and $\bm{\alpha}_{2}=\mathbf{0}_{3}$ .
(ii)

Nonzero coefficients for observed covariates $\mathbf{X}_{(1)}$ only: $\bm{\alpha}_{1}=0.2\mathbf{1}_{20}$ and $\bm{\alpha}_{2}=\mathbf{0}_{3}$ .
(iii)

Nonzero coefficients for unobserved covariates $\mathbf{X}_{(2)}$ only: $\bm{\alpha}_{1}=\mathbf{0}_{20}$ and $\bm{\alpha}_{2}=0.33\mathbf{1}_{3}$ .
(iv)

Nonzero coefficients for both $\mathbf{X}_{(1)}$ and $\mathbf{X}_{(2)}$ : $\bm{\alpha}_{1}=0.2\mathbf{1}_{20}$ and $\bm{\alpha}_{2}=0.33\mathbf{1}_{3}$ .

Here, scenario (i) is used to examine the empirical size since the covariates of treated subjects and controls are generated from the same distribution. For each scenario, we simulate 1000 data sets. Define the standardized mean difference as $\text{SD1}=(\bar{x}_{1}-\bar{x}_{0})/\sqrt{(s_{1}^{2}+s_{0}^{2})/2}$ where $\bar{x}_{m}$ and $s_{m}$ are the sample mean and variance for treated subjects $(m=1)$ and controls $(m=0)$ . For each of the 23 covariates, we calculate the standardized mean difference between the treatment group and control group before and after matching. Figure 2 shows the boxplots of standardized mean difference of 1000 data set with the left panels denoting before matching and right panels denoting after matching. We see that the standardized mean differences of the observed covariates $\mathbf{X}_{(1)}$ are relatively close to 0 after matching under all scenarios. However, the unobserved covariates $\mathbf{X}_{(2)}$ are very unbalanced after matching when $\bm{\alpha}_{2}\neq\mathbf{0}_{3}$ (scenarios (iii) and (iv)), which indicates the distribution of ( $X_{1},X_{2},X_{3}$ ) is not well balanced. Since we aim to test whether the joint distributions of the covariates in the matched control group and the treatment group are the same, we would expect that a good test rejects the null hypothesis under scenarios (iii) and (iv), while it accepts the null hypothesis under scenarios (i) and (ii).

Table 4: The proportion of trials (out of 1000) that the test rejects covariate balance at 0.05 significance level. The largest estimated power under scenarios (iii) and (iv) is in bold

Scenario	pHT	CD	CM	$\text{NN}1$	$\text{NN}2$	$\text{MST}1$	$\text{MST}2$	$D$
(i)	0.000	0.000	0.030	0.018	0.024	0.029	0.035	0.050
(ii)	0.000	0.000	0.018	0.020	0.034	0.022	0.040	0.069
(iii)	0.003	0.001	0.579	0.510	0.677	0.473	0.537	0.907
(iv)	0.010	0.005	0.540	0.530	0.670	0.505	0.515	0.925

Denote the method of combined differences and crossmatch test by CD and CM, respectively. To apply the CrossNN and CrossMST tests, we consider 1-MST and 5-MST as the similarity graph and denote these tests by $\text{NN}1$ , $\text{NN}2$ , $\text{MST}1$ and $\text{MST}2$ , respectively. We present the proportion of trials that the tests reject the null hypothesis at 0.05 significance level in Table 4. We see all tests control empirical size well under scenario (i) and they deem the covariates well balanced under the reasonably balanced scenario (ii). For the results under scenarios (iii) and (iv), we observe the proposed test $D$ performs best with the highest power as expected. The crossmatch, CrossNN and CrossMST tests also show some power, while it is much less effective than the proposed test $D$ . The paired Hotelling’s test and the method of combined differences do not have power under scenarios (iii) and (iv).

4 A real application on Alzheimer’s disease research

In this section, we illustrate the newly developed test in a research project studying the Alzheimer’s disease. The data were collected by multiple Alzheimer’s Disease Centers between September 2005 and December 2018 (see Beekly et al. (2007) for details). Participants were evaluated on cognitive performance at their initial visits, and latter approximate annual follow-up visits. The information is recorded in the Uniform Data Set (UDS), which is a longitudinal, standardized data set maintained by the National Alzheimer’s Coordinating Center (NACC) and more information can be found in https://www.alz.washington.edu/. The data set consists of about 725 variables obtained from approximately annual comprehensive evaluations of 39,412 research volunteers as of the December 2018 data freeze. In our study, we requested for data containing participants with at least 4 visits.

4.1 Comparison of neuropsychologic performances between two visits

We first study the participants’ neuropsychologic performance over time, so we focused on the Neuropsychological Battery Summary Scores in Form C1, which contains 22 neuropsychology measurement variables. A description of these 22 variables are provided in LABEL:app:variable. We group the participants into three groups according to the CDR Dementia Staging Instrument in their first visits: no dementia (Group I), very mild dementia (Group II), and mild dementia (Group III). We test whether their neuropsychologic performances in five years differ from those in their initial visits. After removing missing data, the sample sizes ( $n$ ) of three groups are 1747, 539, and 41, respectively.

Table 5: Test results of pHT and

D

(bold for those

p

-values

<

0.05)

	Group I		Group II		Group III
	statistic	$p$ -value	statistic	$p$ -value	statistic	$p$ -value
pHT	27.09	$<$ 1e-4	11.83	$<$ 1e-4	1.44	0.2121
$D$	39.78	$<$ 1e-4	37.93	$<$ 1e-4	7.60	0.0224

Table 5 presents the results of the paired Hotelling’s $T^{2}$ test (pHT) and the proposed test $D$ . We first check the results for Group I in testing whether their neuropsychologic measures in five years differ from those in their initial visits. We see that both the paired Hotelling’s $T^{2}$ test and our new test $D$ reject the null hypothesis with extremely small $p$ -value. Thus, it is clear there is a significant difference in the neuropsychologic measures between the two visits. Similarly, we would reject the null hypothesis for Group II.

For Group III, as the sample size is relatively small, the power of the tests decreases. We see that the paired Hotelling’s $T^{2}$ test cannot reject the null hypothesis, while the proposed test $D$ rejects the null hypothesis with a small $p$ -value. To see which result is more reliable, we check the data in more details. In particular, we perform the paired $t$ -test to each of the 22 neuropsychologic measurement variables. It turns out that the paired $t$ -test has $p$ -value less than 0.05 for 11 out of these 22 variables and the smallest $p$ -value is 0.001 (Table 1). Then, by the Bonferroni correction, the paired $t$ -test would reject the null hypothesis at 0.05 significance level ( $0.05/22=0.002>0.001$ ), supporting the result from $D$ .

These results reveal that for all the participants from the three groups, neuropsychologic measures after several years would become significantly different from those in the initial visits. It suggests that researchers pay attention to the changes of cognitive performance even for no dementia or (very) mild dementia participants.

4.2 Assessment of covariate balance in pair matching for examining sex effect

Several papers have previously reported a sex effect on Alzheimer’s disease (Payami et al., 1996; Podcasy and Epperson, 2016; Grimm et al., 2016). In this study, we focus on the data from the initial visit and only consider the white participant who has no stroke, no transient ischemic attack, no serious heart problem (e.g., atrial fibrillation, cardiac bypass procedure, congestive heart failure), no diabetes, no brain trauma, no Parkinson’s disease, no seizures, no other neurological and psychiatric disorder, do not abuse substances (except alcohol), and meanwhile, no Frontotemporal lobar degeneration mutation and Alzheimer’s disease mutation in her/his family, no cognitive impairment in her/his first-degree family. There are 13 covariates in total under consideration (Table 6). Here, each of TOBAC30, TOBAC100, CVHATT and DEP2YRS is a 0-1 variable, which is transformed to one dummy variable. CVANGIO and ALCOHOL are categorical variables with three categories: absent, recent/active, remote/inactive, and both of them are transformed to two dummy variables, respectively. We match 105 male participants with 105 female participants, among 153 female participants, expecting the covariates in the control (female) group are balanced with the exposed/treated (male) group. We adopt pairmatch() function in the optmatch package in R (Hansen, 2007) on treated to control distances with the distance computed from the match_on() using the logit propensity scores.

Table 6: Summary of covariates and statistics before and after the pair matching of a male participant and a female participant

(a) Matching results of Section 4.2

		Before matching					After matching
Description	Covariates	Male	Female	SD1	SD2	SD3	Female	SD1	SD2	SD3
Age at visit	NACCAGE	72.762	72.092	0.081	-0.130	0.048	72.848	-0.010	-0.149	0.247
Body mass index (BMI)	NACCBMI	26.827	26.852	-0.005	-0.489	-0.367	26.035	0.169	-0.354	0.073
Total years smoked cigarettes	SMOKYRS	9.114	9.248	-0.010	-0.238	-0.516	9.133	-0.001	-0.080	-0.178
Blood pressure, systolic	BPSYS	135.152	133.915	0.068	-0.079	0.091	134.419	0.041	-0.044	0.229
Blood pressure, diastolic	BPDIAS	75.505	76.307	-0.084	-0.329	-0.069	75.66	-0.016	-0.258	-0.038
Smoked cigarettes in last 30 days	TOBAC30	0.019	0.033	-0.086	-0.510	-2.869	0.019	0.000	0.000	0.000
Smoked $>$ 100 cigarettes in life	TOBAC100	0.486	0.425	0.122	0.022	-0.238	0.495	-0.019	-0.001	0.038
Heart attack	CVHATT	0.114	0.013	0.422	1.534	4.741	0.019	0.387	1.364	4.037
Angioplasty/endarterectomy/stent	CVANGIO
Recent/Active	CVANGIO1	0.019	0.000	0.196	1.981	19.620	0.000	0.196	1.981	19.620
Remote/Inactive	CVANGIO2	0.095	0.026	0.291	1.078	3.414	0.019	0.331	1.275	4.252
Alcohol abuse	ALCOHOL
Recent/Active	ALCOHOL1	0.019	0.007	0.111	0.960	8.080	0.010	0.080	0.652	5.156
Remote/Inactive	ALCOHOL2	0.029	0.020	0.058	0.360	2.112	0.029	0.000	0.000	0.000
Active depression in last 2 years	DEP2YRS	0.229	0.294	-0.149	-0.162	0.120	0.210	0.046	0.062	-0.007

(b) Matching results of Section 4.3 Before matching After matching Description Covariates Male Female SD1 SD2 SD3 Female SD1 SD2 SD3 Age at visit NACCAGE 74.750 73.500 0.162 -0.350 0.258 74.684 0.009 -0.180 0.367 Body mass index (BMI) NACCBMI 26.429 26.894 -0.095 -0.698 -0.816 25.961 0.106 -0.404 -0.844 Total years smoked cigarettes SMOKYRS 9.882 10.308 -0.031 -0.054 -0.039 11.000 -0.079 -0.114 -0.183 Smoked cigarettes in last 30 days TOBAC30 0.026 0.025 0.008 0.049 0.278 0.026 0.000 0.000 0.000

To measure the balance of covariates in the control group and the treatment group, we use the standardized mean difference (SD1), and define the standardized variance difference (SD2) and standardized third central moment difference (SD3) as

\text{SD2}=\frac{2(s_{1}-s_{0})}{s_{1}^{2}+s_{0}^{2}},\quad\text{SD3}=\frac{2^{\frac{3}{2}}(\nu_{1}-\nu_{0})}{(s_{1}^{2}+s_{0}^{2})^{\frac{3}{2}}},

where $s_{m}$ and $\nu_{m}$ are the sample variance and the third sample central moment for treated subjects $(m=1)$ and controls $(m=0)$ . Table 6 lists means, SD1, SD2 and SD3 before and after matching. We see that the covariates are not balanced before matching. For example, the values of SD1, SD2 and SD3 for CVHATT are 0.422, 1.534 and 4.741. After matching, it seems that the covariates are still not balanced well. For example, SD3 of the four covariates, CVHATT, CVANGIO1, CVANGIO2 and ALCOHOL1, are larger than 4.

Now we apply the paired Hotelling’s $T^{2}$ (pHT), the method of combined differences (CD), the crossmatch (CM), the CrossNN (NN1, NN2), the CrossMST (MST1, MST2) and the proposed test ( $D$ ) to the paired data after matching. Only our proposed test gives small $p$ -value 0.005. By contrast, the $p$ -values of pHT, CD, CM, NN1, NN2, MST1 and MST2 are 0.087, 0.100, 0.669, 0.541, 0.219, 0.249, 0.266, respectively, which indicates that the other tests cannot reject the null hypothesis of balanced covariates at 0.05 significance level. To explore which result is more reliable, we apply the paired $t$ -test to each of the 13 covariates. It turns out that the smallest $p$ -value of the paired $t$ -test is 0.001. So the paired $t$ -test would reject the null hypothesis at 0.05 significance level ( $0.05/13=0.004>0.001$ ) by the Bonferroni correction, supporting the result of $D$ .

4.3 Comparison of neuropsychologic performances for pair-matched data

Another question of interest is to compare the neuropsychologic performances between well-matched female participants and male participants. To make the matching easier, we focus on the data from the initial visit and only consider the white participant who has no stroke, no transient ischemic attack, no atrial fibrillation, no heart problem, no angioplasty/endarterectomy/stent, no active depression in the last two years, no diabetes, no brain trauma, no Parkinson’s disease, no seizures, no alcohol abuse, no other neurological and psychiatric disorder, and meanwhile, no Frontotemporal lobar degeneration mutation and Alzheimer’s disease mutation in her/his family, no cognitive impairment in her/his first-degree family. In addition, only four covariates, NACCAGE, NACCBMI, SMOKYRS and TOBAC30, are under consideration now (Table 6). Adopting the same matching method as in Section 4.2, we match 76 male participants with 76 female participants, among 120 female participants.

The means, SD1, SD2 and SD3 before and after matching are provided in Table 6. We see the covariates of the data after matching are much more balanced than those in Section 4.2. We apply the proposed test $D$ and other existing tests to the matched covariates. Since the $p$ -values of pHT, CD, CM, NN1, NN2, MST1, MST2 and $D$ are 0.372, 0.366, 0.804, 0.280, 0.688, 0.082, 0.163 and 0.056, respectively, all these tests cannot reject the null hypothesis at 0.05 significance level. So we deem the covariates reasonably balanced jointly.

After obtaining well-matched participants, we apply the paired Hotelling’s $T^{2}$ test (pHT) and the proposed test ( $D$ ) to test whether the neuropsychologic performances in the female group differ from those in the male group. Here, 22 neuropsychology measurement variables are considered as in Section 4.1. The $p$ -values of pHT and $D$ are 0.068 and 0.008, respectively, so the proposed test rejects the null hypothesis of equal neuropsychologic performances between the female group and the male group, while pHT accepts the null hypothesis at 0.05 significance level. We further explore the data by applying the paired $t$ -test to each of the 22 covariates. We see that the smallest $p$ -value of the paired $t$ -test is 0.001. The paired $t$ -test would reject the null hypothesis at 0.05 significance level ( $0.05/22=0.002>0.001$ ) by the Bonferroni correction. Therefore, the proposed test $D$ is more reliable and the sex may have an effect on Alzheimer’s disease.

5 Discussion

If we are only interested in the location difference or the scale difference. The following two statistics can be considered:

•

paired mean test

$D_{m}=\frac{R_{1}+R_{2}-E(R_{1}+R_{2})}{\sqrt{\text{var}(R_{1}+R_{2})}}.$ (5)

The test is rejected at level $\alpha$ if $D_{m}>C_{m}(\alpha)$ .
•

paired scale test

$D_{s}=\frac{R_{1}-R_{2}-E(R_{1}-R_{2})}{\sqrt{\text{var}(R_{1}-R_{2})}}.$ (6)

The test is rejected at level $\alpha$ if $|D_{s}|>C_{s}(\alpha)$ .

The statistics $D_{m}$ and $D_{s}$ are effective for the location and scale alternatives, respectively.

Based on Theorem 3, the asymptotic distributions of $D_{m}$ and $D_{s}$ can be derived, and we could set $C_{m}(\alpha)=\Phi^{-1}(1-\alpha)$ , the $(1-\alpha)$ quantile of the standard normal distribution, and $C_{s}(\alpha)=\Phi^{-1}(1-\alpha)$ , for easy implementation of the two tests.

Corollary 2.

Under Conditions 1, 2 and 3, as $N\rightarrow\infty$ ,

\begin{pmatrix}D_{m}\\ D_{s}\end{pmatrix}\longrightarrow\mathcal{N}\left(\begin{pmatrix}0\\ 0\end{pmatrix},\begin{pmatrix}1&0\\ 0&1\end{pmatrix}\right)

in distribution under the paired-comparison permutation null.

With the analytic expressions of the expectations and variances in Theorem 1, we could further study the relationship among the statistics $D$ , $D_{m}$ and $D_{s}$ .

Proposition 1.

We have

D=D_{m}^{2}+D_{s}^{2}\text{ and }\text{cov}(D_{m},D_{s})=0.

Proof.

Let

\mathbf{R}=\begin{pmatrix}R_{1}-E(R_{1})\\ R_{2}-E(R_{2})\end{pmatrix},\quad\mathbf{\tilde{D}}=\begin{pmatrix}D_{m}\\ D_{s}\end{pmatrix}=\begin{pmatrix}\frac{1}{\sqrt{\text{var}(R_{1}+R_{2})}}&\frac{1}{\sqrt{\text{var}(R_{1}+R_{2})}}\\ \frac{1}{\sqrt{\text{var}(R_{1}-R_{2})}}&-\frac{1}{\sqrt{\text{var}(R_{1}-R_{2})}}\end{pmatrix}\mathbf{R}\triangleq\mathbf{B}\mathbf{R}.

It is easy to see that $\mathbf{B}$ is invertible. From the definition of $D$ (Equation (1)), it can be written as

D=\mathbf{R}^{T}\mathbf{\Sigma}^{-1}_{R}\mathbf{R}=(\mathbf{B}^{-1}\mathbf{\tilde{D}})^{T}\mathbf{\Sigma}^{-1}_{R}(\mathbf{B}^{-1}\mathbf{\tilde{D}})=\mathbf{\tilde{D}}^{T}(\mathbf{B}\mathbf{\Sigma}_{R}\mathbf{B}^{T})^{-1}\mathbf{\tilde{D}}.

Plugging $\mathbf{B}$ and $\mathbf{\Sigma}_{R}$ , we have $\mathbf{B}\mathbf{\Sigma}_{R}\mathbf{B}^{T}=\begin{pmatrix}1&0\\ 0&1\end{pmatrix}$ . Then, $D=\mathbf{\tilde{D}}^{T}\mathbf{\tilde{D}}=D_{m}^{2}+D_{s}^{2}$ . ∎

6 Conclusion

Paired data arise naturally in a variety of applications. In many contemporary datasets, the number of measurements is often comparable to or even larger than the number of pairs. We propose a new non-parametric test for paired data upon a graph-based two-sample testing framework. Since all existing graph-based tests are inapplicable for paired data owing to the nature of paired observations, we propose to use the paired-comparison permutation null distribution and develop the statistic $D$ . The simulation studies demonstrate that $D$ works well for assessing covariate balance in pair matching besides testing paired data. It exhibits high power under a wide range of situations including shape, location and/or scale alternatives. Under the paired-comparison permutation null, the asymptotic distribution of our new statistic $D$ is also derived. The approximate $p$ -value based on the asymptotic result is reasonably accurate to the permutation $p$ -value under finite samples, making the test fast applicable in practice.

As one advantage inherited from the graph-based tests, the proposed test can also be applied to multi-type complex data, such as non-Euclidean data, as long as a reasonable distance can be defined over the observations.

Acknowledgements

Hao Chen was supported in part by NSF awards DMS-1513653 and DMS-1848579. Xiao-Hua Zhou was supported in part by the National Science Foundation of China (NSFC 81773546 and NSFC 12026606). The NACC database is funded by NIA/NIH Grant U24 AG072122. NACC data are contributed by the NIA-funded ADRCs: P30 AG019610 (PI Eric Reiman, MD), P30 AG013846 (PI Neil Kowall, MD), P50 AG008702 (PI Scott Small, MD), P50 AG025688 (PI Allan Levey, MD, PhD), P50 AG047266 (PI Todd Golde, MD, PhD), P30 AG010133 (PI Andrew Saykin, PsyD), P50 AG005146 (PI Marilyn Albert, PhD), P50 AG005134 (PI Bradley Hyman, MD, PhD), P50 AG016574 (PI Ronald Petersen, MD, PhD), P50 AG005138 (PI Mary Sano, PhD), P30 AG008051 (PI Thomas Wisniewski, MD), P30 AG013854 (PI Robert Vassar, PhD), P30 AG008017 (PI Jeffrey Kaye, MD), P30 AG010161 (PI David Bennett, MD), P50 AG047366 (PI Victor Henderson, MD, MS), P30 AG010129 (PI Charles DeCarli, MD), P50 AG016573 (PI Frank LaFerla, PhD), P50 AG005131 (PI James Brewer, MD, PhD), P50 AG023501 (PI Bruce Miller, MD), P30 AG035982 (PI Russell Swerdlow, MD), P30 AG028383 (PI Linda Van Eldik, PhD), P30 AG053760 (PI Henry Paulson, MD, PhD), P30 AG010124 (PI John Trojanowski, MD, PhD), P50 AG005133 (PI Oscar Lopez, MD), P50 AG005142 (PI Helena Chui, MD), P30 AG012300 (PI Roger Rosenberg, MD), P30 AG049638 (PI Suzanne Craft, PhD), P50 AG005136 (PI Thomas Grabowski, MD), P50 AG033514 (PI Sanjay Asthana, MD, FRCP), P50 AG005681 (PI John Morris, MD), P50 AG047270 (PI Stephen Strittmatter, MD, PhD).

References

Beekly et al. (2007) Beekly, D. L., Ramos, E. M., Lee, W. W., Deitrich, W. D., Jacka, M. E., Wu, J., Hubbard, J. L., Koepsell, T. D., Morris, J. C., Kukull, W. A., et al. (2007). The National Alzheimer’s Coordinating Center (NACC) database: the uniform data set. Alzheimer Disease & Associated Disorders 21, 249–258.
Carter et al. (2012) Carter, C. L., Resnick, E. M., Mallampalli, M., and Kalbarczyk, A. (2012). Sex and gender differences in alzheimer’s disease: recommendations for future research. Journal of Women’s Health 21, 1018–1023.
Chen et al. (2018) Chen, H., Chen, X., and Su, Y. (2018). A weighted edge-count two-sample test for multivariate and object data. Journal of the American Statistical Association 113, 1146–1155.
Chen and Friedman (2017) Chen, H. and Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association 112, 397–409.
Chen and Small (2020) Chen, H. and Small, D. S. (2020). New multivariate tests for assessing covariate balance in matched observational studies. Biometrics .
Conover (1998) Conover, W. J. (1998). Practical nonparametric statistics, volume 350. John Wiley & Sons.
Franklin et al. (2014) Franklin, J. M., Rassen, J. A., Ackermann, D., Bartels, D. B., and Schneeweiss, S. (2014). Metrics for covariate balance in cohort studies of causal effects. Statistics in Medicine 33, 1685–1699.
Friedman and Rafsky (1979) Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics 7, 697–717.
Grimm et al. (2016) Grimm, A., Mensah-Nyagan, A. G., and Eckert, A. (2016). Alzheimer, mitochondria and gender. Neuroscience & Biobehavioral Reviews 67, 89–101.
Hansen (2007) Hansen, B. B. (2007). Optmatch: Flexible, optimal matching for observational studies. New Functions for Multivariate Analysis 7, 18–24.
Hansen and Bowers (2008) Hansen, B. B. and Bowers, J. (2008). Covariate balance in simple, stratified and clustered comparative studies. Statistical Science 23, 219–236.
Mazure and Swendsen (2016) Mazure, C. M. and Swendsen, J. (2016). Sex differences in Alzheimer’s disease and other dementias. The Lancet. Neurology 15, 451.
McKhann et al. (1984) McKhann, G., Drachman, D., Folstein, M., Katzman, R., Price, D., and Stadlan, E. M. (1984). Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer’s Disease. Neurology 34, 939–939.
Payami et al. (1996) Payami, H., Zareparsi, S., Montee, K. R., Sexton, G. J., Kaye, J. A., Bird, T. D., Yu, C.-E., Wijsman, E. M., Heston, L. L., Litt, M., et al. (1996). Gender difference in apolipoprotein E-associated risk for familial Alzheimer disease: a possible clue to the higher incidence of Alzheimer disease in women. American Journal of Human Genetics 58, 803.
Podcasy and Epperson (2016) Podcasy, J. L. and Epperson, C. N. (2016). Considering sex and gender in Alzheimer disease and other dementias. Dialogues in Clinical Neuroscience 18, 437.
Rosenbaum (2005) Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 515–530.
Weintraub et al. (2009) Weintraub, S., Salmon, D., Mercaldo, N., Ferris, S., Graff-Radford, N. R., Chui, H., Cummings, J., DeCarli, C., Foster, N. L., Galasko, D., et al. (2009). The Alzheimer’s disease centers’ uniform data set (UDS): The neuropsychological test battery. Alzheimer Disease and Associated Disorders 23, 91.
Wilcoxon (1992) Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Breakthroughs in statistics, pages 196–202. Springer.