This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A New Non-parametric Test for Multivariate Paired Data and Pair Matching

Jingru Zhang
Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
Hao Chen
Department of Statistics, University of California, Davis
Xiao-Hua Zhou
Beijing International Center for Mathematical Research and School of Public Health,
Peking University
Abstract

In paired design studies, it is common to have multiple measurements taken for the same set of subjects under different conditions. In observational studies, it is many times of interest to conduct pair matching on multiple covariates between a treatment group and a control group, and to test the treatment effect represented by multiple response variables on well pair-matched data. However, there is a lack of an effective test on multivariate paired data. The multivariate paired Hotelling’s T2T^{2} test can sometimes be used, but its power decreases fast as the dimension increases. Existing methods for assessing the balance of multiple covariates in matched observational studies usually ignore the paired structure and thus they do not perform well under some settings. In this work, we propose a new non-parametric test for paired data, which exhibits a substantial power improvement over existing methods under a wide range of situations. We also derive the asymptotic distribution of the new test and the approximate pp-value is reasonably accurate under finite samples through simulation studies even when the dimension is larger than the sample size, making the new test an easy-off-the-shelf tool for real applications. The proposed test is illustrated through an analysis of a real data set on the Alzheimer’s disease research.

Keywords: Graph-based test; Nonparametric test; Paired-comparison permutation null distribution; Pair matching.

1 Introduction

Paired data are prevalent in paired design studies, where each subject is measured twice at two different time points or under two different situations. In many applications, multiple variables are measured and it is of interest to test whether they are significantly different under the two situations. For example, in studying the Alzheimer’s disease, neuropsychologic evaluation plays a central role in the clinical characterization of the disease (McKhann et al., 1984; Weintraub et al., 2009). It provides confirmatory evidence of the diagnosis of dementia. In a typical study, cognitive performance of participants is evaluated at approximately annual visits. Variables, such as total number of animals named in 60 seconds, orientation subscale score, total number of story units recalled from the current test administration, and so on, are recorded to reflect the performance, and it is of scientific interest to find out whether there is a significant difference in their neuropsychologic measures after several years. In addition, pair-matched data are also frequently encountered in causal inference. For instance, sex differences in the Alzheimer’s disease have aroused more and more attention (Carter et al., 2012; Mazure and Swendsen, 2016). In particular, an important study is to compare neuropsychologic performances between well-matched female participants and male participants. In this case, the two groups are matched well on a few covariates such as age and BMI, and then the test is done on their neuropsychology measurement variables.

For two-sample comparison on paired data, the conventional way is to take the difference between the corresponding values from the two data sets and convert the problem into a one-sample problem. If only one variable is measured, there are many choices, including the paired tt-test, sign test, signed-rank test, paired-comparison permutation test, and general scoring systems (Wilcoxon, 1992; Conover, 1998). When more than one variable is measured, the multivariate paired Hotelling’s T2T^{2} test has been widely applied. However, unless the dimension of the data is very low, this test depends on whether the differences follow a multivariate Gaussian distribution. Also, the power of the test decreases quickly as the dimension increases. This is a common issue for parametric testing procedures as the number of parameters one has to estimate increases quickly as the dimension increases, such as the covariance matrix, unless strong assumptions are imposed. For example, in a real application on the Alzheimer’s disease research, 22 neuropsychology measurement variables are collected (details of the data are provided in Section 4). When the multivariate paired Hotelling’s T2T^{2} test is applied to test whether there is any difference in neuropsychology measures between the initial visit and the visit after five years for the mild dementia group, its pp-value is 0.2121. So the test cannot reject the null hypothesis and concludes no significant difference. However, if applying the paired tt-test to each variable, we see that 11 out of 22 pp-values of the paired tt-test are less than 0.05 and the smallest one is 0.001 (Table 1). So the paired tt-test would reject the null hypothesis at 0.05 significance level even by the Bonferroni correction. Hence, the multivariate paired Hotelling’s T2T^{2} test does not meet the needs for many applications with multiple variables.

Table 1: The pp-values of applying the paired tt-test to each of the variables separately for the mild dementia group, the initial visit versus the visit after five years.
Variable pp-value
MMSEORDA 0.068
MMSEORLO 0.173
PENTAGON 0.534
NACCMMSE 0.022
LOGIMEM 0.303
MEMUNITS 0.315
MEMTIME 0.740
DIGIF 0.024
DIGIFLEN 0.014
DIGIB 0.027
DIGIBLEN 0.031
ANIMALS 0.090
VEG 0.039
TRAILA 0.041
TRAILARR 0.071
TRAILALI 0.418
TRAILB 0.005
TRAILBRR 0.227
TRAILBLI 0.013
WAIS 0.001
BOSTON 0.005
COGSTAT 0.643

Paired settings also arise naturally from matched observational studies, where a key question is that whether the covariates of a treated group and its matched control group are well balanced. Some approaches have been proposed to compare the distributions of the observed covariates, such as the method of combined differences (Hansen and Bowers, 2008) and the crossmatch test (Rosenbaum, 2005). More recently, Chen and Small (2020) proposed new tests (CrossNN and CrossMST) based on the similarity information of subjects. However, these existing tests do not fully utilize the paired structure, which makes them less effective under some circumstances as illustrated in our simulations (Section 3.2) and real application (Section 4.2).

In this paper, we propose a new non-parametric framework for paired data that can be applied to multivariate data with the dimension comparable with or possibly higher than the sample size (Section 2). This new non-parametric framework for paired data relies on a similarity graph constructed on the observations. We also derive the asymptotic distribution of the statistic under the paired-comparison permutation null distribution, facilitating its application to large data sets. The performance of the proposed test is examined through extensive simulation studies in Section 3. The new test is illustrated through a real data application on the Alzheimer’s disease research in Section 4. We discuss two other related statistics in Section 5 and conclude in Section 6.

2 A new non-parametric test for paired data based on a similarity graph

In two-sample tests, with mm observations in sample 1 and nn observations in sample 2, the two-sample permutation framework is to randomly assign mm out of the total m+nm+n pooled observations to sample 1 and the rest to sample 2 due to exchangeability of the observations under the null hypothesis of equal distribution. However, for paired data, the observations from the two samples are no longer exchangeable. On the other hand, the corresponding observations (paired observations) are exchangeable under the null hypothesis. Let (Xi,Yi)(X_{i},Y_{i}), i=1,,ni=1,\ldots,n, be the paired data. The paired-comparison permutation null distribution places probability 2n2^{-n} on each of 2n2^{n} choices of “assigning XiX_{i} to sample 1 and YiY_{i} to sample 2” or “assigning YiY_{i} to sample 1 and XiX_{i} to sample 2” for each ii. In the following, when there is no further specification, we use pr, EE, var and cov to denote probability, expectation, variance and covariance, respectively, under this paired-comparison permutation null distribution.

We first construct a similarity graph on the pool observations. The idea of utilizing a similarity graph has proven to be successful under the two-sample hypothesis testing problems for high-dimensional data (Friedman and Rafsky, 1979; Chen and Friedman, 2017). We use GG to denote the similarity graph, such as the minimum spanning tree (MST), which is a spanning tree that connects all observations with the sum of the distances of the edges in the tree minimized. Since the graph-based statistic is usually more powerful under a slightly denser graph (Friedman and Rafsky, 1979), we choose 5-MST as the similarity graph in our simulation studies and real application as recommended by Chen and Friedman (2017). Here, a kk-MST is the union of the 1st, ,k\ldots,kth MSTs, where the 1st MST is the MST and the jth (j>1j>1) MST is a spanning tree that connects all observations such that the sum of the edges in the tree is minimized under the constraint that it does not contain any edge in the 1st, ,(j1)\ldots,(j-1)th MSTs.

Since the graph is constructed on the pooled observations, it is unchanged under the paired-comparison permutation null distribution. For notation simplicity, we do not include GG as a subscript for the quantities depending on GG. Let R1R_{1} be the number of edges in the similarity graph connecting observations from sample 1, and R2R_{2} be the number of edges connecting observations from sample 2. Here, R1R_{1} and R2R_{2} are random variables under the paired-comparison permutation null distribution. The new test statistic is defined as

D=(R1E(R1)R2E(R2))T𝚺R1(R1E(R1)R2E(R2)),D=\begin{pmatrix}R_{1}-E(R_{1})\\ R_{2}-E(R_{2})\end{pmatrix}^{T}\mathbf{\Sigma}^{-1}_{R}\begin{pmatrix}R_{1}-E(R_{1})\\ R_{2}-E(R_{2})\end{pmatrix}, (1)

where 𝚺R=var((R1,R2)T)\mathbf{\Sigma}_{R}=\text{var}((R_{1},R_{2})^{T}). The test is rejected at level α\alpha if D>C(α)D>C(\alpha).

In the following, we derive exact analytic expressions for E(R1)E(R_{1}), E(R2)E(R_{2}), var(R1)\text{var}(R_{1}), var(R2)\text{var}(R_{2}) and cov(R1,R2)\text{cov}(R_{1},R_{2}), so that the proposed test statistic can be computed quickly (Section 2.1), and then discuss how to determine the critical value C(α)C(\alpha) in an analytic way (Section 2.2).

2.1 Analytic expressions for the new paired test statistic

Let N=2nN=2n be the total number of observations and let Zi=I(in)Xi+I(i>n)YinZ_{i}=I(i\leq n)X_{i}+I(i>n)Y_{i-n}, i=1,,Ni=1,\ldots,N, where I()I(\cdot) is the indicator function. Let gig_{i} be the indicator function that equals 1 when ZiZ_{i} is assigned to sample 1 under the paired-comparison permutation null distribution, and 0 if ZiZ_{i} is assigned to sample 2. It is easy to see that pr(gi=1)=0.5\text{pr}(g_{i}=1)=0.5. We use aba\wedge b to denote the minimum of aa and bb, and use aba\vee b to denote the maximum of aa and bb. Let ii and ii^{*} be the two indices belonging to pair 𝐢=(ii,ii)\mathbf{i}=(i\wedge i^{*},i\vee i^{*}), i.e., i=i+ni^{*}=i+n if ini\leq n, and i=ini^{*}=i-n if i>ni>n. Then, we always have gi+gi=1g_{i}+g_{i^{*}}=1. Let jj and jj^{*} be the two indices belonging to pair 𝐣=(jj,jj)\mathbf{j}=(j\wedge j^{*},j\vee j^{*}). Since assigning ZiZ_{i} to sample 1 is independent of assigning ZjZ_{j} (𝐣𝐢)(\mathbf{j}\neq\mathbf{i}) to sample 1, gig_{i} and gjg_{j} are independent.

For an edge in GG, we denote it by the indices of the nodes connected by the edge. Then by definition, we have

R1=(i,j)GI(gi=gj=1),\displaystyle R_{1}=\sum_{(i,j)\in G}I(g_{i}=g_{j}=1),
R2=(i,j)GI(gi=gj=0).\displaystyle R_{2}=\sum_{(i,j)\in G}I(g_{i}=g_{j}=0).

Here, we do not distinguish edge (i,j)(i,j) from edge (j,i)(j,i). For graph GG, let |G||G| be the number of edges in the graph. Let G1G_{1} be the subgraph in GG that connects observations from different pairs, i.e., G1G_{1} contains edges {(i,j)G:ji}\{(i,j)\in G:j\neq i^{*}\}. Let G1,iG_{1,i} be the subgraph in G1G_{1} that connects to node ii. Then |G1,i||G_{1,i}| is the degree of node ii in G1G_{1}. Let C1C_{1} be the number of pairs of edges (i,j),(i,j)G1(i,j),~{}(i^{*},j^{*})\in G_{1}, and C2C_{2} be the number of pairs of edges (i,j),(i,j)G1(i,j),~{}(i,j^{*})\in G_{1}. The analytic expressions are provided in the following theorem.

Theorem 1.

The analytic expressions of the expectations and variances under the paired-comparison permutation null are as follows:

E(R1)=E(R2)=14|G1|,\displaystyle E(R_{1})=E(R_{2})=\frac{1}{4}|G_{1}|,
var(R1)=var(R2)=116(|G1|+2C12C2)+116i=1n(|G1,i||G1,i|)2,\displaystyle\text{var}(R_{1})=\text{var}(R_{2})=\frac{1}{16}(|G_{1}|+2C_{1}-2C_{2})+\frac{1}{16}\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2},
cov(R1,R2)=116(|G1|+2C12C2)116i=1n(|G1,i||G1,i|)2.\displaystyle\text{cov}(R_{1},R_{2})=\frac{1}{16}(|G_{1}|+2C_{1}-2C_{2})-\frac{1}{16}\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2}.
Remark 1.

The expectation and variance of (R1,R2)(R_{1},R_{2}) are very different from those under the permutation null for the two-sample test setting. Also, they only depend on G1G_{1}, the set of edges that connect observations from different pairs. This is because the two end nodes of an edge connecting observations from a pair would always be in different samples under the pair-comparison permutation null distribution. Hence, one could construct a similarity graph GG that does not include any edge connecting within a pair. It is reasonable to not use the information of edge connecting within pairs as they only reflect that the subjects are distinct enough rather than they are not changed under a different circumstance.

To ensure that the proposed test statistic DD is well defined, ΣR\Sigma_{R} needs to be invertible.

Theorem 2.

The proposed test statistic DD is well defined except the following two scenarios:

  1. 1.

    For each pair 𝐢\mathbf{i}, the two nodes have the same degree in G1G_{1}, i.e., |G1,i||G1,i|=0|G_{1,i}|-|G_{1,i^{*}}|=0, for all ii’s; or

  2. 2.

    |G1|+2C12C2=0|G_{1}|+2C_{1}-2C_{2}=0.

Remark 2.

This theorem follows straightforwardly from the analytic expression of ΣR\Sigma_{R} derived in Theorem 1. After some simplifications, we have that the determinant of ΣR\Sigma_{R} is

|ΣR|=164(|G1|+2C12C2)i=1n(|G1,i||G1,i|)2.|\Sigma_{R}|=\frac{1}{64}(|G_{1}|+2C_{1}-2C_{2})\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2}.

Hence, DD is well defined except for when |ΣR|=0|\Sigma_{R}|=0, i.e., the two scenarios in Theorem 2.

Let subG1𝐢,𝐣subG_{1}^{\mathbf{i},\mathbf{j}} be the subgraph of G1G_{1} that connects any nodes in the pairs of 𝐢\mathbf{i} and 𝐣\mathbf{j}. Then it has eight possible configurations if there is at least one edge in subG1𝐢,𝐣subG_{1}^{\mathbf{i},\mathbf{j}}. The eight configurations are shown in Figure 1.

Refer to caption
Figure 1: For (i,j)G1(i,j)\in G_{1}, eight possible subgraphs of G1G_{1} between pairs 𝐢\mathbf{i} and 𝐣\mathbf{j}.

Notice that

|G1|+2C12C2=pairs 𝐢,𝐣|subG1𝐢,𝐣|>0T(subG1𝐢,𝐣),|G_{1}|+2C_{1}-2C_{2}=\sum_{\begin{subarray}{c}\text{pairs }\mathbf{i},\mathbf{j}\\ |subG_{1}^{\mathbf{i},\mathbf{j}}|>0\end{subarray}}T(subG_{1}^{\mathbf{i},\mathbf{j}}),

where

T(subG1𝐢,𝐣):=\displaystyle T(subG_{1}^{\mathbf{i},\mathbf{j}}):= |subG1𝐢,𝐣|+2×pairs of two edges in subG1𝐢,𝐣 not sharing any node\displaystyle|subG_{1}^{\mathbf{i},\mathbf{j}}|+2\times\text{pairs of two edges in }subG_{1}^{\mathbf{i},\mathbf{j}}\text{ not sharing any node }
2×pairs of two edges in subG1𝐢𝐣 sharing a node.\displaystyle-2\times\text{pairs of two edges in }subG_{1}^{\mathbf{i}\mathbf{j}}\text{ sharing a node}.

It is not hard to see that

T(c)=T(d)=T(h)=0,T(a)=T(e)=T(f)=T(g)=1,T(b)=4.T(c)=T(d)=T(h)=0,\quad T(a)=T(e)=T(f)=T(g)=1,\quad T(b)=4.

Hence, if every subgraph subG1𝐢,𝐣subG_{1}^{\mathbf{i},\mathbf{j}} belongs to one of the three configurations, cc, dd and hh, then ΣR\Sigma_{R} is noninvertible.

2.2 Asymptotics

For the critical value C(α)C(\alpha), it can be determined by performing the paired-comparison permutation directly, which is however time consuming. To make the test more application friendly, we study the asymptotic distribution of the statistic DD.

Before stating the theorem, we define two additional terms on the similarity graph G1G_{1}: for edge eG1e\in G_{1}, let ee_{-} and e+e_{+} be the indices of the nodes connected by the edge ee.

Ae={(i,j)G1:i{e,e+,e,e+} or j{e,e+,e,e+}},\displaystyle A_{e}=\{(i,j)\in G_{1}:i\in\{e_{-},e_{+},e_{-}^{*},e_{+}^{*}\}\text{ or }j\in\{e_{-},e_{+},e_{-}^{*},e_{+}^{*}\}\},
Be=e~AeAe~.\displaystyle B_{e}=\cup_{\tilde{e}\in A_{e}}A_{\tilde{e}}.

We use a=O(b)a=O(b) to denote that aa and bb are of the same order, and a=o(b)a=o(b) to denote that aa is of a smaller order than bb.

To derive the asymptotic behavior of our statistic, we work under the following conditions for some γ>0\gamma>0:

Condition 1.

eG1|Ae||Be|=o(N1.5γ)\sum_{e\in G_{1}}|A_{e}||B_{e}|=o(N^{1.5\gamma}).

Condition 2.

i=1n(|G1,i||G1,i|)2=O(Nγ)\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2}=O(N^{\gamma}).

Condition 3.

|G1|+2C12C2=O(Nγ)|G_{1}|+2C_{1}-2C_{2}=O(N^{\gamma}).

Remark 3.

The parameter γ\gamma could be any positive number. For example, when γ=1\gamma=1, the conditions above become the following conditions (2), (3) and (4), respectively.

eG1|Ae||Be|=o(N1.5),\sum_{e\in G_{1}}|A_{e}||B_{e}|=o(N^{1.5}), (2)
i=1n(|G1,i||G1,i|)2=O(N),\sum_{i=1}^{n}(|G_{1,i}|-|G_{1,i^{*}}|)^{2}=O(N), (3)
|G1|+2C12C2=O(N).|G_{1}|+2C_{1}-2C_{2}=O(N). (4)

Here, Condition (2) sets a constraint on the number of edges sharing a pair in the graph G1G_{1} such that it cannot be too large. (A similar condition was proposed for graph-based statistics for independent observations and was discussed in Chen and Friedman (2017) and Chen et al. (2018).)

Conditions (3) and (4) ensure that (R1,R2)(R_{1},R_{2}) does not degenerate asymptotically. Let

L1={𝐢=(ii,ii):|G1,i||G1,i|}.L_{1}=\{\mathbf{i}=(i\wedge i^{*},i\vee i^{*}):|G_{1,i}|\neq|G_{1,i^{*}}|\}.

If |L1|=O(N)|L_{1}|=O(N) and (|G1,i||G1,i|)2=O(1),𝐢L1(|G_{1,i}|-|G_{1,i^{*}}|)^{2}=O(1),~{}\mathbf{i}\in L_{1}, then (3) could be satisfied.

Condition (4) sets a constraint on the structure of the graph G1G_{1}. As shown in the proof of Theorem 2, we obtain

|G1|+2C12C2=pairs 𝐢,𝐣|subG1𝐢,𝐣|>0T(subG1𝐢,𝐣)|G_{1}|+2C_{1}-2C_{2}=\sum_{\begin{subarray}{c}\text{pairs }\mathbf{i},\mathbf{j}\\ |subG_{1}^{\mathbf{i},\mathbf{j}}|>0\end{subarray}}T(subG_{1}^{\mathbf{i},\mathbf{j}})

and T(subG1𝐢,𝐣)=0,1T(subG_{1}^{\mathbf{i},\mathbf{j}})=0,1 or 44. Let

L2={subG1𝐢,𝐣:subG1𝐢,𝐣 contains at least one edge, i.e., |subG1𝐢,𝐣|>0},L_{2}=\{subG_{1}^{\mathbf{i},\mathbf{j}}:subG_{1}^{\mathbf{i},\mathbf{j}}\text{ contains at least one edge, i.e., }|subG_{1}^{\mathbf{i},\mathbf{j}}|>0\},
L3={subG1𝐢,𝐣L2:T(subG1𝐢,𝐣)0}.L_{3}=\{subG_{1}^{\mathbf{i},\mathbf{j}}\in L_{2}:T(subG_{1}^{\mathbf{i},\mathbf{j}})\neq 0\}.

We have

|G1|/4|L2||G1|.|G_{1}|/4\leq|L_{2}|\leq|G_{1}|.

Thus, |L2|=O(|G1|)|L_{2}|=O(|G_{1}|). If |G1|=O(N)|G_{1}|=O(N) and |L3|=O(|L2|)|L_{3}|=O(|L_{2}|), then (4) could be satisfied.

Theorem 3.

Under Conditions 1, 2 and 3, as NN\rightarrow\infty, ((R1E(R1))/var(R1)((R_{1}-E(R_{1}))/\sqrt{\text{var}(R_{1})}, (R2E(R2))/var(R2))T(R_{2}-E(R_{2}))/\sqrt{\text{var}(R_{2})})^{T} converges in distribution to a bivariate Gaussian distribution under the paired-comparison permutation null distribution.

Based on Theorem 3, it is easy to obtain the asymptotic distribution of DD.

Corollary 1.

Under Conditions 1, 2 and 3, as NN\rightarrow\infty,

Dχ22\quad D\longrightarrow\chi_{2}^{2}

in distribution under the paired-comparison permutation null distribution.

We reject the null hypothesis at α\alpha level when D>C(α)D>C(\alpha). Based on Corollary 1, C(α)C(\alpha) can be approximated by χ22(1α)\chi_{2}^{2}(1-\alpha), the (1α)(1-\alpha) quantile of the χ22\chi_{2}^{2} distribution.

We next check how well the rejection region D>χ22(1α)D>\chi_{2}^{2}(1-\alpha) can control Type I error through numerical studies. We consider the following three settings:

  • Setting 1 (S1): (X1,Y1)T,(X2,Y2)T,,(Xn,Yn)Tiid𝒩2d(ν,Γ)(X_{1},Y_{1})^{T},(X_{2},Y_{2})^{T},\ldots,(X_{n},Y_{n})^{T}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{2d}(\nu,\Gamma);

  • Setting 2 (S2): (X1,Y1)T,(X2,Y2)T,,(Xn,Yn)Tiidmultivariate t3(ν,Γ)(X_{1},Y_{1})^{T},(X_{2},Y_{2})^{T},\ldots,(X_{n},Y_{n})^{T}\stackrel{{\scriptstyle iid}}{{\sim}}\text{multivariate }t_{3}(\nu,\Gamma);

  • Setting 3 (S3): log(X1,Y1)T,log(X2,Y2)T,,log(Xn,Yn)Tiid𝒩2d(ν,Γ)\log(X_{1},Y_{1})^{T},\log(X_{2},Y_{2})^{T},\ldots,\log(X_{n},Y_{n})^{T}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{2d}(\nu,\Gamma).

Here, νT=(ν1,ν2)T\nu^{T}=(\nu_{1},\nu_{2})^{T} and Γ=(Γ1Γ12Γ12Γ2)\Gamma=\begin{pmatrix}\Gamma_{1}&\Gamma_{12}\\ \Gamma_{12}&\Gamma_{2}\end{pmatrix} with ν1=ν2=𝟎d\nu_{1}=\nu_{2}=\mathbf{0}_{d}, Γ1=Γ2=𝐈d\Gamma_{1}=\Gamma_{2}=\mathbf{I}_{d} and Γ12=0.6𝐈d\Gamma_{12}=0.6\mathbf{I}_{d}. Then, in each setting, the distributions of (Xi,Yi)T(X_{i},Y_{i})^{T} and (Yi,Xi)T(Y_{i},X_{i})^{T} are the same.

Table 2: Examination of empirical size for data generated from a multivariate normal distribution (S1), a multivariate tt distribution (S2), and a multivariate log-normal distribution (S3)

(a) Empirical size at 0.05 nominal level

d=10d=10 d=100d=100
n=25n=25 n=50n=50 n=100n=100 n=150n=150 n=25n=25 n=50n=50 n=100n=100 n=150n=150
(S1) 0.038 0.038 0.046 0.036 0.046 0.042 0.039 0.048
(S2) 0.040 0.043 0.030 0.045 0.041 0.048 0.042 0.041
(S3) 0.039 0.025 0.042 0.040 0.034 0.040 0.040 0.037

(b) Empirical size at 0.1 nominal level

d=10d=10 d=100d=100
n=25n=25 n=50n=50 n=100n=100 n=150n=150 n=25n=25 n=50n=50 n=100n=100 n=150n=150
(S1) 0.081 0.087 0.081 0.088 0.085 0.089 0.094 0.100
(S2) 0.077 0.100 0.078 0.095 0.084 0.094 0.095 0.096
(S3) 0.079 0.072 0.085 0.091 0.084 0.092 0.088 0.088

Table 2 shows the empirical sizes of the proposed test under 0.05 and 0.1 nominal levels. The empirical size is computed as the fraction of trials (out of 1000) that the test statistic is greater than χ22(1α)\chi_{2}^{2}(1-\alpha). We see that the empirical sizes are well controlled for the proposed test even for quite small sample sizes under both low and high dimensions in all three settings.

3 Performance of the proposed test

In this section, we evaluate the performance of the paired graph-based test DD through two-sample testing for paired nonindependent data and pair matching. The proposed test is compared with the multivariate paired Hotelling’s T2T^{2} test (pHT) in Section 3.1, and compared with pHT and four existing tests for matching: the method of combined differences (Hansen and Bowers, 2008), the crossmatch test (Rosenbaum, 2005), CrossNN and CrossMST tests (Chen and Small, 2020) in Section 3.2. The levels of the tests are all set to be 0.05. For the paired graph-based test, we reject the test when D>χ22(0.95)D>\chi_{2}^{2}(0.95). For the paired Hotelling’s T2T^{2} test, let Ti=XiYi(i=1,,n)T_{i}=X_{i}-Y_{i}~{}(i=1,\ldots,n), T¯=1ni=1nTi\bar{T}=\frac{1}{n}\sum_{i=1}^{n}T_{i}, and ΣT=1n1i=1n(TiT¯)(TiT¯)T\Sigma_{T}=\frac{1}{n-1}\sum_{i=1}^{n}(T_{i}-\bar{T})(T_{i}-\bar{T})^{T}. The null hypothesis is rejected if

(nd)nd(n1)T¯TΣT1T¯>Fd,nd(0.95),\frac{(n-d)n}{d(n-1)}\bar{T}^{T}\Sigma_{T}^{-1}\bar{T}>F_{d,n-d}(0.95),

where Fd,nd(0.95)F_{d,n-d}(0.95) denotes the 0.95 quantile of an F-distribution with dd and ndn-d degrees of freedom.

3.1 Two-sample testing for paired nonindependent data

We first examine the performance of the proposed test for data from the same family of distributions. We consider the three settings (S1), (S2) and (S3) in Section 2.2 but with different choices of ν\nu and Γ\Gamma. Let the number of pairs nn be fixed at the moderate size with n=60n=60. We assess how the proposed statistic behaves when the dimension is comparable to or larger than the number of pairs by considering three dimensions: d=50d=50, d=100d=100 and d=1000d=1000. For each setting, we consider two alternatives for each dd listed below.

  1. (i)

    Only ν1\nu_{1} differs from ν2\nu_{2} with ν1=𝟎d\nu_{1}=\mathbf{0}_{d}, ν2=0.5d1/4𝟏d\nu_{2}=0.5d^{-1/4}\mathbf{1}_{d}, Γ1=Γ2=𝐈d\Gamma_{1}=\Gamma_{2}=\mathbf{I}_{d} and Γ12=0.6𝐈d\Gamma_{12}=0.6\mathbf{I}_{d}.

  2. (ii)

    Parameter ν1\nu_{1} differs from ν2\nu_{2} and Γ1\Gamma_{1} differs from Γ2\Gamma_{2} with ν1=𝟎d\nu_{1}=\mathbf{0}_{d}, ν2=0.5d1/4𝟏d\nu_{2}=0.5d^{-1/4}\mathbf{1}_{d}, Γ1=𝐈d\Gamma_{1}=\mathbf{I}_{d}, Γ2=cd𝐈d\Gamma_{2}=c_{d}\mathbf{I}_{d} and Γ12=0.6cd1/2𝐈d\Gamma_{12}=0.6c_{d}^{1/2}\mathbf{I}_{d}, where cd=1.15c_{d}=1.15 for d=50d=50, cd=1.1c_{d}=1.1 for d=100d=100 and cd=1.05c_{d}=1.05 for d=1000d=1000.

Table 3: Estimated power at 0.05 significance level based on 1000 runs. The larger estimated power under each setting is in bold

(a) Data from the same family of distributions (i) ν1ν2\nu_{1}\neq\nu_{2}, Γ1=Γ2\Gamma_{1}=\Gamma_{2} (ii) ν1ν2\nu_{1}\neq\nu_{2}, Γ1Γ2\Gamma_{1}\neq\Gamma_{2} dd 50 100 1000 50 100 1000 multivariate normal pHT 0.788 - - 0.729 - - DD 0.709 0.754 0.810 0.972 0.974 0.999 multivariate t pHT 0.750 - - 0.671 - - DD 0.826 0.823 0.543 0.942 0.941 0.916 multivariate log-normal pHT 0.450 - - 0.611 - - DD 0.780 0.821 0.847 0.995 0.993 0.992

(b) Data from different families of distributions Δ=𝟎d\Delta=\mathbf{0}_{d} Δ=0.2𝟏d\Delta=0.2\mathbf{1}_{d} dd 50 100 1000 50 100 1000 τiiidnormal\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{normal} pHT 0.053 - - 0.429 - - DD 0.643 0.758 0.980 0.752 0.928 1.000 τiiidskew normal\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{skew normal} pHT 0.039 - - 0.416 - - DD 0.566 0.658 0.960 0.639 0.856 1.000 τiiidLaplace\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{Laplace} pHT 0.041 - - 0.440 - - DD 0.358 0.486 0.920 0.571 0.778 1.000

The results under various scenarios are summarized in Table 3. We first look at the results under moderate-dimensional settings (d=50d=50). When only ν1\nu_{1} differs from ν2\nu_{2}, the paired Hotelling’s T2T^{2} test has high power for the multivariate normal and multivariate tt distributions, but is not good for the multivariate log-normal distribution. When ν1ν2\nu_{1}\neq\nu_{2} and Γ1Γ2\Gamma_{1}\neq\Gamma_{2}, the performance of paired Hotelling’s T2T^{2} is much worse than the proposed test DD, especially for the multivariate log-normal distribution. For results under dimensions d=100d=100 and d=1000d=1000, DD exhibits pretty good performance under all the settings, while the paired Hotelling’s T2T^{2} cannot be applied as d>nd>n.

Additionally, under setting (ii), ν1\nu_{1} and ν2\nu_{2} are set to be the same as those under setting (i), while Γ2\Gamma_{2} becomes different from Γ1\Gamma_{1}. Thus, the difference between the two samples should be more significant. Comparing the results under setting (ii) with those under setting (i), we observe an increasing power of DD for the three distributions, while the power of pHT decreases for the multivariate normal and multivariate tt distributions.

Now we examine the performance of DD when XiX_{i} and YiY_{i} are from different families of distributions. Suppose Xi,YidX_{i},Y_{i}\in\mathbb{R}^{d}, X1,,XniidFXX_{1},\ldots,X_{n}\stackrel{{\scriptstyle iid}}{{\sim}}F_{X}, Y1,,YniidFYY_{1},\ldots,Y_{n}\stackrel{{\scriptstyle iid}}{{\sim}}F_{Y}. We generate data by

Xi=αi+ϵi,Yi=Δ+αi+τi,i=1,,n,X_{i}=\alpha_{i}+\epsilon_{i},\quad Y_{i}=\Delta+\alpha_{i}+\tau_{i},\quad i=1,\ldots,n,

where Δ\Delta is a constant vector, and αi,ϵi,τi\alpha_{i},\epsilon_{i},\tau_{i} are independent with each other. Let αiiid𝒩d(𝟎d,Ω1)\alpha_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{d}(\mathbf{0}_{d},\Omega_{1}) and ϵiiidt3(𝟎d,Ω2/3)\epsilon_{i}\stackrel{{\scriptstyle iid}}{{\sim}}t_{3}(\mathbf{0}_{d},\Omega_{2}/3). We consider three types of distributions for τi\tau_{i}: (i) τiiid𝒩d(𝟎d,Ω2)\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}_{d}(\mathbf{0}_{d},\Omega_{2}), (ii) τiiidmultivariate skew normal distribution with mean 𝟎d\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{multivariate skew normal distribution with mean }\mathbf{0}_{d}, variance Ω2\Omega_{2}, skewness 1 and (iii) τiiidmultivariate Laplace distribution with mean 𝟎d\tau_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{multivariate Laplace distribution with mean }\mathbf{0}_{d}, variance Ω2\Omega_{2}. Therefore, E(Xi)=𝟎dE(X_{i})=\mathbf{0}_{d}, E(Yi)=ΔE(Y_{i})=\Delta, var(Xi)=var(Yi)=Ω1+Ω2\text{var}(X_{i})=\text{var}(Y_{i})=\Omega_{1}+\Omega_{2} and cov(Xi,Yi)=Ω1\text{cov}(X_{i},Y_{i})=\Omega_{1}. Let n=60n=60 and d=50,100,1000d=50,100,1000. Let Ω1=(Ωij)\Omega_{1}=(\Omega_{ij}) with the (i,j)(i,j) element of Ω\Omega being Ωij=0.5|ij|\Omega_{ij}=0.5^{|i-j|} and Ω2=𝐈d\Omega_{2}=\mathbf{I}_{d}. For each dd, we consider two settings for Δ\Delta: Δ=𝟎d\Delta=\mathbf{0}_{d} and Δ=0.2𝟏d\Delta=0.2\mathbf{1}_{d}.

Table 3 gives the results. We first take a look at the results when there is no mean difference between the two samples (Δ=𝟎d\Delta=\mathbf{0}_{d}). When Δ=𝟎d\Delta=\mathbf{0}_{d}, the means and variances of XiX_{i} and YiY_{i} are the same, while the two distributions FXF_{X} and FYF_{Y} are different, and we would expect a powerful test to reject the null hypothesis. However, the paired Hotelling’s T2T^{2} test has almost no power under d=50d=50 and cannot be applied to high-dimensional scenarios (d=100,1000d=100,1000), so it cannot detect the shape difference between two distributions. The proposed test DD performs well under all these scenarios. When there is a mean difference between the two samples (Δ=0.2𝟏d\Delta=0.2\mathbf{1}_{d}), the paired Hotelling’s T2T^{2} exhibits some power under d=50d=50, though the proposed test DD has a higher power. For the high dimensional scenarios (d=100,1000d=100,1000), the paired Hotelling’s T2T^{2} is inapplicable, while the power of DD is very high.

3.2 Assessing covariate balance for pair matching

Following a similar simulation setting as in Franklin et al. (2014), we assume that 20-dimensional covariates 𝐗(1)=(X1,,X20)T\mathbf{X}_{(1)}=(X_{1},\cdots,X_{20})^{T} are observed, where Xj(j=1,,20)X_{j}~{}(j=1,\cdots,20) independent identically follows Laplace distribution with mean 0 and variance 0.65. The exposure/treatment, TT, depends on these 20 covariates as well as three unobserved transformations of them (𝐗(2)=(X21,X22,X23)T=(sin(X1)/4,cos(X2)/4,X34)T\mathbf{X}_{(2)}=(X_{21},X_{22},X_{23})^{T}=(\sin(X_{1})/4,\cos(X_{2})/4,X_{3}^{4})^{T}). We simulate TT as a binary variable via the logistic model logit{P(T=1)}=α0+𝜶1T𝐗(1)+𝜶2T𝐗(2)\text{logit}\{P(T=1)\}=\alpha_{0}+\bm{\alpha}_{1}^{T}\mathbf{X}_{(1)}+\bm{\alpha}_{2}^{T}\mathbf{X}_{(2)}, where 𝜶1\bm{\alpha}_{1} and 𝜶2\bm{\alpha}_{2} define the log-odds ratios between exposures and controls in the pre-matched data set. We generated 1000 subjects and determined whether each of them is exposed or not. Then the control subjects (T=0)(T=0) are matched to exposed subjects (T=1)(T=1) through their propensity scores on 𝐗(1)\mathbf{X}_{(1)}.

We consider the following four scenarios.

  1. (i)

    Zero coefficients for both 𝐗(1)\mathbf{X}_{(1)} and 𝐗(2)\mathbf{X}_{(2)}: 𝜶1=𝟎20\bm{\alpha}_{1}=\mathbf{0}_{20} and 𝜶2=𝟎3\bm{\alpha}_{2}=\mathbf{0}_{3}.

  2. (ii)

    Nonzero coefficients for observed covariates 𝐗(1)\mathbf{X}_{(1)} only: 𝜶1=0.2𝟏20\bm{\alpha}_{1}=0.2\mathbf{1}_{20} and 𝜶2=𝟎3\bm{\alpha}_{2}=\mathbf{0}_{3}.

  3. (iii)

    Nonzero coefficients for unobserved covariates 𝐗(2)\mathbf{X}_{(2)} only: 𝜶1=𝟎20\bm{\alpha}_{1}=\mathbf{0}_{20} and 𝜶2=0.33𝟏3\bm{\alpha}_{2}=0.33\mathbf{1}_{3}.

  4. (iv)

    Nonzero coefficients for both 𝐗(1)\mathbf{X}_{(1)} and 𝐗(2)\mathbf{X}_{(2)}: 𝜶1=0.2𝟏20\bm{\alpha}_{1}=0.2\mathbf{1}_{20} and 𝜶2=0.33𝟏3\bm{\alpha}_{2}=0.33\mathbf{1}_{3}.

Here, scenario (i) is used to examine the empirical size since the covariates of treated subjects and controls are generated from the same distribution. For each scenario, we simulate 1000 data sets. Define the standardized mean difference as SD1=(x¯1x¯0)/(s12+s02)/2\text{SD1}=(\bar{x}_{1}-\bar{x}_{0})/\sqrt{(s_{1}^{2}+s_{0}^{2})/2} where x¯m\bar{x}_{m} and sms_{m} are the sample mean and variance for treated subjects (m=1)(m=1) and controls (m=0)(m=0). For each of the 23 covariates, we calculate the standardized mean difference between the treatment group and control group before and after matching. Figure 2 shows the boxplots of standardized mean difference of 1000 data set with the left panels denoting before matching and right panels denoting after matching. We see that the standardized mean differences of the observed covariates 𝐗(1)\mathbf{X}_{(1)} are relatively close to 0 after matching under all scenarios. However, the unobserved covariates 𝐗(2)\mathbf{X}_{(2)} are very unbalanced after matching when 𝜶2𝟎3\bm{\alpha}_{2}\neq\mathbf{0}_{3} (scenarios (iii) and (iv)), which indicates the distribution of (X1,X2,X3X_{1},X_{2},X_{3}) is not well balanced. Since we aim to test whether the joint distributions of the covariates in the matched control group and the treatment group are the same, we would expect that a good test rejects the null hypothesis under scenarios (iii) and (iv), while it accepts the null hypothesis under scenarios (i) and (ii).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Boxplots of standardized mean differences between the treatment group and control group over 1000 runs under each scenario. Left: before matching; right: after matching.
Table 4: The proportion of trials (out of 1000) that the test rejects covariate balance at 0.05 significance level. The largest estimated power under scenarios (iii) and (iv) is in bold
Scenario pHT CD CM NN1\text{NN}1 NN2\text{NN}2 MST1\text{MST}1 MST2\text{MST}2 DD
(i) 0.000 0.000 0.030 0.018 0.024 0.029 0.035 0.050
(ii) 0.000 0.000 0.018 0.020 0.034 0.022 0.040 0.069
(iii) 0.003 0.001 0.579 0.510 0.677 0.473 0.537 0.907
(iv) 0.010 0.005 0.540 0.530 0.670 0.505 0.515 0.925

Denote the method of combined differences and crossmatch test by CD and CM, respectively. To apply the CrossNN and CrossMST tests, we consider 1-MST and 5-MST as the similarity graph and denote these tests by NN1\text{NN}1, NN2\text{NN}2, MST1\text{MST}1 and MST2\text{MST}2, respectively. We present the proportion of trials that the tests reject the null hypothesis at 0.05 significance level in Table 4. We see all tests control empirical size well under scenario (i) and they deem the covariates well balanced under the reasonably balanced scenario (ii). For the results under scenarios (iii) and (iv), we observe the proposed test DD performs best with the highest power as expected. The crossmatch, CrossNN and CrossMST tests also show some power, while it is much less effective than the proposed test DD. The paired Hotelling’s test and the method of combined differences do not have power under scenarios (iii) and (iv).

4 A real application on Alzheimer’s disease research

In this section, we illustrate the newly developed test in a research project studying the Alzheimer’s disease. The data were collected by multiple Alzheimer’s Disease Centers between September 2005 and December 2018 (see Beekly et al. (2007) for details). Participants were evaluated on cognitive performance at their initial visits, and latter approximate annual follow-up visits. The information is recorded in the Uniform Data Set (UDS), which is a longitudinal, standardized data set maintained by the National Alzheimer’s Coordinating Center (NACC) and more information can be found in https://www.alz.washington.edu/. The data set consists of about 725 variables obtained from approximately annual comprehensive evaluations of 39,412 research volunteers as of the December 2018 data freeze. In our study, we requested for data containing participants with at least 4 visits.

4.1 Comparison of neuropsychologic performances between two visits

We first study the participants’ neuropsychologic performance over time, so we focused on the Neuropsychological Battery Summary Scores in Form C1, which contains 22 neuropsychology measurement variables. A description of these 22 variables are provided in LABEL:app:variable. We group the participants into three groups according to the CDR R Dementia Staging Instrument in their first visits: no dementia (Group I), very mild dementia (Group II), and mild dementia (Group III). We test whether their neuropsychologic performances in five years differ from those in their initial visits. After removing missing data, the sample sizes (nn) of three groups are 1747, 539, and 41, respectively.

Table 5: Test results of pHT and DD (bold for those pp-values << 0.05)
Group I Group II Group III
statistic pp-value statistic pp-value statistic pp-value
pHT 27.09 <<1e-4 11.83 <<1e-4 1.44 0.2121
DD 39.78 <<1e-4 37.93 <<1e-4 7.60 0.0224

Table 5 presents the results of the paired Hotelling’s T2T^{2} test (pHT) and the proposed test DD. We first check the results for Group I in testing whether their neuropsychologic measures in five years differ from those in their initial visits. We see that both the paired Hotelling’s T2T^{2} test and our new test DD reject the null hypothesis with extremely small pp-value. Thus, it is clear there is a significant difference in the neuropsychologic measures between the two visits. Similarly, we would reject the null hypothesis for Group II.

For Group III, as the sample size is relatively small, the power of the tests decreases. We see that the paired Hotelling’s T2T^{2} test cannot reject the null hypothesis, while the proposed test DD rejects the null hypothesis with a small pp-value. To see which result is more reliable, we check the data in more details. In particular, we perform the paired tt-test to each of the 22 neuropsychologic measurement variables. It turns out that the paired tt-test has pp-value less than 0.05 for 11 out of these 22 variables and the smallest pp-value is 0.001 (Table 1). Then, by the Bonferroni correction, the paired tt-test would reject the null hypothesis at 0.05 significance level (0.05/22=0.002>0.0010.05/22=0.002>0.001), supporting the result from DD.

These results reveal that for all the participants from the three groups, neuropsychologic measures after several years would become significantly different from those in the initial visits. It suggests that researchers pay attention to the changes of cognitive performance even for no dementia or (very) mild dementia participants.

4.2 Assessment of covariate balance in pair matching for examining sex effect

Several papers have previously reported a sex effect on Alzheimer’s disease (Payami et al., 1996; Podcasy and Epperson, 2016; Grimm et al., 2016). In this study, we focus on the data from the initial visit and only consider the white participant who has no stroke, no transient ischemic attack, no serious heart problem (e.g., atrial fibrillation, cardiac bypass procedure, congestive heart failure), no diabetes, no brain trauma, no Parkinson’s disease, no seizures, no other neurological and psychiatric disorder, do not abuse substances (except alcohol), and meanwhile, no Frontotemporal lobar degeneration mutation and Alzheimer’s disease mutation in her/his family, no cognitive impairment in her/his first-degree family. There are 13 covariates in total under consideration (Table 6). Here, each of TOBAC30, TOBAC100, CVHATT and DEP2YRS is a 0-1 variable, which is transformed to one dummy variable. CVANGIO and ALCOHOL are categorical variables with three categories: absent, recent/active, remote/inactive, and both of them are transformed to two dummy variables, respectively. We match 105 male participants with 105 female participants, among 153 female participants, expecting the covariates in the control (female) group are balanced with the exposed/treated (male) group. We adopt pairmatch() function in the optmatch package in R (Hansen, 2007) on treated to control distances with the distance computed from the match_on() using the logit propensity scores.

Table 6: Summary of covariates and statistics before and after the pair matching of a male participant and a female participant

(a) Matching results of Section 4.2

Before matching After matching
Description Covariates Male Female SD1 SD2 SD3 Female SD1 SD2 SD3
Age at visit NACCAGE 72.762 72.092 0.081 -0.130 0.048 72.848 -0.010 -0.149 0.247
Body mass index (BMI) NACCBMI 26.827 26.852 -0.005 -0.489 -0.367 26.035 0.169 -0.354 0.073
Total years smoked cigarettes SMOKYRS 9.114 9.248 -0.010 -0.238 -0.516 9.133 -0.001 -0.080 -0.178
Blood pressure, systolic BPSYS 135.152 133.915 0.068 -0.079 0.091 134.419 0.041 -0.044 0.229
Blood pressure, diastolic BPDIAS 75.505 76.307 -0.084 -0.329 -0.069 75.66 -0.016 -0.258 -0.038
Smoked cigarettes in last 30 days TOBAC30 0.019 0.033 -0.086 -0.510 -2.869 0.019 0.000 0.000 0.000
Smoked >> 100 cigarettes in life TOBAC100 0.486 0.425 0.122 0.022 -0.238 0.495 -0.019 -0.001 0.038
Heart attack CVHATT 0.114 0.013 0.422 1.534 4.741 0.019 0.387 1.364 4.037
Angioplasty/endarterectomy/stent CVANGIO
Recent/Active CVANGIO1 0.019 0.000 0.196 1.981 19.620 0.000 0.196 1.981 19.620
Remote/Inactive CVANGIO2 0.095 0.026 0.291 1.078 3.414 0.019 0.331 1.275 4.252
Alcohol abuse ALCOHOL
Recent/Active ALCOHOL1 0.019 0.007 0.111 0.960 8.080 0.010 0.080 0.652 5.156
Remote/Inactive ALCOHOL2 0.029 0.020 0.058 0.360 2.112 0.029 0.000 0.000 0.000
Active depression in last 2 years DEP2YRS 0.229 0.294 -0.149 -0.162 0.120 0.210 0.046 0.062 -0.007

(b) Matching results of Section 4.3 Before matching After matching Description Covariates Male Female SD1 SD2 SD3 Female SD1 SD2 SD3 Age at visit NACCAGE 74.750 73.500 0.162 -0.350 0.258 74.684 0.009 -0.180 0.367 Body mass index (BMI) NACCBMI 26.429 26.894 -0.095 -0.698 -0.816 25.961 0.106 -0.404 -0.844 Total years smoked cigarettes SMOKYRS 9.882 10.308 -0.031 -0.054 -0.039 11.000 -0.079 -0.114 -0.183 Smoked cigarettes in last 30 days TOBAC30 0.026 0.025 0.008 0.049 0.278 0.026 0.000 0.000 0.000

To measure the balance of covariates in the control group and the treatment group, we use the standardized mean difference (SD1), and define the standardized variance difference (SD2) and standardized third central moment difference (SD3) as

SD2=2(s1s0)s12+s02,SD3=232(ν1ν0)(s12+s02)32,\text{SD2}=\frac{2(s_{1}-s_{0})}{s_{1}^{2}+s_{0}^{2}},\quad\text{SD3}=\frac{2^{\frac{3}{2}}(\nu_{1}-\nu_{0})}{(s_{1}^{2}+s_{0}^{2})^{\frac{3}{2}}},

where sms_{m} and νm\nu_{m} are the sample variance and the third sample central moment for treated subjects (m=1)(m=1) and controls (m=0)(m=0). Table 6 lists means, SD1, SD2 and SD3 before and after matching. We see that the covariates are not balanced before matching. For example, the values of SD1, SD2 and SD3 for CVHATT are 0.422, 1.534 and 4.741. After matching, it seems that the covariates are still not balanced well. For example, SD3 of the four covariates, CVHATT, CVANGIO1, CVANGIO2 and ALCOHOL1, are larger than 4.

Now we apply the paired Hotelling’s T2T^{2} (pHT), the method of combined differences (CD), the crossmatch (CM), the CrossNN (NN1, NN2), the CrossMST (MST1, MST2) and the proposed test (DD) to the paired data after matching. Only our proposed test gives small pp-value 0.005. By contrast, the pp-values of pHT, CD, CM, NN1, NN2, MST1 and MST2 are 0.087, 0.100, 0.669, 0.541, 0.219, 0.249, 0.266, respectively, which indicates that the other tests cannot reject the null hypothesis of balanced covariates at 0.05 significance level. To explore which result is more reliable, we apply the paired tt-test to each of the 13 covariates. It turns out that the smallest pp-value of the paired tt-test is 0.001. So the paired tt-test would reject the null hypothesis at 0.05 significance level (0.05/13=0.004>0.0010.05/13=0.004>0.001) by the Bonferroni correction, supporting the result of DD.

4.3 Comparison of neuropsychologic performances for pair-matched data

Another question of interest is to compare the neuropsychologic performances between well-matched female participants and male participants. To make the matching easier, we focus on the data from the initial visit and only consider the white participant who has no stroke, no transient ischemic attack, no atrial fibrillation, no heart problem, no angioplasty/endarterectomy/stent, no active depression in the last two years, no diabetes, no brain trauma, no Parkinson’s disease, no seizures, no alcohol abuse, no other neurological and psychiatric disorder, and meanwhile, no Frontotemporal lobar degeneration mutation and Alzheimer’s disease mutation in her/his family, no cognitive impairment in her/his first-degree family. In addition, only four covariates, NACCAGE, NACCBMI, SMOKYRS and TOBAC30, are under consideration now (Table 6). Adopting the same matching method as in Section 4.2, we match 76 male participants with 76 female participants, among 120 female participants.

The means, SD1, SD2 and SD3 before and after matching are provided in Table 6. We see the covariates of the data after matching are much more balanced than those in Section 4.2. We apply the proposed test DD and other existing tests to the matched covariates. Since the pp-values of pHT, CD, CM, NN1, NN2, MST1, MST2 and DD are 0.372, 0.366, 0.804, 0.280, 0.688, 0.082, 0.163 and 0.056, respectively, all these tests cannot reject the null hypothesis at 0.05 significance level. So we deem the covariates reasonably balanced jointly.

After obtaining well-matched participants, we apply the paired Hotelling’s T2T^{2} test (pHT) and the proposed test (DD) to test whether the neuropsychologic performances in the female group differ from those in the male group. Here, 22 neuropsychology measurement variables are considered as in Section 4.1. The pp-values of pHT and DD are 0.068 and 0.008, respectively, so the proposed test rejects the null hypothesis of equal neuropsychologic performances between the female group and the male group, while pHT accepts the null hypothesis at 0.05 significance level. We further explore the data by applying the paired tt-test to each of the 22 covariates. We see that the smallest pp-value of the paired tt-test is 0.001. The paired tt-test would reject the null hypothesis at 0.05 significance level (0.05/22=0.002>0.0010.05/22=0.002>0.001) by the Bonferroni correction. Therefore, the proposed test DD is more reliable and the sex may have an effect on Alzheimer’s disease.

5 Discussion

If we are only interested in the location difference or the scale difference. The following two statistics can be considered:

  • paired mean test

    Dm=R1+R2E(R1+R2)var(R1+R2).D_{m}=\frac{R_{1}+R_{2}-E(R_{1}+R_{2})}{\sqrt{\text{var}(R_{1}+R_{2})}}. (5)

    The test is rejected at level α\alpha if Dm>Cm(α)D_{m}>C_{m}(\alpha).

  • paired scale test

    Ds=R1R2E(R1R2)var(R1R2).D_{s}=\frac{R_{1}-R_{2}-E(R_{1}-R_{2})}{\sqrt{\text{var}(R_{1}-R_{2})}}. (6)

    The test is rejected at level α\alpha if |Ds|>Cs(α)|D_{s}|>C_{s}(\alpha).

The statistics DmD_{m} and DsD_{s} are effective for the location and scale alternatives, respectively.

Based on Theorem 3, the asymptotic distributions of DmD_{m} and DsD_{s} can be derived, and we could set Cm(α)=Φ1(1α)C_{m}(\alpha)=\Phi^{-1}(1-\alpha), the (1α)(1-\alpha) quantile of the standard normal distribution, and Cs(α)=Φ1(1α)C_{s}(\alpha)=\Phi^{-1}(1-\alpha), for easy implementation of the two tests.

Corollary 2.

Under Conditions 1, 2 and 3, as NN\rightarrow\infty,

(DmDs)𝒩((00),(1001))\begin{pmatrix}D_{m}\\ D_{s}\end{pmatrix}\longrightarrow\mathcal{N}\left(\begin{pmatrix}0\\ 0\end{pmatrix},\begin{pmatrix}1&0\\ 0&1\end{pmatrix}\right)

in distribution under the paired-comparison permutation null.

With the analytic expressions of the expectations and variances in Theorem 1, we could further study the relationship among the statistics DD, DmD_{m} and DsD_{s}.

Proposition 1.

We have

D=Dm2+Ds2 and cov(Dm,Ds)=0.D=D_{m}^{2}+D_{s}^{2}\text{ and }\text{cov}(D_{m},D_{s})=0.
Proof.

Let

𝐑=(R1E(R1)R2E(R2)),𝐃~=(DmDs)=(1var(R1+R2)1var(R1+R2)1var(R1R2)1var(R1R2))𝐑𝐁𝐑.\mathbf{R}=\begin{pmatrix}R_{1}-E(R_{1})\\ R_{2}-E(R_{2})\end{pmatrix},\quad\mathbf{\tilde{D}}=\begin{pmatrix}D_{m}\\ D_{s}\end{pmatrix}=\begin{pmatrix}\frac{1}{\sqrt{\text{var}(R_{1}+R_{2})}}&\frac{1}{\sqrt{\text{var}(R_{1}+R_{2})}}\\ \frac{1}{\sqrt{\text{var}(R_{1}-R_{2})}}&-\frac{1}{\sqrt{\text{var}(R_{1}-R_{2})}}\end{pmatrix}\mathbf{R}\triangleq\mathbf{B}\mathbf{R}.

It is easy to see that 𝐁\mathbf{B} is invertible. From the definition of DD (Equation (1)), it can be written as

D=𝐑T𝚺R1𝐑=(𝐁1𝐃~)T𝚺R1(𝐁1𝐃~)=𝐃~T(𝐁𝚺R𝐁T)1𝐃~.D=\mathbf{R}^{T}\mathbf{\Sigma}^{-1}_{R}\mathbf{R}=(\mathbf{B}^{-1}\mathbf{\tilde{D}})^{T}\mathbf{\Sigma}^{-1}_{R}(\mathbf{B}^{-1}\mathbf{\tilde{D}})=\mathbf{\tilde{D}}^{T}(\mathbf{B}\mathbf{\Sigma}_{R}\mathbf{B}^{T})^{-1}\mathbf{\tilde{D}}.

Plugging 𝐁\mathbf{B} and 𝚺R\mathbf{\Sigma}_{R}, we have 𝐁𝚺R𝐁T=(1001)\mathbf{B}\mathbf{\Sigma}_{R}\mathbf{B}^{T}=\begin{pmatrix}1&0\\ 0&1\end{pmatrix}. Then, D=𝐃~T𝐃~=Dm2+Ds2D=\mathbf{\tilde{D}}^{T}\mathbf{\tilde{D}}=D_{m}^{2}+D_{s}^{2}. ∎

6 Conclusion

Paired data arise naturally in a variety of applications. In many contemporary datasets, the number of measurements is often comparable to or even larger than the number of pairs. We propose a new non-parametric test for paired data upon a graph-based two-sample testing framework. Since all existing graph-based tests are inapplicable for paired data owing to the nature of paired observations, we propose to use the paired-comparison permutation null distribution and develop the statistic DD. The simulation studies demonstrate that DD works well for assessing covariate balance in pair matching besides testing paired data. It exhibits high power under a wide range of situations including shape, location and/or scale alternatives. Under the paired-comparison permutation null, the asymptotic distribution of our new statistic DD is also derived. The approximate pp-value based on the asymptotic result is reasonably accurate to the permutation pp-value under finite samples, making the test fast applicable in practice.

As one advantage inherited from the graph-based tests, the proposed test can also be applied to multi-type complex data, such as non-Euclidean data, as long as a reasonable distance can be defined over the observations.

Acknowledgements

Hao Chen was supported in part by NSF awards DMS-1513653 and DMS-1848579. Xiao-Hua Zhou was supported in part by the National Science Foundation of China (NSFC 81773546 and NSFC 12026606). The NACC database is funded by NIA/NIH Grant U24 AG072122. NACC data are contributed by the NIA-funded ADRCs: P30 AG019610 (PI Eric Reiman, MD), P30 AG013846 (PI Neil Kowall, MD), P50 AG008702 (PI Scott Small, MD), P50 AG025688 (PI Allan Levey, MD, PhD), P50 AG047266 (PI Todd Golde, MD, PhD), P30 AG010133 (PI Andrew Saykin, PsyD), P50 AG005146 (PI Marilyn Albert, PhD), P50 AG005134 (PI Bradley Hyman, MD, PhD), P50 AG016574 (PI Ronald Petersen, MD, PhD), P50 AG005138 (PI Mary Sano, PhD), P30 AG008051 (PI Thomas Wisniewski, MD), P30 AG013854 (PI Robert Vassar, PhD), P30 AG008017 (PI Jeffrey Kaye, MD), P30 AG010161 (PI David Bennett, MD), P50 AG047366 (PI Victor Henderson, MD, MS), P30 AG010129 (PI Charles DeCarli, MD), P50 AG016573 (PI Frank LaFerla, PhD), P50 AG005131 (PI James Brewer, MD, PhD), P50 AG023501 (PI Bruce Miller, MD), P30 AG035982 (PI Russell Swerdlow, MD), P30 AG028383 (PI Linda Van Eldik, PhD), P30 AG053760 (PI Henry Paulson, MD, PhD), P30 AG010124 (PI John Trojanowski, MD, PhD), P50 AG005133 (PI Oscar Lopez, MD), P50 AG005142 (PI Helena Chui, MD), P30 AG012300 (PI Roger Rosenberg, MD), P30 AG049638 (PI Suzanne Craft, PhD), P50 AG005136 (PI Thomas Grabowski, MD), P50 AG033514 (PI Sanjay Asthana, MD, FRCP), P50 AG005681 (PI John Morris, MD), P50 AG047270 (PI Stephen Strittmatter, MD, PhD).

References

  • Beekly et al. (2007) Beekly, D. L., Ramos, E. M., Lee, W. W., Deitrich, W. D., Jacka, M. E., Wu, J., Hubbard, J. L., Koepsell, T. D., Morris, J. C., Kukull, W. A., et al. (2007). The National Alzheimer’s Coordinating Center (NACC) database: the uniform data set. Alzheimer Disease & Associated Disorders 21, 249–258.
  • Carter et al. (2012) Carter, C. L., Resnick, E. M., Mallampalli, M., and Kalbarczyk, A. (2012). Sex and gender differences in alzheimer’s disease: recommendations for future research. Journal of Women’s Health 21, 1018–1023.
  • Chen et al. (2018) Chen, H., Chen, X., and Su, Y. (2018). A weighted edge-count two-sample test for multivariate and object data. Journal of the American Statistical Association 113, 1146–1155.
  • Chen and Friedman (2017) Chen, H. and Friedman, J. H. (2017). A new graph-based two-sample test for multivariate and object data. Journal of the American Statistical Association 112, 397–409.
  • Chen and Small (2020) Chen, H. and Small, D. S. (2020). New multivariate tests for assessing covariate balance in matched observational studies. Biometrics .
  • Conover (1998) Conover, W. J. (1998). Practical nonparametric statistics, volume 350. John Wiley & Sons.
  • Franklin et al. (2014) Franklin, J. M., Rassen, J. A., Ackermann, D., Bartels, D. B., and Schneeweiss, S. (2014). Metrics for covariate balance in cohort studies of causal effects. Statistics in Medicine 33, 1685–1699.
  • Friedman and Rafsky (1979) Friedman, J. H. and Rafsky, L. C. (1979). Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics 7, 697–717.
  • Grimm et al. (2016) Grimm, A., Mensah-Nyagan, A. G., and Eckert, A. (2016). Alzheimer, mitochondria and gender. Neuroscience & Biobehavioral Reviews 67, 89–101.
  • Hansen (2007) Hansen, B. B. (2007). Optmatch: Flexible, optimal matching for observational studies. New Functions for Multivariate Analysis 7, 18–24.
  • Hansen and Bowers (2008) Hansen, B. B. and Bowers, J. (2008). Covariate balance in simple, stratified and clustered comparative studies. Statistical Science 23, 219–236.
  • Mazure and Swendsen (2016) Mazure, C. M. and Swendsen, J. (2016). Sex differences in Alzheimer’s disease and other dementias. The Lancet. Neurology 15, 451.
  • McKhann et al. (1984) McKhann, G., Drachman, D., Folstein, M., Katzman, R., Price, D., and Stadlan, E. M. (1984). Clinical diagnosis of Alzheimer’s disease: report of the NINCDS-ADRDA Work Group under the auspices of Department of Health and Human Services Task Force on Alzheimer’s Disease. Neurology 34, 939–939.
  • Payami et al. (1996) Payami, H., Zareparsi, S., Montee, K. R., Sexton, G. J., Kaye, J. A., Bird, T. D., Yu, C.-E., Wijsman, E. M., Heston, L. L., Litt, M., et al. (1996). Gender difference in apolipoprotein E-associated risk for familial Alzheimer disease: a possible clue to the higher incidence of Alzheimer disease in women. American Journal of Human Genetics 58, 803.
  • Podcasy and Epperson (2016) Podcasy, J. L. and Epperson, C. N. (2016). Considering sex and gender in Alzheimer disease and other dementias. Dialogues in Clinical Neuroscience 18, 437.
  • Rosenbaum (2005) Rosenbaum, P. R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 515–530.
  • Weintraub et al. (2009) Weintraub, S., Salmon, D., Mercaldo, N., Ferris, S., Graff-Radford, N. R., Chui, H., Cummings, J., DeCarli, C., Foster, N. L., Galasko, D., et al. (2009). The Alzheimer’s disease centers’ uniform data set (UDS): The neuropsychological test battery. Alzheimer Disease and Associated Disorders 23, 91.
  • Wilcoxon (1992) Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Breakthroughs in statistics, pages 196–202. Springer.