This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distance and Kernel-Based Measures for Global and Local Two-Sample Conditional Distribution Testing

Jian Yan Department of Statistics and Data Science, Cornell University Zhuoxi Li Department of Statistics and Data Science, SOE, Xiamen University Xianyang Zhang Department of Statistics, Texas A&M University
Abstract

Testing the equality of two conditional distributions is crucial in various modern applications, including transfer learning and causal inference. Despite its importance, this fundamental problem has received surprisingly little attention in the literature. This work aims to present a unified framework based on distance and kernel methods for both global and local two-sample conditional distribution testing. To this end, we introduce distance and kernel-based measures that characterize the homogeneity of two conditional distributions. Drawing from the concept of conditional U-statistics, we propose consistent estimators for these measures. Theoretically, we derive the convergence rates and the asymptotic distributions of the estimators under both the null and alternative hypotheses. Utilizing these measures, along with a local bootstrap approach, we develop global and local tests that can detect discrepancies between two conditional distributions at global and local levels, respectively. Our tests demonstrate reliable performance through simulations and real data analyses.

Keywords: Conditional distribution; Generalized energy distance; Kernel smoothing; Maximum mean discrepancy; Two-sample testing; U-statistics.

1 Introduction

The canonical setting for (nonparametric) two-sample testing focuses on assessing the equality of two unconditional distributions. In many contemporary applications, however, people are instead interested in testing the equality of two conditional distributions. Consider two independent data sets {(Yi(l),Xi(l))}i=1nl\{(Y_{i}^{(l)},X_{i}^{(l)})\}_{i=1}^{n_{l}} for l=1,2l=1,2. Here Yi(l)𝒴Y_{i}^{(l)}\in\mathcal{Y} and Xi(l)pX_{i}^{(l)}\in\mathbb{R}^{p}, where 𝒴\mathcal{Y} is allowed to be a general metric space (see Remark 2). For l=1,2l=1,2, assume that {(Yi(l),Xi(l))}i=1nl\{(Y_{i}^{(l)},X_{i}^{(l)})\}_{i=1}^{n_{l}} are independent and identically distributed (i.i.d.) samples of (Y(l),X(l))P(l)PYX(l)PX(l)(Y^{(l)},X^{(l)})\sim P^{(l)}\equiv P_{Y\mid X}^{(l)}\otimes P_{X}^{(l)}, where PX(l)P_{X}^{(l)} is the marginal distribution of X(l)X^{(l)} and PYX=x(l)P_{Y\mid X=x}^{(l)} is the conditional distribution of Y(l)Y^{(l)} given X(l)=xX^{(l)}=x. To ensure that the testing problem considered below is nontrivial, we assume that PX(1)P_{X}^{(1)} and PX(2)P_{X}^{(2)} are equivalent, i.e., PX(1)PX(2)P_{X}^{(1)}\ll P_{X}^{(2)} and PX(2)PX(1)P_{X}^{(2)}\ll P_{X}^{(1)}, where PQP\ll Q means that PP is absolutely continuous in reference to QQ. We aim to test the following hypothesis, which we call the global two-sample conditional distribution testing problem,

H0:PX(1)(PYX(1)=PYX(2))=1versusHa:PX(1)(PYX(1)PYX(2))>0.H_{0}:P^{(1)}_{X}(P_{Y\mid X}^{(1)}=P_{Y\mid X}^{(2)})=1\quad\text{versus}\quad H_{a}:P^{(1)}_{X}(P_{Y\mid X}^{(1)}\neq P_{Y\mid X}^{(2)})>0. (1)

Due to the equivalence between PX(1)P_{X}^{(1)} and PX(2)P_{X}^{(2)}, the hypotheses in (1) can be equivalently formulated by replacing PX(1)P^{(1)}_{X} with PX(2)P^{(2)}_{X}.

We want to emphasize that the marginal distributions of XX from the two populations may differ (i.e., PX(1)PX(2)P_{X}^{(1)}\neq P_{X}^{(2)}), as seen in the motivating examples below. Thus, H0H_{0} in (1) is not equivalent to P(1)=P(2)P^{(1)}=P^{(2)}, and unconditional two-sample tests for the equality of two joint distributions (i.e., P(1)=P(2)P^{(1)}=P^{(2)}) are generally not applicable in our context. Applying such tests would result in a failure to control the type I error when PX(1)PX(2)P_{X}^{(1)}\neq P_{X}^{(2)}. Besides, YY and XX denote two generic random variables and do not necessarily correspond to response and covariates, respectively. For instance, in the prior shift example below, YY are covariates, and XX is a response in (1).

Hypothesis (1) is central to many important problems in econometrics, machine learning, and statistics. For example, in transfer learning, the prior and covariate shift assumptions are commonly employed to tackle distributional differences between source and target populations (Kouw and Loog,, 2018). The prior shift assumption asserts that the conditional distribution of the covariates given the response is identical in both populations while allowing for a shift in the marginal distributions of the response. Conversely, the covariate shift assumption posits that the conditional distribution of the response given the covariates remains invariant across source and target populations, but the marginal distributions of the covariates can differ. Both assumptions are widely adopted in the literature; see, e.g., Huang et al., (2024); Lee et al., (2024) for the prior shift assumption, and Shimodaira, (2000); Tibshirani et al., (2019); Liu et al., (2023); Ma et al., (2023) for the covariate shift assumption. Despite their prevalence, there is a paucity of work that formally validates these assumptions. Both of them can be framed as testing the equality of two conditional distributions, as in (1). Such tests are essential to the validity of methods developed under the prior or covariate shift assumptions.

Another motivating example comes from causal inference. Testing hypotheses in the context of treatment effect analysis has always been of interest (Imbens and Wooldridge,, 2009, Sections 3.3 and 5.12). Consider the standard setup based on the potential outcome framework (Rubin,, 1974). Suppose {(Yi,Ti,Xi)}i=1n\{(Y_{i},T_{i},X_{i})\}_{i=1}^{n} are i.i.d. observations of (Y,T,X)(Y,T,X), where YY is the observed outcome of interest, TT denotes a binary treatment (1: treated, 0: untreated), and XX are pretreatment covariates. For each subject, we define a pair of potential outcomes, {Y(1),Y(0)}\{Y(1),Y(0)\}, that would be observed if the subject had been given treatment, Y(1)Y(1), and control, Y(0)Y(0). One may be interested in testing the null hypothesis that the conditional distribution of Y(1)XY(1)\mid X is the same as that of Y(0)XY(0)\mid X (Imbens and Wooldridge,, 2009), i.e., zero conditional distributional treatment effect. Under the prevalent assumptions of consistency, Y=TY(1)+(1T)Y(0)Y=TY(1)+(1-T)Y(0), and no unmeasured confounding, {Y(1),Y(0)}TX\{Y(1),Y(0)\}\perp\!\!\!\perp T\mid X, the conditional distributions of Y(1)XY(1)\mid X and Y(0)XY(0)\mid X can be identified as those of YX,T=1Y\mid X,T=1 and YX,T=0Y\mid X,T=0, respectively. Therefore, it can be formulated as testing hypothesis (1) with {(Yi,Xi):Ti=1}\{(Y_{i},X_{i}):T_{i}=1\} and {(Yi,Xi):Ti=0}\{(Y_{i},X_{i}):T_{i}=0\} being the two sets of independent samples.

When the null hypothesis of the global two-sample conditional distribution testing is rejected, one might wish to pinpoint local regions where the two conditional distributions are significantly different. Specifically, for a fixed xx in the support of PX(1)P_{X}^{(1)} (or equivalently, PX(2)P_{X}^{(2)}), we are interested in testing

H0:PYX=x(1)=PYX=x(2)versusHa:PYX=x(1)PYX=x(2),H_{0}:P_{Y\mid X=x}^{(1)}=P_{Y\mid X=x}^{(2)}\quad\text{versus}\quad H_{a}:P_{Y\mid X=x}^{(1)}\neq P_{Y\mid X=x}^{(2)}, (2)

which we refer to as the local two-sample conditional distribution testing problem, in contrast to the global problem (1). This problem is novel and is partly motivated by Duong, (2013) and Kim et al., (2019) in the context of unconditional two-sample testing. With a little extra effort, our framework can readily handle the local testing problem (2).

Recently, distance and kernel-based measures have received considerable attention in both the statistics and machine learning communities (Székely and Rizzo,, 2017; Muandet et al.,, 2017). These measures have been applied to a wide range of classical and modern hypothesis testing problems, including two-sample testing (Székely et al.,, 2004; Baringhaus and Franz,, 2004; Gretton et al.,, 2012), goodness-of-fit testing (Székely and Rizzo,, 2005; Balasubramanian et al.,, 2021), independence testing (Székely et al.,, 2007; Gretton et al.,, 2007; Chakraborty and Zhang,, 2019; Ke and Yin,, 2020; Deb et al.,, 2020), conditional independence testing (Fukumizu et al.,, 2007; Zhang et al.,, 2012; Wang et al.,, 2015; Sheng and Sriperumbudur,, 2023) and high-dimensional two sample and dependence testing (Zhang et al.,, 2018; Yao et al.,, 2018; Zhu et al.,, 2020; Chakraborty and Zhang,, 2021). In this paper, we establish a distance and kernel-based framework to tackle both the global and local two-sample conditional distribution testing problems (1) and (2). To achieve this, we introduce the conditional generalized energy distance (3) and the conditional maximum mean discrepancy (4), along with their integrated version (5), which fully characterize the homogeneity of two conditional distributions. Additionally, we show the equivalence between the conditional generalized energy distance and the conditional maximum mean discrepancy. Building upon estimators of these measures, we develop global and local tests capable of detecting discrepancies between two conditional distributions at global and local levels, respectively.

Our estimation strategy employs a combination of U-statistics and kernel smoothing, initially introduced in the so-called conditional U-statistics by Stute, (1991) and later applied to different problems by Wang et al., (2015) and Ke and Yin, (2020). To highlight our theoretical contributions, we summarize the distinct asymptotic behaviors of our global and local test statistics under both the null and alternative hypotheses in Table 1. It is noteworthy that local tests based on distance and kernel measures have not been previously explored in the literature. Furthermore, while Wang et al., (2015) and Ke and Yin, (2020) only provided the asymptotic distributions of their global test statistics under the null, we offer additional insights by systematically studying the properties of our statistics under both the null and alternative hypotheses. As a side note, we identify certain gaps in the derivations of the asymptotic null distribution in both Wang et al., (2015) and Ke and Yin, (2020), with details provided in Section 4.

Our theoretical analysis indicates that the implementation of kernel smoothing in U-statistic estimators of distance and kernel-based measures has two significant effects. On the one hand, as shown in Table 1, U-statistic estimators are no longer unbiased, and undersmoothing is required to mitigate the bias introduced by kernel smoothing. On the other hand, owing to kernel smoothing, the U-statistic estimators for distance and kernel-based measures become non-degenerate under the null hypotheses, which is in contrast to the unconditional situation (Székely et al.,, 2004; Gretton et al.,, 2012). Nevertheless, when undersmoothing is employed, the first-order projections in the Hoeffding decomposition of U-statistic estimators turn out to be asymptotically negligible under the null, and thus the asymptotic null distributions are determined by the second-order projections in the Hoeffding decomposition. For further discussion, please refer to Sections 3-4.

Table 1: Asymptotic behaviors in different scenarios, assuming the sample sizes n1n2nn_{1}\asymp n_{2}\asymp n, the bandwidths h1h2hh_{1}\asymp h_{2}\asymp h, and the smoothing kernel is of order ν\nu (see Assumption 1).

Global test Local test Null Alternative Null Alternative Asymptotic distribution Normal Normal Weighted sum of chi-squares Normal Convergence rate n1hp/2n^{-1}h^{-p/2} n1/2n^{-1/2} (nhp)1(nh^{p})^{-1} (nhp)1/2(nh^{p})^{-1/2} Undersmoothing condition nhp/2+ν0nh^{p/2+\nu}\rightarrow 0 n1/2hν0n^{1/2}h^{\nu}\rightarrow 0 nhp+ν0nh^{p+\nu}\rightarrow 0 n1/2hp/2+ν0n^{1/2}h^{p/2+\nu}\rightarrow 0

We now discuss related work and highlight several notable features of our framework. Although problems (1) and (2) are fundamental in various modern applications, surprisingly, there are very few methods available to test the equality of two conditional distributions. In the nonparametric testing literature, existing methods mainly focus on testing the equality of conditional moments of YY given XX. Specifically, most studies aim at testing the equality of conditional means, also known as the comparison of regression curves, as seen in Hall and Hart, (1990); Kulasekera, (1995); Dette and Munk, (1998); Lavergne, (2001); Neumeyer and Dette, (2003), among others (see Section 7 in González-Manteiga and Crujeiras, (2013) for a detailed review). Similarly, in the causal inference literature, hypothesis testing has largely been limited to conditional average treatment effect (see, e.g., Crump et al., (2008)). The methods we develop in this paper can detect general discrepancies between two conditional distributions, beyond specific moments. Lee, (2009) and Chang et al., (2015) proposed nonparametric tests for the null hypothesis of zero conditional distributional treatment effect. Their tests are built upon a Mann-Whitney statistic and cumulative distribution functions, respectively, and thus are only applicable for univariate YY. In our framework, YY is allowed to take values in a general metric space, while XX can be multivariate. Very recently, Hu and Lei, (2024) proposed a test for the hypothesis (1) using techniques from conformal prediction. However, their test requires (possibly unbalanced) sample splitting, and its performance relies on high quality density ratio estimators (see Assumption 2(b) and related discussion therein). Chen et al., (2022) and Chatterjee et al., (2024) considered the paired-sample problem, which is very different from our two-sample setting. Notably, the local testing problem (2) is new and has not been addressed in the literature. Our framework can accommodate both global and local two-sample conditional distribution testing.

Notation. For l=1,2l=1,2, let fl(y,x)f_{l}(y,x), fl(x)f_{l}(x) and fl(yx)f_{l}(y\mid x) be the joint probability density function of P(l)P^{(l)}, the marginal probability density function of PX(l)P_{X}^{(l)} and the conditional probability density function of PYX=x(l)P_{Y\mid X=x}^{(l)}, respectively (assuming the existence of these densities). For two sequences of real numbers {an}n=1\{a_{n}\}_{n=1}^{\infty} and {bn}n=1\{b_{n}\}_{n=1}^{\infty}, we write anbna_{n}\asymp b_{n} if and only if an=O(bn)a_{n}=O(b_{n}) and bn=O(an)b_{n}=O(a_{n}). The symbols 𝑝\overset{p}{\rightarrow} and 𝑑\overset{d}{\rightarrow} stand for convergence in probability and in distribution, respectively.

2 Preliminaries

This section provides an overview of the generalized energy distance and maximum mean discrepancy, as well as their equivalence, which has been extensively discussed in Sejdinovic et al., (2013).

For a non-empty set 𝒰\mathcal{U}, a non-negative function ρ:𝒰×𝒰[0,)\rho:\mathcal{U}\times\mathcal{U}\rightarrow[0,\infty) is called a semimetric on 𝒰\mathcal{U} if for any u,u𝒰u,u^{\prime}\in\mathcal{U}, it satisfies (i) ρ(u,u)=0u=u\rho(u,u^{\prime})=0\iff u=u^{\prime} and (ii) ρ(u,u)=ρ(u,u)\rho(u,u^{\prime})=\rho(u^{\prime},u). Then (𝒰,ρ)(\mathcal{U},\rho) is said to be a semimetric space. The semimetric space (𝒰,ρ)(\mathcal{U},\rho) is said to have negative type if for all n2n\geq 2, u1,,un𝒰u_{1},\ldots,u_{n}\in\mathcal{U} and α1,,αn\alpha_{1},\ldots,\alpha_{n}\in\mathbb{R} with i=1nαi=0\sum_{i=1}^{n}\alpha_{i}=0, we have i,j=1nαiαjρ(ui,uj)0\sum_{i,j=1}^{n}\alpha_{i}\alpha_{j}\rho(u_{i},u_{j})\leq 0. Define ρ1(𝒰)={μ(𝒰):ρ(u,u0)dμ(u)<\mathcal{M}_{\rho}^{1}(\mathcal{U})=\{\mu\in\mathcal{M}(\mathcal{U}):\int\rho(u,u_{0})d\mu(u)<\infty, for some u0𝒰}u_{0}\in\mathcal{U}\}, where (𝒰)\mathcal{M}(\mathcal{U}) denotes the set of all probability measures on 𝒰\mathcal{U}. Suppose P,Qρ1(𝒰)P,Q\in\mathcal{M}_{\rho}^{1}(\mathcal{U}), we have ρd(PQ)20\int\rho d(P-Q)^{2}\leq 0 when (𝒰,ρ)(\mathcal{U},\rho) has negative type. We say that (𝒰,ρ)(\mathcal{U},\rho) has strong negative type if it has negative type and the equality holds only when P=QP=Q. The generalized energy distance between P,Qρ1(𝒰)P,Q\in\mathcal{M}_{\rho}^{1}(\mathcal{U}) is defined as (Sejdinovic et al.,, 2013)

𝒟ρ(P,Q)=2𝔼ρ(U,V)𝔼ρ(U,U)𝔼ρ(V,V)=ρd(PQ)2,\mathcal{D}_{\rho}(P,Q)=2\mathbb{E}\rho(U,V)-\mathbb{E}\rho(U,U^{\prime})-\mathbb{E}\rho(V,V^{\prime})=-\int\rho d(P-Q)^{2},

where U,Ui.i.d.PU,U^{\prime}\overset{i.i.d.}{\sim}P and V,Vi.i.d.QV,V^{\prime}\overset{i.i.d.}{\sim}Q. If (𝒰,ρ)(\mathcal{U},\rho) is of strong negative type, we have 𝒟ρ(P,Q)0\mathcal{D}_{\rho}(P,Q)\geq 0 and the equality holds if and only if P=QP=Q. Every separable Hilbert space is of strong negative type (Lyons,, 2013). In particular, Euclidean spaces are separable, and thus the generalized energy distance generalizes the usual energy distance introduced by Székely et al., (2004) and independently by Baringhaus and Franz, (2004) from Euclidean spaces to semimetric spaces of strong negative type.

Let \mathcal{H} be a Hilbert space of real-valued functions defined on 𝒰\mathcal{U}. A function k:𝒰×𝒰k:\mathcal{U}\times\mathcal{U}\rightarrow\mathbb{R} is a reproducing kernel of \mathcal{H} if (i) u𝒰\forall u\in\mathcal{U}, k(,u)k(\cdot,u)\in\mathcal{H} and (ii) u𝒰,f\forall u\in\mathcal{U},\forall f\in\mathcal{H}, f,k(,u)=f(u)\langle f,k(\cdot,u)\rangle_{\mathcal{H}}=f(u), where ,\langle\cdot,\cdot\rangle_{\mathcal{H}} is the inner product associated with \mathcal{H}. If \mathcal{H} has a reproducing kernel, it is said to be a reproducing kernel Hilbert space (RKHS). According to Moore-Aronszajn theorem (Berlinet and Thomas-Agnan,, 2011), for every symmetric, positive definite function (henceforth kernel) kk, there exists an associated RKHS k\mathcal{H}_{k} with reproducing kernel kk. For θ>0\theta>0, define kθ(𝒰)={μ(𝒰):kθ(u,u)𝑑μ(u)<}\mathcal{M}_{k}^{\theta}(\mathcal{U})=\{\mu\in\mathcal{M}(\mathcal{U}):\int k^{\theta}(u,u)d\mu(u)<\infty\}. The kernel mean embedding of Pk1/2(𝒰)P\in\mathcal{M}_{k}^{1/2}(\mathcal{U}) into RKHS k\mathcal{H}_{k} is defined by the Bochner integral Πk(P)=k(,u)𝑑P(u)k\Pi_{k}(P)=\int k(\cdot,u)dP(u)\in\mathcal{H}_{k}. Besides, the kernel kk is said to be characteristic if the mapping Πk\Pi_{k} is injective. Conditions under which kernels are characteristic have been studied by Sriperumbudur et al., (2008, 2010), and examples include the Gaussian kernel and the Laplace kernel. The maximum mean discrepancy (MMD) between P,Qk1/2(𝒰)P,Q\in\mathcal{M}_{k}^{1/2}(\mathcal{U}) is given by (Gretton et al.,, 2012)

γk(P,Q)=Πk(P)Πk(Q)k.\gamma_{k}(P,Q)=\|\Pi_{k}(P)-\Pi_{k}(Q)\|_{\mathcal{H}_{k}}.

When the kernel kk is characteristic, we have γk(P,Q)=0\gamma_{k}(P,Q)=0 if and only if P=QP=Q. Also, the following alternative representation of the squared MMD is useful

γk2(P,Q)=𝔼k(U,U)+𝔼k(V,V)2𝔼k(U,V)=kd(PQ)2,\gamma_{k}^{2}(P,Q)=\mathbb{E}k(U,U^{\prime})+\mathbb{E}k(V,V^{\prime})-2\mathbb{E}k(U,V)=\int kd(P-Q)^{2},

where U,Ui.i.d.PU,U^{\prime}\overset{i.i.d.}{\sim}P and V,Vi.i.d.QV,V^{\prime}\overset{i.i.d.}{\sim}Q.

Let ρ\rho be a semimetric of negative type on 𝒰\mathcal{U}. For any u0𝒰u_{0}\in\mathcal{U}, the function k(u,u)=ρ(u,u0)+ρ(u,u0)ρ(u,u)k(u,u^{\prime})=\rho(u,u_{0})+\rho(u^{\prime},u_{0})-\rho(u,u^{\prime}) is positive definite, and is said to be the distance-induced kernel induced by ρ\rho and centered at u0u_{0}. Correspondingly, for a kernel kk on 𝒰\mathcal{U}, the function ρ(u,u)=12{k(u,u)+k(u,u)}k(u,u)\rho(u,u^{\prime})=\frac{1}{2}\{k(u,u)+k(u^{\prime},u^{\prime})\}-k(u,u^{\prime}) defines a valid semimetric ρ\rho of negative type on 𝒰\mathcal{U}, and we say that kk generates ρ\rho. It is clear that every distance-induced kernel kk induced by ρ\rho, also generates ρ\rho. Theorem 22 in Sejdinovic et al., (2013) establishes the equivalence between the generalized energy distance and MMD. Specifically, suppose P,Qρ1(𝒰)P,Q\in\mathcal{M}_{\rho}^{1}(\mathcal{U}) and let kk be any kernel that generates ρ\rho, then 𝒟ρ(P,Q)=γk2(P,Q)\mathcal{D}_{\rho}(P,Q)=\gamma_{k}^{2}(P,Q).

3 Conditional generalized energy distance and conditional maximum mean discrepancy

Let ρ\rho be a semimetric on 𝒴\mathcal{Y}. We define the conditional generalized energy distance at xpx\in\mathbb{R}^{p} as the generalized energy distance between PYX=x(1)P_{Y\mid X=x}^{(1)} and PYX=x(2)P_{Y\mid X=x}^{(2)}.

Definition 1.

Assume fl(x)>0f_{l}(x)>0 and PYX=x(l)ρ1(𝒴)P_{Y\mid X=x}^{(l)}\in\mathcal{M}_{\rho}^{1}(\mathcal{Y}) for l=1,2l=1,2. The conditional generalized energy distance between P(1)P^{(1)} and P(2)P^{(2)} at xpx\in\mathbb{R}^{p} is defined as

𝒟ρ(x)𝒟ρ(PYX=x(1),PYX=x(2))=𝒴𝒴ρ(y,y)d(PYX=x(1)PYX=x(2))2(y,y)=2𝔼{ρ(Y(1),Y(2))X(1)=x,X(2)=x}𝔼{ρ(Y(1),Y(1))X(1)=x,X(1)=x}𝔼{ρ(Y(2),Y(2))X(2)=x,X(2)=x},\displaystyle\begin{split}\mathcal{D}_{\rho}(x)&\coloneqq\mathcal{D}_{\rho}(P_{Y\mid X=x}^{(1)},P_{Y\mid X=x}^{(2)})=-\int_{\mathcal{Y}}\int_{\mathcal{Y}}\rho(y,y^{\prime})d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})^{2}(y,y^{\prime})\\ &=2\mathbb{E}\{\rho(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\}-\mathbb{E}\{\rho(Y^{(1)},Y^{(1)\prime})\mid X^{(1)}=x,X^{(1)\prime}=x\}\\ &\quad-\mathbb{E}\{\rho(Y^{(2)},Y^{(2)\prime})\mid X^{(2)}=x,X^{(2)\prime}=x\},\end{split} (3)

where (Y(1),X(1))(Y^{(1)\prime},X^{(1)\prime}) and (Y(2),X(2))(Y^{(2)\prime},X^{(2)\prime}) are i.i.d. copies of (Y(1),X(1))(Y^{(1)},X^{(1)}) and (Y(2),X(2))(Y^{(2)},X^{(2)}), respectively.

When 𝒴=q\mathcal{Y}=\mathbb{R}^{q} and ρ\rho corresponds to the Euclidean distance, (3) also has a nice interpretation in terms of conditional characteristic function. Specifically, for l=1,2l=1,2 and tqt\in\mathbb{R}^{q}, the conditional characteristic function of Y(l)Y^{(l)} given X(l)=xX^{(l)}=x is defined as φYX=x(l)(t)=𝔼{exp(ıtY(l))X(l)=x}\varphi_{Y\mid X=x}^{(l)}(t)=\mathbb{E}\{\exp(\imath t^{\top}Y^{(l)})\mid X^{(l)}=x\}, where ı=1\imath=\sqrt{-1} is the imaginary unit. When ρ(y,y)=yy\rho(y,y^{\prime})=\|y-y^{\prime}\| with \|\cdot\| being the Euclidean norm, we have

𝒟ρ(x)=1cqq|φYX=x(1)(t)φYX=x(2)(t)|2tq+1𝑑t,\mathcal{D}_{\rho}(x)=\frac{1}{c_{q}}\int_{\mathbb{R}^{q}}\frac{|\varphi_{Y\mid X=x}^{(1)}(t)-\varphi_{Y\mid X=x}^{(2)}(t)|^{2}}{\|t\|^{q+1}}dt,

where cq=π(q+1)/2/Γ((q+1)/2)c_{q}=\pi^{(q+1)/2}/\Gamma((q+1)/2) is a constant with Γ()\Gamma(\cdot) being the gamma function. The proof of this fact follows a similar approach to Proposition 1 in Székely and Rizzo, (2017), and the details are omitted for brevity.

As the counterpart to the distance-based metric (3), we now introduce a kernel-based metric to measure the discrepancy between the two conditional distributions. Let k\mathcal{H}_{k} be a RKHS associated with a kernel kk on 𝒴\mathcal{Y}.

Definition 2.

Assume fl(x)>0f_{l}(x)>0 and PYX=x(l)k1/2(𝒴)P_{Y\mid X=x}^{(l)}\in\mathcal{M}_{k}^{1/2}(\mathcal{Y}) for l=1,2l=1,2. The conditional maximum mean discrepancy (CMMD) between P(1)P^{(1)} and P(2)P^{(2)} at xpx\in\mathbb{R}^{p} is defined as the square root of

γk2(x)γk2(PYX=x(1),PYX=x(2))=Πk(PYX=x(1))Πk(PYX=x(2))k2=𝒴𝒴k(y,y)d(PYX=x(1)PYX=x(2))2(y,y)=𝔼{k(Y(1),Y(1))X(1)=x,X(1)=x}+𝔼{k(Y(2),Y(2))X(2)=x,X(2)=x}2𝔼{k(Y(1),Y(2))X(1)=x,X(2)=x},\displaystyle\begin{split}\gamma_{k}^{2}(x)&\coloneqq\gamma_{k}^{2}(P_{Y\mid X=x}^{(1)},P_{Y\mid X=x}^{(2)})=\|\Pi_{k}(P_{Y\mid X=x}^{(1)})-\Pi_{k}(P_{Y\mid X=x}^{(2)})\|_{\mathcal{H}_{k}}^{2}\\ &=\int_{\mathcal{Y}}\int_{\mathcal{Y}}k(y,y^{\prime})d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})^{2}(y,y^{\prime})\\ &=\mathbb{E}\{k(Y^{(1)},Y^{(1)\prime})\mid X^{(1)}=x,X^{(1)\prime}=x\}+\mathbb{E}\{k(Y^{(2)},Y^{(2)\prime})\mid X^{(2)}=x,X^{(2)\prime}=x\}\\ &\quad-2\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\},\end{split} (4)

where (Y(1),X(1))(Y^{(1)\prime},X^{(1)\prime}) and (Y(2),X(2))(Y^{(2)\prime},X^{(2)\prime}) are i.i.d. copies of (Y(1),X(1))(Y^{(1)},X^{(1)}) and (Y(2),X(2))(Y^{(2)},X^{(2)}), respectively.

While conducting this research, we came across Park and Muandet, (2020), where the authors present the CMMD (4) in a slightly different form. However, their estimation strategy is entirely different from ours. Neither the convergence rate nor the asymptotic distribution of their estimator is provided. By contrast, we derive the exact convergence rate and the asymptotic distribution of our statistic under both the null and alternative hypotheses. In addition, we establish a unified framework for both the distance and kernel-based measures. Furthermore, it is important to note that the CMMD is a metric indexed by xpx\in\mathbb{R}^{p}, and the single measure (5) introduced in Section 4, which integrates the CMMD with a weight function, is not discussed in Park and Muandet, (2020).

The conditional generalized energy distance and CMMD both serve to characterize the homogeneity of two conditional distributions. Besides, we can demonstrate the equivalence between the conditional generalized energy distance and CMMD. As a result, we will focus on the CMMD for the remainder of this paper.

Theorem 1.
  1. 1.

    When the semimetric ρ\rho is of strong negative type, for any xpx\in\mathbb{R}^{p} such that fl(x)>0f_{l}(x)>0 and PYX=x(l)ρ1(𝒴)(l=1,2)P_{Y\mid X=x}^{(l)}\in\mathcal{M}_{\rho}^{1}(\mathcal{Y})\ (l=1,2), we have 𝒟ρ(x)0\mathcal{D}_{\rho}(x)\geq 0, and 𝒟ρ(x)=0\mathcal{D}_{\rho}(x)=0 if and only if H0H_{0} in (2) holds.

  2. 2.

    When the kernel kk is characteristic, for any xpx\in\mathbb{R}^{p} such that fl(x)>0f_{l}(x)>0 and PYX=x(l)k1/2(𝒴)(l=1,2)P_{Y\mid X=x}^{(l)}\in\mathcal{M}_{k}^{1/2}(\mathcal{Y})\ (l=1,2), we have γk(x)=0\gamma_{k}(x)=0 if and only if H0H_{0} in (2) holds.

  3. 3.

    Let ρ\rho be a semimetric of negative type on 𝒴\mathcal{Y}. Suppose fl(x)>0f_{l}(x)>0 and PYX=x(l)ρ1(𝒴)P_{Y\mid X=x}^{(l)}\in\mathcal{M}_{\rho}^{1}(\mathcal{Y}) for l=1,2l=1,2. If kk is a kernel that generates ρ\rho, i.e., ρ(y,y)=12{k(y,y)+k(y,y)}k(y,y)\rho(y,y^{\prime})=\frac{1}{2}\{k(y,y)+k(y^{\prime},y^{\prime})\}-k(y,y^{\prime}), then

    𝒟ρ(x)=γk2(x).\mathcal{D}_{\rho}(x)=\gamma_{k}^{2}(x).
Remark 1.

While this paper employs characteristic kernels to fully capture the homogeneity of two conditional distributions, it is also possible to utilize non-characteristic kernels to detect specific moment discrepancies. For instance, by using the linear kernel k(y,y)=yyk(y,y^{\prime})=y^{\top}y^{\prime}, we obtain γk(x)=|𝔼(Y(1)X(1)=x)𝔼(Y(2)X(2)=x)|\gamma_{k}(x)=|\mathbb{E}(Y^{(1)}\mid X^{(1)}=x)-\mathbb{E}(Y^{(2)}\mid X^{(2)}=x)|, which can be used to compare two conditional means. This illustrates the broad range of applications for our framework.

Remark 2.

Note that the definitions of the conditional generalized energy distance and CMMD, as well as the estimation procedures to be introduced, are applicable to a general metric space 𝒴\mathcal{Y}. This implies that our framework is capable of handling cases where YY is a non-Euclidean object, such as a curve, image, or even topological data encountered in various applications.

Now, we present an estimator for the CMMD. Let x(s)x(s) denote the ssth component of xpx\in\mathbb{R}^{p} for s=1,,ps=1,\ldots,p. For l=1,2l=1,2, define Ghl:pG_{h_{l}}:\mathbb{R}^{p}\rightarrow\mathbb{R} as

Ghl(x)=1hlps=1pg(x(s)hl),xp,G_{h_{l}}(x)=\frac{1}{h_{l}^{p}}\prod_{s=1}^{p}g\bigg{(}\frac{x(s)}{h_{l}}\bigg{)},\quad x\in\mathbb{R}^{p},

where g:g:\mathbb{R}\rightarrow\mathbb{R} is a univariate kernel function satisfying Assumption 1 below, and h1,h2h_{1},h_{2}\in\mathbb{R} are the bandwidth parameters. For ease of presentation, here we use the same bandwidth hlh_{l} for each component of X(l)(l=1,2)X^{(l)}\ (l=1,2). In practice, one should always allow for varying bandwidths across each component of X(l)X^{(l)}. Besides, it is important to note that the smoothing kernel gg applied to XpX\in\mathbb{R}^{p} differs from the reproducing kernel kk applied to Y𝒴Y\in\mathcal{Y}, and we employ different criteria for selecting these two kernels.

Motivated by the representation of CMMD in terms of the conditional moments given in (4), we propose the following estimator for the CMMD:

γk2^(x)\displaystyle\widehat{\gamma_{k}^{2}}(x) ={i1i2Gh1(Xi1(1)x)Gh1(Xi2(1)x)}1i1i2k(Yi1(1),Yi2(1))Gh1(Xi1(1)x)Gh1(Xi2(1)x)\displaystyle=\bigg{\{}\sum_{i_{1}\neq i_{2}}G_{h_{1}}(X_{i_{1}}^{(1)}-x)G_{h_{1}}(X_{i_{2}}^{(1)}-x)\bigg{\}}^{-1}\sum_{i_{1}\neq i_{2}}k(Y_{i_{1}}^{(1)},Y_{i_{2}}^{(1)})G_{h_{1}}(X_{i_{1}}^{(1)}-x)G_{h_{1}}(X_{i_{2}}^{(1)}-x)
+{j1j2Gh2(Xj1(2)x)Gh2(Xj2(2)x)}1j1j2k(Yj1(2),Yj2(2))Gh2(Xj1(2)x)Gh2(Xj2(2)x)\displaystyle\quad+\bigg{\{}\sum_{j_{1}\neq j_{2}}G_{h_{2}}(X_{j_{1}}^{(2)}-x)G_{h_{2}}(X_{j_{2}}^{(2)}-x)\bigg{\}}^{-1}\sum_{j_{1}\neq j_{2}}k(Y_{j_{1}}^{(2)},Y_{j_{2}}^{(2)})G_{h_{2}}(X_{j_{1}}^{(2)}-x)G_{h_{2}}(X_{j_{2}}^{(2)}-x)
2{i1i2Gh1(Xi1(1)x)Gh1(Xi2(1)x)j1j2Gh2(Xj1(2)x)Gh2(Xj2(2)x)}1\displaystyle\quad-2\bigg{\{}\sum_{i_{1}\neq i_{2}}G_{h_{1}}(X_{i_{1}}^{(1)}-x)G_{h_{1}}(X_{i_{2}}^{(1)}-x)\sum_{j_{1}\neq j_{2}}G_{h_{2}}(X_{j_{1}}^{(2)}-x)G_{h_{2}}(X_{j_{2}}^{(2)}-x)\bigg{\}}^{-1}
×i=1n1j=1n2k(Yi(1),Yj(2))Gh1(Xi(1)x)Gh2(Xj(2)x)iiGh1(Xi(1)x)jjGh2(Xj(2)x),\displaystyle\quad\times\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}k(Y_{i}^{(1)},Y_{j}^{(2)})G_{h_{1}}(X_{i}^{(1)}-x)G_{h_{2}}(X_{j}^{(2)}-x)\sum_{i^{\prime}\neq i}G_{h_{1}}(X_{i^{\prime}}^{(1)}-x)\sum_{j^{\prime}\neq j}G_{h_{2}}(X_{j^{\prime}}^{(2)}-x),

which we call the sample conditional maximum mean discrepancy. This estimation strategy, which combines U-statistics and kernel smoothing, has been adopted in previous studies such as Wang et al., (2015); Ke and Yin, (2020). It first appeared in the so-called conditional U-statistics (Stute,, 1991), which generalize the Nadaraya-Watson estimate of a regression function in the same way as Hoeffding’s classical U-statistic is a generalization of the sample mean.

We shall make the following assumptions throughout the analysis.

Assumption 1.

The univariate smoothing kernel gg is of order ν\nu in the sense that g(u)𝑑u=1\int g(u)du=1, urg(u)𝑑u=0\int u^{r}g(u)du=0 for r=1,,ν1r=1,\cdots,\nu-1, and uνg(u)𝑑u0\int u^{\nu}g(u)du\neq 0. Also, gg is bounded with g2(u)𝑑u<\int g^{2}(u)du<\infty and |u|ν|g(u)|𝑑u<\int|u|^{\nu}|g(u)|du<\infty.

Assumption 2.

For l=1,2l=1,2, hl0h_{l}\rightarrow 0 and nlhlpn_{l}h_{l}^{p}\rightarrow\infty, as nln_{l}\rightarrow\infty.

Assumption 3.

The marginal densities f1(x)f_{1}(x) and f2(x)f_{2}(x) are ν\nu-times continuously differentiable.

Assumption 4.

𝔼{k(Y(1),Y(1))X(1)=x1,X(1)=x2}\mathbb{E}\{k(Y^{(1)},Y^{(1)\prime})\mid X^{(1)}=x_{1},X^{(1)\prime}=x_{2}\}, 𝔼{k(Y(2),Y(2))X(2)=x1,X(2)=x2}\mathbb{E}\{k(Y^{(2)},Y^{(2)\prime})\mid X^{(2)}=x_{1},X^{(2)\prime}=x_{2}\} and 𝔼{k(Y(1),Y(2))X(1)=x1,X(2)=x2}\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x_{1},X^{(2)}=x_{2}\} are ν\nu-times continuously differentiable with respect to (x1,x2)(x_{1}^{\top},x_{2}^{\top})^{\top}.

Assumptions 1-3 are standard in the literature (see, e.g., Section 1.11 in Li and Racine, (2007)), allowing for multivariate XX and higher-order kernels. Assumption 4 can be seen as the counterpart of the usual smoothness condition imposed on the regression function in classical kernel regression. A similar assumption is used in Stute, (1991).

In what follows, we present the consistency of the sample CMMD in Theorem 2, while Theorems 3 and 4 provide its asymptotic distributions under HaH_{a} and H0H_{0} in (2), respectively.

Theorem 2.

Suppose fl(x)>0f_{l}(x)>0 and PYX=x(l)k1(𝒴)P_{Y\mid X=x}^{(l)}\in\mathcal{M}_{k}^{1}(\mathcal{Y}) for l=1,2l=1,2. Under Assumptions 1-4, we have

γk2^(x)𝑝γk2(x),as n1,n2.\widehat{\gamma_{k}^{2}}(x)\overset{p}{\longrightarrow}\gamma_{k}^{2}(x),\quad\text{as }n_{1},n_{2}\rightarrow\infty.
Theorem 3.

Suppose fl(x)>0f_{l}(x)>0 and PYX=x(l)k1+δ/2(𝒴)P_{Y\mid X=x}^{(l)}\in\mathcal{M}_{k}^{1+\delta/2}(\mathcal{Y}) for l=1,2l=1,2 and some δ>0\delta>0. Under HaH_{a} in (2), Assumptions 1-4 and nl1/2hlp/2+ν0n_{l}^{1/2}h_{l}^{p/2+\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2, we have

γk2^(x)γk2(x)2{f1(x)f2(x)}2ξ12n1+ξ22n2𝑑𝒩(0,1),as n1,n2,\frac{\widehat{\gamma_{k}^{2}}(x)-\gamma_{k}^{2}(x)}{\frac{2}{\{f_{1}(x)f_{2}(x)\}^{2}}\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}\overset{d}{\longrightarrow}\mathcal{N}(0,1),\quad\text{as }n_{1},n_{2}\rightarrow\infty,

where ξ12\xi_{1}^{2} and ξ22\xi_{2}^{2} are defined in equations (A.3)-(A.4) of the appendix.

As shown in the proof in the appendix, we have ξ12h1p\xi_{1}^{2}\asymp h_{1}^{-p} and ξ22h2p\xi_{2}^{2}\asymp h_{2}^{-p}. Thus, under HaH_{a} in (2), γk2^(x)γk2(x)=Op((n1h1p)1/2+(n2h2p)1/2)\widehat{\gamma_{k}^{2}}(x)-\gamma_{k}^{2}(x)=O_{p}((n_{1}h_{1}^{p})^{-1/2}+(n_{2}h_{2}^{p})^{-1/2}), which is the same convergence rate as that of classical kernel regression (Ullah and Pagan,, 1999) and conditional U-statistics (Stute,, 1991). The condition nl1/2hlp/2+ν0n_{l}^{1/2}h_{l}^{p/2+\nu}\rightarrow 0 as nln_{l}\rightarrow\infty in Theorem 3 requires undersmoothing in order to make the estimator asymptotically unbiased. Undersmoothing, coupled with the use of a higher-order kernel, is commonly employed in nonparametric and semiparametric estimation to reduce the bias of estimators involving kernel smoothing (Li and Racine,, 2007). The same condition with p=1p=1 and ν=2\nu=2 is used in Section 3.4 of Ullah and Pagan, (1999) for the central limit theorem (CLT) of classical kernel regression and in Stute, (1991) for the CLT of conditional U-statistics.

Theorem 4.

Define the double centered kernel

k~(y,y)=k(y,y)𝔼{k(y,Y)X=x}𝔼{k(Y,y)X=x}+𝔼{k(Y,Y)X=x,X=x},\widetilde{k}(y,y^{\prime})=k(y,y^{\prime})-\mathbb{E}\{k(y,Y^{\prime})\mid X^{\prime}=x\}-\mathbb{E}\{k(Y,y^{\prime})\mid X=x\}+\mathbb{E}\{k(Y,Y^{\prime})\mid X=x,X^{\prime}=x\},

where YX=x,YX=xi.i.d.PYX=xPYX=x(1)=PYX=x(2)Y\mid X=x,Y^{\prime}\mid X^{\prime}=x\overset{i.i.d.}{\sim}P_{Y\mid X=x}\coloneqq P_{Y\mid X=x}^{(1)}=P_{Y\mid X=x}^{(2)} under H0H_{0} in (2). Suppose fl(x)>0f_{l}(x)>0 and PYX=xk1+δ/2(𝒴)P_{Y\mid X=x}\in\mathcal{M}_{k}^{1+\delta/2}(\mathcal{Y}) for l=1,2l=1,2 and some δ>0\delta>0. Under H0H_{0} in (2), Assumptions 1-4, n1h1p/(n1h1p+n2h2p)τ(0,1)n_{1}h_{1}^{p}/(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})\rightarrow\tau\in(0,1), and nlhlp+ν0n_{l}h_{l}^{p+\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2, we have

(n1h1p+n2h2p)γk2^(x)𝑑r=1λr{(1τζr11τηr)2(1τ)σ12+τσ22τ(1τ)},(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})\widehat{\gamma_{k}^{2}}(x)\overset{d}{\longrightarrow}\sum_{r=1}^{\infty}\lambda_{r}\bigg{\{}\bigg{(}\frac{1}{\sqrt{\tau}}\zeta_{r}-\frac{1}{\sqrt{1-\tau}}\eta_{r}\bigg{)}^{2}-\frac{(1-\tau)\sigma_{1}^{2}+\tau\sigma_{2}^{2}}{\tau(1-\tau)}\bigg{\}},

where {ζr}r=1\{\zeta_{r}\}_{r=1}^{\infty} and {ηr}r=1\{\eta_{r}\}_{r=1}^{\infty} are two sequences of independent normal random variables with

ζr\displaystyle\zeta_{r} i.i.d.𝒩(0,σ12),σ12=1f1(x){g2(u)𝑑u}p,\displaystyle\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma_{1}^{2}),\quad\sigma_{1}^{2}=\frac{1}{f_{1}(x)}\bigg{\{}\int g^{2}(u)du\bigg{\}}^{p},
ηr\displaystyle\eta_{r} i.i.d.𝒩(0,σ22),σ22=1f2(x){g2(u)𝑑u}p,\displaystyle\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma_{2}^{2}),\quad\sigma_{2}^{2}=\frac{1}{f_{2}(x)}\bigg{\{}\int g^{2}(u)du\bigg{\}}^{p},

and {λr}r=1\{\lambda_{r}\}_{r=1}^{\infty} are the eigenvalues of k~\widetilde{k} with respect to the conditional distribution PYX=xP_{Y\mid X=x}, i.e.,

𝔼{k~(y,Y)ϕr(Y)X=x}=λrϕr(y).\mathbb{E}\{\widetilde{k}(y,Y^{\prime})\phi_{r}(Y^{\prime})\mid X^{\prime}=x\}=\lambda_{r}\phi_{r}(y).

As demonstrated in Theorem 4, the sample CMMD converges at a faster rate under H0H_{0} in (2): γk2^(x)=Op((n1h1p)1+(n2h2p)1)\widehat{\gamma_{k}^{2}}(x)=O_{p}((n_{1}h_{1}^{p})^{-1}+(n_{2}h_{2}^{p})^{-1}), compared to that under HaH_{a} in (2). This explains why the undersmoothing condition in Theorem 4 is more stringent than that in Theorem 3. Compared with the unconditional scenario (Gretton et al.,, 2012), our results differ in several aspects. First, the convergence rates are distinct. Second, our spectral decomposition is performed with respect to the conditional distribution. Third, in contrast to i.i.d. standard normal random variables in Gretton et al., (2012), {ζr}r=1\{\zeta_{r}\}_{r=1}^{\infty} and {ηr}r=1\{\eta_{r}\}_{r=1}^{\infty} have different variances when f1(x)f2(x)f_{1}(x)\neq f_{2}(x).

4 Integrated conditional maximum mean discrepancy

The CMMD γk2(x)\gamma_{k}^{2}(x) is indexed by xpx\in\mathbb{R}^{p}, and thus is not a single number. We can obtain a single measure by integrating the CMMD with some weight function ω:p\omega:\mathbb{R}^{p}\rightarrow\mathbb{R}, i.e.,

k,ωpγk2(x)ω(x)𝑑x,\mathcal{I}_{k,\omega}\coloneqq\int_{\mathbb{R}^{p}}\gamma_{k}^{2}(x)\omega(x)dx,

which we call the integrated conditional maximum mean discrepancy (ICMMD). It follows directly from Theorem 1 that k,ω\mathcal{I}_{k,\omega} fully characterizes the homogeneity of two conditional distributions.

Theorem 5.

When ω\omega is a non-negative function with the same support as PX(1)P^{(1)}_{X} (or equivalently, PX(2)P^{(2)}_{X}) and the kernel kk is characteristic, we have k,ω=0\mathcal{I}_{k,\omega}=0 if and only if H0H_{0} in (1) holds.

Motivated by Su and White, (2007); Wang et al., (2015); Ke and Yin, (2020), we consider the weight function ω(x)={f1(x)f2(x)}2\omega(x)=\{f_{1}(x)f_{2}(x)\}^{2} and define

kpγk2(x){f1(x)f2(x)}2𝑑x.\mathcal{I}_{k}\coloneqq\int_{\mathbb{R}^{p}}\gamma_{k}^{2}(x)\{f_{1}(x)f_{2}(x)\}^{2}dx. (5)

This particular choice of weight function circumvents the random denominator issue. Otherwise, to deal with the density near zero and mitigate significant bias, additional stringent assumptions or trimming schemes may need to be used. For example, Yin and Yuan, (2020) assumes that the density functions f1(x)f_{1}(x) and f2(x)f_{2}(x) are bounded away from zero, which can be quite restrictive.

We employ the same estimation strategy as in Section 3 and propose the following estimator for the ICMMD, which is a U-statistic by symmetrization (see the proof of Theorem 6):

k^\displaystyle\widehat{\mathcal{I}_{k}} =1n1(n11)1n2(n21)\displaystyle=\frac{1}{n_{1}(n_{1}-1)}\frac{1}{n_{2}(n_{2}-1)}
×[i1i2k(Yi1(1),Yi2(1))Gh1(Xi1(1)Xi2(1))j1j2Gh2(Xj1(2)Xi1(1))Gh2(Xj2(2)Xi1(1))\displaystyle\quad\times\bigg{[}\sum_{i_{1}\neq i_{2}}k(Y_{i_{1}}^{(1)},Y_{i_{2}}^{(1)})G_{h_{1}}(X_{i_{1}}^{(1)}-X_{i_{2}}^{(1)})\sum_{j_{1}\neq j_{2}}G_{h_{2}}(X_{j_{1}}^{(2)}-X_{i_{1}}^{(1)})G_{h_{2}}(X_{j_{2}}^{(2)}-X_{i_{1}}^{(1)})
+j1j2k(Yj1(2),Yj2(2))Gh2(Xj1(2)Xj2(2))i1i2Gh1(Xi1(1)Xj1(2))Gh1(Xi2(1)Xj1(2))\displaystyle\quad+\sum_{j_{1}\neq j_{2}}k(Y_{j_{1}}^{(2)},Y_{j_{2}}^{(2)})G_{h_{2}}(X_{j_{1}}^{(2)}-X_{j_{2}}^{(2)})\sum_{i_{1}\neq i_{2}}G_{h_{1}}(X_{i_{1}}^{(1)}-X_{j_{1}}^{(2)})G_{h_{1}}(X_{i_{2}}^{(1)}-X_{j_{1}}^{(2)})
i=1n1j=1n2k(Yi(1),Yj(2)){Gh1(Xi(1)Xj(2))+Gh2(Xj(2)Xi(1))}\displaystyle\quad-\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}k(Y_{i}^{(1)},Y_{j}^{(2)})\Big{\{}G_{h_{1}}(X_{i}^{(1)}-X_{j}^{(2)})+G_{h_{2}}(X_{j}^{(2)}-X_{i}^{(1)})\Big{\}}
×iiGh1(Xi(1)Xj(2))jjGh2(Xj(2)Xi(1))].\displaystyle\quad\times\sum_{i^{\prime}\neq i}G_{h_{1}}(X_{i^{\prime}}^{(1)}-X_{j}^{(2)})\sum_{j^{\prime}\neq j}G_{h_{2}}(X_{j^{\prime}}^{(2)}-X_{i}^{(1)})\bigg{]}.

We call k^\widehat{\mathcal{I}_{k}} the sample integrated conditional maximum mean discrepancy.

Remark 3.

The computational complexity of both the sample CMMD γk2^(x)\widehat{\gamma_{k}^{2}}(x) and sample ICMMD k^\widehat{\mathcal{I}_{k}} is O((n1+n2)2)O((n_{1}+n_{2})^{2}), which is the same as that of the unconditional situation (Székely et al.,, 2004; Gretton et al.,, 2012). The details are provided in the appendix.

Theorem 6 asserts that k^\widehat{\mathcal{I}_{k}} is a consistent estimator of k\mathcal{I}_{k}. By examining the Hoeffding decomposition of k^\widehat{\mathcal{I}_{k}}, Theorem 7 describes its asymptotic distribution under the alternative hypothesis HaH_{a} in (1), while Theorem 8 characterizes its asymptotic null distribution under H0H_{0} in (1). To get these results, we need to strengthen Assumptions 3-4 in the following manner.

Assumption 5.

In addition to Assumption 3, we require the derivatives of f1(x)f_{1}(x) and f2(x)f_{2}(x) to be bounded uniformly over xx.

Assumption 6.

In addition to Assumption 4, we require the derivatives of the bivariate functions defined therein to belong to the set {f:p×p,𝔼PX(1)|f(X,X)|<,𝔼PX(2)|f(X,X)|<}\{f:\mathbb{R}^{p}\times\mathbb{R}^{p}\rightarrow\mathbb{R},\ \mathbb{E}_{P_{X}^{(1)}}|f(X,X)|<\infty,\ \mathbb{E}_{P_{X}^{(2)}}|f(X,X)|<\infty\}.

Assumption 5 is standard in the literature, as seen in previous works such as Lee, (2009); Wang et al., (2015). Meanwhile, Assumption 6 is mild in that it does not necessitate bounded derivatives of the conditional densities or moments. Wang et al., (2015) assumes that the conditional densities are ν\nu-times continuously differentiable with respect to xx and have bounded derivatives. However, under their assumption, the proof in Wang et al., (2015) implicitly requires the distance function y1y2\|y_{1}-y_{2}\| to be integrable with respect to the Lebesgue measure (not some probability measure), which does not hold unless the support is bounded. Indeed, by examining the proof presented in Wang et al., (2015) and Ke and Yin, (2020), it appears that conditions analogous to our Assumption 6 are necessary.

Theorem 6.

Suppose PY(l)k2(𝒴)P_{Y}^{(l)}\in\mathcal{M}_{k}^{2}(\mathcal{Y}) for l=1,2l=1,2. Under Assumptions 1-2 and 5-6, we have

k^𝑝k,as n1,n2.\widehat{\mathcal{I}_{k}}\overset{p}{\longrightarrow}\mathcal{I}_{k},\quad\text{as }n_{1},n_{2}\rightarrow\infty.
Theorem 7.

Suppose PY(l)k2+δ(𝒴)P_{Y}^{(l)}\in\mathcal{M}_{k}^{2+\delta}(\mathcal{Y}) for l=1,2l=1,2 and some δ>0\delta>0. Under HaH_{a} in (1), Assumptions 1-2, 5-6 and nl1/2hlν0n_{l}^{1/2}h_{l}^{\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2, we have

k^k2δ1,02n1+δ0,12n2𝑑𝒩(0,1),as n1,n2,\frac{\widehat{\mathcal{I}_{k}}-\mathcal{I}_{k}}{2\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}}\overset{d}{\longrightarrow}\mathcal{N}(0,1),\quad\text{as }n_{1},n_{2}\rightarrow\infty,

where δ1,02\delta_{1,0}^{2} and δ0,12\delta_{0,1}^{2} are defined in equations (A.8)-(A.9) of the appendix.

As shown in the proof in the appendix, we have δ1,021\delta_{1,0}^{2}\asymp 1 and δ0,121\delta_{0,1}^{2}\asymp 1. Hence, under HaH_{a} in (1), k^k=Op(n11/2+n21/2)\widehat{\mathcal{I}_{k}}-\mathcal{I}_{k}=O_{p}(n_{1}^{-1/2}+n_{2}^{-1/2}), and the convergence rate is the same as that of the test statistic in Lee, (2009). Again, undersmoothing is required, and the same condition has also been used in Lee, (2009). Furthermore, we emphasize that undersmoothing is not needed for establishing the consistency of γk2^(x)\widehat{\gamma_{k}^{2}}(x) and k^\widehat{\mathcal{I}_{k}}.

Theorem 8.

Suppose PY(l)k2+δ(𝒴)P_{Y}^{(l)}\in\mathcal{M}_{k}^{2+\delta}(\mathcal{Y}) for l=1,2l=1,2 and some δ>0\delta>0. Under H0H_{0} in (1), Assumptions 1-2, 5-6 and nlhlp/2+ν0n_{l}h_{l}^{p/2+\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2, we have

k^2δ2,02n1(n11)+16δ1,12n1n2+2δ0,22n2(n21)𝑑𝒩(0,1),as n1,n2,\frac{\widehat{\mathcal{I}_{k}}}{\sqrt{\frac{2\delta_{2,0}^{2}}{n_{1}(n_{1}-1)}+\frac{16\delta_{1,1}^{2}}{n_{1}n_{2}}+\frac{2\delta_{0,2}^{2}}{n_{2}(n_{2}-1)}}}\overset{d}{\longrightarrow}\mathcal{N}(0,1),\quad\text{as }n_{1},n_{2}\rightarrow\infty,

where δ2,02\delta_{2,0}^{2}, δ1,12\delta_{1,1}^{2} and δ0,22\delta_{0,2}^{2} are defined in equations (A.10)-(A.12) of the appendix.

As shown in the proof in the appendix, we have δ2,02h1p\delta_{2,0}^{2}\asymp h_{1}^{-p}, δ1,12h1p+h2p\delta_{1,1}^{2}\asymp h_{1}^{-p}+h_{2}^{-p} and δ0,22h2p\delta_{0,2}^{2}\asymp h_{2}^{-p}. Thus, the convergence rate of k^\widehat{\mathcal{I}_{k}} under H0H_{0} in (1) is k^=Op(n11h1p/2+n21h2p/2)\widehat{\mathcal{I}_{k}}=O_{p}(n_{1}^{-1}h_{1}^{-p/2}+n_{2}^{-1}h_{2}^{-p/2}). Because of the integration of xx over p\mathbb{R}^{p}, the convergence rate of the global statistic k^\widehat{\mathcal{I}_{k}} is faster than that of the local statistic γk2^(x)\widehat{\gamma_{k}^{2}}(x) under both the null and alternative hypotheses. Besides, instead of a weighted sum of chi-squares, the global statistic k^\widehat{\mathcal{I}_{k}} has a limiting normal distribution under the null, which can be proved by the martingale CLT (see, e.g., Corollary 3.1 of Hall and Heyde, (2014)).

Theorems 3-4 and 7-8 demonstrate that kernel smoothing has two key effects on the asymptotic behavior of distance and kernel-based measures:

  • Kernel smoothing introduces biases of the order O(h1ν+h2ν)O(h_{1}^{\nu}+h_{2}^{\nu}). To ensure valid inference, undersmoothing or other bias reduction techniques are necessary to achieve asymptotic unbiasedness. For the local statistic γk2^(x)\widehat{\gamma_{k}^{2}}(x), Theorem 4 and Assumption 2 require that the bandwidths in the statistic be chosen such that hlnlκ1h_{l}\asymp n_{l}^{\kappa_{1}}, where κ1(1p,1p+ν)\kappa_{1}\in(-\frac{1}{p},-\frac{1}{p+\nu}). There is no restriction on the relationship between the order ν\nu of the univariate smoothing kernel gg and the dimension pp of XX. For the global statistic k^\widehat{\mathcal{I}_{k}}, Theorem 8 and Assumption 2 dictate that the bandwidths be chosen such that hlnlκ2h_{l}\asymp n_{l}^{\kappa_{2}}, where κ2(1p,1p/2+ν)\kappa_{2}\in(-\frac{1}{p},-\frac{1}{p/2+\nu}). This condition requires ν>p/2\nu>p/2, implying that higher-order kernels become necessary as the dimension pp increases. We note that Wang et al., (2015) simply used a second-order smoothing kernel throughout, which is not suitable for p4p\geq 4 (see Theorem 7 therein).

  • Due to kernel smoothing, the U-statistic estimators γk2^(x)\widehat{\gamma_{k}^{2}}(x) and k^\widehat{\mathcal{I}_{k}} are non-degenerate under the null hypotheses, as shown in the proof in the appendix. This is in sharp contrast to the unconditional scenario where the U-statistic estimators for the energy distance and maximum mean discrepancy are degenerate under the null (Székely et al.,, 2004; Gretton et al.,, 2012). Nonetheless, when undersmoothing is employed, a more in-depth analysis reveals that the first-order projections in their Hoeffding decompositions are asymptotically negligible under the null, compared with the second-order projections. Therefore, their asymptotic null distributions are determined by the second-order projections in the Hoeffding decompositions. Specifically, we have ξ12=O(h1p(h12ν+h22ν))\xi_{1}^{2}=O(h_{1}^{-p}(h_{1}^{2\nu}+h_{2}^{2\nu})) under H0H_{0} in (2), instead of ξ12h1p\xi_{1}^{2}\asymp h_{1}^{-p} under HaH_{a} in (2). Additionally, we have δ1,02=O(h12ν+h22ν)\delta_{1,0}^{2}=O(h_{1}^{2\nu}+h_{2}^{2\nu}) under H0H_{0} in (1), instead of δ1,021\delta_{1,0}^{2}\asymp 1 under HaH_{a} in (1).

Remark 4.

Our work utilizes techniques similar to those used by Wang et al., (2015) and Ke and Yin, (2020), including U-statistics and kernel smoothing. However, we believe that there exists a gap in the derivations of the asymptotic null distribution in both studies. Specifically, Theorem 7 of Wang et al., (2015) and Theorem 9 of Ke and Yin, (2020) both rely on Lemma B.4 of Fan and Li, (1996), which establishes the CLT for degenerate U-statistics under certain conditions. To show that a U-statistic is degenerate, it must be demonstrated that 𝒫n1(W1)=0\mathcal{P}_{n1}(W_{1})=0 under the null hypothesis, as per the notation used in their proof. However, Wang et al., (2015) merely showed that 𝔼{𝒫n1(W1)}=O(h2)\mathbb{E}\{\mathcal{P}_{n1}(W_{1})\}=O(h^{2}), while Ke and Yin, (2020) incorrectly claimed that 𝔼{𝒫n1(W1)}=0\mathbb{E}\{\mathcal{P}_{n1}(W_{1})\}=0. As a result, both studies overlook the crucial undersmoothing step necessary for valid inference.

5 Applications to global and local two-sample conditional distribution testing

Theorems 6-8 collectively suggest that k^\widehat{\mathcal{I}_{k}} is a suitable test statistic for testing (1). Nevertheless, it is impractical to use Theorem 8 to compute the pp-value since it is arduous to estimate δ2,02\delta_{2,0}^{2}, δ1,12\delta_{1,1}^{2} and δ0,22\delta_{0,2}^{2}. Moreover, it is widely acknowledged that a nonparametric test that relies on asymptotic normal approximation may perform poorly in finite samples (Su and White,, 2007). Consequently, we resort to the local bootstrap proposed by Paparoditis and Politis, (2000). This approach has been widely employed in other works involving conditional distributions; see, e.g., Su and White, (2008); Huang, (2010); Bouezmarni et al., (2012); Su and Spindler, (2013); Taamouti et al., (2014); Wang et al., (2015). One can follow aforementioned references to verify the asymptotic validity of this bootstrap method in our framework. Define

P^YX=x={l=12i=1nlGhl(Xi(l)x)}1l=12i=1nlGhl(Xi(l)x)δYi(l),\widehat{P}_{Y\mid X=x}=\bigg{\{}\sum_{l=1}^{2}\sum_{i=1}^{n_{l}}G_{h_{l}}(X_{i}^{(l)}-x)\bigg{\}}^{-1}\sum_{l=1}^{2}\sum_{i=1}^{n_{l}}G_{h_{l}}(X_{i}^{(l)}-x)\delta_{Y_{i}^{(l)}}, (6)

where δY\delta_{Y} denotes a point mass at YY. Essentially, P^YX=x\widehat{P}_{Y\mid X=x} is a discrete distribution that assigns the probability Ghl(Xi(l)x)/l=12i=1nlGhl(Xi(l)x)G_{h_{l}}(X_{i}^{(l)}-x)/\sum_{l=1}^{2}\sum_{i=1}^{n_{l}}G_{h_{l}}(X_{i}^{(l)}-x) to the observation Yi(l)Y_{i}^{(l)}. Then the following steps outline the procedure for the global two-sample conditional distribution test:

  1. (i)

    Calculate the global test statistic, i.e., the sample ICMMD, k^\widehat{\mathcal{I}_{k}}.

  2. (ii)

    For l=1,2l=1,2 and i=1,,nli=1,\ldots,n_{l}, draw Y^i(l)\widehat{Y}_{i}^{(l)} from P^YX=Xi(l)\widehat{P}_{Y\mid X=X_{i}^{(l)}}. Calculate the global test statistic using the local bootstrap samples {(Y^i(l),Xi(l))}i=1nl(l=1,2)\{(\widehat{Y}_{i}^{(l)},X_{i}^{(l)})\}_{i=1}^{n_{l}}\ (l=1,2), which retains the marginal distributions of XX but imposes the null restriction (i.e., the same conditional distribution of YY given XX).

  3. (iii)

    Repeat step (ii) BB times, and collect {k^(b)}b=1B\{\widehat{\mathcal{I}_{k}}^{(b)}\}_{b=1}^{B}. Then, the bootstrap-based pp-value of the global test is given by

    p-value=b=1B𝟙{k^(b)>k^}+1B+1.p\text{-value}=\frac{\sum_{b=1}^{B}\mathbbm{1}\{\widehat{\mathcal{I}_{k}}^{(b)}>\widehat{\mathcal{I}_{k}}\}+1}{B+1}.

Theorems 2-4 support the use of γk2^(x)\widehat{\gamma_{k}^{2}}(x) as the test statistic for testing (2). Unfortunately, the asymptotic null distribution of γk2^(x)\widehat{\gamma_{k}^{2}}(x) in Theorem 4 is not pivotal and involves infinite nuisance parameters. To tackle this issue, we once again utilize the local bootstrap method to calculate the pp-value, replacing the k^\widehat{\mathcal{I}_{k}} in the above steps i-iii with γk2^(x)\widehat{\gamma_{k}^{2}}(x).

6 Numerical studies

6.1 Monte Carlo simulation

In this subsection, we assess the finite-sample performance of the proposed methodologies through several simulation examples. In Examples 1-3, we compare our global test (denoted as TCDT) with the test using conformal prediction (Hu and Lei,, 2024) (denoted as CONF). The performance of the proposed local test is examined in Examples 4-5.

For our proposed global and local tests, the kernel function gg for local smoothing in the test statistic is chosen to be the Gaussian kernel. In Examples 1 and 3-5, we use g(u)=(2π)1/2exp(u2/2)g(u)=(2\pi)^{-1/2}\exp(-u^{2}/2) of order ν=2\nu=2. In Example 2, a higher-order Gaussian kernel, g(u)=(3/2u2/2)(2π)1/2exp(u2/2)g(u)=(3/2-u^{2}/2)(2\pi)^{-1/2}\exp(-u^{2}/2), of order ν=4\nu=4 is applied, since the dimension of XX is p=4p=4 therein. We take the undersmoothing bandwidth hl(s)=Cnl1/(p/2+ν0.1)σ^X(l)(s)h_{l}(s)=Cn_{l}^{-1/(p/2+\nu-0.1)}\hat{\sigma}_{X^{(l)}(s)} for the global testing problem (1) following the guidance from Theorem 8, and use hl(s)=Cnl1/(p+ν0.1)σ^X(l)(s)h_{l}(s)=Cn_{l}^{-1/(p+\nu-0.1)}\hat{\sigma}_{X^{(l)}(s)} for the local testing problem (2) following Theorem 4, where σ^X(l)(s)\hat{\sigma}_{X^{(l)}(s)} is the sample standard deviation of X(l)(s)X^{(l)}(s). We set the constant factor CC here to 1 for both global and local tests hereafter. The effect of different values of CC is explored in Example 6 of the appendix. The RKHS kernel kk is chosen to be the popular Gaussian kernel k(y,y)=exp(yy2/2γ2)k(y,y^{\prime})=\exp(-\|y-y^{\prime}\|^{2}/2\gamma^{2}) with the bandwidth γ\gamma chosen by the median heuristic (Gretton et al.,, 2012). For the local bootstrap procedure (6), we take the Gaussian kernel g(u)=(2π)1/2exp(u2/2)g(u)=(2\pi)^{-1/2}\exp(-u^{2}/2) with bandwidth hl(s)=nl1/(p+4)σ^X(l)(s)h_{l}(s)=n_{l}^{-1/(p+4)}\hat{\sigma}_{X^{(l)}(s)}. For CONF, we consider estimating the marginal and conditional density ratios both by kernel logistic regression, with parameters σ2=200\sigma^{2}=200 and λ\lambda selected by the data-driven out-of-sample cross entropy loss as Hu and Lei, (2024) suggested. The equal data-splitting ratio is taken for CONF. We calculate the pp-value of the proposed methods via the local bootstrap procedure with 299 replications, while CONF’s pp-value is calculated from its asymptotic distribution as stated in Hu and Lei, (2024).

Example 1.

In this example, we focus on the univariate regression framework

Yi(l)=f(l)(Xi(l))+ϵi(l),i=1,,nl,l=1,2,\displaystyle Y_{i}^{(l)}=f^{(l)}(X_{i}^{(l)})+\epsilon_{i}^{(l)},\quad i=1,\dots,n_{l},~{}~{}l=1,2, (7)

where {Xi(l)}i=1nl\{X_{i}^{(l)}\}_{i=1}^{n_{l}} are univariate i.i.d. observations, {ϵi(l)}i=1nl\{\epsilon_{i}^{(l)}\}_{i=1}^{n_{l}} are independent random noises, and f(l)f^{(l)} denotes the regression function, for l=1,2l=1,2. Here Xi(1)i.i.d.𝒩(0,1)X_{i}^{(1)}\overset{i.i.d.}{\sim}\mathcal{N}(0,1) and Xi(2)i.i.d.𝒩(1,1)X_{i}^{(2)}\overset{i.i.d.}{\sim}\mathcal{N}(1,1), which corresponds to PX(1)PX(2)P_{X}^{(1)}\neq P_{X}^{(2)}. For both l=1,2l=1,2, set ϵi(l)i.i.d.𝒩(0,0.5)\epsilon_{i}^{(l)}\overset{i.i.d.}{\sim}\mathcal{N}(0,0.5).

We first consider the scenarios in which the two conditional distributions differ only in the conditional means (i.e., f(l)f^{(l)}). Three settings of f(l)f^{(l)} are considered as follows:

  1. (1.1)

    f(1)(x)=1+xf^{(1)}(x)=1+x  and  f(2)(x)=1+(1+c)xf^{(2)}(x)=1+(1+c)x;

  2. (1.2)

    f(1)(x)=1+sin(2πx)f^{(1)}(x)=1+\sin(2\pi x)  and  f(2)(x)=1+(1+c)sin(2πx)f^{(2)}(x)=1+(1+c)\sin(2\pi x);

  3. (1.3)

    f(1)(x)=1+exp(x)f^{(1)}(x)=1+\exp(x)  and  f(2)(x)=1+exp(x)+csin(2πx)f^{(2)}(x)=1+\exp(x)+c\sin(2\pi x);

where cc is a parameter controlling the signal strength of the difference. Notice that c=0c=0 corresponds to the null hypothesis, and a larger cc corresponds to a greater difference. To examine and compare the empirical sizes and powers of different methods, we set n1=n2=50n_{1}=n_{2}=50 or 100100, and vary the signal strength cc from 0 to 22. Table 2 and Figure 1 summarize the empirical sizes and powers under the significance level α=0.05\alpha=0.05 via 1000 simulations.

Table 2: Empirical size (c=0c=0) under α=0.05\alpha=0.05 in Example 1.

Difference in conditional means Difference in conditional variances Setting Sample Size TCDT CONF Setting Sample Size TCDT CONF (1.1) n1=n2=50n_{1}=n_{2}=50 0.040 0.063 4 n1=n2=100n_{1}=n_{2}=100 0.022 0.063 n1=n2=100n_{1}=n_{2}=100 0.045 0.050 n1=n2=200n_{1}=n_{2}=200 0.020 0.055 (1.2) n1=n2=50n_{1}=n_{2}=50 0.039 0.055 5 n1=n2=100n_{1}=n_{2}=100 0.030 0.058 n1=n2=100n_{1}=n_{2}=100 0.038 0.051 n1=n2=200n_{1}=n_{2}=200 0.025 0.059 (1.3) n1=n2=50n_{1}=n_{2}=50 0.019 0.091 6 n1=n2=100n_{1}=n_{2}=100 0.004 0.062 n1=n2=100n_{1}=n_{2}=100 0.018 0.083 n1=n2=200n_{1}=n_{2}=200 0.003 0.061

Refer to caption
Figure 1: Empirical size (c=0c=0) and power (c0c\neq 0) under α=0.05\alpha=0.05 in Example 1.

One can observe that TCDT can control the empirical size below the nominal level in all cases, while CONF tends to fail in controlling the type I error. A possible reason is that, as pointed out by Assumption 2(b) in Hu and Lei, (2024), CONF relies on accurate density ratio estimators, and possibly requires unbalanced sample splitting. In the linear regression setting (1.1), CONF achieves higher power than TCDT. In settings (1.2) and (1.3) where the oscillating curve and exponential curve are considered, TCDT has demonstrated an advantage in power compared to CONF, with well-controlled type I error.

Next we consider the scenarios in which the two conditional distributions differ solely in conditional variances. Let the regression curves f(1)(x)=f(2)(x)=1+x2f^{(1)}(x)=f^{(2)}(x)=1+x^{2} be identical quadratic functions across the two samples. The following three cases are considered:

  1. (1.0)

    ϵi(1)𝒩(0,0.25),ϵi(2)𝒩(0,0.25+c)\epsilon_{i}^{(1)}\sim\mathcal{N}(0,0.25),\quad\epsilon_{i}^{(2)}\sim\mathcal{N}(0,0.25+c);

  2. (1.0)

    ϵi(1)𝒩(0,0.252exp(2Xi(1))),ϵi(2)𝒩(0,(0.25+0.5c)2exp(2Xi(2)))\epsilon_{i}^{(1)}\sim\mathcal{N}(0,0.25^{2}\exp(2X^{(1)}_{i})),\quad\epsilon_{i}^{(2)}\sim\mathcal{N}(0,(0.25+0.5c)^{2}\exp(2X^{(2)}_{i}));

  3. (1.0)

    ϵi(1)𝒩(0,0.11+(Xi(1))2),ϵi(2)𝒩(0,0.1+c1+(Xi(2))2)\epsilon_{i}^{(1)}\sim\mathcal{N}(0,\frac{0.1}{1+(X^{(1)}_{i})^{2}}),\quad\epsilon_{i}^{(2)}\sim\mathcal{N}(0,\frac{0.1+c}{1+(X^{(2)}_{i})^{2}});

Here cc is the parameter controlling the signal strength of the difference. The first setting represents the homogeneous case, while the latter two represent the heterogeneous cases. We set n1=n2=100n_{1}=n_{2}=100 or 200200, and vary the signal strength cc. Results of 1000 simulations are shown in Table 2 and Figure 1.

It can be observed that CONF generally reaches an empirical size higher than 0.05. Besides, it is evident that TCDT shows a greater ascending trend in the power curve than that of CONF. The good performance of TCDT holds across different settings of conditional variances, indicating that TCDT is capable of handling various types of alternative hypotheses.

Example 2.

Similar to Example 1, we now consider the settings where the dimension of XX is p=4p=4. Let Xi(1)i.i.d.𝒩(𝟎,𝐈)X_{i}^{(1)}\overset{i.i.d.}{\sim}\mathcal{N}(\mathbf{0},\mathbf{I}) and Xi(2)i.i.d.𝒩(𝝁,𝐈)X_{i}^{(2)}\overset{i.i.d.}{\sim}\mathcal{N}(\boldsymbol{\mu},\mathbf{I}), where 𝝁=(1,1,1,0)\boldsymbol{\mu}=(1,1,-1,0)^{\top}. For both l=1,2l=1,2, set errors ϵi(l)i.i.d.310t(5)\epsilon_{i}^{(l)}\overset{i.i.d.}{\sim}\sqrt{\frac{3}{10}}t(5) with heavy tails. We let the difference between the two conditional distributions reside solely in the regression functions f(l)f^{(l)}. Specifically,

f(1)(x)={θ(x)}3+2{θ(x)}2θ(x)andf(2)(x)=(1+c){θ(x)}3+2{θ(x)}2θ(x),f^{(1)}(x)=\{\theta(x)\}^{3}+2\{\theta(x)\}^{2}-\theta(x)\quad\text{and}\quad f^{(2)}(x)=(1+c)\{\theta(x)\}^{3}+2\{\theta(x)\}^{2}-\theta(x),

where θ(x)=x(1)x(2)+x(3)x(4)0.5\theta(x)=x(1)-x(2)+x(3)-x(4)-0.5 and cc is the signal strength. Let the sample sizes be n1=n2=100n_{1}=n_{2}=100 or 200200, and vary cc from 0 to 22. We conduct the simulations 1000 times under significance level α=0.05\alpha=0.05. Table 3 and Figure 2 show the empirical sizes and powers. In this setting with multivariate XX, TCDT equipped with the higher-order Gaussian smoothing kernel achieves better power performance compared to CONF, while still guaranteeing type I error control.

Table 3: Empirical size (c=0c=0) under α=0.05\alpha=0.05 in Example 2.

Sample Size TCDT CONF n1=n2=100n_{1}=n_{2}=100 0.019 0.018 n1=n2=200n_{1}=n_{2}=200 0.018 0.014

Refer to caption
Figure 2: Empirical size (c=0c=0) and power (c0c\neq 0) under α=0.05\alpha=0.05 in Example 2.
Example 3.

We further consider a scenario that appears in transfer learning: testing whether the conditional distribution of covariates given the response, PXYP_{X\mid Y}, is the same for the two populations (i.e., prior shift assumption). To generate data, conversely to (7), we consider the multivariate regression model of XX on YY:

Xi(l)=f(l)(Yi(l))+ϵi(l),i=1,,nl,l=1,2,\displaystyle X_{i}^{(l)}=f^{(l)}(Y_{i}^{(l)})+\epsilon_{i}^{(l)},\quad i=1,\dots,n_{l},~{}~{}l=1,2,

where Yi(l)Y_{i}^{(l)}\in\mathbb{R}, ϵi(l)p\epsilon_{i}^{(l)}\in\mathbb{R}^{p} is the multivariate error, and f(l):pf^{(l)}:\mathbb{R}\to\mathbb{R}^{p} is the multivariate regression function. Let ϵi(l)i.i.d.𝒩(𝟎,𝚺)\epsilon_{i}^{(l)}\overset{i.i.d.}{\sim}\mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}) for both l=1,2l=1,2, where 𝚺=(ρ|ij|)\boldsymbol{\Sigma}=(\rho^{|i-j|}) with ρ=0.5\rho=0.5. We consider the PY(1)PY(2)P_{Y}^{(1)}\neq P_{Y}^{(2)} case, by setting Yi(1)i.i.d.𝒩(0,1)Y_{i}^{(1)}\overset{i.i.d.}{\sim}\mathcal{N}(0,1) and Yi(2)i.i.d.𝒩(0,1.5)Y_{i}^{(2)}\overset{i.i.d.}{\sim}\mathcal{N}(0,1.5). Let θ(1)(y)=yβ\theta^{(1)}(y)=y\cdot\beta and θ(2)(y)=(1+c)yβ\theta^{(2)}(y)=(1+c)y\cdot\beta, where βp\beta\in\mathbb{R}^{p}, and cc controls the signal for the difference of the conditional distributions between two samples. Then

f(l)(y)=[{θ(l)(y)}2+3θ(l)(y)+2]÷[{θ(l)(y)}2+1],l=1,2,f^{(l)}(y)=[\{\theta^{(l)}(y)\}^{\circ 2}+3\cdot\theta^{(l)}(y)+2]\div[\{\theta^{(l)}(y)\}^{\circ 2}+1],\quad l=1,2,

where x2x^{\circ 2} is the Hadamard square of xx, and ÷\div is the Hadamard division. We consider p=5p=5 or 2020. When p=5p=5, we let β(j)=0.5\beta(j)=-0.5 for j=1,2j=1,2, β(j)=0.5\beta(j)=0.5 for j=3,4j=3,4 and β(j)=1\beta(j)=1 for j=5j=5. When p=20p=20, we let β(j)=0.5\beta(j)=-0.5 for j=1,2,3j=1,2,3, β(j)=0.5\beta(j)=0.5 for j=4,5j=4,5, β(j)=1\beta(j)=1 for j=6,,10j=6,\dots,10 and β(j)=0\beta(j)=0 for j=11,,20j=11,\dots,20. Let n1=n2=100n_{1}=n_{2}=100, 200200 or 500500, and vary the the signal strength cc from 0 to 11. For each varying setting, we repeat 1000 simulations to obtain the empirical size (c=0c=0) and power (c0c\neq 0) of TCDT and CONF under the significance level of 0.05. The results are shown in Table 4 and Figure 3. Both methods can control the size in this setting. It can be observed that TCDT achieves greater power than CONF.

Table 4: Empirical size (c=0c=0) under α=0.05\alpha=0.05 in Example 3.

p=5p=5 p=20p=20 Sample Size TCDT CONF TCDT CONF n1=n2=100n_{1}=n_{2}=100 0.035 0.047 0.013 0.025 n1=n2=200n_{1}=n_{2}=200 0.033 0.051 0.021 0.020 n1=n2=500n_{1}=n_{2}=500 0.039 0.048 0.025 0.019

Refer to caption
Figure 3: Empirical size (c=0c=0) and power (c0c\neq 0) under α=0.05\alpha=0.05 in Example 3.
Example 4.

To examine the performance of the proposed test for the local testing problem (2), we first consider the data-generating setting under the univariate regression framework (7) with f(1)(x)=0.5sin(πx)f^{(1)}(x)=0.5\sin(\pi x) and f(2)(x)=0.5sin(πx)f^{(2)}(x)=-0.5\sin(\pi x). Sample sizes are set as n1=n2=150n_{1}=n_{2}=150. Set Xi(1)i.i.d.𝒩(0.5,1)X_{i}^{(1)}\overset{i.i.d.}{\sim}\mathcal{N}(-0.5,1), and Xi(2)i.i.d.𝒩(0.5,1)X_{i}^{(2)}\overset{i.i.d.}{\sim}\mathcal{N}(0.5,1), that is, PX(1)PX(2)P_{X}^{(1)}\neq P_{X}^{(2)}. The errors ϵi(l)i.i.d.𝒩(0,0.5)\epsilon_{i}^{(l)}\overset{i.i.d.}{\sim}\mathcal{N}(0,0.5) are considered for both l=1,2l=1,2. In this setting, the two samples differ only in the conditional means. When the local point xx is an integer, it is under the null hypothesis. The signal is cyclically increasing and decreasing between each pair of integer local points.

We then conduct the proposed test on local points ranging from 1.3-1.3 to 1.31.3 with the spacing 0.1 and obtain empirical power or size for each local point, with a total of 1000 simulations and a significance level of 0.05. The results are summarized in Figure 4. The empirical sizes at local x=1,0x=-1,0 and 11 are 0.055,0.0460.055,0.046 and 0.0540.054, respectively. It can be seen that the size can be controlled around the nominal level of 0.05, and the power increases as the local signal strengthens. The power loss at the region of large magnitude for xx may be due to the fewer sample points at the boundary area.

Refer to caption
Figure 4: The empirical power curves and the properly scaled signals in Example 4.
Example 5.

We further consider a bivariate case to show how the proposed local test performs. Specifically, let Xi(1)X_{i}^{(1)} be i.i.d. and follow a bivariate truncated standard normal distribution with support on [2,2]×[2,2][-2,2]\times[-2,2], and let Xi(2)i.i.d.𝒰(2,2)×𝒰(2,2)X_{i}^{(2)}\overset{i.i.d.}{\sim}\mathcal{U}(-2,2)\times\mathcal{U}(-2,2). Then set Yi(1)=Xi(1)2+ϵi(1),i=1,,n1,Y_{i}^{(1)}=\|X_{i}^{(1)}\|^{2}+\epsilon_{i}^{(1)},\ i=1,\dots,n_{1}, and Yi(2)=Xi(2)2𝟙{Xi(2)(1)Xi(2)(2)0}Xi(2)2𝟙{Xi(2)(1)Xi(2)(2)<0}+ϵi(2),i=1,,n2Y_{i}^{(2)}=\|X_{i}^{(2)}\|^{2}\mathbbm{1}\{X_{i}^{(2)}(1)X_{i}^{(2)}(2)\geq 0\}-\|X_{i}^{(2)}\|^{2}\mathbbm{1}\{X_{i}^{(2)}(1)X_{i}^{(2)}(2)<0\}+\epsilon_{i}^{(2)},\ i=1,\dots,n_{2}, where ϵi(l)i.i.d.𝒩(0,1)\epsilon_{i}^{(l)}\overset{i.i.d.}{\sim}\mathcal{N}(0,1) for both l=1,2l=1,2. In this setting, the null in (2) holds when the local point xx is in the first or third quadrant, and does not hold when xx is in the second or fourth quadrant.

We generate n1=n2=300n_{1}=n_{2}=300 samples, and conduct the proposed local test at a fixed uniform grid of 20×2020\times 20 points over (x(1),x(2))[2,2]×[2,2](x(1),x(2))\in[-2,2]\times[-2,2], under the significance level of 0.05. The result is shown in Figure 5 with a single data generation. The proposed test is capable of correctly rejecting H0H_{0} at most grid points within the region under the alternative hypothesis, without producing any false rejections of H0H_{0}.

Refer to caption
Figure 5: Significant local regions for Example 5, obtained by the proposed test under α=0.05\alpha=0.05 based on one-time data generation. The rejection regions are filled in red, whereas the non-significant regions are filled in gray.

6.2 Ethanol data example

We now apply the proposed test to analyze the ethanol data (Cleveland,, 1993; Kulasekera,, 1995; Kulasekera and Wang,, 1997). The response variable YY is the concentration of nitric oxide plus the concentration of nitrogen dioxide in the exhaust of an experimental engine when the engine is set at various equivalence ratios (XX) and compression ratios (CC). These oxides of nitrogen are major air pollutants, and the experiment is to determine the dependency of the concentration of oxides of nitrogen on various engine settings. The (Y,X)(Y,X) pairs are divided into two groups according to low compression ratio (Low): C<10C<10 and high compression ratio (High): C10C\geq 10. We model YY with XX for the two types of compression ratios and compare the two conditional distributions. The Low group had 39 observations, while the High group had 49 observations. Figure 6 shows the scatter plot for the two groups. For small values of XX, the Low group tends to give lower oxides concentrations than the High group, i.e., the two conditional distributions are different. However, for large values of XX, the Low and High groups seem to display the same conditional distribution.

Refer to caption
Figure 6: Scatter plot of the ethanol data for the two groups, and the curve of pp-values from the proposed local test for local XX between 0.75 and 1.15. The horizontal dashed line indicates 0.050.05.

To validate these findings, we first apply our global test, using the same configuration as specified in Example 1 except that the number of local bootstrap replications is set as B=499B=499. The resulting pp-value is 0.0200.020, showing strong evidence against the equality of the two conditional distributions. Besides, we conduct the local tests on XX ranging from 0.75 to 1.15, and plot the corresponding curve of pp-values against XX on Figure 6. The pp-values are shown to be less than 0.05 for XX from 0.750.75 to 0.830.83. This suggests that, locally at these points, the two conditional distributions differ significantly. These results are consistent with what we observed from the scatter plot. Hence, our method works well for data sets with small sample sizes.

6.3 Airfoil data example

We also validate the performance of the proposed global test through the airfoil dataset (Brooks et al.,, 2014). The dataset is collected by NASA to study the sound pressure of different airfoils, and has been explored by Tibshirani et al., (2019) and Hu and Lei, (2024) as well. It consists of N=1503N=1503 observations, with a univariate response YY (scaled sound pressure level) and a 5-dimensional covariates XX (log frequency, angle of attack, chord length, free-stream velocity, and suction side log displacement thickness). We standardize each covariate to be zero mean and unit variance.

Since the dataset does not have two samples, we take the following settings similar to Hu and Lei, (2024), to partition the data into two samples so that our proposed global test can be applied.

  1. 1.

    Random partition and exponential tilting on covariates. Following Tibshirani et al., (2019), we first randomly partition the dataset into two parts 𝒟1\mathcal{D}_{1} and 𝒟~2\mathcal{\widetilde{D}}_{2} with sample sizes n1=301n_{1}=301 and n~2=1202\tilde{n}_{2}=1202, and then sample n2=301n_{2}=301 points without replacement from 𝒟~2\mathcal{\widetilde{D}}_{2} to build the second sample 𝒟2\mathcal{D}_{2}. The sampling is conducted with probabilities proportional to w(X)=exp(αX)w(X)=\exp(\alpha^{\top}X), where α=(1,0,0,0,1)\alpha=(-1,0,0,0,1).

  2. 2.

    Random partition and exponential tilting on response. Similar to Setting 1, except that the sampling is conducted with probabilities proportional to w(Y)=Yw(Y)=Y.

  3. 3.

    Response-based partition. We partition the dataset into two subsets based on the value of “sound” variable. The first sample contains n1=752n_{1}=752 points with smaller values, and the second sample with n2=751n_{2}=751 contains the rest. To avoid singularity between the two populations, we randomly select 0.05n10.05n_{1} observations in each of the two samples and then flip their groups.

  4. 4.

    Chord-based partition. Similar to Setting 3, except that the dataset is partitioned based on the value of “chord” variable.

We conduct the proposed global test at significance level α=0.05\alpha=0.05 to test the equality between PYX(1)P^{(1)}_{Y\mid X} and PYX(2)P^{(2)}_{Y\mid X} in Settings 1 and 3 with Gaussian smoothing kernel of ν=4\nu=4, and test the equality between PXY(1)P^{(1)}_{X\mid Y} and PXY(2)P^{(2)}_{X\mid Y} in Settings 2 and 4 with Gaussian smoothing kernel of ν=2\nu=2. Other implementation details are the same as those in the simulations. Notice that Settings 1-2 are under the null hypothesis, while Settings 3-4 are under the alternative hypothesis. For Settings 1-2, we repeat the data generation 500 times, and show the empirical rejection rate with B=299B=299 in Table 5. For Settings 3-4, since there is only a single deterministic generation of the two samples except a small fraction of group flipping, we conduct the proposed global test once, and show the pp-value in Table 5.

Table 5: Airfoil data example: Empirical rejection rate and pp-values for different partition settings. Tests are under α=0.05\alpha=0.05.

Rejection rate pp-value Setting 1 0.008 Setting 3 0.003 Setting 2 0.050 Setting 4 0.003

As can be seen, for Settings 1-2 under the null, the proposed test can control the empirical rejection rate below or around the nominal level 0.05. For Settings 3-4, the null hypothesis are rejected correctly.

7 Discussions

Several research directions warrant further exploration. First, it would be valuable to extend the framework to KK-sample conditional distribution testing for K2K\geq 2 and establish connections with distance-based multivariate analysis of variance (Rizzo and Székely,, 2010). In addition to kernel smoothing, alternative machine learning techniques, such as random forest and neural networks, could be considered for estimating the CMMD and ICMMD. These methods are anticipated to offer better performance for higher-dimensional XX, but determining the asymptotic distributions of the corresponding test statistics presents significant challenges. Lastly, compared to the sample CMMD, a local polynomial-based statistic may be preferable for addressing local testing problems, particularly when X=xX=x lies on the boundary of the region of interest. Such a statistic could potentially be helpful in the regression-discontinuity design (Calonico et al.,, 2014).

Appendix A Computation of the sample CMMD and ICMMD

For the sample CMMD γk2^(x)\widehat{\gamma_{k}^{2}}(x), note that, for i=1,,n1i=1,\ldots,n_{1},

iiGh1(Xi(1)x)=i=1n1Gh1(Xi(1)x)Gh1(Xi(1)x),\sum_{i^{\prime}\neq i}G_{h_{1}}(X_{i^{\prime}}^{(1)}-x)=\sum_{i^{\prime}=1}^{n_{1}}G_{h_{1}}(X_{i^{\prime}}^{(1)}-x)-G_{h_{1}}(X_{i}^{(1)}-x),

and, for j=1,,n2j=1,\ldots,n_{2},

jjGh2(Xj(2)x)=j=1n2Gh2(Xj(2)x)Gh2(Xj(2)x).\sum_{j^{\prime}\neq j}G_{h_{2}}(X_{j^{\prime}}^{(2)}-x)=\sum_{j^{\prime}=1}^{n_{2}}G_{h_{2}}(X_{j^{\prime}}^{(2)}-x)-G_{h_{2}}(X_{j}^{(2)}-x).

We first calculate i=1n1Gh1(Xi(1)x)\sum_{i^{\prime}=1}^{n_{1}}G_{h_{1}}(X_{i^{\prime}}^{(1)}-x) and j=1n2Gh2(Xj(2)x)\sum_{j^{\prime}=1}^{n_{2}}G_{h_{2}}(X_{j^{\prime}}^{(2)}-x), which can be computed in O(n1)O(n_{1}) and O(n2)O(n_{2}), respectively. Then it can be seen that the overall cost of computing the statistic γk2^(x)\widehat{\gamma_{k}^{2}}(x) is O((n1+n2)2)O((n_{1}+n_{2})^{2}).

For the sample ICMMD k^\widehat{\mathcal{I}_{k}}, note that, for i1=1,,n1i_{1}=1,\ldots,n_{1},

j1j2Gh2(Xj1(2)Xi1(1))Gh2(Xj2(2)Xi1(1))=[j=1n2Gh2(Xj(2)Xi1(1))]2j=1n2[Gh2(Xj(2)Xi1(1))]2,\sum_{j_{1}\neq j_{2}}G_{h_{2}}(X_{j_{1}}^{(2)}-X_{i_{1}}^{(1)})G_{h_{2}}(X_{j_{2}}^{(2)}-X_{i_{1}}^{(1)})=\bigg{[}\sum_{j=1}^{n_{2}}G_{h_{2}}(X_{j}^{(2)}-X_{i_{1}}^{(1)})\bigg{]}^{2}-\sum_{j=1}^{n_{2}}\Big{[}G_{h_{2}}(X_{j}^{(2)}-X_{i_{1}}^{(1)})\Big{]}^{2},

for j1=1,,n2j_{1}=1,\ldots,n_{2},

i1i2Gh1(Xi1(1)Xj1(2))Gh1(Xi2(1)Xj1(2))=[i=1n1Gh1(Xi(1)Xj1(2))]2i=1n1[Gh1(Xi(1)Xj1(2))]2,\sum_{i_{1}\neq i_{2}}G_{h_{1}}(X_{i_{1}}^{(1)}-X_{j_{1}}^{(2)})G_{h_{1}}(X_{i_{2}}^{(1)}-X_{j_{1}}^{(2)})=\bigg{[}\sum_{i=1}^{n_{1}}G_{h_{1}}(X_{i}^{(1)}-X_{j_{1}}^{(2)})\bigg{]}^{2}-\sum_{i=1}^{n_{1}}\Big{[}G_{h_{1}}(X_{i}^{(1)}-X_{j_{1}}^{(2)})\Big{]}^{2},

and, for i=1,,n1i=1,\ldots,n_{1} and j=1,,n2j=1,\ldots,n_{2},

iiGh1(Xi(1)Xj(2))\displaystyle\sum_{i^{\prime}\neq i}G_{h_{1}}(X_{i^{\prime}}^{(1)}-X_{j}^{(2)}) =i=1n1Gh1(Xi(1)Xj(2))Gh1(Xi(1)Xj(2)),\displaystyle=\sum_{i^{\prime}=1}^{n_{1}}G_{h_{1}}(X_{i^{\prime}}^{(1)}-X_{j}^{(2)})-G_{h_{1}}(X_{i}^{(1)}-X_{j}^{(2)}),
jjGh2(Xj(2)Xi(1))\displaystyle\sum_{j^{\prime}\neq j}G_{h_{2}}(X_{j^{\prime}}^{(2)}-X_{i}^{(1)}) =j=1n2Gh2(Xj(2)Xi(1))Gh2(Xj(2)Xi(1)).\displaystyle=\sum_{j^{\prime}=1}^{n_{2}}G_{h_{2}}(X_{j^{\prime}}^{(2)}-X_{i}^{(1)})-G_{h_{2}}(X_{j}^{(2)}-X_{i}^{(1)}).

We first calculate

{j=1n2Gh2(Xj(2)Xi(1))}i=1n1,{j=1n2[Gh2(Xj(2)Xi(1))]2}i=1n1,\displaystyle\bigg{\{}\sum_{j=1}^{n_{2}}G_{h_{2}}(X_{j}^{(2)}-X_{i}^{(1)})\bigg{\}}_{i=1}^{n_{1}},\quad\bigg{\{}\sum_{j=1}^{n_{2}}\Big{[}G_{h_{2}}(X_{j}^{(2)}-X_{i}^{(1)})\Big{]}^{2}\bigg{\}}_{i=1}^{n_{1}},
{i=1n1Gh1(Xi(1)Xj(2))}j=1n2,{i=1n1[Gh1(Xi(1)Xj(2))]2}j=1n2.\displaystyle\bigg{\{}\sum_{i=1}^{n_{1}}G_{h_{1}}(X_{i}^{(1)}-X_{j}^{(2)})\bigg{\}}_{j=1}^{n_{2}},\quad\bigg{\{}\sum_{i=1}^{n_{1}}\Big{[}G_{h_{1}}(X_{i}^{(1)}-X_{j}^{(2)})\Big{]}^{2}\bigg{\}}_{j=1}^{n_{2}}.

Each of these sequences can be computed with O(n1n2)O(n_{1}n_{2}) operations. Then, it can be seen that the overall cost of computing the statistic k^\widehat{\mathcal{I}_{k}} is O((n1+n2)2)O((n_{1}+n_{2})^{2}).

Appendix B Ancillary results

Below we list some relevant results about the generalized U-statistic in Lee, (2019).

Definition B.1 (Section 2.2, Lee, (2019)).

Assume that {Zi(1)}i=1n1\{Z_{i}^{(1)}\}_{i=1}^{n_{1}} and {Zj(2)}j=1n2\{Z_{j}^{(2)}\}_{j=1}^{n_{2}} are i.i.d. samples from distributions P(1)P^{(1)} and P(2)P^{(2)} respectively, and {Zi(1)}i=1n1\{Z_{i}^{(1)}\}_{i=1}^{n_{1}} is independent of {Zj(2)}j=1n2\{Z_{j}^{(2)}\}_{j=1}^{n_{2}}. Let ψ\psi be a function of m1+m2m_{1}+m_{2} arguments

ψ(z1(1),,zm1(1);z1(2),,zm2(2)),\psi(z_{1}^{(1)},\ldots,z_{m_{1}}^{(1)};z_{1}^{(2)},\ldots,z_{m_{2}}^{(2)}),

which is symmetric in z1(1),,zm1(1)z_{1}^{(1)},\ldots,z_{m_{1}}^{(1)} and z1(2),,zm2(2)z_{1}^{(2)},\ldots,z_{m_{2}}^{(2)}. The generalized U-statistic based on ψ\psi is a statistic of the form

Un1,n2=(n1m1)1(n2m2)1(n1,m1)(n2,m2)ψ(Zi1(1),,Zim1(1);Zj1(2),,Zjm2(2)),U_{n_{1},n_{2}}=\binom{n_{1}}{m_{1}}^{-1}\binom{n_{2}}{m_{2}}^{-1}\sum_{(n_{1},m_{1})}\sum_{(n_{2},m_{2})}\psi(Z_{i_{1}}^{(1)},\ldots,Z_{i_{m_{1}}}^{(1)};Z_{j_{1}}^{(2)},\ldots,Z_{j_{m_{2}}}^{(2)}),

where the summation is over all m1m_{1}-subsets of {Zi(1)}i=1n1\{Z_{i}^{(1)}\}_{i=1}^{n_{1}} and m2m_{2}-subsets of {Zj(2)}j=1n2\{Z_{j}^{(2)}\}_{j=1}^{n_{2}}. Then Un1,n2U_{n_{1},n_{2}} is an unbiased estimator of 𝔼{ψ(Z1(1),,Zm1(1);Z1(2),,Zm2(2))}\mathbb{E}\{\psi(Z_{1}^{(1)},\ldots,Z_{m_{1}}^{(1)};Z_{1}^{(2)},\ldots,Z_{m_{2}}^{(2)})\}.

Definition B.2 (Section 2.2, Lee, (2019)).

For c=0,,m1c=0,\ldots,m_{1} and d=0,,m2d=0,\ldots,m_{2}, define

ψc,d(z1(1),,zc(1);z1(2),,zd(2))=𝔼{ψ(z1(1),,zc(1),Zc+1(1),,Zm1(1);z1(2),,zd(2),Zd+1(2),,Zm2(2))},\psi_{c,d}(z_{1}^{(1)},\ldots,z_{c}^{(1)};z_{1}^{(2)},\ldots,z_{d}^{(2)})=\mathbb{E}\{\psi(z_{1}^{(1)},\ldots,z_{c}^{(1)},Z_{c+1}^{(1)},\ldots,Z_{m_{1}}^{(1)};z_{1}^{(2)},\ldots,z_{d}^{(2)},Z_{d+1}^{(2)},\ldots,Z_{m_{2}}^{(2)})\},

and

σc,d2=var{ψc,d(Z1(1),,Zc(1);Z1(2),,Zd(2))}.\sigma_{c,d}^{2}=\operatorname{var}\{\psi_{c,d}(Z_{1}^{(1)},\ldots,Z_{c}^{(1)};Z_{1}^{(2)},\ldots,Z_{d}^{(2)})\}.
Theorem B.1 (Theorem 2 in Section 2.2, Lee, (2019)).

The variance of the generalized U-statistic Un1,n2U_{n_{1},n_{2}} in Definition B.1 is given by

var(Un1,n2)=(n1m1)1(n2m2)1c=0m1d=0m2(m1c)(m2d)(n1m1m1c)(n2m2m2d)σc,d2,\operatorname{var}(U_{n_{1},n_{2}})=\binom{n_{1}}{m_{1}}^{-1}\binom{n_{2}}{m_{2}}^{-1}\sum_{c=0}^{m_{1}}\sum_{d=0}^{m_{2}}\binom{m_{1}}{c}\binom{m_{2}}{d}\binom{n_{1}-m_{1}}{m_{1}-c}\binom{n_{2}-m_{2}}{m_{2}-d}\sigma_{c,d}^{2},

where σc,d2\sigma_{c,d}^{2} is given in Definition B.2.

Definition B.3 (Section 2.2, Lee, (2019)).

Let FzF_{z} denote the distribution function of a single point mass at zz. For c=0,,m1c=0,\ldots,m_{1} and d=0,,m2d=0,\ldots,m_{2}, define

ϕ(c,d)(z1(1),,zc(1);z1(2),,zd(2))\displaystyle\quad\phi^{(c,d)}(z_{1}^{(1)},\ldots,z_{c}^{(1)};z_{1}^{(2)},\ldots,z_{d}^{(2)})
=ψ(u1,,um1;v1,,vm2)i=1c{dFzi(1)(ui)dP(1)(ui)}i=c+1m1dP(1)(ui)\displaystyle=\int\ldots\int\psi(u_{1},\ldots,u_{m_{1}};v_{1},\ldots,v_{m_{2}})\prod_{i=1}^{c}\{dF_{z_{i}^{(1)}}(u_{i})-dP^{(1)}(u_{i})\}\prod_{i=c+1}^{m_{1}}dP^{(1)}(u_{i})
×j=1d{dFzj(2)(vj)dP(2)(vj)}j=d+1m2dP(2)(vj).\displaystyle\quad\times\prod_{j=1}^{d}\{dF_{z_{j}^{(2)}}(v_{j})-dP^{(2)}(v_{j})\}\prod_{j=d+1}^{m_{2}}dP^{(2)}(v_{j}).
Theorem B.2 (Hoeffding decomposition, Theorem 3 in Section 2.2, Lee, (2019)).

The generalized U-statistic Un1,n2U_{n_{1},n_{2}} in Definition B.1 admits the representation

Un1,n2=c=0m1d=0m2(m1c)(m2d)Hn1,n2(c,d),U_{n_{1},n_{2}}=\sum_{c=0}^{m_{1}}\sum_{d=0}^{m_{2}}\binom{m_{1}}{c}\binom{m_{2}}{d}H_{n_{1},n_{2}}^{(c,d)},

where Hn1,n2(c,d)H_{n_{1},n_{2}}^{(c,d)} is the generalized U-statistic based on ϕ(c,d)\phi^{(c,d)} in Definition B.3 and is given by

Hn1,n2(c,d)=(n1c)1(n2d)1(n1,c)(n2,d)ϕ(c,d)(Zi1(1),,Zic(1);Zj1(2),,Zjd(2)).H_{n_{1},n_{2}}^{(c,d)}=\binom{n_{1}}{c}^{-1}\binom{n_{2}}{d}^{-1}\sum_{(n_{1},c)}\sum_{(n_{2},d)}\phi^{(c,d)}(Z_{i_{1}}^{(1)},\ldots,Z_{i_{c}}^{(1)};Z_{j_{1}}^{(2)},\ldots,Z_{j_{d}}^{(2)}).

Moreover, the functions ϕ(c,d)\phi^{(c,d)} satisfy

(i) 𝔼{ϕ(c,d)(Z1(1),,Zc(1);Z1(2),,Zd(2))}=0\mathbb{E}\{\phi^{(c,d)}(Z_{1}^{(1)},\ldots,Z_{c}^{(1)};Z_{1}^{(2)},\ldots,Z_{d}^{(2)})\}=0;

(ii) cov{ϕ(c,d)(S1;S2),ϕ(c,d)(S1;S2)}=0\operatorname{cov}\{\phi^{(c,d)}(S_{1};S_{2}),\phi^{(c^{\prime},d^{\prime})}(S_{1}^{\prime};S_{2}^{\prime})\}=0 for all integers c,d,c,dc,d,c^{\prime},d^{\prime} and sets S1,S2,S1,S2S_{1},S_{2},S_{1}^{\prime},S_{2}^{\prime} unless c=cc=c^{\prime}, d=dd=d^{\prime}, S1=S1S_{1}=S_{1}^{\prime} and S2=S2S_{2}=S_{2}^{\prime}.

The generalized U-statistics Hn1,n2(c,d)H_{n_{1},n_{2}}^{(c,d)} are thus all uncorrelated. Their variances are given by

var(Hn1,n2(c,d))=(n1c)1(n2d)1δc,d2,\operatorname{var}(H_{n_{1},n_{2}}^{(c,d)})=\binom{n_{1}}{c}^{-1}\binom{n_{2}}{d}^{-1}\delta_{c,d}^{2},

where δc,d2=var{ϕ(c,d)(Z1(1),,Zc(1);Z1(2),,Zd(2))}\delta_{c,d}^{2}=\operatorname{var}\{\phi^{(c,d)}(Z_{1}^{(1)},\ldots,Z_{c}^{(1)};Z_{1}^{(2)},\ldots,Z_{d}^{(2)})\}.

Appendix C Technical details

For l=1,2l=1,2 and i=1,,nli=1,\ldots,n_{l}, denote Zi(l)=(Yi(l),Xi(l))Z_{i}^{(l)}=(Y_{i}^{(l)},X_{i}^{(l)}).

Proof of Theorem 1.

The facts 1 and 2 directly follow from the definitions of semimetric of strong negative type and characteristic kernel.

Suppose kk generates ρ\rho, i.e., ρ(y,y)=12{k(y,y)+k(y,y)}k(y,y)\rho(y,y^{\prime})=\frac{1}{2}\{k(y,y)+k(y^{\prime},y^{\prime})\}-k(y,y^{\prime}). Then

𝒟ρ(x)\displaystyle\mathcal{D}_{\rho}(x) =𝒴𝒴ρ(y,y)d(PYX=x(1)PYX=x(2))(y)d(PYX=x(1)PYX=x(2))(y)\displaystyle=-\int_{\mathcal{Y}}\int_{\mathcal{Y}}\rho(y,y^{\prime})d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})(y)d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})(y^{\prime})
=𝒴𝒴[k(y,y)12{k(y,y)+k(y,y)}]d(PYX=x(1)PYX=x(2))(y)d(PYX=x(1)PYX=x(2))(y)\displaystyle=\int_{\mathcal{Y}}\int_{\mathcal{Y}}\bigg{[}k(y,y^{\prime})-\frac{1}{2}\{k(y,y)+k(y^{\prime},y^{\prime})\}\bigg{]}d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})(y)d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})(y^{\prime})
=𝒴𝒴k(y,y)d(PYX=x(1)PYX=x(2))(y)d(PYX=x(1)PYX=x(2))(y)\displaystyle=\int_{\mathcal{Y}}\int_{\mathcal{Y}}k(y,y^{\prime})d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})(y)d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})(y^{\prime})
=γk2(x),\displaystyle=\gamma_{k}^{2}(x),

where we have used the fact that 𝒴d(PYX=x(1)PYX=x(2))(y)=0\int_{\mathcal{Y}}d(P_{Y\mid X=x}^{(1)}-P_{Y\mid X=x}^{(2)})(y)=0. ∎

Proof of Theorem 2.

Define the generalized U-statistic

Un1,n2=(n12)1(n22)11i1<i2n11j1<j2n2ψ(Zi1(1),Zi2(1);Zj1(2),Zj2(2)),U_{n_{1},n_{2}}=\binom{n_{1}}{2}^{-1}\binom{n_{2}}{2}^{-1}\sum_{1\leq i_{1}<i_{2}\leq n_{1}}\sum_{1\leq j_{1}<j_{2}\leq n_{2}}\psi(Z_{i_{1}}^{(1)},Z_{i_{2}}^{(1)};Z_{j_{1}}^{(2)},Z_{j_{2}}^{(2)}),

where

ψ(Z1(1),Z2(1);Z1(2),Z2(2))\displaystyle\quad\psi(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})
=[k(Y1(1),Y2(1))+k(Y1(2),Y2(2))12{k(Y1(1),Y1(2))+k(Y1(1),Y2(2))+k(Y2(1),Y1(2))+k(Y2(1),Y2(2))}\displaystyle=\bigg{[}k(Y_{1}^{(1)},Y_{2}^{(1)})+k(Y_{1}^{(2)},Y_{2}^{(2)})-\frac{1}{2}\Big{\{}k(Y_{1}^{(1)},Y_{1}^{(2)})+k(Y_{1}^{(1)},Y_{2}^{(2)})+k(Y_{2}^{(1)},Y_{1}^{(2)})+k(Y_{2}^{(1)},Y_{2}^{(2)})\Big{\}}
γk2(x)]Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x).\displaystyle\quad-\gamma_{k}^{2}(x)\bigg{]}G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x).

Under Assumptions 1-3, we have

γk2^(x)γk2(x)\displaystyle\quad\widehat{\gamma_{k}^{2}}(x)-\gamma_{k}^{2}(x)
={1n1(n11)i1i2Gh1(Xi1(1)x)Gh1(Xi2(1)x)1n2(n21)j1j2Gh2(Xj1(2)x)Gh2(Xj2(2)x)}1\displaystyle=\bigg{\{}\frac{1}{n_{1}(n_{1}-1)}\sum_{i_{1}\neq i_{2}}G_{h_{1}}(X_{i_{1}}^{(1)}-x)G_{h_{1}}(X_{i_{2}}^{(1)}-x)\frac{1}{n_{2}(n_{2}-1)}\sum_{j_{1}\neq j_{2}}G_{h_{2}}(X_{j_{1}}^{(2)}-x)G_{h_{2}}(X_{j_{2}}^{(2)}-x)\bigg{\}}^{-1}
×Un1,n2\displaystyle\quad\times U_{n_{1},n_{2}}
=[{f1(x)f2(x)}2+op(1)]1Un1,n2.\displaystyle=\big{[}\{f_{1}(x)f_{2}(x)\}^{2}+o_{p}(1)\big{]}^{-1}U_{n_{1},n_{2}}.

It suffices to show that

𝔼(Un1,n2)=o(1),var(Un1,n2)=o(1).\mathbb{E}(U_{n_{1},n_{2}})=o(1),\quad\operatorname{var}(U_{n_{1},n_{2}})=o(1). (A.1)

For the first part of (A.1), we have

𝔼(Un1,n2)\displaystyle\mathbb{E}(U_{n_{1},n_{2}}) =𝔼{k(Y1(1),Y2(1))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle=\mathbb{E}\{k(Y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
+𝔼{k(Y1(2),Y2(2))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad+\mathbb{E}\{k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
2𝔼{k(Y1(1),Y1(2))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad-2\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
γk2(x)𝔼{Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}.\displaystyle\quad-\gamma_{k}^{2}(x)\mathbb{E}\{G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}.

We first consider the term

𝔼{k(Y1(1),Y1(2))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
=k(y1(1),y1(2))Gh1(x1(1)x)Gh1(x2(1)x)Gh2(x1(2)x)Gh2(x2(2)x)\displaystyle=\int k(y_{1}^{(1)},y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(x_{2}^{(1)}-x)G_{h_{2}}(x_{1}^{(2)}-x)G_{h_{2}}(x_{2}^{(2)}-x)
×f1(y1(1),x1(1))f2(y1(2),x1(2))f1(x2(1))f2(x2(2))dy1(1)dy1(2)dx1(1)dx2(1)dx1(2)dx2(2).\displaystyle\quad\times f_{1}(y_{1}^{(1)},x_{1}^{(1)})f_{2}(y_{1}^{(2)},x_{1}^{(2)})f_{1}(x_{2}^{(1)})f_{2}(x_{2}^{(2)})dy_{1}^{(1)}dy_{1}^{(2)}dx_{1}^{(1)}dx_{2}^{(1)}dx_{1}^{(2)}dx_{2}^{(2)}.

Let t=h11(x1(1)x)t=h_{1}^{-1}(x_{1}^{(1)}-x), u=h11(x2(1)x)u=h_{1}^{-1}(x_{2}^{(1)}-x), v=h21(x1(2)x)v=h_{2}^{-1}(x_{1}^{(2)}-x) and w=h21(x2(2)x)w=h_{2}^{-1}(x_{2}^{(2)}-x). We have

𝔼{k(Y1(1),Y1(2))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
=(i)k(y1(1),y1(2))s=1p{g(t(s))g(u(s))g(v(s))g(w(s))}\displaystyle\overset{(i)}{=}\int k(y_{1}^{(1)},y_{1}^{(2)})\prod_{s=1}^{p}\{g(t(s))g(u(s))g(v(s))g(w(s))\}
×f1(y1(1),x+h1t)f2(y1(2),x+h2v)f1(x+h1u)f2(x+h2w)dy1(1)dy1(2)dtdudvdw\displaystyle\quad\times f_{1}(y_{1}^{(1)},x+h_{1}t)f_{2}(y_{1}^{(2)},x+h_{2}v)f_{1}(x+h_{1}u)f_{2}(x+h_{2}w)dy_{1}^{(1)}dy_{1}^{(2)}dtdudvdw
=𝔼{k(Y(1),Y(2))X(1)=x+h1t,X(2)=x+h2v}s=1p{g(t(s))g(u(s))g(v(s))g(w(s))}\displaystyle=\int\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x+h_{1}t,X^{(2)}=x+h_{2}v\}\prod_{s=1}^{p}\{g(t(s))g(u(s))g(v(s))g(w(s))\}
×f1(x+h1t)f1(x+h1u)f2(x+h2v)f2(x+h2w)dtdudvdw\displaystyle\quad\times f_{1}(x+h_{1}t)f_{1}(x+h_{1}u)f_{2}(x+h_{2}v)f_{2}(x+h_{2}w)dtdudvdw
=(ii)𝔼{k(Y(1),Y(2))X(1)=x,X(2)=x}s=1p{g(t(s))g(u(s))g(v(s))g(w(s))}\displaystyle\overset{(ii)}{=}\int\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\}\prod_{s=1}^{p}\{g(t(s))g(u(s))g(v(s))g(w(s))\}
×{f1(x)f2(x)}2dtdudvdw+O(h1ν+h2ν)\displaystyle\quad\times\{f_{1}(x)f_{2}(x)\}^{2}dtdudvdw+O(h_{1}^{\nu}+h_{2}^{\nu})
=𝔼{k(Y(1),Y(2))X(1)=x,X(2)=x}{f1(x)f2(x)}2+O(h1ν+h2ν),\displaystyle=\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\}\{f_{1}(x)f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu}),

where (i)(i) follows from change of variables, and (ii)(ii) holds by Taylor’s theorem and Assumptions 1 and 3-4. Similarly, one can verify that

𝔼{k(Y1(1),Y2(1))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
=𝔼{k(Y(1),Y(1))X(1)=x,X(1)=x}{f1(x)f2(x)}2+O(h1ν+h2ν),\displaystyle=\mathbb{E}\{k(Y^{(1)},Y^{(1)\prime})\mid X^{(1)}=x,X^{(1)\prime}=x\}\{f_{1}(x)f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu}),
𝔼{k(Y1(2),Y2(2))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
=𝔼{k(Y(2),Y(2))X(2)=x,X(2)=x}{f1(x)f2(x)}2+O(h1ν+h2ν),\displaystyle=\mathbb{E}\{k(Y^{(2)},Y^{(2)\prime})\mid X^{(2)}=x,X^{(2)\prime}=x\}\{f_{1}(x)f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu}),
𝔼{Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\{G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\}
={f1(x)f2(x)}2+O(h1ν+h2ν).\displaystyle=\{f_{1}(x)f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu}).

Thus, by Assumption 2,

𝔼(Un1,n2)=O(h1ν+h2ν)=o(1).\mathbb{E}(U_{n_{1},n_{2}})=O(h_{1}^{\nu}+h_{2}^{\nu})=o(1).

For the second part of (A.1), following Definition B.2 and Theorem B.1 (m1=m2=2m_{1}=m_{2}=2), we have

var(Un1,n2)=(n12)1(n22)1c=02d=02(2c)(2d)(n122c)(n222d)σc,d2.\operatorname{var}(U_{n_{1},n_{2}})=\binom{n_{1}}{2}^{-1}\binom{n_{2}}{2}^{-1}\sum_{c=0}^{2}\sum_{d=0}^{2}\binom{2}{c}\binom{2}{d}\binom{n_{1}-2}{2-c}\binom{n_{2}-2}{2-d}\sigma_{c,d}^{2}.

We first calculate

σ2,22=var{ψ(Z1(1),Z2(1);Z1(2),Z2(2))}=𝔼{ψ2(Z1(1),Z2(1);Z1(2),Z2(2))}{𝔼(Un1,n2)}2.\sigma_{2,2}^{2}=\operatorname{var}\{\psi(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})\}=\mathbb{E}\{\psi^{2}(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})\}-\{\mathbb{E}(U_{n_{1},n_{2}})\}^{2}.

As for 𝔼{ψ2(Z1(1),Z2(1);Z1(2),Z2(2))}\mathbb{E}\{\psi^{2}(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})\}, it can be expanded into several terms, and each of these terms can be shown to be of order (h1h2)2p(h_{1}h_{2})^{-2p}. We only present the proof for the term below. Derivations for the other terms are similar. Note that

𝔼[{k(Y1(1),Y1(2))Gh1(X1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}2]\displaystyle\quad\mathbb{E}\Big{[}\Big{\{}k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}^{2}\Big{]}
={k(y1(1),y1(2))Gh1(x1(1)x)Gh1(x2(1)x)Gh2(x1(2)x)Gh2(x2(2)x)}2\displaystyle=\int\left\{k(y_{1}^{(1)},y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(x_{2}^{(1)}-x)G_{h_{2}}(x_{1}^{(2)}-x)G_{h_{2}}(x_{2}^{(2)}-x)\right\}^{2}
×f1(y1(1),x1(1))f2(y1(2),x1(2))f1(x2(1))f2(x2(2))dy1(1)dy1(2)dx1(1)dx2(1)dx1(2)dx2(2)\displaystyle\quad\times f_{1}(y_{1}^{(1)},x_{1}^{(1)})f_{2}(y_{1}^{(2)},x_{1}^{(2)})f_{1}(x_{2}^{(1)})f_{2}(x_{2}^{(2)})dy_{1}^{(1)}dy_{1}^{(2)}dx_{1}^{(1)}dx_{2}^{(1)}dx_{1}^{(2)}dx_{2}^{(2)}
=(i)(h1h2)2p[k(y1(1),y1(2))s=1p{g(t(s))g(u(s))g(v(s))g(w(s))}]2\displaystyle\overset{(i)}{=}(h_{1}h_{2})^{-2p}\int\bigg{[}k(y_{1}^{(1)},y_{1}^{(2)})\prod_{s=1}^{p}\{g(t(s))g(u(s))g(v(s))g(w(s))\}\bigg{]}^{2}
×f1(y1(1),x+h1t)f2(y1(2),x+h2v)f1(x+h1u)f2(x+h2w)dy1(1)dy1(2)dtdudvdw\displaystyle\quad\times f_{1}(y_{1}^{(1)},x+h_{1}t)f_{2}(y_{1}^{(2)},x+h_{2}v)f_{1}(x+h_{1}u)f_{2}(x+h_{2}w)dy_{1}^{(1)}dy_{1}^{(2)}dtdudvdw
(h1h2)2p,\displaystyle\asymp(h_{1}h_{2})^{-2p},

where (i)(i) follows from the same change of variables technique as in the proof of the first part of (A.1) above. Hence, σ2,22=O((h1h2)2p)\sigma_{2,2}^{2}=O((h_{1}h_{2})^{-2p}).

Analogously, using change of variables, one can verify that σ1,02=O(h1p)\sigma_{1,0}^{2}=O(h_{1}^{-p}), σ0,12=O(h2p)\sigma_{0,1}^{2}=O(h_{2}^{-p}), σ2,02=O(h12p)\sigma_{2,0}^{2}=O(h_{1}^{-2p}), σ1,12=O((h1h2)p)\sigma_{1,1}^{2}=O((h_{1}h_{2})^{-p}), σ0,22=O(h22p)\sigma_{0,2}^{2}=O(h_{2}^{-2p}), σ2,12=O((h12h2)p)\sigma_{2,1}^{2}=O((h_{1}^{2}h_{2})^{-p}) and σ1,22=O((h1h22)p)\sigma_{1,2}^{2}=O((h_{1}h_{2}^{2})^{-p}). Therefore, by Assumption 2,

var(Un1,n2)=c=02d=02O(n1cn2d)σc,d2=O((n1h1p)1+(n2h2p)1)=o(1).\operatorname{var}(U_{n_{1},n_{2}})=\sum_{c=0}^{2}\sum_{d=0}^{2}O(n_{1}^{-c}n_{2}^{-d})\sigma_{c,d}^{2}=O((n_{1}h_{1}^{p})^{-1}+(n_{2}h_{2}^{p})^{-1})=o(1).

This completes the proof. ∎

Proof of Theorem 3.

Recall from the proof of Theorem 2 that

γk2^(x)γk2(x)=[{f1(x)f2(x)}2+op(1)]1Un1,n2.\widehat{\gamma_{k}^{2}}(x)-\gamma_{k}^{2}(x)=\big{[}\{f_{1}(x)f_{2}(x)\}^{2}+o_{p}(1)\big{]}^{-1}U_{n_{1},n_{2}}.

Following Definition B.3 and Theorem B.2, we have the Hoeffding decomposition for the U-statistic Un1,n2U_{n_{1},n_{2}}:

Un1,n2=c=02d=02(2c)(2d)Hn1,n2(c,d)=𝔼(Un1,n2)+2Hn1,n2(1,0)+2Hn1,n2(0,1)+Rn1,n2,U_{n_{1},n_{2}}=\sum_{c=0}^{2}\sum_{d=0}^{2}\binom{2}{c}\binom{2}{d}H_{n_{1},n_{2}}^{(c,d)}=\mathbb{E}(U_{n_{1},n_{2}})+2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}+R_{n_{1},n_{2}},

where

Hn1,n2(1,0)=1n1i=1n1ϕ(1,0)(Zi(1)),Hn1,n2(0,1)=1n2j=1n2ϕ(0,1)(Zj(2)),H_{n_{1},n_{2}}^{(1,0)}=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi^{(1,0)}(Z_{i}^{(1)}),\quad H_{n_{1},n_{2}}^{(0,1)}=\frac{1}{n_{2}}\sum_{j=1}^{n_{2}}\phi^{(0,1)}(Z_{j}^{(2)}),

and Rn1,n2R_{n_{1},n_{2}} is the remainder term. Furthermore, we give the explicit expression of ϕ(1,0)(z1(1))\phi^{(1,0)}(z_{1}^{(1)}) (the expression of ϕ(0,1)(z1(2))\phi^{(0,1)}(z_{1}^{(2)}) is similar and omitted):

ϕ(1,0)(z1(1))=𝔼{k(y1(1),Y2(1))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}+𝔼{k(Y1(2),Y2(2))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}𝔼{k(y1(1),Y1(2))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}𝔼{k(Y2(1),Y1(2))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}γk2(x)𝔼{Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}𝔼(Un1,n2).\displaystyle\begin{split}\phi^{(1,0)}(z_{1}^{(1)})&=\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}\\ &\quad+\mathbb{E}\Big{\{}k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}\\ &\quad-\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}\\ &\quad-\mathbb{E}\Big{\{}k(Y_{2}^{(1)},Y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}\\ &\quad-\gamma_{k}^{2}(x)\mathbb{E}\Big{\{}G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}\\ &\quad-\mathbb{E}(U_{n_{1},n_{2}}).\end{split} (A.2)

By similar calculations as in the proof of Theorem 2, we have

ξ12\displaystyle\xi_{1}^{2} var{ϕ(1,0)(Z1(1))}=σ1,02h1p,\displaystyle\coloneqq\operatorname{var}\{\phi^{(1,0)}(Z_{1}^{(1)})\}=\sigma_{1,0}^{2}\asymp h_{1}^{-p}, (A.3)
ξ22\displaystyle\xi_{2}^{2} var{ϕ(0,1)(Z1(2))}=σ0,12h2p,\displaystyle\coloneqq\operatorname{var}\{\phi^{(0,1)}(Z_{1}^{(2)})\}=\sigma_{0,1}^{2}\asymp h_{2}^{-p}, (A.4)

under HaH_{a} in (2). Applying Lyapunov CLT as in Section 3.4 of Ullah and Pagan, (1999),

Hn1,n2(1,0)+Hn1,n2(0,1)ξ12n1+ξ22n2𝑑𝒩(0,1).\frac{H_{n_{1},n_{2}}^{(1,0)}+H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}\overset{d}{\longrightarrow}\mathcal{N}(0,1).

The condition that |g(u)|2+δ𝑑u<\int|g(u)|^{2+\delta}du<\infty for some δ>0\delta>0 used in Ullah and Pagan, (1999) is implied by our Assumption 1 that gg is bounded and g2(u)𝑑u<\int g^{2}(u)du<\infty.

Also, one can show that

var(Hn1,n2(2,0))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(2,0)}) n12δ2,02n12σ2,02(n1h1p)2,\displaystyle\asymp n_{1}^{-2}\delta_{2,0}^{2}\asymp n_{1}^{-2}\sigma_{2,0}^{2}\asymp(n_{1}h_{1}^{p})^{-2},
var(Hn1,n2(1,1))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(1,1)}) (n1n2)1δ1,12(n1n2)1σ1,12(n1h1p)1(n2h2p)1,\displaystyle\asymp(n_{1}n_{2})^{-1}\delta_{1,1}^{2}\asymp(n_{1}n_{2})^{-1}\sigma_{1,1}^{2}\asymp(n_{1}h_{1}^{p})^{-1}(n_{2}h_{2}^{p})^{-1},
var(Hn1,n2(0,2))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(0,2)}) n22δ0,22n22σ0,22(n2h2p)2,\displaystyle\asymp n_{2}^{-2}\delta_{0,2}^{2}\asymp n_{2}^{-2}\sigma_{0,2}^{2}\asymp(n_{2}h_{2}^{p})^{-2},
var(Hn1,n2(2,1))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(2,1)}) n12n21δ2,12n12n21σ2,12(n1h1p)2(n2h2p)1,\displaystyle\asymp n_{1}^{-2}n_{2}^{-1}\delta_{2,1}^{2}\asymp n_{1}^{-2}n_{2}^{-1}\sigma_{2,1}^{2}\asymp(n_{1}h_{1}^{p})^{-2}(n_{2}h_{2}^{p})^{-1},
var(Hn1,n2(1,2))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(1,2)}) n11n22δ1,22n11n22σ1,22(n1h1p)1(n2h2p)2,\displaystyle\asymp n_{1}^{-1}n_{2}^{-2}\delta_{1,2}^{2}\asymp n_{1}^{-1}n_{2}^{-2}\sigma_{1,2}^{2}\asymp(n_{1}h_{1}^{p})^{-1}(n_{2}h_{2}^{p})^{-2},
var(Hn1,n2(2,2))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(2,2)}) (n1n2)2δ2,22(n1n2)2σ2,22(n1h1p)2(n2h2p)2.\displaystyle\asymp(n_{1}n_{2})^{-2}\delta_{2,2}^{2}\asymp(n_{1}n_{2})^{-2}\sigma_{2,2}^{2}\asymp(n_{1}h_{1}^{p})^{-2}(n_{2}h_{2}^{p})^{-2}.

Thus,

var(Rn1,n2)ξ12n1+ξ22n2=O((n1h1p)1+(n2h2p)1).\frac{\operatorname{var}(R_{n_{1},n_{2}})}{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}=O((n_{1}h_{1}^{p})^{-1}+(n_{2}h_{2}^{p})^{-1}).

Now we have

Un1,n2ξ12n1+ξ22n2\displaystyle\frac{U_{n_{1},n_{2}}}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}} =2Hn1,n2(1,0)+2Hn1,n2(0,1)ξ12n1+ξ22n2+𝔼(Un1,n2)ξ12n1+ξ22n2+Rn1,n2ξ12n1+ξ22n2\displaystyle=\frac{2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}+\frac{\mathbb{E}(U_{n_{1},n_{2}})}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}+\frac{R_{n_{1},n_{2}}}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}
=2Hn1,n2(1,0)+2Hn1,n2(0,1)ξ12n1+ξ22n2+O((n1h1p)1/2h1ν+(n2h2p)1/2h2ν)+Op((n1h1p)1/2+(n2h2p)1/2)\displaystyle=\frac{2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}+O((n_{1}h_{1}^{p})^{1/2}h_{1}^{\nu}+(n_{2}h_{2}^{p})^{1/2}h_{2}^{\nu})+O_{p}((n_{1}h_{1}^{p})^{-1/2}+(n_{2}h_{2}^{p})^{-1/2})
=2Hn1,n2(1,0)+2Hn1,n2(0,1)ξ12n1+ξ22n2+op(1)𝑑𝒩(0,4),\displaystyle=\frac{2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}+o_{p}(1)\overset{d}{\longrightarrow}\mathcal{N}(0,4),

where we have used Assumption 2 and nl1/2hlp/2+ν0n_{l}^{1/2}h_{l}^{p/2+\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2. Then

γk2^(x)γk2(x)ξ12n1+ξ22n2=Un1,n2[{f1(x)f2(x)}2+op(1)]ξ12n1+ξ22n2𝑑𝒩(0,4{f1(x)f2(x)}4).\frac{\widehat{\gamma_{k}^{2}}(x)-\gamma_{k}^{2}(x)}{\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}=\frac{U_{n_{1},n_{2}}}{\big{[}\{f_{1}(x)f_{2}(x)\}^{2}+o_{p}(1)\big{]}\sqrt{\frac{\xi_{1}^{2}}{n_{1}}+\frac{\xi_{2}^{2}}{n_{2}}}}\overset{d}{\longrightarrow}\mathcal{N}(0,4\{f_{1}(x)f_{2}(x)\}^{-4}).

Proof of Theorem 4.

Since γk2(x)=0\gamma_{k}^{2}(x)=0 under H0H_{0} in (2), following the proof of Theorems 2-3, we have

γk2^(x)=[{f1(x)f2(x)}2+op(1)]1Un1,n2,\widehat{\gamma_{k}^{2}}(x)=\big{[}\{f_{1}(x)f_{2}(x)\}^{2}+o_{p}(1)\big{]}^{-1}U_{n_{1},n_{2}},

and the Hoeffding decomposition for Un1,n2U_{n_{1},n_{2}}:

Un1,n2=𝔼(Un1,n2)+2Hn1,n2(1,0)+2Hn1,n2(0,1)+Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2)+Rn1,n2,U_{n_{1},n_{2}}=\mathbb{E}(U_{n_{1},n_{2}})+2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}+H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)}+R_{n_{1},n_{2}},

where

Hn1,n2(2,0)\displaystyle H_{n_{1},n_{2}}^{(2,0)} =(n12)11i1<i2n1ϕ(2,0)(Zi1(1),Zi2(1)),\displaystyle=\binom{n_{1}}{2}^{-1}\sum_{1\leq i_{1}<i_{2}\leq n_{1}}\phi^{(2,0)}(Z_{i_{1}}^{(1)},Z_{i_{2}}^{(1)}),
Hn1,n2(1,1)\displaystyle H_{n_{1},n_{2}}^{(1,1)} =1n1n2i=1n1j=1n2ϕ(1,1)(Zi(1);Zj(2)),\displaystyle=\frac{1}{n_{1}n_{2}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\phi^{(1,1)}(Z_{i}^{(1)};Z_{j}^{(2)}),
Hn1,n2(0,2)\displaystyle H_{n_{1},n_{2}}^{(0,2)} =(n22)11j1<j2n2ϕ(0,2)(Zj1(2),Zj2(2)),\displaystyle=\binom{n_{2}}{2}^{-1}\sum_{1\leq j_{1}<j_{2}\leq n_{2}}\phi^{(0,2)}(Z_{j_{1}}^{(2)},Z_{j_{2}}^{(2)}),

and Rn1,n2R_{n_{1},n_{2}} is the remainder term, with a slight abuse of notation.

Under H0H_{0} in (2), one can verify that ϕ(1,0)(z1(1))0\phi^{(1,0)}(z_{1}^{(1)})\neq 0 and ϕ(0,1)(z1(2))0\phi^{(0,1)}(z_{1}^{(2)})\neq 0, and thus the U-statistic Un1,n2U_{n_{1},n_{2}} is non-degenerate. Nonetheless, a more in-depth analysis reveals that Hn1,n2(1,0)H_{n_{1},n_{2}}^{(1,0)} and Hn1,n2(0,1)H_{n_{1},n_{2}}^{(0,1)} can be asymptotically negligible under H0H_{0} in (2). Specifically, by similar calculations as in the proof of Theorem 2,

𝔼{k(y1(1),Y2(1))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}
=Gh1(x1(1)x)k(y1(1),y2(1))f1(y2(1)x)𝑑y2(1)×[f1(x){f2(x)}2+O(h1ν+h2ν)],\displaystyle=G_{h_{1}}(x_{1}^{(1)}-x)\int k(y_{1}^{(1)},y_{2}^{(1)})f_{1}(y_{2}^{(1)}\mid x)dy_{2}^{(1)}\times\big{[}f_{1}(x)\{f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu})\big{]},
𝔼{k(Y1(2),Y2(2))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\Big{\{}k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}
=Gh1(x1(1)x)[k(y1(2),y2(2))f2(y1(2)x)f2(y2(2)x)𝑑y1(2)𝑑y2(2)×f1(x){f2(x)}2+O(h1ν+h2ν)],\displaystyle=G_{h_{1}}(x_{1}^{(1)}-x)\bigg{[}\int k(y_{1}^{(2)},y_{2}^{(2)})f_{2}(y_{1}^{(2)}\mid x)f_{2}(y_{2}^{(2)}\mid x)dy_{1}^{(2)}dy_{2}^{(2)}\times f_{1}(x)\{f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu})\bigg{]},
𝔼{k(y1(1),Y1(2))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}
=Gh1(x1(1)x)k(y1(1),y1(2))f2(y1(2)x)𝑑y1(2)×[f1(x){f2(x)}2+O(h1ν+h2ν)],\displaystyle=G_{h_{1}}(x_{1}^{(1)}-x)\int k(y_{1}^{(1)},y_{1}^{(2)})f_{2}(y_{1}^{(2)}\mid x)dy_{1}^{(2)}\times\big{[}f_{1}(x)\{f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu})\big{]},
𝔼{k(Y2(1),Y1(2))Gh1(x1(1)x)Gh1(X2(1)x)Gh2(X1(2)x)Gh2(X2(2)x)}\displaystyle\quad\mathbb{E}\Big{\{}k(Y_{2}^{(1)},Y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{1}}(X_{2}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)G_{h_{2}}(X_{2}^{(2)}-x)\Big{\}}
=Gh1(x1(1)x)[k(y2(1),y1(2))f1(y2(1)x)f2(y1(2)x)𝑑y2(1)𝑑y1(2)×f1(x){f2(x)}2+O(h1ν+h2ν)].\displaystyle=G_{h_{1}}(x_{1}^{(1)}-x)\bigg{[}\int k(y_{2}^{(1)},y_{1}^{(2)})f_{1}(y_{2}^{(1)}\mid x)f_{2}(y_{1}^{(2)}\mid x)dy_{2}^{(1)}dy_{1}^{(2)}\times f_{1}(x)\{f_{2}(x)\}^{2}+O(h_{1}^{\nu}+h_{2}^{\nu})\bigg{]}.

Under H0H_{0} in (2), we have PYX=xPYX=x(1)=PYX=x(2)P_{Y\mid X=x}\coloneqq P_{Y\mid X=x}^{(1)}=P_{Y\mid X=x}^{(2)}, and f(yx)f1(yx)=f2(yx)f(y\mid x)\coloneqq f_{1}(y\mid x)=f_{2}(y\mid x) for all yy at the fixed xx. Hence, for ϕ(1,0)(z1(1))\phi^{(1,0)}(z_{1}^{(1)}) in (A.2), we have

ϕ(1,0)(z1(1))=Gh1(x1(1)x){k(y1(1),y)f(yx)𝑑y+1}×O(h1ν+h2ν)+O(h1ν+h2ν),\phi^{(1,0)}(z_{1}^{(1)})=G_{h_{1}}(x_{1}^{(1)}-x)\bigg{\{}\int k(y_{1}^{(1)},y)f(y\mid x)dy+1\bigg{\}}\times O(h_{1}^{\nu}+h_{2}^{\nu})+O(h_{1}^{\nu}+h_{2}^{\nu}),

under H0H_{0} in (2). It implies that

var{ϕ(1,0)(Z1(1))}=O(h1p(h12ν+h22ν)),\operatorname{var}\{\phi^{(1,0)}(Z_{1}^{(1)})\}=O(h_{1}^{-p}(h_{1}^{2\nu}+h_{2}^{2\nu})),

and thus,

var(Hn1,n2(1,0))=O(n11h1p(h12ν+h22ν)).\operatorname{var}(H_{n_{1},n_{2}}^{(1,0)})=O(n_{1}^{-1}h_{1}^{-p}(h_{1}^{2\nu}+h_{2}^{2\nu})).

Analogously, one can show that var(Hn1,n2(0,1))=O(n21h2p(h12ν+h22ν))\operatorname{var}(H_{n_{1},n_{2}}^{(0,1)})=O(n_{2}^{-1}h_{2}^{-p}(h_{1}^{2\nu}+h_{2}^{2\nu})) under H0H_{0} in (2). Following the proof of Theorem 3, we have var(Hn1,n2(2,0))(n1h1p)2\operatorname{var}(H_{n_{1},n_{2}}^{(2,0)})\asymp(n_{1}h_{1}^{p})^{-2}, var(Hn1,n2(1,1))(n1h1p)1(n2h2p)1\operatorname{var}(H_{n_{1},n_{2}}^{(1,1)})\asymp(n_{1}h_{1}^{p})^{-1}(n_{2}h_{2}^{p})^{-1}, and var(Hn1,n2(0,2))(n2h2p)2\operatorname{var}(H_{n_{1},n_{2}}^{(0,2)})\asymp(n_{2}h_{2}^{p})^{-2}. Hence, when undersmoothing is employed, the asymptotic distribution of Un1,n2U_{n_{1},n_{2}} is determined by Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2)H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)}, instead of 2Hn1,n2(1,0)+2Hn1,n2(0,1)2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}.

The proof below is essentially similar to that in Appendix B.1 of Gretton et al., (2012). The main difference lies in that the spectral decomposition is performed with respect to the conditional distribution. Define the double centered version of kk:

k~(y,y)=k(y,y)𝔼{k(y,Y)X=x}𝔼{k(Y,y)X=x}+𝔼{k(Y,Y)X=x,X=x},\widetilde{k}(y,y^{\prime})=k(y,y^{\prime})-\mathbb{E}\{k(y,Y^{\prime})\mid X^{\prime}=x\}-\mathbb{E}\{k(Y,y^{\prime})\mid X=x\}+\mathbb{E}\{k(Y,Y^{\prime})\mid X=x,X^{\prime}=x\},

where YX=x,YX=xi.i.d.PYX=xY\mid X=x,Y^{\prime}\mid X^{\prime}=x\overset{i.i.d.}{\sim}P_{Y\mid X=x}. As PYX=xk1+δ/2(𝒴)P_{Y\mid X=x}\in\mathcal{M}_{k}^{1+\delta/2}(\mathcal{Y}), the kernel k~\widetilde{k} is square integrable with respect to PYX=xP_{Y\mid X=x}. Then k~(y,y)\widetilde{k}(y,y^{\prime}) admits a spectral decomposition

k~(y,y)=r=1λrϕr(y)ϕr(y),\widetilde{k}(y,y^{\prime})=\sum_{r=1}^{\infty}\lambda_{r}\phi_{r}(y)\phi_{r}(y^{\prime}), (A.5)

where {λr}r=1\{\lambda_{r}\}_{r=1}^{\infty} are the eigenvalues and {ϕr()}r=1\{\phi_{r}(\cdot)\}_{r=1}^{\infty} are the corresponding orthonormal eigenfunctions of k~\widetilde{k} with respect to the conditional distribution PYX=xP_{Y\mid X=x}, i.e.,

𝔼{k~(y,Y)ϕr(Y)X=x}\displaystyle\mathbb{E}\{\widetilde{k}(y,Y^{\prime})\phi_{r}(Y^{\prime})\mid X^{\prime}=x\} =λrϕr(y),\displaystyle=\lambda_{r}\phi_{r}(y),
𝔼{ϕr1(Y)ϕr2(Y)X=x}\displaystyle\mathbb{E}\{\phi_{r_{1}}(Y)\phi_{r_{2}}(Y)\mid X=x\} ={1,if r1=r2,0,if r1r2.\displaystyle=\left\{\begin{aligned} 1,\quad\text{if }r_{1}=r_{2},\\ 0,\quad\text{if }r_{1}\neq r_{2}.\end{aligned}\right.

We now find the asymptotic distribution of Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2)H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)}. First, under H0H_{0} in (2),

ϕ(1,1)(z1(1);z1(2))\displaystyle\phi^{(1,1)}(z_{1}^{(1)};z_{1}^{(2)}) =12f1(x)f2(x)×[k~(y1(1),y1(2))Gh1(x1(1)x)Gh2(x1(2)x)\displaystyle=-\frac{1}{2}f_{1}(x)f_{2}(x)\times\Big{[}\widetilde{k}(y_{1}^{(1)},y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{2}}(x_{1}^{(2)}-x)
𝔼{k~(y1(1),Y1(2))Gh1(x1(1)x)Gh2(X1(2)x)}\displaystyle\quad-\mathbb{E}\Big{\{}\widetilde{k}(y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)\Big{\}}
𝔼{k~(Y1(1),y1(2))Gh1(X1(1)x)Gh2(x1(2)x)}\displaystyle\quad-\mathbb{E}\Big{\{}\widetilde{k}(Y_{1}^{(1)},y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{2}}(x_{1}^{(2)}-x)\Big{\}}
+𝔼{k~(Y1(1),Y1(2))Gh1(X1(1)x)Gh2(X1(2)x)}]\displaystyle\quad+\mathbb{E}\Big{\{}\widetilde{k}(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-x)G_{h_{2}}(X_{1}^{(2)}-x)\Big{\}}\Big{]}
×{1+O(h1ν+h2ν)}.\displaystyle\quad\times\{1+O(h_{1}^{\nu}+h_{2}^{\nu})\}.

Using the spectral decomposition (A.5), we have

Hn1,n2(1,1)\displaystyle H_{n_{1},n_{2}}^{(1,1)} =12f1(x)f2(x)r=1λr×1n1i=1n1[ϕr(Yi(1))Gh1(Xi(1)x)𝔼{ϕr(Y(1))Gh1(X(1)x)}]\displaystyle=-\frac{1}{2}f_{1}(x)f_{2}(x)\sum_{r=1}^{\infty}\lambda_{r}\times\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\Big{[}\phi_{r}(Y_{i}^{(1)})G_{h_{1}}(X_{i}^{(1)}-x)-\mathbb{E}\Big{\{}\phi_{r}(Y^{(1)})G_{h_{1}}(X^{(1)}-x)\Big{\}}\Big{]}
×1n2j=1n2[ϕr(Yj(2))Gh2(Xj(2)x)𝔼{ϕr(Y(2))Gh2(X(2)x)}]\displaystyle\quad\times\frac{1}{n_{2}}\sum_{j=1}^{n_{2}}\Big{[}\phi_{r}(Y_{j}^{(2)})G_{h_{2}}(X_{j}^{(2)}-x)-\mathbb{E}\Big{\{}\phi_{r}(Y^{(2)})G_{h_{2}}(X^{(2)}-x)\Big{\}}\Big{]}
+Op((n1h1pn2h2p)1/2(h1ν+h2ν)).\displaystyle\quad+O_{p}((n_{1}h_{1}^{p}n_{2}h_{2}^{p})^{-1/2}(h_{1}^{\nu}+h_{2}^{\nu})).

Applying Lyapunov CLT,

n1h1p1n1i=1n1[ϕr(Yi(1))Gh1(Xi(1)x)𝔼{ϕr(Y(1))Gh1(X(1)x)}]\displaystyle\sqrt{n_{1}h_{1}^{p}}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\Big{[}\phi_{r}(Y_{i}^{(1)})G_{h_{1}}(X_{i}^{(1)}-x)-\mathbb{E}\Big{\{}\phi_{r}(Y^{(1)})G_{h_{1}}(X^{(1)}-x)\Big{\}}\Big{]}
𝑑𝒩(0,f1(x){g2(u)𝑑u}p),\displaystyle\overset{d}{\longrightarrow}\mathcal{N}\bigg{(}0,f_{1}(x)\bigg{\{}\int g^{2}(u)du\bigg{\}}^{p}\bigg{)},
n2h2p1n2j=1n2[ϕr(Yj(2))Gh2(Xj(2)x)𝔼{ϕr(Y(2))Gh2(X(2)x)}]\displaystyle\sqrt{n_{2}h_{2}^{p}}\frac{1}{n_{2}}\sum_{j=1}^{n_{2}}\Big{[}\phi_{r}(Y_{j}^{(2)})G_{h_{2}}(X_{j}^{(2)}-x)-\mathbb{E}\Big{\{}\phi_{r}(Y^{(2)})G_{h_{2}}(X^{(2)}-x)\Big{\}}\Big{]}
𝑑𝒩(0,f2(x){g2(u)𝑑u}p),\displaystyle\overset{d}{\longrightarrow}\mathcal{N}\bigg{(}0,f_{2}(x)\bigg{\{}\int g^{2}(u)du\bigg{\}}^{p}\bigg{)},

since 𝔼{ϕr2(Y)X=x}=1\mathbb{E}\{\phi_{r}^{2}(Y)\mid X=x\}=1. Also, for l=1,2l=1,2 and r1r2r_{1}\neq r_{2}, we have

hlpcov[ϕr1(Y(l))Ghl(X(l)x),ϕr2(Y(l))Ghl(X(l)x)]\displaystyle\quad h_{l}^{p}\operatorname{cov}\Big{[}\phi_{r_{1}}(Y^{(l)})G_{h_{l}}(X^{(l)}-x),\phi_{r_{2}}(Y^{(l)})G_{h_{l}}(X^{(l)}-x)\Big{]}
𝔼{ϕr1(Y)ϕr2(Y)X=x}fl(x){g2(u)𝑑u}p=0.\displaystyle\longrightarrow\mathbb{E}\{\phi_{r_{1}}(Y)\phi_{r_{2}}(Y)\mid X=x\}f_{l}(x)\bigg{\{}\int g^{2}(u)du\bigg{\}}^{p}=0.

Thus,

{f1(x)f2(x)}2n1h1pn2h2pHn1,n2(1,1)𝑑12r=1λrζrηr,\{f_{1}(x)f_{2}(x)\}^{-2}\sqrt{n_{1}h_{1}^{p}n_{2}h_{2}^{p}}H_{n_{1},n_{2}}^{(1,1)}\overset{d}{\longrightarrow}-\frac{1}{2}\sum^{\infty}_{r=1}\lambda_{r}\zeta_{r}\eta_{r},

where {ζr}r=1\{\zeta_{r}\}_{r=1}^{\infty} and {ηr}r=1\{\eta_{r}\}_{r=1}^{\infty} are two sequences of independent normal random variables with

ζr\displaystyle\zeta_{r} i.i.d.𝒩(0,σ12),σ12=1f1(x){g2(u)𝑑u}p,\displaystyle\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma_{1}^{2}),\quad\sigma_{1}^{2}=\frac{1}{f_{1}(x)}\bigg{\{}\int g^{2}(u)du\bigg{\}}^{p},
ηr\displaystyle\eta_{r} i.i.d.𝒩(0,σ22),σ22=1f2(x){g2(u)𝑑u}p.\displaystyle\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma_{2}^{2}),\quad\sigma_{2}^{2}=\frac{1}{f_{2}(x)}\bigg{\{}\int g^{2}(u)du\bigg{\}}^{p}.

Similarly, one can show that

{f1(x)f2(x)}2n1h1pHn1,n2(2,0)𝑑r=1λr(ζr2σ12),\{f_{1}(x)f_{2}(x)\}^{-2}n_{1}h_{1}^{p}H_{n_{1},n_{2}}^{(2,0)}\overset{d}{\longrightarrow}\sum_{r=1}^{\infty}\lambda_{r}(\zeta_{r}^{2}-\sigma_{1}^{2}),

and

{f1(x)f2(x)}2n2h2pHn1,n2(0,2)𝑑r=1λr(ηr2σ22).\{f_{1}(x)f_{2}(x)\}^{-2}n_{2}h_{2}^{p}H_{n_{1},n_{2}}^{(0,2)}\overset{d}{\longrightarrow}\sum_{r=1}^{\infty}\lambda_{r}(\eta_{r}^{2}-\sigma_{2}^{2}).

Combining the above results, we have

{f1(x)f2(x)}2(n1h1p+n2h2p)(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))\displaystyle\quad\{f_{1}(x)f_{2}(x)\}^{-2}(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})
𝑑1τr=1λr(ζr2σ12)2τ(1τ)r=1λrζrηr+11τr=1λr(ηr2σ22)\displaystyle\overset{d}{\longrightarrow}\frac{1}{\tau}\sum_{r=1}^{\infty}\lambda_{r}(\zeta_{r}^{2}-\sigma_{1}^{2})-\frac{2}{\sqrt{\tau(1-\tau)}}\sum^{\infty}_{r=1}\lambda_{r}\zeta_{r}\eta_{r}+\frac{1}{1-\tau}\sum_{r=1}^{\infty}\lambda_{r}(\eta_{r}^{2}-\sigma_{2}^{2})
=r=1λr{(1τζr11τηr)2(1τ)σ12+τσ22τ(1τ)}.\displaystyle=\sum_{r=1}^{\infty}\lambda_{r}\bigg{\{}\bigg{(}\frac{1}{\sqrt{\tau}}\zeta_{r}-\frac{1}{\sqrt{1-\tau}}\eta_{r}\bigg{)}^{2}-\frac{(1-\tau)\sigma_{1}^{2}+\tau\sigma_{2}^{2}}{\tau(1-\tau)}\bigg{\}}.

Also, according to the results in the proof of Theorem 3,

(n1h1p+n2h2p)2var(Rn1,n2)=O((n1h1p)1+(n2h2p)1).(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})^{2}\operatorname{var}(R_{n_{1},n_{2}})=O((n_{1}h_{1}^{p})^{-1}+(n_{2}h_{2}^{p})^{-1}).

Therefore,

{f1(x)f2(x)}2(n1h1p+n2h2p)Un1,n2\displaystyle\quad\{f_{1}(x)f_{2}(x)\}^{-2}(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})U_{n_{1},n_{2}}
={f1(x)f2(x)}2(n1h1p+n2h2p)(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))\displaystyle=\{f_{1}(x)f_{2}(x)\}^{-2}(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})
+{f1(x)f2(x)}2(n1h1p+n2h2p){𝔼(Un1,n2)+2Hn1,n2(1,0)+2Hn1,n2(0,1)+Rn1,n2}\displaystyle\quad+\{f_{1}(x)f_{2}(x)\}^{-2}(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})\{\mathbb{E}(U_{n_{1},n_{2}})+2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}+R_{n_{1},n_{2}}\}
={f1(x)f2(x)}2(n1h1p+n2h2p)(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))\displaystyle=\{f_{1}(x)f_{2}(x)\}^{-2}(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})
+O(n1h1p+ν+n2h2p+ν)+Op(n11/2h1p/2+ν+n21/2h2p/2+ν)+Op((n1h1p)1/2+(n2h2p)1/2)\displaystyle\quad+O(n_{1}h_{1}^{p+\nu}+n_{2}h_{2}^{p+\nu})+O_{p}(n_{1}^{1/2}h_{1}^{p/2+\nu}+n_{2}^{1/2}h_{2}^{p/2+\nu})+O_{p}((n_{1}h_{1}^{p})^{-1/2}+(n_{2}h_{2}^{p})^{-1/2})
={f1(x)f2(x)}2(n1h1p+n2h2p)(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))+op(1)\displaystyle=\{f_{1}(x)f_{2}(x)\}^{-2}(n_{1}h_{1}^{p}+n_{2}h_{2}^{p})(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})+o_{p}(1)
𝑑r=1λr{(1τζr11τηr)2(1τ)σ12+τσ22τ(1τ)},\displaystyle\overset{d}{\longrightarrow}\sum_{r=1}^{\infty}\lambda_{r}\bigg{\{}\bigg{(}\frac{1}{\sqrt{\tau}}\zeta_{r}-\frac{1}{\sqrt{1-\tau}}\eta_{r}\bigg{)}^{2}-\frac{(1-\tau)\sigma_{1}^{2}+\tau\sigma_{2}^{2}}{\tau(1-\tau)}\bigg{\}},

where we have used Assumption 2, and nlhlp+ν0n_{l}h_{l}^{p+\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2. The result follows. ∎

Proof of Theorem 6.

Note that

k^=(n12)1(n22)11i1<i2n11j1<j2n2ψ(Zi1(1),Zi2(1);Zj1(2),Zj2(2)),\widehat{\mathcal{I}_{k}}=\binom{n_{1}}{2}^{-1}\binom{n_{2}}{2}^{-1}\sum_{1\leq i_{1}<i_{2}\leq n_{1}}\sum_{1\leq j_{1}<j_{2}\leq n_{2}}\psi(Z_{i_{1}}^{(1)},Z_{i_{2}}^{(1)};Z_{j_{1}}^{(2)},Z_{j_{2}}^{(2)}),

where

ψ(Z1(1),Z2(1);Z1(2),Z2(2))\displaystyle\quad\psi(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})
=12k(Y1(1),Y2(1))Gh1(X1(1)X2(1))\displaystyle=\frac{1}{2}k(Y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(X_{1}^{(1)}-X_{2}^{(1)})
×{Gh2(X1(2)X1(1))Gh2(X2(2)X1(1))+Gh2(X1(2)X2(1))Gh2(X2(2)X2(1))}\displaystyle\quad\times\left\{G_{h_{2}}(X_{1}^{(2)}-X_{1}^{(1)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})+G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(1)})G_{h_{2}}(X_{2}^{(2)}-X_{2}^{(1)})\right\}
+12k(Y1(2),Y2(2))Gh2(X1(2)X2(2))\displaystyle\quad+\frac{1}{2}k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(2)})
×{Gh1(X1(1)X1(2))Gh1(X2(1)X1(2))+Gh1(X1(1)X2(2))Gh1(X2(1)X2(2))}\displaystyle\quad\times\left\{G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})+G_{h_{1}}(X_{1}^{(1)}-X_{2}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{2}^{(2)})\right\}
14k(Y1(1),Y1(2)){Gh1(X1(1)X1(2))+Gh2(X1(2)X1(1))}Gh1(X2(1)X1(2))Gh2(X2(2)X1(1))\displaystyle\quad-\frac{1}{4}k(Y_{1}^{(1)},Y_{1}^{(2)})\left\{G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})+G_{h_{2}}(X_{1}^{(2)}-X_{1}^{(1)})\right\}G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})
14k(Y1(1),Y2(2)){Gh1(X1(1)X2(2))+Gh2(X2(2)X1(1))}Gh1(X2(1)X2(2))Gh2(X1(2)X1(1))\displaystyle\quad-\frac{1}{4}k(Y_{1}^{(1)},Y_{2}^{(2)})\left\{G_{h_{1}}(X_{1}^{(1)}-X_{2}^{(2)})+G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\right\}G_{h_{1}}(X_{2}^{(1)}-X_{2}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{1}^{(1)})
14k(Y2(1),Y1(2)){Gh1(X2(1)X1(2))+Gh2(X1(2)X2(1))}Gh1(X1(1)X1(2))Gh2(X2(2)X2(1))\displaystyle\quad-\frac{1}{4}k(Y_{2}^{(1)},Y_{1}^{(2)})\left\{G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})+G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(1)})\right\}G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{2}^{(1)})
14k(Y2(1),Y2(2)){Gh1(X2(1)X2(2))+Gh2(X2(2)X2(1))}Gh1(X1(1)X2(2))Gh2(X1(2)X2(1)).\displaystyle\quad-\frac{1}{4}k(Y_{2}^{(1)},Y_{2}^{(2)})\left\{G_{h_{1}}(X_{2}^{(1)}-X_{2}^{(2)})+G_{h_{2}}(X_{2}^{(2)}-X_{2}^{(1)})\right\}G_{h_{1}}(X_{1}^{(1)}-X_{2}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(1)}).

It suffices to show that

𝔼(k^)=k+o(1),var(k^)=o(1).\mathbb{E}(\widehat{\mathcal{I}_{k}})=\mathcal{I}_{k}+o(1),\quad\operatorname{var}(\widehat{\mathcal{I}_{k}})=o(1). (A.6)

For the first part of (A.6), we have

𝔼(k^)\displaystyle\mathbb{E}(\widehat{\mathcal{I}_{k}}) =𝔼{k(Y1(1),Y2(1))Gh1(X1(1)X2(1))Gh2(X1(2)X1(1))Gh2(X2(2)X1(1))}\displaystyle=\mathbb{E}\{k(Y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(X_{1}^{(1)}-X_{2}^{(1)})G_{h_{2}}(X_{1}^{(2)}-X_{1}^{(1)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\}
+𝔼{k(Y1(2),Y2(2))Gh2(X1(2)X2(2))Gh1(X1(1)X1(2))Gh1(X2(1)X1(2))}\displaystyle\quad+\mathbb{E}\{k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(2)})G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})\}
𝔼{k(Y1(1),Y1(2))Gh1(X1(1)X1(2))Gh1(X2(1)X1(2))Gh2(X2(2)X1(1))}\displaystyle\quad-\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\}
𝔼{k(Y1(1),Y1(2))Gh2(X1(2)X1(1))Gh1(X2(1)X1(2))Gh2(X2(2)X1(1))}.\displaystyle\quad-\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{1}^{(1)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\}.

We first consider the term

𝔼{k(Y1(1),Y1(2))Gh1(X1(1)X1(2))Gh1(X2(1)X1(2))Gh2(X2(2)X1(1))}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\}
=k(y1(1),y1(2))Gh1(x1(1)x1(2))Gh1(x2(1)x1(2))Gh2(x2(2)x1(1))\displaystyle=\int k(y_{1}^{(1)},y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x_{1}^{(2)})G_{h_{1}}(x_{2}^{(1)}-x_{1}^{(2)})G_{h_{2}}(x_{2}^{(2)}-x_{1}^{(1)})
×f1(y1(1),x1(1))f2(y1(2),x1(2))f1(x2(1))f2(x2(2))dy1(1)dy1(2)dx1(1)dx2(1)dx1(2)dx2(2).\displaystyle\quad\times f_{1}(y_{1}^{(1)},x_{1}^{(1)})f_{2}(y_{1}^{(2)},x_{1}^{(2)})f_{1}(x_{2}^{(1)})f_{2}(x_{2}^{(2)})dy_{1}^{(1)}dy_{1}^{(2)}dx_{1}^{(1)}dx_{2}^{(1)}dx_{1}^{(2)}dx_{2}^{(2)}.

Let x=x1(2)x=x_{1}^{(2)}, u=h11(x1(1)x1(2))u=h_{1}^{-1}(x_{1}^{(1)}-x_{1}^{(2)}), v=h11(x2(1)x1(2))v=h_{1}^{-1}(x_{2}^{(1)}-x_{1}^{(2)}) and w=h21(x2(2)x1(1))w=h_{2}^{-1}(x_{2}^{(2)}-x_{1}^{(1)}). We have

𝔼{k(Y1(1),Y1(2))Gh1(X1(1)X1(2))Gh1(X2(1)X1(2))Gh2(X2(2)X1(1))}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\}
=(i)k(y1(1),y1(2))s=1p{g(u(s))g(v(s))g(w(s))}\displaystyle\overset{(i)}{=}\int k(y_{1}^{(1)},y_{1}^{(2)})\prod_{s=1}^{p}\{g(u(s))g(v(s))g(w(s))\}
×f1(y1(1),x+h1u)f2(y1(2),x)f1(x+h1v)f2(x+h1u+h2w)dy1(1)dy1(2)dudvdwdx\displaystyle\quad\times f_{1}(y_{1}^{(1)},x+h_{1}u)f_{2}(y_{1}^{(2)},x)f_{1}(x+h_{1}v)f_{2}(x+h_{1}u+h_{2}w)dy_{1}^{(1)}dy_{1}^{(2)}dudvdwdx
=𝔼{k(Y(1),Y(2))X(1)=x+h1u,X(2)=x}s=1p{g(u(s))g(v(s))g(w(s))}\displaystyle=\int\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x+h_{1}u,X^{(2)}=x\}\prod_{s=1}^{p}\{g(u(s))g(v(s))g(w(s))\}
×f1(x+h1u)f1(x+h1v)f2(x+h1u+h2w)f2(x)dudvdwdx\displaystyle\quad\times f_{1}(x+h_{1}u)f_{1}(x+h_{1}v)f_{2}(x+h_{1}u+h_{2}w)f_{2}(x)dudvdwdx
=(ii)𝔼{k(Y(1),Y(2))X(1)=x,X(2)=x}s=1p{g(u(s))g(v(s))g(w(s))}\displaystyle\overset{(ii)}{=}\int\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\}\prod_{s=1}^{p}\{g(u(s))g(v(s))g(w(s))\}
×{f1(x)f2(x)}2dudvdwdx+O(h1ν+h2ν)\displaystyle\quad\times\{f_{1}(x)f_{2}(x)\}^{2}dudvdwdx+O(h_{1}^{\nu}+h_{2}^{\nu})
=𝔼{k(Y(1),Y(2))X(1)=x,X(2)=x}{f1(x)f2(x)}2𝑑x+O(h1ν+h2ν),\displaystyle=\int\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\}\{f_{1}(x)f_{2}(x)\}^{2}dx+O(h_{1}^{\nu}+h_{2}^{\nu}),

where (i)(i) follows from change of variables, and (ii)(ii) holds by Taylor’s theorem and Assumptions 1 and 5-6. Similarly, one can verify that

𝔼{k(Y1(1),Y2(1))Gh1(X1(1)X2(1))Gh2(X1(2)X1(1))Gh2(X2(2)X1(1))}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(X_{1}^{(1)}-X_{2}^{(1)})G_{h_{2}}(X_{1}^{(2)}-X_{1}^{(1)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\}
=𝔼{k(Y(1),Y(1))X(1)=x,X(1)=x}{f1(x)f2(x)}2𝑑x+O(h1ν+h2ν),\displaystyle=\int\mathbb{E}\{k(Y^{(1)},Y^{(1)\prime})\mid X^{(1)}=x,X^{(1)\prime}=x\}\{f_{1}(x)f_{2}(x)\}^{2}dx+O(h_{1}^{\nu}+h_{2}^{\nu}),
𝔼{k(Y1(2),Y2(2))Gh2(X1(2)X2(2))Gh1(X1(1)X1(2))Gh1(X2(1)X1(2))}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(2)})G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})\}
=𝔼{k(Y(2),Y(2))X(2)=x,X(2)=x}{f1(x)f2(x)}2𝑑x+O(h1ν+h2ν),\displaystyle=\int\mathbb{E}\{k(Y^{(2)},Y^{(2)\prime})\mid X^{(2)}=x,X^{(2)\prime}=x\}\{f_{1}(x)f_{2}(x)\}^{2}dx+O(h_{1}^{\nu}+h_{2}^{\nu}),
𝔼{k(Y1(1),Y1(2))Gh2(X1(2)X1(1))Gh1(X2(1)X1(2))Gh2(X2(2)X1(1))}\displaystyle\quad\mathbb{E}\{k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{1}^{(1)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\}
=𝔼{k(Y(1),Y(2))X(1)=x,X(2)=x}{f1(x)f2(x)}2𝑑x+O(h1ν+h2ν).\displaystyle=\int\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\}\{f_{1}(x)f_{2}(x)\}^{2}dx+O(h_{1}^{\nu}+h_{2}^{\nu}).

Thus, by Assumption 2,

𝔼(k^)\displaystyle\mathbb{E}(\widehat{\mathcal{I}_{k}}) =[𝔼{k(Y(1),Y(1))X(1)=x,X(1)=x}+𝔼{k(Y(2),Y(2))X(2)=x,X(2)=x}\displaystyle=\int\Big{[}\mathbb{E}\{k(Y^{(1)},Y^{(1)\prime})\mid X^{(1)}=x,X^{(1)\prime}=x\}+\mathbb{E}\{k(Y^{(2)},Y^{(2)\prime})\mid X^{(2)}=x,X^{(2)\prime}=x\}
2𝔼{k(Y(1),Y(2))X(1)=x,X(2)=x}]{f1(x)f2(x)}2dx+O(h1ν+h2ν)\displaystyle\quad-2\mathbb{E}\{k(Y^{(1)},Y^{(2)})\mid X^{(1)}=x,X^{(2)}=x\}\Big{]}\{f_{1}(x)f_{2}(x)\}^{2}dx+O(h_{1}^{\nu}+h_{2}^{\nu})
=γk2(x){f1(x)f2(x)}2𝑑x+o(1).\displaystyle=\int\gamma_{k}^{2}(x)\{f_{1}(x)f_{2}(x)\}^{2}dx+o(1).

For the second part of (A.6), following Definition B.2 and Theorem B.1 (m1=m2=2m_{1}=m_{2}=2), we have

var(k^)=(n12)1(n22)1c=02d=02(2c)(2d)(n122c)(n222d)σc,d2.\operatorname{var}(\widehat{\mathcal{I}_{k}})=\binom{n_{1}}{2}^{-1}\binom{n_{2}}{2}^{-1}\sum_{c=0}^{2}\sum_{d=0}^{2}\binom{2}{c}\binom{2}{d}\binom{n_{1}-2}{2-c}\binom{n_{2}-2}{2-d}\sigma_{c,d}^{2}.

We first calculate

σ2,22=var{ψ(Z1(1),Z2(1);Z1(2),Z2(2))}=𝔼{ψ2(Z1(1),Z2(1);Z1(2),Z2(2))}{k+o(1)}2.\sigma_{2,2}^{2}=\operatorname{var}\{\psi(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})\}=\mathbb{E}\{\psi^{2}(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})\}-\{\mathcal{I}_{k}+o(1)\}^{2}.

As for 𝔼{ψ2(Z1(1),Z2(1);Z1(2),Z2(2))}\mathbb{E}\{\psi^{2}(Z_{1}^{(1)},Z_{2}^{(1)};Z_{1}^{(2)},Z_{2}^{(2)})\}, it can be expanded into several terms, and each of these terms can be shown to be of order (h12h2)p(h_{1}^{2}h_{2})^{-p} or (h1h22)p(h_{1}h_{2}^{2})^{-p}. We only present the proof for the term below. Derivations for the other terms are similar. Note that

𝔼[{k(Y1(1),Y1(2))Gh1(X1(1)X1(2))Gh1(X2(1)X1(2))Gh2(X2(2)X1(1))}2]\displaystyle\quad\mathbb{E}\Big{[}\Big{\{}k(Y_{1}^{(1)},Y_{1}^{(2)})G_{h_{1}}(X_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{1}^{(1)})\Big{\}}^{2}\Big{]}
={k(y1(1),y1(2))Gh1(x1(1)x1(2))Gh1(x2(1)x1(2))Gh2(x2(2)x1(1))}2\displaystyle=\int\left\{k(y_{1}^{(1)},y_{1}^{(2)})G_{h_{1}}(x_{1}^{(1)}-x_{1}^{(2)})G_{h_{1}}(x_{2}^{(1)}-x_{1}^{(2)})G_{h_{2}}(x_{2}^{(2)}-x_{1}^{(1)})\right\}^{2}
×f1(y1(1),x1(1))f2(y1(2),x1(2))f1(x2(1))f2(x2(2))dy1(1)dy1(2)dx1(1)dx2(1)dx1(2)dx2(2)\displaystyle\quad\times f_{1}(y_{1}^{(1)},x_{1}^{(1)})f_{2}(y_{1}^{(2)},x_{1}^{(2)})f_{1}(x_{2}^{(1)})f_{2}(x_{2}^{(2)})dy_{1}^{(1)}dy_{1}^{(2)}dx_{1}^{(1)}dx_{2}^{(1)}dx_{1}^{(2)}dx_{2}^{(2)}
=(i)(h12h2)p[k(y1(1),y1(2))s=1p{g(u(s))g(v(s))g(w(s))}]2\displaystyle\overset{(i)}{=}(h_{1}^{2}h_{2})^{-p}\int\bigg{[}k(y_{1}^{(1)},y_{1}^{(2)})\prod_{s=1}^{p}\{g(u(s))g(v(s))g(w(s))\}\bigg{]}^{2}
×f1(y1(1),x+h1u)f2(y1(2),x)f1(x+h1v)f2(x+h1u+h2w)dy1(1)dy1(2)dudvdwdx\displaystyle\quad\times f_{1}(y_{1}^{(1)},x+h_{1}u)f_{2}(y_{1}^{(2)},x)f_{1}(x+h_{1}v)f_{2}(x+h_{1}u+h_{2}w)dy_{1}^{(1)}dy_{1}^{(2)}dudvdwdx
(h12h2)p,\displaystyle\asymp(h_{1}^{2}h_{2})^{-p},

where (i)(i) follows from the same change of variables technique as in the proof of the first part of (A.6) above. Hence, σ2,22=O((h12h2)p+(h1h22)p)\sigma_{2,2}^{2}=O((h_{1}^{2}h_{2})^{-p}+(h_{1}h_{2}^{2})^{-p}).

Analogously, using change of variables, one can verify that σ1,02=O(1)\sigma_{1,0}^{2}=O(1), σ0,12=O(1)\sigma_{0,1}^{2}=O(1), σ2,02=O(h1p)\sigma_{2,0}^{2}=O(h_{1}^{-p}), σ1,12=O(h1p+h2p)\sigma_{1,1}^{2}=O(h_{1}^{-p}+h_{2}^{-p}), σ0,22=O(h2p)\sigma_{0,2}^{2}=O(h_{2}^{-p}), σ2,12=O(h12p+(h1h2)p)\sigma_{2,1}^{2}=O(h_{1}^{-2p}+(h_{1}h_{2})^{-p}), and σ1,22=O((h1h2)p+h22p)\sigma_{1,2}^{2}=O((h_{1}h_{2})^{-p}+h_{2}^{-2p}). Therefore, by Assumption 2,

var(k^)=c=02d=02O(n1cn2d)σc,d2=O(n11+n21)=o(1).\operatorname{var}(\widehat{\mathcal{I}_{k}})=\sum_{c=0}^{2}\sum_{d=0}^{2}O(n_{1}^{-c}n_{2}^{-d})\sigma_{c,d}^{2}=O(n_{1}^{-1}+n_{2}^{-1})=o(1).

This completes the proof. ∎

Proof of Theorem 7.

Following Definition B.3 and Theorem B.2, we have the Hoeffding decomposition for the U-statistic k^\widehat{\mathcal{I}_{k}}:

k^=c=02d=02(2c)(2d)Hn1,n2(c,d)=𝔼(k^)+2Hn1,n2(1,0)+2Hn1,n2(0,1)+Rn1,n2,\widehat{\mathcal{I}_{k}}=\sum_{c=0}^{2}\sum_{d=0}^{2}\binom{2}{c}\binom{2}{d}H_{n_{1},n_{2}}^{(c,d)}=\mathbb{E}(\widehat{\mathcal{I}_{k}})+2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}+R_{n_{1},n_{2}},

where

Hn1,n2(1,0)\displaystyle H_{n_{1},n_{2}}^{(1,0)} =1n1i=1n1ϕ(1,0)(Zi(1)),Hn1,n2(0,1)=1n2j=1n2ϕ(0,1)(Zj(2)),\displaystyle=\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\phi^{(1,0)}(Z_{i}^{(1)}),\quad H_{n_{1},n_{2}}^{(0,1)}=\frac{1}{n_{2}}\sum_{j=1}^{n_{2}}\phi^{(0,1)}(Z_{j}^{(2)}),

and Rn1,n2R_{n_{1},n_{2}} is the remainder term. Furthermore, we give the explicit expression of ϕ(1,0)(z1(1))\phi^{(1,0)}(z_{1}^{(1)}) (the expression of ϕ(0,1)(z1(2))\phi^{(0,1)}(z_{1}^{(2)}) is similar and omitted):

ϕ(1,0)(z1(1))=12𝔼{k(y1(1),Y2(1))Gh1(x1(1)X2(1))Gh2(X1(2)x1(1))Gh2(X2(2)x1(1))}+12𝔼{k(y1(1),Y2(1))Gh1(x1(1)X2(1))Gh2(X1(2)X2(1))Gh2(X2(2)X2(1))}+𝔼{k(Y1(2),Y2(2))Gh2(X1(2)X2(2))Gh1(x1(1)X1(2))Gh1(X2(1)X1(2))}12𝔼[k(y1(1),Y1(2)){Gh1(x1(1)X1(2))+Gh2(X1(2)x1(1))}Gh1(X2(1)X1(2))Gh2(X2(2)x1(1))]12𝔼[k(Y2(1),Y1(2)){Gh1(X2(1)X1(2))+Gh2(X1(2)X2(1))}Gh1(x1(1)X1(2))Gh2(X2(2)X2(1))]𝔼(k^).\displaystyle\begin{split}&\quad\phi^{(1,0)}(z_{1}^{(1)})\\ &=\frac{1}{2}\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(x_{1}^{(1)}-X_{2}^{(1)})G_{h_{2}}(X_{1}^{(2)}-x_{1}^{(1)})G_{h_{2}}(X_{2}^{(2)}-x_{1}^{(1)})\Big{\}}\\ &\quad+\frac{1}{2}\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(x_{1}^{(1)}-X_{2}^{(1)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(1)})G_{h_{2}}(X_{2}^{(2)}-X_{2}^{(1)})\Big{\}}\\ &\quad+\mathbb{E}\Big{\{}k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(2)})G_{h_{1}}(x_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})\Big{\}}\\ &\quad-\frac{1}{2}\mathbb{E}\Big{[}k(y_{1}^{(1)},Y_{1}^{(2)})\left\{G_{h_{1}}(x_{1}^{(1)}-X_{1}^{(2)})+G_{h_{2}}(X_{1}^{(2)}-x_{1}^{(1)})\right\}G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-x_{1}^{(1)})\Big{]}\\ &\quad-\frac{1}{2}\mathbb{E}\Big{[}k(Y_{2}^{(1)},Y_{1}^{(2)})\left\{G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})+G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(1)})\right\}G_{h_{1}}(x_{1}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{2}^{(1)})\Big{]}\\ &\quad-\mathbb{E}(\widehat{\mathcal{I}_{k}}).\end{split} (A.7)

By similar calculations as in the proof of Theorem 6, we have

δ1,02\displaystyle\delta_{1,0}^{2} =var{ϕ(1,0)(Z1(1))}=σ1,021,\displaystyle=\operatorname{var}\{\phi^{(1,0)}(Z_{1}^{(1)})\}=\sigma_{1,0}^{2}\asymp 1, (A.8)
δ0,12\displaystyle\delta_{0,1}^{2} =var{ϕ(0,1)(Z1(2))}=σ0,121,\displaystyle=\operatorname{var}\{\phi^{(0,1)}(Z_{1}^{(2)})\}=\sigma_{0,1}^{2}\asymp 1, (A.9)

under HaH_{a} in (1). Applying Lyapunov CLT,

Hn1,n2(1,0)+Hn1,n2(0,1)δ1,02n1+δ0,12n2𝑑𝒩(0,1).\frac{H_{n_{1},n_{2}}^{(1,0)}+H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}}\overset{d}{\longrightarrow}\mathcal{N}(0,1).

Also, one can show that

var(Hn1,n2(2,0))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(2,0)}) n12δ2,02n12σ2,02(n12h1p)1,\displaystyle\asymp n_{1}^{-2}\delta_{2,0}^{2}\asymp n_{1}^{-2}\sigma_{2,0}^{2}\asymp(n_{1}^{2}h_{1}^{p})^{-1},
var(Hn1,n2(1,1))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(1,1)}) (n1n2)1δ1,12(n1n2)1σ1,12(n1h1p)1n21+n11(n2h2p)1,\displaystyle\asymp(n_{1}n_{2})^{-1}\delta_{1,1}^{2}\asymp(n_{1}n_{2})^{-1}\sigma_{1,1}^{2}\asymp(n_{1}h_{1}^{p})^{-1}n_{2}^{-1}+n_{1}^{-1}(n_{2}h_{2}^{p})^{-1},
var(Hn1,n2(0,2))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(0,2)}) n22δ0,22n22σ0,22(n22h2p)1,\displaystyle\asymp n_{2}^{-2}\delta_{0,2}^{2}\asymp n_{2}^{-2}\sigma_{0,2}^{2}\asymp(n_{2}^{2}h_{2}^{p})^{-1},
var(Hn1,n2(2,1))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(2,1)}) n12n21δ2,12n12n21σ2,12(n1h1p)2n21+(n12h1p)1(n2h2p)1,\displaystyle\asymp n_{1}^{-2}n_{2}^{-1}\delta_{2,1}^{2}\asymp n_{1}^{-2}n_{2}^{-1}\sigma_{2,1}^{2}\asymp(n_{1}h_{1}^{p})^{-2}n_{2}^{-1}+(n_{1}^{2}h_{1}^{p})^{-1}(n_{2}h_{2}^{p})^{-1},
var(Hn1,n2(1,2))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(1,2)}) n11n22δ1,22n11n22σ1,22(n1h1p)1(n22h2p)1+n11(n2h2p)2,\displaystyle\asymp n_{1}^{-1}n_{2}^{-2}\delta_{1,2}^{2}\asymp n_{1}^{-1}n_{2}^{-2}\sigma_{1,2}^{2}\asymp(n_{1}h_{1}^{p})^{-1}(n_{2}^{2}h_{2}^{p})^{-1}+n_{1}^{-1}(n_{2}h_{2}^{p})^{-2},
var(Hn1,n2(2,2))\displaystyle\operatorname{var}(H_{n_{1},n_{2}}^{(2,2)}) (n1n2)2δ2,22(n1n2)2σ2,22(n1h1p)2(n22h2p)1+(n12h1p)1(n2h2p)2.\displaystyle\asymp(n_{1}n_{2})^{-2}\delta_{2,2}^{2}\asymp(n_{1}n_{2})^{-2}\sigma_{2,2}^{2}\asymp(n_{1}h_{1}^{p})^{-2}(n_{2}^{2}h_{2}^{p})^{-1}+(n_{1}^{2}h_{1}^{p})^{-1}(n_{2}h_{2}^{p})^{-2}.

Thus,

var(Rn1,n2)δ1,02n1+δ0,12n2=O((n1h1p)1+(n2h2p)1).\frac{\operatorname{var}(R_{n_{1},n_{2}})}{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}=O((n_{1}h_{1}^{p})^{-1}+(n_{2}h_{2}^{p})^{-1}).

Then we have

k^kδ1,02n1+δ0,12n2\displaystyle\frac{\widehat{\mathcal{I}_{k}}-\mathcal{I}_{k}}{\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}} =2Hn1,n2(1,0)+2Hn1,n2(0,1)δ1,02n1+δ0,12n2+𝔼(k^)kδ1,02n1+δ0,12n2+Rn1,n2δ1,02n1+δ0,12n2\displaystyle=\frac{2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}}+\frac{\mathbb{E}(\widehat{\mathcal{I}_{k}})-\mathcal{I}_{k}}{\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}}+\frac{R_{n_{1},n_{2}}}{\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}}
=2Hn1,n2(1,0)+2Hn1,n2(0,1)δ1,02n1+δ0,12n2+O(n11/2h1ν+n21/2h2ν)+Op((n1h1p)1/2+(n2h2p)1/2)\displaystyle=\frac{2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}}+O(n_{1}^{1/2}h_{1}^{\nu}+n_{2}^{1/2}h_{2}^{\nu})+O_{p}((n_{1}h_{1}^{p})^{-1/2}+(n_{2}h_{2}^{p})^{-1/2})
=2Hn1,n2(1,0)+2Hn1,n2(0,1)δ1,02n1+δ0,12n2+op(1)𝑑𝒩(0,4),\displaystyle=\frac{2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}}{\sqrt{\frac{\delta_{1,0}^{2}}{n_{1}}+\frac{\delta_{0,1}^{2}}{n_{2}}}}+o_{p}(1)\overset{d}{\longrightarrow}\mathcal{N}(0,4),

where we have used Assumption 2 and nl1/2hlν0n_{l}^{1/2}h_{l}^{\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2. ∎

Proof of Theorem 8.

Following the proof of Theorem 7, we have the Hoeffding decomposition

k^=𝔼(k^)+2Hn1,n2(1,0)+2Hn1,n2(0,1)+Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2)+Rn1,n2,\widehat{\mathcal{I}_{k}}=\mathbb{E}(\widehat{\mathcal{I}_{k}})+2H_{n_{1},n_{2}}^{(1,0)}+2H_{n_{1},n_{2}}^{(0,1)}+H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)}+R_{n_{1},n_{2}},

where Rn1,n2R_{n_{1},n_{2}} is the remainder term, with slight abuse of notation.

Under H0H_{0} in (1), one can verify that ϕ(1,0)(z1(1))0\phi^{(1,0)}(z_{1}^{(1)})\neq 0 and ϕ(0,1)(z1(2))0\phi^{(0,1)}(z_{1}^{(2)})\neq 0, and thus the U-statistic k^\widehat{\mathcal{I}_{k}} is non-degenerate. Nonetheless, as in the local case, Hn1,n2(1,0)H_{n_{1},n_{2}}^{(1,0)} and Hn1,n2(0,1)H_{n_{1},n_{2}}^{(0,1)} can be asymptotically negligible under H0H_{0} in (1). Specifically, by similar calculations as in the proof of Theorem 6,

𝔼{k(y1(1),Y2(1))Gh1(x1(1)X2(1))Gh2(X1(2)x1(1))Gh2(X2(2)x1(1))}\displaystyle\quad\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(x_{1}^{(1)}-X_{2}^{(1)})G_{h_{2}}(X_{1}^{(2)}-x_{1}^{(1)})G_{h_{2}}(X_{2}^{(2)}-x_{1}^{(1)})\Big{\}}
=f1(x1(1))f22(x1(1))k(y1(1),y2(1))f1(y2(1)x1(1))𝑑y2(1)×{1+O(h1ν+h2ν)},\displaystyle=f_{1}(x_{1}^{(1)})f_{2}^{2}(x_{1}^{(1)})\int k(y_{1}^{(1)},y_{2}^{(1)})f_{1}(y_{2}^{(1)}\mid x_{1}^{(1)})dy_{2}^{(1)}\times\{1+O(h_{1}^{\nu}+h_{2}^{\nu})\},
𝔼{k(y1(1),Y2(1))Gh1(x1(1)X2(1))Gh2(X1(2)X2(1))Gh2(X2(2)X2(1))}\displaystyle\quad\mathbb{E}\Big{\{}k(y_{1}^{(1)},Y_{2}^{(1)})G_{h_{1}}(x_{1}^{(1)}-X_{2}^{(1)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(1)})G_{h_{2}}(X_{2}^{(2)}-X_{2}^{(1)})\Big{\}}
=f1(x1(1))f22(x1(1))k(y1(1),y2(1))f1(y2(1)x1(1))𝑑y2(1)×{1+O(h1ν+h2ν)},\displaystyle=f_{1}(x_{1}^{(1)})f_{2}^{2}(x_{1}^{(1)})\int k(y_{1}^{(1)},y_{2}^{(1)})f_{1}(y_{2}^{(1)}\mid x_{1}^{(1)})dy_{2}^{(1)}\times\{1+O(h_{1}^{\nu}+h_{2}^{\nu})\},
𝔼{k(Y1(2),Y2(2))Gh2(X1(2)X2(2))Gh1(x1(1)X1(2))Gh1(X2(1)X1(2))}\displaystyle\quad\mathbb{E}\Big{\{}k(Y_{1}^{(2)},Y_{2}^{(2)})G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(2)})G_{h_{1}}(x_{1}^{(1)}-X_{1}^{(2)})G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})\Big{\}}
=f1(x1(1))f22(x1(1))k(y1(2),y2(2))f2(y1(2)x1(1))f2(y2(2)x1(1))𝑑y1(2)𝑑y2(2)×{1+O(h1ν+h2ν)},\displaystyle=f_{1}(x_{1}^{(1)})f_{2}^{2}(x_{1}^{(1)})\int k(y_{1}^{(2)},y_{2}^{(2)})f_{2}(y_{1}^{(2)}\mid x_{1}^{(1)})f_{2}(y_{2}^{(2)}\mid x_{1}^{(1)})dy_{1}^{(2)}dy_{2}^{(2)}\times\{1+O(h_{1}^{\nu}+h_{2}^{\nu})\},
𝔼[k(y1(1),Y1(2)){Gh1(x1(1)X1(2))+Gh2(X1(2)x1(1))}Gh1(X2(1)X1(2))Gh2(X2(2)x1(1))]\displaystyle\quad\mathbb{E}\Big{[}k(y_{1}^{(1)},Y_{1}^{(2)})\left\{G_{h_{1}}(x_{1}^{(1)}-X_{1}^{(2)})+G_{h_{2}}(X_{1}^{(2)}-x_{1}^{(1)})\right\}G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-x_{1}^{(1)})\Big{]}
=2f1(x1(1))f22(x1(1))k(y1(1),y1(2))f2(y1(2)x1(1))𝑑y1(2)×{1+O(h1ν+h2ν)},\displaystyle=2f_{1}(x_{1}^{(1)})f_{2}^{2}(x_{1}^{(1)})\int k(y_{1}^{(1)},y_{1}^{(2)})f_{2}(y_{1}^{(2)}\mid x_{1}^{(1)})dy_{1}^{(2)}\times\{1+O(h_{1}^{\nu}+h_{2}^{\nu})\},
𝔼[k(Y2(1),Y1(2)){Gh1(X2(1)X1(2))+Gh2(X1(2)X2(1))}Gh1(x1(1)X1(2))Gh2(X2(2)X2(1))]\displaystyle\quad\mathbb{E}\Big{[}k(Y_{2}^{(1)},Y_{1}^{(2)})\left\{G_{h_{1}}(X_{2}^{(1)}-X_{1}^{(2)})+G_{h_{2}}(X_{1}^{(2)}-X_{2}^{(1)})\right\}G_{h_{1}}(x_{1}^{(1)}-X_{1}^{(2)})G_{h_{2}}(X_{2}^{(2)}-X_{2}^{(1)})\Big{]}
=2f1(x1(1))f22(x1(1))k(y2(1),y1(2))f1(y2(1)x1(1))f2(y1(2)x1(1))𝑑y2(1)𝑑y1(2)×{1+O(h1ν+h2ν)}.\displaystyle=2f_{1}(x_{1}^{(1)})f_{2}^{2}(x_{1}^{(1)})\int k(y_{2}^{(1)},y_{1}^{(2)})f_{1}(y_{2}^{(1)}\mid x_{1}^{(1)})f_{2}(y_{1}^{(2)}\mid x_{1}^{(1)})dy_{2}^{(1)}dy_{1}^{(2)}\times\{1+O(h_{1}^{\nu}+h_{2}^{\nu})\}.

Under H0H_{0} in (1), we have f(yx)f1(yx)=f2(yx)f(y\mid x)\coloneqq f_{1}(y\mid x)=f_{2}(y\mid x) for all y,xy,x. Hence, for ϕ(1,0)(z1(1))\phi^{(1,0)}(z_{1}^{(1)}) in (LABEL:eq:hoeff2),

ϕ(1,0)(z1(1))\displaystyle\phi^{(1,0)}(z_{1}^{(1)}) =f1(x1(1))f22(x1(1)){k(y1(1),y)f(yx1(1))𝑑y+k(y,y)f(yx1(1))f(yx1(1))𝑑y𝑑y}\displaystyle=f_{1}(x_{1}^{(1)})f_{2}^{2}(x_{1}^{(1)})\bigg{\{}\int k(y_{1}^{(1)},y)f(y\mid x_{1}^{(1)})dy+\int k(y,y^{\prime})f(y\mid x_{1}^{(1)})f(y^{\prime}\mid x_{1}^{(1)})dydy^{\prime}\bigg{\}}
×O(h1ν+h2ν)+O(h1ν+h2ν),\displaystyle\quad\times O(h_{1}^{\nu}+h_{2}^{\nu})+O(h_{1}^{\nu}+h_{2}^{\nu}),

under H0H_{0} in (1). It implies that

var{ϕ(1,0)(Z1(1))}=O(h12ν+h22ν),\operatorname{var}\{\phi^{(1,0)}(Z_{1}^{(1)})\}=O(h_{1}^{2\nu}+h_{2}^{2\nu}),

and thus,

var(Hn1,n2(1,0))=O(n11(h12ν+h22ν)).\operatorname{var}(H_{n_{1},n_{2}}^{(1,0)})=O(n_{1}^{-1}(h_{1}^{2\nu}+h_{2}^{2\nu})).

Analogously, one can show that var(Hn1,n2(0,1))=O(n21(h12ν+h22ν))\operatorname{var}(H_{n_{1},n_{2}}^{(0,1)})=O(n_{2}^{-1}(h_{1}^{2\nu}+h_{2}^{2\nu})) under H0H_{0} in (1). As in the local case, when undersmoothing is used, the asymptotic distribution of k^\widehat{\mathcal{I}_{k}} is determined by Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2)H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)} under H0H_{0} in (1), since var(Hn1,n2(2,0))(n12h1p)1\operatorname{var}(H_{n_{1},n_{2}}^{(2,0)})\asymp(n_{1}^{2}h_{1}^{p})^{-1}, var(Hn1,n2(1,1))(n1h1p)1n21+n11(n2h2p)1\operatorname{var}(H_{n_{1},n_{2}}^{(1,1)})\asymp(n_{1}h_{1}^{p})^{-1}n_{2}^{-1}+n_{1}^{-1}(n_{2}h_{2}^{p})^{-1} and var(Hn1,n2(0,2))(n22h2p)1\operatorname{var}(H_{n_{1},n_{2}}^{(0,2)})\asymp(n_{2}^{2}h_{2}^{p})^{-1}.

We have

Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2)\displaystyle H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)} =(n12)11i1<i2n1ϕ(2,0)(Zi1(1),Zi2(1))+4n1n2i=1n1j=1n2ϕ(1,1)(Zi(1);Zj(2))\displaystyle=\binom{n_{1}}{2}^{-1}\sum_{1\leq i_{1}<i_{2}\leq n_{1}}\phi^{(2,0)}(Z_{i_{1}}^{(1)},Z_{i_{2}}^{(1)})+\frac{4}{n_{1}n_{2}}\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}\phi^{(1,1)}(Z_{i}^{(1)};Z_{j}^{(2)})
+(n22)11j1<j2n2ϕ(0,2)(Zj1(2),Zj2(2)),\displaystyle\quad+\binom{n_{2}}{2}^{-1}\sum_{1\leq j_{1}<j_{2}\leq n_{2}}\phi^{(0,2)}(Z_{j_{1}}^{(2)},Z_{j_{2}}^{(2)}),

and aim to show that

σn1,n21(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))𝑑𝒩(0,1),as n1,n2,\sigma_{n_{1},n_{2}}^{-1}(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})\overset{d}{\longrightarrow}\mathcal{N}(0,1),\quad\text{as }n_{1},n_{2}\rightarrow\infty,

where

σn1,n22var(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))=2n1(n11)δ2,02+16n1n2δ1,12+2n2(n21)δ0,22,\sigma_{n_{1},n_{2}}^{2}\coloneqq\operatorname{var}(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})=\frac{2}{n_{1}(n_{1}-1)}\delta_{2,0}^{2}+\frac{16}{n_{1}n_{2}}\delta_{1,1}^{2}+\frac{2}{n_{2}(n_{2}-1)}\delta_{0,2}^{2},

and

δ2,02\displaystyle\delta_{2,0}^{2} =var{ϕ(2,0)(Z1(1),Z2(1))},\displaystyle=\operatorname{var}\{\phi^{(2,0)}(Z_{1}^{(1)},Z_{2}^{(1)})\}, (A.10)
δ1,12\displaystyle\delta_{1,1}^{2} =var{ϕ(1,1)(Z1(1);Z1(2))},\displaystyle=\operatorname{var}\{\phi^{(1,1)}(Z_{1}^{(1)};Z_{1}^{(2)})\}, (A.11)
δ0,22\displaystyle\delta_{0,2}^{2} =var{ϕ(0,2)(Z1(2),Z2(2))}.\displaystyle=\operatorname{var}\{\phi^{(0,2)}(Z_{1}^{(2)},Z_{2}^{(2)})\}. (A.12)

Let Wi=Zi(1)W_{i}=Z_{i}^{(1)} for i=1,,n1i=1,\ldots,n_{1} and Wj+n1=Zj(2)W_{j+n_{1}}=Z_{j}^{(2)} for j=1,,n2j=1,\ldots,n_{2}, and for iji\neq j,

φij={n11(n11)1ϕ(2,0)(Wi,Wj),if i,j{1,,n1},2n11n21ϕ(1,1)(Wi,Wj),if i{1,,n1} and j{n1+1,,n1+n2},n21(n21)1ϕ(0,2)(Wi,Wj),if i,j{n1+1,,n1+n2}.\displaystyle\varphi_{ij}=\left\{\begin{aligned} n_{1}^{-1}(n_{1}-1)^{-1}\phi^{(2,0)}(W_{i},W_{j}),\quad&\text{if }i,j\in\{1,\ldots,n_{1}\},\\ 2n_{1}^{-1}n_{2}^{-1}\phi^{(1,1)}(W_{i},W_{j}),\quad&\text{if }i\in\{1,\ldots,n_{1}\}\text{ and }j\in\{n_{1}+1,\ldots,n_{1}+n_{2}\},\\ n_{2}^{-1}(n_{2}-1)^{-1}\phi^{(0,2)}(W_{i},W_{j}),\quad&\text{if }i,j\in\{n_{1}+1,\ldots,n_{1}+n_{2}\}.\end{aligned}\right.

Write n=n1+n2n=n_{1}+n_{2}. Define Vn,j=i=1j1φijV_{n,j}=\sum_{i=1}^{j-1}\varphi_{ij} for j=2,,nj=2,\ldots,n, and Sn,m=j=2mVn,jS_{n,m}=\sum_{j=2}^{m}V_{n,j} for m=2,,nm=2,\ldots,n. Let n,m\mathcal{F}_{n,m} be the σ\sigma-algebra generated by {W1,,Wm}\{W_{1},\ldots,W_{m}\} for m=1,,nm=1,\ldots,n. It is straightforward that n,m1n,m\mathcal{F}_{n,m-1}\subset\mathcal{F}_{n,m} for any m=2,,nm=2,\ldots,n, and {Sn,m}m=1n\{S_{n,m}\}_{m=1}^{n} is of zero mean and square integrable. Also, for m1>m2m_{1}>m_{2}, we have 𝔼(Sn,m1n,m2)=Sn,m2\mathbb{E}(S_{n,m_{1}}\mid\mathcal{F}_{n,m_{2}})=S_{n,m_{2}}. Hence, for each nn, {Sn,m,n,m}m=1n\{S_{n,m},\mathcal{F}_{n,m}\}_{m=1}^{n} is a square integrable martingale of zero mean. As

Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2)=2Sn,n,H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)}=2S_{n,n},

its asymptotic normality can be established by Corollary 3.1 in Hall and Heyde, (2014), i.e.,

σn1,n22j=2n𝔼(Vn,j2n,j1)\displaystyle\sigma_{n_{1},n_{2}}^{-2}\sum_{j=2}^{n}\mathbb{E}(V_{n,j}^{2}\mid\mathcal{F}_{n,j-1}) 𝑝14,\displaystyle\overset{p}{\longrightarrow}\frac{1}{4},
σn1,n22j=2n𝔼[Vn,j2𝟙{|Vn,j|>ϵσn1,n2}n,j1]\displaystyle\sigma_{n_{1},n_{2}}^{-2}\sum_{j=2}^{n}\mathbb{E}[V_{n,j}^{2}\mathbbm{1}\{|V_{n,j}|>\epsilon\sigma_{n_{1},n_{2}}\}\mid\mathcal{F}_{n,j-1}] 𝑝0,ϵ>0,\displaystyle\overset{p}{\longrightarrow}0,\quad\forall\epsilon>0,

which can be done with routine verification of the following conditions:

var{j=2n𝔼(Vn,j2n,j1)}\displaystyle\operatorname{var}\bigg{\{}\sum_{j=2}^{n}\mathbb{E}(V_{n,j}^{2}\mid\mathcal{F}_{n,j-1})\bigg{\}} =o(σn1,n24),\displaystyle=o(\sigma_{n_{1},n_{2}}^{4}),
j=2n𝔼(Vn,j4)\displaystyle\sum_{j=2}^{n}\mathbb{E}(V_{n,j}^{4}) =o(σn1,n24).\displaystyle=o(\sigma_{n_{1},n_{2}}^{4}).

In our case, we have

σn1,n24var{j=2n𝔼(Vn,j2n,j1)}\displaystyle\sigma_{n_{1},n_{2}}^{-4}\operatorname{var}\bigg{\{}\sum_{j=2}^{n}\mathbb{E}(V_{n,j}^{2}\mid\mathcal{F}_{n,j-1})\bigg{\}} =O(h1p+h2p)=o(1),\displaystyle=O(h_{1}^{p}+h_{2}^{p})=o(1),
σn1,n24j=2n𝔼(Vn,j4)\displaystyle\sigma_{n_{1},n_{2}}^{-4}\sum_{j=2}^{n}\mathbb{E}(V_{n,j}^{4}) =O((n1h1p+n2h2p)1)=o(1),\displaystyle=O((n_{1}h_{1}^{p}+n_{2}h_{2}^{p})^{-1})=o(1),

where we have used Assumption 2. The tedious calculations are omitted here.

Therefore,

σn1,n21k^\displaystyle\sigma_{n_{1},n_{2}}^{-1}\widehat{\mathcal{I}_{k}} =σn1,n21(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))+σn1,n21𝔼(k^)\displaystyle=\sigma_{n_{1},n_{2}}^{-1}(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})+\sigma_{n_{1},n_{2}}^{-1}\mathbb{E}(\widehat{\mathcal{I}_{k}})
+2σn1,n21(Hn1,n2(1,0)+Hn1,n2(0,1))+σn1,n21Rn1,n2\displaystyle\quad+2\sigma_{n_{1},n_{2}}^{-1}(H_{n_{1},n_{2}}^{(1,0)}+H_{n_{1},n_{2}}^{(0,1)})+\sigma_{n_{1},n_{2}}^{-1}R_{n_{1},n_{2}}
=σn1,n21(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))+O(n1h1p/2+ν+n2h2p/2+ν)\displaystyle=\sigma_{n_{1},n_{2}}^{-1}(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})+O(n_{1}h_{1}^{p/2+\nu}+n_{2}h_{2}^{p/2+\nu})
+Op(n11/2h1p/2+ν+n21/2h2p/2+ν)+Op((n1h1p)1/2+(n2h2p)1/2)\displaystyle\quad+O_{p}(n_{1}^{1/2}h_{1}^{p/2+\nu}+n_{2}^{1/2}h_{2}^{p/2+\nu})+O_{p}((n_{1}h_{1}^{p})^{-1/2}+(n_{2}h_{2}^{p})^{-1/2})
=σn1,n21(Hn1,n2(2,0)+4Hn1,n2(1,1)+Hn1,n2(0,2))+op(1)𝑑𝒩(0,1),\displaystyle=\sigma_{n_{1},n_{2}}^{-1}(H_{n_{1},n_{2}}^{(2,0)}+4H_{n_{1},n_{2}}^{(1,1)}+H_{n_{1},n_{2}}^{(0,2)})+o_{p}(1)\overset{d}{\longrightarrow}\mathcal{N}(0,1),

where we have used Assumption 2 and nlhlp/2+ν0n_{l}h_{l}^{p/2+\nu}\rightarrow 0 as nln_{l}\rightarrow\infty for l=1,2l=1,2. ∎

Appendix D Additional simulations and results

Example 6.

We explore the effect of taking different values for the constant factor CC in the bandwidth hl(s)=Cnl1/(p/2+ν0.1)σ^X(l)(s)h_{l}(s)=C\cdot n_{l}^{-1/(p/2+\nu-0.1)}\hat{\sigma}_{X^{(l)}(s)} for the proposed global test, under the same settings as in Example 1. The empirical size table and power curves under the significance level of 0.05 for the proposed TCDT with bandwidths taking C=0.5C=0.5, 11 and 22 are displayed in Table 6 and Figure 7, respectively.

Table 6: Empirical size (c=0c=0) under α=0.05\alpha=0.05 for taking C=0.5C=0.5, 11 or 22 in the bandwidth hl(s)=Cnl1/(p/2+ν0.1)σ^X(l)(s)h_{l}(s)=C\cdot n_{l}^{-1/(p/2+\nu-0.1)}\hat{\sigma}_{X^{(l)}(s)}, following the same settings as in Example 1.

Setting Sample Size C=0.5C=0.5 C=1C=1 C=2C=2 (1.1) n1=n2=50n_{1}=n_{2}=50 0.043 0.040 0.031 n1=n2=100n_{1}=n_{2}=100 0.045 0.045 0.044 (1.2) n1=n2=50n_{1}=n_{2}=50 0.026 0.039 0.039 n1=n2=100n_{1}=n_{2}=100 0.025 0.038 0.040 (1.3) n1=n2=50n_{1}=n_{2}=50 0.021 0.019 0.014 n1=n2=100n_{1}=n_{2}=100 0.016 0.018 0.020

Refer to caption
Figure 7: Empirical size (c=0c=0) and power (c0c\neq 0) under α=0.05\alpha=0.05 for taking C=0.5C=0.5, 11 or 22 in the bandwidth hl(s)=Cnl1/(p/2+ν0.1)σ^X(l)(s)h_{l}(s)=C\cdot n_{l}^{-1/(p/2+\nu-0.1)}\hat{\sigma}_{X^{(l)}(s)}, following the same settings as in Example 1.

We find that regardless of whether C=0.5C=0.5, 11 or 22, the empirical size of TCDT remains controlled under 0.05. However, different values of CC lead to different power performance. In Settings (1.1) and (1.2), TCDT achieves higher power with a larger CC, while in Setting (1.3) it performs better in terms of power when C=1C=1 or C=0.5C=0.5. To achieve a balance, we consider setting C=1C=1 for all the settings in the numerical studies.

References

  • Balasubramanian et al., (2021) Balasubramanian, K., Li, T., and Yuan, M. (2021). On the optimality of kernel-embedding based goodness-of-fit tests. The Journal of Machine Learning Research, 22(1):1–45.
  • Baringhaus and Franz, (2004) Baringhaus, L. and Franz, C. (2004). On a new multivariate two-sample test. Journal of multivariate analysis, 88(1):190–206.
  • Berlinet and Thomas-Agnan, (2011) Berlinet, A. and Thomas-Agnan, C. (2011). Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media.
  • Bouezmarni et al., (2012) Bouezmarni, T., Rombouts, J. V., and Taamouti, A. (2012). Nonparametric copula-based test for conditional independence with applications to granger causality. Journal of Business & Economic Statistics, 30(2):275–287.
  • Brooks et al., (2014) Brooks, T., Pope, D., and Marcolini, M. (2014). Airfoil Self-Noise. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5VW2C.
  • Calonico et al., (2014) Calonico, S., Cattaneo, M. D., and Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6):2295–2326.
  • Chakraborty and Zhang, (2019) Chakraborty, S. and Zhang, X. (2019). Distance metrics for measuring joint dependence with application to causal inference. Journal of the American Statistical Association.
  • Chakraborty and Zhang, (2021) Chakraborty, S. and Zhang, X. (2021). A new framework for distance and kernel-based metrics in high dimensions. Electronic Journal of Statistics, 15(2):5455–5522.
  • Chang et al., (2015) Chang, M., Lee, S., and Whang, Y.-J. (2015). Nonparametric tests of conditional treatment effects with an application to single-sex schooling on academic achievements. The Econometrics Journal, 18(3):307–346.
  • Chatterjee et al., (2024) Chatterjee, A., Niu, Z., and Bhattacharya, B. B. (2024). A kernel-based conditional two-sample test using nearest neighbors (with applications to calibration, regression curves, and simulation-based inference). arXiv preprint arXiv:2407.16550.
  • Chen et al., (2022) Chen, M., Tian, T., Zhu, J., Pan, W., and Wang, X. (2022). Paired-sample tests for homogeneity with/without confounding variables. Statistics and Its Interface, 15(3):335–348.
  • Cleveland, (1993) Cleveland, W. S. (1993). Visualizing data. Hobart press.
  • Crump et al., (2008) Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2008). Nonparametric tests for treatment effect heterogeneity. The Review of Economics and Statistics, 90(3):389–405.
  • Deb et al., (2020) Deb, N., Ghosal, P., and Sen, B. (2020). Measuring association on topological spaces using kernels and geometric graphs. arXiv preprint arXiv:2010.01768.
  • Dette and Munk, (1998) Dette, H. and Munk, A. (1998). Nonparametric comparison of several regression functions: exact and asymptotic theory. The Annals of Statistics, 26(6):2339–2368.
  • Duong, (2013) Duong, T. (2013). Local significant differences from nonparametric two-sample tests. Journal of Nonparametric Statistics, 25(3):635–645.
  • Fan and Li, (1996) Fan, Y. and Li, Q. (1996). Consistent model specification tests: omitted variables and semiparametric functional forms. Econometrica: Journal of the econometric society, pages 865–890.
  • Fukumizu et al., (2007) Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2007). Kernel measures of conditional dependence. Advances in neural information processing systems, 20.
  • González-Manteiga and Crujeiras, (2013) González-Manteiga, W. and Crujeiras, R. M. (2013). An updated review of goodness-of-fit tests for regression models. Test, 22(3):361–411.
  • Gretton et al., (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773.
  • Gretton et al., (2007) Gretton, A., Fukumizu, K., Teo, C., Song, L., Schölkopf, B., and Smola, A. (2007). A kernel statistical test of independence. Advances in neural information processing systems, 20.
  • Hall and Hart, (1990) Hall, P. and Hart, J. D. (1990). Bootstrap test for difference between means in nonparametric regression. Journal of the American Statistical Association, 85(412):1039–1049.
  • Hall and Heyde, (2014) Hall, P. and Heyde, C. C. (2014). Martingale limit theory and its application. Academic press.
  • Hu and Lei, (2024) Hu, X. and Lei, J. (2024). A two-sample conditional distribution test using conformal prediction and weighted rank sum. Journal of the American Statistical Association, 119(546):1136–1154.
  • Huang et al., (2024) Huang, M.-Y., Qin, J., and Huang, C.-Y. (2024). Efficient data integration under prior probability shift. Biometrics, 80(2):ujae035.
  • Huang, (2010) Huang, T.-M. (2010). Testing conditional independence using maximal nonlinear conditional correlation. The Annals of Statistics, 38(4):2047–2091.
  • Imbens and Wooldridge, (2009) Imbens, G. W. and Wooldridge, J. M. (2009). Recent developments in the econometrics of program evaluation. Journal of economic literature, 47(1):5–86.
  • Ke and Yin, (2020) Ke, C. and Yin, X. (2020). Expected conditional characteristic function-based measures for testing independence. Journal of the American Statistical Association.
  • Kim et al., (2019) Kim, I., Lee, A. B., and Lei, J. (2019). Global and local two-sample tests via regression. Electronic Journal of Statistics, 13(2):5253–5305.
  • Kouw and Loog, (2018) Kouw, W. M. and Loog, M. (2018). An introduction to domain adaptation and transfer learning. arXiv preprint arXiv:1812.11806.
  • Kulasekera, (1995) Kulasekera, K. (1995). Comparison of regression curves using quasi-residuals. Journal of the American Statistical Association, 90(431):1085–1093.
  • Kulasekera and Wang, (1997) Kulasekera, K. and Wang, J. (1997). Smoothing parameter selection for power optimality in testing of regression curves. Journal of the American Statistical Association, 92(438):500–511.
  • Lavergne, (2001) Lavergne, P. (2001). An equality test across nonparametric regressions. Journal of Econometrics, 103(1-2):307–344.
  • Lee, (2019) Lee, A. J. (2019). U-statistics: Theory and Practice. Routledge.
  • Lee, (2009) Lee, M.-j. (2009). Non-parametric tests for distributional treatment effect for randomly censored responses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(1):243–264.
  • Lee et al., (2024) Lee, S.-h., Ma, Y., and Zhao, J. (2024). Doubly flexible estimation under label shift. Journal of the American Statistical Association, pages 1–13.
  • Li and Racine, (2007) Li, Q. and Racine, J. S. (2007). Nonparametric econometrics: theory and practice. Princeton University Press.
  • Liu et al., (2023) Liu, M., Zhang, Y., Liao, K. P., and Cai, T. (2023). Augmented transfer regression learning with semi-non-parametric nuisance models. Journal of Machine Learning Research, 24(293):1–50.
  • Lyons, (2013) Lyons, R. (2013). Distance covariance in metric spaces. The Annals of Probability, 41(5):3284–3305.
  • Ma et al., (2023) Ma, C., Pathak, R., and Wainwright, M. J. (2023). Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics, 51(2):738–761.
  • Muandet et al., (2017) Muandet, K., Fukumizu, K., Sriperumbudur, B., Schölkopf, B., et al. (2017). Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(1-2):1–141.
  • Neumeyer and Dette, (2003) Neumeyer, N. and Dette, H. (2003). Nonparametric comparison of regression curves: an empirical process approach. The Annals of Statistics, 31(3):880–920.
  • Paparoditis and Politis, (2000) Paparoditis, E. and Politis, D. N. (2000). The local bootstrap for kernel estimators under general dependence conditions. Annals of the Institute of Statistical Mathematics, 52:139–159.
  • Park and Muandet, (2020) Park, J. and Muandet, K. (2020). A measure-theoretic approach to kernel conditional mean embeddings. Advances in neural information processing systems, 33:21247–21259.
  • Rizzo and Székely, (2010) Rizzo, M. L. and Székely, G. J. (2010). Disco analysis: A nonparametric extension of analysis of variance. The Annals of Applied Statistics, pages 1034–1055.
  • Rubin, (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688.
  • Sejdinovic et al., (2013) Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013). Equivalence of distance-based and rkhs-based statistics in hypothesis testing. The annals of statistics, pages 2263–2291.
  • Sheng and Sriperumbudur, (2023) Sheng, T. and Sriperumbudur, B. K. (2023). On distance and kernel measures of conditional dependence. Journal of Machine Learning Research, 24(7):1–16.
  • Shimodaira, (2000) Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244.
  • Sriperumbudur et al., (2008) Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G., and Schölkopf, B. (2008). Injective hilbert space embeddings of probability measures. In 21st Annual Conference on Learning Theory (COLT 2008), pages 111–122. Omnipress.
  • Sriperumbudur et al., (2010) Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561.
  • Stute, (1991) Stute, W. (1991). Conditional u-statistics. The Annals of Probability, pages 812–825.
  • Su and Spindler, (2013) Su, L. and Spindler, M. (2013). Nonparametric testing for asymmetric information. Journal of Business & Economic Statistics, 31(2):208–225.
  • Su and White, (2007) Su, L. and White, H. (2007). A consistent characteristic function-based test for conditional independence. Journal of Econometrics, 141(2):807–834.
  • Su and White, (2008) Su, L. and White, H. (2008). A nonparametric hellinger metric test for conditional independence. Econometric Theory, 24(4):829–864.
  • Székely and Rizzo, (2005) Székely, G. J. and Rizzo, M. L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93(1):58–80.
  • Székely and Rizzo, (2017) Székely, G. J. and Rizzo, M. L. (2017). The energy of data. Annual Review of Statistics and Its Application, 4(1):447–479.
  • Székely et al., (2007) Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The annals of statistics, 35(6):2769–2794.
  • Székely et al., (2004) Székely, G. J., Rizzo, M. L., et al. (2004). Testing for equal distributions in high dimension. InterStat, 5(16.10):1249–1272.
  • Taamouti et al., (2014) Taamouti, A., Bouezmarni, T., and El Ghouch, A. (2014). Nonparametric estimation and inference for conditional density based granger causality measures. Journal of Econometrics, 180(2):251–264.
  • Tibshirani et al., (2019) Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under covariate shift. Advances in neural information processing systems, 32.
  • Ullah and Pagan, (1999) Ullah, A. and Pagan, A. (1999). Nonparametric econometrics. Cambridge university press Cambridge.
  • Wang et al., (2015) Wang, X., Pan, W., Hu, W., Tian, Y., and Zhang, H. (2015). Conditional distance correlation. Journal of the American Statistical Association, 110(512):1726–1734.
  • Yao et al., (2018) Yao, S., Zhang, X., and Shao, X. (2018). Testing mutual independence in high dimension via distance covariance. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 80(3):455–480.
  • Yin and Yuan, (2020) Yin, X. and Yuan, Q. (2020). A new class of measures for testing independence. Statistica Sinica, 30(4):2131–2154.
  • Zhang et al., (2012) Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. (2012). Kernel-based conditional independence test and application in causal discovery. arXiv preprint arXiv:1202.3775.
  • Zhang et al., (2018) Zhang, X., Yao, S., and Shao, X. (2018). Conditional mean and quantile dependence testing in high dimension. The Annals of Statistics, 46(1):219–246.
  • Zhu et al., (2020) Zhu, C., Zhang, X., Yao, S., and Shao, X. (2020). Distance-based and rkhs-based dependence metrics in high dimension. The Annals of Statistics, 48(6):3366–3394.