k-Sample inference
via Multimarginal Optimal Transport

Natalia Kravtsovalabel=e1]kravtsova.2@osu.edu [ Department of Mathematics, The Ohio State Universitypresep=, ]e1

Abstract

This paper proposes a Multimarginal Optimal Transport ( $MOT$ ) approach for simultaneously comparing $k\geq 2$ measures supported on finite subsets of $\mathbb{R}^{d}$ , $d\geq 1$ . We derive asymptotic distributions of the optimal value of the empirical $MOT$ program under the null hypothesis that all $k$ measures are same, and the alternative hypothesis that at least two measures are different. We use these results to construct the test of the null hypothesis and provide consistency and power guarantees of this $k$ -sample test. We consistently estimate asymptotic distributions using bootstrap, and propose a low complexity linear program to approximate the test cut-off. We demonstrate the advantages of our approach on synthetic and real datasets, including the real data on cancers in the United States in 2004 - 2020.

62G10,

90C31,

60F05,

k-Sample test,

Optimal Transport,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

1 Introduction

The k-sample inference concerns with simultaneously comparing several probability measures. The classical question of this inference is to determine whether $k\geq 2$ groups of observed data points have the same underlying probability distribution, i.e. to test

		$\displaystyle H_{0}:\mu_{1}=\cdots=\mu_{k}$		(1)
		$\displaystyle H_{a}:\mu_{i}\neq\mu_{j}\text{ for some }1\leq i<j\leq k$		(1)

This testing problem has a long history in statistics, with classical rank-based tests for univariate data [15, 49, 67, 87] to recent extension [20] using multivariate ranks [13, 38], to graph [56], distance [64] and kernel based [68, 50] methods. Direct applications of testing hypotheses in (1) include simultaneously comparing gene expression profiles to assess presence of disease [40, 84], assessing differences in chronic disease levels based on quality of life [11, 76], analyzing associations between exercise and morphology of an animal [88], and comparing distributions of agents’ outcomes in reinforcement learning [61]. Moreover, the test of (1) is frequently viewed as a non-parametric version of ANOVA [12, 64] with myriad of scientific applications, typically comparing treatment outcomes between multiple groups. e.g. in clinical trials [14] and cancer studies [43, 86]. Table 1 outlines additional instances of scientific applications for $k$ -sample inference when measures of interest have finite support, which is the case considered in this paper.

This paper proposes an Optimal Transport approach to $k$ -sample inference for $k\geq 2$ probability measures with finite supports in $\mathbb{R}^{d}$ , $d\geq 1$ . The method provides a powerful $k$ -sample test of (1), but also allows comparison between different collections of $k$ measures in terms of their within-collection variability. Optimal Transport based approach has been shown successful in one-sample (goodness-of-fit) and two-sample problems on finite [6, 51, 72], countable [75], semidiscrete [39], and some of the continuous spaces [57, 46]. The test statistics employ $p$ -Wasserstein distances $W_{p}$ (or their regularized variants) to quantify differences between measures of interest while respecting metric structure of their supports [60].

Our test statistic employs a different functional - the Multimarginal Optimal Transport program $MOT$ [62] - which can be represented as a variance functional on the space of measures [10] and thus serves as a natural candidate for testing variability in a collection of $k$ measures. We show that despite well-documented differences in solution structures of $MOT$ and $W_{p}$ problems (Section 1.7.4 of [66]), $MOT$ shares the same benefits as $W_{p}$ when it comes to the limiting behavior of its optimal value.

Using $MOT$ for $k$ -sample inference brings several important advantages. The main advantage is that the asymptotic distributions of $MOT$ can be derived under both $H_{0}$ and $H_{a}$ . To the best of our knowledge, the only multivariate $k$ -sample test statistic with known $H_{a}$ distribution is the one of [47], where the limit laws are known only for a specific subset of alternatives. Our laws cover all alternatives in (1), which allows to explicitly derive a power function of the test and establish novel consistency results. The consistency analysis techniques developed in this paper can be further applied to one- and two-sample tests based on asymptotic results in [46, 72, 75].

Another benefit of $MOT$ limit under $H_{a}$ is the ability to estimate functionals of the $H_{a}$ distribution, e.g. Confidence Regions for the $MOT$ value. This allows for a novel application of comparing several collections of $k$ measures using overlap between their Confidence Regions (see Figure 8 for a concrete example). The procedure can be viewed as a distributional analogue of multiple comparisons (see [45] for a review), which are performed in a space of measures rather than Euclidean space. To the best of our knowledge, this type of analysis is not available with other $k$ -sample statistics considered in the literature.

Conceptually, our approach to $k$ -sample inference is equivalent to viewing measures as points and considering variability within their collection. It follows a general framework of Optimal Transport based distribution comparison: distributions are viewed as points in a Wasserstein space with Wasserstein distance indicating their closeness [60], leading to development of distributional analogues of traditional methods such as regression and time series [85, 89, 33], synthetic controls [36], and clustering [77]. Below we provide a formulation of our approach (Sections 1.1, 1.2) and summarize our contributions (Section 1.3).

1.1 Multimarginal Optimal Transport (MOT) for k-sample inference

Let $\mu_{1},\ldots,\mu_{k}$ be Borel probability measures supported on $\mathcal{X}\subseteq\mathbb{R}^{d}$ , $d\geq 1$ . The Multimarginal Optimal Transport ( $MOT$ ) problem (equation (4.3) of [1]) is the optimization problem

\inf_{\pi\in\mathcal{C}(\mu_{1},\ldots,\mu_{k})}\int_{\mathcal{X}\times\cdots\times\mathcal{X}}c(x_{1},\ldots,x_{k})\,d\pi(x_{1},\ldots,x_{k})

(2)

where $\mathcal{C}(\mu_{1},\ldots,\mu_{k})$ is the set of Borel probability measures on the product space $\mathcal{X}\times\cdots\times\mathcal{X}$ with marginals $\mu_{1},\ldots,\mu_{k}$ . Different choices for the cost function $c(x_{1},\ldots,x_{k})$ are possible; throughout the paper, we fix the choice to be

c(x_{1},\ldots,x_{k}):=\frac{1}{k}\sum_{i=1}^{k}\|x_{i}-\frac{1}{k}\sum_{j=1}^{k}x_{j}\|^{2}

(3)

Under this choice, the $MOT$ problem is equivalent (Proposition 3.1.2 of [60] to the Wasserstein barycenter problem (equation 2.2 of [1])

\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{k}\sum_{i=1}^{k}W_{2}^{2}(\mu_{i},\nu)

(4)

where $W_{2}(\cdot,\cdot)$ denotes 2-Wasserstein distance. By equivalence here we mean that the optimal values of both programs are equal, and the optimal solutions $\pi^{*}$ of (2) and $\nu^{*}$ of (4) are related by $\nu^{*}=M_{\#}\pi^{*}$ where $M$ is the map that averages a given $k$ -tuple of points from the supports of the $k$ measures ( $M\#\pi^{*}$ stands for pushforward of a measure $\pi^{*}$ by the map $M$ ). We remark here that when the measures are discrete, the barycenter problem (4) generally has more than one optimal solution $\nu^{*}$ . This presents challenges for statistical inference concerning barycenter solutions [52], but does not impede the analysis of the optimal value of barycenter or $MOT$ problems (recalling that all optimal solutions result in the same optimal value).

Let $MOT(\mu_{1},\ldots,\mu_{k})$ denote the optimal value of the $MOT$ program (2). Observe that $MOT(\mu_{1},\ldots,\mu_{k})=0$ if and only if the $k$ measures $\mu_{1},\ldots,\mu_{k}$ are all the same. Indeed, if $\nu$ is (any) optimal solution to the barycenter problem (4), then $MOT(\mu_{1},\ldots,\mu_{k})=0$ is equivalent to zero optimal value in the barycenter program (4), i.e. $W_{2}^{2}(\mu_{i},\nu)=0$ for all $i=1,\ldots,k$ . Due to metric properties of $2$ -Wasserstein distance (Theorem 7.3 of [80]), this is equivalent to all the $k$ measures being the same (and equal to $\nu$ ).

This observation suggests that testing $H_{0}$ in (1) can be addressed via testing for $MOT(\mu_{1},\ldots,\mu_{k})=0$ . To this end, suppose that the data on $k$ samples $(X^{1}_{i})_{i=1}^{n_{1}},\cdots,(X^{k}_{i})_{i=1}^{n_{k}}$ of sizes $n_{1},\cdots,n_{k}$ , respectively, is available to estimate the underlying measures $\mu_{1},\cdots,\mu_{k}$ by the empirical measures $\hat{\mu}_{1}:=\frac{1}{n_{i}}\sum_{i=1}^{n_{1}}\delta_{x^{1}_{i}},\cdots,\hat{\mu}_{k}:=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\delta_{x^{k}_{i}}$ . To test for $MOT(\mu_{1},\ldots,\mu_{k})=0$ based on the data, we consider the asymptotic distribution $\mathcal{D}_{0}$ of the empirical estimator $MOT(\widehat{\mu}_{1},\ldots,\widehat{\mu}_{k})$ under $H_{0}$ and reject $H_{0}$ when the estimator value is large.

More generally, once the asymptotic distribution $\mathcal{D}$ of empirical $MOT$ is known, one can estimate various functionals of $\mathcal{D}$ , such as Confidence Regions (CR’s) for a true $MOT$ value either $H_{0}$ or $H_{a}$ in (1). Inference of this type requires knowledge of the asymptotic distribution of empirical version of the $MOT$ value in (2). Our derivation of these distributions leverages rich literature on asymptotic theory for the Wasserstein distance, whose main results we briefly review below.

Table 1: Examples of scientific data on finitely supported measures. Sample reference describes the set up, the data, and/or the comparison problem in each case.

Variable(s) of interest	Support of measures	Ref.
Age of patients (in years)	$\{0,1,2.\ldots,100\}\subset\mathbb{R}$	[25]
Tumor size (in mm)	$\{1,2,\ldots,150\}\subset\mathbb{R}$	[73]
Number of positive lymph nodes	$\{1,2,\ldots,7\}\subset\mathbb{R}$	[31]
Joint distributions of the above variables	finite subsets of $\mathbb{R}^{2}$ or $\mathbb{R}^{3}$	[34]
Cell counts	$\{0,1,\ldots,M\}^{d}\subset\mathbb{R}^{d}$ for $d$ sites	[16]
Demand over $N$ locations	$N$ points (longitude, latitude) $\in\mathbb{R}^{2}$	[3]
Disease rates over $N$ locations	$N$ points on the map in $\mathbb{R}^{2}$	[48]
Pixel/voxel intensity in microscopy images	the grid in $\mathbb{R}^{2}$ or $\mathbb{R}^{3}$	[78]

1.2 Existing results on weak limits for Optimal Transport

The squared 2-Wasserstein distance $W_{2}^{2}(\mu_{1},\mu_{2})$ is the optimal value of the problem

\inf_{\pi\in\mathcal{C}(\mu,\nu)}\int_{\mathcal{X}\times\mathcal{X}}c(x_{1},x_{2})\,d\pi(x_{1},x_{k})

(5)

with $c(x_{1},x_{2})=\|x_{1}-x_{2}\|^{2}$ , which can be viewed as a particular case of the $MOT$ problem (2) with $k=2$ measures. Being a true metric on a space of probability measures on a given metric space ([81]), the 2-Wasserstein distance $W_{2}$ (and, more generally, the $p$ -Wasserstein distance $W_{p}$ ) provides a natural way to compare probability measures while respecting the geometry of the supporting metric space. Under this framework, the true measures are estimated by their empirical counterparts, and statistical inference is conducted using limiting laws for the empirical Wasserstein distance [59, 63].

The forms of the weak limits depend on two main factors: dimensionality of the support and the nature of the measures (where the cases $\mu=\nu$ and $\mu\neq\nu$ may have different limits). Letting $OT_{c}$ denote the optimal value in (5) (with possibly different costs $c$ ), the limiting laws have general form of

\rho_{n}\left(OT_{c}(\widehat{\mu}_{n_{1}},\widehat{\nu}_{n_{2}})-OT_{c}(\mu,\nu)\right)\xrightarrow[]{\text{in law}}L

(6)

with $\rho_{n}=\sqrt{\frac{n_{1}n_{2}}{n_{1}+n_{2}}}$ (when only $\mu$ is estimated from the data while $\nu$ is not, the “one-sample” version with $\rho_{n}=\sqrt{n}$ is considered). When measures are supported on $\mathbb{R}$ , the limits $L$ can be Gaussian, with variance that depending on the truth $\mu,\nu$ under the “alternative” assumption $\mu\neq\nu$ [57, 22], and are non-Gaussian under the “null” assumption $\mu=\nu$ [22, 21]. When measures are supported on $\mathbb{R}^{d}$ , $d>1$ and are absolutely continuous, the curse of dimensionality takes place: the empirical Wasserstein distance converges in expectation to to the true one too slowly [26, 30]. It is still possible, however, to obtain convergence statement similar to (6) in any dimension $d\geq 1$ by replacing the centering true value $OT_{c}(\mu,\nu)$ with expectation of the empirical value $\mathbb{E}\left(OT_{c}(\widehat{\mu}_{n_{1}},\widehat{\nu}_{n_{2}}\right)$ [23]. The limit $L$ is Gaussian when $\mu\neq\nu$ and is degenerate (i.e. limiting random variable has zero variance) when $\mu=\nu$ .

Favorable situation arises when measures are supported on a finite space $\mathcal{X}=\{x_{1},\ldots,x_{N}\}\subseteq\mathbb{R}^{d}$ : [72] show that the limit law of the form (6) hold for the $W_{p}$ distance in any dimension $d\geq 1$ and use resulting laws to construct statistical inference under $H_{0}$ and $H_{a}$ (the case of countable support is treated in [75]). The laws under either $H_{0}$ or $H_{a}$ are non-degenerate and given by

\rho_{n}\left(W_{p}^{p}(\widehat{\mu}_{n_{1}},\widehat{\nu}_{n_{2}})-W_{p}^{p}(\mu,\nu)\right)\xrightarrow[]{\text{in law}}\max_{(u_{1},u_{2})\in\Phi^{*}}\sqrt{\lambda}\langle u_{1},G_{1}\rangle+\sqrt{1-\lambda}\langle u_{2},G_{2}\rangle

(7)

where $G_{1},G_{2}$ are the weak limits of multinomial processes $\sqrt{n_{1}}(\widehat{\mu}_{n_{1}}-\mu)$ and $\sqrt{n_{2}}(\widehat{\nu}_{n_{2}}-\nu)$ , and $\Phi^{*}$ is set of optimal solutions to the dual of the $W_{p}^{p}$ program (5). The results are extended in [46] to general measures supported on $\mathbb{R}^{d}$ , $d=1,2,3$ and general costs $c$ (with discussion of limitations in higher dimensions $d\geq 4$ ), thus providing a unified approach to weak limits of empirical OT costs centered by the true population value.

The starting point for theoretical results of this paper is the weak limit (7) [72]. Inspired by this result, we establish the limits of the form (6), where $OT_{c}$ is now the optimal value of the $MOT$ program (2) with $k\geq 2$ measures supported on a finite space $\mathcal{X}=\{x_{1},\ldots,x_{N}\}\subseteq\mathbb{R}^{d}$ for any $d\geq 1$ . The implications of these results to $k$ -sample inference and further theoretical findings related to our limits are summarized below.

1.3 Summary of contributions and outline

Asymptotic distributions of $MOT$ : We provide asymptotic distribution of $MOT$ cost on finite spaces by establishing Hadamard directional differentiability of the $MOT$ functional and combining it with functional Delta method [9, 29, 46, 70, 69, 65, 72, 75]. The resulting limit is a Hadamard directional derivative of $MOT$ at the true $\mu$ in the direction of the limit $G$ of the empirical process $\rho_{n}\left(\widehat{\mu}_{n}-\mu\right)$ for suitably defined rate $\rho_{n}$ (Theorem 2.2(a)). For $k=2$ measures, our limit recovers the one for the Wasserstein distance on finite spaces obtained in [72]; for $k>2$ measures, our limit allows to construct novel inference procedures for the $k$ -sample problem using $MOT$ (Section 2.2). We specify the structure of the limit under the assumption of $H_{0}$ and construct a low complexity stochastic upper bound on the null distribution (Theorem 2.2(b)) that is used to efficiently approximate the limit under $H_{0}$ (Section 3.2). The bound is tight for $k=2$ . We specify the structure of the limit under $H_{a}$ and provide sufficient conditions for the limit to be Gaussian by leveraging the results from geometry of multitransportation polytopes from [28]. When the limits are not Gaussian, we construct the Normal lower bounds on the alternative limiting distribution (Theorem 2.2(c)). Our stochastic bounds on the null and alternative distributions provide an analytically tractable way to analyze the power of the Optimal Transport based tests (Section 2.4), which, to the best of our knowledge, was not yet considered in the literature.

Consistency and power: We provide a novel power analysis for Optimal Transport based tests of hypotheses (1) that encompasses both our test (12) and the two-sample test in [72] and can potentially be applied to tests based on limiting laws in [75] and [46]. We show consistency of the test under fixed alternatives (Proposition 2.8), as well as uniform consistency in a certain broad class of alternatives (Theorem 2.9 and Proposition 2.11). We illustrate theoretical power results in $k=2$ case by providing a lower bound on the power function that explicitly relates sample size and the effect size (Corollary 2.10, Figure 2). We also quantify how the population version of our statistic changes with number of measures for certain sequences of alternatives (Lemmas 2.13 and 2.14), suggesting potential power advantages in these cases. For the case of small sample sizes, we provide a permutation version of the $MOT$ based test (Section 3.3). Comparison with with state-of-the-art tests of [47, 50, 64] shows strong finite sample power performance of our tests (Figure 5).

Computational complexity results: Leveraging recent complexity result for $MOT$ /barycenter program [2], we prove polynomial time complexity of the derivative bootstrap that consistently estimates asymptotic distribution of $MOT$ under $H_{0}$ (Lemma 3.2); polynomial complexity of m-out-of-n bootstrap and permutation procedure follow directly from [2] (Table 2). We demonstrate that the null upper bound of Theorem 2.2(b) can efficiently approximate the null distribution when the cardinality of the support $N=|\mathcal{X}|$ is large (Figure 4).

Applications to real data and software: We illustrate performance of $MOT$ based $k$ -sample inference on two synthetic datasets showing strong power performance when testing $H_{0}$ and the ability to produce meaningful and interpretable confidence regions under $H_{a}$ (Section 4.1). Further, we apply our methodology to real data on cancers in the United States populations to confirm two claims in cancer studies that were previously shown using different methodologies (Section 4.2, Figures 7 and 8). Current version of the software that implements our methods is available at https://github.com/kravtsova2/mot.

2 k-Sample inference on finite spaces using $MOT$

2.1 Notation and preliminary definitions

Denote the vector of $k$ true measures supported on $\mathcal{X}=\{x_{1}\ldots,x_{N}\}\subseteq\mathbb{R}^{d}$ by

\mu:=\begin{pmatrix}\mu_{1}\\ \vdots\\ \mu_{k}\end{pmatrix}\in\mathbb{R}^{kN}

and the vector of the empirical counterparts $\widehat{\mu}_{i}:=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\delta_{x_{j}}$ by

\widehat{\mu}_{n}:=\begin{pmatrix}\widehat{\mu}_{1}\\ \vdots\\ \widehat{\mu}_{k}\end{pmatrix}\in\mathbb{R}^{kN}

with sample sizes

n:=(n_{1},\ldots,n_{k})

where $n\to\infty$ to be interpreted as each sample size tending to infinity.

Let $MOT(\mu)$ be the optimal value of the program (2), which on the finite space $\mathcal{X}$ becomes the finite-dimensional linear program

		$\displaystyle\min_{\pi\geq 0}\langle c,\pi\rangle$		(8)
		$\displaystyle A\pi=\mu$		(8)

The optimization variable $\pi\in\mathbb{R}^{N^{k}}$ is a column vector representing joint probability distribution with marginals $\mu_{1},\ldots,\mu_{k}$ (frequently called multicoupling), a matrix $A\in\mathbb{R}^{kN\times N^{k}}$ encodes the constraints for $\pi$ to be a multicoupling (i.e. that summing certain entries of $\pi$ gives the marginals $\mu_{1},\ldots,\mu_{k}$ ), and a cost column vector $c\in\mathbb{R}^{N^{k}}$ contains Euclidean distances between measure support points to their averages given by (3)¹¹1The linear program formulation of MOT problem is discussed, for example, on p.3 of [54]..

For reader’s convenience, Example 1 below illustrates the structure of the linear program (8) in a case of three measures, each supported on two points in $\mathbb{R}^{1}$ :

Example 1 (Illustration of $MOT$ optimization problem for three measures).

Consider the finite set $\mathcal{X}=\{5,10\}\subset\mathbb{R}^{1}$ which could represent, for instance, tumor sizes (in centimeters) of cancer patients, and consider probability measures supported on $\mathcal{X}$ representing the probabilities of occurrences of $5\text{cm}$ and $10\text{cm}$ tumors in given population of cancer patients. Suppose that the measure $\mu_{1}$ has probabilities recorded in a vector $\mu_{1}:=\begin{pmatrix}\mu_{1}^{1}\\ \mu_{1}^{2}\end{pmatrix}$ , and, similarly, $\mu_{2}:=\begin{pmatrix}\mu_{2}^{1}\\ \mu_{2}^{2}\end{pmatrix}$ , $\mu_{3}:=\begin{pmatrix}\mu_{3}^{1}\\ \mu_{3}^{2}\end{pmatrix}$ . The multimarginal optimal transport problem (8) is to minimize a linear (with coefficients in $c$ ) function of a measure $\pi$ on the product space $\mathcal{X}\times\mathcal{X}\times\mathcal{X}$ whose marginals are $\mu_{1},\mu_{2}$ , and $\mu_{3}$ , respectively (in this example, all of these sets are equal to $\mathcal{X}$ ). Technically, $\pi$ is an order- $3$ tensor, i.e. an array with $3$ indices with values in $\{1,2\}$ , but for notational convenience we represent it by a long vector $(\pi_{ijk})_{i,j,k\in\{1,2\}}\in\mathbb{R}^{2\cdot 2\cdot 2}$ . The cost $c_{ijk}$ in the objective of (8) associated with the entry $\pi_{ijk}$ is the average of squared differences (or squared norms of the differences in higher-dimensional case) between support points $x_{i},x_{j},x_{k}$ to their mean $\overline{M}_{ijk}=\frac{1}{3}(x_{i}+x_{j}+x_{k})$ , i.e.

c_{ijk}=\frac{1}{3}\left((x_{i}-\overline{M}_{ijk})^{2}+(x_{j}-\overline{M}_{ijk})^{2}+(x_{k}-\overline{M}_{ijk})^{2}\right)

The objective is to minimize the total discrepancy weighted by $\pi$ , which is given by

\langle c,\pi\rangle=c_{111}\pi_{111}+c_{112}\pi_{112}+\cdots+c_{222}\pi_{222}

The multicoupling $\pi$ is subjected to having non-negative entries and constrained linearly with $A\pi=\mu$ . The constraint matrix $A$ is responsible for making sure that appropriate entries of $\pi$ sum to the given marginals $\mu_{1}$ , $\mu_{2}$ , and $\mu_{3}$ , i.e.

\underbrace{\begin{pmatrix}1&1&1&1&0&0&0&0\\ 0&0&0&0&1&1&1&1\\ 1&1&0&0&1&1&0&0\\ 0&0&1&1&0&0&1&1\\ 1&0&1&0&1&0&1&0\\ 0&1&0&1&0&1&0&1\end{pmatrix}}_{A}\underbrace{\begin{pmatrix}\pi_{111}\\ \pi_{112}\\ \pi_{121}\\ \pi_{122}\\ \pi_{211}\\ \pi_{212}\\ \pi_{221}\\ \pi_{222}\end{pmatrix}}_{\pi}=\underbrace{\begin{pmatrix}\mu_{1}^{1}\\ \mu_{1}^{2}\\ \mu_{2}^{1}\\ \mu_{2}^{2}\\ \mu_{3}^{1}\\ \mu_{3}^{2}\end{pmatrix}}_{\mu}

(9)

finishing the example.

The dual program of (8) is given by

		$\displaystyle\max_{u}\langle u,\mu\rangle$		(10)
		$\displaystyle A^{\prime}u\leq c$		(10)

(the derivation of the dual follows from the standard theory of linear programming, e.g. Section 4.1 of [5]). A column vector $u:=\begin{pmatrix}u_{1}\\ \vdots\\ u_{k}\end{pmatrix}\in\mathbb{R}^{kN}$ contains dual variables, one for each measure, and the objective of (10) can be thought of summing the contributions $\langle u_{1},\mu_{1}\rangle+\cdots+\langle u_{k},\mu_{k}\rangle$ .

Let $\Phi^{*}$ denote the set of dual optimal solutions to (10). This set consists of all vectors $u$ that result in the maximum value of the dual objective (= minimum value of the primal objective by strong duality, e.g. Theorem 4.4 of [5]) and satisfy the dual constraints, i.e.

\Phi^{*}:=\{u=\begin{pmatrix}u_{1}\\ \vdots\\ u_{k}\end{pmatrix}\in\mathbb{R}^{kN}:\langle u,\mu\rangle=MOT(\mu),\text{ }A^{\prime}u\leq c\}

(11)

We consider the asymptotic behavior of scaled and centered empirical estimator $MOT(\widehat{\mu}_{n})$ by establishing the weak limit

\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)\overset{\text{ in law }}{\longrightarrow}X

as $n\to\infty$ where $X\overset{d}{=}X_{0}$ under $H_{0}$ and $X\overset{d}{=}X_{a}$ under $H_{a}$ . The set $\Phi^{*}$ will be needed to define the limit $X$ .

2.2 Definitions of $H_{0}$ testing and $H_{a}$ inference procedures

Consider the statistic

T_{n}:=\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)

where $MOT(\mu)=0$ under $H_{0}$ and $MOT(\mu)>0$ under $H_{a}$ .

An $\alpha$ -level test of $H_{0}$ would reject $H_{0}$ if $\rho_{n}MOT(\widehat{\mu}_{n})$ is large, i.e. if $T_{n}$ exceeds a $(1-\alpha)$ th quantile of its null distribution $\mathcal{D}_{0}$ . However, as Theorem 2.2 shows, $\mathcal{D}_{0}$ depends on the unknown true $\mu$ , and hence care must be taken to ensure that the estimated cut-off used for the test still results in the (asymptotic) level $\alpha$ .

To this end, we consider a consistent bootstrap estimator of $\mathcal{D}_{0}$ given in Proposition 3.1 and denote its $(1-\alpha)$ -th quantile by $c_{\alpha,\mathcal{D}_{0}}$ . Consistency of the bootstrap is shown using results of [29], and by Corollary 3.2 of the same work such bootstrap based cut-off gives an asymptotic level $\alpha$ test of $H_{0}$ . Using this cut-off, we define the asymptotic test of $H_{0}$ as a map

\phi_{n,\mu}:=\begin{cases}1&\text{ if }T_{n}\geq c_{\alpha,\mathcal{D}_{0}}\\ 0&\text{ otherwise }\end{cases}

(12)

Similarly, the distribution of $T_{n}$ under $H_{a}$ is consistency estimated by bootstrap in Proposition 3.1, with resulting $(\alpha/2)$ -th and $(1-\alpha/2)$ -th quantiles denoted by $c_{\alpha/2,\mathcal{D}_{a}}$ and $c_{1-\alpha/2,\mathcal{D}_{a}}$ , respectively. The asymptotic $(1-\alpha)\%$ Confidence Region for $MOT(\mu)$ under $H_{a}$ is given by

\left(\frac{1}{\rho_{n}}MOT(\widehat{\mu}_{n})-c_{1-\alpha/2,\mathcal{D}_{a}},\frac{1}{\rho_{n}}MOT(\widehat{\mu}_{n})-c_{\alpha/2,\mathcal{D}_{a}}\right)

(13)

2.3 Asymptotic distributions of $MOT$ under $H_{0}$ and $H_{a}$

Refer to caption — Figure 1: Illustration of asymptotic distributions of $MOT$ given by Theorem 2.2 on the set up of Example 1. A. Under $H_{0}$ (Theorem 2.2(b)): null distribution $X_{0}$ , the upper bound $UB_{0}$ , and $MOT$ values sampled directly (black line) by sampling empirical measures from the truth and evaluating $MOT$ value. All densities here and in the rest of the paper are estimated by kernel density estimators in ggplot2 [82] with default parameters. B. Under $H_{a}$ (Theorem 2.2(b)): Gaussian limit under Condition (A1) (Left) and non-Gaussian limit (Right) with Normal Lower Bounds (NLB’s) ( $NLB\overset{d}{=}X_{a}$ in the Gaussian case).

Theorem 2.2(a) provides the general form of the asymptotic distribution of $MOT(\widehat{\mu}_{n})$ on finite spaces. This distribution is given by the Hadamard directional derivative of the $MOT$ functional, which is an optimal value of a linear program with a feasible set consisting of dual optimal solutions $\Phi^{*}$ (11). If $\Phi^{*}$ is a singleton, the limit in Theorem 2.2(a) is a linear combination of Gaussians, and hence is also Gaussian. If not, the limit is the maximum (taken over the feasible set $\Phi^{*}$ ) of such linear combinations.

By the theory of linear programming, it is possible to assess whether the set of dual optimal solutions $\Phi^{*}$ is a singleton or not based on the corresponding set of basic optimal solutions to the primal program (we use Theorem 5.6.1 of [71], and Chapters 4 and 5 of [5] for the general linear programming results employed below). In our case, the basic optimal solutions to the primal program (8) are the vertices of a multitransportation polytope $P(\mu_{1},\ldots,\mu_{k}):=\{\pi\geq 0:A\pi=\mu\}$ given by multicouplings $\pi^{*}$ . These vertices contain $\leq\text{rank}(A)=kN-k+1$ positive entries, and a vertex is termed degenerate if it contains strictly less (p. 366 of [28]).

The dual optimal set $\Phi^{*}$ cannot be a singleton if an optimal solution to the primal $MOT$ program is unique and degenerate. This is always the case under $H_{0}$ : the unique optimal solution $\pi^{*}$ is given by the “identity” multicoupling (with $\mu_{1}$ in the entries with the same tuple indices and zeros otherwise - so it is degenerate). Hence, the asymptotic distribution of $MOT$ under $H_{0}$ is never a Gaussian (Theorem 2.2(b)).

The dual optimal set $\Phi^{*}$ is a singleton if there exists a non-degenerate primal optimal vertex, i.e. an optimal multicoupling $\pi^{*}$ with $kN-k+1$ positive entries. This can be possible under certain $H_{a}$ ’s; in particular, this is possible if a multitransportation polytope $P(\mu_{1},\ldots,\mu_{k}):=\{\pi\geq 0:A\pi=\mu\}$ contains no degenerate vertices. In this case, $\Phi^{*}$ is a singleton, and the corresponding asymptotic distribution of $MOT$ is Gaussian (Theorem 2.2(c)). We use a result in discrete geometry from [28] to provide a sufficient condition (A1) that leads to Gaussian limits under $H_{a}$ :

Condition (A1).

(Regularity, Definition 1.5 in Chapter 8 of [28]) For each $i=1,\ldots,k$ , order the entries of the vector $\mu_{i}$ as $\mu_{i}^{1}\geq\mu_{i}^{2}\geq\cdots\geq\mu_{i}^{N}$ , and assume that $\mu_{i}^{1}<\mu_{i+1}^{1}$ for $i=1,\ldots,k-1$ . A multitransportation polytope $P(\mu_{1},\ldots,\mu_{k})$ is regular if

\mu_{i}^{N}+\sum_{j=i+1}^{k}\mu_{j}^{1}>k-i,\hskip 14.22636pt\forall i=1,\ldots,k-1

A regular multitransportation polytope does not have degenerate vertices (Lemma 1.4 in Chapter 8 of [28]).

Remark 2.1.

For $k=2$ measures, the condition (A1) is implied by the following No Subset Sum condition that ensures that a transportation polytope $P(\mu_{1},\mu_{2})$ has no degenerate vertices²²2This condition is mentioned by [74] for uniqueness of Kantorovich potentials for $k=2$ finitely supported measures.: there is no proper subsets of indices $I,J\subset[N]$ such that $\sum_{i\in I}\mu_{1}^{i}=\sum_{j\in J}\mu_{2}^{j}$ . This condition is both necessary and sufficient to exclude degenerate vertices in $k=2$ case (see, e.g., Theorem 1.2 in Chapter 6 of [28]).

Theorem 2.2 (Asymptotic distribution of $MOT$ on finite spaces).

Assume that the sizes of $k$ samples $n_{1},\ldots,n_{k}\to\infty$ satisfying $\frac{n_{i}}{n_{1}+\ldots+n_{k}}\to\lambda_{i}\in(0,1)$ . Denote $\rho_{n}:=\frac{\sqrt{n_{1}\cdot\ldots\cdot n_{k}}}{{\left(\sqrt{n_{1}+\ldots+n_{k}}\right)^{k-1}}}$ and $a_{i}:=\prod_{j\neq i}\lambda_{j}$ . Then,

(a)

The asymptotic distribution of $MOT$ is given by

\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)\xrightarrow[]{\text{in law}}\max_{u\in\Phi^{*}}\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle

(14)

where the feasible set $\Phi^{*}$ is given by (11), and $G_{i}\overset{\text{indep.}}{\sim}N(0,\Sigma_{i})$ with $\Sigma_{i}=\text{diag}(\mu_{i})-\mu_{i}\mu_{i}^{\prime}$ .

(b)

Under $H_{0}$ , the limit in (14) is non-Gaussian, and given by

\rho_{n}\left(MOT(\widehat{\mu}_{n})-0\right)\xrightarrow[]{\text{in law}}X_{0}\sim\mathcal{D}_{0}

where $\mathcal{D}_{0}$ is given by

$\displaystyle\max_{u}$	$\displaystyle\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle$	(15)
s.t.	$\displaystyle\sum_{i=1}^{k}u_{i}=0$
	$\displaystyle A^{\prime}u\leq c$

with $G_{i}\overset{\text{indep.}}{\sim}N(0,\Sigma_{1})$ with $\Sigma_{1}=\text{diag}(\mu_{1})-\mu_{1}\mu_{1}^{\prime}$ .

Furthermore, there exists ${UB}_{0}\sim\mathcal{D}_{{UB}_{0}}$ on the same probability space as $X_{0}$ such that ${UB}_{0}\geq X_{0}$ everywhere, and $\mathcal{D}_{{UB}_{0}}$ is given by

	$\displaystyle\max_{u}$	$\displaystyle\sum_{i=2}^{k}\langle u_{i},\sqrt{a_{i}}G_{i}-\sqrt{a_{1}}G_{1}\rangle$		(16)
		$\displaystyle\widetilde{A}^{\prime}u\leq\widetilde{c}$		(16)

where $\widetilde{A}^{\prime}u\leq\widetilde{c}$ is a subset of constraints from (15).

(c)

Under $H_{a}$ ,

$\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)\xrightarrow[]{\text{in law}}X_{a}\sim\mathcal{D}_{a}$

where $\mathcal{D}_{a}$ is given by (14). Furthermore, for every $u^{*}\in\Phi^{*}$ given by (11), there exists a random variable $NLB_{u^{*}}$ on the same probability space as $X_{a}$ , such that $NLB_{u^{*}}\leq X_{a}$ everywhere, and

$NLB_{u^{*}}\sim\mathcal{N}\left(0,\sum_{i=1}^{k}a_{i}{u_{i}^{*}}^{\prime}\Sigma_{i}u_{i}^{*}\right)$ (17)

If Condition (A1) holds, then $\Phi^{*}$ is a singleton $\{u^{*}\}$ , and $X_{a}\overset{d}{=}NLB_{u^{*}}$ .

Proof summary for Theorem 2.2.

Proof of part (a) is outlined below, with details in Appendix A.1. In what follows, for each $i=1,\ldots,k$ , we view the measures $\mu_{i}\in\mathcal{P}(\mathcal{X})\subseteq\left(l^{1}(\mathcal{X}),\|\cdot\|_{l^{1}}\right)$ , and the dual vectors $u_{i}\in\left(l^{\infty}(\mathcal{X}),\|\cdot\|_{l^{\infty}}\right)$ . The weak convergence and Hadamard directional differentiability are with respect to $l^{1}$ norm on $\bigotimes_{i=1}^{k}l^{1}(\mathcal{X})$ .

Step 1 Establish, for a suitable scaling $\rho_{n}$ , the weak limit $\sqrt{a}G$ of the empirical process

$\rho_{n}\left(\widehat{\mu}_{n}-\mu\right)\xrightarrow[]{\text{in law}}\sqrt{a}G$
Step 2 Confirm that the functional $\mu\longrightarrow MOT(\mu)$ is Hadamard directionally differentiable at $\mu$ with derivative

$f_{\mu}^{\prime}(G)=\max_{u\in\Phi^{*}}\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle$
Step 3 Use Delta Method for Hadamard directionally differentiable maps [65, 70] to conclude that

$\rho_{n}\left(f(\widehat{\mu}_{n})-f(\mu)\right)\xrightarrow[]{\text{in law}}f^{\prime}_{\mu}(G)$

Proof of part (b) is given in Appendix A.2. It provides the exact form of the proposed upper bound $UB_{0}$ reporting the constraints in $\widetilde{A}$ . To construct $\widetilde{A}$ , we consider how the inequality constraints $A^{\prime}u\leq c$ behave on the kernel of the map $\{u\longrightarrow\sum_{i=1}^{k}u_{i}\}$ . Resulting upper bound has only polynomially many constraints and can be sampled efficiently to approximate the null distribution in Section 3.2. We remark that the proposed bound is not unique: in particular, it can be strengthened by including more constraints from $A$ (Section 5.2). The proposed bound is tight when $k=2$ ³³3We remark here that the null distribution program (15) can be written with $k-1$ dual variables instead of $k$ due to the constraint $\sum_{i=1}^{k}ui=0$ ; this is what [72] refer to in $k=2$ case (p. 227). We choose to keep the form (15) for notational convenience.

For part (c), if the limit is not Gaussian (i.e., the feasible set $\Phi^{*}$ is not a singleton), one can take any $u^{*}\in\Phi^{*}$ and consider the (random) objective (14) evaluated at $u^{*}$ . Resulting value lower bounds the value of the maximization the program (14), and it is distributed according to (17).

∎

Observation 2.3.

The entries of the cost vector $c_{i_{1}i_{2}\cdots i_{k}}$ indexed by $i_{1},\cdots,i_{k}\in\{1,\ldots,N\}$ with $k-1$ coinciding index values can be written in terms of the distance between two points with unique indices scaled by $\frac{k-1}{k^{2}}$ . For example, $c_{1\cdots 12}=\frac{k-1}{k^{2}}\|x_{1}-x_{2}\|^{2}$ . The details are provided in Appendix A.3.

Lemma 2.4 (Bounds on the dual variables).

Fix $k\geq 2$ . Let $u=(u_{1},\ldots,u_{k})$ be optimal solutions to the dual $MOT$ program (10) satisfying $\sum_{i=1}^{k}u_{i}=0$ and chosen⁴⁴4Recall that adding a constant to any dual variable $u_{i}$ and subtracting the same constant from any other dual variable $u_{i^{\prime}}$ does not change the dual objective and does not violate the dual constraints in (10). Hence, given any vector of dual solutions $u=(u_{1},\ldots,u_{k})$ , the first entries of $u_{2},\ldots,u_{k}$ can be normalized to zero, and the constraint $\sum_{i=1}^{k}u_{i}=0$ would force the first entry of $u_{1}$ to be zero as well. Such normalization is frequently done to avoid redundant solutions - see, e.g., the definition of dual transportation polyhedron in [4]. such that the first entries ${u_{i}}_{1}=0$ . Then for each $i=1,\ldots,k$ , the $j$ th entry of $u_{i}$ is bounded as

|u_{i}|_{j}\leq\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2}

where $\|\cdot-\cdot\|^{2}$ is the squared distance on the ground metric space $\mathcal{X}=\{x_{1},\ldots,x_{N}\}$ . It follows that

\|u_{i}\|\leq\frac{k-1}{k^{2}}C(\mathcal{X}),\text{ }i=1,\ldots,k

where $C(\mathcal{X})$ depends only on the ground metric space $\mathcal{X}$ .

Remark 2.5.

The assumption $\sum_{i=1}^{k}u_{i}=0$ holds for those $\mu$ for which the primal optimal solution - the multicoupling $(\pi_{i_{1},\ldots,i_{k}})_{i_{1},\ldots,i_{k}}$ - assigns a positive mass to the “diagonal” tuples $(x_{i_{1}},\ldots,x_{i_{1}})$ , i.e. $\pi_{i_{1},\ldots,i_{1}}>0$ for $i_{1}=1,\ldots,N$ . By complementary slackness result in linear programming (see, e.g., Theorem 4.5 of [5]), in this case, the corresponding constraints of the dual

(u_{1})_{j}+\cdots+(u_{k})_{j}\leq c_{j,\ldots,j}=0,\text{ }j=1,\ldots,N

hold with equality, giving $\sum_{i=1}^{k}u_{i}=0$ . This always holds under $H_{0}$ and frequently happens under $H_{a}$ .

Using the above results, Proposition 2.6 defines an upper bound on all test cut-offs $c_{\alpha,\mathcal{D}_{0}}$ , which is independent of the nature and the number of measures in $\mu(k)=(\mu_{1},\ldots,\mu_{k})$ . This bound is used to prove consistency of the test (12) with cut-offs $c_{\alpha,\mathcal{D}_{0}}$ uniformly over $\mu$ (Theorem 2.9) and $k$ (Proposition 2.11).

Proposition 2.6 (Bound for test cut-off $c_{\alpha,\mathcal{D}_{0}}$ ).

Fix the test level $\alpha\in(0,1)$ . There exists $c_{\alpha}(\mathcal{X})$ depending only on the ground metric space $\mathcal{X}$ such that

c_{\alpha,\mathcal{D}_{0}}\leq c_{\alpha}(\mathcal{X})\text{ for all }\mu(k)\text{ supported on }\mathcal{X}

where $\mu(k)=(\mu_{1},\ldots,\mu_{k})$ . In particular, for any $k\geq 2$ ,

c_{\alpha,\mathcal{D}_{0}}\leq\left(\sum_{i=1}^{k}\sqrt{a_{i}}\right)\frac{k-1}{k^{2}}C(\mathcal{X})\sqrt{-8\ln{(\alpha/4)}}=:c_{\alpha}(\mathcal{X})

For equal sample sizes $n_{1}=\cdots=n_{k}$ , this gives

c_{\alpha,\mathcal{D}_{0}}\leq\frac{k-1}{k^{(k+1)/2}}C(\mathcal{X})\sqrt{-8\ln{(\alpha/4)}}\leq\frac{1}{\sqrt{8}}C(\mathcal{X})\sqrt{-8\ln{(\alpha/4)}}

for all $k\geq 2$ .

Proof.

For any given $\mu(k)$ , consider the null distribution $\mathcal{D}_{0}$ given by linear program (15) and bound its objective (everywhere) as

	$\displaystyle\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle\leq\sum_{i=1}^{k}\sqrt{a_{i}}\left\|\langle u_{i},G_{i}\rangle\right\|$	$\displaystyle\leq\sum_{i=1}^{k}\sqrt{a_{i}}\\|u_{i}\\|\\|G_{i}\\|$
		$\displaystyle\underset{\text{Lemma }\ref{lem:dual_bd_null}}{\leq}\left(\sum_{i=1}^{k}\sqrt{a_{i}}\right)\frac{k-1}{k^{2}}C(\mathcal{X})\\|G_{1}\\|$

Thus the optimal value $X_{0}$ of (15) is also bounded by the same quantity.

Desired cut-off $c_{\alpha}(\mathcal{X})$ is obtained using the cut-off $t_{\alpha}$ for the distribution of $\|G_{1}\|$ . To define $t_{\alpha}$ , we use the following concentration result from [53]:

Concentration of $\|G_{1}\|$ (Equation (3.5) from [53]) For a centered Gaussian random variable $G_{1}$ , given any $t>0$ ,

\mathbb{P}\left(\|G_{1}\|\geq t\right)\leq 4e^{-\frac{t^{2}}{8\mathbb{E}\|G_{1}\|^{2}}}

(18)

Recalling that $G_{1}=(G_{1}^{1},\ldots,G_{1}^{N})$ with $\text{Cov}(G_{1})=\text{diag}(\mu_{1})-\mu_{1}\mu_{1}^{\prime}$ where $\mu_{1}=(p_{1},\ldots,p_{N})$ , we get that

	$\displaystyle\mathbb{E}\\|G_{1}\\|^{2}$	$\displaystyle=\mathbb{E}\left[(G_{1}^{1})^{2}+\cdots+(G_{1}^{N})^{2}\right]$
		$\displaystyle=p_{1}(1-p_{1})+\cdots+p_{N}(1-p_{N})$
		$\displaystyle=1-\sum_{i=1}^{N}p_{i}^{2}<1$

Hence, $\mathbb{P}\left(\|G_{1}\|\geq t\right)\leq 4e^{-\frac{t^{2}}{8}}$ , and choosing $t=\sqrt{-8\ln{(\alpha/4)}}$ gives $\mathbb{P}\left(\|G_{1}\|\geq t\right)\leq\alpha$ . Note that the result holds for any $\mu_{1}$ .

We let $t_{\alpha}:=\sqrt{-8\ln{(\alpha/4)}}$ which ensures $\mathbb{P}\left(\|G_{1}\|\geq t_{\alpha}\right)\leq\alpha$ . Thus,

		$\displaystyle\mathbb{P}\left(X_{0}\geq\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})\cdot t_{\alpha}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})\\|G_{1}\\|\geq\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})\cdot t_{\alpha}\right)$
	$\displaystyle=$	$\displaystyle\mathbb{P}\left(\\|G_{1}\\|\geq t_{\alpha}\right)\leq\alpha$

which implies $c_{\alpha,\mathcal{D}_{0}}\leq\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})=:c_{\alpha}(\mathcal{X})$ .

∎

2.4 Consistency and power

We start by showing the basic requirement for tests comparing $k\geq 2$ measures - consistency under any fixed alternative - by showing that the power tends to $1$ with increasing sample sizes.

The power of the test (12) is

$\displaystyle\text{Power}(\mu)$	$\displaystyle=\mathbb{P}_{\mathcal{D}_{a}}\left(\phi_{n,\mu}=1\right)$	(19)
	$\displaystyle=\mathbb{P}_{\mathcal{D}_{a}}\left(\rho_{n}(MOT(\widehat{\mu}_{n})-0)\geq c_{\alpha,\mathcal{D}_{0}}\right)$
	$\displaystyle=\mathbb{P}_{\mathcal{D}_{a}}\left(\rho_{n}(MOT(\widehat{\mu}_{n})-MOT(\mu))\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)$
	$\displaystyle=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)$

Remark 2.7.

To obtain expression of the power for the test in [72], $MOT(\mu)$ in the above expression can be replaced by $W_{2}^{2}(\mu)$ ⁵⁵5This choice corresponds to $p=2$ -Wasserstein distance in [72]. Other choices of $p\in[1,\infty)$ can be used by adjusting the details accordingly., with necessary adjustments for the $k=2$ measures case.

Proposition 2.8 below proves consistency under fixed alternatives for any $k\geq 2$ . The proof utilizes a Normal Lower Bound guaranteed by Theorem 2.2(c) to lower bound the power of the test.

Proposition 2.8 (Consistency under fixed alternatives, $k\geq 2$ measures).

Under any given alternative $\mu=(\mu_{1},\ldots,\mu_{k})$ , $k\geq 2$ , the test in (12) satisfies

\lim_{n\to\infty}\mathbb{P}(\phi_{n,\mu}=1)=1

Proof.

Given an alternative $\mu$ , consider a set of dual optimal solutions $\Phi^{*}$ (11) to $MOT(\mu)$ . Choose any $u^{*}\in\Phi^{*}$ (treated fixed after the choice), and consider the Normal Lower Bound $NLB_{u^{*}}\sim\mathcal{N}\left(0,\sum_{i=1}^{k}a_{i}{u_{i}^{*}}^{\prime}\Sigma_{i}u_{i}^{*}\right)$ guaranteed by Theorem 2.2(c). Using (19), we have that

	$\displaystyle\mathbb{P}_{\mathcal{D}_{a}}\left(\phi_{n,\mu}=1\right)$	$\displaystyle=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)$
		$\displaystyle\geq\mathbb{P}\left(NLB_{u^{*}}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)$

Since $\rho_{n}\to\infty$ while the test’s cut-off $c_{\alpha,\mathcal{D}_{0}}$ and the true value $MOT(\mu)>0$ do not change with $n$ , we get that $c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\to-\infty$ , and hence the Gaussian random variable $NLB_{u^{*}}$ exceeds it with probability tending to $1$ as $n\to\infty$ . ∎

Next, we prove uniform consistency of the test (12) over a broad class of alternatives. We start with a case of $k=2$ measures in Theorem 2.9, which proves uniform consistency of the test proposed by [72]. We then move to general $k\geq 2$ ( $k$ is allowed to change) in Proposition 2.11, concluding uniform consistency of tests of this type.

Our results are proved under the following assumption (B1) below, which is discussed in Remark 2.5. This assumption is expected to hold when measures are not too far from each other, and the power without this assumption is expected to be higher. Removing (B1) poses the difficulty of bounding dual solutions $u$ uniformly over alternative polytopes $\{\mu^{\prime}u=MOT(\mu),A^{\prime}u\leq c\}$ ; we use such bound to control alternative variances. Under (B1), the condition $\sum_{i=1}^{k}u_{i}=0$ allows to bound $u$ (and hence alternative variances) explicitly and uniformly over $\mu$ .

Assumption (B1).

There exist dual solutions $u$ to $MOT(\mu)$ satisfying $\sum_{i=1}^{k}u_{i}=0$ .

For any fixed metric space $\mathcal{X}$ with $N$ points and any $\delta>0$ , define the class of alternatives

\mathcal{F}(\delta):=\{\mu\text{ on }\mathcal{X}:W^{2}_{2}(\mu)\geq\delta\}

Theorem 2.9 (Uniform consistency, $k=2$ measures).

For $k=2$ , the test (12) (or, equivalently, the test based on Theorem 1(c) of [72]) satisfies

\lim_{n\to\infty}\inf_{\mu\in\mathcal{F}(\delta)\cap(B1)}\mathbb{P}(\phi_{n,\mu}=1)=1

Proof.

Fix test level $\alpha\in(0,1)$ . The goal is to show that, independently of the nature of the alternative $\mu\in\mathcal{F}(\delta)\cap(B1)$ , the probability that the test rejects $H_{0}$ given by

\mathbb{P}(\phi_{n,\mu}=1)=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}\frac{1}{4}W_{2}^{2}(\mu)\right)

tends to $1$ as $n\to\infty$ (the factor of $\frac{1}{4}$ is due to $MOT(\mu)=\frac{1}{4}W_{2}^{2}(\mu)$ for $k=2$ ).

Assume for simplicity of notation that the sample sizes are equal, i.e. $n_{1}=\cdots=n_{k}$ ⁶⁶6The proof for unequal samples sizes would be similar by considering $n:=\min_{i}\{n_{i}\}$ .. Note first that the null cut-offs $c_{\alpha,\mathcal{D}_{0}}$ can be bounded above uniformly over the class $\mathcal{F}(\delta)\cap(B1)$ using Proposition 2.6, which with $k=2$ gives the bound

c_{\alpha,\mathcal{D}_{0}}\leq\frac{1}{\sqrt{8}}C(\mathcal{X})\sqrt{-8\ln(\alpha/4)}:=c_{\alpha}(\mathcal{X})

Hence,

c_{\alpha,\mathcal{D}_{0}}-\rho_{n}\frac{1}{4}W_{2}^{2}(\mu)\leq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\text{ for all }\mu\in\mathcal{F}(\delta)\cap(B1)

This expression represents the “worst” (over $\mathcal{F}(\delta)\cap(B1)$ ) value that any given $X_{a}$ must exceed to give the test a power.

Next we show that any $X_{a}$ will exceed this bound with probability tending to $1$ as $n\to\infty$ . To this end, let $\mu\in\mathcal{F}(\delta)\cap(B1)$ , and consider the corresponding alternative distribution of $X_{a}$ . Consider the dual solutions $({u_{1}}_{a}^{*},{u_{2}}_{a}^{*})$ satisfying Assumption (B1) and the corresponding Normal Lower Bound

NLB_{u_{a}^{*}}\sim\mathcal{N}\left(0,\frac{1}{2}{{u_{1}}_{a}^{*}}^{\prime}\Sigma_{1}{{u_{1}}_{a}^{*}}+\frac{1}{2}{{u_{2}}_{a}^{*}}^{\prime}\Sigma_{2}{{u_{2}}_{a}^{*}}\right)

guaranteed by Theorem 2.2(c). Note that its variance

	$\displaystyle\frac{1}{2}{{u_{1}}_{a}^{}}^{\prime}\Sigma_{1}{u_{1}}_{a}^{}+\frac{1}{2}{{u_{2}}_{a}^{}}^{\prime}\Sigma_{2}{u_{2}}_{a}^{}\leq\frac{1}{2}$	$\displaystyle\\|{u_{1}}_{a}^{}\\|^{2}\cdot\lambda_{max}(\Sigma_{1})+\frac{1}{2}\\|{u_{2}}_{a}^{}\\|^{2}\cdot\lambda_{max}(\Sigma_{2})$
	$\displaystyle\underset{(B1)}{=}$	$\displaystyle\frac{1}{2}\\|{u_{1}}_{a}^{*}\\|^{2}\left(\lambda_{max}(\Sigma_{1})+\lambda_{max}(\Sigma_{2})\right)$

where $\lambda_{max}(\cdot)$ denotes the largest eigenvalue of a matrix argument.

Using Theorem 1 of [8], the eigenvalues of $\Sigma_{1},\Sigma_{2}$ are upper bounded by entries of $\mu_{1}$ and $\mu_{2}$ as $\lambda_{max}(\Sigma_{1})\leq\max_{i}({\mu_{1}})_{i}$ and $\lambda_{max}(\Sigma_{2})\leq\max_{i}({\mu_{2}})_{i}$ , with the uniform upper bound of $1$ for all instances $\mu\in\mathcal{F}(\delta)$ ⁷⁷7In fact, the bound hold for any $\mu$ and is not restricted to the class $\mathcal{F}$ - see [8].. Hence,

\lambda_{max}(\Sigma_{1})+\lambda_{max}(\Sigma_{2})\leq 2

providing a uniform upper bound on the eigenvalue part. Further, $\|{u_{1}}_{a}^{*}\|\leq\frac{1}{4}C(\mathcal{X})$ by Lemma 2.4. Thus, letting

\sigma^{2}(\mathcal{X}):=\left(\frac{1}{4}C(\mathcal{X})\right)^{2}

(20)

uniformly bounds the variances for all $NLB$ ’s chosen as above for $X_{a}$ ’s arising from $\mu\in\mathcal{F}(\delta)\cap(B1)$ .

The final step is to combine the above uniform bounding arguments to get the power $\to 1$ . Note that for large enough $n$ , $c_{\alpha}(\mathcal{X})-\rho_{n}\frac{1}{4}\delta<0$ , and also that $c_{\alpha}(\mathcal{X})-\rho_{n}\frac{1}{4}\delta\to-\infty$ as $n\to\infty$ . Hence, we have that, for any $\mu\in\mathcal{F}(\delta)\cap(B1)$ , for large enough $n$ depending only on $\delta$ and $\mathcal{X}$ but not the nature of $\mu$ ,

$\displaystyle\mathbb{P}(\phi_{n,\mu}=1)$	$\displaystyle=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}\frac{1}{4}W_{2}^{2}(\mu)\right)$	(21)
	$\displaystyle\geq\mathbb{P}\left(X_{a}\geq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\right)$
	$\displaystyle\geq\mathbb{P}\left(NLB_{u^{*}_{a}}\geq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\right)$
	$\displaystyle\geq\mathbb{P}\left(\mathcal{N}\left(0,\sigma^{2}(\mathcal{X})\right)\geq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\right)$

which tends to $1$ with $c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\to-\infty$ as $n\to\infty$ . This gives the uniform lower bound on the power over $\mathcal{F}(\delta)\cap(B1)$ proving the uniform consistency of the test over this broad class of alternatives. ∎

Using Theorem 2.9, one can provide a practical lower bound on the power as a function of the sample size $n$ and/or the effect size $\delta$ when measures are supported on a known metric space $\mathcal{X}$ . Note that the dual bound $C(\mathcal{X})$ is rather conservative; it can be replaced by a bound on $\|u_{1}\|$ computed for the polytope $\{u_{1}+u_{2}=0,A^{\prime}u\leq c\}$ in every specific case. To find these bounds, one could solve linear programs $\{\max/\min u^{i}:u_{1}+u_{2}=0,A^{\prime}u\leq c\}$ for each entry $u^{i}$ of $u$ to estimate magnitudes of the dual variables over a given polytope. Denoting the resulting bound $\widetilde{C}(\mathcal{X})$ , we have

Corollary 2.10 (Lower bound on the power of the two-sample test).

For any alternative $\mu\in\mathcal{F}(\delta)\cap(B1)$ , the two-sample test (12) (or the test based on Theorem 1(c) of [72]) with equal sample sizes $n$ has

\text{power}\geq 1-\Phi\left(\frac{4\widetilde{C}(\mathcal{X})\sqrt{-\log(\alpha/4)})-\frac{\sqrt{n}}{\sqrt{2}}\delta}{\widetilde{C}(\mathcal{X})}\right)

where $\Phi(\cdot)$ denotes the cumulative distribution function of the standard Normal distribution.

Illustration of this bound on the real data from Section 4.2.1 is provided in Figure 2.

Using techniques similar to the proof of Theorem 2.9, it is possible to prove uniform consistency of the test (12) for alternatives $\mu(k):=(\mu_{1},\ldots,\mu_{k})$ in the class

\mathcal{F}_{K}(\delta):=\{\mu(k)\text{ on }\mathcal{X}:k\leq K,\text{ }MOT(\mu)\geq\delta\}

defined for any $\delta>0$ and any fixed $2\leq K<\infty$ under the Assumption (B1). This gives consistency of the test (12) uniformly over alternatives with $k\leq K$ measures:

Proposition 2.11 (Uniform consistency in the class $\mathcal{F}_{K}(\delta)$ ).

The test in (12) satisfies

\lim_{n\to\infty}\inf_{\mu(k)\in\mathcal{F}_{K}(\delta)\cap(B1)}\mathbb{P}(\phi_{n,\mu}=1)=1

Proof.

Similarly to the proof of Theorem 2.9, the null cut-offs are uniformly bounded using Proposition 2.6 as $c_{\alpha,\mathcal{D}_{0}}\leq c_{\alpha}(\mathcal{X})$ . Recall that (taking equal sample sizes $n$ for simplicity) for any $2\leq k\leq K$ , $\rho_{n}=\sqrt{\frac{n}{k^{k-1}}}$ , and hence $c_{\alpha}-\rho_{n}MOT(\mu(k))<0$ for $n$ large enough to ensure this holds for all $k\leq K$ .

Similarly to the proof of Theorem 2.9, each alternative random variable $X_{a}$ arising from $\mu(k)$ has a Normal Lower Bound

NLB_{u_{a}^{*}}\sim\mathcal{N}\left(0,\frac{1}{k^{k-1}}{{u_{1}}_{a}^{*}}^{\prime}\Sigma_{1}{{u_{1}}_{a}^{*}}+\cdots+\frac{1}{k^{k-1}}{{u_{k}}_{a}^{*}}^{\prime}\Sigma_{k}{{u_{k}}_{a}^{*}}\right)

with the variance bounded above by

\sigma^{2}(\mathcal{X},k):=\frac{1}{k^{k-1}}\cdot\left(\frac{k-1}{k^{2}}C(\mathcal{X})\right)^{2}\cdot k=\frac{(k-1)^{2}}{k^{k+2}}\left(C(\mathcal{X})\right)^{2}

(22)

which decreases with $k$ . Hence, it is bounded uniformly in $k$ by $\sigma^{2}(\mathcal{X})$ from the $k=2$ case (equation (20)), and the rest of the argument carries in exactly the same way as in the proof of Theorem 2.9. ∎

Remark 2.12 (Large $k$ and connection with [50]).

In practice, the upper bound on the number $k$ of measures in $\mathcal{F}_{K}(\delta)$ cannot be too large. This is due to the requirement $\rho_{n}=\sqrt{\frac{n}{k^{k-1}}}\to\infty$ , which forces very large sample sizes that may not be practically plausible. While this limitation is natural for asymptotic $k$ -sample tests that work with fixed $k$ (as discussed, e.g., in [84]), recent results of [50] show that permutation approach for certain test statistics allows for growing $k$ and $n$ simultaneously. More precisely, in the class of alternatives where only a few measures differ from the rest of the collection, the permutation kernel based test is uniformly powerful if the population version of the test statistic $\delta$ sufficiently exceeds $\sqrt{\frac{\log k}{n}}$ . Below we discuss sequences of alternatives whose $MOT$ population value does not decrease with $k$ , and hence the $MOT$ test statistic is expected to perform well in a permutation procedure. We leave a theoretical power analysis concerning $MOT$ permutation test for future work.

The “clustered” alternatives are collections $\mu=(\mu_{1},\ldots,\mu_{k})$ that separate into two groups (or “clusters”), with $k/C$ measures in each cluster that are all the same. Such situation might arise, for example, if an applied treatment causes $C$ different types of responses. For instance, for $C=2$ , define

\displaystyle\mathcal{F}^{2}_{k}:=\{\mu\text{ on }\mathcal{X}:\mu_{1}=\cdots=\mu_{k/2}\neq\mu_{k/2+1}=\cdots=\mu_{k}\}

(see Figure 5B for illustration). The classes $\mathcal{F}^{C}_{k}$ with $C=3,\cdots,k$ are defined analogously (in each case, $k$ is assumed to be divisible by $C$ ).

Lemma 2.13 (MOT values for “clustered” alternatives).

For $\mu\in\mathcal{F}^{2}_{k}$ , we have $MOT(\mu)=MOT(\mu_{1},\mu_{k})=\frac{1}{4}W_{2}^{2}(\mu_{1},\mu_{k})$ . More generally, for $\mu\in\mathcal{F}^{C}_{k}$ , $MOT(\mu)=MOT(\{\mu_{i}\}_{i=1}^{C})$ , where $\{\mu_{i}\}_{i=1}^{C}$ is a collection consisting of one measure from each cluster.

Proof is provided in Appendix A.5. Note that true $MOT$ values for “clustered” alternatives do not decrease with increasing number of measures and thus may serve as a suitable test statistics in permutation tests against alternatives in $\mathcal{F}^{C}_{k}$ .

Finally, we comment on the $MOT$ values in a “sparse” alternative class when only one measure is different from the rest (alternatives of this type are considered in both [50] and [84]:

\mathcal{F}^{s}_{k}:=\{\mu\text{ on }\mathcal{X}:\mu_{1}=\cdots=\mu_{k-1}\neq\mu_{k}\}

While $MOT$ values do decrease with $k$ in this sequences of alternatives, we can state precisely how the rate of this decrease is controlled (proved in Appendix A.6):

Lemma 2.14 (MOT values for “sparse” alternatives).

For $\mu\in\mathcal{F}^{s}_{k}$ , $MOT(\mu)=\frac{k-1}{k^{2}}W_{2}^{2}(\mu_{1},\mu_{k})$ .

Empirical performance of asymptotic $MOT$ test (12) and permutation $MOT$ test (Section 3.3) on “clustered” and “sparse” alternatives are illustrated in Figure 5.

3 Sampling from null and alternative distributions

3.1 Bootstrap: $m$ -out-of- $n$ and derivative

We recall that the limiting laws in Theorem 2.2 depend on the true measures $\mu$ , similarly to the $k=1,2$ -sample cases considered in [72] and [46]. More precisely, the laws are of the form

\rho_{n}\left(f(\widehat{\mu}_{n})-\ f(\mu)\right)\xrightarrow[]{\text{in law}}f^{\prime}_{\mu}(G)

where $f:\mu\longrightarrow MOT(\mu)$ is the map with Hadamard directional derivative $f^{\prime}$ at $\mu$ in the direction of $G\overset{\text{in law}}{=}\lim_{n}\rho_{n}\left(\widehat{\mu}_{n}-\mu\right)$ and $n:=(n_{1},\ldots,n_{k})$ . The classical bootstrap estimator of $f_{\mu}(G)$ in a sense of [27] would be constructed by sampling from the conditional (given the data) law of $\rho_{n}\left(f(\hat{\mu}_{n}^{*})-f(\widehat{\mu}_{n})\right)$ , where $\hat{\mu}_{n}^{*}$ is obtained by taking $n$ samples from the vector of empirical measures $\widehat{\mu}_{n}$ . By Theorem 3.1 of [29], this estimator is not consistent when $f^{\prime}_{\mu}(G)$ is non-Gaussian, which is always the case under $H_{0}$ and frequently under $H_{a}$ .

In place of inconsistent classical bootstrap, [29] proposes a consistent bootstrap procedure to estimate the law of $f^{\prime}_{\mu}(G)$ . The approach of [29] is to ensure consistency of an estimator $f^{\prime}_{n}(\cdot)$ of the map $f^{\prime}_{\mu}(\cdot)$ uniformly in the argument $(\cdot)$ , assuming that the law of the argument $(\cdot)$ is estimated by (some) consistent bootstrap scheme. Two different choices for $f^{\prime}_{n}(\cdot)$ then lead to bootstrap schemes frequently termed m-out-of-n and derivative bootstrap methods, respectively (see Section 1 of [29] on historical notes on these methods).

The work of [72] outlines the consistency results for these two schemes in $k=1,2$ -sample cases. For completion, we describe these schemes in the general case of $k\geq 2$ (proved in Appendix A.7):

Proposition 3.1 (Consistency of bootstrap from [29]).

The results of parts (a) and (b) concern with two estimators $f^{\prime}_{n}(\cdot)$ of the map $f_{\mu}^{\prime}(\cdot)$ .

(a)

$f^{\prime}_{n}:h\longrightarrow f_{n}^{\prime}(h)$ given by

$f^{\prime}_{n}(h):=\frac{f(\widehat{\mu}_{n}+\varepsilon_{n}h)-f(\widehat{\mu}_{n})}{\varepsilon_{n}}$

composed with the estimator of $G$ given by

$\hat{G}^{*}:=\rho_{m}\left(\widehat{\mu}^{*}_{(m)}-\widehat{\mu}_{n}\right)$

results in a consistent bootstrap estimator $f^{\prime}_{n}(\hat{G}^{*})$ of $f_{\mu}^{\prime}(G)$ under both $H_{0}$ and $H_{a}$ . Here, $\hat{\mu}^{*}_{(m)}$ is obtained by resampling m out of n observations from $\widehat{\mu}_{n}$ with $m:=\sqrt{n}$ , and $\varepsilon_{n}\to 0$ such that $\rho_{n}\varepsilon_{n}\to\infty$ .

Note: The choice $\varepsilon_{n}:=\frac{1}{\rho_{m}}$ leads to

$f^{\prime}_{n}(\hat{G}^{*})=\rho_{m}\left(f(\widehat{\mu}^{*}_{(m)})-f(\widehat{\mu}_{n})\right)$

which is frequently termed the m-out-of-n bootstrap estimator of $f_{\mu}^{\prime}(G)$ and is considered in [72] and [46] for the Wasserstein distance map $f$ .

(b)

$f_{n}^{\prime}:h\longrightarrow f_{\hat{\mu}_{n}}^{\prime}(h)$ given by

	$\displaystyle f^{\prime}_{\hat{\mu}_{n}}(h):=$	$\displaystyle\max_{u}\sum_{i=1}^{k}\langle u_{i},h_{i}\rangle$
		$\displaystyle\langle u,\hat{\mu}_{n}\rangle=MOT(\hat{\mu}_{n})$
		$\displaystyle A^{\prime}u\leq c$

composed with the estimator of $G$ given by

\hat{G}^{*}:=\left(\sqrt{\hat{a}_{1}}\hat{G}_{1}^{*},\ldots,\sqrt{\hat{a}_{k}}\hat{G}^{*}_{k}\right)

where each $\hat{G}_{i}^{*}$ is $N\left(0,\text{diag}(\hat{\mu}^{*}_{1})-\hat{\mu}_{1}^{*}(\hat{\mu}_{1}^{*})^{\prime}\right)$ results in a consistent bootstrap estimator $f_{n}^{\prime}(\hat{G}^{*})$ of $f_{\mu}^{\prime}(G)$ under $H_{0}$ .

Note: This estimator is frequently termed the derivative bootstrap estimator of $f^{\prime}_{\mu}(G)$ and is considered in [72] for the Wasserstein distance map $f$ .

Pseudocodes 1 and 3 describe sampling from the limiting laws of $MOT$ under $H_{0}$ and $H_{a}$ using bootstrap schemes in Proposition 3.1.

Pseudocode 1 (m-out-of-n bootstrap to obtain one sample from $H_{0}$ or $H_{a}$ limiting law).

Given the data $\widehat{\mu}_{n}$ ,

1.

Let $m_{i}:=n_{i}^{p}$ , $p\in(0,1)$ .
2.

For each $i=1,\ldots,k$ ,
sample $\widehat{\mu}_{i}^{*}\sim\text{Multinomial}(m_{i},\widehat{\mu}_{1})$ under $H_{0}$ , or
sample $\widehat{\mu}_{i}^{*}\sim\text{Multinomial}(m_{i},\widehat{\mu}_{i})$ under $H_{a}$ .
3.

Compute $MOT(\widehat{\mu}_{1}^{*},\ldots,\widehat{\mu}_{k}^{*})$ by solving the program (8).
4.

Report $\rho_{m}MOT(\widehat{\mu}_{1}^{*},\ldots,\widehat{\mu}_{k}^{*})$ under $H_{0}$ or
$\rho_{m}\left(MOT(\widehat{\mu}_{1}^{*},\ldots,\widehat{\mu}_{k}^{*})-MOT(\widehat{\mu}_{n})\right)$ under $H_{a}$ ,
where $\rho_{m}=\frac{\sqrt{m_{1}\cdot\ldots\cdot m_{k}}}{\left(\sqrt{m_{1}+\ldots+m_{k}}\right)^{k-1}}$ .

Pseudocode 2 (Derivative bootstrap to obtain one sample from $H_{0}$ limiting law).

Given the data $\widehat{\mu}_{n}$ ,

1.

Sample $\hat{\mu}_{1}^{*}\sim\text{Multinomial}(n_{1},\widehat{\mu}_{1})$ .
2.

For each $i=1,\ldots,k$ ,
sample $\hat{G}_{i}^{*}\sim\text{Normal}(0,\text{diag}(\widehat{\mu}_{1}^{*})-\hat{\mu}_{1}^{*}(\hat{\mu}_{1}^{*})^{\prime})$ .
Let $a_{i}=\prod_{j\neq i}\hat{\lambda}_{j}$ , where $\hat{\lambda}_{j}=\frac{n_{j}}{n_{1}+\ldots+n_{k}}$ .
3.

Solve the program (15) with $\{\hat{G}_{i}^{*}\}_{i=1}^{k}$ in place of $\{G_{i}\}_{i=1}^{k}$ .

3.1.1 Computational complexity of bootstrap

For the m-out-of-n bootstrap, computation in Step 3 requires solving the primal $MOT$ program (8), which is a linear program with $N^{k}$ variables, i.e. exponentially many in terms of the cardinality of the support space $\mathcal{X}$ . By strong duality, the optimal value of primal MOT program is the same as that of the dual MOT (10), which is a linear program

\max_{u}\mu^{\prime}u:A^{\prime}u\leq c

with polynomially many variables but exponentially many constraints. It is well-known that a linear program with exponentially many constraints can be proved polynomial time solvable via ellipsoid method provided its feasible set $\{A^{\prime}u\leq c\}$ has a polynomial time computable separation oracle (see, e.g. Section 8.5 of [5]). Such oracle is a procedure which accepts a proposal point $u\in\mathbb{R}^{kN}$ and either confirms that $u\in\{A^{\prime}u\leq c\}$ , or outputs a violated constraint. Polynomial separation oracle is found for the dual $MOT$ problem in [2] (Proposition 12), resulting in polynomial time algorithm to solve $MOT$ problem with quadratic cost (3) (Theorem 2 of [2])⁸⁸8Besides the optimal value which agrees for the primal and the dual, [2] are also interested in the primal vertex solution. For that reason, they also discuss how to get primal solution in polynomial time in their Proposition 11..

For the derivative bootstrap, computation in Step 3 requires solving program (15), which similar to the dual MOT linear program and is given by

\max_{u}g^{\prime}u:A^{\prime}u\leq c,\text{ }Bu=0

where $B$ is the matrix of the linear map $u\longrightarrow\sum_{i=1}^{k}u_{i}$ and $g$ is a realization of $G^{*}$ (the nature of the coefficient vector $g$ does not affect the complexity). Note that there are only polynomially many constraints in $B$ (namely, $N$ of them); hence, given a polynomial separation oracle for $\{A^{\prime}u\leq c\}$ , the rest of the constraints in $\{Bu=0\}$ can be checked in polynomial time, giving the following theoretical complexity for result for the derivative bootstrap for $MOT$ :

Lemma 3.2 (Polynomial complexity of derivative bootstrap).

The derivative bootstrap linear program (Step 3 in Pseudocode 3) has computational complexity $\text{poly}(N,k,\log U)$ , where $\log U$ is an upper bound on the bits of precision used to represent the coefficient vector $G^{*}$ .

Proof details missing from the above discussion are provided in Appendix A.8.

Computational complexity of bootstrap methods is summarized in Table 2, and consistency of bootstrap is illustrated in Figure 3.

3.2 Fast approximation of the null distribution by ${UB}_{0}$

A fast alternative to the bootstrap sampling from the null distribution is to utilize the lower bound $UB_{0}$ on the null random variable $X_{0}$ provided by equation (16) in Theorem 2.2(b). As the proof of the theorem shows, a stochastic upper bound $UB_{0}$ can be constructed to have $kN(N-1)$ constraints in place of $N^{k}$ constraints in $X_{0}$ by exploiting a constraint structure under $H_{0}$ (see Appendix A.2 for details). Note that, with only quadratically many constraints, the linear program for $UB_{0}$ with any realization of the coefficient vector $G$ can be solved fast by modern linear program solvers.

Sampling from $UB_{0}$ can be viewed as obtaining an upper bound on the derivative bootstrap sampling distribution of $X_{0}$ , via the following algorithm:

Pseudocode 3 (Sampling $UB_{0}$ ).

Given the data $\widehat{\mu}_{n}$ ,

1.

Sample $\hat{\mu}_{1}^{*}\sim\text{Multinomial}(n_{1},\widehat{\mu}_{1})$ .
2.

For each $i=1,\ldots,k$ ,
sample $\hat{G}_{i}^{*}\sim\text{Normal}(0,\text{diag}(\widehat{\mu}_{1}^{*})-\hat{\mu}_{1}^{*}(\hat{\mu}_{1}^{*})^{\prime})$ .
Let $a_{i}=\prod_{j\neq i}\hat{\lambda}_{j}$ , where $\hat{\lambda}_{j}=\frac{n_{j}}{n_{1}+\ldots+n_{k}}$ .
3.

Solve the program (16) with $\{\hat{G}_{i}^{*}\}_{i=1}^{k}$ in place of $\{G_{i}\}_{i=1}^{k}$ .

Computational complexity of sampling $UB_{0}$ is included in Table 2. The performance of $UB_{0}$ for testing $H_{0}$ on all real datasets considered in the paper is illustrated in Figure 4. Note that the low computational complexity of $UB_{0}$ allows to approximate the $H_{0}$ distribution on large datasets within a few minutes on a standard laptop.

3.3 Permutation approach

An alternative to the asymptotic test (12) is a permutation test. The permutation approach in k-sample testing is frequently used when the asymptotic distribution is difficult to sample from due to, for example, infinite number of parameters and/or difficulties of their estimation (cases of [64], [47], and [50]). Moreover, permutation procedures are applicable when the sample sizes are small (and hence the asymptotic distribution may not be valid), giving exact level $\alpha$ permutation tests (Section 15.2 of [27]).

Permutation test accepts a set of data points with group labels, and randomly permutes the labels to compute test statistic of interest on the permuted data to compare with the original one. The number of random permutations $R$ is usually taken to be between $99$ and $999$ out of total possible large number of permutations (p. 158 of [17]).

The $MOT$ permutation test is described in Pseudocode 4.

Pseudocode 4 (MOT based permutation test).

Given the data $\widehat{\mu}_{n}=(\widehat{\mu}_{1},\cdots,\widehat{\mu}_{k})$ :

1.

Compute $MOT(\widehat{\mu}_{n})$ .
2.

Convert $\widehat{\mu}_{n}$ to a matrix of support points, where each support point belongs to the $i$ th group, $i=1,\ldots,k$ , and is repeated according to the counts in $\widehat{\mu}_{i}$ . Collect group labels in the vector $v$ .
3.

For each $r=1,\ldots,R$ , sample random permutation $\pi_{r}(v)$ , permute support points according to $\pi_{r}(v)$ , and construct measures $\widehat{\mu}_{n}^{r}$ based on the frequencies of support points in new groups. Compute permuted test statistic $MOT(\widehat{\mu}_{n}^{r})$ by solving the program (8).
4.

Compute approximate p-value (p. 158 of [17]) as

$\hat{p}:=\frac{1+\sum_{r=1}^{R}\mathbbm{1}_{\{MOT(\widehat{\mu}_{n}^{r})\geq MOT(\hat{\widehat{\mu}}_{n})\}}}{1+R}$

Computation of permuted test statistic in Step 3 requires to solve the $MOT$ program (8), similarly to the case of m-out-of-n bootstrap in Pseudocode 1. Hence, both algorithms have the same complexity, as shown in Table 2. Empirical performance of the permutation test of Pseudocode 4 is illustrated in Figure 5.

Table 2: Complexity of computing a single sample for permutation null distribution,

X_{0}

with m-out-of-n bootstrap,

X_{0}

with derivative bootstrap, and

{UB}_{0}

given by Theorem 2.2(b). Recall that

k

is the number of measures

\mu_{1},\ldots,\mu_{k}

, and

N

is the cardinality of the underlying metric space

\mathcal{X}=\{x_{1},\ldots,x_{N}\}\subset\mathbb{R}^{d}

. Algorithm (theory) row reports an algorithm used to prove theoretical complexity, while algorithm (practice) rows report algorithms implemented in this paper and available for use. The algorithm of Altschuler & Boix-Adserà [2] is abbreviated as AB-A and assumes fixed

d

distribution to sample	permut. or $X_{0}$ (m-out-of-n)	$X_{0}$ (deriv.)	$UB_{0}$
hypothesis	null, alternative	null	null
optimization program	equation (8)	equation (15)	equation (16)
$\#$ variables	$N^{k}$	$kN$	$(k-1)N$
$\#$ equality constraints	$kN$	$N$	none
$\#$ inequality constraints	none	$N^{k}$	$(k-1)N(N-1)$
theoretical complexity	$\text{poly}(N,k,\log U)$	$\text{poly}(N,k,\log U)$	$\text{poly}(N,k,\log U)$
reference	Theorem 2 of [2]	Lemma 3.2 here	Theorem 6 of [2]
algorithm (theory)	AB-A [2]	AB-A [2]	Ellipsoid
algorithm (practice)	AB-A [2]	AB-A [2]
	Simplex	Simplex	Simplex
	Interior point	Interior point	Interior point
software	Github for [2] (d=2)
	GUROBI [37]	GUROBI [37]	GUROBI [37]
	RSymphony [41]	RSymphony [41]	RSymphony [41]

4 Applications

Sections 4.1.1 and 4.1.2 illustrate basic properties of $MOT$ based inference on synthetic datasets with measure supports on finite subsets of $\mathbb{R}^{d}$ , $d=1,2,3$ . The structures of these datasets emulate potential issues in the real data settings while providing convenient models to demonstrate the advantages of $MOT$ based procedures over existing methods (Figures 5 and 6).

Sections 4.2.1 and 4.2.2 illustrate how $MOT$ based inference can be used in real biomedical settings where measures of interest are naturally finitely supported on a given metric space. We use Surveillance, Epidemiology, and End Results (SEER) (https://seer.cancer.gov), a large database on cancers in the United States routinely used in biomedical literature. Detailed description of the used data including the information on the sample sizes is provided in the Supplementary Material (Kravtsova (2024)).

4.1 Illustrations on synthetic data

4.1.1 3D Experiment dataset: testing $H_{0}$

We construct the dataset 3D Experiment which aims to emulate experimental settings of counting the number of induced cells in response to a treatment. The model organism frequently used in such experiments is the nematode worm C. elegans. The goal of the experiment is to determine whether certain genetic modification interrupts with a normal organ development, resulting in abnormal cell behavior observed in diseases [16, 83].

The abnormality is measured by the number of induced cells that emerge after genetic modification. There could be $0,1,2$ induced cells in each worm; a total of $n$ worms are examined giving a measure supported on $\{0,1,2\}$ ⁹⁹9In the real experiment, the measure is supported on $\{0,1,2,3\}$ , but we simplify it for the purpose of this synthetic dataset. When constructing 3D Experiment dataset, we assume that counting is simultaneously performed in two more sites of an animal, where the number of induced cells can be $\{0,1\}$ . This results in the 3-dimensional support $\{0,1\}\times\{0,1\}\times\{0,1,2\}$ with $2\times 2\times 3=12$ points (Figure 5A).

Recent results in biological literature report differences in the number of induced cells between worm species C. elegans and C. briggsae [18, 55]. Inspired by these results, we construct four measures $\{\mu_{1},\mu_{2},\mu_{3},\mu_{4}\}$ in 3D Experiment dataset: the first two correspond to two C. briggsae worm strains, and the last two C. elegans worm strains (Figure 5A). We use this set up to demonstrate the power of MOT asymptotic and permutation tests for testing $H_{0}$ (Figure 5B).

4.1.2 Anderes et al. 2016 dataset: $H_{a}$ inference

We consider the data constructed by [3], where it demonstrates the properties of a barycenter of finitely supported measures. Each measure represents a demand distribution (for some hypothetical product) over nine locations on the map (these locations are cities in California, and they constitute a finite support $\mathcal{X}=\{x_{1},\ldots,x_{9}\}\subset\mathbb{R}^{2}$ for demand distributions). There are $12$ measures in [3], each giving demand distribution during particular month (Figure 6A,B).

We use these measures as the ground truth and construct empirical measures by sampling multinomial counts based on this truth. We note that all $12$ underlying true measures are different, i. e., $H_{a}$ holds. Moreover, the differences are more drastic between months with different temperature since [3] allows the temperature to influence the demand. Our inference under $H_{a}$ confirms this claim by examining sub-collections of measures with months from the same season versus months from different seasons and comparing Confidence Regions for $MOT$ under these settings (Figure 6C).

4.2 Applications to real data

4.2.1 SEER Tumor size dataset: testing $H_{0}$

An important question in cancer studies is to determine what factors are associated with development of metastases. In the case of breast cancer, [73] showed that metastatic risk increases with tumor size in intermediate and some of the large tumors ( $\geq 1$ cm), but does not increase in small tumors ( $<1$ cm). The study used SEER database and considered a correlation between tumor size and prevalence of metastases. Here we confirm these results via k-sample testing, as described below. Further, we observe similar trend in three more cancer types: prostate cancer, lung cancer in males, and lung cancer in females (Figure 7).

We use the SEER database to extract the data on distributions of tumor size and term this dataset SEER Tumor size. We consider three groups of patients with different disease progression status, giving $k=3$ measures: patients with no metastases present at diagnosis and alive at the end of the study ( $\mu_{1}$ ), patients with metastases at diagnosis and alive at the end of the study ( $\mu_{2}$ ), and patients dead by the end of the study with death caused by the diagnosed cancer ( $\mu_{3}$ ).

First, we test $H_{0}$ for size distributions in small tumor range ( $<1$ cm); we find no difference between groups, which holds for breast, prostate, and both lung cancer types (Figure 7A). In contrast, the groups are found different for tumors in larger range ( $1-9$ cm), which again holds for all considered cancer types (Figure 7B). The analysis confirms the significance of metastatic status for the tumor size distribution in intermediate/large tumors, but not the small tumors.

4.2.2 SEER Year of diagnosis dataset: $H_{a}$ inference

Our final example concerns with potential differences in distributions of characteristics in patients diagnosed at different times. Such differences are discussed in the case of early stage lung cancer, possibly due to improvements in diagnostic technologies [58]. Here we compare these distributions in a framework of $k$ -sample inference to confirm the differences in diagnosis results over time, and show that the trend is similar in both male and female patients.

We use SEER database to extract joint distributions of tumor size and patients’ age for lung cancer in males and females and term this dataset SEER Year of diagnosis. We consider four time periods giving $k=4$ measures: 2004 - 2006 ( $\mu_{1}$ ), 2009 - 2011 ( $\mu_{2}$ ), 2014 - 2016 ( $\mu_{3}$ ), 2019 - 2020 ( $\mu_{4}$ ). The distributions are found different by $MOT$ test in both male and female lung cancer cases, and we are interested to compare the differences between male and female collections of measures (Figure 8).

We observe visually that the differences between measures are of similar nature in male and female cases: later diagnostic years appear to have more small size tumors diagnosed in comparison to earlier years (Figure 8A). The similarities between male and female cases are reflected in overlapping Confidence Regions. The reported $MOT$ plan also highlights this finding by coupling the small size support points from the later periods with the larger size support points from the earlier period for patients of the same age (Figure 8B).

5 Discussion and Conclusions

5.1 Summary of results

In this paper, we proposed an Optimal Transport approach to $k$ -sample inference. We used the optimal value of the Multimarginal Optimal Transport program ( $MOT$ ) to quantify the difference in a given collection of $k$ measures supported on finite subsets of $\mathbb{R}^{d}$ , $d\geq 1$ .

We derived limit laws for the empirical version of $MOT$ under assumptions of $H_{0}$ (all $k$ measures are the same) and $H_{a}$ (some measures may differ). We established that the limit cannot be Gaussian under $H_{0}$ , and provided sufficient conditions for the limit to be Gaussian under $H_{a}$ . Based on these results, we derived expression for the power function of the test of $H_{0}$ ; using this function, we proved consistency of the test against any fixed alternative and uniform consistency in certain broad classes of alternatives.

To sample from limit laws, we confirmed that derivative and m-out-of-n bootstrap methods are consistent under $H_{0}$ , and m-out-of-n bootstrap is consistent under $H_{a}$ . We proved polynomial complexity of sampling via derivative bootstrap, and defined a low complexity upper bound to approximate the test cut-off under $H_{0}$ . As an alternative to sampling for the limit laws, we defined a permutation test that is suitable if sample sizes are not large enough to validate to convergence to the limit.

We empirically showed that the $MOT$ based test of $H_{0}$ has strong finite sample power performance when compared with state-of-the-art methods. We also showed how to construct Confidence Regions for the true $MOT$ value under the assumptions of $H_{a}$ , and how to use this procedure to compare variability between collections of $k$ measures. Finally, we demonstrated the use of our methodology on several real biomedical datasets.

5.2 Limitations and future directions

Extensions to continuous measures: One of the main benefits of working on finite spaces is the ability to obtain a non-degenerate limit law under $H_{0}$ (i. e. the law with a non-zero variance), which allows to quantify fluctuations of the $MOT$ value when all measures are the same and test $H_{0}$ . In $k=2$ case, non-degeneracy may fail for continuous measures (see discussion in Section 1.2), but holds for discrete measures with limit laws of the form [72]. When extending [72] to continuous measures in $k=2$ case, [46] show that non-degenerate limit laws are possible provided that there exist dual variables (the Kantorovich potentials) which are not constant almost everywhere (Theorem 4.2 of [46]). While constant potentials are always present under $H_{0}$ (Corollary 4.6 of [46]), in discrete case there are other potentials around that are not constant (this holds for our case of $k>2$ , see discussion preceding Theorem 2.2). Lemma 11 of [74] shows that in $k=2$ case, it is possible to get non-constant potentials for continuous measures by requiring the support to be disconnected (intuitively, this resembles the discrete situation). It is an interesting future direction to analyze how the $k>2$ potentials would behave if our limit laws are extended to continuous measures and possibly different ground cost $c$ .

Improving upper bounds on the null distribution: While the proposed null upper bound $UB_{0}$ is computationally tractable and tight for $k=2$ measures, it may be too large to provide a good power for $H_{0}$ testing for larger $k$ . The main reason for this weakness is the “independent” nature of optimization over dual variables $u_{1},\ldots,u_{k}$ recorded in the constraints. Indeed, the constraints that relate different entries of different dual vectors are omitted, and hence the dual vectors only interact via $\sum_{i=1}^{k}u_{i}=0$ when solving the $UB_{0}$ program. The bound $UB_{0}$ can be strengthened by introducing additional constraints from $X_{0}$ , which will decrease the value of the program and provide a tighter bound on the null distribution. Two possible choices for these extra constraints are (1) including constraints that involve diverse entries from different dual vectors (e.g., $u_{1}^{1}+u_{2}^{2}+\cdots+u_{k}^{k}\leq c_{12\cdots k}$ ), and/or (2) sampling constraints at random (such constraint sampling techniques are widely applicable when solving large linear programs arising, for instance, in Markov Decision Processes [19]).

Faster computation of $MOT$ /barycenter value and permutation test: Our empirical power results suggest that $MOT$ could serve as a suitable statistics for permutation test, which can be performed if samples size are not large enough to validate asymptotic tests. In that case, the $MOT$ (or, equivalently, the barycenter) value has to be computed for each permutation. While direct computation of the barycenter cost is not plausible for a large cardinality of the support $N$ and/or number of measures $k$ , recently proposed sampling techniques can be used to speed up computation while preserving some statistical guarantees [42]. These methods are especially applicable when measures have different supports, and hence can be applied for extensions to continuous measures.

Appendix A Proofs of main results omitted in the main text

A.1 Details on the proof of Theorem 2.2(a)

Step 1 (Weak convergence of measures) For every $i=1,\ldots,k$ , the empirical process converges weakly (Theorem 14.3 - 4 of [7]) as

\sqrt{n_{i}}\left(\widehat{\mu}_{i}-\mu\right)\xrightarrow[]{\text{in law}}G_{i}

where $G_{i}\sim N(0,\text{diag}(\mu_{i}-\mu_{i}\mu_{i}^{\prime}))$ . Since the processes are independent in $i=1,\ldots,k$ , by Theorem 1.4.8 of [79] we can view them jointly as

\begin{pmatrix}\sqrt{n_{1}}(\hat{\mu}_{1}-\mu_{1})\\ \vdots\\ \sqrt{n_{k}}(\hat{\mu}_{k}-\mu_{k})\end{pmatrix}\xrightarrow[]{\text{in law}}\begin{pmatrix}G_{1}\\ \vdots\\ G_{k}\end{pmatrix}

with respect to $l^{1}$ norm on $\bigotimes_{i=1}^{k}l^{1}(\mathcal{X})$ . Using Slutsky’s Theorem (e.g., Example 1.4.7 of [79])

\begin{pmatrix}\sqrt{n_{1}}\frac{\sqrt{n_{2}}}{\sqrt{n_{1}+\ldots+n_{k}}}\cdots\frac{\sqrt{n_{k}}}{\sqrt{n_{1}+\ldots+n_{k}}}(\hat{\mu}_{1}-\mu_{1})\\ \vdots\\ \sqrt{n_{k}}\frac{\sqrt{n_{1}}}{\sqrt{n_{1}+\ldots+n_{k}}}\cdots\frac{\sqrt{n_{k-1}}}{\sqrt{n_{1}+\ldots+n_{k}}}(\hat{\mu}_{k}-\mu_{k})\end{pmatrix}\xrightarrow[]{\text{in law}}\begin{pmatrix}\overbrace{\sqrt{\lambda_{2}}\cdots\sqrt{\lambda_{k}}}^{\sqrt{a_{1}}}G_{1}\\ \vdots\\ \underbrace{\sqrt{\lambda_{1}}\cdots\sqrt{\lambda_{k-1}}}_{\sqrt{a_{k}}}G_{k}\end{pmatrix}=:G

which is of the form

\rho_{n}(\widehat{\mu}_{n}-\mu)\xrightarrow[]{\text{in law}}\sqrt{a}G

with $\rho_{n}:=\frac{\sqrt{n_{1}\cdot\ldots\cdot n_{k}}}{\left(\sqrt{n_{1}+\ldots+n_{k}}\right)^{k-1}}$ .

Step 2 (Hadamard directional differentiability of $MOT$ ) Consider the functional $f:\bigotimes_{i=1}^{k}\mathcal{P}(\mathcal{X})\subseteq\bigotimes_{i=1}^{k}l^{1}(\mathcal{X})\longrightarrow\mathbb{R}$ given by $f(\mu)=MOT(\mu)$ , where $MOT(\mu)$ is the optimal value $z$ of the primal program (8), or, equivalently, the dual program (10). The map $\mu\longrightarrow z(\mu)$ is Gâteaux directionally differentiable at $\mu$ tangentially to a certain set $D\subseteq\bigotimes_{i=1}^{k}\mathcal{P}(\mathcal{X})$ (Theorem 3.1 of [32]) and locally Lipschitz (Remark 2.1 of [9]¹⁰¹⁰10Locally Lipschitz property can also be shown directly, see Supplementary Material (Kravtsova (2024))). Hence, it is Hadamard directionally differentiable at $\mu$ tangentially to $D$ and two derivatives coincide (see Proposition 3.5 of [69] and also the discussion in Section 2.1 of [9]). The derivative is given by

f^{\prime}_{\mu}(g)=\max_{\{u:A^{\prime}u\leq c,\langle u,\mu\rangle=MOT(\mu)\}}\langle u,g\rangle

(23)

for $g\in D:=\{\lim_{n\to\infty}\frac{\widehat{\mu}_{n}-\mu}{t_{n}}$ , $\widehat{\mu}_{n}\in l^{1}(\mathcal{X})$ , $t_{n}\searrow 0\}$ .

A.2 Details on the proof of Theorem 2.2(b)

Consider the null distribution of $X_{0}\sim\mathcal{D}_{0}$ in (15) given by

	$\displaystyle\max_{u}$	$\displaystyle\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle$
	s.t.	$\displaystyle\sum_{i=1}^{k}u_{i}=0$
		$\displaystyle A^{\prime}u\leq c$

Note that for any given realization of random coefficients $G_{i}\sim\mathcal{N}\left(0,\Sigma_{1}\right)$ , this linear program has $k$ dual vectors of length $N$ each, i.e. $kN$ variables in total. There are $N$ equality constraints, each for summing one of the $N$ entries of these vectors to zero. The large number $N^{k}$ of inequality constraints comes from the size of the primal constraint matrix $A\in\mathbb{R}^{kN\times N^{k}}$ , as discussed in Section 2.1.

To construct ${UB}_{0}\sim\mathcal{D}_{{UB}_{0}}$ with smaller complexity than $X_{0}$ , we take the same objective function as in $X_{0}$ program above, but relax some of the inequality constraints $A^{\prime}u\leq c$ subject to $\sum_{i=1}^{k}u_{i}=0$ . Formally, we represent the equality constraints of $X_{0}$ as

\sum_{i=1}^{k}u_{i}=0\iff u=\begin{bmatrix}u_{1}\\ \vdots\\ u_{k}\end{bmatrix}\in\text{ker}(B)

where $B$ represents the linear operator with matrix whose $j$ th row picks $j$ th element from vectors $u_{1},\ldots,u_{k}$ and sums them up.

As detailed in the proof of Lemma 2.4 given in A.4, for $u\in\text{ker}(B)$ , there are constraints in $A^{\prime}u\leq c$ of the form

{u_{1}}_{1}-{u_{1}}_{j}\leq c_{1\cdots 1j}

which can be written as

\underbrace{\begin{pmatrix}1&-1&0&\cdots&0\\ 1&0&-1&\ldots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 1&0&0&\cdots&-1\end{pmatrix}}_{\widetilde{A}^{\prime}_{1}}u_{1}\leq\underbrace{\begin{pmatrix}c_{1\cdots 12}\\ \vdots\\ \vdots\\ c_{1\cdots 1N}\end{pmatrix}}_{\widetilde{c}_{1}}

Similarly, for $u\in\text{ker}(B)\cap\{A^{\prime}u\leq c\}$ , the first dual vector $u_{1}$ satisfies

\underbrace{\begin{pmatrix}-1&1&0&\cdots&0\\ 0&1&-1&\ldots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&1&0&\cdots&-1\end{pmatrix}}_{\widetilde{A}^{\prime}_{2}}u_{1}\leq\underbrace{\begin{pmatrix}c_{2\cdots 21}\\ \vdots\\ \vdots\\ c_{2\cdots 2N}\end{pmatrix}}_{\widetilde{c}_{2}}

\vdots

\underbrace{\begin{pmatrix}-1&0&0&\cdots&1\\ 0&-1&0&\ldots&1\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&-1&1\end{pmatrix}}_{\widetilde{A}^{\prime}_{N}}u_{1}\leq\underbrace{\begin{pmatrix}c_{2\cdots 21}\\ \vdots\\ \vdots\\ c_{N\cdots NN-1}\end{pmatrix}}_{\widetilde{c}_{N}}

Thus, for $u\in\text{ker}(B)\cap\{A^{\prime}u\leq c\}$ ,

\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}u_{1}\leq\begin{pmatrix}\widetilde{c}_{1}\\ \vdots\\ \widetilde{c}_{N}\end{pmatrix}

and the same constraints are satisfied by $u_{2},\ldots,u_{k}$ . Combining these constraints, we obtain

\underbrace{\begin{pmatrix}{\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}}&0&\ldots&0\\ 0&{\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&{\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}}\end{pmatrix}}_{\widetilde{A}^{\prime}}\begin{bmatrix}u_{1}\\ \vdots\\ \vdots\\ u_{k}\end{bmatrix}\leq\underbrace{\begin{pmatrix}{\begin{pmatrix}\widetilde{c}_{1}\\ \vdots\\ \widetilde{c}_{N}\end{pmatrix}}\\ \vdots\\ \vdots\\ {\begin{pmatrix}\widetilde{c}_{1}\\ \vdots\\ \widetilde{c}_{N}\end{pmatrix}}\end{pmatrix}}_{\widetilde{c}}

for all $u\in\text{ker}(B)\cap\{A^{\prime}u\leq c\}$ . This gives us $\widetilde{A}^{\prime}u\leq\widetilde{c}$ with $kN(N-1)$ constraints that we choose to be the constraint set for the linear program (16) defining ${UB}_{0}$ (which now has no equality constraints).

Note that for $u\in\text{ker}(B)$ , we have that $u_{1}=-\sum_{i=2}^{k}u_{i}$ , which, upon substitution to the objective function of (15), gives

	$\displaystyle\langle u_{2},\sqrt{a_{2}}G_{2}\rangle+\cdots+\langle u_{k},\sqrt{a_{k}}G_{k}\rangle$	$\displaystyle-\langle u_{2},\sqrt{a_{1}}G_{1}\rangle-\cdots-\langle u_{k},\sqrt{a_{1}}G_{1}\rangle$
		$\displaystyle\overset{d}{=}\sum_{i=2}^{k}\langle u_{i},\sqrt{a_{i}}G_{i}-\sqrt{a_{1}}G_{1}\rangle$

This is the objective in the linear program (16) defining ${UB}_{0}$ with $(k-1)N$ variables.

A.3 Details for Observation 2.3

Consider a cost vector in the $MOT$ program (8) with entries $c_{i_{1}i_{2}\cdots i_{k}}$ , where each index takes values in $\{1,\ldots,N\}$ . Suppose that $k-1$ indexes have the same value, e.g. $c_{i\cdots ij}$ . Then,

\displaystyle c_{i\cdots ij}=\frac{1}{k}\left((k-1)\left\|x_{i}-\overline{x}_{i\cdots ij}\right\|^{2}+\left\|x_{j}-\overline{x}_{i\cdots ij}\right\|^{2}\right)

where $\overline{x}_{i\cdots ij}=\frac{1}{k}\left[(k-1)x_{i}+x_{j}\right]$ . The first term gives

(k-1)\left\|x_{i}-\overline{x}_{i\cdots ij}\right\|^{2}=\frac{k-1}{k^{2}}\|x_{i}-x_{j}\|^{2}

and the second term gives

\left\|x_{j}-\overline{x}_{i\cdots ij}\right\|^{2}=\frac{(k-1)^{2}}{k^{2}}\|x_{i}-x_{j}\|^{2}

Combining the two and multiplying by $\frac{1}{k}$ gives the result.

A.4 Proof of Lemma 2.4

For notational clarity, we start with a case of $k=2$ measures, and assume for simplicity that they are supported on the metric space $\mathcal{X}$ with only two points. Let $u=\left((u_{1},u_{2})=({u_{1}}_{1}{u_{1}}_{2}),({u_{2}}_{1},{u_{2}}_{2})\right)$ be solutions to the dual $MOT$ program (10) satisfying $\sum_{i=1}^{k}u_{i}=0$ , i.e. $u_{2}=-u_{1}$ . Recall that the constraint matrix in the dual constraints $A^{\prime}u\leq c$ is

A^{\prime}=\begin{pmatrix}1&0&1&0\\ 1&0&0&1\\ 0&1&1&0\\ 0&1&0&1\end{pmatrix}

Applying it to $u=\begin{pmatrix}u_{1}\\ -u_{1}\end{pmatrix}$ gives the constraints on $u_{1}$ as

		$\displaystyle{u_{1}}_{1}-{u_{1}}_{1}\leq c_{11}=0$
		$\displaystyle{u_{1}}_{1}-{u_{1}}_{2}\leq c_{12}$
		$\displaystyle{u_{1}}_{2}-{u_{1}}_{1}\leq c_{21}=c_{12}$
		$\displaystyle{u_{1}}_{2}-{u_{1}}_{2}\leq c_{22}=0$

The middle two constraints give

|{u_{1}}_{1}-{u_{1}}_{2}|\leq c_{12}

Recalling that ${u_{1}}_{1}=0$ , we get that

|{u_{1}}_{2}|\leq c_{12}

If the number of support points was $N>2$ , similar argument using constraints ${u_{1}}_{1}-{u_{1}}_{j}\leq c_{1j}$ and ${u_{1}}_{j}-{u_{1}}_{1}\leq c_{1j}$ would give

|{u_{1}}_{j}|\leq c_{1j}\text{ for }j=1,\ldots,N

Recall from Observation 2.3 that $c_{1j}=\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2}$ which finishes the proof for $k=2$ measures.

To see the result for $k>2$ measures, note that $A^{\prime}u\leq c$ contains constraints

		$\displaystyle\underbrace{{u_{1}}_{1}+{u_{2}}_{1}+\cdots+{u_{k-1}}_{1}}_{-{u_{k}}_{1}}+{u_{k}}_{2}\leq c_{1\cdots 12}$
		$\displaystyle\underbrace{{u_{1}}_{2}+{u_{2}}_{2}+\cdots+{u_{k-1}}_{2}}_{-{u_{k}}_{2}}+{u_{k}}_{1}\leq c_{2\cdots 21}$

and by Observation 2.3, $c_{1\cdots 12}=c_{2\cdots 21}=\frac{k-1}{k^{2}}\|x_{1}-x_{2}\|^{2}$ giving

|{u_{k}}_{2}|\leq\frac{k-1}{k^{2}}\|x_{1}-x_{2}\|^{2}

and, by similar reasoning,

|{u_{k}}_{j}|\leq\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2}\text{ for }j=1,\ldots,N

Similarly, we conclude the same property for all dual variables $u_{i}$ indexed by $1,\ldots,k$ , i.e.

|{u_{i}}_{j}|\leq\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2}\text{ for }j=1,\ldots,N

concluding the proof.

A.5 Proof of Lemma 2.13

By the equivalence between $MOT$ and barycenter problems (2) and (4),

MOT(\mu)=\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{k}\sum_{i=1}^{k}W_{2}^{2}(\mu_{i},\nu)

=\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{k}\left[\frac{k}{2}W_{2}^{2}(\mu_{1},\nu)+\frac{k}{2}W_{2}^{2}(\mu_{k},\nu)\right]=\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{2}W_{2}^{2}(\mu_{1},\nu)+\frac{1}{2}W_{2}^{2}(\mu_{k},\nu)

i.e. $\nu$ is a solution to the barycenter problem between $\mu_{1}$ and $\mu_{k}$ with optimal value $MOT(\mu_{1},\mu_{k})=\frac{1}{4}W_{2}^{2}(\mu_{1},\mu_{k})$ . By similar reasoning, for $\mu\in\mathcal{F}^{C}_{k}$ , the population value $MOT(\mu)$ is the same as the value of MOT computed with one measure from each cluster.

A.6 Proof of Lemma 2.14

Let $(u_{1}^{*},u_{k}^{*})$ be dual optimal solutions to the the problem $W_{2}^{2}(\mu_{1},\mu_{k})$ . We will show that $(\frac{k-1}{k^{2}}u_{1}^{*},0,\cdots,0,\frac{k-1}{k^{2}}u_{k}^{*})$ are dual optimal for $MOT(\mu_{1},\cdots,\mu_{k})$ , and hence

MOT(\mu_{1},\cdots,\mu_{k})=\frac{k-1}{k^{2}}\langle u_{1}^{*},\mu_{1}\rangle+\frac{k-1}{k^{2}}\langle u_{k}^{*},\mu_{k}\rangle=\frac{k-1}{k^{2}}W_{2}^{2}(\mu_{1},\mu_{k})

By optimality of $(u_{1}^{*},u_{k}^{*})$ for $W_{2}^{2}(\mu_{1},\mu_{k})$ , the dual constraints hold with equality

u_{1}^{*i}+u_{k}^{*j}\leq\|x_{i}-x_{j}\|^{2}

for all $i,j\in\{1,\ldots,N\}$ , and hold with equality

u_{1}^{*i}+u_{k}^{*j}=\|x_{i}-x_{j}\|^{2}

on for some pairs $(i,j)\in I$ with the set $I$ indexing the pairs $(x_{i},x_{j})$ that support the optimal Wasserstein coupling $\pi^{*}$ . Consider a multicoupling that agrees with $\pi^{*}$ on tuples $(x_{i},\cdots,x_{i},x_{j})$ , $(i,j)\in I$ , and is zero otherwise (so the set of such tuples has a full mutlicoupling measure) - this is the candidate for the primal optimal solution to $MOT$ . Further, the above equality implies, for $(i,j)\in I$ ,

\frac{k-1}{k^{2}}u_{1}^{*i}+0+\cdots+0+\frac{k-1}{k^{2}}u_{k}^{*j}=\frac{k-1}{k^{2}}\|x_{i}-x_{j}\|^{2}\underset{\text{Observation }\ref{obs:cost_dist}}{=}c_{i\cdots ij}

Moreover, $c_{i\cdots ij}$ is no larger than the value of $c$ if some indices are not repeated, making the candidate $(\frac{k-1}{k^{2}}u_{1}^{*},0,\cdots,0,\frac{k-1}{k^{2}}u_{k}^{*})$ dual feasible for $MOT$ . By complementary slackness (e. g., Lemma 1.1 of [35] which specifically addresses the multimarginal problem), our dual candidate is optimal.

A.7 Proof of Proposition 3.1

(a)

By Theorem 3.1 of [44], the numerical directional derivative estimator $\frac{f(\hat{\mu}_{n}+\varepsilon_{n}\hat{G}^{*})-f(\hat{\mu}_{n})}{\varepsilon_{n}}$ is consistent for the directional derivative $f_{\mu}^{\prime}(G)$ under mild measurability conditions on $\hat{G}^{*}$ . The choice $\varepsilon_{n}=\frac{1}{r_{m}}$ with $m=\sqrt{n}$ (or, more generally, $m=n^{p}$ , with $p\in(0,1)$ ) ensures that assumptions of the theorem are satisfied, i.e. that $\varepsilon_{n}\to 0$ and $\varepsilon_{n}r_{n}=\frac{r_{n}}{r_{m}}=\sqrt{\frac{n}{m}}\to\infty$ and allows to conclude consistency of this estimator for $f^{\prime}_{\mu}(G)$ . Note that consistency does not depend on the form of $f_{\mu}^{\prime}(G)$ , and hence holds under both $H_{0}$ and $H_{a}$ .

(b)

We will check that the estimator $f^{\prime}_{n}$ of the directional derivative map $f^{\prime}_{\mu}$ given by $f^{\prime}_{n}=f^{\prime}_{\hat{\mu}_{n}}$ is uniformly consistent in the sense of Assumption 4 of [29]. Note that under $H_{0}$ , the estimator is given by

	$\displaystyle f_{\hat{\mu}_{n}}^{\prime}:h\to$	$\displaystyle\max_{u}\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},h_{i}\rangle$
		$\displaystyle\sum_{i=1}^{k}u_{i}=0,{}A^{\prime}u\leq c$

The expression is independent of $\widehat{\mu}_{n}$ , and hence the assumption is trivially satisfied. Thus, the proposed bootstrap is consistent by Theorem 3.2 of [29].

A.8 Details on the proof of Lemma 3.2

Consider the linear program

\max_{u}g^{\prime}u

Bu=0

A^{\prime}u\leq c

where $g\in\mathbb{R}^{kN}$ is a vector of objective coefficients, $\underset{N\times kN}{B}$ and $\underset{N^{k}\times kN}{A^{\prime}}$ matrices with entries in $\{0,1\}$ , and $c$ a cost vector from primal MOT problem where the measures $\mu$ and support points in $\mathcal{X}$ are represented with $\log U$ bits of precision.

Recall that a linear program over a polytope with exponentially many constraints can be solved in polynomial time by ellipsoid method if there exists a polynomial time separation oracle for the polytope (Theorem 8.5 of [5]). Here, we construct a separation oracle as follows. Given any $u\in\mathbb{R}^{kN}$ , check if $N$ constraints $Bu=0$ are satisfied; if not, output a violated constraint (this is done with $\text{poly}(N)$ complexity). If $Bu=0$ , check (and output a violated constraint if needed) in $A^{\prime}u\leq c$ by employing a polynomial time oracle in Definition 10 of [2], which is done with $\text{poly}(N,k,\log U)$ complexity (Proposition 12 of [2]).

{acks}

[Acknowledgments] The author would like to thank the anonymous referees, an Associate Editor and the Editor for their constructive comments that greatly improved the quality of this paper. The author is indebted to Ilmun Kim for sharing the codes for the tests used for comparisons in Figure 5. The author sincerely thanks Adriana Dawes for the advice and support and Florian Gunsilius for the comments that greatly improved the paper. Helpful discussions with Helen Chamberlin, Jun Kitagawa, Facundo Mémoli, and Dustin Mixon are gratefully acknowledged. The efforts of SEER Program in creation and maintenance of the SEER database are gratefully acknowledged. The author is solely responsible for all the mistakes.

{funding}

The author was supported by the National Institute of General Medical Science of the National Institutes of Health under award number R01GM132651 to Adriana Dawes.

References

[1] {barticle}[author] \bauthor\bsnmAgueh, \bfnmMartial\binitsM. and \bauthor\bsnmCarlier, \bfnmGuillaume\binitsG. (\byear2011). \btitleBarycenters in the Wasserstein space. \bjournalSIAM Journal on Mathematical Analysis \bvolume43 \bpages904–924. \endbibitem
[2] {barticle}[author] \bauthor\bsnmAltschuler, \bfnmJason M\binitsJ. M. and \bauthor\bsnmBoix-Adsera, \bfnmEnric\binitsE. (\byear2021). \btitleWasserstein barycenters can be computed in polynomial time in fixed dimension. \bjournalJournal of Machine Learning Research \bvolume22 \bpages1–19. \endbibitem
[3] {barticle}[author] \bauthor\bsnmAnderes, \bfnmEthan\binitsE., \bauthor\bsnmBorgwardt, \bfnmSteffen\binitsS. and \bauthor\bsnmMiller, \bfnmJacob\binitsJ. (\byear2016). \btitleDiscrete Wasserstein barycenters: Optimal transport for discrete data. \bjournalMathematical Methods of Operations Research \bvolume84 \bpages389–409. \endbibitem
[4] {bbook}[author] \bauthor\bsnmBalinski, \bfnmMichel L\binitsM. L. and \bauthor\bsnmRussakoff, \bfnmAndrew\binitsA. (\byear1984). \btitleFaces of dual transportation polyhedra. \bpublisherSpringer. \endbibitem
[5] {bbook}[author] \bauthor\bsnmBertsimas, \bfnmDimitris\binitsD. and \bauthor\bsnmTsitsiklis, \bfnmJohn N\binitsJ. N. (\byear1997). \btitleIntroduction to linear optimization \bvolume6. \bpublisherAthena scientific Belmont, MA. \endbibitem
[6] {barticle}[author] \bauthor\bsnmBigot, \bfnmJérémie\binitsJ., \bauthor\bsnmCazelles, \bfnmElsa\binitsE. and \bauthor\bsnmPapadakis, \bfnmNicolas\binitsN. (\byear2019). \btitleCentral limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications. \bjournalElectronic Journal of Statistics \bvolume13 \bpages5120 – 5150. \endbibitem
[7] {bbook}[author] \bauthor\bsnmBishop, \bfnmYvonne M\binitsY. M., \bauthor\bsnmFienberg, \bfnmStephen E\binitsS. E. and \bauthor\bsnmHolland, \bfnmPaul W\binitsP. W. (\byear2007). \btitleDiscrete multivariate analysis: Theory and practice. \bpublisherSpringer Science & Business Media. \endbibitem
[8] {barticle}[author] \bauthor\bsnmBénasséni, \bfnmJacques\binitsJ. (\byear2012). \btitleA new derivation of eigenvalue inequalities for the multinomial distribution. \bjournalJournal of Mathematical Analysis and Applications \bvolume393 \bpages697-698. \bdoihttps://doi.org/10.1016/j.jmaa.2012.03.029 \endbibitem
[9] {barticle}[author] \bauthor\bsnmCárcamo, \bfnmJavier\binitsJ., \bauthor\bsnmCuevas, \bfnmAntonio\binitsA. and \bauthor\bsnmRodríguez, \bfnmLuis-Alberto\binitsL.-A. (\byear2020). \btitleDirectional differentiability for supremum-type functionals: Statistical applications. \bjournalBernoulli \bvolume26 \bpages2143 – 2175. \bdoi10.3150/19-BEJ1188 \endbibitem
[10] {barticle}[author] \bauthor\bsnmCarlier, \bfnmGuillaume\binitsG., \bauthor\bsnmDelalande, \bfnmAlex\binitsA. and \bauthor\bsnmMerigot, \bfnmQuentin\binitsQ. (\byear2024). \btitleQuantitative stability of barycenters in the Wasserstein space. \bjournalProbability Theory and Related Fields \bvolume188 \bpages1257–1286. \endbibitem
[11] {barticle}[author] \bauthor\bsnmChen, \bfnmSu\binitsS. (\byear2020). \btitleA new distribution-free k-sample test: Analysis of kernel density functionals. \bjournalCanadian Journal of Statistics \bvolume48 \bpages167–186. \endbibitem
[12] {barticle}[author] \bauthor\bsnmChen, \bfnmSu\binitsS. and \bauthor\bsnmPokojovy, \bfnmMichael\binitsM. (\byear2018). \btitleModern and classical k-sample omnibus tests. \bjournalWiley Interdisciplinary Reviews: Computational Statistics \bvolume10 \bpagese1418. \endbibitem
[13] {barticle}[author] \bauthor\bsnmChernozhukov, \bfnmVictor\binitsV., \bauthor\bsnmGalichon, \bfnmAlfred\binitsA., \bauthor\bsnmHallin, \bfnmMarc\binitsM. and \bauthor\bsnmHenry, \bfnmMarc\binitsM. (\byear2017). \btitleMonge–Kantorovich depth, quantiles, ranks and signs. \bjournalThe Annals of Statistics \bvolume45 \bpages223 – 256. \bdoi10.1214/16-AOS1450 \endbibitem
[14] {bbook}[author] \bauthor\bsnmCleophas, \bfnmTon J\binitsT. J., \bauthor\bsnmZwinderman, \bfnmAeilko H\binitsA. H., \bauthor\bsnmCleophas, \bfnmToine F\binitsT. F. and \bauthor\bsnmCleophas, \bfnmEugene P\binitsE. P. (\byear2009). \btitleStatistics applied to clinical trials. \bpublisherSpringer. \endbibitem
[15] {barticle}[author] \bauthor\bsnmConover, \bfnmWJ175230\binitsW. (\byear1965). \btitleSeveral k-sample Kolmogorov-Smirnov tests. \bjournalThe Annals of Mathematical Statistics \bvolume36 \bpages1019–1026. \endbibitem
[16] {barticle}[author] \bauthor\bsnmCorchado-Sonera, \bfnmMarcos\binitsM., \bauthor\bsnmRambani, \bfnmKomal\binitsK., \bauthor\bsnmNavarro, \bfnmKristen\binitsK., \bauthor\bsnmKladney, \bfnmRaleigh\binitsR., \bauthor\bsnmDowdle, \bfnmJames\binitsJ., \bauthor\bsnmLeone, \bfnmGustavo\binitsG. and \bauthor\bsnmChamberlin, \bfnmHelen M\binitsH. M. (\byear2022). \btitleDiscovery of nonautonomous modulators of activated Ras. \bjournalG3 Genes—Genomes—Genetics \bvolume12 \bpagesjkac200. \endbibitem
[17] {bbook}[author] \bauthor\bsnmDavison, \bfnmAnthony Christopher\binitsA. C. and \bauthor\bsnmHinkley, \bfnmDavid Victor\binitsD. V. (\byear1997). \btitleBootstrap methods and their application \bvolume1. \bpublisherCambridge university press. \endbibitem
[18] {barticle}[author] \bauthor\bsnmDawes, \bfnmAdriana T\binitsA. T., \bauthor\bsnmWu, \bfnmDavid\binitsD., \bauthor\bsnmMahalak, \bfnmKarley K\binitsK. K., \bauthor\bsnmZitnik, \bfnmEdward M\binitsE. M., \bauthor\bsnmKravtsova, \bfnmNatalia\binitsN., \bauthor\bsnmSu, \bfnmHaiwei\binitsH. and \bauthor\bsnmChamberlin, \bfnmHelen M\binitsH. M. (\byear2017). \btitleA computational model predicts genetic nodes that allow switching between species-specific responses in a conserved signaling network. \bjournalIntegrative Biology \bvolume9 \bpages156–166. \endbibitem
[19] {barticle}[author] \bauthor\bsnmDe Farias, \bfnmDaniela Pucci\binitsD. P. and \bauthor\bsnmVan Roy, \bfnmBenjamin\binitsB. (\byear2004). \btitleOn constraint sampling in the linear programming approach to approximate dynamic programming. \bjournalMathematics of operations research \bvolume29 \bpages462–478. \endbibitem
[20] {barticle}[author] \bauthor\bsnmDeb, \bfnmNabarun\binitsN. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2023). \btitleMultivariate rank-based distribution-free nonparametric testing using measure transportation. \bjournalJournal of the American Statistical Association \bvolume118 \bpages192–207. \endbibitem
[21] {barticle}[author] \bauthor\bsnmDel Barrio, \bfnmEustasio\binitsE., \bauthor\bsnmGiné, \bfnmEvarist\binitsE. and \bauthor\bsnmUtzet, \bfnmFrederic\binitsF. (\byear2005). \btitleAsymptotics for L2 functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances. \bjournalBernoulli \bvolume11 \bpages131–189. \endbibitem
[22] {barticle}[author] \bauthor\bsnmDel Barrio, \bfnmEustasio\binitsE., \bauthor\bsnmGordaliza, \bfnmPaula\binitsP. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2019). \btitleA central limit theorem for Lp transportation cost on the real line with application to fairness assessment in machine learning. \bjournalInformation and Inference: A Journal of the IMA \bvolume8 \bpages817–849. \endbibitem
[23] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2019). \btitleCentral limit theorems for empirical transportation cost in general dimension. \bjournalThe Annals of Probability \bvolume47 \bpages926 – 951. \bdoi10.1214/18-AOP1275 \endbibitem
[24] {barticle}[author] \bauthor\bsnmDeng, \bfnmLi\binitsL. (\byear2012). \btitleThe mnist database of handwritten digit images for machine learning research. \bjournalIEEE Signal Processing Magazine \bvolume29 \bpages141–142. \endbibitem
[25] {barticle}[author] \bauthor\bsnmDesai, \bfnmShreya\binitsS. and \bauthor\bsnmGuddati, \bfnmAchuta K\binitsA. K. (\byear2022). \btitleBimodal Age Distribution in Cancer Incidence. \bjournalWorld Journal of Oncology \bvolume13 \bpages329. \endbibitem
[26] {barticle}[author] \bauthor\bsnmDudley, \bfnmRichard Mansfield\binitsR. M. (\byear1969). \btitleThe speed of mean Glivenko-Cantelli convergence. \bjournalThe Annals of Mathematical Statistics \bvolume40 \bpages40–50. \endbibitem
[27] {barticle}[author] \bauthor\bsnmEfron, \bfnmB.\binitsB. (\byear1979). \btitleBootstrap Methods: Another Look at the Jackknife. \bjournalThe Annals of Statistics \bvolume7 \bpages1–26. \endbibitem
[28] {bbook}[author] \bauthor\bsnmEmelichev, \bfnmVladimir Alekseevich\binitsV. A., \bauthor\bsnmKovalev, \bfnmMikhail Mikhailovich\binitsM. M. and \bauthor\bsnmKravtsov, \bfnmMikhail Konstantinovich\binitsM. K. (\byear1984). \btitlePolytopes, graphs and optimisation. \bpublisherCambridge University Press. \endbibitem
[29] {barticle}[author] \bauthor\bsnmFang, \bfnmZheng\binitsZ. and \bauthor\bsnmSantos, \bfnmAndres\binitsA. (\byear2019). \btitleInference on directionally differentiable functions. \bjournalThe Review of Economic Studies \bvolume86 \bpages377–412. \endbibitem
[30] {barticle}[author] \bauthor\bsnmFournier, \bfnmNicolas\binitsN. and \bauthor\bsnmGuillin, \bfnmArnaud\binitsA. (\byear2015). \btitleOn the rate of convergence in Wasserstein distance of the empirical measure. \bjournalProbability theory and related fields \bvolume162 \bpages707–738. \endbibitem
[31] {barticle}[author] \bauthor\bsnmFukui, \bfnmTakayuki\binitsT., \bauthor\bsnmMori, \bfnmShoichi\binitsS., \bauthor\bsnmYokoi, \bfnmKohei\binitsK. and \bauthor\bsnmMitsudomi, \bfnmTetsuya\binitsT. (\byear2006). \btitleSignificance of the number of positive lymph nodes in resected non-small cell lung cancer. \bjournalJournal of Thoracic Oncology \bvolume1 \bpages120–125. \endbibitem
[32] {bincollection}[author] \bauthor\bsnmGal, \bfnmTomas\binitsT. (\byear1997). \btitleA historical sketch on sensitivity analysis and parametric programming. In \bbooktitleAdvances in Sensitivity Analysis and Parametic Programming \bpages1–10. \bpublisherSpringer. \endbibitem
[33] {barticle}[author] \bauthor\bsnmGhodrati, \bfnmLaya\binitsL. and \bauthor\bsnmPanaretos, \bfnmVictor M\binitsV. M. (\byear2022). \btitleDistribution-on-distribution regression via optimal transport maps. \bjournalBiometrika \bvolume109 \bpages957–974. \endbibitem
[34] {barticle}[author] \bauthor\bsnmGiordano, \bfnmSharon H\binitsS. H., \bauthor\bsnmCohen, \bfnmDeborah S\binitsD. S., \bauthor\bsnmBuzdar, \bfnmAman U\binitsA. U., \bauthor\bsnmPerkins, \bfnmGeorge\binitsG. and \bauthor\bsnmHortobagyi, \bfnmGabriel N\binitsG. N. (\byear2004). \btitleBreast carcinoma in men: a population-based study. \bjournalCancer: Interdisciplinary International Journal of the American Cancer Society \bvolume101 \bpages51–57. \endbibitem
[35] {barticle}[author] \bauthor\bsnmGladkov, \bfnmNikita A.\binitsN. A. and \bauthor\bsnmZimin, \bfnmAlexander P.\binitsA. P. (\byear2020). \btitleAn Explicit Solution for a Multimarginal Mass Transportation Problem. \bjournalSIAM Journal on Mathematical Analysis \bvolume52 \bpages3666-3696. \bdoi10.1137/18M122707X \endbibitem
[36] {barticle}[author] \bauthor\bsnmGunsilius, \bfnmFlorian F\binitsF. F. (\byear2023). \btitleDistributional synthetic controls. \bjournalEconometrica \bvolume91 \bpages1105–1117. \endbibitem
[37] {bmisc}[author] \bauthor\bsnmGurobi Optimization, LLC (\byear2023). \btitleGurobi Optimizer Reference Manual. \endbibitem
[38] {barticle}[author] \bauthor\bsnmHallin, \bfnmMarc\binitsM., \bauthor\bsnmDel Barrio, \bfnmEustasio\binitsE., \bauthor\bsnmCuesta-Albertos, \bfnmJuan\binitsJ. and \bauthor\bsnmMatran, \bfnmCarlos\binitsC. (\byear2021). \btitleDistribution and quantile functions, ranks and signs in dimension d: A measure transportation approach. \bjournalAnnals of Statistics \bvolume49 \bpages1139 - 1165. \endbibitem
[39] {barticle}[author] \bauthor\bsnmHallin, \bfnmM.\binitsM., \bauthor\bsnmMordant, \bfnmG.\binitsG. and \bauthor\bsnmSegers, \bfnmJ.\binitsJ. (\byear2021). \btitleMultivariate goodness-of-fit tests based on wasserstein distance. \bjournalElectronic Journal of Statistics \bvolume15 \bpages1328-1371 - 1371. \endbibitem
[40] {barticle}[author] \bauthor\bsnmHart, \bfnmJeffrey D.\binitsJ. D. and \bauthor\bsnmCañette, \bfnmIsabel\binitsI. (\byear2011). \btitleNonparametric Estimation of Distributions in Random Effects Models. \bjournalJournal of Computational and Graphical Statistics \bvolume20 \bpages461–478. \endbibitem
[41] {bmanual}[author] \bauthor\bsnmHarter, \bfnmReinhard\binitsR., \bauthor\bsnmHornik, \bfnmKurt\binitsK. and \bauthor\bsnmTheussl, \bfnmStefan\binitsS. (\byear2021). \btitleRsymphony: SYMPHONY in R \bnoteR package version 0.1-33. \endbibitem
[42] {barticle}[author] \bauthor\bsnmHeinemann, \bfnmFlorian\binitsF., \bauthor\bsnmMunk, \bfnmAxel\binitsA. and \bauthor\bsnmZemel, \bfnmYoav\binitsY. (\byear2022). \btitleRandomized Wasserstein Barycenter Computation: Resampling with Statistical Guarantees. \bjournalSIAM Journal on Mathematics of Data Science \bvolume4 \bpages229-259. \endbibitem
[43] {barticle}[author] \bauthor\bsnmHeitjan, \bfnmDaniel F\binitsD. F., \bauthor\bsnmManni, \bfnmAndrea\binitsA. and \bauthor\bsnmSanten, \bfnmRichard J\binitsR. J. (\byear1993). \btitleStatistical analysis of in vivo tumor growth experiments. \bjournalCancer research \bvolume53 \bpages6042–6050. \endbibitem
[44] {barticle}[author] \bauthor\bsnmHong, \bfnmHan\binitsH. and \bauthor\bsnmLi, \bfnmJessie\binitsJ. (\byear2018). \btitleThe numerical delta method. \bjournalJournal of Econometrics \bvolume206 \bpages379–394. \endbibitem
[45] {bbook}[author] \bauthor\bsnmHsu, \bfnmJason\binitsJ. (\byear1996). \btitleMultiple comparisons: theory and methods. \bpublisherCRC Press. \endbibitem
[46] {barticle}[author] \bauthor\bsnmHundrieser, \bfnmShayan\binitsS., \bauthor\bsnmKlatt, \bfnmMarcel\binitsM., \bauthor\bsnmMunk, \bfnmAxel\binitsA. and \bauthor\bsnmStaudt, \bfnmThomas\binitsT. (\byear2024). \btitleA unifying approach to distributional limits for empirical optimal transport. \bjournalBernoulli \bvolume30 \bpages2846–2877. \endbibitem
[47] {barticle}[author] \bauthor\bsnmHušková, \bfnmMarie\binitsM. and \bauthor\bsnmMeintanis, \bfnmSimos G.\binitsS. G. (\byear2008). \btitleTests for the multivariate k-sample problem based on the empirical characteristic function. \bjournalJournal of Nonparametric Statistics \bvolume20 \bpages263–277. \bdoi10.1080/10485250801948294 \endbibitem
[48] {barticle}[author] \bauthor\bsnmKhan, \bfnmMd Marufuzzaman\binitsM. M., \bauthor\bsnmOdoi, \bfnmAgricola\binitsA. and \bauthor\bsnmOdoi, \bfnmEvah W\binitsE. W. (\byear2023). \btitleGeographic disparities in COVID-19 testing and outcomes in Florida. \bjournalBMC Public Health \bvolume23 \bpages79. \endbibitem
[49] {barticle}[author] \bauthor\bsnmKiefer, \bfnmJ\binitsJ. (\byear1959). \btitleK-sample analogues of the Kolmogorov-Smirnov and Cramér-V. Mises tests. \bjournalThe Annals of Mathematical Statistics \bpages420–447. \endbibitem
[50] {barticle}[author] \bauthor\bsnmKim, \bfnmIlmun\binitsI. (\byear2021). \btitleComparing a large number of multivariate distributions. \bjournalBernoulli \bvolume27 \bpages419 – 441. \bdoi10.3150/20-BEJ1244 \endbibitem
[51] {barticle}[author] \bauthor\bsnmKlatt, \bfnmMarcel\binitsM., \bauthor\bsnmTameling, \bfnmCarla\binitsC. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2020). \btitleEmpirical regularized optimal transport: Statistical theory and applications. \bjournalSIAM Journal on Mathematics of Data Science \bvolume2 \bpages419–443. \endbibitem
[52] {barticle}[author] \bauthor\bsnmLe Gouic, \bfnmThibaut\binitsT. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2017). \btitleExistence and consistency of Wasserstein barycenters. \bjournalProbability Theory and Related Fields \bvolume168 \bpages901–917. \endbibitem
[53] {bbook}[author] \bauthor\bsnmLedoux, \bfnmMichel\binitsM. and \bauthor\bsnmTalagrand, \bfnmMichel\binitsM. (\byear2013). \btitleProbability in Banach Spaces: isoperimetry and processes. \bpublisherSpringer Science & Business Media. \endbibitem
[54] {barticle}[author] \bauthor\bsnmLin, \bfnmTianyi\binitsT., \bauthor\bsnmHo, \bfnmNhat\binitsN., \bauthor\bsnmCuturi, \bfnmMarco\binitsM. and \bauthor\bsnmJordan, \bfnmMichael I\binitsM. I. (\byear2022). \btitleOn the complexity of approximating multimarginal optimal transport. \bjournalThe Journal of Machine Learning Research \bvolume23 \bpages2835–2877. \endbibitem
[55] {barticle}[author] \bauthor\bsnmMahalak, \bfnmKarley K\binitsK. K., \bauthor\bsnmJama, \bfnmAbdulrahman M\binitsA. M., \bauthor\bsnmBillups, \bfnmSteven J\binitsS. J., \bauthor\bsnmDawes, \bfnmAdriana T\binitsA. T. and \bauthor\bsnmChamberlin, \bfnmHelen M\binitsH. M. (\byear2017). \btitleDiffering roles for sur-2/MED23 in C. elegans and C. briggsae vulval development. \bjournalDevelopment Genes and Evolution \bvolume227 \bpages213–218. \endbibitem
[56] {barticle}[author] \bauthor\bsnmMukhopadhyay, \bfnmSubhadeep\binitsS. and \bauthor\bsnmWang, \bfnmKaijun\binitsK. (\byear2020). \btitleA nonparametric approach to high-dimensional k-sample comparison problems. \bjournalBiometrika \bvolume107 \bpages555–572. \endbibitem
[57] {barticle}[author] \bauthor\bsnmMunk, \bfnmAxel\binitsA. and \bauthor\bsnmCzado, \bfnmClaudia\binitsC. (\byear1998). \btitleNonparametric validation of similar distributions and assessment of goodness of fit. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume60 \bpages223–241. \endbibitem
[58] {barticle}[author] \bauthor\bsnmNations, \bfnmJoel A\binitsJ. A., \bauthor\bsnmBrown, \bfnmDerek W\binitsD. W., \bauthor\bsnmShao, \bfnmStephanie\binitsS., \bauthor\bsnmShriver, \bfnmCraig D\binitsC. D. and \bauthor\bsnmZhu, \bfnmKangmin\binitsK. (\byear2020). \btitleComparative trends in the distribution of lung cancer stage at diagnosis in the Department of Defense Cancer Registry and the Surveillance, Epidemiology, and End Results data, 1989–2012. \bjournalMilitary medicine \bvolume185 \bpagese2044–e2048. \endbibitem
[59] {barticle}[author] \bauthor\bsnmPanaretos, \bfnmVictor M\binitsV. M. and \bauthor\bsnmZemel, \bfnmYoav\binitsY. (\byear2019). \btitleStatistical aspects of Wasserstein distances. \bjournalAnnual review of statistics and its application \bvolume6 \bpages405–431. \endbibitem
[60] {bbook}[author] \bauthor\bsnmPanaretos, \bfnmVictor M\binitsV. M. and \bauthor\bsnmZemel, \bfnmYoav\binitsY. (\byear2020). \btitleAn invitation to statistics in Wasserstein space. \bpublisherSpringer Nature. \endbibitem
[61] {binproceedings}[author] \bauthor\bsnmPark, \bfnmJoon Sung\binitsJ. S., \bauthor\bsnmO’Brien, \bfnmJoseph\binitsJ., \bauthor\bsnmCai, \bfnmCarrie Jun\binitsC. J., \bauthor\bsnmMorris, \bfnmMeredith Ringel\binitsM. R., \bauthor\bsnmLiang, \bfnmPercy\binitsP. and \bauthor\bsnmBernstein, \bfnmMichael S\binitsM. S. (\byear2023). \btitleGenerative agents: Interactive simulacra of human behavior. In \bbooktitleProceedings of the 36th annual acm symposium on user interface software and technology \bpages1–22. \endbibitem
[62] {barticle}[author] \bauthor\bsnmPass, \bfnmBrendan\binitsB. (\byear2015). \btitleMulti-marginal optimal transport: theory and applications. \bjournalESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique \bvolume49 \bpages1771–1790. \endbibitem
[63] {barticle}[author] \bauthor\bsnmRamdas, \bfnmAaditya\binitsA., \bauthor\bsnmGarcía Trillos, \bfnmNicolás\binitsN. and \bauthor\bsnmCuturi, \bfnmMarco\binitsM. (\byear2017). \btitleOn wasserstein two-sample testing and related families of nonparametric tests. \bjournalEntropy \bvolume19 \bpages47. \endbibitem
[64] {barticle}[author] \bauthor\bsnmRizzo, \bfnmMaria L.\binitsM. L. and \bauthor\bsnmSzékely, \bfnmGábor J.\binitsG. J. (\byear2010). \btitleDISCO analysis: A nonparametric extension of analysis of variance. \bjournalThe Annals of Applied Statistics \bvolume4 \bpages1034 – 1055. \endbibitem
[65] {binproceedings}[author] \bauthor\bsnmRömisch, \bfnmWerner\binitsW. (\byear2004). \btitleDelta method, infinite dimensional. In \bbooktitleEncyclopedia of Statistical Sciences. \bpublisherNew York: Wiley. \endbibitem
[66] {barticle}[author] \bauthor\bsnmSantambrogio, \bfnmFilippo\binitsF. (\byear2015). \btitleOptimal transport for applied mathematicians. \bjournalBirkäuser, NY \bvolume55 \bpages94. \endbibitem
[67] {barticle}[author] \bauthor\bsnmScholz, \bfnmFritz W\binitsF. W. and \bauthor\bsnmStephens, \bfnmMichael A\binitsM. A. (\byear1987). \btitleK-sample Anderson–Darling tests. \bjournalJournal of the American Statistical Association \bvolume82 \bpages918–924. \endbibitem
[68] {barticle}[author] \bauthor\bsnmSejdinovic, \bfnmDino\binitsD., \bauthor\bsnmSriperumbudur, \bfnmBharath\binitsB., \bauthor\bsnmGretton, \bfnmArthur\binitsA. and \bauthor\bsnmFukumizu, \bfnmKenji\binitsK. (\byear2013). \btitleEquivalence of distance-based and RKHS-based statistics in hypothesis testing. \bjournalThe annals of statistics \bpages2263–2291. \endbibitem
[69] {barticle}[author] \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear1990). \btitleOn concepts of directional differentiability. \bjournalJournal of optimization theory and applications \bvolume66 \bpages477–487. \endbibitem
[70] {barticle}[author] \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear1991). \btitleAsymptotic analysis of stochastic programs. \bjournalAnnals of Operations Research \bvolume30 \bpages169–186. \endbibitem
[71] {bbook}[author] \bauthor\bsnmSierksma, \bfnmGerard\binitsG. (\byear2001). \btitleLinear and integer programming: theory and practice. \bpublisherCRC Press. \endbibitem
[72] {barticle}[author] \bauthor\bsnmSommerfeld, \bfnmMax\binitsM. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2018). \btitleInference for empirical Wasserstein distances on finite spaces. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume80 \bpages219–238. \endbibitem
[73] {barticle}[author] \bauthor\bsnmSopik, \bfnmVictoria\binitsV. and \bauthor\bsnmNarod, \bfnmSteven A\binitsS. A. (\byear2018). \btitleThe relationship between tumour size, nodal status and distant metastases: on the origins of breast cancer. \bjournalBreast cancer research and treatment \bvolume170 \bpages647–656. \endbibitem
[74] {barticle}[author] \bauthor\bsnmStaudt, \bfnmThomas\binitsT., \bauthor\bsnmHundrieser, \bfnmShayan\binitsS. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2022). \btitleOn the uniqueness of Kantorovich potentials. \bjournalarXiv preprint arXiv:2201.08316. \endbibitem
[75] {barticle}[author] \bauthor\bsnmTameling, \bfnmCarla\binitsC., \bauthor\bsnmSommerfeld, \bfnmMax\binitsM. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2019). \btitleEmpirical optimal transport on countable metric spaces: Distributional limits and statistical applications. \bjournalThe Annals of Applied Probability \bvolume29 \bpages2744–2781. \endbibitem
[76] {barticle}[author] \bauthor\bsnmTrick, \bfnmWilliam\binitsW. (\byear2011). \btitleComputer Assisted Quality of Life and Symptom Assessment of Complex Patients from April 2011-August 2012: Chicago, Illinois. \bjournalComputer \bvolume2012. \endbibitem
[77] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolás García\binitsN. G., \bauthor\bsnmJacobs, \bfnmMatt\binitsM. and \bauthor\bsnmKim, \bfnmJakwang\binitsJ. (\byear2023). \btitleThe multimarginal optimal transport formulation of adversarial multiclass classification. \bjournalJournal of Machine Learning Research \bvolume24 \bpages1–56. \endbibitem
[78] {barticle}[author] \bauthor\bsnmUchida, \bfnmSeiichi\binitsS. (\byear2013). \btitleImage processing and recognition for biological images. \bjournalDevelopment, growth & differentiation \bvolume55 \bpages523–549. \endbibitem
[79] {bbook}[author] \bauthor\bsnmVan Der Vaart, \bfnmAad W\binitsA. W., \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A., \bauthor\bparticlevan der \bsnmVaart, \bfnmAad W\binitsA. W. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear1996). \btitleWeak convergence. \bpublisherSpringer. \endbibitem
[80] {bbook}[author] \bauthor\bsnmVillani, \bfnmCédric\binitsC. (\byear2021). \btitleTopics in optimal transportation \bvolume58. \bpublisherAmerican Mathematical Soc. \endbibitem
[81] {bbook}[author] \bauthor\bsnmVillani, \bfnmCédric\binitsC. \betalet al. (\byear2009). \btitleOptimal transport: old and new \bvolume338. \bpublisherSpringer. \endbibitem
[82] {bbook}[author] \bauthor\bsnmWickham, \bfnmHadley\binitsH. (\byear2016). \btitleggplot2: Elegant Graphics for Data Analysis. \bpublisherSpringer-Verlag New York. \endbibitem
[83] {barticle}[author] \bauthor\bsnmZand, \bfnmTanya P\binitsT. P., \bauthor\bsnmReiner, \bfnmDavid J\binitsD. J. and \bauthor\bsnmDer, \bfnmChanning J\binitsC. J. (\byear2011). \btitleRas effector switching promotes divergent cell fates in C. elegans vulval patterning. \bjournalDevelopmental cell \bvolume20 \bpages84–96. \endbibitem
[84] {barticle}[author] \bauthor\bsnmZhan, \bfnmD\binitsD. and \bauthor\bsnmHart, \bfnmJD\binitsJ. (\byear2014). \btitleTesting equality of a large number of densities. \bjournalBiometrika \bvolume101 \bpages449–464. \endbibitem
[85] {barticle}[author] \bauthor\bsnmZhang, \bfnmChao\binitsC., \bauthor\bsnmKokoszka, \bfnmPiotr\binitsP. and \bauthor\bsnmPetersen, \bfnmAlexander\binitsA. (\byear2022). \btitleWasserstein autoregressive models for density time series. \bjournalJournal of Time Series Analysis \bvolume43 \bpages30–52. \endbibitem
[86] {barticle}[author] \bauthor\bsnmZhang, \bfnmHai-Liang\binitsH.-L., \bauthor\bsnmYang, \bfnmLi-Feng\binitsL.-F., \bauthor\bsnmZhu, \bfnmYao\binitsY., \bauthor\bsnmYao, \bfnmXu-Dong\binitsX.-D., \bauthor\bsnmZhang, \bfnmShi-Lin\binitsS.-L., \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmZhu, \bfnmYi-Ping\binitsY.-P., \bauthor\bsnmShen, \bfnmYi-Jun\binitsY.-J., \bauthor\bsnmShi, \bfnmGuo-Hai\binitsG.-H. and \bauthor\bsnmYe, \bfnmDing-Wei\binitsD.-W. (\byear2011). \btitleSerum miRNA-21: Elevated levels in patients with metastatic hormone-refractory prostate cancer and potential predictive factor for the efficacy of docetaxel-based chemotherapy. \bjournalThe Prostate \bvolume71 \bpages326–331. \endbibitem
[87] {barticle}[author] \bauthor\bsnmZhang, \bfnmJin\binitsJ. and \bauthor\bsnmWu, \bfnmYuehua\binitsY. (\byear2007). \btitlek-Sample tests based on the likelihood ratio. \bjournalComputational Statistics & Data Analysis \bvolume51 \bpages4682–4691. \endbibitem
[88] {barticle}[author] \bauthor\bsnmZhang, \bfnmRuiyi\binitsR., \bauthor\bsnmOgden, \bfnmR Todd\binitsR. T., \bauthor\bsnmPicard, \bfnmMartin\binitsM. and \bauthor\bsnmSrivastava, \bfnmAnuj\binitsA. (\byear2022). \btitleNonparametric k-sample test on shape spaces with applications to mitochondrial shape analysis. \bjournalJournal of the Royal Statistical Society Series C: Applied Statistics \bvolume71 \bpages51–69. \endbibitem
[89] {barticle}[author] \bauthor\bsnmZhu, \bfnmChangbo\binitsC. and \bauthor\bsnmMüller, \bfnmHans-Georg\binitsH.-G. (\byear2023). \btitleAutoregressive optimal transport models. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume85 \bpages1012–1033. \endbibitem

	$\displaystyle\frac{1}{2}{{u_{1}}_{a}^{}}^{\prime}\Sigma_{1}{u_{1}}_{a}^{}+\frac{1}{2}{{u_{2}}_{a}^{}}^{\prime}\Sigma_{2}{u_{2}}_{a}^{}\leq\frac{1}{2}$	$\displaystyle\\|{u_{1}}_{a}^{}\\|^{2}\cdot\lambda_{max}(\Sigma_{1})+\frac{1}{2}\\|{u_{2}}_{a}^{}\\|^{2}\cdot\lambda_{max}(\Sigma_{2})$
	$\displaystyle\underset{(B1)}{=}$	$\displaystyle\frac{1}{2}\\|{u_{1}}_{a}^{*}\\|^{2}\left(\lambda_{max}(\Sigma_{1})+\lambda_{max}(\Sigma_{2})\right)$

k-Sample inference via Multimarginal Optimal Transport

Abstract

keywords:

keywords:

1 Introduction

1.1 Multimarginal Optimal Transport (MOT) for k-sample inference

1.2 Existing results on weak limits for Optimal Transport

1.3 Summary of contributions and outline

2 k-Sample inference on finite spaces using M​O​TMOT

2.1 Notation and preliminary definitions

Example 1 (Illustration of M​O​TMOT optimization problem for three measures).

2.2 Definitions of H0H_{0} testing and HaH_{a} inference procedures

2.3 Asymptotic distributions of M​O​TMOT under H0H_{0} and HaH_{a}

Condition (A1).

Remark 2.1.

Theorem 2.2 (Asymptotic distribution of M​O​TMOT on finite spaces).

Proof summary for Theorem 2.2.

Observation 2.3.

Lemma 2.4 (Bounds on the dual variables).

Remark 2.5.

Proposition 2.6 (Bound for test cut-off cα,𝒟0c_{\alpha,\mathcal{D}_{0}}).

Proof.

2.4 Consistency and power

Remark 2.7.

Proposition 2.8 (Consistency under fixed alternatives, k≥2k\geq 2 measures).

Proof.

Assumption (B1).

Theorem 2.9 (Uniform consistency, k=2k=2 measures).

Proof.

Corollary 2.10 (Lower bound on the power of the two-sample test).

Proposition 2.11 (Uniform consistency in the class ℱK​(δ)\mathcal{F}_{K}(\delta)).

Proof.

Remark 2.12 (Large kk and connection with [50]).

Lemma 2.13 (MOT values for “clustered” alternatives).

Lemma 2.14 (MOT values for “sparse” alternatives).

3 Sampling from null and alternative distributions

3.1 Bootstrap: mm-out-of-nn and derivative

Proposition 3.1 (Consistency of bootstrap from [29]).

Pseudocode 1 (m-out-of-n bootstrap to obtain one sample from H0H_{0} or HaH_{a} limiting law).

Pseudocode 2 (Derivative bootstrap to obtain one sample from H0H_{0} limiting law).

3.1.1 Computational complexity of bootstrap

Lemma 3.2 (Polynomial complexity of derivative bootstrap).

3.2 Fast approximation of the null distribution by U​B0{UB}_{0}

Pseudocode 3 (Sampling U​B0UB_{0}).

3.3 Permutation approach

Pseudocode 4 (MOT based permutation test).

4 Applications

4.1 Illustrations on synthetic data

4.1.1 3D Experiment dataset: testing H0H_{0}

4.1.2 Anderes et al. 2016 dataset: HaH_{a} inference

4.2 Applications to real data

4.2.1 SEER Tumor size dataset: testing H0H_{0}

4.2.2 SEER Year of diagnosis dataset: HaH_{a} inference

5 Discussion and Conclusions

5.1 Summary of results

5.2 Limitations and future directions

Appendix A Proofs of main results omitted in the main text

A.1 Details on the proof of Theorem 2.2(a)

A.2 Details on the proof of Theorem 2.2(b)

A.3 Details for Observation 2.3

A.4 Proof of Lemma 2.4

A.5 Proof of Lemma 2.13

A.6 Proof of Lemma 2.14

A.7 Proof of Proposition 3.1

A.8 Details on the proof of Lemma 3.2

References

k-Sample inference
via Multimarginal Optimal Transport

2 k-Sample inference on finite spaces using $MOT$

Example 1 (Illustration of $MOT$ optimization problem for three measures).

2.2 Definitions of $H_{0}$ testing and $H_{a}$ inference procedures

2.3 Asymptotic distributions of $MOT$ under $H_{0}$ and $H_{a}$

Theorem 2.2 (Asymptotic distribution of $MOT$ on finite spaces).

Proposition 2.6 (Bound for test cut-off $c_{\alpha,\mathcal{D}_{0}}$ ).

Proposition 2.8 (Consistency under fixed alternatives, $k\geq 2$ measures).

Theorem 2.9 (Uniform consistency, $k=2$ measures).

Proposition 2.11 (Uniform consistency in the class $\mathcal{F}_{K}(\delta)$ ).

Remark 2.12 (Large $k$ and connection with [50]).

3.1 Bootstrap: $m$ -out-of- $n$ and derivative

Pseudocode 1 (m-out-of-n bootstrap to obtain one sample from $H_{0}$ or $H_{a}$ limiting law).

Pseudocode 2 (Derivative bootstrap to obtain one sample from $H_{0}$ limiting law).

3.2 Fast approximation of the null distribution by ${UB}_{0}$

Pseudocode 3 (Sampling $UB_{0}$ ).

4.1.1 3D Experiment dataset: testing $H_{0}$

4.1.2 Anderes et al. 2016 dataset: $H_{a}$ inference

4.2.1 SEER Tumor size dataset: testing $H_{0}$

4.2.2 SEER Year of diagnosis dataset: $H_{a}$ inference