This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

k-Sample inference
via Multimarginal Optimal Transport

Natalia Kravtsovalabel=e1]kravtsova.2@osu.edu [ Department of Mathematics, The Ohio State Universitypresep=, ]e1
Abstract

This paper proposes a Multimarginal Optimal Transport (MOTMOT) approach for simultaneously comparing k2k\geq 2 measures supported on finite subsets of d\mathbb{R}^{d}, d1d\geq 1. We derive asymptotic distributions of the optimal value of the empirical MOTMOT program under the null hypothesis that all kk measures are same, and the alternative hypothesis that at least two measures are different. We use these results to construct the test of the null hypothesis and provide consistency and power guarantees of this kk-sample test. We consistently estimate asymptotic distributions using bootstrap, and propose a low complexity linear program to approximate the test cut-off. We demonstrate the advantages of our approach on synthetic and real datasets, including the real data on cancers in the United States in 2004 - 2020.

62G10,
90C31,
60F05,
k-Sample test,
Optimal Transport,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs

1 Introduction

The k-sample inference concerns with simultaneously comparing several probability measures. The classical question of this inference is to determine whether k2k\geq 2 groups of observed data points have the same underlying probability distribution, i.e. to test

H0:μ1==μk\displaystyle H_{0}:\mu_{1}=\cdots=\mu_{k} (1)
Ha:μiμj for some 1i<jk\displaystyle H_{a}:\mu_{i}\neq\mu_{j}\text{ for some }1\leq i<j\leq k

This testing problem has a long history in statistics, with classical rank-based tests for univariate data [15, 49, 67, 87] to recent extension [20] using multivariate ranks [13, 38], to graph [56], distance [64] and kernel based [68, 50] methods. Direct applications of testing hypotheses in (1) include simultaneously comparing gene expression profiles to assess presence of disease [40, 84], assessing differences in chronic disease levels based on quality of life [11, 76], analyzing associations between exercise and morphology of an animal [88], and comparing distributions of agents’ outcomes in reinforcement learning [61]. Moreover, the test of (1) is frequently viewed as a non-parametric version of ANOVA [12, 64] with myriad of scientific applications, typically comparing treatment outcomes between multiple groups. e.g. in clinical trials [14] and cancer studies [43, 86]. Table 1 outlines additional instances of scientific applications for kk-sample inference when measures of interest have finite support, which is the case considered in this paper.

This paper proposes an Optimal Transport approach to kk-sample inference for k2k\geq 2 probability measures with finite supports in d\mathbb{R}^{d}, d1d\geq 1. The method provides a powerful kk-sample test of (1), but also allows comparison between different collections of kk measures in terms of their within-collection variability. Optimal Transport based approach has been shown successful in one-sample (goodness-of-fit) and two-sample problems on finite [6, 51, 72], countable [75], semidiscrete [39], and some of the continuous spaces [57, 46]. The test statistics employ pp-Wasserstein distances WpW_{p} (or their regularized variants) to quantify differences between measures of interest while respecting metric structure of their supports [60].

Our test statistic employs a different functional - the Multimarginal Optimal Transport program MOTMOT [62] - which can be represented as a variance functional on the space of measures [10] and thus serves as a natural candidate for testing variability in a collection of kk measures. We show that despite well-documented differences in solution structures of MOTMOT and WpW_{p} problems (Section 1.7.4 of [66]), MOTMOT shares the same benefits as WpW_{p} when it comes to the limiting behavior of its optimal value.

Using MOTMOT for kk-sample inference brings several important advantages. The main advantage is that the asymptotic distributions of MOTMOT can be derived under both H0H_{0} and HaH_{a}. To the best of our knowledge, the only multivariate kk-sample test statistic with known HaH_{a} distribution is the one of [47], where the limit laws are known only for a specific subset of alternatives. Our laws cover all alternatives in (1), which allows to explicitly derive a power function of the test and establish novel consistency results. The consistency analysis techniques developed in this paper can be further applied to one- and two-sample tests based on asymptotic results in [46, 72, 75].

Another benefit of MOTMOT limit under HaH_{a} is the ability to estimate functionals of the HaH_{a} distribution, e.g. Confidence Regions for the MOTMOT value. This allows for a novel application of comparing several collections of kk measures using overlap between their Confidence Regions (see Figure 8 for a concrete example). The procedure can be viewed as a distributional analogue of multiple comparisons (see [45] for a review), which are performed in a space of measures rather than Euclidean space. To the best of our knowledge, this type of analysis is not available with other kk-sample statistics considered in the literature.

Conceptually, our approach to kk-sample inference is equivalent to viewing measures as points and considering variability within their collection. It follows a general framework of Optimal Transport based distribution comparison: distributions are viewed as points in a Wasserstein space with Wasserstein distance indicating their closeness [60], leading to development of distributional analogues of traditional methods such as regression and time series [85, 89, 33], synthetic controls [36], and clustering [77]. Below we provide a formulation of our approach (Sections 1.1, 1.2) and summarize our contributions (Section 1.3).

1.1 Multimarginal Optimal Transport (MOT) for k-sample inference

Let μ1,,μk\mu_{1},\ldots,\mu_{k} be Borel probability measures supported on 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d}, d1d\geq 1. The Multimarginal Optimal Transport (MOTMOT) problem (equation (4.3) of [1]) is the optimization problem

infπ𝒞(μ1,,μk)𝒳××𝒳c(x1,,xk)𝑑π(x1,,xk)\inf_{\pi\in\mathcal{C}(\mu_{1},\ldots,\mu_{k})}\int_{\mathcal{X}\times\cdots\times\mathcal{X}}c(x_{1},\ldots,x_{k})\,d\pi(x_{1},\ldots,x_{k}) (2)

where 𝒞(μ1,,μk)\mathcal{C}(\mu_{1},\ldots,\mu_{k}) is the set of Borel probability measures on the product space 𝒳××𝒳\mathcal{X}\times\cdots\times\mathcal{X} with marginals μ1,,μk\mu_{1},\ldots,\mu_{k}. Different choices for the cost function c(x1,,xk)c(x_{1},\ldots,x_{k}) are possible; throughout the paper, we fix the choice to be

c(x1,,xk):=1ki=1kxi1kj=1kxj2c(x_{1},\ldots,x_{k}):=\frac{1}{k}\sum_{i=1}^{k}\|x_{i}-\frac{1}{k}\sum_{j=1}^{k}x_{j}\|^{2} (3)

Under this choice, the MOTMOT problem is equivalent (Proposition 3.1.2 of [60] to the Wasserstein barycenter problem (equation 2.2 of [1])

infν𝒫2(d)1ki=1kW22(μi,ν)\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{k}\sum_{i=1}^{k}W_{2}^{2}(\mu_{i},\nu) (4)

where W2(,)W_{2}(\cdot,\cdot) denotes 2-Wasserstein distance. By equivalence here we mean that the optimal values of both programs are equal, and the optimal solutions π\pi^{*} of (2) and ν\nu^{*} of (4) are related by ν=M#π\nu^{*}=M_{\#}\pi^{*} where MM is the map that averages a given kk-tuple of points from the supports of the kk measures (M#πM\#\pi^{*} stands for pushforward of a measure π\pi^{*} by the map MM). We remark here that when the measures are discrete, the barycenter problem (4) generally has more than one optimal solution ν\nu^{*}. This presents challenges for statistical inference concerning barycenter solutions [52], but does not impede the analysis of the optimal value of barycenter or MOTMOT problems (recalling that all optimal solutions result in the same optimal value).

Let MOT(μ1,,μk)MOT(\mu_{1},\ldots,\mu_{k}) denote the optimal value of the MOTMOT program (2). Observe that MOT(μ1,,μk)=0MOT(\mu_{1},\ldots,\mu_{k})=0 if and only if the kk measures μ1,,μk\mu_{1},\ldots,\mu_{k} are all the same. Indeed, if ν\nu is (any) optimal solution to the barycenter problem (4), then MOT(μ1,,μk)=0MOT(\mu_{1},\ldots,\mu_{k})=0 is equivalent to zero optimal value in the barycenter program (4), i.e. W22(μi,ν)=0W_{2}^{2}(\mu_{i},\nu)=0 for all i=1,,ki=1,\ldots,k. Due to metric properties of 22-Wasserstein distance (Theorem 7.3 of [80]), this is equivalent to all the kk measures being the same (and equal to ν\nu).

This observation suggests that testing H0H_{0} in (1) can be addressed via testing for MOT(μ1,,μk)=0MOT(\mu_{1},\ldots,\mu_{k})=0. To this end, suppose that the data on kk samples (Xi1)i=1n1,,(Xik)i=1nk(X^{1}_{i})_{i=1}^{n_{1}},\cdots,(X^{k}_{i})_{i=1}^{n_{k}} of sizes n1,,nkn_{1},\cdots,n_{k}, respectively, is available to estimate the underlying measures μ1,,μk\mu_{1},\cdots,\mu_{k} by the empirical measures μ^1:=1nii=1n1δxi1,,μ^k:=1nki=1nkδxik\hat{\mu}_{1}:=\frac{1}{n_{i}}\sum_{i=1}^{n_{1}}\delta_{x^{1}_{i}},\cdots,\hat{\mu}_{k}:=\frac{1}{n_{k}}\sum_{i=1}^{n_{k}}\delta_{x^{k}_{i}}. To test for MOT(μ1,,μk)=0MOT(\mu_{1},\ldots,\mu_{k})=0 based on the data, we consider the asymptotic distribution 𝒟0\mathcal{D}_{0} of the empirical estimator MOT(μ^1,,μ^k)MOT(\widehat{\mu}_{1},\ldots,\widehat{\mu}_{k}) under H0H_{0} and reject H0H_{0} when the estimator value is large.

More generally, once the asymptotic distribution 𝒟\mathcal{D} of empirical MOTMOT is known, one can estimate various functionals of 𝒟\mathcal{D}, such as Confidence Regions (CR’s) for a true MOTMOT value either H0H_{0} or HaH_{a} in (1). Inference of this type requires knowledge of the asymptotic distribution of empirical version of the MOTMOT value in (2). Our derivation of these distributions leverages rich literature on asymptotic theory for the Wasserstein distance, whose main results we briefly review below.

Table 1: Examples of scientific data on finitely supported measures. Sample reference describes the set up, the data, and/or the comparison problem in each case.
Variable(s) of interest Support of measures Ref.
Age of patients (in years) {0,1,2.,100}\{0,1,2.\ldots,100\}\subset\mathbb{R} [25]
Tumor size (in mm) {1,2,,150}\{1,2,\ldots,150\}\subset\mathbb{R} [73]
Number of positive lymph nodes {1,2,,7}\{1,2,\ldots,7\}\subset\mathbb{R} [31]
Joint distributions of the above variables finite subsets of 2\mathbb{R}^{2} or 3\mathbb{R}^{3} [34]
Cell counts {0,1,,M}dd\{0,1,\ldots,M\}^{d}\subset\mathbb{R}^{d} for dd sites [16]
Demand over NN locations NN points (longitude, latitude) 2\in\mathbb{R}^{2} [3]
Disease rates over NN locations NN points on the map in 2\mathbb{R}^{2} [48]
Pixel/voxel intensity in microscopy images the grid in 2\mathbb{R}^{2} or 3\mathbb{R}^{3} [78]

1.2 Existing results on weak limits for Optimal Transport

The squared 2-Wasserstein distance W22(μ1,μ2)W_{2}^{2}(\mu_{1},\mu_{2}) is the optimal value of the problem

infπ𝒞(μ,ν)𝒳×𝒳c(x1,x2)𝑑π(x1,xk)\inf_{\pi\in\mathcal{C}(\mu,\nu)}\int_{\mathcal{X}\times\mathcal{X}}c(x_{1},x_{2})\,d\pi(x_{1},x_{k}) (5)

with c(x1,x2)=x1x22c(x_{1},x_{2})=\|x_{1}-x_{2}\|^{2}, which can be viewed as a particular case of the MOTMOT problem (2) with k=2k=2 measures. Being a true metric on a space of probability measures on a given metric space ([81]), the 2-Wasserstein distance W2W_{2} (and, more generally, the pp-Wasserstein distance WpW_{p}) provides a natural way to compare probability measures while respecting the geometry of the supporting metric space. Under this framework, the true measures are estimated by their empirical counterparts, and statistical inference is conducted using limiting laws for the empirical Wasserstein distance [59, 63].

The forms of the weak limits depend on two main factors: dimensionality of the support and the nature of the measures (where the cases μ=ν\mu=\nu and μν\mu\neq\nu may have different limits). Letting OTcOT_{c} denote the optimal value in (5) (with possibly different costs cc), the limiting laws have general form of

ρn(OTc(μ^n1,ν^n2)OTc(μ,ν))in lawL\rho_{n}\left(OT_{c}(\widehat{\mu}_{n_{1}},\widehat{\nu}_{n_{2}})-OT_{c}(\mu,\nu)\right)\xrightarrow[]{\text{in law}}L (6)

with ρn=n1n2n1+n2\rho_{n}=\sqrt{\frac{n_{1}n_{2}}{n_{1}+n_{2}}} (when only μ\mu is estimated from the data while ν\nu is not, the “one-sample” version with ρn=n\rho_{n}=\sqrt{n} is considered). When measures are supported on \mathbb{R}, the limits LL can be Gaussian, with variance that depending on the truth μ,ν\mu,\nu under the “alternative” assumption μν\mu\neq\nu [57, 22], and are non-Gaussian under the “null” assumption μ=ν\mu=\nu [22, 21]. When measures are supported on d\mathbb{R}^{d}, d>1d>1 and are absolutely continuous, the curse of dimensionality takes place: the empirical Wasserstein distance converges in expectation to to the true one too slowly [26, 30]. It is still possible, however, to obtain convergence statement similar to (6) in any dimension d1d\geq 1 by replacing the centering true value OTc(μ,ν)OT_{c}(\mu,\nu) with expectation of the empirical value 𝔼(OTc(μ^n1,ν^n2)\mathbb{E}\left(OT_{c}(\widehat{\mu}_{n_{1}},\widehat{\nu}_{n_{2}}\right) [23]. The limit LL is Gaussian when μν\mu\neq\nu and is degenerate (i.e. limiting random variable has zero variance) when μ=ν\mu=\nu.

Favorable situation arises when measures are supported on a finite space 𝒳={x1,,xN}d\mathcal{X}=\{x_{1},\ldots,x_{N}\}\subseteq\mathbb{R}^{d}: [72] show that the limit law of the form (6) hold for the WpW_{p} distance in any dimension d1d\geq 1 and use resulting laws to construct statistical inference under H0H_{0} and HaH_{a} (the case of countable support is treated in [75]). The laws under either H0H_{0} or HaH_{a} are non-degenerate and given by

ρn(Wpp(μ^n1,ν^n2)Wpp(μ,ν))in lawmax(u1,u2)Φλu1,G1+1λu2,G2\rho_{n}\left(W_{p}^{p}(\widehat{\mu}_{n_{1}},\widehat{\nu}_{n_{2}})-W_{p}^{p}(\mu,\nu)\right)\xrightarrow[]{\text{in law}}\max_{(u_{1},u_{2})\in\Phi^{*}}\sqrt{\lambda}\langle u_{1},G_{1}\rangle+\sqrt{1-\lambda}\langle u_{2},G_{2}\rangle (7)

where G1,G2G_{1},G_{2} are the weak limits of multinomial processes n1(μ^n1μ)\sqrt{n_{1}}(\widehat{\mu}_{n_{1}}-\mu) and n2(ν^n2ν)\sqrt{n_{2}}(\widehat{\nu}_{n_{2}}-\nu), and Φ\Phi^{*} is set of optimal solutions to the dual of the WppW_{p}^{p} program (5). The results are extended in [46] to general measures supported on d\mathbb{R}^{d}, d=1,2,3d=1,2,3 and general costs cc (with discussion of limitations in higher dimensions d4d\geq 4), thus providing a unified approach to weak limits of empirical OT costs centered by the true population value.

The starting point for theoretical results of this paper is the weak limit (7) [72]. Inspired by this result, we establish the limits of the form (6), where OTcOT_{c} is now the optimal value of the MOTMOT program (2) with k2k\geq 2 measures supported on a finite space 𝒳={x1,,xN}d\mathcal{X}=\{x_{1},\ldots,x_{N}\}\subseteq\mathbb{R}^{d} for any d1d\geq 1. The implications of these results to kk-sample inference and further theoretical findings related to our limits are summarized below.

1.3 Summary of contributions and outline

Asymptotic distributions of MOTMOT: We provide asymptotic distribution of MOTMOT cost on finite spaces by establishing Hadamard directional differentiability of the MOTMOT functional and combining it with functional Delta method [9, 29, 46, 70, 69, 65, 72, 75]. The resulting limit is a Hadamard directional derivative of MOTMOT at the true μ\mu in the direction of the limit GG of the empirical process ρn(μ^nμ)\rho_{n}\left(\widehat{\mu}_{n}-\mu\right) for suitably defined rate ρn\rho_{n} (Theorem 2.2(a)). For k=2k=2 measures, our limit recovers the one for the Wasserstein distance on finite spaces obtained in [72]; for k>2k>2 measures, our limit allows to construct novel inference procedures for the kk-sample problem using MOTMOT (Section 2.2). We specify the structure of the limit under the assumption of H0H_{0} and construct a low complexity stochastic upper bound on the null distribution (Theorem 2.2(b)) that is used to efficiently approximate the limit under H0H_{0} (Section 3.2). The bound is tight for k=2k=2. We specify the structure of the limit under HaH_{a} and provide sufficient conditions for the limit to be Gaussian by leveraging the results from geometry of multitransportation polytopes from [28]. When the limits are not Gaussian, we construct the Normal lower bounds on the alternative limiting distribution (Theorem 2.2(c)). Our stochastic bounds on the null and alternative distributions provide an analytically tractable way to analyze the power of the Optimal Transport based tests (Section 2.4), which, to the best of our knowledge, was not yet considered in the literature.

Consistency and power: We provide a novel power analysis for Optimal Transport based tests of hypotheses (1) that encompasses both our test (12) and the two-sample test in [72] and can potentially be applied to tests based on limiting laws in [75] and [46]. We show consistency of the test under fixed alternatives (Proposition 2.8), as well as uniform consistency in a certain broad class of alternatives (Theorem 2.9 and Proposition 2.11). We illustrate theoretical power results in k=2k=2 case by providing a lower bound on the power function that explicitly relates sample size and the effect size (Corollary 2.10, Figure 2). We also quantify how the population version of our statistic changes with number of measures for certain sequences of alternatives (Lemmas 2.13 and 2.14), suggesting potential power advantages in these cases. For the case of small sample sizes, we provide a permutation version of the MOTMOT based test (Section 3.3). Comparison with with state-of-the-art tests of [47, 50, 64] shows strong finite sample power performance of our tests (Figure 5).

Computational complexity results: Leveraging recent complexity result for MOTMOT/barycenter program [2], we prove polynomial time complexity of the derivative bootstrap that consistently estimates asymptotic distribution of MOTMOT under H0H_{0} (Lemma 3.2); polynomial complexity of m-out-of-n bootstrap and permutation procedure follow directly from [2] (Table 2). We demonstrate that the null upper bound of Theorem 2.2(b) can efficiently approximate the null distribution when the cardinality of the support N=|𝒳|N=|\mathcal{X}| is large (Figure 4).

Applications to real data and software: We illustrate performance of MOTMOT based kk-sample inference on two synthetic datasets showing strong power performance when testing H0H_{0} and the ability to produce meaningful and interpretable confidence regions under HaH_{a} (Section 4.1). Further, we apply our methodology to real data on cancers in the United States populations to confirm two claims in cancer studies that were previously shown using different methodologies (Section 4.2, Figures 7 and 8). Current version of the software that implements our methods is available at https://github.com/kravtsova2/mot.

2 k-Sample inference on finite spaces using MOTMOT

2.1 Notation and preliminary definitions

Denote the vector of kk true measures supported on 𝒳={x1,xN}d\mathcal{X}=\{x_{1}\ldots,x_{N}\}\subseteq\mathbb{R}^{d} by

μ:=(μ1μk)kN\mu:=\begin{pmatrix}\mu_{1}\\ \vdots\\ \mu_{k}\end{pmatrix}\in\mathbb{R}^{kN}

and the vector of the empirical counterparts μ^i:=1nij=1niδxj\widehat{\mu}_{i}:=\frac{1}{n_{i}}\sum_{j=1}^{n_{i}}\delta_{x_{j}} by

μ^n:=(μ^1μ^k)kN\widehat{\mu}_{n}:=\begin{pmatrix}\widehat{\mu}_{1}\\ \vdots\\ \widehat{\mu}_{k}\end{pmatrix}\in\mathbb{R}^{kN}

with sample sizes

n:=(n1,,nk)n:=(n_{1},\ldots,n_{k})

where nn\to\infty to be interpreted as each sample size tending to infinity.

Let MOT(μ)MOT(\mu) be the optimal value of the program (2), which on the finite space 𝒳\mathcal{X} becomes the finite-dimensional linear program

minπ0c,π\displaystyle\min_{\pi\geq 0}\langle c,\pi\rangle (8)
Aπ=μ\displaystyle A\pi=\mu

The optimization variable πNk\pi\in\mathbb{R}^{N^{k}} is a column vector representing joint probability distribution with marginals μ1,,μk\mu_{1},\ldots,\mu_{k} (frequently called multicoupling), a matrix AkN×NkA\in\mathbb{R}^{kN\times N^{k}} encodes the constraints for π\pi to be a multicoupling (i.e. that summing certain entries of π\pi gives the marginals μ1,,μk\mu_{1},\ldots,\mu_{k}), and a cost column vector cNkc\in\mathbb{R}^{N^{k}} contains Euclidean distances between measure support points to their averages given by (3)111The linear program formulation of MOT problem is discussed, for example, on p.3 of [54]..

For reader’s convenience, Example 1 below illustrates the structure of the linear program (8) in a case of three measures, each supported on two points in 1\mathbb{R}^{1}:

Example 1 (Illustration of MOTMOT optimization problem for three measures).

Consider the finite set 𝒳={5,10}1\mathcal{X}=\{5,10\}\subset\mathbb{R}^{1} which could represent, for instance, tumor sizes (in centimeters) of cancer patients, and consider probability measures supported on 𝒳\mathcal{X} representing the probabilities of occurrences of 5cm5\text{cm} and 10cm10\text{cm} tumors in given population of cancer patients. Suppose that the measure μ1\mu_{1} has probabilities recorded in a vector μ1:=(μ11μ12)\mu_{1}:=\begin{pmatrix}\mu_{1}^{1}\\ \mu_{1}^{2}\end{pmatrix}, and, similarly, μ2:=(μ21μ22)\mu_{2}:=\begin{pmatrix}\mu_{2}^{1}\\ \mu_{2}^{2}\end{pmatrix}, μ3:=(μ31μ32)\mu_{3}:=\begin{pmatrix}\mu_{3}^{1}\\ \mu_{3}^{2}\end{pmatrix}. The multimarginal optimal transport problem (8) is to minimize a linear (with coefficients in cc) function of a measure π\pi on the product space 𝒳×𝒳×𝒳\mathcal{X}\times\mathcal{X}\times\mathcal{X} whose marginals are μ1,μ2\mu_{1},\mu_{2}, and μ3\mu_{3}, respectively (in this example, all of these sets are equal to 𝒳\mathcal{X}). Technically, π\pi is an order-33 tensor, i.e. an array with 33 indices with values in {1,2}\{1,2\}, but for notational convenience we represent it by a long vector (πijk)i,j,k{1,2}222(\pi_{ijk})_{i,j,k\in\{1,2\}}\in\mathbb{R}^{2\cdot 2\cdot 2}. The cost cijkc_{ijk} in the objective of (8) associated with the entry πijk\pi_{ijk} is the average of squared differences (or squared norms of the differences in higher-dimensional case) between support points xi,xj,xkx_{i},x_{j},x_{k} to their mean M¯ijk=13(xi+xj+xk)\overline{M}_{ijk}=\frac{1}{3}(x_{i}+x_{j}+x_{k}), i.e.

cijk=13((xiM¯ijk)2+(xjM¯ijk)2+(xkM¯ijk)2)c_{ijk}=\frac{1}{3}\left((x_{i}-\overline{M}_{ijk})^{2}+(x_{j}-\overline{M}_{ijk})^{2}+(x_{k}-\overline{M}_{ijk})^{2}\right)

The objective is to minimize the total discrepancy weighted by π\pi, which is given by

c,π=c111π111+c112π112++c222π222\langle c,\pi\rangle=c_{111}\pi_{111}+c_{112}\pi_{112}+\cdots+c_{222}\pi_{222}

The multicoupling π\pi is subjected to having non-negative entries and constrained linearly with Aπ=μA\pi=\mu. The constraint matrix AA is responsible for making sure that appropriate entries of π\pi sum to the given marginals μ1\mu_{1}, μ2\mu_{2}, and μ3\mu_{3}, i.e.

(111100000000111111001100001100111010101001010101)A(π111π112π121π122π211π212π221π222)π=(μ11μ12μ21μ22μ31μ32)μ\underbrace{\begin{pmatrix}1&1&1&1&0&0&0&0\\ 0&0&0&0&1&1&1&1\\ 1&1&0&0&1&1&0&0\\ 0&0&1&1&0&0&1&1\\ 1&0&1&0&1&0&1&0\\ 0&1&0&1&0&1&0&1\end{pmatrix}}_{A}\underbrace{\begin{pmatrix}\pi_{111}\\ \pi_{112}\\ \pi_{121}\\ \pi_{122}\\ \pi_{211}\\ \pi_{212}\\ \pi_{221}\\ \pi_{222}\end{pmatrix}}_{\pi}=\underbrace{\begin{pmatrix}\mu_{1}^{1}\\ \mu_{1}^{2}\\ \mu_{2}^{1}\\ \mu_{2}^{2}\\ \mu_{3}^{1}\\ \mu_{3}^{2}\end{pmatrix}}_{\mu} (9)

finishing the example.

The dual program of (8) is given by

maxuu,μ\displaystyle\max_{u}\langle u,\mu\rangle (10)
Auc\displaystyle A^{\prime}u\leq c

(the derivation of the dual follows from the standard theory of linear programming, e.g. Section 4.1 of [5]). A column vector u:=(u1uk)kNu:=\begin{pmatrix}u_{1}\\ \vdots\\ u_{k}\end{pmatrix}\in\mathbb{R}^{kN} contains dual variables, one for each measure, and the objective of (10) can be thought of summing the contributions u1,μ1++uk,μk\langle u_{1},\mu_{1}\rangle+\cdots+\langle u_{k},\mu_{k}\rangle.

Let Φ\Phi^{*} denote the set of dual optimal solutions to (10). This set consists of all vectors uu that result in the maximum value of the dual objective (= minimum value of the primal objective by strong duality, e.g. Theorem 4.4 of [5]) and satisfy the dual constraints, i.e.

Φ:={u=(u1uk)kN:u,μ=MOT(μ), Auc}\Phi^{*}:=\{u=\begin{pmatrix}u_{1}\\ \vdots\\ u_{k}\end{pmatrix}\in\mathbb{R}^{kN}:\langle u,\mu\rangle=MOT(\mu),\text{ }A^{\prime}u\leq c\} (11)

We consider the asymptotic behavior of scaled and centered empirical estimator MOT(μ^n)MOT(\widehat{\mu}_{n}) by establishing the weak limit

ρn(MOT(μ^n)MOT(μ)) in law X\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)\overset{\text{ in law }}{\longrightarrow}X

as nn\to\infty where X=𝑑X0X\overset{d}{=}X_{0} under H0H_{0} and X=𝑑XaX\overset{d}{=}X_{a} under HaH_{a}. The set Φ\Phi^{*} will be needed to define the limit XX.

2.2 Definitions of H0H_{0} testing and HaH_{a} inference procedures

Consider the statistic

Tn:=ρn(MOT(μ^n)MOT(μ))T_{n}:=\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)

where MOT(μ)=0MOT(\mu)=0 under H0H_{0} and MOT(μ)>0MOT(\mu)>0 under HaH_{a}.

An α\alpha-level test of H0H_{0} would reject H0H_{0} if ρnMOT(μ^n)\rho_{n}MOT(\widehat{\mu}_{n}) is large, i.e. if TnT_{n} exceeds a (1α)(1-\alpha)th quantile of its null distribution 𝒟0\mathcal{D}_{0}. However, as Theorem 2.2 shows, 𝒟0\mathcal{D}_{0} depends on the unknown true μ\mu, and hence care must be taken to ensure that the estimated cut-off used for the test still results in the (asymptotic) level α\alpha.

To this end, we consider a consistent bootstrap estimator of 𝒟0\mathcal{D}_{0} given in Proposition 3.1 and denote its (1α)(1-\alpha)-th quantile by cα,𝒟0c_{\alpha,\mathcal{D}_{0}}. Consistency of the bootstrap is shown using results of [29], and by Corollary 3.2 of the same work such bootstrap based cut-off gives an asymptotic level α\alpha test of H0H_{0}. Using this cut-off, we define the asymptotic test of H0H_{0} as a map

ϕn,μ:={1 if Tncα,𝒟00 otherwise \phi_{n,\mu}:=\begin{cases}1&\text{ if }T_{n}\geq c_{\alpha,\mathcal{D}_{0}}\\ 0&\text{ otherwise }\end{cases} (12)

Similarly, the distribution of TnT_{n} under HaH_{a} is consistency estimated by bootstrap in Proposition 3.1, with resulting (α/2)(\alpha/2)-th and (1α/2)(1-\alpha/2)-th quantiles denoted by cα/2,𝒟ac_{\alpha/2,\mathcal{D}_{a}} and c1α/2,𝒟ac_{1-\alpha/2,\mathcal{D}_{a}}, respectively. The asymptotic (1α)%(1-\alpha)\% Confidence Region for MOT(μ)MOT(\mu) under HaH_{a} is given by

(1ρnMOT(μ^n)c1α/2,𝒟a,1ρnMOT(μ^n)cα/2,𝒟a)\left(\frac{1}{\rho_{n}}MOT(\widehat{\mu}_{n})-c_{1-\alpha/2,\mathcal{D}_{a}},\frac{1}{\rho_{n}}MOT(\widehat{\mu}_{n})-c_{\alpha/2,\mathcal{D}_{a}}\right) (13)

2.3 Asymptotic distributions of MOTMOT under H0H_{0} and HaH_{a}

Refer to caption
Figure 1: Illustration of asymptotic distributions of MOTMOT given by Theorem 2.2 on the set up of Example 1. A. Under H0H_{0} (Theorem 2.2(b)): null distribution X0X_{0}, the upper bound UB0UB_{0}, and MOTMOT values sampled directly (black line) by sampling empirical measures from the truth and evaluating MOTMOT value. All densities here and in the rest of the paper are estimated by kernel density estimators in ggplot2 [82] with default parameters. B. Under HaH_{a} (Theorem 2.2(b)): Gaussian limit under Condition (A1) (Left) and non-Gaussian limit (Right) with Normal Lower Bounds (NLB’s) (NLB=𝑑XaNLB\overset{d}{=}X_{a} in the Gaussian case).

Theorem 2.2(a) provides the general form of the asymptotic distribution of MOT(μ^n)MOT(\widehat{\mu}_{n}) on finite spaces. This distribution is given by the Hadamard directional derivative of the MOTMOT functional, which is an optimal value of a linear program with a feasible set consisting of dual optimal solutions Φ\Phi^{*} (11). If Φ\Phi^{*} is a singleton, the limit in Theorem 2.2(a) is a linear combination of Gaussians, and hence is also Gaussian. If not, the limit is the maximum (taken over the feasible set Φ\Phi^{*}) of such linear combinations.

By the theory of linear programming, it is possible to assess whether the set of dual optimal solutions Φ\Phi^{*} is a singleton or not based on the corresponding set of basic optimal solutions to the primal program (we use Theorem 5.6.1 of [71], and Chapters 4 and 5 of [5] for the general linear programming results employed below). In our case, the basic optimal solutions to the primal program (8) are the vertices of a multitransportation polytope P(μ1,,μk):={π0:Aπ=μ}P(\mu_{1},\ldots,\mu_{k}):=\{\pi\geq 0:A\pi=\mu\} given by multicouplings π\pi^{*}. These vertices contain rank(A)=kNk+1\leq\text{rank}(A)=kN-k+1 positive entries, and a vertex is termed degenerate if it contains strictly less (p. 366 of [28]).

The dual optimal set Φ\Phi^{*} cannot be a singleton if an optimal solution to the primal MOTMOT program is unique and degenerate. This is always the case under H0H_{0}: the unique optimal solution π\pi^{*} is given by the “identity” multicoupling (with μ1\mu_{1} in the entries with the same tuple indices and zeros otherwise - so it is degenerate). Hence, the asymptotic distribution of MOTMOT under H0H_{0} is never a Gaussian (Theorem 2.2(b)).

The dual optimal set Φ\Phi^{*} is a singleton if there exists a non-degenerate primal optimal vertex, i.e. an optimal multicoupling π\pi^{*} with kNk+1kN-k+1 positive entries. This can be possible under certain HaH_{a}’s; in particular, this is possible if a multitransportation polytope P(μ1,,μk):={π0:Aπ=μ}P(\mu_{1},\ldots,\mu_{k}):=\{\pi\geq 0:A\pi=\mu\} contains no degenerate vertices. In this case, Φ\Phi^{*} is a singleton, and the corresponding asymptotic distribution of MOTMOT is Gaussian (Theorem 2.2(c)). We use a result in discrete geometry from [28] to provide a sufficient condition (A1) that leads to Gaussian limits under HaH_{a}:

Condition (A1).

(Regularity, Definition 1.5 in Chapter 8 of [28]) For each i=1,,ki=1,\ldots,k, order the entries of the vector μi\mu_{i} as μi1μi2μiN\mu_{i}^{1}\geq\mu_{i}^{2}\geq\cdots\geq\mu_{i}^{N}, and assume that μi1<μi+11\mu_{i}^{1}<\mu_{i+1}^{1} for i=1,,k1i=1,\ldots,k-1. A multitransportation polytope P(μ1,,μk)P(\mu_{1},\ldots,\mu_{k}) is regular if

μiN+j=i+1kμj1>ki,i=1,,k1\mu_{i}^{N}+\sum_{j=i+1}^{k}\mu_{j}^{1}>k-i,\hskip 14.22636pt\forall i=1,\ldots,k-1

A regular multitransportation polytope does not have degenerate vertices (Lemma 1.4 in Chapter 8 of [28]).

Remark 2.1.

For k=2k=2 measures, the condition (A1) is implied by the following No Subset Sum condition that ensures that a transportation polytope P(μ1,μ2)P(\mu_{1},\mu_{2}) has no degenerate vertices222This condition is mentioned by [74] for uniqueness of Kantorovich potentials for k=2k=2 finitely supported measures.: there is no proper subsets of indices I,J[N]I,J\subset[N] such that iIμ1i=jJμ2j\sum_{i\in I}\mu_{1}^{i}=\sum_{j\in J}\mu_{2}^{j}. This condition is both necessary and sufficient to exclude degenerate vertices in k=2k=2 case (see, e.g., Theorem 1.2 in Chapter 6 of [28]).

Theorem 2.2 (Asymptotic distribution of MOTMOT on finite spaces).

Assume that the sizes of kk samples n1,,nkn_{1},\ldots,n_{k}\to\infty satisfying nin1++nkλi(0,1)\frac{n_{i}}{n_{1}+\ldots+n_{k}}\to\lambda_{i}\in(0,1). Denote ρn:=n1nk(n1++nk)k1\rho_{n}:=\frac{\sqrt{n_{1}\cdot\ldots\cdot n_{k}}}{{\left(\sqrt{n_{1}+\ldots+n_{k}}\right)^{k-1}}} and ai:=jiλja_{i}:=\prod_{j\neq i}\lambda_{j}. Then,

  • (a)

    The asymptotic distribution of MOTMOT is given by

    ρn(MOT(μ^n)MOT(μ))in lawmaxuΦi=1kaiui,Gi\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)\xrightarrow[]{\text{in law}}\max_{u\in\Phi^{*}}\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle (14)

    where the feasible set Φ\Phi^{*} is given by (11), and Giindep.N(0,Σi)G_{i}\overset{\text{indep.}}{\sim}N(0,\Sigma_{i}) with Σi=diag(μi)μiμi\Sigma_{i}=\text{diag}(\mu_{i})-\mu_{i}\mu_{i}^{\prime}.

  • (b)

    Under H0H_{0}, the limit in (14) is non-Gaussian, and given by

    ρn(MOT(μ^n)0)in lawX0𝒟0\rho_{n}\left(MOT(\widehat{\mu}_{n})-0\right)\xrightarrow[]{\text{in law}}X_{0}\sim\mathcal{D}_{0}

    where 𝒟0\mathcal{D}_{0} is given by

    maxu\displaystyle\max_{u} i=1kaiui,Gi\displaystyle\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle (15)
    s.t. i=1kui=0\displaystyle\sum_{i=1}^{k}u_{i}=0
    Auc\displaystyle A^{\prime}u\leq c

    with Giindep.N(0,Σ1)G_{i}\overset{\text{indep.}}{\sim}N(0,\Sigma_{1}) with Σ1=diag(μ1)μ1μ1\Sigma_{1}=\text{diag}(\mu_{1})-\mu_{1}\mu_{1}^{\prime}.

    Furthermore, there exists UB0𝒟UB0{UB}_{0}\sim\mathcal{D}_{{UB}_{0}} on the same probability space as X0X_{0} such that UB0X0{UB}_{0}\geq X_{0} everywhere, and 𝒟UB0\mathcal{D}_{{UB}_{0}} is given by

    maxu\displaystyle\max_{u} i=2kui,aiGia1G1\displaystyle\sum_{i=2}^{k}\langle u_{i},\sqrt{a_{i}}G_{i}-\sqrt{a_{1}}G_{1}\rangle (16)
    A~uc~\displaystyle\widetilde{A}^{\prime}u\leq\widetilde{c}

    where A~uc~\widetilde{A}^{\prime}u\leq\widetilde{c} is a subset of constraints from (15).

  • (c)

    Under HaH_{a},

    ρn(MOT(μ^n)MOT(μ))in lawXa𝒟a\rho_{n}\left(MOT(\widehat{\mu}_{n})-MOT(\mu)\right)\xrightarrow[]{\text{in law}}X_{a}\sim\mathcal{D}_{a}

    where 𝒟a\mathcal{D}_{a} is given by (14). Furthermore, for every uΦu^{*}\in\Phi^{*} given by (11), there exists a random variable NLBuNLB_{u^{*}} on the same probability space as XaX_{a}, such that NLBuXaNLB_{u^{*}}\leq X_{a} everywhere, and

    NLBu𝒩(0,i=1kaiuiΣiui)NLB_{u^{*}}\sim\mathcal{N}\left(0,\sum_{i=1}^{k}a_{i}{u_{i}^{*}}^{\prime}\Sigma_{i}u_{i}^{*}\right) (17)

    If Condition (A1) holds, then Φ\Phi^{*} is a singleton {u}\{u^{*}\}, and Xa=𝑑NLBuX_{a}\overset{d}{=}NLB_{u^{*}}.

Proof summary for Theorem 2.2.

Proof of part (a) is outlined below, with details in Appendix A.1. In what follows, for each i=1,,ki=1,\ldots,k, we view the measures μi𝒫(𝒳)(l1(𝒳),l1)\mu_{i}\in\mathcal{P}(\mathcal{X})\subseteq\left(l^{1}(\mathcal{X}),\|\cdot\|_{l^{1}}\right), and the dual vectors ui(l(𝒳),l)u_{i}\in\left(l^{\infty}(\mathcal{X}),\|\cdot\|_{l^{\infty}}\right). The weak convergence and Hadamard directional differentiability are with respect to l1l^{1} norm on i=1kl1(𝒳)\bigotimes_{i=1}^{k}l^{1}(\mathcal{X}).

  • Step 1 Establish, for a suitable scaling ρn\rho_{n}, the weak limit aG\sqrt{a}G of the empirical process

    ρn(μ^nμ)in lawaG\rho_{n}\left(\widehat{\mu}_{n}-\mu\right)\xrightarrow[]{\text{in law}}\sqrt{a}G
  • Step 2 Confirm that the functional μMOT(μ)\mu\longrightarrow MOT(\mu) is Hadamard directionally differentiable at μ\mu with derivative

    fμ(G)=maxuΦi=1kaiui,Gif_{\mu}^{\prime}(G)=\max_{u\in\Phi^{*}}\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle
  • Step 3 Use Delta Method for Hadamard directionally differentiable maps [65, 70] to conclude that

    ρn(f(μ^n)f(μ))in lawfμ(G)\rho_{n}\left(f(\widehat{\mu}_{n})-f(\mu)\right)\xrightarrow[]{\text{in law}}f^{\prime}_{\mu}(G)

Proof of part (b) is given in Appendix A.2. It provides the exact form of the proposed upper bound UB0UB_{0} reporting the constraints in A~\widetilde{A}. To construct A~\widetilde{A}, we consider how the inequality constraints AucA^{\prime}u\leq c behave on the kernel of the map {ui=1kui}\{u\longrightarrow\sum_{i=1}^{k}u_{i}\}. Resulting upper bound has only polynomially many constraints and can be sampled efficiently to approximate the null distribution in Section 3.2. We remark that the proposed bound is not unique: in particular, it can be strengthened by including more constraints from AA (Section 5.2). The proposed bound is tight when k=2k=2333We remark here that the null distribution program (15) can be written with k1k-1 dual variables instead of kk due to the constraint i=1kui=0\sum_{i=1}^{k}ui=0; this is what [72] refer to in k=2k=2 case (p. 227). We choose to keep the form (15) for notational convenience.

For part (c), if the limit is not Gaussian (i.e., the feasible set Φ\Phi^{*} is not a singleton), one can take any uΦu^{*}\in\Phi^{*} and consider the (random) objective (14) evaluated at uu^{*}. Resulting value lower bounds the value of the maximization the program (14), and it is distributed according to (17).

Observation 2.3.

The entries of the cost vector ci1i2ikc_{i_{1}i_{2}\cdots i_{k}} indexed by i1,,ik{1,,N}i_{1},\cdots,i_{k}\in\{1,\ldots,N\} with k1k-1 coinciding index values can be written in terms of the distance between two points with unique indices scaled by k1k2\frac{k-1}{k^{2}}. For example, c112=k1k2x1x22c_{1\cdots 12}=\frac{k-1}{k^{2}}\|x_{1}-x_{2}\|^{2}. The details are provided in Appendix A.3.

Lemma 2.4 (Bounds on the dual variables).

Fix k2k\geq 2. Let u=(u1,,uk)u=(u_{1},\ldots,u_{k}) be optimal solutions to the dual MOTMOT program (10) satisfying i=1kui=0\sum_{i=1}^{k}u_{i}=0 and chosen444Recall that adding a constant to any dual variable uiu_{i} and subtracting the same constant from any other dual variable uiu_{i^{\prime}} does not change the dual objective and does not violate the dual constraints in (10). Hence, given any vector of dual solutions u=(u1,,uk)u=(u_{1},\ldots,u_{k}), the first entries of u2,,uku_{2},\ldots,u_{k} can be normalized to zero, and the constraint i=1kui=0\sum_{i=1}^{k}u_{i}=0 would force the first entry of u1u_{1} to be zero as well. Such normalization is frequently done to avoid redundant solutions - see, e.g., the definition of dual transportation polyhedron in [4]. such that the first entries ui1=0{u_{i}}_{1}=0. Then for each i=1,,ki=1,\ldots,k, the jjth entry of uiu_{i} is bounded as

|ui|jk1k2x1xj2|u_{i}|_{j}\leq\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2}

where 2\|\cdot-\cdot\|^{2} is the squared distance on the ground metric space 𝒳={x1,,xN}\mathcal{X}=\{x_{1},\ldots,x_{N}\}. It follows that

uik1k2C(𝒳), i=1,,k\|u_{i}\|\leq\frac{k-1}{k^{2}}C(\mathcal{X}),\text{ }i=1,\ldots,k

where C(𝒳)C(\mathcal{X}) depends only on the ground metric space 𝒳\mathcal{X}.

Remark 2.5.

The assumption i=1kui=0\sum_{i=1}^{k}u_{i}=0 holds for those μ\mu for which the primal optimal solution - the multicoupling (πi1,,ik)i1,,ik(\pi_{i_{1},\ldots,i_{k}})_{i_{1},\ldots,i_{k}} - assigns a positive mass to the “diagonal” tuples (xi1,,xi1)(x_{i_{1}},\ldots,x_{i_{1}}), i.e. πi1,,i1>0\pi_{i_{1},\ldots,i_{1}}>0 for i1=1,,Ni_{1}=1,\ldots,N. By complementary slackness result in linear programming (see, e.g., Theorem 4.5 of [5]), in this case, the corresponding constraints of the dual

(u1)j++(uk)jcj,,j=0, j=1,,N(u_{1})_{j}+\cdots+(u_{k})_{j}\leq c_{j,\ldots,j}=0,\text{ }j=1,\ldots,N

hold with equality, giving i=1kui=0\sum_{i=1}^{k}u_{i}=0. This always holds under H0H_{0} and frequently happens under HaH_{a}.

Using the above results, Proposition 2.6 defines an upper bound on all test cut-offs cα,𝒟0c_{\alpha,\mathcal{D}_{0}}, which is independent of the nature and the number of measures in μ(k)=(μ1,,μk)\mu(k)=(\mu_{1},\ldots,\mu_{k}). This bound is used to prove consistency of the test (12) with cut-offs cα,𝒟0c_{\alpha,\mathcal{D}_{0}} uniformly over μ\mu (Theorem 2.9) and kk (Proposition 2.11).

Proposition 2.6 (Bound for test cut-off cα,𝒟0c_{\alpha,\mathcal{D}_{0}}).

Fix the test level α(0,1)\alpha\in(0,1). There exists cα(𝒳)c_{\alpha}(\mathcal{X}) depending only on the ground metric space 𝒳\mathcal{X} such that

cα,𝒟0cα(𝒳) for all μ(k) supported on 𝒳c_{\alpha,\mathcal{D}_{0}}\leq c_{\alpha}(\mathcal{X})\text{ for all }\mu(k)\text{ supported on }\mathcal{X}

where μ(k)=(μ1,,μk)\mu(k)=(\mu_{1},\ldots,\mu_{k}). In particular, for any k2k\geq 2,

cα,𝒟0(i=1kai)k1k2C(𝒳)8ln(α/4)=:cα(𝒳)c_{\alpha,\mathcal{D}_{0}}\leq\left(\sum_{i=1}^{k}\sqrt{a_{i}}\right)\frac{k-1}{k^{2}}C(\mathcal{X})\sqrt{-8\ln{(\alpha/4)}}=:c_{\alpha}(\mathcal{X})

For equal sample sizes n1==nkn_{1}=\cdots=n_{k}, this gives

cα,𝒟0k1k(k+1)/2C(𝒳)8ln(α/4)18C(𝒳)8ln(α/4)c_{\alpha,\mathcal{D}_{0}}\leq\frac{k-1}{k^{(k+1)/2}}C(\mathcal{X})\sqrt{-8\ln{(\alpha/4)}}\leq\frac{1}{\sqrt{8}}C(\mathcal{X})\sqrt{-8\ln{(\alpha/4)}}

for all k2k\geq 2.

Proof.

For any given μ(k)\mu(k), consider the null distribution 𝒟0\mathcal{D}_{0} given by linear program (15) and bound its objective (everywhere) as

i=1kaiui,Gii=1kai|ui,Gi|\displaystyle\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle\leq\sum_{i=1}^{k}\sqrt{a_{i}}\left|\langle u_{i},G_{i}\rangle\right| i=1kaiuiGi\displaystyle\leq\sum_{i=1}^{k}\sqrt{a_{i}}\|u_{i}\|\|G_{i}\|
Lemma 2.4(i=1kai)k1k2C(𝒳)G1\displaystyle\underset{\text{Lemma }\ref{lem:dual_bd_null}}{\leq}\left(\sum_{i=1}^{k}\sqrt{a_{i}}\right)\frac{k-1}{k^{2}}C(\mathcal{X})\|G_{1}\|

Thus the optimal value X0X_{0} of (15) is also bounded by the same quantity.

Desired cut-off cα(𝒳)c_{\alpha}(\mathcal{X}) is obtained using the cut-off tαt_{\alpha} for the distribution of G1\|G_{1}\|. To define tαt_{\alpha}, we use the following concentration result from [53]:

Concentration of G1\|G_{1}\|(Equation (3.5) from [53]) For a centered Gaussian random variable G1G_{1}, given any t>0t>0,

(G1t)4et28𝔼G12\mathbb{P}\left(\|G_{1}\|\geq t\right)\leq 4e^{-\frac{t^{2}}{8\mathbb{E}\|G_{1}\|^{2}}} (18)

Recalling that G1=(G11,,G1N)G_{1}=(G_{1}^{1},\ldots,G_{1}^{N}) with Cov(G1)=diag(μ1)μ1μ1\text{Cov}(G_{1})=\text{diag}(\mu_{1})-\mu_{1}\mu_{1}^{\prime} where μ1=(p1,,pN)\mu_{1}=(p_{1},\ldots,p_{N}), we get that

𝔼G12\displaystyle\mathbb{E}\|G_{1}\|^{2} =𝔼[(G11)2++(G1N)2]\displaystyle=\mathbb{E}\left[(G_{1}^{1})^{2}+\cdots+(G_{1}^{N})^{2}\right]
=p1(1p1)++pN(1pN)\displaystyle=p_{1}(1-p_{1})+\cdots+p_{N}(1-p_{N})
=1i=1Npi2<1\displaystyle=1-\sum_{i=1}^{N}p_{i}^{2}<1

Hence, (G1t)4et28\mathbb{P}\left(\|G_{1}\|\geq t\right)\leq 4e^{-\frac{t^{2}}{8}}, and choosing t=8ln(α/4)t=\sqrt{-8\ln{(\alpha/4)}} gives (G1t)α\mathbb{P}\left(\|G_{1}\|\geq t\right)\leq\alpha. Note that the result holds for any μ1\mu_{1}.

We let tα:=8ln(α/4)t_{\alpha}:=\sqrt{-8\ln{(\alpha/4)}} which ensures (G1tα)α\mathbb{P}\left(\|G_{1}\|\geq t_{\alpha}\right)\leq\alpha. Thus,

(X0[i=1kai]k1k2C(𝒳)tα)\displaystyle\mathbb{P}\left(X_{0}\geq\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})\cdot t_{\alpha}\right)
\displaystyle\leq ([i=1kai]k1k2C(𝒳)G1[i=1kai]k1k2C(𝒳)tα)\displaystyle\mathbb{P}\left(\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})\|G_{1}\|\geq\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})\cdot t_{\alpha}\right)
=\displaystyle= (G1tα)α\displaystyle\mathbb{P}\left(\|G_{1}\|\geq t_{\alpha}\right)\leq\alpha

which implies cα,𝒟0[i=1kai]k1k2C(𝒳)=:cα(𝒳)c_{\alpha,\mathcal{D}_{0}}\leq\left[\sum_{i=1}^{k}\sqrt{a_{i}}\right]\frac{k-1}{k^{2}}C(\mathcal{X})=:c_{\alpha}(\mathcal{X}).

2.4 Consistency and power

Refer to caption
Figure 2: Illustration of theoretical power guarantees for k=2k=2 case (Theorem 2.9 and Corollary 2.10) for real data settings described in Section 4.2.1. The test of H0H_{0} compares distributions of tumor size for two groups of patients (marked “alive, no metastases” and “dead” in Figure 7B). Left: The effect size (Wasserstein squared distance δ\delta) is fixed at the value calculated from the data (blue line in the right panel), and the lower bound on power is shown as a function of nn. Right: The sample size (nn) was fixed at high value (actual sample size in the dataset is even larger), and power is shown as a function of δ\delta.

We start by showing the basic requirement for tests comparing k2k\geq 2 measures - consistency under any fixed alternative - by showing that the power tends to 11 with increasing sample sizes.

The power of the test (12) is

Power(μ)\displaystyle\text{Power}(\mu) =𝒟a(ϕn,μ=1)\displaystyle=\mathbb{P}_{\mathcal{D}_{a}}\left(\phi_{n,\mu}=1\right) (19)
=𝒟a(ρn(MOT(μ^n)0)cα,𝒟0)\displaystyle=\mathbb{P}_{\mathcal{D}_{a}}\left(\rho_{n}(MOT(\widehat{\mu}_{n})-0)\geq c_{\alpha,\mathcal{D}_{0}}\right)
=𝒟a(ρn(MOT(μ^n)MOT(μ))cα,𝒟0ρnMOT(μ))\displaystyle=\mathbb{P}_{\mathcal{D}_{a}}\left(\rho_{n}(MOT(\widehat{\mu}_{n})-MOT(\mu))\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)
=(Xacα,𝒟0ρnMOT(μ))\displaystyle=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)
Remark 2.7.

To obtain expression of the power for the test in [72], MOT(μ)MOT(\mu) in the above expression can be replaced by W22(μ)W_{2}^{2}(\mu)555This choice corresponds to p=2p=2-Wasserstein distance in [72]. Other choices of p[1,)p\in[1,\infty) can be used by adjusting the details accordingly., with necessary adjustments for the k=2k=2 measures case.

Proposition 2.8 below proves consistency under fixed alternatives for any k2k\geq 2. The proof utilizes a Normal Lower Bound guaranteed by Theorem 2.2(c) to lower bound the power of the test.

Proposition 2.8 (Consistency under fixed alternatives, k2k\geq 2 measures).

Under any given alternative μ=(μ1,,μk)\mu=(\mu_{1},\ldots,\mu_{k}), k2k\geq 2, the test in (12) satisfies

limn(ϕn,μ=1)=1\lim_{n\to\infty}\mathbb{P}(\phi_{n,\mu}=1)=1
Proof.

Given an alternative μ\mu, consider a set of dual optimal solutions Φ\Phi^{*} (11) to MOT(μ)MOT(\mu). Choose any uΦu^{*}\in\Phi^{*} (treated fixed after the choice), and consider the Normal Lower Bound NLBu𝒩(0,i=1kaiuiΣiui)NLB_{u^{*}}\sim\mathcal{N}\left(0,\sum_{i=1}^{k}a_{i}{u_{i}^{*}}^{\prime}\Sigma_{i}u_{i}^{*}\right) guaranteed by Theorem 2.2(c). Using (19), we have that

𝒟a(ϕn,μ=1)\displaystyle\mathbb{P}_{\mathcal{D}_{a}}\left(\phi_{n,\mu}=1\right) =(Xacα,𝒟0ρnMOT(μ))\displaystyle=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)
(NLBucα,𝒟0ρnMOT(μ))\displaystyle\geq\mathbb{P}\left(NLB_{u^{*}}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\right)

Since ρn\rho_{n}\to\infty while the test’s cut-off cα,𝒟0c_{\alpha,\mathcal{D}_{0}} and the true value MOT(μ)>0MOT(\mu)>0 do not change with nn, we get that cα,𝒟0ρnMOT(μ)c_{\alpha,\mathcal{D}_{0}}-\rho_{n}MOT(\mu)\to-\infty, and hence the Gaussian random variable NLBuNLB_{u^{*}} exceeds it with probability tending to 11 as nn\to\infty. ∎

Next, we prove uniform consistency of the test (12) over a broad class of alternatives. We start with a case of k=2k=2 measures in Theorem 2.9, which proves uniform consistency of the test proposed by [72]. We then move to general k2k\geq 2 (kk is allowed to change) in Proposition 2.11, concluding uniform consistency of tests of this type.

Our results are proved under the following assumption (B1) below, which is discussed in Remark 2.5. This assumption is expected to hold when measures are not too far from each other, and the power without this assumption is expected to be higher. Removing (B1) poses the difficulty of bounding dual solutions uu uniformly over alternative polytopes {μu=MOT(μ),Auc}\{\mu^{\prime}u=MOT(\mu),A^{\prime}u\leq c\}; we use such bound to control alternative variances. Under (B1), the condition i=1kui=0\sum_{i=1}^{k}u_{i}=0 allows to bound uu (and hence alternative variances) explicitly and uniformly over μ\mu.

Assumption (B1).

There exist dual solutions uu to MOT(μ)MOT(\mu) satisfying i=1kui=0\sum_{i=1}^{k}u_{i}=0.

For any fixed metric space 𝒳\mathcal{X} with NN points and any δ>0\delta>0, define the class of alternatives

(δ):={μ on 𝒳:W22(μ)δ}\mathcal{F}(\delta):=\{\mu\text{ on }\mathcal{X}:W^{2}_{2}(\mu)\geq\delta\}
Theorem 2.9 (Uniform consistency, k=2k=2 measures).

For k=2k=2, the test (12) (or, equivalently, the test based on Theorem 1(c) of [72]) satisfies

limninfμ(δ)(B1)(ϕn,μ=1)=1\lim_{n\to\infty}\inf_{\mu\in\mathcal{F}(\delta)\cap(B1)}\mathbb{P}(\phi_{n,\mu}=1)=1
Proof.

Fix test level α(0,1)\alpha\in(0,1). The goal is to show that, independently of the nature of the alternative μ(δ)(B1)\mu\in\mathcal{F}(\delta)\cap(B1), the probability that the test rejects H0H_{0} given by

(ϕn,μ=1)=(Xacα,𝒟0ρn14W22(μ))\mathbb{P}(\phi_{n,\mu}=1)=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}\frac{1}{4}W_{2}^{2}(\mu)\right)

tends to 11 as nn\to\infty (the factor of 14\frac{1}{4} is due to MOT(μ)=14W22(μ)MOT(\mu)=\frac{1}{4}W_{2}^{2}(\mu) for k=2k=2).

Assume for simplicity of notation that the sample sizes are equal, i.e. n1==nkn_{1}=\cdots=n_{k}666The proof for unequal samples sizes would be similar by considering n:=mini{ni}n:=\min_{i}\{n_{i}\}.. Note first that the null cut-offs cα,𝒟0c_{\alpha,\mathcal{D}_{0}} can be bounded above uniformly over the class (δ)(B1)\mathcal{F}(\delta)\cap(B1) using Proposition 2.6, which with k=2k=2 gives the bound

cα,𝒟018C(𝒳)8ln(α/4):=cα(𝒳)c_{\alpha,\mathcal{D}_{0}}\leq\frac{1}{\sqrt{8}}C(\mathcal{X})\sqrt{-8\ln(\alpha/4)}:=c_{\alpha}(\mathcal{X})

Hence,

cα,𝒟0ρn14W22(μ)cα(𝒳)ρnδ4 for all μ(δ)(B1)c_{\alpha,\mathcal{D}_{0}}-\rho_{n}\frac{1}{4}W_{2}^{2}(\mu)\leq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\text{ for all }\mu\in\mathcal{F}(\delta)\cap(B1)

This expression represents the “worst” (over (δ)(B1)\mathcal{F}(\delta)\cap(B1)) value that any given XaX_{a} must exceed to give the test a power.

Next we show that any XaX_{a} will exceed this bound with probability tending to 11 as nn\to\infty. To this end, let μ(δ)(B1)\mu\in\mathcal{F}(\delta)\cap(B1), and consider the corresponding alternative distribution of XaX_{a}. Consider the dual solutions (u1a,u2a)({u_{1}}_{a}^{*},{u_{2}}_{a}^{*}) satisfying Assumption (B1) and the corresponding Normal Lower Bound

NLBua𝒩(0,12u1aΣ1u1a+12u2aΣ2u2a)NLB_{u_{a}^{*}}\sim\mathcal{N}\left(0,\frac{1}{2}{{u_{1}}_{a}^{*}}^{\prime}\Sigma_{1}{{u_{1}}_{a}^{*}}+\frac{1}{2}{{u_{2}}_{a}^{*}}^{\prime}\Sigma_{2}{{u_{2}}_{a}^{*}}\right)

guaranteed by Theorem 2.2(c). Note that its variance

12u1aΣ1u1a+12u2aΣ2u2a12\displaystyle\frac{1}{2}{{u_{1}}_{a}^{*}}^{\prime}\Sigma_{1}{u_{1}}_{a}^{*}+\frac{1}{2}{{u_{2}}_{a}^{*}}^{\prime}\Sigma_{2}{u_{2}}_{a}^{*}\leq\frac{1}{2} u1a2λmax(Σ1)+12u2a2λmax(Σ2)\displaystyle\|{u_{1}}_{a}^{*}\|^{2}\cdot\lambda_{max}(\Sigma_{1})+\frac{1}{2}\|{u_{2}}_{a}^{*}\|^{2}\cdot\lambda_{max}(\Sigma_{2})
=(B1)\displaystyle\underset{(B1)}{=} 12u1a2(λmax(Σ1)+λmax(Σ2))\displaystyle\frac{1}{2}\|{u_{1}}_{a}^{*}\|^{2}\left(\lambda_{max}(\Sigma_{1})+\lambda_{max}(\Sigma_{2})\right)

where λmax()\lambda_{max}(\cdot) denotes the largest eigenvalue of a matrix argument.

Using Theorem 1 of [8], the eigenvalues of Σ1,Σ2\Sigma_{1},\Sigma_{2} are upper bounded by entries of μ1\mu_{1} and μ2\mu_{2} as λmax(Σ1)maxi(μ1)i\lambda_{max}(\Sigma_{1})\leq\max_{i}({\mu_{1}})_{i} and λmax(Σ2)maxi(μ2)i\lambda_{max}(\Sigma_{2})\leq\max_{i}({\mu_{2}})_{i}, with the uniform upper bound of 11 for all instances μ(δ)\mu\in\mathcal{F}(\delta)777In fact, the bound hold for any μ\mu and is not restricted to the class \mathcal{F} - see [8].. Hence,

λmax(Σ1)+λmax(Σ2)2\lambda_{max}(\Sigma_{1})+\lambda_{max}(\Sigma_{2})\leq 2

providing a uniform upper bound on the eigenvalue part. Further, u1a14C(𝒳)\|{u_{1}}_{a}^{*}\|\leq\frac{1}{4}C(\mathcal{X}) by Lemma 2.4. Thus, letting

σ2(𝒳):=(14C(𝒳))2\sigma^{2}(\mathcal{X}):=\left(\frac{1}{4}C(\mathcal{X})\right)^{2} (20)

uniformly bounds the variances for all NLBNLB’s chosen as above for XaX_{a}’s arising from μ(δ)(B1)\mu\in\mathcal{F}(\delta)\cap(B1).

The final step is to combine the above uniform bounding arguments to get the power 1\to 1. Note that for large enough nn, cα(𝒳)ρn14δ<0c_{\alpha}(\mathcal{X})-\rho_{n}\frac{1}{4}\delta<0, and also that cα(𝒳)ρn14δc_{\alpha}(\mathcal{X})-\rho_{n}\frac{1}{4}\delta\to-\infty as nn\to\infty. Hence, we have that, for any μ(δ)(B1)\mu\in\mathcal{F}(\delta)\cap(B1), for large enough nn depending only on δ\delta and 𝒳\mathcal{X} but not the nature of μ\mu,

(ϕn,μ=1)\displaystyle\mathbb{P}(\phi_{n,\mu}=1) =(Xacα,𝒟0ρn14W22(μ))\displaystyle=\mathbb{P}\left(X_{a}\geq c_{\alpha,\mathcal{D}_{0}}-\rho_{n}\frac{1}{4}W_{2}^{2}(\mu)\right) (21)
(Xacα(𝒳)ρnδ4)\displaystyle\geq\mathbb{P}\left(X_{a}\geq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\right)
(NLBuacα(𝒳)ρnδ4)\displaystyle\geq\mathbb{P}\left(NLB_{u^{*}_{a}}\geq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\right)
(𝒩(0,σ2(𝒳))cα(𝒳)ρnδ4)\displaystyle\geq\mathbb{P}\left(\mathcal{N}\left(0,\sigma^{2}(\mathcal{X})\right)\geq c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\right)

which tends to 11 with cα(𝒳)ρnδ4c_{\alpha}(\mathcal{X})-\rho_{n}\frac{\delta}{4}\to-\infty as nn\to\infty. This gives the uniform lower bound on the power over (δ)(B1)\mathcal{F}(\delta)\cap(B1) proving the uniform consistency of the test over this broad class of alternatives. ∎

Using Theorem 2.9, one can provide a practical lower bound on the power as a function of the sample size nn and/or the effect size δ\delta when measures are supported on a known metric space 𝒳\mathcal{X}. Note that the dual bound C(𝒳)C(\mathcal{X}) is rather conservative; it can be replaced by a bound on u1\|u_{1}\| computed for the polytope {u1+u2=0,Auc}\{u_{1}+u_{2}=0,A^{\prime}u\leq c\} in every specific case. To find these bounds, one could solve linear programs {max/minui:u1+u2=0,Auc}\{\max/\min u^{i}:u_{1}+u_{2}=0,A^{\prime}u\leq c\} for each entry uiu^{i} of uu to estimate magnitudes of the dual variables over a given polytope. Denoting the resulting bound C~(𝒳)\widetilde{C}(\mathcal{X}), we have

Corollary 2.10 (Lower bound on the power of the two-sample test).

For any alternative μ(δ)(B1)\mu\in\mathcal{F}(\delta)\cap(B1), the two-sample test (12) (or the test based on Theorem 1(c) of [72]) with equal sample sizes nn has

power1Φ(4C~(𝒳)log(α/4))n2δC~(𝒳))\text{power}\geq 1-\Phi\left(\frac{4\widetilde{C}(\mathcal{X})\sqrt{-\log(\alpha/4)})-\frac{\sqrt{n}}{\sqrt{2}}\delta}{\widetilde{C}(\mathcal{X})}\right)

where Φ()\Phi(\cdot) denotes the cumulative distribution function of the standard Normal distribution.

Illustration of this bound on the real data from Section 4.2.1 is provided in Figure 2.

Using techniques similar to the proof of Theorem 2.9, it is possible to prove uniform consistency of the test (12) for alternatives μ(k):=(μ1,,μk)\mu(k):=(\mu_{1},\ldots,\mu_{k}) in the class

K(δ):={μ(k) on 𝒳:kK, MOT(μ)δ}\mathcal{F}_{K}(\delta):=\{\mu(k)\text{ on }\mathcal{X}:k\leq K,\text{ }MOT(\mu)\geq\delta\}

defined for any δ>0\delta>0 and any fixed 2K<2\leq K<\infty under the Assumption (B1). This gives consistency of the test (12) uniformly over alternatives with kKk\leq K measures:

Proposition 2.11 (Uniform consistency in the class K(δ)\mathcal{F}_{K}(\delta)).

The test in (12) satisfies

limninfμ(k)K(δ)(B1)(ϕn,μ=1)=1\lim_{n\to\infty}\inf_{\mu(k)\in\mathcal{F}_{K}(\delta)\cap(B1)}\mathbb{P}(\phi_{n,\mu}=1)=1
Proof.

Similarly to the proof of Theorem 2.9, the null cut-offs are uniformly bounded using Proposition 2.6 as cα,𝒟0cα(𝒳)c_{\alpha,\mathcal{D}_{0}}\leq c_{\alpha}(\mathcal{X}). Recall that (taking equal sample sizes nn for simplicity) for any 2kK2\leq k\leq K, ρn=nkk1\rho_{n}=\sqrt{\frac{n}{k^{k-1}}}, and hence cαρnMOT(μ(k))<0c_{\alpha}-\rho_{n}MOT(\mu(k))<0 for nn large enough to ensure this holds for all kKk\leq K.

Similarly to the proof of Theorem 2.9, each alternative random variable XaX_{a} arising from μ(k)\mu(k) has a Normal Lower Bound

NLBua𝒩(0,1kk1u1aΣ1u1a++1kk1ukaΣkuka)NLB_{u_{a}^{*}}\sim\mathcal{N}\left(0,\frac{1}{k^{k-1}}{{u_{1}}_{a}^{*}}^{\prime}\Sigma_{1}{{u_{1}}_{a}^{*}}+\cdots+\frac{1}{k^{k-1}}{{u_{k}}_{a}^{*}}^{\prime}\Sigma_{k}{{u_{k}}_{a}^{*}}\right)

with the variance bounded above by

σ2(𝒳,k):=1kk1(k1k2C(𝒳))2k=(k1)2kk+2(C(𝒳))2\sigma^{2}(\mathcal{X},k):=\frac{1}{k^{k-1}}\cdot\left(\frac{k-1}{k^{2}}C(\mathcal{X})\right)^{2}\cdot k=\frac{(k-1)^{2}}{k^{k+2}}\left(C(\mathcal{X})\right)^{2} (22)

which decreases with kk. Hence, it is bounded uniformly in kk by σ2(𝒳)\sigma^{2}(\mathcal{X}) from the k=2k=2 case (equation (20)), and the rest of the argument carries in exactly the same way as in the proof of Theorem 2.9. ∎

Remark 2.12 (Large kk and connection with [50]).

In practice, the upper bound on the number kk of measures in K(δ)\mathcal{F}_{K}(\delta) cannot be too large. This is due to the requirement ρn=nkk1\rho_{n}=\sqrt{\frac{n}{k^{k-1}}}\to\infty, which forces very large sample sizes that may not be practically plausible. While this limitation is natural for asymptotic kk-sample tests that work with fixed kk (as discussed, e.g., in [84]), recent results of [50] show that permutation approach for certain test statistics allows for growing kk and nn simultaneously. More precisely, in the class of alternatives where only a few measures differ from the rest of the collection, the permutation kernel based test is uniformly powerful if the population version of the test statistic δ\delta sufficiently exceeds logkn\sqrt{\frac{\log k}{n}}. Below we discuss sequences of alternatives whose MOTMOT population value does not decrease with kk, and hence the MOTMOT test statistic is expected to perform well in a permutation procedure. We leave a theoretical power analysis concerning MOTMOT permutation test for future work.

The “clustered” alternatives are collections μ=(μ1,,μk)\mu=(\mu_{1},\ldots,\mu_{k}) that separate into two groups (or “clusters”), with k/Ck/C measures in each cluster that are all the same. Such situation might arise, for example, if an applied treatment causes CC different types of responses. For instance, for C=2C=2, define

k2:={μ on 𝒳:μ1==μk/2μk/2+1==μk}\displaystyle\mathcal{F}^{2}_{k}:=\{\mu\text{ on }\mathcal{X}:\mu_{1}=\cdots=\mu_{k/2}\neq\mu_{k/2+1}=\cdots=\mu_{k}\}

(see Figure 5B for illustration). The classes kC\mathcal{F}^{C}_{k} with C=3,,kC=3,\cdots,k are defined analogously (in each case, kk is assumed to be divisible by CC).

Lemma 2.13 (MOT values for “clustered” alternatives).

For μk2\mu\in\mathcal{F}^{2}_{k}, we have MOT(μ)=MOT(μ1,μk)=14W22(μ1,μk)MOT(\mu)=MOT(\mu_{1},\mu_{k})=\frac{1}{4}W_{2}^{2}(\mu_{1},\mu_{k}). More generally, for μkC\mu\in\mathcal{F}^{C}_{k}, MOT(μ)=MOT({μi}i=1C)MOT(\mu)=MOT(\{\mu_{i}\}_{i=1}^{C}), where {μi}i=1C\{\mu_{i}\}_{i=1}^{C} is a collection consisting of one measure from each cluster.

Proof is provided in Appendix A.5. Note that true MOTMOT values for “clustered” alternatives do not decrease with increasing number of measures and thus may serve as a suitable test statistics in permutation tests against alternatives in kC\mathcal{F}^{C}_{k}.

Finally, we comment on the MOTMOT values in a “sparse” alternative class when only one measure is different from the rest (alternatives of this type are considered in both [50] and [84]:

ks:={μ on 𝒳:μ1==μk1μk}\mathcal{F}^{s}_{k}:=\{\mu\text{ on }\mathcal{X}:\mu_{1}=\cdots=\mu_{k-1}\neq\mu_{k}\}

While MOTMOT values do decrease with kk in this sequences of alternatives, we can state precisely how the rate of this decrease is controlled (proved in Appendix A.6):

Lemma 2.14 (MOT values for “sparse” alternatives).

For μks\mu\in\mathcal{F}^{s}_{k}, MOT(μ)=k1k2W22(μ1,μk)MOT(\mu)=\frac{k-1}{k^{2}}W_{2}^{2}(\mu_{1},\mu_{k}).

Empirical performance of asymptotic MOTMOT test (12) and permutation MOTMOT test (Section 3.3) on “clustered” and “sparse” alternatives are illustrated in Figure 5.

3 Sampling from null and alternative distributions

Refer to caption
Figure 3: Illustration of bootstrap consistency (Section 3.1). A. Convergence of m-out-of-n and derivative bootstrap sampling distributions (both sampled based on the empirical μ^n\widehat{\mu}_{n}) to the true null distribution (sampled based on the true μ\mu) assessed by 1-Wasserstein distance. The m-out-of-n bootstrap schemes are shown with m:=npm:=n^{p}, p{0.3,0.5,0.7,0.9}p\in\{0.3,0.5,0.7,0.9\}. Observed convergence rate is fastest for the derivative bootstrap (blue), and is slower for the m-out-of-n bootstrap for larger values of mm. The data is based on the 3D Experiment dataset (Figure 5) by choosing bottom unit square in 2D2D (Left panel here) and a unit square in 3D3D (Right panel here). B. Sampling distributions corresponding to panel A, n=500n=500. C. Quantile-quantile plot illustrating closeness of the derivative bootstrap to the true distribution from panel B.
Refer to caption
Figure 4: Illustration of performance of the null upper bound UB0UB_{0} (Section 3.2). A. H0H_{0} testing using upper bound UB0UB_{0} and the true null distribution X0X_{0} for all three real (or real-based) datasets considered in this paper (Section 4). Observe that UB0UB_{0} produces the same conclusion as X0X_{0} (rejection of H0H_{0}) on all considered datasets, while having much lower computational complexity (Table 2). The whole analysis took under 5 minutes on the standard laptop, with negligible fraction of time taken by UB0UB_{0}. B. Times to compute 500500 samples from UB0UB_{0} for (subsampled grid) MNIST digits data [24] for kk images on a standard laptop. All simulations were conducted on AMD Ryzen 5 7520U with 16 GB RAM.

3.1 Bootstrap: mm-out-of-nn and derivative

We recall that the limiting laws in Theorem 2.2 depend on the true measures μ\mu, similarly to the k=1,2k=1,2-sample cases considered in [72] and [46]. More precisely, the laws are of the form

ρn(f(μ^n)f(μ))in lawfμ(G)\rho_{n}\left(f(\widehat{\mu}_{n})-\ f(\mu)\right)\xrightarrow[]{\text{in law}}f^{\prime}_{\mu}(G)

where f:μMOT(μ)f:\mu\longrightarrow MOT(\mu) is the map with Hadamard directional derivative ff^{\prime} at μ\mu in the direction of G=in lawlimnρn(μ^nμ)G\overset{\text{in law}}{=}\lim_{n}\rho_{n}\left(\widehat{\mu}_{n}-\mu\right) and n:=(n1,,nk)n:=(n_{1},\ldots,n_{k}). The classical bootstrap estimator of fμ(G)f_{\mu}(G) in a sense of [27] would be constructed by sampling from the conditional (given the data) law of ρn(f(μ^n)f(μ^n))\rho_{n}\left(f(\hat{\mu}_{n}^{*})-f(\widehat{\mu}_{n})\right), where μ^n\hat{\mu}_{n}^{*} is obtained by taking nn samples from the vector of empirical measures μ^n\widehat{\mu}_{n}. By Theorem 3.1 of [29], this estimator is not consistent when fμ(G)f^{\prime}_{\mu}(G) is non-Gaussian, which is always the case under H0H_{0} and frequently under HaH_{a}.

In place of inconsistent classical bootstrap, [29] proposes a consistent bootstrap procedure to estimate the law of fμ(G)f^{\prime}_{\mu}(G). The approach of [29] is to ensure consistency of an estimator fn()f^{\prime}_{n}(\cdot) of the map fμ()f^{\prime}_{\mu}(\cdot) uniformly in the argument ()(\cdot), assuming that the law of the argument ()(\cdot) is estimated by (some) consistent bootstrap scheme. Two different choices for fn()f^{\prime}_{n}(\cdot) then lead to bootstrap schemes frequently termed m-out-of-n and derivative bootstrap methods, respectively (see Section 1 of [29] on historical notes on these methods).

The work of [72] outlines the consistency results for these two schemes in k=1,2k=1,2-sample cases. For completion, we describe these schemes in the general case of k2k\geq 2 (proved in Appendix A.7):

Proposition 3.1 (Consistency of bootstrap from [29]).

The results of parts (a) and (b) concern with two estimators fn()f^{\prime}_{n}(\cdot) of the map fμ()f_{\mu}^{\prime}(\cdot).

  • (a)

    fn:hfn(h)f^{\prime}_{n}:h\longrightarrow f_{n}^{\prime}(h) given by

    fn(h):=f(μ^n+εnh)f(μ^n)εnf^{\prime}_{n}(h):=\frac{f(\widehat{\mu}_{n}+\varepsilon_{n}h)-f(\widehat{\mu}_{n})}{\varepsilon_{n}}

    composed with the estimator of GG given by

    G^:=ρm(μ^(m)μ^n)\hat{G}^{*}:=\rho_{m}\left(\widehat{\mu}^{*}_{(m)}-\widehat{\mu}_{n}\right)

    results in a consistent bootstrap estimator fn(G^)f^{\prime}_{n}(\hat{G}^{*}) of fμ(G)f_{\mu}^{\prime}(G) under both H0H_{0} and HaH_{a}. Here, μ^(m)\hat{\mu}^{*}_{(m)} is obtained by resampling m out of n observations from μ^n\widehat{\mu}_{n} with m:=nm:=\sqrt{n}, and εn0\varepsilon_{n}\to 0 such that ρnεn\rho_{n}\varepsilon_{n}\to\infty.

    Note: The choice εn:=1ρm\varepsilon_{n}:=\frac{1}{\rho_{m}} leads to

    fn(G^)=ρm(f(μ^(m))f(μ^n))f^{\prime}_{n}(\hat{G}^{*})=\rho_{m}\left(f(\widehat{\mu}^{*}_{(m)})-f(\widehat{\mu}_{n})\right)

    which is frequently termed the m-out-of-n bootstrap estimator of fμ(G)f_{\mu}^{\prime}(G) and is considered in [72] and [46] for the Wasserstein distance map ff.

  • (b)

    fn:hfμ^n(h)f_{n}^{\prime}:h\longrightarrow f_{\hat{\mu}_{n}}^{\prime}(h) given by

    fμ^n(h):=\displaystyle f^{\prime}_{\hat{\mu}_{n}}(h):= maxui=1kui,hi\displaystyle\max_{u}\sum_{i=1}^{k}\langle u_{i},h_{i}\rangle
    u,μ^n=MOT(μ^n)\displaystyle\langle u,\hat{\mu}_{n}\rangle=MOT(\hat{\mu}_{n})
    Auc\displaystyle A^{\prime}u\leq c

    composed with the estimator of GG given by

    G^:=(a^1G^1,,a^kG^k)\hat{G}^{*}:=\left(\sqrt{\hat{a}_{1}}\hat{G}_{1}^{*},\ldots,\sqrt{\hat{a}_{k}}\hat{G}^{*}_{k}\right)

    where each G^i\hat{G}_{i}^{*} is N(0,diag(μ^1)μ^1(μ^1))N\left(0,\text{diag}(\hat{\mu}^{*}_{1})-\hat{\mu}_{1}^{*}(\hat{\mu}_{1}^{*})^{\prime}\right) results in a consistent bootstrap estimator fn(G^)f_{n}^{\prime}(\hat{G}^{*}) of fμ(G)f_{\mu}^{\prime}(G) under H0H_{0}.

    Note: This estimator is frequently termed the derivative bootstrap estimator of fμ(G)f^{\prime}_{\mu}(G) and is considered in [72] for the Wasserstein distance map ff.

Pseudocodes 1 and 3 describe sampling from the limiting laws of MOTMOT under H0H_{0} and HaH_{a} using bootstrap schemes in Proposition 3.1.

Pseudocode 1 (m-out-of-n bootstrap to obtain one sample from H0H_{0} or HaH_{a} limiting law).

Given the data μ^n\widehat{\mu}_{n},

  1. 1.

    Let mi:=nipm_{i}:=n_{i}^{p}, p(0,1)p\in(0,1).

  2. 2.

    For each i=1,,ki=1,\ldots,k,
    sample μ^iMultinomial(mi,μ^1)\widehat{\mu}_{i}^{*}\sim\text{Multinomial}(m_{i},\widehat{\mu}_{1}) under H0H_{0}, or
    sample μ^iMultinomial(mi,μ^i)\widehat{\mu}_{i}^{*}\sim\text{Multinomial}(m_{i},\widehat{\mu}_{i}) under HaH_{a}.

  3. 3.

    Compute MOT(μ^1,,μ^k)MOT(\widehat{\mu}_{1}^{*},\ldots,\widehat{\mu}_{k}^{*}) by solving the program (8).

  4. 4.

    Report ρmMOT(μ^1,,μ^k)\rho_{m}MOT(\widehat{\mu}_{1}^{*},\ldots,\widehat{\mu}_{k}^{*}) under H0H_{0} or
    ρm(MOT(μ^1,,μ^k)MOT(μ^n))\rho_{m}\left(MOT(\widehat{\mu}_{1}^{*},\ldots,\widehat{\mu}_{k}^{*})-MOT(\widehat{\mu}_{n})\right) under HaH_{a},
    where ρm=m1mk(m1++mk)k1\rho_{m}=\frac{\sqrt{m_{1}\cdot\ldots\cdot m_{k}}}{\left(\sqrt{m_{1}+\ldots+m_{k}}\right)^{k-1}}.

Pseudocode 2 (Derivative bootstrap to obtain one sample from H0H_{0} limiting law).

Given the data μ^n\widehat{\mu}_{n},

  1. 1.

    Sample μ^1Multinomial(n1,μ^1)\hat{\mu}_{1}^{*}\sim\text{Multinomial}(n_{1},\widehat{\mu}_{1}).

  2. 2.

    For each i=1,,ki=1,\ldots,k,
    sample G^iNormal(0,diag(μ^1)μ^1(μ^1))\hat{G}_{i}^{*}\sim\text{Normal}(0,\text{diag}(\widehat{\mu}_{1}^{*})-\hat{\mu}_{1}^{*}(\hat{\mu}_{1}^{*})^{\prime}).
    Let ai=jiλ^ja_{i}=\prod_{j\neq i}\hat{\lambda}_{j}, where λ^j=njn1++nk\hat{\lambda}_{j}=\frac{n_{j}}{n_{1}+\ldots+n_{k}}.

  3. 3.

    Solve the program (15) with {G^i}i=1k\{\hat{G}_{i}^{*}\}_{i=1}^{k} in place of {Gi}i=1k\{G_{i}\}_{i=1}^{k}.

3.1.1 Computational complexity of bootstrap

For the m-out-of-n bootstrap, computation in Step 3 requires solving the primal MOTMOT program (8), which is a linear program with NkN^{k} variables, i.e. exponentially many in terms of the cardinality of the support space 𝒳\mathcal{X}. By strong duality, the optimal value of primal MOT program is the same as that of the dual MOT (10), which is a linear program

maxuμu:Auc\max_{u}\mu^{\prime}u:A^{\prime}u\leq c

with polynomially many variables but exponentially many constraints. It is well-known that a linear program with exponentially many constraints can be proved polynomial time solvable via ellipsoid method provided its feasible set {Auc}\{A^{\prime}u\leq c\} has a polynomial time computable separation oracle (see, e.g. Section 8.5 of [5]). Such oracle is a procedure which accepts a proposal point ukNu\in\mathbb{R}^{kN} and either confirms that u{Auc}u\in\{A^{\prime}u\leq c\}, or outputs a violated constraint. Polynomial separation oracle is found for the dual MOTMOT problem in [2] (Proposition 12), resulting in polynomial time algorithm to solve MOTMOT problem with quadratic cost (3) (Theorem 2 of [2])888Besides the optimal value which agrees for the primal and the dual, [2] are also interested in the primal vertex solution. For that reason, they also discuss how to get primal solution in polynomial time in their Proposition 11..

For the derivative bootstrap, computation in Step 3 requires solving program (15), which similar to the dual MOT linear program and is given by

maxugu:Auc, Bu=0\max_{u}g^{\prime}u:A^{\prime}u\leq c,\text{ }Bu=0

where BB is the matrix of the linear map ui=1kuiu\longrightarrow\sum_{i=1}^{k}u_{i} and gg is a realization of GG^{*} (the nature of the coefficient vector gg does not affect the complexity). Note that there are only polynomially many constraints in BB (namely, NN of them); hence, given a polynomial separation oracle for {Auc}\{A^{\prime}u\leq c\}, the rest of the constraints in {Bu=0}\{Bu=0\} can be checked in polynomial time, giving the following theoretical complexity for result for the derivative bootstrap for MOTMOT:

Lemma 3.2 (Polynomial complexity of derivative bootstrap).

The derivative bootstrap linear program (Step 3 in Pseudocode 3) has computational complexity poly(N,k,logU)\text{poly}(N,k,\log U), where logU\log U is an upper bound on the bits of precision used to represent the coefficient vector GG^{*}.

Proof details missing from the above discussion are provided in Appendix A.8.

Computational complexity of bootstrap methods is summarized in Table 2, and consistency of bootstrap is illustrated in Figure 3.

3.2 Fast approximation of the null distribution by UB0{UB}_{0}

A fast alternative to the bootstrap sampling from the null distribution is to utilize the lower bound UB0UB_{0} on the null random variable X0X_{0} provided by equation (16) in Theorem 2.2(b). As the proof of the theorem shows, a stochastic upper bound UB0UB_{0} can be constructed to have kN(N1)kN(N-1) constraints in place of NkN^{k} constraints in X0X_{0} by exploiting a constraint structure under H0H_{0} (see Appendix A.2 for details). Note that, with only quadratically many constraints, the linear program for UB0UB_{0} with any realization of the coefficient vector GG can be solved fast by modern linear program solvers.

Sampling from UB0UB_{0} can be viewed as obtaining an upper bound on the derivative bootstrap sampling distribution of X0X_{0}, via the following algorithm:

Pseudocode 3 (Sampling UB0UB_{0}).

Given the data μ^n\widehat{\mu}_{n},

  1. 1.

    Sample μ^1Multinomial(n1,μ^1)\hat{\mu}_{1}^{*}\sim\text{Multinomial}(n_{1},\widehat{\mu}_{1}).

  2. 2.

    For each i=1,,ki=1,\ldots,k,
    sample G^iNormal(0,diag(μ^1)μ^1(μ^1))\hat{G}_{i}^{*}\sim\text{Normal}(0,\text{diag}(\widehat{\mu}_{1}^{*})-\hat{\mu}_{1}^{*}(\hat{\mu}_{1}^{*})^{\prime}).
    Let ai=jiλ^ja_{i}=\prod_{j\neq i}\hat{\lambda}_{j}, where λ^j=njn1++nk\hat{\lambda}_{j}=\frac{n_{j}}{n_{1}+\ldots+n_{k}}.

  3. 3.

    Solve the program (16) with {G^i}i=1k\{\hat{G}_{i}^{*}\}_{i=1}^{k} in place of {Gi}i=1k\{G_{i}\}_{i=1}^{k}.

Computational complexity of sampling UB0UB_{0} is included in Table 2. The performance of UB0UB_{0} for testing H0H_{0} on all real datasets considered in the paper is illustrated in Figure 4. Note that the low computational complexity of UB0UB_{0} allows to approximate the H0H_{0} distribution on large datasets within a few minutes on a standard laptop.

3.3 Permutation approach

An alternative to the asymptotic test (12) is a permutation test. The permutation approach in k-sample testing is frequently used when the asymptotic distribution is difficult to sample from due to, for example, infinite number of parameters and/or difficulties of their estimation (cases of [64], [47], and [50]). Moreover, permutation procedures are applicable when the sample sizes are small (and hence the asymptotic distribution may not be valid), giving exact level α\alpha permutation tests (Section 15.2 of [27]).

Permutation test accepts a set of data points with group labels, and randomly permutes the labels to compute test statistic of interest on the permuted data to compare with the original one. The number of random permutations RR is usually taken to be between 9999 and 999999 out of total possible large number of permutations (p. 158 of [17]).

The MOTMOT permutation test is described in Pseudocode 4.

Pseudocode 4 (MOT based permutation test).

Given the data μ^n=(μ^1,,μ^k)\widehat{\mu}_{n}=(\widehat{\mu}_{1},\cdots,\widehat{\mu}_{k}):

  1. 1.

    Compute MOT(μ^n)MOT(\widehat{\mu}_{n}).

  2. 2.

    Convert μ^n\widehat{\mu}_{n} to a matrix of support points, where each support point belongs to the iith group, i=1,,ki=1,\ldots,k, and is repeated according to the counts in μ^i\widehat{\mu}_{i}. Collect group labels in the vector vv.

  3. 3.

    For each r=1,,Rr=1,\ldots,R, sample random permutation πr(v)\pi_{r}(v), permute support points according to πr(v)\pi_{r}(v), and construct measures μ^nr\widehat{\mu}_{n}^{r} based on the frequencies of support points in new groups. Compute permuted test statistic MOT(μ^nr)MOT(\widehat{\mu}_{n}^{r}) by solving the program (8).

  4. 4.

    Compute approximate p-value (p. 158 of [17]) as

    p^:=1+r=1R𝟙{MOT(μ^nr)MOT(μ^^n)}1+R\hat{p}:=\frac{1+\sum_{r=1}^{R}\mathbbm{1}_{\{MOT(\widehat{\mu}_{n}^{r})\geq MOT(\hat{\widehat{\mu}}_{n})\}}}{1+R}

Computation of permuted test statistic in Step 3 requires to solve the MOTMOT program (8), similarly to the case of m-out-of-n bootstrap in Pseudocode 1. Hence, both algorithms have the same complexity, as shown in Table 2. Empirical performance of the permutation test of Pseudocode 4 is illustrated in Figure 5.

Table 2: Complexity of computing a single sample for permutation null distribution, X0X_{0} with m-out-of-n bootstrap, X0X_{0} with derivative bootstrap, and UB0{UB}_{0} given by Theorem 2.2(b). Recall that kk is the number of measures μ1,,μk\mu_{1},\ldots,\mu_{k}, and NN is the cardinality of the underlying metric space 𝒳={x1,,xN}d\mathcal{X}=\{x_{1},\ldots,x_{N}\}\subset\mathbb{R}^{d}. Algorithm (theory) row reports an algorithm used to prove theoretical complexity, while algorithm (practice) rows report algorithms implemented in this paper and available for use. The algorithm of Altschuler & Boix-Adserà [2] is abbreviated as AB-A and assumes fixed dd.
distribution to sample permut. or X0X_{0} (m-out-of-n) X0X_{0} (deriv.) UB0UB_{0}
hypothesis null, alternative null null
optimization program equation (8) equation (15) equation (16)
#\# variables NkN^{k} kNkN (k1)N(k-1)N
#\# equality constraints kNkN NN none
#\# inequality constraints none NkN^{k} (k1)N(N1)(k-1)N(N-1)
theoretical complexity poly(N,k,logU)\text{poly}(N,k,\log U) poly(N,k,logU)\text{poly}(N,k,\log U) poly(N,k,logU)\text{poly}(N,k,\log U)
reference Theorem 2 of [2] Lemma 3.2 here Theorem 6 of [2]
algorithm (theory) AB-A [2] AB-A [2] Ellipsoid
algorithm (practice) AB-A [2] AB-A [2]
Simplex Simplex Simplex
Interior point Interior point Interior point
software Github for [2] (d=2)
GUROBI [37] GUROBI [37] GUROBI [37]
RSymphony [41] RSymphony [41] RSymphony [41]

4 Applications

Sections 4.1.1 and 4.1.2 illustrate basic properties of MOTMOT based inference on synthetic datasets with measure supports on finite subsets of d\mathbb{R}^{d}, d=1,2,3d=1,2,3. The structures of these datasets emulate potential issues in the real data settings while providing convenient models to demonstrate the advantages of MOTMOT based procedures over existing methods (Figures 5 and 6).

Sections 4.2.1 and 4.2.2 illustrate how MOTMOT based inference can be used in real biomedical settings where measures of interest are naturally finitely supported on a given metric space. We use Surveillance, Epidemiology, and End Results (SEER) (https://seer.cancer.gov), a large database on cancers in the United States routinely used in biomedical literature. Detailed description of the used data including the information on the sample sizes is provided in the Supplementary Material (Kravtsova (2024)).

4.1 Illustrations on synthetic data

4.1.1 3D Experiment dataset: testing H0H_{0}

We construct the dataset 3D Experiment which aims to emulate experimental settings of counting the number of induced cells in response to a treatment. The model organism frequently used in such experiments is the nematode worm C. elegans. The goal of the experiment is to determine whether certain genetic modification interrupts with a normal organ development, resulting in abnormal cell behavior observed in diseases [16, 83].

The abnormality is measured by the number of induced cells that emerge after genetic modification. There could be 0,1,20,1,2 induced cells in each worm; a total of nn worms are examined giving a measure supported on {0,1,2}\{0,1,2\}999In the real experiment, the measure is supported on {0,1,2,3}\{0,1,2,3\}, but we simplify it for the purpose of this synthetic dataset. When constructing 3D Experiment dataset, we assume that counting is simultaneously performed in two more sites of an animal, where the number of induced cells can be {0,1}\{0,1\}. This results in the 3-dimensional support {0,1}×{0,1}×{0,1,2}\{0,1\}\times\{0,1\}\times\{0,1,2\} with 2×2×3=122\times 2\times 3=12 points (Figure 5A).

Recent results in biological literature report differences in the number of induced cells between worm species C. elegans and C. briggsae [18, 55]. Inspired by these results, we construct four measures {μ1,μ2,μ3,μ4}\{\mu_{1},\mu_{2},\mu_{3},\mu_{4}\} in 3D Experiment dataset: the first two correspond to two C. briggsae worm strains, and the last two C. elegans worm strains (Figure 5A). We use this set up to demonstrate the power of MOT asymptotic and permutation tests for testing H0H_{0} (Figure 5B).

Refer to caption
Figure 5: MOTMOT based testing of H0H_{0} and empirical power on 3D Expeiment synthetic data. A. Structure and plot of the true measures (“clustered” alternative is shown). B. Left: Illustration for “clustered” and “sparse” alternatives discussed in Section 2.4. Right: Empirical power against “clustered” (μ1=μ2μ3=μ4)(\mu_{1}=\mu_{2}\neq\mu_{3}=\mu_{4}) and “sparse” (μ1=μ2=μ3μ4)(\mu_{1}=\mu_{2}=\mu_{3}\neq\mu_{4}) alternatives. MOT asymptotic test (12) and MOT permutation test (section 3.3) are compared with the following four tests: distance based test DISCO from [64] (gray), kernel based tests from [50] with Gaussian (solid black) and energy distance (dashed black) kernels, and test based on empirical characteristic functions from [47] (orange). Sample sizes are taken as n=300n=300 and n=500n=500 for two classes, respectively.

4.1.2 Anderes et al. 2016 dataset: HaH_{a} inference

We consider the data constructed by [3], where it demonstrates the properties of a barycenter of finitely supported measures. Each measure represents a demand distribution (for some hypothetical product) over nine locations on the map (these locations are cities in California, and they constitute a finite support 𝒳={x1,,x9}2\mathcal{X}=\{x_{1},\ldots,x_{9}\}\subset\mathbb{R}^{2} for demand distributions). There are 1212 measures in [3], each giving demand distribution during particular month (Figure 6A,B).

We use these measures as the ground truth and construct empirical measures by sampling multinomial counts based on this truth. We note that all 1212 underlying true measures are different, i. e., HaH_{a} holds. Moreover, the differences are more drastic between months with different temperature since [3] allows the temperature to influence the demand. Our inference under HaH_{a} confirms this claim by examining sub-collections of measures with months from the same season versus months from different seasons and comparing Confidence Regions for MOTMOT under these settings (Figure 6C).

Refer to caption
Figure 6: MOTMOT based inference under HaH_{a} for the Anderes et al. 2016 dataset from [3] described in Section 4.1.2. A. Schematic of the state of California map with nine cities supporting the 1212 measures corresponding to monthly demand distributions (four of them shown in B). B. Illustration of estimated MOTMOT plan (multicoupling; five highest probability non-identity tuples shown) coupling the support points of four measures. C. 99%99\% Confidence Regions for MOTMOT cost under HaH_{a} for 4-measure collections. Note an overlap of similar collections and no overlap between different collections, as well as higher MOTMOT costs for collections whose monthly demands are more different.

4.2 Applications to real data

4.2.1 SEER Tumor size dataset: testing H0H_{0}

An important question in cancer studies is to determine what factors are associated with development of metastases. In the case of breast cancer, [73] showed that metastatic risk increases with tumor size in intermediate and some of the large tumors (1\geq 1 cm), but does not increase in small tumors (<1<1 cm). The study used SEER database and considered a correlation between tumor size and prevalence of metastases. Here we confirm these results via k-sample testing, as described below. Further, we observe similar trend in three more cancer types: prostate cancer, lung cancer in males, and lung cancer in females (Figure 7).

We use the SEER database to extract the data on distributions of tumor size and term this dataset SEER Tumor size. We consider three groups of patients with different disease progression status, giving k=3k=3 measures: patients with no metastases present at diagnosis and alive at the end of the study (μ1\mu_{1}), patients with metastases at diagnosis and alive at the end of the study (μ2\mu_{2}), and patients dead by the end of the study with death caused by the diagnosed cancer (μ3\mu_{3}).

First, we test H0H_{0} for size distributions in small tumor range (<1<1 cm); we find no difference between groups, which holds for breast, prostate, and both lung cancer types (Figure 7A). In contrast, the groups are found different for tumors in larger range (191-9 cm), which again holds for all considered cancer types (Figure 7B). The analysis confirms the significance of metastatic status for the tumor size distribution in intermediate/large tumors, but not the small tumors.

Refer to caption
Figure 7: Application of MOTMOT based testing of H0H_{0} to comparison of tumor size distributions from SEER Tumor size dataset described in Section 4.2.1. Size distributions are compared for three groups of patients with different metastatic/survival characteristics (alive at the end of the study with no metastases at diagnosis, alive with metastases present at diagnosis, dead at the end of the study). All sample sizes are very large, except the prostate cancer case (sample sizes are provided in Supplementary Material (Kravtsova(2024))). A. Comparison of sizes in smaller tumor size range shows no difference (at 1%1\% level) among three groups of patients. B. Comparison of sizes for larger tumors shows significant difference (at 1%1\% level) between three groups of patients. Note: For the prostate cancer with small tumor sizes, enough data was available only for two groups of patients rather than three, so we used the Sommerfeld-Munk test [72] in place of the MOTMOT test and obtained similar conclusion.

4.2.2 SEER Year of diagnosis dataset: HaH_{a} inference

Our final example concerns with potential differences in distributions of characteristics in patients diagnosed at different times. Such differences are discussed in the case of early stage lung cancer, possibly due to improvements in diagnostic technologies [58]. Here we compare these distributions in a framework of kk-sample inference to confirm the differences in diagnosis results over time, and show that the trend is similar in both male and female patients.

We use SEER database to extract joint distributions of tumor size and patients’ age for lung cancer in males and females and term this dataset SEER Year of diagnosis. We consider four time periods giving k=4k=4 measures: 2004 - 2006 (μ1\mu_{1}), 2009 - 2011 (μ2\mu_{2}), 2014 - 2016 (μ3\mu_{3}), 2019 - 2020 (μ4\mu_{4}). The distributions are found different by MOTMOT test in both male and female lung cancer cases, and we are interested to compare the differences between male and female collections of measures (Figure 8).

We observe visually that the differences between measures are of similar nature in male and female cases: later diagnostic years appear to have more small size tumors diagnosed in comparison to earlier years (Figure 8A). The similarities between male and female cases are reflected in overlapping Confidence Regions. The reported MOTMOT plan also highlights this finding by coupling the small size support points from the later periods with the larger size support points from the earlier period for patients of the same age (Figure 8B).

Refer to caption
Figure 8: Application of MOTMOT based inference under HaH_{a} to comparison of bivariate (age/tumor size) distributions from SEER Year of diagnosis dataset described in Section 4.2.2. Bivariate distributions are compared for four periods of diagnosis: 2004 - 2006, 2009 - 2011, 2014 - 2016, and 2019 - 2020 In both male and female cases, the age/tumor size distributions are different between four periods of diagnosis (i.e., H0H_{0} is rejected). A. Confidence Regions for the MOTMOT cost for male and female cases overlap, suggesting that differences between distributions are of similar magnitude. B. Illustration of estimated MOTMOT plan (multicoupling) for male case (non-identity tuples with positive multicoupling mass are shown).

5 Discussion and Conclusions

5.1 Summary of results

In this paper, we proposed an Optimal Transport approach to kk-sample inference. We used the optimal value of the Multimarginal Optimal Transport program (MOTMOT) to quantify the difference in a given collection of kk measures supported on finite subsets of d\mathbb{R}^{d}, d1d\geq 1.

We derived limit laws for the empirical version of MOTMOT under assumptions of H0H_{0} (all kk measures are the same) and HaH_{a} (some measures may differ). We established that the limit cannot be Gaussian under H0H_{0}, and provided sufficient conditions for the limit to be Gaussian under HaH_{a}. Based on these results, we derived expression for the power function of the test of H0H_{0}; using this function, we proved consistency of the test against any fixed alternative and uniform consistency in certain broad classes of alternatives.

To sample from limit laws, we confirmed that derivative and m-out-of-n bootstrap methods are consistent under H0H_{0}, and m-out-of-n bootstrap is consistent under HaH_{a}. We proved polynomial complexity of sampling via derivative bootstrap, and defined a low complexity upper bound to approximate the test cut-off under H0H_{0}. As an alternative to sampling for the limit laws, we defined a permutation test that is suitable if sample sizes are not large enough to validate to convergence to the limit.

We empirically showed that the MOTMOT based test of H0H_{0} has strong finite sample power performance when compared with state-of-the-art methods. We also showed how to construct Confidence Regions for the true MOTMOT value under the assumptions of HaH_{a}, and how to use this procedure to compare variability between collections of kk measures. Finally, we demonstrated the use of our methodology on several real biomedical datasets.

5.2 Limitations and future directions

Extensions to continuous measures: One of the main benefits of working on finite spaces is the ability to obtain a non-degenerate limit law under H0H_{0} (i. e. the law with a non-zero variance), which allows to quantify fluctuations of the MOTMOT value when all measures are the same and test H0H_{0}. In k=2k=2 case, non-degeneracy may fail for continuous measures (see discussion in Section 1.2), but holds for discrete measures with limit laws of the form [72]. When extending [72] to continuous measures in k=2k=2 case, [46] show that non-degenerate limit laws are possible provided that there exist dual variables (the Kantorovich potentials) which are not constant almost everywhere (Theorem 4.2 of [46]). While constant potentials are always present under H0H_{0} (Corollary 4.6 of [46]), in discrete case there are other potentials around that are not constant (this holds for our case of k>2k>2, see discussion preceding Theorem 2.2). Lemma 11 of [74] shows that in k=2k=2 case, it is possible to get non-constant potentials for continuous measures by requiring the support to be disconnected (intuitively, this resembles the discrete situation). It is an interesting future direction to analyze how the k>2k>2 potentials would behave if our limit laws are extended to continuous measures and possibly different ground cost cc.

Improving upper bounds on the null distribution: While the proposed null upper bound UB0UB_{0} is computationally tractable and tight for k=2k=2 measures, it may be too large to provide a good power for H0H_{0} testing for larger kk. The main reason for this weakness is the “independent” nature of optimization over dual variables u1,,uku_{1},\ldots,u_{k} recorded in the constraints. Indeed, the constraints that relate different entries of different dual vectors are omitted, and hence the dual vectors only interact via i=1kui=0\sum_{i=1}^{k}u_{i}=0 when solving the UB0UB_{0} program. The bound UB0UB_{0} can be strengthened by introducing additional constraints from X0X_{0}, which will decrease the value of the program and provide a tighter bound on the null distribution. Two possible choices for these extra constraints are (1) including constraints that involve diverse entries from different dual vectors (e.g., u11+u22++ukkc12ku_{1}^{1}+u_{2}^{2}+\cdots+u_{k}^{k}\leq c_{12\cdots k}), and/or (2) sampling constraints at random (such constraint sampling techniques are widely applicable when solving large linear programs arising, for instance, in Markov Decision Processes [19]).

Faster computation of MOTMOT/barycenter value and permutation test: Our empirical power results suggest that MOTMOT could serve as a suitable statistics for permutation test, which can be performed if samples size are not large enough to validate asymptotic tests. In that case, the MOTMOT (or, equivalently, the barycenter) value has to be computed for each permutation. While direct computation of the barycenter cost is not plausible for a large cardinality of the support NN and/or number of measures kk, recently proposed sampling techniques can be used to speed up computation while preserving some statistical guarantees [42]. These methods are especially applicable when measures have different supports, and hence can be applied for extensions to continuous measures.

Appendix A Proofs of main results omitted in the main text

A.1 Details on the proof of Theorem 2.2(a)

Step 1 (Weak convergence of measures) For every i=1,,ki=1,\ldots,k, the empirical process converges weakly (Theorem 14.3 - 4 of [7]) as

ni(μ^iμ)in lawGi\sqrt{n_{i}}\left(\widehat{\mu}_{i}-\mu\right)\xrightarrow[]{\text{in law}}G_{i}

where GiN(0,diag(μiμiμi))G_{i}\sim N(0,\text{diag}(\mu_{i}-\mu_{i}\mu_{i}^{\prime})). Since the processes are independent in i=1,,ki=1,\ldots,k, by Theorem 1.4.8 of [79] we can view them jointly as

(n1(μ^1μ1)nk(μ^kμk))in law(G1Gk)\begin{pmatrix}\sqrt{n_{1}}(\hat{\mu}_{1}-\mu_{1})\\ \vdots\\ \sqrt{n_{k}}(\hat{\mu}_{k}-\mu_{k})\end{pmatrix}\xrightarrow[]{\text{in law}}\begin{pmatrix}G_{1}\\ \vdots\\ G_{k}\end{pmatrix}

with respect to l1l^{1} norm on i=1kl1(𝒳)\bigotimes_{i=1}^{k}l^{1}(\mathcal{X}). Using Slutsky’s Theorem (e.g., Example 1.4.7 of [79])

(n1n2n1++nknkn1++nk(μ^1μ1)nkn1n1++nknk1n1++nk(μ^kμk))in law(λ2λka1G1λ1λk1akGk)=:G\begin{pmatrix}\sqrt{n_{1}}\frac{\sqrt{n_{2}}}{\sqrt{n_{1}+\ldots+n_{k}}}\cdots\frac{\sqrt{n_{k}}}{\sqrt{n_{1}+\ldots+n_{k}}}(\hat{\mu}_{1}-\mu_{1})\\ \vdots\\ \sqrt{n_{k}}\frac{\sqrt{n_{1}}}{\sqrt{n_{1}+\ldots+n_{k}}}\cdots\frac{\sqrt{n_{k-1}}}{\sqrt{n_{1}+\ldots+n_{k}}}(\hat{\mu}_{k}-\mu_{k})\end{pmatrix}\xrightarrow[]{\text{in law}}\begin{pmatrix}\overbrace{\sqrt{\lambda_{2}}\cdots\sqrt{\lambda_{k}}}^{\sqrt{a_{1}}}G_{1}\\ \vdots\\ \underbrace{\sqrt{\lambda_{1}}\cdots\sqrt{\lambda_{k-1}}}_{\sqrt{a_{k}}}G_{k}\end{pmatrix}=:G

which is of the form

ρn(μ^nμ)in lawaG\rho_{n}(\widehat{\mu}_{n}-\mu)\xrightarrow[]{\text{in law}}\sqrt{a}G

with ρn:=n1nk(n1++nk)k1\rho_{n}:=\frac{\sqrt{n_{1}\cdot\ldots\cdot n_{k}}}{\left(\sqrt{n_{1}+\ldots+n_{k}}\right)^{k-1}}.

Step 2 (Hadamard directional differentiability of MOTMOT) Consider the functional f:i=1k𝒫(𝒳)i=1kl1(𝒳)f:\bigotimes_{i=1}^{k}\mathcal{P}(\mathcal{X})\subseteq\bigotimes_{i=1}^{k}l^{1}(\mathcal{X})\longrightarrow\mathbb{R} given by f(μ)=MOT(μ)f(\mu)=MOT(\mu), where MOT(μ)MOT(\mu) is the optimal value zz of the primal program (8), or, equivalently, the dual program (10). The map μz(μ)\mu\longrightarrow z(\mu) is Gâteaux directionally differentiable at μ\mu tangentially to a certain set Di=1k𝒫(𝒳)D\subseteq\bigotimes_{i=1}^{k}\mathcal{P}(\mathcal{X}) (Theorem 3.1 of [32]) and locally Lipschitz (Remark 2.1 of [9]101010Locally Lipschitz property can also be shown directly, see Supplementary Material (Kravtsova (2024))). Hence, it is Hadamard directionally differentiable at μ\mu tangentially to DD and two derivatives coincide (see Proposition 3.5 of [69] and also the discussion in Section 2.1 of [9]). The derivative is given by

fμ(g)=max{u:Auc,u,μ=MOT(μ)}u,gf^{\prime}_{\mu}(g)=\max_{\{u:A^{\prime}u\leq c,\langle u,\mu\rangle=MOT(\mu)\}}\langle u,g\rangle (23)

for gD:={limnμ^nμtng\in D:=\{\lim_{n\to\infty}\frac{\widehat{\mu}_{n}-\mu}{t_{n}}, μ^nl1(𝒳)\widehat{\mu}_{n}\in l^{1}(\mathcal{X}), tn0}t_{n}\searrow 0\}.

A.2 Details on the proof of Theorem 2.2(b)

Consider the null distribution of X0𝒟0X_{0}\sim\mathcal{D}_{0} in (15) given by

maxu\displaystyle\max_{u} i=1kaiui,Gi\displaystyle\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},G_{i}\rangle
s.t. i=1kui=0\displaystyle\sum_{i=1}^{k}u_{i}=0
Auc\displaystyle A^{\prime}u\leq c

Note that for any given realization of random coefficients Gi𝒩(0,Σ1)G_{i}\sim\mathcal{N}\left(0,\Sigma_{1}\right), this linear program has kk dual vectors of length NN each, i.e. kNkN variables in total. There are NN equality constraints, each for summing one of the NN entries of these vectors to zero. The large number NkN^{k} of inequality constraints comes from the size of the primal constraint matrix AkN×NkA\in\mathbb{R}^{kN\times N^{k}}, as discussed in Section 2.1.

To construct UB0𝒟UB0{UB}_{0}\sim\mathcal{D}_{{UB}_{0}} with smaller complexity than X0X_{0}, we take the same objective function as in X0X_{0} program above, but relax some of the inequality constraints AucA^{\prime}u\leq c subject to i=1kui=0\sum_{i=1}^{k}u_{i}=0. Formally, we represent the equality constraints of X0X_{0} as

i=1kui=0u=[u1uk]ker(B)\sum_{i=1}^{k}u_{i}=0\iff u=\begin{bmatrix}u_{1}\\ \vdots\\ u_{k}\end{bmatrix}\in\text{ker}(B)

where BB represents the linear operator with matrix whose jjth row picks jjth element from vectors u1,,uku_{1},\ldots,u_{k} and sums them up.

As detailed in the proof of Lemma 2.4 given in A.4, for uker(B)u\in\text{ker}(B), there are constraints in AucA^{\prime}u\leq c of the form

u11u1jc11j{u_{1}}_{1}-{u_{1}}_{j}\leq c_{1\cdots 1j}

which can be written as

(110010101001)A~1u1(c112c11N)c~1\underbrace{\begin{pmatrix}1&-1&0&\cdots&0\\ 1&0&-1&\ldots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 1&0&0&\cdots&-1\end{pmatrix}}_{\widetilde{A}^{\prime}_{1}}u_{1}\leq\underbrace{\begin{pmatrix}c_{1\cdots 12}\\ \vdots\\ \vdots\\ c_{1\cdots 1N}\end{pmatrix}}_{\widetilde{c}_{1}}

Similarly, for uker(B){Auc}u\in\text{ker}(B)\cap\{A^{\prime}u\leq c\}, the first dual vector u1u_{1} satisfies

(110001100101)A~2u1(c221c22N)c~2\underbrace{\begin{pmatrix}-1&1&0&\cdots&0\\ 0&1&-1&\ldots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&1&0&\cdots&-1\end{pmatrix}}_{\widetilde{A}^{\prime}_{2}}u_{1}\leq\underbrace{\begin{pmatrix}c_{2\cdots 21}\\ \vdots\\ \vdots\\ c_{2\cdots 2N}\end{pmatrix}}_{\widetilde{c}_{2}}
\vdots
(1001010100011)A~Nu1(c221cNNN1)c~N\underbrace{\begin{pmatrix}-1&0&0&\cdots&1\\ 0&-1&0&\ldots&1\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&-1&1\end{pmatrix}}_{\widetilde{A}^{\prime}_{N}}u_{1}\leq\underbrace{\begin{pmatrix}c_{2\cdots 21}\\ \vdots\\ \vdots\\ c_{N\cdots NN-1}\end{pmatrix}}_{\widetilde{c}_{N}}

Thus, for uker(B){Auc}u\in\text{ker}(B)\cap\{A^{\prime}u\leq c\},

(A~1A~N)u1(c~1c~N)\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}u_{1}\leq\begin{pmatrix}\widetilde{c}_{1}\\ \vdots\\ \widetilde{c}_{N}\end{pmatrix}

and the same constraints are satisfied by u2,,uku_{2},\ldots,u_{k}. Combining these constraints, we obtain

((A~1A~N)000(A~1A~N)000(A~1A~N))A~[u1uk]((c~1c~N)(c~1c~N))c~\underbrace{\begin{pmatrix}{\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}}&0&\ldots&0\\ 0&{\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}}&\ldots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\ldots&{\begin{pmatrix}\widetilde{A}_{1}^{\prime}\\ \vdots\\ \widetilde{A}_{N}^{\prime}\end{pmatrix}}\end{pmatrix}}_{\widetilde{A}^{\prime}}\begin{bmatrix}u_{1}\\ \vdots\\ \vdots\\ u_{k}\end{bmatrix}\leq\underbrace{\begin{pmatrix}{\begin{pmatrix}\widetilde{c}_{1}\\ \vdots\\ \widetilde{c}_{N}\end{pmatrix}}\\ \vdots\\ \vdots\\ {\begin{pmatrix}\widetilde{c}_{1}\\ \vdots\\ \widetilde{c}_{N}\end{pmatrix}}\end{pmatrix}}_{\widetilde{c}}

for all uker(B){Auc}u\in\text{ker}(B)\cap\{A^{\prime}u\leq c\}. This gives us A~uc~\widetilde{A}^{\prime}u\leq\widetilde{c} with kN(N1)kN(N-1) constraints that we choose to be the constraint set for the linear program (16) defining UB0{UB}_{0} (which now has no equality constraints).

Note that for uker(B)u\in\text{ker}(B), we have that u1=i=2kuiu_{1}=-\sum_{i=2}^{k}u_{i}, which, upon substitution to the objective function of (15), gives

u2,a2G2++uk,akGk\displaystyle\langle u_{2},\sqrt{a_{2}}G_{2}\rangle+\cdots+\langle u_{k},\sqrt{a_{k}}G_{k}\rangle u2,a1G1uk,a1G1\displaystyle-\langle u_{2},\sqrt{a_{1}}G_{1}\rangle-\cdots-\langle u_{k},\sqrt{a_{1}}G_{1}\rangle
=𝑑i=2kui,aiGia1G1\displaystyle\overset{d}{=}\sum_{i=2}^{k}\langle u_{i},\sqrt{a_{i}}G_{i}-\sqrt{a_{1}}G_{1}\rangle

This is the objective in the linear program (16) defining UB0{UB}_{0} with (k1)N(k-1)N variables.

A.3 Details for Observation 2.3

Consider a cost vector in the MOTMOT program (8) with entries ci1i2ikc_{i_{1}i_{2}\cdots i_{k}}, where each index takes values in {1,,N}\{1,\ldots,N\}. Suppose that k1k-1 indexes have the same value, e.g. ciijc_{i\cdots ij}. Then,

ciij=1k((k1)xix¯iij2+xjx¯iij2)\displaystyle c_{i\cdots ij}=\frac{1}{k}\left((k-1)\left\|x_{i}-\overline{x}_{i\cdots ij}\right\|^{2}+\left\|x_{j}-\overline{x}_{i\cdots ij}\right\|^{2}\right)

where x¯iij=1k[(k1)xi+xj]\overline{x}_{i\cdots ij}=\frac{1}{k}\left[(k-1)x_{i}+x_{j}\right]. The first term gives

(k1)xix¯iij2=k1k2xixj2(k-1)\left\|x_{i}-\overline{x}_{i\cdots ij}\right\|^{2}=\frac{k-1}{k^{2}}\|x_{i}-x_{j}\|^{2}

and the second term gives

xjx¯iij2=(k1)2k2xixj2\left\|x_{j}-\overline{x}_{i\cdots ij}\right\|^{2}=\frac{(k-1)^{2}}{k^{2}}\|x_{i}-x_{j}\|^{2}

Combining the two and multiplying by 1k\frac{1}{k} gives the result.

A.4 Proof of Lemma 2.4

For notational clarity, we start with a case of k=2k=2 measures, and assume for simplicity that they are supported on the metric space 𝒳\mathcal{X} with only two points. Let u=((u1,u2)=(u11u12),(u21,u22))u=\left((u_{1},u_{2})=({u_{1}}_{1}{u_{1}}_{2}),({u_{2}}_{1},{u_{2}}_{2})\right) be solutions to the dual MOTMOT program (10) satisfying i=1kui=0\sum_{i=1}^{k}u_{i}=0, i.e. u2=u1u_{2}=-u_{1}. Recall that the constraint matrix in the dual constraints AucA^{\prime}u\leq c is

A=(1010100101100101)A^{\prime}=\begin{pmatrix}1&0&1&0\\ 1&0&0&1\\ 0&1&1&0\\ 0&1&0&1\end{pmatrix}

Applying it to u=(u1u1)u=\begin{pmatrix}u_{1}\\ -u_{1}\end{pmatrix} gives the constraints on u1u_{1} as

u11u11c11=0\displaystyle{u_{1}}_{1}-{u_{1}}_{1}\leq c_{11}=0
u11u12c12\displaystyle{u_{1}}_{1}-{u_{1}}_{2}\leq c_{12}
u12u11c21=c12\displaystyle{u_{1}}_{2}-{u_{1}}_{1}\leq c_{21}=c_{12}
u12u12c22=0\displaystyle{u_{1}}_{2}-{u_{1}}_{2}\leq c_{22}=0

The middle two constraints give

|u11u12|c12|{u_{1}}_{1}-{u_{1}}_{2}|\leq c_{12}

Recalling that u11=0{u_{1}}_{1}=0, we get that

|u12|c12|{u_{1}}_{2}|\leq c_{12}

If the number of support points was N>2N>2, similar argument using constraints u11u1jc1j{u_{1}}_{1}-{u_{1}}_{j}\leq c_{1j} and u1ju11c1j{u_{1}}_{j}-{u_{1}}_{1}\leq c_{1j} would give

|u1j|c1j for j=1,,N|{u_{1}}_{j}|\leq c_{1j}\text{ for }j=1,\ldots,N

Recall from Observation 2.3 that c1j=k1k2x1xj2c_{1j}=\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2} which finishes the proof for k=2k=2 measures.

To see the result for k>2k>2 measures, note that AucA^{\prime}u\leq c contains constraints

u11+u21++uk11uk1+uk2c112\displaystyle\underbrace{{u_{1}}_{1}+{u_{2}}_{1}+\cdots+{u_{k-1}}_{1}}_{-{u_{k}}_{1}}+{u_{k}}_{2}\leq c_{1\cdots 12}
u12+u22++uk12uk2+uk1c221\displaystyle\underbrace{{u_{1}}_{2}+{u_{2}}_{2}+\cdots+{u_{k-1}}_{2}}_{-{u_{k}}_{2}}+{u_{k}}_{1}\leq c_{2\cdots 21}

and by Observation 2.3, c112=c221=k1k2x1x22c_{1\cdots 12}=c_{2\cdots 21}=\frac{k-1}{k^{2}}\|x_{1}-x_{2}\|^{2} giving

|uk2|k1k2x1x22|{u_{k}}_{2}|\leq\frac{k-1}{k^{2}}\|x_{1}-x_{2}\|^{2}

and, by similar reasoning,

|ukj|k1k2x1xj2 for j=1,,N|{u_{k}}_{j}|\leq\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2}\text{ for }j=1,\ldots,N

Similarly, we conclude the same property for all dual variables uiu_{i} indexed by 1,,k1,\ldots,k, i.e.

|uij|k1k2x1xj2 for j=1,,N|{u_{i}}_{j}|\leq\frac{k-1}{k^{2}}\|x_{1}-x_{j}\|^{2}\text{ for }j=1,\ldots,N

concluding the proof.

A.5 Proof of Lemma 2.13

By the equivalence between MOTMOT and barycenter problems (2) and (4),

MOT(μ)=infν𝒫2(d)1ki=1kW22(μi,ν)MOT(\mu)=\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{k}\sum_{i=1}^{k}W_{2}^{2}(\mu_{i},\nu)
=infν𝒫2(d)1k[k2W22(μ1,ν)+k2W22(μk,ν)]=infν𝒫2(d)12W22(μ1,ν)+12W22(μk,ν)=\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{k}\left[\frac{k}{2}W_{2}^{2}(\mu_{1},\nu)+\frac{k}{2}W_{2}^{2}(\mu_{k},\nu)\right]=\inf_{\nu\in\mathcal{P}^{2}(\mathbb{R}^{d})}\frac{1}{2}W_{2}^{2}(\mu_{1},\nu)+\frac{1}{2}W_{2}^{2}(\mu_{k},\nu)

i.e. ν\nu is a solution to the barycenter problem between μ1\mu_{1} and μk\mu_{k} with optimal value MOT(μ1,μk)=14W22(μ1,μk)MOT(\mu_{1},\mu_{k})=\frac{1}{4}W_{2}^{2}(\mu_{1},\mu_{k}). By similar reasoning, for μkC\mu\in\mathcal{F}^{C}_{k}, the population value MOT(μ)MOT(\mu) is the same as the value of MOT computed with one measure from each cluster.

A.6 Proof of Lemma 2.14

Let (u1,uk)(u_{1}^{*},u_{k}^{*}) be dual optimal solutions to the the problem W22(μ1,μk)W_{2}^{2}(\mu_{1},\mu_{k}). We will show that (k1k2u1,0,,0,k1k2uk)(\frac{k-1}{k^{2}}u_{1}^{*},0,\cdots,0,\frac{k-1}{k^{2}}u_{k}^{*}) are dual optimal for MOT(μ1,,μk)MOT(\mu_{1},\cdots,\mu_{k}), and hence

MOT(μ1,,μk)=k1k2u1,μ1+k1k2uk,μk=k1k2W22(μ1,μk)MOT(\mu_{1},\cdots,\mu_{k})=\frac{k-1}{k^{2}}\langle u_{1}^{*},\mu_{1}\rangle+\frac{k-1}{k^{2}}\langle u_{k}^{*},\mu_{k}\rangle=\frac{k-1}{k^{2}}W_{2}^{2}(\mu_{1},\mu_{k})

By optimality of (u1,uk)(u_{1}^{*},u_{k}^{*}) for W22(μ1,μk)W_{2}^{2}(\mu_{1},\mu_{k}), the dual constraints hold with equality

u1i+ukjxixj2u_{1}^{*i}+u_{k}^{*j}\leq\|x_{i}-x_{j}\|^{2}

for all i,j{1,,N}i,j\in\{1,\ldots,N\}, and hold with equality

u1i+ukj=xixj2u_{1}^{*i}+u_{k}^{*j}=\|x_{i}-x_{j}\|^{2}

on for some pairs (i,j)I(i,j)\in I with the set II indexing the pairs (xi,xj)(x_{i},x_{j}) that support the optimal Wasserstein coupling π\pi^{*}. Consider a multicoupling that agrees with π\pi^{*} on tuples (xi,,xi,xj)(x_{i},\cdots,x_{i},x_{j}), (i,j)I(i,j)\in I, and is zero otherwise (so the set of such tuples has a full mutlicoupling measure) - this is the candidate for the primal optimal solution to MOTMOT. Further, the above equality implies, for (i,j)I(i,j)\in I,

k1k2u1i+0++0+k1k2ukj=k1k2xixj2=Observation 2.3ciij\frac{k-1}{k^{2}}u_{1}^{*i}+0+\cdots+0+\frac{k-1}{k^{2}}u_{k}^{*j}=\frac{k-1}{k^{2}}\|x_{i}-x_{j}\|^{2}\underset{\text{Observation }\ref{obs:cost_dist}}{=}c_{i\cdots ij}

Moreover, ciijc_{i\cdots ij} is no larger than the value of cc if some indices are not repeated, making the candidate (k1k2u1,0,,0,k1k2uk)(\frac{k-1}{k^{2}}u_{1}^{*},0,\cdots,0,\frac{k-1}{k^{2}}u_{k}^{*}) dual feasible for MOTMOT. By complementary slackness (e. g., Lemma 1.1 of [35] which specifically addresses the multimarginal problem), our dual candidate is optimal.

A.7 Proof of Proposition 3.1

  • (a)

    By Theorem 3.1 of [44], the numerical directional derivative estimator f(μ^n+εnG^)f(μ^n)εn\frac{f(\hat{\mu}_{n}+\varepsilon_{n}\hat{G}^{*})-f(\hat{\mu}_{n})}{\varepsilon_{n}} is consistent for the directional derivative fμ(G)f_{\mu}^{\prime}(G) under mild measurability conditions on G^\hat{G}^{*}. The choice εn=1rm\varepsilon_{n}=\frac{1}{r_{m}} with m=nm=\sqrt{n} (or, more generally, m=npm=n^{p}, with p(0,1)p\in(0,1)) ensures that assumptions of the theorem are satisfied, i.e. that εn0\varepsilon_{n}\to 0 and εnrn=rnrm=nm\varepsilon_{n}r_{n}=\frac{r_{n}}{r_{m}}=\sqrt{\frac{n}{m}}\to\infty and allows to conclude consistency of this estimator for fμ(G)f^{\prime}_{\mu}(G). Note that consistency does not depend on the form of fμ(G)f_{\mu}^{\prime}(G), and hence holds under both H0H_{0} and HaH_{a}.

  • (b)

    We will check that the estimator fnf^{\prime}_{n} of the directional derivative map fμf^{\prime}_{\mu} given by fn=fμ^nf^{\prime}_{n}=f^{\prime}_{\hat{\mu}_{n}} is uniformly consistent in the sense of Assumption 4 of [29]. Note that under H0H_{0}, the estimator is given by

    fμ^n:h\displaystyle f_{\hat{\mu}_{n}}^{\prime}:h\to maxui=1kaiui,hi\displaystyle\max_{u}\sum_{i=1}^{k}\sqrt{a_{i}}\langle u_{i},h_{i}\rangle
    i=1kui=0,Auc\displaystyle\sum_{i=1}^{k}u_{i}=0,{}A^{\prime}u\leq c

    The expression is independent of μ^n\widehat{\mu}_{n}, and hence the assumption is trivially satisfied. Thus, the proposed bootstrap is consistent by Theorem 3.2 of [29].

A.8 Details on the proof of Lemma 3.2

Consider the linear program

maxugu\max_{u}g^{\prime}u
Bu=0Bu=0
AucA^{\prime}u\leq c

where gkNg\in\mathbb{R}^{kN} is a vector of objective coefficients, BN×kN\underset{N\times kN}{B} and ANk×kN\underset{N^{k}\times kN}{A^{\prime}} matrices with entries in {0,1}\{0,1\}, and cc a cost vector from primal MOT problem where the measures μ\mu and support points in 𝒳\mathcal{X} are represented with logU\log U bits of precision.

Recall that a linear program over a polytope with exponentially many constraints can be solved in polynomial time by ellipsoid method if there exists a polynomial time separation oracle for the polytope (Theorem 8.5 of [5]). Here, we construct a separation oracle as follows. Given any ukNu\in\mathbb{R}^{kN}, check if NN constraints Bu=0Bu=0 are satisfied; if not, output a violated constraint (this is done with poly(N)\text{poly}(N) complexity). If Bu=0Bu=0, check (and output a violated constraint if needed) in AucA^{\prime}u\leq c by employing a polynomial time oracle in Definition 10 of [2], which is done with poly(N,k,logU)\text{poly}(N,k,\log U) complexity (Proposition 12 of [2]).

{acks}

[Acknowledgments] The author would like to thank the anonymous referees, an Associate Editor and the Editor for their constructive comments that greatly improved the quality of this paper. The author is indebted to Ilmun Kim for sharing the codes for the tests used for comparisons in Figure 5. The author sincerely thanks Adriana Dawes for the advice and support and Florian Gunsilius for the comments that greatly improved the paper. Helpful discussions with Helen Chamberlin, Jun Kitagawa, Facundo Mémoli, and Dustin Mixon are gratefully acknowledged. The efforts of SEER Program in creation and maintenance of the SEER database are gratefully acknowledged. The author is solely responsible for all the mistakes.

{funding}

The author was supported by the National Institute of General Medical Science of the National Institutes of Health under award number R01GM132651 to Adriana Dawes.

References

  • [1] {barticle}[author] \bauthor\bsnmAgueh, \bfnmMartial\binitsM. and \bauthor\bsnmCarlier, \bfnmGuillaume\binitsG. (\byear2011). \btitleBarycenters in the Wasserstein space. \bjournalSIAM Journal on Mathematical Analysis \bvolume43 \bpages904–924. \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAltschuler, \bfnmJason M\binitsJ. M. and \bauthor\bsnmBoix-Adsera, \bfnmEnric\binitsE. (\byear2021). \btitleWasserstein barycenters can be computed in polynomial time in fixed dimension. \bjournalJournal of Machine Learning Research \bvolume22 \bpages1–19. \endbibitem
  • [3] {barticle}[author] \bauthor\bsnmAnderes, \bfnmEthan\binitsE., \bauthor\bsnmBorgwardt, \bfnmSteffen\binitsS. and \bauthor\bsnmMiller, \bfnmJacob\binitsJ. (\byear2016). \btitleDiscrete Wasserstein barycenters: Optimal transport for discrete data. \bjournalMathematical Methods of Operations Research \bvolume84 \bpages389–409. \endbibitem
  • [4] {bbook}[author] \bauthor\bsnmBalinski, \bfnmMichel L\binitsM. L. and \bauthor\bsnmRussakoff, \bfnmAndrew\binitsA. (\byear1984). \btitleFaces of dual transportation polyhedra. \bpublisherSpringer. \endbibitem
  • [5] {bbook}[author] \bauthor\bsnmBertsimas, \bfnmDimitris\binitsD. and \bauthor\bsnmTsitsiklis, \bfnmJohn N\binitsJ. N. (\byear1997). \btitleIntroduction to linear optimization \bvolume6. \bpublisherAthena scientific Belmont, MA. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmBigot, \bfnmJérémie\binitsJ., \bauthor\bsnmCazelles, \bfnmElsa\binitsE. and \bauthor\bsnmPapadakis, \bfnmNicolas\binitsN. (\byear2019). \btitleCentral limit theorems for entropy-regularized optimal transport on finite spaces and statistical applications. \bjournalElectronic Journal of Statistics \bvolume13 \bpages5120 – 5150. \endbibitem
  • [7] {bbook}[author] \bauthor\bsnmBishop, \bfnmYvonne M\binitsY. M., \bauthor\bsnmFienberg, \bfnmStephen E\binitsS. E. and \bauthor\bsnmHolland, \bfnmPaul W\binitsP. W. (\byear2007). \btitleDiscrete multivariate analysis: Theory and practice. \bpublisherSpringer Science & Business Media. \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmBénasséni, \bfnmJacques\binitsJ. (\byear2012). \btitleA new derivation of eigenvalue inequalities for the multinomial distribution. \bjournalJournal of Mathematical Analysis and Applications \bvolume393 \bpages697-698. \bdoihttps://doi.org/10.1016/j.jmaa.2012.03.029 \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmCárcamo, \bfnmJavier\binitsJ., \bauthor\bsnmCuevas, \bfnmAntonio\binitsA. and \bauthor\bsnmRodríguez, \bfnmLuis-Alberto\binitsL.-A. (\byear2020). \btitleDirectional differentiability for supremum-type functionals: Statistical applications. \bjournalBernoulli \bvolume26 \bpages2143 – 2175. \bdoi10.3150/19-BEJ1188 \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmCarlier, \bfnmGuillaume\binitsG., \bauthor\bsnmDelalande, \bfnmAlex\binitsA. and \bauthor\bsnmMerigot, \bfnmQuentin\binitsQ. (\byear2024). \btitleQuantitative stability of barycenters in the Wasserstein space. \bjournalProbability Theory and Related Fields \bvolume188 \bpages1257–1286. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmChen, \bfnmSu\binitsS. (\byear2020). \btitleA new distribution-free k-sample test: Analysis of kernel density functionals. \bjournalCanadian Journal of Statistics \bvolume48 \bpages167–186. \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmChen, \bfnmSu\binitsS. and \bauthor\bsnmPokojovy, \bfnmMichael\binitsM. (\byear2018). \btitleModern and classical k-sample omnibus tests. \bjournalWiley Interdisciplinary Reviews: Computational Statistics \bvolume10 \bpagese1418. \endbibitem
  • [13] {barticle}[author] \bauthor\bsnmChernozhukov, \bfnmVictor\binitsV., \bauthor\bsnmGalichon, \bfnmAlfred\binitsA., \bauthor\bsnmHallin, \bfnmMarc\binitsM. and \bauthor\bsnmHenry, \bfnmMarc\binitsM. (\byear2017). \btitleMonge–Kantorovich depth, quantiles, ranks and signs. \bjournalThe Annals of Statistics \bvolume45 \bpages223 – 256. \bdoi10.1214/16-AOS1450 \endbibitem
  • [14] {bbook}[author] \bauthor\bsnmCleophas, \bfnmTon J\binitsT. J., \bauthor\bsnmZwinderman, \bfnmAeilko H\binitsA. H., \bauthor\bsnmCleophas, \bfnmToine F\binitsT. F. and \bauthor\bsnmCleophas, \bfnmEugene P\binitsE. P. (\byear2009). \btitleStatistics applied to clinical trials. \bpublisherSpringer. \endbibitem
  • [15] {barticle}[author] \bauthor\bsnmConover, \bfnmWJ175230\binitsW. (\byear1965). \btitleSeveral k-sample Kolmogorov-Smirnov tests. \bjournalThe Annals of Mathematical Statistics \bvolume36 \bpages1019–1026. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmCorchado-Sonera, \bfnmMarcos\binitsM., \bauthor\bsnmRambani, \bfnmKomal\binitsK., \bauthor\bsnmNavarro, \bfnmKristen\binitsK., \bauthor\bsnmKladney, \bfnmRaleigh\binitsR., \bauthor\bsnmDowdle, \bfnmJames\binitsJ., \bauthor\bsnmLeone, \bfnmGustavo\binitsG. and \bauthor\bsnmChamberlin, \bfnmHelen M\binitsH. M. (\byear2022). \btitleDiscovery of nonautonomous modulators of activated Ras. \bjournalG3 Genes—Genomes—Genetics \bvolume12 \bpagesjkac200. \endbibitem
  • [17] {bbook}[author] \bauthor\bsnmDavison, \bfnmAnthony Christopher\binitsA. C. and \bauthor\bsnmHinkley, \bfnmDavid Victor\binitsD. V. (\byear1997). \btitleBootstrap methods and their application \bvolume1. \bpublisherCambridge university press. \endbibitem
  • [18] {barticle}[author] \bauthor\bsnmDawes, \bfnmAdriana T\binitsA. T., \bauthor\bsnmWu, \bfnmDavid\binitsD., \bauthor\bsnmMahalak, \bfnmKarley K\binitsK. K., \bauthor\bsnmZitnik, \bfnmEdward M\binitsE. M., \bauthor\bsnmKravtsova, \bfnmNatalia\binitsN., \bauthor\bsnmSu, \bfnmHaiwei\binitsH. and \bauthor\bsnmChamberlin, \bfnmHelen M\binitsH. M. (\byear2017). \btitleA computational model predicts genetic nodes that allow switching between species-specific responses in a conserved signaling network. \bjournalIntegrative Biology \bvolume9 \bpages156–166. \endbibitem
  • [19] {barticle}[author] \bauthor\bsnmDe Farias, \bfnmDaniela Pucci\binitsD. P. and \bauthor\bsnmVan Roy, \bfnmBenjamin\binitsB. (\byear2004). \btitleOn constraint sampling in the linear programming approach to approximate dynamic programming. \bjournalMathematics of operations research \bvolume29 \bpages462–478. \endbibitem
  • [20] {barticle}[author] \bauthor\bsnmDeb, \bfnmNabarun\binitsN. and \bauthor\bsnmSen, \bfnmBodhisattva\binitsB. (\byear2023). \btitleMultivariate rank-based distribution-free nonparametric testing using measure transportation. \bjournalJournal of the American Statistical Association \bvolume118 \bpages192–207. \endbibitem
  • [21] {barticle}[author] \bauthor\bsnmDel Barrio, \bfnmEustasio\binitsE., \bauthor\bsnmGiné, \bfnmEvarist\binitsE. and \bauthor\bsnmUtzet, \bfnmFrederic\binitsF. (\byear2005). \btitleAsymptotics for L2 functionals of the empirical quantile process, with applications to tests of fit based on weighted Wasserstein distances. \bjournalBernoulli \bvolume11 \bpages131–189. \endbibitem
  • [22] {barticle}[author] \bauthor\bsnmDel Barrio, \bfnmEustasio\binitsE., \bauthor\bsnmGordaliza, \bfnmPaula\binitsP. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2019). \btitleA central limit theorem for Lp transportation cost on the real line with application to fairness assessment in machine learning. \bjournalInformation and Inference: A Journal of the IMA \bvolume8 \bpages817–849. \endbibitem
  • [23] {barticle}[author] \bauthor\bparticledel \bsnmBarrio, \bfnmEustasio\binitsE. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2019). \btitleCentral limit theorems for empirical transportation cost in general dimension. \bjournalThe Annals of Probability \bvolume47 \bpages926 – 951. \bdoi10.1214/18-AOP1275 \endbibitem
  • [24] {barticle}[author] \bauthor\bsnmDeng, \bfnmLi\binitsL. (\byear2012). \btitleThe mnist database of handwritten digit images for machine learning research. \bjournalIEEE Signal Processing Magazine \bvolume29 \bpages141–142. \endbibitem
  • [25] {barticle}[author] \bauthor\bsnmDesai, \bfnmShreya\binitsS. and \bauthor\bsnmGuddati, \bfnmAchuta K\binitsA. K. (\byear2022). \btitleBimodal Age Distribution in Cancer Incidence. \bjournalWorld Journal of Oncology \bvolume13 \bpages329. \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmDudley, \bfnmRichard Mansfield\binitsR. M. (\byear1969). \btitleThe speed of mean Glivenko-Cantelli convergence. \bjournalThe Annals of Mathematical Statistics \bvolume40 \bpages40–50. \endbibitem
  • [27] {barticle}[author] \bauthor\bsnmEfron, \bfnmB.\binitsB. (\byear1979). \btitleBootstrap Methods: Another Look at the Jackknife. \bjournalThe Annals of Statistics \bvolume7 \bpages1–26. \endbibitem
  • [28] {bbook}[author] \bauthor\bsnmEmelichev, \bfnmVladimir Alekseevich\binitsV. A., \bauthor\bsnmKovalev, \bfnmMikhail Mikhailovich\binitsM. M. and \bauthor\bsnmKravtsov, \bfnmMikhail Konstantinovich\binitsM. K. (\byear1984). \btitlePolytopes, graphs and optimisation. \bpublisherCambridge University Press. \endbibitem
  • [29] {barticle}[author] \bauthor\bsnmFang, \bfnmZheng\binitsZ. and \bauthor\bsnmSantos, \bfnmAndres\binitsA. (\byear2019). \btitleInference on directionally differentiable functions. \bjournalThe Review of Economic Studies \bvolume86 \bpages377–412. \endbibitem
  • [30] {barticle}[author] \bauthor\bsnmFournier, \bfnmNicolas\binitsN. and \bauthor\bsnmGuillin, \bfnmArnaud\binitsA. (\byear2015). \btitleOn the rate of convergence in Wasserstein distance of the empirical measure. \bjournalProbability theory and related fields \bvolume162 \bpages707–738. \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmFukui, \bfnmTakayuki\binitsT., \bauthor\bsnmMori, \bfnmShoichi\binitsS., \bauthor\bsnmYokoi, \bfnmKohei\binitsK. and \bauthor\bsnmMitsudomi, \bfnmTetsuya\binitsT. (\byear2006). \btitleSignificance of the number of positive lymph nodes in resected non-small cell lung cancer. \bjournalJournal of Thoracic Oncology \bvolume1 \bpages120–125. \endbibitem
  • [32] {bincollection}[author] \bauthor\bsnmGal, \bfnmTomas\binitsT. (\byear1997). \btitleA historical sketch on sensitivity analysis and parametric programming. In \bbooktitleAdvances in Sensitivity Analysis and Parametic Programming \bpages1–10. \bpublisherSpringer. \endbibitem
  • [33] {barticle}[author] \bauthor\bsnmGhodrati, \bfnmLaya\binitsL. and \bauthor\bsnmPanaretos, \bfnmVictor M\binitsV. M. (\byear2022). \btitleDistribution-on-distribution regression via optimal transport maps. \bjournalBiometrika \bvolume109 \bpages957–974. \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmGiordano, \bfnmSharon H\binitsS. H., \bauthor\bsnmCohen, \bfnmDeborah S\binitsD. S., \bauthor\bsnmBuzdar, \bfnmAman U\binitsA. U., \bauthor\bsnmPerkins, \bfnmGeorge\binitsG. and \bauthor\bsnmHortobagyi, \bfnmGabriel N\binitsG. N. (\byear2004). \btitleBreast carcinoma in men: a population-based study. \bjournalCancer: Interdisciplinary International Journal of the American Cancer Society \bvolume101 \bpages51–57. \endbibitem
  • [35] {barticle}[author] \bauthor\bsnmGladkov, \bfnmNikita A.\binitsN. A. and \bauthor\bsnmZimin, \bfnmAlexander P.\binitsA. P. (\byear2020). \btitleAn Explicit Solution for a Multimarginal Mass Transportation Problem. \bjournalSIAM Journal on Mathematical Analysis \bvolume52 \bpages3666-3696. \bdoi10.1137/18M122707X \endbibitem
  • [36] {barticle}[author] \bauthor\bsnmGunsilius, \bfnmFlorian F\binitsF. F. (\byear2023). \btitleDistributional synthetic controls. \bjournalEconometrica \bvolume91 \bpages1105–1117. \endbibitem
  • [37] {bmisc}[author] \bauthor\bsnmGurobi Optimization, LLC (\byear2023). \btitleGurobi Optimizer Reference Manual. \endbibitem
  • [38] {barticle}[author] \bauthor\bsnmHallin, \bfnmMarc\binitsM., \bauthor\bsnmDel Barrio, \bfnmEustasio\binitsE., \bauthor\bsnmCuesta-Albertos, \bfnmJuan\binitsJ. and \bauthor\bsnmMatran, \bfnmCarlos\binitsC. (\byear2021). \btitleDistribution and quantile functions, ranks and signs in dimension d: A measure transportation approach. \bjournalAnnals of Statistics \bvolume49 \bpages1139 - 1165. \endbibitem
  • [39] {barticle}[author] \bauthor\bsnmHallin, \bfnmM.\binitsM., \bauthor\bsnmMordant, \bfnmG.\binitsG. and \bauthor\bsnmSegers, \bfnmJ.\binitsJ. (\byear2021). \btitleMultivariate goodness-of-fit tests based on wasserstein distance. \bjournalElectronic Journal of Statistics \bvolume15 \bpages1328-1371 - 1371. \endbibitem
  • [40] {barticle}[author] \bauthor\bsnmHart, \bfnmJeffrey D.\binitsJ. D. and \bauthor\bsnmCañette, \bfnmIsabel\binitsI. (\byear2011). \btitleNonparametric Estimation of Distributions in Random Effects Models. \bjournalJournal of Computational and Graphical Statistics \bvolume20 \bpages461–478. \endbibitem
  • [41] {bmanual}[author] \bauthor\bsnmHarter, \bfnmReinhard\binitsR., \bauthor\bsnmHornik, \bfnmKurt\binitsK. and \bauthor\bsnmTheussl, \bfnmStefan\binitsS. (\byear2021). \btitleRsymphony: SYMPHONY in R \bnoteR package version 0.1-33. \endbibitem
  • [42] {barticle}[author] \bauthor\bsnmHeinemann, \bfnmFlorian\binitsF., \bauthor\bsnmMunk, \bfnmAxel\binitsA. and \bauthor\bsnmZemel, \bfnmYoav\binitsY. (\byear2022). \btitleRandomized Wasserstein Barycenter Computation: Resampling with Statistical Guarantees. \bjournalSIAM Journal on Mathematics of Data Science \bvolume4 \bpages229-259. \endbibitem
  • [43] {barticle}[author] \bauthor\bsnmHeitjan, \bfnmDaniel F\binitsD. F., \bauthor\bsnmManni, \bfnmAndrea\binitsA. and \bauthor\bsnmSanten, \bfnmRichard J\binitsR. J. (\byear1993). \btitleStatistical analysis of in vivo tumor growth experiments. \bjournalCancer research \bvolume53 \bpages6042–6050. \endbibitem
  • [44] {barticle}[author] \bauthor\bsnmHong, \bfnmHan\binitsH. and \bauthor\bsnmLi, \bfnmJessie\binitsJ. (\byear2018). \btitleThe numerical delta method. \bjournalJournal of Econometrics \bvolume206 \bpages379–394. \endbibitem
  • [45] {bbook}[author] \bauthor\bsnmHsu, \bfnmJason\binitsJ. (\byear1996). \btitleMultiple comparisons: theory and methods. \bpublisherCRC Press. \endbibitem
  • [46] {barticle}[author] \bauthor\bsnmHundrieser, \bfnmShayan\binitsS., \bauthor\bsnmKlatt, \bfnmMarcel\binitsM., \bauthor\bsnmMunk, \bfnmAxel\binitsA. and \bauthor\bsnmStaudt, \bfnmThomas\binitsT. (\byear2024). \btitleA unifying approach to distributional limits for empirical optimal transport. \bjournalBernoulli \bvolume30 \bpages2846–2877. \endbibitem
  • [47] {barticle}[author] \bauthor\bsnmHušková, \bfnmMarie\binitsM. and \bauthor\bsnmMeintanis, \bfnmSimos G.\binitsS. G. (\byear2008). \btitleTests for the multivariate k-sample problem based on the empirical characteristic function. \bjournalJournal of Nonparametric Statistics \bvolume20 \bpages263–277. \bdoi10.1080/10485250801948294 \endbibitem
  • [48] {barticle}[author] \bauthor\bsnmKhan, \bfnmMd Marufuzzaman\binitsM. M., \bauthor\bsnmOdoi, \bfnmAgricola\binitsA. and \bauthor\bsnmOdoi, \bfnmEvah W\binitsE. W. (\byear2023). \btitleGeographic disparities in COVID-19 testing and outcomes in Florida. \bjournalBMC Public Health \bvolume23 \bpages79. \endbibitem
  • [49] {barticle}[author] \bauthor\bsnmKiefer, \bfnmJ\binitsJ. (\byear1959). \btitleK-sample analogues of the Kolmogorov-Smirnov and Cramér-V. Mises tests. \bjournalThe Annals of Mathematical Statistics \bpages420–447. \endbibitem
  • [50] {barticle}[author] \bauthor\bsnmKim, \bfnmIlmun\binitsI. (\byear2021). \btitleComparing a large number of multivariate distributions. \bjournalBernoulli \bvolume27 \bpages419 – 441. \bdoi10.3150/20-BEJ1244 \endbibitem
  • [51] {barticle}[author] \bauthor\bsnmKlatt, \bfnmMarcel\binitsM., \bauthor\bsnmTameling, \bfnmCarla\binitsC. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2020). \btitleEmpirical regularized optimal transport: Statistical theory and applications. \bjournalSIAM Journal on Mathematics of Data Science \bvolume2 \bpages419–443. \endbibitem
  • [52] {barticle}[author] \bauthor\bsnmLe Gouic, \bfnmThibaut\binitsT. and \bauthor\bsnmLoubes, \bfnmJean-Michel\binitsJ.-M. (\byear2017). \btitleExistence and consistency of Wasserstein barycenters. \bjournalProbability Theory and Related Fields \bvolume168 \bpages901–917. \endbibitem
  • [53] {bbook}[author] \bauthor\bsnmLedoux, \bfnmMichel\binitsM. and \bauthor\bsnmTalagrand, \bfnmMichel\binitsM. (\byear2013). \btitleProbability in Banach Spaces: isoperimetry and processes. \bpublisherSpringer Science & Business Media. \endbibitem
  • [54] {barticle}[author] \bauthor\bsnmLin, \bfnmTianyi\binitsT., \bauthor\bsnmHo, \bfnmNhat\binitsN., \bauthor\bsnmCuturi, \bfnmMarco\binitsM. and \bauthor\bsnmJordan, \bfnmMichael I\binitsM. I. (\byear2022). \btitleOn the complexity of approximating multimarginal optimal transport. \bjournalThe Journal of Machine Learning Research \bvolume23 \bpages2835–2877. \endbibitem
  • [55] {barticle}[author] \bauthor\bsnmMahalak, \bfnmKarley K\binitsK. K., \bauthor\bsnmJama, \bfnmAbdulrahman M\binitsA. M., \bauthor\bsnmBillups, \bfnmSteven J\binitsS. J., \bauthor\bsnmDawes, \bfnmAdriana T\binitsA. T. and \bauthor\bsnmChamberlin, \bfnmHelen M\binitsH. M. (\byear2017). \btitleDiffering roles for sur-2/MED23 in C. elegans and C. briggsae vulval development. \bjournalDevelopment Genes and Evolution \bvolume227 \bpages213–218. \endbibitem
  • [56] {barticle}[author] \bauthor\bsnmMukhopadhyay, \bfnmSubhadeep\binitsS. and \bauthor\bsnmWang, \bfnmKaijun\binitsK. (\byear2020). \btitleA nonparametric approach to high-dimensional k-sample comparison problems. \bjournalBiometrika \bvolume107 \bpages555–572. \endbibitem
  • [57] {barticle}[author] \bauthor\bsnmMunk, \bfnmAxel\binitsA. and \bauthor\bsnmCzado, \bfnmClaudia\binitsC. (\byear1998). \btitleNonparametric validation of similar distributions and assessment of goodness of fit. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume60 \bpages223–241. \endbibitem
  • [58] {barticle}[author] \bauthor\bsnmNations, \bfnmJoel A\binitsJ. A., \bauthor\bsnmBrown, \bfnmDerek W\binitsD. W., \bauthor\bsnmShao, \bfnmStephanie\binitsS., \bauthor\bsnmShriver, \bfnmCraig D\binitsC. D. and \bauthor\bsnmZhu, \bfnmKangmin\binitsK. (\byear2020). \btitleComparative trends in the distribution of lung cancer stage at diagnosis in the Department of Defense Cancer Registry and the Surveillance, Epidemiology, and End Results data, 1989–2012. \bjournalMilitary medicine \bvolume185 \bpagese2044–e2048. \endbibitem
  • [59] {barticle}[author] \bauthor\bsnmPanaretos, \bfnmVictor M\binitsV. M. and \bauthor\bsnmZemel, \bfnmYoav\binitsY. (\byear2019). \btitleStatistical aspects of Wasserstein distances. \bjournalAnnual review of statistics and its application \bvolume6 \bpages405–431. \endbibitem
  • [60] {bbook}[author] \bauthor\bsnmPanaretos, \bfnmVictor M\binitsV. M. and \bauthor\bsnmZemel, \bfnmYoav\binitsY. (\byear2020). \btitleAn invitation to statistics in Wasserstein space. \bpublisherSpringer Nature. \endbibitem
  • [61] {binproceedings}[author] \bauthor\bsnmPark, \bfnmJoon Sung\binitsJ. S., \bauthor\bsnmO’Brien, \bfnmJoseph\binitsJ., \bauthor\bsnmCai, \bfnmCarrie Jun\binitsC. J., \bauthor\bsnmMorris, \bfnmMeredith Ringel\binitsM. R., \bauthor\bsnmLiang, \bfnmPercy\binitsP. and \bauthor\bsnmBernstein, \bfnmMichael S\binitsM. S. (\byear2023). \btitleGenerative agents: Interactive simulacra of human behavior. In \bbooktitleProceedings of the 36th annual acm symposium on user interface software and technology \bpages1–22. \endbibitem
  • [62] {barticle}[author] \bauthor\bsnmPass, \bfnmBrendan\binitsB. (\byear2015). \btitleMulti-marginal optimal transport: theory and applications. \bjournalESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique \bvolume49 \bpages1771–1790. \endbibitem
  • [63] {barticle}[author] \bauthor\bsnmRamdas, \bfnmAaditya\binitsA., \bauthor\bsnmGarcía Trillos, \bfnmNicolás\binitsN. and \bauthor\bsnmCuturi, \bfnmMarco\binitsM. (\byear2017). \btitleOn wasserstein two-sample testing and related families of nonparametric tests. \bjournalEntropy \bvolume19 \bpages47. \endbibitem
  • [64] {barticle}[author] \bauthor\bsnmRizzo, \bfnmMaria L.\binitsM. L. and \bauthor\bsnmSzékely, \bfnmGábor J.\binitsG. J. (\byear2010). \btitleDISCO analysis: A nonparametric extension of analysis of variance. \bjournalThe Annals of Applied Statistics \bvolume4 \bpages1034 – 1055. \endbibitem
  • [65] {binproceedings}[author] \bauthor\bsnmRömisch, \bfnmWerner\binitsW. (\byear2004). \btitleDelta method, infinite dimensional. In \bbooktitleEncyclopedia of Statistical Sciences. \bpublisherNew York: Wiley. \endbibitem
  • [66] {barticle}[author] \bauthor\bsnmSantambrogio, \bfnmFilippo\binitsF. (\byear2015). \btitleOptimal transport for applied mathematicians. \bjournalBirkäuser, NY \bvolume55 \bpages94. \endbibitem
  • [67] {barticle}[author] \bauthor\bsnmScholz, \bfnmFritz W\binitsF. W. and \bauthor\bsnmStephens, \bfnmMichael A\binitsM. A. (\byear1987). \btitleK-sample Anderson–Darling tests. \bjournalJournal of the American Statistical Association \bvolume82 \bpages918–924. \endbibitem
  • [68] {barticle}[author] \bauthor\bsnmSejdinovic, \bfnmDino\binitsD., \bauthor\bsnmSriperumbudur, \bfnmBharath\binitsB., \bauthor\bsnmGretton, \bfnmArthur\binitsA. and \bauthor\bsnmFukumizu, \bfnmKenji\binitsK. (\byear2013). \btitleEquivalence of distance-based and RKHS-based statistics in hypothesis testing. \bjournalThe annals of statistics \bpages2263–2291. \endbibitem
  • [69] {barticle}[author] \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear1990). \btitleOn concepts of directional differentiability. \bjournalJournal of optimization theory and applications \bvolume66 \bpages477–487. \endbibitem
  • [70] {barticle}[author] \bauthor\bsnmShapiro, \bfnmAlexander\binitsA. (\byear1991). \btitleAsymptotic analysis of stochastic programs. \bjournalAnnals of Operations Research \bvolume30 \bpages169–186. \endbibitem
  • [71] {bbook}[author] \bauthor\bsnmSierksma, \bfnmGerard\binitsG. (\byear2001). \btitleLinear and integer programming: theory and practice. \bpublisherCRC Press. \endbibitem
  • [72] {barticle}[author] \bauthor\bsnmSommerfeld, \bfnmMax\binitsM. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2018). \btitleInference for empirical Wasserstein distances on finite spaces. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume80 \bpages219–238. \endbibitem
  • [73] {barticle}[author] \bauthor\bsnmSopik, \bfnmVictoria\binitsV. and \bauthor\bsnmNarod, \bfnmSteven A\binitsS. A. (\byear2018). \btitleThe relationship between tumour size, nodal status and distant metastases: on the origins of breast cancer. \bjournalBreast cancer research and treatment \bvolume170 \bpages647–656. \endbibitem
  • [74] {barticle}[author] \bauthor\bsnmStaudt, \bfnmThomas\binitsT., \bauthor\bsnmHundrieser, \bfnmShayan\binitsS. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2022). \btitleOn the uniqueness of Kantorovich potentials. \bjournalarXiv preprint arXiv:2201.08316. \endbibitem
  • [75] {barticle}[author] \bauthor\bsnmTameling, \bfnmCarla\binitsC., \bauthor\bsnmSommerfeld, \bfnmMax\binitsM. and \bauthor\bsnmMunk, \bfnmAxel\binitsA. (\byear2019). \btitleEmpirical optimal transport on countable metric spaces: Distributional limits and statistical applications. \bjournalThe Annals of Applied Probability \bvolume29 \bpages2744–2781. \endbibitem
  • [76] {barticle}[author] \bauthor\bsnmTrick, \bfnmWilliam\binitsW. (\byear2011). \btitleComputer Assisted Quality of Life and Symptom Assessment of Complex Patients from April 2011-August 2012: Chicago, Illinois. \bjournalComputer \bvolume2012. \endbibitem
  • [77] {barticle}[author] \bauthor\bsnmTrillos, \bfnmNicolás García\binitsN. G., \bauthor\bsnmJacobs, \bfnmMatt\binitsM. and \bauthor\bsnmKim, \bfnmJakwang\binitsJ. (\byear2023). \btitleThe multimarginal optimal transport formulation of adversarial multiclass classification. \bjournalJournal of Machine Learning Research \bvolume24 \bpages1–56. \endbibitem
  • [78] {barticle}[author] \bauthor\bsnmUchida, \bfnmSeiichi\binitsS. (\byear2013). \btitleImage processing and recognition for biological images. \bjournalDevelopment, growth & differentiation \bvolume55 \bpages523–549. \endbibitem
  • [79] {bbook}[author] \bauthor\bsnmVan Der Vaart, \bfnmAad W\binitsA. W., \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A., \bauthor\bparticlevan der \bsnmVaart, \bfnmAad W\binitsA. W. and \bauthor\bsnmWellner, \bfnmJon A\binitsJ. A. (\byear1996). \btitleWeak convergence. \bpublisherSpringer. \endbibitem
  • [80] {bbook}[author] \bauthor\bsnmVillani, \bfnmCédric\binitsC. (\byear2021). \btitleTopics in optimal transportation \bvolume58. \bpublisherAmerican Mathematical Soc. \endbibitem
  • [81] {bbook}[author] \bauthor\bsnmVillani, \bfnmCédric\binitsC. \betalet al. (\byear2009). \btitleOptimal transport: old and new \bvolume338. \bpublisherSpringer. \endbibitem
  • [82] {bbook}[author] \bauthor\bsnmWickham, \bfnmHadley\binitsH. (\byear2016). \btitleggplot2: Elegant Graphics for Data Analysis. \bpublisherSpringer-Verlag New York. \endbibitem
  • [83] {barticle}[author] \bauthor\bsnmZand, \bfnmTanya P\binitsT. P., \bauthor\bsnmReiner, \bfnmDavid J\binitsD. J. and \bauthor\bsnmDer, \bfnmChanning J\binitsC. J. (\byear2011). \btitleRas effector switching promotes divergent cell fates in C. elegans vulval patterning. \bjournalDevelopmental cell \bvolume20 \bpages84–96. \endbibitem
  • [84] {barticle}[author] \bauthor\bsnmZhan, \bfnmD\binitsD. and \bauthor\bsnmHart, \bfnmJD\binitsJ. (\byear2014). \btitleTesting equality of a large number of densities. \bjournalBiometrika \bvolume101 \bpages449–464. \endbibitem
  • [85] {barticle}[author] \bauthor\bsnmZhang, \bfnmChao\binitsC., \bauthor\bsnmKokoszka, \bfnmPiotr\binitsP. and \bauthor\bsnmPetersen, \bfnmAlexander\binitsA. (\byear2022). \btitleWasserstein autoregressive models for density time series. \bjournalJournal of Time Series Analysis \bvolume43 \bpages30–52. \endbibitem
  • [86] {barticle}[author] \bauthor\bsnmZhang, \bfnmHai-Liang\binitsH.-L., \bauthor\bsnmYang, \bfnmLi-Feng\binitsL.-F., \bauthor\bsnmZhu, \bfnmYao\binitsY., \bauthor\bsnmYao, \bfnmXu-Dong\binitsX.-D., \bauthor\bsnmZhang, \bfnmShi-Lin\binitsS.-L., \bauthor\bsnmDai, \bfnmBo\binitsB., \bauthor\bsnmZhu, \bfnmYi-Ping\binitsY.-P., \bauthor\bsnmShen, \bfnmYi-Jun\binitsY.-J., \bauthor\bsnmShi, \bfnmGuo-Hai\binitsG.-H. and \bauthor\bsnmYe, \bfnmDing-Wei\binitsD.-W. (\byear2011). \btitleSerum miRNA-21: Elevated levels in patients with metastatic hormone-refractory prostate cancer and potential predictive factor for the efficacy of docetaxel-based chemotherapy. \bjournalThe Prostate \bvolume71 \bpages326–331. \endbibitem
  • [87] {barticle}[author] \bauthor\bsnmZhang, \bfnmJin\binitsJ. and \bauthor\bsnmWu, \bfnmYuehua\binitsY. (\byear2007). \btitlek-Sample tests based on the likelihood ratio. \bjournalComputational Statistics & Data Analysis \bvolume51 \bpages4682–4691. \endbibitem
  • [88] {barticle}[author] \bauthor\bsnmZhang, \bfnmRuiyi\binitsR., \bauthor\bsnmOgden, \bfnmR Todd\binitsR. T., \bauthor\bsnmPicard, \bfnmMartin\binitsM. and \bauthor\bsnmSrivastava, \bfnmAnuj\binitsA. (\byear2022). \btitleNonparametric k-sample test on shape spaces with applications to mitochondrial shape analysis. \bjournalJournal of the Royal Statistical Society Series C: Applied Statistics \bvolume71 \bpages51–69. \endbibitem
  • [89] {barticle}[author] \bauthor\bsnmZhu, \bfnmChangbo\binitsC. and \bauthor\bsnmMüller, \bfnmHans-Georg\binitsH.-G. (\byear2023). \btitleAutoregressive optimal transport models. \bjournalJournal of the Royal Statistical Society Series B: Statistical Methodology \bvolume85 \bpages1012–1033. \endbibitem