This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exact Statistical Inference for the Wasserstein Distance by
Selective Inference

Vo Nguyen Le Duy
Nagoya Institute of Technology and RIKEN
duy.mllab.nit@gmail.com
   Ichiro Takeuchi
Nagoya Institute of Technology and RIKEN
takeuchi.ichiro@nitech.ac.jp
Abstract

In this paper, we study statistical inference for the Wasserstein distance, which has attracted much attention and has been applied to various machine learning tasks. Several studies have been proposed in the literature, but almost all of them are based on asymptotic approximation and do not have finite-sample validity. In this study, we propose an exact (non-asymptotic) inference method for the Wasserstein distance inspired by the concept of conditional Selective Inference (SI). To our knowledge, this is the first method that can provide a valid confidence interval (CI) for the Wasserstein distance with finite-sample coverage guarantee, which can be applied not only to one-dimensional problems but also to multi-dimensional problems. We evaluate the performance of the proposed method on both synthetic and real-world datasets.

1 Introduction

The Wasserstein distance, which is a metric used to compare the probability distributions, has been attracted significant attention and being used more and more in Statistics and Machine Learning (ML) (Kolouri et al., 2017). This distance measures the cost to couple one distribution with another, which arises from the notion of optimal transport (Villani, 2009). The utilization of the Wasserstein distance benefits to several applications such as supervised learning (Frogner et al., 2015), generative modelling (Arjovsky et al., 2017), biology (Evans and Matsen, 2012), and computer vision (Ni et al., 2009).

When the Wasserstein distance calculated from noisy data is used for various decision-making problems, it is necessary to quantify its statistical reliability, e.g., in the form of confidence intervals (CIs). However, there is no satisfactory statistical inference method for the Wasserstein distance. The main reason is that the Wasserstein distance is defined based on the optimal solution of a linear program (LP), and it is difficult to analyze how the uncertainty in the data is transmitted to the uncertainty in the Wasserstein distance. Several studies have been proposed in literature (Bernton et al., 2017; Del Barrio et al., 1999, 2019; Ramdas et al., 2017; Imaizumi et al., 2019). However, they all rely on asymptotic approximation to mitigate the difficulty stemming from the fact that the Wasserstein distance depends on the optimization problem, and thus most of them can only be applied to one-dimensional problems (the details will be discussed in the related work section).

When an optimization problem such as LP is applied to random data, it is intrinsically difficult to derive the exact (non-asymptotic) sampling distribution of the optimal solution. In this paper, to resolve the difficulty, we introduce the idea of conditional Selective Inference (SI). It is well-known that the optimal solution of an LP depends only on a subset of variables, called basic variables. Therefore, the LP algorithm can be interpreted as first identifying the basic variables, and then solving the linear system of equations about the basic variables.

Our basic idea is based on the fact that, in the LP problem for the Wasserstein distance, the identification of the basic variables corresponds to the process of determining the coupling between the source and destination of the probability mass. Since the optimal coupling is determined (selected) based on the data, the selection bias must be properly considered. Therefore, to address the selection bias, we propose an exact statistical inference method for the Wasserstein distance by conditioning on the basic variables of the LP, i.e., optimal coupling.

The main advantage of the proposed method is that it can provide exact (non-asymptotic) CI for the Wasserstein distance, unlike the asymptotic approximations in the existing studies. Moreover, while the existing methods are restricted to one-dimensional problems, the proposed method can be directly extended to multi-dimensional problems because the proposed CI is derived based on the fact that the Wasserstein distance depends on the optimal solution of the LP.

1.1 Contributions

Regarding the high-level conceptual contribution, we introduce a novel approach to explicitly characterize the selection bias of the data-driven coupling problem inspired by the concept of conditional Selective Inference (SI). In regard to the technical contribution, we provide an exact (non-asymptotic) inference method for the Wasserstein distance. To our knowledge, this is the first method that can provide valid CI, called selective CI, for the Wasserstein distance that guarantees the coverage property in finite sample size. Another practically important advantage of this study is that the proposed method is valid when the Wasserstein distance is computed in multi-dimensional problem, which is impossible for almost all the existing asymptotic methods since the limit distribution of the Wasserstein distance is only applicable for univariate data. We conduct experiments on both synthetic and real-world datasets to evaluate the performance of the proposed method.

1.2 Related works

In traditional statistics, reliability evaluation with Wasserstein distance has been based on asymptotic theory, i.e., sample size \rightarrow\infty. In the univariate case, instead of solving the optimization problem, the Wasserstein can be described by using an inverse of the distribution function. For example, let F1F^{-1} be the quantile function of the data and Fn1F_{n}^{-1} be the empirical quantile function of the generated data, the Wasserstein distance with 2\ell_{2} distance is computed by 01(F1(t)Fn1(t))2𝑑t\int_{0}^{1}(F^{-1}(t)-F_{n}^{-1}(t))^{2}\,dt. Based on the quantile function, several studies (Del Barrio et al., 1999, 2019; Ramdas et al., 2017) derived the asymptotic distribution of the Wasserstein distance. Obviously, these methods can not guarantee the validity in finite sample size. Moreover, since the quantile function is only available in univariate case, these methods can not be extended to multivariate cases which are practically important.

Recently, Imaizumi et al. (2019) has proposed an approach on multidimensional problems. However, it is important to clarify that this study does not provide statistical inference for the “original” Wasserstein distance. Instead, the authors consider an approximation of the Wasserstein distance, which does not require solving a LP. Besides, this method also relies on asymptotic distribution of the test statistic which is approximated by the Gaussian multiplier bootstrap. Therefore, to our knowledge, statistical inference method for the Wasserstein distance in multi-dimensional problems is still a challenging open problem.

Conditional SI is a statistical inference framework for correcting selection bias. Traditionally, there are mainly two types of approaches for selection bias correction. The first approach is family-wise error rate (FWER) control, which includes standard multiple comparison methods such as traditional Bonferroni correction. However, the FWER control is too conservative and it is difficult to utilize it for selection bias correction in complex adaptive data analysis. Another approach is false discovery rate (FDR) control, in which the target is to control the expected proportion of discoveries that are false at a given significance level. The FDR control is less conservative than FWER control, and it is used in many high-dimensional statistical inference problems.

Conditional SI is the third approach for selection bias correction. The basic idea of conditional SI was known before, but it becomes popular by the recent seminal work proposed by Lee et al. (2016). In that study, exact statistical inference on the selected features by Lasso was considered. Their basic idea is to employ the sampling distributions of the selected parameter coefficients conditional on the selection event that a subset of features is selected by the Lasso. By using the conditional distribution in statistical inference, the selection bias of the Lasso (i.e., the fact that the Lasso selected the features based on the data) can be corrected. Their contribution was to show that, even in complex data analysis methods such as Lasso, the exact sampling distribution can be characterized if appropriate selection events are considered.

After the seminal work (Lee et al., 2016), many conditional SI approaches for various feature selection methods were proposed in the literature (Loftus and Taylor, 2015; Yang et al., 2016; Tibshirani et al., 2016; Suzumura et al., 2017). Furthermore, theoretical analyses and new computational methods for conditional SI are still being actively studied (Fithian et al., 2014; Le Duy and Takeuchi, 2021; Duy and Takeuchi, 2021; Sugiyama et al., 2021a). However, most of conditional SI studies are focused on feature selection problems. Although there have been applications to several problems such as change point detection (Hyun et al., 2018; Duy et al., 2020b; Sugiyama et al., 2021b), outlier detection (Chen and Bien, 2019; Tsukurimichi et al., 2021), and image segmentation (Tanizaki et al., 2020; Duy et al., 2020a), these problems can also be interpreted as feature selection in a broad sense. Our novelty in this study is to first introduce conditional SI framework for statistical inference on the Wasserstein distance, which is a data-dependent adaptive distance measure. Our basic idea is based on the facts that the Wasserstein distance is formulated as the solution of a linear program (LP), and the optimal solution of an LP is characterized by the selected basic variables. In this study, we consider the sampling distribution of the Wasserstein distance conditional on the selected basic variables, which can be interpreted as considering the selection event on the optimal coupling between the two distributions.

2 Problem Statement

To formulate the problem, we consider two vectors corrupted with Gaussian noise as

𝑿\displaystyle\bm{X} =(x1,,xn)=𝝁𝑿+𝜺𝑿,𝜺𝑿(𝟎,Σ𝑿),\displaystyle=(x_{1},...,x_{n})^{\top}=\bm{\mu}_{\bm{X}}+\bm{\varepsilon}_{\bm{X}},\quad\bm{\varepsilon}_{\bm{X}}\sim\mathbb{N}(\bm{0},\Sigma_{\bm{X}}), (1)
𝒀\displaystyle\bm{Y} =(y1,,ym)=𝝁𝒀+𝜺𝒀,𝜺𝒀(𝟎,Σ𝒀),\displaystyle=(y_{1},...,y_{m})^{\top}=\bm{\mu}_{\bm{Y}}+\bm{\varepsilon}_{\bm{Y}},\quad\bm{\varepsilon}_{\bm{Y}}\sim\mathbb{N}(\bm{0},\Sigma_{\bm{Y}}), (2)

where nn and mm are the number of instances in each vector, 𝝁𝑿\bm{\mu}_{\bm{X}} and 𝝁𝒀\bm{\mu}_{\bm{Y}} are unknown mean vectors, 𝜺𝑿\bm{\varepsilon}_{\bm{X}} and 𝜺𝒀\bm{\varepsilon}_{\bm{Y}} are Gaussian noise vectors with covariances matrices Σ𝑿\Sigma_{\bm{X}} and Σ𝒀\Sigma_{\bm{Y}} assumed to be known or estimable from independent data. We denote by PnP_{n} and QmQ_{m} the corresponding empirical measures on 𝑿\bm{X} and 𝒀\bm{Y}.

2.1 Cost matrix

We define the cost matrix C(𝑿,𝒀)C(\bm{X},\bm{Y}) of pairwise distances (1\ell_{1} distance) between elements of 𝑿\bm{X} and 𝒀\bm{Y} as

C(𝑿,𝒀)\displaystyle C(\bm{X},\bm{Y}) =[|xiyj|]ijn×m.\displaystyle=\big{[}|x_{i}-y_{j}|\big{]}_{ij}\in\mathbb{R}^{n\times m}. (3)

We can vectorize C(𝑿,𝒀)C(\bm{X},\bm{Y}) in the form of

𝒄(𝑿,𝒀)=vec(C(𝑿,𝒀))nm=Θ(𝑿𝒀),\displaystyle\begin{aligned} \bm{c}(\bm{X},\bm{Y})&={\rm{vec}}(C(\bm{X},\bm{Y}))\in\mathbb{R}^{nm}\\ &=\Theta(\bm{X}~{}\bm{Y})^{\top},\end{aligned} (4)

where vec(){\rm{vec}}(\cdot) is an operator that transforms a matrix into a vector with concatenated rows. The matrix Θ\Theta is defined as

Θ\displaystyle\Theta =𝒮(𝑿,𝒀)Ωnm×(n+m),\displaystyle={\mathcal{S}}(\bm{X},\bm{Y})\circ\Omega~{}\in\mathbb{R}^{nm\times(n+m)}, (5)
𝒮(𝑿,𝒀)\displaystyle{\mathcal{S}}(\bm{X},\bm{Y}) =sign(Ω(𝑿𝒀))nm,\displaystyle={\rm{sign}}\left(\Omega\left(\bm{X}~{}\bm{Y}\right)^{\top}\right)\in\mathbb{R}^{nm},
Ω\displaystyle\Omega =(𝟏m𝟎m𝟎mIm𝟎m𝟏m𝟎mIm𝟎m𝟎m𝟏mIm)nm×(n+m),\displaystyle=\begin{pmatrix}\bm{1}_{m}&\bm{0}_{m}&\cdots&\bm{0}_{m}&-I_{m}\\ \bm{0}_{m}&\bm{1}_{m}&\cdots&\bm{0}_{m}&-I_{m}\\ \vdots&\vdots&\ddots&\vdots&\vdots\\ \bm{0}_{m}&\bm{0}_{m}&\cdots&\bm{1}_{m}&-I_{m}\end{pmatrix}\in\mathbb{R}^{nm\times(n+m)},

where the operator \circ is element-wise product, sign(){\rm{sign}}(\cdot) is the operator that returns an element-wise indication of the sign of a number, 𝟏mm\bm{1}_{m}\in\mathbb{R}^{m} is the vector of ones, 𝟎mm\bm{0}_{m}\in\mathbb{R}^{m} is the vector of zeros, and Imm×mI_{m}\in\mathbb{R}^{m\times m} is the identity matrix.

2.2 The Wasserstein distance

To compare two empirical measures PnP_{n} and QmQ_{m} with uniform weight vectors 𝟏n/n\bm{1}_{n}/n and 𝟏m/m\bm{1}_{m}/m, we consider the following Wasserstein distance, which is defined as the solution of a linear program (LP),

W(Pn,Qm)=minTn×m\displaystyle W(P_{n},Q_{m})=\min\limits_{T\in\mathbb{R}^{n\times m}} T,C(𝑿,𝒀)\displaystyle~{}\langle T,C(\bm{X},\bm{Y})\rangle (6)
s.t. T𝟏m=𝟏n/n,\displaystyle~{}T\bm{1}_{m}=\bm{1}_{n}/n,
T𝟏n=𝟏m/m,\displaystyle~{}T^{\top}\bm{1}_{n}=\bm{1}_{m}/m,
T0.\displaystyle~{}T\geq 0.

Given 𝑿obs\bm{X}^{\rm{obs}} and 𝒀obs\bm{Y}^{\rm{obs}} respectively sampled from models (1) and (2) 111To make a distinction between random variables and observed variables, we use superscript obs, e.g., 𝑿\bm{X} is a random vector and 𝑿obs\bm{X}^{\rm{obs}} is the observed data vector., the Wasserstein distance in (6) on the observed data can be re-written as

W(Pn,Qm)=min𝒕nm\displaystyle W(P_{n},Q_{m})=\min\limits_{\bm{t}\in\mathbb{R}^{nm}}~{}~{} 𝒕𝒄(𝑿obs,𝒀obs)\displaystyle\bm{t}^{\top}\bm{c}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}) (7)
s.t. S𝒕=𝒉,𝒕𝟎,\displaystyle S\bm{t}=\bm{h},~{}\bm{t}\geq\bm{0},

where 𝒕=vec(T)nm\bm{t}={\rm{vec}}(T)\in\mathbb{R}^{nm}, 𝒄(𝑿obs,𝒀obs)nm\bm{c}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\in\mathbb{R}^{nm} is defined in (4), S=(MrMc)(n+m)×nmS=\left(M_{r}~{}M_{c}\right)^{\top}\in\mathbb{R}^{(n+m)\times nm} in which

Mr=[110000001100000011]n×nm\displaystyle M_{r}=\begin{bmatrix}1~{}\ldots~{}1&0~{}\ldots~{}0&\ldots&0~{}\ldots~{}0\\ 0~{}\ldots~{}0&1~{}\ldots~{}1&\ldots&0~{}\ldots~{}0\\ ~{}\ldots~{}&~{}\ldots~{}&\ldots&~{}\ldots~{}\\ 0~{}\ldots~{}0&0~{}\ldots~{}0&\ldots&1~{}\ldots~{}1\\ \end{bmatrix}\in\mathbb{R}^{n\times nm}

that performs the sum over the rows of TT and

Mc=[ImImIm]m×nm\displaystyle M_{c}=\begin{bmatrix}I_{m}&I_{m}&\ldots&I_{m}\end{bmatrix}\in\mathbb{R}^{m\times nm}

that performs the sum over the columns of TT, and 𝒉=(𝟏n/n𝟏m/m)n+m\bm{h}=\left(\bm{1}_{n}/n~{}\bm{1}_{m}/m\right)^{\top}\in\mathbb{R}^{n+m} 222 We note that there always exists exactly one redundant equality constraint in linear equality constraint system in (7). This is due to the fact that sum of all the masses on 𝑿obs\bm{X}^{\rm{obs}} is always equal to sum of all the masses on 𝒀obs\bm{Y}^{\rm{obs}} (i.e., they are all equal to 1). Therefore, any equality constraint can be expressed as a linear combination of the others, and hence any one constraint can be dropped. In this paper, we always drop the last equality constraint (i.e., the last row of matrix SS and the last element of vector 𝒉\bm{h}) before solving (7). .

2.3 Optimal solution and closed-form expression of the distance

Let us denote the set of basis variables (the definition of basis variable can be found in the literature of LP, e.g., Murty (1983)) obtained when applying the LP in (7) on 𝑿obs\bm{X}^{\rm{obs}} and 𝒀obs\bm{Y}^{\rm{obs}} as

obs=(𝑿obs,𝒀obs).\displaystyle{\mathcal{M}}_{\rm{obs}}={\mathcal{M}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}). (8)

We would like to note that the identification of the basic variables can be interpreted as the process of determining the optimal coupling between the elements of 𝑿obs\bm{X}^{\rm obs} and 𝒀obs\bm{Y}^{\rm obs} in the optimal transport problem for calculating the Wasserstein distance. Therefore, obs{\mathcal{M}}_{\rm{obs}} in (8) can be interpreted as the observed optimal coupling obtained after solving LP in (7) on the observed data 333We suppose that the LP is non-degenerate. A careful discussion might be needed in the presence of degeneracy.. An illustration of this interpretation is shown in Figure 1. We also denote by obsc{\mathcal{M}}_{\rm{obs}}^{c} a set of non-basis variables. Then, the optimal solution of (7) can be written as

𝒕^nm,𝒕^obs=S:,obs1𝒉,𝒕^obsc=𝟎|obsc|,\displaystyle\hat{\bm{t}}\in\mathbb{R}^{nm},\quad\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}=S_{:,{\mathcal{M}}_{\rm{obs}}}^{-1}\bm{h},\quad\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}^{c}}=\bm{0}_{|{\mathcal{M}}_{\rm{obs}}^{c}|},

where S:,obsS_{:,{\mathcal{M}}_{\rm{obs}}} is a sub-matrix of SS made up of all rows and columns in the set obs{\mathcal{M}}_{\rm{obs}}.

Example 1.

Given a matrix

S=[110000111010]andobs={1,2,4},\displaystyle S=\begin{bmatrix}1&1&0&0\\ 0&0&1&1\\ 1&0&1&0\end{bmatrix}\quad\text{and}\quad{\mathcal{M}}_{\rm{obs}}=\{1,2,4\},

then S:,obsS_{:,{\mathcal{M}}_{\rm{obs}}} is constructed by extracting the first, second and fourth columns of the matrix SS, i.e,

S:,obs=[110001100].\displaystyle S_{:,{\mathcal{M}}_{\rm{obs}}}=\begin{bmatrix}1&1&0\\ 0&0&1\\ 1&0&0\end{bmatrix}.

In the literature of LP, the matrix S:,obsS_{:,{\mathcal{M}}_{\rm{obs}}} is also referred to as a basis. After obtaining 𝒕^\hat{\bm{t}}, the Wasserstein distance can be re-written as

W(Pn,Qm)=𝒕^𝒄(𝑿obs,𝒀obs)=𝒕^obs𝒄obs(𝑿obs,𝒀obs)=𝒕^obsΘobs,:(𝑿obs𝒀obs)𝒄obs(𝑿obs,𝒀obs),\displaystyle\begin{aligned} W(P_{n},Q_{m})&=\hat{\bm{t}}^{\top}\bm{c}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\\ &=\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\bm{c}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\\ &=\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\underbrace{\Theta_{{\mathcal{M}}_{\rm{obs}},:}(\bm{X}^{\rm{obs}}~{}\bm{Y}^{\rm{obs}})^{\top}}_{\bm{c}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})},\end{aligned} (9)

where Θ\Theta is defined in (5), and Θobs,:\Theta_{{\mathcal{M}}_{\rm{obs}},:} is a sub-matrix of Θ\Theta made up of rows in the set obs{\mathcal{M}}_{\rm{obs}} and all columns.

Refer to caption
Figure 1: Illustration of the correspondence between basic variables in the LP and the optimal coupling. After inputting the data to the LP, we obtain the transportation matrix. The elements t1,t2,t3,t8,t9,t10t_{1},t_{2},t_{3},t_{8},t_{9},t_{10} (whose values are non-zero) in the matrix are called basic variables in the LP, and the identification of the basic variables corresponds to the process of determining the coupling between the source and target instances in the optimal transport problem for the Wasserstein distance.

2.4 Statistical inference (confidence interval)

The goal is to provide a confidence interval (CI) for the Wasserstein distance. The W(Pn,Qm)W(P_{n},Q_{m}) in (9) can be written as

W(Pn,Qm)=𝜼(𝑿obs𝒀obs),\displaystyle W(P_{n},Q_{m})=\bm{\eta}^{\top}\left(\bm{X}^{\rm{obs}}~{}\bm{Y}^{\rm{obs}}\right)^{\top}, (10)

where 𝜼=Θobs,:𝒕^obs\bm{\eta}=\Theta_{{\mathcal{M}}_{\rm{obs}},:}^{\top}\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}} is the test-statistic direction. It is important to note that 𝜼\bm{\eta} is a random vector because obs{\mathcal{M}}_{\rm{obs}} is selected based on the data. For the purpose of explanation, let us suppose, for now, that the test-statistic direction 𝜼\bm{\eta} in (10) is fixed before observing the data; that is, non-random. Let us define

Σ~=(Σ𝑿00Σ𝒀).\displaystyle\tilde{\Sigma}=\begin{pmatrix}\Sigma_{\bm{X}}&0\\ 0&\Sigma_{\bm{Y}}\end{pmatrix}. (11)

Thus, we have 𝜼(𝑿𝒀)(𝜼(𝝁𝑿𝝁𝒀),𝜼Σ~𝜼)\bm{\eta}^{\top}(\bm{X}~{}\bm{Y})^{\top}\sim\mathbb{N}\left(\bm{\eta}^{\top}(\bm{\mu}_{\bm{X}}~{}\bm{\mu}_{\bm{Y}})^{\top},\bm{\eta}^{\top}\tilde{\Sigma}\bm{\eta}\right). Given a significance level α[0,1]\alpha\in[0,1] (e.g., 0.05) for the inference, the naive (classical) CI for

W=W(Pm,Qm)=𝜼(𝝁𝑿𝝁𝒀)\displaystyle W^{\ast}=W^{\ast}(P_{m},Q_{m})=\bm{\eta}^{\top}\left(\bm{\mu}_{\bm{X}}~{}\bm{\mu}_{\bm{Y}}\right)^{\top}

with (1α)(1-\alpha)-coverage is obtained by

Cnaive={w:α2Fw,σ2(𝜼(𝑿obs𝒀obs))1α2},\displaystyle C_{\rm{naive}}=\left\{w\in\mathbb{R}:\frac{\alpha}{2}\leq F_{w,\sigma^{2}}\left(\bm{\eta}^{\top}{\bm{X}^{\rm{obs}}\choose\bm{Y}^{\rm{obs}}}\right)\leq 1-\frac{\alpha}{2}\right\}, (12)

where σ2=𝜼Σ~𝜼\sigma^{2}=\bm{\eta}^{\top}\tilde{\Sigma}\bm{\eta} and Fw,σ2()F_{w,\sigma^{2}}(\cdot) is the c.d.f of the normal distribution with a mean ww and variance σ2\sigma^{2}. With the assumption that 𝜼\bm{\eta} in (10) is fixed in advance, the naive CI is valid in the sense that

(WCnaive)=1α.\displaystyle\mathbb{P}\left(W^{\ast}\in C_{\rm{naive}}\right)=1-\alpha. (13)

However, in reality, because the test-statistic direction 𝜼\bm{\eta} is actually not fixed in advance, the naive CI in (12) is unreliable. In other words, the naive CI is invalid in the sense that the (1α)(1-\alpha)-coverage guarantee in (13) is no longer satisfied. This is because the construction of 𝜼\bm{\eta} depends on the data and selection bias exists.

In the next section, we introduce an approach to correct the selection bias which is inspired by the conditional SI framework, and propose a valid CI called selective CI (CselC_{\rm{sel}}) for the Wasserstein distance which guarantees the (1α)(1-\alpha)-coverage property in finite sample (i.e., non-asymptotic).

3 Proposed Method

We present the technical details of the proposed method in this section. We first introduce the selective CI for the Wasserstein distance in §3.1. To compute the proposed selective CI, we need to consider the sampling distribution of the Wasserstein distance conditional on the selection event. Thereafter, the selection event is explicitly characterized in a conditional data space whose identification is presented in §3.2. Finally, we end the section with the detailed algorithm.

3.1 Selective Confidence Interval for Wasserstein Distance

In this section, we propose an exact (non-asymptotic) selective CI for the Wasserstein distance by conditional SI. We interpret the computation of the Wasserstein distance in (7) as a two-step procedure:

  • Step 1: Compute the cost matrix in (3) with the 1\ell_{1} distance and obtain the vectorized form of the cost matrix in (4).

  • Step 2: Solving the LP in (7) to obtain obs{\mathcal{M}}_{\rm{obs}} which is subsequently used to construct the test-statistic direction 𝜼\bm{\eta} in (10) and calculate the distance.

Since the distance is computed in data-driven fashion, for constructing a valid CI, it is necessary to remove the information that has been used for the initial calculation process, i.e., steps 1 and 2 in the above procedure, to correct the selection bias. This can be achieved by considering the sampling distribution of the test-statistic 𝜼(𝑿𝒀)\bm{\eta}^{\top}(\bm{X}~{}\bm{Y})^{\top} conditional on the selection event; that is,

𝜼(𝑿𝒀)|{𝒮(𝑿,𝒀)=𝒮(𝑿obs,𝒀obs),(𝑿,𝒀)=obs},\displaystyle\bm{\eta}^{\top}{\bm{X}\choose\bm{Y}}\Big{|}\left\{{\mathcal{S}}(\bm{X},\bm{Y})={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}),{\mathcal{M}}(\bm{X},\bm{Y})={\mathcal{M}}_{\rm{obs}}\right\}, (14)

where 𝒮(𝑿,𝒀){\mathcal{S}}(\bm{X},\bm{Y}) is the sign vector explained in the construction of Θ\Theta in (5).

Interpretation of the selection event in (14). The first and second conditions in (14) respectively represent the selection event for steps 1 and 2 in the procedure described in the beginning of this section. The first condition 𝒮(𝑿,𝒀)=𝒮(𝑿obs,𝒀obs){\mathcal{S}}(\bm{X},\bm{Y})={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}) indicates that sign(xiyj)=sign(xiobsyjobs){\rm{sign}}(x_{i}-y_{j})={\rm{sign}}(x^{\rm{obs}}_{i}-y_{j}^{\rm{obs}}) for i[n],j[m]i\in[n],j\in[m]. In other words, for i[n],j[m]i\in[n],j\in[m], we condition on the event whether xiobsx^{\rm{obs}}_{i} is greater than yjobsy^{\rm{obs}}_{j} or not. The second condition (𝑿,𝒀)=obs{\mathcal{M}}(\bm{X},\bm{Y})={\mathcal{M}}_{\rm{obs}} indicates that the set of selected basic variables for random vectors 𝑿\bm{X} and 𝒀\bm{Y} is the same as that for 𝑿obs\bm{X}^{\rm{obs}} and 𝒀obs\bm{Y}^{\rm{obs}}. This condition can be interpreted as conditioning on the observed optimal coupling obs{\mathcal{M}}_{\rm{obs}} between the elements of 𝑿obs\bm{X}^{\rm{obs}} and 𝒀obs\bm{Y}^{\rm{obs}}, which is obtained after solving the LP in (7) on the observed data (see Figure 1).

Selective CI. To derive exact statistical inference for the Wasserstein distance, we introduce so-called selective CI for W=W(Pm,Qm)=𝜼(𝝁𝑿𝝁𝒀)W^{\ast}=W^{\ast}(P_{m},Q_{m})=\bm{\eta}^{\top}\left(\bm{\mu}_{\bm{X}}~{}\bm{\mu}_{\bm{Y}}\right)^{\top} that satisfies the following (1α)(1-\alpha)-coverage property conditional on the selection event:

(WCsel|𝒮(𝑿,𝒀)=𝒮(𝑿obs,𝒀obs),(𝑿,𝒀)=obs)=1α,\displaystyle\mathbb{P}\left(W^{\ast}\in C_{\rm{sel}}~{}\Big{|}~{}{\mathcal{S}}(\bm{X},\bm{Y})={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}),{\mathcal{M}}(\bm{X},\bm{Y})={\mathcal{M}}_{\rm{obs}}\right)=1-\alpha, (15)

for any α[0,1]\alpha\in[0,1]. The selective CI is defined as

Csel={w:α2Fw,σ2𝒵(𝜼(𝑿obs𝒀obs))1α2}.\displaystyle C_{\rm{sel}}=\left\{w\in\mathbb{R}:\frac{\alpha}{2}\leq F_{w,\sigma^{2}}^{{\mathcal{Z}}}\left(\bm{\eta}^{\top}{\bm{X}^{\rm{obs}}\choose\bm{Y}^{\rm{obs}}}\right)\leq 1-\frac{\alpha}{2}\right\}. (16)

where the pivotal quantity

Fw,σ2𝒵(𝜼(𝑿𝒀))|{𝒮(𝑿,𝒀)=𝒮(𝑿obs,𝒀obs),(𝑿,𝒀)=obs,𝒒(𝑿,𝒀)=𝒒(𝑿obs,𝒀obs)}\displaystyle F_{w,\sigma^{2}}^{{\mathcal{Z}}}\left(\bm{\eta}^{\top}{\bm{X}\choose\bm{Y}}\right)\bigg{|}\left\{{\mathcal{S}}(\bm{X},\bm{Y})={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}),{\mathcal{M}}(\bm{X},\bm{Y})={\mathcal{M}}_{\rm{obs}},\bm{q}(\bm{X},\bm{Y})=\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\right\} (17)

is the c.d.f of the truncated normal distribution with a mean ww\in\mathbb{R}, variance σ2=𝜼Σ~𝜼\sigma^{2}=\bm{\eta}^{\top}\tilde{\Sigma}\bm{\eta}, and truncation region 𝒵{\mathcal{Z}} (the detailed construction of 𝒵{\mathcal{Z}} will be discussed later in §3.2) which is calculated based on the selection event in (17). The 𝒒(𝑿,𝒀)\bm{q}(\bm{X},\bm{Y}) in the additional third condition is the sufficient statistic of nuisance parameter defined as

𝒒(𝑿,𝒀)=(In+m𝒄𝜼)(𝑿𝒀)\displaystyle\bm{q}(\bm{X},\bm{Y})=\Big{(}I_{n+m}-\bm{c}\bm{\eta}^{\top}\Big{)}\Big{(}\bm{X}~{}\bm{Y}\Big{)}^{\top}

in which 𝒄=Σ~𝜼(𝜼Σ~𝜼)1\bm{c}=\tilde{\Sigma}\bm{\eta}(\bm{\eta}^{\top}\tilde{\Sigma}\bm{\eta})^{-1} with Σ~\tilde{\Sigma} is defined in (11). Here, we note that the selective CI depends on 𝒒(𝑿,𝒀)\bm{q}(\bm{X},\bm{Y}) because the pivotal quantity in (17) depends on this component, but the sampling property in (15) is kept satisfied without this additional condition because we can marginalize over all values of 𝒒(𝑿,𝒀)\bm{q}(\bm{X},\bm{Y}). The 𝒒(𝑿,𝒀)\bm{q}(\bm{X},\bm{Y}) corresponds to the component 𝒛\bm{z} in the seminal paper of Lee et al. (2016) (see Section 5, Eq. 5.2 and Theorem 5.2). We note that additionally conditioning on 𝒒(𝑿,𝒀)\bm{q}(\bm{X},\bm{Y}) is a standard approach in the SI literature and it is used in almost all the SI-related works that we cited in this paper.

To obtain the selective CI in (16), we need to compute the quantity in (17) which depends on the truncation region 𝒵{\mathcal{Z}}. Therefore, the remaining task is to identify 𝒵{\mathcal{Z}} whose characterization will be introduced in the next section.

3.2 Conditional Data Space Characterization

We define the set of (𝑿𝒀)n+m(\bm{X}~{}\bm{Y})^{\top}\in\mathbb{R}^{n+m} that satisfies the conditions in Equation (17) as

𝒟={(𝑿𝒀)n+m|𝒮(𝑿,𝒀)=𝒮(𝑿obs,𝒀obs),(𝑿,𝒀)=obs,𝒒(𝑿,𝒀)=𝒒(𝑿obs,𝒀obs)}.\displaystyle{\mathcal{D}}=\left\{{\bm{X}\choose\bm{Y}}\in\mathbb{R}^{n+m}~{}\bigg{|}~{}{\mathcal{S}}(\bm{X},\bm{Y})={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}),{\mathcal{M}}(\bm{X},\bm{Y})={\mathcal{M}}_{\rm{obs}},\bm{q}(\bm{X},\bm{Y})=\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\right\}. (18)

According to the third condition 𝒒(𝑿,𝒀)=𝒒(𝑿obs,𝒀obs)\bm{q}(\bm{X},\bm{Y})=\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}), the data in 𝒟{\mathcal{D}} is restricted to a line in n+m\mathbb{R}^{n+m} as stated in the following Lemma.

Lemma 1.

Let us define

𝒂=𝒒(𝑿obs,𝒀obs)and𝒃=Σ~𝜼(𝜼Σ~𝜼)1,\displaystyle\bm{a}=\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\quad\text{and}\quad\bm{b}=\tilde{\Sigma}\bm{\eta}(\bm{\eta}^{\top}\tilde{\Sigma}\bm{\eta})^{-1}, (19)

where Σ~\tilde{\Sigma} is defined in (11). Then, the set 𝒟{\mathcal{D}} in (18) can be rewritten using the scalar parameter zz\in\mathbb{R} as follows:

𝒟={(𝑿𝒀)=𝒂+𝒃zz𝒵},\displaystyle{\mathcal{D}}=\Big{\{}(\bm{X}~{}\bm{Y})^{\top}=\bm{a}+\bm{b}z\mid z\in{\mathcal{Z}}\Big{\}}, (20)

where

𝒵={z|𝒮(𝒂+𝒃z)=𝒮(𝑿obs,𝒀obs),(𝒂+𝒃z)=obs}.\displaystyle{\mathcal{Z}}=\left\{z\in\mathbb{R}~{}\Big{|}\begin{array}[]{l}{\mathcal{S}}(\bm{a}+\bm{b}z)={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}),\\ {\mathcal{M}}(\bm{a}+\bm{b}z)={\mathcal{M}}_{\rm{obs}}\end{array}\right\}. (23)

Here, with a slight abuse of notation, 𝒮(𝐚+𝐛z)=𝒮((𝐗𝐘)){\mathcal{S}}(\bm{a}+\bm{b}z)={\mathcal{S}}\left((\bm{X}~{}\bm{Y})^{\top}\right) is equivalent to 𝒮(𝐗,𝐘){\mathcal{S}}(\bm{X},\bm{Y}). This similarly applies to (𝐚+𝐛z){\mathcal{M}}(\bm{a}+\bm{b}z).

Proof.

According to the third condition in (18), we have

𝒒(𝑿,𝒀)=\displaystyle\bm{q}(\bm{X},\bm{Y})= 𝒒(𝑿obs,𝒀obs)\displaystyle~{}\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})
(In+m𝒄𝜼)(𝑿𝒀)=\displaystyle\Leftrightarrow\Big{(}I_{n+m}-\bm{c}\bm{\eta}^{\top}\Big{)}{\bm{X}\choose\bm{Y}}= 𝒒(𝑿obs,𝒀obs)\displaystyle~{}\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})
(𝑿𝒀)=𝒒(𝑿obs,𝒀obs)\displaystyle\Leftrightarrow{\bm{X}\choose\bm{Y}}=\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}) +Σ~𝜼𝜼Σ~𝜼𝜼(𝑿𝒀).\displaystyle+\frac{\tilde{\Sigma}\bm{\eta}}{\bm{\eta}^{\top}\tilde{\Sigma}\bm{\eta}}\bm{\eta}^{\top}{\bm{X}\choose\bm{Y}}.

By defining 𝒂=𝒒(𝑿obs,𝒀obs)\bm{a}=\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}), 𝒃=Σ~𝜼(𝜼Σ~𝜼)1\bm{b}=\tilde{\Sigma}\bm{\eta}(\bm{\eta}^{\top}\tilde{\Sigma}\bm{\eta})^{-1}, z=𝜼(𝑿𝒀)z=\bm{\eta}^{\top}\Big{(}\bm{X}~{}\bm{Y}\Big{)}^{\top}, and incorporating the first and second conditions in (18), we obtain the results in Lemma 1. We note that the fact of restricting the data to the line has been already implicitly exploited in the seminal conditional SI work of Lee et al. (2016), but explicitly discussed for the first time in Section 6 of Liu et al. (2018). ∎

Lemma 1 indicates that we do not have to consider the (n+m)(n+m)-dimensional data space. Instead, we only to consider the one-dimensional projected data space 𝒵{\mathcal{Z}} in (23), which is the truncation region that is important for computing the pivotal quantity in (17) and constructing the selective CI CselC_{\rm{sel}} in (16).

Characterization of truncation region 𝒵{\mathcal{Z}}. We can decompose 𝒵{\mathcal{Z}} into two separate sets as 𝒵=𝒵1𝒵2{\mathcal{Z}}={\mathcal{Z}}_{1}\cap{\mathcal{Z}}_{2}, where

𝒵1\displaystyle{\mathcal{Z}}_{1} ={z𝒮(𝒂+𝒃z)=𝒮(𝑿obs,𝒀obs)}\displaystyle=\{z\in\mathbb{R}\mid{\mathcal{S}}(\bm{a}+\bm{b}z)={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\}
 and 𝒵2\displaystyle\quad\text{ and }\quad{\mathcal{Z}}_{2} ={z(𝒂+𝒃z)=obs}.\displaystyle=\{z\in\mathbb{R}\mid{\mathcal{M}}(\bm{a}+\bm{b}z)={\mathcal{M}}_{\rm{obs}}\}.

We first present the construction of 𝒵1{\mathcal{Z}}_{1} in the following Lemma.

Lemma 2.

For notational simplicity, we denote 𝐬obs=𝒮(𝐗obs,𝐘obs)\bm{s}_{\rm{obs}}={\mathcal{S}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}). Then, the 𝒵1{\mathcal{Z}}_{1} is an interval defined as:

𝒵1={zmaxj:νj(2)>0νj(1)νj(2)zminj:νj(2)<0νj(1)νj(2)},\displaystyle{\mathcal{Z}}_{1}=\left\{z\in\mathbb{R}\mid\max\limits_{j:\nu_{j}^{(2)}>0}\frac{-\nu_{j}^{(1)}}{\nu_{j}^{(2)}}\leq z\leq\min\limits_{j:\nu_{j}^{(2)}<0}\frac{-\nu_{j}^{(1)}}{\nu_{j}^{(2)}}\right\},

where 𝛎(1)=𝐬obsΩ𝐚\bm{\nu}^{(1)}=\bm{s}_{\rm{obs}}\circ\Omega\bm{a} and 𝛎(2)=𝐬obsΩ𝐛\bm{\nu}^{(2)}=\bm{s}_{\rm{obs}}\circ\Omega\bm{b}.

Proof.

Let us first remind that 𝒮(𝑿,𝒀)=sign(Ω(𝑿𝒀)){\mathcal{S}}(\bm{X},\bm{Y})={\rm{sign}}\left(\Omega\left(\bm{X}~{}\bm{Y}\right)^{\top}\right) where Ω\Omega is defined in (4). Then, the 𝒵1{\mathcal{Z}}_{1} can be re-written as follows:

𝒵1\displaystyle{\mathcal{Z}}_{1} ={z𝒮(𝒂+𝒃z)=𝒔obs}\displaystyle=\{z\in\mathbb{R}\mid{\mathcal{S}}(\bm{a}+\bm{b}z)=\bm{s}_{\rm{obs}}\}
={zsign(Ω(𝒂+𝒃z))=𝒔obs}\displaystyle=\left\{z\in\mathbb{R}\mid{\rm{sign}}\Big{(}\Omega(\bm{a}+\bm{b}z)\Big{)}=\bm{s}_{\rm{obs}}\right\}
={z𝒔obsΩ(𝒂+𝒃z)𝟎}.\displaystyle=\left\{z\in\mathbb{R}\mid\bm{s}_{\rm{obs}}\circ\Omega(\bm{a}+\bm{b}z)\geq\bm{0}\right\}.

By defining 𝝂(1)=𝒔obsΩ𝒂\bm{\nu}^{(1)}=\bm{s}_{\rm{obs}}\circ\Omega\bm{a} and 𝝂(2)=𝒔obsΩ𝒃\bm{\nu}^{(2)}=\bm{s}_{\rm{obs}}\circ\Omega\bm{b}, the result of Lemma 2 is straightforward by solving the above system of linear inequalities. ∎

Next, we present the construction of 𝒵2{\mathcal{Z}}_{2}. Here, 𝒵2{\mathcal{Z}}_{2} can be interpreted as the set of values of zz in which we obtain the same set of the selected basic variables obs{\mathcal{M}}_{\rm{obs}} when applying the LP in (7) on the prametrized data 𝒂+𝒃z\bm{a}+\bm{b}z.

Lemma 3.

The set 𝒵2{\mathcal{Z}}_{2} is an interval defined as:

𝒵2={zmaxjobsc:v~j>0u~jv~jzminjobsc:v~j<0u~jv~j},\displaystyle{\mathcal{Z}}_{2}=\left\{z\in\mathbb{R}\mid\max\limits_{j\in{\mathcal{M}}_{\rm{obs}}^{c}:\tilde{v}_{j}>0}\frac{-\tilde{u}_{j}}{\tilde{v}_{j}}\leq z\leq\min\limits_{j\in{\mathcal{M}}_{\rm{obs}}^{c}:\tilde{v}_{j}<0}\frac{-\tilde{u}_{j}}{\tilde{v}_{j}}\right\},

where

𝒖~\displaystyle\tilde{\bm{u}} =(𝒖c𝒖S:,1S:,c),\displaystyle=\left(\bm{u}_{{\mathcal{M}}^{c}}^{\top}-\bm{u}_{{\mathcal{M}}}^{\top}S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\right)^{\top},
𝒗~\displaystyle\tilde{\bm{v}} =(𝒗c𝒗S:,1S:,c),\displaystyle=\left(\bm{v}_{{\mathcal{M}}^{c}}^{\top}-\bm{v}_{{\mathcal{M}}}^{\top}S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\right)^{\top},

𝒖=Θ𝒂\bm{u}=\Theta\bm{a}, 𝐯=Θ𝐛\bm{v}=\Theta\bm{b}, Θ\Theta is defined as in (5) with observed data.

Proof.

We consider the LP in (7) with the parametrized data 𝒂+𝒃z\bm{a}+\bm{b}z as follows:

min𝒕nm𝒕Θ(𝒂+𝒃z)s.t.S𝒕=𝒉,𝒕𝟎.\displaystyle\min\limits_{\bm{t}\in\mathbb{R}^{nm}}~{}~{}\bm{t}^{\top}\Theta(\bm{a}+\bm{b}z)\quad\text{s.t.}\quad S\bm{t}=\bm{h},\bm{t}\geq\bm{0}. (24)

Here, we remind that Θ(𝒂+𝒃z)\Theta(\bm{a}+\bm{b}z) is the vectorized version of the cost matrix defined in (4). The optimization problem in (24) is well-known as the parametric cost problem in LP literature (e.g., see Section 8.2 in Murty (1983)). Let us denote 𝒖=Θ𝒂\bm{u}=\Theta\bm{a} and 𝒗=Θ𝒃\bm{v}=\Theta\bm{b}, the LP in (24) can be re-written as

min𝒕nm(𝒖+𝒗z)𝒕s.t.S𝒕=𝒉,𝒕𝟎.\displaystyle\min\limits_{\bm{t}\in\mathbb{R}^{nm}}~{}~{}(\bm{u}+\bm{v}z)^{\top}\bm{t}\quad\text{s.t.}\quad S\bm{t}=\bm{h},\bm{t}\geq\bm{0}. (25)

Given a fixed value z0z_{0}, let {\mathcal{M}} be an optimal basic index set of the LP in (25) at z=z0z=z_{0} and c{\mathcal{M}}^{c} be its complement. Then by partitioning SS, 𝒕\bm{t}, 𝒖\bm{u}, and 𝒗\bm{v} as

S\displaystyle S =[S:,,S:,c]\displaystyle=[S_{:,{\mathcal{M}}},S_{:,{\mathcal{M}}^{c}}]
𝒕=(𝒕,𝒕c),𝒖\displaystyle\bm{t}=(\bm{t}_{{\mathcal{M}}},\bm{t}_{{\mathcal{M}}^{c}}),\quad\bm{u} =(𝒖,𝒖c),𝒗=(𝒗,𝒗c),\displaystyle=(\bm{u}_{{\mathcal{M}}},\bm{u}_{{\mathcal{M}}^{c}}),\quad\bm{v}=(\bm{v}_{{\mathcal{M}}},\bm{v}_{{\mathcal{M}}^{c}}),

the LP in (25) becomes

min𝒕,𝒕c\displaystyle\min\limits_{\bm{t}_{{\mathcal{M}}},\bm{t}_{{\mathcal{M}}^{c}}}~{}~{} (𝒖+𝒗z)𝒕+(𝒖c+𝒗cz)𝒕c\displaystyle(\bm{u}_{{\mathcal{M}}}+\bm{v}_{{\mathcal{M}}}z)^{\top}\bm{t}_{{\mathcal{M}}}+(\bm{u}_{{\mathcal{M}}^{c}}+\bm{v}_{{\mathcal{M}}^{c}}z)^{\top}\bm{t}_{{\mathcal{M}}^{c}}
s.t. S:,𝒕+S:,c𝒕c=𝒉,\displaystyle S_{:,{\mathcal{M}}}\bm{t}_{{\mathcal{M}}}+S_{:,{\mathcal{M}}^{c}}\bm{t}_{{\mathcal{M}}^{c}}=\bm{h}, (26)
𝒕𝟎,𝒕c𝟎.\displaystyle\bm{t}_{{\mathcal{M}}}\geq\bm{0},~{}\bm{t}_{{\mathcal{M}}^{c}}\geq\bm{0}.

The value of 𝒕\bm{t}_{{\mathcal{M}}} can be computed as

𝒕=S:,1𝒉S:,1S:,c𝒕c,\displaystyle\bm{t}_{{\mathcal{M}}}=S_{:,{\mathcal{M}}}^{-1}\bm{h}-S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\bm{t}_{{\mathcal{M}}^{c}},

and this general expression when substituted in the objective (cost) function of (25) yields

f\displaystyle f =(𝒖+𝒗z)(S:,1𝒉S:,1S:,c𝒕c)+(𝒖c+𝒗cz)𝒕c\displaystyle=(\bm{u}_{{\mathcal{M}}}+\bm{v}_{{\mathcal{M}}}z)^{\top}(S_{:,{\mathcal{M}}}^{-1}\bm{h}-S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\bm{t}_{{\mathcal{M}}^{c}})+(\bm{u}_{{\mathcal{M}}^{c}}+\bm{v}_{{\mathcal{M}}^{c}}z)^{\top}\bm{t}_{{\mathcal{M}}^{c}}
=(𝒖+𝒗z)S:,1𝒉+[(𝒖c𝒖S:,1S:,c)+(𝒗c𝒗S:,1S:,c)×z]𝒕c,\displaystyle=(\bm{u}_{{\mathcal{M}}}+\bm{v}_{{\mathcal{M}}}z)^{\top}S_{:,{\mathcal{M}}}^{-1}\bm{h}+\Big{[}\left(\bm{u}_{{\mathcal{M}}^{c}}^{\top}-\bm{u}_{{\mathcal{M}}}^{\top}S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\right)+\left(\bm{v}_{{\mathcal{M}}^{c}}^{\top}-\bm{v}_{{\mathcal{M}}}^{\top}S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\right)\times z\Big{]}\bm{t}_{{\mathcal{M}}^{c}},

which expresses the cost of (3.2) in terms of 𝒕c\bm{t}_{{\mathcal{M}}^{c}}. Let us denote

𝒖~=(𝒖c𝒖S:,1S:,c) and 𝒗~=(𝒗c𝒗S:,1S:,c),\displaystyle\tilde{\bm{u}}=\left(\bm{u}_{{\mathcal{M}}^{c}}^{\top}-\bm{u}_{{\mathcal{M}}}^{\top}S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\right)^{\top}\text{ and }~{}~{}\tilde{\bm{v}}=\left(\bm{v}_{{\mathcal{M}}^{c}}^{\top}-\bm{v}_{{\mathcal{M}}}^{\top}S_{:,{\mathcal{M}}}^{-1}S_{:,{\mathcal{M}}^{c}}\right)^{\top},

we can write 𝒓c=𝒖~+𝒗~z\bm{r}_{{\mathcal{M}}^{c}}=\tilde{\bm{u}}+\tilde{\bm{v}}z which is known as the relative cost vector in the LP literature. Then, {\mathcal{M}} is an optimal basic index set of (25) for all values of the parameter zz satisfying

𝒓c=𝒖~+𝒗~z0,\displaystyle\bm{r}_{{\mathcal{M}}^{c}}=\tilde{\bm{u}}+\tilde{\bm{v}}z\geq 0, (27)

which is also explicitly discussed in Section 8.2 of Murty (1983). Finally, the results in Lemma 3 are obtained by respectively replacing {\mathcal{M}} and c{\mathcal{M}}^{c} by obs{\mathcal{M}}_{\rm{obs}} and obsc{\mathcal{M}}_{\rm{obs}}^{c}, and solving the linear inequality system in (27). ∎

Once 𝒵1{\mathcal{Z}}_{1} and 𝒵2{\mathcal{Z}}_{2} are identified, we can compute the truncation region 𝒵=𝒵1𝒵2{\mathcal{Z}}={\mathcal{Z}}_{1}\cap{\mathcal{Z}}_{2}. Finally, we can use 𝒵{\mathcal{Z}} to calculate the pivotal quantity in (17) which is subsequently used to construct the proposed selective CI in (16). The details of the algorithm is presented in Algorithm 1.

0:  𝑿obs,𝒀obs\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}
1:  Compute the cost matrix as in (3) and obtained its vectorized version 𝒄(𝑿obs,𝒀obs)\bm{c}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}) as in (4)
2:  Solve LP in (7) to obtain obs{\mathcal{M}}_{\rm{obs}}
3:  Compute 𝜼\bm{\eta} based on obs{\mathcal{M}}_{\rm{obs}} \leftarrow Equation (10)
4:  Calculate 𝒂\bm{a} and 𝒃\bm{b} based on 𝜼\bm{\eta} \leftarrow Equation (19)
5:  Construct 𝒵1{\mathcal{Z}}_{1} \leftarrow Lemma 2
6:  Identify 𝒵2{\mathcal{Z}}_{2} \leftarrow Lemma 3
7:  Truncation region 𝒵=𝒵1𝒵2{\mathcal{Z}}={\mathcal{Z}}_{1}\cap{\mathcal{Z}}_{2}
8:  CselC_{\rm{sel}} \leftarrow Equation (16)
8:  CselC_{\rm{sel}}

Algorithm 1 Selective CI for the Wasserstein Distance

4 Extension to Multi-Dimension

In §2 and §3, we mainly focus on the Wasserstein distance in one-dimension, i.e., xi[n]x_{i\in[n]}\in\mathbb{R} and yj[m]y_{j\in[m]}\in\mathbb{R}. In this section, we generalize the problem setup and extend the proposed method for the Wasserstein distance in multi-dimension. We consider two random sets XX and YY of dd-dimensional vectors

X=(𝒙1,:,,𝒙n,:)n×d,Y=(𝒚1,:,,𝒚m,:)m×d,\displaystyle\begin{aligned} X&=(\bm{x}_{1,:},\ldots,\bm{x}_{n,:})^{\top}\in\mathbb{R}^{n\times d},\\ Y&=(\bm{y}_{1,:},\ldots,\bm{y}_{m,:})^{\top}\in\mathbb{R}^{m\times d},\end{aligned} (28)

corrupted with Gaussian noise as

nd𝑿vec\displaystyle\mathbb{R}^{nd}\ni\bm{X}_{\rm{vec}} =vec(X)=(𝒙1,:,,𝒙n,:)=𝝁𝑿vec+𝜺𝑿vec,𝜺𝑿vec(𝟎,Σ𝑿vec),\displaystyle={\rm{vec}}(X)=(\bm{x}_{1,:}^{\top},\ldots,\bm{x}_{n,:}^{\top})^{\top}=\bm{\mu}_{\bm{X}_{\rm{vec}}}+\bm{\varepsilon}_{\bm{X}_{{\rm{vec}}}},\quad\bm{\varepsilon}_{\bm{X}_{\rm{vec}}}\sim\mathbb{N}(\bm{0},\Sigma_{\bm{X}_{{\rm{vec}}}}),
md𝒀vec\displaystyle\mathbb{R}^{md}\ni\bm{Y}_{\rm{vec}} =vec(Y)=(𝒚1,:,,𝒚n,:)=𝝁𝒀vec+𝜺𝒀vec,𝜺𝒀vec(𝟎,Σ𝒀vec),\displaystyle={\rm{vec}}(Y)=(\bm{y}_{1,:}^{\top},\ldots,\bm{y}_{n,:}^{\top})^{\top}=\bm{\mu}_{\bm{Y}_{\rm{vec}}}+\bm{\varepsilon}_{\bm{Y}_{{\rm{vec}}}},\quad\bm{\varepsilon}_{\bm{Y}_{\rm{vec}}}\sim\mathbb{N}(\bm{0},\Sigma_{\bm{Y}_{{\rm{vec}}}}),

where nn and mm are the number of instances in each set, 𝝁𝑿vec\bm{\mu}_{\bm{X}_{\rm{vec}}} and 𝝁𝒀vec\bm{\mu}_{\bm{Y}_{\rm{vec}}} are unknown mean vectors, 𝜺𝑿vec\bm{\varepsilon}_{\bm{X}_{\rm{vec}}} and 𝜺𝒀vec\bm{\varepsilon}_{\bm{Y}_{\rm{vec}}} are Gaussian noise vectors with covariances matrices Σ𝑿vec\Sigma_{\bm{X}_{\rm{vec}}} and Σ𝒀vec\Sigma_{\bm{Y}_{\rm{vec}}} assumed to be known or estimable from independent data.

Cost matrix. The cost matrix C(X,Y)C(X,Y) of pairwise distances (1\ell_{1} distance) between elements of XX and YY as

C(X,Y)\displaystyle C(X,Y) =[k=1d|𝒙i,k𝒚j,k|]ijn×m.\displaystyle=\left[\sum\limits_{k=1}^{d}|\bm{x}_{i,k}-\bm{y}_{j,k}|\right]_{ij}\in\mathbb{R}^{n\times m}. (29)

Then, the vectorized form of C(X,Y)C(X,Y) can be defined as

𝒄(X,Y)=vec(C(X,Y))=Θmul(𝑿vec𝒀vec)nm,\displaystyle\begin{aligned} \bm{c}(X,Y)&={\rm{vec}}\left(C(X,Y)\right)\\ &=\Theta^{\rm{mul}}\left(\bm{X}_{\rm{vec}}~{}\bm{Y}_{\rm{vec}}\right)^{\top}\in\mathbb{R}^{nm},\end{aligned} (30)

where

Θmul\displaystyle\Theta^{\rm{mul}} =k=1d𝒮k(X,Y)(Ω𝒆d,k)nm×(nd+md),\displaystyle=\sum\limits_{k=1}^{d}{\mathcal{S}}_{k}(X,Y)\circ\left(\Omega\otimes\bm{e}_{d,k}^{\top}\right)\in\mathbb{R}^{nm\times(nd+md)},
𝒮k(X,Y)\displaystyle{\mathcal{S}}_{k}(X,Y) =sign((Ω𝒆d,k)(𝑿vec𝒀vec))nm,\displaystyle={\rm{sign}}\left(\left(\Omega\otimes\bm{e}_{d,k}^{\top}\right)\left(\bm{X}_{\rm{vec}}~{}\bm{Y}_{\rm{vec}}\right)^{\top}\right)\in\mathbb{R}^{nm},

the matrix Ω\Omega is defined in (5), the operator \otimes is Kronecker product, and 𝒆d,kd\bm{e}_{d,k}\in\mathbb{R}^{d} is a dd-dimensional unit vector with 1 at position k[d]k\in[d].

The Wasserstein distance in multi-dimension. Given XobsX^{\rm{obs}} and YobsY^{\rm{obs}} sampled from (28), after obtaining 𝒄(Xobs,Yobs)\bm{c}(X^{\rm{obs}},Y^{\rm{obs}}) as in (30), the Wasserstein distance in multi-dimension is defined as

Wmul(Pn,Qm)=min𝒕nm\displaystyle W^{\rm{mul}}(P_{n},Q_{m})=\min\limits_{\bm{t}\in\mathbb{R}^{nm}}~{}~{} 𝒕𝒄(Xobs,Yobs)\displaystyle\bm{t}^{\top}\bm{c}(X^{\rm{obs}},Y^{\rm{obs}}) (31)
s.t. S𝒕=𝒉,𝒕𝟎,\displaystyle S\bm{t}=\bm{h},~{}\bm{t}\geq\bm{0},

where SS and 𝒉\bm{h} are defined in (7). By solving LP in (31), we obtain the set of selected basic variables

obs=(Xobs,Yobs),\displaystyle{\mathcal{M}}_{\rm{obs}}={\mathcal{M}}(X^{\rm{obs}},Y^{\rm{obs}}), (32)

Then, the Wasserstein distance can be re-written as

Wmul(Pn,Qm)=𝒕^𝒄(Xobs,Yobs)=𝒕^obs𝒄obs(Xobs,Yobs)=𝒕^obsΘobs,:mul(𝑿vecobs𝒀vecobs)=𝜼mul(𝑿vecobs𝒀vecobs)\displaystyle\begin{aligned} W^{\rm{mul}}(P_{n},Q_{m})&=\hat{\bm{t}}^{\top}\bm{c}(X^{\rm{obs}},Y^{\rm{obs}})\\ &=\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\bm{c}_{{\mathcal{M}}_{\rm{obs}}}(X^{\rm{obs}},Y^{\rm{obs}})\\ &=\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\Theta^{\rm{mul}}_{{\mathcal{M}}_{\rm{obs}},:}(\bm{X}^{\rm{obs}}_{\rm{vec}}~{}\bm{Y}^{\rm{obs}}_{\rm{vec}})^{\top}\\ &=\bm{\eta}^{\top}_{\rm{mul}}(\bm{X}^{\rm{obs}}_{\rm{vec}}~{}\bm{Y}^{\rm{obs}}_{\rm{vec}})^{\top}\end{aligned} (33)

where 𝜼mul=(𝒕^obsΘobs,:mul)\bm{\eta}_{\rm{mul}}=\left(\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\Theta^{\rm{mul}}_{{\mathcal{M}}_{\rm{obs}},:}\right)^{\top} is the test-statistic direction, 𝒕^obs=S:,obs1𝒉\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}=S_{:,{\mathcal{M}}_{\rm{obs}}}^{-1}\bm{h} is the optimal solution of (31), and the matrix Θmul\Theta^{\rm{mul}} is defined in (30).

Selection event and selective CI. Since we are dealing with multi-dimensional case, the selection event is slightly different from but more general than the one presented in (17) of §3. Specifically, we consider the following conditional inference

𝜼mul(𝑿vec𝒀vec)mul,\displaystyle\bm{\eta}^{\top}_{\rm{mul}}(\bm{X}_{\rm{vec}}~{}\bm{Y}_{\rm{vec}})^{\top}\mid{\mathcal{E}}^{\rm{mul}}, (34)

where

mul={k=1d𝒮k(X,Y)=𝒮k(Xobs,Yobs),(X,Y)=obs,𝒒(X,Y)=𝒒(Xobs,Yobs)}.\displaystyle{\mathcal{E}}^{\rm{mul}}=\left\{\bigcup_{k=1}^{d}{\mathcal{S}}_{k}(X,Y)={\mathcal{S}}_{k}(X^{\rm{obs}},Y^{\rm{obs}}),{\mathcal{M}}(X,Y)={\mathcal{M}}_{\rm{obs}},\bm{q}(X,Y)=\bm{q}(X^{\rm{obs}},Y^{\rm{obs}})\right\}.

Once the selection event mul{\mathcal{E}}^{\rm{mul}} has been identified, the pivotal quantity can be computed:

F𝜼mul(𝝁𝑿vec𝝁𝒀vec),σmul2𝒵mul(𝜼mul(𝑿vec𝒀vec))mul\displaystyle F_{\bm{\eta}^{\top}_{\rm{mul}}(\bm{\mu}_{\bm{X}_{\rm{vec}}}~{}\bm{\mu}_{\bm{Y}_{\rm{vec}}})^{\top},\sigma^{2}_{\rm{mul}}}^{{\mathcal{Z}}^{\rm{mul}}}\left(\bm{\eta}^{\top}_{\rm{mul}}(\bm{X}_{\rm{vec}}~{}\bm{Y}_{\rm{vec}})^{\top}\right)\mid{\mathcal{E}}^{\rm{mul}}

where σmul2=𝜼mulΣ~mul𝜼mul\sigma^{2}_{\rm{mul}}=\bm{\eta}_{\rm{mul}}^{\top}\tilde{\Sigma}^{\rm{mul}}\bm{\eta}_{\rm{mul}} with Σ~mul=(Σ𝑿vec00Σ𝒀vec)\tilde{\Sigma}^{\rm{mul}}=\begin{pmatrix}\Sigma_{\bm{X}_{\rm{vec}}}&0\\ 0&\Sigma_{\bm{Y}_{\rm{vec}}}\end{pmatrix}, and truncation region 𝒵mul{\mathcal{Z}}^{\rm{mul}} is calculated based on the selection event mul{\mathcal{E}}^{\rm{mul}} which we will discuss later. After 𝒵mul{\mathcal{Z}}^{\rm{mul}} is identified, the selective CI is defined as

Cselmul={w:α2Fw,σmul2𝒵mul(𝜼mul(𝑿vecobs𝒀vecobs))1α2}.\displaystyle C^{\rm{mul}}_{\rm{sel}}=\left\{w\in\mathbb{R}:\frac{\alpha}{2}\leq F_{w,\sigma^{2}_{\rm{mul}}}^{{\mathcal{Z}}^{\rm{mul}}}\left(\bm{\eta}^{\top}_{\rm{mul}}{\bm{X}^{\rm{obs}}_{\rm{vec}}\choose\bm{Y}^{\rm{obs}}_{\rm{vec}}}\right)\leq 1-\frac{\alpha}{2}\right\}. (35)

The remaining task is to identify 𝒵mul{\mathcal{Z}}^{\rm{mul}}.

Characterization of truncation region 𝒵mul{\mathcal{Z}}^{\rm{mul}}. Similar to the discussion in §3, the data is restricted on the line due to the conditioning on the nuisance component 𝒒(X,Y)\bm{q}(X,Y). Then, the set of data that satisfies the condition in (34) is defined as

𝒟mul={(𝑿vec𝒀vec)=𝒂mul+𝒃mulzz𝒵mul},\displaystyle{\mathcal{D}}^{\rm{mul}}=\Big{\{}(\bm{X}_{\rm{vec}}~{}\bm{Y}_{\rm{vec}})^{\top}=\bm{a}^{\rm{mul}}+\bm{b}^{\rm{mul}}z\mid z\in{\mathcal{Z}}^{\rm{mul}}\Big{\}},

where

𝒂mul\displaystyle\bm{a}^{\rm{mul}} =𝒒(Xobs,Yobs),\displaystyle=\bm{q}(X^{\rm{obs}},Y^{\rm{obs}}),
𝒃mul\displaystyle\bm{b}^{\rm{mul}} =Σ~mul𝜼mul(𝜼mulΣ~mul𝜼mul)1,\displaystyle=\tilde{\Sigma}^{\rm{mul}}\bm{\eta}_{\rm{mul}}(\bm{\eta}^{\top}_{\rm{mul}}\tilde{\Sigma}^{\rm{mul}}\bm{\eta}_{\rm{mul}})^{-1},
𝒵mul\displaystyle{\mathcal{Z}}^{\rm{mul}} ={z|k=1d𝒮k(𝒂mul+𝒃mulz)=𝒮k(Xobs,Yobs),(𝒂mul+𝒃mulz)=obs}\displaystyle=\left\{z\in\mathbb{R}~{}\Bigg{|}~{}\begin{array}[]{l}\bigcup\limits_{k=1}^{d}{\mathcal{S}}_{k}(\bm{a}^{\rm{mul}}+\bm{b}^{\rm{mul}}z)={\mathcal{S}}_{k}(X^{\rm{obs}},Y^{\rm{obs}}),\\ {\mathcal{M}}(\bm{a}^{\rm{mul}}+\bm{b}^{\rm{mul}}z)={\mathcal{M}}_{\rm{obs}}\end{array}\right\}

with zz\in\mathbb{R}. Next, we can decompose 𝒵mul{\mathcal{Z}}^{\rm{mul}} into two separate sets as 𝒵mul=𝒵1mul𝒵2mul{\mathcal{Z}}^{\rm{mul}}={\mathcal{Z}}_{1}^{\rm{mul}}\cap{\mathcal{Z}}_{2}^{\rm{mul}}, where

𝒵1mul\displaystyle{\mathcal{Z}}_{1}^{\rm{mul}} ={zk=1d𝒮k(𝒂mul+𝒃mulz)=𝒮k(Xobs,Yobs)},\displaystyle=\left\{z\in\mathbb{R}\mid\bigcup\limits_{k=1}^{d}{\mathcal{S}}_{k}(\bm{a}^{\rm{mul}}+\bm{b}^{\rm{mul}}z)={\mathcal{S}}_{k}(X^{\rm{obs}},Y^{\rm{obs}})\right\},
𝒵2mul\displaystyle{\mathcal{Z}}_{2}^{\rm{mul}} ={z(𝒂mul+𝒃mulz)=obs}.\displaystyle=\left\{z\in\mathbb{R}\mid{\mathcal{M}}(\bm{a}^{\rm{mul}}+\bm{b}^{\rm{mul}}z)={\mathcal{M}}_{\rm{obs}}\right\}.

From now, the identification of 𝒵1mul{\mathcal{Z}}_{1}^{\rm{mul}} and 𝒵2mul{\mathcal{Z}}_{2}^{\rm{mul}} is straightforward and similar to the construction of 𝒵1{\mathcal{Z}}_{1} and 𝒵2{\mathcal{Z}}_{2} discussed in §3. Once 𝒵1mul{\mathcal{Z}}_{1}^{\rm{mul}} and 𝒵2mul{\mathcal{Z}}_{2}^{\rm{mul}} are identified, we can compute the truncation region 𝒵mul=𝒵1mul𝒵2mul{\mathcal{Z}}^{\rm{mul}}={\mathcal{Z}}_{1}^{\rm{mul}}\cap{\mathcal{Z}}_{2}^{\rm{mul}} and use it to compute the selective CI in (35).

5 Experiment

In this section, we demonstrate the performance of the proposed method in both univariate case and multi-dimensional case. We present the results on synthetic data in §5.1. Thereafter, the results on real data are shown in §5.2. In all the experiments, we set the significance level α=0.05\alpha=0.05, i.e., all the experiments were conducted with the coverage level of 1α=0.951-\alpha=0.95.

5.1 Numerical Experiment

In this section, we evaluate the performance of the proposed selective CI in terms of coverage guarantee, CI length and computational cost. We also show the results of comparison between our selective CI and the naive CI in (12) in terms of coverage guarantee. We would like to note that we did not conduct the comparison in terms of CI length because the naive CI could not guarantee the coverage property. In statistical viewpoint, if the CI is unreliable, i.e., invalid or does not satisfy the coverage property, the demonstration of CI length does not make sense. Besides, we additionally compare the performance of the proposed method with the latest asymptotic study (Imaizumi et al., 2019).

Refer to caption
(a) Coverage guarantee
Refer to caption
(b) CI length
Refer to caption
(c) Computational cost
Figure 2: Coverage guarantee, CI length and computational cost in univariate case (d=1d=1).

5.1.1 Univariate case (d=1d=1)

We generated the dataset 𝑿\bm{X} and 𝒀\bm{Y} with 𝝁𝑿=𝟏n\bm{\mu}_{\bm{X}}=\bm{1}_{n}, 𝝁𝒀=𝟏m+Δ\bm{\mu}_{\bm{Y}}=\bm{1}_{m}+\Delta (element-wise addition), 𝜺𝑿(𝟎,In)\bm{\varepsilon}_{\bm{X}}\sim\mathbb{N}(\bm{0},I_{n}), 𝜺𝒀(𝟎,Im)\bm{\varepsilon}_{\bm{Y}}\sim\mathbb{N}(\bm{0},I_{m}). Regarding the experiments of coverage guarantee and CI length, we set n=m=5n=m=5 and ran 120 trials for each Δ{0,1,2,3,4}\Delta\in\{0,1,2,3,4\}. In regard to the experiments of computational cost, we set Δ=2\Delta=2 and ran 10 trials for each n=m{50,60,70,80}n=m\in\{50,60,70,80\}. The results are shown in Figure 2. In the left plot, the naive CI can not properly guarantee the coverage property while the proposed selective CI does. The results in the middle plot indicate that the larger the true distance between 𝑿\bm{X} and 𝒀\bm{Y}, the shorter selective CI we obtain. The right plot shows that the proposed method also has reasonable computational cost.

Refer to caption
(a) Coverage guarantee
Refer to caption
(b) CI length
Refer to caption
(c) Computational cost
Figure 3: Coverage guarantee, CI length and computational cost in multi-dimensional case (d=2d=2).

5.1.2 Multi-dimensional case (d=2d=2)

We generated the dataset X={𝒙i,:}i[n]X=\{\bm{x}_{i,:}\}_{i\in[n]} with 𝒙i,:(𝟏d,Id)\bm{x}_{i,:}\sim\mathbb{N}(\bm{1}_{d},I_{d}) and Y={𝒚j,:}j[m]Y=\{\bm{y}_{j,:}\}_{j\in[m]} with 𝒚j,:(𝟏d+Δ,Id)\bm{y}_{j,:}\sim\mathbb{N}(\bm{1}_{d}+\Delta,I_{d}) (element-wise addition). Similar to the univariate case, we set n=m=5n=m=5 and ran 120 trials for each Δ{0,1,2,3,4}\Delta\in\{0,1,2,3,4\} for the experiments of coverage guarantee and CI length as well as setting Δ=2\Delta=2 and ran 10 trials for each n=m{50,60,70,80}n=m\in\{50,60,70,80\} for the experiments of computational cost. The results are shown in Figure 3. The interpretation of the results is similar and consistent with the univariate case.

5.1.3 Robustness of the proposed selective CI in terms of coverage guarantee

We additionally demonstrate the robustness of the proposed selective CI in terms of coverage guarantee by considering the following cases:

  • Non-normal noise: we considered the noises 𝜺𝑿\bm{\varepsilon}_{\bm{X}} and 𝜺𝒀\bm{\varepsilon}_{\bm{Y}} following the Laplace distribution, skew normal distribution (skewness coefficient: 1010), and t20t_{20} distribution.

  • Unknown variance: we considered the case in which the variance of the noises was also estimated from the data.

The dataset 𝑿\bm{X} and 𝒀\bm{Y} were generated with 𝝁𝑿=𝟏n\bm{\mu}_{\bm{X}}=\bm{1}_{n}, 𝝁𝒀=𝟏m+Δ\bm{\mu}_{\bm{Y}}=\bm{1}_{m}+\Delta. We set n=m=5n=m=5 and ran 120 trials for each Δ{1,2,3,4}\Delta\in\{1,2,3,4\}. We confirmed that our selective CI maintained good performance in terms of coverage guarantee. The results are shown in Figure 4.

Refer to caption
(a) Laplace
Refer to caption
(b) Skew normal
Refer to caption
(c) t20t_{20}
Refer to caption
(d) Estimated variance
Figure 4: Robustness of the proposed selective CI in terms of coverage guarantee.

5.1.4 Comparison with asymptotic method in Imaizumi et al. (2019)

The authors of Imaizumi et al. (2019) provided us their code of computing the pp-value in hypothesis testing framework. Therefore, we conducted the comparisons in terms of false positive rate (FPR) control and true positive rate (TPR). Although we mainly focused on the CI in the previous sections, the corresponding hypothesis testing problem is defined as follows:

H0:𝜼(𝝁𝑿𝝁𝒀)=0v.s.H1:𝜼(𝝁𝑿𝝁𝒀)0.\displaystyle{\rm H}_{0}:\bm{\eta}^{\top}{\bm{\mu}_{\bm{X}}\choose\bm{\mu}_{\bm{Y}}}=0\quad\text{v.s.}\quad{\rm H}_{1}:\bm{\eta}^{\top}{\bm{\mu}_{\bm{X}}\choose\bm{\mu}_{\bm{Y}}}\neq 0.

The details are presented in Appendix A. We generated the dataset 𝑿\bm{X} and 𝒀\bm{Y} with 𝝁𝑿=𝟏n\bm{\mu}_{\bm{X}}=\bm{1}_{n}, 𝝁𝒀=𝟏m+Δ\bm{\mu}_{\bm{Y}}=\bm{1}_{m}+\Delta (element-wise addition), 𝜺𝑿(𝟎,In)\bm{\varepsilon}_{\bm{X}}\sim\mathbb{N}(\bm{0},I_{n}), 𝜺𝒀(𝟎,Im)\bm{\varepsilon}_{\bm{Y}}\sim\mathbb{N}(\bm{0},I_{m}). Regarding the FPR experiments, we set Δ=0\Delta=0 and and ran 120 trials for each n=m{5,10,15,20}n=m\in\{5,10,15,20\}. In regard to the TPR experiments, we set Δ{1,2,3,4,5}\Delta\in\{1,2,3,4,5\} and ran 120 trials for each n=m{5,10,20}n=m\in\{5,10,20\}. The results are shown in Figure 5. In terms for FPR control, both methods could successfully control the FPR under α=0.05\alpha=0.05. However, in terms of TPR, the proposed method outperformed the existing asymptotic one in all the cases. As observed in Figure 5 (a), the existing method is conservative in the sense that the FPR is smaller than the specified significance level α\alpha = 0.05. As a consequence of this conservativeness, the power of their method was consistently lower than ours. Such a phenomenon is commonly observed in approximate statistical inference.

Refer to caption
(a) FPR
Refer to caption
(b) TPR (n=m=5n=m=5)
Refer to caption
(c) TPR (n=m=10n=m=10)
Refer to caption
(d) TPR (n=m=20n=m=20)
Figure 5: Comparisons with asymptotic method (Imaizumi et al., 2019) in terms of FPR and TPR. While both methods can successfully control the FPR under the significance level α=0.05\alpha=0.05, the proposed method has higher TPR than the existing asymptotic method in all the cases.

5.2 Real Data Experiment

In this section, we evaluate the proposed selective CI on four real-world datasets. We used Iris dataset, Wine dataset, Breast Cancer dataset which are available in the UCI machine learning repository, and Lung Cancer dataset 444We used dataset Lung_GSE7670 which is available at https://sbcb.inf.ufrgs.br/cumida. (Feltes et al., 2019). The experiments were conducted with the following settings:

  • Setting 1: For each pair of classes in the dataset:

    • Randomly select n=m=5n=m=5 instances from each class. Here, each instance is represented by a dd-dimensional vector where dd is the number of features.

    • Compute the selective CI.

    • Repeat the above process up to 120 times.

  • Setting 2: Given a dataset with two classes C1 and C2, we either chose n=5n=5 instances from C1 and m=5m=5 instances from C2 (XobsX^{\rm{obs}} and YobsY^{\rm{obs}} are from different classes); or both XobsX^{\rm{obs}} and YobsY^{\rm{obs}} from either C1 or C2 (XobsX^{\rm{obs}} and YobsY^{\rm{obs}} are from the same class). Then, we compute the selective CI. We repeated this process up to 120 times.

Refer to caption
(a) C1-C2
Refer to caption
(b) C1-C3
Refer to caption
(c) C2-C3
Figure 6: Results on Iris dataset in univariate case (d=1d=1).

5.2.1 Univariate case (d=1d=1) with Setting 1

We conducted the experiments on Iris dataset which contains three classes: Iris Setosa (C1), Iris Versicolour (C2), and Iris Virginica (C3). This dataset also contains four features: sepal length (s-l), sepal width (s-w), petal length (p-l), and petal width (p-w). We ran the procedure described in Setting 1 on each individual feature. The results are shown in Figure 6. In all three plots of this figure, the two features p-l and p-w always have the shortest CI length among the four features which indicates that these two features are informative to discriminate between the classes. Besides, the results of Figure 6 are also consistent with the results obtained after plotting the histogram of each feature in each class. In other words, the farther the two histograms, the smaller the length of selective CI.

Refer to caption
Refer to caption
(a) d=2d=2
Refer to caption
Refer to caption
(b) d=3d=3
Figure 7: Results on Iris dataset in multi-dimensional case (d{2,3}d\in\{2,3\})

5.2.2 Multi-dimensional case (d{2,3}d\in\{2,3\}) with Setting 1

Regarding the experiments on Iris dataset in multi-dimensional case, we chose two features p-l and p-w when d=2d=2 and additionally include feature s-l when d=3d=3. The results are shown in Figure 7. In each sub-plot, we show the results of the length of selective CI and the corresponding scatter plot which is used to verify the CI length results. For example, in Figure 7a, it is obvious that the distance between C1 and C2 as well as the distance between C1 and C3 are larger than the distance between C2 and C3 by seeing the scatter plot. Therefore, the CI lengths of C1-C2 and C1-C3 tend to be smaller than that of C2-C3. Besides, we also additionally conducted experiments on Wine dataset. This dataset contains 3 classes of wine and 13 features. In the case of d=2d=2, we conducted the experiments on each pair features in the set {7,12,13}\{7,12,13\} (feature 7: flavanoids, features 12: od280/od315 of diluted wines, feature 13: proline). In the case of d=3d=3, we conducted the experiments on both three features. The results are shown in Figure 8. In general, the results of CI length are consistent with the scatter plots, i.e., the farther the scatter plots between two classes, the smaller the length of selective CI.

Refer to caption
Refer to caption
(a) d=2d=2 (features 7 and 12)
Refer to caption
Refer to caption
(b) d=2d=2 (features 7 and 13)
Refer to caption
Refer to caption
(c) d=2d=2 (features 12 and 13)
Refer to caption
Refer to caption
(d) d=3d=3 (features 7, 12, and 13)
Figure 8: Results on Wine dataset in multi-dimensional case (d{2,3}d\in\{2,3\})

5.2.3 Multi-dimensional case with Setting 2

We conducted experiments on Breast Cancer and Lung Cancer datasets. In Breast Cancer dataset, there are two classes (malignant and benign) and d=30d=30 features. In Lung Cancer dataset, there are two classes (normal and adenocarcinoma) and we choose d=1,000d=1,000 (we selected the top 1,000 genes which have the largest standard deviations as it is commonly done in the literature). The results on these datasets with Setting 2 are shown in Figure 9. The results are consistent with the intuitive expectation. When XobsX^{\rm{obs}} and YobsY^{\rm{obs}} are from different classes, the Wasserstein distance tends to be larger than the one computed when XobsX^{\rm{obs}} and YobsY^{\rm{obs}} are from the same class. Therefore, the CI for the Wasserstein distance in the case of different classes is shorter than the one computed in the case of same class. In other words, the larger the Wasserstein distance is, the shorter the CI becomes.

Refer to caption
(a) Breast Cancer dataset (d=30d=30)
Refer to caption
(b) Lung Cancer dataset (d=1,000d=1,000)
Figure 9: Results on Breast Cancer and Lung Cancer datasets in multi-dimensional case.

6 Conclusion

In this paper, we present an exact (non-asymptotic) statistical inference method for the Wasserstein distance. We first introduce the problem setup and present the proposed method for univariate case. We next provide the extension to multi-dimensional case. We finally conduct the experiments on both synthetic and real-world datasets to evaluate the performance of our method. To our knowledge, this is the first method that can provide a valid confidence interval (CI) for the Wasserstein distance with finite-sample coverage guarantee. We believe this study is an important contribution toward reliable ML, which is one of the most critical issues in the ML community.

References

  • Arjovsky et al. [2017] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • Bernton et al. [2017] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. Inference in generative models using the wasserstein distance. arXiv preprint arXiv:1701.05146, 1(8):9, 2017.
  • Chen and Bien [2019] S. Chen and J. Bien. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, pages 1–12, 2019.
  • Del Barrio et al. [1999] E. Del Barrio, J. A. Cuesta-Albertos, C. Matrán, and J. M. Rodríguez-Rodríguez. Tests of goodness of fit based on the l2-wasserstein distance. Annals of Statistics, pages 1230–1239, 1999.
  • Del Barrio et al. [2019] E. Del Barrio, P. Gordaliza, H. Lescornel, and J.-M. Loubes. Central limit theorem and bootstrap procedure for wasserstein’s variations with an application to structural relationships between distributions. Journal of Multivariate Analysis, 169:341–362, 2019.
  • Duy and Takeuchi [2021] V. N. L. Duy and I. Takeuchi. More powerful conditional selective inference for generalized lasso by parametric programming. arXiv preprint arXiv:2105.04920, 2021.
  • Duy et al. [2020a] V. N. L. Duy, S. Iwazaki, and I. Takeuchi. Quantifying statistical significance of neural network representation-driven hypotheses by selective inference. arXiv preprint arXiv:2010.01823, 2020a.
  • Duy et al. [2020b] V. N. L. Duy, H. Toda, R. Sugiyama, and I. Takeuchi. Computing valid p-value for optimal changepoint by selective inference using dynamic programming. In Advances in Neural Information Processing Systems, 2020b.
  • Evans and Matsen [2012] S. N. Evans and F. A. Matsen. The phylogenetic kantorovich–rubinstein metric for environmental sequence samples. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):569–592, 2012.
  • Feltes et al. [2019] B. C. Feltes, E. B. Chandelier, B. I. Grisci, and M. Dorn. Cumida: an extensively curated microarray database for benchmarking and testing of machine learning approaches in cancer research. Journal of Computational Biology, 26(4):376–386, 2019.
  • Fithian et al. [2014] W. Fithian, D. Sun, and J. Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
  • Frogner et al. [2015] C. Frogner, C. Zhang, H. Mobahi, M. Araya-Polo, and T. Poggio. Learning with a wasserstein loss. arXiv preprint arXiv:1506.05439, 2015.
  • Hyun et al. [2018] S. Hyun, K. Lin, M. G’Sell, and R. J. Tibshirani. Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644, 2018.
  • Imaizumi et al. [2019] M. Imaizumi, H. Ota, and T. Hamaguchi. Hypothesis test and confidence analysis with wasserstein distance with general dimension. arXiv preprint arXiv:1910.07773, 2019.
  • Kolouri et al. [2017] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing and machine-learning applications. IEEE signal processing magazine, 34(4):43–59, 2017.
  • Le Duy and Takeuchi [2021] V. N. Le Duy and I. Takeuchi. Parametric programming approach for more powerful and general lasso selective inference. In International Conference on Artificial Intelligence and Statistics, pages 901–909. PMLR, 2021.
  • Lee et al. [2016] J. D. Lee, D. L. Sun, Y. Sun, and J. E. Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
  • Liu et al. [2018] K. Liu, J. Markovic, and R. Tibshirani. More powerful post-selection inference, with application to the lasso. arXiv preprint arXiv:1801.09037, 2018.
  • Loftus and Taylor [2015] J. R. Loftus and J. E. Taylor. Selective inference in regression models with groups of variables. arXiv preprint arXiv:1511.01478, 2015.
  • Murty [1983] K. Murty. Linear Programming. Wiley, 1983. ISBN 9780471097259.
  • Ni et al. [2009] K. Ni, X. Bresson, T. Chan, and S. Esedoglu. Local histogram based segmentation using the wasserstein distance. International journal of computer vision, 84(1):97–111, 2009.
  • Ramdas et al. [2017] A. Ramdas, N. G. Trillos, and M. Cuturi. On wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.
  • Sugiyama et al. [2021a] K. Sugiyama, V. N. Le Duy, and I. Takeuchi. More powerful and general selective inference for stepwise feature selection using homotopy method. In International Conference on Machine Learning, pages 9891–9901. PMLR, 2021a.
  • Sugiyama et al. [2021b] R. Sugiyama, H. Toda, V. N. L. Duy, Y. Inatsu, and I. Takeuchi. Valid and exact statistical inference for multi-dimensional multiple change-points by selective inference. arXiv preprint arXiv:2110.08989, 2021b.
  • Suzumura et al. [2017] S. Suzumura, K. Nakagawa, Y. Umezu, K. Tsuda, and I. Takeuchi. Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3338–3347. JMLR. org, 2017.
  • Tanizaki et al. [2020] K. Tanizaki, N. Hashimoto, Y. Inatsu, H. Hontani, and I. Takeuchi. Computing valid p-values for image segmentation by selective inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9553–9562, 2020.
  • Tibshirani et al. [2016] R. J. Tibshirani, J. Taylor, R. Lockhart, and R. Tibshirani. Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620, 2016.
  • Tsukurimichi et al. [2021] T. Tsukurimichi, Y. Inatsu, V. N. L. Duy, and I. Takeuchi. Conditional selective inference for robust regression and outlier detection using piecewise-linear homotopy continuation. arXiv preprint arXiv:2104.10840, 2021.
  • Villani [2009] C. Villani. Optimal transport: old and new, volume 338. Springer, 2009.
  • Yang et al. [2016] F. Yang, R. F. Barber, P. Jain, and J. Lafferty. Selective inference for group-sparse linear models. In Advances in Neural Information Processing Systems, pages 2469–2477, 2016.

Appendix A Proposed method in hypothesis testing framework

We present the proposed method in the setting of hypothesis testing and consider the case when the cost matrix is defined by using squared 2\ell_{2} distance.

Cost matrix. We define the cost matrix C(𝑿,𝒀)C(\bm{X},\bm{Y}) of pairwise distances (squared 2\ell_{2} distance) between elements of 𝑿\bm{X} and 𝒀\bm{Y} as

C(𝑿,𝒀)\displaystyle C(\bm{X},\bm{Y}) =[(xiyj)2]ijn×m.\displaystyle=\big{[}(x_{i}-y_{j})^{2}\big{]}_{ij}\in\mathbb{R}^{n\times m}. (36)

Then, the vectorized form of C(𝑿,𝒀)C(\bm{X},\bm{Y}) can be defined as

𝒄(𝑿,𝒀)=vec(C(𝑿,𝒀))nm=[Ω(𝑿𝒀)][Ω(𝑿𝒀)],\displaystyle\begin{aligned} \bm{c}(\bm{X},\bm{Y})&={\rm{vec}}(C(\bm{X},\bm{Y}))\in\mathbb{R}^{nm}\\ &=\left[\Omega{\bm{X}\choose\bm{Y}}\right]\circ\left[\Omega{\bm{X}\choose\bm{Y}}\right],\end{aligned} (37)

where Ω\Omega is defined as in (5) and the operation \circ is element-wise product.

The Wasserstein distance. By solving LP with the cost vector defined in (37) on the observed data 𝑿obs\bm{X}^{\rm{obs}} and 𝒀obs\bm{Y}^{\rm{obs}}, we obtain the set of selected basic variables

obs=(𝑿obs,𝒀obs).\displaystyle{\mathcal{M}}_{\rm{obs}}={\mathcal{M}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}).

Then, the Wasserstein distance can be re-written as (we denote W=W(Pn,Qm)W=W(P_{n},Q_{m}) for notational simplicity)

W\displaystyle W =𝒕^𝒄(𝑿obs,𝒀obs)\displaystyle=\hat{\bm{t}}^{\top}\bm{c}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})
=𝒕^obs𝒄obs(𝑿obs,𝒀obs)\displaystyle=\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\bm{c}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})
=𝒕^obs[[Ωobs,:(𝑿obs𝒀obs)][Ωobs,:(𝑿obs𝒀obs)]].\displaystyle=\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\Bigg{[}\left[\Omega_{{\mathcal{M}}_{\rm{obs}},:}{\bm{X}^{\rm{obs}}\choose\bm{Y}^{\rm{obs}}}\right]\circ\left[\Omega_{{\mathcal{M}}_{\rm{obs}},:}{\bm{X}^{\rm{obs}}\choose\bm{Y}^{\rm{obs}}}\right]\Bigg{]}.

Hypothesis testing. Our goal is to test the following hypothesis:

H0:𝒕^obs[[Ωobs,:(𝝁𝑿𝝁𝒀)][Ωobs,:(𝝁𝑿𝝁𝒀)]]=0.\displaystyle{\rm H}_{0}:\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\Bigg{[}\left[\Omega_{{\mathcal{M}}_{\rm{obs}},:}{\bm{\mu}_{\bm{X}}\choose\bm{\mu}_{\bm{Y}}}\right]\circ\left[\Omega_{{\mathcal{M}}_{\rm{obs}},:}{\bm{\mu}_{\bm{X}}\choose\bm{\mu}_{\bm{Y}}}\right]\Bigg{]}=0.

Unfortunately, it is technically difficult to directly test the above hypothesis. Therefore, we propose to test the following equivalent one:

H0:𝒕^obs[Θobs,:(𝝁𝑿𝝁𝒀)]=0\displaystyle{\rm H}_{0}:\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}^{\top}\left[\Theta_{{\mathcal{M}}_{\rm{obs}},:}{\bm{\mu}_{\bm{X}}\choose\bm{\mu}_{\bm{Y}}}\right]=0
\displaystyle\Leftrightarrow~{} H0:𝜼(𝝁𝑿𝝁𝒀)=0\displaystyle{\rm H}_{0}:\bm{\eta}^{\top}{\bm{\mu}_{\bm{X}}\choose\bm{\mu}_{\bm{Y}}}=0

where Θ\Theta is defined as in (5) and 𝜼=Θobs,:𝒕^obs\bm{\eta}=\Theta_{{\mathcal{M}}_{\rm{obs}},:}^{\top}\hat{\bm{t}}_{{\mathcal{M}}_{\rm{obs}}}.

Conditional SI. To test the aforementioned hypothesis, we consider the following selective pp-value:

pselective=H0(|𝜼(𝑿𝒀)||𝜼(𝑿obs𝒀obs)|)\displaystyle p_{\rm{selective}}=\mathbb{P}_{\rm H_{0}}\left(\left|\bm{\eta}^{\top}{\bm{X}\choose\bm{Y}}\right|\geq\left|\bm{\eta}^{\top}{\bm{X}^{\rm{obs}}\choose\bm{Y}^{\rm{obs}}}\right|\mid{\mathcal{E}}\right)

where the conditional selection event is defined as

={(𝑿,𝒀)=obs,𝒮obs(𝑿,𝒀)=𝒮obs(𝑿obs,𝒀obs)𝒒(𝑿,𝒀)=𝒒(𝑿obs,𝒀obs)}.\displaystyle{\mathcal{E}}=\left\{\begin{array}[]{l}{\mathcal{M}}(\bm{X},\bm{Y})={\mathcal{M}}_{\rm{obs}},\\ {\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X},\bm{Y})={\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\\ \bm{q}(\bm{X},\bm{Y})=\bm{q}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\end{array}\right\}.

Our next task is to identify the conditional data space whose data satisfies {\mathcal{E}}.

Characterization of the conditional data space. Similar to the discussion in §3, the data is restricted on the line due to the conditioning on the nuisance component 𝒒(𝑿,𝒀)\bm{q}(\bm{X},\bm{Y}). Then, the conditional data space is defined as

𝒟={(𝑿𝒀)=𝒂+𝒃zz𝒵},\displaystyle{\mathcal{D}}=\Big{\{}(\bm{X}~{}\bm{Y})^{\top}=\bm{a}+\bm{b}z\mid z\in{\mathcal{Z}}\Big{\}},

where

𝒵={z|(𝒂+𝒃z)=obs,𝒮obs(𝒂+𝒃z)=𝒮obs(𝑿obs,𝒀obs)}.\displaystyle{\mathcal{Z}}=\left\{z\in\mathbb{R}~{}\Big{|}\begin{array}[]{l}{\mathcal{M}}(\bm{a}+\bm{b}z)={\mathcal{M}}_{\rm{obs}},\\ {\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{a}+\bm{b}z)={\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\end{array}\right\}.

The remaining task is to construct 𝒵{\mathcal{Z}}. We can decompose 𝒵{\mathcal{Z}} into two separate sets as 𝒵=𝒵1𝒵2{\mathcal{Z}}={\mathcal{Z}}_{1}\cap{\mathcal{Z}}_{2}, where

𝒵1\displaystyle{\mathcal{Z}}_{1} ={z(𝒂+𝒃z)=obs},\displaystyle=\{z\in\mathbb{R}\mid{\mathcal{M}}(\bm{a}+\bm{b}z)={\mathcal{M}}_{\rm{obs}}\},
𝒵2\displaystyle{\mathcal{Z}}_{2} ={z𝒮obs(𝒂+𝒃z)=𝒮obs(𝑿obs,𝒀obs)}.\displaystyle=\{z\in\mathbb{R}\mid{\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{a}+\bm{b}z)={\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}})\}.

The construction of 𝒵2{\mathcal{Z}}_{2} is as follows (we denote 𝒔obs=𝒮obs(𝑿obs,𝒀obs)\bm{s}_{{\mathcal{M}}_{\rm{obs}}}={\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{X}^{\rm{obs}},\bm{Y}^{\rm{obs}}) for notational simplicity):

𝒵2\displaystyle{\mathcal{Z}}_{2} ={z𝒮obs(𝒂+𝒃z)=𝒔obs}\displaystyle=\{z\in\mathbb{R}\mid{\mathcal{S}}_{{\mathcal{M}}_{\rm{obs}}}(\bm{a}+\bm{b}z)=\bm{s}_{{\mathcal{M}}_{\rm{obs}}}\}
={zsign(Ωobs,:(𝒂+𝒃z))=𝒔obs}\displaystyle=\left\{z\in\mathbb{R}\mid{\rm{sign}}\Big{(}\Omega_{{\mathcal{M}}_{\rm{obs}},:}(\bm{a}+\bm{b}z)\Big{)}=\bm{s}_{{\mathcal{M}}_{\rm{obs}}}\right\}
={z𝒔obsΩobs,:(𝒂+𝒃z)𝟎},\displaystyle=\left\{z\in\mathbb{R}\mid\bm{s}_{{\mathcal{M}}_{\rm{obs}}}\circ\Omega_{{\mathcal{M}}_{\rm{obs}},:}(\bm{a}+\bm{b}z)\geq\bm{0}\right\},

which can be obtained by solving the system of linear inequalities. Next, we present the identification of 𝒵1{\mathcal{Z}}_{1}. Because we use squared 2\ell_{2} distance to define the cost matrix, the LP with the parametrized data 𝒂+𝒃z\bm{a}+\bm{b}z is written as follows:

min𝒕nm𝒕[Ω(𝒂+𝒃z)Ω(𝒂+𝒃z)]s.t.S𝒕=𝒉,𝒕𝟎\displaystyle\min\limits_{\bm{t}\in\mathbb{R}^{nm}}~{}\bm{t}^{\top}[\Omega(\bm{a}+\bm{b}z)\circ\Omega(\bm{a}+\bm{b}z)]~{}~{}\text{s.t.}~{}~{}S\bm{t}=\bm{h},\bm{t}\geq\bm{0}
\displaystyle\Leftrightarrow min𝒕nm(𝒖+𝒗z+𝒘z2)𝒕s.t.S𝒕=𝒉,𝒕𝟎,\displaystyle\min\limits_{\bm{t}\in\mathbb{R}^{nm}}~{}(\bm{u}+\bm{v}z+\bm{w}z^{2})^{\top}\bm{t}~{}~{}\text{s.t.}~{}~{}S\bm{t}=\bm{h},\bm{t}\geq\bm{0},

where

𝒖\displaystyle\bm{u} =(Ω𝒂)(Ω𝒂),\displaystyle=(\Omega\bm{a})\circ(\Omega\bm{a}),
𝒗\displaystyle\bm{v} =(Ω𝒂)(Ω𝒃)+(Ω𝒃)(Ω𝒂),\displaystyle=(\Omega\bm{a})\circ(\Omega\bm{b})+(\Omega\bm{b})\circ(\Omega\bm{a}),
𝒘\displaystyle\bm{w} =(Ω𝒃)(Ω𝒃),\displaystyle=(\Omega\bm{b})\circ(\Omega\bm{b}),

and SS and 𝒉\bm{h} are the same as in (7). By fixing obs{\mathcal{M}}_{\rm{obs}} as the optimal basic index set, the relative cost vector w.r.t to the set of non-basic variables is defines as

𝒓obsc=𝒖~+𝒗~z+𝒘~z2,\displaystyle\bm{r}_{{\mathcal{M}}^{c}_{\rm{obs}}}=\tilde{\bm{u}}+\tilde{\bm{v}}z+\tilde{\bm{w}}z^{2},

where

𝒖~\displaystyle\tilde{\bm{u}} =(𝒖obsc𝒖obsS:,obs1S:,obsc),\displaystyle=\left(\bm{u}_{{\mathcal{M}}^{c}_{\rm{obs}}}^{\top}-\bm{u}_{{\mathcal{M}}_{\rm{obs}}}^{\top}S_{:,{\mathcal{M}}_{\rm{obs}}}^{-1}S_{:,{\mathcal{M}}^{c}_{\rm{obs}}}\right)^{\top},
𝒗~\displaystyle\tilde{\bm{v}} =(𝒗obsc𝒗obsS:,obs1S:,obsc),\displaystyle=\left(\bm{v}_{{\mathcal{M}}^{c}_{\rm{obs}}}^{\top}-\bm{v}_{{\mathcal{M}}_{\rm{obs}}}^{\top}S_{:,{\mathcal{M}}_{\rm{obs}}}^{-1}S_{:,{\mathcal{M}}^{c}_{\rm{obs}}}\right)^{\top},
and 𝒘~\displaystyle\text{ and }~{}~{}\tilde{\bm{w}} =(𝒘obsc𝒘obsS:,obs1S:,obsc).\displaystyle=\left(\bm{w}_{{\mathcal{M}}^{c}_{\rm{obs}}}^{\top}-\bm{w}_{{\mathcal{M}}_{\rm{obs}}}^{\top}S_{:,{\mathcal{M}}_{\rm{obs}}}^{-1}S_{:,{\mathcal{M}}^{c}_{\rm{obs}}}\right)^{\top}.

The requirement for obs{\mathcal{M}}_{\rm{obs}} to be the optimal basis index set is 𝒓obsc𝟎\bm{r}_{{\mathcal{M}}^{c}_{\rm{obs}}}\geq\bm{0} (i.e., the cost in minimization problem will never decrease when the non-basic variables become positive and enter the basis). Finally, the set 𝒵1{\mathcal{Z}}_{1} can be defined as

𝒵1\displaystyle{\mathcal{Z}}_{1} ={z(𝒂+𝒃z)=obs},\displaystyle=\{z\in\mathbb{R}\mid{\mathcal{M}}(\bm{a}+\bm{b}z)={\mathcal{M}}_{\rm{obs}}\},
𝒵1\displaystyle{\mathcal{Z}}_{1} ={z𝒓obsc=𝒖~+𝒗~z+𝒘~z2𝟎},\displaystyle=\{z\in\mathbb{R}\mid\bm{r}_{{\mathcal{M}}^{c}_{\rm{obs}}}=\tilde{\bm{u}}+\tilde{\bm{v}}z+\tilde{\bm{w}}z^{2}\geq\bm{0}\},

which can be obtained by solving the system of quadratic inequalities.

Appendix B Experiment on High-dimensional Data

We generated the dataset X={𝒙i,:}i[n]X=\{\bm{x}_{i,:}\}_{i\in[n]} with 𝒙i,:(𝟏d,Id)\bm{x}_{i,:}\sim\mathbb{N}(\bm{1}_{d},I_{d}) and Y={𝒚j,:}j[m]Y=\{\bm{y}_{j,:}\}_{j\in[m]} with 𝒚j,:(𝟏d+Δ,Id)\bm{y}_{j,:}\sim\mathbb{N}(\bm{1}_{d}+\Delta,I_{d}) (element-wise addition). We set n=m=20,Δ=2n=m=20,\Delta=2, and ran 10 trials for each d{100,150,200,250}d\in\{100,150,200,250\}. The results in Figure 10 show that the proposed method still has reasonable computational time.

Refer to caption
Figure 10: Computational time of the proposed method when increasing the dimension dd.