∎

¹¹institutetext: J.M. Altschuler ²²institutetext: E. Boix-Adserà ³³institutetext: Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology, Cambridge MA 02139.
³³email: jasonalt@mit.edu
³³email: eboix@mit.edu

Polynomial-time algorithms for Multimarginal Optimal Transport problems with structure^†^†thanks: Work partially supported by NSF GRFP 1122374, a TwoSigma PhD fellowship, and a Siebel Scholarship.

Jason M. Altschuler Enric Boix-Adserà

Abstract

Multimarginal Optimal Transport (MOT) has attracted significant interest due to applications in machine learning, statistics, and the sciences. However, in most applications, the success of MOT is severely limited by a lack of efficient algorithms. Indeed, MOT in general requires exponential time in the number of marginals $k$ and their support sizes $n$ . This paper develops a general theory about what “structure” makes MOT solvable in $\mathrm{poly}(n,k)$ time.

We develop a unified algorithmic framework for solving MOT in $\mathrm{poly}(n,k)$ time by characterizing the structure that different algorithms require in terms of simple variants of the dual feasibility oracle. This framework has several benefits. First, it enables us to show that the Sinkhorn algorithm, which is currently the most popular MOT algorithm, requires strictly more structure than other algorithms do to solve MOT in $\mathrm{poly}(n,k)$ time. Second, our framework makes it much simpler to develop $\mathrm{poly}(n,k)$ time algorithms for a given MOT problem. In particular, it is necessary and sufficient to (approximately) solve the dual feasibility oracle—which is much more amenable to standard algorithmic techniques.

We illustrate this ease-of-use by developing $\mathrm{poly}(n,k)$ -time algorithms for three general classes of MOT cost structures: (1) graphical structure; (2) set-optimization structure; and (3) low-rank plus sparse structure. For structure (1), we recover the known result that Sinkhorn has $\mathrm{poly}(n,k)$ runtime (h20gm, ; teh2002unified, ); moreover, we provide the first $\mathrm{poly}(n,k)$ time algorithms for computing solutions that are exact and sparse. For structures (2)-(3), we give the first $\mathrm{poly}(n,k)$ time algorithms, even for approximate computation. Together, these three structures encompass many—if not most—current applications of MOT.

Keywords:

Multimarginal optimal transport polynomial-time algorithms implicit linear programming structured linear programs

MSC:

90C08 90C06

1 Introduction

Multimarginal Optimal Transport (MOT) is the problem of linear programming over joint probability distributions with fixed marginal distributions. In this way, MOT generalizes the classical Kantorovich formulation of Optimal Transport from $2$ marginal distributions to an arbitrary number $k\geqslant 2$ of them.

More precisely, an MOT problem is specified by a cost tensor $C$ in the $k$ -fold tensor product space $(\mathbb{R}^{n})^{\otimes k}=\mathbb{R}^{n}\otimes\cdots\otimes\mathbb{R}^{n}$ , and $k$ marginal distributions $\mu_{1},\dots,\mu_{k}$ in the simplex $\Delta_{n}=\{v\in\mathbb{R}_{\geqslant 0}^{n}:\sum_{i=1}^{n}v_{i}=1\}$ .¹¹1For simplicity, all $\mu_{i}$ are assumed to have the same support size $n$ . Everything in this paper extends in a straightforward way to non-uniform sizes $n_{i}$ where $n^{k}$ is replaced by $\prod_{i=1}^{k}n_{i}$ , and $\mathrm{poly}(n,k)$ is replaced by $\mathrm{poly}(\max_{i}n_{i},k)$ . The MOT problem is to compute

\displaystyle\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle

(MOT)

where $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ is the “transportation polytope” consisting of all entrywise non-negative tensors $P\in(\mathbb{R}^{n})^{\otimes k}$ satisfying the marginal constraints $\sum_{j_{1},\dots,j_{i-1},j_{i+1},\dots,j_{k}}P_{j_{1},\dots,j_{i-1},j,j_{i+1},\dots,j_{k}}=[\mu_{i}]_{j}$ for all $i\in\{1,\dots,k\}$ and $j\in\{1,\dots,n\}$ .

This MOT problem has many applications throughout machine learning, computer science, and the natural sciences since it arises in tasks that require “stitching” together aggregate measurements. For instance, applications of MOT include inference from collective dynamics (h20gm, ; h20partial, ), information fusion for Bayesian learning (srivastava2018scalable, ), averaging point clouds (AguCar11, ; CutDou14, ), the $n$ -coupling problem (ruschendorf2002n, ), quantile aggregation (makarov1982estimates, ; ruschendorf1982random, ), matching for teams (ChiMccNes10, ; CarEke10, ), image processing (RabPeyDel11, ; solomon2015convolutional, ), random combinatorial optimization (zemel1982polynomial, ; weiss1986stochastic, ; Nat21extremal, ; Agr12, ; MeiNad79, ; nadas1979probabilistic, ; Han86, ), Distributionally Robust Optimization (Nat18dro, ; Nat09, ; MisNat14, ), simulation of incompressible fluids (Bre08, ; BenCarNen19, ), and Density Functional Theory (cotar2013density, ; buttazzo2012optimal, ; BenCarNen16, ).

However, in most applications, the success of MOT is severely limited by the lack of efficient algorithms. Indeed, in general, MOT requires exponential time in the number of marginals $k$ and their support sizes $n$ . For instance, applying a linear program solver out-of-the-box takes $n^{\Theta(k)}$ time because MOT is a linear program with $n^{k}$ variables, $n^{k}$ non-negativity constraints, and $nk$ equality constraints. Specialized algorithms in the literature such as the Sinkhorn algorithm yield similar $n^{\Theta(k)}$ runtimes. Such runtimes currently limit the applicability of MOT to tiny-scale problems (e.g., $n=k=10$ ).

Polynomial-time algorithms for MOT.

This paper develops polynomial-time algorithms for MOT, where here and henceforth “polynomial” means in the number of marginals $k$ and their support sizes $n$ —and possibly also $C_{\max}/\varepsilon$ for $\varepsilon$ -additive approximation, where $C_{\max}$ is a bound on the entries of $C$ .

At first glance, this may seem impossible for at least two “trivial” reasons. One is that it takes exponential time to read the input cost $C$ since it has $n^{k}$ entries. We circumvent this issue by considering costs $C$ with $\mathrm{poly}(n,k)$ -size implicit representations, which encompasses essentially all MOT applications.²²2E.g., in the MOT problems of Wasserstein barycenters, generalized Euler flows, and Density Functional Theory, $C$ has entries $C_{j_{1},\dots,j_{k}}=\sum_{i,i^{\prime}=1}^{k}g_{i,i^{\prime}}(j_{i},j_{i^{\prime}})$ and thus can be implicitly input via the $k^{2}$ functions $g_{i,i^{\prime}}:\{1,\dots,n\}^{2}\to\mathbb{R}$ . A second obvious issue is that it takes exponential time to write the output variable $P$ since it has $n^{k}$ entries. We circumvent this issue by returning solutions $P$ with $\mathrm{poly}(n,k)$ -size implicit representations, for instance sparse solutions.

But, of course, circumventing these issues of input/output size is not enough to actually solve MOT in polynomial time. See (AltBoi20hard, ) for examples of $\mathsf{NP}$ -hard MOT problems with costs that have $\mathrm{poly}(n,k)$ -size implicit representations.

Remarkably, for several MOT problems, there are specially-tailored algorithms that run in polynomial time—notably, for MOT problems with graphically-structured costs of constant treewidth (h20tree, ; h20gm, ; teh2002unified, ), variational mean-field games (BenCarDi18, ), computing generalized Euler flows (BenCarCut15, ), computing low-dimensional Wasserstein barycenters (CarObeOud15, ; BenCarCut15, ; AltBoi20bary, ), and filtering and estimation tasks based on target tracking (h20tree, ; h20gm, ; h20incremental, ; h19hmc, ; h20partial, ). However, the number of MOT problems that are known to be solvable in polynomial time is small, and it is unknown if these techniques can be extended to the many other MOT problems arising in applications. This motivates the central question driving this paper:

\displaystyle\text{\emph{Are there general ``structural properties'' that make }}\textsf{MOT}\text{\emph{ solvable in }}\mathrm{poly}(n,k)\text{\emph{ time?}}

This paper is conceptually divided into two parts. In the first part of the paper, we develop a unified algorithmic framework for MOT that characterizes the structure required for different algorithms to solve MOT in $\mathrm{poly}(n,k)$ time, in terms of simple variants of the dual feasibility oracle. This enables us to prove that some algorithms can solve MOT problems in polynomial time whenever any algorithm can; whereas the popular SINKHORN algorithm cannot. Moreover, this algorithmic framework makes it significantly easier to design a $\mathrm{poly}(n,k)$ time algorithm for a given MOT problem (when possible) because it now suffices to solve the dual feasibility oracle—and this is much more amenable to standard algorithmic techniques. In the second part of the paper, we demonstrate the ease-of-use of our algorithmic framework by applying it to three general classes of MOT cost structures.

Below, we detail these two parts of the paper in §1.1 and §1.2, respectively.

1.1 Contribution $1$ : unified algorithmic framework for MOT

In order to understand what structural properties make MOT solvable in polynomial time, we first lay a more general groundwork. The purpose of this is to understand the following fundamental questions:

Q1

What are reasonable candidate algorithms for solving structured MOT problems in polynomial time?
Q2

What structure must an MOT problem have for these algorithms to have polynomial runtimes?
Q3

Is the structure required by a given algorithm more restrictive than the structure required by a different algorithm (or any algorithm)?
Q4

How to check if this structure occurs for a given MOT problem?

We detail our answers to these four questions below in §1.1.1 to §1.1.4, and then briefly discuss practical tradeoffs beyond polynomial-time solvability in §1.1.5; see Table 4 for a summary. We expect that this general groundwork will prove useful in future investigations of tractable MOT problems.

Algorithm	Oracle	Runtime	Always applicable?	Exact solution?	Sparse solution?	Practical?
ELLIPSOID	MIN	Theorem 4.1	Yes	Yes	Yes	No
MWU	AMIN	Theorem 4.7	Yes	No	Yes	Yes
SINKHORN	SMIN	Theorem 4.18	No	No	No	Yes

Table 1: These MOT algorithms have polynomial runtime except for a bottleneck “oracle”. Each oracle is a simple variant of the dual feasibility oracle for MOT. The number of oracle computations is

\mathrm{poly}(n,k)

for ELLIPSOID, and

\mathrm{poly}(n,k,C_{\max}/\varepsilon)

for both MWU and SINKHORN. From a theoretical perspective, the most important aspect of an algorithm is whether it can solve MOT in polynomial time if and only if any algorithm can. We show that ELLIPSOID and MWU satisfy this (Theorem 1.1), but SINKHORN does not (Theorem 1.3). From a practical perspective, SINKHORN is the most scalable when applicable.⁴⁴4Code for implementing these algorithms and reproducing all numerical simulations in this paper is provided at https://github.com/eboix/mot.

1.1.1 Answer to Q1: candidate $\mathrm{poly}(n,k)$ -time algorithms

We consider three algorithms for MOT whose exponential runtimes can be isolated into a single bottleneck—and thus can be implemented in polynomial time whenever that bottleneck can. These algorithms are the Ellipsoid algorithm ELLIPSOID (GLSbook, ), the Multiplicative Weights Update algorithm MWU (young2001sequential, ), and the natural multidimensional analog of Sinkhorn’s scaling algorithm SINKHORN (BenCarCut15, ; PeyCut17, ). SINKHORN is specially tailored to MOT and is currently the predominant algorithm for it. To foreshadow our answer to Q3, the reason that we restrict to these candidate algorithms is: we show that ELLIPSOID and MWU can solve an MOT problem in polynomial time if and only if any algorithm can.

1.1.2 Answer to Q2: structure necessary to run candidate algorithms

These three algorithms only access the cost tensor $C$ through polynomially many calls of their respective bottlenecks. Thus the structure required to implement these candidate algorithms in polynomial time is equivalent to the structure required to implement their respective bottlenecks in polynomial time.

In §4, we show that the bottlenecks of these three algorithms are polynomial-time equivalent to natural analogs of the feasibility oracle for the dual LP to MOT. Namely, given weights $p_{1},\dots,p_{k}\in\mathbb{R}^{n}$ , compute

\displaystyle\min_{(j_{1},\dots,j_{k})\in\{1,\dots,n\}^{k}}C_{j_{1},\dots,j_{k}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}

(1.1)

either exactly for ELLIPSOID, approximately for MWU, or with the “min” replaced by a “softmin” for SINKHORN. We call these three tasks the MIN, AMIN, and SMIN oracles, respectively. See Remark 3.4 for the interpretation of these oracles as variants of the dual feasibility oracle.

These three oracles take $n^{k}$ time to implement in general. However, for a wide range of structured cost tensors $C$ they can be implemented in $\mathrm{poly}(n,k)$ time, see §1.2 below. For such structured costs $C$ , our oracle abstraction immediately implies that the MOT problem with cost $C$ and any input marginals $\mu_{1},\dots,\mu_{k}$ can be (approximately) solved in polynomial time by any of the three respective algorithms.

Our characterization of the algorithms’ bottlenecks as variations of the dual feasibility oracle has two key benefits—which are the answers to Q3 and Q4, described below.

1.1.3 Answer to Q3: characterizing what MOT problems each algorithm can solve

A key benefit of our characterization of the algorithms’ bottlenecks as variations of the dual feasibility oracles is that it enables us to establish whether the structure required by a given MOT algorithm is more restrictive than the structure required by a different algorithm (or by any algorithm).

In particular, this enables us to answer the natural question: why restrict to just the three algorithms described above? Can other algorithms solve MOT in $\mathrm{poly}(n,k)$ time in situations when these algorithms cannot? Critically, the answer is no: restricting ourselves to the ELLIPSOID and MWU algorithms is at no loss of generality.

Theorem 1.1 (Informal statement of part of Theorems 4.1 and 4.7).

For any family of costs $C\in(\mathbb{R}^{n})^{\otimes k}$ :

•

ELLIPSOID computes an exact solution for MOT in $\mathrm{poly}(n,k)$ time if and only if any algorithm can.
•

MWU computes an $\varepsilon$ -approximate solution for MOT in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time if and only if any algorithm can.

The statement for ELLIPSOID is implicit from classical results about LP (GLSbook, ) combined with arguments from (AltBoi20bary, ), see the previous work section §1.3. The statement for MWU is new to this paper.

The oracle abstraction helps us show Theorem 1.1 because it reduces this question of what structure is needed for the algorithms to solve MOT in polynomial time, to the question of what structure is needed to solve their respective bottlenecks in polynomial time. Thus Theorem 1.1 is a consequence of the following result. (The “if” part of this result is a contribution of this paper; the “only if” part was shown in (AltBoi20hard, ).)

Theorem 1.2 (Informal statement of part of Theorems 4.1 and 4.7).

For any family of costs $C\in(\mathbb{R}^{n})^{\otimes k}$ :

•

MOT can be exactly solved in $\mathrm{poly}(n,k)$ time if and only if MIN can.
•

MOT can be $\varepsilon$ -approximately solved in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time if and only if AMIN can.

Interestingly, a further consequence of our unified algorithm-to-oracle abstraction is that it enables us to show that SINKHORN—which is currently the most popular algorithm for MOT by far—requires strictly more structure to solve an MOT problem than other algorithms require. This is in sharp contrast to the complete generality of the other two algorithms shown in Theorem 1.1.

Theorem 1.3 (Informal statement of Theorem 4.19).

Under standard complexity-theoretic assumptions, there exists a family of MOT problems that can be solved exactly in $\mathrm{poly}(n,k)$ time using ELLIPSOID, however it is impossible to implement a single iteration of SINKHORN (even approximately) in $\mathrm{poly}(n,k)$ time.

The reason that our unified algorithm-to-oracle abstraction helps us show Theorem 1.3 is that it puts SINKHORN on equal footing with the other two classical algorithms in terms of their reliance on variants of the dual feasibility oracle. This reduces proving Theorem 1.3 to showing the following separation between the SMIN oracle and the other two oracles.

Theorem 1.4 (Informal statement of Lemma 3.7).

Under standard complexity-theoretic assumptions, there exists a family of cost tensors $C\in(\mathbb{R}^{n})^{\otimes k}$ such that there are $\mathrm{poly}(n,k)$ -time algorithms for MIN and AMIN, however it is impossible to solve SMIN (even approximately) in $\mathrm{poly}(n,k)$ time.

1.1.4 Answer to Q4: ease-of-use for checking if MOT is solvable in polynomial time

The second key benefit of this oracle abstraction is that it is helpful for showing that a given MOT problem (whose cost $C$ is input implicitly through some concise representation) is solvable in polynomial time as it without loss of generality reduces MOT to solving any of the three corresponding oracles in polynomial time. The upshot is that these oracles are more directly amenable to standard algorithmic techniques since they are phrased as more conventional combinatorial-optimization problems. In the second part of the paper, we illustrate this ease-of-use via applications to three general classes of structured MOT problems; for an overview see §1.2.

1.1.5 Practical algorithmic tradeoffs beyond polynomial-time solvability

From a theoretical perspective, the most important aspect of an algorithm is whether it can solve MOT in polynomial time if and only if any algorithm can. As we have discussed, this is true for ELLIPSOID and MWU (Theorem 1.1) but not for SINKHORN (Theorem 1.3). Nevertheless, for a wide range of MOT cost structures, all three oracles can be implemented in polynomial time, which means that all three algorithms ELLIPSOID, MWU, and SINKHORN can be implemented in polynomial time. Which algorithm is best in practice depends on the relative importance of the following considerations for the particular application.

•

Error. ELLIPSOID computes exact solutions, whereas MWU and SINKHORN only compute low-precision solutions due to $\mathrm{poly}(1/\varepsilon)$ runtime dependence.
•

Solution sparsity. ELLIPSOID and MWU output solutions with polynomially many non-zero entries (roughly $nk$ ), whereas SINKHORN outputs fully dense solutions with $n^{k}$ non-zero entries (through a polynomial-size implicit representation, see §4.3). Solution sparsity enables interpretability, visualization, and efficient downstream computation—benefits which are helpful in diverse applications, for example ranging from computer graphics (solomon2015convolutional, ; blondel2018smooth, ; pitie2007automated, ) to facility location problems (anderes2016discrete, ) to machine learning (AltBoi20bary, ; flamary2016optimal, ) to ecological inference (muzellec2017tsallis, ) to fluid dynamics (see §5.3), and more. Furthermore, in §7.4, we show that sparse solutions for MOT (a.k.a. linear optimization over the transportation polytope) enable efficiently solving certain non-linear optimization problems over the transportation polytope.
•

Practical runtime. Although all three algorithms enjoy polynomial runtime guarantees, the polynomials are smaller for some algorithms than for others. In particular, SINKHORN has remarkably good scalability in practice as long the error $\varepsilon$ is not too small and its bottleneck oracle SMIN is practically implementable. By Theorems 1.1 and 1.3, MWU can solve strictly more MOT problems in polynomial time than SINKHORN; however, it is less scalable in practice when both MWU and SINKHORN can be implemented. ELLIPSOID is not practical and is used solely as a proof of concept that problems are tractable to solve exactly; in practice, we use Column Generation (see, e.g., (BerTsi97, , §6.1)) rather than ELLIPSOID as it has better empirical performance, yet still has the same bottleneck oracle MIN, see §4.1.3. Column Generation is not as practically scalable as SINKHORN in $n$ and $k$ but has the benefit of computing exact, sparse solutions.

To summarize: which algorithm is best in practice depends on the application. For example, Column Generation produces the qualitatively best solutions for the fluid dynamics application in §5.3, SINKHORN is the most scalable for the risk estimation application in §7.3, and MWU is the most scalable for the network reliability application in §6.3 (for that application there is no known implementation of SINKHORN that is practically efficient).

1.2 Contribution $2$ : applications to general classes of structured MOT problems

In the second part of the paper, we illustrate the algorithmic framework developed in the first part of the paper by applying it to three general classes of MOT cost structures:

1.

Graphical structure (in §5).
2.

Set-optimization structure (in §6).
3.

Low-rank plus sparse structure (in §7).

Specifically, if the cost $C$ is structured in any of these three ways, then MOT can be (approximately) solved in $\mathrm{poly}(n,k)$ time for any input marginals $\mu_{1},\dots,\mu_{k}$ .

Previously, it was known how to solve MOT problems with structure (1) using SINKHORN (h20gm, ; teh2002unified, ), but this only computes solutions that are dense (with $n^{k}$ non-zero entries) and low-precision (due to $\mathrm{poly}(1/\varepsilon)$ runtime dependence). We therefore provide the first solutions that are sparse and exact for structure (1). For structures (2) and (3), we provide the first polynomial-time algorithms, even for approximate computation. These three structures are incomparable: it is in general not possible to model a problem falling under any of the three structures in a non-trivial way using any of the others, for details see Remarks 6.7 and 7.3. This means that the new structures (2) and (3) enable capturing a wide range of new applications.

Below, we detail these structures individually in §1.2.1, §1.2.2, and §1.2.3. See Table 2 for a summary.

Structure	Definition	Complexity measure	Polynomial-time algorithm?
Structure	Definition	Complexity measure	Approximate	Exact
Graphical (§5)	$C_{\vec{j}}=\sum_{S\in\mathcal{S}}f_{S}(\vec{j}_{S})$	treewidth	Known (teh2002unified, ; h20gm, )	Corollary 5.6
Set-optimization (§6)	$C_{\vec{j}}=\mathds{1}[\vec{j}\notin S]$	optimization oracle over S	Corollary 6.9	Corollary 6.9
Low-rank + sparse (§7)	$C=R+S$	rank of $R$ , sparsity of $S$	Corollary 7.5	Unknown

Table 2: In the second part of the paper, we illustrate the ease-of-use of our algorithmic framework by applying it to three general classes of MOT cost structures. These structures encompass many—if not most—current applications of MOT.

1.2.1 Graphical structure

In §5, we apply our algorithmic framework to MOT problems with graphical structure, a broad class of MOT problems that have been previously studied (h20gm, ; h20tree, ; teh2002unified, ). Briefly, an MOT problem has graphical structure if its cost tensor $C$ decomposes as

C_{j_{1},\dots,j_{k}}=\sum_{S\in\mathcal{S}}f_{S}(\vec{j}_{S}),

where $f_{S}(\vec{j}_{S})$ are arbitrary “local interactions” that depend only on tuples $\vec{j}_{S}:=\{j_{i}\}_{i\in S}$ of the $k$ variables.

In order to derive efficient algorithms, it is necessary to restrict how local the interactions are because otherwise MOT is $\mathsf{NP}$ -hard (even if all interaction sets $S\in\mathcal{S}$ have size $2$ ) (AltBoi20hard, ). We measure the locality of the interactions via the standard complexity measure of the “treewidth” of the associated graphical model. See §5.1 for formal definitions. While the runtimes of our algorithms (and all previous algorithms) depend exponentially on the treewidth, we emphasize that the treewidth is a very small constant (either $1$ or $2$ ) in all current applications of MOT falling under this framework; see the related work section.

We show that for MOT cost tensors that have graphical structure of constant treewidth, all three oracles can be implemented in $\mathrm{poly}(n,k)$ time. We accomplish this by leveraging the known connection between graphically structured MOT and graphical models shown in (h20gm, ). In particular, the MIN, AMIN, and SMIN oracles are respectively equivalent to the mode, approximate mode, and log-partition function of an associated graphical model. Thus we can implement all oracles in $\mathrm{poly}(n,k)$ time by simply applying classical algorithms from the graphical models community (KolFri09, ; wainwright2008graphical, ).

Theorem 1.5 (Informal statement of Theorem 5.5).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ have graphical structure of constant treewidth. Then the MIN, AMIN, and SMIN oracles can be computed in $\mathrm{poly}(n,k)$ time.

It is an immediate corollary of Theorem 1.5 and our algorithms-to-oracles reduction described in §1.1 that one can implement ELLIPSOID, MWU, and SINKHORN in polynomial time. Below, we record the theoretical guarantee of ELLIPSOID since it is the best of the three algorithms as it computes exact, sparse solutions.

Theorem 1.6 (Informal statement of Corollary 5.6).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ have graphical structure of constant treewidth. Then an exact, sparse solution for MOT can be computed in $\mathrm{poly}(n,k)$ time.

Previously, it was known how to solve such MOT problems (teh2002unified, ; h20gm, ) using SINKHORN, but this only computes a solution that is fully dense (with $n^{k}$ non-zero entries) and low-precision (due to $\mathrm{poly}(1/\varepsilon)$ runtime dependence). Details in the related work section. Our result improves over this state-of-the-art algorithm by producing solutions that are exact and sparse in $\mathrm{poly}(n,k)$ time.

In §5.3, we demonstrate the benefit of Theorem 1.6 on the application of computing generalized Euler flows, which was historically the motivation of MOT and has received significant attention, e.g., (BenCarCut15, ; BenCarNen19, ; Bre08, ; Bre89, ; Bre93, ; Bre99, ). While there is a specially-tailored version of the SINKHORN algorithm for this problem that runs in polynomial time (BenCarCut15, ; BenCarNen19, ), it produces solutions that are approximate and fully dense. Our algorithm produces exact, sparse solutions which lead to sharp visualizations rather than blurry ones (see Figure 10).

1.2.2 Set-optimization structure

In §6, we apply our algorithmic framework to MOT problems whose cost tensors $C$ take value $0$ or $1$ in each entry. That, is costs $C$ of the form

C_{j_{1},\dots,j_{k}}=\mathds{1}[(j_{1},\dots,j_{k})\notin S],

for some subset $S\subseteq[n]^{k}$ . Such MOT problems arise naturally in applications where one seeks to minimize the probability that some event $S$ occurs, given marginal probabilities on each variable $j_{i}$ , see Example 6.1.

In order to derive efficient algorithms, it is necessary to restrict the (otherwise arbitrary) set $S$ . We parametrize the complexity of such MOT problems via the complexity of finding the minimum-weight object in $S$ . This opens the door to combinatorial applications of MOT because finding the minimum-weight object in $S$ is well-known to be polynomial-time solvable for many “combinatorially-structured” sets $S$ of interest—e.g., the set $S$ of cuts in a graph, or the set $S$ of independent sets in a matroid.

We show that for MOT cost tensors with this structure, all three oracles can be implemented efficiently.

Theorem 1.7 (Informal statement of Theorem 6.8).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ have set-optimization structure. Then the MIN, AMIN, and SMIN oracles can be computed in $\mathrm{poly}(n,k)$ time.

It is an immediate corollary of Theorem 1.7 and our algorithms-to-oracles reduction described in §1.1 that one can implement ELLIPSOID, MWU, and SINKHORN in polynomial time. Below, we record the theoretical guarantee for ELLIPSOID since it is the best of these three algorithms as it computes exact, sparse solutions.

Theorem 1.8 (Informal statement of Corollary 6.9).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ have set-optimization structure. Then an exact, sparse solution for MOT can be computed in $\mathrm{poly}(n,k)$ time.

This is the first polynomial-time algorithm for this class of MOT problems. We note that a more restrictive class of MOT problems was studied in (zemel1982polynomial, ) under the additional restriction that $S$ is upwards-closed.

In §6.3, we show how this general class of set-optimization structure captures, for example, the classical application of computing the extremal reliability of a network with stochastic edge failures. Network reliability is a fundamental topic in network science and engineering (gertsbakh2011network, ; ball1995network, ; ball1986computational, ) which is often studied in an average-case setting where each edge fails independently with some given probability (moore1956reliable, ; karger2001randomized, ; valiant1979complexity, ; provan1983complexity, ). The application investigated here is a robust notion of network reliability in which edge failures may be maximally correlated (e.g., by an adversary) or minimally correlated (e.g., by a network maintainer) subject to a marginal constraint on each edge’s failure probability, a setting that dates back to the 1980s (zemel1982polynomial, ; weiss1986stochastic, ). We show how to express both the minimally and maximally correlated network reliability problems as MOT problems with set-optimization structure, recovering as a special case of our general framework the known polynomial-time algorithms in (zemel1982polynomial, ; weiss1986stochastic, ) as well as more practical polynomial-time algorithms that scale to input sizes that are an order-of-magnitude larger.

1.2.3 Low-rank and sparse structure

In §7, we apply our algorithmic framework to MOT problems whose cost tensors $C$ decompose as

C=R+S,

where $R$ is a constant-rank tensor, and $S$ is a polynomially-sparse tensor. We assume that $R$ is represented in factored form, and that $S$ is represented through its non-zero entries, which overall yields a $\mathrm{poly}(n,k)$ -size representation of $C$ .

We show that for MOT cost tensors with low-rank plus sparse structure, the AMIN and SMIN oracles can be implemented in polynomial time.⁵⁵5It is an interesting open question if the MIN oracle can similarly be implemented in $\mathrm{poly}(n,k)$ time. This would enable implementing ELLIPSOID in $\mathrm{poly}(n,k)$ time by our algorithms-to-oracles reduction, and thus would enable computing exact solutions for this class of MOT problems (cf., Theorem 1.10). This may be of independent interest because, by taking all oracle inputs $p_{i}=0$ in (1.1), this generalizes the previously open problem of approximately computing the smallest entry of a constant-rank tensor with $n^{k}$ entries in $\mathrm{poly}(n,k)$ time.

Theorem 1.9 (Informal statement of Theorem 7.4).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ have low-rank plus sparse structure. Then the AMIN and SMIN oracles can be computed in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.

It is an immediate corollary of Theorem 1.9 and our algorithms-to-oracles reduction described in §1.1 that one can implement MWU and SINKHORN in polynomial time. Of these two algorithms, MWU computes sparse solutions, yielding the following theorem.

Theorem 1.10 (Informal statement of Corollary 7.5).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ have low-rank plus sparse structure. Then a sparse, $\varepsilon$ -approximate solution for MOT can be computed in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.

This is the first polynomial-time result for this class of MOT problems. We note that the runtime of our MOT algorithm depends exponentially on the rank $r$ of $R$ , hence why we take $r$ to be constant. Nevertheless, such a restriction on the rank is unavoidable since unless $\mathsf{P}=\mathsf{NP}$ , there does not exist an algorithm with runtime that is jointly polynomial in $n$ , $k$ , and the rank $r$ (AltBoi20hard, ).

We demonstrate this polynomial-time algorithm concretely on two applications. First, in §7.3 we consider the risk estimation problem of computing an investor’s expected profit in the worst-case over all future prices that are consistent with given marginal distributions. We show that this is equivalent to an MOT problem with a low-rank tensor and thereby provide the first efficient algorithm for it.

Second, in §7.4, we consider the fundamental problem of projecting a joint distribution $Q$ onto the transportation polytope. We provide the first polynomial-time algorithm for solving this when $Q$ decomposes into a constant-rank and sparse component, which models mixtures of product distributions with polynomially many corruptions. This application illustrates the versatility of our algorithmic results beyond polynomial-time solvability of MOT, since this projection problem is a quadratic optimization over the transportation polytope rather than linear optimization (a.k.a. MOT). In order to achieve this, we develop a simple quadratic-to-linear reduction tailored to this problem that crucially exploits the sparsity of the MOT solutions enabled by the MWU algorithm.

1.3 Related work

1.3.1 MOT algorithms

MOT algorithms fall into two categories. One category consists of general-purpose algorithms that do not depend on the specific MOT cost. For example, this includes running an LP solver out-of-the-box, or running the Sinkhorn algorithm where in each iteration one sums over all $n^{k}$ entries of the cost tensor to implement the marginalization bottleneck (LinHoJor19, ; Fri20, ; tupitsa2020multimarginal, ). These approaches are robust in the sense that they do not need to be changed based on the specific MOT problem. However, they are impractical beyond tiny input sizes (e.g., $n=k=10$ ) because their runtimes scale as $n^{\Omega(k)}$ .

The second category consists of algorithms that are much more scalable but require extra structure of the MOT problem. Specifically, these are algorithms that somehow exploit the structure of the relevant cost tensor $C$ in order to (approximately) solve an MOT problem in $\mathrm{poly}(n,k)$ time (teh2002unified, ; h20gm, ; h20tree, ; BenCarDi18, ; h20incremental, ; h19hmc, ; h20partial, ; CarObeOud15, ; Nen16, ; BenCarCut15, ; BenCarNen19, ; AltBoi20bary, ; Nat18dro, ; Nat09, ; MisNat14, ; zemel1982polynomial, ; weiss1986stochastic, ; Nat21extremal, ; Agr12, ; MeiNad79, ; nadas1979probabilistic, ; Han86, ). Such a $\mathrm{poly}(n,k)$ runtime is far more tractable—but it is not well understood for which MOT problems such a runtime is possible. The purpose of this paper is to clarify this question.

To contextualize our answer to this question with the rapidly growing literature requires further splitting this second category of algorithms.

Sinkhorn algorithm.

Currently, the predominant approach in the second category is to solve an entropically regularized version of MOT with the Sinkhorn algorithm, a.k.a. Iterative Proportional Fitting or Iterative Bregman Projections or RAS algorithm or Iterative Scaling algorithm, see e.g., (teh2002unified, ; BenCarDi18, ; BenCarNen16, ; BenCarNen19, ; Nen16, ; h20tree, ; h20gm, ). Recent work has shown that a polynomial number of iterations of this algorithm suffices (LinHoJor19, ; Fri20, ; tupitsa2020multimarginal, ). However, the bottleneck is that each iteration requires $n^{k}$ operations in general because it requires marginalizing a tensor with $n^{k}$ entries. The critical question is therefore: what structure of an MOT problem enables implementing this marginalization bottleneck in polynomial time.

This paper makes two contributions to this question. First, we identify new broad classes of MOT problems for which this bottleneck can be implemented in polynomial time, and thus SINKHORN can be implemented in polynomial time (see §1.2). Second, we propose other algorithms that require strictly less structure than SINKHORN does in order to solve an MOT problem in polynomial time (Theorem 4.19).

Ellipsoid algorithm.

The Ellipsoid algorithm is among the most classical algorithms for implicit LP (GLSbook, ; GroLovSch81, ; khachiyan1980polynomial, ), however it has taken a back seat to the SINKHORN algorithm in the vast majority of the MOT literature.

In §4.1, we make explicit the fact that the variant of ELLIPSOID from (AltBoi20bary, ) can solve MOT exactly in $\mathrm{poly}(n,k)$ time if and only if any algorithm can (Theorem 4.1). This is implicit from combining several known results (AltBoi20hard, ; AltBoi20bary, ; GLSbook, ). In the process of making this result explicit, we exploit the special structure of the MOT LP to significantly simplify the reduction from the dual violation oracle to the dual feasibility oracle. The previously known reduction is highly impractical as it requires an indirect “back-and-forth” use of the Ellipsoid algorithm (GLSbook, , page 107). In contrast, our reduction is direct and simple; this is critical for implementing our practical alternative to ELLIPSOID, namely COLGEN, with the dual feasibility oracle.

Multiplicative Weights Update algorithm.

This algorithm, first introduced by (young2001sequential, ), has been studied in the context of optimal transport when $k=2$ (BlaJamKenSid18, ; Qua18, ), in which case implicit LP is not necessary for a polynomial runtime. MWU lends itself to implicit LP (young2001sequential, ), but is notably absent from the MOT literature.

In §4.2, we show that MWU can be applied to MOT in polynomial time if and only if the approximate dual feasibility oracle can be solved in polynomial time. To do this, we show that in the special case of MOT, the well-known “softmax-derivative” bottleneck of MWU is polynomial-time equivalent to the approximate dual feasibility oracle. Since it is known that the approximate dual feasibility oracle is polynomial-time reducible to approximate MOT (AltBoi20hard, ), we therefore establish that MWU can solve MOT approximately in polynomial time if and only if any algorithm can (Theorem 4.7).

1.3.2 Graphically structured MOT problems with constant treewidth

We isolate here graphically structured costs with constant treewidth because this framework encompasses all MOT problems that were previously known to be tractable in polynomial time (teh2002unified, ; h20gm, ), with the exceptions of the fixed-dimensional Wasserstein barycenter problem and MOT problems related to combinatorial optimization—both of which are described below in §1.3.3. This family of graphical structured costs with treewidth $1$ (a.k.a. “tree-structured costs” (h20tree, )) includes applications in economics such as variational mean-field games (BenCarDi18, ), interpolating histograms on trees (akagi2020probabilistic, ), matching for teams (CarObeOud15, ; Nen16, ); as well as encompasses applications in filtering and estimation for collective dynamics such as target tracking (h20tree, ; h20gm, ; h20incremental, ; h19hmc, ; h20partial, ) and Wasserstein barycenters in the case of fixed support (h20partial, ; Nen16, ; BenCarCut15, ; CarObeOud15, ). With treewidth $2$ , this family of costs also includes dynamic multi-commodity flow problems (haasler2021scalable, ), as well as the application of computing generalized Euler flows in fluid dynamics (BenCarCut15, ; BenCarNen19, ; Nen16, ), which was historically the original motivation of MOT (Bre89, ; Bre93, ; Bre99, ; Bre08, ).

Previous polynomial-time algorithms for graphically structured MOT compute approximate, dense solutions.

Implementing SINKHORN for graphically structured MOT problems by using belief propagation to efficiently implement the marginalization bottleneck was first proposed twenty years ago in (teh2002unified, ). There have been recent advancements in understanding connections of this algorithm to the Schrödinger bridge problem in the case of trees (h20tree, ), as well as developing more practically efficient single-loop variations (h20gm, ).

All of these works prove theoretical runtime guarantees only in the case of tree structure (i.e., treewidth $1$ ). However, this graphical model perspective for efficiently implementing SINKHORN readily extends to any constant treewidth: simply implement the marginalization bottleneck using junction trees. This, combined with the iteration complexity of SINKHORN which is known to be polynomial (LinHoJor19, ; Fri20, ; tupitsa2020multimarginal, ), immediately yields an overall polynomial runtime. This is why we cite (teh2002unified, ; h20gm, ) throughout this paper regarding the fact that SINKHORN can be implemented in polynomial time for graphical structure with any constant treewidth.

While the use of SINKHORN for graphically structured MOT is mathematically elegant and can be impressively scalable in practice, it has two drawbacks. The first drawback of this algorithm is that it computes (implicit representations of) solutions that are fully dense with $n^{k}$ non-zero entries. Indeed, it is well-known that SINKHORN finds the unique optimal solution to the entropically regularized MOT problem $\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle-\eta^{-1}H(P)$ , and that this solution is fully dense (PeyCut17, ). For example, in the simple case of cost $C=0$ , uniform marginals $\mu_{i}$ , and any strictly positive regularization parameter $\eta>0$ , this solution $P$ has value $1/n^{k}$ in each entry.

The second drawback of this algorithm is that it only computes solutions that are low-precision due to $\mathrm{poly}(1/\varepsilon)$ runtime dependence on the accuracy $\varepsilon$ . This is because the number of SINKHORN iterations is known to scale polynomially in the entropic regularization parameter $\eta$ even in the matrix case $k=2$ (linial1998deterministic, , §1.2), and it is known that $\eta=\Omega(\varepsilon^{-1}k\log n)$ is necessary for the converged solution of SINKHORN to be an $\varepsilon$ -approximate solution to the (unregularized) original MOT problem (LinHoJor19, ).

Improved algorithms for graphically structured MOT problems.

The contribution of this paper to the study of graphically structured MOT problems is that we give the first $\mathrm{poly}(n,k)$ time algorithms that can compute solutions which are exact and sparse (Corollary 5.6). Our framework also directly recovers all known results about SINKHORN for graphically structured MOT problems—namely that it can be implemented in polynomial time for trees (h20tree, ; teh2002unified, ) and for constant treewidth (h20gm, ; teh2002unified, ).

1.3.3 Tractable MOT problems beyond graphically structured costs

The two new classes of MOT problems studied in this paper—namely, set-optimization structure and low-rank plus sparse structure—are incomparable to each other as well as to graphical structure. Details in Remarks 6.7 and 7.3. This lets us handle a wide range of new MOT problems that could not be handled before.

There are two other classes of MOT problems studied in the literature which do not fall under the three structures studied in this paper. We elaborate on both below.

Remark 1.11 (Low-dimensional Wasserstein barycenter).

This MOT problem has cost $C_{j_{1},\dots,j_{k}}=\sum_{i,i^{\prime}=1}^{k}\|x_{i,j_{i}}-x_{i^{\prime},j_{i^{\prime}}}\|^{2}$ where $x_{i,j}\in\mathbb{R}^{d}$ denotes the $j$ -th atom in the distribution $\mu_{i}$ . Clearly this cost is not a graphically structured cost of constant treewidth—indeed, representing it through the lens of graphical structure requires the complete graph of interactions, which means a maximal treewidth of $k-1$ .⁶⁶6 We remark that the related but different problem of fixed-support Wasserstein barycenters has graphical structure with treewidth $1$ (h20partial, ; Nen16, ; BenCarCut15, ; CarObeOud15, ). However, it should be emphasized that the fixed-support Wasserstein barycenter problem is different from the Wasserstein barycenter problem: it only approximates the latter to $\varepsilon$ accuracy if the fixed support is restricted to an $O(\varepsilon)$ -net which requires $n=1/\varepsilon^{\Omega(d)}$ discretization size for the barycenter’s support, and thus (i) even in constant dimension, does not lead to high-precision algorithms due to $\mathrm{poly}(1/\varepsilon)$ runtime; and (ii) scales exponentially in the dimension $d$ . See (AltBoi20baryhard, , §1.3) for further details about the complexity of Wasserstein barycenters. This problem also does not fall under the set-optimization or constant-rank structures. Nevertheless, this MOT problem can be solved in $\mathrm{poly}(n,k)$ time for any fixed dimension $d$ by exploiting the low-dimensional geometric structure of the points $\{x_{i,j}\}$ that implicitly define the cost (AltBoi20bary, ).

Remark 1.12 (Random combinatorial optimization).

MOT problems also appear in the random combinatorial optimization literature since the 1970s, see e.g., (MeiNad79, ; zemel1982polynomial, ; weiss1986stochastic, ; nadas1979probabilistic, ; Han86, ), although under a different name and in a different community. These papers consider MOT problems with costs of the form $C(x)=\min_{v\in V}\langle x,v\rangle$ for polytopes $V\subseteq\{0,1\}^{k}$ given through a list of their extreme points. Applications include PERT (Program Evaluation and Review Technique), extremal network reliability, and scheduling. Recently, applications to Distributionally Robust Optimization were investigated in (Nat18dro, ; Nat09, ; MisNat14, ) which considered general polytopes $V\subset\mathbb{R}^{k}$ , as well as in (Nat21extremal, ) which considered MOT costs of the related form $C(x)=\mathds{1}[\min_{v\in V}\langle x,v\rangle\geqslant t]$ , and in (Agr12, ) which considers other combinatorial costs $C$ such as sub/supermodular functions. These papers show that these random combinatorial optimization problems are in general intractable, and give sufficient conditions on when they can be solved in polynomial time. In general, these families of MOT problems are different from the three structures studied in this paper, although some MOT applications fall under multiple umbrellas (e.g., extremal network reliability). It is an interesting question to understand to what extent these structures can be reconciled (as well as the algorithms, which sometimes use extended formulations in these papers).

1.3.4 Intractable MOT problems

These algorithmic results beg the question: what are the fundamental limitations of this line of work on polynomial-time algorithms for structured MOT problems? To this end, the recent paper (AltBoi20hard, ) provides a systematic investigation of $\mathsf{NP}$ -hardness results for structured MOT problems, including converses to several results in this paper. In particular, (AltBoi20hard, , Propositions 4.1 and 4.2) justify the constant-rank regime studied in §7 by showing that unless $\mathsf{P}=\mathsf{NP}$ , there does not exist an algorithm with runtime that is jointly polynomially in the rank $r$ and the input parameters $n$ and $k$ . Similarly, (AltBoi20hard, , Propositions 5.1 and 5.2) justify the constant-treewidth regime for graphically structured costs studied in §5 and all previous work by showing that unless $\mathsf{P}=\mathsf{NP}$ , there does not exist an algorithm with polynomial runtime even in the seemingly simple class of MOT costs that decompose into pairwise interactions $C_{j_{1},\dots,j_{k}}=\sum_{i\neq i^{\prime}\in[k]}c_{i,i^{\prime}}(j_{i},j_{i}^{\prime})$ . The paper (AltBoi20hard, ) also shows $\mathsf{NP}$ -hardness for several MOT problems with repulsive costs, including for example the MOT formulation of Density Functional Theory with Coulomb-Buckingham potential. It is an problem whether the Coulomb potential, studied in (BenCarNen16, ; cotar2013density, ; buttazzo2012optimal, ), also leads to an $\mathsf{NP}$ -hard MOT problem (AltBoi20hard, , Conjecture 6.4).

1.3.5 Variants of MOT

The literature has studied several other variants of the MOT problem, notably with entropic regularization and/or with constraints on a subset of the $k$ marginals, see, e.g., (BenCarCut15, ; BenCarDi18, ; BenCarNen19, ; BenCarNen16, ; h20partial, ; haasler2021scalable, ; h20incremental, ; h20gm, ; h20tree, ; h19hmc, ; LinHoJor19, ). Our techniques readily apply with little change. Briefly, to handle entropic regularization, simply use the SMIN oracle and SINKHORN algorithm with fixed regularization parameter $1/\eta>0$ (rather than $1/\eta$ of vanishing size $\Theta(\varepsilon/\log n)$ ) as described in §4.3. And to handle partial marginal constraints, essentially the only change is that in the MIN, AMIN, and SMIN oracles, the potentials $p_{i}$ are zero for all indices $i\in[k]$ corresponding to unconstrained marginals $m_{i}(P)$ . Full details are omitted for brevity since they are straightforward modifications of our main results.

1.3.6 Optimization over joint distributions

Optimization problems over exponential-size joint distributions appear in many domains. For instance, they arise in game theory when computing correlated equilibria (pap08, ); however, in that case the optimization has different constraints which lead to different algorithms. Such problems also arise in variational inference (wainwright2003variational, ); however, the optimization there typically constrains this distribution to ensure tractability (e.g., mean-field approximation restricts to product distributions). The different constraints in these optimization problems over joint distributions versus MOT lead to significant differences in computational complexity, and thus also necessitate different algorithmic techniques.

1.4 Organization

In §2 we recall preliminaries about MOT and establish notation. The first part of the paper then establishes our unified algorithmic framework for MOT. Specifically, in §3 we define and compare three variants of the dual feasibility oracle; and in §4 we characterize the structure that MOT algorithms require for polynomial-time implementation in terms of these three oracles. For an overview of these results, see §1.1. The second part of the paper applies this algorithmic framework to three general classes of MOT cost structures: graphical structure (§5), set-optimization structure (§6), and low-rank plus sparse structure (§7). For an overview of these results, see §1.2. These three application sections are independent of each other and can be read separately. We conclude in §8.

2 Preliminaries

General notation.

The set $\{1,\dots,n\}$ is denoted by $[n]$ . For shorthand, we write $\mathrm{poly}(t_{1},\dots,t_{m})$ to denote a function that grows at most polynomially fast in those parameters. Throughout, we assume for simplicity of exposition that all entries of the input $C$ and $\mu_{1},\dots,\mu_{k}$ have bit complexity at most $\mathrm{poly}(n,k)$ , and same with the components defining $C$ in structured settings. As such, throughout runtimes refer to the number of arithmetic operations. The set $\mathbb{R}\cup\{-\infty\}$ is denoted by $\bar{\mathbb{R}}$ , and note that the value $-\infty$ can be represented efficiently by adding a single flag bit. We use the standard $O(\cdot)$ and $\Omega(\cdot)$ notation, and use $\tilde{O}(\cdot)$ and $\tilde{\Omega}(\cdot)$ to denote that polylogarithmic factors may be omitted.

Tensor notation.

The $k$ -fold tensor product space $\mathbb{R}^{n}\otimes\cdots\otimes\mathbb{R}^{n}$ is denoted by $(\mathbb{R}^{n})^{\otimes k}$ , and similarly for $(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}$ . Let $P\in(\mathbb{R}^{n})^{\otimes k}$ . Its $i$ -th marginal, $i\in[k]$ , is denoted by $m_{i}(P)\in\mathbb{R}^{n}$ and has entries $[m_{i}(P)]_{j}:=\sum_{j_{1},\dots,j_{i-1},j_{i+1},\dots,j_{k}}P_{j_{1},\dots,j_{i-1},j,j_{i+1},\dots,j_{k}}$ . For shorthand, we often denote an index $(j_{1},\dots,j_{k})$ by $\vec{j}$ . The sum of $P$ ’s entries is denoted by $m(P)=\sum_{\vec{j}}P_{\vec{j}}$ . The maximum absolute value of $P$ ’s entries is denoted by $\|P\|_{\max}:=\max_{\vec{j}}|P_{\vec{j}}|$ , or simply $P_{\max}$ for short. For $\vec{j}\in[n]^{k}$ , we write $\delta_{\vec{j}}$ to denote the tensor with value $1$ at entry $\vec{j}$ , and $0$ elswewhere. The operations $\odot$ and $\otimes$ respectively denote the entrywise product and the Kronecker product. The notation $\otimes_{i=1}^{k}d_{i}$ is shorthand for $d_{1}\otimes\cdots\otimes d_{k}$ . A non-standard notation we use throughout is that $f[P]$ denotes a function $f:\mathbb{R}\to\mathbb{R}$ (typically $\exp$ , $\log$ , or a polynomial) applied entrywise to a tensor $P$ .

2.1 Multimarginal Optimal Transport

The transportation polytope between measures $\mu_{1},\dots,\mu_{k}\in\Delta_{n}$ is

\displaystyle\mathcal{M}(\mu_{1},\dots,\mu_{k}):=\left\{P\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}\;:\;m_{i}(P)=\mu_{i},\;\forall i\in[k]\right\}.

(2.1)

For a fixed cost $C\in(\mathbb{R}^{n})^{\otimes k}$ , the $\textsf{MOT}_{C}$ problem is to solve the following linear program, given input measures $\mu=(\mu_{1},\dots,\mu_{k})\in(\Delta_{n})^{k}$ :

\displaystyle\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle.

(MOT)

In the $k=2$ matrix case, (MOT) is the Kantorovich formulation of OT (Vil03, ). Its dual LP is

\displaystyle\max_{p_{1},\ldots,p_{k}\in\mathbb{R}^{n}}\sum_{i=1}^{k}\langle p_{i},\mu_{i}\rangle\quad\text{subject to}\quad C_{j_{1},\dots,j_{k}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\geqslant 0,\;\forall(j_{1},\dots,j_{k})\in[n]^{k}.

(MOT-D)

A basic, folklore fact about MOT is that it always has a sparse optimal solution (e.g., (anderes2016discrete, , Lemma 3)). This follows from elementary facts about standard-form LP; we provide a short proof for completeness.

Lemma 2.1 (Sparse solutions for MOT).

For any cost $C\in(\mathbb{R}^{n})^{\otimes k}$ and any marginals $\mu_{1},\dots,\mu_{k}\in\Delta_{n}$ , there exists an optimal solution $P$ to $\textsf{MOT}_{C}(\mu)$ that has at most $nk-k+1$ non-zero entries.

Proof.

Since (MOT) is an LP over a compact domain, it has an optimal solution at a vertex (BerTsi97, , Theorem 2.7). Since (MOT) is a standard-form LP, these vertices are in correspondence with basic solutions, thus their sparsity is bounded by the number of linearly dependent constraints defining $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ (BerTsi97, , Theorem 2.4). We bound this quantity by $nk-k+1$ via two observations. First, $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ is defined by $nk$ equality constraints $[m_{i}(P)]_{j}=[\mu_{i}]_{j}$ in (2.1), one for each coordinate $j\in[n]$ of each marginal constraint $i\in[k]$ . Second, at least $k-1$ of these constraints are linearly dependent because we can construct $k$ distinct linear combinations of them, namely $\sum_{j\in[n]}[m_{i}(P)]_{j}=\sum_{j\in[n]}[\mu_{i}]_{j}$ for each marginal $i\in[k]$ , which all simplify to the same constraint $m(P)=1$ , and thus are redundant with each other. ∎

Definition 2.2 ( $\varepsilon$ -approximate MOT solution).

$P$ is an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}(\mu)$ if $P$ is feasible (i.e., $P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})$ ) and $\langle C,P\rangle$ is at most $\varepsilon$ more than the optimal value.

2.2 Regularization

We introduce two standard regularization operators. First is the Shannon entropy $H(P):=-\sum_{\vec{j}}P_{\vec{j}}\log P_{\vec{j}}$ of a tensor $P\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}$ with entries summing to $m(P)=1$ . We adopt the standard notational convention that $0\log 0=0$ . Second is the softmin operator, which is defined for parameter $\eta>0$ as

\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}:=-\frac{1}{\eta}\log\left(\sum_{i=1}^{m}e^{-\eta a_{i}}\right).

(2.2)

This softmin operator naturally extends to $a_{i}\in\mathbb{R}\cup\{\infty\}$ by adopting the standard notational conventions that $e^{-\infty}=0$ and $\log 0=-\infty$ .

We make use of the following folklore fact, which bounds the error between the $\min$ and $\operatorname*{smin}$ operators based on the regularization and the number of points. For completeness, we provide a short proof.

Lemma 2.3 (Softmin approximation bound).

For any $a_{1},\dots,a_{m}\in\mathbb{R}\cup\{\infty\}$ and $\eta>0$ ,

\displaystyle\min_{i\in[m]}a_{i}\geqslant\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}\geqslant\min_{i\in[m]}a_{i}-\frac{\log m}{\eta}.

Proof.

Assume without loss of generality that all $a_{i}$ are finite, else $a_{i}$ can be dropped (if all $a_{i}=\infty$ then the claim is trivial). For shorthand, denote $\min_{i\in[m]}a_{i}$ by $a_{\min}$ . For the first inequality, use the non-negativity of the exponential function to bound

\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}=-\frac{1}{\eta}\log\left(\sum_{i=1}^{m}e^{-\eta a_{i}}\right)\leqslant-\frac{1}{\eta}\log\left(e^{-\eta a_{\min}}\right)=a_{\min}.

For the second inequality, use the fact that each $a_{i}\geqslant a_{\min}$ to bound

\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}=-\frac{1}{\eta}\log\left(\sum_{i=1}^{m}e^{-\eta a_{i}}\right)\geqslant-\frac{1}{\eta}\log\left(me^{-\eta a_{\min}}\right)=a_{\min}-\frac{\log m}{\eta}.

∎

The entropically regularized MOT problem (RMOT for short) is the convex optimization problem

\displaystyle\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle-\eta^{-1}H(P).

(RMOT)

This is the natural multidimensional analog of entropically regularized OT, which has a rich literature in statistics (Leo13, ) and transportation theory (Wil69, ), and has recently attracted significant interest in machine learning (Cut13, ; PeyCut17, ). The convex dual of (RMOT) is the convex optimization problem

\displaystyle\max_{p_{1},\dots,p_{k}\in\mathbb{R}^{n}}\sum_{i=1}^{k}\langle p_{i},\mu_{i}\rangle+\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right).

(RMOT-D)

In contrast to MOT, there is no analog of Lemma 2.1 for RMOT: the unique optimal solution to RMOT is dense. Further, this solution may not even be “approximately” sparse. For example, when $C=0$ , all $\mu_{i}$ are uniform, and $\eta>0$ is any positive number, the solution is fully dense with all entries equal to $1/n^{k}$ .

We define $P$ to be an $\varepsilon$ -approximate RMOT solution in the analogous way as in Definition 2.2. A basic, folklore fact about RMOT is that if the regularization $\eta$ is sufficiently large, then RMOT and MOT are equivalent in terms of approximate solutions.

Lemma 2.4 (MOT and RMOT are close for large regularization $\eta$ ).

Let $P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})$ , $\varepsilon>0$ , and $\eta\geqslant\varepsilon^{-1}k\log n$ . If $P$ is an $\varepsilon$ -approximate solution to (RMOT), then $P$ is also a $(2\varepsilon)$ -approximate solution to (MOT); and vice versa.

Proof.

Since a discrete distribution supported on $n^{k}$ atoms has entropy at most $k\log n$ (CoverThomas, ), the objectives of (MOT) and (RMOT) differ pointwise by at most $\eta^{-1}k\log n\leqslant\varepsilon$ . Since (MOT) and (RMOT) also have the same feasible sets, their optimal values therefore differ by at most $\varepsilon$ . ∎

3 Oracles

Here we define the three oracle variants described in the introduction and discuss their relations. In the below definitions, let $C\in(\mathbb{R}^{n})^{\otimes k}$ be a cost tensor.

Definition 3.1 (MIN oracle).

For weights $p=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k}$ , $\textsf{MIN}_{C}(p)$ returns

\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

Definition 3.2 (AMIN oracle).

For weights $p=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k}$ and accuracy $\varepsilon>0$ , $\textsf{AMIN}_{C}(p,\varepsilon)$ returns $\textsf{MIN}_{C}(p)$ up to additive error $\varepsilon$ .

Definition 3.3 (SMIN oracle).

For weights $p=(p_{1},\ldots,p_{k})\in\bar{\mathbb{R}}^{n\times k}$ and regularization parameter $\eta>0$ , $\textsf{SMIN}_{C}(p,\eta)$ returns

\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

An algorithm is said to “solve” or “implement” $\textsf{MIN}_{C}$ if given input $p$ , it outputs $\textsf{MIN}_{C}(p)$ . Similarly for $\textsf{AMIN}_{C}$ and $\textsf{SMIN}_{C}$ . Note that the weights $p$ that are input to SMIN have values inside $\bar{\mathbb{R}}=\mathbb{R}\cup\{-\infty\}$ ; this simplifies the notation in the treatment of the SINKHORN algorithm below and does not increase the bit-complexity by more than $1$ bit by adding a flag for the value $-\infty$ .

Remark 3.4 (Interpretation as variants of the dual feasibility oracle).

These three oracles can be viewed as variants of the feasibility oracle for (MOT-D). For $\textsf{MIN}_{C}(p)$ , this relationship is exact: $p\in\mathbb{R}^{n\times k}$ is feasible for (MOT-D) if and only if $\textsf{MIN}_{C}(p)$ is non-negative. For $\textsf{AMIN}_{C}$ and $\textsf{SMIN}_{C}$ , this relationship is approximate, with the approximation depending on how small $\varepsilon$ is and how large $\eta$ is, respectively.

Since these oracles form the respective bottlenecks of all algorithms from the MOT and implicit linear programming literatures (see the overview in the introduction §1.1), an important question is: if one oracle can be implemented in $\mathrm{poly}(n,k)$ time, does this imply that the other can be too?

Two reductions are straightforward: the AMIN oracle can be implemented in $\mathrm{poly}(n,k)$ time whenever either the MIN oracle or the SMIN oracle can be implemented in $\mathrm{poly}(n,k)$ time. We record these simple observations in remarks for easy recall.

Remark 3.5 (MIN implies AMIN).

For any accuracy $\varepsilon>0$ , the $\textsf{MIN}_{C}(p)$ oracle provides a valid answer to the $\textsf{AMIN}_{C}(p,\varepsilon)$ oracle by definition.

Remark 3.6 (SMIN implies AMIN).

For any $p\in\mathbb{R}^{n\times k}$ and regularization $\eta\geqslant\varepsilon^{-1}k\log n$ , the $\textsf{SMIN}_{C}(p,\eta)$ oracle provides a valid answer to the $\textsf{AMIN}_{C}(p,\varepsilon)$ oracle due to the approximation property of the $\operatorname*{smin}$ operator (Lemma 2.3).

In the remainder of this section, we show a separation between the SMIN oracle and both the MIN and AMIN oracles by exhibiting a family of cost tensors $C$ for which there exist polynomial-time algorithms for MIN and AMIN, however there is no polynomial-time algorithm for SMIN. The non-existence of a polynomial-time algorithm of course requires a complexity theoretic assumption; our result holds under $\#$ BIS-hardness—which is a by-now standard complexity theory assumption introduced in (dyer2000relative, ), and in words is the statement that there does not exist a polynomial-time algorithm for counting the number of independent sets in a bipartite graph.

Lemma 3.7 (Restrictiveness of the SMIN oracle).

There exists a family of costs $C\in(\mathbb{R}^{n})^{\otimes k}$ for which $\textsf{MIN}_{C}$ and $\textsf{AMIN}_{C}$ can be solved in $\mathrm{poly}(n,k)$ time, however $\textsf{SMIN}_{C}$ is #BIS-hard.

Proof.

In order to prove hardness for general $n$ , it suffices to exhibit such a family of cost tensors when $n=2$ . Since $n=2$ , it is convenient to abuse notation slightly by indexing a cost tensor $C\in(\mathbb{R}^{n})^{\otimes k}$ by $\vec{j}\in\{-1,1\}^{k}$ rather than by $\vec{j}\in\{1,2\}^{k}$ . The family we exhibit is $\{C(A,b):A\in\mathbb{R}_{\geqslant 0}^{k\times k},\,b\in\mathbb{R}^{k}\}$ , where the cost tensors $C(A,b)$ are parameterized by a non-negative square matrix $A$ and a vector $b$ , and have entries of the form

C_{\vec{j}}(A,b):=-\langle\vec{j},A\vec{j}\rangle-\langle b,\vec{j}\rangle,\qquad\vec{j}\in\{\raisebox{0.86108pt}{$\scriptstyle\pm$}1\}^{k}.

Polynomial-time algorithm for MIN and AMIN. We show that given a matrix $A\in\mathbb{R}_{\geqslant 0}^{k\times k}$ , vector $b\in\mathbb{R}^{k}$ , and weights $p\in\mathbb{R}^{2\times k}$ , it is possible to compute $\textsf{MIN}_{C}(p)$ on the cost tensor $C(A,b)$ in $\mathrm{poly}(k)$ time. Clearly this also implies a $\mathrm{poly}(k)$ time algorithm for $\textsf{AMIN}_{C}(p,\varepsilon)$ for any $\varepsilon>0$ , see Remark 3.5.

To this end, we first re-write the $\textsf{MIN}_{C}(p)$ problem on $C(A,b)$ in a more convenient form that enables us to “ignore” the weights $p$ . Recall that $\textsf{MIN}_{C}(p)$ is the problem of

\textsf{MIN}_{C}(p)=\min_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle b,\vec{j}\rangle-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

Note that the linear part of the cost is equal to

\displaystyle\langle b,\vec{j}\rangle+\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\langle\ell,\vec{j}\rangle+d,

(3.1)

where $\ell\in\mathbb{R}^{k}$ is the vector with entries $\ell_{i}=b_{i}+\operatorname{\frac{1}{2}}((p_{i})_{1}-(p_{i})_{-1})$ , and $d$ is the scalar $d=\operatorname{\frac{1}{2}}\sum_{i=1}^{k}([p_{i}]_{1}+[p_{i}]_{-1})$ . Thus, since $d$ is clearly computable in $O(k)$ time, the $\textsf{MIN}_{C}$ problem is equivalent to solving

\displaystyle\min_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle\ell,\vec{j}\rangle,

(3.2)

when given as input a non-negative matrix $A\in\mathbb{R}_{\geqslant 0}^{k\times k}$ and a vector $\ell\in\mathbb{R}^{k}$ .

To show that this task is solvable in $\mathrm{poly}(k)$ time, note that the objective in (3.2) is a submodular function because it is a quadratic whose Hessian $-A$ has non-positive off-diagonal terms (Bach13, , Proposition 6.3). Therefore (3.2) is a submodular optimization problem, and thus can be solved in $\mathrm{poly}(k)$ time using classical algorithms from combinatorial optimization (GLSbook, , Chapter 10.2).

SMIN oracle is $\#$ BIS-hard. On the other hand, by using the definition of the SMIN oracle, the re-parameterization (3.1), and then the definition of the softmin operator,

\textsf{SMIN}_{C}(p,\eta)=\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle b,\vec{j}\rangle-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle\ell,\vec{j}\rangle-d=-\frac{\log Z}{\eta}-d,

where $Z=\sum_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}Q(\vec{j})$ is the partition function of the ferromagnetic Ising model with inconsistent external fields given by

Q(\vec{j})=\exp\left(\eta\langle\vec{j},A\vec{j}\rangle+\eta\langle\ell,\vec{j}\rangle\right).

Because it is $\#$ BIS hard to compute the partition function $Z$ of a ferromagnetic Ising model with inconsistent external fields (goldberg2007complexity, ), it is $\#$ BIS hard to compute the value $-\eta^{-1}\log Z-d$ of the oracle $\textsf{SMIN}_{C}(p,\eta)$ . ∎

Remark 3.8 (The restrictiveness of SMIN extends to approximate computation).

The separation between the oracles shown in Lemma 3.7 further extends to approximate computation of the SMIN oracle under the assumption that $\#$ BIS is hard to approximate, since under this assumption it is hard to approximate the partition function of a ferromagnetic Ising model with inconsistent external fields (goldberg2007complexity, ).

4 Algorithms to oracles

In this section, we consider three algorithms for MOT. Each is iterative and requires only polynomially many iterations. The key issue for each algorithm is the per-iteration runtime, which is in general exponential (roughly $n^{k}$ ). We isolate the respective bottlenecks of these three algorithms into the three variants of the dual feasibility oracle defined in §3. See §1.1 and Table 4 for a high-level overview of this section’s results.

4.1 The Ellipsoid algorithm and the MIN oracle

Among the most classical algorithms for implicit LP is the Ellipsoid algorithm (GLSbook, ; GroLovSch81, ; khachiyan1980polynomial, ). However it has taken a back seat to the SINKHORN algorithm in the vast majority of the MOT literature. The very recent paper (AltBoi20bary, ), which focuses on the specific MOT application of computing low-dimensional Wasserstein barycenters, develops a variant of the classical Ellipsoid algorithm specialized to MOT; henceforth this is called ELLIPSOID, see §4.1.1 for a description of this algorithm. The objective of this section is to analyze ELLIPSOID in the context of general MOT problems in order to prove the following.

Theorem 4.1.

For any family of cost tensors $C\in(\mathbb{R}^{n})^{\otimes k}$ , the following are equivalent:

(i)

ELLIPSOID takes $\mathrm{poly}(n,k)$ time to solve the $\textsf{MOT}_{C}$ problem. (Moreover, it outputs a vertex solution represented as a sparse tensor with at most $nk-k+1$ non-zeros.)
(ii)

There exists a $\mathrm{poly}(n,k)$ time algorithm that solves the $\textsf{MOT}_{C}$ problem.
(iii)

There exists a $\mathrm{poly}(n,k)$ time algorithm that solves the $\textsf{MIN}_{C}$ problem.

Interpretation of results.

In words, the equivalence “(i) $\Longleftrightarrow$ (ii)” establishes that ELLIPSOID can solve any MOT problem in polynomial time that any other algorithm can. Thus from a theoretical perspective, this paper’s restriction to ELLIPSOID is at no loss of generality for developing polynomial-time algorithms that exactly solve MOT. In words, the equivalence “(ii) $\Longleftrightarrow$ (iii)” establishes that the MOT and MIN problems are polynomial-time equivalent. Thus we may investigate when MOT is tractable by instead investigating the more amenable question of when MIN is tractable (see §1.1.4) at no loss of generality.

As stated in the related work section, Theorem 4.1 is implicit from combining several known results (AltBoi20hard, ; AltBoi20bary, ; GLSbook, ). Our contribution here is to make this result explicit, since this allows us to unify algorithms from the implicit LP literature with the SINKHORN algorithm. We also significantly simplify part of the implication “(iii) $\Longrightarrow$ (i)”, which is crucial for making an algorithm that relies on the MIN oracle practical—namely, the Column Generation algorithm discussed below.

Organization of §4.1.

In §4.1.1, we recall this ELLIPSOID algorithm and how it depends on the violation oracle for (MOT-D). In §4.1.2, we give a significantly simpler proof that the violation and feasibility oracles are polynomial-time equivalent in the case of (MOT-D), and use this to prove Theorem 4.1. In §4.1.3, we describe a practical implementation that replaces the ELLIPSOID outer loop with Column Generation.

4.1.1 Algorithm

A key component of the proof of Theorem 4.1 is the ELLIPSOID algorithm introduced in (AltBoi20bary, ) for MOT, which we describe below. In order to present this, we first define a variant of the MIN oracle that returns a minimizing tuple rather than the minimizing value.

Definition 4.2 (Violation oracle for (MOT-D)).

Given weights $p=(p_{1},\dots,p_{k})\in\mathbb{R}^{n\times k}$ , $\textsf{ARGMIN}_{C}$ returns the minimizing solution $\vec{j}$ and value of $\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$ .

$\textsf{ARGMIN}_{C}$ can be viewed as a violation oracle⁷⁷7Recall that a violation oracle for a polytope $K=\{x:\langle a_{i},x\rangle\leqslant b_{i},\forall i\in[N]\}$ is an algorithm that given a point $p$ , either asserts $p$ is in $K$ , or otherwise outputs the index $i$ of a violated constraint $\langle a_{i},p\rangle>b_{i}$ . for the decision set to (MOT-D). This is because, given $p=(p_{1},\dots,p_{k})\in\mathbb{R}^{n\times k}$ , the tuple $\vec{j}$ output by $\textsf{ARGMIN}_{C}(p)$ either provides a violated constraint if $C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}<0$ , or otherwise certifies $p$ is feasible. In (AltBoi20bary, ) it is proved that MOT can be solved with polynomially many calls to the $\textsf{ARGMIN}_{C}$ oracle.

Theorem 4.3 (ELLIPSOID guarantee; Proposition 12 of (AltBoi20bary, )).

Algorithm 1 finds an optimal vertex solution for $\textsf{MOT}_{C}(\mu)$ using $\mathrm{poly}(n,k)$ calls to the $\textsf{ARGMIN}_{C}$ oracle and $\mathrm{poly}(n,k)$ additional time. The solution is returned as a sparse tensor with at most $nk-k+1$ non-zero entries.

Algorithm 1 ELLIPSOID: specialization of the classical Ellipsoid algorithm to MOT

Input: Cost $C\in(\mathbb{R}^{n})^{\otimes k}$ , marginals $\mu_{1},\dots,\mu_{k}\in\Delta_{n}$
Output: Vertex solution to $\textsf{MOT}_{C}(\mu)$

1:\\ Solve dual

2:Solve (MOT-D) using the Ellipsoid algorithm, with

\textsf{ARGMIN}_{C}

as the violation oracle. Let

S

denote the set of tuples returned by all calls to

\textsf{ARGMIN}_{C}

4:\\ Solve primal

5:Solve (4.1) using the Ellipsoid algorithm.

Sketch of algorithm.

Full details and a proof are in (AltBoi20bary, ). We give a brief overview here for completeness. First, recall from the implicit LP literature that the classical Ellipsoid algorithm can be implemented in polynomial time for an LP with arbitrarily many constraints so long as it has polynomially many variables and the violation oracle for its decision set is solvable in polynomial time (GLSbook, ). This does not directly apply to the LP (MOT) because that LP has $n^{k}$ variables. However, it can apply to the dual LP (MOT-D) because that LP only has $nk$ variables.

This suggests a natural two-step algorithm for MOT. First, compute an optimal dual solution by directly applying the Ellipsoid algorithm to (MOT-D). Second, use this dual solution to construct a sparse primal solution. Although this dual-to-primal conversion does not extend to arbitrary LP (BerTsi97, , Exercise 4.17), the paper (AltBoi20bary, ) provides a solution by exploiting the standard-form structure of MOT. The procedure is to solve

\displaystyle\min_{\begin{subarray}{c}P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})\\ \text{s.t. }P_{\vec{j}}=0,\;\forall\vec{j}\notin S\end{subarray}}\langle C,P\rangle

(4.1)

which is the MOT problem restricted to sparsity pattern $S$ , where $S$ is the set of tuples $\vec{j}$ returned by the violation oracle during the execution of step one of Algorithm 1. This second step takes $\mathrm{poly}(n,k)$ time using a standard LP solver, because running the Ellipsoid algorithm in the first step only calls the violation oracle $\mathrm{poly}(n,k)$ times, and thus $S$ has $\mathrm{poly}(n,k)$ size, and therefore the LP (4.1) has $\mathrm{poly}(n,k)$ variables and constraints. In (AltBoi20bary, ) it is proved that this produces a primal vertex solution to the original MOT problem.

4.1.2 Equivalence of bottleneck to MIN

Although Theorem 4.3 shows that ELLIPSOID can solve MOT in $\mathrm{poly}(n,k)$ time using the ARGMIN oracle, this is not sufficient to prove the implication “(iii) $\implies$ (i)” in Theorem 4.1. In order to prove that implication requires showing the polynomial-time equivalence between MIN and ARGMIN.

Lemma 4.4 (Equivalence of MIN and ARGMIN).

Each of the oracles $\textsf{MIN}_{C}$ and $\textsf{ARGMIN}_{C}$ can be implemented using $\mathrm{poly}(n,k)$ calls of the other oracle and $\mathrm{poly}(n,k)$ additional time.

This equivalence follows from classical results about the equivalence of violation and feasibility oracles (yudin1976informational, ). However, the known proof of that general result requires an involved and indirect argument based on “back-and-forth” applications of the Ellipsoid algorithm (GLSbook, , §4.3). Here we exploit the special structure of MOT to give a direct and elementary proof. This is essential to practical implementations (see §4.1.3).

Proof.

It is obvious how the $\textsf{MIN}_{C}$ oracle can be implemented via a single call of the $\textsf{ARGMIN}_{C}$ oracle; we now show the converse. Specifically, given $p_{1},\dots,p_{k}\in\mathbb{R}^{n}$ , we show how to compute a solution $\vec{j}=(j_{1},\dots,j_{k})\in[n]^{k}$ for $\textsf{ARGMIN}_{C}([p_{1},\dots,p_{k}])$ using $nk$ calls to the $\textsf{MIN}_{C}$ oracle and polynomial additional time. We use the first $n$ calls to compute the first index $j_{1}$ of the solution, the next $n$ calls to compute the next index $j_{2}$ , and so on.

Formally, for $s\in[k]$ , let us say that $(j_{1}^{*},\dots,j_{s}^{*})\in[n]^{s}$ is a “partial solution” of size $s$ if there exists a solution $j\in[n]^{k}$ for $\textsf{ARGMIN}_{C}([p_{1},\dots,p_{k}])$ that satisfies $j_{i}=j_{i}^{*}$ for all $i\in[s]$ . Then it suffices to show that for every $s\in[k]$ , it is possible to compute a partial solution $(j_{1}^{*},\dots,j_{s}^{*})$ of size $s$ from a partial solution $(j_{1}^{*},\dots,j_{s-1}^{*})$ of size $s-1$ using $n$ calls to the $\textsf{MIN}_{C}$ oracle and polynomial additional time.

The simple but key observation enabling this is the following. Below, for $i\in[k]$ and $j\in[n]$ , define $q_{i,j}$ to be the vector in $\mathbb{R}^{n}$ with value $[p_{i}]_{j}$ on entry $j$ , and value $-M$ on all other entries. In words, the following observation states that if the constant $M$ is sufficiently large, then for any indices $j_{i}^{\prime}$ , replacing the vectors $p_{i}$ with the vectors $q_{i,j_{i}^{\prime}}$ in a MIN oracle query effectively performs a MIN oracle query on the original input $p_{1},\dots,p_{k}$ except that now the minimization is only over $\vec{j}\in[n]^{k}$ satisfying $j_{i}=j_{i}^{\prime}$ .

Observation 4.5.

Set $M:=2C_{\max}+2\sum_{i=1}^{k}\|p_{i}\|_{\max}+1$ . Then for any $s\in[k]$ and any $(j_{1}^{\prime},\dots,j_{s}^{\prime})\in[n]^{s}$ ,

\displaystyle\textsf{MIN}_{C}([q_{1,j_{1}^{\prime}},\dots,q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}])=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{\prime},\dots,j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

Proof.

By definition of the MIN oracle,

\displaystyle\textsf{MIN}_{C}([q_{1,j_{1}^{\prime}},\dots,q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}])

\displaystyle=\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{s}[q_{i,j_{i}^{\prime}}]_{j_{i}}-\sum_{i=s+1}^{k}[p_{i}]_{j_{i}}

It suffices to prove that every minimizing tuple $\vec{j}\in[n]^{k}$ for the right hand side satisfies $j_{i}=j_{i}^{\prime}$ for all $i\in[s]$ . Suppose not for sake of contradiction. Then there exists a minimizing tuple $\vec{j}\in[n]^{k}$ for which $j_{\ell}\neq j_{\ell}^{\prime}$ for some $\ell\in[s]$ . But then $[q_{\ell,j_{\ell}^{\prime}}]_{j_{\ell}}=-M$ , so the objective value of $\vec{j}$ is at least

C_{\vec{j}}-\sum_{i=1}^{s}[q_{i,j_{i}^{\prime}}]_{j_{i}}-\sum_{i=s+1}^{k}[p_{i}]_{j_{i}}\geqslant M-C_{\max}-\sum_{i=1}^{k}\|p_{i}\|_{\max}=C_{\max}+\sum_{i=1}^{k}\|p_{i}\|_{\max}+1.

But this is strictly larger (by at least $1$ ) than the value of any tuple with prefix $(j_{1}^{\prime},\dots,j_{s}^{\prime})$ , contradicting the optimality of $\vec{j}$ . ∎

Thus, given a partial solution $(j_{1}^{*},\dots,j_{s-1}^{*})$ of length $s-1$ , we construct a partial solution $(j_{1}^{*},\dots,j_{s}^{*})$ of length $s$ by setting $j_{s}^{*}$ to be a minimizer of

\displaystyle\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}([q_{1,j_{1}^{*}},\dots,q_{s-1,j_{s-1}^{*}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}]).

(4.2)

The runtime bound is clear; it remains to show correctness. To this end, note that

	$\displaystyle\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s}=j_{s}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$	$\displaystyle=\textsf{MIN}_{C}([q_{1,j_{1}^{}},\dots,q_{s,j_{s}^{}},p_{s+1},\dots,p_{k}])$
		$\displaystyle=\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}([q_{1,j_{1}^{}},\dots,q_{s-1,j_{s-1}^{}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}])$
		$\displaystyle=\min_{j_{s}^{\prime}\in[n]}\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s-1}=j_{s-1}^{},j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s-1}=j_{s-1}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\textsf{MIN}_{C}([p_{1},\dots,p_{k}]),$

where above the first and third steps are by Observation 4.5, the second step is by construction of $j_{s}^{*}$ , the fourth step is by simplifying, and the final step is by definition of $(j_{1}^{*},\dots,j_{s-1}^{*})$ being a partial solution of size $s-1$ . We conclude that $(j_{1}^{*},\dots,j_{s}^{*})$ is a partial solution of size $s$ , as desired. ∎

We can now conclude the proof of the main result of §4.1.

Proof of Theorem 4.1.

The implication “(i) $\implies$ (ii)” is trivial, and the implication “(ii) $\implies$ (iii)” is shown in (AltBoi20hard, ). It therefore suffices to show the implication “(iii) $\implies$ (i)”. This follows from combining the fact that ELLIPSOID solves $\textsf{MOT}_{C}$ in polynomial time given an efficient implementation of $\textsf{ARGMIN}_{C}$ (Theorem 4.3), with the fact that the $\textsf{MIN}_{C}$ and $\textsf{ARGMIN}_{C}$ oracles are polynomial-time equivalent (Lemma 4.4). ∎

4.1.3 Practical implementation via Column Generation

Although ELLIPSOID enjoys powerful theoretical runtime guarantees, it is slow in practice because the classical Ellipsoid algorithm is. Nevertheless, whenever ELLIPSOID is applicable (i.e., whenever the $\textsf{MIN}_{C}$ oracle can be efficiently implemented), we can use an alternative practical algorithm, namely the delayed Column Generation method COLGEN, to compute exact, sparse solutions to MOT.

For completeness, we briefly recall the idea behind COLGEN; for further details see the standard textbook (BerTsi97, , §6.1). COLGEN runs the Simplex method, keeping only basic variables in the tableau. Each time that COLGEN needs to find a Simplex variable on which to pivot, it solves the “pricing problem” of finding a variable with negative reduced cost. This is the key subroutine in COLGEN. In the present context of the MOT LP, this pricing problem is equivalent to a call to the ARGMIN violation oracle (see (BerTsi97, , Definition 3.2) for the definition of reduced costs). By the polynomial-time equivalence of the ARGMIN and MIN oracles shown in Lemma 4.4, this bottleneck subroutine in COLGEN can be computed in polynomial time whenever the MIN oracle can. For easy recall, we summarize this discussion as follows.

Theorem 4.6 (Standard guarantee for COLGEN; Section 6.1 of (BerTsi97, )).

For any $T>0$ , one can implement $T$ iterations of COLGEN in $\mathrm{poly}(n,k,T)$ time and calls to the $\textsf{MIN}_{C}$ oracle. When COLGEN terminates, it returns an optimal vertex solution, which is given as a sparse tensor with at most $nk-k+1$ non-zero entries.

Note that COLGEN does not have a theoretical guarantee stating that it terminates after a polynomial number of iterations. But it often performs well in practice and terminates after a small number of iterations, leading to much better empirical performance than ELLIPSOID.

4.2 The Multiplicative Weights Update algorithm and the AMIN oracle

The second classical algorithm for solving implicitly-structured LPs that we study in the context of MOT is the Multiplicative Weights Update algorithm MWU (young2001sequential, ). The objective of this section is to prove the following guarantees for its specialization to MOT.

Theorem 4.7.

For any family of cost tensors $C\in(\mathbb{R}^{n})^{\otimes k}$ , the following are equivalent:

(i)

For any $\varepsilon>0$ , MWU takes $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time to solve the $\textsf{MOT}_{C}$ problem $\varepsilon$ -approximately. (Moreover, it outputs a sparse solution with at most $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ non-zero entries.)
(ii)

There exists a $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ -time algorithm that solves the $\textsf{MOT}_{C}$ problem $\varepsilon$ -approximately for any $\varepsilon>0$ .
(iii)

There exists a $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ -time algorithm that solves the $\textsf{AMIN}_{C}$ problem $\varepsilon$ -approximately for any $\varepsilon>0$ .

Interpretation of results.

Similarly to the analogous Theorem 4.1 for ELLIPSOID, the equivalence “(i) $\Longleftrightarrow$ (ii)” establishes that MWU can approximately solve any MOT problem in polynomial time that any other algorithm can. Thus, from a theoretical perspective, restricting to MWU for approximately solving MOT problems is at no loss of generality. In words, the equivalence “(ii) $\Longleftrightarrow$ (iii)” establishes that approximating MOT and approximating MIN are polynomial-time equivalent. Thus we may investigate when MOT is tractable to approximate by instead investigating the more amenable question of when MIN is tractable (see §1.1.4) at no loss of generality.

Theorem 4.7 is new to this work. In particular, equivalences between problems with polynomially small error do not fall under the purview of classical LP theory, which deals with exponentially small error (GLSbook, ). Our use of the MWU algorithm exploits a simple reduction of MOT to a mixed packing-covering LP that has appeared in the $k=2$ matrix case of Optimal Transport in (BlaJamKenSid18, ; Qua18, ), where implicit LP is not necessary for polynomial runtime.

Organization of §4.2.

In §4.2.1 we present the specialization of Multiplicative Weights Update to MOT, and recall how it runs in polynomial time and calls to a certain bottleneck oracle. In §4.2.2, we show that this bottleneck oracle is equivalent to the AMIN oracle, and then use this to prove Theorem 4.7.

4.2.1 Algorithm

Here we present the MWU algorithm, which combines the generic Multiplicative Weights Update algorithm of (young2001sequential, ) specialized to MOT, along with a final rounding step that ensures feasibility of the solution.

In order to present MWU, it is convenient to assume that the cost $C$ has entries in the range $[1,2]\subset\mathbb{R}$ , which is at no loss of generality by simply translating and rescaling the cost (see §4.2.2), and can be done implicitly given a bound on $C_{\max}$ . This is why in the rest of this subsection, every runtime dependence on $\varepsilon$ is polynomial in $1/\varepsilon$ for costs in the range $[1,2]$ ; after transformation back to $[-C_{\max},C_{\max}]$ , this is polynomial dependence in the standard scale-invariant quantity $C_{\max}/\varepsilon$ .

Since the cost $C$ is assumed to have non-negative entries, for any $\lambda\in[1,2]$ , the polytope

\displaystyle K(\lambda)=\{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k}):\langle C,P\rangle\leqslant\lambda\}

of couplings with cost at most $\lambda$ is a mixed packing-covering polytope (i.e., all variables are non-negative and all constraints have non-negative coefficients). Note that $K(\lambda)$ is non-empty if and only if $\textsf{MOT}_{C}(\mu)$ has value at most $\lambda$ . Thus, modulo a binary search on $\lambda$ , this reduces computing the value of $\textsf{MOT}_{C}(\mu)$ to the task of detecting whether $K(\lambda)$ is empty. Since $K(\lambda)$ is a mixed packing-covering polytope, the Multiplicative Weights Update algorithm of (young2001sequential, ) determines whether $K(\lambda)$ is empty, and runs in polynomial time apart from one bottleneck, which we now define.

In order to define the bottleneck, we first define a potential function. For this, we define the softmax analogously to the softmin as

\operatorname*{smax}(a_{1},\ldots,a_{t})=-\operatorname*{smin}(-a_{1},\ldots,-a_{t})=\log\left(\sum_{i=1}^{t}e^{a_{i}}\right).

Here we use regularization parameter $\eta=1$ for simplicity, since this suffices for analyzing MWU, and thus we have dropped this index $\eta$ for shorthand.

Definition 4.8 (Potential function for MWU).

Fix a cost $C\in(\mathbb{R}^{n})^{\otimes k}$ , target marginals $\mu\in(\Delta_{n})^{k}$ , and target value $\lambda\in\mathbb{R}$ . Define the potential function $\Phi:=\Phi_{C,\mu,\lambda}:(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}\to\mathbb{R}$ by

\Phi(P)=\operatorname*{smax}\left(\frac{\langle C,P\rangle}{\lambda},\frac{m_{1}(P)}{\mu_{1}},\ldots,\frac{m_{k}(P)}{\mu_{k}}\right).

The softmax in the above expression is interpreted as a softmax over the $nk+1$ values in the concatenation of vectors and scalars in its input. (This slight abuse of notation significantly reduces notational overhead.)

Given this potential function, we now define the bottleneck operation for MWU: find a direction $\vec{j}\in[n]^{k}$ in which $P$ can be increased such that the potential is increased as little as possible.

Definition 4.9 (Bottleneck oracle for MWU).

Given iterate $P\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}$ , target marginals $\mu\in(\Delta_{n})^{k}$ , target value $\lambda\in\mathbb{R}$ , and accuracy $\varepsilon>0$ , $\textsf{MWU\_BOTTLENECK}_{C}(P,\mu,\lambda,\varepsilon)$ either:

•

Outputs “null”, certifying that $\min_{\vec{j}\in[n]^{k}}\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0}>1$ , or
•

Outputs $\vec{j}\in[n]^{k}$ such that $\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0}\leqslant 1+\varepsilon$ .

(If $\min_{\vec{j}\in[n]^{k}}\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0}$ is within $(1,1+\varepsilon]$ , then either return behavior is a valid output.)

Pseudocode for the MWU algorithm is given in Algorithm 2. We prove that MWU runs in polynomial time given access to this bottleneck oracle.

Theorem 4.10.

Let the entries of the cost $C$ lie in the range $[1,2]$ . Given $\lambda\in\mathbb{R}$ and accuracy parameter $\varepsilon>0$ , MWU either certifies that $\textsf{MOT}_{C}(\mu)\leqslant\lambda$ , or returns a $\mathrm{poly}(n,k,1/\varepsilon)$ -sparse $P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})$ satisfying $\langle C,P\rangle\leqslant\lambda+8\varepsilon$ .

Furthermore, the loop in Step 1 runs in $\tilde{O}(nk/\varepsilon^{2})$ iterations, and Step 2 runs in $\mathrm{poly}(n,k,1/\varepsilon)$ time.

The MWU algorithm can be used to output a $O(\varepsilon)$ -approximate solution for MOT time via an outer loop that performs binary search over $\lambda$ ; this only incurs $O(\log(1/\varepsilon))$ -multiplicative overhead in runtime.

Algorithm 2 MWU: specialization of Multiplicative Weights Update (young2001sequential, ) to MOT

1:Accuracy

\varepsilon>0

, marginals

\mu_{1},\dots,\mu_{k}\in\Delta_{n}

, target value

\lambda>0

2:Either certifies

\textsf{MOT}_{C}(\mu)>\lambda

by returning “infeasible”, or returns a solution

P

with

\langle C,P\rangle\leqslant\lambda+8\varepsilon

3:\\ Assume cost

C

satisfies

C_{\vec{j}}\in[1,2]

for all

\vec{j}\in[n]^{k}

(without loss of generality by rescaling)

4:\\ Step 1: Multiplicative Weights Update

P\leftarrow 0\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}

\eta\leftarrow 2(\log(nk+1))/\varepsilon

6:while

m(P)<\eta

\triangleright

While total mass is small

\vec{j}\leftarrow\textsf{MWU\_BOTTLENECK}_{C}(P,\mu,\lambda,\varepsilon)

\triangleright

Bottleneck: find direction with small potential increase

8: if

\vec{j}=

“null” then return “infeasible”

\triangleright

Infeasible if no good direction

9: else

P\leftarrow P+\delta_{\vec{j}}\cdot\varepsilon\cdot\min(\lambda/C_{\vec{j}},\min_{i}[\mu_{i}]_{j_{i}})

\triangleright

Else, increase in good direction

10:

P\leftarrow P/(\eta(1+\varepsilon)^{4})

\triangleright

Rescale

11:

12:\\ Step 2: round to transportation polytope

13:while

m(P)<1

\triangleright

Until all constraints are satisfied

14:

j_{i}\leftarrow\operatorname*{argmax}_{j\in[n]}([\mu_{i}]_{j}-[m_{i}(P)]_{j})

for each

i\in[k]

\triangleright

For each marginal, find unsatisfied constraint

15:

\alpha\leftarrow\min_{i\in[k]}([\mu_{i}]_{j_{i}}-[m_{i}(P)]_{j_{i}})

\triangleright

Maximal mass to add

16:

P\leftarrow P+\alpha\cdot\delta_{\vec{j}}

\triangleright

Add mass to saturate at least one constraint

Proof.

We analyze Step 1 (Multiplicative Weights Update) and Step 2 (rounding) of MWU separately.

Lemma 4.11 (Correctness of Step 1).

Step 1 of Algorithm 2 runs in $\tilde{O}(nk/\varepsilon^{2})$ iterations. It either returns (i) “infeasible”, certifying that $K(\lambda)$ is empty; or (ii) finds a $\mathrm{poly}(n,k,1/\varepsilon)$ -sparse tensor $P\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}$ that is approximately in $K(\lambda)$ , i.e., $P$ satisfies:

m(P)\geqslant 1-4\varepsilon,\quad\langle C,P\rangle\leqslant\lambda,\quad\mbox{ and }\quad m_{i}(P)\leqslant\mu_{i}\mbox{ for all }i\in[k]

Step 1 is the Multiplicative Weights Update algorithm of (young2001sequential, ) applied to the polytope $K(\lambda)$ , so correctness follows from the analysis of (young2001sequential, ). We briefly recall the main idea behind this algorithm for the convenience of the reader, and provide a full proof in Appendix A.1 for completeness. The main idea behind the algorithm is that on each iteration, $\vec{j}\in[n]^{k}$ is chosen so that the increase in the potential $\Phi(P)$ is approximately bounded by the increase in the total mass $m(P)$ . If this is impossible, then the bottleneck oracle returns null, which means $K(\lambda)$ is empty. So assume otherwise. Then once the total mass has increased to $m(P)=\eta+O(\varepsilon)$ , the potential $\Phi(P)$ must be bounded by $\eta(1+O(\varepsilon))$ . By exploiting the inequality between the max and the softmax, this means that $\max(\langle C,P\rangle/\lambda,\max_{i\in[n],j\in[k]}[m_{i}(P)]_{j}/[\mu_{i}]_{j})\leqslant\Phi(P)\leqslant\eta(1+O(\varepsilon))$ as well. Thus, rescaling $P$ by $1/(\eta(1+O(\varepsilon)))$ in Line 10 satisfies $m(P)\geqslant 1-O(\varepsilon)$ , $\langle C,P\rangle/\lambda\leqslant 1$ , and $m_{i}(P)/\mu_{i}\leqslant 1$ . See Appendix A.1 for full details and a proof of the runtime and sparsity claims.

Lemma 4.12 (Correctness of Step 2).

Step 2 of Algorithm 2 runs in $\mathrm{poly}(n,k,1/\varepsilon)$ time and returns $P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})$ satisfying $\langle C,P\rangle\leqslant\lambda+8\varepsilon$ . Furthermore, $P$ only has $\mathrm{poly}(n,k,1/\varepsilon)$ non-zero entries.

Proof of Lemma 4.12.

By Lemma 4.11, $P$ satisfies $m_{i}(P)\leqslant\mu_{i}$ for all $i\in[k]$ . Observe that this is an invariant that holds throughout the execution of Step $2$ . This, along with the fact that $\sum_{j=1}^{n}[m_{i}(P)]_{j}=m(P)$ is equal for all $i$ , implies that the indices $(j_{1},\dots,j_{k})$ found in Line 14 satisfy $[\mu_{i}]_{j_{i}}-[m_{i}(P)]_{j_{i}}>0$ for each $i\in[k]$ . Thus in particular $\alpha>0$ in Line 15. It follows that Line 16 makes at least one more constraint satisfied (in particular the constraint “ $[\mu_{i}]_{j_{i}}=[m_{i}(P)]_{j_{i}}$ ” where $i$ is the minimizer in Line 15). Since there are $nk$ constraints total to be satisfied, Step 2 terminates in at most $nk$ iterations. Each iteration increases the number of non-zero entries in $P$ by at most one, thus $P$ is $\mathrm{poly}(n,k,1/\varepsilon)$ sparse throughout. That $P$ is sparse also implies that each iteration can be performed in $\mathrm{poly}(n,k,1/\varepsilon)$ time, thus Step 2 takes $\mathrm{poly}(n,k,1/\varepsilon)$ time overall.

Finally, we establish the quality guarantee on $\langle C,P\rangle$ . By Lemma 4.11, this is at most $\lambda$ before starting Step 2. During Step 2, the total mass added to $P$ is equal to $1-m(P)$ . This is upper bounded by $4\varepsilon$ by Lemma 4.11. Since $C_{\max}\leqslant 2$ , we conclude that the value of $\langle C,P\rangle$ is increased by at most $8\varepsilon$ in Step 2. ∎

Combining Lemmas 4.11 and 4.12 concludes the proof of Theorem 4.10. ∎

4.2.2 Equivalence of bottleneck to AMIN

In order to prove Theorem 4.7, we show that the MWU algorithm can be implemented in polynomial time and calls to the AMIN oracle. First, we prove this fact for the ARGAMIN oracle, which differs from the AMIN oracle in that it also returns a tuple $\vec{j}\in[n]^{k}$ that is an approximate minimizer.

Definition 4.13 (Approximate violation oracle for (MOT-D)).

Given weights $p=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k}$ and accuracy $\varepsilon>0$ , $\textsf{ARGAMIN}_{C}$ returns $\vec{j}\in[n]^{k}$ that minimizes $\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$ up to additive error $\varepsilon$ , and its value up to additive error $\varepsilon$ .

Lemma 4.14.

Let the entries of the cost $C$ lie in the range $[1,2]$ . The MWU algorithm (Algorithm 2), can be implemented by $\mathrm{poly}(n,k,1/\varepsilon)$ time and calls to the $\textsf{ARGAMIN}_{C}$ oracle with accuracy parameter $\varepsilon^{\prime}=\Theta(\varepsilon^{2}/(nk))$ .

Proof.

We show that on each iteration of Step 1 of Algorithm 2 we can emulate the call to the MWU_BOTTLENECK oracle with one call to the ARGAMIN oracle. Recall that $\textsf{MWU\_BOTTLENECK}_{C}(P,\mu,\lambda,\varepsilon)$ seeks to find $\vec{j}\in[n]^{k}$ such that

V_{\vec{j}}:=\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\Big{|}_{h=0}

is at most $1+\varepsilon$ , or to certify that for all $\vec{j}$ it is greater than $1$ . By explicit computation,

	$\displaystyle V_{\vec{j}}$	$\displaystyle=\frac{\partial}{\partial h}\log\left(\exp\left(\left(\langle C,P\rangle+hC_{\vec{j}}\right)/\lambda+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp\left(\left([m_{s}(P)]_{t}+\delta_{t,j_{s}}\right)/[\mu_{s}]_{t}\right)\right)\right)\Bigg{\|}_{h=0}$
		$\displaystyle=\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right)\frac{\exp(\langle C,P\rangle/\lambda)/\lambda}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})},$		(4.3)

where the weights $p=(p_{1},\ldots,p_{k})\in\mathbb{R}^{k\times n}$ in the last line are defined as

[p_{i}]_{j}=-\frac{\lambda}{\exp(\langle C,P\rangle/\lambda)}\cdot\frac{\exp([m_{i}(P)]_{j}/[\mu_{i}]_{j})}{[\mu_{i}]_{j}},\qquad\forall i\in[k],j\in[n].

Note that the second term in the product in (4.3) is positive and does not depend on $\vec{j}$ . This suggests that in order to minimize (4.3), it suffices to compute $\vec{j}\leftarrow\textsf{ARGAMIN}_{C}(p,\varepsilon^{\prime})$ for some accuracy parameter $\varepsilon^{\prime}>0$ .

The main technical difficulty with formalizing this intuitive approach is that the weights $p$ are not necessarily efficiently computable. Nevertheless, using $\mathrm{poly}(n,k)$ extra time on each iteration, we can compute the marginals $m_{1}(P),\ldots,m_{k}(P)$ . Since the ARGAMIN oracle returns an $\varepsilon^{\prime}$ -additive approximation of the cost, we can also compute a running estimate $\tilde{c}$ of the cost such that, on iteration $T$ ,

\tilde{c}-T\varepsilon^{\prime}\leqslant\langle C,P\rangle\leqslant\tilde{c}+T\varepsilon^{\prime}.

Therefore, we define weights $\tilde{p}\in\mathbb{R}^{n\times k}$ , which approximate $p$ and which can be computed in $\mathrm{poly}(n,k)$ time on each iteration:

[\tilde{p}_{i}]_{j}=-\frac{\lambda}{\exp(\tilde{c}/\lambda)}\cdot\frac{\exp([m_{i}(P)]_{j}/[\mu_{i}]_{j})}{[\mu_{i}]_{j}},\qquad\forall i\in[k],j\in[n].

We also define the approximate value for any $\vec{j}\in[n]^{k}$ :

\displaystyle\tilde{V}_{\vec{j}}:=\left(C_{\vec{j}}-\sum_{i=1}^{k}[\tilde{p}_{i}]_{j_{i}}\right)\frac{\exp(\tilde{c}/\lambda)/\lambda}{\exp(\tilde{c}/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}

It holds that $\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime})$ returns a $\vec{j}\in[n]^{k}$ that minimizes $C_{\vec{j}}-\sum_{i=1}^{k}[\tilde{p}_{i}]_{j}$ up to multiplicative error $1/(1-\varepsilon^{\prime})$ , because the entries of the cost $C$ are lower-bounded by 1, and $[\tilde{p}_{i}]_{j}\leqslant 0$ for all $i\in[n],j\in[k]$ . In particular, $\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime})$ minimizes $\tilde{V}_{\vec{j}}$ up to multiplicative error $1/(1-\varepsilon^{\prime})$ . We prove the following claim relating $V_{\vec{j}}$ and $\tilde{V}_{\vec{j}}$ :

Claim 4.15.

For any $\vec{j}\in[n]^{k}$ , on iteration $T$ , we have $V_{\vec{j}}/\tilde{V}_{\vec{j}}\in[\exp(-2T\varepsilon^{\prime}/\lambda),\exp(2T\varepsilon^{\prime}/\lambda)]$ .

By the above claim, therefore $\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime})$ minimizes $V_{\vec{j}}$ up to multiplicative error $\exp(4T\varepsilon^{\prime}/\lambda)/(1-\varepsilon^{\prime})\leqslant(1+\varepsilon/3)$ if we choose $\varepsilon^{\prime}=\Omega(\lambda\varepsilon/T)$ . Thus one can implement $\textsf{MWU\_BOTTLENECK}_{C}(p,\mu,\lambda,\varepsilon)$ by returning the value of $\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime})$ if its value is estimated to be at most $1+\varepsilon/3$ , and returning “null“ otherwise. The bound on the accuracy $\varepsilon^{\prime}=\tilde{\Omega}(\varepsilon^{2}/(nk))$ follows since $\lambda\in[1,2]$ follows since $\lambda\in[1,2]$ and $T=\tilde{O}(nk/\varepsilon^{2})$ by Theorem 4.10.

Proof of Claim.

We compare the expressions for $V_{\vec{j}}$ and $\tilde{V}_{\vec{j}}$ . Each of these is a product of two terms. Since $C_{\vec{j}}\geqslant 0$ , and $[\tilde{p}_{i}]_{j_{i}},[p_{i}]_{j_{i}}\leqslant 0$ for all $i$ , the ratio of the first terms is

\displaystyle\frac{C_{\vec{j}}-\sum_{i=1}^{k}[\tilde{p}_{i}]_{j_{i}}}{C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}}\in[\min_{i}[\tilde{p}_{i}]_{j_{i}}/[p_{i}]_{j_{i}},\max_{i}[\tilde{p}_{i}]_{j_{i}}/[p_{i}]_{j_{i}}]\subset[\exp(-T\varepsilon^{\prime}/\lambda),\exp(T\varepsilon^{\prime}/\lambda)],

where we have used that, for all $i\in[k]$ ,

[\tilde{p}_{i}]_{j_{i}}/[p_{i}]_{j_{i}}=\exp(\langle C,P\rangle/\lambda)/\exp(\tilde{c}/\lambda)\in[\exp(-T\varepsilon^{\prime}/\lambda),\exp(T\varepsilon^{\prime}/\lambda)].

Similarly the ratio of the second terms in the expression for $V_{\vec{j}}$ and $\tilde{V}_{\vec{j}}$ is also in the range $[\exp(-T\varepsilon^{\prime}/\lambda),\exp(T\varepsilon^{\prime}/\lambda)]$ . This concludes the proof of the claim. ∎

∎

Finally, we show that the ARGAMIN oracle can be reduced to the AMIN oracle, which completes the proof that MWU can be run with AMIN.

Lemma 4.16 (Equivalence of AMIN and ARGAMIN).

Each of the oracles $\textsf{AMIN}_{C}$ and $\textsf{ARGAMIN}_{C}$ with accuracy parameter $\varepsilon>0$ can be implemented using $\mathrm{poly}(n,k)$ calls of the other oracle with accuracy parameter $\Theta(\varepsilon/k)$ and $\mathrm{poly}(n,k)$ additional time.

It is worth remarking that the equivalence that we show between AMIN and ARGAMIN is not known to hold for the feasibility and separation oracles of general LPs, since the known result for general LPs requires exponentially small error in $nk$ (GLSbook, , §4.3). However, in the case of MOT the equivalence follows from a direct and practical reduction, similar to the proof of the equivalence of the exact oracles (Lemma 4.4). The main difference is that some care is needed to bound the propagation of the errors of the approximate oracles. For completeness, we provide the full proof of Lemma 4.16 in Appendix A.

We conclude by proving Theorem 4.7.

Proof of Theorem 4.7.

The implication “(i) $\implies$ (ii)” is trivial, and the implication “(ii) $\implies$ (iii)” is shown in (AltBoi20hard, ). It therefore suffices to show the implication “(iii) $\implies$ (i)”. For costs $C$ with entries in the range $[1,2]$ , this follows from combining the fact that MWU can be implemented to solve $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k,1/\varepsilon)$ time given an efficient implementation of $\textsf{ARGAMIN}_{C}$ with polynomially-sized accuracy parameter $\varepsilon^{\prime}=\mathrm{poly}(1/n,1/k,\varepsilon)$ (Lemma 4.14), along with the fact that the $\textsf{AMIN}_{C}$ and $\textsf{ARGAMIN}_{C}$ oracles are polynomially-time equivalent with polynomial-sized accuracy parameter (Lemma 4.16).

The assumption that $C$ has entries within the range $[1,2]$ can be removed with no loss by translating and rescaling the original cost $C^{\prime}\leftarrow(C+3C_{\max})/(2C_{\max})$ and running Algorithm 2 on $C^{\prime}$ with approximation parameter $\varepsilon^{\prime}\leftarrow\varepsilon/(2C_{\max})$ . Each $\tau^{\prime}$ -approximate query to the $\textsf{AMIN}_{C^{\prime}}$ oracle can be simulated by a $\tau$ -approximate query to the $\textsf{AMIN}_{C}$ oracle, where $\tau=2C_{\max}\tau^{\prime}$ . ∎

Remark 4.17 (Practical optimizations).

Our numerical implementation of MWU has two modifications that provide practical speedups. One is maintaining a cached list of the tuples $\vec{j}\in[n]^{k}$ previously returned by calls to MWU_BOTTLENECK. Whenever MWU_BOTTLENECK is called, we first check whether any tuple $\vec{j}$ in the cache satisfies the desiderata $\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0}\leqslant 1+\varepsilon$ , in which case we use this $\vec{j}$ to answer the oracle query. Otherwise, we answer the oracle query using AMIN as explained above. In practice, this cache allows us to avoid many calls to the potentially expensive AMIN bottleneck. Our second optimization is that, at each iteration of MWU, we check whether the current iterate $P$ can be rescaled in order to satisfy the guarantees in Lemma 4.11 required from Step 1. If so, we stop Step 1 early and use this rescaled $P$ .

4.3 The Sinkhorn algorithm and the SMIN oracle

The Sinkhorn algorithm (SINKHORN) is specially tailored to MOT, and does not apply to general exponential-size LP. Currently it is by far the most popular algorithm in the MOT literature (see §1.3). However, in general each iteration of SINKHORN takes exponential time $n^{\Theta(k)}$ , and it is not well-understood when it can be implemented in polynomial-time. The objective of this section is to show that this bottleneck is polynomial-time equivalent to the SMIN oracle, and in doing so put SINKHORN on equal footing with classical implicit LP algorithms in terms of their reliance on variants of the dual feasibility oracle for MOT. Concretely, this lets us establish the following two results.

First, SINKHORN can solve MOT in polynomial time whenever SMIN can be solved in polynomial time.

Theorem 4.18.

For any family of cost tensors $C\in(\mathbb{R}^{n})^{\otimes k}$ and accuracy $\varepsilon>0$ , SINKHORN solves $\textsf{MOT}_{C}$ to $\varepsilon$ accuracy in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time and $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ calls to the $\textsf{SMIN}_{C}$ oracle with regularization $\eta=(2k\log n)/\varepsilon$ . (The solution is output through a polynomial-size implicit representation, see §4.3.1.)

Second, we show that SINKHORN requires strictly more structure than other algorithms do to solve an MOT problem. This is why the results about ELLIPSOID (Theorem 4.1) and MWU (Theorem 4.7) state that those algorithms solve an MOT problem whenever possible, whereas Theorem 4.18 cannot be analogously extended to such an “if and only if” characterization.

Theorem 4.19.

There is a family of cost tensors $C\in(\mathbb{R}^{n})^{\otimes k}$ for which ELLIPSOID solves $\textsf{MOT}_{C}$ exactly in $\mathrm{poly}(n,k)$ time, however it is $\#$ BIS-hard to implement a single iteration of SINKHORN in $\mathrm{poly}(n,k)$ time.

Organization of §4.3.

In §4.3.1, we recall this SINKHORN algorithm and how it depends on a certain marginalization oracle. In §4.3.2, we show that this marginalization oracle is polynomial-time equivalent to the SMIN oracle, and use this to prove Theorems 4.18 and 4.19.

4.3.1 Algorithm

Here we recall SINKHORN and its known guarantees. To do this, we first define the following oracle. While this oracle does not have an interpretation as a dual feasibility oracle, we show below that it is polynomial-time equivalent to SMIN, which is a specific type of approximate dual feasibility oracle (Remark 3.6).

Definition 4.20 (MARG).

Given scalings $d=(d_{1},\dots,d_{k})\in\mathbb{R}_{\geqslant 0}^{n\times k}$ , regularization $\eta>0$ , and an index $i\in[k]$ , the marginalization oracle $\textsf{MARG}_{C}(d,\eta,i)$ returns the vector $m_{i}((\otimes_{i^{\prime}=1}^{k}d_{i^{\prime}})\odot\exp[-\eta C])\in\mathbb{R}_{\geqslant 0}^{n}$ .

It is known that SINKHORN can solve MOT with only polynomially many calls to this oracle (LinHoJor19, ). The approximate solution that SINKHORN computes is a fully dense tensor with $n^{k}$ non-zero entries, but it is output implicitly in $O(nk)$ space through “scaling vectors” and “rounding vectors”, described below.

Theorem 4.21 (SINKHORN guarantee, (LinHoJor19, )).

Algorithm 3 computes an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}(\mu)$ using $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ calls to the $\textsf{MARG}_{C}$ oracle with parameter $\eta=(2k\log n)/\varepsilon$ , and $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ additional time. The solution is of the form

\displaystyle P=\left(\otimes_{i=1}^{k}d_{i}\right)\odot\exp[-\eta C]+\left(\otimes_{i=1}^{k}v_{i}\right),

(4.4)

and is output implicitly via the scaling vectors $d_{1},\dots,d_{k}\in\mathbb{R}_{\geqslant 0}^{n}$ and rounding vectors $v_{1},\dots,v_{k}\in\mathbb{R}_{\geqslant 0}^{n}$ .

Algorithm 3 SINKHORN: multidimensional analog of classical Sinkhorn scaling

Input: Cost $C\in(\mathbb{R}^{n})^{\otimes k}$ , marginals $\mu_{1},\dots,\mu_{k}\in\Delta_{n}$
Output: Implicit representation of tensor (4.4) that is an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}(\mu)$

1:\\ Step 1: scale

d_{1},\dots,d_{k}\leftarrow\mathbf{1}

and

\eta\leftarrow(2k\log n)/\varepsilon

\triangleright

Initialize (no scaling)

3:for

\mathrm{poly}(n,k,C_{\max}/\varepsilon)

iterations do

4: Choose

i\in[k]

\triangleright

Round-robin, greedily, or randomly

\tilde{\mu}_{i}\leftarrow\textsf{MARG}_{C}(d,\eta,i)

\triangleright

Bottleneck: compute

i

-th marginal

d_{i}\leftarrow d_{i}\odot(\mu_{i}/\tilde{\mu}_{i})

\triangleright

Rescale

i

-th marginal (division is entrywise)

8:\\ Step 2: round to transportation polytope

9:for

i=1,\dots,k

\triangleright

Rescale each marginal to be below marginal constraints

10:

\tilde{\mu}_{i}\leftarrow\textsf{MARG}_{C}(d,\eta,i)

\triangleright

Bottleneck: compute

i

-th marginal

11:

d_{i}\leftarrow d_{i}\odot\min[\mathbf{1},\mu_{i}/\tilde{\mu}_{i}]

\triangleright

Rescale

i

-th marginal (operations are entrywise)

12:

v_{i}\leftarrow\mu_{i}-\textsf{MARG}_{C}(d,\eta,i)

for each

i\in[k]

\triangleright

Add back mass

13:

v_{1}\leftarrow v_{1}/\|v\|_{1}^{k-1}

\triangleright

Rescale so that (4.4) satisfies marginal constraints

14:Return

d_{1},\dots,d_{k}

and

v_{1},\dots,v_{k}

\triangleright

Implicit representation of solution (4.4)

Sketch of algorithm.

Full details and a proof are in (LinHoJor19, ). We give a brief overview here for completeness. The main idea of SINKHORN is to solve RMOT, the entropically regularized variant of MOT described in §2.2. On one hand, this provides an $\varepsilon$ -approximate solution to MOT by taking the regularization parameter $\eta=\Theta(\varepsilon^{-1}k\log n)$ sufficiently high (Lemma 2.4). On the other hand, solving RMOT rather than MOT enables exploiting the first-order optimality conditions of RMOT, which imply that the unique solution to RMOT is the unique tensor in $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ of the form

\displaystyle P^{*}=(\otimes_{i=1}^{k}d_{i}^{*})\odot K,

(4.5)

where $K$ denotes the entrywise exponentiated tensor $\exp[-\eta C]$ , and $d_{1}^{*},\dots,d_{k}^{*}\in\mathbb{R}_{\geqslant 0}^{n}$ are non-negative vectors. The SINKHORN algorithm approximately computes this solution in two steps.

The first and main step of Algorithm 3 is the natural multimarginal analog of the Sinkhorn scaling algorithm (Sin67, ). It computes an approximate solution $P=(\otimes_{i=1}^{k}d_{i})\odot K$ by finding $d_{1},\dots,d_{k}$ such that $P$ is nearly feasible in the sense that $m_{i}(P)\approx\mu_{i}$ for each $i\in[k]$ . Briefly, it does this via alternating optimization: initialize $d_{i}$ to the all-ones vector $\mathbf{1}\in\mathbb{R}^{n}$ , and then iteratively update one $d_{i}$ so that the $i$ -th marginal $m_{i}(P)$ of the current scaled iterate $P=(\otimes_{i=1}^{k}d_{i})\odot K$ is $\mu_{i}$ . Although correcting one marginal can detrimentally affect the others, this algorithm nevertheless converges—in fact, in a polynomial number of iterations (LinHoJor19, ).

The second step of Algorithm 3 is the natural multimarginal analog of the rounding algorithm (AltWeeRig17, , Algorithm 2). It rounds the solution $P=(\otimes_{i=1}^{k}d_{i})\odot K$ found in step one to the transportation polytope $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ . Briefly, it performs this by scaling each marginal $m_{i}(P)$ to be entrywise less than the desired $\mu_{i}$ , and then adding mass back to $P$ so that all marginals constraints are exactly satisfied. The former adjustment is done by adjusting the diagonal scalings $d_{1},\dots,d_{k}$ , and the latter adjustment is done by adding a rank- $1$ term $\otimes_{i=1}^{k}v_{i}$ .

Critically, note that Algorithm 3 takes polynomial time except for possibly the calls to the $\textsf{MARG}_{C}$ oracle. In the absence of structure in the cost tensor $C$ , evaluating this $\textsf{MARG}_{C}$ oracle takes exponential time because it requires computing marginals of a tensor with $n^{k}$ entries.

We conclude this discussion with several remarks about SINKHORN.

Remark 4.22 (Choice of update index in SINKHORN).

In line 4 there are several ways to choose update indices, all of which lead to the polynomial iteration complexity we desire. Iteration-complexity bounds are shown for a greedy choice in (LinHoJor19, ; Fri20, ). Similar bounds can be shown for random and round-robin choices by adapting the techniques of (AltPar20bal, ). These latter two choices do not incur the overhead of $k$ MARG computations per iteration required by the greedy choice, which is helpful in practice. Empirically, we observe that round-robin works quite well, and we use this in our experiments.

Remark 4.23 (Alternative implementations of SINKHORN).

For simplicity, Algorithm 3 provides pseudocode for the “vanilla” version of SINKHORN as it performs well in practice and it achieves the polynomial iteration complexity we desire. There are several variants in the literature, including accelerated versions and first rounding small entries of the marginals—these variants have iteration-complexity bounds with better polynomial dependence on $\varepsilon$ and $k$ , albeit sometimes at the expense of larger polynomial factors in $n$ (LinHoJor19, ; tupitsa2020multimarginal, ).

Remark 4.24 (Output of SINKHORN and efficient downstream tasks).

While the output $P$ of SINKHORN is fully dense with $n^{k}$ non-zero entries, its specific form (4.4) enables performing downstream tasks in polynomial time. This is conditional on a polynomial-time $\textsf{MARG}_{C}$ oracle, which is at no loss of generality since that is required for running SINKHORN in polynomial time in the first place. The basic idea is that $P$ is a mixture of two simple distributions (modulo normalization). The first is $\left(\otimes_{i=1}^{k}d_{i}\right)\odot\exp[-\eta C]$ , which is marginalizable using $\textsf{MARG}_{C}$ . The second is $\otimes_{i=1}^{k}v_{i}$ , which is easily marginalizable since it is a product distribution (as the $v_{i}$ are non-negative). Together, this enables efficient marginalization of $P$ . By recursively marginalizing on conditional distributions, this enables efficiently sampling from $P$ . This in turn enables efficient estimation of bounded statistics of $P$ (e.g., the cost $\langle C,P\rangle$ ) by Hoeffding’s inequality.

4.3.2 Equivalence of bottleneck to SMIN

Although Theorem 4.21 shows that SINKHORN solves MOT in polynomial time using the MARG oracle, this is neither sufficient to prove the implication “(ii) $\implies$ (i)” in Theorem 4.18, nor to prove Theorem 4.19. In order to prove these results, we show that SMIN and MARG are polynomial-time equivalent.

Lemma 4.25 (Equivalence of MARG and SMIN).

For any regularization $\eta>0$ , each of the oracles $\textsf{MARG}_{C}$ and $\textsf{SMIN}_{C}$ can be implemented using $\mathrm{poly}(n)$ calls of the other oracle and $\mathrm{poly}(n,k)$ additional time.

Proof.

Reduction from SMIN to MARG. First, we show how to compute $\textsf{SMIN}_{C}(p,\eta)$ using one call to the marginalization oracle and $O(n)$ additional time. Consider the entrywise exponentiated matrix $d=\exp[\eta p]\in\mathbb{R}_{\geqslant 0}^{n\times k}$ , and let $\mu_{1}=m_{1}((\otimes_{i=1}^{k}d_{i})\odot\exp[-\eta C])$ be the answer to $\textsf{MARG}_{C}(d,\eta,1)$ . Observe that

	$\displaystyle-\eta^{-1}\log\left(\sum_{j_{1}=1}^{n}[\mu_{1}]_{j_{1}}\right)$	$\displaystyle=-\eta^{-1}\log\left(\sum_{j_{1}=1}^{n}\sum_{j_{2},\dots,j_{k}\in[n]}\prod_{i=1}^{k}[d_{i}]_{j_{i}}e^{-\eta C_{\vec{j}}}\right)$
		$\displaystyle=-\eta^{-1}\log\left(\sum_{\vec{j}\in[n]^{k}}e^{-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}})}\right)$
		$\displaystyle=\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right),$

where above the first step is by definition of $\mu_{1}$ , the second step is by definition of $d$ and combining the sums, and the third step is by definition of $\operatorname*{smin}$ . We conclude that $-\eta^{-1}\log\sum_{j_{1}=1}^{n}[\mu_{1}]_{j_{1}}$ is a valid answer to $\textsf{SMIN}_{C}(p,\eta)$ . Since this is clearly computable from $\mu_{1}$ in $O(n)$ time, this establishes the claimed reduction.

Reduction from MARG to SMIN. Next, we show that for any marginalization index $i\in[k]$ and entry $\ell\in[n]$ , it is possible to compute the $\ell$ -th entry of the vector $\textsf{MARG}_{C}(d,\eta,i)$ using one call to the $\textsf{SMIN}_{C}$ oracle and $\mathrm{poly}(n,k)$ additional time. Define $v\in\mathbb{R}^{n}$ to be the vector with $\ell$ -th entry equal to $[d_{i}]_{\ell}$ , and all other entries $0$ . Define the matrix $p=\eta^{-1}\log[d_{1},\ldots,d_{i-1},v,d_{i+1},\ldots,d_{k}]\in\bar{\mathbb{R}}^{n\times k}$ , where recall that $\log 0=-\infty$ (see §2). Let $s\in\mathbb{R}$ denote the answer to $\textsf{SMIN}_{C}(p,\eta)$ . Observe that

\displaystyle e^{-\eta s}=\sum_{\vec{j}\in[n]^{k}}e^{-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}})}=\sum_{\vec{j}\in[n]^{k}\;:\;\vec{j}_{i}=\ell}\prod_{i=1}^{k}[d_{i}]_{j_{i}}e^{-\eta C_{\vec{j}}}=\left[m_{i}\left((\otimes_{i=1}^{k}d_{i})\odot\exp[-\eta C]\right)\right]_{\ell},

where above the first step is by definition of $s$ , the second step is by definition of $p$ and $v$ , and the third step is by definition of the marginalization notation $m_{i}(\cdot)$ . We conclude that $\exp(-\eta s)$ is a valid answer for the $\ell$ -th entry of the vector $\textsf{MARG}_{C}(d,\eta,i)$ . This establishes the claimed reduction since we may repeat this procedure $n$ times to compute all $n$ entries of the the vector $\textsf{MARG}_{C}(d,\eta,i)$ .

∎

We can now conclude the proofs of the main results of §4.3.

Proof of Theorem 4.18.

This follows from the fact that SINKHORN approximates $\textsf{MOT}_{C}$ in polynomial time given a efficient implementation of $\textsf{MARG}_{C}$ (Theorem 4.21), combined with the fact that the $\textsf{MARG}_{C}$ and $\textsf{SMIN}_{C}$ oracles are polynomial-time equivalent (Lemma 4.25). ∎

Proof of Theorem 4.19.

Consider the family of cost tensors in Lemma 3.7 for which the $\textsf{MIN}_{C}$ oracle admits a polynomial-time algorithm, but for which the $\textsf{SMIN}_{C}$ oracle is #BIS-hard. Then on one hand, the ELLIPSOID algorithm solves $\textsf{MOT}_{C}$ in polynomial time by Theorem 4.1. And on the other hand, it is #BIS-hard to implement a single iteration of SINKHORN because that requires implementing the $\textsf{MARG}_{C}$ oracle, which is polynomial-time equivalent to the $\textsf{SMIN}_{C}$ oracle by Lemma 4.25. ∎

5 Application: MOT problems with graphical structure

In this section, we illustrate our algorithmic framework on MOT problems with graphical structure. Although a polynomial-time algorithm is already known for this particular structure (h20gm, ; teh2002unified, ), that algorithm computes solutions that are approximate and dense; see the related work section for details. By combining our algorithmic framework developed above with classical facts about graphical models, we show that it is possible to compute solutions that are exact and sparse in polynomial time.

The section is organized as follows. In §5.1, we recall the definition of graphical structure. In §5.2, we show that the MIN, AMIN, and SMIN oracles can be implemented in polynomial time for cost tensors with graphical structure; from this it immediately follows that all of the MOT algorithms discussed in part 1 of this paper can be implemented in polynomial time. Finally, in §5.3, we demonstrate our results on the popular application of computing generalized Euler flows, which was the original motivation of MOT. Numerical simulations demonstrate how the exact, sparse solutions produced by our new algorithms provide qualitatively better solutions than previously possible in polynomial time.

5.1 Setup

We begin by recalling preliminaries about undirected graphical models, a.k.a., Markov Random Fields. We recall only the relevant background; for further details we refer the reader to the textbooks (KolFri09, ; wainwright2008graphical, ).

In words, graphical models provide a way of encoding the independence structure of a collection of random variables in terms of a graph. The formal definition is as follows. Below, all graphs are undirected, and the notation $2^{V}$ means the power set of $V$ (i.e., the set of all subsets of $V$ ).

Definition 5.1 (Graphical model structure).

Let $\mathcal{S}\subset 2^{[k]}$ . The graphical model structure corresponding to $\mathcal{S}$ is the graph $G_{\mathcal{S}}=(V,E)$ with vertices $V=[k]$ and edges $E=\{(i,j):i,j\in S,\text{ for some }S\in\mathcal{S}\}$ .

Definition 5.2 (Graphical model).

Let $\mathcal{S}\subset 2^{[k]}$ . A probability distribution $P$ over $\{X_{i}\}_{i\in[k]}$ is a graphical model with structure $\mathcal{S}$ if there exist functions $\{\psi_{S}\}_{S\in\mathcal{S}}$ and normalizing constant $Z$ such that

P\Big{(}\{x_{i}\}_{i\in[k]}\Big{)}=\frac{1}{Z}\prod_{S\in\mathcal{S}}\psi_{S}\Big{(}\{x_{i}\}_{i\in S}\Big{)}.

A standard measure of complexity for graphical models is the treewidth of the underlying graphical model structure $G_{\mathcal{S}}$ because this captures not just the storage complexity, but also the algorithmic complexity of performing fundamental tasks such as computing the mode, log-partition function, and marginal distributions (KolFri09, ; wainwright2008graphical, ). There are a number of equivalent definitions of treewidth (bodlaender2007treewidth, ). Each requires defining intermediate combinatorial concepts. We recall here the definition that is based on the concept of a junction tree because this is perhaps the most standard definition in the graphical models community.

Definition 5.3 (Junction tree, treewidth).

A junction tree $T=(V_{T},E_{T},\{B_{u}\}_{u\in V_{T}})$ for a graph $G=(V,E)$ consists of a tree $(V_{T},E_{T})$ and a set of bags $\{B_{u}\subseteq V\}_{u\in V_{T}}$ satisfying:

•

For each variable $i\in V$ , the set of nodes $U_{i}=\{u\in V_{T}:i\in B_{u}\}$ induces a subtree of $T$ .
•

For each edge $e\in E$ , there is some bag $B_{u}$ containing both endpoints of $e$ .

The width of the junction tree is one less than the size of the largest bag, i.e., is $\max_{u\in V_{T}}|B_{u}|-1$ . The treewidth of a graph is the width of its minimum-width junction tree.

See Figures 1, 2, and 3 for illustrated examples.

We now formally recall the definition of graphical structure for MOT.

Refer to caption — Figure 1: The path graph (left) has treewidth $1$ because the corresponding junction tree (right) has bags of size at most $2$ .

Definition 5.4 (Graphical structure for MOT).

An MOT cost tensor $C\in(\mathbb{R}^{n})^{\otimes k}$ has graphical structure with treewidth $\omega$ if there is a graphical model structure $\mathcal{S}\subset 2^{[k]}$ and functions $\{f_{S}\}_{S\in\mathcal{S}}$ such that

C_{\vec{j}}=\sum_{S\in\mathcal{S}}f_{S}\Big{(}\{j_{i}\}_{i\in S}\Big{)},\qquad\forall\vec{j}:=(j_{1},\dots,j_{k})\in[n]^{k},

(5.1)

and such that the graph $G_{\mathcal{S}}$ has treewidth $\omega$ .

We make three remarks about this structure. First, note that the functions $\{f_{S}\}_{S\in\mathcal{S}}$ can be arbitrary so long as the corresponding graphical model structure has treewidth at most $\omega$ .

Second, if Definition 5.4 did not constrain the treewidth $\omega$ , then every tensor $C$ would trivially have graphical structure with maximal treewidth $\omega=k-1$ (take $\mathcal{S}$ to be the singleton containing $[k]$ , $G_{\mathcal{S}}$ to be the complete graph, and $f_{[k]}$ to be $C$ ). Just like all previous algorithms, our algorithms have runtimes that depend exponentially (only) on the treewidth of $G_{\mathcal{S}}$ . This is optimal in the sense that unless $\mathsf{P}=\mathsf{NP}$ , there is no algorithm with jointly polynomial runtime in the input size and treewidth (AltBoi20hard, ). We also point out that in all current applications of graphically structured MOT, the treewidth is either $1$ or $2$ , see §1.3.

Third, as in all previous work on graphically structured MOT, we make the natural assumptions that the cost $C$ is input implicitly through the functions $\{f_{S}\}_{S\in\mathcal{S}}$ , and that each function $f_{S}$ can be evaluated in polynomial time, since otherwise graphical structure is useless for designing polynomial-time algorithms. In all applications in the literature, these two basic assumptions are always satisfied. Note also that if the treewidth of the graphical structure is constant, then there is a linear-time algorithm to compute the treewidth and a corresponding minimum-width junction tree (bodlaender1996linear, ).

5.2 Polynomial-time algorithms

By our oracle reductions in §4, in order to design polynomial-time algorithms for MOT with graphical structure, it suffices to design polynomial-time algorithms for the MIN, AMIN, or SMIN oracles. This follows directly from classical algorithmic results in the graphical models literature (KolFri09, ).

Theorem 5.5 (Polynomial-time algorithms for the MIN, AMIN, and SMIN oracles for costs with graphical structure).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ be a cost tensor that has graphical structure with constant treewidth $\omega$ (see Definition 5.4). Then the $\textsf{MIN}_{C}$ , $\textsf{AMIN}_{C}$ , and $\textsf{SMIN}_{C}$ oracles can be computed in $\mathrm{poly}(n,k)$ time.

Algorithm 4 Polynomial-time algorithm for MIN for graphically structured costs (Theorem 5.5).

Input: Cost $C$ with graphical structure, matrix $p\in\mathbb{R}^{n\times k}$
Output: Solution to $\textsf{MIN}_{C}(p)$

\vec{j}\leftarrow

mode of the graphical model

P

in (5.2)

\triangleright

Using the classical max-product algorithm (KolFri09, , §13.3)

2:Return

C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}

\triangleright

Value of the

\textsf{MIN}_{C}(p)

oracle

Algorithm 5 Polynomial-time algorithm for SMIN for graphically structured costs (Theorem 5.5).

Input: Cost $C$ with graphical structure, matrix $p\in\bar{\mathbb{R}}^{n\times k}$ , regularization $\eta>0$
Output: Solution to $\textsf{SMIN}_{C}(p,\eta)$

Z\leftarrow

partition function of the graphical model

P

in (5.2)

\triangleright

Using the classical sum-product algorithm (KolFri09, , §10.2)

2:Return

-\eta^{-1}\log Z

\triangleright

Value of the

\textsf{SMIN}_{C}(p,\eta)

oracle

Proof.

Consider input $p$ for the oracles. Let $P$ denote the probability distribution on $[n]^{k}$ given by

\displaystyle P(\vec{j})=\frac{1}{Z}\exp\left(-\eta\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right)\right),\qquad\forall\vec{j}\in[n]^{k},

(5.2)

where $Z=\sum_{\vec{j}\in[n]^{[k]}}\exp(-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}))$ ensures $P$ is normalized. Observe that the $\textsf{MIN}_{C}$ oracle amounts⁸⁸8In fact, for the purpose of computing $\textsf{MIN}_{C}$ , the distribution $P(\vec{j})$ can be defined using any $\eta>0$ . to computing the mode of the distribution $P$ because $\textsf{MIN}_{C}(p)=C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$ , where $\vec{j}\in[n]^{k}$ is a maximizer of $P_{\vec{j}}$ . Also, the $\textsf{SMIN}_{C}$ oracle amounts to computing the partition function $Z$ because $\textsf{SMIN}_{C}(p)=-\eta^{-1}\log Z$ . Thus it suffices to compute the mode and partition function of $P$ in polynomial time. (The $\textsf{AMIN}_{C}$ oracle follows from the $\textsf{MIN}_{C}$ oracle by Remark 3.5).

To this end, observe that by assumption on $C$ , there is a graphical model structure $\mathcal{S}\in 2^{[k]}$ and functions $\{f_{S}\}_{S\in\mathcal{S}}$ such that the corresponding graph $G_{\mathcal{S}}$ has treewidth $\omega$ and the distribution $P$ factors as

P(\vec{j})=\exp\left(-\eta\left(\sum_{S\in\mathcal{S}}f_{S}\left(\{j_{i}\}_{i\in S}\right)-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right)\right).

It follows that $P$ is a graphical model with respect to the same graphical model structure $\mathcal{S}$ because the “vertex potentials” $\exp(\eta[p_{i}]_{j_{i}})$ do not affect the underlying graphical model structure. Thus $P$ is a graphical model with constant treewidth $\omega$ , so we may compute the mode and partition function of $P$ in $\mathrm{poly}(n,k)$ time using, respectively, the classical max-product and sum-product algorithms (KolFri09, , Chapters 13.3 and 10.2). For convenience, pseudocode summarizing this discussion is provided in Algorithms 4 and 5. ∎

An immediate consequence of Theorem 5.5 combined with our oracle reductions is that all candidate MOT algorithms in §4 can be efficiently implemented for MOT problems with graphical structure. From a theoretical perspective, ELLIPSOID gives the best guarantee since it produces an exact, sparse solution.

Corollary 5.6 (Polynomial-time algorithms for MOT problems with graphical structure).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ be a cost tensor that has graphical structure with constant treewidth $\omega$ (see Definition 5.4). Then:

•

The ELLIPSOID algorithm in §4.1 computes an exact solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k)$ time.
•

The MWU algorithm in §4.2 computes an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.
•

The SINKHORN algorithm in §4.3 computes an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.
•

The COLGEN algorithm in §4.1.3 can be run for $T$ iterations in $\mathrm{poly}(n,k,T)$ time.

Moreover, ELLIPSOID, MWU, and COLGEN output a polynomially sparse tensor, whereas SINKHORN outputs a fully dense tensor through the implicit representation described in §4.3.1.

Proof.

Combine the polynomial-time implementations of the oracles in Theorem 5.5 with the polynomial-time algorithm-to-oracle reductions in Theorems 4.1, 4.7, 4.18, and 4.6, respectively. ∎

5.3 Application vignette: fluid dynamics

In this section, we numerically demonstrate our new results for graphically structured MOT—namely the ability to compute exact, sparse solutions in polynomial time (Corollary 5.6). We illustrate this on the problem of computing generalized Euler flows—an MOT application which has received significant interest and which was historically the motivation of MOT, see e.g., (BenCarCut15, ; BenCarNen16, ; Bre08, ; Bre89, ; Bre93, ; Bre99, ). This MOT problem is already known to be tractable via a popular, specially-tailored modification of SINKHORN (BenCarCut15, )—which can be interpreted as implementing SINKHORN using graphical structure (h20gm, ; teh2002unified, ). However, that algorithm is based on SINKHORN and thus unavoidably produces solutions that are low-precision (due to $\mathrm{poly}(1/\varepsilon)$ runtime dependence), fully dense (with $n^{k}$ non-zero entries), and have well-documented numerical precision issues. We offer the first polynomial-time algorithm for computing exact and/or sparse solutions.

We briefly recall the premise of this MOT problem; for further background see (Bre08, ; BenCarCut15, ). An incompressible fluid (e.g., water) is modeled by $n$ particles which are uniformly distributed in space (due to incompressibility) at all times $t\in\{1,\dots,k+1\}$ . We observe each particle’s location at initial time $t=1$ and final time $t=k+1$ . The task is to infer the particles’ locations at all intermediate times $t\in\{2,\dots,k\}$ , and this is modeled by an MOT problem as follows.

Specifically, the locations of the fluid particles are discretized to points $\{x_{j}\}_{j\in[n]}\subset\mathbb{R}^{d}$ , and $\sigma$ is a known permutation on this set that encodes the relation between each initial location $x_{j}$ at time $t=1$ and final location $\sigma(x_{j})$ at time $t=k+1$ . The total movement of a particle that takes the trajectory $x_{j_{1}},x_{j_{2}},\ldots,x_{j_{k}},\sigma(x_{j_{1}})$ is given by

C_{j_{1},\ldots,j_{k}}=\|\sigma(x_{j_{1}})-x_{j_{k}}\|^{2}+\sum_{t=1}^{k-1}\|x_{j_{t+1}}-x_{j_{t}}\|^{2},

(5.3)

By the principle of least action, the generalized Euler flow problem of inferring the most likely trajectories of the fluid particles is given by the solution to the MOT problem with this cost $C$ and uniform marginals $\mu_{t}=\mathbf{1}_{n}/n\in\Delta_{n}$ which impose the constraint that the fluid is incompressible.

Corollary 5.7 (Exact, sparse solutions for generalized Euler flows).

The MOT problem with cost (5.3) can be solved in $d\cdot\mathrm{poly}(n,k)$ time. The solution is returned as a sparse tensor with at most $nk-k+1$ non-zeros.

Proof.

This cost tensor $C$ can be expressed in graphical form $C_{\vec{j}}=\sum_{S\in\mathcal{S}}f_{S}(\{j_{i}\})$ where $\mathcal{S}$ consists of the sets $\{1,2\},\dots,\{k-1,k\}$ of adjacent time points as well as the set $\{1,k\}$ . Moreover, each function $f_{S}:[n]^{2}\to\mathbb{R}$ can be computed in $O(dn^{2})$ time since this simply requires computing $\|x_{j}-x_{j^{\prime}}\|^{2}$ for $n^{2}$ pairs of points $x_{j},x_{j^{\prime}}\in\mathbb{R}^{d}$ . Once this graphical representation is computed, Corollary 5.6 implies a $\mathrm{poly}(n,k)$ time algorithm for this MOT problem because the graphical model structure $\mathcal{S}$ is a cycle graph and thus has treewidth $2$ (cf., Figure 3). ∎

Figure 10 illustrates how the exact, sparse solutions found by our new algorithm provide visually sharper estimates than the popular modification of SINKHORN in (BenCarCut15, ), which blurs the trajectories. The latter is the state-of-the-art algorithm in the literature and in particular is the only previously known non-heuristic algorithm that has polynomial-time guarantees. Note that this algorithm is identical to implementing SINKHORN by exploiting the graphical structure to perform exact marginalization efficiently (teh2002unified, ; h20gm, ).

The numerical simulation is on a standard benchmark problem used in the literature (see e.g., (BenCarCut15, , Figure 9) and (Bre08, , Figure 2)) in which the particle at initial location $x\in[0,1]$ moves to final location $\sigma(x)=x+\operatorname{\frac{1}{2}}\pmod{1}$ . This is run with $k=6$ and marginals $\mu_{1}=\dots=\mu_{k}$ uniformly supported on $n=51$ positions in $[0,1]$ . See Appendix B for numerics on other standard benchmark instances. Note that this amounts to solving an MOT LP with $n^{k}=51^{6}\approx 1.8\times 10^{10}$ variables, which is infeasible for standard LP solvers. Our algorithm is the first to compute exact solutions for problem instances of this scale.

Two important remarks. First, since this MOT problem is a discretization of the underlying PDE, an exact solution is of course not necessary; however, there is an important—even qualitative—difference between low-precision solutions (computable with $\mathrm{poly}(1/\varepsilon)$ runtime) and high-precision solutions (computable with $\mathrm{polylog}(1/\varepsilon)$ runtime) for the discretized problem. Second, a desirable feature of SINKHORN that should be emphasized is its practical scalability, which might make it advantageous for problems where very fine discretization is required. It is an interesting direction of practical relevance to develop algorithms that can compute high-precision solutions at a similarly large scale in practice (see the discussion in §8).

6 Application: MOT problems with set-optimization structure

In this section, we consider MOT problems whose cost tensors $C$ take values $0$ and $1$ —or more generally any two values, by a straightforward reduction¹¹¹¹11If $C$ takes two values $a<b$ , then define the tensor $\tilde{C}$ with $\{0,1\}$ -entries by $\tilde{C}_{\vec{j}}=(C_{\vec{j}}-a)/(b-a)$ . It is straightforward to see that the MOT problems with costs $C$ and $\tilde{C}$ have identical solutions.. Such MOT problems arise naturally in applications where one seeks to minimize or maximize the probability that some event occurs given marginal probabilities on each variable (see Example 6.1). We establish that this general class of MOT problems can be solved in polynomial time under a condition on the sparsity pattern of $C$ that is often simple to check due its connection to classical combinatorial optimization problems.

The section is organized as follows. In §6.1 we formally describe this setup and discuss why it is incomparable to all other structures discussed in this paper. In §6.2, we show that for costs with this structure, the MIN, AMIN, and SMIN oracles can be implemented in polynomial time; from this it immediately follows that the ELLIPSOID, MWU, SINKHORN, and COLGEN algorithms discussed in part 1 of this paper can be implemented in polynomial time. In §6.3, we illustrate our results via a case study on network reliability.

6.1 Setup

Example 6.1 (Motivation for binary-valued MOT costs: minimizing/maximizing probability of an event).

Let $S\subset[n]^{k}$ . If $C_{\vec{j}}=\mathds{1}[\vec{j}\in S]$ , then the $\textsf{MOT}_{C}$ problem amounts to minimizing the probability that event $S$ occurs, given marginals on each variable. On the other hand, if $C_{\vec{j}}=\mathds{1}[\vec{j}\notin S]$ , then the $\textsf{MOT}_{C}$ problem amounts to maximizing the probability that event $S$ occurs since

\textsf{MOT}_{C}(\mu_{1},\dots,\mu_{k})=\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\mathbb{P}_{\vec{j}\sim P}[\vec{j}\notin S]=1-\max_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\mathbb{P}_{\vec{j}\sim P}[\vec{j}\in S].

Even if the cost is binary-valued, there is no hope to solve MOT in polynomial time without further assumptions—essentially because in the worst case, any algorithm must query all $n^{k}$ entries if $C$ is a completely arbitrary $\{0,1\}$ -valued tensor.

We show that MOT is polynomial-time solvable under the general and often simple-to-check condition that the MIN, AMIN, and SMIN oracles introduced in §3 are polynomial-time solvable when restricted to the set $S$ of indices $\vec{j}\in[n]^{k}$ for which $C_{\vec{j}}=0$ . For simplicity, our definition of these set oracles removes the cost $C_{\vec{j}}$ as it is constant on $S$ . Of course it is also possible to remove the negative sign in $-p$ by re-parameterizing the inputs as $w=-p$ ; however, we keep this notation in order to parallel the original oracles.

Definition 6.2 (MIN set oracle).

Let $S\subset[n]^{k}$ . For weights $p=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k}$ , $\textsf{MIN}_{C,S}(p)$ returns

\min_{\vec{j}\in S}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

Definition 6.3 (AMIN set oracle).

Let $S\subset[n]^{k}$ . For weights $p=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k}$ and accuracy $\varepsilon>0$ , $\textsf{AMIN}_{C,S}(p,\varepsilon)$ returns $\textsf{MIN}_{C,S}(p)$ up to additive error $\varepsilon$ .

Definition 6.4 (SMIN set oracle).

Let $S\subset[n]^{k}$ . For weights $p=(p_{1},\ldots,p_{k})\in\bar{\mathbb{R}}^{n\times k}$ and regularization parameter $\eta>0$ , $\textsf{SMIN}_{C,S}(p,\eta)$ returns

\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in S}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

The key motivation behind these set oracle definitions (aside from the syntactic similarity to the original oracles) is that they encode the problem of (approximately) finding the min-weight object in $S$ . This opens the door to combinatorial applications of MOT because finding the min-weight object in $S$ is well-known to be polynomial-time solvable for many “combinatorial-structured” sets $S$ of interest—e.g., the set $S$ of cuts in a graph, or the set $S$ of independent sets in a matroid. See §6.3 for fully-detailed applications.

Definition 6.5 (Set-optimization structure for MOT).

An MOT cost tensor $C\in(\mathbb{R}^{n})^{\otimes k}$ has exact, approximate, or soft set-optimization structure of complexity $\beta$ if

C_{\vec{j}}=\mathds{1}[\vec{j}\notin S]

for a set $S\subset[n]^{k}$ for which there is an algorithm solving $\textsf{MIN}_{C,S}$ , $\textsf{AMIN}_{C,S}$ , or $\textsf{SMIN}_{C,S}$ , respectively, in $\beta$ time.

We make two remarks about this structure.

Remark 6.6 (Only require set oracle for $C^{-1}(0)$ , not for $C^{-1}(1)$ ).

Note that Definition 6.5 only requires the set oracles for the set $S$ of entries where $C$ is $0$ , and does not need the set oracles for the set $[n]^{k}\setminus S$ where $C$ is $1$ . The fact that both set oracles are not needed makes set-optimization structure easier to check than the original oracles in §3, because those effectively require optimization over both $S$ and $[n]^{k}\setminus S$ .

Remark 6.7 (Set-optimization structure is incomparable to graphical and low-rank plus sparse structure).

Costs $C$ that satisfy Definition 6.5 in general do not have non-trivial graphical structure or low-rank plus sparse structure. Specifically, there are costs $C$ that satisfy Definition 6.5, yet require maximal $k-1$ treewidth to model via graphical structure, and super-constant rank or exponential sparsity to model via low-rank plus sparse structure. (A concrete example is the network reliability application in §6.3.) Because of the $\mathsf{NP}$ -hardness of MOT problems with $(k-1)$ -treewidth graphical structure or super-constant rank (AltBoi20hard, ), simply modeling such problems with graphical structure or low-rank plus rank structure is therefore useless for the purpose of designing polynomial-time MOT algorithms.

6.2 Polynomial-time algorithms

By our oracle reductions in part 1 of this paper, in order to design polynomial-time algorithms for MOT with set-optimization structure, it suffices to design polynomial-time algorithms for the MIN, AMIN, or SMIN oracles. We show how to do this for all three oracles in a straightforward way by exploiting the set-optimization structure.

Theorem 6.8 (Polynomial-time algorithms for the MIN, AMIN, and SMIN oracles for costs with set-optimization structure).

If $C\in(\mathbb{R}^{n})^{\otimes k}$ is a cost tensor with exact, approximate, or soft set-optimization structure of complexity $\beta$ (see Definition 6.5), then the $\textsf{MIN}_{C}$ , $\textsf{AMIN}_{C}$ , and $\textsf{SMIN}_{C}$ oracles, respectively, can be computed in $\beta+\mathrm{poly}(n,k)$ time.

Algorithm 6 Polynomial-time algorithm for MIN for costs with exact set-optimization structure (Theorem 6.8).

Input: Access to $C$ via $\textsf{MIN}_{C,S}$ oracle, matrix $p\in\mathbb{R}^{n\times k}$
Output: Solution to $\textsf{MIN}_{C}(p)$

a\leftarrow\textsf{MIN}_{C,S}(p)

\triangleright

One oracle call

x\leftarrow-\sum_{i=1}^{k}\max_{j\in[n]}[p_{i}]_{j}

\triangleright

Takes

O(nk)

time

3:Return

a

a\leqslant x

, or

\min(a,1+x)

otherwise

\triangleright

Takes

O(1)

time

Algorithm 7 Polynomial-time algorithm for SMIN for costs with soft set-optimization structure (Theorem 6.8).

Input: Access to $C$ via $\textsf{SMIN}_{C,S}$ oracle, matrix $p\in\bar{\mathbb{R}}^{n\times k}$ , regularization $\eta$
Output: Solution to $\textsf{SMIN}_{C}(p,\eta)$

a\leftarrow\exp(-\eta\cdot\textsf{SMIN}_{C,S}(p))

\triangleright

One oracle call

x\leftarrow\prod_{i=1}^{k}\sum_{j=1}^{n}\exp(\eta[p_{i}]_{j_{i}})

\triangleright

Takes

O(nk)

time

3:Return

-\eta^{-1}\log(e^{-\eta}x+(1-e^{-\eta})a)

\triangleright

Takes

O(1)

time

Proof.

Polynomial-time algorithm for MIN. We first claim that Algorithm 6 implements the $\textsf{MIN}_{C}(p)$ oracle. To this end, define

\displaystyle a:=\textsf{MIN}_{C,S}(p)=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}}=0\end{subarray}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\qquad\text{and}\qquad b:=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}}=1\end{subarray}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

(6.1)

By re-arranging the sum and max, it follows that

\displaystyle x:=-\sum_{i=1}^{k}\max_{j\in[n]}[p_{i}]_{j}=-\max_{\vec{j}\in[n]^{k}}\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\min_{\vec{j}\in[n]^{k}}\sum_{i=1}^{k}-[p_{i}]_{j_{i}}=\min(a,b).

(6.2)

Therefore

	$\displaystyle\textsf{MIN}_{C}(p)=\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\min(a,1+b)$	$\displaystyle=\begin{cases}a&\text{ if }a\leqslant b\\ \min(a,1+\min(a,b))&\text{ if }a>b\end{cases}$
		$\displaystyle=\begin{cases}a&\text{ if }a\leqslant x\\ \min(a,1+x)&\text{ if }a>x\end{cases},$		(6.3)

where above the first step is by definition of $\textsf{MIN}_{C}$ ; the second step is by partitioning the minimization over $\vec{j}\in[n]^{k}$ into $\vec{j}$ such that $C_{\vec{j}}=0$ or $C_{\vec{j}}=1$ , and then plugging in the definitions of $a$ and $b$ ; the third step is by manipulating $\min(a,1+b)$ in both cases; and the last step is because $x=\min(a,b)$ as shown above. We conclude that Algorithm 6 correctly outputs $\textsf{MIN}_{C}(p)$ . Since the algorithm uses one call to the $\textsf{MIN}_{C,S}$ oracle and $O(nk)$ additional time, the claim is proven.

Polynomial-time algorithm for AMIN. Next, we claim that the same Algorithm 6, now run with the approximate oracle $\textsf{AMIN}_{C,S}(p,\varepsilon)$ in the first step instead of the exact oracle $\textsf{MIN}_{C,S}(p)$ , computes a valid solution to $\textsf{AMIN}_{C}(p,\varepsilon)$ . To prove this, let $a$ , $b$ , and $x$ be as defined in (6.1) and (6.2) for the MIN analysis, and let $\tilde{a}=\textsf{AMIN}_{C,S}(p,\varepsilon)$ . By the same logic as in (6.3), except now reversed, the output

\displaystyle\begin{cases}\tilde{a}&\text{ if }\tilde{a}\leqslant x\\ \min(\tilde{a},1+x)&\text{ if }\tilde{a}>x\end{cases}

is equal to $\min(\tilde{a},1+b)$ . Now because $\tilde{a}$ is within additive $\varepsilon$ error of $a$ (by definition of the $\textsf{AMIN}_{C,S}$ oracle), it follows that the above output is within $\varepsilon$ additive error of

\min(a,1+b)=\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\textsf{MIN}_{C}(p).

Thus the output is a valid answer to $\textsf{AMIN}_{C}(p,\varepsilon)$ , establishing correctness. The runtime claim is obvious.

Polynomial-time algorithm for SMIN. Finally, we claim that Algorithm 7 implements the $\textsf{SMIN}_{C}(p,\eta)$ oracle. To this end, define

a:=e^{-\eta\cdot\textsf{SMIN}_{C,S}(p,\eta)}=\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}=0}\end{subarray}}e^{\eta\sum_{i=1}^{k}[p_{i}]_{j_{i}}}\qquad\text{ and }\qquad b:=\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}=1}\end{subarray}}e^{\eta\sum_{i=1}^{k}[p_{i}]_{j_{i}}}.

By re-arranging products and sums, it follows that

x:=\prod_{i=1}^{k}\sum_{j=1}^{n}e^{\eta[p_{i}]_{j_{i}}}=\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}e^{\eta[p_{i}]_{j_{i}}}=a+b.

Therefore

\textsf{SMIN}_{C}(p,\eta)=-\frac{1}{\eta}\log\left(\sum_{\vec{j}\in[n]^{k}}e^{-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}})}\right)=-\frac{1}{\eta}\log\left(a+e^{-\eta}b\right)=-\frac{1}{\eta}\log\left(e^{-\eta}x+(1-e^{-\eta})a\right),

where above the first step is by definition of $\textsf{SMIN}_{C}$ ; the second step is by partitioning the sum over $\vec{j}\in[n]^{k}$ into $\vec{j}$ such that $C_{\vec{j}}=0$ or $C_{\vec{j}}=1$ , and then plugging in the definitions of $a$ and $b$ ; and the third step is because $x=a+b$ as shown above. We conclude that Algorithm 7 correctly outputs $\textsf{SMIN}_{C}(p,\eta)$ . Since the algorithm uses one call to the $\textsf{SMIN}_{C,S}$ oracle and $O(nk)$ additional time, the claim is proven. ∎

An immediate consequence of Theorem 6.8 combined with our oracle reductions is that all of the candidate MOT algorithms described in §4 can be efficiently implemented for MOT problems with set-optimization structure. From a theoretical perspective, the ELLIPSOID algorithm gives the best guarantee since it produces an exact, sparse solution in polynomial time.

Corollary 6.9 (Polynomial-time algorithms for MOT problems with set-optimization structure).

Let $C\in(\mathbb{R}^{n})^{\otimes k}$ be a cost tensor that has set-optimization structure with $\mathrm{poly}(n,k)$ complexity (see Definition 6.5).

•

Exact set-optimization structure. The ELLIPSOID algorithm in §4.1 computes an exact solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k)$ time. Also, the COLGEN algorithm in §4.1.3 can be run for $T$ iterations in $\mathrm{poly}(n,k,T)$ time.
•

Approximate set-optimization structure. The MWU algorithm in §4.2 computes an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.
•

Soft set-optimization structure. The SINKHORN algorithm in §4.3 computes an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.

Moreover, ELLIPSOID, MWU, and COLGEN output a polynomially sparse tensor, whereas SINKHORN outputs a fully dense tensor through the implicit representation described in §4.3.1.

Proof.

Combine the polynomial-time implementations of the oracles in Theorem 7.4 with the polynomial-time algorithm-to-oracle reductions in Theorems 4.1, 4.6, 4.7, and 4.18, respectively. ∎

6.3 Application vignette: network reliability with correlations

In this section, we illustrate this class of MOT structures via an application to network reliability, a central topic in network science, engineering, and operations research, see e.g., the textbooks (gertsbakh2011network, ; ball1995network, ; ball1986computational, ). The basic network reliability question is: given an undirected graph $G=(V,E)$ where each edge $e\in E$ is reliable with some probability $q_{e}$ and fails with probability $1-q_{e}$ , what is the probability that all vertices are reachable from all others? This connectivity is desirable in applications, e.g., if $G$ is a computer cluster, the vertices are the machines, and the edges are communication links, then connectivity corresponds to the reachability of all machines. See the aforementioned textbooks for many other applications.

Of course, the above network reliability question is not yet well-defined since the edge failures are only prescribed up to their marginal distributions. Which joint distribution greatly impacts the answer.

The most classical setup posits that edge failures are independent (moore1956reliable, ). Denote the network reliability probability for this setting by $\rho^{\mathrm{ind}}$ . This quantity $\rho^{\mathrm{ind}}$ is $\#$ P-complete (valiant1979complexity, ; provan1983complexity, ) and thus $\mathsf{NP}$ -hard to compute, but there exist fully polynomial randomized approximation schemes (a.k.a. FPRAS) for multiplicatively approximating both the connection probability $\rho^{\mathrm{ind}}$ (karger2001randomized, ) and the failure probability $1-\rho^{\mathrm{ind}}$ (guo2019polynomial, ).

Here we investigate the setting of coordinated edge failures, which dates back to the 1980s (weiss1986stochastic, ; zemel1982polynomial, ). This coordination may optimize for disconnection (e.g., by an adversary), or for connection (e.g., maximize the time a network is connected while performing maintenance on each edge $e$ during $1-q_{e}$ fraction of the time). We define these notions below; see also Figure 5 for an illustration. Below, $\operatorname{Ber}(q_{e})$ denotes the Bernoulli distribution with parameter $q_{e}$ .

Definition 6.10 (Network reliability with correlations).

For an undirected graph $G=(V,E)$ and edge reliability probabilities $\{q_{e}\}_{e\in E}$ :

•

The worst-case network reliability is

\rho^{\min}:=\min_{P\in\mathcal{M}(\{\operatorname{Ber}(q_{e})\}_{e\in E})}\mathbb{P}_{H\sim P}\left[H\text{ is a connected subgraph of }G\right].

•

The best-case network reliability is

\rho^{\max}:=\max_{P\in\mathcal{M}(\{\operatorname{Ber}(q_{e})\}_{e\in E})}\mathbb{P}_{H\sim P}[H\text{ is a connected subgraph of }G].

Clearly $\rho^{\min}\leqslant\rho^{\mathrm{ind}}\leqslant\rho^{\max}$ . These gaps can be large (e.g., see Figure 5), which promises large opportunities for applications in which coordination is possible. However, in order to realize such an opportunity requires being able to compute $\rho^{\min}$ and $\rho^{\max}$ , and both of these problems require solving an exponentially large LP with $2^{|E|}$ variables. Below we show how to use set-optimization structure to compute these quantities in $\mathrm{poly}(|E|)$ time, thereby recovering as a special case of our general framework the known polynomial-time algorithms for this particular problem in (zemel1982polynomial, ; weiss1986stochastic, ), as well as more practical polynomial-time algorithms that scale to input sizes that are an order-of-magnitude larger.

Corollary 6.11 (Polynomial-time algorithm for network reliability with correlations).

The worst-case and best-case network reliability can both be computed in $\mathrm{poly}(|E|)$ time.

Proof.

By the observation in Example 6.1, the optimization problems defining $\rho^{\min}$ and $1-\rho^{\max}$ are instances of MOT in which $k=|E|$ , $n=2$ , $\mu_{e}=\operatorname{Ber}(q_{e})$ , and each entry of the cost $C\in\{0,1\}^{|E|}$ is the indicator of whether that subset of edges is a connected or disconnected subgraph of $G$ , respectively. It therefore suffices to show that both of these MOT cost tensors satisfy exact set-optimization structure (Definition 6.5) since that implies a polynomial-time algorithm for exactly solving MOT (Corollary 6.9).

Set-optimization structure for $1-\rho^{\max}$ .

In this case, $S$ is the set of connected subgraphs of $G$ . Thus the $\textsf{MIN}_{C,S}$ problem is: given weights $p\in\mathbb{R}^{2\times|E|}$ , compute

\min_{\text{connected subgraph }H\text{ of }G}-\sum_{e\in H}p_{2,e}-\sum_{e\notin H}p_{1,e}.

Note that this objective is equal to $\sum_{e\in H}x_{e}-\sum_{e\in E}p_{1,e}$ where $x_{e}:=p_{1,e}-p_{2,e}$ . Since the latter sum is independent of $H$ , the $\textsf{MIN}_{C,S}$ problem therefore reduces to the problem of finding a minimum-weight connected subgraph in $G$ ; that is, given edge weights $x\in\mathbb{R}^{|E|}$ , compute

\displaystyle\min_{\text{connected subgraph }H\text{ of }G}\sum_{e\in H}x_{e}.

(6.4)

We first show how to solve this in polynomial time in the case that all edge weights $x_{e}$ are positive. In this case, the optimal solution $H$ is a minimum-weight spanning tree of $G$ . This can be found by Kruskal’s algorithm in $O(|E|\log|E|)$ time (kruskal1956shortest, ).

For the general case of arbitrary edge weights, note that the edges $e$ with non-positive weight $x\leqslant 0$ can be added to any solution without worsening the cost or feasibility. Thus these edges are without loss of generality in every solution $H$ , and so it suffices to solve the same problem (6.4) on the graph $G^{\prime}$ obtained by contracting these non-positively-weighted edges in $G$ . This reduces (6.4) to the same problem of finding a minimum-weight connected subgraph, except now in the special case that all edge weights are positive. Since we have already shown how to solve this case in polynomial time, the proof is complete.

Set-optimization structure for $\rho^{\min}$ .

In this case, $S$ is the set of disconnected subgraphs of $G$ . We may simplify the $\textsf{MIN}_{C,S}$ problem for $\rho^{\min}$ by re-parameterizing the input $p\in\mathbb{R}^{2\times|E|}$ to edge weights $x\in\mathbb{R}^{|E|}$ as done above in (6.4) for $1-\rho^{\max}$ . Thus the $\textsf{MIN}_{C,S}$ problem for $\rho^{\min}$ is: given weights $x\in\mathbb{R}^{|E|}$ , compute

\displaystyle\min_{\text{disconnected subgraph }H\text{ of }G}\sum_{e\in H}x_{e}.

(6.5)

We first show how to solve this in the case that all edge weights $x_{e}$ are negative. In that case, the optimal solution is of the form $H=E\setminus C$ , where $C$ is a maximum-weight cut of the graph $G$ with weights $x_{e}$ . Equivalently, by negating all edge weights, $C$ is a minimum-weight cut of the graph $G$ with weights $-x_{e}$ . Since a minimum-weight cut of a graph with positive weights can be found in polynomial time (stoer1997simple, ), the problem (6.5) can be solved in polynomial time when all $x_{e}$ are negative.

Now in the general case of arbitrary edge weights, note that the edges $e$ with non-negative weight $x\geqslant 0$ can be removed from any solution without worsening the cost or feasibility. Thus these edges are without loss of generality not in every solution $H$ , and so it suffices to solve the same problem (6.5) on the graph $G^{\prime}$ obtained by deleting these non-negatively-weighted edges in $G$ . This reduces (6.5) to the same problem of finding a minimum-weight disconnected subgraph, except now in the special case that all edge weights are negative. Since we have already shown how to solve this case in polynomial time, the proof is complete. ∎

In Figure 6, we compare the numerical performance of the algorithms in Corollary 6.11—COLGEN and MWU with polynomial-time implementation of their bottlenecks—with the fastest previous algorithms for both best-case and worst-case network reliability. Previously, the fastest algorithms that apply to this problem are (1) out-of-the-box LP solvers run on MOT, (2) the brute-force implementation of SINKHORN which marginalizes over all $n^{k}=2^{|E|}$ entries in each iteration, and (3) this COLGEN algorithm that we recover (zemel1982polynomial, ; weiss1986stochastic, ). It is unknown if there is a practically efficient implementation of the $\textsf{SMIN}_{C,S}$ oracle (and thus of SINKHORN) for both best-case or worst-case reliability. Since the previous algorithms (1) and (2) have exponential runtime that scales as $n^{\Omega(k)}=2^{\Omega(|E|)}$ , they do not scale past tiny input sizes. In contrast, the algorithms in Corollary 6.11 scale to much larger inputs. Indeed, the COLGEN algorithm that our framework recovers can compute exact solutions roughly an order-of-magnitude faster than the other algorithms, and the new MWU algorithm computes reasonably approximate solutions beyond $k=400$ , which amounts to an MOT LP with $n^{k}=2^{400}\approx 2.6\times 10^{120}$ variables.

7 Application: MOT problems with low-rank plus sparse structure

In this section, we consider MOT problems whose cost tensors $C$ decompose into low-rank and sparse components. We propose the first polynomial-time algorithms for this general class of MOT problems.

The section is organized as follows. In §7.1 we formally describe this setup and discuss why it is incomparable to all other structures discussed in this paper. In §7.2, we show that for costs with this structure, the AMIN and SMIN oracles can be implemented in polynomial time; from this it immediately follows that MWU and SINKHORN can be implemented in polynomial time. Finally, in §7.3 and §7.4, we provide two illustrative applications of these algorithms. The former regards portfolio risk management and is a direct application of our result for MOT with low-rank cost tensors. The latter regards projecting mixture distributions to the transportation polytope and illustrates the versality of our algorithmic results since this problem is quadratic optimization over the transportation polytope rather than linear (a.k.a. MOT).

7.1 Setup

We begin by recalling the definition of tensor rank. It is the direct analog of the standard concept of matrix rank. See the survey (kolda2009tensor, ) for further background.

Definition 7.1 (Tensor rank).

A rank- $r$ factorization of a tensor $R\in(\mathbb{R}^{n})^{\otimes k}$ is a collection of $rk$ vectors $\{u_{i,\ell}\}_{i\in[k],\ell\in[r]}\subset\mathbb{R}^{n}$ satisfying

R=\sum_{\ell=1}^{r}\bigotimes_{i=1}^{k}u_{i,\ell}.

The rank of a tensor is the minimal $r$ for which there exists a rank- $r$ factorization.

In this section we consider MOT problems with the following “low-rank plus sparse” structure.

Definition 7.2 (Low-rank plus sparse structure for MOT).

An MOT cost tensor $C\in(\mathbb{R}^{n})^{\otimes k}$ has low-rank plus sparse structure of rank $r$ and sparsity $s$ if it decomposes as

\displaystyle C=R+S,

(7.1)

where $R$ is a rank- $r$ tensor and $S$ is an $s$ -sparse tensor.

Throughout, we make the natural assumption that $S$ is input through its $s$ non-zero entries, and that $R$ is input through a rank- $r$ factorization. We also make the natural assumption that the entries of both $R$ and $S$ are of size $O(C_{\max})$ —this rules out the case of having extremely large entries of $R$ and $S$ , one positive and one negative, which cancel to yield a small entry of $C=R+S$ .

Remark 7.3 (Neither low-rank structure nor sparse structure can be modeled by graphical structure or set-optimization structure).

In general, both rank- $1$ costs and polynomially sparse costs do not have non-trivial graphical structure. Specifically, modeling these costs with graphical structure requires the complete graph (a.k.a., maximal treewidth of $k-1$ )—and because MOT problems with graphical structure of treewidth $k-1$ are $\mathsf{NP}$ -hard to solve in the absence of further structure (AltBoi20hard, ), modeling such problems with graphical structure is useless for the purpose of designing polynomial-time MOT algorithms. It is also clear that neither low-rank structure nor sparse structure can be modeled by set-optimization structure because in general, neither $R$ nor $S$ nor $R+S$ has binary-valued entries.

7.2 Polynomial-time algorithms

From a technical perspective, the main result of this section is that there is a polynomial-time algorithm for approximating the minimum entry of a tensor that decomposes into constant-rank and sparse components. Previously, this was not known even for constant-rank tensors. This result may be of independent interest. We remark that this result is optimal in the sense that unless $\mathsf{P}=\mathsf{NP}$ , there does not exist an algorithm with runtime that is jointly polynomial in the input size and the rank $r$ (AltBoi20hard, ).

Theorem 7.4 (Polynomial-time algorithm solving AMIN and SMIN for low-rank + sparse costs).

Consider cost tensors $C\in(\mathbb{R}^{n})^{\otimes k}$ that have low-rank plus sparse structure of rank $r$ and sparsity $s$ (see Definition 7.2). For any fixed $r$ , Algorithm 8 runs in $\mathrm{poly}(n,k,s,C_{\max}/\varepsilon)$ time and solves the $\varepsilon$ -approximate $\textsf{AMIN}_{C}$ oracle. Furthermore, it also solves the $\textsf{SMIN}_{\tilde{C}}$ oracle for $\eta=(2k\log n)/\varepsilon$ on some cost tensor $\tilde{C}\in(\mathbb{R}^{n})^{\otimes k}$ satisfying $\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2$ .

We make three remarks about Theorem 7.4. First, we are unaware of any polynomial-time implementation of $\textsf{SMIN}_{C}$ for the cost $C$ . Instead, Theorem 7.4 solves the $\textsf{SMIN}_{\tilde{C}}$ oracle for an $O(\varepsilon)$ -approximate cost tensor $\tilde{C}$ since this is sufficient for implementing SINKHORN on the original cost tensor $C$ (see Corollary 7.5 below). Second, it is an interesting open question if the $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ runtime for the $\varepsilon$ -approximate $\textsf{AMIN}_{C}$ oracle can be improved to $\mathrm{poly}(n,k,\log(C_{\max}/\varepsilon))$ , as this would imply a $\mathrm{poly}(n,k)$ runtime for the $\textsf{MIN}_{C}$ oracle and thus for this class of MOT problems (see also Footnote 5 in the introduction). Third, we remark about practical efficiency: the runtime of Algorithm 8 is not just polynomially small in $s$ and $n$ , but in fact linear in $s$ and near-linear in $n$ . However, since this improved runtime is not needed for the theoretical results in the sequel, we do not pursue this further.

Combining the efficient oracle implementations in Theorem 7.4 with our algorithm-to-oracles reductions in §4 implies the first polynomial-time algorithms for MOT problems with costs that have constant-rank plus sparse structure. This is optimal in the sense that unless $\mathsf{P}=\mathsf{NP}$ , there does not exist an algorithm with runtime that is jointly polynomial in the input size and the rank $r$ (AltBoi20hard, ).

Corollary 7.5 (Polynomial-time algorithms solving MOT for low-rank + sparse costs).

Consider cost tensors $C\in(\mathbb{R}^{n})^{\otimes k}$ that have low-rank plus sparse structure of constant rank $r$ and $\mathrm{poly}(n,k)$ sparsity $s$ (see Definition 7.2). For any $\varepsilon>0$ :

•

The MWU algorithm in §4.2 computes an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.
•

The SINKHORN algorithm in §4.3 computes an $\varepsilon$ -approximate solution to $\textsf{MOT}_{C}$ in $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ time.

Moreover, MWU outputs a polynomially sparse tensor, whereas SINKHORN outputs a fully dense tensor through the implicit representation described in §4.3.1.

Proof.

For MWU, simply combine the polynomial-time reduction to the $\textsf{AMIN}_{C}$ oracle (Theorem 4.7) with the polynomial-time algorithm for the AMIN oracle (Theorem 7.4). For SINKHORN, combining the polynomial-time reduction to the $\textsf{SMIN}_{\tilde{C}}$ oracle (Theorem 4.18) with the polynomial-time algorithm for the $\textsf{SMIN}_{\tilde{C}}$ oracle (Theorem 7.4) yields a $\mathrm{poly}(n,k,C_{\max}/\varepsilon)$ algorithm for $\varepsilon/2$ -approximating the MOT problem with cost tensor $\tilde{C}$ . It therefore suffices to show that the values of the MOT problems with cost tensors $C$ and $\tilde{C}$ differ by at most $\varepsilon/2$ , that is,

\left\lvert\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle-\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,\tilde{C}\rangle\right\rvert\leqslant\varepsilon/2.

But this holds because both MOT problems have the same feasible set, and for any feasible $P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})$ it follows from Hölder’s inequality that the objectives of the two MOT problems differ by at most

\left\lvert\langle P,C\rangle-\langle P,\tilde{C}\rangle\right\rvert\leqslant\|P\|_{1}\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2.

∎

Below, we describe the algorithm in Theorem 7.4. Specifically, in §7.2.1, we give four helper lemmas which form the technical core of our algorithm; and then in §7.2.2, we combine these ingredients to design the algorithm and prove its correctness. Throughout, recall that we use the bracket notation $f[A]$ to denote the entrwise application of a univariate function $f$ (e.g., $\exp$ , $\log$ , or a polynomial) to $A$ .

7.2.1 Technical ingredients

At a high level, our approach to designing the algorithm in Theorem 7.4 is to approximately compute the SMIN oracle in polynomial time by synthesizing four facts:

1.

By expanding the softmin and performing simple operations, it suffices to compute the total sum of all $n^{k}$ entries of the entrywise exponentiated tensor $\exp[-\eta R]$ (modulo simple transforms).
2.

Although $\exp[-\eta R]$ is in general a full-rank tensor, we can exploit the fact that $R$ is a low-rank tensor in order to approximate $\exp[-\eta R]$ by a low-rank tensor $L$ . (Moreover, we can efficiently compute a low-rank factorization of $L$ in closed form.)
3.

There is a simple algorithm for computing the sum of all $n^{k}$ entries of $L$ in polynomial time because $L$ is low-rank. (And thus we may approximate the sum of all $n^{k}$ entries of $\exp[-\eta R]$ as desired in step 1.)
4.

This approximation is sufficient for computing both the AMIN and SMIN oracle in Theorem 7.4.

Of these four steps, the main technical step is the low-rank approximation in step two. Below, we formalize these four steps individually in Lemmas 7.6, 7.7, 7.8, and 7.9. Further detail on how to synthesize these four steps is then provided afterwards, in the proof of Theorem 7.4.

It is convenient to write the first lemma in terms of an approximate tensor $\tilde{C}=\tilde{R}+S$ rather than the original cost $C=R+S$ .

Lemma 7.6 (Softmin for cost with sparse component).

Let $\tilde{C}=\tilde{R}+S$ and $p_{1},\dots,p_{k}\in\mathbb{R}^{n}$ . Then

\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=-\eta^{-1}\log(a+b),

where $d_{i}:=\exp[\eta p_{i}]\in\mathbb{R}_{\geqslant 0}^{n}$ ,

\displaystyle a:=\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}\neq 0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\cdot(e^{-\eta S_{\vec{j}}}-1)

(7.2)

and

\displaystyle b:=\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}.

(7.3)

Proof.

By expanding the definition of softmin, and then substituting $p_{i}$ with $d_{i}$ and $\tilde{C}$ with $\tilde{R}+S$ ,

\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}

\displaystyle=-\frac{1}{\eta}\log\left(\sum_{\vec{j}\in[n]^{k}}e^{\eta\sum_{i=1}^{k}[p_{i}]_{j_{i}}}e^{-\eta\tilde{C}_{\vec{j}}}\right)=-\frac{1}{\eta}\log\left(\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}e^{-\eta S_{\vec{j}}}\right).

By simple manipulations, we conclude that the above quantity is equal to the desired quantity:

	$\displaystyle\cdots=$	$\displaystyle-\frac{1}{\eta}\log\left(\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}\neq 0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}e^{-\eta S_{\vec{j}}}+\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}=0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\right)$
	$\displaystyle=$	$\displaystyle-\frac{1}{\eta}\log\left(\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}\neq 0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\left(e^{-\eta S_{\vec{j}}}-1\right)+\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\right)$
	$\displaystyle=$	$\displaystyle-\frac{1}{\eta}\log(a+b).$

Above, the first step is by partitioning the sum over $\vec{j}\in[n]^{k}$ based on if $S_{\vec{j}}=0$ , the second step is by adding and subtracting $\sum_{\vec{j}\in[n]^{k}\text{ s.t. }S_{\vec{j}}\neq 0}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}$ , and the last step is by definition of $a$ and $b$ . ∎

Lemma 7.7 (Low-rank approximation of the exponential of a low-rank tensor).

There is an algorithm that given $R\in(\mathbb{R}^{n})^{\otimes k}$ in rank- $r$ factored form, $\eta>0$ , and a precision $\tilde{\varepsilon}<e^{-\eta R_{\max}}$ , takes $n\cdot\mathrm{poly}(k,\tilde{r})$ time to compute a rank- $\tilde{r}$ tensor $L\in(\mathbb{R}^{n})^{\otimes k}$ in factored form satisfying $\|L-\exp[-\eta R]\|_{\max}\leqslant\tilde{\varepsilon}$ , where

\displaystyle\tilde{r}\leqslant\binom{r+O(\log\tfrac{1}{\tilde{\varepsilon}})}{r}.

(7.4)

Proof.

By classical results from approximation theory (see, e.g., (Tre19, )), there exists a polynomial $q$ of degree $m=O(\log 1/\tilde{\varepsilon})$ satisfying

\left\lvert\exp(-\eta x)-q(x)\right\rvert\leqslant\tilde{\varepsilon},\qquad\forall x\in[-R_{\max},R_{\max}].

For instance, the Taylor or Chebyshev expansion of $x\mapsto\exp(-\eta x)$ suffices. Thus the tensor $L$ with entries

L_{\vec{j}}=q(R_{\vec{j}})

approximates $\exp[-\eta R]$ to error

\|L-\exp[-\eta R]\|_{\max}\leqslant\tilde{\varepsilon}.

We now show that $L$ has rank $\tilde{r}\leqslant\binom{r+m}{r}$ , and moreover that a rank- $\tilde{r}$ factorization can be computed in $n\cdot\mathrm{poly}(k,\tilde{r})$ time. Denote $q(x)=\sum_{t=0}^{m}a_{t}x^{t}$ and $R=\sum_{\ell=1}^{r}\otimes_{i=1}^{k}u_{i,\ell}$ . By definition of $L$ , definition of $q$ and $R$ , and then the Multinomial Theorem,

L_{\vec{j}}=q(R_{\vec{j}})=\sum_{t=0}^{m}a_{t}\left(\sum_{\ell=1}^{r}\prod_{i=1}^{k}[u_{i,\ell}]_{j_{i}}\right)^{t}=\sum_{\alpha\in\mathbb{N}_{0}^{r}\,:\,|\alpha|\leqslant m}\binom{|\alpha|}{\alpha}a_{|\alpha|}\prod_{\ell=1}^{r}\prod_{i=1}^{k}[u_{i,\ell}]_{j_{i}}^{\alpha_{i}},

where the sum is over $r$ -tuples $\alpha$ with non-negative entries summing to at most $m$ . Thus

L=\sum_{\alpha\in\mathbb{N}_{0}^{r}\,:\,|\alpha|\leqslant m}\bigotimes_{i=1}^{k}v_{i,\alpha},

where $v_{i,\alpha}\in\mathbb{R}^{n}$ denotes the vector with $j$ -th entry $\binom{|\alpha|}{\alpha}a_{|\alpha|}\prod_{\ell=1}^{r}[u_{i,\ell}]_{j}^{\alpha_{i}}$ for $i=1$ , and $\prod_{\ell=1}^{r}[u_{i,\ell}]_{j}^{\alpha_{i}}$ for $i>1$ . This yields the desired low-rank factorization of $L$ because

\tilde{r}\leqslant\#\{\alpha\in\mathbb{N}_{0}^{r}\;:\;|\alpha|\leqslant m\}=\binom{r+m}{r}.

Finally, since each of the $k\tilde{r}$ vectors $v_{i,\alpha}$ in the factorization of $L$ can be computed efficiently from the closed-form expression above, the desired runtime follows. ∎

Lemma 7.8 (Marginalizing a scaled low-rank tensor).

Given vectors $d_{1},\dots,d_{k}\in\mathbb{R}^{n}$ and a tensor $L\in(\mathbb{R}^{n})^{\otimes k}$ through a rank $\tilde{r}$ factorization, we can compute $m((\otimes_{i=1}^{k}d_{i})\odot L)$ in $O(nk\tilde{r})$ time.

Proof.

Denote the factorization of $L$ by $L=\sum_{\ell=1}^{\tilde{r}}\otimes_{i=1}^{k}v_{i,\ell}$ . Then

\displaystyle m((\otimes_{i=1}^{k}d_{i})\odot L)

\displaystyle=\sum_{\vec{j}\in[n]^{k}}\left[(\otimes_{i=1}^{k}d_{i})\odot L\right]_{\vec{j}}=\sum_{\vec{j}\in[n]^{k}}\sum_{\ell=1}^{\tilde{r}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}[v_{i,\ell}]_{j_{i}}=\sum_{\ell=1}^{\tilde{r}}\prod_{i=1}^{k}\sum_{j=1}^{n}[d_{i}]_{j}[v_{i,\ell}]_{j}=\sum_{\ell=1}^{\tilde{r}}\prod_{i=1}^{k}\langle d_{i},v_{i,\ell}\rangle,

where the first step is by definition of the $m(\cdot)$ operation that sums over all entries, the second step is by definition of $L$ , and the third step is by swapping products and sums. Thus computing the desired quantity amounts to computing $\tilde{r}k$ inner products of $n$ -dimensional vectors. This can be done in $O(nr\tilde{k})$ time. ∎

Lemma 7.9 (Precision of the low-rank approximation).

Let $\varepsilon\leqslant 1$ . Suppose $L\in(\mathbb{R}^{n})^{\otimes k}$ satisfies $\|L-\exp[-\eta R]\|_{\max}\leqslant\tfrac{\varepsilon}{3}e^{-\eta R_{\max}}$ . Then the matrix $\tilde{C}:=-\frac{1}{\eta}\log[L]+S$ satisfies

\displaystyle\|\tilde{C}-C\|_{\max}\leqslant\frac{\varepsilon}{2}.

(7.5)

Proof.

Observe that the minimum entry of $L$ is at least

\displaystyle e^{-\eta R_{\max}}-\tfrac{\varepsilon}{3}e^{-\eta R_{\max}}\geqslant\tfrac{2}{3}e^{-\eta R_{\max}}.

(7.6)

Since this is strictly positive, the tensor $\tilde{R}:=-\eta^{-1}\log[L]$ is well defined. Furthermore,

\|\eta\tilde{R}-\eta R\|_{\max}=\max_{\vec{j}\in[n]^{k}}\left\lvert\eta\tilde{R}_{\vec{j}}-\eta R_{\vec{j}}\right\rvert\leqslant\max_{\vec{j}\in[n]^{k}}\frac{\left\lvert L_{\vec{j}}-e^{-\eta R_{\vec{j}}}\right\rvert}{\min(L_{\vec{j}},e^{-\eta R_{\vec{j}}})}\leqslant\frac{\tfrac{\varepsilon}{3}e^{-\eta R_{\max}}}{\tfrac{2}{3}e^{-\eta R_{\max}}}=\frac{\varepsilon}{2},

where above the first step is by definition of the max norm; the second step is by the elementary inequality $|\log x-\log y|\leqslant|x-y|/\min(x,y)$ which holds for positive scalars $x$ and $y$ (AltBacRud19, , Lemma K); and the third step is by (7.6) and the approximation bound of $L$ . Since $\eta\geqslant 1$ , we therefore conclude that $\|\tilde{R}-R\|_{\max}\leqslant\varepsilon/2$ . By adding and subtracting $S$ , this implies $\|\tilde{C}-C\|_{\max}=\|\tilde{R}-R\|_{\max}\leqslant\varepsilon/2$ . ∎

7.2.2 Proof of Theorem 7.4

We are now ready to state the algorithm in Theorem 7.4. Pseudocode is in Algorithm 8. Note that $\tilde{R}=-\eta^{-1}\log[L]$ and $\tilde{C}=\tilde{R}+S$ are never explicitly computed because in both Lines 3 and 4, the algorithm performs the relevant operations only through the low-rank tensor $L$ and the sparse tensor $S$ .

Algorithm 8 Polynomial-time algorithm for AMIN and SMIN for low-rank + sparse costs (Theorem 7.4).

Input: Low-rank tensor $R$ , sparse tensor $S$ , matrix $p\in\bar{\mathbb{R}}^{n\times k}$ , accuracy $\varepsilon>0$
Output: Solution to both $\textsf{AMIN}_{C}(p,\varepsilon)$ on cost tensor $C=R+S$ , and also $\textsf{SMIN}_{\tilde{C}}(p,(2k\log n)/\varepsilon)$ on some approximate cost tensor $\tilde{C}$ satisfying $\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2$

\eta\leftarrow(2k\log n)/\varepsilon

2:Compute low-rank approximation

L

\exp[-\eta R]

via Lemma 7.7, for precision

\tilde{\varepsilon}=\frac{\varepsilon}{3}e^{-\eta R_{\max}}

3:Compute

a

in (7.2) directly by enumerating over the polynomially many non-zero entries of

S

, where

\tilde{R}=-\eta^{-1}\log[L]

4:Compute

b

in (7.3) via Lemma 7.8, where

\tilde{R}=-\eta^{-1}\log[L]

5:Return

-\eta^{-1}\log(a+b)

Proof of Theorem 7.4.

Proof of correctness for SMIN. Consider any oracle inputs $p=(p_{1},\dots,p_{k})\in\bar{\mathbb{R}}^{n\times k}$ . By Lemma 7.9, the tensor $\tilde{C}=\tilde{R}+S=-\eta^{-1}\log L+S$ satisfies $\|\tilde{C}-C\|_{\max}\leqslant\varepsilon/2$ . Therefore it suffices to show that Algorithm 8 correctly computes $\textsf{SMIN}_{\tilde{C}}(p,\eta)$ . This is true because that quantity is equal to $-\eta^{-1}\log(a+b)$ by Lemma 7.6.

Proof of correctness for AMIN. We have just established that Algorithm 8 computes $\textsf{SMIN}_{\tilde{C}}(p,\eta)$ . Because $\eta=(2k\log n)/\varepsilon$ and the fact that SMIN is a special case of AMIN (Remark 3.6), it follows that $\textsf{SMIN}_{\tilde{C}}(p,\eta)$ is within additive accuracy $\varepsilon/2$ of $\textsf{MIN}_{\tilde{C}}(p,\eta)$ . Therefore, by the triangle inequality, it suffices to show that $\textsf{MIN}_{\tilde{C}}(p)$ is within $\varepsilon/2$ additive accuracy of $\textsf{MIN}_{C}(p)$ . That is, it suffices to show that

\left\lvert\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}-\min_{\vec{j}\in[n]^{k}}\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right\rvert\leqslant\varepsilon/2.

But this is true because $\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2$ by Lemma 7.9, and thus the quantities $C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$ and $\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$ are within additive accuracy $\varepsilon/2$ for each $\vec{j}\in[n]^{k}$ .

Proof of runtime. We prove the claimed runtime bound simultaneously for the AMIN and SMIN computation because we use the same algorithm for both. To this end, we first bound the rank $\tilde{r}$ of the low-rank approximation $L$ computed in Lemma 7.7. Note that since $\tilde{\varepsilon}=\tfrac{\varepsilon}{3}e^{-\eta R_{\max}}$ and since it is assumed that $R_{\max}=O(C_{\max})$ , we have $\log 1/\tilde{\varepsilon}=O(\tfrac{C_{\max}}{\varepsilon}k\log n)$ . Therefore

\tilde{r}\leqslant\binom{r+O(\log 1/\tilde{\varepsilon})}{r}=O(\log 1/\tilde{\varepsilon})^{r}=O(\tfrac{C_{\max}}{\varepsilon}k\log n)^{r}=\mathrm{poly}(\log n,k,C_{\max}/\varepsilon).

Above, the first step is by Lemma 7.7, and the final step is because $r$ is assumed constant.

Therefore Line 2 in Algorithm 8 takes polynomial time by Lemma 7.7, Line 3 takes polynomial time by simply enumerating over the $s$ non-zero entries of $S$ , and Line 4 takes polynomial time by Lemma 7.8. ∎

7.3 Application vignette: risk estimation

Here we consider an application to portfolio risk management. For simplicity of exposition, let us first describe the setting of $1$ financial instrument (“stock”). Consider investing in one unit of a stock for $k$ years. For $i\in\{0,\dots,k\}$ , let $X_{i}$ denote the price of the stock at year $i$ . Suppose that the return $\rho_{i}=X_{i}/X_{i-1}$ of the stock between years $i-1$ and $i$ is believed to follow some distribution $\rho_{i}\sim\mu_{i}$ . A fundamental question about the riskiness of this stock is to compute the investor’s expected profit in the worst-case over all joint probability distributions on future returns $(\rho_{1},\dots,\rho_{k})$ that are consistent with the modeled marginal distributions $(\mu_{1},\dots,\mu_{k})$ . This is an MOT problem with cost $C$ given by

C(\rho_{1},\dots,\rho_{k})=\prod_{i\in[k]}\rho_{i},

where here we view $C$ as a function rather than a tensor for notational simplicity. If each return $\rho_{i}$ has $n$ possible values (e.g., after quantization), then the cost $C$ is equivalently represented as a rank-1 tensor in $(\mathbb{R}^{n})^{\otimes k}$ (by assigning an index to each of the $n$ possible values of each $\rho_{i}$ ). Therefore our result Corollary 7.5 provides a polynomial-time algorithm for solving this MOT problem defining the investor’s worst-case profit.

Rather than formalize this proof for $1$ stock, we directly generalize to the general case of investing in $r$ stocks, $r\geqslant 1$ . This is essentially identical to the simple case of $r=1$ stock, modulo additional notation.

Corollary 7.10 (Polynomial-time algorithm for expected profit given marginals on the returns).

Suppose an investor holds $1$ unit of $r$ stocks for $k$ years. For each stock $\ell\in[r]$ and each year $i\in[k]$ , let $\rho_{i,\ell}$ denote the relative price of stock $\ell$ between years $i$ and $i-1$ . Suppose $\rho_{i,\ell}$ has distribution $\mu_{i,\ell}$ , and that each $\mu_{i,\ell}$ has at most $n$ atoms. Let $R_{\max}=\max_{\{\rho_{i,\ell}\}}\sum_{\ell=1}^{r}\prod_{i=1}^{k}\rho_{i,\ell}$ denote the maximal possible return. For any constant number of stocks $r$ , there is a $\mathrm{poly}(n,k,R_{\max}/\varepsilon)$ time algorithm for $\varepsilon$ -approximating the expected profit in the worst-case over all futures that are consistent with the returns’ marginal distributions.

Proof.

This is the optimization problem

\min_{P\in\mathcal{M}(\{\mu_{i,\ell}\}_{i\in[k],\ell\in[r]})}\mathbb{E}_{\{\rho_{i,\ell}\}_{i\in[k],\ell\in[r]}\sim P}\left[\sum_{\ell=1}^{r}\prod_{i=1}^{k}\rho_{i,\ell}\right]

over all joint distributions $P$ on the returns $\{\rho_{i,\ell}\}_{i\in[k],\ell\in[k]}$ that are consistent with the marginal distibutions $\{\mu_{i,\ell}\}_{i\in[k],\ell\in[k]}$ . This is an MOT problem with $k^{\prime}=rk$ marginals, each over $n$ atoms, with cost function

\displaystyle C\Big{(}\{\rho_{i,\ell}\}_{i\in[k],\ell\in[r]}\Big{)}=\sum_{\ell^{\prime}\in[r]}\prod_{(i,\ell)\in[k]\times[r]\cong[k^{\prime}]}\big{(}\rho_{i,\ell}\cdot\mathds{1}[\ell=\ell^{\prime}]+\mathds{1}[\ell\neq\ell^{\prime}]\big{)}.

(7.7)

By viewing this cost function $C$ as a cost tensor in the natural way (i.e., assigning an index to each of the $n$ possible values of $\rho_{i,\ell}$ ), this representation (7.7) shows that the corresponding cost tensor $C\in(\mathbb{R}^{n})^{\otimes k^{\prime}}$ has rank $r$ . Moreover, observe that the maximum entry of the cost is $R_{\max}$ . Therefore we may appeal to our polynomial-time MOT algorithms in Corollary 7.5 for costs with constant rank. ∎

The algorithm is readily generalized, e.g., if the investor has different units of a stock, or if a stock is held for a different number of years. The former is modeled simply by adding an extra year in which the return of stock $\ell$ is equal to the number of units, with probability $1$ . The latter is modeled simply by setting the return of stock $\ell$ to be $1$ for all years after it is held, with probability $1$ .

In Figure 7, we provide a numerical illustration comparing our new polynomial-time algorithms for this risk estimation task with the previous fastest algorithms. Previously, the fastest algorithms that apply to this problem are out-of-the-box LP solvers run on MOT, and the brute-force implementation of SINKHORN which marginalizes over all $n^{k}$ entries in each iteration. Since both of these previous algorithms have exponential runtime that scales as $n^{\Omega(k)}$ , they do not scale beyond tiny input sizes of $n=10$ and $k=8$ even with two minutes of computation time. In contrast, our new polynomial-time algorithms compute high-quality solutions for problems that are orders-of-magnitude larger. For example, our polynomial-time implementation of SINKHORN takes less than a second to solve an MOT LP with $n^{k}=10^{30}$ variables.

Details for this numerical experiment: we consider $r=1$ stock over $k$ timesteps, where each marginal distribution $\mu_{i}$ is uniform on $[1,1+1/k]$ , discretized with $n=10$ . We implement the AMIN and SMIN oracle efficiently by using our above algorithm to exploit the rank-one structure of the cost tensor. In particular, the polynomial approximation we use here to approximate $\exp[-\eta C]$ is the degree- $5$ Taylor approximation (cf., Lemma 7.7). This lets us run SINKHORN and MWU in polynomial time, as described above. In the numerical experiment, we also implement an approximate version of COLGEN using our polynomial-time implementation of the approximate violation oracle AMIN. Since the algorithms compute an upper bound, lower value is better in the right plot of Figure 7. We observe that MWU yields the loosest approximation for this application, whereas our implementations of SINKHORN and COLGEN produce high-quality approximations, as is evident by comparing to the exact LP solver in the regime that the latter is tractable to run.

7.4 Application vignette: projection to the transportation polytope

Here we consider the fundamental problem of projecting a joint probability distribution $Q$ onto the transportation polytope $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ , i.e.,

\displaystyle\operatorname*{argmin}_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\sum_{\vec{j}}(P_{\vec{j}}-Q_{\vec{j}})^{2}.

(7.8)

We provide the first polynomial-time algorithm for solving this problem in the case where $Q$ is a distribution that decomposes into a low-rank component plus a sparse component. The low-rank component enables modeling mixtures of product distributions (e.g., mixtures of isotropic Gaussians), which arise frequently in statistics and machine learning; see, e.g., (feldman2008learning, ). In such applications, the number of product distributions in the mixture corresponds to the tensor rank. The sparse component further enables modeling arbitrary corruptions to the distribution in polynomially many entries.

We emphasize that this projection problem (7.8) is not an MOT problem since the objective is quadratic rather than linear. This illustrates the versatility of our algorithmic results. Our algorithm is based on a reduction from quadratic optimization to linear optimization over $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ that is tailored to this problem. Crucial to this reduction is the fact that the MOT algorithms in §4 can compute sparse solutions. In particular, this reduction does not work with SINKHORN because SINKHORN cannot compute sparse solutions.

Corollary 7.11 (Efficient projection to the transportation polytope).

Let $Q=R+S\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}$ , where $R$ has constant rank and $S$ is polynomially sparse. Suppose that $R_{\max}$ and $S_{\max}$ are $O(1)$ . Given $R$ in factored form, $S$ through its non-zero entries, measures $\mu_{1},\dots,\mu_{k}\in\Delta_{n}$ , and accuracy $\varepsilon>0$ , we can compute in $\mathrm{poly}(n,k,1/\varepsilon)$ time a feasible $P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})$ that has $\varepsilon$ -suboptimal cost for the projection problem (7.8). This solution $P$ is a sparse tensor output through its $\mathrm{poly}(n,k,1/\varepsilon)$ non-zero entries.

Proof.

We apply the Frank-Wolfe algorithm (a.k.a., Conditional Gradient Descent) to solve (7.8), specifically using approximate LP solutions for the descent direction as in (Jag13, , Algorithm 2). By the known convergence guarantee of this algorithm (Jag13, , Theorem 1.1), if each LP is solved to $\varepsilon^{\prime}=O(\varepsilon)$ accuracy, then $T=O(1/\varepsilon)$ Frank-Wolfe iterations suffice to obtain an $\varepsilon$ -suboptimal solution to (7.8).

The crux, therefore, is to show that each Frank-Wolfe iteration can be computed efficiently, and that the final solution is sparse. Initialize $P^{(0)}$ to be an arbitrary vertex of $\mathcal{M}(\mu_{1},\dots,\mu_{k})$ . Then $P^{(0)}$ is feasible and is polynomially sparse (see §2.1). Let $P^{(t)}\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}$ denote the $t$ -th Frank-Wolfe iterate. Performing the next iteration requires two computations:

1.

Approximately solve the following LP to $\varepsilon^{\prime}$ accuracy:

$\displaystyle D^{(t)}\leftarrow\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,P^{(t)}-Q\rangle.$ (7.9)
2.

Update $P^{(t+1)}\leftarrow(1-\gamma_{t})P^{(t)}+\gamma_{t}D^{(t)}$ , where $\gamma_{t}=2/(t+2)$ is the current stepsize.

For the first iteration $t=0$ , note that the LP (7.9) is an MOT problem with cost $C^{(0)}=P^{(0)}-Q=P^{(0)}-R-S$ which decomposes into a polynomially sparse tensor $P^{(0)}-S$ plus a constant-rank tensor $-R$ . Therefore the algorithm in Corollary 7.5 can solve the LP (7.9) to $\varepsilon^{\prime}=O(\varepsilon)$ additive accuracy in $\mathrm{poly}(n,k,1/\varepsilon)$ time, and it outputs a solution $D^{(0)}$ that is $\mathrm{poly}(n,k,1/\varepsilon)$ sparse. It follows that $P^{(1)}$ can be computed in $\mathrm{poly}(n,k,1/\varepsilon)$ time and moreover is $\mathrm{poly}(n,k,1/\varepsilon)$ sparse since it is a convex combination of the similarly sparse tensors $P^{(0)}$ and $D^{(0)}$ . By repeating this argument identically for $T=O(1/\varepsilon)$ iterations, it follows that each iteration takes $\mathrm{poly}(n,k,1/\varepsilon)$ time, and that each iterate $P^{(t)}$ is $\mathrm{poly}(n,k,1/\varepsilon)$ sparse. ∎

8 Discussion

In this paper, we investigated what structure enables MOT—an LP with $n^{k}$ variables—to be solved in $\mathrm{poly}(n,k)$ time. We developed a unified algorithmic framework for MOT by characterizing what “structure” is required to solve MOT in polynomial time by different algorithms in terms of simple variants of the dual feasibility oracle. On one hand, this enabled us to show that ELLIPSOID and MWU solve MOT in polynomial time whenever any algorithm can, whereas SINKHORN requires strictly more structure. And on the other hand, this made the design of polynomial-time algorithms for MOT much simpler, as we illustrated on three general classes of MOT cost structures.

Our results suggest several natural directions for future research. One exciting direction is to identify further tractable classes of MOT cost structures beyond the three studied in this paper, since this may enable new applications of MOT. Our results help guide this search because they make it significantly easier to identify if an MOT problem is polynomial-time solvable (see §1.1.4).

Another important direction is practicality. While the focus of this paper is to characterize when MOT problems can be solved in polynomial time, in practice there is of course a difference between small and large polynomial runtimes. It is therefore a question of practical significance to improve our “proof of concept” polynomial-time algorithms by designing algorithms with smaller polynomial runtimes. Our theoretical results help guide this search for practical algorithms because they make it significantly easier to identify if an MOT problem is polynomial-time solvable in the first place.

In order to develop more practical algorithms, recall that, roughly speaking, our approach for designing MOT algorithms consisted of three parts:

•

An “outer loop” algorithm such as ELLIPSOID, MWU, or SINKHORN that solves MOT in polynomial time conditionally on a polynomial-time implementation of a certain bottleneck oracle.
•

An “intermediate” algorithm that reduces this bottleneck oracle to polynomial calls of a variant of the dual feasibility oracle.
•

An “inner loop” algorithm that solves the relevant variant of the dual feasibility oracle for the structured MOT problem at hand.

Obtaining a smaller polynomial runtime for any of these three parts immediately implies smaller polynomial runtimes for the overall MOT algorithm. Another approach is to design altogether different algorithms that avoid the polynomial blow-up of the runtime that arises from composing these three parts. Understanding how to solve an MOT problem more “directly” in this way is an interesting question.

Acknowledgements.

We are grateful to Jonathan Niles-Weed, Pablo Parrilo, and Philippe Rigollet for insightful conversations; to Frederic Koehler for suggesting a simpler proof of Lemma 3.7; to Ben Edelman and Siddhartha Jayanti who were involved in the brainstorming stages and provided helpful references; and to Karthik Natarajan for references to the random combinatorial optimization literature.

Appendix A Deferred proof details

A.1 Proof of Lemma 4.11

Our proof is based on two helper claims.

Claim A.1 (Lemma 3 of (young2001sequential, )).

If $\textsf{MWU\_BOTTLENECK}(P,\mu,\lambda,\varepsilon)$ returns “null”, then $\textsf{MOT}_{C}(\mu)>\lambda$ .

Proof.

We prove the contrapositive. Let $P^{*}\in\mathcal{M}(\mu_{1},\dots,\mu_{k})$ such that $\langle C,P^{*}\rangle\leqslant\lambda$ . Then

	$\displaystyle\sum_{\vec{j}\in[n]^{k}}P^{*}_{\vec{j}}\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0}$	$\displaystyle=\frac{\sum_{\vec{j}\in[n]^{k}}P^{*}_{\vec{j}}(\frac{C_{\vec{j}}}{\lambda}\exp(\langle C,P\rangle/\lambda)+\sum_{i=1}^{k}\frac{1}{[\mu_{i}]_{j_{i}}}\exp([m_{i}(P)]_{j_{i}}/[\mu_{i}]_{j_{i}}))}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}$
		$\displaystyle=\frac{\frac{\langle C,P^{}\rangle}{\lambda}\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\frac{[m_{s}(P^{})]_{t}}{[\mu_{s}]_{t}}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}$
		$\displaystyle\leqslant\frac{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}$
		$\displaystyle=1.$

Since $P^{*}$ is non-negative and $\sum_{\vec{j}\in[n]^{k}}P^{*}_{\vec{j}}=m(P^{*})=1$ , there must exist a $\vec{j}\in[n]^{k}$ satisfying $\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0}\leqslant 1$ . ∎

This first claim ensures that if the algorithm returns “infeasible”, then indeed $\textsf{MOT}_{C}(\mu)>\lambda$ . This proves the first part of the lemma. We now prove a second claim, useful for bounding the running time and the quality of the returned solution when the algorithm does not return “infeasible”.

Claim A.2 (Lemmas 1 and 4 of (young2001sequential, )).

In Lines 5 to 9, we maintain the invariant that $\Phi(P)-(1+\varepsilon)^{2}m(P)\leqslant\log(nk+1)$ .

Proof.

Note that initially $P=0$ and $\Phi(0)-(1+\varepsilon)^{2}m(0)=\Phi(0)=\log(nk+1)$ . So it suffices to prove that $\Phi(P)-(1+\varepsilon)^{2}m(P)$ does not increase on each iteration. Indeed, if $\vec{j}$ is returned by MWU_BOTTLENECK, then $\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0}\leqslant(1+\varepsilon)$ . Furthermore, let $\varepsilon^{\prime}=\varepsilon\cdot\min(\lambda/C_{\vec{j}},\min_{i}[\mu_{i}]_{j_{i}})$ . By the smoothness of the softmax, it holds that

\Phi(P+\varepsilon^{\prime}\delta_{\vec{j}})\leqslant\Phi(P)+\varepsilon^{\prime}(1+\varepsilon)\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0}\leqslant\varepsilon^{\prime}(1+\varepsilon)^{2}.

So, in Line 9, $(1+\varepsilon)^{2}m(P)$ increases by $(1+\varepsilon)^{2}\varepsilon^{\prime}$ , and $\Phi(P)$ increases by at most $(1+\varepsilon)^{2}\varepsilon^{\prime}$ . This proves the claim. ∎

We use this second claim to bound the running time of Step 1. Each iteration of the loop from Line 5 to Line 9 of Algorithm 2, increases the value of

\Psi(P):=\langle C,P\rangle/\lambda+\sum_{i}\|m_{i}(P)/\mu_{i}\|_{1}

by at least $\varepsilon$ . So after $T$ iterations we must have

\Phi(P)\geqslant\Psi(P)/(nk+1)\geqslant T\varepsilon/(nk+1),

where we have used Jensen’s inequality to relate $\Phi$ and $\Psi$ . By the claim, this means that on any iteration $T\geqslant(\eta+\log(nk+1))(nk+1)(1+\varepsilon)^{2}/\varepsilon=\tilde{O}(nk/\varepsilon^{2})$ , we must have

m(P)\geqslant(\Phi(P)-\log(nk+1))/(1+\varepsilon)^{2}\geqslant\eta,

so the loop must terminate after at most $\tilde{O}(nk/\varepsilon^{2})$ iterations. Since each iteration can increase the number of non-zero entries by at most 1, this also proves the sparsity bound on $P$ .

We now prove the bounds on the marginals and cost, again using the claim. When Line 10 is reached, we must have $m(P)\in[\eta,\eta+\varepsilon]$ , because each iteration increases $m(P)$ by at most $\varepsilon$ . Therefore,

$\displaystyle\max(\langle C,P\rangle/\lambda,m_{1}(P)/\mu_{1},\ldots,m_{k}(P)/\mu_{k})$	$\displaystyle\leqslant\Phi(P)$	by Lemma 2.3 for softmax
	$\displaystyle\leqslant\log(nk+1)+(1+\varepsilon)^{2}m(P)$	by the claim
	$\displaystyle\leqslant\log(nk+1)+(1+\varepsilon)^{2}(\eta+\varepsilon)$
	$\displaystyle\leqslant(1+\varepsilon)^{4}\eta$	by $\eta\geqslant 2\log(nk+1)\geqslant 1$

Therefore, the rescaling in Line 10 yields $P$ satisfying the guarantees of Lemma 4.11.

A.2 Proof of Lemma 4.16

It is obvious how the $\textsf{AMIN}_{C}$ oracle can be implemented via a single call of the $\textsf{ARGAMIN}_{C}$ oracle; we now show the converse. Specifically, given $p_{1},\ldots,p_{k}\in\mathbb{R}^{n}$ , we show how to compute a solution $\vec{j}=(j_{1},\dots,j_{k})\in[n]^{k}$ for $\textsf{ARGAMIN}_{C}([p_{1},\dots,p_{k}],\varepsilon)$ using $nk$ calls to the $\textsf{AMIN}_{C}$ oracle with accuracy $\varepsilon/(2k)$ . As in the proof of Lemma 4.4, we use the first $n$ calls to compute the first index $j_{1}$ of the solution, the next $n$ calls to compute the next index $j_{2}$ , and so on.

Formally, for $s\in[k]$ , let us say that $(j_{1}^{*},\dots,j_{s}^{*})\in[n]^{s}$ is a $\delta$ -approximate “partial solution” of size $s$ if there exists a solution $j\in[n]^{k}$ for $\textsf{ARGAMIN}_{C}([p_{1},\dots,p_{k}],\delta)$ that satisfies $j_{i}=j_{i}^{*}$ for all $i\in[s]$ . Then it suffices to show that for every $s\in[k]$ , it is possible to compute an $(s\varepsilon/k)$ -approximate partial solution $(j_{1}^{*},\dots,j_{s}^{*})$ of size $s$ from an $((s-1)\varepsilon/k)$ -approximate partial solution $(j_{1}^{*},\dots,j_{s-1}^{*})$ of size $s-1$ using $n$ calls to $\textsf{AMIN}_{C}$ and polynomial additional time.

Do this by setting $j_{s}^{*}$ to be a minimizer of

\displaystyle\min_{j_{s}^{\prime}\in[n]}\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{*}},\ldots,q_{s-1,j_{s-1}^{*}},q_{s,j_{s}^{\prime}},p_{s+1},\ldots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)},

(A.1)

where the $q$ vectors are defined as in the proof of Lemma 4.4. The runtime claim is obvious; it suffices to prove correctness. To this end, observe that

	$\displaystyle\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s}=j_{s}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$	$\displaystyle=\textsf{MIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s,j_{s}^{}},p_{s+1},\dots,p_{k}\Big{]}\Big{)}$
		$\displaystyle\leqslant\frac{\varepsilon}{2k}+\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s,j_{s}^{}},p_{s+1},\dots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)}$
		$\displaystyle=\frac{\varepsilon}{2k}+\min_{j_{s}^{\prime}\in[n]}\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s-1,j_{s-1}^{}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)}$
		$\displaystyle\leqslant\frac{\varepsilon}{k}+\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s-1,j_{s-1}^{}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}\Big{]}\Big{)}$
		$\displaystyle=\frac{\varepsilon}{k}+\min_{j_{s}^{\prime}\in[n]}\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots j_{s-1}=j_{s-1}^{},j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\frac{\varepsilon}{k}+\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots j_{s-1}=j_{s-1}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\frac{s\varepsilon}{k}+\textsf{MIN}_{C}\Big{(}\Big{[}p_{1},\dots,p_{k}\Big{]}\Big{)}.$

Above, the first and fifth steps are by Observation 4.5, the second and fourth steps are by definition of the AMIN oracle, the third step is by construction of $j_{s}^{*}$ , the penultimate step is by simplifying, and the final step is by definition of $(j_{1}^{*},\dots,j_{s-1}^{*})$ being an $((s-1)\varepsilon/k)$ -approximate partial solution of size $s-1$ . We conclude that $(j_{1}^{*},\dots,j_{s}^{*})$ is an $(s\varepsilon/k)$ -approximate partial solution of size $s$ , as desired.

Appendix B Additional numerical experiments

Here, we provide additional numerics for the generalized Euler flow application in §5.3 in order to demonstrate that similar behavior is observed on other standard benchmark inputs in the literature (Bre08, ). These instances are identical to Figure 10, except with different input permutations $\sigma$ between the initial and final positions of the particles. Note that our algorithm COLGEN computes an exact, sparse solution with at most $nk-k+1$ non-zero entries (Theorem 4.6). In contrast, the SINKHORN algorithm of (BenCarCut15, ) computes approximate, fully dense solutions with $n^{k}$ non-zero entries, which leads to blurry visualizations.

References

[1] S. Agrawal, Y. Ding, A. Saberi, and Y. Ye. Price of correlations in stochastic optimization. Operations Research, 60(1):150–162, 2012.
[2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
[3] Y. Akagi, Y. Tanaka, T. Iwata, T. Kurashima, and H. Toda. Probabilistic optimal transport based on collective graphical models. Preprint at arXiv:2006.08866, 2020.
[4] J. Altschuler, F. Bach, A. Rudi, and J. Niles-Weed. Massively scalable Sinkhorn distances via the Nyström method. In Advances in Neural Information Processing Systems, pages 4429–4439, 2019.
[5] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems, 2017.
[6] J. M. Altschuler and E. Boix-Adserà. Hardness results for Multimarginal Optimal Transport problems. Discrete Optimization, 42:100669, 2021.
[7] J. M. Altschuler and E. Boix-Adserà. Wasserstein barycenters can be computed in polynomial time in fixed dimension. Journal of Machine Learning Research, 22:44–1, 2021.
[8] J. M. Altschuler and E. Boix-Adserà. Wasserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Science, 2022.
[9] J. M. Altschuler and P. A. Parrilo. Near-linear convergence of the Random Osborne algorithm for Matrix Balancing. Mathematical Programming, pages 1–35, 2022.
[10] E. Anderes, S. Borgwardt, and J. Miller. Discrete Wasserstein barycenters: Optimal transport for discrete data. Mathematical Methods of Operations Research, 84(2):389–409, 2016.
[11] F. Bach. Learning with submodular functions: a convex optimization perspective. Foundations and Trends in Machine Learning, 6(2-3):145–373, 2013.
[12] M. O. Ball. Computational complexity of network reliability analysis: An overview. IEEE Transactions on Reliability, 35(3):230–239, 1986.
[13] M. O. Ball, C. J. Colbourn, and J. S. Provan. Network reliability. Handbooks in Operations Research and Management Science, 7:673–762, 1995.
[14] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
[15] J.-D. Benamou, G. Carlier, S. Di Marino, and L. Nenna. An entropy minimization approach to second-order variational mean-field games. Mathematical Models and Methods in Applied Sciences, 29(08):1553–1583, 2019.
[16] J.-D. Benamou, G. Carlier, and L. Nenna. A numerical method to solve multi-marginal optimal transport problems with Coulomb cost. In Splitting Methods in Communication, Imaging, Science, and Engineering, pages 577–601. Springer, 2016.
[17] J.-D. Benamou, G. Carlier, and L. Nenna. Generalized incompressible flows, multi-marginal transport and Sinkhorn algorithm. Numerische Mathematik, 142(1):33–54, 2019.
[18] D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.
[19] J. Blanchet, A. Jambulapati, C. Kent, and A. Sidford. Towards optimal running times for optimal transport. Preprint at arXiv:1810.07717, 2018.
[20] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. In International conference on Artificial Intelligence and Statistics, pages 880–889. PMLR, 2018.
[21] H. L. Bodlaender. A linear-time algorithm for finding tree-decompositions of small treewidth. SIAM Journal on computing, 25(6):1305–1317, 1996.
[22] H. L. Bodlaender. Treewidth: Structure and algorithms. In International Colloquium on Structural Information and Communication Complexity, pages 11–25. Springer, 2007.
[23] Y. Brenier. The least action principle and the related concept of generalized flows for incompressible perfect fluids. Journal of the American Mathematical Society, 2(2):225–255, 1989.
[24] Y. Brenier. The dual least action problem for an ideal, incompressible fluid. Archive for Rational Mechanics and Analysis, 122(4):323–351, 1993.
[25] Y. Brenier. Minimal geodesics on groups of volume-preserving maps and generalized solutions of the Euler equations. Communications on Pure and Applied Mathematics, 52(4):411–452, 1999.
[26] Y. Brenier. Generalized solutions and hydrostatic approximation of the Euler equations. Physica D: Nonlinear Phenomena, 237(14-17):1982–1988, 2008.
[27] G. Buttazzo, L. De Pascale, and P. Gori-Giorgi. Optimal-transport formulation of electronic density-functional theory. Physical Review A, 85(6):062502, 2012.
[28] G. Carlier and I. Ekeland. Matching for teams. Economic Theory, 42(2):397–418, 2010.
[29] G. Carlier, A. Oberman, and E. Oudet. Numerical methods for matching for teams and Wasserstein barycenters. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):1621–1642, 2015.
[30] L. Chen, W. Ma, K. Natarajan, D. Simchi-Levi, and Z. Yan. Distributionally robust linear and discrete optimization with marginals. Operations Research, 2022.
[31] P.-A. Chiappori, R. J. McCann, and L. P. Nesheim. Hedonic price equilibria, stable matching, and optimal transport: equivalence, topology, and uniqueness. Economic Theory, 42(2):317–354, 2010.
[32] C. Cotar, G. Friesecke, and C. Klüppelberg. Density functional theory and optimal transportation with Coulomb cost. Communications on Pure and Applied Mathematics, 66(4):548–599, 2013.
[33] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 2016.
[34] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
[35] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, 2013.
[36] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, 2014.
[37] M. Dyer, L. A. Goldberg, C. Greenhill, and M. Jerrum. On the relative complexity of approximate counting problems. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 108–119. Springer, 2000.
[38] F. Elvander, I. Haasler, A. Jakobsson, and J. Karlsson. Multi-marginal optimal transport using partial information with applications in robust localization and sensor fusion. Signal Processing, 171:107474, 2020.
[39] J. Feldman, R. O’Donnell, and R. A. Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536–1564, 2008.
[40] S. Friedland. Optimal transport, distance between sets of measures and tensor scaling. Preprint at arXiv:2005.00945, 2020.
[41] I. Gertsbakh and Y. Shpungin. Network reliability and resilience. Springer Science & Business Media, 2011.
[42] L. A. Goldberg and M. Jerrum. The complexity of ferromagnetic Ising with local fields. Combinatorics, Probability and Computing, 16(1):43–61, 2007.
[43] M. Grötschel, L. Lovász, and A. Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2):169–197, 1981.
[44] M. Grötschel, L. Lovász, and A. Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
[45] H. Guo and M. Jerrum. A polynomial-time approximation algorithm for all-terminal network reliability. SIAM Journal on Computing, 48(3):964–978, 2019.
[46] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Estimating ensemble flows on a hidden Markov chain. In IEEE Conference on Decision and Control, pages 1331–1338. IEEE, 2019.
[47] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Multi-marginal optimal transport and Schrödinger bridges on trees. Preprint at arXiv:2004.06909, 2020.
[48] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Scalable computation of dynamic flow problems via multi-marginal graph-structured optimal transport. Preprint at arXiv:2106.14485, 2021.
[49] I. Haasler, R. Singh, Q. Zhang, J. Karlsson, and Y. Chen. Multi-marginal optimal transport and probabilistic graphical models. IEEE Transactions on Information Theory, 2021.
[50] W. K. Haneveld. Robustness against dependence in PERT: An application of duality and distributions with known marginals. In Stochastic Programming, pages 153–182. Springer, 1986.
[51] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435, 2013.
[52] D. R. Karger. A randomized fully polynomial time approximation scheme for the all-terminal network reliability problem. SIAM Review, 43(3):499–522, 2001.
[53] L. G. Khachiyan. Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53–72, 1980.
[54] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
[55] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
[56] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.
[57] C. Léonard. A survey of the Schr̈odinger problem and some of its connections with optimal transport. Preprint at arXiv:1308.0215, 2013.
[58] T. Lin, N. Ho, M. Cuturi, and M. I. Jordan. On the complexity of approximating multimarginal optimal transport. Preprint at arXiv:1910.00152, 2019.
[59] N. Linial, A. Samorodnitsky, and A. Wigderson. A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. In Symposium on Theory of Computing, pages 644–652, 1998.
[60] G. Makarov. Estimates for the distribution function of a sum of two random variables when the marginal distributions are fixed. Theory of Probability and its Applications, 26(4):803–806, 1982.
[61] I. Meilijson and A. Nádas. Convex majorization with an application to the length of critical paths. Journal of Applied Probability, 16(3):671–677, 1979.
[62] V. K. Mishra, K. Natarajan, D. Padmanabhan, C.-P. Teo, and X. Li. On theoretical and empirical aspects of marginal distribution choice models. Management Science, 60(6):1511–1531, 2014.
[63] E. F. Moore and C. E. Shannon. Reliable circuits using less reliable relays. Journal of the Franklin Institute, 262(3):191–208, 1956.
[64] B. Muzellec, R. Nock, G. Patrini, and F. Nielsen. Tsallis regularized optimal transport and ecological inference. In AAAI Conference on Artificial Intelligence, volume 31, 2017.
[65] A. Nadas. Probabilistic PERT. IBM Journal of Research and Development, 23(3):339–347, 1979.
[66] K. Natarajan, M. Song, and C.-P. Teo. Persistency model and its applications in choice modeling. Management Science, 55(3):453–469, 2009.
[67] L. Nenna. Numerical methods for multi-marginal optimal transportation. PhD thesis, 2016.
[68] D. Padmanabhan, S. Damla Ahipasaoglu, A. Ramachandra, and K. Natarajan. Extremal probability bounds in combinatorial optimization. Preprint at arXiv:2109.01591, 2021.
[69] C. H. Papadimitriou and T. Roughgarden. Computing correlated equilibria in multi-player games. Journal of the ACM, 55(3):1–29, 2008.
[70] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 2017.
[71] F. Pitié, A. C. Kokaram, and R. Dahyot. Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding, 107(1-2):123–137, 2007.
[72] J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4):777–788, 1983.
[73] K. Quanrud. Approximating optimal transport with linear programs. In Symposium on Simplicity in Algorithms, 2018.
[74] J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.
[75] L. Rüschendorf. Random variables with maximum sums. Advances in Applied Probability, pages 623–632, 1982.
[76] L. Rüschendorf and L. Uckelmann. On the n-coupling problem. Journal of Multivariate Analysis, 81(2):242–258, 2002.
[77] R. Singh, I. Haasler, Q. Zhang, J. Karlsson, and Y. Chen. Incremental inference of collective graphical models. IEEE Control Systems Letters, 2020.
[78] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
[79] J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4):1–11, 2015.
[80] S. Srivastava, C. Li, and D. B. Dunson. Scalable Bayes via barycenter in Wasserstein space. Journal of Machine Learning Research, 19(1):312–346, 2018.
[81] M. Stoer and F. Wagner. A simple min-cut algorithm. Journal of the ACM, 44(4):585–591, 1997.
[82] Y. W. Teh and M. Welling. The unified propagation and scaling algorithm. In Advances in Neural Information Processing Systems, pages 953–960, 2002.
[83] L. N. Trefethen. Approximation theory and approximation practice, volume 164. SIAM, 2019.
[84] N. Tupitsa, P. Dvurechensky, A. Gasnikov, and C. A. Uribe. Multimarginal optimal transport by accelerated alternating minimization. In 2020 IEEE Conference on Decision and Control, pages 6132–6137. IEEE, 2020.
[85] L. G. Valiant. The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3):410–421, 1979.
[86] C. Villani. Topics in optimal transportation. Number 58. American Mathematical Society, 2003.
[87] M. J. Wainwright and M. I. Jordan. Variational inference in graphical models: The view from the marginal polytope. In Allerton Conference on Communication, Control, and Computing, volume 41, pages 961–971, 2003.
[88] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
[89] G. Weiss. Stochastic bounds on distributions of optimal value functions with applications to PERT, network flows and reliability. Operations Research, 34(4):595–605, 1986.
[90] A. G. Wilson. The use of entropy maximising models, in the theory of trip distribution, mode split and route split. Journal of Transport Economics and Policy, pages 108–126, 1969.
[91] N. E. Young. Sequential and parallel algorithms for mixed packing and covering. In Symposium on Foundations of Computer Science, pages 538–546. IEEE, 2001.
[92] D. B. Yudin and A. S. Nemirovskii. Informational complexity and efficient methods for the solution of convex extremal problems. Matekon, 13(2):22–45, 1976.
[93] E. Zemel. Polynomial algorithms for estimating network reliability. Networks, 12(4):439–452, 1982.

	$\displaystyle\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s}=j_{s}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$	$\displaystyle=\textsf{MIN}_{C}([q_{1,j_{1}^{}},\dots,q_{s,j_{s}^{}},p_{s+1},\dots,p_{k}])$
		$\displaystyle=\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}([q_{1,j_{1}^{}},\dots,q_{s-1,j_{s-1}^{}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}])$
		$\displaystyle=\min_{j_{s}^{\prime}\in[n]}\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s-1}=j_{s-1}^{},j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s-1}=j_{s-1}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\textsf{MIN}_{C}([p_{1},\dots,p_{k}]),$

	$\displaystyle\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots,j_{s}=j_{s}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$	$\displaystyle=\textsf{MIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s,j_{s}^{}},p_{s+1},\dots,p_{k}\Big{]}\Big{)}$
		$\displaystyle\leqslant\frac{\varepsilon}{2k}+\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s,j_{s}^{}},p_{s+1},\dots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)}$
		$\displaystyle=\frac{\varepsilon}{2k}+\min_{j_{s}^{\prime}\in[n]}\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s-1,j_{s-1}^{}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)}$
		$\displaystyle\leqslant\frac{\varepsilon}{k}+\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{}},\dots,q_{s-1,j_{s-1}^{}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}\Big{]}\Big{)}$
		$\displaystyle=\frac{\varepsilon}{k}+\min_{j_{s}^{\prime}\in[n]}\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots j_{s-1}=j_{s-1}^{},j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\frac{\varepsilon}{k}+\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{},\dots j_{s-1}=j_{s-1}^{}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}$
		$\displaystyle=\frac{s\varepsilon}{k}+\textsf{MIN}_{C}\Big{(}\Big{[}p_{1},\dots,p_{k}\Big{]}\Big{)}.$

Polynomial-time algorithms for Multimarginal Optimal Transport problems with structure††thanks: Work partially supported by NSF GRFP 1122374, a TwoSigma PhD fellowship, and a Siebel Scholarship.

Abstract

Keywords:

MSC:

1 Introduction

Polynomial-time algorithms for MOT.

1.1 Contribution 11: unified algorithmic framework for MOT

1.1.1 Answer to Q1: candidate poly​(n,k)\mathrm{poly}(n,k)-time algorithms

1.1.2 Answer to Q2: structure necessary to run candidate algorithms

1.1.3 Answer to Q3: characterizing what MOT problems each algorithm can solve

Theorem 1.1 (Informal statement of part of Theorems 4.1 and 4.7).

Theorem 1.2 (Informal statement of part of Theorems 4.1 and 4.7).

Theorem 1.3 (Informal statement of Theorem 4.19).

Theorem 1.4 (Informal statement of Lemma 3.7).

1.1.4 Answer to Q4: ease-of-use for checking if MOT is solvable in polynomial time

1.1.5 Practical algorithmic tradeoffs beyond polynomial-time solvability

1.2 Contribution 22: applications to general classes of structured MOT problems

1.2.1 Graphical structure

Theorem 1.5 (Informal statement of Theorem 5.5).

Theorem 1.6 (Informal statement of Corollary 5.6).

1.2.2 Set-optimization structure

Theorem 1.7 (Informal statement of Theorem 6.8).

Theorem 1.8 (Informal statement of Corollary 6.9).

1.2.3 Low-rank and sparse structure

Theorem 1.9 (Informal statement of Theorem 7.4).

Theorem 1.10 (Informal statement of Corollary 7.5).

1.3 Related work

1.3.1 MOT algorithms

Sinkhorn algorithm.

Ellipsoid algorithm.

Multiplicative Weights Update algorithm.

1.3.2 Graphically structured MOT problems with constant treewidth

Previous polynomial-time algorithms for graphically structured MOT compute approximate, dense solutions.

Improved algorithms for graphically structured MOT problems.

1.3.3 Tractable MOT problems beyond graphically structured costs

Remark 1.11 (Low-dimensional Wasserstein barycenter).

Remark 1.12 (Random combinatorial optimization).

1.3.4 Intractable MOT problems

1.3.5 Variants of MOT

1.3.6 Optimization over joint distributions

1.4 Organization

2 Preliminaries

General notation.

Tensor notation.

2.1 Multimarginal Optimal Transport

Lemma 2.1 (Sparse solutions for MOT).

Proof.

Definition 2.2 (ε\varepsilon-approximate MOT solution).

2.2 Regularization

Lemma 2.3 (Softmin approximation bound).

Proof.

Lemma 2.4 (MOT and RMOT are close for large regularization η\eta).

Proof.

3 Oracles

Definition 3.1 (MIN oracle).

Definition 3.2 (AMIN oracle).

Definition 3.3 (SMIN oracle).

Remark 3.4 (Interpretation as variants of the dual feasibility oracle).

Remark 3.5 (MIN implies AMIN).

Remark 3.6 (SMIN implies AMIN).

Lemma 3.7 (Restrictiveness of the SMIN oracle).

Proof.

Remark 3.8 (The restrictiveness of SMIN extends to approximate computation).

4 Algorithms to oracles

4.1 The Ellipsoid algorithm and the MIN oracle

Theorem 4.1.

Interpretation of results.

Organization of §4.1.

4.1.1 Algorithm

Definition 4.2 (Violation oracle for (MOT-D)).

Theorem 4.3 (ELLIPSOID guarantee; Proposition 12 of (AltBoi20bary, )).

Sketch of algorithm.

4.1.2 Equivalence of bottleneck to MIN

Lemma 4.4 (Equivalence of MIN and ARGMIN).

Proof.

Observation 4.5.

Proof.

Proof of Theorem 4.1.

4.1.3 Practical implementation via Column Generation

Theorem 4.6 (Standard guarantee for COLGEN; Section 6.1 of (BerTsi97, )).

Polynomial-time algorithms for Multimarginal Optimal Transport problems with structure^†^†thanks: Work partially supported by NSF GRFP 1122374, a TwoSigma PhD fellowship, and a Siebel Scholarship.

1.1 Contribution $1$ : unified algorithmic framework for MOT

1.1.1 Answer to Q1: candidate $\mathrm{poly}(n,k)$ -time algorithms

1.2 Contribution $2$ : applications to general classes of structured MOT problems

Definition 2.2 ( $\varepsilon$ -approximate MOT solution).

Lemma 2.4 (MOT and RMOT are close for large regularization $\eta$ ).

Remark 6.6 (Only require set oracle for $C^{-1}(0)$ , not for $C^{-1}(1)$ ).