This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: J.M. Altschuler 22institutetext: E. Boix-Adserà 33institutetext: Laboratory for Information and Decision Systems (LIDS), Massachusetts Institute of Technology, Cambridge MA 02139.
33email: jasonalt@mit.edu
33email: eboix@mit.edu

Polynomial-time algorithms for Multimarginal Optimal Transport problems with structurethanks: Work partially supported by NSF GRFP 1122374, a TwoSigma PhD fellowship, and a Siebel Scholarship.

Jason M. Altschuler    Enric Boix-Adserà
Abstract

Multimarginal Optimal Transport (MOT) has attracted significant interest due to applications in machine learning, statistics, and the sciences. However, in most applications, the success of MOT is severely limited by a lack of efficient algorithms. Indeed, MOT in general requires exponential time in the number of marginals kk and their support sizes nn. This paper develops a general theory about what “structure” makes MOT solvable in poly(n,k)\mathrm{poly}(n,k) time.

We develop a unified algorithmic framework for solving MOT in poly(n,k)\mathrm{poly}(n,k) time by characterizing the structure that different algorithms require in terms of simple variants of the dual feasibility oracle. This framework has several benefits. First, it enables us to show that the Sinkhorn algorithm, which is currently the most popular MOT algorithm, requires strictly more structure than other algorithms do to solve MOT in poly(n,k)\mathrm{poly}(n,k) time. Second, our framework makes it much simpler to develop poly(n,k)\mathrm{poly}(n,k) time algorithms for a given MOT problem. In particular, it is necessary and sufficient to (approximately) solve the dual feasibility oracle—which is much more amenable to standard algorithmic techniques.

We illustrate this ease-of-use by developing poly(n,k)\mathrm{poly}(n,k)-time algorithms for three general classes of MOT cost structures: (1) graphical structure; (2) set-optimization structure; and (3) low-rank plus sparse structure. For structure (1), we recover the known result that Sinkhorn has poly(n,k)\mathrm{poly}(n,k) runtime (h20gm, ; teh2002unified, ); moreover, we provide the first poly(n,k)\mathrm{poly}(n,k) time algorithms for computing solutions that are exact and sparse. For structures (2)-(3), we give the first poly(n,k)\mathrm{poly}(n,k) time algorithms, even for approximate computation. Together, these three structures encompass many—if not most—current applications of MOT.

Keywords:
Multimarginal optimal transport polynomial-time algorithms implicit linear programming structured linear programs
MSC:
90C08 90C06

1 Introduction

Multimarginal Optimal Transport (MOT) is the problem of linear programming over joint probability distributions with fixed marginal distributions. In this way, MOT generalizes the classical Kantorovich formulation of Optimal Transport from 22 marginal distributions to an arbitrary number k2k\geqslant 2 of them.

More precisely, an MOT problem is specified by a cost tensor CC in the kk-fold tensor product space (n)k=nn(\mathbb{R}^{n})^{\otimes k}=\mathbb{R}^{n}\otimes\cdots\otimes\mathbb{R}^{n}, and kk marginal distributions μ1,,μk\mu_{1},\dots,\mu_{k} in the simplex Δn={v0n:i=1nvi=1}\Delta_{n}=\{v\in\mathbb{R}_{\geqslant 0}^{n}:\sum_{i=1}^{n}v_{i}=1\}.111For simplicity, all μi\mu_{i} are assumed to have the same support size nn. Everything in this paper extends in a straightforward way to non-uniform sizes nin_{i} where nkn^{k} is replaced by i=1kni\prod_{i=1}^{k}n_{i}, and poly(n,k)\mathrm{poly}(n,k) is replaced by poly(maxini,k)\mathrm{poly}(\max_{i}n_{i},k). The MOT problem is to compute

minP(μ1,,μk)P,C\displaystyle\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle (MOT)

where (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}) is the “transportation polytope” consisting of all entrywise non-negative tensors P(n)kP\in(\mathbb{R}^{n})^{\otimes k} satisfying the marginal constraints j1,,ji1,ji+1,,jkPj1,,ji1,j,ji+1,,jk=[μi]j\sum_{j_{1},\dots,j_{i-1},j_{i+1},\dots,j_{k}}P_{j_{1},\dots,j_{i-1},j,j_{i+1},\dots,j_{k}}=[\mu_{i}]_{j} for all i{1,,k}i\in\{1,\dots,k\} and j{1,,n}j\in\{1,\dots,n\}.

This MOT problem has many applications throughout machine learning, computer science, and the natural sciences since it arises in tasks that require “stitching” together aggregate measurements. For instance, applications of MOT include inference from collective dynamics (h20gm, ; h20partial, ), information fusion for Bayesian learning (srivastava2018scalable, ), averaging point clouds (AguCar11, ; CutDou14, ), the nn-coupling problem (ruschendorf2002n, ), quantile aggregation (makarov1982estimates, ; ruschendorf1982random, ), matching for teams (ChiMccNes10, ; CarEke10, ), image processing (RabPeyDel11, ; solomon2015convolutional, ), random combinatorial optimization (zemel1982polynomial, ; weiss1986stochastic, ; Nat21extremal, ; Agr12, ; MeiNad79, ; nadas1979probabilistic, ; Han86, ), Distributionally Robust Optimization (Nat18dro, ; Nat09, ; MisNat14, ), simulation of incompressible fluids (Bre08, ; BenCarNen19, ), and Density Functional Theory (cotar2013density, ; buttazzo2012optimal, ; BenCarNen16, ).

However, in most applications, the success of MOT is severely limited by the lack of efficient algorithms. Indeed, in general, MOT requires exponential time in the number of marginals kk and their support sizes nn. For instance, applying a linear program solver out-of-the-box takes nΘ(k)n^{\Theta(k)} time because MOT is a linear program with nkn^{k} variables, nkn^{k} non-negativity constraints, and nknk equality constraints. Specialized algorithms in the literature such as the Sinkhorn algorithm yield similar nΘ(k)n^{\Theta(k)} runtimes. Such runtimes currently limit the applicability of MOT to tiny-scale problems (e.g., n=k=10n=k=10).

Polynomial-time algorithms for MOT.

This paper develops polynomial-time algorithms for MOT, where here and henceforth “polynomial” means in the number of marginals kk and their support sizes nn—and possibly also Cmax/εC_{\max}/\varepsilon for ε\varepsilon-additive approximation, where CmaxC_{\max} is a bound on the entries of CC.

At first glance, this may seem impossible for at least two “trivial” reasons. One is that it takes exponential time to read the input cost CC since it has nkn^{k} entries. We circumvent this issue by considering costs CC with poly(n,k)\mathrm{poly}(n,k)-size implicit representations, which encompasses essentially all MOT applications.222E.g., in the MOT problems of Wasserstein barycenters, generalized Euler flows, and Density Functional Theory, CC has entries Cj1,,jk=i,i=1kgi,i(ji,ji)C_{j_{1},\dots,j_{k}}=\sum_{i,i^{\prime}=1}^{k}g_{i,i^{\prime}}(j_{i},j_{i^{\prime}}) and thus can be implicitly input via the k2k^{2} functions gi,i:{1,,n}2g_{i,i^{\prime}}:\{1,\dots,n\}^{2}\to\mathbb{R}. A second obvious issue is that it takes exponential time to write the output variable PP since it has nkn^{k} entries. We circumvent this issue by returning solutions PP with poly(n,k)\mathrm{poly}(n,k)-size implicit representations, for instance sparse solutions.

But, of course, circumventing these issues of input/output size is not enough to actually solve MOT in polynomial time. See (AltBoi20hard, ) for examples of 𝖭𝖯\mathsf{NP}-hard MOT problems with costs that have poly(n,k)\mathrm{poly}(n,k)-size implicit representations.

Remarkably, for several MOT problems, there are specially-tailored algorithms that run in polynomial time—notably, for MOT problems with graphically-structured costs of constant treewidth (h20tree, ; h20gm, ; teh2002unified, ), variational mean-field games (BenCarDi18, ), computing generalized Euler flows (BenCarCut15, ), computing low-dimensional Wasserstein barycenters (CarObeOud15, ; BenCarCut15, ; AltBoi20bary, ), and filtering and estimation tasks based on target tracking (h20tree, ; h20gm, ; h20incremental, ; h19hmc, ; h20partial, ). However, the number of MOT problems that are known to be solvable in polynomial time is small, and it is unknown if these techniques can be extended to the many other MOT problems arising in applications. This motivates the central question driving this paper:

Are there general “structural properties” that make MOT solvable in poly(n,k) time?\displaystyle\text{\emph{Are there general ``structural properties'' that make }}\textsf{MOT}\text{\emph{ solvable in }}\mathrm{poly}(n,k)\text{\emph{ time?}}

This paper is conceptually divided into two parts. In the first part of the paper, we develop a unified algorithmic framework for MOT that characterizes the structure required for different algorithms to solve MOT in poly(n,k)\mathrm{poly}(n,k) time, in terms of simple variants of the dual feasibility oracle. This enables us to prove that some algorithms can solve MOT problems in polynomial time whenever any algorithm can; whereas the popular SINKHORN algorithm cannot. Moreover, this algorithmic framework makes it significantly easier to design a poly(n,k)\mathrm{poly}(n,k) time algorithm for a given MOT problem (when possible) because it now suffices to solve the dual feasibility oracle—and this is much more amenable to standard algorithmic techniques. In the second part of the paper, we demonstrate the ease-of-use of our algorithmic framework by applying it to three general classes of MOT cost structures.

Below, we detail these two parts of the paper in §1.1 and §1.2, respectively.

1.1 Contribution 11: unified algorithmic framework for MOT

In order to understand what structural properties make MOT solvable in polynomial time, we first lay a more general groundwork. The purpose of this is to understand the following fundamental questions:

  • Q1

    What are reasonable candidate algorithms for solving structured MOT problems in polynomial time?

  • Q2

    What structure must an MOT problem have for these algorithms to have polynomial runtimes?

  • Q3

    Is the structure required by a given algorithm more restrictive than the structure required by a different algorithm (or any algorithm)?

  • Q4

    How to check if this structure occurs for a given MOT problem?

We detail our answers to these four questions below in §1.1.1 to §1.1.4, and then briefly discuss practical tradeoffs beyond polynomial-time solvability in §1.1.5; see Table 4 for a summary. We expect that this general groundwork will prove useful in future investigations of tractable MOT problems.

Algorithm Oracle Runtime Always applicable? Exact solution? Sparse solution? Practical?
ELLIPSOID MIN Theorem 4.1 Yes Yes Yes No
MWU AMIN Theorem 4.7 Yes No Yes Yes
SINKHORN SMIN Theorem 4.18 No No No Yes
Table 1: These MOT algorithms have polynomial runtime except for a bottleneck “oracle”. Each oracle is a simple variant of the dual feasibility oracle for MOT. The number of oracle computations is poly(n,k)\mathrm{poly}(n,k) for ELLIPSOID, and poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) for both MWU and SINKHORN. From a theoretical perspective, the most important aspect of an algorithm is whether it can solve MOT in polynomial time if and only if any algorithm can. We show that ELLIPSOID and MWU satisfy this (Theorem 1.1), but SINKHORN does not (Theorem 1.3). From a practical perspective, SINKHORN is the most scalable when applicable.444Code for implementing these algorithms and reproducing all numerical simulations in this paper is provided at https://github.com/eboix/mot.

1.1.1 Answer to Q1: candidate poly(n,k)\mathrm{poly}(n,k)-time algorithms

We consider three algorithms for MOT whose exponential runtimes can be isolated into a single bottleneck—and thus can be implemented in polynomial time whenever that bottleneck can. These algorithms are the Ellipsoid algorithm ELLIPSOID (GLSbook, ), the Multiplicative Weights Update algorithm MWU (young2001sequential, ), and the natural multidimensional analog of Sinkhorn’s scaling algorithm SINKHORN (BenCarCut15, ; PeyCut17, ). SINKHORN is specially tailored to MOT and is currently the predominant algorithm for it. To foreshadow our answer to Q3, the reason that we restrict to these candidate algorithms is: we show that ELLIPSOID and MWU can solve an MOT problem in polynomial time if and only if any algorithm can.

1.1.2 Answer to Q2: structure necessary to run candidate algorithms

These three algorithms only access the cost tensor CC through polynomially many calls of their respective bottlenecks. Thus the structure required to implement these candidate algorithms in polynomial time is equivalent to the structure required to implement their respective bottlenecks in polynomial time.

In §4, we show that the bottlenecks of these three algorithms are polynomial-time equivalent to natural analogs of the feasibility oracle for the dual LP to MOT. Namely, given weights p1,,pknp_{1},\dots,p_{k}\in\mathbb{R}^{n}, compute

min(j1,,jk){1,,n}kCj1,,jki=1k[pi]ji\displaystyle\min_{(j_{1},\dots,j_{k})\in\{1,\dots,n\}^{k}}C_{j_{1},\dots,j_{k}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} (1.1)

either exactly for ELLIPSOID, approximately for MWU, or with the “min” replaced by a “softmin” for SINKHORN. We call these three tasks the MIN, AMIN, and SMIN oracles, respectively. See Remark 3.4 for the interpretation of these oracles as variants of the dual feasibility oracle.

These three oracles take nkn^{k} time to implement in general. However, for a wide range of structured cost tensors CC they can be implemented in poly(n,k)\mathrm{poly}(n,k) time, see §1.2 below. For such structured costs CC, our oracle abstraction immediately implies that the MOT problem with cost CC and any input marginals μ1,,μk\mu_{1},\dots,\mu_{k} can be (approximately) solved in polynomial time by any of the three respective algorithms.

Our characterization of the algorithms’ bottlenecks as variations of the dual feasibility oracle has two key benefits—which are the answers to Q3 and Q4, described below.

1.1.3 Answer to Q3: characterizing what MOT problems each algorithm can solve

A key benefit of our characterization of the algorithms’ bottlenecks as variations of the dual feasibility oracles is that it enables us to establish whether the structure required by a given MOT algorithm is more restrictive than the structure required by a different algorithm (or by any algorithm).

In particular, this enables us to answer the natural question: why restrict to just the three algorithms described above? Can other algorithms solve MOT in poly(n,k)\mathrm{poly}(n,k) time in situations when these algorithms cannot? Critically, the answer is no: restricting ourselves to the ELLIPSOID and MWU algorithms is at no loss of generality.

Theorem 1.1 (Informal statement of part of Theorems 4.1 and 4.7).

For any family of costs C(n)kC\in(\mathbb{R}^{n})^{\otimes k}:

  • ELLIPSOID computes an exact solution for MOT in poly(n,k)\mathrm{poly}(n,k) time if and only if any algorithm can.

  • MWU computes an ε\varepsilon-approximate solution for MOT in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time if and only if any algorithm can.

The statement for ELLIPSOID is implicit from classical results about LP (GLSbook, ) combined with arguments from (AltBoi20bary, ), see the previous work section §1.3. The statement for MWU is new to this paper.

The oracle abstraction helps us show Theorem 1.1 because it reduces this question of what structure is needed for the algorithms to solve MOT in polynomial time, to the question of what structure is needed to solve their respective bottlenecks in polynomial time. Thus Theorem 1.1 is a consequence of the following result. (The “if” part of this result is a contribution of this paper; the “only if” part was shown in (AltBoi20hard, ).)

Theorem 1.2 (Informal statement of part of Theorems 4.1 and 4.7).

For any family of costs C(n)kC\in(\mathbb{R}^{n})^{\otimes k}:

  • MOT can be exactly solved in poly(n,k)\mathrm{poly}(n,k) time if and only if MIN can.

  • MOT can be ε\varepsilon-approximately solved in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time if and only if AMIN can.

Interestingly, a further consequence of our unified algorithm-to-oracle abstraction is that it enables us to show that SINKHORN—which is currently the most popular algorithm for MOT by far—requires strictly more structure to solve an MOT problem than other algorithms require. This is in sharp contrast to the complete generality of the other two algorithms shown in Theorem 1.1.

Theorem 1.3 (Informal statement of Theorem 4.19).

Under standard complexity-theoretic assumptions, there exists a family of MOT problems that can be solved exactly in poly(n,k)\mathrm{poly}(n,k) time using ELLIPSOID, however it is impossible to implement a single iteration of SINKHORN (even approximately) in poly(n,k)\mathrm{poly}(n,k) time.

The reason that our unified algorithm-to-oracle abstraction helps us show Theorem 1.3 is that it puts SINKHORN on equal footing with the other two classical algorithms in terms of their reliance on variants of the dual feasibility oracle. This reduces proving Theorem 1.3 to showing the following separation between the SMIN oracle and the other two oracles.

Theorem 1.4 (Informal statement of Lemma 3.7).

Under standard complexity-theoretic assumptions, there exists a family of cost tensors C(n)kC\in(\mathbb{R}^{n})^{\otimes k} such that there are poly(n,k)\mathrm{poly}(n,k)-time algorithms for MIN and AMIN, however it is impossible to solve SMIN (even approximately) in poly(n,k)\mathrm{poly}(n,k) time.

1.1.4 Answer to Q4: ease-of-use for checking if MOT is solvable in polynomial time

The second key benefit of this oracle abstraction is that it is helpful for showing that a given MOT problem (whose cost CC is input implicitly through some concise representation) is solvable in polynomial time as it without loss of generality reduces MOT to solving any of the three corresponding oracles in polynomial time. The upshot is that these oracles are more directly amenable to standard algorithmic techniques since they are phrased as more conventional combinatorial-optimization problems. In the second part of the paper, we illustrate this ease-of-use via applications to three general classes of structured MOT problems; for an overview see §1.2.

1.1.5 Practical algorithmic tradeoffs beyond polynomial-time solvability

From a theoretical perspective, the most important aspect of an algorithm is whether it can solve MOT in polynomial time if and only if any algorithm can. As we have discussed, this is true for ELLIPSOID and MWU (Theorem 1.1) but not for SINKHORN (Theorem 1.3). Nevertheless, for a wide range of MOT cost structures, all three oracles can be implemented in polynomial time, which means that all three algorithms ELLIPSOID, MWU, and SINKHORN can be implemented in polynomial time. Which algorithm is best in practice depends on the relative importance of the following considerations for the particular application.

  • Error. ELLIPSOID computes exact solutions, whereas MWU and SINKHORN only compute low-precision solutions due to poly(1/ε)\mathrm{poly}(1/\varepsilon) runtime dependence.

  • Solution sparsity. ELLIPSOID and MWU output solutions with polynomially many non-zero entries (roughly nknk), whereas SINKHORN outputs fully dense solutions with nkn^{k} non-zero entries (through a polynomial-size implicit representation, see §4.3). Solution sparsity enables interpretability, visualization, and efficient downstream computation—benefits which are helpful in diverse applications, for example ranging from computer graphics (solomon2015convolutional, ; blondel2018smooth, ; pitie2007automated, ) to facility location problems (anderes2016discrete, ) to machine learning (AltBoi20bary, ; flamary2016optimal, ) to ecological inference (muzellec2017tsallis, ) to fluid dynamics (see §5.3), and more. Furthermore, in §7.4, we show that sparse solutions for MOT (a.k.a. linear optimization over the transportation polytope) enable efficiently solving certain non-linear optimization problems over the transportation polytope.

  • Practical runtime. Although all three algorithms enjoy polynomial runtime guarantees, the polynomials are smaller for some algorithms than for others. In particular, SINKHORN has remarkably good scalability in practice as long the error ε\varepsilon is not too small and its bottleneck oracle SMIN is practically implementable. By Theorems 1.1 and 1.3, MWU can solve strictly more MOT problems in polynomial time than SINKHORN; however, it is less scalable in practice when both MWU and SINKHORN can be implemented. ELLIPSOID is not practical and is used solely as a proof of concept that problems are tractable to solve exactly; in practice, we use Column Generation (see, e.g., (BerTsi97, , §6.1)) rather than ELLIPSOID as it has better empirical performance, yet still has the same bottleneck oracle MIN, see §4.1.3. Column Generation is not as practically scalable as SINKHORN in nn and kk but has the benefit of computing exact, sparse solutions.

To summarize: which algorithm is best in practice depends on the application. For example, Column Generation produces the qualitatively best solutions for the fluid dynamics application in §5.3, SINKHORN is the most scalable for the risk estimation application in §7.3, and MWU is the most scalable for the network reliability application in §6.3 (for that application there is no known implementation of SINKHORN that is practically efficient).

1.2 Contribution 22: applications to general classes of structured MOT problems

In the second part of the paper, we illustrate the algorithmic framework developed in the first part of the paper by applying it to three general classes of MOT cost structures:

  1. 1.

    Graphical structure (in §5).

  2. 2.

    Set-optimization structure (in §6).

  3. 3.

    Low-rank plus sparse structure (in §7).

Specifically, if the cost CC is structured in any of these three ways, then MOT can be (approximately) solved in poly(n,k)\mathrm{poly}(n,k) time for any input marginals μ1,,μk\mu_{1},\dots,\mu_{k}.

Previously, it was known how to solve MOT problems with structure (1) using SINKHORN (h20gm, ; teh2002unified, ), but this only computes solutions that are dense (with nkn^{k} non-zero entries) and low-precision (due to poly(1/ε)\mathrm{poly}(1/\varepsilon) runtime dependence). We therefore provide the first solutions that are sparse and exact for structure (1). For structures (2) and (3), we provide the first polynomial-time algorithms, even for approximate computation. These three structures are incomparable: it is in general not possible to model a problem falling under any of the three structures in a non-trivial way using any of the others, for details see Remarks 6.7 and 7.3. This means that the new structures (2) and (3) enable capturing a wide range of new applications.

Below, we detail these structures individually in §1.2.1, §1.2.2, and §1.2.3. See Table 2 for a summary.

Structure Definition Complexity measure Polynomial-time algorithm?
Approximate Exact
Graphical (§5) Cj=S𝒮fS(jS)C_{\vec{j}}=\sum_{S\in\mathcal{S}}f_{S}(\vec{j}_{S}) treewidth Known (teh2002unified, ; h20gm, ) Corollary 5.6
Set-optimization (§6) Cj=𝟙[jS]C_{\vec{j}}=\mathds{1}[\vec{j}\notin S] optimization oracle over S Corollary 6.9 Corollary 6.9
Low-rank + sparse (§7) C=R+SC=R+S rank of RR, sparsity of SS Corollary 7.5 Unknown
Table 2: In the second part of the paper, we illustrate the ease-of-use of our algorithmic framework by applying it to three general classes of MOT cost structures. These structures encompass many—if not most—current applications of MOT.

1.2.1 Graphical structure

In §5, we apply our algorithmic framework to MOT problems with graphical structure, a broad class of MOT problems that have been previously studied (h20gm, ; h20tree, ; teh2002unified, ). Briefly, an MOT problem has graphical structure if its cost tensor CC decomposes as

Cj1,,jk=S𝒮fS(jS),C_{j_{1},\dots,j_{k}}=\sum_{S\in\mathcal{S}}f_{S}(\vec{j}_{S}),

where fS(jS)f_{S}(\vec{j}_{S}) are arbitrary “local interactions” that depend only on tuples jS:={ji}iS\vec{j}_{S}:=\{j_{i}\}_{i\in S} of the kk variables.

In order to derive efficient algorithms, it is necessary to restrict how local the interactions are because otherwise MOT is 𝖭𝖯\mathsf{NP}-hard (even if all interaction sets S𝒮S\in\mathcal{S} have size 22(AltBoi20hard, ). We measure the locality of the interactions via the standard complexity measure of the “treewidth” of the associated graphical model. See §5.1 for formal definitions. While the runtimes of our algorithms (and all previous algorithms) depend exponentially on the treewidth, we emphasize that the treewidth is a very small constant (either 11 or 22) in all current applications of MOT falling under this framework; see the related work section.

We show that for MOT cost tensors that have graphical structure of constant treewidth, all three oracles can be implemented in poly(n,k)\mathrm{poly}(n,k) time. We accomplish this by leveraging the known connection between graphically structured MOT and graphical models shown in (h20gm, ). In particular, the MIN, AMIN, and SMIN oracles are respectively equivalent to the mode, approximate mode, and log-partition function of an associated graphical model. Thus we can implement all oracles in poly(n,k)\mathrm{poly}(n,k) time by simply applying classical algorithms from the graphical models community (KolFri09, ; wainwright2008graphical, ).

Theorem 1.5 (Informal statement of Theorem 5.5).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} have graphical structure of constant treewidth. Then the MIN, AMIN, and SMIN oracles can be computed in poly(n,k)\mathrm{poly}(n,k) time.

It is an immediate corollary of Theorem 1.5 and our algorithms-to-oracles reduction described in §1.1 that one can implement ELLIPSOID, MWU, and SINKHORN in polynomial time. Below, we record the theoretical guarantee of ELLIPSOID since it is the best of the three algorithms as it computes exact, sparse solutions.

Theorem 1.6 (Informal statement of Corollary 5.6).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} have graphical structure of constant treewidth. Then an exact, sparse solution for MOT can be computed in poly(n,k)\mathrm{poly}(n,k) time.

Previously, it was known how to solve such MOT problems (teh2002unified, ; h20gm, ) using SINKHORN, but this only computes a solution that is fully dense (with nkn^{k} non-zero entries) and low-precision (due to poly(1/ε)\mathrm{poly}(1/\varepsilon) runtime dependence). Details in the related work section. Our result improves over this state-of-the-art algorithm by producing solutions that are exact and sparse in poly(n,k)\mathrm{poly}(n,k) time.

In §5.3, we demonstrate the benefit of Theorem 1.6 on the application of computing generalized Euler flows, which was historically the motivation of MOT and has received significant attention, e.g., (BenCarCut15, ; BenCarNen19, ; Bre08, ; Bre89, ; Bre93, ; Bre99, ). While there is a specially-tailored version of the SINKHORN algorithm for this problem that runs in polynomial time (BenCarCut15, ; BenCarNen19, ), it produces solutions that are approximate and fully dense. Our algorithm produces exact, sparse solutions which lead to sharp visualizations rather than blurry ones (see Figure 10).

1.2.2 Set-optimization structure

In §6, we apply our algorithmic framework to MOT problems whose cost tensors CC take value 0 or 11 in each entry. That, is costs CC of the form

Cj1,,jk=𝟙[(j1,,jk)S],C_{j_{1},\dots,j_{k}}=\mathds{1}[(j_{1},\dots,j_{k})\notin S],

for some subset S[n]kS\subseteq[n]^{k}. Such MOT problems arise naturally in applications where one seeks to minimize the probability that some event SS occurs, given marginal probabilities on each variable jij_{i}, see Example 6.1.

In order to derive efficient algorithms, it is necessary to restrict the (otherwise arbitrary) set SS. We parametrize the complexity of such MOT problems via the complexity of finding the minimum-weight object in SS. This opens the door to combinatorial applications of MOT because finding the minimum-weight object in SS is well-known to be polynomial-time solvable for many “combinatorially-structured” sets SS of interest—e.g., the set SS of cuts in a graph, or the set SS of independent sets in a matroid.

We show that for MOT cost tensors with this structure, all three oracles can be implemented efficiently.

Theorem 1.7 (Informal statement of Theorem 6.8).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} have set-optimization structure. Then the MIN, AMIN, and SMIN oracles can be computed in poly(n,k)\mathrm{poly}(n,k) time.

It is an immediate corollary of Theorem 1.7 and our algorithms-to-oracles reduction described in §1.1 that one can implement ELLIPSOID, MWU, and SINKHORN in polynomial time. Below, we record the theoretical guarantee for ELLIPSOID since it is the best of these three algorithms as it computes exact, sparse solutions.

Theorem 1.8 (Informal statement of Corollary 6.9).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} have set-optimization structure. Then an exact, sparse solution for MOT can be computed in poly(n,k)\mathrm{poly}(n,k) time.

This is the first polynomial-time algorithm for this class of MOT problems. We note that a more restrictive class of MOT problems was studied in (zemel1982polynomial, ) under the additional restriction that SS is upwards-closed.

In §6.3, we show how this general class of set-optimization structure captures, for example, the classical application of computing the extremal reliability of a network with stochastic edge failures. Network reliability is a fundamental topic in network science and engineering (gertsbakh2011network, ; ball1995network, ; ball1986computational, ) which is often studied in an average-case setting where each edge fails independently with some given probability (moore1956reliable, ; karger2001randomized, ; valiant1979complexity, ; provan1983complexity, ). The application investigated here is a robust notion of network reliability in which edge failures may be maximally correlated (e.g., by an adversary) or minimally correlated (e.g., by a network maintainer) subject to a marginal constraint on each edge’s failure probability, a setting that dates back to the 1980s (zemel1982polynomial, ; weiss1986stochastic, ). We show how to express both the minimally and maximally correlated network reliability problems as MOT problems with set-optimization structure, recovering as a special case of our general framework the known polynomial-time algorithms in (zemel1982polynomial, ; weiss1986stochastic, ) as well as more practical polynomial-time algorithms that scale to input sizes that are an order-of-magnitude larger.

1.2.3 Low-rank and sparse structure

In §7, we apply our algorithmic framework to MOT problems whose cost tensors CC decompose as

C=R+S,C=R+S,

where RR is a constant-rank tensor, and SS is a polynomially-sparse tensor. We assume that RR is represented in factored form, and that SS is represented through its non-zero entries, which overall yields a poly(n,k)\mathrm{poly}(n,k)-size representation of CC.

We show that for MOT cost tensors with low-rank plus sparse structure, the AMIN and SMIN oracles can be implemented in polynomial time.555It is an interesting open question if the MIN oracle can similarly be implemented in poly(n,k)\mathrm{poly}(n,k) time. This would enable implementing ELLIPSOID in poly(n,k)\mathrm{poly}(n,k) time by our algorithms-to-oracles reduction, and thus would enable computing exact solutions for this class of MOT problems (cf., Theorem 1.10). This may be of independent interest because, by taking all oracle inputs pi=0p_{i}=0 in (1.1), this generalizes the previously open problem of approximately computing the smallest entry of a constant-rank tensor with nkn^{k} entries in poly(n,k)\mathrm{poly}(n,k) time.

Theorem 1.9 (Informal statement of Theorem 7.4).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} have low-rank plus sparse structure. Then the AMIN and SMIN oracles can be computed in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

It is an immediate corollary of Theorem 1.9 and our algorithms-to-oracles reduction described in §1.1 that one can implement MWU and SINKHORN in polynomial time. Of these two algorithms, MWU computes sparse solutions, yielding the following theorem.

Theorem 1.10 (Informal statement of Corollary 7.5).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} have low-rank plus sparse structure. Then a sparse, ε\varepsilon-approximate solution for MOT can be computed in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

This is the first polynomial-time result for this class of MOT problems. We note that the runtime of our MOT algorithm depends exponentially on the rank rr of RR, hence why we take rr to be constant. Nevertheless, such a restriction on the rank is unavoidable since unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP}, there does not exist an algorithm with runtime that is jointly polynomial in nn, kk, and the rank rr (AltBoi20hard, ).

We demonstrate this polynomial-time algorithm concretely on two applications. First, in §7.3 we consider the risk estimation problem of computing an investor’s expected profit in the worst-case over all future prices that are consistent with given marginal distributions. We show that this is equivalent to an MOT problem with a low-rank tensor and thereby provide the first efficient algorithm for it.

Second, in §7.4, we consider the fundamental problem of projecting a joint distribution QQ onto the transportation polytope. We provide the first polynomial-time algorithm for solving this when QQ decomposes into a constant-rank and sparse component, which models mixtures of product distributions with polynomially many corruptions. This application illustrates the versatility of our algorithmic results beyond polynomial-time solvability of MOT, since this projection problem is a quadratic optimization over the transportation polytope rather than linear optimization (a.k.a. MOT). In order to achieve this, we develop a simple quadratic-to-linear reduction tailored to this problem that crucially exploits the sparsity of the MOT solutions enabled by the MWU algorithm.

1.3 Related work

1.3.1 MOT algorithms

MOT algorithms fall into two categories. One category consists of general-purpose algorithms that do not depend on the specific MOT cost. For example, this includes running an LP solver out-of-the-box, or running the Sinkhorn algorithm where in each iteration one sums over all nkn^{k} entries of the cost tensor to implement the marginalization bottleneck (LinHoJor19, ; Fri20, ; tupitsa2020multimarginal, ). These approaches are robust in the sense that they do not need to be changed based on the specific MOT problem. However, they are impractical beyond tiny input sizes (e.g., n=k=10n=k=10) because their runtimes scale as nΩ(k)n^{\Omega(k)}.

The second category consists of algorithms that are much more scalable but require extra structure of the MOT problem. Specifically, these are algorithms that somehow exploit the structure of the relevant cost tensor CC in order to (approximately) solve an MOT problem in poly(n,k)\mathrm{poly}(n,k) time (teh2002unified, ; h20gm, ; h20tree, ; BenCarDi18, ; h20incremental, ; h19hmc, ; h20partial, ; CarObeOud15, ; Nen16, ; BenCarCut15, ; BenCarNen19, ; AltBoi20bary, ; Nat18dro, ; Nat09, ; MisNat14, ; zemel1982polynomial, ; weiss1986stochastic, ; Nat21extremal, ; Agr12, ; MeiNad79, ; nadas1979probabilistic, ; Han86, ). Such a poly(n,k)\mathrm{poly}(n,k) runtime is far more tractable—but it is not well understood for which MOT problems such a runtime is possible. The purpose of this paper is to clarify this question.

To contextualize our answer to this question with the rapidly growing literature requires further splitting this second category of algorithms.

Sinkhorn algorithm.

Currently, the predominant approach in the second category is to solve an entropically regularized version of MOT with the Sinkhorn algorithm, a.k.a. Iterative Proportional Fitting or Iterative Bregman Projections or RAS algorithm or Iterative Scaling algorithm, see e.g., (teh2002unified, ; BenCarDi18, ; BenCarNen16, ; BenCarNen19, ; Nen16, ; h20tree, ; h20gm, ). Recent work has shown that a polynomial number of iterations of this algorithm suffices (LinHoJor19, ; Fri20, ; tupitsa2020multimarginal, ). However, the bottleneck is that each iteration requires nkn^{k} operations in general because it requires marginalizing a tensor with nkn^{k} entries. The critical question is therefore: what structure of an MOT problem enables implementing this marginalization bottleneck in polynomial time.

This paper makes two contributions to this question. First, we identify new broad classes of MOT problems for which this bottleneck can be implemented in polynomial time, and thus SINKHORN can be implemented in polynomial time (see §1.2). Second, we propose other algorithms that require strictly less structure than SINKHORN does in order to solve an MOT problem in polynomial time (Theorem 4.19).

Ellipsoid algorithm.

The Ellipsoid algorithm is among the most classical algorithms for implicit LP (GLSbook, ; GroLovSch81, ; khachiyan1980polynomial, ), however it has taken a back seat to the SINKHORN algorithm in the vast majority of the MOT literature.

In §4.1, we make explicit the fact that the variant of ELLIPSOID from (AltBoi20bary, ) can solve MOT exactly in poly(n,k)\mathrm{poly}(n,k) time if and only if any algorithm can (Theorem 4.1). This is implicit from combining several known results (AltBoi20hard, ; AltBoi20bary, ; GLSbook, ). In the process of making this result explicit, we exploit the special structure of the MOT LP to significantly simplify the reduction from the dual violation oracle to the dual feasibility oracle. The previously known reduction is highly impractical as it requires an indirect “back-and-forth” use of the Ellipsoid algorithm (GLSbook, , page 107). In contrast, our reduction is direct and simple; this is critical for implementing our practical alternative to ELLIPSOID, namely COLGEN, with the dual feasibility oracle.

Multiplicative Weights Update algorithm.

This algorithm, first introduced by (young2001sequential, ), has been studied in the context of optimal transport when k=2k=2 (BlaJamKenSid18, ; Qua18, ), in which case implicit LP is not necessary for a polynomial runtime. MWU lends itself to implicit LP (young2001sequential, ), but is notably absent from the MOT literature.

In §4.2, we show that MWU can be applied to MOT in polynomial time if and only if the approximate dual feasibility oracle can be solved in polynomial time. To do this, we show that in the special case of MOT, the well-known “softmax-derivative” bottleneck of MWU is polynomial-time equivalent to the approximate dual feasibility oracle. Since it is known that the approximate dual feasibility oracle is polynomial-time reducible to approximate MOT (AltBoi20hard, ), we therefore establish that MWU can solve MOT approximately in polynomial time if and only if any algorithm can (Theorem 4.7).

1.3.2 Graphically structured MOT problems with constant treewidth

We isolate here graphically structured costs with constant treewidth because this framework encompasses all MOT problems that were previously known to be tractable in polynomial time (teh2002unified, ; h20gm, ), with the exceptions of the fixed-dimensional Wasserstein barycenter problem and MOT problems related to combinatorial optimization—both of which are described below in §1.3.3. This family of graphical structured costs with treewidth 11 (a.k.a. “tree-structured costs” (h20tree, )) includes applications in economics such as variational mean-field games (BenCarDi18, ), interpolating histograms on trees (akagi2020probabilistic, ), matching for teams (CarObeOud15, ; Nen16, ); as well as encompasses applications in filtering and estimation for collective dynamics such as target tracking (h20tree, ; h20gm, ; h20incremental, ; h19hmc, ; h20partial, ) and Wasserstein barycenters in the case of fixed support (h20partial, ; Nen16, ; BenCarCut15, ; CarObeOud15, ). With treewidth 22, this family of costs also includes dynamic multi-commodity flow problems (haasler2021scalable, ), as well as the application of computing generalized Euler flows in fluid dynamics (BenCarCut15, ; BenCarNen19, ; Nen16, ), which was historically the original motivation of MOT (Bre89, ; Bre93, ; Bre99, ; Bre08, ).

Previous polynomial-time algorithms for graphically structured MOT compute approximate, dense solutions.

Implementing SINKHORN for graphically structured MOT problems by using belief propagation to efficiently implement the marginalization bottleneck was first proposed twenty years ago in (teh2002unified, ). There have been recent advancements in understanding connections of this algorithm to the Schrödinger bridge problem in the case of trees (h20tree, ), as well as developing more practically efficient single-loop variations (h20gm, ).

All of these works prove theoretical runtime guarantees only in the case of tree structure (i.e., treewidth 11). However, this graphical model perspective for efficiently implementing SINKHORN readily extends to any constant treewidth: simply implement the marginalization bottleneck using junction trees. This, combined with the iteration complexity of SINKHORN which is known to be polynomial (LinHoJor19, ; Fri20, ; tupitsa2020multimarginal, ), immediately yields an overall polynomial runtime. This is why we cite (teh2002unified, ; h20gm, ) throughout this paper regarding the fact that SINKHORN can be implemented in polynomial time for graphical structure with any constant treewidth.

While the use of SINKHORN for graphically structured MOT is mathematically elegant and can be impressively scalable in practice, it has two drawbacks. The first drawback of this algorithm is that it computes (implicit representations of) solutions that are fully dense with nkn^{k} non-zero entries. Indeed, it is well-known that SINKHORN finds the unique optimal solution to the entropically regularized MOT problem minP(μ1,,μk)P,Cη1H(P)\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle-\eta^{-1}H(P), and that this solution is fully dense (PeyCut17, ). For example, in the simple case of cost C=0C=0, uniform marginals μi\mu_{i}, and any strictly positive regularization parameter η>0\eta>0, this solution PP has value 1/nk1/n^{k} in each entry.

The second drawback of this algorithm is that it only computes solutions that are low-precision due to poly(1/ε)\mathrm{poly}(1/\varepsilon) runtime dependence on the accuracy ε\varepsilon. This is because the number of SINKHORN iterations is known to scale polynomially in the entropic regularization parameter η\eta even in the matrix case k=2k=2 (linial1998deterministic, , §1.2), and it is known that η=Ω(ε1klogn)\eta=\Omega(\varepsilon^{-1}k\log n) is necessary for the converged solution of SINKHORN to be an ε\varepsilon-approximate solution to the (unregularized) original MOT problem (LinHoJor19, ).

Improved algorithms for graphically structured MOT problems.

The contribution of this paper to the study of graphically structured MOT problems is that we give the first poly(n,k)\mathrm{poly}(n,k) time algorithms that can compute solutions which are exact and sparse (Corollary 5.6). Our framework also directly recovers all known results about SINKHORN for graphically structured MOT problems—namely that it can be implemented in polynomial time for trees (h20tree, ; teh2002unified, ) and for constant treewidth (h20gm, ; teh2002unified, ).

1.3.3 Tractable MOT problems beyond graphically structured costs

The two new classes of MOT problems studied in this paper—namely, set-optimization structure and low-rank plus sparse structure—are incomparable to each other as well as to graphical structure. Details in Remarks 6.7 and 7.3. This lets us handle a wide range of new MOT problems that could not be handled before.

There are two other classes of MOT problems studied in the literature which do not fall under the three structures studied in this paper. We elaborate on both below.

Remark 1.11 (Low-dimensional Wasserstein barycenter).

This MOT problem has cost Cj1,,jk=i,i=1kxi,jixi,ji2C_{j_{1},\dots,j_{k}}=\sum_{i,i^{\prime}=1}^{k}\|x_{i,j_{i}}-x_{i^{\prime},j_{i^{\prime}}}\|^{2} where xi,jdx_{i,j}\in\mathbb{R}^{d} denotes the jj-th atom in the distribution μi\mu_{i}. Clearly this cost is not a graphically structured cost of constant treewidth—indeed, representing it through the lens of graphical structure requires the complete graph of interactions, which means a maximal treewidth of k1k-1.666 We remark that the related but different problem of fixed-support Wasserstein barycenters has graphical structure with treewidth 11 (h20partial, ; Nen16, ; BenCarCut15, ; CarObeOud15, ). However, it should be emphasized that the fixed-support Wasserstein barycenter problem is different from the Wasserstein barycenter problem: it only approximates the latter to ε\varepsilon accuracy if the fixed support is restricted to an O(ε)O(\varepsilon)-net which requires n=1/εΩ(d)n=1/\varepsilon^{\Omega(d)} discretization size for the barycenter’s support, and thus (i) even in constant dimension, does not lead to high-precision algorithms due to poly(1/ε)\mathrm{poly}(1/\varepsilon) runtime; and (ii) scales exponentially in the dimension dd. See (AltBoi20baryhard, , §1.3) for further details about the complexity of Wasserstein barycenters. This problem also does not fall under the set-optimization or constant-rank structures. Nevertheless, this MOT problem can be solved in poly(n,k)\mathrm{poly}(n,k) time for any fixed dimension dd by exploiting the low-dimensional geometric structure of the points {xi,j}\{x_{i,j}\} that implicitly define the cost (AltBoi20bary, ).

Remark 1.12 (Random combinatorial optimization).

MOT problems also appear in the random combinatorial optimization literature since the 1970s, see e.g., (MeiNad79, ; zemel1982polynomial, ; weiss1986stochastic, ; nadas1979probabilistic, ; Han86, ), although under a different name and in a different community. These papers consider MOT problems with costs of the form C(x)=minvVx,vC(x)=\min_{v\in V}\langle x,v\rangle for polytopes V{0,1}kV\subseteq\{0,1\}^{k} given through a list of their extreme points. Applications include PERT (Program Evaluation and Review Technique), extremal network reliability, and scheduling. Recently, applications to Distributionally Robust Optimization were investigated in (Nat18dro, ; Nat09, ; MisNat14, ) which considered general polytopes VkV\subset\mathbb{R}^{k}, as well as in (Nat21extremal, ) which considered MOT costs of the related form C(x)=𝟙[minvVx,vt]C(x)=\mathds{1}[\min_{v\in V}\langle x,v\rangle\geqslant t], and in (Agr12, ) which considers other combinatorial costs CC such as sub/supermodular functions. These papers show that these random combinatorial optimization problems are in general intractable, and give sufficient conditions on when they can be solved in polynomial time. In general, these families of MOT problems are different from the three structures studied in this paper, although some MOT applications fall under multiple umbrellas (e.g., extremal network reliability). It is an interesting question to understand to what extent these structures can be reconciled (as well as the algorithms, which sometimes use extended formulations in these papers).

1.3.4 Intractable MOT problems

These algorithmic results beg the question: what are the fundamental limitations of this line of work on polynomial-time algorithms for structured MOT problems? To this end, the recent paper (AltBoi20hard, ) provides a systematic investigation of 𝖭𝖯\mathsf{NP}-hardness results for structured MOT problems, including converses to several results in this paper. In particular, (AltBoi20hard, , Propositions 4.1 and 4.2) justify the constant-rank regime studied in §7 by showing that unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP}, there does not exist an algorithm with runtime that is jointly polynomially in the rank rr and the input parameters nn and kk. Similarly, (AltBoi20hard, , Propositions 5.1 and 5.2) justify the constant-treewidth regime for graphically structured costs studied in §5 and all previous work by showing that unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP}, there does not exist an algorithm with polynomial runtime even in the seemingly simple class of MOT costs that decompose into pairwise interactions Cj1,,jk=ii[k]ci,i(ji,ji)C_{j_{1},\dots,j_{k}}=\sum_{i\neq i^{\prime}\in[k]}c_{i,i^{\prime}}(j_{i},j_{i}^{\prime}). The paper (AltBoi20hard, ) also shows 𝖭𝖯\mathsf{NP}-hardness for several MOT problems with repulsive costs, including for example the MOT formulation of Density Functional Theory with Coulomb-Buckingham potential. It is an problem whether the Coulomb potential, studied in (BenCarNen16, ; cotar2013density, ; buttazzo2012optimal, ), also leads to an 𝖭𝖯\mathsf{NP}-hard MOT problem (AltBoi20hard, , Conjecture 6.4).

1.3.5 Variants of MOT

The literature has studied several other variants of the MOT problem, notably with entropic regularization and/or with constraints on a subset of the kk marginals, see, e.g., (BenCarCut15, ; BenCarDi18, ; BenCarNen19, ; BenCarNen16, ; h20partial, ; haasler2021scalable, ; h20incremental, ; h20gm, ; h20tree, ; h19hmc, ; LinHoJor19, ). Our techniques readily apply with little change. Briefly, to handle entropic regularization, simply use the SMIN oracle and SINKHORN algorithm with fixed regularization parameter 1/η>01/\eta>0 (rather than 1/η1/\eta of vanishing size Θ(ε/logn)\Theta(\varepsilon/\log n)) as described in §4.3. And to handle partial marginal constraints, essentially the only change is that in the MIN, AMIN, and SMIN oracles, the potentials pip_{i} are zero for all indices i[k]i\in[k] corresponding to unconstrained marginals mi(P)m_{i}(P). Full details are omitted for brevity since they are straightforward modifications of our main results.

1.3.6 Optimization over joint distributions

Optimization problems over exponential-size joint distributions appear in many domains. For instance, they arise in game theory when computing correlated equilibria (pap08, ); however, in that case the optimization has different constraints which lead to different algorithms. Such problems also arise in variational inference (wainwright2003variational, ); however, the optimization there typically constrains this distribution to ensure tractability (e.g., mean-field approximation restricts to product distributions). The different constraints in these optimization problems over joint distributions versus MOT lead to significant differences in computational complexity, and thus also necessitate different algorithmic techniques.

1.4 Organization

In §2 we recall preliminaries about MOT and establish notation. The first part of the paper then establishes our unified algorithmic framework for MOT. Specifically, in §3 we define and compare three variants of the dual feasibility oracle; and in §4 we characterize the structure that MOT algorithms require for polynomial-time implementation in terms of these three oracles. For an overview of these results, see §1.1. The second part of the paper applies this algorithmic framework to three general classes of MOT cost structures: graphical structure (§5), set-optimization structure (§6), and low-rank plus sparse structure (§7). For an overview of these results, see §1.2. These three application sections are independent of each other and can be read separately. We conclude in §8.

2 Preliminaries

General notation.

The set {1,,n}\{1,\dots,n\} is denoted by [n][n]. For shorthand, we write poly(t1,,tm)\mathrm{poly}(t_{1},\dots,t_{m}) to denote a function that grows at most polynomially fast in those parameters. Throughout, we assume for simplicity of exposition that all entries of the input CC and μ1,,μk\mu_{1},\dots,\mu_{k} have bit complexity at most poly(n,k)\mathrm{poly}(n,k), and same with the components defining CC in structured settings. As such, throughout runtimes refer to the number of arithmetic operations. The set {}\mathbb{R}\cup\{-\infty\} is denoted by ¯\bar{\mathbb{R}}, and note that the value -\infty can be represented efficiently by adding a single flag bit. We use the standard O()O(\cdot) and Ω()\Omega(\cdot) notation, and use O~()\tilde{O}(\cdot) and Ω~()\tilde{\Omega}(\cdot) to denote that polylogarithmic factors may be omitted.

Tensor notation.

The kk-fold tensor product space nn\mathbb{R}^{n}\otimes\cdots\otimes\mathbb{R}^{n} is denoted by (n)k(\mathbb{R}^{n})^{\otimes k}, and similarly for (0n)k(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}. Let P(n)kP\in(\mathbb{R}^{n})^{\otimes k}. Its ii-th marginal, i[k]i\in[k], is denoted by mi(P)nm_{i}(P)\in\mathbb{R}^{n} and has entries [mi(P)]j:=j1,,ji1,ji+1,,jkPj1,,ji1,j,ji+1,,jk[m_{i}(P)]_{j}:=\sum_{j_{1},\dots,j_{i-1},j_{i+1},\dots,j_{k}}P_{j_{1},\dots,j_{i-1},j,j_{i+1},\dots,j_{k}}. For shorthand, we often denote an index (j1,,jk)(j_{1},\dots,j_{k}) by j\vec{j}. The sum of PP’s entries is denoted by m(P)=jPjm(P)=\sum_{\vec{j}}P_{\vec{j}}. The maximum absolute value of PP’s entries is denoted by Pmax:=maxj|Pj|\|P\|_{\max}:=\max_{\vec{j}}|P_{\vec{j}}|, or simply PmaxP_{\max} for short. For j[n]k\vec{j}\in[n]^{k}, we write δj\delta_{\vec{j}} to denote the tensor with value 11 at entry j\vec{j}, and 0 elswewhere. The operations \odot and \otimes respectively denote the entrywise product and the Kronecker product. The notation i=1kdi\otimes_{i=1}^{k}d_{i} is shorthand for d1dkd_{1}\otimes\cdots\otimes d_{k}. A non-standard notation we use throughout is that f[P]f[P] denotes a function f:f:\mathbb{R}\to\mathbb{R} (typically exp\exp, log\log, or a polynomial) applied entrywise to a tensor PP.

2.1 Multimarginal Optimal Transport

The transportation polytope between measures μ1,,μkΔn\mu_{1},\dots,\mu_{k}\in\Delta_{n} is

(μ1,,μk):={P(0n)k:mi(P)=μi,i[k]}.\displaystyle\mathcal{M}(\mu_{1},\dots,\mu_{k}):=\left\{P\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}\;:\;m_{i}(P)=\mu_{i},\;\forall i\in[k]\right\}. (2.1)

For a fixed cost C(n)kC\in(\mathbb{R}^{n})^{\otimes k}, the MOTC\textsf{MOT}_{C} problem is to solve the following linear program, given input measures μ=(μ1,,μk)(Δn)k\mu=(\mu_{1},\dots,\mu_{k})\in(\Delta_{n})^{k}:

minP(μ1,,μk)P,C.\displaystyle\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle. (MOT)

In the k=2k=2 matrix case, (MOT) is the Kantorovich formulation of OT (Vil03, ). Its dual LP is

maxp1,,pkni=1kpi,μisubject toCj1,,jki=1k[pi]ji0,(j1,,jk)[n]k.\displaystyle\max_{p_{1},\ldots,p_{k}\in\mathbb{R}^{n}}\sum_{i=1}^{k}\langle p_{i},\mu_{i}\rangle\quad\text{subject to}\quad C_{j_{1},\dots,j_{k}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\geqslant 0,\;\forall(j_{1},\dots,j_{k})\in[n]^{k}. (MOT-D)

A basic, folklore fact about MOT is that it always has a sparse optimal solution (e.g., (anderes2016discrete, , Lemma 3)). This follows from elementary facts about standard-form LP; we provide a short proof for completeness.

Lemma 2.1 (Sparse solutions for MOT).

For any cost C(n)kC\in(\mathbb{R}^{n})^{\otimes k} and any marginals μ1,,μkΔn\mu_{1},\dots,\mu_{k}\in\Delta_{n}, there exists an optimal solution PP to MOTC(μ)\textsf{MOT}_{C}(\mu) that has at most nkk+1nk-k+1 non-zero entries.

Proof.

Since (MOT) is an LP over a compact domain, it has an optimal solution at a vertex (BerTsi97, , Theorem 2.7). Since (MOT) is a standard-form LP, these vertices are in correspondence with basic solutions, thus their sparsity is bounded by the number of linearly dependent constraints defining (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}) (BerTsi97, , Theorem 2.4). We bound this quantity by nkk+1nk-k+1 via two observations. First, (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}) is defined by nknk equality constraints [mi(P)]j=[μi]j[m_{i}(P)]_{j}=[\mu_{i}]_{j} in (2.1), one for each coordinate j[n]j\in[n] of each marginal constraint i[k]i\in[k]. Second, at least k1k-1 of these constraints are linearly dependent because we can construct kk distinct linear combinations of them, namely j[n][mi(P)]j=j[n][μi]j\sum_{j\in[n]}[m_{i}(P)]_{j}=\sum_{j\in[n]}[\mu_{i}]_{j} for each marginal i[k]i\in[k], which all simplify to the same constraint m(P)=1m(P)=1, and thus are redundant with each other. ∎

Definition 2.2 (ε\varepsilon-approximate MOT solution).

PP is an ε\varepsilon-approximate solution to MOTC(μ)\textsf{MOT}_{C}(\mu) if PP is feasible (i.e., P(μ1,,μk)P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})) and C,P\langle C,P\rangle is at most ε\varepsilon more than the optimal value.

2.2 Regularization

We introduce two standard regularization operators. First is the Shannon entropy H(P):=jPjlogPjH(P):=-\sum_{\vec{j}}P_{\vec{j}}\log P_{\vec{j}} of a tensor P(0n)kP\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k} with entries summing to m(P)=1m(P)=1. We adopt the standard notational convention that 0log0=00\log 0=0. Second is the softmin operator, which is defined for parameter η>0\eta>0 as

sminηi[m]ai:=1ηlog(i=1meηai).\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}:=-\frac{1}{\eta}\log\left(\sum_{i=1}^{m}e^{-\eta a_{i}}\right). (2.2)

This softmin operator naturally extends to ai{}a_{i}\in\mathbb{R}\cup\{\infty\} by adopting the standard notational conventions that e=0e^{-\infty}=0 and log0=\log 0=-\infty.

We make use of the following folklore fact, which bounds the error between the min\min and smin\operatorname*{smin} operators based on the regularization and the number of points. For completeness, we provide a short proof.

Lemma 2.3 (Softmin approximation bound).

For any a1,,am{}a_{1},\dots,a_{m}\in\mathbb{R}\cup\{\infty\} and η>0\eta>0,

mini[m]aisminηi[m]aimini[m]ailogmη.\displaystyle\min_{i\in[m]}a_{i}\geqslant\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}\geqslant\min_{i\in[m]}a_{i}-\frac{\log m}{\eta}.
Proof.

Assume without loss of generality that all aia_{i} are finite, else aia_{i} can be dropped (if all ai=a_{i}=\infty then the claim is trivial). For shorthand, denote mini[m]ai\min_{i\in[m]}a_{i} by amina_{\min}. For the first inequality, use the non-negativity of the exponential function to bound

sminηi[m]ai=1ηlog(i=1meηai)1ηlog(eηamin)=amin.\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}=-\frac{1}{\eta}\log\left(\sum_{i=1}^{m}e^{-\eta a_{i}}\right)\leqslant-\frac{1}{\eta}\log\left(e^{-\eta a_{\min}}\right)=a_{\min}.

For the second inequality, use the fact that each aiamina_{i}\geqslant a_{\min} to bound

sminηi[m]ai=1ηlog(i=1meηai)1ηlog(meηamin)=aminlogmη.\operatorname*{{\textstyle{smin_{\eta}}}}_{i\in[m]}a_{i}=-\frac{1}{\eta}\log\left(\sum_{i=1}^{m}e^{-\eta a_{i}}\right)\geqslant-\frac{1}{\eta}\log\left(me^{-\eta a_{\min}}\right)=a_{\min}-\frac{\log m}{\eta}.

The entropically regularized MOT problem (RMOT for short) is the convex optimization problem

minP(μ1,,μk)P,Cη1H(P).\displaystyle\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle-\eta^{-1}H(P). (RMOT)

This is the natural multidimensional analog of entropically regularized OT, which has a rich literature in statistics (Leo13, ) and transportation theory (Wil69, ), and has recently attracted significant interest in machine learning (Cut13, ; PeyCut17, ). The convex dual of (RMOT) is the convex optimization problem

maxp1,,pkni=1kpi,μi+sminηj[n]k(Cji=1k[pi]ji).\displaystyle\max_{p_{1},\dots,p_{k}\in\mathbb{R}^{n}}\sum_{i=1}^{k}\langle p_{i},\mu_{i}\rangle+\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right). (RMOT-D)

In contrast to MOT, there is no analog of Lemma 2.1 for RMOT: the unique optimal solution to RMOT is dense. Further, this solution may not even be “approximately” sparse. For example, when C=0C=0, all μi\mu_{i} are uniform, and η>0\eta>0 is any positive number, the solution is fully dense with all entries equal to 1/nk1/n^{k}.

We define PP to be an ε\varepsilon-approximate RMOT solution in the analogous way as in Definition 2.2. A basic, folklore fact about RMOT is that if the regularization η\eta is sufficiently large, then RMOT and MOT are equivalent in terms of approximate solutions.

Lemma 2.4 (MOT and RMOT are close for large regularization η\eta).

Let P(μ1,,μk)P\in\mathcal{M}(\mu_{1},\dots,\mu_{k}), ε>0\varepsilon>0, and ηε1klogn\eta\geqslant\varepsilon^{-1}k\log n. If PP is an ε\varepsilon-approximate solution to (RMOT), then PP is also a (2ε)(2\varepsilon)-approximate solution to (MOT); and vice versa.

Proof.

Since a discrete distribution supported on nkn^{k} atoms has entropy at most klognk\log n (CoverThomas, ), the objectives of (MOT) and (RMOT) differ pointwise by at most η1klognε\eta^{-1}k\log n\leqslant\varepsilon. Since (MOT) and (RMOT) also have the same feasible sets, their optimal values therefore differ by at most ε\varepsilon. ∎

3 Oracles

Here we define the three oracle variants described in the introduction and discuss their relations. In the below definitions, let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} be a cost tensor.

Definition 3.1 (MIN oracle).

For weights p=(p1,,pk)n×kp=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k}, MINC(p)\textsf{MIN}_{C}(p) returns

minj[n]kCji=1k[pi]ji.\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.
Definition 3.2 (AMIN oracle).

For weights p=(p1,,pk)n×kp=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k} and accuracy ε>0\varepsilon>0, AMINC(p,ε)\textsf{AMIN}_{C}(p,\varepsilon) returns MINC(p)\textsf{MIN}_{C}(p) up to additive error ε\varepsilon.

Definition 3.3 (SMIN oracle).

For weights p=(p1,,pk)¯n×kp=(p_{1},\ldots,p_{k})\in\bar{\mathbb{R}}^{n\times k} and regularization parameter η>0\eta>0, SMINC(p,η)\textsf{SMIN}_{C}(p,\eta) returns

sminηj[n]kCji=1k[pi]ji.\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

An algorithm is said to “solve” or “implement” MINC\textsf{MIN}_{C} if given input pp, it outputs MINC(p)\textsf{MIN}_{C}(p). Similarly for AMINC\textsf{AMIN}_{C} and SMINC\textsf{SMIN}_{C}. Note that the weights pp that are input to SMIN have values inside ¯={}\bar{\mathbb{R}}=\mathbb{R}\cup\{-\infty\}; this simplifies the notation in the treatment of the SINKHORN algorithm below and does not increase the bit-complexity by more than 11 bit by adding a flag for the value -\infty.

Remark 3.4 (Interpretation as variants of the dual feasibility oracle).

These three oracles can be viewed as variants of the feasibility oracle for (MOT-D). For MINC(p)\textsf{MIN}_{C}(p), this relationship is exact: pn×kp\in\mathbb{R}^{n\times k} is feasible for (MOT-D) if and only if MINC(p)\textsf{MIN}_{C}(p) is non-negative. For AMINC\textsf{AMIN}_{C} and SMINC\textsf{SMIN}_{C}, this relationship is approximate, with the approximation depending on how small ε\varepsilon is and how large η\eta is, respectively.

Since these oracles form the respective bottlenecks of all algorithms from the MOT and implicit linear programming literatures (see the overview in the introduction §1.1), an important question is: if one oracle can be implemented in poly(n,k)\mathrm{poly}(n,k) time, does this imply that the other can be too?

Two reductions are straightforward: the AMIN oracle can be implemented in poly(n,k)\mathrm{poly}(n,k) time whenever either the MIN oracle or the SMIN oracle can be implemented in poly(n,k)\mathrm{poly}(n,k) time. We record these simple observations in remarks for easy recall.

Remark 3.5 (MIN implies AMIN).

For any accuracy ε>0\varepsilon>0, the MINC(p)\textsf{MIN}_{C}(p) oracle provides a valid answer to the AMINC(p,ε)\textsf{AMIN}_{C}(p,\varepsilon) oracle by definition.

Remark 3.6 (SMIN implies AMIN).

For any pn×kp\in\mathbb{R}^{n\times k} and regularization ηε1klogn\eta\geqslant\varepsilon^{-1}k\log n, the SMINC(p,η)\textsf{SMIN}_{C}(p,\eta) oracle provides a valid answer to the AMINC(p,ε)\textsf{AMIN}_{C}(p,\varepsilon) oracle due to the approximation property of the smin\operatorname*{smin} operator (Lemma 2.3).

In the remainder of this section, we show a separation between the SMIN oracle and both the MIN and AMIN oracles by exhibiting a family of cost tensors CC for which there exist polynomial-time algorithms for MIN and AMIN, however there is no polynomial-time algorithm for SMIN. The non-existence of a polynomial-time algorithm of course requires a complexity theoretic assumption; our result holds under #\#BIS-hardness—which is a by-now standard complexity theory assumption introduced in (dyer2000relative, ), and in words is the statement that there does not exist a polynomial-time algorithm for counting the number of independent sets in a bipartite graph.

Lemma 3.7 (Restrictiveness of the SMIN oracle).

There exists a family of costs C(n)kC\in(\mathbb{R}^{n})^{\otimes k} for which MINC\textsf{MIN}_{C} and AMINC\textsf{AMIN}_{C} can be solved in poly(n,k)\mathrm{poly}(n,k) time, however SMINC\textsf{SMIN}_{C} is #BIS-hard.

Proof.

In order to prove hardness for general nn, it suffices to exhibit such a family of cost tensors when n=2n=2. Since n=2n=2, it is convenient to abuse notation slightly by indexing a cost tensor C(n)kC\in(\mathbb{R}^{n})^{\otimes k} by j{1,1}k\vec{j}\in\{-1,1\}^{k} rather than by j{1,2}k\vec{j}\in\{1,2\}^{k}. The family we exhibit is {C(A,b):A0k×k,bk}\{C(A,b):A\in\mathbb{R}_{\geqslant 0}^{k\times k},\,b\in\mathbb{R}^{k}\}, where the cost tensors C(A,b)C(A,b) are parameterized by a non-negative square matrix AA and a vector bb, and have entries of the form

Cj(A,b):=j,Ajb,j,j{±1}k.C_{\vec{j}}(A,b):=-\langle\vec{j},A\vec{j}\rangle-\langle b,\vec{j}\rangle,\qquad\vec{j}\in\{\raisebox{0.86108pt}{$\scriptstyle\pm$}1\}^{k}.

Polynomial-time algorithm for MIN and AMIN. We show that given a matrix A0k×kA\in\mathbb{R}_{\geqslant 0}^{k\times k}, vector bkb\in\mathbb{R}^{k}, and weights p2×kp\in\mathbb{R}^{2\times k}, it is possible to compute MINC(p)\textsf{MIN}_{C}(p) on the cost tensor C(A,b)C(A,b) in poly(k)\mathrm{poly}(k) time. Clearly this also implies a poly(k)\mathrm{poly}(k) time algorithm for AMINC(p,ε)\textsf{AMIN}_{C}(p,\varepsilon) for any ε>0\varepsilon>0, see Remark 3.5.

To this end, we first re-write the MINC(p)\textsf{MIN}_{C}(p) problem on C(A,b)C(A,b) in a more convenient form that enables us to “ignore” the weights pp. Recall that MINC(p)\textsf{MIN}_{C}(p) is the problem of

MINC(p)=minj{±1}kj,Ajb,ji=1k[pi]ji.\textsf{MIN}_{C}(p)=\min_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle b,\vec{j}\rangle-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

Note that the linear part of the cost is equal to

b,j+i=1k[pi]ji=,j+d,\displaystyle\langle b,\vec{j}\rangle+\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\langle\ell,\vec{j}\rangle+d, (3.1)

where k\ell\in\mathbb{R}^{k} is the vector with entries i=bi+12((pi)1(pi)1)\ell_{i}=b_{i}+\operatorname{\frac{1}{2}}((p_{i})_{1}-(p_{i})_{-1}), and dd is the scalar d=12i=1k([pi]1+[pi]1)d=\operatorname{\frac{1}{2}}\sum_{i=1}^{k}([p_{i}]_{1}+[p_{i}]_{-1}). Thus, since dd is clearly computable in O(k)O(k) time, the MINC\textsf{MIN}_{C} problem is equivalent to solving

minj{±1}kj,Aj,j,\displaystyle\min_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle\ell,\vec{j}\rangle, (3.2)

when given as input a non-negative matrix A0k×kA\in\mathbb{R}_{\geqslant 0}^{k\times k} and a vector k\ell\in\mathbb{R}^{k}.

To show that this task is solvable in poly(k)\mathrm{poly}(k) time, note that the objective in (3.2) is a submodular function because it is a quadratic whose Hessian A-A has non-positive off-diagonal terms (Bach13, , Proposition 6.3). Therefore (3.2) is a submodular optimization problem, and thus can be solved in poly(k)\mathrm{poly}(k) time using classical algorithms from combinatorial optimization (GLSbook, , Chapter 10.2).

SMIN oracle is #\#BIS-hard. On the other hand, by using the definition of the SMIN oracle, the re-parameterization (3.1), and then the definition of the softmin operator,

SMINC(p,η)=sminηj{±1}kj,Ajb,ji=1k[pi]ji=sminηj{±1}kj,Aj,jd=logZηd,\textsf{SMIN}_{C}(p,\eta)=\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle b,\vec{j}\rangle-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}-\langle\vec{j},A\vec{j}\rangle-\langle\ell,\vec{j}\rangle-d=-\frac{\log Z}{\eta}-d,

where Z=j{±1}kQ(j)Z=\sum_{\vec{j}\in\{\raisebox{0.60275pt}{$\scriptstyle\pm$}1\}^{k}}Q(\vec{j}) is the partition function of the ferromagnetic Ising model with inconsistent external fields given by

Q(j)=exp(ηj,Aj+η,j).Q(\vec{j})=\exp\left(\eta\langle\vec{j},A\vec{j}\rangle+\eta\langle\ell,\vec{j}\rangle\right).

Because it is #\#BIS hard to compute the partition function ZZ of a ferromagnetic Ising model with inconsistent external fields (goldberg2007complexity, ), it is #\#BIS hard to compute the value η1logZd-\eta^{-1}\log Z-d of the oracle SMINC(p,η)\textsf{SMIN}_{C}(p,\eta). ∎

Remark 3.8 (The restrictiveness of SMIN extends to approximate computation).

The separation between the oracles shown in Lemma 3.7 further extends to approximate computation of the SMIN oracle under the assumption that #\#BIS is hard to approximate, since under this assumption it is hard to approximate the partition function of a ferromagnetic Ising model with inconsistent external fields (goldberg2007complexity, ).

4 Algorithms to oracles

In this section, we consider three algorithms for MOT. Each is iterative and requires only polynomially many iterations. The key issue for each algorithm is the per-iteration runtime, which is in general exponential (roughly nkn^{k}). We isolate the respective bottlenecks of these three algorithms into the three variants of the dual feasibility oracle defined in §3. See §1.1 and Table 4 for a high-level overview of this section’s results.

4.1 The Ellipsoid algorithm and the MIN oracle

Among the most classical algorithms for implicit LP is the Ellipsoid algorithm (GLSbook, ; GroLovSch81, ; khachiyan1980polynomial, ). However it has taken a back seat to the SINKHORN algorithm in the vast majority of the MOT literature. The very recent paper (AltBoi20bary, ), which focuses on the specific MOT application of computing low-dimensional Wasserstein barycenters, develops a variant of the classical Ellipsoid algorithm specialized to MOT; henceforth this is called ELLIPSOID, see §4.1.1 for a description of this algorithm. The objective of this section is to analyze ELLIPSOID in the context of general MOT problems in order to prove the following.

Theorem 4.1.

For any family of cost tensors C(n)kC\in(\mathbb{R}^{n})^{\otimes k}, the following are equivalent:

  • (i)

    ELLIPSOID takes poly(n,k)\mathrm{poly}(n,k) time to solve the MOTC\textsf{MOT}_{C} problem. (Moreover, it outputs a vertex solution represented as a sparse tensor with at most nkk+1nk-k+1 non-zeros.)

  • (ii)

    There exists a poly(n,k)\mathrm{poly}(n,k) time algorithm that solves the MOTC\textsf{MOT}_{C} problem.

  • (iii)

    There exists a poly(n,k)\mathrm{poly}(n,k) time algorithm that solves the MINC\textsf{MIN}_{C} problem.

Interpretation of results.

In words, the equivalence “(i) \Longleftrightarrow (ii)” establishes that ELLIPSOID can solve any MOT problem in polynomial time that any other algorithm can. Thus from a theoretical perspective, this paper’s restriction to ELLIPSOID is at no loss of generality for developing polynomial-time algorithms that exactly solve MOT. In words, the equivalence “(ii) \Longleftrightarrow (iii)” establishes that the MOT and MIN problems are polynomial-time equivalent. Thus we may investigate when MOT is tractable by instead investigating the more amenable question of when MIN is tractable (see §1.1.4) at no loss of generality.

As stated in the related work section, Theorem 4.1 is implicit from combining several known results (AltBoi20hard, ; AltBoi20bary, ; GLSbook, ). Our contribution here is to make this result explicit, since this allows us to unify algorithms from the implicit LP literature with the SINKHORN algorithm. We also significantly simplify part of the implication “(iii) \Longrightarrow (i)”, which is crucial for making an algorithm that relies on the MIN oracle practical—namely, the Column Generation algorithm discussed below.

Organization of §4.1.

In §4.1.1, we recall this ELLIPSOID algorithm and how it depends on the violation oracle for (MOT-D). In §4.1.2, we give a significantly simpler proof that the violation and feasibility oracles are polynomial-time equivalent in the case of (MOT-D), and use this to prove Theorem 4.1. In §4.1.3, we describe a practical implementation that replaces the ELLIPSOID outer loop with Column Generation.

4.1.1 Algorithm

A key component of the proof of Theorem 4.1 is the ELLIPSOID algorithm introduced in (AltBoi20bary, ) for MOT, which we describe below. In order to present this, we first define a variant of the MIN oracle that returns a minimizing tuple rather than the minimizing value.

Definition 4.2 (Violation oracle for (MOT-D)).

Given weights p=(p1,,pk)n×kp=(p_{1},\dots,p_{k})\in\mathbb{R}^{n\times k}, ARGMINC\textsf{ARGMIN}_{C} returns the minimizing solution j\vec{j} and value of minj[n]kCji=1k[pi]ji\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

ARGMINC\textsf{ARGMIN}_{C} can be viewed as a violation oracle777Recall that a violation oracle for a polytope K={x:ai,xbi,i[N]}K=\{x:\langle a_{i},x\rangle\leqslant b_{i},\forall i\in[N]\} is an algorithm that given a point pp, either asserts pp is in KK, or otherwise outputs the index ii of a violated constraint ai,p>bi\langle a_{i},p\rangle>b_{i}. for the decision set to (MOT-D). This is because, given p=(p1,,pk)n×kp=(p_{1},\dots,p_{k})\in\mathbb{R}^{n\times k}, the tuple j\vec{j} output by ARGMINC(p)\textsf{ARGMIN}_{C}(p) either provides a violated constraint if Cji=1k[pi]ji<0C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}<0, or otherwise certifies pp is feasible. In (AltBoi20bary, ) it is proved that MOT can be solved with polynomially many calls to the ARGMINC\textsf{ARGMIN}_{C} oracle.

Theorem 4.3 (ELLIPSOID guarantee; Proposition 12 of (AltBoi20bary, )).

Algorithm 1 finds an optimal vertex solution for MOTC(μ)\textsf{MOT}_{C}(\mu) using poly(n,k)\mathrm{poly}(n,k) calls to the ARGMINC\textsf{ARGMIN}_{C} oracle and poly(n,k)\mathrm{poly}(n,k) additional time. The solution is returned as a sparse tensor with at most nkk+1nk-k+1 non-zero entries.

Algorithm 1 ELLIPSOID: specialization of the classical Ellipsoid algorithm to MOT

Input: Cost C(n)kC\in(\mathbb{R}^{n})^{\otimes k}, marginals μ1,,μkΔn\mu_{1},\dots,\mu_{k}\in\Delta_{n}
      Output: Vertex solution to MOTC(μ)\textsf{MOT}_{C}(\mu)

1:\\  Solve dual
2:Solve (MOT-D) using the Ellipsoid algorithm, with ARGMINC\textsf{ARGMIN}_{C} as the violation oracle. Let SS denote the set of tuples returned by all calls to ARGMINC\textsf{ARGMIN}_{C}.
3:
4:\\  Solve primal
5:Solve (4.1) using the Ellipsoid algorithm.
Sketch of algorithm.

Full details and a proof are in (AltBoi20bary, ). We give a brief overview here for completeness. First, recall from the implicit LP literature that the classical Ellipsoid algorithm can be implemented in polynomial time for an LP with arbitrarily many constraints so long as it has polynomially many variables and the violation oracle for its decision set is solvable in polynomial time (GLSbook, ). This does not directly apply to the LP (MOT) because that LP has nkn^{k} variables. However, it can apply to the dual LP (MOT-D) because that LP only has nknk variables.

This suggests a natural two-step algorithm for MOT. First, compute an optimal dual solution by directly applying the Ellipsoid algorithm to (MOT-D). Second, use this dual solution to construct a sparse primal solution. Although this dual-to-primal conversion does not extend to arbitrary LP (BerTsi97, , Exercise 4.17), the paper (AltBoi20bary, ) provides a solution by exploiting the standard-form structure of MOT. The procedure is to solve

minP(μ1,,μk)s.t. Pj=0,jSC,P\displaystyle\min_{\begin{subarray}{c}P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})\\ \text{s.t. }P_{\vec{j}}=0,\;\forall\vec{j}\notin S\end{subarray}}\langle C,P\rangle (4.1)

which is the MOT problem restricted to sparsity pattern SS, where SS is the set of tuples j\vec{j} returned by the violation oracle during the execution of step one of Algorithm 1. This second step takes poly(n,k)\mathrm{poly}(n,k) time using a standard LP solver, because running the Ellipsoid algorithm in the first step only calls the violation oracle poly(n,k)\mathrm{poly}(n,k) times, and thus SS has poly(n,k)\mathrm{poly}(n,k) size, and therefore the LP (4.1) has poly(n,k)\mathrm{poly}(n,k) variables and constraints. In (AltBoi20bary, ) it is proved that this produces a primal vertex solution to the original MOT problem.

4.1.2 Equivalence of bottleneck to MIN

Although Theorem 4.3 shows that ELLIPSOID can solve MOT in poly(n,k)\mathrm{poly}(n,k) time using the ARGMIN oracle, this is not sufficient to prove the implication “(iii)\implies(i)” in Theorem 4.1. In order to prove that implication requires showing the polynomial-time equivalence between MIN and ARGMIN.

Lemma 4.4 (Equivalence of MIN and ARGMIN).

Each of the oracles MINC\textsf{MIN}_{C} and ARGMINC\textsf{ARGMIN}_{C} can be implemented using poly(n,k)\mathrm{poly}(n,k) calls of the other oracle and poly(n,k)\mathrm{poly}(n,k) additional time.

This equivalence follows from classical results about the equivalence of violation and feasibility oracles (yudin1976informational, ). However, the known proof of that general result requires an involved and indirect argument based on “back-and-forth” applications of the Ellipsoid algorithm (GLSbook, , §4.3). Here we exploit the special structure of MOT to give a direct and elementary proof. This is essential to practical implementations (see §4.1.3).

Proof.

It is obvious how the MINC\textsf{MIN}_{C} oracle can be implemented via a single call of the ARGMINC\textsf{ARGMIN}_{C} oracle; we now show the converse. Specifically, given p1,,pknp_{1},\dots,p_{k}\in\mathbb{R}^{n}, we show how to compute a solution j=(j1,,jk)[n]k\vec{j}=(j_{1},\dots,j_{k})\in[n]^{k} for ARGMINC([p1,,pk])\textsf{ARGMIN}_{C}([p_{1},\dots,p_{k}]) using nknk calls to the MINC\textsf{MIN}_{C} oracle and polynomial additional time. We use the first nn calls to compute the first index j1j_{1} of the solution, the next nn calls to compute the next index j2j_{2}, and so on.

Formally, for s[k]s\in[k], let us say that (j1,,js)[n]s(j_{1}^{*},\dots,j_{s}^{*})\in[n]^{s} is a “partial solution” of size ss if there exists a solution j[n]kj\in[n]^{k} for ARGMINC([p1,,pk])\textsf{ARGMIN}_{C}([p_{1},\dots,p_{k}]) that satisfies ji=jij_{i}=j_{i}^{*} for all i[s]i\in[s]. Then it suffices to show that for every s[k]s\in[k], it is possible to compute a partial solution (j1,,js)(j_{1}^{*},\dots,j_{s}^{*}) of size ss from a partial solution (j1,,js1)(j_{1}^{*},\dots,j_{s-1}^{*}) of size s1s-1 using nn calls to the MINC\textsf{MIN}_{C} oracle and polynomial additional time.

The simple but key observation enabling this is the following. Below, for i[k]i\in[k] and j[n]j\in[n], define qi,jq_{i,j} to be the vector in n\mathbb{R}^{n} with value [pi]j[p_{i}]_{j} on entry jj, and value M-M on all other entries. In words, the following observation states that if the constant MM is sufficiently large, then for any indices jij_{i}^{\prime}, replacing the vectors pip_{i} with the vectors qi,jiq_{i,j_{i}^{\prime}} in a MIN oracle query effectively performs a MIN oracle query on the original input p1,,pkp_{1},\dots,p_{k} except that now the minimization is only over j[n]k\vec{j}\in[n]^{k} satisfying ji=jij_{i}=j_{i}^{\prime}.

Observation 4.5.

Set M:=2Cmax+2i=1kpimax+1M:=2C_{\max}+2\sum_{i=1}^{k}\|p_{i}\|_{\max}+1. Then for any s[k]s\in[k] and any (j1,,js)[n]s(j_{1}^{\prime},\dots,j_{s}^{\prime})\in[n]^{s},

MINC([q1,j1,,qs,js,ps+1,,pk])=minj[n]ks.t. j1=j1,,js=jsCji=1k[pi]ji.\displaystyle\textsf{MIN}_{C}([q_{1,j_{1}^{\prime}},\dots,q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}])=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{\prime},\dots,j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.
Proof.

By definition of the MIN oracle,

MINC([q1,j1,,qs,js,ps+1,,pk])\displaystyle\textsf{MIN}_{C}([q_{1,j_{1}^{\prime}},\dots,q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}]) =minj[n]kCji=1s[qi,ji]jii=s+1k[pi]ji\displaystyle=\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{s}[q_{i,j_{i}^{\prime}}]_{j_{i}}-\sum_{i=s+1}^{k}[p_{i}]_{j_{i}}

It suffices to prove that every minimizing tuple j[n]k\vec{j}\in[n]^{k} for the right hand side satisfies ji=jij_{i}=j_{i}^{\prime} for all i[s]i\in[s]. Suppose not for sake of contradiction. Then there exists a minimizing tuple j[n]k\vec{j}\in[n]^{k} for which jjj_{\ell}\neq j_{\ell}^{\prime} for some [s]\ell\in[s]. But then [q,j]j=M[q_{\ell,j_{\ell}^{\prime}}]_{j_{\ell}}=-M, so the objective value of j\vec{j} is at least

Cji=1s[qi,ji]jii=s+1k[pi]jiMCmaxi=1kpimax=Cmax+i=1kpimax+1.C_{\vec{j}}-\sum_{i=1}^{s}[q_{i,j_{i}^{\prime}}]_{j_{i}}-\sum_{i=s+1}^{k}[p_{i}]_{j_{i}}\geqslant M-C_{\max}-\sum_{i=1}^{k}\|p_{i}\|_{\max}=C_{\max}+\sum_{i=1}^{k}\|p_{i}\|_{\max}+1.

But this is strictly larger (by at least 11) than the value of any tuple with prefix (j1,,js)(j_{1}^{\prime},\dots,j_{s}^{\prime}), contradicting the optimality of j\vec{j}. ∎

Thus, given a partial solution (j1,,js1)(j_{1}^{*},\dots,j_{s-1}^{*}) of length s1s-1, we construct a partial solution (j1,,js)(j_{1}^{*},\dots,j_{s}^{*}) of length ss by setting jsj_{s}^{*} to be a minimizer of

minjs[n]MINC([q1,j1,,qs1,js1,qs,js,ps+1,,pk]).\displaystyle\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}([q_{1,j_{1}^{*}},\dots,q_{s-1,j_{s-1}^{*}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}]). (4.2)

The runtime bound is clear; it remains to show correctness. To this end, note that

minj[n]ks.t. j1=j1,,js=jsCji=1k[pi]ji\displaystyle\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{*},\dots,j_{s}=j_{s}^{*}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} =MINC([q1,j1,,qs,js,ps+1,,pk])\displaystyle=\textsf{MIN}_{C}([q_{1,j_{1}^{*}},\dots,q_{s,j_{s}^{*}},p_{s+1},\dots,p_{k}])
=minjs[n]MINC([q1,j1,,qs1,js1,qs,js,ps+1,,pk])\displaystyle=\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}([q_{1,j_{1}^{*}},\dots,q_{s-1,j_{s-1}^{*}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}])
=minjs[n]minj[n]ks.t. j1=j1,,js1=js1,js=jsCji=1k[pi]ji\displaystyle=\min_{j_{s}^{\prime}\in[n]}\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{*},\dots,j_{s-1}=j_{s-1}^{*},j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}
=minj[n]ks.t. j1=j1,,js1=js1Cji=1k[pi]ji\displaystyle=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{*},\dots,j_{s-1}=j_{s-1}^{*}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}
=MINC([p1,,pk]),\displaystyle=\textsf{MIN}_{C}([p_{1},\dots,p_{k}]),

where above the first and third steps are by Observation 4.5, the second step is by construction of jsj_{s}^{*}, the fourth step is by simplifying, and the final step is by definition of (j1,,js1)(j_{1}^{*},\dots,j_{s-1}^{*}) being a partial solution of size s1s-1. We conclude that (j1,,js)(j_{1}^{*},\dots,j_{s}^{*}) is a partial solution of size ss, as desired. ∎

We can now conclude the proof of the main result of §4.1.

Proof of Theorem 4.1.

The implication “(i) \implies (ii)” is trivial, and the implication “(ii) \implies (iii)” is shown in (AltBoi20hard, ). It therefore suffices to show the implication “(iii) \implies (i)”. This follows from combining the fact that ELLIPSOID solves MOTC\textsf{MOT}_{C} in polynomial time given an efficient implementation of ARGMINC\textsf{ARGMIN}_{C} (Theorem 4.3), with the fact that the MINC\textsf{MIN}_{C} and ARGMINC\textsf{ARGMIN}_{C} oracles are polynomial-time equivalent (Lemma 4.4). ∎

4.1.3 Practical implementation via Column Generation

Although ELLIPSOID enjoys powerful theoretical runtime guarantees, it is slow in practice because the classical Ellipsoid algorithm is. Nevertheless, whenever ELLIPSOID is applicable (i.e., whenever the MINC\textsf{MIN}_{C} oracle can be efficiently implemented), we can use an alternative practical algorithm, namely the delayed Column Generation method COLGEN, to compute exact, sparse solutions to MOT.

For completeness, we briefly recall the idea behind COLGEN; for further details see the standard textbook (BerTsi97, , §6.1). COLGEN runs the Simplex method, keeping only basic variables in the tableau. Each time that COLGEN needs to find a Simplex variable on which to pivot, it solves the “pricing problem” of finding a variable with negative reduced cost. This is the key subroutine in COLGEN. In the present context of the MOT LP, this pricing problem is equivalent to a call to the ARGMIN violation oracle (see (BerTsi97, , Definition 3.2) for the definition of reduced costs). By the polynomial-time equivalence of the ARGMIN and MIN oracles shown in Lemma 4.4, this bottleneck subroutine in COLGEN can be computed in polynomial time whenever the MIN oracle can. For easy recall, we summarize this discussion as follows.

Theorem 4.6 (Standard guarantee for COLGEN; Section 6.1 of (BerTsi97, )).

For any T>0T>0, one can implement TT iterations of COLGEN in poly(n,k,T)\mathrm{poly}(n,k,T) time and calls to the MINC\textsf{MIN}_{C} oracle. When COLGEN terminates, it returns an optimal vertex solution, which is given as a sparse tensor with at most nkk+1nk-k+1 non-zero entries.

Note that COLGEN does not have a theoretical guarantee stating that it terminates after a polynomial number of iterations. But it often performs well in practice and terminates after a small number of iterations, leading to much better empirical performance than ELLIPSOID.

4.2 The Multiplicative Weights Update algorithm and the AMIN oracle

The second classical algorithm for solving implicitly-structured LPs that we study in the context of MOT is the Multiplicative Weights Update algorithm MWU (young2001sequential, ). The objective of this section is to prove the following guarantees for its specialization to MOT.

Theorem 4.7.

For any family of cost tensors C(n)kC\in(\mathbb{R}^{n})^{\otimes k}, the following are equivalent:

  • (i)

    For any ε>0\varepsilon>0, MWU takes poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time to solve the MOTC\textsf{MOT}_{C} problem ε\varepsilon-approximately. (Moreover, it outputs a sparse solution with at most poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) non-zero entries.)

  • (ii)

    There exists a poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon)-time algorithm that solves the MOTC\textsf{MOT}_{C} problem ε\varepsilon-approximately for any ε>0\varepsilon>0.

  • (iii)

    There exists a poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon)-time algorithm that solves the AMINC\textsf{AMIN}_{C} problem ε\varepsilon-approximately for any ε>0\varepsilon>0.

Interpretation of results.

Similarly to the analogous Theorem 4.1 for ELLIPSOID, the equivalence “(i) \Longleftrightarrow (ii)” establishes that MWU can approximately solve any MOT problem in polynomial time that any other algorithm can. Thus, from a theoretical perspective, restricting to MWU for approximately solving MOT problems is at no loss of generality. In words, the equivalence “(ii) \Longleftrightarrow (iii)” establishes that approximating MOT and approximating MIN are polynomial-time equivalent. Thus we may investigate when MOT is tractable to approximate by instead investigating the more amenable question of when MIN is tractable (see §1.1.4) at no loss of generality.

Theorem 4.7 is new to this work. In particular, equivalences between problems with polynomially small error do not fall under the purview of classical LP theory, which deals with exponentially small error (GLSbook, ). Our use of the MWU algorithm exploits a simple reduction of MOT to a mixed packing-covering LP that has appeared in the k=2k=2 matrix case of Optimal Transport in (BlaJamKenSid18, ; Qua18, ), where implicit LP is not necessary for polynomial runtime.

Organization of §4.2.

In §4.2.1 we present the specialization of Multiplicative Weights Update to MOT, and recall how it runs in polynomial time and calls to a certain bottleneck oracle. In §4.2.2, we show that this bottleneck oracle is equivalent to the AMIN oracle, and then use this to prove Theorem 4.7.

4.2.1 Algorithm

Here we present the MWU algorithm, which combines the generic Multiplicative Weights Update algorithm of (young2001sequential, ) specialized to MOT, along with a final rounding step that ensures feasibility of the solution.

In order to present MWU, it is convenient to assume that the cost CC has entries in the range [1,2][1,2]\subset\mathbb{R}, which is at no loss of generality by simply translating and rescaling the cost (see §4.2.2), and can be done implicitly given a bound on CmaxC_{\max}. This is why in the rest of this subsection, every runtime dependence on ε\varepsilon is polynomial in 1/ε1/\varepsilon for costs in the range [1,2][1,2]; after transformation back to [Cmax,Cmax][-C_{\max},C_{\max}], this is polynomial dependence in the standard scale-invariant quantity Cmax/εC_{\max}/\varepsilon.

Since the cost CC is assumed to have non-negative entries, for any λ[1,2]\lambda\in[1,2], the polytope

K(λ)={P(μ1,,μk):C,Pλ}\displaystyle K(\lambda)=\{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k}):\langle C,P\rangle\leqslant\lambda\}

of couplings with cost at most λ\lambda is a mixed packing-covering polytope (i.e., all variables are non-negative and all constraints have non-negative coefficients). Note that K(λ)K(\lambda) is non-empty if and only if MOTC(μ)\textsf{MOT}_{C}(\mu) has value at most λ\lambda. Thus, modulo a binary search on λ\lambda, this reduces computing the value of MOTC(μ)\textsf{MOT}_{C}(\mu) to the task of detecting whether K(λ)K(\lambda) is empty. Since K(λ)K(\lambda) is a mixed packing-covering polytope, the Multiplicative Weights Update algorithm of (young2001sequential, ) determines whether K(λ)K(\lambda) is empty, and runs in polynomial time apart from one bottleneck, which we now define.

In order to define the bottleneck, we first define a potential function. For this, we define the softmax analogously to the softmin as

smax(a1,,at)=smin(a1,,at)=log(i=1teai).\operatorname*{smax}(a_{1},\ldots,a_{t})=-\operatorname*{smin}(-a_{1},\ldots,-a_{t})=\log\left(\sum_{i=1}^{t}e^{a_{i}}\right).

Here we use regularization parameter η=1\eta=1 for simplicity, since this suffices for analyzing MWU, and thus we have dropped this index η\eta for shorthand.

Definition 4.8 (Potential function for MWU).

Fix a cost C(n)kC\in(\mathbb{R}^{n})^{\otimes k}, target marginals μ(Δn)k\mu\in(\Delta_{n})^{k}, and target value λ\lambda\in\mathbb{R}. Define the potential function Φ:=ΦC,μ,λ:(0n)k\Phi:=\Phi_{C,\mu,\lambda}:(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}\to\mathbb{R} by

Φ(P)=smax(C,Pλ,m1(P)μ1,,mk(P)μk).\Phi(P)=\operatorname*{smax}\left(\frac{\langle C,P\rangle}{\lambda},\frac{m_{1}(P)}{\mu_{1}},\ldots,\frac{m_{k}(P)}{\mu_{k}}\right).

The softmax in the above expression is interpreted as a softmax over the nk+1nk+1 values in the concatenation of vectors and scalars in its input. (This slight abuse of notation significantly reduces notational overhead.)

Given this potential function, we now define the bottleneck operation for MWU: find a direction j[n]k\vec{j}\in[n]^{k} in which PP can be increased such that the potential is increased as little as possible.

Definition 4.9 (Bottleneck oracle for MWU).

Given iterate P(0n)kP\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}, target marginals μ(Δn)k\mu\in(\Delta_{n})^{k}, target value λ\lambda\in\mathbb{R}, and accuracy ε>0\varepsilon>0, MWU_BOTTLENECKC(P,μ,λ,ε)\textsf{MWU\_BOTTLENECK}_{C}(P,\mu,\lambda,\varepsilon) either:

  • Outputs “null”, certifying that minj[n]khΦ(P+hδj)h=0>1\min_{\vec{j}\in[n]^{k}}\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0}>1, or

  • Outputs j[n]k\vec{j}\in[n]^{k} such that hΦ(P+hδj)h=01+ε\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0}\leqslant 1+\varepsilon.

(If minj[n]khΦ(P+hδj)h=0\min_{\vec{j}\in[n]^{k}}\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0} is within (1,1+ε](1,1+\varepsilon], then either return behavior is a valid output.)

Pseudocode for the MWU algorithm is given in Algorithm 2. We prove that MWU runs in polynomial time given access to this bottleneck oracle.

Theorem 4.10.

Let the entries of the cost CC lie in the range [1,2][1,2]. Given λ\lambda\in\mathbb{R} and accuracy parameter ε>0\varepsilon>0, MWU either certifies that MOTC(μ)λ\textsf{MOT}_{C}(\mu)\leqslant\lambda, or returns a poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon)-sparse P(μ1,,μk)P\in\mathcal{M}(\mu_{1},\dots,\mu_{k}) satisfying C,Pλ+8ε\langle C,P\rangle\leqslant\lambda+8\varepsilon.

Furthermore, the loop in Step 1 runs in O~(nk/ε2)\tilde{O}(nk/\varepsilon^{2}) iterations, and Step 2 runs in poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time.

The MWU algorithm can be used to output a O(ε)O(\varepsilon)-approximate solution for MOT time via an outer loop that performs binary search over λ\lambda; this only incurs O(log(1/ε))O(\log(1/\varepsilon))-multiplicative overhead in runtime.

Algorithm 2 MWU: specialization of Multiplicative Weights Update (young2001sequential, ) to MOT
1:Accuracy ε>0\varepsilon>0, marginals μ1,,μkΔn\mu_{1},\dots,\mu_{k}\in\Delta_{n}, target value λ>0\lambda>0
2:Either certifies MOTC(μ)>λ\textsf{MOT}_{C}(\mu)>\lambda by returning “infeasible”, or returns a solution PP with C,Pλ+8ε\langle C,P\rangle\leqslant\lambda+8\varepsilon
3:\\  Assume cost CC satisfies Cj[1,2]C_{\vec{j}}\in[1,2] for all j[n]k\vec{j}\in[n]^{k} (without loss of generality by rescaling)
4:\\  Step 1: Multiplicative Weights Update
5:P0(0n)kP\leftarrow 0\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}, η2(log(nk+1))/ε\eta\leftarrow 2(\log(nk+1))/\varepsilon
6:while m(P)<ηm(P)<\eta do \triangleright While total mass is small
7:     jMWU_BOTTLENECKC(P,μ,λ,ε)\vec{j}\leftarrow\textsf{MWU\_BOTTLENECK}_{C}(P,\mu,\lambda,\varepsilon) \triangleright Bottleneck: find direction with small potential increase
8:     if j=\vec{j}=“null” then return “infeasible” \triangleright Infeasible if no good direction
9:     else PP+δjεmin(λ/Cj,mini[μi]ji)P\leftarrow P+\delta_{\vec{j}}\cdot\varepsilon\cdot\min(\lambda/C_{\vec{j}},\min_{i}[\mu_{i}]_{j_{i}}) \triangleright Else, increase in good direction      
10:PP/(η(1+ε)4)P\leftarrow P/(\eta(1+\varepsilon)^{4}) \triangleright Rescale
11:
12:\\  Step 2: round to transportation polytope
13:while m(P)<1m(P)<1 do \triangleright Until all constraints are satisfied
14:     jiargmaxj[n]([μi]j[mi(P)]j)j_{i}\leftarrow\operatorname*{argmax}_{j\in[n]}([\mu_{i}]_{j}-[m_{i}(P)]_{j}) for each i[k]i\in[k] \triangleright For each marginal, find unsatisfied constraint
15:     αmini[k]([μi]ji[mi(P)]ji)\alpha\leftarrow\min_{i\in[k]}([\mu_{i}]_{j_{i}}-[m_{i}(P)]_{j_{i}}) \triangleright Maximal mass to add
16:     PP+αδjP\leftarrow P+\alpha\cdot\delta_{\vec{j}} \triangleright Add mass to saturate at least one constraint
Proof.

We analyze Step 1 (Multiplicative Weights Update) and Step 2 (rounding) of MWU separately.

Lemma 4.11 (Correctness of Step 1).

Step 1 of Algorithm 2 runs in O~(nk/ε2)\tilde{O}(nk/\varepsilon^{2}) iterations. It either returns (i) “infeasible”, certifying that K(λ)K(\lambda) is empty; or (ii) finds a poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon)-sparse tensor P(0n)kP\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k} that is approximately in K(λ)K(\lambda), i.e., PP satisfies:

m(P)14ε,C,Pλ, and mi(P)μi for all i[k]m(P)\geqslant 1-4\varepsilon,\quad\langle C,P\rangle\leqslant\lambda,\quad\mbox{ and }\quad m_{i}(P)\leqslant\mu_{i}\mbox{ for all }i\in[k]

Step 1 is the Multiplicative Weights Update algorithm of (young2001sequential, ) applied to the polytope K(λ)K(\lambda), so correctness follows from the analysis of (young2001sequential, ). We briefly recall the main idea behind this algorithm for the convenience of the reader, and provide a full proof in Appendix A.1 for completeness. The main idea behind the algorithm is that on each iteration, j[n]k\vec{j}\in[n]^{k} is chosen so that the increase in the potential Φ(P)\Phi(P) is approximately bounded by the increase in the total mass m(P)m(P). If this is impossible, then the bottleneck oracle returns null, which means K(λ)K(\lambda) is empty. So assume otherwise. Then once the total mass has increased to m(P)=η+O(ε)m(P)=\eta+O(\varepsilon), the potential Φ(P)\Phi(P) must be bounded by η(1+O(ε))\eta(1+O(\varepsilon)). By exploiting the inequality between the max and the softmax, this means that max(C,P/λ,maxi[n],j[k][mi(P)]j/[μi]j)Φ(P)η(1+O(ε))\max(\langle C,P\rangle/\lambda,\max_{i\in[n],j\in[k]}[m_{i}(P)]_{j}/[\mu_{i}]_{j})\leqslant\Phi(P)\leqslant\eta(1+O(\varepsilon)) as well. Thus, rescaling PP by 1/(η(1+O(ε)))1/(\eta(1+O(\varepsilon))) in Line 10 satisfies m(P)1O(ε)m(P)\geqslant 1-O(\varepsilon), C,P/λ1\langle C,P\rangle/\lambda\leqslant 1, and mi(P)/μi1m_{i}(P)/\mu_{i}\leqslant 1. See Appendix A.1 for full details and a proof of the runtime and sparsity claims.

Lemma 4.12 (Correctness of Step 2).

Step 2 of Algorithm 2 runs in poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time and returns P(μ1,,μk)P\in\mathcal{M}(\mu_{1},\dots,\mu_{k}) satisfying C,Pλ+8ε\langle C,P\rangle\leqslant\lambda+8\varepsilon. Furthermore, PP only has poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) non-zero entries.

Proof of Lemma 4.12.

By Lemma 4.11, PP satisfies mi(P)μim_{i}(P)\leqslant\mu_{i} for all i[k]i\in[k]. Observe that this is an invariant that holds throughout the execution of Step 22. This, along with the fact that j=1n[mi(P)]j=m(P)\sum_{j=1}^{n}[m_{i}(P)]_{j}=m(P) is equal for all ii, implies that the indices (j1,,jk)(j_{1},\dots,j_{k}) found in Line 14 satisfy [μi]ji[mi(P)]ji>0[\mu_{i}]_{j_{i}}-[m_{i}(P)]_{j_{i}}>0 for each i[k]i\in[k]. Thus in particular α>0\alpha>0 in Line 15. It follows that Line 16 makes at least one more constraint satisfied (in particular the constraint “[μi]ji=[mi(P)]ji[\mu_{i}]_{j_{i}}=[m_{i}(P)]_{j_{i}}” where ii is the minimizer in Line 15). Since there are nknk constraints total to be satisfied, Step 2 terminates in at most nknk iterations. Each iteration increases the number of non-zero entries in PP by at most one, thus PP is poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) sparse throughout. That PP is sparse also implies that each iteration can be performed in poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time, thus Step 2 takes poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time overall.

Finally, we establish the quality guarantee on C,P\langle C,P\rangle. By Lemma 4.11, this is at most λ\lambda before starting Step 2. During Step 2, the total mass added to PP is equal to 1m(P)1-m(P). This is upper bounded by 4ε4\varepsilon by Lemma 4.11. Since Cmax2C_{\max}\leqslant 2, we conclude that the value of C,P\langle C,P\rangle is increased by at most 8ε8\varepsilon in Step 2. ∎

Combining Lemmas 4.11 and 4.12 concludes the proof of Theorem 4.10. ∎

4.2.2 Equivalence of bottleneck to AMIN

In order to prove Theorem 4.7, we show that the MWU algorithm can be implemented in polynomial time and calls to the AMIN oracle. First, we prove this fact for the ARGAMIN oracle, which differs from the AMIN oracle in that it also returns a tuple j[n]k\vec{j}\in[n]^{k} that is an approximate minimizer.

Definition 4.13 (Approximate violation oracle for (MOT-D)).

Given weights p=(p1,,pk)n×kp=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k} and accuracy ε>0\varepsilon>0, ARGAMINC\textsf{ARGAMIN}_{C} returns j[n]k\vec{j}\in[n]^{k} that minimizes minj[n]kCji=1k[pi]ji\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} up to additive error ε\varepsilon, and its value up to additive error ε\varepsilon.

Lemma 4.14.

Let the entries of the cost CC lie in the range [1,2][1,2]. The MWU algorithm (Algorithm 2), can be implemented by poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time and calls to the ARGAMINC\textsf{ARGAMIN}_{C} oracle with accuracy parameter ε=Θ(ε2/(nk))\varepsilon^{\prime}=\Theta(\varepsilon^{2}/(nk)).

Proof.

We show that on each iteration of Step 1 of Algorithm 2 we can emulate the call to the MWU_BOTTLENECK oracle with one call to the ARGAMIN oracle. Recall that MWU_BOTTLENECKC(P,μ,λ,ε)\textsf{MWU\_BOTTLENECK}_{C}(P,\mu,\lambda,\varepsilon) seeks to find j[n]k\vec{j}\in[n]^{k} such that

Vj:=hΦ(P+hδj)|h=0V_{\vec{j}}:=\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\Big{|}_{h=0}

is at most 1+ε1+\varepsilon, or to certify that for all j\vec{j} it is greater than 11. By explicit computation,

Vj\displaystyle V_{\vec{j}} =hlog(exp((C,P+hCj)/λ+s=1kt=1nexp(([ms(P)]t+δt,js)/[μs]t)))|h=0\displaystyle=\frac{\partial}{\partial h}\log\left(\exp\left(\left(\langle C,P\rangle+hC_{\vec{j}}\right)/\lambda+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp\left(\left([m_{s}(P)]_{t}+\delta_{t,j_{s}}\right)/[\mu_{s}]_{t}\right)\right)\right)\Bigg{|}_{h=0}
=(Cji=1k[pi]ji)exp(C,P/λ)/λexp(C,P/λ)+s=1kt=1nexp([ms(P)]t/[μs]t),\displaystyle=\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right)\frac{\exp(\langle C,P\rangle/\lambda)/\lambda}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}, (4.3)

where the weights p=(p1,,pk)k×np=(p_{1},\ldots,p_{k})\in\mathbb{R}^{k\times n} in the last line are defined as

[pi]j=λexp(C,P/λ)exp([mi(P)]j/[μi]j)[μi]j,i[k],j[n].[p_{i}]_{j}=-\frac{\lambda}{\exp(\langle C,P\rangle/\lambda)}\cdot\frac{\exp([m_{i}(P)]_{j}/[\mu_{i}]_{j})}{[\mu_{i}]_{j}},\qquad\forall i\in[k],j\in[n].

Note that the second term in the product in (4.3) is positive and does not depend on j\vec{j}. This suggests that in order to minimize (4.3), it suffices to compute jARGAMINC(p,ε)\vec{j}\leftarrow\textsf{ARGAMIN}_{C}(p,\varepsilon^{\prime}) for some accuracy parameter ε>0\varepsilon^{\prime}>0.

The main technical difficulty with formalizing this intuitive approach is that the weights pp are not necessarily efficiently computable. Nevertheless, using poly(n,k)\mathrm{poly}(n,k) extra time on each iteration, we can compute the marginals m1(P),,mk(P)m_{1}(P),\ldots,m_{k}(P). Since the ARGAMIN oracle returns an ε\varepsilon^{\prime}-additive approximation of the cost, we can also compute a running estimate c~\tilde{c} of the cost such that, on iteration TT,

c~TεC,Pc~+Tε.\tilde{c}-T\varepsilon^{\prime}\leqslant\langle C,P\rangle\leqslant\tilde{c}+T\varepsilon^{\prime}.

Therefore, we define weights p~n×k\tilde{p}\in\mathbb{R}^{n\times k}, which approximate pp and which can be computed in poly(n,k)\mathrm{poly}(n,k) time on each iteration:

[p~i]j=λexp(c~/λ)exp([mi(P)]j/[μi]j)[μi]j,i[k],j[n].[\tilde{p}_{i}]_{j}=-\frac{\lambda}{\exp(\tilde{c}/\lambda)}\cdot\frac{\exp([m_{i}(P)]_{j}/[\mu_{i}]_{j})}{[\mu_{i}]_{j}},\qquad\forall i\in[k],j\in[n].

We also define the approximate value for any j[n]k\vec{j}\in[n]^{k}:

V~j:=(Cji=1k[p~i]ji)exp(c~/λ)/λexp(c~/λ)+s=1kt=1nexp([ms(P)]t/[μs]t)\displaystyle\tilde{V}_{\vec{j}}:=\left(C_{\vec{j}}-\sum_{i=1}^{k}[\tilde{p}_{i}]_{j_{i}}\right)\frac{\exp(\tilde{c}/\lambda)/\lambda}{\exp(\tilde{c}/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}

It holds that ARGAMINC(p~,ε)\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime}) returns a j[n]k\vec{j}\in[n]^{k} that minimizes Cji=1k[p~i]jC_{\vec{j}}-\sum_{i=1}^{k}[\tilde{p}_{i}]_{j} up to multiplicative error 1/(1ε)1/(1-\varepsilon^{\prime}), because the entries of the cost CC are lower-bounded by 1, and [p~i]j0[\tilde{p}_{i}]_{j}\leqslant 0 for all i[n],j[k]i\in[n],j\in[k]. In particular, ARGAMINC(p~,ε)\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime}) minimizes V~j\tilde{V}_{\vec{j}} up to multiplicative error 1/(1ε)1/(1-\varepsilon^{\prime}). We prove the following claim relating VjV_{\vec{j}} and V~j\tilde{V}_{\vec{j}}:

Claim 4.15.

For any j[n]k\vec{j}\in[n]^{k}, on iteration TT, we have Vj/V~j[exp(2Tε/λ),exp(2Tε/λ)]V_{\vec{j}}/\tilde{V}_{\vec{j}}\in[\exp(-2T\varepsilon^{\prime}/\lambda),\exp(2T\varepsilon^{\prime}/\lambda)].

By the above claim, therefore ARGAMINC(p~,ε)\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime}) minimizes VjV_{\vec{j}} up to multiplicative error exp(4Tε/λ)/(1ε)(1+ε/3)\exp(4T\varepsilon^{\prime}/\lambda)/(1-\varepsilon^{\prime})\leqslant(1+\varepsilon/3) if we choose ε=Ω(λε/T)\varepsilon^{\prime}=\Omega(\lambda\varepsilon/T). Thus one can implement MWU_BOTTLENECKC(p,μ,λ,ε)\textsf{MWU\_BOTTLENECK}_{C}(p,\mu,\lambda,\varepsilon) by returning the value of ARGAMINC(p~,ε)\textsf{ARGAMIN}_{C}(\tilde{p},\varepsilon^{\prime}) if its value is estimated to be at most 1+ε/31+\varepsilon/3, and returning “null“ otherwise. The bound on the accuracy ε=Ω~(ε2/(nk))\varepsilon^{\prime}=\tilde{\Omega}(\varepsilon^{2}/(nk)) follows since λ[1,2]\lambda\in[1,2] follows since λ[1,2]\lambda\in[1,2] and T=O~(nk/ε2)T=\tilde{O}(nk/\varepsilon^{2}) by Theorem 4.10.

Proof of Claim.

We compare the expressions for VjV_{\vec{j}} and V~j\tilde{V}_{\vec{j}}. Each of these is a product of two terms. Since Cj0C_{\vec{j}}\geqslant 0, and [p~i]ji,[pi]ji0[\tilde{p}_{i}]_{j_{i}},[p_{i}]_{j_{i}}\leqslant 0 for all ii, the ratio of the first terms is

Cji=1k[p~i]jiCji=1k[pi]ji[mini[p~i]ji/[pi]ji,maxi[p~i]ji/[pi]ji][exp(Tε/λ),exp(Tε/λ)],\displaystyle\frac{C_{\vec{j}}-\sum_{i=1}^{k}[\tilde{p}_{i}]_{j_{i}}}{C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}}\in[\min_{i}[\tilde{p}_{i}]_{j_{i}}/[p_{i}]_{j_{i}},\max_{i}[\tilde{p}_{i}]_{j_{i}}/[p_{i}]_{j_{i}}]\subset[\exp(-T\varepsilon^{\prime}/\lambda),\exp(T\varepsilon^{\prime}/\lambda)],

where we have used that, for all i[k]i\in[k],

[p~i]ji/[pi]ji=exp(C,P/λ)/exp(c~/λ)[exp(Tε/λ),exp(Tε/λ)].[\tilde{p}_{i}]_{j_{i}}/[p_{i}]_{j_{i}}=\exp(\langle C,P\rangle/\lambda)/\exp(\tilde{c}/\lambda)\in[\exp(-T\varepsilon^{\prime}/\lambda),\exp(T\varepsilon^{\prime}/\lambda)].

Similarly the ratio of the second terms in the expression for VjV_{\vec{j}} and V~j\tilde{V}_{\vec{j}} is also in the range [exp(Tε/λ),exp(Tε/λ)][\exp(-T\varepsilon^{\prime}/\lambda),\exp(T\varepsilon^{\prime}/\lambda)]. This concludes the proof of the claim. ∎

Finally, we show that the ARGAMIN oracle can be reduced to the AMIN oracle, which completes the proof that MWU can be run with AMIN.

Lemma 4.16 (Equivalence of AMIN and ARGAMIN).

Each of the oracles AMINC\textsf{AMIN}_{C} and ARGAMINC\textsf{ARGAMIN}_{C} with accuracy parameter ε>0\varepsilon>0 can be implemented using poly(n,k)\mathrm{poly}(n,k) calls of the other oracle with accuracy parameter Θ(ε/k)\Theta(\varepsilon/k) and poly(n,k)\mathrm{poly}(n,k) additional time.

It is worth remarking that the equivalence that we show between AMIN and ARGAMIN is not known to hold for the feasibility and separation oracles of general LPs, since the known result for general LPs requires exponentially small error in nknk (GLSbook, , §4.3). However, in the case of MOT the equivalence follows from a direct and practical reduction, similar to the proof of the equivalence of the exact oracles (Lemma 4.4). The main difference is that some care is needed to bound the propagation of the errors of the approximate oracles. For completeness, we provide the full proof of Lemma 4.16 in Appendix A.

We conclude by proving Theorem 4.7.

Proof of Theorem 4.7.

The implication “(i) \implies (ii)” is trivial, and the implication “(ii) \implies (iii)” is shown in (AltBoi20hard, ). It therefore suffices to show the implication “(iii) \implies (i)”. For costs CC with entries in the range [1,2][1,2], this follows from combining the fact that MWU can be implemented to solve MOTC\textsf{MOT}_{C} in poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time given an efficient implementation of ARGAMINC\textsf{ARGAMIN}_{C} with polynomially-sized accuracy parameter ε=poly(1/n,1/k,ε)\varepsilon^{\prime}=\mathrm{poly}(1/n,1/k,\varepsilon) (Lemma 4.14), along with the fact that the AMINC\textsf{AMIN}_{C} and ARGAMINC\textsf{ARGAMIN}_{C} oracles are polynomially-time equivalent with polynomial-sized accuracy parameter (Lemma 4.16).

The assumption that CC has entries within the range [1,2][1,2] can be removed with no loss by translating and rescaling the original cost C(C+3Cmax)/(2Cmax)C^{\prime}\leftarrow(C+3C_{\max})/(2C_{\max}) and running Algorithm 2 on CC^{\prime} with approximation parameter εε/(2Cmax)\varepsilon^{\prime}\leftarrow\varepsilon/(2C_{\max}). Each τ\tau^{\prime}-approximate query to the AMINC\textsf{AMIN}_{C^{\prime}} oracle can be simulated by a τ\tau-approximate query to the AMINC\textsf{AMIN}_{C} oracle, where τ=2Cmaxτ\tau=2C_{\max}\tau^{\prime}. ∎

Remark 4.17 (Practical optimizations).

Our numerical implementation of MWU has two modifications that provide practical speedups. One is maintaining a cached list of the tuples j[n]k\vec{j}\in[n]^{k} previously returned by calls to MWU_BOTTLENECK. Whenever MWU_BOTTLENECK is called, we first check whether any tuple j\vec{j} in the cache satisfies the desiderata hΦ(P+hδj)h=01+ε\frac{\partial}{\partial h}\Phi(P+h\cdot\delta_{\vec{j}})\mid_{h=0}\leqslant 1+\varepsilon, in which case we use this j\vec{j} to answer the oracle query. Otherwise, we answer the oracle query using AMIN as explained above. In practice, this cache allows us to avoid many calls to the potentially expensive AMIN bottleneck. Our second optimization is that, at each iteration of MWU, we check whether the current iterate PP can be rescaled in order to satisfy the guarantees in Lemma 4.11 required from Step 1. If so, we stop Step 1 early and use this rescaled PP.

4.3 The Sinkhorn algorithm and the SMIN oracle

The Sinkhorn algorithm (SINKHORN) is specially tailored to MOT, and does not apply to general exponential-size LP. Currently it is by far the most popular algorithm in the MOT literature (see §1.3). However, in general each iteration of SINKHORN takes exponential time nΘ(k)n^{\Theta(k)}, and it is not well-understood when it can be implemented in polynomial-time. The objective of this section is to show that this bottleneck is polynomial-time equivalent to the SMIN oracle, and in doing so put SINKHORN on equal footing with classical implicit LP algorithms in terms of their reliance on variants of the dual feasibility oracle for MOT. Concretely, this lets us establish the following two results.

First, SINKHORN can solve MOT in polynomial time whenever SMIN can be solved in polynomial time.

Theorem 4.18.

For any family of cost tensors C(n)kC\in(\mathbb{R}^{n})^{\otimes k} and accuracy ε>0\varepsilon>0, SINKHORN solves MOTC\textsf{MOT}_{C} to ε\varepsilon accuracy in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time and poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) calls to the SMINC\textsf{SMIN}_{C} oracle with regularization η=(2klogn)/ε\eta=(2k\log n)/\varepsilon. (The solution is output through a polynomial-size implicit representation, see §4.3.1.)

Second, we show that SINKHORN requires strictly more structure than other algorithms do to solve an MOT problem. This is why the results about ELLIPSOID (Theorem 4.1) and MWU (Theorem 4.7) state that those algorithms solve an MOT problem whenever possible, whereas Theorem 4.18 cannot be analogously extended to such an “if and only if” characterization.

Theorem 4.19.

There is a family of cost tensors C(n)kC\in(\mathbb{R}^{n})^{\otimes k} for which ELLIPSOID solves MOTC\textsf{MOT}_{C} exactly in poly(n,k)\mathrm{poly}(n,k) time, however it is #\#BIS-hard to implement a single iteration of SINKHORN in poly(n,k)\mathrm{poly}(n,k) time.

Organization of §4.3.

In §4.3.1, we recall this SINKHORN algorithm and how it depends on a certain marginalization oracle. In §4.3.2, we show that this marginalization oracle is polynomial-time equivalent to the SMIN oracle, and use this to prove Theorems 4.18 and 4.19.

4.3.1 Algorithm

Here we recall SINKHORN and its known guarantees. To do this, we first define the following oracle. While this oracle does not have an interpretation as a dual feasibility oracle, we show below that it is polynomial-time equivalent to SMIN, which is a specific type of approximate dual feasibility oracle (Remark 3.6).

Definition 4.20 (MARG).

Given scalings d=(d1,,dk)0n×kd=(d_{1},\dots,d_{k})\in\mathbb{R}_{\geqslant 0}^{n\times k}, regularization η>0\eta>0, and an index i[k]i\in[k], the marginalization oracle MARGC(d,η,i)\textsf{MARG}_{C}(d,\eta,i) returns the vector mi((i=1kdi)exp[ηC])0nm_{i}((\otimes_{i^{\prime}=1}^{k}d_{i^{\prime}})\odot\exp[-\eta C])\in\mathbb{R}_{\geqslant 0}^{n}.

It is known that SINKHORN can solve MOT with only polynomially many calls to this oracle (LinHoJor19, ). The approximate solution that SINKHORN computes is a fully dense tensor with nkn^{k} non-zero entries, but it is output implicitly in O(nk)O(nk) space through “scaling vectors” and “rounding vectors”, described below.

Theorem 4.21 (SINKHORN guarantee, (LinHoJor19, )).

Algorithm 3 computes an ε\varepsilon-approximate solution to MOTC(μ)\textsf{MOT}_{C}(\mu) using poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) calls to the MARGC\textsf{MARG}_{C} oracle with parameter η=(2klogn)/ε\eta=(2k\log n)/\varepsilon, and poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) additional time. The solution is of the form

P=(i=1kdi)exp[ηC]+(i=1kvi),\displaystyle P=\left(\otimes_{i=1}^{k}d_{i}\right)\odot\exp[-\eta C]+\left(\otimes_{i=1}^{k}v_{i}\right), (4.4)

and is output implicitly via the scaling vectors d1,,dk0nd_{1},\dots,d_{k}\in\mathbb{R}_{\geqslant 0}^{n} and rounding vectors v1,,vk0nv_{1},\dots,v_{k}\in\mathbb{R}_{\geqslant 0}^{n}.

Algorithm 3 SINKHORN: multidimensional analog of classical Sinkhorn scaling

Input: Cost C(n)kC\in(\mathbb{R}^{n})^{\otimes k}, marginals μ1,,μkΔn\mu_{1},\dots,\mu_{k}\in\Delta_{n}
      Output: Implicit representation of tensor (4.4) that is an ε\varepsilon-approximate solution to MOTC(μ)\textsf{MOT}_{C}(\mu)

1:\\  Step 1: scale
2:d1,,dk𝟏d_{1},\dots,d_{k}\leftarrow\mathbf{1} and η(2klogn)/ε\eta\leftarrow(2k\log n)/\varepsilon \triangleright Initialize (no scaling)
3:for poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) iterations do
4:     Choose i[k]i\in[k] \triangleright Round-robin, greedily, or randomly
5:     μ~iMARGC(d,η,i)\tilde{\mu}_{i}\leftarrow\textsf{MARG}_{C}(d,\eta,i) \triangleright Bottleneck: compute ii-th marginal
6:     didi(μi/μ~i)d_{i}\leftarrow d_{i}\odot(\mu_{i}/\tilde{\mu}_{i}) \triangleright Rescale ii-th marginal (division is entrywise)
7:
8:\\  Step 2: round to transportation polytope
9:for i=1,,ki=1,\dots,k do \triangleright Rescale each marginal to be below marginal constraints
10:     μ~iMARGC(d,η,i)\tilde{\mu}_{i}\leftarrow\textsf{MARG}_{C}(d,\eta,i) \triangleright Bottleneck: compute ii-th marginal
11:     didimin[𝟏,μi/μ~i]d_{i}\leftarrow d_{i}\odot\min[\mathbf{1},\mu_{i}/\tilde{\mu}_{i}] \triangleright Rescale ii-th marginal (operations are entrywise)
12:viμiMARGC(d,η,i)v_{i}\leftarrow\mu_{i}-\textsf{MARG}_{C}(d,\eta,i) for each i[k]i\in[k] \triangleright Add back mass
13:v1v1/v1k1v_{1}\leftarrow v_{1}/\|v\|_{1}^{k-1} \triangleright Rescale so that (4.4) satisfies marginal constraints
14:Return d1,,dkd_{1},\dots,d_{k} and v1,,vkv_{1},\dots,v_{k} \triangleright Implicit representation of solution (4.4)
Sketch of algorithm.

Full details and a proof are in (LinHoJor19, ). We give a brief overview here for completeness. The main idea of SINKHORN is to solve RMOT, the entropically regularized variant of MOT described in §2.2. On one hand, this provides an ε\varepsilon-approximate solution to MOT by taking the regularization parameter η=Θ(ε1klogn)\eta=\Theta(\varepsilon^{-1}k\log n) sufficiently high (Lemma 2.4). On the other hand, solving RMOT rather than MOT enables exploiting the first-order optimality conditions of RMOT, which imply that the unique solution to RMOT is the unique tensor in (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}) of the form

P=(i=1kdi)K,\displaystyle P^{*}=(\otimes_{i=1}^{k}d_{i}^{*})\odot K, (4.5)

where KK denotes the entrywise exponentiated tensor exp[ηC]\exp[-\eta C], and d1,,dk0nd_{1}^{*},\dots,d_{k}^{*}\in\mathbb{R}_{\geqslant 0}^{n} are non-negative vectors. The SINKHORN algorithm approximately computes this solution in two steps.

The first and main step of Algorithm 3 is the natural multimarginal analog of the Sinkhorn scaling algorithm (Sin67, ). It computes an approximate solution P=(i=1kdi)KP=(\otimes_{i=1}^{k}d_{i})\odot K by finding d1,,dkd_{1},\dots,d_{k} such that PP is nearly feasible in the sense that mi(P)μim_{i}(P)\approx\mu_{i} for each i[k]i\in[k]. Briefly, it does this via alternating optimization: initialize did_{i} to the all-ones vector 𝟏n\mathbf{1}\in\mathbb{R}^{n}, and then iteratively update one did_{i} so that the ii-th marginal mi(P)m_{i}(P) of the current scaled iterate P=(i=1kdi)KP=(\otimes_{i=1}^{k}d_{i})\odot K is μi\mu_{i}. Although correcting one marginal can detrimentally affect the others, this algorithm nevertheless converges—in fact, in a polynomial number of iterations (LinHoJor19, ).

The second step of Algorithm 3 is the natural multimarginal analog of the rounding algorithm (AltWeeRig17, , Algorithm 2). It rounds the solution P=(i=1kdi)KP=(\otimes_{i=1}^{k}d_{i})\odot K found in step one to the transportation polytope (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}). Briefly, it performs this by scaling each marginal mi(P)m_{i}(P) to be entrywise less than the desired μi\mu_{i}, and then adding mass back to PP so that all marginals constraints are exactly satisfied. The former adjustment is done by adjusting the diagonal scalings d1,,dkd_{1},\dots,d_{k}, and the latter adjustment is done by adding a rank-11 term i=1kvi\otimes_{i=1}^{k}v_{i}.

Critically, note that Algorithm 3 takes polynomial time except for possibly the calls to the MARGC\textsf{MARG}_{C} oracle. In the absence of structure in the cost tensor CC, evaluating this MARGC\textsf{MARG}_{C} oracle takes exponential time because it requires computing marginals of a tensor with nkn^{k} entries.

We conclude this discussion with several remarks about SINKHORN.

Remark 4.22 (Choice of update index in SINKHORN).

In line 4 there are several ways to choose update indices, all of which lead to the polynomial iteration complexity we desire. Iteration-complexity bounds are shown for a greedy choice in (LinHoJor19, ; Fri20, ). Similar bounds can be shown for random and round-robin choices by adapting the techniques of (AltPar20bal, ). These latter two choices do not incur the overhead of kk MARG computations per iteration required by the greedy choice, which is helpful in practice. Empirically, we observe that round-robin works quite well, and we use this in our experiments.

Remark 4.23 (Alternative implementations of SINKHORN).

For simplicity, Algorithm 3 provides pseudocode for the “vanilla” version of SINKHORN as it performs well in practice and it achieves the polynomial iteration complexity we desire. There are several variants in the literature, including accelerated versions and first rounding small entries of the marginals—these variants have iteration-complexity bounds with better polynomial dependence on ε\varepsilon and kk, albeit sometimes at the expense of larger polynomial factors in nn (LinHoJor19, ; tupitsa2020multimarginal, ).

Remark 4.24 (Output of SINKHORN and efficient downstream tasks).

While the output PP of SINKHORN is fully dense with nkn^{k} non-zero entries, its specific form (4.4) enables performing downstream tasks in polynomial time. This is conditional on a polynomial-time MARGC\textsf{MARG}_{C} oracle, which is at no loss of generality since that is required for running SINKHORN in polynomial time in the first place. The basic idea is that PP is a mixture of two simple distributions (modulo normalization). The first is (i=1kdi)exp[ηC]\left(\otimes_{i=1}^{k}d_{i}\right)\odot\exp[-\eta C], which is marginalizable using MARGC\textsf{MARG}_{C}. The second is i=1kvi\otimes_{i=1}^{k}v_{i}, which is easily marginalizable since it is a product distribution (as the viv_{i} are non-negative). Together, this enables efficient marginalization of PP. By recursively marginalizing on conditional distributions, this enables efficiently sampling from PP. This in turn enables efficient estimation of bounded statistics of PP (e.g., the cost C,P\langle C,P\rangle) by Hoeffding’s inequality.

4.3.2 Equivalence of bottleneck to SMIN

Although Theorem 4.21 shows that SINKHORN solves MOT in polynomial time using the MARG oracle, this is neither sufficient to prove the implication “(ii)\implies(i)” in Theorem 4.18, nor to prove Theorem 4.19. In order to prove these results, we show that SMIN and MARG are polynomial-time equivalent.

Lemma 4.25 (Equivalence of MARG and SMIN).

For any regularization η>0\eta>0, each of the oracles MARGC\textsf{MARG}_{C} and SMINC\textsf{SMIN}_{C} can be implemented using poly(n)\mathrm{poly}(n) calls of the other oracle and poly(n,k)\mathrm{poly}(n,k) additional time.

Proof.

Reduction from SMIN to MARG. First, we show how to compute SMINC(p,η)\textsf{SMIN}_{C}(p,\eta) using one call to the marginalization oracle and O(n)O(n) additional time. Consider the entrywise exponentiated matrix d=exp[ηp]0n×kd=\exp[\eta p]\in\mathbb{R}_{\geqslant 0}^{n\times k}, and let μ1=m1((i=1kdi)exp[ηC])\mu_{1}=m_{1}((\otimes_{i=1}^{k}d_{i})\odot\exp[-\eta C]) be the answer to MARGC(d,η,1)\textsf{MARG}_{C}(d,\eta,1). Observe that

η1log(j1=1n[μ1]j1)\displaystyle-\eta^{-1}\log\left(\sum_{j_{1}=1}^{n}[\mu_{1}]_{j_{1}}\right) =η1log(j1=1nj2,,jk[n]i=1k[di]jieηCj)\displaystyle=-\eta^{-1}\log\left(\sum_{j_{1}=1}^{n}\sum_{j_{2},\dots,j_{k}\in[n]}\prod_{i=1}^{k}[d_{i}]_{j_{i}}e^{-\eta C_{\vec{j}}}\right)
=η1log(j[n]keη(Cji=1k[pi]ji))\displaystyle=-\eta^{-1}\log\left(\sum_{\vec{j}\in[n]^{k}}e^{-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}})}\right)
=sminηj[n]k(Cji=1k[pi]ji),\displaystyle=\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right),

where above the first step is by definition of μ1\mu_{1}, the second step is by definition of dd and combining the sums, and the third step is by definition of smin\operatorname*{smin}. We conclude that η1logj1=1n[μ1]j1-\eta^{-1}\log\sum_{j_{1}=1}^{n}[\mu_{1}]_{j_{1}} is a valid answer to SMINC(p,η)\textsf{SMIN}_{C}(p,\eta). Since this is clearly computable from μ1\mu_{1} in O(n)O(n) time, this establishes the claimed reduction.

Reduction from MARG to SMIN. Next, we show that for any marginalization index i[k]i\in[k] and entry [n]\ell\in[n], it is possible to compute the \ell-th entry of the vector MARGC(d,η,i)\textsf{MARG}_{C}(d,\eta,i) using one call to the SMINC\textsf{SMIN}_{C} oracle and poly(n,k)\mathrm{poly}(n,k) additional time. Define vnv\in\mathbb{R}^{n} to be the vector with \ell-th entry equal to [di][d_{i}]_{\ell}, and all other entries 0. Define the matrix p=η1log[d1,,di1,v,di+1,,dk]¯n×kp=\eta^{-1}\log[d_{1},\ldots,d_{i-1},v,d_{i+1},\ldots,d_{k}]\in\bar{\mathbb{R}}^{n\times k}, where recall that log0=\log 0=-\infty (see §2). Let ss\in\mathbb{R} denote the answer to SMINC(p,η)\textsf{SMIN}_{C}(p,\eta). Observe that

eηs=j[n]keη(Cji=1k[pi]ji)=j[n]k:ji=i=1k[di]jieηCj=[mi((i=1kdi)exp[ηC])],\displaystyle e^{-\eta s}=\sum_{\vec{j}\in[n]^{k}}e^{-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}})}=\sum_{\vec{j}\in[n]^{k}\;:\;\vec{j}_{i}=\ell}\prod_{i=1}^{k}[d_{i}]_{j_{i}}e^{-\eta C_{\vec{j}}}=\left[m_{i}\left((\otimes_{i=1}^{k}d_{i})\odot\exp[-\eta C]\right)\right]_{\ell},

where above the first step is by definition of ss, the second step is by definition of pp and vv, and the third step is by definition of the marginalization notation mi()m_{i}(\cdot). We conclude that exp(ηs)\exp(-\eta s) is a valid answer for the \ell-th entry of the vector MARGC(d,η,i)\textsf{MARG}_{C}(d,\eta,i). This establishes the claimed reduction since we may repeat this procedure nn times to compute all nn entries of the the vector MARGC(d,η,i)\textsf{MARG}_{C}(d,\eta,i).

We can now conclude the proofs of the main results of §4.3.

Proof of Theorem 4.18.

This follows from the fact that SINKHORN approximates MOTC\textsf{MOT}_{C} in polynomial time given a efficient implementation of MARGC\textsf{MARG}_{C} (Theorem 4.21), combined with the fact that the MARGC\textsf{MARG}_{C} and SMINC\textsf{SMIN}_{C} oracles are polynomial-time equivalent (Lemma 4.25). ∎

Proof of Theorem 4.19.

Consider the family of cost tensors in Lemma 3.7 for which the MINC\textsf{MIN}_{C} oracle admits a polynomial-time algorithm, but for which the SMINC\textsf{SMIN}_{C} oracle is #BIS-hard. Then on one hand, the ELLIPSOID algorithm solves MOTC\textsf{MOT}_{C} in polynomial time by Theorem 4.1. And on the other hand, it is #BIS-hard to implement a single iteration of SINKHORN because that requires implementing the MARGC\textsf{MARG}_{C} oracle, which is polynomial-time equivalent to the SMINC\textsf{SMIN}_{C} oracle by Lemma 4.25. ∎

5 Application: MOT problems with graphical structure

In this section, we illustrate our algorithmic framework on MOT problems with graphical structure. Although a polynomial-time algorithm is already known for this particular structure (h20gm, ; teh2002unified, ), that algorithm computes solutions that are approximate and dense; see the related work section for details. By combining our algorithmic framework developed above with classical facts about graphical models, we show that it is possible to compute solutions that are exact and sparse in polynomial time.

The section is organized as follows. In §5.1, we recall the definition of graphical structure. In §5.2, we show that the MIN, AMIN, and SMIN oracles can be implemented in polynomial time for cost tensors with graphical structure; from this it immediately follows that all of the MOT algorithms discussed in part 1 of this paper can be implemented in polynomial time. Finally, in §5.3, we demonstrate our results on the popular application of computing generalized Euler flows, which was the original motivation of MOT. Numerical simulations demonstrate how the exact, sparse solutions produced by our new algorithms provide qualitatively better solutions than previously possible in polynomial time.

5.1 Setup

We begin by recalling preliminaries about undirected graphical models, a.k.a., Markov Random Fields. We recall only the relevant background; for further details we refer the reader to the textbooks (KolFri09, ; wainwright2008graphical, ).

In words, graphical models provide a way of encoding the independence structure of a collection of random variables in terms of a graph. The formal definition is as follows. Below, all graphs are undirected, and the notation 2V2^{V} means the power set of VV (i.e., the set of all subsets of VV).

Definition 5.1 (Graphical model structure).

Let 𝒮2[k]\mathcal{S}\subset 2^{[k]}. The graphical model structure corresponding to 𝒮\mathcal{S} is the graph G𝒮=(V,E)G_{\mathcal{S}}=(V,E) with vertices V=[k]V=[k] and edges E={(i,j):i,jS, for some S𝒮}E=\{(i,j):i,j\in S,\text{ for some }S\in\mathcal{S}\}.

Definition 5.2 (Graphical model).

Let 𝒮2[k]\mathcal{S}\subset 2^{[k]}. A probability distribution PP over {Xi}i[k]\{X_{i}\}_{i\in[k]} is a graphical model with structure 𝒮\mathcal{S} if there exist functions {ψS}S𝒮\{\psi_{S}\}_{S\in\mathcal{S}} and normalizing constant ZZ such that

P({xi}i[k])=1ZS𝒮ψS({xi}iS).P\Big{(}\{x_{i}\}_{i\in[k]}\Big{)}=\frac{1}{Z}\prod_{S\in\mathcal{S}}\psi_{S}\Big{(}\{x_{i}\}_{i\in S}\Big{)}.

A standard measure of complexity for graphical models is the treewidth of the underlying graphical model structure G𝒮G_{\mathcal{S}} because this captures not just the storage complexity, but also the algorithmic complexity of performing fundamental tasks such as computing the mode, log-partition function, and marginal distributions (KolFri09, ; wainwright2008graphical, ). There are a number of equivalent definitions of treewidth (bodlaender2007treewidth, ). Each requires defining intermediate combinatorial concepts. We recall here the definition that is based on the concept of a junction tree because this is perhaps the most standard definition in the graphical models community.

Definition 5.3 (Junction tree, treewidth).

A junction tree T=(VT,ET,{Bu}uVT)T=(V_{T},E_{T},\{B_{u}\}_{u\in V_{T}}) for a graph G=(V,E)G=(V,E) consists of a tree (VT,ET)(V_{T},E_{T}) and a set of bags {BuV}uVT\{B_{u}\subseteq V\}_{u\in V_{T}} satisfying:

  • For each variable iVi\in V, the set of nodes Ui={uVT:iBu}U_{i}=\{u\in V_{T}:i\in B_{u}\} induces a subtree of TT.

  • For each edge eEe\in E, there is some bag BuB_{u} containing both endpoints of ee.

The width of the junction tree is one less than the size of the largest bag, i.e., is maxuVT|Bu|1\max_{u\in V_{T}}|B_{u}|-1. The treewidth of a graph is the width of its minimum-width junction tree.

See Figures 12, and 3 for illustrated examples.

We now formally recall the definition of graphical structure for MOT.

Refer to caption
Refer to caption
(a) Path graph.
(b) Junction tree.
Figure 1: The path graph (left) has treewidth 11 because the corresponding junction tree (right) has bags of size at most 22.
Refer to caption
Refer to caption
(a) Window graph with window size 22.
(b) Junction tree.
Figure 2: The graph that has an edges between all vertices of distance at most two when ordered sequentially (left) has treewidth 22 because the corresponding junction tree (right) has bags of size at most 33.
Refer to caption
Refer to caption
(a) Cycle graph.
(b) Junction tree.
Figure 3: The cycle graph (left) has treewidth 22 because the corresponding junction tree (right) has bags of size at most 33.
Definition 5.4 (Graphical structure for MOT).

An MOT cost tensor C(n)kC\in(\mathbb{R}^{n})^{\otimes k} has graphical structure with treewidth ω\omega if there is a graphical model structure 𝒮2[k]\mathcal{S}\subset 2^{[k]} and functions {fS}S𝒮\{f_{S}\}_{S\in\mathcal{S}} such that

Cj=S𝒮fS({ji}iS),j:=(j1,,jk)[n]k,C_{\vec{j}}=\sum_{S\in\mathcal{S}}f_{S}\Big{(}\{j_{i}\}_{i\in S}\Big{)},\qquad\forall\vec{j}:=(j_{1},\dots,j_{k})\in[n]^{k}, (5.1)

and such that the graph G𝒮G_{\mathcal{S}} has treewidth ω\omega.

We make three remarks about this structure. First, note that the functions {fS}S𝒮\{f_{S}\}_{S\in\mathcal{S}} can be arbitrary so long as the corresponding graphical model structure has treewidth at most ω\omega.

Second, if Definition 5.4 did not constrain the treewidth ω\omega, then every tensor CC would trivially have graphical structure with maximal treewidth ω=k1\omega=k-1 (take 𝒮\mathcal{S} to be the singleton containing [k][k], G𝒮G_{\mathcal{S}} to be the complete graph, and f[k]f_{[k]} to be CC). Just like all previous algorithms, our algorithms have runtimes that depend exponentially (only) on the treewidth of G𝒮G_{\mathcal{S}}. This is optimal in the sense that unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP}, there is no algorithm with jointly polynomial runtime in the input size and treewidth (AltBoi20hard, ). We also point out that in all current applications of graphically structured MOT, the treewidth is either 11 or 22, see §1.3.

Third, as in all previous work on graphically structured MOT, we make the natural assumptions that the cost CC is input implicitly through the functions {fS}S𝒮\{f_{S}\}_{S\in\mathcal{S}}, and that each function fSf_{S} can be evaluated in polynomial time, since otherwise graphical structure is useless for designing polynomial-time algorithms. In all applications in the literature, these two basic assumptions are always satisfied. Note also that if the treewidth of the graphical structure is constant, then there is a linear-time algorithm to compute the treewidth and a corresponding minimum-width junction tree (bodlaender1996linear, ).

5.2 Polynomial-time algorithms

By our oracle reductions in §4, in order to design polynomial-time algorithms for MOT with graphical structure, it suffices to design polynomial-time algorithms for the MIN, AMIN, or SMIN oracles. This follows directly from classical algorithmic results in the graphical models literature (KolFri09, ).

Theorem 5.5 (Polynomial-time algorithms for the MIN, AMIN, and SMIN oracles for costs with graphical structure).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} be a cost tensor that has graphical structure with constant treewidth ω\omega (see Definition 5.4). Then the MINC\textsf{MIN}_{C}, AMINC\textsf{AMIN}_{C}, and SMINC\textsf{SMIN}_{C} oracles can be computed in poly(n,k)\mathrm{poly}(n,k) time.

Algorithm 4 Polynomial-time algorithm for MIN for graphically structured costs (Theorem 5.5).

Input: Cost CC with graphical structure, matrix pn×kp\in\mathbb{R}^{n\times k}
      Output: Solution to MINC(p)\textsf{MIN}_{C}(p)

1:j\vec{j}\leftarrow mode of the graphical model PP in (5.2) \triangleright Using the classical max-product algorithm (KolFri09, , §13.3)
2:Return Cji=1k[pi]jiC_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} \triangleright Value of the MINC(p)\textsf{MIN}_{C}(p) oracle
Algorithm 5 Polynomial-time algorithm for SMIN for graphically structured costs (Theorem 5.5).

Input: Cost CC with graphical structure, matrix p¯n×kp\in\bar{\mathbb{R}}^{n\times k}, regularization η>0\eta>0
      Output: Solution to SMINC(p,η)\textsf{SMIN}_{C}(p,\eta)

1:ZZ\leftarrow partition function of the graphical model PP in (5.2) \triangleright Using the classical sum-product algorithm (KolFri09, , §10.2)
2:Return η1logZ-\eta^{-1}\log Z \triangleright Value of the SMINC(p,η)\textsf{SMIN}_{C}(p,\eta) oracle
Proof.

Consider input pp for the oracles. Let PP denote the probability distribution on [n]k[n]^{k} given by

P(j)=1Zexp(η(Cji=1k[pi]ji)),j[n]k,\displaystyle P(\vec{j})=\frac{1}{Z}\exp\left(-\eta\left(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right)\right),\qquad\forall\vec{j}\in[n]^{k}, (5.2)

where Z=j[n][k]exp(η(Cji=1k[pi]ji))Z=\sum_{\vec{j}\in[n]^{[k]}}\exp(-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}})) ensures PP is normalized. Observe that the MINC\textsf{MIN}_{C} oracle amounts888In fact, for the purpose of computing MINC\textsf{MIN}_{C}, the distribution P(j)P(\vec{j}) can be defined using any η>0\eta>0. to computing the mode of the distribution PP because MINC(p)=Cji=1k[pi]ji\textsf{MIN}_{C}(p)=C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}, where j[n]k\vec{j}\in[n]^{k} is a maximizer of PjP_{\vec{j}}. Also, the SMINC\textsf{SMIN}_{C} oracle amounts to computing the partition function ZZ because SMINC(p)=η1logZ\textsf{SMIN}_{C}(p)=-\eta^{-1}\log Z. Thus it suffices to compute the mode and partition function of PP in polynomial time. (The AMINC\textsf{AMIN}_{C} oracle follows from the MINC\textsf{MIN}_{C} oracle by Remark 3.5).

To this end, observe that by assumption on CC, there is a graphical model structure 𝒮2[k]\mathcal{S}\in 2^{[k]} and functions {fS}S𝒮\{f_{S}\}_{S\in\mathcal{S}} such that the corresponding graph G𝒮G_{\mathcal{S}} has treewidth ω\omega and the distribution PP factors as

P(j)=exp(η(S𝒮fS({ji}iS)i=1k[pi]ji)).P(\vec{j})=\exp\left(-\eta\left(\sum_{S\in\mathcal{S}}f_{S}\left(\{j_{i}\}_{i\in S}\right)-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right)\right).

It follows that PP is a graphical model with respect to the same graphical model structure 𝒮\mathcal{S} because the “vertex potentials” exp(η[pi]ji)\exp(\eta[p_{i}]_{j_{i}}) do not affect the underlying graphical model structure. Thus PP is a graphical model with constant treewidth ω\omega, so we may compute the mode and partition function of PP in poly(n,k)\mathrm{poly}(n,k) time using, respectively, the classical max-product and sum-product algorithms (KolFri09, , Chapters 13.3 and 10.2). For convenience, pseudocode summarizing this discussion is provided in Algorithms 4 and 5. ∎

An immediate consequence of Theorem 5.5 combined with our oracle reductions is that all candidate MOT algorithms in §4 can be efficiently implemented for MOT problems with graphical structure. From a theoretical perspective, ELLIPSOID gives the best guarantee since it produces an exact, sparse solution.

Corollary 5.6 (Polynomial-time algorithms for MOT problems with graphical structure).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} be a cost tensor that has graphical structure with constant treewidth ω\omega (see Definition 5.4). Then:

  • The ELLIPSOID algorithm in §4.1 computes an exact solution to MOTC\textsf{MOT}_{C} in poly(n,k)\mathrm{poly}(n,k) time.

  • The MWU algorithm in §4.2 computes an ε\varepsilon-approximate solution to MOTC\textsf{MOT}_{C} in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

  • The SINKHORN algorithm in §4.3 computes an ε\varepsilon-approximate solution to MOTC\textsf{MOT}_{C} in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

  • The COLGEN algorithm in §4.1.3 can be run for TT iterations in poly(n,k,T)\mathrm{poly}(n,k,T) time.

Moreover, ELLIPSOID, MWU, and COLGEN output a polynomially sparse tensor, whereas SINKHORN outputs a fully dense tensor through the implicit representation described in §4.3.1.

Proof.

Combine the polynomial-time implementations of the oracles in Theorem 5.5 with the polynomial-time algorithm-to-oracle reductions in Theorems 4.14.74.18, and 4.6, respectively. ∎

5.3 Application vignette: fluid dynamics

In this section, we numerically demonstrate our new results for graphically structured MOT—namely the ability to compute exact, sparse solutions in polynomial time (Corollary 5.6). We illustrate this on the problem of computing generalized Euler flows—an MOT application which has received significant interest and which was historically the motivation of MOT, see e.g., (BenCarCut15, ; BenCarNen16, ; Bre08, ; Bre89, ; Bre93, ; Bre99, ). This MOT problem is already known to be tractable via a popular, specially-tailored modification of SINKHORN (BenCarCut15, )—which can be interpreted as implementing SINKHORN using graphical structure (h20gm, ; teh2002unified, ). However, that algorithm is based on SINKHORN and thus unavoidably produces solutions that are low-precision (due to poly(1/ε)\mathrm{poly}(1/\varepsilon) runtime dependence), fully dense (with nkn^{k} non-zero entries), and have well-documented numerical precision issues. We offer the first polynomial-time algorithm for computing exact and/or sparse solutions.

We briefly recall the premise of this MOT problem; for further background see (Bre08, ; BenCarCut15, ). An incompressible fluid (e.g., water) is modeled by nn particles which are uniformly distributed in space (due to incompressibility) at all times t{1,,k+1}t\in\{1,\dots,k+1\}. We observe each particle’s location at initial time t=1t=1 and final time t=k+1t=k+1. The task is to infer the particles’ locations at all intermediate times t{2,,k}t\in\{2,\dots,k\}, and this is modeled by an MOT problem as follows.

Specifically, the locations of the fluid particles are discretized to points {xj}j[n]d\{x_{j}\}_{j\in[n]}\subset\mathbb{R}^{d}, and σ\sigma is a known permutation on this set that encodes the relation between each initial location xjx_{j} at time t=1t=1 and final location σ(xj)\sigma(x_{j}) at time t=k+1t=k+1. The total movement of a particle that takes the trajectory xj1,xj2,,xjk,σ(xj1)x_{j_{1}},x_{j_{2}},\ldots,x_{j_{k}},\sigma(x_{j_{1}}) is given by

Cj1,,jk=σ(xj1)xjk2+t=1k1xjt+1xjt2,C_{j_{1},\ldots,j_{k}}=\|\sigma(x_{j_{1}})-x_{j_{k}}\|^{2}+\sum_{t=1}^{k-1}\|x_{j_{t+1}}-x_{j_{t}}\|^{2}, (5.3)

By the principle of least action, the generalized Euler flow problem of inferring the most likely trajectories of the fluid particles is given by the solution to the MOT problem with this cost CC and uniform marginals μt=𝟏n/nΔn\mu_{t}=\mathbf{1}_{n}/n\in\Delta_{n} which impose the constraint that the fluid is incompressible.

Corollary 5.7 (Exact, sparse solutions for generalized Euler flows).

The MOT problem with cost (5.3) can be solved in dpoly(n,k)d\cdot\mathrm{poly}(n,k) time. The solution is returned as a sparse tensor with at most nkk+1nk-k+1 non-zeros.

Proof.

This cost tensor CC can be expressed in graphical form Cj=S𝒮fS({ji})C_{\vec{j}}=\sum_{S\in\mathcal{S}}f_{S}(\{j_{i}\}) where 𝒮\mathcal{S} consists of the sets {1,2},,{k1,k}\{1,2\},\dots,\{k-1,k\} of adjacent time points as well as the set {1,k}\{1,k\}. Moreover, each function fS:[n]2f_{S}:[n]^{2}\to\mathbb{R} can be computed in O(dn2)O(dn^{2}) time since this simply requires computing xjxj2\|x_{j}-x_{j^{\prime}}\|^{2} for n2n^{2} pairs of points xj,xjdx_{j},x_{j^{\prime}}\in\mathbb{R}^{d}. Once this graphical representation is computed, Corollary 5.6 implies a poly(n,k)\mathrm{poly}(n,k) time algorithm for this MOT problem because the graphical model structure 𝒮\mathcal{S} is a cycle graph and thus has treewidth 22 (cf., Figure 3). ∎

t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
Refer to caption
t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
Refer to caption
SINKHORN COLGEN
Figure 4: Transport maps computed by the fast implementation of SINKHORN (BenCarCut15, ) (left) and our COLGEN implementation (right) on a standard fluid dynamics benchmark problem in dimension d=1d=1 (Bre08, ). The pairwise transport maps between successive timesteps are plotted with opacity proportional to the mass. The SINKHORN algorithm is run at the highest precision (i.e., smallest regularization) before serious numerical precision issues (NaNs). It returns a dense, approximate solution in 2.25 seconds.101010All experiments in this paper are run on a standard-issue Apple MacBook Pro 2020 laptop with an M1 Chip. COLGEN returns an exact, sparse solution in 9.52 seconds. Furthermore, in this particular problem instance, the COLGEN method returns a Monge solution, i.e., the sparsity is nn so that the particles never split in the computed trajectories.

Figure 10 illustrates how the exact, sparse solutions found by our new algorithm provide visually sharper estimates than the popular modification of SINKHORN in (BenCarCut15, ), which blurs the trajectories. The latter is the state-of-the-art algorithm in the literature and in particular is the only previously known non-heuristic algorithm that has polynomial-time guarantees. Note that this algorithm is identical to implementing SINKHORN by exploiting the graphical structure to perform exact marginalization efficiently (teh2002unified, ; h20gm, ).

The numerical simulation is on a standard benchmark problem used in the literature (see e.g., (BenCarCut15, , Figure 9) and (Bre08, , Figure 2)) in which the particle at initial location x[0,1]x\in[0,1] moves to final location σ(x)=x+12(mod1)\sigma(x)=x+\operatorname{\frac{1}{2}}\pmod{1}. This is run with k=6k=6 and marginals μ1==μk\mu_{1}=\dots=\mu_{k} uniformly supported on n=51n=51 positions in [0,1][0,1]. See Appendix B for numerics on other standard benchmark instances. Note that this amounts to solving an MOT LP with nk=5161.8×1010n^{k}=51^{6}\approx 1.8\times 10^{10} variables, which is infeasible for standard LP solvers. Our algorithm is the first to compute exact solutions for problem instances of this scale.

Two important remarks. First, since this MOT problem is a discretization of the underlying PDE, an exact solution is of course not necessary; however, there is an important—even qualitative—difference between low-precision solutions (computable with poly(1/ε)\mathrm{poly}(1/\varepsilon) runtime) and high-precision solutions (computable with polylog(1/ε)\mathrm{polylog}(1/\varepsilon) runtime) for the discretized problem. Second, a desirable feature of SINKHORN that should be emphasized is its practical scalability, which might make it advantageous for problems where very fine discretization is required. It is an interesting direction of practical relevance to develop algorithms that can compute high-precision solutions at a similarly large scale in practice (see the discussion in §8).

6 Application: MOT problems with set-optimization structure

In this section, we consider MOT problems whose cost tensors CC take values 0 and 11—or more generally any two values, by a straightforward reduction111111If CC takes two values a<ba<b, then define the tensor C~\tilde{C} with {0,1}\{0,1\}-entries by C~j=(Cja)/(ba)\tilde{C}_{\vec{j}}=(C_{\vec{j}}-a)/(b-a). It is straightforward to see that the MOT problems with costs CC and C~\tilde{C} have identical solutions.. Such MOT problems arise naturally in applications where one seeks to minimize or maximize the probability that some event occurs given marginal probabilities on each variable (see Example 6.1). We establish that this general class of MOT problems can be solved in polynomial time under a condition on the sparsity pattern of CC that is often simple to check due its connection to classical combinatorial optimization problems.

The section is organized as follows. In §6.1 we formally describe this setup and discuss why it is incomparable to all other structures discussed in this paper. In §6.2, we show that for costs with this structure, the MIN, AMIN, and SMIN oracles can be implemented in polynomial time; from this it immediately follows that the ELLIPSOID, MWU, SINKHORN, and COLGEN algorithms discussed in part 1 of this paper can be implemented in polynomial time. In §6.3, we illustrate our results via a case study on network reliability.

6.1 Setup

Example 6.1 (Motivation for binary-valued MOT costs: minimizing/maximizing probability of an event).

Let S[n]kS\subset[n]^{k}. If Cj=𝟙[jS]C_{\vec{j}}=\mathds{1}[\vec{j}\in S], then the MOTC\textsf{MOT}_{C} problem amounts to minimizing the probability that event SS occurs, given marginals on each variable. On the other hand, if Cj=𝟙[jS]C_{\vec{j}}=\mathds{1}[\vec{j}\notin S], then the MOTC\textsf{MOT}_{C} problem amounts to maximizing the probability that event SS occurs since

MOTC(μ1,,μk)=minP(μ1,,μk)jP[jS]=1maxP(μ1,,μk)jP[jS].\textsf{MOT}_{C}(\mu_{1},\dots,\mu_{k})=\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\mathbb{P}_{\vec{j}\sim P}[\vec{j}\notin S]=1-\max_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\mathbb{P}_{\vec{j}\sim P}[\vec{j}\in S].

Even if the cost is binary-valued, there is no hope to solve MOT in polynomial time without further assumptions—essentially because in the worst case, any algorithm must query all nkn^{k} entries if CC is a completely arbitrary {0,1}\{0,1\}-valued tensor.

We show that MOT is polynomial-time solvable under the general and often simple-to-check condition that the MIN, AMIN, and SMIN oracles introduced in §3 are polynomial-time solvable when restricted to the set SS of indices j[n]k\vec{j}\in[n]^{k} for which Cj=0C_{\vec{j}}=0. For simplicity, our definition of these set oracles removes the cost CjC_{\vec{j}} as it is constant on SS. Of course it is also possible to remove the negative sign in p-p by re-parameterizing the inputs as w=pw=-p; however, we keep this notation in order to parallel the original oracles.

Definition 6.2 (MIN set oracle).

Let S[n]kS\subset[n]^{k}. For weights p=(p1,,pk)n×kp=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k}, MINC,S(p)\textsf{MIN}_{C,S}(p) returns

minjSi=1k[pi]ji.\min_{\vec{j}\in S}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.
Definition 6.3 (AMIN set oracle).

Let S[n]kS\subset[n]^{k}. For weights p=(p1,,pk)n×kp=(p_{1},\ldots,p_{k})\in\mathbb{R}^{n\times k} and accuracy ε>0\varepsilon>0, AMINC,S(p,ε)\textsf{AMIN}_{C,S}(p,\varepsilon) returns MINC,S(p)\textsf{MIN}_{C,S}(p) up to additive error ε\varepsilon.

Definition 6.4 (SMIN set oracle).

Let S[n]kS\subset[n]^{k}. For weights p=(p1,,pk)¯n×kp=(p_{1},\ldots,p_{k})\in\bar{\mathbb{R}}^{n\times k} and regularization parameter η>0\eta>0, SMINC,S(p,η)\textsf{SMIN}_{C,S}(p,\eta) returns

sminηjSi=1k[pi]ji.\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in S}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}.

The key motivation behind these set oracle definitions (aside from the syntactic similarity to the original oracles) is that they encode the problem of (approximately) finding the min-weight object in SS. This opens the door to combinatorial applications of MOT because finding the min-weight object in SS is well-known to be polynomial-time solvable for many “combinatorial-structured” sets SS of interest—e.g., the set SS of cuts in a graph, or the set SS of independent sets in a matroid. See §6.3 for fully-detailed applications.

Definition 6.5 (Set-optimization structure for MOT).

An MOT cost tensor C(n)kC\in(\mathbb{R}^{n})^{\otimes k} has exact, approximate, or soft set-optimization structure of complexity β\beta if

Cj=𝟙[jS]C_{\vec{j}}=\mathds{1}[\vec{j}\notin S]

for a set S[n]kS\subset[n]^{k} for which there is an algorithm solving MINC,S\textsf{MIN}_{C,S}, AMINC,S\textsf{AMIN}_{C,S}, or SMINC,S\textsf{SMIN}_{C,S}, respectively, in β\beta time.

We make two remarks about this structure.

Remark 6.6 (Only require set oracle for C1(0)C^{-1}(0), not for C1(1)C^{-1}(1)).

Note that Definition 6.5 only requires the set oracles for the set SS of entries where CC is 0, and does not need the set oracles for the set [n]kS[n]^{k}\setminus S where CC is 11. The fact that both set oracles are not needed makes set-optimization structure easier to check than the original oracles in §3, because those effectively require optimization over both SS and [n]kS[n]^{k}\setminus S.

Remark 6.7 (Set-optimization structure is incomparable to graphical and low-rank plus sparse structure).

Costs CC that satisfy Definition 6.5 in general do not have non-trivial graphical structure or low-rank plus sparse structure. Specifically, there are costs CC that satisfy Definition 6.5, yet require maximal k1k-1 treewidth to model via graphical structure, and super-constant rank or exponential sparsity to model via low-rank plus sparse structure. (A concrete example is the network reliability application in §6.3.) Because of the 𝖭𝖯\mathsf{NP}-hardness of MOT problems with (k1)(k-1)-treewidth graphical structure or super-constant rank (AltBoi20hard, ), simply modeling such problems with graphical structure or low-rank plus rank structure is therefore useless for the purpose of designing polynomial-time MOT algorithms.

6.2 Polynomial-time algorithms

By our oracle reductions in part 1 of this paper, in order to design polynomial-time algorithms for MOT with set-optimization structure, it suffices to design polynomial-time algorithms for the MIN, AMIN, or SMIN oracles. We show how to do this for all three oracles in a straightforward way by exploiting the set-optimization structure.

Theorem 6.8 (Polynomial-time algorithms for the MIN, AMIN, and SMIN oracles for costs with set-optimization structure).

If C(n)kC\in(\mathbb{R}^{n})^{\otimes k} is a cost tensor with exact, approximate, or soft set-optimization structure of complexity β\beta (see Definition 6.5), then the MINC\textsf{MIN}_{C}, AMINC\textsf{AMIN}_{C}, and SMINC\textsf{SMIN}_{C} oracles, respectively, can be computed in β+poly(n,k)\beta+\mathrm{poly}(n,k) time.

Algorithm 6 Polynomial-time algorithm for MIN for costs with exact set-optimization structure (Theorem 6.8).

Input: Access to CC via MINC,S\textsf{MIN}_{C,S} oracle, matrix pn×kp\in\mathbb{R}^{n\times k}
      Output: Solution to MINC(p)\textsf{MIN}_{C}(p)

1:aMINC,S(p)a\leftarrow\textsf{MIN}_{C,S}(p) \triangleright One oracle call
2:xi=1kmaxj[n][pi]jx\leftarrow-\sum_{i=1}^{k}\max_{j\in[n]}[p_{i}]_{j} \triangleright Takes O(nk)O(nk) time
3:Return aa if axa\leqslant x, or min(a,1+x)\min(a,1+x) otherwise \triangleright Takes O(1)O(1) time
Algorithm 7 Polynomial-time algorithm for SMIN for costs with soft set-optimization structure (Theorem 6.8).

Input: Access to CC via SMINC,S\textsf{SMIN}_{C,S} oracle, matrix p¯n×kp\in\bar{\mathbb{R}}^{n\times k}, regularization η\eta
      Output: Solution to SMINC(p,η)\textsf{SMIN}_{C}(p,\eta)

1:aexp(ηSMINC,S(p))a\leftarrow\exp(-\eta\cdot\textsf{SMIN}_{C,S}(p)) \triangleright One oracle call
2:xi=1kj=1nexp(η[pi]ji)x\leftarrow\prod_{i=1}^{k}\sum_{j=1}^{n}\exp(\eta[p_{i}]_{j_{i}}) \triangleright Takes O(nk)O(nk) time
3:Return η1log(eηx+(1eη)a)-\eta^{-1}\log(e^{-\eta}x+(1-e^{-\eta})a) \triangleright Takes O(1)O(1) time
Proof.

Polynomial-time algorithm for MIN. We first claim that Algorithm 6 implements the MINC(p)\textsf{MIN}_{C}(p) oracle. To this end, define

a:=MINC,S(p)=minj[n]ks.t. Cj=0i=1k[pi]jiandb:=minj[n]ks.t. Cj=1i=1k[pi]ji.\displaystyle a:=\textsf{MIN}_{C,S}(p)=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}}=0\end{subarray}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\qquad\text{and}\qquad b:=\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}}=1\end{subarray}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}. (6.1)

By re-arranging the sum and max, it follows that

x:=i=1kmaxj[n][pi]j=maxj[n]ki=1k[pi]ji=minj[n]ki=1k[pi]ji=min(a,b).\displaystyle x:=-\sum_{i=1}^{k}\max_{j\in[n]}[p_{i}]_{j}=-\max_{\vec{j}\in[n]^{k}}\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\min_{\vec{j}\in[n]^{k}}\sum_{i=1}^{k}-[p_{i}]_{j_{i}}=\min(a,b). (6.2)

Therefore

MINC(p)=minj[n]kCji=1k[pi]ji=min(a,1+b)\displaystyle\textsf{MIN}_{C}(p)=\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\min(a,1+b) ={a if abmin(a,1+min(a,b)) if a>b\displaystyle=\begin{cases}a&\text{ if }a\leqslant b\\ \min(a,1+\min(a,b))&\text{ if }a>b\end{cases}
={a if axmin(a,1+x) if a>x,\displaystyle=\begin{cases}a&\text{ if }a\leqslant x\\ \min(a,1+x)&\text{ if }a>x\end{cases}, (6.3)

where above the first step is by definition of MINC\textsf{MIN}_{C}; the second step is by partitioning the minimization over j[n]k\vec{j}\in[n]^{k} into j\vec{j} such that Cj=0C_{\vec{j}}=0 or Cj=1C_{\vec{j}}=1, and then plugging in the definitions of aa and bb; the third step is by manipulating min(a,1+b)\min(a,1+b) in both cases; and the last step is because x=min(a,b)x=\min(a,b) as shown above. We conclude that Algorithm 6 correctly outputs MINC(p)\textsf{MIN}_{C}(p). Since the algorithm uses one call to the MINC,S\textsf{MIN}_{C,S} oracle and O(nk)O(nk) additional time, the claim is proven.

Polynomial-time algorithm for AMIN. Next, we claim that the same Algorithm 6, now run with the approximate oracle AMINC,S(p,ε)\textsf{AMIN}_{C,S}(p,\varepsilon) in the first step instead of the exact oracle MINC,S(p)\textsf{MIN}_{C,S}(p), computes a valid solution to AMINC(p,ε)\textsf{AMIN}_{C}(p,\varepsilon). To prove this, let aa, bb, and xx be as defined in (6.1) and (6.2) for the MIN analysis, and let a~=AMINC,S(p,ε)\tilde{a}=\textsf{AMIN}_{C,S}(p,\varepsilon). By the same logic as in (6.3), except now reversed, the output

{a~ if a~xmin(a~,1+x) if a~>x\displaystyle\begin{cases}\tilde{a}&\text{ if }\tilde{a}\leqslant x\\ \min(\tilde{a},1+x)&\text{ if }\tilde{a}>x\end{cases}

is equal to min(a~,1+b)\min(\tilde{a},1+b). Now because a~\tilde{a} is within additive ε\varepsilon error of aa (by definition of the AMINC,S\textsf{AMIN}_{C,S} oracle), it follows that the above output is within ε\varepsilon additive error of

min(a,1+b)=minj[n]kCji=1k[pi]ji=MINC(p).\min(a,1+b)=\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=\textsf{MIN}_{C}(p).

Thus the output is a valid answer to AMINC(p,ε)\textsf{AMIN}_{C}(p,\varepsilon), establishing correctness. The runtime claim is obvious.

Polynomial-time algorithm for SMIN. Finally, we claim that Algorithm 7 implements the SMINC(p,η)\textsf{SMIN}_{C}(p,\eta) oracle. To this end, define

a:=eηSMINC,S(p,η)=j[n]ks.t. Cj=0eηi=1k[pi]ji and b:=j[n]ks.t. Cj=1eηi=1k[pi]ji.a:=e^{-\eta\cdot\textsf{SMIN}_{C,S}(p,\eta)}=\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}=0}\end{subarray}}e^{\eta\sum_{i=1}^{k}[p_{i}]_{j_{i}}}\qquad\text{ and }\qquad b:=\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }C_{\vec{j}=1}\end{subarray}}e^{\eta\sum_{i=1}^{k}[p_{i}]_{j_{i}}}.

By re-arranging products and sums, it follows that

x:=i=1kj=1neη[pi]ji=j[n]ki=1keη[pi]ji=a+b.x:=\prod_{i=1}^{k}\sum_{j=1}^{n}e^{\eta[p_{i}]_{j_{i}}}=\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}e^{\eta[p_{i}]_{j_{i}}}=a+b.

Therefore

SMINC(p,η)=1ηlog(j[n]keη(Cji=1k[pi]ji))=1ηlog(a+eηb)=1ηlog(eηx+(1eη)a),\textsf{SMIN}_{C}(p,\eta)=-\frac{1}{\eta}\log\left(\sum_{\vec{j}\in[n]^{k}}e^{-\eta(C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}})}\right)=-\frac{1}{\eta}\log\left(a+e^{-\eta}b\right)=-\frac{1}{\eta}\log\left(e^{-\eta}x+(1-e^{-\eta})a\right),

where above the first step is by definition of SMINC\textsf{SMIN}_{C}; the second step is by partitioning the sum over j[n]k\vec{j}\in[n]^{k} into j\vec{j} such that Cj=0C_{\vec{j}}=0 or Cj=1C_{\vec{j}}=1, and then plugging in the definitions of aa and bb; and the third step is because x=a+bx=a+b as shown above. We conclude that Algorithm 7 correctly outputs SMINC(p,η)\textsf{SMIN}_{C}(p,\eta). Since the algorithm uses one call to the SMINC,S\textsf{SMIN}_{C,S} oracle and O(nk)O(nk) additional time, the claim is proven. ∎

An immediate consequence of Theorem 6.8 combined with our oracle reductions is that all of the candidate MOT algorithms described in §4 can be efficiently implemented for MOT problems with set-optimization structure. From a theoretical perspective, the ELLIPSOID algorithm gives the best guarantee since it produces an exact, sparse solution in polynomial time.

Corollary 6.9 (Polynomial-time algorithms for MOT problems with set-optimization structure).

Let C(n)kC\in(\mathbb{R}^{n})^{\otimes k} be a cost tensor that has set-optimization structure with poly(n,k)\mathrm{poly}(n,k) complexity (see Definition 6.5).

  • Exact set-optimization structure. The ELLIPSOID algorithm in §4.1 computes an exact solution to MOTC\textsf{MOT}_{C} in poly(n,k)\mathrm{poly}(n,k) time. Also, the COLGEN algorithm in §4.1.3 can be run for TT iterations in poly(n,k,T)\mathrm{poly}(n,k,T) time.

  • Approximate set-optimization structure. The MWU algorithm in §4.2 computes an ε\varepsilon-approximate solution to MOTC\textsf{MOT}_{C} in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

  • Soft set-optimization structure. The SINKHORN algorithm in §4.3 computes an ε\varepsilon-approximate solution to MOTC\textsf{MOT}_{C} in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

Moreover, ELLIPSOID, MWU, and COLGEN output a polynomially sparse tensor, whereas SINKHORN outputs a fully dense tensor through the implicit representation described in §4.3.1.

Proof.

Combine the polynomial-time implementations of the oracles in Theorem 7.4 with the polynomial-time algorithm-to-oracle reductions in Theorems 4.14.64.7, and 4.18, respectively. ∎

6.3 Application vignette: network reliability with correlations

In this section, we illustrate this class of MOT structures via an application to network reliability, a central topic in network science, engineering, and operations research, see e.g., the textbooks (gertsbakh2011network, ; ball1995network, ; ball1986computational, ). The basic network reliability question is: given an undirected graph G=(V,E)G=(V,E) where each edge eEe\in E is reliable with some probability qeq_{e} and fails with probability 1qe1-q_{e}, what is the probability that all vertices are reachable from all others? This connectivity is desirable in applications, e.g., if GG is a computer cluster, the vertices are the machines, and the edges are communication links, then connectivity corresponds to the reachability of all machines. See the aforementioned textbooks for many other applications.

Of course, the above network reliability question is not yet well-defined since the edge failures are only prescribed up to their marginal distributions. Which joint distribution greatly impacts the answer.

The most classical setup posits that edge failures are independent (moore1956reliable, ). Denote the network reliability probability for this setting by ρind\rho^{\mathrm{ind}}. This quantity ρind\rho^{\mathrm{ind}} is #\#P-complete (valiant1979complexity, ; provan1983complexity, ) and thus 𝖭𝖯\mathsf{NP}-hard to compute, but there exist fully polynomial randomized approximation schemes (a.k.a. FPRAS) for multiplicatively approximating both the connection probability ρind\rho^{\mathrm{ind}} (karger2001randomized, ) and the failure probability 1ρind1-\rho^{\mathrm{ind}} (guo2019polynomial, ).

Here we investigate the setting of coordinated edge failures, which dates back to the 1980s (weiss1986stochastic, ; zemel1982polynomial, ). This coordination may optimize for disconnection (e.g., by an adversary), or for connection (e.g., maximize the time a network is connected while performing maintenance on each edge ee during 1qe1-q_{e} fraction of the time). We define these notions below; see also Figure 5 for an illustration. Below, Ber(qe)\operatorname{Ber}(q_{e}) denotes the Bernoulli distribution with parameter qeq_{e}.

Definition 6.10 (Network reliability with correlations).

For an undirected graph G=(V,E)G=(V,E) and edge reliability probabilities {qe}eE\{q_{e}\}_{e\in E}:

  • The worst-case network reliability is

    ρmin:=minP({Ber(qe)}eE)HP[H is a connected subgraph of G].\rho^{\min}:=\min_{P\in\mathcal{M}(\{\operatorname{Ber}(q_{e})\}_{e\in E})}\mathbb{P}_{H\sim P}\left[H\text{ is a connected subgraph of }G\right].
  • The best-case network reliability is

    ρmax:=maxP({Ber(qe)}eE)HP[H is a connected subgraph of G].\rho^{\max}:=\max_{P\in\mathcal{M}(\{\operatorname{Ber}(q_{e})\}_{e\in E})}\mathbb{P}_{H\sim P}[H\text{ is a connected subgraph of }G].
Refer to caption
Figure 5: Optimal decompositions for the the worst-case (top) and best-case (bottom) reliability problems on the same graph GG and edge reliability probabilities qeq_{e} (left). Coordinating edge failures yields significantly different connection probabilities: ρmin=40%\rho^{\min}=40\%, ρind60%\rho^{\mathrm{ind}}\approx 60\%, and ρmax=90%\rho^{\max}=90\%.

Clearly ρminρindρmax\rho^{\min}\leqslant\rho^{\mathrm{ind}}\leqslant\rho^{\max}. These gaps can be large (e.g., see Figure 5), which promises large opportunities for applications in which coordination is possible. However, in order to realize such an opportunity requires being able to compute ρmin\rho^{\min} and ρmax\rho^{\max}, and both of these problems require solving an exponentially large LP with 2|E|2^{|E|} variables. Below we show how to use set-optimization structure to compute these quantities in poly(|E|)\mathrm{poly}(|E|) time, thereby recovering as a special case of our general framework the known polynomial-time algorithms for this particular problem in (zemel1982polynomial, ; weiss1986stochastic, ), as well as more practical polynomial-time algorithms that scale to input sizes that are an order-of-magnitude larger.

Corollary 6.11 (Polynomial-time algorithm for network reliability with correlations).

The worst-case and best-case network reliability can both be computed in poly(|E|)\mathrm{poly}(|E|) time.

Proof.

By the observation in Example 6.1, the optimization problems defining ρmin\rho^{\min} and 1ρmax1-\rho^{\max} are instances of MOT in which k=|E|k=|E|, n=2n=2, μe=Ber(qe)\mu_{e}=\operatorname{Ber}(q_{e}), and each entry of the cost C{0,1}|E|C\in\{0,1\}^{|E|} is the indicator of whether that subset of edges is a connected or disconnected subgraph of GG, respectively. It therefore suffices to show that both of these MOT cost tensors satisfy exact set-optimization structure (Definition 6.5) since that implies a polynomial-time algorithm for exactly solving MOT (Corollary 6.9).

Set-optimization structure for 1ρmax1-\rho^{\max}.

In this case, SS is the set of connected subgraphs of GG. Thus the MINC,S\textsf{MIN}_{C,S} problem is: given weights p2×|E|p\in\mathbb{R}^{2\times|E|}, compute

minconnected subgraph H of GeHp2,eeHp1,e.\min_{\text{connected subgraph }H\text{ of }G}-\sum_{e\in H}p_{2,e}-\sum_{e\notin H}p_{1,e}.

Note that this objective is equal to eHxeeEp1,e\sum_{e\in H}x_{e}-\sum_{e\in E}p_{1,e} where xe:=p1,ep2,ex_{e}:=p_{1,e}-p_{2,e}. Since the latter sum is independent of HH, the MINC,S\textsf{MIN}_{C,S} problem therefore reduces to the problem of finding a minimum-weight connected subgraph in GG; that is, given edge weights x|E|x\in\mathbb{R}^{|E|}, compute

minconnected subgraph H of GeHxe.\displaystyle\min_{\text{connected subgraph }H\text{ of }G}\sum_{e\in H}x_{e}. (6.4)

We first show how to solve this in polynomial time in the case that all edge weights xex_{e} are positive. In this case, the optimal solution HH is a minimum-weight spanning tree of GG. This can be found by Kruskal’s algorithm in O(|E|log|E|)O(|E|\log|E|) time (kruskal1956shortest, ).

For the general case of arbitrary edge weights, note that the edges ee with non-positive weight x0x\leqslant 0 can be added to any solution without worsening the cost or feasibility. Thus these edges are without loss of generality in every solution HH, and so it suffices to solve the same problem (6.4) on the graph GG^{\prime} obtained by contracting these non-positively-weighted edges in GG. This reduces (6.4) to the same problem of finding a minimum-weight connected subgraph, except now in the special case that all edge weights are positive. Since we have already shown how to solve this case in polynomial time, the proof is complete.

Set-optimization structure for ρmin\rho^{\min}.

In this case, SS is the set of disconnected subgraphs of GG. We may simplify the MINC,S\textsf{MIN}_{C,S} problem for ρmin\rho^{\min} by re-parameterizing the input p2×|E|p\in\mathbb{R}^{2\times|E|} to edge weights x|E|x\in\mathbb{R}^{|E|} as done above in (6.4) for 1ρmax1-\rho^{\max}. Thus the MINC,S\textsf{MIN}_{C,S} problem for ρmin\rho^{\min} is: given weights x|E|x\in\mathbb{R}^{|E|}, compute

mindisconnected subgraph H of GeHxe.\displaystyle\min_{\text{disconnected subgraph }H\text{ of }G}\sum_{e\in H}x_{e}. (6.5)

We first show how to solve this in the case that all edge weights xex_{e} are negative. In that case, the optimal solution is of the form H=ECH=E\setminus C, where CC is a maximum-weight cut of the graph GG with weights xex_{e}. Equivalently, by negating all edge weights, CC is a minimum-weight cut of the graph GG with weights xe-x_{e}. Since a minimum-weight cut of a graph with positive weights can be found in polynomial time (stoer1997simple, ), the problem (6.5) can be solved in polynomial time when all xex_{e} are negative.

Now in the general case of arbitrary edge weights, note that the edges ee with non-negative weight x0x\geqslant 0 can be removed from any solution without worsening the cost or feasibility. Thus these edges are without loss of generality not in every solution HH, and so it suffices to solve the same problem (6.5) on the graph GG^{\prime} obtained by deleting these non-negatively-weighted edges in GG. This reduces (6.5) to the same problem of finding a minimum-weight disconnected subgraph, except now in the special case that all edge weights are negative. Since we have already shown how to solve this case in polynomial time, the proof is complete. ∎

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 6: Top: comparison of the runtime (left) and accuracy (right) of the algorithms described in the main text, for the worst-case reliability of a clique graph on tt vertices and k=(t2)k=\binom{t}{2} edges with reliability probabilities qe=0.99q_{e}=0.99. Bottom: same, but for best-case reliability and reliability probabilities qe=0.01q_{e}=0.01. For worst-case reliability, the algorithms compute an upper bound, so a smaller value is better; reverse for best-case reliability. The algorithms are cut off at 2 minutes, denoted by an “x”. SINKHORN is run at the highest precision (i.e., highest η\eta) before numerical precision issues. The COLGEN algorithm that our framework recovers computes exact solutions an order-of-magnitude faster than the other algorithms, and the new MWU algorithm computes reasonably approximate solutions for k=400k=400, which amounts to an MOT LP with nk=24002.6×10120n^{k}=2^{400}\approx 2.6\times 10^{120} variables.

In Figure 6, we compare the numerical performance of the algorithms in Corollary 6.11COLGEN and MWU with polynomial-time implementation of their bottlenecks—with the fastest previous algorithms for both best-case and worst-case network reliability. Previously, the fastest algorithms that apply to this problem are (1) out-of-the-box LP solvers run on MOT, (2) the brute-force implementation of SINKHORN which marginalizes over all nk=2|E|n^{k}=2^{|E|} entries in each iteration, and (3) this COLGEN algorithm that we recover (zemel1982polynomial, ; weiss1986stochastic, ). It is unknown if there is a practically efficient implementation of the SMINC,S\textsf{SMIN}_{C,S} oracle (and thus of SINKHORN) for both best-case or worst-case reliability. Since the previous algorithms (1) and (2) have exponential runtime that scales as nΩ(k)=2Ω(|E|)n^{\Omega(k)}=2^{\Omega(|E|)}, they do not scale past tiny input sizes. In contrast, the algorithms in Corollary 6.11 scale to much larger inputs. Indeed, the COLGEN algorithm that our framework recovers can compute exact solutions roughly an order-of-magnitude faster than the other algorithms, and the new MWU algorithm computes reasonably approximate solutions beyond k=400k=400, which amounts to an MOT LP with nk=24002.6×10120n^{k}=2^{400}\approx 2.6\times 10^{120} variables.

7 Application: MOT problems with low-rank plus sparse structure

In this section, we consider MOT problems whose cost tensors CC decompose into low-rank and sparse components. We propose the first polynomial-time algorithms for this general class of MOT problems.

The section is organized as follows. In §7.1 we formally describe this setup and discuss why it is incomparable to all other structures discussed in this paper. In §7.2, we show that for costs with this structure, the AMIN and SMIN oracles can be implemented in polynomial time; from this it immediately follows that MWU and SINKHORN can be implemented in polynomial time. Finally, in §7.3 and §7.4, we provide two illustrative applications of these algorithms. The former regards portfolio risk management and is a direct application of our result for MOT with low-rank cost tensors. The latter regards projecting mixture distributions to the transportation polytope and illustrates the versality of our algorithmic results since this problem is quadratic optimization over the transportation polytope rather than linear (a.k.a. MOT).

7.1 Setup

We begin by recalling the definition of tensor rank. It is the direct analog of the standard concept of matrix rank. See the survey (kolda2009tensor, ) for further background.

Definition 7.1 (Tensor rank).

A rank-rr factorization of a tensor R(n)kR\in(\mathbb{R}^{n})^{\otimes k} is a collection of rkrk vectors {ui,}i[k],[r]n\{u_{i,\ell}\}_{i\in[k],\ell\in[r]}\subset\mathbb{R}^{n} satisfying

R==1ri=1kui,.R=\sum_{\ell=1}^{r}\bigotimes_{i=1}^{k}u_{i,\ell}.

The rank of a tensor is the minimal rr for which there exists a rank-rr factorization.

In this section we consider MOT problems with the following “low-rank plus sparse” structure.

Definition 7.2 (Low-rank plus sparse structure for MOT).

An MOT cost tensor C(n)kC\in(\mathbb{R}^{n})^{\otimes k} has low-rank plus sparse structure of rank rr and sparsity ss if it decomposes as

C=R+S,\displaystyle C=R+S, (7.1)

where RR is a rank-rr tensor and SS is an ss-sparse tensor.

Throughout, we make the natural assumption that SS is input through its ss non-zero entries, and that RR is input through a rank-rr factorization. We also make the natural assumption that the entries of both RR and SS are of size O(Cmax)O(C_{\max})—this rules out the case of having extremely large entries of RR and SS, one positive and one negative, which cancel to yield a small entry of C=R+SC=R+S.

Remark 7.3 (Neither low-rank structure nor sparse structure can be modeled by graphical structure or set-optimization structure).

In general, both rank-11 costs and polynomially sparse costs do not have non-trivial graphical structure. Specifically, modeling these costs with graphical structure requires the complete graph (a.k.a., maximal treewidth of k1k-1)—and because MOT problems with graphical structure of treewidth k1k-1 are 𝖭𝖯\mathsf{NP}-hard to solve in the absence of further structure (AltBoi20hard, ), modeling such problems with graphical structure is useless for the purpose of designing polynomial-time MOT algorithms. It is also clear that neither low-rank structure nor sparse structure can be modeled by set-optimization structure because in general, neither RR nor SS nor R+SR+S has binary-valued entries.

7.2 Polynomial-time algorithms

From a technical perspective, the main result of this section is that there is a polynomial-time algorithm for approximating the minimum entry of a tensor that decomposes into constant-rank and sparse components. Previously, this was not known even for constant-rank tensors. This result may be of independent interest. We remark that this result is optimal in the sense that unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP}, there does not exist an algorithm with runtime that is jointly polynomial in the input size and the rank rr (AltBoi20hard, ).

Theorem 7.4 (Polynomial-time algorithm solving AMIN and SMIN for low-rank + sparse costs).

Consider cost tensors C(n)kC\in(\mathbb{R}^{n})^{\otimes k} that have low-rank plus sparse structure of rank rr and sparsity ss (see Definition 7.2). For any fixed rr, Algorithm 8 runs in poly(n,k,s,Cmax/ε)\mathrm{poly}(n,k,s,C_{\max}/\varepsilon) time and solves the ε\varepsilon-approximate AMINC\textsf{AMIN}_{C} oracle. Furthermore, it also solves the SMINC~\textsf{SMIN}_{\tilde{C}} oracle for η=(2klogn)/ε\eta=(2k\log n)/\varepsilon on some cost tensor C~(n)k\tilde{C}\in(\mathbb{R}^{n})^{\otimes k} satisfying CC~maxε/2\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2.

We make three remarks about Theorem 7.4. First, we are unaware of any polynomial-time implementation of SMINC\textsf{SMIN}_{C} for the cost CC. Instead, Theorem 7.4 solves the SMINC~\textsf{SMIN}_{\tilde{C}} oracle for an O(ε)O(\varepsilon)-approximate cost tensor C~\tilde{C} since this is sufficient for implementing SINKHORN on the original cost tensor CC (see Corollary 7.5 below). Second, it is an interesting open question if the poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) runtime for the ε\varepsilon-approximate AMINC\textsf{AMIN}_{C} oracle can be improved to poly(n,k,log(Cmax/ε))\mathrm{poly}(n,k,\log(C_{\max}/\varepsilon)), as this would imply a poly(n,k)\mathrm{poly}(n,k) runtime for the MINC\textsf{MIN}_{C} oracle and thus for this class of MOT problems (see also Footnote 5 in the introduction). Third, we remark about practical efficiency: the runtime of Algorithm 8 is not just polynomially small in ss and nn, but in fact linear in ss and near-linear in nn. However, since this improved runtime is not needed for the theoretical results in the sequel, we do not pursue this further.

Combining the efficient oracle implementations in Theorem 7.4 with our algorithm-to-oracles reductions in §4 implies the first polynomial-time algorithms for MOT problems with costs that have constant-rank plus sparse structure. This is optimal in the sense that unless 𝖯=𝖭𝖯\mathsf{P}=\mathsf{NP}, there does not exist an algorithm with runtime that is jointly polynomial in the input size and the rank rr (AltBoi20hard, ).

Corollary 7.5 (Polynomial-time algorithms solving MOT for low-rank + sparse costs).

Consider cost tensors C(n)kC\in(\mathbb{R}^{n})^{\otimes k} that have low-rank plus sparse structure of constant rank rr and poly(n,k)\mathrm{poly}(n,k) sparsity ss (see Definition 7.2). For any ε>0\varepsilon>0:

  • The MWU algorithm in §4.2 computes an ε\varepsilon-approximate solution to MOTC\textsf{MOT}_{C} in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

  • The SINKHORN algorithm in §4.3 computes an ε\varepsilon-approximate solution to MOTC\textsf{MOT}_{C} in poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) time.

Moreover, MWU outputs a polynomially sparse tensor, whereas SINKHORN outputs a fully dense tensor through the implicit representation described in §4.3.1.

Proof.

For MWU, simply combine the polynomial-time reduction to the AMINC\textsf{AMIN}_{C} oracle (Theorem 4.7) with the polynomial-time algorithm for the AMIN oracle (Theorem 7.4). For SINKHORN, combining the polynomial-time reduction to the SMINC~\textsf{SMIN}_{\tilde{C}} oracle (Theorem 4.18) with the polynomial-time algorithm for the SMINC~\textsf{SMIN}_{\tilde{C}} oracle (Theorem 7.4) yields a poly(n,k,Cmax/ε)\mathrm{poly}(n,k,C_{\max}/\varepsilon) algorithm for ε/2\varepsilon/2-approximating the MOT problem with cost tensor C~\tilde{C}. It therefore suffices to show that the values of the MOT problems with cost tensors CC and C~\tilde{C} differ by at most ε/2\varepsilon/2, that is,

|minP(μ1,,μk)P,CminP(μ1,,μk)P,C~|ε/2.\left\lvert\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,C\rangle-\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,\tilde{C}\rangle\right\rvert\leqslant\varepsilon/2.

But this holds because both MOT problems have the same feasible set, and for any feasible P(μ1,,μk)P\in\mathcal{M}(\mu_{1},\dots,\mu_{k}) it follows from Hölder’s inequality that the objectives of the two MOT problems differ by at most

|P,CP,C~|P1CC~maxε/2.\left\lvert\langle P,C\rangle-\langle P,\tilde{C}\rangle\right\rvert\leqslant\|P\|_{1}\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2.

Below, we describe the algorithm in Theorem 7.4. Specifically, in §7.2.1, we give four helper lemmas which form the technical core of our algorithm; and then in §7.2.2, we combine these ingredients to design the algorithm and prove its correctness. Throughout, recall that we use the bracket notation f[A]f[A] to denote the entrwise application of a univariate function ff (e.g., exp\exp, log\log, or a polynomial) to AA.

7.2.1 Technical ingredients

At a high level, our approach to designing the algorithm in Theorem 7.4 is to approximately compute the SMIN oracle in polynomial time by synthesizing four facts:

  1. 1.

    By expanding the softmin and performing simple operations, it suffices to compute the total sum of all nkn^{k} entries of the entrywise exponentiated tensor exp[ηR]\exp[-\eta R] (modulo simple transforms).

  2. 2.

    Although exp[ηR]\exp[-\eta R] is in general a full-rank tensor, we can exploit the fact that RR is a low-rank tensor in order to approximate exp[ηR]\exp[-\eta R] by a low-rank tensor LL. (Moreover, we can efficiently compute a low-rank factorization of LL in closed form.)

  3. 3.

    There is a simple algorithm for computing the sum of all nkn^{k} entries of LL in polynomial time because LL is low-rank. (And thus we may approximate the sum of all nkn^{k} entries of exp[ηR]\exp[-\eta R] as desired in step 1.)

  4. 4.

    This approximation is sufficient for computing both the AMIN and SMIN oracle in Theorem 7.4.

Of these four steps, the main technical step is the low-rank approximation in step two. Below, we formalize these four steps individually in Lemmas 7.67.77.8, and 7.9. Further detail on how to synthesize these four steps is then provided afterwards, in the proof of Theorem 7.4.

It is convenient to write the first lemma in terms of an approximate tensor C~=R~+S\tilde{C}=\tilde{R}+S rather than the original cost C=R+SC=R+S.

Lemma 7.6 (Softmin for cost with sparse component).

Let C~=R~+S\tilde{C}=\tilde{R}+S and p1,,pknp_{1},\dots,p_{k}\in\mathbb{R}^{n}. Then

sminηj[n]kC~ji=1k[pi]ji=η1log(a+b),\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}=-\eta^{-1}\log(a+b),

where di:=exp[ηpi]0nd_{i}:=\exp[\eta p_{i}]\in\mathbb{R}_{\geqslant 0}^{n},

a:=j[n]ks.t. Sj0i=1k[di]jieηR~j(eηSj1)\displaystyle a:=\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}\neq 0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\cdot(e^{-\eta S_{\vec{j}}}-1) (7.2)

and

b:=j[n]ki=1k[di]jieηR~j.\displaystyle b:=\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}. (7.3)
Proof.

By expanding the definition of softmin, and then substituting pip_{i} with did_{i} and C~\tilde{C} with R~+S\tilde{R}+S,

sminηj[n]kC~ji=1k[pi]ji\displaystyle\operatorname*{{\textstyle{smin_{\eta}}}}_{\vec{j}\in[n]^{k}}\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} =1ηlog(j[n]keηi=1k[pi]jieηC~j)=1ηlog(j[n]ki=1k[di]jieηR~jeηSj).\displaystyle=-\frac{1}{\eta}\log\left(\sum_{\vec{j}\in[n]^{k}}e^{\eta\sum_{i=1}^{k}[p_{i}]_{j_{i}}}e^{-\eta\tilde{C}_{\vec{j}}}\right)=-\frac{1}{\eta}\log\left(\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}e^{-\eta S_{\vec{j}}}\right).

By simple manipulations, we conclude that the above quantity is equal to the desired quantity:

=\displaystyle\cdots= 1ηlog(j[n]ks.t. Sj0i=1k[di]jieηR~jeηSj+j[n]ks.t. Sj=0i=1k[di]jieηR~j)\displaystyle-\frac{1}{\eta}\log\left(\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}\neq 0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}e^{-\eta S_{\vec{j}}}+\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}=0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\right)
=\displaystyle= 1ηlog(j[n]ks.t. Sj0i=1k[di]jieηR~j(eηSj1)+j[n]ki=1k[di]jieηR~j)\displaystyle-\frac{1}{\eta}\log\left(\sum_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }S_{\vec{j}}\neq 0\end{subarray}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\left(e^{-\eta S_{\vec{j}}}-1\right)+\sum_{\vec{j}\in[n]^{k}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}\right)
=\displaystyle= 1ηlog(a+b).\displaystyle-\frac{1}{\eta}\log(a+b).

Above, the first step is by partitioning the sum over j[n]k\vec{j}\in[n]^{k} based on if Sj=0S_{\vec{j}}=0, the second step is by adding and subtracting j[n]k s.t. Sj0i=1k[di]jieηR~j\sum_{\vec{j}\in[n]^{k}\text{ s.t. }S_{\vec{j}}\neq 0}\prod_{i=1}^{k}[d_{i}]_{j_{i}}\cdot e^{-\eta\tilde{R}_{\vec{j}}}, and the last step is by definition of aa and bb. ∎

Lemma 7.7 (Low-rank approximation of the exponential of a low-rank tensor).

There is an algorithm that given R(n)kR\in(\mathbb{R}^{n})^{\otimes k} in rank-rr factored form, η>0\eta>0, and a precision ε~<eηRmax\tilde{\varepsilon}<e^{-\eta R_{\max}}, takes npoly(k,r~)n\cdot\mathrm{poly}(k,\tilde{r}) time to compute a rank-r~\tilde{r} tensor L(n)kL\in(\mathbb{R}^{n})^{\otimes k} in factored form satisfying Lexp[ηR]maxε~\|L-\exp[-\eta R]\|_{\max}\leqslant\tilde{\varepsilon}, where

r~(r+O(log1ε~)r).\displaystyle\tilde{r}\leqslant\binom{r+O(\log\tfrac{1}{\tilde{\varepsilon}})}{r}. (7.4)
Proof.

By classical results from approximation theory (see, e.g., (Tre19, )), there exists a polynomial qq of degree m=O(log1/ε~)m=O(\log 1/\tilde{\varepsilon}) satisfying

|exp(ηx)q(x)|ε~,x[Rmax,Rmax].\left\lvert\exp(-\eta x)-q(x)\right\rvert\leqslant\tilde{\varepsilon},\qquad\forall x\in[-R_{\max},R_{\max}].

For instance, the Taylor or Chebyshev expansion of xexp(ηx)x\mapsto\exp(-\eta x) suffices. Thus the tensor LL with entries

Lj=q(Rj)L_{\vec{j}}=q(R_{\vec{j}})

approximates exp[ηR]\exp[-\eta R] to error

Lexp[ηR]maxε~.\|L-\exp[-\eta R]\|_{\max}\leqslant\tilde{\varepsilon}.

We now show that LL has rank r~(r+mr)\tilde{r}\leqslant\binom{r+m}{r}, and moreover that a rank-r~\tilde{r} factorization can be computed in npoly(k,r~)n\cdot\mathrm{poly}(k,\tilde{r}) time. Denote q(x)=t=0matxtq(x)=\sum_{t=0}^{m}a_{t}x^{t} and R==1ri=1kui,R=\sum_{\ell=1}^{r}\otimes_{i=1}^{k}u_{i,\ell}. By definition of LL, definition of qq and RR, and then the Multinomial Theorem,

Lj=q(Rj)=t=0mat(=1ri=1k[ui,]ji)t=α0r:|α|m(|α|α)a|α|=1ri=1k[ui,]jiαi,L_{\vec{j}}=q(R_{\vec{j}})=\sum_{t=0}^{m}a_{t}\left(\sum_{\ell=1}^{r}\prod_{i=1}^{k}[u_{i,\ell}]_{j_{i}}\right)^{t}=\sum_{\alpha\in\mathbb{N}_{0}^{r}\,:\,|\alpha|\leqslant m}\binom{|\alpha|}{\alpha}a_{|\alpha|}\prod_{\ell=1}^{r}\prod_{i=1}^{k}[u_{i,\ell}]_{j_{i}}^{\alpha_{i}},

where the sum is over rr-tuples α\alpha with non-negative entries summing to at most mm. Thus

L=α0r:|α|mi=1kvi,α,L=\sum_{\alpha\in\mathbb{N}_{0}^{r}\,:\,|\alpha|\leqslant m}\bigotimes_{i=1}^{k}v_{i,\alpha},

where vi,αnv_{i,\alpha}\in\mathbb{R}^{n} denotes the vector with jj-th entry (|α|α)a|α|=1r[ui,]jαi\binom{|\alpha|}{\alpha}a_{|\alpha|}\prod_{\ell=1}^{r}[u_{i,\ell}]_{j}^{\alpha_{i}} for i=1i=1, and =1r[ui,]jαi\prod_{\ell=1}^{r}[u_{i,\ell}]_{j}^{\alpha_{i}} for i>1i>1. This yields the desired low-rank factorization of LL because

r~#{α0r:|α|m}=(r+mr).\tilde{r}\leqslant\#\{\alpha\in\mathbb{N}_{0}^{r}\;:\;|\alpha|\leqslant m\}=\binom{r+m}{r}.

Finally, since each of the kr~k\tilde{r} vectors vi,αv_{i,\alpha} in the factorization of LL can be computed efficiently from the closed-form expression above, the desired runtime follows. ∎

Lemma 7.8 (Marginalizing a scaled low-rank tensor).

Given vectors d1,,dknd_{1},\dots,d_{k}\in\mathbb{R}^{n} and a tensor L(n)kL\in(\mathbb{R}^{n})^{\otimes k} through a rank r~\tilde{r} factorization, we can compute m((i=1kdi)L)m((\otimes_{i=1}^{k}d_{i})\odot L) in O(nkr~)O(nk\tilde{r}) time.

Proof.

Denote the factorization of LL by L==1r~i=1kvi,L=\sum_{\ell=1}^{\tilde{r}}\otimes_{i=1}^{k}v_{i,\ell}. Then

m((i=1kdi)L)\displaystyle m((\otimes_{i=1}^{k}d_{i})\odot L) =j[n]k[(i=1kdi)L]j=j[n]k=1r~i=1k[di]ji[vi,]ji==1r~i=1kj=1n[di]j[vi,]j==1r~i=1kdi,vi,,\displaystyle=\sum_{\vec{j}\in[n]^{k}}\left[(\otimes_{i=1}^{k}d_{i})\odot L\right]_{\vec{j}}=\sum_{\vec{j}\in[n]^{k}}\sum_{\ell=1}^{\tilde{r}}\prod_{i=1}^{k}[d_{i}]_{j_{i}}[v_{i,\ell}]_{j_{i}}=\sum_{\ell=1}^{\tilde{r}}\prod_{i=1}^{k}\sum_{j=1}^{n}[d_{i}]_{j}[v_{i,\ell}]_{j}=\sum_{\ell=1}^{\tilde{r}}\prod_{i=1}^{k}\langle d_{i},v_{i,\ell}\rangle,

where the first step is by definition of the m()m(\cdot) operation that sums over all entries, the second step is by definition of LL, and the third step is by swapping products and sums. Thus computing the desired quantity amounts to computing r~k\tilde{r}k inner products of nn-dimensional vectors. This can be done in O(nrk~)O(nr\tilde{k}) time. ∎

Lemma 7.9 (Precision of the low-rank approximation).

Let ε1\varepsilon\leqslant 1. Suppose L(n)kL\in(\mathbb{R}^{n})^{\otimes k} satisfies Lexp[ηR]maxε3eηRmax\|L-\exp[-\eta R]\|_{\max}\leqslant\tfrac{\varepsilon}{3}e^{-\eta R_{\max}}. Then the matrix C~:=1ηlog[L]+S\tilde{C}:=-\frac{1}{\eta}\log[L]+S satisfies

C~Cmaxε2.\displaystyle\|\tilde{C}-C\|_{\max}\leqslant\frac{\varepsilon}{2}. (7.5)
Proof.

Observe that the minimum entry of LL is at least

eηRmaxε3eηRmax23eηRmax.\displaystyle e^{-\eta R_{\max}}-\tfrac{\varepsilon}{3}e^{-\eta R_{\max}}\geqslant\tfrac{2}{3}e^{-\eta R_{\max}}. (7.6)

Since this is strictly positive, the tensor R~:=η1log[L]\tilde{R}:=-\eta^{-1}\log[L] is well defined. Furthermore,

ηR~ηRmax=maxj[n]k|ηR~jηRj|maxj[n]k|LjeηRj|min(Lj,eηRj)ε3eηRmax23eηRmax=ε2,\|\eta\tilde{R}-\eta R\|_{\max}=\max_{\vec{j}\in[n]^{k}}\left\lvert\eta\tilde{R}_{\vec{j}}-\eta R_{\vec{j}}\right\rvert\leqslant\max_{\vec{j}\in[n]^{k}}\frac{\left\lvert L_{\vec{j}}-e^{-\eta R_{\vec{j}}}\right\rvert}{\min(L_{\vec{j}},e^{-\eta R_{\vec{j}}})}\leqslant\frac{\tfrac{\varepsilon}{3}e^{-\eta R_{\max}}}{\tfrac{2}{3}e^{-\eta R_{\max}}}=\frac{\varepsilon}{2},

where above the first step is by definition of the max norm; the second step is by the elementary inequality |logxlogy||xy|/min(x,y)|\log x-\log y|\leqslant|x-y|/\min(x,y) which holds for positive scalars xx and yy (AltBacRud19, , Lemma K); and the third step is by (7.6) and the approximation bound of LL. Since η1\eta\geqslant 1, we therefore conclude that R~Rmaxε/2\|\tilde{R}-R\|_{\max}\leqslant\varepsilon/2. By adding and subtracting SS, this implies C~Cmax=R~Rmaxε/2\|\tilde{C}-C\|_{\max}=\|\tilde{R}-R\|_{\max}\leqslant\varepsilon/2. ∎

7.2.2 Proof of Theorem 7.4

We are now ready to state the algorithm in Theorem 7.4. Pseudocode is in Algorithm 8. Note that R~=η1log[L]\tilde{R}=-\eta^{-1}\log[L] and C~=R~+S\tilde{C}=\tilde{R}+S are never explicitly computed because in both Lines 3 and 4, the algorithm performs the relevant operations only through the low-rank tensor LL and the sparse tensor SS.

Algorithm 8 Polynomial-time algorithm for AMIN and SMIN for low-rank + sparse costs (Theorem 7.4).

Input: Low-rank tensor RR, sparse tensor SS, matrix p¯n×kp\in\bar{\mathbb{R}}^{n\times k}, accuracy ε>0\varepsilon>0
      Output: Solution to both AMINC(p,ε)\textsf{AMIN}_{C}(p,\varepsilon) on cost tensor C=R+SC=R+S, and also SMINC~(p,(2klogn)/ε)\textsf{SMIN}_{\tilde{C}}(p,(2k\log n)/\varepsilon) on some approximate cost tensor C~\tilde{C} satisfying CC~maxε/2\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2

1:η(2klogn)/ε\eta\leftarrow(2k\log n)/\varepsilon
2:Compute low-rank approximation LL of exp[ηR]\exp[-\eta R] via Lemma 7.7, for precision ε~=ε3eηRmax\tilde{\varepsilon}=\frac{\varepsilon}{3}e^{-\eta R_{\max}}
3:Compute aa in (7.2) directly by enumerating over the polynomially many non-zero entries of SS, where R~=η1log[L]\tilde{R}=-\eta^{-1}\log[L]
4:Compute bb in (7.3) via Lemma 7.8, where R~=η1log[L]\tilde{R}=-\eta^{-1}\log[L]
5:Return η1log(a+b)-\eta^{-1}\log(a+b)
Proof of Theorem 7.4.

Proof of correctness for SMIN. Consider any oracle inputs p=(p1,,pk)¯n×kp=(p_{1},\dots,p_{k})\in\bar{\mathbb{R}}^{n\times k}. By Lemma 7.9, the tensor C~=R~+S=η1logL+S\tilde{C}=\tilde{R}+S=-\eta^{-1}\log L+S satisfies C~Cmaxε/2\|\tilde{C}-C\|_{\max}\leqslant\varepsilon/2. Therefore it suffices to show that Algorithm 8 correctly computes SMINC~(p,η)\textsf{SMIN}_{\tilde{C}}(p,\eta). This is true because that quantity is equal to η1log(a+b)-\eta^{-1}\log(a+b) by Lemma 7.6.

Proof of correctness for AMIN. We have just established that Algorithm 8 computes SMINC~(p,η)\textsf{SMIN}_{\tilde{C}}(p,\eta). Because η=(2klogn)/ε\eta=(2k\log n)/\varepsilon and the fact that SMIN is a special case of AMIN (Remark 3.6), it follows that SMINC~(p,η)\textsf{SMIN}_{\tilde{C}}(p,\eta) is within additive accuracy ε/2\varepsilon/2 of MINC~(p,η)\textsf{MIN}_{\tilde{C}}(p,\eta). Therefore, by the triangle inequality, it suffices to show that MINC~(p)\textsf{MIN}_{\tilde{C}}(p) is within ε/2\varepsilon/2 additive accuracy of MINC(p)\textsf{MIN}_{C}(p). That is, it suffices to show that

|minj[n]kCji=1k[pi]jiminj[n]kC~ji=1k[pi]ji|ε/2.\left\lvert\min_{\vec{j}\in[n]^{k}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}-\min_{\vec{j}\in[n]^{k}}\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}\right\rvert\leqslant\varepsilon/2.

But this is true because CC~maxε/2\|C-\tilde{C}\|_{\max}\leqslant\varepsilon/2 by Lemma 7.9, and thus the quantities Cji=1k[pi]jiC_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} and C~ji=1k[pi]ji\tilde{C}_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} are within additive accuracy ε/2\varepsilon/2 for each j[n]k\vec{j}\in[n]^{k}.

Proof of runtime. We prove the claimed runtime bound simultaneously for the AMIN and SMIN computation because we use the same algorithm for both. To this end, we first bound the rank r~\tilde{r} of the low-rank approximation LL computed in Lemma 7.7. Note that since ε~=ε3eηRmax\tilde{\varepsilon}=\tfrac{\varepsilon}{3}e^{-\eta R_{\max}} and since it is assumed that Rmax=O(Cmax)R_{\max}=O(C_{\max}), we have log1/ε~=O(Cmaxεklogn)\log 1/\tilde{\varepsilon}=O(\tfrac{C_{\max}}{\varepsilon}k\log n). Therefore

r~(r+O(log1/ε~)r)=O(log1/ε~)r=O(Cmaxεklogn)r=poly(logn,k,Cmax/ε).\tilde{r}\leqslant\binom{r+O(\log 1/\tilde{\varepsilon})}{r}=O(\log 1/\tilde{\varepsilon})^{r}=O(\tfrac{C_{\max}}{\varepsilon}k\log n)^{r}=\mathrm{poly}(\log n,k,C_{\max}/\varepsilon).

Above, the first step is by Lemma 7.7, and the final step is because rr is assumed constant.

Therefore Line 2 in Algorithm 8 takes polynomial time by Lemma 7.7, Line 3 takes polynomial time by simply enumerating over the ss non-zero entries of SS, and Line 4 takes polynomial time by Lemma 7.8. ∎

7.3 Application vignette: risk estimation

Here we consider an application to portfolio risk management. For simplicity of exposition, let us first describe the setting of 11 financial instrument (“stock”). Consider investing in one unit of a stock for kk years. For i{0,,k}i\in\{0,\dots,k\}, let XiX_{i} denote the price of the stock at year ii. Suppose that the return ρi=Xi/Xi1\rho_{i}=X_{i}/X_{i-1} of the stock between years i1i-1 and ii is believed to follow some distribution ρiμi\rho_{i}\sim\mu_{i}. A fundamental question about the riskiness of this stock is to compute the investor’s expected profit in the worst-case over all joint probability distributions on future returns (ρ1,,ρk)(\rho_{1},\dots,\rho_{k}) that are consistent with the modeled marginal distributions (μ1,,μk)(\mu_{1},\dots,\mu_{k}). This is an MOT problem with cost CC given by

C(ρ1,,ρk)=i[k]ρi,C(\rho_{1},\dots,\rho_{k})=\prod_{i\in[k]}\rho_{i},

where here we view CC as a function rather than a tensor for notational simplicity. If each return ρi\rho_{i} has nn possible values (e.g., after quantization), then the cost CC is equivalently represented as a rank-1 tensor in (n)k(\mathbb{R}^{n})^{\otimes k} (by assigning an index to each of the nn possible values of each ρi\rho_{i}). Therefore our result Corollary 7.5 provides a polynomial-time algorithm for solving this MOT problem defining the investor’s worst-case profit.

Rather than formalize this proof for 11 stock, we directly generalize to the general case of investing in rr stocks, r1r\geqslant 1. This is essentially identical to the simple case of r=1r=1 stock, modulo additional notation.

Corollary 7.10 (Polynomial-time algorithm for expected profit given marginals on the returns).

Suppose an investor holds 11 unit of rr stocks for kk years. For each stock [r]\ell\in[r] and each year i[k]i\in[k], let ρi,\rho_{i,\ell} denote the relative price of stock \ell between years ii and i1i-1. Suppose ρi,\rho_{i,\ell} has distribution μi,\mu_{i,\ell}, and that each μi,\mu_{i,\ell} has at most nn atoms. Let Rmax=max{ρi,}=1ri=1kρi,R_{\max}=\max_{\{\rho_{i,\ell}\}}\sum_{\ell=1}^{r}\prod_{i=1}^{k}\rho_{i,\ell} denote the maximal possible return. For any constant number of stocks rr, there is a poly(n,k,Rmax/ε)\mathrm{poly}(n,k,R_{\max}/\varepsilon) time algorithm for ε\varepsilon-approximating the expected profit in the worst-case over all futures that are consistent with the returns’ marginal distributions.

Proof.

This is the optimization problem

minP({μi,}i[k],[r])𝔼{ρi,}i[k],[r]P[=1ri=1kρi,]\min_{P\in\mathcal{M}(\{\mu_{i,\ell}\}_{i\in[k],\ell\in[r]})}\mathbb{E}_{\{\rho_{i,\ell}\}_{i\in[k],\ell\in[r]}\sim P}\left[\sum_{\ell=1}^{r}\prod_{i=1}^{k}\rho_{i,\ell}\right]

over all joint distributions PP on the returns {ρi,}i[k],[k]\{\rho_{i,\ell}\}_{i\in[k],\ell\in[k]} that are consistent with the marginal distibutions {μi,}i[k],[k]\{\mu_{i,\ell}\}_{i\in[k],\ell\in[k]}. This is an MOT problem with k=rkk^{\prime}=rk marginals, each over nn atoms, with cost function

C({ρi,}i[k],[r])=[r](i,)[k]×[r][k](ρi,𝟙[=]+𝟙[]).\displaystyle C\Big{(}\{\rho_{i,\ell}\}_{i\in[k],\ell\in[r]}\Big{)}=\sum_{\ell^{\prime}\in[r]}\prod_{(i,\ell)\in[k]\times[r]\cong[k^{\prime}]}\big{(}\rho_{i,\ell}\cdot\mathds{1}[\ell=\ell^{\prime}]+\mathds{1}[\ell\neq\ell^{\prime}]\big{)}. (7.7)

By viewing this cost function CC as a cost tensor in the natural way (i.e., assigning an index to each of the nn possible values of ρi,\rho_{i,\ell}), this representation (7.7) shows that the corresponding cost tensor C(n)kC\in(\mathbb{R}^{n})^{\otimes k^{\prime}} has rank rr. Moreover, observe that the maximum entry of the cost is RmaxR_{\max}. Therefore we may appeal to our polynomial-time MOT algorithms in Corollary 7.5 for costs with constant rank. ∎

The algorithm is readily generalized, e.g., if the investor has different units of a stock, or if a stock is held for a different number of years. The former is modeled simply by adding an extra year in which the return of stock \ell is equal to the number of units, with probability 11. The latter is modeled simply by setting the return of stock \ell to be 11 for all years after it is held, with probability 11.

In Figure 7, we provide a numerical illustration comparing our new polynomial-time algorithms for this risk estimation task with the previous fastest algorithms. Previously, the fastest algorithms that apply to this problem are out-of-the-box LP solvers run on MOT, and the brute-force implementation of SINKHORN which marginalizes over all nkn^{k} entries in each iteration. Since both of these previous algorithms have exponential runtime that scales as nΩ(k)n^{\Omega(k)}, they do not scale beyond tiny input sizes of n=10n=10 and k=8k=8 even with two minutes of computation time. In contrast, our new polynomial-time algorithms compute high-quality solutions for problems that are orders-of-magnitude larger. For example, our polynomial-time implementation of SINKHORN takes less than a second to solve an MOT LP with nk=1030n^{k}=10^{30} variables.

Details for this numerical experiment: we consider r=1r=1 stock over kk timesteps, where each marginal distribution μi\mu_{i} is uniform on [1,1+1/k][1,1+1/k], discretized with n=10n=10. We implement the AMIN and SMIN oracle efficiently by using our above algorithm to exploit the rank-one structure of the cost tensor. In particular, the polynomial approximation we use here to approximate exp[ηC]\exp[-\eta C] is the degree-55 Taylor approximation (cf., Lemma 7.7). This lets us run SINKHORN and MWU in polynomial time, as described above. In the numerical experiment, we also implement an approximate version of COLGEN using our polynomial-time implementation of the approximate violation oracle AMIN. Since the algorithms compute an upper bound, lower value is better in the right plot of Figure 7. We observe that MWU yields the loosest approximation for this application, whereas our implementations of SINKHORN and COLGEN produce high-quality approximations, as is evident by comparing to the exact LP solver in the regime that the latter is tractable to run.

Refer to caption Refer to caption
Figure 7: Comparison of the runtime (left) and accuracy (right) of the fastest existing algorithms (naive LP solver and naive SINKHORN which both have exponential runtimes that scale as nΩ(k)n^{\Omega(k)}) with our algorithms (SINKHORN, MWU, and COLGEN and MWU with polynomial-time implementations of their bottlenecks) for the risk estimation problem described in the main text. The algorithms are cut off at 2 minutes, denoted by an “x”. Our new polynomial-time implementation of SINKHORN returns high-quality solutions for problems that are orders-of-magnitude larger than previously possible: e.g., it takes less than a second to solve the problem for k=30k=30, which amounts to an MOT LP with 103010^{30} variables.

7.4 Application vignette: projection to the transportation polytope

Here we consider the fundamental problem of projecting a joint probability distribution QQ onto the transportation polytope (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}), i.e.,

argminP(μ1,,μk)j(PjQj)2.\displaystyle\operatorname*{argmin}_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\sum_{\vec{j}}(P_{\vec{j}}-Q_{\vec{j}})^{2}. (7.8)

We provide the first polynomial-time algorithm for solving this problem in the case where QQ is a distribution that decomposes into a low-rank component plus a sparse component. The low-rank component enables modeling mixtures of product distributions (e.g., mixtures of isotropic Gaussians), which arise frequently in statistics and machine learning; see, e.g., (feldman2008learning, ). In such applications, the number of product distributions in the mixture corresponds to the tensor rank. The sparse component further enables modeling arbitrary corruptions to the distribution in polynomially many entries.

We emphasize that this projection problem (7.8) is not an MOT problem since the objective is quadratic rather than linear. This illustrates the versatility of our algorithmic results. Our algorithm is based on a reduction from quadratic optimization to linear optimization over (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}) that is tailored to this problem. Crucial to this reduction is the fact that the MOT algorithms in §4 can compute sparse solutions. In particular, this reduction does not work with SINKHORN because SINKHORN cannot compute sparse solutions.

Corollary 7.11 (Efficient projection to the transportation polytope).

Let Q=R+S(0n)kQ=R+S\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k}, where RR has constant rank and SS is polynomially sparse. Suppose that RmaxR_{\max} and SmaxS_{\max} are O(1)O(1). Given RR in factored form, SS through its non-zero entries, measures μ1,,μkΔn\mu_{1},\dots,\mu_{k}\in\Delta_{n}, and accuracy ε>0\varepsilon>0, we can compute in poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time a feasible P(μ1,,μk)P\in\mathcal{M}(\mu_{1},\dots,\mu_{k}) that has ε\varepsilon-suboptimal cost for the projection problem (7.8). This solution PP is a sparse tensor output through its poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) non-zero entries.

Proof.

We apply the Frank-Wolfe algorithm (a.k.a., Conditional Gradient Descent) to solve (7.8), specifically using approximate LP solutions for the descent direction as in (Jag13, , Algorithm 2). By the known convergence guarantee of this algorithm (Jag13, , Theorem 1.1), if each LP is solved to ε=O(ε)\varepsilon^{\prime}=O(\varepsilon) accuracy, then T=O(1/ε)T=O(1/\varepsilon) Frank-Wolfe iterations suffice to obtain an ε\varepsilon-suboptimal solution to (7.8).

The crux, therefore, is to show that each Frank-Wolfe iteration can be computed efficiently, and that the final solution is sparse. Initialize P(0)P^{(0)} to be an arbitrary vertex of (μ1,,μk)\mathcal{M}(\mu_{1},\dots,\mu_{k}). Then P(0)P^{(0)} is feasible and is polynomially sparse (see §2.1). Let P(t)(0n)kP^{(t)}\in(\mathbb{R}_{\geqslant 0}^{n})^{\otimes k} denote the tt-th Frank-Wolfe iterate. Performing the next iteration requires two computations:

  1. 1.

    Approximately solve the following LP to ε\varepsilon^{\prime} accuracy:

    D(t)minP(μ1,,μk)P,P(t)Q.\displaystyle D^{(t)}\leftarrow\min_{P\in\mathcal{M}(\mu_{1},\dots,\mu_{k})}\langle P,P^{(t)}-Q\rangle. (7.9)
  2. 2.

    Update P(t+1)(1γt)P(t)+γtD(t)P^{(t+1)}\leftarrow(1-\gamma_{t})P^{(t)}+\gamma_{t}D^{(t)}, where γt=2/(t+2)\gamma_{t}=2/(t+2) is the current stepsize.

For the first iteration t=0t=0, note that the LP (7.9) is an MOT problem with cost C(0)=P(0)Q=P(0)RSC^{(0)}=P^{(0)}-Q=P^{(0)}-R-S which decomposes into a polynomially sparse tensor P(0)SP^{(0)}-S plus a constant-rank tensor R-R. Therefore the algorithm in Corollary 7.5 can solve the LP (7.9) to ε=O(ε)\varepsilon^{\prime}=O(\varepsilon) additive accuracy in poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time, and it outputs a solution D(0)D^{(0)} that is poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) sparse. It follows that P(1)P^{(1)} can be computed in poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time and moreover is poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) sparse since it is a convex combination of the similarly sparse tensors P(0)P^{(0)} and D(0)D^{(0)}. By repeating this argument identically for T=O(1/ε)T=O(1/\varepsilon) iterations, it follows that each iteration takes poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) time, and that each iterate P(t)P^{(t)} is poly(n,k,1/ε)\mathrm{poly}(n,k,1/\varepsilon) sparse. ∎

8 Discussion

In this paper, we investigated what structure enables MOT—an LP with nkn^{k} variables—to be solved in poly(n,k)\mathrm{poly}(n,k) time. We developed a unified algorithmic framework for MOT by characterizing what “structure” is required to solve MOT in polynomial time by different algorithms in terms of simple variants of the dual feasibility oracle. On one hand, this enabled us to show that ELLIPSOID and MWU solve MOT in polynomial time whenever any algorithm can, whereas SINKHORN requires strictly more structure. And on the other hand, this made the design of polynomial-time algorithms for MOT much simpler, as we illustrated on three general classes of MOT cost structures.

Our results suggest several natural directions for future research. One exciting direction is to identify further tractable classes of MOT cost structures beyond the three studied in this paper, since this may enable new applications of MOT. Our results help guide this search because they make it significantly easier to identify if an MOT problem is polynomial-time solvable (see §1.1.4).

Another important direction is practicality. While the focus of this paper is to characterize when MOT problems can be solved in polynomial time, in practice there is of course a difference between small and large polynomial runtimes. It is therefore a question of practical significance to improve our “proof of concept” polynomial-time algorithms by designing algorithms with smaller polynomial runtimes. Our theoretical results help guide this search for practical algorithms because they make it significantly easier to identify if an MOT problem is polynomial-time solvable in the first place.

In order to develop more practical algorithms, recall that, roughly speaking, our approach for designing MOT algorithms consisted of three parts:

  • An “outer loop” algorithm such as ELLIPSOID, MWU, or SINKHORN that solves MOT in polynomial time conditionally on a polynomial-time implementation of a certain bottleneck oracle.

  • An “intermediate” algorithm that reduces this bottleneck oracle to polynomial calls of a variant of the dual feasibility oracle.

  • An “inner loop” algorithm that solves the relevant variant of the dual feasibility oracle for the structured MOT problem at hand.

Obtaining a smaller polynomial runtime for any of these three parts immediately implies smaller polynomial runtimes for the overall MOT algorithm. Another approach is to design altogether different algorithms that avoid the polynomial blow-up of the runtime that arises from composing these three parts. Understanding how to solve an MOT problem more “directly” in this way is an interesting question.

Acknowledgements.
We are grateful to Jonathan Niles-Weed, Pablo Parrilo, and Philippe Rigollet for insightful conversations; to Frederic Koehler for suggesting a simpler proof of Lemma 3.7; to Ben Edelman and Siddhartha Jayanti who were involved in the brainstorming stages and provided helpful references; and to Karthik Natarajan for references to the random combinatorial optimization literature.

Appendix A Deferred proof details

A.1 Proof of Lemma 4.11

Our proof is based on two helper claims.

Claim A.1 (Lemma 3 of (young2001sequential, )).

If MWU_BOTTLENECK(P,μ,λ,ε)\textsf{MWU\_BOTTLENECK}(P,\mu,\lambda,\varepsilon) returns “null”, then MOTC(μ)>λ\textsf{MOT}_{C}(\mu)>\lambda.

Proof.

We prove the contrapositive. Let P(μ1,,μk)P^{*}\in\mathcal{M}(\mu_{1},\dots,\mu_{k}) such that C,Pλ\langle C,P^{*}\rangle\leqslant\lambda. Then

j[n]kPjhΦ(P+hδj)h=0\displaystyle\sum_{\vec{j}\in[n]^{k}}P^{*}_{\vec{j}}\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0} =j[n]kPj(Cjλexp(C,P/λ)+i=1k1[μi]jiexp([mi(P)]ji/[μi]ji))exp(C,P/λ)+s=1kt=1nexp([ms(P)]t/[μs]t)\displaystyle=\frac{\sum_{\vec{j}\in[n]^{k}}P^{*}_{\vec{j}}(\frac{C_{\vec{j}}}{\lambda}\exp(\langle C,P\rangle/\lambda)+\sum_{i=1}^{k}\frac{1}{[\mu_{i}]_{j_{i}}}\exp([m_{i}(P)]_{j_{i}}/[\mu_{i}]_{j_{i}}))}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}
=C,Pλexp(C,P/λ)+s=1kt=1n[ms(P)]t[μs]texp([ms(P)]t/[μs]t)exp(C,P/λ)+s=1kt=1nexp([ms(P)]t/[μs]t)\displaystyle=\frac{\frac{\langle C,P^{*}\rangle}{\lambda}\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\frac{[m_{s}(P^{*})]_{t}}{[\mu_{s}]_{t}}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}
exp(C,P/λ)+s=1kt=1nexp([ms(P)]t/[μs]t)exp(C,P/λ)+s=1kt=1nexp([ms(P)]t/[μs]t)\displaystyle\leqslant\frac{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}{\exp(\langle C,P\rangle/\lambda)+\sum_{s=1}^{k}\sum_{t=1}^{n}\exp([m_{s}(P)]_{t}/[\mu_{s}]_{t})}
=1.\displaystyle=1.

Since PP^{*} is non-negative and j[n]kPj=m(P)=1\sum_{\vec{j}\in[n]^{k}}P^{*}_{\vec{j}}=m(P^{*})=1, there must exist a j[n]k\vec{j}\in[n]^{k} satisfying hΦ(P+hδj)h=01\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0}\leqslant 1. ∎

This first claim ensures that if the algorithm returns “infeasible”, then indeed MOTC(μ)>λ\textsf{MOT}_{C}(\mu)>\lambda. This proves the first part of the lemma. We now prove a second claim, useful for bounding the running time and the quality of the returned solution when the algorithm does not return “infeasible”.

Claim A.2 (Lemmas 1 and 4 of (young2001sequential, )).

In Lines 5 to 9, we maintain the invariant that Φ(P)(1+ε)2m(P)log(nk+1)\Phi(P)-(1+\varepsilon)^{2}m(P)\leqslant\log(nk+1).

Proof.

Note that initially P=0P=0 and Φ(0)(1+ε)2m(0)=Φ(0)=log(nk+1)\Phi(0)-(1+\varepsilon)^{2}m(0)=\Phi(0)=\log(nk+1). So it suffices to prove that Φ(P)(1+ε)2m(P)\Phi(P)-(1+\varepsilon)^{2}m(P) does not increase on each iteration. Indeed, if j\vec{j} is returned by MWU_BOTTLENECK, then hΦ(P+hδj)h=0(1+ε)\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0}\leqslant(1+\varepsilon). Furthermore, let ε=εmin(λ/Cj,mini[μi]ji)\varepsilon^{\prime}=\varepsilon\cdot\min(\lambda/C_{\vec{j}},\min_{i}[\mu_{i}]_{j_{i}}). By the smoothness of the softmax, it holds that

Φ(P+εδj)Φ(P)+ε(1+ε)hΦ(P+hδj)h=0ε(1+ε)2.\Phi(P+\varepsilon^{\prime}\delta_{\vec{j}})\leqslant\Phi(P)+\varepsilon^{\prime}(1+\varepsilon)\frac{\partial}{\partial h}\Phi(P+h\delta_{\vec{j}})\mid_{h=0}\leqslant\varepsilon^{\prime}(1+\varepsilon)^{2}.

So, in Line 9, (1+ε)2m(P)(1+\varepsilon)^{2}m(P) increases by (1+ε)2ε(1+\varepsilon)^{2}\varepsilon^{\prime}, and Φ(P)\Phi(P) increases by at most (1+ε)2ε(1+\varepsilon)^{2}\varepsilon^{\prime}. This proves the claim. ∎

We use this second claim to bound the running time of Step 1. Each iteration of the loop from Line 5 to Line 9 of Algorithm 2, increases the value of

Ψ(P):=C,P/λ+imi(P)/μi1\Psi(P):=\langle C,P\rangle/\lambda+\sum_{i}\|m_{i}(P)/\mu_{i}\|_{1}

by at least ε\varepsilon. So after TT iterations we must have

Φ(P)Ψ(P)/(nk+1)Tε/(nk+1),\Phi(P)\geqslant\Psi(P)/(nk+1)\geqslant T\varepsilon/(nk+1),

where we have used Jensen’s inequality to relate Φ\Phi and Ψ\Psi. By the claim, this means that on any iteration T(η+log(nk+1))(nk+1)(1+ε)2/ε=O~(nk/ε2)T\geqslant(\eta+\log(nk+1))(nk+1)(1+\varepsilon)^{2}/\varepsilon=\tilde{O}(nk/\varepsilon^{2}), we must have

m(P)(Φ(P)log(nk+1))/(1+ε)2η,m(P)\geqslant(\Phi(P)-\log(nk+1))/(1+\varepsilon)^{2}\geqslant\eta,

so the loop must terminate after at most O~(nk/ε2)\tilde{O}(nk/\varepsilon^{2}) iterations. Since each iteration can increase the number of non-zero entries by at most 1, this also proves the sparsity bound on PP.

We now prove the bounds on the marginals and cost, again using the claim. When Line 10 is reached, we must have m(P)[η,η+ε]m(P)\in[\eta,\eta+\varepsilon], because each iteration increases m(P)m(P) by at most ε\varepsilon. Therefore,

max(C,P/λ,m1(P)/μ1,,mk(P)/μk)\displaystyle\max(\langle C,P\rangle/\lambda,m_{1}(P)/\mu_{1},\ldots,m_{k}(P)/\mu_{k}) Φ(P)\displaystyle\leqslant\Phi(P) by Lemma 2.3 for softmax
log(nk+1)+(1+ε)2m(P)\displaystyle\leqslant\log(nk+1)+(1+\varepsilon)^{2}m(P) by the claim
log(nk+1)+(1+ε)2(η+ε)\displaystyle\leqslant\log(nk+1)+(1+\varepsilon)^{2}(\eta+\varepsilon)
(1+ε)4η\displaystyle\leqslant(1+\varepsilon)^{4}\eta by η2log(nk+1)1\eta\geqslant 2\log(nk+1)\geqslant 1

Therefore, the rescaling in Line 10 yields PP satisfying the guarantees of Lemma 4.11.

A.2 Proof of Lemma 4.16

It is obvious how the AMINC\textsf{AMIN}_{C} oracle can be implemented via a single call of the ARGAMINC\textsf{ARGAMIN}_{C} oracle; we now show the converse. Specifically, given p1,,pknp_{1},\ldots,p_{k}\in\mathbb{R}^{n}, we show how to compute a solution j=(j1,,jk)[n]k\vec{j}=(j_{1},\dots,j_{k})\in[n]^{k} for ARGAMINC([p1,,pk],ε)\textsf{ARGAMIN}_{C}([p_{1},\dots,p_{k}],\varepsilon) using nknk calls to the AMINC\textsf{AMIN}_{C} oracle with accuracy ε/(2k)\varepsilon/(2k). As in the proof of Lemma 4.4, we use the first nn calls to compute the first index j1j_{1} of the solution, the next nn calls to compute the next index j2j_{2}, and so on.

Formally, for s[k]s\in[k], let us say that (j1,,js)[n]s(j_{1}^{*},\dots,j_{s}^{*})\in[n]^{s} is a δ\delta-approximate “partial solution” of size ss if there exists a solution j[n]kj\in[n]^{k} for ARGAMINC([p1,,pk],δ)\textsf{ARGAMIN}_{C}([p_{1},\dots,p_{k}],\delta) that satisfies ji=jij_{i}=j_{i}^{*} for all i[s]i\in[s]. Then it suffices to show that for every s[k]s\in[k], it is possible to compute an (sε/k)(s\varepsilon/k)-approximate partial solution (j1,,js)(j_{1}^{*},\dots,j_{s}^{*}) of size ss from an ((s1)ε/k)((s-1)\varepsilon/k)-approximate partial solution (j1,,js1)(j_{1}^{*},\dots,j_{s-1}^{*}) of size s1s-1 using nn calls to AMINC\textsf{AMIN}_{C} and polynomial additional time.

Do this by setting jsj_{s}^{*} to be a minimizer of

minjs[n]AMINC([q1,j1,,qs1,js1,qs,js,ps+1,,pk],ε2k),\displaystyle\min_{j_{s}^{\prime}\in[n]}\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{*}},\ldots,q_{s-1,j_{s-1}^{*}},q_{s,j_{s}^{\prime}},p_{s+1},\ldots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)}, (A.1)

where the qq vectors are defined as in the proof of Lemma 4.4. The runtime claim is obvious; it suffices to prove correctness. To this end, observe that

minj[n]ks.t. j1=j1,,js=jsCji=1k[pi]ji\displaystyle\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{*},\dots,j_{s}=j_{s}^{*}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}} =MINC([q1,j1,,qs,js,ps+1,,pk])\displaystyle=\textsf{MIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{*}},\dots,q_{s,j_{s}^{*}},p_{s+1},\dots,p_{k}\Big{]}\Big{)}
ε2k+AMINC([q1,j1,,qs,js,ps+1,,pk],ε2k)\displaystyle\leqslant\frac{\varepsilon}{2k}+\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{*}},\dots,q_{s,j_{s}^{*}},p_{s+1},\dots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)}
=ε2k+minjs[n]AMINC([q1,j1,,qs1,js1,qs,js,ps+1,,pk],ε2k)\displaystyle=\frac{\varepsilon}{2k}+\min_{j_{s}^{\prime}\in[n]}\textsf{AMIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{*}},\dots,q_{s-1,j_{s-1}^{*}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}\Big{]},\frac{\varepsilon}{2k}\Big{)}
εk+minjs[n]MINC([q1,j1,,qs1,js1,qs,js,ps+1,,pk])\displaystyle\leqslant\frac{\varepsilon}{k}+\min_{j_{s}^{\prime}\in[n]}\textsf{MIN}_{C}\Big{(}\Big{[}q_{1,j_{1}^{*}},\dots,q_{s-1,j_{s-1}^{*}},q_{s,j_{s}^{\prime}},p_{s+1},\dots,p_{k}\Big{]}\Big{)}
=εk+minjs[n]minj[n]ks.t. j1=j1,js1=js1,js=jsCji=1k[pi]ji\displaystyle=\frac{\varepsilon}{k}+\min_{j_{s}^{\prime}\in[n]}\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{*},\dots j_{s-1}=j_{s-1}^{*},j_{s}=j_{s}^{\prime}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}
=εk+minj[n]ks.t. j1=j1,js1=js1Cji=1k[pi]ji\displaystyle=\frac{\varepsilon}{k}+\min_{\begin{subarray}{c}\vec{j}\in[n]^{k}\\ \text{s.t. }j_{1}=j_{1}^{*},\dots j_{s-1}=j_{s-1}^{*}\end{subarray}}C_{\vec{j}}-\sum_{i=1}^{k}[p_{i}]_{j_{i}}
=sεk+MINC([p1,,pk]).\displaystyle=\frac{s\varepsilon}{k}+\textsf{MIN}_{C}\Big{(}\Big{[}p_{1},\dots,p_{k}\Big{]}\Big{)}.

Above, the first and fifth steps are by Observation 4.5, the second and fourth steps are by definition of the AMIN oracle, the third step is by construction of jsj_{s}^{*}, the penultimate step is by simplifying, and the final step is by definition of (j1,,js1)(j_{1}^{*},\dots,j_{s-1}^{*}) being an ((s1)ε/k)((s-1)\varepsilon/k)-approximate partial solution of size s1s-1. We conclude that (j1,,js)(j_{1}^{*},\dots,j_{s}^{*}) is an (sε/k)(s\varepsilon/k)-approximate partial solution of size ss, as desired.

Appendix B Additional numerical experiments

Here, we provide additional numerics for the generalized Euler flow application in §5.3 in order to demonstrate that similar behavior is observed on other standard benchmark inputs in the literature (Bre08, ). These instances are identical to Figure 10, except with different input permutations σ\sigma between the initial and final positions of the particles. Note that our algorithm COLGEN computes an exact, sparse solution with at most nkk+1nk-k+1 non-zero entries (Theorem 4.6). In contrast, the SINKHORN algorithm of (BenCarCut15, ) computes approximate, fully dense solutions with nkn^{k} non-zero entries, which leads to blurry visualizations.

t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
Refer to caption
t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
Refer to caption
SINKHORN COLGEN
Figure 8: Same as Figure 10, but now with the permutation σ\sigma that sends the particle at initial location x[0,1]x\in[0,1] to final location σ(x)=min(2x,22x)\sigma(x)=\min(2x,2-2x). COLGEN runs in 7.88 seconds, while SINKHORN with regularization η=2000\eta=2000 runs in 6.97 seconds. In the COLGEN solution, roughly half the particles have trajectories that never split.
t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
Refer to caption
t = 1
t = 2
t = 3
t = 4
t = 5
t = 6
t = 7
Refer to caption
SINKHORN COLGEN
Figure 9: Same as Figure 10, but now with the permutation σ\sigma that sends the particle at initial location x[0,1]x\in[0,1] to final location σ(x)=1x\sigma(x)=1-x. COLGEN runs in 10.53 seconds, while SINKHORN with regularization η=1500\eta=1500 runs in 2.10 seconds.

References

  • [1] S. Agrawal, Y. Ding, A. Saberi, and Y. Ye. Price of correlations in stochastic optimization. Operations Research, 60(1):150–162, 2012.
  • [2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • [3] Y. Akagi, Y. Tanaka, T. Iwata, T. Kurashima, and H. Toda. Probabilistic optimal transport based on collective graphical models. Preprint at arXiv:2006.08866, 2020.
  • [4] J. Altschuler, F. Bach, A. Rudi, and J. Niles-Weed. Massively scalable Sinkhorn distances via the Nyström method. In Advances in Neural Information Processing Systems, pages 4429–4439, 2019.
  • [5] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Systems, 2017.
  • [6] J. M. Altschuler and E. Boix-Adserà. Hardness results for Multimarginal Optimal Transport problems. Discrete Optimization, 42:100669, 2021.
  • [7] J. M. Altschuler and E. Boix-Adserà. Wasserstein barycenters can be computed in polynomial time in fixed dimension. Journal of Machine Learning Research, 22:44–1, 2021.
  • [8] J. M. Altschuler and E. Boix-Adserà. Wasserstein barycenters are NP-hard to compute. SIAM Journal on Mathematics of Data Science, 2022.
  • [9] J. M. Altschuler and P. A. Parrilo. Near-linear convergence of the Random Osborne algorithm for Matrix Balancing. Mathematical Programming, pages 1–35, 2022.
  • [10] E. Anderes, S. Borgwardt, and J. Miller. Discrete Wasserstein barycenters: Optimal transport for discrete data. Mathematical Methods of Operations Research, 84(2):389–409, 2016.
  • [11] F. Bach. Learning with submodular functions: a convex optimization perspective. Foundations and Trends in Machine Learning, 6(2-3):145–373, 2013.
  • [12] M. O. Ball. Computational complexity of network reliability analysis: An overview. IEEE Transactions on Reliability, 35(3):230–239, 1986.
  • [13] M. O. Ball, C. J. Colbourn, and J. S. Provan. Network reliability. Handbooks in Operations Research and Management Science, 7:673–762, 1995.
  • [14] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
  • [15] J.-D. Benamou, G. Carlier, S. Di Marino, and L. Nenna. An entropy minimization approach to second-order variational mean-field games. Mathematical Models and Methods in Applied Sciences, 29(08):1553–1583, 2019.
  • [16] J.-D. Benamou, G. Carlier, and L. Nenna. A numerical method to solve multi-marginal optimal transport problems with Coulomb cost. In Splitting Methods in Communication, Imaging, Science, and Engineering, pages 577–601. Springer, 2016.
  • [17] J.-D. Benamou, G. Carlier, and L. Nenna. Generalized incompressible flows, multi-marginal transport and Sinkhorn algorithm. Numerische Mathematik, 142(1):33–54, 2019.
  • [18] D. Bertsimas and J. N. Tsitsiklis. Introduction to linear optimization, volume 6. Athena Scientific Belmont, MA, 1997.
  • [19] J. Blanchet, A. Jambulapati, C. Kent, and A. Sidford. Towards optimal running times for optimal transport. Preprint at arXiv:1810.07717, 2018.
  • [20] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. In International conference on Artificial Intelligence and Statistics, pages 880–889. PMLR, 2018.
  • [21] H. L. Bodlaender. A linear-time algorithm for finding tree-decompositions of small treewidth. SIAM Journal on computing, 25(6):1305–1317, 1996.
  • [22] H. L. Bodlaender. Treewidth: Structure and algorithms. In International Colloquium on Structural Information and Communication Complexity, pages 11–25. Springer, 2007.
  • [23] Y. Brenier. The least action principle and the related concept of generalized flows for incompressible perfect fluids. Journal of the American Mathematical Society, 2(2):225–255, 1989.
  • [24] Y. Brenier. The dual least action problem for an ideal, incompressible fluid. Archive for Rational Mechanics and Analysis, 122(4):323–351, 1993.
  • [25] Y. Brenier. Minimal geodesics on groups of volume-preserving maps and generalized solutions of the Euler equations. Communications on Pure and Applied Mathematics, 52(4):411–452, 1999.
  • [26] Y. Brenier. Generalized solutions and hydrostatic approximation of the Euler equations. Physica D: Nonlinear Phenomena, 237(14-17):1982–1988, 2008.
  • [27] G. Buttazzo, L. De Pascale, and P. Gori-Giorgi. Optimal-transport formulation of electronic density-functional theory. Physical Review A, 85(6):062502, 2012.
  • [28] G. Carlier and I. Ekeland. Matching for teams. Economic Theory, 42(2):397–418, 2010.
  • [29] G. Carlier, A. Oberman, and E. Oudet. Numerical methods for matching for teams and Wasserstein barycenters. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):1621–1642, 2015.
  • [30] L. Chen, W. Ma, K. Natarajan, D. Simchi-Levi, and Z. Yan. Distributionally robust linear and discrete optimization with marginals. Operations Research, 2022.
  • [31] P.-A. Chiappori, R. J. McCann, and L. P. Nesheim. Hedonic price equilibria, stable matching, and optimal transport: equivalence, topology, and uniqueness. Economic Theory, 42(2):317–354, 2010.
  • [32] C. Cotar, G. Friesecke, and C. Klüppelberg. Density functional theory and optimal transportation with Coulomb cost. Communications on Pure and Applied Mathematics, 66(4):548–599, 2013.
  • [33] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1, 2016.
  • [34] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.
  • [35] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, 2013.
  • [36] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In International Conference on Machine Learning, 2014.
  • [37] M. Dyer, L. A. Goldberg, C. Greenhill, and M. Jerrum. On the relative complexity of approximate counting problems. In International Workshop on Approximation Algorithms for Combinatorial Optimization, pages 108–119. Springer, 2000.
  • [38] F. Elvander, I. Haasler, A. Jakobsson, and J. Karlsson. Multi-marginal optimal transport using partial information with applications in robust localization and sensor fusion. Signal Processing, 171:107474, 2020.
  • [39] J. Feldman, R. O’Donnell, and R. A. Servedio. Learning mixtures of product distributions over discrete domains. SIAM Journal on Computing, 37(5):1536–1564, 2008.
  • [40] S. Friedland. Optimal transport, distance between sets of measures and tensor scaling. Preprint at arXiv:2005.00945, 2020.
  • [41] I. Gertsbakh and Y. Shpungin. Network reliability and resilience. Springer Science & Business Media, 2011.
  • [42] L. A. Goldberg and M. Jerrum. The complexity of ferromagnetic Ising with local fields. Combinatorics, Probability and Computing, 16(1):43–61, 2007.
  • [43] M. Grötschel, L. Lovász, and A. Schrijver. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2):169–197, 1981.
  • [44] M. Grötschel, L. Lovász, and A. Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
  • [45] H. Guo and M. Jerrum. A polynomial-time approximation algorithm for all-terminal network reliability. SIAM Journal on Computing, 48(3):964–978, 2019.
  • [46] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Estimating ensemble flows on a hidden Markov chain. In IEEE Conference on Decision and Control, pages 1331–1338. IEEE, 2019.
  • [47] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Multi-marginal optimal transport and Schrödinger bridges on trees. Preprint at arXiv:2004.06909, 2020.
  • [48] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson. Scalable computation of dynamic flow problems via multi-marginal graph-structured optimal transport. Preprint at arXiv:2106.14485, 2021.
  • [49] I. Haasler, R. Singh, Q. Zhang, J. Karlsson, and Y. Chen. Multi-marginal optimal transport and probabilistic graphical models. IEEE Transactions on Information Theory, 2021.
  • [50] W. K. Haneveld. Robustness against dependence in PERT: An application of duality and distributions with known marginals. In Stochastic Programming, pages 153–182. Springer, 1986.
  • [51] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435, 2013.
  • [52] D. R. Karger. A randomized fully polynomial time approximation scheme for the all-terminal network reliability problem. SIAM Review, 43(3):499–522, 2001.
  • [53] L. G. Khachiyan. Polynomial algorithms in linear programming. USSR Computational Mathematics and Mathematical Physics, 20(1):53–72, 1980.
  • [54] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
  • [55] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • [56] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7(1):48–50, 1956.
  • [57] C. Léonard. A survey of the Schr̈odinger problem and some of its connections with optimal transport. Preprint at arXiv:1308.0215, 2013.
  • [58] T. Lin, N. Ho, M. Cuturi, and M. I. Jordan. On the complexity of approximating multimarginal optimal transport. Preprint at arXiv:1910.00152, 2019.
  • [59] N. Linial, A. Samorodnitsky, and A. Wigderson. A deterministic strongly polynomial algorithm for matrix scaling and approximate permanents. In Symposium on Theory of Computing, pages 644–652, 1998.
  • [60] G. Makarov. Estimates for the distribution function of a sum of two random variables when the marginal distributions are fixed. Theory of Probability and its Applications, 26(4):803–806, 1982.
  • [61] I. Meilijson and A. Nádas. Convex majorization with an application to the length of critical paths. Journal of Applied Probability, 16(3):671–677, 1979.
  • [62] V. K. Mishra, K. Natarajan, D. Padmanabhan, C.-P. Teo, and X. Li. On theoretical and empirical aspects of marginal distribution choice models. Management Science, 60(6):1511–1531, 2014.
  • [63] E. F. Moore and C. E. Shannon. Reliable circuits using less reliable relays. Journal of the Franklin Institute, 262(3):191–208, 1956.
  • [64] B. Muzellec, R. Nock, G. Patrini, and F. Nielsen. Tsallis regularized optimal transport and ecological inference. In AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • [65] A. Nadas. Probabilistic PERT. IBM Journal of Research and Development, 23(3):339–347, 1979.
  • [66] K. Natarajan, M. Song, and C.-P. Teo. Persistency model and its applications in choice modeling. Management Science, 55(3):453–469, 2009.
  • [67] L. Nenna. Numerical methods for multi-marginal optimal transportation. PhD thesis, 2016.
  • [68] D. Padmanabhan, S. Damla Ahipasaoglu, A. Ramachandra, and K. Natarajan. Extremal probability bounds in combinatorial optimization. Preprint at arXiv:2109.01591, 2021.
  • [69] C. H. Papadimitriou and T. Roughgarden. Computing correlated equilibria in multi-player games. Journal of the ACM, 55(3):1–29, 2008.
  • [70] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends in Machine Learning, 2017.
  • [71] F. Pitié, A. C. Kokaram, and R. Dahyot. Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding, 107(1-2):123–137, 2007.
  • [72] J. S. Provan and M. O. Ball. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM Journal on Computing, 12(4):777–788, 1983.
  • [73] K. Quanrud. Approximating optimal transport with linear programs. In Symposium on Simplicity in Algorithms, 2018.
  • [74] J. Rabin, G. Peyré, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision, pages 435–446. Springer, 2011.
  • [75] L. Rüschendorf. Random variables with maximum sums. Advances in Applied Probability, pages 623–632, 1982.
  • [76] L. Rüschendorf and L. Uckelmann. On the n-coupling problem. Journal of Multivariate Analysis, 81(2):242–258, 2002.
  • [77] R. Singh, I. Haasler, Q. Zhang, J. Karlsson, and Y. Chen. Incremental inference of collective graphical models. IEEE Control Systems Letters, 2020.
  • [78] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
  • [79] J. Solomon, F. De Goes, G. Peyré, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics, 34(4):1–11, 2015.
  • [80] S. Srivastava, C. Li, and D. B. Dunson. Scalable Bayes via barycenter in Wasserstein space. Journal of Machine Learning Research, 19(1):312–346, 2018.
  • [81] M. Stoer and F. Wagner. A simple min-cut algorithm. Journal of the ACM, 44(4):585–591, 1997.
  • [82] Y. W. Teh and M. Welling. The unified propagation and scaling algorithm. In Advances in Neural Information Processing Systems, pages 953–960, 2002.
  • [83] L. N. Trefethen. Approximation theory and approximation practice, volume 164. SIAM, 2019.
  • [84] N. Tupitsa, P. Dvurechensky, A. Gasnikov, and C. A. Uribe. Multimarginal optimal transport by accelerated alternating minimization. In 2020 IEEE Conference on Decision and Control, pages 6132–6137. IEEE, 2020.
  • [85] L. G. Valiant. The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3):410–421, 1979.
  • [86] C. Villani. Topics in optimal transportation. Number 58. American Mathematical Society, 2003.
  • [87] M. J. Wainwright and M. I. Jordan. Variational inference in graphical models: The view from the marginal polytope. In Allerton Conference on Communication, Control, and Computing, volume 41, pages 961–971, 2003.
  • [88] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Now Publishers Inc, 2008.
  • [89] G. Weiss. Stochastic bounds on distributions of optimal value functions with applications to PERT, network flows and reliability. Operations Research, 34(4):595–605, 1986.
  • [90] A. G. Wilson. The use of entropy maximising models, in the theory of trip distribution, mode split and route split. Journal of Transport Economics and Policy, pages 108–126, 1969.
  • [91] N. E. Young. Sequential and parallel algorithms for mixed packing and covering. In Symposium on Foundations of Computer Science, pages 538–546. IEEE, 2001.
  • [92] D. B. Yudin and A. S. Nemirovskii. Informational complexity and efficient methods for the solution of convex extremal problems. Matekon, 13(2):22–45, 1976.
  • [93] E. Zemel. Polynomial algorithms for estimating network reliability. Networks, 12(4):439–452, 1982.