This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Complexity Analysis of the Sinkhorn and Greenkhorn Algorithms for Optimal Transport00footnotetext: JL and DY contribute equally.

Jianzhou Luo School of Data Science, Fudan University, Shanghai, China. Dingchuan Yang School of Data Science, Fudan University, Shanghai, China. Ke Wei School of Data Science, Fudan University, Shanghai, China.
(August 12, 2025)
Abstract

The Sinkhorn algorithm is a widely used method for solving the optimal transport problem, and the Greenkhorn algorithm is one of its variants. While there are modified versions of these two algorithms whose computational complexities are O(n2𝐂2logn/ε2)O({n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}/{\varepsilon^{2}}) to achieve an ε\varepsilon-accuracy, to the best of our knowledge, the existing complexities for the vanilla versions are O(n2𝐂3logn/ε3)O({n^{2}\|\mathbf{C}\|_{\infty}^{3}\log n}/{\varepsilon^{3}}). In this paper we fill this gap and show that the complexities of the vanilla Sinkhorn and Greenkhorn algorithms are indeed O(n2𝐂2logn/ε2)O({n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}/{\varepsilon^{2}}). The analysis relies on the equicontinuity of the dual variables for the discrete entropic regularized optimal transport problem, which is of independent interest.

Keywords. Optimal transport, Sinkhorn algorithm, Greenkhorn algorithm, complexity analysis

1 Introduction

Optimal transport (OT) is the problem of finding the most economic way to transfer one measure to another. The optimal transport problem dates back to Gaspard Monge [1]. Kantorovich (see [2] for historical context) provides a relaxation of this problem by allowing the mass splitting in source based on the notation of coupling, which advances its applications in areas such as economics and physics. Recently, with the increasing demand of quantifying the dissimilarity for probability measures, the optimal transport cost, which can serve as a natural notion of metric between measures, has attracted a lot of attention in the fields of statistics and machine learning. For instance, as a special case of the optimal transport cost, Wasserstein distance [3] can be used as a loss function when the output of a supervised learning method is a probability measure [4]. The application of Wasserstein-1 metric in GAN can effectively mitigate the issues of vanishing gradients and mode collapse [5]. Other applications of optimal transport include shape matching [6], color transfer [7] and object detection [8].

For ease of notation, we consider the case where the target probability vectors are of the same length, denoted 𝒂,𝒃Δn\boldsymbol{a},\boldsymbol{b}\in\Delta_{n}. Given a cost matrix 𝐂+n×n\mathbf{C}\in\mathbb{R}_{+}^{n\times n}, the Kantorovich relaxation attempts to compute the distance between 𝒂\boldsymbol{a} and 𝒃\boldsymbol{b} by solving the following problem:

min𝐏𝚷(𝒂,𝒃)𝐂,𝐏,\displaystyle\min_{\mathbf{P}\in\boldsymbol{\Pi}(\boldsymbol{a},\boldsymbol{b})}\langle\mathbf{C},\mathbf{P}\rangle, (1)

where

𝚷(𝒂,𝒃):={𝐏+n×n:𝐏𝟏n=𝒂,𝐏𝖳𝟏n=𝒃}.\displaystyle\boldsymbol{\Pi}(\boldsymbol{a},\boldsymbol{b}):=\left\{\mathbf{P}\in\mathbb{R}_{+}^{n\times n}:\mathbf{P}\boldsymbol{1}_{n}=\boldsymbol{a},\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n}=\boldsymbol{b}\right\}.

As a linear programming, the Kantorovich problem can be solved by the simplex method or the interior-point method. Yet, both of them suffer from huge computational complexity (overall O(n3)O(n^{3})) [2, 9]. A computationally efficient algorithm is developed in [10] which solves the Shannon entropic regularized Kantorovich problem

min𝐏𝚷(𝒂,𝒃)𝐂,𝐏γH(𝐏)\displaystyle\min_{\mathbf{P}\in\boldsymbol{\Pi}(\boldsymbol{a},\boldsymbol{b})}\langle\mathbf{C},\mathbf{P}\rangle-\gamma H(\mathbf{P}) (2)

by the Sinkhorn update, where γ\gamma and H(𝐏)=i,j𝐏i,j(log𝐏i,j1)H(\mathbf{P})=-\sum_{i,j}\mathbf{P}_{i,j}(\log\mathbf{P}_{i,j}-1) are the regularization parameter and discrete entropy, respectively. The easily parallel and GPU friendly features of the Sinkhorn algorithm make it popular in applications, especially in machine learning [4]. The idea of the Sinkhorn update can date back to [11, 12], and there are many studies on its convergence. The linear convergence rate of it (not for optimal transport) under Hilbert projective metric is provided in [13]. Regarding optimal transport, the computational complexity of the Sinkhorn algorithm to achive an ε\varepsilon-accuracy is shown to be O(n2𝐂3logn/ε3)O({n^{2}\|\mathbf{C}\|_{\infty}^{3}\log n}/{\varepsilon^{3}}) [14]. The Greenkhorn algorithm is an elementwise variant of the Sinkhorn algorithm which has the same computational complexity [14]. In [15], a variant of the Sinkhorn algorithm which utilizes modified probability vectors is developed and the O(n2𝐂2logn/ε2)O({n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}/{\varepsilon^{2}}) computational complexity is established. The idea has been extended to the Greenkhorn algorithm with the same improved complexity [16]. However, as can be seen later, the numerical results show that the practical performances of the original methods and the modified ones are overall indistinguishable, which motivates us to think whether we can improve the computational complexity of the vanilla Sinkhorn and Greenkhorn algorithms. The answer is affirmative.

Main contributions. Inspired by the equicontinuity for the continuous KL-divergence regularized OT problem, in this paper we have identified a discrete analogue of this property and shown that the O(n2𝐂2logn/ε2)O({n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}/{\varepsilon^{2}}) complexity of the vanilla Sinkhorn algorithm can be achieved based on this property. In addition, it is shown that this property can also be used to improve the complexity of the vanilla Greenkhorn algorithm to O(n2𝐂2logn/ε2)O({n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}/{\varepsilon^{2}}).

Notation. Let +\mathbb{R}_{+} and ++\mathbb{R}_{++} denote non-negative real numbers and positive real numbers respectively. For a vector 𝒙n\boldsymbol{x}\in\mathbb{R}^{n}, we denote by 𝒙1:=i=1n|𝒙i|\|\boldsymbol{x}\|_{1}:=\sum_{i=1}^{n}|\boldsymbol{x}_{i}| the sum of absolute values its elements, and by 𝒙\|\boldsymbol{x}\|_{\infty} the maximum absolute value of its elements. Given a matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n}, we write 𝐀1:=ij|𝐀ij|\|\mathbf{A}\|_{1}:=\sum_{ij}|\mathbf{A}_{ij}| and 𝐀:=maxij|𝐀|ij\|\mathbf{A}\|_{\infty}:=\max_{ij}|\mathbf{A}|_{ij}. Let Δn:={𝒙+n:𝒙1=1}\Delta_{n}:=\{\boldsymbol{x}\in\mathbb{R}^{n}_{+}:\|\boldsymbol{x}\|_{1}=1\} be the probability simplex in n\mathbb{R}^{n}. For a matrix 𝐀\mathbf{A} and a vector 𝒙\boldsymbol{x}, we denote by e𝐀e^{\mathbf{A}}, e𝒙e^{\boldsymbol{x}}, log𝐀\log\mathbf{A}, log𝒙\log\boldsymbol{x} their entrywise exponentials and natural logarithms respectively. For vectors 𝒙,𝒚\boldsymbol{x},\boldsymbol{y} of the same dimension, their product 𝒙𝒚\boldsymbol{x}\odot\boldsymbol{y} and division 𝒙𝒚\frac{\boldsymbol{x}}{\boldsymbol{y}} should be interpreted in an entrywise way. For two probability vectors 𝒂,𝒃Δn\boldsymbol{a},\boldsymbol{b}\in\Delta_{n}, KL-divergence is defined as

KL(𝒂𝒃):=i=1n𝒂ilog𝒂i𝒃i.KL(\boldsymbol{a}\|\boldsymbol{b}):=\sum_{i=1}^{n}\boldsymbol{a}_{i}\log\frac{\boldsymbol{a}_{i}}{\boldsymbol{b}_{i}}.

The Pinsker’s inequality provides a lower bound of KL(𝒂𝒃)KL(\boldsymbol{a}\|\boldsymbol{b}) in terms of the 1\ell_{1}-norm:

KL(𝒂𝒃)12𝒂𝒃12.KL(\boldsymbol{a}\|\boldsymbol{b})\geq\frac{1}{2}\|\boldsymbol{a}-\boldsymbol{b}\|^{2}_{1}.

2 Sinkhorn and Its Complexity

Algorithm 1 Sinkhorn [10]
0: probability vectors 𝒂,𝒃\boldsymbol{a},\boldsymbol{b}, accuracy δ\delta, 𝐊=exp(𝐂γ)\mathbf{K}=\exp(-\frac{\mathbf{C}}{\gamma}).
 Initialize k0k\leftarrow 0, 𝒖0>0\boldsymbol{u}_{0}>0, 𝒗0>0\boldsymbol{v}_{0}>0
repeat
  if kk mod 2=02=0 then
   𝒖k+1=𝒂𝐊𝒗k\boldsymbol{u}_{k+1}=\frac{\boldsymbol{a}}{\mathbf{K}\boldsymbol{v}_{k}}
   𝒗k+1=𝒗k\boldsymbol{v}_{k+1}=\boldsymbol{v}_{k}
  else
   𝒖k+1=𝒖k\boldsymbol{u}_{k+1}=\boldsymbol{u}_{k}
   𝒗k+1=𝒃𝐊𝖳𝒖k\boldsymbol{v}_{k+1}=\frac{\boldsymbol{b}}{\mathbf{K}^{\mathsf{T}}\boldsymbol{u}_{k}}
  end if
  kk+1k\leftarrow k+1
  𝐏kdiag(𝒖k)𝐊diag(𝒗k)\mathbf{P}_{k}\leftarrow\operatorname{\mathrm{diag}}(\boldsymbol{u}_{k})\mathbf{K}\operatorname{\mathrm{diag}}(\boldsymbol{v}_{k})
until 𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1δ\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1}\leq\delta
𝐏k\mathbf{P}_{k}.

Without loss of generality, assume 𝒂>0\boldsymbol{a}>0 and 𝒃>0\boldsymbol{b}>0111 The case where there are zeros in 𝒂\boldsymbol{a} or 𝒃\boldsymbol{b} is discussed in Remark 2.13. . Then the solution to the entropic regularized Kantorovich problem (2) is the unique matrix of the form

𝐏=diag(𝒖)𝐊diag(𝒗)\mathbf{P}=\operatorname{\mathrm{diag}}(\boldsymbol{u})\mathbf{K}\operatorname{\mathrm{diag}}(\boldsymbol{v}) (3)

that satisfies

𝒖(𝐊𝒗)=𝒂and𝒗(𝐊𝖳𝒖)=𝒃.\displaystyle\boldsymbol{u}\odot(\mathbf{K}\boldsymbol{v})=\boldsymbol{a}\quad\text{and}\quad\boldsymbol{v}\odot(\mathbf{K}^{\mathsf{T}}\boldsymbol{u})=\boldsymbol{b}. (4)

Here, 𝐊=exp(𝐂γ)\mathbf{K}=\exp(-\frac{\mathbf{C}}{\gamma}), (𝒖,𝒗)++n×++n(\boldsymbol{u},\boldsymbol{v})\in\mathbb{R}_{++}^{n}\times\mathbb{R}_{++}^{n} are two (unknown) scaling vectors, and (4) is a reformulation of the constraints 𝐏𝟏n=𝒂,𝐏𝖳𝟏n=𝒃\mathbf{P}\boldsymbol{1}_{n}=\boldsymbol{a},\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n}=\boldsymbol{b} in terms of 𝒖\boldsymbol{u} and 𝒗\boldsymbol{v}. For the details of derivation, see [2]. Based on this optimality condition, the Sinkhorn algorithm with arbitrary positive initialization vectors 𝒖0\boldsymbol{u}_{0} and 𝒗0\boldsymbol{v}_{0} conducts alternating projection to satisfy (4), see Algorithm 1 for a description of the algorithm. As already mentioned, the best known complexity for the vanilla Sinkhorn algorithm is O(n2𝐂3logn/ε3)O({n^{2}\|\mathbf{C}\|_{\infty}^{3}\log n}/{\varepsilon^{3}}) [14], while this complexity can be improved to O(n2𝐂2logn/ε2)O({n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}/{\varepsilon^{2}}) for the modified variant which uses lifted values of 𝒂\boldsymbol{a} and 𝒃\boldsymbol{b} [15] (more precisely, run Sinkhorn using (1δ8)((𝒂,𝒃)+δn(8δ)(𝟏,𝟏))(1-\frac{\delta}{8})((\boldsymbol{a},\boldsymbol{b})+\frac{\delta}{n(8-\delta)}(\boldsymbol{1},\boldsymbol{1})) for a small mismatch δ\delta instead of (𝒂\boldsymbol{a}, 𝒃\boldsymbol{b})). However, the numerical results in Figure 1 demonstrate that the practical performances of the vanilla and modified Sinkhorn algorithms are overall similar, which motivates us to improve the convergence analysis of the vanilla Sinkhorn algorithm. The numerical experiments are conducted on two data sets: the MNIST data set and a widely used synthetic data set[14]. In the experiments, a pair of images are randomly sampled from each data set and 𝒂\boldsymbol{a} and 𝒃\boldsymbol{b} are obtained by vectorizing the images, followed by normalization. The transport cost is computed as the 2\ell_{2}-distance of the pixel positions. The plots in Figure 1 are average results over 10 random instances.

Refer to caption
(a) MNIST data set
Refer to caption
(b) Synthetic data set
Figure 1: Transport cost error vs number of iteration for the vanilla and modified Sinkhorn algorithms.

There are several different equivalent interpretations of the Sinkhorn algorithm [2], and the convergence analysis is based on the dual perspective. First note that the dual problem of the entropic regularized Kantorovich problem (2) is given by

max𝐟,𝐠nh(𝐟,𝐠):=𝐟,𝒂+𝐠,𝒃γi,jexp[1γ(𝐟i+𝐠j𝐂ij)].\displaystyle\max_{\mathbf{f},\mathbf{g}\in\mathbb{R}^{n}}h(\mathbf{f},\mathbf{g}):=\langle\mathbf{f},\boldsymbol{a}\rangle+\langle\mathbf{g},\boldsymbol{b}\rangle-\gamma\sum_{i,j}\exp\left[\frac{1}{\gamma}(\mathbf{f}_{i}+\mathbf{g}_{j}-\mathbf{C}_{ij})\right]. (5)

For the details of derivation, see for example [10]. If we define

𝐟k=γlog𝒖k,𝐠k=γlog𝒗k,\mathbf{f}_{k}=\gamma\log\boldsymbol{u}_{k},\;\mathbf{g}_{k}=\gamma\log\boldsymbol{v}_{k},

it can be easily verified that

{𝐟k+1=γlog𝒂γlog(𝐊e𝐠kγ),𝐠k+1=𝐠k,if k is even,𝐟k+1=𝐟k,𝐠k+1=γlog𝒃γlog(𝐊𝖳e𝐟kγ),if k is odd.\begin{cases}\mathbf{f}_{k+1}=\gamma\log\boldsymbol{a}-\gamma\log(\mathbf{K}e^{\frac{\mathbf{g}_{k}}{\gamma}}),\quad\mathbf{g}_{k+1}=\mathbf{g}_{k},\quad\text{if $k$ is even},\\ \mathbf{f}_{k+1}=\mathbf{f}_{k},\quad\mathbf{g}_{k+1}=\gamma\log\boldsymbol{b}-\gamma\log(\mathbf{K}^{\mathsf{T}}e^{\frac{\mathbf{f}_{k}}{\gamma}}),\quad\text{if $k$ is odd}.\end{cases}

Moreover, when k2k\geq 2,

{𝐟h(𝐟k+1,𝐠k+1)=𝒂e𝐟k+1/γ(𝐊e𝐠k+1/γ)=𝟎,if k is even,𝐠h(𝐟k+1,𝐠k+1)=𝒃e𝐠k+1/γ(𝐊𝖳e𝐟k+1/γ)=𝟎,if k is odd.\begin{cases}\nabla_{\mathbf{f}}h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})=\boldsymbol{a}-e^{\mathbf{f}_{k+1}/\gamma}\odot(\mathbf{K}e^{\mathbf{g}_{k+1}/\gamma})=\boldsymbol{0},\quad\text{if $k$ is even},\\ \nabla_{\mathbf{g}}h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})=\boldsymbol{b}-e^{\mathbf{g}_{k+1}/\gamma}\odot(\mathbf{K}^{\mathsf{T}}e^{\mathbf{f}_{k+1}/\gamma})=\boldsymbol{0},\quad\text{if $k$ is odd}.\end{cases}

Therefore, in terms of (𝐟k,𝐠k)(\mathbf{f}_{k},\mathbf{g}_{k}), the Sinkhorn algorithm can be interpreted as a block coordinate ascent on the dual problem (5), and the analysis will be primarily based on (𝐟k,𝐠k)(\mathbf{f}_{k},\mathbf{g}_{k}).

Algorithm 2 Round(𝐏,𝒂,𝒃)\mbox{Round}(\mathbf{P},\boldsymbol{a},\boldsymbol{b}) [14]
0:𝐏+n×n\mathbf{P}\in\mathbb{R}_{+}^{n\times n}, 𝒂,𝒃+n\boldsymbol{a},\boldsymbol{b}\in\mathbb{R}^{n}_{+}.
for i[n]i\in[n] do
  if (𝐏𝟏n)i>𝒂i(\mathbf{P}\boldsymbol{1}_{n})_{i}>\boldsymbol{a}_{i} then 𝒙i=𝒂i(𝐏𝟏)i\boldsymbol{x}_{i}=\frac{\boldsymbol{a}_{i}}{(\mathbf{P}\boldsymbol{1})_{i}}; else 𝒙i=1\boldsymbol{x}_{i}=1
end for
𝐗diag(𝒙)\mathbf{X}\leftarrow\operatorname{\mathrm{diag}}(\boldsymbol{x}), 𝐏𝐗𝐏\mathbf{P}^{\prime}\leftarrow\mathbf{X}\mathbf{P}
for j[n]j\in[n] do
  if (𝐏𝖳𝟏n)j>𝒃j(\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n})_{j}>\boldsymbol{b}_{j} then 𝒚j=𝒃j(𝐏𝖳𝟏n)j\boldsymbol{y}_{j}=\frac{\boldsymbol{b}_{j}}{(\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n})_{j}}; else 𝒚j=1\boldsymbol{y}_{j}=1
end for
𝐘diag(𝒚)\mathbf{Y}\leftarrow\operatorname{\mathrm{diag}}(\boldsymbol{y}), 𝐏′′𝐏𝐘\mathbf{P}^{\prime\prime}\leftarrow\mathbf{P}^{\prime}\mathbf{Y}
Δ𝒂𝒂𝐏′′𝟏n,Δ𝒃𝒃𝐏′′𝖳𝟏n\Delta_{\boldsymbol{a}}\leftarrow\boldsymbol{a}-\mathbf{P}^{\prime\prime}\boldsymbol{1}_{n},\Delta_{\boldsymbol{b}}\leftarrow\boldsymbol{b}-\mathbf{P}^{\prime\prime\mathsf{T}}\boldsymbol{1}_{n}
𝐏^𝐏′′+1Δ𝒂1Δ𝒂Δ𝒃𝖳\widehat{\mathbf{P}}\leftarrow\mathbf{P}^{\prime\prime}+\frac{1}{\|\Delta_{\boldsymbol{a}}\|_{1}}\Delta_{\boldsymbol{a}}\Delta_{\boldsymbol{b}}^{\mathsf{T}}
𝐏^𝚷(𝒂,𝒃)\widehat{\mathbf{P}}\in\boldsymbol{\Pi}(\boldsymbol{a},\boldsymbol{b}).

After computing an approximate transport plan matrix 𝐏k\mathbf{P}_{k} which satisfies 𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1δ\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1}\leq\delta by the Sinkhorn algorithm, a rounding step presented in Algorithm 2 usually follows to ensure that the output matrix 𝐏^k\widehat{\mathbf{P}}_{k} satisfies 𝐏^k𝟏n=𝒂\widehat{\mathbf{P}}_{k}\boldsymbol{1}_{n}=\boldsymbol{a} and 𝐏^k𝖳𝟏n=𝒃\widehat{\mathbf{P}}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}=\boldsymbol{b} simultaneously. The goal in the analysis is to study the computational complexity of the algorithm when an ε\varepsilon-approximate solution of the original Kantorovich problem (1) is obtained, that is when 𝐏^k\widehat{\mathbf{P}}_{k} satisfies

𝐂,𝐏^k𝐂,𝐏+ε,\langle\mathbf{C},\widehat{\mathbf{P}}_{k}\rangle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+\varepsilon,

where 𝐏\mathbf{P}^{*} is the solution to (1).

To carry out the analysis, we first list three lemmas from [14]. For readers’ convenience, we have included the proofs in the appendix. The first lemma concerns the error bound of the rounding scheme, which is presented in a general form where 𝒂\boldsymbol{a} and 𝒃\boldsymbol{b} are not necessarily probability vectors. This will be useful in the analysis of the Greenkhorn algorithm in the next section.

Lemma 2.1 ([14, Lemma 7] ).

For any vectors 𝐚,𝐛+n\boldsymbol{a},\boldsymbol{b}\in\mathbb{R}_{+}^{n} satisfying 𝐚1=𝐛1\|\boldsymbol{a}\|_{1}=\|\boldsymbol{b}\|_{1} and matrix 𝐏+n×n\mathbf{P}\in\mathbb{R}^{n\times n}_{+} , Algorithm 2 outputs a matrix Round(𝐏,𝐚,𝐛)𝚷(𝐚,𝐛)\rm{Round}(\mathbf{P},\boldsymbol{a},\boldsymbol{b})\in\boldsymbol{\Pi}(\boldsymbol{a},\boldsymbol{b}) such that

𝐏Round(𝐏,𝐚,𝐛)12(𝐚𝐏𝟏n1+𝐛𝐏𝖳𝟏n1).\displaystyle\|\mathbf{P}-\rm{Round}(\mathbf{P},\boldsymbol{a},\boldsymbol{b})\|_{1}\leq 2\left(\|\boldsymbol{a}-\mathbf{P}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}\right).

The second lemma provides a bound for the error between the true OT distance 𝐂,𝐏\langle\mathbf{C},\mathbf{P}^{*}\rangle and the approximate OT distance 𝐂,𝐏^k\langle\mathbf{C},\widehat{\mathbf{P}}_{k}\rangle.

Lemma 2.2 ([14, Theorem 1] ).

Let 𝐏\mathbf{P}^{*} be the solution to the Kantorovich problem (1) with marginals 𝐚\boldsymbol{a}, 𝐛\boldsymbol{b} and cost matrix 𝐂\mathbf{C}. Let 𝐁\mathbf{B} be a probability matrix (i.e. 𝐁1=1\|\mathbf{B}\|_{1}=1) of the form 𝐁=diag(𝐮)𝐊diag(𝐯)\mathbf{B}=\rm{diag}(\boldsymbol{u})\mathbf{K}\rm{diag}(\boldsymbol{v}), where 𝐮,𝐯++n\boldsymbol{u},\boldsymbol{v}\in\mathbb{R}^{n}_{++} and 𝐊n×n\mathbf{K}\in\mathbb{R}^{n\times n} is defined by 𝐊=exp(𝐂γ)\mathbf{K}=\exp(-\frac{\mathbf{C}}{\gamma}). Letting 𝐁^=Round(𝐁,𝐚,𝐛)\widehat{\mathbf{B}}=\mbox{Round}(\mathbf{B},\boldsymbol{a},\boldsymbol{b}), we have

𝐂,𝐁^𝐂,𝐏+2γlogn+4(𝐁𝟏n𝒂1+𝐁𝖳𝟏n𝒃1)𝐂.\displaystyle\langle\mathbf{C},\widehat{\mathbf{B}}\rangle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+2\gamma\log n+4(\|\mathbf{B}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})\|\mathbf{C}\|_{\infty}. (6)
Remark 2.3.

In (6), If we set 𝐁\mathbf{B} to be 𝐏k\mathbf{P}_{k}, the output of Algorithm 1, we have

𝐂,𝐏^k𝐂,𝐏+2γlogn+4(𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1)𝐂.\langle\mathbf{C},\widehat{\mathbf{P}}_{k}\rangle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+2\gamma\log n+4(\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})\|\mathbf{C}\|_{\infty}.

This implies that if we set γ=ε4logn\gamma=\frac{\varepsilon}{4\log n} for the regularization parameter and choose δ=ε8𝐂\delta=\frac{\varepsilon}{8\|\mathbf{C}\|_{\infty}} as the termination condition, the Sinkhorn algorithm will output a 𝐏k\mathbf{P}_{k} such that

𝐂,Round(𝐏k,𝒂,𝒃)𝐂,𝐏+ε.\langle\mathbf{C},\mbox{Round}(\mathbf{P}_{k},\boldsymbol{a},\boldsymbol{b})\rangle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+\varepsilon.

The third lemma characterizes the one-iteration improvement of the Sinkhorn algorithm in terms of the function value.

Lemma 2.4 ([14, Lemma 2] ).

If k2k\geq 2, then

h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)γ2(𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1)2.\displaystyle h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k})\geq\frac{\gamma}{2}(\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})^{2}.

The improved convergence analysis of the Sinkhorn algorithm relies heavily on the equicontinuity of the discrete regularized optimal transport problem. To state this property, we first follow the convention in the optimal transport literature[17] and define the (𝐂,γ)(\mathbf{C},\gamma)-transform and (𝐂,γ)¯\overline{(\mathbf{C},\gamma)}-transform of vectors.

Definition 2.5 ((𝐂,γ)(\mathbf{C},\gamma)-transform, (𝐂,γ)¯\overline{(\mathbf{C},\gamma)}-transform).

For any vector 𝐟n\mathbf{f}\in\mathbb{R}^{n}, we define its (𝐂,γ)(\mathbf{C},\gamma)-transform by

𝐟(𝐂,γ):=argmax𝐠nh(𝐟,𝐠).\displaystyle\mathbf{f}^{(\mathbf{C},\gamma)}:=\mathop{\arg\max}_{\mathbf{g}\in\mathbb{R}^{n}}h(\mathbf{f},\mathbf{g}).

For any vector 𝐠n\mathbf{g}\in\mathbb{R}^{n}, we define its (𝐂,γ)¯\overline{(\mathbf{C},\gamma)}-transform by

𝐠(𝐂,γ)¯:=argmax𝐟nh(𝐟,𝐠).\displaystyle\mathbf{g}^{\overline{(\mathbf{C},\gamma)}}:=\mathop{\arg\max}_{\mathbf{f}\in\mathbb{R}^{n}}h(\mathbf{f},\mathbf{g}).

The definitions of (𝐂,γ)(\mathbf{C},\gamma)-transform and (𝐂,γ)¯\overline{(\mathbf{C},\gamma)}-transform enable us to describe the Sinkhorn update as

{𝐟k+1=𝐠k(𝐂,γ)¯, 𝐠k+1=𝐠k,if k is even,𝐟k+1=𝐟k, 𝐠k+1=𝐟k(𝐂,γ),if k is odd.\displaystyle\begin{split}\left\{\begin{array}[]{cc}\mathbf{f}_{k+1}=\mathbf{g}_{k}^{\overline{(\mathbf{C},\gamma)}},\textbf{ }\mathbf{g}_{k+1}=\mathbf{g}_{k},&\text{if }k\text{ is even},\\ \\ \mathbf{f}_{k+1}=\mathbf{f}_{k},\textbf{ }\mathbf{g}_{k+1}=\mathbf{f}_{k}^{(\mathbf{C},\gamma)},&\text{if }k\text{ is odd}.\end{array}\right.\end{split}

Furthermore, we have

𝐟γ=(𝐠γ)(𝐂,γ)¯,𝐠γ=(𝐟γ)(𝐂,γ),\displaystyle\mathbf{f}_{*}^{\gamma}=\left(\mathbf{g}_{*}^{\gamma}\right)^{\overline{(\mathbf{C},\gamma)}},\quad\mathbf{g}_{*}^{\gamma}=\left(\mathbf{f}_{*}^{\gamma}\right)^{(\mathbf{C},\gamma)},

where (𝐟γ,𝐠γ)(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma}) is any solution of (5). Next we present the key property used in the analysis, which is inspired by the equicontinuity of the (c,γ)(c,\gamma)-transform and (c,γ)¯\overline{(c,\gamma)}-transform for the continuous KL-divergence regularized optimal transport problem.

Lemma 2.6 (Equicontinuity).

Assume hh in (5) is defined with 𝐚>0\boldsymbol{a}>0, 𝐛>0\boldsymbol{b}>0, and 𝐂0\mathbf{C}\geq 0. Then for any vectors 𝐟,𝐠n\mathbf{f},\mathbf{g}\in\mathbb{R}^{n},

max(𝐟(𝐂,γ)γlog𝒃)min(𝐟(𝐂,γ)γlog𝒃)max(𝐂)min(𝐂)𝐂,\displaystyle\max(\mathbf{f}^{(\mathbf{C},\gamma)}-\gamma\log\boldsymbol{b})-\min(\mathbf{f}^{(\mathbf{C},\gamma)}-\gamma\log\boldsymbol{b})\leq\max(\mathbf{C})-\min(\mathbf{C})\leq\|\mathbf{C}\|_{\infty},
max(𝐠(𝐂,γ)¯γlog𝒂)min(𝐠(𝐂,γ)¯γlog𝒂)max(𝐂)min(𝐂)𝐂,\displaystyle\max(\mathbf{g}^{\overline{(\mathbf{C},\gamma)}}-\gamma\log\boldsymbol{a})-\min(\mathbf{g}^{\overline{(\mathbf{C},\gamma)}}-\gamma\log\boldsymbol{a})\leq\max(\mathbf{C})-\min(\mathbf{C})\leq\|\mathbf{C}\|_{\infty},

where max()\max(\cdot) and min()\min(\cdot) refer to the largest entry and smallest entry of a vector or matrix, respectively.

Proof.

By the definition of (𝐂,γ)(\mathbf{C},\gamma)-transform, we have

[gh(𝐟,𝐟(𝐂,γ))]j=𝒃ji=1nexp[1γ(𝐟i+𝐟j(𝐂,γ)𝐂i,j)]=0\displaystyle\left[\nabla_{g}h(\mathbf{f},\mathbf{f}^{(\mathbf{C},\gamma)})\right]_{j}=\boldsymbol{b}_{j}-\sum_{i=1}^{n}\exp\left[\frac{1}{\gamma}(\mathbf{f}_{i}+\mathbf{f}^{(\mathbf{C},\gamma)}_{j}-\mathbf{C}_{i,j})\right]=0

for any j[n]j\in[n]. This means for any j,j[n]j,j^{\prime}\in[n],

i=1nexp[1γ(𝐟i+𝐟j(𝐂,γ)γlog𝒃j𝐂i,j)]=i=1nexp[1γ(𝐟i+𝐟j(𝐂,γ)γlog𝒃j𝐂i,j)]=1.\displaystyle\sum_{i=1}^{n}\exp\left[\frac{1}{\gamma}(\mathbf{f}_{i}+\mathbf{f}^{(\mathbf{C},\gamma)}_{j}-\gamma\log\boldsymbol{b}_{j}-\mathbf{C}_{i,j})\right]=\sum_{i=1}^{n}\exp\left[\frac{1}{\gamma}(\mathbf{f}_{i}+\mathbf{f}^{(\mathbf{C},\gamma)}_{j^{\prime}}-\gamma\log\boldsymbol{b}_{j^{\prime}}-\mathbf{C}_{i,{j^{\prime}}})\right]=1. (7)

Suppose that there exist j,j[n]j,j^{\prime}\in[n] satisfying

𝐟j(𝐂,γ)γlog𝒃j>𝐟j(𝐂,γ)γlog𝒃j+max(𝐂)min(𝐂).\displaystyle\mathbf{f}^{(\mathbf{C},\gamma)}_{j}-\gamma\log\boldsymbol{b}_{j}>\mathbf{f}^{(\mathbf{C},\gamma)}_{j^{\prime}}-\gamma\log\boldsymbol{b}_{j^{\prime}}+\max(\mathbf{C})-\min(\mathbf{C}).

Then we have

exp[1γ(𝐟i+𝐟j(𝐂,γ)γlog𝒃j𝐂i,j)]>exp[1γ(𝐟i+𝐟j(𝐂,γ)γlog𝒃j𝐂i,j)]\displaystyle\exp\left[\frac{1}{\gamma}(\mathbf{f}_{i}+\mathbf{f}^{(\mathbf{C},\gamma)}_{j}-\gamma\log\boldsymbol{b}_{j}-\mathbf{C}_{i,j})\right]>\exp\left[\frac{1}{\gamma}(\mathbf{f}_{i}+\mathbf{f}^{(\mathbf{C},\gamma)}_{j^{\prime}}-\gamma\log\boldsymbol{b}_{j^{\prime}}-\mathbf{C}_{i,{j^{\prime}}})\right]

for any i[n]i\in[n], which contradicts (7) after summing it up for all i[n]i\in[n]. This completes the proof of the first claim, and the second claim can be proved similarly. ∎

Remark 2.7.

The upper bound in Lemma 2.6 is tight. Here we give an example. Let 𝐟=𝟎\mathbf{f}=\boldsymbol{0}. By the definition of the (𝐂,γ)(\mathbf{C},\gamma)-transform, we have

𝐟j(𝐂,γ)γlog𝒃j=γlogiexp(1γ𝐂i,j)\displaystyle\mathbf{f}^{(\mathbf{C},\gamma)}_{j}-\gamma\log\boldsymbol{b}_{j}=-\gamma\log\sum_{i}\exp\left(-\frac{1}{\gamma}\mathbf{C}_{i,j}\right)

for any j[n]j\in[n]. Consider the case where the entries of the jj-th column of 𝐂\mathbf{C} are all equal to 𝐂\|\mathbf{C}\|_{\infty} and the entries of the jj^{\prime}-th column of 𝐂\mathbf{C} are all zeros. Then a simple calculation yields

𝐟j(𝐂,γ)γlog𝒃j=𝐟j(𝐂,γ)γlog𝒃j+𝐂.\displaystyle\mathbf{f}^{(\mathbf{C},\gamma)}_{j}-\gamma\log\boldsymbol{b}_{j}=\mathbf{f}^{(\mathbf{C},\gamma)}_{j^{\prime}}-\gamma\log\boldsymbol{b}_{j^{\prime}}+\|\mathbf{C}\|_{\infty}.
Remark 2.8.

Let aa, bb be two regular Borel measures supported on a compact metric spaces (X,dX)(X,d_{X}) and (Y,dY)(Y,d_{Y}), respectively, and cC(X×Y)c\in C(X\times Y) be a continuous function defined on X×YX\times Y. The continuous KL-divergence regularized Kantorovich problem is defined by

minPπ(a,b)X×Yc(x,y)𝑑P(x,y)+γKL(Pab),\displaystyle\min_{P\in\pi(a,b)}\int_{X\times Y}c(x,y)dP(x,y)+\gamma KL(P\|a\otimes b), (8)

where the KLKL-divergence between measures is defined by

KL(μν):=X×Ylog(dμdν(x,y))𝑑μ(x,y)KL(\mu\|\nu):=\int_{X\times Y}\log\left(\frac{d\mu}{d\nu}(x,y)\right)d\mu(x,y)

for μ,ν𝒫(X×Y)\mu,\nu\in\mathcal{P}(X\times Y). The dual of (8) can is given by (see for example [2])

maxαC(x),βC(Y)ξ(α,β):=Xα(x)𝑑a(x)+Yβ(y)𝑑b(y)γX×Yexp(1γ(α(x)+β(y)c(x,y)))d(ab)(x,y),\displaystyle\max_{\alpha\in C(x),\beta\in C(Y)}\xi(\alpha,\beta):=\int_{X}\alpha(x)da(x)+\int_{Y}\beta(y)db(y)-\gamma\int_{X\times Y}\exp\left(\frac{1}{\gamma}(\alpha(x)+\beta(y)-c(x,y))\right)d(a\otimes b)(x,y),

and the (c,γ)(c,\gamma)-transform and (c,γ)¯\overline{(c,\gamma)}-transform are partial maximizations of the functional ξ(α,β)\xi(\alpha,\beta). We only consider the (c,γ)(c,\gamma)-transform here and the (c,γ)¯\overline{(c,\gamma)}-transform can be be similarly discussed. For any α:X\alpha:X\to\mathbb{R}, its (c,γ)(c,\gamma)-transform α(c,γ):Y\alpha^{(c,\gamma)}:Y\to\mathbb{R} is defined by

α(c,γ)(y):=argmaxs{sγXexp(1γ(α(x)+sc(x,y))da(x)}.\displaystyle\alpha^{(c,\gamma)}(y):=\mathop{\arg\max}_{s\in\mathbb{R}}\left\{s-\gamma\int_{X}\exp\left(\frac{1}{\gamma}(\alpha(x)+s-c(x,y)\right)da(x)\right\}.

For any α:X\alpha:X\to\mathbb{R} and xXx\in X, it has been shown that α(c,γ)\alpha^{(c,\gamma)} share the same modulus of continuity as {c(x,)}x\{c(x,\cdot)\}_{x} [18]. That is, if

|c(x,y)c(x,y)|ω(dY(y,y))\displaystyle|c(x,y)-c(x,y^{\prime})|\leq\omega(d_{Y}(y,y^{\prime}))

for any xXx\in X and y,yYy,y^{\prime}\in Y, where ω:++\omega:\mathbb{R}_{+}\to\mathbb{R}_{+} is an increasing continuous function with ω(0)=0\omega(0)=0, then

|α(c,γ)(y)α(c,γ)(y)|ω(dY(y,y))\displaystyle|\alpha^{(c,\gamma)}(y)-\alpha^{(c,\gamma)}(y^{\prime})|\leq\omega(d_{Y}(y,y^{\prime}))

for any y,yYy,y^{\prime}\in Y.

When X={x1,,xn},Y={y1,,yn}X=\{x_{1},\dots,x_{n}\},Y=\{y_{1},\dots,y_{n}\}, the measures aa and bb can be described by vectors 𝐚,𝐛\boldsymbol{a},\boldsymbol{b} satisfying 𝐚i=a({xi}),𝐛j=b({yj})\boldsymbol{a}_{i}=a(\{x_{i}\}),\boldsymbol{b}_{j}=b(\{y_{j}\}) for any i,j[n]i,j\in[n], and the function cc can be described by a matrix 𝐂n×n\mathbf{C}\in\mathbb{R}^{n\times n} satisfying 𝐂i,j=c(xi,yj)\mathbf{C}_{i,j}=c(x_{i},y_{j}) for any (i,j)[n]×[n](i,j)\in[n]\times[n]. It is noted in [10] that the KL-divergence regularized Kantorovich problem is equivalent to the entropic regularized Kantorovich problem in this situation. Let 𝛂=[α(x1),α(xn)]𝖳\boldsymbol{\alpha}=[\alpha(x_{1}),\dots\alpha(x_{n})]^{\mathsf{T}}, 𝛃=[β(y1),β(yn)]𝖳\boldsymbol{\beta}=[\beta(y_{1}),\dots\beta(y_{n})]^{\mathsf{T}}. Then we have

ξ(α,β)\displaystyle\xi(\alpha,\beta) =𝜶,𝒂+𝜷,𝒃γi,jexp(1γ(𝜶i+𝜷j𝐂i,j))𝒂i𝒃j\displaystyle=\langle\boldsymbol{\alpha},\boldsymbol{a}\rangle+\langle\boldsymbol{\beta},\boldsymbol{b}\rangle-\gamma\sum_{i,j}\exp\left(\frac{1}{\gamma}(\boldsymbol{\alpha}_{i}+\boldsymbol{\beta}_{j}-\mathbf{C}_{i,j})\right)\boldsymbol{a}_{i}\boldsymbol{b}_{j}
=h(𝜶+γlog𝒂,𝜷+γlog𝒃)γlog𝒂,𝒂γlog𝒃,𝒃\displaystyle=h(\boldsymbol{\alpha}+\gamma\log\boldsymbol{a},\boldsymbol{\beta}+\gamma\log\boldsymbol{b})-\gamma\langle\log\boldsymbol{a},\boldsymbol{a}\rangle-\gamma\langle\log\boldsymbol{b},\boldsymbol{b}\rangle
=h(𝐟,𝐠)γlog𝒂,𝒂γlog𝒃,𝒃,where 𝐟=𝜶+γlog𝒂 and 𝐠=𝜷+γlog𝒃.\displaystyle=h(\mathbf{f},\mathbf{g})-\gamma\langle\log\boldsymbol{a},\boldsymbol{a}\rangle-\gamma\langle\log\boldsymbol{b},\boldsymbol{b}\rangle,\quad\mbox{where }\mathbf{f}=\boldsymbol{\alpha}+\gamma\log\boldsymbol{a}\mbox{ and }\mathbf{g}=\boldsymbol{\beta}+\gamma\log\boldsymbol{b}.

The equicontinuity property with respect to 𝛂\boldsymbol{\alpha} inherited from the continuous setting (under certain ω\omega and dYd_{Y}) inspires us to seek the equicontinuity property with respect to 𝐟γlog𝐚\mathbf{f}-\gamma\log\boldsymbol{a} (similar discussion for 𝐠γlog𝐛\mathbf{g}-\gamma\log\boldsymbol{b}) directly for the discrete entropic regularized optimal transport problem, as presented in Lemma 2.6. Moreover, we will show that the complexities of the vanilla Sinkhorn and Greenkhorn algorithms can both be improved to O(n2𝐂2logn/ε2)O({n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}/{\varepsilon^{2}}) based on this property.

Recall that in the Sinkhorn algorithm 𝐟k=𝐠k(𝐂,γ)¯\mathbf{f}_{k}=\mathbf{g}_{k}^{\overline{(\mathbf{C},\gamma)}} when k2k\geq 2 is odd and 𝐠k=𝐟k(𝐂,γ)\mathbf{g}_{k}=\mathbf{f}_{k}^{(\mathbf{C},\gamma)} when k2k\geq 2 is even. Moreover, since 𝐟k\mathbf{f}_{k} = 𝐟k1\mathbf{f}_{k-1} if k2k\geq 2 is even and 𝐠k\mathbf{g}_{k} = 𝐠k1\mathbf{g}_{k-1} if k2k\geq 2 is odd, the following corollary is a straightforward consequence of Lemma 2.6.

Corollary 2.9.

No matter what initial variables 𝐮0>0,𝐯0>0\boldsymbol{u}_{0}>0,\boldsymbol{v}_{0}>0 are chosen in the Sinkhorn algorithm, for every k2k\geq 2, the corresponding dual variables satisfy

max(𝐟kγlog𝒂)min(𝐟kγlog𝒂)C,\displaystyle\max(\mathbf{f}_{k}-\gamma\log\boldsymbol{a})-\min(\mathbf{f}_{k}-\gamma\log\boldsymbol{a})\leq\|C\|_{\infty},
max(𝐠kγlog𝒃)min(𝐠kγlog𝒃)C.\displaystyle\max(\mathbf{g}_{k}-\gamma\log\boldsymbol{b})-\min(\mathbf{g}_{k}-\gamma\log\boldsymbol{b})\leq\|C\|_{\infty}.

Moreover, for any solution (𝐟γ,𝐠γ)(\mathbf{f}^{\gamma}_{*},\mathbf{g}^{\gamma}_{*}) of problem (5), we have

max(𝐟γγlog𝒂)min(𝐟γγlog𝒂)C,\displaystyle\max(\mathbf{f}^{\gamma}_{*}-\gamma\log\boldsymbol{a})-\min(\mathbf{f}^{\gamma}_{*}-\gamma\log\boldsymbol{a})\leq\|C\|_{\infty},
max(𝐠γγlog𝒃)min(𝐠γγlog𝒃)C.\displaystyle\max(\mathbf{g}^{\gamma}_{*}-\gamma\log\boldsymbol{b})-\min(\mathbf{g}^{\gamma}_{*}-\gamma\log\boldsymbol{b})\leq\|C\|_{\infty}.

The following lemma provides a bound for the gap between h(𝐟k,𝐠k)h(\mathbf{f}_{k},\mathbf{g}_{k}) and the optimal value h(𝐟γ,𝐠γ)h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma}) utilizing Corollary 2.9.

Lemma 2.10.

Define Tk:=h(𝐟γ,𝐠γ)h(𝐟k,𝐠k)T_{k}:=h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\mathbf{f}_{k},\mathbf{g}_{k}). Then for k2k\geq 2,

Tk𝐂(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1),T_{k}\leq\|\mathbf{C}\|_{\infty}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}),

where 𝐏k=diag(e𝐟kγ)𝐊diag(e𝐠kγ)\mathbf{P}_{k}=\operatorname{\mathrm{diag}}(e^{\frac{\mathbf{f}_{k}}{\gamma}})\mathbf{K}\operatorname{\mathrm{diag}}(e^{\frac{\mathbf{g}_{k}}{\gamma}}) is the corresponding transport plan matrix.

Proof.

First recall that

𝐟h(𝐟,𝐠)=𝒂e𝐟/γ(𝐊e𝐠/γ)=𝒂𝐏𝟏n,\displaystyle\nabla_{\mathbf{f}}h(\mathbf{f},\mathbf{g})=\boldsymbol{a}-e^{\mathbf{f}/\gamma}\odot(\mathbf{K}e^{\mathbf{g}/\gamma})=\boldsymbol{a}-\mathbf{P}\boldsymbol{1}_{n},
𝐠h(𝐟,𝐠)=𝒃e𝐠/γ(𝐊𝖳e𝐟/γ)=𝒃𝐏𝖳𝟏n,\displaystyle\nabla_{\mathbf{g}}h(\mathbf{f},\mathbf{g})=\boldsymbol{b}-e^{\mathbf{g}/\gamma}\odot(\mathbf{K}^{\mathsf{T}}e^{\mathbf{f}/\gamma})=\boldsymbol{b}-\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n},

where 𝐏=diag(e𝐟γ)𝐊diag(e𝐠γ)\mathbf{P}=\operatorname{\mathrm{diag}}(e^{\frac{\mathbf{f}}{\gamma}})\mathbf{K}\operatorname{\mathrm{diag}}(e^{\frac{\mathbf{g}}{\gamma}}). Suppose without loss of generality that k2k\geq 2 is even. Since h(𝐟,𝐠)h(\mathbf{f},\mathbf{g}) is a concave function, we have

Tk\displaystyle T_{k} =h(𝐟γ,𝐠γ)h(𝐟k,𝐠k)\displaystyle=h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\mathbf{f}_{k},\mathbf{g}_{k})
𝐟γ𝐟k,𝒂𝐏k𝟏n+𝐠γ𝐠k,𝒃𝐏k𝖳𝟏n\displaystyle\leq\langle\mathbf{f}_{*}^{\gamma}-\mathbf{f}_{k},\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle+\langle\mathbf{g}_{*}^{\gamma}-\mathbf{g}_{k},\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\rangle
=𝐟γ𝐟k,𝒂𝐏k𝟏n\displaystyle=\langle\mathbf{f}_{*}^{\gamma}-\mathbf{f}_{k},\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle
=(𝐟γγlog𝒂)(𝐟kγlog𝒂),𝒂𝐏k𝟏n,\displaystyle=\langle(\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a})-(\mathbf{f}_{k}-\gamma\log\boldsymbol{a}),\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle, (9)

where the second equality follows from 𝐏k𝖳𝟏n=𝒃\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}=\boldsymbol{b}. Note that 𝒂\boldsymbol{a} and 𝐏k𝟏n\mathbf{P}_{k}\boldsymbol{1}_{n} are both probability vectors. Thus,

𝐟γγlog𝒂,𝒂𝐏k𝟏n\displaystyle\langle\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a},\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle =𝐟γγlog𝒂λ𝟏n,𝒂𝐏k𝟏n\displaystyle=\langle\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}-\lambda\boldsymbol{1}_{n},\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle
𝐟γγlog𝒂λ𝟏n𝒂𝐏k𝟏n1.\displaystyle\leq\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}-\lambda\boldsymbol{1}_{n}\|_{\infty}\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}.

Since it holds for any λ\lambda\in\mathbb{R}, we have

𝐟γγlog𝒂,𝒂𝐏k𝟏n\displaystyle\langle\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a},\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle minλ𝐟γγlog𝒂λ𝟏n𝒂𝐏k𝟏n1\displaystyle\leq\min_{\lambda}\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}-\lambda\boldsymbol{1}_{n}\|_{\infty}\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}
=12[max(𝐟γγlog𝒂)min(𝐟γγlog𝒂)]𝒂𝐏k𝟏n1\displaystyle=\frac{1}{2}\left[\max(\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a})-\min(\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a})\right]\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}
𝐂2𝒂𝐏k𝟏n1,\displaystyle\leq\frac{\|\mathbf{C}\|_{\infty}}{2}\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1},

where it is easy to see that the minimum is achieved at λ=12[max(𝐟γγlog𝒂)+min(𝐟γγlog𝒂)]\lambda=\frac{1}{2}\left[\max(\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a})+\min(\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a})\right], and the last inequality follows from Corollary 2.9. Since 𝐟kγlog𝒂,𝒂𝐏k𝟏n\langle\mathbf{f}_{k}-\gamma\log\boldsymbol{a},\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle in (9) can be bounded similarly, the proof is complete. ∎

Remark 2.11.

In [15], it has been shown that

maxi(𝐟k)imini(𝐟k)iR,maxi(𝐠k)jmini(𝐠k)jR,\displaystyle\max_{i}(\mathbf{f}_{k})_{i}-\min_{i}(\mathbf{f}_{k})_{i}\leq R,\quad\max_{i}(\mathbf{g}_{k})_{j}-\min_{i}(\mathbf{g}_{k})_{j}\leq R,
maxi(𝐟)imini(𝐟)iR,maxi(𝐠)jmini(𝐠)jR\displaystyle\max_{i}(\mathbf{f}^{*})_{i}-\min_{i}(\mathbf{f}^{*})_{i}\leq R,\quad\max_{i}(\mathbf{g}^{*})_{j}-\min_{i}(\mathbf{g}^{*})_{j}\leq R

and

TkR(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1),T_{k}\leq R(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}),

where R=log(νmini,j{𝐚i,𝐛j})R=-\log(\nu\min_{i,j}\{\boldsymbol{a}_{i},\boldsymbol{b}_{j}\}) and ν=mini,j𝐊ij=e𝐂γ\nu=\min_{i,j}\mathbf{K}_{ij}=e^{-\frac{\|\mathbf{C}\|_{\infty}}{\gamma}}. By modifying 𝐚,𝐛\boldsymbol{a},\boldsymbol{b} into 𝐚~,𝐛~\tilde{\boldsymbol{a}},\tilde{\boldsymbol{b}}, satisfying

𝒂~𝒂1δ4,𝒃~𝒃1δ4\|\tilde{\boldsymbol{a}}-\boldsymbol{a}\|_{1}\leq\frac{\delta}{4},\quad\tilde{\boldsymbol{b}}-\boldsymbol{b}\|_{1}\leq\frac{\delta}{4}

and

mini𝒂~iδ8n,mini𝒃~jδ8n,\min_{i}\tilde{\boldsymbol{a}}_{i}\geq\frac{\delta}{8n},\quad\min_{i}\tilde{\boldsymbol{b}}_{j}\geq\frac{\delta}{8n},

the Sinkhorn Algorithm using 𝐚~,𝐛~\tilde{\boldsymbol{a}},\tilde{\boldsymbol{b}} achieves an ε\varepsilon-approximate solution to the original OT problem 1 with complexity O(n2𝐂2lognε2)O(\frac{n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}{\varepsilon^{2}}). Such 𝐚~,𝐛~\tilde{\boldsymbol{a}},\tilde{\boldsymbol{b}} exist. For example, we can choose

(𝒂~,𝒃~)=(1δ8)((𝒂,𝒃)+δn(8δ)(𝟏,𝟏)).(\tilde{\boldsymbol{a}},\tilde{\boldsymbol{b}})=(1-\frac{\delta}{8})((\boldsymbol{a},\boldsymbol{b})+\frac{\delta}{n(8-\delta)}(\boldsymbol{1},\boldsymbol{1})).

However, using the equicontinuity property we have actually shown that RR can be replaced by 𝐂\|\mathbf{C}\|_{\infty}, which is independent of 𝐚\boldsymbol{a} and 𝐛\boldsymbol{b}. This fact enables us to prove that even for the vanilla Sinkhorn algorithm (without modifying 𝐚\boldsymbol{a} and 𝐛\boldsymbol{b}), the computational complexity to achieve an ε\varepsilon-accuracy after rounding is O(n2𝐂2lognε2)O(\frac{n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}{\varepsilon^{2}}).

Theorem 2.12.

For any positive probability vectors 𝐚,𝐛Δn\boldsymbol{a},\boldsymbol{b}\in\Delta_{n}, Algorithm 1 outputs a matrix 𝐏k\mathbf{P}_{k} such that 𝐚𝐏k𝟏n1+𝐛𝐏k𝖳𝟏n1δ\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}\leq\delta in less than 4𝐂γδ+2\lceil\frac{4\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil+2 number of iterations. Furthermore, the computational complexity for the Sinkhorn algorithm to to achieve an ε\varepsilon-accuracy after rounding is O(𝐂2n2lognε2)O\left(\frac{\|\mathbf{C}\|^{2}_{\infty}n^{2}\log n}{\varepsilon^{2}}\right).

Proof.

It follows from Lemmas 2.4 and 2.10 that

TkTk+1γ2(𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1)2T_{k}-T_{k+1}\geq\frac{\gamma}{2}(\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})^{2}

and

Tk𝐂(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1).T_{k}\leq\|\mathbf{C}\|_{\infty}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}).

Combining these two inequalities together yields

TkTk+1γ2𝐂2Tk2.T_{k}-T_{k+1}\geq\frac{\gamma}{2\|\mathbf{C}\|_{\infty}^{2}}T_{k}^{2}.

Since {1Tk}k\{\frac{1}{T_{k}}\}_{k} is an increasing sequence, we have

1Tk+11Tk=TkTk+1TkTk+1TkTk+1Tk2γ2𝐂2.\frac{1}{T_{k+1}}-\frac{1}{T_{k}}=\frac{T_{k}-T_{k+1}}{T_{k}T_{k+1}}\geq\frac{T_{k}-T_{k+1}}{T_{k}^{2}}\geq\frac{\gamma}{2\|\mathbf{C}\|_{\infty}^{2}}.

It follows immediately that

1Tk1Tk1T2(k2)γ2𝐂2.\displaystyle\frac{1}{T_{k}}\geq\frac{1}{T_{k}}-\frac{1}{T_{2}}\geq\frac{(k-2)\gamma}{2\|\mathbf{C}\|_{\infty}^{2}}.

Therefore, we have

Tk2𝐂2γ(k2) for any k3.\displaystyle T_{k}\leq\frac{2\|\mathbf{C}\|_{\infty}^{2}}{\gamma(k-2)}\quad\mbox{ for any $k\geq 3$.}

Thus, if we choose k0=2+2𝐂γδk_{0}=2+\lceil\frac{2\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil, then Tk0δ𝐂.T_{k_{0}}\leq\delta\|\mathbf{C}\|_{\infty}. Applying Lemma 2.4 again shows that the iteration will terminate within at most δ𝐂/γδ22=2𝐂γδ\lceil\delta\|\mathbf{C}\|_{\infty}/\frac{\gamma\delta^{2}}{2}\rceil=\lceil\frac{2\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil additional number of iterations. Therefore, the total number of iterations required by the Sinkhorn algorithm to terminate is smaller than

k0+2𝐂γδ=2+22𝐂γδ.k_{0}+\lceil\frac{2\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil=2+2\lceil\frac{2\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil.

Noting Remark 2.3, since the per iteration computational complexity of the Sinkhorn algorithm is O(n2)O(n^{2}), we can finally conclude that the total computational cost for the Sinkhorn algorithm to achieve an ε\varepsilon-accuracy after rounding is O(𝐂2n2lognε2)O\left(\frac{\|\mathbf{C}\|_{\infty}^{2}n^{2}\log n}{\varepsilon^{2}}\right) by setting γ=ε4logn\gamma=\frac{\varepsilon}{4\log n} and δ=ε8𝐂\delta=\frac{\varepsilon}{8\|\mathbf{C}\|_{\infty}}. ∎

Remark 2.13.

For the case when there are zeros in 𝐚\boldsymbol{a} and 𝐛\boldsymbol{b}, define A1:={i[n]:𝐚i=0}A_{1}:=\{i\in[n]:\boldsymbol{a}_{i}=0\} and A2:={j[n]:𝐛j=0}A_{2}:=\{j\in[n]:\boldsymbol{b}_{j}=0\}. Then for any iA1i\in A_{1}, jA2j\in A_{2} and k2k\geq 2, it can be easily verified that (𝐮k)i=0(\boldsymbol{u}_{k})_{i}=0 and (𝐯k)j=0(\boldsymbol{v}_{k})_{j}=0. Thus the rows with index in A1A_{1} and the columns with index in A2A_{2} of each 𝐏k\mathbf{P}_{k} and 𝐏\mathbf{P}^{*} are all zeros, and we only need to focus on the nonzero part of {𝐏k}k\{\mathbf{P}_{k}\}_{k}. It is equivalent to applying the Sinkhorn algorithm to solve the following compact problem

min𝐏~+n1×n2𝐂~,𝐏~+γi,j𝐏~i,j(log𝐏~i,j1) subject to 𝐏~𝟏n2=𝒂~,𝐏~𝖳𝟏n1=𝒃~,\displaystyle\min_{\tilde{\mathbf{P}}\in\mathbb{R}_{+}^{n_{1}\times n_{2}}}\langle\tilde{\mathbf{C}},\tilde{\mathbf{P}}\rangle+\gamma\sum_{i,j}\tilde{\mathbf{P}}_{i,j}(\log\tilde{\mathbf{P}}_{i,j}-1)\text{\quad subject to\quad}\tilde{\mathbf{P}}\boldsymbol{1}_{n_{2}}=\tilde{\boldsymbol{a}},\tilde{\mathbf{P}}^{\mathsf{T}}\boldsymbol{1}_{n_{1}}=\tilde{\boldsymbol{b}}, (10)

where n1=n|A1|,n2=n|A2|n_{1}=n-|A_{1}|,n_{2}=n-|A_{2}|, 𝐚~++n1\tilde{\boldsymbol{a}}\in\mathbb{R}_{++}^{n_{1}} and 𝐛~++n2\tilde{\boldsymbol{b}}\in\mathbb{R}_{++}^{n_{2}} are vectors obtained by removing the zeros from 𝐚\boldsymbol{a} and 𝐛\boldsymbol{b} respectively, and 𝐂~+n1×n2\tilde{\mathbf{C}}\in\mathbb{R}_{+}^{n_{1}\times n_{2}} is a matrix obtained by removing the rows with index in A1A_{1} and the columns with index in A2A_{2} from 𝐂\mathbf{C}. Therefore the complexity in this case is also O(𝐂2n2lognε2)O\left(\frac{\|\mathbf{C}\|^{2}_{\infty}n^{2}\log n}{\varepsilon^{2}}\right).

Remark 2.14.

A set of experiments have been conducted to investigate the dependency of the computational complexity of Sinkhorn on the accuracy ε\varepsilon. See Figure 2 for the plots of the average number of iterations out of 1010 random trials needed for Sinkhorn to achieve an ε\varepsilon-accuracy against 1/ε21/\varepsilon^{2}. A desirable linear relation has been clearly shown in the plots.

Refer to caption
(a) MNIST data set
Refer to caption
(b) Synthetic data set
Figure 2: Number of iteration vs 1ε2\frac{1}{\varepsilon^{2}}.

3 Greenkhorn and Its Complexity

Algorithm 3 Greenkhorn [14]
0: probability vectors 𝒂,𝒃\boldsymbol{a},\boldsymbol{b}, accuracy δ\delta, 𝐊exp(𝐂γ)\mathbf{K}\leftarrow\exp\left(-\frac{\mathbf{C}}{\gamma}\right)
 Initialize k0k\leftarrow 0, 𝒖0=𝒂\boldsymbol{u}_{0}=\boldsymbol{a}, 𝒗0=𝒃\boldsymbol{v}_{0}=\boldsymbol{b}, 𝐏0diag(𝒖0)𝐊diag(𝒗0)\mathbf{P}_{0}\leftarrow\operatorname{\mathrm{diag}}(\boldsymbol{u}_{0})\mathbf{K}\operatorname{\mathrm{diag}}(\boldsymbol{v}_{0})
repeat
  Iargmaxiρ(𝒂i,(𝐏k𝟏n)i)I\leftarrow\mathop{\arg\max}_{i}\rho(\boldsymbol{a}_{i},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{i}) (break the tie arbitrarily)
  Jargmaxjρ(𝒃j,(𝐏k𝖳𝟏n)j)J\leftarrow\mathop{\arg\max}_{j}\rho(\boldsymbol{b}_{j},(\mathbf{P}^{\mathsf{T}}_{k}\boldsymbol{1}_{n})_{j}) (break the tie arbitrarily)
  𝒖k+1𝒖k,𝒗k+1𝒗k\boldsymbol{u}_{k+1}\leftarrow\boldsymbol{u}_{k},\boldsymbol{v}_{k+1}\leftarrow\boldsymbol{v}_{k}
  if ρ(𝒂I,(𝐏k𝟏n)I)>ρ(𝒃J,(𝐏k𝖳𝟏n)J)\rho(\boldsymbol{a}_{I},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})>\rho(\boldsymbol{b}_{J},(\mathbf{P}^{\mathsf{T}}_{k}\boldsymbol{1}_{n})_{J}) then
   (𝒖k+1)I𝒂I(𝐊𝒗k)I(\boldsymbol{u}_{k+1})_{I}\leftarrow\frac{\boldsymbol{a}_{I}}{(\mathbf{K}\boldsymbol{v}_{k})_{I}}
  else
   (𝒗k+1)J𝒃J(𝐊𝖳𝒖k)J(\boldsymbol{v}_{k+1})_{J}\leftarrow\frac{\boldsymbol{b}_{J}}{(\mathbf{K}^{\mathsf{T}}\boldsymbol{u}_{k})_{J}}
  end if
  kk+1k\leftarrow k+1
  𝐏kdiag(𝒖k)𝐊diag(𝒗k)\mathbf{P}_{k}\leftarrow\operatorname{\mathrm{diag}}(\boldsymbol{u}_{k})\mathbf{K}\operatorname{\mathrm{diag}}(\boldsymbol{v}_{k})
until 𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1δ\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1}\leq\delta
𝐏k\mathbf{P}_{k}.

Again without loss of generality we assume 𝒂>0\boldsymbol{a}>0 and 𝒃>0\boldsymbol{b}>0 222The case where there are zeros in 𝒂\boldsymbol{a} or 𝒃\boldsymbol{b} is discussed in Remark 3.9.. Recall that in each iteration of the Sinkhorn algorithm, all the elements of 𝒖k\boldsymbol{u}_{k} or 𝒗k\boldsymbol{v}_{k} are updated simultaneously such that the row sum of 𝐏k\mathbf{P}_{k} equals 𝒂\boldsymbol{a} or the column sum of 𝐏k\mathbf{P}_{k} equals 𝒃\boldsymbol{b}. In contrast, only a single element of 𝒖k\boldsymbol{u}_{k} or 𝒗k\boldsymbol{v}_{k} is updated at a time in the Greenkhorn algorithm such that only one element of the row sum or column sum of 𝐏k\mathbf{P}_{k} is equal to the target value. To determine which element of 𝒖k\boldsymbol{u}_{k} or 𝒗k\boldsymbol{v}_{k} is updated, the following scalar version of the KL divergence is used to quantify the mismatch between the elements of 𝒂\boldsymbol{a} or 𝒃\boldsymbol{b} and the corresponding elements of 𝐏𝟏n\mathbf{P}\boldsymbol{1}_{n} or 𝐏𝖳𝟏n\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n}:

ρ(x,y):=yx+xlogxy,\rho(x,y):=y-x+x\log\frac{x}{y},

and then the one with the largest mismatch is chosen to be updated. For two probability vectors 𝒂\boldsymbol{a} and 𝒃\boldsymbol{b}, it can be easily verified that KL(𝒂,𝒃)=i=1nρ(𝒂i,𝒃i).KL(\boldsymbol{a},\boldsymbol{b})=\sum_{i=1}^{n}\rho(\boldsymbol{a}_{i},\boldsymbol{b}_{i}). Moreover, ρ(x,y)\rho(x,y) is indeed the Bregman distance between xx and yy associated with the function ϕ(t)=tlogt\phi(t)=t\log t. The Greenkhorn algorithm is described in Algorithm 3, where the initial variables are set to be 𝒖0=𝒂\boldsymbol{u}_{0}=\boldsymbol{a} and 𝒗0=𝒃\boldsymbol{v}_{0}=\boldsymbol{b}. Note that without further specification, 𝐏^k\widehat{\mathbf{P}}_{k} is the matrix that is obtained by applying Algorithm 2 to the output 𝐏k\mathbf{P}_{k} of the Greenkhorn algorithm in this section. The existing complexity for the Greenkhorn algorithm is O(𝐂3n2logn/ε3)O({\|\mathbf{C}\|^{3}_{\infty}n^{2}\log n}/{\varepsilon^{3}}) [14], while the complexity for the modified version using lifted 𝒂\boldsymbol{a} and 𝒃\boldsymbol{b} is O(𝐂2n2logn/ε2)O({\|\mathbf{C}\|^{2}_{\infty}n^{2}\log n}/{\varepsilon^{2}}) [16]. However, a set of similar numerical experiments as in Section 2 suggest that the performance of the vanilla and modified versions are overall similar, see Figure 3. Thus, it also makes sense to provide an improved analysis of the vanilla Greenkhorn algorithm.

Refer to caption
(a) MNIST data set
Refer to caption
(b) Synthetic data set
Figure 3: Transport cost error vs number of iteration for the vanilla and modified Greenkhorn algorithms.

The convergence analysis for the Greenkhorn algorithm is overall parallel to that for the Sinkhorn algorithm, which is carried out on the dual variables:

𝐟k:=γlog𝒖kand𝐠k:=γlog𝒗k.\mathbf{f}_{k}:=\gamma\log\boldsymbol{u}_{k}\quad\text{and}\quad\mathbf{g}_{k}:=\gamma\log\boldsymbol{v}_{k}.

That being said, a fact that should be cautioned is that the transport plan matrix 𝐏k\mathbf{P}_{k} produced by the Greenkhorn algorithm may not be a probability matrix, which requires a careful treatment in the analysis. We first extend Lemma 2.2 to the situation where 𝐁\mathbf{B} is not necessarily a probability matrix. The proof of this lemma is presented in the appendix.

Lemma 3.1.

Let 𝐏\mathbf{P}^{*} be the solution to the Kantorovich problem 1 with marginals 𝐚,𝐛\boldsymbol{a},\boldsymbol{b} and cost matrix 𝐂\mathbf{C}. Let 𝐁=diag(𝐮)𝐊diag(𝐯)\mathbf{B}=\rm{diag}(\boldsymbol{u})\mathbf{K}\rm{diag}(\boldsymbol{v}), where 𝐮,𝐯++n\boldsymbol{u},\boldsymbol{v}\in\mathbb{R}^{n}_{++} and 𝐊n×n\mathbf{K}\in\mathbb{R}^{n\times n} is defined by 𝐊=exp(𝐂γ)\mathbf{K}=\exp(\frac{\mathbf{C}}{\gamma}). Letting 𝐁^=Round(𝐁,𝐚,𝐛)\widehat{\mathbf{B}}=\rm{Round}(\mathbf{B},\boldsymbol{a},\boldsymbol{b}), we have

𝐂,𝐁^𝐂,𝐏+(2+𝐁𝟏n𝒂1+𝐁𝖳𝟏n𝒃1)γlogn+4(𝐁𝟏n𝒂1+𝐁𝖳𝟏n𝒃1)𝐂.\displaystyle\langle\mathbf{C},\widehat{\mathbf{B}}\rangle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+(2+\|\mathbf{B}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})\gamma\log n+4(\|\mathbf{B}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})\|\mathbf{C}\|_{\infty}. (11)
Remark 3.2.

In (11), If we set 𝐁\mathbf{B} to be 𝐏k\mathbf{P}_{k}, the output of Algorithm 3, we have

𝐂,𝐏^k𝐂,𝐏+(2+𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1)γlogn+4(𝐏k𝟏n𝒂1+𝐏k𝖳𝟏n𝒃1)𝐂.\langle\mathbf{C},\widehat{\mathbf{P}}_{k}\rangle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+(2+\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})\gamma\log n+4(\|\mathbf{P}_{k}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}+\|\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1})\|\mathbf{C}\|_{\infty}.

This implies that if we set γ=ε6logn\gamma=\frac{\varepsilon}{6\log n} for the regularization parameter and choose δ=min(1,ε8𝐂)\delta=\min\left(1,\frac{\varepsilon}{8\|\mathbf{C}\|_{\infty}}\right) as the termination condition, the Greenkhorn algorithm will output a 𝐏k\mathbf{P}_{k} such that

𝐂,Round(𝐏k,𝒂,𝒃)𝐂,𝐏+ε.\langle\mathbf{C},\mbox{Round}(\mathbf{P}_{k},\boldsymbol{a},\boldsymbol{b})\rangle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+\varepsilon.

The next lemma characterizes the one-iteration improvement of the Greenkhorn algorithm.

Lemma 3.3 ([14, Lemma 5, Lemma 6] ).

For any k0k\geq 0, if max(i=1nρ(𝐚i,(𝐏k𝟏n)i),j=1nρ(𝐛j,(𝐏k𝖳𝟏n)j))1\max\left(\sum_{i=1}^{n}\rho(\boldsymbol{a}_{i},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{i}),\sum_{j=1}^{n}\rho(\boldsymbol{b}_{j},(\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n})_{j})\right)\leq 1,

h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)γ28n(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1)2,\displaystyle h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k})\geq\frac{\gamma}{28n}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1})^{2},

and if i=1nρ(𝐚i,(𝐏k𝟏n)i)>1\sum_{i=1}^{n}\rho(\boldsymbol{a}_{i},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{i})>1 or j=1nρ(𝐛j,(𝐏k𝖳𝟏n)j)>1\sum_{j=1}^{n}\rho(\boldsymbol{b}_{j},(\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n})_{j})>1,

h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)γn.\displaystyle h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k})\geq\frac{\gamma}{n}.

The following lemma demonstrates that the infinity norm error of the iterates in the Greenkhorn algorithm will not increase.

Lemma 3.4 ([16, Lemma 3.1] ).

Let (𝐟γ,𝐠γ)(\mathbf{f}^{\gamma}_{*},\mathbf{g}^{\gamma}_{*}) be any solution of (5). Then for any k0k\geq 0, we have

max(𝐟k+1𝐟γ,𝐠k+1𝐠γ)max(𝐟k𝐟γ,𝐠k𝐠γ).\displaystyle\max(\|\mathbf{f}_{k+1}-\mathbf{f}^{\gamma}_{*}\|_{\infty},\|\mathbf{g}_{k+1}-\mathbf{g}^{\gamma}_{*}\|_{\infty})\leq\max(\|\mathbf{f}_{k}-\mathbf{f}^{\gamma}_{*}\|_{\infty},\|\mathbf{g}_{k}-\mathbf{g}^{\gamma}_{*}\|_{\infty}).

The proofs of Lemma 3.3 and Lemma 3.4 can be found in [14] and [16], respectively. To keep the work self-contained, we include the proofs in the appendix. It is worth noting that by a similar argument we can show that Lemma 3.4 also holds for the Sinkhorn algorithm. The details are omitted since this result is not used in the convergence analysis of the Sinkhorn algorithm.

The following lemma provides an upper bound on the infinity norm of the optimal solution (after translation), which plays an important role in the analysis of the Greenkhorn algorithm. The proof of this lemma relies on the equicontinuity property of the optimal solution.

Lemma 3.5.

For the dual entropic regularized optimal transport problem (5), there exists a pair of optimal solution (𝐟γ,𝐠γ)(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma}) satisfying

max(𝐟γγlog𝒂,𝐠γγlog𝒃)2𝐂.\displaystyle\max(\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}\|_{\infty},\|\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b}\|_{\infty})\leq 2\|\mathbf{C}\|_{\infty}. (12)
Proof.

We first claim that, for any solution 𝐟~γ,𝐠~γ\tilde{\mathbf{f}}_{*}^{\gamma},\tilde{\mathbf{g}}_{*}^{\gamma}, there exist (s,t),(s,t)[n]×[n](s,t),(s^{\prime},t^{\prime})\in[n]\times[n] satisfying

(𝐟~γ)sγlog𝒂s+(𝐠~γ)tγlog𝒃t0,\displaystyle(\tilde{\mathbf{f}}_{*}^{\gamma})_{s}-\gamma\log\boldsymbol{a}_{s}+(\tilde{\mathbf{g}}_{*}^{\gamma})_{t}-\gamma\log\boldsymbol{b}_{t}\geq 0, (13)
(𝐟~γ)sγlog𝒂s+(𝐠~γ)tγlog𝒃t𝐂.\displaystyle(\tilde{\mathbf{f}}_{*}^{\gamma})_{s^{\prime}}-\gamma\log\boldsymbol{a}_{s^{\prime}}+(\tilde{\mathbf{g}}_{*}^{\gamma})_{t^{\prime}}-\gamma\log\boldsymbol{b}_{t^{\prime}}\leq\|\mathbf{C}\|_{\infty}. (14)

Indeed, noting that i,j𝒂i𝒃j=1\sum_{i,j}\boldsymbol{a}_{i}\boldsymbol{b}_{j}=1, if for all (s,t)[n]×[n](s,t)\in[n]\times[n],

(𝐟~γ)sγlog𝒂s+(𝐠~γ)tγlog𝒃t<0,(\tilde{\mathbf{f}}_{*}^{\gamma})_{s}-\gamma\log\boldsymbol{a}_{s}+(\tilde{\mathbf{g}}_{*}^{\gamma})_{t}-\gamma\log\boldsymbol{b}_{t}<0,

we have

i,jexp(1γ((𝐟~γ)i+(𝐠~γ)j𝐂i,j))=i,jexp(1γ((𝐟~γ)iγlog𝒂i+(𝐠~γ)jγlog𝒃j𝐂i,j))𝒂i𝒃j<1,\displaystyle\sum_{i,j}\exp\left(\frac{1}{\gamma}((\tilde{\mathbf{f}}_{*}^{\gamma})_{i}+(\tilde{\mathbf{g}}_{*}^{\gamma})_{j}-\mathbf{C}_{i,j})\right)=\sum_{i,j}\exp\left(\frac{1}{\gamma}((\tilde{\mathbf{f}}_{*}^{\gamma})_{i}-\gamma\log\boldsymbol{a}_{i}+(\tilde{\mathbf{g}}_{*}^{\gamma})_{j}-\gamma\log\boldsymbol{b}_{j}-\mathbf{C}_{i,j})\right)\boldsymbol{a}_{i}\boldsymbol{b}_{j}<1,

which contradicts the fact that diag(exp(𝐟~γγ))𝐊diag(exp(𝐠~γγ))\operatorname{\mathrm{diag}}\left(\exp(\frac{\tilde{\mathbf{f}}_{*}^{\gamma}}{\gamma})\right)\mathbf{K}\operatorname{\mathrm{diag}}\left(\exp(\frac{\tilde{\mathbf{g}}_{*}^{\gamma}}{\gamma})\right) is a probability matrix. Thus, there exists (s,t)(s,t) such that (13) holds. Similarly, we can show that there exists (s,t)(s^{\prime},t^{\prime}) such that (14) holds.

Note that h(𝐟~γ,𝐠~γ)=h(𝐟~γ+η𝟏n,𝐠~γη𝟏n)h(\tilde{\mathbf{f}}_{*}^{\gamma},\tilde{\mathbf{g}}_{*}^{\gamma})=h(\tilde{\mathbf{f}}_{*}^{\gamma}+\eta\boldsymbol{1}_{n},\tilde{\mathbf{g}}_{*}^{\gamma}-\eta\boldsymbol{1}_{n}) for any η\eta\in\mathbb{R}. Therefore, we could choose a η\eta such that (𝐟~γ)sγlog𝒂s+η=𝐂(\tilde{\mathbf{f}}_{*}^{\gamma})_{s}-\gamma\log\boldsymbol{a}_{s}+\eta=\|\mathbf{C}\|_{\infty}. Set 𝐟γ=𝐟~γ+η𝟏n\mathbf{f}_{*}^{\gamma}=\tilde{\mathbf{f}}_{*}^{\gamma}+\eta\boldsymbol{1}_{n} and 𝐠γ=𝐠~γη𝟏n\mathbf{g}_{*}^{\gamma}=\tilde{\mathbf{g}}_{*}^{\gamma}-\eta\boldsymbol{1}_{n}. It is evident that (𝐟γ)sγlog𝒂s=𝐂(\mathbf{f}_{*}^{\gamma})_{s}-\gamma\log\boldsymbol{a}_{s}=\|\mathbf{C}\|_{\infty}. Thus the application of Corollary 2.9 implies

𝐟γγlog𝒂2𝐂.\displaystyle\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}\|_{\infty}\leq 2\|\mathbf{C}\|_{\infty}.

and

(𝐟γ)sγlog𝒂s0.\displaystyle(\mathbf{f}_{*}^{\gamma})_{s^{\prime}}-\gamma\log\boldsymbol{a}_{s^{\prime}}\geq 0.

Since (𝐟γ)sγlog𝒂s=𝐂(\mathbf{f}_{*}^{\gamma})_{s}-\gamma\log\boldsymbol{a}_{s}=\|\mathbf{C}\|_{\infty} and (13) holds, we have

(𝐠γ)tγlog𝒃t𝐂.\displaystyle(\mathbf{g}_{*}^{\gamma})_{t}-\gamma\log\boldsymbol{b}_{t}\geq-\|\mathbf{C}\|_{\infty}.

In addition, it follows from (14) that

(𝐠γ)tγlog𝒃t𝐂.\displaystyle(\mathbf{g}_{*}^{\gamma})_{t^{\prime}}-\gamma\log\boldsymbol{b}_{t^{\prime}}\leq\|\mathbf{C}\|_{\infty}.

Thus, applying Corollary 2.9 again yields that

𝐠γγlog𝒃2𝐂\displaystyle\|\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b}\|_{\infty}\leq 2\|\mathbf{C}\|_{\infty}

which concludes the proof. ∎

The following two lemmas can be obtained based on the above lemmas. Noting h(𝐟γ,𝐠γ)h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma}) is the same for all the optimal solutions, without loss of generality we assume (𝐟γ,𝐠γ)(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma}) used in the proof satisfies (12).

Lemma 3.6.

Let (𝐟γ,𝐠γ)(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma}) be any solution of (5), and define Tk:=h(𝐟γ,𝐠γ)h(𝐟k,𝐠k)T_{k}:=h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\mathbf{f}_{k},\mathbf{g}_{k}). Let (𝐟0,𝐠0)=(γlog𝐚,γlog𝐛)(\mathbf{f}_{0},\mathbf{g}_{0})=(\gamma\log\boldsymbol{a},\gamma\log\boldsymbol{b}). We have

Tk=h(𝐟γ,𝐠γ)h(𝐟k,𝐠k)2𝐂(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1).\displaystyle T_{k}=h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\mathbf{f}_{k},\mathbf{g}_{k})\leq 2\|\mathbf{C}\|_{\infty}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}).
Proof.

Since h(𝐟,𝐠)h(\mathbf{f},\mathbf{g}) is concave, we have

Tk=h(𝐟γ,𝐠γ)h(𝐟k,𝐠k)\displaystyle T_{k}=h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\mathbf{f}_{k},\mathbf{g}_{k}) 𝐟γ𝐟k,𝒂𝐏k𝟏n+𝐠γ𝐠k,𝒃𝐏k𝖳𝟏n\displaystyle\leq\langle\mathbf{f}_{*}^{\gamma}-\mathbf{f}_{k},\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\rangle+\langle\mathbf{g}_{*}^{\gamma}-\mathbf{g}_{k},\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\rangle
𝐟γ𝐟k𝒂𝐏k𝟏n1+𝐠γ𝐠k𝒃𝐏k𝖳𝟏n1\displaystyle\leq\|\mathbf{f}_{*}^{\gamma}-\mathbf{f}_{k}\|_{\infty}\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\mathbf{g}_{*}^{\gamma}-\mathbf{g}_{k}\|_{\infty}\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}
max(𝐟γ𝐟k,𝐠γ𝐠k)(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1)\displaystyle\leq\max(\|\mathbf{f}_{*}^{\gamma}-\mathbf{f}_{k}\|_{\infty},\|\mathbf{g}_{*}^{\gamma}-\mathbf{g}_{k}\|_{\infty})(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1})
max(𝐟γ𝐟0,𝐠γ𝐠0)(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1)\displaystyle\leq\max(\|\mathbf{f}_{*}^{\gamma}-\mathbf{f}_{0}\|_{\infty},\|\mathbf{g}_{*}^{\gamma}-\mathbf{g}_{0}\|_{\infty})(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1})
=max(𝐟γγlog𝒂,𝐠γγlog𝒃)(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1)\displaystyle=\max(\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}\|_{\infty},\|\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b}\|_{\infty})(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}) (15)
2𝐂(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1),\displaystyle\leq 2\|\mathbf{C}\|_{\infty}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}), (16)

where the fourth inequality follows from Lemma 3.4 and the last inequality follows from Lemma 3.5. ∎

Lemma 3.7.

Let (𝐟γ,𝐠γ)(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma}) be any solution of (5) and let (𝐟0,𝐠0)=(γlog𝐚,γlog𝐛)(\mathbf{f}_{0},\mathbf{g}_{0})=(\gamma\log\boldsymbol{a},\gamma\log\boldsymbol{b}). Then

h(𝐟γ,𝐠γ)h(𝐟0,𝐠0)=h(𝐟γ,𝐠γ)h(γlog𝒂,γlog𝒃)4𝐂.\displaystyle h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\mathbf{f}_{0},\mathbf{g}_{0})=h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\gamma\log\boldsymbol{a},\gamma\log\boldsymbol{b})\leq 4\|\mathbf{C}\|_{\infty}.
Proof.

By the definition of function hh, we have

h(𝐟γ,𝐠γ)h(γlog𝒂,γlog𝒃)\displaystyle h(\mathbf{f}_{*}^{\gamma},\mathbf{g}_{*}^{\gamma})-h(\gamma\log\boldsymbol{a},\gamma\log\boldsymbol{b}) =𝐟γγlog𝒂,𝒂+𝐠γγlog𝒃,𝒃\displaystyle=\langle\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a},\boldsymbol{a}\rangle+\langle\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b},\boldsymbol{b}\rangle
γi,jexp(1γ((𝐟γ)i+(𝐠γ)j𝐂i,j))+γi,jexp(1γ(γlog𝒂i+γlog𝒃j𝐂i,j))\displaystyle-\gamma\sum_{i,j}\exp\left(\frac{1}{\gamma}((\mathbf{f}_{*}^{\gamma})_{i}+(\mathbf{g}_{*}^{\gamma})_{j}-\mathbf{C}_{i,j})\right)+\gamma\sum_{i,j}\exp\left(\frac{1}{\gamma}(\gamma\log\boldsymbol{a}_{i}+\gamma\log\boldsymbol{b}_{j}-\mathbf{C}_{i,j})\right)
=𝐟γγlog𝒂,𝒂+𝐠γγlog𝒃,𝒃γ+γi,j𝒂i𝒃jexp(1γ𝐂i,j)\displaystyle=\langle\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a},\boldsymbol{a}\rangle+\langle\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b},\boldsymbol{b}\rangle-\gamma+\gamma\sum_{i,j}\boldsymbol{a}_{i}\boldsymbol{b}_{j}\exp\left(-\frac{1}{\gamma}\mathbf{C}_{i,j}\right)
𝐟γγlog𝒂,𝒂+𝐠γγlog𝒃,𝒃\displaystyle\leq\langle\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a},\boldsymbol{a}\rangle+\langle\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b},\boldsymbol{b}\rangle
𝐟γγlog𝒂𝒂1+𝐠γγlog𝒃𝒃1\displaystyle\leq\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}\|_{\infty}\|\boldsymbol{a}\|_{1}+\|\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b}\|_{\infty}\|\boldsymbol{b}\|_{1}
=𝐟γγlog𝒂+𝐠γγlog𝒃\displaystyle=\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}\|_{\infty}+\|\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b}\|_{\infty}
=2max(𝐟γγlog𝒂,𝐠γγlog𝒃)\displaystyle=2\max(\|\mathbf{f}_{*}^{\gamma}-\gamma\log\boldsymbol{a}\|_{\infty},\|\mathbf{g}_{*}^{\gamma}-\gamma\log\boldsymbol{b}\|_{\infty})
2𝐂,\displaystyle\leq 2\|\mathbf{C}\|_{\infty},

where the last inequality follows from Lemma 3.5. ∎

Now we are ready to prove the main theorem about the computational complexity of the Greenkhorn algorithm. Note that, compared with the proof for Theorem 2.12, we need to discuss the two different cases of the iterates very carefully.

Theorem 3.8.

Algorithm 3 with initialization (𝐮0,𝐯0)=(𝐚,𝐛)(\boldsymbol{u}_{0},\boldsymbol{v}_{0})=(\boldsymbol{a},\boldsymbol{b}) outputs a matrix 𝐏k\mathbf{P}_{k} satisfying 𝐚𝐏k𝟏n1+𝐛𝐏k𝖳𝟏n1δ\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}\leq\delta in less than 256n𝐂γδ+24n𝐂γ2\lceil\frac{56n\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil+2\lceil\frac{4n\|\mathbf{C}\|_{\infty}}{\gamma}\rceil number of iterations. Furthermore, the computational complexity for the algorithm to achieve an ε\varepsilon-accuracy after rounding is

O(max(𝐂2n2lognε2,𝐂n2lognε))=O(𝐂2n2lognε2)\displaystyle O\left(\max\left(\frac{\|\mathbf{C}\|^{2}_{\infty}n^{2}\log n}{\varepsilon^{2}},\frac{\|\mathbf{C}\|_{\infty}n^{2}\log n}{\varepsilon}\right)\right)=O\left(\frac{\|\mathbf{C}\|^{2}_{\infty}n^{2}\log n}{\varepsilon^{2}}\right)

when ε\varepsilon is sufficiently small.

Proof.

Define

S1={kN:i=1nρ(𝒂i,(𝐏k𝟏n)i)>1 or j=1nρ(𝒃j,(𝐏k𝖳𝟏n)j)>1},\displaystyle S_{1}=\{k\in N:\sum_{i=1}^{n}\rho(\boldsymbol{a}_{i},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{i})>1\text{ or }\sum_{j=1}^{n}\rho(\boldsymbol{b}_{j},(\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n})_{j})>1\},
S2=N\S1.\displaystyle S_{2}=N\backslash S_{1}.

When kS1k\in S_{1}, by Lemma 3.3,

TkTk+1=h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)γn.\displaystyle T_{k}-T_{k+1}=h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k})\geq\frac{\gamma}{n}.

On the other hand, when kS2k\in S_{2}, by Lemma 3.3 and Lemma 3.6

TkTk+1=h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)γ28n(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1)2γ112n𝐂2Tk2.\displaystyle T_{k}-T_{k+1}=h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k})\geq\frac{\gamma}{28n}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1})^{2}\geq\frac{\gamma}{112n\|\mathbf{C}\|_{\infty}^{2}}T_{k}^{2}.

This means that {Tk}k\{T_{k}\}_{k} is a decreasing sequence. So by Lemma 3.7, we have

|S1|h(𝐟γ,𝐠γ)h(𝐟0,𝐠0)γn4n𝐂γ.\displaystyle|S_{1}|\leq\frac{h(\mathbf{f}^{\gamma}_{*},\mathbf{g}_{*}^{\gamma})-h(\mathbf{f}_{0},\mathbf{g}_{0})}{\frac{\gamma}{n}}\leq\frac{4n\|\mathbf{C}\|_{\infty}}{\gamma}.

Noting that {1Tk}k\left\{\frac{1}{T_{k}}\right\}_{k} is an increasing sequence, if kS2k\in S_{2},

1Tk+11Tk=TkTk+1Tk+1TkTkTk+1Tk2γ112n𝐂2.\displaystyle\frac{1}{T_{k+1}}-\frac{1}{T_{k}}=\frac{T_{k}-T_{k+1}}{T_{k+1}T_{k}}\geq\frac{T_{k}-T_{k+1}}{T_{k}^{2}}\geq\frac{\gamma}{112n\|\mathbf{C}\|_{\infty}^{2}}. (17)

Setting k0=56n𝐂δγ+|S1|k_{0}=\lceil\frac{56n\|\mathbf{C}\|_{\infty}}{\delta\gamma}\rceil+|S_{1}|, one has

1Tk01Tk01T0kS2 and k<k0(1Tk+11Tk)12𝐂δ,\displaystyle\frac{1}{T_{k_{0}}}\geq\frac{1}{T_{k_{0}}}-\frac{1}{T_{0}}\geq\sum_{k\in S_{2}\mbox{ and }k<k_{0}}\left(\frac{1}{T_{k+1}}-\frac{1}{T_{k}}\right)\geq\frac{1}{2\|\mathbf{C}\|_{\infty}\delta},

since there are at least 56n𝐂δγ\lceil\frac{56n\|\mathbf{C}\|_{\infty}}{\delta\gamma}\rceil terms in the sum. Since Tk02𝐂δT_{k_{0}}\leq 2\|\mathbf{C}\|_{\infty}\delta, applying Lemma 3.3 again shows that the algorithm will terminate within at most 2𝐂δ/γδ228n+|S1|=56n𝐂γδ+|S1|\lceil 2\|\mathbf{C}\|_{\infty}\delta/\frac{\gamma\delta^{2}}{28n}\rceil+|S_{1}|=\lceil\frac{56n\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil+|S_{1}| additional number of iterations. Therefore, the total number of iterations required by the Greenkhorn algorithm to terminate is smaller than

k0+56n𝐂γδ+|S1|256n𝐂γδ+24n𝐂γ.\displaystyle k_{0}+\lceil\frac{56n\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil+|S_{1}|\leq 2\lceil\frac{56n\|\mathbf{C}\|_{\infty}}{\gamma\delta}\rceil+2\lceil\frac{4n\|\mathbf{C}\|_{\infty}}{\gamma}\rceil.

Noting Remark 3.2, since the per iteration computational complexity of the Greenkhorn algorithm is O(n)O(n), we can finally conclude that the total computational cost for the Greenkhorn algorithm to achieve an ε\varepsilon-accuracy after rounding is O(max(𝐂2n2lognε2,𝐂n2lognε))O\left(\max\left(\frac{\|\mathbf{C}\|^{2}_{\infty}n^{2}\log n}{\varepsilon^{2}},\frac{\|\mathbf{C}\|_{\infty}n^{2}\log n}{\varepsilon}\right)\right) by setting γ=ε6logn\gamma=\frac{\varepsilon}{6\log n} and δ=min(1,ε8𝐂)\delta=\min(1,\frac{\varepsilon}{8\|\mathbf{C}\|_{\infty}}). ∎

Remark 3.9.

For the case when there are zeros in 𝐚\boldsymbol{a} and 𝐛\boldsymbol{b}, due to our selection of 𝐮0=𝐚\boldsymbol{u}_{0}=\boldsymbol{a} and 𝐯0=𝐛\boldsymbol{v}_{0}=\boldsymbol{b}, if we use the convention ρ(0,0)=0\rho(0,0)=0, the Greenkhorn algorithm never updates zeros in 𝐮\boldsymbol{u} and 𝐯\boldsymbol{v} in each iteration. This means that the rows with index in A1A_{1} and the columns with index in A2A_{2} of each 𝐏k\mathbf{P}_{k} are all zeros, where A1A_{1} and A2A_{2} is defined in the same way as in Remark 2.13. Thus we only need to focus on the nonzero part of {𝐏k}k\{\mathbf{P}_{k}\}_{k}, which is identical to the sequence generated by applying the Greenkhorn algorithm to solve (10). Therefore, the complexity is still O(max(𝐂2n2lognε2,𝐂n2lognε))O\left(\max\left(\frac{\|\mathbf{C}\|^{2}_{\infty}n^{2}\log n}{\varepsilon^{2}},\frac{\|\mathbf{C}\|_{\infty}n^{2}\log n}{\varepsilon}\right)\right) in this case.

Remark 3.10.

Similar to Section 2, numerical experiments have been conducted to verify the linear dependency of the complexity on 1/ε2{1}/{\varepsilon^{2}}, see Figure 4.

Refer to caption
(a) MNIST data set
Refer to caption
(b) Synthetic data set
Figure 4: Number of iteration vs 1ε2\frac{1}{\varepsilon^{2}}.
Remark 3.11.

Though our work focuses on the complexity analysis of the Sinkhorn and Greenkhorn algorithms, it is worth noting that there are also other algorithms to solve the OT problem. For example, a set of first-order methods (APDAGD [15];APDAMD [16]; AAM [19]; HPD [20]) using primal-dual gradient descent or mirror descent schemes have been shown to enjoy an O~(n2.5/ε)\tilde{O}({n^{2.5}}/{\varepsilon}) complexity. Note that the technique of modifying 𝐚,𝐛\boldsymbol{a},\boldsymbol{b} is also used in APDAGD [15] and APDAMD [16] and our analysis indeed suggests that the same computational complexity can be achieved without modifying 𝐚,𝐛\boldsymbol{a},\boldsymbol{b}. Algorithms with theoretical runtime O~(n2/ε)\tilde{O}({n^{2}}/{\varepsilon}) have been developed in [21] and [22], although practical implementations of these algorithms remain unavailable. In [23, 24], more practical algorithms with the O~(n2/ε)\tilde{O}({n^{2}}/{\varepsilon}) complexity have been developed.

4 Conclusion

We first show that the computational complexity of the vanilla Sinkhorn algorithm with any feasible initial vectors is O(n2𝐂2lognε2)O(\frac{n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}{\varepsilon^{2}}). For the vanilla Greenhorn algorithm, a careful analysis shows that if the initialization is chosen to be the target probability vectors, the computational complexity of the algorithm is also O(n2𝐂2lognε2)O(\frac{n^{2}\|\mathbf{C}\|_{\infty}^{2}\log n}{\varepsilon^{2}}). The equicontinuity property of the discrete entropic regularized optimal transport problem plays a key role in the analysis.

References

  • [1] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pages 666–704, 1781.
  • [2] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • [3] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008.
  • [4] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. Advances in neural information processing systems, 28, 2015.
  • [5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • [6] Zhengyu Su, Yalin Wang, Rui Shi, Wei Zeng, Jian Sun, Feng Luo, and Xianfeng Gu. Optimal mass transport for shape matching and comparison. IEEE transactions on pattern analysis and machine intelligence, 37(11):2246–2259, 2015.
  • [7] Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (ToG), 34(4):1–11, 2015.
  • [8] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 303–312, 2021.
  • [9] Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In 2009 IEEE 12th international conference on computer vision, pages 460–467. IEEE, 2009.
  • [10] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances. Advances in Neural Information Processing Systems, 26, 06 2013.
  • [11] G Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579–652, 1912.
  • [12] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
  • [13] Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra and its applications, 114:717–735, 1989.
  • [14] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration, 2018.
  • [15] Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1367–1376. PMLR, 10–15 Jul 2018.
  • [16] Tianyi Lin, Nhat Ho, and Michael I Jordan. On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research, 23(137):1–42, 2022.
  • [17] F. Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Springer International Publishing :Imprint: Birkhäuser, 2015.
  • [18] Simone Di Marino and Augusto Gerolin. An optimal transport approach for the schrödinger bridge problem and convergence of sinkhorn algorithm. Journal of Scientific Computing, 85(2):27, 2020.
  • [19] Sergey Guminov, Pavel Dvurechensky, Nazarii Tupitsa, and Alexander Gasnikov. On a combination of alternating minimization and nesterov’s momentum. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3886–3898. PMLR, 18–24 Jul 2021.
  • [20] Antonin Chambolle and Juan Pablo Contreras. Accelerated bregman primal-dual methods applied to optimal transport and wasserstein barycenter problems. SIAM Journal on Mathematics of Data Science, 4(4):1369–1395, 2022.
  • [21] Jose Blanchet, Arun Jambulapati, Carson Kent, and Aaron Sidford. Towards optimal running times for optimal transport, 2020.
  • [22] Kent Quanrud. Approximating optimal transport with linear programs, 2018.
  • [23] Arun Jambulapati, Aaron Sidford, and Kevin Tian. A direct O~(1/ϵ)\tilde{O}(1/\epsilon) iteration parallel algorithm for optimal transport, 2019.
  • [24] Gen Li, Yanxi Chen, Yuejie Chi, H. Vincent Poor, and Yuxin Chen. Fast computation of optimal transport via entropy-regularized extragradient methods, 2023.

Appendix A Supplementary Proofs for Lemmas 2.1, 2.2, and 2.4

Proof of Lemma 2.1.

It is clear that 𝐏′′𝐏𝐏\mathbf{P}^{\prime\prime}\leq\mathbf{P}^{\prime}\leq\mathbf{P}, Δ𝒂1=Δ𝒃1\|\Delta_{\boldsymbol{a}}\|_{1}=\|\Delta_{\boldsymbol{b}}\|_{1}. Let 𝐏^=Round(𝐏,𝐚,𝐛)\widehat{\mathbf{P}}=\rm{Round}(\mathbf{P},\boldsymbol{a},\boldsymbol{b}). It is easy to verify that 𝐏^𝟏n=𝒂\widehat{\mathbf{P}}\boldsymbol{1}_{n}=\boldsymbol{a} and 𝐏^𝖳𝟏n=𝒃\widehat{\mathbf{P}}^{\mathsf{T}}\boldsymbol{1}_{n}=\boldsymbol{b}. Moreover,

𝐏𝐏^1\displaystyle\|\mathbf{P}-\widehat{\mathbf{P}}\|_{1} 𝐏𝐏′′1+1Δ𝒂1Δ𝒂Δ𝒃𝖳1=𝐏𝐏′′1+𝒂1𝐏′′1\displaystyle\leq\|\mathbf{P}-\mathbf{P}^{\prime\prime}\|_{1}+\frac{1}{\|\Delta_{\boldsymbol{a}}\|_{1}}\|\Delta_{\boldsymbol{a}}\Delta_{\boldsymbol{b}}^{\mathsf{T}}\|_{1}=\|\mathbf{P}-\mathbf{P}^{\prime\prime}\|_{1}+\|\boldsymbol{a}\|_{1}-\|\mathbf{P}^{\prime\prime}\|_{1}
=2𝐏𝐏′′1+𝒂1𝐏1=2(𝐏𝐏1+𝐏𝐏′′1)+𝒂1𝐏1,\displaystyle=2\|\mathbf{P}-\mathbf{P}^{\prime\prime}\|_{1}+\|\boldsymbol{a}\|_{1}-\|\mathbf{P}\|_{1}=2\left(\|\mathbf{P}-\mathbf{P}^{\prime}\|_{1}+\|\mathbf{P}^{\prime}-\mathbf{P}^{\prime\prime}\|_{1}\right)+\|\boldsymbol{a}\|_{1}-\|\mathbf{P}\|_{1},

where the last equality arises from 𝐏𝐏𝐏′′\mathbf{P}\geq\mathbf{P}^{\prime}\geq\mathbf{P}^{\prime\prime}. Since

𝐏𝐏1\displaystyle\|\mathbf{P}-\mathbf{P}^{\prime}\|_{1} =i=1n[(𝐏𝟏n)i𝒂i]+=12i=1n[(𝐏𝟏n)i𝒂i+|(𝐏𝟏n)i𝒂i|]=12(𝒂𝐏𝟏n1+𝐏1𝒂1)\displaystyle=\sum_{i=1}^{n}\left[(\mathbf{P}\boldsymbol{1}_{n})_{i}-\boldsymbol{a}_{i}\right]_{+}=\frac{1}{2}\sum_{i=1}^{n}\left[(\mathbf{P}\boldsymbol{1}_{n})_{i}-\boldsymbol{a}_{i}+|(\mathbf{P}\boldsymbol{1}_{n})_{i}-\boldsymbol{a}_{i}|\right]=\frac{1}{2}\left(\|\boldsymbol{a}-\mathbf{P}\boldsymbol{1}_{n}\|_{1}+\|\mathbf{P}\|_{1}-\|\boldsymbol{a}\|_{1}\right)

and

𝐏𝐏′′1=j=1n[(𝐏𝖳𝟏n)j𝒃j]+j=1n[(𝐏𝖳𝟏n)j𝒃j]+𝒃𝐏𝖳𝟏n1,\displaystyle\|\mathbf{P}^{\prime}-\mathbf{P}^{\prime\prime}\|_{1}=\sum_{j=1}^{n}[(\mathbf{P}^{\prime\mathsf{T}}\boldsymbol{1}_{n})_{j}-\boldsymbol{b}_{j}]_{+}\leq\sum_{j=1}^{n}[(\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n})_{j}-\boldsymbol{b}_{j}]_{+}\leq\|\boldsymbol{b}-\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1},

we have

𝐏𝐏^1\displaystyle\|\mathbf{P}-\widehat{\mathbf{P}}\|_{1} 2(𝐏𝐏1+𝐏𝐏′′1)+𝒂1𝐏1\displaystyle\leq 2\left(\|\mathbf{P}-\mathbf{P}^{\prime}\|_{1}+\|\mathbf{P}^{\prime}-\mathbf{P}^{\prime\prime}\|_{1}\right)+\|\boldsymbol{a}\|_{1}-\|\mathbf{P}\|_{1}
𝒂𝐏𝟏n1+2𝒃𝐏𝖳𝟏n12(𝒂𝐏𝟏n1+𝒃𝐏𝖳𝟏n1),\displaystyle\leq\|\boldsymbol{a}-\mathbf{P}\boldsymbol{1}_{n}\|_{1}+2\|\boldsymbol{b}-\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}\leq 2\left(\|\boldsymbol{a}-\mathbf{P}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}\right),

which completes the proof. ∎

Proof of Lemma 2.2.

Let 𝐁^\widehat{\mathbf{B}} be the output of Round(𝐁,𝒂,𝒃)(\mathbf{B},\boldsymbol{a},\boldsymbol{b}) and 𝐏^\widehat{\mathbf{P}} be the output of Round(𝐏,𝐁𝟏n,𝐁𝖳𝟏n)(\mathbf{P}^{*},\mathbf{B}\boldsymbol{1}_{n},\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}). By verifying the KKT condition, we have

𝐁argmin𝐏𝚷(𝐁𝟏n,𝐁𝖳𝟏n)𝐂,𝐏+γi,j𝐏i,j(log𝐏i,j1).\displaystyle\mathbf{B}\in\mathop{\arg\min}_{\mathbf{P}\in\boldsymbol{\Pi}(\mathbf{B}\boldsymbol{1}_{n},\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n})}\langle\mathbf{C},\mathbf{P}\rangle+\gamma\sum_{i,j}\mathbf{P}_{i,j}(\log\mathbf{P}_{i,j}-1). (18)

Thus,

𝐂,𝐁^\displaystyle\langle\mathbf{C},\widehat{\mathbf{B}}\rangle 𝐂,𝐁+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\mathbf{B}\rangle+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏^+γi,j(𝐏^)i,j(log(𝐏^)i,j1)γi,j(𝐁)i,j(log(𝐁)i,j1)+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\widehat{\mathbf{P}}\rangle+\gamma\sum_{i,j}(\widehat{\mathbf{P}})_{i,j}(\log(\widehat{\mathbf{P}})_{i,j}-1)-\gamma\sum_{i,j}(\mathbf{B})_{i,j}(\log(\mathbf{B})_{i,j}-1)+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
=𝐂,𝐏^+γi,j(𝐏^)i,jlog(𝐏^)i,jγi,j(𝐁)i,jlog(𝐁)i,j+𝐂𝐁𝐁^1\displaystyle=\langle\mathbf{C},\widehat{\mathbf{P}}\rangle+\gamma\sum_{i,j}(\widehat{\mathbf{P}})_{i,j}\log(\widehat{\mathbf{P}})_{i,j}-\gamma\sum_{i,j}(\mathbf{B})_{i,j}\log(\mathbf{B})_{i,j}+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏^γi,j(𝐁)i,jlog(𝐁)i,j+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\widehat{\mathbf{P}}\rangle-\gamma\sum_{i,j}(\mathbf{B})_{i,j}\log(\mathbf{B})_{i,j}+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏^+2γlogn+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\widehat{\mathbf{P}}\rangle+2\gamma\log n+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏+𝐂𝐏^𝐏1+2γlogn+𝐂𝐁𝐁^1.\displaystyle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+\|\mathbf{C}\|_{\infty}\|\widehat{\mathbf{P}}-\mathbf{P}^{*}\|_{1}+2\gamma\log n+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}.

Then the proof is complete after applying Lemma 2.1 to bound 𝐏^𝐏1\|\widehat{\mathbf{P}}-\mathbf{P}^{*}\|_{1} and 𝐁𝐁^1\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}. ∎

Proof of Lemma 2.4.

Assume k2k\geq 2 and kk is even. According to the update rule of the Sinkhorn algorithm, we have

𝐟k+1=γlog𝒂γlog𝐊e𝐠kγ,𝐠k+1=𝐠k\mathbf{f}_{k+1}=\gamma\log\boldsymbol{a}-\gamma\log\mathbf{K}e^{\frac{\mathbf{g}_{k}}{\gamma}},\quad\mathbf{g}_{k+1}=\mathbf{g}_{k}

and i,jexp(1γ((𝐟k)i+(𝐠k)j𝐂ij))=i,jexp(1γ((𝐟k+1)i+(𝐠k+1)j𝐂ij))=1\sum_{i,j}\exp\left(\frac{1}{\gamma}((\mathbf{f}_{k})_{i}+(\mathbf{g}_{k})_{j}-\mathbf{C}_{ij})\right)=\sum_{i,j}\exp\left(\frac{1}{\gamma}((\mathbf{f}_{k+1})_{i}+(\mathbf{g}_{k+1})_{j}-\mathbf{C}_{ij})\right)=1. Thus

h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)\displaystyle h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k}) =𝒂,𝐟k+1𝐟k=𝒂,γlog𝒂𝐊e𝐠k/γγloge𝐟kγ\displaystyle=\langle\boldsymbol{a},\mathbf{f}_{k+1}-\mathbf{f}_{k}\rangle=\langle\boldsymbol{a},\gamma\log\frac{\boldsymbol{a}}{\mathbf{K}e^{\mathbf{g}_{k}/{\gamma}}}-\gamma\log e^{\frac{\mathbf{f}_{k}}{\gamma}}\rangle
=𝒂,γlog𝒂e𝐟k/γ𝐊e𝐠k/γ=γ𝒂,log𝒂𝐏k𝟏n=γKL(𝒂𝐏k𝟏n)\displaystyle=\langle\boldsymbol{a},\gamma\log\frac{\boldsymbol{a}}{e^{\mathbf{f}_{k}/{\gamma}}\odot\mathbf{K}e^{\mathbf{g}_{k}/{\gamma}}}\rangle=\gamma\langle\boldsymbol{a},\log\frac{\boldsymbol{a}}{\mathbf{P}_{k}\boldsymbol{1}_{n}}\rangle=\gamma KL(\boldsymbol{a}\|\mathbf{P}_{k}\boldsymbol{1}_{n})
γ2𝒂𝐏k𝟏n12=γ2(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1)2,\displaystyle\geq\frac{\gamma}{2}\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}^{2}=\frac{\gamma}{2}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1})^{2},

where the inequality utilizes Pinsker’s inequality and the last equality follows from the fact that 𝐏k𝖳𝟏n=𝒃\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n}=\boldsymbol{b} if k2k\geq 2 and kk is even. Then case when kk is odd can be proved in a similar manner. ∎

Appendix B Proof of Lemma 3.1

Let 𝐁^=Round(𝐁,𝒂,𝒃)\widehat{\mathbf{B}}=\mbox{Round}(\mathbf{B},\boldsymbol{a},\boldsymbol{b}) and 𝐏^=Round(𝐏,𝐁𝟏n,𝐁𝖳𝟏n)\widehat{\mathbf{P}}=\mbox{Round}(\mathbf{P}^{*},\mathbf{B}\boldsymbol{1}_{n},\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}). Note that 𝐁1=𝐏^1\|\mathbf{B}\|_{1}=\|\widehat{\mathbf{P}}\|_{1}. Utilizing (18) again yields that

𝐂,𝐁^\displaystyle\langle\mathbf{C},\widehat{\mathbf{B}}\rangle 𝐂,𝐁+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\mathbf{B}\rangle+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏^+γi,j(𝐏^)i,j(log(𝐏^)i,j1)γi,j(𝐁)i,j(log(𝐁)i,j1)+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\widehat{\mathbf{P}}\rangle+\gamma\sum_{i,j}(\widehat{\mathbf{P}})_{i,j}(\log(\widehat{\mathbf{P}})_{i,j}-1)-\gamma\sum_{i,j}(\mathbf{B})_{i,j}(\log(\mathbf{B})_{i,j}-1)+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
=𝐂,𝐏^+γ𝐁1i,j(𝐏^)i,j𝐁1log(𝐏^)i,j𝐁1γ𝐁1i,j(𝐁)i,j𝐁1log(𝐁)i,j𝐁1+𝐂𝐁𝐁^1\displaystyle=\langle\mathbf{C},\widehat{\mathbf{P}}\rangle+\gamma\|\mathbf{B}\|_{1}\sum_{i,j}\frac{(\widehat{\mathbf{P}})_{i,j}}{\|\mathbf{B}\|_{1}}\log\frac{(\widehat{\mathbf{P}})_{i,j}}{\|\mathbf{B}\|_{1}}-\gamma\|\mathbf{B}\|_{1}\sum_{i,j}\frac{(\mathbf{B})_{i,j}}{\|\mathbf{B}\|_{1}}\log\frac{(\mathbf{B})_{i,j}}{\|\mathbf{B}\|_{1}}+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏^γ𝐁1i,j(𝐁)i,j𝐁1log(𝐁)i,j𝐁1+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\widehat{\mathbf{P}}\rangle-\gamma\|\mathbf{B}\|_{1}\sum_{i,j}\frac{(\mathbf{B})_{i,j}}{\|\mathbf{B}\|_{1}}\log\frac{(\mathbf{B})_{i,j}}{\|\mathbf{B}\|_{1}}+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏^+2𝐁1γlogn+𝐂𝐁𝐁^1\displaystyle\leq\langle\mathbf{C},\widehat{\mathbf{P}}\rangle+2\|\mathbf{B}\|_{1}\gamma\log n+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}
𝐂,𝐏+𝐂𝐏^𝐏1+2𝐁1γlogn+𝐂𝐁𝐁^1,\displaystyle\leq\langle\mathbf{C},\mathbf{P}^{*}\rangle+\|\mathbf{C}\|_{\infty}\|\widehat{\mathbf{P}}-\mathbf{P}^{*}\|_{1}+2\|\mathbf{B}\|_{1}\gamma\log n+\|\mathbf{C}\|_{\infty}\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1},

Note that 𝐁1\|\mathbf{B}\|_{1} can be bounded as follows:

𝐁1=𝐁𝟏n1𝒂1+𝐁𝟏n𝒂1=1+𝐁𝟏n𝒂1,\displaystyle\|\mathbf{B}\|_{1}=\|\mathbf{B}\boldsymbol{1}_{n}\|_{1}\leq\|\boldsymbol{a}\|_{1}+\|\mathbf{B}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1}=1+\|\mathbf{B}\boldsymbol{1}_{n}-\boldsymbol{a}\|_{1},
𝐁1=𝐁𝖳𝟏n1𝒃1+𝐁𝖳𝟏n𝒃1=1+𝐁𝖳𝟏n𝒃1.\displaystyle\|\mathbf{B}\|_{1}=\|\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}\|_{1}\leq\|\boldsymbol{b}\|_{1}+\|\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1}=1+\|\mathbf{B}^{\mathsf{T}}\boldsymbol{1}_{n}-\boldsymbol{b}\|_{1}.

We can conclude the proof by applying Lemma 2.1 to bound 𝐏^𝐏1\|\widehat{\mathbf{P}}-\mathbf{P}^{*}\|_{1} and 𝐁𝐁^1\|\mathbf{B}-\widehat{\mathbf{B}}\|_{1}.

Appendix C Supplementary Proofs for Lemmas 3.3 and 3.4

Proof of Lemma 3.3.

We first establish the following result which can be viewed as a generalization of the Pinsker’s inequality. For any 𝒙Δn,𝒚++n\boldsymbol{x}\in\Delta_{n},\boldsymbol{y}\in\mathbb{R}_{++}^{n}, if i=1nρ(𝒙i,𝒚i)1\sum_{i=1}^{n}\rho(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\leq 1, we have

𝒙𝒚127i=1nρ(𝒙i,𝒚i).\displaystyle\|\boldsymbol{x}-\boldsymbol{y}\|_{1}^{2}\leq 7\sum_{i=1}^{n}\rho(\boldsymbol{x}_{i},\boldsymbol{y}_{i}). (19)

The proof of this result is as follows. Letting 𝒚¯:=𝒚𝒚1\bar{\boldsymbol{y}}:=\frac{\boldsymbol{y}}{\|\boldsymbol{y}\|_{1}}, we have

i=1nρ(𝒙i,𝒚i)\displaystyle\sum_{i=1}^{n}\rho(\boldsymbol{x}_{i},\boldsymbol{y}_{i}) =i=1n(𝒚i𝒙i)+𝒙ilog𝒙i𝒚i=𝒚11+i=1n𝒙ilog𝒙i𝒚1𝒚¯i\displaystyle=\sum_{i=1}^{n}(\boldsymbol{y}_{i}-\boldsymbol{x}_{i})+\boldsymbol{x}_{i}\log\frac{\boldsymbol{x}_{i}}{\boldsymbol{y}_{i}}=\|\boldsymbol{y}\|_{1}-1+\sum_{i=1}^{n}\boldsymbol{x}_{i}\log\frac{\boldsymbol{x}_{i}}{\|\boldsymbol{y}\|_{1}\bar{\boldsymbol{y}}_{i}}
=𝒚11log𝒚1i=1n𝒙i+KL(𝒙𝒚¯)=𝒚11log𝒚1+KL(𝒙𝒚¯).\displaystyle=\|\boldsymbol{y}\|_{1}-1-\log\|\boldsymbol{y}\|_{1}\sum_{i=1}^{n}\boldsymbol{x}_{i}+KL(\boldsymbol{x}\|\bar{\boldsymbol{y}})=\|\boldsymbol{y}\|_{1}-1-\log\|\boldsymbol{y}\|_{1}+KL(\boldsymbol{x}\|\bar{\boldsymbol{y}}).

When i=1nρ(𝒙i,𝒚i)1\sum_{i=1}^{n}\rho(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\leq 1, it is easy to see 𝒚11log𝒚11\|\boldsymbol{y}\|_{1}-1-\log\|\boldsymbol{y}\|_{1}\leq 1. Moreover, it can be verified directly that 𝒚11log𝒚115(𝒚11)2\|\boldsymbol{y}\|_{1}-1-\log\|\boldsymbol{y}\|_{1}\geq\frac{1}{5}(\|\boldsymbol{y}\|_{1}-1)^{2} if 𝒚11log𝒚11\|\boldsymbol{y}\|_{1}-1-\log\|\boldsymbol{y}\|_{1}\leq 1. Applying the Pinsker’s inequality yields

i=1nρ(𝒙i,𝒚i)\displaystyle\sum_{i=1}^{n}\rho(\boldsymbol{x}_{i},\boldsymbol{y}_{i}) =𝒚11log𝒚1+KL(𝒙𝒚¯)\displaystyle=\|\boldsymbol{y}\|_{1}-1-\log\|\boldsymbol{y}\|_{1}+KL(\boldsymbol{x}\|\bar{\boldsymbol{y}})
15(𝒚11)2+12𝒙𝒚¯12.\displaystyle\geq\frac{1}{5}(\|\boldsymbol{y}\|_{1}-1)^{2}+\frac{1}{2}\|\boldsymbol{x}-\bar{\boldsymbol{y}}\|_{1}^{2}.

It follows that

𝒙𝒚12\displaystyle\|\boldsymbol{x}-\boldsymbol{y}\|_{1}^{2} (𝒚¯𝒚1+𝒙𝒚¯1)2=(|𝒚11|+𝒙𝒚¯1)2\displaystyle\leq(\|\bar{\boldsymbol{y}}-\boldsymbol{y}\|_{1}+\|\boldsymbol{x}-\bar{\boldsymbol{y}}\|_{1})^{2}=(|\|\boldsymbol{y}\|_{1}-1|+\|\boldsymbol{x}-\bar{\boldsymbol{y}}\|_{1})^{2}
75(𝒚11)2+72𝒙𝒚¯127i=1nρ(𝒙i,𝒚i),\displaystyle\leq\frac{7}{5}(\|\boldsymbol{y}\|_{1}-1)^{2}+\frac{7}{2}\|\boldsymbol{x}-\bar{\boldsymbol{y}}\|_{1}^{2}\leq 7\sum_{i=1}^{n}\rho(\boldsymbol{x}_{i},\boldsymbol{y}_{i}),

which establishes (19) provided i=1nρ(𝒙i,𝒚i)1\sum_{i=1}^{n}\rho(\boldsymbol{x}_{i},\boldsymbol{y}_{i})\leq 1.

To proceed, without loss of generality, suppose the (k+1)(k+1)-th iteration of the Greenkhorn algorithm updates 𝐟I\mathbf{f}_{I}. Then we have

h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)\displaystyle h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k}) =𝒂I((𝐟k+1)I(𝐟k)I)\displaystyle=\boldsymbol{a}_{I}((\mathbf{f}_{k+1})_{I}-(\mathbf{f}_{k})_{I})
γj=1n[exp(1γ((𝐟k+1)I+(𝐠k+1)j𝐂I,j))exp(1γ((𝐟k)I+(𝐠k)j𝐂I,j))]\displaystyle-\gamma\sum_{j=1}^{n}\left[\exp\left(\frac{1}{\gamma}((\mathbf{f}_{k+1})_{I}+(\mathbf{g}_{k+1})_{j}-\mathbf{C}_{I,j})\right)-\exp\left(\frac{1}{\gamma}((\mathbf{f}_{k})_{I}+(\mathbf{g}_{k})_{j}-\mathbf{C}_{I,j})\right)\right]
=𝒂I((𝐟k+1)I(𝐟k)I)γ((𝐏k+1𝟏n)I(𝐏k𝟏n)I)\displaystyle=\boldsymbol{a}_{I}((\mathbf{f}_{k+1})_{I}-(\mathbf{f}_{k})_{I})-\gamma((\mathbf{P}_{k+1}\boldsymbol{1}_{n})_{I}-(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})
=γ𝒂Ilog(exp((𝐟k+1)Iγ)exp((𝐟k)Iγ))γ((𝐏k+1𝟏n)I(𝐏k𝟏n)I)\displaystyle=\gamma\boldsymbol{a}_{I}\log\left(\frac{\exp\left(\frac{(\mathbf{f}_{k+1})_{I}}{\gamma}\right)}{\exp\left(\frac{(\mathbf{f}_{k})_{I}}{\gamma}\right)}\right)-\gamma((\mathbf{P}_{k+1}\boldsymbol{1}_{n})_{I}-(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})
=γ𝒂Ilog(j=1nexp((𝐟k+1)I+(𝐠k+1)j𝐂I,jγ)j=1nexp((𝐟k)I+(𝐠k)j𝐂I,jγ))γ((𝐏k+1𝟏n)I(𝐏k𝟏n)I)\displaystyle=\gamma\boldsymbol{a}_{I}\log\left(\frac{\sum_{j=1}^{n}\exp\left(\frac{(\mathbf{f}_{k+1})_{I}+(\mathbf{g}_{k+1})_{j}-\mathbf{C}_{I,j}}{\gamma}\right)}{\sum_{j=1}^{n}\exp\left(\frac{(\mathbf{f}_{k})_{I}+(\mathbf{g}_{k})_{j}-\mathbf{C}_{I,j}}{\gamma}\right)}\right)-\gamma((\mathbf{P}_{k+1}\boldsymbol{1}_{n})_{I}-(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})
=γ𝒂Ilog((𝐏k+1𝟏n)I(𝐏k𝟏n)I)γ((𝐏k+1𝟏n)I(𝐏k𝟏n)I)\displaystyle=\gamma\boldsymbol{a}_{I}\log\left(\frac{(\mathbf{P}_{k+1}\boldsymbol{1}_{n})_{I}}{(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I}}\right)-\gamma((\mathbf{P}_{k+1}\boldsymbol{1}_{n})_{I}-(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})
=γ𝒂Ilog(𝒂I(𝐏k𝟏n)I)γ(𝒂I(𝐏k𝟏n)I)\displaystyle=\gamma\boldsymbol{a}_{I}\log\left(\frac{\boldsymbol{a}_{I}}{(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I}}\right)-\gamma(\boldsymbol{a}_{I}-(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})
=γρ(𝒂I,(𝐏k𝟏n)I),\displaystyle=\gamma\rho(\boldsymbol{a}_{I},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I}),

where the sixth equality follows from (𝐏k+1𝟏n)I=𝒂I(\mathbf{P}_{k+1}\boldsymbol{1}_{n})_{I}=\boldsymbol{a}_{I}. If

max(i=1nρ(𝒂i,(𝐏k𝟏n)i),j=1nρ(𝒃j,(𝐏k𝖳𝟏n)j))1,\displaystyle\max\left(\sum_{i=1}^{n}\rho(\boldsymbol{a}_{i},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{i}),\sum_{j=1}^{n}\rho(\boldsymbol{b}_{j},(\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n})_{j})\right)\leq 1,

according to the update rule of the Greenkhorn algorithm and inequality (19), we have

h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)\displaystyle h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k}) =γρ(𝒂I,(𝐏k𝟏n)I)\displaystyle=\gamma\rho(\boldsymbol{a}_{I},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})
γ2n(i=1nρ(𝒂i,(𝐏k𝟏n)i)+j=1nρ(𝒃j,(𝐏k𝖳𝟏n)j))\displaystyle\geq\frac{\gamma}{2n}\left(\sum_{i=1}^{n}\rho(\boldsymbol{a}_{i},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{i})+\sum_{j=1}^{n}\rho(\boldsymbol{b}_{j},(\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n})_{j})\right)
γ14n(𝒂𝐏k𝟏n12+𝒃𝐏k𝖳𝟏n12)\displaystyle\geq\frac{\gamma}{14n}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}^{2}+\|\boldsymbol{b}-\mathbf{P}^{\mathsf{T}}_{k}\boldsymbol{1}_{n}\|_{1}^{2})
γ28n(𝒂𝐏k𝟏n1+𝒃𝐏k𝖳𝟏n1)2.\displaystyle\geq\frac{\gamma}{28n}(\|\boldsymbol{a}-\mathbf{P}_{k}\boldsymbol{1}_{n}\|_{1}+\|\boldsymbol{b}-\mathbf{P}^{\mathsf{T}}_{k}\boldsymbol{1}_{n}\|_{1})^{2}.

If i=1nρ(𝒂i,(𝐏k𝟏n)i)>1\sum_{i=1}^{n}\rho(\boldsymbol{a}_{i},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{i})>1 or j=1nρ(𝒃j,(𝐏k𝖳𝟏n)j)>1\sum_{j=1}^{n}\rho(\boldsymbol{b}_{j},(\mathbf{P}_{k}^{\mathsf{T}}\boldsymbol{1}_{n})_{j})>1, the update rule of Greenkhorn algorithm implies that

h(𝐟k+1,𝐠k+1)h(𝐟k,𝐠k)=γρ(𝒂I,(𝐏k𝟏n)I)γn,\displaystyle h(\mathbf{f}_{k+1},\mathbf{g}_{k+1})-h(\mathbf{f}_{k},\mathbf{g}_{k})=\gamma\rho(\boldsymbol{a}_{I},(\mathbf{P}_{k}\boldsymbol{1}_{n})_{I})\geq\frac{\gamma}{n},

which completes the proof. ∎

Proof of lemma 3.4.

For any k0k\geq 0, without loss of generality, suppose the value of 𝒖I\boldsymbol{u}_{I} is updated at the k+1k+1-th iteration. Then we have

𝐠γ𝐠k+1=𝐠γ𝐠k,\displaystyle\|\mathbf{g}^{\gamma}_{*}-\mathbf{g}_{k+1}\|_{\infty}=\|\mathbf{g}^{\gamma}_{*}-\mathbf{g}_{k}\|_{\infty},
𝐟γ𝐟k+1max(𝐟γ𝐟k,|(𝐟γ)I(𝐟k+1)I|).\displaystyle\|\mathbf{f}^{\gamma}_{*}-\mathbf{f}_{k+1}\|_{\infty}\leq\max\left(\|\mathbf{f}^{\gamma}_{*}-\mathbf{f}_{k}\|_{\infty},|(\mathbf{f}_{*}^{\gamma})_{I}-(\mathbf{f}_{k+1})_{I}|\right). (20)

By the update rule of the Greenkhorn algorithm, we have

exp(1γ(𝐟k+1)I)=𝒂Ij=1n𝐊I,j(𝒗k)j,\displaystyle\exp\left(\frac{1}{\gamma}(\mathbf{f}_{k+1})_{I}\right)=\frac{\boldsymbol{a}_{I}}{\sum_{j=1}^{n}\mathbf{K}_{I,j}(\boldsymbol{v}_{k})_{j}},
exp(1γ(𝐟γ)I)=𝒂Ij=1n𝐊I,j(𝒗γ)j,\displaystyle\exp\left(\frac{1}{\gamma}(\mathbf{f}_{*}^{\gamma})_{I}\right)=\frac{\boldsymbol{a}_{I}}{\sum_{j=1}^{n}\mathbf{K}_{I,j}(\boldsymbol{v}_{*}^{\gamma})_{j}},

where 𝒗γ=exp(1γ𝐠γ)\boldsymbol{v}_{*}^{\gamma}=\exp(\frac{1}{\gamma}\mathbf{g}^{\gamma}_{*}) and 𝒗k=exp(1γ𝐠k)\boldsymbol{v}_{k}=\exp(\frac{1}{\gamma}\mathbf{g}_{k}). It follows that

|(𝐟γ)I(𝐟k+1)I|=γ|log(j=1n𝐊I,j(𝒗γ)jj=1n𝐊I,j(𝒗k)j)|γmaxj|log(𝒗γ)jlog(𝒗k)j|=𝐠γ𝐠k.\displaystyle|(\mathbf{f}^{\gamma}_{*})_{I}-(\mathbf{f}_{k+1})_{I}|=\gamma\left|\log\left(\frac{\sum_{j=1}^{n}\mathbf{K}_{I,j}(\boldsymbol{v}_{*}^{\gamma})_{j}}{\sum_{j=1}^{n}\mathbf{K}_{I,j}(\boldsymbol{v}_{k})_{j}}\right)\right|\leq\gamma\max_{j}|\log(\boldsymbol{v}^{\gamma}_{*})_{j}-\log(\boldsymbol{v}_{k})_{j}|=\|\mathbf{g}^{\gamma}_{*}-\mathbf{g}_{k}\|_{\infty}.

The proof is complete by combining this result together with (C). ∎