Improved Complexity Analysis of the Sinkhorn and Greenkhorn Algorithms for Optimal Transport00footnotetext: JL and DY contribute equally.
Abstract
The Sinkhorn algorithm is a widely used method for solving the optimal transport problem, and the Greenkhorn algorithm is one of its variants. While there are modified versions of these two algorithms whose computational complexities are to achieve an -accuracy, to the best of our knowledge, the existing complexities for the vanilla versions are . In this paper we fill this gap and show that the complexities of the vanilla Sinkhorn and Greenkhorn algorithms are indeed . The analysis relies on the equicontinuity of the dual variables for the discrete entropic regularized optimal transport problem, which is of independent interest.
Keywords. Optimal transport, Sinkhorn algorithm, Greenkhorn algorithm, complexity analysis
1 Introduction
Optimal transport (OT) is the problem of finding the most economic way to transfer one measure to another. The optimal transport problem dates back to Gaspard Monge [1]. Kantorovich (see [2] for historical context) provides a relaxation of this problem by allowing the mass splitting in source based on the notation of coupling, which advances its applications in areas such as economics and physics. Recently, with the increasing demand of quantifying the dissimilarity for probability measures, the optimal transport cost, which can serve as a natural notion of metric between measures, has attracted a lot of attention in the fields of statistics and machine learning. For instance, as a special case of the optimal transport cost, Wasserstein distance [3] can be used as a loss function when the output of a supervised learning method is a probability measure [4]. The application of Wasserstein-1 metric in GAN can effectively mitigate the issues of vanishing gradients and mode collapse [5]. Other applications of optimal transport include shape matching [6], color transfer [7] and object detection [8].
For ease of notation, we consider the case where the target probability vectors are of the same length, denoted . Given a cost matrix , the Kantorovich relaxation attempts to compute the distance between and by solving the following problem:
(1) |
where
As a linear programming, the Kantorovich problem can be solved by the simplex method or the interior-point method. Yet, both of them suffer from huge computational complexity (overall ) [2, 9]. A computationally efficient algorithm is developed in [10] which solves the Shannon entropic regularized Kantorovich problem
(2) |
by the Sinkhorn update, where and are the regularization parameter and discrete entropy, respectively. The easily parallel and GPU friendly features of the Sinkhorn algorithm make it popular in applications, especially in machine learning [4]. The idea of the Sinkhorn update can date back to [11, 12], and there are many studies on its convergence. The linear convergence rate of it (not for optimal transport) under Hilbert projective metric is provided in [13]. Regarding optimal transport, the computational complexity of the Sinkhorn algorithm to achive an -accuracy is shown to be [14]. The Greenkhorn algorithm is an elementwise variant of the Sinkhorn algorithm which has the same computational complexity [14]. In [15], a variant of the Sinkhorn algorithm which utilizes modified probability vectors is developed and the computational complexity is established. The idea has been extended to the Greenkhorn algorithm with the same improved complexity [16]. However, as can be seen later, the numerical results show that the practical performances of the original methods and the modified ones are overall indistinguishable, which motivates us to think whether we can improve the computational complexity of the vanilla Sinkhorn and Greenkhorn algorithms. The answer is affirmative.
Main contributions. Inspired by the equicontinuity for the continuous KL-divergence regularized OT problem, in this paper we have identified a discrete analogue of this property and shown that the complexity of the vanilla Sinkhorn algorithm can be achieved based on this property. In addition, it is shown that this property can also be used to improve the complexity of the vanilla Greenkhorn algorithm to .
Notation. Let and denote non-negative real numbers and positive real numbers respectively. For a vector , we denote by the sum of absolute values its elements, and by the maximum absolute value of its elements. Given a matrix , we write and . Let be the probability simplex in . For a matrix and a vector , we denote by , , , their entrywise exponentials and natural logarithms respectively. For vectors of the same dimension, their product and division should be interpreted in an entrywise way. For two probability vectors , KL-divergence is defined as
The Pinsker’s inequality provides a lower bound of in terms of the -norm:
2 Sinkhorn and Its Complexity
Without loss of generality, assume and 111 The case where there are zeros in or is discussed in Remark 2.13. . Then the solution to the entropic regularized Kantorovich problem (2) is the unique matrix of the form
(3) |
that satisfies
(4) |
Here, , are two (unknown) scaling vectors, and (4) is a reformulation of the constraints in terms of and . For the details of derivation, see [2]. Based on this optimality condition, the Sinkhorn algorithm with arbitrary positive initialization vectors and conducts alternating projection to satisfy (4), see Algorithm 1 for a description of the algorithm. As already mentioned, the best known complexity for the vanilla Sinkhorn algorithm is [14], while this complexity can be improved to for the modified variant which uses lifted values of and [15] (more precisely, run Sinkhorn using for a small mismatch instead of (, )). However, the numerical results in Figure 1 demonstrate that the practical performances of the vanilla and modified Sinkhorn algorithms are overall similar, which motivates us to improve the convergence analysis of the vanilla Sinkhorn algorithm. The numerical experiments are conducted on two data sets: the MNIST data set and a widely used synthetic data set[14]. In the experiments, a pair of images are randomly sampled from each data set and and are obtained by vectorizing the images, followed by normalization. The transport cost is computed as the -distance of the pixel positions. The plots in Figure 1 are average results over 10 random instances.


There are several different equivalent interpretations of the Sinkhorn algorithm [2], and the convergence analysis is based on the dual perspective. First note that the dual problem of the entropic regularized Kantorovich problem (2) is given by
(5) |
For the details of derivation, see for example [10]. If we define
it can be easily verified that
Moreover, when ,
Therefore, in terms of , the Sinkhorn algorithm can be interpreted as a block coordinate ascent on the dual problem (5), and the analysis will be primarily based on .
After computing an approximate transport plan matrix which satisfies by the Sinkhorn algorithm, a rounding step presented in Algorithm 2 usually follows to ensure that the output matrix satisfies and simultaneously. The goal in the analysis is to study the computational complexity of the algorithm when an -approximate solution of the original Kantorovich problem (1) is obtained, that is when satisfies
where is the solution to (1).
To carry out the analysis, we first list three lemmas from [14]. For readers’ convenience, we have included the proofs in the appendix. The first lemma concerns the error bound of the rounding scheme, which is presented in a general form where and are not necessarily probability vectors. This will be useful in the analysis of the Greenkhorn algorithm in the next section.
Lemma 2.1 ([14, Lemma 7] ).
For any vectors satisfying and matrix , Algorithm 2 outputs a matrix such that
The second lemma provides a bound for the error between the true OT distance and the approximate OT distance .
Lemma 2.2 ([14, Theorem 1] ).
Let be the solution to the Kantorovich problem (1) with marginals , and cost matrix . Let be a probability matrix (i.e. ) of the form , where and is defined by . Letting , we have
(6) |
Remark 2.3.
The third lemma characterizes the one-iteration improvement of the Sinkhorn algorithm in terms of the function value.
Lemma 2.4 ([14, Lemma 2] ).
If , then
The improved convergence analysis of the Sinkhorn algorithm relies heavily on the equicontinuity of the discrete regularized optimal transport problem. To state this property, we first follow the convention in the optimal transport literature[17] and define the -transform and -transform of vectors.
Definition 2.5 (-transform, -transform).
For any vector , we define its -transform by
For any vector , we define its -transform by
The definitions of -transform and -transform enable us to describe the Sinkhorn update as
Furthermore, we have
where is any solution of (5). Next we present the key property used in the analysis, which is inspired by the equicontinuity of the -transform and -transform for the continuous KL-divergence regularized optimal transport problem.
Lemma 2.6 (Equicontinuity).
Assume in (5) is defined with , , and . Then for any vectors ,
where and refer to the largest entry and smallest entry of a vector or matrix, respectively.
Proof.
By the definition of -transform, we have
for any . This means for any ,
(7) |
Suppose that there exist satisfying
Then we have
for any , which contradicts (7) after summing it up for all . This completes the proof of the first claim, and the second claim can be proved similarly. ∎
Remark 2.7.
The upper bound in Lemma 2.6 is tight. Here we give an example. Let . By the definition of the -transform, we have
for any . Consider the case where the entries of the -th column of are all equal to and the entries of the -th column of are all zeros. Then a simple calculation yields
Remark 2.8.
Let , be two regular Borel measures supported on a compact metric spaces and , respectively, and be a continuous function defined on . The continuous KL-divergence regularized Kantorovich problem is defined by
(8) |
where the -divergence between measures is defined by
for . The dual of (8) can is given by (see for example [2])
and the -transform and -transform are partial maximizations of the functional . We only consider the -transform here and the -transform can be be similarly discussed. For any , its -transform is defined by
For any and , it has been shown that share the same modulus of continuity as [18]. That is, if
for any and , where is an increasing continuous function with , then
for any .
When , the measures and can be described by vectors satisfying for any , and the function can be described by a matrix satisfying for any . It is noted in [10] that the KL-divergence regularized Kantorovich problem is equivalent to the entropic regularized Kantorovich problem in this situation. Let , . Then we have
The equicontinuity property with respect to inherited from the continuous setting (under certain and ) inspires us to seek the equicontinuity property with respect to (similar discussion for ) directly for the discrete entropic regularized optimal transport problem, as presented in Lemma 2.6. Moreover, we will show that the complexities of the vanilla Sinkhorn and Greenkhorn algorithms can both be improved to based on this property.
Recall that in the Sinkhorn algorithm when is odd and when is even. Moreover, since = if is even and = if is odd, the following corollary is a straightforward consequence of Lemma 2.6.
Corollary 2.9.
No matter what initial variables are chosen in the Sinkhorn algorithm, for every , the corresponding dual variables satisfy
Moreover, for any solution of problem (5), we have
The following lemma provides a bound for the gap between and the optimal value utilizing Corollary 2.9.
Lemma 2.10.
Define . Then for ,
where is the corresponding transport plan matrix.
Proof.
First recall that
where . Suppose without loss of generality that is even. Since is a concave function, we have
(9) |
where the second equality follows from . Note that and are both probability vectors. Thus,
Since it holds for any , we have
where it is easy to see that the minimum is achieved at , and the last inequality follows from Corollary 2.9. Since in (9) can be bounded similarly, the proof is complete. ∎
Remark 2.11.
In [15], it has been shown that
and
where and . By modifying into , satisfying
and
the Sinkhorn Algorithm using achieves an -approximate solution to the original OT problem 1 with complexity . Such exist. For example, we can choose
However, using the equicontinuity property we have actually shown that can be replaced by , which is independent of and . This fact enables us to prove that even for the vanilla Sinkhorn algorithm (without modifying and ), the computational complexity to achieve an -accuracy after rounding is .
Theorem 2.12.
For any positive probability vectors , Algorithm 1 outputs a matrix such that in less than number of iterations. Furthermore, the computational complexity for the Sinkhorn algorithm to to achieve an -accuracy after rounding is .
Proof.
It follows from Lemmas 2.4 and 2.10 that
and
Combining these two inequalities together yields
Since is an increasing sequence, we have
It follows immediately that
Therefore, we have
Thus, if we choose , then Applying Lemma 2.4 again shows that the iteration will terminate within at most additional number of iterations. Therefore, the total number of iterations required by the Sinkhorn algorithm to terminate is smaller than
Noting Remark 2.3, since the per iteration computational complexity of the Sinkhorn algorithm is , we can finally conclude that the total computational cost for the Sinkhorn algorithm to achieve an -accuracy after rounding is by setting and . ∎
Remark 2.13.
For the case when there are zeros in and , define and . Then for any , and , it can be easily verified that and . Thus the rows with index in and the columns with index in of each and are all zeros, and we only need to focus on the nonzero part of . It is equivalent to applying the Sinkhorn algorithm to solve the following compact problem
(10) |
where , and are vectors obtained by removing the zeros from and respectively, and is a matrix obtained by removing the rows with index in and the columns with index in from . Therefore the complexity in this case is also .
Remark 2.14.
A set of experiments have been conducted to investigate the dependency of the computational complexity of Sinkhorn on the accuracy . See Figure 2 for the plots of the average number of iterations out of random trials needed for Sinkhorn to achieve an -accuracy against . A desirable linear relation has been clearly shown in the plots.


3 Greenkhorn and Its Complexity
Again without loss of generality we assume and 222The case where there are zeros in or is discussed in Remark 3.9.. Recall that in each iteration of the Sinkhorn algorithm, all the elements of or are updated simultaneously such that the row sum of equals or the column sum of equals . In contrast, only a single element of or is updated at a time in the Greenkhorn algorithm such that only one element of the row sum or column sum of is equal to the target value. To determine which element of or is updated, the following scalar version of the KL divergence is used to quantify the mismatch between the elements of or and the corresponding elements of or :
and then the one with the largest mismatch is chosen to be updated. For two probability vectors and , it can be easily verified that Moreover, is indeed the Bregman distance between and associated with the function . The Greenkhorn algorithm is described in Algorithm 3, where the initial variables are set to be and . Note that without further specification, is the matrix that is obtained by applying Algorithm 2 to the output of the Greenkhorn algorithm in this section. The existing complexity for the Greenkhorn algorithm is [14], while the complexity for the modified version using lifted and is [16]. However, a set of similar numerical experiments as in Section 2 suggest that the performance of the vanilla and modified versions are overall similar, see Figure 3. Thus, it also makes sense to provide an improved analysis of the vanilla Greenkhorn algorithm.


The convergence analysis for the Greenkhorn algorithm is overall parallel to that for the Sinkhorn algorithm, which is carried out on the dual variables:
That being said, a fact that should be cautioned is that the transport plan matrix produced by the Greenkhorn algorithm may not be a probability matrix, which requires a careful treatment in the analysis. We first extend Lemma 2.2 to the situation where is not necessarily a probability matrix. The proof of this lemma is presented in the appendix.
Lemma 3.1.
Let be the solution to the Kantorovich problem 1 with marginals and cost matrix . Let , where and is defined by . Letting , we have
(11) |
Remark 3.2.
The next lemma characterizes the one-iteration improvement of the Greenkhorn algorithm.
Lemma 3.3 ([14, Lemma 5, Lemma 6] ).
For any , if ,
and if or ,
The following lemma demonstrates that the infinity norm error of the iterates in the Greenkhorn algorithm will not increase.
The proofs of Lemma 3.3 and Lemma 3.4 can be found in [14] and [16], respectively. To keep the work self-contained, we include the proofs in the appendix. It is worth noting that by a similar argument we can show that Lemma 3.4 also holds for the Sinkhorn algorithm. The details are omitted since this result is not used in the convergence analysis of the Sinkhorn algorithm.
The following lemma provides an upper bound on the infinity norm of the optimal solution (after translation), which plays an important role in the analysis of the Greenkhorn algorithm. The proof of this lemma relies on the equicontinuity property of the optimal solution.
Lemma 3.5.
For the dual entropic regularized optimal transport problem (5), there exists a pair of optimal solution satisfying
(12) |
Proof.
The following two lemmas can be obtained based on the above lemmas. Noting is the same for all the optimal solutions, without loss of generality we assume used in the proof satisfies (12).
Lemma 3.6.
Let be any solution of (5), and define . Let . We have
Proof.
Lemma 3.7.
Let be any solution of (5) and let . Then
Proof.
Now we are ready to prove the main theorem about the computational complexity of the Greenkhorn algorithm. Note that, compared with the proof for Theorem 2.12, we need to discuss the two different cases of the iterates very carefully.
Theorem 3.8.
Algorithm 3 with initialization outputs a matrix satisfying in less than number of iterations. Furthermore, the computational complexity for the algorithm to achieve an -accuracy after rounding is
when is sufficiently small.
Proof.
Define
When , by Lemma 3.3,
On the other hand, when , by Lemma 3.3 and Lemma 3.6
This means that is a decreasing sequence. So by Lemma 3.7, we have
Noting that is an increasing sequence, if ,
(17) |
Setting , one has
since there are at least terms in the sum. Since , applying Lemma 3.3 again shows that the algorithm will terminate within at most additional number of iterations. Therefore, the total number of iterations required by the Greenkhorn algorithm to terminate is smaller than
Noting Remark 3.2, since the per iteration computational complexity of the Greenkhorn algorithm is , we can finally conclude that the total computational cost for the Greenkhorn algorithm to achieve an -accuracy after rounding is by setting and . ∎
Remark 3.9.
For the case when there are zeros in and , due to our selection of and , if we use the convention , the Greenkhorn algorithm never updates zeros in and in each iteration. This means that the rows with index in and the columns with index in of each are all zeros, where and is defined in the same way as in Remark 2.13. Thus we only need to focus on the nonzero part of , which is identical to the sequence generated by applying the Greenkhorn algorithm to solve (10). Therefore, the complexity is still in this case.
Remark 3.10.


Remark 3.11.
Though our work focuses on the complexity analysis of the Sinkhorn and Greenkhorn algorithms, it is worth noting that there are also other algorithms to solve the OT problem. For example, a set of first-order methods (APDAGD [15];APDAMD [16]; AAM [19]; HPD [20]) using primal-dual gradient descent or mirror descent schemes have been shown to enjoy an complexity. Note that the technique of modifying is also used in APDAGD [15] and APDAMD [16] and our analysis indeed suggests that the same computational complexity can be achieved without modifying . Algorithms with theoretical runtime have been developed in [21] and [22], although practical implementations of these algorithms remain unavailable. In [23, 24], more practical algorithms with the complexity have been developed.
4 Conclusion
We first show that the computational complexity of the vanilla Sinkhorn algorithm with any feasible initial vectors is . For the vanilla Greenhorn algorithm, a careful analysis shows that if the initialization is chosen to be the target probability vectors, the computational complexity of the algorithm is also . The equicontinuity property of the discrete entropic regularized optimal transport problem plays a key role in the analysis.
References
- [1] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Mem. Math. Phys. Acad. Royale Sci., pages 666–704, 1781.
- [2] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
- [3] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008.
- [4] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. Advances in neural information processing systems, 28, 2015.
- [5] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
- [6] Zhengyu Su, Yalin Wang, Rui Shi, Wei Zeng, Jian Sun, Feng Luo, and Xianfeng Gu. Optimal mass transport for shape matching and comparison. IEEE transactions on pattern analysis and machine intelligence, 37(11):2246–2259, 2015.
- [7] Justin Solomon, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (ToG), 34(4):1–11, 2015.
- [8] Zheng Ge, Songtao Liu, Zeming Li, Osamu Yoshie, and Jian Sun. Ota: Optimal transport assignment for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 303–312, 2021.
- [9] Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In 2009 IEEE 12th international conference on computer vision, pages 460–467. IEEE, 2009.
- [10] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation distances. Advances in Neural Information Processing Systems, 26, 06 2013.
- [11] G Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579–652, 1912.
- [12] Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
- [13] Joel Franklin and Jens Lorenz. On the scaling of multidimensional matrices. Linear Algebra and its applications, 114:717–735, 1989.
- [14] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration, 2018.
- [15] Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1367–1376. PMLR, 10–15 Jul 2018.
- [16] Tianyi Lin, Nhat Ho, and Michael I Jordan. On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research, 23(137):1–42, 2022.
- [17] F. Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Springer International Publishing :Imprint: Birkhäuser, 2015.
- [18] Simone Di Marino and Augusto Gerolin. An optimal transport approach for the schrödinger bridge problem and convergence of sinkhorn algorithm. Journal of Scientific Computing, 85(2):27, 2020.
- [19] Sergey Guminov, Pavel Dvurechensky, Nazarii Tupitsa, and Alexander Gasnikov. On a combination of alternating minimization and nesterov’s momentum. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3886–3898. PMLR, 18–24 Jul 2021.
- [20] Antonin Chambolle and Juan Pablo Contreras. Accelerated bregman primal-dual methods applied to optimal transport and wasserstein barycenter problems. SIAM Journal on Mathematics of Data Science, 4(4):1369–1395, 2022.
- [21] Jose Blanchet, Arun Jambulapati, Carson Kent, and Aaron Sidford. Towards optimal running times for optimal transport, 2020.
- [22] Kent Quanrud. Approximating optimal transport with linear programs, 2018.
- [23] Arun Jambulapati, Aaron Sidford, and Kevin Tian. A direct iteration parallel algorithm for optimal transport, 2019.
- [24] Gen Li, Yanxi Chen, Yuejie Chi, H. Vincent Poor, and Yuxin Chen. Fast computation of optimal transport via entropy-regularized extragradient methods, 2023.
Appendix A Supplementary Proofs for Lemmas 2.1, 2.2, and 2.4
Proof of Lemma 2.1.
It is clear that , . Let . It is easy to verify that and . Moreover,
where the last equality arises from . Since
and
we have
which completes the proof. ∎
Proof of Lemma 2.2.
Let be the output of Round and be the output of Round. By verifying the KKT condition, we have
(18) |
Thus,
Then the proof is complete after applying Lemma 2.1 to bound and . ∎
Proof of Lemma 2.4.
Assume and is even. According to the update rule of the Sinkhorn algorithm, we have
and . Thus
where the inequality utilizes Pinsker’s inequality and the last equality follows from the fact that if and is even. Then case when is odd can be proved in a similar manner. ∎
Appendix B Proof of Lemma 3.1
Appendix C Supplementary Proofs for Lemmas 3.3 and 3.4
Proof of Lemma 3.3.
We first establish the following result which can be viewed as a generalization of the Pinsker’s inequality. For any , if , we have
(19) |
The proof of this result is as follows. Letting , we have
When , it is easy to see . Moreover, it can be verified directly that if . Applying the Pinsker’s inequality yields
It follows that
which establishes (19) provided .
To proceed, without loss of generality, suppose the -th iteration of the Greenkhorn algorithm updates . Then we have
where the sixth equality follows from . If
according to the update rule of the Greenkhorn algorithm and inequality (19), we have
If or , the update rule of Greenkhorn algorithm implies that
which completes the proof. ∎