This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction

Khai Nguyen    Dang Nguyen    Nhat Ho
Abstract

Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for less discriminative projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the “max” projecting directions given two input measures instead of using projected gradient ascent multiple times. Despite being efficient, Max-SW and its amortized version cannot guarantee metricity property due to the sub-optimality of the projected gradient ascent and the amortization gap. Therefore, we propose to replace Max-SW with distributional sliced Wasserstein distance with von Mises-Fisher (vMF) projecting distribution (v-DSW). Since v-DSW is a metric with any non-degenerate vMF distribution, its amortized version can guarantee the metricity when performing amortization. Furthermore, current amortized models are not permutation invariant and symmetric. To address the issue, we design amortized models based on self-attention architecture. In particular, we adopt efficient self-attention architectures to make the computation linear in the number of supports. With the two improvements, we derive self-attention amortized distributional projection optimization and show its appealing performance in point-cloud reconstruction and its downstream applications.

Machine Learning, ICML

1 Introduction

Wasserstein distance (Villani, 2008; Peyré & Cuturi, 2019) has been widely recognized in the community of machine learning as an effective tool. For example, Wasserstein distance is used to explore clusters inside data (Ho et al., 2017), to transfer knowledge between different domains (Courty et al., 2016; Damodaran et al., 2018), to learn generative models (Arjovsky et al., 2017; Tolstikhin et al., 2018), to extract features from graphs (Vincent-Cuaz et al., 2022), to compare datasets (Alvarez-Melis & Fusi, 2020), and many other applications. Despite being effective, Wasserstein distance is extremely expensive to compute. In particular, the computational complexity and memory complexity of Wasserstein distance in the discrete case is 𝒪(m3logm)\mathcal{O}(m^{3}\log m) and 𝒪(m2)\mathcal{O}(m^{2}) respectively with mm is the number of supports. The computational problem becomes more severe for applications that require computing the Wasserstein distance multiple times on different pairs of measures. Some examples can be named: deep generative modeling (Genevay et al., 2018), deep domain adaptation (Bhushan Damodaran et al., 2018), comparing datasets (Alvarez-Melis & Fusi, 2020), topic modeling (Huynh et al., 2020), point-cloud reconstruction (Achlioptas et al., 2018), and so on.

By adding entropic regularization (Cuturi, 2013), an ε\varepsilon-approximation of Wasserstein distance can be obtained in 𝒪(m2/ε2)\mathcal{O}(m^{2}/\varepsilon^{2}). However, this approach cannot reduce the memory complexity of 𝒪(m2)\mathcal{O}(m^{2}) due to the storage of the cost matrix. A more efficient approach is based on the closed-form solution of Wasserstein distance in one dimension which is known as sliced Wasserstein distance (Bonneel et al., 2015). Sliced Wasserstein (SW) distance is defined as the expectation of the Wasserstein distance between random one-dimensional push-forward measures from the two original measures. Thanks to the closed-form solution, SW can be solved in 𝒪(mlog2m)\mathcal{O}(m\log_{2}m) while having a linear memory complexity 𝒪(m)\mathcal{O}(m). Moreover, SW is also better than Wasserstein distance in high-dimensional statistical inference. Namely, the sample complexity (statistical estimation rate) of SW is 𝒪(n1/2)\mathcal{O}(n^{-1/2}) compared to 𝒪(n1/d)\mathcal{O}(n^{-1/d}) of Wasserstein distance with dd is the number dimension and nn is the number of data samples. Due to appealing properties, SW is utilized successfully in various applications e.g., generative modeling (Deshpande et al., 2018; Nguyen & Ho, 2022b; Nguyen et al., 2023), domain adaptation (Lee et al., 2019a), Bayesian inference (Nadjahi et al., 2020; Yi & Liu, 2021), point-cloud representation learning (Nguyen et al., 2021c; Naderializadeh et al., 2021), and so on.

The downside of SW is that it treats all projections the same due to the usage of a uniform distribution over projecting directions. This choice is inappropriate in practice since there exist projecting directions that cannot discriminate two interested measures (Kolouri et al., 2018). As a solution, max sliced Wasserstein distance (Max-SW) (Deshpande et al., 2019) is introduced by searching for the best projecting direction that can maximize the projected Wasserstein distance. Max-SW needs to use a projected sub-gradient ascent algorithm to find the “max” slice. Therefore, in applications that need to evaluate Max-SW multiple times on different pairs of measures, the repeated optimization procedure is costly. For example, this paper focuses on point-cloud reconstruction applications where Max-SW needs to be computed between various pairs of empirical measures over a point-cloud and its reconstructed version.

To address the problem, amortized projection optimization is proposed in (Nguyen & Ho, 2022a). As in other amortized optimization (Shu, 2017; Amos, 2022) (learning to learn), an amortized model is estimated to predict the best projecting direction given the two input empirical measures. The authors in (Nguyen & Ho, 2022a) propose three types of amortized models including linear model, generalized linear model, and non-linear model. The linear model assumes that the “max” projecting direction is a linear combination of supports of two measures. The generalized linear model injects the linearity through a link function on the supports of two measures while the non-linear model uses multilayer perceptions to have more expressiveness.

Despite performing well in practice, the previous work has not explored the full potential of amortized optimization in the sliced Wasserstein setting. There are two issues in the current amortized optimization framework. Firstly, the sub-optimality of amortized optimization leads to losing the metricity of the projected distance from the predicted projecting direction. In particular, the metricity of Max-SW is only obtained at the global optimum. Therefore, using an amortized model with sub-optimal solutions cannot achieve the metricity for all pairs of measures. Losing metricity property could hurt the performance of downstream applications. Secondly, the current amortized models are not permutation invariant to the supports of two input measures and are not symmetric. The permutation-invariant and symmetry properties are vital since the “max” projecting direction is also not changed when permuting supports of two input empirical measures and exchanging two input empirical measures. By inducing the permutation-invariance and symmetry to the amortized model, it could help to learn a better amortized model and reduce the amortization gap

In this paper, we focus on overcoming the two issues of the current amortized projection optimization framework. For metricity preservation, we propose amortized distributional projection optimization framework which predicts the best distribution over projecting directions. In particular, we do amortized optimization for distributional sliced Wasserstein (DSW) distance (Nguyen et al., 2021a) with von Mises Fisher (vMF) slicing distribution (Jupp & Mardia, 1979) instead of Max-SW. Thanks to the smoothness of vMF, the metricity can be preserved even without a zero amortization gap. For the permutation-invariance and symmetry properties, we propose to use the self-attention mechanism (Vaswani et al., 2017) to design the amortized model. Moreover, we utilize efficient self-attention approaches that have the computational complexity scales linearly in the number of supports including efficient attention (Shen et al., 2021) and linear attention (Wang et al., 2020a).

Contribution. In summary, our contribution is two-fold:

1. First, we introduce amortized distributional projection amortization framework which predicts the best location parameter for von Mises-Fisher (vMF) distribution in distributional sliced Wasserstein (DSW) distance. Due to the smoothness of vMF, the metricity is guaranteed for all pairs of measures. Moreover, we enhance amortized models by inducing inductive biases which are permutation invariance and symmetry. To improve the efficiency, we leverage two linear-complexity attention mechanisms including efficient attention (Shen et al., 2021) and linear attention (Wang et al., 2020a) to parameterize the amortized model. Combining the above two improvements, we obtain self-attention amortized distributional projection amortization framework

2. Second, we adapt the new framework to the point-clouds reconstruction problem. In particular, we want to learn an autoencoder that can reconstruct (encode and decode) all point-clouds through their latent representations. The main idea is to treat a point-cloud as an empirical measure and use sliced Wasserstein distances as the reconstruction losses. Here, amortized optimization serves as a fast way to yield informative projecting directions for sliced Wasserstein distance to discriminative all pairs of original point-cloud and reconstructed point-cloud. Empirically, we show that the self-attention amortized distributional projection amortization provides better reconstructed point-clouds on the ModelNet40 dataset (Wu et al., 2015) than the amortized projection optimization framework and widely used distances. Moreover, on downstream tasks, the new framework also leads to higher classification accuracy on ModelNet40 and generates ShapeNet chairs with better quality.

Organization. The remainder of the paper is organized as follows. In Section 2, we provide backgrounds for point-cloud reconstruction and popular distances. In Section 3, we define the new amortized distributional projection optimization framework for the point-cloud reconstruction problem. Section 4 benchmarks the proposed method by extensive experiments on point-cloud reconstruction, transfer learning, and point-cloud generation. Finally, proofs of key results and extra materials are in the supplementary.

Notation. For any d2d\geq 2, we denote 𝒰(𝕊d1)\mathcal{U}(\mathbb{S}^{d-1}) is the uniform measure over the unit hyper-sphere 𝕊d1:={θdθ22=1}\mathbb{S}^{d-1}:=\{\theta\in\mathbb{R}^{d}\mid||\theta||_{2}^{2}=1\}. For p1p\geq 1, 𝒫p(d)\mathcal{P}_{p}(\mathbb{R}^{d}) is the set of all probability measures on d\mathbb{R}^{d} that have finite pp-moments. For any two sequences ana_{n} and bnb_{n}, the notation an=𝒪(bn)a_{n}=\mathcal{O}(b_{n}) means that anCbna_{n}\leq Cb_{n} for all n1n\geq 1, where CC is some universal constant. We denote θμ\theta\sharp\mu is the push-forward measures of μ\mu through the function f:df:\mathbb{R}^{d}\to\mathbb{R} that is f(x)=θxf(x)=\theta^{\top}x.

2 Preliminaries

We first review the point-cloud reconstruction framework in Section 2.1. After that, we discuss famous choices of metrics between two point-clouds in Section 2.2. Finally, we present an adapted definition of the amortized projection optimization framework in the point-cloud reconstruction setting in Section 2.3.

2.1 Point-Cloud Reconstruction

Refer to caption
Figure 1: The reconstruction of a point-cloud XX (a plane).

We denote a point-cloud of mm points x1,,xmdx_{1},\ldots,x_{m}\in\mathbb{R}^{d} (d1d\geq 1) as X=(x1,,xm)dmX=(x_{1},\ldots,x_{m})\in\mathbb{R}^{dm} which is a vector of a concatenation of all points in the point-cloud. We denote the set of all possible point-clouds as 𝒳dm\mathcal{X}\subset\mathbb{R}^{dm}.

Permutation invariant metric space. Given a permutation one-to-one mapping function σ:[m][m]\sigma:[m]\to[m], we have σ(X)𝒳\sigma(X)\in\mathcal{X} for all X𝒳X\in\mathcal{X}. Moreover, we need a metric 𝒟:𝒳×𝒳+\mathcal{D}:\mathcal{X}\times\mathcal{X}\to\mathbb{R}^{+} such that 𝒟(X,σ(X))=0\mathcal{D}(X,\sigma(X))=0 for all X𝒳X\in\mathcal{X} where σ(X)=(xσ(1),,xσ(m))\sigma(X)=(x_{\sigma(1)},\ldots,x_{\sigma(m)}). Here, 𝒟\mathcal{D} is a metric, namely, it needs to satisfy the non-negativity, symmetry, triangle inequality, and identity property. The pair (𝒳,𝒟)(\mathcal{X},\mathcal{D}) forms a point-cloud metric space.

Learning representation via reconstruction. The raw representation of point-clouds is hard to work with in applications due to the complicated metric space. Therefore, a famous approach is to map point-clouds to points in a different space e.g., Euclidean, which is easier to apply machine learning algorithms. In more detail, we want to estimate a function fϕ:𝒳𝒵f_{\phi}:\mathcal{X}\to\mathcal{Z} (ϕΦ\phi\in\Phi) where 𝒵\mathcal{Z} is a set that belongs to another metric space. Then, we can apply machine learning algorithms on 𝒵\mathcal{Z} instead of 𝒳\mathcal{X}. The most well-known and effective way to estimate the function fϕf_{\phi} is through reconstruction loss. Namely, we estimate fϕf_{\phi} jointly with a function gγ:𝒵𝒳g_{\gamma}:\mathcal{Z}\to\mathcal{X} (γΓ\gamma\in\Gamma) given a point-cloud dataset p(X)p(X) (distribution over 𝒳\mathcal{X}) by minimizing the objective:

minϕΦ,γΓ𝔼Xp(X)𝒟(X,gγ(fϕ(X))).\displaystyle\min_{\phi\in\Phi,\gamma\in\Gamma}\mathbb{E}_{X\sim p(X)}\mathcal{D}(X,g_{\gamma}(f_{\phi}(X))). (1)

The loss 𝔼Xp(X)𝒟(X,gγ(fϕ(X)))\mathbb{E}_{X\sim p(X)}\mathcal{D}(X,g_{\gamma}(f_{\phi}(X))) is known as the reconstruction loss. If the reconstruction loss is 0, we have gγ=fϕ1g_{\gamma}=f_{\phi}^{-1} p-almost surely. Therefore, we can move from 𝒳\mathcal{X} to 𝒵\mathcal{Z} and move back from 𝒵\mathcal{Z} to 𝒳\mathcal{X} without losing information through the functions fϕf_{\phi} (referred as the encoder) and gγg_{\gamma} (referred as the decoder). We show an illustration of the framework (Achlioptas et al., 2018) in Figure 1. After learning how to do the reconstruction well, other point-cloud tasks can be done using the autoencoder (the pair (fϕ,gγ)(f_{\phi},g_{\gamma})) e.g., shape interpolation, shape editing, shape analogy, shape completion, point-cloud classification, and point-cloud generation (Achlioptas et al., 2018).

2.2 Metric Spaces for Point-Clouds

We now review some famous choices of the metric 𝒟\mathcal{D} which are Chamfer distance (Barrow et al., 1977), Wasserstein distance (Villani, 2008), sliced Wasserstein (SW) distance (Bonneel et al., 2015), and max sliced Wasserstein (Max-SW) (Deshpande et al., 2019) distance.

Chamfer distance. For any two point-clouds XX and YY, the Chamfer distance is defined as follows: CD(X,Y)=\text{CD}(X,Y)=

1|X|xXminyYxy22+1|Y|yYminxXxy22,\displaystyle\frac{1}{|X|}\sum\limits_{x\in X}\min\limits_{y\in Y}\|x-y\|_{2}^{2}+\frac{1}{|Y|}\sum\limits_{y\in Y}\min\limits_{x\in X}\|x-y\|_{2}^{2}, (2)

where |X||X| denotes the number of points in XX.

Wasserstein distance. Given two probability measures μ𝒫p(d)\mu\in\mathcal{P}_{p}(\mathbb{R}^{d}) and ν𝒫p(d)\nu\in\mathcal{P}_{p}(\mathbb{R}^{d}), the Wasserstein distance between μ\mu and ν\nu is defined as follows:

Wp(μ,ν)=(infπΠ(μ,ν)d×dxypp𝑑π(x,y))1p\displaystyle\text{W}_{p}(\mu,\nu)=\left(\inf_{\pi\in\Pi(\mu,\nu)}\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\|x-y\|_{p}^{p}d\pi(x,y)\right)^{\frac{1}{p}} (3)

where Π(μ,ν)\Pi(\mu,\nu) is set of all couplings whose marginals are μ\mu and ν\nu respectively. Since the Wasserstein distance is originally defined on probability measures space, we need to convert a point-cloud X=(x1,,xm)𝒳X=(x_{1},\ldots,x_{m})\in\mathcal{X} to the corresponding empirical probability measure PX=1mi=1mδxi𝒫(d)P_{X}=\frac{1}{m}\sum_{i=1}^{m}\delta_{x_{i}}\in\mathcal{P}(\mathbb{R}^{d}). Therefore, we can use 𝒟(X,Y)=Wp(PX,PY)\mathcal{D}(X,Y)=\text{W}_{p}(P_{X},P_{Y}) for X,Y𝒳X,Y\in\mathcal{X}.

Sliced Wasserstein distance. As discussed, the Wasserstein distance is expensive to compute with the time complexity 𝒪(m3logm)\mathcal{O}(m^{3}\log m) and the memory complexity 𝒪(m2)\mathcal{O}(m^{2}). Therefore, an alternative choice is sliced Wasserstein (SW) distance between two probability measures μ𝒫p(d)\mu\in\mathcal{P}_{p}(\mathbb{R}^{d}) and ν𝒫p(d)\nu\in\mathcal{P}_{p}(\mathbb{R}^{d}) is:

SWp(μ,ν)=(𝔼θ𝒰(𝕊d1)Wpp(θμ,θν))1p,\displaystyle\text{SW}_{p}(\mu,\nu)=\left(\mathbb{E}_{\theta\sim\mathcal{U}(\mathbb{S}^{d-1})}\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)\right)^{\frac{1}{p}}, (4)

The benefit of SW is that Wp(θμ,θν)\text{W}_{p}(\theta\sharp\mu,\theta\sharp\nu) has a closed-form solution which is

Wp(θμ,θν)=(01|Fθμ1(z)Fθν1(z)|p𝑑z)1p,\text{W}_{p}(\theta\sharp\mu,\theta\sharp\nu)=\left(\int_{0}^{1}|F_{\theta\sharp\mu}^{-1}(z)-F_{\theta\sharp\nu}^{-1}(z)|^{p}dz\right)^{\frac{1}{p}},

with F1F^{-1} denotes the inverse CDF function. The expectation is often approximated by Monte Carlo sampling, namely, it is replaced by the average from θ1,,θL\theta_{1},\ldots,\theta_{L} that are drawn i.i.d from 𝒰(𝕊d1)\mathcal{U}(\mathbb{S}^{d-1}). The computational complexity and memory complexity of SW becomes 𝒪(Lmlog2m)\mathcal{O}(Lm\log_{2}m) and 𝒪(Lm)\mathcal{O}(Lm).

Max sliced Wasserstein distance. It is well-known that SW has a lot of less discriminative projections due to the uniform sampling. Therefore, max sliced Wasserstein distance is proposed to use the most discriminative projecting direction. Max sliced Wasserstein (Max-SW) distance (Deshpande et al., 2019) between μ𝒫p(d)\mu\in\mathcal{P}_{p}(\mathbb{R}^{d}) and ν𝒫p(d)\nu\in\mathcal{P}_{p}(\mathbb{R}^{d}) is introduced as follows:

Max-SWp(μ,ν)=maxθ𝕊d1Wp(θμ,θν),\displaystyle\text{Max-SW}_{p}(\mu,\nu)=\max_{\theta\in\mathbb{S}^{d-1}}W_{p}(\theta\sharp\mu,\theta\sharp\nu), (5)

Max-SW is often computed by a projected sub-gradient ascent algorithm. When the projected sub-gradient ascent algorithm has T1T\geq 1 iterations, the computation complexity of Max-SW is 𝒪(Tmlog2m)\mathcal{O}(Tm\log_{2}m) and the memory complexity is 𝒪(m)\mathcal{O}(m). Both SW and Max-SW are applied successfully in point-cloud reconstruction (Nguyen et al., 2021c).

2.3 Amortized Projection Optimization

Amortized Optimization. We start with the definition of amortized optimization.

Definition 1.

For each context variable xx in the context space 𝒳\mathcal{X}, θ(x)\theta^{\star}(x) is the solution of the optimization problem θ(x)=argminθΘ(θ,x)\theta^{\star}(x)=\operatorname*{arg\,min}_{\theta\in\Theta}\mathcal{L}(\theta,x), where Θ\Theta is the solution space. A parametric function fψ:𝒳Θf_{\psi}:\mathcal{X}\to\Theta, where ψΨ\psi\in\Psi, is called an amortized model if

fψ(x)θ(x),x𝒳.\displaystyle f_{\psi}(x)\approx\theta^{\star}(x),\quad\forall x\in\mathcal{X}. (6)

The amortized model is trained by the amortized optimization objective which is defined as:

minψΨ𝔼xp(x)(fψ(x),x),\displaystyle\min_{\psi\in\Psi}\mathbb{E}_{x\sim p(x)}\mathcal{L}(f_{\psi}(x),x), (7)

where p(x)p(x) is a probability measure on 𝒳\mathcal{X} which measures the “importance” of optimization problems.

Amortized Projection Optimization. We now revisit the point-cloud reconstruction objective with 𝒟(X,Y)=Max-SWp(PX,PY)\mathcal{D}(X,Y)=\text{Max-SW}_{p}(P_{X},P_{Y}):

minϕΦ,γΓ𝔼[maxθ𝕊d1Wp(θPX,θPgγ(fϕ(X)))],\displaystyle\min_{\phi\in\Phi,\gamma\in\Gamma}\mathbb{E}\left[\max_{\theta\in\mathbb{S}^{d-1}}\text{W}_{p}(\theta\sharp P_{X},\theta\sharp P_{g_{\gamma}(f_{\phi}(X))})\right], (8)

where the expectation is with respect to Xp(X)X\sim p(X). For each point-cloud X𝒳X\in\mathcal{X}, we need to compute a Max-SW distance with an iterative optimization procedure. Therefore, it is computationally expensive.

Authors in  (Nguyen & Ho, 2022a) propose to use amortized optimization (Shu, 2017; Amos, 2022) to speed up the problem. Instead of solving all optimization problems independently, an amortized model is trained to predict optimal solutions to all problems. In greater detail, given a parametric function aψ:𝒳×𝒳𝕊d1a_{\psi}:\mathcal{X}\times\mathcal{X}\to\mathbb{S}^{d-1} (ψΨ\psi\in\Psi), the amortized objective is:

minϕΦ,γΓmaxψΨ𝔼Wp(θψ,γ,ϕPX,θψ,γ,ϕPgγ(fϕ(X))),\displaystyle\min_{\phi\in\Phi,\gamma\in\Gamma}\max_{\psi\in\Psi}\mathbb{E}\text{W}_{p}(\theta_{\psi,\gamma,\phi}\sharp P_{X},\theta_{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X))}), (9)

where the expectation is with respect to Xp(X)X\sim p(X), and θψ,γ,ϕ=aψ(X,gγ(fϕ(X)))\theta_{\psi,\gamma,\phi}=a_{\psi}(X,g_{\gamma}(f_{\phi}(X))). The above optimization is solved by an alternative stochastic (projected)-gradient descent-ascent algorithm. Therefore, it is faster to compute in each update iteration of ϕ\phi and γ\gamma. It is worth noting that the previous work (Nguyen & Ho, 2022a) considers the generative model application which is unstable and hard to understand. Here, we adapt the framework to the point-cloud reconstruction application which is easier to explore the behavior of amortized optimization. We refer the reader to Algorithms 2-3 in Appendix A.3 for algorithms on training an autoencoder with Max-SW and amortized projection optimization.

Amortized models. Authors in (Nguyen & Ho, 2022a) propose three types of amortized models that are based on the literature on linear models (Christensen, 2002). In particular, the linear amortized model is defined as:

Definition 2.

Given X,YdmX,Y\in{\mathbb{R}}^{dm}, the linear amortized model is defined as:

aψ(X,Y):=w0+Xw1+Yw2w0+Xw1+Yw22,a_{\psi}(X,Y):=\frac{w_{0}+X^{\prime}w_{1}+Y^{\prime}w_{2}}{||w_{0}+X^{\prime}w_{1}+Y^{\prime}w_{2}||_{2}},

where XX^{\prime} and YY^{\prime} are matrices of size d×md\times m that are reshaped from the concatenated vectors XX and YY of size dmdm, ψ=(w0,w1,w2)\psi=(w_{0},w_{1},w_{2}) with w1,w2mw_{1},w_{2}\in{\mathbb{R}}^{m}, and w0dw_{0}\in{\mathbb{R}}^{d} .

Similarly, the generalized linear amortized model and the non-linear amortized model are defined by injecting non-linearity into the linear model. We review the definitions of the generalized linear amortized model and non-linear amortized model in Definitions 4-5 in Appendix A.1.

Sub-optimality. Despite being faster, amortized optimization often cannot recover the global optimum of optimization problems. Namely, we denote

θ(X)=argmaxθ𝕊d1Wp(θPX,θPgγ(fϕ(X)))\theta^{\star}(X)=\text{argmax}_{\theta\in\mathbb{S}^{d-1}}\text{W}_{p}(\theta\sharp P_{X},\theta\sharp P_{g_{\gamma}(f_{\phi}(X))})

and ψ=\psi^{\star}=

argmaxψΨ𝔼Xp(X)[Wp(θψ,γ,ϕPX,θψ,γ,ϕPgγ(fϕ(X)))].\displaystyle\operatorname*{arg\,max}_{\psi\in\Psi}\mathbb{E}_{X\in p(X)}\left[\text{W}_{p}(\theta_{\psi,\gamma,\phi}\sharp P_{X},\theta_{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X))})\right].

Then, it is well-known that the amortization gap 𝔼Xp(X)[c(θ(X),aψ(X,gγ(fϕ(X))))]>0\mathbb{E}_{X\sim p(X)}[c(\theta^{\star}(X),a_{\psi^{\star}}(X,g_{\gamma}(f_{\phi}(X))))]>0 for a metric c:𝕊d1×𝕊d1+c:\mathbb{S}^{d-1}\times\mathbb{S}^{d-1}\to\mathbb{R}^{+}. A great amortized model is one that can minimize the amortization gap. However, in the amortized projection optimization setting, we cannot obtain θ(X)\theta^{\star}(X) since the projected gradient ascent algorithm can only yield the local optimum. Therefore, a careful investigation of the amortization gap is challenging.

3 Self-Attention Amortized Distributional Projection Optimization

In this section, we propose the self-attention amortized distributional projection optimization framework. First, we present amortized distributional projection optimization to maintain the metricity property in Section 3.1. We then introduce self-attention amortized models which are symmetric and permutation invariant in Section 3.2.

3.1 Amortized Distributional Projection Optimization

Refer to caption
Figure 2: The difference between amortized projection optimization and amortized distributional projection optimization.

The current amortized projection optimization framework is for predicting the “max” projecting direction in Max-SW. However, the projected one-dimensional Wasserstein is only a metric on space of probability measure at the global optimum of Max-SW. Therefore, the local optimum from the projected sub-gradient ascent algorithm (Nietert et al., 2022) and the prediction from the amortized model only yield pseudo-metricity for the projected Wasserstein.

Proposition 1.

Let the projected one-dimensional Wasserstein be PWp(μ,ν;θ^)=Wp(θ^μ,θ^ν))\text{PW}_{p}(\mu,\nu;\hat{\theta})=\text{W}_{p}(\hat{\theta}\sharp\mu,\hat{\theta}\sharp\nu)) for any μ,ν𝒫p(d)\mu,\nu\in\mathcal{P}_{p}(\mathbb{R}^{d}) (p1,d1p\geq 1,d\geq 1) and θ^𝕊d1\hat{\theta}\in\mathbb{S}^{d-1} such that θ^argmaxθ𝕊d1Wp(θμ,θν)\hat{\theta}\neq\operatorname*{arg\,max}_{\theta\in\mathbb{S}^{d-1}}\text{W}_{p}(\theta\sharp\mu,\theta\sharp\nu) , PWp(μ,ν;θ^)\text{PW}_{p}(\mu,\nu;\hat{\theta}) is a pseudo metric on 𝒫p(d)\mathcal{P}_{p}(\mathbb{R}^{d}) since it satisfies symmetry, non-negativity, triangle inequality, μ=ν\mu=\nu implies PWp(μ,ν;θ^)=0\text{PW}_{p}(\mu,\nu;\hat{\theta})=0, however, PWp(μ,ν;θ^)=0\text{PW}_{p}(\mu,\nu;\hat{\theta})=0 does not imply μ=ν\mu=\nu.

The proof for Proposition 1 is given in Appendix B.1. This result implies that the if reconstruction loss 𝔼Xp(X)[PWp(PX,Pgγ(fϕ(X));θ^(X))=0\mathbb{E}_{X\sim p(X)}[\text{PW}_{p}(P_{X},P_{g_{\gamma}(f_{\phi}(X))};\hat{\theta}(X))=0, it does not imply X=gγ(fϕ(X))X=g_{\gamma}(f_{\phi}(X)) for p-almost surely X𝒳X\in\mathcal{X}. Therefore, a local maximum for maxθ𝕊d1\max_{\theta\in\mathbb{S}^{d-1}} in Max-SW reconstruction (Equation 8) and the global maximum for maxψΨ\max_{\psi\in\Psi} in amortized Max-SW reconstruction (Equation 9 with a misspecified amortized model) cannot guarantee perfect reconstruction even when their objectives obtain 0 values.

Amortized Distributional Projection Optimization. To overcome the issue, we propose to replace Max-SW in Equation 8 with the von Mises Fisher distributional sliced Wasserstein (v-DSW) distance (Nguyen et al., 2021b):

minϕΦ,γΓ𝔼Xp(X)[maxϵ𝕊d1(𝔼θvMF(ϵ,κ)\displaystyle\min_{\phi\in\Phi,\gamma\in\Gamma}\mathbb{E}_{X\sim p(X)}\Big{[}\max_{\epsilon\in\mathbb{S}^{d-1}}\Big{(}\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}
Wpp(θPX,θPgγ(fϕ(X))))1p],\displaystyle\hskip 40.00006pt\quad\quad\text{W}_{p}^{p}(\theta\sharp P_{X},\theta\sharp P_{g_{\gamma}(f_{\phi}(X))})\Big{)}^{\frac{1}{p}}\Big{]}, (10)

where vMF(ϵ,κ)\text{vMF}(\epsilon,\kappa) is the von Mises Fisher distribution with the mean location parameter ϵ𝕊d1\epsilon\in\mathbb{S}^{d-1} and the concentration parameter κ>0\kappa>0, and v-DSWp(μ,ν;κ)=maxϵ𝕊d1(𝔼θvMF(ϵ,κ)Wpp(θμ,θν))1p\text{v-DSW}_{p}(\mu,\nu;\kappa)=\max_{\epsilon\in\mathbb{S}^{d-1}}\Big{(}\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)\Big{)}^{\frac{1}{p}} is von Mises Fisher distributional sliced Wasserstein distance. The optimization can be solved by a stochastic projected gradient ascent algorithm with the vMF reparameterization trick. In particular, θ1,,θL\theta_{1},\ldots,\theta_{L} (L1L\geq 1) is sampled i.i.d from vMF(ϵ,κ)\text{vMF}(\epsilon,\kappa) via the reparameterized acceptance-rejection sampling (Davidson et al., 2018a) to approximate ϵ𝔼vMF(ϵ,κ)[Wpp(θμ,θν)]\nabla_{\epsilon}\mathbb{E}_{\text{vMF}(\epsilon,\kappa)}[\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)] via Monte Carlo integration. We refer the reader to Section A.2 for more detail about the vMF distribution, its sampling algorithm, its reparameterization trick, and the stochastic gradient estimators. We present a visualization of the difference between the new amortized distributional projection optimization framework and the conventional amortized projection optimization framework in Figure 2. The corresponding amortized objective is:

minϕΦ,γΓmaxψΨ𝔼Xp(X)(𝔼θvMF(ϵψ,γ,ϕ,κ)\displaystyle\min_{\phi\in\Phi,\gamma\in\Gamma}\max_{\psi\in\Psi}\mathbb{E}_{X\sim p(X)}\Big{(}\mathbb{E}_{\theta\sim\text{vMF}(\epsilon_{\psi,\gamma,\phi},\kappa)}
Wpp(θPX,θPgγ(fϕ(X))))1p,\displaystyle\hskip 60.00009pt\quad\quad\text{W}_{p}^{p}(\theta\sharp P_{X},\theta\sharp P_{g_{\gamma}(f_{\phi}(X))})\Big{)}^{\frac{1}{p}}, (11)

where ϵψ,γ,ϕ=aψ(X,gγ(fϕ(X)))\epsilon_{\psi,\gamma,\phi}=a_{\psi}(X,g_{\gamma}(f_{\phi}(X))). The optimization is solved by an alternative stochastic (projected)-gradient descent-ascent algorithm with the vMF reparameterization.

Theorem 1.

For any ϵ𝕊d1\epsilon\in\mathbb{S}^{d-1} and 0κ<0\leq\kappa<\infty, if 𝔼Xp(X)(𝔼θvMF(ϵ,κ)Wpp(θPX,θPgγ(fϕ(X))))1p=0\mathbb{E}_{X\sim p(X)}\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp P_{X},\theta\sharp P_{g_{\gamma}(f_{\phi}(X))})\right)^{\frac{1}{p}}=0, X=gγ(fϕ(X))X=g_{\gamma}(f_{\phi}(X)) for p-almost surely X𝒳X\in\mathcal{X}.

The proof of Theorem 1 is given in Appendix B.2. The proof is based on proving the metricity of the non-optimal von Mises Fisher distributional sliced Wasserstein distance (v-DSW) with the smoothness condition of the vMF distribution. It is worth noting that the proof of metricity of von Mises Fisher distributional sliced Wasserstein distance is new since the original work (Nguyen et al., 2021b) only shows the pseudo-metricity with the global optimality condition. Theorem 1 indicates that a perfect reconstruction can be obtained with a local optimum for maxϵ𝕊d1\max_{\epsilon\in\mathbb{S}^{d-1}} in v-DSW reconstruction (Equation 3.1) and a local optimum for maxψΨ\max_{\psi\in\Psi} in amortized v-DSW reconstruction (Equation 3.1).

Comparison with SW and Max-SW: When κ0\kappa\to 0, the vMF distribution converges weakly to the uniform distribution over the unit hypersphere. Hence, we can get back the conventional sliced Wasserstein reconstruction in both Equation 3.1 and Equation 3.1. When κ\kappa\to\infty, vMF distribution converges weakly to the Dirac delta at the location parameter. Therefore, we obtain Max-SW reconstruction and amortized Max-SW reconstruction in Equation 3.1 and Equation 3.1, respectively. However, when 0<κ<0<\kappa<\infty, v-DSW reconstruction and amortized v-DSW reconstruction can find a region of discriminative projecting directions while preserving the metricity for perfect reconstruction.

3.2 Self-Attention Amortized Models

Refer to caption
Figure 3: Visualization of an amortized model that is not symmetric and permutation invariant in two dimensions.

We now discuss the parameterization of the amortized model for amortized optimization.

Permutation Invariance and Symmetry. Let XX and YY be two point-clouds, the optimal slicing distribution vMF(ϵ,κ)\text{vMF}(\epsilon^{\star},\kappa) of v-DSW between PXP_{X} and PYP_{Y} can be obtained by running Algorithm 4 in Appendix A.3. Clearly, vMF(ϵ,κ)\text{vMF}(\epsilon^{\star},\kappa) is invariant to the permutation of the supports since Pσ(X)=PXP_{\sigma(X)}=P_{X} and Pσ(Y)=PYP_{\sigma(Y)}=P_{Y} for a permutation function σ\sigma. Moreover, the optimal slicing distribution vMF(ϵ,κ)\text{vMF}(\epsilon^{\star},\kappa) is also unchanged when we exchange PXP_{X} and PYP_{Y} since v-DSW is symmetric. However, the current amortized models (see Definition 2, Definitions 4-5 in Appendix A.1) are not permutation invariant and symmetric, namely, aψ(X,Y)aψ(X,σ(Y))a_{\psi}(X,Y)\neq a_{\psi}(X,\sigma(Y)) and aψ(X,Y)aψ(Y,X)a_{\psi}(X,Y)\neq a_{\psi}(Y,X) . Therefore, the current amortized models could be strongly misspecified. We show a visualization of an amortized model that is not symmetric and permutation invariant in Figure 3. To address the issue, we propose amortized models that are symmetric and permutation invariant based on the self-attention mechanism.

Self-Attention Mechanism. Attention is well-known for its effectiveness in learning long-range dependencies when data are sequences such as text (Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020) or speech (Li et al., 2019; Wang et al., 2020b). This mechanism was then successfully generalized to other data types including image (Carion et al., 2020; Dosovitskiy et al., 2020), video (Sun et al., 2019), graph (Dwivedi & Bresson, 2021), point-cloud (Zhao et al., 2021; Guo et al., 2021), to name a few. We now revisit the attention mechanism (Vaswani et al., 2017). Given Q,Km×dk,Vm×dvQ,K\in{\mathbb{R}}^{m\times d_{k}},V\in{\mathbb{R}}^{m\times d_{v}}, the scaled dot-product attention operator is defined as:

Att(Q,K,V)=softmaxrow[QKTdk]V\displaystyle\mathrm{Att}(Q,K,V)=\text{softmax}_{\rm{row}}\left[\frac{QK^{T}}{\sqrt{d_{k}}}\right]V (12)

where softmaxrow\text{softmax}_{\rm{row}} denotes the row-wise softmax function. In the self-attention mechanism, the query matrix Q, the key matrix K, and the value matrix V are usually computed by projecting the input sequence X into different subspaces. Thus, the self-attention mechanism is given as follows. Given Xm×dX\in{\mathbb{R}}^{m\times d}, the self-attention operator is:

𝒜ζ(X)=Att(XWq,XWk,XWv)\displaystyle{\mathcal{A}}_{\zeta}(X)=\mathrm{Att}(XW_{q},XW_{k},XW_{v}) (13)

where Wq,Wkd×dk,Wvd×dvW_{q},W_{k}\in{\mathbb{R}}^{d\times d_{k}},W_{v}\in{\mathbb{R}}^{d\times d_{v}} and ζ=(Wq,Wk,Wv)\zeta=(W_{q},W_{k},W_{v}). The self-attention operator is infamous for its quadratic memory and computational costs. In particular, given an input sequence of length mm, both the time and space complexity are 𝒪(m2){\mathcal{O}}(m^{2}). Since we focus on the sliced Wasserstein setting where the computational complexity should be at most 𝒪(mlogm)\mathcal{O}(m\log m), the conventional self-attention is not appropriate. Several works (Li et al., 2020; Katharopoulos et al., 2020; Wang et al., 2020a; Shen et al., 2021) have been proposed to reduce the overall complexity from 𝒪(m2){\mathcal{O}}(m^{2}) to 𝒪(m){\mathcal{O}}(m). In this paper, we utilize two linear complexity variants of attention which are efficient attention (Shen et al., 2021) and linear attention (Wang et al., 2020a). Given Xm×dX\in{\mathbb{R}}^{m\times d}, the efficient self-attention is defined as:

𝒜ζ(X)=\displaystyle{\mathcal{E}}{\mathcal{A}}_{\zeta}(X)=
softmaxrow(XWq)[softmaxcol(XWk)T(XWv)]\displaystyle\text{softmax}_{\rm{row}}(XW_{q})\left[\text{softmax}_{\rm{col}}(XW_{k})^{T}(XW_{v})\right] (14)

where Wq,Wkd×dk,Wvd×dvW_{q},W_{k}\in{\mathbb{R}}^{d\times d_{k}},W_{v}\in{\mathbb{R}}^{d\times d_{v}}, ζ=(Wq,Wk,Wv)\zeta=(W_{q},W_{k},W_{v}), and softmaxcol\text{softmax}_{\rm{col}} denotes applying the softmax function column-wise. The linear self-attention is:

𝒜ζ(X)=Att(XWq,Wk1XWk2,Wv1XWv2)\displaystyle{\mathcal{L}}{\mathcal{A}}_{\zeta}(X)=\mathrm{Att}(XW_{q},W_{k1}XW_{k2},W_{v1}XW_{v2}) (15)

where Wq,Wk2d×dk,Wv2d×dvW_{q},W_{k2}\in{\mathbb{R}}^{d\times d_{k}},W_{v2}\in{\mathbb{R}}^{d\times d_{v}}, Wk1,Wv1k×nW_{k1},W_{v1}\in{\mathbb{R}}^{k\times n}, and ζ=(Wq,Wk1,Wk2,Wv1,Wv2)\zeta=(W_{q},W_{k1},W_{k2},W_{v1},W_{v2}). The projected dimension kk is chosen such that mkm\gg k to reduce the memory and space consumption significantly.

Self-Attention Amortized Models: Based on the self-attention mechanism, we introduce the self-attention amortized model which is permutation invariant and symmetric. Formally, the self-attention amortized model is defined as:

Definition 3.

Given X,YdmX,Y\in{\mathbb{R}}^{dm}, the self-attention amortized model is defined as:

aψ(X,Y)=𝒜ζ(X)𝟏m+𝒜ζ(Y)𝟏m𝒜ζ(X)𝟏m+𝒜ζ(Y)𝟏m2,\displaystyle a_{\psi}(X,Y)=\frac{{\mathcal{A}}_{\zeta}(X^{\prime\top})^{\top}{\bm{1}}_{m}+{\mathcal{A}}_{\zeta}(Y^{\prime\top})^{\top}{\bm{1}}_{m}}{||{\mathcal{A}}_{\zeta}(X^{\prime\top})^{\top}{\bm{1}}_{m}+{\mathcal{A}}_{\zeta}(Y^{\prime\top})^{\top}{\bm{1}}_{m}||_{2}}, (16)

where XX^{\prime} and YY^{\prime} are matrices of size d×md\times m that are reshaped from the concatenated vectors XX and YY of size dmdm, 𝟏m{\bm{1}}_{m} is the mm-dimensional vector whose all entries are 11 and ψ=(ζ)\psi=(\zeta).

By replacing the conventional self-attention with the linear self-attention and the efficient self-attention, we obtain the linear self-attention amortized model and the efficient self-attention amortized model.

Proposition 2.

Self-attention amortized models are symmetric and permutation invariant.

The proof of Proposition 2 is given in Appendix B.3. The symmetry follows directly from the definition of the self-attention amortized models. The permutation invariance is proved by showing that the self-attention operators combined with average pooling are permutation invariant.

Comparison with Set Transformer. The authors in (Lee et al., 2019b) also proposed a method to guarantee the permutation invariant of sets. There are two main differences between our works and theirs. Firstly, Set Transformer introduced a new attention mechanism and a new Transformer architecture while we only present an approach to apply any attention mechanism to preserve the permutation invariance property of amortized models. Secondly, Set Transformer maintains the permutation invariance property by using a learnable multi-head attention as the aggregation scheme. We instead still rely on average pooling, a conventional permutation invariant aggregation scheme, to accumulate features learned by self-attention operations. Nevertheless, our works are orthogonal to Set Transformer, in other words, it is possible to apply techniques in Set Transformer to our attention-based amortized models. We leave this investigation for future work.

Table 1: Reconstruction and transfer learning performance on the ModelNet40 dataset. CD and SW are multiplied by 100.
Method CD(102,)(10^{-2},\downarrow) SW(102,)(10^{-2},\downarrow) EMD()(\downarrow) Acc()(\uparrow) Time ()(\downarrow)
CD 1.25 ±\pm 0.03 681.20 ±\pm 16.73 653.52 ±\pm 10.43 86.28 ±\pm 0.34 95
EMD 0.40 ±\pm 0.00 94.54 ±\pm 2.90 168.60 ±\pm 1.57 88.45 ±\pm 0.20 208
SW 0.68 ±\pm 0.01 89.61 ±\pm 3.88 191.12 ±\pm 2.88 87.90 ±\pm 0.27 106
Max-SW 0.68 ±\pm 0.01 88.22 ±\pm 1.45 190.23 ±\pm 0.1 87.97 ±\pm 0.14 116
ASW 0.69 ±\pm 0.01 89.42 ±\pm 5.07 192.03 ±\pm 3.09 87.78 ±\pm 0.20 103
v-DSW 0.67 ±\pm 0.00 85.03 ±\pm 3.31 187.75 ±\pm 2.00 87.83 ±\pm 0.40 633
{\mathcal{L}}-Max-SW 1.06 ±\pm 0.03 121.85 ±\pm 5.77 236.87 ±\pm 3.42 87.70 ±\pm 0.23 94
𝒢{\mathcal{G}}-Max-SW 12.11 ±\pm 0.29 851.07 ±\pm 2.11 829.28 ±\pm 5.53 87.49 ±\pm 0.36 97
𝒩{\mathcal{N}}-Max-SW 7.38 ±\pm 3.29 618.74 ±\pm 153.87 648.32 ±\pm 117.03 87.43 ±\pm 0.15 96
{\mathcal{L}}v-DSW 0.68 ±\pm 0.00 85.32 ±\pm 0.54 188.32 ±\pm 0.23 87.70 ±\pm 0.34 114
𝒢{\mathcal{G}}v-DSW 0.68 ±\pm 0.01 82.77 ±\pm 0.48 187.04 ±\pm 1.11 87.75 ±\pm 0.19 117
𝒩{\mathcal{N}}v-DSW 0.67 ±\pm 0.00 83.47 ±\pm 0.49 186.66 ±\pm 0.81 87.84 ±\pm 0.07 115
𝒜{\mathcal{A}}v-DSW 0.67 ±\pm 0.01 83.08 ±\pm 1.22 186.27 ±\pm 0.56 88.05 ±\pm 0.17 230
𝒜{\mathcal{E}}{\mathcal{A}}v-DSW 0.68 ±\pm 0.01 82.05 ±\pm 0.40 186.46 ±\pm 0.25 88.07 ±\pm 0.21 125
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW 0.68 ±\pm 0.00 81.03 ±\pm 0.18 185.26 ±\pm 0.31 88.28 ±\pm 0.13 123
Refer to caption
Figure 4: Qualitative results of reconstructing point-clouds in the ShapeNet Core-55 dataset. From top to bottom, the point-clouds are input, SW, Max-SW (T = 50), v-DSW (T = 50), and 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW respectively.

4 Experiments

To verify the effectiveness of our proposal, we evaluate our methods on the point-cloud reconstruction task and its two downstream tasks including transfer learning and point-cloud generation. Three important questions we want to answer are:

  1. 1.

    Does the sub-optimality issue of amortized Max-SW occur when working with point-clouds and does replacing Max-SW with v-DSW alleviate the problem?

  2. 2.

    Does the proposed amortized distribution projection optimization framework improve the performance over the conventional amortized projection optimization framework and commonly used distances e.g., Chamfer distance, Earth Mover Distance (Wasserstein distance), SW, Max-SW, adaptive SW (ASW) (Nguyen et al., 2021c), and v-DSW?

  3. 3.

    Are self-attention amortized models better than the previous misspecified amortized models in (Nguyen & Ho, 2022a)?

Table 2: Comparison between amortized models when approximating von Mises Fisher distributional sliced Wasserstein (v-DSW). T denotes the number of projected sub-gradient ascent iterations.
Method T Distance ()(\uparrow) Time ()(\downarrow)
{\mathcal{L}}v-DSW 1 52.73 0.06
𝒢{\mathcal{G}}v-DSW 1 50.73 0.07
𝒩{\mathcal{N}}v-DSW 1 51.89 0.07
𝒜{\mathcal{A}}v-DSW 1 53.07 1.00
𝒜{\mathcal{E}}{\mathcal{A}}v-DSW 1 53.17 0.17
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW 1 53.83 0.14
v-DSW 1 51.87 0.1
v-DSW 5 51.90 0.33
v-DSW 10 52.65 0.5
v-DSW 50 53.16 2.00
v-DSW 100 54.39 4.00
Table 3: Performance comparison of point-cloud generation on the chair category of ShapeNet. The values of JSD, MMD-CD, and MMD-EMD are multiplied by 100.
Method JSD (×102,)(\times 10^{-2},\downarrow) MMD (×102,)(\times 10^{-2},\downarrow) COV (%,)(\%,\uparrow) 1-NNA (%,)(\%,\downarrow)
CD EMD CD EMD CD EMD
CD 17.88 ±\pm 1.14 1.12 ±\pm 0.02 17.19 ±\pm 0.36 23.73 ±\pm 1.69 10.83 ±\pm 0.89 98.45 ±\pm 0.10 100.00 ±\pm 0.00
EMD 5.15 ±\pm 1.52 0.61 ±\pm 0.09 10.37 ±\pm 0.61 41.65 ±\pm 2.19 42.54 ±\pm 2.42 87.76 ±\pm 1.46 87.30 ±\pm 1.22
SW 1.56 ±\pm 0.06 0.72 ±\pm 0.02 10.80 ±\pm 0.11 38.55 ±\pm 0.43 45.35 ±\pm 0.48 89.91 ±\pm 1.17 88.28 ±\pm 0.70
Max-SW 1.63 ±\pm 0.32 0.74 ±\pm 0.01 10.84 ±\pm 0.08 40.47 ±\pm 1.04 47.81 ±\pm 0.78 91.46 ±\pm 0.72 89.93 ±\pm 0.86
ASW 1.75 ±\pm 0.38 0.78 ±\pm 0.05 11.27 ±\pm 0.38 38.16 ±\pm 2.15 45.45 ±\pm 1.40 91.21 ±\pm 0.40 89.36 ±\pm 0.40
v-DSW 1.79 ±\pm 0.17 0.72 ±\pm 0.02 10.73 ±\pm 0.20 37.76 ±\pm 0.71 45.49 ±\pm 1.37 90.23 ±\pm 0.13 88.33 ±\pm 0.95
{\mathcal{L}}v-DSW 1.67 ±\pm 0.07 0.77 ±\pm 0.04 11.10 ±\pm 0.33 37.91 ±\pm 1.84 45.64 ±\pm 2.30 90.42 ±\pm 0.53 88.82 ±\pm 0.38
𝒢{\mathcal{G}}v-DSW 1.56 ±\pm 0.22 0.75 ±\pm 0.02 10.99 ±\pm 0.11 37.81 ±\pm 1.70 45.69 ±\pm 0.46 90.32 ±\pm 0.38 88.26 ±\pm 0.28
𝒩{\mathcal{N}}v-DSW 1.44 ±\pm 0.06 0.75 ±\pm 0.02 10.95 ±\pm 0.09 38.40 ±\pm 1.34 46.28 ±\pm 2.06 90.15 ±\pm 0.80 88.65 ±\pm 0.82
𝒜{\mathcal{E}}{\mathcal{A}}v-DSW 1.73 ±\pm 0.21 0.71 ±\pm 0.04 10.70 ±\pm 0.26 40.03 ±\pm 1.28 48.01 ±\pm 1.07 89.98 ±\pm 0.57 88.55 ±\pm 0.38
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW 1.54 ±\pm 0.09 0.72 ±\pm 0.03 10.74 ±\pm 0.35 40.62 ±\pm 1.39 45.84 ±\pm 1.23 89.44 ±\pm 0.28 87.79 ±\pm 0.37

Experiment settings: Our settings111Code for the paper is published at https://github.com/hsgser/Self-Amortized-DSW., which can be found in Appendix C.1, are identical to the setting in the paper of ASW. We compare our methods, amortized v-DSW, with the following loss functions: Chamfer discrepancy (CD), Earth-mover distance (EMD), SW, Max-SW, adaptive SW (ASW), v-DSW, and amortized Max-SW variants. For amortized models, we consider 6 different ones. The prefix ,𝒢{\mathcal{L}},{\mathcal{G}}, and 𝒩{\mathcal{N}} denote the linear, generalized linear, and non-linear amortized models in (Nguyen & Ho, 2022a), respectively. 𝒜{\mathcal{A}}, 𝒜{\mathcal{E}}{\mathcal{A}}, and 𝒜{\mathcal{L}}{\mathcal{A}} represent self-attention, efficient self-attention, and linear self-attention, respectively. Implementation details for baseline distances and amortized models are given in Appendices C.2 and C.3, respectively. Each experiment was run over three different random seeds. We report the average performance along with the standard deviation for each entity. All experiments are run on NVIDIA V100 GPUs.

Comparison with CD and EMD: The main focus of the paper is to compare the new amortized framework with the conventional amortized framework and sliced Wasserstein variants. The results with CD and EMD are provided only for completeness. In addition, we found that there is an unfair comparison between EMD and sliced Wasserstein variants in the ASW’s paper. In particular, the EMD loss is normalized by the number of points in a point cloud while SW variants are not. To fix the aforementioned issue, we modified the implementation of the EMD loss by scaling it by the number of points (2048 in this case). As a “perfect” objective, EMD performs better than all SW variants. However, EMD suffers from huge computational costs compared to SW variants

Point-cloud reconstruction: Following ASW (Nguyen et al., 2021c), we measure the reconstruction performance of different autoencoders on the ModelNet40 dataset (Wu et al., 2015) using three discrepancies: Chamfer discrepancy (CD), sliced Wasserstein distance (SW), and EMD. The quantitative results are summarized in Table 1. For each method, we only report the best performing (based on EMD) model among all choices of hyper-parameters. Full quantitative results (including std) can be found in Table 4. Our methods achieve the best performance in all three discrepancies. In contrast, autoencoders with amortized Max-SW losses fail in this scenario due to the sub-optimality and losing metricity issues that we discussed in Section 2.3. In addition, amortized v-DSW losses have smaller standard deviations over 3 runs than v-DSW. Moreover, using amortized optimization reduces the training time compared to the conventional computation using the projected sub-gradient ascent algorithm (e.g. Max-SW and v-DSW). For example, training one iteration of autoencoder using 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW only takes 123 seconds while using v-DSW costs 633 seconds. In terms of amortized models, attention-based amortized models lead to lower EMD between reconstruction and input. Qualitative results are given in Figure 4, showing the success of our methods in reconstructing 3D point-clouds. Full qualitative results are reported in Figure 6.

Amortization Gaps: To validate the advantage of self-attention amortized models over the previous misspecified amortized models, we compare their effectiveness in approximating v-DSW. We create a dataset by sampling 1000 pairs of point-clouds from the ShapeNet Core-55 dataset. Due to the memory constraint when solving amortized optimization, the dataset is divided into 10 batches of size 100. We compute v-DSW and its amortized versions between all pairs of point-clouds and report their average loss values in Table 2. Compared to previous misspecified amortized models, attention-based amortized models produce higher losses which are closer to the conventional computation of v-DSW (T = 100). To achieve the same level as efficient/linear self-attention amortized models, one needs to run more than 50 sub-gradient iterations, which is more than 10 times slower.

Transfer learning: We further feed the latent vectors learned by the above autoencoders into a classifier. Following the settings in ASW’s paper, we train our classifier for 500 epochs with a batch size of 256. The optimizer is the same as that in the reconstruction experiment. Table 1 illustrates the classification result. Again, we see a boost in accuracy when using self-attention amortized v-DSW.

Point-cloud generation: We also evaluate our methods on the 3D point-cloud generation task. Following (Achlioptas et al., 2018), the chair category of ShapeNet is divided into train/valid/test sets in an 85/5/10 ratio. We train each autoencoder on the train set for 100 epochs and evaluate on the valid set. The generator is then trained to generate latent codes learned by the autoencoder, same as (Achlioptas et al., 2018). For evaluation, the same set of metrics in (Yang et al., 2019a) is used. The quantitative results of the test set are given in Table 3. Our methods yield the best performance in all metrics. In addition, attention-based amortized models lead to higher performance than previous amortized models in all metrics except for JSD. Full quantitative results are reported in Table 9.

5 Conclusion

We have proposed a self-attention amortized distributional projection optimization framework which uses a self-attention amortized model to predict the best discriminative distribution over projecting direction for each pair of probability measures. The efficient self-attention mechanism helps to inject the geometric inductive biases which are permutation invariance and symmetry into the amortized model while remaining fast computation. Furthermore, the amortized distribution projection optimization framework guarantees the metricity for all pairs of probability measures while the amortization gap still exists. On the experimental side, we compare the new proposed framework to the conventional amortized projection optimization framework and other widely-used distances in the point-cloud reconstruction application and its two downstream tasks including transfer learning and point-cloud generation to show the superior performance of the proposed framework.

Acknowledgements

Nhat Ho acknowledges support from the NSF IFML 2019844 and the NSF AI Institute for Foundations of Machine Learning.

References

  • Achlioptas et al. (2018) Achlioptas, P., Diamanti, O., Mitliagkas, I., and Guibas, L. Learning representations and generative models for 3d point clouds. In International conference on machine learning, pp.  40–49. PMLR, 2018.
  • Alvarez-Melis & Fusi (2020) Alvarez-Melis, D. and Fusi, N. Geometric dataset distances via optimal transport. Advances in Neural Information Processing Systems, 33:21428–21439, 2020.
  • Amos (2022) Amos, B. Tutorial on amortized optimization for learning to optimize over continuous domains. arXiv preprint arXiv:2202.00665, 2022.
  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223, 2017.
  • Barrow et al. (1977) Barrow, H. G., Tenenbaum, J. M., Bolles, R. C., and Wolf, H. C. Parametric correspondence and chamfer matching: Two new techniques for image matching. Technical report, SRI INTERNATIONAL MENLO PARK CA ARTIFICIAL INTELLIGENCE CENTER, 1977.
  • Bhushan Damodaran et al. (2018) Bhushan Damodaran, B., Kellenberger, B., Flamary, R., Tuia, D., and Courty, N. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  447–463, 2018.
  • Bonneel et al. (2015) Bonneel, N., Rabin, J., Peyré, G., and Pfister, H. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 1(51):22–45, 2015.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Carion et al. (2020) Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-end object detection with transformers. In European conference on computer vision, pp.  213–229. Springer, 2020.
  • Chang et al. (2015) Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., and Yu, F. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  • Christensen (2002) Christensen, R. Plane answers to complex questions, volume 35. Springer, 2002.
  • Courty et al. (2016) Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2016.
  • Cuturi (2013) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pp. 2292–2300, 2013.
  • Damodaran et al. (2018) Damodaran, B. B., Kellenberger, B., Flamary, R., Tuia, D., and Courty, N. Deepjdot: Deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  447–463, 2018.
  • Davidson et al. (2018a) Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. Hyperspherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, pp.  856–865. Association For Uncertainty in Artificial Intelligence (AUAI), 2018a.
  • Davidson et al. (2018b) Davidson, T. R., Falorsi, L., De Cao, N., and Tomczak, T. K. J. M. Hyperspherical variational auto-encoders. In Conference on Uncertainty in Artificial Intelligence (UAI), 2018b.
  • Deshpande et al. (2018) Deshpande, I., Zhang, Z., and Schwing, A. G. Generative modeling using the sliced Wasserstein distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  3483–3491, 2018.
  • Deshpande et al. (2019) Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N., Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G. Max-sliced Wasserstein distance and its use for GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  10648–10656, 2019.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  • Dosovitskiy et al. (2020) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Dwivedi & Bresson (2021) Dwivedi, V. P. and Bresson, X. A generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021.
  • Genevay et al. (2018) Genevay, A., Peyré, G., and Cuturi, M. Learning generative models with Sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pp.  1608–1617. PMLR, 2018.
  • Guo et al. (2021) Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R. R., and Hu, S.-M. Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, Apr 2021. ISSN 2096-0662. doi: 10.1007/s41095-021-0229-5. URL http://dx.doi.org/10.1007/s41095-021-0229-5.
  • Ho et al. (2017) Ho, N., Nguyen, X., Yurochkin, M., Bui, H. H., Huynh, V., and Phung, D. Multilevel clustering via Wasserstein means. In International Conference on Machine Learning, pp. 1501–1509, 2017.
  • Huynh et al. (2020) Huynh, V., Zhao, H., and Phung, D. Otlda: A geometry-aware optimal transport approach for topic modeling. Advances in Neural Information Processing Systems, 33:18573–18582, 2020.
  • Jupp & Mardia (1979) Jupp, P. E. and Mardia, K. V. Maximum likelihood estimators for the matrix von Mises-Fisher and bingham distributions. The Annals of Statistics, 7(3):599–606, 1979.
  • Katharopoulos et al. (2020) Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pp. 5156–5165. PMLR, 2020.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kolouri et al. (2018) Kolouri, S., Pope, P. E., Martin, C. E., and Rohde, G. K. Sliced Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.
  • Lee et al. (2019a) Lee, C.-Y., Batra, T., Baig, M. H., and Ulbricht, D. Sliced Wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10285–10295, 2019a.
  • Lee et al. (2019b) Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pp. 3744–3753. PMLR, 2019b.
  • Lee et al. (2022) Lee, Y., Kim, S., Choi, J., and Park, F. A statistical manifold framework for point cloud data. In International Conference on Machine Learning, pp. 12378–12402. PMLR, 2022.
  • Li et al. (2020) Li, R., Su, J., Duan, C., and Zheng, S. Linear attention mechanism: An efficient attention for semantic segmentation. arXiv preprint arXiv:2007.14902, 2020.
  • Li et al. (2019) Li, S., Raj, D., Lu, X., Shen, P., Kawahara, T., and Kawai, H. Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation. In Proc. Interspeech 2019, pp.  4400–4404, 2019. doi: 10.21437/Interspeech.2019-2112. URL http://dx.doi.org/10.21437/Interspeech.2019-2112.
  • Liu et al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • Naderializadeh et al. (2021) Naderializadeh, N., Comer, J., Andrews, R., Hoffmann, H., and Kolouri, S. Pooling by sliced-Wasserstein embedding. Advances in Neural Information Processing Systems, 34, 2021.
  • Nadjahi et al. (2020) Nadjahi, K., De Bortoli, V., Durmus, A., Badeau, R., and Şimşekli, U. Approximate Bayesian computation with the sliced-Wasserstein distance. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5470–5474. IEEE, 2020.
  • Nguyen & Ho (2022a) Nguyen, K. and Ho, N. Amortized projection optimization for sliced Wasserstein generative models. Advances in Neural Information Processing Systems, 2022a.
  • Nguyen & Ho (2022b) Nguyen, K. and Ho, N. Revisiting sliced Wasserstein on images: From vectorization to convolution. Advances in Neural Information Processing Systems, 2022b.
  • Nguyen et al. (2021a) Nguyen, K., Ho, N., Pham, T., and Bui, H. Distributional sliced-Wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021a.
  • Nguyen et al. (2021b) Nguyen, K., Nguyen, S., Ho, N., Pham, T., and Bui, H. Improving relational regularized autoencoders with spherical sliced fused Gromov Wasserstein. In International Conference on Learning Representations, 2021b.
  • Nguyen et al. (2023) Nguyen, K., Ren, T., Nguyen, H., Rout, L., Nguyen, T. M., and Ho, N. Hierarchical sliced wasserstein distance. In The Eleventh International Conference on Learning Representations, 2023.
  • Nguyen et al. (2021c) Nguyen, T., Pham, Q.-H., Le, T., Pham, T., Ho, N., and Hua, B.-S. Point-set distances for learning representations of 3d point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021c.
  • Nietert et al. (2022) Nietert, S., Sadhu, R., Goldfeld, Z., and Kato, K. Statistical, robustness, and computational guarantees for sliced wasserstein distances. Advances in Neural Information Processing Systems, 2022.
  • Peyré & Cuturi (2019) Peyré, G. and Cuturi, M. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • Pham et al. (2020) Pham, Q.-H., Uy, M. A., Hua, B.-S., Nguyen, D. T., Roig, G., and Yeung, S.-K. LCD: Learned cross-domain descriptors for 2D-3D matching. In AAAI Conference on Artificial Intelligence, 2020.
  • Qi et al. (2017) Qi, C. R., Su, H., Mo, K., and Guibas, L. J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  652–660, 2017.
  • Shen et al. (2021) Shen, Z., Zhang, M., Zhao, H., Yi, S., and Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  3531–3539, 2021.
  • Shu (2017) Shu, R. Amortized optimization http://ruishu.io/2017/11/07/amortized-optimization/. Personal Blog, 2017.
  • Sra (2016) Sra, S. Directional statistics in machine learning: a brief review. arXiv preprint arXiv:1605.00316, 2016.
  • Sun et al. (2019) Sun, C., Myers, A., Vondrick, C., Murphy, K., and Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7464–7473, 2019.
  • Temme (2011) Temme, N. M. Special functions: An introduction to the classical functions of mathematical physics. John Wiley & Sons, 2011.
  • Tolstikhin et al. (2018) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Vincent-Cuaz et al. (2022) Vincent-Cuaz, C., Flamary, R., Corneli, M., Vayer, T., and Courty, N. Template based graph neural network with optimal transport distances. Advances in Neural Information Processing Systems, 2022.
  • Wang et al. (2020a) Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020a.
  • Wang et al. (2020b) Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., Huang, H., Tjandra, A., Zhang, X., Zhang, F., et al. Transformer-based acoustic modeling for hybrid speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6874–6878. IEEE, 2020b.
  • Wu et al. (2015) Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., and Xiao, J. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1912–1920, 2015.
  • Yang et al. (2019a) Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., and Hariharan, B. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4541–4550, 2019a.
  • Yang et al. (2019b) Yang, J., Zhang, Q., Ni, B., Li, L., Liu, J., Zhou, M., and Tian, Q. Modeling point clouds with self-attention and gumbel subset sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  3323–3332, 2019b.
  • Yi & Liu (2021) Yi, M. and Liu, S. Sliced Wasserstein variational inference. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2021.
  • Zhao et al. (2021) Zhao, H., Jiang, L., Jia, J., Torr, P. H., and Koltun, V. Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  16259–16268, 2021.

Supplement to “Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Clouds Reconstruction”

In this supplementary, we first provide some additional materials in Appendix A including definitions of generalized linear amortized models and non-linear amortized models in Appendix A.1, the detail of computing von Mises-Fisher distributional sliced Wasserstein in Appendix A.2, and training algorithms for autoencoders in Appendix A.3. Next, we collect skipped proofs in the main text in Appendix B. After that, we discuss the experimental settings of our experiments in Appendix C. Finally, we present additional experimental results in Appendix D.

Appendix A Additional Materials

A.1 Amortized models

We now review the generalized linear amortized model and the non-linear amortized model (Nguyen & Ho, 2022a).

Definition 4.

Given X,YdmX,Y\in{\mathbb{R}}^{dm}, the generalized linear amortized model is defined as:

fψ(X,Y):=gψ1(X)w1+gψ1(Y)w2gψ1(X)w1+gψ1(Y)w222,\displaystyle f_{\psi}(X,Y):=\frac{g_{\psi_{1}}(X)^{\prime}w_{1}+g_{\psi_{1}}(Y)^{\prime}w_{2}}{||g_{\psi_{1}}(X)^{\prime}w_{1}+g_{\psi_{1}}(Y)^{\prime}w_{2}||_{2}^{2}}, (17)

where XX^{\prime} and YY^{\prime} are matrices of size d×md\times m that are reshaped from the concatenated vectors XX and YY of size dmdm, w1,w2mw_{1},w_{2}\in{\mathbb{R}}^{m}, w0dw_{0}\in{\mathbb{R}}^{d}, ψ1Ψ1\psi_{1}\in\Psi_{1}, gψ1:dmdmg_{\psi_{1}}:{\mathbb{R}}^{dm}\to{\mathbb{R}}^{dm}, ψ=(w0,w1,w2,ψ1)\psi=(w_{0},w_{1},w_{2},\psi_{1}), and gψ1(X)=(x1,,xm)g_{\psi_{1}}(X)=(x^{\prime}_{1},\ldots,x^{\prime}_{m}) and gψ1(Y)=(y1,,ym)g_{\psi_{1}}(Y)=(y^{\prime}_{1},\ldots,y^{\prime}_{m}). To specify, we let gψ1(X)=(W2σ(W1x1)+b0,,W2σ(W1xm)+b0)g_{\psi_{1}}(X)=(W_{2}\sigma(W_{1}x_{1})+b_{0},\ldots,W_{2}\sigma(W_{1}x_{m})+b_{0}), where σ()\sigma(\cdot) is the Sigmoid function, W1d×dW_{1}\in\mathbb{R}^{d\times d}, W2d×dW_{2}\in\mathbb{R}^{d\times d}, and b0db_{0}\in\mathbb{R}^{d}.

Definition 5.

Given X,YdmX,Y\in{\mathbb{R}}^{dm}, the non-linear amortized model is defined as:

fψ(X,Y):=hψ2(Xw1+Yw2)hψ2(Xw1+Yw2)22,\displaystyle f_{\psi}(X,Y):=\frac{h_{\psi_{2}}(X^{\prime}w_{1}+Y^{\prime}w_{2})}{||h_{\psi_{2}}(X^{\prime}w_{1}+Y^{\prime}w_{2})||_{2}^{2}}, (18)

where XX^{\prime} and YY^{\prime} are matrices of size d×md\times m that are reshaped from the concatenated vectors XX and YY of size dmdm, w1,w2mw_{1},w_{2}\in{\mathbb{R}}^{m}, ψ2Ψ2\psi_{2}\in\Psi_{2}, hψ2:ddh_{\psi_{2}}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{d}, ψ=(w1,w2,ψ2)\psi=(w_{1},w_{2},\psi_{2}), and hψ2(x)=W4σ(W3x))+b0h_{\psi_{2}}(x)=W_{4}\sigma(W_{3}x))+b_{0} where σ()\sigma(\cdot) is the Sigmoid function.

A.2 Von Mises-Fisher distributional sliced Wasserstein distance

We first start with the definition of von Mises Fisher (vMF) distribution. The von Mises–Fisher distribution (vMF)(Jupp & Mardia, 1979) is a probability distribution on the unit hypersphere 𝕊d1\mathbb{S}^{d-1} with the density function is :

f(x|ϵ,κ):=Cd(κ)exp(κϵx),\displaystyle f(x|\epsilon,\kappa):=C_{d}(\kappa)\exp(\kappa\epsilon^{\top}x), (19)

where ϵ𝕊d1\epsilon\in\mathbb{S}^{d-1} is the location vector, κ0\kappa\geq 0 is the concentration parameter, and Cd(κ):=κd/21(2π)d/2Id/21(κ)C_{d}(\kappa):=\frac{\kappa^{d/2-1}}{(2\pi)^{d/2}I_{d/2-1}(\kappa)} is the normalization constant. Here, IvI_{v} is the modified Bessel function of the first kind at order vv (Temme, 2011).

The vMF distribution is a continuous distribution, its mass concentrates around the mean ϵ\epsilon, and its density decrease when xx goes away from ϵ\epsilon. When κ0\kappa\to 0, vMF converges in distribution to 𝒰(𝕊d1)\mathcal{U}(\mathbb{S}^{d-1}), and when κ\kappa\to\infty, vMF converges in distribution to the Dirac distribution centered at ϵ\epsilon (Sra, 2016).

Reparameterized Rejection Sampling: The sampling process of vMF distribution is based on the rejection sampling procedure. We review the sampling process in Algorithm 1 (Davidson et al., 2018a; Nguyen et al., 2021b). The algorithm performs the reparameterization for the proposal distribution. We now derive the gradient estimator for ϵ𝔼vMF(θ|ϵ,κ)[f(θ)]\nabla_{\epsilon}\mathbb{E}_{\text{vMF}(\theta|\epsilon,\kappa)}\big{[}f(\theta)\big{]} for a general function f(θ)f(\theta) to find the maxima ϵ\epsilon^{*} in the optimization problem maxϵ𝕊d1𝔼vMF(θ|ϵ,κ)[f(θ)]\max_{\epsilon\in\mathbb{S}^{d-1}}\mathbb{E}_{\text{vMF}(\theta|\epsilon,\kappa)}\big{[}f(\theta)\big{]}.

In d>0d>0 dimension, let (ϵ,κ)(\epsilon,\kappa) be the parameters of vMF distribution. We denotes b=2κ+4κ2+(d1)2d1b=\frac{-2\kappa+\sqrt{4\kappa^{2}+(d-1)^{2}}}{d-1}, two conditional distributions: g(ωκ)=2(πd/2)Γ(d/2)𝒞d(κ)exp(ωκ)(1ω2)12(d3)Beta(12,12(d1)),r(ω|κ)=2b1/2(d1)Beta(12(d1),12(d1))(1ω2)1/2(d3)[(1+b)(1b)ω]d1,g(\omega\mid\kappa)=\frac{2\left(\pi^{d/2}\right)}{\Gamma(d/2)}\mathcal{C}_{d}(\kappa)\frac{\exp(\omega\kappa)\left(1-\omega^{2}\right)^{\frac{1}{2}(d-3)}}{\text{Beta}\left(\frac{1}{2},\frac{1}{2}(d-1)\right)},\quad r(\omega|\kappa)=\frac{2b^{1/2(d-1)}}{\text{Beta}\left(\frac{1}{2}(d-1),\frac{1}{2}(d-1)\right)}\frac{\left(1-\omega^{2}\right)^{1/2(d-3)}}{[(1+b)-(1-b)\omega]^{d-1}}, distribution s(ψ):=Beta(12(d1),12(d1))s(\psi):=\operatorname{Beta}\left(\frac{1}{2}(d-1),\frac{1}{2}(d-1)\right), function h(ψ,κ)=1(1+b)ψ1(1b)ψh(\psi,\kappa)=\frac{1-(1+b)\psi}{1-(1-b)\psi}, distributions π1(ψ|κ)=s(ψ)g(h(ψ,κ)|κ)r(h(ψ,κ)|κ)\pi_{1}(\psi|\kappa)=s(\psi)\frac{g(h(\psi,\kappa)|\kappa)}{r(h(\psi,\kappa)|\kappa)}, π2(v):=𝒰(𝕊d2)\pi_{2}(v):=\mathcal{U}(\mathbb{S}^{d-2}), and function T(ω,v,ϵ)=(I2e1ϵe1ϵ2e1ϵe1ϵ2)(ω,1ω2v):=θ.T(\omega,v,\epsilon)=\Big{(}I-2\frac{e_{1}-\epsilon}{||e_{1}-\epsilon||_{2}}\frac{e_{1}-\epsilon}{||e_{1}-\epsilon||_{2}}^{\top}\Big{)}\big{(}\omega,\sqrt{1-\omega^{2}}v^{\top}\big{)}^{\top}:=\theta.

We can obtain the gradient estimator by the following Lemma 2 in (Davidson et al., 2018b):

ϵ𝔼vMF(θ|ϵ,κ)[f(θ)]=ϵ𝔼(ψ,v)π1(ψ|κ)π2(v)[f(T(h(ψ,κ),v,ϵ))]=𝔼(ψ,v)π1(ψ|κ)π2(v)[ϵf(T(h(ψ,κ),v,ϵ))].\displaystyle\nabla_{\epsilon}\mathbb{E}_{\text{vMF}(\theta|\epsilon,\kappa)}\big{[}f(\theta)\big{]}=\nabla_{\epsilon}\mathbb{E}_{(\psi,v)\sim\pi_{1}(\psi|\kappa)\pi_{2}(v)}\Big{[}f\big{(}T(h(\psi,\kappa),v,\epsilon)\big{)}\Big{]}=\mathbb{E}_{(\psi,v)\sim\pi_{1}(\psi|\kappa)\pi_{2}(v)}\Big{[}\nabla_{\epsilon}f\big{(}T(h(\psi,\kappa),v,\epsilon)\big{)}\Big{]}.

In v-DSW case, we have f(θ)=Wpp(θμ,θν)f(\theta)=\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu). Therefore, we have:

ϵ𝔼vMF(θ|ϵ,κ)[Wpp(θμ,θν)]=𝔼(ψ,v)π1(ψ|κ)π2(v)[ϵWpp(f(T(h(ψ,κ),v,ϵ)μ,f(T(h(ψ,κ),v,ϵ)ν))].\displaystyle\nabla_{\epsilon}\mathbb{E}_{\text{vMF}(\theta|\epsilon,\kappa)}\big{[}\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)\big{]}=\mathbb{E}_{(\psi,v)\sim\pi_{1}(\psi|\kappa)\pi_{2}(v)}\Big{[}\nabla_{\epsilon}\text{W}_{p}^{p}(f\big{(}T(h(\psi,\kappa),v,\epsilon)\sharp\mu,f\big{(}T(h(\psi,\kappa),v,\epsilon)\sharp\nu)\big{)}\Big{]}.

Then we can get a gradient estimator by using Monte-Carlo estimation scheme:

ϵ𝔼vMF(θ|ϵ,κ)[Wpp(θμ,θν)]1Li=1L[ϵWpp(f(T(h(ψi,κ),vi,ϵ)μ,f(T(h(ψi,κ),vi,ϵ)ν))],\displaystyle\nabla_{\epsilon}\mathbb{E}_{\text{vMF}(\theta|\epsilon,\kappa)}\big{[}\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)\big{]}\approx\frac{1}{L}\sum_{i=1}^{L}\Big{[}\nabla_{\epsilon}\text{W}_{p}^{p}(f\big{(}T(h(\psi_{i},\kappa),v_{i},\epsilon)\sharp\mu,f\big{(}T(h(\psi_{i},\kappa),v_{i},\epsilon)\sharp\nu)\big{)}\Big{]},

where {ψi}i=1Lπ1(ψ|κ)\{\psi_{i}\}_{i=1}^{L}\sim\pi_{1}(\psi|\kappa) i.i.d, {vi}i=1Lπ2(v)\{v_{i}\}_{i=1}^{L}\sim\pi_{2}(v) i.i.d, and LL is the number of projections. Sampling from π1(ψ|κ)\pi_{1}(\psi|\kappa) is equivalent to the acceptance-rejection scheme in vMF sampling procedure, sampling π2(v)\pi_{2}(v) is directly from 𝒰(𝕊d2)\mathcal{U}(\mathbb{S}^{d-2}). It is worth noting that the gradient estimator for κ𝔼vMF(θ|ϵ,κ)[f(θ)]\nabla_{\kappa}\mathbb{E}_{\text{vMF}(\theta|\epsilon,\kappa)}\big{[}f(\theta)\big{]} can be derived by using the log-derivative trick, however, we do not need it here since we do not optimize for κ\kappa in v-DSW.

Algorithm 1 Sampling from vMF distribution
Input: location ϵ\epsilon, concentration κ\kappa, dimension dd, unit vector e1=(1,0,..,0)e_{1}=(1,0,..,0)
 Draw v𝒰(𝕊d2)v\sim\mathcal{U}(\mathbb{S}^{d-2})
b2κ+4κ2+(d1)2d1b\leftarrow\frac{-2\kappa+\sqrt{4\kappa^{2}+(d-1)^{2}}}{d-1}, a(d1)+2κ+4κ2+(d1)24a\leftarrow\frac{(d-1)+2\kappa+\sqrt{4\kappa^{2}+(d-1)^{2}}}{4}, m4ab(1+b)(d1)log(d1)m\leftarrow\frac{4ab}{(1+b)}-(d-1)\log(d-1)
repeat
  Draw ψBeta(12(d1),12(d1))\psi\sim\operatorname{Beta}\left(\frac{1}{2}(d-1),\frac{1}{2}(d-1)\right)
  ωh(ψ,κ)=1(1+b)ψ1(1b)ψ\omega\leftarrow h(\psi,\kappa)=\frac{1-(1+b)\psi}{1-(1-b)\psi}
  t2ab1(1b)ψt\leftarrow\frac{2ab}{1-(1-b)\psi}
  Draw u𝒰([0,1])u\sim\mathcal{U}([0,1])
until (d1)log(t)t+mlog(u)(d-1)\log(t)-t+m\geq\log(u)
h1(ω,1ω2v)h_{1}\leftarrow(\omega,\sqrt{1-\omega^{2}}v^{\top})^{\top}
ϵe1ϵ\epsilon^{\prime}\leftarrow e_{1}-\epsilon
u=ϵϵ2u=\frac{\epsilon^{\prime}}{||\epsilon^{\prime}||_{2}}
U=I2uuU=I-2uu^{\top}
Output: Uh1Uh_{1}

A.3 Training algorithms

Training point-cloud autoencoder with Max-SW: We present the algorithm of training autoencoder with Max-SW in Algorithm 2. The algorithm contains a nested loop: one is for training the autoencoder, one is for finding the max projecting direction for Max-SW.

Algorithm 2 Training point-cloud autoencoder with max sliced Wasserstein distance
Input: Point-cloud distribution p(X)p(X), learning rate η\eta, slice learning rate ηs\eta_{s}, model maximum number of iterations 𝒯\mathcal{T}, slice maximum number of iterations TT, mini-batch size kk.
Initialization: Initialize the encoder fϕf_{\phi} and the decoder gγg_{\gamma}
while ϕ,γ\phi,\gamma not converge or reach 𝒯\mathcal{T} do
  Sample a mini-batch X1,,XkX_{1},\ldots,X_{k} i.i.d from p(X)p(X)
  ϕ=0,γ=0\nabla_{\phi}=0,\nabla_{\gamma}=0
  for i=1i=1 to kk do
   Initialize θ\theta
   while θ\theta not converge or reach TT do
    θ=θ+ηsθWp(θPXi,θPgγ(fϕ(Xi)))\theta=\theta+\eta_{s}\cdot\nabla_{\theta}\text{W}_{p}(\theta\sharp P_{X_{i}},\theta\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))}) # Other update rules can be used
    θ=θθ2\theta=\frac{\theta}{||\theta||_{2}} #Project back to the unit-hypersphere 𝕊d1\mathbb{S}^{d-1}
   end while
   ϕ=ϕ+1kϕWp(θPXi,θPgγ(fϕ(Xi)))\nabla_{\phi}=\nabla_{\phi}+\frac{1}{k}\nabla_{\phi}\text{W}_{p}(\theta\sharp P_{X_{i}},\theta\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
   γ=γ+1kγWp(θPXi,θPgγ(fϕ(Xi)))\nabla_{\gamma}=\nabla_{\gamma}+\frac{1}{k}\nabla_{\gamma}\text{W}_{p}(\theta\sharp P_{X_{i}},\theta\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
  end for
  ϕ=ϕηϕ\phi=\phi-\eta\cdot\nabla_{\phi} # Other update rules can be used
  γ=γηγ\gamma=\gamma-\eta\cdot\nabla_{\gamma} # Other update rules can be used
end while
Return: ϕ,γ\phi,\gamma

Training point-cloud autoencoder with amortized projection optimization: We present the training algorithm for point-cloud autoencoder with amortized projection optimization in Algorithm 3. With amortized optimization, the inner loop for finding the max projecting direction is removed.

Algorithm 3 Training point-cloud autoencoder with amortized projection optimization
Input: Point-cloud distribution p(X)p(X), learning rate η\eta, slice learning rate ηs\eta_{s}, model maximum number of iterations 𝒯\mathcal{T}, mini-batch size kk.
Initialization: Initialize the encoder fϕf_{\phi}, the decoder gγg_{\gamma}, and the amortized model aψa_{\psi}
while ϕ,γ,ψ\phi,\gamma,\psi not converge or reach 𝒯\mathcal{T} do
  Sample a mini-batch X1,,XkX_{1},\ldots,X_{k} i.i.d from p(X)p(X)
  ϕ=0,γ=0,ψ=0\nabla_{\phi}=0,\nabla_{\gamma}=0,\nabla_{\psi}=0
  for i=1i=1 to kk do
   θψ,γ,ϕ=aψ(Xi,gγ(fϕ(Xi)))\theta_{\psi,\gamma,\phi}=a_{\psi}(X_{i},g_{\gamma}(f_{\phi}(X_{i})))
   ψ=ψ+1kψWp(θψ,γ,ϕPXi,θψ,γ,ϕPgγ(fϕ(Xi)))\nabla_{\psi}=\nabla_{\psi}+\frac{1}{k}\nabla_{\psi}\text{W}_{p}(\theta_{\psi,\gamma,\phi}\sharp P_{X_{i}},\theta_{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
   ϕ=ϕ+1kϕWp(θψ,γ,ϕPXi,θψ,γ,ϕPgγ(fϕ(Xi)))\nabla_{\phi}=\nabla_{\phi}+\frac{1}{k}\nabla_{\phi}\text{W}_{p}(\theta_{\psi,\gamma,\phi}\sharp P_{X_{i}},\theta_{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
   γ=γ+1kγWp(θψ,γ,ϕPXi,θψ,γ,ϕPgγ(fϕ(Xi)))\nabla_{\gamma}=\nabla_{\gamma}+\frac{1}{k}\nabla_{\gamma}\text{W}_{p}(\theta_{\psi,\gamma,\phi}\sharp P_{X_{i}},\theta_{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
  end for
  ψ=ψ+ηsψ\psi=\psi+\eta_{s}\cdot\nabla_{\psi} # Other update rules can be used
  ϕ=ϕηϕ\phi=\phi-\eta\cdot\nabla_{\phi} # Other update rules can be used
  γ=γηγ\gamma=\gamma-\eta\cdot\nabla_{\gamma} # Other update rules can be used
end while
Return: ϕ,γ\phi,\gamma

Training point-cloud autoencoder with v-DSW: We present the algorithm of training autoencoder with v-DSW in Algorithm 4. The algorithm contains a nested loop: one is for training the autoencoder, one is for finding the best distribution over projecting directions for v-DSW.

Algorithm 4 Training point-cloud autoencoder with von-Mises Fisher distributional sliced Wasserstein distance
Input: Point-cloud distribution p(X)p(X), learning rate η\eta, slice learning rate ηs\eta_{s}, model maximum number of iterations 𝒯\mathcal{T}, slice maximum number of iterations TT, mini-batch size kk, the number of projections LL, and the concentration hyperparameter κ\kappa.
Initialization: Initialize the encoder fϕf_{\phi} and the decoder gγg_{\gamma}
while ϕ,γ\phi,\gamma not converge or reach 𝒯\mathcal{T} do
  Sample a mini-batch X1,,XkX_{1},\ldots,X_{k} i.i.d from p(X)p(X)
  ϕ=0,γ=0\nabla_{\phi}=0,\nabla_{\gamma}=0
  for i=1i=1 to kk do
   Initialize ϵ\epsilon
   while ϵ\epsilon not converge or reach TT do
    Sample θ1ϵ,,θLϵ\theta_{1}^{\epsilon},\ldots,\theta_{L}^{\epsilon} i.i.d from vMF(ϵ,κ)\text{vMF}(\epsilon,\kappa) via the reparameterized acceptance-rejection sampling in Algorithm 1
    ϵ=ϵ+ηs1Ll=1LϵWp(θlϵPXi,θlϵPgγ(fϕ(Xi)))\epsilon=\epsilon+\eta_{s}\cdot\frac{1}{L}\sum_{l=1}^{L}\nabla_{\epsilon}\text{W}_{p}(\theta_{l}^{\epsilon}\sharp P_{X_{i}},\theta_{l}^{\epsilon}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))}) # Other update rules can be used
    ϵ=ϵϵ2\epsilon=\frac{\epsilon}{||\epsilon||_{2}} #Project back to the unit-hypersphere 𝕊d1\mathbb{S}^{d-1}
   end while
   Sample θ1ϵ,,θLϵ\theta_{1}^{\epsilon},\ldots,\theta_{L}^{\epsilon} i.i.d from vMF(ϵ,κ)\text{vMF}(\epsilon,\kappa) via the reparameterized acceptance-rejection sampling in Algorithm 1.
   ϕ=ϕ+1k1Li=lLϕWp(θlϵPXi,θlϵPgγ(fϕ(Xi)))\nabla_{\phi}=\nabla_{\phi}+\frac{1}{k}\frac{1}{L}\sum_{i=l}^{L}\nabla_{\phi}\text{W}_{p}(\theta_{l}^{\epsilon}\sharp P_{X_{i}},\theta_{l}^{\epsilon}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
   γ=γ+1k1Li=lLγWp(θlϵPXi,θlϵPgγ(fϕ(Xi)))\nabla_{\gamma}=\nabla_{\gamma}+\frac{1}{k}\frac{1}{L}\sum_{i=l}^{L}\nabla_{\gamma}\text{W}_{p}(\theta_{l}^{\epsilon}\sharp P_{X_{i}},\theta_{l}^{\epsilon}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
  end for
  ϕ=ϕηϕ\phi=\phi-\eta\cdot\nabla_{\phi} # Other update rules can be used
  γ=γηγ\gamma=\gamma-\eta\cdot\nabla_{\gamma} # Other update rules can be used
end while
Return: ϕ,γ\phi,\gamma

Training point-cloud autoencoder with amortized distributonal projection optimization: We present the training algorithm for point-cloud autoencoder with amortized distributional projection optimization in Algorithm 5. With amortized distributional optimization, the inner loop for finding the best distribution over projecting directions is removed.

Algorithm 5 Training point-cloud autoencoder with amortized projection optimization
Input: Point-cloud distribution p(X)p(X), learning rate η\eta, slice learning rate ηs\eta_{s}, model maximum number of iterations 𝒯\mathcal{T}, mini-batch size kk.
Initialization: Initialize the encoder fϕf_{\phi}, the decoder gγg_{\gamma}, and the amortized model aψa_{\psi}
while ϕ,γ,ψ\phi,\gamma,\psi not converge or reach 𝒯\mathcal{T} do
  Sample a mini-batch X1,,XkX_{1},\ldots,X_{k} i.i.d from p(X)p(X)
  ϕ=0,γ=0,ψ=0\nabla_{\phi}=0,\nabla_{\gamma}=0,\nabla_{\psi}=0
  for i=1i=1 to kk do
   ϵψ,γ,ϕ=aψ(Xi,gγ(fϕ(Xi)))\epsilon_{\psi,\gamma,\phi}=a_{\psi}(X_{i},g_{\gamma}(f_{\phi}(X_{i})))
   Sample θ1ψ,γ,ϕ,,θLψ,γ,ϕ\theta_{1}^{\psi,\gamma,\phi},\ldots,\theta_{L}^{\psi,\gamma,\phi} i.i.d from vMF(ϵψ,γ,ϕ,κ)\text{vMF}(\epsilon_{\psi,\gamma,\phi},\kappa) via the reparameterized acceptance-rejection sampling in Algorithm 1
   ψ=ψ+1k1Li=lLψWp(θlψ,γ,ϕPXi,θlψ,γ,ϕPgγ(fϕ(Xi)))\nabla_{\psi}=\nabla_{\psi}+\frac{1}{k}\frac{1}{L}\sum_{i=l}^{L}\nabla_{\psi}\text{W}_{p}(\theta_{l}^{\psi,\gamma,\phi}\sharp P_{X_{i}},\theta_{l}^{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
   ϕ=ϕ+1k1Li=lLϕWp(θlψ,γ,ϕPXi,θlψ,γ,ϕPgγ(fϕ(Xi)))\nabla_{\phi}=\nabla_{\phi}+\frac{1}{k}\frac{1}{L}\sum_{i=l}^{L}\nabla_{\phi}\text{W}_{p}(\theta_{l}^{\psi,\gamma,\phi}\sharp P_{X_{i}},\theta_{l}^{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
   γ=γ+1k1Li=lLγWp(θlψ,γ,ϕPXi,θlψ,γ,ϕPgγ(fϕ(Xi)))\nabla_{\gamma}=\nabla_{\gamma}+\frac{1}{k}\frac{1}{L}\sum_{i=l}^{L}\nabla_{\gamma}\text{W}_{p}(\theta_{l}^{\psi,\gamma,\phi}\sharp P_{X_{i}},\theta_{l}^{\psi,\gamma,\phi}\sharp P_{g_{\gamma}(f_{\phi}(X_{i}))})
  end for
  ψ=ψ+ηsψ\psi=\psi+\eta_{s}\cdot\nabla_{\psi} # Other update rules can be used
  ϕ=ϕηϕ\phi=\phi-\eta\cdot\nabla_{\phi} # Other update rules can be used
  γ=γηγ\gamma=\gamma-\eta\cdot\nabla_{\gamma} # Other update rules can be used
end while
Return: ϕ,γ\phi,\gamma

Appendix B Proofs

B.1 Proof for Proposition 1

We first recall the definition of the projected one-dimensional Wasserstein between two probability measures μ\mu and ν\nu: PWp(μ,ν;θ^)=Wp(θ^μ,θ^ν)\text{PW}_{p}(\mu,\nu;\hat{\theta})=\text{W}_{p}(\hat{\theta}\sharp\mu,\hat{\theta}\sharp\nu) for θ^argmaxθ𝕊d1Wp(θμ,θν)\hat{\theta}\neq\text{argmax}_{\theta\in\mathbb{S}^{d-1}}\text{W}_{p}(\theta\sharp\mu,\theta\sharp\nu).

Non-negativity and Symmetry: Due to the non-negativity and symmetry of the Wasserstein distance, the non-negativity and symmetry of the projected Wasserstein follow directly from its definition.

Triangle inequality: For any three probability measures μ1,μ2,μ3𝒫p(d)\mu_{1},\mu_{2},\mu_{3}\in\mathcal{P}_{p}(\mathbb{R}^{d}), we have:

PWp(μ1,μ3;θ^)\displaystyle\text{PW}_{p}(\mu_{1},\mu_{3};\hat{\theta}) =Wp(θ^μ1,θ^μ3)\displaystyle=\text{W}_{p}(\hat{\theta}\sharp\mu_{1},\hat{\theta}\sharp\mu_{3})
Wp(θ^μ1,θ^μ2)+Wp(θ^μ2,θ^μ3)\displaystyle\leq\text{W}_{p}(\hat{\theta}\sharp\mu_{1},\hat{\theta}\sharp\mu_{2})+\text{W}_{p}(\hat{\theta}\sharp\mu_{2},\hat{\theta}\sharp\mu_{3})
=PWp(μ1,μ2;θ^)+PWp(μ2,μ3;θ^),\displaystyle=\text{PW}_{p}(\mu_{1},\mu_{2};\hat{\theta})+\text{PW}_{p}(\mu_{2},\mu_{3};\hat{\theta}),

where the first inequality is due to the triangle inequality of the Wasserstein distance.

Identity: If μ=ν\mu=\nu, we have PWp(μ,ν;θ^)=0\text{PW}_{p}(\mu,\nu;\hat{\theta})=0 due to the identity of the Wasserstein distance. However, if PWp(μ,ν;θ^)=0\text{PW}_{p}(\mu,\nu;\hat{\theta})=0, there exists θ𝕊d1\theta^{\prime}\in\mathbb{S}^{d-1} such that 0=PWp(μ,ν;θ^)<PWp(μ,ν;θ)0=\text{PW}_{p}(\mu,\nu;\hat{\theta})<\text{PW}_{p}(\mu,\nu;\theta^{\prime}). Let [γ](w)=deiw,x𝑑γ(x)\mathcal{F}[\gamma](w)=\int_{\mathbb{R}^{d^{\prime}}}e^{-i\langle w,x\rangle}d\gamma(x) be the Fourier transform of γ𝒫(d)\gamma\in\mathcal{P}(\mathbb{R}^{d^{\prime}}), for any tt\in\mathbb{R}, we have

[μ](tθ)\displaystyle\mathcal{F}[\mu](t\theta^{\prime}) =deitθ,x𝑑μ(x)=eitz𝑑θμ(z)=[θμ](t)\displaystyle=\int_{\mathbb{R}^{d}}e^{-it\langle\theta^{\prime},x\rangle}d\mu(x)=\int_{\mathbb{R}}e^{-itz}d\theta^{\prime}\sharp\mu(z)=\mathcal{F}[\theta^{\prime}\sharp\mu](t)
[θν](t)=eitz𝑑θν(z)=deitθ,x𝑑ν(x)=[ν](tθ).\displaystyle\neq\mathcal{F}[\theta^{\prime}\sharp\nu](t)=\int_{\mathbb{R}}e^{-itz}d\theta^{\prime}\sharp\nu(z)=\int_{\mathbb{R}^{d}}e^{-it\langle\theta^{\prime},x\rangle}d\nu(x)=\mathcal{F}[\nu](t\theta^{\prime}).

Therefore, we have μν\mu\neq\nu. We complete the proof.

B.2 Proof for Theorem 1

We first start with proving the metricity of the non-optimal von Mises Fisher distributional sliced Wasserstein distance (v-DSW). For any two probability measures μ,ν𝒫p(d)\mu,\nu\in\mathcal{P}_{p}(\mathbb{R}^{d}), the non-optimal von Mises Fisher distributional sliced Wasserstein distance (v-DSW) is defined as follow:

v-DSWp(μ,ν;ϵ,κ)=(𝔼θvMF(ϵ,κ)Wpp(θμ,θν))1p,\displaystyle\text{v-DSW}_{p}(\mu,\nu;\epsilon,\kappa)=\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)\right)^{\frac{1}{p}},

where ϵ𝕊d1\epsilon\in\mathbb{S}^{d-1} and 0<κ<0<\kappa<\infty.

Lemma 1.

For any ϵ𝕊d1\epsilon\in\mathbb{S}^{d-1} and κ<\kappa<\infty, v-DSWp(,;ϵ,κ)\text{v-DSW}_{p}(\cdot,\cdot;\epsilon,\kappa) is a valid metric on the space of probability measures.

Proof.

We now prove that v-DSW satisfies non-negativity, symmetry, triangle inequality, and identity.

Non-negativity and Symmetry: The non-negativity and symmetry of v-DSW follow directly the non-negativity and symmetry of the Wasserstein distance.

Triangle inequality: For any three probability measures μ1,μ2,μ3𝒫p(d)\mu_{1},\mu_{2},\mu_{3}\in\mathcal{P}_{p}(\mathbb{R}^{d}), we have

v-DSWp(μ1,μ3;ϵ,κ)\displaystyle\text{v-DSW}_{p}(\mu_{1},\mu_{3};\epsilon,\kappa) =(𝔼θvMF(ϵ,κ)Wpp(θμ1,θμ3))1p\displaystyle=\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp\mu_{1},\theta\sharp\mu_{3})\right)^{\frac{1}{p}}
(𝔼θvMF(ϵ,κ)[Wp(θμ1,θμ2)+Wp(θμ2,θμ3)]p)1p\displaystyle\leq\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\left[\text{W}_{p}(\theta\sharp\mu_{1},\theta\sharp\mu_{2})+\text{W}_{p}(\theta\sharp\mu_{2},\theta\sharp\mu_{3})\right]^{p}\right)^{\frac{1}{p}}
(𝔼θvMF(ϵ,κ)Wpp(θμ1,θμ2))1p+(𝔼θvMF(ϵ,κ)Wpp(θμ2,θμ3))1p\displaystyle\leq\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp\mu_{1},\theta\sharp\mu_{2})\right)^{\frac{1}{p}}+\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp\mu_{2},\theta\sharp\mu_{3})\right)^{\frac{1}{p}}
=v-DSWp(μ1,μ2;ϵ,κ)+v-DSWp(μ2,μ3;ϵ,κ)\displaystyle=\text{v-DSW}_{p}(\mu_{1},\mu_{2};\epsilon,\kappa)+\text{v-DSW}_{p}(\mu_{2},\mu_{3};\epsilon,\kappa)

Identity: From the definition, if μ=ν\mu=\nu, we obtain v-DSWp(μ,ν;ϵ,κ)=0\text{v-DSW}_{p}(\mu,\nu;\epsilon,\kappa)=0. Now, we need to show that if v-DSWp(μ,ν;ϵ,κ)=0\text{v-DSW}_{p}(\mu,\nu;\epsilon,\kappa)=0, then μ=ν\mu=\nu.

If v-DSWp(μ,ν;ϵ,κ)=0\text{v-DSW}_{p}(\mu,\nu;\epsilon,\kappa)=0, we have (𝔼θvMF(ϵ,κ)Wpp(θμ,θν))1p=0\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)\right)^{\frac{1}{p}}=0 which implies 𝔼θvMF(ϵ,κ)Wpp(θμ,θν)=0\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp\mu,\theta\sharp\nu)=0. Therefore, Wp(θμ,θν)=0\text{W}_{p}(\theta\sharp\mu,\theta\sharp\nu)=0 for vMF(ϵ,κ)\text{vMF}(\epsilon,\kappa) almost surely θ𝕊d1\theta\in\mathbb{S}^{d-1}. Using the identity property of the Wasserstein distance, we obtain θμ=θν\theta\sharp\mu=\theta\sharp\nu for vMF(ϵ,κ)\text{vMF}(\epsilon,\kappa) almost surely θ𝕊d1\theta\in\mathbb{S}^{d-1}. Since vMF(ϵ,κ)\text{vMF}(\epsilon,\kappa) with 0<κ<0<\kappa<\infty has the supports on all 𝕊d1\mathbb{S}^{d-1}, for any tt\in\mathbb{R} and θ𝕊d1\theta\in\mathbb{S}^{d-1}, we have:

[μ](tθ)\displaystyle\mathcal{F}[\mu](t\theta) =deitθ,x𝑑μ(x)=eitz𝑑θμ(z)=[θμ](t)\displaystyle=\int_{\mathbb{R}^{d}}e^{-it\langle\theta,x\rangle}d\mu(x)=\int_{\mathbb{R}}e^{-itz}d\theta\sharp\mu(z)=\mathcal{F}[\theta\sharp\mu](t)
=[θν](t)=eitz𝑑θν(z)=deitθ,x𝑑ν(x)=[ν](tθ),\displaystyle=\mathcal{F}[\theta\sharp\nu](t)=\int_{\mathbb{R}}e^{-itz}d\theta\sharp\nu(z)=\int_{\mathbb{R}^{d}}e^{-it\langle\theta,x\rangle}d\nu(x)=\mathcal{F}[\nu](t\theta),

where [γ](w)=deiw,x𝑑γ(x)\mathcal{F}[\gamma](w)=\int_{\mathbb{R}^{d^{\prime}}}e^{-i\langle w,x\rangle}d\gamma(x) denotes the Fourier transform of γ𝒫(d)\gamma\in\mathcal{P}(\mathbb{R}^{d^{\prime}}). We then obtain μ=ν\mu=\nu by the injectivity of the Fourier transform. We complete the proof. ∎

By abuse of notation, we denote v-DSW(X,Y;ϵ,κ)=v-DSW(PX,PY;ϵ,κ)\text{v-DSW}(X,Y;\epsilon,\kappa)=\text{v-DSW}(P_{X},P_{Y};\epsilon,\kappa) for X,Y𝒳X,Y\in\mathcal{X} are two point-clouds, PX=1mi=1mδxiP_{X}=\frac{1}{m}\sum_{i=1}^{m}\delta_{x_{i}}, and PY=1mi=1mδyiP_{Y}=\frac{1}{m}\sum_{i=1}^{m}\delta_{y_{i}}. We cast the v-DSW from a metric on the space of probability measures to the space of point-clouds 𝒳\mathcal{X}.

Corollary 1.

For any ϵ𝕊d1\epsilon\in\mathbb{S}^{d-1} and κ<\kappa<\infty, v-DSWp(,;ϵ,κ)\text{v-DSW}_{p}(\cdot,;\epsilon,\kappa) is a valid metric on the space of point-clouds 𝒳\mathcal{X}.

Proof.

Since PX,PY𝒫p(d)P_{X},P_{Y}\in\mathcal{P}_{p}(\mathbb{R}^{d}), the non-negativity, symmetry, triangle inequality, and identity properties follow directly from Lemma 1. We now only need to show that v-DSW is invariant to permutation. This property is straightforward from the definition of empirical probability measures. For any permutation function σ\sigma, we have PX=1mi=1mδxi=1mi=1mδxσ(i)=Pσ(X)P_{X}=\frac{1}{m}\sum_{i=1}^{m}\delta_{x_{i}}=\frac{1}{m}\sum_{i=1}^{m}\delta_{x_{\sigma}(i)}=P_{\sigma(X)} which completes the proof. ∎

We now continue the proof of Theorem 1. If 𝔼Xp(X)(𝔼θvMF(ϵ,κ)Wpp(θPX,θPgγ(fϕ(X))))1p=0\mathbb{E}_{X\sim p(X)}\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp P_{X},\theta\sharp P_{g_{\gamma}(f_{\phi}(X))})\right)^{\frac{1}{p}}=0, we obtain (𝔼θvMF(ϵ,κ)Wpp(θPX,θPgγ(fϕ(X))))1p=v-DSW(X,gγ(fϕ(X));ϵ,κ)=0\left(\mathbb{E}_{\theta\sim\text{vMF}(\epsilon,\kappa)}\text{W}_{p}^{p}(\theta\sharp P_{X},\theta\sharp P_{g_{\gamma}(f_{\phi}(X))})\right)^{\frac{1}{p}}=\text{v-DSW}(X,g_{\gamma}(f_{\phi}(X));\epsilon,\kappa)=0 for p-almost surely X𝒳X\in\mathcal{X}. By Collorary 1, we obtain X=gγ(fϕ(X))X=g_{\gamma}(f_{\phi}(X)) for p-almost surely X𝒳X\in\mathcal{X}. We complete the proof.

B.3 Proof for Proposition 2

We first recall the definition of the self-attention amortized model in Definition 3:

aψ(X,Y)=𝒜ζ(X)𝟏m+𝒜ζ(Y)𝟏m𝒜ζ(X)𝟏m+𝒜ζ(Y)𝟏m2,\displaystyle a_{\psi}(X,Y)=\frac{{\mathcal{A}}_{\zeta}(X^{\prime\top})^{\top}{\bm{1}}_{m}+{\mathcal{A}}_{\zeta}(Y^{\prime\top})^{\top}{\bm{1}}_{m}}{||{\mathcal{A}}_{\zeta}(X^{\prime\top})^{\top}{\bm{1}}_{m}+{\mathcal{A}}_{\zeta}(Y^{\prime\top})^{\top}{\bm{1}}_{m}||_{2}},

Symmetry: Since the self-attention amortized model use the same attention weight ζ\zeta for both XX and YY, exchanging XX and YY yields the same results aψ(X,Y)=aψ(Y,X)a_{\psi}(X,Y)=a_{\psi}(Y,X).

Permutation invariance: Based on the results in Yang et al. (2019b, Appendix A), we show that self-attention amortized model is permutation invariant. In particular, we have:

𝒜ζ(X)𝟏m\displaystyle{\mathcal{A}}_{\zeta}(X^{\prime\top})^{\top}{\bm{1}}_{m} =Att(XWq,XWk,XWv)𝟏m\displaystyle=\mathrm{Att}(X^{\prime\top}W_{q},X^{\prime\top}W_{k},X^{\prime\top}W_{v})^{\top}{\bm{1}}_{m}
=(softmaxrow[XWqWkXdk]XWv)𝟏m\displaystyle=\left(\text{softmax}_{\text{row}}\left[\frac{X^{\prime\top}W_{q}W_{k}^{\top}X^{\prime}}{\sqrt{d_{k}}}\right]X^{\prime\top}W_{v}\right)^{\top}{\bm{1}}_{m}
=(softmaxrow[σ(X)WqWkσ(X)dk]σ(X)Wv)𝟏m\displaystyle=\left(\text{softmax}_{\text{row}}\left[\frac{\sigma(X)^{\prime\top}W_{q}W_{k}^{\top}\sigma(X)^{\prime}}{\sqrt{d_{k}}}\right]\sigma(X)^{\prime\top}W_{v}\right)^{\top}{\bm{1}}_{m}
=𝒜ζ(σ(X))𝟏m.\displaystyle={\mathcal{A}}_{\zeta}(\sigma(X)^{\prime\top})^{\top}{\bm{1}}_{m}.

Similarly, the proof holds for both linear self-attention and efficient self-attention.

Appendix C Experiment settings

In this section, we first provide the details of the training process and the architecture for point-cloud reconstruction, transfer learning, and point-cloud generation. Then, we present the implementation detail and hyper-parameters settings for different distances used in our experiments.

C.1 Details of point-cloud reconstruction and downstream applications

Point-cloud reconstruction: We use the same settings in ASW (Nguyen et al., 2021c) to train autoencoders. We utilize a variant of Point-Net (Qi et al., 2017) with an embedding size of 256 proposed in (Pham et al., 2020). The architecture of the autoencoder and classifier are shown in Figure 5. Our autoencoder is trained on the ShapeNet Core-55 dataset (Chang et al., 2015) with a batch size of 128 and a point-cloud size of 2048. We train it for 300 epochs using an SGD optimizer with an initial learning rate of 1e-3, a momentum of 0.9, and a weight decay of 5e-4.

Refer to caption
Figure 5: The architecture of the Point-Net variant in our experiments. For transfer learning, we use a simple classifier with 3 fully-connected layers. All layers are followed by ReLU activation and batch normalization by default, except for the final layers.

Next, we detail the process of conducting two downstream applications of point-cloud reconstruction.

Transfer learning: A classifier is trained on the latent space of the autoencoder. Particularly, we extract a 256-dimension (which is smaller than the setting in (Lee et al., 2022)) latent vector of an input 3D point-cloud via the pre-trained encoder. Then, this vector is fed into a multi-layer perceptron with hidden layers of size 512 and 256. The last layer outputs a 40-dimension vector representing the prediction of 40 classes of the ModelNet40 dataset.

Point-cloud generation: Our generative model is trained on the latent space of the autoencoder as follows. First, we extract a 256-dimension latent vector of an input 3D point-cloud via the pre-trained encoder. Then a 64-dimensional vector is drawn from a normal distribution 𝒩(0,𝕀64){\mathcal{N}}(0,{\mathbb{I}}_{64}), where 𝕀64{\mathbb{I}}_{64} is the 64x6464x64 identity matrix, and fed into a generator which also outputs a 256-dimension vector. Finally, the generator learns by minimizing the optimal transport distance between the generated and ground truth latent codes.

C.2 Details of baseline distances

We want to emphasize that we use the same set of hyper-parameters reported in (Nguyen et al., 2021c) for Chamfer, EMD, SW, and Max-SW.

Chamfer and EMD: We use the CUDA implementation from (Yang et al., 2019a).

SW: We use the Monte Carlo estimation with 100 slices.

Max-SW: We use the projected sub-gradient ascent algorithm to optimize the projection. It is trained with an Adam optimizer with an initial learning rate of 1e-4. The number of iterations T is chosen from {1,10,50}\{1,10,50\}.

Adaptive SW: We use Algorithm 1 in (Nguyen et al., 2021c) with the same set of parameters as follows: N0=2,s=1,ϵ=0.5,N_{0}=2,s=1,\epsilon=0.5, and M=500M=500.

Table 4: Reconstruction and transfer learning performance of different autoencoders on the ModelNet40 dataset. For v-DSW and Max-SW, T denotes the number of projected sub-gradient ascent iterations. In Table 1, both v-DSW and Max-SW have T = 50 iterations. All reconstructed losses except EMD are multiplied by 100.
Method CD (102,)(10^{-2},\downarrow) SW (102,)(10^{-2},\downarrow) EMD ()(\downarrow) Acc ()(\uparrow) Time ()(\downarrow)
CD 1.25 ±\pm 0.03 681.20 ±\pm 16.73 653.52 ±\pm 10.43 86.28 ±\pm 0.34 95
EMD 0.40 ±\pm 0.00 94.54 ±\pm 2.90 168.60 ±\pm 1.57 88.45 ±\pm 0.20 208
SW 0.68 ±\pm 0.01 89.61 ±\pm 3.88 191.12 ±\pm 2.88 87.90 ±\pm 0.27 106
Max-SW (T = 1) 0.69 ±\pm 0.01 87.60 ±\pm 0.95 190.88 ±\pm 0.40 88.05 ±\pm 0.23 97
Max-SW (T = 10) 0.69 ±\pm 0.01 90.72 ±\pm 0.58 192.82 ±\pm 0.73 87.82 ±\pm 0.37 102
Max-SW (T = 50) 0.68 ±\pm 0.01 88.22 ±\pm 1.45 190.23 ±\pm 0.1 87.97 ±\pm 0.14 116
ASW 0.69 ±\pm 0.01 89.42 ±\pm 5.07 192.03 ±\pm 3.09 87.78 ±\pm 0.20 103
v-DSW (T = 1) 0.67 ±\pm 0.01 87.29 ±\pm 1.49 188.52 ±\pm 1.47 87.87 ±\pm 0.28 115
v-DSW (T = 10) 0.68 ±\pm 0.00 87.44 ±\pm 1.07 189.97 ±\pm 1.04 87.98 ±\pm 0.23 205
v-DSW (T = 50) 0.67 ±\pm 0.00 85.03 ±\pm 3.31 187.75 ±\pm 2.00 87.83 ±\pm 0.40 633
{\mathcal{L}}-Max-SW 1.06 ±\pm 0.03 121.85 ±\pm 5.77 236.87 ±\pm 3.42 87.70 ±\pm 0.23 94
𝒢{\mathcal{G}}-Max-SW 12.11 ±\pm 0.29 851.07 ±\pm 2.11 829.28 ±\pm 5.53 87.49 ±\pm 0.36 97
𝒩{\mathcal{N}}-Max-SW 7.38 ±\pm 3.29 618.74 ±\pm 153.87 648.32 ±\pm 117.03 87.43 ±\pm 0.15 96
{\mathcal{L}}v-DSW (ours) 0.68 ±\pm 0.00 85.32 ±\pm 0.54 188.32 ±\pm 0.23 87.70 ±\pm 0.34 114
𝒢{\mathcal{G}}v-DSW (ours) 0.68 ±\pm 0.01 82.77 ±\pm 0.48 187.04 ±\pm 1.11 87.75 ±\pm 0.19 117
𝒩{\mathcal{N}}v-DSW (ours) 0.67 ±\pm 0.00 83.47 ±\pm 0.49 186.66 ±\pm 0.81 87.84 ±\pm 0.07 115
𝒜{\mathcal{A}}v-DSW (ours) 0.67 ±\pm 0.01 83.08 ±\pm 1.22 186.27 ±\pm 0.56 88.05 ±\pm 0.17 230
𝒜{\mathcal{E}}{\mathcal{A}}v-DSW (ours) 0.68 ±\pm 0.01 82.05 ±\pm 0.40 186.46 ±\pm 0.25 88.07 ±\pm 0.21 125
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW (ours) 0.68 ±\pm 0.00 81.03 ±\pm 0.18 185.26 ±\pm 0.31 88.28 ±\pm 0.13 123

v-DSW: We use stochastic projected gradient ascent algorithm to optimize the location vector ϵ\epsilon in Equation 19 while we fix the concentration parameter κ\kappa to 1 for both v-DSW and all of its amortized versions. Similar to Max-SW, it is trained with an Adam optimizer with an initial learning rate of 1e-4. The number of iterations TT is selected from {1,10,50}\{1,10,50\} based on the task performance. Intuitively, increasing the number of iterations leads to a better approximation that is closer to the optimal value but comes with an expensive computational cost. We also use the Monte Carlo estimation with 100 slices as in SW.

C.3 Details of amortized sliced Wasserstein distances

Linear, generalized linear, and non-linear models: We adopt the official implementations in (Nguyen & Ho, 2022a).

Self-attention-based models: We adapt the official implementations from their corresponding papers in our experiments. For all variants, dvd_{v} is set to 3, which equals the dimension of point-clouds while dkd_{k} is chosen from {16,32,64,128}\{16,32,64,128\}. In Equation 15, the projected dimension kk is selected from {64,128}\{64,128\}.

Refer to caption
Figure 6: Qualitative results of reconstructing point-clouds in the ShapeNet Core-55 dataset. From top to bottom: input, SW, Max-SW (T = 50), ASW, v-DSW (T = 50), 𝒩{\mathcal{N}}v-DSW, 𝒜{\mathcal{E}}{\mathcal{A}}v-DSW, and 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW.

Training amortized models: The learning rate is set to 1e-3 and the optimizer is set to Adam (Kingma & Ba, 2014) with (β1,β2)=(0,0.9)(\beta_{1},\beta_{2})=(0,0.9).

Table 5: Quantitative results (measured in EMD) of reconstructing point-clouds in the ShapeNet Core-55 dataset.
Method PC1 PC2 PC3 PC4 PC5 PC6 Avg
SW 141.07 139.50 118.83 99.11 150.28 128.46 129.54
Max-SW (T = 50) 145.15 131.76 112.13 116.73 139.91 115.79 126.91
ASW 139.17 126.55 115.49 91.07 153.87 114.84 123.50
v-DSW (T = 50) 133.06 146.99 105.65 105.66 137.32 110.50 123.20
𝒩{\mathcal{N}}v-DSW 132.60 127.57 100.81 94.31 131.04 116.34 117.11
𝒜{\mathcal{E}}{\mathcal{A}}v-DSW 139.64 124.28 100.34 98.33 123.59 115.05 116.87
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW 130.21 127.00 96.75 98.09 132.33 114.11 116.41
Table 6: Reconstruction results of SW and 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW when changing LL. CD and SWD are multiplied by 100.
Method LL CD (102,)(10^{-2},\downarrow) SW (102,)(10^{-2},\downarrow) EMD ()(\downarrow) Time
SW 50 0.67 ±\pm 0.00 90.17 ±\pm 2.97 190.97 ±\pm 1.87 100
100 0.68 ±\pm 0.01 89.61 ±\pm 3.88 191.12 ±\pm 2.88 107
200 0.67 ±\pm 0.00 89.54 ±\pm 4.57 191.21 ±\pm 3.87 111
500 0.67 ±\pm 0.01 88.20 ±\pm 4.22 190.14 ±\pm 2.35 142
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW 50 0.68 ±\pm 0.01 85.88 ±\pm 4.03 188.80 ±\pm 2.55 133
100 0.68 ±\pm 0.00 81.03 ±\pm 0.18 185.26 ±\pm 0.31 123
Table 7: Reconstruction results of v-DSW when changing the number of projected sub-gradient ascent iteration (T). CD and SWD are multiplied by 100.
Method CD (102,)(10^{-2},\downarrow) SW (102,)(10^{-2},\downarrow) EMD ()(\downarrow)
v-DSW (T = 0) 0.67 ±\pm 0.01 88.63 ±\pm 2.30 189.81 ±\pm 1.19
v-DSW (T = 1) 0.67 ±\pm 0.01 87.29 ±\pm 1.49 188.52 ±\pm 1.47
v-DSW (T = 10) 0.68 ±\pm 0.00 87.44 ±\pm 1.07 189.97 ±\pm 1.04
v-DSW (T = 50) 0.67 ±\pm 0.00 85.03 ±\pm 3.31 187.75 ±\pm 2.00
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW 0.68 ±\pm 0.00 81.03 ±\pm 0.18 185.26 ±\pm 0.31

Appendix D Additional experimental results

Point-cloud reconstruction: Table 4 illustrates the full quantitative results of the point-cloud reconstruction experiment. For Max-SW and v-DSW, we vary the number of gradient iterations T in {1,10,50}\{1,10,50\}. Because CD is not a proper distance so we choose the best number of iterations based on SW and EMD losses (we prioritize EMD loss first then SW). As can be seen from the table, increasing the number of gradient ascent iterations (T)(T) increases the reconstruction performance of Max-SW and v-DSW but comes with the cost of additional computation, especially for v-DSW. However, with all choices of T, the reconstruction performance (measured in SW and EMD) of both Max-SW and v-DSW are generally worse than our amortized methods. In addition, our amortized methods have smaller standard deviations over 3 runs, thus they are more stable than the conventional optimization method using gradient ascent method. The qualitative results are given in Figure 6. The corresponding quantitative results in EMD are given in Table 5. It can be seen that our amortized v-DSW methods have more favorable performance.

Table 8: Reconstruction results of 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW when changing κ\kappa. CD and SWD are multiplied by 100.
κ\kappa CD (102,)(10^{-2},\downarrow) SW (102,)(10^{-2},\downarrow) EMD ()(\downarrow)
0.1 0.67 ±\pm 0.00 81.88 ±\pm 1.09 185.30 ±\pm 0.94
1 0.68 ±\pm 0.00 81.03 ±\pm 0.18 185.26 ±\pm 0.31
10 0.85 ±\pm 0.01 96.01 ±\pm 4.24 208.46 ±\pm 4.03
Table 9: Performance comparison of point-cloud generation on the chair category of ShapeNet. For v-DSW and Max-SW, T denotes the number of projected sub-gradient ascent iterations. In Table 3, v-DSW and Max-SW have T = 50 and 10 iterations, respectively. JSD, MMD-CD, and MMD-EMD are multiplied by 100.
Method JSD (102,)(10^{-2},\downarrow) MMD (102,)(10^{-2},\downarrow) COV (%,)(\%,\uparrow) 1-NNA (%,)(\%,\downarrow)
CD EMD CD EMD CD EMD
CD 17.88 ±\pm 1.14 1.12 ±\pm 0.02 17.19 ±\pm 0.36 23.73 ±\pm 1.69 10.83 ±\pm 0.89 98.45 ±\pm 0.10 100.00 ±\pm 0.00
EMD 5.15 ±\pm 1.52 0.61 ±\pm 0.09 10.37 ±\pm 0.61 41.65 ±\pm 2.19 42.54 ±\pm 2.42 87.76 ±\pm 1.46 87.30 ±\pm 1.22
SW 1.56 ±\pm 0.06 0.72 ±\pm 0.02 10.80 ±\pm 0.11 38.55 ±\pm 0.43 45.35 ±\pm 0.48 89.91 ±\pm 1.17 88.28 ±\pm 0.70
Max-SW (T = 1) 1.74 ±\pm 0.22 0.78 ±\pm 0.05 11.05 ±\pm 0.31 39.39 ±\pm 2.28 46.82 ±\pm 0.79 92.15 ±\pm 0.95 90.20 ±\pm 0.87
Max-SW (T = 10) 1.63 ±\pm 0.32 0.74 ±\pm 0.01 10.84 ±\pm 0.08 40.47 ±\pm 1.04 47.81 ±\pm 0.78 91.46 ±\pm 0.72 89.93 ±\pm 0.86
Max-SW (T = 50) 1.57 ±\pm 0.26 0.80 ±\pm 0.05 11.25 ±\pm 0.34 37.81 ±\pm 1.69 46.23 ±\pm 0.64 92.15 ±\pm 0.72 90.35 ±\pm 0.28
ASW 1.75 ±\pm 0.38 0.78 ±\pm 0.05 11.27 ±\pm 0.38 38.16 ±\pm 2.15 45.45 ±\pm 1.40 91.21 ±\pm 0.40 89.36 ±\pm 0.40
v-DSW (T = 1) 1.84 ±\pm 0.17 0.75 ±\pm 0.03 11.02 ±\pm 0.21 38.26 ±\pm 1.46 45.35 ±\pm 1.70 90.08 ±\pm 0.48 87.81 ±\pm 0.16
v-DSW (T = 10) 1.48 ±\pm 0.17 0.77 ±\pm 0.02 11.09 ±\pm 0.09 37.22 ±\pm 0.96 43.77 ±\pm 0.39 90.40 ±\pm 1.05 88.87 ±\pm 1.04
v-DSW (T = 50) 1.79 ±\pm 0.17 0.72 ±\pm 0.02 10.73 ±\pm 0.20 37.76 ±\pm 0.71 45.49 ±\pm 1.37 90.23 ±\pm 0.13 88.33 ±\pm 0.95
{\mathcal{L}}v-DSW (ours) 1.67 ±\pm 0.07 0.77 ±\pm 0.04 11.10 ±\pm 0.33 37.91 ±\pm 1.84 45.64 ±\pm 2.30 90.42 ±\pm 0.53 88.82 ±\pm 0.38
𝒢{\mathcal{G}}v-DSW (ours) 1.56 ±\pm 0.22 0.75 ±\pm 0.02 10.99 ±\pm 0.11 37.81 ±\pm 1.70 45.69 ±\pm 0.46 90.32 ±\pm 0.38 88.26 ±\pm 0.28
𝒩{\mathcal{N}}v-DSW (ours) 1.44 ±\pm 0.06 0.75 ±\pm 0.02 10.95 ±\pm 0.09 38.40 ±\pm 1.34 46.28 ±\pm 2.06 90.15 ±\pm 0.80 88.65 ±\pm 0.82
𝒜{\mathcal{E}}{\mathcal{A}}v-DSW (ours) 1.73 ±\pm 0.21 0.71 ±\pm 0.04 10.70 ±\pm 0.26 40.03 ±\pm 1.28 48.01 ±\pm 1.07 89.98 ±\pm 0.57 88.55 ±\pm 0.38
𝒜{\mathcal{L}}{\mathcal{A}}v-DSW (ours) 1.54 ±\pm 0.09 0.72 ±\pm 0.03 10.74 ±\pm 0.35 40.62 ±\pm 1.39 45.84 ±\pm 1.23 89.44 ±\pm 0.28 87.79 ±\pm 0.37

On the number of projections (LL). In our experiments, LL is fixed to 100 as in the ASW’s paper. Here, we conduct an ablation study on the number of projections LL and report the result in Table 6. As can be seen from the table, increasing the number of projections improves the performance in terms of EMD but comes with an extra running time. We see that 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW with L=50L=50 and L=100L=100 are faster than SW with L=500L=500 while being better in terms of SW and EMD evaluation metrics. Compared to SW with L=200L=200, 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW with L=50L=50 has approximately the same computational time while having lower SW and EMD evaluation metrics.

On the choice of location vector ϵ\epsilon. We would like to recall that the optimal location vector ϵ\epsilon^{\star} of v-DSW are computed using Algorithm 4 in Appendix A.3. To show its effectiveness, we compare it with a random location ϵ\epsilon, i.e. T = 0. Table 7 illustrates that optimizing for the location parameter of the vMF distribution helps to improve the reconstruction. Moreover, our amortized optimization still gives better reconstruction scores than the randomly initialized location ϵ\epsilon and the optimized location using the conventional method. Therefore, using amortized optimization could actually have benefits.

On the choice of parameter κ\kappa. We would like first to recall that κ\kappa is set to 1 for all v-DSW and amortized v-DSW methods in our experiments. In practice, the parameter κ\kappa can be chosen by doing a grid search. Here, we conduct an ablation study by varying κ{0.1,1,10}\kappa\in\{0.1,1,10\} for 𝒜{\mathcal{L}}{\mathcal{A}}v-DSW and report the results in Table 8. As can be seen from the table, κ=1\kappa=1 results in the best-performing EMD.

Point-cloud generation. We summarize the full quantitative results for point-cloud generation in Table 9. For Max-SW and v-DSW, we again change the number of gradient iterations TT in {1,10,50}\{1,10,50\}. Note that 𝒜{\mathcal{A}}v-DSW cannot be used in this experiment due to being out of memory while the performance of amortized Max-SW is too bad. Therefore, their results are not reported in this experiment. As can be seen from the table, amortized v-DSW methods achieve the best performance in 7 out of 7 metrics. Using more than one gradient ascent iteration (T{10,50})(T\in\{10,50\}) does improve the generation performance of Max-SW and v-DSW but comes with the cost of additional computation.