This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Principal Geodesic Analysis for Probability Measures under the Optimal Transport Metric

Vivien Seguy
Graduate School of Informatics
Kyoto University
vivien.seguy@iip.ist.i.kyoto-u.ac.jp
\AndMarco Cuturi
Graduate School of Informatics
Kyoto University
mcuturi@i.kyoto-u.ac.jp
Abstract

Given a family of probability measures in P(𝒳)P(\mathcal{X}), the space of probability measures on a Hilbert space 𝒳\mathcal{X}, our goal in this paper is to highlight one ore more curves in P(𝒳)P(\mathcal{X}) that summarize efficiently that family. We propose to study this problem under the optimal transport (Wasserstein) geometry, using curves that are restricted to be geodesic segments under that metric. We show that concepts that play a key role in Euclidean PCA, such as data centering or orthogonality of principal directions, find a natural equivalent in the optimal transport geometry, using Wasserstein means and differential geometry. The implementation of these ideas is, however, computationally challenging. To achieve scalable algorithms that can handle thousands of measures, we propose to use a relaxed definition for geodesics and regularized optimal transport distances. The interest of our approach is demonstrated on images seen either as shapes or color histograms.

1 Introduction

Optimal transport distances (Villani, 2008), a.k.a Wasserstein or earth mover’s distances, define a powerful geometry to compare probability measures supported on a metric space 𝒳\mathcal{X}. The Wasserstein space P(𝒳)P(\mathcal{X})—the space of probability measures on 𝒳\mathcal{X} endowed with the Wasserstein distance—is a metric space which has received ample interest from a theoretical perspective. Given the prominent role played by probability measures and feature histograms in machine learning, the properties of P(𝒳)P(\mathcal{X}) can also have practical implications in data science. This was shown by Agueh and Carlier (2011) who described first Wasserstein means of probability measures. Wasserstein means have been recently used in Bayesian inference (Srivastava et al., 2015), clustering (Cuturi and Doucet, 2014), graphics (Solomon et al., 2015) or brain imaging (Gramfort et al., 2015). When 𝒳\mathcal{X} is not just metric but also a Hilbert space, P(𝒳)P(\mathcal{X}) is an infinite-dimensional Riemannian manifold (Ambrosio et al. 2006, Chap. 8; Villani 2008, Part II). Three recent contributions by Boissard et al. (2015, §5.2), Bigot et al. (2015) and Wang et al. (2013) exploit directly or indirectly this structure to extend Principal Component Analysis (PCA) to P(𝒳)P(\mathcal{X}). These important seminal papers are, however, limited in their applicability and/or the type of curves they output. Our goal in this paper is to propose more general and scalable algorithms to carry out Wasserstein principal geodesic analysis on probability measures, and not simply dimensionality reduction as explained below.

Principal Geodesics in P(𝒳)P(\mathcal{X}) vs. Dimensionality Reduction on P(𝒳)P(\mathcal{X})

We provide in Fig. 1 a simple example that illustrates the motivation of this paper, and which also shows how our approach differentiates itself from existing dimensionality reduction algorithms (linear and non-linear) that draw inspiration from PCA. As shown in Fig. 1, linear PCA cannot produce components that remain in P(𝒳)P(\mathcal{X}). Even more advanced tools, such as those proposed by Hastie and Stuetzle (1989), fall slightly short of that goal. On the other hand, Wasserstein geodesic analysis yields geodesic components in P(𝒳)P(\mathcal{X}) that are easy to interpret and which can also be used to reduce dimensionality.

Refer to captionP(𝒳)P(\mathcal{X})Wasserstein Principal GeodesicsEuclidean Principal Components Principal Curve
Figure 1: (top-left) Dataset: 60×6060\times 60 images of a single Chinese character randomly translated, scaled and slightly rotated (36 images displayed out of 300 used). Each image is handled as a normalized histogram of 3,6003,600 non-negative intensities. (middle-left) Dataset schematically drawn on P(𝒳)P(\mathcal{X}). The Wasserstein principal geodesics of this dataset are depicted in red, its Euclidean components in blue, and its principal curve (Verbeek et al., 2002) in yellow. (right) Actual curves (blue colors depict negative intensities, green intensities 1\geq 1). Neither the Euclidean components nor the principal curve belong to P(𝒳)P(\mathcal{X}), nor can they be interpreted as meaningful axis of variation.

Foundations of PCA and Riemannian Extensions

Carrying out PCA on a family (x1,,xn)(x_{1},\dots,x_{n}) of points taken in a space XX can be described in abstract terms as: (i) define a mean element x¯\bar{x} for that dataset; (ii) define a family of components in XX, typically geodesic curves, that contain x¯\bar{x}; (iii) fit a component by making it follow the xix_{i}’s as closely as possible, in the sense that the sum of the distances of each point xix_{i} to that component is minimized; (iv) fit additional components by iterating step (iii) several times, with the added constraint that each new component is different (orthogonal) enough to the previous components. When XX is Euclidean and the xix_{i}’s are vectors in d\mathbb{R}^{d}, the (n+1)(n+1)-th component vn+1v_{n+1} can be computed iteratively by solving:

vn+1argminvVn,v2=1i=1Nmintxi(x¯+tv)22,where V0=def., and Vn=def.span{v1,,vn}.v_{n+1}\in\!\!\underset{v\in V_{n}^{\perp},|\!|v|\!|_{2}=1}{\operatorname{argmin}}\sum_{i=1}^{N}\underset{t\in\mathbb{R}}{\min}~~\|x_{i}-(\bar{x}+tv)\|_{2}^{2},\,\text{where }V_{0}\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\emptyset,\text{ and }V_{n}\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\text{span}\{v_{1},\cdots,v_{n}\}. (1)

Since PCA is known to boil down to a simple eigen-decomposition when XX is Euclidean or Hilbertian (Schölkopf et al., 1997), Eq. (1) looks artificially complicated. This formulation is, however, extremely useful to generalize PCA to Riemannian manifolds (Fletcher et al., 2004). This generalization proceeds first by replacing vector means, lines and orthogonality conditions using respectively Fréchet means (1948), geodesics, and orthogonality in tangent spaces. Riemannian PCA builds then upon the knowledge of the exponential map at each point xx of the manifold XX. Each exponential map expx\exp_{x} is locally bijective between the tangent space TxT_{x} of xx and XX. After computing the Fréchet mean x¯\bar{x} of the dataset, the logarithmic map logx¯\log_{\bar{x}} at x¯\bar{x} (the inverse of expx¯\exp_{\bar{x}}) is used to map all data points xix_{i} onto Tx¯T_{\bar{x}}. Because Tx¯T_{\bar{x}} is a Euclidean space by definition of Riemannian manifolds, the dataset (logx¯xi)i(\log_{\bar{x}}x_{i})_{i} can be studied using Euclidean PCA. Principal geodesics in XX can then be recovered by applying the exponential map to a principal component vv^{\star}, {expx¯(tv),|t|<ε}\{\exp_{\bar{x}}(tv^{\star}),|t|<\varepsilon\}.

From Riemannian PCA to Wasserstein PCA: Related Work

As remarked by Bigot et al. (2015), Fletcher et al.’s approach cannot be used as it is to define Wasserstein geodesic PCA, because P(𝒳)P(\mathcal{X}) is infinite dimensional and because there are no known ways to define exponential maps which are locally bijective between Wasserstein tangent spaces and the manifold of probability measures. To circumvent this problem, Boissard et al. (2015), Bigot et al. (2015) have proposed to formulate the geodesic PCA problem directly as an optimization problem over curves in P(𝒳)P(\mathcal{X}).

Boissard et al. and Bigot et al. study the Wasserstein PCA problem in restricted scenarios: Bigot et al. focus their attention on measures supported on 𝒳=\mathcal{X}=\mathbb{R}, which considerably simplifies their analysis since it is known in that case that the Wasserstein space P()P(\mathbb{R}) can be embedded isometrically in L1()L^{1}(\mathbb{R}); Boissard et al. assume that each input measure has been generated from a single template density (the mean measure) which has been transformed according to one “admissible deformation” taken in a parameterized family of deformation maps. Their approach to Wasserstein PCA boils down to a functional PCA on such maps. Wang et al. proposed a more general approach: given a family of input empirical measures (μ1,,μN)(\mu_{1},\dots,\mu_{N}), they propose to compute first a “template measure” μ~\tilde{\mu} using kk-means clustering on iμi\sum_{i}\mu_{i}. They consider next all optimal transport plans πi\pi_{i} between that template μ~\tilde{\mu} and each of the measures μi\mu_{i}, and propose to compute the barycentric projection (see Eq. 8) of each optimal transport plan πi\pi_{i} to recover Monge maps TiT_{i}, on which standard PCA can be used. This approach is computationally attractive since it requires the computation of only one optimal transport per input measure. Its weakness lies, however, in the fact that the curves in P(𝒳)P(\mathcal{X}) obtained by displacing μ~\tilde{\mu} along each of these PCA directions are not geodesics in general.

Contributions and Outline

We propose a new algorithm to compute Wasserstein Principal Geodesics (WPG) in P(𝒳)P(\mathcal{X}) for arbitrary Hilbert spaces 𝒳\mathcal{X}. We use several approximations—both of the optimal transport metric and of its geodesics—to obtain tractable algorithms that can scale to thousands of measures. We provide first in §2 a review of the key concepts used in this paper, namely Wasserstein distances and means, geodesics and tangent spaces in the Wasserstein space. We propose in §3 to parameterize a Wasserstein principal component (PC) using two velocity fields defined on the support of the Wasserstein mean of all measures, and formulate the WPG problem as that of optimizing these velocity fields so that the average distance of all measures to that PC is minimal. This problem is non-convex and non-smooth. We propose to optimize smooth upper-bounds of that objective using entropy regularized optimal transport in §4. The practical interest of our approach is demonstrated in §5 on toy samples, datasets of shapes and histograms of colors.

Notations

We write A,B\langle A,B\,\rangle for the Frobenius dot-product of matrices AA and BB. 𝐃(u)\mathbf{D}(u) is the diagonal matrix of vector uu. For a mapping f:𝒴𝒴f:\mathcal{Y}\rightarrow\mathcal{Y}, we say that ff acts on a measure μP(𝒴)\mu\in P(\mathcal{Y}) through the pushforward operator #\# to define a new measure f#μP(𝒴)f\#\mu\in P(\mathcal{Y}). This measure is characterized by the identity (f#μ)(B)=μ(f1(B))(f\#\mu)(B)=\mu(f^{-1}(B)) for any Borel set B𝒴B\subset\mathcal{Y}. We write p1p_{1} and p2p_{2} for the canonical projection operators 𝒳2𝒳\mathcal{X}^{2}\rightarrow\mathcal{X}, defined as p1(x1,x2)=x1p_{1}(x_{1},x_{2})=x_{1} and p2(x1,x2)=x2p_{2}(x_{1},x_{2})=x_{2}.

2 Background on Optimal Transport

Wasserstein Distances

We start this section with the main mathematical object of this paper:

Definition 1.

(Villani, 2008, Def. 6.1) Let P(𝒳)P(\mathcal{X}) the space of probability measures on a Hilbert space 𝒳\mathcal{X}. Let Π(ν,η)\Pi(\nu,\eta) be the set of probability measures on 𝒳2\mathcal{X}^{2} with marginals ν\nu and η\eta, i.e. p1#π=νp_{1}\#\pi=\nu and p2#π=ηp_{2}\#\pi=\eta. The squared 22-Wasserstein distance between ν\nu and η\eta in P(𝒳)P(\mathcal{X}) is defined as:

W22(ν,η)=infπΠ(ν,η)𝒳2xy𝒳2𝑑π(x,y).W_{2}^{2}(\nu,\eta)=\underset{\begin{array}[]{c}\scriptstyle\pi\in\Pi(\nu,\eta)\\ \end{array}}{\inf}\int_{\mathcal{X}^{2}}\|x-y\|_{\mathcal{X}}^{2}d\pi(x,y). (2)

Wasserstein Barycenters

Given a family of NN probability measures (μ1,,μN)(\mu_{1},\cdots,\mu_{N}) in P(𝒳)P(\mathcal{X}) and weights λ+N\lambda\in\mathbb{R}^{N}_{+}, Agueh and Carlier (2011) define μ¯\bar{\mu}, the Wasserstein barycenter of these measures:

μ¯argminνP(𝒳)i=1NλiW22(μi,ν).\bar{\mu}\in\underset{\nu\in P(\mathcal{X})}{\operatorname{argmin}}\sum_{i=1}^{N}\lambda_{i}W_{2}^{2}(\mu_{i},\nu).

Our paper relies on several algorithms which have been recently proposed (Cuturi and Doucet, 2014; carlier2015numerical; Benamou et al., 2015; Bonneel et al., 2015) to compute such barycenters.

Wasserstein Geodesics

Given two measures ν\nu and η\eta, let Π(ν,η)\Pi^{\star}(\nu,\eta) be the set of optimal couplings for Eq. (2). Informally speaking, it is well known that if either ν\nu or η\eta are absolutely continuous measures, then any optimal coupling πΠ(ν,η)\pi^{\star}\in\Pi^{\star}(\nu,\eta) is degenerated in the sense that, assuming for instance that ν\nu is absolutely continuous, for all xx in the support of ν\nu only one point y𝒳y\in\mathcal{X} is such that dπ(x,y)>0d\pi^{\star}(x,y)>0. In that case, the optimal transport is said to have no mass splitting, and there exists an optimal mapping T:𝒳𝒳T:\mathcal{X}\rightarrow\mathcal{X} such that π\pi^{\star} can be written, using a pushforward, as π=(id×T)#ν\pi^{\star}=(\text{id}\times T)\#\nu. When there is no mass splitting to transport ν\nu to η\eta, McCann’s interpolant (1997):

gt=((1t)id+tT)#ν,t[0,1],g_{t}=((1-t)\text{id}+tT)\#\nu,~~t\in[0,1], (3)

defines a geodesic curve in the Wasserstein space, i.e. (gt)t(g_{t})_{t} is locally the shortest path between any two measures located on the geodesic, with respect to W2W_{2}. In the more general case, where no optimal map TT exists and mass splitting occurs (for some locations xx one may have dπ(x,y)>0d\pi^{\star}(x,y)>0 for several yy), then a geodesic can still be defined, but it relies on the optimal plan π\pi^{\star} instead: gt=((1t)p1+tp2)#π,t[0,1]g_{t}=((1-t)p_{1}+tp_{2})\#\pi^{\star},t\in[0,1], (Ambrosio et al., 2006, §7.2). Both cases are shown in Fig. 2.

Refer to caption
Figure 2: Both plots display geodesic curves between two empirical measures ν\nu and η\eta on 2\mathbb{R}^{2}. An optimal map exists in the left plot (no mass splitting occurs), whereas some of the mass of ν\nu needs to be split to be transported onto η\eta on the right plot.

Tangent Space and Tangent Vectors

We briefly describe in this section the tangent spaces of P(𝒳)P(\mathcal{X}), and refer to (Ambrosio et al., 2006, Chap. 8) for more details. Let μ:IP(𝒳)\mu~:~I\subset\mathbb{R}\rightarrow P(\mathcal{X}) be a curve in P(𝒳)P(\mathcal{X}). For a given time tt, the tangent space of P(𝒳)P(\mathcal{X}) at μt\mu_{t} is a subset of L2(μt,𝒳)L^{2}(\mu_{t},\mathcal{X}), the space of square-integrable velocity fields supported on Supp(μt)\text{Supp}(\mu_{t}). At any tt, there exists tangent vectors vtv_{t} in L2(μt,𝒳)L^{2}(\mu_{t},\mathcal{X}) such that limh0W2(μt+h,(id+hvt)#μt)/|h|=0\lim_{h\to 0}W_{2}(\mu_{t+h},(\text{id}+hv_{t})\#\mu_{t})/{\lvert}h{\rvert}=0. Given a geodesic curve in P(𝒳)P(\mathcal{X}) parameterized as Eq. (3), its corresponding tangent vector at time zero is v=Tidv=T-\text{id}.

3 Wasserstein Principal Geodesics

Geodesic Parameterization

The goal of principal geodesic analysis is to define geodesic curves in P(𝒳)P(\mathcal{X}) that go through the mean μ¯\bar{\mu} and which pass close enough to all target measures μi\mu_{i}. To that end, geodesic curves can be parameterized with two end points ν\nu and η\eta. However, to avoid dealing with the constraint that a principal geodesic needs to go through μ¯\bar{\mu}, one can start instead from μ¯\bar{\mu}, and consider a velocity field vL2(μ¯,𝒳)v\in L^{2}(\bar{\mu},\mathcal{X}) which displaces all of the mass of μ¯\bar{\mu} in both directions:

gt(v)=def.(id+tv)#μ¯,t[1,1].g_{t}(v)\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\left(\text{id}+tv\right)\#\bar{\mu},~~~t\in[-1,1]. (4)

Lemma 7.2.1 of Ambrosio et al. (2006) implies that any geodesic going through μ¯\bar{\mu} can be written as Eq. (4). Hence, we do not lose any generality using this parameterization. However, given an arbitrary vector field vv, the curve (gt(v))t\left(g_{t}(v)\right)_{t} is not necessarily a geodesic. Indeed, the maps id±v\text{id}\pm v are not necessarily in the set 𝒞μ¯=def.{rL2(μ¯,𝒳)|(id×r)#μ¯Π(μ¯,r#μ¯)}\mathcal{C}_{\bar{\mu}}\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\{r\in L^{2}(\bar{\mu},\mathcal{X})|(\text{id}\times r)\#\bar{\mu}\in\Pi^{\star}(\bar{\mu},r\#\bar{\mu})\} of maps that are optimal when moving mass away from μ¯\bar{\mu}. Ensuring thus, at each step of our algorithm, that vv is still such that (gt(v))t\left(g_{t}(v)\right)_{t} is a geodesic curve is particularly challenging. To relax this strong assumption, we propose to use a generalized formulation of geodesics, which builds upon not one but two velocity fields, as introduced by Ambrosio et al. (2006, §9.2):

Definition 2.

(adapted from (Ambrosio et al., 2006, §9.2)) Let σ,ν,ηP(𝒳)\sigma,~\nu,~\eta\in P(\mathcal{X}), and assume there is an optimal mapping T(σ,ν)T^{(\sigma,\nu)} from σ\sigma to ν\nu and an optimal mapping T(σ,η)T^{(\sigma,\eta)} from σ\sigma to η\eta. A generalized geodesic, illustrated in Fig. 3 between ν\nu and η\eta with base σ\sigma is defined by,

gt=((1t)T(σ,ν)+tT(σ,η))#σ,t[0,1].g_{t}=\left((1-t)T^{(\sigma,\nu)}+tT^{(\sigma,\eta)}\right)\#\sigma,~~~t\in[0,1].

Choosing μ¯\bar{\mu} as the base measure in Definition 2, and two fields v1,v2v_{1},v_{2} such that idv1,id+v2\text{id}-v_{1},\text{id}+v_{2} are optimal mappings (in 𝒞μ¯\mathcal{C}_{\bar{\mu}}), we can define the following generalized geodesic gt(v1,v2)g_{t}(v_{1},v_{2}):

gt(v1,v2)=def.(idv1+t(v1+v2))#μ¯, for t[0,1].g_{t}(v_{1},v_{2})\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\left(\text{id}-v_{1}+t(v_{1}+v_{2})\right)\#\bar{\mu},\text{ for }t\in[0,1]. (5)

Generalized geodesics become true geodesics when v1v_{1} and v2v_{2} are positively proportional. We can thus consider a regularizer that controls the deviation from that property by defining Ω(v1,v2)=(v1,v2L2(μ¯,𝒳)v1L2(μ¯,𝒳)v2L2(μ¯,𝒳))2,\Omega(v_{1},v_{2})=(\langle v_{1},v_{2}\,\rangle_{L^{2}(\bar{\mu},\mathcal{X})}-\|v_{1}\|_{L^{2}(\bar{\mu},\mathcal{X})}\|v_{2}\|_{L^{2}(\bar{\mu},\mathcal{X})})^{2}, which is minimal when v1v_{1} and v2v_{2} are indeed positively proportional. We can now formulate the WPG problem as computing, for n0n\geq 0, the (n+1)th(n+1)^{\text{th}} principal (generalized) geodesic component of a family of measures (μi)i(\mu_{i})_{i} by solving, with λ>0\lambda>0:

minv1,v2L2(μ¯,𝒳)λΩ(v1,v2)+i=1Nmint[0,1]W22(gt(v1,v2),μi),s.t.{idv1,id+v2𝒞μ¯,v1+v2span({v1(i)+v2(i)}in).\!\min_{\begin{subarray}{c}v_{1},v_{2}\in L^{2}(\bar{\mu},\mathcal{X})\end{subarray}}\!\!\!\!\!\!\!\!\!\lambda\Omega(v_{1},v_{2})+\!\sum_{i=1}^{N}\!\min_{t\in[0,1]}\!\!W_{2}^{2}\left(g_{t}(v_{1},v_{2}),\mu_{i}\right)\!,\text{s.t.}\begin{cases}\text{id}-v_{1},\text{id}+v_{2}\in\mathcal{C}_{\bar{\mu}},\\ v_{1}\!+\!v_{2}\!\in\!\text{span}(\{v_{1}^{(i)}+v_{2}^{(i)}\}_{i\leq n})^{\perp}.\end{cases} (6)
Refer to caption
Figure 3: Generalized geodesic interpolation between two empirical measures ν\nu and η\eta using the base measure σ\sigma, all defined on 𝒳=2\mathcal{X}=\mathbb{R}^{2}.

This problem is not convex in v1v_{1}, v2v_{2}. We propose to find an approximation of that minimum by a projected gradient descent, with a projection that is to be understood in terms of an alternative metric on the space of vector fields L2(μ¯,𝒳)L^{2}(\bar{\mu},\mathcal{X}). To preserve the optimality of the mappings idv1\text{id}-v_{1} and id+v2\text{id}+v_{2} between iterations, we introduce in the next paragraph a suitable projection operator on L2(μ¯,𝒳)L^{2}(\bar{\mu},\mathcal{X}).

Remark 1.

A trivial way to ensure that (gt(v))t\left(g_{t}(v)\right)_{t} is geodesic is to impose that the vector field vv is a translation, namely that vv is uniformly equal to a vector τ\tau on all of Supp(μ¯)\text{Supp}(\bar{\mu}). One can show in that case that the WPG problem described in Eq. (6) outputs an optimal vector τ\tau which is the Euclidean principal component of the family formed by the means of each measure μi\mu_{i}.

Projection on the Optimal Mapping Set

We use a projected gradient descent method to solve Eq. (6) approximately. We will compute the gradient of a local upper-bound of the objective of Eq. (6) and update v1v_{1} and v2v_{2} accordingly. We then need to ensure that v1v_{1} and v2v_{2} are such that idv1\text{id}-v_{1} and id+v2\text{id}+v_{2} belong to the set of optimal mappings 𝒞μ¯\mathcal{C}_{\bar{\mu}}. To do so, we would ideally want to compute the projection r2r_{2} of id+v2\text{id}+v_{2} in 𝒞μ¯\mathcal{C}_{\bar{\mu}}

r2=argminr𝒞μ¯(id+v2)rL2(μ¯,𝒳)2,r_{2}=\underset{r\in\mathcal{C}_{\bar{\mu}}}{\operatorname{argmin}~}\|(\text{id}+v_{2})-r\|^{2}_{L^{2}(\bar{\mu},\mathcal{X})}, (7)

to update v2r2idv_{2}\leftarrow r_{2}-\text{id}. Westdickenberg (2010) has shown that the set of optimal mappings 𝒞μ¯\mathcal{C}_{\bar{\mu}} is a convex closed cone in L2(μ¯,𝒳)L^{2}(\bar{\mu},\mathcal{X}), leading to the existence and the unicity of the solution of Eq. (7). However, there is to our knowledge no known method to compute the projection r2r_{2} of id+v2\text{id}+v_{2}. There is nevertheless a well known and efficient approach to find a mapping r2r_{2} in 𝒞μ¯\mathcal{C}_{\bar{\mu}} which is close to id+v2\text{id}+v_{2}. That approach, known as the the barycentric projection, requires to compute first an optimal coupling π\pi^{\star} between μ¯\bar{\mu} and (id+v2)#μ¯(\text{id}+v_{2})\#\bar{\mu}, to define then a (conditional expectation) map

Tπ(x)=def.𝒳y𝑑π(y|x).T_{\pi^{\star}}(x)\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\int_{\mathcal{X}}yd\pi^{\star}(y|x). (8)

Ambrosio et al. (2006, Theorem 12.4.4) or Reich (2013, Lemma 3.1) have shown that TπT_{\pi^{\star}} is indeed an optimal mapping between μ¯\bar{\mu} and Tπ#μ¯T_{\pi^{\star}}\#\bar{\mu}. We can thus set the velocity field as v2Tπidv_{2}\leftarrow T_{\pi^{\star}}-\text{id} to carry out an approximate projection. We show in the supplementary material that this operator can be in fact interpreted as a projection under a pseudo-metric GWμ¯GW_{\bar{\mu}} on L2(μ¯,𝒳)L^{2}(\bar{\mu},\mathcal{X}).

4 Computing Principal Generalized Geodesics in Practice

We show in this section that when 𝒳=d\mathcal{X}=\mathbb{R}^{d}, the steps outlined above can be implemented efficiently.

Input Measures and Their Barycenter

Each input measure in the family (μ1,,μN)(\mu_{1},\cdots,\mu_{N}) is a finite weighted sum of Diracs, described by nin_{i} points contained in a matrix XiX_{i} of size d×nid\times n_{i}, and a (non-negative) weight vector aia_{i} of dimension nin_{i} summing to 1. The Wasserstein mean of these measures is given and equal to μ¯=k=1pbkδyk\bar{\mu}=\sum_{k=1}^{p}b_{k}\delta_{{y}_{k}}, where the nonnegative vector b=(b1,,bp)b=(b_{1},\cdots,b_{p}) sums to one, and Y=[y1,,yp]d×pY=[y_{1},\cdots,y_{p}]\in\mathbb{R}^{d\times p} is the matrix containing locations of μ¯\bar{\mu}.

Generalized Geodesic

Two velocity vectors for each of the pp points in μ¯\bar{\mu} are needed to parameterize a generalized geodesic. These velocity fields will be represented by two matrices V1=[v11,,vp1]V_{1}=[v_{1}^{1},\cdots,v_{p}^{1}] and V2=[v12,,vp2]V_{2}=[v_{1}^{2},\cdots,v_{p}^{2}] in d×p\mathbb{R}^{d\times p}. Assuming that these velocity fields yield optimal mappings, the points at time tt of that generalized geodesic are the measures parameterized by tt,

gt(V1,V2)=k=1pbkδzkt, with locations Zt=[z1t,,zpt]=def.YV1+t(V1+V2).g_{t}(V_{1},V_{2})=\sum_{k=1}^{p}b_{k}\delta_{z^{t}_{k}},\text{ with locations }Z_{t}=[z^{t}_{1},\dots,z^{t}_{p}]\stackrel{{\scriptstyle\mbox{def.}}}{{=}}Y-V_{1}+t(V_{1}+V_{2}).

The squared 2-Wasserstein distance between datum μi\mu_{i} and a point gt(V1,V2)g_{t}(V_{1},V_{2}) on the geodesic is:

W22(gt(V1,V2),μi)=minPU(b,ai)P,MZtXi,W_{2}^{2}(g_{t}(V_{1},V_{2}),\mu_{i})=\underset{P\in U(b,a_{i})}{\min}\langle P,M_{Z_{t}X_{i}}\,\rangle, (9)

where U(b,ai)U(b,a_{i}) is the transportation polytope {P+p×ni,P𝟏ni=b,PT𝟏p=ai},\{P\in\mathbb{R}^{p\times n_{i}}_{+},P\mathbf{1}_{n_{i}}=b,P^{T}\mathbf{1}_{p}=a_{i}\}, and MZtXiM_{Z_{t}X_{i}} stands for the p×nip\times n_{i} matrix of squared-Euclidean distances between the pp and nin_{i} column vectors of ZtZ_{t} and XiX_{i} respectively. Writing 𝐳t=𝐃(ZtTZt)\mathbf{z}_{t}=\mathbf{D}(Z_{t}^{T}Z_{t}) and 𝐱i=𝐃(XiTXi)\mathbf{x}_{i}=\mathbf{D}(X_{i}^{T}X_{i}), we have that

MZtXi=𝐳t𝟏niT+𝟏p𝐱iT2ZtTXip×ni,M_{Z_{t}X_{i}}=\mathbf{z}_{t}\mathbf{1}_{n_{i}}^{T}+\mathbf{1}_{p}\mathbf{x}_{i}^{T}-2Z_{t}^{T}X_{i}\in\mathbb{R}^{p\times n_{i}},

which, by taking into account the marginal conditions on PU(b,ai)P\in U(b,a_{i}), leads to,

P,MZtXi=bT𝐳t+aiT𝐱i2P,ZtTXi.\langle P,M_{Z_{t}X_{i}}\,\rangle=b^{T}\mathbf{z}_{t}+a_{i}^{T}\mathbf{x}_{i}-2\langle P,Z_{t}^{T}X_{i}\,\rangle. (10)

1. Majorization of the Distance of each μi\mu_{i} to the Principal Geodesic

Using Eq. (10), the distance between each μi\mu_{i} and the PC (gt(V1,V2))t(g_{t}(V_{1},V_{2}))_{t} can be cast as a function fif_{i} of (V1,V2)(V_{1},V_{2}):

fi(V1,V2)=def.mint[0,1](bT𝐳t+aiT𝐱i+minPU(b,ai)2P,(YV1+t(V1+V2))TXi).f_{i}(V_{1},V_{2})\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\min_{t\in[0,1]}\left(b^{T}\mathbf{z}_{t}+a_{i}^{T}\mathbf{x}_{i}+\min_{P\in U(b,a_{i})}-2\langle P,\left(Y-V_{1}+t(V_{1}+V_{2})\right)^{T}X_{i}\,\rangle\right). (11)

where we have replaced ZtZ_{t} above by its explicit form in tt to highlight that the objective above is quadratic convex plus piecewise linear concave as a function of tt, and thus neither convex nor concave. Assume that we are given PP^{\sharp} and tt^{\sharp} that are approximate arg-minima for fi(V1,V2)f_{i}(V_{1},V_{2}). For any A,BA,B in d×p\mathbb{R}^{d\times p}, we thus have that each distance fi(V1,V2)f_{i}(V_{1},V_{2}) appearing in Eq. (6), is such that

fi(A,B)miV1V2(A,B)=def.P,MZtXi.f_{i}(A,B)\leqslant m_{i}^{V_{1}V_{2}}(A,B)\stackrel{{\scriptstyle\mbox{def.}}}{{=}}\langle P^{\sharp},M_{Z_{t^{\sharp}}X_{i}}\,\rangle. (12)

We can thus use a majorization-minimization procedure (Hunter and Lange, 2000) to minimize the sum of terms fif_{i} by iteratively creating majorization functions miV1V2m_{i}^{V_{1}V_{2}} at each iterate (V1,V2)(V_{1},V_{2}). All functions miV1V2m_{i}^{V_{1}V_{2}} are quadratic convex. Given that we need to ensure that these velocity fields yield optimal mappings, and that they may also need to satisfy orthogonality constraints with respect to lower-order principal components, we use gradient steps to update V1,V2V_{1},V_{2}, which can be recovered using (Cuturi and Doucet, 2014, §4.3) and the chain rule as:

1miV1V2=2(t1)(ZtXiPT𝐃(b1)),2miV1V2=2t(ZtXiPT𝐃(b1)).\nabla_{1}m_{i}^{V_{1}V_{2}}=2(t^{\sharp}-1)(Z_{t^{\sharp}}-X_{i}P^{\sharp T}\mathbf{D}(b^{-1})),~~\nabla_{2}m_{i}^{V_{1}V_{2}}=2t^{\sharp}(Z_{t^{\sharp}}-X_{i}P^{\sharp T}\mathbf{D}(b^{-1})). (13)

2. Efficient Approximation of PP^{\sharp} and tt^{\sharp}

As discussed above, gradients for majorization functions miV1V2m_{i}^{V_{1}V_{2}} can be obtained using approximate minima PP^{\sharp} and tt^{\sharp} for each function fif_{i}. Because the objective of Eq. (11) is not convex w.r.t. tt, we propose to do an exhaustive 1-d grid search with KK values in [0,1][0,1]. This approach would still require, in theory, to solve KK optimal transport problems to solve Eq. (11) for each of the NN input measures. To carry out this step efficiently, we propose to use entropy regularized transport (Cuturi, 2013), which allows for much faster computations and efficient parallelizations to recover approximately optimal transports PP^{\sharp}.

3. Projected Gradient Update

Velocity fields are updated with a gradient stepsize β>0\beta>0,

V1V1β(i=1N1miV1V2+λ1Ω),V2V2β(i=1N2miV1V2+λ2Ω),V_{1}\leftarrow V_{1}-\beta\left(\sum_{i=1}^{N}\nabla_{1}m_{i}^{V_{1}V_{2}}+\lambda\nabla_{1}\Omega\right),\;V_{2}\leftarrow V_{2}-\beta\left(\sum_{i=1}^{N}\nabla_{2}m_{i}^{V_{1}V_{2}}+\lambda\nabla_{2}\Omega\right),

followed by a projection step to enforce that V1V_{1} and V2V_{2} lie in span(V1(1)+V2(1),,V1(n)+V2(n))\text{span}(V_{1}^{(1)}+V_{2}^{(1)},\cdots,V_{1}^{(n)}+V_{2}^{(n)})^{\perp} in the L2(μ¯,𝒳)L^{2}(\bar{\mu},\mathcal{X}) sense when computing the (n+1)th(n+1)^{\text{th}} PC. We finally apply the barycentric projection operator defined in the end of §3. We first need to compute two optimal transport plans,

P1argminPU(b,b)P,MY(YV1),P2argminPU(b,b)P,MY(Y+V2),P_{1}^{\star}\in\underset{P\in U(b,b)}{\operatorname{argmin}}\langle P,M_{Y(Y-V_{1})}\,\rangle,~~~~P_{2}^{\star}\in\underset{P\in U(b,b)}{\operatorname{argmin}}\langle P,M_{Y(Y+V_{2})}\,\rangle, (14)

to form the barycentric projections, which then yield updated velocity vectors:

V1((YV1)P1T𝐃(b1)Y),V2(Y+V2)P2T𝐃(b1)Y.V_{1}\leftarrow-\left((Y-V_{1})P_{1}^{\star T}\mathbf{D}(b^{-1})-Y\right),~~~~V_{2}\leftarrow(Y+V_{2})P_{2}^{\star T}\mathbf{D}(b^{-1})-Y. (15)

We repeat steps 1,2,3 until convergence. Pseudo-code is given in the supplementary material.

5 Experiments

Refer to caption
Refer to caption
Figure 4: Wasserstein mean μ¯\bar{\mu} and first PC computed on a dataset of four (left) and three (right) empirical measures. The second PC is also displayed in the right figure.

Toy samples: We first run our algorithm on two simple synthetic examples. We consider respectively 4 and 3 empirical measures supported on a small number of locations in 𝒳=2\mathcal{X}=\mathbb{R}^{2}, so that we can compute their exact Wasserstein means, using the multi-marginal linear programming formulation given in (Agueh and Carlier, 2011, §4). These measures and their mean (red squares) are shown in Fig. 4. The first principal component on the left example is able to capture both the variability of average measure locations, from left to right, and also the variability in the spread of the measure locations. On the right example, the first principal component captures the overall elliptic shape of the supports of all considered measures. The second principal component reflects the variability in the parameters of each ellipse on which measures are located. The variability in the weights of each location is also captured through the Wasserstein mean, since each single line of a generalized geodesic has a corresponding location and weight in the Wasserstein mean.

MNIST: For each of the digits ranging from 0 to 99, we sample 1,000 images in the MNIST database representing that digit. Each image, originally a 28x28 grayscale image, is converted into a probability distribution on that grid by normalizing each intensity by the total intensity in the image. We compute the Wasserstein mean for each digit using the approach of Benamou et al. (2015). We then follow our approach to compute the first three principal geodesics for each digit. Geodesics for four of these digits are displayed in Fig. 5 by showing intermediary (rasterized) measures on the curves. While some deformations in these curves can be attributed to relatively simple rotations around the digit center, more interesting deformations appear in some of the curves, such as the the loop on the bottom left of digit 2. Our results are easy to interpret, unlike those obtained with Wang et al.’s approach (2013) on these datasets, see supplementary material. Fig. 6 displays the first PC obtained on a subset of MNIST composed of 2,000 images of 2 and 4 in equal proportions.

Refer to captiont=0t=0t=1t=1PC1PC_{1}PC2PC_{2}PC3PC_{3}
Figure 5: 1000 images for each of the digits 1,2,3,4 were sampled from the MNIST database. We display above the first three PCs sampled at times tk=k/4,k=0,,4t_{k}=k/4,~k=0,\dots,4 for each of these digits.
Refer to caption
Figure 6: First PC on a subset of MNIST composed of one thousand 2s and one thousand 4s.

Color histograms: We consider a subset of the Caltech-256 Dataset composed of three image categories: waterfalls, tomatoes and tennis balls, resulting in a set of 295 color images. The pixels contained in each image can be seen as a point-cloud in the RGB color space [0,1]3[0,1]^{3}. We use kk-means quantization to reduce the size of these uniform point-clouds into a set of k=128k=128 weighted points, using cluster assignments to define the weights of each of the kk cluster centroids. Each image can be thus regarded as a discrete probability measure of 128128 atoms in the tridimensional RGB space. We then compute the Wasserstein barycenter of these measures supported on p=256p=256 locations using (Cuturi and Doucet, 2014, Alg.2). Principal components are then computed as described in §4. The computation for a single PC is performed within 15 minutes on an iMac (3.4GHz Intel Core i7). Fig. 7 displays color palettes sampled along each of the first three PCs. The first PC suggests that the main source of color variability in the dataset is the illumination, each pixel going from dark to light. Second and third PCs display the variation of colors induced by the typical images’ dominant colors (blue, red, yellow). Fig. 8 displays the second PC, along with three images projected on that curve. The projection of a given image on a PC is obtained by finding first the optimal time tt^{\star} such that the distance of that image to the PC at tt^{\star} is minimum, and then by computing an optimal color transfer (Pitié et al., 2007) between the original image and the histogram at time tt^{\star}.

Refer to caption
Figure 7: Each row represents a PC displayed at regular time intervals from t=0t=0 (left) to t=1t=1 (right), from the first PC (top) to the third PC (bottom).
Refer to caption
Figure 8: Color palettes from the second PC (t=0t=0 on the left, t=1t=1 on the right) displayed at times t=0,13,23,1t=0,\tfrac{1}{3},\tfrac{2}{3},1. Images displayed in the top row are original; their projection on the PC is displayed below, using a color transfer with the palette in the PC to which they are the closest.

Conclusion We have proposed an approximate projected gradient descent method to compute generalized geodesic principal components for probability measures. Our experiments suggest that these principal geodesics may be useful to analyze shapes and distributions, and that they do not require any parameterization of shapes or deformations to be used in practice.

Aknowledgements

MC acknowledges the support of JSPS young researcher A grant 26700002.

References

  • Agueh and Carlier (2011) Martial Agueh and Guillaume Carlier. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • Ambrosio et al. (2006) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer, 2006.
  • Benamou et al. (2015) Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
  • Bigot et al. (2015) Jérémie Bigot, Raúl Gouet, Thierry Klein, and Alfredo López. Geodesic pca in the wasserstein space by convex pca. Annales de l’Institut Henri Poincaré B: Probability and Statistics, 2015.
  • Boissard et al. (2015) Emmanuel Boissard, Thibaut Le Gouic, Jean-Michel Loubes, et al. Distribution’s template estimate with wasserstein metrics. Bernoulli, 21(2):740–759, 2015.
  • Bonneel et al. (2015) Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
  • Cuturi (2013) Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pages 2292–2300, 2013.
  • Cuturi and Doucet (2014) Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 685–693, 2014.
  • Fletcher et al. (2004) P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal geodesic analysis for the study of nonlinear statistics of shape. Medical Imaging, IEEE Transactions on, 23(8):995–1005, 2004.
  • Fréchet (1948) Maurice Fréchet. Les éléments aléatoires de nature quelconque dans un espace distancié. In Annales de l’institut Henri Poincaré, volume 10, pages 215–310. Presses universitaires de France, 1948.
  • Gramfort et al. (2015) Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport averaging of neuroimaging data. In Information Processing in Medical Imaging (IPMI). Springer, 2015.
  • Hastie and Stuetzle (1989) Trevor Hastie and Werner Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):502–516, 1989.
  • Hunter and Lange (2000) David R Hunter and Kenneth Lange. Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics, 9(1):60–77, 2000.
  • McCann (1997) Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
  • Pitié et al. (2007) François Pitié, Anil C Kokaram, and Rozenn Dahyot. Automated colour grading using colour distribution transfer. Computer Vision and Image Understanding, 107(1):123–137, 2007.
  • Reich (2013) Sebastian Reich. A nonparametric ensemble transform method for bayesian inference. SIAM Journal on Scientific Computing, 35(4):A2013–A2024, 2013.
  • Schölkopf et al. (1997) Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. In Artificial Neural Networks—ICANN’97, pages 583–588. Springer, 1997.
  • Solomon et al. (2015) Justin Solomon, Fernando de Goes, Pixar Animation Studios, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (Proc. SIGGRAPH 2015), to appear, 2015.
  • Srivastava et al. (2015) Sanvesh Srivastava, Volkan Cevher, Quoc Tran-Dinh, and David B Dunson. Wasp: Scalable bayes via barycenters of subset posteriors. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 912–920, 2015.
  • Verbeek et al. (2002) Jakob J Verbeek, Nikos Vlassis, and B Kröse. A k-segments algorithm for finding principal curves. Pattern Recognition Letters, 23(8):1009–1017, 2002.
  • Villani (2008) Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2008.
  • Wang et al. (2013) Wei Wang, Dejan Slepčev, Saurav Basu, John A Ozolek, and Gustavo K Rohde. A linear optimal transportation framework for quantifying and visualizing variations in sets of images. International journal of computer vision, 101(2):254–269, 2013.
  • Westdickenberg (2010) Michael Westdickenberg. Projections onto the cone of optimal transport maps and compressible fluid flows. Journal of Hyperbolic Differential Equations, 7(04):605–649, 2010.