This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the representation and learning of monotone triangular transport maps

Ricardo Baptistalabel=e1]rsb@caltech.edu [    Youssef Marzouklabel=e2]ymarz@mit.edu [    Olivier Zahmlabel=e3]olivier.zahm@inria.fr [ California Institute of Technology
Pasadena, MA 91125 USA
e1,
Massachusetts Institute of Technology
Cambridge, MA 02139 USA
e2,
Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK
38000, Grenoble, France
e3
Abstract

Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps—approximations of the Knothe–Rosenblatt (KR) rearrangement—are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes.

Knothe–Rosenblatt rearrangement,
normalizing flows,
monotone functions,
infinite-dimensional optimization,
adaptive approximation,
multivariate polynomials,
wavelets,
density estimation,
65C20,
49Q22,
62G07,
41A10,
keywords:
keywords:
[class=MSC]

, , and

1 Introduction

Many sampling, estimation, and inference algorithms seek to characterize a somehow intractable or complex probability distribution 𝝁\bm{\mu} on d\mathbb{R}^{d}. Transportation of measure provides a useful and versatile approach to this problem. The underlying idea is to construct a coupling of 𝝁\bm{\mu} with a tractable “reference” distribution 𝝂\bm{\nu} on d\mathbb{R}^{d}—for instance, a standard normal. Formally, one jointly constructs a pair of random variables (𝑿,𝒁)(\bm{X},\bm{Z}) such that 𝑿𝝁\bm{X}\sim\bm{\mu} and 𝒁𝝂\bm{Z}\sim\bm{\nu}. A special class of couplings is given by deterministic transformations S:ddS\colon\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} such that S(𝑿)=𝒁S(\bm{X})=\bm{Z} in distribution, and a transformation SS that satisfies this property is called a transport map [69]. As a result, if the transport map is invertible, one can generate realizations of 𝑿\bm{X} by first simulating 𝒁\bm{Z}. If 𝝁\bm{\mu} and 𝝂\bm{\nu} have densities π\pi and η\eta with respect to a common base measure, one can also explicitly represent the target density π\pi as a transformation of the reference density η\eta. The construction of such transport maps has found numerous applications: density estimation [62, 2, 18, 21]; variational Bayesian inference [20, 50, 10]; generative modeling of images, video, and other structured objects [42, 27]; likelihood-free inference [43, 47, 33]; and beyond.

In general, there exist infinitely many transport maps between two absolutely continuous distributions on d\mathbb{R}^{d}. In this paper, we focus on a specific, canonical choice: triangular transport maps [8] of the form

S(𝒙)=[S1(x1)S2(x1,x2)Sd(x1,,xd)],S(\bm{x})=\begin{bmatrix}[l]S_{1}(x_{1})\\ S_{2}(x_{1},x_{2})\\ ~{}~{}\vdots\\ S_{d}(x_{1},\ldots,x_{d})\end{bmatrix}, (1)

where each component function SkS_{k} depends only on the kk variables 𝒙k(x1,,xk)\bm{x}_{\leq k}\coloneqq(x_{1},\ldots,x_{k}) and is monotone increasing with respect to the last input xkx_{k}. In particular, each function SkS_{k} encodes an increasing rearrangement [69] between ordered marginal conditionals of 𝝁\bm{\mu} and 𝝂\bm{\nu}, i.e., 𝝁(dxk|𝒙<k)\bm{\mu}(\mathrm{d}x_{k}|\bm{x}_{<k}) and 𝝂(dzk|𝒛<k)\bm{\nu}(\mathrm{d}z_{k}|\bm{z}_{<k}). Later we will discuss properties of such triangular maps—known also as Knothe–Rosenblatt (KR) rearrangements [69, 51, 53]—more precisely, but we comment here on their utility. First of all, triangular structure facilitates computational tractability: SS is easy to invert and the determinant of its Jacobian (a lower triangular matrix) is easy to evaluate (see Section 2). Triangular maps have thus been used extensively in the density estimation, inference, and generative modeling applications noted above. For example, triangular maps are the building blocks of many normalizing flows, popularized by the machine learning community [28, 44]; specifically, autoregressive normalizing flows define triangular maps [25] via particular structural choices and parameterizations, and compose these maps to produce more expressive transformations. More fundamentally, because triangular maps expose certain conditionals of 𝝁\bm{\mu}, they are particularly well suited to conditional density estimation and conditional sampling; we will describe this link in Section 2. Triangular maps also inherit sparsity from the conditional independence properties of 𝝁\bm{\mu} and 𝝂\bm{\nu}, as detailed in [61].

It is worth noting, of course, that other canonical choices of transport have different attractive features. For instance, optimal transport maps are invariant under relabeling of the coordinates or more general isometries on d\mathbb{R}^{d} (unlike triangular maps), and have deep links to partial differential equations [69]. But optimal transport maps are in general more challenging to represent, evaluate, and estimate from data in the continuous setting; moreover, they do not enjoy such a direct link to conditioning or to graphical models.

Many representations and finite-dimensional parameterizations of monotone triangular maps have been proposed in recent years. These include representations based on polynomials [35, 25], radial basis functions [62], neural networks of varying capacity [18, 27], and tensor decompositions [15, 16]. A core challenge in this setting is to satisfy the monotonicity constraint xkSk>0\partial_{x_{k}}S_{k}>0. For instance, one might enforce the monotonicity constraint at a finite collection of points in the support of π\pi [46], but this approach cannot in general guarantee that SS is monotone over the entire support of π\pi. Other efforts have sought to enforce monotonicity by construction—via the parameterization of SS itself. For example, [45] employs map components with affine dependence on the last variable, i.e., Sk(𝒙k)=α(𝒙<k)+exp(β(𝒙<k))xkS_{k}(\bm{x}_{\leq k})=\alpha(\bm{x}_{<k})+\exp(\beta(\bm{x}_{<k}))x_{k}, where α\alpha and β\beta are neural networks. While SS is then guaranteed to be monotone, it can only represent a restricted class of distributions 𝝁\bm{\mu}. (If 𝝂\bm{\nu} is Gaussian, then 𝝁\bm{\mu} can only be a product of Gaussian marginal conditionals.) Such representations therefore cannot consistently approximate the true KR rearrangement for general 𝝁\bm{\mu}. Recent work [63] has shown that a composition of such affine maps, interleaved with rotations and permutations, can approximate a general class of distributions, though approximation rates remain unknown. The required rotations or permutations break the overall triangular structure of the transformation, however. Alternatively, to increase the “expressiveness” of a given triangular function, [72, 22, 19] have introduced more general parametric representations of the monotone component functions SkS_{k}. For distributions with analytic densities on bounded domains, a complete approximation theory for the KR map was recently developed in [75, 76], using polynomial or ReLU neural network representations of range-constrained monotone triangular functions.

Despite these myriad proposals, relatively little attention has been paid to the structure and tractability of the optimization problem involved in learning triangular maps (e.g., in estimating a map given an i.i.d. sample {𝑿i}\{\bm{X}^{i}\} from 𝝁\bm{\mu}). Properties of this optimization problem are intimately tied to the representation and parameterization of the associated map. It is desirable to have a flexible and general representation—one that can consistently recover the KR rearrangement for a broad class of distributions (𝝁,𝝂)(\bm{\mu},\bm{\nu})—that at the same time makes optimization tractable. It is also desirable to have adaptivity: a parameterization whose size or complexity can be adapted to properties of the target distribution and the available sample size, for good empirical statistical performance.

This paper directly addresses these desiderata. We do so by developing and analyzing a functional framework for representing and learning triangular maps. Our main contributions are as follows. We propose a general representation of monotone triangular functions, based on a rectification operator k\mathcal{R}_{k} that transforms sufficiently smooth non-monotone functions fk:kf_{k}:\mathbb{R}^{k}\to\mathbb{R} into monotone component functions SkS_{k} of a triangular map. This operator takes the form

k(fk)(𝒙k)=fk(𝒙<k,0)+0xkg(kfk(𝒙<k,t))dt,\mathcal{R}_{k}(f_{k})(\bm{x}_{\leq k})=f_{k}(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g\big{(}\partial_{k}f_{k}(\bm{x}_{<k},t)\big{)}\mathrm{d}t, (2)

where g:>0g\colon\mathbb{R}\to\mathbb{R}_{>0} is a positive function. We then analyze the infinite-dimensional optimization problem associated with learning maps from data, recast as optimization over functions {fk}k=1d\{f_{k}\}_{k=1}^{d}. Specifically, we establish conditions on the rectification operator and on the target distribution such that the resulting optimization problem is well-posed, smooth, and has no spurious local minima. Under further conditions on the target distribution (essentially that it has Gaussian tails), we show that the optimization problem has a unique global minimizer corresponding to the KR map.

These theoretical results guarantee, in practice, fast and reliable learning of monotone triangular maps given an appropriate function space VkV_{k} for each fkf_{k}. The second main contribution of our paper is algorithmic: given a hierarchical basis (e.g., polynomials or wavelets) for each VkV_{k}, we propose a greedy adaptive procedure to learn parametric representations of fkf_{k}. The procedure naturally produces map representations that are sparse and interpretable—in particular, it exploits and implicitly discovers conditional independence. We use these learned maps for density estimation, given an i.i.d. sample {𝑿i}\{\bm{X}^{i}\} from π\pi. Maintaining a strict triangular structure also exposes marginal conditionals of the target density, thus enabling conditional density estimation. Our numerical experiments show that the algorithm provides robust performance at small-to-moderate sample sizes, and constitutes a semi-parametric approach that naturally links map complexity to the size of the data.

The remainder of the paper is organized as follows. Section 2 recalls properties of triangular transport maps and introduces some estimation problems of interest. Our main theoretical contributions are in Section 3, which introduces a framework for representing monotone triangular maps and analyzes the resulting optimization problem. Section 4 then introduces our greedy adaptive procedure for learning maps, and Section 5 contains numerical experiments. Proofs of certain theoretical results are deferred to the appendix.

2 Triangular transport for density estimation and simulation

Consider the unsupervised learning problem of approximating a target probability density function π\pi defined on d\mathbb{R}^{d}, given an i.i.d. sample from π\pi. Our goal is to construct a sufficiently smooth and invertible map S:ddS\colon\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} such that the pullback density

Sη(𝒙)=η(S(𝒙))|detS(𝒙)|,S^{\sharp}\eta(\bm{x})=\eta\left(S(\bm{x})\right)|\det\nabla S(\bm{x})|, (3)

is a good approximation to π\pi, where η\eta is a simple/tractable probability density function on d\mathbb{R}^{d}. The choice of η\eta is a degree of freedom of the method, and here we take η(𝒙)exp(𝒙2/2)\eta(\bm{x})\propto\exp(-\|\bm{x}\|^{2}/2); i.e., η\eta is the density of the standard normal distribution on d\mathbb{R}^{d}, where \|\cdot\| is the canonical norm of d\mathbb{R}^{d}.

To ensure invertibility of SS, a common practice is to constrain SS to be an increasing lower triangular map of the form (1), where each component Sk:kS_{k}\colon\mathbb{R}^{k}\rightarrow\mathbb{R} is such that xkSk(𝒙<k,xk)x_{k}\mapsto S_{k}(\bm{x}_{<k},x_{k}) is increasing for all 𝒙<k=(x1,,xk1)k1\bm{x}_{<k}=(x_{1},\ldots,x_{k-1})\in\mathbb{R}^{k-1}. Such a map is easy to invert111For any 𝒛d\bm{z}\in\mathbb{R}^{d}, 𝒙=S1(𝒛)\bm{x}=S^{-1}(\bm{z}) can be computed recursively as xk=Tk(𝒙<k,zk)x_{k}=T^{k}(\bm{x}_{<k},z_{k}) for k=1,,dk=1,\dots,d, where the function Tk(𝒙<k,)T^{k}(\bm{x}_{<k},\cdot) is the inverse of xkSk(𝒙<k,xk)x_{k}\mapsto S_{k}(\bm{x}_{<k},x_{k}). In practice, evaluating TkT^{k} requires solving a root-finding problem which is guaranteed to have a unique (real) root, and for which the bisection method converges geometrically fast. Therefore, S1(𝒛)S^{-1}(\bm{z}) can be evaluated to machine precision in negligible computational time. and has a lower triangular Jacobian S(𝒙)\nabla S(\bm{x}) so that |detS(𝒙)|=k=1dkSk(𝒙k)|\det\nabla S(\bm{x})|=\prod_{k=1}^{d}\partial_{k}S_{k}(\bm{x}_{\leq k}) is readily computable. This is a major benefit of working with monotone triangular maps. This structure in fact corresponds to the Knothe–Rosenblatt (KR) rearrangement SKRS_{\text{KR}}: the increasing lower triangular map satisfying

π(𝒙)=SKRη(𝒙).\pi(\bm{x})=S_{\text{KR}}^{\sharp}\eta(\bm{x}).

For a measure 𝝁\bm{\mu} on d\mathbb{R}^{d} that is absolutely continuous with respect to a Gaussian measure 𝝂\bm{\nu} (and hence has a density π\pi with respect to the Lebesgue measure), the KR rearrangement SKRS_{\textrm{KR}} exists and is the unique map of the form (1) that pulls back η\eta to π\pi (or equivalently pushes forward π\pi to η\eta), up to sets of measure zero [8].

A useful way to measure the difference between π\pi and its approximation SηS^{\sharp}\eta is the Kullback–Leibler (KL) divergence 𝒟KL(π||Sη)=log(π/Sη)dπ\mathcal{D}_{\text{KL}}(\pi||S^{\sharp}\eta)=\int\log(\pi/S^{\sharp}\eta)\mathrm{d}\pi. As explained below, this choice has direct links to maximum likelihood estimation of SS. Furthermore, the following inequality shows that convergence in the KL sense 𝒟KL(π||Sη)0\mathcal{D}_{\text{KL}}(\pi||S^{\sharp}\eta)\rightarrow 0 implies convergence of SS towards SKRS_{\mathrm{KR}} in the Lπ2L^{2}_{\pi} sense. This result is a direct consequence of Corollary 3.10 in [8]; see Section A.1 for a proof.

Proposition 1.

Let SKRS_{\textrm{KR}} be the KR rearrangement pushing forward a distribution with density π\pi on d\mathbb{R}^{d} to the standard normal distribution on d\mathbb{R}^{d}, with density η\eta. For any map S:ddS:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} as in (1), we have

SKR(𝒙)S(𝒙)2dπ(𝒙)2𝒟KL(π||Sη).\int\|S_{\mathrm{KR}}(\bm{x})-S(\bm{x})\|^{2}\mathrm{d}\pi(\bm{x})\leq 2\mathcal{D}_{\text{KL}}(\pi||S^{\sharp}\eta). (4)

Since the standard normal density η\eta is a product of its marginal densities, the KL divergence decomposes as

𝒟KL(π||Sη)=k=1d𝒥k(Sk)𝒥k(SKR,k),\displaystyle\mathcal{D}_{\text{KL}}(\pi||S^{\sharp}\eta)=\sum_{k=1}^{d}\mathcal{J}_{k}(S_{k})-\mathcal{J}_{k}(S_{\text{KR},k}), (5)

where the functionals 𝒥1,,𝒥d\mathcal{J}_{1},\ldots,\mathcal{J}_{d} are given by

𝒥k(s)=(12s(𝒙k)2log|ks(𝒙k)|)π(𝒙)d𝒙.\mathcal{J}_{k}(s)=\int\left(\frac{1}{2}s(\bm{x}_{\leq k})^{2}-\log\left|\partial_{k}s(\bm{x}_{\leq k})\right|\right)\pi(\bm{x})\mathrm{d}\bm{x}. (6)

Minimizing the KL divergence (5) over triangular maps SS of the form (1) is therefore equivalent to independently minimizing each objective functional 𝒥k\mathcal{J}_{k} to find the associated map component SkS_{k} [35]. Solution of these optimization problems is thus embarrassingly parallel. This parallel structure was also exploited for Cholesky factorization via KL minimization in [54]. In addition, minimizing each objective s𝒥k(s)s\mapsto\mathcal{J}_{k}(s) over functions s:ks\colon\mathbb{R}^{k}\rightarrow\mathbb{R} that are strictly increasing in the last variable is a strictly convex optimization problem; see Lemma 10.

Given an i.i.d. sample {𝐗i}i=1n\{\mathbf{X}^{i}\}_{i=1}^{n} from π\pi, we can replace the expectation in (6) by the sample average, which yields the objective

𝒥^k(s)=1ni=1n(12s(𝐗ki)2log|ks(𝐗ki)|).\widehat{\mathcal{J}}_{k}(s)=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{1}{2}s(\mathbf{X}^{i}_{\leq k})^{2}-\log\left|\partial_{k}s(\mathbf{X}^{i}_{\leq k})\right|\right). (7)

Minimizing (7) under the constraint ks(𝒙k)>0\partial_{k}s(\bm{x}_{\leq k})>0 produces an estimator S^k\widehat{S}_{k} of SKR,kS_{\text{KR},k}. The collection of all such component functions defines an estimator S^=(S^1,,S^d)\widehat{S}=(\widehat{S}_{1},\ldots,\widehat{S}_{d}) of the KR rearrangement, along with an estimate of the density π\pi as π^(𝒙)S^η(𝒙)\widehat{\pi}(\bm{x})\coloneqq\widehat{S}^{\sharp}\eta(\bm{x}). This S^\widehat{S} is also the maximum likelihood estimator of SKRS_{\text{KR}}, i.e.,

S^=argmax{S(1),kSk>0,k=1,,d}i=1nlogSη(𝑿i),\widehat{S}=\operatorname*{arg\,max}_{\{S\in\text{\eqref{eq:increasingMaps}},\ \partial_{k}S_{k}>0,\ k=1,\ldots,d\}}\sum_{i=1}^{n}\log S^{\sharp}\eta(\bm{X}^{i}),

with the optimization being over the space of monotone increasing and triangular maps of the form (1). This connection between maximum likelihood estimation and minimization of an empirical forward KL divergence is standard. Furthermore, convexity of the optimization objective is preserved when replacing the expectation in (6) by the sample average in (7).

A core question when estimating maps and densities by minimizing (7) is how to parameterize sufficiently expressive monotone map components SkS_{k}—i.e., maps capable of representing a wide class of distributions π\pi—while ensuring that the optimization problem can be solved efficiently. As explained in the introduction, this question is intimately tied to the monotonicity constraint kSk(𝒙k)>0\partial_{k}S_{k}(\bm{x}_{\leq k})>0. For example, choosing a linear parameterization for SkS_{k} that admits only affine dependence on xkx_{k} (to easily enforce monotonicity) allows the map component be identified efficiently through the solution of a least-squares problem (see [60, Appendix A]), but such maps can only capture distributions that factor into a sequence of Gaussian marginal conditionals. On the other hand, a more complex ansatz for SkS_{k} will often yield a much more difficult optimization problem. Note that with any nonlinear parameterization of SkS_{k}, the convexity of (6) and (7) with respect to SkS_{k} does not in general yield convexity in the parameters. As we shall demonstrate later, many nonlinear parameterizations that enforce monotonicity yield optimization problems that may not even be smooth, and that have many local minima. We will address these issues in Section 3.

Conditional density estimation and sampling

Another important feature of the triangular structure (1) is that each component of the map represents one marginal conditional density of π\pi. More precisely, in the present setting where η\eta is a product density, SKR,kS_{\text{KR},k} pushes forward the marginal conditional πk(xk|𝒙<k)\pi_{k}(x_{k}|\bm{x}_{<k}) to the kk-th marginal of the reference ηk(zk)\eta_{k}(z_{k}) [53, Section 2.3]. Now partition 𝒙=(𝒚,𝒘)\bm{x}=(\bm{y},\bm{w}), where 𝒚m\bm{y}\in\mathbb{R}^{m} and 𝒘p\bm{w}\in\mathbb{R}^{p}, with d=m+pd=m+p. This property of the KR map lets us estimate the conditional probability density function π(𝒘|𝒚)\pi(\bm{w}|\bm{y}), for any value of 𝒚\bm{y}, given a sample {(𝒀i,𝑾i)}i=1n\{(\bm{Y}^{i},\bm{W}^{i})\}_{i=1}^{n} from the joint density π(𝒚,𝒘)\pi(\bm{y},\bm{w}). Observe that the KR map immediately has the block structure:

S(𝒚,𝒘)=[S𝒴(𝒚)S𝒲(𝒚,𝒘)],S(\bm{y},\bm{w})=\begin{bmatrix}[l]S^{\mathcal{Y}}(\bm{y})\\ S^{\mathcal{W}}(\bm{y},\bm{w})\end{bmatrix}, (8)

where 𝒚S𝒴(𝒚):mm\bm{y}\mapsto S^{\mathcal{Y}}(\bm{y})\colon\mathbb{R}^{m}\rightarrow\mathbb{R}^{m} and 𝒘S𝒲(𝒚,𝒘):pp\bm{w}\mapsto S^{\mathcal{W}}(\bm{y}^{\ast},\bm{w})\colon\mathbb{R}^{p}\rightarrow\mathbb{R}^{p} are increasing lower triangular maps, the latter for any 𝒚m\bm{y}^{\ast}\in\mathbb{R}^{m}. Recall that the reference density η\eta is the standard normal on d\mathbb{R}^{d}, and thus is a product of two standard normals, η1(𝒛)η2(𝒛)\eta_{1}(\bm{z}^{\prime})\eta_{2}(\bm{z}), where 𝒛m\bm{z}^{\prime}\in\mathbb{R}^{m} and 𝒛p\bm{z}\in\mathbb{R}^{p}. Using a KR map of the form (8), we can write the marginal density of 𝒀\bm{Y} as π(𝒚)=(SKR𝒴)η1(𝒚)\pi(\bm{y})=(S_{\text{KR}}^{\mathcal{Y}})^{\sharp}\eta_{1}(\bm{y}) and, more interestingly, the conditional density of 𝑾\bm{W} as π(𝒘|𝒚)=SKR𝒲(𝒚,)η2(𝒘)\pi(\bm{w}|\bm{y})=S^{\mathcal{W}}_{\text{KR}}(\bm{y},\cdot)^{\sharp}\eta_{2}(\bm{w}).

Each of the last pp components of the KR rearrangement, SKR𝒲,k(𝒚,𝒘k)S^{\mathcal{W},k}_{\text{KR}}(\bm{y},\bm{w}_{\leq k}), 1kp1\leq k\leq p, can be estimated from a sample {(𝒀i,𝑾i)}i=1n\{(\bm{Y}^{i},\bm{W}^{i})\}_{i=1}^{n} by minimizing

𝒥^k(s)=1ni=1n(12s(𝒀i,𝑾ki)2log|m+ks(𝒀i,𝑾ki)|),\widehat{\mathcal{J}}_{k}(s)=\frac{1}{n}\sum_{i=1}^{n}\left(\frac{1}{2}s(\bm{Y}^{i},\bm{W}^{i}_{\leq k})^{2}-\log\left|\partial_{m+k}s(\bm{Y}^{i},\bm{W}^{i}_{\leq k})\right|\right), (9)

under the constraints m+ks(𝒚,𝒘k)>0\partial_{m+k}s(\bm{y},\bm{w}_{\leq k})>0. This produces an estimator S^𝒲\widehat{S}^{\mathcal{W}} of SKR𝒲S_{\text{KR}}^{\mathcal{W}}, which in turn yields an estimator of the conditional density π(𝒘|𝒚)\pi(\bm{w}|\bm{y}) as π^(𝒘|𝒚)S^𝒲(𝒚,)η2(𝒘)\widehat{\pi}(\bm{w}|\bm{y})\coloneqq\widehat{S}^{\mathcal{W}}(\bm{y},\cdot)^{\sharp}\eta_{2}(\bm{w}). This property has been used to perform conditional density estimation (CDE) in [35, 43, 3, 16].

One important application of CDE is likelihood-free inference, where 𝑾\bm{W} represents a parameter to be inferred and 𝒀\bm{Y} are data whose conditional density π(𝒚|𝒘)\pi(\bm{y}|\bm{w}) is computationally intractable or unavailable in closed form. This setting arises in many applications: inference in stochastic models, or inference in the presence of high-dimensional nuisance parameters or latent variables (including parameter inference for state-space models). Given a joint sample {(𝒀i,𝑾i)}i=1n\{(\bm{Y}^{i},\bm{W}^{i})\}_{i=1}^{n}, we can estimate the map component S𝒲S^{\mathcal{W}} and use it to simulate from the estimated conditional density π^(𝒘|𝒚)\widehat{\pi}(\bm{w}|\bm{y}^{*}) given any realization of the data 𝒚\bm{y}^{*}, simply by sampling 𝒛iη2\bm{z}^{i}\sim\eta_{2} and solving the triangular system S^𝒲(𝒚,𝒘i)=𝒛i\widehat{S}^{\mathcal{W}}(\bm{y}^{*},\bm{w}^{i})=\bm{z}^{i} for each 𝒘i\bm{w}^{i}. Note that we can simulate from or evaluate the estimated conditional densities for multiple realizations of the data 𝒚\bm{y}^{*}, including values that are not present in the dataset {𝒀i}i=1n\{\bm{Y}^{i}\}_{i=1}^{n}. Thus, learning a single map S𝒲S^{\mathcal{W}} parameterized by 𝒚\bm{y} is said to amortize the cost of conditional sampling over the data.

3 Representing and learning continuous monotone functions

In this section, we define a general representation for components of monotone triangular maps and present our main theoretical results. The essential idea, as mentioned in the introduction, is to express each component SkS_{k} as a nonlinear transformation Sk=k(f)S_{k}=\mathcal{R}_{k}(f) of a smooth function ff, where k\mathcal{R}_{k} (2) is an operator that enforces the monotonicity constraint by construction. We then identify the map component by solving the re-parameterized optimization problem

minfVkk(f),where k(f)=𝒥k(k(f)),\min_{f\in V_{k}}\mathcal{L}_{k}(f),\quad\text{where }\mathcal{L}_{k}(f)=\mathcal{J}_{k}(\mathcal{R}_{k}(f)), (10)

where VkV_{k} is a linear space of functions in which we seek ff. With this nonlinear transformation, we lose the convexity of the constrained problem min{s:ks>0}𝒥k(s)\min_{\{s\colon\partial_{k}s>0\}}\mathcal{J}_{k}(s), but obtain an unconstrained minimization problem instead. It turns out that, with appropriate conditions on k\mathcal{R}_{k}, the transformed optimization problem retains many desirable properties.

In Section 3.1, we discuss the construction of k\mathcal{R}_{k} and motivate certain critical choices therein. Then, in Section 3.2, we analyze the regularity of the composed objective functional k\mathcal{L}_{k} for a certain choice of the function space VkV_{k}. In Section 3.3 we discuss the existence and uniqueness of solutions to (10), and describe conditions under which solving (10) exactly recovers the KR rearrangement.

3.1 The rectification operator

Recalling (2), for any sufficiently smooth f:kf\colon\mathbb{R}^{k}\rightarrow\mathbb{R} we let k(f):k\mathcal{R}_{k}(f)\colon\mathbb{R}^{k}\rightarrow\mathbb{R} be the function defined by

k(f)(𝒙k)=f(𝒙<k,0)+0xkg(kf(𝒙<k,t))dt,\mathcal{R}_{k}(f)(\bm{x}_{\leq k})=f(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g\big{(}\partial_{k}f(\bm{x}_{<k},t)\big{)}\mathrm{d}t, (11)

where g:>0g\colon\mathbb{R}\rightarrow\mathbb{R}_{>0} is a positive function. We call the operator k:fk(f)\mathcal{R}_{k}\colon f\mapsto\mathcal{R}_{k}(f) a rectifier because it transforms any ff into a function which is increasing in the kk-th variable, i.e., kk(f)(𝒙k)=g(kf(𝒙k))>0\partial_{k}\mathcal{R}_{k}(f)(\bm{x}_{\leq k})=g(\partial_{k}f(\bm{x}_{\leq k}))>0. As a simple example, functions of the form f(𝒙k)=α(𝒙<k)+β(𝒙<k)xkf(\bm{x}_{\leq k})=\alpha(\bm{x}_{<k})+\beta(\bm{x}_{<k})x_{k} are transformed into k(f)(𝒙k)=α(𝒙<k)+g(β(𝒙<k))xk\mathcal{R}_{k}(f)(\bm{x}_{\leq k})=\alpha(\bm{x}_{<k})+g(\beta(\bm{x}_{<k}))x_{k}. Figure 1 illustrates the application of the rectifier to a nonlinear univariate function ff.

Refer to caption
Refer to caption
Figure 1: The rectifier (2) transforms the non-monotone function ff into the monotone function S=(f)S=\mathcal{R}(f). Here we choose g()=log(1+exp())g(\cdot)=\log(1+\exp(\cdot)). SS is an increasing transport that pushes forward a one-dimensional mixture of Gaussians π(x)=0.5𝒩(x;1,1)+0.5𝒩(x;1,1)\pi(x)=0.5\mathcal{N}(x;-1,1)+0.5\mathcal{N}(x;1,1) to the standard Gaussian reference density η\eta.

The choice of the function gg in (2) has a crucial impact on properties of the optimization problem (10). One possible choice, proposed in [25, 10], is the square function g(ξ)=ξ2g(\xi)=\xi^{2}. While this choice permits closed-form computation of the integral in (11) when ff is polynomial, it yields an optimization problem (10) which possesses many spurious local minima and is in fact non-smooth; see Figure 2 for an illustration. This can be explained in part by the fact that this gg is not bijective.

Instead, one can choose gg to be a bijective function from \mathbb{R} to >0\mathbb{R}_{>0}, such as the soft-plus function,

g(ξ)=log(1+exp(ξ)),g(\xi)=\log(1+\exp(\xi)), (12)

whose inverse is g1(ξ)=log(exp(ξ)1)g^{-1}(\xi)=\log(\exp(\xi)-1). Another example of a bijective function, considered in [72], is the shifted exponential linear unit (ELU),

g(ξ)={exp(ξ)ξ<0ξ+1ξ0,g(\xi)=\left\{\begin{array}[]{ll}\exp(\xi)&\xi<0\\ \xi+1&\xi\geq 0\end{array}\right., (13)

whose inverse is g1(ξ)=ξ1g^{-1}(\xi)=\xi-1 if ξ1\xi\geq 1 and g1(ξ)=log(ξ)g^{-1}(\xi)=\log(\xi) otherwise.

As a consequence of gg being bijective, the inverse of the rectifier k1(s)\mathcal{R}_{k}^{-1}(s) exists for any sufficiently smooth s:ks\colon\mathbb{R}^{k}\rightarrow\mathbb{R} with ks(𝒙k)>0\partial_{k}s(\bm{x}_{\leq k})>0 and can be written as

k1(s)(𝒙k)=s(𝒙<k,0)+0xkg1(ks(𝒙<k,t))dt.\mathcal{R}_{k}^{-1}(s)(\bm{x}_{\leq k})=s(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{-1}\big{(}\partial_{k}s(\bm{x}_{<k},t)\big{)}\mathrm{d}t. (14)

More importantly, the fact that gg is invertible yields an objective function k\mathcal{L}_{k} that is far better behaved than with g(ξ)=ξ2g(\xi)=\xi^{2}; see the numerical illustration in Figure 2, with both soft-plus and shifted ELU gg. With these choices of gg, we observe that k=𝒥kk\mathcal{L}_{k}=\mathcal{J}_{k}\circ\mathcal{R}_{k}, though non-convex in general, has no local minima and no saddle points. In the next sections, we will analyze these properties of the optimization problem (10) (e.g., smoothness, existence and uniqueness of solutions) and elucidate their precise dependence on gg, VkV_{k}, and π\pi.

Remark 1.

In the finite-dimensional setting, it can easily be shown that composing a smooth and (strictly) convex objective function with a 𝒞1\mathcal{C}^{1}-diffeomorphic map preserves the (unique) global minima of the objective. Such diffeomorphic maps have been explored for accelerating optimization on finite-dimensional manifolds; see, e.g., [31]. Establishing that the operator k\mathcal{R}_{k} in (11) is 𝒞1\mathcal{C}^{1}-diffeomorphic is, however, non-trivial. We will show in Section 3.3 that, under additional assumptions, our reparameterization yields an infinite-dimensional optimization problem where local minima are global minima.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Objective function ^1=𝒥^11\widehat{\mathcal{L}}_{1}=\widehat{\mathcal{J}}_{1}\circ\mathcal{R}_{1} where the rectifier 1\mathcal{R}_{1} is defined using the soft-plus gg (12) (left), the shifted ELU gg (13) (middle), or the square function g(ξ)=ξ2g(\xi)=\xi^{2} (right). Here, π(x)=1/2𝒩(x;2,0.5)+1/2𝒩(x;2,2)\pi(x)=1/2\mathcal{N}(x;-2,0.5)+1/2\mathcal{N}(x;2,2) is a univariate Gaussian mixture, and we use n=50n=50 to define 𝒥1\mathcal{J}_{1} with f1f_{1} represented using a linear combination of Hermite functions up to degree 1010. The objective is evaluated along line segments that interpolate between random initial maps (t=0t=0) and critical points resulting from a gradient-based optimization method (t=1t=1). Observe that with bijective gg (left and middle) the algorithm always arrives at the same optimal value, whereas with the square function gg (right) the algorithm gets stuck in local minima and rarely attains the optimal value.

3.2 Smoothness of the optimization problem

In this section we give sufficient conditions on the function gg and the target density π\pi to guarantee that the objective function k\mathcal{L}_{k} is smooth over an appropriate function space for ff. We introduce this function space, VkV_{k}, and show continuity properties of the rectifier that hold for functions fVkf\in V_{k}, before stating our main result in Theorem 4.

We begin by defining the weighted Sobolev space

Vk\displaystyle V_{k} ={f:k such that fVk2(|f(𝒙)|2+|kf(𝒙)|2)ηk(𝒙)d𝒙<+},\displaystyle=\left\{f\colon\mathbb{R}^{k}\rightarrow\mathbb{R}\text{ such that }\|f\|_{V_{k}}^{2}\coloneqq\int\Big{(}|f(\bm{x})|^{2}+|\partial_{k}f(\bm{x})|^{2}\Big{)}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}<+\infty\right\}, (15)

where ηk\eta_{\leq k} is the standard normal density on k\mathbb{R}^{k}. By [30, Theorem 1.11], this space is complete and thus is a Hilbert space. The space VkV_{k} has sufficient regularity for kf\partial_{k}f to exist, but also to permit the pointwise evaluation f(𝒙<k,0)f(\bm{x}_{<k},0), as required in the definition (11) of the rectifier. This property is formalized by the following proposition, a trace theorem, which shows that for any fVkf\in V_{k}, the function 𝒙<kf(𝒙<k,0)\bm{x}_{<k}\mapsto f(\bm{x}_{<k},0) is a function in Lη<k2L^{2}_{\eta_{<k}}, the η<k\eta_{<k}-weighted space of square integrable functions.

Proposition 2.

There exists a constant CT<C_{T}<\infty such that for any fVkf\in V_{k}

f(𝒙<k,0)2η<k(𝒙)d𝒙CTfVk2.\int f(\bm{x}_{<k},0)^{2}\,\eta_{<k}(\bm{x})\mathrm{d}\bm{x}\leq C_{T}\|f\|_{V_{k}}^{2}. (16)
Proof.

See Appendix A.3. ∎

Remark 2.

Notice that Hηk1VkLηk2H^{1}_{\eta_{\leq k}}\subset V_{k}\subset L^{2}_{\eta_{\leq k}}, where Lηk2={f:k:fLηk22f2dηk<}L^{2}_{\eta_{\leq k}}=\{f\colon\mathbb{R}^{k}\rightarrow\mathbb{R}:\|f\|_{L^{2}_{\eta_{\leq k}}}^{2}\coloneqq\int f^{2}\mathrm{d}\eta_{\leq k}<\infty\} and Hηk1={f:k:fHηk12f2+f2dηk<}H^{1}_{\eta_{\leq k}}=\{f\colon\mathbb{R}^{k}\rightarrow\mathbb{R}:\|f\|_{H^{1}_{\eta_{\leq k}}}^{2}\coloneqq\int f^{2}+\|\nabla f\|^{2}\mathrm{d}\eta_{\leq k}<\infty\} are the standard weighted Sobolev spaces. Given that the standard normal density factorizes as ηk(𝐱)=i=1kηi(xi)\eta_{\leq k}(\bm{x})=\prod_{i=1}^{k}\eta_{i}(x_{i}), the space VkV_{k} admits the following tensor product structure

Vk=Lη12Lηk12Hηk1,V_{k}=L^{2}_{\eta_{1}}\otimes\cdots\otimes L^{2}_{\eta_{k-1}}\otimes H^{1}_{\eta_{k}}, (17)

and the norm Vk\|\cdot\|_{V_{k}} is a product norm.222That is v1vkVk=v1Lη12v2Lη22vk1Lηk12vkHηk1\|v_{1}\otimes\cdots\otimes v_{k}\|_{V_{k}}=\|v_{1}\|_{L^{2}_{\eta_{1}}}\|v_{2}\|_{L^{2}_{\eta_{2}}}\cdots\|v_{k-1}\|_{L^{2}_{\eta_{k-1}}}\|v_{k}\|_{H^{1}_{\eta_{k}}} for any vjLηk2v_{j}\in L^{2}_{\eta_{k}} and vkHηk1v_{k}\in H^{1}_{\eta_{k}}. This tensor product structure will be used later in Section 4 to construct a numerical scheme for approximating the map components SkS_{k}.

The following proposition shows that, under mild assumptions on gg, the rectifier k\mathcal{R}_{k} is a Lipschitz continuous operator from VkV_{k} to itself. The proof relies on the Hardy inequality [39] and on Proposition 2.

Proposition 3.

Let g:>0g\colon\mathbb{R}\rightarrow\mathbb{R}_{>0} be a Lipschitz function, i.e., there exists a constant L<L<\infty so that

|g(ξ)g(ξ)|\displaystyle|g(\xi)-g(\xi^{\prime})| L|ξξ|,\displaystyle\leq L|\xi-\xi^{\prime}|, (18)

holds for any ξ,ξ\xi,\xi^{\prime}\in\mathbb{R}. Then k(f)Vk\mathcal{R}_{k}(f)\in V_{k} for any fVkf\in V_{k}, where k(f)\mathcal{R}_{k}(f) is defined in (11). Furthermore there exists a constant C<C<\infty such that

k(f1)k(f2)VkCf1f2Vk,\|\mathcal{R}_{k}(f_{1})-\mathcal{R}_{k}(f_{2})\|_{V_{k}}\leq C\|f_{1}-f_{2}\|_{V_{k}}, (19)

holds for any f1,f2Vkf_{1},f_{2}\in V_{k}.

Proof.

See Section A.4. ∎

The next result relies on Proposition 3 to show that the objective functional k\mathcal{L}_{k} in (10), seen as a function from VkV_{k} to \mathbb{R}, is well-defined, continuous, and differentiable.

Theorem 4.

Let π\pi be a probability density function on d\mathbb{R}^{d} satisfying

π(𝒙)Cπη(𝒙),\pi(\bm{x})\leq C_{\pi}\eta(\bm{x}), (20)

for all 𝐱d\bm{x}\in\mathbb{R}^{d} and a constant Cπ<C_{\pi}<\infty, and let g:0g\colon\mathbb{R}\rightarrow\mathbb{R}_{\geq 0} be an increasing function where

|g(ξ)g(ξ)|\displaystyle|g(\xi)-g(\xi^{\prime})| L|ξξ|,\displaystyle\leq L|\xi-\xi^{\prime}|, (21)
|logg(ξ)logg(ξ)|\displaystyle|\log\circ\,g(\xi)-\log\circ\,g(\xi^{\prime})| L|ξξ|,\displaystyle\leq L|\xi-\xi^{\prime}|, (22)

holds for any ξ,ξ\xi,\xi^{\prime}\in\mathbb{R} and a constant L<L<\infty. Then

k(f)=𝒥k(k(f))<,\mathcal{L}_{k}(f)=\mathcal{J}_{k}(\mathcal{R}_{k}(f))<\infty,

for any fVkf\in V_{k}, where 𝒥k(s)=(12s(𝐱k)2log|ks(𝐱k)|)π(𝐱)d𝐱\mathcal{J}_{k}(s)=\int\left(\frac{1}{2}s(\bm{x}_{\leq k})^{2}-\log\left|\partial_{k}s(\bm{x}_{\leq k})\right|\right)\pi(\bm{x})\mathrm{d}\bm{x} as in (6) and where k(f)\mathcal{R}_{k}(f) is defined as in (11). Moreover, the function k:Vk\mathcal{L}_{k}\colon V_{k}\rightarrow\mathbb{R} is continuous and differentiable at any fVkf\in V_{k} and its gradient k(f)Vk\nabla\mathcal{L}_{k}(f)\in V_{k} is given by

k(f),εVk\displaystyle\langle\nabla\mathcal{L}_{k}(f),\varepsilon\rangle_{V_{k}} =k(f)(𝒙)(ε(𝒙<k,0)+0xkg(kf(𝒙<k,t))kε(𝒙<k,t)dt)π(𝒙)d𝒙\displaystyle=\int\mathcal{R}_{k}(f)(\bm{x})\left(\varepsilon(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{\prime}\big{(}\partial_{k}f(\bm{x}_{<k},t)\big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)\pi(\bm{x})\mathrm{d}\bm{x}
(logg)(kf(𝒙))kε(𝒙)π(𝒙)d𝒙.\displaystyle-\int(\log\circ\,g)^{\prime}(\partial_{k}f(\bm{x}))\partial_{k}\varepsilon(\bm{x})~{}\pi(\bm{x})\mathrm{d}\bm{x}. (23)

for any εVk\varepsilon\in V_{k}, where ,Vk\langle\cdot,\cdot\rangle_{V_{k}} is the scalar product in VkV_{k}.

Proof.

See Section A.5. ∎

It is useful to discuss some implications of Theorem 4. A smooth objective function permits us to use deterministic first or second-order optimization algorithms. Furthermore, in Section 4, we exploit the gradients of 𝒥k\mathcal{J}_{k} in (23) to construct an adaptive polynomial (or wavelet) basis for VkV_{k}. We also note that many functions gg satisfy the conditions (21) and (22) in the proposition. For example, these include the soft-plus (12) and shifted ELU (13) functions considered in Figure 2.

Remark 3 (Assumption (20) implies Gaussian tails).

The condition (20) implies that 𝐗π(𝐗2>t)Cπ𝐙η(𝐙2>t)\mathbb{P}_{\bm{X}\sim\pi}(\|\bm{X}\|_{2}>t)\leq C_{\pi}\mathbb{P}_{\bm{Z}\sim\eta}(\|\bm{Z}\|_{2}>t) holds for any t0t\geq 0, and hence that π\pi has sub-Gaussian tails [67]. The reverse, however, does not hold in general. For instance, a compactly supported random variable with density π(x)1/|x|𝟙{x[1,1]}\pi(x)\propto 1/\sqrt{|x|}\mathbbm{1}_{\{x\in[-1,1]\}} is sub-Gaussian, but the density is unbounded at x=0x=0 and thus does not satisfy (20).

Remark 4.

We note that (20) can be relaxed with the (less interpretable) assumption that there exist diagonal scalings dk>0d_{k}>0 and a constant Cπ<C_{\pi}<\infty such that π(𝐱)Cπk=1dηk(dkxk)\pi(\bm{x})\leq C_{\pi}\prod_{k=1}^{d}\eta_{k}(d_{k}x_{k}) for all 𝐱d\bm{x}\in\mathbb{R}^{d}. In practice, we can always standardize (i.e., center and rescale) the marginals of π\pi to match the mean and variance of the standard Gaussian reference. This preconditioning step should yield a smaller constant CπC_{\pi} in (20). Thus, without loss of generality, we will use the assumption above for the remainder of this article.

Remark 5.

Under additional assumptions on gg, the gradient fk(f)f\mapsto\nabla\mathcal{L}_{k}(f) is locally Lipschitz from V¯k\overline{V}_{k} to VkV_{k}, where V¯k={fVk,kfL}\overline{V}_{k}=\{f\in V_{k},\partial_{k}f\in L^{\infty}\} is the space endowed with the norm fV¯k=fVk+kfL\|f\|_{\overline{V}_{k}}=\|f\|_{V_{k}}+\|\partial_{k}f\|_{L^{\infty}}. More specifically, if gg^{\prime} and (logg)(\log\circ\,g)^{\prime} are Lipschitz functions, then there exists a constant M<M<\infty such that

k(f1)k(f2)VkM(1+k(f2)Vk)f1f2V¯k,\|\nabla\mathcal{L}_{k}(f_{1})-\nabla\mathcal{L}_{k}(f_{2})\|_{V_{k}}\leq M(1+\|{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{R}_{k}}(f_{2})\|_{V_{k}})\|f_{1}-f_{2}\|_{\overline{V}_{k}}, (24)

holds for any f1,f2V¯kf_{1},f_{2}\in\overline{V}_{k}. See the derivation of this result in Appendix A.6. Such local Lipschitz regularity is useful to analyze the convergence of backtracking gradient descent procedures, i.e., gradient descent with an inexact line search such as Armijo’s rule, to a stationary point; see Theorem 2.1 in [65] and Proposition 2.1.1 in [5]. We leave the extension of such optimization guarantees for solving minfVkk(f)\min_{f\in V_{k}}\mathcal{L}_{k}(f) to future work.

3.3 Existence and uniqueness of solutions

In this section, we show that the optimization problem (10) does not admit any spurious local minima, meaning that local minimizers are in fact global minimizers. We also show that problem (10) admits a unique global minimizer which permits us to recover the KR rearrangement. To prove these results, we will need the following proposition, which provides conditions ensuring that the inverse rectifier k1\mathcal{R}_{k}^{-1} is continuous.

Proposition 5.

Let gg be a bijective function from \mathbb{R} to >0\mathbb{R}_{>0} such that for any c>0c>0 there exists a constant Lc<L_{c}<\infty so that

|g1(ξ)g1(ξ)|Lc|ξξ|,|g^{-1}(\xi)-g^{-1}(\xi^{\prime})|\leq L_{c}|\xi-\xi^{\prime}|, (25)

holds for any ξ,ξc\xi,\xi^{\prime}\geq c. Then, for any sVks\in V_{k} such that essinfks>0\operatorname*{ess\,inf}\partial_{k}s>0, we have k1(s)Vk\mathcal{R}_{k}^{-1}(s)\in V_{k} and essinfkk1(s)>\operatorname*{ess\,inf}\partial_{k}\mathcal{R}_{k}^{-1}(s)>-\infty. Furthermore for any c>0c>0, there exists a constant Cc<C_{c}<\infty such that

k1(s1)k1(s2)VkCcs1s2Vk,\|\mathcal{R}_{k}^{-1}(s_{1})-\mathcal{R}_{k}^{-1}(s_{2})\|_{V_{k}}\leq C_{c}\|s_{1}-s_{2}\|_{V_{k}}, (26)

holds for any s1,s2Vks_{1},s_{2}\in V_{k} such that essinfksic\operatorname*{ess\,inf}\partial_{k}s_{i}\geq c.

Proof.

See Section A.8. ∎

Note that the softplus function (12) and the shifted exponential linear unit (13) satisfy (25) with Lc=(1ec)1L_{c}=(1-e^{-c})^{-1} and Lc=max{1/c,1}L_{c}=\max\{1/c,1\}, respectively.

3.3.1 Local minima are global minima

The following proposition shows that, under certain conditions on the function gg, the image set k(Vk)={k(f):fVk}\mathcal{R}_{k}(V_{k})=\{\mathcal{R}_{k}(f)\,:\,f\in V_{k}\} is convex.

Proposition 6.

Let g:0g\colon\mathbb{R}\rightarrow\mathbb{R}_{\geq 0} be an increasing function such that ξ(g1(ξ))2\xi\mapsto(g^{-1}(\xi))^{2} is a convex function. Let k(f)\mathcal{R}_{k}(f) be defined as in (11) and VkV_{k} as in (15). Then k(Vk)={k(f):fVk}\mathcal{R}_{k}(V_{k})=\{\mathcal{R}_{k}(f)\,:\,f\in V_{k}\} is convex.

Proof.

See Section A.7. ∎

When gg is the softplus function (12) or the exponential linear unit (13), the mapping ξ(g1(ξ))2\xi\mapsto(g^{-1}(\xi))^{2} is indeed convex. Interestingly, however, the exponential function g(ξ)=exp(ξ)g(\xi)=\exp(\xi)—which has been used to parameterize monotone maps in [35, 61] and for monotone regression in [48, 57]—does not satisfy this property, and there is no guarantee for k(Vk)\mathcal{R}_{k}(V_{k}) to be convex in this case.

An important consequence of Proposition 6 is that the constrained problem minsk(Vk)𝒥k(s)\min_{s\in\mathcal{R}_{k}(V_{k})}\mathcal{J}_{k}(s) remains convex. Hence, the strict convexity of 𝒥k\mathcal{J}_{k} (see Appendix A.2) yields the uniqueness of any solution to minsk(Vk)𝒥k(s)\min_{s\in\mathcal{R}_{k}(V_{k})}\mathcal{J}_{k}(s). The following theorem now establishes the existence of a unique global minimizer of the unconstrained problem (10), under the assumption that a local minimizer exists.

Theorem 7.

Under the assumptions of Propositions 5 and 6, let fVkf^{*}\in V_{k} be a local minimizer of f𝒥k(k(f))f\mapsto\mathcal{J}_{k}(\mathcal{R}_{k}(f)), meaning that there exists ρ>0\rho>0 such that

𝒥k(k(f))𝒥k(k(f)),fVk such that ffVkρ.\mathcal{J}_{k}(\mathcal{R}_{k}(f^{*}))\leq\mathcal{J}_{k}(\mathcal{R}_{k}(f)),\quad\forall f\in V_{k}\text{ such that }\|f-f^{*}\|_{V_{k}}\leq\rho. (27)

If essinfkf>\operatorname*{ess\,inf}\partial_{k}f^{*}>-\infty, then ff^{*} is the unique global minimizer, meaning that

𝒥k(k(f))<𝒥k(k(f)),fVk{f}.{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{J}_{k}(\mathcal{R}_{k}(f^{*}))<\mathcal{J}_{k}(\mathcal{R}_{k}(f)),\quad\forall f\in V_{k}\setminus\{f^{*}\}.}
Proof.

Let fVkf\in V_{k} with fff\neq f^{*}. For any 0<t<10<t<1 we let st=tk(f)+(1t)k(f)s_{t}=t\mathcal{R}_{k}(f)+(1-t)\mathcal{R}_{k}(f^{*}). By Proposition 6, k(Vk)\mathcal{R}_{k}(V_{k}) is convex so that stk(Vk)s_{t}\in\mathcal{R}_{k}(V_{k}). By the strict convexity of 𝒥k\mathcal{J}_{k} (cf. Appendix A.2), we have 𝒥k(st)<t𝒥k(k(f))+(1t)𝒥k(k(f))\mathcal{J}_{k}(s_{t})<t\mathcal{J}_{k}(\mathcal{R}_{k}(f))+(1-t)\mathcal{J}_{k}(\mathcal{R}_{k}(f^{*})), or equivalently

𝒥k(st)𝒥k(k(f))<t(𝒥k(k(f))𝒥k(k(f))).\mathcal{J}_{k}(s_{t})-\mathcal{J}_{k}(\mathcal{R}_{k}(f^{*}))<t\Big{(}\mathcal{J}_{k}(\mathcal{R}_{k}(f))-\mathcal{J}_{k}(\mathcal{R}_{k}(f^{*}))\Big{)}. (28)

Next we show that there exists a sufficiently small t>0t>0 such that the above left-hand side is non-negative. As a consequence, the right hand side of (28) will be positive, which will conclude the proof.

Let b=essinfkfb=\operatorname*{ess\,inf}\partial_{k}f^{*} and c=g(b)/2>0c=g(b)/2>0. For any t1/2t\leq 1/2, we have kst=tg(kf)+(1t)g(kf)1/2g(kf)\partial_{k}s_{t}=tg(\partial_{k}f)+(1-t)g(\partial_{k}f^{*})\geq 1/2g(\partial_{k}f^{*}) so that essinfkstc\operatorname*{ess\,inf}\partial_{k}s_{t}\geq c. In addition, we have essinfkk(f)=essinfg(kf)c\operatorname*{ess\,inf}\partial_{k}\mathcal{R}_{k}(f^{*})=\operatorname*{ess\,inf}g(\partial_{k}f^{*})\geq c. Thus, by Proposition 5, there exists a constant Cc<C_{c}<\infty such that

k1(st)fVk\displaystyle\|\mathcal{R}^{-1}_{k}(s_{t})-f^{*}\|_{V_{k}} =k1(st)k1(k(f))Vk\displaystyle=\|\mathcal{R}^{-1}_{k}(s_{t})-\mathcal{R}^{-1}_{k}(\mathcal{R}_{k}(f^{*}))\|_{V_{k}}
Ccstk(f)Vk\displaystyle\leq C_{c}\|s_{t}-\mathcal{R}_{k}(f^{*})\|_{V_{k}}
=tCck(f)k(f)Vk.\displaystyle=tC_{c}\|\mathcal{R}_{k}(f)-\mathcal{R}_{k}(f^{*})\|_{V_{k}}.

By letting t=ρ/(Cck(f)k(f)Vk)t=\rho/(C_{c}\|\mathcal{R}_{k}(f)-\mathcal{R}_{k}(f^{*})\|_{V_{k}}), we have k1(st)fVkρ\|\mathcal{R}^{-1}_{k}(s_{t})-f^{*}\|_{V_{k}}\leq\rho. Therefore, setting f=k1(st)f=\mathcal{R}^{-1}_{k}(s_{t}) in (27) ensures that 0𝒥k(st)𝒥k(k(f))0\leq\mathcal{J}_{k}(s_{t})-\mathcal{J}_{k}(\mathcal{R}_{k}(f^{*})). ∎

Let us remark that Theorem 7 holds for general target densities π\pi without the bound assumed in Theorem 4. It establishes that any local minimizer of 𝒥kk\mathcal{J}_{k}\circ\mathcal{R}_{k} on VkV_{k} is in fact the unique global minimizer. Applying this result, however, depends on the existence of a local minimizer in the function space VkV_{k}, which must be established for a given target density. The following section studies a relatively broad class of distributions where the kk-th component of the KR rearrangement is in k(Vk)\mathcal{R}_{k}(V_{k}), thereby providing a candidate solution to apply Theorem 7.

3.3.2 Recovery of the KR rearrangement

As discussed in Section 2, the Knothe–Rosenblatt rearrangement SKRS_{\textrm{KR}} is the unique lower triangular and monotone map such that 𝒟KL(π||SKRη)=0\mathcal{D}_{\textrm{KL}}(\pi||S_{\textrm{KR}}^{\sharp}\eta)=0; see [8]. The decomposition (5) of the KL divergence 𝒟KL(π||Sη)\mathcal{D}_{\textrm{KL}}(\pi||S^{\sharp}\eta) thus permits us to write

𝒥k(SKR,k)𝒥k(k(f)),\mathcal{J}_{k}(S_{\textrm{KR},k})\leq\mathcal{J}_{k}(\mathcal{R}_{k}(f)), (29)

for any fVkf\in V_{k} and for any 1kd1\leq k\leq d. Indeed, by letting SS be the map such that Sk=k(f)S_{k}=\mathcal{R}_{k}(f) and Si=SKR,iS_{i}=S_{\textrm{KR},i} for iki\neq k, (5) yields (29). Thus, if there exists a function fKR,kVkf_{\textrm{KR},k}\in V_{k} such that SKR,k=k(fKR,k)S_{\textrm{KR},k}=\mathcal{R}_{k}(f_{\textrm{KR},k}), then fKR,kf_{\textrm{KR},k} is a global minimizer of f𝒥k(k(f))f\mapsto\mathcal{J}_{k}(\mathcal{R}_{k}(f)) over fVkf\in V_{k}. To show this, we first need the following intermediate result.

Proposition 8.

Let π\pi be a probability density function on d\mathbb{R}^{d} such that

cη(𝒙)π(𝒙)Cη(𝒙),c\eta(\bm{x})\leq\pi(\bm{x})\leq C\eta(\bm{x}), (30)

for all 𝐱d\bm{x}\in\mathbb{R}^{d} with some constants 0<cC<0<c\leq C<\infty. Then, for all 𝐱<kk1\bm{x}_{<k}\in\mathbb{R}^{k-1} and k=1,,dk=1,\dots,d, SKR,k(𝐱<k,xk)=𝒪(xk)S_{\textrm{KR},k}(\bm{x}_{<k},x_{k})=\mathcal{O}(x_{k}) and kSKR,k(𝐱<k,xk)=𝒪(1)\partial_{k}S_{\textrm{KR},k}(\bm{x}_{<k},x_{k})=\mathcal{O}(1) as |xk||x_{k}|\rightarrow\infty. Furthermore, we have SKR,kVkS_{\textrm{KR},k}\in V_{k} and essinfkSKR,k>0\operatorname*{ess\,inf}\partial_{k}S_{\textrm{KR},k}>0 for all k=1,,dk=1,\dots,d.

Proof.

See Appendix A.9. ∎

Let us comment on the assumption (30). First, we note that it is equivalent to assuming π(x)=exp(Ψ(x))η(x)\pi(x)=\exp(\Psi(x))\eta(x) for some integrable function Ψ:d\Psi\colon\mathbb{R}^{d}\rightarrow\mathbb{R} with bounded oscillation, meaning supxdΨ(x)infxdΨ(x)<\sup_{x\in\mathbb{R}^{d}}\Psi(x)-\inf_{x\in\mathbb{R}^{d}}\Psi(x)<\infty. As an example, it is satisfied for any Gaussian mixture of the form π(x)=i=1kwiη(xxi)\pi(x)=\sum_{i=1}^{k}w_{i}\eta(x-x_{i}) with weights wi0w_{i}\geq 0 that sum to one, centers xidx_{i}\in\mathbb{R}^{d}, and η(x)exp(x22)\eta(x)\propto\exp(-\frac{\|x\|^{2}}{2}); see Example 2.7 in [74] for more details.

Second, while the results in Section 3.2 assume only an upper bound on the joint density π\pi, here we have both an upper and a lower bound. These bounds together imply that the kk-th marginal conditional density, for 1kd1\leq k\leq d, satisfies (c/C)ηk(xk)π(xk|𝒙<k)(C/c)ηk(xk)(c/C)\eta_{k}(x_{k})\leq\pi(x_{k}|\bm{x}_{<k})\leq(C/c)\eta_{k}(x_{k}) for all 𝒙kk\bm{x}_{\leq k}\in\mathbb{R}^{k}. This means that the marginal conditionals of the target density have Gaussian tails, rather than potentially being lighter than Gaussian (as in assumption (20)). This condition guarantees that the asymptotic behavior of each map component SKR,kS_{\textrm{KR},k} in its last variable (as |xk||x_{k}|\to\infty) is affine.

We now combine our previous results to show that under the assumption (30) on the target density π\pi, we have fKR,kVkf_{\textrm{KR},k}\in V_{k} and the existence of a unique solution to minfVk𝒥k(k(f))\min_{f\in V_{k}}\mathcal{J}_{k}(\mathcal{R}_{k}(f)).

Corollary 9.

Under the assumptions on π\pi in Proposition 8, let gg be a Lipschitz bijection from \mathbb{R} to 0\mathbb{R}_{\geq 0} such that ξ(g1(ξ))2\xi\mapsto(g^{-1}(\xi))^{2} is convex and, for any c>0c>0, there exists a constant Lc<L_{c}<\infty such that |g1(ξ)g1(ξ)|Lc|ξξ||g^{-1}(\xi)-g^{-1}(\xi^{\prime})|\leq L_{c}|\xi-\xi^{\prime}| for any ξ,ξc\xi,\xi^{\prime}\geq c. Then, for any 1kd1\leq k\leq d, there exists a unique function fKR,kVkf_{\textrm{KR},k}\in V_{k} that satisfies essinfkfKR,k>\operatorname*{ess\,inf}\partial_{k}f_{\textrm{KR},k}>-\infty such that k(fKR,k)\mathcal{R}_{k}(f_{\textrm{KR},k}) is the kk-th component of the KR rearrangement SKRS_{\textrm{KR}}. As a consequence, minfVk𝒥k(k(f))\min_{f\in V_{k}}\mathcal{J}_{k}(\mathcal{R}_{k}(f)) admits a unique solution and

𝒥k(k(fKR,k))=minfVk𝒥k(k(f)).\mathcal{J}_{k}(\mathcal{R}_{k}(f_{\textrm{KR},k}))=\min_{f\in V_{k}}\mathcal{J}_{k}(\mathcal{R}_{k}(f)).
Proof.

Combining Proposition 8 and Proposition 5, there exists a function fKR,kVkf_{\textrm{KR},k}\in V_{k} that satisfies essinfkfKR,k>\operatorname*{ess\,inf}\partial_{k}f_{\textrm{KR},k}>-\infty such that k(fKR,k)=SKR,k\mathcal{R}_{k}(f_{\textrm{KR},k})=S_{\textrm{KR},k}. Then from Theorem 7 and inequality (29), we have that fKR,kf_{\textrm{KR},k} is the unique global minimizer of f𝒥k(k(f))f\mapsto\mathcal{J}_{k}(\mathcal{R}_{k}(f)) over fVkf\in V_{k}. ∎

Corollary 9 gives sufficient conditions for the existence of a unique solution to argminfVk𝒥k(k(f))\operatorname*{arg\,min}_{f\in V_{k}}\mathcal{J}_{k}(\mathcal{R}_{k}(f)), and shows that this solution (after rectification) is in fact the kk-th component of the KR rearrangement. It is natural to ask whether, outside of these assumptions, there still exists a local minimizer of 𝒥kk\mathcal{J}_{k}\circ\mathcal{R}_{k} over VkV_{k}. If so, then Theorem 7 would guarantee that this function is in fact the unique global minimizer of the same objective. Following the preceding analysis, one route towards answering this question would be to elucidate whether k1(SKR,k)Vk\mathcal{R}_{k}^{-1}(S_{\textrm{KR},k})\in V_{k} holds for tail assumptions on π\pi that are different from (30). In general, this is not clear. For targets with lighter tails, the KR map to a Gaussian η\eta will increase super-linearly as |xk||x_{k}|\to\infty, but we hypothesize that many such maps are still recoverable using VkV_{k}, due to its Gaussian weighting. We comment further on this possibility in Section 6. For targets with heavier tails, however, we may have that kSKR,k(x)0\partial_{k}S_{\textrm{KR},k}(x)\to 0 as xk±x_{k}\to\pm\infty, which implies that essinfkfKR,k=\operatorname*{ess\,inf}\partial_{k}f_{\textrm{KR},k}=-\infty. In this setting, a more promising route may be to modify the reference distribution, i.e., to choose a reference that itself has heavier tails, so that assumption (30) is satisfied. See [24] for work in this direction. Doing so would break the strong log-concavity of η\eta, however, which is an essential element of our analysis; hence it is unclear whether any optimization guarantees could then hold.

Alternatively, one could avoid appealing to the properties of the KR rearrangement for a given target π\pi and more directly ask whether a local minimizer of (10) exists even if it is not fKR,kf_{\textrm{KR},k}. Our current theory does not address this question. One possible strategy involves elucidating under what additional assumptions (including for what class of targets π\pi) the objective function k\mathcal{L}_{k} is coercive on VkV_{k}; a minimizer of (10) would therefore exist by the direct method in the calculus of variations. The resulting map would, by construction, be the closest approximation to SKRS_{\textrm{KR}} (in the forward KL sense of (5)) within the space of triangular functions ×kk(Vk)\bigtimes_{k}\mathcal{R}_{k}(V_{k}). Proposition 1 ensures that such a map is also close to SKRS_{\textrm{KR}} in the Lπ2L_{\pi}^{2} sense.

4 Adaptive parameterization of transport maps

Given an i.i.d. sample {𝐗i}i=1nπ\{\mathbf{X}^{i}\}_{i=1}^{n}\sim\pi, we now propose an adaptive algorithm to build fVkf\in V_{k} that minimizes the empirical objective ^k𝒥^kk(f)\widehat{\mathcal{L}}_{k}\coloneqq\widehat{\mathcal{J}}_{k}\,\circ\,\mathcal{R}_{k}(f), where 𝒥^k\widehat{\mathcal{J}}_{k} is as in (7). Doing so for each k=1,,dk=1,\ldots,d yields a monotone triangular map, as described in Section 2. For each kk, we represent fVkf\in V_{k} with an mm-term expansion

f(𝒙k)=𝜶Λc𝜶ψ𝜶(𝒙k),f(\bm{x}_{\leq k})=\sum_{\bm{\alpha}\in\Lambda}c_{\bm{\alpha}}\psi_{\bm{\alpha}}(\bm{x}_{\leq k}), (31)

where Λ\Lambda is a set of multi-indices 𝜶=(α1,,αk)0k\bm{\alpha}=(\alpha_{1},\dots,\alpha_{k})\in\mathbb{N}_{0}^{k} with #Λ=m\#\Lambda=m, c𝜶c_{\bm{\alpha}}\in\mathbb{R} are coefficients, and ψ𝜶:k\psi_{\bm{\alpha}}\colon\mathbb{R}^{k}\rightarrow\mathbb{R} are basis functions for VkV_{k}, constructed as products of univariate functions,

ψ𝜶(𝒙k)=j=1kψαjj(xj).\psi_{\bm{\alpha}}(\bm{x}_{\leq k})=\prod_{j=1}^{k}\psi_{\alpha_{j}}^{j}(x_{j}).

Here, {ψαj}α0\{\psi_{\alpha}^{j}\}_{\alpha\in\mathbb{N}_{0}} is chosen to be a basis of Lηj2L^{2}_{\eta_{j}} if j<kj<k and of Hηk1H^{1}_{\eta_{k}} if j=kj=k. Because VkV_{k} possesses a tensor product structure (17) and because its norm Vk\|\cdot\|_{V_{k}} is a product norm, the basis {ψ𝜶}𝜶0k\{\psi_{\bm{\alpha}}\}_{\bm{\alpha}\in\mathbb{N}_{0}^{k}} is orthonormal if {ψαj}α0\{\psi_{\alpha}^{j}\}_{\alpha\in\mathbb{N}_{0}} is an orthonormal basis for all jkj\leq k. Since ff depends linearly on the coefficients c𝜶c_{\bm{\alpha}}, the smoothness properties of the objective k\mathcal{L}_{k} transfer to the objective function parameterized by the coefficients c𝜶c_{\bm{\alpha}}, 𝜶Λ\bm{\alpha}\in\Lambda.

Sections 4.1 and 4.2 describe two different choices of basis: polynomials and wavelets, respectively. Then, in Section 4.3, we present a greedy algorithm for building the multi-index set Λ\Lambda, applicable to any such hierarchical basis. In addition, we propose a cross validation procedure to determine when to stop enriching Λ\Lambda in order to avoid over-fitting to the finite sample.

4.1 Polynomial basis

Probabilists’ Hermite polynomials {φα}α\{\varphi_{\alpha}\}_{\alpha\in\mathbb{N}} form an orthogonal basis for Lη2L^{2}_{\eta} where η\eta is the univariate standard Gaussian density. φα\varphi_{\alpha} is a polynomial of degree α0\alpha\geq 0 defined as

φα(x)=(1)αexp(x2/2)dαdxαexp(x2/2).\varphi_{\alpha}(x)=(-1)^{\alpha}\exp(x^{2}/2)\frac{\mathrm{d}^{\alpha}}{\mathrm{d}x^{\alpha}}\exp(-x^{2}/2).

Furthermore, we have φα,φβLη2=α!δα,β\langle\varphi_{\alpha},\varphi_{\beta}\rangle_{L^{2}_{\eta}}=\alpha!\,\delta_{\alpha,\beta} so that {φα/α!}α0\{\varphi_{\alpha}/\sqrt{\alpha!}\}_{\alpha\in\mathbb{N}_{0}} forms an orthonormal basis for Lη2L^{2}_{\eta}. Similarly, φα,φβHη1=(α+1)!δα,β\langle\varphi_{\alpha},\varphi_{\beta}\rangle_{H^{1}_{\eta}}=(\alpha+1)!\,\delta_{\alpha,\beta} so that {φα/(α+1)!}α0\{\varphi_{\alpha}/\sqrt{(\alpha+1)!}\}_{\alpha\in\mathbb{N}_{0}} forms a orthonormal basis for Hη1H^{1}_{\eta}; see [55, Proposition 1.3]. We can thus tensorize these basis functions to obtain an orthonormal basis for the anisotropic space VkV_{k} (17).

In practice, we modify the univariate Hermite polynomials to be linear outside of an arbitrary compact domain. Following [46], we let

ψαj(xj)=1Zαj{φαj(xja)+φαj(xja)(xjxja) if xj<xjaφαj(xj) if xjaxxjbφαj(xjb)+φαj(xjb)(xjxjb) if xj>xjb\psi_{\alpha_{j}}(x_{j})=\frac{1}{\sqrt{Z_{\alpha_{j}}}}\left\{\begin{array}[]{ll}\varphi_{\alpha_{j}}(x_{j}^{a})+\varphi_{\alpha_{j}}^{\prime}(x_{j}^{a})(x_{j}-x_{j}^{a})&\text{ if }x_{j}<x_{j}^{a}\\ \varphi_{\alpha_{j}}(x_{j})&\text{ if }x_{j}^{a}\leq x\leq x_{j}^{b}\\ \varphi_{\alpha_{j}}(x_{j}^{b})+\varphi_{\alpha_{j}}^{\prime}(x_{j}^{b})(x_{j}-x_{j}^{b})&\text{ if }x_{j}>x_{j}^{b}\\ \end{array}\right. (32)

where Zαj=αj!Z_{\alpha_{j}}=\alpha_{j}! for j=1,,k1j=1,\dots,k-1 and Zαj=(αj+1)!Z_{\alpha_{j}}=(\alpha_{j}+1)! for j=kj=k. In our numerical experiments, we set xjax_{j}^{a} and xjbx_{j}^{b} to be the 0.010.01 and 0.990.99 (empirical) quantiles, respectively, of the marginal sample {Xji}i=1n\{X_{j}^{i}\}_{i=1}^{n}. The basis {ψαj}αj0\{\psi_{\alpha_{j}}\}_{\alpha_{j}\in\mathbb{N}_{0}}, albeit not exactly orthonormal, is close to being orthonormal for sufficiently small xjax_{j}^{a} and large xjbx_{j}^{b}. This is important for the numerical stability of the algorithm presented in the following section.

The basis {ψ𝜶}𝜶\{\psi_{\bm{\alpha}}\}_{\bm{\alpha}} generated by the piecewise polynomials (32) has numerical and structural advantages over the Hermite polynomials. First, these functions avoid the fast growth rate of standard polynomials as 𝒙\|\bm{x}\|\rightarrow\infty, providing a more stable extrapolation of the map to the tails of π\pi, where there are typically no samples available to otherwise constrain its growth. Second, a map that reverts to linearity far from the origin yields densities in the class considered in Subsection 3.3.2; see condition (30). Indeed, each map component Sk=k(f)S_{k}=\mathcal{R}_{k}(f) with fspan{ψ𝜶:𝜶Λ}f\in\text{span}\{\psi_{\bm{\alpha}}:\bm{\alpha}\in\Lambda\} behaves linearly as |xk||x_{k}|\rightarrow\infty, so that the pullback density SηS^{\sharp}\eta has Gaussian tails like the reference η\eta.

4.2 Wavelet basis

Wavelets are a popular approach for approximating functions that are not uniformly smooth [13, 68]. In particular, wavelet techniques define a multi-resolution approximation to a function that can better capture local features than, for instance, global polynomials. Given a compactly supported function ψ:\psi\colon\mathbb{R}\rightarrow\mathbb{R} called the mother wavelet, and indices (l,q)2(l,q)\in\mathbb{Z}^{2}, the function

ψ(l,q):x2l/2ψ(2lxq),\psi_{(l,q)}:x\mapsto 2^{l/2}\psi(2^{l}x-q),

is the qqth wavelet at the level ll. Common choices for the mother wavelet include the Haar, the Daubechies, and the Meyer wavelets.

In our numerical experiments, we use the continuous Mexican hat mother wavelet

ψ(x)=23σπ1/4(1(x/σ)2)exp(x2/(2σ2)),\psi(x)=\frac{2}{\sqrt{3\sigma}\pi^{1/4}}(1-\left(x/\sigma\right)^{2})\exp(-x^{2}/(2\sigma^{2})), (33)

with scale parameter σ=1\sigma=1 [34], which is obtained from the second derivative of the univariate Gaussian density. The Mexican hat function is also commonly referred to as the Ricker wavelet in the geophysics community. We treat this function as having essentially compact support on [6,6][-6,6], up to numerical precision.

Tensorizing a univariate wavelet in the kkth coordinate direction with modified Hermite polynomials in the first k1k-1 directions yields an expansion of the form in (31), where the indices in 𝜶=(α1,,αk)\bm{\alpha}=(\alpha_{1},\ldots,\alpha_{k}) are tuples of the form

α1,,αk10,αk=(lk,qk)2.\alpha_{1},\ldots,\alpha_{k-1}\in\mathbb{N}_{0},\quad\alpha_{k}=(l_{k},q_{k})\in\mathbb{Z}^{2}. (34)

In practice, we consider a truncated set of wavelets by first rescaling the mother wavelet ψ\psi to have the same support as the interval between the 0.01 and 0.99 empirical quantiles of the marginal sample {Xji}i=1n\{X_{j}^{i}\}_{i=1}^{n}. Then, we constrain lj0l_{j}\geq 0 and consider lj=0l_{j}=0 to be the coarsest level of the approximation. Next, for any ljl_{j}, the translation parameter qjq_{j} is bounded as 0qj2lj10\leq q_{j}\leq 2^{l_{j}}-1. Thus, we can consider

α1,,αk10,αk=(lk,qk)02.\alpha_{1},\ldots,\alpha_{k-1}\in\mathbb{N}_{0},\quad\alpha_{k}=(l_{k},q_{k})\in\mathbb{N}_{0}^{2}.

rather than (34).

Lastly, we note that any function ff of the form (31) expanded with compactly supported wavelets will decay to 0 as |xk||x_{k}|\rightarrow\infty. Thus, the map Sk=k(f)S_{k}=\mathcal{R}_{k}(f) will have asymptotic linear growth as |xk||x_{k}|\rightarrow\infty, similarly to the modified Hermite polynomial expansions.

4.3 Adaptive transport map algorithm

Now we propose a greedy method for constructing the multi-index set Λ=Λt\Lambda=\Lambda_{t} in (31). For simplicity, we consider only the case of single indices αj0\alpha_{j}\in\mathbb{N}_{0}. The extension to the case αj02\alpha_{j}\in\mathbb{N}_{0}^{2} (as in the wavelet construction above) is described in Appendix B. At each greedy iteration tt, we add one multi-index 𝜶t\bm{\alpha}_{t}^{\ast} to Λt\Lambda_{t}. Starting with Λ0=\Lambda_{0}=\emptyset we let

Λt+1=Λt{𝜶t},\Lambda_{t+1}=\Lambda_{t}\cup\{\bm{\alpha}_{t}^{\ast}\}, (35)

where 𝜶tΛt\bm{\alpha}_{t}^{\ast}\notin\Lambda_{t} is a multi-index that yields the best improvement of ^k\widehat{\mathcal{L}}_{k} in a sense to be defined below. Borrowing ideas from [36, 37], we constrain the sets {Λt}t0\{\Lambda_{t}\}_{t\geq 0} to be downward closed [12, 14], meaning that they satisfy the property

𝜶Λt and 𝜶𝜶𝜶Λt,\bm{\alpha}\in\Lambda_{t}\text{ and }\bm{\alpha}^{\prime}\leq\bm{\alpha}~{}\Rightarrow~{}\bm{\alpha}^{\prime}\in\Lambda_{t}, (36)

where 𝜶𝜶\bm{\alpha}^{\prime}\leq\bm{\alpha} denotes αiαi\alpha_{i}^{\prime}\leq\alpha_{i} for all 1ik1\leq i\leq k. Intuitively, (36) means that Λt\Lambda_{t} has a pyramidal shape containing no holes. Downward-closed sets are known to preserve good approximation properties and permit a tractable construction of Λt\Lambda_{t}; see [14]. Indeed, Λt+1\Lambda_{t+1} remains downward-closed if and only if the multi-index 𝜶t\bm{\alpha}_{t}^{\ast} is selected from the so-called reduced margin of Λt\Lambda_{t}, defined by

ΛtRM={𝜶Λt:𝜶𝐞iΛt for all 1ik with αi0},\Lambda_{t}^{\text{RM}}=\{\bm{\alpha}\notin\Lambda_{t}\,:\,\bm{\alpha}-\mathbf{e}_{i}\in\Lambda_{t}\text{ for all }1\leq i\leq k\text{ with }\alpha_{i}\neq 0\},

where 𝐞i\mathbf{e}_{i} denotes the ii-th canonical vector of k\mathbb{N}^{k}. The reduced margin is a subset of the margin set ΛtM\Lambda_{t}^{\text{M}} (i.e., multi-indices 𝜶Λt\bm{\alpha}\not\in\Lambda_{t} such that i>0\exists\,i>0 where 𝜶𝐞iΛt\bm{\alpha}-\mathbf{e}_{i}\in\Lambda_{t}); see Figure 3. The size of the reduced margin typically grows more slowly with respect to the dimension kk than the margin itself. For instance, if Λt\Lambda_{t} contains all multi-indices in a hypercube of width pp in dimension kk, the margin has cardinality (p+1)kpk(p+1)^{k}-p^{k}, while the reduced margin has cardinality kk.

0123401234Λt\Lambda_{t}ΛtM\Lambda_{t}^{\text{M}}ΛtRM\Lambda_{t}^{\text{RM}}𝜶t\bm{\alpha}_{t}^{*}
0123401234Λt+1\Lambda_{t+1}Λt+1M\Lambda_{t+1}^{\text{M}}Λt+1RM\Lambda_{t+1}^{\text{RM}}
Figure 3: A k=2k=2 dimensional downward-closed active set of multi-indices Λt\Lambda_{t} with its margin ΛtM\Lambda_{t}^{\text{M}} and reduced margin ΛtRM\Lambda_{t}^{\text{RM}}. The margin and reduced margins are plotted before (left) and after (right) adding to 𝜶t=(2,1)\bm{\alpha}_{t}^{\ast}=(2,1), that is denoted with a cross, to Λt\Lambda_{t}

Denoting by ftf_{t} the minimizer of f^k(f)f\mapsto\widehat{\mathcal{L}}_{k}(f) over fspan{ψ𝜶,𝜶Λt}f\in\text{span}\{\psi_{\bm{\alpha}},\bm{\alpha}\in\Lambda_{t}\}, we select 𝜶t\bm{\alpha}_{t}^{\ast} in the reduced margin ΛtRM\Lambda_{t}^{\text{RM}} with the following heuristic

𝜶targmax𝜶ΛtRM|𝜶^k(ft)|.\bm{\alpha}_{t}^{*}\in\operatorname*{arg\,max}_{\bm{\alpha}\in\Lambda_{t}^{\text{RM}}}|\nabla_{{\bm{\alpha}}}\widehat{\mathcal{L}}_{k}(f_{t})|. (37)

Here, the notation 𝜶^k(ft)\nabla_{{\bm{\alpha}}}\widehat{\mathcal{L}}_{k}(f_{t}) denotes the derivative of c𝜶^k(ft+c𝜶ψ𝜶)c_{\bm{\alpha}}\mapsto\widehat{\mathcal{L}}_{k}(f_{t}+c_{\bm{\alpha}}\psi_{\bm{\alpha}}) evaluated at c𝜶=0c_{\bm{\alpha}}=0. In other words, we select the multi-index 𝜶t\bm{\alpha}_{t}^{*} by choosing the largest functional derivative of ^k\widehat{\mathcal{L}}_{k} at ftf_{t} with respect to the basis functions not contained in the current expansion. This greedy procedure for learning each map component is presented in detail in Algorithm 1.

Algorithm 1 Estimate map component SkS_{k}
1:  Input: training sample {𝐗1:ki}i=1n\{\mathbf{X}_{1:k}^{i}\}_{i=1}^{n}, maximum cardinality mm for Λt\Lambda_{t}
2:  Initialize Λ0=\Lambda_{0}=\emptyset, f0=0f_{0}=0
3:  for t=0,,m1t=0,\dots,m-1 do
4:     Construct the reduced margin: ΛtRM\Lambda_{t}^{\text{RM}}
5:     Select the new multi-index: 𝜶targmax𝜶ΛtRM|𝜶^k(ft)|\bm{\alpha}_{t}^{*}\in\operatorname*{arg\,max}_{\bm{\alpha}\in\Lambda_{t}^{\text{RM}}}|\nabla_{{\bm{\alpha}}}\widehat{\mathcal{L}}_{k}(f_{t})|
6:     Update the active set: Λt+1=Λt{𝜶t}\Lambda_{t+1}=\Lambda_{t}\cup\{\bm{\alpha}_{t}^{*}\}
7:     Update the approximation: ft+1=argminfspan{ψ𝜶:𝜶Λt+1}^k(f)f_{t+1}=\operatorname*{arg\,min}_{f\in\text{span}\{\psi_{\bm{\alpha}}:\bm{\alpha}\in\Lambda_{t+1}\}}\widehat{\mathcal{L}}_{k}(f)
8:  end for
9:  Output: S^k=k(fm)\widehat{S}_{k}=\mathcal{R}_{k}(f_{m})

We emphasize that this procedure achieves adaptivity by refining the approximation spaces for (fk)k=1d(f_{k})_{k=1}^{d}, and seeking fkf_{k} in these linear spaces at each iteration of Algorithm 1. This approach contrasts with normalizing flows that are typically formulated over nonlinear spaces of deep neural networks; see [72] for an example of combining neural networks with a rectifier to represent monotone maps. Training the latter can be understood as simultaneously minimizing the KL objective (5) and learning features relevant to the approximation of fkf_{k}, but to our knowledge optimization guarantees for such models are unavailable and much more difficult to achieve.

Remark 6.

If the objective function ^k\widehat{\mathcal{L}}_{k} is the linear least-squares loss, then Algorithm 1 corresponds to orthogonal matching pursuit for sparse linear regression with normalized basis functions. An alternative heuristic for selecting 𝛂t\bm{\alpha}_{t}^{\ast} that requires second-order information from the objective is 𝛂targmax𝛂ΛtRM|𝛂^(ft)|2/|𝛂2^(ft)|\bm{\alpha}_{t}^{\ast}\in\operatorname*{arg\,max}_{\bm{\alpha}\in\Lambda_{t}^{\text{RM}}}|\nabla_{\bm{\alpha}}\widehat{\mathcal{L}}(f_{t})|^{2}/|\nabla^{2}_{\bm{\alpha}}\widehat{\mathcal{L}}(f_{t})|. We found this criterion to perform similarly to (37) for the polynomial basis, but it led to faster convergence of the objective when working with the wavelet basis.

Remark 7.

A potential drawback of Algorithm 1 is that the greedy enrichment procedure cannot “see” behind the reduced margin. For instance, if a relevant multi-index is located far beyond the reduced margin and the gradient 𝛂^k(ft)\nabla_{{\bm{\alpha}}}\widehat{\mathcal{L}}_{k}(f_{t}) vanishes for all 𝛂ΛtRM\bm{\alpha}\in\Lambda_{t}^{RM}, then the algorithm will be stuck. [37] suggested a safeguard mechanism to avoid this behaviour: arbitrarily activate the most ancient index from the reduced margin every tsgt_{\textrm{sg}}\in\mathbb{N} iterations. This modification, however, was not needed in our numerical experiments.

Remark 8.

As pointed out in [14, 36, 37], adding multiple multi-indices at each greedy iteration could also yield better performance compared to adding only one multi-index at a time. The so-called “bulk-chasing” procedure identifies a subset λtΛtRM\lambda_{t}^{\ast}\subset\Lambda_{t}^{\text{RM}} of multi-indices that capture a fraction of the L2L_{2}-norm of the gradient 𝛂^k(ft)\nabla_{\bm{\alpha}}\widehat{\mathcal{L}}_{k}(f_{t}) along the reduced margin.

To determine the maximal cardinality of Λt\Lambda_{t} (i.e., when to stop adaptation) we use ν\nu-fold cross-validation as in [71, 6]. For each fold, we run Algorithm 1 with ν1\nu-1 folds of the training data for m=nν1νm=n\frac{\nu-1}{\nu} iterations, and evaluate the objective function in (7) on the held-out test set. We then select the number of terms mnm^{\ast}\leq n in the expansion (31) that minimizes the test error averaged over the ν\nu folds. The full sample is then used to run Algorithm 1 for mm^{\ast} iterations. The complete procedure for learning each map component SkS_{k}, with cross-validation, is called the Adaptive Transport Map (ATM) algorithm. This procedure is detailed in Algorithm 2. In practice, we also stop the training on each fold early if the log-likelihood of the test set does not continue to increase for more than 2020 iterations.

The cross-validation procedure produces an approximation for ff with an expansion whose cardinality is adapted to the size of the training sample nn. With a larger sample, we observe that the cross-validation procedure reliably adds more parameters, when needed, to reduce the bias in the approximation while controlling the variance of the estimated map parameters. Thus, we consider ATM a semi-parametric approach for approximating monotone triangular transport maps. Adaptation of the map complexity to the sample size nn will be demonstrated in the numerical results to follow.

Algorithm 2 ATM algorithm for learning map component SkS_{k}
1:  Input: Training sample 𝝌={𝐗1:ki}i=1n\bm{\chi}=\{\mathbf{X}_{1:k}^{i}\}_{i=1}^{n}, number of folds ν\nu
2:  Partition data into ν\nu folds of equal size
3:  for j=1,,νj=1,\dots,\nu do
4:     Partition 𝝌\bm{\chi}: 𝝌jtest\bm{\chi}^{\text{test}}_{j} is the jjth subset of 𝝌\bm{\chi} and 𝝌jtrain=𝝌𝝌jtest\bm{\chi}^{\text{train}}_{j}=\bm{\chi}\setminus\bm{\chi}^{\text{test}}_{j}
5:     Estimate ftf_{t} using Algorithm 1 for m=|𝝌jtrain|m=|\bm{\chi}^{\text{train}}_{j}| iterations with sample 𝝌jtrain\bm{\chi}^{\text{train}}_{j}
6:     Store iterates S^kj,1k(f1),,S^kj,mk(fm)\widehat{S}_{k}^{j,1}\coloneqq\mathcal{R}_{k}(f_{1}),\dots,\widehat{S}_{k}^{j,m}\coloneqq\mathcal{R}_{k}(f_{m})
7:     Evaluate log-likelihood log(S^kj,t)η\log(\widehat{S}^{j,t}_{k})^{\sharp}\eta for iterations t=1,,mt=1,\dots,m with sample 𝝌jtest\bm{\chi}^{\text{test}}_{j}
8:  end for
9:  Define m{1,,m}m^{\ast}\in\{1,\ldots,m\} as the minimizer of negative log-likelihood tj=1νlog(S^kj,t)η(𝝌jtest)t\mapsto\sum_{j=1}^{\nu}-\log(\widehat{S}^{j,t}_{k})^{\sharp}\eta(\bm{\chi}^{\text{test}}_{j})
10:  Estimate map ff^{\ast} using Algorithm 1 for mm^{\ast} iterations with sample 𝝌\bm{\chi}
11:  Output: S^k=k(f)\widehat{S}_{k}=\mathcal{R}_{k}(f^{\ast})

5 Numerical experiments

In this section, we evaluate the performance of the ATM algorithm for learning monotone triangular transport maps associated with a variety of target distributions. Sections 5.1 and 5.2 evaluate and visualize the approximation power of the proposed method for strongly non-Gaussian targets. Sections 5.3 illustrates the benefits of adaptivity over a range of sample sizes, and Section 5.4 demonstrates how the estimated maps reveal and exploit conditional independence structure in the target distribution. Section 5.5 presents joint and conditional density estimation results for a suite of UCI datasets. Code for reproducing all of these numerical results is available online.333https://github.com/baptistar/ATM

In all of the numerical experiments, we pre-process the data by standardizing each marginal, i.e., subtracting the empirical mean and dividing each component of 𝐗\mathbf{X} by its empirical standard deviation. Within the rectifier k\mathcal{R}_{k}, we employ the modified soft-plus function g(ξ)=log(1+2ξ)/log(2)g(\xi)=\log(1+2^{\xi})/\log(2), which satisfies the conditions of Propositions 4, 5, and 6. This function also satisfies g(0)=1g(0)=1, so that f(𝒙1:k)=0f(\bm{x}_{1:k})=0 is transformed into k(f)(𝒙1:k)=xk\mathcal{R}_{k}(f)(\bm{x}_{1:k})=x_{k}. We evaluate the integral in (11) numerically using an adaptive quadrature method based on Clenshaw-Curtis points with a relative error of 10310^{-3}. At each iteration of the ATM algorithm, we optimize over the coefficients c𝜶c_{\bm{\alpha}} for 𝜶Λt\bm{\alpha}\in\Lambda_{t} using a BFGS (quasi-Newton) method [40].

5.1 One-dimensional examples

We first consider a one-dimensional mixture of Gaussians with density π(x)=0.5𝒩(x;2,0.5)+0.5𝒩(x;2,2)\pi(x)=0.5\mathcal{N}(x;-2,0.5)+0.5\mathcal{N}(x;2,2). To estimate π\pi, we generate n=104n=10^{4} training samples {Xi}i=1n\{X^{i}\}_{i=1}^{n} and use them to estimate the map SKRS_{\textrm{KR}} that pushes forward π\pi to η\eta. In one dimension, the KR rearrangement (which coincides with the optimal transport map [53]) is simply the increasing function SKR=Fη1FπS_{\textrm{KR}}=F_{\eta}^{-1}\circ F_{\pi}, where FηF_{\eta} and FπF_{\pi} denote the cumulative distribution functions of the reference and target distributions, respectively. We can thus compare our estimated maps to this exact expression.

We approximate SS as the transformation (f)\mathcal{R}(f) of a smooth non-monotone function ff, seeking ff in the finite-dimensional space V1pV1V_{1}^{p}\subset V_{1} spanned by polynomials of degree at most pp. Figure 4(a) plots the estimated pullback densities π^=S^η\widehat{\pi}=\widehat{S}^{\sharp}\eta for increasing polynomial degree pp, or equivalently an increasing number of terms #Λ\#\Lambda in ff (31). Figure 4(b) shows convergence of this approximation in KL divergence, again for increasing pp. We also observe convergence of the estimated map S^\widehat{S} towards the true KR rearrangement in the Lπ2L^{2}_{\pi} norm, as expected from the upper bound in (4). Figure 5 illustrates the approximation to the map SS and the associated non-monotone function ff for different choices of basis {ψα}α\{\psi_{\alpha}\}_{\alpha}. Here we observe that approximating ff using the modified Hermite polynomials (32) very closely tracks the KR rearrangement SKRS_{\textrm{KR}} and the function fKR=1(SKR)f_{\textrm{KR}}=\mathcal{R}^{-1}(S_{\textrm{KR}}) as x±x\to\pm\infty, as compared to standard Hermite polynomials or Hermite functions [9]. Enforcing linear asymptotic behavior of the transport map ensures that the approximate distribution has Gaussian tails in regions where there are few samples.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) The pullback density S^η\widehat{S}^{\sharp}\eta approaches the target Gaussian mixture density π\pi when increasing the maximum polynomial degree pp of the space V1pV1V_{1}^{p}\subset V_{1}. (b) With increasing pp, the pullback density converges to π\pi in KL divergence, and the estimated map converges to SKRS_{\textrm{KR}} in Lπ2L^{2}_{\pi}.
Refer to caption
(a)
Refer to caption
(b)
Figure 5: (a) The approximate transport maps S^\widehat{S} compared to SKRS_{\textrm{KR}} (black), and (b) the corresponding non-monotone functions ff compared to fKR1(SKR)f_{\textrm{KR}}\coloneqq\mathcal{R}^{-1}(S_{\textrm{KR}}). Both subfigures illustrate different choices of basis {ψα}α\{\psi_{\alpha}\}_{\alpha}, for the Gaussian mixture target of Figure 4. The modified Hermite polynomial basis provides the closest approximation to SKRS_{\textrm{KR}} and fKRf_{\textrm{KR}}.

Next we consider a Gaussian mixture with density π(x)=0.5𝒩(x;0,1)+0.5𝒩(x;0,0.025)\pi(x)=0.5\mathcal{N}(x;0,1)+0.5\mathcal{N}(x;0,0.025).This mixture is a common test case for sampling and density estimation methods, as it evaluates their ability to capture densities with multiple scales [59]. Given n=104n=10^{4} samples, we compute the approximate map using either modified Hermite polynomials (Section 4.1) or Mexican hat wavelets (Section 4.2). Figure 6(a) plots approximations to the target density using an m=15m=15 term expansion (31) for ff, while Figure 6(b) shows convergence in KL divergence for both the polynomial and wavelet bases. We observe that the polynomial approximation suffers from oscillations, while the wavelets better capture localized features. This results in a much faster convergence of the KL divergence for wavelets than for polynomials.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: Approximation of a scale mixture of Gaussians, using Hermite polynomials or Ricker wavelet expansions. (a) A wavelet expansion with m=15m=15 coefficients approximates the target density better than a polynomial expansion of the same size, and (b) yields significantly lower KL divergence for this example.

5.2 Two-dimensional datasets

To demonstrate the expressiveness of the modified Hermite polynomial basis (32), we use the ATM algorithm to approximate the Knothe–Rosenblatt rearrangement for several two-dimensional distributions with strongly non-Gaussian geometries: the “banana,” “funnel,” “cosine,” “mixture of Gaussians (MoG),” and “ring” distributions π\pi considered in [73, 25]. For each distribution, we generate an i.i.d. sample of size n=104n=10^{4} from π\pi and apply a random rotation to the data using an angle uniformly distributed in [0,π/2][0,\pi/2]. Figure 7 plots the true densities π\pi and the approximate densities π^S^η\widehat{\pi}\coloneqq\widehat{S}^{\sharp}\eta found using ATM. We use 55-fold cross-validation to identify the optimal number of elements in the multi-index set for each map component: S1S_{1} and S2S_{2}. Figure 7 indicates the total number of coefficients #Λm\#\Lambda_{m^{\ast}} across both map components, underscoring the parsimony of the approximation.

Truth

(a)
(b) Banana
Refer to caption
(c) Funnel
Refer to caption
(d) Cosine
Refer to caption
(e) MoG
Refer to caption
(f) Ring
Refer to caption

ATM

(g)
Refer to caption
(h) #Λm=44\#\Lambda_{m^{\ast}}=44
Refer to caption
(i) #Λm=31\#\Lambda_{m^{\ast}}=31
Refer to caption
(j) #Λm=72\#\Lambda_{m^{\ast}}=72
Refer to caption
(k) #Λm=10\#\Lambda_{m^{\ast}}=10
Refer to caption
(l) #Λm=39\#\Lambda_{m^{\ast}}=39
Figure 7: Five different densities π\pi (top row) and their approximations π^\widehat{\pi} (bottom row) built using the ATM algorithm with a Hermite polynomial basis for VkV_{k}, given a sample of size n=104n=10^{4} from π\pi. The total number of coefficients #Λm\#\Lambda_{m^{\ast}} in both map components is indicated below each density.

5.3 Random mixture of Gaussians

In this example, we evaluate the stability and accuracy of the ATM algorithm for learning transport maps in the small sample regime. We consider a three-dimensional distribution π\pi defined as a mixture of Gaussians centered at the eight vertices of the hypercube [4,4]3[-4,4]^{3}, each with covariance matrix IdI_{d}. The weights of the mixture components are randomly sampled from a uniform distribution 𝒰([0,1])\mathcal{U}([0,1]) and then normalized so that 3dπ=1\int_{\mathbb{R}^{3}}\mathrm{d}\pi=1. To estimate π\pi, we generate a training sample {𝑿i}i=1nπ\{\bm{X}^{i}\}_{i=1}^{n}\sim\pi and use the ATM algorithm to estimate the KR rearrangement that pushes forward π\pi to η\eta using 55-fold cross validation.

Figure 8(a) plots the KL divergence of π^\widehat{\pi} from π\pi averaged over 10 experiments. Here, the training sets used to build π^=S^η\widehat{\pi}=\widehat{S}^{\sharp}\eta are of varying size n[101,104]n\in[10^{1},10^{4}] and the reported KL divergence is computed on a common test set of 10410^{4} samples. Figure 8(b) plots the total number of coefficients #Λm\#\Lambda_{m^{*}} identified by the ATM algorithm. The performance of ATM is compared to a non-adaptive method where Λ=Λ(p){𝜶0k,𝜶1p}\Lambda=\Lambda(p)\coloneqq\{\bm{\alpha}\in\mathbb{N}_{0}^{k},\|\bm{\alpha}\|_{1}\leq p\} is arbitrarily fixed, for p=1,3,5p=1,3,5. Note that Λ(p)\Lambda(p) corresponds to polynomial ff with total degree pp. For each sample size nn, ATM consistently finds a better estimator of π\pi (in the sense of KL divergence) than a non-adaptive method with a fixed degree pp, by adaptively identifying the basis {ψ𝜶}𝜶Λt\{\psi_{\bm{\alpha}}\}_{\bm{\alpha}\in\Lambda_{t}} to represent the map components. In addition, the ATM estimator achieves smaller KL divergence with fewer total map coefficients, as seen in Figure 8(b).

Refer to caption
(a)
Refer to caption
(b)
Figure 8: (a) KL divergence over 10 sets of training samples is lower for ATM than for non-adapted expansions. (b) Trade-off between the approximation quality and number of coefficients using the adaptive algorithm for different nn. For any given number of map coefficients, the ATM algorithm finds a representation with similar or lower KL divergence than the lowest achievable error of each non-adaptive approximation (indicated with stars in plot (b)).

5.4 Stochastic volatility

Next we consider data from a Markov process that describes the volatility of the return on a financial asset over time. The model has two hyperparameters μ\mu and ϕ\phi, with a state (Zk)k({Z}_{k})_{k} that represents the log-volatility at times k=1,,Tk=1,\ldots,T. The two hyperparameters are drawn from the distributions μ𝒩(0,1)\mu\sim\mathcal{N}(0,1) and ϕ=2exp(ϕ)/(1+exp(ϕ))\phi=2\exp(\phi^{*})/(1+\exp(\phi^{*})) for ϕ𝒩(3,1)\phi^{*}\sim\mathcal{N}(3,1). The log-volatility obeys the order-one autoregressive process Zk+1=μ+ϕ(Zkμ)+ϵkZ_{k+1}=\mu+\phi(Z_{k}-\mu)+\epsilon_{k} for all k>1k>1, where ϵk𝒩(0,1)\epsilon_{k}\sim\mathcal{N}(0,1) is independent of all other variables and Z0|μ,ϕ𝒩(μ,11ϕ2)Z_{0}|\mu,\phi\sim\mathcal{N}(\mu,\frac{1}{1-\phi^{2}}). While the states are Gaussian conditioned on the hyperparameters, the joint distribution of

𝑿=(μ,ϕ,Z1,,ZT),\bm{X}=(\mu,\phi,Z_{1},\ldots,Z_{T}),

is non-Gaussian. In this example, the dimension of 𝑿\bm{X} is made arbitrarily large by increasing TT.

A key property of triangular transport maps is that they inherit sparsity from the conditional independence structure of π\pi. [61] showed that the Markov structure of π\pi yields a lower bound on the sparsity of the map SS (i.e., the functional dependence of each component SkS^{k} on the input variables 𝒙1:k1\bm{x}_{1:k-1}). This sparsity was exploited to learn the structure of undirected probabilistic graph models from data in [38, 4]. From the conditional independence properties of the stochastic volatility model described above, we know that the KR rearrangement SKRS_{\textrm{KR}} between the joint distribution of 𝑿\bm{X} and a standard Gaussian reference η\eta is sparse. Moreover, the exact sparsity of SKRS_{\textrm{KR}} can be derived from Theorem 5.1 in [61]. We now show that the ATM algorithm is capable of detecting and exploiting this structure—without knowing in advance that it exists.

Figure 9(a) compares the variable dependence of the true KR rearrangement SKRS_{\textrm{KR}} and the map S^\widehat{S} learned by the ATM algorithm for a distribution with T=40T=40 using a sample of size n=103n=10^{3} from π\pi; a non-filled entry (j,k)(j,k) entry of the plot indicates that kk-th map component does not depend on variable xjx_{j}. The dependence of component SkS_{k} on (xk,xk1)(x_{k},x_{k-1}) shows that each state ZkZ_{k} strongly depends on the previous state in time. Most of the map components also show dependence on the hyperparameters (μ,ϕ)(\mu,\phi). The estimated sparsity closely matches the exact sparsity of the KR rearrangement. Furthermore, the sparse variable dependence of the kkth map component S^k\widehat{S}_{k} on parent nodes Pa(k){1,,k1}\text{Pa}(k)\subseteq\{1,\dots,k-1\} produces an approximation to the kkth marginal conditional density given by

π^(xk|𝒙<k)=π^(xk|𝒙Pa(k)).\widehat{\pi}(x_{k}|\bm{x}_{<k})=\widehat{\pi}(x_{k}|\bm{x}_{\text{Pa}(k)}).

By identifying the parent nodes Pa(k)\text{Pa}(k) of each variable kk, we also learn a sparse Bayesian network or directed acyclic graphical (DAG) model representing the target distribution [29]. As a result, we can see the ATM algorithm as a technique for learning DAGs from samples, given a prescribed variable ordering.

Next, we consider the approximation of SKRS_{\textrm{KR}} for Markov models of increasing state dimension TT and hence increasing map dimension d=T+2d=T+2; these experiments use a fixed sample size n=103n=10^{3} as before. Figure 9(b) plots the KL divergence from the ATM approximation to the joint density of 𝑿d\bm{X}\in\mathbb{R}^{d}, for increasing dimensions dd. We compare ATM with a variant of ATM where the exact sparsity pattern of the the KR rearrangement is provided to the algorithm in advance. This variant, labeled “sparsity-aware ATM” in Figure 9(b), differs from the ATM algorithm in that it only activates multi-indices 𝜶t\bm{\alpha}^{\ast}_{t} which match the sparsity of SKRS_{\text{KR}} (meaning of the form 𝜶=(α1,α2,0,,0,αk1,αk)\bm{\alpha}=(\alpha_{1},\alpha_{2},0,\ldots,0,\alpha_{k-1},\alpha_{k})). We also compare ATM with non-adaptive maps of degree p=1p=1 and p=2p=2; these maps do not exploit the conditional independence structure of π\pi, and hence depend on all input variables. For low dimensions, we see that degree p=2p=2 maps can better capture the non-Gaussian target distributions than p=1p=1 linear maps. As the dimension dd increases, however, the growing number of coefficients in the p=2p=2 maps results in higher-variance estimators, outweighing their smaller bias, and hence we see a rapidly increasing KL divergence as the dimension dd increases (with crossover around d=21d=21). In contrast, ATM outperforms the non-adaptive maps for all dd and, more importantly, achieves a KL divergence that grows linearly with dd; indeed, it performs similarly to the best case “oracle sparsity” of sparsity-aware ATM.

Refer to caption
(a)
Refer to caption
(b)
Figure 9: (a) The sparsity of an ATM map S^\widehat{S} and of the KR rearrangement SKRS_{\textrm{KR}} for the stochastic volatility model with T=40T=40 time steps. (b) KL divergence using ATM with increasing dimension dd, compared to adaptive map estimators with known sparsity and non-adaptive maps estimators.

5.5 Tabular datasets

Lastly, we evaluate ATM’s performance for density estimation on a suite of UCI datasets [32] that were also considered in [66]. These datasets have dimensionalities between d=10d=10 and 1515 and sample sizes between n=506n=506 and 58755875. We pre-process each dataset to eliminate discrete-valued variables and one variable in every pair that has Pearson correlation coefficient greater than 0.98, following the procedure in [66]. We consider 10 splits of the data. For each split, we use one fold (i.e., 10% of the data) as a test set and the remaining 9 folds (i.e., 90% of the data) as the training set to build S^\widehat{S}. To assess the quality of our estimated map S^\widehat{S}, we evaluate the negative log-likelihood of the pullback density π^=S^η\widehat{\pi}=\widehat{S}^{\sharp}\eta on the test set. The negative log-likelihood is an empirical estimator of 𝔼π[logS^η]=𝒟KL(π||S^η)logπdπ-\mathbb{E}_{\pi}[\log\widehat{S}^{\sharp}\eta]=\mathcal{D}_{\text{KL}}(\pi||\widehat{S}^{\sharp}\eta)-\int\log\pi\mathrm{d}\pi and is, up to the unknown constant logπdπ-\int\log\pi\mathrm{d}\pi, the KL divergence from S^η\widehat{S}^{\sharp}\eta to π\pi. Table 1 presents the mean of the negative log-likelihoods over the 1010 splits and a 95% confidence interval for the mean. For all datasets, we observe an improvement using ATM over non-adaptive p=2p=2 maps and multivariate Gaussian approximations (i.e., p=1p=1 maps), with a lower number of total coefficients.

Table 1: Mean negative log-likelihood for UCI datasets over 10 sets of training samples.
The estimator with best performance (lowest negative log-likelihood) is highlighted in bold.
Dataset (d,n)(d,n) Gaussian ATM # coefficients p=2p=2 maps # coefficients
Boston (10,506)(10,506) 11.3±0.511.3\pm 0.5 3.1±0.6\mathbf{3.1\pm 0.6} 228±7228\pm 7 6.5±0.46.5\pm 0.4 285285
Red Wine (11,1599)(11,1599) 13.2±0.313.2\pm 0.3 9.8±0.4\mathbf{9.8\pm 0.4} 289±9289\pm 9 10.5±0.210.5\pm 0.2 363363
White Wine (11,4898)(11,4898) 13.2±0.513.2\pm 0.5 11.0±0.2\mathbf{11.0\pm 0.2} 342±26342\pm 26 12.0±1.012.0\pm 1.0 363363
Parkinsons (15,5875)(15,5875) 10.8±0.410.8\pm 0.4 2.8±0.4\mathbf{2.8\pm 0.4} 783±17783\pm 17 5.1±0.45.1\pm 0.4 815815

Lastly, we evaluate ATM’s performance for conditional density estimation on a different suite of UCI datasets. We follow a similar procedure as above to pre-process each dataset. Each dataset has a one-dimensional predictor variable XdX_{d} and covariates 𝑿<d\bm{X}_{<d} of varying dimension. To approximate the conditional density πd(xd|𝒙<d)\pi_{d}(x_{d}|\bm{x}_{<d}), we estimate one map component SdS_{d} as a function of the predictor and the covariates using joint samples {(𝑿<ki,Xki)}i=1n\{(\bm{X}_{<k}^{i},X_{k}^{i})\}_{i=1}^{n}. Table 2 presents the mean negative conditional log-likelihoods over 1010 splits of the training data. The negative conditional log-likelihood is an empirical estimator of 𝔼π[logS^dη]=𝔼π(𝒙<d)[𝒟KL(πd(xd|𝒙<d)||S^dη)]logπd(xd|𝒙<d)dπd-\mathbb{E}_{\pi}[\log\widehat{S}_{d}^{\sharp}\eta]=\mathbb{E}_{\pi(\bm{x}_{<d})}[\mathcal{D}_{\textrm{KL}}(\pi_{d}(x_{d}|\bm{x}_{<d})||\widehat{S}_{d}^{\sharp}\eta)]-\int\log\pi_{d}(x_{d}|\bm{x}_{<d})\textrm{d}\pi_{d} and is, up to the unknown constant logπd(xd|𝒙<d)dπd-\int\log\pi_{d}(x_{d}|\bm{x}_{<d})\textrm{d}\pi_{d}, the expected KL divergence from S^dη\widehat{S}_{d}^{\sharp}\eta to the marginal conditional πd(xd|𝒙<d)\pi_{d}(x_{d}|\bm{x}_{<d}).

In Table 2, we compare ATM to conditional kernel density estimation (CKDE[58], ϵ\epsilon-neighborhood kernel density estimation (NKDE), and kernel mixture networks (KMN[1] using the implementation provided by [52]. We also include results for high-capacity parametric conditional density estimation methods that rely on neural networks: mixture density networks (MDN[7] and conditional normalizing flows (NF[64]; implementation details of these methods are provided in Appendix C. For all datasets, we observe that ATM has performance comparable to the neural-network based methods and improved performance over the nonparametric approaches (e.g., on the Concrete and Yacht datasets).

In Table 3, we compare the number of coefficients in the ATM and MDN approximations and the computational time required to learn each of these conditional density representations. The runtime of the greedy procedure in Algorithm 1 (sequentially, for all dd map components) is generally less than that of MDN, achieving similar approximation performance with about two orders of magnitude fewer coefficients. It is more realistic, however, to consider ATM with the cross-validation procedure of Algorithm 2 and to compare this with the hyperparameter tuning required to achieve the reported performance of MDN. ATM only requires performing cross-validation for a single hyper-parameter, i.e., the total number of coefficients mm^{\ast} (see Algorithm 2). In comparison, tuning several hyperparameters in the neural-network based methods (e.g., the learning rate, and the structure of the networks) by performing cross-validation over a tensor product grid increases runtimes by more than an order of magnitude relative to ATM, as presented below.

Table 2: Mean negative conditional log-likelihood for UCI datasets over 10 sets of training samples.
The method with best performance from both categories is highlighted in bold.
Dataset (d,n)(d,n) ATM CKDE NKDE MDN KMN NF
Boston (12,506) 2.6±0.2\mathbf{2.6\pm 0.2} 2.6±0.2\mathbf{2.6\pm 0.2} 3.1±0.23.1\pm 0.2 2.4±0.2\mathbf{2.4\pm 0.2} 2.7±0.22.7\pm 0.2 2.4±0.1\mathbf{2.4\pm 0.1}
Concrete (9,1030) 3.1±0.1\mathbf{3.1\pm 0.1} 3.2±0.13.2\pm 0.1 3.9±0.13.9\pm 0.1 2.9±0.1\mathbf{2.9\pm 0.1} 3.5±0.13.5\pm 0.1 3.2±0.23.2\pm 0.2
Energy (10,768) 1.5±0.11.5\pm 0.1 1.0±0.1\mathbf{1.0\pm 0.1} 2.1±0.22.1\pm 0.2 1.2±0.1\mathbf{1.2\pm 0.1} 1.7±0.11.7\pm 0.1 1.7±0.31.7\pm 0.3
Yacht (7,308) 0.5±0.2\mathbf{0.5\pm 0.2} 1.1±0.31.1\pm 0.3 3.8±0.23.8\pm 0.2 0.7±0.2\mathbf{0.7\pm 0.2} 1.8±0.21.8\pm 0.2 1.3±0.51.3\pm 0.5
Table 3: The number of coefficients in the ATM and MDN approximations, as well as the runtimes (in seconds) to identify coefficients via optimization and to perform cross-validation of hyperparameters.
ATM MDN
Dataset # coefficients runtime (s) runtime w/CV # coefficients runtime (s) runtime w/CV
Boston 31±331\pm 3   6±1\;\;6\pm 1   88±14\;\;88\pm 14 (5±2)×103(5\pm 2)\times 10^{3} 18±818\pm 8 1584±131584\pm 13
Concrete 42±542\pm 5 14±314\pm 3 230±18230\pm 18 (2.6±0.3)×103(2.6\pm 0.3)\times 10^{3} 11±111\pm 1 1846±101846\pm 10
Energy 41±741\pm 7 12±512\pm 5 193±29193\pm 29 (8±3)×103(8\pm 3)\times 10^{3}   32±12\;\,32\pm 12 1758±171758\pm 17
Yacht 28±428\pm 4   7±2\;\;7\pm 2   99±20\;\;99\pm 20 (6±2)×103(6\pm 2)\times 10^{3} 23±623\pm 6 1453±151453\pm 15

6 Conclusions

This paper has presented and analyzed a functional framework for learning monotone triangular transport maps. Our approach represents monotone component functions of a triangular map through the action of an invertible rectification operator k\mathcal{R}_{k} on smooth, generally non-monotone, functions. Imposing appropriate structure on this operator and the function space VkV_{k} on which it acts, along with conditions on the target density π\pi, yields an unconstrained optimization problem for learning the map components that has many desirable and useful properties. First, under certain assumptions on k\mathcal{R}_{k} and π\pi, we show that the optimization objective is bounded, continuous, and differentiable for all fVkf\in V_{k}. This permits the use of deterministic gradient-based optimization methods to find the minimizers. Next, under certain conditions on k\mathcal{R}_{k}, we show that the optimization problem has no spurious local minima on VkV_{k}. In practice, this yields important robustness to the initial guess and other parameters of the optimization algorithms. Finally, under the same (cumulative) conditions on k\mathcal{R}_{k} and some additional assumptions on the target density, we show that the optimization problem has a unique global minimizer on VkV_{k} that corresponds to the canonical KR rearrangement.

Our functional framework also enables the construction of novel transport map estimators, based on maximum likelihood, given a finite sample from the target. The procedure we develop here is semi-parametric, making use of different hierarchical bases for VkV_{k}. In particular, we propose a greedy, adaptive algorithm that identifies a sparse set of basis functions, automatically tailored to the target density and to the sample size. Sparsity in the basis selection reflects conditional independence structure in the target. More generally, the maps we build can capture the structure of complex probability distributions with sufficient data, but also are robust in settings with few observations. This is crucial to deploying these algorithms within large-scale applications, such as data assimilation, where triangular transport maps must be learned online [60]. We demonstrate good computational performance of our algorithm—and the parsimonious representations it produces—in a variety of joint and conditional density estimation problems. We outline some interesting future research directions below.

KR recovery for broader classes of target distributions

While our result guaranteeing no spurious local minima (Theorem 7) makes no explicit assumptions on π\pi, our subsequent results on recovery of the KR rearrangement (see Corollary 9) make particular use of Gaussian tail assumptions, relating the target π\pi to the Gaussian reference η\eta. In principle, however, the space VkV_{k} contains functions k1(SKR,k)\mathcal{R}_{k}^{-1}(S_{\textrm{KR},k}) corresponding to a broader class of target densities; VkV_{k} includes functions that grow up to sub-exponentially fast, as a result of its Gaussian weighting. Hence, it includes the inverse images of KR rearrangements for lighter-tailed π\pi, whose component functions grow faster than linear as xk±x_{k}\to\pm\infty. For instance, [17, Sec. 2.1] notes that the increasing rearrangement from the univariate exponential power distribution π(x)e|x|p\pi(x)\propto e^{-|x|^{p}} with p>2p>2 to a Gaussian η\eta grows as 𝒪(xp/2)\mathcal{O}(x^{p/2}) when xx\rightarrow\infty; see [25] for the asymptotic behavior of monotone maps for other light-tailed densities. It may be possible to obtain unique recovery results analogous to Corollary 9 for such densities, although some modifications to our analysis would be needed.

Approximation theory and statistical guarantees

An open research topic, to the best of our knowledge, is to analyze the finite-dimensional approximation of triangular transport maps on unbounded domains. (See [76, 75] for an approximation theory of triangular transports between analytic densities π,η\pi,\eta on bounded domains.) These results would show how approximate maps converge to the KR rearrangement with different parameterizations (e.g., polynomial spaces or neural networks of increasing size), and we expect that our rectification framework could be a useful ingredient of such analyses. A parallel line of work is to develop non-asymptotic statistical convergence results for triangular transports (e.g., in the context of density estimation), to understand how the quality of density estimates and of map estimates depends on the sample size nn. The analyses in [23, 70] address this question for certain classes of distributions on bounded domains. Combining these lines of work could provide lower bounds on the number of samples required to learn maps of a given complexity, and also some analytical guidance for how to navigate the bias-variance tradeoff of a map estimator given finite samples.

Other sources of low-dimensional structure

The ATM algorithm in this paper identifies maps with sparse variable dependence by exploiting conditional independence structure in the target distribution [61]. In future work it will be interesting to investigate other notions of low-dimensional structure and how they could facilitate the learning of maps from small samples. One promising notion is when the target density π\pi departs from the reference η\eta only along a low-dimensional subspace. In this case, the triangular map SS pushing forward π\pi to η\eta can be written as a low-dimensional perturbation of the identity map, after a variable rotation. See [10] for a recent contribution in this direction applied to variational inference.

Map ordering

One disadvantage of triangular maps is that the approximation depends on the choice of variable ordering. In particular, each variable ordering yields a different factorization of the target density and a different Knothe–Rosenblatt rearrangement SKRS_{\textrm{KR}}. Thus, it is of interest to develop variable ordering algorithms that minimize the finite-sample error of the estimated transport map S^\widehat{S}, the estimated pullback density π^=S^η\widehat{\pi}=\widehat{S}^{\sharp}\eta, or other goal-oriented metrics. For target distributions that induce sparse variable dependence in the map, one approach is to find the permutation that maximizes the sparsity of S^\widehat{S}. This is equivalent to finding the the sparsest Bayesian network or directed acyclic graphical (DAG) model for a distribution from samples. While this problem is in general NP-complete, effective algorithms [49] have been proposed. This algorithm reduces to finding a sparse maximum likelihood estimator for the Cholesky factor of the inverse covariance matrix of π\pi in the Gaussian setting. Since a linear transport map is precisely this Cholesky factor for Gaussian π\pi and standard normal η\eta (see [4, Section 3]), we expect that such sparse permutation algorithms could be generalized to the nonlinear transport map setting.

Nonparametric methods

Instead of finding a particular finite-dimensional basis for the map components Sk=k(f)S_{k}=\mathcal{R}_{k}(f), nonparametric methods do not limit the functional form of ff (e.g., as linear combinations of multivariate polynomials). A broadly useful nonparametric approach involves seeking ff in a reproducing kernel Hilbert space (RKHS). Thanks to the representer theorem [56], the optimal ff in the RKHS can still be identified by solving a finite-dimensional optimization problem. Choosing the kernel so that the RKHS is a suitable weighted Sobolev space (see, e.g., [41]) will yield map representations that fall within the framework of Section 3.2. A related approach uses Gaussian process representations for the map components [26]. It will be interesting to compare the finite-sample performance of such methods to the semi-parametric procedure proposed in this work.

Acknowledgments

RB, YM, and OZ gratefully acknowledge support from the INRIA associate team Unquestionable. RB and YM are also grateful for support from the AFOSR Computational Mathematics program (MURI award FA9550-15-1-0038) and the US Department of Energy AEOLUS center. RB acknowledges support from an NSERC PGSD-D fellowship. OZ also acknowledges support from the ANR JCJC project MODENA (ANR-21-CE46-0006-01).

Appendix A Proofs and theoretical details

A.1 Proof of Proposition 1

Proof.

Recall that the KR rearrangement SKRS_{\mathrm{KR}} is a transport map that satisfies SKRη=πS_{\mathrm{KR}}^{\sharp}\eta=\pi, where η\eta is the density of the standard Gaussian measure on d\mathbb{R}^{d} and π\pi is the target density. Corollary 3.10 in [8] states that for any PDF ϱ\varrho on d\mathbb{R}^{d} of the form ϱ(𝒙):=f(𝒙)η(𝒙)\varrho(\bm{x}):=f(\bm{x})\eta(\bm{x}) with flogfLη1f\log f\in L^{1}_{\eta}, the inequality

𝒙T(𝒙)2η(𝒙)d𝒙2f(𝒙)logf(𝒙)η(𝒙)d𝒙,\int\|\bm{x}-T(\bm{x})\|^{2}\eta(\bm{x})\mathrm{d}\bm{x}\leq 2\int f(\bm{x})\log f(\bm{x})\eta(\bm{x})\mathrm{d}\bm{x}, (38)

holds, where TT is the KR rearrangement such that Tη=ϱT_{\sharp}\eta=\varrho. Let SS be an increasing lower triangular map as in (1) and let ϱ=Sπ\varrho=S_{\sharp}\pi. Thus we have T=SSKR1T=S\circ S_{\mathrm{KR}}^{-1} and so the left-hand side of (38) becomes

𝒙T(𝒙)2η(𝒙)d𝒙=𝒙SSKR1(𝒙)2η(𝒙)d𝒙=SKR(𝒙)S(𝒙)2π(𝒙)d𝒙,\int\|\bm{x}-T(\bm{x})\|^{2}\eta(\bm{x})\mathrm{d}\bm{x}=\int\|\bm{x}-S\circ S_{\mathrm{KR}}^{-1}(\bm{x})\|^{2}\eta(\bm{x})\mathrm{d}\bm{x}=\int\|S_{\mathrm{KR}}(\bm{x})-S(\bm{x})\|^{2}\pi(\bm{x})\mathrm{d}\bm{x},

and the right-hand side becomes

2f(𝒙)logf(𝒙)η(𝒙)d𝒙=2𝒟KL(ϱ||η)=2𝒟KL(π||Sη),2\int f(\bm{x})\log f(\bm{x})\eta(\bm{x})\mathrm{d}\bm{x}=2\mathcal{D}_{\text{KL}}(\varrho||\eta)=2\mathcal{D}_{\text{KL}}(\pi||S^{\sharp}\eta),

which yields (4). ∎

A.2 Convexity of s𝒥k(s)s\mapsto\mathcal{J}_{k}(s)

Lemma 10.

The optimization problem min{s:ks>0}𝒥k(s)\min_{\{s\colon\partial_{k}s>0\}}\mathcal{J}_{k}(s) is strictly convex.

Proof.

Let s1,s2:ks_{1},s_{2}\colon\mathbb{R}^{k}\rightarrow\mathbb{R} be two functions such that ks1(𝒙k)>0\partial_{k}s_{1}(\bm{x}_{\leq k})>0 and ks2(𝒙k)>0\partial_{k}s_{2}(\bm{x}_{\leq k})>0. Let st=ts1+(1t)s2s_{t}=ts_{1}+(1-t)s_{2} for 0<t<10<t<1. Then sts_{t} also satisfies kst(𝒙k)>0\partial_{k}s_{t}(\bm{x}_{\leq k})>0. Finally, because both ξ12ξ2\xi\mapsto\frac{1}{2}\xi^{2} and ξlog(ξ)\xi\mapsto-\log(\xi) are strictly convex functions, we have

𝒥k(st)\displaystyle\mathcal{J}_{k}(s_{t}) =(6)(12st(𝒙k)2logkst(𝒙k))π(𝒙)d𝒙\displaystyle\overset{\eqref{eq:opt_map_component}}{=}\int\left(\frac{1}{2}s_{t}(\bm{x}_{\leq k})^{2}-\log\partial_{k}s_{t}(\bm{x}_{\leq k})\right)\pi(\bm{x})\mathrm{d}\bm{x}
<(t12s1(𝒙k)2+(1t)12s2(𝒙k)2)(tlogks1(𝒙k)+(1t)logks2(𝒙k))π(𝒙)d𝒙\displaystyle\;<\;\int\Big{(}t\frac{1}{2}s_{1}(\bm{x}_{\leq k})^{2}+(1-t)\frac{1}{2}s_{2}(\bm{x}_{\leq k})^{2}\Big{)}-\Big{(}t\log\partial_{k}s_{1}(\bm{x}_{\leq k})+(1-t)\log\partial_{k}s_{2}(\bm{x}_{\leq k})\Big{)}\pi(\bm{x})\mathrm{d}\bm{x}
=t𝒥k(s1)+(1t)𝒥k(s2),\displaystyle\;=\;t\mathcal{J}_{k}(s_{1})+(1-t)\mathcal{J}_{k}(s_{2}),

which shows that 𝒥k\mathcal{J}_{k} is strictly convex. ∎

A.3 Proof of Proposition 2

To prove Proposition 2 we need the following lemma.

Lemma 11.

Let

H1([0,1])={f:[0,1] such that fH1([0,1])201f(t)2+f(t)2dt}.H^{1}([0,1])=\left\{f\colon[0,1]\rightarrow\mathbb{R}\text{ such that }\|f\|^{2}_{H^{1}([0,1])}\coloneqq\int_{0}^{1}f(t)^{2}+f^{\prime}(t)^{2}\,\mathrm{d}t\right\}.

Then

|f(0)|2fH1([0,1]),|f(0)|\leq\sqrt{2}\|f\|_{H^{1}([0,1])}, (39)

holds for any fH1([0,1])f\in H^{1}([0,1]).

Proof.

Because 𝒞([0,1])\mathcal{C}^{\infty}([0,1]) is dense in H1([0,1])H^{1}([0,1]), it suffices to show (39) for any f𝒞([0,1])f\in\mathcal{C}^{\infty}([0,1]). By the mean value theorem, there exists 0z10\leq z\leq 1 such that

f(z)=11001f(t)dt.f(z)=\frac{1}{1-0}\int_{0}^{1}f(t)\mathrm{d}t.

Thus we can write

|f(0)|2\displaystyle|f(0)|^{2} 2|f(z)f(0)|2+2|f(z)|2\displaystyle\leq 2|f(z)-f(0)|^{2}+2|f(z)|^{2}
=2|0zf(t)dt|2+2|01f(t)dt|2\displaystyle=2\left|\int_{0}^{z}f^{\prime}(t)\mathrm{d}t\right|^{2}+2\left|\int_{0}^{1}f(t)\mathrm{d}t\right|^{2}
201|f(t)|2dt+201|f(t)|2dt.\displaystyle\leq 2\int_{0}^{1}\left|f^{\prime}(t)\right|^{2}\mathrm{d}t+2\int_{0}^{1}\left|f(t)\right|^{2}\mathrm{d}t.

This concludes the proof. ∎

We now prove Proposition 2.

Proof.

For any fVkf\in V_{k}, Lemma 11 permits us to write

|f(𝒙<k,0)|2η<k(𝒙<k)d𝒙<k\displaystyle\int|f(\bm{x}_{<k},0)|^{2}\eta_{<k}(\bm{x}_{<k})\mathrm{d}\bm{x}_{<k} (39)(201|f(𝒙<k,t)|2+|kf(𝒙<k,t)|2dt)η<k(𝒙<k)d𝒙<k\displaystyle\overset{\eqref{eq:RKHS}}{\leq}\int\left(2\int_{0}^{1}|f(\bm{x}_{<k},t)|^{2}+|\partial_{k}f(\bm{x}_{<k},t)|^{2}\mathrm{d}t\right)\eta_{<k}(\bm{x}_{<k})\mathrm{d}\bm{x}_{<k}
CT01(|f(𝒙<k,t)|2+|kf(𝒙<k,t)|2)η<k(𝒙<k)η1(t)d𝒙<kdt\displaystyle\;\leq\;C_{T}\int\int_{0}^{1}\Big{(}|f(\bm{x}_{<k},t)|^{2}+|\partial_{k}f(\bm{x}_{<k},t)|^{2}\Big{)}\eta_{<k}(\bm{x}_{<k})\eta_{1}(t)\mathrm{d}\bm{x}_{<k}\mathrm{d}t
CT+(|f(𝒙<k,t)|2+|kf(𝒙<k,t)|2)ηk(𝒙<k,t)d𝒙<kdt\displaystyle\;\leq\;C_{T}\int\int_{-\infty}^{+\infty}\Big{(}|f(\bm{x}_{<k},t)|^{2}+|\partial_{k}f(\bm{x}_{<k},t)|^{2}\Big{)}\eta_{\leq k}(\bm{x}_{<k},t)\mathrm{d}\bm{x}_{<k}\mathrm{d}t
=CTfVk2,\displaystyle\;=\;C_{T}\|f\|_{V_{k}}^{2},

where CT=2sup0t1η1(t)1C_{T}=2\sup_{0\leq t\leq 1}\eta_{1}(t)^{-1}. ∎

A.4 Proof of Proposition 3

The proof relies on Proposition 2 and on the following generalized integral Hardy inequality, see [39].

Lemma 12.

Let ηk\eta_{\leq k} be the standard Gaussian density on k\mathbb{R}^{k}. Then there exists a constant CHC_{H} such that for any vLη2(k)v\in L^{2}_{\eta}(\mathbb{R}^{k}),

(0xkv(𝒙<k,t)dt)2ηk(𝒙)d𝒙CHv(𝒙)2ηk(𝒙)d𝒙.\int\left(\int_{0}^{x_{k}}v(\bm{x}_{<k},t)\mathrm{d}t\right)^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}\leq C_{H}\int v(\bm{x})^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}. (40)
Proof of Lemma 12.

Let us recall the integral Hardy inequality [39].

Theorem 13 (from [39]).

For weight ρ:++\rho\colon\mathbb{R}_{+}\rightarrow\mathbb{R}_{+} and uLρ2()u\in L^{2}_{\rho}(\mathbb{R}), there exists a constant CH<C_{H}<\infty such that

0+(0xu(t)dt)2ρ(x)dxCH0+u(x)2ρ(x)dx\int_{0}^{+\infty}\left(\int_{0}^{x}u(t)\mathrm{d}t\right)^{2}\rho(x)\mathrm{d}x\leq C_{H}\int_{0}^{+\infty}u(x)^{2}\rho(x)\mathrm{d}x (41)

if and only if

supx>0(x+ρ(t)dt)1/2(0xρ(t)1dt)1/2<+.\sup_{x>0}\,\left(\int_{x}^{+\infty}\rho(t)\mathrm{d}t\right)^{1/2}\left(\int_{0}^{x}\rho(t)^{-1}\mathrm{d}t\right)^{1/2}<+\infty. (42)

We apply Theorem 13 with the one-dimensional standard Gaussian density ρ=η\rho=\eta for x>0x>0. In order to check condition (42), we need to show that

D(x)(x+ρ(t)dt)1/2(0xρ(t)1dt)1/2=(x+et2/2dt)1/2(0xet2/2dt)1/2,D(x)\coloneqq\left(\int_{x}^{+\infty}\rho(t)\mathrm{d}t\right)^{1/2}\left(\int_{0}^{x}\rho(t)^{-1}\mathrm{d}t\right)^{1/2}=\left(\int_{x}^{+\infty}e^{-t^{2}/2}\mathrm{d}t\right)^{1/2}\left(\int_{0}^{x}e^{t^{2}/2}\mathrm{d}t\right)^{1/2},

is bounded. Since xD(x)x\mapsto D(x) is a continuous function with a finite limit as x0x\rightarrow 0, it is sufficient to show that D(x)D(x) has a finite limit when xx\rightarrow\infty. For x>1x>1, x+et2/2dtex2/2\int_{x}^{+\infty}e^{-t^{2}/2}\mathrm{d}t\leq e^{-x^{2}/2} and D(x)2ex2/20xet2/2dtD(x)^{2}\leq e^{-x^{2}/2}\int_{0}^{x}e^{t^{2}/2}\mathrm{d}t. Furthermore, using integration-by-parts we have 0xet2/2dt=01et2/2dt+ex2/2/xe+1xet2/2/t2dt\int_{0}^{x}e^{t^{2}/2}\mathrm{d}t=\int_{0}^{1}e^{t^{2}/2}\mathrm{d}t+e^{x^{2}/2}/x-\sqrt{e}+\int_{1}^{x}e^{t^{2}/2}/t^{2}\mathrm{d}t. As xx\rightarrow\infty the dominating term in the sum is ex2/2/xe^{x^{2}/2}/x. Thus, ex2/20xet2/2dte^{-x^{2}/2}\int_{0}^{x}e^{t^{2}/2}\mathrm{d}t behaves asymptotically as 𝒪(1x)\mathcal{O}(\frac{1}{x}), so that D(x)0D(x)\rightarrow 0 when xx\rightarrow\infty. Thus, condition (42) is satisfied.

Then, by the Hardy inequality in (41) for uLη2()u\in L_{\eta}^{2}(\mathbb{R}) we have

0+(0xku(t)dt)2η(xk)dxkCH0+u(xk)2η(xk)dxk.\int_{0}^{+\infty}\left(\int_{0}^{x_{k}}u(t)\mathrm{d}t\right)^{2}\eta(x_{k})\mathrm{d}x_{k}\leq C_{H}\int_{0}^{+\infty}u(x_{k})^{2}\eta(x_{k})\mathrm{d}x_{k}. (43)

For the symmetric density η(xk)=η(xk)\eta(x_{k})=\eta(-x_{k}) we also have

0(0xku(t)dt)2η(xk)dxkCH0u(xk)2η(xk)dxk.\int_{-\infty}^{0}\left(\int_{0}^{x_{k}}u(t)\mathrm{d}t\right)^{2}\eta(x_{k})\mathrm{d}x_{k}\leq C_{H}\int_{-\infty}^{0}u(x_{k})^{2}\eta(x_{k})\mathrm{d}x_{k}. (44)

Combining the results in (43) and (44) we have

+(0xku(t)dt)2η(xk)dxkCH+u(xk)2η(xk)dxk.\int_{-\infty}^{+\infty}\left(\int_{0}^{x_{k}}u(t)\mathrm{d}t\right)^{2}\eta(x_{k})\mathrm{d}x_{k}\leq C_{H}\int_{-\infty}^{+\infty}u(x_{k})^{2}\eta(x_{k})\mathrm{d}x_{k}.

Setting u(t)=v(𝒙<k,t)u(t)=v(\bm{x}_{<k},t) and integrating both sides over 𝒙<kk1\bm{x}_{<k}\in\mathbb{R}^{k-1} with the standard Gaussian weight function η(𝒙<k)\eta(\bm{x}_{<k}) gives the result. ∎

We now prove Proposition 3.

Proof.

By Proposition 2, Lemma 12 and by the Lipschitz property of gg, we can write

k(f1)k(f2)Lηk22\displaystyle\|\mathcal{R}_{k}(f_{1})-\mathcal{R}_{k}(f_{2})\|_{L^{2}_{\eta_{\leq k}}}^{2} 2(f1(𝒙<k,0)f2(𝒙<k,0))2ηk(𝒙)d𝒙\displaystyle\leq 2\int\Big{(}f_{1}(\bm{x}_{<k},0)-f_{2}(\bm{x}_{<k},0)\Big{)}^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}
+2(0xkg(kf1(𝒙<k,t))g(kf2(𝒙<k,t))dt)2ηk(𝒙)d𝒙\displaystyle~{}+2\int\Big{(}\int_{0}^{x_{k}}g\big{(}\partial_{k}f_{1}(\bm{x}_{<k},t)\big{)}-g\big{(}\partial_{k}f_{2}(\bm{x}_{<k},t)\big{)}\mathrm{d}t\Big{)}^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}
2CTf1f2Vk2+2CHg(kf1)g(kf2)Lηk22\displaystyle\leq 2C_{T}\|f_{1}-f_{2}\|_{V_{k}}^{2}+2C_{H}\|g(\partial_{k}f_{1})-g(\partial_{k}f_{2})\|_{L^{2}_{\eta_{\leq k}}}^{2}
2CTf1f2Vk2+2CHL2kf1kf2Lηk22\displaystyle\leq 2C_{T}\|f_{1}-f_{2}\|_{V_{k}}^{2}+2C_{H}L^{2}\|\partial_{k}f_{1}-\partial_{k}f_{2}\|_{L^{2}_{\eta_{\leq k}}}^{2}
2(CT+CHL2)f1f2Vk2,\displaystyle\leq 2(C_{T}+C_{H}L^{2})\|f_{1}-f_{2}\|_{V_{k}}^{2}, (45)

for any f1,f2Vkf_{1},f_{2}\in V_{k}. Furthermore, using the Lipschitz property of gg we have

kk(f1)kk(f2)Lηk22\displaystyle\|\partial_{k}\mathcal{R}_{k}(f_{1})-\partial_{k}\mathcal{R}_{k}(f_{2})\|_{L^{2}_{\eta_{\leq k}}}^{2} =(g(kf1(𝒙<k,t))g(kf2(𝒙<k,t)))2ηk(𝒙)d𝒙\displaystyle=\int\Big{(}g(\partial_{k}f_{1}(\bm{x}_{<k},t))-g(\partial_{k}f_{2}(\bm{x}_{<k},t))\Big{)}^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}
L2(kf1(𝒙<k,t)kf2(𝒙<k,t))2ηk(𝒙)d𝒙\displaystyle\leq L^{2}\int\Big{(}\partial_{k}f_{1}(\bm{x}_{<k},t)-\partial_{k}f_{2}(\bm{x}_{<k},t)\Big{)}^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}
L2f1f2Vk2.\displaystyle\leq L^{2}\|f_{1}-f_{2}\|_{V_{k}}^{2}. (46)

Combining (45) with (46), we obtain (19) with C=2(CT+CHL2)+L2C=\sqrt{2(C_{T}+C_{H}L^{2})+L^{2}}.

It remains to show that k(f)Vk<\|\mathcal{R}_{k}(f)\|_{V_{k}}<\infty for any fVkf\in V_{k}. Letting with f1=ff_{1}=f and f2=0f_{2}=0 in (19), the triangle inequality yields

k(f)Vkk(0)Vk+CfVk\|\mathcal{R}_{k}(f)\|_{V_{k}}\leq\|\mathcal{R}_{k}(0)\|_{V_{k}}+C\|f\|_{V_{k}}

Because k(0)\mathcal{R}_{k}(0) is the affine function 𝒙g(0)xk\bm{x}\mapsto g(0)x_{k}, we have that k(0)Lηk22=g(0)2xk2η(𝒙)d𝒙\|\mathcal{R}_{k}(0)\|_{L^{2}_{\eta_{\leq k}}}^{2}=g(0)^{2}\int x_{k}^{2}\eta(\bm{x})\mathrm{d}\bm{x} and kk(0)Lηk22=g(0)2\|\partial_{k}\mathcal{R}_{k}(0)\|_{L^{2}_{\eta_{\leq k}}}^{2}=g(0)^{2} are finite and so is k(0)Vk\|\mathcal{R}_{k}(0)\|_{V_{k}}. Thus, k(f)Vk\mathcal{R}_{k}(f)\in V_{k} for all fVkf\in V_{k}. ∎

A.5 Proof of Proposition 4

Proof.

For any fVkf\in V_{k} we have

|k(f)|\displaystyle\left|\mathcal{L}_{k}(f)\right| =|(12k(f)2log(kk(f)))dπ|\displaystyle\;=\;\left|\int\left(\frac{1}{2}\mathcal{R}_{k}(f)^{2}-\log(\partial_{k}\mathcal{R}_{k}(f))\right)\mathrm{d}\pi\right|
(20)Cπ2k(f)Lηk22+Cπ|log(g(kf))|dηk\displaystyle\overset{\eqref{eq:AssumptionPiBounded}}{\leq}\frac{C_{\pi}}{2}\|\mathcal{R}_{k}(f)\|_{L^{2}_{\eta_{\leq k}}}^{2}+C_{\pi}\int\left|\log(g(\partial_{k}f))\right|\mathrm{d}\eta_{\leq k}
Cπ2k(f)Lηk22+Cπ|log(g(0))|+|log(g(kf))log(g(0))|dηk\displaystyle\;\leq\;\frac{C_{\pi}}{2}\|\mathcal{R}_{k}(f)\|_{L^{2}_{\eta_{\leq k}}}^{2}+C_{\pi}\int|\log(g(0))|+\left|\log(g(\partial_{k}f))-\log(g(0))\right|\mathrm{d}\eta_{\leq k}
(22)Cπ2k(f)Lηk22+Cπ|log(g(0))|+CπL|kf0|dηk\displaystyle\overset{\eqref{eq:lipschitz_logg}}{\leq}\frac{C_{\pi}}{2}\|\mathcal{R}_{k}(f)\|_{L^{2}_{\eta_{\leq k}}}^{2}+C_{\pi}|\log(g(0))|+C_{\pi}L\int\left|\partial_{k}f-0\right|\mathrm{d}\eta_{\leq k}
Cπ2k(f)Lηk22+Cπ|log(g(0))|+CπLfVk2.\displaystyle\;\leq\;\frac{C_{\pi}}{2}\|\mathcal{R}_{k}(f)\|_{L^{2}_{\eta_{\leq k}}}^{2}+C_{\pi}|\log(g(0))|+C_{\pi}L\|f\|_{V_{k}}^{2}.

Because Proposition 3 ensures k(f)VkLηk2\mathcal{R}_{k}(f)\in V_{k}\subset L^{2}_{\eta_{\leq k}}, we have that k(f)\mathcal{L}_{k}(f) is finite for any fVkf\in V_{k}. Now, for any f1,f2Vkf_{1},f_{2}\in V_{k}, we can write

|k(f1)k(f2)|\displaystyle\left|\mathcal{L}_{k}(f_{1})-\mathcal{L}_{k}(f_{2})\right| =|(12k(f1)212k(f2)2log(kk(f1))+log(kk(f2)))dπ|\displaystyle\;=\;\left|\int\left(\frac{1}{2}\mathcal{R}_{k}(f_{1})^{2}-\frac{1}{2}\mathcal{R}_{k}(f_{2})^{2}-\log(\partial_{k}\mathcal{R}_{k}(f_{1}))+\log(\partial_{k}\mathcal{R}_{k}(f_{2}))\right)\mathrm{d}\pi\right|
(20)Cπ12|k(f1)2k(f2)2|+|log(g(kf1))log(g(kf2))|dη\displaystyle\overset{\eqref{eq:AssumptionPiBounded}}{\leq}C_{\pi}\int\frac{1}{2}\Big{|}\mathcal{R}_{k}(f_{1})^{2}-\mathcal{R}_{k}(f_{2})^{2}\Big{|}+\Big{|}\log(g(\partial_{k}f_{1}))-\log(g(\partial_{k}f_{2}))\Big{|}\mathrm{d}\eta
(22)Cπ2k(f1)+k(f2)Lηk2k(f1)k(f2)Lηk2+CπLkf1kf2Lηk2\displaystyle\overset{\eqref{eq:lipschitz_logg}}{\leq}\frac{C_{\pi}}{2}\|\mathcal{R}_{k}(f_{1})+\mathcal{R}_{k}(f_{2})\|_{L^{2}_{\eta_{\leq k}}}\|\mathcal{R}_{k}(f_{1})-\mathcal{R}_{k}(f_{2})\|_{L^{2}_{\eta_{\leq k}}}+C_{\pi}L\|\partial_{k}f_{1}-\partial_{k}f_{2}\|_{L^{2}_{\eta_{\leq k}}}
(19)Cπk(f1)Lηk2+k(f2)Lηk22Cf1f2Vk+CπLf1f2Vk.\displaystyle\overset{\eqref{eq:RectifierIsLip}}{\leq}C_{\pi}\frac{\|\mathcal{R}_{k}(f_{1})\|_{L^{2}_{\eta_{\leq k}}}+\|\mathcal{R}_{k}(f_{2})\|_{L^{2}_{\eta_{\leq k}}}}{2}C\|f_{1}-f_{2}\|_{V_{k}}+C_{\pi}L\|f_{1}-f_{2}\|_{V_{k}}.

This shows that k:Vk\mathcal{L}_{k}\colon V_{k}\rightarrow\mathbb{R} is continuous. To show that k\mathcal{L}_{k} is differentiable, we let f,εVkf,\varepsilon\in V_{k} so that

k(f+ε)\displaystyle\mathcal{L}_{k}(f+\varepsilon) =(12k(f+ε)2log(kk(f+ε)))dπ\displaystyle=\int\left(\frac{1}{2}\mathcal{R}_{k}(f+\varepsilon)^{2}-\log(\partial_{k}\mathcal{R}_{k}(f+\varepsilon))\right)\mathrm{d}\pi
=(12(f(𝒙<k,0)+ε(𝒙<k,0)+0xkg(k(f+ε)(𝒙<k,t))dt)2logg(k(f+ε)(𝒙)))π(𝒙)d𝒙\displaystyle=\int\left(\frac{1}{2}\left(f(\bm{x}_{<k},0)+\varepsilon(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g\big{(}\partial_{k}(f+\varepsilon)(\bm{x}_{<k},t)\big{)}\mathrm{d}t\right)^{2}-\log\circ g(\partial_{k}(f+\varepsilon)(\bm{x}))\right)\pi(\bm{x})\mathrm{d}\bm{x}
=(12k(f)(𝒙)2+k(f)(𝒙)(ε(𝒙<k,0)+0xkg(kf(𝒙<k,t))kε(𝒙<k,t)dt))π(𝒙)d𝒙\displaystyle=\int\left(\frac{1}{2}\mathcal{R}_{k}(f)(\bm{x})^{2}+\mathcal{R}_{k}(f)(\bm{x})\left(\varepsilon(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{\prime}\big{(}\partial_{k}f(\bm{x}_{<k},t)\big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)\right)\pi(\bm{x})\mathrm{d}\bm{x}
logg(kf)+(logg)(kf)kεdπ+𝒪(εVk2)\displaystyle~{}~{}-\int\log\circ g(\partial_{k}f)+(\log\circ g)^{\prime}(\partial_{k}f)\partial_{k}\varepsilon\mathrm{d}\pi+\mathcal{O}(\|\varepsilon\|_{V_{k}}^{2})
=k(f)+(ε)+𝒪(εVk2)\displaystyle=\mathcal{L}_{k}(f)+\ell(\varepsilon)+\mathcal{O}(\|\varepsilon\|_{V_{k}}^{2})

where :Vk\ell\colon V_{k}\rightarrow\mathbb{R} is the linear form defined by

(ε)=k(f)(𝒙)(ε(𝒙<k,0)+0xkg(kf(𝒙<k,t))kε(𝒙<k,t)dt)(logg)(kf(𝒙))kε(𝒙)π(𝒙)d𝒙.\ell(\varepsilon)=\int\mathcal{R}_{k}(f)(\bm{x})\left(\varepsilon(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{\prime}\big{(}\partial_{k}f(\bm{x}_{<k},t)\big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)-(\log\circ g)^{\prime}(\partial_{k}f(\bm{x}))\partial_{k}\varepsilon(\bm{x})~{}\pi(\bm{x})\mathrm{d}\bm{x}.

If \ell is continuous, meaning if there exists a constant CC_{\ell} such that |(ε)|CεVk|\ell(\varepsilon)|\leq C_{\ell}\|\varepsilon\|_{V_{k}} for any εVk\varepsilon\in V_{k}, then the Riesz representation theorem states that there exists a vector k(f)Vk\nabla\mathcal{L}_{k}(f)\in V_{k} such that (ε)=k(f),εVk\ell(\varepsilon)=\langle\nabla\mathcal{L}_{k}(f),\varepsilon\rangle_{V_{k}}. This proves k\mathcal{L}_{k} is differentiable everywhere.

To show that \ell is continuous, we write

|(ε)|\displaystyle|\ell(\varepsilon)| (20)Cπ|k(f)(𝒙)(ε(𝒙<k,0)+0xkg(kf(𝒙<k,t))kε(𝒙<k,t)dt)|ηk(𝒙)d𝒙\displaystyle\overset{\eqref{eq:AssumptionPiBounded}}{\leq}C_{\pi}\int\Big{|}\mathcal{R}_{k}(f)(\bm{x})\left(\varepsilon(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{\prime}\big{(}\partial_{k}f(\bm{x}_{<k},t)\big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)\Big{|}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}
+Cπ|(logg)(kf(𝒙))kε(𝒙)|ηk(𝒙)d𝒙\displaystyle~{}~{}+C_{\pi}\int\Big{|}(\log\circ g)^{\prime}(\partial_{k}f(\bm{x}))\partial_{k}\varepsilon(\bm{x})\Big{|}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}
(22)Cπk(f)Lηk2|ε(𝒙<k,0)+0xkg(kf(𝒙<k,t))kε(𝒙<k,t)dt|2ηk(𝒙)d𝒙+CπLkεLηk2\displaystyle\overset{\eqref{eq:lipschitz_logg}}{\leq}C_{\pi}\|\mathcal{R}_{k}(f)\|_{L^{2}_{\eta_{\leq k}}}\sqrt{\int\Big{|}\varepsilon(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{\prime}\big{(}\partial_{k}f(\bm{x}_{<k},t)\big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\Big{|}^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}}+C_{\pi}L\|\partial_{k}\varepsilon\|_{L_{\eta_{\leq k}}^{2}}
(21)Cπk(f)Lηk22CTεVk2+2CHL2|kε(𝒙))|2ηk(𝒙)d𝒙+CπLkεLηk2\displaystyle\overset{\eqref{eq:lipschitz_g}}{\leq}C_{\pi}\|\mathcal{R}_{k}(f)\|_{L^{2}_{\eta_{\leq k}}}\sqrt{2C_{T}\|\varepsilon\|_{V_{k}}^{2}+2C_{H}L^{2}\int\big{|}\partial_{k}\varepsilon(\bm{x}))\big{|}^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}}+C_{\pi}L\|\partial_{k}\varepsilon\|_{L_{\eta_{\leq k}}^{2}}
Cπ(k(f)Lηk22CT+2CHL2+L)εVk,\displaystyle\;\leq\;C_{\pi}\Big{(}\|\mathcal{R}_{k}(f)\|_{L^{2}_{\eta_{\leq k}}}\sqrt{2C_{T}+2C_{H}L^{2}}+L\Big{)}\|\varepsilon\|_{V_{k}},

where the second last inequality also uses Proposition 2 and Lemma 12. This concludes the proof. ∎

A.6 Proof of the local Lipschitz regularity (24)

Proposition 14.

In addition to the assumptions of Theorem 4, we further assume there exists a constant L<L<\infty such that for all ξ,ξ\xi,\xi^{\prime}\in\mathbb{R} we have

|g(ξ)g(ξ)|\displaystyle|g^{\prime}(\xi)-g^{\prime}(\xi^{\prime})| L|ξξ|\displaystyle\leq L|\xi-\xi^{\prime}| (47)
|(logg)(ξ)(logg)(ξ)|\displaystyle|(\log\circ g)^{\prime}(\xi)-(\log\circ g)^{\prime}(\xi^{\prime})| L|ξξ|.\displaystyle\leq L|\xi-\xi^{\prime}|. (48)

Then there exists M<M<\infty such that

k(f1)k(f2)VkM(1+k(f2)Vk)f1f2V¯k,\|\nabla\mathcal{L}_{k}(f_{1})-\nabla\mathcal{L}_{k}(f_{2})\|_{V_{k}}\leq M(1+\|\mathcal{R}_{k}(f_{2})\|_{V_{k}})\|f_{1}-f_{2}\|_{\overline{V}_{k}},

for any f1,f2V¯kf_{1},f_{2}\in\overline{V}_{k}, where V¯k={fVk,kfL}\overline{V}_{k}=\{f\in V_{k},\partial_{k}f\in L^{\infty}\} is the space endowed with the norm fV¯k=fVk+kfL\|f\|_{\overline{V}_{k}}=\|f\|_{V_{k}}+\|\partial_{k}f\|_{L^{\infty}}.

Proof.

Recall the definition (23) of k(f)\nabla\mathcal{L}_{k}(f)

k(f),εVk\displaystyle\langle\nabla\mathcal{L}_{k}(f),\varepsilon\rangle_{V_{k}} =k(f)(𝒙)(ε(𝒙<k,0)+0xkg(kf(𝒙<k,t))kε(𝒙<k,t)dt)π(𝒙)d𝒙\displaystyle=\int\mathcal{R}_{k}(f)(\bm{x})\left(\varepsilon(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{\prime}\big{(}\partial_{k}f(\bm{x}_{<k},t)\big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)\pi(\bm{x})\mathrm{d}\bm{x}
(logg)(kf(𝒙))kε(𝒙)π(𝒙)d𝒙.\displaystyle-\int(\log\circ g)^{\prime}(\partial_{k}f(\bm{x}))\partial_{k}\varepsilon(\bm{x})\pi(\bm{x})\mathrm{d}\bm{x}.

Then for any f1,f2V¯kf_{1},f_{2}\in\overline{V}_{k}. we can write

k(f1)k(f2),εVk=A+B+C+D,\langle\nabla\mathcal{L}_{k}(f_{1})-\nabla\mathcal{L}_{k}(f_{2}),\varepsilon\rangle_{V_{k}}=A+B+C+D,

where

A\displaystyle A =(k(f1)(𝒙)k(f2)(𝒙))ε(𝒙<k,0)π(𝒙)d𝒙\displaystyle=\int\Big{(}\mathcal{R}_{k}(f_{1})(\bm{x})-\mathcal{R}_{k}(f_{2})(\bm{x})\Big{)}\varepsilon(\bm{x}_{<k},0)\pi(\bm{x})\mathrm{d}\bm{x}
B\displaystyle B =(k(f1)(𝒙)k(f2)(𝒙))(0xkg(kf1(𝒙<k,t))kε(𝒙<k,t)dt)π(𝒙)d𝒙\displaystyle=\int\Big{(}\mathcal{R}_{k}(f_{1})(\bm{x})-\mathcal{R}_{k}(f_{2})(\bm{x})\Big{)}\left(\int_{0}^{x_{k}}g^{\prime}(\partial_{k}f_{1}(\bm{x}_{<k},t))\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)\pi(\bm{x})\mathrm{d}\bm{x}
C\displaystyle C =k(f2)(𝒙)(0xk(g(kf1(𝒙<k,t))g(kf2(𝒙<k,t)))kε(𝒙<k,t)dt)π(𝒙)d𝒙\displaystyle=\int\mathcal{R}_{k}(f_{2})(\bm{x})\left(\int_{0}^{x_{k}}\Big{(}g^{\prime}(\partial_{k}f_{1}(\bm{x}_{<k},t))-g^{\prime}(\partial_{k}f_{2}(\bm{x}_{<k},t))\Big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)\pi(\bm{x})\mathrm{d}\bm{x}
D\displaystyle D =((logg)(kf1(𝒙))(logg)(kf2(𝒙)))kε(𝒙)π(𝒙)d𝒙.\displaystyle=\int\Big{(}(\log\circ g)^{\prime}(\partial_{k}f_{1}(\bm{x}))-(\log\circ g)^{\prime}(\partial_{k}f_{2}(\bm{x}))\Big{)}\partial_{k}\varepsilon(\bm{x})\pi(\bm{x})\mathrm{d}\bm{x}.

For the first term AA we write

|A|\displaystyle|A| (20)Cπ|k(f1)(𝒙)k(f2)(𝒙)||ε(𝒙<k,0)|η(𝒙)d𝒙\displaystyle\overset{\eqref{eq:AssumptionPiBounded}}{\leq}C_{\pi}\int\Big{|}\mathcal{R}_{k}(f_{1})(\bm{x})-\mathcal{R}_{k}(f_{2})(\bm{x})\Big{|}|\varepsilon(\bm{x}_{<k},0)|\eta(\bm{x})\mathrm{d}\bm{x}
Cπk(f1)k(f2)Vk(|ε(𝒙<k,0)|2η<k(𝒙)d𝒙)1/2\displaystyle\;\leq\;C_{\pi}\|\mathcal{R}_{k}(f_{1})-\mathcal{R}_{k}(f_{2})\|_{V_{k}}\left(\int|\varepsilon(\bm{x}_{<k},0)|^{2}\eta_{<k}(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}
(16)CπCTk(f1)k(f2)VkεVk\displaystyle\overset{\eqref{eq:trace_theorem}}{\leq}C_{\pi}\sqrt{C_{T}}\|\mathcal{R}_{k}(f_{1})-\mathcal{R}_{k}(f_{2})\|_{V_{k}}\|\varepsilon\|_{V_{k}}
(19)CπCTCf1f2VkεVk.\displaystyle\overset{\eqref{eq:RectifierIsLip}}{\leq}C_{\pi}\sqrt{C_{T}}C\|f_{1}-f_{2}\|_{V_{k}}\|\varepsilon\|_{V_{k}}.

For the second term BB we write

|B|\displaystyle|B| (20)Cπk(f1)k(f2)Vk((0xkg(kf1(𝒙<k,t))kε(𝒙<k,t)dt)2η(𝒙)d𝒙)1/2\displaystyle\overset{\eqref{eq:AssumptionPiBounded}}{\leq}C_{\pi}\|\mathcal{R}_{k}(f_{1})-\mathcal{R}_{k}(f_{2})\|_{V_{k}}\left(\int\left(\int_{0}^{x_{k}}g^{\prime}(\partial_{k}f_{1}(\bm{x}_{<k},t))\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}
(40)(19)CπCHCf1f2Vk((g(kf1(𝒙k))kε(𝒙k))2η(𝒙)d𝒙)1/2\displaystyle\overset{\begin{subarray}{c}\eqref{eq:Hardy}\\ \eqref{eq:RectifierIsLip}\end{subarray}}{\leq}C_{\pi}\sqrt{C_{H}}C\|f_{1}-f_{2}\|_{V_{k}}\Big{(}\int\left(g^{\prime}(\partial_{k}f_{1}(\bm{x}_{\leq k}))\partial_{k}\varepsilon(\bm{x}_{\leq k})\Big{)}^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}
(21)CπCHCLf1f2Vk((kε(𝒙k))2η(𝒙)d𝒙)1/2\displaystyle\overset{\eqref{eq:lipschitz_g}}{\leq}C_{\pi}\sqrt{C_{H}}CL\|f_{1}-f_{2}\|_{V_{k}}\Big{(}\int\left(\partial_{k}\varepsilon(\bm{x}_{\leq k})\Big{)}^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}
CπCHCLf1f2VkεVk.\displaystyle\;\leq\;C_{\pi}\sqrt{C_{H}}CL\|f_{1}-f_{2}\|_{V_{k}}\|\varepsilon\|_{V_{k}}.

For the third term CC we write

|C|\displaystyle|C| (20)Cπk(f2)Vk((0xk(g(kf1(𝒙<k,t))g(kf2(𝒙<k,t)))kε(𝒙<k,t)dt)2η(𝒙)d𝒙)1/2\displaystyle\overset{\eqref{eq:AssumptionPiBounded}}{\leq}C_{\pi}\|\mathcal{R}_{k}(f_{2})\|_{V_{k}}\left(\int\left(\int_{0}^{x_{k}}\Big{(}g^{\prime}(\partial_{k}f_{1}(\bm{x}_{<k},t))-g^{\prime}(\partial_{k}f_{2}(\bm{x}_{<k},t))\Big{)}\partial_{k}\varepsilon(\bm{x}_{<k},t)\mathrm{d}t\right)^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}
(40)CπCHk(f2)Vk(((g(kf1(𝒙k))g(kf2(𝒙k)))kε(𝒙k))2η(𝒙)d𝒙)1/2\displaystyle\overset{\eqref{eq:Hardy}}{\leq}C_{\pi}\sqrt{C_{H}}\|\mathcal{R}_{k}(f_{2})\|_{V_{k}}\left(\int\left(\Big{(}g^{\prime}(\partial_{k}f_{1}(\bm{x}_{\leq k}))-g^{\prime}(\partial_{k}f_{2}(\bm{x}_{\leq k}))\Big{)}\partial_{k}\varepsilon(\bm{x}_{\leq k})\right)^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}
CπCHk(f2)Vk(esssup|gkf1gkf2|)((kε(𝒙k))2η(𝒙)d𝒙)1/2\displaystyle\;\leq\;C_{\pi}\sqrt{C_{H}}\|\mathcal{R}_{k}(f_{2})\|_{V_{k}}\left(\operatorname*{ess\,sup}\Big{|}g^{\prime}\circ\partial_{k}f_{1}-g^{\prime}\circ\partial_{k}f_{2}\Big{|}\right)\left(\int\left(\partial_{k}\varepsilon(\bm{x}_{\leq k})\right)^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}
(47)CπCHLk(f2)Vk(esssup|kf1kf2|)εVk\displaystyle\overset{\eqref{eq:gPrimeLip}}{\leq}C_{\pi}\sqrt{C_{H}}L\|\mathcal{R}_{k}(f_{2})\|_{V_{k}}\left(\operatorname*{ess\,sup}\Big{|}\partial_{k}f_{1}-\partial_{k}f_{2}\Big{|}\right)\|\varepsilon\|_{V_{k}}
CπCHLk(f2)Vkf1f2V¯kεVk.\displaystyle\;\leq\;C_{\pi}\sqrt{C_{H}}L\|\mathcal{R}_{k}(f_{2})\|_{V_{k}}\|f_{1}-f_{2}\|_{\overline{V}_{k}}\|\varepsilon\|_{V_{k}}.

For the last term DD we write

|D|\displaystyle|D| (20)Cπ(((logg)(kf1(𝒙))(logg)(kf2(𝒙)))2η(𝒙)d𝒙)1/2εVk\displaystyle\overset{\eqref{eq:AssumptionPiBounded}}{\leq}C_{\pi}\left(\int\Big{(}(\log\circ g)^{\prime}(\partial_{k}f_{1}(\bm{x}))-(\log\circ g)^{\prime}(\partial_{k}f_{2}(\bm{x}))\Big{)}^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}\|\varepsilon\|_{V_{k}}
(48)CπL((kf1(𝒙)kf2(𝒙))2η(𝒙)d𝒙)1/2εVk\displaystyle\overset{\eqref{eq:loggPrimeLip}}{\leq}C_{\pi}L\left(\int\Big{(}\partial_{k}f_{1}(\bm{x})-\partial_{k}f_{2}(\bm{x})\Big{)}^{2}\eta(\bm{x})\mathrm{d}\bm{x}\right)^{1/2}\|\varepsilon\|_{V_{k}}
CπLf1f2VkεVk.\displaystyle\;\leq\;C_{\pi}L\|f_{1}-f_{2}\|_{V_{k}}\|\varepsilon\|_{V_{k}}.

Thus, because f1f2Vkf1f2V¯k\|f_{1}-f_{2}\|_{V_{k}}\leq\|f_{1}-f_{2}\|_{\overline{V}_{k}} we obtain

|k(f1)k(f2),εVk|εVk\displaystyle\frac{|\langle\nabla\mathcal{L}_{k}(f_{1})-\nabla\mathcal{L}_{k}(f_{2}),\varepsilon\rangle_{V_{k}}|}{\|\varepsilon\|_{V_{k}}} Cπ(CTC+CHCL+CHLk(f2)Vk+L)f1f2V¯k\displaystyle\leq C_{\pi}\Big{(}\sqrt{C_{T}}C+\sqrt{C_{H}}CL+\sqrt{C_{H}}L\|\mathcal{R}_{k}(f_{2})\|_{V_{k}}+L\Big{)}\|f_{1}-f_{2}\|_{\overline{V}_{k}}
M(1+k(f2)Vk)f1f2V¯k,\displaystyle\leq M(1+\|\mathcal{R}_{k}(f_{2})\|_{V_{k}})\|f_{1}-f_{2}\|_{\overline{V}_{k}},

where

M=Cπmax{CTC+CHCL+L;CHL}.M=C_{\pi}\max\{\sqrt{C_{T}}C+\sqrt{C_{H}}CL+L;\sqrt{C_{H}}L\}.

This concludes the proof. ∎

A.7 Proof of Proposition 6

Proof.

To show that k(Vk)={(f):fVk}\mathcal{R}_{k}(V_{k})=\{\mathcal{R}(f):f\in V_{k}\} is convex, let f1,f2Vkf_{1},f_{2}\in V_{k} and 0α10\leq\alpha\leq 1. We need to show that there exists fαVkf_{\alpha}\in V_{k} such that (fα)=Sα\mathcal{R}(f_{\alpha})=S_{\alpha} where

Sααk(f1)+(1α)k(f2).S_{\alpha}\coloneqq\alpha\mathcal{R}_{k}(f_{1})+(1-\alpha)\mathcal{R}_{k}(f_{2}).

Let

fα(𝒙k)1(Sα)(𝒙k)=Sα(𝒙<k,0)+0xkg1(kSα(𝒙<k,t))dt.\displaystyle f_{\alpha}(\bm{x}_{\leq k})\coloneqq\mathcal{R}^{-1}(S_{\alpha})(\bm{x}_{\leq k})=S_{\alpha}(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{-1}(\partial_{k}S_{\alpha}(\bm{x}_{<k},t))\textrm{d}t.

It remains to show that fαVkf_{\alpha}\in V_{k}, meaning that fαLηk2f_{\alpha}\in L^{2}_{\eta_{\leq k}} and kfαLηk2\partial_{k}f_{\alpha}\in L^{2}_{\eta_{\leq k}}. By convexity of ξg1(ξ)2\xi\mapsto g^{-1}(\xi)^{2}, we have

kfαLηk22\displaystyle\|\partial_{k}f_{\alpha}\|_{L^{2}_{\eta_{\leq k}}}^{2} =g1(αkk(f1)+(1α)kk(f2))2dηk\displaystyle=\int g^{-1}\big{(}\alpha\partial_{k}\mathcal{R}_{k}(f_{1})+(1-\alpha)\partial_{k}\mathcal{R}_{k}(f_{2})\big{)}^{2}\textrm{d}\eta_{\leq k}
=g1(αg(kf1)+(1α)g(kf2))2dηk\displaystyle=\int g^{-1}\big{(}\alpha g(\partial_{k}f_{1})+(1-\alpha)g(\partial_{k}f_{2})\big{)}^{2}\textrm{d}\eta_{\leq k}
αg1(g(kf1))2+(1α)g1(g(kf2))2dηk\displaystyle\leq\int\alpha g^{-1}(g(\partial_{k}f_{1}))^{2}+(1-\alpha)g^{-1}(g(\partial_{k}f_{2}))^{2}\textrm{d}\eta_{\leq k}
=αkf1Lηk22+(1α)kf2Lηk22.\displaystyle=\alpha\|\partial_{k}f_{1}\|^{2}_{L^{2}_{\eta_{\leq k}}}+(1-\alpha)\|\partial_{k}f_{2}\|^{2}_{L^{2}_{\eta_{\leq k}}}. (49)

Thus kfαLηk2\partial_{k}f_{\alpha}\in L^{2}_{\eta_{\leq k}}. Furthermore we have

fαLηk22\displaystyle\|f_{\alpha}\|_{L^{2}_{\eta_{\leq k}}}^{2} =(Sα(𝒙<k,0)+0xkg1(kSα(𝒙<k,t))dt)2ηk(𝒙)d𝒙\displaystyle=\int\left(S_{\alpha}(\bm{x}_{<k},0)+\int_{0}^{x_{k}}g^{-1}(\partial_{k}S_{\alpha}(\bm{x}_{<k},t))\textrm{d}t\right)^{2}\eta_{\leq k}(\bm{x})\textrm{d}\bm{x}
2Sα(𝒙<k,0)2η<k(𝒙<k)d𝒙+2(0xkg1(kSα(𝒙<k,t))dt)2ηk(𝒙)d𝒙.\displaystyle\leq 2\int S_{\alpha}(\bm{x}_{<k},0)^{2}\eta_{<k}(\bm{x}_{<k})\textrm{d}\bm{x}+2\int\left(\int_{0}^{x_{k}}g^{-1}(\partial_{k}S_{\alpha}(\bm{x}_{<k},t))\textrm{d}t\right)^{2}\eta_{\leq k}(\bm{x})\textrm{d}\bm{x}.

To show that the above quantity is finite, Proposition 2 permits us to write

Sα(𝒙<k,0)2η<k(𝒙<k)d𝒙\displaystyle\int S_{\alpha}(\bm{x}_{<k},0)^{2}\eta_{<k}(\bm{x}_{<k})\textrm{d}\bm{x} =(αf1(𝒙<k,0)+(1α)f2(𝒙<k,0))2η<k(𝒙<k)d𝒙\displaystyle=\int\Big{(}\alpha f_{1}(\bm{x}_{<k},0)+(1-\alpha)f_{2}(\bm{x}_{<k},0)\Big{)}^{2}\eta_{<k}(\bm{x}_{<k})\textrm{d}\bm{x}
CTαf1+(1α)f2Vk2,\displaystyle\leq C_{T}\|\alpha f_{1}+(1-\alpha)f_{2}\|^{2}_{V_{k}},

which is finite. Finally, because g1(kSα)=kfαLηk2g^{-1}(\partial_{k}S_{\alpha})=\partial_{k}f_{\alpha}\in L^{2}_{\eta_{\leq k}} by (49), Lemma 12 yields

(0xkg1(kSα(𝒙<k,t))dt)2ηk(𝒙)d𝒙\displaystyle\int\left(\int_{0}^{x_{k}}g^{-1}(\partial_{k}S_{\alpha}(\bm{x}_{<k},t))\textrm{d}t\right)^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x} CHg1(kSα(𝒙k))2ηk(𝒙)d𝒙\displaystyle\leq C_{H}\int g^{-1}(\partial_{k}S_{\alpha}(\bm{x}_{\leq k}))^{2}\eta_{\leq k}(\bm{x})\mathrm{d}\bm{x}
=CHkfαLηk22,\displaystyle=C_{H}\|\partial_{k}f_{\alpha}\|_{L^{2}_{\eta_{\leq k}}}^{2},

which is finite. We deduce that fαLηk2f_{\alpha}\in L^{2}_{\eta_{\leq k}} and therefore that fαVkf_{\alpha}\in V_{k}. ∎

A.8 Proof of Proposition 5

Proof.

Let s1,s2Vks_{1},s_{2}\in V_{k} be strictly increasing functions with respect to xkx_{k} that satisfy ksi(𝒙k)c\partial_{k}s_{i}(\bm{x}_{\leq k})\geq c for i=1,2i=1,2 and all 𝒙kk\bm{x}_{\leq k}\in\mathbb{R}^{k}. By the Lipschitz property of g1g^{-1} on the domain [c,)[c,\infty) with constant LcL_{c}, we can write

kk1(s1)kk1(s2)Lηk22\displaystyle\|\partial_{k}\mathcal{R}_{k}^{-1}(s_{1})-\partial_{k}\mathcal{R}_{k}^{-1}(s_{2})\|_{L^{2}_{\eta_{\leq k}}}^{2} =(g1(ks1(𝒙k))g1(ks2(𝒙k)))2ηk(𝒙)d𝒙\displaystyle=\int(g^{-1}(\partial_{k}s_{1}(\bm{x}_{\leq k}))-g^{-1}(\partial_{k}s_{2}(\bm{x}_{\leq k})))^{2}\eta_{\leq k}(\bm{x})\textrm{d}\bm{x}
Lc2(ks1(𝒙k)ks2(𝒙k))2ηk(𝒙)d𝒙\displaystyle\leq L_{c}^{2}\int(\partial_{k}s_{1}(\bm{x}_{\leq k})-\partial_{k}s_{2}(\bm{x}_{\leq k}))^{2}\eta_{\leq k}(\bm{x})\textrm{d}\bm{x}
Lc2s1s2Vk2.\displaystyle\leq L_{c}^{2}\|s_{1}-s_{2}\|_{V_{k}}^{2}. (50)

Applying Proposition 2 to s1,s2Vks_{1},s_{2}\in V_{k} and Lemma 12 to kk1(si)=g1(ksi)Lηk2\partial_{k}\mathcal{R}_{k}^{-1}(s_{i})=g^{-1}(\partial_{k}s_{i})\in L_{\eta_{\leq k}}^{2} for i=1,2i=1,2 we have

k1(s1)k1(s2)Lηk22\displaystyle\|\mathcal{R}_{k}^{-1}(s_{1})-\mathcal{R}_{k}^{-1}(s_{2})\|_{L^{2}_{\eta_{\leq k}}}^{2} 2(s1(𝒙<k,0)s2(𝒙<k,0))2ηk(𝒙)𝑑𝒙\displaystyle\leq 2\int\left(s_{1}(\bm{x}_{<k},0)-s_{2}(\bm{x}_{<k},0)\right)^{2}\eta_{\leq k}(\bm{x})d\bm{x}
+2(0xkg1(ks1)g1(ks2)dt)2ηk(𝒙)d𝒙\displaystyle\quad+2\int\left(\int_{0}^{x_{k}}g^{-1}(\partial_{k}s_{1})-g^{-1}(\partial_{k}s_{2})\textrm{d}t\right)^{2}\eta_{\leq k}(\bm{x})\textrm{d}\bm{x}
2CTs1s2Vk2\displaystyle\leq 2C_{T}\|s_{1}-s_{2}\|_{V_{k}}^{2}
+2CH(g1(ks1(𝒙k))g1(ks2(𝒙k)))2ηk(𝒙)d𝒙\displaystyle\quad+2C_{H}\int(g^{-1}(\partial_{k}s_{1}(\bm{x}_{\leq k}))-g^{-1}(\partial_{k}s_{2}(\bm{x}_{\leq k})))^{2}\eta_{\leq k}(\bm{x})\textrm{d}\bm{x}
(2CT+2CHLc2)s1s2Vk2,\displaystyle\leq(2C_{T}+2C_{H}L_{c}^{2})\|s_{1}-s_{2}\|_{V_{k}}^{2}, (51)

where the last inequality follows from (50).

It remains to show that k1(s)Vk<\|\mathcal{R}_{k}^{-1}(s)\|_{V_{k}}<\infty for any sVks\in V_{k} such that essinfks>0\operatorname*{ess\,inf}\partial_{k}s>0. Letting s1=ss_{1}=s and s2=g(0)xks_{2}=g(0)x_{k}, the triangle inequality combined with (26) yields

k1(s)Vkk1(g(0)xk)Vk+Ccsg(0)xkVk.\|\mathcal{R}_{k}^{-1}(s)\|_{V_{k}}\leq\|\mathcal{R}_{k}^{-1}(g(0)x_{k})\|_{V_{k}}+C_{c}\|s-g(0)x_{k}\|_{V_{k}}.

The function k1(g(0)xk)\mathcal{R}_{k}^{-1}(g(0)x_{k}) is zero. Therefore, k1(s)VkCcsg(0)xkVkCc(sVk+g(0)xkVk)\|\mathcal{R}_{k}^{-1}(s)\|_{V_{k}}\leq C_{c}\|s-g(0)x_{k}\|_{V_{k}}\leq C_{c}(\|s\|_{V_{k}}+\|g(0)x_{k}\|_{V_{k}}). For a linear function, g(0)xkVk2=g(0)xkLηk22+g(0)Lηk22=2g(0)2\|g(0)x_{k}\|_{V_{k}}^{2}=\|g(0)x_{k}\|^{2}_{L^{2}_{\eta_{\leq k}}}+\|g(0)\|^{2}_{L^{2}_{\eta_{\leq k}}}=2g(0)^{2} is finite, and so k1(s)Vk<\|\mathcal{R}_{k}^{-1}(s)\|_{V_{k}}<\infty for sVks\in V_{k}. Furthermore, if ksc>0\partial_{k}s\geq c>0, then kk1(s)=g1(ks)g1(c)>\partial_{k}\mathcal{R}_{k}^{-1}(s)=g^{-1}(\partial_{k}s)\geq g^{-1}(c)>-\infty and so essinfk1(s)>\operatorname*{ess\,inf}\mathcal{R}_{k}^{-1}(s)>-\infty. ∎

A.9 Proof for the KR rearrangement

Proof.

Let SKR,kS_{\textrm{KR},k} be the kkth component of the KR rearrangement, given by composing the inverse CDF of the standard Gaussian marginal Fη,k(xk)F_{\eta,k}(x_{k}) with the CDF of the target’s kkth marginal conditional Fπk(xk|𝒙<k)F_{\pi_{k}}(x_{k}|\bm{x}_{<k}). That is,

SKR,k(𝒙k)=Fηk1Fπk(xk|𝒙<k).S_{\textrm{KR},k}(\bm{x}_{\leq k})=F_{\eta_{k}}^{-1}\circ F_{\pi_{k}}(x_{k}|\bm{x}_{<k}). (52)

The goal is to show SKR,kVkS_{\textrm{KR},k}\in V_{k}, that is, SKR,kLηk2S_{\textrm{KR},k}\in L^{2}_{\eta_{\leq k}} and kSKR,kLηk2\partial_{k}S_{\textrm{KR},k}\in L^{2}_{\eta_{\leq k}}.

First we show SKR,kLηk2S_{\textrm{KR},k}\in L^{2}_{\eta_{\leq k}}. From condition (30), we have Fηk1(C1Fηk(xk))SKR,k(xk|𝒙<k)Fηk1(C2Fηk(xk))F_{\eta_{k}}^{-1}(C_{1}F_{\eta_{k}}(x_{k}))\leq S_{\textrm{KR},k}(x_{k}|\bm{x}_{<k})\leq F_{\eta_{k}}^{-1}(C_{2}F_{\eta_{k}}(x_{k})) for some constants C1,C2>0C_{1},C_{2}>0 so that

SKR,k(xk|𝒙<k)2max{Fηk1(C1Fηk(xk))2;Fηk1(C2Fηk(xk))2},S_{\textrm{KR},k}(x_{k}|\bm{x}_{<k})^{2}\leq\max\{F_{\eta_{k}}^{-1}(C_{1}F_{\eta_{k}}(x_{k}))^{2};F_{\eta_{k}}^{-1}(C_{2}F_{\eta_{k}}(x_{k}))^{2}\}, (53)

for all 𝒙<kk1\bm{x}_{<k}\in\mathbb{R}^{k-1}. To show that SKR,kLηk2S_{\textrm{KR},k}\in L^{2}_{\eta_{\leq k}}, it is sufficient to prove that any function of the form xkFηk1(CFηk(xk))x_{k}\mapsto F_{\eta_{k}}^{-1}(CF_{\eta_{k}}(x_{k})) is in Lηk2L^{2}_{\eta_{\leq k}} for any C>0C>0. From Theorems 1 and 2 in [11], there exists strictly positive constants αi,βi>0\alpha_{i},\beta_{i}>0 for i=1,2i=1,2 such that

1α1exp(β1xk2)Fηk(xk)1α2exp(β2xk2),1-\alpha_{1}\exp(-\beta_{1}x_{k}^{2})\leq F_{\eta_{k}}(x_{k})\leq 1-\alpha_{2}\exp(-\beta_{2}x_{k}^{2}), (54)

for xk>0x_{k}>0. With a change of variable u=Fηk(xk)u=F_{\eta_{k}}(x_{k}) we obtain Fηk1(u)21/β2log(α2/(1u))F^{-1}_{\eta_{k}}(u)^{2}\leq 1/\beta_{2}\log(\alpha_{2}/(1-u)) for all u>Fηk(0)=1/2u>F_{\eta_{k}}(0)=1/2. Letting u=CFηk(xk)u=CF_{\eta_{k}}(x_{k}) yields

Fηk1(CFηk(xk))2\displaystyle F^{-1}_{\eta_{k}}(CF_{\eta_{k}}(x_{k}))^{2} 1β2log(α2CFηk(xk)),\displaystyle\;\leq\;\frac{1}{\beta_{2}}\log\left(\frac{\alpha_{2}}{CF_{\eta_{k}}(x_{k})}\right),
(54)1β2log(α2Cα2exp(β2xk2))\displaystyle\overset{\eqref{eq:bounds_Gaussian_cdf}}{\leq}\frac{1}{\beta_{2}}\log\left(\frac{\alpha_{2}}{C\alpha_{2}\exp(-\beta_{2}x_{k}^{2})}\right)
=1β2log(1C)+xk2\displaystyle\;=\;\frac{1}{\beta_{2}}\log\left(\frac{1}{C}\right)+x_{k}^{2}

for all xk>max{Fηk1(1/(2C)),0}x_{k}>\max\{F_{\eta_{k}}^{-1}(1/(2C)),0\}. Using the same argument, we obtain a similar bound on Fηk1(CFηk(xk))2F^{-1}_{\eta_{k}}(CF_{\eta_{k}}(x_{k}))^{2} for all xkx_{k} smaller than a certain value. Together with the continuity of xkFηk1(CFηk(xk))2x_{k}\mapsto F^{-1}_{\eta_{k}}(CF_{\eta_{k}}(x_{k}))^{2} these bounds ensure that xkFηk1(CFηk(xk))x_{k}\mapsto F^{-1}_{\eta_{k}}(CF_{\eta_{k}}(x_{k})) is in Lηk2L^{2}_{\eta_{\leq k}} for any CC. Then SKR,kLηk2S_{\textrm{KR},k}\in L^{2}_{\eta_{\leq k}}. Furthermore, we have SKR,k(𝒙k)=𝒪(xk)S_{\textrm{KR},k}(\bm{x}_{\leq k})=\mathcal{O}(x_{k}) as |xk||x_{k}|\rightarrow\infty.

Now we show that kSKR,kLηk2\partial_{k}S_{\textrm{KR},k}\in L^{2}_{\eta_{\leq k}} by showing kSKR,k\partial_{k}S_{\textrm{KR},k} is a continuous and bounded function. From the absolute continuity of 𝝁\bm{\mu} and 𝝂\bm{\nu}, we have that

kSKR,k(𝒙k)=πk(xk|𝒙<k)ηk(SKR,k(𝒙k))=πk(Fπk1(Fπk(xk|𝒙<k)|𝒙<k)|𝒙<k)ηk(Fηk1Fπk(xk|𝒙<k)),\partial_{k}S_{\textrm{KR},k}(\bm{x}_{\leq k})=\frac{\pi_{k}(x_{k}|\bm{x}_{<k})}{\eta_{k}(S_{\textrm{KR},k}(\bm{x}_{\leq k}))}=\frac{\pi_{k}(F_{\pi_{k}}^{-1}(F_{\pi_{k}}(x_{k}|\bm{x}_{<k})|\bm{x}_{<k})|\bm{x}_{<k})}{\eta_{k}(F_{\eta_{k}}^{-1}\circ F_{\pi_{k}}(x_{k}|\bm{x}_{<k}))}, (55)

is continuous, where Fπk1(|𝒙<k)F_{\pi_{k}}^{-1}(\cdot|\bm{x}_{<k}) denotes the inverse of the map xkFπk(xk|𝒙<k)x_{k}\mapsto F_{\pi_{k}}(x_{k}|\bm{x}_{<k}) for each 𝒙<kk1\bm{x}_{<k}\in\mathbb{R}^{k-1}. Hence, it is sufficient to show that kSKR,k\partial_{k}S_{\textrm{KR},k} goes to a finite limit as |xk||x_{k}|\rightarrow\infty. For the right-hand limit, we can write

limxkkSKR,k(𝒙k)\displaystyle\lim_{x_{k}\rightarrow\infty}\partial_{k}S_{\textrm{KR},k}(\bm{x}_{\leq k}) =limu1πk(Fπk1(u|𝒙<k)|𝒙<k)ηk(Fηk1(u))\displaystyle=\lim_{u\rightarrow 1^{-}}\frac{\pi_{k}(F_{\pi_{k}}^{-1}(u|\bm{x}_{<k})|\bm{x}_{<k})}{\eta_{k}(F_{\eta_{k}}^{-1}(u))}
=limu1(Fη,k1)(u)(Fπk1)(u|𝒙<k)\displaystyle=\lim_{u\rightarrow 1^{-}}\frac{(F_{\eta,k}^{-1})^{\prime}(u)}{(F_{\pi_{k}}^{-1})^{\prime}(u|\bm{x}_{<k})}
=limu1Fηk1(u)Fπk1(u|𝒙<k),\displaystyle=\lim_{u\rightarrow 1^{-}}\frac{F_{\eta_{k}}^{-1}(u)}{F_{\pi_{k}}^{-1}(u|\bm{x}_{<k})}, (56)

where in the second equality we used the inverse function theorem and the third equality follows from l’Hôpital’s rule. To analyze the ratio Fηk1(u)/Fπk1(u|𝒙<k)F_{\eta_{k}}^{-1}(u)/F_{\pi_{k}}^{-1}(u|\bm{x}_{<k}), we combine the lower bound in (30) and the bounds in (54) to get

Fηk1(u)Fπk1(u|𝒙<k)Fηk1(u)Fηk1((u1)/C1+1)β1(logα2log(1u))β2(logα1log((1u)/C1)).\frac{F_{\eta_{k}}^{-1}(u)}{F_{\pi_{k}}^{-1}(u|\bm{x}_{<k})}\leq\frac{F_{\eta_{k}}^{-1}(u)}{F_{\eta_{k}}^{-1}((u-1)/C_{1}+1)}\leq\sqrt{\frac{\beta_{1}(\log\alpha_{2}-\log(1-u))}{\beta_{2}(\log\alpha_{1}-\log((1-u)/C_{1}))}}.

Similarly, from the upper bound in (30) and the bounds in (54), we have

Fηk1(u)Fπk1(u|𝒙<k)Fηk1(u)Fηk1((u1)/C2+1)β2(logα1log(1u))β1(logα2log((1u)/C2)).\frac{F_{\eta_{k}}^{-1}(u)}{F_{\pi_{k}}^{-1}(u|\bm{x}_{<k})}\geq\frac{F_{\eta_{k}}^{-1}(u)}{F_{\eta_{k}}^{-1}((u-1)/C_{2}+1)}\geq\sqrt{\frac{\beta_{2}(\log\alpha_{1}-\log(1-u))}{\beta_{1}(\log\alpha_{2}-\log((1-u)/C_{2}))}}.

Thus, kSKR,k(𝒙k)=𝒪(1)\partial_{k}S_{\textrm{KR},k}(\bm{x}_{\leq k})=\mathcal{O}(1) as xkx_{k}\rightarrow\infty, and we have kSKR,kLηk2\partial_{k}S_{\textrm{KR},k}\in L^{2}_{\eta_{k}}.

Lastly, taking the limit in (A.9) we have limxkkSKR,k(𝒙k)β2/β1\lim_{x_{k}\rightarrow\infty}\partial_{k}S_{\textrm{KR},k}(\bm{x}_{\leq k})\geq\sqrt{\beta_{2}/\beta_{1}}. For a target distribution π\pi with full support, all marginal conditional densities satisfy πk(xk|𝒙<k)>0\pi_{k}(x_{k}|\bm{x}_{<k})>0 for each 𝒙kk\bm{x}_{\leq k}\in\mathbb{R}^{k}. Given that the kSKR,k\partial_{k}S_{\textrm{KR},k} does not approach zero as |xk||x_{k}|\rightarrow\infty, we can find a strictly positive constant ck>0c_{k}>0 such that kSKR,k(𝒙k)ck\partial_{k}S_{\textrm{KR},k}(\bm{x}_{\leq k})\geq c_{k} for all 𝒙kk\bm{x}_{\leq k}\in\mathbb{R}^{k}. This shows that essinfkSKR,k>0\operatorname*{ess\,inf}\partial_{k}S_{\textrm{KR},k}>0. ∎

Appendix B Multi-index refinement for the wavelet basis

In this section we show how to greedily enrich the index set Λt\Lambda_{t} for a one-dimensional wavelet basis parameterized by the tuple of indices (l,q)(l,q) representing the level ll and translation qq of each wavelet ψ(l,q)\psi_{(l,q)}. To define the allowable indices, we construct a binary tree where each node is indexed by (l,q)(l,q) and has two children with indices (l+1,2q)(l+1,2q) and (l+1,2q+1)(l+1,2q+1). The root of the tree has index (0,0)(0,0) and corresponds to the mother wavelet ψ\psi. Analogously to the downward closed property for polynomial indices, we only add nodes to the tree (i.e., indices in Λt\Lambda_{t}) if its parents have already been added. Given any set Λt\Lambda_{t}, we define its reduced margin as

ΛtRM={α=(l,q)Λt such that (l1,q/2)Λt for odd q(l1,(q1)/2)Λt for even q}.\Lambda_{t}^{\text{RM}}=\left\{\alpha=(l,q)\not\in\Lambda_{t}\text{ such that }\begin{array}[]{ll}(l-1,q/2)\in\Lambda_{t}&\text{ for odd }q\\ (l-1,(q-1)/2)\in\Lambda_{t}&\text{ for even }q\end{array}\right\}.

Then, the ATM algorithm with a wavelet basis follows from Algorithm 1 with this construction for the reduced margin at each iteration.

Appendix C Architecture details of alternative methods

In this section we present the details of the alternative methods to ATM that we consider in Section 5.

For each normalizing flow model, we use the recommended stochastic gradient descent optimizer with a learning rate of 10310^{-3}. We partition 10% of the samples in each training set to be validation samples and use the remaining samples for training the model. We select the optimal hyper-parameters for each dataset by fitting the density with the training data and choosing the parameters that minimize the negative log-likelihood of the approximate density on the validation samples. We also use the validation samples to set the termination criteria during the optimization.

We follow the implementation of [52] to define the architectures of these models. The hyper-parameters we consider for the neural networks in the MDN and NF models are: 22 hidden layers, 3232 hidden units in each layer, {5,10,20,50,100}\{5,10,20,50,100\} centers or flows, weight normalization, and a dropout probability of {0,0.2}\{0,0.2\} for regularizing the neural networks during training. For CKDE and NKDE we select the bandwidth of the kernel estimators using 55-fold cross-validation.

References

  • [1] {barticle}[author] \bauthor\bsnmAmbrogioni, \bfnmLuca\binitsL., \bauthor\bsnmGüçlü, \bfnmUmut\binitsU., \bauthor\bparticlevan \bsnmGerven, \bfnmMarcel AJ\binitsM. A. and \bauthor\bsnmMaris, \bfnmEric\binitsE. (\byear2017). \btitleThe kernel mixture network: A nonparametric method for conditional density estimation of continuous random variables. \bjournalarXiv preprint arXiv:1705.07111. \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAnderes, \bfnmEthan\binitsE. and \bauthor\bsnmCoram, \bfnmMarc\binitsM. (\byear2012). \btitleA general spline representation for nonparametric and semiparametric density estimates using diffeomorphisms. \bjournalarXiv preprint arXiv:1205.5314. \endbibitem
  • [3] {barticle}[author] \bauthor\bsnmBaptista, \bfnmRicardo\binitsR., \bauthor\bsnmHosseini, \bfnmBamdad\binitsB., \bauthor\bsnmKovachki, \bfnmNikola B\binitsN. B. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2023). \btitleConditional sampling with monotone GANs: from generative models to likelihood-free inference. \bjournalarXiv preprint arXiv:2006.06755v3. \endbibitem
  • [4] {barticle}[author] \bauthor\bsnmBaptista, \bfnmRicardo\binitsR., \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY., \bauthor\bsnmMorrison, \bfnmRebecca E\binitsR. E. and \bauthor\bsnmZahm, \bfnmOlivier\binitsO. (\byear2021). \btitleLearning non-Gaussian graphical models via Hessian scores and triangular transport. \bjournalarXiv preprint arXiv:2101.03093. \endbibitem
  • [5] {barticle}[author] \bauthor\bsnmBertsekas, \bfnmDimitri P\binitsD. P. (\byear1997). \btitleNonlinear programming. \bjournalJournal of the Operational Research Society \bvolume48 \bpages334–334. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmBigoni, \bfnmDaniele\binitsD., \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY., \bauthor\bsnmPrieur, \bfnmClémentine\binitsC. and \bauthor\bsnmZahm, \bfnmOlivier\binitsO. (\byear2022). \btitleNonlinear dimension reduction for surrogate modeling using gradient information. \bjournalInformation and Inference: A Journal of the IMA. \endbibitem
  • [7] {btechreport}[author] \bauthor\bsnmBishop, \bfnmChristopher M\binitsC. M. (\byear1994). \btitleMixture density networks \btypeTechnical Report No. \bnumberNeural Computing Research Group report: NCRG/94/004, \bpublisherAston University. \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmBogachev, \bfnmVladimir Igorevich\binitsV. I., \bauthor\bsnmKolesnikov, \bfnmAleksandr Viktorovich\binitsA. V. and \bauthor\bsnmMedvedev, \bfnmKirill Vladimirovich\binitsK. V. (\byear2005). \btitleTriangular transformations of measures. \bjournalSbornik: Mathematics \bvolume196 \bpages309. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmBoyd, \bfnmJohn P\binitsJ. P. (\byear1984). \btitleAsymptotic coefficients of Hermite function series. \bjournalJournal of Computational Physics \bvolume54 \bpages382–410. \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmBrennan, \bfnmMichael\binitsM., \bauthor\bsnmBigoni, \bfnmDaniele\binitsD., \bauthor\bsnmZahm, \bfnmOlivier\binitsO., \bauthor\bsnmSpantini, \bfnmAlessio\binitsA. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2020). \btitleGreedy inference with structure-exploiting lazy maps. \bjournalAdvances in Neural Information Processing Systems \bvolume33. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmChang, \bfnmSeok-Ho\binitsS.-H., \bauthor\bsnmCosman, \bfnmPamela C\binitsP. C. and \bauthor\bsnmMilstein, \bfnmLaurence B\binitsL. B. (\byear2011). \btitleChernoff-type bounds for the Gaussian error function. \bjournalIEEE Transactions on Communications \bvolume59 \bpages2939–2944. \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmChkifa, \bfnmAbdellah\binitsA., \bauthor\bsnmCohen, \bfnmAlbert\binitsA. and \bauthor\bsnmSchwab, \bfnmChristoph\binitsC. (\byear2015). \btitleBreaking the curse of dimensionality in sparse polynomial approximation of parametric PDEs. \bjournalJournal de Mathématiques Pures et Appliquées \bvolume103 \bpages400–428. \endbibitem
  • [13] {bbook}[author] \bauthor\bsnmCohen, \bfnmAlbert\binitsA. (\byear2003). \btitleNumerical analysis of wavelet methods. \bpublisherElsevier. \endbibitem
  • [14] {bincollection}[author] \bauthor\bsnmCohen, \bfnmAlbert\binitsA. and \bauthor\bsnmMigliorati, \bfnmGiovanni\binitsG. (\byear2018). \btitleMultivariate approximation in downward closed polynomial spaces. In \bbooktitleContemporary Computational Mathematics-A celebration of the 80th birthday of Ian Sloan \bpages233–282. \bpublisherSpringer. \endbibitem
  • [15] {barticle}[author] \bauthor\bsnmCui, \bfnmTiangang\binitsT. and \bauthor\bsnmDolgov, \bfnmSergey\binitsS. (\byear2021). \btitleDeep composition of tensor trains using squared inverse Rosenblatt transports. \bjournalFoundations of Computational Mathematics \bpages1–60. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmCui, \bfnmTiangang\binitsT., \bauthor\bsnmDolgov, \bfnmSergey\binitsS. and \bauthor\bsnmZahm, \bfnmOlivier\binitsO. (\byear2023). \btitleScalable conditional deep inverse Rosenblatt transports using tensor trains and gradient-based dimension reduction. \bjournalJournal of Computational Physics \bvolume485 \bpages112103. \bdoihttps://doi.org/10.1016/j.jcp.2023.112103 \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmCui, \bfnmTiangang\binitsT., \bauthor\bsnmTong, \bfnmXin T\binitsX. T. and \bauthor\bsnmZahm, \bfnmOlivier\binitsO. (\byear2022). \btitlePrior normalization for certified likelihood-informed subspace detection of Bayesian inverse problems. \bjournalInverse Problems \bvolume38 \bpages124002. \endbibitem
  • [18] {binproceedings}[author] \bauthor\bsnmDinh, \bfnmLaurent\binitsL., \bauthor\bsnmSohl-Dickstein, \bfnmJascha\binitsJ. and \bauthor\bsnmBengio, \bfnmSamy\binitsS. (\byear2017). \btitleDensity estimation using Real NVP. In \bbooktitleInternational Conference on Learning Representations. \endbibitem
  • [19] {binproceedings}[author] \bauthor\bsnmDurkan, \bfnmConor\binitsC., \bauthor\bsnmBekasov, \bfnmArtur\binitsA., \bauthor\bsnmMurray, \bfnmIain\binitsI. and \bauthor\bsnmPapamakarios, \bfnmGeorge\binitsG. (\byear2019). \btitleNeural spline flows. In \bbooktitleAdvances in Neural Information Processing Systems \bpages7509–7520. \endbibitem
  • [20] {barticle}[author] \bauthor\bsnmEl Moselhy, \bfnmTarek A\binitsT. A. and \bauthor\bsnmMarzouk, \bfnmYoussef M\binitsY. M. (\byear2012). \btitleBayesian inference with optimal maps. \bjournalJournal of Computational Physics \bvolume231 \bpages7815–7850. \endbibitem
  • [21] {binproceedings}[author] \bauthor\bsnmHuang, \bfnmChin-Wei\binitsC.-W., \bauthor\bsnmChen, \bfnmRicky TQ\binitsR. T., \bauthor\bsnmTsirigotis, \bfnmChristos\binitsC. and \bauthor\bsnmCourville, \bfnmAaron\binitsA. (\byear2020). \btitleConvex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization. In \bbooktitleInternational Conference on Learning Representations. \endbibitem
  • [22] {binproceedings}[author] \bauthor\bsnmHuang, \bfnmChin-Wei\binitsC.-W., \bauthor\bsnmKrueger, \bfnmDavid\binitsD., \bauthor\bsnmLacoste, \bfnmAlexandre\binitsA. and \bauthor\bsnmCourville, \bfnmAaron\binitsA. (\byear2018). \btitleNeural Autoregressive Flows. In \bbooktitleInternational Conference on Machine Learning \bpages2083–2092. \endbibitem
  • [23] {binproceedings}[author] \bauthor\bsnmIrons, \bfnmNicholas J\binitsN. J., \bauthor\bsnmScetbon, \bfnmMeyer\binitsM., \bauthor\bsnmPal, \bfnmSoumik\binitsS. and \bauthor\bsnmHarchaoui, \bfnmZaid\binitsZ. (\byear2022). \btitleTriangular flows for generative modeling: Statistical consistency, smoothness classes, and fast rates. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages10161–10195. \bpublisherPMLR. \endbibitem
  • [24] {binproceedings}[author] \bauthor\bsnmJaini, \bfnmPriyank\binitsP., \bauthor\bsnmKobyzev, \bfnmIvan\binitsI., \bauthor\bsnmYu, \bfnmYaoliang\binitsY. and \bauthor\bsnmBrubaker, \bfnmMarcus\binitsM. (\byear2020). \btitleTails of Lipschitz triangular flows. In \bbooktitleInternational Conference on Machine Learning \bpages4673–4681. \bpublisherPMLR. \endbibitem
  • [25] {binproceedings}[author] \bauthor\bsnmJaini, \bfnmPriyank\binitsP., \bauthor\bsnmSelby, \bfnmKira A\binitsK. A. and \bauthor\bsnmYu, \bfnmYaoliang\binitsY. (\byear2019). \btitleSum-of-squares polynomial flow. In \bbooktitleInternational Conference on Machine Learning \bpages3009–3018. \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmKatzfuss, \bfnmMatthias\binitsM. and \bauthor\bsnmSchäfer, \bfnmFlorian\binitsF. (\byear2023). \btitleScalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields. \bjournalJournal of the American Statistical Association \bvolume0 \bpages1-15. \bdoi10.1080/01621459.2023.2197158 \endbibitem
  • [27] {binproceedings}[author] \bauthor\bsnmKingma, \bfnmDurk P\binitsD. P. and \bauthor\bsnmDhariwal, \bfnmPrafulla\binitsP. (\byear2018). \btitleGlow: Generative flow with invertible 1x1 convolutions. In \bbooktitleAdvances in Neural Information Processing Systems \bpages10215–10224. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmKobyzev, \bfnmIvan\binitsI., \bauthor\bsnmPrince, \bfnmSimon\binitsS. and \bauthor\bsnmBrubaker, \bfnmMarcus\binitsM. (\byear2020). \btitleNormalizing flows: An introduction and review of current methods. \bjournalIEEE Transactions on Pattern Analysis and Machine Intelligence. \endbibitem
  • [29] {bbook}[author] \bauthor\bsnmKoller, \bfnmDaphne\binitsD. and \bauthor\bsnmFriedman, \bfnmNir\binitsN. (\byear2009). \btitleProbabilistic graphical models: principles and techniques. \bpublisherMIT press. \endbibitem
  • [30] {barticle}[author] \bauthor\bsnmKufner, \bfnmAlois\binitsA. and \bauthor\bsnmOpic, \bfnmBohumír\binitsB. (\byear1984). \btitleHow to define reasonably weighted Sobolev spaces. \bjournalCommentationes Mathematicae Universitatis Carolinae \bvolume25 \bpages537–554. \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmLezcano Casado, \bfnmMario\binitsM. (\byear2019). \btitleTrivializations for gradient-based optimization on manifolds. \bjournalAdvances in Neural Information Processing Systems \bvolume32 \bpages9157–9168. \endbibitem
  • [32] {bmisc}[author] \bauthor\bsnmLichman, \bfnmMoshe\binitsM. (\byear2013). \btitleUCI Machine Learning Repository. \bnotehttp://archive.ics.uci.edu/ml. \endbibitem
  • [33] {binproceedings}[author] \bauthor\bsnmLueckmann, \bfnmJan-Matthis\binitsJ.-M., \bauthor\bsnmBoelts, \bfnmJan\binitsJ., \bauthor\bsnmGreenberg, \bfnmDavid\binitsD., \bauthor\bsnmGoncalves, \bfnmPedro\binitsP. and \bauthor\bsnmMacke, \bfnmJakob\binitsJ. (\byear2021). \btitleBenchmarking simulation-based inference. In \bbooktitleInternational Conference on Artificial Intelligence and Statistics \bpages343–351. \bpublisherPMLR. \endbibitem
  • [34] {bbook}[author] \bauthor\bsnmMallat, \bfnmStéphane\binitsS. (\byear1999). \btitleA wavelet tour of signal processing. \bpublisherElsevier. \endbibitem
  • [35] {binbook}[author] \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY., \bauthor\bsnmMoselhy, \bfnmTarek\binitsT., \bauthor\bsnmParno, \bfnmMatthew\binitsM. and \bauthor\bsnmSpantini, \bfnmAlessio\binitsA. (\byear2016). \btitleSampling via Measure Transport: An Introduction In \bbooktitleHandbook of Uncertainty Quantification \bpages1–41. \bpublisherSpringer International Publishing. \endbibitem
  • [36] {bincollection}[author] \bauthor\bsnmMigliorati, \bfnmGiovanni\binitsG. (\byear2015). \btitleAdaptive polynomial approximation by means of random discrete least squares. In \bbooktitleNumerical Mathematics and Advanced Applications-ENUMATH 2013 \bpages547–554. \bpublisherSpringer. \endbibitem
  • [37] {barticle}[author] \bauthor\bsnmMigliorati, \bfnmGiovanni\binitsG. (\byear2019). \btitleAdaptive approximation by optimal weighted least-squares methods. \bjournalSIAM Journal on Numerical Analysis \bvolume57 \bpages2217–2245. \endbibitem
  • [38] {binproceedings}[author] \bauthor\bsnmMorrison, \bfnmRebecca\binitsR., \bauthor\bsnmBaptista, \bfnmRicardo\binitsR. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2017). \btitleBeyond normality: Learning sparse probabilistic graphical models in the non-Gaussian setting. In \bbooktitleAdvances in Neural Information Processing Systems \bpages2359–2369. \endbibitem
  • [39] {barticle}[author] \bauthor\bsnmMuckenhoupt, \bfnmBenjamin\binitsB. (\byear1972). \btitleHardy’s inequality with weights. \bjournalStudia Mathematica \bvolume44 \bpages31–38. \endbibitem
  • [40] {bbook}[author] \bauthor\bsnmNocedal, \bfnmJorge\binitsJ. and \bauthor\bsnmWright, \bfnmStephen\binitsS. (\byear2006). \btitleNumerical optimization. \bpublisherSpringer Science & Business Media. \endbibitem
  • [41] {barticle}[author] \bauthor\bsnmNovak, \bfnmErich\binitsE., \bauthor\bsnmUllrich, \bfnmMario\binitsM., \bauthor\bsnmWoźniakowski, \bfnmHenryk\binitsH. and \bauthor\bsnmZhang, \bfnmShun\binitsS. (\byear2018). \btitleReproducing kernels of Sobolev spaces on d\mathbb{R}^{d} and applications to embedding constants and tractability. \bjournalAnalysis and Applications \bvolume16 \bpages693–715. \endbibitem
  • [42] {barticle}[author] \bauthor\bsnmOord, \bfnmAaron van den\binitsA. v. d., \bauthor\bsnmLi, \bfnmYazhe\binitsY., \bauthor\bsnmBabuschkin, \bfnmIgor\binitsI., \bauthor\bsnmSimonyan, \bfnmKaren\binitsK., \bauthor\bsnmVinyals, \bfnmOriol\binitsO., \bauthor\bsnmKavukcuoglu, \bfnmKoray\binitsK., \bauthor\bsnmDriessche, \bfnmGeorge van den\binitsG. v. d., \bauthor\bsnmLockhart, \bfnmEdward\binitsE., \bauthor\bsnmCobo, \bfnmLuis C\binitsL. C., \bauthor\bsnmStimberg, \bfnmFlorian\binitsF. \betalet al. (\byear2017). \btitleParallel WaveNet: Fast high-fidelity speech synthesis. \bjournalarXiv preprint arXiv:1711.10433. \endbibitem
  • [43] {binproceedings}[author] \bauthor\bsnmPapamakarios, \bfnmGeorge\binitsG. and \bauthor\bsnmMurray, \bfnmIain\binitsI. (\byear2016). \btitleFast ε\varepsilon-free inference of simulation models with Bayesian conditional density estimation. In \bbooktitleAdvances in Neural Information Processing Systems \bpages1028–1036. \endbibitem
  • [44] {barticle}[author] \bauthor\bsnmPapamakarios, \bfnmGeorge\binitsG., \bauthor\bsnmNalisnick, \bfnmEric\binitsE., \bauthor\bsnmRezende, \bfnmDanilo Jimenez\binitsD. J., \bauthor\bsnmMohamed, \bfnmShakir\binitsS. and \bauthor\bsnmLakshminarayanan, \bfnmBalaji\binitsB. (\byear2021). \btitleNormalizing flows for probabilistic modeling and inference. \bjournalJournal of Machine Learning Research \bvolume22 \bpages1–64. \endbibitem
  • [45] {binproceedings}[author] \bauthor\bsnmPapamakarios, \bfnmGeorge\binitsG., \bauthor\bsnmPavlakou, \bfnmTheo\binitsT. and \bauthor\bsnmMurray, \bfnmIain\binitsI. (\byear2017). \btitleMasked autoregressive flow for density estimation. In \bbooktitleAdvances in Neural Information Processing Systems \bpages2338–2347. \endbibitem
  • [46] {barticle}[author] \bauthor\bsnmParno, \bfnmMatthew D\binitsM. D. and \bauthor\bsnmMarzouk, \bfnmYoussef M\binitsY. M. (\byear2018). \btitleTransport map accelerated Markov chain Monte Carlo. \bjournalSIAM/ASA Journal on Uncertainty Quantification \bvolume6 \bpages645–682. \endbibitem
  • [47] {barticle}[author] \bauthor\bsnmRadev, \bfnmStefan T\binitsS. T., \bauthor\bsnmMertens, \bfnmUlf K\binitsU. K., \bauthor\bsnmVoss, \bfnmAndreas\binitsA., \bauthor\bsnmArdizzone, \bfnmLynton\binitsL. and \bauthor\bsnmKöthe, \bfnmUllrich\binitsU. (\byear2020). \btitleBayesFlow: Learning complex stochastic models with invertible neural networks. \bjournalIEEE transactions on neural networks and learning systems. \endbibitem
  • [48] {barticle}[author] \bauthor\bsnmRamsay, \bfnmJames O\binitsJ. O. (\byear1998). \btitleEstimating smooth monotone functions. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume60 \bpages365–375. \endbibitem
  • [49] {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG. and \bauthor\bsnmUhler, \bfnmCaroline\binitsC. (\byear2018). \btitleLearning directed acyclic graph models based on sparsest permutations. \bjournalStat \bvolume7 \bpagese183. \endbibitem
  • [50] {binproceedings}[author] \bauthor\bsnmRezende, \bfnmDanilo\binitsD. and \bauthor\bsnmMohamed, \bfnmShakir\binitsS. (\byear2015). \btitleVariational inference with normalizing flows. In \bbooktitleInternational conference on machine learning \bpages1530–1538. \bpublisherPMLR. \endbibitem
  • [51] {barticle}[author] \bauthor\bsnmRosenblatt, \bfnmMurray\binitsM. (\byear1952). \btitleRemarks on a multivariate transformation. \bjournalThe Annals of Mathematical Statistics \bvolume23 \bpages470–472. \endbibitem
  • [52] {barticle}[author] \bauthor\bsnmRothfuss, \bfnmJonas\binitsJ., \bauthor\bsnmFerreira, \bfnmFabio\binitsF., \bauthor\bsnmWalther, \bfnmSimon\binitsS. and \bauthor\bsnmUlrich, \bfnmMaxim\binitsM. (\byear2019). \btitleConditional density estimation with neural networks: Best practices and benchmarks. \bjournalarXiv preprint arXiv:1903.00954. \endbibitem
  • [53] {bbook}[author] \bauthor\bsnmSantambrogio, \bfnmFilippo\binitsF. (\byear2015). \btitleOptimal Transport for Applied Mathematicians. \bpublisherSpringer International Publishing. \endbibitem
  • [54] {barticle}[author] \bauthor\bsnmSchäfer, \bfnmFlorian\binitsF., \bauthor\bsnmKatzfuss, \bfnmMatthias\binitsM. and \bauthor\bsnmOwhadi, \bfnmHouman\binitsH. (\byear2021). \btitleSparse Cholesky Factorization by Kullback–Leibler Minimization. \bjournalSIAM Journal on Scientific Computing \bvolume43 \bpagesA2019–A2046. \endbibitem
  • [55] {barticle}[author] \bauthor\bsnmSchmuland, \bfnmByron\binitsB. (\byear1992). \btitleDirichlet forms with polynomial domain. \bjournalMath. Japon \bvolume37 \bpages1015–1024. \endbibitem
  • [56] {binproceedings}[author] \bauthor\bsnmSchölkopf, \bfnmBernhard\binitsB., \bauthor\bsnmHerbrich, \bfnmRalf\binitsR. and \bauthor\bsnmSmola, \bfnmAlex J\binitsA. J. (\byear2001). \btitleA generalized representer theorem. In \bbooktitleInternational conference on computational learning theory \bpages416–426. \bpublisherSpringer. \endbibitem
  • [57] {barticle}[author] \bauthor\bsnmShin, \bfnmYei Eun\binitsY. E., \bauthor\bsnmZhou, \bfnmLan\binitsL. and \bauthor\bsnmDing, \bfnmYu\binitsY. (\byear2022). \btitleJoint estimation of monotone curves via functional principal component analysis. \bjournalComputational Statistics & Data Analysis \bvolume166 \bpages107343. \endbibitem
  • [58] {barticle}[author] \bauthor\bsnmSilverman, \bfnmBernard W\binitsB. W. (\byear1982). \btitleOn the estimation of a probability density function by the maximum penalized likelihood method. \bjournalThe Annals of Statistics \bpages795–810. \endbibitem
  • [59] {barticle}[author] \bauthor\bsnmSisson, \bfnmScott A\binitsS. A., \bauthor\bsnmFan, \bfnmYanan\binitsY. and \bauthor\bsnmTanaka, \bfnmMark M\binitsM. M. (\byear2007). \btitleSequential Monte Carlo without likelihoods. \bjournalProceedings of the National Academy of Sciences \bvolume104 \bpages1760–1765. \endbibitem
  • [60] {barticle}[author] \bauthor\bsnmSpantini, \bfnmAlessio\binitsA., \bauthor\bsnmBaptista, \bfnmRicardo\binitsR. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2022). \btitleCoupling techniques for nonlinear ensemble filtering. \bjournalSIAM Review \bvolume64 \bpages921–953. \endbibitem
  • [61] {barticle}[author] \bauthor\bsnmSpantini, \bfnmAlessio\binitsA., \bauthor\bsnmBigoni, \bfnmDaniele\binitsD. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2018). \btitleInference via low-dimensional couplings. \bjournalThe Journal of Machine Learning Research \bvolume19 \bpages2639–2709. \endbibitem
  • [62] {barticle}[author] \bauthor\bsnmTabak, \bfnmEsteban G\binitsE. G. and \bauthor\bsnmTurner, \bfnmCristina V\binitsC. V. (\byear2013). \btitleA family of nonparametric density estimation algorithms. \bjournalCommunications on Pure and Applied Mathematics \bvolume66 \bpages145–164. \endbibitem
  • [63] {binproceedings}[author] \bauthor\bsnmTeshima, \bfnmTakeshi\binitsT., \bauthor\bsnmIshikawa, \bfnmIsao\binitsI., \bauthor\bsnmTojo, \bfnmKoichi\binitsK., \bauthor\bsnmOono, \bfnmKenta\binitsK., \bauthor\bsnmIkeda, \bfnmMasahiro\binitsM. and \bauthor\bsnmSugiyama, \bfnmMasashi\binitsM. (\byear2020). \btitleCoupling-based invertible neural networks are universal diffeomorphism approximators. In \bbooktitleAdvances in Neural Information Processing Systems \bvolume33 \bpages3362–3373. \endbibitem
  • [64] {binproceedings}[author] \bauthor\bsnmTrippe, \bfnmBrian L\binitsB. L. and \bauthor\bsnmTurner, \bfnmRichard E\binitsR. E. (\byear2018). \btitleConditional density estimation with Bayesian normalising flows. In \bbooktitleBayesian Deep Learning: NIPS 2017 Workshop. \endbibitem
  • [65] {barticle}[author] \bauthor\bsnmTruong, \bfnmTuyen Trung\binitsT. T. and \bauthor\bsnmNguyen, \bfnmHang-Tuan\binitsH.-T. (\byear2021). \btitleBacktracking Gradient Descent Method and Some Applications in Large Scale Optimisation. Part 2: Algorithms and Experiments. \bjournalApplied Mathematics & Optimization \bvolume84 \bpages2557–2586. \endbibitem
  • [66] {barticle}[author] \bauthor\bsnmUria, \bfnmBenigno\binitsB., \bauthor\bsnmMurray, \bfnmIain\binitsI. and \bauthor\bsnmLarochelle, \bfnmHugo\binitsH. (\byear2013). \btitleRNADE: The real-valued neural autoregressive density-estimator. \bjournalarXiv preprint arXiv:1306.0186. \endbibitem
  • [67] {bbook}[author] \bauthor\bsnmVershynin, \bfnmRoman\binitsR. (\byear2018). \btitleHigh-dimensional probability: An introduction with applications in data science \bvolume47. \bpublisherCambridge university press. \endbibitem
  • [68] {bbook}[author] \bauthor\bsnmVidakovic, \bfnmBrani\binitsB. (\byear2009). \btitleStatistical modeling by wavelets \bvolume503. \bpublisherJohn Wiley & Sons. \endbibitem
  • [69] {bbook}[author] \bauthor\bsnmVillani, \bfnmCédric\binitsC. (\byear2008). \btitleOptimal transport: old and new \bvolume338. \bpublisherSpringer Science & Business Media. \endbibitem
  • [70] {barticle}[author] \bauthor\bsnmWang, \bfnmSven\binitsS. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2022). \btitleOn minimax density estimation via measure transport. \bjournalarXiv preprint arXiv:2207.10231. \endbibitem
  • [71] {bbook}[author] \bauthor\bsnmWasserman, \bfnmLarry\binitsL. (\byear2013). \btitleAll of statistics: a concise course in statistical inference. \bpublisherSpringer Science & Business Media. \endbibitem
  • [72] {binproceedings}[author] \bauthor\bsnmWehenkel, \bfnmAntoine\binitsA. and \bauthor\bsnmLouppe, \bfnmGilles\binitsG. (\byear2019). \btitleUnconstrained monotonic neural networks. In \bbooktitleAdvances in Neural Information Processing Systems \bpages1543–1553. \endbibitem
  • [73] {binproceedings}[author] \bauthor\bsnmWenliang, \bfnmLi\binitsL., \bauthor\bsnmSutherland, \bfnmDougal\binitsD., \bauthor\bsnmStrathmann, \bfnmHeiko\binitsH. and \bauthor\bsnmGretton, \bfnmArthur\binitsA. (\byear2019). \btitleLearning deep kernels for exponential family densities. In \bbooktitleInternational Conference on Machine Learning \bpages6737–6746. \endbibitem
  • [74] {barticle}[author] \bauthor\bsnmZahm, \bfnmOlivier\binitsO., \bauthor\bsnmCui, \bfnmTiangang\binitsT., \bauthor\bsnmLaw, \bfnmKody\binitsK., \bauthor\bsnmSpantini, \bfnmAlessio\binitsA. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2022). \btitleCertified dimension reduction in nonlinear Bayesian inverse problems. \bjournalMathematics of Computation \bvolume91 \bpages1789–1835. \endbibitem
  • [75] {barticle}[author] \bauthor\bsnmZech, \bfnmJakob\binitsJ. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2022). \btitleSparse approximation of triangular transports. Part II: the infinite dimensional case. \bjournalConstructive Approximation \bvolume55 \bpages987–1036. \endbibitem
  • [76] {barticle}[author] \bauthor\bsnmZech, \bfnmJakob\binitsJ. and \bauthor\bsnmMarzouk, \bfnmYoussef\binitsY. (\byear2022). \btitleSparse Approximation of triangular transports. Part I: the finite-dimensional case. \bjournalConstructive Approximation \bvolume55 \bpages919–986. \endbibitem