This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of Stein Variational Gradient Descent under a Weaker Smoothness Condition

Lukang Sun
CEMSE, KAUST
lukang.sun@kaust.edu.sa
&Avetik Karagulyan
CEMSE, KAUST
avetik.karagulyan@kaust.edu.sa
&Peter Richtarik
CEMSE, KAUST
peter.richtarik@kaust.edu.sa
Abstract

Stein Variational Gradient Descent (SVGD) is an important alternative to the Langevin-type algorithms for sampling from probability distributions of the form π(x)exp(V(x))\pi(x)\propto\exp(-V(x)). In the existing theory of Langevin-type algorithms and SVGD, the potential function VV is often assumed to be LL-smooth. However, this restrictive condition excludes a large class of potential functions such as polynomials of degree greater than 22. Our paper studies the convergence of the SVGD algorithm for distributions with (L0,L1)(L_{0},L_{1})-smooth potentials. This relaxed smoothness assumption was introduced by Zhang et al. [2019a] for the analysis of gradient clipping algorithms. With the help of trajectory-independent auxiliary conditions, we provide a descent lemma establishing that the algorithm decreases the KL\mathop{\mathrm{KL}}\nolimits divergence at each iteration and prove a complexity bound for SVGD in the population limit in terms of the Stein Fisher information.

1 Introduction

Bayesian methods are widely implemented in various inference tasks that emerge in computational statistics [Neal, 1992, Roberts and Tweedie, 1996], machine learning [Grenander and Miller, 1994, Şimşekli, 2017], inverse problems [Zhou et al., 2020, Durmus et al., 2018] and model selection [Feroz and Skilling, 2013, Cai et al., 2021]. Often Bayesian methods boil down to approximate integration problems which are solved using Markov Chain Monte-Carlo algorithms [Robert and Casella, 1999]. In practice, the target distribution π\pi defined on d\mathbb{R}^{d} is often absolutely continuous w.r.t. the Lebesgue measure and we have access to its density up to a normalization constant. Throughout the paper we will use the same notation for probability distributions and their corresponding densities. In Bayesian statistics the target distribution is generally given by

π(x)=1ZeV(x),\pi(x)=\frac{1}{Z}e^{-V(x)},

where ZZ is the normalization constant and V:dV:{\mathbb{R}}^{d}\to\mathbb{R} is called potential function. Here, we usually do not know the value of ZZ. The approximate sampling methods are proceed as follows. For every ε>0\varepsilon>0, the goal is to construct a distribution μ\mu that approximates the target π\pi in some probability distance dist:

𝖽𝗂𝗌𝗍(μ,π)<ε.{\sf dist}(\mu,\pi)<\varepsilon.

In this paper, we will study the case when 𝖽𝗂𝗌𝗍{\sf dist} is the Kullback-Leibler divergence:

KL(μπ):=log(μ(x)π(x))μ(dx)=𝔼Xμ[V(X)]H(μ).\mathop{\mathrm{KL}}\nolimits(\mu\mid\pi):=\int\log\left(\frac{\mu(x)}{\pi(x)}\right)\mu({\mathrm{d}}x)=\mathbb{E}_{X\sim\mu}\big{[}V(X)\big{]}-H(\mu).

Here, H(μ):=log(μ(x))μ(dx)H(\mu):=-\int\log(\mu(x))\mu({\mathrm{d}}x) is the negative entropy. The sampling problem can also be seen as a minimization problem in the space of probability measures (see e.g. [Liu and Wang, 2016, Wibisono, 2018, Durmus et al., 2019]). Indeed, consider the functional ():=KL(π){\mathcal{F}}(\cdot):=\mathop{\mathrm{KL}}\nolimits(\cdot\mid\pi). This functional is non-negative and it admits its minimum value 0 only when for the target distribution: π=argminμ(μ)\pi=\arg\min_{\mu}{\mathcal{F}}(\mu). The connection of sampling and optimization has been repeatedly leveraged in the previous literature of Langevin sampling. Various algorithms such as Langevin Monte-Carlo (e.g. Dalalyan [2017a], Wibisono [2019], Durmus et al. [2019]), Underdamped Langevin Algorithm (e.g. Ma et al. [2019], Chatterji et al. [2018]) have often been influenced by the known optimization methods. Conversely, another line of research has studied the application of sampling algorithms to solve optimization problems (see e.g. [Raginsky et al., 2017]). In this paper, we will study another algorithm called Stein Variational Gradient Descent (SVGD), which is designed as a gradient descent algorithm for the KL\mathop{\mathrm{KL}}\nolimits divergence in the space of probability measures.

1.1 SVGD

The LMC algorithm, treats the functional KL(|π)\mathop{\mathrm{KL}}\nolimits(\cdot|\pi) as a composite functional described in (1). For details, we refer the reader to Appendix˜A (see also [Wibisono, 2018]). Unlike the LMC, the Stein Variational Gradient Descent (SVGD) algorithm applies gradient descent directly to KL(|π)\mathop{\mathrm{KL}}\nolimits(\cdot|\pi) (see Section˜2.2 for the complete definition). SVGD is an important alternative to the Langevin algorithm and already has been used extensively in different settings of machine learning, such as variational auto-encoders [Pu et al., 2017], reinforcement learning [Liu et al., 2017], sequential decision making [Zhang et al., 2018, 2019b], generative adversarial networks [Tao et al., 2019] and federated learning [Kassab and Simeone, 2022].

The seminal work of Liu and Wang [2016] introduced SVGD as a sampling method. Since then several variants of SVGD have been proposed. Here is a non-exclusive list: random batch method SVGD [Li et al., 2020], matrix kernel SVGD [Wang et al., 2019], Newton version SVGD [Detommaso et al., 2018]. However, the theoretical understanding of SVGD is still limited to the continuous time approximation of SVGD or the so called infinite particle regime. The work by Liu [2017] was the first that proved a convergence result of the SVGD in the population limit. Later, Duncan et al. [2019] studied the geometry related to the SVGD and proposed a scheme to choose the kernel. The clean analysis of the finite particle regime remains an open question (see also [Lu et al., 2019, Nüsken and Renger, 2021]). In [Korba et al., 2020], a descent lemma was established for the SVGD in population limit in terms of Kullback-Leibler divergence. The drawback of this result was that the analysis relied on the path information of the SVGD which is unknown beforehand. Salim et al. [2021] improved the work of  Korba et al. [2020] and provided a clean analysis for the convergence. They assumed π\pi satisfies T1\text{T}_{1} (see Section˜2) inequality which essentially replaced the initial trajectory condition. This implied a complexity bound for the SVGD in terms of the desired accuracy ϵ\epsilon and dimension dd. To the best of our knowledge, this is the state of the art result on SVGD.

1.2 Contributions

The main contribution of the paper relies on its weaker set of assumptions, that allow to treat a larger class of probability distributions which includes densities with polynomials. We enlarge the class of probability distributions two-fold.

Smoothness:

The gradient smoothness assumption is very common in the sampling literature (see e.g., Durmus and Moulines [2017], Dalalyan [2017b], Dalalyan and Karagulyan [2019], Durmus et al. [2019], Vempala and Wibisono [2019], Shen and Lee [2019], Korba et al. [2020], Salim et al. [2021]). It is formulated as the Lipschitz continuity of the gradient. That is the Hessian 2V()\nabla^{2}V(\cdot) is well-defined on d\mathbb{R}^{d} and

2V(x)opL,xd.\left\|{\nabla^{2}V}(x)\right\|_{op}\leq L,\quad\forall x\in\mathbb{R}^{d}. (1)

Though people have made great progress towards the understanding of these algorithms, the LL-smoothness condition is quite restrictive. In fact from LL-smoothness condition we can easily deduce that VV has at most quadratic growth rate. In particular, the large class of polynomial potentials whose order is higher than 2 does not satisfy this condition. For the LMC algorithm, several papers have proposed different methods to relax or remove this assumption [Lehec, 2021, Brosse et al., 2019, Hutzenthaler et al., 2012, Sabanis, 2013, Erdogdu and Hosseinzadeh, 2021, Chewi et al., 2021].

However, to the best of our knowledge there is no result for the SVGD under a relaxed smoothness assumption. In this paper we introduce the concept of (L0,L1)(L_{0},L_{1}) smoothness (see ˜(L0,L1)(L_{0},L_{1}) for the definition). The latter was initially proposed by Zhang et al. [2019a], for the gradient clipping algorithm. The new condition with parameters (L,0)(L,0) is equivalent to the LL-smoothness and therefore, it is indeed a weaker assumption. Important example of such functions are the higher order polynomials (see Section˜4 for details).

Functional inequalities:

The analysis of Langevin algorithms often relies on the strong convexity of the potential function VV. A line of work has proposed different methods to relax or bypass this assumption. One possible solution is to slightly modify the original algorithm (e.g. [Dalalyan et al., 2019, Karagulyan and Dalalyan, 2020]). Another approach relies on functional inequalities such as Poincaré or logarithmic Sobolev inequalities (e.g. [Vempala and Wibisono, 2019, Chewi et al., 2020, Ahn and Chewi, 2021]). However, the verification of these inequalities is not straightforward.

Unlike the Langevin algorithm, the SVGD algorithm does not require convexity in any form. However, functional inequalities also serve for the analysis of the SVGD but with a different purpose. Salim et al. [2021] have proved a complexity result in dimension and precision error for the SVGD algorithm performed on the targets that satisfy Talagrand’s T1{\mathrm{T}}_{1} inequality (see (5) for details). This assumption replaces the bound on the trajectory that was initially proposed by Korba et al. [2020]. Despite the major improvement, they still assumed the LL-smoothness of the potential, which as mentioned previously does not cover the polynomials.

In this regard, we propose to replace the Talagrand’s inequality with the generalized Tp{\mathrm{T}}_{p} (see ˜(Tp,J)({\mathrm{T}}_{p},{\rm J}) for the definition). Due its general form, this new condition includes inequalities that are easy to verify (see Appendix˜E). See also the Table˜1 for a visual representation of the literature review.

Paper Method Smoothness Other conditions Can deal with higher-order polynomials? Complexity to get ε\varepsilon error Criterion
Vempala and Wibisono [2019] LMC LL-smoothness λLS\lambda_{LS}-Log-Sobolev 𝒪~[dλLS2ε]\widetilde{{\mathcal{O}}}\left[\frac{d}{\lambda_{LS}^{2}\varepsilon}\right] KL\mathop{\mathrm{KL}}\nolimits
Chatterji et al. [2020] LMC (L,α)(L,\alpha)-Hölder continuity + Gaussian smoothing Log-concavity 𝒪~[d(53α)/2ε4/(1+α)]\widetilde{\mathcal{O}}\left[\frac{d^{\nicefrac{{(5-3\alpha)}}{{2}}}}{\varepsilon^{\nicefrac{{4}}{{(1+\alpha)}}}}\right] W2W_{2}
Chewi et al. [2021] LMC Hölder continuity α\alpha-tails + Modified Log-Sobolev O~[d(2/α)(1+1/s)1/sε1/s]\widetilde{O}\left[\frac{d^{(2/\alpha)(1+1/s)-1/s}}{\varepsilon^{1/s}}\right] KL\mathop{\mathrm{KL}}\nolimits
Korba et al. [2020] SVGD LL-smoothness Trajectory bounds 𝒪[dεC]{\mathcal{O}}\left[\frac{d}{\varepsilon}\sqrt{C}\right] ISteinI_{\rm Stein}
Salim et al. [2021] SVGD LL-smoothness T1{\mathrm{T}}_{1} inequality with constant λT\lambda_{{\mathrm{T}}} 𝒪~[d3/2λT1/2ε]\widetilde{\mathcal{O}}\left[\frac{d^{3/2}}{\lambda_{{\mathrm{T}}}^{1/2}\varepsilon}\right] ISteinI_{\rm Stein}
This paper SVGD (L0,L1)(L_{0},L_{1}) smoothness Tp{\mathrm{T}}_{p} inequality with constant λT\lambda_{{\mathrm{T}}} 𝒪[(λTp2ε)1(pd)(p+1)(p+2)4]{\mathcal{O}}\Big{[}{(\lambda_{{\mathrm{T}}}^{\frac{p}{2}}\varepsilon)}^{-1}{(pd)^{\frac{(p+1)(p+2)}{4}}}\Big{]} ISteinI_{\rm Stein}
This paper SVGD (L0,L1)(L_{0},L_{1}) smoothness Generalized (Tp,J)({\mathrm{T}}_{p},{\rm J}) 𝒪[λBVp(pd)p+1ε]{\mathcal{O}}\left[\frac{\lambda_{BV}^{p}(pd)^{p+1}}{\varepsilon}\right] ISteinI_{\rm Stein}
Table 1: This summarizes several previous results on the LMC and SVGD algorithms. Hear 𝒪~\tilde{{\mathcal{O}}} corresponds to the order without the log-polynomial factors. One has to bear in mind, that the complexities of the SVGD and LMC cannot be compared to each other as they use different error metrics. The complexity of [Korba et al., 2020][Corollary 6] contains the trajectory bound constant CC, which may depend on the dimension dd. In the last row of the table, the (Tp,J)({\mathrm{T}}_{p},{\rm J}) is satisfied for the function J(r)=λBV(r1/p+(r/2)1/2p){\mathrm{J}}(r)=\lambda_{BV}(r^{1/p}+(r/2)^{1/2p}), where λBV\lambda_{BV} is a constant that may depend on the dimension (see Section˜2.3).

1.3 Paper structure

The paper is organized as follows. In Section˜2, we present the mathematical setting of our problem. We introduce the basic notions, describe the main algorithm and list the necessary assumptions. The first part of Section˜3 is devoted to the main result, which is a descent lemma for KL(,π)\mathop{\mathrm{KL}}\nolimits(\cdot,\pi) functional on the Wasserstein space. The second part contains the complexity results and discussions. Section˜4 presents two examples. Section˜5 concludes the results of the paper and discusses possible future work. Finally in the Appendix, the reader may find the postponed proofs, intermediary technical lemmas and supplementary discussions.

2 Mathematical setting of the problem

2.1 Notations

We will denote the random variables with uppercase (resp. lowercase) Latin letters the random (resp. deterministic) vectors, unless specified. The space of dd-dimensional real vectors is denoted by d\mathbb{R}^{d}, while the set of non-negative real numbers is denoted by +\mathbb{R}^{+}. Our target distribution π\pi is defined on d\mathbb{R}^{d}, which is enhanced with the Euclidean norm. The notation \|\cdot\| will correspond to the 2{\ell}_{2} norm on d\mathbb{R}^{d} unless specified. Also, we will assume that π\pi has the pp-th moment, that is π𝒫p(d)\pi\in\mathcal{P}_{p}(\mathbb{R}^{d}). The Jacobian of a vector valued function h()=(h1(),,hd())h(\cdot)=(h_{1}(\cdot),\ldots,h_{d}(\cdot))^{\top} is a d×dd\times d matrix defined as

Jh(x):=(xihj)i=1,j=1d,d.Jh(x):=\big{(}{\partial_{x_{i}}h_{j}}\big{)}_{i=1,j=1}^{d,d}.

The divergence of the vector valued function hh is the trace of its Jacobian:

divh(x):=i=1dxihi(x).\operatorname{div}h(x):=\sum_{i=1}^{d}\partial_{x_{i}}h_{i}(x).

The Hessian of a real valued function U:dU:\mathbb{R}^{d}\rightarrow\mathbb{R} is defined as the following square matrix:

2U(x):=(2hxixj(x))i=1,j=1d,d.\nabla^{2}U(x):=\left(\frac{\partial^{2}h}{\partial_{x_{i}}\partial_{x_{j}}}(x)\right)_{i=1,j=1}^{d,d}.

For any Hilbert space \mathcal{H}, we denote by ,\langle\cdot,\cdot\rangle_{\mathcal{H}} the inner product of \mathcal{H} and by \|\cdot\|_{\mathcal{H}} its norm. Moreover, op\|\cdot\|_{\mathrm{op}} denotes the operator norm on the set of matrices. Let μ,ν𝒫p(d)\mu,\nu\in\mathcal{P}_{p}(\mathbb{R}^{d}). The set of couplings Γ(μ,ν)\Gamma(\mu,\nu) is the set of all joint distributions defined on d×d\mathbb{R}^{d}\times\mathbb{R}^{d} having μ\mu and ν\nu as its marginals. The Wasserstein-pp distance between two probability measures is defined as

Wp(μ,ν):=infηΓ(μ,ν)[xypη(dx,dy)]1/p.W_{p}(\mu,\nu):=\inf_{\eta\in{\Gamma}(\mu,\nu)}\left[\int\|x-y\|^{p}\eta({\mathrm{d}}x,{\mathrm{d}}y)\right]^{1/p}.

The Kullback-Leibler divergence is defined as

KL(μν)={dlog(μ(x)ν(x))μ(dx), if μν;+, otherwise.\mathop{\mathrm{KL}}\nolimits(\mu\mid\nu)=\begin{cases}\int_{\mathbb{R}^{d}}\log\left(\frac{\mu(x)}{\nu(x)}\right)\mu({\mathrm{d}}x),\text{ if }\mu\ll\nu;\\ +\infty,\text{ otherwise}.\end{cases}

We will use the spectral and the Hilbert-Schmidt norms for matrices. For Md×dM\in\mathbb{R}^{d\times d} they are respectively defined as

Mop\displaystyle\|M\|_{op} :=λmax(MM),andMHS\displaystyle=\sqrt{\lambda_{\max}(M^{\top}M)},\qquad\text{and}\qquad\|M\|_{HS} :=i=1dj=1dMi,j2.\displaystyle=\sqrt{\sum_{i=1}^{d}\sum_{j=1}^{d}M_{i,j}^{2}}.

Here λmax\lambda_{\max} corresponds to the largest eigenvalue.

2.2 The definition of the SVGD

Let us present briefly Reproducing Kernel Hilbert Spaces (RKHS) and some of its essential properties. We refer the reader to [Steinwart and Christmann, 2008][Chapter 4] for a detailed introduction. Let the map k:d×dk:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} be a reproducing kernel and let 0{\mathcal{H}}_{0} be its corresponding RKHS. This means that 0{\mathcal{H}}_{0} consists of real-valued maps from d\mathbb{R}^{d} to \mathbb{R}, including the feature maps Φ(x):=k(x,)0\Phi(x):=k(x,\cdot)\in{\mathcal{H}}_{0}, and the reproducing property is satisfied:

f(x)=f,k(,x)0.f(x)=\langle f,k(\cdot,x)\rangle_{{\mathcal{H}}_{0}}.

See the discussion on the reproducing property in Section˜E.1. Let {\mathcal{H}} be the space of the dd-dimensional maps {(f1,,fd)fi0,i=1,,d}\{(f_{1},\ldots,f_{d})^{\top}\mid f_{i}\in{\mathcal{H}}_{0},i=1,\ldots,d\}. For two vector functions f=(f1,,fd)f=(f_{1},\ldots,f_{d})^{\top} and g=(g1,,gd)g=(g_{1},\ldots,g_{d})^{\top} from {\mathcal{H}}, we define the scalar product as

f,g:=i=1dfi,gi0.\langle f,g\rangle_{{\mathcal{H}}}:=\sum_{i=1}^{d}\langle f_{i},g_{i}\rangle_{{\mathcal{H}}_{0}}.

Suppose that we have a kernel k:d×d+k:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{+} and 0{\mathcal{H}}_{0} is its corresponding RKHS. As described above, we construct the Hilbert space =0d{\mathcal{H}}={\mathcal{H}}_{0}^{d}. Our goal is to construct an iterative algorithm, that has its iterates in the space of probability measures 𝒫p(d){\mathcal{P}}_{p}(\mathbb{R}^{d}). Each iterate is defined as a pushforward measure from the previous one in a way that it minimizes the KL\mathop{\mathrm{KL}}\nolimits distance the most. To do so, for every ψ\psi\in{\mathcal{H}} and γ>0\gamma>0 let us define the operator

Tγ(x):=xγψ(x).T_{\gamma}(x):=x-\gamma\psi(x).

The operator ψ\psi will serve us as the direction or the perturbation, while as γ\gamma is the step-size. The goal is to choose the direction in which the KL\mathop{\mathrm{KL}}\nolimits error descends the most. Thus, for every μ\mu, the optimal choice of ψ\psi is the solution of the following problem:

gμ:=argmaxψ{ddγKL(Tγ#μπ)|γ=0 s.t. ψ1}.g_{\mu}:=\arg\max_{\psi\in\mathcal{H}}\left\{-\frac{\mathrm{d}}{\mathrm{d}\gamma}\mathrm{KL}\left(T_{\gamma}{\#}\mu\|\pi\right)\Big{|}_{\gamma=0}\quad\text{ s.t. }\quad\|\psi\|_{\mathcal{H}}\leq 1\right\}. (2)

The functional objective depends linearly on the perturbation function ψ\psi. Indeed, Liu and Wang [2016] have proved that

ddγKL(Tγ#μπ)|γ=0=d(V(x)ψ(x)+divψ(x))μ(dx).-\frac{\mathrm{d}}{\mathrm{d}\gamma}\mathrm{KL}\left(T_{\gamma}{\#}\mu\|\pi\right)\Big{|}_{\gamma=0}=\int_{\mathbb{R}^{d}}\big{(}-V(x)\psi(x)+\operatorname{div}\psi(x)\big{)}\mu({\mathrm{d}}x). (3)

If ψ\psi satisfies mild conditions, then the right-hand side of (3) is equal to zero if and only if the measures μ\mu and π\pi coincide [Stein, 1972]. This property motivates to define a discrepancy measure called Stein discrepancy as follows:

IStein(μπ):=maxψ{d(V(x)ψ(x)+divψ(x))μ(dx) s.t. ψ1}2.I_{\rm Stein}(\mu\mid\pi):=\max_{\psi\in\mathcal{H}}\left\{\int_{\mathbb{R}^{d}}\big{(}-V(x)\psi(x)+\operatorname{div}\psi(x)\big{)}\mu({\mathrm{d}}x)\quad\text{ s.t. }\quad\|\psi\|_{\mathcal{H}}\leq 1\right\}^{2}.

Liu et al. [2016] have proved that the solution of the optimization problem (2) is given explicitly by

gμ()=d[logπ(x)k(x,)+xk(x,)]μ(dx)g_{\mu}(\cdot)=-\int_{\mathbb{R}^{d}}\big{[}\nabla\log\pi(x)k(x,\cdot)+\nabla_{x}k(x,\cdot)\big{]}\mu({\mathrm{d}}x)

and gμ=IStein(μπ)\|g_{\mu}\|_{{\mathcal{H}}}=\sqrt{I_{\rm Stein}(\mu\mid\pi)}. For the proof of this inequality, we refer the reader to the remark in Section˜B.1. In our case, ISteinI_{\rm Stein} is often referred to as the squared Kernelized Stein Discrepancy (KSD) or the Stein-Fisher information. Under certain conditions (see Section˜E.4) on the target distribution π\pi and the kernel kk, the convergence in KSD implies weak convergence:

IStein(μnπ)0μnπ.I_{\rm Stein}(\mu_{n}\mid\pi)\rightarrow 0\implies\mu_{n}\Rightarrow\pi.
Remark 1.

Integration by parts yields the following formula for gμg_{\mu}:

gμ()=dlog(μ(x)π(x))k(x,)μ(dx).g_{\mu}(\cdot)=\int_{\mathbb{R}^{d}}\nabla\log\left(\frac{\mu(x)}{\pi(x)}\right)k(x,\cdot)\mu({\mathrm{d}}x).

This latter equality is the alternative definition of the optimal direction gμg_{\mu} given by Korba et al. [2020]. The proof of the remark can be found in Section˜E.5. Thus, we have determined the descent direction. In order to formulate the algorithm let us initialize our algorithm at μ0𝒫p(d)\mu_{0}\in{\mathcal{P}}_{p}(\mathbb{R}^{d}). The (i+1)(i+1)-th iterate μi+1\mu_{i+1} is obtained by transferring μi\mu_{i} with the map IγgμiI-\gamma g_{\mu_{i}}:

μi+1=(Iγgμi)#μi.\mu_{i+1}=(I-\gamma g_{\mu_{i}})\#\mu_{i}. (SVGD)

In case when kLp(μ)k\in L_{p}(\mu), then 0Lp(μ){\mathcal{H}}_{0}\subset L_{p}(\mu) (see [Steinwart and Christmann, 2008][Theorem 4.26]). This guarantees that every iteration of the SVGD has a finite pp-th moment, that is for every ii\in\mathbb{N}, μi𝒫p(d)\mu_{i}\in{\mathcal{P}}_{p}(\mathbb{R}^{d}). The rest of the section is devoted to the assumptions that we will use later in the analysis.

2.3 Assumptions

In order to perform SVGD we need the target π\pi, in particular the gradient of its potential potential VV and the kernel k(,)k(\cdot,\cdot). Below we present the four main assumptions on π\pi,VV and kk, that we use in the analysis. The first assumptions is a relaxed form of smoothness condition for V\nabla V.

We study the SVGD algorithm under the so called (L0,L1)(L_{0},L_{1}) smoothness, which has been borrowed from optimization literature. In [Zhang et al., 2019a], the authors studied the convergence of the clipped gradient descent under this condition. The assumption goes as follows.

Assumption (L0,L1)(L_{0},L_{1}).

The Hessian 2V{\nabla^{2}V} of V=logπV=-\log\pi is well-defined and L0,L10\exists L_{0},L_{1}\geq 0 s.t.

2V(x)opL0+L1V(x),\left\|{\nabla^{2}V}(x)\right\|_{op}\leq L_{0}+L_{1}\|\nabla V(x)\|, (4)

for any xnx\in\mathbb{R}^{n}.

First let us notice that when L1=0L_{1}=0 the inequality (4) becomes the usual smoothness condition (1). On the other hand, let us consider the generalized normal distribution (also known as exponential power distribution) which is given by its density π(x)exp(xaβ)\pi(x)\propto\exp(-\|x-a\|^{\beta}), where β2\beta\geq 2. In this case, Hessian 2V\nabla^{2}V is not bounded by a constant in terms of the operator norm. However, it is easy to verify that VV satisfies ˜(L0,L1)(L_{0},L_{1}), which yields that this assumption is indeed a relaxation of the LL-smoothness.

The next two assumptions are intended to replace the trajectory condition [Korba et al., 2020, Duncan et al., 2019] for polynomial potentials. However, this condition limits the applicability of the algorithm as these results cannot guarantee the convergence of the algorithm before actually implementing it. This condition was replaced by Talagrand’s T1{\mathrm{T}}_{1} inequality in the recent work by Salim et al. [2021], where complexity bounds are proved for LL-smooth potentials. Talagrand’s Tp{\mathrm{T}}_{p} inequality goes as follows:

Wp(μ,π)2KL(μπ)λT,μ𝒫p(d).W_{p}(\mu,\pi)\leq\sqrt{\frac{2\mathop{\mathrm{KL}}\nolimits(\mu\mid\pi)}{\lambda_{{\mathrm{T}}}}},\quad\forall\mu\in{\mathcal{P}}_{p}({\mathbb{R}}^{d}). (5)

Since the Wasserstein-pp distance is increasing with respect to the order pp, then Tp{\mathrm{T}}_{p} implies T1{\mathrm{T}}_{1}. In general, it is hard to verify if the distribution satisfies the Tp{\mathrm{T}}_{p} inequality. However, one may check that T2{\mathrm{T}}_{2} is true for mm-strongly log-concave distributions with λT=m\lambda_{{\mathrm{T}}}=m. According to Bakry-Émery theory, sharp estimates on this constant are available due to perturbation arguments, such as the well-known Holley-Stroock method (see Steiner [2021]). We propose a more general form of the Tp{\mathrm{T}}_{p} inequalities which will help us to analyze the convergence of the SVGD under the (L0,L1)(L_{0},L_{1}) smoothness condition.

Definition 1 (Generalized Tp{\mathrm{T}}_{p} inequality).

Let p1p\geq 1. The distribution π\pi satisfies the generalized Tp{\mathrm{T}}_{p} inequality if there exists an increasing function J:++{\mathrm{J}}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+} such that for all μ𝒫p(d)\mu\in\mathcal{P}_{p}(\mathbb{R}^{d}), we have Wp(μ,π)J(KL(μ|π))W_{p}(\mu,\pi)\leq{\mathrm{J}}(\rm{KL}(\mu|\pi)), where WpW_{p} is the Wasserstein-pp distance.

Assumption (Tp,J)({\mathrm{T}}_{p},{\rm J}).

The target distribution π\pi satisfies the generalized Tp{\mathrm{T}}_{p} inequality for some increasing function J:++{\mathrm{J}}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}.

If we take J(r)2r/λTJ(r)\equiv\sqrt{2r/\lambda_{{\mathrm{T}}}},for every r+r\in\mathbb{R}^{+}, then we retrieve the classical Tp{\mathrm{T}}_{p} inequality. Thus, we indeed generalize Tp{\mathrm{T}}_{p} inequality with this assumption. An important example of ˜(Tp,J)({\mathrm{T}}_{p},{\rm J}) is a consequence of Bolley and Villani [2005][Corollary 2.3]. They prove that if

dexp(xx0p)π(dx)<+, for some x0d,\int_{\mathbb{R}^{d}}\exp(\|x-x_{0}\|^{p})\pi({\mathrm{d}}x)<+\infty,\text{ for some }x_{0}\in\mathbb{R}^{d},

then ˜(Tp,J)({\mathrm{T}}_{p},{\rm J}) is satisfied for J(r)=λBV(r1/p+(r/2)1/2p){\mathrm{J}}(r)=\lambda_{BV}(r^{\nicefrac{{1}}{{p}}}+(r/2)^{\nicefrac{{1}}{{2p}}}), where λBV\lambda_{BV} is a constant that may depend on the dimension (see Section˜E.2 for details). Therefore, in this case the assumption is essentially a condition on the tails of the target distribution π\pi. In particular, for the generalized normal distribution the verification of this bound is straightforward.

The third assumption is chosen to essentially restrict ourselves for potentials with (at most) polynomial growth. Mathematically, it is expressed as follows:

Assumption (poly,Q)(\mathrm{poly},{\mathrm{Q}}).

For some p>0p>0, there exists a polynomial with positive coefficients such that ord(Q)=p\mathrm{ord}({{\mathrm{Q}}})=p and the following inequality is true:

V(x)Q(x).\big{\|}\nabla V(x)\big{\|}\leq{{\mathrm{Q}}}(\|x\|).

Using Taylor formula, one may check that LL-smooth potentials satisfy this condition with Q(r)=Lr+V(0){\mathrm{Q}}(r)=Lr+\|\nabla V(0)\|. We want highlight that the constant pp is the same for the assumptions (Tp,J)({\mathrm{T}}_{p},{\rm J}) and (poly,Q)(\mathrm{poly},{\mathrm{Q}}). This will allow us to treat the polynomials of order pp. For detailed examples of distributions that satisfy our set of assumptions please refer to Section˜4.

We conclude our list of assumptions with a bound on the kernel k(,)k(\cdot,\cdot).

Assumption (ker,B)({\rm ker},B).

There exists B>0B>0 such that k(x,.)0B\|k(x,.)\|_{\mathcal{H}_{0}}\leq B and

xk(x,.)=(i=1dxik(x,.)02)12B,\left\|\nabla_{x}k(x,.)\right\|_{\mathcal{H}}=\left(\sum_{i=1}^{d}\left\|\partial_{x_{i}}k\left(x,.\right)\right\|_{\mathcal{H}_{0}}^{2}\right)^{\frac{1}{2}}\leq B,

for all xdx\in\mathbb{R}^{d}.

Due to the reproducing property, these conditions are equivalent to k(x,x)Bk(x,x)\leq B and xi,xi2k(x,x)B\partial^{2}_{x_{i},x_{i}}k(x,x)\leq B. Here, the first and the second partial derivatives are operated, respectively, on the first and the second variables of k(,)k(\cdot,\cdot). Based on this criterion, one may check that the multiquadratic kernel k(x,y):=(c2+xy2)βk(x,y):=(c^{2}+\|x-y\|^{2})^{\beta} for some c>0>β>1c>0>\beta>-1 satisfies the ˜(ker,B)({\rm ker},B). See appendices E.1 for details on the reproductive property.

3 Main result

In this section we present our main results. We start with a proposition that bounds the difference between the value of the objective at two consecutive iterations. The proof of the proposition can be found in Section˜B.1

Proposition 1.

Suppose that Assumptions (𝑘𝑒𝑟,B)({\rm ker},B) and (L0,L1)(L_{0},L_{1}) are satisfied. Let α>1\alpha>1 and choose

γ(α1)min{1,1/L1}αBgμn.\gamma\leq\frac{(\alpha-1)\cdot\min\{1,{1}/{L_{1}}\}}{\alpha B\|g_{\mu_{n}}\|_{\mathcal{H}}}. (6)

Then

KL(μn+1π)KL(μnπ)γ[1γ2B2(α2+(e1)An)]IStein(μnπ),\mathrm{KL}\left(\mu_{n+1}\mid\pi\right)-\mathrm{KL}\left(\mu_{n}\mid\pi\right)\leq-\gamma\left[1-\frac{\gamma}{2}B^{2}\big{(}\alpha^{2}+(e-1)A_{n}\big{)}\right]I_{\rm Stein}(\mu_{n}\mid\pi), (7)

where An:=L0+L1𝔼Xμn[V(X)]A_{n}:=L_{0}+L_{1}\mathbb{E}_{X\sim\mu_{n}}[\|\nabla V(X)\|].

Proposition˜1 may be seen as a descent lemma on the KL\mathop{\mathrm{KL}}\nolimits divergence. Let us develop the condition (6). The following lemma provides us with an upper bound on gμn\|g_{\mu_{n}}\|_{{\mathcal{H}}}.

Lemma 1.

If ˜(𝑘𝑒𝑟,B)({\rm ker},B) is satisfied, then

gμn=IStein(μnπ)12B(𝔼Xμn[V(X)]+1)\|g_{\mu_{n}}\|_{{\mathcal{H}}}=I_{\rm Stein}(\mu_{n}\mid\pi)^{\frac{1}{2}}\leq B\left(\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\|\big{]}+1\right)

for all nn\in\mathbb{N}.

Thus (6) may be replaced by

γ(α1)min{1,1/L1}[αB2(𝔼Xμn[V(X)]+1)]1.\gamma\leq(\alpha-1)\min\{1,{1}/{L_{1}}\}\left[\alpha B^{2}\left(\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\|\big{]}+1\right)\right]^{-1}. (8)

On the other hand, we would like the right-hand side of (7) to be negative and it yields the following condition on the step-size:

γ\displaystyle\gamma <2B2(α2+(e1)An)1=2B2(α2+(e1)(L0+L1𝔼Xμn[V(X)]))1.\displaystyle<2B^{-2}\big{(}{\alpha^{2}+(e-1)A_{n}}\big{)}^{-1}={2B^{-2}\big{(}\alpha^{2}+(e-1)\big{(}L_{0}+L_{1}\mathbb{E}_{X\sim\mu_{n}}[\|\nabla V(X)\|]\big{)}\big{)}}^{-1}. (9)

Both in (8) and (9) we have dependence on 𝔼Xμn[V(X)]\mathbb{E}_{X\sim\mu_{n}}[\|\nabla V(X)\|]. The lemma below establishes an upper bound on this quantity.

Lemma 2.

Suppose that the potential function VV satisfies the Assumptions (Tp,J)({\mathrm{T}}_{p},{\rm J}) and (𝑝𝑜𝑙𝑦,Q)(\mathrm{poly},{\mathrm{Q}}) for some constants pp and a polynomial Q{\mathrm{Q}}. Then,

𝔼Xμn[V(X)]Q(J(KL(μnπ))+Wp(π,δ0)).\mathbb{E}_{X\sim\mu_{n}}[\|\nabla V(X)\|]\leq{\mathrm{Q}}\left({\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{n}\mid\pi)\big{)}+W_{p}(\pi,\delta_{0})\right).

The important implication of Lemma˜2 is that both upper bounds on the step-size are inversely proportional to KL(μnπ)\mathop{\mathrm{KL}}\nolimits(\mu_{n}\mid\pi). Let us now state the main theorem.

Theorem 1 (Descent lemma).

Let the target distribution π\pi and its potential function VV satisfy the Assumptions (𝑘𝑒𝑟,B)({\rm ker},B), (L0,L1)(L_{0},L_{1}), (Tp,J)({\mathrm{T}}_{p},{\rm J}), and (𝑝𝑜𝑙𝑦,Q)(\mathrm{poly},{\mathrm{Q}}). Define C0:=Q(J(KL(μ0π))+Wp(π,δ0))C_{0}:={\mathrm{Q}}\left({\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)\big{)}+W_{p}(\pi,\delta_{0})\right) and suppose that the step-size γ\gamma satisfies

γ<α1αB2(α2+(e1)(max(L0,L1,1)+max(L1,1)C0)).\gamma<\frac{\alpha-1}{\alpha B^{2}\big{(}\alpha^{2}+(e-1)(\max(L_{0},L_{1},1)+\max(L_{1},1)C_{0})\big{)}}. (10)

Then for every n=0,1,n=0,1,\ldots the following inequality is true:

KL(μn+1π)KL(μnπ)γ2IStein(μnπ).\mathrm{KL}\left(\mu_{n+1}\mid\pi\right)-\mathrm{KL}\left(\mu_{n}\mid\pi\right)\leq-\frac{\gamma}{2}I_{\rm Stein}(\mu_{n}\mid\pi).

The proof is postponed to Section˜B.2.

Corollary 1 (Convergence).

Let the assumptions of Theorem˜1 be satisfied. Then following statements are true.

  • 1.

    IStein(μnπ)I_{\rm Stein}(\mu_{n}\mid\pi) converges to zero, when nn\rightarrow\infty.

  • 2.

    The average Stein-Fisher error of the first nn iterates converges to 0, with O(1/n)O(1/n) rate:

    1ni=0n1IStein(μiπ)2nγKL(μ0π).\frac{1}{n}\sum_{i=0}^{n-1}I_{\rm Stein}\left(\mu_{i}\mid\pi\right)\leq\frac{2}{n\gamma}{\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)}.

This yields in particular that the series with the general term IStein(μnπ)I_{\rm Stein}(\mu_{n}\mid\pi) is convergent, that is IStein(μnπ)0I_{\rm Stein}(\mu_{n}\mid\pi)\rightarrow 0. When π\pi is distant dissipative and the kk is the inverse multiquadratic kernel, then convergence in ISteinI_{\rm Stein} yields weak convergence of the sequence μn\mu_{n} to the target π\pi (see Section˜E.4). Proofs of the corollary can be found in Section˜B.3. Corollary˜1 yields that if n>KL(μ0π)/γεn>\nicefrac{{\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)}}{{\gamma\varepsilon}}, then 1ni=0n1IStein(μiπ)<ε\frac{1}{n}\sum_{i=0}^{n-1}I_{\rm Stein}\left(\mu_{i}\mid\pi\right)<\varepsilon. Based on this inequality, the next theorem estimates the complexity of the algorithm for some particular choices of J{\mathrm{J}} and μ0\mu_{0}.

Theorem 2.

Let assumptions (𝑘𝑒𝑟,B)({\rm ker},B), (L0,L1)(L_{0},L_{1}), and (𝑝𝑜𝑙𝑦,Q)(\mathrm{poly},{\mathrm{Q}}) hold and let μ0=𝒩(0,Id)\mu_{0}={\mathcal{N}}(0,{\rm I}_{d}). Then in order to have ε\varepsilon average Stein-Fisher error it is sufficient to perform nn iterations of the SVGD, where

  • n=𝒪(ε1Q(1)3max(L1,1)λBVp(pd)p+1)n={\mathcal{O}}\left({\varepsilon}^{-1}{Q(1)^{3}\max(L_{1},1)\lambda_{BV}^{p}(pd)^{p+1}}\right), if the target π\pi satisfies ˜(Tp,J)({\mathrm{T}}_{p},{\rm J}) with J(r)=λBV(r1/p+(r/2)1/2p){\mathrm{J}}(r)=\lambda_{BV}(r^{\nicefrac{{1}}{{p}}}+(r/2)^{\nicefrac{{1}}{{2p}}});

  • n=𝒪(ε1Q(1)p+22max(L1,1)λTp2(pd)(p+1)(p+2)4)n={\mathcal{O}}\left({\varepsilon}^{-1}{{\mathrm{Q}}(1)^{\frac{p+2}{2}}\max(L_{1},1)\lambda_{{\mathrm{T}}}^{-\frac{p}{2}}\left(pd\right)^{\frac{(p+1)(p+2)}{4}}}\right), if the target π\pi satisfies Talagrand’s Tp{\mathrm{T}}_{p} inequality (5) with constant λT\lambda_{{\mathrm{T}}}.

The proof is postponed to Section˜C.1. We observe that in both cases we achieve polynomial convergence of the algorithm in terms of the dimension and the precision. As mentioned in Section˜2, in the first setting, (Tp,J)({\mathrm{T}}_{p},{\rm J}) essentially becomes a tail condition, which can be easily verified for a wide class of log-polynomial densities. The Talagrand’s Tp{\mathrm{T}}_{p} is harder to verify in the general case. However, unlike λT\lambda_{{\mathrm{T}}}, the constant λBV\lambda_{BV} may be dimension dependent (see Section˜E.2). Therefore, when p2p\leq 2 Talagrand’s inequality yields better convergence. In particular, when p=1p=1, we recover the complexity 𝒪(d3/2){\mathcal{O}}(d^{3/2}) which coincides with the result from [Salim et al., 2021]. On the other hand, when pp is large, the algorithm is likely to perform significantly faster under the generalized Tp{\mathrm{T}}_{p}.

4 Examples

In this section, we discuss a few cases that our method can be applied on. As mentioned in Section˜2, the family of generalized Gaussian distributions satisfy our set of assumptions:

π(x)exp(xap2σp).\pi(x)\propto\exp\left(-\frac{\|x-a\|^{p}}{2\sigma^{p}}\right).

This family of distributions has received considerable attention from the engineering community, due to the flexible parametric form of its probability density function in modeling many physical phenomena. For instance values of p=2.2p=2.2 and p=1.6p=1.6 have been found to model the ship transit noise and the sea surface agitation noise respectively Banerjee and Agrawal [2013]. Due to its simple explicit form of the density function, many important probabilistic quantities, such as moment, entropy etc. are easy to compute for the generalized Gaussians. It is known that that this class minimizes the entropy under under a pp-th absolute moment constraint [Cover and Thomas, 2006]. An overview of the analytic properties of this family of distributions can be found in Dytso et al. [2018]. As seen previously, when p2p\geq 2, this family of distributions meets the assumptions of this work, so we can apply SVGD on them.

The second example has its roots in the Bayesian LASSO problem . The LASSO estimator (proposed by[Tibshirani, 1996]) can be seen as the 1\ell_{1}-penalized least square estimator: This can be interpreted also as a Bayesian posterior mode (also known as Bayesian LASSO), with Laplacian priors (see [Park and Casella, 2008, Fu, 1998, Mallick and Yi, 2014]).

From the sampling perspective, the Bayesian LASSO boils down to

π(x)exp(f(x)τxpp).\pi(x)\propto\exp(-f(x)-\tau\|x\|_{p}^{p}).

Here ff is the likelihood function of the prior. In [Park and Casella, 2008], the authors consider the simpler case where f(x)=Axb2f(x)=\|{Ax}-{b}\|^{2}, with A{A} is the measurement matrix, while b{b} is the vector of labels. In the case when p>1p>1 the classical sampling methods fail to provide guarantees as the potential does not satisfy the smoothness condition. When p(1,2)p\in(1,2), Chatterji et al. [2020] applied the Gaussian smoothing technique on the potential function V(x)=f(x)+τxppV(x)=f(x)+\tau\|x\|_{p}^{p}. This allowed to perform the usual Langevin Monte-Carlo algorithm on the modified potential. However, their method does not deal with the case p2p\geq 2. Moreover, they assume the strong convexity of the potential. Our analysis for the SVGD can deal with the case when p2p\geq 2. In particular, we do not need to put any restriction on the matrix A{A}.

5 Conclusion and Future work

In this work, we studied the convergence of the SVGD under a relaxed smoothness condition. The latter was first proposed in the context of gradient clipping for the optimization problem. This relaxed smoothness condition allows to treat distributions with polynomial potentials. Main result consists of a descent lemma for the KL\mathop{\mathrm{KL}}\nolimits error of the SVGD algorithm. The result implies polynomial convergence for the average Stein-Fisher information error for two types of functional inequalities. However, at the moment this setting is of purely theoretical interest as the algorithm operates on the space of probability distributions. Despite numerous applications, the discretized algorithm does not have practical convergence guarantees in the general case. This problem remains open for future work.

References

  • Ahn and Chewi [2021] K. Ahn and S. Chewi. Efficient constrained sampling via the mirror-langevin algorithm. Advances in Neural Information Processing Systems, 34, 2021.
  • Amann [2011] H. Amann. Ordinary differential equations: an introduction to nonlinear analysis, volume 13. Walter de gruyter, 2011.
  • Banerjee and Agrawal [2013] S. Banerjee and M. Agrawal. Underwater acoustic noise with generalized Gaussian statistics: Effects on error performance. In 2013 MTS/IEEE OCEANS-Bergen, pages 1–8. IEEE, 2013.
  • Bhattacharya [1978] R. Bhattacharya. Criteria for recurrence and existence of invariant measures for multidimensional diffusions. The Annals of Probability, pages 541–553, 1978.
  • Bolley and Villani [2005] F. Bolley and C. Villani. Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. In Annales de la Faculté des sciences de Toulouse: Mathématiques, volume 14, pages 331–352, 2005.
  • Brosse et al. [2019] N. Brosse, A. Durmus, É. Moulines, and S. Sabanis. The tamed unadjusted Langevin algorithm. Stochastic Processes and their Applications, 129(10):3638–3663, 2019.
  • Cai et al. [2021] X. Cai, J. D. McEwen, and M. Pereyra. High-dimensional Bayesian model selection by proximal nested sampling. arXiv preprint arXiv:2106.03646, 2021.
  • Chatterji et al. [2020] N. Chatterji, J. Diakonikolas, M. I. Jordan, and P. Bartlett. Langevin Monte Carlo without smoothness. In S. Chiappa and R. Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1716–1726. PMLR, 26–28 Aug 2020. URL https://proceedings.mlr.press/v108/chatterji20a.html.
  • Chatterji et al. [2018] N. S. Chatterji, N. Flammarion, Y.-A. Ma, P. L. Bartlett, and M. I. Jordan. On the theory of variance reduction for stochastic gradient Monte Carlo. arXiv preprint arXiv:1802.05431, 2018.
  • Cheng and Bartlett [2018] X. Cheng and P. L. Bartlett. Convergence of Langevin MCMC in KL-divergence. PMLR 83, (83):186–211, 2018.
  • Cheng et al. [2018] X. Cheng, N. S. Chatterji, Y. Abbasi-Yadkori, P. L. Bartlett, and M. I. Jordan. Sharp convergence rates for Langevin dynamics in the nonconvex setting. arXiv preprint arXiv:1805.01648, 2018.
  • Chewi et al. [2020] S. Chewi, T. Le Gouic, C. Lu, T. Maunu, P. Rigollet, and A. Stromme. Exponential ergodicity of mirror-Langevin diffusions. Advances in Neural Information Processing Systems, 33:19573–19585, 2020.
  • Chewi et al. [2021] S. Chewi, M. A. Erdogdu, M. B. Li, R. Shen, and M. Zhang. Analysis of Langevin Monte Carlo from Poincaré to Log-Sobolev. arXiv preprint arXiv:2112.12662, 2021.
  • Cover and Thomas [2006] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley series in telecommunications and signal processing, 2006.
  • Dalalyan [2017a] A. S. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In Conference on Learning Theory (COLT), pages 678–689, 2017a.
  • Dalalyan [2017b] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017b.
  • Dalalyan and Karagulyan [2019] A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
  • Dalalyan et al. [2019] A. S. Dalalyan, L. Riou-Durand, and A. Karagulyan. Bounding the error of discretized Langevin algorithms for non-strongly log-concave targets. 2019.
  • Detommaso et al. [2018] G. Detommaso, T. Cui, Y. Marzouk, A. Spantini, and R. Scheichl. A Stein variational Newton method. Advances in Neural Information Processing Systems, 31, 2018.
  • Duncan et al. [2019] A. Duncan, N. Nüsken, and L. Szpruch. On the geometry of Stein variational gradient descent. arXiv preprint arXiv:1912.00894, 2019.
  • Durmus and Moulines [2017] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551–1587, 2017.
  • Durmus and Moulines [2019] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854–2882, 2019.
  • Durmus et al. [2018] A. Durmus, E. Moulines, and M. Pereyra. Efficient Bayesian computation by proximal Markov Chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
  • Durmus et al. [2019] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
  • Dwivedi et al. [2018] R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis-Hastings algorithms are fast! In Conference on Learning Theory, pages 793–797. PMLR, 2018.
  • Dytso et al. [2018] A. Dytso, R. Bustin, H. V. Poor, and S. Shamai. Analytical properties of generalized Gaussian distributions. Journal of Statistical Distributions and Applications, 5(1):1–40, 2018.
  • Erdogdu and Hosseinzadeh [2021] M. A. Erdogdu and R. Hosseinzadeh. On the convergence of Langevin Monte Carlo: the interplay between tail growth and smoothness. In Conference on Learning Theory, pages 1776–1822. PMLR, 2021.
  • Feroz and Skilling [2013] F. Feroz and J. Skilling. Exploring multi-modal distributions with nested sampling. In AIP Conference Proceedings, volume 1553, pages 106–113. American Institute of Physics, 2013.
  • Fu [1998] W. J. Fu. Penalized regressions: the bridge versus the LASSO. Journal of computational and graphical statistics, 7(3):397–416, 1998.
  • Gorham and Mackey [2017] J. Gorham and L. Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning (ICML), pages 1292–1301, 2017.
  • Grenander and Miller [1994] U. Grenander and M. I. Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 56(4):549–581, 1994.
  • Hutzenthaler et al. [2012] M. Hutzenthaler, A. Jentzen, and P. E. Kloeden. Strong convergence of an explicit numerical method for SDEs with non-globally Lipschitz continuous coefficients. The Annals of Applied Probability, 22(4):1611–1641, 2012.
  • Karagulyan and Dalalyan [2020] A. Karagulyan and A. Dalalyan. Penalized Langevin dynamics with vanishing penalty for smooth and log-concave targets. Advances in Neural Information Processing Systems, 33, 2020.
  • Kassab and Simeone [2022] R. Kassab and O. Simeone. Federated generalized Bayesian learning via distributed Stein variational gradient descent. IEEE Transactions on Signal Processing, 2022.
  • Korba et al. [2020] A. Korba, A. Salim, M. Arbel, G. Luise, and A. Gretton. A non-asymptotic analysis for Stein variational gradient descent. In Advances in Neural Information Processing Systems (NeurIPS), pages 4672–4682, 2020.
  • Lehec [2021] J. Lehec. The Langevin Monte Carlo algorithm in the non-smooth log-concave case. arXiv preprint arXiv:2101.10695, 2021.
  • Li et al. [2020] L. Li, Y. Li, J.-G. Liu, Z. Liu, and J. Lu. A stochastic version of Stein variational gradient descent for efficient sampling. Communications in Applied Mathematics and Computational Science, 15(1):37–63, 2020.
  • Liu [2017] Q. Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems (NIPS), pages 3115–3123, 2017.
  • Liu and Wang [2016] Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems (NIPS), pages 2378–2386, 2016.
  • Liu et al. [2016] Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning (ICML), pages 276–284, 2016.
  • Liu et al. [2017] Y. Liu, P. Ramachandran, Q. Liu, and J. Peng. Stein variational policy gradient. arXiv preprint arXiv:1704.02399, 2017.
  • Lu et al. [2019] J. Lu, Y. Lu, and J. Nolen. Scaling limit of the Stein variational gradient descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2):648–671, 2019.
  • Ma et al. [2019] Y.-A. Ma, N. Chatterji, X. Cheng, N. Flammarion, P. L. Bartlett, and M. I. Jordan. Is there an analog of Nesterov acceleration for MCMC? arXiv preprint arXiv:1902.00996, 2019.
  • Mallick and Yi [2014] H. Mallick and N. Yi. A new Bayesian lasso. Statistics and its interface, 7(4):571–582, 2014.
  • Neal [1992] R. Neal. Bayesian learning via stochastic dynamics. Advances in neural information processing systems, 5, 1992.
  • Nüsken and Renger [2021] N. Nüsken and D. Renger. Stein variational gradient descent: many-particle and long-time asymptotics. arXiv preprint arXiv:2102.12956, 2021.
  • Park and Casella [2008] T. Park and G. Casella. The Bayesian LASSO. Journal of the American Statistical Association, 103(482):681–686, 2008.
  • Pu et al. [2017] Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin. VAE learning via Stein variational gradient descent. In Advances in Neural Information Processing Systems (NIPS), pages 4236–4245, 2017.
  • Raginsky et al. [2017] M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory (COLT), pages 1674–1703, 2017.
  • Robert and Casella [1999] C. P. Robert and G. Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
  • Roberts and Rosenthal [1998] G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
  • Roberts and Stramer [2002] G. O. Roberts and O. Stramer. Langevin diffusions and Metropolis-Hastings algorithms. Methodology and computing in applied probability, 4(4):337–357, 2002.
  • Roberts and Tweedie [1996] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
  • Sabanis [2013] S. Sabanis. A note on tamed Euler approximations. Electronic Communications in Probability, 18:1–10, 2013.
  • Salim et al. [2021] A. Salim, L. Sun, and P. Richtárik. Complexity analysis of Stein variational gradient descent under Talagrand’s inequality T1. arXiv preprint arXiv:2106.03076, 2021.
  • Shen and Lee [2019] R. Shen and Y. T. Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems (NeurIPS), pages 2100–2111, 2019.
  • Şimşekli [2017] U. Şimşekli. Fractional Langevin Monte Carlo: Exploring Lévy driven stochastic differential equations for Markov chain Monte Carlo. In International Conference on Machine Learning, pages 3200–3209. PMLR, 2017.
  • Stein [1972] C. Stein. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory, volume 6, pages 583–603. University of California Press, 1972.
  • Steiner [2021] C. Steiner. A Feynman-Kac approach for logarithmic Sobolev inequalities. Electronic Journal of Probability, 26:1–19, 2021.
  • Steinwart and Christmann [2008] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
  • Tao et al. [2019] C. Tao, S. Dai, L. Chen, K. Bai, J. Chen, C. Liu, R. Zhang, G. Bobashev, and L. Carin. Variational annealing of GANs: A Langevin perspective. In International Conference on Machine Learning (ICML), pages 6176–6185, 2019.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Vempala and Wibisono [2019] S. Vempala and A. Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. In Advances in Neural Information Processing Systems (NeurIPS), pages 8092–8104, 2019.
  • Wang et al. [2019] D. Wang, Z. Tang, C. Bajaj, and Q. Liu. Stein variational gradient descent with matrix-valued kernels. Advances in neural information processing systems, 32, 2019.
  • Wibisono [2018] A. Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory (COLT), pages 2093–3027, 2018.
  • Wibisono [2019] A. Wibisono. Proximal Langevin algorithm: Rapid convergence under isoperimetry. arXiv preprint arXiv:1911.01469, 2019.
  • Winkelbauer [2012] A. Winkelbauer. Moments and absolute moments of the normal distribution. arXiv preprint arXiv:1209.4340, 2012.
  • Zhang et al. [2019a] J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2019a.
  • Zhang et al. [2018] R. Zhang, C. Li, C. Chen, and L. Carin. Learning structural weight uncertainty for sequential decision-making. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1137–1146, 2018.
  • Zhang et al. [2019b] R. Zhang, Z. Wen, C. Chen, C. Fang, T. Yu, and L. Carin. Scalable Thompson sampling via optimal transport. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 87–96, 2019b.
  • Zhou et al. [2020] Q. Zhou, T. Yu, X. Zhang, and J. Li. Bayesian inference and uncertainty quantification for medical image reconstruction with Poisson data. SIAM Journal on Imaging Sciences, 13(1):29–52, 2020.

Appendix

Appendix A Langevin Monte-Carlo from the optimization perspective

Langevin algorithms have extensively been studied by statistics and machine learning communities during the last decades. The algorithms rely on an SDE called (vanilla) Langevin dynamics:

dXt=V(Xt)dt+2dBt,dX_{t}=-\nabla V(X_{t})dt+\sqrt{2}dB_{t},

where BtB_{t} is a dd-dimensional Brownian motion on d\mathbb{R}^{d}. The important property of this SDE is that under technical conditions the target π()exp(V())\pi(\cdot)\propto\exp(-V(\cdot)) is its unique invariant distribution. Moreover, it is ergodic and KL(μt,π)\mathop{\mathrm{KL}}\nolimits(\mu_{t},\pi) converges linearly to zero, where μt\mu_{t} is the distribution of XtX_{t} [Bhattacharya, 1978]. The Euler-Maruyama discretization of this SDE over the time axis results in the Langevin Monte-Carlo (LMC) algorithm:

θi+1=θγV(θi)+2hξi+1, for i.\theta_{i+1}=\theta-\gamma\nabla V(\theta_{i})+\sqrt{2h}\xi_{i+1},\text{ for }i\in\mathbb{N}. (11)

Here γ>0\gamma>0 is the discretization step-size and ξ1,ξ2,\xi_{1},\xi_{2},\ldots is a sequence of independent standard Gaussians independent from the initial point θ0\theta_{0}. This algorithm initially was studied by Roberts and Tweedie [1996], Roberts and Rosenthal [1998], Roberts and Stramer [2002], Dwivedi et al. [2018] who proposed to apply Metropolis-Hastings step (MALA). The reason for this adjustment is that the constant step-size γ\gamma induces a bias and the target distribution π\pi is no longer the invariant distribution of the discrete-time process. Later, Dalalyan [2017b] suggested to remove the Metropolis-Hastings step with a result that essentially controls the bias depending on the step-size γ\gamma. This instigated a new line of research which studied the convergence properties of the LMC in different settings (see Durmus and Moulines [2017], Cheng et al. [2018], Cheng and Bartlett [2018], Dalalyan and Karagulyan [2019], Durmus and Moulines [2019], Vempala and Wibisono [2019]).

The LMC algorithm essentially performs a Forward-Flow iteration in the space of distributions (see [Wibisono, 2018, Durmus et al., 2019]). Indeed, if we define by μi\mu_{i} the distribution of the ii-th iterate θi\theta_{i}, then (11) is equivalent to

μi+1/2=(IγV(μi))#μi\displaystyle\mu_{i+\nicefrac{{1}}{{2}}}=\big{(}I-\gamma\nabla V(\mu_{i})\big{)}{\#}\mu_{i}
μi+1=𝒩(0,Id)μi+1/2,\displaystyle\mu_{i+1}=\mathcal{N}(0,{\rm I}_{d})\hskip 2.84544pt\textasteriskcentered\hskip 2.84544pt\mu_{i+\nicefrac{{1}}{{2}}},

where the #\# is the pushforward measure, while \textasteriskcentered is the convolution. The first equation corresponds to the gradient descent for 𝔼Xμ[V(X)]\mathbb{E}_{X\sim\mu}[V(X)] where the argument is the measure μ\mu. The second equation can be interpreted as the exact gradient flow of the negative entropy. The combination of this two steps results in a biased algorithm (as noticed in [Roberts and Tweedie, 1996]) since the flow step is not the adjoint of the gradient step.

Appendix B Convergence results

B.1 Proofs of the Proposition˜1

For the sake of brevity we will omit the index when referring to gμng_{\mu_{n}} and will write simply gg. Let us define ϕt:=Itg\phi_{t}:=I-tg and ρt:=ϕt#μn\rho_{t}:={\phi_{t}}_{\#}\mu_{n} for every t[0,γ]t\in[0,\gamma]. Then, applying Taylor formula to the function

φ(t):=KL(ρtπ)\varphi(t):=\mathrm{KL}\left(\rho_{t}\mid\pi\right) (12)

we have the following

φ(γ)=φ(0)+γφ(0)+0γ(γt)φ′′(t)dt.\varphi(\gamma)=\varphi(0)+\gamma\varphi^{\prime}(0)+\int_{0}^{\gamma}(\gamma-t)\varphi^{\prime\prime}(t){{\mathrm{d}}}t. (13)

By the definition of the SVGD iteration, we have that φ(0)=KL(μnπ)\varphi(0)={\mathop{\mathrm{KL}}\nolimits}\left(\mu_{n}\mid\pi\right) and φ(γ)=KL(μn+1π)\varphi(\gamma)={\mathop{\mathrm{KL}}\nolimits}\left(\mu_{n+1}\mid\pi\right). Let us now compute the term of (13) corresponding to the first order derivative.

Lemma 3.

Suppose that Assumption (𝑘𝑒𝑟,B)({\rm ker},B) holds. Then, for any xdx\in\mathbb{R}^{d} and hh\in\mathcal{H},

Jh(x)HSBh.\|Jh(x)\|_{{\rm HS}}\leq B\|h\|_{\mathcal{H}}.

The proof of the lemma can be found in Section˜D.3. Applying the lemma to the function gg, we obtain the following:

tJg(x)optJg(x)HStBg<1.\|tJg(x)\|_{op}\leq\|tJg(x)\|_{{\rm HS}}\leq tB\|g\|_{\mathcal{H}}<1. (14)

The latter inequality is due to the condition on the step-size γ\gamma. The bound (14) implies that ϕt\phi_{t} is a diffeomorphism. Therefore, ρt\rho_{t} admits a density given by the change of variables formula:

ρt(x)=|Jϕt(ϕt1(x))|1μn(ϕt1(x)).\rho_{t}(x)=\left|J\phi_{t}\left(\phi_{t}^{-1}(x)\right)\right|^{-1}\mu_{n}\left(\phi_{t}^{-1}(x)\right).

Changing the variable of integration and applying the transfer lemma we get the following formula for φ(t)\varphi(t):

φ(t)\displaystyle\varphi(t) =log(ρt(y)π(y))ρt(dy)\displaystyle=\int\log\left(\frac{\rho_{t}(y)}{\pi(y)}\right)\rho_{t}({\mathrm{d}}y)
=log(μn(x)|Jϕt(x)|1π(ϕt(x)))μn(dx)\displaystyle=\int\log\left(\frac{\mu_{n}(x)\left|J\phi_{t}(x)\right|^{-1}}{\pi\left(\phi_{t}(x)\right)}\right)\mu_{n}({\mathrm{d}}x)
=[log(μn(x))+log(|Jϕt(x)|1)log(π(ϕt(x)))]μn(dx).\displaystyle=\int\left[\log\left(\mu_{n}(x)\right)+\log\left(\left|J\phi_{t}(x)\right|^{-1}\right)-\log\left({\pi\left(\phi_{t}(x)\right)}\right)\right]\mu_{n}({\mathrm{d}}x).

Let us then compute the time derivative of φ(t)\varphi(t). Taking the derivative inside and applying Jacobi’s formula for matrix determinant differentiation we obtain the following equality:

φ(t)=tr((Jϕt(x))1dJϕt(x)dt)μn(dx)logπ(ϕt(x)),dϕt(x)dtμn(dx).\varphi^{\prime}(t)=-\int\operatorname{tr}\left((J\phi_{t}(x))^{-1}\frac{dJ\phi_{t}(x)}{dt}\right)\mu_{n}({\mathrm{d}}x)-\int\left\langle\nabla\log\pi\left(\phi_{t}(x)\right),\frac{d\phi_{t}(x)}{dt}\right\rangle\mu_{n}({\mathrm{d}}x).

By definition, dϕt/dt=gd\phi_{t}/dt=g. Therefore, we can use the explicit expression of ϕt\phi_{t} to write:

φ(t)=tr((Jϕt(x))1Jg(x))μn(dx)+V(ϕt(x)),g(x)μn(dx).\varphi^{\prime}(t)=\int\operatorname{tr}\left((J\phi_{t}(x))^{-1}Jg(x)\right)\mu_{n}({\mathrm{d}}x)+\int\left\langle\nabla V\left(\phi_{t}(x)\right),g(x)\right\rangle\mu_{n}({\mathrm{d}}x). (15)

The Jacobian at time t=0t=0 is simply equal to the identity matrix since ϕ0=Id\phi_{0}={\rm I}_{d}. It follows that tr((Jϕ0(x))1Jg(x))=\operatorname{tr}\left((J\phi_{0}(x))^{-1}Jg(x)\right)= tr(Jg(x))=div(g)(x)\operatorname{tr}(Jg(x))=\operatorname{div}(g)(x) by the definition of the divergence operator. Using integration by parts:

φ(0)\displaystyle\varphi^{\prime}(0) =[div(g)(x)logπ(x),g(x)]μn(dx)\displaystyle=-\int[-\operatorname{div}(g)(x)-\langle\nabla\log\pi(x),g(x)\rangle]\mu_{n}({\mathrm{d}}x)
=log(μnπ)(x),g(x)μn(dx).\displaystyle=-\int\left\langle\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(x),g(x)\right\rangle\mu_{n}({\mathrm{d}}x).

Based on the alternative definition of gμg_{\mu} (see Remark˜1) and the reproducing property, we have

log(μnπ)(x)\displaystyle\int\Big{\langle}\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(x) ,g(x)μn(dx)\displaystyle,g(x)\Big{\rangle}\mu_{n}({\mathrm{d}}x)
=k(x,y)log(μnπ)(x),log(μnπ)(y)μn(dx)μn(dy)\displaystyle=\iint k(x,y)\left\langle\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(x),\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(y)\right\rangle\mu_{n}({\mathrm{d}}x)\mu_{n}({\mathrm{d}}y)
=k(x,),k(y,)0log(μnπ)(x),log(μnπ)(y)μn(dx)μn(dy)\displaystyle=\iint\langle k(x,\cdot),k(y,\cdot)\rangle_{{\mathcal{H}}_{0}}\left\langle\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(x),\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(y)\right\rangle\mu_{n}({\mathrm{d}}x)\mu_{n}({\mathrm{d}}y)
=log(μnπ)(x)k(x,)μn(dx),log(μnπ)(y)k(y,)μn(dy)\displaystyle=\left\langle\int\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(x)k(x,\cdot)\mu_{n}({\mathrm{d}}x),\int\nabla\log\left(\frac{\mu_{n}}{\pi}\right)(y)k(y,\cdot)\mu_{n}({\mathrm{d}}y)\right\rangle_{{\mathcal{H}}}
=g2.\displaystyle=\left\|g\right\|_{{\mathcal{H}}}^{2}.

Therefore,

φ(0)=g2.\varphi^{\prime}(0)=-\left\|g\right\|_{{\mathcal{H}}}^{2}. (16)

Next, we calculate the term of (13) that contains the second derivative using (15). First,

ddttr((Jϕt(x))1Jg(x))\displaystyle\frac{d}{dt}\operatorname{tr}\left((J\phi_{t}(x))^{-1}Jg(x)\right) =tr(ddt(Jϕt(x))1Jg(x))\displaystyle=\operatorname{tr}\left(\frac{d}{dt}(J\phi_{t}(x))^{-1}Jg(x)\right)
=tr((Jϕt(x))2ddtJϕt(x)Jg(x))\displaystyle=-\operatorname{tr}\left((J\phi_{t}(x))^{-2}\frac{d}{dt}J\phi_{t}(x)Jg(x)\right)
=tr(Jϕt(x)2Jg(x)Jg(x))\displaystyle=\operatorname{tr}\left(J\phi_{t}(x)^{-2}Jg(x)Jg(x)\right)
=tr((Jϕt(x)1Jg(x))2).\displaystyle=\operatorname{tr}\left(\left(J\phi_{t}(x)^{-1}Jg(x)\right)^{2}\right).

From the definition of the function ϕt\phi_{t}, we know that Jϕt(x)=(Id+tJg)(x)J\phi_{t}(x)=({\rm I}_{d}+tJg)(x). Thus, (Jϕt)1(J\phi_{t})^{-1} and JgJg commute. The latter yields tr((Jg(x)(Jϕt(x))1)2)=Jg(x)(Jϕt(x))1HS2\operatorname{tr}\left(\left(Jg(x)\left(J\phi_{t}(x)\right)^{-1}\right)^{2}\right)=\left\|Jg(x)\left(J\phi_{t}(x)\right)^{-1}\right\|_{\rm HS}^{2}. On the other hand, applying the chain rule on second term of (15) yields

t(V(ϕt(x)),g(x))=g(x),2V(ϕt(x))g(x).\partial_{t}\big{(}\left\langle\nabla V\left(\phi_{t}(x)\right),g(x)\right\rangle\big{)}=\left\langle g(x),{\nabla^{2}V}\left(\phi_{t}(x)\right)g(x)\right\rangle.

Summing up, we have the following:

φ′′(t)=Jg(x)(Jϕt(x))1HS2μn(dx):=ψ1(t)+g(x),2V(ϕt(x))g(x)μn(dx):=ψ2(t).\varphi^{\prime\prime}(t)=\underbrace{\int\left\|Jg(x)\left(J\phi_{t}(x)\right)^{-1}\right\|_{\rm HS}^{2}\mu_{n}({\mathrm{d}}x)}_{:=\psi_{1}(t)}+\underbrace{\int\left\langle g(x),{\nabla^{2}V}\left(\phi_{t}(x)\right)g(x)\right\rangle\mu_{n}({\mathrm{d}}x)}_{:=\psi_{2}(t)}.

First, we bound ψ1(t)\psi_{1}(t). Cauchy-Schwarz implies that

Jg(x)(Jϕt(x))1HS2Jg(x)HS2(Jϕt(x))1op2.\left\|Jg(x)\left(J\phi_{t}(x)\right)^{-1}\right\|_{\rm HS}^{2}\leq\|Jg(x)\|_{\rm HS}^{2}\left\|\left(J\phi_{t}(x)\right)^{-1}\right\|_{op}^{2}.

From Lemma˜3, we have Jg(x)HSBg\|Jg(x)\|_{\rm HS}\leq B\|g\|_{{\mathcal{H}}}. To bound the second term, let us recall that ϕt=Itg\phi_{t}=I-tg and that tγt\leq\gamma. Thus, the following bound is true:

(Jϕt(x))1op\displaystyle\left\|\left(J\phi_{t}(x)\right)^{-1}\right\|_{op} =((IdtJg)(x))1opi=0tJg(x)opii=0γJg(x)HSi.\displaystyle=\left\|\big{(}({\rm I}_{d}-tJg)(x)\big{)}^{-1}\right\|_{op}\leq\sum_{i=0}^{\infty}\|tJg(x)\|_{op}^{i}\leq\sum_{i=0}^{\infty}\|\gamma Jg(x)\|_{\rm HS}^{i}.

Recalling (6) and combining it with Lemma˜3 we obtain

(Jϕt(x))1opi=0(γBg)ii=0(α1α)i=α.\displaystyle\left\|\left(J\phi_{t}(x)\right)^{-1}\right\|_{op}\leq\sum_{i=0}^{\infty}(\gamma B\|g\|_{{\mathcal{H}}})^{i}\leq\sum_{i=0}^{\infty}\left(\frac{\alpha-1}{\alpha}\right)^{i}=\alpha.

Summing up, we have that

ψ1(t)α2B2g2.\psi_{1}(t)\leq\alpha^{2}B^{2}\|g\|_{{\mathcal{H}}}^{2}.

Next, we bound ψ2(t)\psi_{2}(t). By definition,

ψ2(t)\displaystyle\psi_{2}(t) =𝔼Xμn[g(X),2V(ϕt(X))g(X)]𝔼Xμn[2V(ϕt(X))opg(X)22].\displaystyle=\mathbb{E}_{X\sim\mu_{n}}\left[\left\langle g(X),{\nabla^{2}V}\left(\phi_{t}(X)\right)g(X)\right\rangle\right]\leq\mathbb{E}_{X\sim\mu_{n}}\left[\|{\nabla^{2}V}\left(\phi_{t}(X)\right)\|_{op}\|g(X)\|_{2}^{2}\right].

Let us bound the norm of g(x)g(x). The reproduction property of the RKHS yields the following:

g(x)22=i=1dk(x,.),gi02k(x,.)02g2B2g2.\|g(x)\|_{2}^{2}=\sum_{i=1}^{d}\left\langle k(x,.),g_{i}\right\rangle_{\mathcal{H}_{0}}^{2}\leq\|k(x,.)\|_{\mathcal{H}_{0}}^{2}\|g\|_{\mathcal{H}}^{2}\leq B^{2}\|g\|_{\mathcal{H}}^{2}. (17)

Therefore,

ψ2(t)B2g2𝔼Xμn[2V(ϕt(X))op].\psi_{2}(t)\leq B^{2}\|g\|_{\mathcal{H}}^{2}\mathbb{E}_{X\sim\mu_{n}}\left[\|{\nabla^{2}V}\left(\phi_{t}(X)\right)\|_{op}\right]. (18)

Let us bound 𝔼Xμn[2V(ϕt(X))op]\mathbb{E}_{X\sim\mu_{n}}\left[\|{\nabla^{2}V}\left(\phi_{t}(X)\right)\|_{op}\right]. ˜(L0,L1)(L_{0},L_{1}) implies the following inequality:

2V(ϕt(x))opL0+L1V(ϕt(x)),\left\|\nabla^{2}V(\phi_{t}(x))\right\|_{op}\leq L_{0}+L_{1}\|\nabla V(\phi_{t}(x))\|,

for every xdx\in\mathbb{R}^{d}. To bound the term V(ϕt(x))\|\nabla V(\phi_{t}(x))\|, we introduce the following lemma.

Lemma 4.

Let VV be an (L0,L1)\left(L_{0},L_{1}\right)-smooth function and Δ>0\Delta>0 be a constant. For any x,x+dx,x^{+}\in\mathbb{R}^{d} such that x+xΔ\left\|x^{+}-x\right\|\leq\Delta, we have

V(x+)L0L1(exp(ΔL1)1)+V(x)exp(ΔL1).\left\|\nabla V\left(x^{+}\right)\right\|\leq\frac{L_{0}}{L_{1}}\left(\exp(\Delta L_{1})-1\right)+\|\nabla V(x)\|\exp(\Delta L_{1}).

We will apply Lemma˜4 to ϕt(x)\phi_{t}(x) and ϕ0(x)\phi_{0}(x). By definition, ϕt(x)ϕ0(x)=tg(x)\phi_{t}(x)-\phi_{0}(x)=tg(x) and according to inequality (17),

ϕt(x)ϕ0(x)2tBg.\|\phi_{t}(x)-\phi_{0}(x)\|_{2}\leq tB\|g\|_{{\mathcal{H}}}.

Thus, using Lemma˜4 for x=ϕ0(x)x=\phi_{0}(x), x+=ϕt(x)x^{+}=\phi_{t}(x) and Δ=tBg\Delta=tB\|g\|_{{\mathcal{H}}}, we obtain the following:

2V(ϕt(x))op\displaystyle\left\|\nabla^{2}V(\phi_{t}(x))\right\|_{op} L0+L1(L0L1(exp(tBgL1)1)+V(ϕ0(x))exp(tBgL1))\displaystyle\leq L_{0}+L_{1}\left(\frac{L_{0}}{L_{1}}\big{(}\exp(tB\|g\|_{{\mathcal{H}}}L_{1})-1\big{)}+\left\|\nabla V\left(\phi_{0}(x)\right)\right\|\exp(tB\|g\|_{{\mathcal{H}}}L_{1})\right) (19)
=(L0+L1V(x))exp(tBL1g),\displaystyle=\big{(}L_{0}+L_{1}\left\|\nabla V\left(x\right)\right\|\big{)}\exp(tBL_{1}\|g\|_{{\mathcal{H}}}),

Combining (18) and (19) we obtain

ψ2(t)B2g2(L0+L1𝔼Xμn[V(X)])exp(tBL1g).\psi_{2}(t)\leq B^{2}\|g\|_{{\mathcal{H}}}^{2}\left(L_{0}+L_{1}\mathbb{E}_{X\sim\mu_{n}}\big{[}\left\|\nabla V(X)\right\|\big{]}\right)\exp(tBL_{1}\|g\|_{{\mathcal{H}}}).

Summing up, the bounds on ψ1\psi_{1} and ψ2\psi_{2} yield the following inequality:

φ′′(t)B2g2[α2+(L0+L1𝔼Xμn[V(X)])exp(tBL1g)].\varphi^{\prime\prime}(t)\leq B^{2}\|g\|_{{\mathcal{H}}}^{2}\Big{[}\alpha^{2}+\Big{(}L_{0}+L_{1}\mathbb{E}_{X\sim\mu_{n}}\big{[}\left\|\nabla V(X)\right\|\big{]}\Big{)}\exp(tBL_{1}\|g\|_{{\mathcal{H}}})\Big{]}.

Recall that by definition An=(L0+L1𝔼Xμn[V(X)]).A_{n}=\big{(}L_{0}+L_{1}\mathbb{E}_{X\sim\mu_{n}}\big{[}\left\|\nabla V(X)\right\|\big{]}\big{)}. Inserting the previous inequality along with (16) to (13), we get the following bound:

φ(γ)φ(0)\displaystyle\varphi(\gamma)-\varphi(0) γg2+[12γ2α2B2g2+An(exp(γBL1g)γBL1g1)].\displaystyle\leq-\gamma\|g\|_{{\mathcal{H}}}^{2}+\Big{[}\frac{1}{2}\gamma^{2}\alpha^{2}B^{2}\|g\|_{{\mathcal{H}}}^{2}+A_{n}\big{(}\exp(\gamma BL_{1}\|g\|_{{\mathcal{H}}})-\gamma BL_{1}\|g\|_{{\mathcal{H}}}-1\big{)}\Big{]}.

One can check that exp(t)t1(e1)t2/2\exp(t)-t-1\leq(e-1)t^{2}/2, when t[0,1]t\in[0,1]. Since γBL1g<1\gamma BL_{1}\|g\|_{{\mathcal{H}}}<1, we deduce

φ(γ)φ(0)[γ+12(γ2α2B2+(e1)Anγ2B2)]g2.\varphi(\gamma)-\varphi(0)\leq\Big{[}-\gamma+\frac{1}{2}(\gamma^{2}\alpha^{2}B^{2}+(e-1)A_{n}\gamma^{2}B^{2})\Big{]}\|g\|_{{\mathcal{H}}}^{2}.

Finally, by the definition of Stein’s information g2=IStein(μnπ)\|g\|_{{\mathcal{H}}}^{2}=I_{\rm Stein}(\mu_{n}\mid\pi). This concludes the proof.

Remark 2.

In fact, the derivation of (16) contains the proof of the fact that g=IStein(μnπ)\|g\|_{{\mathcal{H}}}=\sqrt{I_{\rm Stein}(\mu_{n}\mid\pi)}. Indeed, by the definition of the SVGD (Section˜2.2) the direction function gg is chosen to minimize the descent the most. On the other hand, the Stein-Fisher information is the maximum of the Stein’s linear operator which coincides with our objective. Thus, the absolute value of the objective function at gg, that is |ϕ(0)||\phi^{\prime}(0)| equals IStein(μnπ)\sqrt{I_{\rm Stein}(\mu_{n}\mid\pi)}.

B.2 Proof of Theorem˜1

Proof.

Let us first prove that the sequence KL(μnπ)\mathop{\mathrm{KL}}\nolimits(\mu_{n}\mid\pi) is monotonically decreasing. We will use the method of mathematical induction. First let us notice that (10) implies the following system of inequalities for the step-size:

{γ(α1)min{1,1/L1}[αB2(C0+1)]1;γB2[α2+(e1)(L0+L1C0)]1.\begin{cases}\gamma&\leq{(\alpha-1)}\min\{1,{1}/{L_{1}}\}\left[{\alpha B^{2}\left(C_{0}+1\right)}\right]^{-1};\\ \gamma&\leq B^{-2}\Big{[}\alpha^{2}+(e-1)\Big{(}L_{0}+L_{1}C_{0}\Big{)}\Big{]}^{-1}.\end{cases} (20)

For n=0n=0 the first equation of the system (20) combined with Lemma˜2 implies that the step-size condition is satisfied. Thus, according to Proposition˜1,

KL(μ1π)KL(μ0π)γ[1γ2B2(α2+(e1)(L0+L1C0))]IStein(μ0π).\mathrm{KL}\left(\mu_{1}\mid\pi\right)-\mathrm{KL}\left(\mu_{0}\mid\pi\right)\leq-\gamma\left[1-\frac{\gamma}{2}B^{2}(\alpha^{2}+(e-1)(L_{0}+L_{1}C_{0}))\right]I_{\rm Stein}(\mu_{0}\mid\pi).

On the other hand, from the second inequality of (20) we get that

1γ2B2(α2+(e1)(L0+L1C0))12.1-\frac{\gamma}{2}B^{2}(\alpha^{2}+(e-1)(L_{0}+L_{1}C_{0}))\geq\frac{1}{2}.

Therefore, KL(μ1π)KL(μ0π)\mathrm{KL}\left(\mu_{1}\mid\pi\right)\leq\mathrm{KL}\left(\mu_{0}\mid\pi\right). Let us define by Cn:=Q(J(KL(μnπ))+Wp(π,δ0))C_{n}:={\mathrm{Q}}\left({\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{n}\mid\pi)\big{)}+W_{p}(\pi,\delta_{0})\right). Since Q{\mathrm{Q}} and J{\mathrm{J}} are monotonically increasing for positive arguments, we obtain C1C0C_{1}\leq C_{0}. Therefore,

{γ(α1)min{1,1/L1}[αB2(C1+1)]1;γB2[α2+(e1)(L0+L1C1)]1.\begin{cases}\gamma&\leq{(\alpha-1)}\min\{1,{1}/{L_{1}}\}\left[{\alpha B^{2}\left(C_{1}+1\right)}\right]^{-1};\\ \gamma&\leq B^{-2}\Big{[}\alpha^{2}+(e-1)\big{(}L_{0}+L_{1}C_{1}\big{)}\Big{]}^{-1}.\end{cases}

We retrieve (20) where the term C0C_{0} is replaced by C1C_{1}. Thus, we can repeat the previous arguments for μ1\mu_{1} and μ2\mu_{2}. Similarly, we can iterate till nn. Therefore we obtain the following descent bound:

KL(μn+1π)KL(μnπ)\displaystyle\mathrm{KL}\left(\mu_{n+1}\mid\pi\right)-\mathrm{KL}\left(\mu_{n}\mid\pi\right) γ[1γ2B2(α2+(e1)(L0+L1Cn))]IStein(μnπ).\displaystyle\leq-\gamma\left[1-\frac{\gamma}{2}B^{2}(\alpha^{2}+(e-1)(L_{0}+L_{1}C_{n}))\right]I_{\rm Stein}(\mu_{n}\mid\pi).

This also yields that the sequence KL(μn)\mathop{\mathrm{KL}}\nolimits(\mu_{n}) is decreasing. Since J{\mathrm{J}} and P{\mathrm{P}} are monotonically increasing for positive arguments, CnC_{n} is also decreasing. Thus,

KL(μn+1π)KL(μnπ)\displaystyle\mathrm{KL}\left(\mu_{n+1}\mid\pi\right)-\mathrm{KL}\left(\mu_{n}\mid\pi\right) γ[1γ2B2(α2+(e1)(L0+L1C0))]IStein(μnπ)\displaystyle\leq-\gamma\left[1-\frac{\gamma}{2}B^{2}(\alpha^{2}+(e-1)(L_{0}+L_{1}C_{0}))\right]I_{\rm Stein}(\mu_{n}\mid\pi)
γ2IStein(μnπ).\displaystyle\leq-\frac{\gamma}{2}I_{\rm Stein}(\mu_{n}\mid\pi).

The last inequality is due to the second condition on the step-size γ\gamma. ∎

B.3 Proof of Corollary˜1

Let us start with the first statement. Summing the descent bounds of Theorem˜1 for i=0,1,,n1i=0,1,\ldots,n-1, we obtain the following:

KL(μnπ)KL(μ0π)γ2i=0n1IStein(μiπ).\mathrm{KL}\left(\mu_{n}\mid\pi\right)-\mathrm{KL}\left(\mu_{0}\mid\pi\right)\leq-\frac{\gamma}{2}\sum_{i=0}^{n-1}I_{\rm Stein}(\mu_{i}\mid\pi).

Rearranging the terms we get

i=0n1IStein(μiπ)\displaystyle\sum_{i=0}^{n-1}I_{\rm Stein}(\mu_{i}\mid\pi) 2γ(KL(μ0π)KL(μnπ))\displaystyle\leq\frac{2}{\gamma}(\mathrm{KL}\left(\mu_{0}\mid\pi\right)-\mathrm{KL}\left(\mu_{n}\mid\pi\right))
2γKL(μ0π).\displaystyle\leq\frac{2}{\gamma}\mathrm{KL}\left(\mu_{0}\mid\pi\right).

This means that the series on the left-hand side is convergent and thus, its general term converges to zero. Thus the first point is proved. Dividing both sides of the previous inequality on nn, we deduce the second statement.

Appendix C Complexity results

C.1 Proof of Theorem˜2

Proof.

From Corollary˜1, we know that if (10) is satisfied, then

1nk=0n1IStein(μkπ)2KL(μ0π)nγ.\frac{1}{n}\sum_{k=0}^{n-1}I_{\rm Stein}\left(\mu_{k}\mid\pi\right)\leq\frac{2\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)}{n\gamma}.

Let us bound the initial KL\mathop{\mathrm{KL}}\nolimits error.

Lemma 5.

Let ˜(𝑝𝑜𝑙𝑦,Q)(\mathrm{poly},{\mathrm{Q}}) hold with some polynomial Q{\mathrm{Q}} and μ0=𝒩(0,Id)\mu_{0}={\mathcal{N}}(0,{\rm I}_{d}). We then have

KL(μ0π)d2log(12Πe)+V(0)+Q(1)d2Π+Q(1)(2d)p+12Γ(p+22)Π(p+1),\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)\leq\frac{d}{2}\log\left(\frac{1}{2{\Pi}e}\right)+V(0)+{{\mathrm{Q}}(1)d}\sqrt{\frac{2}{\Pi}}+\frac{{\mathrm{Q}}(1)(2d)^{\frac{p+1}{2}}\Gamma\big{(}\frac{p+2}{2}\big{)}}{\sqrt{\Pi}(p+1)},

where Π\Pi is the area of the circle of radius 11.

Since p1p\geq 1, Lemma˜5 implies that

KL(μ0π)\displaystyle\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi) =𝒪(Q(1)(2d)p+12Γ(p+22)Π(p+1))\displaystyle={\mathcal{O}}\left(\frac{{\mathrm{Q}}(1)(2d)^{\frac{p+1}{2}}\Gamma\big{(}\frac{p+2}{2}\big{)}}{\sqrt{\Pi}(p+1)}\right) (21)
=𝒪(Q(1)(pd)p+12).\displaystyle={\mathcal{O}}\left({\mathrm{Q}}(1)\left({pd}\right)^{\frac{p+1}{2}}\right).

The second equality is due to Stirling’s formula: Γ(r+1)=𝒪(2Πr(r/e)r)\Gamma(r+1)={\mathcal{O}}(\sqrt{2\Pi r}(r/e)^{r}), for every r>0r>0. In order to estimate the order of the step-size, let us compute the order of C0C_{0}. By definition,

C0\displaystyle C_{0} =Q(J(KL(μ0π))+Wp(π,δ0))\displaystyle={\mathrm{Q}}\left({\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)\big{)}+W_{p}(\pi,\delta_{0})\right)
Q(J(KL(μ0π))+Wp(π,μ0)+Wp(μ0,δ0))\displaystyle\leq{\mathrm{Q}}\left({\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)\big{)}+W_{p}(\pi,\mu_{0})+W_{p}(\mu_{0},\delta_{0})\right)
Q(2J(KL(μ0π))+Wp(μ0,δ0)).\displaystyle\leq{\mathrm{Q}}\left(2{\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)\big{)}+W_{p}(\mu_{0},\delta_{0})\right).

Here we applied the triangle inequality of WpW_{p} and the ˜(Tp,J)({\mathrm{T}}_{p},{\rm J}).

Let us now consider the first point of the theorem. Using Q(r)Q(1)(rp+1)Q(r)\leq Q(1)(r^{p}+1) and J(r)=λBV(r1/p+(r/2)1/2p)=𝒪(λBVr1/p){\mathrm{J}}(r)=\lambda_{BV}(r^{\nicefrac{{1}}{{p}}}+(r/2)^{\nicefrac{{1}}{{2p}}})={\mathcal{O}}(\lambda_{BV}r^{\nicefrac{{1}}{{p}}}), we obtain

C0\displaystyle C_{0} Q(1)((2J(KL(μ0π))+Wp(μ0,δ0))p+1)\displaystyle\leq{\mathrm{Q}}(1)\cdot\left(\left(2{\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)\big{)}+W_{p}(\mu_{0},\delta_{0})\right)^{p}+1\right) (22)
=𝒪(Q(1)((2λBVKL(μ0π)1/p+Wp(μ0,δ0))p+1))\displaystyle={\mathcal{O}}\Big{(}{\mathrm{Q}}(1)\cdot\left(\left(2\lambda_{BV}\mathop{\mathrm{KL}}\nolimits\big{(}\mu_{0}\mid\pi\big{)}^{\nicefrac{{1}}{{p}}}+W_{p}(\mu_{0},\delta_{0})\right)^{p}+1\right)\Big{)}
=𝒪(Q(1)((4λBV)pKL(μ0π)+2pWpp(μ0,δ0)+1)).\displaystyle={\mathcal{O}}\Big{(}{\mathrm{Q}}(1)\cdot\left((4\lambda_{BV})^{p}\mathop{\mathrm{KL}}\nolimits\big{(}\mu_{0}\mid\pi\big{)}+2^{p}W_{p}^{p}(\mu_{0},\delta_{0})+1\right)\Big{)}.

Applying (21) we get

C0\displaystyle C_{0} =𝒪(Q(1)((4λBV)p𝒪(Q(1)(pd)p+12)+2pWpp(μ0,δ0)))\displaystyle={\mathcal{O}}\left({\mathrm{Q}}(1)\cdot\left((4\lambda_{BV})^{p}{\mathcal{O}}\left({\mathrm{Q}}(1)\left({pd}\right)^{\frac{p+1}{2}}\right)+2^{p}W_{p}^{p}(\mu_{0},\delta_{0})\right)\right) (23)
=𝒪(Q(1)2(4λBV)p(pd)p+12+Q(1)2pWpp(μ0,δ0)).\displaystyle={\mathcal{O}}\left({\mathrm{Q}}(1)^{2}(4\lambda_{BV})^{p}\left(pd\right)^{\frac{p+1}{2}}+{\mathrm{Q}}(1)2^{p}W_{p}^{p}(\mu_{0},\delta_{0})\right).

The following lemma is to bound the (p+1)(p+1)-th norm of the standard multivariate Gaussian.

Lemma 6.

Let μ0\mu_{0} be the standard multivariate Gaussian defined on d\mathbb{R}^{d}. Then, for every integer m2m\geq 2

dxmμ0(dx)(2d)m2ΠΓ(m+12).\int_{\mathbb{R}^{d}}\|{x}\|^{m}\mu_{0}({\mathrm{d}}x)\leq\frac{(2d)^{\frac{m}{2}}}{\sqrt{\Pi}}\Gamma\left(\frac{m+1}{2}\right).

If we write down the definition of the Wasserstein distance, then Lemma˜6 for m=pm=p yields

Wpp(μ0,δ0)=dxpμ0(dx)(2d)p2Γ(p+12)Π=𝒪((pd)p2).W_{p}^{p}(\mu_{0},\delta_{0})=\int_{\mathbb{R}^{d}}\|x\|^{p}\mu_{0}({\mathrm{d}}x)\leq\frac{(2d)^{\frac{p}{2}}\Gamma\left(\frac{p+1}{2}\right)}{\sqrt{\Pi}}={\mathcal{O}}\left(\left({pd}\right)^{\frac{p}{2}}\right). (24)

This implies

C0\displaystyle C_{0} =𝒪(Q(1)2(4λBV)p(pd)p+12+Q(1)2p(pd)p2)\displaystyle={\mathcal{O}}\left({\mathrm{Q}}(1)^{2}(4\lambda_{BV})^{p}\left({pd}\right)^{\frac{p+1}{2}}+{\mathrm{Q}}(1)2^{p}\left({pd}\right)^{\frac{p}{2}}\right)
=𝒪(Q(1)2(4λBV)p(pd)p+12).\displaystyle={\mathcal{O}}\left({\mathrm{Q}}(1)^{2}(4\lambda_{BV})^{p}\left({pd}\right)^{\frac{p+1}{2}}\right).

Now let us look back at (10). This inequality yields that Corollary˜1 is true for

γ\displaystyle\gamma =𝒪({B2(α2+(e1)(L0+L1Q(1)2(4λBV)p(pd)p+12))}1)\displaystyle={\mathcal{O}}\left(\left\{B^{2}(\alpha^{2}+(e-1)(L_{0}+L_{1}{\mathrm{Q}}(1)^{2}(4\lambda_{BV})^{p}\left({pd}\right)^{\frac{p+1}{2}}))\right\}^{-1}\right)
=𝒪({B2max(L1,1)Q(1)2λBVp(pd)p+12}1).\displaystyle={\mathcal{O}}\left(\left\{B^{2}\max(L_{1},1){\mathrm{Q}}(1)^{2}\lambda_{BV}^{p}\left({pd}\right)^{\frac{p+1}{2}}\right\}^{-1}\right).

This implies the following equality and concludes the proof of the first point:

n=KL(μ0π)2γε\displaystyle n=\frac{\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)}{2\gamma\varepsilon} =𝒪(1ε[Q(1)(pd)p+12][B2(α2+(e1)(L0+L1Q(1)2(4λBV)p(pd)p+12))])\displaystyle={\mathcal{O}}\left(\frac{1}{\varepsilon}\left[{Q(1)\left({pd}\right)^{\frac{p+1}{2}}}\right]\left[{B^{2}(\alpha^{2}+(e-1)(L_{0}+L_{1}{\mathrm{Q}}(1)^{2}(4\lambda_{BV})^{p}\left({pd}\right)^{\frac{p+1}{2}}))}\right]\right)
=𝒪(B2max(L1,1)Q(1)3λBVp(pd)p+1ε).\displaystyle={\mathcal{O}}\left(B^{2}\max(L_{1},1)\frac{Q(1)^{3}\lambda_{BV}^{p}(pd)^{p+1}}{\varepsilon}\right).

For the second point, we will do the same analysis. In the case of Tp{\mathrm{T}}_{p} inequality, the target satisfies ˜(Tp,J)({\mathrm{T}}_{p},{\rm J}) with J(r)=2r/λT{\mathrm{J}}(r)=\sqrt{2r/\lambda_{{\mathrm{T}}}}, for some constant λT>0\lambda_{{\mathrm{T}}}>0. Then, similar to (22), we obtain the following complexity for C0C_{0}:

C0\displaystyle C_{0} Q(1)((2J(KL(μ0π))+Wp(μ0,δ0))p+1)\displaystyle\leq{\mathrm{Q}}(1)\cdot\left(\left(2{\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)\big{)}+W_{p}(\mu_{0},\delta_{0})\right)^{p}+1\right)
=Q(1)((22KL(μ0π)/λT+Wp(μ0,δ0))p+1)\displaystyle={\mathrm{Q}}(1)\cdot\left(\left(2\sqrt{2\mathop{\mathrm{KL}}\nolimits\big{(}\mu_{0}\mid\pi\big{)}/\lambda_{{\mathrm{T}}}}+W_{p}(\mu_{0},\delta_{0})\right)^{p}+1\right)
Q(1)((32/λT)p2KL(μ0π)p2+2pWpp(μ0,δ0)+1).\displaystyle\leq{\mathrm{Q}}(1)\cdot\left((32/\lambda_{{\mathrm{T}}})^{\frac{p}{2}}\mathop{\mathrm{KL}}\nolimits\big{(}\mu_{0}\mid\pi\big{)}^{\frac{p}{2}}+2^{p}W_{p}^{p}(\mu_{0},\delta_{0})+1\right).

Applying (21)

C0\displaystyle C_{0} =𝒪(Q(1)((32/λT)p2𝒪(Q(1)p2(pd)p(p+1)4)+2pWpp(μ0,δ0)))\displaystyle={\mathcal{O}}\left({\mathrm{Q}}(1)\cdot\left((32/\lambda_{{\mathrm{T}}})^{\frac{p}{2}}{\mathcal{O}}\left({\mathrm{Q}}(1)^{\frac{p}{2}}\left({pd}\right)^{\frac{p(p+1)}{4}}\right)+2^{p}W_{p}^{p}(\mu_{0},\delta_{0})\right)\right)
=𝒪(Q(1)p+22(32/λT)p2(pd)p(p+1)4+Q(1)2pWpp(μ0,δ0)).\displaystyle={\mathcal{O}}\left({\mathrm{Q}}(1)^{\frac{p+2}{2}}(32/\lambda_{{\mathrm{T}}})^{\frac{p}{2}}\left(pd\right)^{\frac{p(p+1)}{4}}+{\mathrm{Q}}(1)2^{p}W_{p}^{p}(\mu_{0},\delta_{0})\right).

Applying (24), we conclude

C0=𝒪(Q(1)p+22λTp2(pd)p(p+1)4).C_{0}={\mathcal{O}}\left({\mathrm{Q}}(1)^{\frac{p+2}{2}}\lambda_{{\mathrm{T}}}^{-\frac{p}{2}}\left(pd\right)^{\frac{p(p+1)}{4}}\right).

Thus, in order to have ε\varepsilon average ISteinI_{\rm Stein} error it is sufficient to perform nn iterations, where

n=KL(μ0π)γε\displaystyle n=\frac{\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi)}{\gamma\varepsilon} =𝒪((pd)p+12εB2(α2+(e1)(max(L0,L1,1)+max(L1,1)C0)))\displaystyle={\mathcal{O}}\left(\frac{\left({pd}\right)^{\frac{p+1}{2}}}{\varepsilon}\cdot B^{2}\big{(}\alpha^{2}+(e-1)(\max(L_{0},L_{1},1)+\max(L_{1},1)C_{0})\big{)}\right)
=𝒪(B2max(L1,1)Q(1)p+22λTp2(pd)(p+1)(p+2)4ε).\displaystyle={\mathcal{O}}\left(B^{2}\max(L_{1},1)\frac{{\mathrm{Q}}(1)^{\frac{p+2}{2}}\lambda_{{\mathrm{T}}}^{-\frac{p}{2}}\left(pd\right)^{\frac{(p+1)(p+2)}{4}}}{\varepsilon}\right).

This concludes the proof.

Appendix D Proofs of the lemmas

D.1 Proof of Lemma˜1

Proof.

Denote Φ(x):=k(x,)\Phi(x):=k(x,\cdot)\in\mathcal{H}. Then by the definition of the Stein discrepancy we have the following:

IStein(μnπ)12\displaystyle I_{\rm Stein}(\mu_{n}\mid\pi)^{\frac{1}{2}} =𝔼Xμn[(V(X)Φ(X)Φ(X))]\displaystyle=\left\|\mathbb{E}_{X\sim\mu_{n}}\big{[}(\nabla V(X)\Phi(X)-\nabla\Phi(X))\big{]}\right\|_{\mathcal{H}}
𝔼Xμn[V(X)Φ(X)Φ(X)].\displaystyle\leq\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\Phi(X)-\nabla\Phi(X)\big{]}\|_{\mathcal{H}}.

Applying the triangle inequality and Cauchy-Schwartz inequality, we obtain

IStein(μnπ)12\displaystyle I_{\rm Stein}(\mu_{n}\mid\pi)^{\frac{1}{2}} 𝔼Xμn[V(X)Φ(X)]+𝔼Xμn[Φ(X)]\displaystyle\leq\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\Phi(X)\|_{\mathcal{H}}\big{]}+\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla\Phi(X)\|_{\mathcal{H}}\big{]}
=𝔼Xμn[V(X)Φ(X)]+𝔼Xμn[Φ(X)]\displaystyle=\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\|\|\Phi(X)\|_{\mathcal{H}}\big{]}+\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla\Phi(X)\|_{\mathcal{H}}\big{]}
B(𝔼Xμn[V(X)]+1).\displaystyle\leq B\left(\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\|\big{]}+1\right).

D.2 Proof of Lemma˜2

Let the polynomial Q{\mathrm{Q}} have the following explicit form:

Q(r)=i=0mairpi{\mathrm{Q}}(r)=\sum_{i=0}^{m}a_{i}r^{p_{i}} (25)

for every rr\in\mathbb{R}. Here p=p0>p1>>pmp=p_{0}>p_{1}>\ldots>p_{m} and ai>0a_{i}>0. Then ˜(poly,Q)(\mathrm{poly},{\mathrm{Q}}) yields

𝔼Xμn[V(X)]\displaystyle\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\|\big{]} 𝔼Xμn[P(X)]\displaystyle\leq\mathbb{E}_{X\sim\mu_{n}}\big{[}{\mathrm{P}}(\|X\|)\big{]}
=𝔼Xμn[i=0maiXpi]\displaystyle=\mathbb{E}_{X\sim\mu_{n}}\left[\sum_{i=0}^{m}a_{i}\|X\|^{p_{i}}\right]
=i=0maiWpi(μn,δ0)pi,\displaystyle=\sum_{i=0}^{m}a_{i}W_{p_{i}}\big{(}\mu_{n},\delta_{0}\big{)}^{p_{i}},

where δ0\delta_{0} is Dirac measure on d\mathbb{R}^{d} at point 0. Using the fact that WrW_{r} is monotonically increasing w.r.t. to rr, we have the following

𝔼Xμn[V(X)]\displaystyle\mathbb{E}_{X\sim\mu_{n}}\big{[}\|\nabla V(X)\|\big{]} i=0maiWp(μn,δ0)pi\displaystyle\leq\sum_{i=0}^{m}a_{i}W_{p}\big{(}\mu_{n},\delta_{0}\big{)}^{p_{i}}
i=1mai(Wp(μn,π)+Wp(π,δ0))pi\displaystyle\leq\sum_{i=1}^{m}a_{i}\big{(}W_{p}(\mu_{n},\pi)+W_{p}(\pi,\delta_{0})\big{)}^{p_{i}}
i=1mai(J(KL(μnπ))+Wp(π,δ0))pi.\displaystyle\leq\sum_{i=1}^{m}a_{i}\left({\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{n}\mid\pi)\big{)}+W_{p}(\pi,\delta_{0})\right)^{p_{i}}.
Q(J(KL(μnπ))+Wp(π,δ0)).\displaystyle\leq{\mathrm{Q}}\left({\mathrm{J}}\big{(}\mathop{\mathrm{KL}}\nolimits(\mu_{n}\mid\pi)\big{)}+W_{p}(\pi,\delta_{0})\right).

The third inequality is due to ˜(Tp,J)({\mathrm{T}}_{p},{\rm J}). Recalling (25) we conclude the proof.

D.3 Proof of Lemma˜3

The proof is based on the reproducing property and Cauchy-Schwarz inequality in the RKHS space. Indeed,

Jh(x)HS2\displaystyle\|Jh(x)\|_{\rm HS}^{2} =i,j=1d|hi(x)xj|2\displaystyle=\sum_{i,j=1}^{d}\left|\frac{\partial h_{i}(x)}{\partial x_{j}}\right|^{2}
=i,j=1dxjk(x,.),hi0\displaystyle=\sum_{i,j=1}^{d}\left\langle\partial_{x_{j}}k(x,.),h_{i}\right\rangle_{\mathcal{H}_{0}}
i,j=1dxjk(x,.)02hi02\displaystyle\leq\sum_{i,j=1}^{d}\left\|\partial_{x_{j}}k(x,.)\right\|_{\mathcal{H}_{0}}^{2}\left\|h_{i}\right\|_{\mathcal{H}_{0}}^{2}
=k(x,.)2h2\displaystyle=\|\nabla k(x,.)\|_{\mathcal{H}}^{2}\left\|h\right\|_{\mathcal{H}}^{2}
B2h2.\displaystyle\leq B^{2}\left\|h\right\|_{\mathcal{H}}^{2}.

This concludes the proof.

D.4 Proof of Lemma˜4

Proof.

Let us fix x,x+dx,x^{+}\in\mathbb{R}^{d} and let τ(t)\tau(t) be defined as τ(t)=t(x+x)+x\tau(t)=t\left(x^{+}-x\right)+x with t[0,1]t\in[0,1]. Then we have

V(τ(t))=0t2V(τ(u))(x+x)dτ+V(τ(0)).\nabla V(\tau(t))=\int_{0}^{t}\nabla^{2}V(\tau(u))\left(x^{+}-x\right)\mathrm{d}\tau+\nabla V(\tau(0)).

Let us bound the norm of V(τ(t))\nabla V(\tau(t)). Applying triangle and Cauchy-Schwartz inequalities we obtain:

V(τ(t))\displaystyle\|\nabla V(\tau(t))\| 0t2V(τ(u))(x+x)du+V(τ(0))\displaystyle\leq\int_{0}^{t}\left\|\nabla^{2}V(\tau(u))\left(x^{+}-x\right)\right\|\mathrm{d}u+\|\nabla V(\tau(0))\|
x+x0t2V(τ(u))du+V(x).\displaystyle\leq\left\|x^{+}-x\right\|\int_{0}^{t}\left\|\nabla^{2}V(\tau(u))\right\|\mathrm{d}u+\|\nabla V(x)\|.

˜(L0,L1)(L_{0},L_{1}) yields

V(τ(t))\displaystyle\|\nabla V(\tau(t))\| Δ0t(L0+L1V(τ(u)))du+V(x)\displaystyle\leq\Delta\int_{0}^{t}\left(L_{0}+L_{1}\|\nabla V(\tau(u))\|\right)\mathrm{d}u+\|\nabla V(x)\|
=ΔL0t+V(x)+0tΔL1V(τ(u))du.\displaystyle=\Delta L_{0}t+\|\nabla V(x)\|+\int_{0}^{t}\Delta L_{1}\big{\|}\nabla V(\tau(u))\big{\|}\mathrm{d}u.

Applying Grönwall’s integral inequality (see Lemma˜8) to the function V(τ())\|\nabla V(\tau(\cdot))\| we obtain

V(τ(t))\displaystyle\|\nabla V(\tau(t))\| =ΔL0t+V(x)+0tΔL1(ΔL0u+V(x))exp(ΔL1(tu))du.\displaystyle=\Delta L_{0}t+\|\nabla V(x)\|+\int_{0}^{t}\Delta L_{1}\big{(}\Delta L_{0}u+\|\nabla V(x)\|\big{)}\exp\big{(}\Delta L_{1}(t-u)\big{)}\mathrm{d}u.

Inserting t=1t=1, we get the following:

V(x+)\displaystyle\|\nabla V(x^{+})\| ΔL0+V(x)+01ΔL1(ΔL0u+V(x))exp(ΔL1(1u))du\displaystyle\leq\Delta L_{0}+\|\nabla V(x)\|+\int_{0}^{1}\Delta L_{1}\big{(}\Delta L_{0}u+\|\nabla V(x)\|\big{)}\exp\big{(}\Delta L_{1}(1-u)\big{)}\mathrm{d}u
=L0L1+(L0L1+V(x))exp(ΔL1).\displaystyle=-\frac{L_{0}}{L_{1}}+\left(\frac{L_{0}}{L_{1}}+\|\nabla V(x)\|\right)\exp(\Delta L_{1}).

This concludes the proof.∎

The rest of the section contains the proofs of the lemmas appearing in the proof of Theorem˜2. Without loss of generality, we may assume that the normalizing constant of the density of π\pi is equal to 1. Thus, π(x)=eV(x)\pi(x)=e^{-V(x)} and the following lemma is true.

D.5 Proof of Lemma˜5

Proof.

By ˜(poly,Q)(\mathrm{poly},{\mathrm{Q}}), we know that V(x)Q(x)\big{\|}\nabla V(x)\big{\|}\leq{\mathrm{Q}}(\|x\|). Thus, applying Taylor formula

V(x)\displaystyle V(x) =01V(tx),xdt+V(0)\displaystyle=\int_{0}^{1}\left<\nabla V(tx),x\right>{\mathrm{d}}t+V(0)
01V(tx)xdt+V(0)\displaystyle\leq\int_{0}^{1}\left\|\nabla V(tx)\right\|\|x\|{\mathrm{d}}t+V(0)
01Q(tx)xdt+V(0).\displaystyle\leq\int_{0}^{1}{\mathrm{Q}}(\|tx\|)\|x\|{\mathrm{d}}t+V(0).

Since the coefficients of Q{\mathrm{Q}} are positive, one may verify that for every r0r\geq 0

Q(r)Q(1)(rp+1).{\mathrm{Q}}(r)\leq{\mathrm{Q}}(1)(r^{p}+1).

Therefore,

V(x)\displaystyle V(x) 01Q(1)(txp+1)xdt+V(0)\displaystyle\leq\int_{0}^{1}{\mathrm{Q}}(1)\left(\left\|tx\right\|^{p}+1\right)\|x\|{\mathrm{d}}t+V(0)
Q(1)(xp+1p+1+x)+V(0).\displaystyle\leq{\mathrm{Q}}(1)\left(\frac{\|x\|^{p+1}}{p+1}+\|x\|\right)+V(0).

Now, let us calculate KL(μ0π)\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi),

KL(μ0π)\displaystyle\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi) =dlog(μ0π(x))μ0(dx)\displaystyle=\int_{\mathbb{R}^{d}}\log\left(\frac{\mu_{0}}{\pi}(x)\right)\mu_{0}({\mathrm{d}}x)
=dlog(μ0(x))μ0(x)dx+dV(x)μ0(dx)\displaystyle=\int_{\mathbb{R}^{d}}\log(\mu_{0}(x))\mu_{0}(x){\mathrm{d}}x+\int_{\mathbb{R}^{d}}V(x)\mu_{0}({\mathrm{d}}x)
d2log(12Πe)+d[Q(1)(xp+1p+1+x)+V(0)]μ0(dx).\displaystyle\leq\frac{d}{2}\log\left(\frac{1}{2\Pi e}\right)+\int_{\mathbb{R}^{d}}\left[{\mathrm{Q}}(1)\left(\frac{\|x\|^{p+1}}{p+1}+\|x\|\right)+V(0)\right]\mu_{0}({\mathrm{d}}x).

Combining Lemma˜6 with the inequality below

x=(i=1d|xi|2)12(i=1d|xi|)212=i=1d|xi|,\|x\|=\left(\sum_{i=1}^{d}|x_{i}|^{2}\right)^{\frac{1}{2}}\leq\left(\sum_{i=1}^{d}|x_{i}|\right)^{2\cdot\frac{1}{2}}=\sum_{i=1}^{d}|x_{i}|,

we obtain the following:

KL(μ0π)\displaystyle\mathop{\mathrm{KL}}\nolimits(\mu_{0}\mid\pi) d2log(12Πe)+V(0)+Q(1)p+1(2d)p+12ΠΓ(p+22)+Q(1)di=1d|xi|μ0(dx)\displaystyle\leq\frac{d}{2}\log\left(\frac{1}{2{\Pi}e}\right)+V(0)+\frac{{\mathrm{Q}}(1)}{p+1}\cdot\frac{(2d)^{\frac{p+1}{2}}}{\sqrt{\Pi}}\Gamma\left(\frac{p+2}{2}\right)+{\mathrm{Q}}(1)\int_{\mathbb{R}^{d}}\sum_{i=1}^{d}|x_{i}|\mu_{0}({\mathrm{d}}x)
=d2log(12Πe)+V(0)+Q(1)p+1(2d)p+12ΠΓ(p+22)+Q(1)d|r|12Πer22dr\displaystyle=\frac{d}{2}\log\left(\frac{1}{2{\Pi}e}\right)+V(0)+\frac{{\mathrm{Q}}(1)}{p+1}\cdot\frac{(2d)^{\frac{p+1}{2}}}{\sqrt{\Pi}}\Gamma\left(\frac{p+2}{2}\right)+{\mathrm{Q}}(1)d\int_{\mathbb{R}}|r|\frac{1}{\sqrt{2\Pi}}e^{-\frac{r^{2}}{2}}{\mathrm{d}}r
=d2log(12Πe)+V(0)+Q(1)(2d)p+12Γ(p+22)Π(p+1)+Q(1)d2Π.\displaystyle=\frac{d}{2}\log\left(\frac{1}{2{\Pi}e}\right)+V(0)+\frac{{\mathrm{Q}}(1)(2d)^{\frac{p+1}{2}}\Gamma\left(\frac{p+2}{2}\right)}{\sqrt{\Pi}(p+1)}+{{\mathrm{Q}}(1)d}\sqrt{\frac{2}{\Pi}}.

This concludes the proof. ∎

D.6 Proof of Lemma˜6

Proof.

By Jensen inequality we have

xm=(i=1d|xi|2)m2=dm2(1di=1d|xi|2)m2dm22i=1d|xi|m.\|{x}\|^{m}=\left(\sum_{i=1}^{d}|x_{i}|^{2}\right)^{\frac{m}{2}}=d^{\frac{m}{2}}\left(\frac{1}{d}\sum_{i=1}^{d}|x_{i}|^{2}\right)^{\frac{m}{2}}\leq d^{\frac{m-2}{2}}\sum_{i=1}^{d}|x_{i}|^{m}.

Thus, we obtain

dxmμ0(dx)\displaystyle\int_{\mathbb{R}^{d}}\|{x}\|^{m}\mu_{0}({\mathrm{d}}x) 1(2Π)d/2ddm22i=1d|xi|mexp(x22)dx\displaystyle\leq\frac{1}{(2\Pi)^{\nicefrac{{d}}{{2}}}}\int_{\mathbb{R}^{d}}d^{\frac{m-2}{2}}\sum_{i=1}^{d}|x_{i}|^{m}\exp\left(-\frac{\|x\|^{2}}{2}\right){\mathrm{d}}x
dm22(2Π)d/2i=1dd|xi|mexp(x22)dx\displaystyle\leq\frac{d^{\frac{m-2}{2}}}{(2\Pi)^{\nicefrac{{d}}{{2}}}}\sum_{i=1}^{d}\int_{\mathbb{R}^{d}}|x_{i}|^{m}\exp\left(-\frac{\|x\|^{2}}{2}\right){\mathrm{d}}x
dm222Πi=1d|xi|mexp(xi22)dxi.\displaystyle\leq\frac{d^{\frac{m-2}{2}}}{\sqrt{2\Pi}}\sum_{i=1}^{d}\int_{\mathbb{R}}|x_{i}|^{m}\exp\left(-\frac{x_{i}^{2}}{2}\right){\mathrm{d}}x_{i}.

From Winkelbauer [2012], we know the mm-th central absolute moment of the standard one-dimensional Gaussian is equal to 2m/2Γ((m+1)/2)/Π{2^{\nicefrac{{m}}{{2}}}\Gamma(\nicefrac{{(m+1)}}{{2}})}/{\sqrt{\Pi}}. Thus, we obtain

dxmμ0(x)dx\displaystyle\int_{\mathbb{R}^{d}}\|{x}\|^{m}\mu_{0}(x){\mathrm{d}}x dm222Πi=1d|xi|mexp(xi22)dxi\displaystyle\leq\frac{d^{\frac{m-2}{2}}}{\sqrt{2\Pi}}\sum_{i=1}^{d}\int_{\mathbb{R}}|x_{i}|^{m}\exp\left(-\frac{x_{i}^{2}}{2}\right){\mathrm{d}}x_{i}
dm22Π|r|mexp(r22)dr\displaystyle\leq\frac{d^{\frac{m}{2}}}{\sqrt{2\Pi}}\int_{\mathbb{R}}|r|^{m}\exp\left(-\frac{r^{2}}{2}\right){\mathrm{d}}r
(2d)m2ΠΓ(m+12).\displaystyle\leq\frac{(2d)^{\frac{m}{2}}}{\sqrt{\Pi}}\Gamma\left(\frac{m+1}{2}\right).

Appendix E Miscellaneous

In this section, we present previously known auxiliary results that were mentioned in the paper. Some of the results are proved, the others refer to their papers of origin.

E.1 Reproducing property

Here we remind you of the reproducing property: let f0f\in{\mathcal{H}}_{0}, then we have f(x)=f(),k(x,)0f(x)=\langle f(\cdot),k(x,\cdot)\rangle_{{\mathcal{H}}_{0}}, if we choose f()=k(y,)0f(\cdot)=k(y,\cdot)\in{\mathcal{H}}_{0}, then k(y,x)=k(y,),k(x,)0k(y,x)=\langle k(y,\cdot),k(x,\cdot)\rangle_{{\mathcal{H}}_{0}}.

Let us first use the reproducing property to calculate k(x,)0\left\|k(x,\cdot)\right\|_{{\mathcal{H}}_{0}}. The definition of the norm in Hilbert spaces yields:

k(x,)02=k(x,),k(x,)0=k(x,x).\left\|k(x,\cdot)\right\|_{{\mathcal{H}}_{0}}^{2}=\langle k(x,\cdot),k(x,\cdot)\rangle_{{\mathcal{H}}_{0}}=k(x,x).

Now, let us proceed to xik(x,)0\left\|\partial_{x_{i}}k(x,\cdot)\right\|_{{\mathcal{H}}_{0}}. Let eie_{i} be the ii-th standard basis of d{\mathbb{R}}^{d}. Then

xik(x,)\displaystyle\langle\partial_{x_{i}}k(x,\cdot) ,yik(y,)0=limϵ10limϵ20k(x+ϵ1ei,)k(x,)ϵ1,k(y+ϵ2ei,)k(y,)ϵ20\displaystyle,\partial_{y_{i}}k(y,\cdot)\rangle_{{\mathcal{H}}_{0}}=\lim_{\epsilon_{1}\to 0}\lim_{\epsilon_{2}\to 0}\left\langle\frac{k(x+\epsilon_{1}e_{i},\cdot)-k(x,\cdot)}{\epsilon_{1}},\frac{k(y+\epsilon_{2}e_{i},\cdot)-k(y,\cdot)}{\epsilon_{2}}\right\rangle_{{\mathcal{H}}_{0}}
=limϵ10limϵ20k(x+ϵ1ei,y+ϵ2ei)k(x+ϵ1ei,y)k(x,y+ϵ2ei)+k(x,y)ϵ1ϵ2\displaystyle=\lim_{\epsilon_{1}\to 0}\lim_{\epsilon_{2}\to 0}\frac{k(x+\epsilon_{1}e_{i},y+\epsilon_{2}e_{i})-k(x+\epsilon_{1}e_{i},y)-k(x,y+\epsilon_{2}e_{i})+k(x,y)}{\epsilon_{1}\epsilon_{2}}
=limϵ10limϵ20(k(x+ϵ1ei,y+ϵ2ei)k(x+ϵ1ei,y))(k(x,y+ϵ2ei)k(x,y))ϵ1ϵ2\displaystyle=\lim_{\epsilon_{1}\to 0}\lim_{\epsilon_{2}\to 0}\frac{\left(k(x+\epsilon_{1}e_{i},y+\epsilon_{2}e_{i})-k(x+\epsilon_{1}e_{i},y)\right)-\left(k(x,y+\epsilon_{2}e_{i})-k(x,y)\right)}{\epsilon_{1}\epsilon_{2}}
=limϵ10yik(x+ϵ1ei,y)yik(x,y)ϵ1\displaystyle=\lim_{\epsilon_{1}\to 0}\frac{\partial_{y_{i}}k(x+\epsilon_{1}e_{i},y)-\partial_{y_{i}}k(x,y)}{\epsilon_{1}}
=xiyik(x,y).\displaystyle=\partial_{x_{i}}\partial_{y_{i}}k(x,y).

Setting x=yx=y, we obtain xik(x,)02=xixik(x,x)=:xi,xi2k(x,x)\left\|\partial_{x_{i}}k(x,\cdot)\right\|_{{\mathcal{H}}_{0}}^{2}=\partial_{x_{i}}\partial_{x_{i}}k(x,x)=:\partial_{x_{i},x_{i}}^{2}k(x,x), where the first xi\partial_{x_{i}} is operated on the first variable of k(,)k(\cdot,\cdot), the second xi\partial_{x_{i}} is operated on the second variable of k(,)k(\cdot,\cdot).

E.2 The generalized Tp{\mathrm{T}}_{p} inequality

The following lemma is from Bolley and Villani [2005]. The proof is omitted.

Lemma 7.

[Bolley and Villani, 2005, Corollary 2.3] Let d\mathbb{R}^{d} be the Euclidean space with its usual norm. Let p1p\geq 1 and let π\pi be a probability measure on d\mathbb{R}^{d}. Assume that there exist x0dx_{0}\in\mathbb{R}^{d} and s>0s>0 such that dexp(sx0xp)π(dx)\int_{\mathbb{R}^{d}}\exp({s\|x_{0}-x\|^{p}})\pi({\mathrm{d}}x) is finite. Then

μ𝒫(d),Wp(μ,π)λBV[KL(μπ)1p+(KL(μπ)2)12p],\forall\mu\in{\mathcal{P}}(\mathbb{R}^{d}),\quad W_{p}(\mu,\pi)\leq\lambda_{BV}\left[\mathop{\mathrm{KL}}\nolimits(\mu\mid\pi)^{\frac{1}{p}}+\left(\frac{\mathop{\mathrm{KL}}\nolimits(\mu\mid\pi)}{2}\right)^{\frac{1}{2p}}\right], (26)

where

λBV:=2infx0X,s>0(1s(32+logdexp(sx0xp)π(dx)))1p<+.\lambda_{BV}:=2\inf_{x_{0}\in X,s>0}\left(\frac{1}{s}\left(\frac{3}{2}+\log\int_{\mathbb{R}^{d}}\exp({s\|x_{0}-x\|^{p}})\pi({\mathrm{d}}x)\right)\right)^{\frac{1}{p}}<+\infty. (27)

We want to underline the fact that the constant here may depend on the dimension. Below, we explicit the dimension-dependence of λBV\lambda_{BV} for the general class of distributions π(x)exp(xp)\pi(x)\propto\exp(-\left\|x\right\|^{p}) with p1p\geq 1. It is straightforward the condition of the lemma is satisfied for x0=0x_{0}=0. Now, let us compute the objective function in (27). We start with the integral term:

xdexp(sxp)π(x)𝑑x\displaystyle\int_{x\in{\mathbb{R}}^{d}}\exp(s\left\|x\right\|^{p})\pi(x)dx =1Zxdexp((1s)xp)𝑑x\displaystyle=\frac{1}{Z}\int_{x\in{\mathbb{R}}^{d}}\exp(-(1-s)\left\|x\right\|^{p})dx
=1Z(1s)dpxdexp(xp)𝑑x\displaystyle=\frac{1}{Z}(1-s)^{-\frac{d}{p}}\int_{x\in{\mathbb{R}}^{d}}\exp(-\left\|x\right\|^{p})dx
=(1s)dp.\displaystyle=(1-s)^{-\frac{d}{p}}.

Inserting the previous formula into (27), we obtain

λBV:\displaystyle\lambda_{BV}: =2infx0d,s>0(1s(32+logexp(sxx0p)𝑑π(x)))1p\displaystyle=2\inf_{x_{0}\in{\mathbb{R}}^{d},s>0}\left(\frac{1}{s}\left(\frac{3}{2}+\log\int\exp({s\left\|x-x_{0}\right\|^{p}})d\pi(x)\right)\right)^{\frac{1}{p}}
=2infs(0,1)(32sdlog(1s)ps)1p\displaystyle=2\inf_{s\in(0,1)}\left(\frac{3}{2s}-\frac{d\log(1-s)}{ps}\right)^{\frac{1}{p}}
2infs(0,1)(32s+dp)1p\displaystyle\geq 2\inf_{s\in(0,1)}\left(\frac{3}{2s}+\frac{d}{p}\right)^{\frac{1}{p}}
=2d1p.\displaystyle=2d^{\frac{1}{p}}.

Thus, indeed the constant λBV\lambda_{BV} can be dimension dependent.

E.3 Grönwall’s integral inequality

The following lemma is the integral form of Grönwall inequality from [Amann, 2011, Chapter II.].

Lemma 8 (Grönwall Inequality).

Assume ϕ,B:[0,T]\phi,B:[0,T]\rightarrow\mathbb{R} are bounded non-negative measurable function and C:[0,T]C:[0,T]\rightarrow\mathbb{R} is a non-negative integrable function with the property that

ϕ(t)B(t)+0tC(τ)ϕ(τ)𝑑τ for all t[0,T]\phi(t)\leq B(t)+\int_{0}^{t}C(\tau)\phi(\tau)d\tau\quad\text{ for all }t\in[0,T] (28)

Then

ϕ(t)B(t)+0tB(s)C(s)exp(stC(τ)𝑑τ)𝑑s for all t[0,T].\phi(t)\leq B(t)+\int_{0}^{t}B(s)C(s)\exp\left(\int_{s}^{t}C(\tau)d\tau\right)ds\quad\text{ for all }t\in[0,T]. (29)

E.4 Stein-Fisher information and weak convergence

We provide a sufficient condition on which limnIStein(μnπ)\lim_{n\to\infty}I_{Stein}(\mu_{n}\mid\pi) implies μnπ\mu_{n}\to\pi weakly. This condition can be found in Gorham and Mackey [2017]. Since Stein Fisher information IStein(π)I_{Stein}(\cdot\mid\pi) depends on the target distribution π\pi and the kernel k(,)k(\cdot,\cdot), we need the following two properties, respectively:

  1. 1.

    π\pi is distant dissipative, that is κ0lim infrκ(r)>0\kappa_{0}\triangleq\liminf_{r\rightarrow\infty}\kappa(r)>0 with

    κ(r)=inf{2V(x)V(y),xyxy2:xy=r}.\kappa(r)=\inf\left\{2\frac{\langle\nabla V(x)-\nabla V(y),x-y\rangle}{\|x-y\|^{2}}:\|x-y\|=r\right\}.

    If VV is strongly convex outside a compact set, then π\pi is distant dissipative, for instance V(x)=x2+δV(x)=\left\|x\right\|^{2+\delta} with δ0\delta\geq 0.

  2. 2.

    k(,)k(\cdot,\cdot) is an inverse multiquadratic kernel, i.e., k(x,y)=(c2+xy2)βk(x,y)=\left(c^{2}+\|x-y\|^{2}\right)^{\beta} for some c>0c>0 and β(1,0)\beta\in(-1,0). It is easy to check that ˜(ker,B)({\rm ker},B) is satisfied.

E.5 Proof of Remark˜1

Chain rule implies

d\displaystyle\int_{\mathbb{R}^{d}}\nabla log(μ(x)π(x))k(x,)μ(dx)\displaystyle\log\left(\frac{\mu(x)}{\pi(x)}\right)k(x,\cdot)\mu({\mathrm{d}}x)
=dlog(μ(x))k(x,)μ(dx)dlog(π(x))k(x,)μ(dx)\displaystyle=\int_{\mathbb{R}^{d}}\nabla\log({\mu(x)})k(x,\cdot)\mu({\mathrm{d}}x)-\int_{\mathbb{R}^{d}}\nabla\log(\pi(x))k(x,\cdot)\mu({\mathrm{d}}x)
=dμ(x)k(x,)dxdlog(π(x))xk(x,)μ(dx).\displaystyle=\int_{\mathbb{R}^{d}}\nabla{\mu(x)}k(x,\cdot){\mathrm{d}}x-\int_{\mathbb{R}^{d}}\log(\pi(x))\nabla_{x}k(x,\cdot)\mu({\mathrm{d}}x).

Since μ(x)0\mu(x)\rightarrow 0, when x+\|x\|\rightarrow+\infty, the integration by parts yields

dμ(x)k(x,)dx\displaystyle\int_{\mathbb{R}^{d}}\nabla{\mu(x)}k(x,\cdot){\mathrm{d}}x dlog(π(x))xk(x,)μ(dx)\displaystyle-\int_{\mathbb{R}^{d}}\log(\pi(x))\nabla_{x}k(x,\cdot)\mu({\mathrm{d}}x)
=d{μ(x)k(x,)+log(π(x))xk(x,)}μ(dx)\displaystyle=-\int_{\mathbb{R}^{d}}\left\{{\mu(x)}\nabla k(x,\cdot)+\log(\pi(x))\nabla_{x}k(x,\cdot)\right\}\mu({\mathrm{d}}x)
=gμ().\displaystyle=g_{\mu}(\cdot).

This concludes the proof.