This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved dimension dependence in the Bernstein-von Mises Theorem via a new Laplace approximation bound

Anya Katsevich
akatsevi@mit.edu
This work was supported by NSF grant DMS-2202963
Abstract

The Bernstein-von Mises theorem (BvM) gives conditions under which the posterior distribution of a parameter θΘd\theta\in\Theta\subseteq\mathbb{R}^{d} based on nn independent samples is asymptotically normal. In the high-dimensional regime, a key question is to determine the growth rate of dd with nn required for the BvM to hold. We show that up to a model-dependent coefficient, nd2n\gg d^{2} suffices for the BvM to hold in two settings: arbitrary generalized linear models, which include exponential families as a special case, and multinomial data, in which the parameter of interest is an unknown probability mass functions on d+1d+1 states. Our results improve on the tightest previously known condition for posterior asymptotic normality, nd3n\gg d^{3}. Our statements of the BvM are nonasymptotic, taking the form of explicit high-probability bounds. To prove the BvM, we derive a new simple and explicit bound on the total variation distance between a measure πenf\pi\propto e^{-nf} on Θd\Theta\subseteq\mathbb{R}^{d} and its Laplace approximation.

1 Introduction

In frequentist inference, classical theory shows that the maximum likelihood estimator (MLE) θ^n=θ^n(Y1:n)\hat{\theta}_{n}=\hat{\theta}_{n}(Y_{1:n}) based on independent data Y1:n={Yi}i=1nY_{1:n}=\{Y_{i}\}_{i=1}^{n} is asymptotically normal in the large sample limit:

Law(θ^n)𝒩(θ,In(θ)1),\mathrm{Law}(\hat{\theta}_{n})\approx\mathcal{N}({\theta^{*}},I_{n}({\theta^{*}})^{-1}), (1.1)

where θ{\theta^{*}} is the ground truth parameter and In(θ)=nI(θ)I_{n}({\theta^{*}})=nI({\theta^{*}}) is the scaled Fisher information matrix. The Bernstein-von Mises theorem (BvM) is the Bayesian counterpart of this result. Under conditions similar to the classical theory of the MLE and certain natural assumptions on the prior, it states that the posterior πn\pi_{n} of the unknown parameter given independent data Y1:nY_{1:n} is also asymptotically normal,

πn(Y1:n)γn:=𝒩(θ^n,In(θ)1)\pi_{n}(\cdot\mid Y_{1:n})\approx\gamma^{*}_{n}:=\mathcal{N}(\hat{\theta}_{n},I_{n}({\theta^{*}})^{-1}) (1.2)

with high probability under the ground truth data generating process [38]. Here, approximation accuracy is typically measured in total variation (TV) distance. The BvM is a frequentist view on Bayesian inference, because it breaks out of the classic Bayesian paradigm of fixed data by making a statement about the posterior as a random object which varies with random draws of the data. The BvM is an important result because it justifies Bayesian uncertainty quantification from the frequentist perspective. Namely, the BvM shows that Bayesian credible sets BnB_{n} of level 1α1-\alpha are asymptotic frequentist confidence sets of the same level [22]. This is true in the sense that BnB_{n} can be expressed by taking a set CnC_{n} of approximate probability 1α1-\alpha under the standard normal, scaling by the Fisher information, and recentering it about the MLE θ^n\hat{\theta}_{n}:

Bn=θ^n+In(θ)1/2Cn.B_{n}=\hat{\theta}_{n}+I_{n}({\theta^{*}})^{-1/2}C_{n}.

Thus in the limit, Bayesian and frequentist inference are two sides of the same coin. But interestingly enough, despite this, there is a notable difference between the conditions under which asymptotic normality has been shown to hold in Bayesian and frequentist inference. This difference occurs in the high-dimensional regime, in which the parameter θ\theta lies in Θd\Theta\subseteq\mathbb{R}^{d}, where dd grows with sample size nn. The question is, how fast can dd grow with nn while preserving asymptotic normality?

In frequentist inference, the pioneering work [29] showed that for exponential families, the MLE is asymptotically normal provided d2/n0d^{2}/n\to 0. The work [17] extended this result to more general parametric models, while [31] proved finite sample error bounds on the Fisher expansion of the MLE (closely related to normality), provided d2nd^{2}\ll n.

In a departure from the condition d2/n0d^{2}/n\to 0 on the frequentist side, all of the works on the BvM for parametric Bayesian inference in growing dimension required at least that d3/n0d^{3}/n\to 0 [28, 25]. The first works were by [15, 16], proving BvMs for posteriors arising from linear regression models and exponential families, respectively. [8] proves a BvM for the posterior of a discrete probability mass function truncated to its first dd entries. [32] proves a BvM for parametric models under more general conditions, which include e.g. generalized linear models (GLMs). [5] extends the BvM result of [16] to include the case of curved exponential families. [25] proves a BvM for nonlinear Bayesian inverse problems. See all of the above works for further references on BvMs with growing parameter dimension.

This discrepancy between frequentist and Bayesian asymptotics motivates the question: is the more stringent requirement on the growth of dd with nn in the BvM an inherent difference between the limiting behavior of the posterior and the law of the MLE? In this work, we resolve this question in the negative.

1.1 Model dependence in high-dimensional BvMs

Before conveying our results, we first raise an additional important consideration in the study of high-dimensional BvMs. Clearly, the TV distance between πn\pi_{n} and γn\gamma^{*}_{n} depends not only on dd and nn, but also on the statistical model for the data-generating process. And in the high-dimensional regime, the contribution to the TV error stemming from the model can itself depend on dd. Thus the condition on the growth of nn relative to dd required for asymptotic normality is model-dependent.

To demonstrate this phenomenon, consider the aforementioned works [8] and [25]. In [8], one can read the proof of the main result Theorem 3.7 to find that the intermediate Proposition 3.10 implies the bound

TV(πn,γn)=𝒪(d3+ϵ/nθmin)\mathrm{TV}(\pi_{n},\gamma^{*}_{n})=\mathcal{O}(\sqrt{d^{3+\epsilon}/n{\theta_{\mathrm{min}}^{*}}}) (1.3)

for any ϵ>0\epsilon>0. We have assumed a flat prior for simplicity, i.e. we set wn1w_{n}\equiv 1 in the statement of the proposition, and made some other simplifications. Here, θmin{\theta_{\mathrm{min}}^{*}} is the minimum probability in the unknown probability mass function over d+1d+1 states, and therefore θmin1/(d+1){\theta_{\mathrm{min}}^{*}}\leq 1/(d+1). Thus nd3+ϵ/θmind4+ϵn\geq d^{3+\epsilon}/{\theta_{\mathrm{min}}^{*}}\geq d^{4+\epsilon} is required.

In the work [25] on the BvM for inverse problems, it is shown that

TV(πn,γn)=𝒪(σ(d)2log(dσ(d))d3/n),\mathrm{TV}(\pi_{n},\gamma^{*}_{n})=\mathcal{O}\left(\sigma(d)^{2}\log\left(d\vee\sigma(d)\right)\sqrt{d^{3}/n}\right), (1.4)

where σ(d)\sigma(d) characterizes the degree of ill-posedness of the inverse problem. The author states that σ(d)\sigma(d) grows as a power of dd for mildly ill-posed inverse problems, and exponentially in dd for severely ill-posed inverse problems.

To meaningfully compare BvM results across different works studying different statistical models, we separate the contribution to the TV error into a “model-dependent” factor and a “universal” factor. We informally define the universal factor as the term in the error bound depending on dd and nn only. For example, 1/θmin1/\sqrt{{\theta_{\mathrm{min}}^{*}}} and σ(d)2log(dσ(d))\sigma(d)^{2}\log(d\vee\sigma(d)) are the model-dependent terms in the above TV bounds of [8] and [25], respectively, while d3+ϵ/n\sqrt{d^{3+\epsilon}/n} and d3/n\sqrt{d^{3}/n} are the universal factors, respectively. Thus to clarify our above overview of prior works, the stated condition nd3n\gg d^{3} stems from the universal factor appearing in the bounds.

1.2 Main Contributions

We prove nonasymptotic statements of the BvM for several representative statistical settings, showing for the first time that, up to model-dependent factors, nd2n\gg d^{2} suffices for TV(πn,γn)0\mathrm{TV}(\pi_{n},\gamma^{*}_{n})\to 0. Specifically, we prove this for posteriors πn\pi_{n} over an unknown parameter θ\theta in the following settings: 1) arbitrary generalized linear models (GLMs) with data Yip(Xiθ)Y_{i}\sim p(\cdot\mid X_{i}^{\intercal}\theta), i=1,,ni=1,\dots,n for an exponential family pp in canonical form and 2) a multinomial observation YnMultinomial(n,θ)Y^{n}\sim\mathrm{Multinomial}(n,\theta), where θ\theta is an unknown probability mass function (pmf) on d+1d+1 states (as in [8]). In each case, we show that, up to negligible terms,

TV(πn,γn)δ3d2/n\mathrm{TV}(\pi_{n},\gamma^{*}_{n})\leq\delta^{*}_{3}\sqrt{d^{2}/n} (1.5)

with high probability. Here, δ3\delta^{*}_{3} is an explicit model dependent coefficient related to the third derivative of the population log likelihood. This quantity may certainly depend on dd in some cases, and the second setting is one such example, as discussed below.

Our first, GLM setting is very general, allowing for the exponential family p(ω)p(\cdot\mid\omega) to have multivariate parameter ωk\omega\in\mathbb{R}^{k}, k1k\geq 1. Thus, the XiX_{i} are d×kd\times k matrices. The dimension kk is itself also allowed to grow with nn. In particular, taking k=dk=d and Xi=IdX_{i}=I_{d} for all i=1,,ni=1,\dots,n, we recover the setting of i.i.d. observations of a dd-parameter exponential family pp. Our BvM in the GLM setting can be compared to the works [16, 5] on the BvM for exponential families and [32] on the BvM for univariate (k=1k=1) GLMs. To make this comparison, some work is required to first recover explicit TV bounds from the proofs of [16, 5]. With some effort, the recovered bound in these two works, as well as the stated TV bound of [32], can be brought into the structure discussed above (see Appendix C.1). In the end, we see that our BvM tightens the dimension dependence of the universal factor from d3/n\sqrt{d^{3}/n} to d2/n\sqrt{d^{2}/n}, while our model-dependent factor δ3\delta^{*}_{3} is of the same order of magnitude as that of prior works.

To apply our general result to a particular GLM of interest, it remains to bound the quantity δ3\delta^{*}_{3}. We do so in two particular cases: an i.i.d. log-concave exponential family, and a logistic regression model with Gaussian design. In both cases we show that δ3\delta^{*}_{3} is bounded by an absolute constant, where in the second case, this holds with high probability over the random design. Thus d2nd^{2}\ll n suffices for the BvM to hold in these settings, with no extra dimension dependence stemming from the model. Moreover, our treatment of the random design case seems to be new; we are not aware of any works proving the BvM for a random design GLM.

In our second setting of inference of a pmf, the model-dependent factor is δ3=1/θmin\delta^{*}_{3}=1/\sqrt{{\theta_{\mathrm{min}}^{*}}}, where θmin{\theta_{\mathrm{min}}^{*}} is as above, i.e. the minimum probability in the ground truth pmf. Thus δ3d\delta^{*}_{3}\geq\sqrt{d} in this setting. Our BvM directly improves on the result of [8]: from TV(πn,γn)=𝒪(d3/nθmin)\mathrm{TV}(\pi_{n},\gamma^{*}_{n})=\mathcal{O}(\sqrt{d^{3}/n{\theta_{\mathrm{min}}^{*}}}) to TV(πn,γn)=𝒪(d2/nθmin)\mathrm{TV}(\pi_{n},\gamma^{*}_{n})=\mathcal{O}(\sqrt{d^{2}/n{\theta_{\mathrm{min}}^{*}}}).

The GLM setting differs from the pmf setting in that the former enjoys a stochastic linearity property, to use the terminology of [33], while the latter does not. Stochastic linearity means that the sample log likelihood deviates from the population log likelihood by a linear function of θ\theta, and this property simplifies certain elements of the BvM proof. Despite this difference between the two settings, we use the same overall structure for our two BvM proofs.

The key element of this common proof structure is a bound on the TV accuracy of the Laplace approximation (LA) to πn\pi_{n}. The LA is another Gaussian approximation γn\gamma_{n} to πn\pi_{n}; for more details, see Section 1.3 below. We derive a new elegant and simple bound on the LA accuracy TV(πn,γn)\mathrm{TV}(\pi_{n},\gamma_{n}) which is tight in its dimension dependence (tightness is known thanks to lower bounds on TV(πn,γn)\mathrm{TV}(\pi_{n},\gamma_{n}) derived in [20]). This bound is at the heart of our BvM proof, and it may also be of independent interest due to its simplicity.

To summarize, our main contributions are as follows.

  • We prove the BvM for arbitrary GLMs under the condition nd2n\gg d^{2} up to a model-dependent contribution δ3\delta^{*}_{3}, improving for the first time on the condition nd3n\gg d^{3} from prior work.

  • We study two special cases of a GLM: log concave exponential families and logistic regression with Gaussian design, showing δ3\delta^{*}_{3} only contributes a constant factor in both cases.

  • We are the first to prove the BvM with high probability over the distribution of the random design in a GLM, in the high-dimensional regime.

  • We prove the BvM for a posterior over an unknown pmf, again showing nd2n\gg d^{2} suffices up to the explicit model-dependent contribution δ3=1/θmin\delta^{*}_{3}=1/\sqrt{{\theta_{\mathrm{min}}^{*}}}.

  • All of our statements of the BvM are nonasymptotic, explicitly quantifying both the bound on TV(πn,γn)\mathrm{TV}(\pi_{n},\gamma^{*}_{n}) and the probability with which the bound holds. Only one other work [32] has done this (with the worse rate d3/n\sqrt{d^{3}/n}).

  • Our proof is based on a new and simple bound on the TV accuracy of the Laplace approximation to a posterior.

1.3 Proof via Laplace approximation

The main tool in our proof is a new bound on the accuracy of the Laplace approximation (LA). The LA is another Gaussian distribution similar in spirit to γn\gamma^{*}_{n}, but which can be computed in practice from given data. Namely, the LA to πn\pi_{n} is given by

γn:=𝒩(θ~n,2(logπn)(θ~n)1),θ~n:=argmaxθdπn(θ).\gamma_{n}:=\mathcal{N}(\tilde{\theta}_{n},\,-\nabla^{2}(\log\pi_{n})(\tilde{\theta}_{n})^{-1}),\qquad\tilde{\theta}_{n}:=\operatorname*{argmax}_{\theta\in\mathbb{R}^{d}}\pi_{n}(\theta).

Bayesian inference can be very costly in high dimensions, since typical quantities of interest such as the posterior mean, covariance, and credible sets involve taking integrals against or sampling from πn\pi_{n}. Making the approximation πnγn\pi_{n}\approx\gamma_{n} can simplify these computations. Here, we prove a new bound on the TV accuracy of the LA, which essentially takes the form

TV(πn,γn)δ3d2/n.\mathrm{TV}(\pi_{n},\gamma_{n})\leq\delta_{3}\sqrt{d^{2}/n}. (1.6)

See Theorem 2.2 for the precise statement. Here, δ3\delta_{3} depends on the third derivative of the function logπn\log\pi_{n} in a neighborhood of θ~n\tilde{\theta}_{n}; it is intimately related to the “model-dependent” factor δ3\delta^{*}_{3} in our bound (1.5) in the statement of the BvM.

The bound (1.6) on TV(πn,γn)\mathrm{TV}(\pi_{n},\gamma_{n}) is the key ingredient in our BvM proof. Namely, we apply the triangle inequality TV(πn,γn)TV(πn,γn)+TV(γn,γn)\mathrm{TV}(\pi_{n},\gamma^{*}_{n})\leq\mathrm{TV}(\pi_{n},\gamma_{n})+\mathrm{TV}(\gamma_{n},\gamma^{*}_{n}), and use the bound (1.6) on the first term on the righthand side. Since δ3\delta_{3} is random (depending on the log posterior, which in turn depends on the data Y1:nY_{1:n}), it remains to prove a high-probability, deterministic upper bound on δ3\delta_{3}. The second term TV(γn,γn)\mathrm{TV}(\gamma_{n},\gamma^{*}_{n}) is a TV distance between two Gaussians, and can be bounded in a straightforward manner. There is an additional first step in which we quantify the effect of removing the prior from πn\pi_{n}, but we neglect this technicality in the present overview to simplify the discussion.

The LA bound (1.6) has two important features which are noteworthy on their own and which have bearing on the BvM result. First, we see that the “universal” factor in the upper bound is d2/n\sqrt{d^{2}/n}. This directly translates to the tightened dimension dependence in the BvM. Second, the bound is remarkably simple and transparent, which allows for a very clear-cut proof of the BvM.

Let us put our bound (1.6) into context. Like prior work on the BvM, most prior work on the LA accuracy in high dimensions also required d3nd^{3}\ll n. In particular, the three works [11, 34, 18] all show, under slightly varying conditions, a bound of the form TV(πn,γn)δ3d3/n\mathrm{TV}(\pi_{n},\gamma_{n})\leq\delta_{3}^{\prime}\sqrt{d^{3}/n}. Here, δ3\delta_{3}^{\prime} are various model-dependent coefficients analogous to δ3\delta_{3} from (1.6). Recently, however, the works [19, 20] obtained bounds on the TV distance with “universal” factor d/nd/\sqrt{n}. The bound we derive here is significantly simpler than that of [19, 20]; see Section 2.2 for a more detailed comparison.

1.4 Other related works.

We note that in certain special cases, the limit of posterior distributions can be obtained in regimes in which d2/nd^{2}/n is not small, including infinite-dimensional regimes. However, the type of limit, and the shape of the limiting posterior distribution, may be different.

The work [7] proves the BvM for Gaussian linear regression in increasing dimensions. This is a case in which the log likelihood is quadratic, so that the only non-Gaussian contribution to the posterior stems from the prior. The author shows that under certain conditions on the prior, dnd\ll n suffices for the BvM to hold. This weaker requirement on nn (compared to our nd2n\gg d^{2}) is consistent with our findings. Indeed, our analysis shows that the main source of error in the BvM vanishes when the log likelihood is quadratic; see Remark 3.6.

The work [10] proves a BvM for the coefficient vector in a sparse linear regression model, in which dd may be larger than nn but the true coefficient vector has few nonzero entries (exactly how many nonzero entries are allowed depends on various problem parameters). The authors show that the posterior is asymptotically a mixture of degenerate normal distributions (owing in part to a special prior construction incorporating sparsity). In particular, the posterior marginal of each coefficient is a mixture of a point mass at zero and a continuous distribution. This has implications for the construction of credible sets.

In infinite dimensions, [13] gives an example in which the posterior distribution of a functional of an infinite-dimensional parameter converges in distribution to a Gaussian, but which gives rise to credible intervals differing from frequentist confidence intervals. The work [9] proves a BvM for the Gaussian white noise model, in the sense that the authors show the posterior converges to the “correct” (from the frequentist perspective) normal distribution in a weaker sense than total variation norm. The notion of convergence is still strong enough to prove linear functionals of the parameter are asymptotically normal, centered at efficient estimators and with asymptotically efficient variance. Moreover, certain weighted L2L^{2}-credible ellipsoids coincide asymptotically with their frequentist counterparts.

Since the BvM is closely related to a high probability bound on the LA error, we mention a few works which prove the latter. This is in contrast to the Laplace bounds cited above, which are for a fixed realization of the data. In [3], the authors prove that for sparse generalized linear models with nn samples and qq active covariates, the relative LA error of the posterior normalizing constant is bounded by (q3log3n/n)1/2(q^{3}\log^{3}n/n)^{1/2} with high probability. In [37], the authors study the pointwise ratio between the density of the posterior πn\pi_{n} and that of the LA γn\gamma_{n}. The authors show that for generalized linear models with Gaussian design, the relative error is bounded by d3logn/nd^{3}\log n/n with probability tending to 1, provided d3+ϵnd^{3+\epsilon}\ll n for some ϵ>0\epsilon>0. For logistic regression with Gaussian design, they obtain the improved rate d2logn/nd^{2}\log n/n provided d2.5nd^{2.5}\ll n. Finally, [11] shows KL(πn||γn)\mathrm{KL}\left(\,\pi_{n}\;||\;\gamma_{n}\right) goes to 0 with probability tending to 1 if d3/n0d^{3}/n\to 0.

Finally, we note that the LA is just one approach to approximate a posterior measure by a simpler distribution. Another approach that is similar in spirit is to approximate the posterior πn\pi_{n} with a log concave distribution [27, 6]. Roughly speaking, the log concave approximation coincides with πn\pi_{n} in a neighborhood of the mode θ~n\tilde{\theta}_{n} (where the negative log posterior is indeed convex), and it “convexifies” the tails of logπn-\log\pi_{n}. The two works [27, 6] prove bounds on the accuracy of such log concave approximations in the context of high-dimensional Bayesian nonlinear inverse problems. Under suitable restrictions on the growth of dd relative to nn, it is shown that the Wasserstein-2 distance between the two measures is exponentially small in nn with high probability under the ground truth data generating process.

Organization

The rest of the paper is organized as follows. In Section 2 we state, discuss, and outline the proof of our new Laplace approximation bound in a general setting, where πvenv\pi_{v}\propto e^{-nv} need not have the structure of a posterior. In Section 3 we introduce the Bayesian setting and use our Laplace bound to prove some key preliminary results for the BvM proofs. In Sections 4 and 5 we prove the BvM for posteriors stemming from GLMs and observations of a discrete probability distribution, respectively. Omitted proofs can be found in the Appendix.

1.5 Notation

The letter CC always denotes an absolute constant. The inequality aba\lesssim b denotes aCba\leq Cb for an absolute constant CC. The statement that aba\lesssim b on a random event indicates that there is an absolute nonrandom constant CC such that aCba\leq Cb on the event. We let |||\cdot| denote the standard Euclidean norm in d\mathbb{R}^{d}. If Ad×dA\in\mathbb{R}^{d\times d} is a symmetric positive definite matrix and udu\in\mathbb{R}^{d} is a vector, we let

|u|A=|A1/2u|.|u|_{A}=|A^{1/2}u|.

Let k1k\geq 1 and SS be a symmetric kk-linear form, identified with its symmetric tensor representation. We define the AA-weighted operator norm of SS to be

SA:=sup|u|A=1S,uk=sup|u|A=1i1,,ik=1dSi1i2ikui1ui2uik.\|S\|_{A}:=\sup_{|u|_{A}=1}\langle S,u^{\otimes k}\rangle=\sup_{|u|_{A}=1}\sum_{i_{1},\dots,i_{k}=1}^{d}S_{i_{1}i_{2}\dots i_{k}}u_{i_{1}}u_{i_{2}}\dots u_{i_{k}}. (1.7)

For example, if k=2k=2 then SS is a symmetric matrix and SA=A1/2SA1/2\|S\|_{A}=\|A^{-1/2}SA^{-1/2}\|, where \|\cdot\| is the standard matrix operator norm. Note that if k=1k=1 and SS is a 1-linear form identified with its vector representation, then

SA=|A1/2S|,|S|A=|A1/2S|.\|S\|_{A}=|A^{-1/2}S|,\qquad|S|_{A}=|A^{1/2}S|.

Thus it is important to distinguish between the meaning of A\|\cdot\|_{A} and ||A|\cdot|_{A} when the quantity inside the norm is a vector. By Theorem 2.1 of [39], for symmetric tensors the definition (1.7) coincides with the standard definition of operator norm:

sup|u1|A==|uk|A=1S,u1uk=SA=sup|u|A=1S,uk.\sup_{|u_{1}|_{A}=\dots=|u_{k}|_{A}=1}\langle S,u_{1}\otimes\dots\otimes u_{k}\rangle=\|S\|_{A}=\sup_{|u|_{A}=1}\langle S,u^{\otimes k}\rangle.

2 Laplace approximation bound

We begin by stating and discussing our general bound on the Laplace approximation accuracy. In Section 2.1, we outline the proof. In Section 2.2, we compare our bound to other bounds in the literature which achieve the tight d/nd/\sqrt{n} rate. We consider a general setting which need not stem from Bayesian inference. Throughout the paper, we let Θ\Theta denote a convex subset of d\mathbb{R}^{d}. The value nn should be thought of as some large positive number.

Definition 2.1.

Let fC2(Θ)f\in C^{2}(\Theta). Whenever enfe^{-nf} is integrable, we define the probability density πf(θ)enf(θ)\pi_{f}(\theta)\propto e^{-nf(\theta)}, θΘ\theta\in\Theta. If ff has a unique strict global minimizer θ^Θ\hat{\theta}\in\Theta, we define the following Laplace approximation (LA) to πf\pi_{f}:

γf=𝒩(θ^,(nH)1),H:=2f(θ^).\gamma_{f}=\mathcal{N}(\hat{\theta},(nH)^{-1}),\qquad H:=\nabla^{2}f(\hat{\theta}).
Theorem 2.2.

Let fC2(Θ)f\in C^{2}(\Theta) be convex with unique strict global minimizer θ^\hat{\theta}. Assume there is r6r\geq 6 such that 𝒰(r)Θ{\mathcal{U}}(r)\subset\Theta, where 𝒰(r)={|θθ^|Hrd/n}{\mathcal{U}}(r)=\{|\theta-\hat{\theta}|_{H}\leq r\sqrt{d/n}\}. Let

δ3(r)=supθ𝒰(r),θθ^2f(θ)2f(θ^)H|θθ^|H\delta_{3}(r)=\sup_{\theta\in{\mathcal{U}}(r),\theta\neq\hat{\theta}}\frac{\|\nabla^{2}f(\theta)-\nabla^{2}f(\hat{\theta})\|_{H}}{|\theta-\hat{\theta}|_{H}} (2.1)

and suppose rδ3(r)d/n1/2r\delta_{3}(r)\sqrt{d/n}\leq 1/2. Then enfL1(Θ)e^{-nf}\in L^{1}(\Theta) so πfenf\pi_{f}\propto e^{-nf} is well-defined, and we have

TV(πf,γf)δ3(r)d2n+3edr2/9.\mathrm{TV}(\pi_{f},\;\gamma_{f})\leq\frac{\delta_{3}(r)d}{\sqrt{2n}}+3e^{-dr^{2}/9}. (2.2)

Recall from Section 1.5 that 2f(θ)2f(θ^)H=H1/2(2f(θ)2f(θ^))H1/2\|\nabla^{2}f(\theta)-\nabla^{2}f(\hat{\theta})\|_{H}=\|H^{-1/2}(\nabla^{2}f(\theta)-\nabla^{2}f(\hat{\theta}))H^{-1/2}\| and |θθ^|H=|H1/2(θθ^)||\theta-\hat{\theta}|_{H}=|H^{1/2}(\theta-\hat{\theta})|, where \|\cdot\| is the standard matrix operator norm and |||\cdot| is the Euclidean norm on d\mathbb{R}^{d}. We hope that the remarkable simplicity of the bound (2.2) will lead to its broad applicability. See the discussion in Section 2.2 on how our bound compares to previous bounds in the literature.

Remark 2.3 (Convexity assumption).

To prove (2.2), we use convexity only through the linear growth lower bound (2.6) in Lemma 2.10 below. Therefore, convexity of ff can be replaced by (2.6) in the statement of Theorem 2.2. However, under convexity (2.6) holds for any r0r\geq 0, giving us the flexibility to optimize the bound (2.2) over rr.

Remark 2.4 (Smoothness and δ3\delta_{3}).

As noted in the introduction, the only smoothness we require is that fC2(Θ)f\in C^{2}(\Theta) and 2f(θ)2f(θ^)H=𝒪(|θθ^|H)\|\nabla^{2}f(\theta)-\nabla^{2}f(\hat{\theta})\|_{H}=\mathcal{O}(|\theta-\hat{\theta}|_{H}) for θ𝒰(r)\theta\in{\mathcal{U}}(r), to ensure that δ3(r)<\delta_{3}(r)<\infty. Thus for example f(θ)=|θ|2/2+|θ|3/6f(\theta)=|\theta|^{2}/2+|\theta|^{3}/6 satisfies the conditions of the theorem: it is convex, strictly minimized at θ^=0\hat{\theta}=0, and fC2f\in C^{2}, with

2f(θ)=(1+12|θ|)Id+12|θ|θθ,θ0\nabla^{2}f(\theta)=\left(1+\frac{1}{2}|\theta|\right)I_{d}+\frac{1}{2|\theta|}\theta\theta^{\intercal},\quad\theta\neq 0

and H=2f(0)=IdH=\nabla^{2}f(0)=I_{d}. It is straightforward to show that

2f(θ)2f(0)=12|θ|Id+12|θ|θθ=|θ|\|\nabla^{2}f(\theta)-\nabla^{2}f(0)\|=\left\|\frac{1}{2}|\theta|I_{d}+\frac{1}{2|\theta|}\theta\theta^{\intercal}\right\|=|\theta|

for all θ0\theta\neq 0, and therefore δ3(r)=1\delta_{3}(r)=1 for all r>0r>0, even though 3f\nabla^{3}f does not exist at θ=0\theta=0. If ff is C3C^{3} in a neighborhood of θ^\hat{\theta}, then one can bound δ3(r)\delta_{3}(r) via

δ3(r)supθ𝒰(r)3f(θ)H.\delta_{3}(r)\leq\sup_{\theta\in{\mathcal{U}}(r)}\|\nabla^{3}f(\theta)\|_{H}. (2.3)
Remark 2.5 (Dependence of δ3\delta_{3} on dd).

We have not indicated the dependence of δ3(r)\delta_{3}(r) on dd or nn, but it does in fact depend on both parameters. The dependence of δ3(r)\delta_{3}(r) on nn is weak, but the dependence on dd may be significant, simply by virtue of the fact that ff is defined on (a subset of) d\mathbb{R}^{d}. For example, consider the function f(θ)=|θ|2/2+|1θ|3/6f(\theta)=|\theta|^{2}/2+|1^{\intercal}\theta|^{3}/6, where 11 is the all ones vector in d\mathbb{R}^{d}. The minimizer of ff is θ^=0\hat{\theta}=0,with H=2f(0)=IdH=\nabla^{2}f(0)=I_{d}. Using that 2f(θ)=Id+|1θ|11\nabla^{2}f(\theta)=I_{d}+|1^{\intercal}\theta|11^{\intercal}, it is straightforward to compute that δ3(r)=d3/2\delta_{3}(r)=d^{3/2} for all r>0r>0.

Remark 2.6 (Choice of rr).

Note that (2.2) is valid for all r6r\geq 6 such that 𝒰(r)Θ{\mathcal{U}}(r)\subseteq\Theta and rδ3(r)d/n1/2r\delta_{3}(r)\sqrt{d/n}\leq 1/2. Thus rr can be chosen to optimize the bound. If dd is very large, then the second term in the bound (2.2) is exponentially small in dd, so it suffices to choose r=6r=6. If dd cannot be considered too large, then the optimal choice of rr depends on how δ3(r)\delta_{3}(r) scales with rr and dd. The following example shows how to choose rr in the special case when δ3(r)\delta_{3}(r) is uniformly bounded.

Example 2.7 (Uniformly bounded third derivative).

Suppose fC3(Θ)f\in C^{3}(\Theta) and supθΘ3f(θ)HM\sup_{\theta\in\Theta}\|\nabla^{3}f(\theta)\|_{H}\leq M, where MM may depend on dd. To give one example in which the third derivative is indeed uniformly bounded over Θ\Theta, see the logistic regression setting in Section 4.4, and the bound (4.22) in particular (the quantity δ3\delta^{*}_{3} in that bound is closely related to the above δ3\delta_{3}).

Under the stated assumptions, we claim that

TV(πf,γf)M(d2n+9n)\mathrm{TV}(\pi_{f},\;\gamma_{f})\leq M\left(\frac{d}{\sqrt{2n}}+\frac{9}{\sqrt{n}}\right) (2.4)

for all n/d(12M)2n/d\geq(12M)^{2}. Indeed, note that MM is a uniform bound on δ3(r)\delta_{3}(r) over all r0r\geq 0, so (2.2) is optimized by taking rr as large as possible while satisfying the constraint rδ3(r)d/n1/2r\delta_{3}(r)\sqrt{d/n}\leq 1/2. Thus we take r=n/d/2Mr=\sqrt{n/d}/2M, which satisfies r6r\geq 6 provided n/d(12M)2n/d\geq(12M)^{2}. Substituting this choice of rr into the second term of (2.2), and using δ3(r)M\delta_{3}(r)\leq M in the first term gives

TV(πf,γf)Md2n+3exp(n/(36M2)).\mathrm{TV}(\pi_{f},\;\gamma_{f})\leq\frac{Md}{\sqrt{2n}}+3\,\mathrm{exp}(-n/(36M^{2})).

Applying the inequality ex2/41/xe^{-x^{2}/4}\leq 1/x with x=n/(3M)x=\sqrt{n}/(3M) yields (2.4).

Remark 2.8 (Affine invariance).

If T:ddT:\mathbb{R}^{d}\to\mathbb{R}^{d} is a bijective affine map, then the pushforward T#πfT_{\#}\pi_{f} of πf\pi_{f} under TT is given by T#πf=πfT1T_{\#}\pi_{f}=\pi_{f\circ T^{-1}}. Also, one can check that γfT1=T#γf\gamma_{f\circ T^{-1}}=T_{\#}\gamma_{f}. Since the TV distance is invariant under bijective coordinate transformations, we therefore have TV(πf,γf)=TV(T#πf,T#γf)=TV(πfT1,γfT1).\mathrm{TV}(\pi_{f},\gamma_{f})=\mathrm{TV}(T_{\#}\pi_{f},T_{\#}\gamma_{f})=\mathrm{TV}(\pi_{f\circ T^{-1}},\gamma_{f\circ T^{-1}}). Due to this invariance, a good upper bound on TV(πf,γf)\mathrm{TV}(\pi_{f},\gamma_{f}) should also be affine invariant, and so should the conditions under which the bound holds. It is straightforward to check that Theorem 2.2 has this property.

Remark 2.9 (Application to Bayesian inference).

In Bayesian inference, the posterior distribution can be written as πfnenfn\pi_{f_{n}}\propto e^{-nf_{n}} for some function fnf_{n} which is weakly dependent on nn. The standard quantity of interest in Bayesian inference are posterior credible sets of level 1α1-\alpha; that is, central regions AA such that πfn(A)=1α\pi_{f_{n}}(A)=1-\alpha [14]. The LA γfn\gamma_{f_{n}} to πfn\pi_{f_{n}} immediately yields explicit, approximate credible sets A^\hat{A} based on level sets of γfn\gamma_{f_{n}}, and Theorem 2.2 can be used to bound the deviation of the actual probability πfn(A^)\pi_{f_{n}}(\hat{A}) from 1α1-\alpha.

2.1 Proof

The assumptions of Theorem 2.2 imply two key properties used in the proof: first, πf\pi_{f} is strongly log concave in the neighborhood 𝒰(r){\mathcal{U}}(r) of θ^\hat{\theta}. Second, ff grows linearly away from θ^\hat{\theta} in 𝒰(r)c{\mathcal{U}}(r)^{c}. These two observations are stated in the following lemma:

Lemma 2.10 (Hessian lower bound and linear growth).

Under the assumptions of Theorem 2.2, it holds

2f(θ)12H,θ𝒰(r).\nabla^{2}f(\theta)\succeq\frac{1}{2}H,\quad\forall\theta\in{\mathcal{U}}(r). (2.5)

Furthermore, we have the following growth bound:

f(θ)f(θ^)r4d/n|θθ^|HθΘ𝒰(r).f(\theta)-f(\hat{\theta})\geq\frac{r}{4}\sqrt{d/n}|\theta-\hat{\theta}|_{H}\qquad\forall\theta\in\Theta\setminus{\mathcal{U}}(r). (2.6)

See Appendix A for the proof of the lemma. The bound (2.5) does not require convexity. Similarly, (2.6) only uses convexity via the property of convex functions that the infimum of (f(θ)f(θ^))/|θθ^|H(f(\theta)-f(\hat{\theta}))/|\theta-\hat{\theta}|_{H} over θ𝒰(r)c\theta\in{\mathcal{U}}(r)^{c} is achieved on the boundary {|θθ^|H=rd/n}\{|\theta-\hat{\theta}|_{H}=r\sqrt{d/n}\}. Thus any other function with this property will satisfy the lemma. Note also that (2.6) proves enfL1(Θ)e^{-nf}\in L_{1}(\Theta).

Next, the following lemma presents the main tools in the proof of Theorem 2.2.

Lemma 2.11.

Let μ\mu and μ^\hat{\mu} be two probability densities on d\mathbb{R}^{d} and let 𝒰d{\mathcal{U}}\subset\mathbb{R}^{d} be such that μ(𝒰)>0,μ^(𝒰)>0\mu({\mathcal{U}})>0,\hat{\mu}({\mathcal{U}})>0. Let μ|𝒰\mu|_{\mathcal{U}}, μ^|𝒰\hat{\mu}|_{\mathcal{U}} be the restrictions of μ\mu and μ^\hat{\mu} to 𝒰{\mathcal{U}}, respectively, renormalized to be probability measures. Then

TV(μ,μ^)μ(𝒰c)+μ^(𝒰c)+TV(μ|𝒰,μ^|𝒰).\mathrm{TV}(\mu,\hat{\mu})\leq\mu({\mathcal{U}}^{c})+\hat{\mu}({\mathcal{U}}^{c})+\mathrm{TV}\left(\mu|_{\mathcal{U}},\hat{\mu}|_{\mathcal{U}}\right). (2.7)

Furthermore, suppose μ\mu and μ^\hat{\mu} are strictly positive on 𝒰{\mathcal{U}} with μenf\mu\propto e^{-nf} and μ^enf^\hat{\mu}\propto e^{-n\hat{f}} where fC2(𝒰)f\in C^{2}({\mathcal{U}}), f^C1(𝒰)\hat{f}\in C^{1}({\mathcal{U}}). Suppose also that 2f(θ)λH\nabla^{2}f(\theta)\succeq\lambda H for all θ𝒰\theta\in{\mathcal{U}} and some H0H\succ 0. Then

TV(μ|𝒰,μ^|𝒰)2n4λ𝔼Xμ^|𝒰[(ff^)(X)H2],\mathrm{TV}\left(\mu|_{\mathcal{U}},\hat{\mu}|_{\mathcal{U}}\right)^{2}\leq\frac{n}{4\lambda}\mathbb{E}\,_{X\sim\hat{\mu}|_{\mathcal{U}}}\left[\|\nabla(f-\hat{f})(X)\|_{H}^{2}\right], (2.8)

and therefore the total TV distance between μ\mu and μ^\hat{\mu} is bounded as

TV(μ,μ^)μ(𝒰c)+μ^(𝒰c)+n4λ𝔼Xμ^|𝒰[(ff^)(X)H2]12.\mathrm{TV}(\mu,\hat{\mu})\leq\mu({\mathcal{U}}^{c})+\hat{\mu}({\mathcal{U}}^{c})+\sqrt{\frac{n}{4\lambda}}\mathbb{E}\,_{X\sim\hat{\mu}|_{\mathcal{U}}}\left[\|\nabla(f-\hat{f})(X)\|_{H}^{2}\right]^{\frac{1}{2}}. (2.9)

See Appendix A.1 for the proof of this lemma. The bound (2.8) stems from a log-Sobolev inequality (LSI) [2], which can be applied since μ\mu is strongly log-concave in the neighborhood 𝒰{\mathcal{U}}. The inspiration for this technique comes from [19], who also prove a bound on TV(πf,γf)\mathrm{TV}(\pi_{f},\gamma_{f}) in the context of Bayesian inference. See Section 2.2 for a comparison between our bound and that of [19].

We apply (2.9) with 𝒰=𝒰(r){\mathcal{U}}={\mathcal{U}}(r), μ=πf\mu=\pi_{f} (extended to be zero in dΘ\mathbb{R}^{d}\setminus\Theta) and μ^=γfenf^\hat{\mu}=\gamma_{f}\propto e^{-n\hat{f}}, where f^(θ)=12|θθ^|H2\hat{f}(\theta)=\frac{1}{2}|\theta-\hat{\theta}|^{2}_{H}. Since 2f12H\nabla^{2}f\succeq\frac{1}{2}H on 𝒰(r){\mathcal{U}}(r) by (2.5), the function ff enjoys local strong convexity. In particular, the lower bound condition in the lemma is satisfied with λ=1/2\lambda=1/2. Therefore, in our setting, the bound (2.9) takes the form

TV(πf,γf)πf(𝒰c)+γf(𝒰c)+n2𝔼θγf|𝒰[f(θ)H(θθ^)H2]12.\mathrm{TV}(\pi_{f},\gamma_{f})\leq\pi_{f}({\mathcal{U}}^{c})+\gamma_{f}({\mathcal{U}}^{c})+\sqrt{\frac{n}{2}}\mathbb{E}\,_{\theta\sim\gamma_{f}|{\mathcal{U}}}\left[\|\nabla f(\theta)-H(\theta-\hat{\theta})\|^{2}_{H}\right]^{\frac{1}{2}}. (2.10)

We see that to bound the TV distance between πf\pi_{f} and γf\gamma_{f}, it suffices to bound the two tail integrals and the local expectation in (2.10). The tail integral πf(𝒰c)=𝒰cenf/enf\pi_{f}({\mathcal{U}}^{c})=\int_{{\mathcal{U}}^{c}}e^{-nf}/\int e^{-nf} can be bounded using the linear growth of ff in 𝒰c{\mathcal{U}}^{c}. The bound on γf(𝒰c)\gamma_{f}({\mathcal{U}}^{c}) follows by standard Gaussian tail inequalities. Finally, to bound the local expectation we use a Taylor expansion of f(θ)\nabla f(\theta) about θ^\hat{\theta}.to get

f(θ)H(θθ^)=f(θ)f(θ^)2f(θ^)(θθ^)=(2f(ξ)2f(θ^))(θθ^)\nabla f(\theta)-H(\theta-\hat{\theta})=\nabla f(\theta)-\nabla f(\hat{\theta})-\nabla^{2}f(\hat{\theta})(\theta-\hat{\theta})=(\nabla^{2}f(\xi)-\nabla^{2}f(\hat{\theta}))(\theta-\hat{\theta})

for a point ξ\xi between θ\theta and θ^\hat{\theta}. From here we use the definition of δ3\delta_{3} and simple Gaussian calculations to show that the third term in (2.10) is bounded as δ3(r)d/n\delta_{3}(r)d/\sqrt{n}.See Appendix A.2 for these calculations, which finish the proof of Proposition 2.2.

2.2 Comparison with [19] and [20]

So far, only two other works have obtained bounds on TV(πf,γf)\mathrm{TV}(\pi_{f},\gamma_{f}) scaling as d/nd/\sqrt{n} — that of [19] and [20]. We compare (2.2) to these two bounds from the perspective of the BvM.

The bound of [19] requires fC3(Θ)f\in C^{3}(\Theta). Here, we require that fC2(Θ)f\in C^{2}(\Theta) and that δ3(r)\delta_{3}(r) defined in (2.1) is finite. The bound of [19] is also significantly more complex than ours; see Theorem 3.1 in [19] (which relies on the quantities defined in Section 2) compared to our (2.2). The bound of [20] involves fourth derivatives of ff. Thus greater regularity of ff is required, and the presence of fourth derivatives makes the bound more complex than (2.2).

The weaker regularity condition required in our bound (2.2) translates into weaker regularity of the log likelihood in the statement of the BvM. As can be seen from the definition of event E3E_{3} in (3.4) and the third condition in (3.5), only boundedness of the Lipschitz constant of the log likelihood’s second derivative is needed. Also, as already mentioned, the simplicity of (2.2) is useful to obtain a streamlined proof and statement of the BvM.

Finally, let us compare our proof technique to that of [19] and [20]. The key idea behind our proof is the same as in [19], that is, to use the log-Sobolev inequality in a local region. However, the implementation is different. Here, we (1) do not split up the function ff into a sum of two parts coming from the log likelihood and the log prior, (2) apply the log-Sobolev inequality in an affine invariant way, and (3) handle the tail integrals differently, relying on the linear growth of ff at infinity rather than the fact that the prior is a proper Lebesgue density integrating to 1.

The proof technique in [20] is quite different. In this work, the leading order term in the TV distance between πf\pi_{f} and γf\gamma_{f} is derived, and the main challenge is to bound the difference between TV(πf,γf)\mathrm{TV}(\pi_{f},\gamma_{f}) and this leading term. This is done using high-dimensional Gaussian concentration. Overall, the calculation in [20] is more delicate, but leads to a more complicated bound involving fourth derivatives.

3 From Laplace approximation to Bernstein-von Mises: an overview

In Section 3.1 we review the general Bayesian set-up and state the BvM result we wish to prove. In Section 3.2 we use Theorem 2.2 to prove a key preliminary bound. Using this bound, it is then straightforward to finish the BvM proof for the specific statistical models considered in this work; we do so later in Sections 4 and 5. Finally, in Section 3.3 we discuss the conditions on the prior and how the condition d2nd^{2}\ll n arises in the BvM.

3.1 Bayesian preliminaries and statement of BvM

Let {Pθn:θΘ}\{P^{n}_{\theta}\;:\;\theta\in\Theta\} be a parameterized family of probability distributions, with Θ\Theta an open convex subset of d\mathbb{R}^{d}. We assume there is a fixed measure with respect to which PθnP^{n}_{\theta} has density pn(θ)p^{n}(\cdot\mid\theta), for each θΘ\theta\in\Theta. We observe a sample YnPθnY^{n}\sim P^{n}_{\theta^{*}} for a ground truth parameter θΘ{\theta^{*}}\in\Theta. We define the likelihood L:Θ(0,)L:\Theta\to(0,\infty) by L(θ)=pn(Ynθ)L(\theta)=p^{n}(Y^{n}\mid\theta), and the negative normalized log likelihood :Θ\ell:\Theta\mapsto\mathbb{R} by

(θ)=1nlogL(θ)=1nlogpn(Ynθ).\ell(\theta)=-\frac{1}{n}\log L(\theta)=-\frac{1}{n}\log p^{n}(Y^{n}\mid\theta).

The normalization by 1/n1/n is natural in the standard case that PθnP^{n}_{\theta} is an nn-fold product measure. In other words, Yn=(Y1,,Yn)Y^{n}=(Y_{1},\dots,Y_{n}) and pn(Ynθ)=i=1npi(Yiθ)p^{n}(Y^{n}\mid\theta)=\prod_{i=1}^{n}p_{i}(Y_{i}\mid\theta). We then have that logL\log L is a sum of nn terms. When it exists and is unique, we define

θ^=argminθΘ(θ),\hat{\theta}=\operatorname*{argmin}_{\theta\in\Theta}\ell(\theta),

the maximum likelihood estimator (MLE). We define the negative population log likelihood to be the function

(θ)=𝔼YnPθn[(θ)]=1n𝔼YnPθn[logpn(Ynθ)].\ell^{*}(\theta)=\mathbb{E}\,_{Y^{n}\sim P^{n}_{\theta^{*}}}[\ell(\theta)]=-\frac{1}{n}\mathbb{E}\,_{Y^{n}\sim P^{n}_{\theta^{*}}}\left[\log p^{n}(Y^{n}\mid\theta)\right].
Assumption A1.

The random function \ell is convex and belongs to C2(Θ)C^{2}(\Theta) with probability 1 with respect to the distribution Ynpn(θ)Y^{n}\sim p^{n}(\cdot\mid{\theta^{*}}). The function \ell^{*} belongs to C2(Θ)C^{2}(\Theta), and the Fisher information matrix

F:=2(θ){F^{*}}:=\nabla^{2}\ell^{*}({\theta^{*}})

is strictly positive definite.

Let the prior probability density over θ\theta be denoted π0\pi_{0}. We assume without loss of generality that Θ\Theta lies in the support of π0\pi_{0}; otherwise, redefine Θ\Theta to be its intersection with the support of π0\pi_{0}. Thus there is a function v0:Θv_{0}:\Theta\to\mathbb{R} such that

π0(θ)ev0(θ),θΘ.\pi_{0}(\theta)\propto e^{-v_{0}(\theta)},\quad\theta\in\Theta.

Finally, the posterior distribution is the probability density πv\pi_{v} such that

πv(θ)en(θ)π0(θ)env(θ),v=+n1v0,θΘ.\pi_{v}(\theta)\propto e^{-n\ell(\theta)}\pi_{0}(\theta)\propto e^{-nv(\theta)},\qquad v=\ell+n^{-1}v_{0},\quad\theta\in\Theta.

Classically, the BvM states that TV(πv,γ)=o(1)\mathrm{TV}(\pi_{v},\gamma^{*})=o(1) with probability tending to 1 (under PθnP^{n}_{\theta^{*}}) as nn\to\infty [38]. Here,

γ=𝒩(θ^,(nF)1).\gamma^{*}=\mathcal{N}(\hat{\theta},(n{F^{*}})^{-1}). (3.1)

From the definition of γ\gamma^{*}, we see that a necessary ingredient to prove the BvM is showing the MLE exists and is unique with probability tending to 1.

Here, we will be interested in a nonasymptotic version of the BvM. Namely, we will prove a result of the form: “under certain conditions on dd, nn, and the statistical model, TV(πv,γ)ϵ\mathrm{TV}(\pi_{v},\gamma^{*})\leq\epsilon with probability at least 1p1-p”, for explicit quantities ϵ\epsilon and pp depending on dd, nn, and the model. We are only aware of one other work which proves BvMs in this style [32]. Our nonasymptotic BvM results then easily lead to the more classical asymptotic statements.

3.2 Proof outline and preliminary bounds

The proof idea is as follows. First, we show that discarding the prior has negligible effect by bounding TV(πv,π)\mathrm{TV}(\pi_{v},\pi_{\ell}). Here, πen\pi_{\ell}\propto e^{-n\ell}. Second, we bound TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma_{\ell}) by directly applying Theorem 2.2 with f=f=\ell. Here,

γ=𝒩(θ^,(n2(θ^))1)\gamma_{\ell}=\mathcal{N}(\hat{\theta},(n\nabla^{2}\ell(\hat{\theta}))^{-1}) (3.2)

is the LA to π\pi_{\ell}. Third, we bound TV(γ,γ)\mathrm{TV}(\gamma_{\ell},\gamma^{*}) by comparing the inverse covariance matrices n2(θ^)n\nabla^{2}\ell(\hat{\theta}) and n2(θ)n\nabla^{2}\ell^{*}({\theta^{*}}) of the two Gaussian distributions (which have the same mean). We then use the triangle inequality to bound TV(πv,γ)\mathrm{TV}(\pi_{v},\gamma^{*}) via the sum of the three bounds. To be more precise, we show in this section that these three bounds hold on the event E(s,ϵ2)E(s,\epsilon_{2}) from Definition 3.1 below. To finish the BvM proof, it remains to prove this is a high-probability event. We will carry out this final probabilistic step using the particulars of each of our models in the following two sections.

We now introduce several quantities and specify the event E(s,ϵ2)E(s,\epsilon_{2}).

Definition 3.1.

Let

𝒰(s)={θd:|θθ|Fsd/n},δ01(s)=supθ𝒰(s)logπ0(θ)F,M0=d1logsupθΘπ0(θ)/π0(θ),\begin{split}{\mathcal{U}}^{*}(s)&=\left\{\theta\in\mathbb{R}^{d}\;:\;|\theta-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n}\right\},\\ \delta^{*}_{01}(s)&=\sup_{\theta\in{\mathcal{U}}^{*}(s)}\|\nabla\log\pi_{0}(\theta)\|_{F^{*}},\\ {M_{0}}^{*}&=d^{-1}\log\sup_{\theta\in\Theta}\pi_{0}(\theta)/\pi_{0}({\theta^{*}}),\end{split} (3.3)

with the convention that δ01(s)=\delta^{*}_{01}(s)=\infty if the prior is not C1C^{1} or not strictly positive on 𝒰(s){\mathcal{U}}^{*}(s). Also, let δ3:[0,)[0,)\delta^{*}_{3}:[0,\infty)\to[0,\infty) be some deterministic nondecreasing function to be specified. Then for s,ϵ20s,\epsilon_{2}\geq 0, define the events

E1(s)={(θ)Fsd/n},E2(ϵ2)={2(θ)FFϵ2},E3(s)={supθ,θ𝒰(s)2(θ)2(θ)F|θθ|Fδ3(s)}.\begin{split}E_{1}(s)&=\left\{\|\nabla\ell({\theta^{*}})\|_{F^{*}}\leq s\sqrt{d/n}\right\},\\ E_{2}(\epsilon_{2})&=\left\{\|\nabla^{2}\ell({\theta^{*}})-{F^{*}}\|_{F^{*}}\leq\epsilon_{2}\right\},\\ E_{3}(s)&=\left\{\sup_{\theta,\theta^{\prime}\in{\mathcal{U}}^{*}(s)}\frac{\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell(\theta^{\prime})\|_{F^{*}}}{|\theta-\theta^{\prime}|_{F^{*}}}\leq\delta^{*}_{3}(s)\right\}.\end{split} (3.4)

Finally, let

E(s,ϵ2)=E1(s)E2(ϵ2)E3(2s),E¯(s,ϵ2)={!MLEθ^,|θ^θ|Fsd/n}E2(ϵ2)E3(2s).\begin{split}E(s,\epsilon_{2})&=E_{1}(s)\cap E_{2}(\epsilon_{2})\cap E_{3}(2s),\\ \bar{E}(s,\epsilon_{2})&=\{\exists!\,\mathrm{MLE}\,\hat{\theta},\,|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n}\}\cap E_{2}(\epsilon_{2})\cap E_{3}(2s).\end{split}

Here, “!MLEθ^\exists!\,\mathrm{MLE}\,\hat{\theta}” is an abbreviation for “the MLE exists and is unique”. Also, recall from Section 1.5 that (θ)F=|F1/2(θ)|\|\nabla\ell({\theta^{*}})\|_{F^{*}}=|{F^{*}}^{-1/2}\nabla\ell({\theta^{*}})|.

We will show that E(s,ϵ2)E¯(s,ϵ2)E(s,\epsilon_{2})\subset\bar{E}(s,\epsilon_{2}), so that in particular, a unique MLE θ^\hat{\theta} is guaranteed to exist on E(s,ϵ2)E(s,\epsilon_{2}) and θ^\hat{\theta} is close to θ{\theta^{*}}. We then state our key bounds on the event E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}).

Remark 3.2.

Consider the quantity δ3\delta_{3} defined in (2.1), when applied to f=f=\ell. There are several slight differences between δ3\delta_{3} and the above δ3\delta^{*}_{3}. First, δ3\delta_{3} is a local Lipschitz constant of 2\nabla^{2}\ell in the neighborhood 𝒰(r){\mathcal{U}}(r), anchored at the point θ^\hat{\theta}, whereas δ3(s)\delta^{*}_{3}(s) is an upper bound on a “uniform” Lipschitz constant in the neighborhood 𝒰(s){\mathcal{U}}^{*}(s). Second, δ3\delta^{*}_{3} is a deterministic function, bounding this random Lipschitz constant uniformly over all realizations of \ell in the event E3E_{3}. When applying Theorem 2.2 to bound TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma_{\ell}), we will show that the random quantity δ3\delta_{3} appearing in the upper bound (2.2) can be further bounded by the deterministic δ3\delta^{*}_{3}.

Remark 3.3 (Intuition for events E1,E2,E3E_{1},E_{2},E_{3}).

Let us show why we expect E1,E2,E3E_{1},E_{2},E_{3} to be high-probability events in general (though as mentioned above, we will prove this rigorously using the particulars of our models). First, note that θ{\theta^{*}} is the global minimizer of \ell^{*}, since (θ)=KL(Pn(θ)||Pn(θ))+const.\ell^{*}(\theta)=\mathrm{KL}\left(\,P^{n}({\theta^{*}})\;||\;P^{n}(\theta)\right)+\mathrm{const.}. Thus (θ)=0\nabla\ell^{*}({\theta^{*}})=0. Second, recall that F=2(θ){F^{*}}=\nabla^{2}\ell^{*}({\theta^{*}}). Thus the events E1,E2E_{1},E_{2} are expressing that k(θ)k(θ)\nabla^{k}\ell({\theta^{*}})\approx\nabla^{k}\ell^{*}({\theta^{*}}) for k=1,2k=1,2, respectively. This is reasonable to expect when nn is large and PθnP^{n}_{\theta} is a product measure (with Yn=(Y1,,Yn)Y^{n}=(Y_{1},\dots,Y_{n})), since then k(θ)\nabla^{k}\ell({\theta^{*}}) can be written as an average of the nn independent random variables θklogpi(Yiθ)\nabla_{\theta}^{k}\log p_{i}(Y_{i}\mid{\theta^{*}}), i=1,,ni=1,\dots,n. To interpret the event E3E_{3}, consider the case when C3(Θ)\ell\in C^{3}(\Theta) with probability 1, and PθnP^{n}_{\theta} is still a product measure. Since 33\nabla^{3}\ell\approx\nabla^{3}\ell^{*} for large nn, we expect supθ𝒰(s)3()(θ)Fϵ3(s)\sup_{\theta\in{\mathcal{U}}^{*}(s)}\|\nabla^{3}(\ell-\ell^{*})(\theta)\|_{F^{*}}\leq\epsilon_{3}(s) with high probability, for some suitably chosen ϵ3(s)\epsilon_{3}(s). We can then take δ3(s)=ϵ3(s)+supθ𝒰(s)3(θ)F\delta^{*}_{3}(s)=\epsilon_{3}(s)+\sup_{\theta\in{\mathcal{U}}^{*}(s)}\|\nabla^{3}\ell^{*}(\theta)\|_{F^{*}}.

Lemma 3.4 (Nonasymptotic BvM: preliminary lemma).

Suppose 0ϵ21/20\leq\epsilon_{2}\leq 1/2 and

s12,𝒰(2s)Θ,2sδ3(2s)d/n1/4.s\geq 12,\quad{\mathcal{U}}^{*}(2s)\subset\Theta,\quad 2s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 1/4. (3.5)

Then E(s,ϵ2)E¯(s,ϵ2)E(s,\epsilon_{2})\subseteq\bar{E}(s,\epsilon_{2}) and on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}) it holds

TV(π,γ)\displaystyle\mathrm{TV}(\pi_{\ell},\gamma_{\ell}) δ3(2s)dn+eds2/36,\displaystyle\lesssim\delta^{*}_{3}(2s)\,\frac{d}{\sqrt{n}}+e^{-ds^{2}/36}, (3.6)
TV(γ,γ)\displaystyle\mathrm{TV}(\gamma_{\ell},\gamma^{*}) sδ3(s)dn+dϵ2.\displaystyle\lesssim s\delta^{*}_{3}(s)\frac{d}{\sqrt{n}}+\sqrt{d}\epsilon_{2}. (3.7)

If also δ01(2s)nd/6\delta^{*}_{01}(2s)\leq\sqrt{nd}/6 then we have the following inequality on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}):

TV(πv,π)δ01(2s)n+ed[M0(s/12)2].\mathrm{TV}(\pi_{v},\pi_{\ell})\lesssim\frac{\delta^{*}_{01}(2s)}{\sqrt{n}}+e^{d[{M_{0}}^{*}-(s/12)^{2}]}. (3.8)

In each of the above bounds, the suppressed constant is absolute, independent of all problem parameters.

Let us briefly outline the proof. On E1(s)E_{1}(s) we have that (θ)F\|\nabla\ell({\theta^{*}})\|_{{F^{*}}} is small. By the inverse function theorem, we can then show that there is a point θ^\hat{\theta} near θ{\theta^{*}} such that (θ^)=0\nabla\ell(\hat{\theta})=0. Specifically, we show that this point satisfies |θ^θ|Fsd/n|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n}. Next we can combine the conditions from E2(ϵ2)E_{2}(\epsilon_{2}) and E3(s)E_{3}(s) to show 2(θ^)0\nabla^{2}\ell(\hat{\theta})\succ 0. Thus since \ell is convex, we conclude that θ^\hat{\theta} is the unique global minimizer of \ell, i.e. it is the MLE. To prove the bound (3.6), we use the definition of E2(ϵ2)E_{2}(\epsilon_{2}) and E3(s)E_{3}(s), and the proximity of θ^\hat{\theta} to θ{\theta^{*}}, to bound the quantity δ3\delta_{3} from Theorem 2.2 via the quantity δ3\delta^{*}_{3}. We then directly apply Theorem 2.2. The proof of (3.8) is similar to the proof of Theorem 2.2, with Lemma 2.11 (i.e. the log-Sobolev inequality) as the key tool. Finally, the bound (3.7) is straightforward since γ\gamma_{\ell} and γ\gamma^{*} are Gaussians with the same mean, and their inverse covariances are close on the event E2(ϵ2)E_{2}(\epsilon_{2}). (Recall the definition of γ\gamma_{\ell} and γ\gamma^{*} from (3.2) and (3.1).) See Appendix B for the full proof.

Given Lemma 3.4, we can finish the nonasymptotic BvM proof by finding s,ϵ2,δ3s,\epsilon_{2},\delta^{*}_{3} which balance the two requirements that E(s,ϵ2)E(s,\epsilon_{2}) is a high probability event and that the righthand sides of (3.8)-(3.7) are small. In certain cases, the choice of ϵ2\epsilon_{2} and δ3\delta^{*}_{3} is immediate. Specifically, suppose \ell-\ell^{*} is a linear function of θ\theta. This occurs for generalized linear models, as we discuss in Section 4.1. This structure has been denoted “stochastically linear” by [33]. The implication of this linearity is that k=k\nabla^{k}\ell=\nabla^{k}\ell^{*} for k=2,3k=2,3. As a result, we can take ϵ2=0\epsilon_{2}=0 and any δ3(s)\delta^{*}_{3}(s) such that

supθ,θ𝒰(s)2(θ)2(θ)F|θθ|Fδ3(s).\sup_{\theta,\theta^{\prime}\in{\mathcal{U}}^{*}(s)}\frac{\|\nabla^{2}\ell^{*}(\theta)-\nabla^{2}\ell^{*}(\theta^{\prime})\|_{F^{*}}}{|\theta-\theta^{\prime}|_{F^{*}}}\leq\delta^{*}_{3}(s). (3.9)

The events E2(0)E_{2}(0) and E3(s)E_{3}(s) are then trivially satisfied with probability 1. Thus the event E(s,0)E(s,0) reduces to E(s,0)=E1(s)E2(0)E3(2s)=E1(s)E(s,0)=E_{1}(s)\cap E_{2}(0)\cap E_{3}(2s)=E_{1}(s). Note that if C3(Θ)\ell\in C^{3}(\Theta) then we can take

δ3(s)=supθ𝒰(s)3(θ)F.\delta^{*}_{3}(s)=\sup_{\theta\in{\mathcal{U}}^{*}(s)}\|\nabla^{3}\ell^{*}(\theta)\|_{F^{*}}. (3.10)
Lemma 3.5 (Nonasymptotic BvM: preliminary lemma under stochastic linearity).

Suppose \ell-\ell^{*} is a linear function of θ\theta. Let δ3\delta^{*}_{3} be any function satisfying (3.9), and assume (3.5). Then on the event E1(s)E_{1}(s), the function \ell has a unique global minimizer θ^\hat{\theta}, which satisfies |θ^θ|Fsd/n|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n}. Moreover,

TV(π,γ)\displaystyle\mathrm{TV}(\pi_{\ell},\gamma_{\ell}) δ3(2s)dn+eds2/36,\displaystyle\lesssim\delta^{*}_{3}(2s)\,\frac{d}{\sqrt{n}}+e^{-ds^{2}/36}, (3.11)
TV(γ,γ)\displaystyle\mathrm{TV}(\gamma_{\ell},\gamma^{*}) sδ3(s)dn.\displaystyle\lesssim s\delta^{*}_{3}(s)\frac{d}{\sqrt{n}}. (3.12)

for an absolute constant CC. If in addition δ01(2s)nd/6\delta^{*}_{01}(2s)\leq\sqrt{nd}/6, then (3.8) also holds on E1(s)E_{1}(s).

Remark 3.6 (Linear regression setting).

In the linear regression model, we have Y=Φθ+1nϵY=\Phi\theta+\frac{1}{\sqrt{n}}\epsilon, where YnY\in\mathbb{R}^{n} is all the data arranged in a column vector, Φn×d\Phi\in\mathbb{R}^{n\times d}, θd\theta\in\mathbb{R}^{d}, and ϵ𝒩(0,In)\epsilon\sim\mathcal{N}(0,I_{n}). In this model, \ell is a quadratic function of θ\theta, and 2=2F\nabla^{2}\ell=\nabla^{2}\ell^{*}\equiv{F^{*}}. Thus we have π=γ=γ\pi_{\ell}=\gamma_{\ell}=\gamma^{*}. We see that TV(πv,γ)=TV(πv,π)\mathrm{TV}(\pi_{v},\gamma^{*})=\mathrm{TV}(\pi_{v},\pi_{\ell}), and we can now use the bound from (3.8). Thus in this case, we can expect the BvM to hold under much weaker conditions than d2nd^{2}\ll n. Indeed, the work [7] shows that under certain conditions on the prior, dnd\ll n suffices; see the discussion following Theorem 2 of that work. Note that [7] differs from our setting in that the author allows for misspecification.

3.3 Discussion

Assumptions on the prior.

In this discussion we take an asymptotic viewpoint and consider the conditions under which the prior contribution (3.8) can be made arbitrarily small as nn\to\infty. By definition, all of the above quantities (most notably d=dnd=d_{n}) are now indexed by nn as nn\to\infty. However, for simplicity of notation, we omit the nn subscript.

Now, we would like for the first term on the righthand side of (3.8) to go to zero as nn\to\infty for each fixed ss. Meanwhile, the quantity M0M_{0}^{*} in the exponent of the second term should remain bounded as nn\to\infty. If these two conditions hold, we can then take ss\to\infty in order to make the righthand side go to zero and the probability of the event E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}) go to 1. We prove the latter fact in our examples, with a proper choice of ϵ2\epsilon_{2}. Thus to summarize, we should have that

M0=𝒪(1),δ01(2s)=o(n),asn.M_{0}^{*}=\mathcal{O}(1),\qquad\delta^{*}_{01}(2s)=o(\sqrt{n}),\qquad\text{as}\;n\to\infty. (3.13)

The condition M0=𝒪(1)M_{0}^{*}=\mathcal{O}(1) is standard and has been assumed in BvM proofs in the works [15, 16, 5, 25, 32]. Regarding the regularity of logπ0\log\pi_{0}, we impose the slightly stronger assumption that logπ0\log\pi_{0} is C1C^{1} in a neighborhood of θ{\theta^{*}}, whereas the cited works only require that logπ0\log\pi_{0} is Lipschitz in this neighborhood. However, suppose logπ0\log\pi_{0} is C1C^{1}, and we use δ01(2s)\delta^{*}_{01}(2s) as an upper bound on this Lipschitz constant. Then we have actually relaxed the condition on this quantity, from δ01(2s)=o(n/d)\delta^{*}_{01}(2s)=o(\sqrt{n/d}) in the five cited works to δ01(2s)=o(n)\delta^{*}_{01}(2s)=o(\sqrt{n}) here. Strengthening the regularity from Lipschitz to C1C^{1} allows us to take advantage of the log-Sobolev inequality in 𝒰(s){\mathcal{U}}^{*}(s) to obtain a tighter bound on TV(π,πv)\mathrm{TV}(\pi_{\ell},\pi_{v}) than that of prior works.

To get a better sense for the quantities M0{M_{0}}^{*} and δ01(s)\delta^{*}_{01}(s) and particularly their magnitude, we compute them for several priors.

Example 3.7.

Flat prior: π0(θ)1\pi_{0}(\theta)\equiv 1. Then M0=0{M_{0}}^{*}=0, δ01(s)0.\delta^{*}_{01}(s)\equiv 0.
Gaussian prior π0=𝒩(μ,Σ)\pi_{0}=\mathcal{N}(\mu,\Sigma):

M012dΣ1F|μθ|F2,δ01(s)Σ1F(|μθ|F+sd/n).\begin{split}{M_{0}}^{*}&\leq\frac{1}{2d}\|\Sigma^{-1}\|_{{F^{*}}}|\mu-{\theta^{*}}|_{F^{*}}^{2},\\ \delta^{*}_{01}(s)&\leq\|\Sigma^{-1}\|_{F^{*}}(|\mu-{\theta^{*}}|_{F^{*}}+s\sqrt{d/n}).\end{split} (3.14)

Multivariate Student’s t prior π0=tν(μ,Σ)\pi_{0}=t_{\nu}(\mu,\Sigma):

M0ν+d2νdΣ1F|μθ|F2,δ01(s)ν+dνΣ1F(|μθ|F+sd/n).\begin{split}{M_{0}}^{*}&\leq\frac{\nu+d}{2\nu d}\|\Sigma^{-1}\|_{F^{*}}|\mu-{\theta^{*}}|_{F^{*}}^{2},\\ \delta^{*}_{01}(s)&\leq\frac{\nu+d}{\nu}\|\Sigma^{-1}\|_{F^{*}}(|\mu-{\theta^{*}}|_{F^{*}}+s\sqrt{d/n}).\end{split} (3.15)

See Appendix B for these calculations. Now suppose s=𝒪(1)s=\mathcal{O}(1) and consider the Gaussian prior 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma). If, for example, we have

|μθ|F=𝒪(1),Σ1F=𝒪(d),|\mu-{\theta^{*}}|_{F^{*}}=\mathcal{O}(1),\qquad\|\Sigma^{-1}\|_{{F^{*}}}=\mathcal{O}(d),

then (3.14) gives M0=𝒪(1){M_{0}}^{*}=\mathcal{O}(1) and δ01(s)=𝒪(d)\delta^{*}_{01}(s)=\mathcal{O}(d). Now, since we should have d/n=o(1)d/\sqrt{n}=o(1) for our bounds in Lemma 3.4 to be small, it follows that δ01(s)=o(n)\delta^{*}_{01}(s)=o(\sqrt{n}). Therefore, the conditions (3.13) are satisfied. For the Student’s t prior tν(μ,Σ)t_{\nu}(\mu,\Sigma) with ν=𝒪(1)\nu=\mathcal{O}(1) degrees of freedom, if

|μθ|F=𝒪(1),Σ1F=𝒪(1),|\mu-{\theta^{*}}|_{F^{*}}=\mathcal{O}(1),\qquad\|\Sigma^{-1}\|_{{F^{*}}}=\mathcal{O}(1),

then (3.15) gives M0=𝒪(1){M_{0}}^{*}=\mathcal{O}(1) and δ01(s)=𝒪(d)=o(n)\delta^{*}_{01}(s)=\mathcal{O}(d)=o(\sqrt{n}), satisfying the conditions (3.13).

The condition d2nd^{2}\ll n.

Here we discuss how the “universal” factor d/nd/\sqrt{n} arises in our bounds, and the obstacles to showing that d2nd^{2}\ll n is a necessary condition for the BvM to hold. For simplicity of discussion, we focus on the stochastically linear case. Thus 2=2\nabla^{2}\ell=\nabla^{2}\ell^{*} and therefore in particular, ϵ2=0\epsilon_{2}=0. Also, we assume a flat prior, so that πv=π\pi_{v}=\pi_{\ell}.

Consider the preliminary bounds (3.6) and (3.7) in Lemma 3.4. The inequality (3.6) is a bound on TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma_{\ell}), the TV error of the LA to π\pi_{\ell}. It has been shown in [20] that d/nd/\sqrt{n} is the sharp rate of approximation for the LA; see that work for the intuition behind this rate. Interestingly, the bound (3.7) on TV(γ,γ)\mathrm{TV}(\gamma_{\ell},\gamma^{*}) is of the same order of magnitude as the LA error bound. Thus our overall bound on TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma^{*}) is made up of two dominant contributions of order d/nd/\sqrt{n}, stemming from both TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma_{\ell}) and TV(γ,γ)\mathrm{TV}(\gamma_{\ell},\gamma^{*}).

Let us take a closer look at how d/nd/\sqrt{n} arises in the bound on TV(γ,γ)\mathrm{TV}(\gamma_{\ell},\gamma^{*}). Using that 2=2\nabla^{2}\ell=\nabla^{2}\ell^{*}, we have γ=𝒩(θ^,(n2(θ^))1)\gamma_{\ell}=\mathcal{N}(\hat{\theta},(n\nabla^{2}\ell^{*}(\hat{\theta}))^{-1}) and γ=𝒩(θ^,(n2(θ))1)\gamma^{*}=\mathcal{N}(\hat{\theta},(n\nabla^{2}\ell^{*}({\theta^{*}}))^{-1}). Now, using a result of [12] and a calculation in Lemma E.2, we have

TV(𝒩(μ,Σ1),𝒩(μ,Σ2))Σ11/2Σ2Σ11/2IdFrodτΣ21Σ11Σ11,\mathrm{TV}\left(\mathcal{N}(\mu,\Sigma_{1}),\;\mathcal{N}(\mu,\Sigma_{2})\right)\asymp\|\Sigma_{1}^{-1/2}\Sigma_{2}\Sigma_{1}^{-1/2}-I_{d}\|_{\text{Fro}}\leq\frac{\sqrt{d}}{\tau}\|\Sigma_{2}^{-1}-\Sigma_{1}^{-1}\|_{\Sigma_{1}^{-1}}, (3.16)

where τ>0\tau>0 is such that Σ21τΣ11\Sigma_{2}^{-1}\succeq\tau\Sigma_{1}^{-1}. Here, aba\asymp b means cabCaca\leq b\leq Ca for absolute constants 0<c<C0<c<C. The d\sqrt{d} in this inequality arises by upper bounding the Frobenius norm by an operator norm. In our case, we apply this result with Σ11=n2(θ)=nF\Sigma_{1}^{-1}=n\nabla^{2}\ell^{*}({\theta^{*}})=n{F^{*}} and Σ21=n2(θ^)\Sigma_{2}^{-1}=n\nabla^{2}\ell^{*}(\hat{\theta}). Now on the event E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}), τ\tau can be taken to be an absolute constant, and we have

Σ21Σ11Σ11=2(θ^)2(θ)Fδ3(s)sd/n,\|\Sigma_{2}^{-1}-\Sigma_{1}^{-1}\|_{\Sigma_{1}^{-1}}=\|\nabla^{2}\ell^{*}(\hat{\theta})-\nabla^{2}\ell^{*}({\theta^{*}})\|_{{F^{*}}}\leq\delta^{*}_{3}(s)s\sqrt{d/n},

by the definition of E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}). Thus overall we get

TV(γ,γ)d2(θ^)2(θ)Fd(δ3(s)sd/n).\mathrm{TV}(\gamma_{\ell},\gamma^{*})\lesssim\sqrt{d}\|\nabla^{2}\ell^{*}(\hat{\theta})-\nabla^{2}\ell^{*}({\theta^{*}})\|_{{F^{*}}}\leq\sqrt{d}\left(\delta^{*}_{3}(s)s\sqrt{d/n}\right). (3.17)

We see that the d/nd/\sqrt{n} comes from the d\sqrt{d} prefactor and the fact that θ^\hat{\theta} is at distance d/n\sqrt{d/n} away from θ{\theta^{*}}. Whether or not this upper bound is tight depends on a number of factors, such as whether upper bounding the Frobenius norm by d\sqrt{d} times the operator norm is tight. We leave this question to future work.

Finally, we consider the issue of deriving a lower bound on TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma^{*}). A natural approach would be to leverage the lower bound on the LA error TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma_{\ell}) derived in [20] via the following inequality:

TV(π,γ)TV(π,γ)TV(γ,γ).\mathrm{TV}(\pi_{\ell},\gamma^{*})\geq\mathrm{TV}(\pi_{\ell},\gamma_{\ell})-\mathrm{TV}(\gamma_{\ell},\gamma^{*}).

However, it is not currently possible to use such a technique. This is because the lower bound on TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma_{\ell}) and the upper bound on TV(γ,γ)\mathrm{TV}(\gamma_{\ell},\gamma^{*}) are both of the order d/nd/\sqrt{n}. It may be possible to directly argue that TV(π,γ)TV(π,γ)\mathrm{TV}(\pi_{\ell},\gamma^{*})\gtrsim\mathrm{TV}(\pi_{\ell},\gamma_{\ell}), which would allow us to take advantage of the Laplace lower bound. We leave this question to future work.

4 BvM for generalized linear models and exponential families

In this section, we prove the BvM for generalized linear models (GLMs), which encompasses the i.i.d. exponential family setting. We describe the set-up in Section 4.1 and prove the BvM in Section 4.2; see also the end of the latter section for a comparison with the literature. In Section 4.3, we specialize the general result to the case of a log-concave exponential family. In Section 4.4, we specialize the result to the case of logistic regression with Gaussian design.

4.1 Set-up

The general setting we consider is a GLM, in which we observe feature-label pairs (Xi,Yi)(X_{i},Y_{i}), i=1,,ni=1,\dots,n. Here, Xid×kX_{i}\in\mathbb{R}^{d\times k} is a matrix whose columns constitute kk feature vectors in d\mathbb{R}^{d} associated to sample ii. The vector Yi𝒴kY_{i}\in\mathcal{Y}\subseteq\mathbb{R}^{k} is the kk-variate “label” corresponding to XiX_{i}. Given XiX_{i} and a parameter vector θd\theta\in\mathbb{R}^{d}, the model for the distribution of YiY_{i} is

YiXip(Xiθ)dμ,i=1,,nY_{i}\mid X_{i}\sim p(\cdot\mid X_{i}^{\intercal}\theta)d\mu,\qquad i=1,\dots,n (4.1)

for some base measure μ\mu supported on 𝒴\mathcal{Y}. Here, pp is a kk-parameter full, minimal, regular exponential family of the form

p(yω)\displaystyle p(y\mid\omega) =exp(ωyψ(ω)),\displaystyle=\,\mathrm{exp}\left(\omega^{\intercal}y-\psi(\omega)\right), (4.2)
ψ(ω)\displaystyle\psi(\omega) =log𝒴eωy𝑑μ(y).\displaystyle=\log\int_{\mathcal{Y}}e^{\omega^{\intercal}y}d\mu(y). (4.3)
Example 4.1 (Reduction to i.i.d. exponential family).

As an important special case, if Xi=IdX_{i}=I_{d} for all i=1,,ni=1,\dots,n (so in particular, k=dk=d) then the GLM reduces to the following i.i.d. exponential family setting:

Yii.i.d.p(θ)dμ,i=1,,n,Y_{i}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}p(\cdot\mid\theta)d\mu,\qquad i=1,\dots,n, (4.4)

where pp is as in (4.2).

Because the exponential family is full, minimal, and regular, the domain Ω\Omega, given by

Ω={ωk:𝒴eωyμ(dy)<},\Omega=\left\{\omega\in\mathbb{R}^{k}\;:\;\int_{\mathcal{Y}}e^{\omega^{\intercal}y}\mu(dy)<\infty\right\}, (4.5)

is known to be convex and open. Moreover, ψ\psi from (4.3) is strictly convex and infinitely differentiable in Ω\Omega [4, Chapters 7,8]. The domain Ω\Omega gives rise to a domain of valid θ\theta values, which depends on the chosen features XiX_{i}:

Θ={θd:XiθΩi=1,,n}.\Theta=\{\theta\in\mathbb{R}^{d}\;:\;X_{i}^{\intercal}\theta\in\Omega\quad\forall i=1,\dots,n\}. (4.6)

That Ω\Omega is open and convex in k\mathbb{R}^{k} implies Θ\Theta is open and convex in d\mathbb{R}^{d}.

Example 4.2.

In the i.i.d. exponential family setting, we simply have Θ=Ωd\Theta=\Omega\subseteq\mathbb{R}^{d}.

Example 4.3.

In Poisson, logistic, and binomial regression we have Ω=\Omega=\mathbb{R} and hence Θ=d\Theta=\mathbb{R}^{d}. Similarly, in multinomial logistic regression we have Ω=k\Omega=\mathbb{R}^{k} and hence Θ=d\Theta=\mathbb{R}^{d} as well. When pp is the exponential distribution, which is a one parameter (k=1k=1) exponential family, we have Ω=(,0)\Omega=(-\infty,0). Thus in this case Θ\Theta is given by the intersection of the nn half-spaces {θd:Xiθ<0}\{\theta\in\mathbb{R}^{d}\;:X_{i}^{\intercal}\theta<0\}.

The model (4.1) leads to the following normalized negative log likelihood \ell:

(θ)=1ni=1n[ψ(Xiθ)YiXiθ].\begin{split}\ell(\theta)&=\frac{1}{n}\sum_{i=1}^{n}\left[\psi(X_{i}^{\intercal}\theta)-Y_{i}X_{i}^{\intercal}\theta\right].\end{split} (4.7)

The derivatives of \ell are given as follows. Below, we treat ψ\nabla\psi as a column vector in kk dimensions.

(θ)=1ni=1nXi[ψ(Xiθ)Yi],2(θ)=1ni=1nXi2ψ(Xiθ)Xi,3(θ),u3=1ni=1n3ψ(Xiθ),(Xiu)3.\begin{split}\nabla\ell(\theta)&=\frac{1}{n}\sum_{i=1}^{n}X_{i}\left[\nabla\psi(X_{i}^{\intercal}\theta)-Y_{i}\right],\\ \nabla^{2}\ell(\theta)&=\frac{1}{n}\sum_{i=1}^{n}X_{i}\nabla^{2}\psi(X_{i}^{\intercal}\theta)X_{i}^{\intercal},\\ \langle\nabla^{3}\ell(\theta),u^{\otimes 3}\rangle&=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla^{3}\psi(X_{i}^{\intercal}\theta),(X_{i}^{\intercal}u)^{\otimes 3}\rangle.\end{split} (4.8)

We now check Assumption A1. Since ψC(Ω)\psi\in C^{\infty}(\Omega) and ψ\psi is convex, we see from (4.7) that C(Θ)\ell\in C^{\infty}(\Theta) and \ell is convex with probability 1. Furthermore, we see from (4.8) that 2=2\nabla^{2}\ell=\nabla^{2}\ell^{*} since 2\nabla^{2}\ell does not depend on the random YiY_{i}. Thus it remains to check 2(θ)0\nabla^{2}\ell({\theta^{*}})\succ 0, which we do in the following lemma.

Lemma 4.4.

If the linear span of the union of all knkn columns of the matrices XiX_{i}, i=1,,ni=1,\dots,n equals d\mathbb{R}^{d}, then 2(θ)0\nabla^{2}\ell(\theta)\succ 0 for all θΘ\theta\in\Theta.

Proof.

It suffices to show u2(θ)u=0u^{\intercal}\nabla^{2}\ell(\theta)u=0 implies u=0u=0. We have

u2(θ)u=1ni=1nuXi2ψ(Xiθ)Xiumini=1,,nλmin(2ψ(Xiθ))1ni=1n|Xiu|2.\begin{split}u^{\intercal}\nabla^{2}\ell(\theta)u=\frac{1}{n}\sum_{i=1}^{n}u^{\intercal}X_{i}\nabla^{2}\psi(X_{i}^{\intercal}\theta)X_{i}^{\intercal}u\geq\min_{i=1,\dots,n}\lambda_{\min}\left(\nabla^{2}\psi(X_{i}^{\intercal}\theta)\right)\frac{1}{n}\sum_{i=1}^{n}|X_{i}^{\intercal}u|^{2}.\end{split} (4.9)

Since ψ\psi is strictly convex, we have that λmin(2ψ(Xiθ))>0\lambda_{\min}\left(\nabla^{2}\psi(X_{i}^{\intercal}\theta)\right)>0 for all ii. Hence u2(θ)u=0u^{\intercal}\nabla^{2}\ell(\theta)u=0 implies Xiu=0X_{i}^{\intercal}u=0 for all ii. This implies uu is orthogonal to each column of XiX_{i} for all i=1,,ni=1,\dots,n. But if the span of the columns is d\mathbb{R}^{d}, then uu must be zero. ∎

4.2 BvM for GLMs

In this section, we prove the BvM in the setting described above. Note from (4.8) that the random YiY_{i} appears only in the first derivative \nabla\ell. Thus \ell-\ell^{*} is linear. As discussed below Lemma 3.4, this structure leads to some simplifications. In particular, we can apply the more specialized Lemma 3.5, with δ3\delta^{*}_{3} as in (3.10). It remains only to bound from below the probability of the event E1(s)E_{1}(s). First, we record the specific form of the Fisher information matrix F{F^{*}} and δ3\delta^{*}_{3} for our model. We have

F=2(θ)=1ni=1nXi2ψ(Xiθ)Xi{F^{*}}=\nabla^{2}\ell^{*}({\theta^{*}})=\frac{1}{n}\sum_{i=1}^{n}X_{i}\nabla^{2}\psi(X_{i}^{\intercal}{\theta^{*}})X_{i}^{\intercal} (4.10)

and

δ3(s)=supθ𝒰(s),u01ni=1n3ψ(Xiθ),(Xiu)3(1ni=1n2ψ(Xiθ),(XiTu)2)3/2.\begin{split}\delta^{*}_{3}(s)&=\sup_{\theta\in{\mathcal{U}}^{*}(s),\,u\neq 0}\;\frac{\frac{1}{n}\sum_{i=1}^{n}\left\langle\nabla^{3}\psi(X_{i}^{\intercal}\theta),\,(X_{i}^{\intercal}u)^{\otimes 3}\right\rangle}{\left(\frac{1}{n}\sum_{i=1}^{n}\left\langle\nabla^{2}\psi(X_{i}^{\intercal}{\theta^{*}}),\,(X_{i}^{T}u)^{\otimes 2}\right\rangle\right)^{3/2}}.\end{split} (4.11)
Example 4.5 (Key quantities for i.i.d. exponential family setting).

In the i.i.d. exponential family setup, F{F^{*}} and δ3\delta^{*}_{3} reduce to the following:

F=2ψ(θ),δ3(s)=supθ𝒰(s),u03ψ(θ),u32ψ(θ),u23/2.\begin{split}{F^{*}}=\nabla^{2}\psi({\theta^{*}}),\qquad\delta^{*}_{3}(s)=\sup_{\theta\in{\mathcal{U}}^{*}(s),\,u\neq 0}\frac{\left\langle\nabla^{3}\psi(\theta),\,u^{\otimes 3}\right\rangle}{\left\langle\nabla^{2}\psi({\theta^{*}}),u^{\otimes 2}\right\rangle^{3/2}}.\end{split} (4.12)

The following lemma bounds the probability of E1(s)E_{1}(s).

Lemma 4.6.

Let YiXip(Xiθ)Y_{i}\mid X_{i}\sim p(\cdot\mid X_{i}^{\intercal}{\theta^{*}}). Suppose (3.5) is satisfied. Then the event E1(s)E_{1}(s) has probability at least 1exp(s2d/10)1-\exp(-s^{2}d/10).

Let us give some intuition for this result. Both for GLMs and in fact for much more general settings, it holds

𝔼YnPθn[(θ)]=(θ)=0,VarYnPθn[(θ)]=1nF.\mathbb{E}\,_{Y^{n}\sim P^{n}_{\theta^{*}}}[\nabla\ell({\theta^{*}})]=\nabla\ell^{*}({\theta^{*}})=0,\qquad{\mathrm{Var}}_{Y^{n}\sim P^{n}_{\theta^{*}}}[\nabla\ell({\theta^{*}})]=\frac{1}{n}{F^{*}}. (4.13)

We have used the more general notation from Section 3.1. In the setting of GLMs, we have Yn=(Y1,,Yn)Y^{n}=(Y_{1},\dots,Y_{n}), and Pθn=i=1np(Xiθ)dμP^{n}_{\theta^{*}}=\otimes_{i=1}^{n}p(\cdot\mid X_{i}^{\intercal}{\theta^{*}})d\mu. Using (4.13), we find that

n(θ)F=|(F/n)1/2(θ)|=|Var((θ))1/2((θ)𝔼(θ))|,\sqrt{n}\|\nabla\ell({\theta^{*}})\|_{{F^{*}}}=|({F^{*}}/n)^{-1/2}\nabla\ell({\theta^{*}})|=\left|{\mathrm{Var}}(\nabla\ell({\theta^{*}}))^{-1/2}\left(\nabla\ell({\theta^{*}})-\mathbb{E}\,\nabla\ell({\theta^{*}})\right)\right|, (4.14)

where the expectation and variance are with respect to PθnP^{n}_{\theta^{*}}. Now, recalling (4.8), we see that (θ)\nabla\ell({\theta^{*}}) is given by an average of nn independent random variables. Therefore, the Central Limit Theorem suggests Zn:=Var((θ))1/2((θ)𝔼(θ))Z_{n}:={\mathrm{Var}}(\nabla\ell({\theta^{*}}))^{-1/2}\left(\nabla\ell({\theta^{*}})-\mathbb{E}\,\nabla\ell({\theta^{*}})\right) is approximately standard normal. Moreover, we can write the event E1(s)E_{1}(s) as E1(s)={|Zn|sd}E_{1}(s)=\{|Z_{n}|\leq s\sqrt{d}\}. Thus the probability of E1(s)cE_{1}(s)^{c} should scale as a Gaussian tail probability and indeed, Lemma 4.6 shows that (|Zn|sd)es2d/10\mathbb{P}(|Z_{n}|\geq s\sqrt{d})\leq e^{-s^{2}d/10}.

The asymptotic distribution of the norm of a standardized sample mean was first considered in [29]. Later, [35, Section F.3] proved nonasymptotic tail bounds on the norm of a sub-Gaussian or sub-exponential random vector. To be self-contained we give our own proof of Lemma 4.6. The proof does not actually rely on the CLT; we use an ϵ\epsilon-net argument and apply Chernoff’s inequality.

We now conclude the BvM by a direct application of Lemmas 4.6 and 3.5.

Proposition 4.7 (Non-asymptotic BvM for GLMs).

Suppose the columns of the matrices XiX_{i}, i=1,,ni=1,\dots,n span d\mathbb{R}^{d}, let Θ\Theta be as in (4.6), F{F^{*}} as in (4.10), and δ3\delta^{*}_{3} be as in (4.11). Suppose that for some s12s\geq 12, the conditions from (3.5) are satisfied, and δ01(2s)nd/6\delta^{*}_{01}(2s)\leq\sqrt{nd}/6. Then on an event of probability at least 1exp(ds2/10)1-\exp(-ds^{2}/10) with respect to the distribution of the YiY_{i}, we have

TV(πv,γ)δ01(2s)n+sδ3(2s)dn+ed(M0(s/12)2).\mathrm{TV}(\pi_{v},\gamma^{*})\lesssim\frac{\delta^{*}_{01}(2s)}{\sqrt{n}}+s\delta^{*}_{3}(2s)\frac{d}{\sqrt{n}}+e^{d({M_{0}}^{*}-(s/12)^{2})}. (4.15)

We now prove the traditional asymptotic BvM. We assume that d=dnd=d_{n} and k=knk=k_{n} may change with nn, and for each nn we are given nn matrices Xi,ndn×knX_{i,n}\in\mathbb{R}^{d_{n}\times k_{n}}, i=1,,ni=1,\dots,n, and a sequence of functions ψ=ψn\psi=\psi_{n} in the exponential family model (4.2). This induces the following nn-dependent quantities: Ω=Ωnkn,Θ=Θndn\Omega=\Omega_{n}\subseteq\mathbb{R}^{k_{n}},\Theta=\Theta_{n}\subseteq\mathbb{R}^{d_{n}}, F=Fn{F^{*}}=F^{*}_{n}, δ3=δ3n\delta^{*}_{3}=\delta^{*}_{3n}, defined as in Section 4.1. In the definition (4.11) of δ3\delta^{*}_{3}, note that the local neighborhood 𝒰=𝒰n{\mathcal{U}}^{*}={\mathcal{U}}^{*}_{n} also changes with nn. Also, write M0=M0n{M_{0}}=M_{0n}^{*} and δ01=δ01n\delta^{*}_{01}=\delta^{*}_{01n}, which are as in Definition 3.1 for each nn. Finally, we write πvn,γn\pi_{v_{n}},\gamma^{*}_{n} to emphasize the dependence on nn of the posterior and the Gaussian in the BvM.

To obtain the asymptotic BvM from the bound (4.15), we take the nn\to\infty limit for each fixed ss, and then take ss\to\infty. Note that if dnd_{n}\to\infty as nn\to\infty, then the second step of taking ss\to\infty is not necessary.

Corollary 4.8 (Asymptotic BvM for GLMs).

Suppose the linear span of the knnk_{n}n columns of the matrices Xi,nX_{i,n}, i=1,,ni=1,\dots,n equals dn\mathbb{R}^{d_{n}}, and that M0n=𝒪(1)M_{0n}^{*}=\mathcal{O}(1) as nn\to\infty. Also, for each fixed s0s\geq 0, suppose the following hold: (1) the neighborhood 𝒰n(s){\mathcal{U}}^{*}_{n}(s) is contained in Θn\Theta_{n} when nn is large enough, (2) δ01n(s)=o(n)\delta^{*}_{01n}(s)=o(\sqrt{n}), and (3)

δ3n(s)dnn=o(1).\delta^{*}_{3n}(s)\frac{d_{n}}{\sqrt{n}}=o(1).

Then TV(πvn,γn)0\mathrm{TV}(\pi_{v_{n}},\gamma^{*}_{n})\to 0 with probability tending to 1 as nn\to\infty.

This result is as explicit as we can hope for in the very general setting of an arbitrary GLM. Given a particular GLM of interest, it remains to bound the quantity δ3\delta^{*}_{3} from (4.11). In the following two subsections, we consider two special cases in which δ3\delta^{*}_{3} can either be simplified or explicitly bounded. In the first case, a log-concave exponential family, we show that the third derivative of ψ\psi can be bounded in terms of the second derivative. This leads to a simple bound on δ3\delta^{*}_{3} avoiding operator norms of d×d×dd\times d\times d tensors. In the second case, logistic regression with Gaussian design, we show the numerator and denominator in (4.11) are upper- and lower-bounded by absolute constants, respectively. This holds with high probability over the random design.

Let us compare Proposition 4.7 to prior work. The work [32] proves a BvM for generalized linear models in k=1k=1, though the conditions on the design under which the BvM holds are left unspecified. The special case of exponential families was studied in [16] and [5]. As we show in Appendix C.1, an explicit TV bound can be recovered from the proof of [5], and cast as a product of a model dependent and universal factor. The work [32] does (essentially) state an explicit TV bound, which can also be brought into this same form. Once this has been done, we see that our BvM tightens the dimension dependence of the universal factor from d3/n\sqrt{d^{3}/n} to d2/n\sqrt{d^{2}/n}, while our model-dependent factor δ3\delta^{*}_{3} is very analogous to that of [32] and [5]

The work [5] in turn improves on [16] by removing a logd\log d factor.

4.3 Application to log concave exponential families

Similarly to [5], we now consider the case of a log concave exponential family. Recall that in the i.i.d. exponential family setting, we have k=dk=d and Xi=IdX_{i}=I_{d} for all i=1,,ni=1,\dots,n, so that

Yii.i.d.p(θ)dμ,i=1,,n.Y_{i}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}p(\cdot\mid\theta)d\mu,\qquad i=1,\dots,n. (4.16)

Furthermore, using standard properties of exponential families, we can write the derivatives of ψ\psi in terms of moments of the distribution (4.2). In particular,

ψ(θ)=𝔼θ[Y1],2ψ(θ)=𝔼θ[(Y1𝔼θ[Y1])(Y1𝔼θ[Y1])]=Varθ(Y1),3ψ(θ)=𝔼θ[(Y1𝔼θ[Y1])3].\begin{split}\nabla\psi(\theta)&=\mathbb{E}\,_{\theta}[Y_{1}],\\ \nabla^{2}\psi(\theta)&=\mathbb{E}\,_{\theta}\left[(Y_{1}-\mathbb{E}\,_{\theta}[Y_{1}])(Y_{1}-\mathbb{E}\,_{\theta}[Y_{1}])^{\intercal}\right]={\mathrm{Var}}_{\theta}(Y_{1}),\\ \nabla^{3}\psi(\theta)&=\mathbb{E}\,_{\theta}\left[(Y_{1}-\mathbb{E}\,_{\theta}[Y_{1}])^{\otimes 3}\right].\end{split} (4.17)

Here, 𝔼θ[]\mathbb{E}\,_{\theta}[\cdot] is shorthand for 𝔼Y1p(θ)[]\mathbb{E}\,_{Y_{1}\sim p(\cdot\mid\theta)}[\cdot], and similarly for Varθ(){\mathrm{Var}}_{\theta}(\cdot).

Let us now consider the key quantity δ3\delta^{*}_{3} in the BvM, which for exponential families takes the form given in (4.12). Using the above relationship between derivatives of ψ\psi and centered moments of Y1Y_{1}, we can write δ3\delta^{*}_{3} as follows:

δ3(s)=supθ𝒰(s),u0𝔼θ[(uYu𝔼θ[Y])3]Varθ(uY)3/2.\delta^{*}_{3}(s)=\sup_{\theta\in{\mathcal{U}}^{*}(s),\,u\neq 0}\;\frac{\mathbb{E}\,_{\theta}\left[\left(u^{\intercal}Y-u^{\intercal}\mathbb{E}\,_{\theta}[Y]\right)^{3}\right]}{{\mathrm{Var}}_{\theta^{*}}(u^{\intercal}Y)^{3/2}}. (4.18)

Thus we see that we need to bound a ratio of third and second moments of the distribution p(|θ)p(\cdot|\theta) from (4.2). When this distribution is log concave, it is well known that the third moment can be bounded above in terms of the second moment. This allows us to prove the following lemma.

Lemma 4.9.

Let F(θ)=Varθ(Y)F(\theta)={\mathrm{Var}}_{\theta}(Y) and define

δ2(s)=supθ𝒰(s)F(θ)1/2F(θ)F(θ)1/2=supθ𝒰(s)F(θ)F,\delta^{*}_{2}(s)=\sup_{\theta\in{\mathcal{U}}^{*}(s)}\|F({\theta^{*}})^{-1/2}F(\theta)F({\theta^{*}})^{-1/2}\|=\sup_{\theta\in{\mathcal{U}}^{*}(s)}\|F(\theta)\|_{F^{*}},

where F(θ)=Varθ(Y)=2ψ(θ)F(\theta)={\mathrm{Var}}_{\theta}(Y)=\nabla^{2}\psi(\theta). If the base distribution μ\mu from (4.16) has a log concave density, then

δ3(s)C23δ2(s)3/2\delta^{*}_{3}(s)\leq C_{23}\delta^{*}_{2}(s)^{3/2}

for some absolute constant C23C_{23}, where δ3\delta^{*}_{3} is defined as in (4.18).

Proof.

If μ\mu has a log concave density then so does p(|θ)p(\cdot|\theta), for each θΘ\theta\in\Theta. Now, log concavity is preserved under affine transformations  [30, Section 3.1.1]. Therefore if Yp(θ)Y\sim p(\cdot\mid\theta) then uY𝔼θ[uY]u^{\intercal}Y-\mathbb{E}\,_{\theta}[u^{\intercal}Y] also has a log concave distribution for all udu\in\mathbb{R}^{d}. [24, Theorem 5.22] now gives that

𝔼θ[|uY𝔼θ[uY]|3]C23Varθ(uY)3/2=C23(uF(θ)u)3/2.\mathbb{E}\,_{\theta}\left[\left|u^{\intercal}Y-\mathbb{E}\,_{\theta}[u^{\intercal}Y]\right|^{3}\right]\leq C_{23}{\mathrm{Var}}_{\theta}(u^{\intercal}Y)^{3/2}=C_{23}\left(u^{\intercal}F(\theta)u\right)^{3/2}.

Taking the supremum over all uu such that |u|F2=1|u|_{F^{*}}^{2}=1 and θ𝒰(s)\theta\in{\mathcal{U}}^{*}(s) concludes the proof.∎

Using this result in Proposition 4.7 leads to the following simplified BvM.

Proposition 4.10 (Nonasymptotic BvM for log concave exponential families).

Suppose the base density hh of the exponential family (4.2) is log concave. Suppose that for some s12s\geq 12 it holds

δ2(2s)=supθ𝒰(2s)F(θ)1/2F(θ)F(θ)1/23/2,sd/n(16C23)1,δ01(2s)nd/6,𝒰(2s)Θ.\begin{split}\delta^{*}_{2}(2s)=&\sup_{\theta\in{\mathcal{U}}^{*}(2s)}\|F({\theta^{*}})^{-1/2}F(\theta)F({\theta^{*}})^{-1/2}\|\leq 3/2,\\ s\sqrt{d/n}\leq&(16C_{23})^{-1},\qquad\delta^{*}_{01}(2s)\leq\sqrt{nd}/6,\qquad{\mathcal{U}}^{*}(2s)\subset\Theta.\end{split} (4.19)

Then on an event of probability at least 1exp(ds2/10)1-\exp(-ds^{2}/10) with respect to the distribution of the YiY_{i}, we have

TV(πv,γ)δ01(2s)n+sdn+ed(M0(s/12)2).\begin{split}\mathrm{TV}(\pi_{v},\gamma^{*})\lesssim\frac{\delta^{*}_{01}(2s)}{\sqrt{n}}+s\frac{d}{\sqrt{n}}+e^{d({M_{0}}^{*}-(s/12)^{2})}.\end{split} (4.20)
Proof.

The assumptions and Lemma 4.9 give δ3(2s)(3/2)3/2C232C23\delta^{*}_{3}(2s)\leq(3/2)^{3/2}C_{23}\leq 2C_{23}. Thus in particular sδ3(2s)d/n2C23sd/n1/8.s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 2C_{23}s\sqrt{d/n}\leq 1/8. Therefore the conditions in (3.5) are satisfied. Substituting δ3(2s)1\delta^{*}_{3}(2s)\lesssim 1 in the righthand side of (4.15) in Proposition 4.7 concludes the proof. ∎

In the above proposition and proof, it may seem like we simply imposed a strong assumption — that δ2(2s)\delta^{*}_{2}(2s) is bounded by an absolute constant — in order to conclude that δ3(2s)\delta^{*}_{3}(2s) is also bounded by an absolute constant, using the log concavity. However, we claim that the assumption δ2(2s)3/2\delta^{*}_{2}(2s)\leq 3/2 is actually weaker than the assumption 2sδ3(2s)d/n1/42s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 1/4 we have been using in previous results, including Proposition 4.7. Indeed, to see this, first note that δ2(0)=1\delta^{*}_{2}(0)=1. It then follows by Taylor’s theorem that

δ2(2s)1+supθ𝒰(2s)2ψ(θ)2ψ(θ)F1+δ3(2s)2sd/n32.\delta^{*}_{2}(2s)\leq 1+\sup_{\theta\in{\mathcal{U}}^{*}(2s)}\|\nabla^{2}\psi(\theta)-\nabla^{2}\psi({\theta^{*}})\|_{F^{*}}\leq 1+\delta^{*}_{3}(2s)2s\sqrt{d/n}\leq\frac{3}{2}.

Thus, if one prefers to start with the stronger condition 2sδ3(2s)d/n1/42s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 1/4, we have shown that for log concave families,

δ3(2s)18sn/dδ3(2s)C=(3/2)3/2C23.\delta^{*}_{3}(2s)\leq\frac{1}{8s}\sqrt{n/d}\implies\delta^{*}_{3}(2s)\leq C=(3/2)^{3/2}C_{23}.

4.4 Application to logistic regression with random design

We now apply Proposition 4.7 to logistic regression. In this model, k=1k=1 and the YiY_{i} are binary. In other words, XidX_{i}\in\mathbb{R}^{d} and Yi{0,1}Y_{i}\in\{0,1\} is the corresponding label. The distribution of the random variables YiY_{i} given the XiX_{i} is

YiXiBernoulli(ψ(Xiθ)),ψ(ω):=log(1+eω).Y_{i}\mid X_{i}\sim\mathrm{Bernoulli}(\psi^{\prime}(X_{i}^{\intercal}\theta)),\qquad\psi(\omega):=\log(1+e^{\omega}).

The probability mass function of this Bernoulli random variable can be written in the form (4.2), with ψ\psi as above.

Now, the statement of the BvM in Proposition 4.7 is conditional on the XiX_{i}’s. But by specifying a distribution on the XiX_{i}’s, we can gain insight into the “typical” size (typical with respect to this distribution) of the key quantity δ3(2s)\delta^{*}_{3}(2s) in the bound (4.15). For the distribution, we choose the standard example of i.i.d. Gaussian design:

Xii.i.d.𝒩(0,Id),i=1,,n.X_{i}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,I_{d}),\quad i=1,\dots,n.

The case of a non-identity covariance matrix can be similarly handled by a linear transformation. The following two lemmas will allow us to bound δ3(2s)\delta^{*}_{3}(2s) with high probability with respect to the design.

Lemma 4.11 (Adaptation of Lemma 7, Chapter 3, [36]).

Suppose d<n/2d<n/2. Then for some λ=λ(|θ|)>0\lambda=\lambda(|{\theta^{*}}|)>0 depending only on |θ||{\theta^{*}}|, it holds

(FλId)=(1ni=1nψ′′(Xiθ)XiXiλId)14eCn,\mathbb{P}({F^{*}}\succeq\lambda I_{d})=\mathbb{P}\left(\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime\prime}(X_{i}^{\intercal}{\theta^{*}})X_{i}X_{i}^{\intercal}\succeq\lambda I_{d}\right)\geq 1-4e^{-Cn},

where CC is an absolute constant (independent of λ\lambda).

The function |θ|λ(|θ|)|{\theta^{*}}|\mapsto\lambda(|{\theta^{*}}|) is nonincreasing, so if |θ||{\theta^{*}}| is bounded above by a constant then λ=λ(|θ|)\lambda=\lambda(|{\theta^{*}}|) is bounded away from zero by a constant.

Lemma 4.12.

If dnd\leq n then there are absolute constants C,CC,C^{\prime} such that

(sup|u|=11ni=1n|uXi|3C(1+d3/2n))14eCn.\begin{split}\mathbb{P}\left(\sup_{|u|=1}\frac{1}{n}\sum_{i=1}^{n}|u^{\intercal}X_{i}|^{3}\leq C\left(1+\frac{d^{3/2}}{n}\right)\right)\geq 1-4e^{-C^{\prime}\sqrt{n}}.\end{split} (4.21)

The lemma follows almost immediately from Theorem 4.2 in [1]; see the appendix for the short calculation. We now combine Lemma 4.11, Lemma 4.12, and the fact that ψ′′′:=supt|ψ′′′(t)|<\|\psi^{\prime\prime\prime}\|_{\infty}:=\sup_{t\in\mathbb{R}}|\psi^{\prime\prime\prime}(t)|<\infty, to derive the following bound on δ3(2s)\delta^{*}_{3}(2s):

δ3(2s)=supθ𝒰(2s),u01ni=1nψ′′′(Xiθ)(Xiu)3(1ni=1nψ′′(Xiθ)(XiTu)2)3/2ψ′′′λ3/2supu01ni=1n|Xiu|3/u3Cψ′′′λ3/2(1+d3/2/n).\begin{split}\delta^{*}_{3}(2s)&=\sup_{\theta\in{\mathcal{U}}^{*}(2s),\,u\neq 0}\;\frac{\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime\prime\prime}(X_{i}^{\intercal}\theta)(X_{i}^{\intercal}u)^{3}}{\left(\frac{1}{n}\sum_{i=1}^{n}\psi^{\prime\prime}(X_{i}^{\intercal}{\theta^{*}})(X_{i}^{T}u)^{2}\right)^{3/2}}\\ &\leq\|\psi^{\prime\prime\prime}\|_{\infty}\lambda^{-3/2}\sup_{u\neq 0}\frac{1}{n}\sum_{i=1}^{n}|X_{i}^{\intercal}u|^{3}/\|u\|^{3}\\ &\leq C\|\psi^{\prime\prime\prime}\|_{\infty}\lambda^{-3/2}\left(1+d^{3/2}/n\right).\end{split} (4.22)

This bound holds on an event of probability at least 18eCn1-8e^{-C\sqrt{n}} for some new constant CC. On this same event, we also have

δ01(s)λ1/2δ¯01(λ1/2s),\delta^{*}_{01}(s)\leq\lambda^{-1/2}\bar{\delta}^{*}_{01}\left(\lambda^{-1/2}s\right), (4.23)

where

δ¯01(t)=sup|θθ|td/n|logπ0(θ)|.\bar{\delta}^{*}_{01}(t)=\sup_{|\theta-{\theta^{*}}|\leq t\sqrt{d/n}}|\nabla\log\pi_{0}(\theta)|.

Using the bounds (4.22) and (4.23) in Proposition 4.7, we conclude the following non-asymptotic BvM.

Corollary 4.13 (Non-asymptotic BvM for logistic regression with random design).

Suppose |θ|C0|{\theta^{*}}|\leq C_{0} and d3/2C0nd^{3/2}\leq C_{0}n for an absolute constant C0C_{0}. There exists a constant C1C_{1} depending only on C0C_{0} and an absolute constant C2C_{2} such that if

s12,C1sd/n1,C1δ¯01(C1s)nds\geq 12,\qquad C_{1}s\sqrt{d/n}\leq 1,\qquad C_{1}\bar{\delta}^{*}_{01}(C_{1}s)\leq\sqrt{nd}

then

TV(πv,γ)δ¯01(C1s)n+sdn+ed[M0(s/12)2]\begin{split}\mathrm{TV}(\pi_{v},\gamma^{*})&\lesssim\frac{\bar{\delta}^{*}_{01}(C_{1}s)}{\sqrt{n}}+\frac{sd}{\sqrt{n}}+e^{d[{M_{0}}^{*}-(s/12)^{2}]}\end{split} (4.24)

with probability at least 18eC2neds2/101-8e^{-C_{2}\sqrt{n}}-e^{-ds^{2}/10} with respect to the joint feature and label distribution. The suppressed constant in (4.24) depends only on C0C_{0}.

We now take nn\to\infty and then ss\to\infty to prove the following asymptotic BvM.

Corollary 4.14 (Asymptotic BvM for logistic regression with random design).

Let θndn\theta^{*}_{n}\in\mathbb{R}^{d_{n}} be a sequence of ground truth vectors such that |θn|=𝒪(1)|\theta^{*}_{n}|=\mathcal{O}(1) as nn\to\infty. Write vn,γn,M0n,δ¯01nv_{n},\gamma^{*}_{n},M_{0n}^{*},\bar{\delta}^{*}_{01n} to emphasize the dependence of these quantities on nn. If M0n=𝒪(1)M_{0n}^{*}=\mathcal{O}(1), δ¯01n(s)=o(n)\bar{\delta}_{01n}^{*}(s)=o(\sqrt{n}) for each s0s\geq 0, and

dn2/n=o(1),d_{n}^{2}/n=o(1),

then TV(πvn,γn)0\mathrm{TV}(\pi_{v_{n}},\gamma^{*}_{n})\to 0 with probability tending to 11 as n.n\to\infty. The probability is with respect to the joint feature-label distribution Xii.i.d.𝒩(0,Idn)X_{i}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,I_{d_{n}}), Yi|XiBer(ψ(Xiθn))Y_{i}|X_{i}\sim\mathrm{Ber}(\psi^{\prime}(X_{i}^{\intercal}\theta^{*}_{n})), i=1,,ni=1,\dots,n.

To our knowledge, this is the first ever BvM to take into account the randomness of the design in the high-dimensional regime.

5 Observations of a discrete probability distribution

In this section, we consider a model in which the parameter of interest are the probabilities of a finite state probability mass function (pmf) θ=(θ0,θ1,,θd){\theta^{*}}=(\theta_{0}^{*},\theta_{1}^{*},\dots,\theta_{d}^{*}), and we are given nn i.i.d. draws from θ{\theta^{*}}. This is equivalent to observing a single sample YMulti(n,θ)Y\sim\mathrm{Multi}(n,{\theta^{*}}), the multinomial distribution with nn trials and probabilities θ0,θ1,,θd\theta_{0}^{*},\theta_{1}^{*},\dots,\theta_{d}^{*}. A BvM for this model, in which dd can grow large, was first proved in [8]. We compare our result to that of [8] at the end of Section 5.2.

There is a reparameterization which turns the model into an exponential family; the works [16] and [5] on BvMs for exponential families both studied it in this form. However, a BvM for one form of the model does not immediately imply the BvM for the other form; as noted in [8], the effect of reparameterization on asymptotic normality must first be established.

Proofs of omitted results in this section can be found in Appendix D.

5.1 Set-up

The d+1d+1 probabilities of a pmf over states 0,1,,d0,1,\dots,d add up to 1. We therefore define the parameter space to be the probabilities of states 1,,d1,\dots,d only:

Θ={θ=(θ1,,θd)0<θ1++θd<1}.\begin{split}\Theta=\{\theta=(\theta_{1},\dots,\theta_{d})\mid 0<\theta_{1}+\dots+\theta_{d}<1\}.\end{split} (5.1)

The cases θ1++θd=0\theta_{1}+\dots+\theta_{d}=0 and θ1++θd=1\theta_{1}+\dots+\theta_{d}=1 are degenerate, so we exclude them. Now, to every point θΘ\theta\in\Theta, we associate the value

θ0=1j=1dθj=1𝟙θ.\theta_{0}=1-\sum_{j=1}^{d}\theta_{j}=1-\mathds{1}^{\intercal}\theta.

Note that if θΘ\theta\in\Theta then (θ0,,θd)(\theta_{0},\dots,\theta_{d}) is a pmf on d+1d+1 states such that each probability θj,j=0,,d\theta_{j},j=0,\dots,d is strictly positive. We will often abuse notation by interchangeably using θ\theta to denote either (θ0,,θd)(\theta_{0},\dots,\theta_{d}) or (θ1,,θd)(\theta_{1},\dots,\theta_{d}). One need only remember that θ1,,θd\theta_{1},\dots,\theta_{d} are free parameters, while θ0\theta_{0} is determined from the others. We observe counts N=(N0,N1,,Nd)Multi(n,θ)N=(N_{0},N_{1},\dots,N_{d})\sim\mathrm{Multi}(n,{\theta^{*}}), where θ=(θ0,θ1,,θd){\theta^{*}}=(\theta_{0}^{*},\theta_{1}^{*},\dots,\theta_{d}^{*}) is the ground truth pmf for some (θ1,,θd)Θ(\theta_{1}^{*},\dots,\theta_{d}^{*})\in\Theta. For the proportions corresponding to the counts, we use the notation

N¯=(N¯0,,N¯d),N¯j:=1nNj.\bar{N}=(\bar{N}_{0},\dots,\bar{N}_{d}),\quad\bar{N}_{j}:=\frac{1}{n}N_{j}.

Define also

θmin=minj=0,,dθj,{\theta_{\mathrm{min}}^{*}}=\min_{j=0,\dots,d}\theta_{j}^{*},

which will play an important role in describing how far θ{\theta^{*}} is from the boundary of Θ\Theta. Now, the multinomial observations give rise to the likelihood L(θ)=j=0dθjNjL(\theta)=\prod_{j=0}^{d}\theta_{j}^{N_{j}}, and negative normalized log likelihood =1nlogL\ell=-\frac{1}{n}\log L:

(θ)=j=0dN¯jlogθj=N¯0log(1𝟙θ)j=1dN¯jlogθj.\ell(\theta)=-\sum_{j=0}^{d}\bar{N}_{j}\log\theta_{j}=-\bar{N}_{0}\log(1-\mathds{1}^{\intercal}\theta)-\sum_{j=1}^{d}\bar{N}_{j}\log\theta_{j}. (5.2)

The first three derivatives of \ell, with respect to the free parameters θ1,,θd\theta_{1},\dots,\theta_{d}, are given as follows:

(θ)=(N¯jθj)j=1d+N¯0θ0𝟙,2(θ)=diag((N¯jθj2)j=1d)+N¯0θ02𝟙𝟙3(θ)=2diag((N¯jθj3)j=1d)+2N¯0θ03𝟙3.\begin{split}\nabla\ell(\theta)&=-\bigg{(}\frac{\bar{N}_{j}}{\theta_{j}}\bigg{)}_{j=1}^{d}+\frac{\bar{N}_{0}}{\theta_{0}}\mathds{1},\\ \nabla^{2}\ell(\theta)&=\mathrm{diag}\bigg{(}\bigg{(}\frac{\bar{N}_{j}}{\theta_{j}^{2}}\bigg{)}_{j=1}^{d}\bigg{)}+\frac{\bar{N}_{0}}{\theta_{0}^{2}}\mathds{1}\mathds{1}^{\intercal}\\ \nabla^{3}\ell(\theta)&=-2\mathrm{diag}\bigg{(}\bigg{(}\frac{\bar{N}_{j}}{\theta_{j}^{3}}\bigg{)}_{j=1}^{d}\bigg{)}+2\frac{\bar{N}_{0}}{\theta_{0}^{3}}\mathds{1}^{\otimes 3}.\end{split} (5.3)

Next, recall that F=2(θ)=𝔼[2(θ)]{F^{*}}=\nabla^{2}\ell^{*}({\theta^{*}})=\mathbb{E}\,[\nabla^{2}\ell({\theta^{*}})]. Using that 𝔼[N¯]=θ\mathbb{E}\,[\bar{N}]={\theta^{*}} gives

F=2(θ)=diag((1θj)j=1d)+1θ0𝟙𝟙.{F^{*}}=\nabla^{2}\ell^{*}({\theta^{*}})=\mathrm{diag}\bigg{(}\bigg{(}\frac{1}{\theta_{j}^{*}}\bigg{)}_{j=1}^{d}\bigg{)}+\frac{1}{\theta_{0}^{*}}\mathds{1}\mathds{1}^{\intercal}. (5.4)

It is clear from (5.2) that C(Θ)\ell\in C^{\infty}(\Theta). Also, we see from the second equation in (5.3) that 2(θ)0\nabla^{2}\ell(\theta)\succeq 0 for all θΘ\theta\in\Theta with probability 1. Hence, \ell is convex on Θ\Theta. Finally, since θΘ{\theta^{*}}\in\Theta (i.e. all θj\theta_{j}^{*} are strictly positive), we see from (5.4) that F0{F^{*}}\succ 0. Thus Assumption A1 is satisfied for this model.

Remark 5.1.

Suppose N¯Θ\bar{N}\in\Theta, which implies the counts NjN_{j}, j=0,,dj=0,\dots,d are all strictly positive. Since \ell is convex, any strict local minimizer θ^\hat{\theta} of \ell in Θ\Theta must be the unique global minimizer, i.e. the MLE. But we see from the first two equations in (5.3) that (N¯)=0\nabla\ell(\bar{N})=0 and 2(N¯)0\nabla^{2}\ell(\bar{N})\succ 0. Thus N¯\bar{N} is the unique MLE as long as N¯Θ\bar{N}\in\Theta.

As usual, we define

𝒰(r)={θd:|θθ|Frd/n}.{\mathcal{U}}^{*}(r)=\{\theta\in\mathbb{R}^{d}:|\theta-{\theta^{*}}|_{F^{*}}\leq r\sqrt{d/n}\}. (5.5)
Definition 5.2 (Chi-squared divergence).

Let θ=(θ0,,θd)\theta=(\theta_{0},\dots,\theta_{d}) be a strictly positive pmf on d+1d+1 states. Let ω\omega be either ω=(ω0,,ωd)d+1\omega=(\omega_{0},\dots,\omega_{d})\in\mathbb{R}^{d+1} such that ω𝟙=1\omega^{\intercal}\mathds{1}=1, or ω=(ω1,,ωd)d\omega=(\omega_{1},\dots,\omega_{d})\in\mathbb{R}^{d}. In the latter case, set ω0=1j=1dωj\omega_{0}=1-\sum_{j=1}^{d}\omega_{j}. Then we define

χ2(ω||θ)=j=0d(ωjθj)2/θj.\chi^{2}(\omega||\theta)=\sum_{j=0}^{d}(\omega_{j}-\theta_{j})^{2}/\theta_{j}.
Remark 5.3.

We will make frequent use of the bound

maxj=0,,d|θjθj1|2χ2(θ||θ)/θmin,\max_{j=0,\dots,d}\left|\frac{\theta_{j}}{\theta_{j}^{*}}-1\right|^{2}\leq\chi^{2}(\theta||{\theta^{*}})/{\theta_{\mathrm{min}}^{*}}, (5.6)

which follow directly from the above definition of χ2\chi^{2} divergence.

We now show that the neighborhoods 𝒰(r){\mathcal{U}}^{*}(r) from (5.5) are χ2\chi^{2} balls around θ{\theta^{*}}. We also characterize how large rr can be to ensure 𝒰(r){\mathcal{U}}^{*}(r) remains a subset of Θ\Theta, and describe a useful property of pmfs in 𝒰(r){\mathcal{U}}^{*}(r)

Lemma 5.4.

The neighborhoods 𝒰(r){\mathcal{U}}^{*}(r) from (5.5) are equivalently given by

𝒰(r)={θd:χ2(θ||θ)r2d/n}.{\mathcal{U}}^{*}(r)=\{\theta\in\mathbb{R}^{d}\;:\;\chi^{2}(\theta||{\theta^{*}})\leq r^{2}d/n\}.

If r2d/n<θmin/4r^{2}d/n<{\theta_{\mathrm{min}}^{*}}/4, then 𝒰(2r)Θ{\mathcal{U}}^{*}(2r)\subset\Theta and 𝒰(r){θΘ:θjθj/2j=0,1,,d}{\mathcal{U}}^{*}(r)\subset\{\theta\in\Theta\;:\theta_{j}\geq\theta_{j}^{*}/2\;\,\forall j=0,1,\dots,d\}.

Remark 5.5.

The lemma shows that how large rr can be to ensure 𝒰(2r)Θ{\mathcal{U}}^{*}(2r)\subset\Theta depends on how small θmin{\theta_{\mathrm{min}}^{*}} is. This explains our earlier comment that θmin{\theta_{\mathrm{min}}^{*}} controls how far θ{\theta^{*}} is from the boundary of Θ\Theta.

5.2 BvM proof

In this section, we prove the BvM by applying Lemma 3.4, though we will not use event E(s,ϵ2)E(s,\epsilon_{2}). Instead, we define a different event E0(s)E_{0}(s) and show that E0(s)E¯(s,ϵ2)E_{0}(s)\subset\bar{E}(s,\epsilon_{2}) for an appropriate choice of ϵ2=ϵ2(s)\epsilon_{2}=\epsilon_{2}(s). Specifically, for s0s\geq 0, let

E0(s):={N¯𝒰(s)}={χ2(N¯||θ)s2d/n}={|N¯θ|F2s2d/n}.E_{0}(s):=\left\{\bar{N}\in{\mathcal{U}}^{*}(s)\right\}=\left\{\chi^{2}(\bar{N}||{\theta^{*}})\leq s^{2}d/n\right\}=\left\{|\bar{N}-{\theta^{*}}|_{F^{*}}^{2}\leq s^{2}d/n\right\}.

We prove that if ss is larger than a sufficiently large absolute constant and s2d/nθmins^{2}d/n{\theta_{\mathrm{min}}^{*}} is smaller than a sufficiently small absolute constant, then:

(1) on E0(s)E_{0}(s), there exists a unique MLE θ^\hat{\theta} satisfying |θ^θ|Fsd/n|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n},

(2) on E0(s)E_{0}(s), it holds 2(θ)FFϵ2(s):=s2d/nθmin\|\nabla^{2}\ell({\theta^{*}})-{F^{*}}\|_{F^{*}}\leq\epsilon_{2}(s):=\sqrt{s^{2}d/n{\theta_{\mathrm{min}}^{*}}},

(3) on E0(s)E_{0}(s), it holds supθ𝒰(2s)3(θ)Fδ3(2s):=C/θmin\sup_{\theta\in{\mathcal{U}}^{*}(2s)}\|\nabla^{3}\ell(\theta)\|_{{F^{*}}}\leq\delta^{*}_{3}(2s):=C/\sqrt{{\theta_{\mathrm{min}}^{*}}},

(4) E0(s)E_{0}(s) has probability at least 1eCs2d1-e^{-Cs^{2}d}.

The proof of (1) is immediate. Indeed, if s2d/nθmins^{2}d/n\leq{\theta_{\mathrm{min}}^{*}} is small enough then 𝒰(s)Θ{\mathcal{U}}^{*}(s)\subset\Theta by Lemma 5.4, and we know that N¯𝒰(s)\bar{N}\in{\mathcal{U}}^{*}(s) on the event E0(s)E_{0}(s). Hence N¯Θ\bar{N}\in\Theta, and Remark 5.1 then shows θ^=N¯\hat{\theta}=\bar{N} is the unique MLE. Since N¯𝒰(s)\bar{N}\in{\mathcal{U}}^{*}(s) we also have |N¯θ|Fsd/n|\bar{N}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n} by definition. See Appendix D for the proof of (2),(3),(4). Note that the result (3) holds for ss small enough; we do not expect a uniform-in-ss bound on supθ𝒰(2s)3(θ)F\sup_{\theta\in{\mathcal{U}}^{*}(2s)}\|\nabla^{3}\ell(\theta)\|_{{F^{*}}}. To prove (4) we use a result from [8]’s BvM proof for this model. The result is in turn based on Talagrand’s inequality for the suprema of empirical processes [26].

Remark 5.6.

Since the ground truth pmf values add up to 1, it follows that θmin1/(d+1){\theta_{\mathrm{min}}^{*}}\leq 1/(d+1). Therefore the upper bound in (3) on the third derivative tensor operator norm scales as d\sqrt{d}. This bound is tight. To see this, suppose for simplicity that N¯j=θj\bar{N}_{j}=\theta_{j}^{*} exactly. This allows us to compute 3(θ)F\|\nabla^{3}\ell({\theta^{*}})\|_{F^{*}} explicitly; see Appendix D. The result is that

supθ𝒰(2s)3(θ)F3(θ)F=212θminθmin1θmin1θmin.\sup_{\theta\in{\mathcal{U}}^{*}(2s)}\|\nabla^{3}\ell(\theta)\|_{{F^{*}}}\geq\|\nabla^{3}\ell({\theta^{*}})\|_{F^{*}}=2\frac{1-2{\theta_{\mathrm{min}}^{*}}}{\sqrt{{\theta_{\mathrm{min}}^{*}}}\sqrt{1-{\theta_{\mathrm{min}}^{*}}}}\sim\frac{1}{\sqrt{\theta_{\mathrm{min}}^{*}}}. (5.7)

Let us show that (1)-(4) combined with Lemma 3.4 finish the BvM proof. First, provided s2d/nθmins^{2}d/n{\theta_{\mathrm{min}}^{*}} is small enough, (1)-(4) gives that 𝒰(2s)Θ{\mathcal{U}}^{*}(2s)\subseteq\Theta, ϵ21/2\epsilon_{2}\leq 1/2 and 2sδ3(2s)d/n1/42s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 1/4. Thus (3.5) is satisfied. Second, (1)-(4) show that E0(s)E¯(s,ϵ2)E_{0}(s)\subset\bar{E}(s,\epsilon_{2}), and hence the bounds in Lemma 3.4 are satisfied on E0(s)E_{0}(s). Finally, in (4) we obtain a lower bound on the probability of E0(s)E_{0}(s). Thus the BvM is proved. Substituting δ3(2s)=C/θmin\delta^{*}_{3}(2s)=C/\sqrt{{\theta_{\mathrm{min}}^{*}}} into the bounds from Lemma 3.4, we obtain the following corollary.

Corollary 5.7.

Suppose ss is larger than a sufficiently large absolute constant, s2d/nθmins^{2}d/n{\theta_{\mathrm{min}}^{*}} is smaller than a sufficiently small absolute constant, and δ01(2s)nd/6\delta^{*}_{01}(2s)\leq\sqrt{nd}/6, where δ01\delta^{*}_{01} is as in Definition 3.1 with F{F^{*}} as in (5.4). Then on an event of probability at least 1exp(Cds2)1-\,\mathrm{exp}(-Cds^{2}), it holds

TV(πv,γ)sdnθmin+δ01(2s)n+ed(M0Cs2).\begin{split}\mathrm{TV}(\pi_{v},\gamma^{*})\lesssim\frac{sd}{\sqrt{n{\theta_{\mathrm{min}}^{*}}}}+\frac{\delta^{*}_{01}(2s)}{\sqrt{n}}+e^{d({M_{0}}^{*}-Cs^{2})}.\end{split} (5.8)
Corollary 5.8 (Asymptotic BvM for discrete probability distribution).

Let d=dnd=d_{n} and Θn\Theta_{n} be the corresponding sequence of parameter spaces (5.1). Let θnΘn\theta^{*}_{n}\in\Theta_{n} be a sequence of ground truth pmfs, and write vn,γn,M0n,δ01n,θmin,nv_{n},\gamma^{*}_{n},M_{0n}^{*},\delta^{*}_{01n},\theta^{*}_{\mathrm{min},n} to emphasize the dependence of these quantities on nn. If M0n=𝒪(1)M_{0n}^{*}=\mathcal{O}(1), δ01n(s)=o(n)\delta^{*}_{01n}(s)=o(\sqrt{n}) for each s0s\geq 0, and if

dn2nθmin,n=o(1),\frac{d_{n}^{2}}{n\theta^{*}_{\mathrm{min},n}}=o(1),

then TV(πvn,γn)0\mathrm{TV}(\pi_{v_{n}},\gamma^{*}_{n})\to 0 with probability tending to 1 as nn\to\infty.

This result is stronger than the BvM result of [8] in two ways. First, we assume only that the prior satisfies δ01n(s)=o(n)\delta^{*}_{01n}(s)=o(\sqrt{n}), whereas [8] essentially requires δ01n(s)=o(1)\delta^{*}_{01n}(s)=o(1) provided one defines δ01n(s)\delta^{*}_{01n}(s) as a Lipschitz constant; recall the discussion at the end of Section 3.2. Second, we require only that dn2/nθmin,n=o(1)d_{n}^{2}/n\theta^{*}_{\mathrm{min},n}=o(1), whereas [8] requires dn3/nθmin,n=o(1)d_{n}^{3}/n\theta^{*}_{\mathrm{min},n}=o(1).

Appendix A Proofs from Section 2

Recall that

δ3(r)=sup|θθ^|Hrd/n2f(θ)2f(θ^)H|θθ^|H,H=2f(θ^),\delta_{3}(r)=\sup_{|\theta-\hat{\theta}|_{H}\leq r\sqrt{d/n}}\frac{\|\nabla^{2}f(\theta)-\nabla^{2}f(\hat{\theta})\|_{H}}{|\theta-\hat{\theta}|_{H}},\quad H=\nabla^{2}f(\hat{\theta}), (A.1)

where θ^\hat{\theta} is the unique minimizer of ff.

A.1 Proofs from Section 2.1

Proof of Lemma 2.10.

To prove the lower bound on 2f(θ)\nabla^{2}f(\theta) in 𝒰(r){\mathcal{U}}(r), note that

2f(θ)HH=2f(θ)2f(θ^)Hδ3(r)θθ^Hrd/nδ3(r)1/2,\|\nabla^{2}f(\theta)-H\|_{H}=\|\nabla^{2}f(\theta)-\nabla^{2}f(\hat{\theta})\|_{H}\leq\delta_{3}(r)\|\theta-\hat{\theta}\|_{H}\leq r\sqrt{d/n}\delta_{3}(r)\leq 1/2, (A.2)

using the assumption on δ3(r)\delta_{3}(r) from Theorem 2.2. Hence 2f(θ)12H\nabla^{2}f(\theta)\succeq\frac{1}{2}H for all θ𝒰(r)\theta\in{\mathcal{U}}(r). To prove the linear growth bound, recall that ff is convex by assumption. Fix a point θΘ\theta\in\Theta such that |θθ^|Hrd/n|\theta-\hat{\theta}|_{H}\geq r\sqrt{d/n}. Then the whole segment between θ^\hat{\theta} and θ\theta is also in Θ\Theta. In particular, convexity of ff gives

f(θ)f(θ^)1t[f(θ^+t(θθ^))f(θ^)],t=rd/n|θθ^|H.\begin{split}f(\theta)-f(\hat{\theta})\geq\frac{1}{t}\left[f(\hat{\theta}+t(\theta-\hat{\theta}))-f(\hat{\theta})\right],\quad t=\frac{r\sqrt{d/n}}{|\theta-\hat{\theta}|_{H}}.\end{split}

Dividing both sides by |θθ^|H|\theta-\hat{\theta}|_{H} , we get

f(θ)f(θ^)|θθ^|Hf(θ^+t(θθ^))f(θ^)|t(θθ^)|Hinf|u|H=rd/nf(θ^+u)f(θ^)|u|H=:ψ.\begin{split}\frac{f(\theta)-f(\hat{\theta})}{|\theta-\hat{\theta}|_{H}}\geq\frac{f(\hat{\theta}+t(\theta-\hat{\theta}))-f(\hat{\theta})}{|t(\theta-\hat{\theta})|_{H}}\geq\inf_{|u|_{H}=r\sqrt{d/n}}\frac{f(\hat{\theta}+u)-f(\hat{\theta})}{|u|_{H}}=:\psi.\end{split} (A.3)

To get the second inequality, we replaced t(θθ^)t(\theta-\hat{\theta}), which has HH-norm rd/nr\sqrt{d/n} by construction, with any uu such that |u|H=rd/n|u|_{H}=r\sqrt{d/n}. To finish the proof, we bound ψ\psi from below. A Taylor expansion around θ=θ^\theta=\hat{\theta} gives that for |u|H=rd/n|u|_{H}=r\sqrt{d/n}, we have

f(θ^+u)f(θ^)=122f(θ^+ξ),u214|u|H2,\begin{split}f(\hat{\theta}+u)-f(\hat{\theta})&=\frac{1}{2}\langle\nabla^{2}f(\hat{\theta}+\xi),u^{\otimes 2}\rangle\geq\frac{1}{4}|u|_{H}^{2},\end{split} (A.4)

using that 2f(θ)12H\nabla^{2}f(\theta)\succeq\frac{1}{2}H for all θ𝒰(r)\theta\in{\mathcal{U}}(r) to get the second inequality. We now divide by |u|H|u|_{H} to get

f(θ)f(θ^)|u|H14|u|H=r4d/n.\begin{split}\frac{f(\theta)-f(\hat{\theta})}{|u|_{H}}&\geq\frac{1}{4}|u|_{H}=\frac{r}{4}\sqrt{d/n}.\end{split} (A.5)

Substituting this bound into (A.3) concludes the proof. ∎

Proof of Lemma 2.11.

We let 𝒰c{\mathcal{U}}^{c} be the complement of 𝒰{\mathcal{U}} in d\mathbb{R}^{d}, and omit the range of integration when the integral is over all of d\mathbb{R}^{d}. Suppose μF\mu\propto F and μ^F^\hat{\mu}\propto\hat{F}. Then

2TV(μ,μ^)=|FFF^F^|𝒰cFF+𝒰cF^F^+𝒰|FFF^F^|.\begin{split}2\mathrm{TV}(\mu,\hat{\mu})=\int\left|\frac{F}{\int F}-\frac{\hat{F}}{\int\hat{F}}\right|\leq\frac{\int_{{\mathcal{U}}^{c}}F}{\int F}+\frac{\int_{{\mathcal{U}}^{c}}\hat{F}}{\int\hat{F}}+\int_{\mathcal{U}}\left|\frac{F}{\int F}-\frac{\hat{F}}{\int\hat{F}}\right|.\end{split} (A.6)

Next, note that

𝒰|FFF^F^|𝒰|F𝒰FF^𝒰F^|𝒰|FFF𝒰F|+𝒰|F^𝒰F^F^F^|=𝒰F|(𝒰F)1(F)1|+𝒰F^|(𝒰F^)1(F^)1|=|1𝒰FF|+|1𝒰F^F^|=𝒰cFF+𝒰cF^F^.\begin{split}\int_{\mathcal{U}}&\left|\frac{F}{\int F}-\frac{\hat{F}}{\int\hat{F}}\right|-\int_{\mathcal{U}}\left|\frac{F}{\int_{\mathcal{U}}F}-\frac{\hat{F}}{\int_{\mathcal{U}}\hat{F}}\right|\leq\int_{\mathcal{U}}\left|\frac{F}{\int F}-\frac{F}{\int_{\mathcal{U}}F}\right|+\int_{\mathcal{U}}\left|\frac{\hat{F}}{\int_{\mathcal{U}}\hat{F}}-\frac{\hat{F}}{\int\hat{F}}\right|\\ &=\int_{\mathcal{U}}F\left|\left(\int_{\mathcal{U}}F\right)^{-1}-\left(\int F\right)^{-1}\right|+\int_{\mathcal{U}}\hat{F}\left|\left(\int_{\mathcal{U}}\hat{F}\right)^{-1}-\left(\int\hat{F}\right)^{-1}\right|\\ &=\left|1-\frac{\int_{\mathcal{U}}F}{\int F}\right|+\left|1-\frac{\int_{\mathcal{U}}\hat{F}}{\int\hat{F}}\right|=\frac{\int_{{\mathcal{U}}^{c}}F}{\int F}+\frac{\int_{{\mathcal{U}}^{c}}\hat{F}}{\int\hat{F}}.\end{split} (A.7)

Substituting (A.7) into (A.6) gives

2TV(μ,μ^)2𝒰cFF+2𝒰cF^F^+𝒰|F𝒰FF^𝒰F^|=2𝒰cF𝒰F+2𝒰cF^𝒰F^+2TV(μ|𝒰,μ^|𝒰).\begin{split}2\mathrm{TV}(\mu,\hat{\mu})&\leq 2\frac{\int_{{\mathcal{U}}^{c}}F}{\int F}+2\frac{\int_{{\mathcal{U}}^{c}}\hat{F}}{\int\hat{F}}+\int_{\mathcal{U}}\left|\frac{F}{\int_{\mathcal{U}}F}-\frac{\hat{F}}{\int_{\mathcal{U}}\hat{F}}\right|\\ &=2\frac{\int_{{\mathcal{U}}^{c}}F}{\int_{\mathcal{U}}F}+2\frac{\int_{{\mathcal{U}}^{c}}\hat{F}}{\int_{\mathcal{U}}\hat{F}}+2\mathrm{TV}\left(\mu|_{\mathcal{U}},\hat{\mu}|_{\mathcal{U}}\right).\end{split} (A.8)

Dividing by 2 finishes the proof of the first statement.

Next we prove the second statement. For brevity, redefine μ\mu and μ^\hat{\mu} to be their restrictions to 𝒰{\mathcal{U}} and recall that μenf\mu\propto e^{-nf}, μ^enf^\hat{\mu}\propto e^{-n\hat{f}} on 𝒰{\mathcal{U}}. Let T(x)=(nH)1/2xT(x)=(nH)^{1/2}x. Then

(T#μ)(y)efH(y),(T#μ^)(y)ef^H(y),y(nH)1/2𝒰,(T_{\#}\mu)(y)\propto e^{-f_{H}(y)},\quad(T_{\#}\hat{\mu})(y)\propto e^{-\hat{f}_{H}(y)},\quad y\in(nH)^{1/2}{\mathcal{U}},

where

fH(y)=nf((nH)1/2y),f^H(y)=nf^((nH)1/2).f_{H}(y)=nf((nH)^{-1/2}y),\quad\hat{f}_{H}(y)=n\hat{f}((nH)^{-1/2}).

Note that 2fH(y)=H1/22f((nH)1/2y)H1/2λId\nabla^{2}f_{H}(y)=H^{-1/2}\nabla^{2}f((nH)^{-1/2}y)H^{-1/2}\succeq\lambda I_{d} for all y(nH)1/2𝒰y\in(nH)^{1/2}{\mathcal{U}} by the assumption 2f(x)λH\nabla^{2}f(x)\succeq\lambda H for all x𝒰x\in{\mathcal{U}}. Therefore, T#μT_{\#}\mu is λ\lambda-strongly concave, so it satisfies a log-Sobolev inequality (LSI) with constant 1/λ1/\lambda [2]. Using the affine invariance of TV distance, then Pinsker’s inequality, then the LSI, we get

TV(μ^,μ)2=TV(T#μ^,T#μ)212KL(T#μ^||T#μ)14λFI(T#μ^||T#μ)\mathrm{TV}(\hat{\mu},\mu)^{2}=\mathrm{TV}(T_{\#}\hat{\mu},T_{\#}\mu)^{2}\leq\frac{1}{2}\mathrm{KL}\left(\,T_{\#}\hat{\mu}\;||\;T_{\#}\mu\right)\leq\frac{1}{4\lambda}\mathrm{FI}\left(\,T_{\#}\hat{\mu}\;||\;T_{\#}\mu\right) (A.9)

where

FI(T#μ^||T#μ)=𝔼YT#μ^[|logT#μ^T#μ(Y)|2].\mathrm{FI}\left(\,T_{\#}\hat{\mu}\;||\;T_{\#}\mu\right)=\mathbb{E}\,_{Y\sim T_{\#}\hat{\mu}}\left[\left|\nabla\log\frac{T_{\#}\hat{\mu}}{T_{\#}\mu}(Y)\right|^{2}\right].

Now, log((T#μ^/T#μ)(Y))=(fHf^H)(Y)=n(ff^)((nH)1/2Y).\log((T_{\#}\hat{\mu}/T_{\#}\mu)(Y))=(f_{H}-\hat{f}_{H})(Y)=n(f-\hat{f})((nH)^{-1/2}Y). Therefore,

log((T#μ^/T#μ)(Y))=nH1/2((ff^)((nH)1/2Y))=dnH1/2(ff^)(X),\begin{split}\nabla\log((T_{\#}\hat{\mu}/T_{\#}\mu)(Y))&=\sqrt{n}H^{-1/2}\left(\nabla(f-\hat{f})((nH)^{-1/2}Y)\right)\\ &\stackrel{{\scriptstyle d}}{{=}}\sqrt{n}H^{-1/2}\nabla(f-\hat{f})(X),\end{split}

where Xμ^X\sim\hat{\mu}. Therefore,

FI(T#μ^||T#μ)=n𝔼Xμ^[(ff^)(X)H2].\mathrm{FI}\left(\,T_{\#}\hat{\mu}\;||\;T_{\#}\mu\right)=n\mathbb{E}\,_{X\sim\hat{\mu}}\left[\left\|\nabla(f-\hat{f})(X)\right\|_{H}^{2}\right].

Substituting this into (A.9) finishes the proof. ∎

A.2 Proof of Theorem 2.2

In this section, we make the assumptions from Theorem 2.2. Recall that it suffices to bound the local expectation and the two tail integrals in (2.10). These quantities are bounded in Lemma A.1 and Corollary A.5 below. The proof of Proposition 2.2 follows immediately by adding up these bounds.

A.2.1 Local Expectation

Lemma A.1.

It holds

n2𝔼θγf|𝒰[f(θ)H(θθ^)H2]12δ3(r)2dn.\sqrt{\frac{n}{2}}\mathbb{E}\,_{\theta\sim\gamma_{f}|{\mathcal{U}}}\left[\|\nabla f(\theta)-H(\theta-\hat{\theta})\|_{H}^{2}\right]^{\frac{1}{2}}\leq\frac{\delta_{3}(r)}{\sqrt{2}}\frac{d}{\sqrt{n}}. (A.10)
Proof.

First note

f(θ)H(θθ^)=01(2f(θ^+t(θθ^))2f(θ^))(θθ^)𝑑t.\nabla f(\theta)-H(\theta-\hat{\theta})=\int_{0}^{1}\left(\nabla^{2}f(\hat{\theta}+t(\theta-\hat{\theta}))-\nabla^{2}f(\hat{\theta})\right)(\theta-\hat{\theta})dt.

Hence if θ𝒰(r)\theta\in{\mathcal{U}}(r) then

f(θ)H(θθ^)H01δ3(r)tθθ^H2𝑑t=12δ3(r)θθ^H2.\|\nabla f(\theta)-H(\theta-\hat{\theta})\|_{H}\leq\int_{0}^{1}\delta_{3}(r)t\|\theta-\hat{\theta}\|_{H}^{2}dt=\frac{1}{2}\delta_{3}(r)\|\theta-\hat{\theta}\|_{H}^{2}.

Therefore, using that θ^+(nH)1/2Zγf\hat{\theta}+(nH)^{-1/2}Z\sim\gamma_{f} when Z𝒩(0,Id)Z\sim\mathcal{N}(0,I_{d}), we get

𝔼θγf|𝒰[f(θ)H(θθ^)H2]14δ3(r)2𝔼θγf|𝒰[θθ^H4]=δ3(r)24n2(|Z|rd)1𝔼[|Z|4𝟙(|Z|rd)]3δ3(r)2d24n2(|Z|rd)1.\begin{split}\mathbb{E}\,_{\theta\sim\gamma_{f}|{\mathcal{U}}}&\left[\|\nabla f(\theta)-H(\theta-\hat{\theta})\|_{H}^{2}\right]\leq\frac{1}{4}\delta_{3}(r)^{2}\mathbb{E}\,_{\theta\sim\gamma_{f}|{\mathcal{U}}}[\|\theta-\hat{\theta}\|_{H}^{4}]\\ &=\frac{\delta_{3}(r)^{2}}{4n^{2}}\mathbb{P}(|Z|\leq r\sqrt{d})^{-1}\mathbb{E}\,\left[|Z|^{4}\mathds{1}(|Z|\leq r\sqrt{d})\right]\leq\frac{3\delta_{3}(r)^{2}d^{2}}{4n^{2}}\mathbb{P}(|Z|\leq r\sqrt{d})^{-1}.\end{split} (A.11)

Now note that

(|Z|rd)1ed(r1)2/21e25/23/4\mathbb{P}(|Z|\leq r\sqrt{d})\geq 1-e^{-d(r-1)^{2}/2}\geq 1-e^{-25/2}\geq 3/4

for all d1d\geq 1 since r6r\geq 6. Substituting this bound into (A.11) gives

𝔼θγf|𝒰[f(θ)H(θθ^)H2]δ3(r)2d2n2.\mathbb{E}\,_{\theta\sim\gamma_{f}|{\mathcal{U}}}\left[\|\nabla f(\theta)-H(\theta-\hat{\theta})\|_{H}^{2}\right]\leq\delta_{3}(r)^{2}\frac{d^{2}}{n^{2}}. (A.12)

Taking the square root and then multiplying by n/2\sqrt{n/2} gives the desired bound. ∎

A.2.2 Tail integrals

Lemma A.2.

Let 𝒰=𝒰(r){\mathcal{U}}={\mathcal{U}}(r) for some r>2r>2. It holds

𝒰cΘen(f(θ^)f(θ))𝑑θ(2π)d/2nd/2exp([2+logrr2/4]d).\int_{{\mathcal{U}}^{c}\cap\Theta}e^{n(f(\hat{\theta})-f(\theta))}d\theta\leq(2\pi)^{d/2}n^{-d/2}\,\mathrm{exp}\left(\left[2+\log r-r^{2}/4\right]d\right). (A.13)
Proof.

(2.6) gives

𝒰cΘen(f(θ^)f(θ))𝑑θ|θθ^|Hrd/nexp(r4nd|θθ^|H)𝑑θ=nd/2|θ|rdexp(r4d|θ|)𝑑θ.\begin{split}\int_{{\mathcal{U}}^{c}\cap\Theta}e^{n(f(\hat{\theta})-f(\theta))}d\theta&\leq\int_{|\theta-\hat{\theta}|_{H}\geq r\sqrt{d/n}}\,\mathrm{exp}\left(-\frac{r}{4}\sqrt{nd}|\theta-\hat{\theta}|_{H}\right)d\theta\\ &=n^{-d/2}\int_{|\theta|\geq r\sqrt{d}}\,\mathrm{exp}\left(-\frac{r}{4}\sqrt{d}|\theta|\right)d\theta.\end{split} (A.14)

Using Lemma E.3 with a=ra=r and b=r/4b=r/4 (the lemma applies since we assumed r2/4>1r^{2}/4>1), we get

|θ|rdexp(r4d|θ|)𝑑θ(2π)d/2dexp([32+logrr24]d).\int_{|\theta|\geq r\sqrt{d}}\,\mathrm{exp}\left(-\frac{r}{4}\sqrt{d}|\theta|\right)d\theta\leq(2\pi)^{d/2}d\,\mathrm{exp}\left(\left[\frac{3}{2}+\log r-\frac{r^{2}}{4}\right]d\right). (A.15)

To conclude, bound de3d/2de^{3d/2} by e2de^{2d}, noting that ded/21de^{-d/2}\leq 1 for all dd. ∎

Lemma A.3.

Let 𝒰=𝒰(r){\mathcal{U}}={\mathcal{U}}(r) for some r2r\geq 2. Then

𝒰en(f(θ^)f(θ))𝑑θ12(2π)d/2(3n/2)d/2.\int_{{\mathcal{U}}}e^{n(f(\hat{\theta})-f(\theta))}d\theta\geq\frac{1}{2}(2\pi)^{d/2}(3n/2)^{-d/2}. (A.16)
Proof.

For all |θθ^|Hrd/n|\theta-\hat{\theta}|_{H}\leq r\sqrt{d/n}, it holds

f(θ)f(θ^)=122f(ξ),(θθ^)234|θθ^|H2\begin{split}f(\theta)-f(\hat{\theta})&=\frac{1}{2}\left\langle\nabla^{2}f(\xi),(\theta-\hat{\theta})^{\otimes 2}\right\rangle\leq\frac{3}{4}|\theta-\hat{\theta}|_{H}^{2}\end{split} (A.17)

for a point ξ\xi on the interval between θ\theta and θ^\hat{\theta}. Here, we have used that supθ𝒰2f(θ)H3/2\sup_{\theta\in{\mathcal{U}}}\|\nabla^{2}f(\theta)\|_{H}\leq 3/2, since 2f(θ)HH1/2\|\nabla^{2}f(\theta)-H\|_{H}\leq 1/2, shown in the proof of Lemma 2.10. Hence

𝒰en(f(θ^)f(θ))𝑑θ𝒰exp(3n4|θθ^|H2)𝑑θ=(3n/2)d/2|u|r3/2dexp(|u|2/2)𝑑u=(3n/2)d/2(2π)d/2(|Z|r3/2d).\begin{split}\int_{{\mathcal{U}}}e^{n(f(\hat{\theta})-f(\theta))}d\theta&\geq\int_{\mathcal{U}}\,\mathrm{exp}\left(-\frac{3n}{4}\left|\theta-\hat{\theta}\right|_{H}^{2}\right)d\theta\\ &=(3n/2)^{-d/2}\int_{|u|\leq r\sqrt{3/2}\sqrt{d}}\,\mathrm{exp}\left(-|u|^{2}/2\right)du\\ &=(3n/2)^{-d/2}(2\pi)^{d/2}\mathbb{P}(|Z|\leq r\sqrt{3/2}\sqrt{d}).\end{split} (A.18)

To conclude the proof note that when r2r\geq 2 we have (|Z|r3/2d)(|Z|6d)1ed(61)2/21/2\mathbb{P}(|Z|\leq r\sqrt{3/2}\sqrt{d})\geq\mathbb{P}\left(|Z|\leq\sqrt{6}\sqrt{d}\right)\geq 1-e^{-d(\sqrt{6}-1)^{2}/2}\geq 1/2 for all d1d\geq 1. ∎

Corollary A.4.

Let 𝒰=𝒰(r){\mathcal{U}}={\mathcal{U}}(r). If r6r\geq 6 then

Θ𝒰enf(θ)𝑑θ𝒰enf(θ)𝑑θ2e5dr2/362edr2/9.\frac{\int_{\Theta\setminus{\mathcal{U}}}e^{-nf(\theta)}d\theta}{\int_{{\mathcal{U}}}e^{-nf(\theta)}d\theta}\leq 2e^{-5dr^{2}/36}\leq 2e^{-dr^{2}/9}.
Proof.

Dividing the upper bound on the numerator by the lower bound on the denominator gives

Θ𝒰enf(θ)𝑑θ𝒰enf(θ)𝑑θ2exp(d[2+log(3/2)+logrr24]).\frac{\int_{\Theta\setminus{\mathcal{U}}}e^{-nf(\theta)}d\theta}{\int_{{\mathcal{U}}}e^{-nf(\theta)}d\theta}\leq 2\,\mathrm{exp}\left(d\left[2+\log(\sqrt{3/2})+\log r-\frac{r^{2}}{4}\right]\right).

We now note that 2+log(3/2)2.212+\log(\sqrt{3/2})\leq 2.21, and that

2.21+logrr29r6.2.21+\log r\leq\frac{r^{2}}{9}\quad\forall r\geq 6. (A.19)

Substituting this bound into the exponential concludes the proof. ∎

Corollary A.5.

Let r6r\geq 6. Then

T:=Θ𝒰enf𝒰enf+γf(𝒰c)3exp(dr2/9).T:=\frac{\int_{\Theta\setminus{\mathcal{U}}}e^{-nf}}{\int_{\mathcal{U}}e^{-nf}}+\gamma_{f}({\mathcal{U}}^{c})\leq 3\,\mathrm{exp}\left(-dr^{2}/9\right).
Proof.

From the above corollary and a standard Gaussian tail bound (i.e. using that γf(𝒰(r)c)=(|Z|rd)exp(d(r1)2/2)\gamma_{f}({\mathcal{U}}(r)^{c})=\mathbb{P}(|Z|\geq r\sqrt{d})\leq\,\mathrm{exp}(-d(r-1)^{2}/2)) we get

T2edr2/9+exp(d2(r1)2)3edr2/9,\begin{split}T\leq 2e^{-dr^{2}/9}+\,\mathrm{exp}\left(-\frac{d}{2}\left(r-1\right)^{2}\right)\leq 3e^{-dr^{2}/9},\end{split} (A.20)

since (d/2)(r1)2(d/2)(r/2)2=dr2/8(d/2)(r-1)^{2}\geq(d/2)(r/2)^{2}=dr^{2}/8. ∎

Appendix B Proofs from Section 3.2

Definition B.1.

On the event {!MLEθ^}\{\exists!\,\mathrm{MLE}\,\hat{\theta}\}, define H=2(θ^)H=\nabla^{2}\ell(\hat{\theta}) and 𝒰(r)={|θθ^|Hrd/n}{\mathcal{U}}(r)=\{|\theta-\hat{\theta}|_{H}\leq r\sqrt{d/n}\}. Assuming 𝒰(r)Θ{\mathcal{U}}(r)\subset\Theta, define

δ3(r)=supθ𝒰(r)2(θ)2(θ^)H|θθ^|H.\delta_{3}(r)=\sup_{\theta\in{\mathcal{U}}(r)}\frac{\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell(\hat{\theta})\|_{H}}{|\theta-\hat{\theta}|_{H}}. (B.1)

Next, recall the definition of the event E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}):

E¯(s,ϵ2)={!MLEθ^,|θ^θ|Fsd/n}E2(ϵ2)E3(2s).\bar{E}(s,\epsilon_{2})=\{\exists!\,\mathrm{MLE}\,\hat{\theta},\,|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n}\}\cap E_{2}(\epsilon_{2})\cap E_{3}(2s).
Lemma B.2.

Suppose (3.5) holds. Then on the event E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}), it holds

2(θ^)14F,\nabla^{2}\ell(\hat{\theta})\succeq\frac{1}{4}{F^{*}},
𝒰(s/2)𝒰(2s),δ3(s/2)8δ3(2s),\begin{gathered}{\mathcal{U}}(s/2)\subseteq{\mathcal{U}}^{*}(2s),\qquad\delta_{3}(s/2)\leq 8\delta^{*}_{3}(2s),\end{gathered} (B.2)
(s/2)δ3(s/2)d/n1/2.(s/2)\delta_{3}(s/2)\sqrt{d/n}\leq 1/2. (B.3)

See Section B.1 for the proof.

Proof of Lemma 3.4.

We split up the proof into steps. First note that by the assumption ϵ21/2\epsilon_{2}\leq 1/2, and recalling that F=2(θ){F^{*}}=\nabla^{2}\ell({\theta^{*}}), we have 2(θ)12F\nabla^{2}\ell({\theta^{*}})\succeq\frac{1}{2}{F^{*}} on E2(ϵ2)E_{2}(\epsilon_{2}). Also, recall that (θ)=0\nabla\ell^{*}({\theta^{*}})=0.

Proof that unique MLE θ^\hat{\theta} exists and satisfies |θ^θ|Fsd/n|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n}..

We check the assumptions of Corollary E.6 with h=h=\ell, y0=θy_{0}={\theta^{*}}, q=sd/nq=s\sqrt{d/n}, and λ=1/2\lambda=1/2. The condition 2(θ)λF\nabla^{2}\ell({\theta^{*}})\succeq\lambda{F^{*}} and (θ)F2λq\|\nabla\ell({\theta^{*}})\|_{F^{*}}\leq 2\lambda q are both satisfied on E(s,ϵ2)E(s,\epsilon_{2}). The condition 2(θ)2(θ)Fλ/4\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell({\theta^{*}})\|_{F^{*}}\leq\lambda/4 for all |θθ|Fq|\theta-{\theta^{*}}|_{F^{*}}\leq q is also satisfied on this event, since

2(θ)2(θ)Fsδ3(s)d/n1/8=λ/4.\begin{split}\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell({\theta^{*}})\|_{F^{*}}&\leq s\delta^{*}_{3}(s)\sqrt{d/n}\leq 1/8=\lambda/4.\end{split} (B.4)

Therefore, the assumptions of Corollary E.6 are satisfied so we conclude there exists θ^\hat{\theta} such that (θ^)=0\nabla\ell(\hat{\theta})=0 and |θ^θ|Fq=sd/n|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq q=s\sqrt{d/n}. Also, we immediately get by (B.4) that 2(θ^)2(θ)F1/8\|\nabla^{2}\ell(\hat{\theta})-\nabla^{2}\ell({\theta^{*}})\|_{F^{*}}\leq 1/8. Using this and the fact that 2(θ)12F\nabla^{2}\ell({\theta^{*}})\succeq\frac{1}{2}{F^{*}} we get 2(θ^)38F\nabla^{2}\ell(\hat{\theta})\succeq\frac{3}{8}{F^{*}}. Hence θ^\hat{\theta} is a strict local minimizer, and therefore a global minimizer, since \ell is convex.∎

This first part of the proof shows that E(s,ϵ2)E¯(s,ϵ2)E(s,\epsilon_{2})\subset\bar{E}(s,\epsilon_{2}). In the rest of the proof, we show (3.6)-(3.8) hold on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}).

Proof of Bound (3.6).

By Assumption A1, \ell is convex and C2(Θ)\ell\in C^{2}(\Theta) with probability 1. Also, \ell has a unique global minimizer θ^\hat{\theta} on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}). Let r=s/2r=s/2, so that r6r\geq 6. By Lemma B.2, it holds on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}) that 𝒰(r)𝒰(2s)Θ{\mathcal{U}}(r)\subset{\mathcal{U}}^{*}(2s)\subset\Theta and rδ3(r)d/n1/2r\delta_{3}(r)\sqrt{d/n}\leq 1/2. Therefore, we can apply Theorem 2.2 with f=f=\ell to get that TV(π,γ)δ3(r)d2n+3edr2/9\mathrm{TV}(\pi_{\ell},\gamma_{\ell})\leq\delta_{3}(r)\frac{d}{\sqrt{2n}}+3e^{-dr^{2}/9}. Using that δ3(r)8δ3(2s)\delta_{3}(r)\leq 8\delta^{*}_{3}(2s) on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}) and that edr2/9=eds2/36e^{-dr^{2}/9}=e^{-ds^{2}/36} concludes the proof.∎

Proof of Bound (3.8).

Let r=s/2r=s/2, so that r6r\geq 6. Since 2sδ3(2s)d/n1/42s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 1/4 and 2(θ)12F\nabla^{2}\ell({\theta^{*}})\succeq\frac{1}{2}{F^{*}} on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}), it follows that 2(θ)14F\nabla^{2}\ell(\theta)\succeq\frac{1}{4}{F^{*}} for all θ𝒰(2s)\theta\in{\mathcal{U}}^{*}(2s). We apply Lemma 2.11 with 𝒰=𝒰(2s){\mathcal{U}}={\mathcal{U}}^{*}(2s), μ=π\mu=\pi_{\ell}, μ^=πv\hat{\mu}=\pi_{v}, λ=1/4\lambda=1/4, and F{F^{*}} instead of HH. This gives

TV(πv,π)Θ𝒰(2s)enπ0𝒰(2s)enπ0+Θ𝒰(2s)en𝒰(2s)en+n𝔼θπv|𝒰[(n1v0)F2]12.\mathrm{TV}(\pi_{v},\pi_{\ell})\leq\frac{\int_{\Theta\setminus{\mathcal{U}}^{*}(2s)}e^{-n\ell}\pi_{0}}{\int_{{\mathcal{U}}^{*}(2s)}e^{-n\ell}\pi_{0}}+\frac{\int_{\Theta\setminus{\mathcal{U}}^{*}(2s)}e^{-n\ell}}{\int_{{\mathcal{U}}^{*}(2s)}e^{-n\ell}}+\sqrt{n}\mathbb{E}\,_{\theta\sim\pi_{v}|_{\mathcal{U}}}\left[\|\nabla(n^{-1}v_{0})\|_{F^{*}}^{2}\right]^{\frac{1}{2}}. (B.5)

In the last term, we simply bound the square root of the expectation of the square by the maximum of (n1v0(θ))F\|\nabla(n^{-1}v_{0}(\theta))\|_{F^{*}} over θ𝒰(2s)\theta\in{\mathcal{U}}^{*}(2s), which is n1δ01(2s)n^{-1}\delta^{*}_{01}(2s) by definition. Thus

n𝔼θν|𝒰[(n1v0)F2]121nδ01(2s).\sqrt{n}\mathbb{E}\,_{\theta\sim\nu|_{\mathcal{U}}}\left[\|\nabla(n^{-1}v_{0})\|_{F^{*}}^{2}\right]^{\frac{1}{2}}\leq\frac{1}{\sqrt{n}}\delta^{*}_{01}(2s). (B.6)

Now, by definition of M0{M_{0}}^{*} and δ01\delta^{*}_{01} and using that 𝒰(r)𝒰(2s){\mathcal{U}}(r)\subset{\mathcal{U}}^{*}(2s) on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}) by Lemma B.2, we have

Θ𝒰(2s)enπ0𝒰(2s)enπ0supθΘπ0(θ)/π0(θ)infθ𝒰(2s)π0(θ)/π0(θ)Θ𝒰(2s)en𝒰(2s)enexp(dM0+2sδ01(2s)d/n)Θ𝒰(2s)en𝒰(2s)enexp(dM0+ds/3)Θ𝒰(r)en𝒰(r)en.\begin{split}\frac{\int_{\Theta\setminus{\mathcal{U}}^{*}(2s)}e^{-n\ell}\pi_{0}}{\int_{{\mathcal{U}}^{*}(2s)}e^{-n\ell}\pi_{0}}&\leq\frac{\sup_{\theta\in\Theta}\pi_{0}(\theta)/\pi_{0}({\theta^{*}})}{\inf_{\theta\in{\mathcal{U}}^{*}(2s)}\pi_{0}(\theta)/\pi_{0}({\theta^{*}})}\frac{\int_{\Theta\setminus{\mathcal{U}}^{*}(2s)}e^{-n\ell}}{\int_{{\mathcal{U}}^{*}(2s)}e^{-n\ell}}\\ &\leq\,\mathrm{exp}(d{M_{0}}^{*}+2s\delta^{*}_{01}(2s)\sqrt{d/n})\frac{\int_{\Theta\setminus{\mathcal{U}}^{*}(2s)}e^{-n\ell}}{\int_{{\mathcal{U}}^{*}(2s)}e^{-n\ell}}\\ &\leq\,\mathrm{exp}(d{M_{0}}^{*}+ds/3)\frac{\int_{\Theta\setminus{\mathcal{U}}(r)}e^{-n\ell}}{\int_{{\mathcal{U}}(r)}e^{-n\ell}}.\end{split} (B.7)

In the last line we used that δ01(2s)nd/6\delta^{*}_{01}(2s)\leq\sqrt{nd}/6. Hence

Θ𝒰(2s)enπ0𝒰(2s)enπ0+Θ𝒰(2s)en𝒰(2s)en2exp(d[M0+s/3])Θ𝒰(r)en𝒰(r)en4exp(d[M0+s/35r2/36])4exp(d(M0s2/144)).\begin{split}\frac{\int_{\Theta\setminus{\mathcal{U}}^{*}(2s)}e^{-n\ell}\pi_{0}}{\int_{{\mathcal{U}}^{*}(2s)}e^{-n\ell}\pi_{0}}&+\frac{\int_{\Theta\setminus{\mathcal{U}}^{*}(2s)}e^{-n\ell}}{\int_{{\mathcal{U}}^{*}(2s)}e^{-n\ell}}\leq 2\,\mathrm{exp}\left(d[{M_{0}}^{*}+s/3]\right)\frac{\int_{\Theta\setminus{\mathcal{U}}(r)}e^{-n\ell}}{\int_{{\mathcal{U}}(r)}e^{-n\ell}}\\ \leq 4\,\mathrm{exp}\bigg{(}d&\left[{M_{0}}^{*}+s/3-5r^{2}/36\right]\bigg{)}\leq 4\,\mathrm{exp}\left(d({M_{0}}^{*}-s^{2}/144)\right).\end{split} (B.8)

We used (B.7) to get the first inequality and Corollary A.4 to get the second inequality. To get the third inequality we used that 5r2/36=5s2/1445r^{2}/36=5s^{2}/144 and that s/34s2/144s/3\leq 4s^{2}/144 when s12s\geq 12. Combining (B.8) with (B.6) concludes the proof.

Proof of Bound (3.7).

We use Lemma E.2 with Σ11=n2(θ)=nF\Sigma_{1}^{-1}=n\nabla^{2}\ell^{*}({\theta^{*}})=n{F^{*}} and Σ21=n2(θ^)\Sigma_{2}^{-1}=n\nabla^{2}\ell(\hat{\theta}). We have 2(θ^)14F\nabla^{2}\ell(\hat{\theta})\succeq\frac{1}{4}{F^{*}}, so we can take τ=1/4\tau=1/4. Also, we have

n2(θ^)nFnF=2(θ^)FF2(θ^)2(θ)F+2()(θ)Fsδ3(s)d/n+ϵ2=:ϵ\begin{split}\|n\nabla^{2}\ell(\hat{\theta})-n{F^{*}}\|_{n{F^{*}}}&=\|\nabla^{2}\ell(\hat{\theta})-{F^{*}}\|_{{F^{*}}}\leq\|\nabla^{2}\ell(\hat{\theta})-\nabla^{2}\ell({\theta^{*}})\|_{F^{*}}+\|\nabla^{2}(\ell-\ell^{*})({\theta^{*}})\|_{F^{*}}\\ &\leq s\delta^{*}_{3}(s)\sqrt{d/n}+\epsilon_{2}=:\epsilon\end{split}

on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}). Substituting the above values of τ\tau and ϵ\epsilon into (E.3) concludes the proof of (3.7). ∎

Proof of bounds in Example 3.7.

The bounds for the flat prior are trivial. For the Gaussian prior π0=𝒩(μ,Σ)\pi_{0}=\mathcal{N}(\mu,\Sigma), we have logπ0(θ)=12(θμ)Σ1(θμ)+const\log\pi_{0}(\theta)=-\frac{1}{2}(\theta-\mu)^{\intercal}\Sigma^{-1}(\theta-\mu)+\mathrm{const}, so that

dM0=supθdlogπ0(θ)logπ0(θ)=12(θμ)Σ1(θμ)=12|θμ|Σ12.\begin{split}d{M_{0}}^{*}=\sup_{\theta\in\mathbb{R}^{d}}\log\pi_{0}(\theta)-\log\pi_{0}({\theta^{*}})=\frac{1}{2}({\theta^{*}}-\mu)^{\intercal}\Sigma^{-1}({\theta^{*}}-\mu)=\frac{1}{2}|{\theta^{*}}-\mu|_{\Sigma^{-1}}^{2}.\end{split} (B.9)

Next, we have logπ0(θ)=Σ1(θμ)\nabla\log\pi_{0}(\theta)=\Sigma^{-1}(\theta-\mu), and therefore

logπ0(θ)F=|F1/2Σ1(θμ)|Σ1F|θμ|F.\|\nabla\log\pi_{0}(\theta)\|_{F^{*}}=|{F^{*}}^{-1/2}\Sigma^{-1}(\theta-\mu)|\leq\|\Sigma^{-1}\|_{F^{*}}|\theta-\mu|_{F^{*}}.

Hence

δ01(r)Σ1F(|μθ|F+rd/n).\delta^{*}_{01}(r)\leq\|\Sigma^{-1}\|_{F^{*}}(|\mu-{\theta^{*}}|_{F^{*}}+r\sqrt{d/n}).

For the multivariate student’s t prior π0=tν(μ,Σ)\pi_{0}=t_{\nu}(\mu,\Sigma), we have logπ0(θ)=ν+d2log(1+1ν|θμ|Σ12)+const\log\pi_{0}(\theta)=-\frac{\nu+d}{2}\log\left(1+\frac{1}{\nu}|\theta-\mu|^{2}_{\Sigma^{-1}}\right)+\mathrm{const}, so

logπ0(θ)logπ0(θ)=ν+d2log(1+1ν|θμ|Σ121+1ν|θμ|Σ12)ν+d2log(1+1ν|θμ|Σ12)ν+d2ν|θμ|Σ12.\begin{split}\log\pi_{0}(\theta)&-\log\pi_{0}({\theta^{*}})=\frac{\nu+d}{2}\log\left(\frac{1+\frac{1}{\nu}|{\theta^{*}}-\mu|^{2}_{\Sigma^{-1}}}{1+\frac{1}{\nu}|\theta-\mu|^{2}_{\Sigma^{-1}}}\right)\\ &\leq\frac{\nu+d}{2}\log\left(1+\frac{1}{\nu}|{\theta^{*}}-\mu|^{2}_{\Sigma^{-1}}\right)\leq\frac{\nu+d}{2\nu}|{\theta^{*}}-\mu|^{2}_{\Sigma^{-1}}.\end{split} (B.10)

Hence M0ν+d2νd|θμ|Σ12{M_{0}}^{*}\leq\frac{\nu+d}{2\nu d}|{\theta^{*}}-\mu|^{2}_{\Sigma^{-1}}. Finally,

logπ0(θ)=ν+dνΣ1(θμ)1+1ν|θμ|Σ12,\nabla\log\pi_{0}(\theta)=-\frac{\nu+d}{\nu}\frac{\Sigma^{-1}(\theta-\mu)}{1+\frac{1}{\nu}|\theta-\mu|^{2}_{\Sigma^{-1}}},

so

δ01(r)=supθ𝒰(r)logπ0(θ)Fν+dνsupθ𝒰(r)Σ1(θμ)Fν+dνΣ1F(|μθ|F+rd/n).\begin{split}\delta^{*}_{01}(r)&=\sup_{\theta\in{\mathcal{U}}(r)}\|\nabla\log\pi_{0}(\theta)\|_{F^{*}}\leq\frac{\nu+d}{\nu}\sup_{\theta\in{\mathcal{U}}(r)}\|\Sigma^{-1}(\theta-\mu)\|_{{F^{*}}}\\ &\leq\frac{\nu+d}{\nu}\|\Sigma^{-1}\|_{F^{*}}(|\mu-{\theta^{*}}|_{F^{*}}+r\sqrt{d/n}).\end{split} (B.11)

B.1 Auxiliary results

Proof of Lemma B.2.

Using (3.5), the definition of event E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}), and Taylor’s theorem, we have

2(θ^)2(θ)Fsup|θθ|Fsd/n2(θ)2(θ)Fsδ3(s)d/n1/4.\begin{split}\|\nabla^{2}\ell(\hat{\theta})-\nabla^{2}\ell({\theta^{*}})\|_{F^{*}}&\leq\sup_{|\theta-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n}}\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell({\theta^{*}})\|_{F^{*}}\\ &\leq s\delta^{*}_{3}(s)\sqrt{d/n}\leq 1/4.\end{split}

Since 2(θ)12F\nabla^{2}\ell({\theta^{*}})\succeq\frac{1}{2}{F^{*}} on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}) (using that ϵ21/2\epsilon_{2}\leq 1/2), we conclude that H=2(θ^)14FH=\nabla^{2}\ell(\hat{\theta})\succeq\frac{1}{4}F. This proves the first statement. Next, fix θ𝒰(s/2)\theta\in{\mathcal{U}}(s/2), so that |θθ^|H(s/2)d/n|\theta-\hat{\theta}|_{H}\leq(s/2)\sqrt{d/n}. Using that H14FH\succeq\frac{1}{4}{F^{*}} and |θ^θ|Fsd/n|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq s\sqrt{d/n} on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}), we then have

|θθ|F|θθ^|F+|θ^θ|F2|θθ^|H+sd/n2sd/n.|\theta-{\theta^{*}}|_{F^{*}}\leq|\theta-\hat{\theta}|_{F^{*}}+|\hat{\theta}-{\theta^{*}}|_{F^{*}}\leq 2|\theta-\hat{\theta}|_{H}+s\sqrt{d/n}\leq 2s\sqrt{d/n}.

We conclude that 𝒰(s/2)𝒰(2s){\mathcal{U}}(s/2)\subseteq{\mathcal{U}}^{*}(2s). Next, using that 𝒰(s/2)𝒰(2s){\mathcal{U}}(s/2)\subset{\mathcal{U}}^{*}(2s) and that θ^𝒰(s)𝒰(2s)\hat{\theta}\in{\mathcal{U}}^{*}(s)\subset{\mathcal{U}}^{*}(2s) on event E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}), we have

δ3(s/2)=supθ𝒰(s/2)2(θ)2(θ^)H|θθ^|Hsupθ,θ𝒰(2s)2(θ)2(θ)H|θθ|H8supθ,θ𝒰(2s)2(θ)2(θ)F|θθ|F8δ3(2s).\begin{split}\delta_{3}(s/2)&=\sup_{\theta\in{\mathcal{U}}(s/2)}\frac{\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell(\hat{\theta})\|_{H}}{|\theta-\hat{\theta}|_{H}}\leq\sup_{\theta,\theta^{\prime}\in{\mathcal{U}}^{*}(2s)}\frac{\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell(\theta^{\prime})\|_{H}}{|\theta-\theta^{\prime}|_{H}}\\ &\leq 8\sup_{\theta,\theta^{\prime}\in{\mathcal{U}}^{*}(2s)}\frac{\|\nabla^{2}\ell(\theta)-\nabla^{2}\ell(\theta^{\prime})\|_{F^{*}}}{|\theta-\theta^{\prime}|_{F^{*}}}\leq 8\delta^{*}_{3}(2s).\end{split}

Here, we used that H14FH\succeq\frac{1}{4}{F^{*}} on E¯(s,ϵ2)\bar{E}(s,\epsilon_{2}). Finally,

(s/2)δ3(s/2)d/n(s/2)(8δ3(2s))d/n=4sδ3(2s)d/n1/2,(s/2)\delta_{3}(s/2)\sqrt{d/n}\leq(s/2)(8\delta^{*}_{3}(2s))\sqrt{d/n}=4s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 1/2,

proving (B.3). ∎

Appendix C Proofs from Section 4 (GLMs)

Proof of Lemma 4.6.

Let

Q=1ni=1nXi[Yiψ(Xiθ)],Q=\frac{1}{n}\sum_{i=1}^{n}X_{i}\left[Y_{i}-\nabla\psi(X_{i}^{\intercal}{\theta^{*}})\right],

treating ψ\nabla\psi as a column vector in k\mathbb{R}^{k}. We are interested in bounding from above the tail probability (QFrd/n)\mathbb{P}(\|Q\|_{{F^{*}}}\geq r\sqrt{d/n}). Let 𝒩\mathcal{N} be a 1/21/2-net of the sphere in d\mathbb{R}^{d}, i.e. a collection of unit norm vectors uu such that infu𝒩|uw|1/2\inf_{u\in\mathcal{N}}|u-w|\leq 1/2 for all |w|=1|w|=1. By standard arguments, we can take 𝒩\mathcal{N} to have at most 5d5^{d} elements. Next, let 𝒩F:={F1/2uu𝒩}\mathcal{N}_{{F^{*}}}:=\{{{F^{*}}}^{-1/2}u\mid u\in\mathcal{N}\}. Then

QF2supu𝒩FuQ,\|Q\|_{F^{*}}\leq 2\sup_{u\in\mathcal{N}_{F^{*}}}u^{\intercal}Q,

and hence

(QFsd/n)(supu𝒩FuQ(s/2)d/n)5dsup|u|F=1(uQ(s/2)d/n),\mathbb{P}(\|Q\|_{{F^{*}}}\geq s\sqrt{d/n})\leq\mathbb{P}(\sup_{u\in\mathcal{N}_{{F^{*}}}}u^{\intercal}Q\geq(s/2)\sqrt{d/n})\leq 5^{d}\sup_{|u|_{{F^{*}}}=1}\mathbb{P}(u^{\intercal}Q\geq(s/2)\sqrt{d/n}), (C.1)

by a union bound. Now, fix |u|F=1|u|_{{F^{*}}}=1. Then

(uQ(s/2)d/n)=(i=1nuXi(Yiψ(Xiθ))(s/2)nd)exp(τs2nd)i=1n𝔼[exp(τuXi(Yiψ(Xiθ))]=exp(τs2nd+i=1n[ψ(Xi(θ+τu))ψ(Xiθ)τuXiψ(Xiθ)])=exp(τs2nd+n[hu(τ)hu(0)τhu(0)]),\begin{split}\mathbb{P}&(u^{\intercal}Q\geq(s/2)\sqrt{d/n})=\mathbb{P}\left(\sum_{i=1}^{n}u^{\intercal}X_{i}(Y_{i}-\nabla\psi(X_{i}^{\intercal}{\theta^{*}}))\geq(s/2)\sqrt{nd}\right)\\ &\leq\,\mathrm{exp}\left(-\tau\frac{s}{2}\sqrt{nd}\right)\prod_{i=1}^{n}\mathbb{E}\,\left[\exp(\tau u^{\intercal}X_{i}(Y_{i}-\nabla\psi(X_{i}^{\intercal}{\theta^{*}}))\right]\\ &=\exp\left(-\tau\frac{s}{2}\sqrt{nd}+\sum_{i=1}^{n}\left[\psi(X_{i}^{\intercal}({\theta^{*}}+\tau u))-\psi(X_{i}^{\intercal}{\theta^{*}})-\tau u^{\intercal}X_{i}\nabla\psi(X_{i}^{\intercal}{\theta^{*}})\right]\right)\\ &=\exp\left(-\tau\frac{s}{2}\sqrt{nd}+n\left[h_{u}(\tau)-h_{u}(0)-\tau h_{u}^{\prime}(0)\right]\right),\end{split} (C.2)

where

hu(t)=1ni=1nψ(Xiθ+tXiu).h_{u}(t)=\frac{1}{n}\sum_{i=1}^{n}\psi(X_{i}^{\intercal}{\theta^{*}}+tX_{i}^{\intercal}u).

We used Chernoff’s inequality and standard properties of exponential families in the second to last line, which is valid for any τ\tau such that Xi(θ+τu)ΩX_{i}^{\intercal}({\theta^{*}}+\tau u)\in\Omega for all ii. In particular, since 𝒰(2s)Θ{\mathcal{U}}^{*}(2s)\subset\Theta we know by the definition of Θ\Theta that Xi(θ+τu)ΩX_{i}^{\intercal}({\theta^{*}}+\tau u)\in\Omega for all ii and all τ2sd/n\tau\leq 2s\sqrt{d/n}, |u|F=1|u|_{F^{*}}=1. We take τ=(s/2)d/n\tau=(s/2)\sqrt{d/n}. This gives, for some ξ[0,τ]\xi\in[0,\tau],

sup|u|F=1{hu(τ)hu(0)τhu(0)}=sup|u|F=1τ22(hu′′(0)+τ3hu′′′(tξ))τ22(1+τ3sup|u|F=1,t[0,τ]hu′′′(t))τ22(1+τ3δ3(s/2)).\begin{split}\sup_{|u|_{F^{*}}=1}\left\{h_{u}(\tau)-h_{u}(0)-\tau h_{u}^{\prime}(0)\right\}&=\sup_{|u|_{F^{*}}=1}\frac{\tau^{2}}{2}\left(h_{u}^{\prime\prime}(0)+\frac{\tau}{3}h_{u}^{\prime\prime\prime}(t\xi)\right)\\ &\leq\frac{\tau^{2}}{2}\left(1+\frac{\tau}{3}\sup_{|u|_{F^{*}}=1,t\in[0,\tau]}h_{u}^{\prime\prime\prime}(t)\right)\leq\frac{\tau^{2}}{2}\left(1+\frac{\tau}{3}\delta^{*}_{3}(s/2)\right).\end{split} (C.3)

To get the first inequality we used that hu′′(0)=uFuh_{u}^{\prime\prime}(0)=u^{\intercal}{F^{*}}u. To get the second inequality, we used that

hu′′′(t)=1ni=1n3ψ(Xi(θ+tu)),(Xiu)3=3(θ+tu),u33(θ+tu)F\begin{split}h_{u}^{\prime\prime\prime}(t)=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla^{3}\psi(X_{i}^{\intercal}({\theta^{*}}+tu)),(X_{i}^{\intercal}u)^{\otimes 3}\rangle=\langle\nabla^{3}\ell({\theta^{*}}+tu),u^{\otimes 3}\rangle\leq\|\nabla^{3}\ell({\theta^{*}}+tu)\|_{{F^{*}}}\end{split}

for all |u|F=1|u|_{F^{*}}=1 (recall the third line of (4.8)). Using (C.3) and that τ=(s/2)d/n\tau=(s/2)\sqrt{d/n}, we obtain

5d(uQ(s/2)d/n)exp(dlog5τs2nd+nτ22(1+τ3δ3(s/2)))=exp(dlog5s2d4+s2d8(1+s6δ3(s/2)d/n))exp(ds2[log512214+18(1+1/48)])eds2/10.\begin{split}5^{d}\mathbb{P}(u^{\intercal}Q\geq(s/2)\sqrt{d/n})&\leq\exp\left(d\log 5-\tau\frac{s}{2}\sqrt{nd}+\frac{n\tau^{2}}{2}\left(1+\frac{\tau}{3}\delta^{*}_{3}(s/2)\right)\right)\\ &=\exp\left(d\log 5-\frac{s^{2}d}{4}+\frac{s^{2}d}{8}\left(1+\frac{s}{6}\delta^{*}_{3}(s/2)\sqrt{d/n}\right)\right)\\ &\leq\exp\left(ds^{2}\left[\frac{\log 5}{12^{2}}-\frac{1}{4}+\frac{1}{8}(1+1/48)\right]\right)\leq e^{-ds^{2}/10}.\end{split} (C.4)

To get the first inequality in the last line we used that 2sδ3(2s)d/n1/42s\delta^{*}_{3}(2s)\sqrt{d/n}\leq 1/4 and s12s\geq 12. ∎

Proof of Lemma 4.12.

The following calculation has been shown in [21]; we repeat it here for convenience. Let m=max(d,logn)m=\max(d,\sqrt{\log n}) which ensures dmnemd\leq m\leq n\leq e^{\sqrt{m}}. This is needed to apply the result of [1] cited below. Let X¯i=(Xi,X~i)\bar{X}_{i}=(X_{i},\tilde{X}_{i}), where X~ii.i.d.𝒩(0,Imd)\tilde{X}_{i}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,I_{m-d}), and the X~i\tilde{X}_{i} are independent of the XiX_{i}. Then X¯ii.i.d.𝒩(0,Im)\bar{X}_{i}\stackrel{{\scriptstyle\mathrm{i.i.d.}}}{{\sim}}\mathcal{N}(0,I_{m}) and we have

supuSd11ni=1n|Xiu|ksupuSm11ni=1n|X¯iu|k\sup_{u\in S^{d-1}}\frac{1}{n}\sum_{i=1}^{n}|X_{i}^{\intercal}u|^{k}\leq\sup_{u\in S^{m-1}}\frac{1}{n}\sum_{i=1}^{n}|\bar{X}_{i}^{\intercal}u|^{k} (C.5)

with probability 1. Here, Sd1S^{d-1} is the unit sphere in d\mathbb{R}^{d}. Next, by [1, Proposition 4.4] applied with t=s=1t=s=1, it holds

supuSm11ni=1n|X¯iu|ksupuSm11ni=1n𝔼[|X¯iu|k]+Cklogk1(2nm)mn+Ckmk/2n+Ckmn\begin{split}\sup_{u\in S^{m-1}}\frac{1}{n}\sum_{i=1}^{n}|\bar{X}_{i}^{\intercal}u|^{k}\leq&\sup_{u\in S^{m-1}}\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}\,\left[|\bar{X}_{i}^{\intercal}u|^{k}\right]\\ &+C_{k}\log^{k-1}\left(\frac{2n}{m}\right)\sqrt{\frac{m}{n}}+C_{k}\frac{m^{k/2}}{n}+C_{k}\frac{m}{n}\end{split} (C.6)

with probability at least

1exp(Cn)exp(Ckmin(nlog2k2(2n/m),nm/log(2n/m)))12exp(Ckn),1-\exp(-C\sqrt{n})-\exp(-C_{k}\min(n\log^{2k-2}(2n/m),\sqrt{nm}/\log(2n/m)))\geq 1-2\exp(-C_{k}\sqrt{n}), (C.7)

where CC is an absolute constant and CkC_{k} depends only on kk. The inequality in (C.7) follows by noting that log(2n/m)log2\log(2n/m)\geq\log 2 and log(2n/m)log(2em)log(e2m)=2m\log(2n/m)\leq\log(2e^{\sqrt{m}})\leq\log(e^{2\sqrt{m}})=2\sqrt{m}, and therefore, nlog2k2(2n/m)Cknn\log^{2k-2}(2n/m)\geq C_{k}n and nm/log(2n/m)n/2\sqrt{nm}/\log(2n/m)\geq\sqrt{n}/2. We can also further upper bound (C.6) by using that mnm\leq n and 𝔼[|X¯iu|k]=Ck\mathbb{E}\,\left[|\bar{X}_{i}^{\intercal}u|^{k}\right]=C_{k} for all ii and some CkC_{k}. Therefore,

supuSm11ni=1n|X¯iu|kCk(1+mk/2n)Ck(1+dk/2n),\sup_{u\in S^{m-1}}\frac{1}{n}\sum_{i=1}^{n}|\bar{X}_{i}^{\intercal}u|^{k}\leq C_{k}\left(1+\frac{m^{k/2}}{n}\right)\leq C_{k}\left(1+\frac{d^{k/2}}{n}\right), (C.8)

where CkC_{k} may change value. The second inequality uses that md+lognm\leq d+\sqrt{\log n}. Combining (C.5) with (C.8) completes the proof. ∎

C.1 Analysis of BvM in [5] and [32]

In this section, all equation and theorem numbers refer to the works of [5] and [32] unless otherwise specified with the signifier “our”.

We study the proof of Theorem 2.2 of [5] for α=0\alpha=0; this recovers the TV distance. The main preliminary bound is (B.1). The first term on the righthand side has to do with the prior. For simplicity we neglect it; i.e. assume a flat prior. The third and fourth terms on the righthand side are exponentially small tail integrals. Thus the main contribution stems from the second term, which is bounded in the four displays starting with (B.2). The fourth of these displays, with α=0\alpha=0, gives the result:

TV(πv,γ)=𝒪(r1n+er2nc¯Md,0+r3n)\mathrm{TV}(\pi_{v},\gamma^{*})=\mathcal{O}(r_{1n}+e^{-r_{2n}\bar{c}M_{d,0}}+r_{3n}) (C.9)

with high probability. Here Md,0=dM_{d,0}=d; this quantity is defined above Theorem 2.2. The quantities r1n,r2n,r3nr_{1n},r_{2n},r_{3n} are defined in Lemma A.5. Since r2nr_{2n} is bounded below by a constant; thus the middle term is exponentially small in dd. When the prior is flat, then r1n=0r_{1n}=0. Finally, r3n=cdλn(c)r_{3n}=cd\lambda_{n}(c) for an absolute constant cc. Using this in our (C.9) gives

TV(πv,γ)=𝒪(dλn(c)).\mathrm{TV}(\pi_{v},\gamma^{*})=\mathcal{O}(d\lambda_{n}(c)). (C.10)

Here, λn(c)\lambda_{n}(c) is defined in (2.6), in terms of B1n(0)B_{1n}(0) and B2n(c)B_{2n}(c) defined in (2.4) and (2.5). Note that B1n(c)B_{1n}(c) from (2.4) is precisely equal to δ3(c)\delta^{*}_{3}(\sqrt{c}) in our notation; this follows from the representation in our (4.18). To continue in the style of our notation, we can write B2n(c)=δ4(c)B_{2n}(c)=\delta^{*}_{4}(\sqrt{c}), where

δ4(s)=supθ𝒰(s),u0𝔼θ[(uYu𝔼θ[Y])4]Varθ(uY)2\delta^{*}_{4}(s)=\sup_{\theta\in{\mathcal{U}}^{*}(s),\,u\neq 0}\;\frac{\mathbb{E}\,_{\theta}\left[\left(u^{\intercal}Y-u^{\intercal}\mathbb{E}\,_{\theta}[Y]\right)^{4}\right]}{{\mathrm{Var}}_{\theta^{*}}(u^{\intercal}Y)^{2}} (C.11)

Now, in our notation, we have

λn(s2)=s6d/n(δ3(0)+sd/nδ4(s)).\lambda_{n}(s^{2})=\frac{s}{6}\sqrt{d/n}\left(\delta^{*}_{3}(0)+s\sqrt{d/n}\delta^{*}_{4}(s)\right).

Substituting this into our (C.10) and writing c=s2c=s^{2} finally gives

TV(πv,γ)=𝒪((δ3(0)+sd/nδ4(s))ddn).\mathrm{TV}(\pi_{v},\gamma^{*})=\mathcal{O}\left(\left(\delta^{*}_{3}(0)+s\sqrt{d/n}\delta^{*}_{4}(s)\right)\frac{d\sqrt{d}}{\sqrt{n}}\right).

To interpret the model-dependent factor, note that we can write δ3,4\delta^{*}_{3,4} as follows: δk(s)=supθ𝒰(s)kψ(θ)F\delta^{*}_{k}(s)=\sup_{\theta\in{\mathcal{U}}^{*}(s)}\|\nabla^{k}\psi(\theta)\|_{{F^{*}}} for k=3,4k=3,4. Thus a Taylor expansion shows that δ3(0)+δ4(s)sd/nδ3(s)\delta^{*}_{3}(0)+\delta^{*}_{4}(s)s\sqrt{d/n}\geq\delta^{*}_{3}(s).

Next, we apply the result of [32] to GLMs. In condition (ED2)(ED_{2}), the quantity 2ξ(θ)0\nabla^{2}\xi(\theta)\equiv 0 due to stochastic linearity. Therefore, ω\omega can be taken arbitrarily small. Thus we set ω=0\omega=0. Substituting ω=0\omega=0 into the quantity (r0,x)\Diamond(r_{0},x) defined in (2.6) of Theorem 2.2 gives (r0,x)=r0δ(r0)\Diamond(r_{0},x)=r_{0}\delta(r_{0}). Thus, the last two inequalities in Theorem 2.4 roughly give that

TV(πv,γ)Δ0(x):=r0(r0,x)=r02δ(r0)\mathrm{TV}(\pi_{v},\gamma^{*})\lesssim\Delta_{0}(x):=r_{0}\Diamond(r_{0},x)=r_{0}^{2}\delta(r_{0}) (C.12)

with probability 15ex1-5e^{-x} for a suitably chosen r0r_{0} depending on xx. Roughly speaking, r02=C(d+x)r_{0}^{2}=C(d+x); see the discussion in Section 2.4. Thus for simplicity, and to parallel our results, we take r0=s0dr_{0}=s_{0}\sqrt{d}.

Now, the quantity δ(r0)\delta(r_{0}) is defined in (0)(\mathcal{L}_{0}). Using that r0=s0dr_{0}=s_{0}\sqrt{d}, the condition defining δ(r0)\delta(r_{0}) is as follows, in our notation:

supθ𝒰(s0)2(θ)2(θ)Fδ(r0).\sup_{\theta\in{\mathcal{U}}^{*}(s_{0})}\|\nabla^{2}\ell^{*}(\theta)-\nabla^{2}\ell^{*}({\theta^{*}})\|_{{F^{*}}}\leq\delta(r_{0}).

Thus we can write δ(r0)=δ~3(s0)s0d/n,\delta(r_{0})=\tilde{\delta}^{*}_{3}(s_{0})s_{0}\sqrt{d/n}, where δ~3\tilde{\delta}^{*}_{3} is comparable to our δ3\delta^{*}_{3}. Substituting this into our (C.12) gives

TV(πv,γ)r02δ(r0)=s0dδ~3(s0)s0d/n=s02δ~3(s0)ddn.\mathrm{TV}(\pi_{v},\gamma^{*})\lesssim r_{0}^{2}\delta(r_{0})=s_{0}d\tilde{\delta}^{*}_{3}(s_{0})s_{0}\sqrt{d/n}=s_{0}^{2}\tilde{\delta}^{*}_{3}(s_{0})\frac{d\sqrt{d}}{\sqrt{n}}.

Appendix D Proofs from Section 5

Proof of Lemma 5.4.

To prove the χ2\chi^{2} characterization of 𝒰(r){\mathcal{U}}^{*}(r), fix θd\theta\in\mathbb{R}^{d} and let θ0:=1θ𝟙\theta_{0}:=1-\theta^{\intercal}\mathds{1}. Now note that

|θθ|F2=(θθ)F(θθ)=j=1d(θjθj)2θj+((θθ)𝟙)2θ0=j=1d(θjθj)2θj+(θ0θ0)2θ0=χ2(θ||θ).\begin{split}|\theta-{\theta^{*}}|_{F^{*}}^{2}&=(\theta-{\theta^{*}})^{\intercal}{F^{*}}(\theta-{\theta^{*}})=\sum_{j=1}^{d}\frac{(\theta_{j}-\theta_{j}^{*})^{2}}{\theta_{j}^{*}}+\frac{((\theta-{\theta^{*}})^{\intercal}\mathds{1})^{2}}{\theta_{0}^{*}}\\ &=\sum_{j=1}^{d}\frac{(\theta_{j}-\theta_{j}^{*})^{2}}{\theta_{j}^{*}}+\frac{(\theta_{0}-\theta_{0}^{*})^{2}}{\theta_{0}^{*}}=\chi^{2}(\theta||{\theta^{*}}).\end{split} (D.1)

Hence 𝒰(r)={θd:χ2(θ||θ)r2d/n}{\mathcal{U}}(r)=\{\theta\in\mathbb{R}^{d}\;:\;\chi^{2}(\theta||{\theta^{*}})\leq r^{2}d/n\}. Note that in the first line above, θ{\theta^{*}} denotes (θ1,,θd)(\theta_{1}^{*},\dots,\theta_{d}^{*}). Next, suppose 𝒰(2r)Θ{\mathcal{U}}^{*}(2r)\not\subseteq\Theta. We will show that θmin4r2d/n{\theta_{\mathrm{min}}^{*}}\leq 4r^{2}d/n. If 𝒰(2r)Θ{\mathcal{U}}^{*}(2r)\not\subseteq\Theta, then there exists θ𝒰(2r)\theta\in{\mathcal{U}}^{*}(2r) and j{1,,d}j\in\{1,\dots,d\} such that θj0\theta_{j}\leq 0 or θj1\theta_{j}\geq 1. Suppose first that θj0\theta_{j}\leq 0. Then

θj(θjθj)2θjχ2(θ||θ)4r2d/n\theta_{j}^{*}\leq\frac{(\theta_{j}-\theta_{j}^{*})^{2}}{\theta_{j}^{*}}\leq\chi^{2}(\theta||{\theta^{*}})\leq 4r^{2}d/n (D.2)

and hence θmin4r2d/n{\theta_{\mathrm{min}}^{*}}\leq 4r^{2}d/n. Next, suppose θj1\theta_{j}\geq 1. Then θ00\theta_{0}\leq 0, and we can apply the above argument with j=0j=0. Finally, fix θ𝒰(r)Θ\theta\in{\mathcal{U}}^{*}(r)\subset\Theta. Then the bound (5.6) gives

maxj=0,,d|θjθj1|2χ2(θ||θ)θminr2d/nθmin14\max_{j=0,\dots,d}\left|\frac{\theta_{j}}{\theta_{j}^{*}}-1\right|^{2}\leq\frac{\chi^{2}(\theta||{\theta^{*}})}{{\theta_{\mathrm{min}}^{*}}}\leq\frac{r^{2}d/n}{{\theta_{\mathrm{min}}^{*}}}\leq\frac{1}{4}

by assumption on rr, and hence θjθj/2j=0,1,,d\theta_{j}\geq\theta_{j}^{*}/2\;\forall j=0,1,\dots,d, as desired. ∎

Throughout the rest of the section, we use the shorthand notation

χ2=χ2(N¯||θ).\chi^{2}=\chi^{2}(\bar{N}||{\theta^{*}}).
Proof of (4).

Using (4.3) of [8], we have

χCd/n+Cx/n+Cxnθmin\chi\leq C\sqrt{d/n}+C\sqrt{x/n}+C\frac{x}{n\sqrt{{\theta_{\mathrm{min}}^{*}}}} (D.3)

with probability at least 1ex1-e^{-x} for any x>0x>0. Now, let x=dt2x=dt^{2} for some t1t\geq 1 for which t2dnθmin1\frac{t^{2}d}{n{\theta_{\mathrm{min}}^{*}}}\leq 1. Since t1t\geq 1 we can upper bound (D.3) as

χCt2d/n+Ct2dnθmin=Ct2dn(1+t2dnθmin)C1t2dn,\chi\leq C\sqrt{t^{2}d/n}+C\frac{t^{2}d}{n\sqrt{\theta_{\mathrm{min}}^{*}}}=C\sqrt{\frac{t^{2}d}{n}}\left(1+\sqrt{\frac{t^{2}d}{n{\theta_{\mathrm{min}}^{*}}}}\right)\leq C_{1}\sqrt{\frac{t^{2}d}{n}}, (D.4)

using the assumption t2dnθmin1\frac{t^{2}d}{n{\theta_{\mathrm{min}}^{*}}}\leq 1. The bound (D.4) holds with probability at least 1et2d1-e^{-t^{2}d}. We now choose t=s/C1t=s/C_{1}, which satisfies t1t\geq 1 and t2dnθmin1\frac{t^{2}d}{n{\theta_{\mathrm{min}}^{*}}}\leq 1 if sC1s\geq C_{1} and s2d/nC12θmins^{2}d/n\leq C_{1}^{2}{\theta_{\mathrm{min}}^{*}}. Thus E0(s)={χsd/n}E_{0}(s)=\{\chi\leq s\sqrt{d/n}\} holds with probability at least 1et2d=1es2d/C121-e^{-t^{2}d}=1-e^{-s^{2}d/C_{1}^{2}}. ∎

Remark D.1.

Since we have assumed s2d/nC12θmins^{2}d/n\leq C_{1}^{2}{\theta_{\mathrm{min}}^{*}}, note that we also have χ2/θmins2d/(nθmin)C12\chi^{2}/{\theta_{\mathrm{min}}^{*}}\leq s^{2}d/(n{\theta_{\mathrm{min}}^{*}})\leq C_{1}^{2} on E0(s)E_{0}(s).

Proof of (3).

First recall that if (2s)2d/nθmin/4(2s)^{2}d/n\leq{\theta_{\mathrm{min}}^{*}}/4, then Lemma 5.4 shows that

θjθj2j=0,,dθ𝒰(2s).\frac{\theta_{j}^{*}}{\theta_{j}}\leq 2\quad\forall j=0,\dots,d\quad\forall\theta\in{\mathcal{U}}^{*}(2s). (D.5)

Now, fix θ𝒰(2s)\theta\in{\mathcal{U}}^{*}(2s). We upper bound 3(θ)F\|\nabla^{3}\ell(\theta)\|_{F^{*}}, which is given by the maximum of

3(θ),u3=2j=1dN¯jθj3uj3+2N¯0θ03(u𝟙)3=2j=0dN¯jθj3uj3\begin{split}\left\langle\nabla^{3}\ell(\theta),u^{\otimes 3}\right\rangle=-2\sum_{j=1}^{d}\frac{\bar{N}_{j}}{\theta_{j}^{3}}u_{j}^{3}+2\frac{\bar{N}_{0}}{\theta_{0}^{3}}(u^{\intercal}\mathds{1})^{3}=-2\sum_{j=0}^{d}\frac{\bar{N}_{j}}{\theta_{j}^{3}}u_{j}^{3}\end{split} (D.6)

over all uu such that

1=uFu=j=1duj2θj+(u𝟙)2θ0=j=0duj2θj,1=u^{\intercal}{F^{*}}u=\sum_{j=1}^{d}\frac{u_{j}^{2}}{\theta_{j}^{*}}+\frac{(u^{\intercal}\mathds{1})^{2}}{\theta_{0}^{*}}=\sum_{j=0}^{d}\frac{u_{j}^{2}}{\theta_{j}^{*}}, (D.7)

where u0:=j=1duju_{0}:=-\sum_{j=1}^{d}u_{j}. Dropping this constraint on the sum of the uju_{j} and using Lagrange multipliers, we see that the maximum is achieved at a uu such that uj=0u_{j}=0 for jIj\in I and uj=λθj3/(θjN¯j)u_{j}=\lambda\theta_{j}^{3}/(\theta_{j}^{*}\bar{N}_{j}) for j{0,,d}Ij\in\{0,\dots,d\}\setminus I, where II is some non-empty index set. The constraint (D.7) reduces to λ2S=1\lambda^{2}S=1 and the objective (D.6) reduces to 2λ3S-2\lambda^{3}S, where S=iIθj6(θj)3N¯j2S=\sum_{i\in I}\frac{\theta_{j}^{6}}{(\theta_{j}^{*})^{3}\bar{N}_{j}^{2}}. Thus λ=1/S\lambda=-1/\sqrt{S}, and the objective (D.6) is given by 2λ3S=2/S-2\lambda^{3}S=2/\sqrt{S}, which is maximized when S=minj=0,,dθj6(θj)3N¯j2S=\min_{j=0,\dots,d}\frac{\theta_{j}^{6}}{(\theta_{j}^{*})^{3}\bar{N}_{j}^{2}}. Thus

3(θ)F2maxj=0,,dN¯j(θj)3/2θj3.\|\nabla^{3}\ell(\theta)\|_{F^{*}}\leq 2\max_{j=0,\dots,d}\frac{\bar{N}_{j}(\theta_{j}^{*})^{3/2}}{\theta_{j}^{3}}. (D.8)

We now take the supremum over all θ𝒰(2s)\theta\in{\mathcal{U}}^{*}(2s) and apply (D.5) to get that

δ3(2s)=supθ𝒰(2s)3(θ)FCmaxj=0,,d1θjN¯jθjCθmin(1+χθmin).\delta^{*}_{3}(2s)=\sup_{\theta\in{\mathcal{U}}^{*}(2s)}\|\nabla^{3}\ell(\theta)\|_{F^{*}}\leq C\max_{j=0,\dots,d}\frac{1}{\sqrt{\theta_{j}^{*}}}\frac{\bar{N}_{j}}{\theta_{j}^{*}}\leq\frac{C}{\sqrt{\theta_{\mathrm{min}}^{*}}}\left(1+\frac{\chi}{\sqrt{\theta_{\mathrm{min}}^{*}}}\right). (D.9)

The last inequality follows by (5.6) applied with θ=N¯\theta=\bar{N}. Finally, as noted in Remark D.1, we have that χ/θmin\chi/\sqrt{{\theta_{\mathrm{min}}^{*}}} is bounded by an absolute constant on E0(s)E_{0}(s). This concludes the proof. ∎

Calculation from Remark 5.6.

In the case that N¯j=θj\bar{N}_{j}=\theta_{j}^{*}, we have as in the above proof that

3(θ)F=sup{2j=0duj3θj2:j=0duj=0,j=0duj2θj=1}.\|\nabla^{3}\ell({\theta^{*}})\|_{F^{*}}=\sup\left\{-2\sum_{j=0}^{d}\frac{u_{j}^{3}}{{\theta_{j}^{*}}^{2}}\;:\;\sum_{j=0}^{d}u_{j}=0,\sum_{j=0}^{d}\frac{u_{j}^{2}}{\theta_{j}^{*}}=1\right\}. (D.10)

Using Lagrange multipliers we get that uj=c1θju_{j}=c_{1}\theta_{j}^{*} for all jj in some nonempty index set II, and uj=c2θju_{j}=c_{2}\theta_{j}^{*} for all other jj. If P=jIθjP=\sum_{j\in I}\theta_{j}^{*}, the constraints reduce to c1P+c2(1P)=0c_{1}P+c_{2}(1-P)=0 and c12P+c22P=1c_{1}^{2}P+c_{2}^{2}P=1. Solving for c1,c2c_{1},c_{2} we get c1=(1P)/Pc_{1}=-\sqrt{(1-P)/P} and c2=P/(1P)c_{2}=\sqrt{P/(1-P)}. The objective value is then equal to 2[(1P)3/2/PP3/2/1P]=2(12P)/P(1P)2[(1-P)^{3/2}/\sqrt{P}-P^{3/2}/\sqrt{1-P}]=2(1-2P)/\sqrt{P(1-P)}. This is maximized when PP is smallest, which is P=θminP={\theta_{\mathrm{min}}^{*}}. This yields

3(θ)F=212θminθmin1θmin.\|\nabla^{3}\ell({\theta^{*}})\|_{F^{*}}=2\frac{1-2{\theta_{\mathrm{min}}^{*}}}{\sqrt{{\theta_{\mathrm{min}}^{*}}}\sqrt{1-{\theta_{\mathrm{min}}^{*}}}}. (D.11)

Proof of (2).

We have

2(θ)FF=supu0j=1dN¯jθj(θj)2uj2+N¯0θ0(θ0)2(𝟙u)2j=1d1θjuj2+1θ0(𝟙u)2maxj=0,1,,d|N¯jθj|θjχθmins2dnθmin,\begin{split}\|\nabla^{2}\ell({\theta^{*}})-{F^{*}}\|_{F^{*}}&=\sup_{u\neq 0}\frac{\sum_{j=1}^{d}\frac{\bar{N}_{j}-\theta_{j}^{*}}{(\theta_{j}^{*})^{2}}u_{j}^{2}+\frac{\bar{N}_{0}-\theta_{0}^{*}}{(\theta_{0}^{*})^{2}}(\mathds{1}^{\intercal}u)^{2}}{\sum_{j=1}^{d}\frac{1}{\theta_{j}^{*}}u_{j}^{2}+\frac{1}{\theta_{0}^{*}}(\mathds{1}^{\intercal}u)^{2}}\\ &\leq\max_{j=0,1,\dots,d}\frac{|\bar{N}_{j}-\theta_{j}^{*}|}{\theta_{j}^{*}}\leq\frac{\chi}{\sqrt{\theta_{\mathrm{min}}^{*}}}\leq\sqrt{\frac{s^{2}d}{n{\theta_{\mathrm{min}}^{*}}}},\end{split} (D.12)

using (5.6) to get the second-to-last inequality, and that χsd/n\chi\leq s\sqrt{d/n} on E0(s)E_{0}(s) to get the last inequality. ∎

Appendix E Auxiliary

Lemma E.1.

Let F,H{F^{*}},H be two symmetric positive definite matrices in d×d\mathbb{R}^{d\times d} such that HλFH\succeq\lambda{F^{*}}. Also, let udu\in\mathbb{R}^{d} and TT be a symmetric kk tensor for k1k\geq 1. Then

|u|Hλ1/2|u|F,THλk/2TF.|u|_{H}\geq\lambda^{1/2}|u|_{F^{*}},\quad\|T\|_{H}\leq\lambda^{-k/2}\|T\|_{F^{*}}. (E.1)
Proof.

The first bound is by definition. For the second bound, we use the first bound to get

TH=supu0T,uk(uHu)k/2λk/2supu0T,uk(uFu)k/2=λk/2TF.\|T\|_{H}=\sup_{u\neq 0}\frac{\langle T,u^{\otimes k}\rangle}{(u^{\intercal}Hu)^{k/2}}\leq\lambda^{-k/2}\sup_{u\neq 0}\frac{\langle T,u^{\otimes k}\rangle}{(u^{\intercal}{F^{*}}u)^{k/2}}=\lambda^{-k/2}\|T\|_{F^{*}}. (E.2)

Lemma E.2.

Suppose Σ21τΣ11\Sigma_{2}^{-1}\succeq\tau\Sigma_{1}^{-1} and Σ21Σ11Σ11ϵ\|\Sigma_{2}^{-1}-\Sigma_{1}^{-1}\|_{\Sigma_{1}^{-1}}\leq\epsilon. Then

TV(𝒩(μ,Σ1),𝒩(μ,Σ2))2ϵτd.\mathrm{TV}\left(\mathcal{N}(\mu,\Sigma_{1}),\;\mathcal{N}(\mu,\Sigma_{2})\right)\leq 2\frac{\epsilon}{\tau}\sqrt{d}. (E.3)
Proof.

We use Theorem 1.1 of [12] and equality (2) of that work to get that

TV(𝒩(μ,Σ1),𝒩(μ,Σ2))2Σ11/2Σ2Σ11/2IdF2dΣ11/2Σ2Σ11/2Id.\mathrm{TV}(\mathcal{N}(\mu,\Sigma_{1}),\mathcal{N}(\mu,\Sigma_{2}))\leq 2\|\Sigma_{1}^{-1/2}\Sigma_{2}\Sigma_{1}^{-1/2}-I_{d}\|_{F}\leq 2\sqrt{d}\|\Sigma_{1}^{-1/2}\Sigma_{2}\Sigma_{1}^{-1/2}-I_{d}\|. (E.4)

Next, let A=Σ11/2Σ21Σ11/2A=\Sigma_{1}^{1/2}\Sigma_{2}^{-1}\Sigma_{1}^{1/2} and note that λmin(A)τ\lambda_{\min}(A)\geq\tau by the assumption Σ21τΣ11\Sigma_{2}^{-1}\succeq\tau\Sigma_{1}^{-1}. We then have

A1IdA1AId=Σ11/2Σ21Σ11/2Idλmin(Σ11/2Σ21Σ11/2)=Σ21Σ11Σ11λmin(Σ11/2Σ21Σ11/2)ϵτ.\begin{split}\|A^{-1}-I_{d}\|\leq\|A^{-1}\|\|A-I_{d}\|&=\frac{\|\Sigma_{1}^{1/2}\Sigma_{2}^{-1}\Sigma_{1}^{1/2}-I_{d}\|}{\lambda_{\min}(\Sigma_{1}^{1/2}\Sigma_{2}^{-1}\Sigma_{1}^{1/2})}=\frac{\|\Sigma_{2}^{-1}-\Sigma_{1}^{-1}\|_{\Sigma_{1}^{-1}}}{\lambda_{\min}(\Sigma_{1}^{1/2}\Sigma_{2}^{-1}\Sigma_{1}^{1/2})}\leq\frac{\epsilon}{\tau}.\end{split}

Substituting this bound into (E.4) concludes the proof. ∎

Lemma E.3 (Lemma E.3 of [20] with p=0p=0).

If ab>1ab>1, then

1(2π)d/2|x|adebd|x|𝑑xdexp([32+logaab]d).\begin{split}\frac{1}{(2\pi)^{d/2}}\int_{|x|\geq a\sqrt{d}}e^{-b\sqrt{d}|x|}dx\leq d\,\mathrm{exp}\left(\left[\frac{3}{2}+\log a-ab\right]d\right).\end{split} (E.5)
Lemma E.4 (Modified from Lemma 1.3 in Chapter XIV of [23]).

Let UdU\subset\mathbb{R}^{d} be open, and let f:Udf:U\to\mathbb{R}^{d} be of class C1C^{1}. Assume that f(0)=0f(0)=0 and f(0)=Id\nabla f(0)=I_{d}. Let r>0r>0 be such that B¯r(0)U\bar{B}_{r}(0)\subset U. If

|f(x)f(0)|14,|x|r,|\nabla f(x)-\nabla f(0)|\leq\frac{1}{4},\qquad\forall|x|\leq r,

then ff maps B¯r(0)\bar{B}_{r}(0) bijectively onto B¯2r(0)\bar{B}_{2r}(0).

Now let h:dh:\mathbb{R}^{d}\to\mathbb{R} be of class C2C^{2}, such that H=2h(y0)H=\nabla^{2}h(y_{0}) is symmetric positive definite. Define

h~(x)=H1/2(h(y0+H1/2x)h(y0)).\tilde{h}(x)=H^{-1/2}\left(\nabla h\left(y_{0}+H^{-1/2}x\right)-\nabla h(y_{0})\right).

Then h~(0)=0\tilde{h}(0)=0 and h~(0)=Id\nabla\tilde{h}(0)=I_{d}, and one can check that |h~(x)h~(0)|1/4|\nabla\tilde{h}(x)-\nabla\tilde{h}(0)|\leq 1/4 for all xB¯r(0)x\in\bar{B}_{r}(0) if and only if

2h(y)2h(y0)H1/4,|yy0|Hr.\|\nabla^{2}h(y)-\nabla^{2}h(y_{0})\|_{H}\leq 1/4,\quad\forall|y-y_{0}|_{H}\leq r. (E.6)

We conclude that if (E.6) is satisfied, then h~\nabla\tilde{h} maps B¯r(0)\bar{B}_{r}(0) bijectively onto B¯2r(0)\bar{B}_{2r}(0) and hence h\nabla h maps y0+H1/2B¯r(0)y_{0}+H^{-1/2}\bar{B}_{r}(0) bijectively onto h(y0)+H1/2B¯2r(0)\nabla h(y_{0})+H^{1/2}\bar{B}_{2r}(0). We have proved the following corollary.

Corollary E.5.

Let h:dh:\mathbb{R}^{d}\to\mathbb{R} be C2C^{2} such that H=2h(y0)H=\nabla^{2}h(y_{0}) is symmetric positive definite. Suppose (E.6) holds, and suppose zz is such that

zh(y0)H=|H1/2(zh(y0))|2r.\|z-\nabla h(y_{0})\|_{H}=|H^{-1/2}(z-\nabla h(y_{0}))|\leq 2r.

Then there exists yy such that |yy0|Hr|y-y_{0}|_{H}\leq r and h(y)=z\nabla h(y)=z.

Next, it will be convenient to use a weighting matrix F{F^{*}} that does not exactly equal H=2h(y0)H=\nabla^{2}h(y_{0}). We will also only need the case z=0z=0.

Corollary E.6.

Let h:dh:\mathbb{R}^{d}\to\mathbb{R} be C2C^{2} such that H=2h(y0)H=\nabla^{2}h(y_{0}) is symmetric positive definite, and let F{F^{*}} be a symmetric positive definite matrix such that HλFH\succeq\lambda{F^{*}}. Suppose

2h(y)2h(y0)Fλ4|yy0|Fq,|F1/2h(y0)|2λq\begin{split}\|\nabla^{2}h(y)-\nabla^{2}h(y_{0})\|_{{F^{*}}}&\leq\frac{\lambda}{4}\quad\forall|y-y_{0}|_{{F^{*}}}\leq q,\\ |{F^{*}}^{-1/2}\nabla h(y_{0})|&\leq 2\lambda q\end{split}

for some q0q\geq 0. Then there exists yy such that |yy0|Fq|y-y_{0}|_{F^{*}}\leq q and h(y)=0\nabla h(y)=0.

Proof.

If |yy0|Hr=λq|y-y_{0}|_{H}\leq r=\sqrt{\lambda}q, then |yy0|Fq|y-y_{0}|_{F^{*}}\leq q and hence 2h(y)2h(y0)Fλ/4\|\nabla^{2}h(y)-\nabla^{2}h(y_{0})\|_{{F^{*}}}\leq\lambda/4, which implies 2h(y)2h(y0)H1/4\|\nabla^{2}h(y)-\nabla^{2}h(y_{0})\|_{H}\leq 1/4. Also, note that |H1/2h(y0)|2λq=2r|H^{-1/2}\nabla h(y_{0})|\leq 2\sqrt{\lambda}q=2r. Thus the assumptions of Corollary E.5 are satisfied, so that there exists yy such that |yy0|Hr|y-y_{0}|_{H}\leq r and h(y)=0\nabla h(y)=0. But then |yy0|Fr/λ=q|y-y_{0}|_{F^{*}}\leq r/\sqrt{\lambda}=q, as desired. ∎

References

  • [1] Radoslaw Adamczak, Alexander Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann. Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of the American Mathematical Society, 23(2):535–561, 2010.
  • [2] Dominique Bakry, Ivan Gentil, Michel Ledoux, et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
  • [3] Rina Foygel Barber, Mathias Drton, and Kean Ming Tan. Laplace approximation in high-dimensional Bayesian regression. In Statistical Analysis for High-Dimensional Data: The Abel Symposium 2014, pages 15–36. Springer, 2016.
  • [4] Ole Barndorff-Nielsen. Information and exponential families: in statistical theory. John Wiley & Sons, 2014.
  • [5] Alexandre Belloni and Victor Chernozhukov. Posterior inference in curved exponential families under increasing dimensions. The Econometrics Journal, 17(2):S75–S100, 2014.
  • [6] Jan Bohr and Richard Nickl. On log-concave approximations of high-dimensional posterior measures and stability properties in non-linear inverse problems. Annales de l’Institut Henri Poincaré, to appear.
  • [7] Dominique Bontemps. Bernstein–von Mises theorems for Gaussian regression with increasing number of regressors. The Annals of Statistics, 39(5):2557 – 2584, 2011.
  • [8] S. Boucheron and E. Gassiat. A Bernstein-Von Mises Theorem for discrete probability distributions. Electronic Journal of Statistics, 3(none):114 – 148, 2009.
  • [9] Ismaël Castillo and Richard Nickl. Nonparametric bernstein–von Mises theorems in Gaussian white noise. The Annals of Statistics, 41(4):1999–2028, 2013.
  • [10] Ismaël Castillo, Johannes Schmidt-Hieber, and Aad van der Vaart. Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986 – 2018, 2015.
  • [11] Guillaume P Dehaene. A deterministic and computable Bernstein-von Mises theorem. arXiv preprint arXiv:1904.02505, 2019.
  • [12] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint arXiv:1810.08693, 2018.
  • [13] David Freedman. Wald lecture: On the Bernstein-von Mises theorem with infinite-dimensional parameters. The Annals of Statistics, 27(4):1119–1141, 1999.
  • [14] Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin. Bayesian data analysis. Chapman and Hall/CRC, 1995.
  • [15] Subhashis Ghosal. Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli, pages 315–331, 1999.
  • [16] Subhashis Ghosal. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. Journal of Multivariate Analysis, 74(1):49–68, 2000.
  • [17] Xuming He and Qi-Man Shao. On parameters of increasing dimensions. Journal of Multivariate Analysis, 73(1):120–135, 2000.
  • [18] Tapio Helin and Remo Kretschmann. Non-asymptotic error estimates for the Laplace approximation in Bayesian inverse problems. Numerische Mathematik, 150(2):521–549, 2022.
  • [19] Mikolaj J Kasprzak, Ryan Giordano, and Tamara Broderick. How good is your Gaussian approximation of the posterior? Finite-sample computable error bounds for a variety of useful divergences. arXiv preprint arXiv:2209.14992, 2022.
  • [20] Anya Katsevich. The Laplace approximation accuracy in high dimensions: a refined analysis and new skew adjustment. arXiv preprint arXiv:2306.07262, 2023.
  • [21] Anya Katsevich. The laplace asymptotic expansion in high dimensions. arXiv preprint arXiv:2406.12706, 2024.
  • [22] Bas JK Kleijn and Aad W van der Vaart. The Bernstein-von-Mises theorem under misspecification. Electronic Journal of Statistics, 6:354–381, 2012.
  • [23] Serge Lang. Real and Functional Analysis. Graduate Texts in Mathematics. Springer New York, NY, 3 edition, 1993.
  • [24] László Lovász and Santosh Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307–358, 2007.
  • [25] Yulong Lu. On the Bernstein-von Mises theorem for high dimensional nonlinear Bayesian inverse problems. arXiv preprint arXiv:1706.00289, 2017.
  • [26] Pascal Massart. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Lecture Notes in Mathematics. Springer, 2007.
  • [27] Richard Nickl and Sven Wang. On polynomial-time computation of high-dimensional posterior measures by langevin-type algorithms. Journal of the European Mathematical Society, 2022.
  • [28] Maxim Panov and Vladimir Spokoiny. Finite sample Bernstein–von Mises theorem for semiparametric problems. Bayesian Analysis, 10(3):665–710, 2015.
  • [29] Stephen Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. The Annals of Statistics, pages 356–366, 1988.
  • [30] Adrien Saumard and Jon A Wellner. Log-concavity and strong log-concavity: a review. Statistics surveys, 8:45, 2014.
  • [31] Vladimir Spokoiny. Parametric estimation. Finite sample theory. The Annals of Statistics, 40(6):2877–2909, 2012.
  • [32] Vladimir Spokoiny. Bernstein-von Mises theorem for growing parameter dimension. arXiv preprint arXiv:1302.3430, 2013.
  • [33] Vladimir Spokoiny. Finite samples inference and critical dimension for stochastically linear models. arXiv preprint arXiv:2201.06327, 2022.
  • [34] Vladimir Spokoiny. Dimension free nonasymptotic bounds on the accuracy of high-dimensional laplace approximation. SIAM/ASA Journal on Uncertainty Quantification, 11(3):1044–1068, 2023.
  • [35] Vladimir Spokoiny. Inexact laplace approximation and the use of posterior mean in bayesian inference. Bayesian Analysis, 1(1):1–28, 2023.
  • [36] Pragya Sur. A modern maximum-likelihood theory for high-dimensional logistic regression. PhD thesis, Stanford University, 2019.
  • [37] Yanbo Tang and Nancy Reid. Laplace and saddlepoint approximations in high dimensions. arXiv preprint arXiv:2107.10885, 2021.
  • [38] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1998.
  • [39] Xinzhen Zhang, Chen Ling, and Liqun Qi. The best rank-1 approximation of a symmetric tensor and related spherical optimization problems. SIAM Journal on Matrix Analysis and Applications, 33(3):806–821, 2012.