This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gibbs posterior concentration rates under sub-exponential type losses

Nicholas Syring111Department of Statistics, Iowa State University; nsyring@iastate.edu.   and   Ryan Martin222Department of Statistics, North Carolina State University; rgmarti3@ncsu.edu
(October 1, 2025)
Abstract

Bayesian posterior distributions are widely used for inference, but their dependence on a statistical model creates some challenges. In particular, there may be lots of nuisance parameters that require prior distributions and posterior computations, plus a potentially serious risk of model misspecification bias. Gibbs posterior distributions, on the other hand, offer direct, principled, probabilistic inference on quantities of interest through a loss function, not a model-based likelihood. Here we provide simple sufficient conditions for establishing Gibbs posterior concentration rates when the loss function is of a sub-exponential type. We apply these general results in a range of practically relevant examples, including mean regression, quantile regression, and sparse high-dimensional classification. We also apply these techniques in an important problem in medical statistics, namely, estimation of a personalized minimum clinically important difference.

Keywords and phrases: classification; generalized Bayes; high-dimensional problem; M-estimation; model misspecification.

1 Introduction

A major selling point of the Bayesian framework is that it is normative: to solve a new problem, one only needs a statistical model/likelihood, a prior distribution for the model parameters, and the means to compute the corresponding posterior distribution. Bayesians’ obligation to specify a prior attracts criticism, but their need to specify a likelihood has a number of potentially negative consequences too, especially when the quantity of interest has meaning independent of a statistical model, like a quantile. On the one hand, even if the posited model is “correct,” it is rare that all the parameters of that model are relevant to the problem at hand. For example, if we are interested in a quantile of a distribution, which we model as skew-normal, then the focus is only on a specific real-valued function of the model parameters. In such a case, the non-trivial effort invested in dealing with these nuisance parameters, e.g., specifying prior distributions and designing computational algorithms, is effectively wasted. On the other hand, in the far more likely case where the posited model is “wrong,” that model misspecification can negatively impact one’s conclusions about the quantity of interest. For example, both skew-normal and Pareto models have a τth\tau^{\text{th}} quantile, but the quality of inferences drawn about that quantile will vary depending on which of these two models is chosen.

The non-negligible dependence on the posited statistical model puts a burden on the data analyst, and those reluctant to take that risk tend to opt for a non-Bayesian approach. After all, if one can get a solution without specifying a statistical model, then it is impossible to incur model misspecification bias. But in taking such an approach, they give up the normative advantage of Bayesian analysis. Can they get the best of both worlds? That is, can one construct a posterior distribution for the quantity of interest directly, incorporating available prior information, without specifying a statistical model and incurring the associated model misspecification risks, and without the need for marginalization over nuisance parameters? Fortunately, the answer is Yes, and this is the present paper’s focus.

The so-called Gibbs posterior distribution is the proper prior-to-posterior update when data and the interest parameter are linked by a loss function rather than a likelihood (Catoni,, 2004; Zhang,, 2006; Bissiri et al.,, 2016). Intuitively, the Gibbs and Bayesian posterior distributions coincide when the loss function linking data and parameter is a (negative) log-likelihood. In that case the properties of the Gibbs posterior can be inferred from the literature on Bayesian asymptotics in both the well-specified and misspecified contexts. For cases where the link is not through a likelihood, the large-sample behavior of the Gibbs posterior is less clear and elucidating this behavior under some simple and fairly general conditions is our goal here.

As a practical example, medical investigators may want to know if a treatment whose effect has been judged to be statistically significant is also clinically significant in the sense that the patients feel better post-treatment. Therefore, they are interested in inference about the effect size cutoff beyond which patients feel better; this is called the minimum clinically important difference or MCID, e.g., Jaescheke et al., (1989). Estimation of the MCID boils down to a classification problem, and standard Bayesian approaches to binary regression do not perform well in this setting; misspecifying the link function leads to bias, and nonparametric modeling of the link function is inefficient (Choudhuri et al.,, 2007). Instead, we found that a Gibbs posterior distribution, as described above, provided a very reasonable and robust solution to the MCID problem (Syring and Martin,, 2017). In some applications, one seeks a “personalized” or subject-specific cutoff that depends on a set of additional covariates. This personalized MCID could be high- or even infinite-dimensional, and our previous Gibbs posterior analysis is not equipped to handle such situations. But the framework developed here is; see Section 6.

In the following sections we lay out and apply conditions under which a Gibbs posterior distribution concentrates, asymptotically, on a neighborhood of the true value of the inferential target as the sample sizes increases. Our focus is not on the most general set of sufficient conditions for concentration; rather, we aim for conditions that are both widely applicable and easily verified. To this end, we consider loss functions of a sub-exponential type, ones that satisfy an inequality similar to the moment-generating function bound for sub-exponential random variables (Boucheron et al.,, 2012). We can apply this condition in a variety of problems, from regression to classification, and in both fixed- and high-dimensional settings. An added advantage is that our conditions lead to straightforward proofs of concentration.

Section 2 provides some background and formally defines the Gibbs posterior distribution. In Section 3, we state our theoretical objectives and present our main results, namely, sets of sufficient conditions under which the Gibbs posterior achieves a specified asymptotic concentration rate. A unique attribute of the Gibbs posterior distribution is its dependence on a tuning parameter called the learning rate, and our results cover a constant, vanishing sequence, and even data-dependent learning rates. Section 4 further discusses verifying our conditions and extends our conditions and main results to handle certain unbounded loss functions. Section 5 applies our general theorems to establish Gibbs posterior concentration rates in a number of practically relevant examples, including nonparametric curve estimation, and high-dimensional sparse classification. Section 6 formulates the personalized MCID application, presents a relevant Gibbs posterior concentration rate result, and gives a brief numerical illustration. Concluding remarks are given in Section 7, and proofs, etc. are postponed to the appendix.

2 Background on Gibbs posteriors

2.1 Notation and definitions

Consider a measurable space (𝕌,𝒰)(\mathbb{U},\mathcal{U}), with 𝒰\mathcal{U} a σ\sigma-algebra of subsets of 𝕌\mathbb{U}, on which a probability measure PP is defined. A random element UPU\sim P need not be a scalar, and many of the applications we have in mind involve U=(X,Y)U=(X,Y) or U=(X,Y,Z)U=(X,Y,Z), where YY denotes a “response” variable and XX or (X,Z)(X,Z) denotes a “predictor” variable, and PP encodes the dependence between the entries in UU. Then the real-world phenomenon under investigation is determined by PP and our goal is to make inference on a relevant feature of PP, which we define as a given functional θ=θ(P)\theta=\theta(P), taking values in Θ\Theta. Note that θ\theta could be finite-, high-, or even infinite-dimensional.

The specific way θ\theta relates to PP guides our posterior construction. Suppose there is a loss function, θ(u)\ell_{\theta}(u), that measures how closely a generic value of θ\theta agrees with a data point uu. (As is customary, “θ\theta” will denote both the quantity of interest and a generic value of that quantity; when we need to distinguish the true from a generic value, we will write “θ\theta^{\star}.”) For example, if u=(x,y)u=(x,y) is a predictor–response pair, and θ\theta is a function, then the loss might be

θ(u)=|yθ(x)|orθ(u)=1{yθ(x)},\ell_{\theta}(u)=|y-\theta(x)|\quad\text{or}\quad\ell_{\theta}(u)=1\{y\neq\theta(x)\}, (1)

depending on whether yy is continuous or discrete/binary, where 1(A)1(A) denotes the indicator function at the event AA. Another common situation is when one specifies a statistical model, say, {Pθ:θΘ}\{P_{\theta}:\theta\in\Theta\}, indexed by a parameter θ\theta, and sets θ(u)=logpθ(u)\ell_{\theta}(u)=-\log p_{\theta}(u), where pθp_{\theta} is the density of PθP_{\theta} with respect to some fixed dominating measure. In all of these cases, the idea is that a loss is incurred when there is a certain discrepancy between θ\theta and the data point uu. Then our inferential target is the value of θ\theta that minimizes the risk or average loss/discrepancy.

Definition 2.1.

Consider a real-valued loss function θ(u)\ell_{\theta}(u) defined on 𝕌×Θ\mathbb{U}\times\Theta, and define the risk function R(θ)=PθR(\theta)=P\ell_{\theta}, the expected loss with respect to PP; throughout, PfPf denotes expected value of f(U)f(U) with respect to UPU\sim P. Then the inferential target is

θargminθΘR(θ).\theta^{\star}\in\arg\min_{\theta\in\Theta}R(\theta). (2)

Given that estimation/inference is our goal, our focus will be on case where the risk minimizer, θ\theta^{\star}, is unique, so that the “\in” in (2) is an equality. But this is not absolutely necessary for our theory. Indeed, the main results in Section 3 remain valid even if the risk minimizer is not unique, and we make a few brief comments about this extension in the discussion following Theorem 3.2.

The risk function itself is unavailable—it depends on PP—and, therefore, so is θ\theta^{\star}. However, suppose that we have an independent and identically distributed (iid) sample Un=(U1,,Un)U^{n}=(U_{1},\ldots,U_{n}) of size nn, with each UiU_{i} having marginal distribution PP on 𝕌\mathbb{U}. The iid assumption is not crucial, but it makes the notation and discussion easier; an extension to independent but not identically distributed (inid) cases is discussed in the context of an example in Section 5.4. In general, we have data UnU^{n} taking values in the measurable space (𝕌n,𝒰n)(\mathbb{U}^{n},\mathcal{U}^{n}), with joint distribution denoted by PnP^{n}. From here, we can naturally replace the unknown risk in Definition 2.1 with an empirical version and proceed accordingly.

Definition 2.2.

For a loss function θ\ell_{\theta} as described above, define the empirical risk as

Rn(θ)=nθ=1ni=1nθ(Ui),R_{n}(\theta)=\mathbb{P}_{n}\ell_{\theta}=\frac{1}{n}\sum_{i=1}^{n}\ell_{\theta}(U_{i}), (3)

where n=n1i=1nδUi\mathbb{P}_{n}=n^{-1}\sum_{i=1}^{n}\delta_{U_{i}}, with δu\delta_{u} the Dirac point-mass measure at uu, is the empirical distribution.

Naturally, if the inferential target is the risk minimizer, then it makes sense to estimate that quantity based on data UnU^{n} by minimizing the empirical risk, i.e.,

θ^nargminθΘRn(θ).\hat{\theta}_{n}\in\arg\min_{\theta\in\Theta}R_{n}(\theta). (4)

This is the M-estimator based on an objective function determined by the loss θ\ell_{\theta}; when RnR_{n} is differentiable, the root of R˙n\dot{R}_{n}, the derivative of RnR_{n}, is a Z-estimator and “R˙n(θ)=0\dot{R}_{n}(\theta)=0” is often called an estimating equation (Godambe,, 1991; van der Vaart,, 1998). Since θθ\theta\mapsto\ell_{\theta} need not be smooth or convex, and RnR_{n} is an average over a finite set of data, it is possible that its minimizer is not unique, even if θ\theta^{\star} is. These computational challenges are, in fact, part of what motivates the Gibbs posterior, as we discuss below.

There is a rich literature on the asymptotic distribution properties of M-estimators, which can be used for developing hypothesis tests and confidence intervals (Maronna et al.,, 2006; Huber and Ronchetti,, 2009). As an alternative, one might consider a Bayesian approach to quantify uncertainty, but there is an immediate obstacle, namely, no statistical model/likelihood connecting the data to the quantity of interest. If we did have a statistical model, with a density function pθp_{\theta}, then the most natural loss is θ(u)=logpθ(u)\ell_{\theta}(u)=-\log p_{\theta}(u) and the likelihood is exp{nRn(θ)}\exp\{-nR_{n}(\theta)\}. It is, therefore, tempting to follow that same strategy for general losses, resulting in a sort of generalized posterior distribution for θ\theta.

Definition 2.3.

Given a loss function θ\ell_{\theta} and the corresponding empirical risk RnR_{n} in Definition 2.2, define the Gibbs posterior distribution as

Πn(dθ)eωnRn(θ)Π(dθ),θΘ,\Pi_{n}(d\theta)\propto e^{-\omega\,nR_{n}(\theta)}\,\Pi(d\theta),\quad\theta\in\Theta, (5)

where Π\Pi is a prior distribution and ω>0\omega>0 is a so-called learning rate (Holmes and Walker,, 2017; Syring and Martin,, 2019; Grünwald,, 2012; van Erven et al.,, 2015). The dependence of Πn\Pi_{n} on ω\omega will generally be omitted from the notation, but see Sections 2.2 and 3.3.

We will assume that the right-hand side of (5) is integrable in θ\theta, so that the proportionality constant is well-defined. Integrability holds whenever the loss function is bounded from below, like for those in (1), but this could fail in some cases where the loss is not bounded away from -\infty, e.g., when θ(u)\ell_{\theta}(u) is a negative log-density. In such cases, extra conditions on the prior distribution would be required to ensure the Gibbs posterior is well-defined.

An immediate advantage of this approach, compared to the M-estimation strategy described above, is that the user is able to incorporate available prior information about θ\theta directly into the analysis. This is especially important in cases where the quantity of interest has a real-world meaning, as opposed to being just a model parameter, where having genuine prior information is the norm rather than the exception. Additionally, even though there is no likelihood, the same computational methods, such as Markov chain Monte Carlo (Chernozhukov and Hong,, 2003) and variational approximations (Alquier et al.,, 2016), common in Bayesian analysis, can be employed to numerically approximate the Gibbs posterior.

We have opted here to define the Gibbs posterior directly as an object to be used and studied, but there is a more formal, more principled way in which Gibbs posteriors emerge. In the PAC-Bayes literature, the goal is to construct a randomized estimator that concentrates in regions of Θ\Theta where the risk, R(θ)R(\theta), or its empirical version, Rn(θ)R_{n}(\theta), is small (Valiant,, 1984; McAllester,, 1999; Alquier,, 2008; Guedj,, 2019). That is, the Gibbs posterior can be viewed as a solution to an optimization problem rather than a solution to the updating-prior-beliefs problem. More formally, for a given prior Π\Pi on Θ\Theta, suppose the goal is to find

infμ{Rn(θ)μ(dθ)+(ωn)1K(μ,Π)},\inf_{\mu}\Bigl{\{}\int R_{n}(\theta)\,\mu(d\theta)+(\omega n)^{-1}K(\mu,\Pi)\Bigr{\}},

where the infimum is over all probability measures μ\mu that are absolutely continuous with respect to Π\Pi, and KK denotes the Kullback–Leibler divergence. Then it can be shown (e.g., Zhang,, 2006; Bissiri et al.,, 2016) that the unique solution is Πn\Pi_{n}, the Gibbs posterior defined in (5). Therefore, the Gibbs posterior distribution is the measure minimizing a penalized risk, averaged with respect to a given prior, Π\Pi.

2.2 Learning rate

Readers familiar with M-estimation may not recognize the learning rate, ω\omega. This does not appear in the M-estimation context because all that influences the optimization problem—and the corresponding asymptotic distribution theory—is the shape of the loss/risk function, not its magnitude or scale. On the other hand, the learning rate is an essential component of the Gibbs posterior distribution in (5) since the distribution depends on both the shape and scale of the loss function. Data-driven strategies for tuning the learning rate are available (Holmes and Walker,, 2017; Syring and Martin,, 2019; Lyddon et al.,, 2019; Wu and Martin,, 2022, 2021).

Here we focus on how the learning rate affects posterior concentration. In typical examples, our results require the learning rate to be a sufficiently small constant. That constant depends on features of PP, which are generally unknown, so, in practice, the learning rate can be taken to be a slowly vanishing sequence, which has a negligible effect on the concentration rate. In more challenging examples, we require the learning rate to vanish sufficiently fast in nn; this is also the case in Grünwald and Mehta, (2020).

2.3 Relation to other generalized posterior distributions

A generalized posterior is any data-dependent distribution other than a well-specified Bayesian posterior. Examples include Gibbs and misspecified Bayesian posteriors, which we compare first.

A key characteristic of the misspecified Bayesian posterior is that it is accidentally misspecified. That is, the data analyst does his/her best to posit a sound model, QγQ_{\gamma}, for the data-generating process and obtains the corresponding posterior for the model parameter γ\gamma. That model will typically be misspecified, i.e., P{Qγ:γΓ}P\not\in\{Q_{\gamma}:\gamma\in\Gamma\}, so the aforementioned posterior will, under certain conditions, concentrate around the point γ\gamma^{\dagger} that minimizes the Kullback–Leibler divergence of QγQ_{\gamma} from PP (Kleijn and van der Vaart,, 2006; De Blasi and Walker,, 2013; Ramamoorthi et al.,, 2015). For a feature θ()\theta(\cdot) of interest, typically θ(Qγ)θ(P)\theta(Q_{\gamma^{\dagger}})\neq\theta(P), so there is generally a bias that the data analyst can do nothing about. A Gibbs posterior, on the other hand, is purposely misspecified—no attempt is made to model PP. Rather, it directly targets the feature of interest via the loss function that defines it. This strategy avoids model misspecification bias, but its point of view also sheds light on the importance of the choice of learning rate. Since the data analyst knows the Gibbs posterior is not a correctly specified Bayesian posterior, they know the learning rate must be handled with care.

A number of authors have studied generalized posterior distributions formed using a likelihood raised to a power η(0,1]\eta\in(0,1]; see, e.g., Martin et al., (2017) and Grünwald and van Ommen, (2017) among others. These η\eta-generalized posteriors tend to be robust to misspecification of the probability model, and data-driven methods to tune η\eta are developed in, e.g., Grünwald and van Ommen, (2017). These posteriors coincide with Gibbs posteriors based on a log-loss and with the learning rate ω\omega corresponding to η\eta.

A common non-Bayesian approach to statistics bases inferences on moment conditions of the form Pfj=0Pf_{j}=0 for functions f1,fJf_{1},\ldots f_{J} , rather than on a fully-specified likelihood. A number of authors have extended this framework to posterior inference. Kim, (2002) uses the moment conditions to construct a so-called limited-information likelihood to use in place of a fully-specified likelihood in a Bayesian formulation. Similarly, Chernozhukov and Hong, (2003) construct a posterior by taking a (pseudo) log-likelihood equal to a quadratic form determined by the set of moment conditions. Chib et al., (2018) studies Bayesian exponentially-tilted empirical likelihood posterior distributions, also defined by moment conditions. In some cases the Gibbs posterior distribution coincides with the above moment-based methods. For instance, in the special case that the risk RR is differentiable at θ\theta^{\star} with derivative R˙(θ)\dot{R}(\theta^{\star}), risk-minimization corresponds to the single moment condition R˙(θ)=0\dot{R}(\theta^{\star})=0.

3 Asymptotic concentration rates

3.1 Objective

A large part of the Bayesian literature concerns the asymptotic concentration properties of their posterior distributions. Roughly, if data are generated from a distribution for which the quantity of interest takes value θ\theta^{\star}, then, as nn\to\infty, the posterior distribution ought to concentrate its mass around that same θ\theta^{\star}. As we will show, optimal concentration rates are possible with Gibbs posteriors, so the robustness achieved by not specifying a statistical model has no cost in terms of (asymptotic) efficiency.

Towards this, for a fixed θΘ\theta^{\star}\in\Theta, let d(θ;θ)d(\theta;\theta^{\star}) denote a divergence measure on Θ\Theta in the sense that d(θ;θ)0d(\theta;\theta^{\star})\geq 0 for all θ\theta, with equality if and only if θ=θ\theta=\theta^{\star}. The divergence measure could depend on the sample size nn or other deterministic features of the problem at hand, especially in the independent but not iid setting; see Section 5.4. Our objective is to provide conditions under which the Gibbs posterior will concentrate asymptotically, at a certain rate, around θ\theta^{\star} relative to the divergence measure dd. Throughout this paper, (εn)(\varepsilon_{n}) denotes a deterministic sequence of positive numbers with εn0\varepsilon_{n}\to 0, which will be referred to as the Gibbs posterior concentration rate.

Definition 3.1.

The Gibbs posterior Πn\Pi_{n} in (5) asymptotically concentrates around θ\theta^{\star} at rate (at least) εn\varepsilon_{n}, with respect to dd, if

PnΠn({θ:d(θ;θ)>Mnεn})0,as n,P^{n}\Pi_{n}(\{\theta:d(\theta;\theta^{\star})>M_{n}\varepsilon_{n}\})\rightarrow 0,\quad\text{as $n\to\infty$,} (6)

where Mn>0M_{n}>0 is either a (deterministic) sequence satisfying MnM_{n}\to\infty arbitrarily slowly or is a sufficiently large constant, MnMM_{n}\equiv M.

In the PAC-Bayes literature, the Gibbs posterior distribution is interpreted as a “randomized estimator,” a generator of random θ\theta values that tend to make the risk difference small. For iid data and with risk divergence d(θ;θ)={R(θ)R(θ)}1/2d(\theta;\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}, a concentration result like that in Definition 3.1 makes this strategy clear, since the Πn\Pi_{n}-probability of the event {R(θ)R(θ)εn2}\{R(\theta)-R(\theta^{\star})\leq\varepsilon_{n}^{2}\} would be 1\to 1.

If the Gibbs posterior concentrates around θ\theta^{\star} in the sense of Definition 3.1, then any reasonable estimator derived from that distribution, such as the mean, should inherit the εn\varepsilon_{n} rate at θ\theta^{\star} relative to the divergence measure dd. This can be made formal under certain conditions on dd; see, e.g., Corollary 1 in Barron et al., (1999) and the discussion following the proof of Theorem 2.5 in Ghosal et al., (2000).

Besides concentration rates, in certain cases it is possible to establish distributional approximations to Gibbs posteriors, i.e., Bernstein–von Mises theorems. Results for finite-dimensional problems and with sufficiently smooth loss functions can be found in, e.g., Bhattacharya and Martin, (2022) and Chernozhukov and Hong, (2003).

3.2 Conditions

Here we discuss a general strategy for proving Gibbs posterior concentration and the kinds of sufficient conditions needed for the strategy to be successful. To start, set An={θ:d(θ,θ)>Mnεn}ΘA_{n}=\{\theta:d(\theta,\theta^{\star})>M_{n}\varepsilon_{n}\}\subset\Theta. Then our first step towards proving concentration is to express Πn(An)\Pi_{n}(A_{n}) as the ratio

Πn(An)=Nn(An)Dn=Anexp[ωn{Rn(θ)Rn(θ)}]Π(dθ)Θexp[ωn{Rn(θ)Rn(θ)}]Π(dθ).\Pi_{n}(A_{n})=\frac{N_{n}(A_{n})}{D_{n}}=\frac{\int_{A_{n}}\exp[-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}]\,\Pi(d\theta)}{\int_{\Theta}\exp[-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}]\,\Pi(d\theta)}. (7)

The goal is to suitably upper and lower bound Nn(An)N_{n}(A_{n}) and DnD_{n}, respectively, in such a way that the ratios of these bounds vanish. Two sufficient conditions are discussed below, the first primarily dealing with the loss function and aiding in bounding Nn(An)N_{n}(A_{n}) and the second primarily concerning the prior distribution and aiding in bounding DnD_{n}. Both conditions concern the excess loss θ(U)θ(U)\ell_{\theta}(U)-\ell_{\theta^{\star}}(U) and its mean and variance:

m(θ,θ):=P(θθ)andv(θ,θ):=P(θθ)2m(θ,θ)2.m(\theta,\theta^{\star}):=P(\ell_{\theta}-\ell_{\theta^{\star}})\quad\text{and}\quad v(\theta,\theta^{\star}):=P(\ell_{\theta}-\ell_{\theta^{\star}})^{2}-m(\theta,\theta^{\star})^{2}.

3.2.1 Sub-exponential type losses

Our method for proving posterior concentration requires a vanishing upper bound on the expectation of the numerator term Nn(An)N_{n}(A_{n}) in (7). Since the integrand, exp[ωn{Rn(θ)Rn(θ)}]\exp[-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}], in Nn(An)N_{n}(A_{n}) is non-negative, Fubini’s theorem says it suffices to bound its expectation. Further, by independence

Pneωn{Rn(θ)Rn(θ)}={Peω(θθ)}n,P^{n}e^{-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}=\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n},

which reveals that the key to bounding PnNn(An)P^{n}N_{n}(A_{n}) is to bound Peω(θθ)Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}, the expected exponentiated excess loss. In order for the bound on PnNn(An)P^{n}N_{n}(A_{n}) to vanish it must be that Peω(θθ)<1Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}<1, but this is not enough to identify the concentration rate in (6). Rather, the speed at which Peω(θθ)Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})} vanishes must be a function of d(θ,θ)d(\theta,\theta^{\star}), and we take this relationship as our key condition for Gibbs posterior concentration. When this holds we say the loss function is of sub-exponential type.

Condition 1.

There exists an interval (0,ω¯)(0,\bar{\omega}) and constants K,r>0K,\,r>0, such that for all ω(0,ω¯)\omega\in(0,\bar{\omega}) and for all sufficiently small δ>0\delta>0, for θΘ\theta\in\Theta

d(θ;θ)>δPeω(θθ)<eKωδrd(\theta;\theta^{\star})>\delta\implies Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}<e^{-K\omega\delta^{r}} (8)

(The constant r>0r>0 that appears here and below can take on different values depending on the context. The case is r=2r=2 common, but some “non-regular” problems require r2r\neq 2; see Section 5.5.)

An immediate consequence of Condition 1 and the definition of AnA_{n} is the key finite-sample, exponential bound on the Gibbs posterior numerator

PnNn(An)eKωnMnrεnr.P^{n}N_{n}(A_{n})\leq e^{-K\omega nM_{n}^{r}\varepsilon_{n}^{r}}. (9)

For some intuition behind Condition 1, consider the following. Let ff be a real-valued function such that the random variable f(U)f(U) has a distribution with sufficiently thin tails. This is automatic when ff is bounded, but suppose f(U)f(U) has a moment-generating function, i.e., is sub-exponential. For bounding the moment-generating function, the dream case would be if PeωfeωPfPe^{-\omega f}\leq e^{-\omega Pf}. Unfortunately, Jensen’s inequality implies the dream is an impossibility. It is possible, however, to show

PeωfeωPf+ω2G(f),Pe^{-\omega f}\leq e^{-\omega Pf+\omega^{2}G(f)},

for suitable G(f)>0G(f)>0 depending on certain features of ff (and of PP). If it could also be shown, e.g., that G(f)PfG(f)\lesssim Pf, then we are in a “near-dream” case where

PeωfecωPf,for a constant c(0,1) and sufficiently small ω.Pe^{-\omega f}\leq e^{-c\omega Pf},\quad\text{for a constant $c\in(0,1)$ and sufficiently small $\omega$.}

The little extra needed beyond sub-exponential to achieve the “near-dream” case bound above is why we refer to such ff as being sub-exponential type.

Towards verifying Condition 1 we briefly review some developments in Grünwald and Mehta, (2020); further comments can be found in Section 3.4. These authors focus on an annealed expectation which, for a real-valued function ff of the random element UPU\sim P and a fixed constant ω>0\omega>0, is defined as Pann(ω)f=ω1logPeωfP^{\text{\sc ann}(\omega)}f=-\omega^{-1}\log Pe^{-\omega f}. With this, we find that Peωf=exp{ωPann(ω)f}Pe^{-\omega f}=\exp\{-\omega P^{\text{\sc ann}(\omega)}f\}. So an upper bound as we require in Condition 1 is equivalent to a corresponding lower bound on Pann(ω)(θθ)P^{\text{\sc ann}(\omega)}(\ell_{\theta}-\ell_{\theta^{\star}}). The “strong central condition” in Grünwald and Mehta, (2020) states that,

Pann(ω)(θθ)0,for all θΘ.P^{\text{\sc ann}(\omega)}(\ell_{\theta}-\ell_{\theta^{\star}})\geq 0,\quad\text{for all $\theta\in\Theta$}.

The above inequality implies Peω(θθ)1Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq 1 and, in turn, that PnNn(An)=O(1)P^{n}N_{n}(A_{n})=O(1) as nn\to\infty. The other conditions in Grünwald and Mehta, (2020) aim to lower-bound the annealed expected excess loss by a suitable function of the excess risk. For example, Lemma 13 in Grünwald and Mehta, (2020) shows that a “witness condition” implies

Pann(ω)(θθ)R(θ)R(θ),for all θΘ,P^{\text{\sc ann}(\omega)}(\ell_{\theta}-\ell_{\theta^{\star}})\gtrsim R(\theta)-R(\theta^{\star}),\quad\text{for all $\theta\in\Theta$},

so, if R(θ)R(θ)R(\theta)-R(\theta^{\star}) is of the order d(θ,θ)rd(\theta,\theta^{\star})^{r} for some r>0r>0, then we recover Condition 1. Therefore, our Condition 1 is exactly what is needed to control the numerator of the Gibbs posterior, and the strong central and witness conditions developed in Grünwald and Mehta, (2020) and elsewhere constitute a set of sufficient conditions for our Condition 1.

A subtle difference between our approach and Grünwald and Mehta’s is that the bounds in the above two displays are required for all θΘ\theta\in\Theta; Condition 1 only deals with θ\theta bounded away from θ\theta^{\star}. This difference arises because we directly target a bound on the Gibbs posterior probability, Πn(An)\Pi_{n}(A_{n}), an integration over An∌θA_{n}\not\ni\theta^{\star}, whereas Grünwald and Mehta, (2020) targets a bound on the Gibbs posterior mean of θR(θ)R(θ)\theta\mapsto R(\theta)-R(\theta^{\star}), an integration over all of Θ\Theta. In Example 3.8 of van Erven et al., (2015) the global requirement in Grünwald and Mehta, (2020)’s condition is a disadvantage as van Erven et al., (2015) illustrate it creates some challenges in checking the bounds in the above two displays.

Besides Condition 1, other strategies for bounding Nn(An)N_{n}(A_{n}) in (7) have been employed in the literature. For example, empirical process techniques are used in Syring and Martin, (2017); Bhattacharya and Martin, (2022) to prove Gibbs posterior concentration in finite-dimensional applications. Generally, such proofs hinge on a uniform law of large numbers, which can be challenging to verify in non-parametric problems. Chernozhukov and Hong, (2003) require the stronger condition that supd(θ,θ)>δ{Rn(θ)Rn(θ)}0\sup_{d(\theta,\theta^{\star})>\delta}\{R_{n}(\theta)-R_{n}(\theta^{\star})\}\to 0 in PnP^{n}-probability. When it holds they immediately recover an in-probability bound on Nn(An)N_{n}(A_{n}), irrespective of ω\omega, but they do not obtain a finite-sample bound on PnΠn(An)P^{n}\Pi_{n}(A_{n}). Their analysis is limited to finite-dimensional parameters and empirical risk functions Rn(θ)R_{n}(\theta) that are smooth in a neighborhood of θ\theta^{\star}. This latter condition excludes important examples like the misclassification-error loss function; see Section 5.5.

There are situations in which Condition 1 does not hold, and we address this formally in Section 4. As an example, note that both Condition 1 and the witness condition used in Grünwald and Mehta, (2020) are closely related to a Bernstein-type condition relating the first two moments of the excess loss: P(θθ)2c{R(θ)R(θ)}αP(\ell_{\theta}-\ell_{\theta^{\star}})^{2}\leq c\{R(\theta)-R(\theta^{\star})\}^{\alpha} for constants (c,α)>0(c,\,\alpha)>0. For the “check” loss used in quantile estimation, the Bernstein condition is generally not satisfied and, consequently, neither Grünwald and Mehta’s witness condition nor our Condition 1—at least not in their original forms—can be verified. Similarly, if the excess loss, θ(U)θ(U)\ell_{\theta}(U)-\ell_{\theta^{\star}}(U), a function of data UU, is heavy-tailed, then the moment-generating function bound in Condition 1 does not hold. For these cases, some modifications to the basic setup are required, which we present in Section 4.

3.2.2 Prior distribution

Generally, the prior must place enough mass on certain “risk-function” neighborhoods Gn:={θ:m(θ,θ)v(θ,θ)εnr}G_{n}:=\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\}. This is analogous to the requirement in the Bayesian literature that the prior place sufficient mass on Kullback–Leibler neighborhoods of θ\theta^{\star}; see, e.g., Shen and Wasserman, (2001) and Ghosal et al., (2000). Some version of the following prior bound is needed

logΠ(Gn)nεnr,\log\Pi(G_{n})\gtrsim-n\varepsilon_{n}^{r}, (10)

for rr as in Condition 1. Lemma 1 in the appendix, along with (10), provides a lower probability bound on the denominator term DnD_{n} defined in (7). That is,

Pn(Dnbn)(nεnr)1,P^{n}\bigl{(}D_{n}\leq b_{n}\bigr{)}\leq(n\varepsilon_{n}^{r})^{-1}, (11)

where bn=12Π(Gn)e2ωnεnrb_{n}=\frac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}. Both the form of the risk-function neighborhoods and the precise lower bound in (10) depend on the concentration rate and the learning rate, so the results in Section 3.3 all require their own version of the above prior bound.

Grünwald and Mehta, (2020) only require bounds on the prior mass of the larger neighborhoods {θ:m(θ,θ)εnr}\{\theta:m(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\}. Under their condition, we can derive a lower bound on PnDnP^{n}D_{n} similar to Lemma 1 in Martin et al., (2013). However, our proofs require an in-probability lower bound on DnD_{n}, which in turn requires stronger prior concentration like that in (10). While there are important examples where the lower bounds on m(θ,θ)m(\theta,\theta^{\star}) and v(θ,θ)v(\theta,\theta^{\star}) are of different orders, none of the applications we consider here are of that type. Therefore, the stronger prior concentration condition in (10) does not affect the rates we derive for the examples in Section 5. Moreover, as discussed following the statement of Theorem 3.2 below, our finite-sample bounds are better than those in Grünwald and Mehta, (2020), a direct consequence of our method of proof that uses a smaller neighborhood GnG_{n}.

3.3 Main results

In this section we present general results on Gibbs posterior concentration. Proofs can be found in Section 1 of the supplementary material. Our first result establishes Gibbs posterior concentration, under Condition 1 and a local prior condition, for sufficiently small constant learning rates.

Theorem 3.2.

Let εn\varepsilon_{n} be a vanishing sequence satisfying nεnrn\varepsilon_{n}^{r}\to\infty for a constant r>0r>0. Suppose the prior satisfies

logΠ({θ:m(θ,θ)v(θ,θ)εnr})C1nεnr,\log\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\})\geq-C_{1}n\varepsilon_{n}^{r}, (12)

for C1>0C_{1}>0 and for divergence measure dd, the same r>0r>0 as above, and learning rate ω(0,ω¯)\omega\in(0,\bar{\omega}) for some ω¯>0\bar{\omega}>0. If the loss function satisfies Condition 1, then the Gibbs posterior distribution in (5) has asymptotic concentration rate εn\varepsilon_{n}, with MnMM_{n}\equiv M a large constant as in Definition 3.1.

For a brief sketch of the proof bound the posterior probability of AnA_{n} by

Πn(An)\displaystyle\Pi_{n}(A_{n}) Nn(An)Dn 1(Dn>bn)+1(Dnbn)\displaystyle\leq\frac{N_{n}(A_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})
bn1Nn(An)+1(Dnbn),\displaystyle\leq b_{n}^{-1}N_{n}(A_{n})+1(D_{n}\leq b_{n}),

where bnb_{n} is as above. Taking expectation of both sides, and applying (11) and the consequence (9) of Condition 1, we get

PnΠn(An)bn1eKωnMnrεnr+(nεnr)1.P^{n}\Pi_{n}(A_{n})\leq b_{n}^{-1}e^{-K\omega nM_{n}^{r}\varepsilon_{n}^{r}}+(n\varepsilon_{n}^{r})^{-1}.

Then the right-hand side is generally of order (nεnr)10(n\varepsilon_{n}^{r})^{-1}\to 0. To compare with the results in Grünwald and Mehta, (2020, Example 2), their upper bound on PnΠn(An)P^{n}\Pi_{n}(A_{n}) is O(Mn1)O(M_{n}^{-1}), which vanishes arbitrarily slowly.

For the case where the risk minimizer in (2) is not unique, certain modifications of the above argument can be made, similar to those in Kleijn and van der Vaart, (2006, Theorem 2.4). Roughly, to our Theorem 3.2 above, we would add the requirements that (12) and Condition 1 hold uniformly in θΘ\theta^{\star}\in\Theta^{\star}, where Θ\Theta^{\star} is the set of risk minimizers. Then virtually the same proof shows that PnΠn({θ:d(θ,Θ)εn})0P^{n}\Pi_{n}(\{\theta:d(\theta,\Theta^{\star})\gtrsim\varepsilon_{n}\})\to 0, where d(θ,Θ)=infθΘd(θ,θ)d(\theta,\Theta^{\star})=\inf_{\theta^{\star}\in\Theta^{\star}}d(\theta,\theta^{\star}).

Theorem 3.2 is quite flexible and can be applied in a range of settings; see Section 5. However, one case in which it cannot be directly applied is when nεnrn\varepsilon_{n}^{r} is bounded. For example, in sufficiently smooth finite-dimensional problems, we have r=2r=2 and the target rate is εn=n1/2\varepsilon_{n}=n^{-1/2}. The difficulty is caused by the prior bound in (12), since it is impossible—at least with a fixed prior—to assign mass bounded away from 0 to a shrinking neighborhood of θ\theta^{\star}. One option is to add a logarithmic factor to the rate, i.e., take εn=(logn)kn1/2\varepsilon_{n}=(\log n)^{k}n^{-1/2}, so that eCnεn2e^{-Cn\varepsilon_{n}^{2}} is a power of n1/2n^{-1/2}. Alternatively, a refinement of the proof of Theorem 3.2 lets us avoid slowing down the rate.

Theorem 3.3.

Consider a finite-dimensional θ\theta, taking values in Θq\Theta\subseteq\mathbb{R}^{q} for some q1q\geq 1. Suppose that the target rate εn\varepsilon_{n} is such that nεnrn\varepsilon_{n}^{r} is bounded for some constant r>0r>0. If the prior Π\Pi satisfies

Π({θ:m(θ,θ)v(θ,θ)εn2})εnq,\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{2}\})\gtrsim\varepsilon_{n}^{q}, (13)

and if Condition 1 holds, then the Gibbs posterior distribution in (5), with any learning rate ω(0,ω¯)\omega\in(0,\bar{\omega}), has asymptotic concentration rate εn\varepsilon_{n} at θ\theta^{\star} with respect to any divergence d(θ,θ)d(\theta,\theta^{\star}) satisfying θθd(θ,θ)θθ\|\theta-\theta^{\star}\|\lesssim d(\theta,\theta^{\star})\lesssim\|\theta-\theta^{\star}\| and for any diverging, positive sequence MnM_{n} in Definition 3.1.

The learning rate is critical to the Gibbs posterior’s performance, but in applications it may be challenging to determine the upper bound ω¯\bar{\omega} for which Condition 1 holds. For a simple illustration, suppose the excess loss θθ\ell_{\theta}-\ell_{\theta}^{\star} is normally distributed with variance σ2(θ)\sigma^{2}(\theta) satisfying σ2(θ)c{R(θ)R(θ)}\sigma^{2}(\theta)\leq c\{R(\theta)-R(\theta^{\star})\}, a kind of Bernstein condition. In this case, Condition 1 holds for d(θ,θ)={R(θ)R(θ)}1/2d(\theta,\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}, with r=2r=2, if ω<2c1\omega<2c^{-1}. Bernstein conditions can be verified in many practical examples, but the factor cc would rarely be known. Consequently, we need ω\omega to be sufficiently small, but the meaning of “sufficiently small” depends on unknowns involving PP. However, any positive, vanishing learning rate sequence ω=ωn\omega=\omega_{n} satisfies ωn(0,ω¯)\omega_{n}\in(0,\bar{\omega}) for all sufficiently large nn. And if ωn\omega_{n} vanishes arbitrarily slowly, then it has no effect on the Gibbs posterior concentration rate; see Section 5.6. All we require to accommodate a vanishing learning rate is a slightly stronger ωn\omega_{n}-dependent version of the prior concentration bound in (12).

Theorem 3.4.

Let εn\varepsilon_{n} be a vanishing sequence and ωn\omega_{n} be a learning rate sequence satisfying nωnεnrn\omega_{n}\varepsilon_{n}^{r}\to\infty for a constant r>0r>0. Consider a Gibbs posterior distribution Πn=Πnωn\Pi_{n}=\Pi_{n}^{\omega_{n}} in (5) based on this sequence of learning rates. If the prior satisfies

logΠ({θ:m(θ,θ)v(θ,θ)εnr})Cnωnεnr,\log\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\})\geq-Cn\omega_{n}\varepsilon_{n}^{r}, (14)

and if Condition 1 holds for ωn\omega_{n}, then the Gibbs posterior distribution in (5), with learning rate sequence ωn\omega_{n}, has concentration rate εn\varepsilon_{n} at θ\theta^{\star} for a sufficiently large constant M>0M>0 in Definition 3.1.

The proof of Theorem 3.4 is almost identical to that of Theorem 3.2, hence omitted. But to see the basic idea, we mention two key observations. First, since Condition 1 holds for ωn\omega_{n} for all sufficiently large nn, and since ωn\omega_{n} is deterministic, by the same argument producing the bound in (9), we get

PnNn(An)eKnωnMnrεnrfor all sufficiently large n.P^{n}N_{n}(A_{n})\leq e^{-Kn\omega_{n}M_{n}^{r}\varepsilon_{n}^{r}}\quad\text{for all {\em sufficiently large} $n$}.

Second, the same argument producing the bound in (11) shows that

Pn(Dn12Π(Gn)e2nωnεnr)Pn(Dn12e(C+2)nωnεnr)(nωnεnr)1.P^{n}\bigl{(}D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2n\omega_{n}\varepsilon_{n}^{r}}\bigr{)}\leq P^{n}\bigl{(}D_{n}\leq\tfrac{1}{2}e^{-(C+2)n\omega_{n}\varepsilon_{n}^{r}}\bigr{)}\leq(n\omega_{n}\varepsilon_{n}^{r})^{-1}.

Then the only difference between this situation and that in Theorem 3.2 is that, here, the numerator bound only holds for “sufficiently large nn,” instead of for all nn. This makes Theorem 3.4 is slightly weaker than Theorem 3.2 since there are no finite-sample bounds.

When ωnω\omega_{n}\equiv\omega the constant learning rate is absorbed by CC and there is no difference between the prior bounds in (12) and (14). But, the prior probability assigned to the (mv)(m\vee v)-neighborhood of θ\theta^{\star} does not depend on ωn\omega_{n}, so if it satisfies (12), then the only way it could also satisfy (14) is if εn\varepsilon_{n} is bigger than it would have been without a vanishing learning rate. In other words, Theorem 3.2 requires nεnrn\varepsilon_{n}^{r}\rightarrow\infty whereas Theorem 3.4 requires nωnεnrn\omega_{n}\varepsilon_{n}^{r}\rightarrow\infty, which implies that for a given vanishing ωn\omega_{n} Theorem 3.2 potentially can accommodate a faster rate εn\varepsilon_{n}. Therefore, we see that a vanishing learning rate can slow down the Gibbs posterior concentration rate if it does not vanish arbitrarily slowly. There are applications that require the learning rate to vanish at a particular nn-dependent rate, and these tend to be those where adjustments like in Section 4 are needed; see Sections 5.3.2 and 5.5.2.

If we can use the data to estimate ω¯\bar{\omega} consistently, then it makes sense to choose a learning rate sequence depending on this estimator. Suppose our data-dependent learning rate ω^n\hat{\omega}_{n} satisfies ω^n<ω¯\hat{\omega}_{n}<\bar{\omega} with probability converging to 11 as nn\rightarrow\infty. For ωω^n\omega\equiv\hat{\omega}_{n} the conclusion of Theorem 3.2 holds for all sufficiently large nn; and see Section 4.2 in the Supplementary Material. One advantage of this strategy is that it avoids using a vanishing learning rate, which may slow concentration.

Theorem 3.5.

Fix a positive deterministic learning rate sequence ωn\omega_{n} such that the conditions of Theorem 3.4 hold and as a result Πnωn\Pi_{n}^{\omega_{n}} has asymptotic concentration rate εn\varepsilon_{n}. Consider a random learning rate sequence ω^n\hat{\omega}_{n} satisfying

Pn(ωn/2<ω^n<ωn)1,n.P^{n}(\omega_{n}/2<\hat{\omega}_{n}<\omega_{n})\rightarrow 1,\quad n\to\infty. (15)

Then Πnω^n\Pi_{n}^{\hat{\omega}_{n}}, the Gibbs posterior distribution in (5) scaled by the random learning rate sequence ω^n\hat{\omega}_{n}, also has concentration rate εn\varepsilon_{n} at θ\theta^{\star} for a sufficiently large constant M>0M>0 in Definition 3.1.

3.4 Checking Condition 1

Of course, Condition 1 is only useful if it can be checked in practically relevant examples. As mentioned in Section 3.2, Grünwald and Mehta’s strong central and witness conditions are sufficient for Condition 1. A pair of slightly stronger, but practically verifiable conditions require the excess loss is sub-exponential with first and second moments that obey a Bernstein condition. For the examples we consider in Section 5 where Condition 1 can be used, these two conditions are convenient.

3.4.1 Bounded excess losses

For bounded excess losses, i.e., θ(u)θ(u)<C\ell_{\theta}(u)-\ell_{\theta^{\star}}(u)<C for all (θ,u)(\theta,u), Lemma 7.26 in (Lafftery et al.,, 2010) gives:

Peω(θθ)exp[ωm(θ,θ)+ω2v(θ,θ){exp(Cω)1CωC2ω2}].Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{[}-\omega m(\theta,\theta^{\star})+\omega^{2}v(\theta,\theta^{\star})\bigl{\{}\tfrac{\exp(C\omega)-1-C\omega}{C^{2}\omega^{2}}\bigr{\}}\bigr{]}. (16)

Whether Condition 1 holds depends on the choice of ω\omega and the relationship between m(θ,θ)m(\theta,\theta^{\star}) and v(θ,θ)v(\theta,\theta^{\star}) as defined by the Bernstein condition:

v(θ,θ)C1m(θ,θ)α,α(0,1],C1>0.v(\theta,\theta^{\star})\leq C_{1}m(\theta,\theta^{\star})^{\alpha},\quad\alpha\in(0,1],\quad C_{1}>0. (17)

Denote the bracketed expression in the exponent of (16) by C(ω)=(C2ω2)1{exp(Cω)1Cω}C(\omega)=(C^{2}\omega^{2})^{-1}\{\exp(C\omega)-1-C\omega\}. When α=1\alpha=1, the mm and vv functions are of the same order and, if d(θ,θ)=m(θ,θ)d(\theta,\theta^{\star})=m(\theta,\theta^{\star}), and ωC(ω)(2C1)1\omega C(\omega)\leq(2C_{1})^{-1} then Condition 1 holds with K=1/2K=1/2. Since C(ω)CωC(\omega)\approx C\omega for small ω\omega, it suffices to take the learning rate a sufficiently small constant.

On the other hand, when vv is larger than mm, i.e., the Bernstein condition holds with α<1\alpha<1, Condition 1 requires that ω\omega depend on α\alpha and m(θ,θ)m(\theta,\theta^{\star}). Suppose m(θ,θ)d(θ,θ)>εnm(\theta,\theta^{\star})\geq d(\theta,\theta^{\star})>\varepsilon_{n}. For all small enough εn\varepsilon_{n}, we can set ωn=(4C)1/2εn(1α)/2\omega_{n}=(4C)^{-1/2}\varepsilon_{n}^{(1-\alpha)/2} and Condition 1 is again satisfied with K=1/2K=1/2.

The above strategies are implemented in the classification examples in Section 5.5 where we also discuss connections to the Tsybakov (Tsybakov,, 2004) and Massart (Massart and Nedelec,, 2006) conditions. For the former, see Theorem 22 and the subsequent discussion in Grünwald and Mehta, (2020), where the learning rate ends up being a vanishing sequence depending on the concentration rate and the Bernstein exponent α\alpha.

3.4.2 Sub-exponential excess losses

Unbounded but light-tailed loss differences θθ\ell_{\theta}-\ell_{\theta^{\star}} may also satisfy Condition 1. Both sub-exponential and sub-Gaussian random variables (Boucheron et al.,, 2012, Sec. 2.3–4) admit an upper bound on their moment-generating functions. When the loss difference θθ\ell_{\theta}-\ell_{\theta^{\star}} is sub-Gaussian

Peω(θθ)exp{ω22σ2(θ,θ)ωm(θ,θ)},Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{\{}\tfrac{\omega^{2}}{2}\sigma^{2}(\theta,\theta^{\star})-\omega m(\theta,\theta^{\star})\bigr{\}}, (18)

for all ω\omega, where the variance proxy σ2(θ,θ)\sigma^{2}(\theta,\theta^{\star}) may depend on (θ,θ)(\theta,\theta^{\star}). If θθ\ell_{\theta}-\ell_{\theta^{\star}} is sub-exponential, then the above bound holds for all ω(2b)1\omega\leq(2b)^{-1} for some b<b<\infty indexing the tail behavior of PP.

If σ2(θ,θ)\sigma^{2}(\theta,\theta^{\star}) is upper-bounded by Lm(θ,θ)Lm(\theta,\theta^{\star}) for a constant L>0L>0, then the above bound can be rewritten as

Peω(θθ)exp{ωm(θ,θ)(1ωL2)}.Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{\{}-\omega m(\theta,\theta^{\star})(1-\tfrac{\omega L}{2})\bigr{\}}. (19)

Then Condition 1 holds if the loss difference is sub-Gaussian and sub-exponential if ω<2L1\omega<2L^{-1} and ω<2L1(2b)1\omega<2L^{-1}\wedge(2b)^{-1}, respectively.

In practice it may be awkward to assume θ(U)θ(U)\ell_{\theta}(U)-\ell_{\theta^{\star}}(U) is sub-exponential, but in certain problems it is sufficient to make such an assumption about features of UU, which may be more reasonable. See Section 5.4 for an application of this idea to a fixed-design regression problem where the fact the response variable is sub-Gaussian implies the excess loss is itself sub-Gaussian.

4 Extensions

4.1 Locally sub-exponential type loss functions

In some cases the moment generating function bound in Condition 1 can be verified in a neighborhood of θ\theta^{\star} but not for all θΘ\theta\in\Theta. For example, suppose θθ(u)\theta\mapsto\ell_{\theta}(u) is Lipschitz with respect to a metric \|\cdot\| with uniformly bounded Lipschitz constant L=L(u)L=L(u), and that, for some δ>0\delta>0,

θθ<δv(θ,θ)m(θ,θ)andθθ>δv(θ,θ)m(θ,θ)2.\displaystyle\|\theta-\theta^{\star}\|<\delta\implies v(\theta,\theta^{\star})\lesssim m(\theta,\theta^{\star})\quad\text{and}\quad\|\theta-\theta^{\star}\|>\delta\implies v(\theta,\theta^{\star})\lesssim m(\theta,\theta^{\star})^{2}.

That is, the Bernstein condition (17) holds but for different values of α\alpha depending on θ\theta. This is the case for quantile regression; see Section 5.6. This class of problems, where the Bernstein exponent varies across Θ\Theta, apparently has not been considered previously. For example, Grünwald and Mehta, (2020) only consider cases where the Bernstein exponent α\alpha is constant across the entire parameter space; consequently, their results assume Θ\Theta is bounded (e.g., Grünwald and Mehta,, 2020, Example 10), so (17) holds trivially with exponent α=0\alpha=0.

To address this problem, our idea is a simple one. We propose to introduce a sieve which is large enough that we can safely assume it contains θ\theta^{\star} but small enough that, on which, the mm and vv functions can be appropriately controlled. Towards this, let Θn\Theta_{n} be an increasing sequence of subsets of Θ\Theta, indexed by the sample size nn. The “size” of Θn\Theta_{n} will play an important role in the result below. While more general sieves are possible, to keep the notion of size concrete, let Θn={θΘ:θΔn}\Theta_{n}=\{\theta\in\Theta:\|\theta\|\leq\Delta_{n}\}, so that size is controlled by the non-decreasing sequence Δn>0\Delta_{n}>0, which would typically satisfy Δn\Delta_{n}\to\infty as nn\to\infty. The metric \|\cdot\| in the definition of Θn\Theta_{n} is at the user’s discretion, it would be chosen so that θΘn\theta\in\Theta_{n} provides information that can be used to control the excess loss θθ\ell_{\theta}-\ell_{\theta^{\star}}. This leads to the following straightforward modification of Condition 1.

Condition 2.

For Θn\Theta_{n} with size controlled by Δn\Delta_{n}, there exists an interval (0,ω¯)(0,\bar{\omega}), a constant r>0r>0, and a sequence Kn=K(Δn)>0K_{n}=K(\Delta_{n})>0 such that, for all ω(0,ω¯)\omega\in(0,\bar{\omega}) and for all small δ>0\delta>0,

θΘn and d(θ;θ)>δPeω(θθ)<eωKnrδr.\theta\in\Theta_{n}\;\text{ and }\;d(\theta;\theta^{\star})>\delta\implies Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}<e^{-\omega K_{n}^{r}\delta^{r}}. (20)

Aside from the restriction Θn\Theta_{n}, the key difference between the bounds here and in Condition 1 is in the exponent. Instead of there being a constant KK, there is a sequence KnK_{n} which is determined by the sieve’s size, controlled by Δn\Delta_{n}. If the sequence KnK_{n} is increasing, as we expect it will be, then we can anticipate that the overall concentration rate would be adversely affected—unless the learning rate is vanishing suitably fast. The following theorem explains this more precisely.

Condition 2 can be used exactly as Condition 1 to upper bound PnNn(AnΘn)P^{n}N_{n}(A_{n}\cap\Theta_{n}). However, the Gibbs posterior probability assigned to AnΘncA_{n}\cap\Theta_{n}^{\text{\sc c}} must be handled separately.

Theorem 4.1.

Let Θn\Theta_{n} be a sequence of subsets for which the loss function satisfies Condition 2 for a sequence Kn>0K_{n}>0, a constant r>0r>0, and a learning rate ωn(0,ω¯)\omega_{n}\in(0,\overline{\omega}) for all sufficiently large nn. Let εn\varepsilon_{n} be a vanishing sequence satisfying nωnKnrεnrn\omega_{n}K_{n}^{r}\varepsilon_{n}^{r}\rightarrow\infty. Suppose the prior satisfies

logΠ({θ:m(θ,θ)v(θ,θ)(Knεn)r})CnωnKnrεnr,\log\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq(K_{n}\varepsilon_{n})^{r}\})\gtrsim-Cn\omega_{n}K_{n}^{r}\varepsilon_{n}^{r}, (21)

for some C>0C>0 and the same KnK_{n}, rr, and ωn\omega_{n} as above. Then the Gibbs posterior in (5) satisfies

lim supnPnΠn(An)lim supnPnΠn(AnΘnc).\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n})\leq\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}}).

Consequently, if

PnΠn(Θnc)0as n,P^{n}\Pi_{n}(\Theta_{n}^{\text{\sc c}})\to 0\quad\text{as $n\to\infty$}, (22)

then the Gibbs posterior has concentration rate εn\varepsilon_{n} for all large constants M>0M>0 in Definition 3.1.

There are two aspects of Theorem 4.1 that deserve further explanation. We start with the point about separate handling of Θnc\Theta_{n}^{\text{\sc c}}. Condition (22) is easy to check for well-specified Bayesian posteriors, but these results do not carry over even to a misspecified Bayesian model. Of course, (22) can always be arranged by restricting the support of the prior distribution to Θn\Theta_{n}, which is the suggestion in Kleijn and van der Vaart, (2006, Theorem 2.3) for the Bayesian case and how we handle (22) for our infinite-dimensional Gibbs example in Section 5.6. However, as (22) suggests, this is not entirely necessary. Indeed, Kleijn and van der Vaart, (2006) describe a trade-off between model complexity and prior support, and offer a more complicated form of their sufficient condition, which they do not explore in that paper. The fact is, without a well-specified likelihood, checking a condition like our (22) or Equation (2.13) in Kleijn and van der Vaart, (2006) is a challenge, at least in infinite-dimensional problems. For finite-dimensional problems, it may be possible to verify (22) directly using properties of the loss functions. For example, we use convexity of the check loss function for quantile estimation to verify (22) in Section 5.1.

Next, how might one proceed to check Condition 2? Go back to the Lipschitz loss case at the start of this subsection. For a sieve Θn\Theta_{n} as described above, if θ\theta and θ\theta^{\star} are in Θn\Theta_{n}, then θθ\|\theta-\theta^{\star}\| is bounded by a multiple of Δn\Delta_{n}. This, together with the Lipschitz property and (16), implies

Peωn(θθ)exp{Cnωn2v(θ,θ)ωnm(θ,θ)},Pe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{\{}C_{n}\omega_{n}^{2}v(\theta,\theta^{\star})-\omega_{n}m(\theta,\theta^{\star})\bigr{\}},

where Cn=O(1+Δnωn)C_{n}=O(1+\Delta_{n}\omega_{n}). Suppose the user chooses ωn\omega_{n} and Δn\Delta_{n} such that ωnΔn=O(1)\omega_{n}\Delta_{n}=O(1); then we can replace CnC_{n} by a constant CC. If there exist functions gg and GG such that

m(θ,θ)g(Δn)θθ2andv(θ,θ)G(Δn)θθ2,θΘn,m(\theta,\theta^{\star})\geq g(\Delta_{n})\|\theta-\theta^{\star}\|^{2}\quad\text{and}\quad v(\theta,\theta^{\star})\leq G(\Delta_{n})\|\theta-\theta^{\star}\|^{2},\quad\theta\in\Theta_{n}, (23)

then the above upper bound simplifies to

exp[ωnθθ2{g(Δn)CωnG(Δn)}],θΘn.\exp[-\omega_{n}\|\theta-\theta^{\star}\|^{2}\{g(\Delta_{n})-C\omega_{n}G(\Delta_{n})\}],\quad\theta\in\Theta_{n}.

Then Condition 2 holds with Kn=g(Δn)CωnG(Δn)K_{n}=g(\Delta_{n})-C\omega_{n}G(\Delta_{n}), and it remains to balance the choices of ωn\omega_{n} and Δn\Delta_{n} to achieve the best possible concentration rate εn\varepsilon_{n}; see Section 5.6.

4.2 Clipping the loss function

When the excess loss is heavy-tailed, i.e., not sub-exponential, like in Section 5.3, its moment-generating function does not exist and, therefore, Condition 1 cannot be satisfied. In this section, we assume the loss function θ(u)\ell_{\theta}(u) is non-negative or lower-bounded by a negative constant. In the latter case we can work with the shifted loss—θ\ell_{\theta} minus its lower bound. Many practically useful loss functions are non-negative, including squared-error loss, which we cover in Section 5.3.

In such cases, it may be reasonable to replace the heavy-tailed loss with a clipped version

θn(u)=θ(u)tn,\ell_{\theta}^{n}(u)=\ell_{\theta}(u)\wedge t_{n},

where tn>0t_{n}>0 is a diverging clipping sequence. Since θn(u)\ell_{\theta}^{n}(u) is bounded in (u,θ)(u,\theta) for each fixed nn, the strategy for checking Condition 1 described in Section 3.4.1 suggests that, for certain choices of (tn,εn,ωn)(t_{n},\varepsilon_{n},\omega_{n}), Πn\Pi_{n} places vanishing mass on the sequence of sets {θ:PθnPθnn>εn}\{\theta:P\ell^{n}_{\theta}-P\ell_{\theta_{n}^{\star}}^{n}>\varepsilon_{n}\}, where θn=argminPθn\theta_{n}^{\star}=\arg\min P\ell_{\theta}^{n}. This makes the θn\theta_{n}^{\star} the (moving) target of the Gibbs posterior, instead of the fixed θ\theta^{\star}. On the other hand, if the loss function admits more than one finite moment, then for a corresponding, increasing clipping sequence tnt_{n} the clipped risk neighborhoods of θn\theta^{\star}_{n} contain the risk neighborhoods of θ\theta^{\star} for all large nn, that is,

{θ:R(θ)R(θ)>Cεn}{θ:PθnPθnn>εn},for some C>0.\{\theta:R(\theta)-R(\theta^{\star})>C\varepsilon_{n}\}\subset\{\theta:P\ell^{n}_{\theta}-P\ell_{\theta_{n}^{\star}}^{n}>\varepsilon_{n}\},\quad\text{for some $C>0$}. (24)

Then we further expect concentration of the clipped loss-based Gibbs posterior at θ\theta^{\star} with respect to the excess risk divergence at rate εn\varepsilon_{n}. Condition 3 and Theorem 4.2 below provide a set of sufficient conditions under which these expectations are realized and a concentration rate can be established.

Condition 3.

Let θ\ell_{\theta} be the loss. Define θn=θtn\ell_{\theta}^{n}=\ell_{\theta}\wedge t_{n} as the clipped loss and Θn={θ:θΔn}\Theta_{n}=\{\theta:\|\theta\|\leq\Delta_{n}\} as a sieve, depending on constants tn,Δnt_{n},\Delta_{n}\to\infty.

  1. 1.

    For some s>1s>1, the sequence Bn=supθΘnP|θ|sB_{n}=\sup_{\theta\in\Theta_{n}}P|\ell_{\theta}|^{s} is finite for all nn.

  2. 2.

    There exists a sequence ω¯n>0\bar{\omega}_{n}>0, and a sequence Kn>0K_{n}>0, such that for all sequences 0<ωnω¯n0<\omega_{n}\leq\bar{\omega}_{n} and for all sufficiently small δ>0\delta>0,

    θΘn and PθnPθnn>δPeωn(θnθnn)<eKnωnδ.\theta\in\Theta_{n}\text{ and }P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}>\delta\implies Pe^{-\omega_{n}(\ell_{\theta}^{n}-\ell^{n}_{\theta_{n}^{\star}})}<e^{-K_{n}\omega_{n}\delta}. (25)
Theorem 4.2.

For a given loss θ\ell_{\theta} and sieve Θn\Theta_{n}, suppose that Condition 3 holds for (ωn,Δn,tn)(\omega_{n},\Delta_{n},t_{n}); and let Bn=Bn(s)B_{n}=B_{n}(s) be as defined in Condition 3.1, for s>1s>1. Let εn\varepsilon_{n} be a vanishing sequence such that nωnKnεnn\omega_{n}K_{n}\varepsilon_{n}\rightarrow\infty, and suppose the prior satisfies

logΠ({θ:mn(θ,θn)vn(θ,θn)Knεn})CnωnKnεn.\log\Pi(\{\theta:m_{n}(\theta,\theta_{n}^{\star})\vee v_{n}(\theta,\theta_{n}^{\star})\leq K_{n}\varepsilon_{n}\})\gtrsim-Cn\omega_{n}K_{n}\varepsilon_{n}. (26)

Then the Gibbs posterior in (5) based on the clipped loss θn\ell_{\theta}^{n} satisfies

lim supnPnΠn(An)lim supnPnΠn(AnΘnc)\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n})\leq\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}})

where An:={θ:R(θ)R(θ)>εnBntn1s}A_{n}:=\{\theta:R(\theta)-R(\theta^{\star})>\varepsilon_{n}\vee B_{n}t_{n}^{1-s}\}. Consequently, if

PnΠn(Θnc)0as n,P^{n}\Pi_{n}(\Theta_{n}^{\text{\sc c}})\to 0\quad\text{as $n\to\infty$},

then the Gibbs posterior has asymptotic concentration rate εnBntn1s\varepsilon_{n}\vee B_{n}t_{n}^{1-s} at θ\theta^{\star} with respect to the excess risk d(θ,θ)=R(θ)R(θ)d(\theta,\theta^{\star})=R(\theta)-R(\theta^{\star}).

The setup here is more complicated than in previous sections, so some further explanation is warranted. First, we sketch out how Condition 3.1 leads to the critical property (24). A well-known bound on the expectation of a non-negative random variable, plus the moment bound in Condition 3.1, for s>1s>1, and Markov’s inequality leads to

Pθ1(θ>tn)=tnP(θ>x)𝑑x\displaystyle P\ell_{\theta}1(\ell_{\theta}>t_{n})=\int_{t_{n}}^{\infty}P(\ell_{\theta}>x)\,dx Bntnxs𝑑x=Bntn1s.\displaystyle\leq B_{n}\int_{t_{n}}^{\infty}x^{-s}\,dx=B_{n}t_{n}^{1-s}. (27)

This, in turn, implies R(θ)=Pθn+O(Bntn1s)R(\theta)=P\ell_{\theta}^{n}+O(B_{n}t_{n}^{1-s}). If Bntn1s0B_{n}t_{n}^{1-s}\rightarrow 0, then the difference between the risk and the clipped risk is vanishing, so we can bound PθnPθnnP\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n} by a multiple of the excess risk R(θ)R(θ)R(\theta)-R(\theta^{\star}) for all sufficiently large nn. In the application we explore in Section 5.3, BnB_{n} is related to the radius of the sieve Θn\Theta_{n}, which grows logarithmically in nn, while tnt_{n} is related to the polynomial tail behavior of θ\ell_{\theta}, so that Bntn1s0B_{n}t_{n}^{1-s}\rightarrow 0 happens naturally if s>1s>1. Additional details are given in the proof of Theorem 4.2, which can be found in Appendix A.

Second, how might Condition 3.2 be checked? Start by defining the excess clipped risk mn(θ;θn)=PθnPθnnm_{n}(\theta;\theta_{n}^{\star})=P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n} and the corresponding variance vn(θ,θn)=P(θnθnn)2mn(θ,θn)2v_{n}(\theta,\theta_{n}^{\star})=P(\ell_{\theta}^{n}-\ell_{\theta_{n}^{\star}}^{n})^{2}-m_{n}(\theta,\theta_{n}^{\star})^{2}. Now suppose it can be shown that there exists a function GG such that

vn(θ,θn)G(Δn)mn(θ,θn),for all θΘn,v_{n}(\theta,\theta_{n}^{\star})\leq G(\Delta_{n})\,m_{n}(\theta,\theta_{n}^{\star}),\quad\text{for all $\theta\in\Theta_{n}$},

Δn\Delta_{n} is the size index of the sieve. This amounts to the excess clipped loss satisfying a Bernstein condition (17), with exponent α=1\alpha=1. The excess clipped loss itself is tn\lesssim t_{n} (and maybe substantially smaller, depending on form of θ\ell_{\theta}). So, we can apply the moment-generating function bound in (16) for bounded excess losses to get

Peωn(θnθn)<eKnωnmn(θ,θn),Pe^{-\omega_{n}(\ell_{\theta}^{n}-\ell_{\theta^{\star}}^{n})}<e^{-K_{n}\omega_{n}m_{n}(\theta,\theta_{n}^{\star})},

where Kn={1CωnG(Δn)}K_{n}=\{1-C\omega_{n}G(\Delta_{n})\}, for a constant C>0C>0, provided that ωntn=O(1)\omega_{n}t_{n}=O(1). Now it is easy to see that the above display implies Condition 3.2.

For the rate calculation with respect to the excess risk, the decomposition of R(θ)R(\theta) based on (27) implies we need εnBntn1s\varepsilon_{n}\geq B_{n}t_{n}^{1-s}, subject to the constraint nωnKnεnn\omega_{n}K_{n}\varepsilon_{n}\rightarrow\infty, for KnK_{n} as above. As we see in Section 5.3, BnB_{n}, Δn\Delta_{n}, and KnK_{n} can often be taken as powers of logn\log n, so the critical components determining the rate are ss and tnt_{n}. The optimal rate depends on the upper bound of the the excess clipped loss, which, in the worst case, equals tnt_{n}. To apply (16) we need the learning rate to vanish like the reciprocal of this bound, so we take ωn=tn1\omega_{n}=t_{n}^{-1}. Then, we determine εn\varepsilon_{n} to satisfy ntn1εnnt_{n}^{-1}\varepsilon_{n}\rightarrow\infty and εntn1s\varepsilon_{n}\geq t_{n}^{1-s}, up to a log term. The clipping sequence tnn1/st_{n}\approx n^{1/s} is sufficient, and yields the rate εnn1/s1\varepsilon_{n}\approx n^{1/s-1}, modulo log terms.

5 Examples

This section presents several illustrations of the general theory presented in Sections 3 and 4. The strategies laid out in Sections 3.4, 4.1, and 4.2 are put to use in the following examples to verify our sufficient conditions for Gibbs posterior concentration. All proofs of results in this section can be found in Appendix C.

5.1 Quantile regression

Consider making inferences on the τth\tau^{\text{th}} conditional quantile of a response YY given a predictor X=xX=x. We model this quantile, denoted QY|X=x(τ)Q_{Y|X=x}(\tau), as a linear combination of functions of xx, that is, QY|X=x(τ)=θf(x)Q_{Y|X=x}(\tau)=\theta^{\top}f(x), for a fixed, finite dictionary of functions f(x)=(f1(x),,fJ(x))f(x)=(f_{1}(x),\dots,f_{J}(x))^{\top} and where θ=(θ1,,θJ)\theta=(\theta_{1},\dots,\theta_{J})^{\top} is a coefficient vector with θΘ\theta\in\Theta. Here we assume the model is well-specified so the true conditional quantile is θf(x){\theta^{\star}}^{\top}f(x) for some θΘ\theta^{\star}\in\Theta. The standard check loss for quantile estimation is

θ(u)=(yθf(x))(τ1{y<θf(x)}).\ell_{\theta}(u)=(y-\theta^{\top}f(x))(\tau-1\{y<\theta^{\top}f(x)\}). (28)

We show θ\theta^{\star} minimizes R(θ)R(\theta) in the proof of Proposition 1 below. It can be shown that θθ(u)\theta\mapsto\ell_{\theta}(u) is LL-Lipschitz, with L<1L<1, and convex, so the strategy in Section 4.1 of the main article is helpful here for verifying Condition 2 and Lemma 2 in Section B can be used to verify (22).

Inference on quantiles is a challenging problem from a Bayesian perspective because the quantile is well-defined irrespective of any particular likelihood. Sriram et al., (2013) interprets the check loss as the negative log-density of an asymmetric Laplace distribution and constructs a corresponding pseudo-posterior using this likelihood, but their posterior is effectively a Gibbs posterior as Definition 3 of the main article.

With a few mild assumptions about the underlying distribution PP, our general result in Theorem 2 can be used to establish Gibbs posterior concentration at rate n1/2n^{-1/2}.

Assumption 1.

  1. 1.

    The marginal distribution of XX is such that PffPff^{\top} exists and is positive definite;

  2. 2.

    the conditional distribution of YY, given X=xX=x, has at least one finite moment and admits a continuous density pxp_{x} such that px(θf)p_{x}(\theta^{\star\top}f) is bounded away from zero for PP-almost all xx; and

  3. 3.

    the prior Π\Pi has a density bounded away from 0 in a neighborhood of θ\theta^{\star}.

Proposition 1.

Under Assumption 1, if the learning rate is sufficiently small, then the Gibbs posterior concentrates at θ\theta^{\star} with rate εn=n1/2\varepsilon_{n}=n^{-1/2} with respect to d(θ,θ)=θθd(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|.

5.2 Area under receiver operator characteristic curve

The receiver operator characteristic (ROC) curve and corresponding area under the curve (AUC) are diagnostic tools often used to judge the effectiveness of a binary classifier. Suppose a binary classifier produces a score UU characterizing the likelihood an individual belongs to Group 11 versus Group 0. We can estimate an individual’s group by 1(U>t)1(U>t) where different values of the cutoff score tt may provide more or less accurate estimates. Suppose U0U_{0} and U1U_{1} are independent scores corresponding to individuals from Group 0 and Group 11, respectively. The specificity and sensitivity of the test of H0:individual i belongs to Group 0H_{0}:\text{individual }i\text{ belongs to Group 0} that rejects when U>tU>t are defined by spec(t)=P(U0<t)\text{spec}(t)=P(U_{0}<t) and sens(t)=P(U1>t)\text{sens}(t)=P(U_{1}>t). When the type 1 and 2 errors of the test are equally costly the optimal cutoff is the value of tt maximizing 1spec(t)+sens(t)1-\text{spec}(t)+\text{sens}(t), or, in other words, the test maximizing the sum of power and one minus the type 1 error probability. The ROC is the parametric curve (1spec(t),sens(t))(1-\text{spec}(t),\text{sens}(t)) in [0,1]2[0,1]^{2} which provides a graphical summary of the tradeoff between Type 1 and Type 2 errors for different choices of the cutoff. The AUC, equal to P(U1>U0)P(U_{1}>U_{0}), gives an overall numerical summary of the quality of the binary classifier, independent of the choice of threshold.

Our goal is to make posterior inferences on the AUC, but the usual Bayesian approach immediately runs into the kinds of problems we see in the examples in Sections 5.1 and later in 6. The parameter of interest is one-dimensional, but it depends on a completely unknown joint distribution PP. Within a Bayesian framework, the options are to fix a parametric model for this joint distribution and risk model misspecification or work with a complicated nonparametric model. Wang and Martin, (2020) constructed a Gibbs posterior for the AUC that avoids both of these issues.

Suppose U0,1,,U0,mU_{0,1},\ldots,U_{0,m} and U1,1,,U1,nU_{1,1},\ldots,U_{1,n} denote random samples of size mm and nn, respectively, of binary classifier scores for individuals belonging to Groups 0 and 1, and denote θ=P(U1>U0)\theta=P(U_{1}>U_{0}). Wang and Martin, (2020) consider the loss function

θ(u0,u1)={θ1(u1>u0)}2,θ[0,1],\ell_{\theta}(u_{0},u_{1})=\{\theta-1(u_{1}>u_{0})\}^{2},\quad\theta\in[0,1], (29)

for which the risk satisfies R(θ)=(θθ)2R(\theta)=(\theta-\theta^{\star})^{2}. If we interpret m=mnm=m_{n} as a function of nn, then it makes sense to write the empirical risk function as

Rn(θ)=1mni=1mj=1n{θ1(U1,i>U0,j)}2.R_{n}(\theta)=\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}\{\theta-1(U_{1,i}>U_{0,j})\}^{2}. (30)

Note the minimizer of the empirical risk is equal to

θ^n=1mni=1mj=1n1(U1,i>U0,j).\hat{\theta}_{n}=\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}1(U_{1,i}>U_{0,j}). (31)

Wang and Martin, (2020) prove concentration of the Gibbs posterior at rate n1/2n^{-1/2} under the following assumption.

Assumption 2.

  1. 1.

    The sample sizes (m,n)(m,n) satisfy m(m+n)1λ(0,1)m(m+n)^{-1}\rightarrow\lambda\in(0,1).

  2. 2.

    The prior distribution has a density function π\pi that is bounded away from zero in a neighborhood of θ\theta^{\star}.

Wang and Martin, (2020) note that their concentration result holds for fixed learning rates and deterministic learning rates that vanish more slowly that min(m,n)1\min(m,n)^{-1}. As discussed in Syring and Martin, (2019) one motivation for choosing a particular learning rate is to calibrate Gibbs posterior credible intervals to attain a nominal coverage probability, at least approximately. With this goal in mind, Wang and Martin, (2020) suggest the following random learning rate. Define the covariances

τ10\displaystyle\tau_{10} =Cov{1(U1,1>U0,1), 1(U1,1>U0,2)}\displaystyle=\text{Cov}\{1(U_{1,1}>U_{0,1}),\,1(U_{1,1}>U_{0,2})\}
τ01\displaystyle\tau_{01} =Cov{1(U1,1>U0,1), 1(U1,2>U0,1)}.\displaystyle=\text{Cov}\{1(U_{1,1}>U_{0,1}),\,1(U_{1,2}>U_{0,1})\}. (32)

Wang and Martin, (2020) note the asymptotic covariance of θ^n\hat{\theta}_{n} is given by

1m+n(τ10λ+τ011λ),\frac{1}{m+n}\Bigl{(}\frac{\tau_{10}}{\lambda}+\frac{\tau_{01}}{1-\lambda}\Bigr{)}, (33)

and that the Gibbs posterior variance can be made to match this, at least asymptotically, by using the random learning rate

ω^n=m+n2mn(τ^10λ+τ^011λ)1,\hat{\omega}_{n}=\frac{m+n}{2mn}\Bigl{(}\frac{\hat{\tau}_{10}}{\lambda}+\frac{\hat{\tau}_{01}}{1-\lambda}\Bigr{)}^{-1}, (34)

where τ^10\hat{\tau}_{10} and τ^01\hat{\tau}_{01} are the corresponding empirical covariances:

τ^10\displaystyle\hat{\tau}_{10} =2mn(n1)i=1mjj1(U1,i>U0,j)1(U1,i>U0,j)θ^n2,\displaystyle=\frac{2}{mn(n-1)}\sum_{i=1}^{m}\sum_{j\neq j^{\prime}}1(U_{1,i}>U_{0,j})1(U_{1,i}>U_{0,j^{\prime}})-\hat{\theta}_{n}^{2},
τ^01\displaystyle\hat{\tau}_{01} =2nm(m1)j=1nii1(U1,i>U0,j)1(U1,i>U0,j)θ^n2.\displaystyle=\frac{2}{nm(m-1)}\sum_{j=1}^{n}\sum_{i\neq i^{\prime}}1(U_{1,i}>U_{0,j})1(U_{1,i^{\prime}}>U_{0,j})-\hat{\theta}_{n}^{2}. (35)

The hope is that the Gibbs posterior with the learning rate ω^n\hat{\omega}_{n} has asymptotically calibrated credible intervals. It turns out that the concentration result in Wang and Martin, (2020) along with Theorem 4 imply the Gibbs posterior with a slightly adjusted version of learning rate ω^n\hat{\omega}_{n} also concentrates at rate n1/2n^{-1/2}. The adjustment to the learning rate has the effect of slightly widening Gibbs posterior credible intervals, so their asymptotic calibration is not adversely affected.

Proposition 2.

Suppose Assumption 2 holds and let ana_{n} denote any diverging sequence. Then, the Gibbs posterior with learning rate anω^na_{n}\hat{\omega}_{n} concentrates at rate n1/2n^{-1/2} with respect to d(θ,θ)=|θθ|d(\theta,\theta^{\star})=|\theta-\theta^{\star}|.

5.3 Finite-dimensional regression with squared-error loss

Consider predicting a response yy\in\mathbb{R} using a linear function xθx^{\top}\theta by minimizing the sum of squared-error losses θ(u)=(yxθ)2\ell_{\theta}(u)=(y-x^{\top}\theta)^{2}, with u=(x,y)J+1u=(x,y)\in\mathbb{R}^{J+1}, over a parameter space θΘJ\theta\in\Theta\subseteq\mathbb{R}^{J}. Suppose the covariate-response variable pairs (Xi,Yi)(X_{i},Y_{i}) are iid with XX taking values in a compact subset 𝒳J\mathcal{X}\subset\mathbb{R}^{J}. To complement this example we present a more flexible, non-parametric regression problem in Section 5.4 below. For the current example, we focus on how the tail behavior of the response variable affects posterior concentration; see Assumptions 3 and 4 below.

5.3.1 Light-tailed response

When the response is sub-exponential so is the excess loss, and by the argument outlined in Section 3.4 we can verify Condition 1 for d(θ,θ)=θθ2d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{2} and with r=2r=2. Then, the Gibbs posterior distribution concentrates at rate n1/2n^{-1/2} as a consequence of Theorem 3.3.

Assumption 3.

  1. 1.

    The response YY, given xx, is sub-exponential with parameters (σ2,b)(\sigma^{2},\,b) for all xx;

  2. 2.

    XX is bounded and its marginal distribution is such that PXXPXX^{\top} exists and is positive definite with eigenvalues bounded away from 0; and

  3. 3.

    the prior Π\Pi has a density bounded away from 0 in a neighborhood of θ\theta^{\star}.

Proposition 3.

If Assumption 3 holds, and the learning rate ω\omega is a sufficiently small constant, then the Gibbs posterior concentrates at rate εn=n1/2\varepsilon_{n}=n^{-1/2} with respect to d(θ,θ)=θθ2d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{2}.

5.3.2 Heavy-tailed response

As discussed in Section 3.4, Condition 1 can be expected to hold when the response is light-tailed, but not when it is heavy-tailed. However, for a capped loss, θn=θtn\ell_{\theta}^{n}=\ell_{\theta}\wedge t_{n}, with increasing tnt_{n}, and for a suitable sieve on the parameter space, we can show concentration of the Gibbs posterior at the risk minimizer θ\theta^{\star} via the argument given in Section 4.2.

Assumption 4.

The marginal distribution of YY satisfies P|Y|s<P|Y|^{s}<\infty for s>2s>2.

Define a sieve by Θn={θJ:θ2<Δn}\Theta_{n}=\{\theta\in\mathbb{R}^{J}:\|\theta\|_{2}<\Delta_{n}\} for an increasing sequence Δn\Delta_{n}, e.g., Δn=logn\Delta_{n}=\log n. The moment condition in Assumption 4 implies three important properties of θn\ell_{\theta}^{n} for θΘn\theta\in\Theta_{n}:

  • The excess clipped loss satisfies supy,x|θn(y,x)θn(y,x)|<Δntn1/2\sup_{y,x}|\ell_{\theta}^{n}(y,x)-\ell_{\theta_{n}^{\star}}(y,x)|<\Delta_{n}t_{n}^{1/2}.

  • The risk and clipped risk are equivalent up to an error of order Δnstn1s/2\Delta_{n}^{s}t_{n}^{1-s/2}.

  • The clipped loss satisfies mn(θ,θn)vn(θ,θn)Δn2(θθ22+Δnstn1s/2)m_{n}(\theta,\theta^{\star}_{n})\vee v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}(\|\theta-\theta^{\star}\|_{2}^{2}+\Delta_{n}^{s}t_{n}^{1-s/2}).

These three properties can be used to verify Condition 3 using the argument sketched out in Section 4.2 and to verify the prior bound in (26) required to apply Theorem 4.2.

Proposition 4.

Suppose Assumption 4 holds for some s>2s>2. Let tn=n2/(s1)t_{n}=n^{2/(s-1)}, Δn=logn\Delta_{n}=\log n, ωn=Δn1tn1/2\omega_{n}=\Delta_{n}^{-1}t_{n}^{-1/2}, and εn=Δns/2n(s2)/(2s2)\varepsilon_{n}=\Delta_{n}^{s/2}n^{-(s-2)/(2s-2)}. If Assumption 3.2–3 also holds, then the Gibbs posterior with learning rate ωn\omega_{n} concentrates at rate εn\varepsilon_{n} with respect to d(θ,θ)=θθ2d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{2}.

Proposition 4 continues to hold if we replace ss in the definitions of tnt_{n} and εn\varepsilon_{n} by ss^{\prime} satisfying 2<s<s2<s^{\prime}<s. That means we only need an accurate lower bound for ss to construct a consistent Gibbs posterior for θ\theta^{\star}, albeit with a slower concentration rate.

It is clear the clipping sequence may bias the clipped risk minimizer. For a given clip, the bias is less for light-tailed (large ss) losses compared to heavy-tailed losses because a loss in excess of the clip is rarer for the former. This explains why the clipping sequence tn=n2/(s1)t_{n}=n^{2/(s-1)} decreases in ss.

Note that our εn2\varepsilon_{n}^{2} can be compared to the rate derived in Grünwald and Mehta, (2020, Example 11). Indeed, up to log factors, their rate, ns/(s+2)n^{-s/(s+2)}, is smaller than ours for s<4s<4 but larger for s>4s>4. That is, their rate is slightly better when the response has between 2 and 4 moments, while ours is better when the response has 4 or more moments. Also, their result assumes that the parameter space is fixed and bounded, whereas we avoid this assumption with a suitably chosen sieve.

5.4 Mean regression curve

Let Y1,,YnY_{1},\ldots,Y_{n} be independent, where the marginal distribution of YiY_{i} depends on a fixed covariate xi[0,1]x_{i}\in[0,1] through the mean, i.e., the expected value of YiY_{i} is θ(xi)\theta^{\star}(x_{i}), i=1,,ni=1,\ldots,n. For simplicity, set xi=i/nx_{i}=i/n, corresponding to an equally-spaced design. Then the goal is estimation of the mean function θ:[0,1]\theta^{\star}:[0,1]\to\mathbb{R}, which resides in a specified function class Θ\Theta defined below.

A natural starting point is to define an empirical risk based on squared error loss, i.e.,

Rn(θ)=1ni=1n{Yiθ(xi)}2.R_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-\theta(x_{i})\}^{2}. (36)

However, any function θ\theta that passes through the observations would be an empirical risk minimizer, so some additional structure is needed to make the solution to the empirical risk minimization problem meaningful. Towards this, as is customary in the literature, we parametrize the mean function as a linear combination of a fixed set of basis functions, f(x)=(f1(x),,fJ(x))f(x)=(f_{1}(x),\ldots,f_{J}(x))^{\top}. That is, we consider only functions θ=θβ\theta=\theta_{\beta}, where

θβ(x)=βf(x),βJ.\theta_{\beta}(x)=\beta^{\top}f(x),\quad\beta\in\mathbb{R}^{J}. (37)

Note that we do not assume that θ\theta^{\star} is of the specified form; more specifically, we do not assume existence of a vector β\beta^{\star} such that θ=θβ\theta^{\star}=\theta_{\beta^{\star}}. The idea is that the structure imposed via the basis functions will force certain smoothness, etc., so that minimization of the risk over this restricted class of functions would identify a suitable estimate.

This structure changes the focus of our investigation from the mean function θ\theta to the JJ-vector of coefficients β\beta. We now proceed by first constructing a Gibbs posterior for β\beta and then obtain the corresponding Gibbs posterior for θ\theta by pushing the former through the mapping βθβ\beta\mapsto\theta_{\beta}. In particular, define the empirical risk function in terms of β\beta:

rn(β)=Rn(θβ)=1ni=1n{Yiθβ(xi)}2=1n(YFnβ)(YFnβ),r_{n}(\beta)=R_{n}(\theta_{\beta})=\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-\theta_{\beta}(x_{i})\}^{2}=\tfrac{1}{n}(Y-F_{n}\beta)^{\top}(Y-F_{n}\beta), (38)

where βJ\beta\in\mathbb{R}^{J} and where FnF_{n} is the n×Jn\times J matrix whose (i,j)(i,j) entry is fj(xi)f_{j}(x_{i}), assumed to be positive definite; see below. Given a prior distribution Π~\widetilde{\Pi} for β\beta—which determines a prior Π\Pi for θ\theta through the aforementioned mapping—we can first construct the Gibbs posterior for β\beta as in (5) with the pseudo-likelihood βexp{ωnrn(β)}\beta\mapsto\exp\{-\omega nr_{n}(\beta)\}. If we write Π~n\widetilde{\Pi}_{n} for this Gibbs posterior for β\beta, then the corresponding Gibbs posterior for θ\theta is given by

Πn(A)=Π~n({β:θβA}),AΘ.\Pi_{n}(A)=\widetilde{\Pi}_{n}(\{\beta:\theta_{\beta}\in A\}),\quad A\subseteq\Theta. (39)

Therefore, the concentration properties of Πn\Pi_{n} are determined by those of Π~n\widetilde{\Pi}_{n}.

We can now proceed very much like we did before, but the details are slightly more complicated in the present inid case. Taking expectation with respect to the joint distribution of (Y1,,Yn)(Y_{1},\ldots,Y_{n}) is, as usual, the average of marginal expectations; however, since the data are not iid, these marginal expectations are not all the same. Therefore, the expected empirical risk function is

r¯n(β)=Pnrn(β)\displaystyle\bar{r}_{n}(\beta)=P^{n}r_{n}(\beta) =1ni=1nPi{Yiθβ(xi)}2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}P_{i}\{Y_{i}-\theta_{\beta}(x_{i})\}^{2} (40)

where Pi=PxiP_{i}=P_{x_{i}} is the marginal distribution of YiY_{i} and where Fn,iF_{n,i} is the ithi^{\text{th}} row of FnF_{n}. Since the expected empirical risk function depends on nn, through (x1,,xn)(x_{1},\ldots,x_{n}), so too does the risk minimizer

βn=argminβr¯n(β).\beta^{\dagger}_{n}=\arg\min_{\beta}\bar{r}_{n}(\beta). (41)

If PiP_{i} has finite variance, then r¯n(β)\bar{r}_{n}(\beta) differs from {θ(x1:n)Fnβ}2\{\theta^{\star}(x_{1:n})-F_{n}\beta\}^{2} by only an additive constant not depending on β\beta, and this becomes a least-squares problem, with solution

βn=(FnFn)1Fnθ(x1:n),\beta_{n}^{\dagger}=(F_{n}^{\top}F_{n})^{-1}F_{n}^{\top}\,\theta^{\star}(x_{1:n}), (42)

where θ(x1:n)\theta^{\star}(x_{1:n}) is the nn-vector (θ(x1),,θ(xn))(\theta^{\star}(x_{1}),\ldots,\theta^{\star}(x_{n}))^{\top}. Our expectation is that the Gibbs posterior Π~n\widetilde{\Pi}_{n} for β\beta will suitably concentrate around βn\beta_{n}^{\dagger}, which implies that the Gibbs posterior Πn\Pi_{n} for θ\theta will suitably concentrate around θβn\theta_{\beta_{n}^{\dagger}}. Finally, if the above holds and the basis representation is suitably flexible, then θβn\theta_{\beta_{n}^{\dagger}} will be close to θ\theta^{\star} in some sense and, hence, we achieve the desired concentration.

The flexibility of the basis representation depends on the dimension JJ. Since θ\theta^{\star} need not be of the form θβ\theta_{\beta}, a good approximation will require that J=JnJ=J_{n} be increasing with nn. How fast J=JnJ=J_{n} must increase depends on the smoothness of θ\theta^{\star}. Indeed, if θ\theta^{\star} has smoothness index α>0\alpha>0 (made precise below), then many systems of basis functions—including Fourier series and B-splines—have the following approximation property: there exists an H>0H>0 such that for every JJ

there exists βJ such that β<H and θβθJα.\text{there exists $\beta\in\mathbb{R}^{J}$ such that $\|\beta\|_{\infty}<H$ and $\|\theta_{\beta}-\theta^{\star}\|_{\infty}\lesssim J^{-\alpha}$}. (43)

Then the idea is to set the approximation error in (43) equal to the target rate of convergence, which depends on nn and on α\alpha, and then solve for J=JnJ=J_{n}.

For Gibbs posterior concentration at or near the optimal rate, we need the prior distribution for β\beta to be sufficiently concentrated in a bounded region of the JJ-dimensional space in the sense that

Π~({β:ββ2ε})(Cε)J,for all βJ with βH,\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\})\gtrsim(C\varepsilon)^{J},\quad\text{for all $\beta^{\prime}\in\mathbb{R}^{J}$ with $\|\beta^{\prime}\|_{\infty}\leq H$}, (44)

for the same HH as in (43), for a small constant C>0C>0, and for all small ε>0\varepsilon>0.

Assumption 5.

  1. 1.

    The function θ:[0,1]\theta^{\star}:[0,1]\to\mathbb{R} belongs to a class Θ=Θ(α,L)\Theta=\Theta(\alpha,L) of Hölder smooth functions parametrized by α1/2\alpha\geq 1/2 and L>0L>0. That is, θ\theta^{\star} satisfies

    |θ([α])(x)θ([α])(x)|L|xx|α[α],for all x,x[0,1],|\theta^{\star([\alpha])}(x)-\theta^{\star([\alpha])}(x^{\prime})|\leq L|x-x^{\prime}|^{\alpha-[\alpha]},\quad\text{for all $x,x^{\prime}\in[0,1]$},

    where the superscript “(k)(k)” means kthk^{\text{th}} derivative and [α][\alpha] is the integer part of α\alpha;

  2. 2.

    for a given xx, the response YY is sub-Gaussian and with variance and variance proxy—both of which can depend on xx—uniformly upper bounded by σ2\sigma^{2};

  3. 3.

    the eigenvalues of FnFnF_{n}^{\top}F_{n} are bounded away from zero and \infty;

  4. 4.

    the approximation property (43) holds; and

  5. 5.

    the prior for β\beta satisfies (44) and has a bounded density on the JnJ_{n}-dimensional parameter space.

The bounded variance assumption is implied, for example, if the variance of YY is a smooth function of xx in [0,1][0,1], which is rather mild. And assuming the eigenvalues of FnFnF_{n}^{\top}F_{n} are bounded is not especially strong since, in many cases, the basis functions would be orthonormal. In that case, the diagonal and off-diagonal entries of FnFnF_{n}^{\top}F_{n} would be approximately 1 and 0, respectively, and the bounds are almost trivial. The conditions on the prior distribution are weak; as we argue in the proof, it can be satisfied by taking the joint prior density to be the product of JJ independent prior densities on the components of β\beta, and where each component density is strictly positive.

Proposition 5.

If Assumption 5 holds, and the learning rate ω\omega is a sufficiently small constant, then the Gibbs posterior Πn\Pi_{n} for θ\theta concentrates at θ\theta^{\star} with rate εn=nα/(1+2α)\varepsilon_{n}=n^{-\alpha/(1+2\alpha)} with respect to the empirical L2L_{2} norm θθn,2\|\theta-\theta^{\star}\|_{n,2}, where fn,22=n1i=1nf2(xi)\|f\|_{n,2}^{2}=n^{-1}\sum_{i=1}^{n}f^{2}(x_{i}).

We should emphasize that the quantity of interest, θ\theta, is high-dimensional, and the rate εn\varepsilon_{n} given in Proposition 5 is optimal for the given smoothness α\alpha; there are not even any nuisance logarithmic terms.

The simpler fixed-dimensional setting with constant JJ can be analyzed similarly as above. In that case, we can simultaneously weaken the requirement on the response YY in Assumption 5.2 from sub-Gaussian to sub-exponential, and strengthen the conclusion to a root-nn concentration rate.

5.5 Binary classification

Let Y{0,1}Y\in\{0,1\} be a binary response variable and X=(X0,X1,,Xq)X=(X_{0},X_{1},\ldots,X_{q})^{\top} a (q+1)(q+1)-dimensional predictor variable. We consider classification rules of the form

ϕθ(X)=1{Xθ>0}=1{αX0+(X1,,Xq)β>0},θ=(α,β)q+1,\phi_{\theta}(X)=1\{X^{\top}\theta>0\}=1\{\alpha X_{0}+(X_{1},\ldots,X_{q})^{\top}\beta>0\},\quad\theta=(\alpha,\beta)\in\mathbb{R}^{q+1}, (45)

and the goal is to learn the optimal θ\theta vector, i.e., θ=argminθR(θ)\theta^{\star}=\arg\min_{\theta}R(\theta), where R(θ)=P{Yϕθ(X)}R(\theta)=P\{Y\neq\phi_{\theta}(X)\} is the misclassification error probability, and PP is the joint distribution of (X,Y)(X,Y). This optimal θ\theta^{\star} is such that η(x)>12\eta(x)>\frac{1}{2} if xθ>0x^{\top}\theta^{\star}>0 and η(x)<12\eta(x)<\frac{1}{2} if xθ0x^{\top}\theta^{\star}\leq 0, where η(x)=P(Y=1X=x)\eta(x)=P(Y=1\mid X=x) is the conditional probability function. Below we construct a Gibbs posterior distribution that concentrates around this optimal θ\theta^{\star} at rate that depends on certain local features of that η\eta function.

Suppose our data consists of iid copies (Xi,Yi)(X_{i},Y_{i}), i=1,,ni=1,\ldots,n, of (X,Y)(X,Y) from PP^{\star}, and define the empirical risk function

Rn(θ)=1ni=1n1{Yiϕθ(Xi)}.R_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}1\{Y_{i}\neq\phi_{\theta}(X_{i})\}. (46)

In addition to the empirical risk function we need specify a prior Π\Pi and here the prior plays a significant role in the Gibbs posterior concentration results.

A unique feature of this problem, which makes the prior specification a little different than in a linear regression problem, is that the scale of θ\theta does not affect classification performance, e.g., replacing θ\theta with 1000θ1000\theta gives exactly the same classification performance. To fix a scale, we follow Jiang and Tanner, (2008), and

  • assume that the x0x_{0} component of xx is of known importance and always included in the classifier,

  • and constrain the corresponding coefficient, α\alpha, to take values ±1\pm 1.

This implies that the α\alpha and β\beta components of the θ\theta vector should be handled very differently in terms of prior specification. In particular, α\alpha is a scalar with a discrete prior—which we take here to be uniform on ±1\pm 1—and β\beta, being potentially high-dimensional, will require setting-specific prior considerations.

The characteristic that determines the difficulty of a classification problem is the distribution of η(X)\eta(X) or, more specifically, how concentrated η(X)\eta(X) is near the value 12\frac{1}{2}, where one could do virtually no worse by classifying according to a coin flip. The set {x:η(x)=12}\{x:\eta(x)=\frac{1}{2}\} is called the margin, and conditions that control the concentration of the distribution of η(X)\eta(X) around 12\frac{1}{2} are generally called margin conditions. Roughly, if η\eta has a “jump” or discontinuity at the margin, then classification is easier and η(X)\eta(X) does not have to be so tightly concentrated around 12\frac{1}{2}. On the other hand, if η\eta is smooth at the margin, then the classification problem is more challenging in the sense that more data near the margin is needed to learn the optimal classifier, hence, tighter concentration of η(X)\eta(X) near 12\frac{1}{2} is required.

In Sections 5.5.1 and 5.5.2 that follow, we consider two such margin conditions, namely, the so-called Massart and Tsybakov conditions. The first is relatively strong, corresponding to a jump in η\eta at the margin, and the result we we establish in Proposition 6 is accordingly strong. In particular, we show that the Gibbs posterior achieves the optimal and adaptive concentration rate in a class of high-dimensional problems (qnq\gg n) under a certain sparsity assumption on θ\theta^{\star}. The Tsybakov margin condition we consider below is weaker than the first, in the sense that η\eta can be smooth near the “η=12\eta=\frac{1}{2}” boundary and, as expected, the Gibbs posterior concentration rate result is not as strong as the first.

5.5.1 Massart’s noise condition

Here we allow the dimension q+1q+1 of the coefficient vector θ=(α,β)\theta=(\alpha,\beta) to exceed the sample size nn, i.e., we consider the so-called high-dimensional problem with qnq\gg n. Accurate estimation and inference is not possible in high-dimensional settings without imposing some low-dimensional structure on the inferential target, θ\theta^{\star}. Here, as is typical in the literature on high-dimensional inference, we assume that θ\theta^{\star} is sparse in the sense that most of its entries are exactly zero, which corresponds to most of the predictor variables being irrelevant to classification. Below we construct a Gibbs posterior distribution for θ\theta that concentrates around the unknown sparse θ\theta^{\star} at a (near) optimal rate.

Since the sparsity in θ\theta^{\star} is crucial to the success of any statistical method, the prior needs to be chosen carefully so that sparsity is encouraged in the posterior. The prior Π\Pi for θ\theta will treat α\alpha and β\beta independent, and the prior for β\beta will be defined hierarchically. Start with the reparametrization β(S,βS)\beta\to(S,\beta_{S}), where S{1,2,,q}S\subseteq\{1,2,\ldots,q\} denotes the configuration of zeros and non-zeros in the β\beta vector, and βS\beta_{S} denotes the |S||S|-vector of non-zero values. Following Castillo et al., (2015), for the marginal prior π(S)\pi(S) for SS, we take

π(S)=(q|S|)1f(|S|),\pi(S)=\textstyle\binom{q}{|S|}^{-1}\,f(|S|),

where the ff is a prior for the size |S||S| and the first factor on the right-hand side is the uniform prior for SS of the given size |S||S|. Various choices of ff are possible, but here we take the complexity prior f(s)(cqa)s,s=0,1,,qf(s)\propto(cq^{a})^{-s},\,s=0,1,\ldots,q, a truncated geometric density, where aa and cc are fixed (and here arbitrary) hyperparameters; a similar choice is also made in Martin et al., (2017). Second, for the conditional prior of βS\beta_{S}, given SS, again following Castillo et al., (2015), we take its density to be

gS(βS)=kSλ2eλ|βk|,g_{S}(\beta_{S})=\prod_{k\in S}\tfrac{\lambda}{2}e^{-\lambda|\beta_{k}|},

a product of |S||S| many Laplace densities with rate λ\lambda to be specified.

Assumption 6.

  1. 1.

    The marginal distribution of XX is compactly supported, say, on [1,1]q+1[-1,1]^{q+1}.

  2. 2.

    The conditional distribution of X0X_{0}, given X~=(X1,,Xq)\tilde{X}=(X_{1},\ldots,X_{q}), has a density with respect to Lebesgue measure that is uniformly bounded.

  3. 3.

    The rate parameter λ\lambda in the Laplace prior satisfies λ(logq)1/2\lambda\lesssim(\log q)^{1/2}.

  4. 4.

    The optimal θ=(α,β)\theta^{\star}=(\alpha^{\star},\beta^{\star}) is sparse in the sense that |S|logq=o(n)|S^{\star}|\log q=o(n), where SS^{\star} is the configuration of non-zero entries in β\beta^{\star}, and β=O(1)\|\beta^{\star}\|_{\infty}=O(1).

  5. 5.

    There exists h(0,12)h\in(0,\frac{1}{2}) such that P(|η(X)12|h)=0P(|\eta(X)-\frac{1}{2}|\leq h)=0.

The first two parts of Assumption 6 correspond to Conditions 00^{\prime} and 0′′0^{\prime\prime} in Jiang and Tanner, (2008). and Assumption 6 (5) above is precisely the margin condition imposed in Equation (5) of Massart and Nedelec, (2006); see, also, Mammen and Tsybakov, (1999) and Koltchinskii, (2006). This concisely states that there is either a jump in the η\eta function at the margin or that the marginal distribution of XX is not supported near the margin; in either case, there is separation between the Y=0Y=0 and Y=1Y=1 cases, which makes the classification problem relatively easy.

With these assumptions, we get the following Gibbs posterior asymptotic concentration rate result. Note that, in order to preserve the high-dimensionality in the asymptotic limit, we let the dimension q=qnq=q_{n} increase with the sample size nn. So, the data sequence actually forms a triangular array but, as is common in the literature, we suppress this formulation in our notation.

Proposition 6.

Consider a classification problem as described above, with qnq\gg n. Under Assumption 6, the Gibbs posterior, with sufficiently small constant learning rate, concentrates at rate εn=(n1|S|logq)1/2\varepsilon_{n}=(n^{-1}|S^{\star}|\log q)^{1/2} with respect to d(θ,θ)={R(θ)R(θ)}1/2d(\theta,\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}.

This result shows that, even in very high dimensional settings, the Gibbs posterior concentrates on the optimal rule θ\theta^{\star} at a fast rate. For example, suppose that the dimension qq is polynomial in nn, i.e., qnbq\sim n^{b} for any b>0b>0, while the “effective dimension,” or complexity, is sub-linear, i.e., |S|na|S^{\star}|\sim n^{a} for a<1a<1. Then we get that {θ:R(θ)R(θ)n(1a)logn}\{\theta:R(\theta)-R(\theta^{\star})\lesssim n^{-(1-a)}\log n\} has Gibbs posterior probability converging to 1 as nn\to\infty. That is, rates better than n1/2n^{-1/2} can easily be achieved, and even arbitrarily close to n1n^{-1} is possible. Compare this to the rates in Propositions 2–3 in Jiang and Tanner, (2008), also in terms of risk difference, that cannot be faster than n1/2n^{-1/2}. Further, the concentration rate in Proposition 6 is nearly the optimal rate corresponding to an oracle who has knowledge of SS^{\star}. That is, the Gibbs posterior concentrates at nearly the optimal rate adaptively with respect to the unknown complexity.

5.5.2 Tsybakov’s margin condition

Next, we consider classification under the more general Tsybakov margin condition (e.g., Tsybakov,, 2004). The problem set up is the same as above, except that here we consider the simpler low-dimensional case, with the number of predictors (q+1)(q+1) small relative to nn. Since the dimension is no longer large, prior specification is much simpler. We will continue to assume, as before, that the x0x_{0} component of xx has a constrained coefficient α{±1}\alpha\in\{\pm 1\}, to which we assign a discrete uniform prior. Otherwise, we simply require the prior Π\Pi have a (marginal) density, π\pi, for β\beta, with respect to Lebesgue measure on q\mathbb{R}^{q}.

Assumption 7.

  1. 1.

    The marginal prior density for β\beta is continuous and bounded away from 0 near β\beta^{\star}.

  2. 2.

    There exists c>0c>0 and γ>0\gamma>0 such that P(|2η(X)1|h)chγP(|2\eta(X)-1|\leq h)\leq ch^{\gamma} for all sufficiently small h>0h>0.

The concentration of the marginal distribution of η(X)\eta(X) around 12\frac{1}{2} controls the difficulty of the classification problem, and Condition 7.2 concerns exactly this. Note that smaller γ\gamma implies η(X)\eta(X) is less concentrated around 12\frac{1}{2}, so we expect our Gibbs posterior concentration rate, say, εn=εn(γ)\varepsilon_{n}=\varepsilon_{n}(\gamma), to be a decreasing function of γ\gamma. The following result confirms this.

Proposition 7.

Suppose Assumption 7 holds and, for the specified γ>0\gamma>0, let

εn=(logn)γ/(2+2γ)nγ/(3+2γ).\varepsilon_{n}=(\log n)^{\gamma/(2+2\gamma)}n^{-\gamma/(3+2\gamma)}.

Then the Gibbs posterior distribution, with learning rate ωn=εn1/γ\omega_{n}=\varepsilon_{n}^{1/\gamma}, concentrates at rate εn\varepsilon_{n} with respect to d(θ,θ)={R(θ)R(θ)}1/2d(\theta,\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}.

Note that Massart’s condition from Section 5.5.1 corresponds to Tsybakov’s condition above with γ=\gamma=\infty. In that case, the Gibbs posterior concentration rate we recover from Proposition 7 is εn=(logn)1/2n1/2\varepsilon_{n}=(\log n)^{1/2}n^{-1/2}, which is achieved with a suitable constant learning rate. This is within a logarithmic factor of the optimal rate for finite-dimensional problems. Moreover, for both the γ<\gamma<\infty and γ=\gamma=\infty cases, we expect that the logarithmic factor could be removed following an approach like that described in Theorem 3.3, but we do not explore this possible extension here.

We should emphasize that this case is unusual because the learning rate ωn\omega_{n} depends on the (likely unknown) smoothness exponent γ\gamma. This means the rate in Proposition 7 is not adaptive to the margin. However, this dependence is not surprising, as it also appears in Grünwald and Mehta, (2020, Section 6). The reason the learning rate depends on γ\gamma is that the Tsybakov margin condition in Assumption 7.2 implies the Bernstein condition in (17) takes the form

v(θ,θ)m(θ,θ)γ/(1+γ).v(\theta,\theta^{\star})\lesssim m(\theta,\theta^{\star})^{\gamma/(1+\gamma)}.

Therefore, in order to verify Condition 1 using the strategy in Section 3.4, we need ωnm(θ,θ)\omega_{n}m(\theta,\theta^{\star}) and ωn3v(θ,θ)\omega_{n}^{3}v(\theta,\theta^{\star}) to have the same order when d(θ,θ)Mnεnd(\theta,\theta^{\star})\geq M_{n}\varepsilon_{n}. This requires that the learning rate depends on γ\gamma, in particular, ωn=εn1/γ\omega_{n}=\varepsilon_{n}^{1/\gamma}.

5.6 Quantile regression curve

In this section we revisit inference on a conditional quantile, covered in Section 5.1. The τth\tau^{\text{th}} conditional quantile of a response YY given a covariate X=xX=x is modeled by a linear combination of basis functions f(x)=(f1(x),,fJ(x))f(x)=(f_{1}(x),...,f_{J}(x))^{\top}:

QY|X=x(τ)=βf(x),βJ.Q_{Y|X=x}(\tau)=\beta^{\top}f(x),\quad\beta\in\mathbb{R}^{J}.

In Section 5.1, we made the rather restrictive assumption that the true conditional quantile function θ(x)\theta^{\star}(x) belonged to the span of a fixed set of JJ basis functions. In practice, it may not be possible to identify such a set of functions, which is why we considered using a sample-size dependent sequence of sets of basis functions in Section 5.4 to model a smooth function, θ\theta^{\star}. When the degree of smoothness, α\alpha, of θ\theta^{\star} is known we can choose the number of basis functions to use in order to achieve the optimal concentration rate. But, in practice, α\alpha may not be known, which creates a challenge because, as mentioned, the number of terms needed in the basis function expansion modeling θ\theta^{\star} depends on this unknown degree of smoothness.

To achieve optimal concentration rates adaptive to unknown smoothness, the choice of prior is crucial. In particular, the prior must support a very large model space in order to guarantee it places sufficient mass near θ\theta^{\star}. Our approximation of θ\theta^{\star} by a linear combination of basis functions suggests a hierarchical prior for θ(J,βJ)\theta\equiv(J,\beta_{J}), similar to Section 5.5.1, with a marginal prior π\pi for the number of basis functions JJ and a conditional prior Π~J\widetilde{\Pi}_{J} for βJ\beta_{J}, given JJ. The resulting prior for θ\theta is given by a mixture,

Π(A)=j=1π(j)Π~j({βjj:βjfA}),AΘ.\Pi(A)=\sum_{j=1}^{\infty}\pi(j)\,\widetilde{\Pi}_{j}(\{\beta_{j}\in\mathbb{R}^{j}:\beta_{j}^{\top}f\in A\}),\quad A\subseteq\Theta. (47)

Then, in order for Π\Pi to place sufficient mass near θ\theta^{\star}, it is sufficient the marginal and conditional priors satisfy the following conditions: the marginal prior π\pi for JJ satisfies for some c1>0c_{1}>0 for every J=jJ=j

π(j)ec1jlogj;\pi(j)\geq e^{-c_{1}j\log j}; (48)

and, the conditional prior Π~\widetilde{\Pi} for βJ\beta_{J} given J=jJ=j satisfies for every jj

Π~({β:ββ2ε})eCjlog(1/ε),for all βj with βH,\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\})\gtrsim e^{-Cj\log(1/\varepsilon)},\quad\text{for all $\beta^{\prime}\in\mathbb{R}^{j}$ with $\|\beta^{\prime}\|_{\infty}\leq H$}, (49)

for the same HH as in (43) and for some constant C>0C>0 for all sufficiently small ε>0\varepsilon>0. Fortunately, many simple choices of (π,Π~J)(\pi,\widetilde{\Pi}_{J}) are satisfactory for obtaining adaptive concentration rates, e.g., a Poisson prior on JJ and a JJ-dimensional normal conditional prior for β\beta, given JJ; see Conditions (A1) and (A2) and Remark 1 in Shen and Ghosal, (2015). Besides the conditions in (48) and (49) we need to make a minor modification of Π\Pi to make it suitable for our proof of Gibbs posterior concentration; see below.

Similar to our choice in Section 5.1 we link the data and parameter through the check loss function

θ(u)=12(|θ(x)y||y|)+(1τ)θ(x),\ell_{\theta}(u)=\tfrac{1}{2}(|\theta(x)-y|-|y|)+(1-\tau)\theta(x),

where θ(x)=βf(x)\theta(x)=\beta^{\top}f(x). See Koltchinskii, (1997) for a proof that PθP\ell_{\theta} is minimized at θ\theta^{\star}. It is straightforward to show the check loss θθ(u)\theta\mapsto\ell_{\theta}(u) is LL-Lipschitz with L<1L<1. From there, if YY were bounded we could use Condition 1 to compute the concentration rate. However, to handle an unbounded response we need the flexibility of Condition 2 and Theorem 4.1. To verify (22), we found it necessary to limit the complexity of the parameter space by imposing a constraint on the prior distribution, namely that the sequence of prior distributions places all its mass on the set Θn:={θ:θΔn}\Theta_{n}:=\{\theta:\|\theta\|_{\infty}\leq\Delta_{n}\} for some diverging sequence Δn\Delta_{n}; see Assumption 8.4. This constraint implies the prior depends on nn, and we refer to this sequence of prior distributions by Π(n)\Pi^{(n)}. Given the hierarchical prior Π\Pi in (47) one straightforward way to define a sequence of prior distributions satisfying the constraint is to restrict and renormalize Π\Pi to Θn\Theta_{n}, i.e., define Π(n)\Pi^{(n)} as

Π(n)(A)=Π(A)/Π(Θn),AΘΘn\displaystyle\Pi^{(n)}(A)=\Pi(A)/\Pi(\Theta_{n}),\quad A\subseteq\Theta\cap\Theta_{n} (50)

This particular construction of Π(n)\Pi^{(n)} in (50) is not the only way to define a sequence of priors satisfying the restriction to Θn\Theta_{n}, but it is convenient. That is, if Π\Pi places mass η\eta on a sup-norm neighborhood of θ\theta^{\star} (see the proof of Proposition 8), then, by construction, Π(n)\Pi^{(n)} in (50) places at least mass η\eta on the same neighborhood.

We should emphasize this restriction of the prior to Θn\Theta_{n} is only a technical requirement needed for the proof, but it is not unreasonable. Since the true function θ\theta^{\star} is bounded, it is eventually in the growing support of the prior Π\Pi. Similar assumptions have been used in the literature on quantile curve regression; for example, Theorem 6 in Takeuchi et al., (2006) requires that the parameter space consists only of bounded functions, which is a stricter assumption than ours here.

Assumption 8.

  1. 1.

    The function θ:𝕏\theta^{\star}:\mathbb{X}\mapsto\mathbb{R} is Hölder smooth with parameters (α,L)(\alpha,L) (see Assumption 5.1);

  2. 2.

    the basis functions satisfy the approximation property in (43);

  3. 3.

    the covariate space 𝕏\mathbb{X} is compact and there exists a δ>0\delta>0 such that the conditional density of YY, given X=xX=x, is continuous and bounded away from 0 by a constant β>0\beta>0 in the interval (θ(x)δ,θ(x)+δ)(\theta^{\star}(x)-\delta,\,\theta^{\star}(x)+\delta) for every xx; and,

  4. 4.

    the sequence Π(n)\Pi^{(n)} of prior distributions satisfies (50) for a sequence of subsets of the parameter space Θn:={θ:θΔn}\Theta_{n}:=\{\theta:\|\theta\|_{\infty}\leq\Delta_{n}\} for some sequence Δn>0\Delta_{n}>0, for Π\Pi as defined in (47), and for marginal and conditional priors (π,Π~)(\pi,\widetilde{\Pi}) for JJ and βJ\beta_{J} given J=jJ=j that satisfy (48) and (49).

Proposition 8.

Define εn=(logn)1/2Δn2nα/(1+2α)\varepsilon_{n}=(\log n)^{1/2}\Delta_{n}^{2}n^{-\alpha/(1+2\alpha)}. If the learning rate satisfies ωn=cΔn2\omega_{n}=c\Delta_{n}^{-2} for some 0<c<1/20<c<1/2 and Assumption 8 holds, then the Gibbs posterior distribution concentrates at rate εn\varepsilon_{n} with respect to d(θ,θ)=θθL2(P)d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{L_{2}(P)}.

Since the mathematical statement does not give sufficient emphasis to the adaptation feature, we should follow up on this point. That is, nα/(2α+1)n^{-\alpha/(2\alpha+1)} is the optimal rate Shen and Ghosal, (2015) for estimating an α\alpha-Hölder smooth function, and it is not difficult to construct an estimator that achieves this, at least approximately, if α\alpha is known. However, α\alpha is unknown in virtually all practical situations, so it is desirable for an estimator, Gibbs posterior, etc. to adapt to the unknown α\alpha. The concentration rate result in Proposition 8 says that the Gibbs posterior achieves nearly the optimal rate adaptively in the sense that it concentrates at nearly the optimal rate as if α\alpha were known.

The concentration rate in Proposition 8 depends on the complexity of the parameter space as determined by Δn\Delta_{n} in Assumption 8(3). For example, if the sup-norm bound on θ\theta^{\star} were known, then Δn\Delta_{n} and the learning rate ωn\omega_{n} could be taken as constants and the rate would be optimal up to a (logn)1/2(\log n)^{1/2} factor. On the other hand, if greater complexity is allowed, e.g., Δn=(logn)p\Delta_{n}=(\log n)^{p} for some power p>0p>0, then the concentration rate takes on an additional (logn)2p(\log n)^{2p} factor, which is not a serious concern.

6 Application: personalized MCID

6.1 Problem setup

In the medical sciences, physicians who investigate the efficacy of new treatments are challenged to determine both statistically and practically significant effects. In many applications some quantitative effectiveness score can be used for assessing the statistical significance of the treatment, but physicians are increasingly interested also in patients’ qualitative assessments of whether they believed the treatment was effective. The aim of the approach described below is to find the cutoff on the effectiveness score scale that best separates patients by their reported outcomes. That cutoff value is called the minimum clinically important difference, or MCID. For this application we follow up on the MCID problem discussed in Syring and Martin, (2017) with a covariate-adjusted, or personalized, version. In medicine, there is a trend away from the classical “one size fits all” treatment procedures, to treatments that are tailored more-or-less to each individual. Along these lines, naturally, doctors would be interested to understand how that threshold for practical significance depends on the individual, hence there is interest in a so-called personalized MCID (Hedayat et al.,, 2015; Zhou et al.,, 2020).

Let the data Un=(U1,,Un)U^{n}=(U_{1},\ldots,U_{n}) be iid PP, where each observation is a triple Ui=(Xi,Yi,Zi)U_{i}=(X_{i},Y_{i},Z_{i}) denoting the patient’s diagnostic measurement, their self-reported effectiveness outcome Yi{1,1}Y_{i}\in\{-1,1\}, and covariate value ZiqZ_{i}\in\mathbb{Z}\subseteq\mathbb{R}^{q}, for i=1,,ni=1,\ldots,n and q1q\geq 1. In practice the diagnostic measurement has something to do with the effectiveness of the treatment so one can imagine examples including blood pressure, blood glucose level, and viral load. Examples of covariates include a patient’s age, weight, and gender. The idea is that the xx-scale cutoff for practical significance would depend on the covariate zz, hence the MCID is a function, say, θ(z)\theta(z), and the goal is to learn this function.

The true MCID θ\theta^{\star} is defined as the solution to an optimization problem. That is, if

θ(x,y,z)=12[1ysign{xθ(z)}],(x,y,z)×{1,1}×,\ell_{\theta}(x,y,z)=\tfrac{1}{2}[1-y\,\mathrm{sign}\{x-\theta(z)\}],\quad(x,y,z)\in\mathbb{R}\times\{-1,1\}\times\mathbb{Z}, (51)

then the expected loss is R(θ)=P[Ysign{Xθ(Z)}]R(\theta)=P[Y\neq\mathrm{sign}\{X-\theta(Z)\}], and the true MCID function is defined as the minimizer θ=argminθΘR(θ)\theta^{\star}=\arg\min_{\theta\in\Theta}R(\theta), where the minimum is taken over a class Θ\Theta of functions on \mathbb{Z}. Alternatively, as in Section 5.5, the true θ\theta^{\star} satisfies ηz(x)>12\eta_{z}(x)>\tfrac{1}{2} if x>θ(z)x>\theta^{\star}(z) and ηz(x)12\eta_{z}(x)\leq\tfrac{1}{2} if xθ(z)x\leq\theta^{\star}(z), where ηz(x)=P(Y=1X=x,Z=z)\eta_{z}(x)=P(Y=1\mid X=x,\,Z=z) is the conditional probability function.

As described in Section 2, the Gibbs posterior distribution is based on an empirical risk function which, in the present case, is given by

Rn(θ)=12ni=1n[1Yisign{Xiθ(Zi)}],θΘ.R_{n}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}[1-Y_{i}\,\mathrm{sign}\{X_{i}-\theta(Z_{i})\}],\quad\theta\in\Theta. (52)

In order to put this theory into practice, it is necessary to give the function space Θ\Theta a lower-dimensional parametrization. In particular, we consider a true MCID function θ\theta^{\star} belonging to a Hölder class as in Assumption 5 but with unknown smoothness, as in Section 5.6. And, we model θ\theta^{\star} by a linear combination of basis functions θ(z)=θJ,β(z):=j=1Jβjfj(z)\theta(z)=\theta_{J,\beta}(z):=\textstyle\sum_{j=1}^{J}\beta_{j}f_{j}(z), for basis functions fjf_{j}, j=1,,Jj=1,\ldots,J. Then, each θ\theta is identified by a pair (J,β)(J,\beta) consisting of a positive integer JJ and a JJ-dimensional coefficient vector β\beta. We use cubic B-splines in the numerical examples in Section 2 of the supplmentary material, but any basis capable of approximating θ\theta^{\star} will work, and see (43).

The prior setup is similar to that in Section 5.6, (47). That is, the prior is specified hierarchically with a marginal prior π\pi for JJ and a suitable conditional prior Π~J\widetilde{\Pi}_{J} for βJ\beta_{J}, given JJ. And, as mentioned before, very simple choices of the marginal and conditional priors achieve the desired adaptive rates.

6.2 Concentration rate result

Assumption 9 below concerns the smoothness of θ\theta^{\star} and requires the chosen basis satisfy the approximation property used previously; it also refers to the same mild assumptions on random series priors used in Section 5.6 sufficient to ensure adequate prior mass is assigned to a neighborhood of θ\theta^{\star}; finally, it assumes a margin condition on the classifier like that used in Section 5.5.1 and Assumption 6(5). These conditions are sufficient to establish a Gibbs posterior concentration rate.

Assumption 9.

  1. 1.

    The true MCID function θ:\theta^{\star}:\mathbb{Z}\to\mathbb{R} for a compact subset \mathbb{Z} of \mathbb{R} and θ\theta^{\star} is Hölder smooth with parameters (α,L)(\alpha,L) (see Assumption 5.1);

  2. 2.

    the basis functions satisfy the approximation property in (43);

  3. 3.

    the prior distribution Π\Pi for θ\theta is defined hierarchically as in (47) with marginal and conditional priors (π,Π~)(\pi,\widetilde{\Pi}) for JJ and βJ\beta_{J} given J=jJ=j that satisfy (48) and (49); and,

  4. 4.

    there exists h(0,1)h\in(0,1) such that P{|2ηZ(X)1|h}=0P\{|2\eta_{Z}(X)-1|\leq h\}=0; and,

  5. 5.

    the conditional distribution, PzP_{z}, of XX, given Z=zZ=z, has a density with respect to Lebesgue measure that is uniformly bounded away from infinity.

Proposition 9.

Suppose Assumption 9 holds, with α\alpha as defined there, and set εn=(logn)nα/(1+α)\varepsilon_{n}=(\log n)n^{-\alpha/(1+\alpha)}. For any fixed ω>0\omega>0 the Gibbs posterior concentrates at rate εn\varepsilon_{n} with respect to the divergence

d(θ,θ)\displaystyle d(\theta,\theta^{\star}) =P{θ(Z)θ(Z)Xθ(Z)θ(Z)}\displaystyle=P\{\theta(Z)\wedge\theta^{\star}(Z)\leq X\leq\theta(Z)\vee\theta^{\star}(Z)\}
=θ(z)θ(z)θ(z)θ(z)Pz(dx)P(dz).\displaystyle=\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P_{z}(dx)\,P(dz). (53)

The Gibbs posterior distribution we have defined for the personalized MCID function achieves the concentration rate in Proposition 9 adaptively to the unknown smoothness α\alpha of θ\theta^{\star}. Mammen and Tsybakov, (1995) consider estimation of the boundary curve of a set, and they show that the minimax optimal rate is nα/(α+1)n^{-\alpha/(\alpha+1)} when the boundary curve is α\alpha-Hölder smooth and distance is measured by the Lebesgue measure of the set symmetric difference. In our case, if (X,Z)(X,Z) has a joint density, bounded away from 0, then our divergence measure d(θ,θ)d(\theta,\theta^{\star}) is equivalent to

Leb({(x,z):xθ(z)}{(x,z):xθ(z)}),\text{Leb}(\{(x,z):x\leq\theta(z)\}\,\triangle\,\{(x,z):x\leq\theta^{\star}(z)\}),

in which case our rate is within a logarithmic factor of the minimax optimal rate.

Hedayat et al., (2015) also study the personalized MCID and derive a convergence rate for an M-estimator of θ\theta^{\star} based on a smoothed and penalized version of (52). It is difficult to compare our result with theirs, for instance, because their rate depends on two user-controlled sequences related to the smoothing and penalization of their loss. But, as mentioned above, our rate is near optimal in certain cases, so the asymptotic results in Hedayat et al., (2015) cannot be appreciably better than our rate in Proposition 9.

6.3 Numerical illustrations

We performed two simulation examples to investigate the performance of the Gibbs posterior for the personalized MCID. In both examples we use a constant learning rate ω=1\omega=1, but we generally recommend data-driven learning rates; and see Syring and Martin, (2019).

For the first example we sample n=100n=100 independent observations of (X,Y,Z)(X,Y,Z). The covariate ZZ is sampled from a uniform distribution on the interval [0,3][0,3]. Given Z=zZ=z, the diagnostic measure XX is sampled from a normal distribution with mean z33z2+5z^{3}-3z^{2}+5 and variance 11, and the patient-reported outcome YY is sampled from a Rademacher distribution with probability

ηz(x)={Φ(x;z33z2+50.05,1/2),x>z33z2+5Φ(x;z33z2+5+0.05,1/2),xz33z2+5,\eta_{z}(x)=\begin{cases}\Phi(x;z^{3}-3z^{2}+5-0.05,1/2),&x>z^{3}-3z^{2}+5\\ \Phi(x;z^{3}-3z^{2}+5+0.05,1/2),&x\leq z^{3}-3z^{2}+5,\end{cases} (54)

where Φ(x;μ,σ)\Phi(x;\mu,\sigma) denotes the 𝖭(μ,σ){\sf N}(\mu,\sigma) distribution function. The addition of ±0.05\pm 0.05 in the formula of ηz(x)\eta_{z}(x) is to meet the margin condition in Assumption 5.4 in the main article. As mentioned above, we parametrize the MCID function by piecewise polynomials, specifically, cubic B-splines. For highly varying MCID functions, a reversible-jump MCMC algorithm that allows for changing numbers of and break points in the piecewise polynomials may be helpful; see Syring and Martin, (2020). However, for this example we fix the parameter dimension to just six B-spline functions, which allows us to use a simple Metropolis–Hastings algorithm to sample from the Gibbs posterior distribution. Since the dimension is fixed, the prior is only needed for the B-spline coefficients, and for these we use diffuse independent normal priors with mean zero and standard deviation of 66. Over 250250 replications, the average empirical misclassification rate is 16% using the Gibbs posterior mean MCID function compared to 13% using the true MCID function when applying these two classifiers to a hold-out sample of 100100 data points.

The left pane of Figure 1 shows the results for one simulated data set under the above formulation. Even with only n=100n=100 samples, the Gibbs posterior does a good job of centering on the true MCID function. The right pane displays the pointwise Gibbs posterior mean MCID function for each of 250250 repetitions of simulation 1, along with the overall pointwise mean of these functions, and the true MCID function. The Gibbs posterior mean function is stable across repetitions of the simulation.

Refer to caption
Refer to caption
Figure 1: Left: the posterior mean function (dashed), true MCID function (solid), and data (Y=1Y=1 black points, Y=1Y=-1 gray points) for one replication of the first simulation. Right: the Gibbs posterior mean MCID functions (solid gray) for each of 250250 repetitions of the first simulation, the overall mean function across those repetitions (dashed black), and the true MCID function (solid black).

The second example we consider includes a vector covariate similar to that in Example 1 of Scenario 2 in Hedayat et al., (2015). We sample n=1000n=1000 independent observations of (X,Y,Z)(X,Y,Z), where Z=(Z1,Z2)Z=(Z_{1},Z_{2}) has a uniform distribution on the square [0,3]2[0,3]^{2}. Given Z=zZ=z, the diagnostic measure XX has a normal distribution with mean z1+2z2z_{1}+2z_{2} and variance 11, and the patient-reported outcome YY is a Rademacher random variable with probability

ηz(x)={Φ(x;z1+2z20.05,1),x>z1+2z2Φ(x;z1+2z2+0.05,1),xz1+2z2.\eta_{z}(x)=\begin{cases}\Phi(x;z_{1}+2z_{2}-0.05,1),&x>z_{1}+2z_{2}\\ \Phi(x;z_{1}+2z_{2}+0.05,1),&x\leq z_{1}+2z_{2}.\end{cases} (55)

In practice it is common to have more than one covariate, so this second example is perhaps more realistic than the first. However, it is much more difficult to visualize the MCID function for more than one covariate, so we do not display any figures for this example. We use tensor product B-splines with 88 fixed B-spline functions (1616 coefficients) to parametrize the MCID function. Again, we use independent diffuse normal priors with zero mean and standard deviation equal to 66 for each coefficient. Over 100100 repetitions of this simulation we observe an average empirical misclassification rate of 24% using the Gibbs posterior mean MCID function compared to 23% using the true MCID function when applied to a hold-out sample of 10001000 data points.

7 Concluding remarks

In this paper we focus on developing some simple, yet general, techniques for establishing asymptotic concentration rates for Gibbs posteriors. A key take-away is that the robustness to model misspecification offered by the Gibbs framework does not come at the expense of slower concentration rates. Indeed, in the examples presented here—and others presented elsewhere (Syring and Martin,, 2020)—the rates achieved are the same as those achieved by traditional Bayesian posteriors and (nearly) minimax optimal. Another main point is that Gibbs posterior distributions are not inherently challenging to analyze; on the contrary, the proofs presented herein are concise and transparent. An additional novelty to the analysis presented here is that we consider cases where the learning rate can be non-constant, i.e., either a vanishing sequence or data-dependent, and prove corresponding posterior concentration rate results.

Here the focus has been on deriving Gibbs posteriors with the best possible concentration rates, and selection of learning rates has proceeded with these asymptotic properties in mind. In other works (Syring and Martin,, 2019), random learning rates are derived for good uncertainty quantification in finite-samples. We conjecture, however, that the two are not mutually-exclusive, and that learning rates arising as solutions to the calibration algorithm in Syring and Martin, (2019) also have the desirable concentration rate properties. Proving this conjecture seems challenging, but Section 3.3 provides a first step in this direction.

Acknowledgments

The authors sincerely thank the reviewers for their helpful feedback on previous versions of the manuscript. This work is partially supported by the U.S. National Science Foundation, DMS–1811802.

Appendix A Proofs of the main theorems

A.1 Proof of Theorem 1

As a first step, we first state and prove a result that gives an in-probability lower bound on the denominator of the Gibbs posterior, the so-called partition function. The proof closely follows that of Lemma 1 in Shen and Wasserman, (2001) but is, in some sense, more general, so we present the details here for the sake of completeness.

Lemma 1.

Define

Dn=eωn{Rn(θ)Rn(θ)}Π(dθ).D_{n}=\int e^{-\omega\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta). (56)

If Gn={θ:m(θ,θ)v(θ,θ)εnr}G_{n}=\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\} is as in (12), with εn\varepsilon_{n} satisfying εn0\varepsilon_{n}\to 0 and nεnrn\varepsilon_{n}^{r}\to\infty, then Dn>12Π(Gn)e2ωnεnrD_{n}>\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}} with PnP^{n}-probability converging to 1.

Proof.

Define a standardized version of the empirical risk difference, i.e.,

Zn(θ)={nRn(θ)nRn(θ)}nm(θ){nv(θ)}1/2,Z_{n}(\theta)=\frac{\{nR_{n}(\theta)-nR_{n}(\theta^{\star})\}-nm(\theta)}{\{nv(\theta)\}^{1/2}},

where m(θ)=m(θ,θ)m(\theta)=m(\theta,\theta^{\star}) and v(θ)=v(θ,θ)v(\theta)=v(\theta,\theta^{\star}), the mean and variance of the risk difference. Of course, Zn(θ)Z_{n}(\theta) depends (implicitly) on the data UnU^{n}. Let

𝒵n={(θ,Un):|Zn(θ)|(nεnr)1/2}.\mathcal{Z}_{n}=\{(\theta,U^{n}):|Z_{n}(\theta)|\geq(n\varepsilon_{n}^{r})^{1/2}\}.

Next, define the cross-sections

𝒵n(θ)={Un:(θ,Un)𝒵n}and𝒵n(Un)={θ:(θ,Un)𝒵n}.\mathcal{Z}_{n}(\theta)=\{U^{n}:(\theta,U^{n})\in\mathcal{Z}_{n}\}\quad\text{and}\quad\mathcal{Z}_{n}(U^{n})=\{\theta:(\theta,U^{n})\in\mathcal{Z}_{n}\}.

For GnG_{n} as above, since

nRn(θ)nRn(θ)=nm(θ)+{nv(θ)}1/2Zn(θ),nR_{n}(\theta)-nR_{n}(\theta^{\star})=nm(\theta)+\{nv(\theta)\}^{1/2}Z_{n}(\theta),

and mm, vv, and ZnZ_{n} are suitably bounded on Gn𝒵n(Un)cG_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}, we immediately get

DnGn𝒵n(Un)ceωnm(θ)ω{nv(θ)}1/2Zn(θ)Π(dθ)e2ωnεnrΠ{Gn𝒵n(Un)c}.D_{n}\geq\int_{G_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}}e^{-\omega nm(\theta)-\omega\{nv(\theta)\}^{1/2}Z_{n}(\theta)}\,\Pi(d\theta)\geq e^{-2\omega n\varepsilon_{n}^{r}}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}\}.

From this lower bound, we get

Pn{Dn12Π(Gn)e2ωnεnr}\displaystyle P^{n}\{D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}\} Pn[e2ωnεnrΠ{Gn𝒵n(Un)c}12Π(Gn)e2ωnεnr]\displaystyle\leq P^{n}\bigl{[}e^{-2\omega n\varepsilon_{n}^{r}}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}\}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}\bigr{]}
=Pn[Π{Gn𝒵n(Un)}12Π(Gn)]\displaystyle=P^{n}\bigl{[}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})\}\geq\tfrac{1}{2}\Pi(G_{n})\bigr{]}
2PnΠ{Gn𝒵n(Un)}Π(Gn),\displaystyle\leq\frac{2P^{n}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})\}}{\Pi(G_{n})},

where the last line is by Markov’s inequality. We can then simplify the expectation in the upper bound displayed above using Fubini’s theorem:

PnΠ{Gn𝒵n(Un)}\displaystyle P^{n}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})\} =1{θGn𝒵n(Un)}Π(dθ)Pn(dUn)\displaystyle=\int\int 1\{\theta\in G_{n}\cap\mathcal{Z}_{n}(U^{n})\}\,\Pi(d\theta)\,P^{n}(dU^{n})
=1{θGn} 1{θ𝒵n(Un)}Pn(dUn)Π(dθ)\displaystyle=\int\int 1\{\theta\in G_{n}\}\,1\{\theta\in\mathcal{Z}_{n}(U^{n})\}\,P^{n}(dU^{n})\,\Pi(d\theta)
=GnPn{𝒵n(θ)}Π(dθ).\displaystyle=\int_{G_{n}}P^{n}\{\mathcal{Z}_{n}(\theta)\}\,\Pi(d\theta).

By Chebyshev’s inequality, Pn{𝒵n(θ)}(nεnr)1P^{n}\{\mathcal{Z}_{n}(\theta)\}\leq(n\varepsilon_{n}^{r})^{-1}, and hence

Pn{Dn12Π(Gn)e2ωnεnr}2(nεnr)1.P^{n}\{D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}\}\leq 2(n\varepsilon_{n}^{r})^{-1}.

Finally, since nεnrn\varepsilon_{n}^{r}\to\infty, the left-hand side vanishes, completing the proof. ∎

For the proof of Theorem 1, write

Πn(An)=Nn(An)Dn,\Pi_{n}(A_{n})=\frac{N_{n}(A_{n})}{D_{n}},

where An={θ:d(θ;θ)>Mεn}A_{n}=\{\theta:d(\theta;\theta^{\star})>M\varepsilon_{n}\}, DnD_{n} is as in (56), and

Nn(An)=Aneωn{Rn(θ)Rn(θ)}Π(dθ).N_{n}(A_{n})=\int_{A_{n}}e^{-\omega\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

For GnG_{n} as in Lemma 1, write bn=12Π(Gn)e2ωnεnrb_{n}=\frac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}} for the lower bound on DnD_{n}. Then

Πn(An)\displaystyle\Pi_{n}(A_{n}) Nn(An)Dn 1(Dn>bn)+1(Dnbn)\displaystyle\leq\frac{N_{n}(A_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})
bn1Nn(An)+1(Dnbn).\displaystyle\leq b_{n}^{-1}N_{n}(A_{n})+1(D_{n}\leq b_{n}).

By Fubini’s theorem, independence of the data UnU^{n}, and Condition 1, we get

PnNn(An)=An{Peω(θθ)}nΠ(dθ)<eKMrωnεnr.P^{n}N_{n}(A_{n})=\int_{A_{n}}\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta)<e^{-KM^{r}\omega n\varepsilon_{n}^{r}}.

Take expectation of Πn(An)\Pi_{n}(A_{n}) and plug in the upper bound above, along with Π(Gn)eC1nεnr\Pi(G_{n})\geq e^{-C_{1}n\varepsilon_{n}^{r}} from (15) and Pn(Dnbn)2(nεnr)1P^{n}(D_{n}\leq b_{n})\geq 2(n\varepsilon_{n}^{r})^{-1} from Lemma 1, to get

PnΠn(An)2e(ωKMrC12ω)nεnr+2(nεnr)1.P^{n}\Pi_{n}(A_{n})\leq 2e^{-(\omega KM^{r}-C_{1}-2\omega)n\varepsilon_{n}^{r}}+2(n\varepsilon_{n}^{r})^{-1}.

Since the right-hand side is vanishing for sufficiently large MM, the claim follows.

A.2 Proof of Theorem 2

A special case of this result was first presented in Bhattacharya and Martin, (2022), but we are including the proof here for completeness.

Recall that the Gibbs posterior probability, Πn(An)\Pi_{n}(A_{n}), is a ratio, namely, Nn(An)/DnN_{n}(A_{n})/D_{n}. Both the numerator and denominator are integrals, and the key idea here is to split the range of integration in the numerator into countably many disjoint pieces as follows:

Nn(An)\displaystyle N_{n}(A_{n}) =d(θ,θ)>Mnεneωn{Rn(θ)Rn(θ)}Π(dθ)\displaystyle=\int_{d(\theta,\theta^{\star})>M_{n}\varepsilon_{n}}e^{-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta)
=t=1tMnεn<d(θ,θ)<(t+1)Mnεneωn{Rn(θ)Rn(θ)}Π(dθ).\displaystyle=\sum_{t=1}^{\infty}\int_{tM_{n}\varepsilon_{n}<d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}}e^{-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

Taking expectation of the left-hand side and moving it under the sum and under the integral on the right-hand side, we need to bound

tMnεn<d(θ,θ)<(t+1)Mnεn{Peω(θθ)}nΠ(dθ),t=1,2,.\int_{tM_{n}\varepsilon_{n}<d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}}\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta),\quad t=1,2,\ldots.

By Condition 1, on the given range of integration, the integrand is bounded above by

eωKn(tMnεn)r=eωKtrMnr,e^{-\omega Kn(tM_{n}\varepsilon_{n})^{r}}=e^{-\omega Kt^{r}M_{n}^{r}},

so the expectation of the integral itself is bounded by

eKtrMnrΠ({θ:d(θ,θ)<(t+1)Mnεn}),t=1,2,e^{-Kt^{r}M_{n}^{r}}\Pi(\{\theta:d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}\}),\quad t=1,2,\ldots

Since Π\Pi has a bounded density on the qq-dimensional parameter space, we clearly have

Π({θ:d(θ,θ)<(t+1)Mnεn}){(t+1)Mnεn}q.\Pi(\{\theta:d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}\})\lesssim\{(t+1)M_{n}\varepsilon_{n}\}^{q}.

Plug all this back into the summation above to get

PnNn(An)(Mnεn)qt=1(t+1)qeωKtrMnr.P^{n}N_{n}(A_{n})\lesssim(M_{n}\varepsilon_{n})^{q}\sum_{t=1}^{\infty}(t+1)^{q}e^{-\omega Kt^{r}M_{n}^{r}}.

The above sum is finite for all nn and bounded by a multiple of eωMnre^{-\omega M_{n}^{r}}. Then MnqM_{n}^{q} times the sum is vanishing as nn\to\infty and, consequently, we find that the expectation of the Gibbs posterior numerator is o(εnq)o(\varepsilon_{n}^{q}).

For the denominator DnD_{n}, we can proceed just like in the proof of Lemma 1. The key difference is that we redefine

𝒵n={(θ,Un):|Zn(θ)|(Qnεnr)1/2},\mathcal{Z}_{n}=\{(\theta,U^{n}):|Z_{n}(\theta)|\geq(Qn\varepsilon_{n}^{r})^{1/2}\},

with an arbitrary constant Q>1Q>1, so that

Pn{Dn12Π(Gn)e2Qωnεnr}2(Qnεnr)1.P^{n}\{D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2Q\omega n\varepsilon_{n}^{r}}\}\leq 2(Qn\varepsilon_{n}^{r})^{-1}.

Then, just like in the proof of Theorem 1, since nεnr=1n\varepsilon_{n}^{r}=1, we have

PnΠn(An)o(εnq)e2Qωεnq+2Q1,P^{n}\Pi_{n}(A_{n})\leq\frac{o(\varepsilon_{n}^{q})}{e^{-2Q\omega}\varepsilon_{n}^{q}}+2Q^{-1},

which implies

lim supnPnΠn(An)2Q1.\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n})\leq 2Q^{-1}.

Since QQ is arbitrary, we conclude that PnΠn(An)0P^{n}\Pi_{n}(A_{n})\to 0, completing the proof.

A.3 Proof of Theorem 3

The proof is nearly identical to that of Theorem 1. Begin with

Πn(An)=Nn(An)Dn,\Pi_{n}(A_{n})=\frac{N_{n}(A_{n})}{D_{n}},

where An={θ:d(θ;θ)>Mεn}A_{n}=\{\theta:d(\theta;\theta^{\star})>M\varepsilon_{n}\}, DnD_{n} is as in (56), and

Nn(An)=Aneωnn{Rn(θ)Rn(θ)}Π(dθ).N_{n}(A_{n})=\int_{A_{n}}e^{-\omega_{n}\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

When the learning rate is a sequence ωn\omega_{n} rather than constant, Lemma 1 can be applied with no alterations provided nωnεnrn\omega_{n}\varepsilon_{n}^{r}\rightarrow\infty, as assumed in the statement of Theorem 3. Then, for GnG_{n} as in Lemma 1, write bn=12Π(Gn)e2ωnnεnrb_{n}=\frac{1}{2}\Pi(G_{n})e^{-2\omega_{n}n\varepsilon_{n}^{r}} for the lower bound on DnD_{n}. Bound the posterior probability of AnA_{n} by

Πn(An)\displaystyle\Pi_{n}(A_{n}) Nn(An)Dn 1(Dn>bn)+1(Dnbn)\displaystyle\leq\frac{N_{n}(A_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})
bn1Nn(An)+1(Dnbn).\displaystyle\leq b_{n}^{-1}N_{n}(A_{n})+1(D_{n}\leq b_{n}).

By Fubini’s theorem, independence of the data UnU^{n}, and Condition 1, we get

PnNn(An)=An{Peωn(θθ)}nΠ(dθ)<eKMrωnnεnr.P^{n}N_{n}(A_{n})=\int_{A_{n}}\{Pe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta)<e^{-KM^{r}\omega_{n}n\varepsilon_{n}^{r}}.

Take expectation of Πn(An)\Pi_{n}(A_{n}) and plug in the upper bound above, along with Π(Gn)eCωnnεnr\Pi(G_{n})\gtrsim e^{-C\omega_{n}n\varepsilon_{n}^{r}} from (14) and Pn(Dnbn)=o(1)P^{n}(D_{n}\leq b_{n})=o(1) from Lemma 1, to get

PnΠn(An)e(KMrC2)ωnnεnr+o(1).P^{n}\Pi_{n}(A_{n})\lesssim e^{-(KM^{r}-C-2)\omega_{n}n\varepsilon_{n}^{r}}+o(1).

Since the right-hand side is vanishing for sufficiently large MM, the claim follows.

A.4 Proof of Theorem 4

First note that if the conditions of Theorem 3 hold for ωn\omega_{n}, then Πnωn/2\Pi_{n}^{\omega_{n}/2} also concentrates at rate εn\varepsilon_{n}. That is, at least asymptotically, there is no difference between the learning rates ωn\omega_{n} and ωn/2\omega_{n}/2.

Next, as in the proof of Theorem 1, denote the numerator and denominator of Πnω^n(A)\Pi_{n}^{\hat{\omega}_{n}}(A) by Nnω^n(A)N_{n}^{\hat{\omega}_{n}}(A) and Dnω^nD_{n}^{\hat{\omega}_{n}}. Let W={Un:ωn/2<ω^n<ωn}W=\{U^{n}:\omega_{n}/2<\hat{\omega}_{n}<\omega_{n}\}. By the assumptions of Theorem 4, Pn1(W)1P^{n}1(W)\rightarrow 1, so in the argument below we focus on bounding the numerator and denominator of the Gibbs posterior given WW.

Restricting to the set WW, using Lemma 1, and noting that ωe2nωεnr\omega\mapsto e^{-2n\omega\varepsilon_{n}^{r}} decreases in ω\omega we have Dnω^n>bnD_{n}^{\hat{\omega}_{n}}>b_{n} with PnP^{n}-probability approaching 11 where

bn=12Π(Gn)e2nωnεnreC1nωnεnrb_{n}=\tfrac{1}{2}\Pi(G_{n})e^{-2n\omega_{n}\varepsilon_{n}^{r}}\gtrsim e^{-C_{1}n\omega_{n}\varepsilon_{n}^{r}}

for some C1>0C_{1}>0, where the last inequality follows from (14).

Set 𝕎={θ:Rn(θ)Rn(θ)>0}\mathbb{W}=\{\theta:R_{n}(\theta)-R_{n}(\theta^{\star})>0\} and then bound the numerator as as follows:

Nnω^n(An)\displaystyle N_{n}^{\hat{\omega}_{n}}(A_{n}) =Nnω^n(An𝕎)+Nnω^n(An𝕎c)\displaystyle=N_{n}^{\hat{\omega}_{n}}(A_{n}\cap\mathbb{W})+N_{n}^{\hat{\omega}_{n}}(A_{n}\cap\mathbb{W}^{c})
An𝕎eωn/2[Rn(θ)Rn(θ)]Π(dθ)+An𝕎ceωn[Rn(θ)Rn(θ)]Π(dθ)\displaystyle\leq\int_{A_{n}\cap\mathbb{W}}e^{-\omega_{n}/2[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)+\int_{A_{n}\cap\mathbb{W}^{c}}e^{-\omega_{n}[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)
Aneωn/2[Rn(θ)Rn(θ)]Π(dθ)+Aneωn[Rn(θ)Rn(θ)]Π(dθ)\displaystyle\leq\int_{A_{n}}e^{-\omega_{n}/2[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)+\int_{A_{n}}e^{-\omega_{n}[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)
=Nnωn/2(An)+Nnωn(An).\displaystyle=N_{n}^{\omega_{n}/2}(A_{n})+N_{n}^{\omega_{n}}(A_{n}).

Then, by Condition 1, Fubini’s theorem, and independence of UnU^{n}, we have

PnNnω^n(An)2e(1/2)KMrnωnεnr.P^{n}N_{n}^{\hat{\omega}_{n}}(A_{n})\leq 2e^{-(1/2)KM^{r}n\omega_{n}\varepsilon_{n}^{r}}.

Similar to the proof of Theorem 1, we can bound Πnω^n(An)\Pi_{n}^{\hat{\omega}_{n}}(A_{n}) using the above exponential bounds on Nnω^n(An)N_{n}^{\hat{\omega}_{n}}(A_{n}) and Dnω^nD_{n}^{\hat{\omega}_{n}}:

Πnω^n(An)\displaystyle\Pi_{n}^{\hat{\omega}_{n}}(A_{n}) 1(W)Nnω^n(An)/Dnω^n+1(Wc)\displaystyle\leq 1(W)N_{n}^{\hat{\omega}_{n}}(A_{n})/D_{n}^{\hat{\omega}_{n}}+1(W^{c})
1(W)bn1Nnω^n(An)+1(W)1(Dnbn)+1(Wc).\displaystyle\leq 1(W)b_{n}^{-1}N_{n}^{\hat{\omega}_{n}}(A_{n})+1(W)1(D_{n}\leq b_{n})+1(W^{c}).

Taking expectation of Πnω^n(An)\Pi_{n}^{\hat{\omega}_{n}}(A_{n}) and applying the numerator and denominator bounds and the fact Pn(W)1P^{n}(W)\rightarrow 1, we have

PnΠnω^n(An)enωnεnr(MrK/2C1)+o(1).P^{n}\Pi_{n}^{\hat{\omega}_{n}}(A_{n})\lesssim e^{-n\omega_{n}\varepsilon_{n}^{r}(M^{r}K/2-C_{1})}+o(1).

The result follows since M>0M>0 is arbitrary.

A.5 Proof of Theorem 5

The proof is very similar to that of Theorem 1. Start with the decomposition

Πn(An)=Πn(AnΘn)+Πn(AnΘnc),\Pi_{n}(A_{n})=\Pi_{n}(A_{n}\cap\Theta_{n})+\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}}),

where An={θ:d(θ;θ)>Mnεn}A_{n}=\{\theta:d(\theta;\theta^{\star})>M_{n}\varepsilon_{n}\} and Θn\Theta_{n} is defined in Condition 2. We consider the first term in the above decomposition. As before, we have

Πn(AnΘn)=Nn(AnΘn)Dn,\Pi_{n}(A_{n}\cap\Theta_{n})=\frac{N_{n}(A_{n}\cap\Theta_{n})}{D_{n}},

for DnD_{n} is as in (56), and

Nn(AnΘn)=AnΘneωnn{Rn(θ)Rn(θ)}Π(dθ).N_{n}(A_{n}\cap\Theta_{n})=\int_{A_{n}\cap\Theta_{n}}e^{-\omega_{n}\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

Apply Lemma 1, with Gn={θ:m(θ,θ)v(θ,θ)(Knεn)r}G_{n}=\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq(K_{n}\varepsilon_{n})^{r}\}, and write

bn=12Π(Gn)e2ωnnKnrεnrb_{n}=\tfrac{1}{2}\Pi(G_{n})e^{-2\omega_{n}nK_{n}^{r}\varepsilon_{n}^{r}}

for the lower bound on DnD_{n}. This immediately leads to

Πn(AnΘn)\displaystyle\Pi_{n}(A_{n}\cap\Theta_{n}) Nn(AnΘn)Dn 1(Dn>bn)+1(Dnbn)\displaystyle\leq\frac{N_{n}(A_{n}\cap\Theta_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})
bn1Nn(AnΘn)+1(Dnbn).\displaystyle\leq b_{n}^{-1}N_{n}(A_{n}\cap\Theta_{n})+1(D_{n}\leq b_{n}).

By Fubini’s theorem, independence of the data UnU^{n}, and Condition 2, we get

PnNn(AnΘn)=AnΘn{Peω(θθ)}nΠ(dθ)<eMnrωnnKnrεnr.P^{n}N_{n}(A_{n}\cap\Theta_{n})=\int_{A_{n}\cap\Theta_{n}}\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta)<e^{-M_{n}^{r}\omega_{n}nK_{n}^{r}\varepsilon_{n}^{r}}.

Take expectation of Πn(AnΘn)\Pi_{n}(A_{n}\cap\Theta_{n}) and plug in the upper bound above, along with Π(Gn)eCnωnKnrεnr\Pi(G_{n})\gtrsim e^{-Cn\omega_{n}K_{n}^{r}\varepsilon_{n}^{r}} from (21) and Pn(Dnbn)=o(1)P^{n}(D_{n}\leq b_{n})=o(1) from Lemma 1, to get

PnΠn(AnΘn)e(MnrC2)ωnKnrnεnr+o(1).P^{n}\Pi_{n}(A_{n}\cap\Theta_{n})\lesssim e^{-(M_{n}^{r}-C-2)\omega_{n}K_{n}^{r}n\varepsilon_{n}^{r}}+o(1).

Since the right-hand side is vanishing for sufficiently large MnM_{n}, we can conclude that

lim supnPnΠn(AnΘn)lim supnPnΠn(AnΘnc).\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n})\leq\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}}).

Of course, if (22), then the upper bound in the above display is 0 and we obtained the claimed Gibbs posterior concentration rate result.

A.6 Proof of Theorem 6

The proof is nearly identical to the proof of Theorem 5 appearing above in Section A.5. However, it remains to show concentration at θn\theta^{\star}_{n} with respect to PθnPθnnP\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n} at rate εn\varepsilon_{n} implies concentration at θ\theta^{\star} with respect to R(θ)R(θ)R(\theta)-R(\theta^{\star}) at rate εnBntn1r\varepsilon_{n}\vee B_{n}t_{n}^{1-r}.

By assumption P|θ|s<BnP|\ell_{\theta}|^{s}<B_{n} for some s>1s>1 and 0<Bn<0<B_{n}<\infty for all θΘn\theta\in\Theta_{n} so that

Pθ\displaystyle P\ell_{\theta} =Pθ1(θtn)+Pθ1(θ>tn)\displaystyle=P\ell_{\theta}\cdot 1(\ell_{\theta}\leq t_{n})+P\ell_{\theta}\cdot 1(\ell_{\theta}>t_{n})
Pθ1(θtn)+tnBnxs𝑑x\displaystyle\leq P\ell_{\theta}\cdot 1(\ell_{\theta}\leq t_{n})+\int_{t_{n}}^{\infty}\frac{B_{n}}{x^{s}}\,dx
Pθ1(θtn)+Bntn1s,\displaystyle\leq P\ell_{\theta}\cdot 1(\ell_{\theta}\leq t_{n})+B_{n}t_{n}^{1-s},

by Markov’s inequality. By definition of θn\ell_{\theta}^{n}

Pθn=Pθ1(θtn)+tnP(θ>tn),P\ell_{\theta}^{n}=P\ell_{\theta}1(\ell_{\theta}\leq t_{n})+t_{n}\,P(\ell_{\theta}>t_{n}),

and, therefore,

PθPθn\displaystyle P\ell_{\theta}-P\ell_{\theta}^{n} Bntn1stnP(θ>tn)Bntn1s.\displaystyle\leq B_{n}t_{n}^{1-s}-t_{n}\,P(\ell_{\theta}>t_{n})\leq B_{n}t_{n}^{1-s}.

Similarly,

PθnPθ\displaystyle P\ell_{\theta}^{n}-P\ell_{\theta} =P{(tnθ)1(θ>tn)}\displaystyle=P\{(t_{n}-\ell_{\theta})\cdot 1(\ell_{\theta}>t_{n})\}
Bntn1sPθ1(θ>tn)\displaystyle\leq B_{n}t_{n}^{1-s}-P\ell_{\theta}\cdot 1(\ell_{\theta}>t_{n})
Bntn1s,\displaystyle\leq B_{n}t_{n}^{1-s},

so we can conclude

|PθPθn|Bntn1s,θΘn.|P\ell_{\theta}-P\ell_{\theta}^{n}|\leq B_{n}t_{n}^{1-s},\quad\theta\in\Theta_{n}. (57)

Next, note that by definition R(θn)R(θ)=PθnPθ>0R(\theta_{n}^{\star})-R(\theta^{\star})=P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}>0. Using (57), replace the risk by the clipped risk at θ\theta^{\star} and at θn\theta^{\star}_{n}, up to error of Bntn1sB_{n}t_{n}^{1-s} each time to see that

0<PθnPθ<PθnnPθn+2Bntn1s,0<P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}<P\ell_{\theta_{n}^{\star}}^{n}-P\ell_{\theta^{\star}}^{n}+2B_{n}t_{n}^{1-s},

and, since PθnnPθn<0P\ell_{\theta_{n}^{\star}}^{n}-P\ell_{\theta^{\star}}^{n}<0 by definition, we have

0<PθnPθ<2Bntn1s,0<P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}<2B_{n}t_{n}^{1-s},

and, therefore,

|PθnPθ|<2Bntn1s.|P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}|<2B_{n}t_{n}^{1-s}. (58)

Now, for some constants C1,C2>0C_{1},\,C_{2}>0 to be determined, define the sets

An\displaystyle A_{n} ={θΘn:PθnPθnn>C1Bntn1s}\displaystyle=\{\theta\in\Theta_{n}:P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}>C_{1}B_{n}t_{n}^{1-s}\}
An\displaystyle A_{n}^{\prime} ={θΘn:PθPθ>C2Bntn1s}.\displaystyle=\{\theta\in\Theta_{n}:P\ell_{\theta}-P\ell_{\theta^{\star}}>C_{2}B_{n}t_{n}^{1-s}\}.

Let ϑAn\vartheta\in A_{n}^{\prime}. By (57), swap PϑP\ell_{\vartheta} for PϑnP\ell_{\vartheta}^{n} up to an error of Bntn1sB_{n}t_{n}^{1-s}:

ϑAnPϑnPθ>(C21)Bntn1s.\vartheta\in A_{n}^{\prime}\implies P\ell_{\vartheta}^{n}-P\ell_{\theta^{\star}}>(C_{2}-1)B_{n}t_{n}^{1-s}.

Using (58), swap PθP\ell_{\theta^{\star}} for PθnP\ell_{\theta^{\star}_{n}} up to an error of Bntn1sB_{n}t_{n}^{1-s}:

ϑAnPϑnPθn>(C22)Bntn1s.\vartheta\in A_{n}^{\prime}\implies P\ell_{\vartheta}^{n}-P\ell_{\theta^{\star}_{n}}>(C_{2}-2)B_{n}t_{n}^{1-s}.

Finally, use (57) again and swap PθnP\ell_{\theta^{\star}_{n}} for PθnnP\ell_{\theta^{\star}_{n}}^{n} up to an error of Bntn1sB_{n}t_{n}^{1-s}:

ϑAnPϑnPθnn>(C23)Bntn1s.\vartheta\in A_{n}^{\prime}\implies P\ell_{\vartheta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}>(C_{2}-3)B_{n}t_{n}^{1-s}.

Conclude that if we choose C1,C2C_{1},\,C_{2} such that 0<C1<C230<C_{1}<C_{2}-3, then we get AnAnA_{n}^{\prime}\subset A_{n}. Consequently, if the Gibbs posterior vanishes on AnA_{n} in PnP^{n}-expectation, then it also vanishes on AnA_{n}^{\prime}.

Appendix B A strategy for checking (22)

Recall, a condition for posterior concentration under Theorems 5–6 is

PnΠn(Θnc)0asn.P^{n}\Pi_{n}(\Theta_{n}^{\text{\sc c}})\rightarrow 0\quad as\,n\rightarrow\infty.

Below we describe a strategy for checking this, based on convexity of θ\ell_{\theta}.

Lemma 2.

The Gibbs posterior distribution satisfies (22) if the following conditions hold:

  1. 1.

    θθ(u)\theta\mapsto\ell_{\theta}(u) is convex,

  2. 2.

    inf{θ:d(θ,θ)>δ}R(θ)R(θ)>0\inf_{\{\theta:d(\theta,\theta^{\star})>\delta\}}R(\theta)-R(\theta^{\star})>0 for all δ>0\delta>0, and

  3. 3.

    The prior distribution satisfies (14) in the main article.

Proof.

Let A:={θ:d(θ,θ)>ε}A:=\{\theta:d(\theta,\theta^{\star})>\varepsilon\} for any fixed ε>0\varepsilon>0 and write the Gibbs posterior probability of AA as

Πn(A)=Nn(A)Dn:=Aen[Rn(θ)Rn(θ)]Π(dθ)en[Rn(θ)Rn(θ)]Π(dθ).\Pi_{n}(A)=\frac{N_{n}(A)}{D_{n}}:=\frac{\int_{A}e^{-n[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)}{\int e^{-n[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)}.

Assumption 2 implies the minimizer θ^n\hat{\theta}_{n} of Rn(θ)R_{n}(\theta) converges to θ\theta^{\star} in PnP^{n}-probability; see Hjört and Pollard, (1993), Lemmas 1–2. Therefore, assume d(θ^n,θ)ε/2d(\hat{\theta}_{n},\theta^{\star})\leq\varepsilon/2 since

{Un:(Πn(A)>a)(d(θ^n,θ)>ε/2)}0\{U^{n}:(\Pi_{n}(A)>a)\cap(d(\hat{\theta}_{n},\theta^{\star})>\varepsilon/2)\}\rightarrow 0

in PnP^{n}-probability. By convexity, and the fact that θ^nA\hat{\theta}_{n}\notin A,

Rn(θ)Rn(θ)infu{Rn(θ+(ε/2)u)Rn(θ)},R_{n}(\theta)-R_{n}(\theta^{\star})\geq\inf_{u}\{R_{n}(\theta^{\star}+(\varepsilon/2)u)-R_{n}(\theta^{\star})\},

where the infimum is over all unit vectors. The infimum on the RHS of the above display converges to a positive number, say ψ>0\psi>0, in PnP^{n}-probability by Lemma 1 in Hjört and Pollard, (1993). Therefore,

liminfθARn(θ)Rn(θ)η\lim\inf_{\theta\in A}R_{n}(\theta)-R_{n}(\theta^{\star})\geq\eta

with PnP^{n}-probability converging to 11. Uniform convergence of the empirical risk functions implies

Nn(A)enψΠ(A)N_{n}(A)\leq e^{-n\psi}\Pi(A)

with PnP^{n}-probability converging to 11 as nn\to\infty. Combining this upper bound on Nn(A)N_{n}(A) with the lower bound on DnD_{n} provided by Lemma 1 in the main article we have

Πn(A)en(ψC1εnr)0\Pi_{n}(A)\leq e^{-n(\psi-C_{1}\varepsilon_{n}^{r})}\rightarrow 0

where the bound vanishes because ψ>C1εnr\psi>C_{1}\varepsilon_{n}^{r} for all large enough nn. By the bounded convergence theorem, PnΠn(A)0P^{n}\Pi_{n}(A)\rightarrow 0, as claimed. ∎

Appendix C Proofs of posterior concentration results for examples

C.1 Proof of Proposition 1

The proof proceeds by checking the conditions of the extended version of Theorem 2, that based on Condition 2. First, we confirm that R(θ)R(\theta) is minimized at θ\theta^{\star}. Write

R(θ)=𝕏[\displaystyle R(\theta)=\int_{\mathbb{X}}\Bigl{[} (τ1)θf(x){yθf(x)}px(y)𝑑y\displaystyle(\tau-1)\int_{-\infty}^{\theta^{\top}f(x)}\{y-\theta^{\top}f(x)\}\,p_{x}(y)\,dy
+τθf(x){yθf(x)}px(y)dy]P(dx).\displaystyle+\tau\int_{\theta^{\top}f(x)}^{\infty}\{y-\theta^{\top}f(x)\}\,p_{x}(y)\,dy\Bigr{]}\,P(dx).

Assumption 1(1–2) implies R(θ)R(\theta) can be differentiated twice under the integral:

R˙(θ)\displaystyle\dot{R}(\theta) =f(x){Px(θf(x))τ}P(dx)\displaystyle=\int f(x)\{P_{x}(\theta^{\top}f(x))-\tau\}\,P(dx)
R¨(θ)\displaystyle\ddot{R}(\theta) =f(x)f(x)px(θf(x))P(dx),\displaystyle=\int f(x)f(x)^{\top}\,p_{x}(\theta^{\top}f(x))\,P(dx),

where PxP_{x} denotes the distribution function corresponding to the density pxp_{x}. By definition, Px(θf(x))=τP_{x}(\theta^{\star\top}f(x))=\tau, so it follows that R˙(θ)=0\dot{R}(\theta^{\star})=0. Moreover, the following Taylor approximation holds in the neighborhood {θ:θθ<δ}\{\theta:\|\theta-\theta^{\star}\|<\delta\}:

R(θ)=12(θθ)R¨(θ)(θθ)+o(θθ2),R(\theta)=\tfrac{1}{2}(\theta-\theta^{\star})^{\top}\ddot{R}(\theta^{\star})(\theta-\theta^{\star})+o(\|\theta-\theta^{\star}\|^{2}),

where Assumption 1(2). implies R¨(θ)\ddot{R}(\theta^{\star}) is positive definite. Then, R(θ)R(\theta) is convex and minimized at θ\theta^{\star}.

Next, note θ(u)\ell_{\theta}(u) satisfies a Lipschitz property:

θθLθθ,\|\ell_{\theta}-\ell_{\theta^{\prime}}\|\leq L\|\theta-\theta^{\prime}\|,

where L=max{τ,1τ}f(x)L=\max\{\tau,1-\tau\}\|f(x)\|. This follows from considering the cases y<θf(x)<θf(x)y<\theta^{\top}f(x)<\theta^{\prime\top}f(x), θf(x)<y<θf(x)\theta^{\top}f(x)<y<\theta^{\prime\top}f(x), and θf(x)<θf(x)<y\theta^{\top}f(x)<\theta^{\prime\top}f(x)<y, and the Cauchy–Schwartz inequality. By Assumption 1(1), LL is uniformly bounded in xx. Then, using the Taylor approximation for R(θ)R(\theta) above and following the strategy laid out in Section 4.1 of the main article we have

Peω(θθ)eωθθ2(aωL2/2)Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq e^{-\omega\|\theta-\theta^{\star}\|^{2}(a-\omega L^{2}/2)}

where 2a>02a>0 is bounded below by the smallest eigenvalue of R¨(θ)\ddot{R}(\theta^{\star}). Therefore, Condition 2 holds for all sufficiently small learning rates, i.e., ω<2aL2\omega<2aL^{-2}.

Assumption 1(3) says the prior density is bounded away from zero in a neighborhood of θ\theta^{\star}. By the above computations,

{θ:m(θ,θ)v(θ,θ)δ}{θ:θθ<Cδ}\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\}\supset\{\theta:\|\theta-\theta^{\star}\|<C\delta\}

for all small enough δ>0\delta>0 and some C>0C>0. Therefore,

Π({θ:m(θ,θ)v(θ,θ)δ})Π({θ:θθ<Cδ})δJ,\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\})\geq\Pi(\{\theta:\|\theta-\theta^{\star}\|<C\delta\})\gtrsim\delta^{J},

verifying the prior condition in (15).

Since θθ(u)\theta\mapsto\ell_{\theta}(u) is convex and the Taylor approximation for R(θ)R(\theta) implies that

θθ>δm(θ,θ)δ2,\|\theta-\theta^{\star}\|>\delta\implies m(\theta,\theta^{\star})\gtrsim\delta^{2},

the conditions of Lemma 2 hold.

C.2 Proof of Proposition 2

For λ(0,1)\lambda\in(0,1) as in Assumption 2(1), define

ωn=m+n2mn(τ10λ+τ011λ)1,\omega_{n}=\frac{m+n}{2mn}\Bigl{(}\frac{\tau_{10}}{\lambda}+\frac{\tau_{01}}{1-\lambda}\Bigr{)}^{-1},

where τ10\tau_{10} and τ01\tau_{01} are not both 0, so that

(m+n)ωn{2(λτ01+(1λ)τ10)}1.(m+n)\omega_{n}\to\{2(\lambda\tau_{01}+(1-\lambda)\tau_{10})\}^{-1}.

For any deterministic sequence ana_{n}, with ana_{n}\to\infty, the learning rate anωna_{n}\omega_{n} vanishes strictly more slowly than min(m,n)1\min(m,n)^{-1}, and, therefore, according to Theorem 1 in Wang and Martin, (2020), the Gibbs posterior with learning rate anωna_{n}\omega_{n} concentrates at rate n1/2n^{-1/2} in the sense of Definition 4 in the main article. By the law of large numbers, τ^01τ01\hat{\tau}_{01}\rightarrow\tau_{01} and τ^10τ10\hat{\tau}_{10}\rightarrow\tau_{10} in PnP^{n}-probability, so

(m+n)ω^n{2(λτ01+(1λ)τ10)}1in Pn-probability.(m+n)\hat{\omega}_{n}\rightarrow\{2(\lambda\tau_{01}+(1-\lambda)\tau_{10})\}^{-1}\quad\text{in $P^{n}$-probability}.

Therefore, for any α(1/2,1)\alpha\in(1/2,1), we have

P(12anωn<αanω^n<anωn)1,P(\tfrac{1}{2}a_{n}\omega_{n}<\alpha a_{n}\hat{\omega}_{n}<a_{n}\omega_{n})\to 1,

and it follows from Theorem 4 that the Gibbs posterior with learning rate αanω^n\alpha a_{n}\hat{\omega}_{n} also concentrates at rate n1/2n^{-1/2}. Finally, since ana_{n} is arbitrary, the constant α\alpha is unimportant and may be implicitly absorbed into ana_{n}. Therefore, the conclusion of Proposition 2, omitting α\alpha, also holds.

C.3 Proof of Proposition 3

The proof proceeds by applying Lemma 2 and then checking the conditions of Theorem 2. The squared-error loss is convex in θ\theta and its corresponding risk equals PX(θθ)XX(θθ)θθ22P_{X}(\theta-\theta^{\star})^{\top}XX^{\top}(\theta-\theta^{\star})\gtrsim\|\theta-\theta^{\star}\|_{2}^{2} by Assumption 1a.2, which implies condition 2 in Lemma 2. Further, the condition on the prior in Assumption 1a.3 is sufficient for verifying condition 3 in Lemma 2. Then, Lemma 2 implies Πn({θ:θθ2>δ})\Pi_{n}(\{\theta:\|\theta-\theta^{\star}\|_{2}>\delta\}) vanishes in PnP^{n}-expectation for any δ>0\delta>0.

Next, verify the conditions of Theorem 2. The excess loss can be written θ(u)θ(u)=(θθ)x{2y(θ+θ)x}\ell_{\theta}(u)-\ell_{\theta^{\star}}(u)=(\theta^{\star}-\theta)^{\top}x\,\{2y-(\theta+\theta^{\star})^{\top}x\}. To verify Condition 1, we start by computing the conditional expectation of eω(θθ)e^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}, given X=xX=x, using the fact that PY|xY=xθP_{Y|x}Y=x^{\top}\theta^{\star}. Lemma 2 implies we may restrict our attention to bounded θ\theta so that Assumption 1a.2. implies |θX|<B|\theta^{\top}X|<B. Therefore, we take ω<(4Bb)1\omega<(4Bb)^{-1} where bb is given in Assumption 1a.1. Specifically, YY given xx is subexponential with parameters (σ2,b)(\sigma^{2},\,b), so that PetY<exp{txθ+t2σ2/2}Pe^{tY}<\exp\{tx^{\top}\theta^{\star}+t^{2}\sigma^{2}/2\} for all t<(2b)1t<(2b)^{-1}. In the excess loss, add and subtract 2ω(θθ)x2\omega(\theta^{\star}-\theta)^{\top}x times xθx^{\top}\theta^{\star} and apply Assumption 1a.1 to get the following bounds:

Peω(θθ)\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})} =Peω(θθ)X{2Y(θ+θ)X}\displaystyle=Pe^{-\omega(\theta^{\star}-\theta)^{\top}X\{2Y-(\theta+\theta^{\star})^{\top}X\}}
PXPY|Xeω(θθ)X(θ+θ)Xe2ω(θ)X(θθ)Xe2ω(θθ)X(YXθ)\displaystyle\leq P_{X}P_{Y|X}e^{\omega(\theta^{\star}-\theta)^{\top}X(\theta+\theta^{\star})^{\top}X}e^{-2\omega(\theta^{\star})^{\top}X(\theta^{\star}-\theta)^{\top}X}e^{-2\omega(\theta^{\star}-\theta)^{\top}X(Y-X^{\top}\theta^{\star})}
PXeω(θθ)X(θ+θ)Xe2ω(θ)X(θθ)Xe2ω2σ2[(θθ)X]2\displaystyle\leq P_{X}e^{\omega(\theta^{\star}-\theta)^{\top}X(\theta+\theta^{\star})^{\top}X}e^{-2\omega(\theta^{\star})^{\top}X(\theta^{\star}-\theta)^{\top}X}e^{2\omega^{2}\sigma^{2}[(\theta^{\star}-\theta)^{\top}X]^{2}}
PXeω(12ωσ2){(θθ)X}2.\displaystyle\leq P_{X}e^{-\omega(1-2\omega\sigma^{2})\{(\theta^{\star}-\theta)^{\top}X\}^{2}}.

Apply (16) in the paper, which is Lemma 7.26 in Lafftery et al., (2010), using the facts from Assumption 1a.2 that (θθ)(PXXT)(θθ)cθθ22(\theta^{\star}-\theta)^{\top}(PXX^{T})(\theta^{\star}-\theta)\geq c\|\theta-\theta^{\star}\|_{2}^{2} and from consistency that |θX|<B|\theta^{\top}X|<B so that {(θθ)X}24B2\{(\theta^{\star}-\theta)^{\top}X\}^{2}\leq 4B^{2}. Then, (16) implies

Peω(θθ)\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})} PXeω(12ωσ2){(θθ)X}2\displaystyle\leq P_{X}e^{-\omega(1-2\omega\sigma^{2})\{(\theta^{\star}-\theta)^{\top}X\}^{2}}
eω(12ωσ2)PX{(θθ)X}2eHω2(12ωσ2)2PX{(θθ)X}4\displaystyle\leq e^{-\omega(1-2\omega\sigma^{2})P_{X}\{(\theta^{\star}-\theta)^{\top}X\}^{2}}e^{H\omega^{2}(1-2\omega\sigma^{2})^{2}P_{X}\{(\theta^{\star}-\theta)^{\top}X\}^{4}}
eω(12ωσ2)θθ22{c4B2CHω(12ωσ2)}\displaystyle\leq e^{-\omega(1-2\omega\sigma^{2})\|\theta-\theta^{\star}\|_{2}^{2}\{c-4B^{2}CH\omega(1-2\omega\sigma^{2})\}}
ec2ωθθ22\displaystyle\leq e^{-c_{2}\omega\|\theta^{\star}-\theta\|^{2}_{2}}

where

H={ω2(12σ2)216B4}1{eω(12ωσ2)4B21ω(12ωσ2)4B2},H=\{\omega^{2}(1-2\sigma^{2})^{2}16B^{4}\}^{-1}\bigl{\{}e^{\omega(1-2\omega\sigma^{2})4B^{2}}-1-\omega(1-2\omega\sigma^{2})4B^{2}\bigr{\}},

and where the last line holds for some c2>0c_{2}>0 and for all sufficiently small ω>0\omega>0. In particular, ω\omega must satisfy c>4B2CHω(12ωσ2)c>4B^{2}CH\omega(1-2\omega\sigma^{2}).

To verify the prior condition in (13) of the paper, we need upper bounds on m(θ,θ)m(\theta,\theta^{\star}) and v(θ,θ)v(\theta,\theta^{\star}). From above, we have m(θ,θ)=(θθ)(PXXT)(θθ)m(\theta,\theta^{\star})=(\theta^{\star}-\theta)^{\top}(PXX^{T})(\theta^{\star}-\theta) and by Assumption 1a.2 it follows m(θ,θ)Cθθ22m(\theta,\theta^{\star})\leq C\|\theta-\theta^{\star}\|_{2}^{2} for some C>0C>0. To bound v(θ,θ)v(\theta,\theta^{\star}), use the total variance formula “V(X)=E{V(XY)}+V{E(XY)}V(X)=E\{V(X\mid Y)\}+V\{E(X\mid Y)\},” noting V(YX)σ2V(Y\mid X)\leq\sigma^{2} by Assumption 1a.1. Then,

v(θ,θ)\displaystyle v(\theta,\theta^{\star}) σ2(θθ)(PXXT)(θθ)+V{(θθ)XXT(θθ)}\displaystyle\leq\sigma^{2}(\theta^{\star}-\theta)^{\top}(PXX^{T})(\theta^{\star}-\theta)+V\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}
Cσ2θθ22+P{(θθ)XXT(θθ)}2\displaystyle\leq C\sigma^{2}\|\theta-\theta^{\star}\|_{2}^{2}+P\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}^{2}
Cσ2θθ22+4B2P{(θθ)XXT(θθ)}\displaystyle\leq C\sigma^{2}\|\theta-\theta^{\star}\|_{2}^{2}+4B^{2}P\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}
(Cσ2+4CB2)θθ22.\displaystyle\leq(C\sigma^{2}+4CB^{2})\|\theta-\theta^{\star}\|_{2}^{2}.

Then, by Cauchy–Schwartz,

{θ:m(θ,θ)v(θ,θ)δ}{θ:θθC11/2δ1/2}\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\}\supset\{\theta:\|\theta-\theta^{\star}\|\leq C_{1}^{-1/2}\delta^{1/2}\}

where C1=max{C,Cσ2+4CB2}C_{1}=\max\{C,C\sigma^{2}+4CB^{2}\}. Assumption 1a.3 says the prior density is bounded away from zero in a neighborhood of θ\theta^{\star}, so that

Π({θ:m(θ,θ)v(θ,θ)δ})Π({θ:θθC1/2δ1/2})δJ/2,\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\})\geq\Pi(\{\theta:\|\theta-\theta^{\star}\|\leq C^{-1/2}\delta^{1/2}\})\gtrsim\delta^{J/2},

which verifies (13).

C.4 Proof of Proposition 4

The proof proceeds by checking the conditions of Theorem 6.

First, note that θΘn\theta\in\Theta_{n} along with Assumption 1b together imply P|θ|s/2<CΔnsP|\ell_{\theta}|^{s/2}<C\Delta_{n}^{s} which we define to be BnB_{n} for some C>0C>0 and where s>2s>2. This fulfills the condition in Theorem 6 that

P|θ|s/2<Bn<,θΘn,s>2.P|\ell_{\theta}|^{s/2}<B_{n}<\infty,\quad\forall\theta\in\Theta_{n},\quad\exists s>2. (59)

Second, we verify Condition 3. We apply the strategy sketched in Section 4.2 of the paper: verify the Bernstein condition with α=1\alpha=1 for the clipped loss function, and then apply the inequality in (16) of the paper. For the Bernstein condition we want to show

θΘn,mn(θ,θn)>Bntn1s/2vn(θ,θn)G(Δn)mn(θ,θn)\theta\in\Theta_{n},\,m_{n}(\theta,\theta_{n}^{\star})>B_{n}t_{n}^{1-s/2}\Rightarrow v_{n}(\theta,\theta^{\star}_{n})\leq G(\Delta_{n})m_{n}(\theta,\theta^{\star}_{n}) (60)

for some function G()G(\cdot) to be specified. By construction, the second moment of excess clipped loss satisfies P[{θnθnn}2]P[{θθn}2]P[\{\ell_{\theta}^{n}-\ell_{\theta_{n}^{\star}}^{n}\}^{2}]\leq P[\{\ell_{\theta}-\ell_{\theta_{n}^{\star}}\}^{2}], which implies

vn(θ,θn)\displaystyle v_{n}(\theta,\theta_{n}^{\star}) v(θ,θn)+m(θ,θn)2mn(θ,θn)2\displaystyle\leq v(\theta,\theta_{n}^{\star})+m(\theta,\theta_{n}^{\star})^{2}-m_{n}(\theta,\theta_{n}^{\star})^{2}
v(θ,θn)+m(θ,θn)2.\displaystyle\leq v(\theta,\theta_{n}^{\star})+m(\theta,\theta_{n}^{\star})^{2}.

So, we want to upper bound v(θ,θn)v(\theta,\theta_{n}^{\star}) and m(θ,θn)2m(\theta,\theta_{n}^{\star})^{2}, to get a bound on vn(θ,θn)v_{n}(\theta,\theta_{n}^{\star}), and then relate this bound to mn(θ,θn)m_{n}(\theta,\theta_{n}^{\star}) to determine G()G(\cdot).

Similarly to the proof of Proposition 3 we use the total variance formula to upper bound v(θ,θn)v(\theta,\theta_{n}^{\star}). For the “E(V(X|Y))E(V(X|Y))” part of the formula, we have

E(V(θθn|X=x))\displaystyle E(V(\ell_{\theta}-\ell_{\theta^{\star}_{n}}|X=x)) =P{4σX2(θnθ)XX(θnθ)}\displaystyle=P\{4\sigma^{2}_{X}(\theta^{\star}_{n}-\theta)^{\top}XX^{\top}(\theta^{\star}_{n}-\theta)\}
θnθ22Pσx2\displaystyle\lesssim\|\theta^{\star}_{n}-\theta\|_{2}^{2}P\sigma_{x}^{2}
θnθ22\displaystyle\lesssim\|\theta^{\star}_{n}-\theta\|_{2}^{2}
θθ22+θnθ22\displaystyle\leq\|\theta-\theta^{\star}\|_{2}^{2}+\|\theta^{\star}_{n}-\theta^{\star}\|_{2}^{2}

where the second line follows from the fact X\|X\|_{\infty} is bounded almost surely according to Assumption 1a and where σx2\sigma_{x}^{2} denotes the marginal variance of YY, given X=xX=x, which has finite expectation according to Assumption 1b.

Next, for the “V(E(X|Y))V(E(X|Y))” part we have

V(E(θθn|X))\displaystyle V(E(\ell_{\theta}-\ell_{\theta^{\star}_{n}}|X)) =V{2θXX(θnθ)+(θθn)XX(θ+θn)}\displaystyle=V\{2\theta^{\star^{\top}}XX^{\top}(\theta^{\star}_{n}-\theta)+(\theta-\theta^{\star}_{n})^{\top}XX^{\top}(\theta+\theta_{n}^{\star})\}
E{2θXX(θnθ)}2\displaystyle\leq E\{2\theta^{\star^{\top}}XX^{\top}(\theta_{n}^{\star}-\theta)\}^{2}
+2E{[2θXX(θnθ)][(θnθ)XX(θn+θ)]}\displaystyle+2E\{[2\theta^{\star^{\top}}XX^{\top}(\theta^{\star}_{n}-\theta)][(\theta_{n}^{\star}-\theta)^{\top}XX^{\top}(\theta_{n}^{\star}+\theta)]\}
+E{(θnθ)XX(θn+θ)}2.\displaystyle+E\{(\theta_{n}^{\star}-\theta)^{\top}XX^{\top}(\theta_{n}^{\star}+\theta)\}^{2}.
Δn2θθn22\displaystyle\lesssim\Delta_{n}^{2}\|\theta-\theta_{n}^{\star}\|_{2}^{2}
Δn2[θθ22+θnθ22],\displaystyle\leq\Delta_{n}^{2}\left[\|\theta-\theta^{\star}\|_{2}^{2}+\|\theta_{n}^{\star}-\theta^{\star}\|_{2}^{2}\right],

again, using the facts X\|X\|_{\infty} is bounded almost surely and that θ2<Δn\|\theta\|_{2}<\Delta_{n} for θΘn\theta\in\Theta_{n}.

Similarly,

m(θ,θn)2\displaystyle m(\theta,\theta^{\star}_{n})^{2} =[P{2θXX(θnθ)+(θθn)XX(θ+θn)}]2\displaystyle=[P\{2\theta^{\star^{\top}}XX^{\top}(\theta^{\star}_{n}-\theta)+(\theta-\theta^{\star}_{n})^{\top}XX^{\top}(\theta+\theta_{n}^{\star})\}]^{2}
{θθn2+Δnθθn2}2\displaystyle\lesssim\{\|\theta-\theta_{n}^{\star}\|_{2}+\Delta_{n}\|\theta-\theta_{n}^{\star}\|_{2}\}^{2}
Δn2{θθ22+θnθ22}.\displaystyle\leq\Delta_{n}^{2}\{\|\theta-\theta^{\star}\|_{2}^{2}+\|\theta_{n}^{\star}-\theta^{\star}\|_{2}^{2}\}.

So far, we have showed

vn(θ,θn)Δn2[θθ22+θnθ22].v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}\left[\|\theta-\theta^{\star}\|_{2}^{2}+\|\theta_{n}^{\star}-\theta^{\star}\|_{2}^{2}\right].

And, since

θθ22m(θ,θ)θθ22\|\theta-\theta^{\star}\|_{2}^{2}\lesssim m(\theta,\theta^{\star})\lesssim\|\theta-\theta^{\star}\|_{2}^{2}

this implies

vn(θ,θn)Δn2[m(θ,θ)+m(θn,θ)].v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}\left[m(\theta,\theta^{\star})+m(\theta_{n}^{\star},\theta^{\star})\right].

Finally, we note that the proof of Theorem 6 above shows that (59) implies

|m(θ,θ)mn(θ,θn)|2Bntn1s/2andm(θn,θ)<2Bntn1s/2.|m(\theta,\theta^{\star})-m_{n}(\theta,\theta^{\star}_{n})|\leq 2B_{n}t_{n}^{1-s/2}\quad\text{and}\quad m(\theta_{n}^{\star},\theta^{\star})<2B_{n}t_{n}^{1-s/2}.

Conclude

vn(θ,θn)Δn2[mn(θ,θn)+4Bntn1s/2].v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}\left[m_{n}(\theta,\theta^{\star}_{n})+4B_{n}t_{n}^{1-s/2}\right].

Therefore, (60) is verified if we take G(Δn)=CΔn2G(\Delta_{n})=C\Delta_{n}^{2} for some C>0C>0.

Next, apply the inequality in (16). Note that Ln:=supu,θΘn|θn(u)θn(u)|Δntn1/2L_{n}:=\sup_{u,\theta\in\Theta_{n}}|\ell_{\theta}^{n}(u)-\ell_{\theta_{n}^{\star}}(u)|\lesssim\Delta_{n}t_{n}^{1/2} and recall that we choose ωn=(Δntn1/2)1\omega_{n}=(\Delta_{n}t_{n}^{1/2})^{-1}. Then, (16) and (60) imply

mn(θ,θn)>Bntn1s/2\displaystyle m_{n}(\theta,\theta_{n}^{\star})>B_{n}t_{n}^{1-s/2}\implies Peωn(θnθnn)eωnmn(θ,θn)+ωn2vn(θ,θn){eLnωn1Lnωn(Lnωn)2}\displaystyle Pe^{-\omega_{n}(\ell_{\theta}^{n}-\ell_{\theta_{n}^{\star}}^{n})}\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})+\omega_{n}^{2}v_{n}(\theta,\theta_{n}^{\star})\left\{\frac{e^{L_{n}\omega_{n}}-1-L_{n}\omega_{n}}{(L_{n}\omega_{n})^{2}}\right\}}
eωnmn(θ,θn)+cnωn2vn(θ,θn)\displaystyle\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})+c_{n}\omega_{n}^{2}v_{n}(\theta,\theta_{n}^{\star})}
eωnmn(θ,θn)+cnωn2Δn2[mn(θ,θn)+4Bntn1s/2]\displaystyle\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})+c_{n}^{\prime}\omega_{n}^{2}\Delta_{n}^{2}[m_{n}(\theta,\theta_{n}^{\star})+4B_{n}t_{n}^{1-s/2}]}
eωnmn(θ,θn)[1cnωnΔn24cnωnΔn2Bntn1s/2mn(θ,θn)1]\displaystyle\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})[1-c_{n}^{\prime}\omega_{n}\Delta_{n}^{2}-4c_{n}^{\prime}\omega_{n}\Delta_{n}^{2}B_{n}t_{n}^{1-s/2}m_{n}(\theta,\theta^{\star}_{n})^{-1}]}
eKnωnεn2for mn(θ,θn)>εn2:=Bntn1s/2,\displaystyle\leq e^{-K_{n}\omega_{n}\varepsilon_{n}^{2}}\quad\text{for }m_{n}(\theta,\theta^{\star}_{n})>\varepsilon_{n}^{2}:=B_{n}t_{n}^{1-s/2},

where cn,cn=O(1)c_{n},c_{n}^{\prime}=O(1) by the choice of ωn\omega_{n}, and where Kn=15cnωnΔn2>1/2K_{n}=1-5c_{n}^{\prime}\omega_{n}\Delta_{n}^{2}>1/2 for all large enough nn. This verifies Condition 3.

Finally, we verify the prior bound in (26). By Assumption 3(3) the prior is bounded away from zero in a neighborhood of θ\theta^{\star}, which implies, by the above bound on θnθ2\|\theta_{n}^{\star}-\theta^{\star}\|_{2}, that the prior is also bounded away from zero in a neighborhood of θn\theta_{n}^{\star} for all sufficiently large nn. Using the above bounds on the mnm_{n} and vnv_{n} functions

Π({θ:mn(θ,θn)vn(θ,θn)<Knεn2})\displaystyle\Pi(\{\theta:m_{n}(\theta,\theta_{n}^{\star})\vee v_{n}(\theta,\theta_{n}^{\star})<K_{n}\varepsilon_{n}^{2}\}) Π({θ:θθn22<Δn2Bntn1s/2})\displaystyle\geq\Pi(\{\theta:\|\theta-\theta^{\star}_{n}\|_{2}^{2}<\Delta_{n}^{-2}B_{n}t_{n}^{1-s/2}\})
(Δn2Bntn1s/2)J\displaystyle\gtrsim(\Delta_{n}^{-2}B_{n}t_{n}^{1-s/2})^{J}
eClognC>0,\displaystyle\gtrsim e^{-C\log n}\quad\exists C>0,
>eKnnωnBntn1s/2=eKnΔns1\displaystyle>e^{-K_{n}n\omega_{n}B_{n}t_{n}^{1-s/2}}=e^{-K_{n}\Delta_{n}^{s-1}}

where the last inequality holds because Δn=logn\Delta_{n}=\log n and s>2s>2. This verifies the prior bound required by Theorem 6 in (26).

C.5 Proof of Proposition 5

Proposition 5 follows from a slight refinement of Theorem 3. We verify Condition 1 and a prior bound essentially equivalent to (14). We also use an argument similar to the proof of Theorem 2 to improve the bound on expectation of the numerator of the Gibbs posterior distribution derived in the proof of Theorem 3.

Towards verifying Condition 1, we first need to define the loss function being used. Even though the xix_{i} values are technically not “data” in this inid setting, it is convenient to express the loss function in terms of the (x,y)(x,y) pairs. Moreover, while the quantity of interest is the mean function θ\theta, since we have introduced the parametric representation θ=θβ\theta=\theta_{\beta} and the focus shifts to the basis coefficients in the β\beta vector, it makes sense to express the loss function in terms of β\beta instead of θ\theta. That is, the squared error loss is

β(x,y)={yθβ(x)}2.\ell_{\beta}(x,y)=\{y-\theta_{\beta}(x)\}^{2}.

For β=βn\beta^{\dagger}=\beta_{n}^{\dagger} as defined in Section 5.4, the loss difference equals

β(x,y)β(x,y)={θβ(x)θβ(x)}2+2{θβ(x)θβ(x)}{yθβ(x)}.\ell_{\beta}(x,y)-\ell_{\beta^{\dagger}}(x,y)=\{\theta_{\beta}(x)-\theta_{\beta^{\dagger}}(x)\}^{2}+2\{\theta_{\beta^{\dagger}}(x)-\theta_{\beta}(x)\}\{y-\theta_{\beta^{\dagger}}(x)\}. (61)

Since the responses are independent, the expectation in Condition 1 can be expressed as the product

Pnenω{rn(β)rn(β)}\displaystyle P^{n}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}} =i=1neω{θβ(xi)θβ(xi)}2Pie2ω{θβ(xi)θβ(xi)}{Yiθβ(xi)}\displaystyle=\prod_{i=1}^{n}e^{-\omega\{\theta_{\beta}(x_{i})-\theta_{\beta^{\dagger}}(x_{i})\}^{2}}P_{i}e^{-2\omega\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}\{Y_{i}-\theta_{\beta^{\dagger}}(x_{i})\}}
=enθβθβn,22i=1nPie2ω{θβ(xi)θβ(xi)}{Yiθβ(xi)}.\displaystyle=e^{-n\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}}\prod_{i=1}^{n}P_{i}e^{-2\omega\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}\{Y_{i}-\theta_{\beta^{\dagger}}(x_{i})\}}.

According to Assumption 5(2), YiY_{i} is sub-Gaussian, so the product in the last line above can be upper-bounded by

e4nω2σ2θβθβn,22×e2ωni=1n{θβ(xi)θβ(xi)}{θ(xi)θβ(xi)}.e^{4n\omega^{2}\sigma^{2}\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}}\times e^{-2\omega_{n}\sum_{i=1}^{n}\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}\{\theta^{\star}(x_{i})-\theta_{\beta^{\dagger}}(x_{i})\}}.

The second exponential term above is identically 11 because the exponent vanishes—a consequence of the Pythagorean theorem. To see this, first write the quantity in the exponent as an inner product

(ββ)Fn{θ(x1:n)Fnβ}=(ββ){Fnθ(x1:n)FnFnβ}.(\beta-\beta^{\dagger})^{\top}F_{n}^{\top}\{\theta^{\star}(x_{1:n})-F_{n}\beta^{\dagger}\}=(\beta-\beta^{\dagger})^{\top}\{F_{n}^{\top}\theta^{\star}(x_{1:n})-F_{n}^{\top}F_{n}\beta^{\dagger}\}.

Recall from the discussion in Section 5.4 that β\beta^{\dagger} satisfies (FnFn)β=Fnθ(x1:n)(F_{n}^{\top}F_{n})\beta^{\dagger}=F_{n}^{\top}\theta^{\star}(x_{1:n}); from this, it follows that the above display vanishes for all vectors β\beta. Therefore,

Pnenω{rn(β)rn(β)}enωθβθβn,22(12ωσ2),P^{n}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\leq e^{-n\omega\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}(1-2\omega\sigma^{2})},

and Condition 1 is satisfied since provided the learning rate ω\omega is less than (2σ2)1(2\sigma^{2})^{-1}.

Next, we derive a prior bound similar to (14). By Assumption 5(3), all eigenvalues of FnFnF_{n}^{\top}F_{n} are bounded away from zero and infinity, which implies

ββ22θβθβn,22ββ22.\|\beta-\beta^{\dagger}\|_{2}^{2}\lesssim\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}\lesssim\|\beta-\beta^{\dagger}\|_{2}^{2}. (62)

In the arguments that follow, we show that θβθβn,2εn\|\theta_{\beta^{\star}}-\theta_{\beta^{\dagger}}\|_{n,2}\lesssim\varepsilon_{n} which implies, by (62), that βnβ22εn2\|\beta^{\star}_{n}-\beta^{\dagger}\|_{2}^{2}\lesssim\varepsilon_{n}^{2}. The approximation property in (43) implies that βn<H\|\beta_{n}^{\star}\|_{\infty}<H. Therefore, β\|\beta^{\dagger}\|_{\infty} is bounded because, if it were not bounded, then (62) would be contradicted.

Since β\|\beta^{\dagger}\|_{\infty} is bounded, Assumption 5(5) implies that the prior for β\beta satisfies

Π~({β:ββ2εn})εnJneJnlogC.\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\dagger}\|_{2}\leq\varepsilon_{n}\})\gtrsim\varepsilon_{n}^{J_{n}}e^{-J_{n}\log C}. (63)

Recall that (14) involves the mean and variance of the empirical risk, and we can directly calculate these. For the mean,

m(θβ,θβ)\displaystyle m(\theta_{\beta},\theta_{\beta^{\dagger}}) =r¯n(β)r¯n(β)\displaystyle=\bar{r}_{n}(\beta)-\bar{r}_{n}(\beta^{\dagger})
=θβθβn,22+i=1n{θβ(xi)θ(xi)}{θβ(xi)θβ(xi)}\displaystyle=\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}+\sum_{i=1}^{n}\{\theta_{\beta^{\dagger}}(x_{i})-\theta^{\star}(x_{i})\}\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}
=θβθβn,22,\displaystyle=\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2},

where the last equality is by the same Pythagorean theorem argument presented above. Similarly, the variance is given by v(θβ,θβ)=4σ2n1θβθβn,22v(\theta_{\beta},\theta_{\beta^{\dagger}})=4\sigma^{2}n^{-1}\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}. Therefore,

{m(θβ,θβ)v(θβ,θβ)}θβθβn,22,\{m(\theta_{\beta},\theta_{\beta^{\dagger}})\vee v(\theta_{\beta},\theta_{\beta^{\dagger}})\}\lesssim\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2},

and (63) and (62) imply

Π~({β:m(θβ,θβ)v(θβ,θβ)εn2})\displaystyle\tilde{\Pi}(\{\beta:m(\theta_{\beta},\theta_{\beta^{\dagger}})\vee v(\theta_{\beta},\theta_{\beta^{\dagger}})\leq\varepsilon_{n}^{2}\}) εnJneCJn.\displaystyle\gtrsim\varepsilon_{n}^{J_{n}}e^{-CJ_{n}}.

The fact that nεn2=Jnn\varepsilon_{n}^{2}=J_{n} along with Lemma 1 implies

DnεnJne(2ω+logC)JnD_{n}\gtrsim\varepsilon_{n}^{J_{n}}e^{-(2\omega+\log C)J_{n}}

with PnP^{n}-probability converging to 11 as nn\rightarrow\infty.

We briefly verify the claim made just before the statement of Proposition 2, i.e., that (44) holds for a independence prior consisting of strictly positive densities. To see this, note that the volume of a JnJ_{n}-dimensional sphere with radius ε\varepsilon is a constant multiple of εJn\varepsilon^{J_{n}}, so that, if the joint prior density is bounded away from zero by CJnC^{J_{n}} on the sphere, then we have Π~({β:ββ2ε})(Cε)Jn\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\})\gtrsim(C\varepsilon)^{J_{n}}, which is (44). So we must verify the bound on the prior density. Suppose Π~\tilde{\Pi} has a density π~\tilde{\pi} equal to the a product of independent prior densities π~j\tilde{\pi}_{j}, j=1,,Jnj=1,\ldots,J_{n}. Since the cube {β:ββJn1/2ε}\{\beta:\|\beta-\beta^{\prime}\|_{\infty}\leq J_{n}^{1/2}\varepsilon\} contains the ball {β:ββ2ε}\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\} by Cauchy–Schwartz, the infimum of the prior density on the ball is no smaller than the infimum of the prior density on the cube. To bound the prior density on the cube consider any of the JJ components and note |βjβj|Jn1/2εn|\beta_{j}-\beta^{\prime}_{j}|\leq J_{n}^{1/2}\varepsilon_{n} implies βj[HJn1/2εn,H+Jn1/2εn]\beta_{j}\in[-H-J_{n}^{1/2}\varepsilon_{n},H+J_{n}^{1/2}\varepsilon_{n}] since βH\|\beta^{\prime}\|_{\infty}\leq H. Moreover, since α1/2\alpha\geq 1/2 we have Jn1/2εn0J_{n}^{1/2}\varepsilon_{n}\rightarrow 0 so that this interval lies within a compact set, say, [H1,H+1][-H-1,H+1]. And, since the prior density π~j\tilde{\pi}_{j} is strictly positive, it is bounded away from zero by a constant CC on this compact set. Finally, by independence, we have π(β)CJ\pi(\beta)\geq C^{J} for all β{β:ββ2ε}\beta\in\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\}, which verifies the claim concerning (44).

In order to obtain the optimal rate of εn=nα/(1+2α)\varepsilon_{n}=n^{-\alpha/(1+2\alpha)}, without an extra logarithmic term, we need a slightly better bound on PnNn(An)P^{n}N_{n}(A_{n}) than that used to prove Theorem 3. Our strategy, as in the proof of Theorem 2, will be to split the range of integration in the numerator into countably many disjoint pieces, and use bounds on the prior probability on those pieces to improve the numerator bound. Define A~n:={β:θβθβn,2>Mnεn}\tilde{A}_{n}:=\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}\varepsilon_{n}\} and write the numerator Nn(A~n)N_{n}(\tilde{A}_{n}) as follows

Nn(A~n)\displaystyle N_{n}(\tilde{A}_{n}) =θβθβn,2>Mnεnenω{rn(β)rn(β)}Π~(dβ)\displaystyle=\int_{\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}\varepsilon_{n}}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\,\tilde{\Pi}(d\beta)
=t=1tMnεn<θβθβn,2<(t+1)Mnεnenω{rn(β)rn(β)}Π~(dβ).\displaystyle=\sum_{t=1}^{\infty}\int_{tM_{n}\varepsilon_{n}<\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}<(t+1)M_{n}\varepsilon_{n}}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\,\tilde{\Pi}(d\beta).

Taking expectation of the left-hand side and moving it under the sum and under the integral on the right-hand side, we need to bound

tMnεn<θβθβn,2<(t+1)MnεnPnenω{rn(β)rn(β)}Π~(dβ),t=1,2,.\int_{tM_{n}\varepsilon_{n}<\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}<(t+1)M_{n}\varepsilon_{n}}P^{n}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\,\tilde{\Pi}(d\beta),\quad t=1,2,\ldots.

By Condition 1, verified above, on the given range of integration the integrand is bounded above by enω(tMnεn)2/2e^{-n\omega(tM_{n}\varepsilon_{n})^{2}/2}, so the expectation of the integral itself is bounded by

enωn(tMnεn)2/2Π({β:θβθβn,2<(t+1)Mnεn}),t=1,2,e^{-n\omega_{n}(tM_{n}\varepsilon_{n})^{2}/2}\Pi(\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}<(t+1)M_{n}\varepsilon_{n}\}),\quad t=1,2,\ldots

Since Π~\tilde{\Pi} has a bounded density on the JnJ_{n}-dimensional parameter space, we clearly have

Π({β:θβθβn,2<(t+1)Mnεn}){(t+1)Mnεn}Jn.\Pi(\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}<(t+1)M_{n}\varepsilon_{n}\})\lesssim\{(t+1)M_{n}\varepsilon_{n}\}^{J_{n}}.

Plug all this back into the summation above to get

PnNn(A~n)(Mnεn)Jnt=1(t+1)Jneω(tMn)2Jn/2.P^{n}N_{n}(\tilde{A}_{n})\lesssim(M_{n}\varepsilon_{n})^{J_{n}}\sum_{t=1}^{\infty}(t+1)^{J_{n}}e^{-\omega(tM_{n})^{2}J_{n}/2}.

The above sum is finite for all nn and bounded by a multiple of eωMn2Jn/4e^{-\omega M_{n}^{2}J_{n}/4}. Consequently, we find that the expectation of the Gibbs posterior numerator is bounded by a constant multiple of (Mnεn)JneωMn2Jn/4(M_{n}\varepsilon_{n})^{J_{n}}e^{-\omega M_{n}^{2}J_{n}/4}.

Combining the in-expectation and in-probability bounds on Nn(A~n)N_{n}(\tilde{A}_{n}) and DnD_{n}, respectively, as in the proof of Theorem 1, we find that

PnΠn(A~n)eJn(ωMn2/4logMn2ωlogC)P^{n}\Pi_{n}(\tilde{A}_{n})\lesssim e^{-J_{n}(\omega M_{n}^{2}/4-\log M_{n}-2\omega-\log C)}

which vanishes as nn\rightarrow\infty for any sufficiently large constant MnM>0M_{n}\equiv M>0.

The above arguments establish that the Gibbs posterior Π~n\widetilde{\Pi}_{n} for β\beta satisfies

PnΠ~n({βJ:θβθβn,2>Mnεn})0.P^{n}\widetilde{\Pi}_{n}(\{\beta\in\mathbb{R}^{J}:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}\varepsilon_{n}\})\to 0. (64)

But this is equivalent to the proposition’s claim, with θβ\theta_{\beta^{\dagger}} replaced by θ\theta^{\star}. To see this, first recall that Assumption 2.4 implies the existence of a JJ-vector β=βn\beta^{\star}=\beta_{n}^{\star} such that θβθ\|\theta_{\beta^{\star}}-\theta^{\star}\|_{\infty} is small. Next, use the triangle inequality to get

θβθn,2θβθβn,2+θβθn,2.\|\theta_{\beta}-\theta^{\star}\|_{n,2}\leq\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}+\|\theta_{\beta^{\dagger}}-\theta^{\star}\|_{n,2}.

Now apply the Pythagorean theorem argument as above to show that

θβθn,22=θβθβn,22+θβθn,22.\|\theta_{\beta^{\star}}-\theta^{\star}\|_{n,2}^{2}=\|\theta_{\beta^{\star}}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}+\|\theta_{\beta^{\dagger}}-\theta^{\star}\|_{n,2}^{2}.

Since the sup-norm dominates the empirical L2L_{2} norm, the left-hand side is bounded by CJ2αCJ^{-2\alpha} for some C>0C>0. But both terms on the right-hand side are non-negative, so it must be that the right-most term is also bounded by CJ2αCJ^{-2\alpha}. Putting these together, we find that

θβθn,2>Mnεnθβθβn,2>MnεnC1/2Jα.\|\theta_{\beta}-\theta^{\star}\|_{n,2}>M_{n}^{\prime}\varepsilon_{n}\implies\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}^{\prime}\varepsilon_{n}-C^{1/2}J^{-\alpha}.

Therefore, with εn=nα/(2α+1)logn\varepsilon_{n}=n^{-\alpha/(2\alpha+1)}\log n and J=Jn=n1/(2α+1)J=J_{n}=n^{1/(2\alpha+1)}, the lower bound on the right-hand side of the previous display is a constant multiple of εn\varepsilon_{n}. We can choose MnM_{n}^{\prime} such that the aforementioned sequence is at least as big as MnM_{n} above. In the end,

Π~n({β:θβθn,2>Mnεn})Π~n({β:θβθβn,2>Mnεn}),\widetilde{\Pi}_{n}(\{\beta:\|\theta_{\beta}-\theta^{\star}\|_{n,2}>M_{n}^{\prime}\varepsilon_{n}\})\leq\widetilde{\Pi}_{n}(\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}\varepsilon_{n}\}),

so the Gibbs posterior concentration claim in the proposition follows from that established above. Finally, by definition of the prior and Gibbs posterior for θ\theta, we have that

PnΠn({θΘ:θθn,2>Mnε})0,P^{n}\Pi_{n}(\{\theta\in\Theta:\|\theta-\theta^{\star}\|_{n,2}>M_{n}^{\prime}\varepsilon\})\to 0,

which completes the proof.

C.6 Proof of Proposition 6

The proof proceeds by checking the conditions of Theorem 1. We begin by verifying (12). Evaluate m(θ,θ)m(\theta,\theta^{\star}) and v(θ,θ)v(\theta,\theta^{\star}) for the loss function θ\ell_{\theta} as defined above:

m(θ;θ)\displaystyle m(\theta;\theta^{\star}) =P{Yϕθ(X)}P{Yϕθ(X)}\displaystyle=P\{Y\neq\phi_{\theta}(X)\}-P\{Y\neq\phi_{\theta^{\star}}(X)\}
=xθ<0,xθ>0(2η1)𝑑P+xθ>0,xθ<0(12η)𝑑P\displaystyle=\int_{x^{\top}\theta<0,x^{\top}\theta^{\star}>0}(2\eta-1)\,dP+\int_{x^{\top}\theta>0,x^{\top}\theta^{\star}<0}(1-2\eta)\,dP
v(θ,θ)\displaystyle v(\theta,\theta^{\star}) P(θθ)2\displaystyle\leq P(\ell_{\theta}-\ell_{\theta^{\star}})^{2}
=P(Xθ<0,Xθ>0)+P(Xθ>0,Xθ<0)\displaystyle=P(X^{\top}\theta<0,X^{\top}\theta^{\star}>0)+P(X^{\top}\theta>0,X^{\top}\theta^{\star}<0)
=P(ϕθϕθ)2.\displaystyle=P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}.

It follows from arguments in Tsybakov, (2004) that, under the margin condition in Assumption 6(5), with h(0,1)h\in(0,1), we have

hP(ϕθϕθ)2m(θ,θ).hP(\phi_{\theta}-\phi_{\theta^{\star}})^{2}\leq m(\theta,\theta^{\star}).

Further, rewrite m(θ,θ)m(\theta,\theta^{\star}) as

m(θ;θ)\displaystyle m(\theta;\theta^{\star}) =η(x){ϕθ(x)ϕθ(x)}P(dx)+[1η(x)]{ϕθ(x)ϕθ(x)}P(dx)\displaystyle=\int\eta(x)\{\phi_{\theta^{\star}}(x)-\phi_{\theta}(x)\}\,P(dx)+\int[1-\eta(x)]\{\phi_{\theta}(x)-\phi_{\theta^{\star}}(x)\}\,P(dx)
2|ϕθ(x)ϕθ(x)|P(dx),\displaystyle\leq 2\int|\phi_{\theta}(x)-\phi_{\theta^{\star}}(x)|\,P(dx), (65)

where the latter inequality follows since η<1\eta<1. Since ϕθϕθ\phi_{\theta}-\phi_{\theta^{\star}} is a difference of indicators

P(ϕθϕθ)2=P|ϕθϕθ|andm(θ,θ)P(ϕθϕθ)2.P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}=P|\phi_{\theta}-\phi_{\theta^{\star}}|\quad\text{and}\quad m(\theta,\theta^{\star})\lesssim P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}.

This latter inequality will be useful below. Under the stated assumptions, the integrand in (C.6) can be handled exactly like in Lemma 4 of Jiang and Tanner, (2008). That is, if ββ1\|\beta-\beta^{\star}\|_{1} is sufficiently small, then m(θ;θ)ββ1m(\theta;\theta^{\star})\lesssim\|\beta-\beta^{\star}\|_{1}. Since v(θ;θ)m(θ;θ)v(\theta;\theta^{\star})\lesssim m(\theta;\theta^{\star}), it follows that

{θ:m(θ;θ)v(θ;θ)ε2}{θ=(α,β):ββ1cε2},\{\theta:m(\theta;\theta^{\star})\vee v(\theta;\theta^{\star})\leq\varepsilon^{2}\}\supseteq\{\theta=(\alpha,\beta):\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\},

for a constant c>0c>0. To lower-bound the prior probability of the event on the right-hand side, we follow the proof of Lemma 2 in Castillo et al., (2015). First, for SS^{\star} the configuration of β\beta^{\star}, we can immediately get

Π({β:ββ1cε2})π(S)βSβS1cε2gS(βS)𝑑βS.\Pi(\{\beta:\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\})\geq\pi(S^{\star})\int_{\|\beta_{S^{\star}}-\beta_{S^{\star}}^{\star}\|_{1}\leq c\varepsilon^{2}}g_{S^{\star}}(\beta_{S^{\star}})\,d\beta_{S^{\star}}.

Now make a change of variable b=βSβSb=\beta_{S^{\star}}-\beta_{S^{\star}}^{\star} and note that

gS(βS)=gS(b+βS)eλβ1gS(b).g_{S^{\star}}(\beta_{S^{\star}})=g_{S^{\star}}(b+\beta_{S^{\star}}^{\star})\geq e^{-\lambda\|\beta^{\star}\|_{1}}g_{S^{\star}}(b).

Therefore,

Π({β:ββ1cε2})π(S)eλβ1b1cε2gS(b)𝑑b,\Pi(\{\beta:\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\})\geq\pi(S^{\star})e^{-\lambda\|\beta^{\star}\|_{1}}\int_{\|b\|_{1}\leq c\varepsilon^{2}}g_{S^{\star}}(b)\,db,

and after plugging in the definition of π(S)\pi(S^{\star}), using the bound in Equation (6.2) of Castillo et al., (2015), and simplifying, we get

Π({β:ββ1cε2})f(|S|)q2|S|eλβ1.\Pi(\{\beta:\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\})\gtrsim f(|S^{\star}|)\,q^{-2|S^{\star}|}e^{-\lambda\|\beta^{\star}\|_{1}}.

From the form of the complexity prior ff, the bounds on λ\lambda, and the assumption that β=O(1)\|\beta^{\star}\|_{\infty}=O(1), we see that the lower bound is no smaller than eC|S|logqe^{-C|S^{\star}|\log q} for some constant C>0C>0, which implies (12), i.e.,

Π({θ:m(θ;θ)v(θ;θ)εn2})eCnεn2,\Pi(\{\theta:m(\theta;\theta^{\star})\vee v(\theta;\theta^{\star})\leq\varepsilon_{n}^{2}\})\gtrsim e^{-Cn\varepsilon_{n}^{2}},

where εn={n1|S|logq}1/2\varepsilon_{n}=\{n^{-1}|S^{\star}|\log q\}^{1/2}.

Next, we verify Condition 1. By direct computation, we get

Peω(θθ)\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})} =1P(ϕθϕθ)2+eωη(x)1{xθ0,xθ>0}P(dx)\displaystyle=1-P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}+e^{-\omega}\int\eta(x)1\{x^{\top}\theta\leq 0,x^{\top}\theta^{\star}>0\}\,P(dx)
+eω(1η(x))1{xθ>0,xθ0}P(dx)\displaystyle\qquad+e^{-\omega}\int(1-\eta(x))1\{x^{\top}\theta>0,x^{\top}\theta^{\star}\leq 0\}\,P(dx)
+eω(1η(x))1{xθ0,xθ>0}P(dx)\displaystyle\qquad+e^{\omega}\int(1-\eta(x))1\{x^{\top}\theta\leq 0,x^{\top}\theta^{\star}>0\}\,P(dx)
+eωη(x)1{xθ>0,xθ0}P(dx).\displaystyle\qquad+e^{\omega}\int\eta(x)1\{x^{\top}\theta>0,x^{\top}\theta^{\star}\leq 0\}\,P(dx).

Using the Massart margin condition, we can upper bound the above by

Peω(θθ)\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})} 1min{a,b}P(ϕθϕθ)2\displaystyle\leq 1-\min\{a,b\}P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}

where a=1eω(12h2)(eωeω)a=1-e^{-\omega}-(\tfrac{1}{2}-\tfrac{h}{2})(e^{\omega-e^{-\omega}}) and b=1eω+(12+h2)(eωeω)b=1-e^{\omega}+(\tfrac{1}{2}+\tfrac{h}{2})(e^{\omega-e^{-\omega}}). For all small enough ω\omega, both aa and bb are O(hω)O(h\omega), so for some constants c,c>0c,c^{\prime}>0,

Peω(θθ)\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})} 1chωP(ϕθϕθ)2\displaystyle\leq 1-ch\omega P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}
1cωm(θ,θ).\displaystyle\leq 1-c^{\prime}\omega m(\theta,\theta^{\star}).

Then Condition 1 follows from the elementary inequality 1tet1-t\leq e^{-t} for t>0t>0.

C.7 Proof of Proposition 7

The proof proceeds by checking the conditions of Theorem 3. We begin by verifying (14). The upper bounds on m(θ,θ)m(\theta,\theta^{\star}) and v(θ,θ)v(\theta,\theta^{\star}) given in the proof of Proposition 6 are also valid in this setting, which means

Π({θ:m(θ,θ)v(θ,θ)ε2})Π({θ:ββ1C1ε2})\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon^{2}\})\geq\Pi(\{\theta:\|\beta-\beta^{\star}\|_{1}\leq C_{1}\varepsilon^{2}\})

for some C1>0C_{1}>0. Further, Assumption 7 implies

Π({θ:ββ1C1ε2})ε2q=e2qlogε.\Pi(\{\theta:\|\beta-\beta^{\star}\|_{1}\leq C_{1}\varepsilon^{2}\})\gtrsim\varepsilon^{2q}=e^{-2q\log\varepsilon}.

By definition of εn\varepsilon_{n} and ωn\omega_{n} we have that

exp(2qlogεn)exp{nωnεn(2+2γ)/γ},\exp(-2q\log\varepsilon_{n})\geq\exp\{-n\omega_{n}\varepsilon_{n}^{(2+2\gamma)/\gamma}\},

which, combined with the previous display, verifies (14).

Next, to verify Condition 1 we note the misclassification error loss function difference θ(u)θ(u)\ell_{\theta}(u)-\ell_{\theta^{\star}}(u) is bounded in absolute value by 11, so we can proceed with using the moment-generating function bound from Lemma 7.26 in Lafftery et al., (2010); see (16). Proposition 1 in Tsybakov, (2004) provides the lower bound on m(θ,θ)m(\theta,\theta^{\star}) we need, i.e., Tsybakov proves that

our Assumption 7(2)m(θ,θ)d(θ,θ)(2+2γ)/γ.\text{our Assumption~\ref{asmp:tsybakov}(2)}\implies m(\theta,\theta^{\star})\gtrsim d(\theta,\theta^{\star})^{(2+2\gamma)/\gamma}.

With this and the upper bound on v(θ,θ)v(\theta,\theta^{\star}) derived above, (16) implies

Peωn(θθ)eC2ωnεn(2+2γ)/γPe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})}\leq e^{-C_{2}\omega_{n}\varepsilon_{n}^{(2+2\gamma)/\gamma}}

for some C2>0C_{2}>0, which verifies Condition 1.

C.8 Proof of Proposition 8

The proof proceeds by checking the conditions of Theorem 5. First, we verify (21). Starting with m(θ,θ)m(\theta,\theta^{\star}), by direct calculation,

m(θ,θ)=12𝕏{\displaystyle m(\theta,\theta^{\star})=\frac{1}{2}\int_{\mathbb{X}}\Bigl{\{} (|θ(x)θ(x)y||θ(x)θ(x)y|)Px(dy)\displaystyle\int\bigl{(}|\theta(x)\vee\theta^{\star}(x)-y|-|\theta(x)\wedge\theta^{\star}(x)-y|\bigr{)}\,P_{x}(dy)
+(12τ)|θ(x)θ(x)|}P(dx).\displaystyle+(1-2\tau)|\theta(x)-\theta^{\star}(x)|\Bigr{\}}\,P(dx).

Partitioning the range of integration with respect to yy, for given xx, into the disjoint intervals (,θθ](-\infty,\theta\wedge\theta^{\star}], (θθ,θθ)(\theta\wedge\theta^{\star},\theta\vee\theta^{\star}), and [θθ,)[\theta\vee\theta^{\star},\infty), and simplifying, we get

m(θ,θ)=12𝕏θ(x)θ(x)θ(x)θ(x){θ(x)θ(x)y}Px(dy)P(dx).m(\theta,\theta^{\star})=\frac{1}{2}\int_{\mathbb{X}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\theta(x)\vee\theta^{\star}(x)}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx). (66)

It follows immediately that m(θ,θn)θθL1(P)m(\theta,\theta^{\star}_{n})\lesssim\|\theta-\theta^{\star}\|_{L_{1}(P)}. Also, since θθ(x,y)\theta\mapsto\ell_{\theta}(x,y) clearly satisfies the fixed-xx Lipschitz bound

|θ(x,y)θ(x,y)||θ(x)θ(x)|,for all (x,y),|\ell_{\theta}(x,y)-\ell_{\theta^{\star}}(x,y)|\leq|\theta(x)-\theta^{\star}(x)|,\quad\text{for all $(x,y)$}, (67)

we get a similar bound for the variance, i.e., v(θ,θn)θθL2(P)2v(\theta,\theta^{\star}_{n})\leq\|\theta-\theta^{\star}\|^{2}_{L_{2}(P)}. Let Jn=n1/(1+2α)J_{n}=n^{1/(1+2\alpha)} and θ^J,β:=βf\hat{\theta}_{J,\beta}:=\beta^{\top}f, and define a sup-norm ball around θ\theta^{\star}

Bn={(β,J):βJ,J=Jn,θθ^J,βCJnα}.B_{n}^{\star}=\{(\beta,J):\beta\in\mathbb{R}^{J},J=J_{n},\|\theta^{\star}-\hat{\theta}_{J,\beta}\|_{\infty}\leq CJ_{n}^{-\alpha}\}.

By the above upper bounds on m(θ,θ)m(\theta,\theta^{\star}) and v(θ,θ)v(\theta,\theta^{\star}) in terms of θθL2(P)2\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}, we have

θθ^J,βCJnα{m(θ,θ)v(θ,θ)}Jn2α.\|\theta^{\star}-\hat{\theta}_{J,\beta}\|_{\infty}\leq CJ_{n}^{-\alpha}\implies\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha}.

Then, by Assumptions 8(3-4), and using the same argument as in the proof of Theorem 1 in Shen and Ghosal, (2015) we have

Π(n)({m(θ,θ)v(θ,θ)}Jn2α)\displaystyle\Pi^{(n)}(\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha}) =Π({m(θ,θ)v(θ,θ)}Jn2α)/Π(Θn)\displaystyle=\Pi(\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha})/\Pi(\Theta_{n})
Π({m(θ,θ)v(θ,θ)}Jn2α)\displaystyle\geq\Pi(\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha})
Π(Bn)eC1Jnlogn,\displaystyle\geq\Pi(B_{n}^{\star})\gtrsim e^{-C_{1}J_{n}\log n},

for some C1>0C_{1}>0. By the definitions of εn\varepsilon_{n} and ωn\omega_{n} in Proposition 8 it follows that Jn2αΔn2εn2J_{n}^{-2\alpha}\leq\Delta_{n}^{-2}\varepsilon_{n}^{2} for Δn\Delta_{n} as defined in Condition 2. Define KnΔn1K_{n}\propto\Delta_{n}^{-1}, with precise proportionality determined below. Then,

C1Jn(logn)CnKn2ωnεn2,C_{1}J_{n}(\log n)\leq CnK_{n}^{2}\omega_{n}\varepsilon_{n}^{2},

for a sufficiently small C>0C>0 and all large enough nn, which verifies the prior condition in (21) with r=2r=2.

Next we verify Condition 2. Define the sets An:={θ:θθL2(P)>Mεn}A_{n}:=\{\theta:\|\theta-\theta^{\star}\|_{L_{2}(P)}>M\varepsilon_{n}\} and Θn:={θ:θΔn}\Theta_{n}:=\{\theta:\|\theta\|_{\infty}\leq\Delta_{n}\}. Note that Assumption 8(4) implies that Πn(AnΘnc)=0\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}})=0, and, therefore,

Πn(An)=Πn(AnΘn).\Pi_{n}(A_{n})=\Pi_{n}(A_{n}\cap\Theta_{n}).

The following computations provide a lower bound on m(θ,θ)m(\theta,\theta^{\star}) for θΘn\theta\in\Theta_{n}. Partition 𝕏\mathbb{X} as 𝕏=𝕏1𝕏2\mathbb{X}=\mathbb{X}_{1}\cup\mathbb{X}_{2} where 𝕏1={x:|θ(x)θ(x)|δ}\mathbb{X}_{1}=\{x:|\theta(x)-\theta^{\star}(x)|\geq\delta\} and 𝕏2=𝕏1c\mathbb{X}_{2}=\mathbb{X}_{1}^{\text{\sc c}} and where δ>0\delta>0 is as in Assumption 4(2). Using (66), the mean function can be expressed as

2m(θ,θ)\displaystyle 2m(\theta,\theta^{\star}) =𝕏1θ(x)θ(x)θ(x)θ(x){θ(x)θ(x)y}Px(dy)P(dx)\displaystyle=\int_{\mathbb{X}_{1}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\theta(x)\vee\theta^{\star}(x)}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx)
+𝕏2θ(x)θ(x)θ(x)θ(x){θ(x)θ(x)y}Px(dy)P(dx).\displaystyle\qquad+\int_{\mathbb{X}_{2}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\theta(x)\vee\theta^{\star}(x)}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx).

For convenience, refer to the two integrals on the right-hand side of the above display as I1I_{1} and I2I_{2}, respectively. Using Assumption 8(2) and replacing the range of integration in the inner integral by a (δ/2)(\delta/2)-neighborhood of θ(x)\theta^{\star}(x) we can lower bound I1I_{1} as

I1\displaystyle I_{1} 𝕏1{x:θ(x)>θ(x)}θ(x)δθ(x)δ/2{θ(x)y}Px(dy)P(dx)\displaystyle\geq\int_{\mathbb{X}_{1}\cap\{x:\theta^{\star}(x)>\theta(x)\}}\int_{\theta^{\star}(x)-\delta}^{\theta^{\star}(x)-\delta/2}\{\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx)
+𝕏1{x:θ(x)θ(x)}θ(x)θ(x)+δ/2{θ(x)y}Px(dy)P(dx)\displaystyle\qquad+\int_{\mathbb{X}_{1}\cap\{x:\theta^{\star}(x)\leq\theta(x)\}}\int_{\theta^{\star}(x)}^{\theta^{\star}(x)+\delta/2}\{\theta(x)-y\}\,P_{x}(dy)\,P(dx)
(βδ2/4)P(𝕏1).\displaystyle\geq(\beta\delta^{2}/4)\,P(\mathbb{X}_{1}).

Next, for I2I_{2}, we can again use Assumption 8(2) to get the lower bound

I2\displaystyle I_{2} 𝕏2θ(x)θ(x){θ(x)+θ(x)}/2{θ(x)θ(x)y}Px(dy)P(dx)\displaystyle\geq\int_{\mathbb{X}_{2}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\{\theta(x)+\theta^{\star}(x)\}/2}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx)
β2𝕏2|θ(x)θ(x)|2P(dx).\displaystyle\geq\frac{\beta}{2}\int_{\mathbb{X}_{2}}|\theta(x)-\theta^{\star}(x)|^{2}\,P(dx).

Similarly, for sufficiently large nn, if θΘn\theta\in\Theta_{n}, then the L2(P)L_{2}(P) norm is bounded as

θθL2(P)2𝕏2|θ(x)θ(x)|2P(dx)+(Δn)2P(𝕏1).\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}\leq\int_{\mathbb{X}_{2}}|\theta(x)-\theta^{\star}(x)|^{2}\,P(dx)+(\Delta_{n})^{2}\,P(\mathbb{X}_{1}).

Comparing the lower and upper bounds for m(θ,θ)m(\theta,\theta^{\star}) and θθL2(P)2\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2} in terms of integration over 𝕏1\mathbb{X}_{1} and 𝕏2\mathbb{X}_{2} we have

𝕏2|θ(x)θ(x)|2P(dx)I2,\int_{\mathbb{X}_{2}}|\theta(x)-\theta^{\star}(x)|^{2}\,P(dx)\lesssim I_{2},

and

(Δn)2𝕏1|θ(x)θ(x)|2P(dx)I1,(\Delta_{n})^{-2}\int_{\mathbb{X}_{1}}|\theta(x)-\theta^{\star}(x)|^{2}\,P(dx)\lesssim I_{1},

which together imply

m(θ,θ)(Δn)2θθL2(P)2.m(\theta,\theta^{\star})\gtrsim(\Delta_{n})^{-2}\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}.

Recall, from (67), that θθ(x,y)\theta\mapsto\ell_{\theta}(x,y) is 11-Lipschitz. Therefore, if θΘn\theta\in\Theta_{n}, for large enough nn, then the loss difference is bounded by Δn\Delta_{n}, so Lemma 7.26 in Lafftery et al., (2010), along with the lower and upper bounds on m(θ,θ)m(\theta,\theta^{\star}) and v(θ,θ)v(\theta,\theta^{\star}), can be used to verify Condition 2. That is, there exists a K>0K>0 such that for all θΘn\theta\in\Theta_{n}

Peωn(θθ)\displaystyle Pe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})} exp{2ωn2v(θ,θ)Kωnm(θ,θ)}\displaystyle\leq\exp\{2\omega_{n}^{2}v(\theta,\theta^{\star})-K\omega_{n}m(\theta,\theta^{\star})\}
exp[KωnΔn2θθL2(P)2{12ωnΔn2}]\displaystyle\leq\exp[-K\omega_{n}\Delta_{n}^{-2}\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}\{1-2\omega_{n}\Delta_{n}^{2}\}]
exp(Knωnεn2)\displaystyle\leq\exp(-K_{n}\omega_{n}\varepsilon_{n}^{2})

where the last inequality holds for Kn=(K/2)Δn2K_{n}=(K/2)\Delta_{n}^{-2}. Given Assumption 8(4), the above inequality verifies Condition 2 for ωn\omega_{n} and Δn\Delta_{n} as in Proposition 4.

C.9 Proof of Proposition 9

Proposition 9 follows from Theorem 1 by verifying (12) and Condition 1.

First, we verify (12). By definition

m(θ,θ)\displaystyle m(\theta,\theta^{\star}) =R(θ)R(θ)\displaystyle=R(\theta)-R(\theta^{\star})
=θ(z)θ(z)θ(z)θ(z)|2ηz(x)1|P(dx)P(dz).\displaystyle=\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}|2\eta_{z}(x)-1|P(dx)P(dz).

And, since θ\ell_{\theta} is bounded by 11,

v(θ,θ)θ(z)θ(z)θ(z)θ(z)P(dx)P(dz)=d(θ,θ).v(\theta,\theta^{\star})\leq\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P(dx)P(dz)=d(\theta,\theta^{\star}).

Since |2ηz(x)1|1|2\eta_{z}(x)-1|\leq 1 and, by Assumption 5(5),

θ(z)θ(z)θ(z)θ(z)P(dx)|θ(z)θ(z)|,\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P(dx)\lesssim|\theta(z)-\theta^{\star}(z)|,

it follows that

{m(θ,θ)v(θ,θ)}θθL1(P):=|θ(z)θ(z)|P(dz).\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim\|\theta-\theta^{\star}\|_{L_{1}(P)}:=\int_{\mathbb{Z}}|\theta(z)-\theta^{\star}(z)|\,P(dz).

Let Jn=n1/(1+α)J_{n}=n^{1/(1+\alpha)} and define a sup-norm ball around θ\theta^{\star}

Bn={(β,J):βJ,J=Jn,θθ^J,βCJnα}.B_{n}^{\star}=\{(\beta,J):\beta\in\mathbb{R}^{J},J=J_{n},\|\theta^{\star}-\hat{\theta}_{J,\beta}\|_{\infty}\leq CJ_{n}^{-\alpha}\}.

Then, by Assumption 9(3), and using the same argument as in the proof of Theorem 1 in Shen and Ghosal, (2015) we have

Π(Bn)eCJnlogn,\Pi(B_{n}^{\star})\gtrsim e^{-CJ_{n}\log n},

for some C>0C>0. Since JnlognnωεnJ_{n}\log n\lesssim n\omega\varepsilon_{n} and θBn\theta\in B_{n}^{\star} implies {m(θ,θ)v(θ,θ)}εn\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim\varepsilon_{n}, it follows that (14) holds with r=1r=1.

Next, we verify Condition 1. By Assumption 9(4)

m(θ,θ)\displaystyle m(\theta,\theta^{\star}) hθ(z)θ(z)θ(z)θ(z)P(dx)P(dz)=hd(θ,θ).\displaystyle\geq h\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P(dx)P(dz)=h\,d(\theta,\theta^{\star}).

Then, Lemma 7.26 in Lafftery et al., (2010) implies

Peω(θθ)\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})} exp{Cω2v(θ,θ)ωm(θ,θ)}\displaystyle\leq\exp\{C\omega^{2}v(\theta,\theta^{\star})-\omega m(\theta,\theta^{\star})\}
exp{ω(hCω)d(θ,θ)},\displaystyle\leq\exp\{-\omega(h-C\omega)\,d(\theta,\theta^{\star})\},

where C>0C>0 depends only on ω\omega. For small ω\omega, C=O(1+ω)C=O(1+\omega), so if ω(1+ω)h\omega(1+\omega)\leq h, then

Peω(θθ)exp{Kωd(θ,θ)}Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\{-K\omega d(\theta,\theta^{\star})\}

for a constant KK depending on hh, which verifies Condition 1 with r=1r=1.

References

  • Alquier, (2008) Alquier, P. (2008). PAC-Bayesian bounds for randomized empirical risk minimizers. Math. Methods Statist. 17(4):279–304.
  • Alquier et al., (2016) Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 17:1–41.
  • Barron et al., (1999) Barron, A., Schervish, M. J., and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Ann. Statist. 27(2):536–561.
  • Bhattacharya and Martin, (2022) Bhattacharya, I., and Martin, R. (2022). Gibbs posterior inference on multivariate quantiles. J. Statist. Plann. Inference 218:106–121.
  • Bissiri et al., (2016) Bissiri, P.G., Holmes, C.C., and Walker, S.G. (2016). A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78:1103–1130.
  • Boucheron et al., (2012) Boucheron, S., Lugosi, G., and Massart, P. (2012). Concentration Inequalities: A Nonasymptotic Theory of Independence. Clarendon Press, Oxford.
  • Castillo et al., (2015) Castillo, I., Schmidt-Hieber, J., and van der Vaart, A.W. Bayesian linear regression with sparse priors. Ann. Statist., 5:1986–2018.
  • Catoni, (2004) Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization, Lecture Notes in Mathematics. Springer-Verlag.
  • Chernozhukov and Hong, (2003) Chernozhukov, V. and Hong, H. (2003). An MCMC approach to classical estimation. J. Econometrics 115(2):293–346.
  • Chib et al., (2018) Chib, S., Shin, M., and Simoni, A. (2018). Bayesian estimation and comparison of moment condition models. J. Am. Stat. Assoc. 113(524):1656–1668.
  • Choudhuri et al., (2007) Choudhuri, N., Ghosal, S., and Roy, A. (2007). Nonparametric binary regression using a Gaussian process prior. Stat. Methodol. 4:227–243.
  • De Blasi and Walker, (2013) De Blasi, P., Walker, S. G. (2013). Bayesian asymptotics with misspecified models. Statist. Sinica 23:169–187.
  • Ghosal et al., (2000) Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Ann. Statist. 28(2):500–531.
  • Godambe, (1991) Godambe, V. P., ed. (1991). Estimating Functions. Oxford University Press, New York.
  • Grünwald, (2012) Grünwald, P. (2012). The safe Bayesian: learning the learning rate via the mixability gap. Algorithmic Learning Theory, Springer, Heidelberg, 7568:169–183.
  • Grünwald and Mehta, (2020) Grünwald, P. and Mehta N. (2020). Fast rates for general unbounded loss functions: From ERM to generalized Bayes. J. Mach. Learn. Res. 21:1–80.
  • Grünwald and van Ommen, (2017) Grünwald, P. and van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified models, and a proposal for repairing it. Bayesian Anal. 12:1069–1103.
  • Guedj, (2019) Guedj, B. (2019). A primer on PAC-Bayes learning. arXiv:1901.05353.
  • Hedayat et al., (2015) Hedayat, S., Wang, J., and Xu, T. (2015). Minimum clinically important difference in medical studies. Biometrics 71:33–41.
  • Hjört and Pollard, (1993) Hjört, N. L., and Pollard, D. (1993). Asymptotics for minimisers of convex processes. http://www.stat.yale.edu/~pollard/Papers/convex.pdf.
  • Holmes and Walker, (2017) Holmes, C. C., and Walker, S. G. (2017). Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104(2):497–503.
  • Huber and Ronchetti, (2009) Huber, P.J., and Ronchetti, E. (2009). Robust Statistics. 2nd ed. Wiley, New York.
  • Jaescheke et al., (1989) Jaescheke, R., Signer, J., and Guyatt, G. (1989). Measurement of health status: ascertaining the minimum clinically important difference. Control. Clin. Trials 10:407–415.
  • Jiang and Tanner, (2008) Jiang, W. and Tanner, M. A. (2008). Gibbs posterior for variable selection in high- dimensional classification and data mining. Ann. Statist. 36:2207–2231.
  • Kim, (2002) Kim, J.-Y. (2002). Limited information likelihood and Bayesian analysis. J. Econom. 107(1-2):175–193.
  • Kleijn and van der Vaart, (2006) Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist. 34(2):837–877.
  • Koltchinskii, (1997) Koltchinskii, V. (1997). M-estimation, convexity and quantiles. Ann. Statist. 25(2):435–477.
  • Koltchinskii, (2006) Koltchinskii, V. (1997). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34(6):2593–2656.
  • Lafftery et al., (2010) Lafferty, J., Liu, H., and Wasserman, L. (2010). Chapter 10: Concentration of Measure. In Statistical Machine Learning, http://www.stat.cmu.edu/~larry/=sml/Concentration.pdf.
  • Lyddon et al., (2019) Lyddon, S. P., Holmes, C. C., and Walker, S. G. (2019). General Bayesian updating and the loss-likelihood bootstrap. Biometrika. 106(2):465–478.
  • Mammen and Tsybakov, (1995) Mammen, E., and Tsybakov, A. B. (1995). Asymptotical minimax recovery of sets with smooth boundaries. Ann. Statist. 23(2):502–524.
  • Mammen and Tsybakov, (1999) Mammen, E., and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27(6):1808–1829.
  • Maronna et al., (2006) Maronna, R. A., Martin, D. R., and Yohai, V. J. (2006). Robust Statistics: Theory and Methods Wiley Series in Probability and Statistics.
  • Martin et al., (2013) Martin, R., Hong, L., and Walker, S.G. (2013). A note on Bayesian convergence rates under local prior support conditions. arXiv:1201.3102.
  • Martin et al., (2017) Martin, R., Mess, R., and Walker, S.G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli 23:1822–1847.
  • Massart and Nedelec, (2006) Massart, P., and Nedelec, E. (2006). Risk bounds for statistical learning. Ann. Statist. 34(5):2326–2366.
  • McAllester, (1999) McAllester, D. (1999). PAC-Bayesian model averaging. COLT‘99 164–170.
  • Ramamoorthi et al., (2015) Ramamoorthi, R.V., Sriram, K., and Martin, R. (2015). On posterior concentration in misspecified models. Bayesian Anal. 10(4):759–789.
  • Shen and Ghosal, (2015) Shen, W. and Ghosal, S. (2015). Adaptive Bayesian procedures using random series priors. Scand. J. Stat. 42:1194–1213.
  • Shen and Wasserman, (2001) Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Ann. Statist. 29(3):687–714.
  • Sriram et al., (2013) Sriram, K., Ramamoorthi, R. V., and Ghosh, P. (2013) Posterior consistency of Bayesian quantile regression based on the misspecified asymmetric Laplace density. Bayesian Anal. 8(2):479–504.
  • Syring and Martin, (2017) Syring, N. and Martin, R. (2017). Gibbs posterior inference on the minimum clinically important difference. J. Statist. Plann. Inference 187:67–77.
  • Syring and Martin, (2019) Syring, N. and Martin, R. (2019). Calibration of general posterior credible regions. Biometrika 106(2):479–486.
  • Syring and Martin, (2020) Syring, N. and Martin, R. (2020). Robust and rate optimal Gibbs posterior inference on the boundary of a noisy image. Ann. Statist. 48(3):1498–1513.
  • Takeuchi et al., (2006) Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. J. Mach. Learn. Res. 7:1231–1264.
  • Tsybakov, (2004) Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning Ann. Statist. 32(1):135–166.
  • Valiant, (1984) Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM 27(11):1134–1142.
  • van der Vaart, (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
  • van Erven et al., (2015) van Erven, T., Grünwald, P., Mehta, N., Reid, M. and Williamson, R. (2015). Fast rates in statistical and online learning. J. Mach. Learn. Res. 16:1793–1861.
  • Wang and Martin, (2020) Wang, Z. and Martin, R. (2020). Model-free posterior inference on the area under the receiver operating characteristic curve. J. Statist. Plann. Inference 209:174–186.
  • Wu and Martin, (2022) Wu, P.-S. and Martin, R. (2022). A comparison of learning rate selection methods in generalized Bayesian inference. Bayesian Anal., to appear; arXiv:2012.11349.
  • Wu and Martin, (2021) Wu, P.-S. and Martin, R. (2021). Calibrating generalized predictive distributions. arXiv:2107.01688.
  • Zhang, (2006) Zhang, T. (2006). Information theoretical upper and lower bounds for statistical estimation. IEEE Trans. Inf. Theory 52(4):1307–1321.
  • Zhou et al., (2020) Zhou, Z, Zhao, J., and Bisson, L.J. (2020). Estimation of data adaptive minimal clinically important difference with a nonconvex optimization procedure. Stat. Methods Med. Res. 29(3):879–893.