Gibbs posterior concentration rates under sub-exponential type losses

Nicholas Syring¹¹1Department of Statistics, Iowa State University; nsyring@iastate.edu. and Ryan Martin²²2Department of Statistics, North Carolina State University; rgmarti3@ncsu.edu

(October 1, 2025)

Abstract

Bayesian posterior distributions are widely used for inference, but their dependence on a statistical model creates some challenges. In particular, there may be lots of nuisance parameters that require prior distributions and posterior computations, plus a potentially serious risk of model misspecification bias. Gibbs posterior distributions, on the other hand, offer direct, principled, probabilistic inference on quantities of interest through a loss function, not a model-based likelihood. Here we provide simple sufficient conditions for establishing Gibbs posterior concentration rates when the loss function is of a sub-exponential type. We apply these general results in a range of practically relevant examples, including mean regression, quantile regression, and sparse high-dimensional classification. We also apply these techniques in an important problem in medical statistics, namely, estimation of a personalized minimum clinically important difference.

Keywords and phrases: classification; generalized Bayes; high-dimensional problem; M-estimation; model misspecification.

1 Introduction

A major selling point of the Bayesian framework is that it is normative: to solve a new problem, one only needs a statistical model/likelihood, a prior distribution for the model parameters, and the means to compute the corresponding posterior distribution. Bayesians’ obligation to specify a prior attracts criticism, but their need to specify a likelihood has a number of potentially negative consequences too, especially when the quantity of interest has meaning independent of a statistical model, like a quantile. On the one hand, even if the posited model is “correct,” it is rare that all the parameters of that model are relevant to the problem at hand. For example, if we are interested in a quantile of a distribution, which we model as skew-normal, then the focus is only on a specific real-valued function of the model parameters. In such a case, the non-trivial effort invested in dealing with these nuisance parameters, e.g., specifying prior distributions and designing computational algorithms, is effectively wasted. On the other hand, in the far more likely case where the posited model is “wrong,” that model misspecification can negatively impact one’s conclusions about the quantity of interest. For example, both skew-normal and Pareto models have a $\tau^{\text{th}}$ quantile, but the quality of inferences drawn about that quantile will vary depending on which of these two models is chosen.

The non-negligible dependence on the posited statistical model puts a burden on the data analyst, and those reluctant to take that risk tend to opt for a non-Bayesian approach. After all, if one can get a solution without specifying a statistical model, then it is impossible to incur model misspecification bias. But in taking such an approach, they give up the normative advantage of Bayesian analysis. Can they get the best of both worlds? That is, can one construct a posterior distribution for the quantity of interest directly, incorporating available prior information, without specifying a statistical model and incurring the associated model misspecification risks, and without the need for marginalization over nuisance parameters? Fortunately, the answer is Yes, and this is the present paper’s focus.

The so-called Gibbs posterior distribution is the proper prior-to-posterior update when data and the interest parameter are linked by a loss function rather than a likelihood (Catoni,, 2004; Zhang,, 2006; Bissiri et al.,, 2016). Intuitively, the Gibbs and Bayesian posterior distributions coincide when the loss function linking data and parameter is a (negative) log-likelihood. In that case the properties of the Gibbs posterior can be inferred from the literature on Bayesian asymptotics in both the well-specified and misspecified contexts. For cases where the link is not through a likelihood, the large-sample behavior of the Gibbs posterior is less clear and elucidating this behavior under some simple and fairly general conditions is our goal here.

As a practical example, medical investigators may want to know if a treatment whose effect has been judged to be statistically significant is also clinically significant in the sense that the patients feel better post-treatment. Therefore, they are interested in inference about the effect size cutoff beyond which patients feel better; this is called the minimum clinically important difference or MCID, e.g., Jaescheke et al., (1989). Estimation of the MCID boils down to a classification problem, and standard Bayesian approaches to binary regression do not perform well in this setting; misspecifying the link function leads to bias, and nonparametric modeling of the link function is inefficient (Choudhuri et al.,, 2007). Instead, we found that a Gibbs posterior distribution, as described above, provided a very reasonable and robust solution to the MCID problem (Syring and Martin,, 2017). In some applications, one seeks a “personalized” or subject-specific cutoff that depends on a set of additional covariates. This personalized MCID could be high- or even infinite-dimensional, and our previous Gibbs posterior analysis is not equipped to handle such situations. But the framework developed here is; see Section 6.

In the following sections we lay out and apply conditions under which a Gibbs posterior distribution concentrates, asymptotically, on a neighborhood of the true value of the inferential target as the sample sizes increases. Our focus is not on the most general set of sufficient conditions for concentration; rather, we aim for conditions that are both widely applicable and easily verified. To this end, we consider loss functions of a sub-exponential type, ones that satisfy an inequality similar to the moment-generating function bound for sub-exponential random variables (Boucheron et al.,, 2012). We can apply this condition in a variety of problems, from regression to classification, and in both fixed- and high-dimensional settings. An added advantage is that our conditions lead to straightforward proofs of concentration.

Section 2 provides some background and formally defines the Gibbs posterior distribution. In Section 3, we state our theoretical objectives and present our main results, namely, sets of sufficient conditions under which the Gibbs posterior achieves a specified asymptotic concentration rate. A unique attribute of the Gibbs posterior distribution is its dependence on a tuning parameter called the learning rate, and our results cover a constant, vanishing sequence, and even data-dependent learning rates. Section 4 further discusses verifying our conditions and extends our conditions and main results to handle certain unbounded loss functions. Section 5 applies our general theorems to establish Gibbs posterior concentration rates in a number of practically relevant examples, including nonparametric curve estimation, and high-dimensional sparse classification. Section 6 formulates the personalized MCID application, presents a relevant Gibbs posterior concentration rate result, and gives a brief numerical illustration. Concluding remarks are given in Section 7, and proofs, etc. are postponed to the appendix.

2 Background on Gibbs posteriors

2.1 Notation and definitions

Consider a measurable space $(\mathbb{U},\mathcal{U})$ , with $\mathcal{U}$ a $\sigma$ -algebra of subsets of $\mathbb{U}$ , on which a probability measure $P$ is defined. A random element $U\sim P$ need not be a scalar, and many of the applications we have in mind involve $U=(X,Y)$ or $U=(X,Y,Z)$ , where $Y$ denotes a “response” variable and $X$ or $(X,Z)$ denotes a “predictor” variable, and $P$ encodes the dependence between the entries in $U$ . Then the real-world phenomenon under investigation is determined by $P$ and our goal is to make inference on a relevant feature of $P$ , which we define as a given functional $\theta=\theta(P)$ , taking values in $\Theta$ . Note that $\theta$ could be finite-, high-, or even infinite-dimensional.

The specific way $\theta$ relates to $P$ guides our posterior construction. Suppose there is a loss function, $\ell_{\theta}(u)$ , that measures how closely a generic value of $\theta$ agrees with a data point $u$ . (As is customary, “ $\theta$ ” will denote both the quantity of interest and a generic value of that quantity; when we need to distinguish the true from a generic value, we will write “ $\theta^{\star}$ .”) For example, if $u=(x,y)$ is a predictor–response pair, and $\theta$ is a function, then the loss might be

\ell_{\theta}(u)=|y-\theta(x)|\quad\text{or}\quad\ell_{\theta}(u)=1\{y\neq\theta(x)\},

(1)

depending on whether $y$ is continuous or discrete/binary, where $1(A)$ denotes the indicator function at the event $A$ . Another common situation is when one specifies a statistical model, say, $\{P_{\theta}:\theta\in\Theta\}$ , indexed by a parameter $\theta$ , and sets $\ell_{\theta}(u)=-\log p_{\theta}(u)$ , where $p_{\theta}$ is the density of $P_{\theta}$ with respect to some fixed dominating measure. In all of these cases, the idea is that a loss is incurred when there is a certain discrepancy between $\theta$ and the data point $u$ . Then our inferential target is the value of $\theta$ that minimizes the risk or average loss/discrepancy.

Definition 2.1.

Consider a real-valued loss function $\ell_{\theta}(u)$ defined on $\mathbb{U}\times\Theta$ , and define the risk function $R(\theta)=P\ell_{\theta}$ , the expected loss with respect to $P$ ; throughout, $Pf$ denotes expected value of $f(U)$ with respect to $U\sim P$ . Then the inferential target is

\theta^{\star}\in\arg\min_{\theta\in\Theta}R(\theta).

(2)

Given that estimation/inference is our goal, our focus will be on case where the risk minimizer, $\theta^{\star}$ , is unique, so that the “ $\in$ ” in (2) is an equality. But this is not absolutely necessary for our theory. Indeed, the main results in Section 3 remain valid even if the risk minimizer is not unique, and we make a few brief comments about this extension in the discussion following Theorem 3.2.

The risk function itself is unavailable—it depends on $P$ —and, therefore, so is $\theta^{\star}$ . However, suppose that we have an independent and identically distributed (iid) sample $U^{n}=(U_{1},\ldots,U_{n})$ of size $n$ , with each $U_{i}$ having marginal distribution $P$ on $\mathbb{U}$ . The iid assumption is not crucial, but it makes the notation and discussion easier; an extension to independent but not identically distributed (inid) cases is discussed in the context of an example in Section 5.4. In general, we have data $U^{n}$ taking values in the measurable space $(\mathbb{U}^{n},\mathcal{U}^{n})$ , with joint distribution denoted by $P^{n}$ . From here, we can naturally replace the unknown risk in Definition 2.1 with an empirical version and proceed accordingly.

Definition 2.2.

For a loss function $\ell_{\theta}$ as described above, define the empirical risk as

R_{n}(\theta)=\mathbb{P}_{n}\ell_{\theta}=\frac{1}{n}\sum_{i=1}^{n}\ell_{\theta}(U_{i}),

(3)

where $\mathbb{P}_{n}=n^{-1}\sum_{i=1}^{n}\delta_{U_{i}}$ , with $\delta_{u}$ the Dirac point-mass measure at $u$ , is the empirical distribution.

Naturally, if the inferential target is the risk minimizer, then it makes sense to estimate that quantity based on data $U^{n}$ by minimizing the empirical risk, i.e.,

\hat{\theta}_{n}\in\arg\min_{\theta\in\Theta}R_{n}(\theta).

(4)

This is the M-estimator based on an objective function determined by the loss $\ell_{\theta}$ ; when $R_{n}$ is differentiable, the root of $\dot{R}_{n}$ , the derivative of $R_{n}$ , is a Z-estimator and “ $\dot{R}_{n}(\theta)=0$ ” is often called an estimating equation (Godambe,, 1991; van der Vaart,, 1998). Since $\theta\mapsto\ell_{\theta}$ need not be smooth or convex, and $R_{n}$ is an average over a finite set of data, it is possible that its minimizer is not unique, even if $\theta^{\star}$ is. These computational challenges are, in fact, part of what motivates the Gibbs posterior, as we discuss below.

There is a rich literature on the asymptotic distribution properties of M-estimators, which can be used for developing hypothesis tests and confidence intervals (Maronna et al.,, 2006; Huber and Ronchetti,, 2009). As an alternative, one might consider a Bayesian approach to quantify uncertainty, but there is an immediate obstacle, namely, no statistical model/likelihood connecting the data to the quantity of interest. If we did have a statistical model, with a density function $p_{\theta}$ , then the most natural loss is $\ell_{\theta}(u)=-\log p_{\theta}(u)$ and the likelihood is $\exp\{-nR_{n}(\theta)\}$ . It is, therefore, tempting to follow that same strategy for general losses, resulting in a sort of generalized posterior distribution for $\theta$ .

Definition 2.3.

Given a loss function $\ell_{\theta}$ and the corresponding empirical risk $R_{n}$ in Definition 2.2, define the Gibbs posterior distribution as

\Pi_{n}(d\theta)\propto e^{-\omega\,nR_{n}(\theta)}\,\Pi(d\theta),\quad\theta\in\Theta,

(5)

where $\Pi$ is a prior distribution and $\omega>0$ is a so-called learning rate (Holmes and Walker,, 2017; Syring and Martin,, 2019; Grünwald,, 2012; van Erven et al.,, 2015). The dependence of $\Pi_{n}$ on $\omega$ will generally be omitted from the notation, but see Sections 2.2 and 3.3.

We will assume that the right-hand side of (5) is integrable in $\theta$ , so that the proportionality constant is well-defined. Integrability holds whenever the loss function is bounded from below, like for those in (1), but this could fail in some cases where the loss is not bounded away from $-\infty$ , e.g., when $\ell_{\theta}(u)$ is a negative log-density. In such cases, extra conditions on the prior distribution would be required to ensure the Gibbs posterior is well-defined.

An immediate advantage of this approach, compared to the M-estimation strategy described above, is that the user is able to incorporate available prior information about $\theta$ directly into the analysis. This is especially important in cases where the quantity of interest has a real-world meaning, as opposed to being just a model parameter, where having genuine prior information is the norm rather than the exception. Additionally, even though there is no likelihood, the same computational methods, such as Markov chain Monte Carlo (Chernozhukov and Hong,, 2003) and variational approximations (Alquier et al.,, 2016), common in Bayesian analysis, can be employed to numerically approximate the Gibbs posterior.

We have opted here to define the Gibbs posterior directly as an object to be used and studied, but there is a more formal, more principled way in which Gibbs posteriors emerge. In the PAC-Bayes literature, the goal is to construct a randomized estimator that concentrates in regions of $\Theta$ where the risk, $R(\theta)$ , or its empirical version, $R_{n}(\theta)$ , is small (Valiant,, 1984; McAllester,, 1999; Alquier,, 2008; Guedj,, 2019). That is, the Gibbs posterior can be viewed as a solution to an optimization problem rather than a solution to the updating-prior-beliefs problem. More formally, for a given prior $\Pi$ on $\Theta$ , suppose the goal is to find

\inf_{\mu}\Bigl{\{}\int R_{n}(\theta)\,\mu(d\theta)+(\omega n)^{-1}K(\mu,\Pi)\Bigr{\}},

where the infimum is over all probability measures $\mu$ that are absolutely continuous with respect to $\Pi$ , and $K$ denotes the Kullback–Leibler divergence. Then it can be shown (e.g., Zhang,, 2006; Bissiri et al.,, 2016) that the unique solution is $\Pi_{n}$ , the Gibbs posterior defined in (5). Therefore, the Gibbs posterior distribution is the measure minimizing a penalized risk, averaged with respect to a given prior, $\Pi$ .

2.2 Learning rate

Readers familiar with M-estimation may not recognize the learning rate, $\omega$ . This does not appear in the M-estimation context because all that influences the optimization problem—and the corresponding asymptotic distribution theory—is the shape of the loss/risk function, not its magnitude or scale. On the other hand, the learning rate is an essential component of the Gibbs posterior distribution in (5) since the distribution depends on both the shape and scale of the loss function. Data-driven strategies for tuning the learning rate are available (Holmes and Walker,, 2017; Syring and Martin,, 2019; Lyddon et al.,, 2019; Wu and Martin,, 2022, 2021).

Here we focus on how the learning rate affects posterior concentration. In typical examples, our results require the learning rate to be a sufficiently small constant. That constant depends on features of $P$ , which are generally unknown, so, in practice, the learning rate can be taken to be a slowly vanishing sequence, which has a negligible effect on the concentration rate. In more challenging examples, we require the learning rate to vanish sufficiently fast in $n$ ; this is also the case in Grünwald and Mehta, (2020).

2.3 Relation to other generalized posterior distributions

A generalized posterior is any data-dependent distribution other than a well-specified Bayesian posterior. Examples include Gibbs and misspecified Bayesian posteriors, which we compare first.

A key characteristic of the misspecified Bayesian posterior is that it is accidentally misspecified. That is, the data analyst does his/her best to posit a sound model, $Q_{\gamma}$ , for the data-generating process and obtains the corresponding posterior for the model parameter $\gamma$ . That model will typically be misspecified, i.e., $P\not\in\{Q_{\gamma}:\gamma\in\Gamma\}$ , so the aforementioned posterior will, under certain conditions, concentrate around the point $\gamma^{\dagger}$ that minimizes the Kullback–Leibler divergence of $Q_{\gamma}$ from $P$ (Kleijn and van der Vaart,, 2006; De Blasi and Walker,, 2013; Ramamoorthi et al.,, 2015). For a feature $\theta(\cdot)$ of interest, typically $\theta(Q_{\gamma^{\dagger}})\neq\theta(P)$ , so there is generally a bias that the data analyst can do nothing about. A Gibbs posterior, on the other hand, is purposely misspecified—no attempt is made to model $P$ . Rather, it directly targets the feature of interest via the loss function that defines it. This strategy avoids model misspecification bias, but its point of view also sheds light on the importance of the choice of learning rate. Since the data analyst knows the Gibbs posterior is not a correctly specified Bayesian posterior, they know the learning rate must be handled with care.

A number of authors have studied generalized posterior distributions formed using a likelihood raised to a power $\eta\in(0,1]$ ; see, e.g., Martin et al., (2017) and Grünwald and van Ommen, (2017) among others. These $\eta$ -generalized posteriors tend to be robust to misspecification of the probability model, and data-driven methods to tune $\eta$ are developed in, e.g., Grünwald and van Ommen, (2017). These posteriors coincide with Gibbs posteriors based on a log-loss and with the learning rate $\omega$ corresponding to $\eta$ .

A common non-Bayesian approach to statistics bases inferences on moment conditions of the form $Pf_{j}=0$ for functions $f_{1},\ldots f_{J}$ , rather than on a fully-specified likelihood. A number of authors have extended this framework to posterior inference. Kim, (2002) uses the moment conditions to construct a so-called limited-information likelihood to use in place of a fully-specified likelihood in a Bayesian formulation. Similarly, Chernozhukov and Hong, (2003) construct a posterior by taking a (pseudo) log-likelihood equal to a quadratic form determined by the set of moment conditions. Chib et al., (2018) studies Bayesian exponentially-tilted empirical likelihood posterior distributions, also defined by moment conditions. In some cases the Gibbs posterior distribution coincides with the above moment-based methods. For instance, in the special case that the risk $R$ is differentiable at $\theta^{\star}$ with derivative $\dot{R}(\theta^{\star})$ , risk-minimization corresponds to the single moment condition $\dot{R}(\theta^{\star})=0$ .

3 Asymptotic concentration rates

3.1 Objective

A large part of the Bayesian literature concerns the asymptotic concentration properties of their posterior distributions. Roughly, if data are generated from a distribution for which the quantity of interest takes value $\theta^{\star}$ , then, as $n\to\infty$ , the posterior distribution ought to concentrate its mass around that same $\theta^{\star}$ . As we will show, optimal concentration rates are possible with Gibbs posteriors, so the robustness achieved by not specifying a statistical model has no cost in terms of (asymptotic) efficiency.

Towards this, for a fixed $\theta^{\star}\in\Theta$ , let $d(\theta;\theta^{\star})$ denote a divergence measure on $\Theta$ in the sense that $d(\theta;\theta^{\star})\geq 0$ for all $\theta$ , with equality if and only if $\theta=\theta^{\star}$ . The divergence measure could depend on the sample size $n$ or other deterministic features of the problem at hand, especially in the independent but not iid setting; see Section 5.4. Our objective is to provide conditions under which the Gibbs posterior will concentrate asymptotically, at a certain rate, around $\theta^{\star}$ relative to the divergence measure $d$ . Throughout this paper, $(\varepsilon_{n})$ denotes a deterministic sequence of positive numbers with $\varepsilon_{n}\to 0$ , which will be referred to as the Gibbs posterior concentration rate.

Definition 3.1.

The Gibbs posterior $\Pi_{n}$ in (5) asymptotically concentrates around $\theta^{\star}$ at rate (at least) $\varepsilon_{n}$ , with respect to $d$ , if

P^{n}\Pi_{n}(\{\theta:d(\theta;\theta^{\star})>M_{n}\varepsilon_{n}\})\rightarrow 0,\quad\text{as $n\to\infty$,}

(6)

where $M_{n}>0$ is either a (deterministic) sequence satisfying $M_{n}\to\infty$ arbitrarily slowly or is a sufficiently large constant, $M_{n}\equiv M$ .

In the PAC-Bayes literature, the Gibbs posterior distribution is interpreted as a “randomized estimator,” a generator of random $\theta$ values that tend to make the risk difference small. For iid data and with risk divergence $d(\theta;\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}$ , a concentration result like that in Definition 3.1 makes this strategy clear, since the $\Pi_{n}$ -probability of the event $\{R(\theta)-R(\theta^{\star})\leq\varepsilon_{n}^{2}\}$ would be $\to 1$ .

If the Gibbs posterior concentrates around $\theta^{\star}$ in the sense of Definition 3.1, then any reasonable estimator derived from that distribution, such as the mean, should inherit the $\varepsilon_{n}$ rate at $\theta^{\star}$ relative to the divergence measure $d$ . This can be made formal under certain conditions on $d$ ; see, e.g., Corollary 1 in Barron et al., (1999) and the discussion following the proof of Theorem 2.5 in Ghosal et al., (2000).

Besides concentration rates, in certain cases it is possible to establish distributional approximations to Gibbs posteriors, i.e., Bernstein–von Mises theorems. Results for finite-dimensional problems and with sufficiently smooth loss functions can be found in, e.g., Bhattacharya and Martin, (2022) and Chernozhukov and Hong, (2003).

3.2 Conditions

Here we discuss a general strategy for proving Gibbs posterior concentration and the kinds of sufficient conditions needed for the strategy to be successful. To start, set $A_{n}=\{\theta:d(\theta,\theta^{\star})>M_{n}\varepsilon_{n}\}\subset\Theta$ . Then our first step towards proving concentration is to express $\Pi_{n}(A_{n})$ as the ratio

\Pi_{n}(A_{n})=\frac{N_{n}(A_{n})}{D_{n}}=\frac{\int_{A_{n}}\exp[-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}]\,\Pi(d\theta)}{\int_{\Theta}\exp[-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}]\,\Pi(d\theta)}.

(7)

The goal is to suitably upper and lower bound $N_{n}(A_{n})$ and $D_{n}$ , respectively, in such a way that the ratios of these bounds vanish. Two sufficient conditions are discussed below, the first primarily dealing with the loss function and aiding in bounding $N_{n}(A_{n})$ and the second primarily concerning the prior distribution and aiding in bounding $D_{n}$ . Both conditions concern the excess loss $\ell_{\theta}(U)-\ell_{\theta^{\star}}(U)$ and its mean and variance:

m(\theta,\theta^{\star}):=P(\ell_{\theta}-\ell_{\theta^{\star}})\quad\text{and}\quad v(\theta,\theta^{\star}):=P(\ell_{\theta}-\ell_{\theta^{\star}})^{2}-m(\theta,\theta^{\star})^{2}.

3.2.1 Sub-exponential type losses

Our method for proving posterior concentration requires a vanishing upper bound on the expectation of the numerator term $N_{n}(A_{n})$ in (7). Since the integrand, $\exp[-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}]$ , in $N_{n}(A_{n})$ is non-negative, Fubini’s theorem says it suffices to bound its expectation. Further, by independence

P^{n}e^{-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}=\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n},

which reveals that the key to bounding $P^{n}N_{n}(A_{n})$ is to bound $Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$ , the expected exponentiated excess loss. In order for the bound on $P^{n}N_{n}(A_{n})$ to vanish it must be that $Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}<1$ , but this is not enough to identify the concentration rate in (6). Rather, the speed at which $Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$ vanishes must be a function of $d(\theta,\theta^{\star})$ , and we take this relationship as our key condition for Gibbs posterior concentration. When this holds we say the loss function is of sub-exponential type.

Condition 1.

There exists an interval $(0,\bar{\omega})$ and constants $K,\,r>0$ , such that for all $\omega\in(0,\bar{\omega})$ and for all sufficiently small $\delta>0$ , for $\theta\in\Theta$

d(\theta;\theta^{\star})>\delta\implies Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}<e^{-K\omega\delta^{r}}

(8)

(The constant $r>0$ that appears here and below can take on different values depending on the context. The case is $r=2$ common, but some “non-regular” problems require $r\neq 2$ ; see Section 5.5.)

An immediate consequence of Condition 1 and the definition of $A_{n}$ is the key finite-sample, exponential bound on the Gibbs posterior numerator

P^{n}N_{n}(A_{n})\leq e^{-K\omega nM_{n}^{r}\varepsilon_{n}^{r}}.

(9)

For some intuition behind Condition 1, consider the following. Let $f$ be a real-valued function such that the random variable $f(U)$ has a distribution with sufficiently thin tails. This is automatic when $f$ is bounded, but suppose $f(U)$ has a moment-generating function, i.e., is sub-exponential. For bounding the moment-generating function, the dream case would be if $Pe^{-\omega f}\leq e^{-\omega Pf}$ . Unfortunately, Jensen’s inequality implies the dream is an impossibility. It is possible, however, to show

Pe^{-\omega f}\leq e^{-\omega Pf+\omega^{2}G(f)},

for suitable $G(f)>0$ depending on certain features of $f$ (and of $P$ ). If it could also be shown, e.g., that $G(f)\lesssim Pf$ , then we are in a “near-dream” case where

Pe^{-\omega f}\leq e^{-c\omega Pf},\quad\text{for a constant $c\in(0,1)$ and sufficiently small $\omega$.}

The little extra needed beyond sub-exponential to achieve the “near-dream” case bound above is why we refer to such $f$ as being sub-exponential type.

Towards verifying Condition 1 we briefly review some developments in Grünwald and Mehta, (2020); further comments can be found in Section 3.4. These authors focus on an annealed expectation which, for a real-valued function $f$ of the random element $U\sim P$ and a fixed constant $\omega>0$ , is defined as $P^{\text{\sc ann}(\omega)}f=-\omega^{-1}\log Pe^{-\omega f}$ . With this, we find that $Pe^{-\omega f}=\exp\{-\omega P^{\text{\sc ann}(\omega)}f\}$ . So an upper bound as we require in Condition 1 is equivalent to a corresponding lower bound on $P^{\text{\sc ann}(\omega)}(\ell_{\theta}-\ell_{\theta^{\star}})$ . The “strong central condition” in Grünwald and Mehta, (2020) states that,

P^{\text{\sc ann}(\omega)}(\ell_{\theta}-\ell_{\theta^{\star}})\geq 0,\quad\text{for all $\theta\in\Theta$}.

The above inequality implies $Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq 1$ and, in turn, that $P^{n}N_{n}(A_{n})=O(1)$ as $n\to\infty$ . The other conditions in Grünwald and Mehta, (2020) aim to lower-bound the annealed expected excess loss by a suitable function of the excess risk. For example, Lemma 13 in Grünwald and Mehta, (2020) shows that a “witness condition” implies

P^{\text{\sc ann}(\omega)}(\ell_{\theta}-\ell_{\theta^{\star}})\gtrsim R(\theta)-R(\theta^{\star}),\quad\text{for all $\theta\in\Theta$},

so, if $R(\theta)-R(\theta^{\star})$ is of the order $d(\theta,\theta^{\star})^{r}$ for some $r>0$ , then we recover Condition 1. Therefore, our Condition 1 is exactly what is needed to control the numerator of the Gibbs posterior, and the strong central and witness conditions developed in Grünwald and Mehta, (2020) and elsewhere constitute a set of sufficient conditions for our Condition 1.

A subtle difference between our approach and Grünwald and Mehta’s is that the bounds in the above two displays are required for all $\theta\in\Theta$ ; Condition 1 only deals with $\theta$ bounded away from $\theta^{\star}$ . This difference arises because we directly target a bound on the Gibbs posterior probability, $\Pi_{n}(A_{n})$ , an integration over $A_{n}\not\ni\theta^{\star}$ , whereas Grünwald and Mehta, (2020) targets a bound on the Gibbs posterior mean of $\theta\mapsto R(\theta)-R(\theta^{\star})$ , an integration over all of $\Theta$ . In Example 3.8 of van Erven et al., (2015) the global requirement in Grünwald and Mehta, (2020)’s condition is a disadvantage as van Erven et al., (2015) illustrate it creates some challenges in checking the bounds in the above two displays.

Besides Condition 1, other strategies for bounding $N_{n}(A_{n})$ in (7) have been employed in the literature. For example, empirical process techniques are used in Syring and Martin, (2017); Bhattacharya and Martin, (2022) to prove Gibbs posterior concentration in finite-dimensional applications. Generally, such proofs hinge on a uniform law of large numbers, which can be challenging to verify in non-parametric problems. Chernozhukov and Hong, (2003) require the stronger condition that $\sup_{d(\theta,\theta^{\star})>\delta}\{R_{n}(\theta)-R_{n}(\theta^{\star})\}\to 0$ in $P^{n}$ -probability. When it holds they immediately recover an in-probability bound on $N_{n}(A_{n})$ , irrespective of $\omega$ , but they do not obtain a finite-sample bound on $P^{n}\Pi_{n}(A_{n})$ . Their analysis is limited to finite-dimensional parameters and empirical risk functions $R_{n}(\theta)$ that are smooth in a neighborhood of $\theta^{\star}$ . This latter condition excludes important examples like the misclassification-error loss function; see Section 5.5.

There are situations in which Condition 1 does not hold, and we address this formally in Section 4. As an example, note that both Condition 1 and the witness condition used in Grünwald and Mehta, (2020) are closely related to a Bernstein-type condition relating the first two moments of the excess loss: $P(\ell_{\theta}-\ell_{\theta^{\star}})^{2}\leq c\{R(\theta)-R(\theta^{\star})\}^{\alpha}$ for constants $(c,\,\alpha)>0$ . For the “check” loss used in quantile estimation, the Bernstein condition is generally not satisfied and, consequently, neither Grünwald and Mehta’s witness condition nor our Condition 1—at least not in their original forms—can be verified. Similarly, if the excess loss, $\ell_{\theta}(U)-\ell_{\theta^{\star}}(U)$ , a function of data $U$ , is heavy-tailed, then the moment-generating function bound in Condition 1 does not hold. For these cases, some modifications to the basic setup are required, which we present in Section 4.

3.2.2 Prior distribution

Generally, the prior must place enough mass on certain “risk-function” neighborhoods $G_{n}:=\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\}$ . This is analogous to the requirement in the Bayesian literature that the prior place sufficient mass on Kullback–Leibler neighborhoods of $\theta^{\star}$ ; see, e.g., Shen and Wasserman, (2001) and Ghosal et al., (2000). Some version of the following prior bound is needed

\log\Pi(G_{n})\gtrsim-n\varepsilon_{n}^{r},

(10)

for $r$ as in Condition 1. Lemma 1 in the appendix, along with (10), provides a lower probability bound on the denominator term $D_{n}$ defined in (7). That is,

P^{n}\bigl{(}D_{n}\leq b_{n}\bigr{)}\leq(n\varepsilon_{n}^{r})^{-1},

(11)

where $b_{n}=\frac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}$ . Both the form of the risk-function neighborhoods and the precise lower bound in (10) depend on the concentration rate and the learning rate, so the results in Section 3.3 all require their own version of the above prior bound.

Grünwald and Mehta, (2020) only require bounds on the prior mass of the larger neighborhoods $\{\theta:m(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\}$ . Under their condition, we can derive a lower bound on $P^{n}D_{n}$ similar to Lemma 1 in Martin et al., (2013). However, our proofs require an in-probability lower bound on $D_{n}$ , which in turn requires stronger prior concentration like that in (10). While there are important examples where the lower bounds on $m(\theta,\theta^{\star})$ and $v(\theta,\theta^{\star})$ are of different orders, none of the applications we consider here are of that type. Therefore, the stronger prior concentration condition in (10) does not affect the rates we derive for the examples in Section 5. Moreover, as discussed following the statement of Theorem 3.2 below, our finite-sample bounds are better than those in Grünwald and Mehta, (2020), a direct consequence of our method of proof that uses a smaller neighborhood $G_{n}$ .

3.3 Main results

In this section we present general results on Gibbs posterior concentration. Proofs can be found in Section 1 of the supplementary material. Our first result establishes Gibbs posterior concentration, under Condition 1 and a local prior condition, for sufficiently small constant learning rates.

Theorem 3.2.

Let $\varepsilon_{n}$ be a vanishing sequence satisfying $n\varepsilon_{n}^{r}\to\infty$ for a constant $r>0$ . Suppose the prior satisfies

\log\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\})\geq-C_{1}n\varepsilon_{n}^{r},

(12)

for $C_{1}>0$ and for divergence measure $d$ , the same $r>0$ as above, and learning rate $\omega\in(0,\bar{\omega})$ for some $\bar{\omega}>0$ . If the loss function satisfies Condition 1, then the Gibbs posterior distribution in (5) has asymptotic concentration rate $\varepsilon_{n}$ , with $M_{n}\equiv M$ a large constant as in Definition 3.1.

For a brief sketch of the proof bound the posterior probability of $A_{n}$ by

	$\displaystyle\Pi_{n}(A_{n})$	$\displaystyle\leq\frac{N_{n}(A_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})$
		$\displaystyle\leq b_{n}^{-1}N_{n}(A_{n})+1(D_{n}\leq b_{n}),$

where $b_{n}$ is as above. Taking expectation of both sides, and applying (11) and the consequence (9) of Condition 1, we get

P^{n}\Pi_{n}(A_{n})\leq b_{n}^{-1}e^{-K\omega nM_{n}^{r}\varepsilon_{n}^{r}}+(n\varepsilon_{n}^{r})^{-1}.

Then the right-hand side is generally of order $(n\varepsilon_{n}^{r})^{-1}\to 0$ . To compare with the results in Grünwald and Mehta, (2020, Example 2), their upper bound on $P^{n}\Pi_{n}(A_{n})$ is $O(M_{n}^{-1})$ , which vanishes arbitrarily slowly.

For the case where the risk minimizer in (2) is not unique, certain modifications of the above argument can be made, similar to those in Kleijn and van der Vaart, (2006, Theorem 2.4). Roughly, to our Theorem 3.2 above, we would add the requirements that (12) and Condition 1 hold uniformly in $\theta^{\star}\in\Theta^{\star}$ , where $\Theta^{\star}$ is the set of risk minimizers. Then virtually the same proof shows that $P^{n}\Pi_{n}(\{\theta:d(\theta,\Theta^{\star})\gtrsim\varepsilon_{n}\})\to 0$ , where $d(\theta,\Theta^{\star})=\inf_{\theta^{\star}\in\Theta^{\star}}d(\theta,\theta^{\star})$ .

Theorem 3.2 is quite flexible and can be applied in a range of settings; see Section 5. However, one case in which it cannot be directly applied is when $n\varepsilon_{n}^{r}$ is bounded. For example, in sufficiently smooth finite-dimensional problems, we have $r=2$ and the target rate is $\varepsilon_{n}=n^{-1/2}$ . The difficulty is caused by the prior bound in (12), since it is impossible—at least with a fixed prior—to assign mass bounded away from 0 to a shrinking neighborhood of $\theta^{\star}$ . One option is to add a logarithmic factor to the rate, i.e., take $\varepsilon_{n}=(\log n)^{k}n^{-1/2}$ , so that $e^{-Cn\varepsilon_{n}^{2}}$ is a power of $n^{-1/2}$ . Alternatively, a refinement of the proof of Theorem 3.2 lets us avoid slowing down the rate.

Theorem 3.3.

Consider a finite-dimensional $\theta$ , taking values in $\Theta\subseteq\mathbb{R}^{q}$ for some $q\geq 1$ . Suppose that the target rate $\varepsilon_{n}$ is such that $n\varepsilon_{n}^{r}$ is bounded for some constant $r>0$ . If the prior $\Pi$ satisfies

\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{2}\})\gtrsim\varepsilon_{n}^{q},

(13)

and if Condition 1 holds, then the Gibbs posterior distribution in (5), with any learning rate $\omega\in(0,\bar{\omega})$ , has asymptotic concentration rate $\varepsilon_{n}$ at $\theta^{\star}$ with respect to any divergence $d(\theta,\theta^{\star})$ satisfying $\|\theta-\theta^{\star}\|\lesssim d(\theta,\theta^{\star})\lesssim\|\theta-\theta^{\star}\|$ and for any diverging, positive sequence $M_{n}$ in Definition 3.1.

The learning rate is critical to the Gibbs posterior’s performance, but in applications it may be challenging to determine the upper bound $\bar{\omega}$ for which Condition 1 holds. For a simple illustration, suppose the excess loss $\ell_{\theta}-\ell_{\theta}^{\star}$ is normally distributed with variance $\sigma^{2}(\theta)$ satisfying $\sigma^{2}(\theta)\leq c\{R(\theta)-R(\theta^{\star})\}$ , a kind of Bernstein condition. In this case, Condition 1 holds for $d(\theta,\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}$ , with $r=2$ , if $\omega<2c^{-1}$ . Bernstein conditions can be verified in many practical examples, but the factor $c$ would rarely be known. Consequently, we need $\omega$ to be sufficiently small, but the meaning of “sufficiently small” depends on unknowns involving $P$ . However, any positive, vanishing learning rate sequence $\omega=\omega_{n}$ satisfies $\omega_{n}\in(0,\bar{\omega})$ for all sufficiently large $n$ . And if $\omega_{n}$ vanishes arbitrarily slowly, then it has no effect on the Gibbs posterior concentration rate; see Section 5.6. All we require to accommodate a vanishing learning rate is a slightly stronger $\omega_{n}$ -dependent version of the prior concentration bound in (12).

Theorem 3.4.

Let $\varepsilon_{n}$ be a vanishing sequence and $\omega_{n}$ be a learning rate sequence satisfying $n\omega_{n}\varepsilon_{n}^{r}\to\infty$ for a constant $r>0$ . Consider a Gibbs posterior distribution $\Pi_{n}=\Pi_{n}^{\omega_{n}}$ in (5) based on this sequence of learning rates. If the prior satisfies

\log\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\})\geq-Cn\omega_{n}\varepsilon_{n}^{r},

(14)

and if Condition 1 holds for $\omega_{n}$ , then the Gibbs posterior distribution in (5), with learning rate sequence $\omega_{n}$ , has concentration rate $\varepsilon_{n}$ at $\theta^{\star}$ for a sufficiently large constant $M>0$ in Definition 3.1.

The proof of Theorem 3.4 is almost identical to that of Theorem 3.2, hence omitted. But to see the basic idea, we mention two key observations. First, since Condition 1 holds for $\omega_{n}$ for all sufficiently large $n$ , and since $\omega_{n}$ is deterministic, by the same argument producing the bound in (9), we get

P^{n}N_{n}(A_{n})\leq e^{-Kn\omega_{n}M_{n}^{r}\varepsilon_{n}^{r}}\quad\text{for all {\em sufficiently large} $n$}.

Second, the same argument producing the bound in (11) shows that

P^{n}\bigl{(}D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2n\omega_{n}\varepsilon_{n}^{r}}\bigr{)}\leq P^{n}\bigl{(}D_{n}\leq\tfrac{1}{2}e^{-(C+2)n\omega_{n}\varepsilon_{n}^{r}}\bigr{)}\leq(n\omega_{n}\varepsilon_{n}^{r})^{-1}.

Then the only difference between this situation and that in Theorem 3.2 is that, here, the numerator bound only holds for “sufficiently large $n$ ,” instead of for all $n$ . This makes Theorem 3.4 is slightly weaker than Theorem 3.2 since there are no finite-sample bounds.

When $\omega_{n}\equiv\omega$ the constant learning rate is absorbed by $C$ and there is no difference between the prior bounds in (12) and (14). But, the prior probability assigned to the $(m\vee v)$ -neighborhood of $\theta^{\star}$ does not depend on $\omega_{n}$ , so if it satisfies (12), then the only way it could also satisfy (14) is if $\varepsilon_{n}$ is bigger than it would have been without a vanishing learning rate. In other words, Theorem 3.2 requires $n\varepsilon_{n}^{r}\rightarrow\infty$ whereas Theorem 3.4 requires $n\omega_{n}\varepsilon_{n}^{r}\rightarrow\infty$ , which implies that for a given vanishing $\omega_{n}$ Theorem 3.2 potentially can accommodate a faster rate $\varepsilon_{n}$ . Therefore, we see that a vanishing learning rate can slow down the Gibbs posterior concentration rate if it does not vanish arbitrarily slowly. There are applications that require the learning rate to vanish at a particular $n$ -dependent rate, and these tend to be those where adjustments like in Section 4 are needed; see Sections 5.3.2 and 5.5.2.

If we can use the data to estimate $\bar{\omega}$ consistently, then it makes sense to choose a learning rate sequence depending on this estimator. Suppose our data-dependent learning rate $\hat{\omega}_{n}$ satisfies $\hat{\omega}_{n}<\bar{\omega}$ with probability converging to $1$ as $n\rightarrow\infty$ . For $\omega\equiv\hat{\omega}_{n}$ the conclusion of Theorem 3.2 holds for all sufficiently large $n$ ; and see Section 4.2 in the Supplementary Material. One advantage of this strategy is that it avoids using a vanishing learning rate, which may slow concentration.

Theorem 3.5.

Fix a positive deterministic learning rate sequence $\omega_{n}$ such that the conditions of Theorem 3.4 hold and as a result $\Pi_{n}^{\omega_{n}}$ has asymptotic concentration rate $\varepsilon_{n}$ . Consider a random learning rate sequence $\hat{\omega}_{n}$ satisfying

P^{n}(\omega_{n}/2<\hat{\omega}_{n}<\omega_{n})\rightarrow 1,\quad n\to\infty.

(15)

Then $\Pi_{n}^{\hat{\omega}_{n}}$ , the Gibbs posterior distribution in (5) scaled by the random learning rate sequence $\hat{\omega}_{n}$ , also has concentration rate $\varepsilon_{n}$ at $\theta^{\star}$ for a sufficiently large constant $M>0$ in Definition 3.1.

3.4 Checking Condition 1

Of course, Condition 1 is only useful if it can be checked in practically relevant examples. As mentioned in Section 3.2, Grünwald and Mehta’s strong central and witness conditions are sufficient for Condition 1. A pair of slightly stronger, but practically verifiable conditions require the excess loss is sub-exponential with first and second moments that obey a Bernstein condition. For the examples we consider in Section 5 where Condition 1 can be used, these two conditions are convenient.

3.4.1 Bounded excess losses

For bounded excess losses, i.e., $\ell_{\theta}(u)-\ell_{\theta^{\star}}(u)<C$ for all $(\theta,u)$ , Lemma 7.26 in (Lafftery et al.,, 2010) gives:

Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{[}-\omega m(\theta,\theta^{\star})+\omega^{2}v(\theta,\theta^{\star})\bigl{\{}\tfrac{\exp(C\omega)-1-C\omega}{C^{2}\omega^{2}}\bigr{\}}\bigr{]}.

(16)

Whether Condition 1 holds depends on the choice of $\omega$ and the relationship between $m(\theta,\theta^{\star})$ and $v(\theta,\theta^{\star})$ as defined by the Bernstein condition:

v(\theta,\theta^{\star})\leq C_{1}m(\theta,\theta^{\star})^{\alpha},\quad\alpha\in(0,1],\quad C_{1}>0.

(17)

Denote the bracketed expression in the exponent of (16) by $C(\omega)=(C^{2}\omega^{2})^{-1}\{\exp(C\omega)-1-C\omega\}$ . When $\alpha=1$ , the $m$ and $v$ functions are of the same order and, if $d(\theta,\theta^{\star})=m(\theta,\theta^{\star})$ , and $\omega C(\omega)\leq(2C_{1})^{-1}$ then Condition 1 holds with $K=1/2$ . Since $C(\omega)\approx C\omega$ for small $\omega$ , it suffices to take the learning rate a sufficiently small constant.

On the other hand, when $v$ is larger than $m$ , i.e., the Bernstein condition holds with $\alpha<1$ , Condition 1 requires that $\omega$ depend on $\alpha$ and $m(\theta,\theta^{\star})$ . Suppose $m(\theta,\theta^{\star})\geq d(\theta,\theta^{\star})>\varepsilon_{n}$ . For all small enough $\varepsilon_{n}$ , we can set $\omega_{n}=(4C)^{-1/2}\varepsilon_{n}^{(1-\alpha)/2}$ and Condition 1 is again satisfied with $K=1/2$ .

The above strategies are implemented in the classification examples in Section 5.5 where we also discuss connections to the Tsybakov (Tsybakov,, 2004) and Massart (Massart and Nedelec,, 2006) conditions. For the former, see Theorem 22 and the subsequent discussion in Grünwald and Mehta, (2020), where the learning rate ends up being a vanishing sequence depending on the concentration rate and the Bernstein exponent $\alpha$ .

3.4.2 Sub-exponential excess losses

Unbounded but light-tailed loss differences $\ell_{\theta}-\ell_{\theta^{\star}}$ may also satisfy Condition 1. Both sub-exponential and sub-Gaussian random variables (Boucheron et al.,, 2012, Sec. 2.3–4) admit an upper bound on their moment-generating functions. When the loss difference $\ell_{\theta}-\ell_{\theta^{\star}}$ is sub-Gaussian

Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{\{}\tfrac{\omega^{2}}{2}\sigma^{2}(\theta,\theta^{\star})-\omega m(\theta,\theta^{\star})\bigr{\}},

(18)

for all $\omega$ , where the variance proxy $\sigma^{2}(\theta,\theta^{\star})$ may depend on $(\theta,\theta^{\star})$ . If $\ell_{\theta}-\ell_{\theta^{\star}}$ is sub-exponential, then the above bound holds for all $\omega\leq(2b)^{-1}$ for some $b<\infty$ indexing the tail behavior of $P$ .

If $\sigma^{2}(\theta,\theta^{\star})$ is upper-bounded by $Lm(\theta,\theta^{\star})$ for a constant $L>0$ , then the above bound can be rewritten as

Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{\{}-\omega m(\theta,\theta^{\star})(1-\tfrac{\omega L}{2})\bigr{\}}.

(19)

Then Condition 1 holds if the loss difference is sub-Gaussian and sub-exponential if $\omega<2L^{-1}$ and $\omega<2L^{-1}\wedge(2b)^{-1}$ , respectively.

In practice it may be awkward to assume $\ell_{\theta}(U)-\ell_{\theta^{\star}}(U)$ is sub-exponential, but in certain problems it is sufficient to make such an assumption about features of $U$ , which may be more reasonable. See Section 5.4 for an application of this idea to a fixed-design regression problem where the fact the response variable is sub-Gaussian implies the excess loss is itself sub-Gaussian.

4 Extensions

4.1 Locally sub-exponential type loss functions

In some cases the moment generating function bound in Condition 1 can be verified in a neighborhood of $\theta^{\star}$ but not for all $\theta\in\Theta$ . For example, suppose $\theta\mapsto\ell_{\theta}(u)$ is Lipschitz with respect to a metric $\|\cdot\|$ with uniformly bounded Lipschitz constant $L=L(u)$ , and that, for some $\delta>0$ ,

\displaystyle\|\theta-\theta^{\star}\|<\delta\implies v(\theta,\theta^{\star})\lesssim m(\theta,\theta^{\star})\quad\text{and}\quad\|\theta-\theta^{\star}\|>\delta\implies v(\theta,\theta^{\star})\lesssim m(\theta,\theta^{\star})^{2}.

That is, the Bernstein condition (17) holds but for different values of $\alpha$ depending on $\theta$ . This is the case for quantile regression; see Section 5.6. This class of problems, where the Bernstein exponent varies across $\Theta$ , apparently has not been considered previously. For example, Grünwald and Mehta, (2020) only consider cases where the Bernstein exponent $\alpha$ is constant across the entire parameter space; consequently, their results assume $\Theta$ is bounded (e.g., Grünwald and Mehta,, 2020, Example 10), so (17) holds trivially with exponent $\alpha=0$ .

To address this problem, our idea is a simple one. We propose to introduce a sieve which is large enough that we can safely assume it contains $\theta^{\star}$ but small enough that, on which, the $m$ and $v$ functions can be appropriately controlled. Towards this, let $\Theta_{n}$ be an increasing sequence of subsets of $\Theta$ , indexed by the sample size $n$ . The “size” of $\Theta_{n}$ will play an important role in the result below. While more general sieves are possible, to keep the notion of size concrete, let $\Theta_{n}=\{\theta\in\Theta:\|\theta\|\leq\Delta_{n}\}$ , so that size is controlled by the non-decreasing sequence $\Delta_{n}>0$ , which would typically satisfy $\Delta_{n}\to\infty$ as $n\to\infty$ . The metric $\|\cdot\|$ in the definition of $\Theta_{n}$ is at the user’s discretion, it would be chosen so that $\theta\in\Theta_{n}$ provides information that can be used to control the excess loss $\ell_{\theta}-\ell_{\theta^{\star}}$ . This leads to the following straightforward modification of Condition 1.

Condition 2.

For $\Theta_{n}$ with size controlled by $\Delta_{n}$ , there exists an interval $(0,\bar{\omega})$ , a constant $r>0$ , and a sequence $K_{n}=K(\Delta_{n})>0$ such that, for all $\omega\in(0,\bar{\omega})$ and for all small $\delta>0$ ,

\theta\in\Theta_{n}\;\text{ and }\;d(\theta;\theta^{\star})>\delta\implies Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}<e^{-\omega K_{n}^{r}\delta^{r}}.

(20)

Aside from the restriction $\Theta_{n}$ , the key difference between the bounds here and in Condition 1 is in the exponent. Instead of there being a constant $K$ , there is a sequence $K_{n}$ which is determined by the sieve’s size, controlled by $\Delta_{n}$ . If the sequence $K_{n}$ is increasing, as we expect it will be, then we can anticipate that the overall concentration rate would be adversely affected—unless the learning rate is vanishing suitably fast. The following theorem explains this more precisely.

Condition 2 can be used exactly as Condition 1 to upper bound $P^{n}N_{n}(A_{n}\cap\Theta_{n})$ . However, the Gibbs posterior probability assigned to $A_{n}\cap\Theta_{n}^{\text{\sc c}}$ must be handled separately.

Theorem 4.1.

Let $\Theta_{n}$ be a sequence of subsets for which the loss function satisfies Condition 2 for a sequence $K_{n}>0$ , a constant $r>0$ , and a learning rate $\omega_{n}\in(0,\overline{\omega})$ for all sufficiently large $n$ . Let $\varepsilon_{n}$ be a vanishing sequence satisfying $n\omega_{n}K_{n}^{r}\varepsilon_{n}^{r}\rightarrow\infty$ . Suppose the prior satisfies

\log\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq(K_{n}\varepsilon_{n})^{r}\})\gtrsim-Cn\omega_{n}K_{n}^{r}\varepsilon_{n}^{r},

(21)

for some $C>0$ and the same $K_{n}$ , $r$ , and $\omega_{n}$ as above. Then the Gibbs posterior in (5) satisfies

\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n})\leq\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}}).

Consequently, if

P^{n}\Pi_{n}(\Theta_{n}^{\text{\sc c}})\to 0\quad\text{as $n\to\infty$},

(22)

then the Gibbs posterior has concentration rate $\varepsilon_{n}$ for all large constants $M>0$ in Definition 3.1.

There are two aspects of Theorem 4.1 that deserve further explanation. We start with the point about separate handling of $\Theta_{n}^{\text{\sc c}}$ . Condition (22) is easy to check for well-specified Bayesian posteriors, but these results do not carry over even to a misspecified Bayesian model. Of course, (22) can always be arranged by restricting the support of the prior distribution to $\Theta_{n}$ , which is the suggestion in Kleijn and van der Vaart, (2006, Theorem 2.3) for the Bayesian case and how we handle (22) for our infinite-dimensional Gibbs example in Section 5.6. However, as (22) suggests, this is not entirely necessary. Indeed, Kleijn and van der Vaart, (2006) describe a trade-off between model complexity and prior support, and offer a more complicated form of their sufficient condition, which they do not explore in that paper. The fact is, without a well-specified likelihood, checking a condition like our (22) or Equation (2.13) in Kleijn and van der Vaart, (2006) is a challenge, at least in infinite-dimensional problems. For finite-dimensional problems, it may be possible to verify (22) directly using properties of the loss functions. For example, we use convexity of the check loss function for quantile estimation to verify (22) in Section 5.1.

Next, how might one proceed to check Condition 2? Go back to the Lipschitz loss case at the start of this subsection. For a sieve $\Theta_{n}$ as described above, if $\theta$ and $\theta^{\star}$ are in $\Theta_{n}$ , then $\|\theta-\theta^{\star}\|$ is bounded by a multiple of $\Delta_{n}$ . This, together with the Lipschitz property and (16), implies

Pe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\bigl{\{}C_{n}\omega_{n}^{2}v(\theta,\theta^{\star})-\omega_{n}m(\theta,\theta^{\star})\bigr{\}},

where $C_{n}=O(1+\Delta_{n}\omega_{n})$ . Suppose the user chooses $\omega_{n}$ and $\Delta_{n}$ such that $\omega_{n}\Delta_{n}=O(1)$ ; then we can replace $C_{n}$ by a constant $C$ . If there exist functions $g$ and $G$ such that

m(\theta,\theta^{\star})\geq g(\Delta_{n})\|\theta-\theta^{\star}\|^{2}\quad\text{and}\quad v(\theta,\theta^{\star})\leq G(\Delta_{n})\|\theta-\theta^{\star}\|^{2},\quad\theta\in\Theta_{n},

(23)

then the above upper bound simplifies to

\exp[-\omega_{n}\|\theta-\theta^{\star}\|^{2}\{g(\Delta_{n})-C\omega_{n}G(\Delta_{n})\}],\quad\theta\in\Theta_{n}.

Then Condition 2 holds with $K_{n}=g(\Delta_{n})-C\omega_{n}G(\Delta_{n})$ , and it remains to balance the choices of $\omega_{n}$ and $\Delta_{n}$ to achieve the best possible concentration rate $\varepsilon_{n}$ ; see Section 5.6.

4.2 Clipping the loss function

When the excess loss is heavy-tailed, i.e., not sub-exponential, like in Section 5.3, its moment-generating function does not exist and, therefore, Condition 1 cannot be satisfied. In this section, we assume the loss function $\ell_{\theta}(u)$ is non-negative or lower-bounded by a negative constant. In the latter case we can work with the shifted loss— $\ell_{\theta}$ minus its lower bound. Many practically useful loss functions are non-negative, including squared-error loss, which we cover in Section 5.3.

In such cases, it may be reasonable to replace the heavy-tailed loss with a clipped version

\ell_{\theta}^{n}(u)=\ell_{\theta}(u)\wedge t_{n},

where $t_{n}>0$ is a diverging clipping sequence. Since $\ell_{\theta}^{n}(u)$ is bounded in $(u,\theta)$ for each fixed $n$ , the strategy for checking Condition 1 described in Section 3.4.1 suggests that, for certain choices of $(t_{n},\varepsilon_{n},\omega_{n})$ , $\Pi_{n}$ places vanishing mass on the sequence of sets $\{\theta:P\ell^{n}_{\theta}-P\ell_{\theta_{n}^{\star}}^{n}>\varepsilon_{n}\}$ , where $\theta_{n}^{\star}=\arg\min P\ell_{\theta}^{n}$ . This makes the $\theta_{n}^{\star}$ the (moving) target of the Gibbs posterior, instead of the fixed $\theta^{\star}$ . On the other hand, if the loss function admits more than one finite moment, then for a corresponding, increasing clipping sequence $t_{n}$ the clipped risk neighborhoods of $\theta^{\star}_{n}$ contain the risk neighborhoods of $\theta^{\star}$ for all large $n$ , that is,

\{\theta:R(\theta)-R(\theta^{\star})>C\varepsilon_{n}\}\subset\{\theta:P\ell^{n}_{\theta}-P\ell_{\theta_{n}^{\star}}^{n}>\varepsilon_{n}\},\quad\text{for some $C>0$}.

(24)

Then we further expect concentration of the clipped loss-based Gibbs posterior at $\theta^{\star}$ with respect to the excess risk divergence at rate $\varepsilon_{n}$ . Condition 3 and Theorem 4.2 below provide a set of sufficient conditions under which these expectations are realized and a concentration rate can be established.

Condition 3.

Let $\ell_{\theta}$ be the loss. Define $\ell_{\theta}^{n}=\ell_{\theta}\wedge t_{n}$ as the clipped loss and $\Theta_{n}=\{\theta:\|\theta\|\leq\Delta_{n}\}$ as a sieve, depending on constants $t_{n},\Delta_{n}\to\infty$ .

1.

For some $s>1$ , the sequence $B_{n}=\sup_{\theta\in\Theta_{n}}P|\ell_{\theta}|^{s}$ is finite for all $n$ .

There exists a sequence $\bar{\omega}_{n}>0$ , and a sequence $K_{n}>0$ , such that for all sequences $0<\omega_{n}\leq\bar{\omega}_{n}$ and for all sufficiently small $\delta>0$ ,

\theta\in\Theta_{n}\text{ and }P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}>\delta\implies Pe^{-\omega_{n}(\ell_{\theta}^{n}-\ell^{n}_{\theta_{n}^{\star}})}<e^{-K_{n}\omega_{n}\delta}.

(25)

Theorem 4.2.

For a given loss $\ell_{\theta}$ and sieve $\Theta_{n}$ , suppose that Condition 3 holds for $(\omega_{n},\Delta_{n},t_{n})$ ; and let $B_{n}=B_{n}(s)$ be as defined in Condition 3.1, for $s>1$ . Let $\varepsilon_{n}$ be a vanishing sequence such that $n\omega_{n}K_{n}\varepsilon_{n}\rightarrow\infty$ , and suppose the prior satisfies

\log\Pi(\{\theta:m_{n}(\theta,\theta_{n}^{\star})\vee v_{n}(\theta,\theta_{n}^{\star})\leq K_{n}\varepsilon_{n}\})\gtrsim-Cn\omega_{n}K_{n}\varepsilon_{n}.

(26)

Then the Gibbs posterior in (5) based on the clipped loss $\ell_{\theta}^{n}$ satisfies

\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n})\leq\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}})

where $A_{n}:=\{\theta:R(\theta)-R(\theta^{\star})>\varepsilon_{n}\vee B_{n}t_{n}^{1-s}\}$ . Consequently, if

P^{n}\Pi_{n}(\Theta_{n}^{\text{\sc c}})\to 0\quad\text{as $n\to\infty$},

then the Gibbs posterior has asymptotic concentration rate $\varepsilon_{n}\vee B_{n}t_{n}^{1-s}$ at $\theta^{\star}$ with respect to the excess risk $d(\theta,\theta^{\star})=R(\theta)-R(\theta^{\star})$ .

The setup here is more complicated than in previous sections, so some further explanation is warranted. First, we sketch out how Condition 3.1 leads to the critical property (24). A well-known bound on the expectation of a non-negative random variable, plus the moment bound in Condition 3.1, for $s>1$ , and Markov’s inequality leads to

\displaystyle P\ell_{\theta}1(\ell_{\theta}>t_{n})=\int_{t_{n}}^{\infty}P(\ell_{\theta}>x)\,dx

\displaystyle\leq B_{n}\int_{t_{n}}^{\infty}x^{-s}\,dx=B_{n}t_{n}^{1-s}.

(27)

This, in turn, implies $R(\theta)=P\ell_{\theta}^{n}+O(B_{n}t_{n}^{1-s})$ . If $B_{n}t_{n}^{1-s}\rightarrow 0$ , then the difference between the risk and the clipped risk is vanishing, so we can bound $P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}$ by a multiple of the excess risk $R(\theta)-R(\theta^{\star})$ for all sufficiently large $n$ . In the application we explore in Section 5.3, $B_{n}$ is related to the radius of the sieve $\Theta_{n}$ , which grows logarithmically in $n$ , while $t_{n}$ is related to the polynomial tail behavior of $\ell_{\theta}$ , so that $B_{n}t_{n}^{1-s}\rightarrow 0$ happens naturally if $s>1$ . Additional details are given in the proof of Theorem 4.2, which can be found in Appendix A.

Second, how might Condition 3.2 be checked? Start by defining the excess clipped risk $m_{n}(\theta;\theta_{n}^{\star})=P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}$ and the corresponding variance $v_{n}(\theta,\theta_{n}^{\star})=P(\ell_{\theta}^{n}-\ell_{\theta_{n}^{\star}}^{n})^{2}-m_{n}(\theta,\theta_{n}^{\star})^{2}$ . Now suppose it can be shown that there exists a function $G$ such that

v_{n}(\theta,\theta_{n}^{\star})\leq G(\Delta_{n})\,m_{n}(\theta,\theta_{n}^{\star}),\quad\text{for all $\theta\in\Theta_{n}$},

$\Delta_{n}$ is the size index of the sieve. This amounts to the excess clipped loss satisfying a Bernstein condition (17), with exponent $\alpha=1$ . The excess clipped loss itself is $\lesssim t_{n}$ (and maybe substantially smaller, depending on form of $\ell_{\theta}$ ). So, we can apply the moment-generating function bound in (16) for bounded excess losses to get

Pe^{-\omega_{n}(\ell_{\theta}^{n}-\ell_{\theta^{\star}}^{n})}<e^{-K_{n}\omega_{n}m_{n}(\theta,\theta_{n}^{\star})},

where $K_{n}=\{1-C\omega_{n}G(\Delta_{n})\}$ , for a constant $C>0$ , provided that $\omega_{n}t_{n}=O(1)$ . Now it is easy to see that the above display implies Condition 3.2.

For the rate calculation with respect to the excess risk, the decomposition of $R(\theta)$ based on (27) implies we need $\varepsilon_{n}\geq B_{n}t_{n}^{1-s}$ , subject to the constraint $n\omega_{n}K_{n}\varepsilon_{n}\rightarrow\infty$ , for $K_{n}$ as above. As we see in Section 5.3, $B_{n}$ , $\Delta_{n}$ , and $K_{n}$ can often be taken as powers of $\log n$ , so the critical components determining the rate are $s$ and $t_{n}$ . The optimal rate depends on the upper bound of the the excess clipped loss, which, in the worst case, equals $t_{n}$ . To apply (16) we need the learning rate to vanish like the reciprocal of this bound, so we take $\omega_{n}=t_{n}^{-1}$ . Then, we determine $\varepsilon_{n}$ to satisfy $nt_{n}^{-1}\varepsilon_{n}\rightarrow\infty$ and $\varepsilon_{n}\geq t_{n}^{1-s}$ , up to a log term. The clipping sequence $t_{n}\approx n^{1/s}$ is sufficient, and yields the rate $\varepsilon_{n}\approx n^{1/s-1}$ , modulo log terms.

5 Examples

This section presents several illustrations of the general theory presented in Sections 3 and 4. The strategies laid out in Sections 3.4, 4.1, and 4.2 are put to use in the following examples to verify our sufficient conditions for Gibbs posterior concentration. All proofs of results in this section can be found in Appendix C.

5.1 Quantile regression

Consider making inferences on the $\tau^{\text{th}}$ conditional quantile of a response $Y$ given a predictor $X=x$ . We model this quantile, denoted $Q_{Y|X=x}(\tau)$ , as a linear combination of functions of $x$ , that is, $Q_{Y|X=x}(\tau)=\theta^{\top}f(x)$ , for a fixed, finite dictionary of functions $f(x)=(f_{1}(x),\dots,f_{J}(x))^{\top}$ and where $\theta=(\theta_{1},\dots,\theta_{J})^{\top}$ is a coefficient vector with $\theta\in\Theta$ . Here we assume the model is well-specified so the true conditional quantile is ${\theta^{\star}}^{\top}f(x)$ for some $\theta^{\star}\in\Theta$ . The standard check loss for quantile estimation is

\ell_{\theta}(u)=(y-\theta^{\top}f(x))(\tau-1\{y<\theta^{\top}f(x)\}).

(28)

We show $\theta^{\star}$ minimizes $R(\theta)$ in the proof of Proposition 1 below. It can be shown that $\theta\mapsto\ell_{\theta}(u)$ is $L$ -Lipschitz, with $L<1$ , and convex, so the strategy in Section 4.1 of the main article is helpful here for verifying Condition 2 and Lemma 2 in Section B can be used to verify (22).

Inference on quantiles is a challenging problem from a Bayesian perspective because the quantile is well-defined irrespective of any particular likelihood. Sriram et al., (2013) interprets the check loss as the negative log-density of an asymmetric Laplace distribution and constructs a corresponding pseudo-posterior using this likelihood, but their posterior is effectively a Gibbs posterior as Definition 3 of the main article.

With a few mild assumptions about the underlying distribution $P$ , our general result in Theorem 2 can be used to establish Gibbs posterior concentration at rate $n^{-1/2}$ .

Assumption 1.

1.

The marginal distribution of $X$ is such that $Pff^{\top}$ exists and is positive definite;
2.

the conditional distribution of $Y$ , given $X=x$ , has at least one finite moment and admits a continuous density $p_{x}$ such that $p_{x}(\theta^{\star\top}f)$ is bounded away from zero for $P$ -almost all $x$ ; and
3.

the prior $\Pi$ has a density bounded away from 0 in a neighborhood of $\theta^{\star}$ .

Proposition 1.

Under Assumption 1, if the learning rate is sufficiently small, then the Gibbs posterior concentrates at $\theta^{\star}$ with rate $\varepsilon_{n}=n^{-1/2}$ with respect to $d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|$ .

5.2 Area under receiver operator characteristic curve

The receiver operator characteristic (ROC) curve and corresponding area under the curve (AUC) are diagnostic tools often used to judge the effectiveness of a binary classifier. Suppose a binary classifier produces a score $U$ characterizing the likelihood an individual belongs to Group $1$ versus Group $0$ . We can estimate an individual’s group by $1(U>t)$ where different values of the cutoff score $t$ may provide more or less accurate estimates. Suppose $U_{0}$ and $U_{1}$ are independent scores corresponding to individuals from Group $0$ and Group $1$ , respectively. The specificity and sensitivity of the test of $H_{0}:\text{individual }i\text{ belongs to Group 0}$ that rejects when $U>t$ are defined by $\text{spec}(t)=P(U_{0}<t)$ and $\text{sens}(t)=P(U_{1}>t)$ . When the type 1 and 2 errors of the test are equally costly the optimal cutoff is the value of $t$ maximizing $1-\text{spec}(t)+\text{sens}(t)$ , or, in other words, the test maximizing the sum of power and one minus the type 1 error probability. The ROC is the parametric curve $(1-\text{spec}(t),\text{sens}(t))$ in $[0,1]^{2}$ which provides a graphical summary of the tradeoff between Type 1 and Type 2 errors for different choices of the cutoff. The AUC, equal to $P(U_{1}>U_{0})$ , gives an overall numerical summary of the quality of the binary classifier, independent of the choice of threshold.

Our goal is to make posterior inferences on the AUC, but the usual Bayesian approach immediately runs into the kinds of problems we see in the examples in Sections 5.1 and later in 6. The parameter of interest is one-dimensional, but it depends on a completely unknown joint distribution $P$ . Within a Bayesian framework, the options are to fix a parametric model for this joint distribution and risk model misspecification or work with a complicated nonparametric model. Wang and Martin, (2020) constructed a Gibbs posterior for the AUC that avoids both of these issues.

Suppose $U_{0,1},\ldots,U_{0,m}$ and $U_{1,1},\ldots,U_{1,n}$ denote random samples of size $m$ and $n$ , respectively, of binary classifier scores for individuals belonging to Groups 0 and 1, and denote $\theta=P(U_{1}>U_{0})$ . Wang and Martin, (2020) consider the loss function

\ell_{\theta}(u_{0},u_{1})=\{\theta-1(u_{1}>u_{0})\}^{2},\quad\theta\in[0,1],

(29)

for which the risk satisfies $R(\theta)=(\theta-\theta^{\star})^{2}$ . If we interpret $m=m_{n}$ as a function of $n$ , then it makes sense to write the empirical risk function as

R_{n}(\theta)=\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}\{\theta-1(U_{1,i}>U_{0,j})\}^{2}.

(30)

Note the minimizer of the empirical risk is equal to

\hat{\theta}_{n}=\frac{1}{mn}\sum_{i=1}^{m}\sum_{j=1}^{n}1(U_{1,i}>U_{0,j}).

(31)

Wang and Martin, (2020) prove concentration of the Gibbs posterior at rate $n^{-1/2}$ under the following assumption.

Assumption 2.

1.

The sample sizes $(m,n)$ satisfy $m(m+n)^{-1}\rightarrow\lambda\in(0,1)$ .
2.

The prior distribution has a density function $\pi$ that is bounded away from zero in a neighborhood of $\theta^{\star}$ .

Wang and Martin, (2020) note that their concentration result holds for fixed learning rates and deterministic learning rates that vanish more slowly that $\min(m,n)^{-1}$ . As discussed in Syring and Martin, (2019) one motivation for choosing a particular learning rate is to calibrate Gibbs posterior credible intervals to attain a nominal coverage probability, at least approximately. With this goal in mind, Wang and Martin, (2020) suggest the following random learning rate. Define the covariances

	$\displaystyle\tau_{10}$	$\displaystyle=\text{Cov}\{1(U_{1,1}>U_{0,1}),\,1(U_{1,1}>U_{0,2})\}$
	$\displaystyle\tau_{01}$	$\displaystyle=\text{Cov}\{1(U_{1,1}>U_{0,1}),\,1(U_{1,2}>U_{0,1})\}.$		(32)

Wang and Martin, (2020) note the asymptotic covariance of $\hat{\theta}_{n}$ is given by

\frac{1}{m+n}\Bigl{(}\frac{\tau_{10}}{\lambda}+\frac{\tau_{01}}{1-\lambda}\Bigr{)},

(33)

and that the Gibbs posterior variance can be made to match this, at least asymptotically, by using the random learning rate

\hat{\omega}_{n}=\frac{m+n}{2mn}\Bigl{(}\frac{\hat{\tau}_{10}}{\lambda}+\frac{\hat{\tau}_{01}}{1-\lambda}\Bigr{)}^{-1},

(34)

where $\hat{\tau}_{10}$ and $\hat{\tau}_{01}$ are the corresponding empirical covariances:

	$\displaystyle\hat{\tau}_{10}$	$\displaystyle=\frac{2}{mn(n-1)}\sum_{i=1}^{m}\sum_{j\neq j^{\prime}}1(U_{1,i}>U_{0,j})1(U_{1,i}>U_{0,j^{\prime}})-\hat{\theta}_{n}^{2},$
	$\displaystyle\hat{\tau}_{01}$	$\displaystyle=\frac{2}{nm(m-1)}\sum_{j=1}^{n}\sum_{i\neq i^{\prime}}1(U_{1,i}>U_{0,j})1(U_{1,i^{\prime}}>U_{0,j})-\hat{\theta}_{n}^{2}.$		(35)

The hope is that the Gibbs posterior with the learning rate $\hat{\omega}_{n}$ has asymptotically calibrated credible intervals. It turns out that the concentration result in Wang and Martin, (2020) along with Theorem 4 imply the Gibbs posterior with a slightly adjusted version of learning rate $\hat{\omega}_{n}$ also concentrates at rate $n^{-1/2}$ . The adjustment to the learning rate has the effect of slightly widening Gibbs posterior credible intervals, so their asymptotic calibration is not adversely affected.

Proposition 2.

Suppose Assumption 2 holds and let $a_{n}$ denote any diverging sequence. Then, the Gibbs posterior with learning rate $a_{n}\hat{\omega}_{n}$ concentrates at rate $n^{-1/2}$ with respect to $d(\theta,\theta^{\star})=|\theta-\theta^{\star}|$ .

5.3 Finite-dimensional regression with squared-error loss

Consider predicting a response $y\in\mathbb{R}$ using a linear function $x^{\top}\theta$ by minimizing the sum of squared-error losses $\ell_{\theta}(u)=(y-x^{\top}\theta)^{2}$ , with $u=(x,y)\in\mathbb{R}^{J+1}$ , over a parameter space $\theta\in\Theta\subseteq\mathbb{R}^{J}$ . Suppose the covariate-response variable pairs $(X_{i},Y_{i})$ are iid with $X$ taking values in a compact subset $\mathcal{X}\subset\mathbb{R}^{J}$ . To complement this example we present a more flexible, non-parametric regression problem in Section 5.4 below. For the current example, we focus on how the tail behavior of the response variable affects posterior concentration; see Assumptions 3 and 4 below.

5.3.1 Light-tailed response

When the response is sub-exponential so is the excess loss, and by the argument outlined in Section 3.4 we can verify Condition 1 for $d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{2}$ and with $r=2$ . Then, the Gibbs posterior distribution concentrates at rate $n^{-1/2}$ as a consequence of Theorem 3.3.

Assumption 3.

1.

The response $Y$ , given $x$ , is sub-exponential with parameters $(\sigma^{2},\,b)$ for all $x$ ;
2.

$X$ is bounded and its marginal distribution is such that $PXX^{\top}$ exists and is positive definite with eigenvalues bounded away from $0$ ; and
3.

the prior $\Pi$ has a density bounded away from $0$ in a neighborhood of $\theta^{\star}$ .

Proposition 3.

If Assumption 3 holds, and the learning rate $\omega$ is a sufficiently small constant, then the Gibbs posterior concentrates at rate $\varepsilon_{n}=n^{-1/2}$ with respect to $d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{2}$ .

5.3.2 Heavy-tailed response

As discussed in Section 3.4, Condition 1 can be expected to hold when the response is light-tailed, but not when it is heavy-tailed. However, for a capped loss, $\ell_{\theta}^{n}=\ell_{\theta}\wedge t_{n}$ , with increasing $t_{n}$ , and for a suitable sieve on the parameter space, we can show concentration of the Gibbs posterior at the risk minimizer $\theta^{\star}$ via the argument given in Section 4.2.

Assumption 4.

The marginal distribution of $Y$ satisfies $P|Y|^{s}<\infty$ for $s>2$ .

Define a sieve by $\Theta_{n}=\{\theta\in\mathbb{R}^{J}:\|\theta\|_{2}<\Delta_{n}\}$ for an increasing sequence $\Delta_{n}$ , e.g., $\Delta_{n}=\log n$ . The moment condition in Assumption 4 implies three important properties of $\ell_{\theta}^{n}$ for $\theta\in\Theta_{n}$ :

•

The excess clipped loss satisfies $\sup_{y,x}|\ell_{\theta}^{n}(y,x)-\ell_{\theta_{n}^{\star}}(y,x)|<\Delta_{n}t_{n}^{1/2}$ .
•

The risk and clipped risk are equivalent up to an error of order $\Delta_{n}^{s}t_{n}^{1-s/2}$ .
•

The clipped loss satisfies $m_{n}(\theta,\theta^{\star}_{n})\vee v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}(\|\theta-\theta^{\star}\|_{2}^{2}+\Delta_{n}^{s}t_{n}^{1-s/2})$ .

These three properties can be used to verify Condition 3 using the argument sketched out in Section 4.2 and to verify the prior bound in (26) required to apply Theorem 4.2.

Proposition 4.

Suppose Assumption 4 holds for some $s>2$ . Let $t_{n}=n^{2/(s-1)}$ , $\Delta_{n}=\log n$ , $\omega_{n}=\Delta_{n}^{-1}t_{n}^{-1/2}$ , and $\varepsilon_{n}=\Delta_{n}^{s/2}n^{-(s-2)/(2s-2)}$ . If Assumption 3.2–3 also holds, then the Gibbs posterior with learning rate $\omega_{n}$ concentrates at rate $\varepsilon_{n}$ with respect to $d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{2}$ .

Proposition 4 continues to hold if we replace $s$ in the definitions of $t_{n}$ and $\varepsilon_{n}$ by $s^{\prime}$ satisfying $2<s^{\prime}<s$ . That means we only need an accurate lower bound for $s$ to construct a consistent Gibbs posterior for $\theta^{\star}$ , albeit with a slower concentration rate.

It is clear the clipping sequence may bias the clipped risk minimizer. For a given clip, the bias is less for light-tailed (large $s$ ) losses compared to heavy-tailed losses because a loss in excess of the clip is rarer for the former. This explains why the clipping sequence $t_{n}=n^{2/(s-1)}$ decreases in $s$ .

Note that our $\varepsilon_{n}^{2}$ can be compared to the rate derived in Grünwald and Mehta, (2020, Example 11). Indeed, up to log factors, their rate, $n^{-s/(s+2)}$ , is smaller than ours for $s<4$ but larger for $s>4$ . That is, their rate is slightly better when the response has between 2 and 4 moments, while ours is better when the response has 4 or more moments. Also, their result assumes that the parameter space is fixed and bounded, whereas we avoid this assumption with a suitably chosen sieve.

5.4 Mean regression curve

Let $Y_{1},\ldots,Y_{n}$ be independent, where the marginal distribution of $Y_{i}$ depends on a fixed covariate $x_{i}\in[0,1]$ through the mean, i.e., the expected value of $Y_{i}$ is $\theta^{\star}(x_{i})$ , $i=1,\ldots,n$ . For simplicity, set $x_{i}=i/n$ , corresponding to an equally-spaced design. Then the goal is estimation of the mean function $\theta^{\star}:[0,1]\to\mathbb{R}$ , which resides in a specified function class $\Theta$ defined below.

A natural starting point is to define an empirical risk based on squared error loss, i.e.,

R_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-\theta(x_{i})\}^{2}.

(36)

However, any function $\theta$ that passes through the observations would be an empirical risk minimizer, so some additional structure is needed to make the solution to the empirical risk minimization problem meaningful. Towards this, as is customary in the literature, we parametrize the mean function as a linear combination of a fixed set of basis functions, $f(x)=(f_{1}(x),\ldots,f_{J}(x))^{\top}$ . That is, we consider only functions $\theta=\theta_{\beta}$ , where

\theta_{\beta}(x)=\beta^{\top}f(x),\quad\beta\in\mathbb{R}^{J}.

(37)

Note that we do not assume that $\theta^{\star}$ is of the specified form; more specifically, we do not assume existence of a vector $\beta^{\star}$ such that $\theta^{\star}=\theta_{\beta^{\star}}$ . The idea is that the structure imposed via the basis functions will force certain smoothness, etc., so that minimization of the risk over this restricted class of functions would identify a suitable estimate.

This structure changes the focus of our investigation from the mean function $\theta$ to the $J$ -vector of coefficients $\beta$ . We now proceed by first constructing a Gibbs posterior for $\beta$ and then obtain the corresponding Gibbs posterior for $\theta$ by pushing the former through the mapping $\beta\mapsto\theta_{\beta}$ . In particular, define the empirical risk function in terms of $\beta$ :

r_{n}(\beta)=R_{n}(\theta_{\beta})=\frac{1}{n}\sum_{i=1}^{n}\{Y_{i}-\theta_{\beta}(x_{i})\}^{2}=\tfrac{1}{n}(Y-F_{n}\beta)^{\top}(Y-F_{n}\beta),

(38)

where $\beta\in\mathbb{R}^{J}$ and where $F_{n}$ is the $n\times J$ matrix whose $(i,j)$ entry is $f_{j}(x_{i})$ , assumed to be positive definite; see below. Given a prior distribution $\widetilde{\Pi}$ for $\beta$ —which determines a prior $\Pi$ for $\theta$ through the aforementioned mapping—we can first construct the Gibbs posterior for $\beta$ as in (5) with the pseudo-likelihood $\beta\mapsto\exp\{-\omega nr_{n}(\beta)\}$ . If we write $\widetilde{\Pi}_{n}$ for this Gibbs posterior for $\beta$ , then the corresponding Gibbs posterior for $\theta$ is given by

\Pi_{n}(A)=\widetilde{\Pi}_{n}(\{\beta:\theta_{\beta}\in A\}),\quad A\subseteq\Theta.

(39)

Therefore, the concentration properties of $\Pi_{n}$ are determined by those of $\widetilde{\Pi}_{n}$ .

We can now proceed very much like we did before, but the details are slightly more complicated in the present inid case. Taking expectation with respect to the joint distribution of $(Y_{1},\ldots,Y_{n})$ is, as usual, the average of marginal expectations; however, since the data are not iid, these marginal expectations are not all the same. Therefore, the expected empirical risk function is

\displaystyle\bar{r}_{n}(\beta)=P^{n}r_{n}(\beta)

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}P_{i}\{Y_{i}-\theta_{\beta}(x_{i})\}^{2}

(40)

where $P_{i}=P_{x_{i}}$ is the marginal distribution of $Y_{i}$ and where $F_{n,i}$ is the $i^{\text{th}}$ row of $F_{n}$ . Since the expected empirical risk function depends on $n$ , through $(x_{1},\ldots,x_{n})$ , so too does the risk minimizer

\beta^{\dagger}_{n}=\arg\min_{\beta}\bar{r}_{n}(\beta).

(41)

If $P_{i}$ has finite variance, then $\bar{r}_{n}(\beta)$ differs from $\{\theta^{\star}(x_{1:n})-F_{n}\beta\}^{2}$ by only an additive constant not depending on $\beta$ , and this becomes a least-squares problem, with solution

\beta_{n}^{\dagger}=(F_{n}^{\top}F_{n})^{-1}F_{n}^{\top}\,\theta^{\star}(x_{1:n}),

(42)

where $\theta^{\star}(x_{1:n})$ is the $n$ -vector $(\theta^{\star}(x_{1}),\ldots,\theta^{\star}(x_{n}))^{\top}$ . Our expectation is that the Gibbs posterior $\widetilde{\Pi}_{n}$ for $\beta$ will suitably concentrate around $\beta_{n}^{\dagger}$ , which implies that the Gibbs posterior $\Pi_{n}$ for $\theta$ will suitably concentrate around $\theta_{\beta_{n}^{\dagger}}$ . Finally, if the above holds and the basis representation is suitably flexible, then $\theta_{\beta_{n}^{\dagger}}$ will be close to $\theta^{\star}$ in some sense and, hence, we achieve the desired concentration.

The flexibility of the basis representation depends on the dimension $J$ . Since $\theta^{\star}$ need not be of the form $\theta_{\beta}$ , a good approximation will require that $J=J_{n}$ be increasing with $n$ . How fast $J=J_{n}$ must increase depends on the smoothness of $\theta^{\star}$ . Indeed, if $\theta^{\star}$ has smoothness index $\alpha>0$ (made precise below), then many systems of basis functions—including Fourier series and B-splines—have the following approximation property: there exists an $H>0$ such that for every $J$

\text{there exists $\beta\in\mathbb{R}^{J}$ such that $\|\beta\|_{\infty}<H$ and $\|\theta_{\beta}-\theta^{\star}\|_{\infty}\lesssim J^{-\alpha}$}.

(43)

Then the idea is to set the approximation error in (43) equal to the target rate of convergence, which depends on $n$ and on $\alpha$ , and then solve for $J=J_{n}$ .

For Gibbs posterior concentration at or near the optimal rate, we need the prior distribution for $\beta$ to be sufficiently concentrated in a bounded region of the $J$ -dimensional space in the sense that

\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\})\gtrsim(C\varepsilon)^{J},\quad\text{for all $\beta^{\prime}\in\mathbb{R}^{J}$ with $\|\beta^{\prime}\|_{\infty}\leq H$},

(44)

for the same $H$ as in (43), for a small constant $C>0$ , and for all small $\varepsilon>0$ .

Assumption 5.

The function $\theta^{\star}:[0,1]\to\mathbb{R}$ belongs to a class $\Theta=\Theta(\alpha,L)$ of Hölder smooth functions parametrized by $\alpha\geq 1/2$ and $L>0$ . That is, $\theta^{\star}$ satisfies

|\theta^{\star([\alpha])}(x)-\theta^{\star([\alpha])}(x^{\prime})|\leq L|x-x^{\prime}|^{\alpha-[\alpha]},\quad\text{for all $x,x^{\prime}\in[0,1]$},

where the superscript “ $(k)$ ” means $k^{\text{th}}$ derivative and $[\alpha]$ is the integer part of $\alpha$ ;

2.

for a given $x$ , the response $Y$ is sub-Gaussian and with variance and variance proxy—both of which can depend on $x$ —uniformly upper bounded by $\sigma^{2}$ ;
3.

the eigenvalues of $F_{n}^{\top}F_{n}$ are bounded away from zero and $\infty$ ;
4.

the approximation property (43) holds; and
5.

the prior for $\beta$ satisfies (44) and has a bounded density on the $J_{n}$ -dimensional parameter space.

The bounded variance assumption is implied, for example, if the variance of $Y$ is a smooth function of $x$ in $[0,1]$ , which is rather mild. And assuming the eigenvalues of $F_{n}^{\top}F_{n}$ are bounded is not especially strong since, in many cases, the basis functions would be orthonormal. In that case, the diagonal and off-diagonal entries of $F_{n}^{\top}F_{n}$ would be approximately 1 and 0, respectively, and the bounds are almost trivial. The conditions on the prior distribution are weak; as we argue in the proof, it can be satisfied by taking the joint prior density to be the product of $J$ independent prior densities on the components of $\beta$ , and where each component density is strictly positive.

Proposition 5.

If Assumption 5 holds, and the learning rate $\omega$ is a sufficiently small constant, then the Gibbs posterior $\Pi_{n}$ for $\theta$ concentrates at $\theta^{\star}$ with rate $\varepsilon_{n}=n^{-\alpha/(1+2\alpha)}$ with respect to the empirical $L_{2}$ norm $\|\theta-\theta^{\star}\|_{n,2}$ , where $\|f\|_{n,2}^{2}=n^{-1}\sum_{i=1}^{n}f^{2}(x_{i})$ .

We should emphasize that the quantity of interest, $\theta$ , is high-dimensional, and the rate $\varepsilon_{n}$ given in Proposition 5 is optimal for the given smoothness $\alpha$ ; there are not even any nuisance logarithmic terms.

The simpler fixed-dimensional setting with constant $J$ can be analyzed similarly as above. In that case, we can simultaneously weaken the requirement on the response $Y$ in Assumption 5.2 from sub-Gaussian to sub-exponential, and strengthen the conclusion to a root- $n$ concentration rate.

5.5 Binary classification

Let $Y\in\{0,1\}$ be a binary response variable and $X=(X_{0},X_{1},\ldots,X_{q})^{\top}$ a $(q+1)$ -dimensional predictor variable. We consider classification rules of the form

\phi_{\theta}(X)=1\{X^{\top}\theta>0\}=1\{\alpha X_{0}+(X_{1},\ldots,X_{q})^{\top}\beta>0\},\quad\theta=(\alpha,\beta)\in\mathbb{R}^{q+1},

(45)

and the goal is to learn the optimal $\theta$ vector, i.e., $\theta^{\star}=\arg\min_{\theta}R(\theta)$ , where $R(\theta)=P\{Y\neq\phi_{\theta}(X)\}$ is the misclassification error probability, and $P$ is the joint distribution of $(X,Y)$ . This optimal $\theta^{\star}$ is such that $\eta(x)>\frac{1}{2}$ if $x^{\top}\theta^{\star}>0$ and $\eta(x)<\frac{1}{2}$ if $x^{\top}\theta^{\star}\leq 0$ , where $\eta(x)=P(Y=1\mid X=x)$ is the conditional probability function. Below we construct a Gibbs posterior distribution that concentrates around this optimal $\theta^{\star}$ at rate that depends on certain local features of that $\eta$ function.

Suppose our data consists of iid copies $(X_{i},Y_{i})$ , $i=1,\ldots,n$ , of $(X,Y)$ from $P^{\star}$ , and define the empirical risk function

R_{n}(\theta)=\frac{1}{n}\sum_{i=1}^{n}1\{Y_{i}\neq\phi_{\theta}(X_{i})\}.

(46)

In addition to the empirical risk function we need specify a prior $\Pi$ and here the prior plays a significant role in the Gibbs posterior concentration results.

A unique feature of this problem, which makes the prior specification a little different than in a linear regression problem, is that the scale of $\theta$ does not affect classification performance, e.g., replacing $\theta$ with $1000\theta$ gives exactly the same classification performance. To fix a scale, we follow Jiang and Tanner, (2008), and

•

assume that the $x_{0}$ component of $x$ is of known importance and always included in the classifier,
•

and constrain the corresponding coefficient, $\alpha$ , to take values $\pm 1$ .

This implies that the $\alpha$ and $\beta$ components of the $\theta$ vector should be handled very differently in terms of prior specification. In particular, $\alpha$ is a scalar with a discrete prior—which we take here to be uniform on $\pm 1$ —and $\beta$ , being potentially high-dimensional, will require setting-specific prior considerations.

The characteristic that determines the difficulty of a classification problem is the distribution of $\eta(X)$ or, more specifically, how concentrated $\eta(X)$ is near the value $\frac{1}{2}$ , where one could do virtually no worse by classifying according to a coin flip. The set $\{x:\eta(x)=\frac{1}{2}\}$ is called the margin, and conditions that control the concentration of the distribution of $\eta(X)$ around $\frac{1}{2}$ are generally called margin conditions. Roughly, if $\eta$ has a “jump” or discontinuity at the margin, then classification is easier and $\eta(X)$ does not have to be so tightly concentrated around $\frac{1}{2}$ . On the other hand, if $\eta$ is smooth at the margin, then the classification problem is more challenging in the sense that more data near the margin is needed to learn the optimal classifier, hence, tighter concentration of $\eta(X)$ near $\frac{1}{2}$ is required.

In Sections 5.5.1 and 5.5.2 that follow, we consider two such margin conditions, namely, the so-called Massart and Tsybakov conditions. The first is relatively strong, corresponding to a jump in $\eta$ at the margin, and the result we we establish in Proposition 6 is accordingly strong. In particular, we show that the Gibbs posterior achieves the optimal and adaptive concentration rate in a class of high-dimensional problems ( $q\gg n$ ) under a certain sparsity assumption on $\theta^{\star}$ . The Tsybakov margin condition we consider below is weaker than the first, in the sense that $\eta$ can be smooth near the “ $\eta=\frac{1}{2}$ ” boundary and, as expected, the Gibbs posterior concentration rate result is not as strong as the first.

5.5.1 Massart’s noise condition

Here we allow the dimension $q+1$ of the coefficient vector $\theta=(\alpha,\beta)$ to exceed the sample size $n$ , i.e., we consider the so-called high-dimensional problem with $q\gg n$ . Accurate estimation and inference is not possible in high-dimensional settings without imposing some low-dimensional structure on the inferential target, $\theta^{\star}$ . Here, as is typical in the literature on high-dimensional inference, we assume that $\theta^{\star}$ is sparse in the sense that most of its entries are exactly zero, which corresponds to most of the predictor variables being irrelevant to classification. Below we construct a Gibbs posterior distribution for $\theta$ that concentrates around the unknown sparse $\theta^{\star}$ at a (near) optimal rate.

Since the sparsity in $\theta^{\star}$ is crucial to the success of any statistical method, the prior needs to be chosen carefully so that sparsity is encouraged in the posterior. The prior $\Pi$ for $\theta$ will treat $\alpha$ and $\beta$ independent, and the prior for $\beta$ will be defined hierarchically. Start with the reparametrization $\beta\to(S,\beta_{S})$ , where $S\subseteq\{1,2,\ldots,q\}$ denotes the configuration of zeros and non-zeros in the $\beta$ vector, and $\beta_{S}$ denotes the $|S|$ -vector of non-zero values. Following Castillo et al., (2015), for the marginal prior $\pi(S)$ for $S$ , we take

\pi(S)=\textstyle\binom{q}{|S|}^{-1}\,f(|S|),

where the $f$ is a prior for the size $|S|$ and the first factor on the right-hand side is the uniform prior for $S$ of the given size $|S|$ . Various choices of $f$ are possible, but here we take the complexity prior $f(s)\propto(cq^{a})^{-s},\,s=0,1,\ldots,q$ , a truncated geometric density, where $a$ and $c$ are fixed (and here arbitrary) hyperparameters; a similar choice is also made in Martin et al., (2017). Second, for the conditional prior of $\beta_{S}$ , given $S$ , again following Castillo et al., (2015), we take its density to be

g_{S}(\beta_{S})=\prod_{k\in S}\tfrac{\lambda}{2}e^{-\lambda|\beta_{k}|},

a product of $|S|$ many Laplace densities with rate $\lambda$ to be specified.

Assumption 6.

1.

The marginal distribution of $X$ is compactly supported, say, on $[-1,1]^{q+1}$ .
2.

The conditional distribution of $X_{0}$ , given $\tilde{X}=(X_{1},\ldots,X_{q})$ , has a density with respect to Lebesgue measure that is uniformly bounded.
3.

The rate parameter $\lambda$ in the Laplace prior satisfies $\lambda\lesssim(\log q)^{1/2}$ .
4.

The optimal $\theta^{\star}=(\alpha^{\star},\beta^{\star})$ is sparse in the sense that $|S^{\star}|\log q=o(n)$ , where $S^{\star}$ is the configuration of non-zero entries in $\beta^{\star}$ , and $\|\beta^{\star}\|_{\infty}=O(1)$ .
5.

There exists $h\in(0,\frac{1}{2})$ such that $P(|\eta(X)-\frac{1}{2}|\leq h)=0$ .

The first two parts of Assumption 6 correspond to Conditions $0^{\prime}$ and $0^{\prime\prime}$ in Jiang and Tanner, (2008). and Assumption 6 (5) above is precisely the margin condition imposed in Equation (5) of Massart and Nedelec, (2006); see, also, Mammen and Tsybakov, (1999) and Koltchinskii, (2006). This concisely states that there is either a jump in the $\eta$ function at the margin or that the marginal distribution of $X$ is not supported near the margin; in either case, there is separation between the $Y=0$ and $Y=1$ cases, which makes the classification problem relatively easy.

With these assumptions, we get the following Gibbs posterior asymptotic concentration rate result. Note that, in order to preserve the high-dimensionality in the asymptotic limit, we let the dimension $q=q_{n}$ increase with the sample size $n$ . So, the data sequence actually forms a triangular array but, as is common in the literature, we suppress this formulation in our notation.

Proposition 6.

Consider a classification problem as described above, with $q\gg n$ . Under Assumption 6, the Gibbs posterior, with sufficiently small constant learning rate, concentrates at rate $\varepsilon_{n}=(n^{-1}|S^{\star}|\log q)^{1/2}$ with respect to $d(\theta,\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}$ .

This result shows that, even in very high dimensional settings, the Gibbs posterior concentrates on the optimal rule $\theta^{\star}$ at a fast rate. For example, suppose that the dimension $q$ is polynomial in $n$ , i.e., $q\sim n^{b}$ for any $b>0$ , while the “effective dimension,” or complexity, is sub-linear, i.e., $|S^{\star}|\sim n^{a}$ for $a<1$ . Then we get that $\{\theta:R(\theta)-R(\theta^{\star})\lesssim n^{-(1-a)}\log n\}$ has Gibbs posterior probability converging to 1 as $n\to\infty$ . That is, rates better than $n^{-1/2}$ can easily be achieved, and even arbitrarily close to $n^{-1}$ is possible. Compare this to the rates in Propositions 2–3 in Jiang and Tanner, (2008), also in terms of risk difference, that cannot be faster than $n^{-1/2}$ . Further, the concentration rate in Proposition 6 is nearly the optimal rate corresponding to an oracle who has knowledge of $S^{\star}$ . That is, the Gibbs posterior concentrates at nearly the optimal rate adaptively with respect to the unknown complexity.

5.5.2 Tsybakov’s margin condition

Next, we consider classification under the more general Tsybakov margin condition (e.g., Tsybakov,, 2004). The problem set up is the same as above, except that here we consider the simpler low-dimensional case, with the number of predictors $(q+1)$ small relative to $n$ . Since the dimension is no longer large, prior specification is much simpler. We will continue to assume, as before, that the $x_{0}$ component of $x$ has a constrained coefficient $\alpha\in\{\pm 1\}$ , to which we assign a discrete uniform prior. Otherwise, we simply require the prior $\Pi$ have a (marginal) density, $\pi$ , for $\beta$ , with respect to Lebesgue measure on $\mathbb{R}^{q}$ .

Assumption 7.

1.

The marginal prior density for $\beta$ is continuous and bounded away from 0 near $\beta^{\star}$ .
2.

There exists $c>0$ and $\gamma>0$ such that $P(|2\eta(X)-1|\leq h)\leq ch^{\gamma}$ for all sufficiently small $h>0$ .

The concentration of the marginal distribution of $\eta(X)$ around $\frac{1}{2}$ controls the difficulty of the classification problem, and Condition 7.2 concerns exactly this. Note that smaller $\gamma$ implies $\eta(X)$ is less concentrated around $\frac{1}{2}$ , so we expect our Gibbs posterior concentration rate, say, $\varepsilon_{n}=\varepsilon_{n}(\gamma)$ , to be a decreasing function of $\gamma$ . The following result confirms this.

Proposition 7.

Suppose Assumption 7 holds and, for the specified $\gamma>0$ , let

\varepsilon_{n}=(\log n)^{\gamma/(2+2\gamma)}n^{-\gamma/(3+2\gamma)}.

Then the Gibbs posterior distribution, with learning rate $\omega_{n}=\varepsilon_{n}^{1/\gamma}$ , concentrates at rate $\varepsilon_{n}$ with respect to $d(\theta,\theta^{\star})=\{R(\theta)-R(\theta^{\star})\}^{1/2}$ .

Note that Massart’s condition from Section 5.5.1 corresponds to Tsybakov’s condition above with $\gamma=\infty$ . In that case, the Gibbs posterior concentration rate we recover from Proposition 7 is $\varepsilon_{n}=(\log n)^{1/2}n^{-1/2}$ , which is achieved with a suitable constant learning rate. This is within a logarithmic factor of the optimal rate for finite-dimensional problems. Moreover, for both the $\gamma<\infty$ and $\gamma=\infty$ cases, we expect that the logarithmic factor could be removed following an approach like that described in Theorem 3.3, but we do not explore this possible extension here.

We should emphasize that this case is unusual because the learning rate $\omega_{n}$ depends on the (likely unknown) smoothness exponent $\gamma$ . This means the rate in Proposition 7 is not adaptive to the margin. However, this dependence is not surprising, as it also appears in Grünwald and Mehta, (2020, Section 6). The reason the learning rate depends on $\gamma$ is that the Tsybakov margin condition in Assumption 7.2 implies the Bernstein condition in (17) takes the form

v(\theta,\theta^{\star})\lesssim m(\theta,\theta^{\star})^{\gamma/(1+\gamma)}.

Therefore, in order to verify Condition 1 using the strategy in Section 3.4, we need $\omega_{n}m(\theta,\theta^{\star})$ and $\omega_{n}^{3}v(\theta,\theta^{\star})$ to have the same order when $d(\theta,\theta^{\star})\geq M_{n}\varepsilon_{n}$ . This requires that the learning rate depends on $\gamma$ , in particular, $\omega_{n}=\varepsilon_{n}^{1/\gamma}$ .

5.6 Quantile regression curve

In this section we revisit inference on a conditional quantile, covered in Section 5.1. The $\tau^{\text{th}}$ conditional quantile of a response $Y$ given a covariate $X=x$ is modeled by a linear combination of basis functions $f(x)=(f_{1}(x),...,f_{J}(x))^{\top}$ :

Q_{Y|X=x}(\tau)=\beta^{\top}f(x),\quad\beta\in\mathbb{R}^{J}.

In Section 5.1, we made the rather restrictive assumption that the true conditional quantile function $\theta^{\star}(x)$ belonged to the span of a fixed set of $J$ basis functions. In practice, it may not be possible to identify such a set of functions, which is why we considered using a sample-size dependent sequence of sets of basis functions in Section 5.4 to model a smooth function, $\theta^{\star}$ . When the degree of smoothness, $\alpha$ , of $\theta^{\star}$ is known we can choose the number of basis functions to use in order to achieve the optimal concentration rate. But, in practice, $\alpha$ may not be known, which creates a challenge because, as mentioned, the number of terms needed in the basis function expansion modeling $\theta^{\star}$ depends on this unknown degree of smoothness.

To achieve optimal concentration rates adaptive to unknown smoothness, the choice of prior is crucial. In particular, the prior must support a very large model space in order to guarantee it places sufficient mass near $\theta^{\star}$ . Our approximation of $\theta^{\star}$ by a linear combination of basis functions suggests a hierarchical prior for $\theta\equiv(J,\beta_{J})$ , similar to Section 5.5.1, with a marginal prior $\pi$ for the number of basis functions $J$ and a conditional prior $\widetilde{\Pi}_{J}$ for $\beta_{J}$ , given $J$ . The resulting prior for $\theta$ is given by a mixture,

\Pi(A)=\sum_{j=1}^{\infty}\pi(j)\,\widetilde{\Pi}_{j}(\{\beta_{j}\in\mathbb{R}^{j}:\beta_{j}^{\top}f\in A\}),\quad A\subseteq\Theta.

(47)

Then, in order for $\Pi$ to place sufficient mass near $\theta^{\star}$ , it is sufficient the marginal and conditional priors satisfy the following conditions: the marginal prior $\pi$ for $J$ satisfies for some $c_{1}>0$ for every $J=j$

\pi(j)\geq e^{-c_{1}j\log j};

(48)

and, the conditional prior $\widetilde{\Pi}$ for $\beta_{J}$ given $J=j$ satisfies for every $j$

\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\})\gtrsim e^{-Cj\log(1/\varepsilon)},\quad\text{for all $\beta^{\prime}\in\mathbb{R}^{j}$ with $\|\beta^{\prime}\|_{\infty}\leq H$},

(49)

for the same $H$ as in (43) and for some constant $C>0$ for all sufficiently small $\varepsilon>0$ . Fortunately, many simple choices of $(\pi,\widetilde{\Pi}_{J})$ are satisfactory for obtaining adaptive concentration rates, e.g., a Poisson prior on $J$ and a $J$ -dimensional normal conditional prior for $\beta$ , given $J$ ; see Conditions (A1) and (A2) and Remark 1 in Shen and Ghosal, (2015). Besides the conditions in (48) and (49) we need to make a minor modification of $\Pi$ to make it suitable for our proof of Gibbs posterior concentration; see below.

Similar to our choice in Section 5.1 we link the data and parameter through the check loss function

\ell_{\theta}(u)=\tfrac{1}{2}(|\theta(x)-y|-|y|)+(1-\tau)\theta(x),

where $\theta(x)=\beta^{\top}f(x)$ . See Koltchinskii, (1997) for a proof that $P\ell_{\theta}$ is minimized at $\theta^{\star}$ . It is straightforward to show the check loss $\theta\mapsto\ell_{\theta}(u)$ is $L$ -Lipschitz with $L<1$ . From there, if $Y$ were bounded we could use Condition 1 to compute the concentration rate. However, to handle an unbounded response we need the flexibility of Condition 2 and Theorem 4.1. To verify (22), we found it necessary to limit the complexity of the parameter space by imposing a constraint on the prior distribution, namely that the sequence of prior distributions places all its mass on the set $\Theta_{n}:=\{\theta:\|\theta\|_{\infty}\leq\Delta_{n}\}$ for some diverging sequence $\Delta_{n}$ ; see Assumption 8.4. This constraint implies the prior depends on $n$ , and we refer to this sequence of prior distributions by $\Pi^{(n)}$ . Given the hierarchical prior $\Pi$ in (47) one straightforward way to define a sequence of prior distributions satisfying the constraint is to restrict and renormalize $\Pi$ to $\Theta_{n}$ , i.e., define $\Pi^{(n)}$ as

\displaystyle\Pi^{(n)}(A)=\Pi(A)/\Pi(\Theta_{n}),\quad A\subseteq\Theta\cap\Theta_{n}

(50)

This particular construction of $\Pi^{(n)}$ in (50) is not the only way to define a sequence of priors satisfying the restriction to $\Theta_{n}$ , but it is convenient. That is, if $\Pi$ places mass $\eta$ on a sup-norm neighborhood of $\theta^{\star}$ (see the proof of Proposition 8), then, by construction, $\Pi^{(n)}$ in (50) places at least mass $\eta$ on the same neighborhood.

We should emphasize this restriction of the prior to $\Theta_{n}$ is only a technical requirement needed for the proof, but it is not unreasonable. Since the true function $\theta^{\star}$ is bounded, it is eventually in the growing support of the prior $\Pi$ . Similar assumptions have been used in the literature on quantile curve regression; for example, Theorem 6 in Takeuchi et al., (2006) requires that the parameter space consists only of bounded functions, which is a stricter assumption than ours here.

Assumption 8.

1.

The function $\theta^{\star}:\mathbb{X}\mapsto\mathbb{R}$ is Hölder smooth with parameters $(\alpha,L)$ (see Assumption 5.1);
2.

the basis functions satisfy the approximation property in (43);
3.

the covariate space $\mathbb{X}$ is compact and there exists a $\delta>0$ such that the conditional density of $Y$ , given $X=x$ , is continuous and bounded away from $0$ by a constant $\beta>0$ in the interval $(\theta^{\star}(x)-\delta,\,\theta^{\star}(x)+\delta)$ for every $x$ ; and,
4.

the sequence $\Pi^{(n)}$ of prior distributions satisfies (50) for a sequence of subsets of the parameter space $\Theta_{n}:=\{\theta:\|\theta\|_{\infty}\leq\Delta_{n}\}$ for some sequence $\Delta_{n}>0$ , for $\Pi$ as defined in (47), and for marginal and conditional priors $(\pi,\widetilde{\Pi})$ for $J$ and $\beta_{J}$ given $J=j$ that satisfy (48) and (49).

Proposition 8.

Define $\varepsilon_{n}=(\log n)^{1/2}\Delta_{n}^{2}n^{-\alpha/(1+2\alpha)}$ . If the learning rate satisfies $\omega_{n}=c\Delta_{n}^{-2}$ for some $0<c<1/2$ and Assumption 8 holds, then the Gibbs posterior distribution concentrates at rate $\varepsilon_{n}$ with respect to $d(\theta,\theta^{\star})=\|\theta-\theta^{\star}\|_{L_{2}(P)}$ .

Since the mathematical statement does not give sufficient emphasis to the adaptation feature, we should follow up on this point. That is, $n^{-\alpha/(2\alpha+1)}$ is the optimal rate Shen and Ghosal, (2015) for estimating an $\alpha$ -Hölder smooth function, and it is not difficult to construct an estimator that achieves this, at least approximately, if $\alpha$ is known. However, $\alpha$ is unknown in virtually all practical situations, so it is desirable for an estimator, Gibbs posterior, etc. to adapt to the unknown $\alpha$ . The concentration rate result in Proposition 8 says that the Gibbs posterior achieves nearly the optimal rate adaptively in the sense that it concentrates at nearly the optimal rate as if $\alpha$ were known.

The concentration rate in Proposition 8 depends on the complexity of the parameter space as determined by $\Delta_{n}$ in Assumption 8(3). For example, if the sup-norm bound on $\theta^{\star}$ were known, then $\Delta_{n}$ and the learning rate $\omega_{n}$ could be taken as constants and the rate would be optimal up to a $(\log n)^{1/2}$ factor. On the other hand, if greater complexity is allowed, e.g., $\Delta_{n}=(\log n)^{p}$ for some power $p>0$ , then the concentration rate takes on an additional $(\log n)^{2p}$ factor, which is not a serious concern.

6 Application: personalized MCID

6.1 Problem setup

In the medical sciences, physicians who investigate the efficacy of new treatments are challenged to determine both statistically and practically significant effects. In many applications some quantitative effectiveness score can be used for assessing the statistical significance of the treatment, but physicians are increasingly interested also in patients’ qualitative assessments of whether they believed the treatment was effective. The aim of the approach described below is to find the cutoff on the effectiveness score scale that best separates patients by their reported outcomes. That cutoff value is called the minimum clinically important difference, or MCID. For this application we follow up on the MCID problem discussed in Syring and Martin, (2017) with a covariate-adjusted, or personalized, version. In medicine, there is a trend away from the classical “one size fits all” treatment procedures, to treatments that are tailored more-or-less to each individual. Along these lines, naturally, doctors would be interested to understand how that threshold for practical significance depends on the individual, hence there is interest in a so-called personalized MCID (Hedayat et al.,, 2015; Zhou et al.,, 2020).

Let the data $U^{n}=(U_{1},\ldots,U_{n})$ be iid $P$ , where each observation is a triple $U_{i}=(X_{i},Y_{i},Z_{i})$ denoting the patient’s diagnostic measurement, their self-reported effectiveness outcome $Y_{i}\in\{-1,1\}$ , and covariate value $Z_{i}\in\mathbb{Z}\subseteq\mathbb{R}^{q}$ , for $i=1,\ldots,n$ and $q\geq 1$ . In practice the diagnostic measurement has something to do with the effectiveness of the treatment so one can imagine examples including blood pressure, blood glucose level, and viral load. Examples of covariates include a patient’s age, weight, and gender. The idea is that the $x$ -scale cutoff for practical significance would depend on the covariate $z$ , hence the MCID is a function, say, $\theta(z)$ , and the goal is to learn this function.

The true MCID $\theta^{\star}$ is defined as the solution to an optimization problem. That is, if

\ell_{\theta}(x,y,z)=\tfrac{1}{2}[1-y\,\mathrm{sign}\{x-\theta(z)\}],\quad(x,y,z)\in\mathbb{R}\times\{-1,1\}\times\mathbb{Z},

(51)

then the expected loss is $R(\theta)=P[Y\neq\mathrm{sign}\{X-\theta(Z)\}]$ , and the true MCID function is defined as the minimizer $\theta^{\star}=\arg\min_{\theta\in\Theta}R(\theta)$ , where the minimum is taken over a class $\Theta$ of functions on $\mathbb{Z}$ . Alternatively, as in Section 5.5, the true $\theta^{\star}$ satisfies $\eta_{z}(x)>\tfrac{1}{2}$ if $x>\theta^{\star}(z)$ and $\eta_{z}(x)\leq\tfrac{1}{2}$ if $x\leq\theta^{\star}(z)$ , where $\eta_{z}(x)=P(Y=1\mid X=x,\,Z=z)$ is the conditional probability function.

As described in Section 2, the Gibbs posterior distribution is based on an empirical risk function which, in the present case, is given by

R_{n}(\theta)=\frac{1}{2n}\sum_{i=1}^{n}[1-Y_{i}\,\mathrm{sign}\{X_{i}-\theta(Z_{i})\}],\quad\theta\in\Theta.

(52)

In order to put this theory into practice, it is necessary to give the function space $\Theta$ a lower-dimensional parametrization. In particular, we consider a true MCID function $\theta^{\star}$ belonging to a Hölder class as in Assumption 5 but with unknown smoothness, as in Section 5.6. And, we model $\theta^{\star}$ by a linear combination of basis functions $\theta(z)=\theta_{J,\beta}(z):=\textstyle\sum_{j=1}^{J}\beta_{j}f_{j}(z)$ , for basis functions $f_{j}$ , $j=1,\ldots,J$ . Then, each $\theta$ is identified by a pair $(J,\beta)$ consisting of a positive integer $J$ and a $J$ -dimensional coefficient vector $\beta$ . We use cubic B-splines in the numerical examples in Section 2 of the supplmentary material, but any basis capable of approximating $\theta^{\star}$ will work, and see (43).

The prior setup is similar to that in Section 5.6, (47). That is, the prior is specified hierarchically with a marginal prior $\pi$ for $J$ and a suitable conditional prior $\widetilde{\Pi}_{J}$ for $\beta_{J}$ , given $J$ . And, as mentioned before, very simple choices of the marginal and conditional priors achieve the desired adaptive rates.

6.2 Concentration rate result

Assumption 9 below concerns the smoothness of $\theta^{\star}$ and requires the chosen basis satisfy the approximation property used previously; it also refers to the same mild assumptions on random series priors used in Section 5.6 sufficient to ensure adequate prior mass is assigned to a neighborhood of $\theta^{\star}$ ; finally, it assumes a margin condition on the classifier like that used in Section 5.5.1 and Assumption 6(5). These conditions are sufficient to establish a Gibbs posterior concentration rate.

Assumption 9.

1.

The true MCID function $\theta^{\star}:\mathbb{Z}\to\mathbb{R}$ for a compact subset $\mathbb{Z}$ of $\mathbb{R}$ and $\theta^{\star}$ is Hölder smooth with parameters $(\alpha,L)$ (see Assumption 5.1);
2.

the basis functions satisfy the approximation property in (43);
3.

the prior distribution $\Pi$ for $\theta$ is defined hierarchically as in (47) with marginal and conditional priors $(\pi,\widetilde{\Pi})$ for $J$ and $\beta_{J}$ given $J=j$ that satisfy (48) and (49); and,
4.

there exists $h\in(0,1)$ such that $P\{|2\eta_{Z}(X)-1|\leq h\}=0$ ; and,
5.

the conditional distribution, $P_{z}$ , of $X$ , given $Z=z$ , has a density with respect to Lebesgue measure that is uniformly bounded away from infinity.

Proposition 9.

Suppose Assumption 9 holds, with $\alpha$ as defined there, and set $\varepsilon_{n}=(\log n)n^{-\alpha/(1+\alpha)}$ . For any fixed $\omega>0$ the Gibbs posterior concentrates at rate $\varepsilon_{n}$ with respect to the divergence

	$\displaystyle d(\theta,\theta^{\star})$	$\displaystyle=P\{\theta(Z)\wedge\theta^{\star}(Z)\leq X\leq\theta(Z)\vee\theta^{\star}(Z)\}$
		$\displaystyle=\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P_{z}(dx)\,P(dz).$		(53)

The Gibbs posterior distribution we have defined for the personalized MCID function achieves the concentration rate in Proposition 9 adaptively to the unknown smoothness $\alpha$ of $\theta^{\star}$ . Mammen and Tsybakov, (1995) consider estimation of the boundary curve of a set, and they show that the minimax optimal rate is $n^{-\alpha/(\alpha+1)}$ when the boundary curve is $\alpha$ -Hölder smooth and distance is measured by the Lebesgue measure of the set symmetric difference. In our case, if $(X,Z)$ has a joint density, bounded away from 0, then our divergence measure $d(\theta,\theta^{\star})$ is equivalent to

\text{Leb}(\{(x,z):x\leq\theta(z)\}\,\triangle\,\{(x,z):x\leq\theta^{\star}(z)\}),

in which case our rate is within a logarithmic factor of the minimax optimal rate.

Hedayat et al., (2015) also study the personalized MCID and derive a convergence rate for an M-estimator of $\theta^{\star}$ based on a smoothed and penalized version of (52). It is difficult to compare our result with theirs, for instance, because their rate depends on two user-controlled sequences related to the smoothing and penalization of their loss. But, as mentioned above, our rate is near optimal in certain cases, so the asymptotic results in Hedayat et al., (2015) cannot be appreciably better than our rate in Proposition 9.

6.3 Numerical illustrations

We performed two simulation examples to investigate the performance of the Gibbs posterior for the personalized MCID. In both examples we use a constant learning rate $\omega=1$ , but we generally recommend data-driven learning rates; and see Syring and Martin, (2019).

For the first example we sample $n=100$ independent observations of $(X,Y,Z)$ . The covariate $Z$ is sampled from a uniform distribution on the interval $[0,3]$ . Given $Z=z$ , the diagnostic measure $X$ is sampled from a normal distribution with mean $z^{3}-3z^{2}+5$ and variance $1$ , and the patient-reported outcome $Y$ is sampled from a Rademacher distribution with probability

\eta_{z}(x)=\begin{cases}\Phi(x;z^{3}-3z^{2}+5-0.05,1/2),&x>z^{3}-3z^{2}+5\\ \Phi(x;z^{3}-3z^{2}+5+0.05,1/2),&x\leq z^{3}-3z^{2}+5,\end{cases}

(54)

where $\Phi(x;\mu,\sigma)$ denotes the ${\sf N}(\mu,\sigma)$ distribution function. The addition of $\pm 0.05$ in the formula of $\eta_{z}(x)$ is to meet the margin condition in Assumption 5.4 in the main article. As mentioned above, we parametrize the MCID function by piecewise polynomials, specifically, cubic B-splines. For highly varying MCID functions, a reversible-jump MCMC algorithm that allows for changing numbers of and break points in the piecewise polynomials may be helpful; see Syring and Martin, (2020). However, for this example we fix the parameter dimension to just six B-spline functions, which allows us to use a simple Metropolis–Hastings algorithm to sample from the Gibbs posterior distribution. Since the dimension is fixed, the prior is only needed for the B-spline coefficients, and for these we use diffuse independent normal priors with mean zero and standard deviation of $6$ . Over $250$ replications, the average empirical misclassification rate is 16% using the Gibbs posterior mean MCID function compared to 13% using the true MCID function when applying these two classifiers to a hold-out sample of $100$ data points.

The left pane of Figure 1 shows the results for one simulated data set under the above formulation. Even with only $n=100$ samples, the Gibbs posterior does a good job of centering on the true MCID function. The right pane displays the pointwise Gibbs posterior mean MCID function for each of $250$ repetitions of simulation 1, along with the overall pointwise mean of these functions, and the true MCID function. The Gibbs posterior mean function is stable across repetitions of the simulation.

Refer to caption — Figure 1: Left: the posterior mean function (dashed), true MCID function (solid), and data ( $Y=1$ black points, $Y=-1$ gray points) for one replication of the first simulation. Right: the Gibbs posterior mean MCID functions (solid gray) for each of $250$ repetitions of the first simulation, the overall mean function across those repetitions (dashed black), and the true MCID function (solid black).

The second example we consider includes a vector covariate similar to that in Example 1 of Scenario 2 in Hedayat et al., (2015). We sample $n=1000$ independent observations of $(X,Y,Z)$ , where $Z=(Z_{1},Z_{2})$ has a uniform distribution on the square $[0,3]^{2}$ . Given $Z=z$ , the diagnostic measure $X$ has a normal distribution with mean $z_{1}+2z_{2}$ and variance $1$ , and the patient-reported outcome $Y$ is a Rademacher random variable with probability

\eta_{z}(x)=\begin{cases}\Phi(x;z_{1}+2z_{2}-0.05,1),&x>z_{1}+2z_{2}\\ \Phi(x;z_{1}+2z_{2}+0.05,1),&x\leq z_{1}+2z_{2}.\end{cases}

(55)

In practice it is common to have more than one covariate, so this second example is perhaps more realistic than the first. However, it is much more difficult to visualize the MCID function for more than one covariate, so we do not display any figures for this example. We use tensor product B-splines with $8$ fixed B-spline functions ( $16$ coefficients) to parametrize the MCID function. Again, we use independent diffuse normal priors with zero mean and standard deviation equal to $6$ for each coefficient. Over $100$ repetitions of this simulation we observe an average empirical misclassification rate of 24% using the Gibbs posterior mean MCID function compared to 23% using the true MCID function when applied to a hold-out sample of $1000$ data points.

7 Concluding remarks

In this paper we focus on developing some simple, yet general, techniques for establishing asymptotic concentration rates for Gibbs posteriors. A key take-away is that the robustness to model misspecification offered by the Gibbs framework does not come at the expense of slower concentration rates. Indeed, in the examples presented here—and others presented elsewhere (Syring and Martin,, 2020)—the rates achieved are the same as those achieved by traditional Bayesian posteriors and (nearly) minimax optimal. Another main point is that Gibbs posterior distributions are not inherently challenging to analyze; on the contrary, the proofs presented herein are concise and transparent. An additional novelty to the analysis presented here is that we consider cases where the learning rate can be non-constant, i.e., either a vanishing sequence or data-dependent, and prove corresponding posterior concentration rate results.

Here the focus has been on deriving Gibbs posteriors with the best possible concentration rates, and selection of learning rates has proceeded with these asymptotic properties in mind. In other works (Syring and Martin,, 2019), random learning rates are derived for good uncertainty quantification in finite-samples. We conjecture, however, that the two are not mutually-exclusive, and that learning rates arising as solutions to the calibration algorithm in Syring and Martin, (2019) also have the desirable concentration rate properties. Proving this conjecture seems challenging, but Section 3.3 provides a first step in this direction.

Acknowledgments

The authors sincerely thank the reviewers for their helpful feedback on previous versions of the manuscript. This work is partially supported by the U.S. National Science Foundation, DMS–1811802.

Appendix A Proofs of the main theorems

A.1 Proof of Theorem 1

As a first step, we first state and prove a result that gives an in-probability lower bound on the denominator of the Gibbs posterior, the so-called partition function. The proof closely follows that of Lemma 1 in Shen and Wasserman, (2001) but is, in some sense, more general, so we present the details here for the sake of completeness.

Lemma 1.

Define

D_{n}=\int e^{-\omega\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

(56)

If $G_{n}=\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon_{n}^{r}\}$ is as in (12), with $\varepsilon_{n}$ satisfying $\varepsilon_{n}\to 0$ and $n\varepsilon_{n}^{r}\to\infty$ , then $D_{n}>\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}$ with $P^{n}$ -probability converging to 1.

Proof.

Define a standardized version of the empirical risk difference, i.e.,

Z_{n}(\theta)=\frac{\{nR_{n}(\theta)-nR_{n}(\theta^{\star})\}-nm(\theta)}{\{nv(\theta)\}^{1/2}},

where $m(\theta)=m(\theta,\theta^{\star})$ and $v(\theta)=v(\theta,\theta^{\star})$ , the mean and variance of the risk difference. Of course, $Z_{n}(\theta)$ depends (implicitly) on the data $U^{n}$ . Let

\mathcal{Z}_{n}=\{(\theta,U^{n}):|Z_{n}(\theta)|\geq(n\varepsilon_{n}^{r})^{1/2}\}.

Next, define the cross-sections

\mathcal{Z}_{n}(\theta)=\{U^{n}:(\theta,U^{n})\in\mathcal{Z}_{n}\}\quad\text{and}\quad\mathcal{Z}_{n}(U^{n})=\{\theta:(\theta,U^{n})\in\mathcal{Z}_{n}\}.

For $G_{n}$ as above, since

nR_{n}(\theta)-nR_{n}(\theta^{\star})=nm(\theta)+\{nv(\theta)\}^{1/2}Z_{n}(\theta),

and $m$ , $v$ , and $Z_{n}$ are suitably bounded on $G_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}$ , we immediately get

D_{n}\geq\int_{G_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}}e^{-\omega nm(\theta)-\omega\{nv(\theta)\}^{1/2}Z_{n}(\theta)}\,\Pi(d\theta)\geq e^{-2\omega n\varepsilon_{n}^{r}}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}\}.

From this lower bound, we get

	$\displaystyle P^{n}\{D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}\}$	$\displaystyle\leq P^{n}\bigl{[}e^{-2\omega n\varepsilon_{n}^{r}}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})^{c}\}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}\bigr{]}$
		$\displaystyle=P^{n}\bigl{[}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})\}\geq\tfrac{1}{2}\Pi(G_{n})\bigr{]}$
		$\displaystyle\leq\frac{2P^{n}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})\}}{\Pi(G_{n})},$

where the last line is by Markov’s inequality. We can then simplify the expectation in the upper bound displayed above using Fubini’s theorem:

	$\displaystyle P^{n}\Pi\{G_{n}\cap\mathcal{Z}_{n}(U^{n})\}$	$\displaystyle=\int\int 1\{\theta\in G_{n}\cap\mathcal{Z}_{n}(U^{n})\}\,\Pi(d\theta)\,P^{n}(dU^{n})$
		$\displaystyle=\int\int 1\{\theta\in G_{n}\}\,1\{\theta\in\mathcal{Z}_{n}(U^{n})\}\,P^{n}(dU^{n})\,\Pi(d\theta)$
		$\displaystyle=\int_{G_{n}}P^{n}\{\mathcal{Z}_{n}(\theta)\}\,\Pi(d\theta).$

By Chebyshev’s inequality, $P^{n}\{\mathcal{Z}_{n}(\theta)\}\leq(n\varepsilon_{n}^{r})^{-1}$ , and hence

P^{n}\{D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}\}\leq 2(n\varepsilon_{n}^{r})^{-1}.

Finally, since $n\varepsilon_{n}^{r}\to\infty$ , the left-hand side vanishes, completing the proof. ∎

For the proof of Theorem 1, write

\Pi_{n}(A_{n})=\frac{N_{n}(A_{n})}{D_{n}},

where $A_{n}=\{\theta:d(\theta;\theta^{\star})>M\varepsilon_{n}\}$ , $D_{n}$ is as in (56), and

N_{n}(A_{n})=\int_{A_{n}}e^{-\omega\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

For $G_{n}$ as in Lemma 1, write $b_{n}=\frac{1}{2}\Pi(G_{n})e^{-2\omega n\varepsilon_{n}^{r}}$ for the lower bound on $D_{n}$ . Then

	$\displaystyle\Pi_{n}(A_{n})$	$\displaystyle\leq\frac{N_{n}(A_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})$
		$\displaystyle\leq b_{n}^{-1}N_{n}(A_{n})+1(D_{n}\leq b_{n}).$

By Fubini’s theorem, independence of the data $U^{n}$ , and Condition 1, we get

P^{n}N_{n}(A_{n})=\int_{A_{n}}\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta)<e^{-KM^{r}\omega n\varepsilon_{n}^{r}}.

Take expectation of $\Pi_{n}(A_{n})$ and plug in the upper bound above, along with $\Pi(G_{n})\geq e^{-C_{1}n\varepsilon_{n}^{r}}$ from (15) and $P^{n}(D_{n}\leq b_{n})\geq 2(n\varepsilon_{n}^{r})^{-1}$ from Lemma 1, to get

P^{n}\Pi_{n}(A_{n})\leq 2e^{-(\omega KM^{r}-C_{1}-2\omega)n\varepsilon_{n}^{r}}+2(n\varepsilon_{n}^{r})^{-1}.

Since the right-hand side is vanishing for sufficiently large $M$ , the claim follows.

A.2 Proof of Theorem 2

A special case of this result was first presented in Bhattacharya and Martin, (2022), but we are including the proof here for completeness.

Recall that the Gibbs posterior probability, $\Pi_{n}(A_{n})$ , is a ratio, namely, $N_{n}(A_{n})/D_{n}$ . Both the numerator and denominator are integrals, and the key idea here is to split the range of integration in the numerator into countably many disjoint pieces as follows:

	$\displaystyle N_{n}(A_{n})$	$\displaystyle=\int_{d(\theta,\theta^{\star})>M_{n}\varepsilon_{n}}e^{-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta)$
		$\displaystyle=\sum_{t=1}^{\infty}\int_{tM_{n}\varepsilon_{n}<d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}}e^{-\omega n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).$

Taking expectation of the left-hand side and moving it under the sum and under the integral on the right-hand side, we need to bound

\int_{tM_{n}\varepsilon_{n}<d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}}\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta),\quad t=1,2,\ldots.

By Condition 1, on the given range of integration, the integrand is bounded above by

e^{-\omega Kn(tM_{n}\varepsilon_{n})^{r}}=e^{-\omega Kt^{r}M_{n}^{r}},

so the expectation of the integral itself is bounded by

e^{-Kt^{r}M_{n}^{r}}\Pi(\{\theta:d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}\}),\quad t=1,2,\ldots

Since $\Pi$ has a bounded density on the $q$ -dimensional parameter space, we clearly have

\Pi(\{\theta:d(\theta,\theta^{\star})<(t+1)M_{n}\varepsilon_{n}\})\lesssim\{(t+1)M_{n}\varepsilon_{n}\}^{q}.

Plug all this back into the summation above to get

P^{n}N_{n}(A_{n})\lesssim(M_{n}\varepsilon_{n})^{q}\sum_{t=1}^{\infty}(t+1)^{q}e^{-\omega Kt^{r}M_{n}^{r}}.

The above sum is finite for all $n$ and bounded by a multiple of $e^{-\omega M_{n}^{r}}$ . Then $M_{n}^{q}$ times the sum is vanishing as $n\to\infty$ and, consequently, we find that the expectation of the Gibbs posterior numerator is $o(\varepsilon_{n}^{q})$ .

For the denominator $D_{n}$ , we can proceed just like in the proof of Lemma 1. The key difference is that we redefine

\mathcal{Z}_{n}=\{(\theta,U^{n}):|Z_{n}(\theta)|\geq(Qn\varepsilon_{n}^{r})^{1/2}\},

with an arbitrary constant $Q>1$ , so that

P^{n}\{D_{n}\leq\tfrac{1}{2}\Pi(G_{n})e^{-2Q\omega n\varepsilon_{n}^{r}}\}\leq 2(Qn\varepsilon_{n}^{r})^{-1}.

Then, just like in the proof of Theorem 1, since $n\varepsilon_{n}^{r}=1$ , we have

P^{n}\Pi_{n}(A_{n})\leq\frac{o(\varepsilon_{n}^{q})}{e^{-2Q\omega}\varepsilon_{n}^{q}}+2Q^{-1},

which implies

\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n})\leq 2Q^{-1}.

Since $Q$ is arbitrary, we conclude that $P^{n}\Pi_{n}(A_{n})\to 0$ , completing the proof.

A.3 Proof of Theorem 3

The proof is nearly identical to that of Theorem 1. Begin with

\Pi_{n}(A_{n})=\frac{N_{n}(A_{n})}{D_{n}},

where $A_{n}=\{\theta:d(\theta;\theta^{\star})>M\varepsilon_{n}\}$ , $D_{n}$ is as in (56), and

N_{n}(A_{n})=\int_{A_{n}}e^{-\omega_{n}\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

When the learning rate is a sequence $\omega_{n}$ rather than constant, Lemma 1 can be applied with no alterations provided $n\omega_{n}\varepsilon_{n}^{r}\rightarrow\infty$ , as assumed in the statement of Theorem 3. Then, for $G_{n}$ as in Lemma 1, write $b_{n}=\frac{1}{2}\Pi(G_{n})e^{-2\omega_{n}n\varepsilon_{n}^{r}}$ for the lower bound on $D_{n}$ . Bound the posterior probability of $A_{n}$ by

	$\displaystyle\Pi_{n}(A_{n})$	$\displaystyle\leq\frac{N_{n}(A_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})$
		$\displaystyle\leq b_{n}^{-1}N_{n}(A_{n})+1(D_{n}\leq b_{n}).$

By Fubini’s theorem, independence of the data $U^{n}$ , and Condition 1, we get

P^{n}N_{n}(A_{n})=\int_{A_{n}}\{Pe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta)<e^{-KM^{r}\omega_{n}n\varepsilon_{n}^{r}}.

Take expectation of $\Pi_{n}(A_{n})$ and plug in the upper bound above, along with $\Pi(G_{n})\gtrsim e^{-C\omega_{n}n\varepsilon_{n}^{r}}$ from (14) and $P^{n}(D_{n}\leq b_{n})=o(1)$ from Lemma 1, to get

P^{n}\Pi_{n}(A_{n})\lesssim e^{-(KM^{r}-C-2)\omega_{n}n\varepsilon_{n}^{r}}+o(1).

Since the right-hand side is vanishing for sufficiently large $M$ , the claim follows.

A.4 Proof of Theorem 4

First note that if the conditions of Theorem 3 hold for $\omega_{n}$ , then $\Pi_{n}^{\omega_{n}/2}$ also concentrates at rate $\varepsilon_{n}$ . That is, at least asymptotically, there is no difference between the learning rates $\omega_{n}$ and $\omega_{n}/2$ .

Next, as in the proof of Theorem 1, denote the numerator and denominator of $\Pi_{n}^{\hat{\omega}_{n}}(A)$ by $N_{n}^{\hat{\omega}_{n}}(A)$ and $D_{n}^{\hat{\omega}_{n}}$ . Let $W=\{U^{n}:\omega_{n}/2<\hat{\omega}_{n}<\omega_{n}\}$ . By the assumptions of Theorem 4, $P^{n}1(W)\rightarrow 1$ , so in the argument below we focus on bounding the numerator and denominator of the Gibbs posterior given $W$ .

Restricting to the set $W$ , using Lemma 1, and noting that $\omega\mapsto e^{-2n\omega\varepsilon_{n}^{r}}$ decreases in $\omega$ we have $D_{n}^{\hat{\omega}_{n}}>b_{n}$ with $P^{n}-$ probability approaching $1$ where

b_{n}=\tfrac{1}{2}\Pi(G_{n})e^{-2n\omega_{n}\varepsilon_{n}^{r}}\gtrsim e^{-C_{1}n\omega_{n}\varepsilon_{n}^{r}}

for some $C_{1}>0$ , where the last inequality follows from (14).

Set $\mathbb{W}=\{\theta:R_{n}(\theta)-R_{n}(\theta^{\star})>0\}$ and then bound the numerator as as follows:

	$\displaystyle N_{n}^{\hat{\omega}_{n}}(A_{n})$	$\displaystyle=N_{n}^{\hat{\omega}_{n}}(A_{n}\cap\mathbb{W})+N_{n}^{\hat{\omega}_{n}}(A_{n}\cap\mathbb{W}^{c})$
		$\displaystyle\leq\int_{A_{n}\cap\mathbb{W}}e^{-\omega_{n}/2[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)+\int_{A_{n}\cap\mathbb{W}^{c}}e^{-\omega_{n}[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)$
		$\displaystyle\leq\int_{A_{n}}e^{-\omega_{n}/2[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)+\int_{A_{n}}e^{-\omega_{n}[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)$
		$\displaystyle=N_{n}^{\omega_{n}/2}(A_{n})+N_{n}^{\omega_{n}}(A_{n}).$

Then, by Condition 1, Fubini’s theorem, and independence of $U^{n}$ , we have

P^{n}N_{n}^{\hat{\omega}_{n}}(A_{n})\leq 2e^{-(1/2)KM^{r}n\omega_{n}\varepsilon_{n}^{r}}.

Similar to the proof of Theorem 1, we can bound $\Pi_{n}^{\hat{\omega}_{n}}(A_{n})$ using the above exponential bounds on $N_{n}^{\hat{\omega}_{n}}(A_{n})$ and $D_{n}^{\hat{\omega}_{n}}$ :

	$\displaystyle\Pi_{n}^{\hat{\omega}_{n}}(A_{n})$	$\displaystyle\leq 1(W)N_{n}^{\hat{\omega}_{n}}(A_{n})/D_{n}^{\hat{\omega}_{n}}+1(W^{c})$
		$\displaystyle\leq 1(W)b_{n}^{-1}N_{n}^{\hat{\omega}_{n}}(A_{n})+1(W)1(D_{n}\leq b_{n})+1(W^{c}).$

Taking expectation of $\Pi_{n}^{\hat{\omega}_{n}}(A_{n})$ and applying the numerator and denominator bounds and the fact $P^{n}(W)\rightarrow 1$ , we have

P^{n}\Pi_{n}^{\hat{\omega}_{n}}(A_{n})\lesssim e^{-n\omega_{n}\varepsilon_{n}^{r}(M^{r}K/2-C_{1})}+o(1).

The result follows since $M>0$ is arbitrary.

A.5 Proof of Theorem 5

The proof is very similar to that of Theorem 1. Start with the decomposition

\Pi_{n}(A_{n})=\Pi_{n}(A_{n}\cap\Theta_{n})+\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}}),

where $A_{n}=\{\theta:d(\theta;\theta^{\star})>M_{n}\varepsilon_{n}\}$ and $\Theta_{n}$ is defined in Condition 2. We consider the first term in the above decomposition. As before, we have

\Pi_{n}(A_{n}\cap\Theta_{n})=\frac{N_{n}(A_{n}\cap\Theta_{n})}{D_{n}},

for $D_{n}$ is as in (56), and

N_{n}(A_{n}\cap\Theta_{n})=\int_{A_{n}\cap\Theta_{n}}e^{-\omega_{n}\,n\{R_{n}(\theta)-R_{n}(\theta^{\star})\}}\,\Pi(d\theta).

Apply Lemma 1, with $G_{n}=\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq(K_{n}\varepsilon_{n})^{r}\}$ , and write

b_{n}=\tfrac{1}{2}\Pi(G_{n})e^{-2\omega_{n}nK_{n}^{r}\varepsilon_{n}^{r}}

for the lower bound on $D_{n}$ . This immediately leads to

	$\displaystyle\Pi_{n}(A_{n}\cap\Theta_{n})$	$\displaystyle\leq\frac{N_{n}(A_{n}\cap\Theta_{n})}{D_{n}}\,1(D_{n}>b_{n})+1(D_{n}\leq b_{n})$
		$\displaystyle\leq b_{n}^{-1}N_{n}(A_{n}\cap\Theta_{n})+1(D_{n}\leq b_{n}).$

By Fubini’s theorem, independence of the data $U^{n}$ , and Condition 2, we get

P^{n}N_{n}(A_{n}\cap\Theta_{n})=\int_{A_{n}\cap\Theta_{n}}\{Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\}^{n}\,\Pi(d\theta)<e^{-M_{n}^{r}\omega_{n}nK_{n}^{r}\varepsilon_{n}^{r}}.

Take expectation of $\Pi_{n}(A_{n}\cap\Theta_{n})$ and plug in the upper bound above, along with $\Pi(G_{n})\gtrsim e^{-Cn\omega_{n}K_{n}^{r}\varepsilon_{n}^{r}}$ from (21) and $P^{n}(D_{n}\leq b_{n})=o(1)$ from Lemma 1, to get

P^{n}\Pi_{n}(A_{n}\cap\Theta_{n})\lesssim e^{-(M_{n}^{r}-C-2)\omega_{n}K_{n}^{r}n\varepsilon_{n}^{r}}+o(1).

Since the right-hand side is vanishing for sufficiently large $M_{n}$ , we can conclude that

\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n})\leq\limsup_{n\to\infty}P^{n}\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}}).

Of course, if (22), then the upper bound in the above display is 0 and we obtained the claimed Gibbs posterior concentration rate result.

A.6 Proof of Theorem 6

The proof is nearly identical to the proof of Theorem 5 appearing above in Section A.5. However, it remains to show concentration at $\theta^{\star}_{n}$ with respect to $P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}$ at rate $\varepsilon_{n}$ implies concentration at $\theta^{\star}$ with respect to $R(\theta)-R(\theta^{\star})$ at rate $\varepsilon_{n}\vee B_{n}t_{n}^{1-r}$ .

By assumption $P|\ell_{\theta}|^{s}<B_{n}$ for some $s>1$ and $0<B_{n}<\infty$ for all $\theta\in\Theta_{n}$ so that

	$\displaystyle P\ell_{\theta}$	$\displaystyle=P\ell_{\theta}\cdot 1(\ell_{\theta}\leq t_{n})+P\ell_{\theta}\cdot 1(\ell_{\theta}>t_{n})$
		$\displaystyle\leq P\ell_{\theta}\cdot 1(\ell_{\theta}\leq t_{n})+\int_{t_{n}}^{\infty}\frac{B_{n}}{x^{s}}\,dx$
		$\displaystyle\leq P\ell_{\theta}\cdot 1(\ell_{\theta}\leq t_{n})+B_{n}t_{n}^{1-s},$

by Markov’s inequality. By definition of $\ell_{\theta}^{n}$

P\ell_{\theta}^{n}=P\ell_{\theta}1(\ell_{\theta}\leq t_{n})+t_{n}\,P(\ell_{\theta}>t_{n}),

and, therefore,

\displaystyle P\ell_{\theta}-P\ell_{\theta}^{n}

\displaystyle\leq B_{n}t_{n}^{1-s}-t_{n}\,P(\ell_{\theta}>t_{n})\leq B_{n}t_{n}^{1-s}.

Similarly,

	$\displaystyle P\ell_{\theta}^{n}-P\ell_{\theta}$	$\displaystyle=P\{(t_{n}-\ell_{\theta})\cdot 1(\ell_{\theta}>t_{n})\}$
		$\displaystyle\leq B_{n}t_{n}^{1-s}-P\ell_{\theta}\cdot 1(\ell_{\theta}>t_{n})$
		$\displaystyle\leq B_{n}t_{n}^{1-s},$

so we can conclude

|P\ell_{\theta}-P\ell_{\theta}^{n}|\leq B_{n}t_{n}^{1-s},\quad\theta\in\Theta_{n}.

(57)

Next, note that by definition $R(\theta_{n}^{\star})-R(\theta^{\star})=P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}>0$ . Using (57), replace the risk by the clipped risk at $\theta^{\star}$ and at $\theta^{\star}_{n}$ , up to error of $B_{n}t_{n}^{1-s}$ each time to see that

0<P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}<P\ell_{\theta_{n}^{\star}}^{n}-P\ell_{\theta^{\star}}^{n}+2B_{n}t_{n}^{1-s},

and, since $P\ell_{\theta_{n}^{\star}}^{n}-P\ell_{\theta^{\star}}^{n}<0$ by definition, we have

0<P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}<2B_{n}t_{n}^{1-s},

and, therefore,

|P\ell_{\theta_{n}^{\star}}-P\ell_{\theta^{\star}}|<2B_{n}t_{n}^{1-s}.

(58)

Now, for some constants $C_{1},\,C_{2}>0$ to be determined, define the sets

	$\displaystyle A_{n}$	$\displaystyle=\{\theta\in\Theta_{n}:P\ell_{\theta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}>C_{1}B_{n}t_{n}^{1-s}\}$
	$\displaystyle A_{n}^{\prime}$	$\displaystyle=\{\theta\in\Theta_{n}:P\ell_{\theta}-P\ell_{\theta^{\star}}>C_{2}B_{n}t_{n}^{1-s}\}.$

Let $\vartheta\in A_{n}^{\prime}$ . By (57), swap $P\ell_{\vartheta}$ for $P\ell_{\vartheta}^{n}$ up to an error of $B_{n}t_{n}^{1-s}$ :

\vartheta\in A_{n}^{\prime}\implies P\ell_{\vartheta}^{n}-P\ell_{\theta^{\star}}>(C_{2}-1)B_{n}t_{n}^{1-s}.

Using (58), swap $P\ell_{\theta^{\star}}$ for $P\ell_{\theta^{\star}_{n}}$ up to an error of $B_{n}t_{n}^{1-s}$ :

\vartheta\in A_{n}^{\prime}\implies P\ell_{\vartheta}^{n}-P\ell_{\theta^{\star}_{n}}>(C_{2}-2)B_{n}t_{n}^{1-s}.

Finally, use (57) again and swap $P\ell_{\theta^{\star}_{n}}$ for $P\ell_{\theta^{\star}_{n}}^{n}$ up to an error of $B_{n}t_{n}^{1-s}$ :

\vartheta\in A_{n}^{\prime}\implies P\ell_{\vartheta}^{n}-P\ell_{\theta_{n}^{\star}}^{n}>(C_{2}-3)B_{n}t_{n}^{1-s}.

Conclude that if we choose $C_{1},\,C_{2}$ such that $0<C_{1}<C_{2}-3$ , then we get $A_{n}^{\prime}\subset A_{n}$ . Consequently, if the Gibbs posterior vanishes on $A_{n}$ in $P^{n}$ -expectation, then it also vanishes on $A_{n}^{\prime}$ .

Appendix B A strategy for checking (22)

Recall, a condition for posterior concentration under Theorems 5–6 is

P^{n}\Pi_{n}(\Theta_{n}^{\text{\sc c}})\rightarrow 0\quad as\,n\rightarrow\infty.

Below we describe a strategy for checking this, based on convexity of $\ell_{\theta}$ .

Lemma 2.

The Gibbs posterior distribution satisfies (22) if the following conditions hold:

1.

$\theta\mapsto\ell_{\theta}(u)$ is convex,
2.

$\inf_{\{\theta:d(\theta,\theta^{\star})>\delta\}}R(\theta)-R(\theta^{\star})>0$ for all $\delta>0$ , and
3.

The prior distribution satisfies (14) in the main article.

Proof.

Let $A:=\{\theta:d(\theta,\theta^{\star})>\varepsilon\}$ for any fixed $\varepsilon>0$ and write the Gibbs posterior probability of $A$ as

\Pi_{n}(A)=\frac{N_{n}(A)}{D_{n}}:=\frac{\int_{A}e^{-n[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)}{\int e^{-n[R_{n}(\theta)-R_{n}(\theta^{\star})]}\Pi(d\theta)}.

Assumption 2 implies the minimizer $\hat{\theta}_{n}$ of $R_{n}(\theta)$ converges to $\theta^{\star}$ in $P^{n}$ -probability; see Hjört and Pollard, (1993), Lemmas 1–2. Therefore, assume $d(\hat{\theta}_{n},\theta^{\star})\leq\varepsilon/2$ since

\{U^{n}:(\Pi_{n}(A)>a)\cap(d(\hat{\theta}_{n},\theta^{\star})>\varepsilon/2)\}\rightarrow 0

in $P^{n}-$ probability. By convexity, and the fact that $\hat{\theta}_{n}\notin A$ ,

R_{n}(\theta)-R_{n}(\theta^{\star})\geq\inf_{u}\{R_{n}(\theta^{\star}+(\varepsilon/2)u)-R_{n}(\theta^{\star})\},

where the infimum is over all unit vectors. The infimum on the RHS of the above display converges to a positive number, say $\psi>0$ , in $P^{n}-$ probability by Lemma 1 in Hjört and Pollard, (1993). Therefore,

\lim\inf_{\theta\in A}R_{n}(\theta)-R_{n}(\theta^{\star})\geq\eta

with $P^{n}-$ probability converging to $1$ . Uniform convergence of the empirical risk functions implies

N_{n}(A)\leq e^{-n\psi}\Pi(A)

with $P^{n}$ -probability converging to $1$ as $n\to\infty$ . Combining this upper bound on $N_{n}(A)$ with the lower bound on $D_{n}$ provided by Lemma 1 in the main article we have

\Pi_{n}(A)\leq e^{-n(\psi-C_{1}\varepsilon_{n}^{r})}\rightarrow 0

where the bound vanishes because $\psi>C_{1}\varepsilon_{n}^{r}$ for all large enough $n$ . By the bounded convergence theorem, $P^{n}\Pi_{n}(A)\rightarrow 0$ , as claimed. ∎

Appendix C Proofs of posterior concentration results for examples

C.1 Proof of Proposition 1

The proof proceeds by checking the conditions of the extended version of Theorem 2, that based on Condition 2. First, we confirm that $R(\theta)$ is minimized at $\theta^{\star}$ . Write

	$\displaystyle R(\theta)=\int_{\mathbb{X}}\Bigl{[}$	$\displaystyle(\tau-1)\int_{-\infty}^{\theta^{\top}f(x)}\{y-\theta^{\top}f(x)\}\,p_{x}(y)\,dy$
		$\displaystyle+\tau\int_{\theta^{\top}f(x)}^{\infty}\{y-\theta^{\top}f(x)\}\,p_{x}(y)\,dy\Bigr{]}\,P(dx).$

Assumption 1(1–2) implies $R(\theta)$ can be differentiated twice under the integral:

	$\displaystyle\dot{R}(\theta)$	$\displaystyle=\int f(x)\{P_{x}(\theta^{\top}f(x))-\tau\}\,P(dx)$
	$\displaystyle\ddot{R}(\theta)$	$\displaystyle=\int f(x)f(x)^{\top}\,p_{x}(\theta^{\top}f(x))\,P(dx),$

where $P_{x}$ denotes the distribution function corresponding to the density $p_{x}$ . By definition, $P_{x}(\theta^{\star\top}f(x))=\tau$ , so it follows that $\dot{R}(\theta^{\star})=0$ . Moreover, the following Taylor approximation holds in the neighborhood $\{\theta:\|\theta-\theta^{\star}\|<\delta\}$ :

R(\theta)=\tfrac{1}{2}(\theta-\theta^{\star})^{\top}\ddot{R}(\theta^{\star})(\theta-\theta^{\star})+o(\|\theta-\theta^{\star}\|^{2}),

where Assumption 1(2). implies $\ddot{R}(\theta^{\star})$ is positive definite. Then, $R(\theta)$ is convex and minimized at $\theta^{\star}$ .

Next, note $\ell_{\theta}(u)$ satisfies a Lipschitz property:

\|\ell_{\theta}-\ell_{\theta^{\prime}}\|\leq L\|\theta-\theta^{\prime}\|,

where $L=\max\{\tau,1-\tau\}\|f(x)\|$ . This follows from considering the cases $y<\theta^{\top}f(x)<\theta^{\prime\top}f(x)$ , $\theta^{\top}f(x)<y<\theta^{\prime\top}f(x)$ , and $\theta^{\top}f(x)<\theta^{\prime\top}f(x)<y$ , and the Cauchy–Schwartz inequality. By Assumption 1(1), $L$ is uniformly bounded in $x$ . Then, using the Taylor approximation for $R(\theta)$ above and following the strategy laid out in Section 4.1 of the main article we have

Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq e^{-\omega\|\theta-\theta^{\star}\|^{2}(a-\omega L^{2}/2)}

where $2a>0$ is bounded below by the smallest eigenvalue of $\ddot{R}(\theta^{\star})$ . Therefore, Condition 2 holds for all sufficiently small learning rates, i.e., $\omega<2aL^{-2}$ .

Assumption 1(3) says the prior density is bounded away from zero in a neighborhood of $\theta^{\star}$ . By the above computations,

\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\}\supset\{\theta:\|\theta-\theta^{\star}\|<C\delta\}

for all small enough $\delta>0$ and some $C>0$ . Therefore,

\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\})\geq\Pi(\{\theta:\|\theta-\theta^{\star}\|<C\delta\})\gtrsim\delta^{J},

verifying the prior condition in (15).

Since $\theta\mapsto\ell_{\theta}(u)$ is convex and the Taylor approximation for $R(\theta)$ implies that

\|\theta-\theta^{\star}\|>\delta\implies m(\theta,\theta^{\star})\gtrsim\delta^{2},

the conditions of Lemma 2 hold.

C.2 Proof of Proposition 2

For $\lambda\in(0,1)$ as in Assumption 2(1), define

\omega_{n}=\frac{m+n}{2mn}\Bigl{(}\frac{\tau_{10}}{\lambda}+\frac{\tau_{01}}{1-\lambda}\Bigr{)}^{-1},

where $\tau_{10}$ and $\tau_{01}$ are not both 0, so that

(m+n)\omega_{n}\to\{2(\lambda\tau_{01}+(1-\lambda)\tau_{10})\}^{-1}.

For any deterministic sequence $a_{n}$ , with $a_{n}\to\infty$ , the learning rate $a_{n}\omega_{n}$ vanishes strictly more slowly than $\min(m,n)^{-1}$ , and, therefore, according to Theorem 1 in Wang and Martin, (2020), the Gibbs posterior with learning rate $a_{n}\omega_{n}$ concentrates at rate $n^{-1/2}$ in the sense of Definition 4 in the main article. By the law of large numbers, $\hat{\tau}_{01}\rightarrow\tau_{01}$ and $\hat{\tau}_{10}\rightarrow\tau_{10}$ in $P^{n}$ -probability, so

(m+n)\hat{\omega}_{n}\rightarrow\{2(\lambda\tau_{01}+(1-\lambda)\tau_{10})\}^{-1}\quad\text{in $P^{n}$-probability}.

Therefore, for any $\alpha\in(1/2,1)$ , we have

P(\tfrac{1}{2}a_{n}\omega_{n}<\alpha a_{n}\hat{\omega}_{n}<a_{n}\omega_{n})\to 1,

and it follows from Theorem 4 that the Gibbs posterior with learning rate $\alpha a_{n}\hat{\omega}_{n}$ also concentrates at rate $n^{-1/2}$ . Finally, since $a_{n}$ is arbitrary, the constant $\alpha$ is unimportant and may be implicitly absorbed into $a_{n}$ . Therefore, the conclusion of Proposition 2, omitting $\alpha$ , also holds.

C.3 Proof of Proposition 3

The proof proceeds by applying Lemma 2 and then checking the conditions of Theorem 2. The squared-error loss is convex in $\theta$ and its corresponding risk equals $P_{X}(\theta-\theta^{\star})^{\top}XX^{\top}(\theta-\theta^{\star})\gtrsim\|\theta-\theta^{\star}\|_{2}^{2}$ by Assumption 1a.2, which implies condition 2 in Lemma 2. Further, the condition on the prior in Assumption 1a.3 is sufficient for verifying condition 3 in Lemma 2. Then, Lemma 2 implies $\Pi_{n}(\{\theta:\|\theta-\theta^{\star}\|_{2}>\delta\})$ vanishes in $P^{n}-$ expectation for any $\delta>0$ .

Next, verify the conditions of Theorem 2. The excess loss can be written $\ell_{\theta}(u)-\ell_{\theta^{\star}}(u)=(\theta^{\star}-\theta)^{\top}x\,\{2y-(\theta+\theta^{\star})^{\top}x\}$ . To verify Condition 1, we start by computing the conditional expectation of $e^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$ , given $X=x$ , using the fact that $P_{Y|x}Y=x^{\top}\theta^{\star}$ . Lemma 2 implies we may restrict our attention to bounded $\theta$ so that Assumption 1a.2. implies $|\theta^{\top}X|<B$ . Therefore, we take $\omega<(4Bb)^{-1}$ where $b$ is given in Assumption 1a.1. Specifically, $Y$ given $x$ is subexponential with parameters $(\sigma^{2},\,b)$ , so that $Pe^{tY}<\exp\{tx^{\top}\theta^{\star}+t^{2}\sigma^{2}/2\}$ for all $t<(2b)^{-1}$ . In the excess loss, add and subtract $2\omega(\theta^{\star}-\theta)^{\top}x$ times $x^{\top}\theta^{\star}$ and apply Assumption 1a.1 to get the following bounds:

	$\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$	$\displaystyle=Pe^{-\omega(\theta^{\star}-\theta)^{\top}X\{2Y-(\theta+\theta^{\star})^{\top}X\}}$
		$\displaystyle\leq P_{X}P_{Y\|X}e^{\omega(\theta^{\star}-\theta)^{\top}X(\theta+\theta^{\star})^{\top}X}e^{-2\omega(\theta^{\star})^{\top}X(\theta^{\star}-\theta)^{\top}X}e^{-2\omega(\theta^{\star}-\theta)^{\top}X(Y-X^{\top}\theta^{\star})}$
		$\displaystyle\leq P_{X}e^{\omega(\theta^{\star}-\theta)^{\top}X(\theta+\theta^{\star})^{\top}X}e^{-2\omega(\theta^{\star})^{\top}X(\theta^{\star}-\theta)^{\top}X}e^{2\omega^{2}\sigma^{2}[(\theta^{\star}-\theta)^{\top}X]^{2}}$
		$\displaystyle\leq P_{X}e^{-\omega(1-2\omega\sigma^{2})\{(\theta^{\star}-\theta)^{\top}X\}^{2}}.$

Apply (16) in the paper, which is Lemma 7.26 in Lafftery et al., (2010), using the facts from Assumption 1a.2 that $(\theta^{\star}-\theta)^{\top}(PXX^{T})(\theta^{\star}-\theta)\geq c\|\theta-\theta^{\star}\|_{2}^{2}$ and from consistency that $|\theta^{\top}X|<B$ so that $\{(\theta^{\star}-\theta)^{\top}X\}^{2}\leq 4B^{2}$ . Then, (16) implies

	$\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$	$\displaystyle\leq P_{X}e^{-\omega(1-2\omega\sigma^{2})\{(\theta^{\star}-\theta)^{\top}X\}^{2}}$
		$\displaystyle\leq e^{-\omega(1-2\omega\sigma^{2})P_{X}\{(\theta^{\star}-\theta)^{\top}X\}^{2}}e^{H\omega^{2}(1-2\omega\sigma^{2})^{2}P_{X}\{(\theta^{\star}-\theta)^{\top}X\}^{4}}$
		$\displaystyle\leq e^{-\omega(1-2\omega\sigma^{2})\\|\theta-\theta^{\star}\\|_{2}^{2}\{c-4B^{2}CH\omega(1-2\omega\sigma^{2})\}}$
		$\displaystyle\leq e^{-c_{2}\omega\\|\theta^{\star}-\theta\\|^{2}_{2}}$

where

H=\{\omega^{2}(1-2\sigma^{2})^{2}16B^{4}\}^{-1}\bigl{\{}e^{\omega(1-2\omega\sigma^{2})4B^{2}}-1-\omega(1-2\omega\sigma^{2})4B^{2}\bigr{\}},

and where the last line holds for some $c_{2}>0$ and for all sufficiently small $\omega>0$ . In particular, $\omega$ must satisfy $c>4B^{2}CH\omega(1-2\omega\sigma^{2})$ .

To verify the prior condition in (13) of the paper, we need upper bounds on $m(\theta,\theta^{\star})$ and $v(\theta,\theta^{\star})$ . From above, we have $m(\theta,\theta^{\star})=(\theta^{\star}-\theta)^{\top}(PXX^{T})(\theta^{\star}-\theta)$ and by Assumption 1a.2 it follows $m(\theta,\theta^{\star})\leq C\|\theta-\theta^{\star}\|_{2}^{2}$ for some $C>0$ . To bound $v(\theta,\theta^{\star})$ , use the total variance formula “ $V(X)=E\{V(X\mid Y)\}+V\{E(X\mid Y)\}$ ,” noting $V(Y\mid X)\leq\sigma^{2}$ by Assumption 1a.1. Then,

	$\displaystyle v(\theta,\theta^{\star})$	$\displaystyle\leq\sigma^{2}(\theta^{\star}-\theta)^{\top}(PXX^{T})(\theta^{\star}-\theta)+V\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}$
		$\displaystyle\leq C\sigma^{2}\\|\theta-\theta^{\star}\\|_{2}^{2}+P\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}^{2}$
		$\displaystyle\leq C\sigma^{2}\\|\theta-\theta^{\star}\\|_{2}^{2}+4B^{2}P\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}$
		$\displaystyle\leq(C\sigma^{2}+4CB^{2})\\|\theta-\theta^{\star}\\|_{2}^{2}.$

Then, by Cauchy–Schwartz,

\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\}\supset\{\theta:\|\theta-\theta^{\star}\|\leq C_{1}^{-1/2}\delta^{1/2}\}

where $C_{1}=\max\{C,C\sigma^{2}+4CB^{2}\}$ . Assumption 1a.3 says the prior density is bounded away from zero in a neighborhood of $\theta^{\star}$ , so that

\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\delta\})\geq\Pi(\{\theta:\|\theta-\theta^{\star}\|\leq C^{-1/2}\delta^{1/2}\})\gtrsim\delta^{J/2},

which verifies (13).

C.4 Proof of Proposition 4

The proof proceeds by checking the conditions of Theorem 6.

First, note that $\theta\in\Theta_{n}$ along with Assumption 1b together imply $P|\ell_{\theta}|^{s/2}<C\Delta_{n}^{s}$ which we define to be $B_{n}$ for some $C>0$ and where $s>2$ . This fulfills the condition in Theorem 6 that

P|\ell_{\theta}|^{s/2}<B_{n}<\infty,\quad\forall\theta\in\Theta_{n},\quad\exists s>2.

(59)

Second, we verify Condition 3. We apply the strategy sketched in Section 4.2 of the paper: verify the Bernstein condition with $\alpha=1$ for the clipped loss function, and then apply the inequality in (16) of the paper. For the Bernstein condition we want to show

\theta\in\Theta_{n},\,m_{n}(\theta,\theta_{n}^{\star})>B_{n}t_{n}^{1-s/2}\Rightarrow v_{n}(\theta,\theta^{\star}_{n})\leq G(\Delta_{n})m_{n}(\theta,\theta^{\star}_{n})

(60)

for some function $G(\cdot)$ to be specified. By construction, the second moment of excess clipped loss satisfies $P[\{\ell_{\theta}^{n}-\ell_{\theta_{n}^{\star}}^{n}\}^{2}]\leq P[\{\ell_{\theta}-\ell_{\theta_{n}^{\star}}\}^{2}]$ , which implies

	$\displaystyle v_{n}(\theta,\theta_{n}^{\star})$	$\displaystyle\leq v(\theta,\theta_{n}^{\star})+m(\theta,\theta_{n}^{\star})^{2}-m_{n}(\theta,\theta_{n}^{\star})^{2}$
		$\displaystyle\leq v(\theta,\theta_{n}^{\star})+m(\theta,\theta_{n}^{\star})^{2}.$

So, we want to upper bound $v(\theta,\theta_{n}^{\star})$ and $m(\theta,\theta_{n}^{\star})^{2}$ , to get a bound on $v_{n}(\theta,\theta_{n}^{\star})$ , and then relate this bound to $m_{n}(\theta,\theta_{n}^{\star})$ to determine $G(\cdot)$ .

Similarly to the proof of Proposition 3 we use the total variance formula to upper bound $v(\theta,\theta_{n}^{\star})$ . For the “ $E(V(X|Y))$ ” part of the formula, we have

	$\displaystyle E(V(\ell_{\theta}-\ell_{\theta^{\star}_{n}}\|X=x))$	$\displaystyle=P\{4\sigma^{2}_{X}(\theta^{\star}_{n}-\theta)^{\top}XX^{\top}(\theta^{\star}_{n}-\theta)\}$
		$\displaystyle\lesssim\\|\theta^{\star}_{n}-\theta\\|_{2}^{2}P\sigma_{x}^{2}$
		$\displaystyle\lesssim\\|\theta^{\star}_{n}-\theta\\|_{2}^{2}$
		$\displaystyle\leq\\|\theta-\theta^{\star}\\|_{2}^{2}+\\|\theta^{\star}_{n}-\theta^{\star}\\|_{2}^{2}$

where the second line follows from the fact $\|X\|_{\infty}$ is bounded almost surely according to Assumption 1a and where $\sigma_{x}^{2}$ denotes the marginal variance of $Y$ , given $X=x$ , which has finite expectation according to Assumption 1b.

Next, for the “ $V(E(X|Y))$ ” part we have

	$\displaystyle V(E(\ell_{\theta}-\ell_{\theta^{\star}_{n}}\|X))$	$\displaystyle=V\{2\theta^{\star^{\top}}XX^{\top}(\theta^{\star}_{n}-\theta)+(\theta-\theta^{\star}_{n})^{\top}XX^{\top}(\theta+\theta_{n}^{\star})\}$
		$\displaystyle\leq E\{2\theta^{\star^{\top}}XX^{\top}(\theta_{n}^{\star}-\theta)\}^{2}$
		$\displaystyle+2E\{[2\theta^{\star^{\top}}XX^{\top}(\theta^{\star}_{n}-\theta)][(\theta_{n}^{\star}-\theta)^{\top}XX^{\top}(\theta_{n}^{\star}+\theta)]\}$
		$\displaystyle+E\{(\theta_{n}^{\star}-\theta)^{\top}XX^{\top}(\theta_{n}^{\star}+\theta)\}^{2}.$
		$\displaystyle\lesssim\Delta_{n}^{2}\\|\theta-\theta_{n}^{\star}\\|_{2}^{2}$
		$\displaystyle\leq\Delta_{n}^{2}\left[\\|\theta-\theta^{\star}\\|_{2}^{2}+\\|\theta_{n}^{\star}-\theta^{\star}\\|_{2}^{2}\right],$

again, using the facts $\|X\|_{\infty}$ is bounded almost surely and that $\|\theta\|_{2}<\Delta_{n}$ for $\theta\in\Theta_{n}$ .

Similarly,

	$\displaystyle m(\theta,\theta^{\star}_{n})^{2}$	$\displaystyle=[P\{2\theta^{\star^{\top}}XX^{\top}(\theta^{\star}_{n}-\theta)+(\theta-\theta^{\star}_{n})^{\top}XX^{\top}(\theta+\theta_{n}^{\star})\}]^{2}$
		$\displaystyle\lesssim\{\\|\theta-\theta_{n}^{\star}\\|_{2}+\Delta_{n}\\|\theta-\theta_{n}^{\star}\\|_{2}\}^{2}$
		$\displaystyle\leq\Delta_{n}^{2}\{\\|\theta-\theta^{\star}\\|_{2}^{2}+\\|\theta_{n}^{\star}-\theta^{\star}\\|_{2}^{2}\}.$

So far, we have showed

v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}\left[\|\theta-\theta^{\star}\|_{2}^{2}+\|\theta_{n}^{\star}-\theta^{\star}\|_{2}^{2}\right].

And, since

\|\theta-\theta^{\star}\|_{2}^{2}\lesssim m(\theta,\theta^{\star})\lesssim\|\theta-\theta^{\star}\|_{2}^{2}

this implies

v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}\left[m(\theta,\theta^{\star})+m(\theta_{n}^{\star},\theta^{\star})\right].

Finally, we note that the proof of Theorem 6 above shows that (59) implies

|m(\theta,\theta^{\star})-m_{n}(\theta,\theta^{\star}_{n})|\leq 2B_{n}t_{n}^{1-s/2}\quad\text{and}\quad m(\theta_{n}^{\star},\theta^{\star})<2B_{n}t_{n}^{1-s/2}.

Conclude

v_{n}(\theta,\theta^{\star}_{n})\lesssim\Delta_{n}^{2}\left[m_{n}(\theta,\theta^{\star}_{n})+4B_{n}t_{n}^{1-s/2}\right].

Therefore, (60) is verified if we take $G(\Delta_{n})=C\Delta_{n}^{2}$ for some $C>0$ .

Next, apply the inequality in (16). Note that $L_{n}:=\sup_{u,\theta\in\Theta_{n}}|\ell_{\theta}^{n}(u)-\ell_{\theta_{n}^{\star}}(u)|\lesssim\Delta_{n}t_{n}^{1/2}$ and recall that we choose $\omega_{n}=(\Delta_{n}t_{n}^{1/2})^{-1}$ . Then, (16) and (60) imply

	$\displaystyle m_{n}(\theta,\theta_{n}^{\star})>B_{n}t_{n}^{1-s/2}\implies$	$\displaystyle Pe^{-\omega_{n}(\ell_{\theta}^{n}-\ell_{\theta_{n}^{\star}}^{n})}\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})+\omega_{n}^{2}v_{n}(\theta,\theta_{n}^{\star})\left\{\frac{e^{L_{n}\omega_{n}}-1-L_{n}\omega_{n}}{(L_{n}\omega_{n})^{2}}\right\}}$
		$\displaystyle\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})+c_{n}\omega_{n}^{2}v_{n}(\theta,\theta_{n}^{\star})}$
		$\displaystyle\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})+c_{n}^{\prime}\omega_{n}^{2}\Delta_{n}^{2}[m_{n}(\theta,\theta_{n}^{\star})+4B_{n}t_{n}^{1-s/2}]}$
		$\displaystyle\leq e^{-\omega_{n}m_{n}(\theta,\theta_{n}^{\star})[1-c_{n}^{\prime}\omega_{n}\Delta_{n}^{2}-4c_{n}^{\prime}\omega_{n}\Delta_{n}^{2}B_{n}t_{n}^{1-s/2}m_{n}(\theta,\theta^{\star}_{n})^{-1}]}$
		$\displaystyle\leq e^{-K_{n}\omega_{n}\varepsilon_{n}^{2}}\quad\text{for }m_{n}(\theta,\theta^{\star}_{n})>\varepsilon_{n}^{2}:=B_{n}t_{n}^{1-s/2},$

where $c_{n},c_{n}^{\prime}=O(1)$ by the choice of $\omega_{n}$ , and where $K_{n}=1-5c_{n}^{\prime}\omega_{n}\Delta_{n}^{2}>1/2$ for all large enough $n$ . This verifies Condition 3.

Finally, we verify the prior bound in (26). By Assumption 3(3) the prior is bounded away from zero in a neighborhood of $\theta^{\star}$ , which implies, by the above bound on $\|\theta_{n}^{\star}-\theta^{\star}\|_{2}$ , that the prior is also bounded away from zero in a neighborhood of $\theta_{n}^{\star}$ for all sufficiently large $n$ . Using the above bounds on the $m_{n}$ and $v_{n}$ functions

	$\displaystyle\Pi(\{\theta:m_{n}(\theta,\theta_{n}^{\star})\vee v_{n}(\theta,\theta_{n}^{\star})<K_{n}\varepsilon_{n}^{2}\})$	$\displaystyle\geq\Pi(\{\theta:\\|\theta-\theta^{\star}_{n}\\|_{2}^{2}<\Delta_{n}^{-2}B_{n}t_{n}^{1-s/2}\})$
		$\displaystyle\gtrsim(\Delta_{n}^{-2}B_{n}t_{n}^{1-s/2})^{J}$
		$\displaystyle\gtrsim e^{-C\log n}\quad\exists C>0,$
		$\displaystyle>e^{-K_{n}n\omega_{n}B_{n}t_{n}^{1-s/2}}=e^{-K_{n}\Delta_{n}^{s-1}}$

where the last inequality holds because $\Delta_{n}=\log n$ and $s>2$ . This verifies the prior bound required by Theorem 6 in (26).

C.5 Proof of Proposition 5

Proposition 5 follows from a slight refinement of Theorem 3. We verify Condition 1 and a prior bound essentially equivalent to (14). We also use an argument similar to the proof of Theorem 2 to improve the bound on expectation of the numerator of the Gibbs posterior distribution derived in the proof of Theorem 3.

Towards verifying Condition 1, we first need to define the loss function being used. Even though the $x_{i}$ values are technically not “data” in this inid setting, it is convenient to express the loss function in terms of the $(x,y)$ pairs. Moreover, while the quantity of interest is the mean function $\theta$ , since we have introduced the parametric representation $\theta=\theta_{\beta}$ and the focus shifts to the basis coefficients in the $\beta$ vector, it makes sense to express the loss function in terms of $\beta$ instead of $\theta$ . That is, the squared error loss is

\ell_{\beta}(x,y)=\{y-\theta_{\beta}(x)\}^{2}.

For $\beta^{\dagger}=\beta_{n}^{\dagger}$ as defined in Section 5.4, the loss difference equals

\ell_{\beta}(x,y)-\ell_{\beta^{\dagger}}(x,y)=\{\theta_{\beta}(x)-\theta_{\beta^{\dagger}}(x)\}^{2}+2\{\theta_{\beta^{\dagger}}(x)-\theta_{\beta}(x)\}\{y-\theta_{\beta^{\dagger}}(x)\}.

(61)

Since the responses are independent, the expectation in Condition 1 can be expressed as the product

	$\displaystyle P^{n}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}$	$\displaystyle=\prod_{i=1}^{n}e^{-\omega\{\theta_{\beta}(x_{i})-\theta_{\beta^{\dagger}}(x_{i})\}^{2}}P_{i}e^{-2\omega\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}\{Y_{i}-\theta_{\beta^{\dagger}}(x_{i})\}}$
		$\displaystyle=e^{-n\\|\theta_{\beta}-\theta_{\beta^{\dagger}}\\|_{n,2}^{2}}\prod_{i=1}^{n}P_{i}e^{-2\omega\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}\{Y_{i}-\theta_{\beta^{\dagger}}(x_{i})\}}.$

According to Assumption 5(2), $Y_{i}$ is sub-Gaussian, so the product in the last line above can be upper-bounded by

e^{4n\omega^{2}\sigma^{2}\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}}\times e^{-2\omega_{n}\sum_{i=1}^{n}\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}\{\theta^{\star}(x_{i})-\theta_{\beta^{\dagger}}(x_{i})\}}.

The second exponential term above is identically $1$ because the exponent vanishes—a consequence of the Pythagorean theorem. To see this, first write the quantity in the exponent as an inner product

(\beta-\beta^{\dagger})^{\top}F_{n}^{\top}\{\theta^{\star}(x_{1:n})-F_{n}\beta^{\dagger}\}=(\beta-\beta^{\dagger})^{\top}\{F_{n}^{\top}\theta^{\star}(x_{1:n})-F_{n}^{\top}F_{n}\beta^{\dagger}\}.

Recall from the discussion in Section 5.4 that $\beta^{\dagger}$ satisfies $(F_{n}^{\top}F_{n})\beta^{\dagger}=F_{n}^{\top}\theta^{\star}(x_{1:n})$ ; from this, it follows that the above display vanishes for all vectors $\beta$ . Therefore,

P^{n}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\leq e^{-n\omega\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}(1-2\omega\sigma^{2})},

and Condition 1 is satisfied since provided the learning rate $\omega$ is less than $(2\sigma^{2})^{-1}$ .

Next, we derive a prior bound similar to (14). By Assumption 5(3), all eigenvalues of $F_{n}^{\top}F_{n}$ are bounded away from zero and infinity, which implies

\|\beta-\beta^{\dagger}\|_{2}^{2}\lesssim\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}\lesssim\|\beta-\beta^{\dagger}\|_{2}^{2}.

(62)

In the arguments that follow, we show that $\|\theta_{\beta^{\star}}-\theta_{\beta^{\dagger}}\|_{n,2}\lesssim\varepsilon_{n}$ which implies, by (62), that $\|\beta^{\star}_{n}-\beta^{\dagger}\|_{2}^{2}\lesssim\varepsilon_{n}^{2}$ . The approximation property in (43) implies that $\|\beta_{n}^{\star}\|_{\infty}<H$ . Therefore, $\|\beta^{\dagger}\|_{\infty}$ is bounded because, if it were not bounded, then (62) would be contradicted.

Since $\|\beta^{\dagger}\|_{\infty}$ is bounded, Assumption 5(5) implies that the prior for $\beta$ satisfies

\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\dagger}\|_{2}\leq\varepsilon_{n}\})\gtrsim\varepsilon_{n}^{J_{n}}e^{-J_{n}\log C}.

(63)

Recall that (14) involves the mean and variance of the empirical risk, and we can directly calculate these. For the mean,

	$\displaystyle m(\theta_{\beta},\theta_{\beta^{\dagger}})$	$\displaystyle=\bar{r}_{n}(\beta)-\bar{r}_{n}(\beta^{\dagger})$
		$\displaystyle=\\|\theta_{\beta}-\theta_{\beta^{\dagger}}\\|_{n,2}^{2}+\sum_{i=1}^{n}\{\theta_{\beta^{\dagger}}(x_{i})-\theta^{\star}(x_{i})\}\{\theta_{\beta^{\dagger}}(x_{i})-\theta_{\beta}(x_{i})\}$
		$\displaystyle=\\|\theta_{\beta}-\theta_{\beta^{\dagger}}\\|_{n,2}^{2},$

where the last equality is by the same Pythagorean theorem argument presented above. Similarly, the variance is given by $v(\theta_{\beta},\theta_{\beta^{\dagger}})=4\sigma^{2}n^{-1}\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}$ . Therefore,

\{m(\theta_{\beta},\theta_{\beta^{\dagger}})\vee v(\theta_{\beta},\theta_{\beta^{\dagger}})\}\lesssim\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}^{2},

and (63) and (62) imply

\displaystyle\tilde{\Pi}(\{\beta:m(\theta_{\beta},\theta_{\beta^{\dagger}})\vee v(\theta_{\beta},\theta_{\beta^{\dagger}})\leq\varepsilon_{n}^{2}\})

\displaystyle\gtrsim\varepsilon_{n}^{J_{n}}e^{-CJ_{n}}.

The fact that $n\varepsilon_{n}^{2}=J_{n}$ along with Lemma 1 implies

D_{n}\gtrsim\varepsilon_{n}^{J_{n}}e^{-(2\omega+\log C)J_{n}}

with $P^{n}$ -probability converging to $1$ as $n\rightarrow\infty$ .

We briefly verify the claim made just before the statement of Proposition 2, i.e., that (44) holds for a independence prior consisting of strictly positive densities. To see this, note that the volume of a $J_{n}$ -dimensional sphere with radius $\varepsilon$ is a constant multiple of $\varepsilon^{J_{n}}$ , so that, if the joint prior density is bounded away from zero by $C^{J_{n}}$ on the sphere, then we have $\widetilde{\Pi}(\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\})\gtrsim(C\varepsilon)^{J_{n}}$ , which is (44). So we must verify the bound on the prior density. Suppose $\tilde{\Pi}$ has a density $\tilde{\pi}$ equal to the a product of independent prior densities $\tilde{\pi}_{j}$ , $j=1,\ldots,J_{n}$ . Since the cube $\{\beta:\|\beta-\beta^{\prime}\|_{\infty}\leq J_{n}^{1/2}\varepsilon\}$ contains the ball $\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\}$ by Cauchy–Schwartz, the infimum of the prior density on the ball is no smaller than the infimum of the prior density on the cube. To bound the prior density on the cube consider any of the $J$ components and note $|\beta_{j}-\beta^{\prime}_{j}|\leq J_{n}^{1/2}\varepsilon_{n}$ implies $\beta_{j}\in[-H-J_{n}^{1/2}\varepsilon_{n},H+J_{n}^{1/2}\varepsilon_{n}]$ since $\|\beta^{\prime}\|_{\infty}\leq H$ . Moreover, since $\alpha\geq 1/2$ we have $J_{n}^{1/2}\varepsilon_{n}\rightarrow 0$ so that this interval lies within a compact set, say, $[-H-1,H+1]$ . And, since the prior density $\tilde{\pi}_{j}$ is strictly positive, it is bounded away from zero by a constant $C$ on this compact set. Finally, by independence, we have $\pi(\beta)\geq C^{J}$ for all $\beta\in\{\beta:\|\beta-\beta^{\prime}\|_{2}\leq\varepsilon\}$ , which verifies the claim concerning (44).

In order to obtain the optimal rate of $\varepsilon_{n}=n^{-\alpha/(1+2\alpha)}$ , without an extra logarithmic term, we need a slightly better bound on $P^{n}N_{n}(A_{n})$ than that used to prove Theorem 3. Our strategy, as in the proof of Theorem 2, will be to split the range of integration in the numerator into countably many disjoint pieces, and use bounds on the prior probability on those pieces to improve the numerator bound. Define $\tilde{A}_{n}:=\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}\varepsilon_{n}\}$ and write the numerator $N_{n}(\tilde{A}_{n})$ as follows

	$\displaystyle N_{n}(\tilde{A}_{n})$	$\displaystyle=\int_{\\|\theta_{\beta}-\theta_{\beta^{\dagger}}\\|_{n,2}>M_{n}\varepsilon_{n}}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\,\tilde{\Pi}(d\beta)$
		$\displaystyle=\sum_{t=1}^{\infty}\int_{tM_{n}\varepsilon_{n}<\\|\theta_{\beta}-\theta_{\beta^{\dagger}}\\|_{n,2}<(t+1)M_{n}\varepsilon_{n}}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\,\tilde{\Pi}(d\beta).$

Taking expectation of the left-hand side and moving it under the sum and under the integral on the right-hand side, we need to bound

\int_{tM_{n}\varepsilon_{n}<\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}<(t+1)M_{n}\varepsilon_{n}}P^{n}e^{-n\omega\{r_{n}(\beta)-r_{n}(\beta^{\dagger})\}}\,\tilde{\Pi}(d\beta),\quad t=1,2,\ldots.

By Condition 1, verified above, on the given range of integration the integrand is bounded above by $e^{-n\omega(tM_{n}\varepsilon_{n})^{2}/2}$ , so the expectation of the integral itself is bounded by

e^{-n\omega_{n}(tM_{n}\varepsilon_{n})^{2}/2}\Pi(\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}<(t+1)M_{n}\varepsilon_{n}\}),\quad t=1,2,\ldots

Since $\tilde{\Pi}$ has a bounded density on the $J_{n}-$ dimensional parameter space, we clearly have

\Pi(\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}<(t+1)M_{n}\varepsilon_{n}\})\lesssim\{(t+1)M_{n}\varepsilon_{n}\}^{J_{n}}.

Plug all this back into the summation above to get

P^{n}N_{n}(\tilde{A}_{n})\lesssim(M_{n}\varepsilon_{n})^{J_{n}}\sum_{t=1}^{\infty}(t+1)^{J_{n}}e^{-\omega(tM_{n})^{2}J_{n}/2}.

The above sum is finite for all $n$ and bounded by a multiple of $e^{-\omega M_{n}^{2}J_{n}/4}$ . Consequently, we find that the expectation of the Gibbs posterior numerator is bounded by a constant multiple of $(M_{n}\varepsilon_{n})^{J_{n}}e^{-\omega M_{n}^{2}J_{n}/4}$ .

Combining the in-expectation and in-probability bounds on $N_{n}(\tilde{A}_{n})$ and $D_{n}$ , respectively, as in the proof of Theorem 1, we find that

P^{n}\Pi_{n}(\tilde{A}_{n})\lesssim e^{-J_{n}(\omega M_{n}^{2}/4-\log M_{n}-2\omega-\log C)}

which vanishes as $n\rightarrow\infty$ for any sufficiently large constant $M_{n}\equiv M>0$ .

The above arguments establish that the Gibbs posterior $\widetilde{\Pi}_{n}$ for $\beta$ satisfies

P^{n}\widetilde{\Pi}_{n}(\{\beta\in\mathbb{R}^{J}:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}\varepsilon_{n}\})\to 0.

(64)

But this is equivalent to the proposition’s claim, with $\theta_{\beta^{\dagger}}$ replaced by $\theta^{\star}$ . To see this, first recall that Assumption 2.4 implies the existence of a $J$ -vector $\beta^{\star}=\beta_{n}^{\star}$ such that $\|\theta_{\beta^{\star}}-\theta^{\star}\|_{\infty}$ is small. Next, use the triangle inequality to get

\|\theta_{\beta}-\theta^{\star}\|_{n,2}\leq\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}+\|\theta_{\beta^{\dagger}}-\theta^{\star}\|_{n,2}.

Now apply the Pythagorean theorem argument as above to show that

\|\theta_{\beta^{\star}}-\theta^{\star}\|_{n,2}^{2}=\|\theta_{\beta^{\star}}-\theta_{\beta^{\dagger}}\|_{n,2}^{2}+\|\theta_{\beta^{\dagger}}-\theta^{\star}\|_{n,2}^{2}.

Since the sup-norm dominates the empirical $L_{2}$ norm, the left-hand side is bounded by $CJ^{-2\alpha}$ for some $C>0$ . But both terms on the right-hand side are non-negative, so it must be that the right-most term is also bounded by $CJ^{-2\alpha}$ . Putting these together, we find that

\|\theta_{\beta}-\theta^{\star}\|_{n,2}>M_{n}^{\prime}\varepsilon_{n}\implies\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}^{\prime}\varepsilon_{n}-C^{1/2}J^{-\alpha}.

Therefore, with $\varepsilon_{n}=n^{-\alpha/(2\alpha+1)}\log n$ and $J=J_{n}=n^{1/(2\alpha+1)}$ , the lower bound on the right-hand side of the previous display is a constant multiple of $\varepsilon_{n}$ . We can choose $M_{n}^{\prime}$ such that the aforementioned sequence is at least as big as $M_{n}$ above. In the end,

\widetilde{\Pi}_{n}(\{\beta:\|\theta_{\beta}-\theta^{\star}\|_{n,2}>M_{n}^{\prime}\varepsilon_{n}\})\leq\widetilde{\Pi}_{n}(\{\beta:\|\theta_{\beta}-\theta_{\beta^{\dagger}}\|_{n,2}>M_{n}\varepsilon_{n}\}),

so the Gibbs posterior concentration claim in the proposition follows from that established above. Finally, by definition of the prior and Gibbs posterior for $\theta$ , we have that

P^{n}\Pi_{n}(\{\theta\in\Theta:\|\theta-\theta^{\star}\|_{n,2}>M_{n}^{\prime}\varepsilon\})\to 0,

which completes the proof.

C.6 Proof of Proposition 6

The proof proceeds by checking the conditions of Theorem 1. We begin by verifying (12). Evaluate $m(\theta,\theta^{\star})$ and $v(\theta,\theta^{\star})$ for the loss function $\ell_{\theta}$ as defined above:

	$\displaystyle m(\theta;\theta^{\star})$	$\displaystyle=P\{Y\neq\phi_{\theta}(X)\}-P\{Y\neq\phi_{\theta^{\star}}(X)\}$
		$\displaystyle=\int_{x^{\top}\theta<0,x^{\top}\theta^{\star}>0}(2\eta-1)\,dP+\int_{x^{\top}\theta>0,x^{\top}\theta^{\star}<0}(1-2\eta)\,dP$
	$\displaystyle v(\theta,\theta^{\star})$	$\displaystyle\leq P(\ell_{\theta}-\ell_{\theta^{\star}})^{2}$
		$\displaystyle=P(X^{\top}\theta<0,X^{\top}\theta^{\star}>0)+P(X^{\top}\theta>0,X^{\top}\theta^{\star}<0)$
		$\displaystyle=P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}.$

It follows from arguments in Tsybakov, (2004) that, under the margin condition in Assumption 6(5), with $h\in(0,1)$ , we have

hP(\phi_{\theta}-\phi_{\theta^{\star}})^{2}\leq m(\theta,\theta^{\star}).

Further, rewrite $m(\theta,\theta^{\star})$ as

	$\displaystyle m(\theta;\theta^{\star})$	$\displaystyle=\int\eta(x)\{\phi_{\theta^{\star}}(x)-\phi_{\theta}(x)\}\,P(dx)+\int[1-\eta(x)]\{\phi_{\theta}(x)-\phi_{\theta^{\star}}(x)\}\,P(dx)$
		$\displaystyle\leq 2\int\|\phi_{\theta}(x)-\phi_{\theta^{\star}}(x)\|\,P(dx),$		(65)

where the latter inequality follows since $\eta<1$ . Since $\phi_{\theta}-\phi_{\theta^{\star}}$ is a difference of indicators

P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}=P|\phi_{\theta}-\phi_{\theta^{\star}}|\quad\text{and}\quad m(\theta,\theta^{\star})\lesssim P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}.

This latter inequality will be useful below. Under the stated assumptions, the integrand in (C.6) can be handled exactly like in Lemma 4 of Jiang and Tanner, (2008). That is, if $\|\beta-\beta^{\star}\|_{1}$ is sufficiently small, then $m(\theta;\theta^{\star})\lesssim\|\beta-\beta^{\star}\|_{1}$ . Since $v(\theta;\theta^{\star})\lesssim m(\theta;\theta^{\star})$ , it follows that

\{\theta:m(\theta;\theta^{\star})\vee v(\theta;\theta^{\star})\leq\varepsilon^{2}\}\supseteq\{\theta=(\alpha,\beta):\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\},

for a constant $c>0$ . To lower-bound the prior probability of the event on the right-hand side, we follow the proof of Lemma 2 in Castillo et al., (2015). First, for $S^{\star}$ the configuration of $\beta^{\star}$ , we can immediately get

\Pi(\{\beta:\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\})\geq\pi(S^{\star})\int_{\|\beta_{S^{\star}}-\beta_{S^{\star}}^{\star}\|_{1}\leq c\varepsilon^{2}}g_{S^{\star}}(\beta_{S^{\star}})\,d\beta_{S^{\star}}.

Now make a change of variable $b=\beta_{S^{\star}}-\beta_{S^{\star}}^{\star}$ and note that

g_{S^{\star}}(\beta_{S^{\star}})=g_{S^{\star}}(b+\beta_{S^{\star}}^{\star})\geq e^{-\lambda\|\beta^{\star}\|_{1}}g_{S^{\star}}(b).

Therefore,

\Pi(\{\beta:\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\})\geq\pi(S^{\star})e^{-\lambda\|\beta^{\star}\|_{1}}\int_{\|b\|_{1}\leq c\varepsilon^{2}}g_{S^{\star}}(b)\,db,

and after plugging in the definition of $\pi(S^{\star})$ , using the bound in Equation (6.2) of Castillo et al., (2015), and simplifying, we get

\Pi(\{\beta:\|\beta-\beta^{\star}\|_{1}\leq c\varepsilon^{2}\})\gtrsim f(|S^{\star}|)\,q^{-2|S^{\star}|}e^{-\lambda\|\beta^{\star}\|_{1}}.

From the form of the complexity prior $f$ , the bounds on $\lambda$ , and the assumption that $\|\beta^{\star}\|_{\infty}=O(1)$ , we see that the lower bound is no smaller than $e^{-C|S^{\star}|\log q}$ for some constant $C>0$ , which implies (12), i.e.,

\Pi(\{\theta:m(\theta;\theta^{\star})\vee v(\theta;\theta^{\star})\leq\varepsilon_{n}^{2}\})\gtrsim e^{-Cn\varepsilon_{n}^{2}},

where $\varepsilon_{n}=\{n^{-1}|S^{\star}|\log q\}^{1/2}$ .

Next, we verify Condition 1. By direct computation, we get

	$\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$	$\displaystyle=1-P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}+e^{-\omega}\int\eta(x)1\{x^{\top}\theta\leq 0,x^{\top}\theta^{\star}>0\}\,P(dx)$
		$\displaystyle\qquad+e^{-\omega}\int(1-\eta(x))1\{x^{\top}\theta>0,x^{\top}\theta^{\star}\leq 0\}\,P(dx)$
		$\displaystyle\qquad+e^{\omega}\int(1-\eta(x))1\{x^{\top}\theta\leq 0,x^{\top}\theta^{\star}>0\}\,P(dx)$
		$\displaystyle\qquad+e^{\omega}\int\eta(x)1\{x^{\top}\theta>0,x^{\top}\theta^{\star}\leq 0\}\,P(dx).$

Using the Massart margin condition, we can upper bound the above by

\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}

\displaystyle\leq 1-\min\{a,b\}P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}

where $a=1-e^{-\omega}-(\tfrac{1}{2}-\tfrac{h}{2})(e^{\omega-e^{-\omega}})$ and $b=1-e^{\omega}+(\tfrac{1}{2}+\tfrac{h}{2})(e^{\omega-e^{-\omega}})$ . For all small enough $\omega$ , both $a$ and $b$ are $O(h\omega)$ , so for some constants $c,c^{\prime}>0$ ,

	$\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$	$\displaystyle\leq 1-ch\omega P(\phi_{\theta}-\phi_{\theta^{\star}})^{2}$
		$\displaystyle\leq 1-c^{\prime}\omega m(\theta,\theta^{\star}).$

Then Condition 1 follows from the elementary inequality $1-t\leq e^{-t}$ for $t>0$ .

C.7 Proof of Proposition 7

The proof proceeds by checking the conditions of Theorem 3. We begin by verifying (14). The upper bounds on $m(\theta,\theta^{\star})$ and $v(\theta,\theta^{\star})$ given in the proof of Proposition 6 are also valid in this setting, which means

\Pi(\{\theta:m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\leq\varepsilon^{2}\})\geq\Pi(\{\theta:\|\beta-\beta^{\star}\|_{1}\leq C_{1}\varepsilon^{2}\})

for some $C_{1}>0$ . Further, Assumption 7 implies

\Pi(\{\theta:\|\beta-\beta^{\star}\|_{1}\leq C_{1}\varepsilon^{2}\})\gtrsim\varepsilon^{2q}=e^{-2q\log\varepsilon}.

By definition of $\varepsilon_{n}$ and $\omega_{n}$ we have that

\exp(-2q\log\varepsilon_{n})\geq\exp\{-n\omega_{n}\varepsilon_{n}^{(2+2\gamma)/\gamma}\},

which, combined with the previous display, verifies (14).

Next, to verify Condition 1 we note the misclassification error loss function difference $\ell_{\theta}(u)-\ell_{\theta^{\star}}(u)$ is bounded in absolute value by $1$ , so we can proceed with using the moment-generating function bound from Lemma 7.26 in Lafftery et al., (2010); see (16). Proposition 1 in Tsybakov, (2004) provides the lower bound on $m(\theta,\theta^{\star})$ we need, i.e., Tsybakov proves that

\text{our Assumption~\ref{asmp:tsybakov}(2)}\implies m(\theta,\theta^{\star})\gtrsim d(\theta,\theta^{\star})^{(2+2\gamma)/\gamma}.

With this and the upper bound on $v(\theta,\theta^{\star})$ derived above, (16) implies

Pe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})}\leq e^{-C_{2}\omega_{n}\varepsilon_{n}^{(2+2\gamma)/\gamma}}

for some $C_{2}>0$ , which verifies Condition 1.

C.8 Proof of Proposition 8

The proof proceeds by checking the conditions of Theorem 5. First, we verify (21). Starting with $m(\theta,\theta^{\star})$ , by direct calculation,

	$\displaystyle m(\theta,\theta^{\star})=\frac{1}{2}\int_{\mathbb{X}}\Bigl{\{}$	$\displaystyle\int\bigl{(}\|\theta(x)\vee\theta^{\star}(x)-y\|-\|\theta(x)\wedge\theta^{\star}(x)-y\|\bigr{)}\,P_{x}(dy)$
		$\displaystyle+(1-2\tau)\|\theta(x)-\theta^{\star}(x)\|\Bigr{\}}\,P(dx).$

Partitioning the range of integration with respect to $y$ , for given $x$ , into the disjoint intervals $(-\infty,\theta\wedge\theta^{\star}]$ , $(\theta\wedge\theta^{\star},\theta\vee\theta^{\star})$ , and $[\theta\vee\theta^{\star},\infty)$ , and simplifying, we get

m(\theta,\theta^{\star})=\frac{1}{2}\int_{\mathbb{X}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\theta(x)\vee\theta^{\star}(x)}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx).

(66)

It follows immediately that $m(\theta,\theta^{\star}_{n})\lesssim\|\theta-\theta^{\star}\|_{L_{1}(P)}$ . Also, since $\theta\mapsto\ell_{\theta}(x,y)$ clearly satisfies the fixed- $x$ Lipschitz bound

|\ell_{\theta}(x,y)-\ell_{\theta^{\star}}(x,y)|\leq|\theta(x)-\theta^{\star}(x)|,\quad\text{for all $(x,y)$},

(67)

we get a similar bound for the variance, i.e., $v(\theta,\theta^{\star}_{n})\leq\|\theta-\theta^{\star}\|^{2}_{L_{2}(P)}$ . Let $J_{n}=n^{1/(1+2\alpha)}$ and $\hat{\theta}_{J,\beta}:=\beta^{\top}f$ , and define a sup-norm ball around $\theta^{\star}$

B_{n}^{\star}=\{(\beta,J):\beta\in\mathbb{R}^{J},J=J_{n},\|\theta^{\star}-\hat{\theta}_{J,\beta}\|_{\infty}\leq CJ_{n}^{-\alpha}\}.

By the above upper bounds on $m(\theta,\theta^{\star})$ and $v(\theta,\theta^{\star})$ in terms of $\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}$ , we have

\|\theta^{\star}-\hat{\theta}_{J,\beta}\|_{\infty}\leq CJ_{n}^{-\alpha}\implies\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha}.

Then, by Assumptions 8(3-4), and using the same argument as in the proof of Theorem 1 in Shen and Ghosal, (2015) we have

	$\displaystyle\Pi^{(n)}(\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha})$	$\displaystyle=\Pi(\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha})/\Pi(\Theta_{n})$
		$\displaystyle\geq\Pi(\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim J_{n}^{-2\alpha})$
		$\displaystyle\geq\Pi(B_{n}^{\star})\gtrsim e^{-C_{1}J_{n}\log n},$

for some $C_{1}>0$ . By the definitions of $\varepsilon_{n}$ and $\omega_{n}$ in Proposition 8 it follows that $J_{n}^{-2\alpha}\leq\Delta_{n}^{-2}\varepsilon_{n}^{2}$ for $\Delta_{n}$ as defined in Condition 2. Define $K_{n}\propto\Delta_{n}^{-1}$ , with precise proportionality determined below. Then,

C_{1}J_{n}(\log n)\leq CnK_{n}^{2}\omega_{n}\varepsilon_{n}^{2},

for a sufficiently small $C>0$ and all large enough $n$ , which verifies the prior condition in (21) with $r=2$ .

Next we verify Condition 2. Define the sets $A_{n}:=\{\theta:\|\theta-\theta^{\star}\|_{L_{2}(P)}>M\varepsilon_{n}\}$ and $\Theta_{n}:=\{\theta:\|\theta\|_{\infty}\leq\Delta_{n}\}$ . Note that Assumption 8(4) implies that $\Pi_{n}(A_{n}\cap\Theta_{n}^{\text{\sc c}})=0$ , and, therefore,

\Pi_{n}(A_{n})=\Pi_{n}(A_{n}\cap\Theta_{n}).

The following computations provide a lower bound on $m(\theta,\theta^{\star})$ for $\theta\in\Theta_{n}$ . Partition $\mathbb{X}$ as $\mathbb{X}=\mathbb{X}_{1}\cup\mathbb{X}_{2}$ where $\mathbb{X}_{1}=\{x:|\theta(x)-\theta^{\star}(x)|\geq\delta\}$ and $\mathbb{X}_{2}=\mathbb{X}_{1}^{\text{\sc c}}$ and where $\delta>0$ is as in Assumption 4(2). Using (66), the mean function can be expressed as

	$\displaystyle 2m(\theta,\theta^{\star})$	$\displaystyle=\int_{\mathbb{X}_{1}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\theta(x)\vee\theta^{\star}(x)}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx)$
		$\displaystyle\qquad+\int_{\mathbb{X}_{2}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\theta(x)\vee\theta^{\star}(x)}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx).$

For convenience, refer to the two integrals on the right-hand side of the above display as $I_{1}$ and $I_{2}$ , respectively. Using Assumption 8(2) and replacing the range of integration in the inner integral by a $(\delta/2)$ -neighborhood of $\theta^{\star}(x)$ we can lower bound $I_{1}$ as

	$\displaystyle I_{1}$	$\displaystyle\geq\int_{\mathbb{X}_{1}\cap\{x:\theta^{\star}(x)>\theta(x)\}}\int_{\theta^{\star}(x)-\delta}^{\theta^{\star}(x)-\delta/2}\{\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx)$
		$\displaystyle\qquad+\int_{\mathbb{X}_{1}\cap\{x:\theta^{\star}(x)\leq\theta(x)\}}\int_{\theta^{\star}(x)}^{\theta^{\star}(x)+\delta/2}\{\theta(x)-y\}\,P_{x}(dy)\,P(dx)$
		$\displaystyle\geq(\beta\delta^{2}/4)\,P(\mathbb{X}_{1}).$

Next, for $I_{2}$ , we can again use Assumption 8(2) to get the lower bound

	$\displaystyle I_{2}$	$\displaystyle\geq\int_{\mathbb{X}_{2}}\int_{\theta(x)\wedge\theta^{\star}(x)}^{\{\theta(x)+\theta^{\star}(x)\}/2}\{\theta(x)\vee\theta^{\star}(x)-y\}\,P_{x}(dy)\,P(dx)$
		$\displaystyle\geq\frac{\beta}{2}\int_{\mathbb{X}_{2}}\|\theta(x)-\theta^{\star}(x)\|^{2}\,P(dx).$

Similarly, for sufficiently large $n$ , if $\theta\in\Theta_{n}$ , then the $L_{2}(P)$ norm is bounded as

\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}\leq\int_{\mathbb{X}_{2}}|\theta(x)-\theta^{\star}(x)|^{2}\,P(dx)+(\Delta_{n})^{2}\,P(\mathbb{X}_{1}).

Comparing the lower and upper bounds for $m(\theta,\theta^{\star})$ and $\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}$ in terms of integration over $\mathbb{X}_{1}$ and $\mathbb{X}_{2}$ we have

\int_{\mathbb{X}_{2}}|\theta(x)-\theta^{\star}(x)|^{2}\,P(dx)\lesssim I_{2},

and

(\Delta_{n})^{-2}\int_{\mathbb{X}_{1}}|\theta(x)-\theta^{\star}(x)|^{2}\,P(dx)\lesssim I_{1},

which together imply

m(\theta,\theta^{\star})\gtrsim(\Delta_{n})^{-2}\|\theta-\theta^{\star}\|_{L_{2}(P)}^{2}.

Recall, from (67), that $\theta\mapsto\ell_{\theta}(x,y)$ is $1-$ Lipschitz. Therefore, if $\theta\in\Theta_{n}$ , for large enough $n$ , then the loss difference is bounded by $\Delta_{n}$ , so Lemma 7.26 in Lafftery et al., (2010), along with the lower and upper bounds on $m(\theta,\theta^{\star})$ and $v(\theta,\theta^{\star})$ , can be used to verify Condition 2. That is, there exists a $K>0$ such that for all $\theta\in\Theta_{n}$

	$\displaystyle Pe^{-\omega_{n}(\ell_{\theta}-\ell_{\theta^{\star}})}$	$\displaystyle\leq\exp\{2\omega_{n}^{2}v(\theta,\theta^{\star})-K\omega_{n}m(\theta,\theta^{\star})\}$
		$\displaystyle\leq\exp[-K\omega_{n}\Delta_{n}^{-2}\\|\theta-\theta^{\star}\\|_{L_{2}(P)}^{2}\{1-2\omega_{n}\Delta_{n}^{2}\}]$
		$\displaystyle\leq\exp(-K_{n}\omega_{n}\varepsilon_{n}^{2})$

where the last inequality holds for $K_{n}=(K/2)\Delta_{n}^{-2}$ . Given Assumption 8(4), the above inequality verifies Condition 2 for $\omega_{n}$ and $\Delta_{n}$ as in Proposition 4.

C.9 Proof of Proposition 9

Proposition 9 follows from Theorem 1 by verifying (12) and Condition 1.

First, we verify (12). By definition

	$\displaystyle m(\theta,\theta^{\star})$	$\displaystyle=R(\theta)-R(\theta^{\star})$
		$\displaystyle=\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}\|2\eta_{z}(x)-1\|P(dx)P(dz).$

And, since $\ell_{\theta}$ is bounded by $1$ ,

v(\theta,\theta^{\star})\leq\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P(dx)P(dz)=d(\theta,\theta^{\star}).

Since $|2\eta_{z}(x)-1|\leq 1$ and, by Assumption 5(5),

\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P(dx)\lesssim|\theta(z)-\theta^{\star}(z)|,

it follows that

\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim\|\theta-\theta^{\star}\|_{L_{1}(P)}:=\int_{\mathbb{Z}}|\theta(z)-\theta^{\star}(z)|\,P(dz).

Let $J_{n}=n^{1/(1+\alpha)}$ and define a sup-norm ball around $\theta^{\star}$

B_{n}^{\star}=\{(\beta,J):\beta\in\mathbb{R}^{J},J=J_{n},\|\theta^{\star}-\hat{\theta}_{J,\beta}\|_{\infty}\leq CJ_{n}^{-\alpha}\}.

Then, by Assumption 9(3), and using the same argument as in the proof of Theorem 1 in Shen and Ghosal, (2015) we have

\Pi(B_{n}^{\star})\gtrsim e^{-CJ_{n}\log n},

for some $C>0$ . Since $J_{n}\log n\lesssim n\omega\varepsilon_{n}$ and $\theta\in B_{n}^{\star}$ implies $\{m(\theta,\theta^{\star})\vee v(\theta,\theta^{\star})\}\lesssim\varepsilon_{n}$ , it follows that (14) holds with $r=1$ .

Next, we verify Condition 1. By Assumption 9(4)

\displaystyle m(\theta,\theta^{\star})

\displaystyle\geq h\int_{\mathbb{Z}}\int_{\theta(z)\wedge\theta^{\star}(z)}^{\theta(z)\vee\theta^{\star}(z)}P(dx)P(dz)=h\,d(\theta,\theta^{\star}).

Then, Lemma 7.26 in Lafftery et al., (2010) implies

	$\displaystyle Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}$	$\displaystyle\leq\exp\{C\omega^{2}v(\theta,\theta^{\star})-\omega m(\theta,\theta^{\star})\}$
		$\displaystyle\leq\exp\{-\omega(h-C\omega)\,d(\theta,\theta^{\star})\},$

where $C>0$ depends only on $\omega$ . For small $\omega$ , $C=O(1+\omega)$ , so if $\omega(1+\omega)\leq h$ , then

Pe^{-\omega(\ell_{\theta}-\ell_{\theta^{\star}})}\leq\exp\{-K\omega d(\theta,\theta^{\star})\}

for a constant $K$ depending on $h$ , which verifies Condition 1 with $r=1$ .

References

Alquier, (2008) Alquier, P. (2008). PAC-Bayesian bounds for randomized empirical risk minimizers. Math. Methods Statist. 17(4):279–304.
Alquier et al., (2016) Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 17:1–41.
Barron et al., (1999) Barron, A., Schervish, M. J., and Wasserman, L. (1999). The consistency of posterior distributions in nonparametric problems. Ann. Statist. 27(2):536–561.
Bhattacharya and Martin, (2022) Bhattacharya, I., and Martin, R. (2022). Gibbs posterior inference on multivariate quantiles. J. Statist. Plann. Inference 218:106–121.
Bissiri et al., (2016) Bissiri, P.G., Holmes, C.C., and Walker, S.G. (2016). A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78:1103–1130.
Boucheron et al., (2012) Boucheron, S., Lugosi, G., and Massart, P. (2012). Concentration Inequalities: A Nonasymptotic Theory of Independence. Clarendon Press, Oxford.
Castillo et al., (2015) Castillo, I., Schmidt-Hieber, J., and van der Vaart, A.W. Bayesian linear regression with sparse priors. Ann. Statist., 5:1986–2018.
Catoni, (2004) Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization, Lecture Notes in Mathematics. Springer-Verlag.
Chernozhukov and Hong, (2003) Chernozhukov, V. and Hong, H. (2003). An MCMC approach to classical estimation. J. Econometrics 115(2):293–346.
Chib et al., (2018) Chib, S., Shin, M., and Simoni, A. (2018). Bayesian estimation and comparison of moment condition models. J. Am. Stat. Assoc. 113(524):1656–1668.
Choudhuri et al., (2007) Choudhuri, N., Ghosal, S., and Roy, A. (2007). Nonparametric binary regression using a Gaussian process prior. Stat. Methodol. 4:227–243.
De Blasi and Walker, (2013) De Blasi, P., Walker, S. G. (2013). Bayesian asymptotics with misspecified models. Statist. Sinica 23:169–187.
Ghosal et al., (2000) Ghosal, S., Ghosh, J. K., and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Ann. Statist. 28(2):500–531.
Godambe, (1991) Godambe, V. P., ed. (1991). Estimating Functions. Oxford University Press, New York.
Grünwald, (2012) Grünwald, P. (2012). The safe Bayesian: learning the learning rate via the mixability gap. Algorithmic Learning Theory, Springer, Heidelberg, 7568:169–183.
Grünwald and Mehta, (2020) Grünwald, P. and Mehta N. (2020). Fast rates for general unbounded loss functions: From ERM to generalized Bayes. J. Mach. Learn. Res. 21:1–80.
Grünwald and van Ommen, (2017) Grünwald, P. and van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified models, and a proposal for repairing it. Bayesian Anal. 12:1069–1103.
Guedj, (2019) Guedj, B. (2019). A primer on PAC-Bayes learning. arXiv:1901.05353.
Hedayat et al., (2015) Hedayat, S., Wang, J., and Xu, T. (2015). Minimum clinically important difference in medical studies. Biometrics 71:33–41.
Hjört and Pollard, (1993) Hjört, N. L., and Pollard, D. (1993). Asymptotics for minimisers of convex processes. http://www.stat.yale.edu/~pollard/Papers/convex.pdf.
Holmes and Walker, (2017) Holmes, C. C., and Walker, S. G. (2017). Assigning a value to a power likelihood in a general Bayesian model. Biometrika 104(2):497–503.
Huber and Ronchetti, (2009) Huber, P.J., and Ronchetti, E. (2009). Robust Statistics. 2nd ed. Wiley, New York.
Jaescheke et al., (1989) Jaescheke, R., Signer, J., and Guyatt, G. (1989). Measurement of health status: ascertaining the minimum clinically important difference. Control. Clin. Trials 10:407–415.
Jiang and Tanner, (2008) Jiang, W. and Tanner, M. A. (2008). Gibbs posterior for variable selection in high- dimensional classification and data mining. Ann. Statist. 36:2207–2231.
Kim, (2002) Kim, J.-Y. (2002). Limited information likelihood and Bayesian analysis. J. Econom. 107(1-2):175–193.
Kleijn and van der Vaart, (2006) Kleijn, B. J. K. and van der Vaart, A. W. (2006). Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist. 34(2):837–877.
Koltchinskii, (1997) Koltchinskii, V. (1997). M-estimation, convexity and quantiles. Ann. Statist. 25(2):435–477.
Koltchinskii, (2006) Koltchinskii, V. (1997). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34(6):2593–2656.
Lafftery et al., (2010) Lafferty, J., Liu, H., and Wasserman, L. (2010). Chapter 10: Concentration of Measure. In Statistical Machine Learning, http://www.stat.cmu.edu/~larry/=sml/Concentration.pdf.
Lyddon et al., (2019) Lyddon, S. P., Holmes, C. C., and Walker, S. G. (2019). General Bayesian updating and the loss-likelihood bootstrap. Biometrika. 106(2):465–478.
Mammen and Tsybakov, (1995) Mammen, E., and Tsybakov, A. B. (1995). Asymptotical minimax recovery of sets with smooth boundaries. Ann. Statist. 23(2):502–524.
Mammen and Tsybakov, (1999) Mammen, E., and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27(6):1808–1829.
Maronna et al., (2006) Maronna, R. A., Martin, D. R., and Yohai, V. J. (2006). Robust Statistics: Theory and Methods Wiley Series in Probability and Statistics.
Martin et al., (2013) Martin, R., Hong, L., and Walker, S.G. (2013). A note on Bayesian convergence rates under local prior support conditions. arXiv:1201.3102.
Martin et al., (2017) Martin, R., Mess, R., and Walker, S.G. (2017). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli 23:1822–1847.
Massart and Nedelec, (2006) Massart, P., and Nedelec, E. (2006). Risk bounds for statistical learning. Ann. Statist. 34(5):2326–2366.
McAllester, (1999) McAllester, D. (1999). PAC-Bayesian model averaging. COLT‘99 164–170.
Ramamoorthi et al., (2015) Ramamoorthi, R.V., Sriram, K., and Martin, R. (2015). On posterior concentration in misspecified models. Bayesian Anal. 10(4):759–789.
Shen and Ghosal, (2015) Shen, W. and Ghosal, S. (2015). Adaptive Bayesian procedures using random series priors. Scand. J. Stat. 42:1194–1213.
Shen and Wasserman, (2001) Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Ann. Statist. 29(3):687–714.
Sriram et al., (2013) Sriram, K., Ramamoorthi, R. V., and Ghosh, P. (2013) Posterior consistency of Bayesian quantile regression based on the misspecified asymmetric Laplace density. Bayesian Anal. 8(2):479–504.
Syring and Martin, (2017) Syring, N. and Martin, R. (2017). Gibbs posterior inference on the minimum clinically important difference. J. Statist. Plann. Inference 187:67–77.
Syring and Martin, (2019) Syring, N. and Martin, R. (2019). Calibration of general posterior credible regions. Biometrika 106(2):479–486.
Syring and Martin, (2020) Syring, N. and Martin, R. (2020). Robust and rate optimal Gibbs posterior inference on the boundary of a noisy image. Ann. Statist. 48(3):1498–1513.
Takeuchi et al., (2006) Takeuchi, I., Le, Q. V., Sears, T. D., and Smola, A. J. (2006). Nonparametric quantile estimation. J. Mach. Learn. Res. 7:1231–1264.
Tsybakov, (2004) Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning Ann. Statist. 32(1):135–166.
Valiant, (1984) Valiant, L.G. (1984). A theory of the learnable. Communications of the ACM 27(11):1134–1142.
van der Vaart, (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
van Erven et al., (2015) van Erven, T., Grünwald, P., Mehta, N., Reid, M. and Williamson, R. (2015). Fast rates in statistical and online learning. J. Mach. Learn. Res. 16:1793–1861.
Wang and Martin, (2020) Wang, Z. and Martin, R. (2020). Model-free posterior inference on the area under the receiver operating characteristic curve. J. Statist. Plann. Inference 209:174–186.
Wu and Martin, (2022) Wu, P.-S. and Martin, R. (2022). A comparison of learning rate selection methods in generalized Bayesian inference. Bayesian Anal., to appear; arXiv:2012.11349.
Wu and Martin, (2021) Wu, P.-S. and Martin, R. (2021). Calibrating generalized predictive distributions. arXiv:2107.01688.
Zhang, (2006) Zhang, T. (2006). Information theoretical upper and lower bounds for statistical estimation. IEEE Trans. Inf. Theory 52(4):1307–1321.
Zhou et al., (2020) Zhou, Z, Zhao, J., and Bisson, L.J. (2020). Estimation of data adaptive minimal clinically important difference with a nonconvex optimization procedure. Stat. Methods Med. Res. 29(3):879–893.

	$\displaystyle v(\theta,\theta^{\star})$	$\displaystyle\leq\sigma^{2}(\theta^{\star}-\theta)^{\top}(PXX^{T})(\theta^{\star}-\theta)+V\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}$
		$\displaystyle\leq C\sigma^{2}\\|\theta-\theta^{\star}\\|_{2}^{2}+P\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}^{2}$
		$\displaystyle\leq C\sigma^{2}\\|\theta-\theta^{\star}\\|_{2}^{2}+4B^{2}P\{(\theta^{\star}-\theta)^{\top}XX^{T}(\theta^{\star}-\theta)\}$
		$\displaystyle\leq(C\sigma^{2}+4CB^{2})\\|\theta-\theta^{\star}\\|_{2}^{2}.$

	$\displaystyle E(V(\ell_{\theta}-\ell_{\theta^{\star}_{n}}\|X=x))$	$\displaystyle=P\{4\sigma^{2}_{X}(\theta^{\star}_{n}-\theta)^{\top}XX^{\top}(\theta^{\star}_{n}-\theta)\}$
		$\displaystyle\lesssim\\|\theta^{\star}_{n}-\theta\\|_{2}^{2}P\sigma_{x}^{2}$
		$\displaystyle\lesssim\\|\theta^{\star}_{n}-\theta\\|_{2}^{2}$
		$\displaystyle\leq\\|\theta-\theta^{\star}\\|_{2}^{2}+\\|\theta^{\star}_{n}-\theta^{\star}\\|_{2}^{2}$