This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: IBM T.J. Watson Research Center, Yorktown Heights, NY USA.
11email: yangli@us.ibm.com
22institutetext: Princeton, NJ USA.
22email: steve.hanneke@gmail.com
33institutetext: Carnegie Mellon University, Pittsburgh, PA USA
33email: jgc@cs.cmu.edu

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Liu Yang    Steve Hanneke    Jaime Carbonell
Abstract

We study the optimal rates of convergence for estimating a prior distribution over a VC class from a sequence of independent data sets respectively labeled by independent target functions sampled from the prior. We specifically derive upper and lower bounds on the optimal rates under a smoothness condition on the correct prior, with the number of samples per data set equal the VC dimension. These results have implications for the improvements achievable via transfer learning. We additionally extend this setting to real-valued function, where we establish consistency of an estimator for the prior, and discuss an additional application to a preference elicitation problem in algorithmic economics.

1 Introduction

In the transfer learning setting, we are presented with a sequence of learning problems, each with some respective target concept we are tasked with learning. The key question in transfer learning is how to leverage our access to past learning problems in order to improve performance on learning problems we will be presented with in the future.

Among the several proposed models for transfer learning, one particularly appealing model supposes the learning problems are independent and identically distributed, with unknown distribution, and the advantage of transfer learning then comes from the ability to estimate this shared distribution based on the data from past learning problems [2, 12]. For instance, when customizing a speech recognition system to a particular speaker’s voice, we might expect the first few people would need to speak many words or phrases in order for the system to accurately identify the nuances. However, after performing this for many different people, if the software has access to those past training sessions when customizing itself to a new user, it should have identified important properties of the speech patterns, such as the common patterns within each of the major dialects or accents, and other such information about the distribution of speech patterns within the user population. It should then be able to leverage this information to reduce the number of words or phrases the next user needs to speak in order to train the system, for instance by first trying to identify the individual’s dialect, then presenting phrases that differentiate common subpatterns within that dialect, and so forth.

In analyzing the benefits of transfer learning in such a setting, one important question to ask is how quickly we can estimate the distribution from which the learning problems are sampled. In recent work, [12] have shown that under mild conditions on the family of possible distributions, if the target concepts reside in a known VC class, then it is possible to estimate this distribtion using only a bounded number of training samples per task: specifically, a number of samples equal the VC dimension. However, that work left open the question of quantifying the rate of convergence. This rate of convergence can have a direct impact on how much benefit we gain from transfer learning when we are faced with only a finite sequence of learning problems. As such, it is certainly desirable to derive tight characterizations of this rate of convergence.

The present work continues that of [12], bounding the rate of convergence for estimating this distribution, under a smoothness condition on the distribution. We derive a generic upper bound, which holds regardless of the VC class the target concepts reside in. The proof of this result builds on that earlier work, but requires several interesting innovations to make the rate of convergence explicit, and to dramatically improve the upper bound implicit in the proofs of those earlier results. We further derive a nontrivial lower bound that holds for certain constructed scenarios, which illustrates a lower limit on how good of a general upper bound we might hope for in results expressed only in terms of the number of tasks, the smoothness conditions, and the VC dimension.

We additionally include an extension of the results of [12] to the setting of real-valued functions, establishing consistency (at a uniform rate) for an estimator of a prior over any VC subgraph class. In addition to the application to transfer learning, analogous to the original work of [12], we also discuss an application of this result to a preference elicitation problem in algorithmic economics, in which we are tasked with allocating items to a sequence of customers to approximately maximize the customers’ satisfaction, while permitted access to the customer valuation functions only via value queries.

2 The Setting

Let (𝒳,𝒳)(\mathcal{X},{\cal B}_{\mathcal{X}}) be a measurable space [8] (where 𝒳\mathcal{X} is called the instance space), and let 𝒟\mathcal{D} be a distribution on 𝒳\mathcal{X} (called the data distribution). Let \mathbb{C} be a VC class of measurable classifiers h:𝒳{1,+1}h:\mathcal{X}\to\{-1,+1\} (called the concept space), and denote by dd the VC dimension of \mathbb{C} [10]. We suppose \mathbb{C} is equipped with its Borel σ\sigma-algebra {\cal B} induced by the pseudo-metric ρ(h,g)=𝒟({x𝒳:h(x)g(x)})\rho(h,g)=\mathcal{D}(\{x\in\mathcal{X}:h(x)\neq g(x)\}). Though our results can be formulated for general 𝒟\mathcal{D} (with somewhat more complicated theorem statements), to simplify the statement of results we suppose ρ\rho is actually a metric, which would follow from appropriate topological conditions on \mathbb{C} relative to 𝒟\mathcal{D}.

For any two probability measures μ1,μ2\mu_{1},\mu_{2} on a measurable space (Ω,)(\Omega,\mathcal{F}), define the total variation distance

μ1μ2=supAμ1(A)μ2(A).\|\mu_{1}-\mu_{2}\|=\sup_{A\in\mathcal{F}}\mu_{1}(A)-\mu_{2}(A).

For a set function μ\mu on a finite measurable space (Ω,)(\Omega,\mathcal{F}), we abbreviate μ(ω)=μ({ω})\mu(\omega)=\mu(\{\omega\}), ωΩ\forall\omega\in\Omega. Let ΠΘ={πθ:θΘ}\Pi_{\Theta}=\{\pi_{\theta}:\theta\in\Theta\} be a family of probability measures on \mathbb{C} (called priors), where Θ\Theta is an arbitrary index set (called the parameter space). We suppose there exists a probability measure π0\pi_{0} on \mathbb{C} (the reference measure) such that every πθ\pi_{\theta} is absolutely continuous with respect to π0\pi_{0}, and therefore has a density function fθf_{\theta} given by the Radon-Nikodym derivative dπθdπ0\frac{{\rm d}\pi_{\theta}}{{\rm d}\pi_{0}} [8].

We consider the following type of estimation problem. There is a collection of \mathbb{C}-valued random variables {htθ:t,θΘ}\{h^{*}_{t\theta}:t\in\mathbb{N},\theta\in\Theta\}, where for any fixed θΘ\theta\in\Theta the {htθ}t=1\{h^{*}_{t\theta}\}_{t=1}^{\infty} variables are i.i.d. with distribution πθ\pi_{\theta}. For each θΘ\theta\in\Theta, there is a sequence 𝒵t(θ)={(Xt1,Yt1(θ)),(Xt2,Yt2(θ)),}\mathcal{Z}^{t}(\theta)=\{(X_{t1},Y_{t1}(\theta)),(X_{t2},Y_{t2}(\theta)),\ldots\}, where {Xti}t,i\{X_{ti}\}_{t,i\in\mathbb{N}} are i.i.d. 𝒟\mathcal{D}, and for each t,it,i\in\mathbb{N}, Yti(θ)=htθ(Xti)Y_{ti}(\theta)=h^{*}_{t\theta}(X_{ti}). We additionally denote by 𝒵kt(θ)={(Xt1,Yt1(θ)),,(Xtk,Ytk(θ))}\mathcal{Z}^{t}_{k}(\theta)=\{(X_{t1},Y_{t1}(\theta)),\ldots,(X_{tk},Y_{tk}(\theta))\} the first kk elements of 𝒵t(θ)\mathcal{Z}^{t}(\theta), for any kk\in\mathbb{N}, and similarly 𝕏tk={Xt1,,Xtk}\mathbb{X}_{tk}=\{X_{t1},\ldots,X_{tk}\} and 𝕐tk(θ)={Yt1(θ),,Ytk(θ)}\mathbb{Y}_{tk}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}. Following the terminology used in the transfer learning literature, we refer to the collection of variables associated with each tt collectively as the ttht^{{\rm th}} task. We will be concerned with sequences of estimators θ^Tθ=θ^T(𝒵k1(θ),,𝒵kT(θ))\hat{\theta}_{T\theta}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{k}(\theta),\ldots,\mathcal{Z}^{T}_{k}(\theta)), for TT\in\mathbb{N}, which are based on only a bounded number kk of samples per task, among the first TT tasks. Our main results specifically study the case of dd samples per task. For any such estimator, we measure the risk as 𝔼[πθ^Tθπθ]\mathbb{E}\left[\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\right], and will be particularly interested in upper-bounding the worst-case risk supθΘ𝔼[πθ^Tθπθ]\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\left[\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\right] as a function of TT, and lower-bounding the minimum possible value of this worst-case risk over all possible θ^T\hat{\theta}_{T} estimators (called the minimax risk).

In previous work, [12] showed that, if ΠΘ\Pi_{\Theta} is a totally bounded family, then even with only dd number of samples per task, the minimax risk (as a function of the number of tasks TT) converges to zero. In fact, that work also proved this is not necessarily the case in general for any number of samples less than dd. However, the actual rates of convergence were not explicitly derived in that work, and indeed the upper bounds on the rates of convergence implicit in that analysis may often have fairly complicated dependences on \mathbb{C}, ΠΘ\Pi_{\Theta}, and 𝒟\mathcal{D}, and furthermore often provide only very slow rates of convergence.

To derive explicit bounds on the rates of convergence, in the present work we specifically focus on families of smooth densities. The motivation for involving a notion of smoothness in characterizing rates of convergence is clear if we consider the extreme case in which ΠΘ\Pi_{\Theta} contains two priors π1\pi_{1} and π2\pi_{2}, with π1({h})=π2({g})=1\pi_{1}(\{h\})=\pi_{2}(\{g\})=1, where ρ(h,g)\rho(h,g) is a very small but nonzero value; in this case, if we have only a small number of samples per task, we would require many tasks (on the order of 1/ρ(h,g)1/\rho(h,g)) to observe any data points carrying any information that would distinguish between these two priors (namely, points xx with h(x)g(x)h(x)\neq g(x)); yet π1π2=1\|\pi_{1}-\pi_{2}\|=1, so that we have a slow rate of convergence (at least initially). A total boundedness condition on ΠΘ\Pi_{\Theta} would limit the number of such pairs present in ΠΘ\Pi_{\Theta}, so that for instance we cannot have arbitrarily close hh and gg, but less extreme variants of this can lead to slow asymptotic rates of convergence as well. Specifically, in the present work we consider the following notion of smoothness. For L(0,)L\in(0,\infty) and α(0,1]\alpha\in(0,1], a function f:f:\mathbb{C}\to\mathbb{R} is (L,α)(L,\alpha)-Hölder smooth if

h,g,|f(h)f(g)|Lρ(h,g)α.\forall h,g\in\mathbb{C},|f(h)-f(g)|\leq L\rho(h,g)^{\alpha}.

3 An Upper Bound

We now have the following theorem, holding for an arbitrary VC class \mathbb{C} and data distribution 𝒟\mathcal{D}; it is the main result of this work.

Theorem 3.1

For ΠΘ\Pi_{\Theta} any class of priors on \mathbb{C} having (L,α)(L,\alpha)-Hölder smooth densities {fθ:θΘ}\{f_{\theta}:\theta\in\Theta\}, for any TT\in\mathbb{N}, there exists an estimator θ^Tθ=θ^T(𝒵d1(θ),,𝒵dT(θ))\hat{\theta}_{T\theta}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{d}(\theta),\ldots,\mathcal{Z}^{T}_{d}(\theta)) such that

supθΘ𝔼πθ^Tπθ=O~(LTα22(d+2α)(α+2(d+1))).\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|=\tilde{O}\left(LT^{-\frac{\alpha^{2}}{2(d+2\alpha)(\alpha+2(d+1))}}\right).
Proof

By the standard PAC analysis [9, 3], for any γ>0\gamma>0, with probability greater than 1γ1-\gamma, a sample of k=O((d/γ)log(1/γ))k=O((d/\gamma)\log(1/\gamma)) random points will partition \mathbb{C} into regions of width less than γ\gamma (under L1(𝒟)L_{1}(\mathcal{D})). For brevity, we omit the tt subscripts and superscripts on quantities such as 𝒵kt(θ)\mathcal{Z}^{t}_{k}(\theta) throughout the following analysis, since the claims hold for any arbitrary value of tt.

For any θΘ\theta\in\Theta, let πθ\pi_{\theta}^{\prime} denote a (conditional on X1,,XkX_{1},\ldots,X_{k}) distribution defined as follows. Let fθf_{\theta}^{\prime} denote the (conditional on X1,,XkX_{1},\ldots,X_{k}) density function of πθ\pi_{\theta}^{\prime} with respect to π0\pi_{0}, and for any gg\in\mathbb{C}, let fθ(g)=πθ({h:ik,h(Xi)=g(Xi)})π0({h:ik,h(Xi)=g(Xi)})f_{\theta}^{\prime}(g)=\frac{\pi_{\theta}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=g(X_{i})\})}{\pi_{0}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=g(X_{i})\})} (or 0 if π0({h:ik,h(Xi)=g(Xi)})=0\pi_{0}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=g(X_{i})\})=0). In other words, πθ\pi_{\theta}^{\prime} has the same probability mass as πθ\pi_{\theta} for each of the equivalence classes induced by X1,,XkX_{1},\ldots,X_{k}, but conditioned on the equivalence class, simply has a constant-density distribution over that equivalence class. Note that every hh\in\mathbb{C} has fθ(h)f_{\theta}^{\prime}(h) between the smallest and largest values of fθ(g)f_{\theta}(g) among gg\in\mathbb{C} with ik,g(Xi)=h(Xi)\forall i\leq k,g(X_{i})=h(X_{i}); therefore, by the smoothness condition, on the event (of probability greater than 1γ1-\gamma) that each of these regions has diameter less than γ\gamma, we have h,|fθ(h)fθ(h)|<Lγα\forall h\in\mathbb{C},|f_{\theta}(h)-f_{\theta}^{\prime}(h)|<L\gamma^{\alpha}. On this event, for any θ,θΘ\theta,\theta^{\prime}\in\Theta,

πθπθ=(1/2)|fθfθ|dπ0<Lγα+(1/2)|fθfθ|dπ0.\|\pi_{\theta}-\pi_{\theta^{\prime}}\|=(1/2)\int|f_{\theta}-f_{\theta^{\prime}}|{\rm d}\pi_{0}<L\gamma^{\alpha}+(1/2)\int|f_{\theta}^{\prime}-f_{\theta^{\prime}}^{\prime}|{\rm d}\pi_{0}.

Furthermore, since the regions that define fθf_{\theta}^{\prime} and fθf_{\theta^{\prime}}^{\prime} are the same (namely, the partition induced by X1,,XkX_{1},\ldots,X_{k}), we have

(1/2)|fθfθ|dπ0\displaystyle(1/2)\int|f_{\theta}^{\prime}-f_{\theta^{\prime}}^{\prime}|{\rm d}\pi_{0} =(1/2)y1,,yk{1,+1}|πθ({h:ik,h(Xi)=yi})\displaystyle=(1/2)\!\!\!\!\!\!\!\!\!\sum_{y_{1},\ldots,y_{k}\in\{-1,+1\}}\!\!\!|\pi_{\theta}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=y_{i}\})
πθ({h:ik,h(Xi)=yi})|\displaystyle\phantom{aaaaaaaaaaaaaaaaa}-\pi_{\theta^{\prime}}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=y_{i}\})|
=𝕐k(θ)|𝕏k𝕐k(θ)|𝕏k.\displaystyle=\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}\|.

Thus, we have that with probability at least 1γ1-\gamma,

πθπθ<Lγα+𝕐k(θ)|𝕏k𝕐k(θ)|𝕏k.\|\pi_{\theta}-\pi_{\theta^{\prime}}\|<L\gamma^{\alpha}+\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}\|.

Following analogous to the inductive argument of [12], suppose I{1,,k}I\subseteq\{1,\ldots,k\}, fix x¯I𝒳|I|\bar{x}_{I}\in\mathcal{X}^{|I|} and y¯I{1,+1}|I|\bar{y}_{I}\in\{-1,+1\}^{|I|}. Then the y~I{1,+1}|I|\tilde{y}_{I}\in\{-1,+1\}^{|I|} for which y¯Iy~I1\|\bar{y}_{I}-\tilde{y}_{I}\|_{1} is minimal, subject to the constraint that no hh\in\mathbb{C} has h(x¯I)=y~Ih(\bar{x}_{I})=\tilde{y}_{I}, has (1/2)y¯Iy~I1d+1(1/2)\|\bar{y}_{I}-\tilde{y}_{I}\|_{1}\leq d+1; also, for any iIi\in I with y¯iy~i\bar{y}_{i}\neq\tilde{y}_{i}, letting y¯j=y¯j\bar{y}^{\prime}_{j}=\bar{y}_{j} for jI{i}j\in I\setminus\{i\} and y¯i=y~i\bar{y}^{\prime}_{i}=\tilde{y}_{i}, we have

𝕐I(θ)|𝕏I(y¯I|x¯I)=𝕐I{i}(θ)|𝕏I{i}(y¯I{i}|x¯I{i})𝕐I(θ)|𝕏I(y¯I|x¯I),\mathbb{P}_{\mathbb{Y}_{I}(\theta)|\mathbb{X}_{I}}(\bar{y}_{I}|\bar{x}_{I})=\mathbb{P}_{\mathbb{Y}_{I\setminus\{i\}}(\theta)|\mathbb{X}_{I\setminus\{i\}}}(\bar{y}_{I\setminus\{i\}}|\bar{x}_{I\setminus\{i\}})-\mathbb{P}_{\mathbb{Y}_{I}(\theta)|\mathbb{X}_{I}}(\bar{y}^{\prime}_{I}|\bar{x}_{I}),

and similarly for θ\theta^{\prime}, so that

|𝕐I(θ)|𝕏I(y¯I|x¯I)𝕐I(θ)|𝕏I(y¯I|x¯I)|\displaystyle|\mathbb{P}_{\mathbb{Y}_{I}(\theta)|\mathbb{X}_{I}}(\bar{y}_{I}|\bar{x}_{I})-\mathbb{P}_{\mathbb{Y}_{I}(\theta^{\prime})|\mathbb{X}_{I}}(\bar{y}_{I}|\bar{x}_{I})|
|𝕐I{i}(θ)|𝕏I{i}(y¯I{i}|x¯I{i})𝕐I{i}(θ)|𝕏I{i}(y¯I{i}|x¯I{i})|\displaystyle\leq|\mathbb{P}_{\mathbb{Y}_{I\setminus\{i\}}(\theta)|\mathbb{X}_{I\setminus\{i\}}}(\bar{y}_{I\setminus\{i\}}|\bar{x}_{I\setminus\{i\}})-\mathbb{P}_{\mathbb{Y}_{I\setminus\{i\}}(\theta^{\prime})|\mathbb{X}_{I\setminus\{i\}}}(\bar{y}_{I\setminus\{i\}}|\bar{x}_{I\setminus\{i\}})|
+|𝕐I(θ)|𝕏I(y¯I|x¯I)𝕐I(θ)|𝕏I(y¯I|x¯I)|.\displaystyle\phantom{aaaa}+|\mathbb{P}_{\mathbb{Y}_{I}(\theta)|\mathbb{X}_{I}}(\bar{y}^{\prime}_{I}|\bar{x}_{I})-\mathbb{P}_{\mathbb{Y}_{I}(\theta^{\prime})|\mathbb{X}_{I}}(\bar{y}^{\prime}_{I}|\bar{x}_{I})|.

Now consider that these two terms inductively define a binary tree. Every time the tree branches left once, it arrives at a difference of probabilities for a set II of one less element than that of its parent. Every time the tree branches right once, it arrives at a difference of probabilities for a y¯I\bar{y}_{I} one closer to an unrealized y~I\tilde{y}_{I} than that of its parent. Say we stop branching the tree upon reaching a set II and a y¯I\bar{y}_{I} such that either y¯I\bar{y}_{I} is an unrealized labeling, or |I|=d|I|=d. Thus, we can bound the original (root node) difference of probabilities by the sum of the differences of probabilities for the leaf nodes with |I|=d|I|=d. Any path in the tree can branch left at most kdk-d times (total) before reaching a set II with only dd elements, and can branch right at most d+1d+1 times in a row before reaching a y¯I\bar{y}_{I} such that both probabilities are zero, so that the difference is zero. So the depth of any leaf node with |I|=d|I|=d is at most (kd)d(k-d)d. Furthermore, at any level of the tree, from left to right the nodes have strictly decreasing |I||I| values, so that the maximum width of the tree is at most kdk-d. So the total number of leaf nodes with |I|=d|I|=d is at most (kd)2d(k-d)^{2}d. Thus, for any y¯{1,+1}k\bar{y}\in\{-1,+1\}^{k} and x¯𝒳k\bar{x}\in\mathcal{X}^{k},

|𝕐k(θ)|𝕏k(y¯|x¯)𝕐k(θ)|𝕏k(y¯|x¯)|\displaystyle|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}(\bar{y}|\bar{x})-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}(\bar{y}|\bar{x})|
(kd)2dmaxy¯d{1,+1}dmaxD{1,,k}d|𝕐d(θ)|𝕏d(y¯d|x¯D)𝕐d(θ)|𝕏d(y¯d|x¯D)|.\displaystyle\leq(k-d)^{2}d\cdot\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{d}}(\bar{y}^{d}|\bar{x}_{D})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{d}}(\bar{y}^{d}|\bar{x}_{D})|.

Since

𝕐k(θ)|𝕏k𝕐k(θ)|𝕏k=(1/2)y¯k{1,+1}k|𝕐k(θ)|𝕏k(y¯k)𝕐k(θ)|𝕏k(y¯k)|,\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}\|=(1/2)\sum_{\bar{y}^{k}\in\{-1,+1\}^{k}}|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}(\bar{y}^{k})-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}(\bar{y}^{k})|,

and by Sauer’s Lemma this is at most

(ek)dmaxy¯k{1,+1}k|𝕐k(θ)|𝕏k(y¯k)𝕐k(θ)|𝕏k(y¯k)|,(ek)^{d}\max_{\bar{y}^{k}\in\{-1,+1\}^{k}}|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}(\bar{y}^{k})-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}(\bar{y}^{k})|,

we have that

𝕐k(θ)|𝕏k𝕐k(θ)|𝕏k\displaystyle\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}\|
(ek)dk2dmaxy¯d{1,+1}dmaxD{1,,k}d|𝕐d(θ)|𝕏D(y¯d)𝕐d(θ)|𝕏D(y¯d)|.\displaystyle\leq(ek)^{d}k^{2}d\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{D}}(\bar{y}^{d})|.

Thus, we have that

πθπθ=𝔼πθπθ\displaystyle\|\pi_{\theta}-\pi_{\theta^{\prime}}\|=\mathbb{E}\|\pi_{\theta}-\pi_{\theta^{\prime}}\|
<γ+Lγα+(ek)dk2d𝔼[maxy¯d{1,+1}dmaxD{1,,k}d𝕐d(θ)|𝕏D(y¯d)𝕐d(θ)|𝕏D(y¯d)|].\displaystyle<\gamma\!+\!L\gamma^{\alpha}\!+\!(ek)^{d}k^{2}d\mathbb{E}\bigg{[}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{D}}(\bar{y}^{d})|\bigg{]}.

Note that

𝔼[maxy¯d{1,+1}dmaxD{1,,k}d|𝕐d(θ)|𝕏D(y¯d)𝕐d(θ)|𝕏D(y¯d)|]\displaystyle\mathbb{E}\bigg{[}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{D}}(\bar{y}^{d})|\bigg{]}
y¯d{1,+1}dD{1,,k}d𝔼[|𝕐d(θ)|𝕏D(y¯d)𝕐d(θ)|𝕏D(y¯d)|]\displaystyle\leq\sum_{\bar{y}^{d}\in\{-1,+1\}^{d}}\sum_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\big{[}|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{D}}(\bar{y}^{d})|\big{]}
(2k)dmaxy¯d{1,+1}dmaxD{1,,k}d𝔼[|𝕐d(θ)|𝕏D(y¯d)𝕐d(θ)|𝕏D(y¯d)|],\displaystyle\leq(2k)^{d}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\big{[}|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{D}}(\bar{y}^{d})|\big{]},

and by exchangeability, this last line equals

(2k)dmaxy¯d{1,+1}d𝔼[|𝕐d(θ)|𝕏d(y¯d)𝕐d(θ)|𝕏d(y¯d)|].(2k)^{d}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\mathbb{E}\left[|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{d}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{d}}(\bar{y}^{d})|\right].

[12] showed that 𝔼[|𝕐d(θ)|𝕏d(y¯d)𝕐d(θ)|𝕏d(y¯d)|]4𝒵d(θ)𝒵d(θ)\mathbb{E}\left[|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{d}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{d}}(\bar{y}^{d})|\right]\leq 4\sqrt{\|\mathbb{P}_{\mathcal{Z}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}_{d}(\theta^{\prime})}\|}, so that in total we have πθπθ<(L+1)γα+4(2ek)2d+2𝒵d(θ)𝒵d(θ)\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\!<\!(L\!+\!1)\gamma^{\alpha}\!+\!4(2ek)^{2d+2}\!\!\sqrt{\|\mathbb{P}_{\mathcal{Z}_{d}(\theta)}\!-\!\mathbb{P}_{\mathcal{Z}_{d}(\theta^{\prime})}\|}. Plugging in the value of k=c(d/γ)log(1/γ)k=c(d/\gamma)\log(1/\gamma), this is

(L+1)γα+4(2ecdγlog(1γ))2d+2𝒵d(θ)𝒵d(θ).(L\!+\!1)\gamma^{\alpha}+4\!\left(\!2ec\frac{d}{\gamma}\log\!\left(\frac{1}{\gamma}\right)\!\right)^{\!\!2d+2}\!\sqrt{\|\mathbb{P}_{\mathcal{Z}_{d}(\theta)}\!-\!\mathbb{P}_{\mathcal{Z}_{d}(\theta^{\prime})}\|}.

Thus, it suffices to bound the rate of convergence (in total variation distance) of some estimator of 𝒵d(θ)\mathbb{P}_{\mathcal{Z}_{d}({\theta_{\star}})}. If N(ε)N(\varepsilon) is the ε\varepsilon-covering number of {𝒵d(θ):θΘ}\{\mathbb{P}_{\mathcal{Z}_{d}(\theta)}:\theta\in\Theta\}, then taking θ^Tθ\hat{\theta}_{T{\theta_{\star}}} as the minimum distance skeleton estimate of [13, 5] achieves expected total variation distance ε\varepsilon from 𝒵d(θ)\mathbb{P}_{\mathcal{Z}_{d}({\theta_{\star}})}, for some T=O((1/ε2)logN(ε/4))T=O((1/\varepsilon^{2})\log N(\varepsilon/4)). We can partition \mathbb{C} into O((L/ε)d/α)O((L/\varepsilon)^{d/\alpha}) cells of diameter O((ε/L)1/α)O((\varepsilon/L)^{1/\alpha}), and set a constant density value within each cell, on an O(ε)O(\varepsilon)-grid of density values, and every prior with (L,α)(L,\alpha)-Hölder smooth density will have density within ε\varepsilon of some density so-constructed; there are then at most (1/ε)O((L/ε)d/α)(1/\varepsilon)^{O((L/\varepsilon)^{d/\alpha})} such densities, so this bounds the covering numbers of ΠΘ\Pi_{\Theta}. Furthermore, the covering number of ΠΘ\Pi_{\Theta} upper bounds N(ε)N(\varepsilon) [12], so that N(ε)(1/ε)O((L/ε)d/α)N(\varepsilon)\leq(1/\varepsilon)^{O((L/\varepsilon)^{d/\alpha})}.

Solving T=O(ε2(L/ε)d/αlog(1/ε))T\!=\!O(\varepsilon^{-2}(L/\varepsilon)^{d/\alpha}\log(1/\varepsilon)) for ε\varepsilon, we have ε=O(L(log(TL)T)αd+2α)\varepsilon\!=\!O\!\left(L\!\left(\frac{\log(TL)}{T}\right)^{\frac{\alpha}{d+2\alpha}}\right). So this bounds the rate of convergence for 𝔼𝒵d(θ^T)𝒵d(θ)\mathbb{E}\|\mathbb{P}_{\mathcal{Z}_{d}(\hat{\theta}_{T})}-\mathbb{P}_{\mathcal{Z}_{d}({\theta_{\star}})}\|, for θ^T\hat{\theta}_{T} the minimum distance skeleton estimate. Plugging this rate into the bound on the priors, combined with Jensen’s inequality, we have

𝔼πθ^Tπθ<(L+1)γα+4(2ecdγlog(1γ))2d+2×O(L(log(TL)T)α2d+4α).\mathbb{E}\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|<(L+1)\gamma^{\alpha}+4\left(2ec\frac{d}{\gamma}\log\left(\frac{1}{\gamma}\right)\right)^{2d+2}\!\!\!\!\!\times O\left(L\left(\frac{\log(TL)}{T}\right)^{\frac{\alpha}{2d+4\alpha}}\right).

This holds for any γ>0\gamma>0, so minimizing this expression over γ>0\gamma>0 yields a bound on the rate. For instance, with γ=O~(Tα2(d+2α)(α+2(d+1)))\gamma=\tilde{O}\left(T^{-\frac{\alpha}{2(d+2\alpha)(\alpha+2(d+1))}}\right), we have

𝔼πθ^Tπθ=O~(LTα22(d+2α)(α+2(d+1))).\mathbb{E}\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|=\tilde{O}\left(LT^{-\frac{\alpha^{2}}{2(d+2\alpha)(\alpha+2(d+1))}}\right).

4 A Minimax Lower Bound

One natural quesiton is whether Theorem 3.1 can generally be improved. While we expect this to be true for some fixed VC classes (e.g., those of finite size), and in any case we expect that some of the constant factors in the exponent may be improvable, it is not at this time clear whether the general form of TΘ(α2/(d+α)2)T^{-\Theta(\alpha^{2}/(d+\alpha)^{2})} is sometimes optimal. One way to investigate this question is to construct specific spaces \mathbb{C} and distributions 𝒟\mathcal{D} for which a lower bound can be obtained. In particular, we are generally interested in exhibiting lower bounds that are worse than those that apply to the usual problem of density estimation based on direct access to the htθh^{*}_{t{\theta_{\star}}} values (see Theorem 4.2 below).

Here we present a lower bound that is interesting for this reason. However, although larger than the optimal rate for methods with direct access to the target concepts, it is still far from matching the upper bound above, so that the question of tightness remains open. Specifically, we have the following result.

Theorem 4.1

For any integer d1d\geq 1, any L>0,α(0,1]L>0,\alpha\in(0,1], there is a value C(d,L,α)(0,)C(d,L,\alpha)\in(0,\infty) such that, for any TT\in\mathbb{N}, there exists an instance space 𝒳\mathcal{X}, a concept space \mathbb{C} of VC dimension dd, a distribution 𝒟\mathcal{D} over 𝒳\mathcal{X}, and a distribution π0\pi_{0} over \mathbb{C} such that, for ΠΘ\Pi_{\Theta} a set of distributions over \mathbb{C} with (L,α)(L,\alpha)-Hölder smooth density functions with respect to π0\pi_{0}, any estimator θ^T=θ^T(𝒵d1(θ),,𝒵dT(θ))\hat{\theta}_{T}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}})) has

supθΘ𝔼[πθ^Tπθ]C(d,L,α)Tα2(d+α).\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\left[\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|\right]\geq C(d,L,\alpha)T^{-\frac{\alpha}{2(d+\alpha)}}.
Proof

(Sketch) We proceed by a reduction from the task of determining the bias of a coin from among two given possibilities. Specifically, fix any γ(0,1/2)\gamma\in(0,1/2), nn\in\mathbb{N}, and let B1(p),,Bn(p)B_{1}(p),\ldots,B_{n}(p) be i.i.d Bernoulli(p){\rm Bernoulli}(p) random variables, for each p[0,1]p\in[0,1]; then it is known that, for any (possibly nondeterministic) decision rule p^n:{0,1}n{(1+γ)/2,(1γ)/2}\hat{p}_{n}:\{0,1\}^{n}\to\{(1+\gamma)/2,(1-\gamma)/2\},

12p{(1+γ)/2,(1γ)/2}(p^n(B1(p),,Bn(p))p)(1/32)exp{128γ2n/3}.\frac{1}{2}\sum_{p\in\{(1+\gamma)/2,(1-\gamma)/2\}}\mathbb{P}(\hat{p}_{n}(B_{1}(p),\ldots,B_{n}(p))\neq p)\\ \geq(1/32)\cdot\exp\left\{-128\gamma^{2}n/3\right\}. (1)

This easily follows from the results of [1], combined with a result of [7] bounding the KL divergence (see also [11])

To use this result, we construct a learning problem as follows. Fix some mm\in\mathbb{N} with mdm\geq d, let 𝒳={1,,m}\mathcal{X}=\{1,\ldots,m\}, and let \mathbb{C} be the space of all classifiers h:𝒳{1,+1}h:\mathcal{X}\to\{-1,+1\} such that |{x𝒳:h(x)=+1}|d|\{x\in\mathcal{X}:h(x)=+1\}|\leq d. Clearly the VC dimension of \mathbb{C} is dd. Define the distribution 𝒟\mathcal{D} as uniform over 𝒳\mathcal{X}. Finally, we specify a family of (L,α)(L,\alpha)-Hölder smooth priors, parameterized by Θ={1,+1}(md)\Theta=\{-1,+1\}^{\binom{m}{d}}, as follows. Let γm=(L/2)(1/m)α\gamma_{m}=(L/2)(1/m)^{\alpha}. First, enumerate the (md)\binom{m}{d} distinct dd-sized subsets of {1,,m}\{1,\ldots,m\} as 𝒳1,𝒳2,,𝒳(md)\mathcal{X}_{1},\mathcal{X}_{2},\ldots,\mathcal{X}_{\binom{m}{d}}. Define the reference distribution π0\pi_{0} by the property that, for any hh\in\mathbb{C}, letting q=|{x:h(x)=+1}|q=|\{x:h(x)=+1\}|, π0({h})=(12)d(mqdq)/(md)\pi_{0}(\{h\})=(\frac{1}{2})^{d}\binom{m-q}{d-q}/\binom{m}{d}. For any 𝐛=(b1,,b(md)){1,1}(md)\mathbf{b}=(b_{1},\ldots,b_{\binom{m}{d}})\in\{-1,1\}^{\binom{m}{d}}, define the prior π𝐛\pi_{\mathbf{b}} as the distribution of a random variable h𝐛h_{\mathbf{b}} specified by the following generative model. Let iUniform({1,,(md)})i^{*}\sim{\rm Uniform}(\{1,\ldots,\binom{m}{d}\}), let C𝐛(i)Bernoulli((1+γmbi)/2)C_{\mathbf{b}}(i^{*})\sim{\rm Bernoulli}((1+\gamma_{m}b_{i^{*}})/2); finally, h𝐛Uniform({h:{x:h(x)=+1}𝒳i,Parity(|{x:h(x)=+1}|)=C𝐛(i)})h_{\mathbf{b}}\sim{\rm Uniform}(\{h\in\mathbb{C}:\{x:h(x)=+1\}\subseteq\mathcal{X}_{i^{*}},{\rm Parity}(|\{x:h(x)=+1\}|)=C_{\mathbf{b}}(i^{*})\}), where Parity(n){\rm Parity}(n) is 11 if nn is odd, or 0 if nn is even. We will refer to the variables in this generative model below. For any hh\in\mathbb{C}, letting H={x:h(x)=+1}H=\{x:h(x)=+1\} and q=|H|q=|H|, we can equivalently express π𝐛({h})=(12)d(md)1i=1(md)𝟙[H𝒳i](1+γmbi)Parity(q)(1γmbi)1Parity(q)\pi_{\mathbf{b}}(\{h\})=(\frac{1}{2})^{d}\binom{m}{d}^{-1}\sum_{i=1}^{\binom{m}{d}}\mathbbm{1}[H\subseteq\mathcal{X}_{i}](1+\gamma_{m}b_{i})^{{\rm Parity}(q)}(1-\gamma_{m}b_{i})^{1-{\rm Parity}(q)}. From this explicit representation, it is clear that, letting f𝐛=dπ𝐛dπ0f_{\mathbf{b}}=\frac{{\rm d}\pi_{\mathbf{b}}}{{\rm d}\pi_{0}}, we have f𝐛(h)[1γm,1+γm]f_{\mathbf{b}}(h)\in[1-\gamma_{m},1+\gamma_{m}] for all hh\in\mathbb{C}. The fact that f𝐛f_{\mathbf{b}} is Hölder smooth follows from this, since every distinct h,gh,g\in\mathbb{C} have 𝒟({x:h(x)g(x)})1/m=(2γm/L)1/α\mathcal{D}(\{x:h(x)\neq g(x)\})\geq 1/m=(2\gamma_{m}/L)^{1/\alpha}.

Next we set up the reduction as follows. For any estimator π^T=π^T(𝒵d1(θ),\hat{\pi}_{T}=\hat{\pi}_{T}(\mathcal{Z}^{1}_{d}({\theta_{\star}}), ,𝒵dT(θ))\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}})), and each i{1,,(md)}i\in\{1,\ldots,\binom{m}{d}\}, let hih_{i} be the classifier with {x:hi(x)=+1}=𝒳i\{x:h_{i}(x)=+1\}=\mathcal{X}_{i}; also, if π^T({hi})>(12)d/(md)\hat{\pi}_{T}(\{h_{i}\})>(\frac{1}{2})^{d}/\binom{m}{d}, let b^i=2Parity(d)1\hat{b}_{i}=2{\rm Parity}(d)-1, and otherwise b^i=12Parity(d)\hat{b}_{i}=1-2{\rm Parity}(d). We use these b^i\hat{b}_{i} values to estimate the original bib_{i} values. Specifically, let p^i=(1+γmb^i)/2\hat{p}_{i}=(1+\gamma_{m}\hat{b}_{i})/2 and pi=(1+γmbi)/2p_{i}=(1+\gamma_{m}b_{i})/2, where 𝐛=θ\mathbf{b}={\theta_{\star}}. Then

π^Tπθ\displaystyle\|\hat{\pi}_{T}-\pi_{{\theta_{\star}}}\| (1/2)i=1(md)|π^T({hi})πθ({hi})|\displaystyle\geq(1/2)\sum_{i=1}^{\binom{m}{d}}|\hat{\pi}_{T}(\{h_{i}\})-\pi_{{\theta_{\star}}}(\{h_{i}\})|
(1/2)i=1(md)γm2d(md)|b^ibi|/2=(1/2)i=1(md)12d(md)|p^ipi|.\displaystyle\geq(1/2)\sum_{i=1}^{\binom{m}{d}}\frac{\gamma_{m}}{2^{d}\binom{m}{d}}|\hat{b}_{i}-b_{i}|/2=(1/2)\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}|\hat{p}_{i}-p_{i}|.

Thus, we have reduced from the problem of deciding the biases of these (md)\binom{m}{d} independent Bernoulli random variables. To complete the proof, it suffices to lower bound the expectation of the right side for an arbitrary estimator.

Toward this end, we in fact study an even easier problem. Specifically, consider an estimator q^i=q^i(𝒵d1(θ),,𝒵dT(θ),i1,,iT)\hat{q}_{i}=\hat{q}_{i}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}}),i_{1}^{*},\ldots,i_{T}^{*}), where iti_{t}^{*} is the ii^{*} random variable in the generative model that defines htθh^{*}_{t{\theta_{\star}}}; that is, itUniform({1,i_{t}^{*}\sim{\rm Uniform}(\{1, ,(md)})\ldots,\binom{m}{d}\}), CtBernoulli((1+γmbit)/2)C_{t}\sim{\rm Bernoulli}((1+\gamma_{m}b_{i_{t}^{*}})/2), and htθUniform({h:{x:h(x)=+1}𝒳it,Parity(|{x:h(x)=+1}|)=Ct})h^{*}_{t{\theta_{\star}}}\sim{\rm Uniform}(\{h\in\mathbb{C}:\{x:h(x)=+1\}\subseteq\mathcal{X}_{i_{t}^{*}},{\rm Parity}(|\{x:h(x)=+1\}|)=C_{t}\}), where the iti_{t}^{*} are independent across tt, as are the CtC_{t} and htθh^{*}_{t{\theta_{\star}}}. Clearly the p^i\hat{p}_{i} from above can be viewed as an estimator of this type, which simply ignores the knowledge of iti_{t}^{*}. The knowledge of these iti_{t}^{*} variables simplifies the analysis, since given {it:tT}\{i_{t}^{*}:t\leq T\}, the data can be partitioned into (md)\binom{m}{d} disjoint sets, {{𝒵dt(θ):it=i}:i=1,,(md)}\{\{\mathcal{Z}^{t}_{d}({\theta_{\star}}):i_{t}^{*}=i\}:i=1,\ldots,\binom{m}{d}\}, and we can use only the set {𝒵dt(θ):it=i}\{\mathcal{Z}^{t}_{d}({\theta_{\star}}):i_{t}^{*}=i\} to estimate pip_{i}. Furthermore, we can use only the subset of these for which 𝕏td=𝒳i\mathbb{X}_{td}=\mathcal{X}_{i}, since otherwise we have zero information about the value of Parity(|{x:htθ(x)=+1}|){\rm Parity}(|\{x:h^{*}_{t{\theta_{\star}}}(x)=+1\}|). That is, given it=ii_{t}^{*}=i, any 𝒵dt(θ)\mathcal{Z}^{t}_{d}({\theta_{\star}}) is conditionally independent from every bjb_{j} for jij\neq i, and is even conditionally independent from bib_{i} when 𝕏td\mathbb{X}_{td} is not completely contained in 𝒳i\mathcal{X}_{i}; specifically, in this case, regardless of bib_{i}, the conditional distribution of 𝕐td(θ)\mathbb{Y}_{td}({\theta_{\star}}) given it=ii_{t}^{*}=i and given 𝕏td\mathbb{X}_{td} is a product distribution, which deterministically assigns label 1-1 to those Ytk(θ)Y_{tk}({\theta_{\star}}) with Xtk𝒳iX_{tk}\notin\mathcal{X}_{i}, and gives uniform random values to the subset of 𝕐td(θ)\mathbb{Y}_{td}({\theta_{\star}}) with their respective Xtk𝒳iX_{tk}\in\mathcal{X}_{i}. Finally, letting rt=Parity(|{kd:Ytk(θ)=+1}|)r_{t}={\rm Parity}(|\{k\leq d:Y_{tk}({\theta_{\star}})=+1\}|), we note that given it=ii_{t}^{*}=i, 𝕏td=𝒳i\mathbb{X}_{td}=\mathcal{X}_{i}, and the value rtr_{t}, bib_{i} is conditionally independent from 𝒵dt(θ)\mathcal{Z}^{t}_{d}({\theta_{\star}}). Thus, the set of values CiT(θ)={rt:it=i,𝕏td=𝒳i}C_{iT}({\theta_{\star}})=\{r_{t}:i_{t}^{*}=i,\mathbb{X}_{td}=\mathcal{X}_{i}\} is a sufficient statistic for bib_{i} (hence for pip_{i}). Recall that, when it=ii_{t}^{*}=i and 𝕏td=𝒳i\mathbb{X}_{td}=\mathcal{X}_{i}, the value of rtr_{t} is equal to CtC_{t}, a Bernoulli(pi){\rm Bernoulli}(p_{i}) random variable. Thus, we neither lose nor gain anything (in terms of risk) by restricting ourselves to estimators q^i\hat{q}_{i} of the type q^i=q^i(𝒵d1(θ),,𝒵dT(θ),i1,,iT)=q^i(CiT(θ))\hat{q}_{i}=\hat{q}_{i}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}}),i_{1}^{*},\ldots,i_{T}^{*})=\hat{q}_{i}^{\prime}(C_{iT}({\theta_{\star}})), for some q^i\hat{q}_{i}^{\prime} [8]: that is, estimators that are a function of the NiT(θ)=|CiT(θ)|N_{iT}({\theta_{\star}})=|C_{iT}({\theta_{\star}})| Bernoulli(pi){\rm Bernoulli}(p_{i}) random variables, which we should note are conditionally i.i.d. given NiT(θ)N_{iT}({\theta_{\star}}).

Thus, by (1), for any nTn\leq T,

12bi{1,+1}𝔼[|q^ipi||NiT(θ)=n]\displaystyle\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\!\mathbb{E}\left[|\hat{q}_{i}-p_{i}|\Big{|}N_{iT}({\theta_{\star}})=n\right] =12bi{1,+1}γm(q^ipi|NiT(θ)=n)\displaystyle=\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\!\gamma_{m}\mathbb{P}\left(\hat{q}_{i}\neq p_{i}\Big{|}N_{iT}({\theta_{\star}})=n\right)
(γm/32)exp{128γm2Ni/3}.\displaystyle\geq(\gamma_{m}/32)\cdot\exp\left\{-128\gamma_{m}^{2}N_{i}/3\right\}.

Also note that, for each ii, 𝔼[Ni]=d!(1/m)d(md)T(d/m)2dT=d2d(2γm/L)2d/αT\mathbb{E}[N_{i}]=\frac{d!(1/m)^{d}}{\binom{m}{d}}T\leq(d/m)^{2d}T=d^{2d}(2\gamma_{m}/L)^{2d/\alpha}T. Thus, Jensen’s inequality, linearity of expectation, and the law of total expectation imply

12bi{1,+1}𝔼[|q^ipi|](γm/32)exp{43(2/L)2d/αd2dγm2+2d/αT}.\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\mathbb{E}\left[|\hat{q}_{i}-p_{i}|\right]\geq(\gamma_{m}/32)\cdot\exp\left\{-43(2/L)^{2d/\alpha}d^{2d}\gamma_{m}^{2+2d/\alpha}T\right\}.

Thus, by linearity of the expectation,

(12)(md)𝐛{1,+1}(md)𝔼[i=1(md)12d(md)|q^ipi|]=i=1(md)12d(md)12bi{1,+1}𝔼[|q^ipi|]\displaystyle\left(\frac{1}{2}\right)^{\binom{m}{d}}\!\!\!\!\sum_{\mathbf{b}\in\{-1,+1\}^{\binom{m}{d}}}\!\!\!\!\mathbb{E}\left[\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}|\hat{q}_{i}-p_{i}|\right]=\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\!\!\!\!\mathbb{E}\left[|\hat{q}_{i}-p_{i}|\right]
(γm/(322d))exp{43(2/L)2d/αd2dγm2+2d/αT}.\displaystyle\geq(\gamma_{m}/(32\cdot 2^{d}))\cdot\exp\left\{-43(2/L)^{2d/\alpha}d^{2d}\gamma_{m}^{2+2d/\alpha}T\right\}.

In particular, taking m=(L/2)1/α(43(2/L)2d/αd2dT)12(d+α)m=\left\lceil(L/2)^{1/\alpha}\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{\frac{1}{2(d+\alpha)}}\right\rceil, we have γm=Θ((43(2/L)2d/αd2dT)α2(d+α))\gamma_{m}=\Theta\left(\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{-\frac{\alpha}{2(d+\alpha)}}\right), so that

(12)(md)𝐛{1,+1}(md)𝔼[i=1(md)12d(md)|q^ipi|]=Ω(2d(43(2/L)2d/αd2dT)α2(d+α)).\left(\frac{1}{2}\right)^{\binom{m}{d}}\sum_{\mathbf{b}\in\{-1,+1\}^{\binom{m}{d}}}\mathbb{E}\left[\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}|\hat{q}_{i}-p_{i}|\right]\\ =\Omega\left(2^{-d}\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{-\frac{\alpha}{2(d+\alpha)}}\right).

In particular, this implies there exists some 𝐛\mathbf{b} for which

𝔼[i=1(md)12d(md)|q^ipi|]=Ω(2d(43(2/L)2d/αd2dT)α2(d+α)).\mathbb{E}\left[\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}|\hat{q}_{i}-p_{i}|\right]=\Omega\left(2^{-d}\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{-\frac{\alpha}{2(d+\alpha)}}\right).

Applying this lower bound to the estimator p^i\hat{p}_{i} above yields the result. ∎

It is natural to wonder how these rates might potentially improve if we allow θ^T\hat{\theta}_{T} to depend on more than dd samples per data set. To establish limits on such improvements, we note that in the extreme case of allowing the estimator to depend on the full 𝒵t(θ)\mathcal{Z}^{t}({\theta_{\star}}) data sets, we may recover the known results lower bounding the risk of density estimation from i.i.d. samples from a smooth density, as indicated by the following result.

Theorem 4.2

For any integer d1d\geq 1, there exists an instance space 𝒳\mathcal{X}, a concept space \mathbb{C} of VC dimension dd, a distribution 𝒟\mathcal{D} over 𝒳\mathcal{X}, and a distribution π0\pi_{0} over \mathbb{C} such that, for ΠΘ\Pi_{\Theta} the set of distributions over \mathbb{C} with (L,α)(L,\alpha)-Hölder smooth density functions with respect to π0\pi_{0}, any sequence of estimators, θ^T=θ^T(𝒵1(θ),,𝒵T(θ))\hat{\theta}_{T}=\hat{\theta}_{T}(\mathcal{Z}^{1}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}({\theta_{\star}})) (T=1,2,T=1,2,\ldots), has

supθΘ𝔼[πθ^Tπθ]=Ω(Tαd+2α).\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\left[\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|\right]=\Omega\left(T^{-\frac{\alpha}{d+2\alpha}}\right).

The proof is a simple reduction from the problem of estimating πθ\pi_{{\theta_{\star}}} based on direct access to h1θ,,hTθh^{*}_{1{\theta_{\star}}},\ldots,h^{*}_{T{\theta_{\star}}}, which is essentially equivalent to the standard model of density estimation, and indeed the lower bound in Theorem 4.2 is a well-known result for density estimation from TT i.i.d. samples from a Hölder smooth density in a dd-dimensional space [5].

5 Real-Valued Functions and an Application in Algorithmic Economics

In this section, we present results generalizing the analysis of [12] to classes of real-valued functions. We also present an application of this generalization to a preference elicitation problem.

5.1 Consistent Estimation of Priors over Real-Valued Functions at a Bounded Rate

In this section, we let {\cal B} denote a σ\sigma-algebra on 𝒳×\mathcal{X}\times\mathbb{R}, and again let 𝒳{\cal B}_{\mathcal{X}} denote the corresponding σ\sigma-algebra on 𝒳\mathcal{X}. Also, for measurable functions h,g:𝒳h,g:\mathcal{X}\to\mathbb{R}, let ρ(h,g)=|hg|dPX\rho(h,g)=\int|h-g|{\rm d}P_{X}, where PXP_{X} is a distribution over 𝒳\mathcal{X}. Let \mathcal{F} be a class of functions 𝒳\mathcal{X}\to\mathbb{R} with Borel σ\sigma-algebra {\cal B}_{\mathcal{F}} induced by ρ\rho. Let Θ\Theta be a set, and for each θΘ\theta\in\Theta, let πθ\pi_{\theta} denote a probability measure on (,)(\mathcal{F},{\cal B}_{\mathcal{F}}). We suppose {πθ:θΘ}\{\pi_{\theta}:\theta\in\Theta\} is totally bounded in total variation distance, and that \mathcal{F} is a uniformly bounded VC subgraph class with pseudodimension dd. We also suppose ρ\rho is a metric when restricted to \mathcal{F}.

As above, let {Xti}t,i\{X_{ti}\}_{t,i\in\mathbb{N}} be i.i.d. PXP_{X} random variables. For each θΘ\theta\in\Theta, let {htθ}t\{h^{*}_{t\theta}\}_{t\in\mathbb{N}} be i.i.d. πθ\pi_{\theta} random variables, independent from {Xti}t,i\{X_{ti}\}_{t,i\in\mathbb{N}}. For each tt\in\mathbb{N} and θΘ\theta\in\Theta, let Yti(θ)=htθ(Xti)Y_{ti}(\theta)=h^{*}_{t\theta}(X_{ti}) for ii\in\mathbb{N}, and let 𝒵t(θ)={(Xt1,Yt1(θ)),(Xt2,Yt2(θ)),}\mathcal{Z}^{t}(\theta)=\{(X_{t1},Y_{t1}(\theta)),(X_{t2},Y_{t2}(\theta)),\ldots\}; for each kk\in\mathbb{N}, define 𝒵kt(θ)={(Xt1,Yt1(θ)),\mathcal{Z}^{t}_{k}(\theta)=\{(X_{t1},Y_{t1}(\theta)), ,(Xtk,Ytk(θ))}\ldots,(X_{tk},Y_{tk}(\theta))\}, 𝕏tk={Xt1,,Xtk}\mathbb{X}_{tk}=\{X_{t1},\ldots,X_{tk}\}, and 𝕐tk(θ)={Yt1(θ),,Ytk(θ)}\mathbb{Y}_{tk}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}.

We have the following result. The proof parallels that of [12] (who studied the special case of binary functions), with a few important twists (in particular, a significantly different approach in the analogue of their Lemma 3). The details are included in Appendix 0.A.

Theorem 5.1

There exists an estimator θ^Tθ=θ^T(𝒵d1(θ),,𝒵dT(θ))\hat{\theta}_{T{\theta_{\star}}}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}})), and functions R:0×(0,1][0,)R:\mathbb{N}_{0}\times(0,1]\to[0,\infty) and δ:0×(0,1][0,1]\delta:\mathbb{N}_{0}\times(0,1]\to[0,1] such that, for any α>0\alpha>0, limTR(T,α)=limTδ(T,α)=0\lim\limits_{T\to\infty}R(T,\alpha)=\lim\limits_{T\to\infty}\delta(T,\alpha)=0 and for any T0T\in\mathbb{N}_{0} and θΘ{\theta_{\star}}\in\Theta,

(πθ^Tθπθ>R(T,α))δ(T,α)α.\mathbb{P}\left(\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|>R(T,\alpha)\right)\leq\delta(T,\alpha)\leq\alpha.

5.2 Maximizing Customer Satisfaction in Combinatorial Auctions

Theorem 5.1 has a clear application in the context of transfer learning, following analogous arguments to those given in the special case of binary classification by [12]. In addition to that application, we can also use Theorem 5.1 in the context of the following problem in algorithmic economics, where the objective is to serve a sequence of customers so as to maximize their satisfaction.

Consider an online travel agency, where customers go to the site with some idea of what type of travel they are interested in; the site then poses a series of questions to each customer, and identifies a travel package that best suits their desires, budget, and dates. There are many options of travel packages, with options on location, site-seeing tours, hotel and room quality, etc. Because of this, serving the needs of an arbitrary customer might be a lengthy process, requiring many detailed questions. Fortunately, the stream of customers is typically not a worst-case sequence, and in particular obeys many statistical regularities: in particular, it is not too far from reality to think of the customers as being independent and identically distributed samples. With this assumption in mind, it becomes desirable to identify some of these statistical regularities so that we can pose the questions that are typically most relevant, and thereby more quickly identify the travel package that best suits the needs of the typical customer. One straightforward way to do this is to directly estimate the distribution of customer value functions, and optimize the questioning system to minimize the expected number of questions needed to find a suitable travel package.

One can model this problem in the style of Bayesian combinatorial auctions, in which each customer has a value function for each possible bundle of items. However, it is slightly different, in that we do not assume the distribution of customers is known, but rather are interested in estimating this distribution; the obtained estimate can then be used in combination with methods based on Bayesian decision theory. In contrast to the literature on Bayesian auctions (and subjectivist Bayesian decision theory in general), this technique is able to maintain general guarantees on performance that hold under an objective interpretation of the problem, rather than merely guarantees holding under an arbitrary assumed prior belief. This general idea is sometimes referred to as Empirical Bayesian decision theory in the machine learning and statistics literatures. The ideal result for an Empirical Bayesian algorithm is to be competitive with the corresponding Bayesian methods based on the actual distribution of the data (assuming the data are random, with an unknown distribution); that is, although the Empirical Bayesian methods only operate with a data-based estimate of the distribution, the aim is to perform nearly as well as methods based on the true (unobservable) distribution. In this work, we present results of this type, in the context of an abstraction of the aforementioned online travel agency problem, where the measure of performance is the expected number of questions to find a suitable package.

The specific application we are interested in here may be expressed abstractly as a kind of combinatorial auction with preference elicitation. Specifically, we suppose there is a collection of items on a menu, and each possible bundle of items has an associated fixed price. There is a stream of customers, each with a valuation function that provides a value for each possible bundle of items. The objective is to serve each customer a bundle of items that nearly-maximizes his or her surplus value (value minus price). However, we are not permitted direct observation of the customer valuation functions; rather, we may query for the value of any given bundle of items; this is referred to as a value query in the literature on preference elicitation in combinatorial auctions (see Chapter 14 of [4], [14]). The objective is to achieve this near-maximal surplus guarantee, while making only a small number of queries per customer. We suppose the customer valuation function are sampled i.i.d. according to an unknown distribution over a known (but arbitrary) class of real-valued functions having finite pseudo-dimension. Reasoning that knowledge of this distribution should allow one to make a smaller number of value queries per customer, we are interested in estimating this unknown distribution, so that as we serve more and more customers, the number of queries per customer required to identify a near-optimal bundle should decrease. In this context, we in fact prove that in the limit, the expected number of queries per customer converges to the number required of a method having direct knowledge of the true distribution of valuation functions.

Formally, suppose there is a menu of nn items [n]={1,,n}[n]=\{1,\ldots,n\}, and each bundle B[n]B\subseteq[n] has an associated price p(B)0p(B)\geq 0. Suppose also there is a sequence of customers, each with a valuation function vt:2[n]v_{t}:2^{[n]}\to\mathbb{R}. We suppose these vtv_{t} functions are i.i.d. samples. We can then calculate the satisfaction function for each customer as st(x)s_{t}(x), where x{0,1}nx\in\{0,1\}^{n}, and st(x)=vt(Bx)p(Bx)s_{t}(x)=v_{t}(B_{x})-p(B_{x}), where Bx[n]B_{x}\subseteq[n] contains element i[n]i\in[n] iff xi=1x_{i}=1.

Now suppose we are able to ask each customer a number of questions before serving up a bundle Bx^tB_{\hat{x}_{t}} to that customer. More specifically, we are able to ask for the value st(x)s_{t}(x) for any x{0,1}nx\in\{0,1\}^{n}. This is referred to as a value query in the literature on preference elicitation in combinatorial auctions (see Chapter 14 of [4], [14]). We are interested in asking as few questions as possible, while satisfying the guarantee that 𝔼[st(x^t)maxxst(x)]ε\mathbb{E}[s_{t}(\hat{x}_{t})-\max_{x}s_{t}(x)]\leq\varepsilon.

Now suppose, for every π\pi and ε\varepsilon, we have a method A(π,ε)A(\pi,\varepsilon) such that, given that π\pi is the actual distribution of the sts_{t} functions, A(π,ε)A(\pi,\varepsilon) guarantees that the x^t\hat{x}_{t} value it selects has 𝔼[maxxst(x)st(x^t)]ε\mathbb{E}[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})]\leq\varepsilon; also let N^t(π,ε)\hat{N}_{t}(\pi,\varepsilon) denote the actual (random) number of queries the method A(π,ε)A(\pi,\varepsilon) would ask for the sts_{t} function, and let Q(π,ε)=𝔼[N^t(π,ε)]Q(\pi,\varepsilon)=\mathbb{E}[\hat{N}_{t}(\pi,\varepsilon)]. We suppose the method never queries any st(x)s_{t}(x) value twice for a given tt, so that its number of queries for any given tt is bounded.

Also suppose \mathcal{F} is a VC subgraph class of functions mapping 𝒳={0,1}n\mathcal{X}=\{0,1\}^{n} into [1,1][-1,1] with pseudodimension dd, and that {πθ:θΘ}\{\pi_{\theta}:\theta\in\Theta\} is a known totally bounded family of distributions over \mathcal{F} such that the sts_{t} functions have distribution πθ\pi_{{\theta_{\star}}} for some unknown θΘ{\theta_{\star}}\in\Theta. For any θΘ\theta\in\Theta and γ>0\gamma>0, let B(θ,γ)={θΘ:πθπθγ}{\rm B}(\theta,\gamma)=\{\theta^{\prime}\in\Theta:\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\gamma\}.

Suppose, in addition to AA, we have another method A(ε)A^{\prime}(\varepsilon) that is not π\pi-dependent, but still provides the ε\varepsilon-correctness guarantee, and makes a bounded number of queries (e.g., in the worst case, we could consider querying all 2n2^{n} points, but in most cases there are more clever π\pi-independent methods that use far fewer queries, such as O(1/ε2)O(1/\varepsilon^{2})). Consider the following method; the quantities θ^Tθ\hat{\theta}_{T{\theta_{\star}}}, R(T,α)R(T,\alpha), and δ(T,α)\delta(T,\alpha) from Theorem 5.1 are here considered with respect PXP_{X} taken as the uniform distribution on {0,1}n\{0,1\}^{n}.

for t=1,2,,Tt=1,2,\ldots,T do
  Pick points Xt1,Xt2,,XtdX_{t1},X_{t2},\ldots,X_{td} uniformly at random from {0,1}n\{0,1\}^{n}
  if R(t1,ε/2)>ε/8R(t-1,\varepsilon/2)>\varepsilon/8 then
   Run A(ε)A^{\prime}(\varepsilon)
   Take x^t\hat{x}_{t} as the returned value
  else
   Let θˇtθB(θ^(t1)θ,R(t1,ε/2))\check{\theta}_{t{\theta_{\star}}}\in{\rm B}\left(\hat{\theta}_{(t-1){\theta_{\star}}},R(t-1,\varepsilon/2)\right) be such that Q(πθˇtθ,ε/4)minθB(θ^(t1)θ,R(t1,ε/2))Q(πθ,ε/4)+1tQ(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4)\leq\!\!\!\!\!\min\limits_{\theta\in{\rm B}\left(\hat{\theta}_{(t-1){\theta_{\star}}},R(t-1,\varepsilon/2)\right)}\!\!\!\!\!Q(\pi_{\theta},\varepsilon/4)+\frac{1}{t}
   Run A(πθˇtθ,ε/4)A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4) and let x^t\hat{x}_{t} be its return value
  end if
end for
Algorithm 1 An algorithm for sequentially maximizing expected customer satisfaction.

The following theorem indicates that this method is correct, and furthermore that the long-run average number of queries is not much worse than that of a method that has direct knowledge of πθ\pi_{{\theta_{\star}}}. The proof of this result parallels that of [12] for the transfer learning setting, but is included here for completeness.

Theorem 5.2

For the above method, tT,𝔼[maxxst(x)st(x^t)]ε\forall t\leq T,\mathbb{E}[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})]\leq\varepsilon. Furthermore, if ST(ε)S_{T}(\varepsilon) is the total number of queries made by the method, then

lim supT𝔼[ST(ε)]TQ(πθ,ε/4)+d.\limsup\limits_{T\to\infty}\frac{\mathbb{E}[S_{T}(\varepsilon)]}{T}\leq Q(\pi_{{\theta_{\star}}},\varepsilon/4)+d.
Proof

By Theorem 5.1, for any tTt\leq T, if R(t1,ε/2)ε/8R(t-1,\varepsilon/2)\leq\varepsilon/8, then with probability at least 1ε/21-\varepsilon/2, πθπθ^(t1)θR(t1,ε/2)\|\pi_{{\theta_{\star}}}-\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}\|\leq R(t-1,\varepsilon/2), so that a triangle inequality implies πθπθˇtθ2R(t1,ε/2)ε/4\|\pi_{{\theta_{\star}}}-\pi_{\check{\theta}_{t{\theta_{\star}}}}\|\leq 2R(t-1,\varepsilon/2)\leq\varepsilon/4. Thus,

𝔼[maxxst(x)st(x^t)]ε/2+𝔼[𝔼[maxxst(x)st(x^t)|θˇtθ]𝟙[πθˇtθπθε/2]].\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\right]\\ \leq\varepsilon/2+\mathbb{E}\left[\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\Big{|}\check{\theta}_{t{\theta_{\star}}}\right]\mathbbm{1}\left[\|\pi_{\check{\theta}_{t{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq\varepsilon/2\right]\right].

For θΘ\theta\in\Theta, let x^tθ\hat{x}_{t\theta} denote the point xx that would be returned by A(πθˇtθ,ε/4)A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4) when queries are answered by some stθπθs_{t\theta}\sim\pi_{\theta} instead of sts_{t} (and supposing st=stθs_{t}=s_{t{\theta_{\star}}}). If πθˇtθπθε/4\|\pi_{\check{\theta}_{t{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq\varepsilon/4, then

𝔼[maxxst(x)st(x^t)|θˇtθ]=𝔼[maxxstθ(x)stθ(x^t)|θˇtθ]\displaystyle\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\Big{|}\check{\theta}_{t{\theta_{\star}}}\right]=\mathbb{E}\left[\max_{x}s_{t{\theta_{\star}}}(x)-s_{t{\theta_{\star}}}(\hat{x}_{t})\Big{|}\check{\theta}_{t{\theta_{\star}}}\right]
𝔼[maxxstθˇtθ(x)stθˇtθ(x^tθˇtθ)|θˇtθ]+πθˇtθπθε/4+ε/4=ε/2.\displaystyle\leq\mathbb{E}\left[\max_{x}s_{t\check{\theta}_{t{\theta_{\star}}}}(x)-s_{t\check{\theta}_{t{\theta_{\star}}}}(\hat{x}_{t\check{\theta}_{t{\theta_{\star}}}})\Big{|}\check{\theta}_{t{\theta_{\star}}}\right]+\|\pi_{\check{\theta}_{t{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq\varepsilon/4+\varepsilon/4=\varepsilon/2.

Plugging into the above bound, we have 𝔼[maxxst(x)st(x^t)]ε\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\right]\leq\varepsilon.

For the result on ST(ε)S_{T}(\varepsilon), first note that R(t1,ε/2)>ε/8R(t-1,\varepsilon/2)>\varepsilon/8 only finitely many times (due to R(t,α)=o(1)R(t,\alpha)=o(1)), so that we can ignore those values of tt in the asymptotic calculation (as the number of queries is always bounded), and rely on the correctness guarantee of AA^{\prime}. For the remaining values tt, let NtN_{t} denote the number of queries made by A(πθˇtθ,ε/4)A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4). Then

lim supT𝔼[ST(ε)]Td+lim supTt=1T𝔼[Nt]T.\limsup\limits_{T\to\infty}\frac{\mathbb{E}[S_{T}(\varepsilon)]}{T}\leq d+\limsup\limits_{T\to\infty}\sum_{t=1}^{T}\frac{\mathbb{E}\left[N_{t}\right]}{T}.

Since

limT1Tt=1T𝔼[Nt𝟙[πθ^(t1)θπθ>R(t1,ε/2)]]\displaystyle\lim\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[N_{t}\mathbbm{1}[\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|>R(t-1,\varepsilon/2)]\right]
limT1Tt=1T2n(πθ^(t1)θπθ>R(t1,ε/2))\displaystyle\leq\lim\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}2^{n}\mathbb{P}\left(\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|>R(t-1,\varepsilon/2)\right)
2nlimT1Tt=1Tδ(t1,ε/2)=0,\displaystyle\leq 2^{n}\lim\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\delta(t-1,\varepsilon/2)=0,

we have

lim supTt=1T𝔼[Nt]T=lim supT1Tt=1T𝔼[Nt𝟙[πθ^(t1)θπθR(t1,ε/2)]].\limsup\limits_{T\to\infty}\sum_{t=1}^{T}\frac{\mathbb{E}\left[N_{t}\right]}{T}=\limsup\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\Big{[}N_{t}\mathbbm{1}[\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq R(t-1,\varepsilon/2)]\Big{]}.

For tTt\leq T, let Nt(θˇtθ)N_{t}(\check{\theta}_{t{\theta_{\star}}}) denote the number of queries A(πθˇtθ,ε/4)A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4) would make if queries were answered with stθˇtθs_{t\check{\theta}_{t{\theta_{\star}}}} instead of sts_{t}. On the event πθ^(t1)θπθR(t1,ε/2)\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq R(t-1,\varepsilon/2), we have

𝔼[Nt|θˇtθ]\displaystyle\mathbb{E}\left[N_{t}\Big{|}\check{\theta}_{t{\theta_{\star}}}\right] 𝔼[Nt(θˇtθ)|θˇtθ]+2R(t1,ε/2)\displaystyle\leq\mathbb{E}\left[N_{t}(\check{\theta}_{t{\theta_{\star}}})\Big{|}\check{\theta}_{t{\theta_{\star}}}\right]+2R(t-1,\varepsilon/2)
=Q(πθˇtθ,ε/4)+2R(t1,ε/2)Q(πθ,ε/4)+2R(t1,ε/2)+1/t.\displaystyle=Q(\pi_{\check{\theta}_{t{\theta_{\star}}}}\!,\varepsilon/4)+2R(t\!-\!1,\varepsilon/2)\leq Q(\pi_{{\theta_{\star}}},\varepsilon/4)+2R(t\!-\!1,\varepsilon/2)+1/t.

Therefore,

lim supT1Tt=1T𝔼[Nt𝟙[πθ^(t1)θπθR(t1,ε/2)]]\displaystyle\limsup\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[N_{t}\mathbbm{1}[\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq R(t-1,\varepsilon/2)]\right]
Q(πθ,ε/4)+lim supT1Tt=1T2R(t1,ε/2)+1/t=Q(πθ,ε/4).\displaystyle\leq Q(\pi_{{\theta_{\star}}},\varepsilon/4)+\limsup\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}2R(t-1,\varepsilon/2)+1/t=Q(\pi_{{\theta_{\star}}},\varepsilon/4).

In many cases, this result will even continue to hold with an infinite number of goods (n=n=\infty), since Theorem 5.1 has no dependence on the cardinality of the space 𝒳\mathcal{X}.

6 Open Problems

There are several interesting questions that remain open at this time. Can either the lower bound or upper bound be improved in general? If, instead of dd samples per task, we instead use mdm\geq d samples, how does the minimax risk vary with mm? Related to this, what is the optimal value of mm to optimize the rate of convergence as a function of mTmT, the total number of samples? More generally, if an estimator is permitted to use NN total samples, taken from however many tasks it wishes, what is the optimal rate of convergence as a function of NN?

Appendix 0.A Proofs for Section 5

The proof of Theorem 5.1 is based on the following sequence of lemmas, which parallel those used by [12] for establishing the analogous result for consistent estimation of priors over binary functions. The last of these lemmas (namely, Lemma 3) requires substantial modifications to the original argument of [12]; the others use arguments more-directly based on those of [12].

Lemma 1

For any θ,θΘ\theta,\theta^{\prime}\in\Theta and tt\in\mathbb{N},

πθπθ=𝒵t(θ)𝒵t(θ).\|\pi_{\theta}-\pi_{\theta^{\prime}}\|=\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|.
Proof

Fix θ,θΘ\theta,\theta^{\prime}\in\Theta, tt\in\mathbb{N}. Let 𝕏={Xt1,Xt2,}\mathbb{X}=\{X_{t1},X_{t2},\ldots\}, 𝕐(θ)={Yt1(θ),Yt2(θ),}\mathbb{Y}(\theta)=\{Y_{t1}(\theta),Y_{t2}(\theta),\ldots\}, and for kk\in\mathbb{N} let 𝕏k={Xt1,,Xtk}\mathbb{X}_{k}=\{X_{t1},\ldots,X_{tk}\}. and 𝕐k(θ)={Yt1(θ),,Ytk(θ)}\mathbb{Y}_{k}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}. For hh\in\mathcal{F}, let c𝕏(h)={(Xt1,h(Xt1)),(Xt2,h(Xt2)),c_{\mathbb{X}}(h)=\{(X_{t1},h(X_{t1})),(X_{t2},h(X_{t2})), }\ldots\}.

For h,gh,g\in\mathcal{F}, define ρ𝕏(h,g)=limm1mi=1m|h(Xti)g(Xti)|\rho_{\mathbb{X}}(h,g)=\lim\limits_{m\to\infty}\frac{1}{m}\sum_{i=1}^{m}|h(X_{ti})-g(X_{ti})| (if the limit exists), and ρ𝕏k(h,g)=1ki=1k|h(Xti)g(Xti)|\rho_{\mathbb{X}_{k}}(h,g)=\frac{1}{k}\sum_{i=1}^{k}|h(X_{ti})-g(X_{ti})|. Note that since \mathcal{F} is a uniformly bounded VC subgraph class, so is the collection of functions {|hg|:h,g}\{|h-g|:h,g\in\mathcal{F}\}, so that the uniform strong law of large numbers implies that with probability one, h,g\forall h,g\in\mathcal{F}, ρ𝕏(h,g)\rho_{\mathbb{X}}(h,g) exists and has ρ𝕏(h,g)=ρ(h,g)\rho_{\mathbb{X}}(h,g)=\rho(h,g) [10].

Consider any θ,θΘ\theta,\theta^{\prime}\in\Theta, and any AA\in{\cal B}_{\mathcal{F}}. Then any hAh\notin A has gA\forall g\in A, ρ(h,g)>0\rho(h,g)>0 (by the metric assumption). Thus, if ρ𝕏(h,g)=ρ(h,g)\rho_{\mathbb{X}}(h,g)=\rho(h,g) for all h,gh,g\in\mathcal{F}, then hA\forall h\notin A,

gA,ρ𝕏(h,g)=ρ(h,g)>0\displaystyle\forall g\in A,\rho_{\mathbb{X}}(h,g)=\rho(h,g)>0 \displaystyle\implies
gA,c𝕏(h)c𝕏(g)\displaystyle\forall g\in A,c_{\mathbb{X}}(h)\neq c_{\mathbb{X}}(g) c𝕏(h)c𝕏(A).\displaystyle\implies c_{\mathbb{X}}(h)\notin c_{\mathbb{X}}(A).

This implies c𝕏1(c𝕏(A))=Ac_{\mathbb{X}}^{-1}(c_{\mathbb{X}}(A))=A. Under these conditions,

𝒵t(θ)|𝕏(c𝕏(A))=πθ(c𝕏1(c𝕏(A)))=πθ(A),\mathbb{P}_{\mathcal{Z}^{t}(\theta)|\mathbb{X}}(c_{\mathbb{X}}(A))=\pi_{\theta}(c_{\mathbb{X}}^{-1}(c_{\mathbb{X}}(A)))=\pi_{\theta}(A),

and similarly for θ\theta^{\prime}.

Any measurable set CC for the range of 𝒵t(θ)\mathcal{Z}^{t}(\theta) can be expressed as C={cx¯(h):(h,x¯)C}C=\{c_{\bar{x}}(h):(h,\bar{x})\in C^{\prime}\} for some appropriate C𝒳C^{\prime}\in{\cal B}_{\mathcal{F}}\otimes{\cal B}_{\mathcal{X}}^{\infty}. Letting Cx¯={h:(h,x¯)C}C^{\prime}_{\bar{x}}=\{h:(h,\bar{x})\in C^{\prime}\}, we have

𝒵t(θ)(C)=πθ(cx¯1(cx¯(Cx¯)))𝕏(dx¯)=πθ(Cx¯)𝕏(dx¯)=(htθ,𝕏)(C).\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(C)=\int\pi_{\theta}(c_{\bar{x}}^{-1}(c_{\bar{x}}(C^{\prime}_{\bar{x}})))\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})=\int\pi_{\theta}(C^{\prime}_{\bar{x}})\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})=\mathbb{P}_{(h^{*}_{t\theta},\mathbb{X})}(C^{\prime}).

Likewise, this reasoning holds for θ\theta^{\prime}. Then

𝒵t(θ)𝒵t(θ)\displaystyle\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\| =(htθ,𝕏)(htθ,𝕏)\displaystyle=\|\mathbb{P}_{(h^{*}_{t\theta},\mathbb{X})}-\mathbb{P}_{(h^{*}_{t\theta^{\prime}},\mathbb{X})}\|
=supC𝒳|(πθ(Cx¯)πθ(Cx¯))𝕏(dx¯)|\displaystyle=\sup_{C^{\prime}\in{\cal B}_{\mathcal{F}}\otimes{\cal B}_{\mathcal{X}}^{\infty}}\left|\int(\pi_{\theta}(C^{\prime}_{\bar{x}})-\pi_{\theta^{\prime}}(C^{\prime}_{\bar{x}}))\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})\right|
supA|πθ(A)πθ(A)|𝕏(dx¯)=πθπθ.\displaystyle\leq\int\sup_{A\in{\cal B}_{\mathcal{F}}}|\pi_{\theta}(A)-\pi_{\theta^{\prime}}(A)|\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})=\|\pi_{\theta}-\pi_{\theta^{\prime}}\|.

Since htθh^{*}_{t\theta} and 𝕏\mathbb{X} are independent, A\forall A\in{\cal B}_{\mathcal{F}}, πθ(A)=htθ(A)=htθ(A)𝕏(𝒳)\pi_{\theta}(A)=\mathbb{P}_{h^{*}_{t\theta}}(A)=\mathbb{P}_{h^{*}_{t\theta}}(A)\mathbb{P}_{\mathbb{X}}(\mathcal{X}^{\infty}) =(htθ,𝕏)(A×𝒳)=\mathbb{P}_{(h^{*}_{t\theta},\mathbb{X})}(A\times\mathcal{X}^{\infty}). Analogous reasoning holds for htθh^{*}_{t\theta^{\prime}}. Thus, we have

πθπθ\displaystyle\|\pi_{\theta}-\pi_{\theta^{\prime}}\| =(htθ,𝕏)(×𝒳)(htθ,𝕏)(×𝒳)\displaystyle=\|\mathbb{P}_{(h^{*}_{t\theta},\mathbb{X})}(\cdot\times\mathcal{X}^{\infty})-\mathbb{P}_{(h^{*}_{t\theta^{\prime}},\mathbb{X})}(\cdot\times\mathcal{X}^{\infty})\|
(htθ,𝕏)(htθ,𝕏)=𝒵t(θ)𝒵t(θ).\displaystyle\leq\|\mathbb{P}_{(h^{*}_{t\theta},\mathbb{X})}-\mathbb{P}_{(h^{*}_{t\theta^{\prime}},\mathbb{X})}\|=\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|.

Altogether, we have 𝒵t(θ)𝒵t(θ)=πθπθ\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|=\|\pi_{\theta}-\pi_{\theta^{\prime}}\|. ∎

Lemma 2

There exists a sequence rk=o(1)r_{k}=o(1) such that, t,k\forall t,k\in\mathbb{N}, θ,θΘ\forall\theta,\theta^{\prime}\in\Theta,

𝒵kt(θ)𝒵kt(θ)πθπθ𝒵kt(θ)𝒵kt(θ)+rk.\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|\leq\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+r_{k}.
Proof

This proof follows identically to a proof of [12], but is included here for completeness. Since 𝒵kt(θ)(A)=𝒵t(θ)(A×(𝒳×))\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(A)=\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A\times(\mathcal{X}\times\mathbb{R})^{\infty}) for all measurable A(𝒳×)kA\subseteq(\mathcal{X}\times\mathbb{R})^{k}, and similarly for θ\theta^{\prime}, we have

𝒵kt(θ)𝒵kt(θ)=supAk𝒵kt(θ)(A)𝒵kt(θ)(A)\displaystyle\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|=\sup_{A\in{\cal B}^{k}}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(A)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(A)
=supAk𝒵t(θ)(A×(𝒳×))𝒵t(θ)(A×(𝒳×))\displaystyle=\sup_{A\in{\cal B}^{k}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A\times(\mathcal{X}\times\mathbb{R})^{\infty})-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A\times(\mathcal{X}\times\mathbb{R})^{\infty})
supA𝒵t(θ)(A)𝒵t(θ)(A)=𝒵t(θ)𝒵t(θ),\displaystyle\leq\sup_{A\in{\cal B}^{\infty}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A)=\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|,

which implies the left inequality when combined with Lemma 1.

Next, we focus on the right inequality. Fix θ,θΘ\theta,\theta^{\prime}\in\Theta and γ>0\gamma>0, and let BB\in{\cal B}^{\infty} be such that

πθπθ=𝒵t(θ)𝒵t(θ)<𝒵t(θ)(B)𝒵t(θ)(B)+γ.\|\pi_{\theta}-\pi_{\theta^{\prime}}\|=\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|<\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(B)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(B)+\gamma.

Let 𝒜={A×(𝒳×):Ak,k}\mathcal{A}=\{A\times(\mathcal{X}\times\mathbb{R})^{\infty}:A\in{\cal B}^{k},k\in\mathbb{N}\}. Note that 𝒜\mathcal{A} is an algebra that generates {\cal B}^{\infty}. Thus, Carathéodory’s extension theorem (specifically, the version presented by [8]) implies that there exist disjoint sets {Ai}i\{A_{i}\}_{i\in\mathbb{N}} in 𝒜\mathcal{A} such that BiAiB\subseteq\bigcup_{i\in\mathbb{N}}A_{i} and

𝒵t(θ)(B)𝒵t(θ)(B)<i𝒵t(θ)(Ai)i𝒵t(θ)(Ai)+γ.\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(B)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(B)<\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})-\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A_{i})+\gamma.

Since these AiA_{i} sets are disjoint, each of these sums is bounded by a probability value, which implies that there exists some nn\in\mathbb{N} such that

i𝒵t(θ)(Ai)<γ+i=1n𝒵t(θ)(Ai),\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})<\gamma+\sum_{i=1}^{n}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i}),

which implies

i𝒵t(θ)(Ai)i𝒵t(θ)(Ai)\displaystyle\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})-\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A_{i}) <γ+i=1n𝒵t(θ)(Ai)i=1n𝒵t(θ)(Ai)\displaystyle<\gamma+\sum_{i=1}^{n}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})-\sum_{i=1}^{n}\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A_{i})
=γ+𝒵t(θ)(i=1nAi)𝒵t(θ)(i=1nAi).\displaystyle=\gamma+\mathbb{P}_{\mathcal{Z}^{t}(\theta)}\left(\bigcup_{i=1}^{n}A_{i}\right)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\left(\bigcup_{i=1}^{n}A_{i}\right).

As i=1nAi𝒜\bigcup_{i=1}^{n}A_{i}\in\mathcal{A}, there exists mm\in\mathbb{N} and measurable BmmB_{m}\in{\cal B}^{m} such that i=1nAi=Bm×(𝒳×)\bigcup_{i=1}^{n}A_{i}=B_{m}\times(\mathcal{X}\times\mathbb{R})^{\infty}, and therefore

𝒵t(θ)(i=1nAi)𝒵t(θ)(i=1nAi)=𝒵mt(θ)(Bm)𝒵mt(θ)(Bm)𝒵mt(θ)𝒵mt(θ)limk𝒵kt(θ)𝒵kt(θ).\mathbb{P}_{\mathcal{Z}^{t}(\theta)}\left(\bigcup_{i=1}^{n}A_{i}\right)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\left(\bigcup_{i=1}^{n}A_{i}\right)=\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta)}(B_{m})-\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta^{\prime})}(B_{m})\\ \leq\|\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta^{\prime})}\|\leq\lim_{k\to\infty}\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|.

Combining the above, we have πθπθlimk𝒵kt(θ)𝒵kt(θ)+3γ\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\lim_{k\to\infty}\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+3\gamma. By letting γ\gamma approach 0, we have

πθπθlimk𝒵kt(θ)𝒵kt(θ).\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\lim_{k\to\infty}\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|.

So there exists a sequence rk(θ,θ)=o(1)r_{k}(\theta,\theta^{\prime})=o(1) such that

k,πθπθ𝒵kt(θ)𝒵kt(θ)+rk(θ,θ).\forall k\in\mathbb{N},\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+r_{k}(\theta,\theta^{\prime}).

Now let γ>0\gamma>0 and let Θγ\Theta_{\gamma} be a minimal γ\gamma-cover of Θ\Theta. Define the quantity rk(γ)=maxθ,θΘγrk(θ,θ)r_{k}(\gamma)=\max_{\theta,\theta^{\prime}\in\Theta_{\gamma}}r_{k}(\theta,\theta^{\prime}). Then for any θ,θΘ\theta,\theta^{\prime}\in\Theta, let θγ=argminθ′′Θγπθπθ′′\theta_{\gamma}=\mathop{\rm argmin}_{\theta^{\prime\prime}\in\Theta_{\gamma}}\|\pi_{\theta}-\pi_{\theta^{\prime\prime}}\| and θγ=argminθ′′Θγπθπθ′′\theta_{\gamma}^{\prime}=\mathop{\rm argmin}_{\theta^{\prime\prime}\in\Theta_{\gamma}}\|\pi_{\theta^{\prime}}-\pi_{\theta^{\prime\prime}}\|. Then a triangle inequality implies that k\forall k\in\mathbb{N},

πθπθπθπθγ+πθγπθγ+πθγπθ\displaystyle\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\|\pi_{\theta}-\pi_{\theta_{\gamma}}\|+\|\pi_{\theta_{\gamma}}-\pi_{\theta_{\gamma}^{\prime}}\|+\|\pi_{\theta_{\gamma}^{\prime}}-\pi_{\theta^{\prime}}\|
<2γ+rk(θγ,θγ)+𝒵kt(θγ)𝒵kt(θγ)2γ+rk(γ)+𝒵kt(θγ)𝒵kt(θγ).\displaystyle<2\gamma+r_{k}(\theta_{\gamma},\theta_{\gamma}^{\prime})+\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\|\leq 2\gamma+r_{k}(\gamma)+\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\|.

Triangle inequalities and the left inequality from the lemma statement (already established) imply

𝒵kt(θγ)𝒵kt(θγ)𝒵kt(θγ)𝒵kt(θ)+𝒵kt(θ)𝒵kt(θ)+𝒵kt(θγ)𝒵kt(θ)\displaystyle\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\|\leq\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}\|+\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|
πθγπθ+𝒵kt(θ)𝒵kt(θ)+πθγπθ<2γ+𝒵kt(θ)𝒵kt(θ).\displaystyle\leq\|\pi_{\theta_{\gamma}}-\pi_{\theta}\|+\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+\|\pi_{\theta_{\gamma}^{\prime}}-\pi_{\theta^{\prime}}\|<2\gamma+\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|.

So in total we have

πθπθ4γ+rk(γ)+𝒵kt(θ)𝒵kt(θ).\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq 4\gamma+r_{k}(\gamma)+\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|.

Since this holds for all γ>0\gamma>0, defining rk=infγ>0(4γ+rk(γ))r_{k}=\inf_{\gamma>0}(4\gamma+r_{k}(\gamma)), we have the right inequality of the lemma statement. Furthermore, since each rk(θ,θ)=o(1)r_{k}(\theta,\theta^{\prime})=o(1), and |Θγ|<|\Theta_{\gamma}|<\infty, we have rk(γ)=o(1)r_{k}(\gamma)=o(1) for each γ>0\gamma>0, and thus we also have rk=o(1)r_{k}=o(1). ∎

Lemma 3

t,k\forall t,k\in\mathbb{N}, there exists a monotone function Mk(x)=o(1)M_{k}(x)=o(1) such that, θ,θΘ\forall\theta,\theta^{\prime}\in\Theta,

𝒵kt(θ)𝒵kt(θ)Mk(𝒵dt(θ)𝒵dt(θ)).\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|\leq M_{k}\left(\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|\right).
Proof

Fix any tt\in\mathbb{N}, and let 𝕏={Xt1,Xt2,}\mathbb{X}=\{X_{t1},X_{t2},\ldots\} and 𝕐(θ)={Yt1(θ),Yt2(θ),}\mathbb{Y}(\theta)=\{Y_{t1}(\theta),Y_{t2}(\theta),\ldots\}, and for kk\in\mathbb{N} let 𝕏k={Xt1,,Xtk}\mathbb{X}_{k}=\{X_{t1},\ldots,X_{tk}\} and 𝕐k(θ)={Yt1(θ),,Ytk(θ)}\mathbb{Y}_{k}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}.

If kdk\leq d, then 𝒵kt(θ)()=𝒵dt(θ)(×(𝒳×{1,+1})dk)\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(\cdot)=\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}(\cdot\times(\mathcal{X}\times\{-1,+1\})^{d-k}), so that

𝒵kt(θ)𝒵kt(θ)𝒵dt(θ)𝒵dt(θ),\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|,

and therefore the result trivially holds.

Now suppose k>dk>d. Fix any γ>0\gamma>0, and let Bθ,θ(𝒳×)kB_{\theta,\theta^{\prime}}\subseteq(\mathcal{X}\times\mathbb{R})^{k} be a measurable set such that

𝒵kt(θ)(Bθ,θ)𝒵kt(θ)(Bθ,θ)\displaystyle\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}}) 𝒵kt(θ)𝒵kt(θ)\displaystyle\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|
𝒵kt(θ)(Bθ,θ)𝒵kt(θ)(Bθ,θ)+γ.\displaystyle\leq\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}})+\gamma.

By Carathéodory’s extension theorem (specifically, the version presented by [8]), there exists a disjoint sequence of sets {Bi(θ,θ)}i=1\{B_{i}(\theta,\theta^{\prime})\}_{i=1}^{\infty} such that

𝒵kt(θ)(Bθ,θ)𝒵kt(θ)(Bθ,θ)<γ+i=1𝒵kt(θ)(Bi(θ,θ))i=1𝒵kt(θ)(Bi(θ,θ)),\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}})<\gamma+\sum_{i=1}^{\infty}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta,\theta^{\prime}))-\sum_{i=1}^{\infty}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta,\theta^{\prime})),

and such that each Bi(θ,θ)B_{i}(\theta,\theta^{\prime}) is representable as follows; for some i(θ,θ)\ell_{i}(\theta,\theta^{\prime})\in\mathbb{N}, and sets Cij=(Aij1×(,tij1])××(Aijk×(,tijk])C_{ij}=(A_{ij1}\times(-\infty,t_{ij1}])\times\cdots\times(A_{ijk}\times(-\infty,t_{ijk}]), for ji(θ,θ)j\leq\ell_{i}(\theta,\theta^{\prime}), where each Aijp𝒳A_{ijp}\in{\cal B}_{\mathcal{X}}, the set Bi(θ,θ)B_{i}(\theta,\theta^{\prime}) is representable as sSij=1i(θ,θ)Dijs\bigcup_{s\in S_{i}}\bigcap_{j=1}^{\ell_{i}(\theta,\theta^{\prime})}D_{ijs}, where Si{0,,2i(θ,θ)1}S_{i}\subseteq\{0,\ldots,2^{\ell_{i}(\theta,\theta^{\prime})}-1\}, each Dijs{Cij,Cijc}D_{ijs}\in\{C_{ij},C_{ij}^{c}\}, and ssj=1i(θ,θ)Dijsj=1i(θ,θ)Dijs=s\neq s^{\prime}\Rightarrow\bigcap_{j=1}^{\ell_{i}(\theta,\theta^{\prime})}D_{ijs}\cap\bigcap_{j=1}^{\ell_{i}(\theta,\theta^{\prime})}D_{ijs^{\prime}}=\emptyset. Since the Bi(θ,θ)B_{i}(\theta,\theta^{\prime}) are disjoint, the above sums are bounded, so that there exists mk(θ,θ,γ)m_{k}(\theta,\theta^{\prime},\gamma)\in\mathbb{N} such that every mmk(θ,θ,γ)m\geq m_{k}(\theta,\theta^{\prime},\gamma) has

𝒵kt(θ)(Bθ,θ)𝒵kt(θ)(Bθ,θ)<2γ+i=1m𝒵kt(θ)(Bi(θ,θ))i=1m𝒵kt(θ)(Bi(θ,θ)),\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}})<2\gamma+\sum_{i=1}^{m}\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta,\theta^{\prime}))-\sum_{i=1}^{m}\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta,\theta^{\prime})),

Now define M~k(γ)=maxθ,θΘγmk(θ,θ,γ)\tilde{M}_{k}(\gamma)=\max_{\theta,\theta^{\prime}\in\Theta_{\gamma}}m_{k}(\theta,\theta^{\prime},\gamma). Then for any θ,θΘ\theta,\theta^{\prime}\in\Theta, let θγ,θγΘγ\theta_{\gamma},\theta_{\gamma}^{\prime}\in\Theta_{\gamma} be such that πθπθγ<γ\|\pi_{\theta}-\pi_{\theta_{\gamma}}\|<\gamma and πθπθγ<γ\|\pi_{\theta^{\prime}}-\pi_{\theta^{\prime}_{\gamma}}\|<\gamma, which implies 𝒵kt(θ)𝒵kt(θγ)<γ\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\|<\gamma and 𝒵kt(θ)𝒵kt(θγ)<γ\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}\|<\gamma by Lemma 2. Then

𝒵kt(θ)𝒵kt(θ)\displaystyle\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\| <𝒵kt(θγ)𝒵kt(θγ)+2γ\displaystyle<\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}\|+2\gamma
𝒵kt(θγ)(Bθγ,θγ)𝒵kt(θγ)(Bθγ,θγ)+3γ\displaystyle\leq\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}(B_{\theta_{\gamma},\theta^{\prime}_{\gamma}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}(B_{\theta_{\gamma},\theta^{\prime}_{\gamma}})+3\gamma
i=1M~k(γ)𝒵kt(θγ)(Bi(θγ,θγ))𝒵kt(θγ)(Bi(θγ,θγ))+5γ.\displaystyle\leq\sum_{i=1}^{\tilde{M}_{k}(\gamma)}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))+5\gamma.

Again, since the Bi(θγ,θγ)B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}) are disjoint, this equals

5γ+𝒵kt(θγ)(i=1M~k(γ)Bi(θγ,θγ))𝒵kt(θγ)(i=1M~k(γ)Bi(θγ,θγ))\displaystyle 5\gamma+\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)
7γ+𝒵kt(θ)(i=1M~k(γ)Bi(θγ,θγ))𝒵kt(θ)(i=1M~k(γ)Bi(θγ,θγ))\displaystyle\leq 7\gamma+\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)
=7γ+i=1M~k(γ)𝒵kt(θ)(Bi(θγ,θγ))𝒵kt(θ)(Bi(θγ,θγ))\displaystyle=7\gamma+\sum_{i=1}^{\tilde{M}_{k}(\gamma)}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))
7γ+M~k(γ)maxiM~k(γ)|𝒵kt(θ)(Bi(θγ,θγ))𝒵kt(θ)(Bi(θγ,θγ))|.\displaystyle\leq 7\gamma+\tilde{M}_{k}(\gamma)\max_{i\leq\tilde{M}_{k}(\gamma)}\left|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\right|.

Thus, if we can show that each term |𝒵kt(θ)(Bi(θγ,θγ))𝒵kt(θ)(Bi(θγ,θγ))|\left|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\right| is bounded by a o(1)o(1) function of 𝒵dt(θ)𝒵dt(θ)\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|, then the result will follow by substituting this relaxation into the above expression and defining MkM_{k} by minimizing the resulting expression over γ>0\gamma>0.

Toward this end, let CijC_{ij} be as above from the definition of Bi(θγ,θγ)B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}), and note that IBi(θγ,θγ)I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})} is representable as a function of the ICijI_{C_{ij}} indicators, so that

|𝒵kt(θ)(Bi(θγ,θγ))𝒵kt(θ)(Bi(θγ,θγ))|=IBi(θγ,θγ)(𝒵kt(θ))IBi(θγ,θγ)(𝒵kt(θ))\displaystyle\left|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\right|=\|\mathbb{P}_{\!I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}(\mathcal{Z}^{t}_{k}(\theta))}\!-\!\mathbb{P}_{\!I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))}\|
(ICi1(𝒵kt(θ)),,ICii(θγ,θγ)(𝒵kt(θ)))(ICi1(𝒵kt(θ)),,ICii(θγ,θγ)(𝒵kt(θ)))\displaystyle\leq\|\mathbb{P}_{(I_{C_{i1}}(\mathcal{Z}^{t}_{k}(\theta)),\ldots,I_{C_{i\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}}(\mathcal{Z}^{t}_{k}(\theta)))}-\mathbb{P}_{(I_{C_{i1}}(\mathcal{Z}^{t}_{k}(\theta^{\prime})),\ldots,I_{C_{i\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}}(\mathcal{Z}^{t}_{k}(\theta^{\prime})))}\|
2i(θγ,θγ)maxJ{1,,i(θγ,θγ)}𝔼[(jJICij(𝒵kt(θ)))jJ(1ICij(𝒵kt(θ)))\displaystyle\leq 2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\{1,\ldots,\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\}}\mathbb{E}\Bigg{[}\Bigg{(}\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))\Bigg{)}\prod_{j\notin J}\Bigg{(}1-I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))\Bigg{)}
(jJICij(𝒵kt(θ)))jJ(1ICij(𝒵kt(θ)))]\displaystyle{\hskip 130.88284pt}-\Bigg{(}\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\Bigg{)}\prod_{j\notin J}\Bigg{(}1-I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\Bigg{)}\Bigg{]}
2i(θγ,θγ)J{1,,2i(θγ,θγ)}|𝔼[jJICij(𝒵kt(θ))jJICij(𝒵kt(θ))]|\displaystyle\leq 2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\sum_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left|\mathbb{E}\left[\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))-\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\right]\right|
4i(θγ,θγ)maxJ{1,,2i(θγ,θγ)}|𝔼[jJICij(𝒵kt(θ))jJICij(𝒵kt(θ))]|\displaystyle\leq 4^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left|\mathbb{E}\left[\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))-\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\right]\right|
=4i(θγ,θγ)maxJ{1,,2i(θγ,θγ)}|𝒵kt(θ)(jJCij)𝒵kt(θ)(jJCij)|.\displaystyle=4^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}\left(\bigcap_{j\in J}C_{ij}\right)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\left(\bigcap_{j\in J}C_{ij}\right)\right|.

Note that jJCij\bigcap_{j\in J}C_{ij} can be expressed as some (A1×(,t1])××(Ak×(,tk])(A_{1}\!\times(\!-\infty,t_{1}])\times\cdots\times(A_{k}\!\times(\!-\infty,t_{k}]), where each Ap𝒳A_{p}\in{\cal B}_{\mathcal{X}} and tpt_{p}\in\mathbb{R}, so that, for ^=maxθ,θΘγmaxiM~k(γ)i(θ,θ)\hat{\ell}=\max_{\theta,\theta^{\prime}\in\Theta_{\gamma}}\max_{i\leq\tilde{M}_{k}(\gamma)}\ell_{i}(\theta,\theta^{\prime}) and 𝒞k={(A1×(,t1])××(Ak×(,tk]):jk,Aj𝒳,tk}\mathcal{C}_{k}=\{(A_{1}\times(-\infty,t_{1}])\times\cdots\times(A_{k}\times(-\infty,t_{k}]):\forall j\leq k,A_{j}\in{\cal B}_{\mathcal{X}},t_{k}\in\mathbb{R}\}, this last expression is at most

4^supC𝒞k|𝒵kt(θ)(C)𝒵kt(θ)(C)|.4^{\hat{\ell}}\sup_{C\in\mathcal{C}_{k}}\left|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(C)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(C)\right|.

Next note that for any C=(A1×(,t1])××(Ak×(,tk])𝒞kC=(A_{1}\times(-\infty,t_{1}])\times\cdots\times(A_{k}\times(-\infty,t_{k}])\in\mathcal{C}_{k}, letting C1=A1××AkC_{1}=A_{1}\times\cdots\times A_{k} and C2=(,t1]××(,tk]C_{2}=(-\infty,t_{1}]\times\cdots\times(-\infty,t_{k}],

𝒵kt(θ)(C)𝒵kt(θ)(C)\displaystyle\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(C)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(C) =𝔼[(𝕐tk(θ)|𝕏tk(C2)𝕐tk(θ)|𝕏tk(C2))IC1(𝕏tk)]\displaystyle=\mathbb{E}\left[\left(\mathbb{P}_{\mathbb{Y}_{tk}(\theta)|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})|\mathbb{X}_{tk}}(C_{2})\right)I_{C_{1}}(\mathbb{X}_{tk})\right]
𝔼[|𝕐tk(θ)|𝕏tk(C2)𝕐tk(θ)|𝕏tk(C2)|].\displaystyle\leq\mathbb{E}\left[\left|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})|\mathbb{X}_{tk}}(C_{2})\right|\right].

For p{1,,k}p\in\{1,\ldots,k\}, let C2p=(,tp]C_{2p}=(-\infty,t_{p}]. Then note that, by definition of dd, for any given x=(x1,,xk)x=(x_{1},\ldots,x_{k}), the class x={xpIC2p(h(xp)):h}\mathcal{H}_{x}=\{x_{p}\mapsto I_{C_{2p}}(h(x_{p})):h\in\mathcal{F}\} is a VC class over {x1,,xk}\{x_{1},\ldots,x_{k}\} with VC dimension at most dd. Furthremore, we have

|𝕐tk(θ)|𝕏tk(C2)𝕐tk(θ)|𝕏tk(C2)|=|(IC21(htθ(Xt1)),,IC2k(htθ(Xtk)))|𝕏tk({(1,,1)})(IC21(htθ(Xt1)),,IC2k(htθ(Xtk)))|𝕏tk({(1,,1)})|.\left|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})|\mathbb{X}_{tk}}(C_{2})\right|\\ =\Big{|}\mathbb{P}_{(I_{C_{21}}(h^{*}_{t\theta}(X_{t1})),\ldots,I_{C_{2k}}(h^{*}_{t\theta}(X_{tk})))|\mathbb{X}_{tk}}(\{(1,\ldots,1)\})\\ -\mathbb{P}_{(I_{C_{21}}(h^{*}_{t\theta^{\prime}}(X_{t1})),\ldots,I_{C_{2k}}(h^{*}_{t\theta^{\prime}}(X_{tk})))|\mathbb{X}_{tk}}(\{(1,\ldots,1)\})\Big{|}.

Therefore, the results of [12] (in the proof of their Lemma 3) imply that

|𝕐tk(θ)|𝕏tk(C2)𝕐tk(θ)|𝕏tk(C2)|\displaystyle\left|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})|\mathbb{X}_{tk}}(C_{2})\right|
2kmaxy{0,1}dmaxD{1,,k}d|{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})\displaystyle\leq 2^{k}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\Big{|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})
{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})|.\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})\Big{|}.

Thus, we have

𝔼[|𝕐tk(θ)|𝕏tk(C2)𝕐tk(θ)|𝕏tk(C2)|]\displaystyle\mathbb{E}\left[\left|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})|\mathbb{X}_{tk}}(C_{2})\right|\right]
2k𝔼[maxy{0,1}dmaxD{1,,k}d|{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})\displaystyle\leq 2^{k}\mathbb{E}\Bigg{[}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\Big{|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})
{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})|]\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})\Big{|}\Bigg{]}
2ky{0,1}dD{1,,k}d𝔼[|{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})\displaystyle\leq 2^{k}\sum_{y\in\{0,1\}^{d}}\sum_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\Bigg{[}\Big{|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})
{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})|]\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})\Big{|}\Bigg{]}
2d+kkdmaxy{0,1}dmaxD{1,,k}d𝔼[|{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})\displaystyle\leq 2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\Bigg{[}\Big{|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})
{IC2j(htθ(Xtj))}jD|{Xtj}jD({y})|].\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}|\{X_{tj}\}_{j\in D}}(\{y\})\Big{|}\Bigg{]}.

Exchangeability implies this is at most

2d+kkdmaxy{0,1}dsupt1,,td𝔼[|{I(,tj](htθ(Xtj))}j=1d|𝕏td({y})\displaystyle 2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\sup_{t_{1},\ldots,t_{d}\in\mathbb{R}}\mathbb{E}\Bigg{[}\Big{|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(h^{*}_{t\theta}(X_{tj}))\}_{j=1}^{d}|\mathbb{X}_{td}}(\{y\})
{I(,tj](htθ(Xtj))}j=1d|𝕏td({y})|]\displaystyle{\hskip 159.3356pt}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j=1}^{d}|\mathbb{X}_{td}}(\{y\})\Big{|}\Bigg{]}
=2d+kkdmaxy{0,1}dsupt1,,td𝔼[|{I(,tj](Ytj(θ))}j=1d|𝕏td({y})\displaystyle=2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\sup_{t_{1},\ldots,t_{d}\in\mathbb{R}}\mathbb{E}\Bigg{[}\Big{|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d}|\mathbb{X}_{td}}(\{y\})
{I(,tj](Ytj(θ))}j=1d|𝕏td({y})|].\displaystyle{\hskip 159.3356pt}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d}|\mathbb{X}_{td}}(\{y\})\Big{|}\Bigg{]}.

[12] argue that for all y{0,1}dy\in\{0,1\}^{d} and t1,,tdt_{1},\ldots,t_{d}\in\mathbb{R},

𝔼[|{I(,tj](Ytj(θ))}j=1d|𝕏td({y}){I(,tj](Ytj(θ))}j=1d|𝕏td({y})|]\displaystyle\mathbb{E}\Big{[}\Big{|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d}|\mathbb{X}_{td}}(\{y\})-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d}|\mathbb{X}_{td}}(\{y\})\Big{|}\Big{]}
4{I(,tj](Ytj(θ))}j=1d,𝕏td{I(,tj](Ytj(θ))}j=1d,𝕏td.\displaystyle\leq 4\sqrt{\|\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d},\mathbb{X}_{td}}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d},\mathbb{X}_{td}}\|}.

Noting that

{I(,tj](Ytj(θ))}j=1d,𝕏td{I(,tj](Ytj(θ))}j=1d,𝕏td𝒵dt(θ)𝒵dt(θ)\|\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d},\mathbb{X}_{td}}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d},\mathbb{X}_{td}}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|

completes the proof. ∎

We are now ready for the proof of Theorem 5.1.

Proof (Proof of Theorem 5.1)

The estimator θ^Tθ\hat{\theta}_{T{\theta_{\star}}} we will use is precisely the minimum-distance skeleton estimate of 𝒵dt(θ)\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})} [13, 5]. [13] proved that if N(ε)N(\varepsilon) is the ε\varepsilon-covering number of {𝒵dt(θ):θΘ}\{\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}:\theta\in\Theta\}, then taking this θ^Tθ\hat{\theta}_{T{\theta_{\star}}} estimator, then for some Tε=O((1/ε2)logN(ε/4))T_{\varepsilon}=O((1/\varepsilon^{2})\log N(\varepsilon/4)), any TTεT\geq T_{\varepsilon} has

𝔼[𝒵dt(θ^Tθ)𝒵dt(θ)]<ε.\mathbb{E}\left[\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\hat{\theta}_{T{\theta_{\star}}})}-\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}\|\right]<\varepsilon.

Thus, taking GT=inf{ε>0:TTε}G_{T}=\inf\{\varepsilon>0:T\geq T_{\varepsilon}\}, we have

𝔼[𝒵dt(θ^Tθ)𝒵dt(θ)]GT=o(1).\mathbb{E}\left[\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\hat{\theta}_{T{\theta_{\star}}})}-\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}\|\right]\leq G_{T}=o(1).

Letting R(T,α)R^{\prime}(T,\alpha) be any positive sequence with GTR(T,α)1G_{T}\ll R^{\prime}(T,\alpha)\ll 1 and R(T,α)GT/αR^{\prime}(T,\alpha)\geq G_{T}/\alpha, and letting δ(T,α)=GT/R(T,α)=o(1)\delta(T,\alpha)=G_{T}/R^{\prime}(T,\alpha)=o(1), Markov’s inequality implies

(𝒵dt(θ^Tθ)𝒵dt(θ)>R(T,α))δ(T,α)α.\mathbb{P}\left(\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\hat{\theta}_{T{\theta_{\star}}})}-\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}\|>R^{\prime}(T,\alpha)\right)\leq\delta(T,\alpha)\leq\alpha. (2)

Letting R(T,α)=mink(Mk(R(T,α))+rk)R(T,\alpha)=\min_{k}\left(M_{k}\left(R^{\prime}(T,\alpha)\right)+r_{k}\right), since R(T,α)=o(1)R^{\prime}(T,\alpha)=o(1) and rk=o(1)r_{k}=o(1), we have R(T,α)=o(1)R(T,\alpha)=o(1). Furthermore, composing (2) with Lemmas 1, 2, and 3, we have

(πθ^Tθπθ>R(T,α))δ(T,α)α.\mathbb{P}\left(\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|>R(T,\alpha)\right)\leq\delta(T,\alpha)\leq\alpha.

Remark:

Although the above proof makes use of the minimum-distance skeleton estimator, which is typically not computationally efficient, it is often possible to achieve this same result (for certain families of distributions) using a simpler estimator, such as the maximum likelihood estimator. All we require is that the risk of the estimator converges to 0 at a known rate that is independent of θ{\theta_{\star}}. For instance, see [6] for conditions on the family of distributions sufficient for this to be true of the maximum likelihood estimator.

References

  • [1] Bar-Yossef, Z.: Sampling lower bounds via information theory. In: Proceedings of the 35th Annual ACM Symposium on the Theory of Computing. pp. 335–344 (2003)
  • [2] Baxter, J.: A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning 28, 7–39 (1997)
  • [3] Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery 36(4), 929–965 (1989)
  • [4] Cramton, P., Shoham, Y., Steinberg, R.: Combinatorial Auctions. The MIT Press (2006)
  • [5] Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation. Springer, New York, NY, USA (2001)
  • [6] van de Geer, S.: Empirical Processes in M-Estimation. Cambridge University Press (2000)
  • [7] Poland, J., Hutter, M.: MDL convergence speed for Bernoulli sequences. Statistics and Computing 16, 161–175 (2006)
  • [8] Schervish, M.J.: Theory of Statistics. Springer, New York, NY, USA (1995)
  • [9] Vapnik, V.: Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York (1982)
  • [10] Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 264–280 (1971)
  • [11] Wald, A.: Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16(2), 117–186 (1945)
  • [12] Yang, L., Hanneke, S., Carbonell, J.: A theory of transfer learning with applications to active learning. Machine Learning 90(2), 161–189 (2013)
  • [13] Yatracos, Y.G.: Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics 13, 768–774 (1985)
  • [14] Zinkevich, M., Blum, A., Sandholm, T.: On polynomial-time preference elicitation with value queries. In: Proceedings of the 4th4^{{\rm th}} ACM Conference on Electronic Commerce. pp. 175–185 (2003)