¹¹institutetext: IBM T.J. Watson Research Center, Yorktown Heights, NY USA.
¹¹email: yangli@us.ibm.com ²²institutetext: Princeton, NJ USA.
²²email: steve.hanneke@gmail.com ³³institutetext: Carnegie Mellon University, Pittsburgh, PA USA
³³email: jgc@cs.cmu.edu

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Liu Yang Steve Hanneke Jaime Carbonell

Abstract

We study the optimal rates of convergence for estimating a prior distribution over a VC class from a sequence of independent data sets respectively labeled by independent target functions sampled from the prior. We specifically derive upper and lower bounds on the optimal rates under a smoothness condition on the correct prior, with the number of samples per data set equal the VC dimension. These results have implications for the improvements achievable via transfer learning. We additionally extend this setting to real-valued function, where we establish consistency of an estimator for the prior, and discuss an additional application to a preference elicitation problem in algorithmic economics.

1 Introduction

In the transfer learning setting, we are presented with a sequence of learning problems, each with some respective target concept we are tasked with learning. The key question in transfer learning is how to leverage our access to past learning problems in order to improve performance on learning problems we will be presented with in the future.

Among the several proposed models for transfer learning, one particularly appealing model supposes the learning problems are independent and identically distributed, with unknown distribution, and the advantage of transfer learning then comes from the ability to estimate this shared distribution based on the data from past learning problems [2, 12]. For instance, when customizing a speech recognition system to a particular speaker’s voice, we might expect the first few people would need to speak many words or phrases in order for the system to accurately identify the nuances. However, after performing this for many different people, if the software has access to those past training sessions when customizing itself to a new user, it should have identified important properties of the speech patterns, such as the common patterns within each of the major dialects or accents, and other such information about the distribution of speech patterns within the user population. It should then be able to leverage this information to reduce the number of words or phrases the next user needs to speak in order to train the system, for instance by first trying to identify the individual’s dialect, then presenting phrases that differentiate common subpatterns within that dialect, and so forth.

In analyzing the benefits of transfer learning in such a setting, one important question to ask is how quickly we can estimate the distribution from which the learning problems are sampled. In recent work, [12] have shown that under mild conditions on the family of possible distributions, if the target concepts reside in a known VC class, then it is possible to estimate this distribtion using only a bounded number of training samples per task: specifically, a number of samples equal the VC dimension. However, that work left open the question of quantifying the rate of convergence. This rate of convergence can have a direct impact on how much benefit we gain from transfer learning when we are faced with only a finite sequence of learning problems. As such, it is certainly desirable to derive tight characterizations of this rate of convergence.

The present work continues that of [12], bounding the rate of convergence for estimating this distribution, under a smoothness condition on the distribution. We derive a generic upper bound, which holds regardless of the VC class the target concepts reside in. The proof of this result builds on that earlier work, but requires several interesting innovations to make the rate of convergence explicit, and to dramatically improve the upper bound implicit in the proofs of those earlier results. We further derive a nontrivial lower bound that holds for certain constructed scenarios, which illustrates a lower limit on how good of a general upper bound we might hope for in results expressed only in terms of the number of tasks, the smoothness conditions, and the VC dimension.

We additionally include an extension of the results of [12] to the setting of real-valued functions, establishing consistency (at a uniform rate) for an estimator of a prior over any VC subgraph class. In addition to the application to transfer learning, analogous to the original work of [12], we also discuss an application of this result to a preference elicitation problem in algorithmic economics, in which we are tasked with allocating items to a sequence of customers to approximately maximize the customers’ satisfaction, while permitted access to the customer valuation functions only via value queries.

2 The Setting

Let $(\mathcal{X},{\cal B}_{\mathcal{X}})$ be a measurable space [8] (where $\mathcal{X}$ is called the instance space), and let $\mathcal{D}$ be a distribution on $\mathcal{X}$ (called the data distribution). Let $\mathbb{C}$ be a VC class of measurable classifiers $h:\mathcal{X}\to\{-1,+1\}$ (called the concept space), and denote by $d$ the VC dimension of $\mathbb{C}$ [10]. We suppose $\mathbb{C}$ is equipped with its Borel $\sigma$ -algebra ${\cal B}$ induced by the pseudo-metric $\rho(h,g)=\mathcal{D}(\{x\in\mathcal{X}:h(x)\neq g(x)\})$ . Though our results can be formulated for general $\mathcal{D}$ (with somewhat more complicated theorem statements), to simplify the statement of results we suppose $\rho$ is actually a metric, which would follow from appropriate topological conditions on $\mathbb{C}$ relative to $\mathcal{D}$ .

For any two probability measures $\mu_{1},\mu_{2}$ on a measurable space $(\Omega,\mathcal{F})$ , define the total variation distance

\|\mu_{1}-\mu_{2}\|=\sup_{A\in\mathcal{F}}\mu_{1}(A)-\mu_{2}(A).

For a set function $\mu$ on a finite measurable space $(\Omega,\mathcal{F})$ , we abbreviate $\mu(\omega)=\mu(\{\omega\})$ , $\forall\omega\in\Omega$ . Let $\Pi_{\Theta}=\{\pi_{\theta}:\theta\in\Theta\}$ be a family of probability measures on $\mathbb{C}$ (called priors), where $\Theta$ is an arbitrary index set (called the parameter space). We suppose there exists a probability measure $\pi_{0}$ on $\mathbb{C}$ (the reference measure) such that every $\pi_{\theta}$ is absolutely continuous with respect to $\pi_{0}$ , and therefore has a density function $f_{\theta}$ given by the Radon-Nikodym derivative $\frac{{\rm d}\pi_{\theta}}{{\rm d}\pi_{0}}$ [8].

We consider the following type of estimation problem. There is a collection of $\mathbb{C}$ -valued random variables $\{h^{*}_{t\theta}:t\in\mathbb{N},\theta\in\Theta\}$ , where for any fixed $\theta\in\Theta$ the $\{h^{*}_{t\theta}\}_{t=1}^{\infty}$ variables are i.i.d. with distribution $\pi_{\theta}$ . For each $\theta\in\Theta$ , there is a sequence $\mathcal{Z}^{t}(\theta)=\{(X_{t1},Y_{t1}(\theta)),(X_{t2},Y_{t2}(\theta)),\ldots\}$ , where $\{X_{ti}\}_{t,i\in\mathbb{N}}$ are i.i.d. $\mathcal{D}$ , and for each $t,i\in\mathbb{N}$ , $Y_{ti}(\theta)=h^{*}_{t\theta}(X_{ti})$ . We additionally denote by $\mathcal{Z}^{t}_{k}(\theta)=\{(X_{t1},Y_{t1}(\theta)),\ldots,(X_{tk},Y_{tk}(\theta))\}$ the first $k$ elements of $\mathcal{Z}^{t}(\theta)$ , for any $k\in\mathbb{N}$ , and similarly $\mathbb{X}_{tk}=\{X_{t1},\ldots,X_{tk}\}$ and $\mathbb{Y}_{tk}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}$ . Following the terminology used in the transfer learning literature, we refer to the collection of variables associated with each $t$ collectively as the $t^{{\rm th}}$ task. We will be concerned with sequences of estimators $\hat{\theta}_{T\theta}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{k}(\theta),\ldots,\mathcal{Z}^{T}_{k}(\theta))$ , for $T\in\mathbb{N}$ , which are based on only a bounded number $k$ of samples per task, among the first $T$ tasks. Our main results specifically study the case of $d$ samples per task. For any such estimator, we measure the risk as $\mathbb{E}\left[\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\right]$ , and will be particularly interested in upper-bounding the worst-case risk $\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\left[\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\right]$ as a function of $T$ , and lower-bounding the minimum possible value of this worst-case risk over all possible $\hat{\theta}_{T}$ estimators (called the minimax risk).

In previous work, [12] showed that, if $\Pi_{\Theta}$ is a totally bounded family, then even with only $d$ number of samples per task, the minimax risk (as a function of the number of tasks $T$ ) converges to zero. In fact, that work also proved this is not necessarily the case in general for any number of samples less than $d$ . However, the actual rates of convergence were not explicitly derived in that work, and indeed the upper bounds on the rates of convergence implicit in that analysis may often have fairly complicated dependences on $\mathbb{C}$ , $\Pi_{\Theta}$ , and $\mathcal{D}$ , and furthermore often provide only very slow rates of convergence.

To derive explicit bounds on the rates of convergence, in the present work we specifically focus on families of smooth densities. The motivation for involving a notion of smoothness in characterizing rates of convergence is clear if we consider the extreme case in which $\Pi_{\Theta}$ contains two priors $\pi_{1}$ and $\pi_{2}$ , with $\pi_{1}(\{h\})=\pi_{2}(\{g\})=1$ , where $\rho(h,g)$ is a very small but nonzero value; in this case, if we have only a small number of samples per task, we would require many tasks (on the order of $1/\rho(h,g)$ ) to observe any data points carrying any information that would distinguish between these two priors (namely, points $x$ with $h(x)\neq g(x)$ ); yet $\|\pi_{1}-\pi_{2}\|=1$ , so that we have a slow rate of convergence (at least initially). A total boundedness condition on $\Pi_{\Theta}$ would limit the number of such pairs present in $\Pi_{\Theta}$ , so that for instance we cannot have arbitrarily close $h$ and $g$ , but less extreme variants of this can lead to slow asymptotic rates of convergence as well. Specifically, in the present work we consider the following notion of smoothness. For $L\in(0,\infty)$ and $\alpha\in(0,1]$ , a function $f:\mathbb{C}\to\mathbb{R}$ is $(L,\alpha)$ -Hölder smooth if

\forall h,g\in\mathbb{C},|f(h)-f(g)|\leq L\rho(h,g)^{\alpha}.

3 An Upper Bound

We now have the following theorem, holding for an arbitrary VC class $\mathbb{C}$ and data distribution $\mathcal{D}$ ; it is the main result of this work.

Theorem 3.1

For $\Pi_{\Theta}$ any class of priors on $\mathbb{C}$ having $(L,\alpha)$ -Hölder smooth densities $\{f_{\theta}:\theta\in\Theta\}$ , for any $T\in\mathbb{N}$ , there exists an estimator $\hat{\theta}_{T\theta}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{d}(\theta),\ldots,\mathcal{Z}^{T}_{d}(\theta))$ such that

\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|=\tilde{O}\left(LT^{-\frac{\alpha^{2}}{2(d+2\alpha)(\alpha+2(d+1))}}\right).

Proof

By the standard PAC analysis [9, 3], for any $\gamma>0$ , with probability greater than $1-\gamma$ , a sample of $k=O((d/\gamma)\log(1/\gamma))$ random points will partition $\mathbb{C}$ into regions of width less than $\gamma$ (under $L_{1}(\mathcal{D})$ ). For brevity, we omit the $t$ subscripts and superscripts on quantities such as $\mathcal{Z}^{t}_{k}(\theta)$ throughout the following analysis, since the claims hold for any arbitrary value of $t$ .

For any $\theta\in\Theta$ , let $\pi_{\theta}^{\prime}$ denote a (conditional on $X_{1},\ldots,X_{k}$ ) distribution defined as follows. Let $f_{\theta}^{\prime}$ denote the (conditional on $X_{1},\ldots,X_{k}$ ) density function of $\pi_{\theta}^{\prime}$ with respect to $\pi_{0}$ , and for any $g\in\mathbb{C}$ , let $f_{\theta}^{\prime}(g)=\frac{\pi_{\theta}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=g(X_{i})\})}{\pi_{0}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=g(X_{i})\})}$ (or $0$ if $\pi_{0}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=g(X_{i})\})=0$ ). In other words, $\pi_{\theta}^{\prime}$ has the same probability mass as $\pi_{\theta}$ for each of the equivalence classes induced by $X_{1},\ldots,X_{k}$ , but conditioned on the equivalence class, simply has a constant-density distribution over that equivalence class. Note that every $h\in\mathbb{C}$ has $f_{\theta}^{\prime}(h)$ between the smallest and largest values of $f_{\theta}(g)$ among $g\in\mathbb{C}$ with $\forall i\leq k,g(X_{i})=h(X_{i})$ ; therefore, by the smoothness condition, on the event (of probability greater than $1-\gamma$ ) that each of these regions has diameter less than $\gamma$ , we have $\forall h\in\mathbb{C},|f_{\theta}(h)-f_{\theta}^{\prime}(h)|<L\gamma^{\alpha}$ . On this event, for any $\theta,\theta^{\prime}\in\Theta$ ,

\|\pi_{\theta}-\pi_{\theta^{\prime}}\|=(1/2)\int|f_{\theta}-f_{\theta^{\prime}}|{\rm d}\pi_{0}<L\gamma^{\alpha}+(1/2)\int|f_{\theta}^{\prime}-f_{\theta^{\prime}}^{\prime}|{\rm d}\pi_{0}.

Furthermore, since the regions that define $f_{\theta}^{\prime}$ and $f_{\theta^{\prime}}^{\prime}$ are the same (namely, the partition induced by $X_{1},\ldots,X_{k}$ ), we have

	$\displaystyle(1/2)\int\|f_{\theta}^{\prime}-f_{\theta^{\prime}}^{\prime}\|{\rm d}\pi_{0}$	$\displaystyle=(1/2)\!\!\!\!\!\!\!\!\!\sum_{y_{1},\ldots,y_{k}\in\{-1,+1\}}\!\!\!\|\pi_{\theta}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=y_{i}\})$
		$\displaystyle\phantom{aaaaaaaaaaaaaaaaa}-\pi_{\theta^{\prime}}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=y_{i}\})\|$
		$\displaystyle=\\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)\|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})\|\mathbb{X}_{k}}\\|.$

Thus, we have that with probability at least $1-\gamma$ ,

\|\pi_{\theta}-\pi_{\theta^{\prime}}\|<L\gamma^{\alpha}+\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}\|.

Following analogous to the inductive argument of [12], suppose $I\subseteq\{1,\ldots,k\}$ , fix $\bar{x}_{I}\in\mathcal{X}^{|I|}$ and $\bar{y}_{I}\in\{-1,+1\}^{|I|}$ . Then the $\tilde{y}_{I}\in\{-1,+1\}^{|I|}$ for which $\|\bar{y}_{I}-\tilde{y}_{I}\|_{1}$ is minimal, subject to the constraint that no $h\in\mathbb{C}$ has $h(\bar{x}_{I})=\tilde{y}_{I}$ , has $(1/2)\|\bar{y}_{I}-\tilde{y}_{I}\|_{1}\leq d+1$ ; also, for any $i\in I$ with $\bar{y}_{i}\neq\tilde{y}_{i}$ , letting $\bar{y}^{\prime}_{j}=\bar{y}_{j}$ for $j\in I\setminus\{i\}$ and $\bar{y}^{\prime}_{i}=\tilde{y}_{i}$ , we have

\mathbb{P}_{\mathbb{Y}_{I}(\theta)|\mathbb{X}_{I}}(\bar{y}_{I}|\bar{x}_{I})=\mathbb{P}_{\mathbb{Y}_{I\setminus\{i\}}(\theta)|\mathbb{X}_{I\setminus\{i\}}}(\bar{y}_{I\setminus\{i\}}|\bar{x}_{I\setminus\{i\}})-\mathbb{P}_{\mathbb{Y}_{I}(\theta)|\mathbb{X}_{I}}(\bar{y}^{\prime}_{I}|\bar{x}_{I}),

and similarly for $\theta^{\prime}$ , so that

	$\displaystyle\|\mathbb{P}_{\mathbb{Y}_{I}(\theta)\|\mathbb{X}_{I}}(\bar{y}_{I}\|\bar{x}_{I})-\mathbb{P}_{\mathbb{Y}_{I}(\theta^{\prime})\|\mathbb{X}_{I}}(\bar{y}_{I}\|\bar{x}_{I})\|$
	$\displaystyle\leq\|\mathbb{P}_{\mathbb{Y}_{I\setminus\{i\}}(\theta)\|\mathbb{X}_{I\setminus\{i\}}}(\bar{y}_{I\setminus\{i\}}\|\bar{x}_{I\setminus\{i\}})-\mathbb{P}_{\mathbb{Y}_{I\setminus\{i\}}(\theta^{\prime})\|\mathbb{X}_{I\setminus\{i\}}}(\bar{y}_{I\setminus\{i\}}\|\bar{x}_{I\setminus\{i\}})\|$
	$\displaystyle\phantom{aaaa}+\|\mathbb{P}_{\mathbb{Y}_{I}(\theta)\|\mathbb{X}_{I}}(\bar{y}^{\prime}_{I}\|\bar{x}_{I})-\mathbb{P}_{\mathbb{Y}_{I}(\theta^{\prime})\|\mathbb{X}_{I}}(\bar{y}^{\prime}_{I}\|\bar{x}_{I})\|.$

Now consider that these two terms inductively define a binary tree. Every time the tree branches left once, it arrives at a difference of probabilities for a set $I$ of one less element than that of its parent. Every time the tree branches right once, it arrives at a difference of probabilities for a $\bar{y}_{I}$ one closer to an unrealized $\tilde{y}_{I}$ than that of its parent. Say we stop branching the tree upon reaching a set $I$ and a $\bar{y}_{I}$ such that either $\bar{y}_{I}$ is an unrealized labeling, or $|I|=d$ . Thus, we can bound the original (root node) difference of probabilities by the sum of the differences of probabilities for the leaf nodes with $|I|=d$ . Any path in the tree can branch left at most $k-d$ times (total) before reaching a set $I$ with only $d$ elements, and can branch right at most $d+1$ times in a row before reaching a $\bar{y}_{I}$ such that both probabilities are zero, so that the difference is zero. So the depth of any leaf node with $|I|=d$ is at most $(k-d)d$ . Furthermore, at any level of the tree, from left to right the nodes have strictly decreasing $|I|$ values, so that the maximum width of the tree is at most $k-d$ . So the total number of leaf nodes with $|I|=d$ is at most $(k-d)^{2}d$ . Thus, for any $\bar{y}\in\{-1,+1\}^{k}$ and $\bar{x}\in\mathcal{X}^{k}$ ,

	$\displaystyle\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)\|\mathbb{X}_{k}}(\bar{y}\|\bar{x})-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})\|\mathbb{X}_{k}}(\bar{y}\|\bar{x})\|$
	$\displaystyle\leq(k-d)^{2}d\cdot\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\|\mathbb{P}_{\mathbb{Y}_{d}(\theta)\|\mathbb{X}_{d}}(\bar{y}^{d}\|\bar{x}_{D})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})\|\mathbb{X}_{d}}(\bar{y}^{d}\|\bar{x}_{D})\|.$

Since

\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}\|=(1/2)\sum_{\bar{y}^{k}\in\{-1,+1\}^{k}}|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}(\bar{y}^{k})-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}(\bar{y}^{k})|,

and by Sauer’s Lemma this is at most

(ek)^{d}\max_{\bar{y}^{k}\in\{-1,+1\}^{k}}|\mathbb{P}_{\mathbb{Y}_{k}(\theta)|\mathbb{X}_{k}}(\bar{y}^{k})-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})|\mathbb{X}_{k}}(\bar{y}^{k})|,

we have that

	$\displaystyle\\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)\|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})\|\mathbb{X}_{k}}\\|$
	$\displaystyle\leq(ek)^{d}k^{2}d\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\|\mathbb{P}_{\mathbb{Y}_{d}(\theta)\|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})\|\mathbb{X}_{D}}(\bar{y}^{d})\|.$

Thus, we have that

	$\displaystyle\\|\pi_{\theta}-\pi_{\theta^{\prime}}\\|=\mathbb{E}\\|\pi_{\theta}-\pi_{\theta^{\prime}}\\|$
	$\displaystyle<\gamma\!+\!L\gamma^{\alpha}\!+\!(ek)^{d}k^{2}d\mathbb{E}\bigg{[}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\mathbb{P}_{\mathbb{Y}_{d}(\theta)\|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})\|\mathbb{X}_{D}}(\bar{y}^{d})\|\bigg{]}.$

Note that

	$\displaystyle\mathbb{E}\bigg{[}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\|\mathbb{P}_{\mathbb{Y}_{d}(\theta)\|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})\|\mathbb{X}_{D}}(\bar{y}^{d})\|\bigg{]}$
	$\displaystyle\leq\sum_{\bar{y}^{d}\in\{-1,+1\}^{d}}\sum_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\big{[}\|\mathbb{P}_{\mathbb{Y}_{d}(\theta)\|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})\|\mathbb{X}_{D}}(\bar{y}^{d})\|\big{]}$
	$\displaystyle\leq(2k)^{d}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\big{[}\|\mathbb{P}_{\mathbb{Y}_{d}(\theta)\|\mathbb{X}_{D}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})\|\mathbb{X}_{D}}(\bar{y}^{d})\|\big{]},$

and by exchangeability, this last line equals

(2k)^{d}\max_{\bar{y}^{d}\in\{-1,+1\}^{d}}\mathbb{E}\left[|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{d}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{d}}(\bar{y}^{d})|\right].

[12] showed that $\mathbb{E}\left[|\mathbb{P}_{\mathbb{Y}_{d}(\theta)|\mathbb{X}_{d}}(\bar{y}^{d})-\mathbb{P}_{\mathbb{Y}_{d}(\theta^{\prime})|\mathbb{X}_{d}}(\bar{y}^{d})|\right]\leq 4\sqrt{\|\mathbb{P}_{\mathcal{Z}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}_{d}(\theta^{\prime})}\|}$ , so that in total we have $\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\!<\!(L\!+\!1)\gamma^{\alpha}\!+\!4(2ek)^{2d+2}\!\!\sqrt{\|\mathbb{P}_{\mathcal{Z}_{d}(\theta)}\!-\!\mathbb{P}_{\mathcal{Z}_{d}(\theta^{\prime})}\|}$ . Plugging in the value of $k=c(d/\gamma)\log(1/\gamma)$ , this is

(L\!+\!1)\gamma^{\alpha}+4\!\left(\!2ec\frac{d}{\gamma}\log\!\left(\frac{1}{\gamma}\right)\!\right)^{\!\!2d+2}\!\sqrt{\|\mathbb{P}_{\mathcal{Z}_{d}(\theta)}\!-\!\mathbb{P}_{\mathcal{Z}_{d}(\theta^{\prime})}\|}.

Thus, it suffices to bound the rate of convergence (in total variation distance) of some estimator of $\mathbb{P}_{\mathcal{Z}_{d}({\theta_{\star}})}$ . If $N(\varepsilon)$ is the $\varepsilon$ -covering number of $\{\mathbb{P}_{\mathcal{Z}_{d}(\theta)}:\theta\in\Theta\}$ , then taking $\hat{\theta}_{T{\theta_{\star}}}$ as the minimum distance skeleton estimate of [13, 5] achieves expected total variation distance $\varepsilon$ from $\mathbb{P}_{\mathcal{Z}_{d}({\theta_{\star}})}$ , for some $T=O((1/\varepsilon^{2})\log N(\varepsilon/4))$ . We can partition $\mathbb{C}$ into $O((L/\varepsilon)^{d/\alpha})$ cells of diameter $O((\varepsilon/L)^{1/\alpha})$ , and set a constant density value within each cell, on an $O(\varepsilon)$ -grid of density values, and every prior with $(L,\alpha)$ -Hölder smooth density will have density within $\varepsilon$ of some density so-constructed; there are then at most $(1/\varepsilon)^{O((L/\varepsilon)^{d/\alpha})}$ such densities, so this bounds the covering numbers of $\Pi_{\Theta}$ . Furthermore, the covering number of $\Pi_{\Theta}$ upper bounds $N(\varepsilon)$ [12], so that $N(\varepsilon)\leq(1/\varepsilon)^{O((L/\varepsilon)^{d/\alpha})}$ .

Solving $T\!=\!O(\varepsilon^{-2}(L/\varepsilon)^{d/\alpha}\log(1/\varepsilon))$ for $\varepsilon$ , we have $\varepsilon\!=\!O\!\left(L\!\left(\frac{\log(TL)}{T}\right)^{\frac{\alpha}{d+2\alpha}}\right)$ . So this bounds the rate of convergence for $\mathbb{E}\|\mathbb{P}_{\mathcal{Z}_{d}(\hat{\theta}_{T})}-\mathbb{P}_{\mathcal{Z}_{d}({\theta_{\star}})}\|$ , for $\hat{\theta}_{T}$ the minimum distance skeleton estimate. Plugging this rate into the bound on the priors, combined with Jensen’s inequality, we have

\mathbb{E}\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|<(L+1)\gamma^{\alpha}+4\left(2ec\frac{d}{\gamma}\log\left(\frac{1}{\gamma}\right)\right)^{2d+2}\!\!\!\!\!\times O\left(L\left(\frac{\log(TL)}{T}\right)^{\frac{\alpha}{2d+4\alpha}}\right).

This holds for any $\gamma>0$ , so minimizing this expression over $\gamma>0$ yields a bound on the rate. For instance, with $\gamma=\tilde{O}\left(T^{-\frac{\alpha}{2(d+2\alpha)(\alpha+2(d+1))}}\right)$ , we have

\mathbb{E}\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|=\tilde{O}\left(LT^{-\frac{\alpha^{2}}{2(d+2\alpha)(\alpha+2(d+1))}}\right).

∎

4 A Minimax Lower Bound

One natural quesiton is whether Theorem 3.1 can generally be improved. While we expect this to be true for some fixed VC classes (e.g., those of finite size), and in any case we expect that some of the constant factors in the exponent may be improvable, it is not at this time clear whether the general form of $T^{-\Theta(\alpha^{2}/(d+\alpha)^{2})}$ is sometimes optimal. One way to investigate this question is to construct specific spaces $\mathbb{C}$ and distributions $\mathcal{D}$ for which a lower bound can be obtained. In particular, we are generally interested in exhibiting lower bounds that are worse than those that apply to the usual problem of density estimation based on direct access to the $h^{*}_{t{\theta_{\star}}}$ values (see Theorem 4.2 below).

Here we present a lower bound that is interesting for this reason. However, although larger than the optimal rate for methods with direct access to the target concepts, it is still far from matching the upper bound above, so that the question of tightness remains open. Specifically, we have the following result.

Theorem 4.1

For any integer $d\geq 1$ , any $L>0,\alpha\in(0,1]$ , there is a value $C(d,L,\alpha)\in(0,\infty)$ such that, for any $T\in\mathbb{N}$ , there exists an instance space $\mathcal{X}$ , a concept space $\mathbb{C}$ of VC dimension $d$ , a distribution $\mathcal{D}$ over $\mathcal{X}$ , and a distribution $\pi_{0}$ over $\mathbb{C}$ such that, for $\Pi_{\Theta}$ a set of distributions over $\mathbb{C}$ with $(L,\alpha)$ -Hölder smooth density functions with respect to $\pi_{0}$ , any estimator $\hat{\theta}_{T}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}}))$ has

\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\left[\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|\right]\geq C(d,L,\alpha)T^{-\frac{\alpha}{2(d+\alpha)}}.

Proof

(Sketch) We proceed by a reduction from the task of determining the bias of a coin from among two given possibilities. Specifically, fix any $\gamma\in(0,1/2)$ , $n\in\mathbb{N}$ , and let $B_{1}(p),\ldots,B_{n}(p)$ be i.i.d ${\rm Bernoulli}(p)$ random variables, for each $p\in[0,1]$ ; then it is known that, for any (possibly nondeterministic) decision rule $\hat{p}_{n}:\{0,1\}^{n}\to\{(1+\gamma)/2,(1-\gamma)/2\}$ ,

\frac{1}{2}\sum_{p\in\{(1+\gamma)/2,(1-\gamma)/2\}}\mathbb{P}(\hat{p}_{n}(B_{1}(p),\ldots,B_{n}(p))\neq p)\\ \geq(1/32)\cdot\exp\left\{-128\gamma^{2}n/3\right\}.

(1)

This easily follows from the results of [1], combined with a result of [7] bounding the KL divergence (see also [11])

To use this result, we construct a learning problem as follows. Fix some $m\in\mathbb{N}$ with $m\geq d$ , let $\mathcal{X}=\{1,\ldots,m\}$ , and let $\mathbb{C}$ be the space of all classifiers $h:\mathcal{X}\to\{-1,+1\}$ such that $|\{x\in\mathcal{X}:h(x)=+1\}|\leq d$ . Clearly the VC dimension of $\mathbb{C}$ is $d$ . Define the distribution $\mathcal{D}$ as uniform over $\mathcal{X}$ . Finally, we specify a family of $(L,\alpha)$ -Hölder smooth priors, parameterized by $\Theta=\{-1,+1\}^{\binom{m}{d}}$ , as follows. Let $\gamma_{m}=(L/2)(1/m)^{\alpha}$ . First, enumerate the $\binom{m}{d}$ distinct $d$ -sized subsets of $\{1,\ldots,m\}$ as $\mathcal{X}_{1},\mathcal{X}_{2},\ldots,\mathcal{X}_{\binom{m}{d}}$ . Define the reference distribution $\pi_{0}$ by the property that, for any $h\in\mathbb{C}$ , letting $q=|\{x:h(x)=+1\}|$ , $\pi_{0}(\{h\})=(\frac{1}{2})^{d}\binom{m-q}{d-q}/\binom{m}{d}$ . For any $\mathbf{b}=(b_{1},\ldots,b_{\binom{m}{d}})\in\{-1,1\}^{\binom{m}{d}}$ , define the prior $\pi_{\mathbf{b}}$ as the distribution of a random variable $h_{\mathbf{b}}$ specified by the following generative model. Let $i^{*}\sim{\rm Uniform}(\{1,\ldots,\binom{m}{d}\})$ , let $C_{\mathbf{b}}(i^{*})\sim{\rm Bernoulli}((1+\gamma_{m}b_{i^{*}})/2)$ ; finally, $h_{\mathbf{b}}\sim{\rm Uniform}(\{h\in\mathbb{C}:\{x:h(x)=+1\}\subseteq\mathcal{X}_{i^{*}},{\rm Parity}(|\{x:h(x)=+1\}|)=C_{\mathbf{b}}(i^{*})\})$ , where ${\rm Parity}(n)$ is $1$ if $n$ is odd, or $0$ if $n$ is even. We will refer to the variables in this generative model below. For any $h\in\mathbb{C}$ , letting $H=\{x:h(x)=+1\}$ and $q=|H|$ , we can equivalently express $\pi_{\mathbf{b}}(\{h\})=(\frac{1}{2})^{d}\binom{m}{d}^{-1}\sum_{i=1}^{\binom{m}{d}}\mathbbm{1}[H\subseteq\mathcal{X}_{i}](1+\gamma_{m}b_{i})^{{\rm Parity}(q)}(1-\gamma_{m}b_{i})^{1-{\rm Parity}(q)}$ . From this explicit representation, it is clear that, letting $f_{\mathbf{b}}=\frac{{\rm d}\pi_{\mathbf{b}}}{{\rm d}\pi_{0}}$ , we have $f_{\mathbf{b}}(h)\in[1-\gamma_{m},1+\gamma_{m}]$ for all $h\in\mathbb{C}$ . The fact that $f_{\mathbf{b}}$ is Hölder smooth follows from this, since every distinct $h,g\in\mathbb{C}$ have $\mathcal{D}(\{x:h(x)\neq g(x)\})\geq 1/m=(2\gamma_{m}/L)^{1/\alpha}$ .

Next we set up the reduction as follows. For any estimator $\hat{\pi}_{T}=\hat{\pi}_{T}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),$ $\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}}))$ , and each $i\in\{1,\ldots,\binom{m}{d}\}$ , let $h_{i}$ be the classifier with $\{x:h_{i}(x)=+1\}=\mathcal{X}_{i}$ ; also, if $\hat{\pi}_{T}(\{h_{i}\})>(\frac{1}{2})^{d}/\binom{m}{d}$ , let $\hat{b}_{i}=2{\rm Parity}(d)-1$ , and otherwise $\hat{b}_{i}=1-2{\rm Parity}(d)$ . We use these $\hat{b}_{i}$ values to estimate the original $b_{i}$ values. Specifically, let $\hat{p}_{i}=(1+\gamma_{m}\hat{b}_{i})/2$ and $p_{i}=(1+\gamma_{m}b_{i})/2$ , where $\mathbf{b}={\theta_{\star}}$ . Then

	$\displaystyle\\|\hat{\pi}_{T}-\pi_{{\theta_{\star}}}\\|$	$\displaystyle\geq(1/2)\sum_{i=1}^{\binom{m}{d}}\|\hat{\pi}_{T}(\{h_{i}\})-\pi_{{\theta_{\star}}}(\{h_{i}\})\|$
		$\displaystyle\geq(1/2)\sum_{i=1}^{\binom{m}{d}}\frac{\gamma_{m}}{2^{d}\binom{m}{d}}\|\hat{b}_{i}-b_{i}\|/2=(1/2)\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}\|\hat{p}_{i}-p_{i}\|.$

Thus, we have reduced from the problem of deciding the biases of these $\binom{m}{d}$ independent Bernoulli random variables. To complete the proof, it suffices to lower bound the expectation of the right side for an arbitrary estimator.

Toward this end, we in fact study an even easier problem. Specifically, consider an estimator $\hat{q}_{i}=\hat{q}_{i}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}}),i_{1}^{*},\ldots,i_{T}^{*})$ , where $i_{t}^{*}$ is the $i^{*}$ random variable in the generative model that defines $h^{*}_{t{\theta_{\star}}}$ ; that is, $i_{t}^{*}\sim{\rm Uniform}(\{1,$ $\ldots,\binom{m}{d}\})$ , $C_{t}\sim{\rm Bernoulli}((1+\gamma_{m}b_{i_{t}^{*}})/2)$ , and $h^{*}_{t{\theta_{\star}}}\sim{\rm Uniform}(\{h\in\mathbb{C}:\{x:h(x)=+1\}\subseteq\mathcal{X}_{i_{t}^{*}},{\rm Parity}(|\{x:h(x)=+1\}|)=C_{t}\})$ , where the $i_{t}^{*}$ are independent across $t$ , as are the $C_{t}$ and $h^{*}_{t{\theta_{\star}}}$ . Clearly the $\hat{p}_{i}$ from above can be viewed as an estimator of this type, which simply ignores the knowledge of $i_{t}^{*}$ . The knowledge of these $i_{t}^{*}$ variables simplifies the analysis, since given $\{i_{t}^{*}:t\leq T\}$ , the data can be partitioned into $\binom{m}{d}$ disjoint sets, $\{\{\mathcal{Z}^{t}_{d}({\theta_{\star}}):i_{t}^{*}=i\}:i=1,\ldots,\binom{m}{d}\}$ , and we can use only the set $\{\mathcal{Z}^{t}_{d}({\theta_{\star}}):i_{t}^{*}=i\}$ to estimate $p_{i}$ . Furthermore, we can use only the subset of these for which $\mathbb{X}_{td}=\mathcal{X}_{i}$ , since otherwise we have zero information about the value of ${\rm Parity}(|\{x:h^{*}_{t{\theta_{\star}}}(x)=+1\}|)$ . That is, given $i_{t}^{*}=i$ , any $\mathcal{Z}^{t}_{d}({\theta_{\star}})$ is conditionally independent from every $b_{j}$ for $j\neq i$ , and is even conditionally independent from $b_{i}$ when $\mathbb{X}_{td}$ is not completely contained in $\mathcal{X}_{i}$ ; specifically, in this case, regardless of $b_{i}$ , the conditional distribution of $\mathbb{Y}_{td}({\theta_{\star}})$ given $i_{t}^{*}=i$ and given $\mathbb{X}_{td}$ is a product distribution, which deterministically assigns label $-1$ to those $Y_{tk}({\theta_{\star}})$ with $X_{tk}\notin\mathcal{X}_{i}$ , and gives uniform random values to the subset of $\mathbb{Y}_{td}({\theta_{\star}})$ with their respective $X_{tk}\in\mathcal{X}_{i}$ . Finally, letting $r_{t}={\rm Parity}(|\{k\leq d:Y_{tk}({\theta_{\star}})=+1\}|)$ , we note that given $i_{t}^{*}=i$ , $\mathbb{X}_{td}=\mathcal{X}_{i}$ , and the value $r_{t}$ , $b_{i}$ is conditionally independent from $\mathcal{Z}^{t}_{d}({\theta_{\star}})$ . Thus, the set of values $C_{iT}({\theta_{\star}})=\{r_{t}:i_{t}^{*}=i,\mathbb{X}_{td}=\mathcal{X}_{i}\}$ is a sufficient statistic for $b_{i}$ (hence for $p_{i}$ ). Recall that, when $i_{t}^{*}=i$ and $\mathbb{X}_{td}=\mathcal{X}_{i}$ , the value of $r_{t}$ is equal to $C_{t}$ , a ${\rm Bernoulli}(p_{i})$ random variable. Thus, we neither lose nor gain anything (in terms of risk) by restricting ourselves to estimators $\hat{q}_{i}$ of the type $\hat{q}_{i}=\hat{q}_{i}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}}),i_{1}^{*},\ldots,i_{T}^{*})=\hat{q}_{i}^{\prime}(C_{iT}({\theta_{\star}}))$ , for some $\hat{q}_{i}^{\prime}$ [8]: that is, estimators that are a function of the $N_{iT}({\theta_{\star}})=|C_{iT}({\theta_{\star}})|$ ${\rm Bernoulli}(p_{i})$ random variables, which we should note are conditionally i.i.d. given $N_{iT}({\theta_{\star}})$ .

Thus, by (1), for any $n\leq T$ ,

	$\displaystyle\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\!\mathbb{E}\left[\|\hat{q}_{i}-p_{i}\|\Big{\|}N_{iT}({\theta_{\star}})=n\right]$	$\displaystyle=\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\!\gamma_{m}\mathbb{P}\left(\hat{q}_{i}\neq p_{i}\Big{\|}N_{iT}({\theta_{\star}})=n\right)$
		$\displaystyle\geq(\gamma_{m}/32)\cdot\exp\left\{-128\gamma_{m}^{2}N_{i}/3\right\}.$

Also note that, for each $i$ , $\mathbb{E}[N_{i}]=\frac{d!(1/m)^{d}}{\binom{m}{d}}T\leq(d/m)^{2d}T=d^{2d}(2\gamma_{m}/L)^{2d/\alpha}T$ . Thus, Jensen’s inequality, linearity of expectation, and the law of total expectation imply

\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\mathbb{E}\left[|\hat{q}_{i}-p_{i}|\right]\geq(\gamma_{m}/32)\cdot\exp\left\{-43(2/L)^{2d/\alpha}d^{2d}\gamma_{m}^{2+2d/\alpha}T\right\}.

Thus, by linearity of the expectation,

	$\displaystyle\left(\frac{1}{2}\right)^{\binom{m}{d}}\!\!\!\!\sum_{\mathbf{b}\in\{-1,+1\}^{\binom{m}{d}}}\!\!\!\!\mathbb{E}\left[\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}\|\hat{q}_{i}-p_{i}\|\right]=\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}\frac{1}{2}\sum_{b_{i}\in\{-1,+1\}}\!\!\!\!\mathbb{E}\left[\|\hat{q}_{i}-p_{i}\|\right]$
	$\displaystyle\geq(\gamma_{m}/(32\cdot 2^{d}))\cdot\exp\left\{-43(2/L)^{2d/\alpha}d^{2d}\gamma_{m}^{2+2d/\alpha}T\right\}.$

In particular, taking $m=\left\lceil(L/2)^{1/\alpha}\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{\frac{1}{2(d+\alpha)}}\right\rceil$ , we have $\gamma_{m}=\Theta\left(\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{-\frac{\alpha}{2(d+\alpha)}}\right)$ , so that

\left(\frac{1}{2}\right)^{\binom{m}{d}}\sum_{\mathbf{b}\in\{-1,+1\}^{\binom{m}{d}}}\mathbb{E}\left[\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}|\hat{q}_{i}-p_{i}|\right]\\ =\Omega\left(2^{-d}\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{-\frac{\alpha}{2(d+\alpha)}}\right).

In particular, this implies there exists some $\mathbf{b}$ for which

\mathbb{E}\left[\sum_{i=1}^{\binom{m}{d}}\frac{1}{2^{d}\binom{m}{d}}|\hat{q}_{i}-p_{i}|\right]=\Omega\left(2^{-d}\left(43(2/L)^{2d/\alpha}d^{2d}T\right)^{-\frac{\alpha}{2(d+\alpha)}}\right).

Applying this lower bound to the estimator $\hat{p}_{i}$ above yields the result. ∎

It is natural to wonder how these rates might potentially improve if we allow $\hat{\theta}_{T}$ to depend on more than $d$ samples per data set. To establish limits on such improvements, we note that in the extreme case of allowing the estimator to depend on the full $\mathcal{Z}^{t}({\theta_{\star}})$ data sets, we may recover the known results lower bounding the risk of density estimation from i.i.d. samples from a smooth density, as indicated by the following result.

Theorem 4.2

For any integer $d\geq 1$ , there exists an instance space $\mathcal{X}$ , a concept space $\mathbb{C}$ of VC dimension $d$ , a distribution $\mathcal{D}$ over $\mathcal{X}$ , and a distribution $\pi_{0}$ over $\mathbb{C}$ such that, for $\Pi_{\Theta}$ the set of distributions over $\mathbb{C}$ with $(L,\alpha)$ -Hölder smooth density functions with respect to $\pi_{0}$ , any sequence of estimators, $\hat{\theta}_{T}=\hat{\theta}_{T}(\mathcal{Z}^{1}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}({\theta_{\star}}))$ ( $T=1,2,\ldots$ ), has

\sup_{{\theta_{\star}}\in\Theta}\mathbb{E}\left[\|\pi_{\hat{\theta}_{T}}-\pi_{{\theta_{\star}}}\|\right]=\Omega\left(T^{-\frac{\alpha}{d+2\alpha}}\right).

The proof is a simple reduction from the problem of estimating $\pi_{{\theta_{\star}}}$ based on direct access to $h^{*}_{1{\theta_{\star}}},\ldots,h^{*}_{T{\theta_{\star}}}$ , which is essentially equivalent to the standard model of density estimation, and indeed the lower bound in Theorem 4.2 is a well-known result for density estimation from $T$ i.i.d. samples from a Hölder smooth density in a $d$ -dimensional space [5].

5 Real-Valued Functions and an Application in Algorithmic Economics

In this section, we present results generalizing the analysis of [12] to classes of real-valued functions. We also present an application of this generalization to a preference elicitation problem.

5.1 Consistent Estimation of Priors over Real-Valued Functions at a Bounded Rate

In this section, we let ${\cal B}$ denote a $\sigma$ -algebra on $\mathcal{X}\times\mathbb{R}$ , and again let ${\cal B}_{\mathcal{X}}$ denote the corresponding $\sigma$ -algebra on $\mathcal{X}$ . Also, for measurable functions $h,g:\mathcal{X}\to\mathbb{R}$ , let $\rho(h,g)=\int|h-g|{\rm d}P_{X}$ , where $P_{X}$ is a distribution over $\mathcal{X}$ . Let $\mathcal{F}$ be a class of functions $\mathcal{X}\to\mathbb{R}$ with Borel $\sigma$ -algebra ${\cal B}_{\mathcal{F}}$ induced by $\rho$ . Let $\Theta$ be a set, and for each $\theta\in\Theta$ , let $\pi_{\theta}$ denote a probability measure on $(\mathcal{F},{\cal B}_{\mathcal{F}})$ . We suppose $\{\pi_{\theta}:\theta\in\Theta\}$ is totally bounded in total variation distance, and that $\mathcal{F}$ is a uniformly bounded VC subgraph class with pseudodimension $d$ . We also suppose $\rho$ is a metric when restricted to $\mathcal{F}$ .

As above, let $\{X_{ti}\}_{t,i\in\mathbb{N}}$ be i.i.d. $P_{X}$ random variables. For each $\theta\in\Theta$ , let $\{h^{*}_{t\theta}\}_{t\in\mathbb{N}}$ be i.i.d. $\pi_{\theta}$ random variables, independent from $\{X_{ti}\}_{t,i\in\mathbb{N}}$ . For each $t\in\mathbb{N}$ and $\theta\in\Theta$ , let $Y_{ti}(\theta)=h^{*}_{t\theta}(X_{ti})$ for $i\in\mathbb{N}$ , and let $\mathcal{Z}^{t}(\theta)=\{(X_{t1},Y_{t1}(\theta)),(X_{t2},Y_{t2}(\theta)),\ldots\}$ ; for each $k\in\mathbb{N}$ , define $\mathcal{Z}^{t}_{k}(\theta)=\{(X_{t1},Y_{t1}(\theta)),$ $\ldots,(X_{tk},Y_{tk}(\theta))\}$ , $\mathbb{X}_{tk}=\{X_{t1},\ldots,X_{tk}\}$ , and $\mathbb{Y}_{tk}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}$ .

We have the following result. The proof parallels that of [12] (who studied the special case of binary functions), with a few important twists (in particular, a significantly different approach in the analogue of their Lemma 3). The details are included in Appendix 0.A.

Theorem 5.1

There exists an estimator $\hat{\theta}_{T{\theta_{\star}}}=\hat{\theta}_{T}(\mathcal{Z}^{1}_{d}({\theta_{\star}}),\ldots,\mathcal{Z}^{T}_{d}({\theta_{\star}}))$ , and functions $R:\mathbb{N}_{0}\times(0,1]\to[0,\infty)$ and $\delta:\mathbb{N}_{0}\times(0,1]\to[0,1]$ such that, for any $\alpha>0$ , $\lim\limits_{T\to\infty}R(T,\alpha)=\lim\limits_{T\to\infty}\delta(T,\alpha)=0$ and for any $T\in\mathbb{N}_{0}$ and ${\theta_{\star}}\in\Theta$ ,

\mathbb{P}\left(\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|>R(T,\alpha)\right)\leq\delta(T,\alpha)\leq\alpha.

5.2 Maximizing Customer Satisfaction in Combinatorial Auctions

Theorem 5.1 has a clear application in the context of transfer learning, following analogous arguments to those given in the special case of binary classification by [12]. In addition to that application, we can also use Theorem 5.1 in the context of the following problem in algorithmic economics, where the objective is to serve a sequence of customers so as to maximize their satisfaction.

Consider an online travel agency, where customers go to the site with some idea of what type of travel they are interested in; the site then poses a series of questions to each customer, and identifies a travel package that best suits their desires, budget, and dates. There are many options of travel packages, with options on location, site-seeing tours, hotel and room quality, etc. Because of this, serving the needs of an arbitrary customer might be a lengthy process, requiring many detailed questions. Fortunately, the stream of customers is typically not a worst-case sequence, and in particular obeys many statistical regularities: in particular, it is not too far from reality to think of the customers as being independent and identically distributed samples. With this assumption in mind, it becomes desirable to identify some of these statistical regularities so that we can pose the questions that are typically most relevant, and thereby more quickly identify the travel package that best suits the needs of the typical customer. One straightforward way to do this is to directly estimate the distribution of customer value functions, and optimize the questioning system to minimize the expected number of questions needed to find a suitable travel package.

One can model this problem in the style of Bayesian combinatorial auctions, in which each customer has a value function for each possible bundle of items. However, it is slightly different, in that we do not assume the distribution of customers is known, but rather are interested in estimating this distribution; the obtained estimate can then be used in combination with methods based on Bayesian decision theory. In contrast to the literature on Bayesian auctions (and subjectivist Bayesian decision theory in general), this technique is able to maintain general guarantees on performance that hold under an objective interpretation of the problem, rather than merely guarantees holding under an arbitrary assumed prior belief. This general idea is sometimes referred to as Empirical Bayesian decision theory in the machine learning and statistics literatures. The ideal result for an Empirical Bayesian algorithm is to be competitive with the corresponding Bayesian methods based on the actual distribution of the data (assuming the data are random, with an unknown distribution); that is, although the Empirical Bayesian methods only operate with a data-based estimate of the distribution, the aim is to perform nearly as well as methods based on the true (unobservable) distribution. In this work, we present results of this type, in the context of an abstraction of the aforementioned online travel agency problem, where the measure of performance is the expected number of questions to find a suitable package.

The specific application we are interested in here may be expressed abstractly as a kind of combinatorial auction with preference elicitation. Specifically, we suppose there is a collection of items on a menu, and each possible bundle of items has an associated fixed price. There is a stream of customers, each with a valuation function that provides a value for each possible bundle of items. The objective is to serve each customer a bundle of items that nearly-maximizes his or her surplus value (value minus price). However, we are not permitted direct observation of the customer valuation functions; rather, we may query for the value of any given bundle of items; this is referred to as a value query in the literature on preference elicitation in combinatorial auctions (see Chapter 14 of [4], [14]). The objective is to achieve this near-maximal surplus guarantee, while making only a small number of queries per customer. We suppose the customer valuation function are sampled i.i.d. according to an unknown distribution over a known (but arbitrary) class of real-valued functions having finite pseudo-dimension. Reasoning that knowledge of this distribution should allow one to make a smaller number of value queries per customer, we are interested in estimating this unknown distribution, so that as we serve more and more customers, the number of queries per customer required to identify a near-optimal bundle should decrease. In this context, we in fact prove that in the limit, the expected number of queries per customer converges to the number required of a method having direct knowledge of the true distribution of valuation functions.

Formally, suppose there is a menu of $n$ items $[n]=\{1,\ldots,n\}$ , and each bundle $B\subseteq[n]$ has an associated price $p(B)\geq 0$ . Suppose also there is a sequence of customers, each with a valuation function $v_{t}:2^{[n]}\to\mathbb{R}$ . We suppose these $v_{t}$ functions are i.i.d. samples. We can then calculate the satisfaction function for each customer as $s_{t}(x)$ , where $x\in\{0,1\}^{n}$ , and $s_{t}(x)=v_{t}(B_{x})-p(B_{x})$ , where $B_{x}\subseteq[n]$ contains element $i\in[n]$ iff $x_{i}=1$ .

Now suppose we are able to ask each customer a number of questions before serving up a bundle $B_{\hat{x}_{t}}$ to that customer. More specifically, we are able to ask for the value $s_{t}(x)$ for any $x\in\{0,1\}^{n}$ . This is referred to as a value query in the literature on preference elicitation in combinatorial auctions (see Chapter 14 of [4], [14]). We are interested in asking as few questions as possible, while satisfying the guarantee that $\mathbb{E}[s_{t}(\hat{x}_{t})-\max_{x}s_{t}(x)]\leq\varepsilon$ .

Now suppose, for every $\pi$ and $\varepsilon$ , we have a method $A(\pi,\varepsilon)$ such that, given that $\pi$ is the actual distribution of the $s_{t}$ functions, $A(\pi,\varepsilon)$ guarantees that the $\hat{x}_{t}$ value it selects has $\mathbb{E}[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})]\leq\varepsilon$ ; also let $\hat{N}_{t}(\pi,\varepsilon)$ denote the actual (random) number of queries the method $A(\pi,\varepsilon)$ would ask for the $s_{t}$ function, and let $Q(\pi,\varepsilon)=\mathbb{E}[\hat{N}_{t}(\pi,\varepsilon)]$ . We suppose the method never queries any $s_{t}(x)$ value twice for a given $t$ , so that its number of queries for any given $t$ is bounded.

Also suppose $\mathcal{F}$ is a VC subgraph class of functions mapping $\mathcal{X}=\{0,1\}^{n}$ into $[-1,1]$ with pseudodimension $d$ , and that $\{\pi_{\theta}:\theta\in\Theta\}$ is a known totally bounded family of distributions over $\mathcal{F}$ such that the $s_{t}$ functions have distribution $\pi_{{\theta_{\star}}}$ for some unknown ${\theta_{\star}}\in\Theta$ . For any $\theta\in\Theta$ and $\gamma>0$ , let ${\rm B}(\theta,\gamma)=\{\theta^{\prime}\in\Theta:\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\gamma\}$ .

Suppose, in addition to $A$ , we have another method $A^{\prime}(\varepsilon)$ that is not $\pi$ -dependent, but still provides the $\varepsilon$ -correctness guarantee, and makes a bounded number of queries (e.g., in the worst case, we could consider querying all $2^{n}$ points, but in most cases there are more clever $\pi$ -independent methods that use far fewer queries, such as $O(1/\varepsilon^{2})$ ). Consider the following method; the quantities $\hat{\theta}_{T{\theta_{\star}}}$ , $R(T,\alpha)$ , and $\delta(T,\alpha)$ from Theorem 5.1 are here considered with respect $P_{X}$ taken as the uniform distribution on $\{0,1\}^{n}$ .

for

t=1,2,\ldots,T

Pick points

X_{t1},X_{t2},\ldots,X_{td}

uniformly at random from

\{0,1\}^{n}

R(t-1,\varepsilon/2)>\varepsilon/8

then

Run

A^{\prime}(\varepsilon)

Take

\hat{x}_{t}

as the returned value

else

Let

\check{\theta}_{t{\theta_{\star}}}\in{\rm B}\left(\hat{\theta}_{(t-1){\theta_{\star}}},R(t-1,\varepsilon/2)\right)

be such that

Q(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4)\leq\!\!\!\!\!\min\limits_{\theta\in{\rm B}\left(\hat{\theta}_{(t-1){\theta_{\star}}},R(t-1,\varepsilon/2)\right)}\!\!\!\!\!Q(\pi_{\theta},\varepsilon/4)+\frac{1}{t}

Run

A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4)

and let

\hat{x}_{t}

be its return value

end if

end for

Algorithm 1 An algorithm for sequentially maximizing expected customer satisfaction.

The following theorem indicates that this method is correct, and furthermore that the long-run average number of queries is not much worse than that of a method that has direct knowledge of $\pi_{{\theta_{\star}}}$ . The proof of this result parallels that of [12] for the transfer learning setting, but is included here for completeness.

Theorem 5.2

For the above method, $\forall t\leq T,\mathbb{E}[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})]\leq\varepsilon$ . Furthermore, if $S_{T}(\varepsilon)$ is the total number of queries made by the method, then

\limsup\limits_{T\to\infty}\frac{\mathbb{E}[S_{T}(\varepsilon)]}{T}\leq Q(\pi_{{\theta_{\star}}},\varepsilon/4)+d.

Proof

By Theorem 5.1, for any $t\leq T$ , if $R(t-1,\varepsilon/2)\leq\varepsilon/8$ , then with probability at least $1-\varepsilon/2$ , $\|\pi_{{\theta_{\star}}}-\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}\|\leq R(t-1,\varepsilon/2)$ , so that a triangle inequality implies $\|\pi_{{\theta_{\star}}}-\pi_{\check{\theta}_{t{\theta_{\star}}}}\|\leq 2R(t-1,\varepsilon/2)\leq\varepsilon/4$ . Thus,

\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\right]\\ \leq\varepsilon/2+\mathbb{E}\left[\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\Big{|}\check{\theta}_{t{\theta_{\star}}}\right]\mathbbm{1}\left[\|\pi_{\check{\theta}_{t{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq\varepsilon/2\right]\right].

For $\theta\in\Theta$ , let $\hat{x}_{t\theta}$ denote the point $x$ that would be returned by $A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4)$ when queries are answered by some $s_{t\theta}\sim\pi_{\theta}$ instead of $s_{t}$ (and supposing $s_{t}=s_{t{\theta_{\star}}}$ ). If $\|\pi_{\check{\theta}_{t{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq\varepsilon/4$ , then

	$\displaystyle\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\Big{\|}\check{\theta}_{t{\theta_{\star}}}\right]=\mathbb{E}\left[\max_{x}s_{t{\theta_{\star}}}(x)-s_{t{\theta_{\star}}}(\hat{x}_{t})\Big{\|}\check{\theta}_{t{\theta_{\star}}}\right]$
	$\displaystyle\leq\mathbb{E}\left[\max_{x}s_{t\check{\theta}_{t{\theta_{\star}}}}(x)-s_{t\check{\theta}_{t{\theta_{\star}}}}(\hat{x}_{t\check{\theta}_{t{\theta_{\star}}}})\Big{\|}\check{\theta}_{t{\theta_{\star}}}\right]+\\|\pi_{\check{\theta}_{t{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\\|\leq\varepsilon/4+\varepsilon/4=\varepsilon/2.$

Plugging into the above bound, we have $\mathbb{E}\left[\max_{x}s_{t}(x)-s_{t}(\hat{x}_{t})\right]\leq\varepsilon$ .

For the result on $S_{T}(\varepsilon)$ , first note that $R(t-1,\varepsilon/2)>\varepsilon/8$ only finitely many times (due to $R(t,\alpha)=o(1)$ ), so that we can ignore those values of $t$ in the asymptotic calculation (as the number of queries is always bounded), and rely on the correctness guarantee of $A^{\prime}$ . For the remaining values $t$ , let $N_{t}$ denote the number of queries made by $A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4)$ . Then

\limsup\limits_{T\to\infty}\frac{\mathbb{E}[S_{T}(\varepsilon)]}{T}\leq d+\limsup\limits_{T\to\infty}\sum_{t=1}^{T}\frac{\mathbb{E}\left[N_{t}\right]}{T}.

Since

	$\displaystyle\lim\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[N_{t}\mathbbm{1}[\\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\\|>R(t-1,\varepsilon/2)]\right]$
	$\displaystyle\leq\lim\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}2^{n}\mathbb{P}\left(\\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\\|>R(t-1,\varepsilon/2)\right)$
	$\displaystyle\leq 2^{n}\lim\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\delta(t-1,\varepsilon/2)=0,$

we have

\limsup\limits_{T\to\infty}\sum_{t=1}^{T}\frac{\mathbb{E}\left[N_{t}\right]}{T}=\limsup\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\Big{[}N_{t}\mathbbm{1}[\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq R(t-1,\varepsilon/2)]\Big{]}.

For $t\leq T$ , let $N_{t}(\check{\theta}_{t{\theta_{\star}}})$ denote the number of queries $A(\pi_{\check{\theta}_{t{\theta_{\star}}}},\varepsilon/4)$ would make if queries were answered with $s_{t\check{\theta}_{t{\theta_{\star}}}}$ instead of $s_{t}$ . On the event $\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|\leq R(t-1,\varepsilon/2)$ , we have

	$\displaystyle\mathbb{E}\left[N_{t}\Big{\|}\check{\theta}_{t{\theta_{\star}}}\right]$	$\displaystyle\leq\mathbb{E}\left[N_{t}(\check{\theta}_{t{\theta_{\star}}})\Big{\|}\check{\theta}_{t{\theta_{\star}}}\right]+2R(t-1,\varepsilon/2)$
		$\displaystyle=Q(\pi_{\check{\theta}_{t{\theta_{\star}}}}\!,\varepsilon/4)+2R(t\!-\!1,\varepsilon/2)\leq Q(\pi_{{\theta_{\star}}},\varepsilon/4)+2R(t\!-\!1,\varepsilon/2)+1/t.$

Therefore,

	$\displaystyle\limsup\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[N_{t}\mathbbm{1}[\\|\pi_{\hat{\theta}_{(t-1){\theta_{\star}}}}-\pi_{{\theta_{\star}}}\\|\leq R(t-1,\varepsilon/2)]\right]$
	$\displaystyle\leq Q(\pi_{{\theta_{\star}}},\varepsilon/4)+\limsup\limits_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}2R(t-1,\varepsilon/2)+1/t=Q(\pi_{{\theta_{\star}}},\varepsilon/4).$

∎

In many cases, this result will even continue to hold with an infinite number of goods ( $n=\infty$ ), since Theorem 5.1 has no dependence on the cardinality of the space $\mathcal{X}$ .

6 Open Problems

There are several interesting questions that remain open at this time. Can either the lower bound or upper bound be improved in general? If, instead of $d$ samples per task, we instead use $m\geq d$ samples, how does the minimax risk vary with $m$ ? Related to this, what is the optimal value of $m$ to optimize the rate of convergence as a function of $mT$ , the total number of samples? More generally, if an estimator is permitted to use $N$ total samples, taken from however many tasks it wishes, what is the optimal rate of convergence as a function of $N$ ?

Appendix 0.A Proofs for Section 5

The proof of Theorem 5.1 is based on the following sequence of lemmas, which parallel those used by [12] for establishing the analogous result for consistent estimation of priors over binary functions. The last of these lemmas (namely, Lemma 3) requires substantial modifications to the original argument of [12]; the others use arguments more-directly based on those of [12].

Lemma 1

For any $\theta,\theta^{\prime}\in\Theta$ and $t\in\mathbb{N}$ ,

\|\pi_{\theta}-\pi_{\theta^{\prime}}\|=\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|.

Proof

Fix $\theta,\theta^{\prime}\in\Theta$ , $t\in\mathbb{N}$ . Let $\mathbb{X}=\{X_{t1},X_{t2},\ldots\}$ , $\mathbb{Y}(\theta)=\{Y_{t1}(\theta),Y_{t2}(\theta),\ldots\}$ , and for $k\in\mathbb{N}$ let $\mathbb{X}_{k}=\{X_{t1},\ldots,X_{tk}\}$ . and $\mathbb{Y}_{k}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}$ . For $h\in\mathcal{F}$ , let $c_{\mathbb{X}}(h)=\{(X_{t1},h(X_{t1})),(X_{t2},h(X_{t2})),$ $\ldots\}$ .

For $h,g\in\mathcal{F}$ , define $\rho_{\mathbb{X}}(h,g)=\lim\limits_{m\to\infty}\frac{1}{m}\sum_{i=1}^{m}|h(X_{ti})-g(X_{ti})|$ (if the limit exists), and $\rho_{\mathbb{X}_{k}}(h,g)=\frac{1}{k}\sum_{i=1}^{k}|h(X_{ti})-g(X_{ti})|$ . Note that since $\mathcal{F}$ is a uniformly bounded VC subgraph class, so is the collection of functions $\{|h-g|:h,g\in\mathcal{F}\}$ , so that the uniform strong law of large numbers implies that with probability one, $\forall h,g\in\mathcal{F}$ , $\rho_{\mathbb{X}}(h,g)$ exists and has $\rho_{\mathbb{X}}(h,g)=\rho(h,g)$ [10].

Consider any $\theta,\theta^{\prime}\in\Theta$ , and any $A\in{\cal B}_{\mathcal{F}}$ . Then any $h\notin A$ has $\forall g\in A$ , $\rho(h,g)>0$ (by the metric assumption). Thus, if $\rho_{\mathbb{X}}(h,g)=\rho(h,g)$ for all $h,g\in\mathcal{F}$ , then $\forall h\notin A$ ,

	$\displaystyle\forall g\in A,\rho_{\mathbb{X}}(h,g)=\rho(h,g)>0$	$\displaystyle\implies$
	$\displaystyle\forall g\in A,c_{\mathbb{X}}(h)\neq c_{\mathbb{X}}(g)$	$\displaystyle\implies c_{\mathbb{X}}(h)\notin c_{\mathbb{X}}(A).$

This implies $c_{\mathbb{X}}^{-1}(c_{\mathbb{X}}(A))=A$ . Under these conditions,

\mathbb{P}_{\mathcal{Z}^{t}(\theta)|\mathbb{X}}(c_{\mathbb{X}}(A))=\pi_{\theta}(c_{\mathbb{X}}^{-1}(c_{\mathbb{X}}(A)))=\pi_{\theta}(A),

and similarly for $\theta^{\prime}$ .

Any measurable set $C$ for the range of $\mathcal{Z}^{t}(\theta)$ can be expressed as $C=\{c_{\bar{x}}(h):(h,\bar{x})\in C^{\prime}\}$ for some appropriate $C^{\prime}\in{\cal B}_{\mathcal{F}}\otimes{\cal B}_{\mathcal{X}}^{\infty}$ . Letting $C^{\prime}_{\bar{x}}=\{h:(h,\bar{x})\in C^{\prime}\}$ , we have

\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(C)=\int\pi_{\theta}(c_{\bar{x}}^{-1}(c_{\bar{x}}(C^{\prime}_{\bar{x}})))\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})=\int\pi_{\theta}(C^{\prime}_{\bar{x}})\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})=\mathbb{P}_{(h^{*}_{t\theta},\mathbb{X})}(C^{\prime}).

Likewise, this reasoning holds for $\theta^{\prime}$ . Then

	$\displaystyle\\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\\|$	$\displaystyle=\\|\mathbb{P}_{(h^{}_{t\theta},\mathbb{X})}-\mathbb{P}_{(h^{}_{t\theta^{\prime}},\mathbb{X})}\\|$
		$\displaystyle=\sup_{C^{\prime}\in{\cal B}_{\mathcal{F}}\otimes{\cal B}_{\mathcal{X}}^{\infty}}\left\|\int(\pi_{\theta}(C^{\prime}_{\bar{x}})-\pi_{\theta^{\prime}}(C^{\prime}_{\bar{x}}))\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})\right\|$
		$\displaystyle\leq\int\sup_{A\in{\cal B}_{\mathcal{F}}}\|\pi_{\theta}(A)-\pi_{\theta^{\prime}}(A)\|\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})=\\|\pi_{\theta}-\pi_{\theta^{\prime}}\\|.$

Since $h^{*}_{t\theta}$ and $\mathbb{X}$ are independent, $\forall A\in{\cal B}_{\mathcal{F}}$ , $\pi_{\theta}(A)=\mathbb{P}_{h^{*}_{t\theta}}(A)=\mathbb{P}_{h^{*}_{t\theta}}(A)\mathbb{P}_{\mathbb{X}}(\mathcal{X}^{\infty})$ $=\mathbb{P}_{(h^{*}_{t\theta},\mathbb{X})}(A\times\mathcal{X}^{\infty})$ . Analogous reasoning holds for $h^{*}_{t\theta^{\prime}}$ . Thus, we have

	$\displaystyle\\|\pi_{\theta}-\pi_{\theta^{\prime}}\\|$	$\displaystyle=\\|\mathbb{P}_{(h^{}_{t\theta},\mathbb{X})}(\cdot\times\mathcal{X}^{\infty})-\mathbb{P}_{(h^{}_{t\theta^{\prime}},\mathbb{X})}(\cdot\times\mathcal{X}^{\infty})\\|$
		$\displaystyle\leq\\|\mathbb{P}_{(h^{}_{t\theta},\mathbb{X})}-\mathbb{P}_{(h^{}_{t\theta^{\prime}},\mathbb{X})}\\|=\\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\\|.$

Altogether, we have $\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|=\|\pi_{\theta}-\pi_{\theta^{\prime}}\|$ . ∎

Lemma 2

There exists a sequence $r_{k}=o(1)$ such that, $\forall t,k\in\mathbb{N}$ , $\forall\theta,\theta^{\prime}\in\Theta$ ,

\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|\leq\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+r_{k}.

Proof

This proof follows identically to a proof of [12], but is included here for completeness. Since $\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(A)=\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A\times(\mathcal{X}\times\mathbb{R})^{\infty})$ for all measurable $A\subseteq(\mathcal{X}\times\mathbb{R})^{k}$ , and similarly for $\theta^{\prime}$ , we have

	$\displaystyle\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\\|=\sup_{A\in{\cal B}^{k}}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(A)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(A)$
	$\displaystyle=\sup_{A\in{\cal B}^{k}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A\times(\mathcal{X}\times\mathbb{R})^{\infty})-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A\times(\mathcal{X}\times\mathbb{R})^{\infty})$
	$\displaystyle\leq\sup_{A\in{\cal B}^{\infty}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A)=\\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\\|,$

which implies the left inequality when combined with Lemma 1.

Next, we focus on the right inequality. Fix $\theta,\theta^{\prime}\in\Theta$ and $\gamma>0$ , and let $B\in{\cal B}^{\infty}$ be such that

\|\pi_{\theta}-\pi_{\theta^{\prime}}\|=\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\|<\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(B)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(B)+\gamma.

Let $\mathcal{A}=\{A\times(\mathcal{X}\times\mathbb{R})^{\infty}:A\in{\cal B}^{k},k\in\mathbb{N}\}$ . Note that $\mathcal{A}$ is an algebra that generates ${\cal B}^{\infty}$ . Thus, Carathéodory’s extension theorem (specifically, the version presented by [8]) implies that there exist disjoint sets $\{A_{i}\}_{i\in\mathbb{N}}$ in $\mathcal{A}$ such that $B\subseteq\bigcup_{i\in\mathbb{N}}A_{i}$ and

\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(B)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(B)<\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})-\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A_{i})+\gamma.

Since these $A_{i}$ sets are disjoint, each of these sums is bounded by a probability value, which implies that there exists some $n\in\mathbb{N}$ such that

\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})<\gamma+\sum_{i=1}^{n}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i}),

which implies

	$\displaystyle\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})-\sum_{i\in\mathbb{N}}\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A_{i})$	$\displaystyle<\gamma+\sum_{i=1}^{n}\mathbb{P}_{\mathcal{Z}^{t}(\theta)}(A_{i})-\sum_{i=1}^{n}\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}(A_{i})$
		$\displaystyle=\gamma+\mathbb{P}_{\mathcal{Z}^{t}(\theta)}\left(\bigcup_{i=1}^{n}A_{i}\right)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\left(\bigcup_{i=1}^{n}A_{i}\right).$

As $\bigcup_{i=1}^{n}A_{i}\in\mathcal{A}$ , there exists $m\in\mathbb{N}$ and measurable $B_{m}\in{\cal B}^{m}$ such that $\bigcup_{i=1}^{n}A_{i}=B_{m}\times(\mathcal{X}\times\mathbb{R})^{\infty}$ , and therefore

\mathbb{P}_{\mathcal{Z}^{t}(\theta)}\left(\bigcup_{i=1}^{n}A_{i}\right)-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\left(\bigcup_{i=1}^{n}A_{i}\right)=\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta)}(B_{m})-\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta^{\prime})}(B_{m})\\ \leq\|\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{m}(\theta^{\prime})}\|\leq\lim_{k\to\infty}\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|.

Combining the above, we have $\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\lim_{k\to\infty}\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+3\gamma$ . By letting $\gamma$ approach $0$ , we have

\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\lim_{k\to\infty}\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|.

So there exists a sequence $r_{k}(\theta,\theta^{\prime})=o(1)$ such that

\forall k\in\mathbb{N},\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|+r_{k}(\theta,\theta^{\prime}).

Now let $\gamma>0$ and let $\Theta_{\gamma}$ be a minimal $\gamma$ -cover of $\Theta$ . Define the quantity $r_{k}(\gamma)=\max_{\theta,\theta^{\prime}\in\Theta_{\gamma}}r_{k}(\theta,\theta^{\prime})$ . Then for any $\theta,\theta^{\prime}\in\Theta$ , let $\theta_{\gamma}=\mathop{\rm argmin}_{\theta^{\prime\prime}\in\Theta_{\gamma}}\|\pi_{\theta}-\pi_{\theta^{\prime\prime}}\|$ and $\theta_{\gamma}^{\prime}=\mathop{\rm argmin}_{\theta^{\prime\prime}\in\Theta_{\gamma}}\|\pi_{\theta^{\prime}}-\pi_{\theta^{\prime\prime}}\|$ . Then a triangle inequality implies that $\forall k\in\mathbb{N}$ ,

	$\displaystyle\\|\pi_{\theta}-\pi_{\theta^{\prime}}\\|\leq\\|\pi_{\theta}-\pi_{\theta_{\gamma}}\\|+\\|\pi_{\theta_{\gamma}}-\pi_{\theta_{\gamma}^{\prime}}\\|+\\|\pi_{\theta_{\gamma}^{\prime}}-\pi_{\theta^{\prime}}\\|$
	$\displaystyle<2\gamma+r_{k}(\theta_{\gamma},\theta_{\gamma}^{\prime})+\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\\|\leq 2\gamma+r_{k}(\gamma)+\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\\|.$

Triangle inequalities and the left inequality from the lemma statement (already established) imply

	$\displaystyle\\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\\|\leq\\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}\\|+\\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}\\|+\\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}\\|$
	$\displaystyle\leq\\|\pi_{\theta_{\gamma}}-\pi_{\theta}\\|+\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\\|+\\|\pi_{\theta_{\gamma}^{\prime}}-\pi_{\theta^{\prime}}\\|<2\gamma+\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\\|.$

So in total we have

\|\pi_{\theta}-\pi_{\theta^{\prime}}\|\leq 4\gamma+r_{k}(\gamma)+\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|.

Since this holds for all $\gamma>0$ , defining $r_{k}=\inf_{\gamma>0}(4\gamma+r_{k}(\gamma))$ , we have the right inequality of the lemma statement. Furthermore, since each $r_{k}(\theta,\theta^{\prime})=o(1)$ , and $|\Theta_{\gamma}|<\infty$ , we have $r_{k}(\gamma)=o(1)$ for each $\gamma>0$ , and thus we also have $r_{k}=o(1)$ . ∎

Lemma 3

$\forall t,k\in\mathbb{N}$ , there exists a monotone function $M_{k}(x)=o(1)$ such that, $\forall\theta,\theta^{\prime}\in\Theta$ ,

\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|\leq M_{k}\left(\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|\right).

Proof

Fix any $t\in\mathbb{N}$ , and let $\mathbb{X}=\{X_{t1},X_{t2},\ldots\}$ and $\mathbb{Y}(\theta)=\{Y_{t1}(\theta),Y_{t2}(\theta),\ldots\}$ , and for $k\in\mathbb{N}$ let $\mathbb{X}_{k}=\{X_{t1},\ldots,X_{tk}\}$ and $\mathbb{Y}_{k}(\theta)=\{Y_{t1}(\theta),\ldots,Y_{tk}(\theta)\}$ .

If $k\leq d$ , then $\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(\cdot)=\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}(\cdot\times(\mathcal{X}\times\{-1,+1\})^{d-k})$ , so that

\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|,

and therefore the result trivially holds.

Now suppose $k>d$ . Fix any $\gamma>0$ , and let $B_{\theta,\theta^{\prime}}\subseteq(\mathcal{X}\times\mathbb{R})^{k}$ be a measurable set such that

	$\displaystyle\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}})$	$\displaystyle\leq\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\\|$
		$\displaystyle\leq\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}})+\gamma.$

By Carathéodory’s extension theorem (specifically, the version presented by [8]), there exists a disjoint sequence of sets $\{B_{i}(\theta,\theta^{\prime})\}_{i=1}^{\infty}$ such that

\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}})<\gamma+\sum_{i=1}^{\infty}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta,\theta^{\prime}))-\sum_{i=1}^{\infty}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta,\theta^{\prime})),

and such that each $B_{i}(\theta,\theta^{\prime})$ is representable as follows; for some $\ell_{i}(\theta,\theta^{\prime})\in\mathbb{N}$ , and sets $C_{ij}=(A_{ij1}\times(-\infty,t_{ij1}])\times\cdots\times(A_{ijk}\times(-\infty,t_{ijk}])$ , for $j\leq\ell_{i}(\theta,\theta^{\prime})$ , where each $A_{ijp}\in{\cal B}_{\mathcal{X}}$ , the set $B_{i}(\theta,\theta^{\prime})$ is representable as $\bigcup_{s\in S_{i}}\bigcap_{j=1}^{\ell_{i}(\theta,\theta^{\prime})}D_{ijs}$ , where $S_{i}\subseteq\{0,\ldots,2^{\ell_{i}(\theta,\theta^{\prime})}-1\}$ , each $D_{ijs}\in\{C_{ij},C_{ij}^{c}\}$ , and $s\neq s^{\prime}\Rightarrow\bigcap_{j=1}^{\ell_{i}(\theta,\theta^{\prime})}D_{ijs}\cap\bigcap_{j=1}^{\ell_{i}(\theta,\theta^{\prime})}D_{ijs^{\prime}}=\emptyset$ . Since the $B_{i}(\theta,\theta^{\prime})$ are disjoint, the above sums are bounded, so that there exists $m_{k}(\theta,\theta^{\prime},\gamma)\in\mathbb{N}$ such that every $m\geq m_{k}(\theta,\theta^{\prime},\gamma)$ has

\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}(B_{\theta,\theta^{\prime}})-\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{\theta,\theta^{\prime}})<2\gamma+\sum_{i=1}^{m}\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta,\theta^{\prime}))-\sum_{i=1}^{m}\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta,\theta^{\prime})),

Now define $\tilde{M}_{k}(\gamma)=\max_{\theta,\theta^{\prime}\in\Theta_{\gamma}}m_{k}(\theta,\theta^{\prime},\gamma)$ . Then for any $\theta,\theta^{\prime}\in\Theta$ , let $\theta_{\gamma},\theta_{\gamma}^{\prime}\in\Theta_{\gamma}$ be such that $\|\pi_{\theta}-\pi_{\theta_{\gamma}}\|<\gamma$ and $\|\pi_{\theta^{\prime}}-\pi_{\theta^{\prime}_{\gamma}}\|<\gamma$ , which implies $\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\|<\gamma$ and $\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}\|<\gamma$ by Lemma 2. Then

	$\displaystyle\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\\|$	$\displaystyle<\\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}\\|+2\gamma$
		$\displaystyle\leq\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}(B_{\theta_{\gamma},\theta^{\prime}_{\gamma}})-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma}^{\prime})}(B_{\theta_{\gamma},\theta^{\prime}_{\gamma}})+3\gamma$
		$\displaystyle\leq\sum_{i=1}^{\tilde{M}_{k}(\gamma)}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))+5\gamma.$

Again, since the $B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})$ are disjoint, this equals

	$\displaystyle 5\gamma+\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta_{\gamma})}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime}_{\gamma})}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)$
	$\displaystyle\leq 7\gamma+\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\left(\bigcup_{i=1}^{\tilde{M}_{k}(\gamma)}B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\right)$
	$\displaystyle=7\gamma+\sum_{i=1}^{\tilde{M}_{k}(\gamma)}\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))$
	$\displaystyle\leq 7\gamma+\tilde{M}_{k}(\gamma)\max_{i\leq\tilde{M}_{k}(\gamma)}\left\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\right\|.$

Thus, if we can show that each term $\left|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\right|$ is bounded by a $o(1)$ function of $\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|$ , then the result will follow by substituting this relaxation into the above expression and defining $M_{k}$ by minimizing the resulting expression over $\gamma>0$ .

Toward this end, let $C_{ij}$ be as above from the definition of $B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})$ , and note that $I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}$ is representable as a function of the $I_{C_{ij}}$ indicators, so that

	$\displaystyle\left\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\right\|=\\|\mathbb{P}_{\!I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}(\mathcal{Z}^{t}_{k}(\theta))}\!-\!\mathbb{P}_{\!I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))}\\|$
	$\displaystyle\leq\\|\mathbb{P}_{(I_{C_{i1}}(\mathcal{Z}^{t}_{k}(\theta)),\ldots,I_{C_{i\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}}(\mathcal{Z}^{t}_{k}(\theta)))}-\mathbb{P}_{(I_{C_{i1}}(\mathcal{Z}^{t}_{k}(\theta^{\prime})),\ldots,I_{C_{i\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}}(\mathcal{Z}^{t}_{k}(\theta^{\prime})))}\\|$
	$\displaystyle\leq 2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\{1,\ldots,\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\}}\mathbb{E}\Bigg{[}\Bigg{(}\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))\Bigg{)}\prod_{j\notin J}\Bigg{(}1-I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))\Bigg{)}$
	$\displaystyle{\hskip 130.88284pt}-\Bigg{(}\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\Bigg{)}\prod_{j\notin J}\Bigg{(}1-I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\Bigg{)}\Bigg{]}$
	$\displaystyle\leq 2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\sum_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left\|\mathbb{E}\left[\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))-\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\right]\right\|$
	$\displaystyle\leq 4^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left\|\mathbb{E}\left[\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))-\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\right]\right\|$
	$\displaystyle=4^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}\left(\bigcap_{j\in J}C_{ij}\right)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\left(\bigcap_{j\in J}C_{ij}\right)\right\|.$

Note that $\bigcap_{j\in J}C_{ij}$ can be expressed as some $(A_{1}\!\times(\!-\infty,t_{1}])\times\cdots\times(A_{k}\!\times(\!-\infty,t_{k}])$ , where each $A_{p}\in{\cal B}_{\mathcal{X}}$ and $t_{p}\in\mathbb{R}$ , so that, for $\hat{\ell}=\max_{\theta,\theta^{\prime}\in\Theta_{\gamma}}\max_{i\leq\tilde{M}_{k}(\gamma)}\ell_{i}(\theta,\theta^{\prime})$ and $\mathcal{C}_{k}=\{(A_{1}\times(-\infty,t_{1}])\times\cdots\times(A_{k}\times(-\infty,t_{k}]):\forall j\leq k,A_{j}\in{\cal B}_{\mathcal{X}},t_{k}\in\mathbb{R}\}$ , this last expression is at most

4^{\hat{\ell}}\sup_{C\in\mathcal{C}_{k}}\left|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(C)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(C)\right|.

Next note that for any $C=(A_{1}\times(-\infty,t_{1}])\times\cdots\times(A_{k}\times(-\infty,t_{k}])\in\mathcal{C}_{k}$ , letting $C_{1}=A_{1}\times\cdots\times A_{k}$ and $C_{2}=(-\infty,t_{1}]\times\cdots\times(-\infty,t_{k}]$ ,

	$\displaystyle\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}(C)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}(C)$	$\displaystyle=\mathbb{E}\left[\left(\mathbb{P}_{\mathbb{Y}_{tk}(\theta)\|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})\|\mathbb{X}_{tk}}(C_{2})\right)I_{C_{1}}(\mathbb{X}_{tk})\right]$
		$\displaystyle\leq\mathbb{E}\left[\left\|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)\|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})\|\mathbb{X}_{tk}}(C_{2})\right\|\right].$

For $p\in\{1,\ldots,k\}$ , let $C_{2p}=(-\infty,t_{p}]$ . Then note that, by definition of $d$ , for any given $x=(x_{1},\ldots,x_{k})$ , the class $\mathcal{H}_{x}=\{x_{p}\mapsto I_{C_{2p}}(h(x_{p})):h\in\mathcal{F}\}$ is a VC class over $\{x_{1},\ldots,x_{k}\}$ with VC dimension at most $d$ . Furthremore, we have

\left|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})|\mathbb{X}_{tk}}(C_{2})\right|\\ =\Big{|}\mathbb{P}_{(I_{C_{21}}(h^{*}_{t\theta}(X_{t1})),\ldots,I_{C_{2k}}(h^{*}_{t\theta}(X_{tk})))|\mathbb{X}_{tk}}(\{(1,\ldots,1)\})\\ -\mathbb{P}_{(I_{C_{21}}(h^{*}_{t\theta^{\prime}}(X_{t1})),\ldots,I_{C_{2k}}(h^{*}_{t\theta^{\prime}}(X_{tk})))|\mathbb{X}_{tk}}(\{(1,\ldots,1)\})\Big{|}.

Therefore, the results of [12] (in the proof of their Lemma 3) imply that

	$\displaystyle\left\|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)\|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})\|\mathbb{X}_{tk}}(C_{2})\right\|$
	$\displaystyle\leq 2^{k}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\Big{\|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})$
	$\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})\Big{\|}.$

Thus, we have

	$\displaystyle\mathbb{E}\left[\left\|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)\|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})\|\mathbb{X}_{tk}}(C_{2})\right\|\right]$
	$\displaystyle\leq 2^{k}\mathbb{E}\Bigg{[}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\Big{\|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})$
	$\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})\Big{\|}\Bigg{]}$
	$\displaystyle\leq 2^{k}\sum_{y\in\{0,1\}^{d}}\sum_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})$
	$\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})\Big{\|}\Bigg{]}$
	$\displaystyle\leq 2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})$
	$\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})\Big{\|}\Bigg{]}.$

Exchangeability implies this is at most

	$\displaystyle 2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\sup_{t_{1},\ldots,t_{d}\in\mathbb{R}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(h^{*}_{t\theta}(X_{tj}))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})$
	$\displaystyle{\hskip 159.3356pt}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})\Big{\|}\Bigg{]}$
	$\displaystyle=2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\sup_{t_{1},\ldots,t_{d}\in\mathbb{R}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})$
	$\displaystyle{\hskip 159.3356pt}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})\Big{\|}\Bigg{]}.$

[12] argue that for all $y\in\{0,1\}^{d}$ and $t_{1},\ldots,t_{d}\in\mathbb{R}$ ,

	$\displaystyle\mathbb{E}\Big{[}\Big{\|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})\Big{\|}\Big{]}$
	$\displaystyle\leq 4\sqrt{\\|\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d},\mathbb{X}_{td}}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d},\mathbb{X}_{td}}\\|}.$

Noting that

\|\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d},\mathbb{X}_{td}}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d},\mathbb{X}_{td}}\|\leq\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}_{d}(\theta^{\prime})}\|

completes the proof. ∎

We are now ready for the proof of Theorem 5.1.

Proof (Proof of Theorem 5.1)

The estimator $\hat{\theta}_{T{\theta_{\star}}}$ we will use is precisely the minimum-distance skeleton estimate of $\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}$ [13, 5]. [13] proved that if $N(\varepsilon)$ is the $\varepsilon$ -covering number of $\{\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}:\theta\in\Theta\}$ , then taking this $\hat{\theta}_{T{\theta_{\star}}}$ estimator, then for some $T_{\varepsilon}=O((1/\varepsilon^{2})\log N(\varepsilon/4))$ , any $T\geq T_{\varepsilon}$ has

\mathbb{E}\left[\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\hat{\theta}_{T{\theta_{\star}}})}-\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}\|\right]<\varepsilon.

Thus, taking $G_{T}=\inf\{\varepsilon>0:T\geq T_{\varepsilon}\}$ , we have

\mathbb{E}\left[\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\hat{\theta}_{T{\theta_{\star}}})}-\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}\|\right]\leq G_{T}=o(1).

Letting $R^{\prime}(T,\alpha)$ be any positive sequence with $G_{T}\ll R^{\prime}(T,\alpha)\ll 1$ and $R^{\prime}(T,\alpha)\geq G_{T}/\alpha$ , and letting $\delta(T,\alpha)=G_{T}/R^{\prime}(T,\alpha)=o(1)$ , Markov’s inequality implies

\mathbb{P}\left(\|\mathbb{P}_{\mathcal{Z}^{t}_{d}(\hat{\theta}_{T{\theta_{\star}}})}-\mathbb{P}_{\mathcal{Z}^{t}_{d}({\theta_{\star}})}\|>R^{\prime}(T,\alpha)\right)\leq\delta(T,\alpha)\leq\alpha.

(2)

Letting $R(T,\alpha)=\min_{k}\left(M_{k}\left(R^{\prime}(T,\alpha)\right)+r_{k}\right)$ , since $R^{\prime}(T,\alpha)=o(1)$ and $r_{k}=o(1)$ , we have $R(T,\alpha)=o(1)$ . Furthermore, composing (2) with Lemmas 1, 2, and 3, we have

\mathbb{P}\left(\|\pi_{\hat{\theta}_{T{\theta_{\star}}}}-\pi_{{\theta_{\star}}}\|>R(T,\alpha)\right)\leq\delta(T,\alpha)\leq\alpha.

∎

Remark:

Although the above proof makes use of the minimum-distance skeleton estimator, which is typically not computationally efficient, it is often possible to achieve this same result (for certain families of distributions) using a simpler estimator, such as the maximum likelihood estimator. All we require is that the risk of the estimator converges to $0$ at a known rate that is independent of ${\theta_{\star}}$ . For instance, see [6] for conditions on the family of distributions sufficient for this to be true of the maximum likelihood estimator.

References

[1] Bar-Yossef, Z.: Sampling lower bounds via information theory. In: Proceedings of the 35th Annual ACM Symposium on the Theory of Computing. pp. 335–344 (2003)
[2] Baxter, J.: A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning 28, 7–39 (1997)
[3] Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery 36(4), 929–965 (1989)
[4] Cramton, P., Shoham, Y., Steinberg, R.: Combinatorial Auctions. The MIT Press (2006)
[5] Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation. Springer, New York, NY, USA (2001)
[6] van de Geer, S.: Empirical Processes in M-Estimation. Cambridge University Press (2000)
[7] Poland, J., Hutter, M.: MDL convergence speed for Bernoulli sequences. Statistics and Computing 16, 161–175 (2006)
[8] Schervish, M.J.: Theory of Statistics. Springer, New York, NY, USA (1995)
[9] Vapnik, V.: Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York (1982)
[10] Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 264–280 (1971)
[11] Wald, A.: Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16(2), 117–186 (1945)
[12] Yang, L., Hanneke, S., Carbonell, J.: A theory of transfer learning with applications to active learning. Machine Learning 90(2), 161–189 (2013)
[13] Yatracos, Y.G.: Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics 13, 768–774 (1985)
[14] Zinkevich, M., Blum, A., Sandholm, T.: On polynomial-time preference elicitation with value queries. In: Proceedings of the $4^{{\rm th}}$ ACM Conference on Electronic Commerce. pp. 175–185 (2003)

	$\displaystyle(1/2)\int\|f_{\theta}^{\prime}-f_{\theta^{\prime}}^{\prime}\|{\rm d}\pi_{0}$	$\displaystyle=(1/2)\!\!\!\!\!\!\!\!\!\sum_{y_{1},\ldots,y_{k}\in\{-1,+1\}}\!\!\!\|\pi_{\theta}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=y_{i}\})$
		$\displaystyle\phantom{aaaaaaaaaaaaaaaaa}-\pi_{\theta^{\prime}}(\{h\in\mathbb{C}:\forall i\leq k,h(X_{i})=y_{i}\})\|$
		$\displaystyle=\\|\mathbb{P}_{\mathbb{Y}_{k}(\theta)\|\mathbb{X}_{k}}-\mathbb{P}_{\mathbb{Y}_{k}(\theta^{\prime})\|\mathbb{X}_{k}}\\|.$

	$\displaystyle\\|\mathbb{P}_{\mathcal{Z}^{t}(\theta)}-\mathbb{P}_{\mathcal{Z}^{t}(\theta^{\prime})}\\|$	$\displaystyle=\\|\mathbb{P}_{(h^{}_{t\theta},\mathbb{X})}-\mathbb{P}_{(h^{}_{t\theta^{\prime}},\mathbb{X})}\\|$
		$\displaystyle=\sup_{C^{\prime}\in{\cal B}_{\mathcal{F}}\otimes{\cal B}_{\mathcal{X}}^{\infty}}\left\|\int(\pi_{\theta}(C^{\prime}_{\bar{x}})-\pi_{\theta^{\prime}}(C^{\prime}_{\bar{x}}))\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})\right\|$
		$\displaystyle\leq\int\sup_{A\in{\cal B}_{\mathcal{F}}}\|\pi_{\theta}(A)-\pi_{\theta^{\prime}}(A)\|\mathbb{P}_{\mathbb{X}}({\rm d}\bar{x})=\\|\pi_{\theta}-\pi_{\theta^{\prime}}\\|.$

	$\displaystyle\left\|\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta)}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\!-\!\mathbb{P}_{\!\mathcal{Z}^{t}_{k}(\theta^{\prime})}(B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma}))\right\|=\\|\mathbb{P}_{\!I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}(\mathcal{Z}^{t}_{k}(\theta))}\!-\!\mathbb{P}_{\!I_{B_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))}\\|$
	$\displaystyle\leq\\|\mathbb{P}_{(I_{C_{i1}}(\mathcal{Z}^{t}_{k}(\theta)),\ldots,I_{C_{i\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}}(\mathcal{Z}^{t}_{k}(\theta)))}-\mathbb{P}_{(I_{C_{i1}}(\mathcal{Z}^{t}_{k}(\theta^{\prime})),\ldots,I_{C_{i\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}}(\mathcal{Z}^{t}_{k}(\theta^{\prime})))}\\|$
	$\displaystyle\leq 2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\{1,\ldots,\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})\}}\mathbb{E}\Bigg{[}\Bigg{(}\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))\Bigg{)}\prod_{j\notin J}\Bigg{(}1-I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))\Bigg{)}$
	$\displaystyle{\hskip 130.88284pt}-\Bigg{(}\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\Bigg{)}\prod_{j\notin J}\Bigg{(}1-I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\Bigg{)}\Bigg{]}$
	$\displaystyle\leq 2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\sum_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left\|\mathbb{E}\left[\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))-\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\right]\right\|$
	$\displaystyle\leq 4^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left\|\mathbb{E}\left[\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta))-\prod_{j\in J}I_{C_{ij}}(\mathcal{Z}^{t}_{k}(\theta^{\prime}))\right]\right\|$
	$\displaystyle=4^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\max_{J\subseteq\left\{1,\ldots,2^{\ell_{i}(\theta_{\gamma},\theta^{\prime}_{\gamma})}\right\}}\left\|\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta)}\left(\bigcap_{j\in J}C_{ij}\right)-\mathbb{P}_{\mathcal{Z}^{t}_{k}(\theta^{\prime})}\left(\bigcap_{j\in J}C_{ij}\right)\right\|.$

	$\displaystyle\mathbb{E}\left[\left\|\mathbb{P}_{\mathbb{Y}_{tk}(\theta)\|\mathbb{X}_{tk}}(C_{2})-\mathbb{P}_{\mathbb{Y}_{tk}(\theta^{\prime})\|\mathbb{X}_{tk}}(C_{2})\right\|\right]$
	$\displaystyle\leq 2^{k}\mathbb{E}\Bigg{[}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\Big{\|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})$
	$\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})\Big{\|}\Bigg{]}$
	$\displaystyle\leq 2^{k}\sum_{y\in\{0,1\}^{d}}\sum_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})$
	$\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})\Big{\|}\Bigg{]}$
	$\displaystyle\leq 2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\max_{D\in\{1,\ldots,k\}^{d}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})$
	$\displaystyle{\hskip 142.26378pt}-\mathbb{P}_{\{I_{C_{2j}}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j\in D}\|\{X_{tj}\}_{j\in D}}(\{y\})\Big{\|}\Bigg{]}.$

	$\displaystyle 2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\sup_{t_{1},\ldots,t_{d}\in\mathbb{R}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(h^{*}_{t\theta}(X_{tj}))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})$
	$\displaystyle{\hskip 159.3356pt}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(h^{*}_{t\theta^{\prime}}(X_{tj}))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})\Big{\|}\Bigg{]}$
	$\displaystyle=2^{d+k}k^{d}\max_{y\in\{0,1\}^{d}}\sup_{t_{1},\ldots,t_{d}\in\mathbb{R}}\mathbb{E}\Bigg{[}\Big{\|}\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})$
	$\displaystyle{\hskip 159.3356pt}-\mathbb{P}_{\{I_{(-\infty,t_{j}]}(Y_{tj}(\theta^{\prime}))\}_{j=1}^{d}\|\mathbb{X}_{td}}(\{y\})\Big{\|}\Bigg{]}.$