This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A No-Free-Lunch Theorem for MultiTask Learning

Steve Hanneke
Toyota Technological Institute at Chicago
steve.hanneke@gmail.com &Samory Kpotufe
Columbia University, Statistics
skk2175@columbia.edu
Abstract

Multitask learning and related areas such as multi-source domain adaptation address modern settings where datasets from NN related distributions {Pt}\left\{P_{t}\right\} are to be combined towards improving performance on any single such distribution π’Ÿ{\cal D}. A perplexing fact remains in the evolving theory on the subject: while we would hope for performance bounds that account for the contribution from multiple tasks, the vast majority of analyses result in bounds that improve at best in the number nn of samples per task, but most often do not improve in NN. As such, it might seem at first that the distributional settings or aggregation procedures considered in such analyses might be somehow unfavorable; however, as we show, the picture happens to be more nuanced, with interestingly hard regimes that might appear otherwise favorable.

In particular, we consider a seemingly favorable classification scenario where all tasks PtP_{t} share a common optimal classifier hβˆ—h^{\!*}, and which can be shown to admit a broad range of regimes with improved oracle rates in terms of NN and nn. Some of our main results are as follows:

  • β€’

    We show that, even though such regimes admit minimax rates accounting for both nn and NN, no adaptive algorithm exists; that is, without access to distributional information, no algorithm can guarantee rates that improve with large NN for nn fixed.

  • β€’

    With a bit of additional information, namely, a ranking of tasks {Pt}\left\{P_{t}\right\} according to their distance to a target π’Ÿ{\cal D}, a simple rank-based procedure can achieve near optimal aggregations of tasks’ datasets, despite a search space exponential in NN. Interestingly, the optimal aggregation might exclude certain tasks, even though they all share the same hβˆ—h^{\!*}.

1 Introduction

Multitask learning and related areas such as multi-source domain adaptation address a statistical setting where multiple datasets Zt∼Pt,t=1,2,…,Z_{t}\sim P_{t},t=1,2,\ldots, are to be aggregated towards improving performance w.r.t. a single (or any of, in the case of multitask) such distributions PtP_{t}. This is motivated by applications in the sciences and engineering where data availability is an issue, e.g., medical analytics typically require aggregating data from loosely related subpopulations, while identifying traffic patterns in a given city might benefit from pulling data from somewhat similar other cities.

While these problems have received much recent theoretical attention, especially in classification, a perplexing reality emerges: the bulk of results appear to show little improvement from such aggregation over using a single data source. Namely, given NN datasets, each of size nn, one would hope for convergence rates in terms of the aggregate data size Nβ‹…nN\cdot n, somehow adjusted w.r.t. discrepancies between distributions PtP_{t}’s, but which clearly improve on rates in terms of just nn as would be obtained with a single dataset. However, such clear improvements on rates appear elusive, as typical bounds on excess risk w.r.t. a target π’Ÿ{\cal D} (i.e., one of the PtP_{t}’s) are of the form (see e.g., CKW (08); BDB (08); BDBC+ (10))

β„°π’Ÿβ€‹(h^)≲(n​N)βˆ’Ξ±+nβˆ’Ξ±+disc​({Pt};π’Ÿ),for someΒ β€‹Ξ±βˆˆ[1/2,1],\displaystyle\mathcal{E}_{{\cal D}}(\hat{h})\lesssim\left(nN\right)^{-\alpha}+n^{-\alpha}+\text{disc}\!\left(\left\{P_{t}\right\};{\cal D}\right),\ \text{for some }\alpha\in[{1}/{2},1], (1)

where in some results, one of the last two terms is dropped. In other words, typical upper-bounds are either dominated by the rate nβˆ’Ξ±n^{-\alpha}, or altogether might not go to 0 with sample size due to the discrepancy terms disc​({Pt};π’Ÿ)>0\text{disc}\!\left(\left\{P_{t}\right\};{\cal D}\right)>0, even while the excess risk of a naive classifier h~\tilde{h} trained on the target dataset would be β„°π’Ÿβ€‹(h~)∝nβˆ’Ξ±β†’0\mathcal{E}_{\cal D}(\tilde{h})\propto n^{-\alpha}\to 0. As such, it might seem at first that there is a gap in either algorithmic approaches, formalism and assumptions, or statistical analyses of the problem. However, as we argue here, no algorithm can guarantee a rate improving in aggregate sample size for nn fixed, even under seemingly generous assumptions on how sources PtP_{t}’s relate to a given target π’Ÿ{\cal D}.

In particular, we consider a seemingly favorable classification setting, where all data distributions PtP_{t}’s induce the same optimal classifier hβˆ—h^{\!*} over a hypothesis class β„‹\mathcal{H}. This situation is of independent interest, e.g., appearing recently under the name of invariant risk minimization (see ABGLP (19) where it is motivated through invariants under causality), but is motivated here by our aim to elucidate basic limits of the multitask problem. As a starting point to understanding the best achievable rates, we first establish minimax upper and lower bounds, up to log terms, for the setting. These oracle rates, as might be expected in such benign settings, do indeed improve with both NN and nn, as allowed by the level of discrepancy between distributions, appropriately formalized (Theorem 1). We then turn to characterizing the extent to which such favorable rates might be achieved by reasonable procedures, i.e., adaptive procedures with little to no prior distributional information. Many interesting messages arise, some of which we highlight below:

  • β€’

    No adaptive procedure exists outside a threshold Ξ²<1\beta<1, where β∈[0,1]\beta\in[0,1] (a so-called Bernstein class condition) parametrizes the level of noise111The term noise is used here to indicate nondeterminism in the label YY of a sample point XX, and so is rather non-adversarial. in label distribution. Namely, while oracle rates might decrease fast in NN and nn, no procedure based on the aggregate samples alone can guarantee a rate better than nβˆ’1/(2βˆ’Ξ²)n^{-1/(2-\beta)} without further favorable restrictions on distributions (Theorem 5).

  • β€’

    At low noise Ξ²=1\beta=1 (e.g., so called Massart’s noise) even the naive, yet common approach of pooling all datasets, i.e., treating them as if identically distributed, is nearly minimax optimal, achieving rates improving in both NN and nn. This of course would not hold if optimal classifiers hβˆ—h^{\!*}’s differ considerably across PtP_{t}’s (Theorem 3).

  • β€’

    At any noise level, a ranking of sources PtP_{t}’s according to discrepancy to a target π’Ÿ{\cal D} is sufficient information for (partial) adaptivity. While a precise ranking is probably unlikely in practice, an approximate ranking might be available, as domain knowledge on how sources rank in fidelity w.r.t. to an ideal target data: e.g., in settings such as learning with slowly drifting distributions. Here we show that a simple rank-based procedure, using such ranking, efficiently achieves a near optimal aggregation rate, despite the exponential search space of over 2N2^{N} possible aggregations (Theorem 4).

Interestingly, even assuming all hβˆ—h^{\!*}’s are the same, the optimal aggregation of datasets can change with the choice of target π’Ÿ{\cal D} in {Pt}\left\{P_{t}\right\} (see Theorem 2), due to the inherent asymmetry in the information tasks might have on each other, e.g., some PtP_{t} might yield data in PsP_{s}-dense regions but not the other way around. Hence, to capture a proper picture of the problem, we cannot employ a symmetric notion of discrepancy as is common in the literature on the subject. Instead, we proceed with a notion of transfer exponent, which we recently introduced in HK (19) for the setting of domain adaptation with N=1N=1 source distribution, and which we show here to also successfully capture performance limits in the present setting with N≫1N\gg 1 sources (see definition in Section 3.1).

We note that in the case N=1N=1, there is always a minimax adaptive procedure (as shown in HK (19)), while this is not the case here for N≫1N\gg 1. In hindsight, the reason is simple: considering N=O​(1)N=O(1), i.e., as a constant, there is no consequential difference between a rate in terms of Nβ‹…nN\cdot n and one in terms of nn. In other words, the case N=O​(1)N=O(1) does not yield adequate insight into multitask with N≫1N\gg 1, and therefore does not properly inform practice as to the best approaches to aggregating and benefiting from multiple datasets.

Background and Related Work

[Uncaptioned image]

The bulk of theoretical work on multitask and related areas build on early work on domain adaptation (i.e., N=1N=1) such as BDBCP (07); CMRR (08); BDBC+ (10), which introduce notions of discrepancy such as the dπ’œd_{\cal A}-divergence, 𝒴\cal Y-discrepancy, that specialize the total-variation metric to the setting of domain adaptation. These notions often result in bounds of the form (1), starting with CKW (08); BDBC+ (10). Such bounds can in fact be shown to be tight w.r.t. the discrepancy term disc​({Pt};π’Ÿ)\text{disc}\!\left(\left\{P_{t}\right\};{\cal D}\right), for given distributional settings and sample sizes, owing for instance to early work on the limits of learning under distribution drift (see e.g. Bar (92)). However, the rates of (1) appear pessimistic when we consider settings of interest here where hβˆ—h^{\!*}’s remain the same (or nearly so) across tasks, as they suggest no general improvement on risk with larger sample size. Consider for instance simple situations as depicted on the right, where a source PP and target π’Ÿ{\cal D} (with respective supports 𝒳P,π’³π’Ÿ\mathcal{X}_{P},\mathcal{X}_{\cal D}) might differ considerably in the mass they assign to regions of data space, thus inducing large discrepancies, but where both assign sufficient mass to decision boundaries to help identify hβˆ—h^{\!*} with enough samples from either distribution. Proper parametrization resolves this issue, as we show through new multitask rates β„°π’Ÿβ€‹(h^)β†’0\mathcal{E}_{\cal D}(\hat{h})\to 0 in natural situations with disc​({Pt};π’Ÿ)>0\text{disc}\!\left(\left\{P_{t}\right\};{\cal D}\right)>0, and even with no target sample.

We remark that other notions of discrepancy, e.g., Maximum Mean Discrepancy GSH+ (09), Wasserstein distance RHS (17); SQZY (18) are employed in domain adaptation; however they appear relatively less often in the theoretical literature on multitask and related areas. For multitask, the work of BDB (08) proposes a more-structured notion of task relatedness, whereby a source PtP_{t} is induced from a target π’Ÿ{\cal D} through a transformation of the domain π’³π’Ÿ\mathcal{X}_{\cal D}; that work also incurs an nβˆ’1/2n^{-1/2} term in the risk bound, but no discrepancy term. The work of MMR (09) considers RΓ©nyi divergences in the context of optimal aggregation in multitask under population risk, but does not study sample complexity.

The use of a non-metric discrepancy such as RΓ©nyi divergence brings back an important point: two distributions might have asymmetric information on each other w.r.t. domain adaptation. Such insight was raised recently, independently in KM (18); HK (19); APMS (19), with various natural examples therein (see also Section 3.1). In particular, it motivates a more unified view of multitask and multisource domain adaptation, which are often treated separately. Namely, if the goal in multitask is to perform as well as possible on each task PtP_{t} in our set, then as we show, such asymmetry in information between tasks calls for different aggregation of datasets for each target PtP_{t}: in other words, treating multitask as separate multisource problems, even if the optimal hβˆ—h^{\!*} is the same across tasks.

In contrast, a frequent aim in multitask, dating back to Car (97), has been to ideally arrive at a single aggregation of task datasets that simultaneously benefits all tasks PtP_{t}’s. Following this spirit, many theoretical works on the subject are concerned with bounds on average risk across tasks Bax (97); AZ (05); MPRP (13); PM (13); PL (14); YHC (13); MPRP (16), rather than bounding the supremum risk accross tasks – i.e., treating multitask as separate multisources, as of interest here. Some of these average bounds, e.g., MPRP (13, 16); PM (13), remove the dependence on discrepancy inherent in bounds of the form (1), but maintain a term of the form nβˆ’Ξ±n^{-\alpha}; in other words, any bound on supremum risk derived from such results would be in terms of a single dataset size nn. The work of BHPQ (17) directly addresses the problem of bounding the supremum risk, but also incurs a term of the form nβˆ’Ξ±n^{-\alpha} in the risk bound.

In the context of multisource domain adaptation, it has been recognized in practice that some datasets that are too far from the ideal target might hurt target performance and should be downweighted accordingly, a situation that has been termed negative transfer. These situations further motivate the need for adaptive procedures that can automatically identify good datasets. As far as theoretical insights, it is clear that negative transfer might happen for instance, under ERM, if optimal classifiers hβˆ—h^{\!*}’s are considerably different across tasks. Interestingly, even when hβˆ—h^{\!*}’s are allowed arbitrarily close (but not equal), BDLLP (10) shows that, for N=1N=1, the source dataset can be useless without labeled target data. We later derived minimax lower-bounds for the case N=1N=1 with or without labeled target data in HK (19) for a range of situations including those considered in BDLLP (10). Such results however do not quite confirm negative transfer, as they allow the possibility that useless datasets might remain safe to include. For the multisource case N≫1N\gg 1, to the best of our knowledge, situations of negative transfer have only been described in adversarial settings with corrupted labels. For instance, the recent papers of Qia (18); MMM (19); KFAL (20) show limits of multitask under various adversarial corruption of labels in datasets, while SZ (19) derives a positive result, i.e., rates (for Lipschitz loss) decreasing in both NN and nn, up to excluded or downweighted datasets. The procedure of SZ (19) is however nonadaptive as it requires known noise proportions.

KFAL (20) is of particular interest as they are concerned with adaptivity under label corruption. They show that, even if at most N/2N/2 of the datasets are corrupted, no procedure can get a rate better than 1/n1/n. In contrast, in the stochastic setting considered here, there is always an adaptive procedure with significantly faster rates than 1/n1/n if at most N/2N/2 (in fact any fixed fraction) of distributions PtP_{t}’s are far from the target π’Ÿ{\cal D} (Theorem 9 of Appendix B). This dichotomy is due to the strength of adversarial corruptions they consider which in effect can flip optimal hβˆ—h^{\!*}’s on corrupted sources. What we show here is that, even when hβˆ—h^{\!*}’s are fixed across tasks, and datasets are sampled i.i.d. from each PtP_{t}, i.e., non-adversarially, no algorithm can achieve a rate better than 1/n1/\sqrt{n} while a non-adaptive oracle procedure can (see Theorem 5). In other words, some datasets are indeed unsafe for any multisource procedure even in nonadversarial settings, as they can force suboptimal choices w.r.t. a target, absent additional restrictions on the problem setup.

As discussed earlier, such favorable restrictions concern for instance situations where information is available on how sources rank in distance to a target. In particular, the case n=1n=1 intersects with the so-called distribution drift setting where each distribution Pt,t∈[N]P_{t},t\in[N] has bounded discrepancy suptβ‰₯1dist​(Pt,Pt+1)\sup_{t\geq 1}\text{dist}(P_{t},P_{t+1}) w.r.t. the next distribution Pt+1P_{t+1} as time tt varies. While the notion of discrepancy is typically taken to be total-variation BDBM (89); Bar (92); BL (97) or related notions MM (12); ZL (19); HY (19), both our upper and lower-bounds, specialized to the case n=1n=1, provide a new perspective for distribution drift under our distinct parametrization of how PtP_{t}’s relate to each other. For instance, our results on multisource under ranking (Theorem 4) imply new rates for distribution drift, when all hβˆ—h^{\!*}’s are the same across time, with excess error at time NN going to 0 with NN, even in situations where suptβ‰₯1dist​(Pt,Pt+1)>0\sup_{t\geq 1}\text{dist}(P_{t},P_{t+1})>0 in total variation. Such consistency in NN is unavailable in prior work on distribution drift.

Finally we note that there has been much recent theoretical efforts towards other formalisms of relations between distributions in a multitask or multisource scenario, with particular emphasis on distributions sharing common latent substructures, both as applied to classification MBS (13); MB (17); AKK+ (19), or to regression settings JSRR (10); LPVDG+ (11); NW (11); DHK+ (20); TJJ (20). The present work does not address such settings.

Paper Outline

We start with setup and definitions in Sections 2 and 3. This is followed by a technical overview of results, along with discussions of the analysis and novel proof techniques, in Section 4. Minimax lower and upper bounds are derived in Sections 5 and 6. Constrained regimes allowing partial adaptivity are discussed in Section 7. This is followed by impossibility theorems for adaptivity in Section 8.

2 Basic Classification Concepts

We consider a classification setting X↦YX\mapsto Y, where X,YX,Y are drawn from some space 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, 𝒴≐{βˆ’1,1}\mathcal{Y}\doteq\left\{-1,1\right\}. We focus on proper learning where, given data, a learner is to return a classifier h:𝒳↦𝒴h:\mathcal{X}\mapsto\mathcal{Y} from some fixed class ℋ≐{h}\mathcal{H}\doteq\left\{h\right\}.

Assumption 1 (Bounded VC).

Throughout we will let β„‹βˆˆ2𝒳\mathcal{H}\in 2^{\mathcal{X}} denote a hypothesis class of finite VC dimension dβ„‹d_{\mathcal{H}}, which we consider fixed in all subsequent discussions. To focus on nontrivial cases, we assume |β„‹|β‰₯3|\mathcal{H}|\geq 3 throughout.

We note that our algorithmic techniques and analysis extend to more general β„‹\mathcal{H} through Rademacher complexity or empirical covering numbers. We focus on VC classes for simplicity, and to allow simple expressions of minimax rates.

The performance of any classifier hh will be captured through the 0-1 risk and excess risk as defined below.

Definition 1.

Let RP​(h)≐PX,Y​(h​(X)β‰ Y)R_{P}(h)\doteq P_{X,Y}\left(h(X)\neq Y\right) denote the risk of any h:𝒳↦𝒴h:\mathcal{X}\mapsto\mathcal{Y} under a distribution P=PX,YP=P_{X,Y}. The excess risk of hh over any hβ€²βˆˆβ„‹h^{\prime}\in\mathcal{H} is then defined as β„°P​(h;hβ€²)≐RP​(h)βˆ’RP​(hβ€²)\mathcal{E}_{P}(h;h^{\prime})\doteq R_{P}(h)-R_{P}(h^{\prime}), while for the excess risk over the best in class we simply write β„°P​(h)≐RP​(h)βˆ’infhβ€²βˆˆβ„‹RP​(hβ€²)\mathcal{E}_{P}(h)\doteq R_{P}(h)-\inf_{h^{\prime}\in\mathcal{H}}R_{P}(h^{\prime}). We let hPβˆ—h^{\!*}_{P} denote any element of argminhβˆˆβ„‹RP​(h)\operatorname*{argmin}_{h\in\mathcal{H}}R_{P}(h) (which we will assume exists); if PP is clear from context we might just write hβˆ—h^{\!*} for a minimizer (a.k.a. best in class). Also define the pseudo-distance P​(hβ‰ hβ€²)≐PX​(h​(X)β‰ h′​(X))P(h\neq h^{\prime})\doteq P_{X}(h(X)\neq h^{\prime}(X)).

Given a finite dataset SS of (x,y)(x,y) pairs in 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, we let R^S​(h)≐1|S|β€‹βˆ‘(x,y)∈S1​{h​(x)β‰ y}\hat{R}_{S}(h)\doteq\frac{1}{|S|}\sum_{(x,y)\in S}{\mathbbold 1}\!\left\{h(x)\neq y\right\} denote the empirical risk of hh under SS; if S=βˆ…S=\emptyset, define R^S​(h)≐0\hat{R}_{S}(h)\doteq 0. The excess empirical risk over any hβ€²h^{\prime} is β„°^S​(h;hβ€²)≐R^S​(h)βˆ’R^S​(hβ€²)\hat{\mathcal{E}}_{S}(h;h^{\prime})\doteq\hat{R}_{S}(h)-\hat{R}_{S}(h^{\prime}). Also define the empirical pseudo-distance β„™^S​(hβ‰ hβ€²)≐1|S|β€‹βˆ‘(x,y)∈S1​{h​(x)β‰ h′​(x)}\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\doteq\frac{1}{|S|}\sum_{(x,y)\in S}{\mathbbold 1}\!\left\{h(x)\neq h^{\prime}(x)\right\}.

The following condition is a classical way to capture a continuum from easy to hard classification. In particular, in vanilla classification, the best rate excess risk achievable by a classifier trained on data of size mm, can be shown to be of order mβˆ’1/(2βˆ’Ξ²)m^{-1/(2-\beta)}, i.e, interpolates between 1/n1/n and 1/n1/\sqrt{n}, as controlled by β∈[0,1]\beta\in[0,1] defined below.

Definition 2.

Let β∈[0,1],CΞ²β‰₯2\beta\in[0,1],C_{\beta}\geq 2. A distribution PP is said to satisfy a Bernstein class condition with parameters (CΞ²,Ξ²)\left(C_{\beta},\beta\right) if the following holds:

βˆ€hβˆˆβ„‹,P​(hβ‰ hPβˆ—)≀CΞ²β‹…β„°Pβ​(h).\displaystyle\forall h\in\mathcal{H},\quad P(h\neq h^{\!*}_{P})\leq C_{\beta}\cdot\mathcal{E}_{P}^{\beta}(h). (2)

Notice that the above always holds with at least Ξ²=0\beta=0.

The condition can be viewed as quantifying the amount of noise in YY since always have PX​(hβ‰ hPβˆ—)β‰₯β„°P​(h)P_{X}(h\neq h^{\!*}_{P})\geq\mathcal{E}_{P}(h) with equality when Y=hβˆ—β€‹(X)Y=h^{\!*}(X). In particular, it captures the so-called Tsybakov noise margin condition when the Bayes classifier is in β„‹\mathcal{H}, that is, let η​(x)≐𝔼P​[Y∣X=x]\eta(x)\doteq\mathbb{E}_{P}[Y\mid X=x], then the margin condition

PX(x:|Ξ·(x)βˆ’1/2|≀τ)≀Cτκ,βˆ€Ο„>0,Β and someΒ ΞΊβ‰₯0,P_{X}\left(x:\left|\eta(x)-1/2\right|\leq\tau\right)\leq C\tau^{\kappa},\forall\tau>0,\text{ and some }\kappa\geq 0,

implies that PP satisfies (2) with Ξ²=ΞΊ/(1+ΞΊ)\beta=\kappa/(1+\kappa) for some CΞ²C_{\beta} Tsy (04).

The importance of such margin considerations become evident as we consider which multitask learner is able to automatically adapt optimally to unknown relations between distributions; interestingly, hardness of adaptation has to do not only with relations (or discrepancies) between distributions, but also with Ξ²\beta.

3 Multitask Setting

We consider a setting where multiple datasets are drawn independently from (related) distributions Pt,t∈[N+1]P_{t},t\in[N+1], with the aim to return hypotheses hh’s from β„‹\mathcal{H} with low excess error β„°Pt​(h)\mathcal{E}_{P_{t}}(h) w.r.t. any of these distributions. W.l.o.g. we fix attention to a single target distribution π’Ÿβ‰PN+1{\cal D}\doteq P_{N+1}, and therefore reduce the setting to that of multisource222The term multisource is often used for situations where the learner has no access to target data, which is not required to be the case here, although is handled simply by setting the target sample size to 0 in our bounds..

Definition 3.

A multisource learner h^\hat{h} is any function ∏t∈[N+1](𝒳×𝒴)nt↦ℋ\prod_{t\in[N+1]}(\mathcal{X}\times\mathcal{Y})^{{n}_{t}}\mapsto\mathcal{H} (for some sample sizes ntβˆˆβ„•{n}_{t}\in\mathbb{N}), i.e., given a multisample Z={Zt}t∈[N+1],|Zt|=ntZ=\left\{Z_{t}\right\}_{t\in[N+1]},\left|Z_{t}\right|={n}_{t}, returns a hypothesis in β„‹\mathcal{H}. In an abuse of notation, we will often conflate the learner h^\hat{h} with the hypothesis it returns.

A multitask setting can then be viewed as one where N+1N+1 multisource learners h^t\hat{h}_{t} (targeting each PtP_{t}) are trained on the same multisample ZZ. Much of the rest of the paper will therefore discuss learners for a given target π’Ÿβ‰PN+1{\cal D}\doteq P_{N+1}.

3.1 Relating Sources to Target

Clearly, how well one can do in multitask depend on how distributions relate to each other. The following parametrization will serve to capture the relation between sources and target distributions.

Definition 4.

Let ρ>0\rho>0 (up to ρ=∞\rho=\infty), and Cρβ‰₯2C_{\rho}\geq 2. We say that a distribution PP has transfer exponent (Cρ,ρ)\left(C_{\rho},\rho\right) w.r.t.Β a distribution π’Ÿ{\cal D} (under β„‹\mathcal{H}), if

βˆ€hβˆˆβ„‹,β„°π’Ÿβ€‹(h)≀Cρ⋅ℰP1/ρ​(h).\forall h\in\mathcal{H},\quad\mathcal{E}_{\cal D}(h)\leq C_{\rho}\cdot\mathcal{E}_{P}^{1/\rho}(h).

Notice that the above always holds with at least ρ=∞\rho=\infty.

We have shown in earlier work HK (19) that the transfer exponent manages to tightly capture the minimax rates of transfer in various situations with a single source PP and target π’Ÿ{\cal D} (N=1N=1), including ones where the best hypotheseses hPβˆ—,hπ’Ÿβˆ—h^{\!*}_{P},h^{\!*}_{\cal D} are different for source and target; in the case of main interest here where PP and π’Ÿ{\cal D} share a same best in class hβˆ—β‰hPβˆ—β‰hπ’Ÿβˆ—h^{\!*}\doteq h^{\!*}_{P}\doteq h^{\!*}_{\cal D}, for respective data sizes nP,nπ’Ÿn_{P},n_{\cal D}, the best possible excess risk β„°π’Ÿ\mathcal{E}_{\cal D} is of order (nP1/ρ+nπ’Ÿ)βˆ’1/(2βˆ’Ξ²)(n_{P}^{1/\rho}+n_{{\cal D}})^{-1/(2-\beta)} which is tight for any values of ρ\rho and Ξ²\beta (a Bernstein class parameter on PP and π’Ÿ{\cal D}). In other words, ρ\rho captures an effective data size nP1/ρn_{P}^{1/\rho} contributed by the source to the target: this decreases as Οβ†’βˆž\rho\to\infty, delineating a continuum from easy to hard transfer. Interestingly, ρ<1\rho<1 reveals the fact that source data could be more useful than target data, for instance if the classification problem is easier under the source (e.g., PP has more mass at the decision boundary).

Altogether, the transfer exponent ρ\rho appears as a more optimistic measure of discrepancy between source and target, as it reveals the possibility of transfer – even at fast rates – in many situations where traditional measures are pessimistically large. To illustrate this, we recall some of the examples from HK (19). In all examples below, we assume for simplicity that Y=hβˆ—β€‹(X)Y=h^{\!*}(X) for some hβˆ—β‰hPβˆ—β‰hπ’Ÿβˆ—h^{\!*}\doteq h^{\!*}_{P}\doteq h^{\!*}_{\cal D}.

Example 1. (Discrepancies dπ’œ,d𝒴d_{\cal A},d_{\cal Y} can be too large) Let β„‹\mathcal{H} consist of one-sided thresholds on the line, and let PX≐𝒰​[0,2]P_{X}\doteq\mathcal{U}[0,2] and QX≐𝒰​[0,1]Q_{X}\doteq\mathcal{U}[0,1].

[Uncaptioned image]

Let hβˆ—h^{\!*} be thresholded at 1/21/2. We then see that for all hΟ„h_{\tau} thresholded at Ο„βˆˆ[0,1]\tau\in[0,1], 2​PX​(hΟ„β‰ hβˆ—)=π’ŸX​(hΟ„β‰ hβˆ—)2P_{X}(h_{\tau}\neq h^{\!*})={\cal D}_{X}(h_{\tau}\neq h^{\!*}), where for Ο„>1\tau>1, PX​(hΟ„β‰ hβˆ—)=12​(Ο„βˆ’1/2)β‰₯12β€‹π’ŸX​(hΟ„β‰ hβˆ—)=14P_{X}(h_{\tau}\neq h^{\!*})=\frac{1}{2}(\tau-1/2)\geq\frac{1}{2}{\cal D}_{X}(h_{\tau}\neq h^{\!*})=\frac{1}{4}. Thus, the transfer exponent ρ=1\rho=1 with Cρ=2C_{\rho}=2, so we have fast transfer at the same rate 1/nP1/n_{P} as if we were sampling from π’Ÿ{\cal D}.

On the other hand, recall that the dπ’œd_{\cal A}-divergence takes the form333Note that these divergences are often defined w.r.t. every pair h,hβ€²βˆˆβ„‹h,h^{\prime}\in\mathcal{H} rather than w.r.t.Β hβˆ—h^{\!*} which makes them smaller. dπ’œβ€‹(P,π’Ÿ)≐suphβˆˆβ„‹|PX​(hβ‰ hβˆ—)βˆ’π’ŸX​(hβ‰ hβˆ—)|d_{\cal A}(P,{\cal D})\doteq\sup_{h\in\mathcal{H}}\left|P_{X}(h\neq h^{\!*})-{\cal D}_{X}(h\neq h^{\!*})\right|, while the 𝒴\cal Y-discrepancy takes the form d𝒴​(P,π’Ÿ)≐suphβˆˆβ„‹|β„°P​(h)βˆ’β„°π’Ÿβ€‹(h)|d_{\cal Y}(P,{\cal D})\doteq\sup_{h\in\mathcal{H}}\left|\mathcal{E}_{P}(h)-\mathcal{E}_{\cal D}(h)\right|. The two coincide when Y=hβˆ—β€‹(X)Y=h^{\!*}(X).

Now, take hΟ„h_{\tau} as the threshold at Ο„=1/2\tau=1/2, and dπ’œ=d𝒴=14d_{\cal A}=d_{\cal Y}=\frac{1}{4} which would wrongly imply that transfer is not feasible at a rate faster than 14\frac{1}{4}; we can in fact make this situation worse, i.e., let dπ’œ=d𝒴→12d_{\cal A}=d_{\cal Y}\to\frac{1}{2} by letting hβˆ—h^{\!*} correspond to a threshold close to 0. A first issue is that these divergences get large in large disagreement regions; this is somewhat mitigated by localization – i.e., defining these discrepancies w.r.t. hh’s in a vicinity of hβˆ—h^{\!*}, but does not quite resolve the issue, as discussed in earlier work HK (19).

Example 2. (Minimum ρ\rho, and the inherent asymmetry of transfer)

[Uncaptioned image]

Suppose β„‹\mathcal{H} is the class of one-sided thresholds on the line, hβˆ—h^{\!*} is a threshold at 0. The marginal π’ŸX{\cal D}_{X} has uniform density fπ’Ÿf_{\cal D} (on an interval containing 0), while, for some ρβ‰₯1\rho\geq 1, PXP_{X} has density fP​(Ο„)βˆΟ„Οβˆ’1f_{P}(\tau)\propto\tau^{\rho-1} on Ο„>0\tau>0 (and uniform on the rest of the support of π’Ÿ{\cal D}, not shown). Consider any hΟ„h_{\tau} at threshold Ο„>0\tau>0, we have PX​(hΟ„β‰ hβˆ—)=∫0Ο„fPβˆΟ„ΟP_{X}(h_{\tau}\neq h^{\!*})=\int_{0}^{\tau}f_{P}\propto\tau^{\rho}, while π’ŸX​(hΟ„β‰ hβˆ—)βˆΟ„{\cal D}_{X}(h_{\tau}\neq h^{\!*})\propto\tau. Notice that for any fixed Ο΅>0\epsilon>0, limΟ„>0,Ο„β†’0π’ŸX​(hΟ„β‰ hβˆ—)Οβˆ’Ο΅PX​(hΟ„β‰ hβˆ—)=limΟ„>0,Ο„β†’0Cβ€‹Ο„Οβˆ’Ο΅Ο„Ο=∞.\lim\limits_{\tau>0,\,\tau\to 0}\frac{{\cal D}_{X}(h_{\tau}\neq h^{\!*})^{\rho-\epsilon}}{P_{X}(h_{\tau}\neq h^{\!*})}=\lim\limits_{\tau>0,\,\tau\to 0}C\frac{\tau^{\rho-\epsilon}}{\tau^{\rho}}=\infty.

We therefore see that ρ\rho is the smallest possible transfer-exponent. Interestingly, now consider transferring instead from π’Ÿ{\cal D} to PP: we would have ρ​(π’Ÿβ†’P)=1≀ρ≐ρ​(Pβ†’π’Ÿ)\rho({\cal D}\to P)=1\leq\rho\doteq\rho(P\to{\cal D}); in other words, there are natural situations where it is easier to transfer from π’Ÿ{\cal D} to PP than from PP to π’Ÿ{\cal D}, as in the case here where PP gives relatively little mass to the decision boundary. This is not captured by symmetric notions of distance, e.g., metrics or semi-metrics such as dπ’œd_{\cal A}, d𝒴d_{\cal Y}, MMD, TV, or Wasserstein.

Finally note that the above examples can be extended to more general hypothesis classes β„‹\mathcal{H} as the examples merely play on how fast fPf_{P} decreases w.r.t. fπ’Ÿf_{\cal D} in regions of space.

3.2 Multisource Class

We are now ready to formalize the main class of distributional settings considered in this work.

Definition 5 (Multisource class).

We consider classes β„³=ℳ​(Cρ,{ρt}t∈[N],{nt}t∈[N+1],CΞ²,Ξ²)\mathcal{M}=\mathcal{M}\left(C_{\rho},\left\{{\rho_{t}}\right\}_{t\in[N]},\left\{{n}_{t}\right\}_{t\in[N+1]},C_{\beta},\beta\right) of product distributions of the form Ξ =∏t∈[N]PtntΓ—π’Ÿnπ’Ÿ,ntβ‰₯1,nπ’Ÿβ‰nN+1β‰₯0,\Pi=\prod_{t\in[N]}P_{t}^{{n}_{t}}\times{\cal D}^{n_{{\cal D}}},\ n_{t}\geq 1,\ n_{\cal D}\doteq n_{N+1}\geq 0, satisfying:

  1. (A1).

    There exists hβˆ—βˆˆβ„‹,βˆ€t∈[N+1],hβˆ—βˆˆargminhβˆˆβ„‹β„°Pt​(h)h^{\!*}\in\mathcal{H},\quad\forall t\in[N+1],\,h^{\!*}\in\operatorname*{argmin}_{h\in\mathcal{H}}\mathcal{E}_{P_{t}}(h),

  2. (A2).

    Sources PtP_{t}’s have transfer exponent (Cρ,ρt)\left(C_{\rho},\rho_{t}\right) w.r.t.Β the target π’Ÿ{\cal D},

  3. (A3).

    All sources PtP_{t} and target π’Ÿ{\cal D} admit a Bernstein class condition with parameter (CΞ²,Ξ²)(C_{\beta},\beta).

For notational simplicity, we will often let PN+1β‰π’ŸP_{N+1}\doteq{\cal D}. Also, although it is not a parameter of the class, we will also refer to ρN+1=1\rho_{N+1}=1, as π’Ÿ{\cal D} always has transfer exponent (Cρ,1)\left(C_{\rho},1\right) w.r.t.Β itself.

Remark 1 (hβˆ—h^{\!*} is almost unique).

Note that, for Ξ²>0\beta>0, the Bernstein class condition implies that hβˆ—h^{\!*} above satisfies Pt​(hβˆ—β‰ htβˆ—)=0P_{t}(h^{\!*}\neq h^{\!*}_{t})=0 for any other htβˆ—βˆˆargminhβˆˆβ„‹β„°Pt​(h)h^{\!*}_{t}\in\operatorname*{argmin}_{h\in\mathcal{H}}\mathcal{E}_{P_{t}}(h). Furthermore for any t∈[N]t\in[N] such that ρt<∞\rho_{t}<\infty, we also have β„°π’Ÿβ€‹(htβˆ—)=0\mathcal{E}_{\cal D}(h^{\!*}_{t})=0, implying π’Ÿβ€‹(hβˆ—β‰ htβˆ—)=0{\cal D}(h^{\!*}\neq h^{\!*}_{t})=0 for Ξ²>0\beta>0.

3.3 Additional Notation

Implicit target π’Ÿ=π’Ÿβ€‹(Ξ ){\cal D}={\cal D}(\Pi). Every time we write Ξ \Pi, we will implicitly mean Ξ =∏t∈[N]PtntΓ—π’ŸnN+1\Pi=\prod_{t\in[N]}P_{t}^{n_{t}}\times{\cal D}^{n_{N+1}}, so that in any context where a Ξ \Pi is introduced, we may write, for instance, β„°π’Ÿβ€‹(h)\mathcal{E}_{{\cal D}}(h), which then refers to the π’Ÿ{\cal D} distribution in Ξ \Pi.

Indices and Order Statistics. For any Z={Zt}t∈[N+1]∼ΠZ=\left\{Z_{t}\right\}_{t\in[N+1]}\sim\Pi, and indices IβŠ‚[N+1]I\subset[N+1], we let ZI≐⋃s∈IZsZ^{I}\doteq\bigcup_{s\in I}Z_{s}.

We will often be interested in order statistics ρ(1)≀ρ(2)​…≀ρ(t)\rho_{(1)}\leq\rho_{(2)}\ldots\leq\rho_{(t)} of ordered ρt,t∈[N+1]\rho_{t},t\in[N+1] values, in which case P(t),Z(t),n(t)P_{(t)},Z_{(t)},n_{(t)} will denote distribution, sample and sample size at index (t){(t)}. We will then let Z(t)≐Z{(1),…,(t)}Z^{(t)}\doteq Z^{\left\{(1),\ldots,(t)\right\}}.

Average transfer exponent. For any t∈[N+1]t\in[N+1], define ρ¯tβ‰βˆ‘s∈[t]Ξ±(s)⋅ρ(s)\bar{\rho}_{t}\doteq\sum_{s\in[t]}\alpha_{(s)}\cdot\rho_{(s)}, where Ξ±(s)≐n(s)βˆ‘r∈[t]n(r)\alpha_{(s)}\doteq\frac{n_{(s)}}{\sum_{r\in[t]}n_{(r)}}.

Aggregate ERM. For any ZI,IβŠ‚[N+1]Z^{I},I\subset[N+1], we let h^ZI≐argminhβˆˆβ„‹R^ZI​(h)\hat{h}_{Z^{I}}\doteq\operatorname*{argmin}_{h\in\mathcal{H}}\hat{R}_{Z^{I}}(h), and correspondingly we also define h^Z(t)\hat{h}_{Z^{(t)}}, t∈[N+1]t\in[N+1] as the ERM over Z(t)Z^{(t)}. When t=N+1t=N+1 we simply write h^Z\hat{h}_{Z} for h^Z(N+1)\hat{h}_{Z^{(N+1)}}.

Min and Max. We often use the short notations a∧b≐min⁑{a,b}a\land b\doteq\min\left\{a,b\right\}, a∨b≐max⁑{a,b}a\lor b\doteq\max\left\{a,b\right\}.

Positive Logarithm. For any xβ‰₯0x\geq 0, define log⁑(x)≐max⁑{ln⁑(x),1}\log(x)\doteq\max\{\ln(x),1\}.

𝟏/𝟎\boldsymbol{1/0} Convention. We adopt the convention that 1/0=∞1/0=\infty.

Asymptotic Order. We often write a≲ba\lesssim b or a≍ba\asymp b in the statements of key results, to indicate inequality, respectively, equality, up to constants and logarithmic factors. The precise constants and logarithmic factors are always presented in supporting results.

4 Results Overview

We start by investigating what the best possible transfer rates are for multisource classes β„³\mathcal{M}, and then investigate the extent to which these rates are attainable by adaptive procedures, i.e., procedures with little access to class information such as transfer exponents ρt\rho_{t} from sources to target.

From this point on, we let β„³=ℳ​(Cρ,{ρt}t∈[N],{nt}t∈[N+1],CΞ²,Ξ²)\mathcal{M}=\mathcal{M}\left(C_{\rho},\left\{{\rho_{t}}\right\}_{t\in[N]},\left\{{n}_{t}\right\}_{t\in[N+1]},C_{\beta},\beta\right) denote any multisource class, with any admissible value of relevant parameters, unless these parameters are specifically constrained a result’s statement.

4.1 Minimax Rates

Theorem 1 (Minimax Rates).

Let β„³\mathcal{M} denote any multisource class where ρtβ‰₯1,βˆ€t∈[N]\rho_{t}\geq 1,\forall t\in[N]. Let h^\hat{h} denote any multisource learner with knowledge of β„³\mathcal{M}. We have:

infh^supΞ βˆˆβ„³π”ΌΞ [β„°π’Ÿ(h^)]≍mint∈[N+1](βˆ‘s=1tn(s))βˆ’1/(2βˆ’Ξ²)​ρ¯t.\displaystyle\inf\limits_{\hat{h}}\sup\limits_{\Pi\in\mathcal{M}}\mathbb{E}_{\Pi}\left[\mathcal{E}_{{\cal D}}\!\left(\hat{h}\right)\right]\asymp\min_{t\in[N+1]}\left(\sum_{s=1}^{t}n_{(s)}\right)^{-{1}/(2-\beta){\bar{\rho}_{t}}}.
Remark 2.

We remark that the proof of the above result (see Theorems 6 and 7 of Sections 5 and 6 for matching lower and upper bounds) imply that, in fact, we could replace ρ¯t\bar{\rho}_{t} with simply ρ(t)\rho_{(t)} and the result would still be true. In other words, although intuitively ρ¯t\bar{\rho}_{t} might be much smaller than ρ(t)\rho_{(t)} for any fixed tt, the minimum values over t∈[N+1]t\in[N+1] can only differ up to logarithmic terms.

We also note that the constraint that ρtβ‰₯1\rho_{t}\geq 1 is only needed for the lower bound (TheoremΒ 6), whereas all of our upper bounds (TheoremsΒ 7, 3, and 4) hold for any values ρt>0\rho_{t}>0. Moreover, there exist classes β„‹\mathcal{H} where the lower bound also holds for all ρt>0\rho_{t}>0, so that the form of the bound is generally not improvable. The case ρt∈(0,1)\rho_{t}\in(0,1) represents a kind of super transfer, where the source samples are actually more informative than target samples.

It follows from Theorem 1 that, despite there being 2N+12^{N+1} possible ways of aggregating datasets (or more if we consider general weightings of datasets), it is sufficient to search over N+1N+1 possible aggregations – defined by the ranking ρ(1)≀ρ(2)​…≀ρ(N+1)\rho_{(1)}\leq\rho_{(2)}\ldots\leq\rho_{(N+1)} – to nearly achieve the minimax rate.

The lower-bound (Theorem 6) relies on constructing a subset (of β„³\mathcal{M}) of product distributions Ξ h,hβˆˆβ„‹\Pi_{h},h\in\mathcal{H}, which are mutually close in KL-divergence, but far under the pseudo-metric π’ŸX​(hβ‰ hβ€²){\cal D}_{X}(h\neq h^{\prime}). For illustration, considering the case Ξ²=1\beta=1, for any h,hβ€²h,h^{\prime} that are sufficiently far under π’ŸX{\cal D}_{X} so that β„°π’Ÿβ€‹(hβ€²;h)≳ϡ\mathcal{E}_{{\cal D}}(h^{\prime};h)\gtrsim\epsilon, the lower-bound construction is such that

KL​(Ξ hβˆ₯Ξ hβ€²)β‰²βˆ‘t=1N+1nt​ϡρt≲1,\displaystyle{\rm KL}(\Pi_{h}\|\Pi_{h^{\prime}})\lesssim\sum_{t=1}^{N+1}n_{t}\epsilon^{\rho_{t}}\lesssim 1, (3)

ensuring that Ξ h,Ξ hβ€²\Pi_{h},\Pi_{h^{\prime}} are hard to distinguish from finite sample. Thus, the largest Ο΅\epsilon satisfying the second inequality above, say Ο΅N+1\epsilon_{N+1} is a minimax lower-bound. On the other-hand, the upper-bound (Theorem 7) relies on a uniform Bernstein’s inequality that holds for non-identically distributed r.v.s (Lemma 1); in particular, by accounting for variance in the risk, such Bernstein-type inequality allows us to extend (to the multisource setting) usual fixed-point arguments that capture the effect of the noise parameter Ξ²\beta. Now, again for illustration, let Ξ²=1\beta=1, and consider the ERM h^≐h^Z\hat{h}\doteq\hat{h}_{Z} combining all N+1N+1 datasets. Let β„°Ξ±β‰βˆ‘t=1N+1Ξ±t​ℰt​(h^)\mathcal{E}_{\alpha}\doteq\sum_{t=1}^{N+1}\alpha_{t}\mathcal{E}_{t}(\hat{h}), Ξ±t≐nt/(βˆ‘sns)\alpha_{t}\doteq n_{t}/\left(\sum_{s}n_{s}\right), then the concentration arguments described above ensure that ℰα​(h^)≲1/(βˆ‘tnt)\mathcal{E}_{\alpha}(\hat{h})\lesssim 1/\left(\sum_{t}n_{t}\right). Now notice that, by definition of ρt\rho_{t}, ℰα​(h^)β‰³βˆ‘tΞ±tβ€‹β„°π’ŸΟt​(h^)\mathcal{E}_{\alpha}(\hat{h})\gtrsim\sum_{t}\alpha_{t}\mathcal{E}_{{\cal D}}^{\rho_{t}}(\hat{h}), in other words, β„°π’ŸΟt​(h^)\mathcal{E}_{{\cal D}}^{\rho_{t}}(\hat{h}) satisfies the second inequality in (3), and must therefore be at most of order Ο΅N+1\epsilon_{N+1}. This establishes the tightness of Ο΅N+1\epsilon_{N+1} as a minimax rate, all that is left being to elucidate its exact form in terms of sample sizes. Similar, but somewhat more involved arguments apply for general Ξ²\beta, though in that case we find that pooling all of the data does not suffice to achieve the minimax rate.

Notice that the rates of Theorem 1 immediately imply minimax rates for multitask under the assumption (A1) of sharing a same hβˆ—h^{\!*} (with appropriate ρt\rho_{t}’s w.r.t.Β any target π’Ÿβ‰Ps,s∈[N+1]{\cal D}\doteq P_{s},s\in[N+1]). It is then natural to ask whether the minimax rate for various targets PsP_{s} might be achieved by the same algorithm, i.e., the same aggregation of tasks, in light of a common approach in the literature (and practice) of optimizing for a single classifier that does well simultaneously on all tasks. We show that even when all hβˆ—h^{\!*}’s are the same, the optimal aggregation might differ across targets PsP_{s}, simply due to the inherent assymmetry of transfer. We have the following theorem, proved in Appendix C.

Theorem 2 (Target affects aggregation).

Set N=1N=1. There exists P,π’ŸP,{\cal D} satisfying a Bernstein class condition with parameters (CΞ²,Ξ²)(C_{\beta},\beta) for some 0≀β<10\leq\beta<1, and sharing the same hβˆ—=hPβˆ—=hπ’Ÿβˆ—h^{\!*}=h^{\!*}_{P}=h^{\!*}_{\cal D} such that the following holds. Consider a multisample Z={ZP,Zπ’Ÿ}Z=\left\{Z_{P},Z_{\cal D}\right\} consisting of independent datasets ZP∼PnP,Zπ’ŸβˆΌπ’Ÿnπ’ŸZ_{P}\sim P^{n_{P}},Z_{\cal D}\sim{\cal D}^{n_{\cal D}}.

Let h^Z,h^ZP,h^Zπ’Ÿ\hat{h}_{Z},\hat{h}_{Z_{P}},\hat{h}_{Z_{\cal D}} denote the ERMs over ZZ, ZPZ_{P}, Zπ’ŸZ_{\cal D} respectively. Suppose 1≀nπ’Ÿ2≀18​nP(2βˆ’2​β)/(2βˆ’Ξ²).1\leq n_{\cal D}^{2}\leq\frac{1}{8}n_{P}^{(2-2\beta)/(2-\beta)}. Then

𝔼​[β„°π’Ÿβ€‹(h^ZP)]βˆ§π”Όβ€‹[β„°π’Ÿβ€‹(h^Z)]\displaystyle\mathbb{E}\left[\mathcal{E}_{{\cal D}}(\hat{h}_{Z_{P}})\right]\land\mathbb{E}\left[\mathcal{E}_{{\cal D}}(\hat{h}_{Z})\right] β‰₯14,while ​𝔼​[β„°π’Ÿβ€‹(h^Zπ’Ÿ)]≲nπ’Ÿβˆ’1/(2βˆ’Ξ²);\displaystyle\geq\frac{1}{4},\quad\text{while }\mathbb{E}\left[\mathcal{E}_{{\cal D}}(\hat{h}_{Z_{\cal D}})\right]\lesssim n_{\cal D}^{-1/(2-\beta)};
however, ​𝔼​[β„°P​(h^ZP)]βˆ¨π”Όβ€‹[β„°P​(h^Z)]\displaystyle\text{ however, }\,\mathbb{E}\left[\mathcal{E}_{P}(\hat{h}_{Z_{P}})\right]\lor\mathbb{E}\left[\mathcal{E}_{P}(\hat{h}_{Z})\right] ≲nPβˆ’1/(2βˆ’Ξ²).\displaystyle\lesssim n_{P}^{-1/(2-\beta)}.
Remark 3 (Suboptimality of pooling).

A common practice is to pool all datasets together and return an ERM as in h^Z\hat{h}_{Z}. We see from the above result that this might be optimal for some targets while suboptimal for other targets. However, pooling is near optimal (simultaneously for all targets PsP_{s}) whenever Ξ²=1\beta=1, as discussed in Section 4.2 below.

4.2 Some Regimes of (Semi) Adaptivity

It is natural to ask whether the above minimax rates for β„³\mathcal{M} are attainable by adaptive procedures, i.e., a reasonable procedure with no access to prior information on (the parameters) of β„³\mathcal{M}, but only access to a multisample Z∼ΠZ\sim\Pi for some unknown Ξ βˆˆβ„³\Pi\in\mathcal{M}. As we will see in Section 4.3, this is not possible in general, i.e., outside of the regimes considered here. Our work however leaves open the existence of more refined regimes of adaptivity.

βˆ™\bullet Low Noise Ξ²=1\beta=1. To start, when the Bernstein class parameter Ξ²=1\beta=1 (which would often be a priori unknown to the learner), pooling of all datasets is near minimax optimal as stated in the next result. This corresponds to low noise situations, e.g., so-called Massart’s noise (where ℙ​(Yβ‰ hβˆ—β€‹(X)|X)≀(1/2)βˆ’Ο„\mathbb{P}(Y\neq h^{\!*}(X)|X)\leq(1/2)-\tau, for some Ο„>0\tau>0), including the realizable case (where Y=hβˆ—β€‹(X)Y=h^{\!*}(X) deterministically). Note that ρt\rho_{t}’s are nonetheless nontrivial (see examples of Section 3.1), however the distributions PtP_{t}’s are then sufficiently related that their datasets are mutually valuable.

Theorem 3 (Pooling under low noise).

Suppose Ξ²=1\beta=1. Consider any Ξ βˆˆβ„³\Pi\in\mathcal{M} and let h^Z\hat{h}_{Z} denote the ERM over Z∼ΠZ\sim\Pi. Let δ∈(0,1)\delta\in(0,1). There exists a universal constant c>0c>0, such that, with probability at least 1βˆ’Ξ΄1-\delta,

β„°π’Ÿβ€‹(h^Z)≀mint∈[N+1]⁑Cρ​(c​Cβ​dℋ​log⁑(1dβ„‹β€‹βˆ‘s=1N+1ns)+log⁑(1/Ξ΄)βˆ‘s=1tn(s))1/ρ¯t.\mathcal{E}_{{\cal D}}(\hat{h}_{Z})\leq\min_{t\in[N+1]}C_{\rho}\left(c\ C_{\beta}\frac{d_{\mathcal{H}}\log\!\left(\frac{1}{d_{\mathcal{H}}}\sum_{s=1}^{N+1}n_{s}\right)+\log(1/\delta)}{\sum_{s=1}^{t}n_{(s)}}\right)^{1/\bar{\rho}_{t}}.

The theorem is proven in Section 7.1. We also state a general bound on β„°π’Ÿβ€‹(h^Z)\mathcal{E}_{{\cal D}}(\hat{h}_{Z}), holding for any Ξ²\beta, in CorollaryΒ 2 of AppendixΒ B; the implied rates are not always optimal, though interestingly they are near-optimal in the case that βˆ‘t=1tβˆ—n(t)βˆβˆ‘t=1Nnt\sum_{t=1}^{t^{*}}n_{(t)}\propto\sum_{t=1}^{N}n_{t}, where tβˆ—t^{*} is the minimizer of the r.h.s.Β in TheoremΒ 1. We note that, unlike in the oracle upper-bounds of Theorem 7, the logarithmic term in the above result is in terms of the entire sample size, rather than the sample size at which the minimizer in t∈[N+1]t\in[N+1] is attained.

βˆ™\bullet Available ranking information. Now, assume that on top of ZβˆΌΞ βˆˆβ„³Z\sim\Pi\in\mathcal{M}, we have access to ranking information ρ(1)≀ρ(2)≀…≀ρ(N+1)\rho_{(1)}\leq\rho_{(2)}\leq\ldots\leq\rho_{(N+1)}, but no additional information on β„³\mathcal{M}. Namely, CΞ²,Ξ²C_{\beta},\beta, and the actual values of ρt,t∈[N]\rho_{t},t\in[N] are unknown to the learner. We show that, in this case, a simple rank-based procedure achieves the minimax rates of Theorem 1, without knowledge of the additional distributional parameters.

Define Ρ​(m,Ξ΄)≐dβ„‹m​log⁑(mdβ„‹)+1m​log⁑(1Ξ΄)\varepsilon(m,\delta)\doteq\frac{d_{\mathcal{H}}}{m}\log\!\left(\frac{m}{d_{\mathcal{H}}}\right)+\frac{1}{m}\log\!\left(\frac{1}{\delta}\right), for δ∈(0,1)\delta\in(0,1) and mβˆˆβ„•m\in\mathbb{N}. Let 𝐧tβ‰βˆ‘s=1tn(s)\mathbf{n}_{t}\doteq\sum\limits_{s=1}^{t}n_{(s)} for each t∈[N+1]t\in[N+1], ansd recall that h^Z(t)\hat{h}_{Z^{(t)}} denotes the ERM over the aggregate sample Z(t)Z^{(t)}.

Rank-based Procedure h^\hat{h}: let Ξ΄t≐δ/(6​t2)\delta_{t}\doteq\delta/(6t^{2}), and C0C_{0} as in LemmaΒ 1. For any t∈[N+1]t\in[N+1], define:

β„‹(t)≐{hβˆˆβ„‹:β„°^Z(t)​(h;h^Z(t))≀C0​ℙ^Z(t)​(hβ‰ h^Z(t))​Ρ​(𝐧t,Ξ΄t)+C0​Ρ​(𝐧t,Ξ΄t)},\mathcal{H}_{(t)}\doteq\left\{h\in\mathcal{H}:\hat{\mathcal{E}}_{Z^{(t)}}(h;\hat{h}_{Z^{(t)}})\leq C_{0}\sqrt{\hat{\mathbb{P}}_{Z^{(t)}}(h\neq\hat{h}_{Z^{(t)}})\varepsilon(\mathbf{n}_{t},\delta_{t})}+C_{0}\varepsilon(\mathbf{n}_{t},\delta_{t})\right\}, (4)

Return any hh in β‹‚s=1N+1β„‹(s)\bigcap_{s=1}^{N+1}\mathcal{H}_{(s)}, if not empty, otherwise return any hh.

We have the following result for this learning algorithm.

Theorem 4 (Semi-adaptive method under ranking information).

Let δ∈(0,1)\delta\in(0,1). Let h^\hat{h} denote the above procedure, trained on Z∼ΠZ\sim\Pi, for any Ξ βˆˆβ„³\Pi\in\mathcal{M}. There exists a universal constant c>0c>0, such that, with probability at least 1βˆ’Ξ΄1-\delta,

β„°π’Ÿβ€‹(h^)≀mint∈[N+1]⁑Cρ​(c​Cβ​dℋ​log⁑(1dβ„‹β€‹βˆ‘s=1tn(s))+log⁑(1/Ξ΄)βˆ‘s=1tn(s))1/(2βˆ’Ξ²)​ρ¯t.\mathcal{E}_{{\cal D}}(\hat{h})\leq\min_{t\in[N+1]}C_{\rho}\left(c\ C_{\beta}\frac{d_{\mathcal{H}}\log\!\left(\frac{1}{d_{\mathcal{H}}}\sum_{s=1}^{t}n_{(s)}\right)+\log(1/\delta)}{\sum_{s=1}^{t}n_{(s)}}\right)^{1/(2-\beta)\bar{\rho}_{t}}.

The result is shown in Section 7.2 by arguing that each hh in β„‹(t)\mathcal{H}_{(t)} has β„°π’Ÿβ€‹(h){\mathcal{E}_{{\cal D}}(h)} of order (βˆ‘s∈[t]n(s))βˆ’1/(2βˆ’Ξ²)​ρ¯t\left(\sum_{s\in[t]}n_{(s)}\right)^{-1/(2-\beta)\bar{\rho}_{t}}, so the minimizer is attained whenever t^=N+1\hat{t}=N+1 (which is probable, as hβˆ—h^{\!*} is likely to be in each β„‹(t)\mathcal{H}_{(t)}).

Recalling our earlier discussion of Section 1, the above result applies to the case of drifting distributions by letting nt=1n_{t}=1, and (t)=N+2βˆ’t(t)=N+2-t, i.e., the ttht^{{\rm th}} previous example is considered the ttht^{{\rm th}} most-relevant to the target π’Ÿ{\cal D}. In this case, in contrast to the lower bounds proven by Bar (92), the above result of TheoremΒ 4 reveals situations where the risk at time NN approaches 0 as Nβ†’βˆžN\to\infty, even though the total-variation distance between adjacent distributions may be bounded away from zero (see examples of Section 3.1). In other words, by constraining the sequence of PtP_{t} distributions by the sequence of ρt\rho_{t} values, we can describe scenarios where the traditional upper bound analysis of drifting distributions from Bar (92); BDBM (89); BL (97) can be improved.

Remark 4 (approximate ranking).

TheoremΒ 4, and its proof, also have implications for scenarios where we may only have access to an approximate ranking: that is, where the ranking indices (t)(t) don’t strictly order the tasks by their respective minimum valid ρt\rho_{t} values. For instance, one immediate observation is that, since DefinitionΒ 4 does not require ρt\rho_{t} to be minimal, any larger value of ρt\rho_{t} would also be valid; therefore, for any permutation Οƒ:[N+1]β†’[N+1]\sigma:[N+1]\to[N+1], there always exists some valid choices of ρt\rho_{t} values so that the sequence ρσ​(t)\rho_{\sigma(t)} is non-decreasing, and hence we can define (t)=σ​(t)(t)=\sigma(t), so that TheoremΒ 4 holds (with these ρt\rho_{t} values) for the rank-based procedure applied with this σ​(t)\sigma(t) ordering of the tasks. For instance, this means that if we use an ordering σ​(t)\sigma(t) that only swaps the order of some ρ(t)\rho_{(t)} values that are within some Ο΅\epsilon of each other, then the result in the theorem remains valid aside from replacing ρ¯t\bar{\rho}_{t} with ρ¯t+Ο΅\bar{\rho}_{t}+\epsilon. A second observation is that, upon inspecting the proof, it is clear that the result is only truly sensitive to the ranking relative to the index tβˆ—t^{*} achieving the minimum in the bound: that is, if some indices t,tβ€²t,t^{\prime} with (t)<(tβˆ—)(t)<(t^{*}) and (tβ€²)<(tβˆ—)(t^{\prime})<(t^{*}) are incorrectly ordered, while still both ranked before tβˆ—t^{*} (or likewise for indices with (t)>(tβˆ—)(t)>(t^{*}) and (tβ€²)>(tβˆ—)(t^{\prime})>(t^{*})), the result remains valid as stated nonetheless.

4.3 Impossibility of Adaptivity in General Regimes

Adaptivity in general is not possible even though the rates of Theorem 1 appear to reduce the exponential search space over aggregations to a more manageable one based on rankings of data. As seen in the previous section, easier settings such as ones with ranking information, or low noise, allow adaptivity to the remaining unknown distributional parameters. The result below states that, outside of these regimes, no algorithm can guarantee a rate better than a dependence on the number of samples nπ’Ÿn_{\cal D} from the target task, even when a semi-adaptive procedure using ranking information can achieve any desired rate Ο΅\epsilon.

Theorem 5 (Impossibility of adaptivity).

Pick any 0≀β<10\leq\beta<1, CΞ²β‰₯2C_{\beta}\geq 2, and let 1≀n<2/Ξ²βˆ’1{1\leq n<2/\beta-1}, nπ’Ÿβ‰₯0n_{\cal D}\geq 0, and Cρ=3C_{\rho}=3. Pick any Ο΅>0\epsilon>0. The following holds for NN sufficiently large.

βˆ™\bullet Let h^\hat{h} denote any multisource learner with no knowledge of β„³\mathcal{M}. There exists a multisource class β„³\mathcal{M} with parameters nN+1=nπ’Ÿn_{N+1}=n_{\cal D}, nt=n,βˆ€t∈[N]{n}_{t}=n,\forall t\in[N], and all other parameters satisfying the above, such that

supΞ βˆˆβ„³π”ΌΞ β€‹[β„°π’Ÿβ€‹(h^)]β‰₯cβ‹…(1∧nπ’Ÿβˆ’1/(2βˆ’Ξ²))​ for a universal constant ​c>0.\displaystyle\sup_{\Pi\in\mathcal{M}}\mathbb{E}_{\Pi}\left[\mathcal{E}_{{\cal D}}(\hat{h})\right]\geq c\cdot\left(1\land n_{\cal D}^{-1/(2-\beta)}\right)\text{ for a universal constant }c>0.

βˆ™\bullet On the other hand, there exists a semi-adaptive classifier h~\tilde{h}, which, given data Z∼ΠZ\sim\Pi, along with a ranking ρ(1)≀ρ(2)​…≀ρ(N+1)\rho_{(1)}\leq\rho_{(2)}\ldots\leq\rho_{(N+1)}, but no other information on the above β„³\mathcal{M}, achieves the rate

supΞ βˆˆβ„³π”ΌΞ β€‹[β„°π’Ÿβ€‹(h~)]≀ϡ.\sup_{\Pi\in\mathcal{M}}\mathbb{E}_{\Pi}\left[\mathcal{E}_{\cal D}(\tilde{h})\right]\leq\epsilon.

The first part of the result follows from Theorem 8 of Section 8, while the second part follows from both Theorem 8 and Theorem 4 of Section 8. The main idea of the proof is to inject enough randomness into the choice of ranking, while at the same time allowing bad datasets from distributions with large ρt\rho_{t} – which would force a wrong choice of hβˆ—h^{\!*} – to appear as benign as good samples from distributions with small ρt=1\rho_{t}=1. Hence, we let NN large enough in our constructions so that the bulk of bad datasets would significantly overwhelm the information from good datasets.

As a technical point, no knowledge of β„³\mathcal{M} simply means that a minimax analysis is performed over a larger family containing such β„³\mathcal{M}’s, indexed over choices of ranking each corresponding to a fixed β„³\mathcal{M}.

Finally, we note that the result leaves open the possibility of adaptivity under further distributional restrictions on β„³\mathcal{M}, for instance requiring that the number of samples nn per source task be large w.r.t. other parameters such as Ξ²\beta and NN; although this remains unclear, large values of nn could perhaps yield sufficient information to compare and rank tasks.

Another possible restriction towards adaptivity is to require that a large proportion of the samples are from good datasets w.r.t. N+1N+1. In particular, we show in TheoremΒ 9 of AppendixΒ B that the ERM h^Z\hat{h}_{Z} which pools all datasets achieves a target excess risk β„°π’Ÿβ€‹(h^Z)\mathcal{E}_{{\cal D}}(\hat{h}_{Z}) depending on the (weighted) median of the ρt\rho_{t} values (or more generally, any quantile); in other words as long as a constant fraction of all datapoints (pooled from all datasets) are from tasks with relatively small ρt\rho_{t}, the bound will be small. However, this is not a safe algorithm in general, as per TheoremΒ 2.

5 Lower Bound Analysis

Theorem 6.

Suppose |β„‹|β‰₯3|\mathcal{H}|\geq 3. If every ρtβ‰₯1\rho_{t}\geq 1, then for any learning rule h^\hat{h}, there exists Ξ βˆˆβ„³\Pi\in\mathcal{M} such that, with probability at least 1/501/50,

β„°π’Ÿ(h^)>c1mint∈[N+1](c2​dβ„‹(βˆ‘s=1tn(s))​log2⁑(βˆ‘s=1tn(s)))1/(2βˆ’Ξ²)​ρ¯t\mathcal{E}_{{\cal D}}(\hat{h})>c_{1}\min\limits_{t\in[N+1]}\left(\frac{c_{2}d_{\mathcal{H}}}{\left(\sum_{s=1}^{t}n_{(s)}\right)\log^{2}\!\left(\sum_{s=1}^{t}n_{(s)}\right)}\right)^{1/(2-\beta)\bar{\rho}_{t}}

for numerical constants c1,c2>0c_{1},c_{2}>0.

Proof.

We will prove this result with CΞ²=Cρ=2C_{\beta}=C_{\rho}=2; the general cases CΞ²β‰₯2C_{\beta}\geq 2 and Cρβ‰₯2C_{\rho}\geq 2 follow immediately (since distributions satisfying CΞ²=Cρ=2C_{\beta}=C_{\rho}=2 also satisfy the respective conditions for any CΞ²β‰₯2C_{\beta}\geq 2 and Cρβ‰₯2C_{\rho}\geq 2). As in the proof of TheoremΒ 7, for each t∈[N+1]t\in[N+1], define 𝐧tβ‰βˆ‘s=1tn(s)\mathbf{n}_{t}\doteq\sum\limits_{s=1}^{t}n_{(s)}. We prove the theorem using a standard approach to lower bounds by the probabilistic method. Let x0,x1,…,xdx_{0},x_{1},\ldots,x_{d} be a sequence in 𝒳\mathcal{X} such that, βˆƒy0βˆˆπ’΄\exists y_{0}\in\mathcal{Y} for which every y1,…,ydβˆˆπ’΄y_{1},\ldots,y_{d}\in\mathcal{Y} can be realized by h​(x1),…,h​(xd)h(x_{1}),\ldots,h(x_{d}) for some hβˆˆβ„‹h\in\mathcal{H} with h​(x0)=y0h(x_{0})=y_{0}. If dβ„‹=1d_{\mathcal{H}}=1, we let d=1d=1, and we can find such a sequence since |β„‹|β‰₯3|\mathcal{H}|\geq 3; if dβ„‹β‰₯2d_{\mathcal{H}}\geq 2, then we let d=dβ„‹βˆ’1d=d_{\mathcal{H}}-1 and any shattered sequence of dβ„‹d_{\mathcal{H}} points will suffice. We now specify a family of 2d2^{d} distributions, as follows.

Fix ϡ∈(0,1)\epsilon\in(0,1). For each Οƒβˆˆ{βˆ’1,1}d\sigma\in\{-1,1\}^{d}, for each t∈[N+1]t\in[N+1], let Ptσ​({x0,y0})=1βˆ’Ο΅Οt​βP_{t}^{\sigma}(\{x_{0},y_{0}\})=1-\epsilon^{\rho_{t}\beta}, Ptσ​({x0,βˆ’y0})=0P_{t}^{\sigma}(\{x_{0},-y_{0}\})=0, and for each i∈[d]i\in[d] and y∈{βˆ’1,1}y\in\{-1,1\} let Ptσ​({xi,y})=1d​ϡρt​β​(12+y​σi2​ϡρt​(1βˆ’Ξ²))P_{t}^{\sigma}(\{x_{i},y\})=\frac{1}{d}\epsilon^{\rho_{t}\beta}\left(\frac{1}{2}+\frac{y\sigma_{i}}{2}\epsilon^{\rho_{t}(1-\beta)}\right). In other words, the marginal probability of each xix_{i} is 1d​ϡρt​β\frac{1}{d}\epsilon^{\rho_{t}\beta} and the conditional probability of label 11 given xix_{i} is 12+Οƒi2​ϡρt​(1βˆ’Ξ²)\frac{1}{2}+\frac{\sigma_{i}}{2}\epsilon^{\rho_{t}(1-\beta)}.

We first verify that Ξ Οƒ=∏t∈[N+1]PtΟƒ\Pi_{\sigma}=\prod\limits_{t\in[N+1]}P_{t}^{\sigma} is in β„³\mathcal{M}. For every Οƒ\sigma and every tt, the Bayes optimal label at x0x_{0} is always y0y_{0}, and the Bayes optimal label at xix_{i} (i∈[d]i\in[d]) is always Οƒi\sigma_{i}; since this is the same for every tt, the classifier hΟƒβˆ—β‰hPN+1Οƒβˆ—h^{\!*}_{\sigma}\doteq h_{P_{N+1}^{\sigma}}^{*} is optimal under every PtΟƒP_{t}^{\sigma}. Next, we check the Bernstein class condition. Define ℓ​(h,Οƒ)≐|{i∈[d]:h​(xi)β‰ Οƒi}|\ell(h,\sigma)\doteq|\{i\in[d]:h(x_{i})\neq\sigma_{i}\}|. For any t∈[N+1]t\in[N+1] and hβˆˆβ„‹h\in\mathcal{H}, we have

Ptσ​(hβ‰ hΟƒβˆ—)=(1βˆ’Ο΅Οt​β)​1​{h​(x0)β‰ y0}+ℓ​(h,Οƒ)d​ϡρt​βP_{t}^{\sigma}(h\neq h^{\!*}_{\sigma})=(1-\epsilon^{\rho_{t}\beta}){\mathbbold 1}\!\left\{h(x_{0})\neq y_{0}\right\}+\frac{\ell(h,\sigma)}{d}\epsilon^{\rho_{t}\beta}

while

β„°Ptσ​(h)=(1βˆ’Ο΅Οt​β)​1​{h​(x0)β‰ y0}+ℓ​(h,Οƒ)d​ϡρt.\mathcal{E}_{P_{t}^{\sigma}}(h)=(1-\epsilon^{\rho_{t}\beta}){\mathbbold 1}\!\left\{h(x_{0})\neq y_{0}\right\}+\frac{\ell(h,\sigma)}{d}\epsilon^{\rho_{t}}.

Thus,

2​ℰPtσβ​(h)β‰₯2​max⁑{(1βˆ’Ο΅Οt​β)β​1​{h​(x0)β‰ y0},ϡρt​β​(ℓ​(h,Οƒ)d)Ξ²}β‰₯Ptσ​(hβ‰ hΟƒβˆ—).2\mathcal{E}_{P_{t}^{\sigma}}^{\beta}(h)\geq 2\max\!\left\{(1-\epsilon^{\rho_{t}\beta})^{\beta}{\mathbbold 1}\!\left\{h(x_{0})\neq y_{0}\right\},\epsilon^{\rho_{t}\beta}\left(\frac{\ell(h,\sigma)}{d}\right)^{\beta}\right\}\geq P_{t}^{\sigma}(h\neq h^{\!*}_{\sigma}).

Finally, we verify the ρt\rho_{t} values are satisfied. First note that any tt with ρt=∞\rho_{t}=\infty trivially satisfies the condition. Denote by π’ŸΟƒ{\cal D}^{\sigma} the distribution PN+1ΟƒP_{N+1}^{\sigma}. Then for each t∈[N]t\in[N] with 1≀ρt<∞1\leq\rho_{t}<\infty, and each hβˆˆβ„‹h\in\mathcal{H},

β„°π’ŸΟƒβ€‹(h)\displaystyle\mathcal{E}_{{\cal D}^{\sigma}}(h) =(((1βˆ’Ο΅Ξ²)​1​{h​(x0)β‰ y0}+ϡ​ℓ​(h,Οƒ)d)ρt)1/ρt\displaystyle=\left(\left((1-\epsilon^{\beta}){\mathbbold 1}\!\left\{h(x_{0})\neq y_{0}\right\}+\epsilon\frac{\ell(h,\sigma)}{d}\right)^{\rho_{t}}\right)^{1/\rho_{t}}
≀2​((1βˆ’Ο΅Ξ²)ρt​1​{h​(x0)β‰ y0}+(ϡ​ℓ​(h,Οƒ)d)ρt)1/ρt≀2​ℰPtΟƒ1/ρt​(h),\displaystyle\leq 2\left((1-\epsilon^{\beta})^{\rho_{t}}{\mathbbold 1}\!\left\{h(x_{0})\neq y_{0}\right\}+\left(\epsilon\frac{\ell(h,\sigma)}{d}\right)^{\rho_{t}}\right)^{1/\rho_{t}}\leq 2\mathcal{E}_{P_{t}^{\sigma}}^{1/\rho_{t}}(h),

where the final inequality follows from ρtβ‰₯1\rho_{t}\geq 1 and a bit of calculus to verify that (1βˆ’x)ρt≀1βˆ’xρt(1-x)^{\rho_{t}}\leq 1-x^{\rho_{t}} for all x∈[0,1]x\in[0,1].

Now the Varshamov-Gilbert bound (see PropositionΒ 10 of AppendixΒ E) implies there exists a subset {Οƒ0,Οƒ1,…,ΟƒM}βŠ‚{βˆ’1,1}d\{\sigma^{0},\sigma^{1},\ldots,\sigma^{M}\}\subset\{-1,1\}^{d} with Mβ‰₯2d/8M\geq 2^{d/8}, Οƒ0=(1,1,…,1)\sigma^{0}=(1,1,\ldots,1), and every distinct i,ji,j have βˆ‘t1​{Οƒtiβ‰ Οƒtj}β‰₯d8\sum_{t}{\mathbbold 1}\!\left\{\sigma^{i}_{t}\neq\sigma^{j}_{t}\right\}\geq\frac{d}{8}. Furthermore, for any iβ‰ 0i\neq 0,

KL​(Ξ Οƒiβˆ₯Ξ Οƒ0)\displaystyle{\rm KL}(\Pi_{\sigma^{i}}\|\Pi_{\sigma^{0}}) =βˆ‘t∈[N+1]nt​KL​(PtΟƒiβˆ₯PtΟƒ0)=βˆ‘t∈[N+1]nt​ϡρt​β​1dβ€‹βˆ‘j∈[d]1​{Οƒjiβ‰ 1}​KL​(12βˆ’12​ϡρt​(1βˆ’Ξ²)βˆ₯12+12​ϡρt​(1βˆ’Ξ²))\displaystyle=\sum_{t\in[N+1]}n_{t}{\rm KL}(P_{t}^{\sigma^{i}}\|P_{t}^{\sigma^{0}})=\sum_{t\in[N+1]}n_{t}\epsilon^{\rho_{t}\beta}\frac{1}{d}\sum_{j\in[d]}{\mathbbold 1}\!\left\{\sigma^{i}_{j}\neq 1\right\}{\rm KL}(\frac{1}{2}-\frac{1}{2}\epsilon^{\rho_{t}(1-\beta)}\|\frac{1}{2}+\frac{1}{2}\epsilon^{\rho_{t}(1-\beta)})
≀c3β€‹βˆ‘t∈[N+1]nt​ϡρt​β​1dβ€‹βˆ‘j∈[d]1​{Οƒjiβ‰ 1}​ϡ2​ρt​(1βˆ’Ξ²)≀c3β€‹βˆ‘t∈[N+1]nt​ϡρt​(2βˆ’Ξ²)\displaystyle\leq c_{3}\sum_{t\in[N+1]}n_{t}\epsilon^{\rho_{t}\beta}\frac{1}{d}\sum_{j\in[d]}{\mathbbold 1}\!\left\{\sigma^{i}_{j}\neq 1\right\}\epsilon^{2\rho_{t}(1-\beta)}\leq c_{3}\sum_{t\in[N+1]}n_{t}\epsilon^{\rho_{t}(2-\beta)}

for a numerical constant c3c_{3}, where we have used a quadratic approximation of KL divergence between Bernoullis (LemmaΒ 8 of AppendixΒ E). Consider any choice of Ο΅>0\epsilon>0 making this last expression less than d64\frac{d}{64}.

Since d64≀(1/8)​log⁑(M)\frac{d}{64}\leq(1/8)\log(M), Theorem 2.5 of (Tsy, 09) (see PropositionΒ 9 of AppendixΒ E) implies that for any (possibly randomized) estimator Οƒ^:(𝒳×𝒴)βˆ‘tntβ†’{βˆ’1,1}d\hat{\sigma}:(\mathcal{X}\times\mathcal{Y})^{\sum_{t}{n}_{t}}\to\{-1,1\}^{d}, there exists a choice of Οƒ\sigma such that a sample ZβˆΌΞ ΟƒZ\sim\Pi_{\sigma} will have |{i:Οƒ^​(Z)iβ‰ Οƒi}|β‰₯d16|\{i:\hat{\sigma}(Z)_{i}\neq\sigma_{i}\}|\geq\frac{d}{16} with probability at least 150\frac{1}{50}.

In particular, for any learning algorithm h^\hat{h}, we can take Οƒ^=(h^​(x1),…,h^​(xd))\hat{\sigma}=(\hat{h}(x_{1}),\ldots,\hat{h}(x_{d})), and this result implies that there is a choice of Οƒ\sigma such that, for h^\hat{h} trained on ZβˆΌΞ ΟƒZ\sim\Pi_{\sigma}, with probability at least 150\frac{1}{50}, there are at least d16\frac{d}{16} points xix_{i} with h^​(xi)β‰ Οƒi=hΟƒβˆ—β€‹(xi)\hat{h}(x_{i})\neq\sigma_{i}=h^{\!*}_{\sigma}(x_{i}), which implies β„°π’ŸΟƒβ€‹(h^)β‰₯Ο΅16\mathcal{E}_{{\cal D}^{\sigma}}(\hat{h})\geq\frac{\epsilon}{16}.

It remains only to identify an explicit value of Ο΅\epsilon for which βˆ‘t∈[N+1]nt​ϡρt​(2βˆ’Ξ²)≀d64\sum_{t\in[N+1]}n_{t}\epsilon^{\rho_{t}(2-\beta)}\leq\frac{d}{64}.

In particular, fix any positive function ϕ​(m)\phi(m) such that βˆ‘m=1βˆžΟ•β€‹(m)βˆ’1≀c​d\sum_{m=1}^{\infty}\phi(m)^{-1}\leq cd for a finite numerical constant cc: for instance, ϕ​(m)=1d​m​log2⁑m\phi(m)=\frac{1}{d}m\log^{2}m will suffice to prove the theorem. Then consider

Ο΅=mint∈[N+1](c2βˆ’1Ο•(𝐧t))βˆ’1/(2βˆ’Ξ²)​ρ(t)\epsilon=\min_{t\in[N+1]}\left(c_{2}^{-1}\phi(\mathbf{n}_{t})\right)^{-1/(2-\beta)\rho_{(t)}}

for a numerical constant c2∈(0,1)c_{2}\in(0,1). Denote by tβˆ—t_{*} the value of tt obtaining the minimum in this expression. Then note that every other tβ‰ tβˆ—t\neq t_{*} has

(c2βˆ’1​ϕ​(𝐧t))βˆ’1/ρ(t)β‰₯(c2βˆ’1​ϕ​(𝐧tβˆ—))βˆ’1/ρ(tβˆ—),\left(c_{2}^{-1}\phi(\mathbf{n}_{t})\right)^{-1/\rho_{(t)}}\geq\left(c_{2}^{-1}\phi(\mathbf{n}_{t_{*}})\right)^{-1/\rho_{(t_{*})}},

which implies

ρ(t)ρ(tβˆ—)β‰₯ln⁑(c2βˆ’1​ϕ​(𝐧t))ln⁑(c2βˆ’1​ϕ​(𝐧tβˆ—)),\frac{\rho_{(t)}}{\rho_{(t_{*})}}\geq\frac{\ln\left(c_{2}^{-1}\phi(\mathbf{n}_{t})\right)}{\ln\left(c_{2}^{-1}\phi(\mathbf{n}_{t_{*}})\right)},

so that

(c2βˆ’1​ϕ​(𝐧tβˆ—))βˆ’Ο(t)/ρ(tβˆ—)≀(c2βˆ’1​ϕ​(𝐧tβˆ—))βˆ’ln⁑(c2βˆ’1​ϕ​(𝐧t))ln⁑(c2βˆ’1​ϕ​(𝐧tβˆ—))=(c2βˆ’1​ϕ​(𝐧t))βˆ’1.\left(c_{2}^{-1}\phi(\mathbf{n}_{t_{*}})\right)^{-\rho_{(t)}/\rho_{(t_{*})}}\leq\left(c_{2}^{-1}\phi(\mathbf{n}_{t_{*}})\right)^{-\frac{\ln\left(c_{2}^{-1}\phi(\mathbf{n}_{t})\right)}{\ln\left(c_{2}^{-1}\phi(\mathbf{n}_{t_{*}})\right)}}=\left(c_{2}^{-1}\phi(\mathbf{n}_{t})\right)^{-1}.

Thus, we have

βˆ‘t∈[N+1]nt​ϡρt​(2βˆ’Ξ²)β‰€βˆ‘t∈[N+1]n(t)​(c2βˆ’1​ϕ​(𝐧t))βˆ’1≀c2β€‹βˆ‘m=1∞1ϕ​(m)≀c2​c​d\sum_{t\in[N+1]}n_{t}\epsilon^{\rho_{t}(2-\beta)}\leq\sum_{t\in[N+1]}n_{(t)}\left(c_{2}^{-1}\phi(\mathbf{n}_{t})\right)^{-1}\leq c_{2}\sum_{m=1}^{\infty}\frac{1}{\phi(m)}\leq c_{2}cd

for a finite numerical constant cc. Thus, choosing any c2<1/(64​c)c_{2}<1/(64c), we have βˆ‘t∈[N+1]nt​ϡρt​(2βˆ’Ξ²)<d64\sum_{t\in[N+1]}n_{t}\epsilon^{\rho_{t}(2-\beta)}<\frac{d}{64}, as desired.

Altogether, we have that for any learning rule h^\hat{h}, there exists Ξ βˆˆβ„³\Pi\in\mathcal{M} such that if h^\hat{h} is trained on Z∼ΠZ\sim\Pi, then with probability at least 1/501/50,

β„°π’Ÿ(h^)β‰₯Ο΅16=116mint∈[N+1](c2ϕ​(𝐧t))1/(2βˆ’Ξ²)​ρ(t).\mathcal{E}_{{\cal D}}(\hat{h})\geq\frac{\epsilon}{16}=\frac{1}{16}\min_{t\in[N+1]}\left(\frac{c_{2}}{\phi(\mathbf{n}_{t})}\right)^{1/(2-\beta)\rho_{(t)}}.

In particular, since we always have ρ¯t≀ρ(t)\bar{\rho}_{t}\leq\rho_{(t)}, the theorem immediately follows. ∎

Remark 5.

We note that it is clear from the proof that the (βˆ‘s=1tn(s))​log2⁑(βˆ‘s=1tn(s))\left(\sum_{s=1}^{t}n_{(s)}\right)\log^{2}\!\left(\sum_{s=1}^{t}n_{(s)}\right) denominator in the lower bound is not strictly optimal. It can be replaced by any function ϕ′​(βˆ‘s=1tn(s))\phi^{\prime}\!\left(\sum_{s=1}^{t}n_{(s)}\right) satisfying βˆ‘m=1βˆžΟ•β€²β€‹(m)βˆ’1<∞\sum_{m=1}^{\infty}\phi^{\prime}(m)^{-1}<\infty: for instance, ϕ′​(m)=m​log⁑(m)​(log⁑log⁑(m))2\phi^{\prime}(m)=m\log(m)(\log\log(m))^{2}.

Remark 6.

We also note that there exist classes β„‹\mathcal{H} for which the lower bound can be extended to the full range of {ρt}\{\rho_{t}\} (i.e., any values ρt>0\rho_{t}>0). Specifically, in this general case, if there exist two points x0,x1βˆˆπ’³x_{0},x_{1}\in\mathcal{X} such that all hβˆˆβ„‹h\in\mathcal{H} agree on h​(x0)h(x_{0}) while βˆƒh,hβ€²βˆˆβ„‹\exists h,h^{\prime}\in\mathcal{H} with h​(x1)β‰ h′​(x1)h(x_{1})\neq h^{\prime}(x_{1}), then by the same construction (with d=1d=1) used in the proof we would have, with probability at least 1/501/50,

β„°π’Ÿ(h^)>c1mint∈[N+1](c2(βˆ‘s=1tn(s))​log2⁑(βˆ‘s=1tn(s)))1/(2βˆ’Ξ²)​ρ¯t.\mathcal{E}_{{\cal D}}(\hat{h})>c_{1}\min\limits_{t\in[N+1]}\left(\frac{c_{2}}{\left(\sum_{s=1}^{t}n_{(s)}\right)\log^{2}\!\left(\sum_{s=1}^{t}n_{(s)}\right)}\right)^{1/(2-\beta)\bar{\rho}_{t}}.

6 Upper Bound Analysis

We will in fact establish the upper bound as a bound holding with high probability 1βˆ’Ξ΄1-\delta, for any δ∈(0,1)\delta\in(0,1). Throughout this subsection, let β„³=ℳ​(Cρ,{ρt}t∈[N],{nt}t∈[N+1],CΞ²,Ξ²)\mathcal{M}=\mathcal{M}\left(C_{\rho},\left\{{\rho_{t}}\right\}_{t\in[N]},\left\{{n}_{t}\right\}_{t\in[N+1]},C_{\beta},\beta\right), for any admissible values of the parameters. Let

tβˆ—β‰argmint∈[N+1]Cρ​(210​C04​Cβ​dℋ​log⁑(1dβ„‹β€‹βˆ‘s=1tn(s))+log⁑(1/Ξ΄)βˆ‘s=1tn(s))1/(2βˆ’Ξ²)​ρ¯t,t^{*}\doteq\operatorname*{argmin}\limits_{t\in[N+1]}C_{\rho}\left(2^{10}C_{0}^{4}C_{\beta}\frac{d_{\mathcal{H}}\log\!\left(\frac{1}{d_{\mathcal{H}}}\sum_{s=1}^{t}n_{(s)}\right)+\log(1/\delta)}{\sum_{s=1}^{t}n_{(s)}}\right)^{1/(2-\beta)\bar{\rho}_{t}},

for C0C_{0} as in LemmaΒ 1 below, and for δ∈(0,1)\delta\in(0,1). The oracle procedure just returns h^Z(tβˆ—)\hat{h}_{Z^{(t^{*})}}, the ERM over Z(tβˆ—)Z^{(t^{*})}.

We have the following theorem.

Theorem 7.

For any Ξ βˆˆβ„³\Pi\in\mathcal{M}, for any δ∈(0,1)\delta\in(0,1), with probability at least 1βˆ’Ξ΄1-\delta, we have

β„°Qt​(h^Z(tβˆ—))≀mint∈[N+1]⁑Cρ​(C​dℋ​log⁑(1dβ„‹β€‹βˆ‘s=1tn(s))+log⁑(1/Ξ΄)βˆ‘s=1tn(s))1/(2βˆ’Ξ²)​ρ¯t\mathcal{E}_{Q_{t}}(\hat{h}_{Z^{(t^{*})}})\leq\min_{t\in[N+1]}C_{\rho}\left(C\frac{d_{\mathcal{H}}\log\left(\frac{1}{d_{\mathcal{H}}}\sum_{s=1}^{t}n_{(s)}\right)+\log(1/\delta)}{\sum_{s=1}^{t}n_{(s)}}\right)^{1/(2-\beta)\bar{\rho}_{t}}

for a constant C=210​C04​CΞ²C=2^{10}C_{0}^{4}C_{\beta}, where C0C_{0} is a numerical constant from LemmaΒ 1 below.

The proof will rely on the following lemma: a uniform Bernstein inequality for independent but non-identically distributed data. Results of this type are well known for i.i.d.Β data (see e.g., Kol (06)). For completeness, we include a proof of the extension to non-identically distributed data in AppendixΒ A. The main technical modification of the proof, compared to the i.i.d.Β case, is employing a generalization of Bousquet’s inequality for non-identical data, due to KR (05).

Lemma 1.

(Uniform Bernstein inequality for non-identical distributions)Β Β  For any mβˆˆβ„•m\in\mathbb{N} and δ∈(0,1)\delta\in(0,1), define Ρ​(m,Ξ΄)≐dβ„‹m​log⁑(mdβ„‹)+1m​log⁑(1Ξ΄)\varepsilon(m,\delta)\doteq\frac{d_{\mathcal{H}}}{m}\log\!\left(\frac{m}{d_{\mathcal{H}}}\right)+\frac{1}{m}\log\!\left(\frac{1}{\delta}\right) and let S={(X1,Y1),…,(Xm,Ym)}S=\{(X_{1},Y_{1}),\ldots,(X_{m},Y_{m})\} be independent samples. With probability at least 1βˆ’Ξ΄1-\delta, βˆ€h,hβ€²βˆˆβ„‹\forall h,h^{\prime}\in\mathcal{H},

𝔼​[β„°^S​(h;hβ€²)]≀ℰ^S​(h;hβ€²)+C0​min⁑{𝔼​[β„™^S​(hβ‰ hβ€²)],β„™^S​(hβ‰ hβ€²)}​Ρ​(m,Ξ΄)+C0​Ρ​(m,Ξ΄),\mathbb{E}\!\left[\hat{\mathcal{E}}_{S}(h;h^{\prime})\right]\leq\hat{\mathcal{E}}_{S}(h;h^{\prime})+C_{0}\sqrt{\min\!\left\{\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right],\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right\}\varepsilon(m,\delta)}+C_{0}\varepsilon(m,\delta), (5)

and

12​𝔼​[β„™^S​(hβ‰ hβ€²)]βˆ’C0​Ρ​(m,Ξ΄)≀ℙ^S​(hβ‰ hβ€²)≀2​𝔼​[β„™^S​(hβ‰ hβ€²)]+C0​Ρ​(m,Ξ΄),\frac{1}{2}\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right]-C_{0}\varepsilon(m,\delta)\leq\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\leq 2\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right]+C_{0}\varepsilon(m,\delta), (6)

for a universal numerical constant C0∈(0,∞)C_{0}\in(0,\infty).

In particular, we will use this lemma via the following implication.

Lemma 2.

For any Ξ \Pi as in TheoremΒ 7, for any IβŠ†[N+1]I\subseteq[N+1], letting 𝐧Iβ‰βˆ‘t∈Int\mathbf{n}_{I}\doteq\sum\limits_{t\in I}n_{t} and PΒ―I≐𝐧Iβˆ’1β€‹βˆ‘t∈Int​Pt\bar{P}_{I}\doteq\mathbf{n}_{I}^{-1}\sum_{t\in I}n_{t}P_{t}, for any δ∈(0,1)\delta\in(0,1), on the event (of probability at least 1βˆ’Ξ΄1-\delta) from LemmaΒ 1 (for S=ZIS=Z^{I} there), for any hβˆˆβ„‹h\in\mathcal{H} satisfying

β„°^ZI​(h;h^ZI)≀C0​ℙ^ZI​(hβ‰ h^ZI)​Ρ​(𝐧I,Ξ΄)+C0​Ρ​(𝐧I,Ξ΄),\hat{\mathcal{E}}_{Z^{I}}(h;\hat{h}_{Z^{I}})\leq C_{0}\sqrt{\hat{\mathbb{P}}_{Z^{I}}(h\neq\hat{h}_{Z^{I}})\varepsilon(\mathbf{n}_{I},\delta)}+C_{0}\varepsilon(\mathbf{n}_{I},\delta), (7)

it holds that

β„°PΒ―I​(h)≀32​C02​(Cβ​Ρ​(𝐧I,Ξ΄))1/(2βˆ’Ξ²).\mathcal{E}_{\bar{P}_{I}}(h)\leq 32C_{0}^{2}\left(C_{\beta}\varepsilon(\mathbf{n}_{I},\delta)\right)^{1/(2-\beta)}.
Proof.

On the event from LemmaΒ 1 for S=ZIS=Z^{I}, it holds that βˆ€h,hβ€²βˆˆβ„‹\forall h,h^{\prime}\in\mathcal{H}, (since β„°PΒ―I​(h;hβ€²)=𝔼​[β„°^ZI​(h;hβ€²)]\mathcal{E}_{\bar{P}_{I}}(h;h^{\prime})=\mathbb{E}[\hat{\mathcal{E}}_{Z^{I}}(h;h^{\prime})])

β„°PΒ―I​(h;hβ€²)≀ℰ^ZI​(h;hβ€²)+C0​min⁑{PΒ―I​(hβ‰ hβ€²),β„™^ZI​(hβ‰ hβ€²)}​Ρ​(𝐧I,Ξ΄)+C0​Ρ​(𝐧I,Ξ΄)\mathcal{E}_{\bar{P}_{I}}(h;h^{\prime})\leq\hat{\mathcal{E}}_{Z^{I}}(h;h^{\prime})+C_{0}\sqrt{\min\!\left\{\bar{P}_{I}(h\neq h^{\prime}),\hat{\mathbb{P}}_{Z^{I}}(h\neq h^{\prime})\right\}\varepsilon(\mathbf{n}_{I},\delta)}+C_{0}\varepsilon(\mathbf{n}_{I},\delta) (8)

and

β„™^ZI​(hβ‰ hβ€²)≀2​PΒ―I​(hβ‰ hβ€²)+C0​Ρ​(𝐧I,Ξ΄).\hat{\mathbb{P}}_{Z^{I}}(h\neq h^{\prime})\leq 2\bar{P}_{I}(h\neq h^{\prime})+C_{0}\varepsilon(\mathbf{n}_{I},\delta). (9)

Suppose this event occurs.

By the Bernstein class condition and Jensen’s inequality, for any hβˆˆβ„‹h\in\mathcal{H} we have

PΒ―I​(hβ‰ hβˆ—)≀Cβ​𝐧Iβˆ’1β€‹βˆ‘t∈Int​ℰPtβ​(h)≀Cβ​(β„°PΒ―I​(h))Ξ².\bar{P}_{I}(h\neq h^{\!*})\leq C_{\beta}\mathbf{n}_{I}^{-1}\sum_{t\in I}n_{t}\mathcal{E}_{P_{t}}^{\beta}(h)\leq C_{\beta}\left(\mathcal{E}_{\bar{P}_{I}}(h)\right)^{\beta}. (10)

Furthermore, applying (8) with h=h^ZIh=\hat{h}_{Z^{I}} and hβ€²=hβˆ—h^{\prime}=h^{\!*} reveals that

β„°PΒ―I​(h^ZI)≀C0​PΒ―I​(h^ZIβ‰ hβˆ—)​Ρ​(𝐧I,Ξ΄)+C0​Ρ​(𝐧I,Ξ΄).\mathcal{E}_{\bar{P}_{I}}(\hat{h}_{Z^{I}})\leq C_{0}\sqrt{\bar{P}_{I}(\hat{h}_{Z^{I}}\neq h^{\!*})\varepsilon(\mathbf{n}_{I},\delta)}+C_{0}\varepsilon(\mathbf{n}_{I},\delta).

Combining this with (10) (for h=h^ZIh=\hat{h}_{Z^{I}}) implies that

β„°PΒ―I​(h^ZI)≀C0​Cβ​(β„°PΒ―I​(h^ZI))β​Ρ​(𝐧I,Ξ΄)+C0​Ρ​(𝐧I,Ξ΄),\mathcal{E}_{\bar{P}_{I}}(\hat{h}_{Z^{I}})\leq C_{0}\sqrt{C_{\beta}\left(\mathcal{E}_{\bar{P}_{I}}(\hat{h}_{Z^{I}})\right)^{\beta}\varepsilon(\mathbf{n}_{I},\delta)}+C_{0}\varepsilon(\mathbf{n}_{I},\delta), (11)

which implies

β„°PΒ―I​(h^ZI)≀2​C0​(Cβ​Ρ​(𝐧I,Ξ΄))1/(2βˆ’Ξ²).\mathcal{E}_{\bar{P}_{I}}(\hat{h}_{Z^{I}})\leq 2C_{0}\left(C_{\beta}\varepsilon(\mathbf{n}_{I},\delta)\right)^{1/(2-\beta)}. (12)

Next, applying (8) with any hβˆˆβ„‹h\in\mathcal{H} satisfying (7) and with hβ€²=h^ZIh^{\prime}=\hat{h}_{Z^{I}} implies

β„°PΒ―I​(h;h^ZI)≀2​C0​ℙ^ZI​(hβ‰ h^ZI)​Ρ​(𝐧I,Ξ΄)+2​C0​Ρ​(𝐧I,Ξ΄),\mathcal{E}_{\bar{P}_{I}}(h;\hat{h}_{Z^{I}})\leq 2C_{0}\sqrt{\hat{\mathbb{P}}_{Z^{I}}(h\neq\hat{h}_{Z^{I}})\varepsilon(\mathbf{n}_{I},\delta)}+2C_{0}\varepsilon(\mathbf{n}_{I},\delta), (13)

and (9) implies the right hand side is at most (after some simplifications of the resulting expression)

2​C0​2​PΒ―I​(hβ‰ h^ZI)​Ρ​(𝐧I,Ξ΄)+2​C0​(1+C0)​Ρ​(𝐧I,Ξ΄).2C_{0}\sqrt{2\bar{P}_{I}(h\neq\hat{h}_{Z^{I}})\varepsilon(\mathbf{n}_{I},\delta)}+2C_{0}(1+\sqrt{C_{0}})\varepsilon(\mathbf{n}_{I},\delta). (14)

By the triangle inequality and (10), together with (12), we have

PΒ―I​(hβ‰ h^ZI)≀PΒ―I​(hβ‰ hβˆ—)+PΒ―I​(h^ZIβ‰ hβˆ—)\displaystyle\bar{P}_{I}(h\neq\hat{h}_{Z^{I}})\leq\bar{P}_{I}(h\neq h^{\!*})+\bar{P}_{I}(\hat{h}_{Z^{I}}\neq h^{\!*}) ≀Cβ​(β„°PΒ―I​(h))Ξ²+Cβ​(β„°PΒ―I​(h^ZI))Ξ²\displaystyle\leq C_{\beta}\left(\mathcal{E}_{\bar{P}_{I}}(h)\right)^{\beta}+C_{\beta}\left(\mathcal{E}_{\bar{P}_{I}}(\hat{h}_{Z^{I}})\right)^{\beta}
≀Cβ​(β„°PΒ―I​(h))Ξ²+Cβ​(2​C0)β​(Cβ​Ρ​(𝐧I,Ξ΄))Ξ²/(2βˆ’Ξ²).\displaystyle\leq C_{\beta}\left(\mathcal{E}_{\bar{P}_{I}}(h)\right)^{\beta}+C_{\beta}(2C_{0})^{\beta}\left(C_{\beta}\varepsilon(\mathbf{n}_{I},\delta)\right)^{\beta/(2-\beta)}.

Combining this with (14) and (13) implies (with some simplification of the resulting expression)

β„°PΒ―I​(h;h^ZI)≀2​C0​2​Cβ​(β„°PΒ―I​(h))β​Ρ​(𝐧I,Ξ΄)+8​C02​(Cβ​Ρ​(𝐧I,Ξ΄))1/(2βˆ’Ξ²).\mathcal{E}_{\bar{P}_{I}}(h;\hat{h}_{Z^{I}})\leq 2C_{0}\sqrt{2C_{\beta}\left(\mathcal{E}_{\bar{P}_{I}}(h)\right)^{\beta}\varepsilon(\mathbf{n}_{I},\delta)}+8C_{0}^{2}\left(C_{\beta}\varepsilon(\mathbf{n}_{I},\delta)\right)^{1/(2-\beta)}.

Since β„°PΒ―I​(h)=β„°PΒ―I​(h;h^ZI)+β„°PΒ―I​(h^ZI)\mathcal{E}_{\bar{P}_{I}}(h)=\mathcal{E}_{\bar{P}_{I}}(h;\hat{h}_{Z^{I}})+\mathcal{E}_{\bar{P}_{I}}(\hat{h}_{Z^{I}}), together with (12) this implies

β„°PΒ―I​(h)≀2​C0​2​Cβ​(β„°PΒ―I​(h))β​Ρ​(𝐧I,Ξ΄)+10​C02​(Cβ​Ρ​(𝐧I,Ξ΄))1/(2βˆ’Ξ²).\mathcal{E}_{\bar{P}_{I}}(h)\leq 2C_{0}\sqrt{2C_{\beta}\left(\mathcal{E}_{\bar{P}_{I}}(h)\right)^{\beta}\varepsilon(\mathbf{n}_{I},\delta)}+10C_{0}^{2}\left(C_{\beta}\varepsilon(\mathbf{n}_{I},\delta)\right)^{1/(2-\beta)}.

In particular, this inequality immediately implies the claimed inequality in the lemma. ∎

We now present the proof of TheoremΒ 7.

Proof of TheoremΒ 7.

For each t∈[N+1]t\in[N+1], define 𝐧tβ‰βˆ‘s=1tn(s)\mathbf{n}_{t}\doteq\sum\limits_{s=1}^{t}n_{(s)}. For brevity, let ρ¯=ρ¯tβˆ—\bar{\rho}=\bar{\rho}_{t^{*}} and h^=h^Z(tβˆ—)\hat{h}=\hat{h}_{Z^{(t^{*})}}. Let PΒ―=𝐧tβˆ—βˆ’1β€‹βˆ‘t=1tβˆ—n(t)​P(t)\bar{P}=\mathbf{n}_{t^{*}}^{-1}\sum_{t=1}^{t^{*}}n_{(t)}P_{(t)}. First note that

β„°P¯​(h^)β‰₯1𝐧tβˆ—β€‹βˆ‘t=1tβˆ—n(t)​(CΟβˆ’1β€‹β„°π’Ÿβ€‹(h^))ρ(t)β‰₯(CΟβˆ’1β€‹β„°π’Ÿβ€‹(h^))ρ¯,\mathcal{E}_{\bar{P}}(\hat{h})\geq\frac{1}{\mathbf{n}_{t^{*}}}\sum_{t=1}^{t^{*}}n_{(t)}\left(C_{\rho}^{-1}\mathcal{E}_{{\cal D}}(\hat{h})\right)^{\rho_{(t)}}\geq\left(C_{\rho}^{-1}\mathcal{E}_{{\cal D}}(\hat{h})\right)^{\bar{\rho}},

where the final inequality is due to Jensen’s inequality. In particular, this implies

β„°π’Ÿβ€‹(h^)≀Cρ​ℰPΒ―1/ρ¯​(h^),\mathcal{E}_{{\cal D}}(\hat{h})\leq C_{\rho}\mathcal{E}_{\bar{P}}^{1/\bar{\rho}}(\hat{h}), (15)

so that it suffices to upper bound the expression on the right hand side. Toward this end, note that h^\hat{h} trivially satisfies (7) for I={(1),…,(tβˆ—)}I=\{(1),\ldots,(t^{*})\}. Therefore, LemmaΒ 2 implies that with probability at least 1βˆ’Ξ΄1-\delta,

β„°P¯​(h^)≀32​C02​(Cβ​Ρ​(𝐧tβˆ—,Ξ΄))1/(2βˆ’Ξ²).\mathcal{E}_{\bar{P}}(\hat{h})\leq 32C_{0}^{2}(C_{\beta}\varepsilon(\mathbf{n}_{t^{*}},\delta))^{1/(2-\beta)}.

Combining this with (15) immediately implies the theorem. ∎

Remark 7.

In particular, the upper bound for TheoremΒ 1 follows from TheoremΒ 7 by plugging in Ξ΄=1/βˆ‘s=1tβˆ—n(s)\delta=1/\sum_{s=1}^{t^{*}}n_{(s)}.

7 Partially Adaptive Procedures

Fix any NN, CρC_{\rho}, {ρt}t∈[N]\{\rho_{t}\}_{t\in[N]}, {nt}t∈[N+1]\{{n}_{t}\}_{t\in[N+1]}, CΞ²C_{\beta}, and β∈[0,1]\beta\in[0,1], and let β„³=ℳ​(Cρ,{ρt}t∈[N],{nt}t∈[N+1],CΞ²,Ξ²)\mathcal{M}=\mathcal{M}\left(C_{\rho},\left\{{\rho_{t}}\right\}_{t\in[N]},\left\{{n}_{t}\right\}_{t\in[N+1]},C_{\beta},\beta\right).

7.1 Pooling Under Low Noise Ξ²=1\beta=1

We now present the proof of Theorem 3 which states the near-optimality of pooling, independent of the choice of target π’Ÿ{\cal D}, whenever Ξ²=1\beta=1.

Proof of Theorem 3.

For any t∈[N+1]t\in[N+1], let 𝐧tβ‰βˆ‘s=1tn(s)\mathbf{n}_{t}\doteq\sum_{s=1}^{t}n_{(s)} and PΒ―t≐(𝐧t)βˆ’1β€‹βˆ‘s=1tn(s)​P(s)\bar{P}_{t}\doteq(\mathbf{n}_{t})^{-1}\sum_{s=1}^{t}n_{(s)}P_{(s)}. Suppose the event from LemmaΒ 1 holds for Z=Z(N+1)Z=Z^{(N+1)}, which occurs with probability at least 1βˆ’Ξ΄1-\delta. By LemmaΒ 2, we know that on this event,

β„°PΒ―N+1​(h^Z)≀32​C02​Cβ⋅Ρ​(𝐧N+1,Ξ΄).\mathcal{E}_{\bar{P}_{N+1}}(\hat{h}_{Z})\leq 32C_{0}^{2}C_{\beta}\cdot\varepsilon(\mathbf{n}_{N+1},\delta).

Combining this with the definition of ρt\rho_{t}, we have

𝐧N+1βˆ’1β€‹βˆ‘t∈[N+1]nt​(CΟβˆ’1β€‹β„°π’Ÿβ€‹(h^Z))ρt≀32​C02​Cβ⋅Ρ​(𝐧N+1,Ξ΄).\mathbf{n}_{N+1}^{-1}\sum_{t\in[N+1]}{n}_{t}\left(C_{\rho}^{-1}\mathcal{E}_{{\cal D}}(\hat{h}_{Z})\right)^{\rho_{t}}\leq 32C_{0}^{2}C_{\beta}\cdot\varepsilon(\mathbf{n}_{N+1},\delta).

Since the left hand side is monotonic in β„°π’Ÿβ€‹(h^Z)\mathcal{E}_{{\cal D}}(\hat{h}_{Z}), if we wish to bound β„°π’Ÿβ€‹(h^Z)\mathcal{E}_{{\cal D}}(\hat{h}_{Z}) by some value Ο΅\epsilon, it suffices to take any value of Ο΅\epsilon such that

𝐧N+1βˆ’1β€‹βˆ‘t∈[N+1]nt​(CΟβˆ’1​ϡ)ρtβ‰₯32​C02​Cβ⋅Ρ​(𝐧N+1,Ξ΄),\mathbf{n}_{N+1}^{-1}\sum_{t\in[N+1]}{n}_{t}\left(C_{\rho}^{-1}\epsilon\right)^{\rho_{t}}\geq 32C_{0}^{2}C_{\beta}\cdot\varepsilon(\mathbf{n}_{N+1},\delta),

or more simply

βˆ‘t∈[N+1]nt​(CΟβˆ’1​ϡ)ρtβ‰₯32​C02​Cβ​(dℋ​log⁑(𝐧N+1dβ„‹)+log⁑(1Ξ΄)).\sum_{t\in[N+1]}{n}_{t}\left(C_{\rho}^{-1}\epsilon\right)^{\rho_{t}}\geq 32C_{0}^{2}C_{\beta}\left(d_{\mathcal{H}}\log\!\left(\frac{\mathbf{n}_{N+1}}{d_{\mathcal{H}}}\right)+\log\!\left(\frac{1}{\delta}\right)\right).

In particular, let us take

ϡ≐Cρ​(32​C02​Cβ​dℋ​log⁑(𝐧N+1/dβ„‹)+log⁑(1/Ξ΄)𝐧tβˆ—)1/ρ¯tβˆ—\epsilon\doteq C_{\rho}\left(32C_{0}^{2}C_{\beta}\frac{d_{\mathcal{H}}\log\!\left(\mathbf{n}_{N+1}/d_{\mathcal{H}}\right)+\log(1/\delta)}{\mathbf{n}_{t^{*}}}\right)^{1/\bar{\rho}_{t^{*}}}

where tβˆ—t^{*} is the value of tt that minimizes the right hand side of this definition of Ο΅\epsilon. Then

βˆ‘t∈[N+1]nt​(CΟβˆ’1​ϡ)ρt\displaystyle\sum_{t\in[N+1]}{n}_{t}\left(C_{\rho}^{-1}\epsilon\right)^{\rho_{t}} β‰₯𝐧tβˆ—β€‹βˆ‘t=1tβˆ—n(t)𝐧tβˆ—β€‹(CΟβˆ’1​ϡ)ρ(t)β‰₯𝐧tβˆ—β€‹(CΟβˆ’1​ϡ)ρ¯tβˆ—\displaystyle\geq\mathbf{n}_{t^{*}}\sum_{t=1}^{t^{*}}\frac{n_{(t)}}{\mathbf{n}_{t^{*}}}\left(C_{\rho}^{-1}\epsilon\right)^{\rho_{(t)}}\geq\mathbf{n}_{t^{*}}(C_{\rho}^{-1}\epsilon)^{\bar{\rho}_{t^{*}}}
=32​C02​Cβ​(dℋ​log⁑(𝐧N+1/dβ„‹)+log⁑(1/Ξ΄)).\displaystyle=32C_{0}^{2}C_{\beta}\left(d_{\mathcal{H}}\log\!\left(\mathbf{n}_{N+1}/d_{\mathcal{H}}\right)+\log(1/\delta)\right).

∎

7.2 Aggregation Under Ranking Information

Proof of Theorem 4.

For any t∈[N+1]t\in[N+1] let PΒ―t=𝐧tβˆ’1β€‹βˆ‘s=1tn(s)​P(s)\bar{P}_{t}=\mathbf{n}_{t}^{-1}\sum_{s=1}^{t}n_{(s)}P_{(s)}. Let tβˆ—t^{*} be as in TheoremΒ 7, and by the same argument from the proof of TheoremΒ 7, (15) holds for h^\hat{h} in the present context as well (where PΒ―\bar{P} there is PΒ―tβˆ—\bar{P}_{t^{*}} in the present notation). Thus, the theorem will be proven if we can bound β„°PΒ―tβˆ—β€‹(h^)\mathcal{E}_{\bar{P}_{t^{*}}}(\hat{h}) by (c​Cβ​Ρ​(𝐧tβˆ—,Ξ΄tβˆ—))1/(2βˆ’Ξ²)\left(cC_{\beta}\varepsilon(\mathbf{n}_{t^{*}},\delta_{t^{*}})\right)^{1/(2-\beta)} for some numerical constant cc.

By a union bound, with probability at least 1βˆ’βˆ‘t=1N+1Ξ΄tβ‰₯1βˆ’Ξ΄1-\sum_{t=1}^{N+1}\delta_{t}\geq 1-\delta, for every t∈[N+1]t\in[N+1], the event from LemmaΒ 1 holds for S=Z(t)S=Z^{(t)} (and with Ξ΄t\delta_{t} in place of Ξ΄\delta there). In particular, this implies that for every t∈[N+1]t\in[N+1],

β„°^Z(t)​(hβˆ—;h^Z(t))≀C0​ℙ^Z(t)​(hβˆ—β‰ h^Z(t))​Ρ​(𝐧t,Ξ΄t)+C0​Ρ​(𝐧t,Ξ΄t),\hat{\mathcal{E}}_{Z^{(t)}}(h^{\!*};\hat{h}_{Z^{(t)}})\leq C_{0}\sqrt{\hat{\mathbb{P}}_{Z^{(t)}}(h^{\!*}\neq\hat{h}_{Z^{(t)}})\varepsilon(\mathbf{n}_{t},\delta_{t})}+C_{0}\varepsilon(\mathbf{n}_{t},\delta_{t}),

so that there do exist classifiers hh in β„‹\mathcal{H} satisfying (4), and hence h^\hat{h} satisfies (4). In particular, this implies h^\hat{h} satisfies the inequality in (4) for t=tβˆ—t=t^{*}. By LemmaΒ 2, this implies that on this same event from above, it holds that

β„°PΒ―tβˆ—β€‹(h^)≀32​C02​(Cβ​Ρ​(𝐧tβˆ—,Ξ΄tβˆ—))1/(2βˆ’Ξ²),\mathcal{E}_{\bar{P}_{t^{*}}}(\hat{h})\leq 32C_{0}^{2}(C_{\beta}\varepsilon(\mathbf{n}_{t^{*}},\delta_{t^{*}}))^{1/(2-\beta)},

which completes the proof (for instance, taking c=210​C04c=2^{10}C_{0}^{4}). ∎

8 Impossibility of Adaptivity over Multisource Classes β„³\mathcal{M}

Theorem 5 follows as corollary to Theorem 8, the main result of this section. In particular, the second part of Theorem 5 follows from the condition on β„³\mathcal{M} that it contains enough tasks with ρt=1\rho_{t}=1, and calling on Theorem 4.

Theorem 8 (Impossibility of Adaptivity).

Pick Cρ=3C_{\rho}=3, and any 0≀β<10\leq\beta<1, CΞ²β‰₯2C_{\beta}\geq 2, a number of samples per source task 1≀n<2/Ξ²βˆ’11\leq n<2/\beta-1, and a number of target samples nπ’Ÿβ‰₯0n_{\cal D}\geq 0. Let the number of source tasks N=NP+NQN=N_{P}+N_{Q}, for NP,NQN_{P},N_{Q} as specified below. There exist universal constants C0,C1,c>0C_{0},C_{1},c>0 such that the following holds.

Choose any NQβ‰₯C0N_{Q}\geq C_{0}, and suppose NPN_{P} is sufficiently large so that NPβ‰₯3​NQN_{P}\geq 3N_{Q}, and furthermore

NP(2βˆ’(n+1)​β)/(2βˆ’Ξ²)β‰₯C1β‹…NQ2β‹…215​n.N_{P}^{(2-(n+1)\beta)/(2-\beta)}\geq C_{1}\cdot N_{Q}^{2}\cdot 2^{15n}.

Let π“œ\boldsymbol{\mathcal{M}} denote the family of multisource classes β„³\mathcal{M} satisfying the above conditions, with parameters nt=n,βˆ€t∈[N]n_{t}=n,\forall t\in[N], nN+1=nπ’Ÿn_{N+1}=n_{\cal D}, and, in addition, such that at least 12​NQ\frac{1}{2}N_{Q} of the exponents {ρt}t∈[N]\left\{\rho_{t}\right\}_{t\in[N]} are at most 11.

βˆ™\bullet Let h^\hat{h} denote any classification procedure having access to Z∼Π,Ξ βˆˆβ„³Z\sim\Pi,\,\Pi\in\mathcal{M}, but without knowledge of β„³\mathcal{M}. We have:

infh^supβ„³βˆˆπ“œsupΞ βˆˆβ„³β„™Ξ β€‹(β„°π’Ÿβ€‹(h^)β‰₯14β‹…(1∧nπ’Ÿβˆ’1/(2βˆ’Ξ²)))β‰₯c.\displaystyle\inf_{\hat{h}}\sup_{{\cal M}\in\boldsymbol{\mathcal{M}}}\sup_{\Pi\in\mathcal{M}}\mathbb{P}_{\Pi}\left(\mathcal{E}_{{\cal D}}(\hat{h})\geq\frac{1}{4}\cdot\left(1\land n_{\cal D}^{-1/(2-\beta)}\right)\right)\geq c.

βˆ™\bullet On the other hand, there exists a semi-adaptive classifier h~\tilde{h}, which, given data Z∼ΠZ\sim\Pi, along with a ranking {ρ(t)}t∈[N+1]\{\rho_{(t)}\}_{t\in[N+1]} of increasing exponents values, achieves the rate

supβ„³βˆˆπ“œsupΞ βˆˆβ„³π”ΌΞ β€‹[β„°π’Ÿβ€‹(h~)]≲(nβ‹…NQ)βˆ’1/(2βˆ’Ξ²).\sup_{{\cal M}\in\boldsymbol{\mathcal{M}}}\sup_{\Pi\in\mathcal{M}}\,\mathbb{E}_{\Pi}\left[\mathcal{E}_{\cal D}(\tilde{h})\right]\lesssim\left(n\cdot N_{Q}\right)^{-1/(2-\beta)}.

As it turns out, the above theorem in fact holds for even smaller classes β„³\mathcal{M} admitting just two choices of distributions, namely PΟƒP_{\sigma} and QΟƒQ_{\sigma} (for fixed Οƒβˆˆ{Β±1}\sigma\in\left\{\pm 1\right\}) below, to which sources PtP_{t}’s might be assigned to. In fact the result holds even if h^\hat{h} has knowledge of the construction below, but of course not of Οƒ\sigma.

Construction.

We build on the following distributions supported on 2 points x0,x1x_{0},x_{1}: here we simply assume that β„‹\mathcal{H} contains at least 2 classifiers that disagree on x1x_{1} but agree on x0x_{0} (this will be the case, for some choice of x0,x1x_{0},x_{1}, if β„‹\mathcal{H} contains at least 3 classifiers). Therefore, w.l.o.g., assume that x0x_{0} has label 1. Let nβ‰₯1,nπ’Ÿβ‰₯0,NP,NQβ‰₯1n\geq 1,n_{\cal D}\geq 0,N_{P},N_{Q}\geq 1, 0≀β<10\leq\beta<1, and define ϡ≐(nβ‹…NP)βˆ’1/(2βˆ’Ξ²)\epsilon\doteq(n\cdot N_{P})^{-1/(2-\beta)} and Ο΅0≐1∧nπ’Ÿβˆ’1/(2βˆ’Ξ²)\epsilon_{0}\doteq 1\land n_{\cal D}^{-1/(2-\beta)}. Let Οƒβˆˆ{Β±1}\sigma\in\left\{\pm 1\right\} – which we will often abbreviate as Β±\pm. In all that follows, we let ημ​(X)\eta_{\mu}(X) denote the regression function ℙμ​[Y=1∣X]\mathbb{P}_{\mu}[Y=1\mid X] under distribution ΞΌ\mu.

  • β€’

    Target π’ŸΟƒ=π’ŸXΓ—π’ŸY|XΟƒ{\cal D}_{\sigma}={\cal D}_{X}\times{\cal D}^{\sigma}_{Y|X}: Let π’ŸX​(x1)=12β‹…Ο΅0Ξ²{\cal D}_{X}(x_{1})=\frac{1}{2}\cdot\epsilon_{0}^{\beta}, π’ŸX​(x0)=1βˆ’12β‹…Ο΅0Ξ²{\cal D}_{X}(x_{0})=1-\frac{1}{2}\cdot\epsilon_{0}^{\beta}; finally π’ŸY|XΟƒ{\cal D}^{\sigma}_{Y|X} is determined by
    Ξ·π’Ÿ,σ​(x1)=1/2+Οƒβ‹…c0​ϡ01βˆ’Ξ²\eta_{{\cal D},\sigma}(x_{1})=1/2+\sigma\cdot{c_{0}}\epsilon_{0}^{1-\beta}, and Ξ·π’Ÿ,σ​(x0)=1\eta_{{\cal D},\sigma}(x_{0})=1, for some c0c_{0} to be specified later.

  • β€’

    Noisy PΟƒ=PXΓ—PY|XΟƒP_{\sigma}=P_{X}\times P^{\sigma}_{Y|X}: Let PX​(x1)=c1​ϡβP_{X}(x_{1})=c_{1}\epsilon^{\beta}, PX​(x0)=1βˆ’c1​ϡβP_{X}(x_{0})=1-c_{1}\epsilon^{\beta}; finally PY|XΟƒP^{\sigma}_{Y|X} is determined by
    Ξ·P,σ​(x1)=1/2+Οƒβ‹…Ο΅1βˆ’Ξ²\eta_{P,\sigma}(x_{1})=1/2+\sigma\cdot\epsilon^{1-\beta}, and Ξ·P,σ​(x0)=1\eta_{P,\sigma}(x_{0})=1, for an appropriate constant c1c_{1} specified later.

  • β€’

    Benign QΟƒ=QXΓ—QY|XΟƒQ_{\sigma}=Q_{X}\times Q^{\sigma}_{Y|X}: Let QX​(x1)=1Q_{X}(x_{1})=1; finally QY|XΟƒQ^{\sigma}_{Y|X} is determined by Ξ·Q,σ​(x1)=1/2+Οƒ/2\eta_{Q,\sigma}(x_{1})=1/2+\sigma/2.

The above construction is such that PΟƒP_{\sigma} can be pushed far from π’ŸΟƒ{\cal D}_{\sigma}, while QΟƒQ_{\sigma} remains close to π’ŸΟƒ{\cal D}_{\sigma}. This is formalized in the following proposition, which can be verified by checking the required inequalities (from DefinitionsΒ 2 and 4) are satisfied in all 44 cases of classifications of x0x_{0}, x1x_{1}.

Proposition 1 (Exponents of PP and QQ w.r.t. π’Ÿ{\cal D}).

PΟƒP_{\sigma} and QΟƒQ_{\sigma} have transfer-exponents (3,ρP)(3,\rho_{P}) and (3,ρQ)(3,\rho_{Q}), respectively w.r.t. π’ŸΟƒ{\cal D}_{\sigma}, with ρPβ‰₯log⁑(c1βˆ’(2βˆ’Ξ²)​nβ‹…NP)log⁑(c0βˆ’(2βˆ’Ξ²)​(1∨nπ’Ÿ))\rho_{P}\geq\frac{\log(c_{1}^{-(2-\beta)}n\cdot N_{P})}{\log(c_{0}^{-(2-\beta)}(1\lor n_{\cal D}))} and ρQ≀1\rho_{Q}\leq 1. Furthermore, the 3 distributions PΟƒ,QΟƒ,π’ŸΟƒP_{\sigma},Q_{\sigma},{\cal D}_{\sigma} satisfy the Bernstein class condition with parameters (CΞ²,Ξ²),CΞ²=max⁑{(1/2)​c0βˆ’Ξ²,2}{(C_{\beta},\beta)},C_{\beta}=\max\{(1/2)c_{0}^{-\beta},2\}.

Approximate Multitask.

Let N≐NP+NQN\doteq N_{P}+N_{Q}, Ξ±P≐NP/N\alpha_{P}\doteq N_{P}/N and Ξ±Q=NQ/N\alpha_{Q}=N_{Q}/N. Now consider the distribution

Γσ=(Ξ±P​PΟƒn+Ξ±Q​QΟƒn)NΓ—π’ŸΟƒnπ’Ÿ.\Gamma_{\sigma}=\left(\alpha_{P}P_{\sigma}^{n}+\alpha_{Q}Q_{\sigma}^{n}\right)^{N}\times{\cal D}_{\sigma}^{n_{\cal D}}.

Proof Strategy.

Henceforth, let Z={Zt}t=1N+1Z=\left\{Z_{t}\right\}_{t=1}^{N+1} denote a random draw from Γσ\Gamma_{\sigma}, where Zt={(Xt,i,Yt,i)}i=1nβˆΌΞ“ΟƒZ_{t}=\left\{(X_{t,i},Y_{t,i})\right\}_{i=1}^{n}\sim\Gamma_{\sigma} for t∈[N]t\in[N], and ZN+1βˆΌπ’ŸΟƒZ_{N+1}\sim{\cal D}_{\sigma}. Much of the effort will be in showing that any learner has nontrivial error in guessing Οƒ\sigma from a random ZZ; we then reduce this fact to a lower-bound on the risk of an adaptive classifier that takes as input a multitask sample of the form Zβ€²βˆΌβˆt=1NPtnΓ—π’ŸΟƒnZ^{\prime}\sim\prod_{t=1}^{N}P_{t}^{n}\times{\cal D}^{n}_{\sigma}, where PtP_{t}’s denote PΟƒP_{\sigma} or QΟƒQ_{\sigma}.

For intuition, the true label Οƒ\sigma is hard to guess from samples Zt∼PΟƒnZ_{t}\sim P_{\sigma}^{n} alone since Ξ·P,Οƒβ‰ˆ1/2\eta_{P,\sigma}\approx 1/2, but easy to guess from samples Zt∼QnZ_{t}\sim Q^{n}, each being a homogeneous vector of nn points x1x_{1} with identical labels Y=ΟƒY=\sigma. However, such homogeneous vectors can be produced by PΟƒnP_{\sigma}^{n} also, with identical but flipped labels Y=βˆ’ΟƒY=-\sigma, and the product of mixtures Γσ\Gamma_{\sigma} adds in enough randomness to make it hard to guess the source of a given vector ZtZ_{t}. In fact, this intuition becomes evident by considering the likelihood-ratio between Ξ“+\Gamma_{+} and Ξ“βˆ’\Gamma_{-} which decomposes into the contribution of homogeneous vs non-homogenous vectors. This is formalized in the following proposition.

Proposition 2 (Likelihood ratio and sufficient statistics).

Let ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}, and let and N^+{\hat{N}}_{+} and N^βˆ’{\hat{N}}_{-} denote the number of homogeneous vectors ZtZ_{t} in ZZ with marginals at x1x_{1}, that is:

N^σ​(Z)β‰βˆ‘t∈[N]1​{βˆ€i∈[n],Xt,i=x1∧Yt,i=Οƒ}.{\hat{N}}_{\sigma}(Z)\doteq\sum_{t\in[N]}{\mathbbold 1}\!\left\{\forall i\in[n],X_{t,i}=x_{1}\,\land\,Y_{t,i}=\sigma\right\}.

Next, let n^+{\hat{n}}_{+} and n^βˆ’{\hat{n}}_{-} denote the total number of Β±1\pm 1 labels in ZZ, over occurrences of x1x_{1}. That is:

n^σ​(Z)β‰βˆ‘t∈[N],i∈[n]1​{Xt,i=x1∧Yt,i=Οƒ}.{\hat{n}}_{\sigma}(Z)\doteq\sum_{t\in[N],i\in[n]}{\mathbbold 1}\!\left\{X_{t,i}=x_{1}\,\land\,Y_{t,i}=\sigma\right\}.

We then have that (where β„™\mathbb{P} is under the randomness of ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}, and we let π’ŸΟƒβ€‹(ZN+1)=1{\cal D}_{\sigma}(Z_{N+1})=1 when nπ’Ÿ=0n_{\cal D}=0):

ℙ​(Ξ“+​(Z)Ξ“βˆ’β€‹(Z)>1)β‰₯ℙ​(N^+​(Z)>N^βˆ’β€‹(Z)∧n^+​(Z)β‰₯n^βˆ’β€‹(Z))⋅ℙ​(π’Ÿ+​(ZN+1)π’Ÿβˆ’β€‹(ZN+1)β‰₯1).\displaystyle\mathbb{P}\left(\frac{\Gamma_{+}(Z)}{\Gamma_{-}(Z)}>1\right)\geq\mathbb{P}\left({\hat{N}}_{+}(Z)>{\hat{N}}_{-}(Z)\,\land\,{\hat{n}}_{+}(Z)\geq{\hat{n}}_{-}(Z)\right)\cdot\mathbb{P}\left(\frac{{\cal D}_{+}(Z_{N+1})}{{\cal D}_{-}(Z_{N+1})}\geq 1\right). (16)
Remark 8 (Likelihoood Ratio and Optimal Discriminants).

Consider sampling ZZ from the mixture 12​Γ++12β€‹Ξ“βˆ’\frac{1}{2}\Gamma_{+}+\frac{1}{2}\Gamma_{-}. Then the Bayes classifier (for identifying Οƒ=Β±\sigma=\pm) returns +1+1 if Ξ“+​(Z)>Ξ“βˆ’β€‹(Z)\Gamma_{+}(Z)>\Gamma_{-}(Z), and therefore, given the symmetry between Γ±\Gamma_{\pm}, has probability of misclassification at least β„™Ξ“βˆ’β€‹(Ξ“+​(Z)>Ξ“βˆ’β€‹(Z))\mathbb{P}_{\Gamma_{-}}\left({\Gamma_{+}(Z)}>{\Gamma_{-}(Z)}\right). We emphasize here that enforcing that Γσ\Gamma_{\sigma} be defined in terms of a mixture – rather than as product of PΟƒP_{\sigma} and QΟƒQ_{\sigma} terms – allows us to bound β„™Ξ“βˆ’β€‹(Ξ“+​(Z)>Ξ“βˆ’β€‹(Z))\mathbb{P}_{\Gamma_{-}}\left({\Gamma_{+}(Z)}>{\Gamma_{-}(Z)}\right) below as in Proposition 2 above; otherwise this probability is always 0 for a product distribution containing QΟƒQ_{\sigma} since Qσ​(Zt)=0Q_{\sigma}(Z_{t})=0 whenever some Yt,i=βˆ’ΟƒY_{t,i}=-\sigma. In fact a product distribution inherently encodes the idea that the learner knows the positions of PΟƒP_{\sigma} and QΟƒQ_{\sigma} vectors in ZZ, and can therefore simply focus on QΟƒQ_{\sigma} vectors to easily discover Οƒ\sigma.

We now further reduce the r.h.s. of (16) to events that are simpler to bound: the main strategy is to properly condition on intermediate events to reveal independences that induce simple i.i.d. Bernouilli’s we can exploit. Towards this end, we first consider the high-probability events of the following proposition.

Proposition 3.

Let ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}. Define N^P​(Z),N^Q​(Z){\hat{N}}_{P}(Z),{\hat{N}}_{Q}(Z) as the number of homogeneous vectors
{Zt:βˆ€i,j∈[n],Xt,i=Xt,j=x1∧Yt,i=Yt,j}\left\{Z_{t}:\forall i,j\in[n],X_{t,i}=X_{t,j}=x_{1}\,\land\,Y_{t,i}=Y_{t,j}\right\} generated respectively by Pβˆ’P_{-}, and Qβˆ’Q_{-}.

Define the events (on ZZ):

𝐄P≐{𝔼​[N^P]/2≀N^P≀2​𝔼​[N^P]}​ and ​𝐄Q≐{N^Q≀2​𝔼​[N^Q]}.\mathbf{E}_{P}\doteq\left\{\mathbb{E}\left[{\hat{N}}_{P}\right]/2\leq{\hat{N}}_{P}\leq 2\mathbb{E}\left[{\hat{N}}_{P}\right]\right\}\text{ and }\mathbf{E}_{Q}\doteq\left\{{\hat{N}}_{Q}\leq 2\mathbb{E}\left[{\hat{N}}_{Q}\right]\right\}.

We have that ℙ​(𝐄P𝖼)≀2​exp⁑(βˆ’π”Όβ€‹[N^P]/8)\mathbb{P}\left(\mathbf{E}_{P}^{\mathsf{c}}\right)\leq 2\exp\left(-\mathbb{E}\left[{\hat{N}}_{P}\right]/8\right) while ℙ​(𝐄Q𝖼)≀exp⁑(βˆ’π”Όβ€‹[N^Q]/3)\mathbb{P}\left(\mathbf{E}_{Q}^{\mathsf{c}}\right)\leq\exp\left(-\mathbb{E}\left[{\hat{N}}_{Q}\right]/3\right).

Proof.

The proposition follows from multiplicative Chernoff bounds. ∎

Notice that, in the above proposition, for ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}, the expectations in the events are given by

𝔼​[N^P​(Z)]=NP​PXn​(x1)​(Ξ·P,βˆ’n​(x1)+Ξ·P,+n​(x1))​ and ​𝔼​[N^Q​(Z)]=NQ.\displaystyle\mathbb{E}\left[{\hat{N}}_{P}(Z)\right]=N_{P}P_{X}^{n}(x_{1})\left(\eta_{P,-}^{n}(x_{1})+\eta_{P,+}^{n}(x_{1})\right)\text{ and }\mathbb{E}\left[{\hat{N}}_{Q}(Z)\right]=N_{Q}. (17)

By the proposition, we just need these quantities large enough for the events to hold with sufficient probability. The next proposition conditions on these events to reduce the likelihood to simpler events.

Proposition 4 (Further Reducing (16)).

Let ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}. For Οƒ=Β±\sigma=\pm, let N^σ​(Z){\hat{N}}_{\sigma}(Z) and n^σ​(Z){\hat{n}}_{\sigma}(Z) as defined in Proposition 2. Furthermore, let n~σ​(Z)≐n^σ​(Z)βˆ’(nβ‹…N^σ​(Z))\tilde{n}_{\sigma}(Z)\doteq{\hat{n}}_{\sigma}(Z)-(n\cdot{\hat{N}}_{\sigma}(Z)) denote the total number of Β±\pm labels at x1x_{1} excluding homogeneous vectors. In all that follows we let β„™\mathbb{P} denote β„™Ξ“βˆ’\mathbb{P}_{\Gamma_{-}}, and we drop the dependence on ZZ for ease of notation.

Let N^P,N^Q{\hat{N}}_{P},{\hat{N}}_{Q} and the events 𝐄P,𝐄Q\mathbf{E}_{P},\mathbf{E}_{Q} as defined in Proposition 3. Suppose that for some Ξ΄1,Ξ΄2>0\delta_{1},\delta_{2}>0, we have

  1. (i)

    ℙ​(N^+>N^βˆ’βˆ£N^P,N^Q)​1​{𝐄Pβˆ©π„Q}β‰₯Ξ΄1β‹…1​{𝐄Pβˆ©π„Q}\mathbb{P}\left({\hat{N}}_{+}>{\hat{N}}_{-}\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right){\mathbbold 1}\!\left\{\mathbf{E}_{P}\cap\mathbf{E}_{Q}\right\}\geq\delta_{1}\cdot{\mathbbold 1}\!\left\{\mathbf{E}_{P}\cap\mathbf{E}_{Q}\right\}.

  2. (ii)

    ℙ​(n~+>n~βˆ’)β‰₯Ξ΄2\mathbb{P}\left(\tilde{n}_{+}>\tilde{n}_{-}\right)\geq\delta_{2}, further assuming that n>1n>1 so that the event is well defined (i.e., n~Β±\tilde{n}_{\pm} are not both 0).

We then have that:

ℙ​(N^+β‰₯N^βˆ’βˆ§n^+β‰₯n^βˆ’)β‰₯Ξ΄1β‹…(Ξ΄2βˆ’β„™β€‹(𝐄P𝖼)βˆ’β„™β€‹(𝐄Q𝖼)).\displaystyle\mathbb{P}\left({\hat{N}}_{+}\geq{\hat{N}}_{-}\,\land\,{\hat{n}}_{+}\geq{\hat{n}}_{-}\right)\geq\delta_{1}\cdot\left(\delta_{2}-\mathbb{P}\left(\mathbf{E}_{P}^{\mathsf{c}}\right)-\mathbb{P}\left(\mathbf{E}_{Q}^{\mathsf{c}}\right)\right). (18)
Proof.

Let A≐{n^+β‰₯n^βˆ’}A\doteq\left\{{\hat{n}}_{+}\geq{\hat{n}}_{-}\right\}, B≐{N^+>N^βˆ’}B\doteq\left\{{\hat{N}}_{+}>{\hat{N}}_{-}\right\}, and A~≐{n~+>n~βˆ’}\tilde{A}\doteq\left\{\tilde{n}_{+}>\tilde{n}_{-}\right\}. First notice that, by definition, A~∩B⟹A∩B\tilde{A}\cap B\implies A\cap B, so we just need to bound ℙ​(A~∩B)\mathbb{P}(\tilde{A}\cap B) below. We have:

ℙ​(A~∩B)\displaystyle\mathbb{P}\left(\tilde{A}\cap B\right) =𝔼​[ℙ​(A~∩B∣N^P,N^Q)]=𝔼​[ℙ​(A~∣N^P,N^Q)⋅ℙ​(B∣N^P,N^Q)]\displaystyle=\mathbb{E}\left[\mathbb{P}\left(\tilde{A}\cap B\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right)\right]={\mathbb{E}\left[\mathbb{P}\left(\tilde{A}\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right)\cdot\mathbb{P}\left(B\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right)\right]} (19)
β‰₯Ξ΄1⋅𝔼​[ℙ​(A~∣N^P,N^Q)β‹…1​{𝐄Pβˆ©π„Q}]\displaystyle\geq\delta_{1}\cdot\mathbb{E}\left[\mathbb{P}\left(\tilde{A}\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right)\cdot{\mathbbold 1}\!\left\{\mathbf{E}_{P}\cap\mathbf{E}_{Q}\right\}\right]
β‰₯Ξ΄1⋅𝔼​[ℙ​(A~∣N^P,N^Q)βˆ’1​{𝐄P𝖼βˆͺ𝐄Q𝖼}]=Ξ΄1β‹…(ℙ​(A~)βˆ’β„™β€‹(𝐄P𝖼βˆͺ𝐄Q𝖼))\displaystyle\geq\delta_{1}\cdot\mathbb{E}\left[\mathbb{P}\left(\tilde{A}\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right)-{\mathbbold 1}\!\left\{\mathbf{E}_{P}^{\mathsf{c}}\cup\mathbf{E}_{Q}^{\mathsf{c}}\right\}\right]=\delta_{1}\cdot\left(\mathbb{P}\left(\tilde{A}\right)-\mathbb{P}\left(\mathbf{E}_{P}^{\mathsf{c}}\cup\mathbf{E}_{Q}^{\mathsf{c}}\right)\right) (20)
β‰₯Ξ΄1β‹…(Ξ΄2βˆ’β„™β€‹(𝐄P𝖼)βˆ’β„™β€‹(𝐄Q𝖼)).\displaystyle\geq\delta_{1}\cdot\left(\delta_{2}-\mathbb{P}\left(\mathbf{E}_{P}^{\mathsf{c}}\right)-\mathbb{P}\left(\mathbf{E}_{Q}^{\mathsf{c}}\right)\right).

In (19), we used the fact that, all vectors in ZZ being independent, A~\tilde{A} is independent of BB given any value of {N^P,N^Q}\left\{{\hat{N}}_{P},{\hat{N}}_{Q}\right\}; in (20) we used the fact that for any p∈[0,1]p\in[0,1], we have pβ‹…1​{𝐄}=pβˆ’pβ‹…1​{𝐄𝖼}β‰₯pβˆ’1​{𝐄𝖼}p\cdot{\mathbbold 1}\!\left\{\mathbf{E}\right\}=p-p\cdot{\mathbbold 1}\!\left\{\mathbf{E}^{\mathsf{c}}\right\}\geq p-{\mathbbold 1}\!\left\{\mathbf{E}^{\mathsf{c}}\right\}. ∎

The following is a known counterpart of Chernoff bounds, following from Slud’s inequalities Slu (77) for Binomials.

Lemma 3 (Anticoncentration (Slud’s Inequality)).

Let {Xi}i∈[m]\left\{X_{i}\right\}_{i\in[m]} denote i.i.d. Bernouilli’s with parameter 0<p≀1/20<p\leq 1/2. Then for any 0≀m0≀m​(1βˆ’2​p)0\leq m_{0}\leq m(1-2p), we have

ℙ​(βˆ‘i∈[m]Xi>m​p+m0)β‰₯14​exp⁑(βˆ’m02m​p​(1βˆ’p)).\displaystyle\mathbb{P}\left(\sum_{i\in[m]}X_{i}>mp+m_{0}\right)\geq\frac{1}{4}\exp\left(\frac{-m_{0}^{2}}{mp(1-p)}\right).
Proof.

This form of the lower bound follows from Theorem 2.1 of Slu (77) – which here implies a lower bound 1βˆ’Ξ¦β€‹(m0/m​p​(1βˆ’p))1-\Phi(m_{0}/\sqrt{mp(1-p)}), for Ξ¦\Phi the standard normal CDF – together with the well-known bound: 1βˆ’Ξ¦β€‹(x)β‰₯12​(1βˆ’1βˆ’eβˆ’x2)β‰₯14​eβˆ’x21-\Phi(x)\geq\frac{1}{2}\left(1-\sqrt{1-e^{-x^{2}}}\right)\geq\frac{1}{4}e^{-x^{2}} Tat (53). See Mou (10) for a detailed derivation of a similar expression. ∎

Proposition 5 (Ξ΄1\delta_{1} from Proposition 4).

Pick any 0≀β<10\leq\beta<1, 1≀n<2/Ξ²βˆ’11\leq n<2/\beta-1, and NQβ‰₯1N_{Q}\geq 1. Let 0<c1≀1/640<c_{1}\leq 1/64, in the construction of Ξ·P,Οƒ\eta_{P,\sigma}, Οƒ=Β±\sigma=\pm. Suppose NPN_{P} is sufficiently large so that

(nβ‹…NP)(2βˆ’(n+1)​β)/(2βˆ’Ξ²)β‰₯4β‹…NQ2β‹…nβ‹…24​n​c1βˆ’n.(n\cdot N_{P})^{(2-(n+1)\beta)/(2-\beta)}\geq 4\cdot N_{Q}^{2}\cdot n\cdot 2^{4n}c_{1}^{-n}.

Let 𝐄=𝐄Pβˆ©π„Q\mathbf{E}=\mathbf{E}_{P}\cap\mathbf{E}_{Q} as defined in Proposition 3 over ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}. Adopting the notation of Proposition 4, we have

ℙ​(N^+>N^βˆ’βˆ£N^P,N^Q)​1​{𝐄}β‰₯112β‹…1​{𝐄}.\mathbb{P}\left({\hat{N}}_{+}>{\hat{N}}_{-}\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right){\mathbbold 1}\!\left\{\mathbf{E}\right\}\geq\frac{1}{12}\cdot{\mathbbold 1}\!\left\{\mathbf{E}\right\}.
Proof.

Consider NH={N^P,N^Q}N_{H}=\left\{{\hat{N}}_{P},{\hat{N}}_{Q}\right\} such that 𝐄≐𝐄Pβˆ©π„Q\mathbf{E}\doteq\mathbf{E}_{P}\cap\mathbf{E}_{Q} holds. Let B~≐{N^+>12​N^P+𝔼​[N^Q]}\tilde{B}\doteq\left\{{\hat{N}}_{+}>\frac{1}{2}{\hat{N}}_{P}+\mathbb{E}\left[{\hat{N}}_{Q}\right]\right\}, and notice that B~βŠ‚{N^+>N^βˆ’}\tilde{B}\subset\left\{{\hat{N}}_{+}>{\hat{N}}_{-}\right\} under 𝐄Q≐{𝔼​[N^Q]β‰₯N^Q/2}\mathbf{E}_{Q}\doteq\left\{\mathbb{E}\left[{\hat{N}}_{Q}\right]\geq{\hat{N}}_{Q}/2\right\}. Therefore we only need to bound ℙ​(B~∣NH)=ℙ​(B~∣N^P)\mathbb{P}\left(\tilde{B}\mid N_{H}\right)=\mathbb{P}\left(\tilde{B}\mid{\hat{N}}_{P}\right). Now, for Οƒ=Β±\sigma=\pm, let 𝒡P,Οƒ{\cal Z}_{P,\sigma} denote the set of homogeneous vectors in {Zt:βˆ€i∈[n],Xt,i​x1∧Yt,i=Οƒ}\left\{Z_{t}:\forall i\in[n],X_{t,i}x_{1}\,\land\,Y_{t,i}=\sigma\right\} generated by Pβˆ’P_{-}, and notice that, conditioned on N^P{\hat{N}}_{P}, N^+{\hat{N}}_{+} is distributed as Binomial​(N^P,p)\text{Binomial}({\hat{N}}_{P},p), where

p\displaystyle p =β„™(Ztβˆˆπ’΅P,+∣Ztβˆˆπ’΅P,+βˆͺ𝒡P,βˆ’,N^P)=β„™(Ztβˆˆπ’΅P,+∣Ztβˆˆπ’΅P,+βˆͺ𝒡P,βˆ’)=Ξ·P,βˆ’n​(x1)Ξ·P,βˆ’n​(x1)+Ξ·P,+n​(x1).\displaystyle=\mathbb{P}\left(Z_{t}\in{\cal Z}_{P,+}\mid Z_{t}\in{\cal Z}_{P,+}\cup{\cal Z}_{P,-},{\hat{N}}_{P}\right)=\mathbb{P}\left(Z_{t}\in{\cal Z}_{P,+}\mid Z_{t}\in{\cal Z}_{P,+}\cup{\cal Z}_{P,-}\right)=\frac{\eta^{n}_{P,-}(x_{1})}{\eta^{n}_{P,-}(x_{1})+\eta^{n}_{P,+}(x_{1})}.

Therefore, applying Lemma 3,

ℙ​(N^+>N^Pβ‹…p+N^Pβ‹…p​(1βˆ’p)∣N^P)β‰₯112.\displaystyle\mathbb{P}\left({\hat{N}}_{+}>{\hat{N}}_{P}\cdot p+\sqrt{{\hat{N}}_{P}\cdot p(1-p)}\mid{\hat{N}}_{P}\right)\geq\frac{1}{12}.

We now just need to show that the event under the probability implies B~\tilde{B}, in other words, that

N^Pβ‹…p​(1βˆ’p)\displaystyle\sqrt{{\hat{N}}_{P}\cdot p(1-p)} β‰₯12​N^Pβˆ’N^Pβ‹…p+𝔼​[N^Q]=N^Pβ‹…(12βˆ’p)+NQ.\displaystyle\geq\frac{1}{2}{\hat{N}}_{P}-{\hat{N}}_{P}\cdot p+\mathbb{E}\left[{\hat{N}}_{Q}\right]={\hat{N}}_{P}\cdot\left(\frac{1}{2}-p\right)+N_{Q}. (21)

Next we upper bound the r.h.s of (21) and lower-bound its l.h.s. Under 𝐄P\mathbf{E}_{P} and using (17), we have:

N^Pβ‹…(12βˆ’p)\displaystyle{\hat{N}}_{P}\cdot\left(\frac{1}{2}-p\right) ≀𝔼​[N^P]β‹…((1βˆ’p)βˆ’p)=NP​PXn​(x1)β‹…(Ξ·P,+n​(x1)βˆ’Ξ·P,βˆ’n​(x1))\displaystyle\leq\mathbb{E}\left[{\hat{N}}_{P}\right]\cdot\left((1-p)-p\right)=N_{P}P_{X}^{n}(x_{1})\cdot\left(\eta^{n}_{P,+}(x_{1})-\eta^{n}_{P,-}(x_{1})\right)
≀NP​PXn​(x1)β‹…nβ‹…(Ξ·P,+​(x1)βˆ’Ξ·P,βˆ’β€‹(x1)),\displaystyle\leq N_{P}P_{X}^{n}(x_{1})\cdot n\cdot\left(\eta_{P,+}(x_{1})-\eta_{P,-}(x_{1})\right), (22)

where for the last inequality we used the fact that, for aβ‰₯b>0a\geq b>0, we have anβˆ’bn=(aβˆ’b)β€‹βˆ‘k=0nβˆ’1anβˆ’1β‹…(ba)ka^{n}-b^{n}=(a-b)\sum_{k=0}^{n-1}a^{n-1}\cdot\left(\frac{b}{a}\right)^{k}. On the other hand, the conditions on NPN_{P} let us lower-bound Ξ·P,βˆ’β€‹(x1)\eta_{P,-}(x_{1}) by 1/41/4, and we have:

N^Pβ‹…p​(1βˆ’p)\displaystyle{\hat{N}}_{P}\cdot p(1-p) β‰₯12​𝔼​[N^P]β‹…p​(1βˆ’p)=12​NP​PXn​(x1)β‹…Ξ·P,βˆ’n​(x1)β‹…Ξ·P,+n​(x1)Ξ·P,βˆ’n​(x1)+Ξ·P,+n​(x1)\displaystyle\geq\frac{1}{2}\mathbb{E}\left[{\hat{N}}_{P}\right]\cdot p(1-p)=\frac{1}{2}N_{P}P_{X}^{n}(x_{1})\cdot\frac{\eta^{n}_{P,-}(x_{1})\cdot\eta^{n}_{P,+}(x_{1})}{\eta^{n}_{P,-}(x_{1})+\eta^{n}_{P,+}(x_{1})}
β‰₯12​NP​PXn​(x1)β‹…2βˆ’3​n.\displaystyle\geq\frac{1}{2}N_{P}P_{X}^{n}(x_{1})\cdot 2^{-3n}. (23)

Combining (22) and (23), we see that B~\tilde{B} is implied by

2βˆ’(2​n)​NP​PXn​(x1)\displaystyle 2^{-(2n)}\sqrt{N_{P}P_{X}^{n}(x_{1})} β‰₯NP​PXn​(x1)β‹…nβ‹…(Ξ·P,+​(x1)βˆ’Ξ·P,βˆ’β€‹(x1))+NQ,Β which is in turn implied by\displaystyle\geq N_{P}P_{X}^{n}(x_{1})\cdot n\cdot\left(\eta_{P,+}(x_{1})-\eta_{P,-}(x_{1})\right)+N_{Q},\text{ which is in turn implied by}
nβˆ’1​2βˆ’(4​n)​c1n​(nβ‹…NP)2βˆ’(n+1)​β2βˆ’Ξ²\displaystyle n^{-1}2^{-(4n)}c_{1}^{n}\left(n\cdot N_{P}\right)^{\frac{2-(n+1)\beta}{2-\beta}} β‰₯2​c12​n​(nβ‹…NP)2βˆ’2​n​β2βˆ’Ξ²+2​NQ2.\displaystyle\geq 2c_{1}^{2n}\left(n\cdot N_{P}\right)^{\frac{2-2n\beta}{2-\beta}}+2N_{Q}^{2}.

The conditions of the proposition ensure that the above inequality holds. ∎

We obtain a bound on ℙ​(N+>N^βˆ’)\mathbb{P}(N_{+}>{\hat{N}}_{-}) as an immediate corollary to the above proposition.

Corollary 1.

Under the conditions of Proposition 5, we have:

ℙ​(N^+>N^βˆ’)β‰₯𝔼​[ℙ​(N^+>N^βˆ’βˆ£N^P,N^Q)​1​{𝐄}]β‰₯112​ℙ​(𝐄).\mathbb{P}\left({\hat{N}}_{+}>{\hat{N}}_{-}\right)\geq\mathbb{E}\left[\mathbb{P}\left({\hat{N}}_{+}>{\hat{N}}_{-}\mid{\hat{N}}_{P},{\hat{N}}_{Q}\right){\mathbbold 1}\!\left\{\mathbf{E}\right\}\right]\geq\frac{1}{12}\mathbb{P}(\mathbf{E}).

We now turn to the second condition of Proposition 4.

Proposition 6 (Ξ΄2\delta_{2} from Proposition 4).

Let 0≀β<10\leq\beta<1, n>1n>1, and NQβ‰₯16N_{Q}\geq 16. Let 0<c1≀2βˆ’100<c_{1}\leq 2^{-10} in the construction of Ξ·P,Οƒ\eta_{P,\sigma}, Οƒ=Β±\sigma=\pm. Suppose NPN_{P} is sufficiently large so that

(i)​NPβ‰₯3​NQand(ii)​(nβ‹…NP)2βˆ’2​β2βˆ’Ξ²β‰₯4096β‹…n2β‹…c1βˆ’1.\text{(i)}\ N_{P}\geq 3N_{Q}\quad\text{and}\quad\text{(ii)}\ (n\cdot N_{P})^{\frac{2-2\beta}{2-\beta}}\geq{4096\cdot n^{2}\cdot c_{1}^{-1}}.

Let ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}, and n~Οƒ=n~σ​(Z),Οƒ=Β±\tilde{n}_{\sigma}=\tilde{n}_{\sigma}(Z),\,\sigma=\pm as defined in Proposition 4. We then have that ℙ​(n~+>n~βˆ’)β‰₯148\mathbb{P}\left(\tilde{n}_{+}>\tilde{n}_{-}\right)\geq\frac{1}{48}.

Proof.

Under the notation of Proposition 4, let N^P,βˆ’β‰N^Pβˆ’N^+{\hat{N}}_{P,-}\doteq{\hat{N}}_{P}-{\hat{N}}_{+}, and for homogeneity of notation herein, let N^P,+≐N^+{\hat{N}}_{P,+}\doteq{\hat{N}}_{+}. Fix Ξ΄>0\delta>0 to be defined, and notice that if Ξ΄>nβ‹…(N^P,+βˆ’N^P,βˆ’)\delta>n\cdot\left({\hat{N}}_{P,+}-{\hat{N}}_{P,-}\right), then the event

A~δ≐{n~++nβ‹…N^P,+β‰₯n~βˆ’+nβ‹…N^P,βˆ’+Ξ΄}​ implies ​{n~+>n~βˆ’}.\tilde{A}_{\delta}\doteq\left\{\tilde{n}_{+}+n\cdot{\hat{N}}_{P,+}\geq\tilde{n}_{-}+n\cdot{\hat{N}}_{P,-}+\delta\right\}\text{ implies }\left\{\tilde{n}_{+}>\tilde{n}_{-}\right\}.

As a first step, we want to upper-bound (N^P,+βˆ’N^P,βˆ’)\left({\hat{N}}_{P,+}-{\hat{N}}_{P,-}\right). Let VΞ΄V_{\delta} denote an upper-bound on the variance of this quantity: we have by Bernstein’s inequality that, for any t≀VΞ΄t\leq\sqrt{V_{\delta}}, with probability at least 1βˆ’eβˆ’t2/41-e^{-t^{2}/4},

(N^P,+βˆ’N^P,βˆ’)≀𝔼​[N^P,+βˆ’N^P,βˆ’]+t​Vδ≀t​VΞ΄.\displaystyle\left({\hat{N}}_{P,+}-{\hat{N}}_{P,-}\right)\leq\mathbb{E}\left[{\hat{N}}_{P,+}-{\hat{N}}_{P,-}\right]+t\sqrt{V_{\delta}}\leq t\sqrt{V_{\delta}}. (24)

We therefore set Ξ΄=4​nβ‹…VΞ΄\delta=4n\cdot\sqrt{V_{\delta}}, whereby, for VΞ΄β‰₯4\sqrt{V_{\delta}}\geq 4, the event of (24) (with t=4t=4) happens with probability at least 1βˆ’1/481-1/48. Hence, we set Vδ≐16βˆ¨π”Όβ€‹[N^P,++N^P,βˆ’]β‰₯16∨Var​(N^P,+βˆ’N^P,βˆ’)V_{\delta}\doteq 16\lor\mathbb{E}\left[{\hat{N}}_{P,+}+{\hat{N}}_{P,-}\right]\geq 16\lor\text{Var}\left({\hat{N}}_{P,+}-{\hat{N}}_{P,-}\right), where 𝔼​[N^P,++N^P,βˆ’]≐𝔼​[N^P]\mathbb{E}\left[{\hat{N}}_{P,+}+{\hat{N}}_{P,-}\right]\doteq\mathbb{E}\left[{\hat{N}}_{P}\right] is given in equation (17). We now proceed with a lower-bound on ℙ​(A~Ξ΄)\mathbb{P}(\tilde{A}_{\delta}).

Let n^x1{\hat{n}}_{x_{1}} denote the number of points sampled from Pβˆ’P_{-} that fall on x1x_{1}. Notice that, conditioned on these samples’ indices, n+≐n~++nβ‹…N^P,+n_{+}\doteq\tilde{n}_{+}+n\cdot{\hat{N}}_{P,+} is distributed as Binomial​(n^x1,p)\text{Binomial}\left({\hat{n}}_{x_{1}},p\right), where p=Ξ·P,βˆ’β€‹(x1)p=\eta_{P,-}(x_{1}), the probability of ++ given that x1x_{1} is sampled from Pβˆ’P_{-}. Applying Lemma 3, and integrating over N^Q{\hat{N}}_{Q}, we have

ℙ​(n+>n^x1β‹…p+n^x1β‹…p​(1βˆ’p))β‰₯112.\displaystyle\mathbb{P}\left(n_{+}>{\hat{n}}_{x_{1}}\cdot p+\sqrt{{\hat{n}}_{x_{1}}\cdot p(1-p)}\right)\geq\frac{1}{12}. (25)

Now notice that A~Ξ΄\tilde{A}_{\delta} holds whenever n^+β‰₯12​(n^x1+Ξ΄){\hat{n}}_{+}\geq\frac{1}{2}\left({\hat{n}}_{x_{1}}+\delta\right), since n^x1=(n~++nβ‹…N^P,+)+(n~βˆ’+nβ‹…N^P,βˆ’){\hat{n}}_{x_{1}}=\left(\tilde{n}_{+}+n\cdot{\hat{N}}_{P,+})+(\tilde{n}_{-}+n\cdot{\hat{N}}_{P,-}\right). Under the event of (25), we have n^+>12​(n^x1+Ξ΄){\hat{n}}_{+}>\frac{1}{2}\left({\hat{n}}_{x_{1}}+\delta\right), whenever it holds that

n^x1β‹…p​(1βˆ’p)β‰₯12​(n^x1+Ξ΄)βˆ’n^x1β‹…p=n^x1​(12βˆ’p)+12​δ.\displaystyle\sqrt{{\hat{n}}_{x_{1}}\cdot p(1-p)}\geq\frac{1}{2}\left({\hat{n}}_{x_{1}}+\delta\right)-{\hat{n}}_{x_{1}}\cdot p={\hat{n}}_{x_{1}}\left(\frac{1}{2}-p\right)+\frac{1}{2}\delta. (26)

Next we bound n^x1{\hat{n}}_{x_{1}} with high probability. Consider any value of N^Q{\hat{N}}_{Q} such that 𝐄Q\mathbf{E}_{Q} (from Proposition 3) holds, i.e., N^Q≀2​NQ{\hat{N}}_{Q}\leq 2N_{Q}. Conditioned on such N^Q{\hat{N}}_{Q}, n^x1{\hat{n}}_{x_{1}} is itself a Binomial with

𝔼​[n^x1∣N^Q]=n​(Nβˆ’N^Q)β‹…PX​(x1)β‰₯12​n​Nβ‹…PX​(x1)β‰₯12​nβ‹…NPβ‹…PX​(x1)=12​c1​(nβ‹…NP)2βˆ’2​β2βˆ’Ξ²,\mathbb{E}\left[{\hat{n}}_{x_{1}}\mid{\hat{N}}_{Q}\right]=n(N-{\hat{N}}_{Q})\cdot P_{X}(x_{1})\geq\frac{1}{2}nN\cdot P_{X}(x_{1})\geq\frac{1}{2}n\cdot N_{P}\cdot P_{X}(x_{1})=\frac{1}{2}c_{1}\left(n\cdot N_{P}\right)^{\frac{2-2\beta}{2-\beta}},

where for the first inequality we used the fact that NPβ‰₯3​NQN_{P}\geq 3N_{Q}. Hence, by a multiplicative Chernoff bound,

β„™(12𝔼[n^x1|N^Q]≀n^x1≀2𝔼[n^x1|N^Q]|N^Q)β‹…1{𝐄Q}β‰₯(1βˆ’148)1{𝐄Q},\displaystyle\mathbb{P}\left(\frac{1}{2}\mathbb{E}\left[{\hat{n}}_{x_{1}}|{\hat{N}}_{Q}\right]\leq{\hat{n}}_{x_{1}}\leq 2\mathbb{E}\left[{\hat{n}}_{x_{1}}|{\hat{N}}_{Q}\right]\middle|{\hat{N}}_{Q}\right)\cdot{\mathbbold 1}\!\left\{\mathbf{E}_{Q}\right\}\geq\left(1-\frac{1}{48}\right){\mathbbold 1}\!\left\{\mathbf{E}_{Q}\right\}, (27)

whenever c1​(nβ‹…NP)2βˆ’2​β2βˆ’Ξ²β‰₯40c_{1}\left(n\cdot N_{P}\right)^{\frac{2-2\beta}{2-\beta}}\geq 40. Now, by Proposition 3, 𝐄Q\mathbf{E}_{Q} holds with probability at least 1βˆ’1/481-1/48 whenever 𝔼​[N^Q]=NQ>16\mathbb{E}\left[{\hat{N}}_{Q}\right]=N_{Q}>16. Thus, integrating (27) over N^Q{\hat{N}}_{Q}, we get that, with probability at least 1βˆ’1/241-1/24,

14​nβ‹…NPβ‹…PX​(x1)≀n^x1≀2​nβ‹…Nβ‹…PX​(x1)≀83​nβ‹…NPβ‹…PX​(x1).\displaystyle\frac{1}{4}n\cdot N_{P}\cdot P_{X}(x_{1})\leq{\hat{n}}_{x_{1}}\leq 2n\cdot N\cdot P_{X}(x_{1})\leq{\frac{8}{3}n\cdot N_{P}\cdot P_{X}(x_{1})}. (28)

Thus, bounding both sides of (26), A~Ξ΄\tilde{A}_{\delta} holds whenever a) the events of (25) and (28) hold, and b) the following inequality is satisfied:

14​nβ‹…NPβ‹…PX​(x1)β‹…p​(1βˆ’p)\displaystyle\sqrt{\frac{1}{4}n\cdot N_{P}\cdot P_{X}(x_{1})\cdot p(1-p)} β‰₯83​nβ‹…NPβ‹…PX​(x1)​(12βˆ’p)+2​nβ‹…VΞ΄,Β which holds whenever\displaystyle\geq{\frac{8}{3}}n\cdot N_{P}\cdot P_{X}(x_{1})\left(\frac{1}{2}-p\right)+2n\cdot\sqrt{V_{\delta}},\text{ which holds whenever }
14​2​c11/2​(nβ‹…NP)1βˆ’Ξ²2βˆ’Ξ²\displaystyle\frac{1}{4\sqrt{2}}c_{1}^{1/2}\left(n\cdot N_{P}\right)^{\frac{1-\beta}{2-\beta}} β‰₯83​c1​(nβ‹…NP)1βˆ’Ξ²2βˆ’Ξ²+2​nβ‹…(4∨nβˆ’1/2​c1n/2​(nβ‹…NP)1βˆ’(n+1)​β/22βˆ’Ξ²),\displaystyle\geq\frac{8}{3}c_{1}\left(n\cdot N_{P}\right)^{\frac{1-\beta}{2-\beta}}+2n\cdot\left(4\lor n^{-1/2}c_{1}^{n/2}\left(n\cdot N_{P}\right)^{\frac{1-(n+1)\beta/2}{2-\beta}}\right), (29)

where the r.h.s. and l.h.s. of (29) upper and lower bound, respectively, the r.h.s. and l.h.s. of the previous inequality (using the setting of VΞ΄V_{\delta} and (17), and lower-bounding p​(1βˆ’p)p(1-p) by 1/81/8 for NPN_{P} as large as assumed). By the conditions of the Proposition, (29) is satisfied. Thus, A~Ξ΄\tilde{A}_{\delta} holds with probability at least 1/241/24 since the events of (25) and (28) hold together with that probability (using the fact that ℙ​(A∩B)β‰₯ℙ​(A)βˆ’β„™β€‹(B𝖼)\mathbb{P}(A\cap B)\geq\mathbb{P}(A)-\mathbb{P}(B^{\mathsf{c}})). Finally, we can conclude that {n~+>n~βˆ’}\left\{\tilde{n}_{+}>\tilde{n}_{-}\right\} with probability at least 1/481/48 since A~Ξ΄\tilde{A}_{\delta} and the event of (24) hold together with that probability. ∎

Finally we bound the π’ŸΒ±{\cal D}_{\pm} term in the likelihood equation (16).

Proposition 7 (π’Ÿ+/π’Ÿβˆ’{\cal D}_{+}/{\cal D}_{-}).

Let nπ’Ÿ>0n_{\cal D}>0. Again let ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}, and let 0<c0≀1/40<c_{0}\leq 1/4 in the construction of Ξ·π’Ÿ,Οƒ,Οƒ=Β±\eta_{{\cal D},\sigma},\sigma=\pm.

ℙ​(π’Ÿ+​(ZN+1)π’Ÿβˆ’β€‹(ZN+1)β‰₯1)β‰₯184.\mathbb{P}\left(\frac{{\cal D}_{+}(Z_{N+1})}{{\cal D}_{-}(Z_{N+1})}\geq 1\right)\geq\frac{1}{84}.

The proof is given in Appendix D, and follows similar lines as above, namely, isolate sufficient statistics (number of Β±\pm in ZN+1Z_{N+1}) and concluding by anticoncentration upon proper conditioning.

We can now combine all the above analysis into the following main proposition.

Proposition 8.

Pick any 0≀β<10\leq\beta<1, 1≀n<2/Ξ²βˆ’11\leq n<2/\beta-1, and NQβ‰₯16N_{Q}\geq 16. Let 0<c1≀2βˆ’100<c_{1}\leq 2^{-10}, and 0<c0≀1/40<c_{0}\leq 1/4 in the constructions of Ξ·P,Οƒ\eta_{P,\sigma}, Ξ·π’Ÿ,Οƒ\eta_{{\cal D},\sigma}, Οƒ=Β±\sigma=\pm. Suppose NPN_{P} is sufficiently large so that NPβ‰₯3​NQN_{P}\geq 3N_{Q}, and also

  • (i)

    (nβ‹…NP)2βˆ’2​β2βˆ’Ξ²β‰₯4096β‹…n2β‹…c1βˆ’1(n\cdot N_{P})^{\frac{2-2\beta}{2-\beta}}\geq{4096\cdot n^{2}\cdot c_{1}^{-1}}.

  • (ii)

    (nβ‹…NP)(2βˆ’(n+1)​β)/(2βˆ’Ξ²)β‰₯4β‹…NQ2β‹…nβ‹…24​n​c1βˆ’n(n\cdot N_{P})^{(2-(n+1)\beta)/(2-\beta)}\geq 4\cdot N_{Q}^{2}\cdot n\cdot 2^{4n}c_{1}^{-n}.

  • (iii)

    NP(2βˆ’(n+1)​β)/(2βˆ’Ξ²)β‰₯22β‹…nβ‹…2nβ‹…c1βˆ’nN_{P}^{(2-(n+1)\beta)/(2-\beta)}\geq 22\cdot n\cdot 2^{n}\cdot c_{1}^{-n}.

Let h^\hat{h} denote any classification procedure having access to ZβˆΌΞ“Οƒ,Οƒ=Β±Z\sim\Gamma_{\sigma},\sigma=\pm. We then have that

infh^supΟƒβˆˆ{Β±}ℙΓσ​(β„°π’Ÿβ€‹(h^)β‰₯c0β‹…(1∧nπ’Ÿβˆ’1/(2βˆ’Ξ²)))β‰₯112β‹…196β‹…184.\displaystyle\inf_{\hat{h}}\sup_{\sigma\in\left\{\pm\right\}}\mathbb{P}_{\Gamma_{\sigma}}\left(\mathcal{E}_{{\cal D}}(\hat{h})\geq c_{0}\cdot\left(1\land n_{\cal D}^{-1/(2-\beta)}\right)\right)\geq\frac{1}{12}\cdot\frac{1}{96}\cdot\frac{1}{84}.
Proof.

Following the above propositions, again assume w.l.o.g. that ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}. Let 𝐄P,𝐄Q\mathbf{E}_{P},\mathbf{E}_{Q} as defined in Proposition 3 over ZβˆΌΞ“βˆ’Z\sim\Gamma_{-}, and notice that, under our assumptions on NQN_{Q} and (iii) on NPN_{P}, each of these events occurs with probability at least 1βˆ’1/(2β‹…96)1-1/(2\cdot 96).

Thus, for n>1n>1, by Propositions 2, 4, and 7, we have that β„™Ξ“βˆ’β€‹(Ξ“+​(Z)>Ξ“βˆ’β€‹(Z))\mathbb{P}_{\Gamma_{-}}\left(\Gamma_{+}(Z)>\Gamma_{-}(Z)\right) is at least Ξ΄1​(Ξ΄2βˆ’1/96)​184\delta_{1}(\delta_{2}-1/96)\frac{1}{84}. Now plug in Ξ΄1=1/12\delta_{1}=1/12 and Ξ΄2=1/48\delta_{2}=1/48 from Propositions 5 and 6. For n=1n=1, using Proposition 2 and 7, and noticing that {N^+​(Z)>N^βˆ’β€‹(Z)}⟹{n^+​(Z)β‰₯n^βˆ’β€‹(Z)}\{{\hat{N}}_{+}(Z)>{\hat{N}}_{-}(Z)\}\implies\{{\hat{n}}_{+}(Z)\geq{\hat{n}}_{-}(Z)\}, we can conclude by Corollary 1 that β„™Ξ“βˆ’β€‹(Ξ“+​(Z)>Ξ“βˆ’β€‹(Z))\mathbb{P}_{\Gamma_{-}}\left(\Gamma_{+}(Z)>\Gamma_{-}(Z)\right) is at least 112​ℙ​(𝐄Pβˆ©π„Q)β‹…184\frac{1}{12}\mathbb{P}\left(\mathbf{E}_{P}\cap\mathbf{E}_{Q}\right)\cdot\frac{1}{84}, again matching the lower-bound in the statement.

Now, if h^\hat{h} wrongly picks Οƒ=+\sigma=+ (i.e. picks hβˆˆβ„‹h\in\mathcal{H}, h​(x1)=+h(x_{1})=+), then β„°π’Ÿβ€‹(h^)β‰₯π’ŸX​(x1)β‹…(Ξ·π’Ÿ,+βˆ’Ξ·π’Ÿ,βˆ’)=c0​ϡ0\mathcal{E}_{\cal D}(\hat{h})\geq{\cal D}_{X}(x_{1})\cdot\left(\eta_{{\cal D},+}-\eta_{{\cal D},-}\right)=c_{0}\epsilon_{0}. By Remark 8, for any h^\hat{h}, the probability that h^\hat{h} picks Οƒ=+\sigma=+ is bounded below by β„™Ξ“βˆ’β€‹(Ξ“+​(Z)>Ξ“βˆ’β€‹(Z))\mathbb{P}_{\Gamma_{-}}\left(\Gamma_{+}(Z)>\Gamma_{-}(Z)\right). ∎

We can now conclude with the proof of the main result of this section.

Proof of Theorem 8.

The first part of the statement builds on Proposition 8 as follows. Set c0=1/4c_{0}=1/4 and c1=2βˆ’10c_{1}=2^{-10}. First, let ZβˆΌΞ“ΟƒZ\sim\Gamma_{\sigma}, and let N^Q{\hat{N}}_{Q} denote the number of vectors Zt∈ZZ_{t}\in Z that were generated by QQ (as in Proposition 3). Let 𝐄Q,β™―\mathbf{E}_{Q,\sharp} denote the event that N^Qβ‰₯𝔼​[N^Q]/2=NQ/2{\hat{N}}_{Q}\geq\mathbb{E}[{\hat{N}}_{Q}]/2=N_{Q}/2, Let 𝐄ℰ≐{β„°π’Ÿβ€‹(h^)β‰₯c0β‹…Ο΅0}\mathbf{E}_{\mathcal{E}}\doteq\left\{\mathcal{E}_{{\cal D}}(\hat{h})\geq c_{0}\cdot\epsilon_{0}\right\}. By Proposition 8, for some Οƒβˆˆ{Β±}\sigma\in\left\{\pm\right\}, we have that ℙ​(𝐄ℰ)\mathbb{P}\left(\mathbf{E}_{\mathcal{E}}\right) is bounded below.

Now decouple the randomness in ZZ as follows. Let ΢≐{ΞΆt}t∈[N]\zeta\doteq\left\{\zeta_{t}\right\}_{t\in[N]} denote NN i.i.d. choices of PΟƒP_{\sigma} or QΟƒQ_{\sigma} with respective probabilities Ξ±P=NP/N\alpha_{P}=N_{P}/N and Ξ±Q=NQ/N\alpha_{Q}=N_{Q}/N; choose Zt∈ZZ_{t}\in Z according to ΞΆtn\zeta_{t}^{n}. We then have that

𝔼​[ℙ​(π„β„°βˆ£ΞΆ)βˆ£π„Q,β™―]\displaystyle\mathbb{E}\left[\mathbb{P}\left(\mathbf{E}_{\mathcal{E}}\mid\zeta\right)\mid\mathbf{E}_{Q,\sharp}\right] β‰₯𝔼​[ℙ​(π„β„°βˆ£ΞΆ)β‹…1​{𝐄Q,β™―}]\displaystyle\geq\mathbb{E}\left[\mathbb{P}\left(\mathbf{E}_{\mathcal{E}}\mid\zeta\right)\cdot{\mathbbold 1}\!\left\{\mathbf{E}_{Q,\sharp}\right\}\right]
β‰₯𝔼​[ℙ​(π„β„°βˆ£ΞΆ)]βˆ’β„™β€‹(𝐄Q,♯𝖼)=ℙ​(𝐄ℰ)βˆ’β„™β€‹(𝐄Q,♯𝖼)β‰₯c,\displaystyle\geq\mathbb{E}\left[\mathbb{P}\left(\mathbf{E}_{\mathcal{E}}\mid\zeta\right)\right]-\mathbb{P}\left(\mathbf{E}_{Q,\sharp}^{\mathsf{c}}\right)=\mathbb{P}\left(\mathbf{E}_{\mathcal{E}}\right)-\mathbb{P}\left(\mathbf{E}_{Q,\sharp}^{\mathsf{c}}\right)\geq c,

where we can bound ℙ​(𝐄Q,♯𝖼)\mathbb{P}\left(\mathbf{E}_{Q,\sharp}^{\mathsf{c}}\right) how ever small for NQN_{Q} sufficiently large (by a multiplicative Chernoff). Now conclude by noticing that the above conditional expectation is a projection of the measure Ξ±N\alpha^{N} onto β„³\mathcal{M} (via the injection ΞΆβ†¦Ξ βˆˆβ„³\zeta\mapsto\Pi\in\mathcal{M}) and is bounded below, implying supΞΆβˆ£π„Q,♯ℙ​(π„β„°βˆ£ΞΆ)\sup_{\zeta\mid\mathbf{E}_{Q,\sharp}}\mathbb{P}\left(\mathbf{E}_{\mathcal{E}}\mid\zeta\right) must be bounded below.

The second part of the theorem is a direct consequence of the results of Section 7. ∎

References

  • ABGLP [19] Martin Arjovsky, LΓ©on Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv:1907.02893, 2019.
  • AKK+ [19] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv:1902.09229, 2019.
  • APMS [19] Alessandro Achille, Giovanni Paolini, Glen Mbeng, and Stefano Soatto. The information complexity of learning tasks, their structure and their distance. arXiv:1904.03292, 2019.
  • AZ [05] RieΒ Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817–1853, 2005.
  • Bar [92] PeterΒ L Bartlett. Learning with a slowly changing distribution. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992.
  • Bax [97] Jonathan Baxter. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7–39, 1997.
  • BDB [08] Shai Ben-David and RebaΒ Schuller Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine Learning, 73(3):273–287, 2008.
  • BDBC+ [10] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and JenniferΒ Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1-2):151–175, 2010.
  • BDBCP [07] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, 2007.
  • BDBM [89] Shai Ben-David, GyoraΒ M Benedek, and Yishay Mansour. A parametrization scheme for classifying models of learnability. In Proceedings of the 2nd Annual Workshop on Computational Learning Theory, 1989.
  • BDLLP [10] Shai Ben-David, Tyler Lu, Teresa Luu, and DΓ‘vid PΓ‘l. Impossibility theorems for domain adaptation. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
  • BHPQ [17] Avrim Blum, Nika Haghtalab, ArielΒ D Procaccia, and Mingda Qiao. Collaborative PAC learning. In Advances in Neural Information Processing Systems, 2017.
  • BL [97] RakeshΒ D Barve and PhilipΒ M Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2):170–193, 1997.
  • BLM [13] StΓ©phane Boucheron, GΓ‘bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford university press, 2013.
  • Car [97] Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, 1997.
  • CKW [08] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757–1774, 2008.
  • CMRR [08] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International Conference on Algorithmic Learning Theory, 2008.
  • DHK+ [20] SimonΒ S Du, Wei Hu, ShamΒ M Kakade, JasonΒ D Lee, and QiΒ Lei. Few-shot learning via learning the representation, provably. arXiv:2002.09434, 2020.
  • GSH+ [09] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard SchΓΆlkopf. Covariate shift by kernel mean matching. In Dataset Shift in Machine Learning, pages 131–160, 2009.
  • HK [19] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. In Advances in Neural Information Processing Systems, 2019.
  • HY [19] Steve Hanneke and Liu Yang. Statistical learning under nonstationary mixing processes. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
  • JSRR [10] Ali Jalali, Sujay Sanghavi, Chao Ruan, and Pradeep Ravikumar. A dirty model for multi-task learning. In Advances in Neural Information Processing Systems, 2010.
  • KFAL [20] Nikola Konstantinov, Elias Frantar, Dan Alistarh, and ChristophΒ H Lampert. On the sample complexity of adversarial multi-source PAC learning. arXiv:2002.10384, 2020.
  • KM [18] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariate-shift. arXiv:1803.01833, 2018.
  • Kol [06] V.Β Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.
  • Kol [11] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’EtΓ© de ProbabilitΓ©s de Saint-Flour XXXVIII-2008, volume 2033. 2011.
  • KR [05] Thierry Klein and Emmanuel Rio. Concentration around the mean for maxima of empirical processes. The Annals of Probability, 33(3):1060–1077, 2005.
  • LPVDG+ [11] Karim Lounici, Massimiliano Pontil, Sara Van DeΒ Geer, AlexandreΒ B Tsybakov, etΒ al. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4):2164–2204, 2011.
  • MB [17] Daniel McNamara and Maria-Florina Balcan. Risk bounds for transferring representations with and without fine-tuning. In International Conference on Machine Learning, 2017.
  • MBS [13] Krikamol Muandet, David Balduzzi, and Bernhard SchΓΆlkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, 2013.
  • MM [12] Mehryar Mohri and AndresΒ Munoz Medina. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, 2012.
  • MMM [19] Saeed Mahloujifar, Mohammad Mahmoody, and Ameer Mohammed. Universal multi-party poisoning attacks. In Proceedings of the 36th International Conference on Machine Learning, 2019.
  • MMR [09] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the RΓ©nyi divergence. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009.
  • Mou [10] Nima Mousavi. How tight is Chernoff bound? https://ece.uwaterloo.ca/~nmousavi/Papers/Chernoff-Tightness.pdf, 2010.
  • MPRP [13] Andreas Maurer, Massi Pontil, and Bernardino Romera-Paredes. Sparse coding for multitask and transfer learning. In International Conference on Machine Learning, 2013.
  • MPRP [16] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853–2884, 2016.
  • NW [11] S.Β N. Negahban and M.Β J. Wainwright. Simultaneous support recovery in high dimensions: Benefits and perils of block β„“1/β„“βˆž\ell_{1}/\ell_{\infty}-regularization. IEEE Transactions on Information Theory, 57(6):3841–3863, 2011.
  • PL [14] Anastasia Pentina and Christoph Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, 2014.
  • PM [13] Massimiliano Pontil and Andreas Maurer. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, 2013.
  • Qia [18] Mingda Qiao. Do outliers ruin collaboration? arXiv:1805.04720, 2018.
  • RHS [17] Ievgen Redko, Amaury Habrard, and Marc Sebban. Theoretical analysis of domain adaptation with optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2017.
  • Sau [72] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147, 1972.
  • Slu [77] EricΒ V Slud. Distribution inequalities for the binomial law. The Annals of Probability, pages 404–412, 1977.
  • SQZY [18] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In 32nd AAAI Conference on Artificial Intelligence, 2018.
  • SZ [19] Clayton Scott and Jianxin Zhang. Learning from multiple corrupted sources, with application to learning from label proportions. arXiv:1910.04665, 2019.
  • Tat [53] RobertΒ F Tate. On a double inequality of the normal distribution. The Annals of Mathematical Statistics, 24(1):132–134, 1953.
  • TJJ [20] Nilesh Tripuraneni, MichaelΒ I Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. arXiv:2006.11650, 2020.
  • Tsy [04] AlexanderΒ B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
  • Tsy [09] AlexandreΒ B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
  • VC [71] V.Β Vapnik and A.Β Chervonenkis. On the uniform convergence of relative frequencies of events to their expectation. Theory of Probability and its Applications, 16:264–280, 1971.
  • vW [96] A.Β W. van der Vaart and J.Β A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag New York, 1996.
  • YHC [13] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications to active learning. Machine learning, 90(2):161–189, 2013.
  • ZL [19] Alexander Zimin and ChristophΒ H Lampert. Tasks without borders: A new approach to online multi-task learning. In Workshop on Adaptive & Multitask Learning, 2019.

Appendix

Appendix A Proof of LemmaΒ 1

Let W1:m=(W1,…,Wm)W_{1:m}=(W_{1},\ldots,W_{m}) be a vector of independent 𝒲\mathcal{W}-valued random variables (for some space 𝒲\mathcal{W}), not necessarily identically distributed. Let β„±\mathcal{F} be a set of measurable functions 𝒲→[βˆ’1,1]\mathcal{W}\to[-1,1], and let |ℱ​(m)||\mathcal{F}(m)| be the number of distinct vectors possible on mm points. Let Ξ±={Ξ±1,…,Ξ±m}∈[0,1]m\alpha=\{\alpha_{1},\ldots,\alpha_{m}\}\in[0,1]^{m}. Define ΞΌ^α​(f)β‰βˆ‘i=1mΞ±i​f​(Wi)\hat{\mu}_{\alpha}(f)\doteq\sum_{i=1}^{m}\alpha_{i}f(W_{i}), and μα​(f)≐𝔼​[ΞΌ^α​(f)]\mu_{\alpha}(f)\doteq\mathbb{E}\!\left[\hat{\mu}_{\alpha}(f)\right], and also Οƒ^Ξ±2​(f)β‰βˆ‘i=1mΞ±i2​f2​(Wi)\hat{\sigma}_{\alpha}^{2}(f)\doteq\sum_{i=1}^{m}\alpha_{i}^{2}f^{2}(W_{i}), and σα2​(f)≐𝔼​σ^Ξ±2​(f)\sigma_{\alpha}^{2}(f)\doteq\mathbb{E}\hat{\sigma}_{\alpha}^{2}(f). Define ℱσ≐{fβˆˆβ„±:σα​(f)≀σ}\mathcal{F}_{\sigma}\doteq\{f\in\mathcal{F}:\sigma_{\alpha}(f)\leq\sigma\}.

The following lemma is immediate from a result of [27] (see also [14] Section 12.5).

Lemma 4.

For any Οƒ>0\sigma>0 with β„±Οƒβ‰ βˆ…\mathcal{F}_{\sigma}\neq\emptyset, letting Lσ≐supfβˆˆβ„±Οƒ(ΞΌ^α​(f)βˆ’ΞΌΞ±β€‹(f))L_{\sigma}\doteq\sup_{f\in\mathcal{F}_{\sigma}}\left(\hat{\mu}_{\alpha}(f)-\mu_{\alpha}(f)\right), βˆ€Ο΅>0\forall\epsilon>0,

ℙ​(LΟƒβ‰₯𝔼​[LΟƒ]+2​ϡ)≀exp⁑{βˆ’Ο΅4​ln⁑(1+2​ln⁑(1+Ο΅2​𝔼​[LΟƒ]+Οƒ2))}.\displaystyle\mathbb{P}\!\left(L_{\sigma}\geq\mathbb{E}\!\left[L_{\sigma}\right]+2\epsilon\right)\leq\exp\!\left\{-\frac{\epsilon}{4}\ln\!\left(1+2\ln\!\left(1+\frac{\epsilon}{2\mathbb{E}\!\left[L_{\sigma}\right]+\sigma^{2}}\right)\right)\right\}.

In particular, based on the inequality ln⁑(1+x)β‰₯x1+x\ln(1+x)\geq\frac{x}{1+x}, this implies

ℙ​(LΟƒβ‰₯𝔼​[LΟƒ]+2​ϡ)≀exp⁑{βˆ’Ο΅26​ϡ+4​𝔼​[LΟƒ]+2​σ2}.\mathbb{P}\!\left(L_{\sigma}\geq\mathbb{E}\!\left[L_{\sigma}\right]+2\epsilon\right)\leq\exp\!\left\{-\frac{\epsilon^{2}}{6\epsilon+4\mathbb{E}\!\left[L_{\sigma}\right]+2\sigma^{2}}\right\}.

In particular, for any Ξ΄>0\delta>0, setting

Ο΅=12​max⁑{(𝔼​[LΟƒ]+Οƒ2)​ln⁑(1Ξ΄),ln⁑(1Ξ΄)}\epsilon=12\max\!\left\{\sqrt{(\mathbb{E}[L_{\sigma}]+\sigma^{2})\ln\!\left(\frac{1}{\delta}\right)},\ln\!\left(\frac{1}{\delta}\right)\right\}

reveals that, with probability at least 1βˆ’Ξ΄1-\delta,

Lσ≀𝔼​[LΟƒ]+24​max⁑{(𝔼​[LΟƒ]+Οƒ2)​ln⁑(1Ξ΄),ln⁑(1Ξ΄)}.L_{\sigma}\leq\mathbb{E}[L_{\sigma}]+24\max\!\left\{\sqrt{(\mathbb{E}[L_{\sigma}]+\sigma^{2})\ln\!\left(\frac{1}{\delta}\right)},\ln\!\left(\frac{1}{\delta}\right)\right\}. (30)

Next, the following lemma bounds 𝔼​[LΟƒ]\mathbb{E}[L_{\sigma}] using a standard route.

Lemma 5.

There is a numerical constant Cβ‰₯1C\geq 1 such that, for any Οƒ>0\sigma>0 with β„±Οƒβ‰ βˆ…\mathcal{F}_{\sigma}\neq\emptyset, for LΟƒL_{\sigma} as in LemmaΒ 4,

𝔼​[LΟƒ]≀C​σ2​ln⁑(|ℱ​(m)|)+C​ln⁑(|ℱ​(m)|).\mathbb{E}[L_{\sigma}]\leq C\sqrt{\sigma^{2}\ln(|\mathcal{F}(m)|)}+C\ln(|\mathcal{F}(m)|).
Proof.

From Lemma 11.4 of [14], for Ο΅1,…,Ο΅m\epsilon_{1},\ldots,\epsilon_{m} independent Uniform​({βˆ’1,1}){\rm Uniform}(\{-1,1\}) (and independent of W1:mW_{1:m}),

𝔼​[LΟƒ]≀2​𝔼​[supfβˆˆβ„±Οƒβˆ‘i=1mΟ΅i​αi​(f​(Wi)βˆ’π”Όβ€‹[f​(Wi)])].\mathbb{E}[L_{\sigma}]\leq 2\mathbb{E}\!\left[\sup_{f\in\mathcal{F}_{\sigma}}\sum_{i=1}^{m}\epsilon_{i}\alpha_{i}\left(f(W_{i})-\mathbb{E}[f(W_{i})]\right)\right].

Furthermore, from [14] (Exercise 13.2, based on a result proved in [51]; see also [26]), there is a numerical constant C>0C>0 such that

𝔼​[supfβˆˆβ„±Οƒβˆ‘i=1mΟ΅i​αi​(f​(Wi)βˆ’π”Όβ€‹[f​(Wi)])]≀C​σ​log⁑(|ℱ​(m)|)+C​log⁑(|ℱ​(m)|).\displaystyle\mathbb{E}\!\left[\sup_{f\in\mathcal{F}_{\sigma}}\sum_{i=1}^{m}\epsilon_{i}\alpha_{i}\left(f(W_{i})-\mathbb{E}[f(W_{i})]\right)\right]\leq C\sigma\sqrt{\log(|\mathcal{F}(m)|)}+C\log(|\mathcal{F}(m)|).

∎

In particular, the above results imply the following lemma.

Lemma 6.

There exists a numerical constant Cβ‰₯1C\geq 1 such that, for any δ∈(0,1)\delta\in(0,1), with probability at least 1βˆ’Ξ΄1-\delta, every fβˆˆβ„±f\in\mathcal{F} satisfies

ΞΌ^α​(f)≀μα​(f)+C​σα2​(f)​ln⁑(|ℱ​(m)|​log2⁑(2​m)Ξ΄)+C​ln⁑(|ℱ​(m)|​log2⁑(2​m)Ξ΄).\hat{\mu}_{\alpha}(f)\leq\mu_{\alpha}(f)+C\sqrt{\sigma_{\alpha}^{2}(f)\ln\!\left(\frac{|\mathcal{F}(m)|\log_{2}(2m)}{\delta}\right)}+C\ln\!\left(\frac{|\mathcal{F}(m)|\log_{2}(2m)}{\delta}\right).
Proof.

Let Ξ΄m≐δlog2⁑(2​m)\delta_{m}\doteq\frac{\delta}{\log_{2}(2m)}. Combining LemmaΒ 5 with (30) implies that, letting Οƒk2=2k\sigma_{k}^{2}=2^{k} (k∈{0,…,log2⁑(m)}k\in\{0,\ldots,\log_{2}(m)\}), with probability at least 1βˆ’Ξ΄1-\delta (by a union bound), every k∈{0,…,log2⁑(m)}k\in\{0,\ldots,\log_{2}(m)\} has (for a numerical constant Cβ‰₯1C\geq 1)

LΟƒk\displaystyle L_{\sigma_{k}} ≀24​max⁑{(𝔼​[LΟƒk]+Οƒk2)​ln⁑(1Ξ΄m),ln⁑(1Ξ΄m)}+C​σk2​ln⁑(|ℱ​(m)|)+C​ln⁑(|ℱ​(m)|)\displaystyle\leq 24\max\!\left\{\sqrt{(\mathbb{E}[L_{\sigma_{k}}]+\sigma_{k}^{2})\ln\!\left(\frac{1}{\delta_{m}}\right)},\ln\!\left(\frac{1}{\delta_{m}}\right)\right\}+C\sqrt{\sigma_{k}^{2}\ln(|\mathcal{F}(m)|)}+C\ln(|\mathcal{F}(m)|)
≀24​(𝔼​[LΟƒk]+Οƒk2)​ln⁑(1Ξ΄m)+C​σk2​ln⁑(|ℱ​(m)|)+(24+C)​ln⁑(|ℱ​(m)|Ξ΄m)\displaystyle\leq 24\sqrt{(\mathbb{E}[L_{\sigma_{k}}]+\sigma_{k}^{2})\ln\!\left(\frac{1}{\delta_{m}}\right)}+C\sqrt{\sigma_{k}^{2}\ln(|\mathcal{F}(m)|)}+(24+C)\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)
≀24​2​σk2​ln⁑(1Ξ΄m)+48​C​σk2​ln⁑(|ℱ​(m)|)​ln⁑(1Ξ΄m)+48​C​ln⁑(|ℱ​(m)|)​ln⁑(1Ξ΄m)\displaystyle\leq 24\sqrt{2\sigma_{k}^{2}\ln\!\left(\frac{1}{\delta_{m}}\right)}+48\sqrt{C\sqrt{\sigma_{k}^{2}\ln(|\mathcal{F}(m)|)}\ln\!\left(\frac{1}{\delta_{m}}\right)}+48\sqrt{C\ln(|\mathcal{F}(m)|)\ln\!\left(\frac{1}{\delta_{m}}\right)}
+C​σk2​ln⁑(|ℱ​(m)|)+(24+C)​ln⁑(|ℱ​(m)|Ξ΄m)\displaystyle\phantom{\leq}+C\sqrt{\sigma_{k}^{2}\ln(|\mathcal{F}(m)|)}+(24+C)\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)
≀(72+2​C+48​C)​(Οƒk2​ln⁑(|ℱ​(m)|Ξ΄m)+ln⁑(|ℱ​(m)|Ξ΄m)).\displaystyle\leq(72+2C+48\sqrt{C})\left(\sqrt{\sigma_{k}^{2}\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)}+\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)\right).

Suppose this event occurs. In particular, for each fβˆˆβ„±f\in\mathcal{F}, let k​(f)≐min⁑{k:fβˆˆβ„±Οƒk}k(f)\doteq\min\{k:f\in\mathcal{F}_{\sigma_{k}}\}. Then every fβˆˆβ„±f\in\mathcal{F} has (for a universal numerical constant Cβ€²β‰₯1C^{\prime}\geq 1)

ΞΌ^α​(f)\displaystyle\hat{\mu}_{\alpha}(f) ≀μα​(f)+C′​σk​(f)2​ln⁑(|ℱ​(m)|Ξ΄m)+C′​ln⁑(|ℱ​(m)|Ξ΄m)\displaystyle\leq\mu_{\alpha}(f)+C^{\prime}\sqrt{\sigma_{k(f)}^{2}\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)}+C^{\prime}\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)
≀μα​(f)+C′​max⁑{1,2​σα2​(f)}​ln⁑(|ℱ​(m)|Ξ΄m)+C′​ln⁑(|ℱ​(m)|Ξ΄m)\displaystyle\leq\mu_{\alpha}(f)+C^{\prime}\sqrt{\max\!\left\{1,2\sigma_{\alpha}^{2}(f)\right\}\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)}+C^{\prime}\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)
≀μα​(f)+C′​2​σα2​(f)​ln⁑(|ℱ​(m)|Ξ΄m)+2​C′​ln⁑(|ℱ​(m)|Ξ΄m).\displaystyle\leq\mu_{\alpha}(f)+C^{\prime}\sqrt{2\sigma_{\alpha}^{2}(f)\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right)}+2C^{\prime}\ln\!\left(\frac{|\mathcal{F}(m)|}{\delta_{m}}\right).

∎

We can also state a concentration result specific to the Οƒ^Ξ±2​(f)\hat{\sigma}_{\alpha}^{2}(f) values, as follows.

Lemma 7.

There exists a numerical constant Cβ‰₯1C\geq 1 such that, for any δ∈(0,1)\delta\in(0,1), with probability at least 1βˆ’Ξ΄1-\delta, every fβˆˆβ„±f\in\mathcal{F} satisfies

12​σα2​(f)βˆ’C​ln⁑(|ℱ​(m)|​log2⁑(2​m)Ξ΄)≀σ^Ξ±2​(f)≀2​σα2​(f)+C​ln⁑(|ℱ​(m)|​log2⁑(2​m)Ξ΄).\frac{1}{2}\sigma_{\alpha}^{2}(f)-C\ln\!\left(\frac{|\mathcal{F}(m)|\log_{2}(2m)}{\delta}\right)\leq\hat{\sigma}_{\alpha}^{2}(f)\leq 2\sigma_{\alpha}^{2}(f)+C\ln\!\left(\frac{|\mathcal{F}(m)|\log_{2}(2m)}{\delta}\right).
Proof.

Define ℱ′≐{f2:fβˆˆβ„±}\mathcal{F}^{\prime}\doteq\{f^{2}:f\in\mathcal{F}\} and note that |ℱ′​(m)|≀|ℱ​(m)||\mathcal{F}^{\prime}(m)|\leq|\mathcal{F}(m)|. Also define Ξ±i′≐αi2\alpha^{\prime}_{i}\doteq\alpha_{i}^{2}. Applying LemmaΒ 6 with this β„±β€²\mathcal{F}^{\prime} and Ξ±β€²\alpha^{\prime}, we have that, with probability at least 1βˆ’Ξ΄/21-\delta/2, every fβˆˆβ„±f\in\mathcal{F} satisfies (for some numerical constant Cβ‰₯1C\geq 1)

Οƒ^Ξ±2​(f)\displaystyle\hat{\sigma}_{\alpha}^{2}(f) ≀σα2​(f)+C​𝔼​[βˆ‘i=1mΞ±i4​f4​(Wi)]​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)+C​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)\displaystyle\leq\sigma_{\alpha}^{2}(f)+C\sqrt{\mathbb{E}\!\left[\sum_{i=1}^{m}\alpha_{i}^{4}f^{4}(W_{i})\right]\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right)}+C\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right)
≀σα2​(f)+C​σα2​(f)​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)+C​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄).\displaystyle\leq\sigma_{\alpha}^{2}(f)+C\sqrt{\sigma_{\alpha}^{2}(f)\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right)}+C\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right). (31)

If σα2​(f)β‰₯C2​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)\sigma_{\alpha}^{2}(f)\geq C^{2}\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right) then the expression in (31) is at most

2​σα2​(f)+C​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄),2\sigma_{\alpha}^{2}(f)+C\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right),

while if σα2​(f)≀C2​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)\sigma_{\alpha}^{2}(f)\leq C^{2}\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right) then the expression in (31) is at most

σα2​(f)+(C2+C)​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄).\sigma_{\alpha}^{2}(f)+(C^{2}+C)\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right).

Thus, either way we have

Οƒ^Ξ±2​(f)≀2​σα2​(f)+(C2+C)​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄).\hat{\sigma}_{\alpha}^{2}(f)\leq 2\sigma_{\alpha}^{2}(f)+(C^{2}+C)\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right).

On the other hand, consider the set ℱ′′≐{βˆ’f2:fβˆˆβ„±}\mathcal{F}^{\prime\prime}\doteq\{-f^{2}:f\in\mathcal{F}\} and note that again we have |ℱ′′​(m)|≀|ℱ​(m)||\mathcal{F}^{\prime\prime}(m)|\leq|\mathcal{F}(m)|. Applying LemmaΒ 6 with this β„±β€²β€²\mathcal{F}^{\prime\prime} and Ξ±β€²\alpha^{\prime}, we have that, with probability at least 1βˆ’Ξ΄/21-\delta/2, every fβˆˆβ„±f\in\mathcal{F} satisfies (for some numerical constant Cβ‰₯1C\geq 1)

βˆ’Οƒ^Ξ±2​(f)\displaystyle-\hat{\sigma}_{\alpha}^{2}(f) β‰€βˆ’ΟƒΞ±2​(f)+C​𝔼​[βˆ‘i=1mΞ±i4​f4​(Wi)]​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)+C​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)\displaystyle\leq-\sigma_{\alpha}^{2}(f)+C\sqrt{\mathbb{E}\!\left[\sum_{i=1}^{m}\alpha_{i}^{4}f^{4}(W_{i})\right]\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right)}+C\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right)
β‰€βˆ’ΟƒΞ±2​(f)+C​σα2​(f)​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)+C​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄).\displaystyle\leq-\sigma_{\alpha}^{2}(f)+C\sqrt{\sigma_{\alpha}^{2}(f)\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right)}+C\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right). (32)

If σα2​(f)β‰₯4​C2​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)\sigma_{\alpha}^{2}(f)\geq 4C^{2}\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right) then the expression in (32) is at most

βˆ’12​σα2​(f)+C​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄),-\frac{1}{2}\sigma_{\alpha}^{2}(f)+C\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right),

while if σα2​(f)≀4​C2​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄)\sigma_{\alpha}^{2}(f)\leq 4C^{2}\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right) then the expression in (32) is at most

βˆ’ΟƒΞ±2​(f)+(2​C2+C)​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄).-\sigma_{\alpha}^{2}(f)+(2C^{2}+C)\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right).

Thus, either way we have

βˆ’Οƒ^Ξ±2​(f)β‰€βˆ’12​σα2​(f)+(2​C2+C)​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄),-\hat{\sigma}_{\alpha}^{2}(f)\leq-\frac{1}{2}\sigma_{\alpha}^{2}(f)+(2C^{2}+C)\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right),

which implies

Οƒ^Ξ±2​(f)β‰₯12​σα2​(f)βˆ’(2​C2+C)​ln⁑(|ℱ​(m)|​2​log2⁑(2​m)Ξ΄).\hat{\sigma}_{\alpha}^{2}(f)\geq\frac{1}{2}\sigma_{\alpha}^{2}(f)-(2C^{2}+C)\ln\!\left(\frac{|\mathcal{F}(m)|2\log_{2}(2m)}{\delta}\right).

The lemma now follows by a union bound, so that these two events (each of probability at least 1βˆ’Ξ΄/21-\delta/2) occur simultaneously, with probability at least 1βˆ’Ξ΄1-\delta. ∎

We will now show that LemmaΒ 1 follows directly from LemmasΒ 6 and 7.

Proof of LemmaΒ 1.

Set 𝒲=𝒳×𝒴\mathcal{W}=\mathcal{X}\times\mathcal{Y}, Wi=(Xi,Yi)W_{i}=(X_{i},Y_{i}), and Ξ±i=1\alpha_{i}=1. For each h,hβ€²βˆˆβ„‹h,h^{\prime}\in\mathcal{H}, define fh,h′​(x,y)≐1​{h′​(x)β‰ y}βˆ’1​{h​(x)β‰ y}f_{h,h^{\prime}}(x,y)\doteq{\mathbbold 1}\!\left\{h^{\prime}(x)\neq y\right\}-{\mathbbold 1}\!\left\{h(x)\neq y\right\}. Note that Οƒ^Ξ±2​(fh,hβ€²)=m​ℙ^S​(hβ‰ hβ€²)\hat{\sigma}_{\alpha}^{2}(f_{h,h^{\prime}})=m\hat{\mathbb{P}}_{S}(h\neq h^{\prime}). ℱ≐{fh,hβ€²:h,hβ€²βˆˆβ„‹}\mathcal{F}\doteq\{f_{h,h^{\prime}}:h,h^{\prime}\in\mathcal{H}\} and note that |ℱ​(m)|≀|ℋ​(m)|2|\mathcal{F}(m)|\leq|\mathcal{H}(m)|^{2}. Applying LemmaΒ 6 with this β„±\mathcal{F}, we have that with probability at least 1βˆ’Ξ΄/21-\delta/2, every h,hβ€²βˆˆβ„‹h,h^{\prime}\in\mathcal{H} satisfy (for some universal numerical constant Cβ‰₯1C\geq 1)

β„°^S​(hβ€²;h)≀𝔼​[β„°^S​(hβ€²;h)]+C​𝔼​[β„™^S​(hβ‰ hβ€²)]​1m​ln⁑(2​|ℋ​(m)|2​log2⁑(2​m)Ξ΄)+Cm​ln⁑(2​|ℋ​(m)|2​log2⁑(2​m)Ξ΄).\hat{\mathcal{E}}_{S}(h^{\prime};h)\leq\mathbb{E}\!\left[\hat{\mathcal{E}}_{S}(h^{\prime};h)\right]+C\sqrt{\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right]\frac{1}{m}\ln\!\left(\frac{2|\mathcal{H}(m)|^{2}\log_{2}(2m)}{\delta}\right)}+\frac{C}{m}\ln\!\left(\frac{2|\mathcal{H}(m)|^{2}\log_{2}(2m)}{\delta}\right). (33)

Furthermore, LemmaΒ 7 implies that, with probability at least 1βˆ’Ξ΄/21-\delta/2, every h,hβ€²βˆˆβ„‹h,h^{\prime}\in\mathcal{H} satisfy (for some universal numerical constant Cβ€²β‰₯1C^{\prime}\geq 1)

12​𝔼​[β„™^S​(hβ‰ hβ€²)]βˆ’Cβ€²m​ln⁑(2​|ℋ​(m)|2​log2⁑(2​m)Ξ΄)≀ℙ^S​(hβ‰ hβ€²)≀2​𝔼​[β„™^S​(hβ‰ hβ€²)]+Cβ€²m​ln⁑(2​|ℋ​(m)|2​log2⁑(2​m)Ξ΄).\frac{1}{2}\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\!\neq\!h^{\prime})\right]-\frac{C^{\prime}}{m}\ln\!\left(\frac{2|\mathcal{H}(m)|^{2}\log_{2}(2m)}{\delta}\right)\leq\hat{\mathbb{P}}_{S}(h\!\neq\!h^{\prime})\leq 2\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\!\neq\!h^{\prime})\right]+\frac{C^{\prime}}{m}\ln\!\left(\frac{2|\mathcal{H}(m)|^{2}\log_{2}(2m)}{\delta}\right). (34)

By a union bound, with probability at least 1βˆ’Ξ΄1-\delta, every h,hβ€²βˆˆβ„‹h,h^{\prime}\in\mathcal{H} satisfy both of (33) and (34). In particular, combining (33) with the left inequality in (34), this also implies every h,hβ€²βˆˆβ„‹h,h^{\prime}\in\mathcal{H} satisfy

β„°^S​(hβ€²;h)\displaystyle\hat{\mathcal{E}}_{S}(h^{\prime};h) ≀𝔼​[β„°^S​(hβ€²;h)]+2​C​ℙ^S​(hβ‰ hβ€²)​1m​ln⁑(2​|ℋ​(m)|2​log2⁑(2​m)Ξ΄)\displaystyle\leq\mathbb{E}\!\left[\hat{\mathcal{E}}_{S}(h^{\prime};h)\right]+2C\sqrt{\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\frac{1}{m}\ln\!\left(\frac{2|\mathcal{H}(m)|^{2}\log_{2}(2m)}{\delta}\right)}
+(2​Cβ€²+1)​C​1m​ln⁑(2​|ℋ​(m)|2​log2⁑(2​m)Ξ΄).\displaystyle{\hskip 28.45274pt}+\left(2\sqrt{C^{\prime}}+1\right)C\frac{1}{m}\ln\!\left(\frac{2|\mathcal{H}(m)|^{2}\log_{2}(2m)}{\delta}\right). (35)

Finally, recall from Sauer’s lemma [50, 42] that |ℋ​(m)|≀(e​max⁑{m,dβ„‹}dβ„‹)dβ„‹|\mathcal{H}(m)|\leq\left(\frac{e\max\{m,d_{\mathcal{H}}\}}{d_{\mathcal{H}}}\right)^{d_{\mathcal{H}}}, and therefore

1m​ln⁑(2​|ℋ​(m)|2​log2⁑(2​m)Ξ΄)\displaystyle\frac{1}{m}\ln\!\left(\frac{2|\mathcal{H}(m)|^{2}\log_{2}(2m)}{\delta}\right) ≀1m​ln⁑((e​max⁑{m,dβ„‹}dβ„‹)2​dℋ​2​log2⁑(2​m)Ξ΄)\displaystyle\leq\frac{1}{m}\ln\!\left(\left(\frac{e\max\{m,d_{\mathcal{H}}\}}{d_{\mathcal{H}}}\right)^{2d_{\mathcal{H}}}\frac{2\log_{2}(2m)}{\delta}\right)
≀1m​ln⁑((e​max⁑{m,dβ„‹}dβ„‹)3​dℋ​4Ξ΄)≀C′′​Ρ​(m,Ξ΄)\displaystyle\leq\frac{1}{m}\ln\!\left(\left(\frac{e\max\{m,d_{\mathcal{H}}\}}{d_{\mathcal{H}}}\right)^{3d_{\mathcal{H}}}\frac{4}{\delta}\right)\leq C^{\prime\prime}\varepsilon(m,\delta)

for some numerical constant Cβ€²β€²β‰₯1C^{\prime\prime}\geq 1.

Altogether (and subtracting 𝔼​[β„°^S​(hβ€²;h)]\mathbb{E}\!\left[\hat{\mathcal{E}}_{S}(h^{\prime};h)\right] and β„°^S​(hβ€²;h)\hat{\mathcal{E}}_{S}(h^{\prime};h) from both sides of (33) and (35)), we have that with probability at least 1βˆ’Ξ΄1-\delta, every h,hβ€²βˆˆβ„‹h,h^{\prime}\in\mathcal{H} satisfy

𝔼​[β„°^S​(h;hβ€²)]≀ℰ^S​(h;hβ€²)+2​C​C′′​min⁑{𝔼​[β„™^S​(hβ‰ hβ€²)],β„™^S​(hβ‰ hβ€²)}​Ρ​(m,Ξ΄)+(2​Cβ€²+1)​C​C′′​Ρ​(m,Ξ΄)\mathbb{E}\!\left[\hat{\mathcal{E}}_{S}(h;h^{\prime})\right]\leq\hat{\mathcal{E}}_{S}(h;h^{\prime})+2C\sqrt{C^{\prime\prime}}\sqrt{\min\!\left\{\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right],\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right\}\varepsilon(m,\delta)}+\left(2\sqrt{C^{\prime}}+1\right)CC^{\prime\prime}\varepsilon(m,\delta)

and

12​𝔼​[β„™^S​(hβ‰ hβ€²)]βˆ’C′​C′′​Ρ​(m,Ξ΄)≀ℙ^S​(hβ‰ hβ€²)≀2​𝔼​[β„™^S​(hβ‰ hβ€²)]+C′​C′′​Ρ​(m,Ξ΄),\frac{1}{2}\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right]-C^{\prime}C^{\prime\prime}\varepsilon(m,\delta)\leq\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\leq 2\mathbb{E}\!\left[\hat{\mathbb{P}}_{S}(h\neq h^{\prime})\right]+C^{\prime}C^{\prime\prime}\varepsilon(m,\delta),

which completes the proof. ∎

Appendix B Pooling is Optimal if Enough Tasks are Good

While our results in SectionsΒ 4.3 and 8 imply that, in general, one cannot achieve optimal rates by simply pooling all of the data and using the global ERM h^Z\hat{h}_{Z}, in this section we find that in some special cases this naive approach can actually be successful: namely, cases where most of the tasks have ρt\rho_{t} below the cut-off value ρ(tβˆ—)\rho_{(t^{*})} chosen by the optimization in the optimal proceedure from SectionΒ 6. We in fact show a general result for pooling, arguing that it always achieves a rate depending on the (weighted) median value of ρ¯t\bar{\rho}_{t}, or more generally any quantile of ρ¯t\bar{\rho}_{t} values.

Theorem 9 (Pooling Beyond Ξ²=1\beta=1).

For any α∈(0,1]\alpha\in(0,1], let t​(Ξ±)t(\alpha) be the smallest value in [N+1][N+1] such that βˆ‘t∈[t​(Ξ±)]n(t)β‰₯Ξ±β€‹βˆ‘t=1N+1nt\sum_{t\in[t(\alpha)]}n_{(t)}\geq\alpha\sum_{t=1}^{N+1}n_{t}. Then, for any δ∈(0,1)\delta\in(0,1), with probability at least 1βˆ’Ξ΄1-\delta we have

β„°π’Ÿβ€‹(h^Z)≀Cρ​(C​dℋ​log⁑(1dβ„‹β€‹βˆ‘t=1N+1nt)+log⁑(1/Ξ΄)βˆ‘t=1N+1nt)1/(2βˆ’Ξ²)​ρ¯t​(Ξ±)\mathcal{E}_{{\cal D}}\!\left(\hat{h}_{Z}\right)\leq C_{\rho}\left(C\frac{d_{\mathcal{H}}\log\!\left(\frac{1}{d_{\mathcal{H}}}\sum_{t=1}^{N+1}n_{t}\right)+\log(1/\delta)}{\sum_{t=1}^{N+1}n_{t}}\right)^{1/(2-\beta)\bar{\rho}_{t(\alpha)}}

for a constant C=(32​C02/Ξ±)2βˆ’Ξ²β€‹CΞ²C=(32C_{0}^{2}/\alpha)^{2-\beta}C_{\beta}.

Proof.

Letting 𝐧[N+1]β‰βˆ‘t=1N+1nt\mathbf{n}_{[N+1]}\doteq\sum_{t=1}^{N+1}n_{t}, P¯≐𝐧[N+1]βˆ’1β€‹βˆ‘t=1N+1nt​Pt\bar{P}\doteq\mathbf{n}_{[N+1]}^{-1}\sum_{t=1}^{N+1}n_{t}P_{t}, and π§Ξ±β‰βˆ‘t∈[t​(Ξ±)]n(t)\mathbf{n}_{\alpha}\doteq\sum_{t\in[t(\alpha)]}n_{(t)}, we have

β„°P¯​(h^Z)\displaystyle\mathcal{E}_{\bar{P}}(\hat{h}_{Z}) β‰₯𝐧[N+1]βˆ’1β€‹βˆ‘t=1N+1nt​(CΟβˆ’1β€‹β„°π’Ÿβ€‹(h^Z))ρt\displaystyle\geq\mathbf{n}_{[N+1]}^{-1}\sum_{t=1}^{N+1}n_{t}\left(C_{\rho}^{-1}\mathcal{E}_{{\cal D}}(\hat{h}_{Z})\right)^{\rho_{t}}
β‰₯𝐧α𝐧[N+1]β€‹βˆ‘t∈[t​(Ξ±)]n(t)𝐧α​(CΟβˆ’1β€‹β„°π’Ÿβ€‹(h^Z))ρ(t)β‰₯α​(CΟβˆ’1β€‹β„°π’Ÿβ€‹(h^Z))ρ¯t​(Ξ±),\displaystyle\geq\frac{\mathbf{n}_{\alpha}}{\mathbf{n}_{[N+1]}}\sum_{t\in[t(\alpha)]}\frac{n_{(t)}}{\mathbf{n}_{\alpha}}\left(C_{\rho}^{-1}\mathcal{E}_{{\cal D}}(\hat{h}_{Z})\right)^{\rho_{(t)}}\geq\alpha\left(C_{\rho}^{-1}\mathcal{E}_{{\cal D}}(\hat{h}_{Z})\right)^{\bar{\rho}_{t(\alpha)}},

where the third inequality is due to Jensen’s inequality. This implies

β„°π’Ÿβ€‹(h^Z)≀Cρ​((1/Ξ±)​ℰP¯​(h^Z))1/ρ¯t​(Ξ±).\mathcal{E}_{{\cal D}}(\hat{h}_{Z})\leq C_{\rho}\left((1/\alpha)\mathcal{E}_{\bar{P}}(\hat{h}_{Z})\right)^{1/\bar{\rho}_{t(\alpha)}}. (36)

LemmaΒ 2 implies that with probability at least 1βˆ’Ξ΄1-\delta,

β„°P¯​(h^Z)≀32​C02​(Cβ​Ρ​(𝐧[N+1],Ξ΄))1/(2βˆ’Ξ²).\mathcal{E}_{\bar{P}}(\hat{h}_{Z})\leq 32C_{0}^{2}(C_{\beta}\varepsilon(\mathbf{n}_{[N+1]},\delta))^{1/(2-\beta)}.

Combining this with (36) completes the proof. ∎

For instance, if all ntn_{t} are equal some common value nn, and we take Ξ±=1/2\alpha=1/2, then we can take t​(Ξ±)=⌈(N+1)/2βŒ‰t(\alpha)=\lceil(N+1)/2\rceil, so that the optimal rate will be achieved as long as at least half of the tasks have ρ¯t\bar{\rho}_{t} below the value ρ¯tβˆ—\bar{\rho}_{t^{*}} for tβˆ—t^{*} the minimizer of the bound in TheoremΒ 1.

Optimizing the bound in TheoremΒ 9 over the choice of Ξ±\alpha yields the following result.

Corollary 2 (General Pooling Bound).

For any δ∈(0,1)\delta\in(0,1), with probability at least 1βˆ’Ξ΄1-\delta we have

β„°π’Ÿβ€‹(h^Z)≀mint∈[N+1]⁑Cρ​(C​dℋ​log⁑(1dβ„‹β€‹βˆ‘s=1N+1ns)+log⁑(1/Ξ΄)(βˆ‘s=1tn(s))2βˆ’Ξ²β€‹(βˆ‘s=1N+1ns)βˆ’(1βˆ’Ξ²))1/(2βˆ’Ξ²)​ρ¯t\mathcal{E}_{{\cal D}}\!\left(\hat{h}_{Z}\right)\leq\min\limits_{t\in[N+1]}C_{\rho}\left(C\frac{d_{\mathcal{H}}\log\!\left(\frac{1}{d_{\mathcal{H}}}\sum_{s=1}^{N+1}n_{s}\right)+\log(1/\delta)}{\left(\sum_{s=1}^{t}n_{(s)}\right)^{2-\beta}\left(\sum_{s=1}^{N+1}n_{s}\right)^{-(1-\beta)}}\right)^{1/(2-\beta)\bar{\rho}_{t}}

for a constant C=(32​C02)2βˆ’Ξ²β€‹CΞ²C=(32C_{0}^{2})^{2-\beta}C_{\beta}.

Remark 9.

In particular, note that this result recovers the bound of TheoremΒ 3 in the special case of Ξ²=1\beta=1.

Appendix C Different Optimal Aggregations of Tasks in Multitask

We present a proof of Theorem 2 in this section. Recall that this result states that different choices of target in the same multitask setting can induce different optimal aggregation of tasks. As a consequence, the naive approach of pooling of all tasks can adversely affect target risk even when all hβˆ—h^{\!*}’s are the same.

We employ a similar construction to that of Section 8.

Setup.

We again build on distributions supported on 2 datapoints x0,x1x_{0},x_{1}. W.l.o.g., assume that x0x_{0} has label 1. Let np,nqβ‰₯1n_{p},n_{q}\geq 1, 0≀β<1{0\leq\beta<1}, and define ϡ≐nPβˆ’1/(2βˆ’Ξ²)\epsilon\doteq n_{P}^{-1/(2-\beta)}. Let Οƒβˆˆ{Β±1}\sigma\in\left\{\pm 1\right\} – which we will often abbreviate as Β±\pm. In all that follows, we let ημ​(X)\eta_{\mu}(X) denote the regression function ℙμ​[Y=1∣X]\mathbb{P}_{\mu}[Y=1\mid X] under distribution ΞΌ\mu.

  • β€’

    Target π’ŸΟƒ=π’ŸXΓ—π’ŸY|XΟƒ{\cal D}_{\sigma}={\cal D}_{X}\times{\cal D}^{\sigma}_{Y|X}: Let π’ŸX​(x1)=1/2{\cal D}_{X}(x_{1})={1}/{2}, π’ŸX​(x0)=1/2{\cal D}_{X}(x_{0})={1}/{2}; finally π’ŸY|XΟƒ{\cal D}^{\sigma}_{Y|X} is determined by
    Ξ·π’Ÿ,σ​(x1)=1/2+Οƒβ‹…(1/4)\eta_{{\cal D},\sigma}(x_{1})=1/2+\sigma\cdot({1}/{4}), and Ξ·π’Ÿ,σ​(x0)=1\eta_{{\cal D},\sigma}(x_{0})=1.

  • β€’

    Source PΟƒ=PXΓ—PY|XΟƒP_{\sigma}=P_{X}\times P^{\sigma}_{Y|X}: Let PX​(x1)=ϡβP_{X}(x_{1})=\epsilon^{\beta}, PX​(x0)=1βˆ’Ο΅Ξ²P_{X}(x_{0})=1-\epsilon^{\beta}; finally PY|XΟƒP^{\sigma}_{Y|X} is determined by
    Ξ·P,σ​(x1)=1/2+Οƒβ‹…c2​ϡ1βˆ’Ξ²\eta_{P,\sigma}(x_{1})=1/2+\sigma\cdot{c_{2}}\epsilon^{1-\beta}, and Ξ·P,σ​(x0)=1\eta_{P,\sigma}(x_{0})=1, for an appropriate constant c1c_{1} specified in the proof.

Proof of Theorem 2.

Let P≐Pβˆ’P\doteq P_{-} and π’Ÿβ‰π’Ÿβˆ’{\cal D}\doteq{\cal D}_{-}. The above construction then ensures that any hh s.t. h​(x1)=+1h(x_{1})=+1 has excess error β„°π’Ÿβ€‹(h)=1/4\mathcal{E}_{{\cal D}}(h)=1/4. Thus we just have to show that number of +1+1 labels at x1x_{1} exceeds the number of βˆ’1-1 labels at x1x_{1} with non-zero probability (so that both h^Z,h^ZP\hat{h}_{Z},\hat{h}_{Z_{P}} would pick +1+1 at x1x_{1}). Let n^P,1{\hat{n}}_{P,1} denote the number of samples from PP at x1x_{1}, and n^P,1+{\hat{n}}_{P,1}^{+} denote those samples from PP having label +1+1. Notice that if n^P,1+>12​(n^P,1+nπ’Ÿ){\hat{n}}_{P,1}^{+}>\frac{1}{2}\left({\hat{n}}_{P,1}+n_{\cal D}\right), then +1+1 necessarily dominates βˆ’1-1 at x1x_{1}.

Now, conditioned on n^P,1{\hat{n}}_{P,1}, n^P,1+{\hat{n}}_{P,1}^{+} is distributed as Binomial​(n^P,1,p)\text{Binomial}({\hat{n}}_{P,1},p) with p=Ξ·P,βˆ’β€‹(x1)p=\eta_{P,-}(x_{1}). Applying Lemma 3, we have

ℙ​(n^P,1+>nP,1β‹…p+n^P,1β‹…p​(1βˆ’p))β‰₯112.\displaystyle\mathbb{P}\left({\hat{n}}_{P,1}^{+}>n_{P,1}\cdot p+\sqrt{{\hat{n}}_{P,1}\cdot p(1-p)}\right)\geq\frac{1}{12}.

In other words, under the above binomial event, +1+1 dominates whenever (the second inequality below holds)

n^P,1β‹…p​(1βˆ’p)β‰₯14​n^P,1β‰₯n^P,1​(12βˆ’p)+12​nπ’Ÿ,\displaystyle\sqrt{{\hat{n}}_{P,1}\cdot p(1-p)}\geq\frac{1}{4}\sqrt{{\hat{n}}_{P,1}}\geq{\hat{n}}_{P,1}\left(\frac{1}{2}-p\right)+\frac{1}{2}n_{\cal D},

in other words, if we have both n^P,1β‹…c12β‹…Ο΅2​(1βˆ’Ξ²)≀164{{\hat{n}}_{P,1}}\cdot c_{1}^{2}\cdot\epsilon^{2(1-\beta)}\leq\frac{1}{64} and nπ’Ÿ2≀14​n^P,1n_{\cal D}^{2}\leq\frac{1}{4}{{\hat{n}}_{P,1}}. Let 𝐄P≐{12​𝔼​[n^P,1]≀n^P,1≀2​𝔼​[n^P,1]}\mathbf{E}_{P}\doteq\left\{\frac{1}{2}\mathbb{E}\left[{\hat{n}}_{P,1}\right]\leq{\hat{n}}_{P,1}\leq 2\mathbb{E}\left[{\hat{n}}_{P,1}\right]\right\}, where 𝔼​[n^P,1]=nPβ‹…PX​(x1)=nP⋅ϡβ\mathbb{E}\left[{\hat{n}}_{P,1}\right]=n_{P}\cdot P_{X}(x_{1})=n_{P}\cdot\epsilon^{\beta}. Under this event, we just need that 2​c12β‹…nPβ‹…Ο΅(2βˆ’Ξ²)=2​c12≀1642c_{1}^{2}\cdot n_{P}\cdot\epsilon^{(2-\beta)}=2c_{1}^{2}\leq\frac{1}{64}, and that nπ’Ÿ2≀18​nP⋅ϡβ=18​nP(2βˆ’2​β)/(2βˆ’Ξ²)n_{\cal D}^{2}\leq\frac{1}{8}n_{P}\cdot\epsilon^{\beta}=\frac{1}{8}n_{P}^{(2-2\beta)/(2-\beta)}, requiring Ξ²<1\beta<1. Hence, integrating over n^P,1{\hat{n}}_{P,1}, we have that

ℙ​(n^P,1+>12​(n^P,1+nπ’Ÿ))β‰₯112​ℙ​(𝐄P)>0,\displaystyle\mathbb{P}\left({\hat{n}}_{P,1}^{+}>\frac{1}{2}\left({\hat{n}}_{P,1}+n_{\cal D}\right)\right)\geq\frac{1}{12}\mathbb{P}\left(\mathbf{E}_{P}\right)>0,

where ℙ​(𝐄P)\mathbb{P}\left(\mathbf{E}_{P}\right) is bounded below by a multiplicative Chernoff whenever 𝔼​[n^P,1]=nP(2βˆ’2​β)/(2βˆ’Ξ²)β‰₯1\mathbb{E}\left[{\hat{n}}_{P,1}\right]=n_{P}^{(2-2\beta)/(2-\beta)}\geq 1. ∎

Appendix D Supporting Results for Section 8 on Impossibility of Adaptivity

Recall that from our construction, we set

Γσ=(Ξ±P​PΟƒn+Ξ±Q​QΟƒn)NΓ—π’ŸΟƒnπ’Ÿ.\Gamma_{\sigma}=\left(\alpha_{P}P_{\sigma}^{n}+\alpha_{Q}Q_{\sigma}^{n}\right)^{N}\times{\cal D}_{\sigma}^{n_{\cal D}}.
Proof of Proposition 2.

Let Γσ,α≐αP​PΟƒn+Ξ±Q​QΟƒn\Gamma_{\sigma,\alpha}\doteq{\alpha_{P}P_{\sigma}^{n}+\alpha_{Q}Q_{\sigma}^{n}}. Clearly

Ξ“+​(Z)Ξ“βˆ’β€‹(Z)=π’Ÿ+​(ZN+1)π’Ÿβˆ’β€‹(ZN+1)β‹…βˆt=1NΞ“+,α​(Zt)Ξ“βˆ’,α​(Zt).\frac{\Gamma_{+}(Z)}{\Gamma_{-}(Z)}=\frac{{\cal D}_{+}(Z_{N+1})}{{\cal D}_{-}(Z_{N+1})}\cdot\prod_{t=1}^{N}\frac{\Gamma_{+,\alpha}(Z_{t})}{\Gamma_{-,\alpha}(Z_{t})}.

Define n^t,0{\hat{n}}_{t,0}, n^t,+{\hat{n}}_{t,+}, and n^t,βˆ’{\hat{n}}_{t,-} as the number of points (Xt,i,i∈[n]X_{t,i},i\in[n]) in ZtZ_{t} which, respectively fall on x0x_{0}, or fall on x1x_{1} with Yt,i=+1Y_{t,i}=+1, or fall on x1x_{1} with Yt,i=βˆ’1Y_{t,i}=-1. Recall that n^Οƒ=βˆ‘t∈[N]n^t,Οƒ{\hat{n}}_{\sigma}=\sum_{t\in[N]}{\hat{n}}_{t,\sigma} for Οƒ=Β±\sigma=\pm. Next, define 𝒡σ≐{Zt,t∈[N]:n^t,Οƒ=n}{\cal Z}_{\sigma}\doteq\left\{Z_{t},t\in[N]:{\hat{n}}_{t,\sigma}=n\right\}, i.e., the set of Οƒ\sigma-homogeneous vectors. We have:

Ξ“+,α​(Zt)\displaystyle\Gamma_{+,\alpha}(Z_{t}) =Ξ±Pβ‹…(Ξ·P,βˆ’n^t,βˆ’β€‹(x1)β‹…Ξ·P,+n^t,+​(x1)β‹…PX(n^t,++n^t,βˆ’)​(x1)β‹…PXn^t,0​(x0))+Ξ±Qβ‹…1​{Ztβˆˆπ’΅+},\displaystyle=\alpha_{P}\cdot\left(\eta_{P,-}^{{\hat{n}}_{t,-}}(x_{1})\cdot\eta_{P,+}^{{\hat{n}}_{t,+}}(x_{1})\cdot P_{X}^{\left({\hat{n}}_{t,+}+{\hat{n}}_{t,-}\right)}(x_{1})\cdot P_{X}^{{\hat{n}}_{t,0}}(x_{0})\right)+\alpha_{Q}\cdot{\mathbbold 1}\!\left\{Z_{t}\in{\cal Z}_{+}\right\},
Ξ“βˆ’,α​(Zt)\displaystyle\Gamma_{-,\alpha}(Z_{t}) =Ξ±Pβ‹…(Ξ·P,βˆ’n^t,+​(x1)β‹…Ξ·P,+n^t,βˆ’β€‹(x1)β‹…PX(n^t,++n^t,βˆ’)​(x1)β‹…PXn^t,0​(x0))+Ξ±Qβ‹…1​{Ztβˆˆπ’΅βˆ’}.\displaystyle=\alpha_{P}\cdot\left(\eta_{P,-}^{{\hat{n}}_{t,+}}(x_{1})\cdot\eta_{P,+}^{{\hat{n}}_{t,-}}(x_{1})\cdot P_{X}^{\left({\hat{n}}_{t,+}+{\hat{n}}_{t,-}\right)}(x_{1})\cdot P_{X}^{{\hat{n}}_{t,0}}(x_{0})\right)+\alpha_{Q}\cdot{\mathbbold 1}\!\left\{Z_{t}\in{\cal Z}_{-}\right\}.

We can then consider homogeneous and non-homogeneous vectors separately. Let 𝒡±≐𝒡+βˆͺπ’΅βˆ’{\cal Z}_{\pm}\doteq{\cal Z}_{+}\cup{\cal Z}_{-},

∏Ztβˆˆπ’΅Β±Ξ“+,α​(Zt)Ξ“βˆ’,α​(Zt)\displaystyle\prod_{Z_{t}\in{\cal Z}_{\pm}}\frac{\Gamma_{+,\alpha}(Z_{t})}{\Gamma_{-,\alpha}(Z_{t})} =Ξ±PN^βˆ’β€‹(Ξ·P,βˆ’β€‹(x1)β‹…PX​(x1))nβ‹…N^βˆ’(Ξ±P​(Ξ·P,+​(x1)β‹…PX​(x1))n+Ξ±Q)N^βˆ’β‹…(Ξ±P​(Ξ·P,+​(x1)β‹…PX​(x1))n+Ξ±Q)N^+Ξ±PN^+​(Ξ·P,βˆ’β€‹(x1)β‹…PX​(x1))nβ‹…N^+\displaystyle=\frac{\alpha_{P}^{{\hat{N}}_{-}}\left(\eta_{P,-}(x_{1})\cdot P_{X}(x_{1})\right)^{n\cdot{\hat{N}}_{-}}}{\left(\alpha_{P}\left(\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})\right)^{n}+\alpha_{Q}\right)^{{\hat{N}}_{-}}}\cdot\frac{\left(\alpha_{P}\left(\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})\right)^{n}+\alpha_{Q}\right)^{{\hat{N}}_{+}}}{\alpha_{P}^{{\hat{N}}_{+}}\left(\eta_{P,-}(x_{1})\cdot P_{X}(x_{1})\right)^{n\cdot{\hat{N}}_{+}}}
=Ξ±P(N^βˆ’βˆ’N^+)β‹…(Ξ·P,βˆ’β€‹(x1)β‹…PX​(x1))n​(N^βˆ’βˆ’N^+)β‹…(Ξ±P​(Ξ·P,+​(x1)β‹…PX​(x1))n+Ξ±Q)N^+βˆ’N^βˆ’\displaystyle=\alpha_{P}^{\left({\hat{N}}_{-}-{\hat{N}}_{+}\right)}\cdot\left(\eta_{P,-}(x_{1})\cdot P_{X}(x_{1})\right)^{n\left({\hat{N}}_{-}-{\hat{N}}_{+}\right)}\cdot\left(\alpha_{P}\left(\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})\right)^{n}+\alpha_{Q}\right)^{{\hat{N}}_{+}-{\hat{N}}_{-}} (37)

where we broke up the product over π’΅βˆ’\cal{Z}_{-} and 𝒡+\cal{Z}_{+}. Now, canceling out similar terms in the fraction,

∏Ztβˆ‰π’΅Β±Ξ“+,α​(Zt)Ξ“βˆ’,α​(Zt)\displaystyle\prod_{Z_{t}\notin{\cal Z}_{\pm}}\frac{\Gamma_{+,\alpha}(Z_{t})}{\Gamma_{-,\alpha}(Z_{t})} =(Ξ·P,βˆ’β€‹(x1)Ξ·P,+​(x1))(n^βˆ’βˆ’nβ‹…N^βˆ’)β‹…(Ξ·P,+​(x1)Ξ·P,βˆ’β€‹(x1))(n^+βˆ’nβ‹…N^+)\displaystyle=\left(\frac{\eta_{P,-}(x_{1})}{\eta_{P,+}(x_{1})}\right)^{\left({\hat{n}}_{-}-n\cdot{\hat{N}}_{-}\right)}\cdot\left(\frac{\eta_{P,+}(x_{1})}{\eta_{P,-}(x_{1})}\right)^{\left({\hat{n}}_{+}-n\cdot{\hat{N}}_{+}\right)}
=(Ξ·P,+​(x1)Ξ·P,βˆ’β€‹(x1))(n^+βˆ’n^βˆ’)β‹…(Ξ·P,βˆ’β€‹(x1)Ξ·P,+​(x1))n​(N^+βˆ’N^βˆ’)\displaystyle=\left(\frac{\eta_{P,+}(x_{1})}{\eta_{P,-}(x_{1})}\right)^{\left({\hat{n}}_{+}-{\hat{n}}_{-}\right)}\cdot\left(\frac{\eta_{P,-}(x_{1})}{\eta_{P,+}(x_{1})}\right)^{n\left({\hat{N}}_{+}-{\hat{N}}_{-}\right)} (38)

Now, the second factor in (38) can be expanded as follows to cancel out the second factor in (37):

(Ξ·P,βˆ’β€‹(x1)Ξ·P,+​(x1))n​(N^+βˆ’N^βˆ’)=(Ξ·P,βˆ’β€‹(x1)β‹…PX​(x1)Ξ·P,+​(x1)β‹…PX​(x1))n​(N^+βˆ’N^βˆ’).\displaystyle\left(\frac{\eta_{P,-}(x_{1})}{\eta_{P,+}(x_{1})}\right)^{n\left({\hat{N}}_{+}-{\hat{N}}_{-}\right)}=\left(\frac{\eta_{P,-}(x_{1})\cdot P_{X}(x_{1})}{\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})}\right)^{n\left({\hat{N}}_{+}-{\hat{N}}_{-}\right)}.

That is, we have:

(Ξ·P,βˆ’β€‹(x1)Ξ·P,+​(x1))n​(N^+βˆ’N^βˆ’)β‹…βˆZtβˆˆπ’΅Β±Ξ“+,α​(Zt)Ξ“βˆ’,α​(Zt)=(Ξ±P​(Ξ·P,+​(x1)β‹…PX​(x1))n+Ξ±QΞ±P​(Ξ·P,+​(x1)β‹…PX​(x1))n)(N^+βˆ’N^βˆ’).\displaystyle\left(\frac{\eta_{P,-}(x_{1})}{\eta_{P,+}(x_{1})}\right)^{n\left({\hat{N}}_{+}-{\hat{N}}_{-}\right)}\cdot\prod_{Z_{t}\in{\cal Z}_{\pm}}\frac{\Gamma_{+,\alpha}(Z_{t})}{\Gamma_{-,\alpha}(Z_{t})}=\left(\frac{\alpha_{P}\left(\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})\right)^{n}+\alpha_{Q}}{\alpha_{P}\left(\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})\right)^{n}}\right)^{\left({\hat{N}}_{+}-{\hat{N}}_{-}\right)}.

In other words, we have:

∏t=1NΞ“+,α​(Zt)Ξ“βˆ’,α​(Zt)=(Ξ·P,+​(x1)Ξ·P,βˆ’β€‹(x1))(n^+βˆ’n^βˆ’)β‹…(Ξ±P​(Ξ·P,+​(x1)β‹…PX​(x1))n+Ξ±QΞ±P​(Ξ·P,+​(x1)β‹…PX​(x1))n)(N^+βˆ’N^βˆ’).\displaystyle\prod_{t=1}^{N}\frac{\Gamma_{+,\alpha}(Z_{t})}{\Gamma_{-,\alpha}(Z_{t})}=\left(\frac{\eta_{P,+}(x_{1})}{\eta_{P,-}(x_{1})}\right)^{\left({\hat{n}}_{+}-{\hat{n}}_{-}\right)}\cdot\left(\frac{\alpha_{P}\left(\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})\right)^{n}+\alpha_{Q}}{\alpha_{P}\left(\eta_{P,+}(x_{1})\cdot P_{X}(x_{1})\right)^{n}}\right)^{\left({\hat{N}}_{+}-{\hat{N}}_{-}\right)}.

We then conclude by noticing that each of the above fractions is greater than 11. ∎

Proof of Proposition 7.

For t=N+1t=N+1, define n^t,0{\hat{n}}_{t,0}, n^t,+{\hat{n}}_{t,+}, and n^t,βˆ’{\hat{n}}_{t,-} as the number of points (Xt,i,i∈[n]X_{t,i},i\in[n]) in ZtZ_{t} which, respectively fall on x0x_{0}, or fall on x1x_{1} with Yt,i=+1Y_{t,i}=+1, or fall on x1x_{1} with Yt,i=βˆ’1Y_{t,i}=-1. Thus, for t=N+1t=N+1 fixed, we have

π’Ÿ+​(ZN+1)π’Ÿβˆ’β€‹(ZN+1)=π’ŸXn^t,0​(x0)β‹…π’ŸX(n^t,++n^t,βˆ’)​(x1)β‹…Ξ·π’Ÿ,+n^t,+​(x1)β‹…Ξ·π’Ÿ,βˆ’n^t,βˆ’β€‹(x1)π’ŸXn^t,0​(x0)β‹…π’ŸX(n^t,++n^t,βˆ’)​(x1)β‹…Ξ·π’Ÿ,+n^t,βˆ’β€‹(x1)β‹…Ξ·π’Ÿ,βˆ’n^t,+​(x1)=(Ξ·π’Ÿ,+​(x1)Ξ·π’Ÿ,βˆ’β€‹(x1))(n^t,+βˆ’n^t,βˆ’).\displaystyle\frac{{\cal D}_{+}(Z_{N+1})}{{\cal D}_{-}(Z_{N+1})}=\frac{{\cal D}_{X}^{{\hat{n}}_{t,0}}(x_{0})\cdot{\cal D}_{X}^{({\hat{n}}_{t,+}+{\hat{n}}_{t,-})}(x_{1})\cdot\eta_{{\cal D},+}^{{\hat{n}}_{t,+}}(x_{1})\cdot\eta_{{\cal D},-}^{{\hat{n}}_{t,-}}(x_{1})}{{\cal D}_{X}^{{\hat{n}}_{t,0}}(x_{0})\cdot{\cal D}_{X}^{({\hat{n}}_{t,+}+{\hat{n}}_{t,-})}(x_{1})\cdot\eta_{{\cal D},+}^{{\hat{n}}_{t,-}}(x_{1})\cdot\eta_{{\cal D},-}^{{\hat{n}}_{t,+}}(x_{1})}=\left(\frac{\eta_{{\cal D},+}(x_{1})}{\eta_{{\cal D},-}(x_{1})}\right)^{({\hat{n}}_{t,+}-{\hat{n}}_{t,-})}. (39)

Now (39) β‰₯1\geq 1 whenever n^t,+β‰₯n^t,βˆ’{\hat{n}}_{t,+}\geq{\hat{n}}_{t,-}, so we proceed to bounding the probability of this event under π’Ÿβˆ’{\cal D}_{-}. In particular, when n=1n=1, the event has probability at least ℙ​(n^t,0=1)=π’ŸX​(x0)=1βˆ’12=12\mathbb{P}({\hat{n}}_{t,0}=1)={\cal D}_{X}(x_{0})=1-\frac{1}{2}=\frac{1}{2}. Assume henceforth that n>1n>1, and let n^t≐n^t,++n^t,βˆ’{\hat{n}}_{t}\doteq{\hat{n}}_{t,+}+{\hat{n}}_{t,-}. Conditioned on n^t{\hat{n}}_{t}, n^t,+{\hat{n}}_{t,+} is distributed as Binomial​(n^t,p)\text{Binomial}({\hat{n}}_{t},p), with p=Ξ·π’Ÿ,βˆ’β€‹(x1)p=\eta_{{\cal D},-}(x_{1}). By the anticoncentration Lemma 3, we then have

ℙ​(n^t,+>n^tβ‹…p+n^tβ‹…p​(1βˆ’p))β‰₯112,\displaystyle\mathbb{P}\left({\hat{n}}_{t,+}>{\hat{n}}_{t}\cdot p+\sqrt{{\hat{n}}_{t}\cdot p(1-p)}\right)\geq\frac{1}{12}, (40)

so the event n^t,+β‰₯n^t,βˆ’{\hat{n}}_{t,+}\geq{\hat{n}}_{t,-} holds whenever

n^tβ‹…p​(1βˆ’p)β‰₯12​n^tβˆ’n^tβ‹…p\displaystyle\sqrt{{\hat{n}}_{t}\cdot p(1-p)}\geq\frac{1}{2}{\hat{n}}_{t}-{\hat{n}}_{t}\cdot p =n^t​(12βˆ’p)=n^tβ‹…c0β‹…Ο΅01βˆ’Ξ²,in other words, whenever\displaystyle={\hat{n}}_{t}\left(\frac{1}{2}-p\right)={\hat{n}}_{t}\cdot c_{0}\cdot\epsilon_{0}^{1-\beta},\text{in other words, whenever}
n^tβ‹…c02β‹…Ο΅02​(1βˆ’Ξ²)\displaystyle{\hat{n}}_{t}\cdot c_{0}^{2}\cdot\epsilon_{0}^{2(1-\beta)} ≀18,Β (assumingΒ c0≀1/4).\displaystyle\leq\frac{1}{8},\text{ (assuming }{c_{0}\leq 1/4}). (41)

Now, consider the event π„π’Ÿβ‰{n^t≀2​𝔼​[n^t]}\mathbf{E}_{{\cal D}}\doteq\left\{{\hat{n}}_{t}\leq 2\mathbb{E}[{\hat{n}}_{t}]\right\}, where 𝔼​[n^t]=nπ’Ÿβ‹…π’ŸX​(x1)=12​nπ’Ÿβ‹…Ο΅0Ξ²\mathbb{E}[{\hat{n}}_{t}]=n_{\cal D}\cdot{\cal D}_{X}(x_{1})=\frac{1}{2}n_{\cal D}\cdot\epsilon_{0}^{\beta}. Under π„π’Ÿ\mathbf{E}_{{\cal D}}, (41) is satisfied whenever c02≀18c_{0}^{2}\leq\frac{1}{8}, recalling Ο΅0≐nπ’Ÿβˆ’1/(2βˆ’Ξ²)\epsilon_{0}\doteq n_{\cal D}^{-1/(2-\beta)}. In other words, under this condition, we have

ℙ​(n^t,+β‰₯n^t,βˆ’)β‰₯𝔼​[ℙ​(n^t,+β‰₯n^t,βˆ’βˆ£n^t)​1​{π„π’Ÿ}]β‰₯112​ℙ​(π„π’Ÿ).\displaystyle\mathbb{P}\left({\hat{n}}_{t,+}\geq{\hat{n}}_{t,-}\right)\geq\mathbb{E}\left[\mathbb{P}\left({\hat{n}}_{t,+}\geq{\hat{n}}_{t,-}\mid{\hat{n}}_{t}\right){\mathbbold 1}\!\left\{\mathbf{E}_{\cal D}\right\}\right]\geq\frac{1}{12}\mathbb{P}(\mathbf{E}_{\cal D}).

Finally, by multiplicative Chernoff, the event π„π’Ÿ\mathbf{E}_{{\cal D}} holds with probability at least 1βˆ’exp⁑(βˆ’1/6)>1/71-\exp(-1/6)>1/7. ∎

Appendix E Auxiliary Lemmas

The following propositions are taken verbatim from [20].

Proposition 9 (Thm 2.5 of [49]).

Let {Ξ h}hβˆˆβ„‹\{\Pi_{h}\}_{h\in\mathcal{H}} be a family of distributions indexed over a subset β„‹\mathcal{H} of a semi-metric (β„±,dist)(\mathcal{F},{\rm dist}). Suppose βˆƒh0,…,hMβˆˆβ„‹\exists\,h_{0},\ldots,h_{M}\in\mathcal{H}, where Mβ‰₯2M\geq 2, such that:

(i)\displaystyle\qquad{\rm(i)}\quad dist​(hi,hj)β‰₯2​s>0,βˆ€0≀i<j≀M,\displaystyle{\rm dist}\!\left(h_{i},h_{j}\right)\geq 2s>0,\quad\forall 0\leq i<j\leq M,
(ii)\displaystyle\qquad{\rm(ii)}\quad Ξ hiβ‰ͺΞ h0βˆ€i∈[M],Β and the average KL-divergence to ​Πh0​ satisfies\displaystyle\Pi_{h_{i}}\ll\Pi_{h_{0}}\quad\forall i\in[M],\text{ and the average KL-divergence to }\Pi_{h_{0}}\text{ satisfies }
1Mβ€‹βˆ‘i=1Mπ’Ÿkl​(Ξ hi|Ξ h0)≀α​log⁑M,Β where ​0<Ξ±<1/8.\displaystyle\qquad\frac{1}{M}\sum_{i=1}^{M}\mathcal{D}_{\text{kl}}\!\left(\Pi_{h_{i}}|\Pi_{h_{0}}\right)\leq\alpha\log M,\text{ where }0<\alpha<1/8.

Let Z∼ΠhZ\sim\Pi_{h}, and let h^:Z↦ℱ\hat{h}:Z\mapsto\mathcal{F} denote any improper learner of hβˆˆβ„‹h\in\mathcal{H}. We have for any h^\hat{h}:

suphβˆˆβ„‹Ξ h​(dist​(h^​(Z),h)β‰₯s)β‰₯M1+M​(1βˆ’2β€‹Ξ±βˆ’2​αlog⁑(M))β‰₯3βˆ’2​28.\sup_{h\in\mathcal{H}}\Pi_{h}\left({\rm dist}\!\left(\hat{h}(Z),h\right)\geq s\right)\geq\frac{\sqrt{M}}{1+\sqrt{M}}\left(1-2\alpha-\sqrt{\frac{2\alpha}{\log(M)}}\right)\geq\frac{3-2\sqrt{2}}{8}.
Proposition 10 (Varshamov-Gilbert bound).

Let dβ‰₯8d\geq 8. Then there exists a subset {Οƒ0,…,ΟƒM}\{\sigma_{0},\ldots,\sigma_{M}\} of {βˆ’1,1}d\{-1,1\}^{d} such that Οƒ0=(1,…,1)\sigma_{0}=(1,\ldots,1),

dist​(Οƒi,Οƒj)β‰₯d8,βˆ€β€‰0≀i<j≀M,andMβ‰₯2d/8,{\rm dist}(\sigma_{i},\sigma_{j})\geq\frac{d}{8},\quad\forall\,0\leq i<j\leq M,\quad\text{and}\quad M\geq 2^{d/8},

where dist​(Οƒ,Οƒβ€²)≐card​({i∈[m]:σ​(i)≠σ′​(i)}){\rm dist}(\sigma,\sigma^{\prime})\doteq{\rm card}(\{i\in[m]:\sigma(i)\neq\sigma^{\prime}(i)\}) is the Hamming distance.

Lemma 8 (A basic KL upper-bound).

For any 0<p,q<10<p,q<1, we let π’Ÿkl​(p|q)\mathcal{D}_{\text{kl}}\!\left(p|q\right) denote π’Ÿkl​(Ber​(p)|Ber​(q))\mathcal{D}_{\text{kl}}\!\left({\rm Ber}(p)|{\rm Ber}(q)\right). Now let 0<Ο΅<1/20<\epsilon<1/2 and let z∈{βˆ’1,1}z\in\{-1,1\}. We have

π’Ÿkl​(1/2+(z/2)β‹…Ο΅| 1/2βˆ’(z/2)β‹…Ο΅)≀c0β‹…Ο΅2,Β for some ​c0​ independent of ​ϡ.\mathcal{D}_{\text{kl}}\!\left(1/2+(z/2)\cdot\epsilon\,|\,1/2-(z/2)\cdot\epsilon\right)\leq c_{0}\cdot\epsilon^{2},\text{ for some }c_{0}\text{ independent of }\epsilon.