A No-Free-Lunch Theorem for MultiTask Learning
Abstract
Multitask learning and related areas such as multi-source domain adaptation address modern settings where datasets from related distributions are to be combined towards improving performance on any single such distribution . A perplexing fact remains in the evolving theory on the subject: while we would hope for performance bounds that account for the contribution from multiple tasks, the vast majority of analyses result in bounds that improve at best in the number of samples per task, but most often do not improve in . As such, it might seem at first that the distributional settings or aggregation procedures considered in such analyses might be somehow unfavorable; however, as we show, the picture happens to be more nuanced, with interestingly hard regimes that might appear otherwise favorable.
In particular, we consider a seemingly favorable classification scenario where all tasks share a common optimal classifier , and which can be shown to admit a broad range of regimes with improved oracle rates in terms of and . Some of our main results are as follows:
-
β’
We show that, even though such regimes admit minimax rates accounting for both and , no adaptive algorithm exists; that is, without access to distributional information, no algorithm can guarantee rates that improve with large for fixed.
-
β’
With a bit of additional information, namely, a ranking of tasks according to their distance to a target , a simple rank-based procedure can achieve near optimal aggregations of tasksβ datasets, despite a search space exponential in . Interestingly, the optimal aggregation might exclude certain tasks, even though they all share the same .
1 Introduction
Multitask learning and related areas such as multi-source domain adaptation address a statistical setting where multiple datasets are to be aggregated towards improving performance w.r.t. a single (or any of, in the case of multitask) such distributions . This is motivated by applications in the sciences and engineering where data availability is an issue, e.g., medical analytics typically require aggregating data from loosely related subpopulations, while identifying traffic patterns in a given city might benefit from pulling data from somewhat similar other cities.
While these problems have received much recent theoretical attention, especially in classification, a perplexing reality emerges: the bulk of results appear to show little improvement from such aggregation over using a single data source. Namely, given datasets, each of size , one would hope for convergence rates in terms of the aggregate data size , somehow adjusted w.r.t. discrepancies between distributions βs, but which clearly improve on rates in terms of just as would be obtained with a single dataset. However, such clear improvements on rates appear elusive, as typical bounds on excess risk w.r.t. a target (i.e., one of the βs) are of the form (see e.g., CKW (08); BDB (08); BDBC+ (10))
(1) |
where in some results, one of the last two terms is dropped. In other words, typical upper-bounds are either dominated by the rate , or altogether might not go to 0 with sample size due to the discrepancy terms , even while the excess risk of a naive classifier trained on the target dataset would be . As such, it might seem at first that there is a gap in either algorithmic approaches, formalism and assumptions, or statistical analyses of the problem. However, as we argue here, no algorithm can guarantee a rate improving in aggregate sample size for fixed, even under seemingly generous assumptions on how sources βs relate to a given target .
In particular, we consider a seemingly favorable classification setting, where all data distributions βs induce the same optimal classifier over a hypothesis class . This situation is of independent interest, e.g., appearing recently under the name of invariant risk minimization (see ABGLP (19) where it is motivated through invariants under causality), but is motivated here by our aim to elucidate basic limits of the multitask problem. As a starting point to understanding the best achievable rates, we first establish minimax upper and lower bounds, up to log terms, for the setting. These oracle rates, as might be expected in such benign settings, do indeed improve with both and , as allowed by the level of discrepancy between distributions, appropriately formalized (Theorem 1). We then turn to characterizing the extent to which such favorable rates might be achieved by reasonable procedures, i.e., adaptive procedures with little to no prior distributional information. Many interesting messages arise, some of which we highlight below:
-
β’
No adaptive procedure exists outside a threshold , where (a so-called Bernstein class condition) parametrizes the level of noise111The term noise is used here to indicate nondeterminism in the label of a sample point , and so is rather non-adversarial. in label distribution. Namely, while oracle rates might decrease fast in and , no procedure based on the aggregate samples alone can guarantee a rate better than without further favorable restrictions on distributions (Theorem 5).
-
β’
At low noise (e.g., so called Massartβs noise) even the naive, yet common approach of pooling all datasets, i.e., treating them as if identically distributed, is nearly minimax optimal, achieving rates improving in both and . This of course would not hold if optimal classifiers βs differ considerably across βs (Theorem 3).
-
β’
At any noise level, a ranking of sources βs according to discrepancy to a target is sufficient information for (partial) adaptivity. While a precise ranking is probably unlikely in practice, an approximate ranking might be available, as domain knowledge on how sources rank in fidelity w.r.t. to an ideal target data: e.g., in settings such as learning with slowly drifting distributions. Here we show that a simple rank-based procedure, using such ranking, efficiently achieves a near optimal aggregation rate, despite the exponential search space of over possible aggregations (Theorem 4).
Interestingly, even assuming all βs are the same, the optimal aggregation of datasets can change with the choice of target in (see Theorem 2), due to the inherent asymmetry in the information tasks might have on each other, e.g., some might yield data in -dense regions but not the other way around. Hence, to capture a proper picture of the problem, we cannot employ a symmetric notion of discrepancy as is common in the literature on the subject. Instead, we proceed with a notion of transfer exponent, which we recently introduced in HK (19) for the setting of domain adaptation with source distribution, and which we show here to also successfully capture performance limits in the present setting with sources (see definition in Section 3.1).
We note that in the case , there is always a minimax adaptive procedure (as shown in HK (19)), while this is not the case here for . In hindsight, the reason is simple: considering , i.e., as a constant, there is no consequential difference between a rate in terms of and one in terms of . In other words, the case does not yield adequate insight into multitask with , and therefore does not properly inform practice as to the best approaches to aggregating and benefiting from multiple datasets.
Background and Related Work
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/c1e0df12-0dd5-4395-b7da-69b0145fce0e/IllustrationBadTV.png)
The bulk of theoretical work on multitask and related areas build on early work on domain adaptation (i.e., ) such as BDBCP (07); CMRR (08); BDBC+ (10), which introduce notions of discrepancy such as the -divergence, -discrepancy, that specialize the total-variation metric to the setting of domain adaptation. These notions often result in bounds of the form (1), starting with CKW (08); BDBC+ (10). Such bounds can in fact be shown to be tight w.r.t. the discrepancy term , for given distributional settings and sample sizes, owing for instance to early work on the limits of learning under distribution drift (see e.g. Bar (92)). However, the rates of (1) appear pessimistic when we consider settings of interest here where βs remain the same (or nearly so) across tasks, as they suggest no general improvement on risk with larger sample size. Consider for instance simple situations as depicted on the right, where a source and target (with respective supports ) might differ considerably in the mass they assign to regions of data space, thus inducing large discrepancies, but where both assign sufficient mass to decision boundaries to help identify with enough samples from either distribution. Proper parametrization resolves this issue, as we show through new multitask rates in natural situations with , and even with no target sample.
We remark that other notions of discrepancy, e.g., Maximum Mean Discrepancy GSH+ (09), Wasserstein distance RHS (17); SQZY (18) are employed in domain adaptation; however they appear relatively less often in the theoretical literature on multitask and related areas. For multitask, the work of BDB (08) proposes a more-structured notion of task relatedness, whereby a source is induced from a target through a transformation of the domain ; that work also incurs an term in the risk bound, but no discrepancy term. The work of MMR (09) considers RΓ©nyi divergences in the context of optimal aggregation in multitask under population risk, but does not study sample complexity.
The use of a non-metric discrepancy such as RΓ©nyi divergence brings back an important point: two distributions might have asymmetric information on each other w.r.t. domain adaptation. Such insight was raised recently, independently in KM (18); HK (19); APMS (19), with various natural examples therein (see also Section 3.1). In particular, it motivates a more unified view of multitask and multisource domain adaptation, which are often treated separately. Namely, if the goal in multitask is to perform as well as possible on each task in our set, then as we show, such asymmetry in information between tasks calls for different aggregation of datasets for each target : in other words, treating multitask as separate multisource problems, even if the optimal is the same across tasks.
In contrast, a frequent aim in multitask, dating back to Car (97), has been to ideally arrive at a single aggregation of task datasets that simultaneously benefits all tasks βs. Following this spirit, many theoretical works on the subject are concerned with bounds on average risk across tasks Bax (97); AZ (05); MPRP (13); PM (13); PL (14); YHC (13); MPRP (16), rather than bounding the supremum risk accross tasks β i.e., treating multitask as separate multisources, as of interest here. Some of these average bounds, e.g., MPRP (13, 16); PM (13), remove the dependence on discrepancy inherent in bounds of the form (1), but maintain a term of the form ; in other words, any bound on supremum risk derived from such results would be in terms of a single dataset size . The work of BHPQ (17) directly addresses the problem of bounding the supremum risk, but also incurs a term of the form in the risk bound.
In the context of multisource domain adaptation, it has been recognized in practice that some datasets that are too far from the ideal target might hurt target performance and should be downweighted accordingly, a situation that has been termed negative transfer. These situations further motivate the need for adaptive procedures that can automatically identify good datasets. As far as theoretical insights, it is clear that negative transfer might happen for instance, under ERM, if optimal classifiers βs are considerably different across tasks. Interestingly, even when βs are allowed arbitrarily close (but not equal), BDLLP (10) shows that, for , the source dataset can be useless without labeled target data. We later derived minimax lower-bounds for the case with or without labeled target data in HK (19) for a range of situations including those considered in BDLLP (10). Such results however do not quite confirm negative transfer, as they allow the possibility that useless datasets might remain safe to include. For the multisource case , to the best of our knowledge, situations of negative transfer have only been described in adversarial settings with corrupted labels. For instance, the recent papers of Qia (18); MMM (19); KFAL (20) show limits of multitask under various adversarial corruption of labels in datasets, while SZ (19) derives a positive result, i.e., rates (for Lipschitz loss) decreasing in both and , up to excluded or downweighted datasets. The procedure of SZ (19) is however nonadaptive as it requires known noise proportions.
KFAL (20) is of particular interest as they are concerned with adaptivity under label corruption. They show that, even if at most of the datasets are corrupted, no procedure can get a rate better than . In contrast, in the stochastic setting considered here, there is always an adaptive procedure with significantly faster rates than if at most (in fact any fixed fraction) of distributions βs are far from the target (Theorem 9 of Appendix B). This dichotomy is due to the strength of adversarial corruptions they consider which in effect can flip optimal βs on corrupted sources. What we show here is that, even when βs are fixed across tasks, and datasets are sampled i.i.d. from each , i.e., non-adversarially, no algorithm can achieve a rate better than while a non-adaptive oracle procedure can (see Theorem 5). In other words, some datasets are indeed unsafe for any multisource procedure even in nonadversarial settings, as they can force suboptimal choices w.r.t. a target, absent additional restrictions on the problem setup.
As discussed earlier, such favorable restrictions concern for instance situations where information is available on how sources rank in distance to a target. In particular, the case intersects with the so-called distribution drift setting where each distribution has bounded discrepancy w.r.t. the next distribution as time varies. While the notion of discrepancy is typically taken to be total-variation BDBM (89); Bar (92); BL (97) or related notions MM (12); ZL (19); HY (19), both our upper and lower-bounds, specialized to the case , provide a new perspective for distribution drift under our distinct parametrization of how βs relate to each other. For instance, our results on multisource under ranking (Theorem 4) imply new rates for distribution drift, when all βs are the same across time, with excess error at time going to with , even in situations where in total variation. Such consistency in is unavailable in prior work on distribution drift.
Finally we note that there has been much recent theoretical efforts towards other formalisms of relations between distributions in a multitask or multisource scenario, with particular emphasis on distributions sharing common latent substructures, both as applied to classification MBS (13); MB (17); AKK+ (19), or to regression settings JSRR (10); LPVDG+ (11); NW (11); DHK+ (20); TJJ (20). The present work does not address such settings.
Paper Outline
We start with setup and definitions in Sections 2 and 3. This is followed by a technical overview of results, along with discussions of the analysis and novel proof techniques, in Section 4. Minimax lower and upper bounds are derived in Sections 5 and 6. Constrained regimes allowing partial adaptivity are discussed in Section 7. This is followed by impossibility theorems for adaptivity in Section 8.
2 Basic Classification Concepts
We consider a classification setting , where are drawn from some space , . We focus on proper learning where, given data, a learner is to return a classifier from some fixed class .
Assumption 1 (Bounded VC).
Throughout we will let denote a hypothesis class of finite VC dimension , which we consider fixed in all subsequent discussions. To focus on nontrivial cases, we assume throughout.
We note that our algorithmic techniques and analysis extend to more general through Rademacher complexity or empirical covering numbers. We focus on VC classes for simplicity, and to allow simple expressions of minimax rates.
The performance of any classifier will be captured through the 0-1 risk and excess risk as defined below.
Definition 1.
Let denote the risk of any under a distribution . The excess risk of over any is then defined as , while for the excess risk over the best in class we simply write . We let denote any element of (which we will assume exists); if is clear from context we might just write for a minimizer (a.k.a. best in class). Also define the pseudo-distance .
Given a finite dataset of pairs in , we let denote the empirical risk of under ; if , define . The excess empirical risk over any is . Also define the empirical pseudo-distance .
The following condition is a classical way to capture a continuum from easy to hard classification. In particular, in vanilla classification, the best rate excess risk achievable by a classifier trained on data of size , can be shown to be of order , i.e, interpolates between and , as controlled by defined below.
Definition 2.
Let . A distribution is said to satisfy a Bernstein class condition with parameters if the following holds:
(2) |
Notice that the above always holds with at least .
The condition can be viewed as quantifying the amount of noise in since always have with equality when . In particular, it captures the so-called Tsybakov noise margin condition when the Bayes classifier is in , that is, let , then the margin condition
The importance of such margin considerations become evident as we consider which multitask learner is able to automatically adapt optimally to unknown relations between distributions; interestingly, hardness of adaptation has to do not only with relations (or discrepancies) between distributions, but also with .
3 Multitask Setting
We consider a setting where multiple datasets are drawn independently from (related) distributions , with the aim to return hypotheses βs from with low excess error w.r.t. any of these distributions. W.l.o.g. we fix attention to a single target distribution , and therefore reduce the setting to that of multisource222The term multisource is often used for situations where the learner has no access to target data, which is not required to be the case here, although is handled simply by setting the target sample size to in our bounds..
Definition 3.
A multisource learner is any function (for some sample sizes ), i.e., given a multisample , returns a hypothesis in . In an abuse of notation, we will often conflate the learner with the hypothesis it returns.
A multitask setting can then be viewed as one where multisource learners (targeting each ) are trained on the same multisample . Much of the rest of the paper will therefore discuss learners for a given target .
3.1 Relating Sources to Target
Clearly, how well one can do in multitask depend on how distributions relate to each other. The following parametrization will serve to capture the relation between sources and target distributions.
Definition 4.
Let (up to ), and . We say that a distribution has transfer exponent w.r.t.Β a distribution (under ), if
Notice that the above always holds with at least .
We have shown in earlier work HK (19) that the transfer exponent manages to tightly capture the minimax rates of transfer in various situations with a single source and target (), including ones where the best hypotheseses are different for source and target; in the case of main interest here where and share a same best in class , for respective data sizes , the best possible excess risk is of order which is tight for any values of and (a Bernstein class parameter on and ). In other words, captures an effective data size contributed by the source to the target: this decreases as , delineating a continuum from easy to hard transfer. Interestingly, reveals the fact that source data could be more useful than target data, for instance if the classification problem is easier under the source (e.g., has more mass at the decision boundary).
Altogether, the transfer exponent appears as a more optimistic measure of discrepancy between source and target, as it reveals the possibility of transfer β even at fast rates β in many situations where traditional measures are pessimistically large. To illustrate this, we recall some of the examples from HK (19). In all examples below, we assume for simplicity that for some .
Example 1. (Discrepancies can be too large) Let consist of one-sided thresholds on the line, and let and .
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/c1e0df12-0dd5-4395-b7da-69b0145fce0e/Example1.png)
Let be thresholded at . We then see that for all thresholded at , , where for , . Thus, the transfer exponent with , so we have fast transfer at the same rate as if we were sampling from .
On the other hand, recall that the -divergence takes the form333Note that these divergences are often defined w.r.t. every pair rather than w.r.t.Β which makes them smaller. , while the -discrepancy takes the form . The two coincide when .
Now, take as the threshold at , and which would wrongly imply that transfer is not feasible at a rate faster than ; we can in fact make this situation worse, i.e., let by letting correspond to a threshold close to . A first issue is that these divergences get large in large disagreement regions; this is somewhat mitigated by localization β i.e., defining these discrepancies w.r.t. βs in a vicinity of , but does not quite resolve the issue, as discussed in earlier work HK (19).
Example 2. (Minimum , and the inherent asymmetry of transfer)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/c1e0df12-0dd5-4395-b7da-69b0145fce0e/Example2.png)
Suppose is the class of one-sided thresholds on the line, is a threshold at . The marginal has uniform density (on an interval containing ), while, for some , has density on (and uniform on the rest of the support of , not shown). Consider any at threshold , we have , while . Notice that for any fixed ,
We therefore see that is the smallest possible transfer-exponent. Interestingly, now consider transferring instead from to : we would have ; in other words, there are natural situations where it is easier to transfer from to than from to , as in the case here where gives relatively little mass to the decision boundary. This is not captured by symmetric notions of distance, e.g., metrics or semi-metrics such as , , MMD, TV, or Wasserstein.
Finally note that the above examples can be extended to more general hypothesis classes as the examples merely play on how fast decreases w.r.t. in regions of space.
3.2 Multisource Class
We are now ready to formalize the main class of distributional settings considered in this work.
Definition 5 (Multisource class).
We consider classes of product distributions of the form satisfying:
-
(A1).
There exists ,
-
(A2).
Sources βs have transfer exponent w.r.t.Β the target ,
-
(A3).
All sources and target admit a Bernstein class condition with parameter .
For notational simplicity, we will often let . Also, although it is not a parameter of the class, we will also refer to , as always has transfer exponent w.r.t.Β itself.
Remark 1 ( is almost unique).
Note that, for , the Bernstein class condition implies that above satisfies for any other . Furthermore for any such that , we also have , implying for .
3.3 Additional Notation
Implicit target . Every time we write , we will implicitly mean , so that in any context where a is introduced, we may write, for instance, , which then refers to the distribution in .
Indices and Order Statistics. For any , and indices , we let .
We will often be interested in order statistics of ordered values, in which case will denote distribution, sample and sample size at index . We will then let .
Average transfer exponent. For any , define , where .
Aggregate ERM. For any , we let , and correspondingly we also define , as the ERM over . When we simply write for .
Min and Max. We often use the short notations , .
Positive Logarithm. For any , define .
Convention. We adopt the convention that .
Asymptotic Order. We often write or in the statements of key results, to indicate inequality, respectively, equality, up to constants and logarithmic factors. The precise constants and logarithmic factors are always presented in supporting results.
4 Results Overview
We start by investigating what the best possible transfer rates are for multisource classes , and then investigate the extent to which these rates are attainable by adaptive procedures, i.e., procedures with little access to class information such as transfer exponents from sources to target.
From this point on, we let denote any multisource class, with any admissible value of relevant parameters, unless these parameters are specifically constrained a resultβs statement.
4.1 Minimax Rates
Theorem 1 (Minimax Rates).
Let denote any multisource class where . Let denote any multisource learner with knowledge of . We have:
Remark 2.
We remark that the proof of the above result (see Theorems 6 and 7 of Sections 5 and 6 for matching lower and upper bounds) imply that, in fact, we could replace with simply and the result would still be true. In other words, although intuitively might be much smaller than for any fixed , the minimum values over can only differ up to logarithmic terms.
We also note that the constraint that is only needed for the lower bound (TheoremΒ 6), whereas all of our upper bounds (TheoremsΒ 7, 3, and 4) hold for any values . Moreover, there exist classes where the lower bound also holds for all , so that the form of the bound is generally not improvable. The case represents a kind of super transfer, where the source samples are actually more informative than target samples.
It follows from Theorem 1 that, despite there being possible ways of aggregating datasets (or more if we consider general weightings of datasets), it is sufficient to search over possible aggregations β defined by the ranking β to nearly achieve the minimax rate.
The lower-bound (Theorem 6) relies on constructing a subset (of ) of product distributions , which are mutually close in KL-divergence, but far under the pseudo-metric . For illustration, considering the case , for any that are sufficiently far under so that , the lower-bound construction is such that
(3) |
ensuring that are hard to distinguish from finite sample. Thus, the largest satisfying the second inequality above, say is a minimax lower-bound. On the other-hand, the upper-bound (Theorem 7) relies on a uniform Bernsteinβs inequality that holds for non-identically distributed r.v.s (Lemma 1); in particular, by accounting for variance in the risk, such Bernstein-type inequality allows us to extend (to the multisource setting) usual fixed-point arguments that capture the effect of the noise parameter . Now, again for illustration, let , and consider the ERM combining all datasets. Let , , then the concentration arguments described above ensure that . Now notice that, by definition of , , in other words, satisfies the second inequality in (3), and must therefore be at most of order . This establishes the tightness of as a minimax rate, all that is left being to elucidate its exact form in terms of sample sizes. Similar, but somewhat more involved arguments apply for general , though in that case we find that pooling all of the data does not suffice to achieve the minimax rate.
Notice that the rates of Theorem 1 immediately imply minimax rates for multitask under the assumption (A1) of sharing a same (with appropriate βs w.r.t.Β any target ). It is then natural to ask whether the minimax rate for various targets might be achieved by the same algorithm, i.e., the same aggregation of tasks, in light of a common approach in the literature (and practice) of optimizing for a single classifier that does well simultaneously on all tasks. We show that even when all βs are the same, the optimal aggregation might differ across targets , simply due to the inherent assymmetry of transfer. We have the following theorem, proved in Appendix C.
Theorem 2 (Target affects aggregation).
Set . There exists satisfying a Bernstein class condition with parameters for some , and sharing the same such that the following holds. Consider a multisample consisting of independent datasets .
Let denote the ERMs over , , respectively. Suppose Then
Remark 3 (Suboptimality of pooling).
A common practice is to pool all datasets together and return an ERM as in . We see from the above result that this might be optimal for some targets while suboptimal for other targets. However, pooling is near optimal (simultaneously for all targets ) whenever , as discussed in Section 4.2 below.
4.2 Some Regimes of (Semi) Adaptivity
It is natural to ask whether the above minimax rates for are attainable by adaptive procedures, i.e., a reasonable procedure with no access to prior information on (the parameters) of , but only access to a multisample for some unknown . As we will see in Section 4.3, this is not possible in general, i.e., outside of the regimes considered here. Our work however leaves open the existence of more refined regimes of adaptivity.
Low Noise . To start, when the Bernstein class parameter (which would often be a priori unknown to the learner), pooling of all datasets is near minimax optimal as stated in the next result. This corresponds to low noise situations, e.g., so-called Massartβs noise (where , for some ), including the realizable case (where deterministically). Note that βs are nonetheless nontrivial (see examples of Section 3.1), however the distributions βs are then sufficiently related that their datasets are mutually valuable.
Theorem 3 (Pooling under low noise).
Suppose . Consider any and let denote the ERM over . Let . There exists a universal constant , such that, with probability at least ,
The theorem is proven in Section 7.1. We also state a general bound on , holding for any , in CorollaryΒ 2 of AppendixΒ B; the implied rates are not always optimal, though interestingly they are near-optimal in the case that , where is the minimizer of the r.h.s.Β in TheoremΒ 1. We note that, unlike in the oracle upper-bounds of Theorem 7, the logarithmic term in the above result is in terms of the entire sample size, rather than the sample size at which the minimizer in is attained.
Available ranking information. Now, assume that on top of , we have access to ranking information , but no additional information on . Namely, , and the actual values of are unknown to the learner. We show that, in this case, a simple rank-based procedure achieves the minimax rates of Theorem 1, without knowledge of the additional distributional parameters.
Define , for and . Let for each , ansd recall that denotes the ERM over the aggregate sample .
Rank-based Procedure : let , and as in LemmaΒ 1. For any , define:
(4) Return any in , if not empty, otherwise return any .
We have the following result for this learning algorithm.
Theorem 4 (Semi-adaptive method under ranking information).
Let . Let denote the above procedure, trained on , for any . There exists a universal constant , such that, with probability at least ,
The result is shown in Section 7.2 by arguing that each in has of order , so the minimizer is attained whenever (which is probable, as is likely to be in each ).
Recalling our earlier discussion of Section 1, the above result applies to the case of drifting distributions by letting , and , i.e., the previous example is considered the most-relevant to the target . In this case, in contrast to the lower bounds proven by Bar (92), the above result of TheoremΒ 4 reveals situations where the risk at time approaches as , even though the total-variation distance between adjacent distributions may be bounded away from zero (see examples of Section 3.1). In other words, by constraining the sequence of distributions by the sequence of values, we can describe scenarios where the traditional upper bound analysis of drifting distributions from Bar (92); BDBM (89); BL (97) can be improved.
Remark 4 (approximate ranking).
TheoremΒ 4, and its proof, also have implications for scenarios where we may only have access to an approximate ranking: that is, where the ranking indices donβt strictly order the tasks by their respective minimum valid values. For instance, one immediate observation is that, since DefinitionΒ 4 does not require to be minimal, any larger value of would also be valid; therefore, for any permutation , there always exists some valid choices of values so that the sequence is non-decreasing, and hence we can define , so that TheoremΒ 4 holds (with these values) for the rank-based procedure applied with this ordering of the tasks. For instance, this means that if we use an ordering that only swaps the order of some values that are within some of each other, then the result in the theorem remains valid aside from replacing with . A second observation is that, upon inspecting the proof, it is clear that the result is only truly sensitive to the ranking relative to the index achieving the minimum in the bound: that is, if some indices with and are incorrectly ordered, while still both ranked before (or likewise for indices with and ), the result remains valid as stated nonetheless.
4.3 Impossibility of Adaptivity in General Regimes
Adaptivity in general is not possible even though the rates of Theorem 1 appear to reduce the exponential search space over aggregations to a more manageable one based on rankings of data. As seen in the previous section, easier settings such as ones with ranking information, or low noise, allow adaptivity to the remaining unknown distributional parameters. The result below states that, outside of these regimes, no algorithm can guarantee a rate better than a dependence on the number of samples from the target task, even when a semi-adaptive procedure using ranking information can achieve any desired rate .
Theorem 5 (Impossibility of adaptivity).
Pick any , , and let , , and . Pick any . The following holds for sufficiently large.
Let denote any multisource learner with no knowledge of . There exists a multisource class with parameters , , and all other parameters satisfying the above, such that
On the other hand, there exists a semi-adaptive classifier , which, given data , along with a ranking , but no other information on the above , achieves the rate
The first part of the result follows from Theorem 8 of Section 8, while the second part follows from both Theorem 8 and Theorem 4 of Section 8. The main idea of the proof is to inject enough randomness into the choice of ranking, while at the same time allowing bad datasets from distributions with large β which would force a wrong choice of β to appear as benign as good samples from distributions with small . Hence, we let large enough in our constructions so that the bulk of bad datasets would significantly overwhelm the information from good datasets.
As a technical point, no knowledge of simply means that a minimax analysis is performed over a larger family containing such βs, indexed over choices of ranking each corresponding to a fixed .
Finally, we note that the result leaves open the possibility of adaptivity under further distributional restrictions on , for instance requiring that the number of samples per source task be large w.r.t. other parameters such as and ; although this remains unclear, large values of could perhaps yield sufficient information to compare and rank tasks.
Another possible restriction towards adaptivity is to require that a large proportion of the samples are from good datasets w.r.t. . In particular, we show in TheoremΒ 9 of AppendixΒ B that the ERM which pools all datasets achieves a target excess risk depending on the (weighted) median of the values (or more generally, any quantile); in other words as long as a constant fraction of all datapoints (pooled from all datasets) are from tasks with relatively small , the bound will be small. However, this is not a safe algorithm in general, as per TheoremΒ 2.
5 Lower Bound Analysis
Theorem 6.
Suppose . If every , then for any learning rule , there exists such that, with probability at least ,
for numerical constants .
Proof.
We will prove this result with ; the general cases and follow immediately (since distributions satisfying also satisfy the respective conditions for any and ). As in the proof of TheoremΒ 7, for each , define . We prove the theorem using a standard approach to lower bounds by the probabilistic method. Let be a sequence in such that, for which every can be realized by for some with . If , we let , and we can find such a sequence since ; if , then we let and any shattered sequence of points will suffice. We now specify a family of distributions, as follows.
Fix . For each , for each , let , , and for each and let . In other words, the marginal probability of each is and the conditional probability of label given is .
We first verify that is in . For every and every , the Bayes optimal label at is always , and the Bayes optimal label at () is always ; since this is the same for every , the classifier is optimal under every . Next, we check the Bernstein class condition. Define . For any and , we have
while
Thus,
Finally, we verify the values are satisfied. First note that any with trivially satisfies the condition. Denote by the distribution . Then for each with , and each ,
where the final inequality follows from and a bit of calculus to verify that for all .
Now the Varshamov-Gilbert bound (see PropositionΒ 10 of AppendixΒ E) implies there exists a subset with , , and every distinct have . Furthermore, for any ,
for a numerical constant , where we have used a quadratic approximation of KL divergence between Bernoullis (LemmaΒ 8 of AppendixΒ E). Consider any choice of making this last expression less than .
Since , Theorem 2.5 of (Tsy, 09) (see PropositionΒ 9 of AppendixΒ E) implies that for any (possibly randomized) estimator , there exists a choice of such that a sample will have with probability at least .
In particular, for any learning algorithm , we can take , and this result implies that there is a choice of such that, for trained on , with probability at least , there are at least points with , which implies .
It remains only to identify an explicit value of for which .
In particular, fix any positive function such that for a finite numerical constant : for instance, will suffice to prove the theorem. Then consider
for a numerical constant . Denote by the value of obtaining the minimum in this expression. Then note that every other has
which implies
so that
Thus, we have
for a finite numerical constant . Thus, choosing any , we have , as desired.
Altogether, we have that for any learning rule , there exists such that if is trained on , then with probability at least ,
In particular, since we always have , the theorem immediately follows. β
Remark 5.
We note that it is clear from the proof that the denominator in the lower bound is not strictly optimal. It can be replaced by any function satisfying : for instance, .
Remark 6.
We also note that there exist classes for which the lower bound can be extended to the full range of (i.e., any values ). Specifically, in this general case, if there exist two points such that all agree on while with , then by the same construction (with ) used in the proof we would have, with probability at least ,
6 Upper Bound Analysis
We will in fact establish the upper bound as a bound holding with high probability , for any . Throughout this subsection, let , for any admissible values of the parameters. Let
for as in LemmaΒ 1 below, and for . The oracle procedure just returns , the ERM over .
We have the following theorem.
Theorem 7.
For any , for any , with probability at least , we have
for a constant , where is a numerical constant from LemmaΒ 1 below.
The proof will rely on the following lemma: a uniform Bernstein inequality for independent but non-identically distributed data. Results of this type are well known for i.i.d.Β data (see e.g., Kol (06)). For completeness, we include a proof of the extension to non-identically distributed data in AppendixΒ A. The main technical modification of the proof, compared to the i.i.d.Β case, is employing a generalization of Bousquetβs inequality for non-identical data, due to KR (05).
Lemma 1.
(Uniform Bernstein inequality for non-identical distributions)Β Β For any and , define and let be independent samples. With probability at least , ,
(5) |
and
(6) |
for a universal numerical constant .
In particular, we will use this lemma via the following implication.
Lemma 2.
Proof.
By the Bernstein class condition and Jensenβs inequality, for any we have
(10) |
Furthermore, applying (8) with and reveals that
Combining this with (10) (for ) implies that
(11) |
which implies
(12) |
Next, applying (8) with any satisfying (7) and with implies
(13) |
and (9) implies the right hand side is at most (after some simplifications of the resulting expression)
(14) |
By the triangle inequality and (10), together with (12), we have
Combining this with (14) and (13) implies (with some simplification of the resulting expression)
Since , together with (12) this implies
In particular, this inequality immediately implies the claimed inequality in the lemma. β
We now present the proof of TheoremΒ 7.
Proof of TheoremΒ 7.
For each , define . For brevity, let and . Let . First note that
where the final inequality is due to Jensenβs inequality. In particular, this implies
(15) |
so that it suffices to upper bound the expression on the right hand side. Toward this end, note that trivially satisfies (7) for . Therefore, LemmaΒ 2 implies that with probability at least ,
Combining this with (15) immediately implies the theorem. β
7 Partially Adaptive Procedures
Fix any , , , , , and , and let .
7.1 Pooling Under Low Noise
We now present the proof of Theorem 3 which states the near-optimality of pooling, independent of the choice of target , whenever .
Proof of Theorem 3.
For any , let and . Suppose the event from LemmaΒ 1 holds for , which occurs with probability at least . By LemmaΒ 2, we know that on this event,
Combining this with the definition of , we have
Since the left hand side is monotonic in , if we wish to bound by some value , it suffices to take any value of such that
or more simply
In particular, let us take
where is the value of that minimizes the right hand side of this definition of . Then
β
7.2 Aggregation Under Ranking Information
Proof of Theorem 4.
For any let . Let be as in TheoremΒ 7, and by the same argument from the proof of TheoremΒ 7, (15) holds for in the present context as well (where there is in the present notation). Thus, the theorem will be proven if we can bound by for some numerical constant .
By a union bound, with probability at least , for every , the event from LemmaΒ 1 holds for (and with in place of there). In particular, this implies that for every ,
so that there do exist classifiers in satisfying (4), and hence satisfies (4). In particular, this implies satisfies the inequality in (4) for . By LemmaΒ 2, this implies that on this same event from above, it holds that
which completes the proof (for instance, taking ). β
8 Impossibility of Adaptivity over Multisource Classes
Theorem 5 follows as corollary to Theorem 8, the main result of this section. In particular, the second part of Theorem 5 follows from the condition on that it contains enough tasks with , and calling on Theorem 4.
Theorem 8 (Impossibility of Adaptivity).
Pick , and any , , a number of samples per source task , and a number of target samples . Let the number of source tasks , for as specified below. There exist universal constants such that the following holds.
Choose any , and suppose is sufficiently large so that , and furthermore
Let denote the family of multisource classes satisfying the above conditions, with parameters , , and, in addition, such that at least of the exponents are at most .
Let denote any classification procedure having access to , but without knowledge of . We have:
On the other hand, there exists a semi-adaptive classifier , which, given data , along with a ranking of increasing exponents values, achieves the rate
As it turns out, the above theorem in fact holds for even smaller classes admitting just two choices of distributions, namely and (for fixed ) below, to which sources βs might be assigned to. In fact the result holds even if has knowledge of the construction below, but of course not of .
Construction.
We build on the following distributions supported on 2 points : here we simply assume that contains at least 2 classifiers that disagree on but agree on (this will be the case, for some choice of , if contains at least 3 classifiers). Therefore, w.l.o.g., assume that has label 1. Let , , and define and . Let β which we will often abbreviate as . In all that follows, we let denote the regression function under distribution .
-
β’
Target : Let , ; finally is determined by
, and , for some to be specified later. -
β’
Noisy : Let , ; finally is determined by
, and , for an appropriate constant specified later. -
β’
Benign : Let ; finally is determined by .
The above construction is such that can be pushed far from , while remains close to . This is formalized in the following proposition, which can be verified by checking the required inequalities (from DefinitionsΒ 2 and 4) are satisfied in all cases of classifications of , .
Proposition 1 (Exponents of and w.r.t. ).
and have transfer-exponents and , respectively w.r.t. , with and . Furthermore, the 3 distributions satisfy the Bernstein class condition with parameters .
Approximate Multitask.
Let , and . Now consider the distribution
Proof Strategy.
Henceforth, let denote a random draw from , where for , and . Much of the effort will be in showing that any learner has nontrivial error in guessing from a random ; we then reduce this fact to a lower-bound on the risk of an adaptive classifier that takes as input a multitask sample of the form , where βs denote or .
For intuition, the true label is hard to guess from samples alone since , but easy to guess from samples , each being a homogeneous vector of points with identical labels . However, such homogeneous vectors can be produced by also, with identical but flipped labels , and the product of mixtures adds in enough randomness to make it hard to guess the source of a given vector . In fact, this intuition becomes evident by considering the likelihood-ratio between and which decomposes into the contribution of homogeneous vs non-homogenous vectors. This is formalized in the following proposition.
Proposition 2 (Likelihood ratio and sufficient statistics).
Let , and let and and denote the number of homogeneous vectors in with marginals at , that is:
Next, let and denote the total number of labels in , over occurrences of . That is:
We then have that (where is under the randomness of , and we let when ):
(16) |
Remark 8 (Likelihoood Ratio and Optimal Discriminants).
Consider sampling from the mixture . Then the Bayes classifier (for identifying ) returns if , and therefore, given the symmetry between , has probability of misclassification at least . We emphasize here that enforcing that be defined in terms of a mixture β rather than as product of and terms β allows us to bound below as in Proposition 2 above; otherwise this probability is always for a product distribution containing since whenever some . In fact a product distribution inherently encodes the idea that the learner knows the positions of and vectors in , and can therefore simply focus on vectors to easily discover .
We now further reduce the r.h.s. of (16) to events that are simpler to bound: the main strategy is to properly condition on intermediate events to reveal independences that induce simple i.i.d. Bernouilliβs we can exploit. Towards this end, we first consider the high-probability events of the following proposition.
Proposition 3.
Let .
Define as the number of homogeneous vectors
generated respectively by , and .
Define the events (on ):
We have that while .
Proof.
The proposition follows from multiplicative Chernoff bounds. β
Notice that, in the above proposition, for , the expectations in the events are given by
(17) |
By the proposition, we just need these quantities large enough for the events to hold with sufficient probability. The next proposition conditions on these events to reduce the likelihood to simpler events.
Proposition 4 (Further Reducing (16)).
Let . For , let and as defined in Proposition 2. Furthermore, let denote the total number of labels at excluding homogeneous vectors. In all that follows we let denote , and we drop the dependence on for ease of notation.
Let and the events as defined in Proposition 3. Suppose that for some , we have
-
(i)
.
-
(ii)
, further assuming that so that the event is well defined (i.e., are not both ).
We then have that:
(18) |
Proof.
The following is a known counterpart of Chernoff bounds, following from Sludβs inequalities Slu (77) for Binomials.
Lemma 3 (Anticoncentration (Sludβs Inequality)).
Let denote i.i.d. Bernouilliβs with parameter . Then for any , we have
Proof.
Proposition 5 ( from Proposition 4).
Proof.
Consider such that holds. Let , and notice that under . Therefore we only need to bound . Now, for , let denote the set of homogeneous vectors in generated by , and notice that, conditioned on , is distributed as , where
Therefore, applying Lemma 3,
We now just need to show that the event under the probability implies , in other words, that
(21) |
Next we upper bound the r.h.s of (21) and lower-bound its l.h.s. Under and using (17), we have:
(22) |
where for the last inequality we used the fact that, for , we have . On the other hand, the conditions on let us lower-bound by , and we have:
(23) |
We obtain a bound on as an immediate corollary to the above proposition.
Corollary 1.
Under the conditions of Proposition 5, we have:
We now turn to the second condition of Proposition 4.
Proposition 6 ( from Proposition 4).
Let , , and . Let in the construction of , . Suppose is sufficiently large so that
Let , and as defined in Proposition 4. We then have that .
Proof.
Under the notation of Proposition 4, let , and for homogeneity of notation herein, let . Fix to be defined, and notice that if , then the event
As a first step, we want to upper-bound . Let denote an upper-bound on the variance of this quantity: we have by Bernsteinβs inequality that, for any , with probability at least ,
(24) |
We therefore set , whereby, for , the event of (24) (with ) happens with probability at least . Hence, we set , where is given in equation (17). We now proceed with a lower-bound on .
Let denote the number of points sampled from that fall on . Notice that, conditioned on these samplesβ indices, is distributed as , where , the probability of given that is sampled from . Applying Lemma 3, and integrating over , we have
(25) |
Now notice that holds whenever , since . Under the event of (25), we have , whenever it holds that
(26) |
Next we bound with high probability. Consider any value of such that (from Proposition 3) holds, i.e., . Conditioned on such , is itself a Binomial with
where for the first inequality we used the fact that . Hence, by a multiplicative Chernoff bound,
(27) |
whenever . Now, by Proposition 3, holds with probability at least whenever . Thus, integrating (27) over , we get that, with probability at least ,
(28) |
Thus, bounding both sides of (26), holds whenever a) the events of (25) and (28) hold, and b) the following inequality is satisfied:
(29) |
where the r.h.s. and l.h.s. of (29) upper and lower bound, respectively, the r.h.s. and l.h.s. of the previous inequality (using the setting of and (17), and lower-bounding by for as large as assumed). By the conditions of the Proposition, (29) is satisfied. Thus, holds with probability at least since the events of (25) and (28) hold together with that probability (using the fact that ). Finally, we can conclude that with probability at least since and the event of (24) hold together with that probability. β
Finally we bound the term in the likelihood equation (16).
Proposition 7 ().
Let . Again let , and let in the construction of .
The proof is given in Appendix D, and follows similar lines as above, namely, isolate sufficient statistics (number of in ) and concluding by anticoncentration upon proper conditioning.
We can now combine all the above analysis into the following main proposition.
Proposition 8.
Pick any , , and . Let , and in the constructions of , , . Suppose is sufficiently large so that , and also
-
(i)
.
-
(ii)
.
-
(iii)
.
Let denote any classification procedure having access to . We then have that
Proof.
Following the above propositions, again assume w.l.o.g. that . Let as defined in Proposition 3 over , and notice that, under our assumptions on and (iii) on , each of these events occurs with probability at least .
Thus, for , by Propositions 2, 4, and 7, we have that is at least . Now plug in and from Propositions 5 and 6. For , using Proposition 2 and 7, and noticing that , we can conclude by Corollary 1 that is at least , again matching the lower-bound in the statement.
Now, if wrongly picks (i.e. picks , ), then . By Remark 8, for any , the probability that picks is bounded below by . β
We can now conclude with the proof of the main result of this section.
Proof of Theorem 8.
The first part of the statement builds on Proposition 8 as follows. Set and . First, let , and let denote the number of vectors that were generated by (as in Proposition 3). Let denote the event that , Let . By Proposition 8, for some , we have that is bounded below.
Now decouple the randomness in as follows. Let denote i.i.d. choices of or with respective probabilities and ; choose according to . We then have that
where we can bound how ever small for sufficiently large (by a multiplicative Chernoff). Now conclude by noticing that the above conditional expectation is a projection of the measure onto (via the injection ) and is bounded below, implying must be bounded below.
The second part of the theorem is a direct consequence of the results of Section 7. β
References
- ABGLP [19] Martin Arjovsky, LΓ©on Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv:1907.02893, 2019.
- AKK+ [19] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv:1902.09229, 2019.
- APMS [19] Alessandro Achille, Giovanni Paolini, Glen Mbeng, and Stefano Soatto. The information complexity of learning tasks, their structure and their distance. arXiv:1904.03292, 2019.
- AZ [05] RieΒ Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6(Nov):1817β1853, 2005.
- Bar [92] PeterΒ L Bartlett. Learning with a slowly changing distribution. In Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992.
- Bax [97] Jonathan Baxter. A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7β39, 1997.
- BDB [08] Shai Ben-David and RebaΒ Schuller Borbely. A notion of task relatedness yielding provable multiple-task learning guarantees. Machine Learning, 73(3):273β287, 2008.
- BDBC+ [10] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and JenniferΒ Wortman Vaughan. A theory of learning from different domains. Machine Learning, 79(1-2):151β175, 2010.
- BDBCP [07] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, 2007.
- BDBM [89] Shai Ben-David, GyoraΒ M Benedek, and Yishay Mansour. A parametrization scheme for classifying models of learnability. In Proceedings of the 2nd Annual Workshop on Computational Learning Theory, 1989.
- BDLLP [10] Shai Ben-David, Tyler Lu, Teresa Luu, and DΓ‘vid PΓ‘l. Impossibility theorems for domain adaptation. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
- BHPQ [17] Avrim Blum, Nika Haghtalab, ArielΒ D Procaccia, and Mingda Qiao. Collaborative PAC learning. In Advances in Neural Information Processing Systems, 2017.
- BL [97] RakeshΒ D Barve and PhilipΒ M Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2):170β193, 1997.
- BLM [13] StΓ©phane Boucheron, GΓ‘bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford university press, 2013.
- Car [97] Rich Caruana. Multitask learning. Machine Learning, 28(1):41β75, 1997.
- CKW [08] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. Journal of Machine Learning Research, 9(Aug):1757β1774, 2008.
- CMRR [08] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In International Conference on Algorithmic Learning Theory, 2008.
- DHK+ [20] SimonΒ S Du, Wei Hu, ShamΒ M Kakade, JasonΒ D Lee, and QiΒ Lei. Few-shot learning via learning the representation, provably. arXiv:2002.09434, 2020.
- GSH+ [09] Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard SchΓΆlkopf. Covariate shift by kernel mean matching. In Dataset Shift in Machine Learning, pages 131β160, 2009.
- HK [19] Steve Hanneke and Samory Kpotufe. On the value of target data in transfer learning. In Advances in Neural Information Processing Systems, 2019.
- HY [19] Steve Hanneke and Liu Yang. Statistical learning under nonstationary mixing processes. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
- JSRR [10] Ali Jalali, Sujay Sanghavi, Chao Ruan, and Pradeep Ravikumar. A dirty model for multi-task learning. In Advances in Neural Information Processing Systems, 2010.
- KFAL [20] Nikola Konstantinov, Elias Frantar, Dan Alistarh, and ChristophΒ H Lampert. On the sample complexity of adversarial multi-source PAC learning. arXiv:2002.10384, 2020.
- KM [18] Samory Kpotufe and Guillaume Martinet. Marginal singularity, and the benefits of labels in covariate-shift. arXiv:1803.01833, 2018.
- Kol [06] V.Β Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593β2656, 2006.
- Kol [11] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole dβEtΓ© de ProbabilitΓ©s de Saint-Flour XXXVIII-2008, volume 2033. 2011.
- KR [05] Thierry Klein and Emmanuel Rio. Concentration around the mean for maxima of empirical processes. The Annals of Probability, 33(3):1060β1077, 2005.
- LPVDG+ [11] Karim Lounici, Massimiliano Pontil, Sara Van DeΒ Geer, AlexandreΒ B Tsybakov, etΒ al. Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics, 39(4):2164β2204, 2011.
- MB [17] Daniel McNamara and Maria-Florina Balcan. Risk bounds for transferring representations with and without fine-tuning. In International Conference on Machine Learning, 2017.
- MBS [13] Krikamol Muandet, David Balduzzi, and Bernhard SchΓΆlkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, 2013.
- MM [12] Mehryar Mohri and AndresΒ Munoz Medina. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, 2012.
- MMM [19] Saeed Mahloujifar, Mohammad Mahmoody, and Ameer Mohammed. Universal multi-party poisoning attacks. In Proceedings of the 36th International Conference on Machine Learning, 2019.
- MMR [09] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and the RΓ©nyi divergence. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009.
- Mou [10] Nima Mousavi. How tight is Chernoff bound? https://ece.uwaterloo.ca/~nmousavi/Papers/Chernoff-Tightness.pdf, 2010.
- MPRP [13] Andreas Maurer, Massi Pontil, and Bernardino Romera-Paredes. Sparse coding for multitask and transfer learning. In International Conference on Machine Learning, 2013.
- MPRP [16] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. The Journal of Machine Learning Research, 17(1):2853β2884, 2016.
- NW [11] S.Β N. Negahban and M.Β J. Wainwright. Simultaneous support recovery in high dimensions: Benefits and perils of block -regularization. IEEE Transactions on Information Theory, 57(6):3841β3863, 2011.
- PL [14] Anastasia Pentina and Christoph Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on Machine Learning, 2014.
- PM [13] Massimiliano Pontil and Andreas Maurer. Excess risk bounds for multitask learning with trace norm regularization. In Conference on Learning Theory, 2013.
- Qia [18] Mingda Qiao. Do outliers ruin collaboration? arXiv:1805.04720, 2018.
- RHS [17] Ievgen Redko, Amaury Habrard, and Marc Sebban. Theoretical analysis of domain adaptation with optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2017.
- Sau [72] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145β147, 1972.
- Slu [77] EricΒ V Slud. Distribution inequalities for the binomial law. The Annals of Probability, pages 404β412, 1977.
- SQZY [18] Jian Shen, Yanru Qu, Weinan Zhang, and Yong Yu. Wasserstein distance guided representation learning for domain adaptation. In 32nd AAAI Conference on Artificial Intelligence, 2018.
- SZ [19] Clayton Scott and Jianxin Zhang. Learning from multiple corrupted sources, with application to learning from label proportions. arXiv:1910.04665, 2019.
- Tat [53] RobertΒ F Tate. On a double inequality of the normal distribution. The Annals of Mathematical Statistics, 24(1):132β134, 1953.
- TJJ [20] Nilesh Tripuraneni, MichaelΒ I Jordan, and Chi Jin. On the theory of transfer learning: The importance of task diversity. arXiv:2006.11650, 2020.
- Tsy [04] AlexanderΒ B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135β166, 2004.
- Tsy [09] AlexandreΒ B Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.
- VC [71] V.Β Vapnik and A.Β Chervonenkis. On the uniform convergence of relative frequencies of events to their expectation. Theory of Probability and its Applications, 16:264β280, 1971.
- vW [96] A.Β W. van der Vaart and J.Β A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag New York, 1996.
- YHC [13] Liu Yang, Steve Hanneke, and Jaime Carbonell. A theory of transfer learning with applications to active learning. Machine learning, 90(2):161β189, 2013.
- ZL [19] Alexander Zimin and ChristophΒ H Lampert. Tasks without borders: A new approach to online multi-task learning. In Workshop on Adaptive & Multitask Learning, 2019.
Appendix
Appendix A Proof of LemmaΒ 1
Let be a vector of independent -valued random variables (for some space ), not necessarily identically distributed. Let be a set of measurable functions , and let be the number of distinct vectors possible on points. Let . Define , and , and also , and . Define .
Lemma 4.
For any with , letting , ,
In particular, based on the inequality , this implies
In particular, for any , setting
reveals that, with probability at least ,
(30) |
Next, the following lemma bounds using a standard route.
Lemma 5.
There is a numerical constant such that, for any with , for as in LemmaΒ 4,
Proof.
In particular, the above results imply the following lemma.
Lemma 6.
There exists a numerical constant such that, for any , with probability at least , every satisfies
Proof.
We can also state a concentration result specific to the values, as follows.
Lemma 7.
There exists a numerical constant such that, for any , with probability at least , every satisfies
Proof.
Define and note that . Also define . Applying LemmaΒ 6 with this and , we have that, with probability at least , every satisfies (for some numerical constant )
(31) |
If then the expression in (31) is at most
while if then the expression in (31) is at most
Thus, either way we have
On the other hand, consider the set and note that again we have . Applying LemmaΒ 6 with this and , we have that, with probability at least , every satisfies (for some numerical constant )
(32) |
If then the expression in (32) is at most
while if then the expression in (32) is at most
Thus, either way we have
which implies
The lemma now follows by a union bound, so that these two events (each of probability at least ) occur simultaneously, with probability at least . β
Proof of LemmaΒ 1.
Set , , and . For each , define . Note that . and note that . Applying LemmaΒ 6 with this , we have that with probability at least , every satisfy (for some universal numerical constant )
(33) |
Furthermore, LemmaΒ 7 implies that, with probability at least , every satisfy (for some universal numerical constant )
(34) |
By a union bound, with probability at least , every satisfy both of (33) and (34). In particular, combining (33) with the left inequality in (34), this also implies every satisfy
(35) |
Appendix B Pooling is Optimal if Enough Tasks are Good
While our results in SectionsΒ 4.3 and 8 imply that, in general, one cannot achieve optimal rates by simply pooling all of the data and using the global ERM , in this section we find that in some special cases this naive approach can actually be successful: namely, cases where most of the tasks have below the cut-off value chosen by the optimization in the optimal proceedure from SectionΒ 6. We in fact show a general result for pooling, arguing that it always achieves a rate depending on the (weighted) median value of , or more generally any quantile of values.
Theorem 9 (Pooling Beyond ).
For any , let be the smallest value in such that . Then, for any , with probability at least we have
for a constant .
Proof.
For instance, if all are equal some common value , and we take , then we can take , so that the optimal rate will be achieved as long as at least half of the tasks have below the value for the minimizer of the bound in TheoremΒ 1.
Optimizing the bound in TheoremΒ 9 over the choice of yields the following result.
Corollary 2 (General Pooling Bound).
For any , with probability at least we have
for a constant .
Remark 9.
In particular, note that this result recovers the bound of TheoremΒ 3 in the special case of .
Appendix C Different Optimal Aggregations of Tasks in Multitask
We present a proof of Theorem 2 in this section. Recall that this result states that different choices of target in the same multitask setting can induce different optimal aggregation of tasks. As a consequence, the naive approach of pooling of all tasks can adversely affect target risk even when all βs are the same.
We employ a similar construction to that of Section 8.
Setup.
We again build on distributions supported on 2 datapoints . W.l.o.g., assume that has label 1. Let , , and define . Let β which we will often abbreviate as . In all that follows, we let denote the regression function under distribution .
-
β’
Target : Let , ; finally is determined by
, and . -
β’
Source : Let , ; finally is determined by
, and , for an appropriate constant specified in the proof.
Proof of Theorem 2.
Let and . The above construction then ensures that any s.t. has excess error . Thus we just have to show that number of labels at exceeds the number of labels at with non-zero probability (so that both would pick at ). Let denote the number of samples from at , and denote those samples from having label . Notice that if , then necessarily dominates at .
Now, conditioned on , is distributed as with . Applying Lemma 3, we have
In other words, under the above binomial event, dominates whenever (the second inequality below holds)
in other words, if we have both and . Let , where . Under this event, we just need that , and that , requiring . Hence, integrating over , we have that
where is bounded below by a multiplicative Chernoff whenever . β
Appendix D Supporting Results for Section 8 on Impossibility of Adaptivity
Recall that from our construction, we set
Proof of Proposition 2.
Let . Clearly
Define , , and as the number of points () in which, respectively fall on , or fall on with , or fall on with . Recall that for . Next, define , i.e., the set of -homogeneous vectors. We have:
We can then consider homogeneous and non-homogeneous vectors separately. Let ,
(37) |
where we broke up the product over and . Now, canceling out similar terms in the fraction,
(38) |
Now, the second factor in (38) can be expanded as follows to cancel out the second factor in (37):
That is, we have:
In other words, we have:
We then conclude by noticing that each of the above fractions is greater than . β
Proof of Proposition 7.
For , define , , and as the number of points () in which, respectively fall on , or fall on with , or fall on with . Thus, for fixed, we have
(39) |
Now (39) whenever , so we proceed to bounding the probability of this event under . In particular, when , the event has probability at least . Assume henceforth that , and let . Conditioned on , is distributed as , with . By the anticoncentration Lemma 3, we then have
(40) |
so the event holds whenever
(41) |
Now, consider the event , where . Under , (41) is satisfied whenever , recalling . In other words, under this condition, we have
Finally, by multiplicative Chernoff, the event holds with probability at least . β
Appendix E Auxiliary Lemmas
The following propositions are taken verbatim from [20].
Proposition 9 (Thm 2.5 of [49]).
Let be a family of distributions indexed over a subset of a semi-metric . Suppose , where , such that:
Let , and let denote any improper learner of . We have for any :
Proposition 10 (Varshamov-Gilbert bound).
Let . Then there exists a subset of such that ,
where is the Hamming distance.
Lemma 8 (A basic KL upper-bound).
For any , we let denote . Now let and let . We have