11email: yangli@us.ibm.com 22institutetext: Princeton, NJ USA.
22email: steve.hanneke@gmail.com 33institutetext: Carnegie Mellon University, Pittsburgh, PA USA
33email: jgc@cs.cmu.edu
Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks
Abstract
We study the optimal rates of convergence for estimating a prior distribution over a VC class from a sequence of independent data sets respectively labeled by independent target functions sampled from the prior. We specifically derive upper and lower bounds on the optimal rates under a smoothness condition on the correct prior, with the number of samples per data set equal the VC dimension. These results have implications for the improvements achievable via transfer learning. We additionally extend this setting to real-valued function, where we establish consistency of an estimator for the prior, and discuss an additional application to a preference elicitation problem in algorithmic economics.
1 Introduction
In the transfer learning setting, we are presented with a sequence of learning problems, each with some respective target concept we are tasked with learning. The key question in transfer learning is how to leverage our access to past learning problems in order to improve performance on learning problems we will be presented with in the future.
Among the several proposed models for transfer learning, one particularly appealing model supposes the learning problems are independent and identically distributed, with unknown distribution, and the advantage of transfer learning then comes from the ability to estimate this shared distribution based on the data from past learning problems [2, 12]. For instance, when customizing a speech recognition system to a particular speaker’s voice, we might expect the first few people would need to speak many words or phrases in order for the system to accurately identify the nuances. However, after performing this for many different people, if the software has access to those past training sessions when customizing itself to a new user, it should have identified important properties of the speech patterns, such as the common patterns within each of the major dialects or accents, and other such information about the distribution of speech patterns within the user population. It should then be able to leverage this information to reduce the number of words or phrases the next user needs to speak in order to train the system, for instance by first trying to identify the individual’s dialect, then presenting phrases that differentiate common subpatterns within that dialect, and so forth.
In analyzing the benefits of transfer learning in such a setting, one important question to ask is how quickly we can estimate the distribution from which the learning problems are sampled. In recent work, [12] have shown that under mild conditions on the family of possible distributions, if the target concepts reside in a known VC class, then it is possible to estimate this distribtion using only a bounded number of training samples per task: specifically, a number of samples equal the VC dimension. However, that work left open the question of quantifying the rate of convergence. This rate of convergence can have a direct impact on how much benefit we gain from transfer learning when we are faced with only a finite sequence of learning problems. As such, it is certainly desirable to derive tight characterizations of this rate of convergence.
The present work continues that of [12], bounding the rate of convergence for estimating this distribution, under a smoothness condition on the distribution. We derive a generic upper bound, which holds regardless of the VC class the target concepts reside in. The proof of this result builds on that earlier work, but requires several interesting innovations to make the rate of convergence explicit, and to dramatically improve the upper bound implicit in the proofs of those earlier results. We further derive a nontrivial lower bound that holds for certain constructed scenarios, which illustrates a lower limit on how good of a general upper bound we might hope for in results expressed only in terms of the number of tasks, the smoothness conditions, and the VC dimension.
We additionally include an extension of the results of [12] to the setting of real-valued functions, establishing consistency (at a uniform rate) for an estimator of a prior over any VC subgraph class. In addition to the application to transfer learning, analogous to the original work of [12], we also discuss an application of this result to a preference elicitation problem in algorithmic economics, in which we are tasked with allocating items to a sequence of customers to approximately maximize the customers’ satisfaction, while permitted access to the customer valuation functions only via value queries.
2 The Setting
Let be a measurable space [8] (where is called the instance space), and let be a distribution on (called the data distribution). Let be a VC class of measurable classifiers (called the concept space), and denote by the VC dimension of [10]. We suppose is equipped with its Borel -algebra induced by the pseudo-metric . Though our results can be formulated for general (with somewhat more complicated theorem statements), to simplify the statement of results we suppose is actually a metric, which would follow from appropriate topological conditions on relative to .
For any two probability measures on a measurable space , define the total variation distance
For a set function on a finite measurable space , we abbreviate , . Let be a family of probability measures on (called priors), where is an arbitrary index set (called the parameter space). We suppose there exists a probability measure on (the reference measure) such that every is absolutely continuous with respect to , and therefore has a density function given by the Radon-Nikodym derivative [8].
We consider the following type of estimation problem. There is a collection of -valued random variables , where for any fixed the variables are i.i.d. with distribution . For each , there is a sequence , where are i.i.d. , and for each , . We additionally denote by the first elements of , for any , and similarly and . Following the terminology used in the transfer learning literature, we refer to the collection of variables associated with each collectively as the task. We will be concerned with sequences of estimators , for , which are based on only a bounded number of samples per task, among the first tasks. Our main results specifically study the case of samples per task. For any such estimator, we measure the risk as , and will be particularly interested in upper-bounding the worst-case risk as a function of , and lower-bounding the minimum possible value of this worst-case risk over all possible estimators (called the minimax risk).
In previous work, [12] showed that, if is a totally bounded family, then even with only number of samples per task, the minimax risk (as a function of the number of tasks ) converges to zero. In fact, that work also proved this is not necessarily the case in general for any number of samples less than . However, the actual rates of convergence were not explicitly derived in that work, and indeed the upper bounds on the rates of convergence implicit in that analysis may often have fairly complicated dependences on , , and , and furthermore often provide only very slow rates of convergence.
To derive explicit bounds on the rates of convergence, in the present work we specifically focus on families of smooth densities. The motivation for involving a notion of smoothness in characterizing rates of convergence is clear if we consider the extreme case in which contains two priors and , with , where is a very small but nonzero value; in this case, if we have only a small number of samples per task, we would require many tasks (on the order of ) to observe any data points carrying any information that would distinguish between these two priors (namely, points with ); yet , so that we have a slow rate of convergence (at least initially). A total boundedness condition on would limit the number of such pairs present in , so that for instance we cannot have arbitrarily close and , but less extreme variants of this can lead to slow asymptotic rates of convergence as well. Specifically, in the present work we consider the following notion of smoothness. For and , a function is -Hölder smooth if
3 An Upper Bound
We now have the following theorem, holding for an arbitrary VC class and data distribution ; it is the main result of this work.
Theorem 3.1
For any class of priors on having -Hölder smooth densities , for any , there exists an estimator such that
Proof
By the standard PAC analysis [9, 3], for any , with probability greater than , a sample of random points will partition into regions of width less than (under ). For brevity, we omit the subscripts and superscripts on quantities such as throughout the following analysis, since the claims hold for any arbitrary value of .
For any , let denote a (conditional on ) distribution defined as follows. Let denote the (conditional on ) density function of with respect to , and for any , let (or if ). In other words, has the same probability mass as for each of the equivalence classes induced by , but conditioned on the equivalence class, simply has a constant-density distribution over that equivalence class. Note that every has between the smallest and largest values of among with ; therefore, by the smoothness condition, on the event (of probability greater than ) that each of these regions has diameter less than , we have . On this event, for any ,
Furthermore, since the regions that define and are the same (namely, the partition induced by ), we have
Thus, we have that with probability at least ,
Following analogous to the inductive argument of [12], suppose , fix and . Then the for which is minimal, subject to the constraint that no has , has ; also, for any with , letting for and , we have
and similarly for , so that
Now consider that these two terms inductively define a binary tree. Every time the tree branches left once, it arrives at a difference of probabilities for a set of one less element than that of its parent. Every time the tree branches right once, it arrives at a difference of probabilities for a one closer to an unrealized than that of its parent. Say we stop branching the tree upon reaching a set and a such that either is an unrealized labeling, or . Thus, we can bound the original (root node) difference of probabilities by the sum of the differences of probabilities for the leaf nodes with . Any path in the tree can branch left at most times (total) before reaching a set with only elements, and can branch right at most times in a row before reaching a such that both probabilities are zero, so that the difference is zero. So the depth of any leaf node with is at most . Furthermore, at any level of the tree, from left to right the nodes have strictly decreasing values, so that the maximum width of the tree is at most . So the total number of leaf nodes with is at most . Thus, for any and ,
Since
and by Sauer’s Lemma this is at most
we have that
Thus, we have that
Note that
and by exchangeability, this last line equals
[12] showed that , so that in total we have . Plugging in the value of , this is
Thus, it suffices to bound the rate of convergence (in total variation distance) of some estimator of . If is the -covering number of , then taking as the minimum distance skeleton estimate of [13, 5] achieves expected total variation distance from , for some . We can partition into cells of diameter , and set a constant density value within each cell, on an -grid of density values, and every prior with -Hölder smooth density will have density within of some density so-constructed; there are then at most such densities, so this bounds the covering numbers of . Furthermore, the covering number of upper bounds [12], so that .
Solving for , we have . So this bounds the rate of convergence for , for the minimum distance skeleton estimate. Plugging this rate into the bound on the priors, combined with Jensen’s inequality, we have
This holds for any , so minimizing this expression over yields a bound on the rate. For instance, with , we have
∎
4 A Minimax Lower Bound
One natural quesiton is whether Theorem 3.1 can generally be improved. While we expect this to be true for some fixed VC classes (e.g., those of finite size), and in any case we expect that some of the constant factors in the exponent may be improvable, it is not at this time clear whether the general form of is sometimes optimal. One way to investigate this question is to construct specific spaces and distributions for which a lower bound can be obtained. In particular, we are generally interested in exhibiting lower bounds that are worse than those that apply to the usual problem of density estimation based on direct access to the values (see Theorem 4.2 below).
Here we present a lower bound that is interesting for this reason. However, although larger than the optimal rate for methods with direct access to the target concepts, it is still far from matching the upper bound above, so that the question of tightness remains open. Specifically, we have the following result.
Theorem 4.1
For any integer , any , there is a value such that, for any , there exists an instance space , a concept space of VC dimension , a distribution over , and a distribution over such that, for a set of distributions over with -Hölder smooth density functions with respect to , any estimator has
Proof
(Sketch) We proceed by a reduction from the task of determining the bias of a coin from among two given possibilities. Specifically, fix any , , and let be i.i.d random variables, for each ; then it is known that, for any (possibly nondeterministic) decision rule ,
| (1) |
This easily follows from the results of [1], combined with a result of [7] bounding the KL divergence (see also [11])
To use this result, we construct a learning problem as follows. Fix some with , let , and let be the space of all classifiers such that . Clearly the VC dimension of is . Define the distribution as uniform over . Finally, we specify a family of -Hölder smooth priors, parameterized by , as follows. Let . First, enumerate the distinct -sized subsets of as . Define the reference distribution by the property that, for any , letting , . For any , define the prior as the distribution of a random variable specified by the following generative model. Let , let ; finally, , where is if is odd, or if is even. We will refer to the variables in this generative model below. For any , letting and , we can equivalently express . From this explicit representation, it is clear that, letting , we have for all . The fact that is Hölder smooth follows from this, since every distinct have .
Next we set up the reduction as follows. For any estimator , and each , let be the classifier with ; also, if , let , and otherwise . We use these values to estimate the original values. Specifically, let and , where . Then
Thus, we have reduced from the problem of deciding the biases of these independent Bernoulli random variables. To complete the proof, it suffices to lower bound the expectation of the right side for an arbitrary estimator.
Toward this end, we in fact study an even easier problem. Specifically, consider an estimator , where is the random variable in the generative model that defines ; that is, , , and , where the are independent across , as are the and . Clearly the from above can be viewed as an estimator of this type, which simply ignores the knowledge of . The knowledge of these variables simplifies the analysis, since given , the data can be partitioned into disjoint sets, , and we can use only the set to estimate . Furthermore, we can use only the subset of these for which , since otherwise we have zero information about the value of . That is, given , any is conditionally independent from every for , and is even conditionally independent from when is not completely contained in ; specifically, in this case, regardless of , the conditional distribution of given and given is a product distribution, which deterministically assigns label to those with , and gives uniform random values to the subset of with their respective . Finally, letting , we note that given , , and the value , is conditionally independent from . Thus, the set of values is a sufficient statistic for (hence for ). Recall that, when and , the value of is equal to , a random variable. Thus, we neither lose nor gain anything (in terms of risk) by restricting ourselves to estimators of the type , for some [8]: that is, estimators that are a function of the random variables, which we should note are conditionally i.i.d. given .
Thus, by (1), for any ,
Also note that, for each , . Thus, Jensen’s inequality, linearity of expectation, and the law of total expectation imply
Thus, by linearity of the expectation,
In particular, taking , we have , so that
In particular, this implies there exists some for which
Applying this lower bound to the estimator above yields the result. ∎
It is natural to wonder how these rates might potentially improve if we allow to depend on more than samples per data set. To establish limits on such improvements, we note that in the extreme case of allowing the estimator to depend on the full data sets, we may recover the known results lower bounding the risk of density estimation from i.i.d. samples from a smooth density, as indicated by the following result.
Theorem 4.2
For any integer , there exists an instance space , a concept space of VC dimension , a distribution over , and a distribution over such that, for the set of distributions over with -Hölder smooth density functions with respect to , any sequence of estimators, (), has
The proof is a simple reduction from the problem of estimating based on direct access to , which is essentially equivalent to the standard model of density estimation, and indeed the lower bound in Theorem 4.2 is a well-known result for density estimation from i.i.d. samples from a Hölder smooth density in a -dimensional space [5].
5 Real-Valued Functions and an Application in Algorithmic Economics
In this section, we present results generalizing the analysis of [12] to classes of real-valued functions. We also present an application of this generalization to a preference elicitation problem.
5.1 Consistent Estimation of Priors over Real-Valued Functions at a Bounded Rate
In this section, we let denote a -algebra on , and again let denote the corresponding -algebra on . Also, for measurable functions , let , where is a distribution over . Let be a class of functions with Borel -algebra induced by . Let be a set, and for each , let denote a probability measure on . We suppose is totally bounded in total variation distance, and that is a uniformly bounded VC subgraph class with pseudodimension . We also suppose is a metric when restricted to .
As above, let be i.i.d. random variables. For each , let be i.i.d. random variables, independent from . For each and , let for , and let ; for each , define , , and .
We have the following result. The proof parallels that of [12] (who studied the special case of binary functions), with a few important twists (in particular, a significantly different approach in the analogue of their Lemma 3). The details are included in Appendix 0.A.
Theorem 5.1
There exists an estimator , and functions and such that, for any , and for any and ,
5.2 Maximizing Customer Satisfaction in Combinatorial Auctions
Theorem 5.1 has a clear application in the context of transfer learning, following analogous arguments to those given in the special case of binary classification by [12]. In addition to that application, we can also use Theorem 5.1 in the context of the following problem in algorithmic economics, where the objective is to serve a sequence of customers so as to maximize their satisfaction.
Consider an online travel agency, where customers go to the site with some idea of what type of travel they are interested in; the site then poses a series of questions to each customer, and identifies a travel package that best suits their desires, budget, and dates. There are many options of travel packages, with options on location, site-seeing tours, hotel and room quality, etc. Because of this, serving the needs of an arbitrary customer might be a lengthy process, requiring many detailed questions. Fortunately, the stream of customers is typically not a worst-case sequence, and in particular obeys many statistical regularities: in particular, it is not too far from reality to think of the customers as being independent and identically distributed samples. With this assumption in mind, it becomes desirable to identify some of these statistical regularities so that we can pose the questions that are typically most relevant, and thereby more quickly identify the travel package that best suits the needs of the typical customer. One straightforward way to do this is to directly estimate the distribution of customer value functions, and optimize the questioning system to minimize the expected number of questions needed to find a suitable travel package.
One can model this problem in the style of Bayesian combinatorial auctions, in which each customer has a value function for each possible bundle of items. However, it is slightly different, in that we do not assume the distribution of customers is known, but rather are interested in estimating this distribution; the obtained estimate can then be used in combination with methods based on Bayesian decision theory. In contrast to the literature on Bayesian auctions (and subjectivist Bayesian decision theory in general), this technique is able to maintain general guarantees on performance that hold under an objective interpretation of the problem, rather than merely guarantees holding under an arbitrary assumed prior belief. This general idea is sometimes referred to as Empirical Bayesian decision theory in the machine learning and statistics literatures. The ideal result for an Empirical Bayesian algorithm is to be competitive with the corresponding Bayesian methods based on the actual distribution of the data (assuming the data are random, with an unknown distribution); that is, although the Empirical Bayesian methods only operate with a data-based estimate of the distribution, the aim is to perform nearly as well as methods based on the true (unobservable) distribution. In this work, we present results of this type, in the context of an abstraction of the aforementioned online travel agency problem, where the measure of performance is the expected number of questions to find a suitable package.
The specific application we are interested in here may be expressed abstractly as a kind of combinatorial auction with preference elicitation. Specifically, we suppose there is a collection of items on a menu, and each possible bundle of items has an associated fixed price. There is a stream of customers, each with a valuation function that provides a value for each possible bundle of items. The objective is to serve each customer a bundle of items that nearly-maximizes his or her surplus value (value minus price). However, we are not permitted direct observation of the customer valuation functions; rather, we may query for the value of any given bundle of items; this is referred to as a value query in the literature on preference elicitation in combinatorial auctions (see Chapter 14 of [4], [14]). The objective is to achieve this near-maximal surplus guarantee, while making only a small number of queries per customer. We suppose the customer valuation function are sampled i.i.d. according to an unknown distribution over a known (but arbitrary) class of real-valued functions having finite pseudo-dimension. Reasoning that knowledge of this distribution should allow one to make a smaller number of value queries per customer, we are interested in estimating this unknown distribution, so that as we serve more and more customers, the number of queries per customer required to identify a near-optimal bundle should decrease. In this context, we in fact prove that in the limit, the expected number of queries per customer converges to the number required of a method having direct knowledge of the true distribution of valuation functions.
Formally, suppose there is a menu of items , and each bundle has an associated price . Suppose also there is a sequence of customers, each with a valuation function . We suppose these functions are i.i.d. samples. We can then calculate the satisfaction function for each customer as , where , and , where contains element iff .
Now suppose we are able to ask each customer a number of questions before serving up a bundle to that customer. More specifically, we are able to ask for the value for any . This is referred to as a value query in the literature on preference elicitation in combinatorial auctions (see Chapter 14 of [4], [14]). We are interested in asking as few questions as possible, while satisfying the guarantee that .
Now suppose, for every and , we have a method such that, given that is the actual distribution of the functions, guarantees that the value it selects has ; also let denote the actual (random) number of queries the method would ask for the function, and let . We suppose the method never queries any value twice for a given , so that its number of queries for any given is bounded.
Also suppose is a VC subgraph class of functions mapping into with pseudodimension , and that is a known totally bounded family of distributions over such that the functions have distribution for some unknown . For any and , let .
Suppose, in addition to , we have another method that is not -dependent, but still provides the -correctness guarantee, and makes a bounded number of queries (e.g., in the worst case, we could consider querying all points, but in most cases there are more clever -independent methods that use far fewer queries, such as ). Consider the following method; the quantities , , and from Theorem 5.1 are here considered with respect taken as the uniform distribution on .
The following theorem indicates that this method is correct, and furthermore that the long-run average number of queries is not much worse than that of a method that has direct knowledge of . The proof of this result parallels that of [12] for the transfer learning setting, but is included here for completeness.
Theorem 5.2
For the above method, . Furthermore, if is the total number of queries made by the method, then
Proof
By Theorem 5.1, for any , if , then with probability at least , , so that a triangle inequality implies . Thus,
For , let denote the point that would be returned by when queries are answered by some instead of (and supposing ). If , then
Plugging into the above bound, we have .
For the result on , first note that only finitely many times (due to ), so that we can ignore those values of in the asymptotic calculation (as the number of queries is always bounded), and rely on the correctness guarantee of . For the remaining values , let denote the number of queries made by . Then
Since
we have
For , let denote the number of queries would make if queries were answered with instead of . On the event , we have
Therefore,
∎
In many cases, this result will even continue to hold with an infinite number of goods (), since Theorem 5.1 has no dependence on the cardinality of the space .
6 Open Problems
There are several interesting questions that remain open at this time. Can either the lower bound or upper bound be improved in general? If, instead of samples per task, we instead use samples, how does the minimax risk vary with ? Related to this, what is the optimal value of to optimize the rate of convergence as a function of , the total number of samples? More generally, if an estimator is permitted to use total samples, taken from however many tasks it wishes, what is the optimal rate of convergence as a function of ?
Appendix 0.A Proofs for Section 5
The proof of Theorem 5.1 is based on the following sequence of lemmas, which parallel those used by [12] for establishing the analogous result for consistent estimation of priors over binary functions. The last of these lemmas (namely, Lemma 3) requires substantial modifications to the original argument of [12]; the others use arguments more-directly based on those of [12].
Lemma 1
For any and ,
Proof
Fix , . Let , , and for let . and . For , let .
For , define (if the limit exists), and . Note that since is a uniformly bounded VC subgraph class, so is the collection of functions , so that the uniform strong law of large numbers implies that with probability one, , exists and has [10].
Consider any , and any . Then any has , (by the metric assumption). Thus, if for all , then ,
This implies . Under these conditions,
and similarly for .
Any measurable set for the range of can be expressed as for some appropriate . Letting , we have
Likewise, this reasoning holds for . Then
Since and are independent, , . Analogous reasoning holds for . Thus, we have
Altogether, we have . ∎
Lemma 2
There exists a sequence such that, , ,
Proof
This proof follows identically to a proof of [12], but is included here for completeness. Since for all measurable , and similarly for , we have
which implies the left inequality when combined with Lemma 1.
Next, we focus on the right inequality. Fix and , and let be such that
Let . Note that is an algebra that generates . Thus, Carathéodory’s extension theorem (specifically, the version presented by [8]) implies that there exist disjoint sets in such that and
Since these sets are disjoint, each of these sums is bounded by a probability value, which implies that there exists some such that
which implies
As , there exists and measurable such that , and therefore
Combining the above, we have . By letting approach , we have
So there exists a sequence such that
Now let and let be a minimal -cover of . Define the quantity . Then for any , let and . Then a triangle inequality implies that ,
Triangle inequalities and the left inequality from the lemma statement (already established) imply
So in total we have
Since this holds for all , defining , we have the right inequality of the lemma statement. Furthermore, since each , and , we have for each , and thus we also have . ∎
Lemma 3
, there exists a monotone function such that, ,
Proof
Fix any , and let and , and for let and .
If , then , so that
and therefore the result trivially holds.
Now suppose . Fix any , and let be a measurable set such that
By Carathéodory’s extension theorem (specifically, the version presented by [8]), there exists a disjoint sequence of sets such that
and such that each is representable as follows; for some , and sets , for , where each , the set is representable as , where , each , and . Since the are disjoint, the above sums are bounded, so that there exists such that every has
Now define . Then for any , let be such that and , which implies and by Lemma 2. Then
Again, since the are disjoint, this equals
Thus, if we can show that each term is bounded by a function of , then the result will follow by substituting this relaxation into the above expression and defining by minimizing the resulting expression over .
Toward this end, let be as above from the definition of , and note that is representable as a function of the indicators, so that
Note that can be expressed as some , where each and , so that, for and , this last expression is at most
Next note that for any , letting and ,
For , let . Then note that, by definition of , for any given , the class is a VC class over with VC dimension at most . Furthremore, we have
Therefore, the results of [12] (in the proof of their Lemma 3) imply that
Thus, we have
Exchangeability implies this is at most
[12] argue that for all and ,
Noting that
completes the proof. ∎
We are now ready for the proof of Theorem 5.1.
Proof (Proof of Theorem 5.1)
The estimator we will use is precisely the minimum-distance skeleton estimate of [13, 5]. [13] proved that if is the -covering number of , then taking this estimator, then for some , any has
Thus, taking , we have
Letting be any positive sequence with and , and letting , Markov’s inequality implies
| (2) |
Letting , since and , we have . Furthermore, composing (2) with Lemmas 1, 2, and 3, we have
∎
Remark:
Although the above proof makes use of the minimum-distance skeleton estimator, which is typically not computationally efficient, it is often possible to achieve this same result (for certain families of distributions) using a simpler estimator, such as the maximum likelihood estimator. All we require is that the risk of the estimator converges to at a known rate that is independent of . For instance, see [6] for conditions on the family of distributions sufficient for this to be true of the maximum likelihood estimator.
References
- [1] Bar-Yossef, Z.: Sampling lower bounds via information theory. In: Proceedings of the 35th Annual ACM Symposium on the Theory of Computing. pp. 335–344 (2003)
- [2] Baxter, J.: A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning 28, 7–39 (1997)
- [3] Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery 36(4), 929–965 (1989)
- [4] Cramton, P., Shoham, Y., Steinberg, R.: Combinatorial Auctions. The MIT Press (2006)
- [5] Devroye, L., Lugosi, G.: Combinatorial Methods in Density Estimation. Springer, New York, NY, USA (2001)
- [6] van de Geer, S.: Empirical Processes in M-Estimation. Cambridge University Press (2000)
- [7] Poland, J., Hutter, M.: MDL convergence speed for Bernoulli sequences. Statistics and Computing 16, 161–175 (2006)
- [8] Schervish, M.J.: Theory of Statistics. Springer, New York, NY, USA (1995)
- [9] Vapnik, V.: Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York (1982)
- [10] Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 264–280 (1971)
- [11] Wald, A.: Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics 16(2), 117–186 (1945)
- [12] Yang, L., Hanneke, S., Carbonell, J.: A theory of transfer learning with applications to active learning. Machine Learning 90(2), 161–189 (2013)
- [13] Yatracos, Y.G.: Rates of convergence of minimum distance estimators and Kolmogorov’s entropy. The Annals of Statistics 13, 768–774 (1985)
- [14] Zinkevich, M., Blum, A., Sandholm, T.: On polynomial-time preference elicitation with value queries. In: Proceedings of the ACM Conference on Electronic Commerce. pp. 175–185 (2003)