On the Computability of Multiclass PAC Learning
Abstract
We study the problem of computable multiclass learnability within the Probably Approximately Correct (PAC) learning framework of Valiant (1984). In the recently introduced computable PAC (CPAC) learning framework of Agarwal et al. (2020), both learners and the functions they output are required to be computable. We focus on the case of finite label space and start by proposing a computable version of the Natarajan dimension and showing that it characterizes CPAC learnability in this setting. We further generalize this result by establishing a meta-characterization of CPAC learnability for a certain family of dimensions: computable distinguishers. Distinguishers were defined by Ben-David et al. (1992) as a certain family of embeddings of the label space, with each embedding giving rise to a dimension. It was shown that the finiteness of each such dimension characterizes multiclass PAC learnability for finite label space in the non-computable setting. We show that the corresponding computable dimensions for distinguishers characterize CPAC learning. We conclude our analysis by proving that the DS dimension, which characterizes PAC learnability for infinite label space, cannot be expressed as a distinguisher (even in the case of finite label space).
1 Introduction
One of the fundamental lines of inquiry in learning theory is to determine when learning is possible (and how to learn in such cases). Taking aside computational efficiency, which requires learners to run in polynomial time as a function of the learning parameters, the vast majority of the learning theory literature has introduced characterizations of learnability in a purely information-theoretic sense, qualifying or quantifying the amount of data needed to guarantee (or refute) generalization, without computational considerations. These works consider learners as functions rather than algorithms. Recently, Agarwal et al. (2020) introduced the Computable Probably Approximately Correct (CPAC) learning framework, adding computability requirements to the pioneering Probably Approximately Correct (PAC) framework of Valiant (1984), introduced for binary classification. In the CPAC setting, both learners and the functions they output are required to be computable functions. Perhaps surprisingly, adding such requirements substantially changes the learnability landscape. For example, in a departure from the standard PAC setting for binary classification, where a hypothesis class is learnable in the agnostic case whenever it is learnable in the realizable case (and in particular, always through empirical risk minimization (ERM)), there are binary classes that are CPAC learnable in the realizable setting, but not in the agnostic one. For the latter, it is in fact the effective VC dimension that characterizes CPAC learnability (Sterkenburg, 2022; Delle Rose et al., 2023). While the works of Agarwal et al. (2020); Sterkenburg (2022); Delle Rose et al. (2023) together provide a practically complete picture of CPAC learnability for the binary case, delineating the CPAC learnability landscape in the multiclass setting, where the label space is larger than two and can in general be infinite, has not yet been attempted.
Multiclass PAC learning has been found to exhibit behaviours that depart from the binary case even without computability requirements. For example, not all ERMs are equally successful from a sample-complexity stem point, and, in particular, for infinite label space, the ERM principle can fail (Daniely et al., 2011)! For the finite label space, the finiteness of both the Natarajan and graph dimensions characterizes learnability (Natarajan and Tadepalli, 1988; Natarajan, 1989). As shown by Ben-David et al. (1992), it is actually possible to provide a meta-characterization through distinguishers, which are families of functions that map the label space to the set . In the infinite case, both the finiteness of the Natarajan and graph dimensions fail to provide a characterization, and it is the finiteness of the DS dimension that is equivalent to learnability (Daniely and Shalev-Shwartz, 2014; Brukhim et al., 2022).
Ultimately, both the CPAC and multiclass settings exhibit a significant contrast with PAC learning for binary classification. In this work, we investigate these two settings in conjunction, and thus initiate the study of computable multiclass PAC learnability. We focus on the finite label space, and provide a meta-characterization of learnability in the agnostic case: the finiteness of the computable dimensions of distinguishers, which we introduce in this work, is both necessary and sufficient for CPAC learnability here. We also explicitly derive the result in the case of the computable Natarajan dimension, also defined in this work, as the lower bound for distinguishers utilizes a lower bound in the computable Natarajan dimension. The significance of the meta-characterization is two-fold: first, it establishes that computable versions of other known dimensions, namely those that can be defined through a suitable embedding of the label space, also provide a characterization of multi-class CPAC learnability (this applies, for example, to the computable graph dimension); further, it allows us to extract high-level concepts and proof mechanics regarding computable dimensions, which may be of independent interest. We conclude our investigations by proving that the DS dimension cannot be expressed through the framework of distinguishers, opening the door to potentially more complex phenomena in the infinite label set case.
1.1 Related Work
Computable Learnability.
Following the work of Ben-David et al. (2019), who showed that the learnability of certain basic learning problems is undecidable within ZFC, Agarwal et al. (2020) formally integrated the notion of computability within the standard PAC learning framework of Valiant (1984). This novel set-up, called computable PAC (CPAC) learning, requires that both learners and the hypotheses they output be computable. With follow-up works by Sterkenburg (2022) and Delle Rose et al. (2023), CPAC learnability in the binary classification setting was shown to be fully characterized by the finiteness of the effective VC dimension, formally defined by Delle Rose et al. (2023). Computability has since been studied in the context of different learning problems: Ackerman et al. (2022) extended CPAC learning to continuous domains, Hasrati and Ben-David (2023) and Delle Rose et al. (2024) to online learning, and Gourdeau et al. (2024) to adversarially robust learning.
Multiclass Learnability.
While the study of computable learnability is a very recent field of study, multiclass learnability has been the subject of extensive research efforts in the past decades. In the binary classification setting, the VC dimension characterizes PAC learnability (Vapnik and Chervonenkis, 1971; Ehrenfeucht et al., 1989; Blumer et al., 1989). However, the landscape of learnability in the multiclass setting is much more complex. The PAC framework was extended to the multiclass setting in the works of Natarajan and Tadepalli (1988) and Natarajan (1989), which gave a lower bound with respect to the Natarajan dimension, and an upper with the graph dimension. Later, Ben-David et al. (1992) generalized the notion of dimension for multiclass learning, and provided a meta-characterization of learnability: (only) dimensions that are distinguishers characterize learnability in the finite label space setting, with distinguishers encompassing both the Natarajan and graph dimensions. Haussler and Long (1995) later generalized the Sauer-Shelah-Perles Lemma for these families of functions. Daniely et al. (2015a), originally (Daniely et al., 2011), identified “good” and “bad” ERMs with vastly different sample complexities, which, in the case of infinite label space, leads to learning scenarios where the ERM principle fails. Daniely and Shalev-Shwartz (2014) introduced a new dimension, the DS dimension, which they proved was a necessary condition for learnability. The breakthrough work of Brukhim et al. (2022) closed the problem by showing it is also sufficient for arbitrary label space. Different multiclass methods and learning paradigms have moreover been explored by Daniely et al. (2012); Rubinstein et al. (2006); Daniely et al. (2015b). Finally, multiclass PAC learning has also been studied in relation to boosting (Brukhim et al., 2021, 2023, 2024), universal learning rates (Kalavasis et al., 2022; Hanneke et al., 2023), sample compression (Pabbaraju, 2024) and regularization (Asilis et al., 2024).
2 Problem Set-up
Notation.
We denote by the natural numbers. For , let denote the set . Given a finite alphabet , we denote by the set of all finite words (strings) over . For a given set , we denote by the set resulting in removing the element with index . We will always use the symbol (vs ) to mean proper inclusion. Let be an arbitrary label set containing . For a function , denote by the largest natural number that is not mapped to 0 by , with if no such exists.
Learnability.
Let be the input space and the label space. We denote by a hypothesis class on : . A learner is a mapping from finite samples to a function . Given a distribution on and a hypothesis , the risk of on is defined as The empirical risk of on a sample is defined as An empirical risk minimizer (ERM) for , denoted by , is a learner that for an input sample outputs a function .
We will focus on the case , i.e., where the domain is countable. We work in the multiclass classification setting, and thus let be arbitrary. Throughout the paper, whether we are working in the case or will be made explicit. The case is the binary classification setting for which the probably approximately correct (PAC) framework of Valiant (1984) was originally defined, though it can straightforwardly be extended to arbitrary label spaces :
Definition 1 (Agnostic PAC learnability).
A hypothesis class is PAC learnable in the agnostic setting if there exists a learner and function such that for all and for any distribution on , if the input to is an i.i.d. sample from of size at least , then, with probability at least over the samples, the learner outputs a hypothesis with The class is said to be PAC learnable in the realizable setting if the above holds under the condition that .
Definition 2 (Proper vs improper learning).
Given a hypothesis class , a learner is said to be proper if for all and samples , , and improper otherwise.
We note that by definition ERMs are proper learners.
Computable learnability.
We start with some computability basics. A function is called total computable if there exists a program such that, for all inputs , halts and satisfies . A set is said to be decidable (or recursive) if there exists a program such that, for all , halts and outputs whether ; is said to be semi-decidable (or recursively enumerable) if there exists a program such that halts for all and, whenever halts, it correctly outputs whether . An equivalent formulation of semi-decidability for is the existence of a program that enumerates all the elements of .
When studying CPAC learnability, we consider hypotheses with a mild requirement on their representation (note that otherwise negative results are trivialized, as argued by Agarwal et al. (2020)):
Definition 3 (Computable Representation (Agarwal et al., 2020)).
A hypothesis class is called decidably representable (DR) if there exists a decidable set of programs such that the set of all functions computed by programs in equals . The class is called recursively enumerably representable (RER) if there exists such a set of programs that is recursively enumerable.
Recall that PAC learnability only takes into account the sample size needed to guarantee generalization. It essentially views learners as functions. Computable learnability adds the basic requirement that learners be algorithms that halt on all inputs and output total computable functions.
Definition 4 (CPAC Learnability (Agarwal et al., 2020)).
A class is (agnostic) CPAC learnable if there exists a computable (agnostic) PAC learner for that outputs total computable functions as predictors and uses a decidable (recursively enumerable) representation for these.
Dimensions characterizing learnability.
A notion of dimension can provide a characterization of learnability for a learning problem in two different senses: first, in a qualitative sense, where the finiteness of the dimension is both a necessary and sufficient condition for learnability, and, second, in a quantitative sense, where the dimension explicitly appears in both lower and upper bounds on the sample complexity. See Lechner and Ben-David (2024) for a thorough treatment of dimensions in the context of learnability.
In the case of binary classification, the Vapnik-Chervonenkis (VC) dimension characterizes learnability in a quantitative sense (of course implying a qualitative characterization as well):
Definition 5 (VC dimension (Vapnik and Chervonenkis, 1971)).
Given a class of functions from to , we say that a set is shattered by if the restriction of to is the set of all function from to . The VC dimension of a hypothesis class , denoted , is the size of the largest set that can be shattered by . If no such exists then .
In the multiclass setting, the Natarajan dimension provides a lower bound on learnability, while the graph dimension, an upper bound (Natarajan and Tadepalli, 1988; Natarajan, 1989). When the label space is finite, both dimensions characterize learnability, though they can be separated by a factor of (Ben-David et al., 1992; Daniely et al., 2011).
Definition 6 (Natarajan dimension (Natarajan, 1989)).
A set is said to be N-shattered by if there exists labelings such that for all , and for all subsets there exists with in case and in case . The Natarajan dimension of , denoted , is the size of the largest set that can be N-shattered by . If no such exists then .
Definition 7 (Graph dimension (Natarajan, 1989)).
A set is said to be G-shattered by if there exists a labeling such that for every there exists such that for all , and for all , . The graph dimension of , denoted , is the size of the largest set that can be G-shattered by . If no such exists then .
When is infinite, it is the DS dimension that characterizes learnability (Daniely and Shalev-Shwartz, 2014; Brukhim et al., 2022). Before defining it, we need to define pseudo-cubes.
Definition 8 (Pseudo-cube).
A set is called a pseudo-cube of dimension if is non-empty and finite, and for every and every index there exists such that if and only if .
Definition 9 (DS dimension (Daniely and Shalev-Shwartz, 2014)).
A set is said to be DS-shattered by if contains an -dimensional pseudo-cube. The DS dimension of , denoted , is the size of the largest set that can be DS-shattered by . If no such exists then .
We refer the reader to Brukhim et al. (2022) for results separating the Natarajan and DS dimensions, as well as an example showing why the finiteness of the pseudo-cube is a necessary property in order for the DS dimension to characterize learnability.
3 Characterizing CPAC Learnability with the Computable Natarajan Dimension
In this section, we first recall results in the binary CPAC setting that have implications for multiclass CPAC learning, and include conditions under which CPAC learnability is sufficient. We then define the computable versions of the Natarajan and graph dimensions, in the spirit of the effective VC dimension, implicitly appearing in the work of Sterkenburg (2022) and formally defined by Delle Rose et al. (2023). We also show that the same gap as in the standard (non-computable) setting exists between the computable Natarajan and computable graph dimension. In Section 3.1, we show that the finiteness of the computable Natarajan dimension is a necessary condition for multiclass CPAC learnability for arbitrary label spaces. We finish this section by showing that this finitess is sufficient for finite label spaces in Section 3.2.
We note that there are several hardness results for binary CPAC learning that immediately imply hardness for multiclass CPAC learning. In particular, the results that show a separation between agnostic PAC and CPAC learnability (both for proper (Agarwal et al., 2020) and improper (Sterkenburg, 2022) learning) imply that there are decidably representable (DR) classes which are (information-theoretically) multiclass learnable, but which are not computably multiclass learnable. On the other hand, in the binary case, any PAC learnable class that is recursively enumerably representable (RER) is also CPAC learbable in the realizable case. For multiclass learning, we can similarly implement an ERM rule for the realizable case as outlined below.
Proposition 10.
Let be RER. If or , then is properly CPAC learnable in the realizable setting.
Proof.
The conditions and are both sufficient to guarantee generalization with an ERM. Upon drawing a sufficiently large sample from underlying distribution , it suffices to enumerate all and compute one by one until we obtain one with zero empirical risk. We thus have a computable ERM (recall that all are total computable). ∎
Thus, RER classes that satisfy the uniform convergence property can be CPAC learned in the realizable case. Moreover, having access to a computable ERM yields the following:
Fact 11.
Let have a computable ERM and suppose or . Then is CPAC learnable in the multiclass setting.
The Computable Natarajan and Graph Dimensions.
The general idea in defining computable versions of shattering-based dimensions, such as the effective VC dimension (Sterkenburg, 2022; Delle Rose et al., 2023) and the computable robust shattering dimension (Gourdeau et al., 2024), is to have a computable proof of the statement “ cannot be shattered” for all sets of a certain size. We define the computable Natarajan and graph dimensions in this spirit as well:
Definition 12 (Computable Natarajan dimension).
A -witness of Natarajan dimension for a hypothesis class is a function that takes as input a set of size and two labelings of satisfying for all and outputs a subset such that for every there exists such that if and if . The computable Natarajan dimension of , denoted , is the smallest integer such that there exists a computable -witness of Natarajan dimension for .
Remark 13.
In the definition above, we make explicit the requirement that and differ on all indices, but this can be checked computably. Moreover, the usual manner to obtain computable dimensions is to negate the first-order formula for “ is shattered by ”. In the Natarajan dimension case would give a witness that after finding , whenever given a hypothesis , outputs the satisfying the condition, but it is straightforward to computably find once is obtained. We thus simplify the Natarajan, graph and general dimensions (see Section 4) in this manner.
Definition 14 (Computable graph dimension).
A -witness of graph dimension for a hypothesis class is a function that takes as input a set of size and a labeling of and outputs a subset such that for every there exists such that if and if . The computable graph dimension of , denoted , is the smallest integer such that there exists a computable -witness of graph dimension for .
In the binary setting, the VC, Natarajan, and graph dimensions are all identical. This is also the case for the computable counterparts. In particular, this implies that and can be arbitrarily far apart, with and . The same separation holds for the graph dimension and computable graph dimension. It is also straightforward to check that . Now, as in the non-computable versions of the Natarajan and graph dimensions, we have an arbitrary gap between their computable counterparts (see Appendix A for the proof):
Proposition 15.
For any there exist and with , and .
3.1 The Finiteness of the Computable Natarajan Dimension as a Necessary Condition
In this section, we show that the finiteness of the computable Natarajan dimension is a necessary condition for CPAC learnability in the agnostic setting, even in the case .
Theorem 16.
Let be improperly CPAC learnable. Then , i.e. admits a computable -witness of Natarajan dimension for some .
We will first show a multiclass analogue of the computable No-Free-Lunch theorem for binary classification (Lemma 19 in (Agarwal et al., 2020)), adapted with the Natarajan dimension in mind:
Lemma 17.
For any computable learner , for any , any instance space of size at least , any subset of size at least , and any two functions satisfying for all , we can computably find such that
-
1.
for all ,
-
2.
,
-
3.
With probability at least over , ,
where is the uniform distribution on .
Proof sketch.
We will first prove the existence of a pair satisfying the desired requirements. To this end, for , denote by the labelling of satisfying if and if . For each , define the following distribution on :
Note that there are possible such functions from to . Let denote the set of all such function-distribution pairs and note that for all .
Note that, by a simple application of Markov’s inequality, it is sufficient to show that there exists satisfying
(1) |
The proof that the third requirement is satisfied is nearly identical to that of the No-Free-Lunch theorem (see, e.g., Shalev-Shwartz and Ben-David (2014), Theorem 5.1), and is omitted for brevity.
Now, to computably find a pair satisfying , it suffices to note that is computable (and outputs computably evaluable functions), that the set is finite, and that for each pair , we can use to compute the expected risk. Indeed, denote by the possible sequences of length from , and for some and , let be the sequence labeled by . Each distribution induces the equally likely sequences , implying
Since such a pair must exist, we will eventually stop for some for which Equation 1 holds. ∎
Proof of Theorem 16.
Let be a computable (potentially improper) learner for with sample complexity function . Let . We will show that can be used to build a computable -witness of Natarajan dimension for .
To this end, suppose we are given an arbitrary set and labelings satisfying for all . By Lemma 17, we can computably find such that (i) for all , (ii) , and (iii) , where is the uniform distribution on . This implies that the labeling of induced by is not achievable by any : otherwise , and by the PAC guarantee , we would get , a contradiction. Now, let be the index set identifying the instances in labelled by in . Then clearly is the set that we seek: for every there exists such that , where if and if , as required. ∎
3.2 The Finiteness of the Computable Natarajan Dimension as a Sufficient Condition
We now state and show the main result of the section: finite computable Natarajan dimension is sufficient for CPAC learnability whenever is finite.
Theorem 18.
Let and . Then is (improperly) CPAC learnable.
In the binary classification setting, Delle Rose et al. (2023) showed that finite effective VC dimension is sufficient for CPAC learnability. We generalize this approach to the multiclass setting:
Proof of Theorem 18.
Let be such that . Let be a -witness of Natarajan dimension. We will embed into satisfying (i) and (ii) has a computable ERM. By Fact 11, this is sufficient to guarantee multiclass CPAC learnability. Before showing that properties (i) and (ii) hold, we introduce the following notation. Given and a subset , we denote by the function
(2) |
When are fixed and clear from context, we will shorten the notation to for readability. We first show the following lemma, which will be invoked in Section 4 as well, when we give a more general necessary condition on multiclass CPAC learnability (Theorem 25).
Lemma 19.
For every with and , there exists a class with
-
•
-
•
There exists a computable function , that takes as input a finite domain subset and outputs a set of labelings .
Proof.
Constructing . Consider the class of “good” functions satisfying, for all
-
1.
, where .
-
2.
For any , and any labelings , let , where . Then .
Namely, “good” functions defined on are those that are eventually always 0 and do not encode the output of the witness function for any labelings.
Now, let . We will show that indeed satisfies the conditions above.
. Let and be arbitrary labelings that differ in each component, i.e., for all . WLOG, suppose and that , in particular . Let be the output of the -witness on without the -th entries, i.e., . Let , and, by a slight abuse of notation, let and , defined as per Equation 2 and where we omit in the subscript for readability. We claim that there exists no satisfying . First note that no can satisfy this, because is defined as the output of the -witness . Then must be in . We distinguish two cases:
-
1.
: then ,
-
2.
: then , which by definition implies .
Existence of the computable function . Let and be arbitrary. We will argue that (i) we can computably obtain all labellings and (ii) . We first argue that we can computably obtain all labelings in . Let . Note that in order to find all labellings in it suffices to consider functions with , and that by the finiteness of , there are a finite number of “good” functions in satisfying this. These function can now be computably identified by first listing all patterns and then using the computable witness function on all inputs
(3) |
to exclude those patterns that are not in , which is possible by the finiteness of . By definition of the remaining patterns match , thus showing that there is indeed an algorithm that for any outputs . We now argue that . Since it is sufficient to argue that for any labelling we have . Let be arbitrary. Now consider its “truncated” version , where if , and otherwise, . We will now show that . Suppose Then there must exist and such that for , , but that means that also satsifies this, a contradiction by the definition of . Thus and since , . Therefore , concluding our proof. ∎
It now remains to show that is computable to conclude the proof of Theorem 18. We note that is recursively enumerable and thus we can iterate through all elements of the class . Furthermore we have seen that that since for any sample we can computably find all behaviours and thus have a stopping criterion for which also serves as an implementation of . ∎
We also note that the above proof goes through in case is infinite, but the range of possible labels for each initial segment of is computably bounded:
Observation 20.
Let and . Furthermore, let be a hypothesis class with . If there is a computable function , such that for every , , then is agnostically CPAC learnable.
We note that this condition would capture many infinite-label settings, such as question-answering, with the requirement that the length of the answer be bounded as a function of the length of the question. The proof of this observation can be obtained by replacing Equation 3 by in the construction of in the proof of Lemma 19.
4 A General Method for CPAC Learnability in the Multiclass Setting with
The Natarajan dimension is one of many ways to generalize the VC dimension to arbitrary label spaces: the graph and DS dimensions also generalize the VC dimension, the latter characterizing learnability even in the case of infinitely many labels (Daniely and Shalev-Shwartz, 2014; Brukhim et al., 2022). Ben-David et al. (1992) generalized a notion of shattering for finite label spaces by encoding the label space into the set , which subsumes Natarajan and graph shattering.
In this section, we formalize a new, more general notion of computable dimension, which are based on those presented by Ben-David et al. (1992). We show that the finiteness of these computable dimensions characterizes CPAC learnability for finite label space, notably generalizing the results we presented in Section 3. This general view also allows us to extract a more abstract and elegant relationship between computable learnability and computable dimensions.
Let be a family of functions from to . Given , and a tuple of labels , denote by the tuple . Given a set of label sequences , we overload as follows: . We are now ready to define the -dimension.
Definition 21 (-shattering and -dimension (Ben-David et al., 1992)).
A set is -shattered by if there exists such that . The -dimension of , denoted , is the size of the largest set that is -shattered by . If no largest such exists, then .
Here the condition essentially means that any 0-1 encoding of the labels is captured by applying to some in the projection of onto .
Examples.
The graph dimension corresponds to the -dimension, where , where
and the Natarajan dimension to the set , where
Definition 22 (Distinguisher (Ben-David et al., 1992)).
A pair of distinct labels is said to be -distinguishable if there exists with and neither nor is equal to . The family is said to be a distinguisher if all pairs of distinct labels are -distinguishable.
The notion of being a distinguisher in Ben-David et al. (1992) was shown to be both necessary and sufficient in order for the -dimension to characterize learnability in the qualitative sense, i.e. through its finiteness. In essence, distinguishers provide a meta-characterization of learnability:
Theorem 23 (Theorem 14 in (Ben-David et al., 1992)).
A family of functions from to provides a characterization of proper learnability if and only if is a distinguisher.
Ben-David et al. (1992) indeed implicitly define learnability as proper learnability. Note, however, that the argument showing that being a distinguisher is a necessary condition for characterizing learnability also goes through for improper learnability (see Lemma 13 therein).
Computable -Dimensions.
We can now straightforwardly define , as the smallest integer for which there exists a computable proof of the statement “ cannot be -shattered” for any set of size larger than :
Definition 24 (Computable -dimension).
Let . A -witness of -dimension is a function such that for any sequence , any , we have that . The computable -dimension of , denoted , is the smallest integer for which there exists a -witness of -dimension. If no such exists, then .
Here, one can view the witness function as returning a 0-1 encoding that no hypothesis in projected onto can achieve when is applied to it.
4.1 Necessary Conditions for Finite
In this section, we show that, for a family embedding the label space into , the finiteness of the computable -dimension is a necessary condition for CPAC learnability of a class . We start by stating a lower bound on the computable Natarajan dimension in terms of the computable dimension and the size of the label space. The theorem is a computable version of Theorem 7 in Ben-David et al. (1992).
Theorem 25.
Let be a family of functions from to . For every RER hypothesis class over a finite label space , we have that
Corollary 26.
Let be a distinguisher, and suppose is improperly CPAC learnable. Then , i.e., admits a computable -witness of -dimension for some .
The proof of Theorem 25 is based on our lemma below, which is a computable version of the generalization of the Sauer Lemma to finite label multiclass settings from (Natarajan, 1989).
Lemma 27.
Let and suppose satisfies . Then there is a computable function that takes as input a set and outputs a set of labellings with and
Proof.
From Lemma 19 we know that there is an embedding such that and such that there exists a computable function , such that for any input it outputs . We can now invoke a classical result from (Natarajan, 1989), which shows that the number of behaviours of any class with Natarajan dimension with finite label space on any set is upper bounded by . Lastly, we note that since we have and , thus proving the bound of the lemma. ∎
We now proceed with the proof of Theorem 25.
Proof of Theorem 25.
Let . Now let be some arbitrary number satisfying the inequality . That is, is an arbitrary number that exceeds the bound for in the theorem. We will now prove that bound for , by showing that for such there exists a computable -witness function for the -dimension. Let and be arbitrary. Now from Lemma 27 we know there is a computable function that for input outputs a set with . This also implies that . In particular, this implies that there is at least one -labelling . Furthermore, we can computably identify by checking which labelling is missing from the computably generated set . Furthermore, from , it also follows that . Thus , i.e. is a witness for the set with distinguisher . Thus we have shown that is at most , implying the bound ot the theorem. ∎
4.2 Sufficient Conditions for Finite
In this section, we show that, for a distinguisher , the finiteness of the computable dimension, , provides a sufficient condition for CPAC learnability.
Theorem 28.
Let . Let be a family of functions from to . Furthermore suppose that finite -dimension implies uniform convergence under the 0-1 loss. Then implies that is CPAC learnable.
The proof follows from suitable generalizations of the arguments presented in Section 3.2:
Proof.
Let be such that . Let be a -witness of -dimension. Our goal is to embed into satisfying the following: (i) and (ii) has computable ERM. By the conditions of the theorem statement, this is sufficient to guarantee multiclass CPAC learnability.
Constructing . Consider the class of “good” functions satisfying, for all
-
1.
, where and if no such exists.
-
2.
For any , and any , , where .
Now, let . We will show that indeed satisfies the conditions above.
. Let and . WLOG suppose and that and are both non empty. Let , be the minimal in their respective set, and WLOG let , in particular . Consider , the output of the -witness on and , but disregarding the -th entry. Let . We claim that no satisfies . First note that, by definition, no can satisfy this. So it remains to consider some “good” function . We distinguish two cases:
-
1.
: then , but since is “good”, by definition,
-
2.
: then by construction , thus , as required.
has a computable ERM. Let be arbitrary. Let , and note that it suffices to consider functions with , and that by the finiteness of , there are a finite number of “good” functions in satisfying this, and that these functions can be identified computably, by listing all patterns and using the computable witness function to exclude functions from , with a similar argument as in the proof of Lemma 19. If we can show that there always exists a function in that is an empirical risk minimizer, then we are done. Let , and consider its “truncated” version , where if , and otherwise, . , so it remains to show . Suppose not. Then there must exist and such that , but that means that the non-truncated also satsifies this, a contradiction by the definition of .
∎
As a corollary of Theorem 23, we get:
Corollary 29.
Let . Let be a family of function from to . Furthermore suppose that is a distinguisher. Then implies that is CPAC learnable.
Combining these with Corollary 26, we obtain the following result:
Theorem 30.
Let and suppose is a distinguisher. Then if and only if is CPAC learnable.
Remark 31.
Since the Natarajan and Graph dimensions are both expressible as distinguishers (with the computable versions matching the corresponding computable -dimension), Theorem 30 hold for and . Similarly, the result can be obtained for other families of distinguishers such as the Pollard pseudo-dimension (Pollard, 1990; Haussler, 1992).
Finally, we note that, while the arguments of Section 3.1 (which give a necessary condition on CPAC learnability via the finiteness of the computable Natarajan dimension) hold for infinite label spaces, neither arguments in Theorem 25, nor in Theorem 28 can be extended to infinite : in the former, appears in the denominator of the lower bound; the latter relies on the finiteness of to implement a computable ERM.
4.3 A Meta-Characterization for CPAC Learnability
In the previous section, we showed that distinguishers give rise to computable dimensions that qualitatively characterize CPAC learnability for finite in the agnostic setting. But what happens if a family of functions fails to be a distinguisher?
Proposition 32.
Suppose fails to be a distinguisher. Then there exists with such that is not CPAC learnable.
Proof.
Suppose it is the case that fails to be a distinguisher, and say it cannot distinguish labels . Then, letting be the hypothesis class of all functions from to , we note that is not PAC learnable, and thus not CPAC learnable. Now, to see that , note that for arbitrary and any , there is such that , and can be identified by computing , regardless of . Thus for any given and , the witness function returns , as required. ∎
Combining Proposition 32 with Theorem 30, we conclude the main result of this paper: a meta-characterization for CPAC learnability in the agnostic setting, in the sense that we precisely characterize which families of functions from to give rise to computable dimensions characterizing multiclass CPAC learnability.
Theorem 33.
Let be a family of functions from to for finite . Then qualitatively characterizes CPAC learnability if and only if is a distinguisher.
4.4 The DS Dimension
The bounds derived in Section 4 hold for families of functions from to that satisfy certain properties, e.g., are distinguishers. This generalization of the Natarajan and Graph dimensions predates the work of Daniely and Shalev-Shwartz (2014), which defined the DS dimension, a characterization of learnability for multiclass classification for arbitrary label spaces . For infinite label space, Brukhim et al. (2022) exhibit an arbritrary gap between the Natarajan and DS dimensions. But even in the case of a finite label set, can we express the DS dimension as a family , in the sense that for all , ? Unfortunately, the result below gives a negative answer to this question, which may be of independent interest.
Lemma 34.
The DS dimension cannot be expressed as a family of functions from to .
Proof.
We will show that any family with and induces with .
Let be such that and . WLOG, suppose there exists such that . Since , it must be that for all , we cannot have both and . Let witness the -shattering of on , i.e.,
(4) |
We distinguish three cases:
-
1.
: impossible by Equation 4,
-
2.
and : WLOG let . Then, for Equation 4 to hold, we need as well as , all of which must be in . From this, it is clear that satisfies , despite . Thus we get an impossibility. Note that the cases (a) and , (b) and and (c) and ) follow an identical reasoning.
-
3.
: WLOG let . Then, for Equation 4 to hold, we need as well as , both in . But this implies that the class satisfies ,
as required. ∎
5 Conclusion
We initiated the study of multiclass CPAC learnability, focusing on finite label spaces, and have established a meta-characterization through the finiteness of the computable dimension of a vast family of functions: so-called distinguishers. Characterization through the computable Natarajan and the computable graph dimensions appear as special cases of this result. Moreover, we showed that this result cannot readily be extended to the DS dimension, thus suggesting that characterizing CPAC learnability for infinite label spaces will potentially require significantly different techniques.
Acknowledgements
Pascale Gourdeau has been supported by a Vector Postdoctoral Fellowship and an NSERC Postdoctoral Fellowship. Tosca Lechner has been supported by a Vector Postdoctoral Fellowship. Ruth Urner is also an Affiliate Faculty Member at Toronto’s Vector Institute, and acknowldeges funding through an NSERC Discovery grant.
References
- Ackerman et al. (2022) Nathanael Ackerman, Julian Asilis, Jieqi Di, Cameron Freer, and Jean-Baptiste Tristan. Computable PAC learning of continuous features. In Proceedings of the 37th Annual ACM/IEEE Symposium on Logic in Computer Science, pages 1–12, 2022.
- Agarwal et al. (2020) Sushant Agarwal, Nivasini Ananthakrishnan, Shai Ben-David, Tosca Lechner, and Ruth Urner. On learnability wih computable learners. In Algorithmic Learning Theory, pages 48–60. PMLR, 2020.
- Asilis et al. (2024) Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, and Shang-Hua Teng. Regularization and optimal multiclass learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 260–310. PMLR, 2024.
- Ben-David et al. (1992) Shai Ben-David, Nicolo Cesa-Bianchi, and Philip M Long. Characterizations of learnability for classes of O,…, n-valued functions. In Proceedings of the fifth annual workshop on Computational learning theory, pages 333–340, 1992.
- Ben-David et al. (2019) Shai Ben-David, Pavel Hrubes, Shay Moran, Amir Shpilka, and Amir Yehudayoff. Learnability can be undecidable. Nat. Mach. Intell., 1(1):44–48, 2019.
- Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
- Brukhim et al. (2021) Nataly Brukhim, Elad Hazan, Shay Moran, Indraneel Mukherjee, and Robert E Schapire. Multiclass boosting and the cost of weak learning. Advances in Neural Information Processing Systems, 34:3057–3067, 2021.
- Brukhim et al. (2022) Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 943–955. IEEE, 2022.
- Brukhim et al. (2023) Nataly Brukhim, Steve Hanneke, and Shay Moran. Improper multiclass boosting. In The Thirty Sixth Annual Conference on Learning Theory, pages 5433–5452. PMLR, 2023.
- Brukhim et al. (2024) Nataly Brukhim, Amit Daniely, Yishay Mansour, and Shay Moran. Multiclass boosting: simple and intuitive weak learning criteria. Advances in Neural Information Processing Systems, 36, 2024.
- Daniely and Shalev-Shwartz (2014) Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory, pages 287–316. PMLR, 2014.
- Daniely et al. (2011) Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pages 207–232. JMLR Workshop and Conference Proceedings, 2011.
- Daniely et al. (2012) Amit Daniely, Sivan Sabato, and Shai Shwartz. Multiclass learning approaches: A theoretical comparison with implications. Advances in Neural Information Processing Systems, 25, 2012.
- Daniely et al. (2015a) Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. J. Mach. Learn. Res., 16(1):2377–2404, 2015a.
- Daniely et al. (2015b) Amit Daniely, Michael Schapira, and Gal Shahaf. Inapproximability of truthful mechanisms via generalizations of the vc dimension. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 401–408, 2015b.
- Delle Rose et al. (2023) Valentino Delle Rose, Alexander Kozachinskiy, Cristóbal Rojas, and Tomasz Steifer. Find a witness or shatter: the landscape of computable PAC learning. In The Thirty Sixth Annual Conference on Learning Theory, pages 511–524. PMLR, 2023.
- Delle Rose et al. (2024) Valentino Delle Rose, Alexander Kozachinskiy, and Tomasz Steifer. Effective littlestone dimension. arXiv preprint arXiv:2411.15109, 2024.
- Ehrenfeucht et al. (1989) Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247–261, 1989.
- Gourdeau et al. (2024) Pascale Gourdeau, Tosca Lechner, and Ruth Urner. On the computability of robust PAC learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 2092–2121. PMLR, 2024.
- Hanneke et al. (2023) Steve Hanneke, Shay Moran, and Qian Zhang. Universal rates for multiclass learning. In The Thirty Sixth Annual Conference on Learning Theory, pages 5615–5681. PMLR, 2023.
- Hasrati and Ben-David (2023) Niki Hasrati and Shai Ben-David. On computable online learning. In International Conference on Algorithmic Learning Theory, pages 707–725. PMLR, 2023.
- Haussler (1992) David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and computation, 100(1):78–150, 1992.
- Haussler and Long (1995) David Haussler and Philip M Long. A generalization of sauer’s lemma. Journal of Combinatorial Theory, Series A, 71(2):219–240, 1995.
- Kalavasis et al. (2022) Alkis Kalavasis, Grigoris Velegkas, and Amin Karbasi. Multiclass learnability beyond the pac framework: Universal rates and partial concept classes. Advances in Neural Information Processing Systems, 35:20809–20822, 2022.
- Lechner and Ben-David (2024) Tosca Lechner and Shai Ben-David. Inherent limitations of dimensions for characterizing learnability of distribution classes. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 3353–3374. PMLR, 30 Jun–03 Jul 2024. URL https://proceedings.mlr.press/v247/lechner24a.html.
- Natarajan (1989) Balas K Natarajan. On learning sets and functions. Machine Learning, 4:67–97, 1989.
- Natarajan and Tadepalli (1988) Balas K Natarajan and Prasad Tadepalli. Two new frameworks for learning. In Machine Learning Proceedings 1988, pages 402–415. Elsevier, 1988.
- Pabbaraju (2024) Chirag Pabbaraju. Multiclass learnability does not imply sample compression. In International Conference on Algorithmic Learning Theory, pages 930–944. PMLR, 2024.
- Pollard (1990) David Pollard. Empirical processes: theory and applications. Ims, 1990.
- Rubinstein et al. (2006) Benjamin Rubinstein, Peter Bartlett, and J Rubinstein. Shifting, one-inclusion mistake bounds and tight multiclass expected risk bounds. Advances in Neural Information Processing Systems, 19, 2006.
- Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.
- Sterkenburg (2022) Tom F Sterkenburg. On characterizations of learnability with computable learners. In Conference on Learning Theory, pages 3365–3379. PMLR, 2022.
- Valiant (1984) Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
- Vapnik and Chervonenkis (1971) Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Theory of Probability and Its Applications. 1971.
Appendix A Proof of Proposition 15
of Proposition 15.
Let be finite or countable. Consider the hypothesis class in Daniely et al. (2015a) showing the same separation between the Natarajan and graph dimensions: let be a subset of the powerset of consisting only of finite or cofinite subsets of . Let . For any , let
Note that any has a finite representation (a special character for whether we are enumerating the set or its complement, as well as the finite set or ). Thus checking whether can be done computably for any , implying each is computably evaluable. Since , it follows that as well. We will now show that , namely we exhibit a computable -witness of Natarajan dimension. To this end, let and with and be arbitrary. First check whether contains more than one non- label, in which case we are done, as we can output the index set corresponding to labelling for some . Otherwise, WLOG let and for some . Check whether . If yes, output (corresponding to labelling ), and if not output (corresponding to labelling ). ∎