This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Computability of Multiclass PAC Learning

Pascale Gourdeau Vector Institute & University of Toronto, Toronto, ON, Canada; pascale.gourdeau@vectorinstitute.ai Tosca Lechner Vector Institute & University of Toronto, Toronto, ON, Canada; tosca.lechner@vectorinstitute.ai and Ruth Urner Lassonde School of Engineering, EECS Department, York University, Toronto, ON, Canada; ruth@eecs.yorku.ca
Abstract

We study the problem of computable multiclass learnability within the Probably Approximately Correct (PAC) learning framework of Valiant (1984). In the recently introduced computable PAC (CPAC) learning framework of Agarwal et al. (2020), both learners and the functions they output are required to be computable. We focus on the case of finite label space and start by proposing a computable version of the Natarajan dimension and showing that it characterizes CPAC learnability in this setting. We further generalize this result by establishing a meta-characterization of CPAC learnability for a certain family of dimensions: computable distinguishers. Distinguishers were defined by Ben-David et al. (1992) as a certain family of embeddings of the label space, with each embedding giving rise to a dimension. It was shown that the finiteness of each such dimension characterizes multiclass PAC learnability for finite label space in the non-computable setting. We show that the corresponding computable dimensions for distinguishers characterize CPAC learning. We conclude our analysis by proving that the DS dimension, which characterizes PAC learnability for infinite label space, cannot be expressed as a distinguisher (even in the case of finite label space).

1 Introduction

One of the fundamental lines of inquiry in learning theory is to determine when learning is possible (and how to learn in such cases). Taking aside computational efficiency, which requires learners to run in polynomial time as a function of the learning parameters, the vast majority of the learning theory literature has introduced characterizations of learnability in a purely information-theoretic sense, qualifying or quantifying the amount of data needed to guarantee (or refute) generalization, without computational considerations. These works consider learners as functions rather than algorithms. Recently, Agarwal et al. (2020) introduced the Computable Probably Approximately Correct (CPAC) learning framework, adding computability requirements to the pioneering Probably Approximately Correct (PAC) framework of Valiant (1984), introduced for binary classification. In the CPAC setting, both learners and the functions they output are required to be computable functions. Perhaps surprisingly, adding such requirements substantially changes the learnability landscape. For example, in a departure from the standard PAC setting for binary classification, where a hypothesis class is learnable in the agnostic case whenever it is learnable in the realizable case (and in particular, always through empirical risk minimization (ERM)), there are binary classes that are CPAC learnable in the realizable setting, but not in the agnostic one. For the latter, it is in fact the effective VC dimension that characterizes CPAC learnability (Sterkenburg, 2022; Delle Rose et al., 2023). While the works of Agarwal et al. (2020); Sterkenburg (2022); Delle Rose et al. (2023) together provide a practically complete picture of CPAC learnability for the binary case, delineating the CPAC learnability landscape in the multiclass setting, where the label space is larger than two and can in general be infinite, has not yet been attempted.

Multiclass PAC learning has been found to exhibit behaviours that depart from the binary case even without computability requirements. For example, not all ERMs are equally successful from a sample-complexity stem point, and, in particular, for infinite label space, the ERM principle can fail (Daniely et al., 2011)! For the finite label space, the finiteness of both the Natarajan and graph dimensions characterizes learnability (Natarajan and Tadepalli, 1988; Natarajan, 1989). As shown by Ben-David et al. (1992), it is actually possible to provide a meta-characterization through distinguishers, which are families of functions that map the label space to the set {0,1,}\{0,1,*\}. In the infinite case, both the finiteness of the Natarajan and graph dimensions fail to provide a characterization, and it is the finiteness of the DS dimension that is equivalent to learnability (Daniely and Shalev-Shwartz, 2014; Brukhim et al., 2022).

Ultimately, both the CPAC and multiclass settings exhibit a significant contrast with PAC learning for binary classification. In this work, we investigate these two settings in conjunction, and thus initiate the study of computable multiclass PAC learnability. We focus on the finite label space, and provide a meta-characterization of learnability in the agnostic case: the finiteness of the computable dimensions of distinguishers, which we introduce in this work, is both necessary and sufficient for CPAC learnability here. We also explicitly derive the result in the case of the computable Natarajan dimension, also defined in this work, as the lower bound for distinguishers utilizes a lower bound in the computable Natarajan dimension. The significance of the meta-characterization is two-fold: first, it establishes that computable versions of other known dimensions, namely those that can be defined through a suitable embedding of the label space, also provide a characterization of multi-class CPAC learnability (this applies, for example, to the computable graph dimension); further, it allows us to extract high-level concepts and proof mechanics regarding computable dimensions, which may be of independent interest. We conclude our investigations by proving that the DS dimension cannot be expressed through the framework of distinguishers, opening the door to potentially more complex phenomena in the infinite label set case.

1.1 Related Work

Computable Learnability.

Following the work of Ben-David et al. (2019), who showed that the learnability of certain basic learning problems is undecidable within ZFC, Agarwal et al. (2020) formally integrated the notion of computability within the standard PAC learning framework of Valiant (1984). This novel set-up, called computable PAC (CPAC) learning, requires that both learners and the hypotheses they output be computable. With follow-up works by Sterkenburg (2022) and Delle Rose et al. (2023), CPAC learnability in the binary classification setting was shown to be fully characterized by the finiteness of the effective VC dimension, formally defined by Delle Rose et al. (2023). Computability has since been studied in the context of different learning problems: Ackerman et al. (2022) extended CPAC learning to continuous domains, Hasrati and Ben-David (2023) and Delle Rose et al. (2024) to online learning, and Gourdeau et al. (2024) to adversarially robust learning.

Multiclass Learnability.

While the study of computable learnability is a very recent field of study, multiclass learnability has been the subject of extensive research efforts in the past decades. In the binary classification setting, the VC dimension characterizes PAC learnability (Vapnik and Chervonenkis, 1971; Ehrenfeucht et al., 1989; Blumer et al., 1989). However, the landscape of learnability in the multiclass setting is much more complex. The PAC framework was extended to the multiclass setting in the works of Natarajan and Tadepalli (1988) and Natarajan (1989), which gave a lower bound with respect to the Natarajan dimension, and an upper with the graph dimension. Later, Ben-David et al. (1992) generalized the notion of dimension for multiclass learning, and provided a meta-characterization of learnability: (only) dimensions that are distinguishers characterize learnability in the finite label space setting, with distinguishers encompassing both the Natarajan and graph dimensions. Haussler and Long (1995) later generalized the Sauer-Shelah-Perles Lemma for these families of functions. Daniely et al. (2015a), originally (Daniely et al., 2011), identified “good” and “bad” ERMs with vastly different sample complexities, which, in the case of infinite label space, leads to learning scenarios where the ERM principle fails. Daniely and Shalev-Shwartz (2014) introduced a new dimension, the DS dimension, which they proved was a necessary condition for learnability. The breakthrough work of Brukhim et al. (2022) closed the problem by showing it is also sufficient for arbitrary label space. Different multiclass methods and learning paradigms have moreover been explored by Daniely et al. (2012); Rubinstein et al. (2006); Daniely et al. (2015b). Finally, multiclass PAC learning has also been studied in relation to boosting (Brukhim et al., 2021, 2023, 2024), universal learning rates (Kalavasis et al., 2022; Hanneke et al., 2023), sample compression (Pabbaraju, 2024) and regularization (Asilis et al., 2024).

2 Problem Set-up

Notation.

We denote by \mathbb{N} the natural numbers. For nn\in\mathbb{N}, let [n][n] denote the set {1,,n}\{1,\dots,n\}. Given a finite alphabet Σ\Sigma, we denote by Σ\Sigma^{*} the set of all finite words (strings) over Σ\Sigma. For a given set X={x1,,xn}X=\{x_{1},\dots,x_{n}\}, we denote by Xi:=X{xi}X_{-i}:=X\setminus\{x_{i}\} the set resulting in removing the element with index ii. We will always use the symbol \subset (vs \subseteq) to mean proper inclusion. Let 𝒴\mathcal{Y} be an arbitrary label set containing 0. For a function h:𝒴h:\mathbb{N}\rightarrow\mathcal{Y}, denote by M(h):=argmaxn{h(n)0}M(h):=\arg\max_{n\in\mathbb{N}}\{h(n)\neq 0\} the largest natural number that is not mapped to 0 by hh, with M(h)=M(h)=\infty if no such nn exists.

Learnability.

Let 𝒳\mathcal{X} be the input space and 𝒴\mathcal{Y} the label space. We denote by \mathcal{H} a hypothesis class on 𝒳\mathcal{X}: 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}}. A learner 𝒜:i=1(𝒳×𝒴)i𝒴𝒳\mathcal{A}:\bigcup_{i=1}^{\infty}(\mathcal{X}\times\mathcal{Y})^{i}\to\mathcal{Y}^{\mathcal{X}} is a mapping from finite samples S=((x1,y1),,(xm,ym))S=((x_{1},y_{1}),\dots,(x_{m},y_{m})) to a function f:𝒳𝒴f:\mathcal{X}\to\mathcal{Y}. Given a distribution DD on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} and a hypothesis hh\in\mathcal{H}, the risk of hh on DD is defined as 𝖱D(h):=Pr(x,y)D(h(x)y).\mathsf{R}_{D}(h):=\underset{{(x,y)\sim D}}{\Pr}\left(h(x)\neq y\right). The empirical risk of hh on a sample S={(xi,yi)}i=1m(𝒳×𝒴)mS=\{(x_{i},y_{i})\}_{i=1}^{m}\in(\mathcal{X}\times\mathcal{Y})^{m} is defined as 𝖱^S(h):=1mi=1m𝟏[h(xi)yi].\widehat{\mathsf{R}}_{S}(h):=\frac{1}{m}\sum_{i=1}^{m}\mathbf{1}[h(x_{i})\neq y_{i}]. An empirical risk minimizer (ERM) for \mathcal{H}, denoted by ERM\mathrm{ERM}_{\mathcal{H}}, is a learner that for an input sample SS outputs a function hargminh𝖱^S(h)h^{\prime}\in{\arg\min}_{h\in\mathcal{H}}\;\widehat{\mathsf{R}}_{S}(h).

We will focus on the case 𝒳=\mathcal{X}=\mathbb{N}, i.e., where the domain is countable. We work in the multiclass classification setting, and thus let 𝒴\mathcal{Y} be arbitrary. Throughout the paper, whether we are working in the case |𝒴|<|\mathcal{Y}|<\infty or |𝒴|=|\mathcal{Y}|=\infty will be made explicit. The case |𝒴|=2|\mathcal{Y}|=2 is the binary classification setting for which the probably approximately correct (PAC) framework of Valiant (1984) was originally defined, though it can straightforwardly be extended to arbitrary label spaces 𝒴\mathcal{Y}:

Definition 1 (Agnostic PAC learnability).

A hypothesis class \mathcal{H} is PAC learnable in the agnostic setting if there exists a learner 𝒜\mathcal{A} and function m(,)m(\cdot,\cdot) such that for all ϵ,δ(0,1)\epsilon,\delta\in(0,1) and for any distribution DD on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, if the input to 𝒜\mathcal{A} is an i.i.d. sample SS from DD of size at least m(ϵ,δ)m(\epsilon,\delta), then, with probability at least (1δ)(1-\delta) over the samples, the learner outputs a hypothesis 𝒜(S)\mathcal{A}(S) with 𝖱D(𝒜(S))infh𝖱D(h)+ϵ.\mathsf{R}_{D}(\mathcal{A}(S))\leq\underset{h\in\mathcal{H}}{\inf}\mathsf{R}_{D}(h)+\epsilon\enspace. The class is said to be PAC learnable in the realizable setting if the above holds under the condition that infh𝖱D(h)=0\underset{h\in\mathcal{H}}{\inf}\mathsf{R}_{D}(h)=0.

Definition 2 (Proper vs improper learning).

Given a hypothesis class 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}}, a learner 𝒜\mathcal{A} is said to be proper if for all mm\in\mathbb{N} and samples S(𝒳×𝒴)mS\in(\mathcal{X}\times\mathcal{Y})^{m}, 𝒜(S)\mathcal{A}(S)\in\mathcal{H}, and improper otherwise.

We note that by definition ERMs are proper learners.

Computable learnability.

We start with some computability basics. A function f:ΣΣf:\Sigma^{*}\rightarrow\Sigma^{*} is called total computable if there exists a program PP such that, for all inputs σΣ\sigma\in\Sigma^{*}, PP halts and satisfies P(σ)=f(σ)P(\sigma)=f(\sigma). A set SΣS\subseteq\Sigma^{*} is said to be decidable (or recursive) if there exists a program PP such that, for all σΣ\sigma\in\Sigma^{*}, P(σ)P(\sigma) halts and outputs whether σS\sigma\in S; SS is said to be semi-decidable (or recursively enumerable) if there exists a program PP such that P(σ)P(\sigma) halts for all σS\sigma\in S and, whenever PP halts, it correctly outputs whether σS\sigma\in S. An equivalent formulation of semi-decidability for SS is the existence of a program PP that enumerates all the elements of SS.

When studying CPAC learnability, we consider hypotheses with a mild requirement on their representation (note that otherwise negative results are trivialized, as argued by Agarwal et al. (2020)):

Definition 3 (Computable Representation (Agarwal et al., 2020)).

A hypothesis class 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} is called decidably representable (DR) if there exists a decidable set of programs 𝒫\mathcal{P} such that the set of all functions computed by programs in 𝒫\mathcal{P} equals \mathcal{H}. The class \mathcal{H} is called recursively enumerably representable (RER) if there exists such a set of programs that is recursively enumerable.

Recall that PAC learnability only takes into account the sample size needed to guarantee generalization. It essentially views learners as functions. Computable learnability adds the basic requirement that learners be algorithms that halt on all inputs and output total computable functions.

Definition 4 (CPAC Learnability (Agarwal et al., 2020)).

A class 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} is (agnostic) CPAC learnable if there exists a computable (agnostic) PAC learner for \mathcal{H} that outputs total computable functions as predictors and uses a decidable (recursively enumerable) representation for these.

Dimensions characterizing learnability.

A notion of dimension can provide a characterization of learnability for a learning problem in two different senses: first, in a qualitative sense, where the finiteness of the dimension is both a necessary and sufficient condition for learnability, and, second, in a quantitative sense, where the dimension explicitly appears in both lower and upper bounds on the sample complexity. See Lechner and Ben-David (2024) for a thorough treatment of dimensions in the context of learnability.

In the case of binary classification, the Vapnik-Chervonenkis (VC) dimension characterizes learnability in a quantitative sense (of course implying a qualitative characterization as well):

Definition 5 (VC dimension (Vapnik and Chervonenkis, 1971)).

Given a class of functions \mathcal{H} from 𝒳\mathcal{X} to {0,1}\{0,1\}, we say that a set S𝒳S\subseteq\mathcal{X} is shattered by \mathcal{H} if the restriction of \mathcal{H} to SS is the set of all function from SS to {0,1}\{0,1\}. The VC dimension of a hypothesis class \mathcal{H}, denoted 𝖵𝖢()\mathsf{VC}(\mathcal{H}), is the size dd of the largest set that can be shattered by \mathcal{H}. If no such dd exists then 𝖵𝖢()=\mathsf{VC}(\mathcal{H})=\infty.

In the multiclass setting, the Natarajan dimension provides a lower bound on learnability, while the graph dimension, an upper bound (Natarajan and Tadepalli, 1988; Natarajan, 1989). When the label space is finite, both dimensions characterize learnability, though they can be separated by a factor of log(|𝒴|)\log(|\mathcal{Y}|) (Ben-David et al., 1992; Daniely et al., 2011).

Definition 6 (Natarajan dimension (Natarajan, 1989)).

A set S={x1,,xn}𝒳kS=\{x_{1},\dots,x_{n}\}\in\mathcal{X}^{k} is said to be N-shattered by \mathcal{H} if there exists labelings g1,g2𝒴kg_{1},g_{2}\in\mathcal{Y}^{k} such that for all i[k]i\in[k], g1(i)g2(i)g_{1}(i)\neq g_{2}(i) and for all subsets I[k]I\subseteq[k] there exists hh\in\mathcal{H} with h(xi)=g1(i)h(x_{i})=g_{1}(i) in case iIi\in I and h(xi)=g2(i)h(x_{i})=g_{2}(i) in case i[k]Ii\in[k]\setminus I. The Natarajan dimension of \mathcal{H}, denoted 𝖭()\mathsf{N}(\mathcal{H}), is the size dd of the largest set that can be N-shattered by \mathcal{H}. If no such dd exists then 𝖭()=\mathsf{N}(\mathcal{H})=\infty.

Definition 7 (Graph dimension (Natarajan, 1989)).

A set S={x1,,xn}𝒳kS=\{x_{1},\dots,x_{n}\}\in\mathcal{X}^{k} is said to be G-shattered by \mathcal{H} if there exists a labeling f𝒴kf\in\mathcal{Y}^{k} such that for every I[k]I\subseteq[k] there exists hh\in\mathcal{H} such that for all iIi\in I, h(xi)=f(i)h(x_{i})=f(i) and for all i[k]Ii\in[k]\setminus I, h(xi)f(i)h(x_{i})\neq f(i). The graph dimension of \mathcal{H}, denoted 𝖦()\mathsf{G}(\mathcal{H}), is the size dd of the largest set that can be G-shattered by \mathcal{H}. If no such dd exists then 𝖦()=\mathsf{G}(\mathcal{H})=\infty.

When 𝒴\mathcal{Y} is infinite, it is the DS dimension that characterizes learnability (Daniely and Shalev-Shwartz, 2014; Brukhim et al., 2022). Before defining it, we need to define pseudo-cubes.

Definition 8 (Pseudo-cube).

A set H𝒴dH\subseteq\mathcal{Y}^{d} is called a pseudo-cube of dimension dd if HH is non-empty and finite, and for every hHh\in H and every index i[d]i\in[d] there exists gHg\in H such that h(j)=g(j)h(j)=g(j) if and only if jij\neq i.

Definition 9 (DS dimension (Daniely and Shalev-Shwartz, 2014)).

A set S𝒳nS\in\mathcal{X}^{n} is said to be DS-shattered by 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} if |S\mathcal{H}|_{S} contains an nn-dimensional pseudo-cube. The DS dimension of \mathcal{H}, denoted 𝖣𝖲()\mathsf{DS}(\mathcal{H}), is the size dd of the largest set that can be DS-shattered by \mathcal{H}. If no such dd exists then 𝖣𝖲()=\mathsf{DS}(\mathcal{H})=\infty.

We refer the reader to Brukhim et al. (2022) for results separating the Natarajan and DS dimensions, as well as an example showing why the finiteness of the pseudo-cube is a necessary property in order for the DS dimension to characterize learnability.

3 Characterizing CPAC Learnability with the Computable Natarajan Dimension

In this section, we first recall results in the binary CPAC setting that have implications for multiclass CPAC learning, and include conditions under which CPAC learnability is sufficient. We then define the computable versions of the Natarajan and graph dimensions, in the spirit of the effective VC dimension, implicitly appearing in the work of Sterkenburg (2022) and formally defined by Delle Rose et al. (2023). We also show that the same gap as in the standard (non-computable) setting exists between the computable Natarajan and computable graph dimension. In Section 3.1, we show that the finiteness of the computable Natarajan dimension is a necessary condition for multiclass CPAC learnability for arbitrary label spaces. We finish this section by showing that this finitess is sufficient for finite label spaces in Section 3.2.

We note that there are several hardness results for binary CPAC learning that immediately imply hardness for multiclass CPAC learning. In particular, the results that show a separation between agnostic PAC and CPAC learnability (both for proper (Agarwal et al., 2020) and improper (Sterkenburg, 2022) learning) imply that there are decidably representable (DR) classes which are (information-theoretically) multiclass learnable, but which are not computably multiclass learnable. On the other hand, in the binary case, any PAC learnable class that is recursively enumerably representable (RER) is also CPAC learbable in the realizable case. For multiclass learning, we can similarly implement an ERM rule for the realizable case as outlined below.

Proposition 10.

Let \mathcal{H} be RER. If 𝖦()<\mathsf{G}(\mathcal{H})<\infty or 𝖭()log(|𝒴|)<\mathsf{N}(\mathcal{H})\log(|\mathcal{Y}|)<\infty, then \mathcal{H} is properly CPAC learnable in the realizable setting.

Proof.

The conditions 𝖦()<\mathsf{G}(\mathcal{H})<\infty and 𝖭()log(|𝒴|)<\mathsf{N}(\mathcal{H})\log(|\mathcal{Y}|)<\infty are both sufficient to guarantee generalization with an ERM. Upon drawing a sufficiently large sample SS from underlying distribution DD, it suffices to enumerate all hh\in\mathcal{H} and compute 𝖱^S(h)\widehat{\mathsf{R}}_{S}(h) one by one until we obtain one with zero empirical risk. We thus have a computable ERM (recall that all hh\in\mathcal{H} are total computable). ∎

Thus, RER classes that satisfy the uniform convergence property can be CPAC learned in the realizable case. Moreover, having access to a computable ERM yields the following:

Fact 11.

Let \mathcal{H} have a computable ERM and suppose 𝖦()<\mathsf{G}(\mathcal{H})<\infty or 𝖭()log(|𝒴|)<\mathsf{N}(\mathcal{H})\log(|\mathcal{Y}|)<\infty. Then \mathcal{H} is CPAC learnable in the multiclass setting.

The Computable Natarajan and Graph Dimensions.

The general idea in defining computable versions of shattering-based dimensions, such as the effective VC dimension (Sterkenburg, 2022; Delle Rose et al., 2023) and the computable robust shattering dimension (Gourdeau et al., 2024), is to have a computable proof of the statement “XX cannot be shattered” for all sets of a certain size. We define the computable Natarajan and graph dimensions in this spirit as well:

Definition 12 (Computable Natarajan dimension).

A kk-witness of Natarajan dimension for a hypothesis class \mathcal{H} is a function wN:𝒳k+1×𝒴k+1×𝒴k+12k+1w_{N}:\mathcal{X}^{k+1}\times\mathcal{Y}^{k+1}\times\mathcal{Y}^{k+1}\rightarrow 2^{k+1} that takes as input a set S={x1,,xk+1}S=\{x_{1},\dots,x_{k+1}\} of size k+1k+1 and two labelings g1,g2𝒴k+1g_{1},g_{2}\in\mathcal{Y}^{k+1} of SS satisfying g1(i)g2(i)g_{1}(i)\neq g_{2}(i) for all i[k+1]i\in[k+1] and outputs a subset I[k+1]I\subseteq[k+1] such that for every hh\in\mathcal{H} there exists i[k+1]i\in[k+1] such that h(xi)g1(i)h(x_{i})\neq g_{1}(i) if iIi\in I and h(xi)g2(i)h(x_{i})\neq g_{2}(i) if i[k+1]Ii\in[k+1]\setminus I. The computable Natarajan dimension of \mathcal{H}, denoted c-𝖭()c\text{-}\mathsf{N}(\mathcal{H}), is the smallest integer kk such that there exists a computable kk-witness of Natarajan dimension for \mathcal{H}.

Remark 13.

In the definition above, we make explicit the requirement that g1g_{1} and g2g_{2} differ on all indices, but this can be checked computably. Moreover, the usual manner to obtain computable dimensions is to negate the first-order formula for “XX is shattered by \mathcal{H}”. In the Natarajan dimension case would give a witness that after finding II, whenever given a hypothesis hh, outputs the xSx\in S satisfying the condition, but it is straightforward to computably find once II is obtained. We thus simplify the Natarajan, graph and general dimensions (see Section 4) in this manner.

Definition 14 (Computable graph dimension).

A kk-witness of graph dimension for a hypothesis class \mathcal{H} is a function wG:𝒳k+1×𝒴k+12k+1w_{G}:\mathcal{X}^{k+1}\times\mathcal{Y}^{k+1}\rightarrow 2^{k+1} that takes as input a set S={x1,,xk+1}S=\{x_{1},\dots,x_{k+1}\} of size k+1k+1 and a labeling f𝒴k+1f\in\mathcal{Y}^{k+1} of SS and outputs a subset I[k+1]I\subseteq[k+1] such that for every hh\in\mathcal{H} there exists i[k+1]i\in[k+1] such that h(xi)f(i)h(x_{i})\neq f(i) if iIi\in I and h(xi)=f(i)h(x_{i})=f(i) if i[k+1]Ii\in[k+1]\setminus I. The computable graph dimension of \mathcal{H}, denoted c-𝖦()c\text{-}\mathsf{G}(\mathcal{H}), is the smallest integer kk such that there exists a computable kk-witness of graph dimension for \mathcal{H}.

In the binary setting, the VC, Natarajan, and graph dimensions are all identical. This is also the case for the computable counterparts. In particular, this implies that c-𝖭()c\text{-}\mathsf{N}(\mathcal{H}) and 𝖭()\mathsf{N}(\mathcal{H}) can be arbitrarily far apart, with 𝖭()=1\mathsf{N}(\mathcal{H})=1 and c-𝖭()=c\text{-}\mathsf{N}(\mathcal{H})=\infty. The same separation holds for the graph dimension and computable graph dimension. It is also straightforward to check that c-𝖭()c-𝖦()c\text{-}\mathsf{N}(\mathcal{H})\leq c\text{-}\mathsf{G}(\mathcal{H}). Now, as in the non-computable versions of the Natarajan and graph dimensions, we have an arbitrary gap between their computable counterparts (see Appendix A for the proof):

Proposition 15.

For any m{}m\in\mathbb{N}\cup\{\infty\} there exist 𝒳,𝒴\mathcal{X},\mathcal{Y} and \mathcal{H} with |𝒳|=m|\mathcal{X}|=m, c-𝖭()=1c\text{-}\mathsf{N}(\mathcal{H})=1 and c-𝖦()=mc\text{-}\mathsf{G}(\mathcal{H})=m.

3.1 The Finiteness of the Computable Natarajan Dimension as a Necessary Condition

In this section, we show that the finiteness of the computable Natarajan dimension is a necessary condition for CPAC learnability in the agnostic setting, even in the case |𝒴|=|\mathcal{Y}|=\infty.

Theorem 16.

Let 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} be improperly CPAC learnable. Then c-𝖭()<c\text{-}\mathsf{N}(\mathcal{H})<\infty, i.e. \mathcal{H} admits a computable kk-witness of Natarajan dimension for some kk\in\mathbb{N}.

We will first show a multiclass analogue of the computable No-Free-Lunch theorem for binary classification (Lemma 19 in (Agarwal et al., 2020)), adapted with the Natarajan dimension in mind:

Lemma 17.

For any computable learner 𝒜\mathcal{A}, for any mm\in\mathbb{N}, any instance space 𝒳\mathcal{X} of size at least 2m2m, any subset X={x1,,x2m}X=\{x_{1},\dots,x_{2m}\} of size at least 2m2m, and any two functions g1,g2:X𝒴g_{1},g_{2}:X\rightarrow\mathcal{Y} satisfying g1(x)g2(x)g_{1}(x)\neq g_{2}(x) for all xXx\in X, we can computably find f:X𝒴f:X\rightarrow\mathcal{Y} such that

  1. 1.

    f(x){g1(x),g2(x)}f(x)\in\{g_{1}(x),g_{2}(x)\} for all xXx\in X,

  2. 2.

    𝖱D(f)=0\mathsf{R}_{D}(f)=0,

  3. 3.

    With probability at least 1/71/7 over SDmS\sim D^{m}, 𝖱D(𝒜(S))1/8\mathsf{R}_{D}(\mathcal{A}(S))\geq 1/8,

where DD is the uniform distribution on {(xi,f(xi))}i=12m\{(x_{i},f(x_{i}))\}_{i=1}^{2m}.

Proof sketch.

We will first prove the existence of a pair (f,D)(f,D) satisfying the desired requirements. To this end, for I[2m]I\subseteq[2m], denote by fI:X𝒴f_{I}:X\rightarrow\mathcal{Y} the labelling of XX satisfying fI(xi)=g1(xi)f_{I}(x_{i})=g_{1}(x_{i}) if iIi\in I and fI(xi)=g2(xi)f_{I}(x_{i})=g_{2}(x_{i}) if i[2m]Ii\in[2m]\setminus I. For each fIf_{I}, define the following distribution DID_{I} on X×𝒴X\times\mathcal{Y}:

DI((x,y))={12mif y=fI(x)0otherwise.D_{I}((x,y))=\begin{cases}\frac{1}{2m}&\text{if }y=f_{I}(x)\\ 0&\text{otherwise}\end{cases}\enspace.

Note that there are T=22mT=2^{2m} possible such functions from XX to 𝒴\mathcal{Y}. Let {(fi,Di)}i[T]\{(f_{i},D_{i})\}_{i\in[T]} denote the set of all such function-distribution pairs and note that 𝖱Di(fi)=0\mathsf{R}_{D_{i}}(f_{i})=0 for all i[T]i\in[T].

Note that, by a simple application of Markov’s inequality, it is sufficient to show that there exists ii satisfying

𝔼SDim[𝖱Di(𝒜(S))]1/4.\underset{{S\sim D^{m}_{i}}}{\mathbb{E}}\left[\mathsf{R}_{D_{i}}(\mathcal{A}(S))\right]\geq 1/4\enspace. (1)

The proof that the third requirement is satisfied is nearly identical to that of the No-Free-Lunch theorem (see, e.g., Shalev-Shwartz and Ben-David (2014), Theorem 5.1), and is omitted for brevity.

Now, to computably find a pair (f,D)(f,D) satisfying 𝔼SDim[𝖱Di(𝒜(S))]14\underset{{S\sim D^{m}_{i}}}{\mathbb{E}}\left[\mathsf{R}_{D_{i}}(\mathcal{A}(S))\right]\geq\frac{1}{4}, it suffices to note that 𝒜\mathcal{A} is computable (and outputs computably evaluable functions), that the set {(fi,Di)}i[T]\{(f_{i},D_{i})\}_{i\in[T]} is finite, and that for each pair (fi,Di)(f_{i},D_{i}), we can use 𝒜\mathcal{A} to compute the expected risk. Indeed, denote by S1,,SnS_{1},\dots,S_{n} the n=(2m)mn=(2m)^{m} possible sequences of length mm from XX, and for some i[T]i\in[T] and Sj=(x1,,xm)S_{j}=(x_{1},\dots,x_{m}), let Sji:=((xl,fi(xl)))l=1mS_{j}^{i}:=((x_{l},f_{i}(x_{l})))_{l=1}^{m} be the sequence SjS_{j} labeled by fif_{i}. Each distribution DiD_{i} induces the equally likely sequences S1i,,SniS_{1}^{i},\dots,S_{n}^{i}, implying

𝔼SDim[𝖱Di(𝒜(S))]=1nj=1n𝖱Di(𝒜(Sji))=1nj=1n12ml=12m𝟏[𝒜(Sji)(x)fi(x)].\underset{{S\sim D^{m}_{i}}}{\mathbb{E}}\left[\mathsf{R}_{D_{i}}(\mathcal{A}(S))\right]=\frac{1}{n}\sum_{j=1}^{n}\mathsf{R}_{D_{i}}(\mathcal{A}(S_{j}^{i}))=\frac{1}{n}\sum_{j=1}^{n}\frac{1}{2m}\sum_{l=1}^{2m}\mathbf{1}[\mathcal{A}(S_{j}^{i})(x)\neq f_{i}(x)]\enspace.

Since such a pair must exist, we will eventually stop for some ii for which Equation 1 holds. ∎

We are now ready to prove Theorem 16, in the spirit of the binary case (Sterkenburg, 2022).

Proof of Theorem 16.

Let 𝒜\mathcal{A} be a computable (potentially improper) learner for \mathcal{H} with sample complexity function m(ϵ,δ)m(\epsilon,\delta). Let m=m(1/8,1/7)m=m(1/8,1/7). We will show that 𝒜\mathcal{A} can be used to build a computable (2m1)(2m-1)-witness of Natarajan dimension for \mathcal{H}.

To this end, suppose we are given an arbitrary set X={x1,,x2m}𝒳2mX=\{x_{1},\dots,x_{2m}\}\in\mathcal{X}^{2m} and labelings g1,g2:X𝒴g_{1},g_{2}:X\rightarrow\mathcal{Y} satisfying g1(xi)g2(xi)g_{1}(x_{i})\neq g_{2}(x_{i}) for all i[2m]i\in[2m]. By Lemma 17, we can computably find f:X𝒴f:X\rightarrow\mathcal{Y} such that (i) f(x){g1(x),g2(x)}f(x)\in\{g_{1}(x),g_{2}(x)\} for all xXx\in X, (ii) 𝖱D(f)=0\mathsf{R}_{D}(f)=0, and (iii) PrSDm(𝖱D(𝒜(S))1/8)1/7\underset{{S\sim D^{m}}}{\Pr}\left(\mathsf{R}_{D}(\mathcal{A}(S))\geq 1/8\right)\geq 1/7, where DD is the uniform distribution on {(xi,f(xi))}i=12m\{(x_{i},f(x_{i}))\}_{i=1}^{2m}. This implies that the labeling of XX induced by ff is not achievable by any hh\in\mathcal{H}: otherwise minh𝖱D(h)=0\underset{h\in\mathcal{H}}{\min}\;\mathsf{R}_{D}(h)=0, and by the PAC guarantee PrSDm(𝖱D(𝒜(S))minh𝖱D(h)+1/8)<1/7\underset{{S\sim D^{m}}}{\Pr}\left(\mathsf{R}_{D}(\mathcal{A}(S))\geq\underset{h\in\mathcal{H}}{\min}\;\mathsf{R}_{D}(h)+1/8\right)<1/7, we would get PrSDm(𝖱D(𝒜(S))1/8)<1/7\underset{{S\sim D^{m}}}{\Pr}\left(\mathsf{R}_{D}(\mathcal{A}(S))\geq 1/8\right)<1/7, a contradiction. Now, let I[2m]I\subseteq[2m] be the index set identifying the instances in XX labelled by g1g_{1} in ff. Then clearly II is the set that we seek: for every hh\in\mathcal{H} there exists xiXx_{i}\in X such that h(xi)f(xi)h(x_{i})\neq f(x_{i}), where f(xi)=g1(xi)f(x_{i})=g_{1}(x_{i}) if iIi\in I and g2(xi)g_{2}(x_{i}) if i[2m]Ii\in[2m]\setminus I, as required. ∎

3.2 The Finiteness of the Computable Natarajan Dimension as a Sufficient Condition

We now state and show the main result of the section: finite computable Natarajan dimension is sufficient for CPAC learnability whenever 𝒴\mathcal{Y} is finite.

Theorem 18.

Let c-𝖭()<c\text{-}\mathsf{N}(\mathcal{H})<\infty and |𝒴|<|\mathcal{Y}|<\infty. Then \mathcal{H} is (improperly) CPAC learnable.

In the binary classification setting, Delle Rose et al. (2023) showed that finite effective VC dimension is sufficient for CPAC learnability. We generalize this approach to the multiclass setting:

Proof of Theorem 18.

Let 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} be such that c-𝖭()=kc\text{-}\mathsf{N}(\mathcal{H})=k. Let w:𝒳k+1×𝒴k+1×𝒴k+12k+1w:\mathcal{X}^{k+1}\times\mathcal{Y}^{k+1}\times\mathcal{Y}^{k+1}\rightarrow 2^{k+1} be a kk-witness of Natarajan dimension. We will embed \mathcal{H} into \mathcal{H}^{\prime} satisfying (i) 𝖭()k+1\mathsf{N}(\mathcal{H}^{\prime})\leq k+1 and (ii) \mathcal{H}^{\prime} has a computable ERM. By Fact 11, this is sufficient to guarantee multiclass CPAC learnability. Before showing that properties (i) and (ii) hold, we introduce the following notation. Given y,y𝒴k+1y,y^{\prime}\in\mathcal{Y}^{k+1} and a subset I[k+1]I\subseteq[k+1], we denote by fI,y,y𝒴[k+1]f_{I,y,y^{\prime}}\in\mathcal{Y}^{[k+1]} the function

fI,y,y(i)={yiiIyii[k+1]I.f_{I,y,y^{\prime}}(i)=\begin{cases}y_{i}&i\in I\\ y_{i}^{\prime}&i\in[k+1]\setminus I\end{cases}\enspace. (2)

When y,yy,y^{\prime} are fixed and clear from context, we will shorten the notation to fIf_{I} for readability. We first show the following lemma, which will be invoked in Section 4 as well, when we give a more general necessary condition on multiclass CPAC learnability (Theorem 25).

Lemma 19.

For every 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} with |𝒴|<|\mathcal{Y}|<\infty and c-𝖭()=kNc\text{-}\mathsf{N}(\mathcal{H})=k_{N}, there exists a class \mathcal{H}^{\prime}\supset\mathcal{H} with

  • 𝖭()kN+1\mathsf{N}(\mathcal{H}^{\prime})\leq k_{N}+1

  • There exists a computable function v:m=1𝒳mm=12𝒴mv:\bigcup_{m=1}^{\infty}\mathcal{X}^{m}\to\bigcup_{m=1}^{\infty}2^{\mathcal{Y}^{m}}, that takes as input a finite domain subset T𝒳T\subset\mathcal{X} and outputs a set of labelings v(T)=|Tv(T)=\mathcal{H}^{\prime}|_{T}.

Proof.

Constructing \mathcal{H}^{\prime}. Consider the class 𝒢𝒴\mathcal{G}\subseteq\mathcal{Y}^{\mathbb{N}} of “good” functions satisfying, for all g𝒢g\in\mathcal{G}

  1. 1.

    M(g)<M(g)<\infty, where M(g)=argmaxn{}{g(n)0}M(g)=\arg\max_{n\in\mathbb{N}\cup\{\infty\}}\{g(n)\neq 0\}.

  2. 2.

    For any x1<<xk<xk+1M(g)x_{1}<\dots<x_{k}<x_{k+1}\leq M(g), and any labelings y,y𝒴k+1y,y^{\prime}\in\mathcal{Y}^{k+1}, let w(X,y,y)=I[k+1]w(X,y,y^{\prime})=I\subseteq[k+1], where X={xi}i[k+1]X=\{x_{i}\}_{i\in[k+1]}. Then g|XfI,y,y|Xg|_{X}\neq f_{I,y,y^{\prime}}|_{X}.

Namely, “good” functions defined on \mathbb{N} are those that are eventually always 0 and do not encode the output of the witness function for any labelings.

Now, let :=𝒢\mathcal{H}^{\prime}:=\mathcal{H}\cup\mathcal{G}. We will show that \mathcal{H}^{\prime} indeed satisfies the conditions above.

𝖭()k+1\mathsf{N}(\mathcal{H}^{\prime})\leq k+1. Let X={x1,,xk+2}𝒳k+2X=\{x_{1},\dots,x_{k+2}\}\in\mathcal{X}^{k+2} and y,y𝒴k+2y,y^{\prime}\in\mathcal{Y}^{k+2} be arbitrary labelings that differ in each component, i.e., yiyiy_{i}\neq y^{\prime}_{i} for all i[k+2]i\in[k+2]. WLOG, suppose x1<<xk+1<xk+2x_{1}<\dots<x_{k+1}<x_{k+2} and that yk+2>yk+2y_{k+2}>y_{k+2}^{\prime}, in particular yk+2>0y_{k+2}>0. Let JJ be the output of the kk-witness ww on (X,y,y)(X,y,y^{\prime}) without the k+2k+2-th entries, i.e., J:=w(X(k+2),y(k+2),y(k+2))J:=w(X_{-(k+2)},y_{-(k+2)},y^{\prime}_{-(k+2)}). Let J+=J{k+2}J^{+}=J\cup\{k+2\}, and, by a slight abuse of notation, let fJ𝒴[k+1]f_{J}\in\mathcal{Y}^{[k+1]} and fJ+𝒴[k+2]f_{J^{+}}\in\mathcal{Y}^{[k+2]}, defined as per Equation 2 and where we omit y,yy,y^{\prime} in the subscript for readability. We claim that there exists no hh\in\mathcal{H}^{\prime} satisfying h|X=fJ+|Xh|_{X}=f_{J^{+}}|_{X}. First note that no hh\in\mathcal{H} can satisfy this, because JJ is defined as the output of the kk-witness ww. Then hh must be in 𝒢\mathcal{G}. We distinguish two cases:

  1. 1.

    h(xk+2)=0h(x_{k+2})=0 : then h(xk+2)yk+2=fJ+(xk+2)h(x_{k+2})\neq y_{k+2}=f_{J^{+}}(x_{k+2}),

  2. 2.

    h(xk+2)0h(x_{k+2})\neq 0 : then xk+2M(h)x_{k+2}\leq M(h), which by definition implies h|X(k+2)fJh|_{X_{-(k+2)}}\neq f_{J}.

Existence of the computable function vv. Let T𝒳mT\subseteq\mathcal{X}^{m} and S(𝒳×𝒴)m|TS\in(\mathcal{X}\times\mathcal{Y})^{m}\subseteq\mathcal{H}^{\prime}|_{T} be arbitrary. We will argue that (i) we can computably obtain all labellings 𝒢|T\mathcal{G}|_{T} and (ii) 𝒢|T=|T\mathcal{G}|_{T}=\mathcal{H}^{\prime}|_{T}. We first argue that we can computably obtain all labelings in 𝒢|T\mathcal{G}|_{T}. Let M=maxxTxM=\max_{x\in T}x. Note that in order to find all labellings in 𝒢|T\mathcal{G}|_{T} it suffices to consider functions hh with M(h)MM(h)\leq M, and that by the finiteness of 𝒴\mathcal{Y}, there are a finite number of “good” functions in 𝒢\mathcal{G} satisfying this. These function can now be computably identified by first listing all patterns 𝒴m\mathcal{Y}^{m} and then using the computable witness function wNw_{N} on all inputs

{(U,y,y):U[M],y,y𝒴M and for all i[M] we have yiyi}\{(U,y,y^{\prime}):U\subseteq[M],y,y\in\mathcal{Y}^{M}\text{ and for all }i\in[M]\text{ we have }y_{i}\neq y^{\prime}_{i}\} (3)

to exclude those patterns that are not in 𝒢\mathcal{G}, which is possible by the finiteness of 𝒴M\mathcal{Y}^{M}. By definition of 𝒢\mathcal{G} the remaining patterns match 𝒢|T\mathcal{G}|_{T}, thus showing that there is indeed an algorithm that for any TT outputs 𝒢|T\mathcal{G}|_{T}. We now argue that |T=𝒢T\mathcal{H}^{\prime}|_{T}=\mathcal{G}_{T}. Since 𝒢\mathcal{G}\subseteq\mathcal{H}^{\prime} it is sufficient to argue that for any labelling h|Th\in\mathcal{H}^{\prime}|_{T} we have h𝒢|Th\in\mathcal{G}|_{T}. Let hh\in\mathcal{H}^{\prime} be arbitrary. Now consider its “truncated” version hMh_{M}, where if xMx\leq M, hM(x)=h(x)h_{M}(x)=h(x) and otherwise, hM(x)=0h_{M}(x)=0. We will now show that h𝒢h\in\mathcal{G}. Suppose h|[M]𝒢|[M]h|_{[M]}\notin\mathcal{G}|_{[M]} Then there must exist X={x1,,xk+1}[M]X=\{x_{1},\dots,x_{k+1}\}\subseteq[M] and y,y𝒴k+1y,y^{\prime}\in\mathcal{Y}^{k+1} such that for I=wN(X,y,y)I=w_{N}(X,y,y^{\prime}), hM|X=fI|Xh^{*}_{M}|_{X}=f_{I}|_{X}, but that means that hh\in\mathcal{H} also satsifies this, a contradiction by the definition of wNw_{N}. Thus h|[M]𝒢|[M]h|_{[M]}\in\mathcal{G}|_{[M]} and since T[M]T\subseteq[M], h|T𝒢|Th|_{T}\in\mathcal{G}|_{T}. Therefore T=𝒢T\mathcal{H}^{\prime}_{T}=\mathcal{G}_{T}, concluding our proof. ∎

It now remains to show that ERM\mathrm{ERM}_{\mathcal{H}^{\prime}} is computable to conclude the proof of Theorem 18. We note that 𝒢\mathcal{G} is recursively enumerable and thus we can iterate through all elements of the class 𝒢\mathcal{G}. Furthermore we have seen that that since for any sample SS we can computably find all behaviours |S=𝒢|S\mathcal{H}^{\prime}|_{S}=\mathcal{G}|_{S} and thus have a stopping criterion for ERM𝒢\mathrm{ERM}_{\mathcal{G}} which also serves as an implementation of ERM\mathrm{ERM}_{\mathcal{H}^{\prime}}. ∎

We also note that the above proof goes through in case 𝒴\mathcal{Y} is infinite, but the range of possible labels for each initial segment of \mathcal{H} is computably bounded:

Observation 20.

Let 𝒳=\mathcal{X}=\mathbb{N} and 𝒴=\mathcal{Y}=\mathbb{N}. Furthermore, let \mathcal{H} be a hypothesis class with c-𝖭()=kc\text{-}\mathsf{N}(\mathcal{H})=k. If there is a computable function c:c:\mathbb{N}\to\mathbb{N}, such that for every nn\in\mathbb{N}, |[n][c(n)][n]\mathcal{H}|_{[n]}\subseteq[c(n)]^{[n]}, then \mathcal{H} is agnostically CPAC learnable.

We note that this condition would capture many infinite-label settings, such as question-answering, with the requirement that the length of the answer be bounded as a function of the length of the question. The proof of this observation can be obtained by replacing Equation 3 by {(U,y,y):U[M],y,y[c(M)]M and for all i[M] we have yiyi}\{(U,y,y^{\prime}):U\subseteq[M],y,y\in[c(M)]^{M}\text{ and for all }i\in[M]\text{ we have }y_{i}\neq y^{\prime}_{i}\} in the construction of v(T)v(T) in the proof of Lemma 19.

4 A General Method for CPAC Learnability in the Multiclass Setting with |𝒴|<|\mathcal{Y}|<\infty

The Natarajan dimension is one of many ways to generalize the VC dimension to arbitrary label spaces: the graph and DS dimensions also generalize the VC dimension, the latter characterizing learnability even in the case of infinitely many labels (Daniely and Shalev-Shwartz, 2014; Brukhim et al., 2022). Ben-David et al. (1992) generalized a notion of shattering for finite label spaces by encoding the label space into the set {0,1,}\{0,1,*\}, which subsumes Natarajan and graph shattering.

In this section, we formalize a new, more general notion of computable dimension, which are based on those presented by Ben-David et al. (1992). We show that the finiteness of these computable dimensions characterizes CPAC learnability for finite label space, notably generalizing the results we presented in Section 3. This general view also allows us to extract a more abstract and elegant relationship between computable learnability and computable dimensions.

Let Ψ\Psi be a family of functions from 𝒴={0,,l}\mathcal{Y}=\{0,\dots,l\} to {0,1,}\{0,1,*\}. Given nn\in\mathbb{N}, ψ¯:=(ψ1,,ψn)Ψn\bar{\psi}:=(\psi_{1},\dots,\psi_{n})\in\Psi^{n} and a tuple of labels y𝒴ny\in\mathcal{Y}^{n}, denote by ψ¯(y)\bar{\psi}(y) the tuple (ψ1(y1),,ψ(yn))(\psi_{1}(y_{1}),\dots,\psi(y_{n})). Given a set of label sequences Y𝒴nY\subseteq\mathcal{Y}^{n}, we overload ψ¯\bar{\psi} as follows: ψ¯(Y):={ψ¯(y)|yY}\bar{\psi}(Y):=\{\bar{\psi}(y)\;|\;y\in Y\}. We are now ready to define the Ψ\Psi-dimension.

Definition 21 (Ψ\Psi-shattering and Ψ\Psi-dimension (Ben-David et al., 1992)).

A set X𝒳nX\in\mathcal{X}^{n} is Ψ\Psi-shattered by \mathcal{H} if there exists ψ¯Ψn\bar{\psi}\in\Psi^{n} such that {0,1}nψ¯(|X)\{0,1\}^{n}\subseteq\bar{\psi}(\mathcal{H}|_{X}). The Ψ\Psi-dimension of \mathcal{H}, denoted Ψ-dim()\Psi\text{-}\mathrm{dim}(\mathcal{H}), is the size dd of the largest set XX that is Ψ\Psi-shattered by \mathcal{H}. If no largest such dd exists, then Ψ-dim()=\Psi\text{-}\mathrm{dim}(\mathcal{H})=\infty.

Here the condition {0,1}nψ¯(|X)\{0,1\}^{n}\subseteq\bar{\psi}(\mathcal{H}|_{X}) essentially means that any 0-1 encoding of the labels is captured by applying ψ¯\bar{\psi} to some hh in the projection of \mathcal{H} onto XX.

Examples.

The graph dimension corresponds to the Ψ𝖦\Psi_{\mathsf{G}}-dimension, where Ψ𝖦:={ψk|k{0,,l}}\Psi_{\mathsf{G}}:=\{\psi_{k}\;|\;k\in\{0,\dots,l\}\}, where

ψk(y)={1y=k0otherwise,\psi_{k}(y)=\begin{cases}1&y=k\\ 0&\text{otherwise}\end{cases}\enspace,

and the Natarajan dimension to the set Ψ𝖭:={ψk,k|kk{0,,l}}\Psi_{\mathsf{N}}:=\{\psi_{k,k^{\prime}}\;|\;k\neq k^{\prime}\in\{0,\dots,l\}\}, where

ψk,k(y)={1y=k0y=kotherwise.\psi_{k,k^{\prime}}(y)=\begin{cases}1&y=k\\ 0&y=k^{\prime}\\ *&\text{otherwise}\end{cases}\enspace.
Definition 22 (Distinguisher (Ben-David et al., 1992)).

A pair (y,y){0,,l}(y,y^{\prime})\in\{0,\dots,l\} of distinct labels is said to be Ψ\Psi-distinguishable if there exists ψΨ\psi\in\Psi with ψ(y)ψ(y)\psi(y)\neq\psi(y^{\prime}) and neither ψ(y)\psi(y) nor ψ(y)\psi(y^{\prime}) is equal to *. The family Ψ\Psi is said to be a distinguisher if all pairs (y,y){0,,l}(y,y^{\prime})\in\{0,\dots,l\} of distinct labels are Ψ\Psi-distinguishable.

The notion of being a distinguisher in Ben-David et al. (1992) was shown to be both necessary and sufficient in order for the Ψ\Psi-dimension to characterize learnability in the qualitative sense, i.e. through its finiteness. In essence, distinguishers provide a meta-characterization of learnability:

Theorem 23 (Theorem 14 in (Ben-David et al., 1992)).

A family Ψ\Psi of functions from {0,,l}\{0,\dots,l\} to {0,1,}\{0,1,*\} provides a characterization of proper learnability if and only if Ψ\Psi is a distinguisher.

Ben-David et al. (1992) indeed implicitly define learnability as proper learnability. Note, however, that the argument showing that being a distinguisher is a necessary condition for characterizing learnability also goes through for improper learnability (see Lemma 13 therein).

Computable Ψ\Psi-Dimensions.

We can now straightforwardly define c-Ψ-dimc\text{-}\Psi\text{-}\mathrm{dim}, as the smallest integer kk\in\mathbb{N} for which there exists a computable proof of the statement “XX cannot be Ψ\Psi-shattered” for any set XX of size larger than kk:

Definition 24 (Computable Ψ\Psi-dimension).

Let 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}}. A kk-witness of Ψ\Psi-dimension is a function w:𝒳k+1×Ψk+1{0,1}k+1w:\mathcal{X}^{k+1}\times\Psi^{k+1}\rightarrow\{0,1\}^{k+1} such that for any sequence X𝒳k+1X\in\mathcal{X}^{k+1}, any ψ¯Ψk+1\bar{\psi}\in\Psi^{k+1}, we have that w(X,ψ¯)ψ¯(|X)w(X,\bar{\psi})\notin\bar{\psi}(\mathcal{H}|_{X}). The computable Ψ\Psi-dimension of \mathcal{H}, denoted c-Ψ-dim()c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H}), is the smallest integer kk for which there exists a kk-witness of Ψ\Psi-dimension. If no such kk exists, then c-Ψ-dim()=c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})=~{}\infty.

Here, one can view the witness function as returning a 0-1 encoding that no hypothesis in \mathcal{H} projected onto XX can achieve when ψ¯\bar{\psi} is applied to it.

4.1 Necessary Conditions for Finite 𝒴\mathcal{Y}

In this section, we show that, for a family Ψ\Psi embedding the label space 𝒴\mathcal{Y} into {0,1,}\{0,1,*\}, the finiteness of the computable Ψ\Psi-dimension is a necessary condition for CPAC learnability of a class \mathcal{H}. We start by stating a lower bound on the computable Natarajan dimension in terms of the computable Ψ\Psi dimension and the size of the label space. The theorem is a computable version of Theorem 7 in Ben-David et al. (1992).

Theorem 25.

Let Ψ\Psi be a family of functions from 𝒴\mathcal{Y} to {0,1,}\{0,1,*\}. For every RER hypothesis class 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} over a finite label space 𝒴\mathcal{Y}, we have that

c-Ψ-dim()log(c-Ψ-dim())+2log(|𝒴|)c-𝖭()+1.\frac{c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})}{\log(c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H}))+2\log(|\mathcal{Y}|)}\leq c\text{-}\mathsf{N}(\mathcal{H})+1\enspace.

Combining Theorem 25 with Theorem 16, we obtain the following:

Corollary 26.

Let Ψ\Psi be a distinguisher, and suppose 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} is improperly CPAC learnable. Then c-Ψ-dim()<c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})<\infty, i.e., \mathcal{H} admits a computable kk-witness of Ψ\Psi-dimension for some kk\in\mathbb{N}.

The proof of Theorem 25 is based on our lemma below, which is a computable version of the generalization of the Sauer Lemma to finite label multiclass settings from (Natarajan, 1989).

Lemma 27.

Let |𝒴|<|\mathcal{Y}|<\infty and suppose 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} satisfies c-𝖭()=kNc\text{-}\mathsf{N}(\mathcal{H})=k_{N}. Then there is a computable function vv that takes as input a set T𝒳mT\in\mathcal{X}^{m} and outputs a set of labellings v(T)𝒴mv(T)\subseteq\mathcal{Y}^{m} with v(T)|Tv(T)\supseteq\mathcal{H}|_{T} and |v(T)|mkN+1|𝒴|2(kN+1).|v(T)|\leq m^{k_{N}+1}|\mathcal{Y}|^{2(k_{N}+1)}.

Proof.

From Lemma 19 we know that there is an embedding \mathcal{H}^{\prime}\supseteq\mathcal{H} such that Ndim()=kN+1\mathrm{Ndim}(\mathcal{H}^{\prime})=k_{N}+1 and such that there exists a computable function vv, such that for any input T𝒳mT\in\mathcal{X}^{m} it outputs |T\mathcal{H}^{\prime}|_{T}. We can now invoke a classical result from (Natarajan, 1989), which shows that the number of behaviours of any class \mathcal{H}^{\prime} with Natarajan dimension dNd_{N} with finite label space 𝒴\mathcal{Y} on any set T𝒳mT\in\mathcal{X}^{m} is upper bounded by mkN|𝒴|2kNm^{k_{N}}|\mathcal{Y}|^{2k_{N}}. Lastly, we note that since \mathcal{H}\subset\mathcal{H}^{\prime} we have |Tv(T)\mathcal{H}|_{T}\subseteq v(T) and ||T|||T|=|v(T)|mkN+1|𝒴|2(kN+1)|\mathcal{H}|_{T}|\leq|\mathcal{H}^{\prime}|_{T}|=|v(T)|\leq m^{k_{N}+1}|\mathcal{Y}|^{2(k_{N}+1)}, thus proving the bound of the lemma. ∎

We now proceed with the proof of Theorem 25.

Proof of Theorem 25.

Let c-𝖭()=kNc\text{-}\mathsf{N}(\mathcal{H})=k_{N}. Now let kBk_{B} be some arbitrary number satisfying the inequality kBkN+1(|𝒴|)2(kN+1)<2kBk_{B}^{k_{N}+1}(|\mathcal{Y}|)^{2(k_{N}+1)}<2^{k_{B}}. That is, kBk_{B} is an arbitrary number that exceeds the bound for c-Ψ-dimc\text{-}\Psi\text{-}\mathrm{dim} in the theorem. We will now prove that bound for c-Ψ-dim()c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H}), by showing that for such kBk_{B} there exists a computable kBk_{B}-witness function wBw_{B} for the Ψ\Psi-dimension. Let ψ¯Ψ\bar{\psi}\in\Psi and T𝒳kBT\in\mathcal{X}^{k_{B}} be arbitrary. Now from Lemma 27 we know there is a computable function vv that for input TT outputs a set v(T)|Tv(T)\supseteq\mathcal{H}|_{T} with |v(T)|kBkN(|𝒴|+1)2kN<2kB|v(T)|\leq k_{B}^{k_{N}}(|\mathcal{Y}|+1)^{2k_{N}}<2^{k_{B}}. This also implies that |Ψ(v(T))||v(T)|<2kB|\Psi(v(T))|\leq|v(T)|<2^{k_{B}}. In particular, this implies that there is at least one {0,1}\{0,1\}-labelling gψ¯(v(T))g^{\prime}\not\in\bar{\psi}(v(T)). Furthermore, we can computably identify gg^{\prime} by checking which labelling is missing from the computably generated set ψ¯(v(T))\bar{\psi}(v(T)). Furthermore, from v(T)|Tv(T)\supseteq\mathcal{H}|_{T}, it also follows that Ψ(v(T))Ψ(|T)\Psi(v(T))\supseteq\Psi(\mathcal{H}|_{T}). Thus gΨ(|T)g^{\prime}\notin\Psi(\mathcal{H}|_{T}), i.e. gg^{\prime} is a witness for the set TT with distinguisher ψ¯\bar{\psi}. Thus we have shown that c-Ψ-dim()c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H}) is at most kB1k_{B}-1, implying the bound ot the theorem. ∎

4.2 Sufficient Conditions for Finite 𝒴\mathcal{Y}

In this section, we show that, for a distinguisher Ψ\Psi, the finiteness of the computable Ψ\Psi dimension, c-Ψ-dimc\text{-}\Psi\text{-}\mathrm{dim}, provides a sufficient condition for CPAC learnability.

Theorem 28.

Let |𝒴|<|\mathcal{Y}|<\infty. Let Ψ\Psi be a family of functions from 𝒴\mathcal{Y} to {0,1,}\{0,1,*\}. Furthermore suppose that finite Ψ\Psi-dimension implies uniform convergence under the 0-1 loss. Then c-Ψ-dim()<c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})<\infty implies that \mathcal{H} is CPAC learnable.

The proof follows from suitable generalizations of the arguments presented in Section 3.2:

Proof.

Let 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}} be such that c-Ψ-dim()<c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})<\infty. Let w:𝒳k+1×Ψk+1{0,1}k+1w:\mathcal{X}^{k+1}\times\Psi^{k+1}\rightarrow\{0,1\}^{k+1} be a kk-witness of Ψ\Psi-dimension. Our goal is to embed \mathcal{H} into \mathcal{H}^{\prime} satisfying the following: (i) Ψ-dim()k+1\Psi\text{-}\mathrm{dim}(\mathcal{H}^{\prime})\leq k+1 and (ii) \mathcal{H}^{\prime} has computable ERM. By the conditions of the theorem statement, this is sufficient to guarantee multiclass CPAC learnability.

Constructing \mathcal{H}^{\prime}. Consider the class 𝒢𝒴\mathcal{G}\subseteq\mathcal{Y}^{\mathbb{N}} of “good” functions satisfying, for all g𝒢g\in\mathcal{G}

  1. 1.

    M(g)<M(g)<\infty, where M(g)=argmaxn{g(n)0}M(g)=\arg\max_{n\in\mathbb{N}}\{g(n)\neq 0\} and M(h)=M(h)=\infty if no such nn exists.

  2. 2.

    For any x1<<xk<xk+1M(g)x_{1}<\dots<x_{k}<x_{k+1}\leq M(g), and any ψ¯Ψk+1\bar{\psi}\in\Psi^{k+1}, w(X,ψ¯)ψ¯(g|X)w(X,\bar{\psi})\neq\bar{\psi}(g|_{X}), where X={xi}i[k+1]X=\{x_{i}\}_{i\in[k+1]}.

Now, let :=𝒢\mathcal{H}^{\prime}:=\mathcal{H}\cup\mathcal{G}. We will show that \mathcal{H}^{\prime} indeed satisfies the conditions above.

Ψ-dim()k+1\Psi\text{-}\mathrm{dim}(\mathcal{H}^{\prime})\leq k+1. Let X={x1,,xk+2}𝒳k+2X=\{x_{1},\dots,x_{k+2}\}\in\mathcal{X}^{k+2} and ψ¯Ψk+2\bar{\psi}\in\Psi^{k+2}. WLOG suppose x1<<xk+1<xk+2x_{1}<\dots<x_{k+1}<x_{k+2} and that ψ¯k+21(0)\bar{\psi}_{k+2}^{-1}(0) and ψ¯k+21(1)\bar{\psi}_{k+2}^{-1}(1) are both non empty. Let y0ψ¯k+21(0)y_{0}\in\bar{\psi}_{k+2}^{-1}(0), y1ψ¯k+21(1)y_{1}\in\bar{\psi}_{k+2}^{-1}(1) be the minimal yiy_{i} in their respective set, and WLOG let y0<y1y_{0}<y_{1}, in particular y1>0y_{1}>0. Consider w(X(k+2),ψ¯(k+2))w(X_{-(k+2)},\bar{\psi}_{-(k+2)}), the output of the kk-witness on XX and ψ¯\bar{\psi}, but disregarding the (k+2)(k+2)-th entry. Let w=(w(X(k+2),ψ¯(k+2)),1){0,1}k+2w^{\prime}=(w(X_{-(k+2)},\bar{\psi}_{-(k+2)}),1)\in\{0,1\}^{k+2}. We claim that no hh\in\mathcal{H}^{\prime} satisfies ψ¯(h|X)=w\bar{\psi}(h|_{X})=w^{\prime}. First note that, by definition, no hh\in\mathcal{H} can satisfy this. So it remains to consider some “good” function g𝒢g\in\mathcal{G}. We distinguish two cases:

  1. 1.

    g(xk+2)0g(x_{k+2})\neq 0: then xk+2M(g)x_{k+2}\leq M(g), but since gg is “good”, w(X(k+2),ψ¯(k+2))ψ¯(g|Xk+2)w(X_{-(k+2)},\bar{\psi}_{-(k+2)})\neq\bar{\psi}(g|_{X_{k+2}}) by definition,

  2. 2.

    g(xk+2)=0g(x_{k+2})=0: then by construction 0ψ¯k+21(1)0\notin\bar{\psi}_{k+2}^{-1}(1), thus ψ¯k+2(g(xk+2))1\bar{\psi}_{k+2}(g(x_{k+2}))\neq 1, as required.

\mathcal{H}^{\prime} has a computable ERM. Let S(×𝒴)mS\in(\mathbb{N}\times\mathcal{Y})^{m} be arbitrary. Let M=max(x,y)SxM=\max_{(x,y)\in S}x, and note that it suffices to consider functions hh with M(h)MM(h)\leq M, and that by the finiteness of 𝒴\mathcal{Y}, there are a finite number of “good” functions in 𝒢\mathcal{G} satisfying this, and that these functions can be identified computably, by listing all patterns and using the computable witness function to exclude functions from 𝒢\mathcal{G}, with a similar argument as in the proof of Lemma 19. If we can show that there always exists a function in 𝒢\mathcal{G} that is an empirical risk minimizer, then we are done. Let hargminhh^{*}\in\arg\min_{h\in\mathcal{H}^{\prime}}, and consider its “truncated” version hMh^{*}_{M}, where if xMx\leq M, hM(x)=h(x)h^{*}_{M}(x)=h^{*}(x) and otherwise, hM(x)=0h^{*}_{M}(x)=0. 𝖱^D(h)=𝖱^D(hM)\widehat{\mathsf{R}}_{D}(h^{*})=\widehat{\mathsf{R}}_{D}(h^{*}_{M}), so it remains to show hM𝒢h^{*}_{M}\in\mathcal{G}. Suppose not. Then there must exist X={x1,,xk+1}[M]X=\{x_{1},\dots,x_{k+1}\}\subseteq[M] and ψ¯Ψk+1\bar{\psi}\in\Psi^{k+1} such that w(X,ψ¯)=ψ¯(hM|X)w(X,\bar{\psi})=\bar{\psi}(h^{*}_{M}|_{X}), but that means that the non-truncated hh^{*}\in\mathcal{H} also satsifies this, a contradiction by the definition of ww.

As a corollary of Theorem 23, we get:

Corollary 29.

Let |𝒴|<|\mathcal{Y}|<\infty. Let Ψ\Psi be a family of function from 𝒴\mathcal{Y} to {0,1,}\{0,1,*\}. Furthermore suppose that Ψ\Psi is a distinguisher. Then c-Ψ-dim()<c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})<\infty implies that \mathcal{H} is CPAC learnable.

Combining these with Corollary 26, we obtain the following result:

Theorem 30.

Let |𝒴|<|\mathcal{Y}|<\infty and suppose Ψ\Psi is a distinguisher. Then c-Ψ-dim()<c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})<\infty if and only if \mathcal{H} is CPAC learnable.

Remark 31.

Since the Natarajan and Graph dimensions are both expressible as distinguishers (with the computable versions matching the corresponding computable Ψ\Psi-dimension), Theorem 30 hold for c-𝖭c\text{-}\mathsf{N} and c-𝖦c\text{-}\mathsf{G}. Similarly, the result can be obtained for other families of distinguishers such as the Pollard pseudo-dimension (Pollard, 1990; Haussler, 1992).

Finally, we note that, while the arguments of Section 3.1 (which give a necessary condition on CPAC learnability via the finiteness of the computable Natarajan dimension) hold for infinite label spaces, neither arguments in Theorem 25, nor in Theorem 28 can be extended to infinite 𝒴\mathcal{Y}: in the former, |𝒴||\mathcal{Y}| appears in the denominator of the lower bound; the latter relies on the finiteness of 𝒴\mathcal{Y} to implement a computable ERM.

4.3 A Meta-Characterization for CPAC Learnability

In the previous section, we showed that distinguishers give rise to computable dimensions that qualitatively characterize CPAC learnability for finite 𝒴\mathcal{Y} in the agnostic setting. But what happens if a family of functions fails to be a distinguisher?

Proposition 32.

Suppose Ψ\Psi fails to be a distinguisher. Then there exists \mathcal{H} with c-Ψ-dim()=1c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})=1 such that \mathcal{H} is not CPAC learnable.

Proof.

Suppose it is the case that Ψ\Psi fails to be a distinguisher, and say it cannot distinguish labels y1,y2y_{1},y_{2}. Then, letting \mathcal{H} be the hypothesis class of all functions from \mathbb{N} to y1,y2y_{1},y_{2}, we note that \mathcal{H} is not PAC learnable, and thus not CPAC learnable. Now, to see that c-Ψ-dim()=1c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H})=1, note that for arbitrary x𝒳x\in\mathcal{X} and any ψΨ\psi\in\Psi, there is b{0,1}b\in\{0,1\} such that ψ(|{x}){b,}\psi(\mathcal{H}|_{\{x\}})\subseteq\{b,*\}, and bb can be identified by computing ψ(y1),ψ(y2)\psi(y_{1}),\psi(y_{2}), regardless of xx. Thus for any given xx and ψ\psi, the witness function returns b1b\oplus 1, as required. ∎

Combining Proposition 32 with Theorem 30, we conclude the main result of this paper: a meta-characterization for CPAC learnability in the agnostic setting, in the sense that we precisely characterize which families of functions from 𝒴\mathcal{Y} to {0,1,}\{0,1,*\} give rise to computable dimensions characterizing multiclass CPAC learnability.

Theorem 33.

Let Ψ\Psi be a family of functions from 𝒴\mathcal{Y} to {0,1,}\{0,1,*\} for finite 𝒴\mathcal{Y}. Then c-Ψ-dim()c\text{-}\Psi\text{-}\mathrm{dim}(\mathcal{H}) qualitatively characterizes CPAC learnability if and only if Ψ\Psi is a distinguisher.

4.4 The DS Dimension

The bounds derived in Section 4 hold for families Ψ\Psi of functions from 𝒴\mathcal{Y} to {0,1,}\{0,1,*\} that satisfy certain properties, e.g., are distinguishers. This generalization of the Natarajan and Graph dimensions predates the work of Daniely and Shalev-Shwartz (2014), which defined the DS dimension, a characterization of learnability for multiclass classification for arbitrary label spaces 𝒴\mathcal{Y}. For infinite label space, Brukhim et al. (2022) exhibit an arbritrary gap between the Natarajan and DS dimensions. But even in the case of a finite label set, can we express the DS dimension as a family Ψ𝖣𝖲\Psi_{\mathsf{DS}}, in the sense that for all 𝒴𝒳\mathcal{H}\subseteq\mathcal{Y}^{\mathcal{X}}, 𝖣𝖲()=Ψ𝖣𝖲-dim()\mathsf{DS}(\mathcal{H})=\Psi_{\mathsf{DS}}\text{-}\dim(\mathcal{H})? Unfortunately, the result below gives a negative answer to this question, which may be of independent interest.

Lemma 34.

The DS dimension cannot be expressed as a family Ψ𝖣𝖲\Psi_{\mathsf{DS}} of functions from 𝒴\mathcal{Y} to {0,1,}\{0,1,*\}.

Proof.

We will show that any family Ψ\Psi with Ψ-dim()=2\Psi\text{-}\mathrm{dim}(\mathcal{H})=2 and Ψ-dim()=1\Psi\text{-}\mathrm{dim}(\mathcal{H}^{\prime})=1 induces \mathcal{H}_{*}\subset\mathcal{H} with Ψ-dim()=2\Psi\text{-}\mathrm{dim}(\mathcal{H}_{*})=2.

Let Ψ\Psi be such that Ψ-dim()=2\Psi\text{-}\mathrm{dim}(\mathcal{H})=2 and Ψ-dim()=1\Psi\text{-}\mathrm{dim}(\mathcal{H}^{\prime})=1. WLOG, suppose there exists ψΨ\psi_{*}\in\Psi such that {0,1}ψ(|{0})\{0,1\}\subseteq\psi_{*}(\mathcal{H}^{\prime}|_{\{0\}}). Since |{1}={2,4}\mathcal{H}^{\prime}|_{\{1\}}=\{2,4\}, it must be that for all ψΨ\psi\in\Psi, we cannot have both ψ(2)ψ(4)\psi(2)\neq\psi(4) and ψ(2),ψ(4){0,1}\psi(2),\psi(4)\in\{0,1\}. Let (ψ1,ψ2)Ψ2(\psi_{1},\psi_{2})\in\Psi^{2} witness the Ψ\Psi-shattering of \mathcal{H} on 𝒳\mathcal{X}, i.e.,

{0,1}2(ψ1,ψ2)().\{0,1\}^{2}\subseteq(\psi_{1},\psi_{2})(\mathcal{H})\enspace. (4)

We distinguish three cases:

  1. 1.

    ψ2(2)=ψ2(4)=\psi_{2}(2)=\psi_{2}(4)=*: impossible by Equation 4,

  2. 2.

    ψ2(2){0,1}\psi_{2}(2)\in\{0,1\} and ψ2(4)=\psi_{2}(4)=*: WLOG let ψ2(2)=0\psi_{2}(2)=0. Then, for Equation 4 to hold, we need ψ2(6)=1\psi_{2}(6)=1 as well as ψ1(3)=ψ1(5)ψ1(1)\psi_{1}(3)=\psi_{1}(5)\neq\psi_{1}(1), all of which must be in {0,1}\{0,1\}. From this, it is clear that ={12,32,56,16}\mathcal{H}_{*}=\{12,32,56,16\} satisfies Ψ-dim()=2\Psi\text{-}\mathrm{dim}(\mathcal{H}_{*})=2, despite 𝖣𝖲()=1\mathsf{DS}(\mathcal{H}_{*})=1. Thus we get an impossibility. Note that the cases (a) ψ2(2)=\psi_{2}(2)=* and ψ2(4){0,1}\psi_{2}(4)\in\{0,1\}, (b) ψ2(4){0,1}\psi_{2}(4)\in\{0,1\} and ψ2(2)=\psi_{2}(2)=* and (c) ψ2(4)=\psi_{2}(4)=* and ψ2(2){0,1}\psi_{2}(2)\in\{0,1\}) follow an identical reasoning.

  3. 3.

    ψ2(2)=ψ2(4){0,1}\psi_{2}(2)=\psi_{2}(4)\in\{0,1\}: WLOG let ψ2(2)=ψ2(4)=0\psi_{2}(2)=\psi_{2}(4)=0. Then, for Equation 4 to hold, we need ψ2(6)=1\psi_{2}(6)=1 as well as ψ1(1)ψ1(5)\psi_{1}(1)\neq\psi_{1}(5), both in {0,1}\{0,1\}. But this implies that the class ={12,16,56,54}\mathcal{H}_{*}=\{12,16,56,54\} satisfies Ψ-dim()=2\Psi\text{-}\mathrm{dim}(\mathcal{H}_{*})=2,

as required. ∎

5 Conclusion

We initiated the study of multiclass CPAC learnability, focusing on finite label spaces, and have established a meta-characterization through the finiteness of the computable dimension of a vast family of functions: so-called distinguishers. Characterization through the computable Natarajan and the computable graph dimensions appear as special cases of this result. Moreover, we showed that this result cannot readily be extended to the DS dimension, thus suggesting that characterizing CPAC learnability for infinite label spaces will potentially require significantly different techniques.

Acknowledgements

Pascale Gourdeau has been supported by a Vector Postdoctoral Fellowship and an NSERC Postdoctoral Fellowship. Tosca Lechner has been supported by a Vector Postdoctoral Fellowship. Ruth Urner is also an Affiliate Faculty Member at Toronto’s Vector Institute, and acknowldeges funding through an NSERC Discovery grant.

References

  • Ackerman et al. (2022) Nathanael Ackerman, Julian Asilis, Jieqi Di, Cameron Freer, and Jean-Baptiste Tristan. Computable PAC learning of continuous features. In Proceedings of the 37th Annual ACM/IEEE Symposium on Logic in Computer Science, pages 1–12, 2022.
  • Agarwal et al. (2020) Sushant Agarwal, Nivasini Ananthakrishnan, Shai Ben-David, Tosca Lechner, and Ruth Urner. On learnability wih computable learners. In Algorithmic Learning Theory, pages 48–60. PMLR, 2020.
  • Asilis et al. (2024) Julian Asilis, Siddartha Devic, Shaddin Dughmi, Vatsal Sharan, and Shang-Hua Teng. Regularization and optimal multiclass learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 260–310. PMLR, 2024.
  • Ben-David et al. (1992) Shai Ben-David, Nicolo Cesa-Bianchi, and Philip M Long. Characterizations of learnability for classes of {\{O,…, n}\}-valued functions. In Proceedings of the fifth annual workshop on Computational learning theory, pages 333–340, 1992.
  • Ben-David et al. (2019) Shai Ben-David, Pavel Hrubes, Shay Moran, Amir Shpilka, and Amir Yehudayoff. Learnability can be undecidable. Nat. Mach. Intell., 1(1):44–48, 2019.
  • Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  • Brukhim et al. (2021) Nataly Brukhim, Elad Hazan, Shay Moran, Indraneel Mukherjee, and Robert E Schapire. Multiclass boosting and the cost of weak learning. Advances in Neural Information Processing Systems, 34:3057–3067, 2021.
  • Brukhim et al. (2022) Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 943–955. IEEE, 2022.
  • Brukhim et al. (2023) Nataly Brukhim, Steve Hanneke, and Shay Moran. Improper multiclass boosting. In The Thirty Sixth Annual Conference on Learning Theory, pages 5433–5452. PMLR, 2023.
  • Brukhim et al. (2024) Nataly Brukhim, Amit Daniely, Yishay Mansour, and Shay Moran. Multiclass boosting: simple and intuitive weak learning criteria. Advances in Neural Information Processing Systems, 36, 2024.
  • Daniely and Shalev-Shwartz (2014) Amit Daniely and Shai Shalev-Shwartz. Optimal learners for multiclass problems. In Conference on Learning Theory, pages 287–316. PMLR, 2014.
  • Daniely et al. (2011) Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pages 207–232. JMLR Workshop and Conference Proceedings, 2011.
  • Daniely et al. (2012) Amit Daniely, Sivan Sabato, and Shai Shwartz. Multiclass learning approaches: A theoretical comparison with implications. Advances in Neural Information Processing Systems, 25, 2012.
  • Daniely et al. (2015a) Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the erm principle. J. Mach. Learn. Res., 16(1):2377–2404, 2015a.
  • Daniely et al. (2015b) Amit Daniely, Michael Schapira, and Gal Shahaf. Inapproximability of truthful mechanisms via generalizations of the vc dimension. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pages 401–408, 2015b.
  • Delle Rose et al. (2023) Valentino Delle Rose, Alexander Kozachinskiy, Cristóbal Rojas, and Tomasz Steifer. Find a witness or shatter: the landscape of computable PAC learning. In The Thirty Sixth Annual Conference on Learning Theory, pages 511–524. PMLR, 2023.
  • Delle Rose et al. (2024) Valentino Delle Rose, Alexander Kozachinskiy, and Tomasz Steifer. Effective littlestone dimension. arXiv preprint arXiv:2411.15109, 2024.
  • Ehrenfeucht et al. (1989) Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A general lower bound on the number of examples needed for learning. Information and Computation, 82(3):247–261, 1989.
  • Gourdeau et al. (2024) Pascale Gourdeau, Tosca Lechner, and Ruth Urner. On the computability of robust PAC learning. In The Thirty Seventh Annual Conference on Learning Theory, pages 2092–2121. PMLR, 2024.
  • Hanneke et al. (2023) Steve Hanneke, Shay Moran, and Qian Zhang. Universal rates for multiclass learning. In The Thirty Sixth Annual Conference on Learning Theory, pages 5615–5681. PMLR, 2023.
  • Hasrati and Ben-David (2023) Niki Hasrati and Shai Ben-David. On computable online learning. In International Conference on Algorithmic Learning Theory, pages 707–725. PMLR, 2023.
  • Haussler (1992) David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and computation, 100(1):78–150, 1992.
  • Haussler and Long (1995) David Haussler and Philip M Long. A generalization of sauer’s lemma. Journal of Combinatorial Theory, Series A, 71(2):219–240, 1995.
  • Kalavasis et al. (2022) Alkis Kalavasis, Grigoris Velegkas, and Amin Karbasi. Multiclass learnability beyond the pac framework: Universal rates and partial concept classes. Advances in Neural Information Processing Systems, 35:20809–20822, 2022.
  • Lechner and Ben-David (2024) Tosca Lechner and Shai Ben-David. Inherent limitations of dimensions for characterizing learnability of distribution classes. In Shipra Agrawal and Aaron Roth, editors, Proceedings of Thirty Seventh Conference on Learning Theory, volume 247 of Proceedings of Machine Learning Research, pages 3353–3374. PMLR, 30 Jun–03 Jul 2024. URL https://proceedings.mlr.press/v247/lechner24a.html.
  • Natarajan (1989) Balas K Natarajan. On learning sets and functions. Machine Learning, 4:67–97, 1989.
  • Natarajan and Tadepalli (1988) Balas K Natarajan and Prasad Tadepalli. Two new frameworks for learning. In Machine Learning Proceedings 1988, pages 402–415. Elsevier, 1988.
  • Pabbaraju (2024) Chirag Pabbaraju. Multiclass learnability does not imply sample compression. In International Conference on Algorithmic Learning Theory, pages 930–944. PMLR, 2024.
  • Pollard (1990) David Pollard. Empirical processes: theory and applications. Ims, 1990.
  • Rubinstein et al. (2006) Benjamin Rubinstein, Peter Bartlett, and J Rubinstein. Shifting, one-inclusion mistake bounds and tight multiclass expected risk bounds. Advances in Neural Information Processing Systems, 19, 2006.
  • Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning - From Theory to Algorithms. Cambridge University Press, 2014.
  • Sterkenburg (2022) Tom F Sterkenburg. On characterizations of learnability with computable learners. In Conference on Learning Theory, pages 3365–3379. PMLR, 2022.
  • Valiant (1984) Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • Vapnik and Chervonenkis (1971) Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Theory of Probability and Its Applications. 1971.

Appendix A Proof of Proposition 15

of Proposition 15.

Let 𝒳\mathcal{X} be finite or countable. Consider the hypothesis class in Daniely et al. (2015a) showing the same separation between the Natarajan and graph dimensions: let 𝒫f(𝒳)\mathcal{P}_{\mathrm{f}}(\mathcal{X}) be a subset of the powerset of 𝒳\mathcal{X} consisting only of finite or cofinite subsets of 𝒳\mathcal{X}. Let 𝒴=𝒫f(𝒳){}\mathcal{Y}=\mathcal{P}_{\mathrm{f}}(\mathcal{X})\cup\{\star\}. For any A𝒫f(𝒳)A\in\mathcal{P}_{\mathrm{f}}(\mathcal{X}), let

hA(x)={AxAxA.h_{A}(x)=\begin{cases}A&x\in A\\ \star&x\notin A\end{cases}\enspace.

Note that any A𝒫f(𝒳)A\in\mathcal{P}_{\mathrm{f}}(\mathcal{X}) has a finite representation (a special character for whether we are enumerating the set or its complement, as well as the finite set AA or 𝒳A\mathcal{X}\setminus A). Thus checking whether xAx\in A can be done computably for any x𝒳x\in\mathcal{X}, implying each hAh_{A} is computably evaluable. Since 𝖦()=|𝒳|\mathsf{G}(\mathcal{H})=|\mathcal{X}|, it follows that c-𝖦()=|𝒳|c\text{-}\mathsf{G}(\mathcal{H})=|\mathcal{X}| as well. We will now show that c-𝖭()=1c\text{-}\mathsf{N}(\mathcal{H})=1, namely we exhibit a computable 11-witness of Natarajan dimension. To this end, let X={x1,x2}𝒳2X=\{x_{1},x_{2}\}\in\mathcal{X}^{2} and y,y𝒴2y,y^{\prime}\in\mathcal{Y}^{2} with y1y1y_{1}\neq y_{1}^{\prime} and y2y2y_{2}\neq y_{2}^{\prime} be arbitrary. First check whether {y1,y2,y1,y2}\{y_{1},y_{2},y_{1}^{\prime},y_{2}^{\prime}\} contains more than one non-\star label, in which case we are done, as we can output the index set corresponding to labelling ABAB for some A,B𝒫f(𝒳)A,B\in\mathcal{P}_{\mathrm{f}}(\mathcal{X}). Otherwise, WLOG let y=AAy=AA and y=y^{\prime}=\star\star for some A𝒫f(𝒳)A\in\mathcal{P}_{\mathrm{f}}(\mathcal{X}). Check whether x1Ax_{1}\in A. If yes, output I=2I=2 (corresponding to labelling A\star A), and if not output I=1I=1 (corresponding to labelling AA\star). ∎