This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\secondaddress

Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON K1N 6N5, Canada vpest283@uOttawa.ca

Universal consistency of the kk-NN rule in metric spaces and Nagata dimension

Benoît Collins Department of Mathematics, Kyoto University, Kitashirakawa Oiwake-cho, Sakyo-ku, 606-8502, Japan collins@math.kyoto-u.ac.jp Sushma Kumari Department of Mathematical Engineering, Musashino University, 1 Chome-1-20 Shinmachi, Nishitokyo, Tokyo 202-8585, Japan
kumari@musashino-u.ac.jp
 and  Vladimir G. Pestov Instituto de Matemática e Estatística, Universidade Federal da Bahia, Ondina, Salvador–BA, 40.170-115, Brazil
(Date: Version as of 12:57 BST, June 14, 2020)
Abstract.

The kk nearest neighbour learning rule (under the uniform distance tie breaking) is universally consistent in every metric space XX that is sigma-finite dimensional in the sense of Nagata. This was pointed out by Cérou and Guyader (2006) as a consequence of the main result by those authors, combined with a theorem in real analysis sketched by D. Preiss (1971) (and elaborated in detail by Assouad and Quentin de Gromard (2006)). We show that it is possible to give a direct proof along the same lines as the original theorem of Charles J. Stone (1977) about the universal consistency of the kk-NN classifier in the finite dimensional Euclidean space. The generalization is non-trivial because of the distance ties being more prevalent in the non-Euclidean setting, and on the way we investigate the relevant geometric properties of the metrics and the limitations of the Stone argument, by constructing various examples.

Key words and phrases:
kk-NN classifier, universal consistency, geometric Stone lemma, distance ties, Nagata dimension, sigma-finite dimensional metric spaces
1991 Mathematics Subject Classification:
62H30, 54F45
S.K. would like to thank JICA-FRIENDSHIP scholarship (fellowship D-1590283/ J-1593019) for supporting her doctoral study at Kyoto University and Department of Mathematics, Kyoto University for supporting the Brazil research trip under the KTGU project.
V.G.P. wants to acknowledge support from CNPq (bolsa Pesquisador Visitante, processo 310012/2016) and CAPES (bolsa Professor Visitante Estrangeiro Sênior, processo 88881.117018/2016-01).
{resume}

La règle d’apprentissage des kk plus proches voisins (sous le bris uniforme d’égalité des distances) est universellment consistente dans chaque espace métrique séparable de dimension sigma-finie au sens de Nagata. Comme indiqué par Cérou et Guyader (2006), le résultat fait suite à une combinaison du théorème principal de ces auteurs avec un théorème d’analyse réelle esquissé par D. Preiss (1971) (et élaboré en détail par Assouad et Quentin de Gromard (2006)). Nous montrons qu’il est possible de donner une preuve directe dans le même esprit que le théorème original de Charles J. Stone (1977) sur la consistence universelle du classificateur kk-NN dans l’espace euclidien de dimension finie. La généralisation est non-triviale, car l’égalité des distances est plus commune dans le cas non-euclidien, et pendant l’élaboration de notre preuve, nous étudions des propriétés géométriques pertinentes des métriques et testons des limites de l’argument de Stone, en construisant quelques exemples.

Introduction

The kk-nearest neighbour classifier, in spite of being arguably the oldest supervised learning algorithm in existence, still retains his importance, both practical and theoretical. In particular, it was the first classification learning rule whose (weak) universal consistency (in the finite-dimensional Euclidean space d=2(d){\mathbb{R}}^{d}=\ell^{2}(d)) was established, by Charles J. Stone in [20].

Stone’s result is easily extended to all finite-dimensional normed spaces, see, e.g., [6]. However, the kk-NN classifier is no longer universally consistent already in the infinite-dimensional Hilbert space 2\ell^{2}. A series of examples of this kind, obtained in the setting of real analysis, belongs to Preiss, and the first of them [17] is so simple that it can be described in a few lines. We will reproduce it in the article, since the example remains virtually unknown in the statistical machine learning community.

There is sufficient empirical evidence to support the view that the performance of the kk-NN classifier greatly depends on the chosen metric on the domain (see e.g., [9]). There is a supervised learning algorithm, Large Margin Nearest Neighbour Classifier (LMNN), based on the idea of optimizing the kk-NN performance over all Euclidean metrics on a finite dimensional vector space [21]. At the same time, it appears that a theoretical foundation for such an optimization over a set of distances is still lacking. The first question to address in this connection, is of course to characterize those metrics (generating the original Borel structure of the domain) for which the kk-NN classifier is (weakly) universally consistent.

While the problem in this generality remains still open, a great advance in this direction was made by Cérou and Guyader in [2]. They have shown that the kk-NN classifier is consistent under the assumption that the regression function η(x)\eta(x) satisfies the weak Lebesgue–Besicovitch differentiation property:

1μ(Br(x))Br(x)η(x)𝑑μ(x)η(x),\frac{1}{\mu(B_{r}(x))}\int_{B_{r}(x)}\eta(x)\,d\mu(x)\to\eta(x), (1)

where the convergence is in measure, that is, for each ϵ>0{\epsilon}>0,

μ{xΩ:|1μ(Br(x))Br(x)η(x)𝑑μ(x)η(x)|>ϵ}0 when r0.\mu\left\{x\in\Omega\colon\left|\frac{1}{\mu(B_{r}(x))}\int_{B_{r}(x)}\eta(x)\,d\mu(x)-\eta(x)\right|>{\epsilon}\right\}\to 0\mbox{ when }r\downarrow 0.

The probability measure μ\mu above is the sample distribution law. The proof extended the ideas of the paper [4], in which it was previously observed that Stone’s universal consistency can be deduced from the classical Lebesgue–Besicovitch differentiation theorem: every L1(μ)L^{1}(\mu)-function ff on d{\mathbb{R}}^{d} satisfies Eq. (1), even in the strong sense (convergence almost everywhere). See also [8].

Those separable metric spaces in which the weak Lebesgue–Besicovitch differentiation property holds for every Borel probability measure (equivalently, for every sigma-finite locally finite Borel measure) have not yet been characterized. But the complete separable metric spaces in which the strong Lebesgue–Besicovitch differentiation property holds for every such measure as above have been described by Preiss [18]: they are exactly those spaces that are sigma-finite dimensional in the sense of Nagata [13, 16]. (For finite dimensional spaces in the sense of Nagata, the sketch of a proof by Preiss, in the sufficiency direction, was elaborated by Assouad and Quentin de Gromard in [1]. The completeness assumption on the metric space is only essential for the necessity part of the result.) In particular, it follows that every sigma-finite dimensional separable metric space satisfies the weak Lebesgue–Besicovitch differentiation property for every probability measure.

Combining the result of Preiss with that of Cérou–Guyader, one concludes that the kk-NN classifier is universally consistent in every sigma-finite dimensional separable metric space, as was noted in [2].

The authors of [2] mention in their paper that “[Stone’s theorem] is based on a geometrical result, known as Stone’s Lemma. This powerful and elegant argument can unfortunately not be generalized to infinite dimension.” The aim of this article is to show that at least Stone’s original proof, including Stone’s geometric lemma as its main tool, can be extended from the Euclidean case to the sigma-finite dimensional metric spaces. In fact, as we will show, the geometry behind Stone’s lemma, even if it appears to be essentially based on the Euclidean structure of the space, is captured by the notion of Nagata dimension, which is a purely metric concept. In this way, Stone’s geometric lemma and indeed the original Stone’s proof of the universal consistency of the kk-NN classifier, become applicable to a wide range of metric spaces.

In the absence of distance ties (that is, in case where every sphere is a μ\mu-negligible set with regard to the underlying measure μ\mu), the extension is quite straightforward, indeed almost literal. However, this is not so in the presence of distance ties: an example shows that the conclusion of Stone’s geometric lemma may not hold. Another example shows that even in the compact metric spaces of Nagata dimension zero, the distance ties may be everpresent. We also show that an attempt to reduce the case to the situation without distance ties by learning in the product of Ω\Omega with the unit interval (an additional random variable used for tie-breaking) cannot work, because already the product of a zero-dimensional space in the sense of Nagata with the interval (which has dimension one) can have an infinite Nagata dimension. Stone’s geometric lemma has to be modified, to parallel the Hardy–Littlewood inequality in the geometric measure theory.

We do not touch upon the subject of strong universal consistency in general metric spaces. The main open question left is whether every metric space in which the kk-NN classifier is universally consistent is necessarily sigma-finite dimensional. A positive answer, modulo the work of [2] and [18], would also answer in the affirmative an open question in real analysis going back to Preiss: suppose a metric space XX satisfies the weak Lebesgue–Besicovitch differentiation property for every sigma-finite locally finite Borel measure, will it satisfy the strong Lebesgue–Besicovitch differentiation property for every such measure?

1. Setting for statistical learning

Here we will recall the standard probabilistic model for statistical learning theory. The domain, Ω\Omega, means a standard Borel space, that is, a set equipped with a sigma-algebra which coincides with the sigma-algebra of Borel sets generated by a suitable separable complete metric. (Recall that the Borel structure generated by a metric ρ\rho on a set Ω\Omega is the smallest sigma-algebra containing all open subsets of the metric space (Ω,ρ)(\Omega,\rho).) The distribution laws for datapoints, both unlabelled and labelled, are Borel probability measures defined on the corresponding Borel sigma-algebra.

Since we will be dealing with the kk-NN classifier, the domain, Ω\Omega, will actually be a metric space, which we also assume to be separable.

Labelled data pairs (x,y)(x,y), where xΩx\in\Omega and y{0,1}y\in\{0,1\}, will follow an unknown probability distribution μ~\tilde{\mu}, that is, a Borel probability measure on Ω×{0,1}\Omega\times\{0,1\}. We denote the corresponding random element (X,Y)μ~(X,Y)\sim\tilde{\mu}. Define two Borel measures on Ω\Omega, μi\mu_{i}, i=0,1i=0,1, by μi(A)=μ~(A×{i})\mu_{i}(A)=\tilde{\mu}(A\times\{i\}). In this way, μ0\mu_{0} is governing the distribution of the elements labelled 0, and similarly for μ1\mu_{1}. The sum μ=μ0+μ1\mu=\mu_{0}+\mu_{1} (the direct image of μ~\tilde{\mu} under the projection from Ω×{0,1}\Omega\times\{0,1\} onto Ω\Omega) is a Borel probability measure on Ω\Omega, the distribution law of unlabelled data points. Clearly, μi\mu_{i} is absolutely continuous with regard to μ\mu, that is, if μ(A)=0\mu(A)=0, then μi(A)=0\mu_{i}(A)=0 for i=0,1i=0,1. The corresponding Radon-Nikodým derivative in the case i=1i=1 is just the conditional probability for a point xx to be labeled 11:

η(x)=dμ1dμ~(x)=P[Y=1|X=x].\eta(x)=\frac{d\mu_{1}}{d\tilde{\mu}}(x)=P[Y=1|X=x].

In statistical terminology, η\eta is the regression function.

Refer to caption
Figure 1. Labeled domain and the projection π:Ω×{0,1}Ω\pi\colon\Omega\times\{0,1\}\to\Omega.

Together with the Borel probability measure μ\mu on Ω\Omega, the regression function allows for an alternative, and often more convenient, description of the joint law μ~\tilde{\mu}. Namely, given AΩA\subseteq\Omega,

μ1(A)=Aη(x)𝑑μ,\mu_{1}(A)=\int_{A}\eta(x)\,d\mu,

and

μ0(A)=A(1η(x))𝑑μ,\mu_{0}(A)=\int_{A}(1-\eta(x))\,d\mu,

which allows to reconstruct the measure μ~\tilde{\mu} on Ω×{0,1}\Omega\times\{0,1\}.

Let (Ω,{0,1}){\mathcal{B}}(\Omega,\{0,1\}) denote the collection of all Borel measurable binary functions on the domain, that is, essentially, the family of all Borel subsets of Ω\Omega. Given such a function f:Ω{0,1}f\colon\Omega\to\{0,1\} (a classifier), the misclassification error is defined by

errμ~(f)=μ~{(x,y):f(x)y}=P[f(X)Y].{\mathrm{err}}_{\tilde{\mu}}(f)=\tilde{\mu}\{(x,y)\colon f(x)\neq y\}=P[f(X)\neq Y].

The Bayes error is the infimal misclassification error taken over all possible classifiers:

=(μ~)=infferrμ~(f).\ell^{\ast}=\ell^{\ast}(\tilde{\mu})=\inf_{f}{\mathrm{err}}_{\tilde{\mu}}(f).

It is a simple exercise to verify that the Bayes error is achieved on some classifier (and thus is the minimum), which is called a Bayes classifier. For instance, every classifier satisfying

Tbayes(x)={1, if η(x)>12,0, if η(x)<12,T_{bayes}(x)=\begin{cases}1,&\mbox{ if }\eta(x)>\frac{1}{2},\\ 0,&\mbox{ if }\eta(x)<\frac{1}{2},\end{cases}

is a Bayes classifier.

The Bayes error is zero if and only if the learning problem is deterministic, that is, the regression function η\eta is equal almost everywhere to the indicator function, χC\chi_{C}, of a concept CΩC\subseteq\Omega, a Borel subset of the domain.

A learning rule is a family of mappings =(n)n=1{\mathcal{L}}=\left({\mathcal{L}}_{n}\right)_{n=1}^{\infty}, where

n:Ωn×{0,1}n(Ω,{0,1}),n=1,2,{\mathcal{L}}_{n}\colon\Omega^{n}\times\{0,1\}^{n}\to{\mathcal{B}}(\Omega,\{0,1\}),~~n=1,2,\ldots

and the functions n{\mathcal{L}}_{n} satisfy the following measurability assumption: the associated maps

Ωn×{0,1}n×Ω(σ,x)n(σ)(x){0,1}\Omega^{n}\times\{0,1\}^{n}\times\Omega\ni(\sigma,x)\mapsto{\mathcal{L}}_{n}(\sigma)(x)\in\{0,1\}

are Borel (or just universally measurable). Here σ=(x1,,xn,y1,,yn)\sigma=(x_{1},\ldots,x_{n},y_{1},\ldots,y_{n}) is a labelled learning sample.

The data is modelled by a sequence of independent identically distributed random elements (Xn,Yn)(X_{n},Y_{n}) of Ω×{0,1}\Omega\times\{0,1\}, following the law μ~\tilde{\mu}. Denote ς\varsigma an infinite sample path. In this context, n\mathcal{L}_{n} only gets to see the first nn labelled coordinates of ς\varsigma. A learning rule \mathcal{L} is weakly consistent, or simply consistent, if errμn(ς){\mathrm{err}}_{\mu}{\mathcal{L}}_{n}(\varsigma)\to\ell^{\ast} in probability as nn\to\infty. If the convergence occurs almost surely (that is, along almost all sample paths ςμ~\varsigma\sim\tilde{\mu}^{\infty}), then \mathcal{L} is said to be strongly consistent. Finally, \mathcal{L} is universally (weakly / strongly) consistent if it is weakly / strongly consistent under every Borel probability measure μ~\tilde{\mu} on the standard Borel space Ω×{0,1}\Omega\times\{0,1\}.

The learning rule we study is the kk-NN classifier, defined by selecting the label n(σ)(x){0,1}{\mathcal{L}}_{n}(\sigma)(x)\in\{0,1\} by the majority vote among the values of yy corresponding to the k=knk=k_{n} nearest neighbours of xx in the learning sample σ\sigma.

If kk is even, then a voting tie may occur. This is of lesser importance, and can be broken in any way. For instance, by always assigning the value 11 in case of a voting tie, or by choosing the value randomly. The consistency results usually do not depend on it. Intuitively, if voting ties keep occurring asymptotically at a point xx along a sample path, it means that η(x)=1/2\eta(x)=1/2 and so any value of the classifier assigned to xx would do.

It may also happen that the smallest closed ball containing kk nearest neighbours of a point xx contains more than kk elements of a sample (distance ties). This situation is more difficult to manage and requires a consistent tie-breaking strategy, whose choice may affect the consistency results.

Given kk and nkn\geq k, we define rk-NNςn(x)r^{\varsigma_{n}}_{k\mbox{\tiny-NN}}(x) as the smallest radius of a closed ball around xx containing at least kk nearest neighbours of xx in the sample ςn\varsigma_{n}:

rk-NNςn(x)=min{r0:{i=1,2,,n:xiB¯r(x)}k}.r^{\varsigma_{n}}_{k\mbox{\tiny-NN}}(x)=\min\{r\geq 0\colon\sharp\{i=1,2,\ldots,n\colon x_{i}\in\bar{B}_{r}(x)\}\geq k\}. (2)

As the corresponding open ball around xx contains at most k1k-1 elements of the sample, the ties may only occur on the sphere.

We adopt the combinatorial notation [n]={1,2,n}[n]=\{1,2,\ldots n\}. If σΩn\sigma\in\Omega^{n} and σΩk\sigma^{\prime}\in\Omega^{k}, knk\leq n, the symbol

σσ\sigma^{\prime}\sqsubset\sigma

means that there is an injection f:[k][n]f\colon[k]\to[n] such that

i=1,2,,k,σi=σf(i).\forall i=1,2,\ldots,k,~\sigma^{\prime}_{i}=\sigma_{f(i)}.

A kk nearest neighbour map is a function

k-NNσ:Ωn×ΩΩkk\mbox{-NN}^{\sigma}\colon\Omega^{n}\times\Omega\to\Omega^{k}

with the properties

  1. (1)

    k-NNσ(x)σk\mbox{-NN}^{\sigma}(x)\sqsubset\sigma, and

  2. (2)

    all points xix_{i} in σ\sigma that are at a distance strictly less than rk-NNςn(x)r^{\varsigma_{n}}_{k\mbox{\tiny-NN}}(x) to xx are in k-NNσ(x)k\mbox{-NN}^{\sigma}(x).

The mapping k-NNσk\mbox{-NN}^{\sigma} can be deterministic or stochastic, in which case it will depend on an additional random variable, independent of the sample.

An example of the former kind is based on the natural order on the sample, x1<x2<<xnx_{1}<x_{2}<\ldots<x_{n}. In this case, from among the points belonging to the sphere of radius rk-NNσ(x)r_{k\mbox{\tiny-NN}^{\sigma}}(x) around xx we choose points with the smaller index: k-NNσ(x)k\mbox{-NN}^{\sigma}(x) contains all the points of σ\sigma in the open ball, Brk-NNσ(x)(x)B_{r_{k\mbox{\tiny-NN}^{\sigma}}(x)}(x), plus a necessary number (at least one) of points of σSrk-NNσ(x)(x)\sigma\cap S_{r_{k\mbox{\tiny-NN}^{\sigma}}(x)}(x) having smallest indices.

An example of the second kind is to use a similar procedure, after applying a random permutation of the indices first. A random learning input will consist of a pair (Wn,Pn)(W_{n},P_{n}), where WnW_{n} is a random nn-sample and PnP_{n} is a random element of the group of permutations of rank nn. An equivalent (and more common) way would be to use a sequence of i.i.d. random elements ZnZ_{n} of the unit interval or the real line, distributed according to the uniform (resp. gaussian) law, and in case of a tie, give a preference to a realization xix_{i} over xjx_{j} provided the value ziz_{i} is smaller than zjz_{j}.

Now, a formal definition of the kk-NN learning rule can be given as follows:

nk-NN(σ,ϵ)(x)\displaystyle{\mathcal{L}}^{k\mbox{\tiny-NN}}_{n}(\sigma,{\epsilon})(x) =\displaystyle= χ[0,)[1kxik-NNσ(x)ϵi12]\displaystyle\chi_{[0,\infty)}\left[\frac{1}{k}\sum_{x_{i}\in k\mbox{\tiny-NN}^{\sigma}(x)}{\epsilon}_{i}-\frac{1}{2}\right]
=\displaystyle= χ[0,)[𝔼μk-NNσ(x)ϵ12].\displaystyle\chi_{[0,\infty)}\left[{\mathbb{E}}_{\mu_{k\mbox{\tiny-NN}^{\sigma}(x)}}{\epsilon}-\frac{1}{2}\right].

Here, χ[0,)\chi_{[0,\infty)} is the Heaviside function, the sign of the argument:

χ[0,)(t)={1, if t0,0, if t<0.\chi_{[0,\infty)}(t)=\begin{cases}1,&\mbox{ if }t\geq 0,\\ 0,&\mbox{ if }t<0.\end{cases}

The empirical measure μk-NNσ(x)\mu_{k\mbox{\tiny-NN}^{\sigma}(x)} is a uniform measure supported on the set of kk nearest neighbours of xx within the sample σ\sigma, and the label ϵ{\epsilon} is seen as a function ϵ:{x1,x2,,xn}{0,1}{\epsilon}\colon\{x_{1},x_{2},\ldots,x_{n}\}\to\{0,1\}.

The expression appearing under the argument,

ηn,k=1kxik-NNσ(x)ϵi,\eta_{n,k}=\frac{1}{k}\sum_{x_{i}\in k\mbox{\tiny-NN}^{\sigma}(x)}{\epsilon}_{i}, (3)

is the empirical regression function. In the presence of a law of labelled points, it is a random variable, and so we have the following immediate, yet important, observation.

{prpstn}

Let (μ,η)(\mu,\eta) be a learning problem in a separable metric space (Ω,d)(\Omega,d). If the values of the empirical regression function, ηn,k\eta_{n,k}, converge to η\eta in probability (resp. almost surely) in the region

Ωη={xΩ:η(x)12},\Omega_{\eta}=\left\{x\in\Omega\colon\eta(x)\neq\frac{1}{2}\right\},

then the kk-NN classifier is consistent (resp. strongly consistent) under (μ,η)(\mu,\eta).

We conclude this section by recalling an important technical tool.

{thrm}

[Cover-Hart lemma [3]] Let Ω\Omega be a separable metric space, and let μ\mu be a Borel probability measure on Ω\Omega. Almost surely, the function rk-NNςnr^{\varsigma_{n}}_{k\mbox{\tiny-NN}} (Eq. (2)) converges to zero uniformly over any precompact subset KsuppμK\subseteq\mbox{supp}\,\mu.

Proof.

Let AA be a countable dense subset of suppμ\mbox{supp}\,\mu. A standard argument shows that, almost surely, for all aAa\in A and each rational ϵ>0{\epsilon}>0, the open ball Bϵ(a)B_{{\epsilon}}(a) contains an infinite number of elements of a sample path. Consequently, the functions rk-NNςn:Ωr^{\varsigma_{n}}_{k\mbox{\tiny-NN}}\colon\Omega\to{\mathbb{R}} almost surely converge to zero pointwise on AA as nn\to\infty. Since these functions are easily seen to be 11-Lipschitz and in particular form a uniformly equicontinuous family, we conclude. ∎

2. Example of Preiss

Here we will discuss a 1979 example of Preiss [17]. Preiss’s aim was to prove that the Lebesgue–Besicovitch differentiation property (Eq. (1)) fails in the infinite-dimensional Hilbert space 2\ell^{2}. However, as already suggested in [2], his example can be easily adapted to prove that the kk-NN learning rule is not universally consistent in the infinite-dimensional separable Hilbert space 2\ell^{2} either.

Recall the notation [n]={1,2,n}[n]=\{1,2,\ldots n\}. Let (Nk)(N_{k}) be a sequence of positive natural numbers 2\geq 2, to be selected later. Denote by

Q=k=1[Nk]Q=\prod_{k=1}^{\infty}[N_{k}]

the Cartesian product of finite discrete spaces equipped with the product topology. It is a Cantor space (the unique, up to a homeomorphism, totally disconnected compact metrizable space without isolated points).

Let πk\pi_{k} denote the canonical cordinate projections of QQ on the kk-dimensional cubes Qk=i=1k[Ni]Q_{k}=\prod_{i=1}^{k}[N_{i}]. Denote Q=k=1QkQ^{\ast}=\cup_{k=1}^{\infty}Q_{k} a disjoint union of the cubes QkQ_{k}, and let =2(Q){\mathcal{H}}=\ell^{2}(Q^{\ast}) be a Hilbert space spanned by an orthonormal basis (en¯)(e_{\bar{n}}) indexed by elements n¯\bar{n} of this union.

For every n¯=(n1,,ni,)Q\bar{n}=(n_{1},\ldots,n_{i},\ldots)\in Q define

f(n¯)=i=12ie(n1,,ni).f(\bar{n})=\sum_{i=1}^{\infty}2^{-i}e_{(n_{1},\ldots,n_{i})}\in{\mathcal{H}}.

The map ff is continuous and injective, thus a homeomorphism onto its image. Denote ν\nu the Haar measure on QQ (the product of the uniform measures on all [Nk][N_{k}]). Let μ1=f(ν)\mu_{1}=f_{\ast}(\nu) be the direct image of ν\nu, a compactly-supported Borel probability measure on \mathcal{H}. If r>0r>0 satisfies 2kr2<2k+12^{-k}\leq r^{2}<2^{-k+1}, then for each n¯=(n1,n2,)Q\bar{n}=(n_{1},n_{2},\ldots)\in Q,

μ1(Br(f(n¯)))\displaystyle\mu_{1}(B_{r}(f(\bar{n}))) =ν(πk+11(n¯))\displaystyle=\nu(\pi_{k+1}^{-1}(\bar{n}))
=(N1N2Nk+1)1.\displaystyle=(N_{1}N_{2}\ldots N_{k+1})^{-1}.

Now, for every kk and each n¯=(n1,,nk)QkQ\bar{n}=(n_{1},\ldots,n_{k})\in Q_{k}\subseteq Q^{\ast} define in a similar way

f(n¯)=i=1k2ie(n1,,ni).f(\bar{n})=\sum_{i=1}^{k}2^{-i}e_{(n_{1},\ldots,n_{i})}\in{\mathcal{H}}.

Note that the closure of f(Q)f(Q^{\ast}) contains f(Q)f(Q) (as a proper subset). Now define a purely atomic measure μ0\mu_{0} supported on the image of QQ^{\ast} under ff, having the following special form:

μ0=k=1n¯Qkakδn¯.\mu_{0}=\sum_{k=1}^{\infty}\sum_{\bar{n}\in Q_{k}}a_{k}\delta_{\bar{n}}.

The weights ak>0a_{k}>0 are chosen so that the measure is finite:

k=1ak<.\sum_{k=1}^{\infty}a_{k}<\infty. (4)

Since for rr satisfying 2kr2<2k+12^{-k}\leq r^{2}<2^{-k+1} and n¯Q\bar{n}\in Q the ball Br(f(n¯))B_{r}(f(\bar{n})) contains in particular f(n1n2,,nk)f(n_{1}n_{2},\ldots,n_{k}), we have

μ0(Br(f(n¯)))ak.\mu_{0}(B_{r}(f(\bar{n})))\geq a_{k}.

Assuming in addition that

akN1N2NkNk+1 as k,a_{k}N_{1}N_{2}\ldots N_{k}N_{k+1}\to\infty\mbox{ as }k\to\infty, (5)

we conclude:

μ1(Br(f(n¯)))μ0(Br(f(n¯)))(N1N2Nk+1)1ak0 when r0.\frac{\mu_{1}(B_{r}(f(\bar{n})))}{\mu_{0}(B_{r}(f(\bar{n})))}\leq\frac{(N_{1}N_{2}\ldots N_{k+1})^{-1}}{a_{k}}\to 0\mbox{ when }r\downarrow 0.

Clearly, the conditions (4) and (5) can be simultaneously satisfied by a recursive choice of (Nk)(N_{k}) and (ak)(a_{k}).

Now renormalize the measures μ0\mu_{0} and μ1\mu_{1} so that μ=μ0+μ1\mu=\mu_{0}+\mu_{1} is a probability measure, and interpret μi\mu_{i} as the distribution of points labelled i=0,1i=0,1. Thus, the regression function is deterministic, η=χC\eta=\chi_{C}, where we are learning the concept C=f(Q)=suppμ1C=f(Q)={\mathrm{supp}}\,\mu_{1}, μ1(C)>0\mu_{1}(C)>0.

For a random element XX\in{\mathcal{H}}, XμX\sim\mu, the distance rk(X)r_{k}(X) to the kk-th nearest neighbour within an i.i.d. nn-sample goes to zero almost surely when k/n0k/n\to 0, according to a lemma of Cover and Hart, and the convergence is uniform on the precompact support of μ\mu. It follows that the probability of one of the kk nearest neighbours to a random point XX\in{\mathcal{H}} to be labelled one, conditionally on rk-NNςn=rr^{\varsigma_{n}}_{k\mbox{\tiny-NN}}=r, converges to zero, uniformly in rr. The kk-NN learning rule will almost surely predict a sequence of classifiers converging to the identically zero classifier, and so is not consistent.

3. Classical theorem of Charles J. Stone

3.1. The case of continuous regression function

Proposition 1 and the Cover–Hart lemma 1 together imply that the kk-NN classifier is universally consistent in a separable metric space whenever the regression function η\eta is continuous. In view of Proposition 1, it is enough to make the following observation.

{lmm}

Let (Ω,μ)(\Omega,\mu) be a separable metric space equipped with a Borel probability measure, and let η\eta be a continuous regression function. Then

ηn,kη\eta_{n,k}\to\eta

in probability, when n,kn,k\to\infty, k/n0k/n\to 0.

Proof.

It follows from the Cover–Hart lemma that the set k-NNςn(x)k\mbox{-NN}^{\varsigma_{n}}(x) of kk nearest neighbours of xx almost surely converges to xx, for almost all xsuppμx\in\mbox{supp}\,\mu, and since η\eta is continuous, the set of values η(k-NN(x))\eta(k\mbox{-NN}(x)) almost surely coverges to η(x)\eta(x) in an obvious sense: for every ε>0{\varepsilon}>0, there exists NN such that

nN,η(k-NN(x))(η(x)ε,η(x)+ε),\forall n\geq N,~~\eta(k\mbox{-NN}(x))\subseteq(\eta(x)-{\varepsilon},\eta(x)+{\varepsilon}), (6)

where kk depends on nn. Let ε>0{\varepsilon}>0 and NN be fixed, and denote Pε,NP_{{\varepsilon},N} the set of pairs (ς,x)(\varsigma,x) consisting of a sample path ς\varsigma and a point xΩx\in\Omega satisfying Eq. (6). Select NN with the property μ(Pε,N)>1ε\mu(P_{{\varepsilon},N})>1-{\varepsilon}. Let ϵ=(ϵi)i=1{\epsilon}=({\epsilon}_{i})_{i=1}^{\infty} denote the sequence of labels for ς\varsigma, which is a random variable with the joint law n=1{η(xi),1η(xi)}\otimes_{n=1}^{\infty}\{\eta(x_{i}),1-\eta(x_{i})\}. By the above, whenever (ς,x)Pε,N(\varsigma,x)\in P_{{\varepsilon},N} and nNn\geq N, if xix_{i} is one of the kk nearest neighbours of xx in ςn\varsigma_{n}, we have 𝔼ϵi=η(xi)(η(x)ε,η(x)+ε){\mathbb{E}}{\epsilon}_{i}=\eta(x_{i})\in(\eta(x)-{\varepsilon},\eta(x)+{\varepsilon}). According to a version of the Law of Large Numbers with Chernoff’s bounds, the probability of the event

xik-NN(x)ϵik(η(x)2ε,η(x)+2ε)\frac{\sum_{x_{i}\in k\mbox{\tiny-NN}(x)}{\epsilon}_{i}}{k}\notin(\eta(x)-2{\varepsilon},\eta(x)+2{\varepsilon}) (7)

is exponentially small, bounded above by 2exp(2ε2k)2\exp(-2{\varepsilon}^{2}k). Thus, when nNn\geq N, P[|ηn,kη|ε]<ε+2exp(2ε2k)P[\left|\eta_{n,k}-\eta\right|\geq{\varepsilon}]<{\varepsilon}+2\exp(-2{\varepsilon}^{2}k), and we conclude. ∎

{rmrk}

In the most general case (with the uniform tie-breaking) we can only infer the almost sure convergence if k=knk=k_{n} grows fast enough as a function in nn, for otherwise the series n=12exp(2ε2kn)\sum_{n=1}^{\infty}2\exp(-2{\varepsilon}^{2}k_{n}) may be divergent.

3.2. Stone’s geometric lemma for d{\mathbb{R}}^{d}

In the case of a general Borel regression function η\eta, which can be discontinuous μ\mu-almost everywhere, where μ\mu, as before, is the sample distribution on a separable metric space, a version of the classical Luzin theorem of real analysis says that for any ε>0{\varepsilon}>0 there is a closed precompact set KK of measure μ(K)>1ε\mu(K)>1-{\varepsilon} upon which η\eta is continuous. (See Appendix.) Now we have control over the behaviour of those kk-nearest neighbours of a point xx that belong to KK: the mean value of the regression function η\eta taken at those kk-nearest neighbours will converge to η(x)\eta(x). However, we have no control over the behaviour of the values of η\eta at the kk-nearest neighbours of xx that belong to the open set U=ΩKU=\Omega\setminus K. The problem is therefore to limit the influence of the remaining εn\approx{\varepsilon}n sample points belonging to UU. Intuitively, as the example of Preiss shows, in infinite dimensions the influence of the few points outside of KK can become overwhelming, no matter how close the measure of KK is to one.

In the Euclidean case, this goal is achieved with the help of Stone’s geometric lemma, which uses the finite-dimensional Euclidean structure of the space in a beautiful way.

{lmm}

[Stone’s geometric lemma for d{\mathbb{R}}^{d}] For every natural dd, there is an absolute constant C=C(d)C=C(d) with the following property. Let

σ=(x1,x2,,xn),xid,i=1,2,,n,\sigma=(x_{1},x_{2},\ldots,x_{n}),~x_{i}\in{\mathbb{R}}^{d},~i=1,2,\ldots,n,

be a finite sample in 2(d)\ell^{2}(d) (possibly with repetitions), and let x2(d)x\in\ell^{2}(d) be any. Given k+k\in{\mathbb{N}}_{+}, the number of ii such that xxix\neq x_{i} and xx is among the kk nearest neighbours of xix_{i} inside the sample

x1,x2,,xi1,x,xi+1,,xnx_{1},x_{2},\ldots,x_{i-1},x,x_{i+1},\dots,x_{n} (8)

is limited by CkCk.

Refer to caption
Figure 2. To the proof of Stone’s geometric lemma (case k=2k=2).
Proof.

Cover d{\mathbb{R}}^{d} with C=C(d)C=C(d) cones of central angle <π/3<\pi/3 with vertices at xx. Inside each cone mark the maximal possible number k\leq k of the nearest neighbours of xx, that are set-theoretically different from xx. (The strategy for possible distance tie-breaking is unimpotant.) In this way, up to CkCk points are marked. Let now ii be any, such that xixx_{i}\neq x as a point. If xix_{i} has not been marked, this means the cone containing xix_{i} has kk points, different from xx, that have been marked. Consider any of the marked points inside the same cone, say yy. A simple argument of planimetry, inside an affine plane passing through xx, xix_{i}, and yy, shows that

xix>xiy,{\left\|\,x_{i}-x\,\right\|}>{\left\|\,x_{i}-y\,\right\|}, (9)

and so the kk nearest neighbours of xix_{i} inside the sample in Eq. (8) will all be among the marked points, excluding xx. ∎

{rmrk}

Note that in the statement of Stone’s geometric lemma neither the order of the sample x1,x2,,xi1,x,xi+1,,xnx_{1},x_{2},\ldots,\allowbreak x_{i-1},x,x_{i+1},\dots,x_{n}, nor the tie-breaking strategy are of any importance.

{rmrk}

If the cones have central angle π/3=60\pi/3=60^{\circ}, then the displayed inequality in the proof of the lemma (Eq. (9)) is no longer strict. This is less convenient in case of distance ties.

3.3. Proof of Stone’s theorem

{thrm}

[Charles J. Stone, 1977] Let k,nk,n\to\infty, k/n0k/n\to 0. Then the kk-NN classification rule in the finite-dimensional Euclidean space d{\mathbb{R}}^{d} is universally consistent.

Let us begin with the case where the domain Ω\Omega is an arbitrary separable metric space, μ\mu is a probability measure on Ω\Omega, and η\eta is a regression function. Let ε>0{\varepsilon}>0 be any. Using a variation of Luzin’s theorem (Theorem Appendix: Luzin’s theorem in Appendix), choose a closed precompact set KΩK\subseteq\Omega with μ(K)>1ε\mu(K)>1-{\varepsilon} and η|K\eta|_{K} continuous. Denote U=ΩKU=\Omega\setminus K. By the Tietze–Urysohn extension theorem (see e.g., [7], 2.1.8), the function η|K\eta|_{K} extends to a continuous [0,1][0,1]-valued function ψ\psi on all of Ω\Omega.

Denote ψn,k\psi_{n,k} the empirical regression function (Eq. (3)) corresponding to ψ\psi viewed as a regression function on its own right. We have:

𝔼|ηηn,k|\displaystyle{\mathbb{E}}\left|\eta-\eta_{n,k}\right| 𝔼|ηψ|(I)+𝔼|ψψn,k|(II)+𝔼|ψn,kηn,k|(III),\displaystyle\leq\underbrace{{\mathbb{E}}\left|\eta-\psi\right|}_{(I)}+\underbrace{{\mathbb{E}}\left|\psi-\psi_{n,k}\right|}_{(II)}+\underbrace{{\mathbb{E}}\left|\psi_{n,k}-\eta_{n,k}\right|}_{(III)},

where (I)μ(U)<ε(I)\leq\mu(U)<{\varepsilon} and (II)0(II)\to 0 in probability by virtue of Lemma 3.1. It only remains to estimate the term (III)(III).

With this purpose, let now Ω=2(d)\Omega=\ell^{2}(d), let KK be a compact subset of d{\mathbb{R}}^{d}, and let U=ΩKU=\Omega\setminus K. Let X,X1,X2,,XnX,X_{1},X_{2},\ldots,X_{n} be i.i.d. random elements of d{\mathbb{R}}^{d} following the law μ\mu. We will estimate the expected number of the random elements XiX_{i} of an nn-sample that (1) belong to UU, and (2) are among the kk nearest neighbours of a random element XX belonging to KK. Applying the symmetrization with a transposition of the coordinates τi:XXi\tau_{i}\colon X\leftrightarrow X_{i}, as well as Stone’s geometric lemma 8, we obtain:

𝔼1k{i=1,2,,n}𝟙Xik-NN(X)𝟙XK𝟙XiK\displaystyle{\mathbb{E}}\frac{1}{k}\sharp\{i=1,2,\ldots,n\}{\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X)}{\mathbbm{1}}_{X\in K}{\mathbbm{1}}_{X_{i}\notin K} =𝔼1kχU(Xi)𝟙Xik-NN(X)𝟙XK𝟙XiK\displaystyle={\mathbb{E}}\frac{1}{k}\sum\chi_{U}(X_{i}){\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X)}{\mathbbm{1}}_{X\in K}{\mathbbm{1}}_{X_{i}\notin K}
𝔼1kχU(Xi)𝟙Xik-NN(X)𝟙XiX\displaystyle\leq{\mathbb{E}}\frac{1}{k}\sum\chi_{U}(X_{i}){\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X)}{\mathbbm{1}}_{X_{i}\neq X}
𝔼1kχU(X)𝟙Xk-NN(X1,X2,,Xi1,X,Xi+1,,Xn)(Xi)𝟙XiX\displaystyle\leq{\mathbb{E}}\frac{1}{k}\sum\chi_{U}(X){\mathbbm{1}}_{X\in k\mbox{\tiny-NN}^{(X_{1},X_{2},\ldots,X_{i-1},X,X_{i+1},\ldots,X_{n})}(X_{i})}{\mathbbm{1}}_{X_{i}\neq X}
𝔼χU(X)1k(kC(d))\displaystyle\leq{\mathbb{E}}\chi_{U}(X)\frac{1}{k}(kC(d))
=C(d)μ(U).\displaystyle=C(d)\mu(U).

Getting back to the term (III)(III), in the case Ω=d\Omega={\mathbb{R}}^{d} we obtain:

(III)\displaystyle(III) 𝔼|ψn,k(X)ηn,k(X)|𝟙XK+𝔼|ψn,k(X)ηn,k(X)|𝟙XU\displaystyle\leq{\mathbb{E}}\left|\psi_{n,k}(X)-\eta_{n,k}(X)\right|\mathbbm{1}_{X\in K}+{\mathbb{E}}\left|\psi_{n,k}(X)-\eta_{n,k}(X)\right|\mathbbm{1}_{X\in U}
C(d)μ(U)+μ(U)\displaystyle\leq C(d)\mu(U)+\mu(U)
<(C(d)+1)ε.\displaystyle<(C(d)+1){\varepsilon}.

Since ε>0{\varepsilon}>0 is as small as desired, we conclude that ηn,k(X)η(X)\eta_{n,k}(X)\to\eta(X) in probability, and so the kk-NN classifier in 2(d)\ell^{2}(d) is universally (weakly) consistent in 2(d)\ell^{2}(d).

4. Nagata dimension of a metric space

Recall that a family γ\gamma of subsets of a set Ω\Omega has multiplicity δ\leq\delta if the intersection of more than δ\delta different elements of γ\gamma is always empty. In other words,

xΩ,VγχV(x)δ.\forall x\in\Omega,~~\sum_{V\in\gamma}\chi_{V}(x)\leq\delta.
{dfntn}

Let δ\delta\in{\mathbb{N}}, s(0,+]s\in(0,+\infty]. We say that a metric space (Ω,d)(\Omega,d) has Nagata dimension δ\leq\delta on the scale s>0s>0, if every finite family γ\gamma of closed balls of radii <s<s admits a subfamily γ\gamma^{\prime} of multiplicity δ+1\leq\delta+1 which covers the centres of all the balls in γ\gamma. The smallest δ\delta with this property, if it exists, and ++\infty otherwise, is called the Nagata dimension of Ω\Omega on the scale s>0s>0. A space Ω\Omega has Nagata dimension δ\delta if it has Nagata dimension δ\delta on a suitably small scale s(0,+]s\in(0,+\infty]. Notation: dimNags(Ω)=δ\dim^{s}_{Nag}(\Omega)=\delta, or simply dimNag(Ω)=δ\dim_{Nag}(\Omega)=\delta.

Sometimes the following reformulation is more convenient.

{prpstn}

A metric space (Ω,d)(\Omega,d) has Nagata dimension δ\leq\delta on the scale s(0,+]s\in(0,+\infty] if and only if it satisfies the following property. For every xΩx\in\Omega, r<sr<s, and a sequence x1,,xδ+2B¯r(x)x_{1},\ldots,x_{\delta+2}\in\bar{B}_{r}(x), there are i,ji,j, iji\neq j, such that d(xi,xj)max{d(x,xi),d(x,xj)}d(x_{i},x_{j})\leq\max\{d(x,x_{i}),d(x,x_{j})\}.

Proof.

Necessity (\Rightarrow): from the family of closed balls Bd(xi,x)(xi)B_{d(x_{i},x)}(x_{i}), i=1,2,,δ+2i=1,2,\ldots,\delta+2, all having radii <s<s, extract a family of δ+1\delta+1 balls covering the centres. One of those balls, say with centre at xix_{i}, must contain some xjx_{j} with iji\neq j, which means d(xi,xj)d(xi,x)max{d(x,xi),d(x,xj)}d(x_{i},x_{j})\leq d(x_{i},x)\leq\max\{d(x,x_{i}),d(x,x_{j})\}.

Sufficiency (\Leftarrow): let γ\gamma be a finite family of closed balls of radii <s<s. Suppose it has multiplicity >δ+1>\delta+1. Then there exist a point xΩx\in\Omega and δ+2\delta+2 balls in γ\gamma with centres that we denote x1,,xδ+2x_{1},\ldots,x_{\delta+2}, all containing xx. Denote r=maxi{d(x,xi)}r=\max_{i}\{d(x,x_{i})\}. Then r<sr<s, and by the hypothesis, there are i,ji,j, iji\neq j, with d(xi,xj)max{d(x,xi),d(x,xj)}d(x_{i},x_{j})\leq\max\{d(x,x_{i}),d(x,x_{j})\}. Without loss in generality, assume d(xi,xj)d(xi,x)d(x_{i},x_{j})\leq d(x_{i},x), that is, xjx_{j} belongs to the ball with centre in xix_{i}. Now the ball centred at xjx_{j} can be removed from the family γ\gamma, with the remaining family still covering all the centres and having the cardinality |γ|1\lvert\gamma\rvert-1. After finitely many steps, we arrive at a subfamily of multiplicity δ+1\leq\delta+1 covering all the centres. ∎

{xmpl}

The property of a metric space Ω\Omega having Nagata dimension zero on the scale ++\infty is equivalent to Ω\Omega being a non-archimedian metric space, that is, a metric space satisfying the strong triangle inequality, d(x,z)max{d(x,y),d(y,z)}d(x,z)\leq\max\{d(x,y),d(y,z)\}.

Indeed, dimNag+(Ω)=0\dim_{Nag}^{+\infty}(\Omega)=0 means exactly that for any sequence of δ+2=2\delta+2=2 points, x1,x2x_{1},x_{2}, contained in a closed ball Br(x)B_{r}(x), we have d(x1,x2)max{d(x,x1),d(x,x2)}d(x_{1},x_{2})\leq\max\{d(x,x_{1}),d(x,x_{2})\}.

{xmpl}

It follows from Proposition 4 that dimNag()=1\dim_{Nag}({\mathbb{R}})=1. Let x1,x2,x3x_{1},x_{2},x_{3} be three points contained in a closed ball, that is, an interval [xr,x+r][x-r,x+r]. Without loss in generality, assume x1<x2<x3x_{1}<x_{2}<x_{3}. If x2xx_{2}\leq x, then |x1x2||x1x|\lvert x_{1}-x_{2}\rvert\leq\lvert x_{1}-x\rvert, and if x2xx_{2}\geq x, then |x3x2||x3x|\lvert x_{3}-x_{2}\rvert\leq\lvert x_{3}-x\rvert.

The following example suggests that the Nagata dimension is relevant for the study of the kk-NN classifier, as it captures in an abstract context the geometry behind Stone’s lemma.

{xmpl}

The Nagata dimension of the Euclidean space 2(d)\ell^{2}(d) is finite, and it is bounded by C(d)1C(d)-1, where C(d)C(d) is the value of the constant in Stone’s geometric lemma (Lemma 8).

Indeed, let x1,,xC(d)+1x_{1},\ldots,x_{C(d)+1} be points belonging to a ball with centre xx. Using the argument in the proof of Stone’s geometric lemma with k=1k=1, mark C(d)\leq C(d) points xix_{i} belonging to the C(d)\leq C(d) cones with apex at xx. At least one point, say xjx_{j}, has not been marked; it belongs to some cone, which therefore already contains a marked point, say xix_{i}, different from xjx_{j}, and xixjxjx{\left\|\,x_{i}-x_{j}\,\right\|}\leq{\left\|\,x_{j}-x\,\right\|}.

{xmpl}

A similar argument shows that every finite-dimensional normed space has finite Nagata dimension.

{rmrk}

In 2={\mathbb{R}}^{2}={\mathbb{C}} the family of closed balls of radius one centred at the vectors exp(2πki/5)\exp(2\pi ki/5), k=1,2,,5k=1,2,\ldots,5, has multiplicity 55 and admits no proper subfamily containing all the centres. Therefore, the Nagata dimension of 2(2)\ell^{2}(2) is at least 55. Since the plane can be covered with 66 cones having the central angle π/3\pi/3, Example 4 implies that dimNag(2(2))=5\dim_{Nag}(\ell^{2}(2))=5.

{rmrk}

The problem of calculating the Nagata dimension of the Euclidean space 2(d)\ell^{2}(d) is mentioned as “possibly open” by Nagata [15], p. 9 (where the value dimNag+2\dim_{Nag}+2 is called the “crowding number”). Nagata also remarks that dimNag1=1\dim_{Nag}{\mathbb{R}}^{1}=1 and dimNag2(3)=5\dim_{Nag}\ell^{2}(3)=5 (without a proof).

{rmrk}

Notice that the property of the Euclidean space established in the proof of Stone’s geometric lemma is strictly stronger than the finiteness of the Nagata dimension. There exists a finite δ\delta (in general, higher than the Nagata dimension) such that, given a sequence x1,,xδ+2B¯r(x)x_{1},\ldots,x_{\delta+2}\in\bar{B}_{r}(x), r<sr<s, there are i,ji,j, iji\neq j, such that d(xi,xj)<max{d(x,xi),d(x,xj)}d(x_{i},x_{j})<\max\{d(x,x_{i}),d(x,x_{j})\}. The inequality here is strict, cf. Remark 2. This is exactly the property that removes the problem of distance ties in the Euclidean space. However, adopting this as a definition in the general case would be too restrictive, removing from consideration a large class of metric spaces in which the kk-NN classifier is still universally consistent, such as all non-archimedean metric spaces.

{xmpl}

Let ene_{n} denote the nn-th standard basic vector in the separable Hilbert space 2\ell^{2}, that is, a sequence whose nn-th coordinate is 11 and the rest are zeros. The convergent sequence (1/n)en(1/n)e_{n}, n0n\geq 0, together with the limit 0, viewed as a metric subspace of 2\ell^{2}, has infinite Nagata dimension on every scale s>0s>0. This is witnessed by the family of closed balls B1/n((a/n)en)B_{1/n}((a/n)e_{n}), having zero as the common point, and having the property that every centre belongs to exactly one ball of the family. Realizing {\mathbb{R}} as a continuous curve in 2\ell^{2} without self-intersections passing through all elements of the sequence as well as the limit leads to an equivalent metric on {\mathbb{R}} having infinite Nagata dimension on each scale.

{rmrk}

The Nagata–Ostrand theorem [13, 16] states that the Lebesgue covering dimension of a metrizable topological space is the smallest Nagata dimension of a compatible metric on the space (and in fact this is true on every scale s>0s>0, [12]). This is the historical origin of the concept of the metric dimension.

There appears to be no single comprehensive reference to the concept of Nagata dimension. Various results are scattered in the journal papers [1, 12, 13, 15, 16, 18], see also the book [14], pages 151–154.

Metric spaces of finite Nagata dimension admit an almost literal version of Stone’s geometric lemma in case where the sample has no distance ties, that is, the values of the distances d(xi,xj)d(x_{i},x_{j}), iji\neq j, are all pairwise distinct.

{lmm}

[Stone’s geometric lemma, finite Nagata dimension, no ties] Let Ω\Omega a metric space of Nagata dimension δ<\delta<\infty on a scale s>0s>0. Let

σ=(x1,x2,,xn),i=1,2,,n,\sigma=(x_{1},x_{2},\ldots,x_{n}),~i=1,2,\ldots,n,

be a finite sample in Ω\Omega, and let xΩx\in\Omega be any. Suppose there are no distance ties inside the sample

x,x1,x2,,xi1,xi,xi+1,,xn,x,x_{1},x_{2},\ldots,x_{i-1},x_{i},x_{i+1},\dots,x_{n}, (10)

and kk is such that, inside the above sample, rk-NN(xi)<sr_{k\mbox{\tiny-NN}}(x_{i})<s for all ii. The number of ii having the property that xxix\neq x_{i} and xx is among the kk nearest neighbours of xix_{i} inside the sample above is limited by (k+1)(δ+1)(k+1)(\delta+1).

Proof.

Suppose that i=1,2,,mi=1,2,\ldots,m is such that xixx_{i}\neq x and xix_{i} has xx among the kk nearest neighbours inside the sample as in Eq. (10). The family γ\gamma of closed balls Brk-NN(xi)(xi)B_{r_{k\mbox{\tiny-NN}}(x_{i})}(x_{i}), imi\leq m, admits a subfamily γ\gamma^{\prime} of multiplicity δ+1\leq\delta+1 covering all the points xix_{i}, imi\leq m. Since there are no distance ties, every ball belonging to γ\gamma contains k+1\leq k+1 points. It follows that γm/(k+1)\sharp\gamma^{\prime}\geq m/(k+1). All the balls in γ\gamma^{\prime} contain xx, and we conclude: γδ+1\sharp\gamma^{\prime}\leq\delta+1. The result follows. ∎

Now the same argument as in the original proof of Stone (Subs. 3.3) shows that the kk-NN classifier is consistent under each distritution μ\mu on Ω×{0,1}\Omega\times\{0,1\} with the property that the distance ties occur with zero probability. Since we are going to give a proof of a more general result, we will not repeat the argument here, only mention that due to the Cover–Hart lemma, if nn is sufficiently large, then with arbitrarily high probability, the kk nearest neighbours of a random point inside a random sample will all lie at a distance <s<s.

5. Distance ties

In this section we will construct a series of examples to illustrate the difficulties arising in the presence of distance ties in general metric spaces that are absent in the Euclidean case. The fundamental difference between the two situations is the inequality in the equivalent definition of the Nagata dimension (proposition 4) that is, unlike in the Euclidean space, no longer strict.

As we have already noted (remark 2), the conclusion of Stone’s geometric lemma 8 remains valid even if we allow the adversary to break the distance ties and pick the kk nearest neighbours. Our first example shows that it is no longer the case in a metric space of finite Nagata dimension.

{xmpl}

Consider a finite set σ={x1,x2,,xn}\sigma=\{x_{1},x_{2},\ldots,x_{n}\} with nkn\geq k points, and assume that in the metric space σ{x}\sigma\cup\{x\} all n+1n+1 points are pairwise at a distance one from each other. The Nagata dimension of the metric space σ{x}\sigma\cup\{x\} is equal to δ=0\delta=0. Indeed, if a family γ\gamma of closed balls contains any ball of radius 1\geq 1, it already covers σ\sigma on its own. Otherwise, we choose one ball of radius <1<1 (that is, a singleton) for each centre. The multiplicity of the selected subfamily is 11 in each case.

Now let us discuss the distance ties. For any element xix_{i} of σ\sigma, the remaining nn points of σ{x}\sigma\cup\{x\} are tied between themselves as the possible kk nearest neighbours. The adversary may decide to always select xx among them, thus invalidating the conclusion of Stone’s geometric lemma.

However, the problem is easily resolved if we break distance ties using a uniform distribution on the nearest neighbour candidates. In this case, the expected number of indices ii such that xx is chosen as one of the kk nearest neighbours of xix_{i} within the sample {x1,x2,,x,,xk}\{x_{1},x_{2},\ldots,x,\ldots,x_{k}\} is obviously kk.

{rmrk}

It is worth observing that in the Euclidean case Ω=d\Omega={\mathbb{R}}^{d} the size of a sample inheriting a 0-11 distance will be limited from above by the dimension, dd.

The next example shows that Stone’s geometric lemma in finite dimensional metric spaces cannot be saved even with the uniform tie-breaking.

{xmpl}

There exists a countable metric space σ={x1,x2,}\sigma=\{x_{1},x_{2},\ldots\} of Nagata dimension 0, having the following property. Given NN\in{\mathbb{N}}, for a sufficiently large nn the expected number of points xix1x_{i}\neq x_{1} within the sample σn={x1,,xn}\sigma_{n}=\{x_{1},\ldots,x_{n}\} having x1x_{1} as the nearest neighbour under the uniform tie-breaking is N\geq N.

We will construct σ\sigma recurrently. Let σ1={x1}\sigma_{1}=\{x_{1}\}. Add x2x_{2} at a distance 11 from x1x_{1}, and set σ2={x1,x2}\sigma_{2}=\{x_{1},x_{2}\}. If σn\sigma_{n} has been already defined, add xn+1x_{n+1} at a distance 2n2^{n} from all the existing points xix_{i}, ini\leq n, and set σn+1=σn{xn+1}\sigma_{n+1}=\sigma_{n}\cup\{x_{n+1}\}. It is clear that the distance so defined is a metric.

We will verify by induction in nn that dimNag(σn)=0\dim_{Nag}(\sigma_{n})=0, on the scale s=+s=+\infty. For n=1n=1 this is trivially true. Assume the statement holds for σn\sigma_{n}, and let γ\gamma be a family of closed balls in σn+1\sigma_{n+1}. If one of those balls contains all the points, there is nothing to prove. Assume not, that is, all the balls elements of γ\gamma have radii smaller than 2n+12^{n+1}. Choose a subfamily of multiplicity 11 consisting of balls centred in elements of σn\sigma_{n} and covering them all, and add one ball centred in xn+1x_{n+1} (which is a singleton). Now it follows that dimNagσ=0\dim_{Nag}\sigma=0 as well.

Finally, let us show that if nn is sufficiently large, then the expected number of indices ii such that x=x1x=x_{1} is the nearest neighbour of xix_{i} under a uniform tie-breaking is as large as desired. With this purpose, for each i2i\geq 2 we will calculate the expectation of the event x1NN(xi)x_{1}\in NN(x_{i}), where NN(xi)NN(x_{i}) denotes the set of nearest neighbours of xix_{i} in the rest of the finite sample σn\sigma_{n}.

For x2x_{2}, the unique nearest neighbour within σn\sigma_{n} is x1x_{1}, therefore 𝔼[x1NN(x2)]=1{\mathbb{E}}[x_{1}\in NN(x_{2})]=1. For x3x_{3}, there are two points in σn\sigma_{n} at a distance 22 from x3x_{3}, which can be chosen each with probability 1/21/2, namely x1x_{1} and x2x_{2}, therefore 𝔼[x1NN(x3)]=1/2{\mathbb{E}}[x_{1}\in NN(x_{3})]=1/2. For arbitrary ii, in a similar way, 𝔼[x1NN(xi)]=1/i{\mathbb{E}}[x_{1}\in NN(x_{i})]=1/i. We conclude:

𝔼[{i=1,,n:x1NN(xi)}=i=1n1i,{\mathbb{E}}[\sharp\{i=1,\ldots,n\colon x_{1}\in NN(x_{i})\}=\sum_{i=1}^{n}\frac{1}{i},

and the sum of the harmonic series converges to ++\infty as nn\to\infty.

Can it be that the distance ties are in some sense extremely rare? Even this expectation is unfounded.

{xmpl}

Given a value δ>0\delta>0 (risk) and a sequence nk+n^{\prime}_{k}\uparrow+\infty, there exist a compact metric space of Nagata dimension zero (a Cantor space with a suitable compatible metric) equipped with a non-atomic probability measure, and a sequence nknkn_{k}\geq n^{\prime}_{k}, k/nk0k/n_{k}\to 0, with the following property. With confidence >1δ>1-\delta, for every kk, a random element XX has nk\geq n_{k} distance ties among its kk nearest neighbours within a random nk+1n_{k+1}-sample σ\sigma.

The space Ω\Omega, just like in the Preiss example (Sect. 2), is the direct product k=1[Nk]\prod_{k=1}^{\infty}[N_{k}] of finite discrete spaces, whose cardinalities Nk2N_{k}\geq 2 will be chosen recursively, and [Nk]={1,2,,Nk}[N_{k}]=\{1,2,\ldots,N_{k}\}. The metric is given by the rule

d(σ,τ)={0, if σ=τ,2min{i:σiτi}, otherwise.d(\sigma,\tau)=\begin{cases}0,&\mbox{ if }\sigma=\tau,\\ 2^{-\min\{i\colon\sigma_{i}\neq\tau_{i}\}},&\mbox{ otherwise.}\end{cases}

This metric induces the product topology and is non-archimedian, so the Nagata dimension of Ω\Omega is zero (example 4). The measure μ\mu is the product of uniform measures μNk\mu_{N_{k}} on the spaces [Nk][N_{k}]. This measure is non-atomic, and in particular, μ\mu-almost all distance ties occur at a strictly positive distance from a random element XX.

Choose a sequence (δi)(\delta_{i}) with δi>0\delta_{i}>0 and 2iδi=δ2\sum_{i}\delta_{i}=\delta. Choose N1N_{1} so large that, with probability >1δ1>1-\delta_{1}, n1=n1n_{1}=n^{\prime}_{1} independent random elements following a uniform distribution on the space [N1][N_{1}] are pairwise distinct. Now let n2n2n_{2}\geq n^{\prime}_{2} be so large that with probability >1δ1>1-\delta_{1}, if n2n_{2} independent random elements follow a uniform distribution on [N1][N_{1}], then each element of [N1][N_{1}] appears among them at least n1n_{1} times.

Suppose that n1,N1,n2,N2,,nkn_{1},N_{1},n_{2},N_{2},\ldots,n_{k} have been chosen. Let NkN_{k} be so large that, with probability >1δk>1-\delta_{k}, nkn_{k} i.i.d. random elements uniformly distributed in [Nk][N_{k}] are pairwise distinct. Choose nk+1nk+1n_{k+1}\geq n^{\prime}_{k+1} so large that, with probability >1δk>1-\delta_{k}, if nk+1n_{k+1} i.i.d. random elements are uniformly distributed within i=1k[Ni]\prod_{i=1}^{k}[N_{i}], then each element of i=1k[Ni]\prod_{i=1}^{k}[N_{i}] will appear among them at least nkn_{k} times.

Let kk be any positive natural number. Choose nk+1+1n_{k+1}+1 i.i.d. random elements X,X1,,Xnk+1X,X_{1},\ldots,X_{n_{k+1}} of Ω\Omega, following the distribution μ\mu. With probability >12δk>1-2\delta_{k}, the following occurs: there are nkn_{k} elements in the sample X1,X2,,XnkX_{1},X_{2},\ldots,X_{n_{k}} which have the same ii-th coordenates as XX, i=1,2,3,,ki=1,2,3,\ldots,k, yet the (k+1)(k+1)-coordenates of X,X1,,XnkX,X_{1},\ldots,X_{n_{k}} are all pairwise distinct. In this way, the distances between XX and all those nkn_{k} elements are equal to 2k12^{-k-1}. We have nkn_{k} distance ties between kk nearest neighbours of XX (which are all at the same distance as the nearest neighbour of XX), and nknkn_{k}\geq n_{k}^{\prime}, as desired.

Now, it would be tempting to try and reduce the general case to the case of zero probability of ties, as follows. Recall that the 1\ell^{1}-type direct sum of two metric spaces, (X,dX)(X,d_{X}) and (Y,dY)(Y,d_{Y}), is the direct product X×YX\times Y equipped with the coordinatewise sum of the two metrics:

d(x,y)=dX(x1,x2)+dY(y1,y2).d(x,y)=d_{X}(x_{1},x_{2})+d_{Y}(y_{1},y_{2}).

Notation: X1YX\oplus_{1}Y.

Let Ω\Omega be a domain, that is, a metric space equipped with a probability measure μ\mu and a regression function, η\eta. Form the 1\ell^{1}-type direct sum Ω1[0,ε]\Omega\oplus_{1}[0,{\varepsilon}], and equip it with the product measure μλ\mu\otimes\lambda (where λ\lambda is the normalized Lebesgue measure on the interval) and the regression function ηπ1\eta\circ\pi_{1}, where π1\pi_{1} is the production on the first coordinate. It is easy to see that the probability of distance ties in the space Ω1[0,ε]\Omega\oplus_{1}[0,{\varepsilon}] is zero, and every uniform distance tie breaking within a given finite sample will occur for a suitably small ε>0{\varepsilon}>0. In this way, one could derive the consistency of the classifier by conitioning. However, we will now give an example of two metric spaces of Nagata dimension 0 and 11 respectively, whose 1\ell^{1}-type sum has infinite Nagata dimension. This is again very different from what happens in the Euclidean case.

{xmpl}

Fix α>0\alpha>0. Let Ω={xn:n}\Omega=\{x_{n}\colon n\in{\mathbb{N}}\}, equipped with the following distance:

d(xi,xj)={0, if i=j,k=1jαk, if i<j.d(x_{i},x_{j})=\begin{cases}0,&\mbox{ if }i=j,\\ \sum_{k=1}^{j}\alpha^{k},&\mbox{ if }i<j.\end{cases}

For i<j<ki<j<k,

d(xi,xk)=m=1kαm=d(xj,xk)>d(xi,xj)=m=1jαm,d(x_{i},x_{k})=\sum_{m=1}^{k}\alpha^{m}=d(x_{j},x_{k})>d(x_{i},x_{j})=\sum_{m=1}^{j}\alpha^{m},

from where it follows that dd is an ultrametric. Thus, Ω\Omega is a metric space of Nagata dimension 0.

The interval 𝕀=[0,1]{\mathbb{I}}=[0,1] has Nagata dimension 11 (Ex. 4). Now let us consider the 1\ell^{1}-type sum Ω1𝕀\Omega\oplus_{1}{\mathbb{I}}. Let 0<β<α<10<\beta<\alpha<1, and β<1/2\beta<1/2. Consider the infinite sequence

zi=(xi,βi)Ω1𝕀z_{i}=(x_{i},\beta^{i})\in\Omega\oplus_{1}{\mathbb{I}}

and the point

z=(x0,0).z=(x_{0},0).

Whenever i<ji<j,

d(zi,zj)\displaystyle d(z_{i},z_{j}) =d(xi,xj)+βiβj\displaystyle=d(x_{i},x_{j})+\beta^{i}-\beta^{j}
d(xi,x0)+αj+βiβj\displaystyle\geq d(x_{i},x_{0})+\alpha^{j}+\beta^{i}-\beta^{j}
>d(xi,x0)+βi\displaystyle>d(x_{i},x_{0})+\beta^{i}
=d(zi,z),\displaystyle=d(z_{i},z),

and also

d(zi,zj)\displaystyle d(z_{i},z_{j}) =d(xi,xj)+βiβj\displaystyle=d(x_{i},x_{j})+\beta^{i}-\beta^{j}
>d(xj,x0)+βj\displaystyle>d(x_{j},x_{0})+\beta^{j}
=d(zj,z).\displaystyle=d(z_{j},z).

Together, the properties imply: for all iji\neq j,

d(zi,zj)>max{d(zi,z0),d(zj,z0)}.d(z_{i},z_{j})>\max\{d(z_{i},z_{0}),d(z_{j},z_{0})\}.

Thus, the Nagata dimension of the 1\ell^{1}-type sum Ω1𝕀\Omega\oplus_{1}{\mathbb{I}} is infinite.

The above examples show that beyond the Euclidean setting, we have to put up with the possibility that some points in a sample will appear disproportionately often among kk nearest neighbours of other points. In data science, such points are known as “hubs” and the above (empirical) observation, as the “hubness phenomenon”, see e.g., [19] and further references therein. Stone’s geometric lemma has to be generalized to allow for the possibility of a few of those “hubs”, whose number will be nevertheless limited. The lemma has to be reshaped in the spirit of the Hardy–Littlewood inequality in geometric measure theory.

To begin with, following Preiss [18], we will extend further our metric space dimension theory setting.

6. Sigma-finite dimensional metric spaces

{dfntn}

Say that a metric subspace XX of a metric space Ω\Omega has Nagata dimension δ\leq\delta\in{\mathbb{N}} on the scale s>0s>0 inside of Ω\Omega if every finite family of closed balls in Ω\Omega with centres in XX admits a subfamily of multiplicity δ+1\leq\delta+1 in Ω\Omega which covers all the centres of the original balls. The subspace XX has a finite Nagata dimension in Ω\Omega if XX has finite dimension in Ω\Omega on some scale s>0s>0. Notation: dimNags(X,Ω)\dim^{s}_{Nag}(X,\Omega) or sometimes simply dimNag(X,Ω)\dim_{Nag}(X,\Omega).

Following Preiss, let us call a family of balls disconnected if the centre of each ball does not belong to any other ball. Here is a mere reformulation of the above definition.

{prpstn}

For a subspace XX of a metric space Ω\Omega, one has

dimNags(X,Ω)δ\dim^{s}_{Nag}(X,\Omega)\leq\delta

if and only if every disconnected family of closed balls in Ω\Omega of radii <s<s with centres in XX has multiplicity δ+1\leq\delta+1.

Proof.

Necessity. Let γ\gamma be a disconnected finite family of closed balls in Ω\Omega with centres in XX. Since by assumption dimNags(X,Ω)δ\dim^{s}_{Nag}(X,\Omega)\leq\delta, the family γ\gamma admits a subfamily of multiplicity δ+1\leq\delta+1 covering all the original centres. But only subfamily that contains all the centres is γ\gamma itself.

Sufficiency. Let γ\gamma be a finite family of closed balls in Ω\Omega with centres in XX. Denote CC the set of centres of those balls. Among all the disconnected subfamilies of γ\gamma (which exist, e.g., each family containing just one ball is such) there is one, γ\gamma^{\prime}, with the maximal cardinality of the set CγC\cap\cup\gamma^{\prime}. We claim that CγC\subseteq\gamma^{\prime}, which will finish the argument. Indeed, if it is not the case, there is a ball, BγB\in\gamma, whose centre, cCc\in C, does not belong to γ\cup\gamma^{\prime}. Remove from γ\gamma^{\prime} all the balls with centres in BCB\cap C and add BB instead. The new family, γ′′\gamma^{\prime\prime}, is disconnected and contains (Cγ){c}\left(C\cap\cup\gamma^{\prime}\right)\cup\{c\}, which contradicts the maximality of γ\gamma^{\prime}. ∎

In the definition 6, as well as in the proposition 6, closed balls can be replaced with open ones. In fact, the statements remain valid if some balls in the families are allowed to be closed, other, open. We have the following.

{prpstn}

For a subspace XX of a metric space Ω\Omega, the following are equivalent.

  1. (1)

    dimNags(X,Ω)δ\dim^{s}_{Nag}(X,\Omega)\leq\delta,

  2. (2)

    every finite family of balls (some open, others closed) in Ω\Omega with centres in XX and radii <s<s admits a subfamily of multiplicity δ+1\leq\delta+1 in Ω\Omega which covers all the centres of the original balls,

  3. (3)

    every finite family of open balls in Ω\Omega having radii <s<s with centres in XX admits a subfamily of multiplicity δ+1\leq\delta+1 in Ω\Omega which covers all the centres of the original balls,

  4. (4)

    every disconnected family of open balls in Ω\Omega of radii <s<s with centres in XX has multiplicity δ+1\leq\delta+1,

  5. (5)

    every disconnected family of balls (some open, others closed) in Ω\Omega of radii <s<s with centres in XX has multiplicity δ+1\leq\delta+1.

Proof.

(1)(2)(\ref{equiv:1})\Rightarrow(\ref{equiv:3}): Let γ\gamma be a finite family of balls in Ω\Omega with centres in XX, of radii <s<s, where some of the balls may be open and others, closed. For every element BγB\in\gamma and each k2k\geq 2, form a closed ball BkB_{k} as follows: if BB is closed, then Bk=BB_{k}=B, and if BB is open, then define BkB_{k} as having the same centre and radius r(11/m)r(1-1/m), where rr is the radius of BB. Thus, we always have B=k=2BkB=\cup_{k=2}^{\infty}B_{k}. Select recursively a chain of subfamilies

γγ1γ2γk\gamma\supseteq\gamma_{1}\supseteq\gamma_{2}\supseteq\ldots\supseteq\gamma_{k}\supseteq\ldots

with the properties that for each kk, the family of closed balls BkB_{k}, BγkB\in\gamma_{k} has multiplicity δ+1\leq\delta+1 in Ω\Omega and covers all the centres of the balls in γ\gamma. Since γ\gamma is finite, starting with some kk, the subfamily γk\gamma_{k} stabilizes, and now it is easy to see that the subfamily γk\gamma_{k} itself has the desired multiplicity, and of course covers all the original centres.

(2)(3)(\ref{equiv:3})\Rightarrow(\ref{equiv:2}): Trivially true.

(3)(4)(\ref{equiv:2})\Rightarrow(\ref{equiv:4}): Same argument as in the proof of necessity in proposition 6.

(4)(5)(\ref{equiv:4})\Rightarrow(\ref{equiv:5}): Let γ\gamma be a disconnected family of balls in Ω\Omega, some of which may be open and others, closed, having radii <s<s and centred in XX. For each BγB\in\gamma and ε>0{\varepsilon}>0, denote BεB_{{\varepsilon}} an open ball equal to BB if BB is open, and concentric with BB and of the radius r+εr+{\varepsilon}, where rr is the radius of BB, if BB is closed. For a sufficiently small ε>0{\varepsilon}>0, the family {Bε:Bγ}\{B_{\varepsilon}\colon B\in\gamma\} is disconnected, and its radii are all strictly less than ss, therefore this family has multiplicity δ+1\leq\delta+1 by assumption. The same follows for γ\gamma.

(5)(1)(\ref{equiv:5})\Rightarrow(\ref{equiv:1}): the condition (5)(\ref{equiv:5}) is formally even stronger than an equivalent condition for (1)(\ref{equiv:1}) established in Proposition 6. ∎

{prpstn}

Let XX be a subspace of a metric space Ω\Omega, satisfying dimNags(X,Ω)δ\dim^{s}_{Nag}(X,\Omega)\leq\delta. Then dimNags(X¯,Ω)δ\dim^{s}_{Nag}(\bar{X},\Omega)\leq\delta, where X¯\bar{X} is the closure of XX in Ω\Omega.

Proof.

Let γ\gamma be a finite disconnected family of open balls in Ω\Omega of radii <s<s, centred in X¯\bar{X}. Let yΩy\in\Omega, and let γ\gamma^{\prime} consist of all balls in γ\gamma containing yy. Choose ε>0{\varepsilon}>0 so small that the open ε{\varepsilon}-ball around yy is contained in every element of γ\gamma^{\prime}. For every open ball BγB\in\gamma^{\prime}, denote yBy_{B} the centre and rBr_{B} the radius. We can also assume that ε<rB{\varepsilon}<r_{B} for each BγB\in\gamma^{\prime}. Denote BB^{\prime} an open ball of radius rBε>0r_{B}-{\varepsilon}>0, centred at a point xBXx_{B}\in X satisfying d(xB,yB)<εd(x_{B},y_{B})<{\varepsilon}. Then BBB^{\prime}\subseteq B, so the family {B:Bγ}\{B^{\prime}\colon B\in\gamma^{\prime}\} is disconnected, and has radii <s<s. Therefore, yy only belongs to δ+1\leq\delta+1 balls BB^{\prime}, BγB\in\gamma^{\prime}, consequently the cardinality of γ\gamma^{\prime} is bounded by δ+1\delta+1. ∎

{prpstn}

If XX and YY are two subspaces of a metric space Ω\Omega, having finite Nagata dimension in Ω\Omega on the scales s1s_{1} and s2s_{2} respectively, then XYX\cup Y has a finite Nagata dimension in Ω\Omega, with dimNag(XY,Ω)dimNag(X,Ω)+dimNag(Y,Ω)\dim_{Nag}(X\cup Y,\Omega)\leq\dim_{Nag}(X,\Omega)+\dim_{Nag}(Y,\Omega), on the scale min{s1,s2}\min\{s_{1},s_{2}\}.

Proof.

Given a finite family of balls γ\gamma in Ω\Omega of radii <min{s1,s2}<\min\{s_{1},s_{2}\} centred in elements of XYX\cup Y, represent it as γ=γXγY\gamma=\gamma_{X}\cup\gamma_{Y}, where the balls in γX\gamma_{X} are centered in XX, and the balls in γY\gamma_{Y} are centred in YY. The rest is obvious. ∎

{dfntn}

A metric space Ω\Omega is said to be sigma-finite dimensional in the sense of Nagata if Ω=i=1Xn\Omega=\cup_{i=1}^{\infty}X_{n}, where every subspace XnX_{n} has finite Nagata dimension in Ω\Omega on some scale sn>0s_{n}>0 (where the scales sns_{n} are possibly all different).

{rmrk}

Due to Proposition 6, in the above definition we can assume the subspaces XnX_{n} to be closed in Ω\Omega, in particular Borel subsets.

{rmrk}

A good reference for a great variety of metric dimensions, including the Nagata dimension, and their applications to measure differentiation theorems, is the article [1].

Now we will develop a version of Stone’s geometric lemma for general metric spaces of finite Nagata dimension.

7. From Stone to Hardy–Littlewood

{lmm}

Let σ={x1,x2,,xn}\sigma=\{x_{1},x_{2},\ldots,x_{n}\} be a finite sample in a metric space Ω\Omega, and let XX be a subspace of finite Nagata dimension δ\delta in Ω\Omega on a scale s>0s>0. Let α(0,1]\alpha\in(0,1] be any. Let σσ\sigma^{\prime}\sqsubseteq\sigma be a sub-sample with mm points. Assign to every xiσx_{i}\in\sigma a ball, BiB_{i} (which could be open or closed), centred at xix_{i}, of radius <s<s. Then

{i=1,2,,n:xiX,(Biσ)α(Biσ)}α1(δ+1)m.\sharp\{i=1,2,\ldots,n\colon x_{i}\in X,~~\sharp(B_{i}\cap\sigma^{\prime})\geq\alpha\sharp(B_{i}\cap\sigma)\}\leq\alpha^{-1}(\delta+1)m.
Proof.

Denote II the set of all i=1,2,,ni=1,2,\ldots,n such that

(Biσ)α(Biσ).\sharp(B_{i}\cap\sigma^{\prime})\geq\alpha\sharp(B_{i}\cap\sigma).

According to the assumption dimNags(Ω)=δ\dim_{Nag}^{s}(\Omega)=\delta, there exists JIJ\subseteq I such that the subfamily {Bi:iJ}\{B_{i}\colon i\in J\} has multiplicity δ+1\leq\delta+1 and covers all the centres xix_{i}, iIi\in I. In particular, each point of σ\sigma^{\prime} belongs, at most, to δ+1\delta+1 balls BiB_{i}, iIi\in I. Consequently,

αI\displaystyle\alpha\sharp I =α{xi:iI}\displaystyle=\alpha\sharp\{x_{i}\colon i\in I\}
iIα(Biσ)\displaystyle\leq\sum_{i\in I}\alpha\sharp(B_{i}\cap\sigma)
iI(Biσ)\displaystyle\leq\sum_{i\in I}\sharp(B_{i}\cap\sigma^{\prime})
σ(δ+1)\displaystyle\leq\sharp\sigma^{\prime}(\delta+1)
=m(δ+1),\displaystyle=m(\delta+1),

whence the conclusion follows. ∎

{rmrk}

In applications of the lemma, Bi=Bϵk-NN(x)B_{i}=B_{{\epsilon}_{\mbox{\tiny{k-NN}}}}(x) (sometimes the ball will need to be open, sometimes closed, depending on the presence of distance ties).

{lmm}

Let α,α1,α20\alpha,\alpha_{1},\alpha_{2}\geq 0, t1,t2[0,1]t_{1},t_{2}\in[0,1], t21t1t_{2}\leq 1-t_{1}. Assume that α1α\alpha_{1}\leq\alpha and

t1α1+(1t1)α2α.t_{1}\alpha_{1}+(1-t_{1})\alpha_{2}\leq\alpha.

Then

t1α1+t2α2t1+t2α.\frac{t_{1}\alpha_{1}+t_{2}\alpha_{2}}{t_{1}+t_{2}}\leq\alpha.
Proof.

If α2α\alpha_{2}\leq\alpha, the conclusion is immediate. Otherwise, α2>α\alpha_{2}>\alpha, and it follows that

t1α1+t2α2α(1t1t2)α2(t1+t2)α.t_{1}\alpha_{1}+t_{2}\alpha_{2}\leq\alpha-(1-t_{1}-t_{2})\alpha_{2}\leq(t_{1}+t_{2})\alpha.

{lmm}

Let x,x1,x2,,xnx,x_{1},x_{2},\ldots,x_{n} be a finite sample (possibly with repetitions), and let σσ\sigma^{\prime}\sqsubset\sigma be a subsample. Let α0\alpha\geq 0, and let BB be a closed ball around xx of radius rk-NN(x)r_{k\mbox{\tiny-NN}}(x) which contains KK elements of the sample,

{i=1,2,,n:xiB}=K.\sharp\{i=1,2,\ldots,n\colon x_{i}\in B\}=K.

Suppose that the fraction of points of σ\sigma^{\prime} found in BB is no more than α\alpha,

{i:xiσ,xiB}αK,\sharp\{i\colon x_{i}\in\sigma^{\prime},~~x_{i}\in B\}\leq\alpha K,

and that the same holds for the corresponding open ball, BB^{\circ},

{i:xiσ,xiB}α{i:xiB}.\sharp\{i\colon x_{i}\in\sigma^{\prime},~~x_{i}\in B^{\circ}\}\leq\alpha\sharp\{i\colon x_{i}\in B^{\circ}\}.

Under the uniform tie-breaking of the kk nearest neighbours, the expected fraction of the points of σ\sigma^{\prime} found among the kk nearest neighbours of xx is less than or equal to α\alpha.

Proof.

We apply lemma 7 with α1\alpha_{1} and α2\alpha_{2} being the fractions of the points of σ\sigma^{\prime} found in the closed ball BB and on the sphere S=BBS=B\setminus B^{\circ} respectively, t1=B/Bt_{1}=\sharp B^{\circ}/\sharp B, and t2t_{2} being the fraction of the points of the sphere S=BBS=B\setminus B^{\circ} to be chosen uniformly and randomly as the kk nearest neighbours of xx that are still missing in the open ball BB^{\circ}. Now it is enough to observe that the expected fraction of the points of σ\sigma^{\prime} among the kk nearest neighbours that belong to the sphere is also equal to α2\alpha_{2}, because they are being chosen randomly, following a uniform distribution. ∎

Now we can give a promised alternative proof of the principal result along the same lines as Stone’s original proof in the finite-dimensional Euclidean case.

{thrm}

The kk nearest neighbour classifier under the uniform distance tie-breaking is universally consistent in every metric space having sigma-finite Nagata dimension, when n,kn,k\to\infty and k/n0k/n\to 0.

Proof.

Represent Ω=l=1Yn\Omega=\cup_{l=1}^{\infty}Y_{n}, where YnY_{n} have finite Nagata dimension in Ω\Omega. According to proposition 6, we can assume that YnY_{n} form an increasing chain, and proposition 6 allows to assume that YnY_{n} are Borel sets. Let a Borel probability measure μ\mu and a measurable regression function η\eta be any on Ω\Omega. Given ϵ>0{\epsilon}>0, there exists ll such that μ(Yl)1ϵ/2\mu(Y_{l})\geq 1-{\epsilon}/2. According to Luzin’s theorem (Th. Appendix: Luzin’s theorem), there is a closed precompact subset KYlK\subseteq Y_{l}, such that η|K\eta|_{K} is continuous and μ(K)1ϵ\mu(K)\geq 1-{\epsilon}. Applying the Tietze–Urysohn extension theorem ([7], 2.1.8), we extend the function η|K\eta|_{K} to a continuous function gg over Ω\Omega.

In the spirit of the proof of Stone’s theorem 3.3, it is enough to limit the term

(B)\displaystyle(B) =𝔼1ki=1n|η(Xi)g(Xi)|𝟙Xik-NN(X)𝟙XK𝟙XiK\displaystyle={\mathbb{E}}\frac{1}{k}\sum_{i=1}^{n}\left|\eta(X_{i})-g(X_{i})\right|{\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X)}{\mathbbm{1}}_{X\in K}{\mathbbm{1}}_{X_{i}\notin K}
=𝔼𝔼jμ1ki{0,1,,n}{j}|η(Xi)g(Xi)|𝟙Xik-NN(Xj)𝟙XjK𝟙XiK,\displaystyle={\mathbb{E}}{\mathbb{E}}_{j\sim\mu_{\sharp}}\frac{1}{k}\sum_{i\in\{0,1,\ldots,n\}\setminus\{j\}}\left|\eta(X_{i})-g(X_{i})\right|{\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X_{j})}{\mathbbm{1}}_{X_{j}\in K}{\mathbbm{1}}_{X_{i}\notin K},

where μ\mu_{\sharp} is the uniform measure on the set {0,1,2,,n}\{0,1,2,\ldots,n\}, and we denote X0=XX_{0}=X.

Let s>0s>0 denote the scale on which YlY_{l} has finite Nagata dimension, δ\delta\in{\mathbb{N}}. Since by the Cover–Hart lemma 1

𝔼jμrk-NN{X0,,Xn}{Xj}(Xj)0,{\mathbb{E}}_{j\sim\mu_{\sharp}}r^{\{X_{0},\ldots,X_{n}\}\setminus\{X_{j}\}}_{k\mbox{\tiny-NN}}(X_{j})\to 0,

it suffices to estimate the term

(B)=𝔼𝔼jμ1ki{0,1,,n}{j}|η(Xi)g(Xi)|𝟙Xik-NN(Xj)𝟙XjK𝟙XiK𝟙rk-NN{X0,,Xn}{Xj}(Xj)<s.(B^{\prime})={\mathbb{E}}{\mathbb{E}}_{j\sim\mu_{\sharp}}\frac{1}{k}\sum_{i\in\{0,1,\ldots,n\}\setminus\{j\}}\left|\eta(X_{i})-g(X_{i})\right|{\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X_{j})}{\mathbbm{1}}_{X_{j}\in K}{\mathbbm{1}}_{X_{i}\notin K}{\mathbbm{1}}_{r^{\{X_{0},\ldots,X_{n}\}\setminus\{X_{j}\}}_{k\mbox{\tiny-NN}}(X_{j})<s}.

We will treat the above as a sum of two expectations, (B1)(B_{1}) and (B2)(B_{2}), according to whether the kk nearest neighbours of XjX_{j} inside the sample {X0,X1,,Xj1,X,Xj+1,Xn}\{X_{0},X_{1},\ldots,X_{j-1},X,X_{j+1},\ldots X_{n}\} contain more or less than ϵk\sqrt{\epsilon}k elements belonging to U=ΩKU=\Omega\setminus K.

Applying lemma 7 to the closed balls of radius k-NN(Xj)k\mbox{-NN}(X_{j}) as well as the corresponding open balls, together with lemma 7, we get in the first case

(B1)\displaystyle(B_{1}) =𝔼𝔼jμ1k|η(Xi)g(Xi)|𝟙Xik-NN(Xj)𝟙XjK𝟙XiK,i{0,1,,n}{j}𝟙XiK𝟙rk-NN{X0,,Xn}{Xj}(Xj)<s×\displaystyle={\mathbb{E}}{\mathbb{E}}_{j\sim\mu_{\sharp}}\frac{1}{k}\sum\left|\eta(X_{i})-g(X_{i})\right|{\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X_{j})}{\mathbbm{1}}_{X_{j}\in K}{\mathbbm{1}}_{X_{i}\notin K,~i\in\{0,1,\ldots,n\}\setminus\{j\}}{\mathbbm{1}}_{X_{i}\notin K}{\mathbbm{1}}_{r^{\{X_{0},\ldots,X_{n}\}\setminus\{X_{j}\}}_{k\mbox{\tiny-NN}}(X_{j})<s}\times
×𝟙{i:Xik-NN(Xj),XiK}kϵ\displaystyle\times{\mathbbm{1}}_{\sharp\{i\colon X_{i}\in k\mbox{\tiny-NN}(X_{j}),~~X_{i}\notin K\}\geq k\sqrt{\epsilon}}
𝔼1kkϵ1/2(δ+1)1n{i=0,1,,n:XiK}\displaystyle\leq{\mathbb{E}}\frac{1}{k}k{\epsilon}^{-1/2}(\delta+1)\frac{1}{n}\sharp\{i=0,1,\ldots,n\colon X_{i}\notin K\}
ϵ1/2(δ+1)ϵ\displaystyle\leq{\epsilon}^{-1/2}(\delta+1){\epsilon}
=ϵ(δ+1),\displaystyle=\sqrt{{\epsilon}}(\delta+1),

where we have used the fact that the sum in the first line does not exceed kk. In the second case,

(B2)\displaystyle(B_{2}) =𝔼𝔼jμ1k|η(Xi)g(Xi)|𝟙Xik-NN(Xj)𝟙XjK𝟙XiK,i{0,1,,n}{j}𝟙rk-NN{X0,,Xn}{Xj}(Xj)<s×\displaystyle={\mathbb{E}}{\mathbb{E}}_{j\sim\mu_{\sharp}}\frac{1}{k}\sum\left|\eta(X_{i})-g(X_{i})\right|{\mathbbm{1}}_{X_{i}\in k\mbox{\tiny-NN}(X_{j})}{\mathbbm{1}}_{X_{j}\in K}{\mathbbm{1}}_{X_{i}\notin K,~i\in\{0,1,\ldots,n\}\setminus\{j\}}{\mathbbm{1}}_{r^{\{X_{0},\ldots,X_{n}\}\setminus\{X_{j}\}}_{k\mbox{\tiny-NN}}(X_{j})<s}\times
×𝟙{i:Xik-NN(Xj),XiK}<kϵ\displaystyle\times{\mathbbm{1}}_{\sharp\{i\colon X_{i}\in k\mbox{\tiny-NN}(X_{j}),~~X_{i}\notin K\}<k\sqrt{\epsilon}}
1kkϵ\displaystyle\leq\frac{1}{k}k\sqrt{\epsilon}
=ϵ.\displaystyle=\sqrt{\epsilon}.

Appendix: Luzin’s theorem

The classical Luzin theorem admits numerous variations, of which we need the following one.

{thrm}

[Luzin’s theorem] Let XX be a separable metric space (not necessarily complete), μ\mu a Borel probability measure on XX, and f:Xf\colon X\to{\mathbb{R}} a μ\mu-measurable function. Then for every ϵ>0{\epsilon}>0 there exists a closed precompact set KXK\subseteq X with μ(K)>1ϵ\mu(K)>1-{\epsilon} and such that f|Kf|_{K} is continuous.

As we could not find an exact reference to this specific version, we are including the proof.

{thrm}

Every Borel probability measure, μ\mu, on a separable metric space Ω\Omega (not necessarily complete) satisfies the following regularity condition. Let AA be a μ\mu-measurable subset of Ω\Omega. For every ε>0{\varepsilon}>0 there exist a closed subset, FF, and an open subset, UU, of Ω\Omega such that

FAUF\subseteq A\subseteq U

and

μ(UF)<ε.\mu(U\setminus F)<{\varepsilon}.
Proof.

Denote 𝒜\mathcal{A} the family of all Borel subsets of Ω\Omega satisfying the conclusion of the theorem: given ε>0{\varepsilon}>0, there exist a closed set, FF, and an open set, UU, of Ω\Omega satisfying FAUF\subseteq A\subseteq U and μ(UF)<ε\mu(U\setminus F)<{\varepsilon}. It is easy to see that 𝒜\mathcal{A} forms a sigma-algebra which contains all closed subsets. Consequently, 𝒜\mathcal{A} contains all Borel sets. Since every μ\mu-measurable set differs from a suitable Borel set by a μ\mu-null set, we conclude. ∎

Proof of Luzin’s theorem Appendix: Luzin’s theorem.

Let (εn)({\varepsilon}_{n}), εn>0{\varepsilon}_{n}>0 be a summable sequence with n=1εn=ε\sum_{n=1}^{\infty}{\varepsilon}_{n}={\varepsilon}. Enumerate the family of all open intervals with rational endpoints: (an,bn)(a_{n},b_{n}), nn\in{\mathbb{N}}. For every nn, use Th. Appendix: Luzin’s theorem to select closed sets Fnf1(an,bn)F_{n}\subseteq f^{-1}(a_{n},b_{n}) and FnF^{\prime}_{n}, Fnf1(an,bn)=F^{\prime}_{n}\cap f^{-1}(a_{n},b_{n})=\emptyset so that their union F~n=FnFn\tilde{F}_{n}=F_{n}\cup F^{\prime}_{n} satisfies μ(F~n)>1εn/2\mu(\tilde{F}_{n})>1-{\varepsilon}_{n}/2.

The measure μ\mu viewed as a Borel probability measure on the completion Ω^\hat{\Omega} of the metric space Ω\Omega is regular, so there exists a compact set QΩ^Q\subseteq\hat{\Omega} with μ(Q)>1ϵ/2\mu(Q)>1-{\epsilon}/2.

The set

K=n=1F~nQK=\bigcap_{n=1}^{\infty}\tilde{F}_{n}\cap Q

is closed and precompact in Ω\Omega, and satisfies μ(K)>1ε\mu(K)>1-{\varepsilon}. For each nn, the set f1(an,bn)Kf^{-1}(a_{n},b_{n})\cap K is relatively open in KK because its compliment,

Kf1(an,bn)=FnK,K\setminus f^{-1}(a_{n},b_{n})=F^{\prime}_{n}\cap K,

is closed. ∎

This simple proof is borrowed from [11].

Concluding remarks

The following question remains open. Let Ω\Omega be a separable complete metric space in which the kk-NN classifier is universally consistent. Does it follow that Ω\Omega is sigma-finite dimensional in the sense of Nagata?

A positive answer would imply, modulo the results of Cérou and Guyader [2] and of Preiss [18], that a separable metric space Ω\Omega satisfies the weak Lebesgue–Besicovitch differentiation property for every Borel sigma-finite locally finite measure if and only if Ω\Omega satisfies the strong Lebesgue–Besicovitch differentiation property for every Borel sigma-finite locally finite measure, which would answer an old question asked by Preiss in [18].

Most of this investigation appears as a part of the Ph.D. thesis of one of the authors [10].

The authors are thankful to the two anonymous referees of the paper, whose comments have helped to considerably improve the presentation. The remaining mistakes are of course authors’ own.

References

  • [1] P. Assouad, T. Quentin de Gromard, Recouvrements, derivation des mesures et dimensions, Rev. Mat. Iberoam. 22 (2006), 893–953.
  • [2] F. Cérou and A. Guyader, Nearest neighbor classification in infinite dimension, ESAIM Probab. Stat. 10 (2006), 340–355.
  • [3] T.M. Cover and P.E. Hart, Nearest neighbour pattern classification, IEEE Trans. Info. Theory 13 (1967), 21–27.
  • [4] L. Devroye, On the almost everywhere convergence of nonparametric regression function estimates, Ann. Statist. 9 (1981), 1310–1319.
  • [5] Luc Devroye, László Györfi and Gábor Lugosi, A Probabilistic Theory of Pattern Recognition, Springer–Verlag, New York, 1996.
  • [6] Hubert Haoyang Duan, Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease, M.Sc. thesis, University of Ottawa, 2014, 102 pp., arXiv:1402.0459 [cs.LG].
  • [7] R. Engelking, General Topology, Second ed., Sigma Series in Pure Mathematics, 6, Heldermann Verlag, Berlin, 1989.
  • [8] Liliana Forzani, Ricardo Fraiman, and Pamela Llop, Consistent nonparametric regression for functional data under the Stone–Besicovitch conditions, IEEE Transactions on Information Theory 58 (2012), 6697–6708.
  • [9] Stan Hatko, kk-Nearest Neighbour Classification of Datasets with a Family of Distances, M.Sc. thesis, University of Ottawa, 2015, 111 pp., arXiv:1512.00001 [stat.ML].
  • [10] Sushma Kumari, Topics in Random Matrices and Statistical Machine Learning, Ph.D. thesis, Kyoto University, 2018, 125 pp., arXiv:1807.09419 [stat.ML].
  • [11] P.A. Loeb and E. Talvila, Lusin’s theorem and Bochner integration, Sci. Math. Jpn. 60 (2004), 113–120.
  • [12] J. Nagata, On a special metric characterizing a metric space of dimn\dim\leq n, Proc. Japan Acad. 39 (1963), 278–282.
  • [13] J.I. Nagata, On a special metric and dimension, Fund. Math. 55 (1964), 181–194.
  • [14] Jun-iti Nagata, Modern dimension theory, Bibliotheca Mathematica, Vol. VI, North–Holland, Amsterdam, 1965.
  • [15] Jun-Iti Nagata, Open problems left in my wake of research, Topology Appl. 146/147 (2005), 5–13.
  • [16] Phillip A. Ostrand, A conjecture of J. Nagata on dimension and metrization, Bull. Amer. Math. Soc. 71 (1965), 623-625.
  • [17] D. Preiss, Invalid Vitali theorems, in: Abstracta. 7th Winter School on Abstract Analysis, pp. 58–60, Czechoslovak Academy of Sciences, 1979.
  • [18] D. Preiss, Dimension of metrics and differentiation of measures, General topology and its relations to modern analysis and algebra, V (Prague, 1981), 565–568, Sigma Ser. Pure Math., 3, Heldermann, Berlin, 1983.
  • [19] Miloš Radovanović, Alexandros Nanopoulos, and Mirjana Ivanović, Hubs in space: popular nearest neighbors in high-dimensional data, Journal of Machine Learning Research 11 (2010), 2487–2531.
  • [20] C. Stone, Consistent nonparametric regression, Annals of Statistics 5 (1977), 595–645.
  • [21] K.Q. Weinberger and L.K. Saul, Distance metric learning for large margin classification, Journal of Machine Learning Research 10 (2009), 207–244.