This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Intrinsic dimension estimation for discrete metrics

Iuri Macocco International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy    Aldo Glielmo International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy Banca d’Italia, Italy The views and opinions expressed in this paper are those of the authors and do not necessarily reflect the official policy or position of Banca d’Italia.    Jacopo Grilli The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34014 Trieste, Italy    Alessandro Laio laio@sissa.it International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy The Abdus Salam International Centre for Theoretical Physics (ICTP), Strada Costiera 11, 34014 Trieste, Italy
Abstract

Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences’ space.

preprint: AIP/123-QED

Data produced by experiments and observations are very often high-dimensional, with each data-point being defined by a sizeable number of features. To the please of modelers, real-world datasets seldom occupy this high-dimensional space uniformly, as strong regularities and constraints emerge. Such a property is what allows for low-dimensional descriptions of these high-dimensional data, ultimately making science possible.

In particular, data-points are often effectively contained in a manifold which can be described by a relatively small amount of coordinates. The number of such coordinates is called Intrinsic Dimension (ID). More formally, the ID is defined as the minimum number of variables needed to describe the data without significant information loss. Its knowledge is of paramount importance in unsupervised learning[1, 2, 3] and has found applications across disciplines. In solid state physics and statistical physics, the ID can be used as a proxy of an order parameter describing phase transitions[4, 5]; in molecular dynamics it can be used to quantify the complexity of a trajectory[6]; in deep learning theory the ID indicates how information is compressed throughout the various layers of a network [7, 8, 9]. During the last three decades much progress has been made in the development of sophisticated tools to estimate the ID[10, 11] and most estimators have been formulated (and are supposed to work) in spaces where distances can vary continuously. However, many datasets are characterised by discrete features and, consequently, discrete distances. For instance, categorical datasets like satisfaction questionnaires, clinical trials, unweighted networks, spin systems, protein, and DNA sequences fall into this category.

Two main methods are usually employed in these cases. The Box Counting (BC) estimator [12, 13, 14] — which is defined by measuring the scaling between the number of boxes needed to cover a dataset and the boxes size — provides good results for 2-3-dimensional datasets but is computationally demanding for higher-dimensional datasets. The second popular method is the Fractal Dimension (FD) estimator[12, 15, 16] and it is based on the assumption of a power law relationship NrdN\sim r^{d} for the number NN of neighbors within a sphere of radius rr from a given point, where dd is the fractal dimension of the data.This estimator has been successfully applied, on discrete datasets, to model the phenomena of dielectric breakdown [17] and Anderson Localization [18]. For non-fractal objects, both methods are reliable only in the limit of small boxes and small radii, since the manifold containing the data can be curved and the data points can be distributed non-uniformly[19]. However, in discrete spaces such a limit is not well defined due to the minimum distance induced by any discrete lattice, and this can lead to systematic errors [20, 21].

In this letter, we introduce an ID estimator explicitly formulated for spaces with discrete features. In discrete spaces, the ID can be thought of as the dimension of a (hyper)cubic lattice where the original data-points can be (locally) projected without a significant information loss. The key challenge in dealing with the discrete nature of the data lies in the proper definition of volumes on lattices. To this end, we introduce a novel method that makes use of Ehrhart’s theory of polytopes [22], which allows to enumerate the lattice points of a given region. By measuring a suitable statistics, depending on the number of data-points observed within a given (discrete) distance, one can infer the value of the dimension of the region, which we interpret as the ID of the dataset. The statistics we use is defined in such a way that density of points is required to be constant only locally and not in the whole dataset. Importantly, our estimator allows to explicitly select the scale at which the ID is computed.

Methods - We assume data points to be uniformly distributed on a generic domain, and that their density is ρ\rho. In such domain, we consider a region AA with volume V(A)V\!(A). Since we are assuming points to be independently generated, the probability of observing nn points in AA is given by the Poisson distribution [23]

P(n,A)=[ρV(A)]nn!eρV(A)P(n,A)=\frac{[\rho\,V\!(A)]^{n}}{n!}e^{-\rho\,V\!(A)} (1)

so that n=ρV(A){\langle n\rangle}=\rho\,V\!(A). Consider now a data-point ii and two regions AA and BB, one containing the other, and both containing the data-point: iABi\in A\subset B. Then the number of points nn and knk-n falling, respectively, in AA and BAB\setminus A are Poisson distributed with rates λ1=ρV(A)\lambda_{1}=\rho\,V\!(A) and λ2=ρV(BA)\lambda_{2}=\rho\,V\!(B\setminus A). The conditional probability of having nn points in AA given that there are kk points in BB is

P(n|k)=P(n)P(kn)P(k)=(kn)pn(1p)knP(n\,|\,k)=\frac{P(n)P(k-n)}{P(k)}=\binom{k}{n}p^{n}(1-p)^{k-n} (2)

with

p=λ1λ1+λ2=ρV(A)ρV(B)=V(A)V(B).p=\frac{\lambda_{1}}{\lambda_{1}+\lambda_{2}}=\frac{\rho\,V\!(A)}{\rho\,V\!(B)}=\frac{V\!(A)}{V\!(B)}. (3)

Thus n|kBinomial(n;k,p)n|k\sim\mathrm{Binomial}(n;k,p). As far as the density ρ\rho is constant within AA and BB, pp is simply equal to the ratio of the volumes of the considered regions and, remarkably, density independent. This is a key property which, as we will show, allows using the estimator even when the density is approximately constant only locally, and varies, even substantially, across larger distance scales. One can then write a conditional probability of the observations nin_{i} (one for each data point), given the parameters kik_{i} and pip_{i}, which can possibly be point-dependent:

(ni|ki,pi)=i=1NBinomial(ni|ki,pi).\mathcal{L}(n_{i}|k_{i},p_{i})=\prod_{i=1}^{N}\mathrm{Binomial}(n_{i}|k_{i},p_{i}). (4)

Such formulation assumes all the observations to be statistically independent. Strictly speaking this is typically not true, since the regions AA and BB of different points can be overlapping. We will address this issue in Supplementary Information (SI), demonstrating that neglecting correlations does not induce significant errors.

The next step consists in defining the volumes in Eq. (3) according to the nature of the embedding manifold. We now assume our space to be a lattice where the L1L^{1} metric is a natural choice. In this space the volume V(A)V(A) is the number of lattice points contained in AA. According to Ehrhart theory of polytopes [24], the number of lattice points within distance tt in dimension dd from a given point amounts to [25]

V(t,d)=(d+td)2F1(d,t,dt,1)V_{\diamond}(t,d)=\binom{d+t}{d}\;_{2}F_{1}(-d,-t,-d-t,-1) (5)

where F12(a,b,c,z){}_{2}F_{1}(a,b,c,z) is the ordinary hypergeometric function. At a given tt, the above expression is a polynomial in dd of order tt. As a consequence, the ratio of volumes defining the value of pp in Eq. (3) becomes a ratio of two polynomials in dd. Given a dataset, the choice of t1t_{1} and t2t_{2} fixes the values of nin_{i} and kik_{i} in the likelihood (4). Its maximization with respect to dd allows to infer the data-manifold’s ID, which is simply given the root of equation (see SI for more details on the derivation)

V(t1,d)V(t2,d)nk=0\frac{V_{\diamond}(t_{1},d)}{V_{\diamond}(t_{2},d)}-\frac{{\langle n\rangle}}{{\langle k\rangle}}=0 (6)

where the mean value over nn and kk is intended over all the points of the dataset. The root can be easily found with standard optimization libraries. This procedure defines an ID estimator that, for brevity, we will call I3D (Intrinsic Dimension estimator for Discrete Datasets).

Very importantly, the ID estimate is density independent as such factor cancels out (see Eq. (3)). The error on the estimator has a theoretical lower bound, given by the Cramer-Rao inequality, which has an explicit analytic expression. As an alternative, the ID can be estimated by a Bayesian approach as the mean value of its posterior distribution, and the error estimated via the posterior variance (details in SI).

The estimation of the ID depends on the choice of the volumes of the smaller and larger regions, which are parametrised by the “radii” t1t_{1} and t2t_{2}. By varying t2t_{2}, the radius of the largest probe region, one can explore the behaviour of the ID at different scales. The proper range of t2t_{2} is dataset dependent and should be chosen by plotting the value of the ID as a function of it, as we will illustrate in the following. If the dataset has a well-defined ID, one will observe an (approximate) plateau in this plot. This leaves the procedure with one free parameter: the ratio r=t1/t2r=t_{1}/t_{2} and its choice influences the statistical error. In continuous space the ratio between volumes in Eq. (3) is simply p=rdp=r^{d} and the Cramer-Rao variance has a simple dependence on the parameter rr. By minimising it with respect to rr, one obtains that the optimal value for the ratio is ropt0.20321dr_{opt}\sim 0.2032^{\frac{1}{d}} (see SI).

In order to check the goodness of the estimator, we test whether the number of points nn contained within the internal shells are actually distributed as a mixture of binomials, as our model assumes:

P(n)=kP(k)B(n;k,V(t1,d)V(t2,d))P(n)=\sum_{k}P(k)\mathrm{B}(n;k,\frac{V_{\diamond}(t_{1},d)}{V_{\diamond}(t_{2},d)}) (7)

where P(k)P(k) is the the empirical probability distribution of the kk found by fixing t2t_{2}. In the following we will compare the empirical cumulative distribution of nn to the cumulative distribution of P(n)P(n).

Results: Uniform distribution - We tested the I3D estimator on artificial datasets, and compared it against the two aforementioned methods: the Box Counting (BC) and the Fractal Dimension (FD). The BC estimate of the ID is obtained by a linear fit between the logarithm of the number of occupied covering boxes and the logarithm of the boxes side. Seemingly, for the FD, the linear fit is computed among the logarithm of the average number of neighbours within a given radius and the logarithm of the radius. In both cases, the scale reported in the figures is given by the largest box or radius included in the fit. We started by analysing uniformly-distributed points in 2d and 6d square lattices. We adopted periodic boundary conditions in order to reduce boundary effects as much as possible. For the I3D estimator, in this and all following cases, we set t1/t2=r=0.5t_{1}/t_{2}=r=0.5. Results are shown in Fig. 1.

Refer to caption
Figure 1: Performance of I3D, BC and FD estimators for points uniformly distributed on a square lattice of size 50 in 2d and and size 20 in 6d. Datasets were obtained by sampling, respectively, 20 realizations of 2500 and 100000 points. Error bars are given by the standard deviation over the different realizations. Lower panels: I3D model validation performed by comparing empirical and theoretical cdfs of the random variable nn.

While the BC and FD proved to be reliable in finding the fractal dimension of repeating, self-similar lattices [12, 17], they do not manage to assess the proper dimension of randomly distributed points, especially at small scales. The I3D estimator, instead, returns accurate values for the ID at all scales and, importantly, provides the correct estimate also on self-similar lattices (see SI). Remarkably the I3D estimator allows to select the scale explicitly by varying the radius t2t_{2}. In Fig. 1, lower panels, we also report a first example of model validation for I3D. The two cdfs (empirical and theoretical one, according to Eq. (7)) perfectly match, meaning that the ID estimation is reliable.

Gaussian distribution - Secondly, we tested the estimators on Gaussian distributed points in 5 dimensions, analysing a case in which the data are uncorrelated and a case in which a correlation is induced by a non-diagonal covariance matrix. In both cases, we set diagonal elements of the covariance matrix to σ=5\sigma=5 (implying an effective standard deviation of the distribution of σeff=dσ\sigma_{eff}=\sqrt{d}\sigma), while off-diagonal terms –for correlated data– were uniformly extracted in the interval (0,2). The values were chosen in order to keep the dimension of the dataset under control, as correlations of the same order of the diagonal would reduce the dimensionality of the dataset. The points were projected on a lattice by taking the nearest integer in each coordinate.

Refer to caption
Figure 2: ID estimations of I3D, BC and FD on 20 realizations of 2500 points drawn from a Gaussian distributions in 5d and projected on a lattice (A). Solid lines with markers are related to diagonal covariance matrix, dashed lines to the non-diagonal case. Panels B and C show, respectively, I3D model validation at a small and large scales.

As one can observe in panel A of Fig.2, I3D is accurate as far as it explores a neighborhood where the density does not vary too much (namely, as far as t2/σeff1t_{2}/\sigma_{eff}\lesssim 1). Correspondingly, empirical and model cdfs in panel B are superimposed. Beyond such distance, neighborhoods are characterised by non-constant density; consequently, estimates gets less precise and, accordingly, the two cdfs show inconsistencies (panel C: t2/σeff1.5t_{2}/\sigma_{eff}\sim 1.5). On the other hand, the BC and FD estimations are far from desired values at any scales, for both correlated and uncorrelated cases.

Spin dataset - As a third test, we created synthetic Ising-like spin systems with a tunable ID, which is given by the number of independent parameters used to generate the dataset. The 1d ensemble is obtained by generating a set of points belonging to a line embedded in D\mathbb{R}^{D} with the process 𝝋i=𝝋0+𝜶ϵ(i)\bm{\varphi}_{i}=\bm{\varphi}_{0}+\bm{\alpha}\epsilon(i). Here, 𝜶\bm{\alpha} is a fixed random vector of unitary norm with uniformly distributed components and 𝝋0=0.5\bm{\varphi}_{0}=-0.5 is the y-intercept that, for simplicity, is equal for all the components; ϵi\epsilon_{i} are gaussian-distributed: ϵ𝒩(0,10)\epsilon\sim\mathcal{N}(0,10) and independently drawn for each sample ii. We then proceed to the discretization by extracting the 𝒛i=sign(𝝋i)\bm{z}_{i}=\text{sign}(\bm{\varphi}_{i}), an ensemble of NN states of DD discrete spins. The pipeline is summarized in Fig.3. The role of 𝝋0\bm{\varphi}_{0} is to introduce an offset in order to enhance the number of the reachable discrete states. In fact, for 𝝋0=0\bm{\varphi}_{0}=0, we would obtain only two different states, given by 𝒛=sign(𝜶ϵ)=±sign(𝜶)\bm{z}=\text{sign}(\bm{\alpha}\epsilon)=\pm\text{sign}(\bm{\alpha}), since the spins would change sign synchronously. An offset 0\neq 0 allows the angles 𝝋i\bm{\varphi}_{i} and the spins 𝒛i\bm{z}_{i} to shift sign in an asynchronous way. The extension to higher dimensions is straightforward and consists in generating the initial points as 𝝋i=𝝋0+j=1id𝜶jϵj(i)\bm{\varphi}_{i}=\bm{\varphi}_{0}+\sum_{j=1}^{id}\bm{\alpha}_{j}\epsilon_{j}(i), with 𝜶j𝜶kδjk\bm{\alpha}_{j}\cdot\bm{\alpha}_{k}\sim\delta_{jk}. Due to the nature of data domain (a DD-dimensional hypercube with side 1), the BC cannot be applied, as boxes with side larger than 1 would include the whole data set. FD and I3D estimates for the 1d system are very close. This is not surprising as both continuous and discrete volumes (and, consequently, the neighbors) scale linearly with the radius. In the 2d case, I3D clearly outperforms the other methods, although even the best estimate remains slightly lower than the true value. This effect, due to non-uniform density, is relatively small and indeed the empirical and theoretical cdfs are rather consistent (panels B and C). Such an effect becomes more important as the dimension rises (see SI for examples in d=3d=3 and d=4d=4).

Refer to caption
Figure 3: (A) The pipeline used to create an ensemble of binary spins with a low ID, together with the results of FD and I3D estimators on 1d and 2d datasets. I3D estimations were validated by comparing theoretical and empirical cdfs (panels B and C).

16S Genomics strands - Lastly, we present the application of our methodology to a real-world dataset in the field of genomics. The dataset consists of DNA sequences of 100300\sim\!100\!-\!300 nucleotides. We selected a dataset downloaded from the Qiita server (https://qiita.ucsd.edu/study/description/13596)[26]. In such study, they sequenced the v4 region of the 16S ribosomal RNA of the microbiome associated with sponges and algal water blooms. This small-subunit of rRNA genes is widely used to study the composition of microbial communities [27, 28, 29, 30]. Hamming distance and the binary mapping A:11, T:00, C:10 and G:01 were used to compute sequences’ distance. The canonical letter representation leads to almost identical results (see SI). To avoid dealing with isolated sequences, we kept only sequences having at least 10 neighbors within a distance of 10. Sequences come with their associated multiplicity, related to the number of times the same read has been found in the samples. We ignore such degeneracy and compute an ID which describes just the distribution of the points regardless of their abundance.

To begin with, we estimated the ID on a subset of sequences that are similar to each other. In order to find such sets, we perform a k-means clustering and calculate the ID separately for each of them. Panel A in Fig. 4 shows the ID at small to medium scale for one of such clusters. The empirical and reconstructed cdfs, performed at t2=20t_{2}=20 (see inset), are fairly compatible. Panel B shows the average and the standard deviation of the ID of all clusters (weighted according to the respective populations). One can appreciate that the ID is always between 1 and 3 in a wide range of distances, showing a plateau around 2 for 15<t2<4015<t_{2}<40.

Refer to caption
Figure 4: Estimated ID at small to medium distances for one of the clusters of the genomics dataset (panel A). The inset reports the fair superposition of empirical and modeled cdfs of nn. Panel B shows average and standard deviation of the IDs estimated separately for each cluster. Panel C shows first and second PCA eigenvectors of the data-points within given distances t2t_{2} (20 or 30) from the center of cluster used for panel A.

Such a low value for the ID is an interesting and unexpected feature, as it suggests that, despite the high-dimensionality of sequences’ space, evolution effectively operates in a low-dimensional space. Qualitatively, an ID \sim 2 on a scale of \sim 20 means that if one considers all the sequences differing by approximately 20 mutations from a given sequence, these mutations cannot be regarded as independent one from each other, but are correlated in such a way that approximately 18 degrees of freedom are effectively forbidden. The “direction” of these correlated mutations can be, at least approximately, measured by performing PCA in the space of sequences with the binary mapping. The first two dominant eigenvectors, shown in panel C, were estimated using all the sequences within a distance of 20 (top) and 30 (bottom) from the center of the cluster of Panel A. Remarkably, the eigenvectors do not change significantly on this distance range, indicating that, consistently with the low value of the ID, the data manifold on this scale can be approximately described by a two-dimensional plane. In order to provide an interpretation of the vectors defining this plane, we repeated this same analysis on the previously mentioned spin model. In this case, if the generative model is defined by two vectors 𝜶1\bm{\alpha}_{1} and 𝜶2\bm{\alpha}_{2}, the first two dominant eigenvectors of a PCA performed on \sim1000 points are contained in the span of the two generating vectors, with a residual of 0.04 (see SI for details). The components of a vector 𝜶\bm{\alpha} can then be qualitatively interpreted as proportional to the mutation probabilities of the associated nucleotide for a collective mutation process. In the genomics dataset this reasoning can applied only locally: the direction of correlated mutation is significantly different in different clusters, indicating that the data manifold is highly curved.

Conclusions - We presented an ID estimator formulated to analyze discrete datasets. Our method relies on few mathematical hypotheses and is asymptotically correct if the density is constant within the probe radius t2t_{2}. In order to prove the estimator’s effectiveness, we tested the algorithm against three different artificial datasets and compared it to the well known Box Counting and Fractal Dimension estimators. While the last two performed poorly, the new one achieved good results in all cases, providing reliable ID estimations corroborated by the comparison of empirical and model cumulative distribution functions for one of the observables. We finally applied the estimator on a genomics dataset, finding an unexpectedly low ID which hints at strong constraints in the sequences’ space, and then exploited such information to give a qualitative interpretation of such ID. The newly developed method paves the way to push the investigation even further, towards the extension to discrete metrics of distance-based algorithms and routines that are, nowadays, consolidated in the continuum, such as density estimation methods or clustering algorithms.

The code implementing the algorithm is available in open source within the DADApy [31] software.

I Acknowledgements

The authors thank Antonietta Mira, Alex Rodriguez and Marcello Dalmonte for the fruitful discussions. AG and AL acknowledge support from the European Union’s Horizon 2020 research and innovation program (Grant No. 824143, MaX ‘Materials design at the eXascale’ Centre of Excellence).

IM, AG, AL designed and performed the research. All authors wrote the paper. JG designed the application on genomics sequences.

References

  • Solorio-Fernández et al. [2020] S. Solorio-Fernández, J. A. Carrasco-Ochoa,  and J. F. Martínez-Trinidad, Artificial Intelligence Review 53, 907 (2020).
  • Jović et al. [2015] A. Jović, K. Brkić,  and N. Bogunović (Ieee, 2015) pp. 1200–1205.
  • Bengio et al. [2013] Y. Bengio, A. Courville,  and P. Vincent, IEEE transactions on pattern analysis and machine intelligence 35, 1798 (2013).
  • Mendes-Santos et al. [2021a] T. Mendes-Santos, X. Turkeshi, M. Dalmonte,  and A. Rodriguez, Physical Review X 11, 011040 (2021a).
  • Mendes-Santos et al. [2021b] T. Mendes-Santos, A. Angelone, A. Rodriguez, R. Fazio,  and M. Dalmonte, PRX Quantum 2, 030332 (2021b).
  • Glielmo et al. [2021] A. Glielmo, B. E. Husic, A. Rodriguez, C. Clementi, F. Noé,  and A. Laio, Chemical Reviews 121, 9722 (2021), pMID: 33945269, https://doi.org/10.1021/acs.chemrev.0c01195 .
  • Ansuini et al. [2019] A. Ansuini, A. Laio, J. H. Macke,  and D. Zoccolan, Advances in Neural Information Processing Systems 32 (2019).
  • Doimo et al. [2020] D. Doimo, A. Glielmo, A. Ansuini,  and A. Laio, Advances in Neural Information Processing Systems 33, 7526 (2020).
  • Recanatesi et al. [2019] S. Recanatesi, M. Farrell, M. Advani, T. Moore, G. Lajoie,  and E. Shea-Brown, arXiv preprint  (2019).
  • Campadelli et al. [2015] P. Campadelli, E. Casiraghi, C. Ceruti,  and A. Rozza, Math. Probl. Eng. 2015 (2015), 10.1155/2015/759567.
  • Camastra and Staiano [2016] F. Camastra and A. Staiano, Information Sciences 328, 26 (2016).
  • Falconer [2004] K. Falconer, Fractal geometry: mathematical foundations and applications (John Wiley & Sons, 2004).
  • Block et al. [1990] A. Block, W. von Bloh,  and H. J. Schellnhuber, Phys. Rev. A 42, 1869 (1990).
  • GRASSBERGER [1993] P. GRASSBERGER, International Journal of Modern Physics C 04, 515 (1993)https://doi.org/10.1142/S0129183193000525 .
  • Grassberger and Procaccia [1983] P. Grassberger and I. Procaccia, Physical review letters 50, 346 (1983).
  • Christensen and Moloney [2005] K. Christensen and N. R. Moloney, Complexity and criticality, Vol. 1 (World Scientific Publishing Company, 2005).
  • Niemeyer et al. [1984] L. Niemeyer, L. Pietronero,  and H. J. Wiesmann, Physical Review Letters 52, 1033 (1984).
  • Kosior and Sacha [2017] A. Kosior and K. Sacha, Physical Review B 95, 104206 (2017).
  • Facco et al. [2017] E. Facco, M. D’Errico, A. Rodriguez,  and A. Laio, Scientific Reports 7, 1 (2017).
  • Theiler [1990] J. Theiler, J. Opt. Soc. Am. A 7, 1055 (1990).
  • Möller et al. [1989] M. Möller, W. Lange, F. Mitschke, N. Abraham,  and U. Hübner, Physics Letters A 138, 176 (1989).
  • Ehrhart [1977] E. Ehrhart, International Series of Numerical Mathematics, Vol.35  (1977).
  • Moltchanov [2012] D. Moltchanov, Ad Hoc Networks 10, 1146 (2012).
  • [24] “Eugène ehrhart - publications 1947-1996,” http://icps.u-strasbg.fr/~clauss/Ehrhart_pub.html, accessed: 2022-03-25.
  • Beck and Robins [2007] M. Beck and S. Robins, Choice Reviews Online 45, 45 (2007).
  • Bolyen et al. [2019] E. Bolyen et al.Nature Biotechnology 37, 852 (2019).
  • Gray et al. [1984] M. W. Gray, D. Sankoff,  and R. J. Cedergren, Nucleic Acids Research 12, 5837 (1984).
  • Woese et al. [1990] C. R. Woese, O. Kandler,  and M. L. Wheelis, Proceedings of the National Academy of Sciences 87, 4576 (1990).
  • Weisburg et al. [1991] W. G. Weisburg, S. M. Barns, D. A. Pelletier,  and D. J. Lane, Journal of bacteriology 173, 697 (1991).
  • Jovel et al. [2016] J. Jovel, J. Patterson, W. Wang, N. Hotte, S. O’Keefe, T. Mitchel, T. Perry, D. Kao, A. L. Mason, K. L. Madsen, et al., Frontiers in microbiology 7, 459 (2016).
  • Glielmo et al. [2022] A. Glielmo, I. Macocco, D. Doimo, M. Carli, C. Zeni, R. Wild, M. d’Errico, A. Rodriguez,  and A. Laio, Patterns 3, 100589 (2022).

Intrinsic dimension estimation for discrete metrics
— Supplementary Information —

Refer to caption
Figure 5: Sketchy representation of hyperspheres for the typical L2L^{2} metric in continuous spaces (left) and for lattices (right). In order to find the ID we exploit the binomial relationship between the nn (blue) points within region AA -of radius t1t_{1}- and kk (red) points within region BB -of radius t2t_{2}-. Points within the inner region AA count for both nn and kk.

II Proof of Eq.(2)

We recall that

P(n,A)=[ρV(A)]nn!eρV(A)P(k,B)=[ρV(B)]kk!eρV(B)P(kn,V(BA))=[ρV(BA)]kn(kn)!eρV(BA)\begin{split}P(n,A)&=\frac{[\rho\,V\!(A)]^{n}}{n!}e^{-\rho\,V\!(A)}\\ P(k,B)&=\frac{[\rho\,V\!(B)]^{k}}{k!}e^{-\rho\,V\!(B)}\\ P(k-n,V(B\setminus A))&=\frac{[\rho\,V\!(B\setminus A)]^{k-n}}{(k-n)!}e^{-\rho\,V\!(B\setminus A)}\end{split} (8)

where we name λ1=ρV(A)\lambda_{1}=\rho\,V\!(A) and λ2=ρV(BA)\lambda_{2}=\rho\,V\!(B\setminus A), so that λ1+λ2=ρ(V(A)+V(BA))=ρV(B)\lambda_{1}+\lambda_{2}=\rho\,(V\!(A)+V\!(B\setminus A))=\rho\,V\!(B) since ABA\subset B. Consequently, it holds that

P(n|k)=P(n,k)P(k)=P(n)P(kn)P(k)==(kn)(λ1λ1+λ2)n(λ2λ1+λ2)kn==(kn)pn(1p)kn\begin{split}P(n\,|\,k)&=\frac{P(n,k)}{P(k)}=\frac{P(n)P(k-n)}{P(k)}=\\ &=\binom{k}{n}\left(\frac{\lambda_{1}}{\lambda_{1}+\lambda_{2}}\right)^{n}\left(\frac{\lambda_{2}}{\lambda_{1}+\lambda_{2}}\right)^{k-n}=\\ &=\binom{k}{n}p^{n}(1-p)^{k-n}\end{split} (9)

where p=V(A)V(B)p=\frac{V(A)}{V(B)}.

III Ehrhart polynomial theory and Cross-polytope enumerating function

According to Ehrhart theory, the volume of a lattice hypershpere of radius tt in dd dimension is given by the enumerating function[25]

V(t,d)=k=0d(dk)(tk+dd).V_{\diamond}(t,d)=\sum_{k=0}^{d}\binom{d}{k}\binom{t-k+d}{d}. (10)

where dd is assumed to be an integer value. In order to make this expression suitable for likelihood maximization, it can be conveniently rewritten using the analytical continuation

V(t,d)=(d+td)2F1(d,t,dt,1)V_{\diamond}(t,d)=\binom{d+t}{d}\;_{2}F_{1}(-d,-t,-d-t,-1) (11)

where the binomial coefficient are computed using the Gamma function: n!=Γ(n+1)n!=\Gamma(n+1) for non-integer nn. F12{}_{2}F_{1} is the hypergeometric function. Here we report the first polynomials for t4t\leq 4:

  • t=0t=0:

    1

  • t=1t=1:

    1+2d1+2d

  • t=2t=2:

    1+2d+2d21+2d+2d^{2}

  • t=3t=3:

    1+83d+2d2+43d31+\frac{8}{3}d+2d^{2}+\frac{4}{3}d^{3}

  • t=4t=4:

    1+83d+103d2+43d3+23d41+\frac{8}{3}d+\frac{10}{3}d^{2}+\frac{4}{3}d^{3}+\frac{2}{3}d^{4}

By substituting integer values of dd, one recovers the volumes found with eq. 10. Using this expansion we can treat dd as a continuous parameter in our inference procedures.

IV Maximum Likelihood Estimation of the ID

Once the two radii t1t_{1} and t2t_{2} are fixed and the corresponding values of nin_{i} and kik_{i} are computed, the likelihood for NN data points is

{d|(ni,ki)}=iNBinomial(ni;ki,p(d))=iN(kini)(p(d))ni(1p(d))kini\mathcal{L}\{d|(n_{i},k_{i})\}=\prod_{i}^{N}\mathrm{Binomial}(n_{i};k_{i},p(d))=\prod_{i}^{N}\binom{k_{i}}{n_{i}}(p(d))^{n_{i}}(1-p(d))^{k_{i}-n_{i}} (12)

where p(d)=V(t1,d)/V(t2,d)p(d)=V_{\diamond}(t_{1},d)/V_{\diamond}(t_{2},d) and depends explicitly only on dd. In order to make the expressions easier to read, we will write just pp from now on. The optimal value for dd can be found by means of a maximum likelihood estimation (MLE), which consists in setting equal to 0 the (log)derivative of the likelihood:

0=dln()=iNdln(B(ni;ki,p))==iNd(niln(p)+(kini)ln(1p))==iN(nipp(kini)p1p)==npp(kn)p1p\begin{split}0=&\frac{\partial}{\partial d}\ln(\mathcal{L})=\sum_{i}^{N}\frac{\partial}{\partial d}\ln(\mathrm{B}(n_{i};k_{i},p))=\\ =&\sum_{i}^{N}\frac{\partial}{\partial d}\left(n_{i}\ln(p)+(k_{i}-n_{i})\ln(1-p)\right)=\\ =&\sum_{i}^{N}\left(n_{i}\frac{p^{\prime}}{p}-(k_{i}-n_{i})\frac{p^{\prime}}{1-p}\right)=\\ =&{\langle n\rangle}\frac{p^{\prime}}{p}-({\langle k\rangle}-{\langle n\rangle})\frac{p^{\prime}}{1-p}\end{split} (13)

where the mean value are intended over all the points of the dataset and p=dp/ddp^{\prime}=\text{d}p/\text{d}d is the Jacobian of the transformation from pp to dd, which reads

dddV(t1;d)V(t2;d)=V(t1)V(t2)V(t1)V(t2)V(t2)2.\frac{\text{d}}{\text{d}d}\frac{V_{\diamond}(t_{1};d)}{V_{\diamond}(t_{2};d)}=\frac{V_{\diamond}^{\prime}(t_{1})V_{\diamond}(t_{2})-V_{\diamond}(t_{1})V_{\diamond}^{\prime}(t_{2})}{V_{\diamond}(t_{2})^{2}}. (14)

The last line of Eq. (13) leads directly to equation (6).

IV.1 Cramer-Rao estimate of the variance of the ID

The Cramer-Rao inequality states that the variance of an unbiased estimator of an unknown parameter is bounded from below by the inverse of the Fisher information, namely

Var(θ^)1(θ)\text{Var}(\hat{\theta})\geq\frac{1}{\mathcal{I}(\theta)} (15)

where

(θ)=N𝔼[(θln(x;θ))2]=N𝔼[2θ2ln(x;θ)],\mathcal{I}(\theta)=N\mathbb{E}\left[\left(\frac{\partial}{\partial\theta}\ln\mathcal{L}(x;\theta)\right)^{2}\right]=-N\mathbb{E}\left[\frac{\partial^{2}}{\partial\theta^{2}}\ln\mathcal{L}(x;\theta)\right], (16)

\mathcal{L} is the likelihood for a single sample xx and 𝔼\mathbb{E} is the expected value over datapoints xx. Given the likelihood in eq. (12), by deriving (with respect to dd) a second time the third line of eq.(13), one has

2d2ln(d;ni)=iN(p′′(nipkini1p)+p(pnip2pkini(1p)2)).\frac{\partial^{2}}{\partial d^{2}}\ln\mathcal{L}(d;n_{i})=\sum_{i}^{N}\left(p^{\prime\prime}(\frac{n_{i}}{p}-\frac{k_{i}-n_{i}}{1-p})+p^{\prime}(-p^{\prime}\frac{n_{i}}{p^{2}}-p^{\prime}\frac{k_{i}-n_{i}}{(1-p)^{2}})\right). (17)

By inserting the MLE solution n=pk{\langle n\rangle}=p{\langle k\rangle} and performing the, sum one obtains

2d2ln(d;ni)=Nkp2p(1p),\frac{\partial^{2}}{\partial d^{2}}\ln\mathcal{L}(d;n_{i})=-N{\langle k\rangle}\frac{p^{\prime 2}}{p(1-p)}, (18)

leading to the final result

Err(d;t1,t2,N)p(1p)kNp2|d=dMLE.\text{Err}(d;t_{1},t_{2},N)\geq\sqrt{\frac{p(1-p)}{{\langle k\rangle}Np^{\prime 2}}}\Bigg{|}_{d=d_{MLE}}. (19)

Such an expression is finally evaluated at the dd found through the MLE procedure.

V Bayesian estimate of the ID

The ID can also be estimated through a Bayesian approach. The likelihood of the process is represented by a binomial distribution where the probability pp (hereafter named xx to avoid confusion) is given by the ratio of the shell volumes around any given point. The binomial pdf has a known conjugate prior: the beta distribution, whose expression is

Beta(x;α,β)=xα1(1x)β1(α,β),\text{Beta}(x;\alpha,\beta)=\frac{x^{\alpha-1}(1-x)^{\beta-1}}{\mathcal{B}(\alpha,\beta)}, (20)

where \mathcal{B} is the beta function. We make an agnostic assumption (since with do not have any information on the value of the ID) and set α=β=1\alpha=\beta=1, so that xx will be uniformly distributed. The posterior distribution of the ratio of the volumes xx will still be a Beta distribution, with parameters updated as follows

α\displaystyle\alpha =\displaystyle= α0+i=1Nni\displaystyle\alpha_{0}+\sum_{i=1}^{N}n_{i} (21)
β\displaystyle\beta =\displaystyle= β0+i=1N(kini)\displaystyle\beta_{0}+\sum_{i=1}^{N}(k_{i}-n_{i}) (22)

where nin_{i} and kik_{i} are the points falling within the inner and outer volumes around point ii; the sum runs on all the points in the dataset.

In order to compute the expected value and the variance on dd, one has to extract its posterior distribution P(d)P(d) from the one of xx. The posterior of dd is obtained from the posterior of xx by a simple change of variables:

P(d)=P(p(d))|dpdd|=Beta(p;α,β)|dpdd|P(d)=P(p(d))\left|\frac{\text{d}p}{\text{d}d}\right|=\text{Beta}(p;\alpha,\beta)\left|\frac{\text{d}p}{\text{d}d}\right| (23)

where the Jacobian is given by Eq. (14). By varying dd, one can estimate the posterior distribution of dd. Its first and second momenta will be the (Bayesian) estimates of the ID and of its confidence.

As far as we could observe, the ID found through MLE is always very close to the mean value of the posterior. The same occurs for the error estimate, which is typically very close to the Cramer-Rao bound. Small differences (<1%<1\%) have been observed in cases of few datapoints (50\sim 50). The reason is that the posterior distribution, for low values of α\alpha and β\beta, can be slightly asymmetric, bringing to a discrepancy between the position of its maximum and its mean value. In most practical cases such an effect is negligible.

VI Statistical correlation in the numbers of data points in the probe regions.

The likelihood has the form in Eq. 12 under the assumption that the observations nin_{i} are statistically uncorrelated. However, if a point jj is close to another point ii, it is likely that their neighborhoods will be overlapping. This implies that the values for nin_{i} and njn_{j} will have a certain degree of correlation. In particular, in order to have a fully statistical independence, one should consider only points with a non-overlapping neighbours. Clearly this would reduce the number of available observations to compute the ID, making the estimate less reliable. Here we assess the entity of such correlations, or seemingly, how much the Bayesian and Cramer-Rao calculations underestimate the error.

To begin with, we generated 10000 realizations of 10000 uniformly distributed points on a 4-dimensional lattice. For each realization, we extracted ID and error using all available points; on the same dataset, we also computed the ID using only one random point, in order to gather statistically independent measurements. We then computed the distribution of the pool variable for the correlated ID measures as

χcorr=dcordgtσBayes,\chi_{corr}=\frac{d_{cor}-d_{gt}}{\sigma_{Bayes}}, (24)

where σBayes\sigma_{Bayes} is the standard deviation of the posterior, and dgtd_{gt} is the ground truth ID. We also computed the distribution of the pool variable for the ID estimates obtained using a single point

χind=dinddσstat,\chi_{ind}=\frac{d_{ind}-d}{\sigma_{stat}}, (25)

where σstat\sigma_{stat} is the standard deviation of the distribution of the single-point ID estimates. We expect χind𝒩(0,1)\chi_{ind}\sim\mathcal{N}(0,1) and that’s what we obtained as shown from the blue histogram in Fig. 6. On the other hand, the distribution obtained for the pool of correlated measurements χcor\chi_{cor} (orange histogram in Fig. 6) shows a higher spread, indicating that the σBayes\sigma_{Bayes} systematically underestimate the error of the 30%\sim 30\%. This was expected, as both the Bayesian and likelihood formulation assume to sample statistically independent observations. We can then conclude that the statistical dependence of neighborhoods in the calculation of ID leads to an error estimate which is slightly below the correct value but it is still very indicative.

Refer to caption
Figure 6: The pool distributions (see text) show that the correlation between the neighbourhoods of the points brings to an error estimate which slightly underestimates the value that would be obtained using statistically independent samples.

VII Analytical results in continuum space

The ID estimation scheme proposed in our work can be easily extended and applied to different metrics than the lattice one (used to build the I3D). In particular, within any LpL^{p} metric in the continuum space, the volume of the hyper-spheres scales as a canonical power law of the radius multiplied by a prefactor depending on both the dimension dd and the value pp:

Vdp(R)=(2Γ(1p+1)R)dΓ(dp+1)=ΩdpRd.V_{d}^{p}(R)=\frac{(2\Gamma(\frac{1}{p}+1)R)^{d}}{\Gamma(\frac{d}{p}+1)}=\Omega_{d}^{p}R^{d}.

As a consequence, the ratio of volumes that occurs in the binomial distribution of eq. (3) becomes

p(r,d)=ΩdR1dΩdR2d=(R1R2)d=rd.p(r,d)=\frac{\Omega_{d}R_{1}^{d}}{\Omega_{d}R_{2}^{d}}=\left(\frac{R_{1}}{R_{2}}\right)^{d}\!\!=r^{d}. (26)

Because of the well-behaved scaling of the volume with the radius in continuous metrics, all formulas presented so far for the discrete case consistently simplify, allowing for further analytical derivations concerning both the MLE and Bayes analyses, as shown in the two following sections.

VII.1 Maximum Likelihood Estimation and Cramer-Rao lower bound

In the continuum case, given the previous expression for the ratio of volumes, the MLE and Cramer-Rao relations can be utterly simplified from Eq. (6) so that we can obtain an explicit form for the intrinsic dimension. Concretely, by substituting p=rdp=r^{d} into eq. (6) we find

d=ln(n/k)ln(r)d=\frac{\ln({\langle n\rangle}/{\langle k\rangle})}{\ln(r)} (27)

for the MLE, while the Cramer Rao inequality (19) becomes

Var(d;r,N,k)1rdkNln(r)2rd.\text{Var}(d;r,N,k)\geq\frac{1-r^{d}}{{\langle k\rangle}N\ln(r)^{2}r^{d}}\;. (28)

In order to have an estimate as precise as possible, we are interested in the value of rr that minimize the variance. Being (28) a convex function, we find a single minimum that corresponds to

ropt(d)=21/d(W(2e))1/d 0.20321dr_{opt}(d)=2^{-1/d}\left(-W\left(-\frac{2}{e}\right)\right)^{1/d}\!\approx\;0.2032^{\frac{1}{d}} (29)

where WW is the Lambert WW function. Of course in principle we don’t know the intrinsic dimension of the system, and thus we don’t have a direct way to practically select roptr_{opt} if not through successive iterations. This relationship tells us that higher dd needs higher rr to provide a better estimate. The above relation also suggests that the optimal ratio between the points in the two shell should approach n/kroptd=0.2032{\langle n\rangle}/{\langle k\rangle}\sim r_{opt}^{d}=0.2032. This implies that there is an optimal and precise fraction of points, and thus volumes, for which the estimated ID is more accurate. Supposing that we are able to find such roptr_{opt}, we might ask how the variance scales with the dimensionality of the system. We obtain that

Var(d;ropt,N,k)=1roptdln(r)2kNroptdd2Nk\text{Var}(d;r_{opt},N,k)=\frac{1-r_{opt}^{d}}{\ln(r)^{2}kNr_{opt}^{d}}\propto\frac{d^{2}}{Nk} (30)

This implies that the precision of our estimator scales with the square of the dimension.

VII.2 Bayes formulation

Also the Bayesian derivations bring to analytical results in the continuum case. In particular, from eq. (23), one obtains

p(d)=Beta(rd;α,β)|rdln(r)|.p(d)=\text{Beta}(r^{d};\alpha,\beta)|r^{d}\ln(r)|. (31)

From this expression one can easily derive the first and second momenta of the distribution. In particular, performing the change of variable d=lnx/lnrd=\ln{x}/\ln{r}, we have

d=0d𝑑p(d)d=1ln(r)01dxBeta(x;α,β)ln(x)=ψ0(α)ψ0(α+β)ln(r)=ψ0(1+i=1Nni)ψ0(2+i=1Nki)ln(r){\langle d\rangle}=\int_{0}^{\infty}\text{d}d\;p(d)d=-\frac{1}{\ln(r)}\int_{0}^{1}\text{d}x\;\text{Beta}(x;\alpha,\beta)\ln(x)=\frac{\psi_{0}(\alpha)-\psi_{0}(\alpha+\beta)}{\ln(r)}=\frac{\psi_{0}(1+\sum_{i=1}^{N}n_{i})-\psi_{0}(2+\sum_{i=1}^{N}k_{i})}{\ln(r)} (32)

where ψ0(z)=ddzlnΓ(z)\psi_{0}(z)=\frac{\text{d}}{\text{d}z}\ln{\Gamma(z)} is the digamma function; in the last step we have inserted the definitions for α\alpha and β\beta from Eq. (22). By exploiting the same change of variable, also the variance ends up into a simple expression:

Var(d)=Var(ln(x))ln(r)2=ψ1(α)ψ1(α+β)ln(r)2=ψ1(1+i=1Nni)ψ1(2+i=1Nki)ln(r)2\text{Var}(d)=\frac{\text{Var}(\ln(x))}{\ln(r)^{2}}=\frac{\psi_{1}(\alpha)-\psi_{1}(\alpha+\beta)}{\ln(r)^{2}}=\frac{\psi_{1}(1+\sum_{i=1}^{N}n_{i})-\psi_{1}(2+\sum_{i=1}^{N}k_{i})}{\ln(r)^{2}} (33)

where ψ1(z)=d2dz2lnΓ(z)\psi_{1}(z)=\frac{\text{d}^{2}}{\text{d}z^{2}}\ln{\Gamma(z)} is the trigamma function.

VIII Fractal lattices

As a further test to check the goodness of I3D, we compared the performance of the estimators on discrete fractal lattices, where Box Counting (BC) and Fractal Dimension (FD) already proved to be reliable[12, 17]. In Fig. 7, we report the ID as a function of the scale for the Koch curve (above), whose ID is log(4)/log(3)1.26\log(4)/\log(3)\sim 1.26 and the Sierpinski gasket (below), with an ID of log(3)/log(2)1.58\log(3)/\log(2)\sim 1.58. As one can appreciate, the FD converges to the proper values only for large scales, where the discrete nature of the dataset is negligible, while BC and I3D quickly find the correct ID. However, the I3D adds a piece of information: it clearly shows at which scale the "fractality" of data comes into play. Indeed, by looking at the inset of the Sierpinski gasket, one notices that at a scale smaller than five the 2-dimensional structure is still prevailing.

Refer to caption
Figure 7: I3D, BC and FD are all capable of finding the correct theoretical ID for fractal lattices, even if that occurs on different scales.

IX Spin systems

Refer to caption
Figure 8: The I3D estimator underestimates the ground truth ID of spin systems in the case of d=3d=3 and d=4d=4. The inconsistency is shown by the model validation plots, where theoretical and empirical cdf are not well superimposed.

IX.1 PCA on discrete spins

Here we address the possibility of performing PCA on discrete datapoints, justifying the results of the last section of the main text. We start from recalling that the continuous spins were generated using a linear embedding 𝝋i=𝝋0+𝜶ϵ(i)\bm{\varphi}_{i}=\bm{\varphi}_{0}+\bm{\alpha}\epsilon(i), so that it is possible to retrieve the directions of the generating vectors 𝜶i\bm{\alpha}_{i} using standard PCA and a number of points NID+1N\geq\text{ID}+1.

In the case of spin states, the retrieval of 𝜶\bm{\alpha} is not so straightforward. In particular, two spin states differ from each other by a finite (and possibly very small) amount of spin flips. This means that we have a piece of information only on a fraction of the features, namely the varying spins. For this reason, we need many more points in order to gather statistics about the behaviour of the spins and how often they flip across the dataset. The idea is that PCA eigenvectors can capture the frequency of spin flips and give a proxy of the embedding directions 𝜶i\bm{\alpha}_{i}. The result for ID=1 is reported in Fig. 9 and compares the generating vector 𝜶\bm{\alpha} and the first PCA eigenvector 𝒗1\bm{v}_{1}. The overlap is almost perfect, as we have 𝜶𝒗10.98\bm{\alpha}\cdot\bm{v}_{1}\sim 0.98.

Refer to caption
Figure 9: Comparison of the embedding vector 𝜶\bm{\alpha} and the first PCA eigenvector 𝒗1\bm{v}_{1} obtained with 1000\sim 1000 points in the 1-dimensional case. The overlap is almost perfect: 𝜶𝒗10.98\bm{\alpha}\cdot\bm{v}_{1}\sim 0.98.

In higher dimensions, the eigenvectors will be rotated with respect to the original embedding vectors, so that a direct visual comparison cannot be made as in the previous case. Hence, we estimate the residual of the overlap, defined as

=di,j=1d(𝜶i𝒗i)2.\mathcal{R}=d-\sum_{i,j=1}^{d}(\bm{\alpha}_{i}\cdot\bm{v}_{i})^{2}. (34)

In the 2-dimensional case, we find that 0.04\mathcal{R}\sim 0.04, meaning that, like for the 1-dimensional example, we are able to retrieve 98%\sim 98\% of the generative process information.

X Results for different nucleotide sequence distances and choices of radii ratio

Refer to caption
Figure 10: Different choice of mapping (left) or parameter rr (right) only slightly affect the ID estimate.

One might wonder how the ID estimate depends on the metric used to compute the distance between sequences. As, anticipated in the main text, we considered two possible mappings. The first one maps each letter to a two spin state as follows: A:11, T:00, C:10 and G:01. The distance is then measured through Hamming or Manhattan indifferently. As a consequence complementary purine and pyrimidine are at distance 2, while other distances are just 1. The other possibility is to use the Hamming distance straightly on the sequences as they are, meaning that all nucleotides are equidistant one from the other. Fig. 10 (left) shows the ID as a function of the scale of the same cluster used in the main text. The behaviour of the ID depends very slightly on the chosen distance measure. For this reason we decided to stick to the spin mapping, as it allows retrieving the local "directions" of the generating process by a PCA analysis. Seemingly, different choices of the free parameter r=t1/t2r=t_{1}/t_{2} do not noticeably affect the ID estimate, especially in the plateau region (15<t2<4015<t_{2}<40), where an ID can thus properly defined.