This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Vudtiwat Ngampruetikorn,*  David J. Schwab
Initiative for the Theoretical Sciences, The Graduate Center, CUNY
*vngampruetikorn@gc.cuny.edu
Abstract

Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. This puzzling contradiction necessitates new approaches to the study of overfitting. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental trade-off between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms. Finally, using results from random matrix theory, we reveal the information complexity of learning a linear map in high dimensions and unveil information-theoretic analogs of double and multiple descent phenomena.

1 Information bottleneck

Conventional wisdom identifies overfitting as being detrimental to generalization performance, yet modern machine learning is dominated by models that perfectly fit training data. Recent attempts to resolve this dilemma have offered much needed insight into the generalization properties of perfectly fitted models [1, 2]. However investigations of overfitting beyond generalization error have received less attention. In this work we present a quantitative analysis of overfitting based on information theory and, in particular, the information bottleneck (IB) method [3].

The essence of learning is the ability to find useful and generalizable representations of training data. An example of such a representation is a fitted model which may capture statistical correlations between two variables (regression and pattern recognition) or the relative likelihood of random variables (density estimation). While what makes a representation useful is problem specific, a good model generalizes well—that is, it is consistent with test data even though they are not used at training.

Achieving good generalization requires information about the unknown data generating process. Maximizing this information is an intuitive strategy, yet extracting too many bits from the training data hurts generalization [4, 5]. This fundamental trade-off underpins the IB principle, which formalizes the notion of a maximally efficient representation as an optimization problem [3]111Note that this minimization is identical to that of the original IB method since I(S;T|W)=I(S;T)I(T;W)I(S;T|W)\!=\!I(S;T)\!-\!I(T;W) under the Markov constraint TSWT\!\leftrightarrow\!S\!\leftrightarrow\!W.

minQT|SI(S;TW)(γ1)I(T;W).\min\nolimits_{\,Q_{T|S}}\ I(S;T\mid W)-(\gamma-1)I(T;W). (1)

Here WW denotes the data generating process. The conditional distribution QT|SQ_{T|S} denotes a learning algorithm which defines a stochastic mapping from the training data SS to the hypothesis or fitted model TT (Fig 1a). The relevant information, I(T;W)I(T;W), is the bits in TT that are informative of the generative model WW. On the other hand, the residual information, I(S;T|W)I(S;T\,|\,W), is the bits in TT that are specific to each realization of the training data SS and thus are not informative of WW. In other words the residual bits measure the degree of overfitting. The parameter γ\gamma controls the trade-off between these two informations.

The IB method has found success in a diverse array of applications, from neural coding [6, 7], developmental biology [8] and statistical physics [9, 10, 11] to clustering [12], deep learning [13, 14, 15] and reinforcement learning [16].

Indeed the IB principle has emerged as a potential candidate for a unifying framework for understanding learning phenomena [17, 18, 15, 19] and a number of recent works have explored deep connections between information-theoretic quantities and generalization properties [4, 20, 5, 21, 22, 23, 24, 25, 26, 27, 28]. However a direct application of information theory to practical learning algorithms is often limited by the difficulty in estimating information, especially in high dimensions. While recent advances in characterizing variational bounds of mutual information have enabled a great deal of scalable, information-theory inspired learning methods [13, 29, 30], these bounds are generally loose and may not reflect the true behaviors of information.

To this end we consider a tractable problem of learning a linear map. We show that the level of overfitting, as measured by the encoded residual bits, is nonmonotonic in sample size, exhibiting a maximum near the crossover between under- and overparametrized regimes. We also demonstrate that additional maxima can develop under anisotropic covariates. As the residual information bounds the generalization gap [4, 5], its nonmonotonicity can be viewed as an information-theoretic analog of (sample-wise) multiple descent—the existence of disjoint regions in which more data hurt generalization (see, e.g., Refs [31, 32]). Using an IB optimal representation as a baseline, we show that the information efficiency of a randomized least squares regression estimator exhibits sample-wise nonmonotonicity with a minimum near the residual information peak. Finally we discuss how redundant coding of relevant information in the data gives rise to the nonmonotonicity of the encoded residual bits and how additional maxima emerge from covariate anisotropy (Sec 4).

1.1 Generative model

Linear map—We consider training data of NN iid samples, S={(x1,y1),,(xN,yN)}S\!=\!\{(x_{1},y_{1}),\dots,(x_{N},y_{N})\}, each of which is a pair of PP dimensional input vector xiPx_{i}\!\in\!\mathbb{R}^{P} and scalar response yiy_{i}\!\in\!\mathbb{R} for i{1,,N}i\!\in\!\{1,\dots,N\}. We assume a linear relation between the input and response,

yi=Wxi+ϵiandϵiN(0,σ2),y_{i}=W\cdot x_{i}+\epsilon_{i}\quad\text{and}\quad\epsilon_{i}\sim N(0,\sigma^{2}), (2)

where WPW\!\in\!\mathbb{R}^{P} denotes the unknown linear map and ϵi\epsilon_{i} a scalar Gaussian noise with mean zero and variance σ2\sigma^{2}. In other words the responses and the inputs are related via a Gaussian channel

YX,WN(X𝖳W,σ2IN),Y\mid X,W\sim N(X^{\mkern-1.5mu\mathsf{T}}W,\sigma^{2}I_{N}), (3)

where we define Y=(y1,,yN)𝖳NY\!=\!(y_{1},\dots,y_{N})^{\mkern-1.5mu\mathsf{T}}\!\in\!\mathbb{R}^{N} and X=(x1,,xN)P×NX\!=\!(x_{1},\dots,x_{N})\!\in\!\mathbb{R}^{P\times N}.

Fixed design—We adopt the fixed design setting in which the inputs (design matrix) XX are deterministic and only the responses YY are random variables (see, e.g., Ref [33, Ch 3]). As a result, the mutual information between the training data SS and any random variable AA is given by I(A;S)=I(A;X,Y)=I(A;Y)I(A;S)\!=\!I(A;X,Y)\!=\!I(A;Y). In the following analyses, we use SS and YY interchangeably.

Random effects—In addition we work in the random effects setting (see, e.g., [34, 35] for recent applications of this setting) in which the true regression parameter WW is a Gaussian vector,

WN(0,ω2PIP).W\sim N(0,\tfrac{\omega^{2}}{P}I_{P}). (4)

Here we define the covariance such that the signal strength, 𝔼W2=ω2\operatorname{\mathbb{E}}\|W\|^{2}\!=\!\omega^{2}, is independent of PP.

1.2 Information optimal algorithm

The data generating process above results in training data YY and true parameters WW that are Gaussian correlated (under the fixed design setting). In this case the IB optimization—minimizing residual information I(T;Y|W)I(T;Y\,|\,W) while maximizing relevant information I(T;W)I(T;W)—admits an exact solution [36], characterized by the eigenmodes of the normalized regression matrix,

ΣY|WΣY1=(IN+1λX𝖳XN)1withλPNσ2ω2,\textstyle\Sigma_{Y|W}\Sigma_{Y}^{-1}=\left(I_{N}+\frac{1}{\lambda^{*}}\frac{X^{\mkern-1.5mu\mathsf{T}}X}{N}\right)^{-1}\quad\text{with}\quad\lambda^{*}\equiv\frac{P}{N}\frac{\sigma^{2}}{\omega^{2}}, (5)

where λ\lambda^{*} denotes the scaled noise-to-signal ratio. The relevant and residual informations of an optimal representation T~\tilde{T} read [36]

I(T~;W)\displaystyle I(\tilde{T};W) =12i=1Nmax(0,ln((1γ1)/νi))\displaystyle=\tfrac{1}{2}\sum\nolimits_{i=1}^{N}\max(0,\ \ln((1-\gamma^{-1})/\nu_{i})) (6)
I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =12i=1Nmax(0,ln(γ(1νi))),\displaystyle=\tfrac{1}{2}\sum\nolimits_{i=1}^{N}\max(0,\ \ln(\gamma(1-\nu_{i}))), (7)

where νi\nu_{i} denote the eigenvalues of ΣY|WΣY1\Sigma_{Y|W}\Sigma_{Y}^{-1} and γ\gamma parametrizes the IB trade-off222The arguments of the logarithms in Eqs (6-7) are always nonnegative since the data processing inequality means that the IB problem is well-defined only for γ>1\gamma\!>\!1, and the eigenvalues of a normalized regression matrix always range from zero to one [36]. [see Eq (1)]. In our setting it is convenient to recast the summations above as integrals (see Appendix A for derivation),

I(T~;W)\displaystyle I(\tilde{T};W) =P2ψ>ψc𝑑FΨ(ψ)ln(1+ψψcψc+λ)\displaystyle=\frac{P}{2}\int_{\psi>\psi_{c}}\!dF^{\Psi}(\psi)\,\ln\left(1+\frac{\psi-\psi_{c}}{\psi_{c}+\lambda^{*}}\right) (8)
I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =P2ψ>ψc𝑑FΨ(ψ)ln(ψ/ψc)I(T~;W),\displaystyle=\frac{P}{2}\int_{\psi>\psi_{c}}\!dF^{\Psi}(\psi)\,\ln(\psi/\psi_{c})-I(\tilde{T};W), (9)

where ΨXX𝖳/N\Psi\!\equiv\!XX^{\mkern-1.5mu\mathsf{T}}/N and FΨF^{\Psi} denote the empirical covariance and its cumulative spectral distribution, respectively.333Note that the eigenvalues of XX𝖳XX^{\mkern-1.5mu\mathsf{T}} and X𝖳XX^{\mkern-1.5mu\mathsf{T}}X are identical except for the number of zero modes. In addition we introduce the parameter ψc=λ/(γ1)\psi_{c}\!=\!\lambda^{*}/(\gamma-1) which controls the number and the weights of eigenmodes used in constructing the optimal representation T~\tilde{T}. In the limit ψc0+\psi_{c}\!\to\!0^{+}, the residual information diverges logarithmically (Fig 1d) and the relevant information converges to the available relevant information in the data (Fig 1c),

I(T~;W)ψc0+I(Y;W)=P2ψ>0𝑑FΨ(ψ)ln(1+ψ/λ).\textstyle I(\tilde{T};W)\ \overset{\psi_{c}\to 0^{+}}{\to}\ I(Y;W)=\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)\,\ln(1+\psi/\lambda^{*}). (10)

Increasing ψc\psi_{c} from zero decreases both residual and relevant informations, tracing out the optimal frontier until the lower spectral cutoff ψc\psi_{c} reaches the upper spectral edge at which both informations vanish (Fig 1b) and beyond which no informative solution exists [36, 37, 38].

Refer to caption
Figure 1: Information optimal algorithma A learning algorithm QT|YQ_{T|Y} is a mapping from training data YY to fitted models TT. Information optimal algorithms minimize residual bits I(T;Y|W)I(T;Y\,|\,W)—which are uninformative of the unknown generative model WW—at fixed relevance level μ\mu, defined as the ratio between the encoded and available relevant bits, I(T;W)I(T;W) and I(Y;W)I(Y;W). b-d The information content of optimal algorithms for learning a linear map (Sec 1.1) at various measurement densities N/PN/P (see color bar). b Optimal algorithms cannot increase the relevance level without encoding more residual bits. Increasing N/PN/P reduces the residual bits per sample but only when NPN\!\lesssim\!P. This results from the change in sample size dependence of relevant bits in the data from linear to logarithmic around NPN\!\approx\!P (inset). That is, available relevant bits in each sample become redundant around NPN\!\approx\!P and increasingly so as NN increases further. Learning algorithms use this redundancy to better distinguish signals from noise, thereby requiring fewer residual bits per sample. c-d The IB frontiers in (b) are parametrized by a spectral cutoff ψc\psi_{c} [see Eqs (8-9)]. Here we set ω2/σ2=1\omega^{2}/\sigma^{2}\!=\!1 and let P,NP,N\!\to\!\infty at the same rate such that the ratio N/PN/P remains fixed and finite. The empirical spectral distribution FΨF^{\Psi} follows the standard Marchenko-Pastur law (see Sec 4).

1.3 Information efficiency

The exact characterization of the IB frontier provides a useful benchmark for information-theoretic analyses of learning algorithms, not least because it allows a precise definition of information efficiency. That is, we can now ask how many more residual bits a given algorithm needs to encode, compared to the IB optimal algorithm, in order to achieve the same level of relevant information. Here we define the information efficiency ημ\eta_{\mu} as the ratio between the residual bits encoded in the outputs of the IB optimal algorithm and the algorithm of interest—T~\tilde{T} and TT, respectively—at some fixed relevance level μ\mu, i.e.,

ημI(T~;YW)I(T;YW)subject toμ=I(T~;W)I(Y;W)=I(T;W)I(Y;W).\eta_{\mu}\equiv\frac{I(\tilde{T};Y\mid W)}{I(T;Y\mid W)}\quad\text{subject to}\quad\mu=\frac{I(\tilde{T};W)}{I(Y;W)}=\frac{I(T;W)}{I(Y;W)}. (11)

Since the optimal representation minimizes residual bits at fixed μ\mu (Fig 1a), the information efficiency ranges from zero to one, 0ημ10\!\leq\!\eta_{\mu}\!\leq\!1. In addition we have 0<μ10\!<\!\mu\!\leq\!1, resulting from the data processing inequality I(T;W)I(Y;W)I(T;W)\!\leq\!I(Y;W) for the Markov constraint TYWT\!\leftrightarrow\!Y\!\leftrightarrow\!W (see, e.g., Ref [39]).

2 Gibbs-posterior least squares regression

We consider one of the best-known learning algorithms: least squares linear regression. Not only is this algorithm widely used in practice, it has also proved a particularly well-suited setting for analyzing learning in the overparametrized regime [40, 41, 42, 43, 44, 45, 46, 47]. Indeed it exhibits some of the most intriguing features of overparametrized learning, including benign overfitting and double descent which describe the surprisingly good generalization performance of overparametrized models and its nonmonotonic dependence on model complexity and sample size [48, 42, 31].

Inferring a model from data generally requires an assumption on a class of models, which defines the hypothesis space, as well as a learning algorithm, which outputs a hypothesis according to some criteria that rank how well each hypothesis explains the data. Linear regression restricts the model class to a linear map, parametrized by TPT\!\in\!\mathbb{R}^{P}, between an input xix_{i} and a predicted response y^i\hat{y}_{i},

y^i=Txi.\hat{y}_{i}=T\cdot x_{i}. (12)

Minimizing the mean squared error 1Ni=1N(y^iyi)2\frac{1}{N}\sum_{i=1}^{N}(\hat{y}_{i}-y_{i})^{2} yields a closed form solution for the estimated regressor, T=(XX𝖳)1XYT^{*}\!=\!(XX^{\mkern-1.5mu\mathsf{T}})^{-1}XY. However, this requires XX𝖳XX^{\mkern-1.5mu\mathsf{T}} to be invertible and thus does not work in the overparametrized regime in which the sample covariance is not full rank and infinitely many models have vanishing mean squared error.

There are several approaches to break this degeneracy but perhaps the simplest and most studied is the ridge regularization which adds to the mean squared error the preference for model parameters with small L2L_{2} norm, resulting in the regularized loss function

L(T,X,Y)=1NYX𝖳T22+λT22,L(T,X,Y)=\tfrac{1}{N}\|Y-X^{\mkern-1.5mu\mathsf{T}}T\|_{2}^{2}+\lambda\|T\|_{2}^{2}, (13)

where λ>0\lambda\!>\!0 controls the regularization strength. Minimizing this loss function leads to a unique solution Tλ=(XX𝖳+λNIP)1XYT^{*}_{\!\lambda}\!=\!(XX^{\mkern-1.5mu\mathsf{T}}+\lambda NI_{P})^{-1}XY even when N<PN\!<\!P.

Gibbs posterior—While ridge regression works in the overparametrized regime, it is a deterministic algorithm which does not readily lend itself to information-theoretic analyses because the mutual information between two deterministically related continuous random variables diverges. Instead we consider the Gibbs posterior (or Gibbs algorithm) which becomes a Gaussian channel when defined with the ridge regularized loss in Eq (13),

QTX,YeβL(T,Y,X)TX,YN(1N1Ψ+λIPXY,12β1Ψ+λIP).Q_{T\mid X,Y}\propto e^{-\beta L(T,Y,X)}\ \leadsto\ T\mid X,Y\sim N\left(\tfrac{1}{N}\tfrac{1}{\Psi+\lambda I_{P}}XY\ ,\ \tfrac{1}{2\beta}\tfrac{1}{\Psi+\lambda I_{P}}\right). (14)

Here β\beta denotes the inverse temperature. In the zero temperature limit β\beta\!\to\!\infty, this algorithm returns the usual ridge regression estimate TλT^{*}_{\!\lambda} (the mean of the above normal distribution) with probability approaching one. Whilst randomized ridge regression needs not take the form above, Gibbs posteriors are attractive, not least because they naturally emerge, for example, from information-regularized risk minimization [5] (see also Ref [27] for a recent discussion).

Markov constraint—The generative model PY|WP_{Y|W}, true parameter distribution PWP_{W} and learning algorithm QT|YQ_{T|Y} [Eqs (3-4) & (14)] completely describe the relationship between all random variables of interest through the Markov factorization of their joint distribution (Fig 1a),

PT,Y,W=PWPYWQTY.P_{T,Y,W}=P_{W}\otimes P_{Y\mid W}\otimes Q_{T\mid Y}. (15)

Note that PY|W=PY|X,WP_{Y|W}\!=\!P_{Y|X,W} and QT|Y=QT|X,YQ_{T|Y}\!=\!Q_{T|X,Y} in the fixed design setting (see Sec 1.1).

3 Information content of Gibbs regression

We now turn to the relevant and residual informations of the models that result from the Gibbs regression algorithm [Eq (14)]. Since all distributions appearing on the rhs of Eq (15) are Gaussian, the relevant and residual informations are given by

I(T;W)=12lndetΣTΣT|W1andI(T;YW)=12lndetΣT|WΣT|Y1.I(T;W)=\tfrac{1}{2}\ln\det\Sigma_{T}\Sigma_{T|W}^{-1}\quad\text{and}\quad I(T;Y\mid W)=\tfrac{1}{2}\ln\det\Sigma_{T|W}\Sigma_{T|Y}^{-1}. (16)

Here we use the fact that ΣT|W,Y=ΣT|Y\Sigma_{T|W,Y}\!=\!\Sigma_{T|Y} due to the Markov constraint [Eq (15)]. The covariance ΣT|Y\Sigma_{T|Y} is defined by the learning algorithm in Eq (14). To obtain ΣT|W\Sigma_{T|W} and ΣT\Sigma_{T}, we marginalize out YY and WW in order from PT,Y,WP_{T,Y,W} [Eqs (3-4) & (14-15)] and obtain

TW\displaystyle T\mid W N(ΨΨ+λIPW,12β1Ψ+λIP+σ2NΨ(Ψ+λIP)2)\displaystyle\sim N\left(\tfrac{\Psi}{\Psi+\lambda I_{P}}W,\ \tfrac{1}{2\beta}\tfrac{1}{\Psi+\lambda I_{P}}+\tfrac{\sigma^{2}}{N}\tfrac{\Psi}{(\Psi+\lambda I_{P})^{2}}\right) (17)
T\displaystyle T N(0,12β1Ψ+λIP+σ2NΨ(Ψ+λIP)2+ω2PΨ2(Ψ+λIP)2).\displaystyle\sim N\left(0,\ \tfrac{1}{2\beta}\tfrac{1}{\Psi+\lambda I_{P}}+\tfrac{\sigma^{2}}{N}\tfrac{\Psi}{(\Psi+\lambda I_{P})^{2}}+\tfrac{\omega^{2}}{P}\tfrac{\Psi^{2}}{(\Psi+\lambda I_{P})^{2}}\right). (18)

Substituting the covariance matrices above into Eq (16) yields (see Appendix B for derivation)

I(T;W)\displaystyle I(T;W) =P2ψ>0𝑑FΨ(ψ)ln(1+ψ2/λψ+N2βσ2(ψ+λ))\displaystyle=\frac{P}{2}\int_{\psi>0}\!dF^{\Psi}(\psi)\,\ln\left(1+\frac{\psi^{2}/\lambda^{*}}{\psi+\tfrac{N}{2\beta\sigma^{2}}(\psi+\lambda)}\right) (19)
I(T;YW)\displaystyle I(T;Y\mid W) =P2ψ>0𝑑FΨ(ψ)ln(1+2βσ2Nψψ+λ).\displaystyle=\frac{P}{2}\int_{\psi>0}\!dF^{\Psi}(\psi)\,\ln\left(1+\frac{2\beta\sigma^{2}}{N}\frac{\psi}{\psi+\lambda}\right). (20)

The integration domains are restricted to positive real numbers since the eigenvalues of a covariance matrix are non-negative and the integrands vanish at ψ=0\psi\!=\!0.

In the zero temperature limit β\beta\!\to\!\infty, the residual information diverges (as expected from a deterministic algorithm [23, 24]; see also Fig 2c) whereas the relevant information approaches the mutual information between the data YY and the true parameter WW,

I(T;W)βI(Y;W)=P2ψ>0𝑑FΨ(ψ)ln(1+ψ/λ).\textstyle I(T;W)\ \overset{\beta\to\infty}{\to}\ I(Y;W)=\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)\,\ln(1+\psi/\lambda^{*}). (21)

Relevant and residual informations decrease with β\beta until they vanish as β0+\beta\!\to\!0^{+} at which Gibbs posteriors become completely random (Fig 2a-c).

Refer to caption
Figure 2: Gibbs regressiona-c The information content of Gibbs regression [Eq (14)] at N/P=1N/P\!=\!1 and various regularization strengths λ\lambda (see color bar). a The information curves of Gibbs regression are bounded by the IB frontier (dotted curve). b-c The inverse temperature β\beta controls the stochasticity of Gibbs posteriors [Eqs (19-20)]. Both relevant and residual bits decrease with temperature and vanish in the limit β0\beta\!\to\!0 where Gibbs posteriors become completely random. d-e Information efficiency [Eq (11)] and residual information of Gibbs regression with λ=106\lambda\!=\!10^{-6} vs measurement ratio at various relevance levels μ\mu (see legend). d Gibbs regression approaches optimality in the limits, NPN\!\gg\!P and NPN\!\ll\!P, and becomes least efficient at N/P=1N/P\!=\!1. e Residual bits of Gibbs regression and optimal algorithm (dotted) grow linearly with NN when NPN\!\lesssim\!P. This growth is similar to that of the available relevant bits (Fig 1b inset). But while the available relevant bits always increase with NN, the residual bits decrease as NN exceeds PP. Here we set ω2/σ2=1\omega^{2}/\sigma^{2}\!=\!1 and let P,NP,N\!\to\!\infty at the same rate such that the ratio N/PN/P remains fixed and finite. The eigenvalues of the sample covariance follow the standard Marchenko-Pastur law (see Sec 4).

3.1 Zero temperature limit

At first sight it appears that our analyses are not applicable in the high-information limit since the residual information diverges for both the optimal algorithm and Gibbs regression (see Figs 1d & 2c). However the rates of divergence differ. Here we use this difference to characterize the efficiency of Gibbs regression in the zero temperature limit β\beta\!\to\!\infty.

We first analyze the limiting behaviors of relevant information. From Eq (10), we see that the relevant information ratio of the IB solution approaches one at ψc=0\psi_{c}\!=\!0. Perturbing ψc\psi_{c} in Eq (8) away from zero results in a linear correction to the relevant information,

I(T~;W)=I(Y;W)ψcλP2ψ>0𝑑FΨ(ψ)+O(ψc2).I(\tilde{T};W)=I(Y;W)-\frac{\psi_{c}}{\lambda^{*}}\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)+O(\psi_{c}^{2}). (22)

Keeping only the leading correction and recalling that μ=I(T~;W)/I(Y;W)\mu\!=\!I(\tilde{T};W)/I(Y;W) [Eq (11)], we obtain

limμ1ψcλ=ψ>0𝑑FΨ(ψ)ln(1+ψ/λ)ψ>0𝑑FΨ(ψ)(1μ).\lim_{\mu\to 1}\frac{\psi_{c}}{\lambda^{*}}=\frac{\int_{\psi>0}dF^{\Psi}(\psi)\,\ln(1+\psi/\lambda^{*})}{\int_{\psi>0}dF^{\Psi}(\psi)}(1-\mu). (23)

Similarly expanding the relevant information of Gibbs regression [Eq (19)] around β\beta\!\to\!\infty yields

I(T;W)=I(Y;W)N2βσ2P2ψ>0𝑑FΨ(ψ)ψ+λψ+λ+O(β2).I(T;W)=I(Y;W)-\frac{N}{2\beta\sigma^{2}}\frac{P}{2}\int_{\psi>0}\!dF^{\Psi}(\psi)\,\frac{\psi+\lambda}{\psi+\lambda^{*}}+O(\beta^{-2}). (24)

As a result, the correspondence between the low-temperature and high-information limits reads

limμ1N2βσ2=ψ>0𝑑FΨ(ψ)ln(1+ψ/λ)ψ>0𝑑FΨ(ψ)ψ+λψ+λ(1μ).\lim_{\mu\to 1}\frac{N}{2\beta\sigma^{2}}=\frac{\int_{\psi>0}dF^{\Psi}(\psi)\,\ln(1+\psi/\lambda^{*})}{\int_{\psi>0}dF^{\Psi}(\psi)\,\frac{\psi+\lambda}{\psi+\lambda^{*}}}(1-\mu). (25)

We turn to the residual bits. Expanding Eq (9) around ψc=0\psi_{c}\!=\!0 and Eq (20) around β1=0\beta^{-1}\!=\!0 leads to

I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =P2ψ>0𝑑FΨ(ψ)lnψcλ+P2ψ>0𝑑FΨ(ψ)lnψψ+λ+O(ψc)\displaystyle=-\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)\ln\frac{\psi_{c}}{\lambda^{*}}+\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)\,\ln\frac{\psi}{\psi+\lambda^{*}}+O(\psi_{c}) (26)
I(T;YW)\displaystyle I(T;Y\mid W) =P2ψ>0𝑑FΨ(ψ)lnN2βσ2+P2ψ>0𝑑FΨ(ψ)lnψψ+λ+O(β1)\displaystyle=-\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)\ln\frac{N}{2\beta\sigma^{2}}+\frac{P}{2}\int_{\psi>0}\!dF^{\Psi}(\psi)\,\ln\frac{\psi}{\psi+\lambda}+O(\beta^{-1}) (27)

From Eqs (23) & (25), we see that the residual informations above have the same logarithmic singularity, ln(1μ)\ln(1-\mu), at μ=1\mu\!=\!1. Therefore their difference remains finite even as μ1\mu\!\to\!1. Combining Eqs (23) & (25-27) and recalling our definition of information efficiency ημ=I(T~;Y|W)/I(T;Y|W)\eta_{\mu}\!=\!I(\tilde{T};Y\,|\,W)/I(T;Y\,|\,W) [Eq (11)] gives

limμ1ημ=11ln(1μ)(lnψ>0𝑑FΨ(ψ)ψ+λψ+λψ>0𝑑FΨ(ψ)ψ>0𝑑FΨ(ψ)lnψ+λψ+λψ>0𝑑FΨ(ψ)).\lim_{\mu\to 1}\eta_{\mu}=1-\frac{-1}{\ln(1-\mu)}\left(\ln\frac{\int_{\psi>0}dF^{\Psi}(\psi)\,\frac{\psi+\lambda}{\psi+\lambda^{*}}}{\int_{\psi>0}dF^{\Psi}(\psi)}-\frac{\int_{\psi>0}dF^{\Psi}(\psi)\,\ln\frac{\psi+\lambda}{\psi+\lambda^{*}}}{\int_{\psi>0}dF^{\Psi}(\psi)}\right). (28)

Note that Jensen’s inequality guarantees that the terms in the parentheses sum to a non-negative value.

It is worth pointing out that, at λ=λ\lambda\!=\!\lambda^{*}, the correction term in Eq (28) vanishes and the efficiency of deterministic Gibbs regression becomes minimally sensitive to algorithmic noise. Incidentally, this value of λ\lambda also minimizes the L2L_{2} prediction error of ridge regression in the asymptotic limit [40].

4 High dimensional limit

To place our results in the context of high dimensional learning, we specialize to the thermodynamic limit in which sample size and input dimension tend to infinity at a fixed ratio—that is, N,PN,P\!\to\!\infty at N/P=n(0,)N/P\!=\!n\!\in\!(0,\infty). While it is easy to grow the dimension of the true parameter WW (Sec 1.1), we have so far not specified how the design matrix XX, and thus the training data YY, should scale in this limit.

To this end, we consider a setting in which the design matrix is generated from X=Σ1/2ZX\!=\!\Sigma^{1/2}Z where ZP×NZ\!\in\!\mathbb{R}^{P\times N} is a matrix with iid entries drawn from a distribution with zero mean and unit variance, and ΣP×P\Sigma\!\in\!\mathbb{R}^{P\times P} is a covariance matrix.444This prescription includes the case where input vectors are drawn iid from xiN(0,Σ)x_{i}\!\sim\!N(0,\Sigma) for i{1,,N}i\!\in\!\{1,\dots,N\}. If Σ\Sigma admits a limiting spectral density as PP\!\to\!\infty, then the empirical spectral distribution FΨF^{\Psi} becomes deterministic [49, 50].

To aid interpretation of our results, we frame all of the following discussions from the perspective that the input dimension PP is held fixed and a change in measurement density n=N/Pn\!=\!N/P results only from a change in sample size NN.

4.1 Isotropic covariates

For Σ=IP\Sigma\!=\!I_{P}, the empirical spectral distribution converges to to the standard Marchenko-Pastur law [49]

dFΨ(ψ)=n(ψ+ψ)(ψψ)2πψdψforψ<ψ<ψ+,dF^{\Psi}(\psi)=n\frac{\sqrt{(\psi_{+}-\psi)(\psi-\psi_{-})}}{2\pi\psi}d\psi\quad\text{for}\quad\psi_{-}<\psi<\psi_{+}, (29)

where ψ±=(1±1/n)2\psi_{\pm}\!=\!(1\pm 1/\sqrt{n})^{2} and FΨ(0)=max(0,1n)F^{\Psi}(0)\!=\!\max(0,1-n). We use this spectral distribution in Figs 1-2.

Optimal algorithm—In Fig 1b, the IB optimal frontiers illustrate the fundamental trade-off; optimal algorithms cannot encode fewer residual bits without becoming less relevant. Figure 1c-d shows that encoded relevant and residual bits go down as ψc\psi_{c} increases and fewer eigenmodes contribute to the IB optimal representation [Eqs (8-9)]. However relevant and residual informations exhibit different behaviors at high information; as ψc0\psi_{c}\!\to\!0, relevant information plateaus whereas residual information diverges logarithmically [see also Eq (26)].

Gibbs regression—Figure 2a depicts the information content of Gibbs regression at different regularization strengths [Eqs (19-20)] and illustrates the fundamental trade-off, similarly to the IB frontier (dotted) but at a lower relevance level. Here the information curves are parametrized by the inverse temperature β\beta which controls the algorithmic stochasticity; Gibbs posteriors become deterministic as β\beta\!\to\!\infty and completely random at β=0\beta\!=\!0 [Eq (14)]. In Fig 2b-c, we see that Gibbs regression encodes fewer relevant and residual bits as temperature goes up. Decreasing Gibbs temperature results in an increase in encoded information. In the zero-temperature limit, the relevant bits saturate while the residual bits grow logarithmically (cf. Fig 1c-d; see Sec 3.1 for a detailed analysis of this limit). The amount of encoded information depends also on the regularization strength λ\lambda. Figure 2b-c shows that, at a fixed temperature, an increase in λ\lambda leads to less information extracted. However this does not necessarily mean that a larger λ\lambda hurts information efficiency. Indeed a lower temperature can compensate for the decrease in information. In Fig 2a, we see that the information curves can be closer to the optimal frontier as λ\lambda increases. In general the maximum efficiency occurs at an intermediate regularization strength that depends on data structure and measurement density (see also Sec 4.2).

Efficiency—Figure 2d displays the information efficiency of Gibbs regression at different relevant information levels (see Sec 1.3). We see that the efficiency approaches optimality (ημ=1\eta_{\mu}\!=\!1) in the limits n0n\!\to\!0 and nn\!\to\!\infty. Away from these limits, Gibbs regression requires more residual bits than the optimal algorithm to achieve the same level of relevance with an efficiency minimum around n=1n\!=\!1. We also see that the efficiency of Gibbs regression decreases with relevance level (see also Supplementary Figure in Appendix D).

Extensivity—Learning is qualitatively different in the over- and underparametrized regimes. In Fig 2e we see that both optimal algorithms and Gibbs regression exhibit nonmonotonic dependence on sample size. In the overparametrized regime n<1n\!<\!1, the residual information is extensive in sample size, i.e., it grows linearly with NN. This scaling behavior mirrors that of the relevant bits in the data (Fig 1b inset). But unlike the available relevant bits which continue to grow in the data-abundant regime, albeit sublinearly—the encoded residual bits decrease with sample size in this limit (see also Supplementary Figure in Appendix D). The resulting maximum is an information-theoretic analog of double descent—the decrease in overfitting level (test error) as the number of parameters exceeds sample size (decreasing nn[48, 31].

Redundancy—Indeed we could have anticipated the extensive behavior of the residual bits in the overparametrized regime (Fig 2e). In this limit, the extensivity of available relevant bits implies that the data encode relevant information with no redundancy. In other words, the relevant bits in one observation do not overlap with that in another. As a result, the dominant learning strategy is to treat each sample separately and extract the same amount of information from each of them, thus resulting in extensive residual information. In the data-abundant regime, on the other hand, the coding of relevant bits in the data becomes increasingly redundant (Fig 1b inset). Learning algorithms exploit this redundancy to better distinguish signals from noise, thereby encoding fewer residual bits.

Refer to caption
Figure 3: Multiple descent under anisotropic covariatesa The relevant bits in the data decreases slightly as the anisotropy ratio rr departs from one (see legend). When n0n\!\lesssim\!0, the available relevant information grow linearly with NN. Strong anisotropy sees this growth start becoming sublinear at smaller N/PN/P. b The IB frontiers at N/P=1N/P\!=\!1. We see that while less relevant information is available in the anisotropic case, it takes fewer residual bits to achieve the same relevance level as the isotropic case (see legend in a). c The empirical spectral density of the sample covariance at different anisotropy ratios (see labels). Each vertical line is normalized by its maximum. We see that anisotropy splits the spectral continuum into two bands which merge into one as N/PN/P decreases. The solid line depicts the IB cutoff ψc\psi_{c} [Eqs (8-9)] for the relevance level μ=0.8\mu\!=\!0.8. e The residual information of optimal algorithms (dotted) and Gibbs regression at various regularization strengths (see color bar) for μ=0.8\mu\!=\!0.8 and different anisotropy ratios (same labels as in c). Here we set ω2/σ2=1\omega^{2}/\sigma^{2}\!=\!1 and let P,NP,N\!\to\!\infty at the same rate such that the ratio N/PN/P remains fixed and finite. The eigenvalues of the sample covariance follows the general Marchenko-Pastur theorem (see Sec 4.2).

4.2 Anisotropic covariates

To explore the effects of anisotropy, we consider a two-scale model in which the population spectral distribution FΣF^{\Sigma} is an equal mixture of two point masses at s+s_{+} and ss_{-}. We normalize the trace of the population covariance such that the signal variance, and thus the signal-to-noise ratio, does not depend on FΣF^{\Sigma}—i.e., we set trΣ/P=(s++s)/2=1\operatorname{tr}\Sigma/P\!=\!(s_{+}+s_{-})/2\!=\!1 such that 𝔼[(Wxi)2]=𝔼W2=ω2\operatorname{\mathbb{E}}[(W\cdot x_{i})^{2}]\!=\!\operatorname{\mathbb{E}}\|W\|^{2}\!=\!\omega^{2}. As a result the anisotropy in our two-scale model is parametrized completely by the eigenvalue ratio rs/s+r\!\equiv\!s_{-}/s_{+}.

Unlike the isotropic case, the limiting empirical spectral distribution does not admit a closed form expression. We obtain FΨF^{\Psi} by solving the Silverstein equation and inverting the resulting Stieltjes transform [51] (see Appendix C). Figure 3c depicts the spectral density at various anisotropy ratios and measurement densities. At high measurement densities n1n\!\gtrsim\!1, anisotropy splits the continuum part of the spectrum into two bands, corresponding to the two modes of the population covariance. These bands broaden as nn decreases and eventually merge into one in the overparametrized limit.

Available information—In Fig 3a, we see that anisotropy decreases the relevant information in the data, but does not affect its qualitative behaviors: the available relevant bits are extensive in the overparametrized regime and subextensive in the data-abundant regime. Although fewer relevant bits are available, learning needs not be less information efficient. Indeed the IB frontiers in Fig 3b illustrate that it takes fewer residual bits in the anisotropic case to reach the same level of relevance as in the isotropic case. This behavior is also apparent in Fig 3d (dotted) as we increase anisotropy levels (from right to left panels).

Anisotropy effects—Anisotropy affects optimal algorithms via the different scales in the population covariance. The signals along high-variance (easy) directions are stronger and, as a result, the coding of relevant bits in these directions becomes subextensive and redundant around n1/2n\!\approx\!1/2 (the proportion of easy directions) instead of at one (Fig 3a). This earlier onset of redundancy allows for more effective signal-noise discrimination in the anisotropic case (Fig 3b & d). In the data-abundant limit, however, low-variance (hard) directions become important as learning algorithms already encode most of the relevant bits along the easy directions. Indeed the hard directions are harder for more anisotropic inputs and thus the required residual bits increase with anisotropy in the limit nn\!\to\!\infty (Fig 3d).

Triple descent—Perhaps the most striking effect of anisotropy is the emergence of an information-theoretic analog of (sample-wise) multiple descent, which describes disjoint regions where more data makes overfitting worse (larger test error) [32, 52]. In Fig 3d, we see that an additional residual information maximum emerges at large nn. This behavior is a consequence of the separation of scales. The first maximum at n1n\!\sim\!1 originates from easy directions and the other maximum at higher nn from hard directions. In fact the IB cutoff ψc\psi_{c} in Fig 3c demonstrates that the residual information maxima roughly coincide with the inclusion of all high-variance modes around n1n\!\sim\!1 and low-variance modes at higher nn.555Note that Fig 3c does not show the zero modes which are present at n<1n\!<\!1. The fact that the spectral continuum appears to be above the IB cutoff at small nn does not mean all eigenmodes are used in the IB solution. In addition we note that for optimal algorithms the first maximum shifts to a lower nn as the anisotropy level increases. This observation is consistent with the fact that the onset of redundancy of relevant bits in the data occurs at smaller nn in the anisotropic case (Fig 3a).

Gibbs regression—Anisotropy makes Gibbs regression depend more strongly on regularization strengths, see Fig 3. In particular the information efficiency decreases with λ\lambda near the first residual information minimum around n1n\!\sim\!1 but this dependence reverses near the second maximum and at larger nn. This behavior is expected. Inductive bias from strong regularization helps prevent noise from poisoning the models at small nn. But when the data become abundant, regularization is unnecessary.

5 Conclusion & Outlook

We use the information bottleneck theory to analyze linear regression problems and illustrate the fundamental trade-off between relevant bits, which are informative of the unknown generative processes, and residual bits, which measure overfitting. We derive the information content of optimal algorithms and Gibbs posterior regression, thus enabling a quantitative investigation of information efficiency. In addition our analytical results on the zero temperature limit of the Gibbs posterior offer a glimpse of a connection between information efficiency and optimally tuned ridge regression. Finally, using results from random matrix theory, we reveal the information complexity of learning a linear map in high dimensions and unveil an information-theoretic analog of multiple descent phenomena. Since residual information is an upper bound on the generalization gap [4, 5], we believe that this information nonmonotonicity could be connected to the original double descent phenomena. But it remains to be seen how deep this connection is.

Our work paves the way for a number of different avenues for future research. While we only focus on isotropic regularization here, it would be interesting to understand how structured regularization affects information extraction. Information-efficiency analyses of different algorithms, such as Bayesian regression, and other classes of learning problems, e.g., classification and density estimation, are also in order. An investigation of information efficiency based on other ff-divergence could lead to new insights into generalization. In particular an exact relationship exists between residual Jeffreys information and generalization error of Gibbs posteriors [27]. Finally exploring how coding redundancy in training data quantitatively affects learning phenomena in general would make for an exciting research direction (see, e.g., Refs [17, 19]).

Acknowledgments and Disclosure of Funding

This work was supported in part by the National Institutes of Health BRAIN initiative (R01EB026943), the National Science Foundation, through the Center for the Physics of Biological Function (PHY-1734030), the Simons Foundation and the Sloan Foundation.

References

References

Appendix A Information content of maximally efficient algorithms

Consider an IB problem where we are interested in an information efficient representation of YY that is predictive of WW (Fig 1a). When YY and WW are Gaussian correlated, the central object in constructing an IB solution is the normalized regression matrix ΣY|WΣY1\Sigma_{Y|W}\Sigma_{Y}^{-1}; in particular, its eigenvalues νi[ΣY|WΣY1]\nu_{i}[\Sigma_{Y|W}\Sigma_{Y}^{-1}] completely characterize the information content of the IB optimal representation T~\tilde{T} via (see Ref [36] for a derivation)

I(T~;W)\displaystyle I(\tilde{T};W) =12i=1Nmax(0,ln1γ1νi[ΣY|WΣY1])\displaystyle=\frac{1}{2}\sum_{i=1}^{N}\max\left(0,\ \ln\frac{1-\gamma^{-1}}{\nu_{i}[\Sigma_{Y|W}\Sigma_{Y}^{-1}]}\right) (30)
I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =12i=1Nmax(0,ln(γ(1νi[ΣY|WΣY1]))),\displaystyle=\frac{1}{2}\sum_{i=1}^{N}\max(0,\ \ln(\gamma(1-\nu_{i}[\Sigma_{Y|W}\Sigma_{Y}^{-1}]))), (31)

where NN is the dimension of YY and γ\gamma parametrizes the IB trade-off [Eq (1)].

Our work focuses on the following generative model for WW and YY (see Sec 1.1)

WN(0,ω2PIP)andYWN(X𝖳W,σ2IN).W\sim N(0,\tfrac{\omega^{2}}{P}I_{P})\quad\text{and}\quad Y\mid W\sim N(X^{\mkern-1.5mu\mathsf{T}}W,\sigma^{2}I_{N}). (32)

Marginalizing out WW yields

YN(0,σ2IN+1PX𝖳X).Y\sim N(0,\sigma^{2}I_{N}+\tfrac{1}{P}X^{\mkern-1.5mu\mathsf{T}}X). (33)

As a result, the normalized regression matrix reads

ΣY|WΣY1=σ2IN1σ2IN+1PX𝖳X=(IN+1λX𝖳XN)1whereλPNσ2ω2.\Sigma_{Y|W}\Sigma_{Y}^{-1}=\sigma^{2}I_{N}\frac{1}{\sigma^{2}I_{N}+\frac{1}{P}X^{\mkern-1.5mu\mathsf{T}}X}=\left(I_{N}+\frac{1}{\lambda^{*}}\frac{X^{\mkern-1.5mu\mathsf{T}}X}{N}\right)^{-1}\quad\text{where}\quad\lambda^{*}\equiv\frac{P}{N}\frac{\sigma^{2}}{\omega^{2}}. (34)

Substituting Eq (34) into Eqs (30-31) gives

I(T~;W)\displaystyle I(\tilde{T};W) =12i=1Nmax(0,ln((1γ1)(1+ϕi[X𝖳X/N]/λ)))\displaystyle=\frac{1}{2}\sum_{i=1}^{N}\max\left(0,\ \ln\left((1-\gamma^{-1})(1+\phi_{i}[X^{\mkern-1.5mu\mathsf{T}}X/N]/\lambda^{*})\right)\right) (35)
I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =12i=1Nmax(0,lnγϕi[X𝖳X/N]λ+ϕi[X𝖳X/N]),\displaystyle=\frac{1}{2}\sum_{i=1}^{N}\max\left(0,\ \ln\frac{\gamma\phi_{i}[X^{\mkern-1.5mu\mathsf{T}}X/N]}{\lambda^{*}+\phi_{i}[X^{\mkern-1.5mu\mathsf{T}}X/N]}\right), (36)

where ϕi[X𝖳X/N]\phi_{i}[X^{\mkern-1.5mu\mathsf{T}}X/N] denote the eigenvalues of X𝖳X/NX^{\mkern-1.5mu\mathsf{T}}X/N. Since the eigenvalues of X𝖳X/NX^{\mkern-1.5mu\mathsf{T}}X/N and the sample covariance Ψ=XX𝖳/N\Psi=XX^{\mkern-1.5mu\mathsf{T}}/N are identical except for the zero modes which do not contribute to information, we can recast the above equations as

I(T~;W)\displaystyle I(\tilde{T};W) =12i=1Pmax(0,ln(1γ1)(1+ψi/λ))\displaystyle=\frac{1}{2}\sum_{i=1}^{P}\max\left(0,\ \ln(1-\gamma^{-1})(1+\psi_{i}/\lambda^{*})\right) (37)
I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =12i=1Pmax(0,lnγψiλ+ψi),\displaystyle=\frac{1}{2}\sum_{i=1}^{P}\max\left(0,\ \ln\frac{\gamma\psi_{i}}{\lambda^{*}+\psi_{i}}\right), (38)

where ψi\psi_{i} are the eigenvalues of Ψ\Psi and the summation limits change to PP, the number of eigenvalues of Ψ\Psi. Introducing the cumulative spectral distribution FΨF^{\Psi} and replacing the summations with integrals results in

I(T~;W)\displaystyle I(\tilde{T};W) =P2𝑑FΨ(ψ)max(0,ln((1γ1)(1+ψ/λ)))\displaystyle=\frac{P}{2}\int dF^{\Psi}(\psi)\max\left(0,\ \ln\left((1-\gamma^{-1})(1+\psi/\lambda^{*})\right)\right) (39)
I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =P2𝑑FΨ(ψ)max(0,lnγψλ+ψ).\displaystyle=\frac{P}{2}\int dF^{\Psi}(\psi)\max\left(0,\ \ln\frac{\gamma\psi}{\lambda^{*}+\psi}\right). (40)

We see that the contributions to the integrals come from the logarithms but only when they are positive. This condition can be recast into integration limits (note that γ>0\gamma>0 and λ>0\lambda^{*}>0)

ln((1γ1)(1+ψ/λ))>0\displaystyle\ln\left((1-\gamma^{-1})(1+\psi/\lambda^{*})\right)>0 ψ>λ/(γ1)\displaystyle\implies\psi>\lambda^{*}/(\gamma-1) (41)
lnγψλ+ψ>0\displaystyle\ln\frac{\gamma\psi}{\lambda^{*}+\psi}>0 ψ>λ/(γ1).\displaystyle\implies\psi>\lambda^{*}/(\gamma-1). (42)

Finally we define the lower cutoff ψcλ/(γ1)\psi_{c}\equiv\lambda^{*}/(\gamma-1) and use the above limits to rewrite the expressions for relevant and residual informations,

I(T~;W)\displaystyle I(\tilde{T};W) =P2ψ>ψc𝑑FΨ(ψ)lnψ+λψc+λ=P2ψ>ψc𝑑FΨ(ψ)ln(1+ψψcψc+λ)\displaystyle=\frac{P}{2}\int_{\psi>\psi_{c}}dF^{\Psi}(\psi)\ln\frac{\psi+\lambda^{*}}{\psi_{c}+\lambda^{*}}=\frac{P}{2}\int_{\psi>\psi_{c}}dF^{\Psi}(\psi)\ln\left(1+\frac{\psi-\psi_{c}}{\psi_{c}+\lambda^{*}}\right) (43)
I(T~;YW)\displaystyle I(\tilde{T};Y\mid W) =P2ψ>ψc𝑑FΨ(ψ)lnψψcψc+λψ+λ=P2ψ>ψc𝑑FΨ(ψ)lnψψcI(T~;W).\displaystyle=\frac{P}{2}\int_{\psi>\psi_{c}}dF^{\Psi}(\psi)\ln\frac{\psi}{\psi_{c}}\frac{\psi_{c}+\lambda^{*}}{\psi+\lambda^{*}}=\frac{P}{2}\int_{\psi>\psi_{c}}dF^{\Psi}(\psi)\ln\frac{\psi}{\psi_{c}}-I(\tilde{T};W). (44)

These equations are identical to Eqs (8-9) in the main text.

Appendix B Information content of Gibbs-posterior regression

To compute the information content of Gibbs regression [Eq (14)], we first recall that the mutual information between two Gaussian correlated variables, AA and BB, is given by

I(A;B)=12lndetΣAΣA|B1,I(A;B)=\frac{1}{2}\ln\det\Sigma_{A}\Sigma_{A|B}^{-1}, (45)

where ΣA\Sigma_{A} is the covariance of AA, and ΣA|B\Sigma_{A|B} of A|BA\,|\,B.

We now write down the relevant information, using the covariances ΣT|W\Sigma_{T|W} and ΣT\Sigma_{T} from Eqs (17-18),

I(T;W)\displaystyle I(T;W) =12lndet(ΣTΣT|W1)\displaystyle=\frac{1}{2}\ln\det\left(\Sigma_{T}\Sigma_{T|W}^{-1}\right) (46)
=12lndet12β1Ψ+λIP+σ2NΨ(Ψ+λIP)2+ω2PΨ2(Ψ+λIP)212β1Ψ+λIP+σ2NΨ(Ψ+λIP)2\displaystyle=\frac{1}{2}\ln\det\frac{\frac{1}{2\beta}\frac{1}{\Psi+\lambda I_{P}}+\frac{\sigma^{2}}{N}\frac{\Psi}{(\Psi+\lambda I_{P})^{2}}+\frac{\omega^{2}}{P}\frac{\Psi^{2}}{(\Psi+\lambda I_{P})^{2}}}{\frac{1}{2\beta}\frac{1}{\Psi+\lambda I_{P}}+\frac{\sigma^{2}}{N}\frac{\Psi}{(\Psi+\lambda I_{P})^{2}}} (47)
=12lndet(IP+Ψ2/λΨ+N2βσ2(Ψ+λIP))\displaystyle=\frac{1}{2}\ln\det\left(I_{P}+\frac{\Psi^{2}/\lambda^{*}}{\Psi+\frac{N}{2\beta\sigma^{2}}(\Psi+\lambda I_{P})}\right) (48)
=12trln(IP+Ψ2/λΨ+N2βσ2(Ψ+λIP))\displaystyle=\frac{1}{2}\operatorname{tr}\ln\left(I_{P}+\frac{\Psi^{2}/\lambda^{*}}{\Psi+\frac{N}{2\beta\sigma^{2}}(\Psi+\lambda I_{P})}\right) (49)
=12i=1Pln(1+ψi2/λψi+N2βσ2(ψi+λ))\displaystyle=\frac{1}{2}\sum_{i=1}^{P}\ln\left(1+\frac{\psi_{i}^{2}/\lambda^{*}}{\psi_{i}+\frac{N}{2\beta\sigma^{2}}(\psi_{i}+\lambda)}\right) (50)
=P2ψ>0𝑑FΨ(ψ)ln(1+ψ2/λψ+N2βσ2(ψ+λ)),\displaystyle=\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)\ln\left(1+\frac{\psi^{2}/\lambda^{*}}{\psi+\frac{N}{2\beta\sigma^{2}}(\psi+\lambda)}\right), (51)

where λ=Pσ2/Nω2\lambda^{*}=P\sigma^{2}/N\omega^{2}. In the above, we use the identity lndetH=trlnH\ln\det H\!=\!\operatorname{tr}\ln H which holds for any positive-definite Hermitian matrix HH, let ψi\psi_{i} denote the eigenvalues of the sample covariance Ψ\Psi and introduce FΨF^{\Psi}, the cumulative distribution of eigenvalues. We also assume that λ\lambda and β\beta are finite and positive. Note that the integral is limited to positive real numbers because the eigenvalues of a covariance matrix is non-negative and the integrand vanishes for ψ=0\psi=0.

Following the same logical steps as above and noting that the Markov constraint WYTW\leftrightarrow Y\leftrightarrow T implies ΣT|Y,W=ΣT|Y\Sigma_{T|Y,W}=\Sigma_{T|Y}, we write down the residual information,

I(T;YW)\displaystyle I(T;Y\mid W) =12lndet(ΣT|WΣT|Y,W1)\displaystyle=\frac{1}{2}\ln\det\left(\Sigma_{T|W}\Sigma_{T|Y,W}^{-1}\right) (52)
=12lndet(ΣT|WΣT|Y1)\displaystyle=\frac{1}{2}\ln\det\left(\Sigma_{T|W}\Sigma_{T|Y}^{-1}\right) (53)
=12lndet(12β1Ψ+λIP+σ2NΨ(Ψ+λIP)212β1Ψ+λIP)\displaystyle=\frac{1}{2}\ln\det\left(\frac{\frac{1}{2\beta}\frac{1}{\Psi+\lambda I_{P}}+\frac{\sigma^{2}}{N}\frac{\Psi}{(\Psi+\lambda I_{P})^{2}}}{\frac{1}{2\beta}\frac{1}{\Psi+\lambda I_{P}}}\right) (54)
=P2ψ>0𝑑FΨ(ψ)ln(1+2βσ2Nψψ+λ)\displaystyle=\frac{P}{2}\int_{\psi>0}dF^{\Psi}(\psi)\ln\left(1+\frac{2\beta\sigma^{2}}{N}\frac{\psi}{\psi+\lambda}\right) (55)

where we use the covariance matrices ΣT|W\Sigma_{T|W} and ΣT|Y\Sigma_{T|Y} from Eqs (17) & (14).

Appendix C Marchenko-Pastur law

Consider X=Σ1/2ZX=\Sigma^{1/2}Z where ZP×NZ\in\mathbb{R}^{P\times N} is a matrix with iid entries drawn from a distribution with zero mean and unit variance, and ΣP×P\Sigma\in\mathbb{R}^{P\times P} is a covariance matrix. In addition we take the asymptotic limit NN\to\infty, NN\to\infty and P/Nα(0,)P/N\to\alpha\in(0,\infty). If the population spectral distribution FΣF^{\Sigma} converges to a limiting distribution, the spectral distribution of the sample covariance Ψ=XX𝖳/N\Psi=XX^{\mkern-1.5mu\mathsf{T}}/N becomes deterministic [49]. The density, fΨ(ψ)=dFΨ(ψ)/dψf^{\Psi}(\psi)=dF^{\Psi}(\psi)/d\psi, is related to its Stieltjes transform m(z)m(z) via

fΨ(ψ)=1πImm(ψ+i 0+),ψ.f^{\Psi}(\psi)=\frac{1}{\pi}\mathop{\mathrm{Im}}m(\psi+i\,0^{+}),\quad\psi\in\mathbb{R}. (56)

We can obtain fΨf^{\Psi} by solving the Silverstein equation for the companion Stieltjes transform ν(z)\nu(z) [51],

1v(z)=zα+𝑑FΣ(s)s1+sv(z),z+,-\frac{1}{v(z)}=z-\alpha\int_{\mathbb{R}^{+}}dF^{\Sigma}(s)\frac{s}{1+sv(z)},\quad z\in\mathbb{C}^{+}, (57)

and using the relation

m(z)=α1(v(z)+z1)z1.m(z)=\alpha^{-1}(v(z)+z^{-1})-z^{-1}. (58)

Here +\mathbb{C}^{+} denotes the upper half of the complex plane.

Appendix D Supplementary figure

Refer to caption
Figure 4: Gibbs ridge regression is least information efficient around N/P=1N/P\!=\!1. a Residual information I(T;Y|W)I(T;Y\,|\,W) of the IB optimal algorithm over a range of sample densities N/PN/P (horizontal axis) and given extracted relevant bits I(T;W)I(T;W) (vertical axis). The extracted relevant bits are bounded by the available relevant bits in the data (black curve), i.e., the data processing inequality implies I(T;W)I(Y;W)I(T;W)\!\leq\!I(Y;W). b Same as (a) but for Gibbs regression with λ=106\lambda\!=\!10^{-6}. Holding other things equal, Gibbs regression estimators encode more residual bits than optimal representations. c Information efficiency, the ratio between residual bits in optimal representations (a) and Gibbs estimator (b), is minimum around N/P=1N/P\!=\!1. Here we set ω2/σ2=1\omega^{2}/\sigma^{2}\!=\!1 and let P,NP,N\!\to\!\infty at the same rate such that the ratio N/PN/P remains fixed and finite. The eigenvalues of the sample covariance follow the standard Marchenko-Pastur law (see Sec 4).