This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Epistemic Uncertainty and Observation Noise
with the Neural Tangent Kernel

Sergio Calvo-Ordoñez1,2∗ Konstantina Palla3 Kamil Ciosek3
1Mathematical Institute
University of Oxford
2Oxford-Man Institute of Quantitative Finance
University of Oxford
3Spotify
Abstract

Recent work has shown that training wide neural networks with gradient descent is formally equivalent to computing the mean of the posterior distribution in a Gaussian Process (GP) with the Neural Tangent Kernel (NTK) as the prior covariance and zero aleatoric noise [12]. In this paper, we extend this framework in two ways. First, we show how to deal with non-zero aleatoric noise. Second, we derive an estimator for the posterior covariance, giving us a handle on epistemic uncertainty. Our proposed approach integrates seamlessly with standard training pipelines, as it involves training a small number of additional predictors using gradient descent on a mean squared error loss. We demonstrate the proof-of-concept of our method through empirical evaluation on synthetic regression.

1 Introduction

[*]jacot2018neural have studied the training of wide neural networks, showing that gradient descent on a standard loss is, in the limit of many iterations, formally equivalent to computing the posterior mean of a Gaussian Process (GP), with the prior covariance specified by the Neural Tangent Kernel (NTK) and with zero aleatoric noise. Crucially, this insight allows us to study complex behaviours of wide networks using Bayesian nonparametrics, which are much better understood.

We extend this analysis by asking two research questions. First, we ask if a similar equivalence exists in cases where we want to do inference for arbitrary values of aleatoric noise. This is crucial in many real-world settings, where measurement accuracy or other data-gathering errors mean that the information in our dataset is only approximate. Second, we ask if it is possible to obtain an estimate of the posterior covariance, not just the mean. Since the posterior covariance measures the epistemic uncertainty about predictions of a model, it is crucial for problems that involve out-of-distribution detection or training with bandit-style feedback.

We answer both of these research questions in the affirmative. Our posterior mean estimator takes the aleatoric noise into account by adding a simple squared norm penalty on the deviation of the network parameters from their initial values, shedding light on regularization in deep learning. Our covariance estimator can be understood as an alternative to existing methods of epistemic uncertainty estimation, such as dropout [7, 20], the Laplace approximation [6, 19], epistemic neural networks [18], deep ensembles [21, 14] and Bayesian Neural Networks [3, 13]. Unlike these approaches, our method has the advantage that it can approximate the NTK-GP posterior arbitrarily well.

Contributions

We derive estimators for the posterior mean and covariance of an NTK-GP with non-zero aleatoric noise, computable using gradient descent on a standard loss. We evaluate our results empirically on a toy repression problem.

2 Preliminaries

Gaussian Processes

Gaussian Processes (GPs) are a popular non-parametric approach for modeling distributions over functions [22]. Given a dataset of input-output pairs {(𝐱i,yi)}i=1N\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{N}, a GP represents uncertainty about function values by assuming they are jointly Gaussian with a covariance structure defined by a kernel function k(𝐱,𝐱)k\mathbf{(x,x^{\prime})}. The GP prior is specified as f(𝐱)𝒢𝒫(m(𝐱),k(𝐱,𝐱))f(\mathbf{x})\sim\mathcal{GP}(m(\mathbf{x}),k(\mathbf{x},\mathbf{x}^{\prime})), where m(𝐱)m(\mathbf{x}) is the mean function and k(𝐱,𝐱)k(\mathbf{x},\mathbf{x}^{\prime}) is the kernel. Assuming yi𝒩(f(𝐱),σ2)y_{i}\sim\mathcal{N}(f(\mathbf{x}),\sigma^{2}) and given new test points 𝐱\mathbf{x^{\prime}}, the posterior mean and covariance are given by:

𝝁p(𝐱)=m(𝐱)+𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1(𝐲m(𝐱)),\displaystyle\boldsymbol{\mu}_{p}(\mathbf{x^{\prime}})=m(\mathbf{x^{\prime}})+\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}(\mathbf{y}-m(\mathbf{x})), (1)
𝚺p(𝐱)=𝐊(𝐱,𝐱)𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱),\displaystyle\mathbf{\Sigma}_{p}(\mathbf{x^{\prime}})=\mathbf{K(x^{\prime},x^{\prime})}-\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}, (2)

where 𝐊(𝐱,𝐱)\mathbf{K(x,x)} is the covariance matrix computed over the training inputs, 𝐊(𝐱,𝐱)\mathbf{K(x^{\prime},x)} is the covariance matrix between the test and training points, and σ2\sigma^{2} represents the aleatoric (or observation) noise.

Neural Tangent Kernel.

The Neural Tangent Kernel (NTK) characterizes the evolution of wide neural network predictions as a linear model in function space. Given a neural network function f(𝐱;θ)f(\mathbf{x};\theta) parameterized by θ\theta, the NTK is defined through the Jacobian J(𝐱)N×pJ(\mathbf{x})\in\mathbb{R}^{N\times p}, where J(𝐱)=f(𝐱;θ)θJ(\mathbf{x})=\frac{\partial f(\mathbf{x};\theta)}{\partial\theta}, NN is the number of data points and pp is the number of parameters. The NTK at two sets of inputs 𝐱\mathbf{x} and 𝐱\mathbf{x^{\prime}} is given by:

𝐊(𝐱,𝐱)=J(𝐱)J(𝐱).\displaystyle\mathbf{K(x,x^{\prime})}=J(\mathbf{x})J(\mathbf{x^{\prime}})^{\top}. (3)

Interestingly, as shown by [12] the NTK converges to a deterministic kernel and remains constant during training in the infinite-width limit. We call a GP with the kernel (3) the NTK GP.

3 Method

We now describe our proposed process of doing inference in the NTK-GP. Our procedure for estimating the posterior mean is given in Algorithm 1, while the procedure for the covariance is given in Algorithm 2. Note that our process is scaleable because both algorithms only use gradient descent, rather than relying on a matrix inverse in equations (1) and (2). While Algorithm 2 relies on the computation of the partial SVD of the Jacobian, we stress that efficient ways of doing so exist and do not require ever storing the full Jacobian. We defer the details of the partial SVD to Appendix E. We describe the theory that justifies our posterior computation in sections 3.1 and 3.2. We defer the discussion of literature to Appendix A.

Algorithm 1 Algorithm for Computing the Posterior Mean in the NTK-GP
procedure Train-Posterior-Mean(xi,yix_{i},y_{i}, θ0\theta_{0})
     y^iyi+f(xi;θ0)\hat{y}_{i}\leftarrow y_{i}+f(x_{i};\theta_{0}) \triangleright Shift the targets to get zero prior mean (Lemma 3.2).
     L1Ni=1N(y^if(xi;θ))2+βNθθ022L\leftarrow\frac{1}{N}\sum_{i=1}^{N}(\hat{y}_{i}-f(x_{i};\theta))^{2}+\beta_{N}||\theta-\theta_{0}||^{2}_{2} \triangleright Equation (4)
     minimize LL with gradient descent wrt. θ\theta until convergence to θ\theta^{\star}
     return θ\theta^{\star} \triangleright Return the trained weights.
end procedure
procedure Query-Posterior-Mean(xjx^{\prime}_{j}, θ\theta^{\star}, θ0\theta^{0}) \triangleright j=1,,Jj=1,\dots,J
     return f(x1;θ)f(x1;θ0),,f(xJ;θ)f(x1;θ0)f(x^{\prime}_{1};\theta^{\star})-f(x^{\prime}_{1};\theta^{0}),\dots,f(x^{\prime}_{J};\theta^{\star})-f(x^{\prime}_{1};\theta^{0})
end procedure

3.1 Aleatoric Noise

Gradient Descent Converges to the NTK-GP Posterior Mean

We build on the work of [12] by focusing on the computation of the mean posterior in the presence of non-zero aleatoric noise. We show that optimizing a regularized mean squared error loss in a neural network is equivalent to computing the mean posterior of an NTK-GP with non-zero aleatoric noise. In the following Lemma, we prove that for a sufficiently long training process, the predictions of the trained network converge to those of an NTK-GP with aleatoric noise characterized by σ2=NβN\sigma^{2}=N\beta_{N}. This is a similar result to [11], but from a Bayesian perspective rather than a frequentist generalization bound. Furthermore, our proof (see Appendix B) focuses on explicitly solving the gradient flows for test and training data points in function space.

Lemma 3.1.

Consider a parametric model f(x;θ)f(x;\theta) where x𝒳Nx\in\mathcal{X}\subset\mathbb{R}^{N} and θp\theta\in\mathbb{R}^{p}, initialized under some assumptions with parameters θ0\theta_{0}. Minimizing the regularized mean squared error loss with respect to θ\theta to find the optimal set of parameters θ\theta^{*} over a dataset (𝐱,𝐲)(\mathbf{x},\mathbf{y}) of size NN, and with sufficient training time (tt\rightarrow\infty):

θ=argminθp1Ni=1N(yif(xi;θ))2+βNθθ022,\theta^{*}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}^{p}}\frac{1}{N}\sum_{i=1}^{N}(y_{i}-f(x_{i};\theta))^{2}+\beta_{N}||\theta-\theta_{0}||^{2}_{2}, (4)

is equivalent to computing the mean posterior of a Gaussian process with non-zero aleatoric noise, σ2=NβN\sigma^{2}=N\beta_{N}, and the NTK as its kernel:

f(𝐱;θ)=f(𝐱;θ0)+𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+NβN𝐈)1(𝐲f(𝐱;θ0)).f(\mathbf{x^{\prime}};\theta_{\infty})=f(\mathbf{x^{\prime}};\theta_{0})+\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+N\beta_{N}\mathbf{I})^{-1}(\mathbf{y}-f(\mathbf{x};\theta_{0})). (5)

Zero Prior Mean

In many practical scenarios, it is desirable to start with zero prior mean rather than with a prior mean that corresponds to random network initialization. To accommodate this, we introduce a simple yet effective transformation of the data and the network outputs, to be applied together with 3.1. We summarize it into the following lemma (see Appendix B for proof):

Lemma 3.2.

Consider the computational process derived in Lemma 3.1. Define shifted labels 𝐲~\tilde{\mathbf{y}} and predictions f~(𝐱;θ)\tilde{f}(\mathbf{x};\theta_{\infty}) as follows::

𝐲~=𝐲+f(𝐱;θ0),f~(𝐱;θ)=f(𝐱;θ)f(𝐱;θ0).\tilde{\mathbf{y}}=\mathbf{y}+f(\mathbf{x};\theta_{0}),\quad\tilde{f}(\mathbf{x};\theta_{\infty})=f(\mathbf{x};\theta_{\infty})-f(\mathbf{x^{\prime}};\theta_{0}).

Using these definitions, the posterior mean of a zero-mean Gaussian process can be computed as:

f~(𝐱,θ)=𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+NβN𝐈)1𝐲.\tilde{f}(\mathbf{x^{\prime}},\theta_{\infty})=\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+N\beta_{N}\mathbf{I})^{-1}\mathbf{y}. (6)
Algorithm 2 Algorithm for Computing the Posterior Covariance in the NTK-GP
procedure Train-Posterior-Covariance(xix_{i}, KK, θ0\theta_{0}) \triangleright KK is the number of predictors
     U,ΣPartial-SVD(Jθ0(𝐱),K)U,\Sigma\leftarrow\textsc{Partial-SVD}(J_{\theta_{0}}({\mathbf{x}}),K) \triangleright Partial SVD of the Jacobian - see appendix E.
     for i=1,,Ki=1,\dots,K do
         θiTrain-Posterior-Mean(xi,Ui)\theta_{i}^{\star}\leftarrow\textsc{Train-Posterior-Mean}(x_{i},U_{i}) \triangleright UiU_{i} is the ii-th column of UU.
     end for
     for i=1,,Ki=1,\dots,K^{\prime} do \triangleright Setting K=0K^{\prime}=0 often works well (see Appendix D).
         θiTrain-Posterior-Mean(xi,ϵi){\theta^{\prime}}_{i}^{\star}\leftarrow\textsc{Train-Posterior-Mean}(x_{i},\epsilon_{i}) \triangleright ϵi𝒩(0,σ2I)\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}I)
     end for
     return Σ,θ1,,θK,θ1,,θK\Sigma,\theta_{1}^{\star},\dots,\theta_{K}^{\star},{\theta^{\prime}}_{1}^{\star},\dots,{\theta^{\prime}}_{K^{\prime}}^{\star}
end procedure
procedure Query-Posterior-Covariance(xjx^{\prime}_{j}, Σ\Sigma, θi\theta_{i}^{\star}, θi{\theta^{\prime}}_{i}^{\star}, θ0\theta_{0}) \triangleright j=1,,Jj=1,\dots,J
     P[f(x1;θ1)f(x1;θ0)f(x1;θK)f(x1;θ0)f(xJ;θ1)f(xJ;θ0)f(xJ;θK)f(x1;θ0)],P[f(x1;θ1)f(x1;θ0)f(x1;θK)f(x1;θ0)f(xJ;θ1)f(xJ;θ0)f(xJ;θK)f(x1;θ0)]P\scriptscriptstyle\leftarrow\begin{bmatrix}\scriptscriptstyle f(x^{\prime}_{1};\theta_{1}^{\star})-f(x^{\prime}_{1};\theta_{0})&\scriptscriptstyle\dots&\scriptscriptstyle f(x^{\prime}_{1};\theta_{K}^{\star})-f(x^{\prime}_{1};\theta_{0})\\ &\scriptscriptstyle\dots&\\ \scriptscriptstyle f(x^{\prime}_{J};\theta_{1}^{\star})-f(x^{\prime}_{J};\theta_{0})&\scriptscriptstyle\dots&\scriptscriptstyle f(x^{\prime}_{J};\theta_{K}^{\star})-f(x^{\prime}_{1};\theta_{0})\\ \end{bmatrix}\textstyle,\;P^{\prime}\scriptscriptstyle\leftarrow\begin{bmatrix}\scriptscriptstyle f(x^{\prime}_{1};{\theta^{\prime}}_{1}^{\star})-f(x^{\prime}_{1};{\theta}_{0})&\scriptscriptstyle\dots&\scriptscriptstyle f(x^{\prime}_{1};{\theta^{\prime}}_{K^{\prime}}^{\star})-f(x^{\prime}_{1};{\theta}_{0})\\ &\scriptscriptstyle\dots&\\ \scriptscriptstyle f(x^{\prime}_{J};{\theta^{\prime}}_{1}^{\star})-f(x^{\prime}_{J};{\theta}_{0})&\scriptscriptstyle\dots&\scriptscriptstyle f(x^{\prime}_{J};{\theta^{\prime}}_{K^{\prime}}^{\star})-f(x^{\prime}_{1};{\theta}_{0})\\ \end{bmatrix}
     return J(𝐱)J(𝐱)PΣ2PP(P)/KJ(\mathbf{x^{\prime}})J(\mathbf{x^{\prime}})^{\top}-P\Sigma^{2}P^{\top}-P^{\prime}(P^{\prime})^{\top}/K^{\prime} \triangleright The last term vanishes for K=0K^{\prime}=0
end procedure

3.2 Estimating the Covariance

We now justify Algorithm 2 for estimating the posterior covariance. The main observation that allows us to derive our estimator comes from examining the term 𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)} in the posterior covariance formula (2). This is summarized in the following Proposition.

Proposition 3.1.

Diagonalize 𝐊(𝐱,𝐱)\mathbf{K(x,x)} so that 𝐊(𝐱,𝐱)=UΛU\mathbf{K(x,x)}=U\Lambda U^{\top}. We have

𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)=(MU)Λ(MU)+σ2MM.\displaystyle\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}=(MU)\Lambda(MU)^{\top}+\sigma^{2}MM^{\top}.

Here, M=𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1M=\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}.

Proof.

We can rewrite it as:

𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)\displaystyle\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)} =\displaystyle=
𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1M\displaystyle\underbrace{\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}}_{M} (𝐊(𝐱,𝐱)+σ2𝐈)(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)M\displaystyle(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})\underbrace{(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}}_{M^{\top}}

Denoting the term 𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1} with MM, this can be written as:

𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)=(MU)Λ(MU)+σ2MM.\displaystyle\mathbf{K(x^{\prime},x)}^{\top}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}=(MU)\Lambda(MU)^{\top}+\sigma^{2}MM^{\top}.

The proposition is useful because the matrix MM appears in equation (1). Hence the matrix multiplication MUMU is equivalent to estimating the posterior mean using algorithm 1 where targets are given by the columns of the matrix UU. Hence the term (MU)Λ(MU)(MU)\Lambda(MU)^{\top} can be computed by gradient descent. In order to derive a complete estimator of the covariance, we still need to deal with the term σ2MM\sigma^{2}MM^{\top}. We can either estimate this term by fitting random targets (which corresponds to setting K>0K^{\prime}>0 in algorithm 2) or accept an upper bound on the covariance, setting K=0K^{\prime}=0. We describe this in detail in Appendix D.

4 Experiment

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: The NTK-GP posterior and its approximations: (top-left) Analytic Posterior, (top-right) Analytic upper bound on posterior (all eigenvectors), (bottom-left) Analytic upper bound on posterior (5 eigenvectors), (bottom-right) Posterior obtained with gradient descent (K=5K=5 predictors, K=0K^{\prime}=0).

We applied the method to a toy regression problem shown in Figure 1. The problem is a standard non-linear 1d regression task which requires both interpolation and extrapolation. The top-left figure was obtained by computing the kernel of the NTK-GP using formula (3) and computing the posterior mean and covariance using equations (1) and (2). The top-right figure was obtained by analytically computing the upper bound defined in appendix D. The bottom-left figure was obtained by taking the first 5 eigenvectors of the kernel. Finally, the bottom-right figure was obtained by fitting a mean prediction network and 5 predictor networks using the gradient-descent method described in algorithm 2. The similarity of the figures shows that the method works. Details of network architecture are deferred to Appendix C.

5 Conclusions

This paper introduces a method for computing the posterior mean and covariance of NTK-Gaussian Processes with non-zero aleatoric noise. Our approach integrates seamlessly with standard training procedures using gradient descent, providing a practical tool for uncertainty estimation in contexts such as Bayesian optimization. The method has been validated empirically on a toy task, demonstrating its effectiveness in capturing uncertainty while maintaining computational efficiency. This work opens up opportunities for further research in applying NTK-GP frameworks to more complex scenarios and datasets.

References

  • [1] R. Bhatia “Positive Definite Matrices”, Princeton Series in Applied Mathematics Princeton University Press, 2015 URL: https://books.google.co.uk/books?id=Y22YDwAAQBAJ
  • [2] Avrim Blum, John Hopcroft and Ravindran Kannan “Foundations of data science” Cambridge University Press, 2020
  • [3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu and Daan Wierstra “Weight uncertainty in neural network” In International conference on machine learning, 2015, pp. 1613–1622 PMLR
  • [4] Yuri Burda, Harrison Edwards, Amos Storkey and Oleg Klimov “Exploration by random network distillation” In arXiv preprint arXiv:1810.12894, 2018
  • [5] Kamil Ciosek et al. “Conservative uncertainty estimation by fitting prior networks” In International Conference on Learning Representations, 2019
  • [6] Erik Daxberger et al. “Laplace redux-effortless bayesian deep learning” In Advances in Neural Information Processing Systems 34, 2021, pp. 20089–20103
  • [7] Yarin Gal and Zoubin Ghahramani “Dropout as a bayesian approximation: Representing model uncertainty in deep learning” In international conference on machine learning, 2016, pp. 1050–1059 PMLR
  • [8] Eugene Golikov, Eduard Pokonechnyy and Vladimir Korviakov “Neural Tangent Kernel: A Survey”, 2022 arXiv: https://arxiv.org/abs/2208.13614
  • [9] Nathan Halko, Per-Gunnar Martinsson and Joel A Tropp “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions” In SIAM review 53.2 SIAM, 2011, pp. 217–288
  • [10] Bobby He, Balaji Lakshminarayanan and Yee Whye Teh “Bayesian Deep Ensembles via the Neural Tangent Kernel”, 2020 arXiv: https://arxiv.org/abs/2007.05864
  • [11] Wei Hu, Zhiyuan Li and Dingli Yu “Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee”, 2020 arXiv: https://arxiv.org/abs/1905.11368
  • [12] Arthur Jacot, Franck Gabriel and Clément Hongler “Neural tangent kernel: Convergence and generalization in neural networks” In Advances in neural information processing systems 31, 2018
  • [13] Diederik P Kingma “Auto-encoding variational bayes” In arXiv preprint arXiv:1312.6114, 2013
  • [14] Balaji Lakshminarayanan, Alexander Pritzel and Charles Blundell “Simple and scalable predictive uncertainty estimation using deep ensembles” In Advances in neural information processing systems 30, 2017
  • [15] Jaehoon Lee et al. “Wide neural networks of any depth evolve as linear models under gradient descent” In Advances in neural information processing systems 32, 2019
  • [16] Radford M Neal “Bayesian learning for neural networks” Springer Science & Business Media, 2012
  • [17] Ian Osband, Benjamin Van Roy, Daniel J Russo and Zheng Wen “Deep exploration via randomized value functions” In Journal of Machine Learning Research 20.124, 2019, pp. 1–62
  • [18] Ian Osband et al. “Epistemic neural networks” In Advances in Neural Information Processing Systems 36, 2023, pp. 2795–2823
  • [19] Hippolyt Ritter, Aleksandar Botev and David Barber “A scalable laplace approximation for neural networks” In 6th international conference on learning representations, ICLR 2018-conference track proceedings 6, 2018 International Conference on Representation Learning
  • [20] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from overfitting” In The journal of machine learning research 15.1 JMLR. org, 2014, pp. 1929–1958
  • [21] Robert J Tibshirani and Bradley Efron “An introduction to the bootstrap” In Monographs on statistics and applied probability 57.1, 1993, pp. 1–436
  • [22] Christopher KI Williams and Carl Edward Rasmussen “Gaussian processes for machine learning” MIT press Cambridge, MA, 2006

Appendix A Related Work

Neural Tangent Kernel

The definition of the Neural Tangent Kernel (3), the proof of the fact that it stays constant during training and doesn’t depend on initialization as well as the link to Gaussian Processes with no aleatoric noise are all due to the seminal paper [12]. The work of [*]lee2019wide builds on that, showing that wide neural networks can be understood as linear models for purposes of studying their training dynamics, a fact we crucially rely on in the proof of our Lemma 3.1. [*]hu2020simpleeffectiveregularizationmethods describe a regularizer for networks trained in the NTK regime which leads to the same optimization problem used in our Lemma 3.1. The difference lies in the fact that we rely on the Bayesian interpretation of the network obtained at the end of training, while they focus on a frequentist generalization bound.

Predictor Networks

Prior work [17, 4, 5] has considered epistemic uncertainty estimation by fitting functions generated using a process that includes some kind of randomness. [4] have applied a similar idea to reinforcement learning, obtaining exceptional results on Montezuma’s Revenge, a problem where it is known that exploration very is hard. [5] provided a link to Gaussian Processes, but did not leverage the NTK, instead describing an upper bound on a posterior relative to the kernel [16] where sampling corresponds to sampling from the network initialization. [17] proposed111See Section 5.3.1 in the paper by [17]. a way of sampling from a Bayesian linear regression posterior by solving an optimization problem with a similar structure to ours. However, this approach is different in two crucial ways. First, [17] is interested in obtaining samples from the posterior, while we are interested in computing the posterior moments. Second, the sampling process in the paper by [17] depends on the true regression targets in a way that our posterior covariance estimate does not. Also, our method is framed differently, as we intend it to be used in the context of the NTK regime, while [17] discusses vanilla linear regression.

Epistemic Uncertainty

Our method of fitting the posterior covariance about network outputs can be thought of as quantifying epistemic uncertainty. There are several established methods in this space. Dropout [7, 20], works by randomly disabling neurons in a network and has a Bayesian interpretation. The Laplace approximation [6, 19] works by replacing an arbitrary likelihood with a Gaussian one. Epistemic neural networks [18] are based on the idea of using an additional input (the epistemic index) when training the network. Deep ensembles [21, 14] work by training several copies of a network with different initializations and sometimes training sets that are only partially overlapping. While classic deep ensembles do not have a Bayesian interpretation, [*]he2020bayesiandeepensemblesneural have recently proposed a modification that approximates the posterior in the NTK-GP. Bayesian Neural Networks [3, 13] attempt to apply Bayes rule in the space of neural network parameters, applying various approximations. A full survey of methods of epistemic uncertainty estimation is beyond the scope of this paper.

Appendix B Proofs

See 3.1

Proof.

Consider a regression problem with the following regularized empirical loss:

(𝐲,f(𝐱;θ))=1Ni=1N(yif(xi;θ))2+βNθθ022.\mathcal{L}(\mathbf{y},f(\mathbf{x};\theta))=\frac{1}{N}\sum_{i=1}^{N}(y_{i}-f(x_{i};\theta))^{2}+\beta_{N}||\theta-\theta_{0}||^{2}_{2}. (7)

Let us use θt\theta_{t} to represent the parameters of the network evolving in time tt and let α\alpha be the learning rate. Assuming we train the network via continuous-time gradient flow, then the evolution of the parameters θt\theta_{t} can be expressed as:

dθtdt=α[2Nθf(𝐱;θt)(f(𝐱;θt)𝐲)+2βN(θtθ0)].\frac{d\theta_{t}}{dt}=-\alpha\left[\frac{2}{N}\nabla_{\theta}f(\mathbf{x};\theta_{t})(f(\mathbf{x};\theta_{t})-\mathbf{y})+2\beta_{N}(\theta_{t}-\theta_{0})\right]. (8)

Assuming that our neural network architecture operates in a sufficiently wide regime [15], where the first-order approximation remains valid throughout gradient descent, we obtain:

f(𝐱;θt)=f(𝐱;θ0)+Jt(𝐱)(θtθ0)θf(𝐱;θt)=Jt(𝐱).f(\mathbf{x^{\prime}};\theta_{t})=f(\mathbf{x^{\prime}};\theta_{0})+J_{t}(\mathbf{x^{\prime}})(\theta_{t}-\theta_{0})\rightarrow\nabla_{\theta}f(\mathbf{x^{\prime}};\theta_{t})^{\top}=J_{t}(\mathbf{x^{\prime}}). (9)

The dynamics of the neural network on the training data:

df(𝐱;θt)dt\displaystyle\frac{df(\mathbf{x};\theta_{t})}{dt} =Jt(𝐱)dθtdt=2αNJt(𝐱)[Jt(𝐱)(f(𝐱;θt)𝐲)+βN(θtθ0)]\displaystyle=J_{t}(\mathbf{x})\frac{d\theta_{t}}{dt}=-\frac{2\alpha}{N}J_{t}(\mathbf{x})\left[J_{t}(\mathbf{x})^{\top}(f(\mathbf{x};\theta_{t})-\mathbf{y})+\beta_{N}(\theta_{t}-\theta_{0})\right]
=2αN(𝐊(𝐱,𝐱)(f(𝐱;θt)𝐲)+βNJt(𝐱)(θtθ0))\displaystyle=-\frac{2\alpha}{N}\left(\mathbf{K(x,x)}(f(\mathbf{x};\theta_{t})-\mathbf{y})+\beta_{N}J_{t}(\mathbf{x})(\theta_{t}-\theta_{0})\right)
=2αN(𝐊(𝐱,𝐱)(f(𝐱;θt)𝐲)+βN(f(𝐱;θt)f(𝐱;θ0)))\displaystyle=-\frac{2\alpha}{N}\left(\mathbf{K(x,x)}(f(\mathbf{x};\theta_{t})-\mathbf{y})+\beta_{N}(f(\mathbf{x};\theta_{t})-f(\mathbf{x};\theta_{0}))\right)
=2αN(𝐊(𝐱,𝐱)+βN𝐈)f(𝐱;θt)+2αN(𝐊(𝐱,𝐱)𝐲+βNf(𝐱;θ0))\displaystyle=-\frac{2\alpha}{N}\left(\mathbf{K(x,x)}+\beta_{N}\mathbf{I}\right)f(\mathbf{x};\theta_{t})+\frac{2\alpha}{N}\left(\mathbf{K(x,x)}\mathbf{y}+\beta_{N}f(\mathbf{x};\theta_{0})\right)

This is a linear ODE, we can solve this:

f(𝐱;θt)\displaystyle f(\mathbf{x};\theta_{t}) =exp(2αNt(𝐊(𝐱,𝐱)+βN𝐈))f(𝐱;θ0)\displaystyle=\exp\left(-\frac{2\alpha}{N}t\left(\mathbf{K(x,x)}+\beta_{N}\mathbf{I}\right)\right)f(\mathbf{x};\theta_{0})
N2α(𝐊(𝐱,𝐱)+βN𝐈)1[exp(2αNt(𝐊(𝐱,𝐱)+βN𝐈))𝐈]\displaystyle-\frac{N}{2\alpha}\left(\mathbf{K(x,x)}+\beta_{N}\mathbf{I}\right)^{-1}\left[\exp\left(-\frac{2\alpha}{N}t\left(\mathbf{K(x,x)}+\beta_{N}\mathbf{I}\right)\right)-\mathbf{I}\right]
×2αN(𝐊(𝐱,𝐱)𝐲+βNf(𝐱;θ0))\displaystyle\times\frac{2\alpha}{N}\left(\mathbf{K(x,x)}\mathbf{y}+\beta_{N}f(\mathbf{x};\theta_{0})\right)

Using A1eA=eAA1A^{-1}e^{A}=e^{A}A^{-1}, and writing 𝐊(𝐱,𝐱)y+βNf(x,θ0)=(𝐊(𝐱,𝐱)+βNI)f(x,θ0)+𝐊(𝐱,𝐱)(yf(x,θ0))\mathbf{K(x,x)}y+\beta_{N}f(x,\theta_{0})=(\mathbf{K(x,x)}+\beta_{N}I)f(x,\theta_{0})+\mathbf{K(x,x)}(y-f(x,\theta_{0})), we get:

f(x,θt)=\displaystyle f(x,\theta_{t})= exp(2αNt(𝐊(𝐱,𝐱)+βNI))f(x,θ0)\displaystyle\exp\left(-\frac{2\alpha}{N}t(\mathbf{K(x,x)}+\beta_{N}I)\right)f(x,\theta_{0})
+[Iexp(2αNt(𝐊(𝐱,𝐱)+βNI))](𝐊(𝐱,𝐱)+βNI)1(𝐊(𝐱,𝐱)y+βNf(x,θ0))\displaystyle+\left[I-\exp\left(-\frac{2\alpha}{N}t(\mathbf{K(x,x)}+\beta_{N}I)\right)\right](\mathbf{K(x,x)}+\beta_{N}I)^{-1}(\mathbf{K(x,x)}y+\beta_{N}f(x,\theta_{0}))
=\displaystyle= exp(2αNt(𝐊(𝐱,𝐱)+βNI))f(x,θ0)+[Iexp(2αNt(𝐊(𝐱,𝐱)+βNI))]f(x,θ0)\displaystyle\exp\left(-\frac{2\alpha}{N}t(\mathbf{K(x,x)}+\beta_{N}I)\right)f(x,\theta_{0})+\left[I-\exp\left(-\frac{2\alpha}{N}t(\mathbf{K(x,x)}+\beta_{N}I)\right)\right]f(x,\theta_{0})
+[Iexp(2αNt(𝐊(𝐱,𝐱)+βNI))](𝐊(𝐱,𝐱)+βNI)1𝐊(𝐱,𝐱)(yf(x,θ0))\displaystyle+\left[I-\exp\left(-\frac{2\alpha}{N}t(\mathbf{K(x,x)}+\beta_{N}I)\right)\right](\mathbf{K(x,x)}+\beta_{N}I)^{-1}\mathbf{K(x,x)}(y-f(x,\theta_{0}))
=f(x,θ0)+[Iexp(2αNt(𝐊(𝐱,𝐱)+βNI))](𝐊(𝐱,𝐱)+βNI)1𝐊(𝐱,𝐱)(yf(x,θ0)).\displaystyle=f(x,\theta_{0})+\left[I-\exp\left(-\frac{2\alpha}{N}t(\mathbf{K(x,x)}+\beta_{N}I)\right)\right](\mathbf{K(x,x)}+\beta_{N}I)^{-1}\mathbf{K(x,x)}(y-f(x,\theta_{0})).

Now, we consider the dynamics for the neural network of an arbitrary set of test points 𝐱\mathbf{x^{\prime}}:

df(x,θt)dt=2αNβNf(x,θt)2αN(𝐊(𝐱,𝐱)(f(x,θt)y)βNf(x,θ0)).\frac{df(x^{\prime},\theta_{t})}{dt}=-\frac{2\alpha}{N}\beta_{N}f(x^{\prime},\theta_{t})-\frac{2\alpha}{N}\left(\mathbf{K(x^{\prime},x)}(f(x,\theta_{t})-y)-\beta_{N}f(x^{\prime},\theta_{0})\right). (10)

This is a linear ODE with a time-dependent inhomogeneous term, we can solve it as follows:

f(x,θt)=\displaystyle f(x^{\prime},\theta_{t})= e2αNβNtf(x,θ0)2αNe2αNβNt0te2αNβNu(𝐊(𝐱,𝐱)(f(x,θu)y)βNf(x,θ0))𝑑u\displaystyle e^{-\frac{2\alpha}{N}\beta_{N}t}f(x^{\prime},\theta_{0})-\frac{2\alpha}{N}e^{-\frac{2\alpha}{N}\beta_{N}t}\int_{0}^{t}e^{\frac{2\alpha}{N}\beta_{N}u}\left(\mathbf{K(x^{\prime},x)}(f(x,\theta_{u})-y)-\beta_{N}f(x^{\prime},\theta_{0})\right)du
=\displaystyle= e2αNβNtf(x,θ0)+2αNe2αNβNt0te2αNβNu𝑑u(𝐊(𝐱,𝐱)y+βNf(x,θ0))\displaystyle e^{-\frac{2\alpha}{N}\beta_{N}t}f(x^{\prime},\theta_{0})+\frac{2\alpha}{N}e^{-\frac{2\alpha}{N}\beta_{N}t}\int_{0}^{t}e^{\frac{2\alpha}{N}\beta_{N}u}du\left(\mathbf{K(x^{\prime},x)}y+\beta_{N}f(x^{\prime},\theta_{0})\right)
2αNe2αNβNt𝐊(𝐱,𝐱)0te2αNβNuf(x,θu)𝑑u.\displaystyle-\frac{2\alpha}{N}e^{-\frac{2\alpha}{N}\beta_{N}t}\mathbf{K(x^{\prime},x)}\int_{0}^{t}e^{\frac{2\alpha}{N}\beta_{N}u}f(x,\theta_{u})du.
=\displaystyle= e2αNβNtf(x,θ0)+e2αNβNt1βN(e2αNβNt1)(𝐊(𝐱,𝐱)y+βNf(x,θ0))\displaystyle e^{-\frac{2\alpha}{N}\beta_{N}t}f(x^{\prime},\theta_{0})+e^{-\frac{2\alpha}{N}\beta_{N}t}\frac{1}{\beta_{N}}\left(e^{\frac{2\alpha}{N}\beta_{N}t}-1\right)\left(\mathbf{K(x^{\prime},x)}y+\beta_{N}f(x^{\prime},\theta_{0})\right)
2αNe2αNβNt𝐊(𝐱,𝐱)0te2αNβNuf(x,θ0)𝑑u\displaystyle-\frac{2\alpha}{N}e^{-\frac{2\alpha}{N}\beta_{N}t}\mathbf{K(x^{\prime},x)}\int_{0}^{t}e^{\frac{2\alpha}{N}\beta_{N}u}f(x,\theta_{0})du
2αNe2αNβNt𝐊(𝐱,𝐱)0te2αNβNu[Iexp(2αNu(𝐊(𝐱,𝐱)+βNI))]𝑑u\displaystyle-\frac{2\alpha}{N}e^{-\frac{2\alpha}{N}\beta_{N}t}\mathbf{K(x^{\prime},x)}\int_{0}^{t}e^{\frac{2\alpha}{N}\beta_{N}u}\left[I-\exp\left(-\frac{2\alpha}{N}u(\mathbf{K(x,x)}+\beta_{N}I)\right)\right]du
×(𝐊(𝐱,𝐱)+βNI)1𝐊(𝐱,𝐱)(yf(x,θ0)).\displaystyle\times(\mathbf{K(x,x)}+\beta_{N}I)^{-1}\mathbf{K(x,x)}(y-f(x,\theta_{0})).
=f(x,θ0)+1βN(1e2αNβNt)𝐊(𝐱,𝐱)y1βN(1e2αNβNt)𝐊(𝐱,𝐱)f(x,θ0)\displaystyle=f(x^{\prime},\theta_{0})+\frac{1}{\beta_{N}}(1-e^{\frac{2\alpha}{N}\beta_{N}t})\mathbf{K(x^{\prime},x)}y-\frac{1}{\beta_{N}}(1-e^{-\frac{2\alpha}{N}\beta_{N}t})\mathbf{K(x^{\prime},x)}f(x,\theta_{0})
2αNe2αNβNt𝐊(𝐱,𝐱)[N2αβ(e2αNβNt1)IN2αβ𝐊(𝐱,𝐱)1(exp(2αNt𝐊(𝐱,𝐱))I)]\displaystyle-\frac{2\alpha}{N}e^{-\frac{2\alpha}{N}\beta_{N}t}\mathbf{K(x^{\prime},x)}\left[\frac{N}{2\alpha\beta}(e^{\frac{2\alpha}{N}\beta_{N}t}-1)I-\frac{N}{2\alpha\beta}\mathbf{K(x,x)}^{-1}\left(\exp\left(-\frac{2\alpha}{N}t\mathbf{K(x,x)}\right)-I\right)\right]
×(𝐊(𝐱,𝐱)+βNI)1𝐊(𝐱,𝐱)(yf(x,θ0))\displaystyle\times(\mathbf{K(x,x)}+\beta_{N}I)^{-1}\mathbf{K(x,x)}(y-f(x,\theta_{0}))
=f(x,θ0)+1βN(1e2αNβNt)𝐊(𝐱,𝐱)(yf(x,θ0))\displaystyle=f(x^{\prime},\theta_{0})+\frac{1}{\beta_{N}}(1-e^{\frac{2\alpha}{N}\beta_{N}t})\mathbf{K(x^{\prime},x)}(y-f(x,\theta_{0}))
1β𝐊(𝐱,𝐱)[(1e2αNβNt)I𝐊(𝐱,𝐱)1(exp(2αNt(𝐊(𝐱,𝐱)+βNI))e2αNβNtI)]\displaystyle-\frac{1}{\beta}\mathbf{K(x^{\prime},x)}\left[(1-e^{-\frac{2\alpha}{N}\beta_{N}t})I-\mathbf{K(x,x)}^{-1}\left(\exp\left(-\frac{2\alpha}{N}t(\mathbf{K(x,x)}+\beta_{N}I)\right)-e^{-\frac{2\alpha}{N}\beta_{N}t}I\right)\right]
×(𝐊(𝐱,𝐱)+βNI)1𝐊(𝐱,𝐱)(yf(x,θ0)).\displaystyle\times(\mathbf{K(x,x)}+\beta_{N}I)^{-1}\mathbf{K(x,x)}(y-f(x,\theta_{0})).

Lastly, taking tt\to\infty, we get

f(x,θ)\displaystyle f(x^{\prime},\theta_{\infty}) =f(x,θ0)+1βN𝐊(𝐱,𝐱)(yf(x,θ0))1βN𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+βNI)1𝐊(𝐱,𝐱)(yf(x,θ0))\displaystyle=f(x^{\prime},\theta_{0})+\frac{1}{\beta_{N}}\mathbf{K(x^{\prime},x)}(y-f(x,\theta_{0}))-\frac{1}{\beta_{N}}\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+\beta_{N}I)^{-1}\mathbf{K(x,x)}(y-f(x,\theta_{0}))
=f(x,θ0)+1βN𝐊(𝐱,𝐱)(I(𝐊(𝐱,𝐱)+βNI)1𝐊(𝐱,𝐱))(yf(x,θ0))\displaystyle=f(x^{\prime},\theta_{0})+\frac{1}{\beta_{N}}\mathbf{K(x^{\prime},x)}\left(I-(\mathbf{K(x,x)}+\beta_{N}I)^{-1}\mathbf{K(x,x)}\right)(y-f(x,\theta_{0}))
=f(x,θ0)+𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+βNI)1(yf(x,θ0)),\displaystyle=f(x^{\prime},\theta_{0})+\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+\beta_{N}I)^{-1}(y-f(x,\theta_{0})),

we achieve the desired result and hence having a regularized gradient flow in the infinite-width limit is equivalent to inferring the mean posterior of a non-zero aleatoric noise NTK-GP. ∎

See 3.2

Proof.

Firstly, substituting 𝐲~\tilde{\mathbf{y}} into 𝐲\mathbf{y}:

f(𝐱;θ)\displaystyle f(\mathbf{x^{\prime}};\theta_{\infty}) =f(𝐱;θ0)+𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+NβN𝐈)1(𝐲~f(𝐱;θ0))\displaystyle=f(\mathbf{x^{\prime}};\theta_{0})+\mathbf{K(x^{\prime},x)}\left(\mathbf{K(x,x)}+N\beta_{N}\mathbf{I}\right)^{-1}(\tilde{\mathbf{y}}-f(\mathbf{x};\theta_{0}))
=f(𝐱;θ0)+𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+NβN𝐈)1𝐲\displaystyle=f(\mathbf{x^{\prime}};\theta_{0})+\mathbf{K(x^{\prime},x)}\left(\mathbf{K(x,x)}+N\beta_{N}\mathbf{I}\right)^{-1}\mathbf{y}

Now, using this new computational process, scaling it as f~(𝐱;θ)\tilde{f}(\mathbf{x};\theta_{\infty}):

f~(𝐱;θ)\displaystyle\tilde{f}(\mathbf{x};\theta_{\infty}) =f(𝐱;θ)f(𝐱;θ0)=𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+NβN𝐈)1𝐲,\displaystyle=f(\mathbf{x};\theta_{\infty})-f(\mathbf{x^{\prime}};\theta_{0})=\mathbf{K(x^{\prime},x)}\left(\mathbf{K(x,x)}+N\beta_{N}\mathbf{I}\right)^{-1}\mathbf{y},

achieving the desired zero-mean Gaussian process. ∎

Appendix C Details of the Experimental Setup

The Adam optimizer was used whenever our experiments needed gradient descent. A patience-based stopping rule was used where training was stopped if there was no improvement in the loss for 500 epochs. The other hyperparameters are given in the table below.

hyperparameter value
no of hidden layers 2
size of hidden layer 512
non-linearity softplus
softplus beta 87.09
scaling multiplier in the output 3.5
learning rate for network predicting mean 1e-4
learning rate for covariance predictor networks 5e-5

Moreover, we used trigonometric normalization, where an input point xx is first scaled and shifted to lie between 0 and π\pi, obtaining a normalized point xx^{\prime}. The point xx^{\prime} is then represented with a vector [sin(x),cos(x)]\left[\sin(x^{\prime}),\cos(x^{\prime})\right].

Appendix D Details on Estimating The Covariance

We now describe two of dealing with the term σ2MM\sigma^{2}MM^{\top} in the covariance formula. Upper bounding the covariance is described in Section D.1, while estimating the exact covariance by fitting noisy targets is described in Section D.2.

D.1 Upper Bounding the Covariance

First, we can simply ignore the term in our estimator, obtaining an upper bound on the covariance. We now characterize the tightness of the upper bound, i.e. the magnitude of the term

σ2MM=σ2𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱).\sigma^{2}MM^{\top}=\sigma^{2}\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}^{\top}.

We do this is the following two lemmas.

Lemma D.1.

When 𝐱=𝐱\mathbf{x}=\mathbf{x^{\prime}}, i.e. on the training set, we have

σ2𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)σ2𝐈.\sigma^{2}\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}^{\top}\preccurlyeq\sigma^{2}{\mathbf{I}}.
Proof.

By assumption, 𝐊(𝐱,𝐱)=𝐊(𝐱,𝐱)=𝐊\mathbf{K(x^{\prime},x)}=\mathbf{K(x,x)}=\mathbf{K}. Denote the diagonalization of 𝐊\mathbf{K} with 𝐊=𝐔𝚲𝐔\mathbf{K}=\mathbf{U\Lambda U^{\top}}. We have

σ2\displaystyle\sigma^{2} 𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)\displaystyle\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}^{\top}
=σ2\displaystyle=\sigma^{2} 𝐊(𝐊+σ2𝐈)2𝐊\displaystyle\mathbf{K}(\mathbf{K}+\sigma^{2}\mathbf{I})^{-2}\mathbf{K}^{\top}
=σ2\displaystyle=\sigma^{2} 𝐔𝚲𝐔(𝐔𝚲𝐔+σ2𝐈)2𝐔𝚲𝐔\displaystyle\mathbf{U\Lambda U^{\top}}(\mathbf{U\Lambda U^{\top}}+\sigma^{2}\mathbf{I})^{-2}\mathbf{U\Lambda U^{\top}}
=σ2\displaystyle=\sigma^{2} 𝐔𝚲𝐔𝐔(𝚲+σ2𝐈)2𝐔𝐔𝚲𝐔\displaystyle\mathbf{U\Lambda U^{\top}}\mathbf{U}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I})^{-2}\mathbf{U}^{\top}\mathbf{U\Lambda U^{\top}}
=σ2\displaystyle=\sigma^{2} 𝐔𝚲(𝚲+σ2𝐈)2𝚲𝐔.\displaystyle\mathbf{U\Lambda}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I})^{-2}\mathbf{\Lambda U^{\top}}.

It can be seen that the diagonal entries of 𝚲(𝚲+σ2𝐈)2𝚲\mathbf{\Lambda}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I})^{-2}\mathbf{\Lambda} are less than or equal one. ∎

The Lemma above, stated in words, implies that, on the training set, the variance estimates that come from using the upper bound (which doesn’t require us to fit noisy targets as in Section D.2) are off by at most σ2\sigma^{2}.

We now give another Lemma, which characterizes the upper bound on arbitrary test points, not just the training set.

Lemma D.2.

Denote by λmax\lambda_{\text{max}} the maximum singular value of 𝐊(𝐱,𝐱)\mathbf{K(x^{\prime},x^{\prime})}. Then we have

σ2𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)1(𝐊(𝐱,𝐱)+σ2𝐈)1𝐊(𝐱,𝐱)214λmax.\left\|\sigma^{2}\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-1}\mathbf{K(x^{\prime},x)}^{\top}\right\|_{2}\leq\frac{1}{4}\lambda_{\text{max}}.
Proof.

By Proposition 1.3.2 from the book by [1], we have that

𝐊(𝐱,𝐱)=𝐊(𝐱,𝐱)1/2𝐂𝐊(𝐱,𝐱)1/2,\mathbf{K(x^{\prime},x)}^{\top}=\mathbf{K(x,x)}^{1/2}\mathbf{C}\mathbf{K(x^{\prime},x^{\prime})}^{1/2},

where 𝐂\mathbf{C} is a contraction. Denote the diagonalization of 𝐊(𝐱,𝐱)\mathbf{K(x,x)} with 𝐊(𝐱,𝐱)=𝐔𝚲𝐔\mathbf{K(x,x)}=\mathbf{U\Lambda U^{\top}}. We have

σ2𝐊(𝐱,𝐱)(𝐊(𝐱,𝐱)+σ2𝐈)2𝐊(𝐱,𝐱)2\displaystyle\left\|\sigma^{2}\mathbf{K(x^{\prime},x)}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-2}\mathbf{K(x^{\prime},x)}^{\top}\right\|_{2}
=\displaystyle= σ2𝐊(𝐱,𝐱)1/2𝐂𝐊(𝐱,𝐱)1/2(𝐊(𝐱,𝐱)+σ2𝐈)2𝐊(𝐱,𝐱)1/2𝐂𝐊(𝐱,𝐱)1/22\displaystyle\left\|\sigma^{2}\mathbf{K(x^{\prime},x^{\prime})}^{1/2}\mathbf{C}^{\top}\mathbf{K(x,x)}^{1/2}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-2}\mathbf{K(x,x)}^{1/2}\mathbf{C}\mathbf{K(x^{\prime},x^{\prime})}^{1/2}\right\|_{2}
\displaystyle\leq σ2λmax𝐊(𝐱,𝐱)1/2(𝐊(𝐱,𝐱)+σ2𝐈)2𝐊(𝐱,𝐱)1/22\displaystyle\sigma^{2}\lambda_{\text{max}}\left\|\mathbf{K(x,x)}^{1/2}(\mathbf{K(x,x)}+\sigma^{2}\mathbf{I})^{-2}\mathbf{K(x,x)}^{1/2}\right\|_{2}
=\displaystyle= σ2λmax𝐔𝚲1/2𝐔𝐔(𝚲+σ2𝐈)2𝐔𝐔𝚲1/2𝐔2\displaystyle\sigma^{2}\lambda_{\text{max}}\left\|\mathbf{U}\mathbf{\Lambda}^{1/2}\mathbf{U}^{\top}\mathbf{U}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I})^{-2}\mathbf{U^{\top}}\mathbf{U}\mathbf{\Lambda}^{1/2}\mathbf{U}^{\top}\right\|_{2}
=\displaystyle= σ2λmax𝚲1/2(𝚲+σ2𝐈)2𝚲1/22.\displaystyle\sigma^{2}\lambda_{\text{max}}\left\|\mathbf{\Lambda}^{1/2}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I})^{-2}\mathbf{\Lambda}^{1/2}\right\|_{2}.

We can expand 𝚲1/2(𝚲+σ2𝐈)2𝚲1/22\left\|\mathbf{\Lambda}^{1/2}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I})^{-2}\mathbf{\Lambda}^{1/2}\right\|_{2} as maxi{λi(λi+σ2)2}14σ2\max_{i}\left\{\frac{\lambda_{i}}{(\lambda_{i}+\sigma^{2})^{2}}\right\}\leq\frac{1}{4\sigma^{2}}, which gives the desired result. ∎

D.2 Exact Covariance by Fitting Noisy Targets

In certain cases, we might not be satisfied with having an upper bound on the posterior covariance, even if it is reasonably tight. We can address these scenario by fitting additional predictor networks, trained on targets sampled from the spherical normal. Formally, we have

σ2MM=M𝔼ϵ[ϵϵ]M,\displaystyle\sigma^{2}MM^{\top}=M\mathbb{E}_{\epsilon}\left[\epsilon\epsilon^{\top}\right]M^{\top},

where ϵ𝒩(0,σ2I)\epsilon\sim\mathcal{N}(0,\sigma^{2}I). We can take KK^{\prime} samples ϵ1,,ϵK\epsilon_{1},\dots,\epsilon_{K^{\prime}}, obtaining

M𝔼ϵ[ϵϵ]M1KiMϵiϵiM=1Ki(Mϵi)(Mϵi),\displaystyle M\mathbb{E}_{\epsilon}\left[\epsilon\epsilon^{\top}\right]M^{\top}\approx\frac{1}{K^{\prime}}\sum_{i}M\epsilon_{i}\epsilon_{i}^{\top}M^{\top}=\frac{1}{K^{\prime}}\sum_{i}(M\epsilon_{i})(M\epsilon_{i})^{\top}, (11)

where the approximation becomes exact by the law of large numbers as KK^{\prime}\rightarrow\infty. Since the multiplication MϵiM\epsilon_{i} is equivalent to estimating the posterior mean with algorithm 1, we can perform the computation in equation (11) by gradient descent.

Appendix E Computing The Partial SVD

Our Algorithm 2 includes the computation of the partial SVD of the Jacobian:

U,ΣPartial-SVD(Jθ0(𝐱),K)U,\Sigma\leftarrow\textsc{Partial-SVD}(J_{\theta_{0}}({\mathbf{x}}),K).

We require an SVD which is partial in the sense that we only want to compute the first KK singular values. For the regression experiment in this submission, we simply called the full SVD on the Jacobian and took the first KK columns of UU and the first KK diagonal entries of Σ\Sigma. This process is infeasible for larger problem instances.

This can be addressed by observing that the power method for SVD computation [2] only requires computing Jacobian-vector products and vector-Jacobian products, which can be efficiently computed in deep learning frameworks without access to the full Jacobian. Another approach that avoids constructing the full Jacobian is the use of randomized SVD [9]. We leave the implementation of these ideas to further work.

Appendix F Network Initialization

We consider a neural network model f(x;θ)f(x;\theta), where θp\theta\in\mathbb{R}^{p} denotes the set of parameters. The model consists of LL layers with dimensions {n0,n1,,nL}\{n_{0},n_{1},\dots,n_{L}\}, where n0n_{0} is the input dimension and nLn_{L} is the output dimension. Note that, as we want to leverage the theory of wide networks, the number of neurons in the hidden layers, {n2,,nL1}\{n_{2},\dots,n_{L-1}\}, is large.

For each fully connected layer ll, the weight matrix W(l)nl×nl1W^{(l)}\in\mathbb{R}^{n_{l}\times n_{l-1}} and the bias vector b(l)nlb^{(l)}\in\mathbb{R}^{n_{l}} are initialized from a Gaussian distribution with mean zero and standard deviations σw\sigma_{w} and σb\sigma_{b}, respectively:

Wij(l)𝒩(0,σw2),bj(l)𝒩(0,σb2),W^{(l)}_{ij}\sim\mathcal{N}(0,\sigma_{w}^{2}),\quad b^{(l)}_{j}\sim\mathcal{N}(0,\sigma_{b}^{2}),

where σw\sigma_{w} and σb\sigma_{b} are fixed values set as hyperparameters during initialization (we use σw=σb=1\sigma_{w}=\sigma_{b}=1).

The network uses a non-linear activation function σ:\sigma:\mathbb{R}\rightarrow\mathbb{R} with bounded second derivative, ensuring Lipschitz continuity. The output of each layer ll is scaled by 1/nl1/\sqrt{n_{l}} to maintain the appropriate magnitude, particularly when considering the infinite-width limit:

a(l)=σ(1nlW(l)a(l1)+b(l)),a^{(l)}=\sigma\left(\frac{1}{\sqrt{n_{l}}}W^{(l)}a^{(l-1)}+b^{(l)}\right),

where a(l)a^{(l)} is the output of layer ll, and a(0)=xa^{(0)}=x is the input to the network.

The final layer output is further scaled by a constant factor coutc_{\text{out}} to ensure that the overall network output remains within the desired range. Specifically, the output f(x;θ)f(x;\theta) is given by:

f(x;θ)=coutnLW(L)a(L1),f(x;\theta)=\frac{c_{\text{out}}}{\sqrt{n_{L}}}W^{(L)}a^{(L-1)},

where coutc_{\text{out}} is a predefined constant that ensures the final output is of the appropriate scale. In our model, coutc_{\text{out}} is set to 3.5. For the hidden layers, we choose σ()\sigma(\cdot) to be Softplus - a smoothed version of ReLU. In this case, an additional scaling factor β\beta is introduced to modulate the sharpness of the non-linearity:

a(l)=σ(1nlW(l)a(l1)+b(l);β).a^{(l)}=\sigma\left(\frac{1}{\sqrt{n_{l}}}W^{(l)}a^{(l-1)}+b^{(l)};\beta\right).

In our model, we set β=87.09\beta=87.09 for the Softplus activation to ensure the appropriate range of activation values. The process described above is standard. We followed closely the methodology provided in several works in the literature [12][15][8]. This initialization strategy ensures that the network’s activations and gradients do not explode or vanish as the number of neurons nln_{l} increases.