This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Understanding Kernel Ridge Regression:
Common behaviors from simple functions to density functionals

Kevin Vu Department of Physics and Astronomy, University of California, Irvine, CA 92697    John Snyder Machine Learning Group, Technical University of Berlin, 10587 Berlin, Germany Max Planck Institute of Microstructure Physics, Weinberg 2, 06120 Halle (Saale), Germany    Li Li Department of Physics and Astronomy, University of California, Irvine, CA 92697    Matthias Rupp Department of Chemistry, University of Basel, Klingelbergstr. 80, 4056 Basel, Switzerland    Brandon F. Chen Department of Chemistry, University of California, Irvine, CA 92697    Tarek Khelif Department of Chemistry, University of California, Irvine, CA 92697    Klaus-Robert Müller Machine Learning Group, Technical University of Berlin, 10587 Berlin, Germany Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 136-713, Korea    Kieron Burke Department of Physics and Astronomy, University of California, Irvine, CA 92697 Department of Chemistry, University of California, Irvine, CA 92697
(August 6, 2025)
Abstract

Accurate approximations to density functionals have recently been obtained via machine learning (ML). By applying ML to a simple function of one variable without any random sampling, we extract the qualitative dependence of errors on hyperparameters. We find universal features of the behavior in extreme limits, including both very small and very large length scales, and the noise-free limit. We show how such features arise in ML models of density functionals.

I Introduction

Machine learning (ML) is a powerful data-driven method for learning patterns in high-dimensional spaces via induction, and has had widespread success in many fields including more recent applications in quantum chemistry and materials science Müller et al. (2001); Kononenko (2001); Sebastiani (2002); Ivanciuc (2007); Bartók et al. (2010); Rupp et al. (2012); Pozun et al. (2012); Hansen et al. (2013); Schütt et al. (2014). Here we are interested in applications of ML to construction of density functionals Snyder et al. (2012, 2013a); Li et al. (2014); Snyder et al. (2013b, 2015), which have focused so far on approximating the kinetic energy (KE) of non-interacting electrons. An accurate, general approximation to this could make orbital-free DFT a practical reality.

However, ML methods have been developed within the areas of statistics and computer science, and have been applied to a huge variety of data, including neuroscience, image and text processing, and robotics Burke (2006). Thus, they are quite general and have not been tailored to account for specific details of the quantum problem. For example, it was found that a standard method, kernel ridge regression, could yield excellent results for the KE functional, while never yielding accurate functional derivatives. The development of methods for bypassing this difficulty has been important for ML in general Snyder et al. (2013b).

ML provides a whole suite of tools for analyzing data, fitting highly non-linear functions, and dimensionality reduction Hastie et al. (2009). More importantly in this context, ML provides a completely different way of thinking about electronic structure. The traditional ab-initio approach Levine (2009) to electronic structure involves deriving carefully constructed approximations to solving the Schrödinger equation, based on physical intuition, exact conditions and asymptotic behaviors Dreizler and Gross (1990). On the other hand, ML learns by example. Given a set of training data, ML algorithms learn via induction to predict new data. ML provides limited interpolation over a specific class of systems for which training data is available.

A system of NN interacting electrons with some external potential is characterized by a 3N3N coordinate wavefunction, which becomes computationally demanding for large NN. In the mid 1960’s, Hohenberg and Kohn proved a one-to-one correspondence between the external potential of a quantum system and its one-electron ground-state density Hohenberg and Kohn (1964), showing that all properties are functionals of the ground-state density alone, which can in principle be found from a single Euler equation for the density. Although these fundamental theorems of DFT proved the existence of a universal functional, essentially all modern calculations use the KS scheme Kohn and Sham (1965), which is much more accurate, because the non-interacting KE is found exactly by using an orbital-scheme Pribram-Jones et al. (2014). This is far faster than traditional approaches for large NN, but remains a bottleneck. If a sufficiently accurate density functional for the non-interacting electrons could be found, it could increase the size of computationally tractable systems by orders of magnitude.

The Hohenberg-Kohn theorem guarantees that all properties of the system can be determined from the electronic density alone. The basic tenet of ML is that a pattern must exist in the data in order for learning to be possible. Thus, DFT seems an ideal case to apply ML. ML learns the underlying pattern in solutions to the Schrödinger equation, bypassing the need to directly solve it. The HK theorem is a statement concerning the minimal information needed to do this for an arbitrary one-body potential.

Some of us recently used ML to learn the non-interacting KE of fermions in a one-dimensional box subject to smooth external potentials Snyder et al. (2012) and of a one-dimensional model of diatomics where we demonstrated the ability of ML to break multiple bonds self-consistently via an orbital-free density functional theory (DFT) Snyder et al. (2013a). Such KE data is effectively noise-free, since it is generated via deterministic reference calculations, by solving the Schrödinger equation or KS equations numerically exactly. (The limited precision of the calculation might be considered “noise,” as different implementations might yield answers differing on the order of machine precision, but this is negligble.) There is no noise, in the traditional sense, as is typically associated with experimental data. Note that what is considered “noise” depends on what is considered ground truth, i.e., the data to be learned. In particular, if a single reference method is used, its error with respect to a universal functional is not considered noise for the ML model. A perfect ML model should, at best, precisely reproduce the single-reference calculation.

As an example, in Fig. 1 we plot a measure of the error of ML for the KE of up to 4 noninteracting spinless fermions in a box under a potential with 9 parameters (given in detail in Ref. Snyder et al. (2012)), fitted for different numbers of evenly spaced training densities as a function of the hyperparameter σ\sigma (called the length scale), for fixed λ\lambda (a hyperparameter called the regularization strength) and several different number of training points NTN_{T}. The scale is logarithmic 111 We will use log\log to denote log10\log_{10} here and throughout this work, so there are large variations in the fitted error. We will give a more in-depth analysis of the model performance on this data set in a later section after we have formally defined the functions and hyperparameters involved, but for now it is still useful to observe the qualitative behaviors that emerge in the figure. Note that the curves assume roughly the same shape for each NTN_{T} over the range of σ\sigma, and that they all possess distinct features in different regimes of σ\sigma.

Refer to caption
Figure 1: The error of the model, ΔT\Delta T (Hartree), for the KE of particles in a box (Section IV) as a function of σ\sigma, for fixed λ=1010\lambda=10^{-10}. NTN_{T} values for each curve are given in the legend.

To better understand the behavior with respect to hyperparameters seen in Fig 1, we have chosen in this paper to apply them to the prototypical regression problem, that of fitting a simple function of one coordinate. We also remove all stochastic elements of the procedure, by considering data points on uniform grids, defining errors in the continuum limit, etc. This is shown in Fig. 2, where we plot a measure of the error of ML for a simple function cosx\cos x, fitted in the region between 0 and 1, inclusive, for several NTN_{T} (represented as values on a grid) as a function of σ\sigma. Note the remarkable similarity between the features and characteristics of the curves of this figure and those of Fig. 1 (like Fig. 1 before it, we will give a more in-depth analysis of Fig. 2 later). We explore the behavior of the fitting error as a function of the number of training parameters and the hyperparameters that are used in kernel ridge regression with Gaussian kernel. We find the landscape to be surprisingly rich, and we also find elegant simplicities in various limiting cases. After this, we will be able to characterize the behavior of ML for systems like the one shown in Fig. 1.

Looking at Fig. 2, we see that the best results (lowest error) are always obtained from the middle of the curves, which can become quite flat with enough training data. Thus, any method for determining hyperparameters should usually yield a length scale somewhere in this valley. For very small length scales, all curves converge to the same poor result, regardless of the number of training points. On the other hand, notice also the plateau structure that develops for very large length scales, again with all curves converging to a certain limit. We show for which ranges of hyperparameters these plateaus emerge and how they can be estimated. We also study and explain many of the features of these curves. To show the value of this study, we then apply the same reasoning to the problem that was tackled in Ref. Snyder et al. (2012); Li et al. (2014), which we showcased in Fig. 1. From the machine learning perspective our study may appear unusual as it considers properties in data and problems that are uncommon. Namely there are only a few noise free data points and all are low dimensional. Nevertheless, from the physics point of view the toy data considered reflects very well the essential properties of a highly relevant problem in quantum physics: the machine learning of DFT.

Refer to caption
Figure 2: Dependence of the model error (as a function of σ\sigma) when fitting cosx\cos x between 0 and 11 (Section III) for various NTN_{T} (shown in the legend) for λ=106\lambda=10^{-6}.

II Background

In this work, we will first use ML to fit a very simple function of one variable,

f(x)=cosx,f(x)=\cos x, (1)

on the interval x[0,1]x\in[0,1]. We will focus on exploring the properties of ML for this simple function before proceeding to our DFT cases. We choose a set of xx-values and corresponding f(x)f(x) values as the “training data” for ML to learn from. In ML, the xx-values {xj}\{x_{j}\} for j=1,,NTj=1,\dots,N_{T} are known as features, and corresponding ff-values, {fj}\{f_{j}\}, are known as labels. Here NTN_{T} is the number of training samples. Usually, ML is applied with considerable random elements, such as in the choice of data points and selection of test data. In our case, we choose evenly spaced training points on the interval [0,1][0,1]: xj=(j1)/(NT1)x_{j}=(j-1)/(N_{T}-1) for j=1,,NTj=1,\dots,N_{T}.

Using this dataset, we apply kernel ridge regression (KRR), which is a non-linear form of regression with regularization to prevent overfitting Hastie et al. (2009), to fit f(x)f(x). The general form of KRR is

fML(x)=j=1NTαjk(x,xj),f^{\rm ML}(x)=\sum_{j=1}^{N_{T}}\alpha_{j}k(x,x_{j}), (2)

where αj\alpha_{j} are the weights, kk is the kernel (which is a measure of similarity between features), and the hyperparameter λ\lambda controls the strength of the regularization and is linked to the noise level of the learning problem. We use the Gaussian kernel

k(x,x)=exp((xx)2/2σ2),k(x,x^{\prime})=\exp\left(-(x-x^{\prime})^{2}/2\sigma^{2}\right), (3)

a standard choice in ML that works well for a variety of problems. The hyperparameter σ\sigma is the length scale of the Gaussian, which controls the degree of correlation between training points.

The weights αj\alpha_{j} are obtained through the minimization of the cost function

𝒞(𝜶)=j=1NT(fML(xj)fj)2+λ𝜶T𝑲𝜶,\mathcal{C}(\bm{\alpha})=\sum_{j=1}^{N_{T}}\left(f^{\rm ML}(x_{j})-f_{j}\right)^{2}+\lambda\,\bm{\alpha}^{T}\mathbfit{K}\bm{\alpha}, (4)

where

𝜶=(α1,,αNT)T.\bm{\alpha}=\left(\alpha_{1},\ldots,\alpha_{N_{T}}\right)^{T}. (5)

The exact solution is given by

𝜶=(𝑲+λ𝑰)𝟏𝒇,\bm{\alpha}=(\mathbfit{K}+\lambda\mathbfit{I})^{-1}\mathbfit{f}, (6)

where 𝑰\mathbfit{I} is the NT×NTN_{T}\times N_{T} identity matrix, 𝑲\mathbfit{K} is the kernel matrix with elements KijK_{ij} = k(xi,xj)k(x_{i},x_{j}), and 𝒇=(𝒇𝟏,,𝒇𝑵𝑻)𝑻\mathbfit{f}=(f_{1},\dots,f_{N_{T}})^{T}.

The two parameters λ\lambda and σ\sigma not determined by Eq. (6) are called hyperparameters and must be determined from the data (see section III.3). σ\sigma can be viewed as the characteristic length scale of the problem being learned (the scale on which changes of ff take place), as discernible from the data (and thus dependent on, e.g., the number of training samples). λ\lambda controls the leeway the model has to fit the training points. For small λ\lambda, the model has to fit the training points exactly, whereas for larger λ\lambda some deviation is allowed. Larger values of λ\lambda therefore cause the model to be smoother and vary less, i.e., less prone to overfitting. This can be directly seen in Gaussian process regression Rasmussen and Williams (2006), a related Bayesian ML model with predictions identical to those of KRR. There, λ\lambda formally is the variance of the assumed additive Gaussian noise in values of ff.

KRR is a method of interpolation. Here, we are mainly concerned with the error of the machine learning approximation (MLA) to f(x)f(x) in the interpolation region, which in this case is the interval x[0,1]x\in[0,1]. As a measure of this error, we define

Δf=01𝑑x(f(x)fML(x))2.\Delta f=\int_{0}^{1}dx\,(f(x)-f^{\rm ML}(x))^{2}. (7)

In the case of the Gaussian kernel, we can expand this and derive the integrals that appear analytically. To simplify the analytical process, we define

Δf0=01𝑑xf2(x),\Delta f_{0}=\int_{0}^{1}dx\,f^{2}(x), (8)

as the benchmark error when fML(x)0f^{\rm ML}(x)\equiv 0. For the cosine function in Eq. (1),

Δf0=01𝑑xcos2(x)=12+sin(2)40.7273.\Delta f_{0}=\int_{0}^{1}dx\,\cos^{2}(x)=\frac{1}{2}+\frac{\sin(2)}{4}\approx 0.7273. (9)

Our goal is to characterize the dependence of the performance of the model on the size of the training data set (NT)(N_{T}) and the hyperparameters of the model (σ,λ)(\sigma,\,\lambda). For this simple model, we discuss different regions of qualitative behavior and derive the dependence of Δf\Delta f for various limiting values of these hyperparameters; we do all of this in the next few sections. In Section IV, we discuss how these results can be qualitatively generalized for the problem of using ML to learn the KE functional for non-interacting fermions in the box for a limited class of potentials.

III Analysis

We begin by analyzing the structure of Δf\Delta f as a function of σ\sigma for fixed λ\lambda and NTN_{T}. Fig. 2 shows Δf\Delta f as a function of σ\sigma for various NTN_{T} while fixing λ=106\lambda=10^{-6}. The curves have roughly the same “valley” shape for all NTN_{T}. The bottom of the valley is an order of magnitude deeper than the walls and may have multiple local minima. These valleys are nearly identical in shape for sufficiently large NTN_{T}, which indicates that this particular feature arises in a systematic manner as NTN_{T} increases. Moreover, this central valley opens up to the left (i.e., smaller σ\sigma) as NTN_{T} increases— as the training samples become more densely packed, narrower Gaussians are better able to interpolate the function smoothly. Thus, with more training samples, a wider range of σ\sigma values will yield an accurate model.

In addition, a “cusp” will begin to appear in the region to the left of the valley, and its general shape remains the same for increasing NTN_{T}. This is another recurring feature that appears to develop systematically like the valley. For a fixed NTN_{T}, and starting from the far left, the Δf\Delta f curve begins to decrease monotonically to the right, i.e., as σ\sigma increases. The cusps mark the first break in this monotonic behavior, as Δf\Delta f increases briefly after reaching this local minimum before resuming its monotonic decrease for increasing σ\sigma (until this monotonicity is interrupted again in the valley region). The cusps shift to the left as NTN_{T} increases, following the trend of the valleys. This indicates that they are a fundamental feature of the Δf\Delta f curves and that their appearances coincide with a particular behavior of the model as it approaches certain σ\sigma values. Note that Δf\Delta f decreases nearly monotonically as NTN_{T} increases for all σ\sigma. This is as expected, since each additional training sample adds another weighted Gaussian, which should improve the fit.

Fig. 3 again shows Δf\Delta f as a function of σ\sigma, but for various λ\lambda with NTN_{T} fixed at 33. As λ\lambda decreases, Δf\Delta f again decreases nearly monotonically and the central region opens up to the right (i.e., larger σ\sigma). Note that the curves for each λ\lambda coincide up to a certain σ\sigma before they split off from the rest, with the lower λ\lambda-valued curves breaking off further along to the right than those with larger λ\lambda. This shows a well-known phenomenon, namely that regularization strength λ\lambda and kernel length scale σ\sigma both give rise to regularization and smoothing Smola et al. (1998). Additionally, we observe the emergence of “plateau”-like structures on the right. These will be explored in detail in Section III.1.2.

Refer to caption
Figure 3: Dependence of the model error (as a function of σ\sigma) for various λ\lambda with NT=33N_{T}=33. The labels give the value of λ\lambda for each curve.

III.1 Regions of qualitative behavior

In Fig. 4, we plot Δf\Delta f as a function of σ\sigma for fixed λ\lambda and NTN_{T}. The three regions labeled I, II, and III denote areas of distinct qualitative behavior. They are delineated by two arbitrary boundaries we denote by σs\sigma_{s} (ss for small, between I and II) and σl\sigma_{l} (ll for large, between II and III). In region I, Δf\Delta f decreases significantly as σ\sigma increases. The region ends when there is significant overlap between neighboring Gaussians (i.e., when k(xj,xj+1)k(x_{j},x_{j+1}) is no longer small). Region II is a “valley” where the global minimum for Δf\Delta f resides. Region III begins where the valley ends and is populated by “plateaus” that correspond to fML(x)f^{\rm ML}(x) assuming a polynomial form (see Section III.1.2). In the following sections, we examine each region separately. In particular, we are interested in the asymptotic behavior of Δf\Delta f with respect to NTN_{T}, σ\sigma and λ\lambda.

Refer to caption
Figure 4: Model error Δf\Delta f as a function of σ\sigma for NT=20N_{T}=20 and λ=106\lambda=10^{-6}. We divide the range of σ\sigma into three qualitatively distinct regions I, II and III. The boundaries between the regions are given by the vertical dashed lines.

III.1.1 Length scale too small

The ML model for f(x)f(x), given in Eq. (2), is a sum of weighted Gaussians centered at the training points, where the weights αj\alpha_{j} are chosen to best reproduce the unknown f(x)f(x).

Refer to caption
Figure 5: Comparison of the function f(x)f(x) (black dot-dashed) and the ML model fML(x)f^{\rm ML}(x) (red dashed), for NT=5N_{T}=5, σ=0.05σs\sigma=0.05\ll\sigma_{s} (σs=0.2\sigma_{s}=0.2), and λ=106\lambda=10^{-6}. When summed, the weighted Gaussians, αjk(x,xj)\alpha_{j}k(x,x_{j}) (blue solid), give fML(x)f^{\rm ML}(x). The blue dots show the location of the training points and the value of the corresponding weights. In this case, the model is in the “comb” region, when σσs\sigma\ll\sigma_{s}. The width of the Gaussians is much smaller than the distance Δx\Delta x between adjacent training points, and so the model cannot reproduce the exact function.

Fig. 5 shows what happens when the width of the Gaussian kernel is too small—the model is incapable of learning f(x)f(x). We call this the “comb” region, as the shape of fML(x)f^{\rm ML}(x) arising from the narrow Gaussians resembles a comb. In order for fML(x)f^{\rm ML}(x) to accurately fit f(x)f(x), the weighted Gaussians must have significant overlap. This begins when σ\sigma is on the order of the distance between adjacent training points. A corresponding general heuristic is to use a multiple (e.g., four times) of the median nearest neighbor distance over the training set Snyder et al. (2013a). For equally spaced training data in one dimension, this is Δx1/NT\Delta x\approx 1/N_{T}, so we define

σs=1/NT\sigma_{s}=1/N_{T} (10)

to be the boundary between regions I and II.

Refer to caption
Figure 6: Same as Fig. 5, but for σ=σs=0.2\sigma=\sigma_{s}=0.2. Here, the model is in region II, the optimum region for the model. The error in the model is very small for all xx in the interpolation region. The width of the Gaussians is comparable to the size of the interpolation region, and the weights are large.

In Fig. 6, as the overlap between neighboring Gaussians becomes significant the model is able to reproduce the model well but still with significant error. Note that the common heuristics of choosing the length scale in radial basis function networks Moody and Darken (1989) are very much in line with this finding. In the comb region, Δf\Delta f decreases as σ\sigma increases in a characteristic way as the Gaussians begin to fill up the space between the training points. For λ0\lambda\to 0, the weights are approximately given as the values of the function at the corresponding training points:

αjfj,σσs.\alpha_{j}\approx f_{j},\quad\sigma\ll\sigma_{s}. (11)

Thus, for small σ\sigma, the weights are independent of σ\sigma. Let Δfσσs\Delta f_{\sigma\ll\sigma_{s}} be the error of the model when αj=f(xj)\alpha_{j}=f(x_{j}).

Refer to caption
Figure 7: Same as Fig. 4, but comparing Δf\Delta f against our expansion of Δf\Delta f for σσs\sigma\ll\sigma_{s} (blue dashed), given in Eq. (52). The vertical dashed line shows the boundary σs\sigma_{s} between regions I and II. The expansion breaks down before we reach σs\sigma_{s}, as the approximation that αjf(xj)\alpha_{j}\approx f(x_{j}) is no longer valid.

This approximation, shown in Fig. 7, captures the initial decrease of Δf\Delta f as σ\sigma increases, but breaks down before we reach σs\sigma_{s}. The qualitative nature of this decay is independent of the type of function f(x)f(x), but its location and scale will depend on the specifics.

As σ0\sigma\to 0 (for fixed λ\lambda and NTN_{T}), fML(x)f^{\rm ML}(x) becomes the sum of infinitesimally narrow Gaussians. Thus, in this limit, the error in the model becomes

limσ0Δf\displaystyle\lim_{\sigma\to 0}\Delta f =\displaystyle= Δf0.\displaystyle\Delta f_{0}. (12)

Note that this limit is independent of λ\lambda and NTN_{T}.

Refer to caption
Figure 8: Same as Fig. 5, but for σ=0.1\sigma=0.1, and λ=5\lambda=5. In this case, known as over-regularization, λ\lambda is too large, forcing the magnitude of the weights αj\alpha_{j} to be small and preventing the model from fitting f(x)f(x).

Fig. 8 shows what happens when the regularization becomes too strong. (Although shown for σ\sigma in region I, the qualitative behavior is the same for any σ\sigma.) The regularization term in Eq. (4) forces the magnitude of the weights to be small, preventing fML(x)f^{\rm ML}(x) from fitting f(x)f(x). As λ\lambda\to\infty, the weights are driven to zero, and so we obtain the same limit as in Eq. (12):

limλΔf=Δf0.\lim_{\lambda\to\infty}\Delta f=\Delta f_{0}. (13)

III.1.2 Length scale too large

We define the boundary σl\sigma_{l} between regions II and III as the last local minimum of Δf\Delta f (with respect to σ\sigma). Thus, in region III (see Figs. 2 and 3) Δf\Delta f is monotonically increasing. As σ\sigma becomes large, the kernel functions become wide and flat over the interpolation region, and in the limit σ\sigma\to\infty, fML(x)f^{\rm ML}(x) is approximately a constant over x[0,1]x\in[0,1]. For small λ\lambda, the optimal constant will be the average value over the training data

limσ,1/λf^(x)=1NTj=1NTf(xj).\lim_{\sigma,1/\lambda\to\infty}\hat{f}(x)=\frac{1}{N_{T}}\sum_{j=1}^{N_{T}}f(x_{j}). (14)

Note that the order of limits is important here: first σ\sigma\to\infty, then λ0\lambda\to 0. If the order is reversed, fML(x)f^{\rm ML}(x) becomes the best polynomial fit of order NTN_{T}. We will show this explicitly for NT=2N_{T}=2. For smaller σ\sigma in region III, as λ\lambda decreases, the emergence of “plateau”-like structures can be seen (see Fig. 3). As will be shown, these flat areas correspond to the model behaving as polynomial fits of different orders. These can be derived by taking the limits σ\sigma\to\infty and λ0\lambda\to 0 while maintaining σ\sigma in a certain proportion to λ\lambda.

NT=2:N_{T}=2:

In this case, the ML function is

fML(x)=α1ex2/2σ2+α2e(x1)2/2σ2,f^{\rm ML}(x)=\alpha_{1}\,e^{-x^{2}/2\sigma^{2}}+\alpha_{2}\,e^{-(x-1)^{2}/2\sigma^{2}}, (15)

and the weights are determined by solving

(α1\@arraycrα2)=(1+λe1/2σ2e1/2σ21+λ)1(f1\@arraycrf2),\left(\thinspace\begin{array}[]{r}\alpha_{1}\@arraycr\alpha_{2}\end{array}\thinspace\right)=\left(\begin{array}[]{ccc}1+\lambda&e^{-1/2\sigma^{2}}\\ \\ e^{-1/2\sigma^{2}}&1+\lambda\\ \end{array}\right)^{-1}\left(\thinspace\begin{array}[]{r}f_{1}\@arraycr f_{2}\end{array}\thinspace\right), (16)

where fj=f(xj)f_{j}=f(x_{j}). The solution is

α1\displaystyle\alpha_{1} =\displaystyle= (f1(1+λ)e1/2σ2f2)/D,\displaystyle(f_{1}(1+\lambda)-e^{-1/2\sigma^{2}}f_{2})/D, (17)
α2\displaystyle\alpha_{2} =\displaystyle= (f2(1+λ)e1/2σ2f1)/D,\displaystyle(f_{2}(1+\lambda)-e^{-1/2\sigma^{2}}f_{1})/D, (18)

where D=det(K+λI)=1+2λ+λ2e1/2σ2D=\det(K+\lambda I)=1+2\lambda+\lambda^{2}-e^{-1/2\sigma^{2}}. First, we expand in powers of σ\sigma as σ\sigma\to\infty, keeping up to first order in 1/2σ21/2\sigma^{2}:

α1\displaystyle\alpha_{1} \displaystyle\approx ((f1f2)+f1λ+f2/2σ2)/D,\displaystyle((f_{1}-f_{2})+f_{1}\lambda+f_{2}/2\sigma^{2})/D, (19)
α2\displaystyle\alpha_{2} \displaystyle\approx ((f2f1)+f2λ+f1/2σ2)/D,\displaystyle((f_{2}-f_{1})+f_{2}\lambda+f_{1}/2\sigma^{2})/D, (20)

where

D2λ+λ2+1/2σ2.D\approx 2\lambda+\lambda^{2}+1/2\sigma^{2}. (21)

Finally

fML(x)α¯+(α2(2x1)α¯x2)/2σ2,f^{\rm ML}(x)\approx\bar{\alpha}+(\alpha_{2}(2x-1)-\bar{\alpha}x^{2})/2\sigma^{2}, (22)

where α¯=α1+α2\bar{\alpha}=\alpha_{1}+\alpha_{2}. Next, we take λ0\lambda\to 0. In this limit DD vanishes and the weights diverge. The relative rate at which the limits are taken will affect the asymptotic behavior of the weights. The form of DD suggests we take

β=12λσ2,\beta=\frac{1}{2\lambda\sigma^{2}}, (23)

where β\beta is a constant.

Taking σ\sigma\to\infty, we obtain a linear form:

fβML(x)=βf1+f¯+β(f2f1)xβ+1,f_{\beta}^{\rm ML}(x)=\frac{\beta f_{1}+\overline{f}+\beta(f_{2}-f_{1})x}{\beta+1}, (24)

where f¯=12(f1+f2)\overline{f}=\frac{1}{2}(f_{1}+f_{2}). The parameter β\beta smoothly connects the constant and linear plateaus. When β0\beta\to 0, we recover the constant form fML(x)=f¯f^{\rm ML}(x)=\overline{f}; when β\beta\to\infty, we recover the linear form fML(x)=f1+x(f2f1)f^{\rm ML}(x)=f_{1}+x(f_{2}-f_{1}).

We can determine the shape of the transition between plateaus by substituting Eq. (24) for fML(x)f^{\rm ML}(x) into Eq. (7) for Δf\Delta f. For simplicity’s sake, we first define

hij=01𝑑xxifj(x),h_{ij}=\int_{0}^{1}dx\,\,x^{i}f^{j}(x), (25)

since expressions of this form will show up in subsequent derivations in this work. Finally, we obtain

Δfβ\displaystyle\Delta f_{\beta} =\displaystyle= 2(f¯+f1β)h011+β+2β(f1f2)h111+β\displaystyle\frac{-2(\overline{f}+f_{1}\beta)h_{01}}{1+\beta}+\frac{2\beta(f_{1}-f_{2})h_{11}}{1+\beta} (26)
+(3+6β+4β2)f¯2f1f2β23(1+β2)+h02.\displaystyle{}+\frac{(3+6\beta+4\beta^{2})\overline{f}^{2}-f_{1}f_{2}\beta^{2}}{3(1+\beta^{2})}+h_{02}.

In Fig. 9, we compare our numerical Δf\Delta f with the expansion Eq. (26) showing the transition between the linear and constant plateaus. In the case of NT=2N_{T}=2, only these two plateaus exist. In general, there will be at most NTN_{T} plateaus, each corresponding to successively higher order polynomial fits. However, not all of these plateaus will necessarily emerge for a given NTN_{T}; as we will show, the plateaus only become apparent when λ\lambda is sufficiently small, i.e., when the asymptotic behavior is reached, and when σ\sigma and λ\lambda are proportional in a certain way similar to how we defined β\beta. This analysis reveals the origin of the plateaus. In the series expansion for σ\sigma\to\infty, λ0\lambda\to 0, certain terms corresponding to polynomial forms becomes dominant when σ\sigma and λ\lambda remain proportional.

Refer to caption
Figure 9: Comparing the numerical Δf\Delta f (black solid), for NT=2N_{T}=2 and λ=106\lambda=10^{-6}, with our asymptotic form Δfβ\Delta f_{\beta} (red dashed) given by Eq. (26). The asymptotic form accurately recovers the behavior of Δf\Delta f in the plateau regions, but fails for small σ\sigma as expected.
NT=3:N_{T}=3:

We proceed in the same manner for NT=3N_{T}=3, using β\beta and substituting into the analytical form of fML(x)f^{\rm ML}(x) for this case to obtain an expression for Δfβ\Delta f_{\beta} (shown in the appendix). This expression is plotted in Fig. 10.

Refer to caption
Figure 10: Comparing the numerical Δf\Delta f (black solid), for NT=3N_{T}=3 and λ=106\lambda=10^{-6}, with our asymptotic form Δfβ\Delta f_{\beta} (red dashed). The asymptotic form accurately recovers the behavior of Δf\Delta f in the plateau regions, but fails for small σ\sigma as expected.

To derive the limiting value of Δf\Delta f at each plateau for large NTN_{T} and small λ\lambda, we minimize the cost function Eq. (4) (which is equivalent to Eq. (7) in this limit), assuming an nn-th order polynomial form for fML(x)f^{\rm ML}(x). We define cnc_{n} as the limiting value of Δf\Delta f for the nn-th order plateau:

cn=limσλ1/2,NT=minωi01𝑑x(f(x)i=0nωixi)2.c_{n}=\lim_{\sigma\propto\lambda^{-1/2},N_{T}\to\infty}=\min_{\omega_{i}}\int_{0}^{1}dx\,(f(x)-\sum_{i=0}^{n}\omega_{i}x^{i})^{2}. (27)

For the constant plateau, fML(x)f^{\rm ML}(x) assumes the constant form aa; to minimize Eq. (7) with respect to aa, we solve

dda01𝑑x(f(x)a)2=0\frac{d}{da}\int_{0}^{1}dx\,(f(x)-a)^{2}=0 (28)

for aa, obtaining

a=01𝑑xf(x),a=\int_{0}^{1}dx\,f(x), (29)

so that

fML(x)=h01.f^{\rm ML}(x)=h_{01}. (30)

Thus, we obtain

c0=h012+h02.c_{0}=-h_{01}^{2}+h_{02}. (31)

For our case with f(x)=cos(x)f(x)=\cos(x), c0=0.0193c_{0}=0.0193.

For the linear plateau, fML(x)f^{\rm ML}(x) assumes the linear form ax+bax+b; minimizing Eq. (7) with respect to aa and bb, we find that

fML(x)=(12h116h01)x+4h016h11,f^{\rm ML}(x)=(12\,h_{11}-6\,h_{01})x+4\,h_{01}-6\,h_{11}, (32)

yielding, via Eq. (27),

c1=h024(h0123h01h11+3h112).c_{1}=h_{02}-4\left(h_{01}^{2}-3\,h_{01}h_{11}+3\,h_{11}^{2}\right). (33)

For our case with f(x)=cos(x)f(x)=\cos(x), c1=1×103c_{1}=1\times 10^{-3}. The same procedure yields c2=2.25×106c_{2}=2.25\times 10^{-6}.

Next, we define

ϵ=12λ1/2σ2\epsilon=\frac{1}{2\lambda^{1/2}\sigma^{2}} (34)

as another parameter to relate σ\sigma and λ\lambda. We choose to define this using the same motivation as for β\beta, i.e., we examined our analytical expression for fML(x)f^{\rm ML}(x) and picked this parameter to substitute in order for σ\sigma and λ\lambda to remain proportional in a specific way as they approach certain limits and to see what values Δf\Delta f takes for these limits (in particular, we are interested to see if we can obtain all 3 plateaus for NT=3N_{T}=3). In doing this, we obtain an expansion analogous to that of Eq. (26) (shown as Δfϵ\Delta f_{\epsilon} in the appendix).

We plot this expression in Fig. 11, alongside our numerical Δf\Delta f and the plateau limits, for NT=3N_{T}=3 and varying λ\lambda. Note that the curves of the expansions are contingent on the value of λ\lambda; we do not retrieve all 3 plateaus for all of the expansions. Only the expansion curves corresponding to the smallest λ\lambda (101010^{-10}, the blue curve) and second smallest λ\lambda (10610^{-6}, the yellow curve) show broad, definitive ranges of σ\sigma where they take the value of each of the 3 plateaus (for the dashed blue curve, this is evident for c1c_{1} and c2c_{2}; the curve approaches c0c_{0} for larger σ\sigma ranges not shown in the figure), suggesting a specific proportion between σ\sigma and λ\lambda is needed for this to occur. For the solid numerical curves, only the blue curve manifests all 3 plateaus (like its expansion curve counterpart, it approaches c0c_{0} for larger σ\sigma ranges not shown); the other two do not obtain all 3 plateaus, regardless of the range of σ\sigma (the solid red curve does not even go down as far as c2c_{2}). However, there appears to be a singularity for each of the expansion curves (the sharp spikes for the dashed curves) at certain values of σ\sigma (ϵ\epsilon) depending on λ\lambda. This singularity emerges because our substitution of ϵ\epsilon leads to an expression with ϵ\epsilon in the denominator of our Δf\Delta f analytical form, which naturally has a singularity for certain values of ϵ\epsilon depending on λ\lambda.

Refer to caption
Figure 11: The dependence of the model error on σ\sigma, for NT=3N_{T}=3 and varying λ\lambda. The solid curves are numerical; the dashed curves are expansions derived by using our expression for ϵ\epsilon in Eq. (34) and substituting into fML(x)f^{\rm ML}(x) in Eq. (7). The legend gives the colors (for both the dashed and solid curves) corresponding to each λ\lambda.

Following the precedent set for NT=2N_{T}=2 and NT=3N_{T}=3, we can proceed in the same way for larger NTN_{T} and perform the same analysis, where we expect to find higher order plateaus and the same behavior for limiting values of the parameters, including specific plateau values for Δf\Delta f when σ\sigma and λ\lambda are varied with respect to each other in certain ways analogous to that of the previous cases. We would like to remark that plateau-like behaviors are well-known in statistical (online) learning in neural networks Saad (1998). However, those plateaus are distinct from the plateau effects discussed here since they correspond to limits in the (online) learning behavior due to symmetries D. and S. (1995); Biehl et al. (1996) or singularities Fukumizu and ichi Amari (2000); Wei and ichi Amari (2008) in the model.

III.1.3 Length scale just right

In the central region (see Fig. 4), Δf\Delta f as a function of σ\sigma has the shape of a valley. The optimum model, i.e., the model which gives the lowest error Δf\Delta f, is found in this region.

Refer to caption
Figure 12: The dependence of Δf(σ~)\Delta f(\tilde{\sigma}) on λ\lambda for various NTN_{T}. Here, σ~\tilde{\sigma} minimizes Δf\Delta f for fixed NTN_{T} and λ\lambda. NTN_{T} values for each curve are given in the legend. The dashed line shows a linear proportionality between Δf(σ~)\Delta f(\tilde{\sigma}) and λ\lambda.

For fixed NTN_{T} and λ\lambda, we define the σ\sigma that gives the global minimum of Δf\Delta f as σ~\tilde{\sigma}. In Fig. 12, we plot the behavior of Δf(σ~)\Delta f(\tilde{\sigma}) as a function of λ\lambda. Again, we observe three regions of different qualitative behavior. For large λ\lambda, we over-regularize (as was shown in Fig. 8), giving the limiting value Δf0\Delta f_{0} in Eq. (13). For moderate λ\lambda, we observe an approximately linear proportionality between Δf(σ~)\Delta f(\tilde{\sigma}) and λ\lambda:

Δf(σ~)λ.\Delta f(\tilde{\sigma})\propto\lambda. (35)

However, for small enough λ\lambda, there is vanishing regularization

𝜶NF=limλ0(𝑲+λ𝑰)𝟏𝒇,\bm{\alpha}_{\text{NF}}=\lim_{\lambda\to 0}(\mathbfit{K}+\lambda\mathbfit{I})^{-1}\mathbfit{f}, (36)

yielding the noise-free limit of the model:

fNFML(x)=j=1NTαNFjk(x,xj).f_{\text{NF}}^{\rm ML}(x)=\sum_{j=1}^{N_{T}}{\alpha_{\text{NF}}}_{j}k(x,x_{j}). (37)

In this case (for the Gaussian kernel), this limit exists for all σ\sigma. The error of the noise-free model is

ΔfNF=limλ0Δf.\Delta f_{\text{NF}}=\lim_{\lambda\to 0}\Delta f. (38)

III.2 Dependence on function scale

We now introduce the parameter γ\gamma into our simple one variable function, so that Eq. (1) becomes

f(x)=cos(γx).f(x)=\cos(\gamma x). (39)

For large values of γ\gamma, Eq. (39) becomes highly oscillatory; we extend our analysis here in order to observe the behavior of the model in this case.

Refer to caption
Figure 13: Same as Fig. 2, except with σ\sigma replaced by γσ\gamma\sigma, where γ=10\gamma=10 for the solid curves and γ=1\gamma=1 for the dashed curves. The labels give the value of NTN_{T} for each curve.

Fig. 13 shows Δf\Delta f as a function of γσ\gamma\sigma for various NTN_{T} while fixing λ=106\lambda=10^{-6} and γ\gamma=10 (solid lines), γ=1\gamma=1 (dashed lines). This is the same as that of Fig. 2, except with the additional γ\gamma parameter. This figure demonstrates that the qualitative behaviors we observed in Fig. 2 persist with the inclusion of the γ\gamma parameter, complete with the characteristic “valley” shape emerging in the moderate σ\sigma region for each NTN_{T}. Similarly, we see that Δf\Delta f decreases nearly monotonically for increasing NTN_{T} for all γσ\gamma\sigma, while opening up to the left as the Gaussians are better able to interpolate the function. The cusps, though not as pronounced, are still present to the left of the valleys, and their general shapes remain the same for increasing NTN_{T}.

Fig. 14 shows Δf\Delta f as a function of γσ\gamma\sigma for various λ\lambda while fixing NT=33N_{T}=33 and γ\gamma=10 (solid lines), γ=1\gamma=1 (dashed lines). This is the same as Fig. 3, except with the γ\gamma parameter included. Like in Fig. 3, as λ\lambda decreases Δf\Delta f decreases nearly monotonically. The same qualitative features still hold, including the splitting-off of each lower-valued λ\lambda curve further along σ\sigma.

Refer to caption
Figure 14: Same as Fig. 3, except with σ\sigma replaced by γσ\gamma\sigma, where γ=10\gamma=10 for the solid curves and γ=1\gamma=1 for the dashed curves. The labels give the value of λ\lambda for each curve.

Next, we look at how the optimal model depends on NTN_{T}. In Fig. 15, we plot Δf(σ~)\Delta f(\tilde{\sigma}) as a function of NTN_{T}, for various γ\gamma. For small NTN_{T}, there is little to no improvement in the model, depending on γ\gamma. For large γ\gamma, fML(x)f^{\rm ML}(x) is rapidly varying and the model requires more training samples before it can begin to accurately fit the function. At this point, Δf(σ~)\Delta f(\tilde{\sigma}) decreases as NTcN_{T}^{-c}, where c27c\approx 27 is a constant independent of γ\gamma. This fast learning rate drops off considerably when Δf\Delta f is on the order of λ\lambda (i.e., at the limit of machine precision), and Δf(σ~)\Delta f(\tilde{\sigma}) levels off (as λ\lambda corresponds to the leeway the model has for fitting training f(x)f(x) values, i.e., to the accuracy with which the model can resolve errors during fitting, it cannot improve the error much beyond this value). In fact, it is known that the learning rate in the asymptotic limit is 1/NT1/N_{T} for faithful models (i.e., models that capture the structure of the data) and 1/NT1/\sqrt{N_{T}} for unfaithful models Hastie et al. (2009); Müller et al. (1996). However, before the regularization kicks in Δf\Delta f is approximately the noise-free limit ΔfNF\Delta f_{\text{NF}}. If the noise-free limit were taken for all NTN_{T}, it appears that Δf(σ~)\Delta f(\tilde{\sigma}) would decrease continually at the same learning rate:

ΔfNFNTc.\Delta f_{\text{NF}}\propto N_{T}^{-c}. (40)
Refer to caption
Figure 15: The dependence of Δf(σ~)\Delta f(\tilde{\sigma}) on NTN_{T} for various γ\gamma. Here, σ~\tilde{\sigma} minimizes Δf\Delta f for fixed NTN_{T} and λ\lambda. The solid portion of the line represents the limit at λ\lambda\to 0 (the noise-free curve), while the dot-dashed continuation shows the decay for finite λ\lambda (λ=1014\lambda=10^{-14} is shown here). For large enough NTN_{T} and λ\lambda\to 0, Δf\Delta f has the asymptotic form given approximately by the linear fit here (dashed line). Note that, although this asymptotic form is independent of γ\gamma, for larger γ\gamma the asymptotic region is reached at larger NTN_{T}.

The learning rate here resembles the error decay of an integration rule, as our simple function is smooth and can always be approximated locally by a Taylor series expansion with enough points on the interval. However, the model here uses an expansion of Gaussian functions instead of polynomials of a particular order, and the error decays much faster than a standard integration rule such as Simpson’s, which decays as NT4N_{T}^{-4} in the asymptotic limit. Additionally, Eq. (40) is independent of γ\gamma since, for large enough NTN_{T}, the functions appear no more complex locally. The larger y-intercepts for the larger γ\gamma curves in Fig. 15 arise due to the larger number of points needed to reach this asymptotic regime, so the errors should be comparatively larger.

III.3 Cross-validation

In previous works (Ref. Snyder et al. (2012); Li et al. (2014)) applying ML to DFT, the hyperparameters of the model were optimized in order to find the best one, i.e., we needed to find the hyperparameters such that the error for the model is minimal on the entire test set, which has not been seen by the machine in training Hansen et al. (2013). We did this by using cross-validation, a technique whereby we minimize the error of the model with respect to the hyperparameters on a partitioned subset of the data we denote as the validationset{\textit{v}alidation\,set}. Only after we have chosen the optimal hyperparameters through cross-validation do we test the accuracy of our model by determining the error on the test set. We focus our attention on leave-one-out cross-validation, where the training set is randomly partitioned into NTN_{T} bins of size one (each bin consisting of a distinct training sample). A validation set is formed by taking the first of these bins, while a training set is formed from the rest. The model is trained on the training set, and optimal hyperparameters are determined by minimizing the error on the singleton validation set. This procedure is repeated for each bin, so NTN_{T} pairs of optimal hyperparameters are obtained in this manner; we take as our final optimal hyperparameters the median of each hyperparameter on the entire set of obtained hyperparameters. The generalization error of the model with optimal hyperparameters will finally be tested on a test set, which is inaccessible to the machine in cross-validation.

Our previous works Snyder et al. (2012, 2013a); Li et al. (2014) demonstrated the efficacy of cross-validation in producing an optimal model. Our aim here is to show how this procedure optimizes the model for our simple function on evenly-spaced training samples. We have thus far trained our model on evenly spaced points on the interval [0,1][0,1]: xj=(j1)/(NT1)x_{j}=(j-1)/(N_{T}-1) for j=1,,NTj=1,\dots,N_{T}. We want to compare how the model error determined in this way compares to the model errors using leave-one-out cross-validation to obtain optimal hyperparameters. In Fig. 16, we plot the model error over a range of σ\sigma values (we fix λ=106\lambda=10^{-6} and we use NT=9N_{T}=9 and NT=33N_{T}=33; compare this with Fig. 2). For each NTN_{T}, we perform leave-one-out cross-validation (using our fixed λ\lambda so that we obtain optimal σ\sigma), yielding NTN_{T} optimal σ\sigma values; we plot the model errors for each of these σ\sigma. We also include the global minimum error Δf(σ~)\Delta f(\tilde{\sigma}) for each NTN_{T} to show how they compare to the errors for the optimal σ\sigma. Looking at Fig. 16, we see that the optimal σ\sigma values all yield errors near Δf(σ~)\Delta f(\tilde{\sigma}) and within the characteristic “valley” region, demonstrating that leave-one-out cross-validation indeed optimizes our model. With this close proximity in error values established, we can thus reasonably estimate the error of the model for the optimal σ\sigma (for a given λ\lambda) by using σ~\tilde{\sigma}.

Refer to caption
Figure 16: The dependence of Δf\Delta f on σ\sigma, for λ=106\lambda=10^{-6} and NT=9N_{T}=9 (curve shown in red with dashed lines) and 3333 (curve shown in blue with solid lines). The crosses denote Δf\Delta f for the optimized σ\sigma values found from performing leave-one-out cross-validation (some of these are degenerate, so there are less than NTN_{T} distinct crosses shown), while the dots denote Δf(σ~)\Delta f(\tilde{\sigma}), the global minimum of Δf\Delta f over σ\sigma (the crosses and dots are matched in color with the curves for each NTN_{T}).

IV Application to density functionals

A canonical quantum system used frequently to explore basic quantum principles and as a proving ground for approximate quantum methods is the particle in a box. In this case, we confine one fermion to a 1d box with hard walls at x=0,1x=0,1, with the addition of the external potential v(x)v(x) in the interval x[0,1]x\in[0,1]. The equation that governs the quantum mechanics is the familiar one-body Schrödinger equation

(122x2+v(x))ϕ(x)=ϵϕ(x).\left(-\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}+v(x)\right)\phi(x)=\epsilon\phi(x). (41)

A solution of this equation gives the orbitals ϕj(x)\phi_{j}(x) and energies ϵj\epsilon_{j}. For one fermion, only the lowest energy level is occupied. The total energy is E=ϵ1E=\epsilon_{1}, the potential energy is

V=𝑑xn(x)v(x)V=\int dx\,n(x)v(x) (42)

(where n(x)=|ϕ(x)|2n(x)=|\phi(x)|^{2} is the electron density), and the KE is T=EVT=E-V. In the case of one particle, the KE can be expressed exactly in terms of the electron density, known as the von Weizsäcker functional Weizsäcker (1935)

TW=𝑑xn(x)28n(x),T^{\rm W}=\int dx\,\frac{n^{\prime}(x)^{2}}{8n(x)}, (43)

where n(x)=dn/dxn^{\prime}(x)=dn/dx. Our goal here in this section is not to demonstrate the efficacy of ML approximations for the KE in DFT (which is the subject of other works Snyder et al. (2012, 2013a)), but rather to study the properties of the ML approximations with respect to those applications.

We choose a simple potential inside the box,

v(x)=Dsin2πx,v(x)=-D\sin^{2}{\pi x}, (44)

to model a well of depth DD, which has also been used in the study of semiclassical methods Cangi et al. (2011). To generate reference data for ML to learn from, we solve Eq. (41) numerically by discretizing space onto a uniform grid, xj=(j1)/(NG1)x_{j}=(j-1)/(N_{G}-1), for j=1,,NGj=1,\dots,N_{G}, where NGN_{G} is the number of grid points. Numerov’s method is used to solve for the lowest energy orbital and its corresponding eigenvalue. We compute TT and n(x)n(x), which is represented by its values on the grid. For a desired number of training samples NTN_{T}, we vary DD uniformly over the range [0,100][0,100], inclusive, generating NTN_{T} pairs of electron densities and exact KEs. Additionally, a test set with 500500 pairs of electron densities and exact KEs is generated.

As in the previous sections, we use KRR to learn the KE of this model system. The formulation is identical to that of Ref. Snyder et al. (2012):

TML[n]=j=1NTαjk[n,nj],T^{\rm ML}[n]=\sum_{j=1}^{N_{T}}\alpha_{j}k[n,n_{j}], (45)

where kk is the Gaussian kernel

k[n,n]=exp(nn2/(2σ2)),k[n,n^{\prime}]=\exp(-\|n-n^{\prime}\|^{2}/(2\sigma^{2})), (46)

and

nn2=Δxj=1NG(n(xj)n(xj))2,\|n-n^{\prime}\|^{2}=\Delta x\sum_{j=1}^{N_{G}}(n(x_{j})-n^{\prime}(x_{j}))^{2}, (47)

where Δx=1/(NG1)\Delta x=1/(N_{G}-1) is the grid spacing. The weights are again given by Eq. (6), found by minimizing the cost function in Eq. (4).

In analogy to Eq. (7), we measure the error of the model as the total squared error integrated over the interpolation region

ΔT=0100𝑑D(TML[nD]T[nD])2,\Delta T=\int_{0}^{100}dD\,(T^{\rm ML}[n_{D}]-T[n_{D}])^{2}, (48)

where nDn_{D} is the exact density for the potential with well depth DD, and T[nD]T[n_{D}] is the exact corresponding KE. We approximate the integral by Simpson’s rule evaluated on DD sampled over the test set (i.e., 500 values spaced uniformly over the interpolation region). This sampling is sufficiently dense over the interval to give an accurate approximation to ΔT\Delta T.

In Fig. 17, we plot ΔT\Delta T as a function of the length scale of the Gaussian kernel, σ\sigma, for various training set sizes NTN_{T}. Clearly, the trends are very similar to Fig. 2: the transition σs\sigma_{s} between the regions I and II becomes smaller as NTN_{T} increases, the valley in region II widens, and region III on the right remains largely unchanged. The dependence of σs\sigma_{s} on NTN_{T} appears to follow the same power law σsNTp\sigma_{s}\propto N_{T}^{p}, but the value of pp is different in this case. A rough estimate yields p0.8p\approx-0.8, which is similar to p=1p=-1 for the simple cosine function explored in the previous sections, but the details will depend on the specifics of the data.

Refer to caption
Figure 17: The error of the model, ΔT\Delta T (Hartree), as a function of σ\sigma, for fixed λ=106\lambda=10^{-6}. NTN_{T} values for each curve are given in the legend.

Similarly, Fig. 18 shows the same plot but with NTN_{T} fixed and λ\lambda varied. Again, the same features are present as in Fig. 3, i.e., three regions with different qualitative behaviors. In region I, ΔT\Delta T has the same decay shape as the kernel functions (Gaussians) begin to overlap significantly, making it possible for the regression to function properly and fit the data. For large values of the regularization strength λ\lambda, the model over-regularizes, yielding high errors for any value of σ\sigma. As λ\lambda decreases, the weights are given more flexibility to conform to the shape of KE functional. Using the same definition for the estimation of σs\sigma_{s} in Eq. (10), the median nearest neighbor distance over this training set gives σs=0.019\sigma_{s}=0.019. We then have logσs=1.72\log\sigma_{s}=-1.72, which matches the boundary between regions I and II in Fig. 18. In region III, the same plateau features emerge for small λ\lambda. Again, these plateaus occur when polynomial forms of the regression model become dominant for certain combinations of the parameters σ\sigma, λ\lambda, and NTN_{T}.

Refer to caption
Figure 18: The error of the model, ΔT\Delta T (Hartree), as a function of σ\sigma, for various λ\lambda with NT=33N_{T}=33. The labels give the value of λ\lambda for each curve.

From Eq. (12) and Eq. (13), we showed that the model error will tend to the benchmark error while σ0\sigma\to 0 or λ\lambda\to\infty. Similarly to Eq. (8), we can also define the benchmark error when TML[n]0T^{\rm ML}[n]\equiv 0 for this data set as

ΔT0=0100𝑑DT2[nD].\Delta T_{0}=\int_{0}^{100}dD\,\,T^{2}[n_{D}]. (49)

Evaluating the above integral numerically on the test set, we obtain logΔT0=3.7\log\Delta T_{0}=3.7. This matches the error when σ0\sigma\to 0 in Fig. 17 and Fig. 18.

We define the σ\sigma that gives the global minimum of ΔT\Delta T as σ~\tilde{\sigma}; similarly to Fig. 12, we plot the optimal model error ΔT(σ~)\Delta T(\tilde{\sigma}) as a function of λ\lambda in Fig. 19.

Refer to caption
Figure 19: The dependence of ΔT(σ~)\Delta T(\tilde{\sigma}) on λ\lambda for various NTN_{T}. Here, σ~\tilde{\sigma} minimizes Δf\Delta f for fixed NTN_{T} and λ\lambda. NTN_{T} values for each curve are given in the legend.

For large λ\lambda, we overregularize the model; the model error tends to the benchmark error in Eq. (49). For moderate λ\lambda, we observe the same linear proportionality ΔT(σ~)λ\Delta T(\tilde{\sigma})\propto\lambda as in Fig. 12.

In this toy system, the prediction of the KE from KRR models shares properties similar to those that we explored in learning the 1d cosine function. Now we will consider up to 4 noninteracting spinless fermions under a potential with 9 parameters as reported in Ref. Snyder et al. (2012).

v(x)=i=13aiexp[(xbi)2/(2ci2)].v(x)=-\sum_{i=1}^{3}a_{i}\exp\left[-(x-b_{i})^{2}/(2c_{i}^{2})\right]. (50)

These densities are represented on NG=500N_{G}=500 evenly spaced grid points in 0x10\leq x\leq 1. Here a model is built using NT/4N_{T}/4 pairs of electron densities and exact KEs for each particle number N=1,2,3,4N=1,2,3,4, respectively. Thus, the size of the training set is NTN_{T}. 1000 pairs of electron densities and exact KEs are generated for each NN, so the size of the test set is S=4000S=4000. Since there are 9 parameters in the potential, we cannot define the error as an integral, so we use summation instead. Thus, the error on the test set is defined as the mean square error (MSE) on the test densities

ΔT=j=1S(TML[nj]T[nj])2/S.\Delta T=\sum_{j=1}^{S}(T^{\rm ML}[n_{j}]-T[n_{j}])^{2}/S. (51)

Fig. 1 shows the error of the model as a function of σ\sigma with various NTN_{T} for fixed λ=1010\lambda=10^{-10}. Although this system is more complicated than the previous two systems discussed in this paper, the qualitative behaviors in Fig. 1 are similar to Fig. 2 and Fig. 17. Table I in Ref. Snyder et al. (2012) only shows the model error with optimized hyperparameters for NT=400N_{T}=400. In Fig. 20, model errors varying with a wide range of σ\sigma values are shown for 4 different values of λ\lambda 222For large σ\sigma, if λ\lambda is small, there will be numerical instability in the computation of (𝑲+λ𝑰)𝟏(\mathbfit{K}+\lambda\mathbfit{I})^{-1}. Thus, the λ=3.2×1014\lambda=3.2\times 10^{-14} curve is not plotted for logσ>5.5\log\sigma>5.5.. The qualitative properties in Fig. 20 are similar to Fig. 3 and Fig. 18. For example, the existence of three regions with distinctly different behavior for the model error can be ascertained just like before. In region I, error curves with different λ\lambda will all tend to the same benchmark error limit when σ0\sigma\to 0. The median nearest neighbor distance over this training set gives σs=0.022\sigma_{s}=0.022. In Fig. 20, the boundary between region I and region II is well estimated by logσs=1.66\log\sigma_{s}=-1.66. In region III, the familiar plateau features emerge. In region II, where σ\sigma is such that the model is optimal or close to it, we find that the model with hyperparameters σ=1.86,λ=3.2×1014\sigma=1.86,\,\lambda=3.2\times 10^{-14} performs the best. The MSE for this model is 1.43×1071.43\times 10^{-7} Hartree. Another common measure of error is the mean absolute error (MAE), which is also used in Ref. Snyder et al. (2012). The MAE of this model is 1.99×104Hartree=0.12kcal/mol1.99\times 10^{-4}\,\text{Hartree}=0.12\,\text{kcal}/\text{mol}. This result is consistent with the model performance reported in Ref. Snyder et al. (2012)333 In Ref. Snyder et al. (2012), the densities were treated as vectors. Here, we treated the densities as functions, so the length scale mentioned here is related to the length scale in Ref. Snyder et al. (2012) by a factor of Δx=0.045\sqrt{\Delta x}=0.045..

Refer to caption
Figure 20: Dependence of the model error (as a function of σ\sigma) for various λ\lambda. The labels give the value of λ\lambda for each curve. The λ=3.2×1014\lambda=3.2\times 10^{-14} curve is not plotted for logσ>5.5\log\sigma>5.5 due to the numerical instability that occurs when λ\lambda is small for large σ\sigma.

V Conclusion

In this work, we have analyzed the properties of KRR models with a Gaussian kernel applied to fitting a simple 1d function. In particular, we have explored regimes of distinct qualitative behavior and derived the asymptotic behavior in certain limits. Finally, we generalized our findings to the problem of learning the KE functional of a single particle confined to a box and subject to a well potential with variable depth. Considering the vast difference in nature of the two problems compared in this work, a 1d cosine function and the KE as a functional of the electron density (a very high-dimensional object), the similarities of the measures of error Δf\Delta f and ΔT\Delta T between each other are remarkable. This analysis demonstrates that much of the behavior of the model can be rationalized by and distilled down to the properties of the kernel. Our goal in this work was to deepen our understanding of how the performance of KRR depends on the parameters qualitatively, in particular in the case that is relevant for MLA in DFT, namely the one of noise-free data and high-dimensional inputs, and how one may determine a-priori which regimes the model lies in. From the ML perspective the scenario analyzed in this work was an unusual one: small data, virtually no noise, low dimensions and high complexity. The effects found are not only interesting from the physics perspective, but are also illuminating from a learning theory point of view. However, in ML practice the extremes that contain plateaus or the “comb” region will not be observable, as the practical data with its noisy manifold structure will confine learning in the favorable region II. Future work will focus on theory and practice in order to improve learning techniques for low noise problems.

Acknowledgements.
The authors would like to thank NSF Grant No. CHE-1240252 (JS, LL, KB), the Alexander von Humboldt Foundation (JS), the Undergraduate Research Opportunities Program (UROP) and the Summer Undergraduate Research Program (SURP) (KV) at UC Irvine for funding. MR thanks O. Anatole von Lilienfeld for support via SNSF Grant No. PPOOP2 138932.

Appendix A Expansion of Δf\Delta f for σσs\sigma\ll\sigma_{s}

Δf\displaystyle\Delta f =\displaystyle= 01𝑑xf(x)22j=1NTαj01𝑑xf(x)k(x,xj)\displaystyle\int_{0}^{1}dx\,f(x)^{2}-2\sum_{j=1}^{N_{T}}\alpha_{j}\int_{0}^{1}dx\,f(x)k(x,x_{j}) (52)
+i,j=1NTαiαj01𝑑xk(x,xi)k(x,xj).\displaystyle{}+\sum_{i,j=1}^{N_{T}}\alpha_{i}\alpha_{j}\int_{0}^{1}dx\,k(x,x_{i})k(x,x_{j}).

where the first integral is given in Eq. (9). Next

01𝑑xf(x)k(x,xj)=π8σe(γσ)2/2(Cj+Cj),\int_{0}^{1}dx\,f(x)k(x,x_{j})=\sqrt{\frac{\pi}{8}}\sigma e^{-(\gamma\sigma)^{2}/2}(C_{j}+C_{j}^{*}), (53)

where

Cj=eiγxj(erf(xjiγσ2σ2)+erf(1xj+iγσ2σ2)),C_{j}=e^{{\rm i}\gamma x_{j}}\left(\operatorname{erf}\left(\frac{x_{j}-{\rm i}\gamma\sigma^{2}}{\sigma\sqrt{2}}\right)+\operatorname{erf}\left(\frac{1-x_{j}+{\rm i}\gamma\sigma^{2}}{\sigma\sqrt{2}}\right)\right), (54)

erf\operatorname{erf} is the error function, and CC^{*} denotes the complex conjugate of CC. The last integral is

01𝑑xk(x,xi)k(x,xj)\displaystyle\int_{0}^{1}dx\,k(x,x_{i})k(x,x_{j}) =\displaystyle= σπ2e(xixj)2/(4σ2)\displaystyle\frac{\sigma\sqrt{\pi}}{2}e^{-(x_{i}-x_{j})^{2}/(4\sigma^{2})} (55)
×(erf(xi+xj2σ)erf(xi+xj22σ)).\displaystyle\hskip-99.58464pt\times\left(\operatorname{erf}\left(\frac{x_{i}+x_{j}}{2\sigma}\right)-\operatorname{erf}\left(\frac{x_{i}+x_{j}-2}{2\sigma}\right)\right).

Appendix B Δfβ\Delta f_{\beta} for NT=3N_{T}=3

Δfβ\displaystyle\Delta f_{\beta} =\displaystyle= h01(13(5f12f2+f3)+f1f3β+1)\displaystyle h_{01}\left(\frac{1}{3}(-5f_{1}-2f_{2}+f_{3})+\frac{f_{1}-f_{3}}{\beta+1}\right) (56)
+2βh11(f1f3)β+1+h02+C,\displaystyle{}+\frac{2\beta\,h_{11}(f_{1}-f_{3})}{\beta+1}+h_{02}+C,

where

C\displaystyle C =\displaystyle= (β2(7f12+2f1(4f2+f3)+4f22+8f2f3\displaystyle(\beta^{2}(7f_{1}^{2}+2f_{1}(4f_{2}+f_{3})+4f_{2}^{2}+8f_{2}f_{3} (57)
+7f32)+36(2β+1)f¯2)/(36(β+1)2),\displaystyle{}+7f_{3}^{2})+36(2\beta+1)\overline{f}^{2})/(36(\beta+1)^{2}),

and where f¯=13(f1+f2+f3)\overline{f}=\frac{1}{3}(f_{1}+f_{2}+f_{3}).

Appendix C Δfϵ\Delta f_{\epsilon} for NT=3N_{T}=3

Δfϵ\displaystyle\Delta f_{\epsilon} =\displaystyle= C1h01+C2h11+C3h21+h02+C4,\displaystyle C_{1}\,h_{01}+C_{2}\,h_{11}+C_{3}\,h_{21}+h_{02}+C_{4}, (58)

where

C1\displaystyle C_{1} =\displaystyle= ((λ(48f¯ϵ2(17f1+4f2+f3))\displaystyle((\sqrt{\lambda}(48\overline{f}-\epsilon^{2}(17f_{1}+4f_{2}+f_{3})) (59)
2ϵ(f1(ϵ220)8f2+4f3)4λϵ(f2+4f3)))/\displaystyle{}-2\epsilon(f_{1}(\epsilon^{2}-20)-8f_{2}+4f_{3})-4\lambda\epsilon(f_{2}+4f_{3})))/
((λ+ϵ)(8(λ+3)+ϵ2+8λϵ)),\displaystyle((\sqrt{\lambda}+\epsilon)(-8(\lambda+3)+\epsilon^{2}+8\sqrt{\lambda}\epsilon)),
C2\displaystyle C_{2} =\displaystyle= (2ϵ(2λϵ(9f1+2f2+f3)+3f1ϵ224f1\displaystyle(2\epsilon(2\sqrt{\lambda}\epsilon(9f_{1}+2f_{2}+f_{3})+3f_{1}\epsilon^{2}-24f_{1} (60)
+8λ(f2+2f3)4f2ϵ2+f3ϵ2+24f3))/\displaystyle{}+8\lambda(f_{2}+2f_{3})-4f_{2}\epsilon^{2}+f_{3}\epsilon^{2}+24f_{3}))/
((λ+ϵ)(8(λ+3)+ϵ2+8λϵ)),\displaystyle((\sqrt{\lambda}+\epsilon)(-8(\lambda+3)+\epsilon^{2}+8\sqrt{\lambda}\epsilon)),
C3\displaystyle C_{3} =\displaystyle= 4ϵ(ϵ(f1+2f2f3)12f¯λ)8(λ+3)+ϵ2+8λϵ,\displaystyle\frac{4\epsilon\left(\epsilon(-f_{1}+2f_{2}-f_{3})-12\overline{f}\sqrt{\lambda}\right)}{-8(\lambda+3)+\epsilon^{2}+8\sqrt{\lambda}\epsilon}, (61)
C4\displaystyle C_{4} =\displaystyle= (2λϵ(ϵ4(75f12+2f1(58f25f3)+48f22\displaystyle(2\sqrt{\lambda}\epsilon(\epsilon^{4}(75f_{1}^{2}+2f_{1}(58f_{2}-5f_{3})+48f_{2}^{2} (62)
+116f2f3+75f32)480ϵ2(5f12+5f1f2\displaystyle{}+116f_{2}f_{3}+75f_{3}^{2})-480\epsilon^{2}(5f_{1}^{2}+5f_{1}f_{2}
+2f1f3+2f22+5f2f3+5f32)+34560f¯2)\displaystyle{}+2f_{1}f_{3}+2f_{2}^{2}+5f_{2}f_{3}+5f_{3}^{2})+34560\overline{f}^{2})
+3λ(ϵ4(273f12+232f1f2+226f1f3+48f22\displaystyle{}+3\lambda(\epsilon^{4}(273f_{1}^{2}+232f_{1}f_{2}+226f_{1}f_{3}+48f_{2}^{2}
+232f2f3+273f32)160ϵ2(7f12+15f1(f2\displaystyle{}+232f_{2}f_{3}+273f_{3}^{2})-160\epsilon^{2}(7f_{1}^{2}+15f_{1}(f_{2}
+2f3)+4f22+15f2f3+7f32)+11520f¯2)\displaystyle{}+2f_{3})+4f_{2}^{2}+15f_{2}f_{3}+7f_{3}^{2})+11520\overline{f}^{2})
+8λ3/2ϵ(ϵ2(40f12+91f1f2+400f1f3+16f22\displaystyle{}+8\lambda^{3/2}\epsilon(\epsilon^{2}(40f_{1}^{2}+91f_{1}f_{2}+400f_{1}f_{3}+16f_{2}^{2}
+91f2f3+40f32)240f¯(4f1+f2+4f3))\displaystyle{}+91f_{2}f_{3}+40f_{3}^{2})-240\overline{f}(4f_{1}+f_{2}+4f_{3}))
+4ϵ6(2f12+2f1f2f1f3+8f22+2f2f3+2f32)\displaystyle{}+4\epsilon^{6}(2f_{1}^{2}+2f_{1}f_{2}-f_{1}f_{3}+8f_{2}^{2}+2f_{2}f_{3}+2f_{3}^{2})
80ϵ4(5f12+10f1f22f1f3+8f22+10f2f3\displaystyle{}-80\epsilon^{4}(5f_{1}^{2}+10f_{1}f_{2}-2f_{1}f_{3}+8f_{2}^{2}+10f_{2}f_{3}
+5f32)+16λ2ϵ2(48f12+16f1(f2+f3)+3f22\displaystyle{}+5f_{3}^{2})+16\lambda^{2}\epsilon^{2}(48f_{1}^{2}+16f_{1}(f_{2}+f_{3})+3f_{2}^{2}
+16f2f3+48f32)+960ϵ2(7f12+2f1(4f2+f3)\displaystyle{}+16f_{2}f_{3}+48f_{3}^{2})+960\epsilon^{2}(7f_{1}^{2}+2f_{1}(4f_{2}+f_{3})
+4f22+8f2f3+7f32))/\displaystyle{}+4f_{2}^{2}+8f_{2}f_{3}+7f_{3}^{2}))/
(60(λ+ϵ)2(8(λ+3)+ϵ2+8λϵ)2),\displaystyle(60(\sqrt{\lambda}+\epsilon)^{2}(-8(\lambda+3)+\epsilon^{2}+8\sqrt{\lambda}\epsilon)^{2}),

and where f¯=13(f1+f2+f3)\overline{f}=\frac{1}{3}(f_{1}+f_{2}+f_{3}).

References

  • Müller et al. (2001) Klaus-Robert Müller, Sebastian Mika, Gunnar Rätsch, Koji Tsuda,  and Bernhard Schölkopf, “An Introduction to Kernel-Based Learning Algorithms,” IEEE Trans. Neural Network 12, 181–201 (2001).
  • Kononenko (2001) Igor Kononenko, “Machine learning for medical diagnosis: history, state of the art and perspective,” Artificial Intelligence in medicine 23, 89–109 (2001).
  • Sebastiani (2002) Fabrizio Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR) 34, 1–47 (2002).
  • Ivanciuc (2007) O. Ivanciuc, “Applications of Support Vector Machines in Chemistry,” in Reviews in Computational Chemistry, Vol. 23, edited by Kenny Lipkowitz and Tom Cundari (Wiley, Hoboken, 2007) Chap. 6, pp. 291–400.
  • Bartók et al. (2010) Albert P. Bartók, Mike C. Payne, Risi Kondor,  and Gábor Csányi, “Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons,” Phys. Rev. Lett. 104, 136403 (2010).
  • Rupp et al. (2012) Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller,  and O. Anatole von Lilienfeld, “Fast and accurate modeling of molecular atomization energies with machine learning,” Phys. Rev. Lett. 108, 058301 (2012).
  • Pozun et al. (2012) Zachary D. Pozun, Katja Hansen, Daniel Sheppard, Matthias Rupp, Klaus-Robert Müller,  and Graeme Henkelman, “Optimizing transition states via kernel-based machine learning,” The Journal of Chemical Physics 136, 174101 (2012).
  • Hansen et al. (2013) Katja Hansen, Grégoire Montavon, Franziska Biegler, Siamac Fazli, Matthias Rupp, Matthias Scheffler, O. Anatole von Lilienfeld, Alexandre Tkatchenko,  and Klaus-Robert Müller, “Assessment and Validation of Machine Learning Methods for Predicting Molecular Atomization Energies,” Journal of Chemical Theory and Computation 9, 3404–3419 (2013)http://pubs.acs.org/doi/pdf/10.1021/ct400195d .
  • Schütt et al. (2014) K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller,  and E. K. U. Gross, “How to represent crystal structures for machine learning: Towards fast prediction of electronic properties,” Phys. Rev. B 89, 205118 (2014).
  • Snyder et al. (2012) John C. Snyder, Matthias Rupp, Katja Hansen, Klaus-Robert Müller,  and Kieron Burke, “Finding Density Functionals with Machine Learning,” Phys. Rev. Lett. 108, 253002 (2012).
  • Snyder et al. (2013a) John C. Snyder, Matthias Rupp, Katja Hansen, Leo Blooston, Klaus-Robert Müller,  and Kieron Burke, “Orbital-free Bond Breaking via Machine Learning,” J. Chem. Phys. 139, 224104 (2013a).
  • Li et al. (2014) Li Li, John C. Snyder, Isabelle M. Pelaschier, Jessica Huang, Uma-Naresh Niranjan, Paul Duncan, Matthias Rupp, Klaus-Robert Müller,  and Kieron Burke, “Understanding Machine-learned Density Functionals,” submitted and ArXiv:1404.1333  (2014).
  • Snyder et al. (2013b) John Snyder, Sebastian Mika, Kieron Burke,  and Klaus-Robert Müller, “Kernels, Pre-Images and Optimization,” in Empirical Inference - Festschrift in Honor of Vladimir N. Vapnik, edited by Bernhard Schoelkopf, Zhiyuan Luo,  and Vladimir Vovk (Springer, Heidelberg, 2013).
  • Snyder et al. (2015) John C. Snyder, Matthias Rupp, Klaus-Robert Müller,  and Kieron Burke, “Non-linear gradient denoising: Finding accurate extrema from inaccurate functional derivatives,” International Journal of Quantum Chemistry in preparation (2015).
  • Burke (2006) K. Burke, “Exact conditions in tddft,” Lect. Notes Phys 706, 181 (2006).
  • Hastie et al. (2009) Trevor Hastie, Robert Tibshirani,  and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. (Springer, New York, 2009).
  • Levine (2009) Ira N Levine, Quantum chemistry, Vol. 6 (Pearson Prentice Hall Upper Saddle River, NJ, 2009).
  • Dreizler and Gross (1990) R. M. Dreizler and E. K. U. Gross, Density Functional Theory: An Approach to the Quantum Many-Body Problem (Springer–Verlag, Berlin, 1990).
  • Hohenberg and Kohn (1964) P. Hohenberg and W. Kohn, “Inhomogeneous electron gas,” Phys. Rev. 136, B864–B871 (1964).
  • Kohn and Sham (1965) W. Kohn and L. J. Sham, “Self-consistent equations including exchange and correlation effects,” Phys. Rev. 140, A1133–A1138 (1965).
  • Pribram-Jones et al. (2014) Aurora Pribram-Jones, David A. Gross,  and Kieron Burke, “DFT: A Theory Full of Holes?” Annual Review of Physical Chemistry  (2014).
  • Note (1) We will use log\mathop{log}\nolimits to denote log10\mathop{log}\nolimits_{10} here and throughout this work.
  • Rasmussen and Williams (2006) Carl Rasmussen and Christopher Williams, Gaussian Processes for Machine Learning (MIT Press, Cambridge, 2006).
  • Smola et al. (1998) Alex J Smola, Bernhard Schölkopf,  and Klaus-Robert Müller, “The connection between regularization operators and support vector kernels,” Neural networks 11, 637–649 (1998).
  • Moody and Darken (1989) John Moody and Christian J. Darken, “Fast Learning in Networks of Locally-tuned Processing Units,” Neural Comput. 1, 281–294 (1989).
  • Saad (1998) D. Saad, On-line learning in neural networks (Cambridge University Press, New York, 1998).
  • D. and S. (1995) Saad D. and Solla S., “On-line learning in soft committee machines,” Phys. Rev. E 52, 4225–4243 (1995).
  • Biehl et al. (1996) M. Biehl, P. Riegler,  and C. Wöhler, “Transient dynamics of on-line learning in two-layered neural networks,” Journal of Physics A: Mathematical and General 29, 4769–4780 (1996).
  • Fukumizu and ichi Amari (2000) Kenji Fukumizu and Shun ichi Amari, “Local minima and plateaus in hierarchical structures of multilayer perceptrons.” Neural Networks 13, 317–327 (2000).
  • Wei and ichi Amari (2008) Haikun Wei and Shun ichi Amari, “Dynamics of learning near singularities in radial basis function networks.” Neural Networks 21, 989–1005 (2008).
  • Müller et al. (1996) K-R Müller, M Finke, N Murata, K Schulten,  and S Amari, “A numerical study on learning curves in stochastic multilayer feedforward networks,” Neural Computation 8, 1085–1106 (1996).
  • Weizsäcker (1935) C. F. von Weizsäcker, “Zur theorie der kernmassen,” Zeitschrift für Physik A Hadrons and Nuclei 96, 431–458 (1935), 10.1007/BF01337700.
  • Cangi et al. (2011) Attila Cangi, Donghyung Lee, Peter Elliott, Kieron Burke,  and E. K. U. Gross, “Electronic structure via potential functional approximations,” Phys. Rev. Lett. 106, 236404 (2011).
  • Note (2) For large σ\sigma, if λ\lambda is small, there will be numerical instability in the computation of (𝑲+λ𝑰)𝟏(\mathbfit{K}+\lambda\mathbfit{I})^{-1}. Thus, the λ=3.2×1014\lambda=3.2\times 10^{-14} curve is not plotted for logσ>5.5\mathop{log}\nolimits\sigma>5.5.
  • Note (3) In Ref. Snyder et al. (2012), the densities were treated as vectors. Here, we treated the densities as functions, so the length scale mentioned here is related to the length scale in Ref. Snyder et al. (2012) by a factor of Δx=0.045\sqrt{\Delta x}=0.045.