This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fitting Matérn smoothness parameters using automatic differentiation

Christopher J. Geoga
Department of Statistics, Rutgers University
Oana Marin
Mathematics and Computer Science Division, Argonne National Laboratory
Michel Schanen
Mathematics and Computer Science Division, Argonne National Laboratory
Michael L. Stein
Department of Statistics, Rutgers University
Abstract

The Matérn covariance function is ubiquitous in the application of Gaussian processes to spatial statistics and beyond. Perhaps the most important reason for this is that the smoothness parameter ν\nu gives complete control over the mean-square differentiability of the process, which has significant implications for the behavior of estimated quantities such as interpolants and forecasts. Unfortunately, derivatives of the Matérn covariance function with respect to ν\nu require derivatives of the modified second-kind Bessel function 𝒦ν\mathcal{K}_{\nu} with respect to ν\nu. While closed form expressions of these derivatives do exist, they are prohibitively difficult and expensive to compute. For this reason, many software packages require fixing ν\nu as opposed to estimating it, and all existing software packages that attempt to offer the functionality of estimating ν\nu use finite difference estimates for ν𝒦ν\partial_{\nu}\mathcal{K}_{\nu}. In this work, we introduce a new implementation of 𝒦ν\mathcal{K}_{\nu} that has been designed to provide derivatives via automatic differentiation (AD), and whose resulting derivatives are significantly faster and more accurate than those computed using finite differences. We provide comprehensive testing for both speed and accuracy and show that our AD solution can be used to build accurate Hessian matrices for second-order maximum likelihood estimation in settings where Hessians built with finite difference approximations completely fail.


Keywords: Matérn covariance, Bessel functions, Gaussian processes, maximum likelihood, automatic differentiation

1 Introduction

Gaussian process modeling is a ubiquitous tool in a variety of disciplines. One attractive feature of a Gaussian process is that it is specified by its first two moments: for a Gaussian random field ZZ with (often multivariate) index 𝒙\bm{x}, one needs only to select a mean function 𝔼Z(𝒙)=:m(𝒙)\mathbb{E}Z(\bm{x})=:m(\bm{x})111For convenience we consider only mean-zero processes, however extensions to parametric mean functions are straightforward. and a positive-definite covariance function KK such that Cov(Z(𝒙),Z(𝒙))=:K(𝒙,𝒙)\text{Cov}(Z(\bm{x}),Z(\bm{x}^{\prime}))=:K(\bm{x},\bm{x}^{\prime}) to completely specify the multivariate law. A classical application of Gaussian process modeling in the mean-zero setting with data 𝒛\bm{z} is to select a parametric family of positive-definite functions {K𝜽}𝜽𝚯\left\{K_{\bm{\theta}}\right\}_{\bm{\theta}\in\bm{\Theta}} and then optimize the negative log-likelihood over the parametric family to obtain the estimator

𝜽^MLE:=argmin𝜽𝚯(𝜽|𝒛)=argmin𝜽𝚯12[log|𝚺(𝜽)|+𝒛𝚺(𝜽)1𝒛].\widehat{\bm{\theta}}_{\text{MLE}}:=\operatorname*{arg\,min}_{\bm{\theta}\in\bm{\Theta}}-\ell(\bm{\theta}\;|\;\bm{z})=\operatorname*{arg\,min}_{\bm{\theta}\in\bm{\Theta}}\frac{1}{2}\left[\log\left|\bm{\Sigma}(\bm{\theta})\right|+\bm{z}^{\top}\bm{\Sigma}(\bm{\theta})^{-1}\bm{z}\right]\ . (1.1)

It is common practice to then treat this MLE as a known true value and compute subsequent estimators like interpolants or forecasts using the Gaussian law specified by the MLE.

One of the most popular isotropic covariance functions, the Matérn covariance function [21], is given by

K𝜽=(σ,ρ,ν)=ν(𝒙,𝒙):=σ221νΓ(ν)(2ν𝒙𝒙ρ)ν𝒦ν(2ν𝒙𝒙ρ),K_{\bm{\theta}=(\sigma,\rho,\nu)}=\mathcal{M}_{\nu}(\bm{x},\bm{x}^{\prime}):=\sigma^{2}\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\left\|\bm{x}-\bm{x}^{\prime}\right\|}{\rho}\right)^{\nu}\mathcal{K}_{\nu}\left(\frac{\sqrt{2\nu}\left\|\bm{x}-\bm{x}^{\prime}\right\|}{\rho}\right)\ , (1.2)

where Γ\Gamma is the gamma function, 𝒦ν\mathcal{K}_{\nu} is the second-kind modified Bessel function [8], and σ\sigma, ρ\rho, and ν\nu are scale, range, and smoothness parameters respectively. The Matérn class is distinguished from other standard covariance functions by the flexibility that the parameter ν\nu gives, as it controls the mean-square differentiability of the process [30]. While the above functional form is complicated, the Matérn covariance is most naturally motivated by its Fourier transform pair (also called the spectral density), which for dd-dimensional processes has the form S(𝒇)=τ2(ζ2+𝒇2)νd/2,S(\bm{f})=\tau^{2}(\zeta^{2}+\left\|\bm{f}\right\|^{2})^{-\nu-d/2}\ , where τ\tau and ζ\zeta are functions of (σ,ρ,ν)(\sigma,\rho,\nu) and dd. As it turns out, ν\nu has significant effects on the behavior of derived quantities. Interpolants and forecasts, for example, are fundamentally different when ν<1\nu<1, corresponding to a process that is not mean-square differentiable, and when ν1\nu\geq 1, corresponding to a process with at least one mean-square derivative. Special concern about whether processes are (mean-square) differentiable dates back to at least Matheron [22] in the geostatistical community. For a complete discussion on the importance of this parameter and the value in estimating it, see [30].

A widely acknowledged challenge in GP modeling is that the log-likelihood surface created by (𝜽|𝒛)\ell(\bm{\theta}\;|\;\bm{z}) is non-convex and difficult to optimize over. Figure 1 provides an example of the log-likelihood surface (profiled in σ\sigma, see Appendix B) for a Matérn process that illustrates the non-quadratic shape of level surfaces and the difficulty of estimating both smoothness and range parameters. A similar but arguably worse challenge occurs for estimating the range and scale even when the smoothness parameter is known [36].

0.50.5111.51.5222.52.5333.53.5444.54.5551.21.21.251.251.31.31.351.351.41.41.451.451.51.5Refer to caption

ρ\rho

ν\nu02244668810101212
Figure 1: A centered profile log-likelihood surface for a Matérn process (so that 0 corresponds to the log-likelihood at the MLE). The colorbar range has been artificially truncated to emphasize the non-quadratic structure near the minimizer. Level contours at 12,32,,172\tfrac{1}{2},\tfrac{3}{2},\ldots,\tfrac{17}{2} log-likelihood units below maximum.

An effective way to combat the log-likelihood’s lack of convexity is to optimize using derivative information. Until recently the use of derivatives to aid Gaussian process optimization was not particularly common (see [31, 23, 11, 16] for examples of separate strategies for optimization at scale using approximated derivative information). The gradient of (𝜽|𝒛)\ell(\bm{\theta}\;|\;\bm{z}) is computable in closed form and is given by

2θj(𝜽|𝒛)=tr(𝚺(𝜽)1𝚺j(𝜽))𝒛𝚺(𝜽)1𝚺j(𝜽)𝚺(𝜽)1𝒛,-2\frac{\partial}{\partial\theta_{j}}\ell(\bm{\theta}\;|\;\bm{z})=\text{tr}\left(\bm{\Sigma}(\bm{\theta})^{-1}\bm{\Sigma}_{j}(\bm{\theta})\right)-\bm{z}^{\top}\bm{\Sigma}(\bm{\theta})^{-1}\bm{\Sigma}_{j}(\bm{\theta})\bm{\Sigma}(\bm{\theta})^{-1}\bm{z}\ , (1.3)

where we use the shorthand 𝚺j(𝜽):=θj𝚺(𝜽)\bm{\Sigma}_{j}(\bm{\theta}):=\frac{\partial}{\partial\theta_{j}}\bm{\Sigma}(\bm{\theta}). Setting aside the computational challenges this presents for large matrices, computing this quantity requires derivatives of the covariance function with respect to all parameters in order to assemble the 𝚺j(𝜽)\bm{\Sigma}_{j}(\bm{\theta}) matrices. This is the step in the process where attempting to estimate ν\nu from data becomes problematic. Until recently, no closed form expressions for the derivative ν𝒦ν\partial_{\nu}\mathcal{K}_{\nu} were available in the literature, and by extension accurate computation of the derivatives νν\partial_{\nu}\mathcal{M}_{\nu} was not possible. While these derivatives have been studied [6], including very recent work providing completely closed-form derivatives [12], the expressions are cumbersome, for example involving higher-order generalized hypergeometric functions, and as such are difficult to implement reliably. Considering that these functions are significantly more general than 𝒦ν\mathcal{K}_{\nu}, existing software implementations for them are understandably complex and in our experience come at significant speed costs on at least some parts of the domain, even if they are reliably accurate on the entire domain of 𝒦ν\mathcal{K}_{\nu}.

The non-convex optimization problem posed by maximum likelihood for Matérn covariance functions may not be practically solvable using only gradient information. In an attempt to introduce higher-order information to further aide in optimization, several recent works have used the expected Fisher information matrix as a proxy for Hessians [11, 16]. True Hessians (also referred to as observed information matrices), however, while coming with an additional computational burden, are more effective than expected information matrices in our experience as well as arguably being preferable as proxies for second-order information of estimated parameters [7]. They can be computed as

2[H(𝜽|𝒛)]j,k\displaystyle-2[H\ell(\bm{\theta}\;|\;\bm{z})]_{j,k} =tr(𝚺(𝜽)1𝚺j(𝜽)𝚺(𝜽)1𝚺k(𝜽))tr(𝚺(𝜽)1𝚺jk(𝜽))\displaystyle=\text{tr}\left(\bm{\Sigma}(\bm{\theta})^{-1}\bm{\Sigma}_{j}(\bm{\theta})\bm{\Sigma}(\bm{\theta})^{-1}\bm{\Sigma}_{k}(\bm{\theta})\right)-\text{tr}\left(\bm{\Sigma}(\bm{\theta})^{-1}\bm{\Sigma}_{jk}(\bm{\theta})\right)
+𝒛T(θk{𝚺1𝚺j(𝜽)𝚺1})𝒛,\displaystyle+\bm{z}^{T}\left(\frac{\partial}{\partial\theta_{k}}\left\{\bm{\Sigma}^{-1}\bm{\Sigma}_{j}(\bm{\theta})\bm{\Sigma}^{-1}\right\}\right)\bm{z}\ ,

where 𝚺j,k\bm{\Sigma}_{j,k} is the second derivative 2θjθk𝚺(𝜽)\frac{\partial^{2}}{\partial\theta_{j}\partial\theta_{k}}\bm{\Sigma}(\bm{\theta}). The derivative in the last quadratic form can be expanded using matrix calculus but is left in this form for clarity (see the appendix of [11] for the full expression). As can be seen, the Hessian of (𝜽)\ell(\bm{\theta}) requires second derivatives of 𝒦ν\mathcal{K}_{\nu}, making the second-order optimization for the Matérn class of covariance functions more challenging still. To our knowledge, no existing literature gives explicit forms for these second derivatives. These considerations indicate that a primary bottleneck in performing maximum likelihood estimation for Matérn models is the robust evaluation of the second-kind modified Bessel function, 𝒦ν\mathcal{K}_{\nu} and its first two derivatives. While we have not directly verified that the inaccuracies of finite-difference derivatives pose a substantial problem in gradient-based Bayesian methods, such as Hamiltonian Monte Carlo [25], it is possible that the improvements we discuss here could also provide meaningful gains in those methods.

𝒦ν\mathcal{K}_{\nu} is a special function defined as one of the two linearly independent solutions of a second order differential equation [8]. Alternative expressions for 𝒦ν\mathcal{K}_{\nu}, such as series expansions, integrals on the real line, path integrals, as well as confluent and generalized hypergeometric functions have been used in existing special function libraries [8, 2, 32]. The primary difficulty of practical implementations is that no single strategy for accurate and efficient evaluation is valid on the entire real axis and a domain partitioning is required to obtain an optimal method to evaluate 𝒦ν\mathcal{K}_{\nu}. Current libraries focus entirely on evaluations of Bessel functions, neglecting their derivatives, and even if they provide access to the source implementation, they are not easily extended to the problem of computing derivatives. Due to this difficulty in computing the derivatives of 𝒦ν\mathcal{K}_{\nu} with respect to the order ν\nu, it is common practice to fix ν\nu prior to performing the optimization instead of treating it as a parameter to be estimated (see, for example, the popular GPyTorch software library [10], or the RobustGaSP softare [15, 14], both of which only offer a few special half-integer values of ν\nu). This often constitutes a significant sacrifice, as ν\nu can have important practical implications for many tasks performed after maximum likelihood estimation, such as interpolation [30].

In an attempt to offer the functionality of estimating ν\nu from data, several software libraries for fitting Matérn models provide simple finite difference approximations for the derivative approximation as ν𝒦ν(x)h1(𝒦ν+h(x)𝒦ν(x))\partial_{\nu}\mathcal{K}_{\nu}(x)\approx h^{-1}(\mathcal{K}_{\nu+h}(x)-\mathcal{K}_{\nu}(x)), where a typical choice is h=106h=10^{-6}. For special function libraries that provide only function evaluations this is one of the few possible choices. And for a suitable choice of hh, these approximations may be reasonably accurate when xx is not near the origin and ν\nu is not overly large, although finite difference approximations are known to be prone to round-off errors. This is particularly problematic for second derivatives, as those calculations involve a divisor of h2h^{2}. Combined with the finite precision representation of the numerator, this division can lead to severe floating point errors. To our knowledge no software library currently attempts to offer second derivative information for 𝒦ν\mathcal{K}_{\nu} with respect to ν\nu, or, by extension, for ν\mathcal{M}_{\nu}. Considering that the evaluation of 𝒦ν\mathcal{K}_{\nu} is the dominant cost in evaluating the entire Matérn covariance function, finite differences, which require at least twice as many evaluations as direct computations, also come with a burdensome computational cost. A more efficient approach, essentially free of round-off errors for first derivatives, is the complex step method [29, 9]. This was studied for modified Bessel function order derivatives in [20], however not for edge cases such as ν\nu\in\mathbb{Z}, since they are less problematic for complex step method evaluations but are highly significant for the current work. Adaptive finite difference methods, uncommon in the statistical community, are more reliable since they select hh adaptively based on function values in the vicinity of differentiation points. But while they provide higher accuracy, it comes at a significant performance cost due to the many function evaluations required. And so while a sufficiently high-order adaptive finite difference method can be used to solve the problem we discuss here, the performance cost is so significant that we do not consider them further.

Arguably the most powerful differentiation approach when derivatives are not computable by hand or easily programmed manually is automatic differentiation (AD), which avoids issues such as round-off errors and allows highly efficient implementations. In this work, we develop a new implementation of 𝒦ν\mathcal{K}_{\nu} that is purposefully designed to be automatically differentiable with high accuracy. The negligible accumulation of errors in the first-order derivatives using AD computations further enables highly accurate computations of second-order derivatives of 𝒦ν\mathcal{K}_{\nu} with respect to ν\nu. We take this approach here, making use of the Julia [5] programming language, which we consider to be an ideal tool due to its type system and powerful AD tools such as ForwardDiff.jl [27].

In the next section, we provide an overview of AD, focusing on the challenges associated with an automatically differentiable implementation of 𝒦ν\mathcal{K}_{\nu}. In Section 3, we present our choices for direct evaluations and their derivatives on the entire domain of definition of 𝒦ν\mathcal{K}_{\nu}. Next, in Section 4, we verify the accuracy and efficiency of the derivatives as well as the Matérn covariance. Finally, we provide a demonstration of second-order optimization in Section 5, where we compare the results of optimizing using derivatives of the log-likelihood built using our AD implementation to optimization results building the same objects with finite difference derivatives. We conclude that optimization using the software we provide here yields correct MLEs, and that the AD-generated derivatives are fast and sufficiently accurate that even very complicated derived functions such as H(𝜽|𝒛)H\ell(\bm{\theta}\;|\;\bm{z}) are computed to high accuracy in settings where building the same objects using finite difference derivatives results in a complete loss of precision.

2 Derivatives via Automatic Differentiation

Automatic differentiation (AD) refers to the algorithmic transformation of function implementations ff into code that computes the derivative ff^{\prime}. In the forward-mode paradigm, which is the only one that will be discussed in this work, an implementation of ff, representing a multivariate vector function 𝐲=f(𝐱),nm\mathbf{y}=f({\mathbf{x}}),\mathbb{R}^{n}\mapsto\mathbb{R}^{m} with nn inputs and mm outputs is transformed by an AD tool into an implementation of the product 𝐲˙=f(𝐱)𝐱˙=Jv(𝐱,𝐱˙){\dot{{\mathbf{y}}}}=\nabla f({\mathbf{x}})\cdot{\dot{{\mathbf{x}}}}=J_{v}\left({\mathbf{x}},{\dot{{\mathbf{x}}}}\right), where JvJ_{v} is the Jacobian vector product of the Jacobian J(𝐱)J({\mathbf{x}}) and the vector 𝐱˙{\dot{{\mathbf{x}}}}. The tangent vectors 𝐱˙{\dot{{\mathbf{x}}}} and 𝐲˙{\dot{{\mathbf{y}}}} are the input directional derivatives and the output sensitivities, respectively. This technique can be employed recursively by applying AD to the Jacobian vector product implementation 𝐲˙=Jv(𝐱,𝐱˙){\dot{{\mathbf{y}}}}=J_{v}({\mathbf{x}},{\dot{{\mathbf{x}}}}) with inputs 𝐱{\mathbf{x}} and 𝐱˙{\dot{{\mathbf{x}}}}. We refer to [13] for an extensive introduction to AD.

An AD tool applies differentiation at the programming language intrinsic function level, having hard-coded differentiation rules covering most algebraic operations, some special functions, and optionally higher-level constructs like linear solvers. In Julia, the programming language of choice in the current work, these rules are implemented by ChainRules.jl [34]. The tool then generates code that computes the function value and derivative value at each program statement. As an example, we illustrate as pseudocode a forward-mode AD tool which generates f(x)f^{\prime}(x) from the implementation of y=f(x1,x2)=e(x1x2)y=f(x_{1},x_{2})=e^{(x_{1}\cdot x_{2})}:

function(x1, x2)
    tmp = x1*x2
    return exp(tmp)
end
function(x1, dx1, x2, dx2)
    tmp = x1*x2
    dtmp = dx1 * x2 + x1 * dx2
    return exp(tmp), exp(tmp)*dtmp
end

with the prefix ’d’ denoting the generated tangents. Together with the chain rule for more complex expressions, this purely mechanical work can be fully automated. Using compiler code analysis, like the LLVM backend of Julia, modern AD tools provide highly efficient implementations that rival any hand-optimized derivative code [24]. An immediate advantage of AD is that it enables a user to compute derivatives of a program seamlessly without tediously differentiating by hand or contending with concerning issues such as tuning the step size hh for finite differences. The reliance of AD on basic calculus rules implemented at the lowest computational level, as opposed to finite differences which operate at the top level once evaluations are available, reduces the impact of round-off errors and thus limits the propagation of errors while being very efficiently compiled, providing even more significant advantages for higher-order derivatives. There are, however, challenges associated with using automatic differentiation in the setting of special functions, especially if applied to existing libraries, as will be briefly discussed.

Differentiation of truncated series approximations

One of the most common implementations of special functions for direct evaluations is via series or asymptotic expansions. Term-wise derivatives of convergent series expansions are, for the functions considered here, also convergent series, but their convergence properties may differ from those of the original series. Thus, more terms may be needed when calculating derivatives to assure no loss of accuracy. Similar difficulties can occur with term-wise differentiation of asymptotic series, whose convergence properties hold as the argument of the function tends to some limit rather than as the number of terms increases. From an AD perspective a similar numerical challenge is encountered in the application of AD to iterative methods, e.g. [3]. In this work, this problem occurs primarily in one special setting that will be discussed at length in the implementation section and is effectively remedied by adding additional terms in the series truncation.

Limit branch problems for special functions

AD differentiated code where the control flow is dependent on the input 𝐱{\mathbf{x}} may yield discontinuous derivatives at code branches for limiting values. We refer to this problem as the limit branch problem, and refer curious readers to [4] (which calls it the if problem) for a thorough investigation and discussion.

As an example, consider a function pertinent in this work given by

f(x):=12x(1Γ(1x)1Γ(1+x)),f(x):=\frac{1}{2x}\left(\frac{1}{\Gamma(1-x)}-\frac{1}{\Gamma(1+x)}\right)\ ,

where at x=±2,±3,x=\pm 2,\;\pm 3,\;... one of the two gamma functions will evaluate to infinity. A code implementation that computes Γ(x)\Gamma(x) using a special functions library may also throw an error at x=0x=0 because there the division of 0 by 0 will return NaN per the floating point standard. The function ff is analytic at x=0,±2,±3,x=0,\pm 2,\;\pm 3,\ldots, however, and admits finite limits, so any implementation of ff should return the correct value for these inputs. Focusing for simplicity on the particular value of x=0x=0, one workaround is to specify the function value using a special branch shown here as pseudocode given by

  function f_focus_on_0(x)
    if x == 0
      return 0.57721566... # Euler-Mascheroni constant
    else
      return (1/gamma(1-x) - 1/gamma(1+x))/(2*x)
    end
  end

Applying AD using a tool such as ForwardDiff.jl [27] will incorrectly return a derivative of 0 at x=0x=0. By manually supplying limits, the programmer has cut the dependency graph of ff with respect to xx at x=0x=0, and the program reduces to a constant function with derivative 0. To solve this issue, the code expression in the if branch has to be given in a form that preserves local derivative information in the vicinity of x=0x=0. Local expansions in the if code branch allow for the limit to be computed and differentiated numerically and ultimately preserve derivative information. In this example, due to the smoothness of ff at x=0x=0 there exists a polynomial PP such that f(x)P(x)f(x)\approx P(x) at x0x\approx 0 (see Section 3.5). A (pseudo)-code implementation given by

  function f_focus_on_0(x)
    if is_near(x, 0)
      return poly_eval(x, p_0, p_1, p_2, ...)
    else
      return (1/gamma(1-x) - 1/gamma(1+x))/(2*x)
    end
  end

will now provide very accurate derivative information at x=0x=0. This approach still of course technically gives a program implementation with discontinuities in the derivative due to the inexact local approximations. But if the polynomials are only substituted in a suitably small neighborhood of the expansion points then the inexactness can be reduced to the same order as the intrinsic inexactness of finite-precision computer arithmetic and is thus not a concern for the present work.

For special function implementations this concern is far from purely academic. Many definitions of 𝒦ν\mathcal{K}_{\nu}, for example when ν\nu is an integer, are given as limiting forms whose implementations do not provide correct automatic derivatives. For this reason, applying AD transformations to existing special function libraries without additional scrutiny is unlikely to provide correct derivatives, which is further motivation for the current work. As indicated in [19], one alternative that would sidestep these concerns would be to use a neural network model instead of a series implementation, since automatic differentiation is an inherent feature of neural networks, and the accuracy could be controlled in the training phase.

3 A Fast and Accurate Implementation of 𝒦ν\mathcal{K}_{\nu} and νk𝒦ν\partial_{\nu}^{k}\mathcal{K}_{\nu}

To devise a robust implementation for 𝒦ν\mathcal{K}_{\nu} and derivatives νk𝒦ν\partial_{\nu}^{k}\mathcal{K}_{\nu}, with k=1,2k=1,2, we partition the domain of definition, +2\mathbb{R}_{+}^{2} (this is sufficient since 𝒦ν(x)=𝒦ν(x)\mathcal{K}_{\nu}(x)=\mathcal{K}_{-\nu}(x)), into several intervals. This strategy of breaking up the domain and applying different strategies for different regions is ubiquitious in the implementation of all special functions, even ones as basic as trigonometric functions, and certainly has been applied in every commonly used implementation of Bessel functions (see, for example, the domain partitioning of the AMOS library [2]). A particular goal in our domain partitioning and method selection, however, is to obtain accurate and efficient automatic derivatives as well as accurate evaluations. In Figure 2, we summarize the domain partitioning based on intervals for pairs (x,ν)(x,\nu), which splits real axis into different sub-intervals, i.e. =(0,a1)[a1,a2)[a2,a3)[a3,)\mathbb{R}=(0,a_{1})\cup[a_{1},a_{2})\cup[a_{2},a_{3})\cup[a_{3},\infty), with recommended selections for the parameters ai,i=1,2,3a_{i},\ i=1,2,3. Even in the most rigorous settings of error analysis, however, there is an element of personal judgment in the exact choices of domain partitioning parameters. While in broad strokes they are guided by when different representations that can be exactly evaluated by a computer are accurate, such as large-argument expansions being picked when the argument is “large enough”, they are also guided by practical metrics like numerical tests. This is the case in this work, and as such the boundary values we recommend here can be altered slightly without major impacts on accuracy.

Each interval is studied in its associated subsection, accompanied by discussions on implementation details for the selected method in that interval (typically a suitable series-type representation) and adjustments necessary to expedite and preserve the accuracy of automatic derivatives. Each separate expansion is truncated at different levels {tj}\{t_{j}\}, and these parameters have a meaningful impact on both efficiency and accuracy. In the present work we chose the {tj}\{t_{j}\} levels that provide close to computer precision. A user willing to sacrifice a few digits of accuracy to increase the efficiency may prefer to truncate the series at lower levels than the suggested values, but a comprehensive study of accuracy and efficiency impacts at such levels may be lengthy and is not considered in this work. Special values of the order ν\nu, such as integers and half-integers must be be considered separately. Since 𝒦ν\mathcal{K}_{\nu} for these values of ν\nu is commonly expressed in a limiting form, substantial work is required to develop an implementation that avoids the limit branch problem. As the rescaled function xν𝒦ν(x)x^{\nu}\mathcal{K}_{\nu}(x) appears in the Matérn covariance function, we also provide discussion of special optimizations for this function in Section A of the appendix.

𝒦ν(x)\mathcal{K}_{\nu}(x)νk𝒦ν(x)\partial_{\nu}^{k}\mathcal{K}_{\nu}(x)x(0,a1)x\in(0,a_{1})Sec. 3.1x[a1,a3)x\in[a_{1},a_{3})Sec. 3.2x>a3x>a_{3}Sec. 3.3ν+12\nu+\tfrac{1}{2}\in\mathbb{Z}Sec. 3.4ν\nu\approx integer, truncation t1t_{1}(i) Eq. 3.7-3.9ν\nu\not\approx integer, truncation t1t_{1}(ii) Eq. 3.2x<a2x<a_{2}, truncation t2t_{2}(iii) Eq. 3.4xa2x\geq a_{2}, truncation t3t_{3}(iii) Eq. 3.4ν>ν1\nu>\nu_{1}, truncation t3t_{3}(iii) Eq. 3.4νν1\nu\leq\nu_{1}, truncation t4t_{4}(iv) Eq. 3.5direct evaluation(iv) Eq. 3.5AD evaluation(v) Eq. 3.6
Figure 2: An overview of the branched implementation for 𝒦ν\mathcal{K}_{\nu} and derivatives νk𝒦ν(x)\partial_{\nu}^{k}\mathcal{K}_{\nu}(x). The parameters ai,tia_{i},\ t_{i} can be tuned for performance by the user: we recommend a1=8.5a_{1}=8.5, a2=15a_{2}=15, a3=30a_{3}=30, t120t_{1}\approx 20, t2=12t_{2}=12, t3=8t_{3}=8, t4=5t_{4}=5, and ν1=1.5\nu_{1}=1.5. Roman numerals are used by equation numbers for easy cross-referencing with Table 1.

3.1 Small arguments and small-to-moderate orders

While truncated direct series implementations may pose numerical difficulties in certain regimes, they can nonetheless be fast and accurate when applicable. Here, we identify a numerically stable series representation that extends the small argument interval up to x8.5x\approx 8.5 while controlling the accumulation of round-off errors at satisfactory levels. Consider the definition of 𝒦ν\mathcal{K}_{\nu}, for non-integer orders ν\nu, given in terms of the modified Bessel function of the first kind, ν\mathcal{I}_{\nu} as

𝒦ν(x)=π(ν(x)ν(x))2sinπν,withν(x)=(12x)νk=0(x2/4)kk!Γ(ν+k+1).\mathcal{K}_{\nu}(x)=\frac{\pi(\mathcal{I}_{-\nu}(x)-\mathcal{I}_{\nu}(x))}{2\sin\pi\nu},\quad\text{with}\quad\mathcal{I}_{\nu}(x)=(\tfrac{1}{2}x)^{\nu}\sum_{k=0}^{\infty}\frac{\left(x^{2}/4\right)^{k}}{k!\Gamma(\nu+k+1)}\ . (3.1)

The function ν\mathcal{I}_{\nu} exhibits fast growth (asymptotically on the order of ex/2πxe^{x}/\sqrt{2\pi x}); thus for large arguments Equation 3.1 leads to significant floating-point errors. Although the expression is valid for all arguments, it has practical use only for sufficiently small xx. The threshold for what constitutes “sufficiently small” is chosen based on accuracy and efficiency considerations, and in [32], values in the vicinity of x=4x=4 are chosen as the cutoff.

An improved series expansion, however, can be derived using basic properties of the Gamma function Γ(x)\Gamma(x). Using Euler’s reflection property (which states that sinπx=π(Γ(1x)Γ(x))1\sin\pi x=\pi(\Gamma(1-x)\Gamma(x))^{-1}) and regrouping terms conveniently, we obtain a particularly convenient form for the series expansion of 𝒦ν\mathcal{K}_{\nu} given by

𝒦ν(x)=k=0(x2)2k12k!(Γ(ν)(x2)νΓ(1ν)Γ(1+kν)+Γ(ν)(x2)νΓ(1+ν)Γ(1+k+ν)).\mathcal{K}_{\nu}(x)=\sum_{k=0}^{\infty}\left(\frac{x}{2}\right)^{2k}\frac{1}{2k!}\left(\Gamma(\nu)\left(\frac{x}{2}\right)^{-\nu}\frac{\Gamma(1-\nu)}{\Gamma(1+k-\nu)}+\Gamma(-\nu)\left(\frac{x}{2}\right)^{\nu}\frac{\Gamma(1+\nu)}{\Gamma(1+k+\nu)}\right)\ . (3.2)

One advantage that the series in Equation 3.2 has over the expression in Equation 3.1 is that it is less prone to cancellation errors by avoiding the subtraction of terms that grow rapidly with the series index or argument xx. The series as stated in Equation 3.2 is infinite; however a computer implementation of the infinite series in 3.2 requires a truncation level or stopping criterion. To promote accuracy we chose to stop the accumulation of terms when a tolerance of ϵ1012\epsilon\leq 10^{-12} is met, which in practice corresponds to approximately 2020 or fewer terms in the sum. For consistency across all computational branches we specify in Figure 2 a truncation at t120t_{1}\approx 20, while reminding the user that in practice the stopping criterion is tolerance-based and can be adjusted. Furthermore, using the property Γ(x+1)=xΓ(x)\Gamma(x+1)=x\Gamma(x), it is possible to evaluate this series (in fact its numerical truncation) using entirely algebraic operations after only evaluations of Γ(ν)\Gamma(\nu) and Γ(ν)\Gamma(-\nu). Applying AD to Equation 3.2 with the same stopping criterion as for the direct evaluation results in convergence in approximately as many terms for both first and second order derivatives. Although we do not derive a formal convergence proof, we will show in Section 4 that the performance of this expression is one of the best across the entire domain of definition.

For integer orders, the sinπν\sin\pi\nu term in the denominator of Equation 3.1 leads to a singularity, and the expression is no longer valid in such cases. The function is defined at these orders, however, and by taking appropriate limits, 𝒦ν\mathcal{K}_{\nu} can be written at integer orders as

2(1)n1𝒦n(x):=(ν(x)ν|ν=n+ν(x)ν|ν=n).2(-1)^{n-1}\mathcal{K}_{n}(x):=\left(\frac{\partial\mathcal{I}_{\nu}(x)}{\partial\nu}\bigg{|}_{\nu=n}+\frac{\partial\mathcal{I}_{\nu}(x)}{\partial\nu}\bigg{|}_{\nu=-n}\right)\ . (3.3)

While such integer-order expressions are valid in the limit and may be convenient to evaluate, they lead to limit branch problems that cause incorrect automatic derivatives, as described in Section 2. Therefore the integer case will be considered separately in Section 3.5. Additionally, we comment in the Appendix on implementation specializations to compute xν𝒦ν(x)x^{\nu}\mathcal{K}_{\nu}(x) directly, which in some cases can improve speed and accuracy.

3.2 Intermediate arguments

Asymptotic expansions resemble series expansions in that they are also cumulative sums which require similar implementations, but they may diverge outside a prescribed range. Within the range of large orders ν\nu and intermediate arguments x[a1,a2)x\in[a_{1},a_{2}) we introduce the uniform asymptotic expansion (UAE) in ν\nu for 𝒦ν\mathcal{K}_{\nu}. As indicated by the word uniform, this approximation is not highly oscillatory in the argument xx, and as such it can be applied in argument ranges where round-off errors can be problematic. The UAE is given by

𝒦ν(νx)νπ2νeνη(x)(1+x2)1/4k=0(1)kνkUk(p(x))\mathcal{K}_{\nu}(\nu x)\mathop{\sim}_{\nu\rightarrow\infty}\sqrt{\frac{\pi}{2\nu}}\frac{e^{-\nu{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\eta(x)}}}{(1+x^{2})^{1/4}}\sum_{k=0}^{\infty}(-1)^{k}\nu^{-k}U_{k}({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}p(x)}) (3.4)

where, as in [8, Eq. 10.41.4], we have

η(x):=1+x2+log(x1+1+x2),p(x):=(1+x2)1/2,\displaystyle\eta(x):=\sqrt{1+x^{2}}+\log\left(\frac{x}{1+\sqrt{1+x^{2}}}\right),\quad p(x):=(1+x^{2})^{-1/2}\ ,

and the polynomials UkU_{k} are given by U0(p)1U_{0}(p)\equiv 1 and

Uk+1(p)=12p2(1p2)Uk(p)+181p(15t2)Uk(t)dt.U_{k+1}(p)=\frac{1}{2}p^{2}(1-p^{2})U_{k}^{\prime}(p)+\frac{1}{8}\int_{1}^{p}(1-5t^{2})U_{k}(t)\text{d}t\ .

Tables of the coefficients for the UkU_{k} polynomials exist (see [8, Sec. 10.41]), but they may be difficult to find digitally. Since they are not difficult to compute using symbolic operations, our software computes them directly (but ahead of time, to be re-used) such that the truncation order of the approximation can be arbitrarily chosen by the user.

For reasonably large truncation orders (which in our case default to t2=12t_{2}=12 or t3=8t_{3}=8 depending on the magnitude of xx), this approximation is very fast to compute. As briefly noted in [8, Sec. 10.41], due to the uniqueness property of asymptotic expansions, this expansion must agree with the more commonly encountered expansion for large arguments to be introduced in Section 3.3. From this perspective, it comes as little surprise that the approximation quality of Equation 3.4 for any fixed truncation order improves as the argument or order increases, although not necessarily at the same rate (see [8, 10.41(iv)] for some discussion and a precise error bound when viewing (3.4) as a generalized asymptotic expansion in xx). This strategy for evaluating 𝒦ν\mathcal{K}_{\nu} is preferable in cases where small-argument methods start to lose accuracy but where the large-argument asymptotic expansion is not yet of satisfactory accuracy (for example, x=10x=10 and ν=3\nu=3). Moreover, since this expression reduces to polynomial computations, it is highly stable under automatic differentiation and less prone to round-off errors.

3.3 Large arguments

For large arguments it is preferable to use a reduced asymptotic expansion, equipped with precise error bounds derived in [26] (see the next section for more explicit details), given as in [8, Eq. 10.40.2] by

𝒦ν(x)xπ2xexk=0xkak(ν),\mathcal{K}_{\nu}(x)\mathop{\sim}_{x\rightarrow\infty}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sqrt{\frac{\pi}{2x}}}e^{-x}\sum_{k=0}^{\infty}x^{-k}a_{k}(\nu)\ , (3.5)

where ak(ν)a_{k}(\nu) is generated via a0(ν)1a_{0}(\nu)\equiv 1 and

ak(ν):\displaystyle a_{k}(\nu): =(4ν212)(4ν232)(4ν2(2k1)2)8kΓ(k+1).\displaystyle=\frac{(4\nu^{2}-1^{2})(4\nu^{2}-3^{2})\cdots(4\nu^{2}-(2k-1)^{2})}{8^{k}\Gamma(k+1)}\ .

For arguments x>30x>30 and moderate truncation orders such as t4=5t_{4}=5 (the default in our implementation), this approximation is accurate to computer precision (see [8, 10.40(iii)] for details on the error term, which without the exponential improvement introduced in the next section is challenging to work with). The automatically differentiated implementation is extremely fast since it involves differentiation of algebraic expressions, thus free of numerical artifacts, and is applied to a very low number of terms up to t4=5t_{4}=5, which leads to a very small operation count.

3.4 Half-integer orders

An interesting property of 𝒦ν\mathcal{K}_{\nu} expressed via the asymptotic expansion in Equation 3.5 is that, when ν+1/2\nu+1/2\in\mathbb{Z}, terms with k>νk>\nu drop out, and thus (3.5) is exact for all xx. These cases reduce to simpler forms that can be manually included in a program (for example, returning the simplified 𝒦1/2(x)=π/(2x)ex\mathcal{K}_{1/2}(x)=\sqrt{\pi/(2x)}e^{-x} for ν=1/2\nu=1/2), but this type of program implementation would cause limit branch problems with respect to ν\nu (as discussed in Section 22), and as such requires additional scrutiny in our setting to ensure accurate automatic derivatives with respect to order.

As noted in Section 2, a converged direct evaluation may not yield a differentiated series that has converged to the same degree. This can be remedied using what [8, Sec. 10.40(iv)] refers to as exponentially-improved asymptotic expansions. The asymptotic expansion introduced in Equation 3.4 can be re-written as

𝒦ν(x)=π2xex(k=0l1xkak(ν)+Rl(ν,x)),\mathcal{K}_{\nu}(x)=\sqrt{\frac{\pi}{2x}}e^{-x}\left(\sum_{k=0}^{l-1}x^{-k}a_{k}(\nu)+R_{l}(\nu,x)\right)\ , (3.6)

where the remainder term Rl(ν,x)R_{l}(\nu,x), itself reasonably complicated to bound neatly (see [8, 10.40(iii)]), can then be expanded as

Rl(ν,x)=(1)l2cos(νπ)(k=0m1xkak(ν)Glk(2x)+Rm,l(ν,x)).R_{l}(\nu,x)=(-1)^{l}2\cos(\nu\pi)\left(\sum_{k=0}^{m-1}x^{-k}a_{k}(\nu)G_{l-k}(2x)+R_{m,l}(\nu,x)\right)\ .

Here Gp(x)=ex2πΓ(p)Γ(1p,x),G_{p}(x)=\frac{e^{x}}{2\pi}\Gamma(p)\Gamma(1-p,x), where Γ(s,x)\Gamma(s,x) is the upper incomplete gamma function [8, Eq. 10.17.16], and Rm,lR_{m,l} is a new and even more finely controlled remainder term that admits a much simpler controlling bound of 𝒪(e2xxm)\mathcal{O}(e^{-2x}x^{-m}) [26, 8]. While direct evaluations of 𝒦ν\mathcal{K}_{\nu} in the case where ν+12\nu+\tfrac{1}{2}\in\mathbb{Z} are exact, the remainder term is required to preserve derivative information, as the above additional expansion makes clear that νkRl(v,x)\partial_{\nu}^{k}R_{l}(v,x) is not zero for any kk. As such, additional terms from the RlR_{l} approximation are necessary to preserve accurate derivative information.

3.5 Integer orders and small arguments with large orders

A common strategy for computer implementations of 𝒦ν\mathcal{K}_{\nu} is to develop an implementation for ν\nu in a neighborhood of 0 and use a standard recursion with respect to order, valid for 𝒦ν\mathcal{K}_{\nu} at any ν\nu\in\mathbb{R}, given by

𝒦ν+1(x)=2νx𝒦ν(x)𝒦ν1(x).\mathcal{K}_{\nu+1}(x)=\frac{2\nu}{x}\mathcal{K}_{\nu}(x)-\mathcal{K}_{\nu-1}(x)\ . (3.7)

For integer orders, several options are available to compute the initial recursion terms 𝒦0\mathcal{K}_{0} and 𝒦1\mathcal{K}_{1}, for example direct polynomial or rational function approximations for 𝒦0\mathcal{K}_{0} and 𝒦1\mathcal{K}_{1} (as provided by [1]). However, this strategy will not be helpful in providing νk𝒦ν\partial_{\nu}^{k}\mathcal{K}_{\nu} for ν\nu\in\mathbb{Z} due to the limit branch problem at ν=0\nu=0 and ν=1\nu=1.

A suitable approach, provided in [32], is to consider series expansions of the recursion terms 𝒦ν\mathcal{K}_{\nu} and 𝒦ν+1\mathcal{K}_{\nu+1}, which due to (3.7) (and the property that 𝒦ν(x)=𝒦ν(x)\mathcal{K}_{\nu}(x)=\mathcal{K}_{-\nu}(x)) need only be considered on the interval ν[1/2,1/2]\nu\in[-1/2,1/2]. Reproducing the notation of [32], we consider the expressions

𝒦ν(x)\displaystyle\mathcal{K}_{\nu}(x) =j=0cjfj(ν,x)and\displaystyle=\sum_{j=0}^{\infty}c_{j}f_{j}(\nu,x)\quad\text{and} (3.8)
𝒦ν+1(x)\displaystyle\mathcal{K}_{\nu+1}(x) =2xj=0cj(pj(ν,x)jfj(ν,x)),\displaystyle=\frac{2}{x}\sum_{j=0}^{\infty}c_{j}(p_{j}(\nu,x)-jf_{j}(\nu,x))\ , (3.9)

where for μ(x)=νlog(2/x)\mu(x)=\nu\log(2/x) we have

f0(ν,x)\displaystyle f_{0}(\nu,x) =νπsinπν[Γ1(ν)coshμ(x)+Γ2(ν)log(2/x)sinh(μ(x))/μ(x)],\displaystyle=\frac{\nu\pi}{\sin\pi\nu}\left[\Gamma_{1}(\nu)\cosh\mu(x)+\Gamma_{2}(\nu)\log(2/x)\sinh(\mu(x))/\mu(x)\right], (3.10)
Γ1(ν)\displaystyle\Gamma_{1}(\nu) =(Γ(1ν)1Γ(1+ν)1)/(2ν),Γ2(ν)=(Γ(1ν)1+Γ(1+ν)1)/2,\displaystyle=\left(\Gamma(1-\nu)^{-1}-\Gamma(1+\nu)^{-1}\right)/(2\nu),\quad\Gamma_{2}(\nu)=\left(\Gamma(1-\nu)^{-1}+\Gamma(1+\nu)^{-1}\right)/2,

and all subsequent fkf_{k} terms can be computed via a direct recursion using

p0(ν,x)\displaystyle p_{0}(\nu,x) =12(x2)νΓ(1+ν),pj=pj1/(jν),\displaystyle=\frac{1}{2}\left(\frac{x}{2}\right)^{-\nu}\Gamma(1+\nu),\quad p_{j}=p_{j-1}/(j-\nu)\ ,
q0(ν,x)\displaystyle q_{0}(\nu,x) =12(x2)νΓ(1ν),qj=qj1/(j+ν),and\displaystyle=\frac{1}{2}\left(\frac{x}{2}\right)^{\nu}\Gamma(1-\nu),\quad q_{j}=q_{j-1}/(j+\nu)\ ,\quad\text{and}
fj\displaystyle f_{j} =(jfj1+pj1qk1)/(j2ν2).\displaystyle=(jf_{j-1}+p_{j-1}q_{k-1})/(j^{2}-\nu^{2})\ .

With these expressions, the problem of obtaining an automatically differentiable program for 𝒦n(x)\mathcal{K}_{n}(x) for all nn is reduced to obtaining an AD-compatible implementation of f0(ν,x)f_{0}(\nu,x) for ν\nu in a vicinity of the origin. As indicated in Section 2, the limit branch problem can be resolved using Taylor expansions in Equation 3.10 about ν=0\nu=0:

Γ1(ν)\displaystyle\Gamma_{1}(\nu) \displaystyle\approx 1+γ2π2/62ν2+60γ460γ2π2+π4240γψ(2)(1)1440ν4,\displaystyle 1+\frac{\gamma^{2}-\pi^{2}/6}{2}\nu^{2}+\frac{60\gamma^{4}-60\gamma^{2}\pi^{2}+\pi^{4}-240\gamma\psi^{(2)}(1)}{1440}\nu^{4}\ ,
Γ2(ν)\displaystyle\Gamma_{2}(\nu) \displaystyle\approx γ+2γ3γπ22ψ(2)(1)12ν2+\displaystyle\gamma+\frac{2\gamma^{3}-\gamma\pi^{2}-2\psi^{(2)}(1)}{12}\nu^{2}+
+12γ520γ3π2+γπ4120γ2ψ(2)(1)+20π2ψ(2)(1)12ψ(4)(1)1440ν4,\displaystyle+\frac{12\gamma^{5}-20\gamma^{3}\pi^{2}+\gamma\pi^{4}-120\gamma^{2}\psi^{(2)}(1)+20\pi^{2}\psi^{(2)}(1)-12\psi^{(4)}(1)}{1440}\nu^{4}\ ,
sinhμμ\displaystyle\frac{\sinh\mu}{\mu} \displaystyle\approx 1+μ2/6+μ4/120,andπνsinπν1+(πν)2/6+7(πν)4/360.\displaystyle 1+\mu^{2}/6+\mu^{4}/120,\quad\text{and}\quad\frac{\pi\nu}{\sin\pi\nu}\approx 1+(\pi\nu)^{2}/6+7(\pi\nu)^{4}/360\ .

With this strategy, we obtain an expression for f0f_{0} which is exact at ν=0\nu=0 and very accurate for ν0\nu\approx 0. By precomputing the coefficients of Taylor expansions, this entire apparatus yields function evaluations that are only approximately twice the cost of the direct series that we introduce above, and whose AD performance is quite satisfactory both with respect to speed and accuracy. For specific details on this recursion, such as accurate and efficient evaluations of Γ1\Gamma_{1} and Γ2\Gamma_{2} using Chebyshev expansions, we refer to [32]. As with Section 3.1, we additionally provide in the appendix a discussion of extending this method to the direct computation of xν𝒦ν(x)x^{\nu}\mathcal{K}_{\nu}(x) for slight additional gains in accuracy and speed.

4 Speed and Accuracy Diagnostics

This section will focus on the verification and efficiency of our implementation of 𝒦ν\mathcal{K}_{\nu} and its derivatives, which we will denote either ν1𝒦ν\partial_{\nu}^{1}\mathcal{K}_{\nu} or ν2𝒦ν\partial_{\nu}^{2}\mathcal{K}_{\nu}. We compare our implementation, denoted 𝒦ν~\tilde{\mathcal{K}_{\nu}}, with the one provided by the AMOS library [2], 𝒦νA\mathcal{K}_{\nu}^{A}, using a reference solution 𝒦νR\mathcal{K}_{\nu}^{R}, considered to be the ground truth and computed using the arbitrary-precision Arb library [18]. Derivatives of up to second order computed in two ways will be compared, namely AD-generated derivatives of our implementation, νkAD𝒦ν~\partial_{\nu}^{k\text{AD}}\tilde{\mathcal{K}_{\nu}}, and non-adaptive finite difference-based derivatives of the AMOS library, νkFD𝒦ν\partial_{\nu}^{k\text{FD}}\mathcal{K}_{\nu}. These two values will be compared against a reference derivative computed using tenth-order adaptive finite difference methods of 𝒦νA\mathcal{K}_{\nu}^{A}, denoted as νkR𝒦ν\partial_{\nu}^{k\text{R}}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{K}_{\nu}}, computed using FiniteDifferences.jl [35].

To offer access to a wider audience, we have packaged the code in the Julia language as the library BesselK.jl222https://github.com/cgeoga/BesselK.jl. The source repository additionally contains example R language bindings as well as the scripts used to generate the results and figures in this work.

4.1 Pointwise accuracy

While BesselK.jl is not designed to be a competitor of existing special functions libraries like AMOS with respect to the accuracy of direct evaluations of 𝒦ν\mathcal{K}_{\nu}, we show that our implementation exhibits competitive absolute accuracy on the entire testing domain, both for direct evaluations as well as derivatives. Our chosen testing region is (a dense grid on) (ν,x)[0.25,10]×[0.005,30](\nu,x)\in[0.25,10]\times[0.005,30], which we pick for several reasons. For one, the ν\nu range generously over-covers what we consider to be the range of sensible values of ν\nu for fitting Matérn covariance parameters. With regard to the xx domain, we pick 3030 as an upper end point to verify accuracy in tail values that are small but numerically relevant. Moreover, beyond that point any software implementation for 𝒦ν\mathcal{K}_{\nu} will be using an asymptotic expansion, and so naturally there will be good agreement. Secondly, we illustrate in log10-scale the relative gain of BesselK.jl over finite difference methods with AMOS for derivative computations.

Direct evaluations

Figure 3 asseses direct evaluations error for both AMOS and BesselK.jl using the absolute differences |𝒦νA𝒦νR|\left|\mathcal{K}_{\nu}^{A}-\mathcal{K}_{\nu}^{R}\right| (left) and |𝒦ν~𝒦νR|\left|\tilde{\mathcal{K}_{\nu}}-\mathcal{K}_{\nu}^{R}\right| (right). The error behavior is qualitatively similar for both implementations, with the intermediate argument range accurate to computer precision, and a small ridge for very small arguments xx and large orders ν\nu showing increasing inaccuracy. The implementation in BesselK.jl was tailored to optimize AD computations for different argument ranges and these ranges are clearly delimited in Figure 3. The loss of accuracy encountered for small to intermediate arguments, approximately 101210^{-12} error, is attributed to the difference in numerical treatment and the choice of truncation levels tuned for accuracy in derivatives. Such a small accuracy loss in trailing digits will be shown to have a negligible impact on properties of Matérn covariance matrices relevant to numerical optimization.

11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx1×10201\times 10^{-20}1×10181\times 10^{-18}1×10161\times 10^{-16}1×10141\times 10^{-14}1×10121\times 10^{-12}1×10101\times 10^{-10}1×1081\times 10^{-8}1×1061\times 10^{-6}0.00010.00010.010.01
11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx1×10201\times 10^{-20}1×10181\times 10^{-18}1×10161\times 10^{-16}1×10141\times 10^{-14}1×10121\times 10^{-12}1×10101\times 10^{-10}1×1081\times 10^{-8}1×1061\times 10^{-6}0.00010.00010.010.01
Figure 3: Absolute accuracy of direct evaluations with respect to a reference solution: AMOS library |𝒦νA𝒦νR|\left|\mathcal{K}_{\nu}^{A}-\mathcal{K}_{\nu}^{R}\right| (left) and BesselK.jl |𝒦ν~𝒦νR|\left|\tilde{\mathcal{K}_{\nu}}-\mathcal{K}_{\nu}^{R}\right| (right).

Derivatives

For first derivative computations, illustrated in Figure 4, we see that BesselK.jl (right) is noticeably superior to finite difference estimates with AMOS (left) in the small argument range. The latter method incurs rather large errors well within the intermediate argument range x>8.5x>8.5, while BesselK.jl consistently out-performs as the argument increases. As higher-order derivatives are taken, as illustrated in Figure 5 for second derivatives, this behavior becomes even more pronounced and automatic differentiation clearly overtakes finite difference computations by a significant margin.

11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx1×10201\times 10^{-20}1×10151\times 10^{-15}1×10101\times 10^{-10}1×1051\times 10^{-5}11100000100000
11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx1×10201\times 10^{-20}1×10151\times 10^{-15}1×10101\times 10^{-10}1×1051\times 10^{-5}11100000100000
Figure 4: Absolute accuracy of first derivatives with respect to a reference solution: AMOS using finite differences |ν1FD𝒦νν1R𝒦ν|\left|\partial_{\nu}^{1\text{FD}}\mathcal{K}_{\nu}-\partial_{\nu}^{1\text{R}}\mathcal{K}_{\nu}\right| (left) and BesselK.jl using AD |ν1AD𝒦ν~ν1R𝒦ν|\left|\partial_{\nu}^{1\text{AD}}\tilde{\mathcal{K}_{\nu}}-\partial_{\nu}^{1\text{R}}\mathcal{K}_{\nu}\right| (right).
11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx1×10201\times 10^{-20}1×10151\times 10^{-15}1×10101\times 10^{-10}1×1051\times 10^{-5}111000001000001×10101\times 10^{10}
11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx1×10201\times 10^{-20}1×10151\times 10^{-15}1×10101\times 10^{-10}1×1051\times 10^{-5}111000001000001×10101\times 10^{10}
Figure 5: Absolute accuracy of second derivatives with respect to a reference solution: AMOS using finite differences |ν2FD𝒦νν2R𝒦ν|\left|\partial_{\nu}^{2\text{FD}}\mathcal{K}_{\nu}-\partial_{\nu}^{2\text{R}}\mathcal{K}_{\nu}\right| (left) and BesselK.jl using AD |ν2AD𝒦ν~ν2R𝒦ν|\left|\partial_{\nu}^{2\text{AD}}\tilde{\mathcal{K}_{\nu}}-\partial_{\nu}^{2\text{R}}\mathcal{K}_{\nu}\right| (right).

To provide a relative comparison of derivative computations using finite differences with AMOS versus BesselK.jl, we introduce the quantity

log10|νkR𝒦ν(x)νkAD𝒦ν~(x)|log10|νkR𝒦ν(x)νkFD𝒦ν(x)|\log_{10}\left|\partial_{\nu}^{k\text{R}}\mathcal{K}_{\nu}(x)-\partial_{\nu}^{k\text{AD}}\tilde{\mathcal{K}_{\nu}}(x)\right|-\log_{10}\left|\partial_{\nu}^{k\text{R}}\mathcal{K}_{\nu}(x)-\partial_{\nu}^{k\text{FD}}\mathcal{K}_{\nu}(x)\right| (4.1)

for both first derivatives (k=1k=1) and second derivatives (k=2k=2). By assessing the error in log10-scale, we can identify correct digits, and by taking the difference between the errors of the two methods, we outline the regions where AD outperforms finite differences. In Figure 6 negative numbers (blue hues) indicate regions where νAD𝒦ν~\partial_{\nu}^{\text{AD}}\tilde{\mathcal{K}_{\nu}} is more accurate than νFD𝒦ν\partial_{\nu}^{\text{FD}}\mathcal{K}_{\nu} (with a value of 1-1 indicating one additional digit of accuracy for the AD method), and we note there are almost no positive values (red hues) over the entire domain. Upon close scrutiny we observe that the branch at x8.5x\approx 8.5 may incur losses of one to two digits in a few discrete locations (visible as traces of red), which could be remedied by branching earlier than recommended at a1<8.5a_{1}<8.5. This is a fine-tuning aspect which vanishes for second derivatives and was not explored extensively since it had no noticeable impact on our results. Finite differences, notoriously inaccurate for second-order derivatives since they involve a division by h2h^{2}, incur increasingly higher round-off errors. In this context AD is expected to exhibit superior accuracy, and in Figure 6 its advantage over finite differences is clearly visible, consistently yielding 55 or more extra digits of accuracy.

11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx10-105-50551010
11223344556677889910105510101515202025253030Refer to caption

ν\nu

xx10-105-50551010
Figure 6: A log10-scale comparison between νkAD𝒦ν~\partial_{\nu}^{k\text{AD}}\tilde{\mathcal{K}_{\nu}} and νkFD𝒦ν\partial_{\nu}^{k\text{FD}}\mathcal{K}_{\nu} using Equation 4.1: first derivatives (left) and second derivatives (right).

4.2 Efficiency diagnostics

In order to compare the efficiency of νkAD𝒦ν~\partial_{\nu}^{k\text{AD}}\tilde{\mathcal{K}_{\nu}} with νkFD𝒦ν\partial_{\nu}^{k\text{FD}}\mathcal{K}_{\nu}, we select pairs (ν,x)(\nu,x) that cover every code branch (outlined in Figure 2) in the implementation of 𝒦ν\mathcal{K}_{\nu} and provide evaluation and derivative evaluation timings performed on an Intel Core i55-11600K CPU in Table 1. As can be seen, the speedup of 𝒦ν~\tilde{\mathcal{K}_{\nu}} is substantial, including in regions of the domain in which its accuracy rivals that of 𝒦νA\mathcal{K}_{\nu}^{A}, as well as for all derivatives. The accuracy and efficiency considerations discussed here are summarized in Table 2 to serve as a convenient reference to the reader.

(ν,x)(\nu,x)\rule{0.0pt}{11.19443pt} 𝒦ν~\tilde{\mathcal{K}_{\nu}} 𝒦νA{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{K}_{\nu}^{A}} νAD𝒦ν~\partial_{\nu}^{\text{AD}}\tilde{\mathcal{K}_{\nu}} νFD𝒦ν\partial_{\nu}^{\text{FD}}\mathcal{K}_{\nu} ν2AD𝒦ν~\partial_{\nu}^{2\;\text{AD}}\tilde{\mathcal{K}_{\nu}} ν2FD𝒦ν\partial_{\nu}^{2\;\text{FD}}\mathcal{K}_{\nu} case
(0.500, 1) 9 51 67 455 82 510 half-integer (iv)
(1.000, 1) 136 159 178 447 328 605 whole integer (i)
(3.001, 1) 139 285 181 570 337 847 near-integer order (i)
(3.001, 8) 227 300 284 597 557 892 near-integer order, borderline arg (i)
(1.850, 1) 85 229 139 460 307 686 small order (ii)
(1.850, 8) 103 241 293 483 531 722 small order, borderline arg (ii)
(1.850, 14) 167 209 281 416 568 624 intermediate arg (iii)
(1.850, 29) 94 191 149 380 293 565 large intermediate arg (iii)
(1.850, 35) 92 183 148 368 293 548 large argument (iii)
Table 1: Timings for evaluating 𝒦νA\mathcal{K}_{\nu}^{A} and 𝒦ν~\tilde{\mathcal{K}_{\nu}} and their derivative methods (in units of ns) at various pairs (ν,x)(\nu,x). The case column additionally provides references to the branch labels used in Figure 2.
Comparison Higher accuracy Higher efficiency
𝒦νA\mathcal{K}_{\nu}^{A} vs. 𝒦ν~\tilde{\mathcal{K}_{\nu}} 𝒦νA\hphantom{\partial_{\nu}^{\text{(F)}}}\mathcal{K}_{\nu}^{A} (\approx 2-5 digits) 𝒦ν~\hphantom{\partial_{\nu}^{\text{(A)}}}\tilde{\mathcal{K}_{\nu}} (\approx factor of 1-3)
νFD𝒦ν\partial_{\nu}^{\text{FD}}\mathcal{K}_{\nu} vs. νAD𝒦ν~\partial_{\nu}^{\text{AD}}\tilde{\mathcal{K}_{\nu}} νAD𝒦ν~\partial_{\nu}^{\text{AD}}\tilde{\mathcal{K}_{\nu}} (\approx 2-5 digits) νAD𝒦ν~\partial_{\nu}^{\text{AD}}\tilde{\mathcal{K}_{\nu}} (\approx factor of 2-3)
ν2FD𝒦ν\partial_{\nu}^{2\text{FD}}\mathcal{K}_{\nu} vs. ν2AD𝒦ν~\partial_{\nu}^{2\text{AD}}\tilde{\mathcal{K}_{\nu}} ν2AD𝒦ν~\partial_{\nu}^{2\text{AD}}\tilde{\mathcal{K}_{\nu}} (\approx 5+5+ digits) ν2AD𝒦ν~\partial_{\nu}^{2\text{AD}}\tilde{\mathcal{K}_{\nu}} (\approx factor of 2-5)
Table 2: A summary of the accuracy and speed diagnostics performed in this section. The accuracy comparison is given in correct digits, while efficiency is given as speedup factor.

4.3 Selected diagnostics for Matérn covariance matrices

The differences in accuracy and efficiency between the AMOS library and BesselK.jl are most relevant in the context of Matérn covariance matrices. To this end we compare in Table 3 several key quantities for the Matérn covariance matrices 𝚺\bm{\Sigma} and 𝚺~\tilde{\bm{\Sigma}}, generated using 𝒦νA\mathcal{K}_{\nu}^{A} and 𝒦ν~\tilde{\mathcal{K}_{\nu}} respectively. Points are chosen on a 24×2424\times 24 grid on the domain [0,1]2[0,1]^{2}, and pairs (ρ,ν)(\rho,\nu) are chosen to cover various argument ranges and branches in 𝒦ν~\tilde{\mathcal{K}_{\nu}} with respect to ν\nu, with both parameters having significant effects on the condition of covariance matrices.

(ρ,ν)(\rho,\nu) T~assembly\tilde{T}_{\text{assembly}} TassemblyT_{\text{assembly}} λ~min\tilde{\lambda}_{\text{min}} λmin\lambda_{\text{min}} |λdif|\left|\lambda_{\text{dif}}\right| log|𝚺~|\log|\tilde{\bm{\Sigma}}| log|𝚺|\log\left|\bm{\Sigma}\right| |log|𝚺~|log|𝚺|||\log|\tilde{\bm{\Sigma}}|-\log|\bm{\Sigma}||
(0.010, 0.40) 4.3e-02 4.3e-02 9.52e-01 9.52e-01 1.22e-12 -2.60e-01 -2.60e-01 3.11e-13
(0.010, 1.25) 4.4e-02 4.3e-02 9.79e-01 9.79e-01 3.47e-12 -3.45e-02 -3.45e-02 9.23e-12
(0.010, 3.50) 3.5e-02 3.5e-02 9.93e-01 9.93e-01 0.00e+00 -3.14e-03 -3.14e-03 0.00e+00
(1.000, 0.40) 2.8e-02 4.8e-02 3.78e-02 3.78e-02 9.76e-15 -1.40e+03 -1.40e+03 1.84e-11
(1.000, 1.25) 2.6e-02 5.4e-02 1.03e-04 1.03e-04 6.56e-14 -4.04e+03 -4.04e+03 1.19e-09
(1.000, 3.50) 2.4e-02 3.5e-02 7.18e-11 7.18e-11 1.02e-14 -1.02e+04 -1.02e+04 2.76e-03
(100.000, 0.40) 2.7e-02 4.5e-02 9.50e-04 9.50e-04 1.15e-14 -3.51e+03 -3.51e+03 6.96e-10
(100.000, 1.25) 2.5e-02 4.6e-02 1.03e-09 1.03e-09 2.03e-15 -1.06e+04 -1.06e+04 3.87e-05
(100.000, 3.50) 1.7e-02 3.8e-02 -3.43e-13 -2.72e-13 7.15e-14 NaN NaN NaN
Table 3: Summary properties of 𝚺\bm{\Sigma} and 𝚺~\tilde{\bm{\Sigma}} for various pairs (ρ,ν)(\rho,\nu). Quantities with tildes, such as λ~\tilde{\lambda}, are derived from 𝚺~\tilde{\bm{\Sigma}}, which has been built using 𝒦ν~\tilde{\mathcal{K}_{\nu}}. Quantities with no tildes were derived from 𝚺\bm{\Sigma}, which was built with 𝒦νA\mathcal{K}_{\nu}^{A}. In the last row, values of NaN are a result of the Cholesky factorization failing, which occurred for both implementations.

As expected from the previous benchmarks, the matrix assembly speed, T~assembly\tilde{T}_{\text{assembly}} for BesselK.jl, is superior to AMOS in regions where the predominant evaluation strategies for 𝒦ν\mathcal{K}_{\nu} is different for each library (thus excluding, for example, cases like ρ=0.01\rho=0.01 in which both software libraries are using asymptotic expansions) and approximately equal otherwise. As a preliminary investigation of matrix similarity, we compare the smallest eigenvalues of each matrix, λ~min\tilde{\lambda}_{\text{min}} and λmin\lambda_{\text{min}}, and note that they agree to almost computer precision in all cases. We also compare log-determinants, as they are directly used in maximum likelihood estimation, and in most cases the log-determinants agree to high precision, although two situations stand out as being slightly inaccurate compared to the others. These two cases have a combination of sufficiently large ρ\rho and ν\nu parameters, yielding especially ill-conditioned matrices that pose numerical challenges even when 𝒦ν\mathcal{K}_{\nu} is computed to double precision accuracy. With this in mind, we consider the numerical agreement in these two cases to be satisfactory.

5 A Demonstration with Simulated Data

In this section, we demonstrate that in some cases even the slightest numerical inaccuracy in finite difference derivatives can accumulate so catastrophically as to yield completely incorrect Hessian matrices of the Gaussian log-likelihood, which can in turn lead to failure in second-order maximum likelihood estimation.

To demonstrate this we select 512512 random locations on the unit square [0,1]2[0,1]^{2} and simulate ten independent realizations from a Matérn Gaussian process with parameters (σ,ρ,ν)=(1.5,2.5,1.3)(\sigma,\rho,\nu)=(1.5,2.5,1.3). Subsequently, we perform maximum likelihood estimation with the Ipopt library [33] using manually computed gradients and Hessians via the formulae provided in the introduction. In all cases, for derivatives involving only σ\sigma and ρ\rho, we use analytical derivatives of the Matérn covariance function. To study the effect of using finite difference derivatives versus our proposed AD derivatives, we consider six different optimization problems. First, we compute the first two derivatives of ν(𝒙,𝒙)\mathcal{M}_{\nu}(\bm{x},\bm{x}^{\prime}) with respect to ν\nu using finite difference approximations to compute Hessian matrices. In the second setting, we compute gradient and Hessian information using AD-generated derivatives and again optimize with true Hessians. In the next two settings, we again use finite differences and AD for derivatives that pertain to ν\nu, but instead of Hessians we use expected Fisher matrices as proxies. Recalling that 𝚺j(𝜽)=θj𝚺(𝜽)\bm{\Sigma}_{j}(\bm{\theta})=\tfrac{\partial}{\partial\theta_{j}}\bm{\Sigma}(\bm{\theta}), the expected Fisher information, given by 𝐈j,k:=12tr(𝚺(𝜽)1𝚺j(𝜽)𝚺(𝜽)1𝚺k(𝜽)),\mathbf{I}_{j,k}:=\frac{1}{2}\text{tr}\left(\bm{\Sigma}(\bm{\theta})^{-1}\bm{\Sigma}_{j}(\bm{\theta})\bm{\Sigma}(\bm{\theta})^{-1}\bm{\Sigma}_{k}(\bm{\theta})\right), is, under appropriate regularity conditions, the asymptotic precision of maximum likelihood estimators. Unfortunately, these conditions are not often met for spatial processes under fixed-domain asymptotics [30]; nonetheless it is a valuable proxy for Hessian information. Moreover, it only requires first derivatives of the covariance function to compute, providing a natural way to separately investigate the effect of first and second finite difference derivatives. The final two settings are in a similar spirit and again investigate using only first derivative information, but now using the more general-purpose BFGS approximation for Hessian information.

While the parameters for the simulated data have been chosen purposefully, they are not implausible for much environmental data. The range parameter ρ=2.5\rho=2.5 is large compared to the domain radius, which, even with 10 replicates, makes estimating the range difficult [36]. This level of dependence could occur in, for example, daily values of meteorological quantities such as temperature or pressure over a region of diameter on the order of 100 km. A smoothness of ν=1.3\nu=1.3 is likewise a very plausible value for a quantity like atmospheric pressure, which should be smooth but not too smooth. The primary challenge with these parameter values is that they require derivatives of 𝒦ν\mathcal{K}_{\nu} very near the origin, which we have demonstrated in the previous section to be particularly problematic. With this said, computing the upper 2×22\times 2 block of the Hessian with analytical derivatives in both cases limits the scope of impact of finite difference approximations as much as possible. Using finite difference derivatives directly on ν\mathcal{M}_{\nu}, which is well-behaved when its argument is near 0, rather than on 𝒦ν\mathcal{K}_{\nu} directly, should also be favorable to finite differences.

method convergence #iterations terminal log-likelihood MLE
FD (BFGS) No 100100 19192.36-19192.36 (1.574,2.642,1.060)(1.574,2.642,1.060)^{*}
FD (Fisher) No 100100 21548.71-21548.71 (1.576,2.642,1.293)(1.576,2.642,1.293)^{*}
FD (Hessian) No 100100 21473.38-21473.38 (47.428,24.628,1.380)(47.428,24.628,1.380)^{*}
AD (BFGS) No 100100 21548.71-21548.71 (1.576,2.642,1.293)(1.576,2.642,1.293)^{*}
AD (Fisher) Yes 5858 21548.71-21548.71 (1.576,2.642,1.293)(1.576,2.642,1.293)
AD (Hessian) Yes 2525 21548.71-21548.71 (1.576,2.642,1.293)(1.576,2.642,1.293)
Table 4: A comparison of optimization results using second-order optimization with finite difference derivatives of the Matérn covariance versus automatic differentiation of the Matérn covariance with BesselK.jl. In the MLE column, asterisks indicate terminal return parameters in the case of failed optimization by reaching the maximum allowed number of iterations, here chosen to be 100100.

Table 4 summarizes the results of the estimation procedures. In the case of true second-order optimization with finite difference derivatives used for Hessians, the optimization completely fails. Optimization using the expected Fisher matrix built with finite difference derivatives does successfully reach the MLE parameters to high accuracy, but due to slight inaccuracies in derivatives fails to reach the termination criteria tests used by Ipopt. Using AD derivatives to assemble the expected Fisher matrix solves this problem, and the optimization successfully terminates after 5858 iterations. Finally, optimization with the Hessians built with AD-generated derivatives is fast, accurate, and terminates successfully in just 2525 iterations. These results are consistent across various starting values and termination criterion tolerances.

To better understand why the second-order optimization with FD-based Hessians fails, we inspect these Hessians at several values. Starting at the initializer, we look at Hessian matrices computed using the two different sources of input derivatives. As a point of reference, we also provide a very high-order adaptive finite difference Hessian using the log-likelihood function itself. As above, the parameter order here is (σ,ρ,ν)(\sigma,\rho,\nu), so that the third column and row are the entries that pertain to derivatives of ν\mathcal{M}_{\nu} with respect to ν\nu. Looking at the generic initializer of all parameters being equal to 11, we inspect the third column, using the shorthand 𝟏3\bm{1}_{3} for (σ,ρ,ν)=(1,1,1)(\sigma,\rho,\nu)=(1,1,1):

[Href𝟏3],3=[1104.053712.605173.03][HAD𝟏3],3=[1104.053712.605173.03][HFD𝟏3],3=[1100.473716.864533.61].\displaystyle\left[H^{\text{ref}}\ell_{\bm{1}_{3}}\right]_{\cdot,3}=\begin{bmatrix}[r]-1104.05\\ 3712.60\\ 5173.03\end{bmatrix}\hskip 7.22743pt\left[H^{\text{AD}}\ell_{\bm{1}_{3}}\right]_{\cdot,3}=\begin{bmatrix}[r]-1104.05\\ 3712.60\\ 5173.03\end{bmatrix}\hskip 7.22743pt\left[H^{\text{FD}}\ell_{\bm{1}_{3}}\right]_{\cdot,3}=\begin{bmatrix}[r]-1100.47\\ 3716.86\\ 4533.61\end{bmatrix}\ .

What immediately stands out in this comparison is that the component pertaining to the second derivative ν2𝒦ν\partial_{\nu}^{2}\mathcal{K}_{\nu} with finite difference-built Hessians is grossly inaccurate. For inaccuracies at this level, one has reason to be concerned that very important quantities for optimization, for example the unconstrained Newton direction H𝜽1𝜽-H\ell_{\bm{\theta}}^{-1}\nabla\ell_{\bm{\theta}}, will be materially affected in a way that inhibits the progress of an optimizer.

Looking now at the third columns of these three matrices at the MLE, we see this problematic inaccuracy progressing further, using 𝜽^\hat{\bm{\theta}} as shorthand for the MLE:

[Href𝜽^],3=[23920.0018431.04145597.78][HAD𝜽^],3=[23920.0118431.04145597.79][HFD𝜽^],3=[23919.5418430.66135175.16].\displaystyle\left[H^{\text{ref}}\ell_{\hat{\bm{\theta}}}\right]_{\cdot,3}=\begin{bmatrix}[r]-23920.00\\ 18431.04\\ 145597.78\end{bmatrix}\hskip 7.22743pt\left[H^{\text{AD}}\ell_{\hat{\bm{\theta}}}\right]_{\cdot,3}=\begin{bmatrix}[r]-23920.01\\ 18431.04\\ 145597.79\end{bmatrix}\hskip 7.22743pt\left[H^{\text{FD}}\ell_{\hat{\bm{\theta}}}\right]_{\cdot,3}=\begin{bmatrix}[r]-23919.54\\ 18430.66\\ 135175.16\end{bmatrix}\ .

We observe once again that the AD-generated Hessians are very accurate. The FD-based Hessian is reasonably accurate save for the component pertaining to ν2𝒦ν\partial_{\nu}^{2}\mathcal{K}_{\nu}. But as before, this quantity is sufficiently inaccurate that it necessarily will change important optimization quantities. Moreover, the matrix HFD𝜽^H^{\text{FD}}\ell_{\hat{\bm{\theta}}} is not even close to being positive semi-definite (with a minimum eigenvalue of 192.1-192.1 compared to HAD𝜽^H^{\text{AD}}\ell_{\hat{\bm{\theta}}}’s minimum eigenvalue of 4.0364.036). Observing FD-generated Hessian matrices as the range parameter continues to increase, one will see a continual worsening of this specific component, which is arguably a particularly serious issue because it introduces structured inaccuracy to the optimization problem. This inaccuracy’s effect on derived quantities for optimization such as search directions is a very likely explanation for the complete failure in the optimization using Ipopt, which unlike simpler optimization software will pursue candidate directions in search of a feasible minimizer even if they do not immediately decrease the objective function value.

The conclusion from these results is clear: second-order optimization using finite difference methods for second derivatives of ν\mathcal{M}_{\nu}, and by extension 𝒦ν\mathcal{K}_{\nu}, presents a significant risk. As a final observation, we reiterate that the demonstration chosen here is the best possible case for finite differences in that only the third row and column of these Hessians involve any approximation, and only one element of the Hessian matrix was incorrect to a degree of obvious concern. Even with the numerical issues of finite differences restricted to this small setting, the optimization results were a complete failure. For even slightly more complicated parametric models, such as adding a geometric anisotropy, the number of Hessian entries that involve finite difference derivatives will be greater, further increasing the risk of accumulated inaccuracies and severely incorrect point estimates.

6 Summary and Discussion

In this work, we identified a set of numerical approaches whose regions of accuracy and efficiency cover the entire domain of 𝒦ν\mathcal{K}_{\nu} and that are well-suited to high-order automatic differentiation. After assessing the computational gains for first and second derivatives our approach provides, we demonstrated the practical significance of the accuracy gains through an example of maximum likelihood estimation via second-order optimization that completely fails when finite difference derivatives are employed, despite the finite difference derivatives being computed with the highly accurate AMOS library.

As modern datasets have continued to grow in size and complexity, accurately modeling dependence structure is more important than ever, and Gaussian processes have continued to serve as among the most popular tools for that purpose. We consider the full three-parameter isotropic Matérn model, in which one estimates ν\nu, to be a bare minimum level of flexibility, at least for many datasets in the physical sciences. We thus hope that this work will empower practitioners to fit ν\nu as easily as they do any other parameter. In order to make this as straightforward as possible, we provide the methods described in this paper in a convenient and freely available software package. While this work provides no discussion of scalable approximations to Gaussian likelihoods, they are a popular and important area of contemporary research in the field (see [17], for example, for a non-exhaustive but nonetheless expansive introduction to the topic). Particularly in the fixed-domain asymptotic regime, large datasets are especially informative about the smoothness parameter ν\nu, and so there is particular synergy between the kernel derivatives we discuss here and scalable likelihood approximations that can use the information of significantly more data than would traditionally be possible with direct dense linear algebra.

Additionally, we hope that this work will reduce the complexity barrier to using Hessian matrices of the log-likelihood. Beyond the demonstration that we provide here of the value of true second-order optimization for maximum likelihood estimation, Hessian matrices of the log-likelihood are crucial to other common methods, such as Laplace approximations, including the method of integrated nested Laplace approximations (INLA) [28]. It is thus our hope that this work will be helpful to a wide variety of scientists and will make the use of the full three-parameter Matérn covariance a more standard practice.

Acknowledgements

The authors would like to thank Paul Hovland (Argonne National Laboratory) for engaging discussions on automatic differentiation as well as suggestions of relevant literature on the topic. They are also grateful to Lydia Zoells for her careful copyediting. Finally, they are very grateful to the two anonymous referees, both of whom provided exceptionally helpful comments and suggestions.

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

Appendix A Modifications for rescaling

In certain cases a scaled modified Bessel function, ex𝒦ν(x)e^{x}\mathcal{K}_{\nu}(x), may be useful in extending the interval of definition to larger arguments and avoiding numerical issues such as underflow. In the Matérn covariance function, a different prefactor comes up: xν𝒦ν(x)x^{\nu}\mathcal{K}_{\nu}(x). This function, while more sensitive to floating point round-off errors than its unscaled counterpart, is bounded at the origin for all orders ν>0\nu>0.

This rescaled modified Bessel function is primarily useful for small arguments, and the two routines that can benefit the most from this rescaling are the direct series from Section 3.1 and the Temme recursion in Section 3.5. In Equation 3.2, one can simply bring the xνx^{\nu} term inside the sum to cancel the xνxνx^{-\nu}x^{\nu} product, completely removing the singularity at the origin as well as lowering the operation count, for both direct evaluations and AD computations.

To modify the expressions in Section 3.5 is slightly more complicated, since integer values of ν\nu are computed via a recursion starting from the 𝒦0\mathcal{K}_{0} and 𝒦1\mathcal{K}_{1} terms. However, at ν=0\nu=0 the term x0𝒦0x^{0}\mathcal{K}_{0} contains a (logarithmically growing) singularity, problematic also for integer orders ν\nu when x=0x=0. We have observed this can be avoided since the recursion can be re-arranged to be

xν+1𝒦ν+1(x)=2ν(xν𝒦ν(x))x2(xν1𝒦ν1(x)),x^{\nu+1}\mathcal{K}_{\nu+1}(x)=2\nu(x^{\nu}\mathcal{K}_{\nu}(x))-x^{2}(x^{\nu-1}\mathcal{K}_{\nu-1}(x)),

where for x=0x=0 the term starting off 𝒦0\mathcal{K}_{0}, i.e. 𝒦ν1\mathcal{K}_{\nu-1}, cancels entirely, simplifying the recursion for all integer orders. Introducing a branch in the code that checks for this condition results in AD-compatible differentiations of xν𝒦ν(x)x^{\nu}\mathcal{K}_{\nu}(x) for non-zero integer values ν\nu.

When ν\nu\not\in\mathbb{Z} and x=0x=0, as the terms xνsinh(μ)/μx^{\nu}\sinh(\mu)/\mu and xνcosh(μ)x^{\nu}\cosh(\mu) cannot be evaluated directly, and we have a secondary set of limit branch problems. In this case we observed that xνcosh(νlog(2/x))x^{\nu}\cosh(\nu\log(2/x)) can be re-written as

xνcosh(νlog(2/x))=xν(2νxν+2νxν)2.x^{\nu}\cosh(\nu\log(2/x))=\frac{x^{\nu}(2^{\nu}x^{-\nu}+2^{-\nu}x^{\nu})}{2}\ .

By again manually performing the cancellations, we can preserve the derivative information for xνcosh(μ)x^{\nu}\cosh(\mu), and similarly for sinh(μ)\sinh(\mu). This minor accountancy addition, which introduces special routines for xν𝒦ν(x)x^{\nu}\mathcal{K}_{\nu}(x) for all orders ν\nu, pays off in terms of accuracy, speed, and AD-simplicity.

Appendix B Profile likelihoods

For a collection of parameters 𝜽=(σ2,ρ,ν,)\bm{\theta}=(\sigma^{2},\rho,\nu,...), let 𝜽1\bm{\theta}_{1} denote the collection (1,ρ,ν,)(1,\rho,\nu,...). Through a small amount of algebra, it can be observed that, given all parameters besides σ2\sigma^{2}, the value of σ2\sigma^{2} that maximizes the log-likelihood can be computed in closed-form as n1𝒛T𝚺(𝜽1)𝒛n^{-1}\bm{z}^{T}\bm{\Sigma}(\bm{\theta}_{1})\bm{z}. Due to the observation that 𝚺(𝜽)=σ2𝚺(𝜽1)\bm{\Sigma}(\bm{\theta})=\sigma^{2}\bm{\Sigma}(\bm{\theta}_{1}) and some subsequent algebraic manipulations, one can thus re-write the likelihood (𝜽)\ell(\bm{\theta}) as a function of all parameters besides σ2\sigma^{2} to obtain

2p(𝜽1|𝒛)=log|𝚺(𝜽1)|+nlog(𝒛T𝚺(𝜽1)1𝒛).-2\ell_{p}(\bm{\theta}_{1}\;|\;\bm{z})=\log\left|\bm{\Sigma}(\bm{\theta}_{1})\right|+n\log\left(\bm{z}^{T}\bm{\Sigma}(\bm{\theta}_{1})^{-1}\bm{z}\right).

The value of this reformulation is that the dimension of the parameter estimation problem (and thus the challenging nonconvex optimization problem) has been reduced by one. For this paper, however, we only utilize the profile log-likelihood in order to visualize likelihood surfaces that would otherwise be three-dimensional.

References

  • [1] M. Abramowitz and I. A Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical table. In US Department of Commerce. National Bureau of Standards Applied Mathematics series 55, 1965.
  • [2] D. Amos. Algorithm 644: A portable package for bessel functions of a complex argument and nonnegative order. ACM T Math Software (TOMS), 12(3):265–273, 1986.
  • [3] T. Beck. Automatic differentiation of iterative processes. J Comput Appl Math, 50(1-3):109–118, 1994.
  • [4] T. Beck and H. Fischer. The if-problem in automatic differentiation. J Comput Appl Math, 50(1):119–131, 1994. URL: https://www.sciencedirect.com/science/article/pii/0377042794902941.
  • [5] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach to numerical computing. SIAM Review, 59(1):65–98, 2017.
  • [6] Y. A. Brychkov and K. O. Geddes. On the derivatives of the Bessel and Struve functions with respect to the order. Integr Transf Spec F, 16(3):187–198, 2005.
  • [7] Bradley Efron and David V Hinkley. Assessing the accuracy of the maximum likelihood estimator: Observed versus expected fisher information. Biometrika, 65(3):457–483, 1978.
  • [8] F. W. J. Olver and A. B. Olde Daalhuis and D. W. Lozier and B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. McClain. NIST Digital Library of Mathematical Functions, 2021. Release 1.1.2 of 2021-06-15.
  • [9] J. A. Fike and J. J Alonso. Automatic differentiation through the use of hyper-dual numbers for second derivatives. In Recent Advances in Algorithmic Differentiation, pages 163–173. Springer, 2012.
  • [10] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing systems, 31, 2018.
  • [11] C. J. Geoga, M. Anitescu, and M. L. Stein. Scalable Gaussian Process Computations Using Hierarchical Matrices. J Comput Graph Stat, 29(2):227–237, 2020.
  • [12] J. L. González-Santander. Closed-form expressions for derivatives of Bessel functions with respect to the order. J Math Anal Appl, 466(1):1060–1081, 2018.
  • [13] A. Griewank and A. Walther. Evaluating Derivatives. Other Titles in Applied Mathematics. SIAM, 2008.
  • [14] Mengyang Gu, Jesus Palomo, and James O Berger. Robustgasp: Robust gaussian stochastic process emulation in r. arXiv preprint arXiv:1801.01874, 2018.
  • [15] Mengyang Gu, Xiaojing Wang, and James O Berger. Robust gaussian stochastic process emulation. The Annals of Statistics, 46(6A):3038–3066, 2018.
  • [16] J. Guinness. Gaussian process learning via Fisher scoring of Vecchia’s approximation. Stat and Comput, 31(3):25, 2021.
  • [17] Matthew J Heaton, Abhirup Datta, Andrew O Finley, Reinhard Furrer, Joseph Guinness, Rajarshi Guhaniyogi, Florian Gerber, Robert B Gramacy, Dorit Hammerling, Matthias Katzfuss, et al. A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398–425, 2019.
  • [18] F. Johansson. Arb: efficient arbitrary-precision midpoint-radius interval arithmetic. IEEE T Comput, 66:1281–1292, 2017.
  • [19] Y. Liu and O. Marin. Special function neural network (SFNN) models. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pages 680–685, 2021.
  • [20] O. Marin, C. Geoga, and M. Schanen. On automatic differentiation for the Matérn covariance. In NeurIPS, 2021. URL: https://papers.nips.cc/paper/2021.
  • [21] B. Matern. Spatial Variation. Reports of the Forest Research Institute of Sweden. 1960.
  • [22] G. Matheron. Estimating and choosing: an essay on probability in practice. Springer-Verlag, 1989.
  • [23] V. Minden, A. Damle, Kenneth L. H., and L. Ying. Fast Spatial Gaussian Process Maximum Likelihood Estimation via Skeletonization Factorizations. Multiscale Model Sim, 15(4):1584–1611, 2017.
  • [24] W. S. Moses and V. Churavy. Instead of rewriting foreign code for machine learning, automatically synthesize fast gradients. arXiv preprint arXiv:2010.01709, 2020.
  • [25] Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  • [26] G. Nemes. Error bounds for the large-argument asymptotic expansions of the hankel and bessel functions. Acta Appl Math, 150(1):141–177, 2017.
  • [27] J. Revels, M. Lubin, and T. Papamarkou. Forward-mode automatic differentiation in Julia. arXiv:1607.07892 [cs.MS], 2016.
  • [28] H. Rue, S. Martino, and N. Chopin. Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. J R Stat Soc Ser B, 71(2):319–392, 2009.
  • [29] William Squire and George Trapp. Using complex variables to estimate derivatives of real functions. SIAM review, 40(1):110–112, 1998.
  • [30] M. L. Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.
  • [31] M. L. Stein, J. Chen, and M. Anitescu. Stochastic approximation of score functions for Gaussian processes. Ann Appl Stat, 7(2):1162–1191, 2013.
  • [32] N. M. Temme. On the numerical evaluation of the modified bessel function of the third kind. J Comput Phys, 19:324–337, 1975.
  • [33] A. Wächter and L. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math program, 106:25–57, 2006.
  • [34] L. White, M. Abbott, M. Zgubic, J. Revels, A. Arslan, S. Axen, S. Schaub, N. Robinson, Y. Ma, G. Dhingra, W. Tebbutt, N. Heim, A. D. W. Rosemberg, D. Widmann, N. Schmitz, C. Rackauckas, R. Heintzmann, F. Schae, K. Fischer, A.Robson, M. Brzezinski, A. Zhabinski, M. Besançon, P. Vertechi, S. Gowda, A. Fitzgibbon, C. Lucibello, C. Vogt, D. Gandhi, and F. Chorney. Juliadiff/chainrules.jl: v1.15.0, December 2021. doi:10.5281/zenodo.5750776.
  • [35] L. White, W. Tebbutt, M. Zgubic, W. Bruinsma, R. Luo, N. Robinson, A. Arslan, S. Axen, S. Schaub, R. Finnegan, A. Robson, B. Richard, C. Vogt, E. Davies, and V. B. Shah. Juliadiff/finitedifferences.jl: v0.12.20, November 2021. doi:10.5281/zenodo.5724087.
  • [36] H. Zhang. Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics. J Am Stat Assoc, 99(465):250–261, 2004.