ergodic and foliated kernel-differentiation method for linear responses of random systems

Angxiu Ni^1,2 ¹ Department of Mathematics, University of California, Irvine, USA ² Yau Mathematical Sciences Center, Tsinghua University, Beijing, China. niangxiu@gmail.com

Abstract.

We extend the kernel-differentiation method (or likelihood-ratio method) for the linear response of random dynamical systems, after revisit its derivation in a microscopic view via transfer operators. First, for the linear response of physical measures, we extend the method to an ergodic version, which is sampled by one infinitely-long sample path; moreover, the magnitude of the integrand is bounded by a finite decorrelation step number. Second, when the noise and perturbation are along a given foliation, we show that the method is still valid for both finite and infinite time. We numerically demonstrate the ergodic version on a tent map, whose linear response of the physical measure does not exist; but we can add some noise and compute an approximate linear response. We also demonstrate on a chaotic neural network with 51 layers $\times$ 9 neurons, whose perturbation is along a given foliation; we show that adding foliated noise incurs a smaller error than noise in all directions. We give a rough error analysis of the method and show that it can be expensive for small-noise. We propose a potential future program unifying the three popular linear response methods.

Keywords. likelihood ratio method, chaos, linear response, random dynamical systems.

AMS subject classification numbers. 37M25, 65D25, 65P99, 65C05, 62D05, 60G10.

1. Introduction

1.1. Literature review

The averaged statistic of a dynamical system is of central interest in applied sciences. The average can be taken in two ways. When the system is random, we can average with respect to the randomness. On the other hand, if the system runs for a long-time, we can average with respect to time. The long-time average can also be defined for deterministic systems, and the existence of the limit measure is called the physical measure or SRB measure [43, 38, 6]. If the system is both random and long time, the existence of the limit measure (which we shall also call the physical measure) is discussed in textbooks such as [11].

We are interested in the the linear response, which is the derivative of the averaged observable with respect to the parameters of the system. The linear response is a fundamental tool for many purposes. For example, in aerospace design, we want to know the how small changes in the geometry would affect the average lift of an aircraft, which was answered for non-chaotic fluids [23] but only partially answered for not-so-chaotic fluids [27]. In climate change, we want to ask what is the temperature change caused by a small change of $CO_{2}$ level [16, 13]. In machine learning, we want to extend the conventional backpropagation method to cases with gradient explosion; current practices have to avoid gradient explosion but that is not always achievable [32, 22].

There are three basic methods for expressing and computing the linear response: the path perturbation method, the divergence method, and the kernel-differentiation method. The path perturbation method averages the path perturbation over a lot of orbits; this is also known as the ensemble method or stochastic gradient method (see for example [25, 12, 26]). However, when the system is chaotic, that is, when the pathwise perturbation grows exponentially fast with respect to time, the method becomes too expensive since it requires many samples to average a huge integrand. Proofs of such formula for the physical measure of hyperbolic systems were given in for example [39, 9, 24].

The divergence method is also known as the transfer operator method, since the perturbation of the measure transfer operator is some divergence. Traditionally, the divergence method is functional, and the formula is not pointwisely defined when the invariant measure is singular, which is common for systems with a contracting direction (see for example [19, 5]). This difficulty can not be circumvented by transformations such as Livisic theorem in [19]. Hence, the traditional divergence formulas can not be computed by a Monte-Carlo approach, so the computation is cursed by dimensionality (see for example [15, 42, 20, 2, 35]).

For chaotic deterministic systems, to have a formula without exponentially growing or distributive terms, we need to blend the two formulas, but the obstruction was that the split lacks smoothness. The ‘fast response’ linear response formulas solve this difficulty for deterministic hyperbolic systems. It is composed of two parts, the adjoint shadowing lemma [29] and the equivariant divergence formula [30]. The fast response formula is a pointwise defined function and has no exponentially growing terms. The continuous-time versions of fast response formulas can be found in [29, 31]. The fast response formula is the ergodic theorem for linear response: it computes $2u+2$ $M$ -dimensional vectors on each point of an orbit, and then the average of some pointwise functions of these vectors. Here $M$ is the phase space dimension, and $u$ is the number of unstable directions.

The kernel-differentiation formula works only for random system. This method is more popularly known as the likelihood ratio method or the Monte-Carlo gradient method in probability context (see [37, 36, 17, 34]), the idea (dividing and multiplying the local density) is also used in Langevin sampling and diffusion models. However, such names are no longer distinguishing in the larger context: for example, the likelihood ratio idea is also useful in the divergence method. We apologize for the inconvenience in naming. Proofs of the kernel-differentiation method was given in [21, 3, 1]. This method is more powerful than the previous two for random dynamics, since annealed linear response can exist when quenched linear response does not [10]. However, as we shall see, inevitably, this method is expensive when the randomness is small, so it is hard to use this approach as a very good approximation to linear responses of deterministic systems.

In terms of numerics, the kernel-differentiation method naturally allows Monte-Carlo type sampling, so is efficient in high-dimensions. The formula does not involve Jacobian matrices hence does not incur propagations of vectors or covectors (parameter gradients), so it is not hindered by undesirable features of the Jacobian of the deterministic part, such as singularity or non-hyperbolicity. The work in [33] solves a Poisson equation for a term in the formula, which reduces the numerical error, but can be expensive for high-dimensions.

Another issue in previous likelihood ratio method is that the scale of the estimator is proportional to the orbit length. This leads to two undesirable consequences:

•

The requested number of sample orbits increases with orbit length.
•

We have to restart a new orbit very often. For each orbit, the observable function is only evaluated at the last step, while the earlier steps are somewhat wasted waiting for the orbits to land onto the physical measure.

This paper shows that, for physical measures, we can resolve these issues by invoking the decay of correlation and the ergodic theorem.

Moreover, sometimes we are interested in the case where some directions in the phase space have special importance, such as the level sets in Hamiltonian system. For these cases, the perturbation in the dynamics and the added noise are sometimes aligned to a subspace. This paper extends the kernel-differentiation method to foliated spaces.

Finally, for pedagogical purposes, it is beneficial to derive the kernel-differentiation method from a local or microscopic perspective. This will be somewhat more intuitive, and the above extensions will be more natural.

1.2. Main results

We consider the random dynamical system

X_{n+1}=f(\gamma,X_{n})+Y_{n+1}.

Let $h_{n}$ and $p_{n}$ be the probability density for $X_{n}$ and $Y_{n}$ . We need the expressions of $p_{n}$ ; but we do not need expressions of $h_{n}$ . We denote the density of the physical measure by $h$ , which is the limit measure under the dynamics. Here $\gamma$ is the parameter controlling the dynamics and hence $h_{n}$ and $h$ ; by default $\gamma=0$ . We denote the perturbation $\delta(\cdot):=\partial(\cdot)/\partial\gamma|_{\gamma=0}$ , and when $f$ is invertible, define

{\tilde{f}}:=f_{\gamma}\circ f^{-1}.

$\Phi$ is a fixed $C^{2}$ observable function.

As a warm-up, through basic calculus, section 2.2 rederives the kernel-differentiation formula for $\delta\int\Phi(x_{1})h(x_{1})dx_{1}$ , the linear response for 1-step. Section 2.3 gives a pictorial explanation for the main lemma. In this simplest setting, we can see more easily how the kernel-differentiation method relates to other methods, and why it can be generalized to foliated situations. Section 2.4 rederives the formula for finitely many steps, which was known as the likelihood ratio method and was given in [37, 36, 17]. Our derivation use the measure-transfer-operator point of view, which is more convenient for our extensions.

The first contribution of this paper, given in section 2.5, is the ergodic kernel-differentiation formula for physical measures. We invoke decay of correlations to constrain the scale of the estimator, and the ergodic theorem to reduce the required data to one long orbit. Hence, we only need to spend one spin-up time to land onto the physical measure, and then all data are fully utilized. This improves the efficiency of the algorithm. More specifically, we prove

{theorem}

[orbitwise kernel-differentiation formula for $\delta h$ ] Under assumptions 1 and 2, we have

\begin{split}\delta\int\Phi(x)h(x)dx\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}-\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi(X_{n+l})-\Phi_{avg}\right)\,\delta f_{\gamma}X_{l}\cdot\frac{dp}{p}(Y_{1+l})\end{split}

almost surely according to the measure obtained by starting the dynamic with $X_{0},Y_{1},Y_{2},\cdots\sim h\times p\times p\times\cdots$ . Here $\Phi_{avg}:=\int\Phi h$ .

Here $W$ and $L$ are two separate and sequential limits, so we can typically pick $W\ll L$ . This makes the estimator smaller than previous likelihood ratio formulas, where $W=L$ . Our formula runs on only one orbit; hence, it is faster than previous likelihood ratio methods. Section 2.6 generalizes the result to cases where the randomness depends on locations and parameters.

The second contribution of this paper is that section 3 extends the kernel-differentiation method to the case where the phase space is partitioned by submanifolds, and the added noise and the perturbation are along the submanifolds (hence the noise is ‘singular’ with respect to the Lebesgue measure). More specifically, Let $F=\{F_{\alpha}\}_{\alpha\in A}$ be a co-dimension $c$ foliation on $\mathcal{M}$ , which is basically a collection of $(M-c)$ -dimensional submanifolds partitioning $\mathcal{M}$ , and each submanifold is called a ‘leaf’. Note that we do not require $f$ to respect the foliation, that is, $x$ and $fx$ can be one two different leaves. Let $p(z,x_{n+1})$ be the density of probability of $X_{n+1}$ , conditioned on $f(X_{n})=z$ , with respect to the Lebesgue measure on $F_{\alpha}(z)$ . The structure of section 3 is similar to section 2, and the main theorem is

{theorem}

[centralized foliated kernel-differentiation formula for $\delta h_{T}$ of time-inhomogeneous systems, comparable to Remark] Under assumptions 1 and 3 for all $n$ , we have

\begin{split}\delta\int\Phi(x_{T})d\mu_{T}(x_{T})=\mathbb{E}\left[\left(\Phi_{T}(x_{T})-\Phi_{avg,T}\right)\sum_{m=1}^{T}\left(\delta{\tilde{f}}_{m}(z_{m})\cdot\frac{d_{z}p_{m}(z_{m},x_{m})}{p_{m}(z_{m},x_{m})}\right)\right]\\ :=\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F^{1}_{\alpha}(z_{1})}\cdots\int_{x_{T}\in F^{T}_{\alpha}(z_{T})}\left(\Phi_{T}(x_{T})-\Phi_{avg,T}\right)\\ \sum_{m=1}^{T}\left(\delta{\tilde{f}}_{m}(z_{m})\cdot\frac{d_{z}p_{m}(z_{m},x_{m})}{p_{m}(z_{m},x_{m})}\right)p_{T}(z_{T},x_{T})dx_{T}\cdots p_{1}(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0}).\end{split}

Here $\Phi_{avg,T}:=\int\Phi_{T}d\mu_{T}$ . Note that here $F,{\tilde{f}},f,p$ can be different for different steps.

Section 4 considers numerical realizations. Section 4.1 gives a detailed list for the algorithm. Note that the foliated case uses the same algorithm; the difference is in the set-up of the problem. Section 4.2 illustrates the ergodic version of the algorithm on an example where the deterministic part does not have a linear response, but adding noise and using the kernel-differentiation algorithm give a reasonable reflection of the relation between the long-time averaged observable and the parameter. Section 4.3 illustrates the foliated version of the algorithm on a unstable neural network with 51 layers $\times$ 9 neurons, where the perturbation is along the foliation. We show that adding a low-dimensional noise along the foliation induces a much smaller error than adding noise in all directions. And the foliated kernel-differentiation can then give a good approximation of the true parameter-derivative.

In the discussion section, section 5.1 gives a rough cost-error estimation of the kernel-differentiation algorithm. This helps to set some constants in the algorithm; it also shows that the small-noise limit is still very expensive. Section 5.2 proposes a program which combines the three formulas to obtain a good approximation of linear responses of deterministic systems.

The paper is organized as follows. Section 2 derives the kernel-differentiation formula where the noise is in all directions. Section 3 derives the formula for foliated perturbations and noises. Section 4 gives the procedure list of the algorithm based on the formulas, and illustrates on two numerical examples. Section 5 discusses the relation to deterministic linear responses.

2. Deriving the kernel-differentiation linear response formula

2.1. Preparations: notations on probability densities

Let $f_{\gamma}$ be a family of $C^{2}$ map on the Euclidean space $\mathbb{R}^{M}$ of dimension $M$ parameterized by $\gamma$ . Assume that $\gamma\mapsto f_{\gamma}$ is $C^{1}$ from $\mathbb{R}$ to the space of $C^{2}$ maps. The random dynamical system in this paper is given by

(1)

\begin{split}X_{n}=f_{\gamma}(X_{n-1})+Y_{n},\quad\textnormal{where}\quad Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}p.\end{split}

The default value is $\gamma=0$ , and we denote $f:=f_{\gamma=0}$ . In this section, we consider the case where $p$ is any fixed $C^{2}$ probability density function. The next section considers the case where $p$ is smooth along certain directions but singular along other directions.

Moreover, except for section 1.2, our results also apply to time-inhomogeneous cases, where $f$ and $p$ are different for each step. More specifically, the dynamic is given by

(2)

\begin{split}X_{n}=f_{\gamma,n-1}(X_{n-1})+Y_{n},\quad\textnormal{where}\quad Y_{n}\sim p_{n}.\end{split}

Here $X_{n}\in\mathbb{R}^{M_{n}}$ are not necessarily of the same dimension. We shall exhibit the time dependence in Remark and also in the numerics section. On the other hand, for the infinite time case in section 1.2, we want to sample by only one orbit; this requires that $f$ and $p$ be repetitive among steps.

We define $h_{T}$ as the density of pushing-forward the initial measure $h_{0}$ for $T$ steps.

\begin{split}h_{T}:=(L_{p}L_{f_{\gamma}})^{T}h_{0}.\end{split}

Here $h_{T}=h_{T,\gamma}$ depends on $\gamma$ and also $h_{0}$ ; we sometimes omit the subscript of $\gamma$ . Here $L_{f}$ is the measure transfer operator of $f$ , which are defined by the integral equality

(3)

\begin{split}\int\Phi(x)\,(L_{f}h)(x)dx:=\int\Phi(fx)h(x)dx.\end{split}

Here $\Phi\in C^{2}(\mathbb{R}^{M})$ is any fixed observable function. When $f$ is bijective, there is an equivalent pointwise definition, $(L_{f}h)(x):=\frac{h}{|Df|}(f^{-1}x)$ .

For density $q$ , $L_{p}q$ is pointwisely defined by convolution with density $p$ :

(4)

\begin{split}L_{p}q(x)=\int q(x-y)p(y)dy=\int q(z)p(x-z)dz.\end{split}

We shall also use the integral equality

(5)

\begin{split}\int\Phi(x)L_{p}q(x)dx=\int p(y)\left(\int\Phi(x)q(x-y)dx\right)dy\\ \,\overset{z=x-y}{=}\,\iint\Phi(y+z)q(z)p(y)dzdy.\end{split}

If we want to compute the above integration by Monte-Carlo method, then we should generate i.i.d. $Y_{l}=y_{l}$ and $Z_{l}=z_{l}$ according to density $p$ and $q$ , then compute the average of $\Phi(y_{l}+z_{l})$ .

To consider differentiability of $h$ with respect to $\gamma$ , we make the following simplifying assumptions on differentiability of the basic quantities.

Assumption 1.

The densities $p,q,h$ have bounded $C^{2}$ norms; $\gamma\mapsto f_{\gamma}$ is $C^{1}$ from $\mathbb{R}$ to the space of $C^{2}$ maps; the observable function $\Phi$ is fixed and is $C^{2}$ .

The linear response formula for finite time $T$ is an expression of $\delta h_{\gamma,T}$ by $\delta f_{\gamma}$ , and

\begin{split}\delta(\cdot):=\left.\frac{\partial(\cdot)}{\partial\gamma}\right|_{\gamma=0}.\end{split}

Here $\delta$ may as well be regarded as small perturbations. Note that by our notation,

\begin{split}\delta f_{\gamma}(x)\in T_{fx}\mathbb{R}^{M};\end{split}

that is, $\delta f_{\gamma}(x)$ is a vector at $fx$ . The linear response formula for finite-time is given by the Leibniz rule,

(6)

\begin{split}\delta h_{T}=\delta(L_{p}L_{f_{\gamma}})^{T}h_{0}=\sum_{n=0}^{T-1}(L_{p}L_{f})^{n}\,\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{T-1-n}h_{0}.\end{split}

We shall give an expression of the perturbative transfer operator $\delta(L_{p}L_{f_{\gamma}})$ later.

For the case where $T\rightarrow\infty$ , we define the density $h$ of the (ergodic) physical measure $\mu$ , which is the unique limit of $h_{T}$ in the weak $*$ topology for any smooth initial density $h_{0}$ :

\begin{split}\lim_{T\in\mathbb{N},\,T\rightarrow\infty}(L_{p}L_{f_{\gamma}})^{T}h_{0}=:h_{\gamma},\end{split}

note that $h$ depends on $\gamma$ but not on $h_{0}$ . The physical measure is also called the SRB measure or the stationary measure. We also define the correlation function, $\mathbb{E}[\varphi(X_{n})\psi(X_{0},X_{1})]$ , where $X_{0}$ is distributed according to the physical measure $\mu$ , and $X_{n}$ is distributed according to the dynamics in equation 1. For this section,

\mathbb{E}_{\gamma}[\varphi(X_{n})\psi(X_{0},X_{1})]:=\int\varphi(x_{n})\psi(x_{0},x_{1})h_{\gamma}(x_{0})dx_{0}p(y_{1})dy_{1}\cdots p(y_{n})dy_{n}.

We shall make the following assumptions for infinitely many steps:

Assumption 2.

For a small interval of $\gamma$ containing zero,

(1)

The physical measure for $\gamma\neq 0$ is the limit of evolving the dynamics starting from the physical measure at $\gamma=0$ .

(2)

For any observable functions $\varphi,\psi\in C^{2}$ , the following sum of correlations

\sum_{n\geq 1}\left|\mathbb{E}_{\gamma}[\varphi(X_{n})\psi(X_{0},X_{1})]-\mathbb{E}_{\gamma}[\varphi(X_{0})]\mathbb{E}_{\gamma}[\psi(X_{0},X_{1})]\right|

converges uniformly in terms of $\gamma$ .

Assumption 2 can be proved from more basic assumptions, which are typically more forgiving than the hyperbolicity assumption for deterministic cases. For example, under assumption 1, the proof of unique existence of the physical measure, where the system has additive noise with smooth density, can be found in probability textbooks such as [11]. [14] proves the decay of correlations when, roughly speaking, $f_{\gamma}$ has bounded variance and is mixing and that the transfer operator is bounded. The technical assumptions and rigorous proof for linear responses can also be found in [21]. Our main purpose is to derive the ergodic kernel-differentiation formula with the detailed expression in section 2.2. The formula seems to be the same in most cases where linear response exists.

2.2. One-step kernel-differentiation

We can give a pointwise formula for $\delta(L_{p}L_{f_{\gamma}}q)(x)$ , which seems to be a long-known wisdom. An intuitive explanation of the following lemma is given in section 2.3. Then we give the integral version which can be computed by Monte Carlo algorithms. An intuition of the lemmas are given in section 2.3.

{lemma}

[local one-step kernel-differentiation formula] Under assumption 1, then for any density $h_{0}$

\begin{split}\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})=\int-\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0})\,h_{0}(x_{0})dx_{0}.\end{split}

Here $\delta:=\partial/\partial\gamma|_{\gamma=0}$ , and $\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0})$ is the derivative of the function $p(\cdot)$ at $x_{1}-fx_{0}$ in the direction $\delta f_{\gamma}(x_{0})$ , which is a vector at $fx_{0}$ .

Remark.

Note that we do not compute $\delta L_{f_{\gamma}}$ separately. In fact, the main point here is that the convolution with $p$ allows us to differentiate $p$ , which is typically much more forgiving than differentiating only $L_{f_{\gamma}}$ . Although there is a divergence formula, $\delta L_{f_{\gamma}}h_{0}=\textnormal{div}(h_{0}\delta f_{\gamma}\circ f^{-1})$ , but then we still need to know the derivative of $h_{0}$ . This is difficult when $h_{0}$ is the physical density $h$ : we can quench a specific noise sequence $\left\{Y_{n}=y_{n}\right\}_{n\geq 0}$ and, for each sample path, apply the adjoint shadowing lemma and equivariant divergence formula in [30, 29], but this requires hyperbolicity, which can be too strict to real-life systems.

Proof.

First write a pointwise expression for $L_{p}L_{f_{\gamma}}h_{0}$ : by definition of $L_{p}$ in equation 4,

\begin{split}(L_{p}L_{f_{\gamma}}h_{0})(x_{1})=\int(L_{f_{\gamma}}h_{0})(z_{1})p(x_{1}-z_{1})dz_{1}.\end{split}

Since $p\in C^{2}(\mathbb{R}^{M})$ , we can substitute $p$ into $\Phi$ in the definition of $L_{f}$ in equation 3, to get

\begin{split}=\int h_{0}(x_{0})p(x_{1}-f_{\gamma}x_{0})dx_{0}.\end{split}

Differentiate with respect to $\gamma$ , we have

\begin{split}\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})=\int-h_{0}(x_{0})\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0})dx_{0}.\end{split}

∎

{lemma}

[integrated one-step kernel-differentiation formula] Under assumption 1, then

\begin{split}\int\Phi(x_{1})\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})dx_{1}=\iint-\Phi(x_{1})\delta f_{\gamma}(x_{0})\cdot\frac{dp(y_{1})}{p(y_{1})}\,h_{0}(x_{0})dx_{0}\,p(y_{1})dy_{1}.\end{split}

Here $x_{1}$ in the right expression is a function of the dummy variables $x_{0}$ and $y_{1}$ , that is, $x_{1}=y_{1}+fx_{0}$ .

Remark.

The point is, if we want to integrate the right expression by Monte-Carlo, just generate random pairs of $\left\{x_{l,0},y_{l,1}\right\}$ , compute $x_{l,1}$ accordingly, and average $I_{l}:=\Phi(x_{1})\delta f_{\gamma}x_{l,0}\cdot\frac{dp}{p}(y_{l,1})$ over many $l$ ’s.

Proof.

Substitute section 2.2 into the integration, we get

\begin{split}\int\Phi(x_{1})\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})dx_{1}=-\iint\Phi(x_{1})\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0})\,h_{0}(x_{0})dx_{0}dx_{1}\end{split}

The problem with this expression is that, should we want Monte-Carlo, it is not obvious which measure we should generate $x_{1}$ ’s according to. To solve this issue, change the order of the double integration to get

\begin{split}=-\int\delta f_{\gamma}(x_{0})\cdot\left(\int\Phi(x_{1})dp(x_{1}-fx_{0})dx_{1}\right)h_{0}(x_{0})dx_{0}.\end{split}

Change the dummy variable of the inner integration from $x_{1}$ to $y_{1}=x_{1}-fx_{0}$ ,

\begin{split}=-\int\delta f_{\gamma}(x_{0})\cdot\left(\int\Phi(y_{1}+fx_{0})dp(y_{1})dy_{1}\right)h_{0}(x_{0})dx_{0}\\ =-\iint\Phi(x_{1}(x_{0},y_{1}))\,\delta f_{\gamma}(x_{0})\cdot\frac{dp}{p}(y_{1})\,p(y_{1})dy_{1}h_{0}(x_{0})dx_{0}.\end{split}

Here $x_{1}$ is a function as stated in the lemma. ∎

2.3. An intuitive explanation

We give an intuitive explanation to section 2.2 and Remark. This will help us to generalize the method to the foliated situation in section 3. First, we shall adopt a more intuitive but restrictive notation. Define ${\tilde{f}}(x,\gamma)$ such that

\begin{split}{\tilde{f}}(f(x),\gamma)=f_{\gamma}(x).\end{split}

In other words, we write $f_{\gamma}$ as appending a small perturbative map ${\tilde{f}}$ to $f:=f_{\gamma=0}$ . Here ${\tilde{f}}$ is the identity map when $\gamma=0$ . Note that ${\tilde{f}}$ can be defined only if $f$ satisfies

\begin{split}f_{\gamma}(x)=f_{\gamma}(x^{\prime})\quad\textnormal{whenever}\quad f(x)=f(x^{\prime}).\end{split}

For example, when $f$ is bijective then we can well-define ${\tilde{f}}$ . Hence, this new notation is more restrictive than what we used in other parts of this section, but it lets us see more clearly what happens during the perturbation.

With the new notation, the dynamics can be written now as

\begin{split}X_{n}={\tilde{f}}f(X_{n-1})+Y_{n},\quad\textnormal{where}\quad Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}p.\end{split}

By this notation, only $f$ changes time step, but ${\tilde{f}}$ and adding $Y$ do not change time. In this subsection we shall start from after having applied the map $f$ , and only look at the effect of applying ${\tilde{f}}$ and adding noise: this is enough to account for the essentials of section 2.2 and Remark. Roughly speaking, the $q_{n}$ we use in the following is in fact $q_{n}:=L_{f}h_{n-1}$ ; $q_{n}$ is the density of $z_{n}:=fx_{n-1}$ , so $z_{n}+y_{n}=x_{n}$ , and we omit the subscript for time $0$ .

With the new notation, section 2.2 is essentially equivalent to

(7)

\begin{split}\delta(L_{p}L_{{\tilde{f}}}q)(x)=\int-\delta{\tilde{f}}z\cdot dp(x-z)\,q(z)dz.\end{split}

We explain this intuitively by figure 1. Let ${\tilde{z}}:={\tilde{f}}z$ be distributed according to ${\tilde{q}}$ , then $L_{p}{\tilde{q}}$ is obtained by first attach a density $p$ to each ${\tilde{z}}$ , then integrate over all ${\tilde{z}}$ . $L_{p}L_{\tilde{f}}q$ is obtained by first move $z$ to ${\tilde{z}}$ and then perform the same procedure. Hence, $\delta L_{p}L_{\tilde{f}}q(x)$ is first computing $\delta p_{{\tilde{f}}z}(x)$ for each $z$ , then integrate over $z$ . Let

\begin{split}p_{{\tilde{f}}z}(x):=p(x-{\tilde{f}}z)\end{split}

be the density of the noise centered at ${\tilde{f}}z$ . So

\begin{split}\delta p_{{\tilde{f}}z}(x)=-dp_{{\tilde{f}}z}(x)\cdot\delta{\tilde{f}}z(x)=-dp(x-{\tilde{f}}z)\cdot\delta{\tilde{f}}z.\end{split}

Here $\delta{\tilde{f}}z(x)$ in the middle expression is the horizontal shift of $p_{{\tilde{f}}z}$ ’s level set previously located at $x$ . Since the entire Gaussian distribution is parallelly moved on the Euclidean space $\mathbb{R}^{M}$ , so $\delta{\tilde{f}}z(x)=\delta{\tilde{f}}z$ is constant for all $x$ . Then we can integrate over $z$ to get equation 7.

Refer to caption — Figure 1. Intuitions for section 2.2 and Remark.

Using ${\tilde{f}}$ notation, Remark is equivalent to

\begin{split}\delta\int\Phi(x)(L_{p}L_{\tilde{f}}q)(x)dx=\iint-\Phi(x)\delta{\tilde{f}}(z)\cdot\frac{dp(y)}{p(y)}\,q(z)dz\,p(y)dy,\end{split}

where $x$ on the right is a function $x=z+y$ . Intuitively, this says that the left side equals to first compute $\delta\int\Phi(z+y)p_{{\tilde{f}}z}(y)dy$ for each $z$ , then integrate over $z$ . The integration on the left uses $x$ as dummy variable, which is convenient for the transfer operator formula above. But it does not involve a density for $x$ , so is not immediately ready for Monte-Carlo. The right integration is over $q(z)$ and $p(y)$ , which are easy to sample.

It is important that we differentiate only $p$ but not $f$ . In fact, our core intuitions are completely within one time step, and do not even involve $f$ . Hence, we can easily generalize to cases where $f$ is bad, for example, when $f$ is not bijective, or when $f$ is not hyperbolic: these are all difficult for deterministic systems.

2.4. Kernel-differentiation over finitely many steps

We use Remark to get integrated formulas for $h_{T}:=(L_{p}L_{f_{\gamma}})^{T}h_{0}$ and its perturbation $\delta h_{T}$ , which can be sampled by Monte-Carlo type algorithms. Note that here $h_{0}$ is fixed and does not depend on $\gamma$ . This was previously known as the likelihood ratio method or the Monte-Carlo gradient method and was given in [37, 36, 17]. The magnitude of the integrand grows as $O(\sqrt{T})$ after the centralization, where $T$ is the number of time steps.

{lemma}

Under assumption 1, for any $C^{2}$ (not necessarily positive) densities $h_{0}$ , any $n\in\mathbb{N}$ ,

\begin{split}\int\Phi(x_{n})((L_{p}L_{f})^{n}h_{0})(x_{n})dx_{n}=\int\Phi(x_{n})h_{0}(x_{0})dx_{0}p(y_{1})dy_{1}\cdots p(y_{n})dy_{n}.\end{split}

Here the $x_{n}$ on the left is a dummy variable; whereas $x_{n}$ on the right is recursively defined by $x_{m}=f(x_{m-1})+y_{m}$ , so $x_{n}$ is a function of the dummy variables $x_{0},y_{1},\cdots,y_{n}$ .

Remark.

The integrated formula is straightforward to compute by Monte-Carlo method. That is, for each $l$ , we generate random $x_{l,0}$ according to density $h_{0}$ , and $y_{l,1},\cdots,y_{l,n}$ i.i.d according to $p$ . Then we compute $\Phi(x_{n})$ for this particular sample of $\left\{x_{l,0},y_{l,1},\cdots,y_{l,n}\right\}$ . Note that the experiments for different $l$ should be either independent or decorrelated. Then the Monte-Carlo integration is simply

\begin{split}\int\Phi(x_{n})((L_{p}L_{f})^{n}h_{0})(x_{n})dx_{n}=\lim_{L\rightarrow\infty}\frac{1}{L}\sum_{l=1}^{L}\Phi(x_{l,n}).\end{split}

Almost surely according to $h_{0}\times p_{1}\times\cdots\times p_{n}$ . In this paper we shall use $n,m$ to indicate time steps, whereas $l$ labels different samples.

Proof.

Sequentially apply the definition of $L_{p}$ and $L_{f}$ , we get

(8)

\begin{split}\int\Phi(x_{n})((L_{p}L_{f})^{n}h_{0})(x_{n})dx_{n}\\ =\iint\Phi(y_{n}+z_{n})(L_{f}(L_{p}L_{f})^{n-1}h_{0})(z_{n})p(y_{n})dy_{n}dz_{n}\\ =\iint\Phi(x_{n}(x_{n-1},y_{n}))((L_{p}L_{f})^{n-1}h_{0})(x_{n-1})p(y_{n})dy_{n}dx_{n-1}.\end{split}

Here $x_{n}:=y_{n}+f(x_{n-1})$ in the last expression is a function of the dummy variables $x_{n-1}$ and $y_{n}$ . Roughly speaking, the dummy variable $z_{n}$ in the second expressions is $f(x_{n-1})$ .

Recursively apply equation 8 once, we have

\begin{split}\int\left(\int\Phi(x_{n}(x_{n-1},y_{n}))((L_{p}L_{f})^{n-1}h_{0})(x_{n-1})dx_{n-1}\right)p(y_{n})dy_{n}\\ =\int\Phi(x_{n}(x_{n-1}(x_{n-2},y_{n-1}),y_{n}))((L_{p}L_{f})^{n-2}h_{0})(x_{n-2})p(y_{n-1})p(y_{n})dy_{n-1}dy_{n}dx_{n-2}.\end{split}

Here $x_{n-1}:=y_{n-1}+f(x_{n-2})$ , so $x_{n}$ is a function of the dummy variables $x_{n-2},y_{n-1}$ , and $y_{n}$ . Keep applying equation 8 recursively to prove the lemma. ∎

{theorem}

[kernel-differentiation formula for $\delta h_{T}$ ] Under assumption 1, then

\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=-\int\Phi(x_{T})\sum_{m=0}^{T-1}\left(\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})\right)h_{0}(x_{0})dx_{0}(pdy)_{1\sim T}.\end{split}

Here $dp$ is the differential of $p$ , $(pdy)_{1\sim T}:=p(y_{1})dy_{1}\cdots p(y_{T})dy_{T}$ is the independent distribution of $Y_{n}$ ’s, and $x_{m}$ is a function of dummy variables $x_{0},y_{1},\cdots,y_{m}$ .

Proof.

Note that $\Phi$ is fixed, use equation 6, and let $m=T-n-1$ , we get

\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=\int\Phi(x_{T})(\delta h_{T})(x_{T})dx_{T}\\ =\int\Phi(x_{T})\sum_{m=0}^{T-1}\left((L_{p}L_{f})^{T-m-1}\,\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{m}h_{0}\right)(x_{T})dx_{T}.\end{split}

For each $m$ , first apply section 2.4 several times,

\begin{split}\int\Phi(x_{T})\left((L_{p}L_{f})^{T-m-1}\,\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{m}h_{0}\right)(x_{T})dx_{T}\\ =\int\Phi(x_{T})\left(\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{m}h_{0}\right)(x_{m+1})\,dx_{m+1}(pdy)_{m+2\sim T}\end{split}

Here $x_{T}$ is a function of $x_{m+1},y_{m+2},\cdots,y_{T}$ . Then apply Remark once,

\begin{split}=-\int\Phi(x_{T})\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})\left((L_{p}L_{f})^{m}h_{0}\right)(x_{m})\,dx_{m}(pdy)_{m+1\sim T}\end{split}

Now $x_{T}$ is a function of $x_{m},y_{m+1},\cdots,y_{T}$ . Then apply section 2.4 several times again,

\begin{split}=-\int\Phi(x_{T})\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})h_{0}(x_{0})\,dx_{0}(pdy)_{1\sim T}\end{split}

Then sum over $m$ to prove the theorem. ∎

Note that subtracting any constant from $\Phi$ and does not change the linear response. We prove a more detailed statement in a Remark. Hence, we can centralize $\Phi$ , i.e. replacing $\Phi(\cdot)$ by $\Phi(\cdot)-\Phi_{avg,T}$ , where the constant

\begin{split}\Phi_{avg,T}:=\int\Phi(x_{T})h_{T}(x_{T})dx_{T}.\end{split}

We sometimes centralize by subtracting $\Phi_{avg}:=\int\Phi hdx$ . The centralization reduces the amplitude of the integrand, so the Monte-Carlo method converges faster: this is also a known result and was discussed in for example [18].

{lemma}

[free centralization] Under assumption 1, for any $m\geq 0$ , we have

\begin{split}\int\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})h_{0}(x_{0})dx_{0}(pdy)_{1\sim m+1}=0.\end{split}

Proof.

First note that a density function $p$ on $\mathbb{R}^{M}$ with bounded $C^{2}$ norm must have $\lim_{y\rightarrow\infty}p(y)=0$ , since otherwise its integration can not be one.

Then notice that $x_{m}$ is a function of dummy variables $x_{0},y_{1},\cdots,y_{m}$ , so we can first integrate

\begin{split}\delta f_{\gamma}x_{m}\cdot\int\frac{dp}{p}(y_{m+1})p(y_{m+1})dy_{m+1}=\delta f_{\gamma}x_{m}\cdot\int dp(y_{m+1})dy_{m+1}=0.\end{split}

Since $p(y)\rightarrow 0$ as $y\rightarrow\infty$ . Here $dp$ is the differential of the function $p$ , whereas $dy$ indicates the integration. ∎

For the convenience of computer coding, we explicitly rewrite this theorem into centralized and time-inhomogeneous form, where $\Phi$ , $f$ , and $p$ are different for each step. This is the setting for many applications, such as finitely-deep neural networks. The following proposition can be proved similarly to our previous proofs. Note that we can reuse a lot of data should we also care about the perturbation of the averaged $\Phi_{n}$ for other layers $1\leq n\leq T$ .

In the following proposition, let $\Phi_{T}:\mathbb{R}^{M_{T}}\rightarrow\mathbb{R}$ be the observable function defined on the last layer of dynamics. Let $h_{T}$ be the pushforward measure given by the dynamics in equation 2, that is, defined recursively by $h_{n}:=L_{p_{n}}L_{f_{\gamma,n-1}}h_{n-1}$ ; Let $(pdy)_{1\sim T}:=p_{1}(y_{1})dy_{1}\cdots p_{T}(y_{T})dy_{T}$ be the independent but not necessarily identical distribution of $Y_{n}$ ’s.

{proposition}

[centralized kernel-differentiation formula for $\delta h_{T}$ of time-inhomogeneous systems] If $f_{\gamma,n}$ and $p_{n}$ satisfy assumption 1 for all $n$ , then

\begin{split}&\delta\int\Phi_{T}(x_{T})h_{T}(x_{T})dx_{T}\\ =&-\int\left(\Phi_{T}(x_{T})-\Phi_{avg,T}\right)\sum_{m=0}^{T-1}\left(\delta f_{\gamma,m}x_{m}\cdot\frac{dp}{p}(y_{m+1})\right)h_{0}(x_{0})dx_{0}(pdy)_{1\sim T}.\end{split}

Here $\Phi_{avg,T}:=\int\Phi_{T}h_{T}$ .

2.5. Ergodic kernel-differentiation formula over infinitely many steps

For the perturbation of physical measures, the kernel-differentiation formula for $\delta h$ can further take the form of long-time average on an orbit. This reduces the magnitude of the integrand so the convergence is faster. Here we can apply the ergodic theorem, forget details of initial distribution, and sample $h$ by points from one long orbit. Note that this subsection requires that the dynamics is time-homogeneous ( $f$ and $p$ are the same for each step) to invoke ergodic theorems.

{lemma}

[kernel-differentiation formula for physical measures] Under assumptions 1 and 2,

\begin{split}\delta\int\Phi(x)h(x)dx=\int\Phi(x)\delta h(x)dx\\ =-\lim_{W\rightarrow\infty}\sum_{n=1}^{W}\int\Phi(x_{n})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim n}\end{split}

Proof.

By assumption 2, we can start from $h_{\gamma=0}$ , apply the perturbed dynamics $f_{\gamma\neq 0}$ many times, and $h_{T}:=(L_{p}L_{f_{\gamma}\neq 0})^{T}h_{\gamma=0}$ would converge to $h_{\gamma}$ .

By Remark, for any $T$ ,

\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=-\sum_{m=0}^{T-1}\int\Phi(x_{T})\left(\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim T}.\end{split}

Note that $h_{\gamma=0}$ is invariant measure, so the expression

\begin{split}=-\sum_{m=0}^{T-1}\int\Phi(x_{T-m})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim T-m}\end{split}

Passing $T-m$ to $n$ ,

\begin{split}=-\sum_{n=1}^{T}\int\Phi(x_{n})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim n}\end{split}

Note that this is the correlation function between $\Phi(x_{T-m})$ and $\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})$ . By Remark, $\mathbb{E}[\delta f_{\gamma}X_{0}\cdot\frac{dp}{p}(Y_{1})]=0$ . By assumption 2, the derivatives expressed by the above summations uniformly converge to the expression in the lemma. Hence, the limit equals $\delta h$ . ∎

See 1.2

Remark.

Recall that $W$ is determined by the rate of decay of correlations, whereas $L$ is the number of averaging samples. Typically $W\ll L$ in numerics.

Proof.

Since $h$ is invariant for the dynamic $X_{n+1}=f(X_{n})+Y_{n+1}$ , if $X_{0}\sim h$ , then $\{X_{l},Y_{l+1},\cdots,Y_{l+n}\}_{l\geq 0}$ is a stationary sequence, so we can apply Birkhoff’s ergodic theorem (the version for stationary sequences),

\begin{split}\int\Phi(x_{n})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim n}\\ \overset{\mathrm{a.s.}}{=}\lim_{L\rightarrow\infty}\frac{1}{L}\sum_{l=1}^{L}\Phi(X_{n+l})\,\delta f_{\gamma}X_{l}\cdot\frac{dp}{p}(Y_{1+l}).\end{split}

By substitution we have

\begin{split}\delta\int\Phi(x)h(x)dx\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}-\sum_{n=1}^{W}\frac{1}{L}\sum_{l=1}^{L}\Phi(X_{n+l})\,\delta f_{\gamma}X_{l}\cdot\frac{dp}{p}(Y_{1+l})\end{split}

Then we can rerun the proof after centralizing $\Phi$ . ∎

2.6. Further generalizations

2.6.1. $p$ depends on $z$ and $\gamma$

Our pictorial intuition does not care whether $p$ depends on $\gamma$ or $z$ . So we can generalize all of our results to the case

\begin{split}X_{n+1}=f(X_{n})+Y_{n+1},\quad\textnormal{where}\quad\mathbb{P}(Y_{n+1}=y\,|\,f(X_{n})=z)=p_{\gamma,z}(y).\end{split}

We write down these formulas without repeating proofs.

Equation 4 and equation 5 become

\begin{split}L_{p_{\gamma}}q(x)=\int q(x-y)p_{\gamma,x-y}(y)dy=\int q(z)p_{\gamma,z}(x-z)dz.\\ \int\Phi(x)L_{p_{\gamma}}q(x)dx=\iint\Phi(y+z)q(z)p_{\gamma,z}(y)dzdy.\end{split}

Section 2.4 becomes

\begin{split}\int\Phi(x_{n})((L_{p_{\gamma}}L_{f})^{n}h_{0})(x_{n})dx_{n}=\int\Phi(x_{n})h_{0}(x_{0})dx_{0}p_{\gamma,z_{1}}(y_{1})dy_{1}\cdots p_{\gamma,z_{n}}(y_{n})dy_{n}.\end{split}

Here $z_{m}$ and $x_{m}$ on the right are recursively defined by $z_{m}=f(x_{m-1})$ , $x_{m}=z_{m}+y_{m}$ . We also have the pointwise formula

\begin{split}(L_{p_{\gamma}}L_{f_{\gamma}}h)(x_{1})=\int h(x_{0})p_{\gamma,f_{\gamma}x_{0}}(x_{1}-f_{\gamma}x_{0})dx_{0}.\end{split}

Since now $p$ depends on $\gamma$ via three ways, section 2.2 becomes

\begin{split}\delta(L_{p_{\gamma}}L_{f_{\gamma}}h)(x_{1})=\int\frac{dp}{d\gamma}(x_{0},y_{1})h(x_{0})dx_{0},\quad\textnormal{where}\quad z_{1}=fx_{0},\quad y_{1}=x_{1}-z_{1},\\ \frac{dp}{d\gamma}(x_{0},y_{1}):=\frac{d}{d\gamma}p_{\gamma,f_{\gamma}x_{0}}(x_{1}-f_{\gamma}x_{0})=\delta p(z_{1},y_{1})+\delta f_{\gamma}(x_{0})\cdot\left(\frac{\partial p}{\partial z}-\frac{\partial p}{\partial y}\right)(z_{1},y_{1}).\end{split}

Here derivatives $\frac{\partial p}{\partial z}$ and $\frac{\partial p}{\partial y}$ refer to writing the density as $p_{\gamma,z}(y)$ . If $p$ does not depend on $\gamma$ and $z$ , then we recover section 2.2. Remark becomes

\begin{split}\delta\int\Phi(x_{1})(L_{p}L_{f_{\gamma}}h)(x_{1})dx_{1}=\int\Phi(x_{1})\delta(L_{p}L_{f_{\gamma}}h)(x_{1})dx_{1}\\ =\iint\Phi(x_{1})\frac{dp}{pd\gamma}(x_{0},y_{1})h(x_{0})dx_{0}\,p(y_{1})dy_{1}.\end{split}

Here

\begin{split}\frac{dp}{pd\gamma}(x_{m},y_{m+1}):=\frac{1}{p_{fx_{m}}(y_{m+1})}\frac{d}{d\gamma}p_{\gamma,f_{\gamma}x_{m}}(x_{m+1}-f_{\gamma}x_{m})\end{split}

Remark becomes

\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=\int\Phi(x_{T})\sum_{m=0}^{T-1}\frac{dp}{pd\gamma}(x_{m},y_{m+1})h_{0}(x_{0})dx_{0}(p_{z}dy)_{1\sim T}.\end{split}

Here $(p_{z}dy)_{1\sim T}:=p_{z_{1}}(y_{1})dy_{1}\cdots p_{z_{T}}(y_{T})dy_{T}$ is the distribution of $Y_{n}$ ’s; $z_{m+1}=fx_{m}$ is a function of dummy variables $x_{0},y_{1},\cdots,y_{m}$ .

We still have free centralization for $\Phi$ . So Remark and section 1.2 still hold accordingly, in particular, when the system is time-homogeneous and has physical measure $h$ , we still have

\begin{split}\delta\int\Phi(x)h(x)dx\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}-\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi(X_{n+l})-\Phi_{avg}\right)\,\frac{dp}{pd\gamma}(X_{l},Y_{1+l}).\end{split}

2.6.2. General random dynamical systems

For general random dynamical systems, at each step $n$ , we randomly select a map $g$ from a family of maps, denote this random map by $G$ , the dynamics is now

\begin{split}X_{n+1}=G(X_{n}).\end{split}

The selection of $G$ ’s are independent among different $n$ . It is not hard to see our pictorial intuition still applies, so our work can generalize to this case.

We can also formally transform this general case to the case in the previous subsubsection. To do so, define the deterministic map $f$

\begin{split}f(x):=\mathbb{E}[G(x)],\end{split}

where the expectation is with respect to the randomness of $G$ . Then the dynamic equals in distribution to

\begin{split}X_{n+1}=f(X_{n})+Y_{n+1},\quad\textnormal{where}\quad Y_{n+1}=G(X_{n})-f(X_{n}).\end{split}

Hence the distribution of $Y_{n+1}$ is completely determined by $X_{n}$ . Note that we still need to compute derivatives of the distribution of $Y$ , or $G$ .

3. Foliated noise and perturbation

This section shows that the kernel-differentiation formulas are still correct for the case where the noise and $\delta F$ are along a given foliation.

3.1. Preparation: measure notations, flotations, conditional densities

We assume that a neighborhood of the support of the physical measure $\mu$ can be foliated by a single chart: this basically means that we can partition the neighborhood by co-dimension- $c$ submanifolds. More specifically, this means that there is $F=\{F_{\alpha}\}_{\alpha\in A}$ a family of co-dimension- $c$ $C^{3}$ submanifolds on $\mathcal{M}$ , and a $C^{3}$ (we are not careful about the degree of differentiability) diffeomorphism between the neighborhood and an open set in $\mathbb{R}^{M}$ , such that the image of $F_{\alpha}$ is the $c$ -dimensional horizontal plane, $\{x:x_{1}=\ldots x_{M-c}=0\}$ . Each submanifold is called a ‘leaf’. An extension to the case where the neighborhood admits not a single chart, but a foliated atlas composed of many charts, is possible and we leave it to the future.

Under single-chart foliation, this section considers the system

Z_{n}=f(X_{n})\,,\quad{\tilde{Z}}_{n}={\tilde{f}}(Z_{n})\,,\quad X_{n+1}=X_{n+1}(\gamma,{\tilde{Z}}_{n}).

The last equation just means that $X_{n+1}$ is a random variable depending on $\gamma$ and ${\tilde{Z}}_{n}$ , with details to be explained in the following paragraphs. The notation in the first equation relates to our previous notation when $f$ is a fixed diffeomorphism and $f_{\gamma}={\tilde{f}}\circ f$ .

We assume that $X$ and ${\tilde{f}}$ are parallel to $F_{\alpha}$ . More specifically, for any $\gamma$ in a small interval on $\mathbb{R}$ and any $z\in\mathcal{M}$ ,

{\tilde{z}}={\tilde{f}}(z)\in F_{\alpha}(z)\;;\quad X(\gamma,{\tilde{z}})\in F_{\alpha}(z).

where $F_{\alpha}(z)$ is the leaf $F_{\alpha}$ at $z$ . Note that we do not require $f$ being constrained by the foliation, so $x$ and $fx$ can be on different leaves.

The measures considered in this section are not necessarily absolute continuous with respect to Lebesgue. So we should use another set of notations for measures. The transfer operator on measures is denoted by $f_{*}$ , more specifically, for any measure $\nu$ ,

\begin{split}\int\Phi(x)\,d(f_{*}\nu)(x):=\int\Phi(fx)d\nu(x).\end{split}

We can disintegrate a measure $\nu$ into conditional measure $[\nu]^{F}_{\alpha}$ on each leaf and the quotient measure $\nu^{\prime}$ on the set of $\alpha$ . Since we have only one fixed foliation, we shall just omit the superscript $F$ . Here $\{[\nu]_{\alpha}\}_{\alpha\in A}$ is a family of probability measures such that each ‘lives on’ $F_{\alpha}$ . Moreover, the integration of any smooth observable $\varphi$ can be written in two consecutive integrations,

\int_{\mathcal{M}}\varphi(z)d\nu(z)=\int_{A}\int_{F_{\alpha}}\varphi(z)d[\nu]_{\alpha}(z)d\nu^{\prime}(\alpha).

We typically use $\nu$ to denote the measure of $Z$ .

We assume that the random variable $X(\gamma,{\tilde{z}})$ has a $C^{3}$ density function $p(\gamma,{\tilde{z}},\cdot)$ with respect to the Lebesgue measure on $F_{\alpha}(z)$ . That is, $p$ is the conditional density of $X|{\tilde{Z}}$ . Then we can define the operator $p_{*}$ , which transfers ${\tilde{\nu}}$ , the measure of ${\tilde{Z}}$ , to $\mu$ , the measure of $X$ . It transports measures only within each leaf, but not across different leaves, so the quotient measure ${\tilde{\nu}}^{\prime}$ is left unchanged. On the other hand, the density of the conditional measure is changed to

(9)

[p_{*}{\tilde{\nu}}]^{\textrm{dst}}_{\alpha}(x):=\frac{d[p_{*}{\tilde{\nu}}]_{\alpha}}{dm}(x)=\int_{F_{\alpha}}p(\gamma,{\tilde{z}},x)d[{\tilde{\nu}}]_{\alpha}({\tilde{z}}),

where $d(\cdot)/dm$ is the Radon–Nikodym derivative with respect to the Lebesgue measure on $F_{\alpha}$ . Note that $[{\tilde{\nu}}]_{\alpha}$ does not necessarily have a density on $F_{\alpha}$ . We have the double and triple integration formulas for the expectation of joint distribution

(10)

\begin{split}\mathbb{E}[\varphi(X,{\tilde{Z}})]=\int_{{\tilde{z}}\in\mathcal{M}}\int_{x\in F_{\alpha}({\tilde{z}})}\varphi(x,{\tilde{z}})p(\gamma,{\tilde{z}},x)dxd{\tilde{\nu}}({\tilde{z}})\\ =\int_{\alpha\in A}\int_{x\in F_{\alpha}}\int_{{\tilde{z}}\in F_{\alpha}}\varphi(x,{\tilde{z}})p(\gamma,{\tilde{z}},x)d[{\tilde{\nu}}]_{\alpha}({\tilde{z}})dxd[{\tilde{\nu}}]^{\prime}(\alpha)\end{split}

To summarize, we shall assume the following in this section.

Assumption 3.

Assume that a neighborhood of the support of the measures of interest is foliated by a $C^{3}$ family of $C^{3}$ leaves $\{F_{\alpha}\}_{\alpha\in A}$ . For any $z\in\mathcal{M}$ , ${\tilde{f}}(z)\in F_{\alpha}(z)$ , $X(\gamma,{\tilde{z}})\in F_{\alpha}({\tilde{z}})$ . The density $p$ is $C^{2}$ on $F_{\alpha}$ (but is singular with respect to $\mathcal{M}$ ).

For the case of multiple steps, we shall denote the sequence of measures of $X_{n}$ by

\mu_{n}:=(p_{*}{\tilde{f}}_{*}f_{*})^{n}\mu_{0}\;,\quad\mu:=\lim_{n\rightarrow\infty}\mu_{n},

where the limit is in the weak star sense, and $\mu$ is called the physical measure. We shall still make assumption 2. However, since the physical measure can be singular, it is more difficult to prove assumption 2 from some more basic assumptions; we did not find accurate references for this purpose.

3.2. Foliated kernel-differentiation in one step

We can see that the pictorial intuition in section 2.3 still makes sense on individual leaves. So it is not surprising that the main theorems are still effective. We shall first prove the formula on one leaf over one-step, which is the foliated version of sections 2.2 and Remark. Note that here one-step means to transfer the measure by ${\tilde{f}}$ , so the foliation is unchanged.

{lemma}

[local one-step kernel-differentiation on a leaf, comparable to section 2.2] Under assumptions 1 and 3, for any measure $\nu$ ,

\delta[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)=\int_{F_{\alpha}}\delta{\tilde{f}}(z)\cdot d_{z}p(z,x)d[\nu]_{\alpha}(z)

Here $\delta:=\partial/\partial\gamma|_{\gamma=0}$ , and the integrand is the derivative of $p(\gamma,z,x)$ with respect to $z$ in the direction $\delta{\tilde{f}}(z)$ . Here $p(z,x)=p(\gamma=0,z,x)$ by default.

Remark.

Note that this differs from section 2.2 by a minus sign, since $p(z,x)=p(x-z)$ so taking derivaitives in $z$ gives a minus sign.

Proof.

By equation 9,

(11)

[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)=\int_{F_{\alpha}}p(\gamma,{\tilde{z}},x)d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}}),

We give a formula for $d[{\tilde{f}}_{*}\nu]_{\alpha}$ . By definition of $p_{*}$ and conditional densities, for any smooth observable function $\varphi$ ,

\begin{split}\int_{A}\int_{F_{\alpha}}\varphi({\tilde{z}})d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}})d[{\tilde{f}}_{*}\nu]^{\prime}(\alpha)=\int_{\mathcal{M}}\varphi({\tilde{z}})d{\tilde{f}}_{*}\nu({\tilde{z}})\\ =\int_{\mathcal{M}}\varphi({\tilde{f}}z)d\nu(z)=\int_{A}\int_{F_{\alpha}}\varphi({\tilde{f}}z)d[\nu]_{\alpha}(z)d[\nu]^{\prime}(\alpha).\end{split}

Since ${\tilde{f}}_{*}$ preserves measure within each leaf, so the quotient measure $\nu^{\prime}=[{\tilde{f}}_{*}\nu]^{\prime}$ is unchanged. Hence,

\int_{F_{\alpha}}\varphi({\tilde{z}})d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}})=\int_{F_{\alpha}}\varphi({\tilde{f}}z)d[\nu]_{\alpha}(z).

In short, we find that pushing-forward by ${\tilde{f}}$ commutes with taking conditions.

Substituting back to equation 11, we get

[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)=\int_{F_{\alpha}}p(\gamma,{\tilde{z}},x)d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}})=\int_{F_{\alpha}}p(\gamma,{\tilde{f}}z,x)d[\nu]_{\alpha}(z)

Then we can differentiate by $\gamma$ and evaluate at $\gamma=0$ to prove the lemma; note that ${\tilde{z}}=z$ at $\gamma=0$ . ∎

{lemma}

[integrated one-step foliated kernel-differentiation, comparable to Remark] Under assumptions 1 and 3, for any fixed measure $\nu$ and any smooth observable $\Phi$ ,

\begin{split}\delta\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\int_{z\in\mathcal{M}}\int_{x\in F_{\alpha}(z)}\Phi(x)\delta{\tilde{f}}(z)\cdot\frac{d_{z}p(z,x)}{p(z,x)}p(z,x)dxd\nu(z).\end{split}

Remark.

See equation 10 for the probability notation which might help understanding.

Proof.

By the definition of conditional measure,

\begin{split}\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\int_{A}\int_{F_{\alpha}}\Phi(x)d[p_{*}{\tilde{f}}_{*}\nu]_{\alpha}(x)d[p_{*}{\tilde{f}}_{*}\nu]^{\prime}(\alpha)\end{split}

Since both $p_{*}$ and ${\tilde{f}}_{*}$ preserve the measure within each leaf, $[p_{*}{\tilde{f}}_{*}\nu]^{\prime}=[\nu]^{\prime}$ is not affected by $\gamma$ . Further express the conditional measure by integrating the density, we get

(12)

\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\int_{A}\int_{F_{\alpha}}\Phi(x)[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)dxd[\nu]^{\prime}(\alpha)

Differentiate on both sides

\begin{split}\delta\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\delta\int_{A}\int_{F_{\alpha}}\Phi(x)[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)dxd[\nu]^{\prime}(\alpha)\\ \end{split}

Since every function in the integrand is bounded, we can move $\delta$ inside,

\begin{split}=\int_{A}\int_{x\in F_{\alpha}}\Phi(x)\delta[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)dxd[\nu]^{\prime}(\alpha).\end{split}

Apply section 3.2,

\begin{split}=\int_{\alpha\in A}\int_{x\in F_{\alpha}}\int_{z\in F_{\alpha}}\Phi(x)\delta{\tilde{f}}(z)\cdot d_{z}p(z,x)d[\nu]_{\alpha}(z)dxd[\nu]^{\prime}(\alpha).\\ =\int_{\alpha\in A}\int_{x\in F_{\alpha}}\int_{z\in F_{\alpha}}\Phi(x)\delta{\tilde{f}}(z)\cdot\frac{d_{z}p(z,x)}{p(z,x)}d[\nu]_{\alpha}(z)p(z,x)dxd[\nu]^{\prime}(\alpha).\end{split}

Then use equation 10 to transform triple integration to double integration. ∎

3.3. Foliated kernel-differentiation over finitely many steps

This subsection proves results for finitely many steps. Most proofs are similar to their counterparts in section 2.4, so we shall skip the proofs that are repetitive.

{lemma}

[comparable to section 2.4] Under assumptions 1 and 3, let ${\tilde{z}}_{m}:={\tilde{f}}z_{m}:={\tilde{f}}fx_{m-1}$ , then for any $\mu_{0}$ , any $n\in\mathbb{N}$ ,

\begin{split}\int\Phi(x_{n})d((p_{*}{\tilde{f}}_{*}f_{*})^{n}\mu_{0})(x_{n})\\ =\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}({\tilde{z}}_{1})}\cdots\int_{x_{n}\in F_{\alpha}({\tilde{z}}_{n})}\Phi(x_{n})p({\tilde{z}}_{n},x_{n})dx_{n}\cdots p({\tilde{z}}_{1},x_{1})dx_{1}d\mu_{0}(x_{0}).\end{split}

Remark.

Note that for the foliated case, the integration domain of $x_{n}$ depends on $x_{n-1}$ , so the multiple integration should first integrate $x_{n}$ with $x_{n-1}$ fixed.

{theorem}

[foliated kernel-differentiation formula for $\delta\mu_{T}$ , comparable to Remark] Under assumptions 1 and 3, then

\begin{split}\delta\int\Phi(x_{T})d\mu_{T}(x_{T})=\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})\\ \sum_{m=1}^{T}\left(\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}\right)p(z_{T},x_{T})dx_{T}\cdots p(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0}).\end{split}

Proof.

We shall recursively apply Remark. Note that there $\nu$ is fixed, but here it is also affected by $\lambda$ . First, we differentiate the last step; by Remark,

\begin{split}\delta\int\Phi(x_{T})d(p_{*}{\tilde{f}}_{*}f_{*}\mu_{T-1})(x_{T})=\delta\int_{x_{T-1}\in\mathcal{M}}\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})p(\gamma,z_{T},x_{T})dx_{T}d\mu(x_{T-1})\\ =\int_{x_{T-1}\in\mathcal{M}}\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})\delta{\tilde{f}}(z_{T})\cdot\frac{d_{z}p(z_{T},x_{T})}{p(z_{T},x_{T})}p(z_{T},x_{T})dx_{T}d\mu(x_{T-1})\\ +\int_{x_{T-1}\in\mathcal{M}}\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})p(z_{T},x_{T})dx_{T}d\delta\mu(x_{T-1}),\end{split}

where $p(z_{T},x_{T})=p(\gamma=0,z_{T},x_{T})$ by default. Then we can further apply Remark and 9 to the last term repeatedly until $x_{0}$ . ∎

{lemma}

[free centralization, comparable to Remark] Under assumptions 1 and 3, for any $1\leq m\leq T$ , we have

\begin{split}\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{T}\in F_{\alpha}(z_{T})}\left(\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}\right)p(z_{T},x_{T})dx_{T}\cdots p(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0})=0.\end{split}

Proof.

We first do the inner integrations for steps later than $m$ , which are all one. Hence, the left side of the equation,

\begin{split}=\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{m}\in F_{\alpha}(z_{m})}\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}p(z_{m},x_{m})dx_{m}\cdots p(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0})\\ =\int_{x_{m-1}\in\mathcal{M}}\int_{x_{m}\in F_{\alpha}(z_{m})}\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}p(z_{m},x_{m})dx_{m}d\mu_{m-1}(x_{m-1}).\end{split}

Here $\mu_{m-1}$ is evaluated at $\gamma=0$ by default. Then we can use Remark with $\Phi$ set as a constant to prove the lemma. ∎

See 1.2

3.4. Foliated ergodic kernel-differentiation formula over infinitely many steps

This subsection is comparable to section 2.5 and states results for infinitely many steps. We shall skip the proofs since the ideas are the same and only the notations are a bit different.

{lemma}

[foliated kernel-differentiation formula for physical measures, comparable to section 2.5] Under assumptions 1, 2 and 3,

\begin{split}\delta\int\Phi(x)d\mu(x)=\lim_{W\rightarrow\infty}\sum_{n=1}^{W}\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{n}\in F_{\alpha}(z_{n})}\Phi(x_{n})\\ \left(\delta{\tilde{f}}(z_{1})\cdot\frac{d_{z}p(z_{1},x_{1})}{p(z_{1},x_{1})}\right)p(z_{T},x_{T})dx_{T}\cdots p(z_{1},x_{1})dx_{1}d\mu(x_{0}).\end{split}

{theorem}

[orbitwise foliated kernel-differentiation formula for $\delta h$ , comparable to section 1.2] Under assumptions 1, 2 and 3,

\begin{split}\delta\int\Phi(x)d\mu(x)\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi(X_{n+l})-\Phi_{avg}\right)\delta{\tilde{f}}_{m}(z_{l+1})\cdot\frac{d_{z}p(z_{l+1},x_{l+1})}{p(z_{l+1},x_{l+1})}\end{split}

Here $\Phi_{avg}:=\int\Phi d\mu$ .

4. Kernel-differentiation algorithm

This section gives the procedure list of the algorithm, and demonstrates the algorithm on several examples. We no longer label the subscript $\gamma$ , and $\delta$ can be the derivative evaluated at $\gamma\neq 0$ ; the dependence on $\gamma$ should be clear from context. The codes used in this section are all at https://github.com/niangxiu/np.

4.1. Procedure lists

This subsection gives the procedure list of the kernel-differentiation algorithm for $\delta h_{T}$ of a finite-time system, whose formula was derived in Remark. Then we give the procedure list for infinite-time system, whose formula was derived in section 1.2. The foliated cases have the same algorithm, and we only need to change to the suitable definition of $dp/p$ .

First, we give algorithm for finite-time system. Here $L$ is the number of sample paths whose initial conditions are generated independently from $h_{0}$ (or $\mu_{0}$ , if the measure is singular and does not have a density). This algorithm requires that we already have a random number generator for $h_{0}$ and $p$ : this is typically easier for $p$ since we tend to use simple $p$ such as Gaussian; but we might ask for specific $h_{0}$ , which requires more advanced sampling tools.

L

, random number generators for densities

h_{0}

and

p_{m}

for

l=1,\dots,L

Independently generate sample

x_{0}

from

X_{0}\sim h_{0}

for

m=0,\dots,T-1

Independently generate sample

y_{m+1}

from

Y_{m+1}\sim p_{m+1}

I_{l,m+1}\leftarrow\delta fx_{l,m}\cdot\frac{dp}{p}(y_{m+1})

x_{l,m+1}\leftarrow f(x_{l,m})+y_{m+1}

end for

\Phi_{l,T}\leftarrow\Phi_{T}(x_{l,T})

\triangleright

\Phi_{l,T}

takes less storage than

x_{l,T}

S_{l}\leftarrow-\sum_{m=1}^{T}I_{l,m}

\triangleright

No need to store

x_{m}

y_{m}

end for

\Phi_{avg,T}:=\int\Phi_{T}h_{\gamma,T}dx\approx\frac{1}{L}\sum_{l=1}^{L}\Phi_{l,T}

\triangleright

Evaluated at

\gamma=0

\delta\Phi_{avg,T}\approx\frac{1}{L}\sum_{l=1}^{L}S_{l}\left(\Phi_{T}(x_{l,T})-\Phi_{avg,T}\right)

Then we give the procedure list for the infinite-time case. Here $M_{pre}$ is the number of preparation steps, during which the background measure is evolved, so that $x_{0}$ is distributed according to the physical measure $h$ . Here $W$ is the decorrelation step number, typically $W\ll L$ , where $L$ is the orbit length. Since here $h$ is given by the dynamic, we only need to sample the easier density $p$ .

M_{pre},W\ll L

, random number generator for

p

Take any

x_{0}

for

m=0,\cdots,M_{pre}

\triangleright

To land

x_{0}

onto the physical measure.

Independently generate sample

y

from density

h

x_{0}\leftarrow f(x_{0})+y

end for

for

m=0,\cdots,W+L-1

Independently generate sample

y_{m+1}

from

Y_{m+1}\sim p

I_{m+1}\leftarrow\delta{\tilde{f}}(fx_{m})\cdot\frac{dp}{p}(y_{m+1})

x_{m+1}\leftarrow f(x_{m})+y_{m+1}

\Phi_{m+1}\leftarrow\Phi(x_{m+1})

end for

\Phi_{avg}:=\int\Phi h_{\gamma}dx\approx\frac{1}{L}\sum_{l=1}^{L}\Phi_{l}

\delta\Phi_{avg}\approx-\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi_{n+l}-\Phi_{avg}\right)I_{l+1}

From a utility point of view, the kernel-differentiation is an ‘adjoint’ algorithm. Since we do not compute the Jacobian matrix at all, so the most expensive operation per step is only computing $f(x_{m})$ and $\delta f(x_{m})$ , so the cost for computing the linear response of the first parameter (i.e. a $\delta f$ ) is cheaper than adjoint methods. On the other hand, the marginal cost for a new parameter in this algorithm is low, and it equals that of adjoint methods.

We remind readers of some useful formulas when we use Gaussian noise in $\mathbb{R}^{M}$ . Let the mean be $\mu\in\mathbb{R}^{M}$ , the covariance matrix be $\Sigma$ , which is defined by $\Sigma_{i,j}:=\operatorname{E}[(Y_{i}-\mu_{i})(Y_{j}-\mu_{j})]=\operatorname{Cov}[Y_{i},Y_{j}]$ . Then the density is

\begin{split}p(y)=(2\pi)^{-k/2}\det(\Sigma)^{-1/2}\,\exp\left(-\frac{1}{2}(y-\mu)^{\mathsf{T}}\Sigma^{-1}(y-\mu)\right).\end{split}

Hence we have

\begin{split}\frac{dp}{p}(y)=-\Sigma^{-1}(y-\mu)\in\mathbb{R}^{M}.\end{split}

Typically we use normal Gaussian, so $\mu=0$ and $\Sigma=\sigma^{2}I_{M\times M}$ , so we have

(13)

\begin{split}\frac{dp}{p}(y)=-\frac{1}{\sigma^{2}}y\in\mathbb{R}^{M}.\end{split}

If we consider the co-dimension- $c$ foliation by planes $F_{\alpha}=\{x_{1}=\alpha_{1},\ldots,x_{c}=\alpha_{c}\}$ , the above formula is still valid. For geometers, it might feel more comfortable to write $\delta{\tilde{f}}\cdot dp$ together, which is just the directional derivative of the function $p$ defined on the subspace.

4.2. Numerical example: tent map with elevating center

We demonstrate the ergodic algorithm over infinitely many steps on the tent map. The dynamics is

\begin{split}X_{n+1}=f_{\gamma}(X_{n})+Y_{n+1}\quad\textnormal{mod 1,}\quad\quad\textnormal{where}\quad\\ Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}{\mathcal{N}}(0,\sigma^{2}),\quad\textnormal{}\quad f_{\gamma}(x)=\begin{cases}\gamma x\quad\text{if}\quad 0\leq x\leq 0.5,\\ \gamma(1-x)\quad\text{otherwise.}\end{cases}\end{split}

In this subsection, we fix the observable

\begin{split}\Phi(x)=x\end{split}

Previous linear response algorithms based on randomizing algorithms for deterministic systems, (such as fast response algorithm, path perturbation algorithm, divergence algorithm) do not work on this example. In fact, it is proved by Baladi that for the deterministic case ( $\sigma=0$ ), $L^{1}$ linear response does not exist [4].

We shall demonstrate the kernel-differentiation algorithm on this example, and show that the noise case can give a useful approximate linear response to the deterministic case. In the following discussions, unless otherwise noted, the default values for the (hyper-)parameters are

\begin{split}\gamma=3,\quad\sigma=0.1,\quad W=7.\end{split}

We first test the effect of adding noise on the physical measure. As shown in figure 2, the density converges to the deterministic case as $\sigma\rightarrow 0$ . This and later numerical results indicate that we can wish to find some approximate ‘linear response’ for the deterministic system, which provides some useful information about the trends between the averaged observable and $\gamma$ .

Then we run the kernel-differentiation algorithm to compute linear responses for different $\gamma$ . This reveals the relation between $\Phi_{avg}=\int\Phi hdx$ and $\gamma$ . As we can see, the algorithm gives an accurate linear response for the noised case. Moreover, the linear response in the noised case is an reasonable reflection of the observable-parameter relation of the deterministic case. Most importantly, the local min/max of the noised and deterministic cases tend to locate similarly. Hence, we can use the kernel-differentiation algorithm of the noised case to help optimizations of the deterministic case.

Then we show the convergence of the kernel-differentiation algorithm with respect to $L$ in figure 4. In particular, the standard deviation of the computed derivative is proportional to $L^{-0.5}$ . This is the same as the classical Monte Carlo method for integrating probabilities, with $L$ being the number of samples. This is expected: since here we are given the dynamics, so our samples are canonically the points on a long orbit.

Figure 5 shows that the bias in the average derivative decreases as $W$ increases, but the standard deviation increases roughly like $O(W^{0.5})$ . Note that if we do not centralize $\Phi$ , then the standard deviation would be like $O(W)$ .

4.3. A chaotic neural network with foliated perturbations

4.3.1. Basic model

This subsection considers a chaotic neural network whose dynamic is deterministic, and the exact linear response does not provide useful information. Since the perturbation is foliated, we can add only a low-dimensional noise, which incur less errors. Then we use the kernel-differentiation algorithm to compute approximations of the linear response.

A main goal in design neural networks is to suppress gradient explosion, which is essentially the definition of chaos. However, even with modern architecture, there is no guarantee that gradient explosion (or chaos) can be precluded. Unlike stochastic backpropagation method, we do not perform propagation at all, so we are not hindered by gradient explosion. But we add extra noise at each layer, which introduces a (potentially large) systematic error.

The variables of interest are $X^{\prime}_{n}\in\mathbb{R}^{9}$ , $n=0,1,\cdots,T$ , where $T=50$ , and the dynamic is

\begin{split}X^{\prime}_{n+1}=f_{\gamma}(X^{\prime}_{n}),\quad\textnormal{where}\quad f_{\gamma}(x)=J\tanh(x+\gamma\mathbbm{1}),\quad J=CJ_{0},\quad\\ J_{0}=\left[\begin{array}[]{ccccccccc}-0.54&-1.19&-0.33&1.66&-0.5&-1.3&1.52&-0.5&1.95\\ -1.6&-1.55&-1.45&0.61&1.92&0.59&-0.16&-1.14&-1.27\\ -0.59&-0.65&-1.32&-1.46&-0.82&-0.95&-1.47&-0.08&-0.38\\ -0.78&-0.26&0.87&1.99&0.07&0.87&-0.79&-0.44&1.11\\ 0.8&-1.28&-0.52&-1.01&1.49&1.49&-1.65&-0.45&0.21\\ -1.77&0.03&-1.39&-0.28&0.44&1.27&0.61&0.01&-0.02\\ -0.18&-0.29&-0.73&0.53&-0.82&-1.58&-1.41&0.07&-1.84\\ 0.64&0.86&0.73&0.96&-0.06&0.04&1.1&1.22&-0.28\\ 1.18&-1.95&-0.37&0.01&1.24&-0.32&0.43&0.06&-1.28\end{array}\right].\end{split}

Here $I$ is the identity matrix, $\mathbbm{1}=[1,\cdots,1]$ , and for $x=[x^{1},\cdots,x^{9}],x_{i}\in\mathbb{R}$ ,

\begin{split}\tanh(x):=[\tanh(x^{1}),\cdots,\tanh(x^{9})].\end{split}

We set $X^{\prime}_{0}\sim{\mathcal{N}}(0,I)$ . The objective function is

\Phi^{\prime}(X^{\prime}_{T})=\sum_{i=1}^{9}(X^{\prime}_{T})^{i}.

There is a somewhat tight region for $C$ such that the system is chaotic: when $C<1$ , then $J$ is small so the Jacobian is small; when $C>10$ , then the points tend to be far from zero, so the derivative of $\tanh$ is small, so the Jacobian is also small. Our choice $C=4$ gives roughly the most unstable network, and we roughly compare with the cost of the ensemble, or stochastic gradient, formula of the linear response, which is

\begin{split}\delta\mathbb{E}(\Phi^{\prime})=\mathbb{E}\left(\sum_{n=1}^{50}\delta f(X^{\prime}_{50-n})\cdot f^{*n}d\Phi(X^{\prime}_{50})\right)\end{split}

Here $f^{*}$ is the backpropagation operator for covectors, which is the transpose of the is the Jacobian matrix $Df$ . On average $|f^{*50}|\approx 10^{4}\sim 10^{5}$ , $d\Phi\cdot\delta f\approx 10$ , so the integrand’s size is about $10^{5}\sim 10^{6}$ . This would require about $10^{10}\sim 10^{12}$ samples (a sample is a realization of the 50-layer network) to reduce the sampling error to $O(1)$ . That is not affordable.

Our model is modified from its original form in [7, 8]. In the original model, the entries of the weight matrix $J$ were randomly generated according to certain laws; as discussed in section 2.6.2, we can rewrite the random maps by an additive noise, then obtain exact solutions to the original problem. We can also further generalize this example to time-inhomogeneous cases.

Finally, we acknowledge that our neural network’s architecture is outdated, but modern architectures are not good tests for our algorithm. Because the backpropagation method does not work in chaos, current architectures typically avoid chaos. The kernel-differentiation method might help us to have more freedom in choosing architectures beyond the current ones.

4.3.2. Explicit foliated chart and artificial noise

We say the perturbation is foliated, which means that there is a fixed foliation such that for a small interval of $\gamma$ , $\delta f$ is always parallel to that foliation. The perturbation in this particular example is foliated, since for any $\gamma_{1},\gamma_{2}$ ,

\left.\frac{\partial f}{\partial\gamma}\right|_{\gamma=\gamma_{1}}(x_{1})=\left.\frac{\partial f}{\partial\gamma}\right|_{\gamma=\gamma_{2}}(x_{2}),\quad\textnormal{whenever}\quad f_{\gamma_{1}}(x_{1})=f_{\gamma_{2}}(x_{2}).

Hence, the vector field $\delta{\tilde{f}}$ is invariant for an interval of $\gamma$ , and the foliation is given by the streamlines of $\delta f$ .

In fact, we can write down the explicit expression of a foliated chart, that is, change coordinate by $X=X^{\prime}+\gamma\mathbbm{1}$ , then the dynamic and the observable under the new coordinate is

X_{n+1}=J\tanh(X_{n})+\gamma\mathbbm{1},\quad\Phi(X_{T})=-9\gamma+\sum_{i=1}^{9}X_{T}^{i}.

Note that we can allow the chart to depend on $\gamma$ . Also, the initial distribution is changed to

X_{0}=Y_{0}+\gamma\mathbbm{1},\quad\textnormal{where}\quad Y_{0}\sim{\mathcal{N}}(0,I).

Now it is clear that $\delta f$ is always in the direction of $\mathbbm{1}$ ; hence, we only need to add noise in this direction. We compute the linear response of the system

(14)

X_{n+1}=J\tanh(X_{n})+\gamma\mathbbm{1}+Y_{n+1}\mathbbm{1}/\sqrt{M},\quad\textnormal{where}\quad Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}{\mathcal{N}}(0,\sigma^{2}).

We shall compare with adding noise in all directions, with $Y^{\prime}\in\mathbb{R}^{M}$

X_{n+1}=J\tanh(X_{n})+\gamma\mathbbm{1}+Y^{\prime}_{n+1},\quad\textnormal{where}\quad Y^{\prime}_{n}{\,\overset{\mathrm{iid}}{\sim}\,}{\mathcal{N}}(0,\sigma^{2}I).

Here $I$ is the $M\times M$ identity matrix.

The foliated case has smaller noise added to the system, since $Y^{\prime}$ equals summing i.i.d pieces of $Y$ in orthogonal directions for $M$ times. Then we show that the two cases with noise have the same computational cost. As to be discussed in section 5.1, the computational cost is determined by the number of samples required, which is further determined by the magnitude of the integrand. To verify this, we check

\mathbb{E}\left[\left(\frac{1}{p^{\prime}}\delta f\cdot dp^{\prime}\right)^{2}\right]=\mathbb{E}\left[\left(\frac{1}{p}\delta f\cdot dp\right)^{2}\right],

where $p$ is the density on the 1-dimensional subspace, and $p^{\prime}$ is the density in all directions. By equation 13, and note that $\delta{\tilde{f}}=\mathbbm{1}$ is constant, we want

\mathbb{E}\left[\left(\frac{1}{\sigma^{2}}\mathbbm{1}\cdot Y^{\prime}\right)^{2}\right]=\mathbb{E}\left[\left(\frac{1}{\sigma^{2}}\mathbbm{1}\cdot\mathbbm{1}Y/\sqrt{M}\right)^{2}\right].

This is indeed true, just notice that $\mathbbm{1}/\sqrt{M}$ is the unit vector.

Note that now both $\Phi$ and $h_{0}$ depends on $\gamma$ due to the chart. Hence, the linear response has two more terms,

(15)

\delta\int\Phi(x_{T})d\mu_{T}(x_{T})=T+\int\delta\Phi(x_{T})d\mu_{T}(x_{T})+\mathbb{E}\left[\Phi(X_{T})\frac{\delta h_{0}}{h_{0}}(X_{0})\right],

where the first term $T$ is given by section 1.2, and the other two terms are new. For the second term, since the integrand $\delta\Phi=-9$ is constant, so the second term is $-9$ ; if the integrand is not constant then we can still do Monte-Carlo sampling easily. For the third term, the expectation $\mathbb{E}$ is just an abbreviation for the multiple integration in section 1.2. By definition,

h_{0}(x_{0})=(2\pi)^{-k/2}\exp\left(-\frac{1}{2}(x_{0}-\gamma\mathbbm{1})^{\mathsf{T}}(x_{0}-\gamma\mathbbm{1})\right),

so $\frac{\delta h_{0}}{h_{0}}(x_{0})=x_{0}\cdot\mathbbm{1}-\gamma M$ .

4.3.3. Numerical results

We use the kernel-differentiation method given in section 1.2 to compute the main term in equation 15, under two different noises: a noise in all directions, and a 1-dimensional noise along the direction of $\mathbbm{1}$ .

Figure 6 and figure 7 show the results of the kernel-differentiation algorithm on this example. The algorithm correctly computes the derivative of the problem with additive noises. The total time cost for running the algorithm on $L=10^{4}$ orbits to get a linear response, using a 1 GHz computer thread, is 3 seconds.

As we can see by comparing the two figures, the case with 1-dimensional noise is a better approximation of the deterministic case, which is what we are actually interested in. In particular, when $\sigma=1.5$ , the $M$ -dimensional noise basically hides away any trend between the parameter $\gamma$ and the averaged observable $\Phi_{avg}$ . In contrast, the 1-dimensional noise has a much smaller impact on the $\gamma\sim\Phi_{avg}$ relation, and the derivative computed by the kernel differentiation method is more useful.

Also note that it is not free to decrease $\sigma$ for the purpose of reducing the systematic error. As we can see by comparing the two figures, and as we shall discuss in section 5.1, small $\sigma$ would increase the sampling error, which further requires more samples to take down, and this increases the cost. So adding low-dimensional noise, when the perturbation is foliated, has a unique advantage of reducing the approximation error to deterministic case.

5. Discussions

The kernel-differentiation algorithm is robust, does not have systematic error, has low cost per step, and is not cursed by dimensionality. But there is a caveat: when the noise is small, we need many data for the Monte-Carlo method to converge. Hence we can not expect to use the small $\sigma$ limit to get an easy approximation of the linear response for deterministic systems.

In this section, we first give a rough cost-error estimation of the problem, and estimate the number of samples $L$ and desirable noise intensity $\sigma$ . Then we discuss how to potentially reduce the cost by further combining with the fast response algorithm, which was developed for deterministic linear responses.

5.1. A rough cost-error estimation

When the problem is intrinsically random, the scale of noise $\sigma$ has been fixed. For this case, there are two sources of error. The first is due to using a finite decorrelation step number $W$ ; this error is $O(\theta^{W})$ for some $0<\theta<1$ . The second is the sampling error due to using a finite number $L$ of samples. Assume we use Gaussian noise, since we are averaging a large integrand to get a small number, we can approximate the standard deviation of the integrand by its absolute value. The integrand is the sum of $W$ copies of $\frac{dp}{p}=-\frac{y}{\sigma^{2}}\sim\frac{1}{\sigma}$ , so the size is roughly $\sqrt{W}/\sigma$ . So the sampling error is $\sqrt{W}/\sigma\sqrt{L}$ , where $L$ is the number of samples. Together we have the total error $\varepsilon$

\begin{split}\varepsilon=O(\theta^{W})+O\left(\frac{\sqrt{W}}{\sigma\sqrt{L}}\right).\end{split}

This gives us a relation among $\varepsilon,W$ , and $L$ , where $L$ is proportional to the overall cost. In practice, we typically set the two errors be roughly equal, which gives an extra relation for us to eliminate $W$ and obtain the cost-error relation

\begin{split}\theta^{W}=\frac{\sqrt{W}}{\sigma\sqrt{L}},\quad\textnormal{so}\quad L=O\left(\frac{W}{\sigma^{2}\theta^{2W}}\right)=O\left(\frac{\log_{\theta}\varepsilon}{\sigma^{2}\varepsilon^{2}}\right).\end{split}

This is rather typical for Monte-Carlo method. But the problem is that cost can be large for small $\sigma$ .

On the other hand, if we use random system to approximate deterministic systems, then we can have the choice on the noise scale $\sigma$ . Now each step further incurs an approximation error $O(\sigma)$ on the measure. This error decays but accumulates, and the total error on the physical measure is $O(\sigma/(1-\theta))$ , which can be quite large compared to the one-step error $\sigma$ . Hence, if we are interested in the trend between $\Phi_{avg}$ and $\gamma$ for a certain stepsize $\Delta\gamma$ (this is typically known from the practical problem), then the error in the linear response is $O(\sigma/\Delta\gamma(1-\theta))$ . Together we have the total error $\varepsilon$

(16)

\begin{split}\varepsilon=O(\theta^{W})+O\left(\frac{\sqrt{W}}{\sigma\sqrt{L}}\right)+O\left(\frac{\sigma}{\Delta\gamma(1-\theta)}\right).\end{split}

Again, in practice we want the three errors to be roughly equal, which shall prescribe the size of $\sigma$ ,

\begin{split}\sigma=\varepsilon\Delta\gamma(1-\theta),\quad\textnormal{so}\quad L=O\left(\frac{\log_{\theta}\varepsilon}{\sigma^{2}\varepsilon^{2}}\right)=O\left(\frac{\log_{\theta}\varepsilon}{\varepsilon^{4}(1-\theta)^{2}}\right).\end{split}

Since $1-\theta$ can be small, this cost can be much larger than just $\varepsilon^{-4}$ , which is already a high cost.

Finally, we acknowledge that our estimation is very inaccurate, but the point we make is solid, that is, the small noise limit is numerically expensive to achieve.

5.2. A potential program to unify three linear response formulas

We sketch a potential program on how to further reduce the cost/error of computing approximate linear response of non-hyperbolic deterministic systems. As is known, non-hyperbolic systems do not typically have linear responses, so we must mollify, and in this paper we choose to mollify by adding noise in the phase space during each time step. But as we see in the last subsection, adding a big noise increases the noise error, the third term in equation 16; whereas small noise increases the sampling error, the second term in equation 16.

There are cases with special structures, and we have specific tools which could give a good approximate linear response, in terms of reflecting the parameter-observable relation. When the system has a foliated structure, this papers shows that we can add low-dimensional noise. When the unstable dimension is low, computing only the shadowing or stable part of the linear response may be a good approximation (see [28]).

For the more general situation, a plausible solution is to add a big but local noise, only at locations where the hyperbolicity is bad. The benefit of this program is that, if the singularity set is low-dimensional, then the area where we add big noise is small, and the noise error is small. For continuous-time case there seems to be an easy choice: we can let the noise scale be reverse proportional to the flow vector length. This should at least solve the singular hyperbolic flow cases [40, 41], where the bad points coincide with zero velocity. But for discrete-time case it can be difficult to find a natural criteria which is easy to compute.

In terms of the formulas, we use the kernel-differentiation formula (the generalized version in section 2.6) where the noise is big, so the sampling error is also small. Where the noise is small, we use the fast-response algorithm, which is efficient regardless of noise scale, but requires hyperbolicity.

This program hinges on the assumption that the singularity set is small or low-dimensional. It also requires us to invent more techniques. For example, we need a formula which transfers information from the kernel-differentiation formula to the fast response formulas. For example, the equivariant divergence formula, in its original form, requires information from the infinite past and future [30]. But we can rerun the proof and restrict the time dependence to finite time; this requires extra information on the interface, such as the derivative of the conditional measure and the divergence of the holonomy map. These information should and could be provided by the kernel-differentiation formula.

Historically, there are three famous linear response formulas. The path-perturbation formula does not work for chaos; the divergence formula is likely to be cursed by dimensionality; the kernel-differentiation formula is expensive for small noise limit. The fast response formula combines the path-perturbation and the divergence formula, it is a function (not a distribution) so can be sampled; it is neither affected by chaos or high-dimension, but is rigid on hyperbolicity. In the future, besides trying to test above methods on specific tasks, we should also try to combine all three formulas together into one, which might provide the best approximate linear responses with highest efficiency.

Acknowledgements

The author is in great debt to Stefano Galatolo, Yang Liu, Caroline Wormell, Wael Bahsoun, and Gary Froyland for helpful discussions.

Data availability statement

The code used in this manuscript is at https://github.com/niangxiu/kd. There is no other associated data.

References

[1] F. Antown, G. Froyland, and S. Galatolo. Optimal linear response for markov hilbert–schmidt integral operators and stochastic dynamical systems. Journal of Nonlinear Science, 32, 12 2022.
[2] W. Bahsoun, S. Galatolo, I. Nisoli, and X. Niu. A rigorous computational approach to linear response. Nonlinearity, 31:1073–1109, 2018.
[3] W. Bahsoun, M. Ruziboev, and B. Saussol. Linear response for random dynamical systems. Advances in Mathematics, 364:107011, 4 2020.
[4] V. Baladi. On the susceptibility function of piecewise expanding interval maps. Communications in Mathematical Physics, 275:839–859, 8 2007.
[5] V. Baladi. The quest for the ultimate anisotropic banach space. Journal of Statistical Physics, 166:525–557, 2017.
[6] R. Bowen and D. Ruelle. The ergodic theory of axiom a flows. Inventiones Mathematicae, 29:181–202, 1975.
[7] B. Cessac and J. A. Sepulchre. Stable resonances and signal propagation in a chaotic network of coupled units. Physical Review E, 70:056111, 11 2004.
[8] B. Cessac and J. A. Sepulchre. Transmitting a signal by amplitude modulation in a chaotic network. Chaos: An Interdisciplinary Journal of Nonlinear Science, 16:013104, 3 2006.
[9] D. Dolgopyat. On differentiability of srb states for partially hyperbolic systems. Inventiones Mathematicae, 155:389–449, 2004.
[10] D. Dragičević, P. Giulietti, and J. Sedro. Quenched linear response for smooth expanding on average cocycles. Communications in Mathematical Physics, 399:423–452, 4 2023.
[11] R. Durrett. Probability: Theory and Examples. Cambridge University Press, 4th edition edition, 8 2010.
[12] G. L. Eyink, T. W. N. Haine, and D. J. Lea. Ruelle’s linear response formula, ensemble adjoint schemes and lévy flights. Nonlinearity, 17:1867–1889, 2004.
[13] G. Froyland, D. Giannakis, B. R. Lintner, M. Pike, and J. Slawinska. Spectral analysis of climate dynamics with operator-theoretic approaches. Nature Communications, 12:6570, 11 2021.
[14] S. Galatolo and P. Giulietti. A linear response for dynamical systems with additive noise. Nonlinearity, 32:2269–2301, 6 2019.
[15] S. Galatolo and I. Nisoli. An elementary approach to rigorous approximation of invariant measures. SIAM Journal on Applied Dynamical Systems, 13:958–985, 2014.
[16] M. Ghil and V. Lucarini. The physics of climate variability and climate change. Reviews of Modern Physics, 92:035002, 7 2020.
[17] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 10 1990.
[18] P. W. Glynn and M. Olvera-Cravioto. Likelihood ratio gradient estimation for steady-state parameters. Stochastic Systems, 9:83–100, 6 2019.
[19] S. Gouëzel and C. Liverani. Compact locally maximal hyperbolic sets for smooth maps: Fine statistical properties. Journal of Differential Geometry, 79:433–477, 2008.
[20] M. S. Gutiérrez and V. Lucarini. Response and sensitivity using markov chains. Journal of Statistical Physics, 179:1572–1593, 2020.
[21] M. Hairer and A. J. Majda. A simple framework to justify linear response theory. Nonlinearity, 23:909–922, 4 2010.
[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pages 1–12, 2016.
[23] A. Jameson. Aerodynamic design via control theory. Journal of Scientific Computing, 3:233–260, 1988. Firstintroduce adjoint method into aerodynamic design.
[24] M. Jiang. Differentiating potential functions of srb measures on hyperbolic attractors. Ergodic Theory and Dynamical Systems, 32:1350–1369, 2012.
[25] D. J. Lea, M. R. Allen, and T. W. N. Haine. Sensitivity analysis of the climate of a chaotic system. Tellus A: Dynamic Meteorology and Oceanography, 52:523–532, 2000.
[26] V. Lucarini, F. Ragone, and F. Lunkeit. Predicting climate change using response theory: Global averages and spatial patterns. Journal of Statistical Physics, 166:1036–1064, 2017.
[27] A. Ni. Hyperbolicity, shadowing directions and sensitivity analysis of a turbulent three-dimensional flow. Journal of Fluid Mechanics, 863:644–669, 2019.
[28] A. Ni. Approximating linear response by nonintrusive shadowing algorithms. SIAM J. Numer. Anal., 59:2843–2865, 2021.
[29] A. Ni. Backpropagation in hyperbolic chaos via adjoint shadowing. Nonlinearity, 37:035009, 3 2024.
[30] A. Ni and Y. Tong. Recursive divergence formulas for perturbing unstable transfer operators and physical measures. Journal of Statistical Physics, 190:126, 7 2023.
[31] A. Ni and Y. Tong. Equivariant divergence formula for hyperbolic chaotic flows. Journal of Statistical Physics, 191:118, 9 2024.
[32] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. International conference on machine learning, pages 1310–1318, 2013.
[33] P. Plecháč, G. Stoltz, and T. Wang. Convergence of the likelihood ratio method for linear response of non-equilibrium stationary states. ESAIM: Mathematical Modelling and Numerical Analysis, 55:S593–S623, 2021.
[34] P. Plecháč, G. Stoltz, and T. Wang. Martingale product estimators for sensitivity analysis in computational statistical physics. IMA Journal of Numerical Analysis, 43:3430–3477, 11 2023.
[35] M. Pollicott and O. Jenkinson. Computing invariant densities and metric entropy. Communications in Mathematical Physics, 211:687–703, 2000.
[36] M. I. Reiman and A. Weiss. Sensitivity analysis for simulations via likelihood ratios. Operations Research, 37:830–844, 10 1989.
[37] R. Y. Rubinstein. Sensitivity analysis and performance extrapolation for computer simulation models. Operations Research, 37:72–81, 2 1989.
[38] D. Ruelle. A measure associated with axiom-a attractors. American Journal of Mathematics, 98:619, 1976.
[39] D. Ruelle. Differentiation of srb states. Commun. Math. Phys, 187:227–241, 1997.
[40] M. Viana. What’s new on lorenz strange attractors? The Mathematical Intelligencer, 22:6–19, 6 2000.
[41] X. Wen and L. Wen. No-shadowing for singular hyperbolic sets with a singularity. Discrete and Continuous Dynamical Systems- Series A, 40:6043–6059, 10 2020.
[42] C. L. Wormell and G. A. Gottwald. Linear response for macroscopic observables in high-dimensional systems. Chaos, 29, 2019.
[43] L.-S. Young. What are srb measures, and which dynamical systems have them? Journal of Statistical Physics, 108:733–754, 2002.

ergodic and foliated kernel-differentiation method for linear responses of random systems

Abstract.

1. Introduction

1.1. Literature review

1.2. Main results

2. Deriving the kernel-differentiation linear response formula

2.1. Preparations: notations on probability densities

Assumption 1.

Assumption 2.

2.2. One-step kernel-differentiation

Remark.

Proof.

Remark.

Proof.

2.3. An intuitive explanation

2.4. Kernel-differentiation over finitely many steps

Remark.

Proof.

Proof.

Proof.

2.5. Ergodic kernel-differentiation formula over infinitely many steps

Proof.

Remark.

Proof.

2.6. Further generalizations

2.6.1. pp depends on zz and γ\gamma

2.6.2. General random dynamical systems

3. Foliated noise and perturbation

3.1. Preparation: measure notations, flotations, conditional densities

Assumption 3.

3.2. Foliated kernel-differentiation in one step

Remark.

Proof.

Remark.

Proof.

3.3. Foliated kernel-differentiation over finitely many steps

Remark.

Proof.

Proof.

3.4. Foliated ergodic kernel-differentiation formula over infinitely many steps

4. Kernel-differentiation algorithm

4.1. Procedure lists

4.2. Numerical example: tent map with elevating center

4.3. A chaotic neural network with foliated perturbations

4.3.1. Basic model

4.3.2. Explicit foliated chart and artificial noise

4.3.3. Numerical results

5. Discussions

5.1. A rough cost-error estimation

5.2. A potential program to unify three linear response formulas

Acknowledgements

Data availability statement

References

2.6.1. $p$ depends on $z$ and $\gamma$