This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ergodic and foliated kernel-differentiation method for linear responses of random systems

Angxiu Ni1,2 1 Department of Mathematics, University of California, Irvine, USA 2 Yau Mathematical Sciences Center, Tsinghua University, Beijing, China. niangxiu@gmail.com
Abstract.

We extend the kernel-differentiation method (or likelihood-ratio method) for the linear response of random dynamical systems, after revisit its derivation in a microscopic view via transfer operators. First, for the linear response of physical measures, we extend the method to an ergodic version, which is sampled by one infinitely-long sample path; moreover, the magnitude of the integrand is bounded by a finite decorrelation step number. Second, when the noise and perturbation are along a given foliation, we show that the method is still valid for both finite and infinite time. We numerically demonstrate the ergodic version on a tent map, whose linear response of the physical measure does not exist; but we can add some noise and compute an approximate linear response. We also demonstrate on a chaotic neural network with 51 layers ×\times 9 neurons, whose perturbation is along a given foliation; we show that adding foliated noise incurs a smaller error than noise in all directions. We give a rough error analysis of the method and show that it can be expensive for small-noise. We propose a potential future program unifying the three popular linear response methods.

Keywords. likelihood ratio method, chaos, linear response, random dynamical systems.

AMS subject classification numbers. 37M25, 65D25, 65P99, 65C05, 62D05, 60G10.

1. Introduction

1.1. Literature review

The averaged statistic of a dynamical system is of central interest in applied sciences. The average can be taken in two ways. When the system is random, we can average with respect to the randomness. On the other hand, if the system runs for a long-time, we can average with respect to time. The long-time average can also be defined for deterministic systems, and the existence of the limit measure is called the physical measure or SRB measure [43, 38, 6]. If the system is both random and long time, the existence of the limit measure (which we shall also call the physical measure) is discussed in textbooks such as [11].

We are interested in the the linear response, which is the derivative of the averaged observable with respect to the parameters of the system. The linear response is a fundamental tool for many purposes. For example, in aerospace design, we want to know the how small changes in the geometry would affect the average lift of an aircraft, which was answered for non-chaotic fluids [23] but only partially answered for not-so-chaotic fluids [27]. In climate change, we want to ask what is the temperature change caused by a small change of CO2CO_{2} level [16, 13]. In machine learning, we want to extend the conventional backpropagation method to cases with gradient explosion; current practices have to avoid gradient explosion but that is not always achievable [32, 22].

There are three basic methods for expressing and computing the linear response: the path perturbation method, the divergence method, and the kernel-differentiation method. The path perturbation method averages the path perturbation over a lot of orbits; this is also known as the ensemble method or stochastic gradient method (see for example [25, 12, 26]). However, when the system is chaotic, that is, when the pathwise perturbation grows exponentially fast with respect to time, the method becomes too expensive since it requires many samples to average a huge integrand. Proofs of such formula for the physical measure of hyperbolic systems were given in for example [39, 9, 24].

The divergence method is also known as the transfer operator method, since the perturbation of the measure transfer operator is some divergence. Traditionally, the divergence method is functional, and the formula is not pointwisely defined when the invariant measure is singular, which is common for systems with a contracting direction (see for example [19, 5]). This difficulty can not be circumvented by transformations such as Livisic theorem in [19]. Hence, the traditional divergence formulas can not be computed by a Monte-Carlo approach, so the computation is cursed by dimensionality (see for example [15, 42, 20, 2, 35]).

For chaotic deterministic systems, to have a formula without exponentially growing or distributive terms, we need to blend the two formulas, but the obstruction was that the split lacks smoothness. The ‘fast response’ linear response formulas solve this difficulty for deterministic hyperbolic systems. It is composed of two parts, the adjoint shadowing lemma [29] and the equivariant divergence formula [30]. The fast response formula is a pointwise defined function and has no exponentially growing terms. The continuous-time versions of fast response formulas can be found in [29, 31]. The fast response formula is the ergodic theorem for linear response: it computes 2u+22u+2 MM-dimensional vectors on each point of an orbit, and then the average of some pointwise functions of these vectors. Here MM is the phase space dimension, and uu is the number of unstable directions.

The kernel-differentiation formula works only for random system. This method is more popularly known as the likelihood ratio method or the Monte-Carlo gradient method in probability context (see [37, 36, 17, 34]), the idea (dividing and multiplying the local density) is also used in Langevin sampling and diffusion models. However, such names are no longer distinguishing in the larger context: for example, the likelihood ratio idea is also useful in the divergence method. We apologize for the inconvenience in naming. Proofs of the kernel-differentiation method was given in [21, 3, 1]. This method is more powerful than the previous two for random dynamics, since annealed linear response can exist when quenched linear response does not [10]. However, as we shall see, inevitably, this method is expensive when the randomness is small, so it is hard to use this approach as a very good approximation to linear responses of deterministic systems.

In terms of numerics, the kernel-differentiation method naturally allows Monte-Carlo type sampling, so is efficient in high-dimensions. The formula does not involve Jacobian matrices hence does not incur propagations of vectors or covectors (parameter gradients), so it is not hindered by undesirable features of the Jacobian of the deterministic part, such as singularity or non-hyperbolicity. The work in [33] solves a Poisson equation for a term in the formula, which reduces the numerical error, but can be expensive for high-dimensions.

Another issue in previous likelihood ratio method is that the scale of the estimator is proportional to the orbit length. This leads to two undesirable consequences:

  • The requested number of sample orbits increases with orbit length.

  • We have to restart a new orbit very often. For each orbit, the observable function is only evaluated at the last step, while the earlier steps are somewhat wasted waiting for the orbits to land onto the physical measure.

This paper shows that, for physical measures, we can resolve these issues by invoking the decay of correlation and the ergodic theorem.

Moreover, sometimes we are interested in the case where some directions in the phase space have special importance, such as the level sets in Hamiltonian system. For these cases, the perturbation in the dynamics and the added noise are sometimes aligned to a subspace. This paper extends the kernel-differentiation method to foliated spaces.

Finally, for pedagogical purposes, it is beneficial to derive the kernel-differentiation method from a local or microscopic perspective. This will be somewhat more intuitive, and the above extensions will be more natural.

1.2. Main results

We consider the random dynamical system

Xn+1=f(γ,Xn)+Yn+1.X_{n+1}=f(\gamma,X_{n})+Y_{n+1}.

Let hnh_{n} and pnp_{n} be the probability density for XnX_{n} and YnY_{n}. We need the expressions of pnp_{n}; but we do not need expressions of hnh_{n}. We denote the density of the physical measure by hh, which is the limit measure under the dynamics. Here γ\gamma is the parameter controlling the dynamics and hence hnh_{n} and hh; by default γ=0\gamma=0. We denote the perturbation δ():=()/γ|γ=0\delta(\cdot):=\partial(\cdot)/\partial\gamma|_{\gamma=0}, and when ff is invertible, define

f~:=fγf1.{\tilde{f}}:=f_{\gamma}\circ f^{-1}.

Φ\Phi is a fixed C2C^{2} observable function.

As a warm-up, through basic calculus, section 2.2 rederives the kernel-differentiation formula for δΦ(x1)h(x1)𝑑x1\delta\int\Phi(x_{1})h(x_{1})dx_{1}, the linear response for 1-step. Section 2.3 gives a pictorial explanation for the main lemma. In this simplest setting, we can see more easily how the kernel-differentiation method relates to other methods, and why it can be generalized to foliated situations. Section 2.4 rederives the formula for finitely many steps, which was known as the likelihood ratio method and was given in [37, 36, 17]. Our derivation use the measure-transfer-operator point of view, which is more convenient for our extensions.

The first contribution of this paper, given in section 2.5, is the ergodic kernel-differentiation formula for physical measures. We invoke decay of correlations to constrain the scale of the estimator, and the ergodic theorem to reduce the required data to one long orbit. Hence, we only need to spend one spin-up time to land onto the physical measure, and then all data are fully utilized. This improves the efficiency of the algorithm. More specifically, we prove

{theorem}

[orbitwise kernel-differentiation formula for δh\delta h] Under assumptions 1 and 2, we have

δΦ(x)h(x)𝑑x=a.s.limWlimL1Ln=1Wl=1L(Φ(Xn+l)Φavg)δfγXldpp(Y1+l)\begin{split}\delta\int\Phi(x)h(x)dx\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}-\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi(X_{n+l})-\Phi_{avg}\right)\,\delta f_{\gamma}X_{l}\cdot\frac{dp}{p}(Y_{1+l})\end{split}

almost surely according to the measure obtained by starting the dynamic with X0,Y1,Y2,h×p×p×X_{0},Y_{1},Y_{2},\cdots\sim h\times p\times p\times\cdots. Here Φavg:=Φh\Phi_{avg}:=\int\Phi h.

Here WW and LL are two separate and sequential limits, so we can typically pick WLW\ll L. This makes the estimator smaller than previous likelihood ratio formulas, where W=LW=L. Our formula runs on only one orbit; hence, it is faster than previous likelihood ratio methods. Section 2.6 generalizes the result to cases where the randomness depends on locations and parameters.

The second contribution of this paper is that section 3 extends the kernel-differentiation method to the case where the phase space is partitioned by submanifolds, and the added noise and the perturbation are along the submanifolds (hence the noise is ‘singular’ with respect to the Lebesgue measure). More specifically, Let F={Fα}αAF=\{F_{\alpha}\}_{\alpha\in A} be a co-dimension cc foliation on \mathcal{M}, which is basically a collection of (Mc)(M-c)-dimensional submanifolds partitioning \mathcal{M}, and each submanifold is called a ‘leaf’. Note that we do not require ff to respect the foliation, that is, xx and fxfx can be one two different leaves. Let p(z,xn+1)p(z,x_{n+1}) be the density of probability of Xn+1X_{n+1}, conditioned on f(Xn)=zf(X_{n})=z, with respect to the Lebesgue measure on Fα(z)F_{\alpha}(z). The structure of section 3 is similar to section 2, and the main theorem is

{theorem}

[centralized foliated kernel-differentiation formula for δhT\delta h_{T} of time-inhomogeneous systems, comparable to Remark] Under assumptions 1 and 3 for all nn, we have

δΦ(xT)𝑑μT(xT)=𝔼[(ΦT(xT)Φavg,T)m=1T(δf~m(zm)dzpm(zm,xm)pm(zm,xm))]:=x0x1Fα1(z1)xTFαT(zT)(ΦT(xT)Φavg,T)m=1T(δf~m(zm)dzpm(zm,xm)pm(zm,xm))pT(zT,xT)dxTp1(z1,x1)dx1dμ0(x0).\begin{split}\delta\int\Phi(x_{T})d\mu_{T}(x_{T})=\mathbb{E}\left[\left(\Phi_{T}(x_{T})-\Phi_{avg,T}\right)\sum_{m=1}^{T}\left(\delta{\tilde{f}}_{m}(z_{m})\cdot\frac{d_{z}p_{m}(z_{m},x_{m})}{p_{m}(z_{m},x_{m})}\right)\right]\\ :=\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F^{1}_{\alpha}(z_{1})}\cdots\int_{x_{T}\in F^{T}_{\alpha}(z_{T})}\left(\Phi_{T}(x_{T})-\Phi_{avg,T}\right)\\ \sum_{m=1}^{T}\left(\delta{\tilde{f}}_{m}(z_{m})\cdot\frac{d_{z}p_{m}(z_{m},x_{m})}{p_{m}(z_{m},x_{m})}\right)p_{T}(z_{T},x_{T})dx_{T}\cdots p_{1}(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0}).\end{split}

Here Φavg,T:=ΦT𝑑μT\Phi_{avg,T}:=\int\Phi_{T}d\mu_{T}. Note that here F,f~,f,pF,{\tilde{f}},f,p can be different for different steps.

Section 4 considers numerical realizations. Section 4.1 gives a detailed list for the algorithm. Note that the foliated case uses the same algorithm; the difference is in the set-up of the problem. Section 4.2 illustrates the ergodic version of the algorithm on an example where the deterministic part does not have a linear response, but adding noise and using the kernel-differentiation algorithm give a reasonable reflection of the relation between the long-time averaged observable and the parameter. Section 4.3 illustrates the foliated version of the algorithm on a unstable neural network with 51 layers ×\times 9 neurons, where the perturbation is along the foliation. We show that adding a low-dimensional noise along the foliation induces a much smaller error than adding noise in all directions. And the foliated kernel-differentiation can then give a good approximation of the true parameter-derivative.

In the discussion section, section 5.1 gives a rough cost-error estimation of the kernel-differentiation algorithm. This helps to set some constants in the algorithm; it also shows that the small-noise limit is still very expensive. Section 5.2 proposes a program which combines the three formulas to obtain a good approximation of linear responses of deterministic systems.

The paper is organized as follows. Section 2 derives the kernel-differentiation formula where the noise is in all directions. Section 3 derives the formula for foliated perturbations and noises. Section 4 gives the procedure list of the algorithm based on the formulas, and illustrates on two numerical examples. Section 5 discusses the relation to deterministic linear responses.

2. Deriving the kernel-differentiation linear response formula

2.1. Preparations: notations on probability densities

Let fγf_{\gamma} be a family of C2C^{2} map on the Euclidean space M\mathbb{R}^{M} of dimension MM parameterized by γ\gamma. Assume that γfγ\gamma\mapsto f_{\gamma} is C1C^{1} from \mathbb{R} to the space of C2C^{2} maps. The random dynamical system in this paper is given by

(1) Xn=fγ(Xn1)+Yn,whereYniidp.\begin{split}X_{n}=f_{\gamma}(X_{n-1})+Y_{n},\quad\textnormal{where}\quad Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}p.\end{split}

The default value is γ=0\gamma=0, and we denote f:=fγ=0f:=f_{\gamma=0}. In this section, we consider the case where pp is any fixed C2C^{2} probability density function. The next section considers the case where pp is smooth along certain directions but singular along other directions.

Moreover, except for section 1.2, our results also apply to time-inhomogeneous cases, where ff and pp are different for each step. More specifically, the dynamic is given by

(2) Xn=fγ,n1(Xn1)+Yn,whereYnpn.\begin{split}X_{n}=f_{\gamma,n-1}(X_{n-1})+Y_{n},\quad\textnormal{where}\quad Y_{n}\sim p_{n}.\end{split}

Here XnMnX_{n}\in\mathbb{R}^{M_{n}} are not necessarily of the same dimension. We shall exhibit the time dependence in Remark and also in the numerics section. On the other hand, for the infinite time case in section 1.2, we want to sample by only one orbit; this requires that ff and pp be repetitive among steps.

We define hTh_{T} as the density of pushing-forward the initial measure h0h_{0} for TT steps.

hT:=(LpLfγ)Th0.\begin{split}h_{T}:=(L_{p}L_{f_{\gamma}})^{T}h_{0}.\end{split}

Here hT=hT,γh_{T}=h_{T,\gamma} depends on γ\gamma and also h0h_{0}; we sometimes omit the subscript of γ\gamma. Here LfL_{f} is the measure transfer operator of ff, which are defined by the integral equality

(3) Φ(x)(Lfh)(x)𝑑x:=Φ(fx)h(x)𝑑x.\begin{split}\int\Phi(x)\,(L_{f}h)(x)dx:=\int\Phi(fx)h(x)dx.\end{split}

Here ΦC2(M)\Phi\in C^{2}(\mathbb{R}^{M}) is any fixed observable function. When ff is bijective, there is an equivalent pointwise definition, (Lfh)(x):=h|Df|(f1x)(L_{f}h)(x):=\frac{h}{|Df|}(f^{-1}x).

For density qq, LpqL_{p}q is pointwisely defined by convolution with density pp:

(4) Lpq(x)=q(xy)p(y)𝑑y=q(z)p(xz)𝑑z.\begin{split}L_{p}q(x)=\int q(x-y)p(y)dy=\int q(z)p(x-z)dz.\end{split}

We shall also use the integral equality

(5) Φ(x)Lpq(x)𝑑x=p(y)(Φ(x)q(xy)𝑑x)𝑑y=z=xyΦ(y+z)q(z)p(y)𝑑z𝑑y.\begin{split}\int\Phi(x)L_{p}q(x)dx=\int p(y)\left(\int\Phi(x)q(x-y)dx\right)dy\\ \,\overset{z=x-y}{=}\,\iint\Phi(y+z)q(z)p(y)dzdy.\end{split}

If we want to compute the above integration by Monte-Carlo method, then we should generate i.i.d. Yl=ylY_{l}=y_{l} and Zl=zlZ_{l}=z_{l} according to density pp and qq, then compute the average of Φ(yl+zl)\Phi(y_{l}+z_{l}).

To consider differentiability of hh with respect to γ\gamma, we make the following simplifying assumptions on differentiability of the basic quantities.

Assumption 1.

The densities p,q,hp,q,h have bounded C2C^{2} norms; γfγ\gamma\mapsto f_{\gamma} is C1C^{1} from \mathbb{R} to the space of C2C^{2} maps; the observable function Φ\Phi is fixed and is C2C^{2}.

The linear response formula for finite time TT is an expression of δhγ,T\delta h_{\gamma,T} by δfγ\delta f_{\gamma}, and

δ():=()γ|γ=0.\begin{split}\delta(\cdot):=\left.\frac{\partial(\cdot)}{\partial\gamma}\right|_{\gamma=0}.\end{split}

Here δ\delta may as well be regarded as small perturbations. Note that by our notation,

δfγ(x)TfxM;\begin{split}\delta f_{\gamma}(x)\in T_{fx}\mathbb{R}^{M};\end{split}

that is, δfγ(x)\delta f_{\gamma}(x) is a vector at fxfx. The linear response formula for finite-time is given by the Leibniz rule,

(6) δhT=δ(LpLfγ)Th0=n=0T1(LpLf)nδ(LpLfγ)(LpLf)T1nh0.\begin{split}\delta h_{T}=\delta(L_{p}L_{f_{\gamma}})^{T}h_{0}=\sum_{n=0}^{T-1}(L_{p}L_{f})^{n}\,\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{T-1-n}h_{0}.\end{split}

We shall give an expression of the perturbative transfer operator δ(LpLfγ)\delta(L_{p}L_{f_{\gamma}}) later.

For the case where TT\rightarrow\infty, we define the density hh of the (ergodic) physical measure μ\mu, which is the unique limit of hTh_{T} in the weak* topology for any smooth initial density h0h_{0}:

limT,T(LpLfγ)Th0=:hγ,\begin{split}\lim_{T\in\mathbb{N},\,T\rightarrow\infty}(L_{p}L_{f_{\gamma}})^{T}h_{0}=:h_{\gamma},\end{split}

note that hh depends on γ\gamma but not on h0h_{0}. The physical measure is also called the SRB measure or the stationary measure. We also define the correlation function, 𝔼[φ(Xn)ψ(X0,X1)]\mathbb{E}[\varphi(X_{n})\psi(X_{0},X_{1})], where X0X_{0} is distributed according to the physical measure μ\mu, and XnX_{n} is distributed according to the dynamics in equation 1. For this section,

𝔼γ[φ(Xn)ψ(X0,X1)]:=φ(xn)ψ(x0,x1)hγ(x0)𝑑x0p(y1)𝑑y1p(yn)𝑑yn.\mathbb{E}_{\gamma}[\varphi(X_{n})\psi(X_{0},X_{1})]:=\int\varphi(x_{n})\psi(x_{0},x_{1})h_{\gamma}(x_{0})dx_{0}p(y_{1})dy_{1}\cdots p(y_{n})dy_{n}.

We shall make the following assumptions for infinitely many steps:

Assumption 2.

For a small interval of γ\gamma containing zero,

  1. (1)

    The physical measure for γ0\gamma\neq 0 is the limit of evolving the dynamics starting from the physical measure at γ=0\gamma=0.

  2. (2)

    For any observable functions φ,ψC2\varphi,\psi\in C^{2}, the following sum of correlations

    n1|𝔼γ[φ(Xn)ψ(X0,X1)]𝔼γ[φ(X0)]𝔼γ[ψ(X0,X1)]|\sum_{n\geq 1}\left|\mathbb{E}_{\gamma}[\varphi(X_{n})\psi(X_{0},X_{1})]-\mathbb{E}_{\gamma}[\varphi(X_{0})]\mathbb{E}_{\gamma}[\psi(X_{0},X_{1})]\right|

converges uniformly in terms of γ\gamma.

Assumption 2 can be proved from more basic assumptions, which are typically more forgiving than the hyperbolicity assumption for deterministic cases. For example, under assumption 1, the proof of unique existence of the physical measure, where the system has additive noise with smooth density, can be found in probability textbooks such as [11]. [14] proves the decay of correlations when, roughly speaking, fγf_{\gamma} has bounded variance and is mixing and that the transfer operator is bounded. The technical assumptions and rigorous proof for linear responses can also be found in [21]. Our main purpose is to derive the ergodic kernel-differentiation formula with the detailed expression in section 2.2. The formula seems to be the same in most cases where linear response exists.

2.2. One-step kernel-differentiation

We can give a pointwise formula for δ(LpLfγq)(x)\delta(L_{p}L_{f_{\gamma}}q)(x), which seems to be a long-known wisdom. An intuitive explanation of the following lemma is given in section 2.3. Then we give the integral version which can be computed by Monte Carlo algorithms. An intuition of the lemmas are given in section 2.3.

{lemma}

[local one-step kernel-differentiation formula] Under assumption 1, then for any density h0h_{0}

δ(LpLfγh0)(x1)=δfγ(x0)dp(x1fx0)h0(x0)dx0.\begin{split}\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})=\int-\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0})\,h_{0}(x_{0})dx_{0}.\end{split}

Here δ:=/γ|γ=0\delta:=\partial/\partial\gamma|_{\gamma=0}, and δfγ(x0)dp(x1fx0)\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0}) is the derivative of the function p()p(\cdot) at x1fx0x_{1}-fx_{0} in the direction δfγ(x0)\delta f_{\gamma}(x_{0}), which is a vector at fx0fx_{0}.

Remark.

Note that we do not compute δLfγ\delta L_{f_{\gamma}} separately. In fact, the main point here is that the convolution with pp allows us to differentiate pp, which is typically much more forgiving than differentiating only LfγL_{f_{\gamma}}. Although there is a divergence formula, δLfγh0=div(h0δfγf1)\delta L_{f_{\gamma}}h_{0}=\textnormal{div}(h_{0}\delta f_{\gamma}\circ f^{-1}), but then we still need to know the derivative of h0h_{0}. This is difficult when h0h_{0} is the physical density hh: we can quench a specific noise sequence {Yn=yn}n0\left\{Y_{n}=y_{n}\right\}_{n\geq 0} and, for each sample path, apply the adjoint shadowing lemma and equivariant divergence formula in [30, 29], but this requires hyperbolicity, which can be too strict to real-life systems.

Proof.

First write a pointwise expression for LpLfγh0L_{p}L_{f_{\gamma}}h_{0}: by definition of LpL_{p} in equation 4,

(LpLfγh0)(x1)=(Lfγh0)(z1)p(x1z1)𝑑z1.\begin{split}(L_{p}L_{f_{\gamma}}h_{0})(x_{1})=\int(L_{f_{\gamma}}h_{0})(z_{1})p(x_{1}-z_{1})dz_{1}.\end{split}

Since pC2(M)p\in C^{2}(\mathbb{R}^{M}), we can substitute pp into Φ\Phi in the definition of LfL_{f} in equation 3, to get

=h0(x0)p(x1fγx0)𝑑x0.\begin{split}=\int h_{0}(x_{0})p(x_{1}-f_{\gamma}x_{0})dx_{0}.\end{split}

Differentiate with respect to γ\gamma, we have

δ(LpLfγh0)(x1)=h0(x0)δfγ(x0)dp(x1fx0)dx0.\begin{split}\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})=\int-h_{0}(x_{0})\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0})dx_{0}.\end{split}

{lemma}

[integrated one-step kernel-differentiation formula] Under assumption 1, then

Φ(x1)δ(LpLfγh0)(x1)𝑑x1=Φ(x1)δfγ(x0)dp(y1)p(y1)h0(x0)dx0p(y1)dy1.\begin{split}\int\Phi(x_{1})\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})dx_{1}=\iint-\Phi(x_{1})\delta f_{\gamma}(x_{0})\cdot\frac{dp(y_{1})}{p(y_{1})}\,h_{0}(x_{0})dx_{0}\,p(y_{1})dy_{1}.\end{split}

Here x1x_{1} in the right expression is a function of the dummy variables x0x_{0} and y1y_{1}, that is, x1=y1+fx0x_{1}=y_{1}+fx_{0}.

Remark.

The point is, if we want to integrate the right expression by Monte-Carlo, just generate random pairs of {xl,0,yl,1}\left\{x_{l,0},y_{l,1}\right\}, compute xl,1x_{l,1} accordingly, and average Il:=Φ(x1)δfγxl,0dpp(yl,1)I_{l}:=\Phi(x_{1})\delta f_{\gamma}x_{l,0}\cdot\frac{dp}{p}(y_{l,1}) over many ll’s.

Proof.

Substitute section 2.2 into the integration, we get

Φ(x1)δ(LpLfγh0)(x1)𝑑x1=Φ(x1)δfγ(x0)𝑑p(x1fx0)h0(x0)𝑑x0𝑑x1\begin{split}\int\Phi(x_{1})\delta(L_{p}L_{f_{\gamma}}h_{0})(x_{1})dx_{1}=-\iint\Phi(x_{1})\delta f_{\gamma}(x_{0})\cdot dp(x_{1}-fx_{0})\,h_{0}(x_{0})dx_{0}dx_{1}\end{split}

The problem with this expression is that, should we want Monte-Carlo, it is not obvious which measure we should generate x1x_{1}’s according to. To solve this issue, change the order of the double integration to get

=δfγ(x0)(Φ(x1)𝑑p(x1fx0)𝑑x1)h0(x0)𝑑x0.\begin{split}=-\int\delta f_{\gamma}(x_{0})\cdot\left(\int\Phi(x_{1})dp(x_{1}-fx_{0})dx_{1}\right)h_{0}(x_{0})dx_{0}.\end{split}

Change the dummy variable of the inner integration from x1x_{1} to y1=x1fx0y_{1}=x_{1}-fx_{0},

=δfγ(x0)(Φ(y1+fx0)𝑑p(y1)𝑑y1)h0(x0)𝑑x0=Φ(x1(x0,y1))δfγ(x0)dpp(y1)p(y1)𝑑y1h0(x0)𝑑x0.\begin{split}=-\int\delta f_{\gamma}(x_{0})\cdot\left(\int\Phi(y_{1}+fx_{0})dp(y_{1})dy_{1}\right)h_{0}(x_{0})dx_{0}\\ =-\iint\Phi(x_{1}(x_{0},y_{1}))\,\delta f_{\gamma}(x_{0})\cdot\frac{dp}{p}(y_{1})\,p(y_{1})dy_{1}h_{0}(x_{0})dx_{0}.\end{split}

Here x1x_{1} is a function as stated in the lemma. ∎

2.3. An intuitive explanation

We give an intuitive explanation to section 2.2 and Remark. This will help us to generalize the method to the foliated situation in section 3. First, we shall adopt a more intuitive but restrictive notation. Define f~(x,γ){\tilde{f}}(x,\gamma) such that

f~(f(x),γ)=fγ(x).\begin{split}{\tilde{f}}(f(x),\gamma)=f_{\gamma}(x).\end{split}

In other words, we write fγf_{\gamma} as appending a small perturbative map f~{\tilde{f}} to f:=fγ=0f:=f_{\gamma=0}. Here f~{\tilde{f}} is the identity map when γ=0\gamma=0. Note that f~{\tilde{f}} can be defined only if ff satisfies

fγ(x)=fγ(x)wheneverf(x)=f(x).\begin{split}f_{\gamma}(x)=f_{\gamma}(x^{\prime})\quad\textnormal{whenever}\quad f(x)=f(x^{\prime}).\end{split}

For example, when ff is bijective then we can well-define f~{\tilde{f}}. Hence, this new notation is more restrictive than what we used in other parts of this section, but it lets us see more clearly what happens during the perturbation.

With the new notation, the dynamics can be written now as

Xn=f~f(Xn1)+Yn,whereYniidp.\begin{split}X_{n}={\tilde{f}}f(X_{n-1})+Y_{n},\quad\textnormal{where}\quad Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}p.\end{split}

By this notation, only ff changes time step, but f~{\tilde{f}} and adding YY do not change time. In this subsection we shall start from after having applied the map ff, and only look at the effect of applying f~{\tilde{f}} and adding noise: this is enough to account for the essentials of section 2.2 and Remark. Roughly speaking, the qnq_{n} we use in the following is in fact qn:=Lfhn1q_{n}:=L_{f}h_{n-1}; qnq_{n} is the density of zn:=fxn1z_{n}:=fx_{n-1}, so zn+yn=xnz_{n}+y_{n}=x_{n}, and we omit the subscript for time 0.

With the new notation, section 2.2 is essentially equivalent to

(7) δ(LpLf~q)(x)=δf~zdp(xz)q(z)dz.\begin{split}\delta(L_{p}L_{{\tilde{f}}}q)(x)=\int-\delta{\tilde{f}}z\cdot dp(x-z)\,q(z)dz.\end{split}

We explain this intuitively by figure 1. Let z~:=f~z{\tilde{z}}:={\tilde{f}}z be distributed according to q~{\tilde{q}}, then Lpq~L_{p}{\tilde{q}} is obtained by first attach a density pp to each z~{\tilde{z}}, then integrate over all z~{\tilde{z}}. LpLf~qL_{p}L_{\tilde{f}}q is obtained by first move zz to z~{\tilde{z}} and then perform the same procedure. Hence, δLpLf~q(x)\delta L_{p}L_{\tilde{f}}q(x) is first computing δpf~z(x)\delta p_{{\tilde{f}}z}(x) for each zz, then integrate over zz. Let

pf~z(x):=p(xf~z)\begin{split}p_{{\tilde{f}}z}(x):=p(x-{\tilde{f}}z)\end{split}

be the density of the noise centered at f~z{\tilde{f}}z. So

δpf~z(x)=dpf~z(x)δf~z(x)=dp(xf~z)δf~z.\begin{split}\delta p_{{\tilde{f}}z}(x)=-dp_{{\tilde{f}}z}(x)\cdot\delta{\tilde{f}}z(x)=-dp(x-{\tilde{f}}z)\cdot\delta{\tilde{f}}z.\end{split}

Here δf~z(x)\delta{\tilde{f}}z(x) in the middle expression is the horizontal shift of pf~zp_{{\tilde{f}}z}’s level set previously located at xx. Since the entire Gaussian distribution is parallelly moved on the Euclidean space M\mathbb{R}^{M}, so δf~z(x)=δf~z\delta{\tilde{f}}z(x)=\delta{\tilde{f}}z is constant for all xx. Then we can integrate over zz to get equation 7.

Refer to caption
Figure 1. Intuitions for section 2.2 and Remark.

Using f~{\tilde{f}} notation, Remark is equivalent to

δΦ(x)(LpLf~q)(x)𝑑x=Φ(x)δf~(z)dp(y)p(y)q(z)dzp(y)dy,\begin{split}\delta\int\Phi(x)(L_{p}L_{\tilde{f}}q)(x)dx=\iint-\Phi(x)\delta{\tilde{f}}(z)\cdot\frac{dp(y)}{p(y)}\,q(z)dz\,p(y)dy,\end{split}

where xx on the right is a function x=z+yx=z+y. Intuitively, this says that the left side equals to first compute δΦ(z+y)pf~z(y)𝑑y\delta\int\Phi(z+y)p_{{\tilde{f}}z}(y)dy for each zz, then integrate over zz. The integration on the left uses xx as dummy variable, which is convenient for the transfer operator formula above. But it does not involve a density for xx, so is not immediately ready for Monte-Carlo. The right integration is over q(z)q(z) and p(y)p(y), which are easy to sample.

It is important that we differentiate only pp but not ff. In fact, our core intuitions are completely within one time step, and do not even involve ff. Hence, we can easily generalize to cases where ff is bad, for example, when ff is not bijective, or when ff is not hyperbolic: these are all difficult for deterministic systems.

2.4. Kernel-differentiation over finitely many steps

We use Remark to get integrated formulas for hT:=(LpLfγ)Th0h_{T}:=(L_{p}L_{f_{\gamma}})^{T}h_{0} and its perturbation δhT\delta h_{T}, which can be sampled by Monte-Carlo type algorithms. Note that here h0h_{0} is fixed and does not depend on γ\gamma. This was previously known as the likelihood ratio method or the Monte-Carlo gradient method and was given in [37, 36, 17]. The magnitude of the integrand grows as O(T)O(\sqrt{T}) after the centralization, where TT is the number of time steps.

{lemma}

Under assumption 1, for any C2C^{2} (not necessarily positive) densities h0h_{0}, any nn\in\mathbb{N},

Φ(xn)((LpLf)nh0)(xn)𝑑xn=Φ(xn)h0(x0)𝑑x0p(y1)𝑑y1p(yn)𝑑yn.\begin{split}\int\Phi(x_{n})((L_{p}L_{f})^{n}h_{0})(x_{n})dx_{n}=\int\Phi(x_{n})h_{0}(x_{0})dx_{0}p(y_{1})dy_{1}\cdots p(y_{n})dy_{n}.\end{split}

Here the xnx_{n} on the left is a dummy variable; whereas xnx_{n} on the right is recursively defined by xm=f(xm1)+ymx_{m}=f(x_{m-1})+y_{m}, so xnx_{n} is a function of the dummy variables x0,y1,,ynx_{0},y_{1},\cdots,y_{n}.

Remark.

The integrated formula is straightforward to compute by Monte-Carlo method. That is, for each ll, we generate random xl,0x_{l,0} according to density h0h_{0}, and yl,1,,yl,ny_{l,1},\cdots,y_{l,n} i.i.d according to pp. Then we compute Φ(xn)\Phi(x_{n}) for this particular sample of {xl,0,yl,1,,yl,n}\left\{x_{l,0},y_{l,1},\cdots,y_{l,n}\right\}. Note that the experiments for different ll should be either independent or decorrelated. Then the Monte-Carlo integration is simply

Φ(xn)((LpLf)nh0)(xn)𝑑xn=limL1Ll=1LΦ(xl,n).\begin{split}\int\Phi(x_{n})((L_{p}L_{f})^{n}h_{0})(x_{n})dx_{n}=\lim_{L\rightarrow\infty}\frac{1}{L}\sum_{l=1}^{L}\Phi(x_{l,n}).\end{split}

Almost surely according to h0×p1××pnh_{0}\times p_{1}\times\cdots\times p_{n}. In this paper we shall use n,mn,m to indicate time steps, whereas ll labels different samples.

Proof.

Sequentially apply the definition of LpL_{p} and LfL_{f}, we get

(8) Φ(xn)((LpLf)nh0)(xn)𝑑xn=Φ(yn+zn)(Lf(LpLf)n1h0)(zn)p(yn)𝑑yn𝑑zn=Φ(xn(xn1,yn))((LpLf)n1h0)(xn1)p(yn)𝑑yn𝑑xn1.\begin{split}\int\Phi(x_{n})((L_{p}L_{f})^{n}h_{0})(x_{n})dx_{n}\\ =\iint\Phi(y_{n}+z_{n})(L_{f}(L_{p}L_{f})^{n-1}h_{0})(z_{n})p(y_{n})dy_{n}dz_{n}\\ =\iint\Phi(x_{n}(x_{n-1},y_{n}))((L_{p}L_{f})^{n-1}h_{0})(x_{n-1})p(y_{n})dy_{n}dx_{n-1}.\end{split}

Here xn:=yn+f(xn1)x_{n}:=y_{n}+f(x_{n-1}) in the last expression is a function of the dummy variables xn1x_{n-1} and yny_{n}. Roughly speaking, the dummy variable znz_{n} in the second expressions is f(xn1)f(x_{n-1}).

Recursively apply equation 8 once, we have

(Φ(xn(xn1,yn))((LpLf)n1h0)(xn1)𝑑xn1)p(yn)𝑑yn=Φ(xn(xn1(xn2,yn1),yn))((LpLf)n2h0)(xn2)p(yn1)p(yn)𝑑yn1𝑑yn𝑑xn2.\begin{split}\int\left(\int\Phi(x_{n}(x_{n-1},y_{n}))((L_{p}L_{f})^{n-1}h_{0})(x_{n-1})dx_{n-1}\right)p(y_{n})dy_{n}\\ =\int\Phi(x_{n}(x_{n-1}(x_{n-2},y_{n-1}),y_{n}))((L_{p}L_{f})^{n-2}h_{0})(x_{n-2})p(y_{n-1})p(y_{n})dy_{n-1}dy_{n}dx_{n-2}.\end{split}

Here xn1:=yn1+f(xn2)x_{n-1}:=y_{n-1}+f(x_{n-2}), so xnx_{n} is a function of the dummy variables xn2,yn1x_{n-2},y_{n-1}, and yny_{n}. Keep applying equation 8 recursively to prove the lemma. ∎

{theorem}

[kernel-differentiation formula for δhT\delta h_{T}] Under assumption 1, then

δΦ(xT)hT(xT)𝑑xT=Φ(xT)m=0T1(δfγxmdpp(ym+1))h0(x0)dx0(pdy)1T.\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=-\int\Phi(x_{T})\sum_{m=0}^{T-1}\left(\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})\right)h_{0}(x_{0})dx_{0}(pdy)_{1\sim T}.\end{split}

Here dpdp is the differential of pp, (pdy)1T:=p(y1)dy1p(yT)dyT(pdy)_{1\sim T}:=p(y_{1})dy_{1}\cdots p(y_{T})dy_{T} is the independent distribution of YnY_{n}’s, and xmx_{m} is a function of dummy variables x0,y1,,ymx_{0},y_{1},\cdots,y_{m}.

Proof.

Note that Φ\Phi is fixed, use equation 6, and let m=Tn1m=T-n-1, we get

δΦ(xT)hT(xT)𝑑xT=Φ(xT)(δhT)(xT)𝑑xT=Φ(xT)m=0T1((LpLf)Tm1δ(LpLfγ)(LpLf)mh0)(xT)dxT.\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=\int\Phi(x_{T})(\delta h_{T})(x_{T})dx_{T}\\ =\int\Phi(x_{T})\sum_{m=0}^{T-1}\left((L_{p}L_{f})^{T-m-1}\,\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{m}h_{0}\right)(x_{T})dx_{T}.\end{split}

For each mm, first apply section 2.4 several times,

Φ(xT)((LpLf)Tm1δ(LpLfγ)(LpLf)mh0)(xT)𝑑xT=Φ(xT)(δ(LpLfγ)(LpLf)mh0)(xm+1)𝑑xm+1(pdy)m+2T\begin{split}\int\Phi(x_{T})\left((L_{p}L_{f})^{T-m-1}\,\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{m}h_{0}\right)(x_{T})dx_{T}\\ =\int\Phi(x_{T})\left(\delta(L_{p}L_{f_{\gamma}})\,(L_{p}L_{f})^{m}h_{0}\right)(x_{m+1})\,dx_{m+1}(pdy)_{m+2\sim T}\end{split}

Here xTx_{T} is a function of xm+1,ym+2,,yTx_{m+1},y_{m+2},\cdots,y_{T}. Then apply Remark once,

=Φ(xT)δfγxmdpp(ym+1)((LpLf)mh0)(xm)𝑑xm(pdy)m+1T\begin{split}=-\int\Phi(x_{T})\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})\left((L_{p}L_{f})^{m}h_{0}\right)(x_{m})\,dx_{m}(pdy)_{m+1\sim T}\end{split}

Now xTx_{T} is a function of xm,ym+1,,yTx_{m},y_{m+1},\cdots,y_{T}. Then apply section 2.4 several times again,

=Φ(xT)δfγxmdpp(ym+1)h0(x0)𝑑x0(pdy)1T\begin{split}=-\int\Phi(x_{T})\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})h_{0}(x_{0})\,dx_{0}(pdy)_{1\sim T}\end{split}

Then sum over mm to prove the theorem. ∎

Note that subtracting any constant from Φ\Phi and does not change the linear response. We prove a more detailed statement in a Remark. Hence, we can centralize Φ\Phi, i.e. replacing Φ()\Phi(\cdot) by Φ()Φavg,T\Phi(\cdot)-\Phi_{avg,T}, where the constant

Φavg,T:=Φ(xT)hT(xT)𝑑xT.\begin{split}\Phi_{avg,T}:=\int\Phi(x_{T})h_{T}(x_{T})dx_{T}.\end{split}

We sometimes centralize by subtracting Φavg:=Φh𝑑x\Phi_{avg}:=\int\Phi hdx. The centralization reduces the amplitude of the integrand, so the Monte-Carlo method converges faster: this is also a known result and was discussed in for example [18].

{lemma}

[free centralization] Under assumption 1, for any m0m\geq 0, we have

δfγxmdpp(ym+1)h0(x0)𝑑x0(pdy)1m+1=0.\begin{split}\int\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})h_{0}(x_{0})dx_{0}(pdy)_{1\sim m+1}=0.\end{split}
Proof.

First note that a density function pp on M\mathbb{R}^{M} with bounded C2C^{2} norm must have limyp(y)=0\lim_{y\rightarrow\infty}p(y)=0, since otherwise its integration can not be one.

Then notice that xmx_{m} is a function of dummy variables x0,y1,,ymx_{0},y_{1},\cdots,y_{m}, so we can first integrate

δfγxmdpp(ym+1)p(ym+1)𝑑ym+1=δfγxm𝑑p(ym+1)𝑑ym+1=0.\begin{split}\delta f_{\gamma}x_{m}\cdot\int\frac{dp}{p}(y_{m+1})p(y_{m+1})dy_{m+1}=\delta f_{\gamma}x_{m}\cdot\int dp(y_{m+1})dy_{m+1}=0.\end{split}

Since p(y)0p(y)\rightarrow 0 as yy\rightarrow\infty. Here dpdp is the differential of the function pp, whereas dydy indicates the integration. ∎

For the convenience of computer coding, we explicitly rewrite this theorem into centralized and time-inhomogeneous form, where Φ\Phi, ff, and pp are different for each step. This is the setting for many applications, such as finitely-deep neural networks. The following proposition can be proved similarly to our previous proofs. Note that we can reuse a lot of data should we also care about the perturbation of the averaged Φn\Phi_{n} for other layers 1nT1\leq n\leq T.

In the following proposition, let ΦT:MT\Phi_{T}:\mathbb{R}^{M_{T}}\rightarrow\mathbb{R} be the observable function defined on the last layer of dynamics. Let hTh_{T} be the pushforward measure given by the dynamics in equation 2, that is, defined recursively by hn:=LpnLfγ,n1hn1h_{n}:=L_{p_{n}}L_{f_{\gamma,n-1}}h_{n-1}; Let (pdy)1T:=p1(y1)dy1pT(yT)dyT(pdy)_{1\sim T}:=p_{1}(y_{1})dy_{1}\cdots p_{T}(y_{T})dy_{T} be the independent but not necessarily identical distribution of YnY_{n}’s.

{proposition}

[centralized kernel-differentiation formula for δhT\delta h_{T} of time-inhomogeneous systems] If fγ,nf_{\gamma,n} and pnp_{n} satisfy assumption 1 for all nn, then

δΦT(xT)hT(xT)𝑑xT=(ΦT(xT)Φavg,T)m=0T1(δfγ,mxmdpp(ym+1))h0(x0)dx0(pdy)1T.\begin{split}&\delta\int\Phi_{T}(x_{T})h_{T}(x_{T})dx_{T}\\ =&-\int\left(\Phi_{T}(x_{T})-\Phi_{avg,T}\right)\sum_{m=0}^{T-1}\left(\delta f_{\gamma,m}x_{m}\cdot\frac{dp}{p}(y_{m+1})\right)h_{0}(x_{0})dx_{0}(pdy)_{1\sim T}.\end{split}

Here Φavg,T:=ΦThT\Phi_{avg,T}:=\int\Phi_{T}h_{T}.

2.5. Ergodic kernel-differentiation formula over infinitely many steps

For the perturbation of physical measures, the kernel-differentiation formula for δh\delta h can further take the form of long-time average on an orbit. This reduces the magnitude of the integrand so the convergence is faster. Here we can apply the ergodic theorem, forget details of initial distribution, and sample hh by points from one long orbit. Note that this subsection requires that the dynamics is time-homogeneous (ff and pp are the same for each step) to invoke ergodic theorems.

{lemma}

[kernel-differentiation formula for physical measures] Under assumptions 1 and 2,

δΦ(x)h(x)𝑑x=Φ(x)δh(x)𝑑x=limWn=1WΦ(xn)(δfγx0dpp(y1))hγ=0(x0)𝑑x0(pdy)1n\begin{split}\delta\int\Phi(x)h(x)dx=\int\Phi(x)\delta h(x)dx\\ =-\lim_{W\rightarrow\infty}\sum_{n=1}^{W}\int\Phi(x_{n})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim n}\end{split}
Proof.

By assumption 2, we can start from hγ=0h_{\gamma=0}, apply the perturbed dynamics fγ0f_{\gamma\neq 0} many times, and hT:=(LpLfγ0)Thγ=0h_{T}:=(L_{p}L_{f_{\gamma}\neq 0})^{T}h_{\gamma=0} would converge to hγh_{\gamma}.

By Remark, for any TT,

δΦ(xT)hT(xT)𝑑xT=m=0T1Φ(xT)(δfγxmdpp(ym+1))hγ=0(x0)𝑑x0(pdy)1T.\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=-\sum_{m=0}^{T-1}\int\Phi(x_{T})\left(\delta f_{\gamma}x_{m}\cdot\frac{dp}{p}(y_{m+1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim T}.\end{split}

Note that hγ=0h_{\gamma=0} is invariant measure, so the expression

=m=0T1Φ(xTm)(δfγx0dpp(y1))hγ=0(x0)𝑑x0(pdy)1Tm\begin{split}=-\sum_{m=0}^{T-1}\int\Phi(x_{T-m})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim T-m}\end{split}

Passing TmT-m to nn,

=n=1TΦ(xn)(δfγx0dpp(y1))hγ=0(x0)𝑑x0(pdy)1n\begin{split}=-\sum_{n=1}^{T}\int\Phi(x_{n})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim n}\end{split}

Note that this is the correlation function between Φ(xTm)\Phi(x_{T-m}) and δfγx0dpp(y1)\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1}). By Remark, 𝔼[δfγX0dpp(Y1)]=0\mathbb{E}[\delta f_{\gamma}X_{0}\cdot\frac{dp}{p}(Y_{1})]=0. By assumption 2, the derivatives expressed by the above summations uniformly converge to the expression in the lemma. Hence, the limit equals δh\delta h. ∎

See 1.2

Remark.

Recall that WW is determined by the rate of decay of correlations, whereas LL is the number of averaging samples. Typically WLW\ll L in numerics.

Proof.

Since hh is invariant for the dynamic Xn+1=f(Xn)+Yn+1X_{n+1}=f(X_{n})+Y_{n+1}, if X0hX_{0}\sim h, then {Xl,Yl+1,,Yl+n}l0\{X_{l},Y_{l+1},\cdots,Y_{l+n}\}_{l\geq 0} is a stationary sequence, so we can apply Birkhoff’s ergodic theorem (the version for stationary sequences),

Φ(xn)(δfγx0dpp(y1))hγ=0(x0)𝑑x0(pdy)1n=a.s.limL1Ll=1LΦ(Xn+l)δfγXldpp(Y1+l).\begin{split}\int\Phi(x_{n})\left(\delta f_{\gamma}x_{0}\cdot\frac{dp}{p}(y_{1})\right)h_{\gamma=0}(x_{0})dx_{0}(pdy)_{1\sim n}\\ \overset{\mathrm{a.s.}}{=}\lim_{L\rightarrow\infty}\frac{1}{L}\sum_{l=1}^{L}\Phi(X_{n+l})\,\delta f_{\gamma}X_{l}\cdot\frac{dp}{p}(Y_{1+l}).\end{split}

By substitution we have

δΦ(x)h(x)𝑑x=a.s.limWlimLn=1W1Ll=1LΦ(Xn+l)δfγXldpp(Y1+l)\begin{split}\delta\int\Phi(x)h(x)dx\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}-\sum_{n=1}^{W}\frac{1}{L}\sum_{l=1}^{L}\Phi(X_{n+l})\,\delta f_{\gamma}X_{l}\cdot\frac{dp}{p}(Y_{1+l})\end{split}

Then we can rerun the proof after centralizing Φ\Phi. ∎

2.6. Further generalizations

2.6.1. pp depends on zz and γ\gamma

Our pictorial intuition does not care whether pp depends on γ\gamma or zz. So we can generalize all of our results to the case

Xn+1=f(Xn)+Yn+1,where(Yn+1=y|f(Xn)=z)=pγ,z(y).\begin{split}X_{n+1}=f(X_{n})+Y_{n+1},\quad\textnormal{where}\quad\mathbb{P}(Y_{n+1}=y\,|\,f(X_{n})=z)=p_{\gamma,z}(y).\end{split}

We write down these formulas without repeating proofs.

Equation 4 and equation 5 become

Lpγq(x)=q(xy)pγ,xy(y)𝑑y=q(z)pγ,z(xz)𝑑z.Φ(x)Lpγq(x)𝑑x=Φ(y+z)q(z)pγ,z(y)𝑑z𝑑y.\begin{split}L_{p_{\gamma}}q(x)=\int q(x-y)p_{\gamma,x-y}(y)dy=\int q(z)p_{\gamma,z}(x-z)dz.\\ \int\Phi(x)L_{p_{\gamma}}q(x)dx=\iint\Phi(y+z)q(z)p_{\gamma,z}(y)dzdy.\end{split}

Section 2.4 becomes

Φ(xn)((LpγLf)nh0)(xn)𝑑xn=Φ(xn)h0(x0)𝑑x0pγ,z1(y1)𝑑y1pγ,zn(yn)𝑑yn.\begin{split}\int\Phi(x_{n})((L_{p_{\gamma}}L_{f})^{n}h_{0})(x_{n})dx_{n}=\int\Phi(x_{n})h_{0}(x_{0})dx_{0}p_{\gamma,z_{1}}(y_{1})dy_{1}\cdots p_{\gamma,z_{n}}(y_{n})dy_{n}.\end{split}

Here zmz_{m} and xmx_{m} on the right are recursively defined by zm=f(xm1)z_{m}=f(x_{m-1}), xm=zm+ymx_{m}=z_{m}+y_{m}. We also have the pointwise formula

(LpγLfγh)(x1)=h(x0)pγ,fγx0(x1fγx0)𝑑x0.\begin{split}(L_{p_{\gamma}}L_{f_{\gamma}}h)(x_{1})=\int h(x_{0})p_{\gamma,f_{\gamma}x_{0}}(x_{1}-f_{\gamma}x_{0})dx_{0}.\end{split}

Since now pp depends on γ\gamma via three ways, section 2.2 becomes

δ(LpγLfγh)(x1)=dpdγ(x0,y1)h(x0)𝑑x0,wherez1=fx0,y1=x1z1,dpdγ(x0,y1):=ddγpγ,fγx0(x1fγx0)=δp(z1,y1)+δfγ(x0)(pzpy)(z1,y1).\begin{split}\delta(L_{p_{\gamma}}L_{f_{\gamma}}h)(x_{1})=\int\frac{dp}{d\gamma}(x_{0},y_{1})h(x_{0})dx_{0},\quad\textnormal{where}\quad z_{1}=fx_{0},\quad y_{1}=x_{1}-z_{1},\\ \frac{dp}{d\gamma}(x_{0},y_{1}):=\frac{d}{d\gamma}p_{\gamma,f_{\gamma}x_{0}}(x_{1}-f_{\gamma}x_{0})=\delta p(z_{1},y_{1})+\delta f_{\gamma}(x_{0})\cdot\left(\frac{\partial p}{\partial z}-\frac{\partial p}{\partial y}\right)(z_{1},y_{1}).\end{split}

Here derivatives pz\frac{\partial p}{\partial z} and py\frac{\partial p}{\partial y} refer to writing the density as pγ,z(y)p_{\gamma,z}(y). If pp does not depend on γ\gamma and zz, then we recover section 2.2. Remark becomes

δΦ(x1)(LpLfγh)(x1)𝑑x1=Φ(x1)δ(LpLfγh)(x1)𝑑x1=Φ(x1)dppdγ(x0,y1)h(x0)𝑑x0p(y1)𝑑y1.\begin{split}\delta\int\Phi(x_{1})(L_{p}L_{f_{\gamma}}h)(x_{1})dx_{1}=\int\Phi(x_{1})\delta(L_{p}L_{f_{\gamma}}h)(x_{1})dx_{1}\\ =\iint\Phi(x_{1})\frac{dp}{pd\gamma}(x_{0},y_{1})h(x_{0})dx_{0}\,p(y_{1})dy_{1}.\end{split}

Here

dppdγ(xm,ym+1):=1pfxm(ym+1)ddγpγ,fγxm(xm+1fγxm)\begin{split}\frac{dp}{pd\gamma}(x_{m},y_{m+1}):=\frac{1}{p_{fx_{m}}(y_{m+1})}\frac{d}{d\gamma}p_{\gamma,f_{\gamma}x_{m}}(x_{m+1}-f_{\gamma}x_{m})\end{split}

Remark becomes

δΦ(xT)hT(xT)𝑑xT=Φ(xT)m=0T1dppdγ(xm,ym+1)h0(x0)dx0(pzdy)1T.\begin{split}\delta\int\Phi(x_{T})h_{T}(x_{T})dx_{T}=\int\Phi(x_{T})\sum_{m=0}^{T-1}\frac{dp}{pd\gamma}(x_{m},y_{m+1})h_{0}(x_{0})dx_{0}(p_{z}dy)_{1\sim T}.\end{split}

Here (pzdy)1T:=pz1(y1)dy1pzT(yT)dyT(p_{z}dy)_{1\sim T}:=p_{z_{1}}(y_{1})dy_{1}\cdots p_{z_{T}}(y_{T})dy_{T} is the distribution of YnY_{n}’s; zm+1=fxmz_{m+1}=fx_{m} is a function of dummy variables x0,y1,,ymx_{0},y_{1},\cdots,y_{m}.

We still have free centralization for Φ\Phi. So Remark and section 1.2 still hold accordingly, in particular, when the system is time-homogeneous and has physical measure hh, we still have

δΦ(x)h(x)𝑑x=a.s.limWlimL1Ln=1Wl=1L(Φ(Xn+l)Φavg)dppdγ(Xl,Y1+l).\begin{split}\delta\int\Phi(x)h(x)dx\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}-\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi(X_{n+l})-\Phi_{avg}\right)\,\frac{dp}{pd\gamma}(X_{l},Y_{1+l}).\end{split}

2.6.2. General random dynamical systems

For general random dynamical systems, at each step nn, we randomly select a map gg from a family of maps, denote this random map by GG, the dynamics is now

Xn+1=G(Xn).\begin{split}X_{n+1}=G(X_{n}).\end{split}

The selection of GG’s are independent among different nn. It is not hard to see our pictorial intuition still applies, so our work can generalize to this case.

We can also formally transform this general case to the case in the previous subsubsection. To do so, define the deterministic map ff

f(x):=𝔼[G(x)],\begin{split}f(x):=\mathbb{E}[G(x)],\end{split}

where the expectation is with respect to the randomness of GG. Then the dynamic equals in distribution to

Xn+1=f(Xn)+Yn+1,whereYn+1=G(Xn)f(Xn).\begin{split}X_{n+1}=f(X_{n})+Y_{n+1},\quad\textnormal{where}\quad Y_{n+1}=G(X_{n})-f(X_{n}).\end{split}

Hence the distribution of Yn+1Y_{n+1} is completely determined by XnX_{n}. Note that we still need to compute derivatives of the distribution of YY, or GG.

3. Foliated noise and perturbation

This section shows that the kernel-differentiation formulas are still correct for the case where the noise and δF\delta F are along a given foliation.

3.1. Preparation: measure notations, flotations, conditional densities

We assume that a neighborhood of the support of the physical measure μ\mu can be foliated by a single chart: this basically means that we can partition the neighborhood by co-dimension-cc submanifolds. More specifically, this means that there is F={Fα}αAF=\{F_{\alpha}\}_{\alpha\in A} a family of co-dimension-cc C3C^{3} submanifolds on \mathcal{M}, and a C3C^{3} (we are not careful about the degree of differentiability) diffeomorphism between the neighborhood and an open set in M\mathbb{R}^{M}, such that the image of FαF_{\alpha} is the cc-dimensional horizontal plane, {x:x1=xMc=0}\{x:x_{1}=\ldots x_{M-c}=0\}. Each submanifold is called a ‘leaf’. An extension to the case where the neighborhood admits not a single chart, but a foliated atlas composed of many charts, is possible and we leave it to the future.

Under single-chart foliation, this section considers the system

Zn=f(Xn),Z~n=f~(Zn),Xn+1=Xn+1(γ,Z~n).Z_{n}=f(X_{n})\,,\quad{\tilde{Z}}_{n}={\tilde{f}}(Z_{n})\,,\quad X_{n+1}=X_{n+1}(\gamma,{\tilde{Z}}_{n}).

The last equation just means that Xn+1X_{n+1} is a random variable depending on γ\gamma and Z~n{\tilde{Z}}_{n}, with details to be explained in the following paragraphs. The notation in the first equation relates to our previous notation when ff is a fixed diffeomorphism and fγ=f~ff_{\gamma}={\tilde{f}}\circ f.

We assume that XX and f~{\tilde{f}} are parallel to FαF_{\alpha}. More specifically, for any γ\gamma in a small interval on \mathbb{R} and any zz\in\mathcal{M},

z~=f~(z)Fα(z);X(γ,z~)Fα(z).{\tilde{z}}={\tilde{f}}(z)\in F_{\alpha}(z)\;;\quad X(\gamma,{\tilde{z}})\in F_{\alpha}(z).

where Fα(z)F_{\alpha}(z) is the leaf FαF_{\alpha} at zz. Note that we do not require ff being constrained by the foliation, so xx and fxfx can be on different leaves.

The measures considered in this section are not necessarily absolute continuous with respect to Lebesgue. So we should use another set of notations for measures. The transfer operator on measures is denoted by ff_{*}, more specifically, for any measure ν\nu,

Φ(x)d(fν)(x):=Φ(fx)𝑑ν(x).\begin{split}\int\Phi(x)\,d(f_{*}\nu)(x):=\int\Phi(fx)d\nu(x).\end{split}

We can disintegrate a measure ν\nu into conditional measure [ν]αF[\nu]^{F}_{\alpha} on each leaf and the quotient measure ν\nu^{\prime} on the set of α\alpha. Since we have only one fixed foliation, we shall just omit the superscript FF. Here {[ν]α}αA\{[\nu]_{\alpha}\}_{\alpha\in A} is a family of probability measures such that each ‘lives on’ FαF_{\alpha}. Moreover, the integration of any smooth observable φ\varphi can be written in two consecutive integrations,

φ(z)𝑑ν(z)=AFαφ(z)d[ν]α(z)𝑑ν(α).\int_{\mathcal{M}}\varphi(z)d\nu(z)=\int_{A}\int_{F_{\alpha}}\varphi(z)d[\nu]_{\alpha}(z)d\nu^{\prime}(\alpha).

We typically use ν\nu to denote the measure of ZZ.

We assume that the random variable X(γ,z~)X(\gamma,{\tilde{z}}) has a C3C^{3} density function p(γ,z~,)p(\gamma,{\tilde{z}},\cdot) with respect to the Lebesgue measure on Fα(z)F_{\alpha}(z). That is, pp is the conditional density of X|Z~X|{\tilde{Z}}. Then we can define the operator pp_{*}, which transfers ν~{\tilde{\nu}}, the measure of Z~{\tilde{Z}}, to μ\mu, the measure of XX. It transports measures only within each leaf, but not across different leaves, so the quotient measure ν~{\tilde{\nu}}^{\prime} is left unchanged. On the other hand, the density of the conditional measure is changed to

(9) [pν~]αdst(x):=d[pν~]αdm(x)=Fαp(γ,z~,x)d[ν~]α(z~),[p_{*}{\tilde{\nu}}]^{\textrm{dst}}_{\alpha}(x):=\frac{d[p_{*}{\tilde{\nu}}]_{\alpha}}{dm}(x)=\int_{F_{\alpha}}p(\gamma,{\tilde{z}},x)d[{\tilde{\nu}}]_{\alpha}({\tilde{z}}),

where d()/dmd(\cdot)/dm is the Radon–Nikodym derivative with respect to the Lebesgue measure on FαF_{\alpha}. Note that [ν~]α[{\tilde{\nu}}]_{\alpha} does not necessarily have a density on FαF_{\alpha}. We have the double and triple integration formulas for the expectation of joint distribution

(10) 𝔼[φ(X,Z~)]=z~xFα(z~)φ(x,z~)p(γ,z~,x)𝑑x𝑑ν~(z~)=αAxFαz~Fαφ(x,z~)p(γ,z~,x)d[ν~]α(z~)𝑑xd[ν~](α)\begin{split}\mathbb{E}[\varphi(X,{\tilde{Z}})]=\int_{{\tilde{z}}\in\mathcal{M}}\int_{x\in F_{\alpha}({\tilde{z}})}\varphi(x,{\tilde{z}})p(\gamma,{\tilde{z}},x)dxd{\tilde{\nu}}({\tilde{z}})\\ =\int_{\alpha\in A}\int_{x\in F_{\alpha}}\int_{{\tilde{z}}\in F_{\alpha}}\varphi(x,{\tilde{z}})p(\gamma,{\tilde{z}},x)d[{\tilde{\nu}}]_{\alpha}({\tilde{z}})dxd[{\tilde{\nu}}]^{\prime}(\alpha)\end{split}

To summarize, we shall assume the following in this section.

Assumption 3.

Assume that a neighborhood of the support of the measures of interest is foliated by a C3C^{3} family of C3C^{3} leaves {Fα}αA\{F_{\alpha}\}_{\alpha\in A}. For any zz\in\mathcal{M}, f~(z)Fα(z){\tilde{f}}(z)\in F_{\alpha}(z), X(γ,z~)Fα(z~)X(\gamma,{\tilde{z}})\in F_{\alpha}({\tilde{z}}). The density pp is C2C^{2} on FαF_{\alpha} (but is singular with respect to \mathcal{M}).

For the case of multiple steps, we shall denote the sequence of measures of XnX_{n} by

μn:=(pf~f)nμ0,μ:=limnμn,\mu_{n}:=(p_{*}{\tilde{f}}_{*}f_{*})^{n}\mu_{0}\;,\quad\mu:=\lim_{n\rightarrow\infty}\mu_{n},

where the limit is in the weak star sense, and μ\mu is called the physical measure. We shall still make assumption 2. However, since the physical measure can be singular, it is more difficult to prove assumption 2 from some more basic assumptions; we did not find accurate references for this purpose.

3.2. Foliated kernel-differentiation in one step

We can see that the pictorial intuition in section 2.3 still makes sense on individual leaves. So it is not surprising that the main theorems are still effective. We shall first prove the formula on one leaf over one-step, which is the foliated version of sections 2.2 and Remark. Note that here one-step means to transfer the measure by f~{\tilde{f}}, so the foliation is unchanged.

{lemma}

[local one-step kernel-differentiation on a leaf, comparable to section 2.2] Under assumptions 1 and 3, for any measure ν\nu,

δ[pf~ν]αdst(x)=Fαδf~(z)dzp(z,x)d[ν]α(z)\delta[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)=\int_{F_{\alpha}}\delta{\tilde{f}}(z)\cdot d_{z}p(z,x)d[\nu]_{\alpha}(z)

Here δ:=/γ|γ=0\delta:=\partial/\partial\gamma|_{\gamma=0}, and the integrand is the derivative of p(γ,z,x)p(\gamma,z,x) with respect to zz in the direction δf~(z)\delta{\tilde{f}}(z). Here p(z,x)=p(γ=0,z,x)p(z,x)=p(\gamma=0,z,x) by default.

Remark.

Note that this differs from section 2.2 by a minus sign, since p(z,x)=p(xz)p(z,x)=p(x-z) so taking derivaitives in zz gives a minus sign.

Proof.

By equation 9,

(11) [pf~ν]αdst(x)=Fαp(γ,z~,x)d[f~ν]α(z~),[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)=\int_{F_{\alpha}}p(\gamma,{\tilde{z}},x)d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}}),

We give a formula for d[f~ν]αd[{\tilde{f}}_{*}\nu]_{\alpha}. By definition of pp_{*} and conditional densities, for any smooth observable function φ\varphi,

AFαφ(z~)d[f~ν]α(z~)d[f~ν](α)=φ(z~)𝑑f~ν(z~)=φ(f~z)𝑑ν(z)=AFαφ(f~z)d[ν]α(z)d[ν](α).\begin{split}\int_{A}\int_{F_{\alpha}}\varphi({\tilde{z}})d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}})d[{\tilde{f}}_{*}\nu]^{\prime}(\alpha)=\int_{\mathcal{M}}\varphi({\tilde{z}})d{\tilde{f}}_{*}\nu({\tilde{z}})\\ =\int_{\mathcal{M}}\varphi({\tilde{f}}z)d\nu(z)=\int_{A}\int_{F_{\alpha}}\varphi({\tilde{f}}z)d[\nu]_{\alpha}(z)d[\nu]^{\prime}(\alpha).\end{split}

Since f~{\tilde{f}}_{*} preserves measure within each leaf, so the quotient measure ν=[f~ν]\nu^{\prime}=[{\tilde{f}}_{*}\nu]^{\prime} is unchanged. Hence,

Fαφ(z~)d[f~ν]α(z~)=Fαφ(f~z)d[ν]α(z).\int_{F_{\alpha}}\varphi({\tilde{z}})d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}})=\int_{F_{\alpha}}\varphi({\tilde{f}}z)d[\nu]_{\alpha}(z).

In short, we find that pushing-forward by f~{\tilde{f}} commutes with taking conditions.

Substituting back to equation 11, we get

[pf~ν]αdst(x)=Fαp(γ,z~,x)d[f~ν]α(z~)=Fαp(γ,f~z,x)d[ν]α(z)[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)=\int_{F_{\alpha}}p(\gamma,{\tilde{z}},x)d[{\tilde{f}}_{*}\nu]_{\alpha}({\tilde{z}})=\int_{F_{\alpha}}p(\gamma,{\tilde{f}}z,x)d[\nu]_{\alpha}(z)

Then we can differentiate by γ\gamma and evaluate at γ=0\gamma=0 to prove the lemma; note that z~=z{\tilde{z}}=z at γ=0\gamma=0. ∎

{lemma}

[integrated one-step foliated kernel-differentiation, comparable to Remark] Under assumptions 1 and 3, for any fixed measure ν\nu and any smooth observable Φ\Phi,

δΦ(x)d(pf~ν)(x)=zxFα(z)Φ(x)δf~(z)dzp(z,x)p(z,x)p(z,x)𝑑x𝑑ν(z).\begin{split}\delta\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\int_{z\in\mathcal{M}}\int_{x\in F_{\alpha}(z)}\Phi(x)\delta{\tilde{f}}(z)\cdot\frac{d_{z}p(z,x)}{p(z,x)}p(z,x)dxd\nu(z).\end{split}
Remark.

See equation 10 for the probability notation which might help understanding.

Proof.

By the definition of conditional measure,

Φ(x)d(pf~ν)(x)=AFαΦ(x)d[pf~ν]α(x)d[pf~ν](α)\begin{split}\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\int_{A}\int_{F_{\alpha}}\Phi(x)d[p_{*}{\tilde{f}}_{*}\nu]_{\alpha}(x)d[p_{*}{\tilde{f}}_{*}\nu]^{\prime}(\alpha)\end{split}

Since both pp_{*} and f~{\tilde{f}}_{*} preserve the measure within each leaf, [pf~ν]=[ν][p_{*}{\tilde{f}}_{*}\nu]^{\prime}=[\nu]^{\prime} is not affected by γ\gamma. Further express the conditional measure by integrating the density, we get

(12) Φ(x)d(pf~ν)(x)=AFαΦ(x)[pf~ν]αdst(x)𝑑xd[ν](α)\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\int_{A}\int_{F_{\alpha}}\Phi(x)[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)dxd[\nu]^{\prime}(\alpha)

Differentiate on both sides

δΦ(x)d(pf~ν)(x)=δAFαΦ(x)[pf~ν]αdst(x)𝑑xd[ν](α)\begin{split}\delta\int\Phi(x)d(p_{*}{\tilde{f}}_{*}\nu)(x)=\delta\int_{A}\int_{F_{\alpha}}\Phi(x)[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)dxd[\nu]^{\prime}(\alpha)\\ \end{split}

Since every function in the integrand is bounded, we can move δ\delta inside,

=AxFαΦ(x)δ[pf~ν]αdst(x)𝑑xd[ν](α).\begin{split}=\int_{A}\int_{x\in F_{\alpha}}\Phi(x)\delta[p_{*}{\tilde{f}}_{*}\nu]^{\textrm{dst}}_{\alpha}(x)dxd[\nu]^{\prime}(\alpha).\end{split}

Apply section 3.2,

=αAxFαzFαΦ(x)δf~(z)dzp(z,x)d[ν]α(z)𝑑xd[ν](α).=αAxFαzFαΦ(x)δf~(z)dzp(z,x)p(z,x)d[ν]α(z)p(z,x)𝑑xd[ν](α).\begin{split}=\int_{\alpha\in A}\int_{x\in F_{\alpha}}\int_{z\in F_{\alpha}}\Phi(x)\delta{\tilde{f}}(z)\cdot d_{z}p(z,x)d[\nu]_{\alpha}(z)dxd[\nu]^{\prime}(\alpha).\\ =\int_{\alpha\in A}\int_{x\in F_{\alpha}}\int_{z\in F_{\alpha}}\Phi(x)\delta{\tilde{f}}(z)\cdot\frac{d_{z}p(z,x)}{p(z,x)}d[\nu]_{\alpha}(z)p(z,x)dxd[\nu]^{\prime}(\alpha).\end{split}

Then use equation 10 to transform triple integration to double integration. ∎

3.3. Foliated kernel-differentiation over finitely many steps

This subsection proves results for finitely many steps. Most proofs are similar to their counterparts in section 2.4, so we shall skip the proofs that are repetitive.

{lemma}

[comparable to section 2.4] Under assumptions 1 and 3, let z~m:=f~zm:=f~fxm1{\tilde{z}}_{m}:={\tilde{f}}z_{m}:={\tilde{f}}fx_{m-1}, then for any μ0\mu_{0}, any nn\in\mathbb{N},

Φ(xn)d((pf~f)nμ0)(xn)=x0x1Fα(z~1)xnFα(z~n)Φ(xn)p(z~n,xn)𝑑xnp(z~1,x1)𝑑x1𝑑μ0(x0).\begin{split}\int\Phi(x_{n})d((p_{*}{\tilde{f}}_{*}f_{*})^{n}\mu_{0})(x_{n})\\ =\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}({\tilde{z}}_{1})}\cdots\int_{x_{n}\in F_{\alpha}({\tilde{z}}_{n})}\Phi(x_{n})p({\tilde{z}}_{n},x_{n})dx_{n}\cdots p({\tilde{z}}_{1},x_{1})dx_{1}d\mu_{0}(x_{0}).\end{split}
Remark.

Note that for the foliated case, the integration domain of xnx_{n} depends on xn1x_{n-1}, so the multiple integration should first integrate xnx_{n} with xn1x_{n-1} fixed.

{theorem}

[foliated kernel-differentiation formula for δμT\delta\mu_{T}, comparable to Remark] Under assumptions 1 and 3, then

δΦ(xT)𝑑μT(xT)=x0x1Fα(z1)xTFα(zT)Φ(xT)m=1T(δf~(zm)dzp(zm,xm)p(zm,xm))p(zT,xT)dxTp(z1,x1)dx1dμ0(x0).\begin{split}\delta\int\Phi(x_{T})d\mu_{T}(x_{T})=\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})\\ \sum_{m=1}^{T}\left(\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}\right)p(z_{T},x_{T})dx_{T}\cdots p(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0}).\end{split}
Proof.

We shall recursively apply Remark. Note that there ν\nu is fixed, but here it is also affected by λ\lambda. First, we differentiate the last step; by Remark,

δΦ(xT)d(pf~fμT1)(xT)=δxT1xTFα(zT)Φ(xT)p(γ,zT,xT)𝑑xT𝑑μ(xT1)=xT1xTFα(zT)Φ(xT)δf~(zT)dzp(zT,xT)p(zT,xT)p(zT,xT)𝑑xT𝑑μ(xT1)+xT1xTFα(zT)Φ(xT)p(zT,xT)𝑑xT𝑑δμ(xT1),\begin{split}\delta\int\Phi(x_{T})d(p_{*}{\tilde{f}}_{*}f_{*}\mu_{T-1})(x_{T})=\delta\int_{x_{T-1}\in\mathcal{M}}\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})p(\gamma,z_{T},x_{T})dx_{T}d\mu(x_{T-1})\\ =\int_{x_{T-1}\in\mathcal{M}}\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})\delta{\tilde{f}}(z_{T})\cdot\frac{d_{z}p(z_{T},x_{T})}{p(z_{T},x_{T})}p(z_{T},x_{T})dx_{T}d\mu(x_{T-1})\\ +\int_{x_{T-1}\in\mathcal{M}}\int_{x_{T}\in F_{\alpha}(z_{T})}\Phi(x_{T})p(z_{T},x_{T})dx_{T}d\delta\mu(x_{T-1}),\end{split}

where p(zT,xT)=p(γ=0,zT,xT)p(z_{T},x_{T})=p(\gamma=0,z_{T},x_{T}) by default. Then we can further apply Remark and 9 to the last term repeatedly until x0x_{0}. ∎

{lemma}

[free centralization, comparable to Remark] Under assumptions 1 and 3, for any 1mT1\leq m\leq T, we have

x0x1Fα(z1)xTFα(zT)(δf~(zm)dzp(zm,xm)p(zm,xm))p(zT,xT)𝑑xTp(z1,x1)𝑑x1𝑑μ0(x0)=0.\begin{split}\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{T}\in F_{\alpha}(z_{T})}\left(\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}\right)p(z_{T},x_{T})dx_{T}\cdots p(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0})=0.\end{split}
Proof.

We first do the inner integrations for steps later than mm, which are all one. Hence, the left side of the equation,

=x0x1Fα(z1)xmFα(zm)δf~(zm)dzp(zm,xm)p(zm,xm)p(zm,xm)𝑑xmp(z1,x1)𝑑x1𝑑μ0(x0)=xm1xmFα(zm)δf~(zm)dzp(zm,xm)p(zm,xm)p(zm,xm)𝑑xm𝑑μm1(xm1).\begin{split}=\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{m}\in F_{\alpha}(z_{m})}\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}p(z_{m},x_{m})dx_{m}\cdots p(z_{1},x_{1})dx_{1}d\mu_{0}(x_{0})\\ =\int_{x_{m-1}\in\mathcal{M}}\int_{x_{m}\in F_{\alpha}(z_{m})}\delta{\tilde{f}}(z_{m})\cdot\frac{d_{z}p(z_{m},x_{m})}{p(z_{m},x_{m})}p(z_{m},x_{m})dx_{m}d\mu_{m-1}(x_{m-1}).\end{split}

Here μm1\mu_{m-1} is evaluated at γ=0\gamma=0 by default. Then we can use Remark with Φ\Phi set as a constant to prove the lemma. ∎

See 1.2

3.4. Foliated ergodic kernel-differentiation formula over infinitely many steps

This subsection is comparable to section 2.5 and states results for infinitely many steps. We shall skip the proofs since the ideas are the same and only the notations are a bit different.

{lemma}

[foliated kernel-differentiation formula for physical measures, comparable to section 2.5] Under assumptions 1, 2 and 3,

δΦ(x)𝑑μ(x)=limWn=1Wx0x1Fα(z1)xnFα(zn)Φ(xn)(δf~(z1)dzp(z1,x1)p(z1,x1))p(zT,xT)dxTp(z1,x1)dx1dμ(x0).\begin{split}\delta\int\Phi(x)d\mu(x)=\lim_{W\rightarrow\infty}\sum_{n=1}^{W}\int_{x_{0}\in\mathcal{M}}\int_{x_{1}\in F_{\alpha}(z_{1})}\cdots\int_{x_{n}\in F_{\alpha}(z_{n})}\Phi(x_{n})\\ \left(\delta{\tilde{f}}(z_{1})\cdot\frac{d_{z}p(z_{1},x_{1})}{p(z_{1},x_{1})}\right)p(z_{T},x_{T})dx_{T}\cdots p(z_{1},x_{1})dx_{1}d\mu(x_{0}).\end{split}
{theorem}

[orbitwise foliated kernel-differentiation formula for δh\delta h, comparable to section 1.2] Under assumptions 1, 2 and 3,

δΦ(x)𝑑μ(x)=a.s.limWlimL1Ln=1Wl=1L(Φ(Xn+l)Φavg)δf~m(zl+1)dzp(zl+1,xl+1)p(zl+1,xl+1)\begin{split}\delta\int\Phi(x)d\mu(x)\,\overset{\mathrm{a.s.}}{=}\,\lim_{W\rightarrow\infty}\lim_{L\rightarrow\infty}\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi(X_{n+l})-\Phi_{avg}\right)\delta{\tilde{f}}_{m}(z_{l+1})\cdot\frac{d_{z}p(z_{l+1},x_{l+1})}{p(z_{l+1},x_{l+1})}\end{split}

Here Φavg:=Φ𝑑μ\Phi_{avg}:=\int\Phi d\mu.

4. Kernel-differentiation algorithm

This section gives the procedure list of the algorithm, and demonstrates the algorithm on several examples. We no longer label the subscript γ\gamma, and δ\delta can be the derivative evaluated at γ0\gamma\neq 0; the dependence on γ\gamma should be clear from context. The codes used in this section are all at https://github.com/niangxiu/np.

4.1. Procedure lists

This subsection gives the procedure list of the kernel-differentiation algorithm for δhT\delta h_{T} of a finite-time system, whose formula was derived in Remark. Then we give the procedure list for infinite-time system, whose formula was derived in section 1.2. The foliated cases have the same algorithm, and we only need to change to the suitable definition of dp/pdp/p.

First, we give algorithm for finite-time system. Here LL is the number of sample paths whose initial conditions are generated independently from h0h_{0} (or μ0\mu_{0}, if the measure is singular and does not have a density). This algorithm requires that we already have a random number generator for h0h_{0} and pp: this is typically easier for pp since we tend to use simple pp such as Gaussian; but we might ask for specific h0h_{0}, which requires more advanced sampling tools.

 

LL, random number generators for densities h0h_{0} and pmp_{m}.
for l=1,,Ll=1,\dots,L do
     Independently generate sample x0x_{0} from X0h0X_{0}\sim h_{0},
     for m=0,,T1m=0,\dots,T-1 do
         Independently generate sample ym+1y_{m+1} from Ym+1pm+1Y_{m+1}\sim p_{m+1}
         Il,m+1δfxl,mdpp(ym+1)I_{l,m+1}\leftarrow\delta fx_{l,m}\cdot\frac{dp}{p}(y_{m+1})
         xl,m+1f(xl,m)+ym+1x_{l,m+1}\leftarrow f(x_{l,m})+y_{m+1}
     end for
     Φl,TΦT(xl,T)\Phi_{l,T}\leftarrow\Phi_{T}(x_{l,T}) \triangleright Φl,T\Phi_{l,T} takes less storage than xl,Tx_{l,T}.
     Slm=1TIl,mS_{l}\leftarrow-\sum_{m=1}^{T}I_{l,m} \triangleright No need to store xmx_{m} or ymy_{m}.
end for
Φavg,T:=ΦThγ,T𝑑x1Ll=1LΦl,T\Phi_{avg,T}:=\int\Phi_{T}h_{\gamma,T}dx\approx\frac{1}{L}\sum_{l=1}^{L}\Phi_{l,T} \triangleright Evaluated at γ=0\gamma=0.
δΦavg,T1Ll=1LSl(ΦT(xl,T)Φavg,T)\delta\Phi_{avg,T}\approx\frac{1}{L}\sum_{l=1}^{L}S_{l}\left(\Phi_{T}(x_{l,T})-\Phi_{avg,T}\right)

 

Then we give the procedure list for the infinite-time case. Here MpreM_{pre} is the number of preparation steps, during which the background measure is evolved, so that x0x_{0} is distributed according to the physical measure hh. Here WW is the decorrelation step number, typically WLW\ll L, where LL is the orbit length. Since here hh is given by the dynamic, we only need to sample the easier density pp.

 

Mpre,WLM_{pre},W\ll L, random number generator for pp.
Take any x0x_{0}.
for m=0,,Mprem=0,\cdots,M_{pre} do \triangleright To land x0x_{0} onto the physical measure.
     Independently generate sample yy from density hh
     x0f(x0)+yx_{0}\leftarrow f(x_{0})+y
end for
for m=0,,W+L1m=0,\cdots,W+L-1 do
     Independently generate sample ym+1y_{m+1} from Ym+1pY_{m+1}\sim p
     Im+1δf~(fxm)dpp(ym+1)I_{m+1}\leftarrow\delta{\tilde{f}}(fx_{m})\cdot\frac{dp}{p}(y_{m+1})
     xm+1f(xm)+ym+1x_{m+1}\leftarrow f(x_{m})+y_{m+1}
     Φm+1Φ(xm+1)\Phi_{m+1}\leftarrow\Phi(x_{m+1})
end for
Φavg:=Φhγ𝑑x1Ll=1LΦl\Phi_{avg}:=\int\Phi h_{\gamma}dx\approx\frac{1}{L}\sum_{l=1}^{L}\Phi_{l}
δΦavg1Ln=1Wl=1L(Φn+lΦavg)Il+1\delta\Phi_{avg}\approx-\frac{1}{L}\sum_{n=1}^{W}\sum_{l=1}^{L}\left(\Phi_{n+l}-\Phi_{avg}\right)I_{l+1}

 

From a utility point of view, the kernel-differentiation is an ‘adjoint’ algorithm. Since we do not compute the Jacobian matrix at all, so the most expensive operation per step is only computing f(xm)f(x_{m}) and δf(xm)\delta f(x_{m}), so the cost for computing the linear response of the first parameter (i.e. a δf\delta f) is cheaper than adjoint methods. On the other hand, the marginal cost for a new parameter in this algorithm is low, and it equals that of adjoint methods.

We remind readers of some useful formulas when we use Gaussian noise in M\mathbb{R}^{M}. Let the mean be μM\mu\in\mathbb{R}^{M}, the covariance matrix be Σ\Sigma, which is defined by Σi,j:=E[(Yiμi)(Yjμj)]=Cov[Yi,Yj]\Sigma_{i,j}:=\operatorname{E}[(Y_{i}-\mu_{i})(Y_{j}-\mu_{j})]=\operatorname{Cov}[Y_{i},Y_{j}]. Then the density is

p(y)=(2π)k/2det(Σ)1/2exp(12(yμ)𝖳Σ1(yμ)).\begin{split}p(y)=(2\pi)^{-k/2}\det(\Sigma)^{-1/2}\,\exp\left(-\frac{1}{2}(y-\mu)^{\mathsf{T}}\Sigma^{-1}(y-\mu)\right).\end{split}

Hence we have

dpp(y)=Σ1(yμ)M.\begin{split}\frac{dp}{p}(y)=-\Sigma^{-1}(y-\mu)\in\mathbb{R}^{M}.\end{split}

Typically we use normal Gaussian, so μ=0\mu=0 and Σ=σ2IM×M\Sigma=\sigma^{2}I_{M\times M}, so we have

(13) dpp(y)=1σ2yM.\begin{split}\frac{dp}{p}(y)=-\frac{1}{\sigma^{2}}y\in\mathbb{R}^{M}.\end{split}

If we consider the co-dimension-cc foliation by planes Fα={x1=α1,,xc=αc}F_{\alpha}=\{x_{1}=\alpha_{1},\ldots,x_{c}=\alpha_{c}\}, the above formula is still valid. For geometers, it might feel more comfortable to write δf~dp\delta{\tilde{f}}\cdot dp together, which is just the directional derivative of the function pp defined on the subspace.

4.2. Numerical example: tent map with elevating center

We demonstrate the ergodic algorithm over infinitely many steps on the tent map. The dynamics is

Xn+1=fγ(Xn)+Yn+1mod 1,whereYniid𝒩(0,σ2),fγ(x)={γxif0x0.5,γ(1x)otherwise.\begin{split}X_{n+1}=f_{\gamma}(X_{n})+Y_{n+1}\quad\textnormal{mod 1,}\quad\quad\textnormal{where}\quad\\ Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}{\mathcal{N}}(0,\sigma^{2}),\quad\textnormal{}\quad f_{\gamma}(x)=\begin{cases}\gamma x\quad\text{if}\quad 0\leq x\leq 0.5,\\ \gamma(1-x)\quad\text{otherwise.}\end{cases}\end{split}

In this subsection, we fix the observable

Φ(x)=x\begin{split}\Phi(x)=x\end{split}

Previous linear response algorithms based on randomizing algorithms for deterministic systems, (such as fast response algorithm, path perturbation algorithm, divergence algorithm) do not work on this example. In fact, it is proved by Baladi that for the deterministic case (σ=0\sigma=0), L1L^{1} linear response does not exist [4].

We shall demonstrate the kernel-differentiation algorithm on this example, and show that the noise case can give a useful approximate linear response to the deterministic case. In the following discussions, unless otherwise noted, the default values for the (hyper-)parameters are

γ=3,σ=0.1,W=7.\begin{split}\gamma=3,\quad\sigma=0.1,\quad W=7.\end{split}
Refer to caption
Figure 2. Densities hh of different σ\sigma. Here L=107L=10^{7}.

We first test the effect of adding noise on the physical measure. As shown in figure 2, the density converges to the deterministic case as σ0\sigma\rightarrow 0. This and later numerical results indicate that we can wish to find some approximate ‘linear response’ for the deterministic system, which provides some useful information about the trends between the averaged observable and γ\gamma.

Refer to caption
Figure 3. Φavg\Phi_{avg} and δΦavg\delta\Phi_{avg} for different parameter γ\gamma. Here L=106L=10^{6} for all orbits. The dots are Φavg\Phi_{avg}, and the short lines are δΦavg\delta\Phi_{avg} computed by the kernel-differentiation response algorithm; they are computed from the same orbit. The long red line is Φavg\Phi_{avg} when there is no noise (σ=0\sigma=0).

Then we run the kernel-differentiation algorithm to compute linear responses for different γ\gamma. This reveals the relation between Φavg=Φh𝑑x\Phi_{avg}=\int\Phi hdx and γ\gamma. As we can see, the algorithm gives an accurate linear response for the noised case. Moreover, the linear response in the noised case is an reasonable reflection of the observable-parameter relation of the deterministic case. Most importantly, the local min/max of the noised and deterministic cases tend to locate similarly. Hence, we can use the kernel-differentiation algorithm of the noised case to help optimizations of the deterministic case.

Refer to caption
Refer to caption
Figure 4. Effects of orbit length LL. Left: derivatives from 10 independent computations for each LL. Right: the standard deviation of the computed derivatives, where the dashed line is L0.5L^{-0.5}.

Then we show the convergence of the kernel-differentiation algorithm with respect to LL in figure 4. In particular, the standard deviation of the computed derivative is proportional to L0.5L^{-0.5}. This is the same as the classical Monte Carlo method for integrating probabilities, with LL being the number of samples. This is expected: since here we are given the dynamics, so our samples are canonically the points on a long orbit.

Refer to caption
Refer to caption
Figure 5. Tent map, effects of WW. Here L=105L=10^{5}. Left: derivatives computed by different WW’s. Right: standard deviation of derivatives, where the dashed line is 0.001W0.50.001W^{0.5}.

Figure 5 shows that the bias in the average derivative decreases as WW increases, but the standard deviation increases roughly like O(W0.5)O(W^{0.5}). Note that if we do not centralize Φ\Phi, then the standard deviation would be like O(W)O(W).

4.3. A chaotic neural network with foliated perturbations

4.3.1. Basic model

This subsection considers a chaotic neural network whose dynamic is deterministic, and the exact linear response does not provide useful information. Since the perturbation is foliated, we can add only a low-dimensional noise, which incur less errors. Then we use the kernel-differentiation algorithm to compute approximations of the linear response.

A main goal in design neural networks is to suppress gradient explosion, which is essentially the definition of chaos. However, even with modern architecture, there is no guarantee that gradient explosion (or chaos) can be precluded. Unlike stochastic backpropagation method, we do not perform propagation at all, so we are not hindered by gradient explosion. But we add extra noise at each layer, which introduces a (potentially large) systematic error.

The variables of interest are Xn9X^{\prime}_{n}\in\mathbb{R}^{9}, n=0,1,,Tn=0,1,\cdots,T, where T=50T=50, and the dynamic is

Xn+1=fγ(Xn),wherefγ(x)=Jtanh(x+γ𝟙),J=CJ0,J0=[0.541.190.331.660.51.31.520.51.951.61.551.450.611.920.590.161.141.270.590.651.321.460.820.951.470.080.380.780.260.871.990.070.870.790.441.110.81.280.521.011.491.491.650.450.211.770.031.390.280.441.270.610.010.020.180.290.730.530.821.581.410.071.840.640.860.730.960.060.041.11.220.281.181.950.370.011.240.320.430.061.28].\begin{split}X^{\prime}_{n+1}=f_{\gamma}(X^{\prime}_{n}),\quad\textnormal{where}\quad f_{\gamma}(x)=J\tanh(x+\gamma\mathbbm{1}),\quad J=CJ_{0},\quad\\ J_{0}=\left[\begin{array}[]{ccccccccc}-0.54&-1.19&-0.33&1.66&-0.5&-1.3&1.52&-0.5&1.95\\ -1.6&-1.55&-1.45&0.61&1.92&0.59&-0.16&-1.14&-1.27\\ -0.59&-0.65&-1.32&-1.46&-0.82&-0.95&-1.47&-0.08&-0.38\\ -0.78&-0.26&0.87&1.99&0.07&0.87&-0.79&-0.44&1.11\\ 0.8&-1.28&-0.52&-1.01&1.49&1.49&-1.65&-0.45&0.21\\ -1.77&0.03&-1.39&-0.28&0.44&1.27&0.61&0.01&-0.02\\ -0.18&-0.29&-0.73&0.53&-0.82&-1.58&-1.41&0.07&-1.84\\ 0.64&0.86&0.73&0.96&-0.06&0.04&1.1&1.22&-0.28\\ 1.18&-1.95&-0.37&0.01&1.24&-0.32&0.43&0.06&-1.28\end{array}\right].\end{split}

Here II is the identity matrix, 𝟙=[1,,1]\mathbbm{1}=[1,\cdots,1], and for x=[x1,,x9],xix=[x^{1},\cdots,x^{9}],x_{i}\in\mathbb{R},

tanh(x):=[tanh(x1),,tanh(x9)].\begin{split}\tanh(x):=[\tanh(x^{1}),\cdots,\tanh(x^{9})].\end{split}

We set X0𝒩(0,I)X^{\prime}_{0}\sim{\mathcal{N}}(0,I). The objective function is

Φ(XT)=i=19(XT)i.\Phi^{\prime}(X^{\prime}_{T})=\sum_{i=1}^{9}(X^{\prime}_{T})^{i}.

There is a somewhat tight region for CC such that the system is chaotic: when C<1C<1, then JJ is small so the Jacobian is small; when C>10C>10, then the points tend to be far from zero, so the derivative of tanh\tanh is small, so the Jacobian is also small. Our choice C=4C=4 gives roughly the most unstable network, and we roughly compare with the cost of the ensemble, or stochastic gradient, formula of the linear response, which is

δ𝔼(Φ)=𝔼(n=150δf(X50n)fndΦ(X50))\begin{split}\delta\mathbb{E}(\Phi^{\prime})=\mathbb{E}\left(\sum_{n=1}^{50}\delta f(X^{\prime}_{50-n})\cdot f^{*n}d\Phi(X^{\prime}_{50})\right)\end{split}

Here ff^{*} is the backpropagation operator for covectors, which is the transpose of the is the Jacobian matrix DfDf. On average |f50|104105|f^{*50}|\approx 10^{4}\sim 10^{5}, dΦδf10d\Phi\cdot\delta f\approx 10, so the integrand’s size is about 10510610^{5}\sim 10^{6}. This would require about 1010101210^{10}\sim 10^{12} samples (a sample is a realization of the 50-layer network) to reduce the sampling error to O(1)O(1). That is not affordable.

Our model is modified from its original form in [7, 8]. In the original model, the entries of the weight matrix JJ were randomly generated according to certain laws; as discussed in section 2.6.2, we can rewrite the random maps by an additive noise, then obtain exact solutions to the original problem. We can also further generalize this example to time-inhomogeneous cases.

Finally, we acknowledge that our neural network’s architecture is outdated, but modern architectures are not good tests for our algorithm. Because the backpropagation method does not work in chaos, current architectures typically avoid chaos. The kernel-differentiation method might help us to have more freedom in choosing architectures beyond the current ones.

4.3.2. Explicit foliated chart and artificial noise

We say the perturbation is foliated, which means that there is a fixed foliation such that for a small interval of γ\gamma, δf\delta f is always parallel to that foliation. The perturbation in this particular example is foliated, since for any γ1,γ2\gamma_{1},\gamma_{2},

fγ|γ=γ1(x1)=fγ|γ=γ2(x2),wheneverfγ1(x1)=fγ2(x2).\left.\frac{\partial f}{\partial\gamma}\right|_{\gamma=\gamma_{1}}(x_{1})=\left.\frac{\partial f}{\partial\gamma}\right|_{\gamma=\gamma_{2}}(x_{2}),\quad\textnormal{whenever}\quad f_{\gamma_{1}}(x_{1})=f_{\gamma_{2}}(x_{2}).

Hence, the vector field δf~\delta{\tilde{f}} is invariant for an interval of γ\gamma, and the foliation is given by the streamlines of δf\delta f.

In fact, we can write down the explicit expression of a foliated chart, that is, change coordinate by X=X+γ𝟙X=X^{\prime}+\gamma\mathbbm{1}, then the dynamic and the observable under the new coordinate is

Xn+1=Jtanh(Xn)+γ𝟙,Φ(XT)=9γ+i=19XTi.X_{n+1}=J\tanh(X_{n})+\gamma\mathbbm{1},\quad\Phi(X_{T})=-9\gamma+\sum_{i=1}^{9}X_{T}^{i}.

Note that we can allow the chart to depend on γ\gamma. Also, the initial distribution is changed to

X0=Y0+γ𝟙,whereY0𝒩(0,I).X_{0}=Y_{0}+\gamma\mathbbm{1},\quad\textnormal{where}\quad Y_{0}\sim{\mathcal{N}}(0,I).

Now it is clear that δf\delta f is always in the direction of 𝟙\mathbbm{1}; hence, we only need to add noise in this direction. We compute the linear response of the system

(14) Xn+1=Jtanh(Xn)+γ𝟙+Yn+1𝟙/M,whereYniid𝒩(0,σ2).X_{n+1}=J\tanh(X_{n})+\gamma\mathbbm{1}+Y_{n+1}\mathbbm{1}/\sqrt{M},\quad\textnormal{where}\quad Y_{n}{\,\overset{\mathrm{iid}}{\sim}\,}{\mathcal{N}}(0,\sigma^{2}).

We shall compare with adding noise in all directions, with YMY^{\prime}\in\mathbb{R}^{M}

Xn+1=Jtanh(Xn)+γ𝟙+Yn+1,whereYniid𝒩(0,σ2I).X_{n+1}=J\tanh(X_{n})+\gamma\mathbbm{1}+Y^{\prime}_{n+1},\quad\textnormal{where}\quad Y^{\prime}_{n}{\,\overset{\mathrm{iid}}{\sim}\,}{\mathcal{N}}(0,\sigma^{2}I).

Here II is the M×MM\times M identity matrix.

The foliated case has smaller noise added to the system, since YY^{\prime} equals summing i.i.d pieces of YY in orthogonal directions for MM times. Then we show that the two cases with noise have the same computational cost. As to be discussed in section 5.1, the computational cost is determined by the number of samples required, which is further determined by the magnitude of the integrand. To verify this, we check

𝔼[(1pδfdp)2]=𝔼[(1pδfdp)2],\mathbb{E}\left[\left(\frac{1}{p^{\prime}}\delta f\cdot dp^{\prime}\right)^{2}\right]=\mathbb{E}\left[\left(\frac{1}{p}\delta f\cdot dp\right)^{2}\right],

where pp is the density on the 1-dimensional subspace, and pp^{\prime} is the density in all directions. By equation 13, and note that δf~=𝟙\delta{\tilde{f}}=\mathbbm{1} is constant, we want

𝔼[(1σ2𝟙Y)2]=𝔼[(1σ2𝟙𝟙Y/M)2].\mathbb{E}\left[\left(\frac{1}{\sigma^{2}}\mathbbm{1}\cdot Y^{\prime}\right)^{2}\right]=\mathbb{E}\left[\left(\frac{1}{\sigma^{2}}\mathbbm{1}\cdot\mathbbm{1}Y/\sqrt{M}\right)^{2}\right].

This is indeed true, just notice that 𝟙/M\mathbbm{1}/\sqrt{M} is the unit vector.

Note that now both Φ\Phi and h0h_{0} depends on γ\gamma due to the chart. Hence, the linear response has two more terms,

(15) δΦ(xT)𝑑μT(xT)=T+δΦ(xT)𝑑μT(xT)+𝔼[Φ(XT)δh0h0(X0)],\delta\int\Phi(x_{T})d\mu_{T}(x_{T})=T+\int\delta\Phi(x_{T})d\mu_{T}(x_{T})+\mathbb{E}\left[\Phi(X_{T})\frac{\delta h_{0}}{h_{0}}(X_{0})\right],

where the first term TT is given by section 1.2, and the other two terms are new. For the second term, since the integrand δΦ=9\delta\Phi=-9 is constant, so the second term is 9-9; if the integrand is not constant then we can still do Monte-Carlo sampling easily. For the third term, the expectation 𝔼\mathbb{E} is just an abbreviation for the multiple integration in section 1.2. By definition,

h0(x0)=(2π)k/2exp(12(x0γ𝟙)𝖳(x0γ𝟙)),h_{0}(x_{0})=(2\pi)^{-k/2}\exp\left(-\frac{1}{2}(x_{0}-\gamma\mathbbm{1})^{\mathsf{T}}(x_{0}-\gamma\mathbbm{1})\right),

so δh0h0(x0)=x0𝟙γM\frac{\delta h_{0}}{h_{0}}(x_{0})=x_{0}\cdot\mathbbm{1}-\gamma M.

4.3.3. Numerical results

We use the kernel-differentiation method given in section 1.2 to compute the main term in equation 15, under two different noises: a noise in all directions, and a 1-dimensional noise along the direction of 𝟙\mathbbm{1}.

Refer to caption
Figure 6. Φavg\Phi_{avg} and δΦavg\delta\Phi_{avg} for different parameter γ\gamma and different noises. For the two cases with noise, σ=1.5\sigma=1.5. Here the total number of samples is L=104L=10^{4}.
Refer to caption
Figure 7. σ=0.5\sigma=0.5. Other settings are the same as figure 6.

Figure 6 and figure 7 show the results of the kernel-differentiation algorithm on this example. The algorithm correctly computes the derivative of the problem with additive noises. The total time cost for running the algorithm on L=104L=10^{4} orbits to get a linear response, using a 1 GHz computer thread, is 3 seconds.

As we can see by comparing the two figures, the case with 1-dimensional noise is a better approximation of the deterministic case, which is what we are actually interested in. In particular, when σ=1.5\sigma=1.5, the MM-dimensional noise basically hides away any trend between the parameter γ\gamma and the averaged observable Φavg\Phi_{avg}. In contrast, the 1-dimensional noise has a much smaller impact on the γΦavg\gamma\sim\Phi_{avg} relation, and the derivative computed by the kernel differentiation method is more useful.

Also note that it is not free to decrease σ\sigma for the purpose of reducing the systematic error. As we can see by comparing the two figures, and as we shall discuss in section 5.1, small σ\sigma would increase the sampling error, which further requires more samples to take down, and this increases the cost. So adding low-dimensional noise, when the perturbation is foliated, has a unique advantage of reducing the approximation error to deterministic case.

5. Discussions

The kernel-differentiation algorithm is robust, does not have systematic error, has low cost per step, and is not cursed by dimensionality. But there is a caveat: when the noise is small, we need many data for the Monte-Carlo method to converge. Hence we can not expect to use the small σ\sigma limit to get an easy approximation of the linear response for deterministic systems.

In this section, we first give a rough cost-error estimation of the problem, and estimate the number of samples LL and desirable noise intensity σ\sigma. Then we discuss how to potentially reduce the cost by further combining with the fast response algorithm, which was developed for deterministic linear responses.

5.1. A rough cost-error estimation

When the problem is intrinsically random, the scale of noise σ\sigma has been fixed. For this case, there are two sources of error. The first is due to using a finite decorrelation step number WW; this error is O(θW)O(\theta^{W}) for some 0<θ<10<\theta<1. The second is the sampling error due to using a finite number LL of samples. Assume we use Gaussian noise, since we are averaging a large integrand to get a small number, we can approximate the standard deviation of the integrand by its absolute value. The integrand is the sum of WW copies of dpp=yσ21σ\frac{dp}{p}=-\frac{y}{\sigma^{2}}\sim\frac{1}{\sigma}, so the size is roughly W/σ\sqrt{W}/\sigma. So the sampling error is W/σL\sqrt{W}/\sigma\sqrt{L}, where LL is the number of samples. Together we have the total error ε\varepsilon

ε=O(θW)+O(WσL).\begin{split}\varepsilon=O(\theta^{W})+O\left(\frac{\sqrt{W}}{\sigma\sqrt{L}}\right).\end{split}

This gives us a relation among ε,W\varepsilon,W, and LL, where LL is proportional to the overall cost. In practice, we typically set the two errors be roughly equal, which gives an extra relation for us to eliminate WW and obtain the cost-error relation

θW=WσL,soL=O(Wσ2θ2W)=O(logθεσ2ε2).\begin{split}\theta^{W}=\frac{\sqrt{W}}{\sigma\sqrt{L}},\quad\textnormal{so}\quad L=O\left(\frac{W}{\sigma^{2}\theta^{2W}}\right)=O\left(\frac{\log_{\theta}\varepsilon}{\sigma^{2}\varepsilon^{2}}\right).\end{split}

This is rather typical for Monte-Carlo method. But the problem is that cost can be large for small σ\sigma.

On the other hand, if we use random system to approximate deterministic systems, then we can have the choice on the noise scale σ\sigma. Now each step further incurs an approximation error O(σ)O(\sigma) on the measure. This error decays but accumulates, and the total error on the physical measure is O(σ/(1θ))O(\sigma/(1-\theta)), which can be quite large compared to the one-step error σ\sigma. Hence, if we are interested in the trend between Φavg\Phi_{avg} and γ\gamma for a certain stepsize Δγ\Delta\gamma (this is typically known from the practical problem), then the error in the linear response is O(σ/Δγ(1θ))O(\sigma/\Delta\gamma(1-\theta)). Together we have the total error ε\varepsilon

(16) ε=O(θW)+O(WσL)+O(σΔγ(1θ)).\begin{split}\varepsilon=O(\theta^{W})+O\left(\frac{\sqrt{W}}{\sigma\sqrt{L}}\right)+O\left(\frac{\sigma}{\Delta\gamma(1-\theta)}\right).\end{split}

Again, in practice we want the three errors to be roughly equal, which shall prescribe the size of σ\sigma,

σ=εΔγ(1θ),soL=O(logθεσ2ε2)=O(logθεε4(1θ)2).\begin{split}\sigma=\varepsilon\Delta\gamma(1-\theta),\quad\textnormal{so}\quad L=O\left(\frac{\log_{\theta}\varepsilon}{\sigma^{2}\varepsilon^{2}}\right)=O\left(\frac{\log_{\theta}\varepsilon}{\varepsilon^{4}(1-\theta)^{2}}\right).\end{split}

Since 1θ1-\theta can be small, this cost can be much larger than just ε4\varepsilon^{-4}, which is already a high cost.

Finally, we acknowledge that our estimation is very inaccurate, but the point we make is solid, that is, the small noise limit is numerically expensive to achieve.

5.2. A potential program to unify three linear response formulas

We sketch a potential program on how to further reduce the cost/error of computing approximate linear response of non-hyperbolic deterministic systems. As is known, non-hyperbolic systems do not typically have linear responses, so we must mollify, and in this paper we choose to mollify by adding noise in the phase space during each time step. But as we see in the last subsection, adding a big noise increases the noise error, the third term in equation 16; whereas small noise increases the sampling error, the second term in equation 16.

There are cases with special structures, and we have specific tools which could give a good approximate linear response, in terms of reflecting the parameter-observable relation. When the system has a foliated structure, this papers shows that we can add low-dimensional noise. When the unstable dimension is low, computing only the shadowing or stable part of the linear response may be a good approximation (see [28]).

For the more general situation, a plausible solution is to add a big but local noise, only at locations where the hyperbolicity is bad. The benefit of this program is that, if the singularity set is low-dimensional, then the area where we add big noise is small, and the noise error is small. For continuous-time case there seems to be an easy choice: we can let the noise scale be reverse proportional to the flow vector length. This should at least solve the singular hyperbolic flow cases [40, 41], where the bad points coincide with zero velocity. But for discrete-time case it can be difficult to find a natural criteria which is easy to compute.

In terms of the formulas, we use the kernel-differentiation formula (the generalized version in section 2.6) where the noise is big, so the sampling error is also small. Where the noise is small, we use the fast-response algorithm, which is efficient regardless of noise scale, but requires hyperbolicity.

This program hinges on the assumption that the singularity set is small or low-dimensional. It also requires us to invent more techniques. For example, we need a formula which transfers information from the kernel-differentiation formula to the fast response formulas. For example, the equivariant divergence formula, in its original form, requires information from the infinite past and future [30]. But we can rerun the proof and restrict the time dependence to finite time; this requires extra information on the interface, such as the derivative of the conditional measure and the divergence of the holonomy map. These information should and could be provided by the kernel-differentiation formula.

Historically, there are three famous linear response formulas. The path-perturbation formula does not work for chaos; the divergence formula is likely to be cursed by dimensionality; the kernel-differentiation formula is expensive for small noise limit. The fast response formula combines the path-perturbation and the divergence formula, it is a function (not a distribution) so can be sampled; it is neither affected by chaos or high-dimension, but is rigid on hyperbolicity. In the future, besides trying to test above methods on specific tasks, we should also try to combine all three formulas together into one, which might provide the best approximate linear responses with highest efficiency.

Acknowledgements

The author is in great debt to Stefano Galatolo, Yang Liu, Caroline Wormell, Wael Bahsoun, and Gary Froyland for helpful discussions.

Data availability statement

The code used in this manuscript is at https://github.com/niangxiu/kd. There is no other associated data.

References

  • [1] F. Antown, G. Froyland, and S. Galatolo. Optimal linear response for markov hilbert–schmidt integral operators and stochastic dynamical systems. Journal of Nonlinear Science, 32, 12 2022.
  • [2] W. Bahsoun, S. Galatolo, I. Nisoli, and X. Niu. A rigorous computational approach to linear response. Nonlinearity, 31:1073–1109, 2018.
  • [3] W. Bahsoun, M. Ruziboev, and B. Saussol. Linear response for random dynamical systems. Advances in Mathematics, 364:107011, 4 2020.
  • [4] V. Baladi. On the susceptibility function of piecewise expanding interval maps. Communications in Mathematical Physics, 275:839–859, 8 2007.
  • [5] V. Baladi. The quest for the ultimate anisotropic banach space. Journal of Statistical Physics, 166:525–557, 2017.
  • [6] R. Bowen and D. Ruelle. The ergodic theory of axiom a flows. Inventiones Mathematicae, 29:181–202, 1975.
  • [7] B. Cessac and J. A. Sepulchre. Stable resonances and signal propagation in a chaotic network of coupled units. Physical Review E, 70:056111, 11 2004.
  • [8] B. Cessac and J. A. Sepulchre. Transmitting a signal by amplitude modulation in a chaotic network. Chaos: An Interdisciplinary Journal of Nonlinear Science, 16:013104, 3 2006.
  • [9] D. Dolgopyat. On differentiability of srb states for partially hyperbolic systems. Inventiones Mathematicae, 155:389–449, 2004.
  • [10] D. Dragičević, P. Giulietti, and J. Sedro. Quenched linear response for smooth expanding on average cocycles. Communications in Mathematical Physics, 399:423–452, 4 2023.
  • [11] R. Durrett. Probability: Theory and Examples. Cambridge University Press, 4th edition edition, 8 2010.
  • [12] G. L. Eyink, T. W. N. Haine, and D. J. Lea. Ruelle’s linear response formula, ensemble adjoint schemes and lévy flights. Nonlinearity, 17:1867–1889, 2004.
  • [13] G. Froyland, D. Giannakis, B. R. Lintner, M. Pike, and J. Slawinska. Spectral analysis of climate dynamics with operator-theoretic approaches. Nature Communications, 12:6570, 11 2021.
  • [14] S. Galatolo and P. Giulietti. A linear response for dynamical systems with additive noise. Nonlinearity, 32:2269–2301, 6 2019.
  • [15] S. Galatolo and I. Nisoli. An elementary approach to rigorous approximation of invariant measures. SIAM Journal on Applied Dynamical Systems, 13:958–985, 2014.
  • [16] M. Ghil and V. Lucarini. The physics of climate variability and climate change. Reviews of Modern Physics, 92:035002, 7 2020.
  • [17] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 10 1990.
  • [18] P. W. Glynn and M. Olvera-Cravioto. Likelihood ratio gradient estimation for steady-state parameters. Stochastic Systems, 9:83–100, 6 2019.
  • [19] S. Gouëzel and C. Liverani. Compact locally maximal hyperbolic sets for smooth maps: Fine statistical properties. Journal of Differential Geometry, 79:433–477, 2008.
  • [20] M. S. Gutiérrez and V. Lucarini. Response and sensitivity using markov chains. Journal of Statistical Physics, 179:1572–1593, 2020.
  • [21] M. Hairer and A. J. Majda. A simple framework to justify linear response theory. Nonlinearity, 23:909–922, 4 2010.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, pages 1–12, 2016.
  • [23] A. Jameson. Aerodynamic design via control theory. Journal of Scientific Computing, 3:233–260, 1988. Firstintroduce adjoint method into aerodynamic design.
  • [24] M. Jiang. Differentiating potential functions of srb measures on hyperbolic attractors. Ergodic Theory and Dynamical Systems, 32:1350–1369, 2012.
  • [25] D. J. Lea, M. R. Allen, and T. W. N. Haine. Sensitivity analysis of the climate of a chaotic system. Tellus A: Dynamic Meteorology and Oceanography, 52:523–532, 2000.
  • [26] V. Lucarini, F. Ragone, and F. Lunkeit. Predicting climate change using response theory: Global averages and spatial patterns. Journal of Statistical Physics, 166:1036–1064, 2017.
  • [27] A. Ni. Hyperbolicity, shadowing directions and sensitivity analysis of a turbulent three-dimensional flow. Journal of Fluid Mechanics, 863:644–669, 2019.
  • [28] A. Ni. Approximating linear response by nonintrusive shadowing algorithms. SIAM J. Numer. Anal., 59:2843–2865, 2021.
  • [29] A. Ni. Backpropagation in hyperbolic chaos via adjoint shadowing. Nonlinearity, 37:035009, 3 2024.
  • [30] A. Ni and Y. Tong. Recursive divergence formulas for perturbing unstable transfer operators and physical measures. Journal of Statistical Physics, 190:126, 7 2023.
  • [31] A. Ni and Y. Tong. Equivariant divergence formula for hyperbolic chaotic flows. Journal of Statistical Physics, 191:118, 9 2024.
  • [32] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. International conference on machine learning, pages 1310–1318, 2013.
  • [33] P. Plecháč, G. Stoltz, and T. Wang. Convergence of the likelihood ratio method for linear response of non-equilibrium stationary states. ESAIM: Mathematical Modelling and Numerical Analysis, 55:S593–S623, 2021.
  • [34] P. Plecháč, G. Stoltz, and T. Wang. Martingale product estimators for sensitivity analysis in computational statistical physics. IMA Journal of Numerical Analysis, 43:3430–3477, 11 2023.
  • [35] M. Pollicott and O. Jenkinson. Computing invariant densities and metric entropy. Communications in Mathematical Physics, 211:687–703, 2000.
  • [36] M. I. Reiman and A. Weiss. Sensitivity analysis for simulations via likelihood ratios. Operations Research, 37:830–844, 10 1989.
  • [37] R. Y. Rubinstein. Sensitivity analysis and performance extrapolation for computer simulation models. Operations Research, 37:72–81, 2 1989.
  • [38] D. Ruelle. A measure associated with axiom-a attractors. American Journal of Mathematics, 98:619, 1976.
  • [39] D. Ruelle. Differentiation of srb states. Commun. Math. Phys, 187:227–241, 1997.
  • [40] M. Viana. What’s new on lorenz strange attractors? The Mathematical Intelligencer, 22:6–19, 6 2000.
  • [41] X. Wen and L. Wen. No-shadowing for singular hyperbolic sets with a singularity. Discrete and Continuous Dynamical Systems- Series A, 40:6043–6059, 10 2020.
  • [42] C. L. Wormell and G. A. Gottwald. Linear response for macroscopic observables in high-dimensional systems. Chaos, 29, 2019.
  • [43] L.-S. Young. What are srb measures, and which dynamical systems have them? Journal of Statistical Physics, 108:733–754, 2002.