This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Machine Learning Enhanced Hankel Dynamic-Mode Decomposition

Christopher W. Curtis Department of Mathematics and Statistics, San Diego State University, San Diego, CA, 92182, USA Corresponding author: Christopher W. Curtis, ccurtis@sdsu.edu D. Jay Alford-Lago Department of Mathematics and Statistics, San Diego State University, San Diego, CA, 92182, USA Naval Information Warfare Center Pacific, San Diego, CA, 92152, USA Erik Bollt Department of Electrical and Computer Engineering, Clarkson University, 8 Clarkson Ave., Potsdam, NY, 13699, USA Clarkson Center for Complex Systems Science, Clarkson University, 8 Clarkson Ave., Potsdam, NY, 13699, USA Andrew Tuma

Abstract

While the acquisition of time series has become more straightforward, developing dynamical models from time series is still a challenging and evolving problem domain. Within the last several years, to address this problem, there has been a merging of machine learning tools with what is called the dynamic mode decomposition (DMD). This general approach has been shown to be an especially promising avenue for accurate model development. Building on this prior body of work, we develop a deep learning DMD based method which makes use of the fundamental insight of Takens’ Embedding Theorem to build an adaptive learning scheme that better approximates higher dimensional and chaotic dynamics. We call this method the Deep Learning Hankel DMD (DLHDMD). We likewise explore how our method learns mappings which tend, after successful training, to significantly change the mutual information between dimensions in the dynamics. This appears to be a key feature in enhancing the DMD overall, and it should help provide further insight for developing other deep learning methods for time series analysis and model generation.

This work uses machine learning to develop an accurate method for generating models of chaotic dynamical systems using measurements alone. A number of challenging examples are examined which show the broad utility of the method and point towards its potential impacts in advancing data analysis and modeling in the physical sciences. Finally, we present quantitative studies of the information theoretic behavior of the machine learning tools used in our work, thereby allowing for a more detailed understanding of what can otherwise be an inscrutable method.

1 Introduction

The incorporation of modern machine learning methodology into dynamical systems is creating an ever expanding array of techniques pushing the boundaries of what is possible with regards to describing and predicting nonlinear multi-dimensional time series. Longstanding problems such as finding optimal Takens’ embeddings [1, 2] now have powerful and novel deep learning based algorithmic approaches [3] which would not have been feasible even ten years ago. Likewise, the field of equation free modeling using Koopman operator methods, broadly described by Dynamic Mode Decomposition (DMD), has seen several innovative deep learning based methods emerge over the last several years [4, 5, 6] which have been shown to greatly expand the accuracy and flexibility of DMD based approaches. There have also been related and significant advances in model identification and solving nonlinear partial differential equations via deep learning techniques [7, 8, 9, 10].

With this background in mind, in this work we focus on extending the methods in [6] which were called Deep Learning DMD (DLDMD). In that work, a relatively straightforward method merging auto-encoders with the extended DMD (EDMD) was developed. This was done by using an encoder to embed dynamics in a sufficiently high enough dimensional space which then generated a sufficiently large enough space of observables for the EDMD to generate accurate linear models of the embedded dynamics. Decoding then returned the embedded time series to the original variables in such a way as to guarantee the global stability of iterating the linear model to generate both reconstructions and forecasts of the dynamics. The DLDMD was shown to be very effective in finding equation-free models which were able to both reconstruct and then forecast from data coming from planar dynamical systems.

However, when chaotic time series from the Lorenz-63 system were examined, the performance of the DLDMD was found to degrade. While this clearly makes the DLDMD approach limited in its scope, we note that the successful use of DMD based approaches to accurately reconstruct or forecast chaotic dynamics are not readily available. Other methods such as HAVOK [11] or SINDy [12] are more focused on the analysis of chaotic time series or the discovery of model equations which generate chaotic dynamics, though of course if one has an accurate model, then one should be able to generate accurate forecasts. In this vein, there are also methods using reservoir computing (RC) [13, 14], though again, nonlinear models are essentially first learned and then used to generate forecasts. However, both SINDy and RC rely on proposing libraries of terms to build models which are then fit (or learned from) data.

While effective, such approaches do not allow for the spectral or modal analysis which has proven to be such an attractive and useful feature of DMD based methods. Likewise, they require a number of user decisions about how to construct the analytic models used in later regressive fitting that amount to a guess and check approach to generating accurate reconstructions and forecasts. Therefore in this work, using insights coming from the Takens’ Embedding Theorem (TET) [15, 3], we expand the DLDMD framework so as to make it accurate in both generating reconstructions and forecasts of chaotic time series. This is done by first making the EDMD over embedded coordinates global as opposed to the local approach of [6]; see also [4, 5]. Second, we develop an adaptive Hankel matrix based ordering of the embedded coordinates which adds more expressive power for approximating dynamics to the deep learning framework. To study our method, we use data generated by the Lorenz-63 and Rossler systems as well as twelve-dimensional projections of data from the Kuramoto–Sivashinksky (KS) equation. In all of these cases, we show that by combining our proposed modifications to the DLDMD that we are able to generate far more accurate reconstructions and forecasts for chaotic systems than with DLDMD alone. Moreover, we have built a method which still allows for the straightforward modal analysis which DMD affords and keeps user choices to a handful of real-valued hyperparamters while still producing results competitive with other approaches in the literature.

Further, motivated by the classic information theory (IT) studies of the TET [1], as well as modern insights into the role that information plays in deep learning [16, 17], we study how the fully trained encoder changes the information content of the dynamics coming from the Lorenz-63 and Rossler systems. For the Lorenz-63 system, the encoder tends to either slightly decrease the mutual information or cause strong phase shifts which decrease the coupling times across dimensions. However, the characteristic timescales corresponding to lobe switching in the Lorenz ‘butterfly’ are clearly seen to be preserved in the dynamics of the information for the Lorenz-63 system. In contrast then, for the Rossler system, the slow/fast dichotomy in the dynamics seen in the original coordinates is made more uniform so that rapid transients in the information coupling are removed by the encoder. Thus in either case, we see that the encoder generates significant differences in the information content between dimensions in the latent coordinates relative to the original ones, and that this strong change in information content is a critical feature in successful training.

Of course, the present work is ultimately preliminary, and there are a number of important questions left to be resolved. First, while we are able to easily display computed spectra, the affiliated global Koopman modes we find are not as straightforward to show. We generate our results from random initial conditions, so the most effective means of constructing the global Koopman modes would be via radial-basis functions, but the implementation would be nontrivial due to the infamous ill-conditioning issues which can plague the approach [18]. Second, there is a clear need for a comparison across SINDy, RC, and our DLHDMD methods. In particular, the present work generates excellent reconstructions, but so far the predictive horizon is relatively short and difficult to increase. How well other methods address this issue relative to their reconstruction and other diagnostic properties, and then how all of these methods compare in these several different ways is as yet unclear. While acknowledging then the limitations of the present work, we defer addressing the above issues till later works where each of the above issues can be dealt with in the detail that is needed.

The structure of this paper is as follows. In Section 2, we provide an introduction to the Extended DMD and then explain the extensions we develop which are critical to the success of the present work. In Section 3, we introduce the Hankel DMD and, incorporating the extensions introduced in Section 2, we show how well it does and does not perform on several examples. Then in Section 4 we introduce the Deep Learning Hankel DMD and provide results on its performance as well as an analysis of how the mutual information changes in the latent variables. Section 5 presents our results on mutual information. Section 6 provides conclusion and discussion.

2 Extended Dynamic Mode Decomposition

To begin, we suppose that we have the data set {𝐲j}j=1NT+1\left\{{\bf y}_{j}\right\}_{j=1}^{N_{T}+1} where

𝐲j=φ(tj;𝐱),tj+1=tj+δt,𝐱Ns{\bf y}_{j}=\varphi(t_{j};{\bf x}),~{}t_{j+1}=t_{j}+\delta t,~{}{\bf x}\in\mathbb{R}^{N_{s}}

where δt\delta t is the time step at which data is sampled and φ(t;𝐱)\varphi(t;{\bf x}) is a flow map such that φ(t1,𝐱)=𝐱\varphi(t_{1},{\bf x})={\bf x}. From the flow map, we define the affiliated Koopman operator 𝒦t\mathcal{K}^{t} such that for a given scalar observable g(𝐱)g({\bf x}), one has

𝒦tg(𝐱)=g(φ(t,𝐱)),\mathcal{K}^{t}g({\bf x})=g(\varphi(t,{\bf x})),

so that the Koopman operator linearly tracks the evolution of the observable along the flow. We likewise define the associated Hilbert space of observables, say L2(Ns,,μ)L_{2}\left(\mathbb{R}^{N_{s}},\mathbb{R},\mu\right), or more tersely as L2(𝒪)L_{2}\left(\mathcal{O}\right), so that gL2(𝒪)g\in L_{2}\left(\mathcal{O}\right) if

Ns|g(𝐱)|2𝑑μ(𝐱)<,\int_{\mathbb{R}^{N_{s}}}\left|g({\bf x})\right|^{2}d\mu\left({\bf x}\right)<\infty,

where μ\mu is some appropriately chosen measure. This makes the infinite-dimensional Koopman operator 𝒦t\mathcal{K}^{t} a map such that 𝒦t:L2(𝒪)L2(𝒪).\mathcal{K}^{t}:L_{2}\left(\mathcal{O}\right)\rightarrow L_{2}\left(\mathcal{O}\right).

Following [19, 20], given our time snapshots {𝐲j}j=1NT+1\left\{{\bf y}_{j}\right\}_{j=1}^{N_{T}+1}, we suppose that any observable g(𝐱)g({\bf x}) of interest lives in a finite-dimensional subspace DL2(𝒪)\mathcal{F}_{D}\subset L_{2}\left(\mathcal{O}\right) described by a given basis of observables {ψl}l=1Nob\left\{\psi_{l}\right\}_{l=1}^{N_{ob}} so that

g(𝐱)=l=1Nobalψl(𝐱).g({\bf x})=\sum_{l=1}^{N_{ob}}a_{l}\psi_{l}\left({\bf x}\right).

Given this ansatz, we then suppose that

𝒦δtg(𝐱)=\displaystyle\mathcal{K}^{\delta t}g({\bf x})= l=1Nobalψl(φ(δt,𝐱))\displaystyle\sum_{l=1}^{N_{ob}}a_{l}\psi_{l}\left(\varphi\left(\delta t,{\bf x}\right)\right)
=\displaystyle= l=1Nobψl(𝐱)(𝐊aT𝐚)l+r(𝐱;𝐊a)\displaystyle\sum_{l=1}^{N_{ob}}\psi_{l}({\bf x})\left(\mathbf{K}^{T}_{a}{\bf a}\right)_{l}+r({\bf x};\mathbf{K}_{a})

where r(𝐱;𝐊a)r({\bf x};\mathbf{K}_{a}) is the associated error which results from the introduction of the finite-dimensional approximation of the Koopman operator represented by 𝐊a\mathbf{K}_{a}. We can then find 𝐊a{\bf K}_{a} by solving the following minimization problem

𝐊a=\displaystyle\mathbf{K}_{a}= arg min𝐊|r(𝐱;𝐊)|2\displaystyle\mbox{arg min}_{\mathbf{K}}\left|r({\bf x};\mathbf{K})\right|^{2} (1)
=\displaystyle= arg min𝐊j=1NT|l=1Nob(alψl(𝐲j+1)ψl(𝐲j)(𝐊T𝐚)l)|2\displaystyle\mbox{arg min}_{\mathbf{K}}\sum_{j=1}^{N_{T}}\left|\sum_{l=1}^{N_{ob}}\left(a_{l}\psi_{l}({\bf y}_{j+1})-\psi_{l}({\bf y}_{j})\left(\mathbf{K}^{T}{\bf a}\right)_{l}\right)\right|^{2}
=\displaystyle= arg min𝐊j=1NT|𝚿j+1𝐊𝚿j,𝐚|2,\displaystyle\mbox{arg min}_{\mathbf{K}}\sum_{j=1}^{N_{T}}\left|\left<\mathbf{\Psi}_{j+1}-\mathbf{K}\mathbf{\Psi}_{j},{\bf a}^{\ast}\right>\right|^{2},

where 𝐚=(a1aNob)T{\bf a}=(a_{1}\cdots a_{N_{ob}})^{T}, 𝚿j=(ψ1(𝐲j)ψNob(𝐲j))T\mathbf{\mathbf{\Psi}}_{j}=\left(\psi_{1}({\bf y}_{j})\cdots\psi_{N_{ob}}({\bf y}_{j})\right)^{T}, the inner product ,\left<,\right> is the standard one over Nob\mathbb{C}^{N_{ob}}, and the \ast symbol denotes complex conjugation. It is straightforward to show that an equivalent and easier to solve form of this optimization problem is given by

𝐊a=argmin𝐊𝚿+𝐊𝚿F2,\mathbf{K}_{a}=\underset{\mathbf{K}}{\mathrm{argmin}}\left|\left|\mathbf{\Psi}_{+}-\mathbf{K}\mathbf{\Psi}_{-}\right|\right|_{F}^{2}, (2)

where ||||F\left|\left|\cdot\right|\right|_{F} is the Frobenius norm, and the Nob×NTN_{ob}\times N_{T} matrices 𝚿±\mathbf{\Psi}_{\pm} are given by

𝚿={𝚿1𝚿2𝚿NT},𝚿+={𝚿2𝚿3𝚿NT+1}.\mathbf{\Psi}_{-}=\left\{\mathbf{\Psi}_{1}~{}\mathbf{\Psi}_{2}~{}\cdots~{}\mathbf{\Psi}_{N_{T}}\right\},\quad\mathbf{\Psi}_{+}=\left\{\mathbf{\Psi}_{2}~{}\mathbf{\Psi}_{3}~{}\cdots~{}\mathbf{\Psi}_{N_{T}+1}\right\}.

In practice, we solve this equation using the Singular-Value Decomposition (SVD) of 𝚿\mathbf{\Psi}_{-} so that

𝚿=𝐔𝚺𝐖.\displaystyle\mathbf{\Psi}_{-}=\mathbf{U}\mathbf{\Sigma}\mathbf{W}^{\dagger}.

This then gives us

𝐊a=𝚿+𝐖𝚺1𝐔.\mathbf{K}_{a}=\mathbf{\Psi}_{+}\mathbf{W}\mathbf{\Sigma}^{-1}\mathbf{U}^{\dagger}.

To complete the algorithm, after diagonalizing 𝐊a\mathbf{K}_{a} so that

𝐊a=𝐕𝐋𝐕1,𝐋ll=l,\mathbf{K}_{a}=\mathbf{V}\mathbf{L}\mathbf{V}^{-1},~{}\mathbf{L}_{ll}=\ell_{l}, (3)

then one can show that the Koopman eigenfunctions ϕl(𝐲j)\phi_{l}({\bf y}_{j}) are found via the equations

𝚽±=𝐕1𝚿±.\mathbf{\Phi}_{\pm}={\bf V}^{-1}\mathbf{\Psi}_{\pm}. (4)

From here, one can, starting from the initial conditions, approximate the dynamics via the reconstruction formula

𝐲(t;𝐱)l=1Nob𝐤letλlϕl(𝐱),{\bf y}(t;{\bf x})\approx\sum_{l=1}^{N_{ob}}{\bf k}_{l}e^{t\lambda_{l}}\phi_{l}({\bf x}), (5)

where λl=ln(l)/δt\lambda_{l}=\ln(\ell_{l})/\delta t and the Koopman modes 𝐤lNs{\bf k}_{l}\in\mathbb{C}^{N_{s}} solve the initial-value problem

𝐱=l=1Nob𝐤lϕl(𝐱).{\bf x}=\sum_{l=1}^{N_{ob}}{\bf k}_{l}\phi_{l}({\bf x}).

Again, in matrix/vector notation, keeping in mind that 𝐱Ns{\bf x}\in\mathbb{R}^{N_{s}} and that in general NsNobN_{s}\neq N_{ob}, we have

𝐱=𝐊M(ϕ1(𝐱)ϕNob(𝐱)){\bf x}={\bf K}_{M}\begin{pmatrix}\phi_{1}({\bf x})\\ \vdots\\ \phi_{N_{ob}}({\bf x})\end{pmatrix}

where 𝐊M{\bf K}_{M} is the Ns×NobN_{s}\times N_{ob} matrix whose columns are the Koopman modes 𝐤j{\bf k}_{j}. As can be seen then, generically, one can only find the Koopman modes through least-squares solutions of the non-square problem. In this regard, one would do well to have information from as many initial conditions as possible to over-determine the problem.

2.1 Global EDMD

To wit, if we had a collection of initial conditions {𝐱k}k=1NC\left\{{\bf x}_{k}\right\}_{k=1}^{N_{C}} with corresponding path data {𝐲j,k}j,k=1NT+1,NC\left\{{\bf y}_{j,k}\right\}_{j,k=1}^{N_{T}+1,N_{C}}, we can extend the optimization problem in Equation (1) to be

𝐊a=arg min𝐊k=1NC|r(𝐱k;𝐊)|2,\mathbf{K}_{a}=\mbox{arg min}_{\mathbf{K}}\sum_{k=1}^{N_{C}}\left|r({\bf x}_{k};\mathbf{K})\right|^{2},

so that now the problem of finding 𝐊a\mathbf{K}_{a} is no longer strictly localized to a particular path labeled by the initial condition 𝐱{\bf x}. Following the same optimization argument as above leads one to concatenate across observables column wise when generating the 𝚿±\mathbf{\Psi}_{\pm} matrices so that

𝚿={𝚿1,1𝚿2,1𝚿NT,1𝚿1,NC𝚿2,,NC𝚿NT,NC}\mathbf{\Psi}_{-}=\left\{\mathbf{\Psi}_{1,1}~{}\mathbf{\Psi}_{2,1}~{}\cdots~{}\mathbf{\Psi}_{N_{T},1}~{}\cdots\mathbf{\Psi}_{1,N_{C}}~{}\mathbf{\Psi}_{2,,N_{C}}~{}\cdots~{}\mathbf{\Psi}_{N_{T},N_{C}}\right\}

where

𝚿j,k=(ψ1(φ(tj;𝐱k))ψNob(φ(tj;𝐱k)))T\mathbf{\Psi}_{j,k}=\left(\psi_{1}(\varphi(t_{j};{\bf x}_{k}))\cdots\psi_{N_{ob}}(\varphi(t_{j};{\bf x}_{k}))\right)^{T}

The matrix 𝚿+\mathbf{\Psi}_{+} is defined similarly. Using then the EDMD algorithm outlined above, we arrive at the following matrix problem for determining 𝐊M{\bf K}_{M}

𝐗=𝐊M𝚽0{\bf X}={\bf K}_{M}\mathbf{\Phi}_{0}

where

𝐗=(𝐱1𝐱NC),𝚽0=(ϕ1(𝐱1)ϕ1(𝐱NC)ϕ2(𝐱1)ϕ2(𝐱NC)ϕNob(𝐱1)ϕNob(𝐱NC).){\bf X}=\left({\bf x}_{1}\cdots{\bf x}_{N_{C}}\right),~{}\mathbf{\Phi}_{0}=\begin{pmatrix}\phi_{1}({\bf x}_{1})&\cdots&\phi_{1}({\bf x}_{N_{C}})\\ \phi_{2}({\bf x}_{1})&\cdots&\phi_{2}({\bf x}_{N_{C}})\\ \vdots&\vdots&\vdots\\ \phi_{N_{ob}}({\bf x}_{1})&\cdots&\phi_{N_{ob}}({\bf x}_{N_{C}}).\end{pmatrix}

Likewise, given that Equation (4) gives us time series of the Koopman eigenfunctions, which necessarily must satisfy, assuming sufficient accuracy of the approximation implied by Equation (3), the identity

ϕl(φ(tj;𝐱k))=𝒦jϕl(𝐱k)=ljϕl(𝐱k),\phi_{l}\left(\varphi(t_{j};{\bf x}_{k})\right)=\mathcal{K}^{j}\phi_{l}({\bf x}_{k})=\ell_{l}^{j}\phi_{l}\left({\bf x}_{k}\right),

we can generalize Equation (5) via the model

𝐘Nst\displaystyle{\bf Y}_{N_{st}}\approx 𝐊M𝐋Nst𝚽,\displaystyle{\bf K}_{M}{\bf L}^{N_{st}}\mathbf{\Phi}_{-},
\displaystyle\approx 𝐊¯M𝐊aNst𝚿\displaystyle\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-} (6)

where Nst{0}N_{st}\in\mathbb{N}\cup\left\{0\right\}, 𝐊¯M=𝐊M𝐕\bar{{\bf K}}_{M}={\bf K}_{M}{\bf V}, and

𝐘Nst{𝐲Nst,1𝐲Nst+NT,1𝐲Nst,NC𝐲Nst+NT,NC},{\bf Y}_{N_{st}}\approx\left\{{\bf y}_{N_{st},1}\cdots{\bf y}_{N_{st}+N_{T},1}~{}\cdots~{}{\bf y}_{N_{st},N_{C}}\cdots{\bf y}_{N_{st}+N_{T},N_{C}}\right\},

which generates a reconstruction of the data for time steps NstjNT+1N_{st}\leq j\leq N_{T}+1 and a forecast for steps with index NT+1jNT+NstN_{T}+1\leq j\leq N_{T}+N_{st}. Using this formula allows for far greater flexibility in employing the EDMD since we can control how many times steps for which we wish to generate reconstructions. This is in contrast to generating forecasts through the iteration of the diagonal matrix 𝐋{\bf L}, which is a process that is generally sensitive to small variations in the position of the eigenvalues l\ell_{l}, especially for those near the unit circle in the complex plane. We will make great use of this generalization in the later sections of this paper. We also note that in order to avoid introducing complex values, which can cause significant complications when using current machine learning libraries, we will do all later computations in terms of 𝐊¯M\bar{{\bf K}}_{M} and 𝐊a{\bf K}_{a} as in Equation (6).

3 Hankel DMD

When implementing EDMD, the most natural observables are the projections along the canonical Cartesian axes, i.e.

ψl(𝐱)=xl,l=1,,Ns.\psi_{l}({\bf x})=x_{l},~{}l=1,\cdots,N_{s}.

If we stick to this space of observables, the EDMD method reduces to the standard DMD method. Thus the idea with EDMD is to include more nonlinear observables to hopefully represent a richer subspace of dynamics and thereby make the approximation of corresponding Koopman operator more accurate and sophisticated.

With this in mind, [15] built upon the classic idea of Takens embeddings [21] and explored using affiliated Hankel matrices to generate natural spaces of observables for EDMD, and approach we describe as Hankel DMD (HDMD). Also of note in this direction is the HAVOK method developed in [11], though in some ways HAVOK is more akin to the embedology methods explored in such classic works as [22, 2].

HDMD thus begins with an affiliated scalar measurement of our time series, say {g(𝐲j)}j=1NT+1\left\{g({\bf y}_{j})\right\}_{j=1}^{N_{T}+1}. From this, by introducing a window size NwN_{w} one builds the affiliated Hankel matrix 𝐇g(𝐱){\bf H}_{g}\left({\bf x}\right) where

𝐇g(𝐱)=(g(𝐲1)g(𝐲Nw)g(𝐲2)g(𝐲Nw+1)g(𝐲Nob)g(𝐲NT+1)).{\bf H}_{g}\left({\bf x}\right)=\begin{pmatrix}g({\bf y}_{1})&\cdots&g({\bf y}_{N_{w}})\\ g({\bf y}_{2})&\cdots&g({\bf y}_{N_{w+1}})\\ \vdots&\vdots&\vdots\\ g({\bf y}_{N_{ob}})&\cdots&g({\bf y}_{N_{T}+1})\end{pmatrix}.

where the number of observables Nob=NT+1(Nw1)N_{ob}=N_{T}+1-(N_{w}-1).

What one sees then is that each row of 𝐇g(𝐱){\bf H}_{g}({\bf x}) is some iteration of the Koopman operator 𝒦δt\mathcal{K}^{\delta t}. From here then, each row of NwN_{w} time steps is defined to be its own separate observable ψl(𝐱)\psi_{l}({\bf x}), i.e.

ψl(𝐱)=𝒦lδtg(𝐱),l=1,,Nob.\psi_{l}({\bf x})=\mathcal{K}^{l\delta t}g({\bf x}),~{}l=1,\cdots,N_{ob}.

One then proceeds as above with the EDMD algorithm, where we emphasize that NTN_{T} is replaced by Nw1N_{w}-1. This is an interesting feature, or arguably limitation, of the HDMD method in which we generate matrices 𝚽±\mathbf{\Phi}_{\pm} (see Equation (4)) up to the time index Nw1NTN_{w}-1\leq N_{T}. Thus later times are used to build approximations at prior times. This makes the issue of forecasting data more difficult since one must iterate the EDMD results, as is done via Equation (6), from time index Nw1N_{w}-1 up to NTN_{T} to reconstruct the original data that was used in the first place. Throughout the remainder of the paper then, we take care to distinguish between iterated reconstructions and actual forecasts which make novel predictions beyond the given data.

Finishing our explanation of HDMD, if one has data along multiple initial conditions, say {𝐱k}k=1NC\left\{{\bf x}_{k}\right\}_{k=1}^{N_{C}}, we can extend the above algorithm by concatenating Hankel matrices so that we perform EDMD on the combined matrix 𝐇C{\bf H}_{C} so that

𝐇C=(𝐇g(𝐱1)𝐇g(𝐱NC)){\bf H}_{C}=\left({\bf H}_{g}\left({\bf x}_{1}\right)~{}\cdots~{}{\bf H}_{g}\left({\bf x}_{N_{C}}\right)\right)

The inclusion of other observables can be done in a similar fashion.

3.1 Results for HDMD

The ultimate promise of the HDMD is that it should facilitate an adaptable implementation of the EDMD framework which allows for the number of observables to simply be adjusted by the window size. To see this, in all of the following results we let tf=20t_{f}=20, dt=.05dt=.05, and we use NC=128N_{C}=128 random initial conditions which are then stacked together. For HDMD, observables along each dimension of the dynamical system are used so that

Nob=NsN¯ob,N¯ob=NT+1(Nw1)N_{ob}=N_{s}\bar{N}_{ob},~{}\bar{N}_{ob}=N_{T}+1-(N_{w}-1)

Reconstructions and forecasts are generated using Equation (6) for Nst=20N_{st}=20, which for a time step of dt=.05dt=.05, means forecasts are produced up to a unit of non-dimensional time. We note though that the choice of N¯ob\bar{N}_{ob} sets the window size value Nw=NT+1(N¯ob1)N_{w}=N_{T}+1-(\bar{N}_{ob}-1), so that instead of using EDMD on data from 0ttf0\leq t\leq t_{f}, we now use data from 0ttf,w0\leq t\leq t_{f,w} where tf,w=Nwδtt_{f,w}=N_{w}\delta t.

If we then take data from the Van der Pol oscillator, where

y˙1=y2,y˙2=y1+μ(1y12)y2,μ=1.5,\dot{y}_{1}=y_{2},~{}\dot{y}_{2}=-y_{1}+\mu(1-y_{1}^{2})y_{2},~{}\mu=1.5,

we find, as seen in Figure 1, that N¯ob=10\bar{N}_{ob}=10 does not produce especially accurate reconstructions or forecasts. By increasing N¯ob\bar{N}_{ob} to 2020 though, we are able to generate far more accurate results, though at the cost of being able to forecast beyond the given time series. We further note that by fixing N¯ob=20\bar{N}_{ob}=20 and letting Nst=30N_{st}=30, we get essentially the same degree of degradation in the forecast as when we chose N¯ob=10\bar{N}_{ob}=10 and Nst=20N_{st}=20.

Refer to caption Refer to caption
Figure 1: HDMD results for the Van der Pol equation with Nst=20N_{st}=20 and N¯obs=10\bar{N}_{obs}=10 (left) and N¯obs=20\bar{N}_{obs}=20 (right), so that the reconstruction is generated for 1ttf,w1\leq t\leq t_{f,w} and iterated reconstruction and forecasting is done for times tf,wttf,w+1t_{f,w}\leq t\leq t_{f,w}+1 where tf,w=19.5t_{f,w}=19.5 for N¯obs=10\bar{N}_{obs}=10 and tf,w=19t_{f,w}=19 for N¯obs=20\bar{N}_{obs}=20. For N¯ob=20\bar{N}_{ob}=20 the maximum error over trajectories is .0214%.0214\% while for N¯ob=10\bar{N}_{ob}=10 it is 25.9%25.9\%. The enhanced accuracy comes at the expense of generating novel forecasts of the time series. As can be seen, relative to the choice of NstN_{st}, doubling the number of observables greatly enhances the accuracy of the reconstructions and forecast.

In contrast to these results, we find that the Lorenz Equations

y˙1=\displaystyle\dot{y}_{1}= σ(y2y1),\displaystyle\sigma(y_{2}-y_{1}),
y˙2=\displaystyle\dot{y}_{2}= ρy1y2y1y3,\displaystyle\rho y_{1}-y_{2}-y_{1}y_{3},
y˙3=\displaystyle\dot{y}_{3}= by3+y1y2,\displaystyle-by_{3}+y_{1}y_{2}, (7)

where

σ=10,ρ=28,b=83,\sigma=10,~{}\rho=28,~{}b=\frac{8}{3},

provide a case in which the HDMD is not able to adequately capture the dynamics for any reasonable choices of N¯ob\bar{N}_{ob}. This is not necessarily surprising since for the parameter choices made, we know that the dynamics traces out the famous Butterfly strange attractor as seen in the top row of Figure 2. Given that we are trying to approximate dynamics on a strange attractor, we would reasonably anticipate the HDMD to struggle.

However, as seen in the bottom row of Figure 2, the method essentially fails completely for parameter choices identical to those used above for the Van der Pol oscillator. That all said, further adaptation of the HDMD method might produce more desirable results. We will see how to realize this through the use of neural networks in the following section.

Refer to caption
Refer to caption Refer to caption
Figure 2: In the top row, the Lorenz Butterfly. In the bottom row, the HDMD results for the Lorenz system with Nst=20N_{st}=20, so that the reconstruction is generated for times 1ttf,w1\leq t\leq t_{f,w} and iterated reconstruction and forecasting is done for times tf,wttf,w+1t_{f,w}\leq t\leq t_{f,w}+1 where tf,w=19.5t_{f,w}=19.5 for N¯ob=10\bar{N}_{ob}=10 (left) and tf,w=19t_{f,w}=19 for N¯ob=20\bar{N}_{ob}=20 (right). As can be seen, doubling the number of observables does little to enhance the accuracy of the reconstructions, and the errors in the bottom row plots are well over 100%100\%.

4 Deep Learning HDMD

To improve the HDMD such that it is able to deal with chaotic systems such as the Lorenz equation, we now turn to and adapt the framework of the deep learning DMD (DLDMD) developed in [6]. Our deep learning enhanced HDMD begins with an autoencoder composed of neural networks \mathcal{E} (the encoder) and 𝒟\mathcal{D} (the decoder) such that

:NsNe,𝒟:NeNs,\mathcal{E}:\mathbb{R}^{N_{s}}\rightarrow\mathbb{R}^{N_{e}},~{}\mathcal{D}:\mathbb{R}^{N_{e}}\rightarrow\mathbb{R}^{N_{s}},

and such that our auto-encoder is a near identity, i.e.

𝐲~=(𝐲),𝒟(𝐲~)𝐲.\tilde{{\bf y}}=\mathcal{E}\left({\bf y}\right),~{}\mathcal{D}\left(\tilde{{\bf y}}\right)\approx{\bf y}.

Note, we call the encoded coordinates latent variables or latent dimensions in line with the larger literature on machine learning. Motivated by our results in [6], we take NeNsN_{e}\geq N_{s}. We then have in the later EDMD step that Nob=NeN¯obN_{ob}=N_{e}\bar{N}_{ob}, so that larger values of NeN_{e} correspond to more observables to represent the dynamics through an approximation to the Koopman operator.

The encoded coordinates should represent a set of observables which should enhance the overall accuracy of HDMD approximations of the dynamics. To train for this, after making reasonable choices for how to initialize the weights of the auto-encoder, and fixing a choice for NstN_{st}, given the training data {𝐲j,k}j,k=1NT+1,NC\left\{{\bf y}_{j,k}\right\}_{j,k=1}^{N_{T}+1,N_{C}}, after shuffling into NC/NB\lfloor N_{C}/N_{B}\rfloor batches of batch size NBN_{B}, over a given batch 𝐘={𝐲j,k}j,k=1NT+1,NB{\bf Y}=\left\{{\bf y}_{j,k}\right\}_{j,k=1}^{N_{T}+1,N_{B}}, we use the following loss function

tot=recon+pred+dmd+αreg\mathcal{L}_{tot}=\mathcal{L}_{recon}+\mathcal{L}_{pred}+\mathcal{L}_{dmd}+\alpha\mathcal{L}_{reg}

where

recon=\displaystyle\mathcal{L}_{recon}= [[𝐘𝒟(𝐘)2]k=1NB]j=1NT+1,\displaystyle\left[\left[\left|\left|{\bf Y}-\mathcal{D}\circ\mathcal{E}\left({\bf Y}\right)\right|\right|_{2}\right]_{k=1}^{N_{B}}\right]_{j=1}^{N_{T}+1},
dmd=\displaystyle\mathcal{L}_{dmd}= [[𝐘~𝐊¯M𝐊aNst𝚿2]k=1NB]j=NstNw1,\displaystyle\left[\left[\left|\left|\tilde{{\bf Y}}-\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right|\right|_{2}\right]_{k=1}^{N_{B}}\right]_{j=N_{st}}^{N_{w}-1},
pred=\displaystyle\mathcal{L}_{pred}= [[𝐘𝒟(𝐊¯M𝐊aNst𝚿)2]k=1NB]j=NstNw1,\displaystyle\left[\left[\left|\left|{\bf Y}-\mathcal{D}\left(\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right)\right|\right|_{2}\right]_{k=1}^{N_{B}}\right]_{j=N_{st}}^{N_{w}-1},

with []k=1NB[\cdot]_{k=1}^{N_{B}} denoting averaging over a given batch, and []j=j1j2[\cdot]_{j=j_{1}}^{j_{2}} denoting averaging over the j1j_{1} to j2j_{2} timesteps. We also emphasize here that the matrices 𝐊¯M\bar{{\bf K}}_{M}, 𝐊a{\bf K}_{a}, and 𝚿\mathbf{\Psi}_{-} are defined globally over a given batch of trajectories. This is in contrast to the methods in [6] or [4] which computes equivalent matrices over each trajectory. Arguably then, our approach is learning more global representations of the Koopman operator, the consequences of which we will explore in future work, though see the Appendix for a brief exploration on this point.

4.0.1 Updating N¯ob\bar{N}_{ob}

One of the most compelling reasons to use Hankel-DMD in our learning algorithm is that it easily allows us to change the model without having to change the underlying neural networks represented by \mathcal{E} and 𝒟\mathcal{D}. We note that this is in marked contrast to the method of [6], which while allowing for varying models by changing the dimensionality of the range of \mathcal{E}, then must fix that output dimension for the length of training. This can make finding good models a challenging task. Thus, we anticipate by allowing N¯ob\bar{N}_{ob} to change that we can develop far more flexible models which are ultimately easier to train and use.

To update N¯ob\bar{N}_{ob}, we first train the networks \mathcal{E} and 𝒟\mathcal{D} by optimizing tot\mathcal{L}_{tot} over a given epoch. Then, fixing the weights in the networks, we take a subset of the training data and compute tot(N¯ob)\mathcal{L}_{tot}(\bar{N}_{ob}). We then compute tot(N¯ob±1)\mathcal{L}_{tot}(\bar{N}_{ob}\pm 1) and compare to tot(N¯ob)\mathcal{L}_{tot}(\bar{N}_{ob}). If either of the variant models gives a lower value of tot\mathcal{L}_{tot} and causes a relative change greater than some chosen tolerance frf_{r}, then we update the model by changing N¯ob\bar{N}_{ob}. Note, the choice of the range of values of N¯ob\bar{N}_{ob} comes from trial and error with a particular emphasis on computational efficiency and cost. Likewise, introducing the relative change threshold frf_{r} was found to be useful in ignoring what amounted to spurious changes in the model parameters. We collect the details of our learning method in Algorithm 1, which we call the Deep Learning HDMD (DLHDMD).

Data: Choose parameters NeN_{e}, NCN_{C}, NBN_{B}, α\alpha, frf_{r}, NstN_{st}
Data: Choose initial value of N¯ob\bar{N}_{ob}
1 for l1l\leftarrow 1 to EmaxE_{max}  do
2       Shuffle NCN_{C} trajectories into NC/NB\lfloor N_{C}/N_{B}\rfloor batches each of size NBN_{B};
3       for m1m\leftarrow 1 to NC/NB\lfloor N_{C}/N_{B}\rfloor  do
4             𝐘~(𝐘)\tilde{{\bf Y}}\leftarrow\mathcal{E}({\bf Y}) ;
5             𝐊¯M,𝐊a,𝚿HDMD(𝐘~)\bar{{\bf K}}_{M},~{}{\bf K}_{a},~{}\mathbf{\Psi}_{-}\leftarrow HDMD\left(\tilde{{\bf Y}}\right);
6             Using 𝐊¯M,𝐊a,𝚿\bar{{\bf K}}_{M},~{}{\bf K}_{a},~{}\mathbf{\Psi}_{-}, compute pred\mathcal{L}_{pred} and dmd\mathcal{L}_{dmd};
7             Compute 𝒟(𝐘~)\mathcal{D}(\tilde{{\bf Y}}) and then find recon\mathcal{L}_{recon};
8             totrecon+pred+dmd+αreg\mathcal{L}_{tot}\leftarrow\mathcal{L}_{recon}+\mathcal{L}_{pred}+\mathcal{L}_{dmd}+\alpha\mathcal{L}_{reg};
9             Using optimizer, attempt to reduce tot\mathcal{L}_{tot};
10            
11      Update for N¯ob\bar{N}_{ob} : All quantities computed for fixed network weights and subset of training data.
12       Compute (as above) tot(N¯ob)\mathcal{L}_{tot}(\bar{N}_{ob});
13       mintot(N¯ob)\mathcal{L}_{min}\leftarrow\mathcal{L}_{tot}(\bar{N}_{ob});
14       for nN¯ob1n\leftarrow\bar{N}_{ob}-1 to N¯ob+1\bar{N}_{ob}+1 do
15             Compute tot(n)\mathcal{L}_{tot}(n);
16             if tot(n)<min\mathcal{L}_{tot}(n)<\mathcal{L}_{min} and |1tot(n)/min|>fr|1-\mathcal{L}_{tot}(n)/\mathcal{L}_{min}|>f_{r} then
17                   mintot(n)\mathcal{L}_{min}\leftarrow\mathcal{L}_{tot}(n);
18                   N¯obn\bar{N}_{ob}\leftarrow n;
19                  
20            
21      
Algorithm 1 The DLHDMD Algorithm

4.1 Results for DLHDMD

We now show how the DLHDMD performs on several dynamical systems. We take as our training set NC=7000N_{C}=7000 randomly chosen initial conditions with their affiliated trajectories. For the lower dimensional systems, we use 1000 randomly chosen initial conditions for testing purposes. For the higher dimensional systems, we use 500 randomly chosen trajectories for testing. The networks \mathcal{E} and 𝒟\mathcal{D} consist of 5 layers each. Two of the layers are for input and output, while the remaining three are hidden nonlinear layers using the ReLU activation function. Alternate architectures with more or less hidden layers were studied with the present choice being found to lead to the most efficient training. The training is done in PyTorch [23] over Emax=200E_{max}=200 epochs for the lower dimensional systems and Emax=100E_{max}=100 epochs for the KS system. In all cases, the training used an ADAM optimizer with learning rate γ\gamma.

4.1.1 Hyperparameter Tuning

Before runtime then, the user must specify NeN_{e}, NstN_{st}, NBN_{B}, α\alpha, γ\gamma, frf_{r}, and an initial choice for N¯ob\bar{N}_{ob}. NstN_{st} was chosen relative to considerations of the underlying dynamical systems themselves as explained later. For our tuning results, we found that setting Ne=NsN_{e}=N_{s} and letting our adaptive scheme update N¯ob\bar{N}_{ob} was generally sufficient to produce useful suggestions for parameter choices. This also helped manage the computational cost of tuning. To determine optimal choices of NBN_{B}, α\alpha, and γ\gamma, we used the Ray tuner [24] set to perform hyperband optimization across six samples at a time. The batch size NBN_{B} was chosen from the values {64,128,256}\left\{64,128,256\right\}, α\alpha was chosen from a log-uniform sampling of values between 101210^{-12} and 10810^{-8}, and γ\gamma was chosen from a log-uniform sampling of values from 10510^{-5} and 10210^{-2}. For Lorenz-63 and Rossler, we started with N¯ob=10\bar{N}_{ob}=10 while for the KS system, we started from N¯ob=5\bar{N}_{ob}=5. Significantly smaller or larger choices for the initial value of N¯ob\bar{N}_{ob} in all cases resulted in less than desirable performance as determined through trial and error. We also note that when tuning, we chose a fraction of EmaxE_{max}, say EtstE_{tst}, so as to facilitate a more rapid tuning process. While this did not need to be especially large for the Lorenz or KS systems, Rossler was a different matter, and small choices of EtstE_{tst} lead to very unstable training.

Determining good choices of frf_{r} turned out to be the most difficult task. Generally speaking, after trial and error, we found that one wanted frf_{r} large enough so that the model did not spuriously increase N¯ob\bar{N}_{ob} in place of forcing the encoder and decoder to better learn the dynamics. Thankfully, we consistently observed that if N¯ob=Nst\bar{N}_{ob}=N_{st}, which corresponds to the models losing any predictive capacity, during training, then dmd\mathcal{L}_{dmd} would drop several orders of magnitude, thereby becoming essentially irrelevant. Thus if we observe during training that N¯obNst\bar{N}_{ob}\geq N_{st}, this clearly indicates that we should increase frf_{r}. That said, we also need to keep frf_{r} small enough so that the model can take advantage of increasing N¯ob\bar{N}_{ob} so as to stabilize training and improve model performance. A strategy to address this issue was to auto-tune for relatively small values of EtstE_{tst} for low choices of frf_{r} which then let us get a reasonable estimate on good choices for γ\gamma, α\alpha, and NBN_{B}. We would then run these models and see if our choice of frf_{r} allowed N¯ob=Nst\bar{N}_{ob}=N_{st} during longer training runs. If so, we would raise the value of frf_{r} and rerun our model, and if this produced a model with low and converging tot\mathcal{L}_{tot} while keeping N¯ob<Nst\bar{N}_{ob}<N_{st}, we kept those parameter choices. Otherwise, we would retune across another six samples at a higher value of frf_{r} and possibly EtstE_{tst}. For each system, this amounted to about three to four tuning experiments overall to produce a good results. This approach did allow us to ultimately converge on parameter choices which produced excellent results, but we also acknowledge that this is a time consuming process. Clearly, developing a loss function which helps manage choosing good frf_{r} values is desirable and will be a subject of future research.

The results for all of our experiments are presented in Table 1. As can be seen, the underlying differences in dynamics manifest as significant changes to the batch size and learning rate. Most notably, Rossler’s multiscale features demand both a longer number of testing epochs, larger batch size, and lower learning rate reflecting the greater difficulty that the machine has in learning a reasonable optimization path. From this, as a general rule of thumb, when all else fails, raising NBN_{B} and lowering γ\gamma gives the optimization routine more chance to find training paths.

Lorenz-63 Rossler KS
N¯ob\bar{N}_{ob} (te) 10 10 5
frf_{r} (te) .05 .25 .15
EtstE_{tst} (te) 20 40 10
NBN_{B} (tn) 64 256 64
α\alpha (tn) 3.46×10123.46\times 10^{-12} 2.52×10122.52\times 10^{-12} 4.85×10124.85\times 10^{-12}
γ\gamma (tn) .000507 .0001 .00081
Table 1: Hyperparamter choices and tuning results for the several systems studied. (te) denotes a parameter chosen through trial-and-error while (tn) denotes one determined via tuning relative to the trial-and-error choices made. We emphasize that EtstE_{tst} is not used in the final learning algorithm, but it is necessary to chose in order to get useful tuning results.

Finally, we briefly address the performance of our algorithm with respect to the system size NsN_{s}. For Lorenz-63 and Rossler, Ns=3N_{s}=3, while for the KS system, Ns=12N_{s}=12. This is a significant difference, but it does raise the question of whether our method has an upper limit beyond which it can no longer readily train. To find this, we studied the Lorenz-96 equations [25] for Ns=5N_{s}=5, 1010, and 2020. We were able to get successful training for all but Ns=20N_{s}=20. Even by following our previous advice of increasing NBN_{B} and decreasing γ\gamma, we could not find a combination of parameters which allowed for reasonable training. Why this is the case, and possible remedies, will be a subject of future research.

4.1.2 DLHDMD for the Lorenz-63 System

The results of running the DLHDMD for the Lorenz-63 system are found in Figures 3 and 4. The maximum positive Lyupanov exponent, say λL\lambda_{L}, for the Lorenz-63 system can be numerically computed using the methods of [26], and we find that λL.8875\lambda_{L}\approx.8875. In this case then, our prediction window is only slightly less than 1/λL1.1271/\lambda_{L}\approx 1.127, so that we are making predictions up to the point where the strange attractor would tend to induce significant separations in what were initially nearby trajectories. First taking Ne=3N_{e}=3 and fr=.05f_{r}=.05, as can be seen in Figure 3, the overall reconstruction and forecast, plotted for times tt such that 1ttf,w+11\leq t\leq t_{f,w}+1, shows excellent agreement with the plot of the Lorenz Butterfly in the top row of Figure 2. This degree of accuracy is quantified by the graph of pred\mathcal{L}_{pred} in Figure 4, which shows a relative accuracy of about .1%.1\% by the 200th200^{th} epoch. To achieve this, we see that the DLHDMD progressively raises the value of N¯ob\bar{N}_{ob}. As seen in Figure 3, this process continues until about the 25th25^{th} epoch, at which point N¯ob=14\bar{N}_{ob}=14 and then stays at this value for the remaining epochs. What is particularly striking is that N¯ob<Nst\bar{N}_{ob}<N_{st} for the length of training, which means that our trained model is able to make novel forecasts while still providing excellent reconstructions. We note that larger choices of frf_{r} were examined, and generally they prevented the model from updating N¯ob\bar{N}_{ob} and thus degrading model performance.

Refer to caption
Figure 3: Results of the DLHDMD on the Lorenz-63 system after 200 epochs of training for Ne=3N_{e}=3 and fr=.05f_{r}=.05. From left to right, we see the reconstruction generated by DLHDMD, i.e. 𝒟(𝐊¯M𝐊aNst𝚿)\mathcal{D}\left(\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right), and the plots of N¯ob\bar{N}_{ob} and tot\mathcal{L}_{tot} over epochs. The reconstruction is generated for times 1ttf,w1\leq t\leq t_{f,w} and forecasting is done for times tf,wttf,w+1t_{f,w}\leq t\leq t_{f,w}+1. Error plots are over both training (blue/dash) and testing (black/solid) data. The reconstruction in the leftmost figure is generated over testing data.
Refer to caption
Figure 4: Results of the DLHDMD on the Lorenz-63 system after 200 epochs of training for Ne=3N_{e}=3 and fr=.05f_{r}=.05. Moving from left to right, we plot recon\mathcal{L}_{recon}, pred\mathcal{L}_{pred}, and dmd\mathcal{L}_{dmd}. Error plots are over both training (blue/dash) and testing (black/solid) data.

We now examine the effect of increasing the embedding dimension so that Ne=6N_{e}=6. We first note that we had to increase fr=.1f_{r}=.1 in order to keep N¯ob\bar{N}_{ob} from reaching and surpassing NstN_{st}, which it rapidly did when fr=.05f_{r}=.05. Otherwise we keep all of the other parameters the same for the sake of comparison. That said, as can be seen in Figures 5 and 6, while the training exhibits larger variance around the net trend in the loss, the model is able to converge to an even lower overall loss while still only needing N¯ob=14\bar{N}_{ob}=14 observables. However, the difference is not even an order of magnitude better, and this generally reflects that the performance of our models is not drastically affected by changing the embedding dimension. We also note that the average training time per epoch when Ne=3N_{e}=3 is 22 s while for Ne=6N_{e}=6 it is 44 s. We then see that raising the embedding dimension is a fraught affair. While it can improve model performance, it becomes much more computationally costly for what appear to be relatively nominal returns.

Refer to caption
Figure 5: Results of the DLHDMD on the Lorenz-63 system after 200 epochs of training for Ne=6N_{e}=6 and fr=.1f_{r}=.1. From left to right, we see the reconstruction generated by DLHDMD, i.e. 𝒟(𝐊¯M𝐊aNst𝚿)\mathcal{D}\left(\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right), and the plots of N¯ob\bar{N}_{ob} and tot\mathcal{L}_{tot} over epochs. The reconstruction is generated for times 1ttf,w1\leq t\leq t_{f,w} and forecasting is done for times tf,wttf,w+1t_{f,w}\leq t\leq t_{f,w}+1. Error plots are over both training (blue/dash) and testing (black/solid) data. The reconstruction in the leftmost figure is generated over testing data.
Refer to caption
Figure 6: Results of the DLHDMD on the Lorenz-63 system after 200 epochs of training for Ne=6N_{e}=6 and fr=.1f_{r}=.1. Moving from left to right, we plot recon\mathcal{L}_{recon}, pred\mathcal{L}_{pred}, and dmd\mathcal{L}_{dmd}. Error plots are over both training (blue/dash) and testing (black/solid) data.

Of course, one might wonder to what extent increasing NeN_{e} can be exchanged for increasing N¯ob\bar{N}_{ob}. Thus, we set fr=f_{r}=\infty so that we do not perform any update of N¯ob\bar{N}_{ob} during training. The results of doing this for Ne=6N_{e}=6 can be seen in Figure 7. Comparing tot\mathcal{L}_{tot} across Figures 3, 5, and 7, we see that increasing NeN_{e} but not updating N¯ob\bar{N}_{ob} gets us a model of intermediate accuracy, though again the overall differences are relatively nominal relative to order of magnitudes present in the original data sets. Ultimately then, we see that the updating scheme controlled by frf_{r} improves model performance, though the update threshold needs to be chosen carefully to be large enough to make sure changing N¯ob\bar{N}_{ob} does not preclude the encoder/decoder networks from doing the harder work of learning the dynamics while at the same time not choosing frf_{r} so large that no update moves are made. Finally, we refer the reader to the Appendix where we examine not only setting fr=f_{r}=\infty, but also fixing N¯ob=1\bar{N}_{ob}=1, thereby “turning off” Hankel DMD in our algorithm. We also compare using the global EDMD method with a local one, which then essentially reduces to the DLDMD method of [6]. In all cases, we find the results significantly worse than what is presented in this section, thereby showing our additions to the DLDMD method have markedly improved model development.

Refer to caption
Figure 7: Results of the DLHDMD on the Lorenz-63 system after 200 epochs of training without updating the initial choice of N¯ob=10\bar{N}_{ob}=10 for Ne=6N_{e}=6 and fr=f_{r}=\infty. From left to right, we see the reconstruction generated by DLHDMD, i.e. 𝒟(𝐊¯M𝐊aNst𝚿)\mathcal{D}\left(\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right), and the plot of tot\mathcal{L}_{tot} over epochs. The reconstruction is generated for times 1ttf,w1\leq t\leq t_{f,w} and forecasting is done for times tf,wttf,w+1t_{f,w}\leq t\leq t_{f,w}+1, where tf,w=19.5t_{f,w}=19.5. Error plots are over both training (blue/dash) and testing (black/solid) data. The reconstruction in the leftmost figure is generated over testing data.

4.1.3 DLHDMD for the Rossler System

We now study the Rossler system given by

y˙1=\displaystyle\dot{y}_{1}= y2y3,\displaystyle-y_{2}-y_{3},
y˙2=\displaystyle\dot{y}_{2}= y1+ay2,\displaystyle y_{1}+ay_{2},
y˙3=\displaystyle\dot{y}_{3}= b+y3(y1c),\displaystyle b+y_{3}\left(y_{1}-c\right),

where

a=.1,b=.1,c=14.a=.1,~{}b=.1,~{}c=14.

Aside from the dynamics coalescing onto a strange attractor, the disparity in parameter values gives rise to multiscale phenomena so that there are slow and fast regimes of the dynamics. The slow portions are approximated by harmonic motion in the (y1,y2)(y_{1},y_{2}) plane, and the fast portion moves along the y3y_{3} coordinate. This strong disparity in time scales also appears by way of λL1.989\lambda_{L}\approx 1.989, which is more than double the maximal Lyupanov exponent for the Lorenz-63 system. Thus dynamics separate along the strange attractor twice as fast.

Using the parameter choices in Table 1 and taking Ne=3N_{e}=3, in Figures 8 and 9 we get the training and testing results of DLHDMD. We immediately observe that, relative to the Lorenz-63 system, the much larger choice of NBN_{B} and much smaller choice of learning rate γ\gamma produce a far less noisy training/validation curve; see in particular Figure 9. Somewhat suprisingly, we are able to train to overall lower loss values even with the significantly larger choice for frf_{r}. We note that even nominally smaller choices of frf_{r} resulted in the collapse of the dmd\mathcal{L}_{dmd} term, so the slow/fast dynamics create very concrete differences between the Rossler and Lorenz-63 models. That said, we are still able to keep the final value of N¯ob<Nst\bar{N}_{ob}<N_{st} and still get excellent reconstruction, so it is clear that our overall approach is working very well. We also note that we examined the impact of letting Ne=6N_{e}=6 and generally found no significant improvement in training or testing. In this case then, we see that there is no hard and fast rule about using higher embedding dimensions to improve model performance, and so it must be treated as a general hyperparameter which can induce significant computational cost.

Refer to caption
Figure 8: Results of the DLHDMD on the Rossler system after 200 epochs of training for Ne=3N_{e}=3 and fr=.25f_{r}=.25. From left to right, we see the reconstruction generated by DLHDMD, i.e. 𝒟(𝐊¯M𝐊aNst𝚿)\mathcal{D}\left(\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right), and the plots of N¯ob\bar{N}_{ob} and tot\mathcal{L}_{tot} over epochs. The reconstruction is generated for times 1ttf,w1\leq t\leq t_{f,w} and forecasting is done for times tf,wttf,w+1t_{f,w}\leq t\leq t_{f,w}+1. Error plots are over both training (blue/dash) and testing (black/solid) data. The reconstruction in the leftmost figure is generated over testing data.
Refer to caption
Figure 9: Results of the DLHDMD on the Rossler system after 200 epochs of training for Ne=3N_{e}=3 and fr=.25f_{r}=.25. Moving from left to right, we plot recon\mathcal{L}_{recon}, pred\mathcal{L}_{pred}, and dmd\mathcal{L}_{dmd}. Error plots are over both training (blue/dash) and testing (black/solid) data.

4.1.4 DLHDMD for the KS Equation

To see the edges of our method, we now examine spatio-temporal chaos generated by the KS equation with periodic-boundary conditions in the form

ut+uxx+uxxxx+uux=0,u(x+2L,t)=u(x,t).u_{t}+u_{xx}+u_{xxxx}+uu_{x}=0,~{}u(x+2L,t)=u(x,t).

Note, given the vast size of the literature around the KS equation, we refer the reader to [27] for an extensive bibliography with regards to details and pertinent proofs of facts used in this section. Introducing the rescalings

t~=tT,x~=πLx,u=Au~,\tilde{t}=\frac{t}{T},~{}\tilde{x}=\frac{\pi}{L}x,~{}u=A\tilde{u},

and taking the balances

A=LπT,T=(Lπ)2,A=\frac{L}{\pi T},~{}T=\left(\frac{L}{\pi}\right)^{2},

we get the equivalent KS equation (dropping tildes for ease of reading)

ut+uxx+νuxxxx+uux=0,ν=(πL)2.u_{t}+u_{xx}+\nu u_{xxxx}+uu_{x}=0,~{}\nu=\left(\frac{\pi}{L}\right)^{2}.

Looking at the linearized dispersion relationship ω(k)=k2νk4\omega(k)=k^{2}-\nu k^{4}, we see that the ν\nu parameter acts as a viscous damping term. Thus, as the system size LL is increased, the effective viscosity is decreased, thereby allowing for more complex dynamics to emerge. As is now well known, for LL sufficiently large, a fractional-dimensional-strange attractor forms which both produces intricate spatio-temporal dynamics while also allowing for a far simpler representation of said dynamics. It is has been shown in many different works (see for example [28]) that L=11L=11 generates a strange attractor with dimension between eight and nine, and that this is about the smallest value of LL which is guaranteed to generate chaotic dynamics. We therefore set L=11L=11 throughout the remainder of this section.

To study the DLDHMD on the KS equation, we use KS data numerically generated by a pseudo-spectral in space and fourth-order exponential-differencing Runge-Kutta in time method [29] of lines approach. For the pseudo-spectral method, K=128K=128 total modes are used giving an effective spatial mesh width of 2L/K=.1722L/K=.172, while the time step for the Runge-Kutta scheme is set to δt=.25\delta t=.25. These particular choices were made with regards to practical memory and simulation time length constraints. After a burn in time of tb=(L/π)4=150.3t_{b}=\left(L/\pi\right)^{4}=150.3, which is the time scale affiliated with the fourth-order spatial derivative for the chosen value of LL, 8000 trajectories of total simulation time length tf=(L/π)4t_{f}=\left(L/\pi\right)^{4} were used with gaps of L/πL/\pi in between to allow for nonlinear effects to make each sample significantly different from its neighbors. Each of the 8000 space/time trajectories was then separated via a POD into space and time modes; see [30]. Taking Ns=12N_{s}=12 modes captured between 97.8% and 99.4% of the total energy. The choice of the total time scale tft_{f} also ensured that the ratio of the largest and smallest singular values affiliated with the POD was between 101.110^{-1.1} and 101.910^{-1.9} so that the relative importance of each of the modes was roughly the same across all samples. We take this as an indication that each 12-dimensional affiliated time series is accurately tracing dynamics along a common finite-dimensional strange attractor as expected in the KS equation. Again, using the methods of [26], we can find across batches that typically the largest positive Lyupanov exponent λL.3930\lambda_{L}\approx.3930, so that 1/λL2.5451/\lambda_{L}\approx 2.545 is the time after which we anticipate that the strange attractor fully pulls trajectories apart.

With regards to the details of the DLHDMD, we again use NC=7000N_{C}=7000 samples for training, and 500 for testing. The best results with regards to window size were found when we initially set N¯ob=5\bar{N}_{ob}=5 and fr=.15f_{r}=.15. The iterated reconstruction/forecast horizon determined by the choice of NstN_{st} was chosen so that Nst=(L/π)/δt14N_{st}=(L/\pi)/\delta t\approx 14, corresponding to the time scale over which nonlinear advection acts. Thus, reconstruction is done on each sample for values of tt such that L/πttf,wL/\pi\leq t\leq t_{f,w}, and iterated reconstruction/forecasting is done for tt such that tf,wttf,w+L/πt_{f,w}\leq t\leq t_{f,w}+L/\pi. Note, for our initial choice of NwN_{w}, we have that initially tf,w=(Lπ)41.25t_{f,w}=(L\pi)^{4}-1.25. The results of DLHDMD training on the Ns=12N_{s}=12 dimensional POD reduction of the KS dynamics is shown in Figures 10, 11, and 12. Likewise, our prediction window is longer than the timescale set by λL\lambda_{L}, so we argue our forecasts are over time scales for which chaotic effects are significant. We see that the reconstruction and predictions appear accurate; see in particular the comparisons in Figure 12, as seen in Figure 11. Likewise, by choosing fr=.15f_{r}=.15, we prevent collapse in DMD\mathcal{L}_{DMD} and keep N¯ob<Nst\bar{N}_{ob}<N_{st}.

Refer to caption
Figure 10: Results of the DLHDMD on the KS system after 100 epochs of training. From left to right, we see the first three dimensions of the reconstruction generated by DLHDMD, i.e. 𝒟(𝐊¯M𝐊aNst𝚿)\mathcal{D}\left(\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right), and the plots of N¯ob\bar{N}_{ob} and tot\mathcal{L}_{tot} over epochs. Again, the reconstruction is generated for times L/πttf,wL/\pi\leq t\leq t_{f,w} and forecasting is done for times tf,wttf,w+L/πt_{f,w}\leq t\leq t_{f,w}+L/\pi. tf,wt_{f,w} is initially (Lπ)41.25(L\pi)^{4}-1.25. Error plots are over both training (blue/dash) and testing (black/solid) data.
Refer to caption
Figure 11: Results of the DLHDMD on the KS system after 100 epochs of training. The reconstruction is generated from NB=64N_{B}=64 testing trajectories. Moving from left to right, we plot recon\mathcal{L}_{recon}, pred\mathcal{L}_{pred}, and dmd\mathcal{L}_{dmd}. Error plots are over both training (blue/dash) and testing (black/solid) data.
Refer to caption
Figure 12: Comparison of DLHDMD results and original KS data with the log-scaled error. The error is overall quite small with peaks in the underlying solution and approximation being the source of any real measurable difference.

Overall then, we find that through careful parameter tuning, especially with regards to how one chooses frf_{r}, we are able to generate models exhibiting high reconstruction accuracy while also providing nontrivial time horizons for prediction of future dynamics, measured by the difference between N¯ob\bar{N}_{ob} relative to NstN_{st}. That said, the parameter tuning process can be time consuming, and more automatic methods would be necessary to develop before attempting to improve the results presented in this work.

5 Mutual Information for Characterizing Embeddings

Given the success of the DLHDMD in reconstructing and forecasting dynamics along a strange attractor, especially when compared to the relative failure of trying to do the same using just the HDMD or DLDMD alone, it is of further interest to try to assess exactly what role the auto-encoder plays in improving the outcome of the HDMD. While we can certainly point to the performance of the components of the loss function tot\mathcal{L}_{tot} to explain the impact of the encoder, this does not provide us with any more explanatory power. In [6], it was empirically shown that the role of the encoder was to generally transform time series to nearly monochromatic periodic signals, which is to say, the effect of encoding was to generate far more localized Fourier spectral representations of the original time series. This does not turn out to be the case though for the DLHDMD. Instead, inspired both by the evolving understanding of how mutual information better explains results in dynamical systems [1, 31] and machine learning [16, 17], we assess the impact of the encoder on the DLHDMD by tracking how the information across dimensions and time lags changes in the original and latent variables.

For two random variables 𝐗{\bf X} and 𝐘{\bf Y} with joint density p(𝐗,𝐘)p({\bf X},{\bf Y}), the mutual information (MI) between them I(𝐗,𝐘)I({\bf X},{\bf Y}) is defined to be

I(𝐗,𝐘)=p(𝐱,𝐲)log(p(𝐱,𝐲)p(𝐱)p(𝐲))𝑑𝐱𝑑𝐲,I({\bf X},{\bf Y})=\int p({\bf x},{\bf y})\log\left(\frac{p({\bf x},{\bf y})}{p({\bf x})p({\bf y})}\right)d{\bf x}d{\bf y},

where p(𝐗)p({\bf X}) and p(𝐘)p({\bf Y}) are the affiliated marginals. One can readily show that I(𝐗,𝐘)0I({\bf X},{\bf Y})\geq 0 and I(𝐗,𝐘)=0I({\bf X},{\bf Y})=0 if and only if 𝐗{\bf X} and 𝐘{\bf Y} are independent. Thus information gives us a stronger metric of statistical coupling between random variables than more traditional tools in time series analysis such as correlation measurements. We also should note here that I(𝐗,𝐘)=I(𝐘,𝐗)I({\bf X},{\bf Y})=I({\bf Y},{\bf X}), which is to say it is symmetric. We also note that MI is invariant under the action of diffeomorphisms of the variables. Thus we cannot expect to get much use from computing the multidimensional MI of the original and latent variables, thereby allowing for meaningful differences to appear between original and latent variable computations.

Instead, using the NC=1000N_{C}=1000 trajectories in the test data, we define the mm-step averaged lagged self-information (ALSI) between the nthn^{th} and vthv^{th} dimensions Inv(m)I_{nv}(m) to be

Inv(m)=1NCk=1NCI(yn,,k,yv,+m,k).I_{nv}(m)=\frac{1}{N_{C}}\sum_{k=1}^{N_{C}}I\left(y_{n,\cdot,k},y_{v,\cdot+m,k}\right).

We refer to the parameter mm as a lag. In words then, after averaging over the ensemble of initial conditions in the test data, we compute the degree to which the signal becomes statistically independent from itself across all of the dimensions along which the dynamics evolve. We emphasize that due to the strong nonlinearities in our dynamics, we compute the lagged information as opposed to the more traditional auto-correlation so as to get a more accurate understanding of the degree of self-dependence across dimension in our dynamics. Further, by measuring the lagged MI across isolated dimensions, we break the invariance of MI with respect to diffeomorphisms.

5.1 MI for the Lorenz-63 System

The results of computing the ALSI for the Lorenz-63 system are plotted in Figure 13. As can be seen, the impact of the encoder is to either weakly attenuate the dependency between dimensions; see I33I_{33}, or as for I11I_{11} and I22I_{22}, leave the ALSI essentially unchanged. Finally, we also see significant phase shifts in the lag count; see I13I_{13} and I23I_{23}. In these phase shifts, we see that the shift is always left towards shorter lags, so that the dependence in the latent variables decays more rapidly than in the original variables. In this sense then, the overall tendency of the encoder is to either reduce MI or cause time series to become more independent more rapidly. Otherwise though, the timescales of oscillation in the latent variables are essentially identical to those seen in the latent variables. In terms of the DLHDMD, we might then say that the encoder assists the HDMD by generally making the rows of the affiliated Hankel matrices more independent, especially over longer time scales, and therefore more meaningful with regards to their generating more accurate approximations of the underlying Koopman operator.

Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption
(c) (d)
Refer to caption Refer to caption
(e) (f)
Figure 13: For the Lorenz-63 system, plots of the ALSI Inv(m)I_{nv}(m) for (n,v)=(1,1)(n,v)=(1,1) (a), (n,v)=(2,2)(n,v)=(2,2) (b), (n,v)=(1,2)(n,v)=(1,2) (c), (n,v)=(3,3)(n,v)=(3,3) (d), (n,v)=(1,3)(n,v)=(1,3) (e), and (n,v)=(2,3)(n,v)=(2,3) (f) for both the original (red/dash) and latent (black/solid) coordinates. As can be seen, the encoder tends to reduce the ALSI along each physical dimension aside from those involving the third physical dimension, for which the ALSI is enhanced for shorter lags and decreased for longer ones.

5.2 MI for the Rossler System

When we examine the evolution over lags of the ALSI, we see in Figure 14 that the encoder is causing large and significant changes to the dynamics. In particular, when we look at the plots of I12I_{12} and I23I_{23}, we see that the sharp transients in the ALSI for the original coordinates is removed and the overall ALSI is relatively flattened in the latent coordinates. This would seem to indicate that the slow/fast dichotomy in the Rossler dynamics is removed and so made more uniform. Also of note though is I13I_{13} which shows that the dependency between the y~1\tilde{y}_{1} and y~3\tilde{y}_{3} axes is enhanced relative to the coupling between y1y_{1} and y3y_{3} and that said dependency increases with lags.

Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption
(c) (d)
Refer to caption Refer to caption
(e) (f)
Figure 14: For the Rossler system, plots of the ALSI Inv(m)I_{nv}(m) for (n,v)=(1,1)(n,v)=(1,1) (a), (n,v)=(2,2)(n,v)=(2,2) (b), (n,v)=(1,2)(n,v)=(1,2) (c), (n,v)=(3,3)(n,v)=(3,3) (d), (n,v)=(1,3)(n,v)=(1,3) (e), and (n,v)=(2,3)(n,v)=(2,3) (f) for both the original (red/dash) and latent (black/solid) coordinates.

6 Conclusion and Discussion

In this work, we have developed a machine learning enhanced version of the HDMD and DLDMD which we call the DLHDMD. We have shown that its performance is significantly better than just the HDMD, and when comparing against existing results in [6] we see radical improvement over the DLDMD method for the Lorenz-63 system. Likewise, we find that our method is successful across several challenging chaotic dynamical systems varying in dynamical features and size. Thus, we have an accurate parallel approach fitting within the larger framework of Koopman operator based methods. Moreover, we have a method which computes Koopman modes globally. Finally, our analysis of the relative information dynamics across physical dimensions in the original and latent variables provides us a means of understanding the impact of the encoder network on the dynamics in line with modern thinking in machine learning as well as better pointing towards an understanding that the HDMD is enhanced by decreasing the relative statistical dependence across physical dimensions. As explained in detail in the Introduction, there are of course a number of questions that remain to be addressed, and they will certainly be the subject of future research.

7 Acknowledgements

C.W. Curtis would like to acknowledge the generous support of the Office of Naval Research and their Summer Research Faculty Program for providing the support of this project. D.J. Alford-Lago acknowledges the support of the Naval Information Warfare Center. E. Bollt was funded in part by the U.S. Army Research Office grant W911NF- 16-1-0081, by the U.S. Naval Research Office, the Defense Advanced Research Projects Agency, the U.S. Air Force Research Office STTR program, and the National Institutes of Health through the CRCNS. We also wish to thank both anonymous reviewers for a number of insightful comments and questions which helped dramatically improve the paper.

8 Appendix

8.1 Models without HDMD for the Lorenz-63 System

Here, for the Lorenz-63 system, we explore the impact of fixing N¯ob=1\bar{N}_{ob}=1, which is equivalent to setting fr=f_{r}=\infty and then removing any Hankel observables from our computations. We also examine the difference between using a global EDMD versus a local one, at which point we are essentially using the DLDMD, i.e. local EDMD, algorithm presented in [6]. Parameter choices are kept the same for the sake of ready comparison. For the DLDMD model, we also explored a range of embedding dimension choices where Ne=4,5,6,7N_{e}=4,~{}5,~{}6,~{}7, and 1212. We found that Ne=6N_{e}=6 gave the best results in tuning experiments. That said, we emphasize that the worst results were found for Ne=12N_{e}=12, and the differences between the choices in embedding dimension were not striking. We can see the results of our final training and testing in Figure 15. In either the global or local case, the results are far from desirable, especially compared to what we have thus far obtained through increasing N¯ob\bar{N}_{ob}. That said, while the rate of training is improved for the global EDMD model, in the end there is little difference in the final outcome between the local and global methods, and both clearly struggle to learn the dynamics, especially in comparison to our results using larger values of N¯ob\bar{N}_{ob}, which is to say using Hankel DMD.

Refer to caption
Ne=6,N¯ob=1,Global EDMDN_{e}=6,~{}\bar{N}_{ob}=1,~{}\text{Global EDMD}
Refer to caption
Ne=6,N¯ob=1,Local EDMDN_{e}=6,~{}\bar{N}_{ob}=1,~{}\text{Local EDMD}
Figure 15: Results of the DLHDMD on the Lorenz-63 system after 200 epochs of training for Ne=6N_{e}=6 and without updating the initial choice of N¯ob=1\bar{N}_{ob}=1 using a global EDMD (top row) and a local EDMD (bottom row). From left to right, we see the reconstruction generated by DLHDMD, i.e. 𝒟(𝐊¯M𝐊aNst𝚿)\mathcal{D}\left(\bar{{\bf K}}_{M}{\bf K}_{a}^{N_{st}}\mathbf{\Psi}_{-}\right), and the plot of tot\mathcal{L}_{tot} over epochs. The reconstruction is generated for times 1ttf1\leq t\leq t_{f} and forecasting is done for times tfttf+1t_{f}\leq t\leq t_{f}+1, where tf=20t_{f}=20. Error plots are over both training (blue/dash) and testing (black/solid) data. The reconstruction in the leftmost figure is generated over testing data.

References

  • [1] A. M. Fraser and H.L. Swinney. Independent coordinates for strange attractors from mutual information. Phys. Rev. A, 33:1134–1140, 1986.
  • [2] T. Sauer, J. A. Yorke, and M. Casdagli. Embedology. J. Stat. Phys., 65:579–616, 1991.
  • [3] W. Gilpin. Deep reconstruction of strange attractors from time series. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 204–216, 2020.
  • [4] B. Lusch, J. N. Kutz, and S. L. Brunton. Deep learning for universal linear embeddings of nonlinear dynamics. Nature Comm., 9:4950, 2018.
  • [5] O. Azencot, N. B. Erichson, V. Lin, and M. Mahoney. Forecasting sequential data using consistent koopman autoencoders. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 475–485, 2020.
  • [6] D. J. Alford-Lago, C. W. Curtis, A. T. Ihler, and O. Issan. Deep learning enhanced dynamic mode decomposition. Chaos, 32:033116, 2022.
  • [7] K. Champion, B. Lusch, J. N. Kutz, and S. L. Brunton. Data-driven discovery of coordinates and governing equations. Proceedings of the National Academy of Sciences, 116(45):22445–22451, 2019.
  • [8] K. Kadierdan, J.N. Kutz, and S.L. Brunton. SINDy-PI: a robust algorithm for parallel implicit sparse identification of nonlinear dynamics. Proc. Roc. Soc. A, 476:20200279, 2020.
  • [9] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comp. Phys., 378:686–707, February 2019.
  • [10] Z. Li, H. Zheng, N. B. Kovachki, D. Jin, H. Chen, B. Liu, A. Stuart, K. Azizzadenesheli, and A. Anandkumar. Physics-informed neural operator for learning partial differential equations, 2021.
  • [11] S.L. Brunton, B.W. Brunton, J.L. Proctor, , E. Kaiser, and J.N. Kutz. Chaos as an intermittently forced linear system. Nature Comm., 8, 2017.
  • [12] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. PNAS, 113(15):3932–3937, 2016.
  • [13] E. Bollt. On explaining the surprising success of resevoir computing forecaster of chaos? The universal machine learningn dynamical system with contrast to var and dmd. Chaos, 31:013108, 2021.
  • [14] D.J. Gauthier, E. Bollt, A. Griffith, and W.A.S. Barbosa. Next generation resevoir computing. Nat. Comm., 12:55674, 2021.
  • [15] H. Arbabi and I. Mezic. Ergodic theory, dynamic mode decomposition, and computation of spectral properties of the Koopman operator. SIAM Appl. Dyn. Sys., 16:2096–2126, 2017.
  • [16] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5, 2015.
  • [17] O. Calin. Deep Learning Architectures: A Mathematical Approach. Springer, Cham, 2020.
  • [18] G.F. Fasshauer. Meshfree Appoximation Methods with Matlab. World Scientific, Hackensack, NJ, 2007.
  • [19] M.O. Williams, I. G. Kevrekidis, and C. W. Rowley. A data-driven approximation of the Koopman operator: extending dynamic mode decomposition. J. Nonlin. Sci., 25:1307–1346, 2015.
  • [20] M.O. Williams, C. W. Rowley, and I. G. Kevrekidis. A kernel-based method for data driven Koopman spectral analysis. J. Comp. Dyn., 2:247–265, 2015.
  • [21] F. Takens. Detecting strange attractors in turbulence. In Dynamical Systems and Turbulence, Lecture Notes in Mathematics, pages 366–381. 1981.
  • [22] D.S. Broomhead and G.P. King. Extracting qualitative dynamics from experimental data. Physica D, 20:217–236, 1986.
  • [23] A. Paszke, S. Gross, F. Massa, and et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • [24] R. Liaw, E. Liang, R. Nishihara, P. Moritz, and et al. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.
  • [25] E. N. Lorenz. Predictability – a problem partly solved, page 40–58. Cambridge University Press, 2006.
  • [26] H.D.I. Abarbanel, R. Brown, and M.B. Kennel. Local Lyupanov exponents computed from observed data. J. Nonlinear Sci., 2:343–365, 1992.
  • [27] J.C. Robinson. Infinite Dimensional Dynamical Systems. Cambridge University Press, Cambridge, UK, 2001.
  • [28] N. B. Budanur, P. Cvitanović, R. L. Davidchack, and E. Siminos. Reduction of so(2) symmetry for spatially extended dynamical systems. Phys. Rev. Lett., 114:084102, Feb 2015.
  • [29] AK Kassam and L.N. Trefethen. Fourth-order time-stepping for stiff PDEs. SIAM J. Sci. Comp., 26:1214–1233, 2005.
  • [30] G. Berkooz, P. Holmes, J.L. Lumley, and C.W. Rowley. Turbulence, Coherent Structures, Dynamical Systems, and Symmetry. Cambridge University Press, Cambridge, UK, 2012.
  • [31] E.M. Bollt and N. Santitissadeekorn. Applied and Computational Measurable Dynamics. SIAM, Philadelphia, PA, 2013.