This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Universal Approximation Property of
Neural Ordinary Differential Equations

Takeshi Teshima
The University of Tokyo, RIKEN
teshima@ms.k.u-tokyo.ac.jp
&Koichi Tojo
RIKEN
koichi.tojo@riken.jp
&Masahiro Ikeda
RIKEN
masahiro.ikeda@riken.jp
&Isao Ishikawa
Ehime University, RIKEN
ishikawa.isao.zx@ehime-u.ac.jp
&Kenta Oono
The University of Tokyo
kenta_oono@mist.i.u-tokyo.ac.jp
Abstract

Neural ordinary differential equations (NODEs) is an invertible neural network architecture promising for its free-form Jacobian and the availability of a tractable Jacobian determinant estimator. Recently, the representation power of NODEs has been partly uncovered: they form an LpL^{p}-universal approximator for continuous maps under certain conditions. However, the LpL^{p}-universality may fail to guarantee an approximation for the entire input domain as it may still hold even if the approximator largely differs from the target function on a small region of the input space. To further uncover the potential of NODEs, we show their stronger approximation property, namely the sup\sup-universality for approximating a large class of diffeomorphisms. It is shown by leveraging a structure theorem of the diffeomorphism group, and the result complements the existing literature by establishing a fairly large set of mappings that NODEs can approximate with a stronger guarantee.

1 Introduction

Neural ordinary differential equations (NODEs) [1] are a family of deep neural networks that indirectly model functions by transforming an input vector through an ordinary differential equation (ODE). When viewed as an invertible neural network (INN) architecture, NODEs have the advantage of having free-form Jacobian, i.e., it is invertible without restricting the Jacobian’s structure, unlike other INN architectures [2]. For the out-of-box invertibility and the availability of a tractable unbiased estimator of the Jacobian determinant [3], NODEs have been used for constructing continuous normalizing flows for generative modeling and density estimation [1, 3, 4].

Recently, the representation power of NODEs has been partly uncovered in [5], namely, a sufficient condition for a family of NODEs to be an LpL^{p}-universal approximator (see Definition 4) for continuous maps has been established. However, the universal approximation property with respect to the LpL^{p}-norm can be insufficient as it does not guarantee an approximation for the entire input domain: LpL^{p} approximation may still hold even if the approximator largely differs from the target function on a small region of the input space.

In this work, we elucidate that the NODEs are a sup\sup-universal approximator (Definition 4) for a fairly large class of diffeomorphisms, i.e., smooth invertible maps with smooth inverse. Our result establishes a function class that can be approximated using NODEs with a stronger guarantee than in the existing literature [5]. We prove the result by using a structure theorem of differential geometry to represent a diffeomorphism as a finite composition of flow endpoints, i.e., diffeomorphisms that are smooth transformations of the identity map. The NODEs are themselves examples of flow endpoints, and we derive the main result by approximating the flow endpoints by the NODEs.

2 Preliminaries and goal

In this section, we define the family of NODEs considered in the present paper as well as the notion of universality.

2.1 Neural ordinary differential equations (NODEs)

Let \mathbb{R} (resp. \mathbb{N}) denote the set of all real values (resp. all positive integers). Throughout the paper, we fix dd\in\mathbb{N}. Let Lip(d):={f:dd|f is Lipschitz continuous}\operatorname{Lip}(\mathbb{R}^{d}){}:=\{f\colon\mathbb{R}^{d}\to\mathbb{R}^{d}\ |\ f\text{ is Lipschitz continuous}\}. It is known that any autonomous ODE (i.e., one that is defined by a time-invariant vector field) with a Lipschitz continuous vector field has a solution and that the solution is unique:

Fact 1 (Existence and uniqueness of a global solution to an ODE [6]).

Let fLip(d)f\in\operatorname{Lip}(\mathbb{R}^{d}){}. Then, a solution z:dz\colon\mathbb{R}\to\mathbb{R}^{d} to the following ordinary differential equation exists and it is unique:

z(0)=𝒙,z˙(t)=f(z(t)),t,z(0)=\mbox{\boldmath$x$},\quad\dot{z}(t)=f(z(t)),\quad t\in\mathbb{R}, (1)

where 𝐱d\mbox{\boldmath$x$}\in\mathbb{R}^{d}, and z˙\dot{z} denotes the derivative of zz.

In view of Fact 1, we use the following notation.

Definition 1.

For fLip(d)f\in\operatorname{Lip}(\mathbb{R}^{d}){}, 𝒙d\mbox{\boldmath$x$}\in\mathbb{R}^{d}, and tt\in\mathbb{R}, we define

IVP[f](𝒙,t):=z(t),\mathrm{IVP}[f](\mbox{\boldmath$x$},t):=z(t),

where z:dz:\mathbb{R}\to\mathbb{R}^{d} is the unique solution to Equation (1).

Definition 2 (Autonomous-ODE flow endpoints; [5]).

For Lip(d)\mathcal{F}\subset\operatorname{Lip}(\mathbb{R}^{d}){}, we define

Ψ():={IVP[f](,1)|f}.\Psi(\mathcal{F}):=\{\mathrm{IVP}[f](\cdot,1)\ |\ f\in\mathcal{F}\}.
Definition 3 (INN-NODE\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}).

Let Aff\mathrm{Aff} denote the group of all invertible affine maps on d\mathbb{R}^{d}, and let Lip(d)\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){}. Define the invertible neural network architecture based on NODEs as

INN-NODE:={Wψkψ1|ψ1,,ψkΨ(),WAff,k}.\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}:=\{W\circ\psi_{k}\circ\cdots\circ\psi_{1}\ |\ \psi_{1},\ldots,\psi_{k}\in\Psi(\mathcal{H}{}),W\in\mathrm{Aff}{},k\in\mathbb{N}\}.

2.2 Goal: the notions of universality and their relations

Here, we define the notions of universality. Let m,nm,n\in\mathbb{N}. For a subset KmK\subset\mathbb{R}^{m} and a map f:Knf:K\to\mathbb{R}^{n}, we define fsup,K:=supxKf(x)\left\|f\right\|_{\sup,K}:=\sup_{x\in K}\left\|f(x)\right\|, where \|\cdot\| denotes the Euclidean norm. Also, for a measurable map f:mnf:\mathbb{R}^{m}\to\mathbb{R}^{n}, a subset KmK\subset\mathbb{R}^{m}, and p[1,)p\in[1,\infty), we define fp,K:=(Kf(x)p𝑑x)1/p\left\|f\right\|_{p,K}:=\left(\int_{K}\left\|f(x)\right\|^{p}dx\right)^{1/p}.

Definition 4 (sup\sup-universality and LpL^{p}-universality).

Let \mathcal{M} be a model, which is a set of measurable mappings from m\mathbb{R}^{m} to n\mathbb{R}^{n}. Let \mathcal{F} be a set of measurable mappings f:Ufnf:U_{f}\rightarrow\mathbb{R}^{n}, where UfU_{f} is a measurable subset of m\mathbb{R}^{m}, which may depend on ff. We say that \mathcal{M} is a sup\sup-universal approximator or has the sup\sup-universal approximation property for \mathcal{F} if for any ff\in\mathcal{F}, any ε>0\varepsilon>0, and any compact subset KUfK\subset U_{f}, there exists gg\in\mathcal{M} such that fgsup,K<ε\left\|f-g\right\|_{\sup,K}<\varepsilon. The LpL^{p}-universal approximation property is defined by replacing sup,K\left\|\cdot\right\|_{\sup,K} with p,K\left\|\cdot\right\|_{p,K} in the above.

Our goal.

Our goal is to elucidate the representation power of INNs composed of NODEs by proving the sup\sup-universality of INN-NODE\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}} for a fairly large class of diffeomorphisms, i.e., smooth invertible functions with smooth inverse.

3 Main result

In this section, we present our main result, Theorem 1.

First, we define the following class of invertible maps, which will be our target to be approximated.

Definition 5 (C2C^{2}-diffeomorphisms: 𝒟2{\mathcal{D}^{2}}).

We define 𝒟2{\mathcal{D}^{2}} as the set of all C2C^{2}-diffeomorphisms f:UfIm(f)df:U_{f}\rightarrow{\rm Im}(f)\subset\mathbb{R}^{d} , where UfdU_{f}\subset\mathbb{R}^{d} is open and C2C^{2}-diffeomorphic to d\mathbb{R}^{d}, and it may depend on ff.

The set 𝒟2{\mathcal{D}^{2}} is a fairly large class: it contains any C2C^{2}-diffeomorphism defined on the entire d\mathbb{R}^{d}, an open convex set, or more generally, a star-shaped open set.

Now, we state our main result to establish a class that the invertible neural networks based on NODEs can approximate with respect to the sup\sup-norm.

Theorem 1 (Universality of NODEs).

Assume Lip(d)\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){} is a sup\sup-universal approximator for Lip(d)\operatorname{Lip}(\mathbb{R}^{d}){}. Then, INN-NODE\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}} is a sup\sup-universal approximator for 𝒟2{\mathcal{D}^{2}}.

Examples of \mathcal{H} include the multi-layer perceptron with finite weights and Lipschitz-continuous activation functions such as rectified linear unit (ReLU) activation [7, 1], as well as the Lipschitz Networks [8, Theorem 3].

Proof outline.

To prove Theorem 1, we take a similar strategy to that of Theorem 1 of [9] but with a major modification to adapt to our problem. First, the approximation target is reduced from 𝒟2{\mathcal{D}^{2}} to the set of compactly-supported diffeomorphisms from d\mathbb{R}^{d} to d\mathbb{R}^{d}, denoted by Diffc2\mathrm{Diff}^{2}_{\mathrm{c}}, by applying Fact 2 in Appendix A.1. Then, it is shown that we can represent each fDiffc2f\in\mathrm{Diff}^{2}_{\mathrm{c}} as a finite composition of flow endpoints (Definition 7 in Appendix A.1), each of which can be approximated by a NODE. The decomposition of ff into flow endpoints is realized by relying on a structure theorem of Diffc2\mathrm{Diff}^{2}_{\mathrm{c}} (Fact 4 in Appendix A.1) attributed to Herman, Thurston [10], Epstein [11], and Mather [12, 13]. Note that we require a different definition of flow endpoints (Definition 7 in Appendix A.1) from that employed in [9, Corollary 2] in order to incorporate sufficient smoothness of the underlying flows.

4 Related work and Discussion

In this section, we overview the existing literature on the representation power of NODEs to provide the context of the present paper.

LpL^{p}-universal approximation property of NODEs.

[5] considered NODEs capped with a terminal family to map the output of NODEs to a vector of the desired output dimension, and its Proposition 3.8 showed that the model class has the LpL^{p}-universality for the set of all continuous maps from d\mathbb{R}^{d} to n\mathbb{R}^{n} (nn\in\mathbb{N}), under a certain sufficient condition. In comparison to our result here, the result of [5] established the universality of NODEs for a larger target function class (namely continuous maps) with a weaker notion of approximation (namely LpL^{p}-universality).

Limitations on the representation power of NODEs.

[14] formulated its Theorem 1 to show that NODEs are not universal approximators by presenting a function that a NODE cannot approximate. The existence of this counterexample does not contradict our result because our approximation target 𝒟2{\mathcal{D}^{2}} is different from the function class considered in [14]: the class in [14] can contain discontinuous maps whereas the elements of 𝒟2{\mathcal{D}^{2}} are smooth and invertible.

Universality of augmented NODEs.

As a device to enhance the representation power of NODEs, increasing the dimensionality and padding zeros to the inputs/outputs has been explored [15, 14]. [14] showed that the augmented NODEs (ANODEs) are universal approximators for homeomorphisms. The approach has a limitation that it can undermine the invertibility of the model: unless the model is ideally trained so that it always outputs zeros in the zero-padded dimensions, the model can no longer represent an invertible map operating on the original dimensionality. On the other hand, the present work explores the universal approximation property of NODEs that is achieved without introducing the complication arising from the dimensionality augmentation.

Relation between INN-NODE\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}} and time-dependent NODEs.

Our result can be readily extended to the design choice of NODEs that includes the time-index as an argument of ff. It can be done by limiting our attention to the subset of the considered class of ff consisting of all time-invariant ones as in the following. Let a(0,]a\in(0,\infty] and consider f~:d×(a,a)\tilde{f}:\mathbb{R}^{d}\times(-a,a) be such that there exists a continuous function :(a,a)0\ell:(-a,a)\to\mathbb{R}_{\geq 0} satisfying

f~(𝒙1,t)f~(𝒙2,t)(t)𝒙1𝒙2.\displaystyle\|\tilde{f}(\mbox{\boldmath$x$}_{1},t)-\tilde{f}(\mbox{\boldmath$x$}_{2},t)\|\leq\ell(t)\|\mbox{\boldmath$x$}_{1}-\mbox{\boldmath$x$}_{2}\|.

Then, the initial value problem

z(0)=𝒙,z˙(t)=f~(z(t),t),t(a,a)z(0)=\mbox{\boldmath$x$},\quad\dot{z}(t)=\tilde{f}(z(t),t),\quad t\in(-a,a)

has a solution z:(a,a)dz:(-a,a)\to\mathbb{R}^{d} and it is unique [6], synonymously to Fact 1. Then, given a set ~\tilde{\mathcal{H}{}} of such mappings f~\tilde{f}, we can consider its subset \mathcal{H}{} that contains only the time-invariant elements, i.e., ~\mathcal{H}{}\subset\tilde{\mathcal{H}{}} such that for any ff\in\mathcal{H}{} and any 𝒙d\mbox{\boldmath$x$}\in\mathbb{R}^{d}, f(𝒙,)f(\mbox{\boldmath$x$},\cdot) is a constant mapping. Such an ff is an element of Lip(d)\operatorname{Lip}(\mathbb{R}^{d}){} with inft(a,a)(t)0\inf_{t\in(-a,a)}\ell(t)\geq 0 being a Lipschitz constant. Then, we can apply Theorem 1 to \mathcal{H}{} and its induced INN-NODE\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}{}.

5 Conclusion

In this paper, we uncovered the sup\sup-universality of the INNs composed of NODEs for approximating a large class of diffeomorphisms. This result complements the existing literature that showed the weaker approximation property of NODEs, namely LpL^{p}-universality, for general continuous maps. Whether the sup\sup-universality holds for a larger class of maps than 𝒟2{\mathcal{D}^{2}} is an important research question for future work. Also, it is important for future work to quantitatively evaluate how many layers of NODEs are required to approximate a given diffeomorphism with a specified smoothness such as a bi-Lipschitz constant to evaluate the efficiency of the approximation.

Acknowledgments and Disclosure of Funding

The authors would like to thank the anonymous reviewers for the insightful discussions. This work was supported by RIKEN Junior Research Associate Program. TT was supported by Masason Foundation. II and MI were supported by CREST:JPMJCR1913.

References

  • [1] Ricky T.. Chen, Yulia Rubanova, Jesse Bettencourt and David K Duvenaud “Neural Ordinary Differential Equations” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018, pp. 6571–6583 URL: http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf
  • [2] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed and Balaji Lakshminarayanan “Normalizing Flows for Probabilistic Modeling and Inference” In arXiv:1912.02762 [cs, stat], 2019 arXiv: http://arxiv.org/abs/1912.02762
  • [3] Will Grathwohl, Ricky T.. Chen, Jesse Bettencourt, Ilya Sutskever and David Duvenaud “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models” In 7th International Conference on Learning Representations New Orleans, LA, USA: OpenReview.net, 2019 URL: https://openreview.net/forum?id=rJxgknCcK7
  • [4] Chris Finlay, Jorn-Henrik Jacobsen, Levon Nurbekyan and Adam M Oberman “How to Train Your Neural ODE: The World of Jacobian and Kinetic Regularization” In Proceedings of the 37 Th International Conference on Machine Learning, 2020
  • [5] Qianxiao Li, Ting Lin and Zuowei Shen “Deep Learning via Dynamical Systems: An Approximation Perspective” In arXiv:1912.10382 [cs, math, stat], 2020 arXiv: http://arxiv.org/abs/1912.10382
  • [6] W. Derrick and L. Janos “A Global Existence and Uniqueness Theorem for Ordinary Differential Equations” In Canadian Mathematical Bulletin 19.1, 1976, pp. 105–107 DOI: 10.4153/CMB-1976-015-7
  • [7] Yann LeCun, Yoshua Bengio and Geoffrey Hinton “Deep Learning” In Nature 521.7553, 2015, pp. 436–444 URL: http://www.nature.com/articles/nature14539
  • [8] Cem Anil, James Lucas and Roger Grosse “Sorting out Lipschitz Function Approximation” In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2019, pp. 291–301 URL: http://proceedings.mlr.press/v97/anil19a.html
  • [9] Takeshi Teshima, Isao Ishikawa, Koichi Tojo, Kenta Oono, Masahiro Ikeda and Masashi Sugiyama “Coupling-Based Invertible Neural Networks Are Universal Diffeomorphism Approximators” In Advances in Neural Information Processing Systems 33, in press
  • [10] William Thurston “Foliations and Groups of Diffeomorphisms” In Bulletin of the American Mathematical Society 80.2, 1974, pp. 304–307 URL: https://projecteuclid.org:443/euclid.bams/1183535407
  • [11] D… Epstein “The Simplicity of Certain Groups of Homeomorphisms” In Compositio Mathematica 22.2, 1970, pp. 165–173
  • [12] John N. Mather “Commutators of diffeomorphisms” In Commentarii mathematici Helvetici 49.1, 1974, pp. 512–528 URL: https://eudml.org/doc/139598
  • [13] John N. Mather “Commutators of Diffeomorphisms: II” In Commentarii Mathematici Helvetici 50.1, 1975, pp. 33–40 DOI: 10.1007/BF02565731
  • [14] Han Zhang, Xi Gao, Jacob Unterman and Tomasz Arodz “Approximation Capabilities of Neural ODEs and Invertible Residual Networks” In Proceedings of the 37th International Conference on Machine Learning 119 Vienna, Austria: PMLR, 2020 URL: https://proceedings.icml.cc/paper/2020/hash/c32d9bf27a3da7ec8163957080c8628e
  • [15] Emilien Dupont, Arnaud Doucet and Yee Whye Teh “Augmented Neural ODEs” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019, pp. 3140–3150 URL: http://papers.nips.cc/paper/8577-augmented-neural-odes.pdf
  • [16] Serge Lang “Differential Manifolds” New York: Springer-Verlag, 1985 DOI: 10.1007/978-1-4684-0265-0
  • [17] Philip Hartman “Ordinary Differential Equations” 38, Classics in Applied Mathematics Society for Industrial and Applied Mathematics, 2002
  • [18] T.. Gronwall “Note on the Derivatives with Respect to a Parameter of the Solutions of a System of Differential Equations” In Annals of Mathematics 20.4 Annals of Mathematics, 1919, pp. 292–296 DOI: 10.2307/1967124

This is the Supplementary Material for “Universal approximation property of neural ordinary differential equations.” Table 1 summarizes the abbreviations and the symbols used in the paper.

Table 1: Abbreviation and notation table.
Abbreviation/Notation Meaning
INN Invertible neural networks
NODE Neural ordinary differential equations
Aff\mathrm{Aff} Set of invertible affine transformations
IVP[f](𝒙,t)\mathrm{IVP}[f](\mbox{\boldmath$x$},t) The (unique) solution to an initial value problem evaluated at tt
Ψ()\Psi(\mathcal{F}) Set of NODEs obtained from the Lipschitz continuous vector fields \mathcal{F}
Lip(d)\operatorname{Lip}(\mathbb{R}^{d}) The set of all Lipschitz continuous maps from d\mathbb{R}^{d} to d\mathbb{R}^{d}
INN-NODE\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}{} INNs composed of Aff\mathrm{Aff} and NODEs parametrized by Lip(d)\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){}
dd\in\mathbb{N} Dimensionality of the Euclidean space under consideration
𝒟2{\mathcal{D}^{2}} Set of all C2C^{2}-diffeomorphisms with C2C^{2}-diffeomorphic domains
Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} Group of compactly-supported CrC^{r}-diffeomorphisms on d\mathbb{R}^{d} (1r1\leq r\leq\infty)
\left\|\cdot\right\| Euclidean norm
op\left\|\cdot\right\|_{\mathrm{op}} Operator norm
sup,K\left\|\cdot\right\|_{\sup,K} Supremum norm on a subset KdK\subset\mathbb{R}^{d}
p,K\left\|\cdot\right\|_{p,K} LpL^{p}-norm on a subset KdK\subset\mathbb{R}^{d}
Id\mathrm{Id} Identity map
supp\mathrm{supp}\ Support of a map

Appendix A Proof of Theorem 1

Here, we provide a proof of Theorem 1. In Section A.1, we display the known facts and show the lemmas used for the proof. In Section A.2, we prove Theorem 1.

A.1 Lemmas and known facts

We use the following definition and facts from [9].

Definition 6 (Compactly supported diffeomorphism).

We use Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} to denote the set of all compactly supported CrC^{r}-diffeomorphisms (1r1\leq r\leq\infty) from d\mathbb{R}^{d} to d\mathbb{R}^{d}. Here, we say a diffeomorphism ff on d\mathbb{R}^{d} is compactly supported if there exists a compact subset KdK\subset\mathbb{R}^{d} such that for any xKx\notin K, f(x)=xf(x)=x. We regard Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} as a group whose group operation is function composition.

The following fact enables us to reduce the approximation problem for 𝒟2{\mathcal{D}^{2}} to that for Diffc2\mathrm{Diff}^{2}_{\mathrm{c}}.

Fact 2 (Lemma 5 of [9]).

Let f:Udf\colon U\to\mathbb{R}^{d} be an element of 𝒟2{\mathcal{D}^{2}}, and let KUK\subset U be a compact set. Then, there exists hDiffc2h\in\mathrm{Diff}^{2}_{\mathrm{c}} and an affine transform WAffW\in\mathrm{Aff} such that

Wh|K=f|K.W\circ h|_{K}=f|_{K}.

The following fact enables the component-wise approximation, i.e., given a transformation that is represented by a composition of some transformations, we can approximate it by approximating each constituent and composing them.

Fact 3 (Compatibility of composition and approximation; Proposition 6 of [9]).

Let \mathcal{M} be a set of locally bounded maps from d\mathbb{R}^{d} to d\mathbb{R}^{d}, and F1,,FkF_{1},\dots,F_{k} be continuous maps from d\mathbb{R}^{d} to d\mathbb{R}^{d}. Assume for any ε>0\varepsilon>0 and any compact set KdK\subset\mathbb{R}^{d}, there exist G~1,,G~k\widetilde{G}_{1},\dots,\widetilde{G}_{k}\in\mathcal{M} such that, for 1ik1\leq i\leq k, FiG~isup,K<ε\big{\|}F_{i}-\widetilde{G}_{i}\big{\|}_{\sup,K}<\varepsilon. Then for any ε>0\varepsilon>0 and any compact set KdK\subset\mathbb{R}^{d}, there exist G1,,GkG_{1},\dots,G_{k}\in\mathcal{M} such that

FkF1GkG1sup,K<ε.\left\|F_{k}\circ\cdots\circ F_{1}-G_{k}\circ\cdots\circ G_{1}\right\|_{\sup,K}<\varepsilon.

The following fact is attributed to Herman, Thurston [10], Epstein [11], and Mather [12, 13]. See Fact 2 of [9] and the remarks therein for details. Let Id\mathrm{Id}{} denote the identity map.

Fact 4 (Fact 2 of [9]).

If rd+1r\neq d+1, the group Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} is simple, i.e., any normal subgroup HDiffcrH\subset\mathrm{Diff}^{r}_{\mathrm{c}} is either {Id}\{\mathrm{Id}\} or Diffcr\mathrm{Diff}^{r}_{\mathrm{c}}.

Next, we define a subset of Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} called the flow endpoints. In Lemma 1, it is shown that the set of flow endpoints generates a non-trivial normal subgroup of Diffcr\mathrm{Diff}^{r}_{\mathrm{c}}. Therefore, by combining it with Fact 3, we can represent any element of Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} as a finite composition of flow endpoints, each of which can be approximated by the NODEs.

While Corollary 2 of [9] also defined a set of flow endpoints in Diffc2\mathrm{Diff}^{2}_{\mathrm{c}}, it differs from the one defined here which is tailored for our purpose. The two definitions can be interpreted as describing two different generators of the same group Diffc2\mathrm{Diff}^{2}_{\mathrm{c}}. Let supp\mathrm{supp} denote the support of a map.

Definition 7 (Flow endpoints SrS^{r} in Diffcr\mathrm{Diff}^{r}_{\mathrm{c}}).

Let 1r1\leq r\leq\infty. Let SrDiffcrS^{r}\subset\mathrm{Diff}^{r}_{\mathrm{c}} be the set of diffeomorphisms gg of the form g(𝒙)=Φ(𝒙,1)g(\bm{x})=\Phi(\bm{x},1) for some map Φ:d×Ud\Phi:\mathbb{R}^{d}\times U\rightarrow\mathbb{R}^{d} such that

  • UU\subset\mathbb{R} is an open interval containing [0,1][0,1],

  • Φ(𝒙,0)=𝒙\Phi(\bm{x},0)=\bm{x},

  • Φ(,t)Diffcr\Phi(\cdot,t)\in\mathrm{Diff}^{r}_{\mathrm{c}} for any tUt\in U,

  • Φ(𝒙,s+t)=Φ(Φ(𝒙,s),t)\Phi(\bm{x},s+t)=\Phi(\Phi(\bm{x},s),t) for any s,tUs,t\in U with s+tUs+t\in U,

  • Φ\Phi is CrC^{r} on d×U\mathbb{R}^{d}\times U,

  • There exists a compact subset KΦdK_{\Phi}\subset\mathbb{R}^{d} such that tUsuppΦ(,t)KΦ\cup_{t\in U}\mathrm{supp}{\Phi(\cdot,t)}\subset K_{\Phi}.

The difference between Definition 7 and the one in Corollary 2 of [9] mainly lies in the last two conditions. Technically, these two conditions are used in Section A.2 for showing that the partial derivative of Φ\Phi in tt at t=0t=0 is Lipschitz continuous.

Lemma 1 (Modified Corollary 2 of [9]).

Let 1r1\leq r\leq\infty and SrDiffcrS^{r}\subset\mathrm{Diff}^{r}_{\mathrm{c}} be the set of all flow endpoints. Then, the subset HrH^{r} of Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} defined by

Hr:={g1gn|n1,g1,,gnSr}H^{r}:=\{g_{1}\circ\cdots\circ g_{n}\ |\ n\geq 1,g_{1},\dots,g_{n}\in S^{r}\}

forms a subgroup of Diffcr\mathrm{Diff}^{r}_{\mathrm{c}} and it is a non-trivial normal subgroup.

Proof of Lemma 1.

First, we prove that HrH^{r} forms a subgroup of Diffcr\mathrm{Diff}^{r}_{\mathrm{c}}. By definition, for any g,hHrg,h\in H^{r}, it holds that ghHrg\circ h\in H^{r}. Also, HrH^{r} is closed under inversion; to see this, it suffices to show that SrS^{r} is closed under inversion. Let g=Φ(,1)Srg=\Phi(\cdot,1)\in S^{r}. Consider the map ϕ:d×Ud\phi:\mathbb{R}^{d}\times U\rightarrow\mathbb{R}^{d} defined by ϕ(,t):=Φ1(,t)\phi(\cdot,t):=\Phi^{-1}(\cdot,t). It is easy to confirm that ϕ\phi satisfies the conditions of Definition 7, hence g1=ϕ(,1)g^{-1}=\phi(\cdot,1) is an element of SrS^{r}. Note that ϕ\phi is confirmed to be CrC^{r} on d×U\mathbb{R}^{d}\times U by applying the inverse function theorem to (t,𝒙)(t,Φ(𝒙,t))(t,\mbox{\boldmath$x$})\mapsto(t,\Phi(\mbox{\boldmath$x$},t)).

Next, we prove that HrH^{r} is normal. To show that the subgroup generated by SrS^{r} is normal, it suffices to show that SrS^{r} is closed under conjugation. Take any gSrg\in S^{r} and hDiffcrh\in\mathrm{Diff}^{r}_{\mathrm{c}}, and let Φ\Phi be a flow associated with gg. Then, the function Φ:d×Ud\Phi^{\prime}:\mathbb{R}^{d}\times U\to\mathbb{R}^{d} defined by Φ(,s):=h1Φ(,s)h\Phi^{\prime}(\cdot,s):=h^{-1}\circ\Phi(\cdot,s)\circ h is a flow associated with h1ghh^{-1}\circ g\circ h satisfying the conditions in Definition 7, which implies h1ghSrh^{-1}\circ g\circ h\in S^{r}, i.e., SrS^{r} is closed under conjugation.

Next, we prove that HrH^{r} is non-trivial by constructing an element of SrS^{r} that is not the identity element. First, consider the case d=1d=1. Let v~:0\tilde{v}:\mathbb{R}\to\mathbb{R}_{\geq 0} be a non-constant CC^{\infty}-function such that suppv~[0,1]\mathrm{supp}\ \tilde{v}\subset[0,1] and v~(k)(0)=0\tilde{v}^{(k)}(0)=0 for any kk\in\mathbb{N}. Then define v:v:\mathbb{R}\to\mathbb{R} by

v(x)={v~(|x|)x|x| if x0,0 if x=0,v(x)=\begin{cases}\tilde{v}(|x|)\frac{x}{|x|}&\text{ if }x\neq 0,\\ 0&\text{ if }x=0,\end{cases}

which is a CC^{\infty}-function on \mathbb{R} with a compact support. Since vv is Lipschitz continuous and CC^{\infty}, there exists IVP[v]\mathrm{IVP}[v] that is a CC^{\infty}-function over ×\mathbb{R}\times\mathbb{R}; see Fact 1 and [17, Chapter V, Corollary 4.1]. Let KvK_{v}\subset\mathbb{R} be a compact subset that contains suppv\mathrm{supp}\ v. Then, by considering the ordinary differential equation by which IVP[v]\mathrm{IVP}[v] is defined, we see that tsuppIVP[v](,t)Kv\bigcup_{t\in\mathbb{R}}\mathrm{supp}\ \mathrm{IVP}[v](\cdot,t)\subset K_{v} and also that IVP[v](x,0)=x\mathrm{IVP}[v](x,0)=x. We also have IVP[v](x,s+t)=IVP[v](IVP[v](x,s),t)\mathrm{IVP}[v](x,s+t)=\mathrm{IVP}[v](\mathrm{IVP}[v](x,s),t) for any s,ts,t\in\mathbb{R}. In particular, we have IVP[v](,s)1=IVP[v](,s)\mathrm{IVP}[v](\cdot,s)^{-1}=\mathrm{IVP}[v](\cdot,-s) for any ss\in\mathbb{R}. Therefore, we have IVP[v](,1)Sr\mathrm{IVP}[v](\cdot,1)\in S^{r}. Since v0v\not\equiv 0, IVP[v](,1)\mathrm{IVP}[v](\cdot,1) is not an identity map and thus SrS^{r} is not trivial. Next, we consider the case d2d\geq 2. Take a CC^{\infty}-function ϕ:\phi\colon\mathbb{R}\to\mathbb{R} with suppϕ=[1,2]\mathrm{supp}\ \phi=[1,2] and a nonzero skew-symmetric matrix AA (i.e. A=AA^{\top}=-A) of size dd, and let X(x):=ϕ(x)AX(x):=\phi(\|x\|)A. We define a CC^{\infty}-map Φ:d×d\Phi\colon\mathbb{R}^{d}\times\mathbb{R}\to\mathbb{R}^{d} by

Φ(x,t):=exp(tX(x))x.\Phi(x,t):=\exp(tX(x))x.

Since exp(tX(x))\exp(tX(x)) is an orthogonal matrix for any tt\in\mathbb{R} and xdx\in\mathbb{R}^{d}, Φ\Phi is a CC^{\infty}-flow on d\mathbb{R}^{d}. Now, it is enough to show that there exists a compact set KΦdK_{\Phi}\subset\mathbb{R}^{d} satisfying tsuppΦ(,t)KΦ\cup_{t\in\mathbb{R}}\mathrm{supp}\ \Phi(\cdot,t)\subset K_{\Phi}. Let KΦ:={xd|x2}K_{\Phi}:=\{x\in\mathbb{R}^{d}\ |\ \|x\|\leq 2\}. Then the inclusion suppΦ(,t)KΦ\mathrm{supp}\ \Phi(\cdot,t)\subset K_{\Phi} holds for any tt\in\mathbb{R} since X(x)=0X(x)=0 for xdKΦx\in\mathbb{R}^{d}\setminus K_{\Phi}. ∎

The following lemma allows us to approximate an autonomous ODE flow endpoint by approximating the differential equation. See Definition 2 for the definition of Ψ()\Psi(\cdot).

Lemma 2 (Approximation of Autonomous-ODE flow endpoints).

Assume Lip(d)\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){} is a sup\sup-universal approximator for Lip(d)\operatorname{Lip}(\mathbb{R}^{d}){}. Then, Ψ()\Psi(\mathcal{H}{}) is a sup\sup-universal approximator for Ψ(Lip(d))\Psi(\operatorname{Lip}(\mathbb{R}^{d}){}).

Proof.

Let ϕΨ(Lip(d))\phi\in\Psi(\operatorname{Lip}(\mathbb{R}^{d}){}). Then, by definition, there exists FLip(d)F\in\operatorname{Lip}(\mathbb{R}^{d}){} such that ϕ=IVP[F](,1)\phi=\mathrm{IVP}[F](\cdot,1). Let LFL_{F} denote the Lipschitz constant of FF. In the following, we approximate IVP[F](,1)\mathrm{IVP}[F](\cdot,1) by approximating FF using an element of \mathcal{H}{}.

Let ε>0\varepsilon>0, and let KdK\subset\mathbb{R}^{d} be a compact subset of d\mathbb{R}^{d}. We show that there exists ff\in\mathcal{H}{} such that IVP[F](,1)IVP[f](,1)sup,K<ε\left\|\mathrm{IVP}[F](\cdot,1)-\mathrm{IVP}[f](\cdot,1)\right\|_{\sup,K}<\varepsilon. Note that IVP[f](,)\mathrm{IVP}[f](\cdot,\cdot) is well-defined because Lip(d)\mathcal{H}{}\subset\operatorname{Lip}(\mathbb{R}^{d}){}. Define

K:={𝒙d|inf𝒚IVP[F](K,[0,1])𝒙𝒚2eLF}.K^{\prime}:=\left\{\mbox{\boldmath$x$}\in\mathbb{R}^{d}\ \bigg{|}\ \inf_{\mbox{\boldmath$y$}\in\mathrm{IVP}[F](K,[0,1])}\|\mbox{\boldmath$x$}-\mbox{\boldmath$y$}\|\leq 2e^{L_{F}}\right\}.

Then, KK^{\prime} is compact. This follows from the compactness of IVP[F](K,[0,1])\mathrm{IVP}[F](K,[0,1]): (i) KK^{\prime} is bounded since IVP[F](K,[0,1])\mathrm{IVP}[F](K,[0,1]) is bounded, and (ii) it is closed since the function min𝒚IVP[F](K,[0,1])𝒙𝒚\min_{\mbox{\boldmath$y$}\in\mathrm{IVP}[F](K,[0,1])}\|\mbox{\boldmath$x$}-\mbox{\boldmath$y$}\| is continuous and hence KK^{\prime} is the inverse image of a closed interval [0,2eLF][0,2e^{L_{F}}] by a continuous map.

Since \mathcal{H}{} is assumed to be a sup\sup-universal approximator for Lip(d)\operatorname{Lip}(\mathbb{R}^{d}){}, for any δ>0\delta>0, we can take ff\in\mathcal{H}{} such that fFsup,K<δ\left\|f-F\right\|_{\sup,K^{\prime}}<\delta. Let δ\delta be such that 0<δ<min{ε/(2eLF),1}0<\delta<\min\{\varepsilon/(2e^{L_{F}}),1\}, and take such an ff.

Fix 𝒙0K\mbox{\boldmath$x$}_{0}\in K and define Δ𝒙0(t):=IVP[F](𝒙0,t)IVP[f](𝒙0,t)\Delta_{\mbox{\boldmath$x$}_{0}}(t):=\|\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t)-\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)\|. Let B:=δeLFB{}:=\delta e^{L_{F}}{} and we show that

Δ𝒙0(t)<2B\Delta_{\mbox{\boldmath$x$}_{0}}(t)<2B{}

holds for all t[0,1]t\in[0,1]. We prove this by contradiction. Suppose that there exists tt^{\prime} for which the inequality does not hold. Then, the set 𝒯:={t[0,1]|Δ𝒙0(t)2B}\mathcal{T}:=\{t\in[0,1]|\Delta_{\mbox{\boldmath$x$}_{0}}(t)\geq 2B{}\} is not empty and thus τ:=inf𝒯[0,1]\tau:=\inf\mathcal{T}\in[0,1]. For this τ\tau, we show both Δ𝒙0(τ)B\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\leq B{} and Δ𝒙0(τ)2B\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\geq 2B{}. First, we have

Δ𝒙0(τ)\displaystyle\Delta_{\mbox{\boldmath$x$}_{0}}(\tau) =IVP[F](𝒙0,τ)IVP[f](𝒙0,τ)\displaystyle=\left\|\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},\tau)-\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},\tau)\right\|
=𝒙0+0τF(IVP[F](𝒙0,t))𝑑t𝒙00τf(IVP[f](𝒙0,t))𝑑t\displaystyle=\left\|\mbox{\boldmath$x$}_{0}+\int_{0}^{\tau}F(\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t))dt-\mbox{\boldmath$x$}_{0}-\int_{0}^{\tau}f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))dt\right\|
0τ(F(IVP[F](𝒙0,t))F(IVP[f](𝒙0,t)))𝑑t\displaystyle\leq\left\|\int_{0}^{\tau}(F(\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t))-F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\|
+0τ(F(IVP[f](𝒙0,t))f(IVP[f](𝒙0,t)))𝑑t.\displaystyle\qquad+\left\|\int_{0}^{\tau}(F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))-f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\|.

The last term can be bounded as

0τ(F(IVP[f](𝒙0,t))f(IVP[f](𝒙0,t)))𝑑t0τδ𝑑t\left\|\int_{0}^{\tau}(F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))-f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\|\leq\int_{0}^{\tau}\delta dt

because of the following argument. If τ=0\tau=0, then both sides equal to zero, hence it holds with equality. If τ>0\tau>0, then for any t<τt<\tau, we have IVP[f](𝒙0,t)K\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)\in K^{\prime} because t<τt<\tau implies Δ𝒙0(t)2B\Delta_{\mbox{\boldmath$x$}_{0}}(t)\leq 2B{}. In this case, Ffsup,K<δ\left\|F-f\right\|_{\sup,K^{\prime}}<\delta implies the inequality. Therefore, we have

Δ𝒙0(τ)LF0τΔ𝒙0(t)𝑑t+0τδ𝑑t.\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\leq L_{F}\int_{0}^{\tau}\Delta_{\mbox{\boldmath$x$}_{0}}(t)dt+\int_{0}^{\tau}\delta dt.

Now, by applying Grönwall’s inequality [18], we obtain

Δ𝒙0(τ)δτeLFτB.\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\leq\delta\tau e^{L_{F}\tau}\leq B{}.

On the other hand, by the definition of 𝒯\mathcal{T} and the continuity of Δ𝒙0()\Delta_{\mbox{\boldmath$x$}_{0}}(\cdot), we have Δ𝒙0(τ)2B\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\geq 2B{}. These two inequalities contradict.

Therefore, IVP[F](,1)IVP[f](,1)sup,K=sup𝒙0KΔ𝒙0(1)2B=2δeLF\left\|\mathrm{IVP}[F](\cdot,1)-\mathrm{IVP}[f](\cdot,1)\right\|_{\sup,K}=\sup_{\mbox{\boldmath$x$}_{0}\in K}\Delta_{\mbox{\boldmath$x$}_{0}}(1)\leq 2B{}=2\delta e^{L_{F}}{} holds. Since δ<ε/(2eLF)\delta<\varepsilon/(2e^{L_{F}}), the right-hand side is smaller than ε\varepsilon. ∎

Finally, we display a lemma that is useful in the case of d=1d=1. It is proved by convolving a smooth bump-like function.

Fact 5 (Lemma 11 of [9]).

Let τ:\tau:\mathbb{R}\to\mathbb{R} be a strictly increasing continuous function. Then, for any compact subset KK\subset\mathbb{R} and any ε>0\varepsilon>0, there exists a strictly increasing CC^{\infty}-function τ~\tilde{\tau} such that

ττ~sup,K<ε.\left\|\tau-\tilde{\tau}\right\|_{\sup,K}<\varepsilon.

A.2 Proof of Theorem 1

Proof of Theorem 1.

Let F:UdF\colon U\to\mathbb{R}^{d} be an element of 𝒟2{\mathcal{D}^{2}}. Take any compact set KUK\subset U and ε>0\varepsilon>0. First, thanks to Fact 2, there exists a GDiffc2G\in\mathrm{Diff}^{2}_{\mathrm{c}} and an affine transform WAffW\in\mathrm{Aff} such that

WG|K=F|K.W\circ G|_{K}=F|_{K}.

Now, if d2d\geq 2, then 2d+12\neq d+1, hence we can immediately use Fact 4 and Lemma 1 to show that there exists a finite set of flow endpoints (Definition 7) g1,,gkS2g_{1},\ldots,g_{k}\in S^{2} such that

G=gkg1.G=g_{k}\circ\cdots\circ g_{1}.

On the other hand, if d=1d=1, by Fact 5, for any δ>0\delta>0, we can find G~\tilde{G} that is a CC^{\infty}-diffeomorphism on \mathbb{R} such that GG~sup,K<δ\left\|G-\tilde{G}\right\|_{\sup,K}<\delta. Without loss of generality, we may assume that G~\tilde{G} is compactly supported so that G~Diffc\tilde{G}\in\mathrm{Diff}^{\infty}_{\mathrm{c}}. Then, we can use Fact 4 and Lemma 1 to show that there exists a finite set of flow endpoints (Definition 7) g1,,gkSg_{1},\ldots,g_{k}\in S^{\infty} such that

G~=gkg1.\tilde{G}=g_{k}\circ\cdots\circ g_{1}.

We now construct fjLip(d)f_{j}\in\operatorname{Lip}(\mathbb{R}^{d}){} such that gj=IVP[fj](,1)g_{j}=\mathrm{IVP}[f_{j}](\cdot,1). By Definition 7, for each gjg_{j} (1jk1\leq j\leq k), there exists an associated flow Φj\Phi_{j}. Now, define

fj():=Φj(,t)t|t=0.f_{j}(\cdot):=\left.\frac{\partial\Phi_{j}(\cdot,t)}{\partial t}\right|_{t=0}.

Then, fjLip(d)f_{j}\in\operatorname{Lip}(\mathbb{R}^{d}){} because it is a compactly-supported C1C^{1}-map: it is compactly supported since there exists a compact subset KjdK_{j}\subset\mathbb{R}^{d} containing the support of Φ(,t)\Phi(\cdot,t) for all tt, and hence Φ(,t)Φ(,0)\Phi(\cdot,t)-\Phi(\cdot,0) is zero in the complement of KjK_{j}.

Now, Φj(𝒙,t)=IVP[fj](𝒙,t)\Phi_{j}(\mbox{\boldmath$x$},t)=\mathrm{IVP}[f_{j}](\mbox{\boldmath$x$},t) since, by additivity of the flows,

Φjt(𝒙,t)\displaystyle\frac{\partial\Phi_{j}}{\partial t}(\mbox{\boldmath$x$},t) =lims0Φj(𝒙,t+s)Φj(𝒙,t)s=lims0Φj(Φj(𝒙,t),s)Φj(Φj(𝒙,t),0)s\displaystyle=\lim_{s\rightarrow 0}\frac{\Phi_{j}(\mbox{\boldmath$x$},t+s)-\Phi_{j}(\mbox{\boldmath$x$},t)}{s}=\lim_{s\rightarrow 0}\frac{\Phi_{j}(\Phi_{j}(\mbox{\boldmath$x$},t),s)-\Phi_{j}(\Phi_{j}(\mbox{\boldmath$x$},t),0)}{s}
=Φj(Φj(𝒙,t),s)s|s=0=fj(Φj(𝒙,t)),\displaystyle=\left.\frac{\partial\Phi_{j}(\Phi_{j}(\mbox{\boldmath$x$},t),s)}{\partial s}\right|_{s=0}=f_{j}(\Phi_{j}(\mbox{\boldmath$x$},t)),

and hence it is a solution to the initial value problem that is unique. As a result, we have gj=Φj(,1)=IVP[fj](,1)g_{j}=\Phi_{j}(\cdot,1)=\mathrm{IVP}[f_{j}](\cdot,1).

By combining Fact 3 and Lemma 2, there exist ϕ1,,ϕkΨ()\phi_{1},\ldots,\phi_{k}\in\Psi(\mathcal{H}{}) such that

gkg1ϕkϕ1sup,K<εWop,\left\|g_{k}\circ\cdots\circ g_{1}-\phi_{k}\circ\cdots\circ\phi_{1}\right\|_{\sup,K}<\frac{\varepsilon}{\left\|W\right\|_{\mathrm{op}}},

where op\left\|\cdot\right\|_{\mathrm{op}} denotes the operator norm. Therefore, we have that Wϕkϕ1INN-NODEW\circ\phi_{k}\circ\cdots\circ\phi_{1}\in\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}} satisfies

FWϕkϕ1sup,K\displaystyle\left\|F-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\|_{\sup,K} =WGWϕkϕ1sup,K\displaystyle=\left\|W\circ G-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\|_{\sup,K}
Wopgkg1ϕkϕ1sup,K\displaystyle\leq\left\|W\right\|_{\mathrm{op}}\left\|g_{k}\circ\cdots\circ g_{1}-\phi_{k}\circ\cdots\circ\phi_{1}\right\|_{\sup,K}
<ε\displaystyle<\varepsilon

if d2d\geq 2. For d=1d=1, it can be shown that there exists Wϕkϕ1INN-NODEW\circ\phi_{k}\circ\cdots\circ\phi_{1}\in\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}} that satisfies FWϕkϕ1sup,K<ε\left\|F-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\|_{\sup,K}<\varepsilon in a similar manner. ∎

Appendix B Terminal time of autonomous-ODE flow endpoints

In Definition 2, the choice of the terminal value of the time variable, t=1t=1, is only technical. To see this, let T>0T>0. If we consider w:dw:\mathbb{R}\to\mathbb{R}^{d} that is the solution of the initial value problem w(0)=𝒙,w˙(t)=(Tf)(w(t))(t)w(0)=\mbox{\boldmath$x$},\dot{w}(t)=(Tf)(w(t))\ (t\in\mathbb{R}) as well as z:dz:\mathbb{R}\to\mathbb{R}^{d} that is the unique solution to z(0)=𝒙,z˙(t)=f(z(t))(t)z(0)=\mbox{\boldmath$x$},\dot{z}(t)=f(z(t))\ (t\in\mathbb{R}), then w(t)=z(Tt)w(t)=z(Tt) holds. Therefore, IVP[f](𝒙,Tt)=IVP[Tf](𝒙,t)\mathrm{IVP}[f](\mbox{\boldmath$x$},Tt)=\mathrm{IVP}[Tf](\mbox{\boldmath$x$},t).

As a result, IVP[f](𝒙,T)=IVP[Tf](𝒙,1)\mathrm{IVP}[f](\mbox{\boldmath$x$},T)=\mathrm{IVP}[Tf](\mbox{\boldmath$x$},1) holds. Therefore, it holds that

{IVP[f](,T)|f}={IVP[Tf](,1)|f}=Ψ(T).\{\mathrm{IVP}[f](\cdot,T)\ |\ f\in\mathcal{F}\}=\{\mathrm{IVP}[Tf](\cdot,1)\ |\ f\in\mathcal{F}\}=\Psi(T\mathcal{F}).

Thus, even if we consider T1T\neq 1, if the set \mathcal{F} is a cone, the set of the autonomous-ODE flow endpoints remains the same.

Appendix C Comparison between LpL^{p}-universality and sup\sup-universality

In this section, we discuss the advantage of having a representation power guarantee in terms of the sup\sup-norm instead of the LpL^{p}-norm in function approximation tasks.

Roughly speaking, the function approximation should be robust under a slight change of norms, but LpL^{p}-universal approximation property can be sensitive to the choice of pp. To make this point, we construct an example: even if a model gg sufficiently approximates a target ff with the norm 1,K\|\cdot\|_{1,K}, the model gg may fail to approximate ff with p,K\|\cdot\|_{p,K} for any p>1p>1, even if pp is very close to 11.

Let h:(0,1)h:(0,1)\rightarrow\mathbb{R} be a strictly increasing function such that

{hp,[0,1]< if p=1,hp,[0,1]= if p>1.\begin{cases}\|h\|_{p^{\prime},[0,1]}&<\infty\text{ if }p^{\prime}=1,\\ \|h\|_{p^{\prime},[0,1]}&=\infty\text{ if }p^{\prime}>1.\end{cases}

For example, h(x)=k=1x1/k1/k3h(x)=-\sum_{k=1}^{\infty}x^{1/k-1}/k^{3} satisfies this condition. Then, we define

gn(x)=x+h(x)n.g_{n}(x)=x+\frac{h(x)}{n}.

Now, the sequence {gn}n=1\{g_{n}\}_{n=1}^{\infty} approximates Id\mathrm{Id} in L1L^{1}-norm in the sense that for any small ε>0\varepsilon>0, for sufficiently large NN, it holds that

gNId1,[0,1]\displaystyle\|g_{N}-{\rm Id}\|_{1,[0,1]} <ε,\displaystyle<\varepsilon, (2)

However, the same gNg_{N} fails to approximate Id\mathrm{Id} in LpL^{p}-norm (p>1p>1) since it always holds that, for sufficiently small δ(0,1/2)\delta\in(0,1/2),

gNIdp,[δ,1δ]\displaystyle\|g_{N}-{\rm Id}\|_{p,[\delta,1-\delta]} 1.\displaystyle\geq 1. (4)

This example highlights that fixing pp first and guaranteeing approximation in LpL^{p}-norm may not suffice for guaranteeing the approximation in LpL^{p^{\prime}}-norm (p>pp^{\prime}>p). On the other hand, having a guarantee in sup\sup-norm suffices for providing an approximation guarantee in LpL^{p}-norm for p1p\geq 1 simultaneously.