Universal Approximation Property of
Neural Ordinary Differential Equations

Takeshi Teshima
The University of Tokyo, RIKEN
teshima@ms.k.u-tokyo.ac.jp
&Koichi Tojo
RIKEN
koichi.tojo@riken.jp
&Masahiro Ikeda
RIKEN
masahiro.ikeda@riken.jp
&Isao Ishikawa
Ehime University, RIKEN
ishikawa.isao.zx@ehime-u.ac.jp
&Kenta Oono
The University of Tokyo
kenta_oono@mist.i.u-tokyo.ac.jp

Abstract

Neural ordinary differential equations (NODEs) is an invertible neural network architecture promising for its free-form Jacobian and the availability of a tractable Jacobian determinant estimator. Recently, the representation power of NODEs has been partly uncovered: they form an $L^{p}$ -universal approximator for continuous maps under certain conditions. However, the $L^{p}$ -universality may fail to guarantee an approximation for the entire input domain as it may still hold even if the approximator largely differs from the target function on a small region of the input space. To further uncover the potential of NODEs, we show their stronger approximation property, namely the $\sup$ -universality for approximating a large class of diffeomorphisms. It is shown by leveraging a structure theorem of the diffeomorphism group, and the result complements the existing literature by establishing a fairly large set of mappings that NODEs can approximate with a stronger guarantee.

1 Introduction

Neural ordinary differential equations (NODEs) [1] are a family of deep neural networks that indirectly model functions by transforming an input vector through an ordinary differential equation (ODE). When viewed as an invertible neural network (INN) architecture, NODEs have the advantage of having free-form Jacobian, i.e., it is invertible without restricting the Jacobian’s structure, unlike other INN architectures [2]. For the out-of-box invertibility and the availability of a tractable unbiased estimator of the Jacobian determinant [3], NODEs have been used for constructing continuous normalizing flows for generative modeling and density estimation [1, 3, 4].

Recently, the representation power of NODEs has been partly uncovered in [5], namely, a sufficient condition for a family of NODEs to be an $L^{p}$ -universal approximator (see Definition 4) for continuous maps has been established. However, the universal approximation property with respect to the $L^{p}$ -norm can be insufficient as it does not guarantee an approximation for the entire input domain: $L^{p}$ approximation may still hold even if the approximator largely differs from the target function on a small region of the input space.

In this work, we elucidate that the NODEs are a $\sup$ -universal approximator (Definition 4) for a fairly large class of diffeomorphisms, i.e., smooth invertible maps with smooth inverse. Our result establishes a function class that can be approximated using NODEs with a stronger guarantee than in the existing literature [5]. We prove the result by using a structure theorem of differential geometry to represent a diffeomorphism as a finite composition of flow endpoints, i.e., diffeomorphisms that are smooth transformations of the identity map. The NODEs are themselves examples of flow endpoints, and we derive the main result by approximating the flow endpoints by the NODEs.

2 Preliminaries and goal

In this section, we define the family of NODEs considered in the present paper as well as the notion of universality.

2.1 Neural ordinary differential equations (NODEs)

Let $\mathbb{R}$ (resp. $\mathbb{N}$ ) denote the set of all real values (resp. all positive integers). Throughout the paper, we fix $d\in\mathbb{N}$ . Let $\operatorname{Lip}(\mathbb{R}^{d}){}:=\{f\colon\mathbb{R}^{d}\to\mathbb{R}^{d}\ |\ f\text{ is Lipschitz continuous}\}$ . It is known that any autonomous ODE (i.e., one that is defined by a time-invariant vector field) with a Lipschitz continuous vector field has a solution and that the solution is unique:

Fact 1 (Existence and uniqueness of a global solution to an ODE [6]).

Let $f\in\operatorname{Lip}(\mathbb{R}^{d}){}$ . Then, a solution $z\colon\mathbb{R}\to\mathbb{R}^{d}$ to the following ordinary differential equation exists and it is unique:

z(0)=\mbox{\boldmath$x$},\quad\dot{z}(t)=f(z(t)),\quad t\in\mathbb{R},

(1)

where $\mbox{\boldmath$x$}\in\mathbb{R}^{d}$ , and $\dot{z}$ denotes the derivative of $z$ .

In view of Fact 1, we use the following notation.

Definition 1.

For $f\in\operatorname{Lip}(\mathbb{R}^{d}){}$ , $\mbox{\boldmath$x$}\in\mathbb{R}^{d}$ , and $t\in\mathbb{R}$ , we define

\mathrm{IVP}[f](\mbox{\boldmath$x$},t):=z(t),

where $z:\mathbb{R}\to\mathbb{R}^{d}$ is the unique solution to Equation (1).

Definition 2 (Autonomous-ODE flow endpoints; [5]).

For $\mathcal{F}\subset\operatorname{Lip}(\mathbb{R}^{d}){}$ , we define

\Psi(\mathcal{F}):=\{\mathrm{IVP}[f](\cdot,1)\ |\ f\in\mathcal{F}\}.

Definition 3 ( $\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}$ ).

Let $\mathrm{Aff}$ denote the group of all invertible affine maps on $\mathbb{R}^{d}$ , and let $\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){}$ . Define the invertible neural network architecture based on NODEs as

\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}:=\{W\circ\psi_{k}\circ\cdots\circ\psi_{1}\ |\ \psi_{1},\ldots,\psi_{k}\in\Psi(\mathcal{H}{}),W\in\mathrm{Aff}{},k\in\mathbb{N}\}.

2.2 Goal: the notions of universality and their relations

Here, we define the notions of universality. Let $m,n\in\mathbb{N}$ . For a subset $K\subset\mathbb{R}^{m}$ and a map $f:K\to\mathbb{R}^{n}$ , we define $\left\|f\right\|_{\sup,K}:=\sup_{x\in K}\left\|f(x)\right\|$ , where $\|\cdot\|$ denotes the Euclidean norm. Also, for a measurable map $f:\mathbb{R}^{m}\to\mathbb{R}^{n}$ , a subset $K\subset\mathbb{R}^{m}$ , and $p\in[1,\infty)$ , we define $\left\|f\right\|_{p,K}:=\left(\int_{K}\left\|f(x)\right\|^{p}dx\right)^{1/p}$ .

Definition 4 ( $\sup$ -universality and $L^{p}$ -universality).

Let $\mathcal{M}$ be a model, which is a set of measurable mappings from $\mathbb{R}^{m}$ to $\mathbb{R}^{n}$ . Let $\mathcal{F}$ be a set of measurable mappings $f:U_{f}\rightarrow\mathbb{R}^{n}$ , where $U_{f}$ is a measurable subset of $\mathbb{R}^{m}$ , which may depend on $f$ . We say that $\mathcal{M}$ is a $\sup$ -universal approximator or has the $\sup$ -universal approximation property for $\mathcal{F}$ if for any $f\in\mathcal{F}$ , any $\varepsilon>0$ , and any compact subset $K\subset U_{f}$ , there exists $g\in\mathcal{M}$ such that $\left\|f-g\right\|_{\sup,K}<\varepsilon$ . The $L^{p}$ -universal approximation property is defined by replacing $\left\|\cdot\right\|_{\sup,K}$ with $\left\|\cdot\right\|_{p,K}$ in the above.

Our goal.

Our goal is to elucidate the representation power of INNs composed of NODEs by proving the $\sup$ -universality of $\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}$ for a fairly large class of diffeomorphisms, i.e., smooth invertible functions with smooth inverse.

3 Main result

In this section, we present our main result, Theorem 1.

First, we define the following class of invertible maps, which will be our target to be approximated.

Definition 5 ( $C^{2}$ -diffeomorphisms: ${\mathcal{D}^{2}}$ ).

We define ${\mathcal{D}^{2}}$ as the set of all $C^{2}$ -diffeomorphisms $f:U_{f}\rightarrow{\rm Im}(f)\subset\mathbb{R}^{d}$ , where $U_{f}\subset\mathbb{R}^{d}$ is open and $C^{2}$ -diffeomorphic to $\mathbb{R}^{d}$ , and it may depend on $f$ .

The set ${\mathcal{D}^{2}}$ is a fairly large class: it contains any $C^{2}$ -diffeomorphism defined on the entire $\mathbb{R}^{d}$ , an open convex set, or more generally, a star-shaped open set.

Now, we state our main result to establish a class that the invertible neural networks based on NODEs can approximate with respect to the $\sup$ -norm.

Theorem 1 (Universality of NODEs).

Assume $\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){}$ is a $\sup$ -universal approximator for $\operatorname{Lip}(\mathbb{R}^{d}){}$ . Then, $\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}$ is a $\sup$ -universal approximator for ${\mathcal{D}^{2}}$ .

Examples of $\mathcal{H}$ include the multi-layer perceptron with finite weights and Lipschitz-continuous activation functions such as rectified linear unit (ReLU) activation [7, 1], as well as the Lipschitz Networks [8, Theorem 3].

Proof outline.

To prove Theorem 1, we take a similar strategy to that of Theorem 1 of [9] but with a major modification to adapt to our problem. First, the approximation target is reduced from ${\mathcal{D}^{2}}$ to the set of compactly-supported diffeomorphisms from $\mathbb{R}^{d}$ to $\mathbb{R}^{d}$ , denoted by $\mathrm{Diff}^{2}_{\mathrm{c}}$ , by applying Fact 2 in Appendix A.1. Then, it is shown that we can represent each $f\in\mathrm{Diff}^{2}_{\mathrm{c}}$ as a finite composition of flow endpoints (Definition 7 in Appendix A.1), each of which can be approximated by a NODE. The decomposition of $f$ into flow endpoints is realized by relying on a structure theorem of $\mathrm{Diff}^{2}_{\mathrm{c}}$ (Fact 4 in Appendix A.1) attributed to Herman, Thurston [10], Epstein [11], and Mather [12, 13]. Note that we require a different definition of flow endpoints (Definition 7 in Appendix A.1) from that employed in [9, Corollary 2] in order to incorporate sufficient smoothness of the underlying flows.

4 Related work and Discussion

In this section, we overview the existing literature on the representation power of NODEs to provide the context of the present paper.

$L^{p}$ -universal approximation property of NODEs.

[5] considered NODEs capped with a terminal family to map the output of NODEs to a vector of the desired output dimension, and its Proposition 3.8 showed that the model class has the $L^{p}$ -universality for the set of all continuous maps from $\mathbb{R}^{d}$ to $\mathbb{R}^{n}$ ( $n\in\mathbb{N}$ ), under a certain sufficient condition. In comparison to our result here, the result of [5] established the universality of NODEs for a larger target function class (namely continuous maps) with a weaker notion of approximation (namely $L^{p}$ -universality).

Limitations on the representation power of NODEs.

[14] formulated its Theorem 1 to show that NODEs are not universal approximators by presenting a function that a NODE cannot approximate. The existence of this counterexample does not contradict our result because our approximation target ${\mathcal{D}^{2}}$ is different from the function class considered in [14]: the class in [14] can contain discontinuous maps whereas the elements of ${\mathcal{D}^{2}}$ are smooth and invertible.

Universality of augmented NODEs.

As a device to enhance the representation power of NODEs, increasing the dimensionality and padding zeros to the inputs/outputs has been explored [15, 14]. [14] showed that the augmented NODEs (ANODEs) are universal approximators for homeomorphisms. The approach has a limitation that it can undermine the invertibility of the model: unless the model is ideally trained so that it always outputs zeros in the zero-padded dimensions, the model can no longer represent an invertible map operating on the original dimensionality. On the other hand, the present work explores the universal approximation property of NODEs that is achieved without introducing the complication arising from the dimensionality augmentation.

Relation between $\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}$ and time-dependent NODEs.

Our result can be readily extended to the design choice of NODEs that includes the time-index as an argument of $f$ . It can be done by limiting our attention to the subset of the considered class of $f$ consisting of all time-invariant ones as in the following. Let $a\in(0,\infty]$ and consider $\tilde{f}:\mathbb{R}^{d}\times(-a,a)$ be such that there exists a continuous function $\ell:(-a,a)\to\mathbb{R}_{\geq 0}$ satisfying

\displaystyle\|\tilde{f}(\mbox{\boldmath$x$}_{1},t)-\tilde{f}(\mbox{\boldmath$x$}_{2},t)\|\leq\ell(t)\|\mbox{\boldmath$x$}_{1}-\mbox{\boldmath$x$}_{2}\|.

Then, the initial value problem

z(0)=\mbox{\boldmath$x$},\quad\dot{z}(t)=\tilde{f}(z(t),t),\quad t\in(-a,a)

has a solution $z:(-a,a)\to\mathbb{R}^{d}$ and it is unique [6], synonymously to Fact 1. Then, given a set $\tilde{\mathcal{H}{}}$ of such mappings $\tilde{f}$ , we can consider its subset $\mathcal{H}{}$ that contains only the time-invariant elements, i.e., $\mathcal{H}{}\subset\tilde{\mathcal{H}{}}$ such that for any $f\in\mathcal{H}{}$ and any $\mbox{\boldmath$x$}\in\mathbb{R}^{d}$ , $f(\mbox{\boldmath$x$},\cdot)$ is a constant mapping. Such an $f$ is an element of $\operatorname{Lip}(\mathbb{R}^{d}){}$ with $\inf_{t\in(-a,a)}\ell(t)\geq 0$ being a Lipschitz constant. Then, we can apply Theorem 1 to $\mathcal{H}{}$ and its induced $\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}{}$ .

5 Conclusion

In this paper, we uncovered the $\sup$ -universality of the INNs composed of NODEs for approximating a large class of diffeomorphisms. This result complements the existing literature that showed the weaker approximation property of NODEs, namely $L^{p}$ -universality, for general continuous maps. Whether the $\sup$ -universality holds for a larger class of maps than ${\mathcal{D}^{2}}$ is an important research question for future work. Also, it is important for future work to quantitatively evaluate how many layers of NODEs are required to approximate a given diffeomorphism with a specified smoothness such as a bi-Lipschitz constant to evaluate the efficiency of the approximation.

Acknowledgments and Disclosure of Funding

The authors would like to thank the anonymous reviewers for the insightful discussions. This work was supported by RIKEN Junior Research Associate Program. TT was supported by Masason Foundation. II and MI were supported by CREST:JPMJCR1913.

References

[1] Ricky T.. Chen, Yulia Rubanova, Jesse Bettencourt and David K Duvenaud “Neural Ordinary Differential Equations” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018, pp. 6571–6583 URL: http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf
[2] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed and Balaji Lakshminarayanan “Normalizing Flows for Probabilistic Modeling and Inference” In arXiv:1912.02762 [cs, stat], 2019 arXiv: http://arxiv.org/abs/1912.02762
[3] Will Grathwohl, Ricky T.. Chen, Jesse Bettencourt, Ilya Sutskever and David Duvenaud “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models” In 7th International Conference on Learning Representations New Orleans, LA, USA: OpenReview.net, 2019 URL: https://openreview.net/forum?id=rJxgknCcK7
[4] Chris Finlay, Jorn-Henrik Jacobsen, Levon Nurbekyan and Adam M Oberman “How to Train Your Neural ODE: The World of Jacobian and Kinetic Regularization” In Proceedings of the 37 Th International Conference on Machine Learning, 2020
[5] Qianxiao Li, Ting Lin and Zuowei Shen “Deep Learning via Dynamical Systems: An Approximation Perspective” In arXiv:1912.10382 [cs, math, stat], 2020 arXiv: http://arxiv.org/abs/1912.10382
[6] W. Derrick and L. Janos “A Global Existence and Uniqueness Theorem for Ordinary Differential Equations” In Canadian Mathematical Bulletin 19.1, 1976, pp. 105–107 DOI: 10.4153/CMB-1976-015-7
[7] Yann LeCun, Yoshua Bengio and Geoffrey Hinton “Deep Learning” In Nature 521.7553, 2015, pp. 436–444 URL: http://www.nature.com/articles/nature14539
[8] Cem Anil, James Lucas and Roger Grosse “Sorting out Lipschitz Function Approximation” In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2019, pp. 291–301 URL: http://proceedings.mlr.press/v97/anil19a.html
[9] Takeshi Teshima, Isao Ishikawa, Koichi Tojo, Kenta Oono, Masahiro Ikeda and Masashi Sugiyama “Coupling-Based Invertible Neural Networks Are Universal Diffeomorphism Approximators” In Advances in Neural Information Processing Systems 33, in press
[10] William Thurston “Foliations and Groups of Diffeomorphisms” In Bulletin of the American Mathematical Society 80.2, 1974, pp. 304–307 URL: https://projecteuclid.org:443/euclid.bams/1183535407
[11] D… Epstein “The Simplicity of Certain Groups of Homeomorphisms” In Compositio Mathematica 22.2, 1970, pp. 165–173
[12] John N. Mather “Commutators of diffeomorphisms” In Commentarii mathematici Helvetici 49.1, 1974, pp. 512–528 URL: https://eudml.org/doc/139598
[13] John N. Mather “Commutators of Diffeomorphisms: II” In Commentarii Mathematici Helvetici 50.1, 1975, pp. 33–40 DOI: 10.1007/BF02565731
[14] Han Zhang, Xi Gao, Jacob Unterman and Tomasz Arodz “Approximation Capabilities of Neural ODEs and Invertible Residual Networks” In Proceedings of the 37th International Conference on Machine Learning 119 Vienna, Austria: PMLR, 2020 URL: https://proceedings.icml.cc/paper/2020/hash/c32d9bf27a3da7ec8163957080c8628e
[15] Emilien Dupont, Arnaud Doucet and Yee Whye Teh “Augmented Neural ODEs” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019, pp. 3140–3150 URL: http://papers.nips.cc/paper/8577-augmented-neural-odes.pdf
[16] Serge Lang “Differential Manifolds” New York: Springer-Verlag, 1985 DOI: 10.1007/978-1-4684-0265-0
[17] Philip Hartman “Ordinary Differential Equations” 38, Classics in Applied Mathematics Society for Industrial and Applied Mathematics, 2002
[18] T.. Gronwall “Note on the Derivatives with Respect to a Parameter of the Solutions of a System of Differential Equations” In Annals of Mathematics 20.4 Annals of Mathematics, 1919, pp. 292–296 DOI: 10.2307/1967124

This is the Supplementary Material for “Universal approximation property of neural ordinary differential equations.” Table 1 summarizes the abbreviations and the symbols used in the paper.

Table 1: Abbreviation and notation table.

Abbreviation/Notation	Meaning
INN	Invertible neural networks
NODE	Neural ordinary differential equations
$\mathrm{Aff}$	Set of invertible affine transformations
$\mathrm{IVP}[f](\mbox{\boldmath$x$},t)$	The (unique) solution to an initial value problem evaluated at $t$
$\Psi(\mathcal{F})$	Set of NODEs obtained from the Lipschitz continuous vector fields $\mathcal{F}$
$\operatorname{Lip}(\mathbb{R}^{d})$	The set of all Lipschitz continuous maps from $\mathbb{R}^{d}$ to $\mathbb{R}^{d}$
$\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}{}$	INNs composed of $\mathrm{Aff}$ and NODEs parametrized by $\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){}$
$d\in\mathbb{N}$	Dimensionality of the Euclidean space under consideration
${\mathcal{D}^{2}}$	Set of all $C^{2}$ -diffeomorphisms with $C^{2}$ -diffeomorphic domains
$\mathrm{Diff}^{r}_{\mathrm{c}}$	Group of compactly-supported $C^{r}$ -diffeomorphisms on $\mathbb{R}^{d}$ ( $1\leq r\leq\infty$ )
$\left\\|\cdot\right\\|$	Euclidean norm
$\left\\|\cdot\right\\|_{\mathrm{op}}$	Operator norm
$\left\\|\cdot\right\\|_{\sup,K}$	Supremum norm on a subset $K\subset\mathbb{R}^{d}$
$\left\\|\cdot\right\\|_{p,K}$	$L^{p}$ -norm on a subset $K\subset\mathbb{R}^{d}$
$\mathrm{Id}$	Identity map
$\mathrm{supp}\$	Support of a map

Appendix A Proof of Theorem 1

Here, we provide a proof of Theorem 1. In Section A.1, we display the known facts and show the lemmas used for the proof. In Section A.2, we prove Theorem 1.

A.1 Lemmas and known facts

We use the following definition and facts from [9].

Definition 6 (Compactly supported diffeomorphism).

We use $\mathrm{Diff}^{r}_{\mathrm{c}}$ to denote the set of all compactly supported $C^{r}$ -diffeomorphisms ( $1\leq r\leq\infty$ ) from $\mathbb{R}^{d}$ to $\mathbb{R}^{d}$ . Here, we say a diffeomorphism $f$ on $\mathbb{R}^{d}$ is compactly supported if there exists a compact subset $K\subset\mathbb{R}^{d}$ such that for any $x\notin K$ , $f(x)=x$ . We regard $\mathrm{Diff}^{r}_{\mathrm{c}}$ as a group whose group operation is function composition.

The following fact enables us to reduce the approximation problem for ${\mathcal{D}^{2}}$ to that for $\mathrm{Diff}^{2}_{\mathrm{c}}$ .

Fact 2 (Lemma 5 of [9]).

Let $f\colon U\to\mathbb{R}^{d}$ be an element of ${\mathcal{D}^{2}}$ , and let $K\subset U$ be a compact set. Then, there exists $h\in\mathrm{Diff}^{2}_{\mathrm{c}}$ and an affine transform $W\in\mathrm{Aff}$ such that

W\circ h|_{K}=f|_{K}.

The following fact enables the component-wise approximation, i.e., given a transformation that is represented by a composition of some transformations, we can approximate it by approximating each constituent and composing them.

Fact 3 (Compatibility of composition and approximation; Proposition 6 of [9]).

Let $\mathcal{M}$ be a set of locally bounded maps from $\mathbb{R}^{d}$ to $\mathbb{R}^{d}$ , and $F_{1},\dots,F_{k}$ be continuous maps from $\mathbb{R}^{d}$ to $\mathbb{R}^{d}$ . Assume for any $\varepsilon>0$ and any compact set $K\subset\mathbb{R}^{d}$ , there exist $\widetilde{G}_{1},\dots,\widetilde{G}_{k}\in\mathcal{M}$ such that, for $1\leq i\leq k$ , $\big{\|}F_{i}-\widetilde{G}_{i}\big{\|}_{\sup,K}<\varepsilon$ . Then for any $\varepsilon>0$ and any compact set $K\subset\mathbb{R}^{d}$ , there exist $G_{1},\dots,G_{k}\in\mathcal{M}$ such that

\left\|F_{k}\circ\cdots\circ F_{1}-G_{k}\circ\cdots\circ G_{1}\right\|_{\sup,K}<\varepsilon.

The following fact is attributed to Herman, Thurston [10], Epstein [11], and Mather [12, 13]. See Fact 2 of [9] and the remarks therein for details. Let $\mathrm{Id}{}$ denote the identity map.

Fact 4 (Fact 2 of [9]).

If $r\neq d+1$ , the group $\mathrm{Diff}^{r}_{\mathrm{c}}$ is simple, i.e., any normal subgroup $H\subset\mathrm{Diff}^{r}_{\mathrm{c}}$ is either $\{\mathrm{Id}\}$ or $\mathrm{Diff}^{r}_{\mathrm{c}}$ .

Next, we define a subset of $\mathrm{Diff}^{r}_{\mathrm{c}}$ called the flow endpoints. In Lemma 1, it is shown that the set of flow endpoints generates a non-trivial normal subgroup of $\mathrm{Diff}^{r}_{\mathrm{c}}$ . Therefore, by combining it with Fact 3, we can represent any element of $\mathrm{Diff}^{r}_{\mathrm{c}}$ as a finite composition of flow endpoints, each of which can be approximated by the NODEs.

While Corollary 2 of [9] also defined a set of flow endpoints in $\mathrm{Diff}^{2}_{\mathrm{c}}$ , it differs from the one defined here which is tailored for our purpose. The two definitions can be interpreted as describing two different generators of the same group $\mathrm{Diff}^{2}_{\mathrm{c}}$ . Let $\mathrm{supp}$ denote the support of a map.

Definition 7 (Flow endpoints $S^{r}$ in $\mathrm{Diff}^{r}_{\mathrm{c}}$ ).

Let $1\leq r\leq\infty$ . Let $S^{r}\subset\mathrm{Diff}^{r}_{\mathrm{c}}$ be the set of diffeomorphisms $g$ of the form $g(\bm{x})=\Phi(\bm{x},1)$ for some map $\Phi:\mathbb{R}^{d}\times U\rightarrow\mathbb{R}^{d}$ such that

•

$U\subset\mathbb{R}$ is an open interval containing $[0,1]$ ,
•

$\Phi(\bm{x},0)=\bm{x}$ ,
•

$\Phi(\cdot,t)\in\mathrm{Diff}^{r}_{\mathrm{c}}$ for any $t\in U$ ,
•

$\Phi(\bm{x},s+t)=\Phi(\Phi(\bm{x},s),t)$ for any $s,t\in U$ with $s+t\in U$ ,
•

$\Phi$ is $C^{r}$ on $\mathbb{R}^{d}\times U$ ,
•

There exists a compact subset $K_{\Phi}\subset\mathbb{R}^{d}$ such that $\cup_{t\in U}\mathrm{supp}{\Phi(\cdot,t)}\subset K_{\Phi}$ .

The difference between Definition 7 and the one in Corollary 2 of [9] mainly lies in the last two conditions. Technically, these two conditions are used in Section A.2 for showing that the partial derivative of $\Phi$ in $t$ at $t=0$ is Lipschitz continuous.

Lemma 1 (Modified Corollary 2 of [9]).

Let $1\leq r\leq\infty$ and $S^{r}\subset\mathrm{Diff}^{r}_{\mathrm{c}}$ be the set of all flow endpoints. Then, the subset $H^{r}$ of $\mathrm{Diff}^{r}_{\mathrm{c}}$ defined by

H^{r}:=\{g_{1}\circ\cdots\circ g_{n}\ |\ n\geq 1,g_{1},\dots,g_{n}\in S^{r}\}

forms a subgroup of $\mathrm{Diff}^{r}_{\mathrm{c}}$ and it is a non-trivial normal subgroup.

Proof of Lemma 1.

First, we prove that $H^{r}$ forms a subgroup of $\mathrm{Diff}^{r}_{\mathrm{c}}$ . By definition, for any $g,h\in H^{r}$ , it holds that $g\circ h\in H^{r}$ . Also, $H^{r}$ is closed under inversion; to see this, it suffices to show that $S^{r}$ is closed under inversion. Let $g=\Phi(\cdot,1)\in S^{r}$ . Consider the map $\phi:\mathbb{R}^{d}\times U\rightarrow\mathbb{R}^{d}$ defined by $\phi(\cdot,t):=\Phi^{-1}(\cdot,t)$ . It is easy to confirm that $\phi$ satisfies the conditions of Definition 7, hence $g^{-1}=\phi(\cdot,1)$ is an element of $S^{r}$ . Note that $\phi$ is confirmed to be $C^{r}$ on $\mathbb{R}^{d}\times U$ by applying the inverse function theorem to $(t,\mbox{\boldmath$x$})\mapsto(t,\Phi(\mbox{\boldmath$x$},t))$ .

Next, we prove that $H^{r}$ is normal. To show that the subgroup generated by $S^{r}$ is normal, it suffices to show that $S^{r}$ is closed under conjugation. Take any $g\in S^{r}$ and $h\in\mathrm{Diff}^{r}_{\mathrm{c}}$ , and let $\Phi$ be a flow associated with $g$ . Then, the function $\Phi^{\prime}:\mathbb{R}^{d}\times U\to\mathbb{R}^{d}$ defined by $\Phi^{\prime}(\cdot,s):=h^{-1}\circ\Phi(\cdot,s)\circ h$ is a flow associated with $h^{-1}\circ g\circ h$ satisfying the conditions in Definition 7, which implies $h^{-1}\circ g\circ h\in S^{r}$ , i.e., $S^{r}$ is closed under conjugation.

Next, we prove that $H^{r}$ is non-trivial by constructing an element of $S^{r}$ that is not the identity element. First, consider the case $d=1$ . Let $\tilde{v}:\mathbb{R}\to\mathbb{R}_{\geq 0}$ be a non-constant $C^{\infty}$ -function such that $\mathrm{supp}\ \tilde{v}\subset[0,1]$ and $\tilde{v}^{(k)}(0)=0$ for any $k\in\mathbb{N}$ . Then define $v:\mathbb{R}\to\mathbb{R}$ by

v(x)=\begin{cases}\tilde{v}(|x|)\frac{x}{|x|}&\text{ if }x\neq 0,\\ 0&\text{ if }x=0,\end{cases}

which is a $C^{\infty}$ -function on $\mathbb{R}$ with a compact support. Since $v$ is Lipschitz continuous and $C^{\infty}$ , there exists $\mathrm{IVP}[v]$ that is a $C^{\infty}$ -function over $\mathbb{R}\times\mathbb{R}$ ; see Fact 1 and [17, Chapter V, Corollary 4.1]. Let $K_{v}\subset\mathbb{R}$ be a compact subset that contains $\mathrm{supp}\ v$ . Then, by considering the ordinary differential equation by which $\mathrm{IVP}[v]$ is defined, we see that $\bigcup_{t\in\mathbb{R}}\mathrm{supp}\ \mathrm{IVP}[v](\cdot,t)\subset K_{v}$ and also that $\mathrm{IVP}[v](x,0)=x$ . We also have $\mathrm{IVP}[v](x,s+t)=\mathrm{IVP}[v](\mathrm{IVP}[v](x,s),t)$ for any $s,t\in\mathbb{R}$ . In particular, we have $\mathrm{IVP}[v](\cdot,s)^{-1}=\mathrm{IVP}[v](\cdot,-s)$ for any $s\in\mathbb{R}$ . Therefore, we have $\mathrm{IVP}[v](\cdot,1)\in S^{r}$ . Since $v\not\equiv 0$ , $\mathrm{IVP}[v](\cdot,1)$ is not an identity map and thus $S^{r}$ is not trivial. Next, we consider the case $d\geq 2$ . Take a $C^{\infty}$ -function $\phi\colon\mathbb{R}\to\mathbb{R}$ with $\mathrm{supp}\ \phi=[1,2]$ and a nonzero skew-symmetric matrix $A$ (i.e. $A^{\top}=-A$ ) of size $d$ , and let $X(x):=\phi(\|x\|)A$ . We define a $C^{\infty}$ -map $\Phi\colon\mathbb{R}^{d}\times\mathbb{R}\to\mathbb{R}^{d}$ by

\Phi(x,t):=\exp(tX(x))x.

Since $\exp(tX(x))$ is an orthogonal matrix for any $t\in\mathbb{R}$ and $x\in\mathbb{R}^{d}$ , $\Phi$ is a $C^{\infty}$ -flow on $\mathbb{R}^{d}$ . Now, it is enough to show that there exists a compact set $K_{\Phi}\subset\mathbb{R}^{d}$ satisfying $\cup_{t\in\mathbb{R}}\mathrm{supp}\ \Phi(\cdot,t)\subset K_{\Phi}$ . Let $K_{\Phi}:=\{x\in\mathbb{R}^{d}\ |\ \|x\|\leq 2\}$ . Then the inclusion $\mathrm{supp}\ \Phi(\cdot,t)\subset K_{\Phi}$ holds for any $t\in\mathbb{R}$ since $X(x)=0$ for $x\in\mathbb{R}^{d}\setminus K_{\Phi}$ . ∎

The following lemma allows us to approximate an autonomous ODE flow endpoint by approximating the differential equation. See Definition 2 for the definition of $\Psi(\cdot)$ .

Lemma 2 (Approximation of Autonomous-ODE flow endpoints).

Assume $\mathcal{H}\subset\operatorname{Lip}(\mathbb{R}^{d}){}$ is a $\sup$ -universal approximator for $\operatorname{Lip}(\mathbb{R}^{d}){}$ . Then, $\Psi(\mathcal{H}{})$ is a $\sup$ -universal approximator for $\Psi(\operatorname{Lip}(\mathbb{R}^{d}){})$ .

Proof.

Let $\phi\in\Psi(\operatorname{Lip}(\mathbb{R}^{d}){})$ . Then, by definition, there exists $F\in\operatorname{Lip}(\mathbb{R}^{d}){}$ such that $\phi=\mathrm{IVP}[F](\cdot,1)$ . Let $L_{F}$ denote the Lipschitz constant of $F$ . In the following, we approximate $\mathrm{IVP}[F](\cdot,1)$ by approximating $F$ using an element of $\mathcal{H}{}$ .

Let $\varepsilon>0$ , and let $K\subset\mathbb{R}^{d}$ be a compact subset of $\mathbb{R}^{d}$ . We show that there exists $f\in\mathcal{H}{}$ such that $\left\|\mathrm{IVP}[F](\cdot,1)-\mathrm{IVP}[f](\cdot,1)\right\|_{\sup,K}<\varepsilon$ . Note that $\mathrm{IVP}[f](\cdot,\cdot)$ is well-defined because $\mathcal{H}{}\subset\operatorname{Lip}(\mathbb{R}^{d}){}$ . Define

K^{\prime}:=\left\{\mbox{\boldmath$x$}\in\mathbb{R}^{d}\ \bigg{|}\ \inf_{\mbox{\boldmath$y$}\in\mathrm{IVP}[F](K,[0,1])}\|\mbox{\boldmath$x$}-\mbox{\boldmath$y$}\|\leq 2e^{L_{F}}\right\}.

Then, $K^{\prime}$ is compact. This follows from the compactness of $\mathrm{IVP}[F](K,[0,1])$ : (i) $K^{\prime}$ is bounded since $\mathrm{IVP}[F](K,[0,1])$ is bounded, and (ii) it is closed since the function $\min_{\mbox{\boldmath$y$}\in\mathrm{IVP}[F](K,[0,1])}\|\mbox{\boldmath$x$}-\mbox{\boldmath$y$}\|$ is continuous and hence $K^{\prime}$ is the inverse image of a closed interval $[0,2e^{L_{F}}]$ by a continuous map.

Since $\mathcal{H}{}$ is assumed to be a $\sup$ -universal approximator for $\operatorname{Lip}(\mathbb{R}^{d}){}$ , for any $\delta>0$ , we can take $f\in\mathcal{H}{}$ such that $\left\|f-F\right\|_{\sup,K^{\prime}}<\delta$ . Let $\delta$ be such that $0<\delta<\min\{\varepsilon/(2e^{L_{F}}),1\}$ , and take such an $f$ .

Fix $\mbox{\boldmath$x$}_{0}\in K$ and define $\Delta_{\mbox{\boldmath$x$}_{0}}(t):=\|\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t)-\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)\|$ . Let $B{}:=\delta e^{L_{F}}{}$ and we show that

\Delta_{\mbox{\boldmath$x$}_{0}}(t)<2B{}

holds for all $t\in[0,1]$ . We prove this by contradiction. Suppose that there exists $t^{\prime}$ for which the inequality does not hold. Then, the set $\mathcal{T}:=\{t\in[0,1]|\Delta_{\mbox{\boldmath$x$}_{0}}(t)\geq 2B{}\}$ is not empty and thus $\tau:=\inf\mathcal{T}\in[0,1]$ . For this $\tau$ , we show both $\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\leq B{}$ and $\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\geq 2B{}$ . First, we have

	$\displaystyle\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)$	$\displaystyle=\left\\|\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},\tau)-\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},\tau)\right\\|$
		$\displaystyle=\left\\|\mbox{\boldmath$x$}_{0}+\int_{0}^{\tau}F(\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t))dt-\mbox{\boldmath$x$}_{0}-\int_{0}^{\tau}f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))dt\right\\|$
		$\displaystyle\leq\left\\|\int_{0}^{\tau}(F(\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t))-F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\\|$
		$\displaystyle\qquad+\left\\|\int_{0}^{\tau}(F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))-f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\\|.$

The last term can be bounded as

\left\|\int_{0}^{\tau}(F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))-f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\|\leq\int_{0}^{\tau}\delta dt

because of the following argument. If $\tau=0$ , then both sides equal to zero, hence it holds with equality. If $\tau>0$ , then for any $t<\tau$ , we have $\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)\in K^{\prime}$ because $t<\tau$ implies $\Delta_{\mbox{\boldmath$x$}_{0}}(t)\leq 2B{}$ . In this case, $\left\|F-f\right\|_{\sup,K^{\prime}}<\delta$ implies the inequality. Therefore, we have

\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\leq L_{F}\int_{0}^{\tau}\Delta_{\mbox{\boldmath$x$}_{0}}(t)dt+\int_{0}^{\tau}\delta dt.

Now, by applying Grönwall’s inequality [18], we obtain

\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\leq\delta\tau e^{L_{F}\tau}\leq B{}.

On the other hand, by the definition of $\mathcal{T}$ and the continuity of $\Delta_{\mbox{\boldmath$x$}_{0}}(\cdot)$ , we have $\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)\geq 2B{}$ . These two inequalities contradict.

Therefore, $\left\|\mathrm{IVP}[F](\cdot,1)-\mathrm{IVP}[f](\cdot,1)\right\|_{\sup,K}=\sup_{\mbox{\boldmath$x$}_{0}\in K}\Delta_{\mbox{\boldmath$x$}_{0}}(1)\leq 2B{}=2\delta e^{L_{F}}{}$ holds. Since $\delta<\varepsilon/(2e^{L_{F}})$ , the right-hand side is smaller than $\varepsilon$ . ∎

Finally, we display a lemma that is useful in the case of $d=1$ . It is proved by convolving a smooth bump-like function.

Fact 5 (Lemma 11 of [9]).

Let $\tau:\mathbb{R}\to\mathbb{R}$ be a strictly increasing continuous function. Then, for any compact subset $K\subset\mathbb{R}$ and any $\varepsilon>0$ , there exists a strictly increasing $C^{\infty}$ -function $\tilde{\tau}$ such that

\left\|\tau-\tilde{\tau}\right\|_{\sup,K}<\varepsilon.

A.2 Proof of Theorem 1

Proof of Theorem 1.

Let $F\colon U\to\mathbb{R}^{d}$ be an element of ${\mathcal{D}^{2}}$ . Take any compact set $K\subset U$ and $\varepsilon>0$ . First, thanks to Fact 2, there exists a $G\in\mathrm{Diff}^{2}_{\mathrm{c}}$ and an affine transform $W\in\mathrm{Aff}$ such that

W\circ G|_{K}=F|_{K}.

Now, if $d\geq 2$ , then $2\neq d+1$ , hence we can immediately use Fact 4 and Lemma 1 to show that there exists a finite set of flow endpoints (Definition 7) $g_{1},\ldots,g_{k}\in S^{2}$ such that

G=g_{k}\circ\cdots\circ g_{1}.

On the other hand, if $d=1$ , by Fact 5, for any $\delta>0$ , we can find $\tilde{G}$ that is a $C^{\infty}$ -diffeomorphism on $\mathbb{R}$ such that $\left\|G-\tilde{G}\right\|_{\sup,K}<\delta$ . Without loss of generality, we may assume that $\tilde{G}$ is compactly supported so that $\tilde{G}\in\mathrm{Diff}^{\infty}_{\mathrm{c}}$ . Then, we can use Fact 4 and Lemma 1 to show that there exists a finite set of flow endpoints (Definition 7) $g_{1},\ldots,g_{k}\in S^{\infty}$ such that

\tilde{G}=g_{k}\circ\cdots\circ g_{1}.

We now construct $f_{j}\in\operatorname{Lip}(\mathbb{R}^{d}){}$ such that $g_{j}=\mathrm{IVP}[f_{j}](\cdot,1)$ . By Definition 7, for each $g_{j}$ ( $1\leq j\leq k$ ), there exists an associated flow $\Phi_{j}$ . Now, define

f_{j}(\cdot):=\left.\frac{\partial\Phi_{j}(\cdot,t)}{\partial t}\right|_{t=0}.

Then, $f_{j}\in\operatorname{Lip}(\mathbb{R}^{d}){}$ because it is a compactly-supported $C^{1}$ -map: it is compactly supported since there exists a compact subset $K_{j}\subset\mathbb{R}^{d}$ containing the support of $\Phi(\cdot,t)$ for all $t$ , and hence $\Phi(\cdot,t)-\Phi(\cdot,0)$ is zero in the complement of $K_{j}$ .

Now, $\Phi_{j}(\mbox{\boldmath$x$},t)=\mathrm{IVP}[f_{j}](\mbox{\boldmath$x$},t)$ since, by additivity of the flows,

	$\displaystyle\frac{\partial\Phi_{j}}{\partial t}(\mbox{\boldmath$x$},t)$	$\displaystyle=\lim_{s\rightarrow 0}\frac{\Phi_{j}(\mbox{\boldmath$x$},t+s)-\Phi_{j}(\mbox{\boldmath$x$},t)}{s}=\lim_{s\rightarrow 0}\frac{\Phi_{j}(\Phi_{j}(\mbox{\boldmath$x$},t),s)-\Phi_{j}(\Phi_{j}(\mbox{\boldmath$x$},t),0)}{s}$
		$\displaystyle=\left.\frac{\partial\Phi_{j}(\Phi_{j}(\mbox{\boldmath$x$},t),s)}{\partial s}\right\|_{s=0}=f_{j}(\Phi_{j}(\mbox{\boldmath$x$},t)),$

and hence it is a solution to the initial value problem that is unique. As a result, we have $g_{j}=\Phi_{j}(\cdot,1)=\mathrm{IVP}[f_{j}](\cdot,1)$ .

By combining Fact 3 and Lemma 2, there exist $\phi_{1},\ldots,\phi_{k}\in\Psi(\mathcal{H}{})$ such that

\left\|g_{k}\circ\cdots\circ g_{1}-\phi_{k}\circ\cdots\circ\phi_{1}\right\|_{\sup,K}<\frac{\varepsilon}{\left\|W\right\|_{\mathrm{op}}},

where $\left\|\cdot\right\|_{\mathrm{op}}$ denotes the operator norm. Therefore, we have that $W\circ\phi_{k}\circ\cdots\circ\phi_{1}\in\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}$ satisfies

	$\displaystyle\left\\|F-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\\|_{\sup,K}$	$\displaystyle=\left\\|W\circ G-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\\|_{\sup,K}$
		$\displaystyle\leq\left\\|W\right\\|_{\mathrm{op}}\left\\|g_{k}\circ\cdots\circ g_{1}-\phi_{k}\circ\cdots\circ\phi_{1}\right\\|_{\sup,K}$
		$\displaystyle<\varepsilon$

if $d\geq 2$ . For $d=1$ , it can be shown that there exists $W\circ\phi_{k}\circ\cdots\circ\phi_{1}\in\mathrm{INN}_{\mathcal{H}\text{-}\mathrm{NODE}}$ that satisfies $\left\|F-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\|_{\sup,K}<\varepsilon$ in a similar manner. ∎

Appendix B Terminal time of autonomous-ODE flow endpoints

In Definition 2, the choice of the terminal value of the time variable, $t=1$ , is only technical. To see this, let $T>0$ . If we consider $w:\mathbb{R}\to\mathbb{R}^{d}$ that is the solution of the initial value problem $w(0)=\mbox{\boldmath$x$},\dot{w}(t)=(Tf)(w(t))\ (t\in\mathbb{R})$ as well as $z:\mathbb{R}\to\mathbb{R}^{d}$ that is the unique solution to $z(0)=\mbox{\boldmath$x$},\dot{z}(t)=f(z(t))\ (t\in\mathbb{R})$ , then $w(t)=z(Tt)$ holds. Therefore, $\mathrm{IVP}[f](\mbox{\boldmath$x$},Tt)=\mathrm{IVP}[Tf](\mbox{\boldmath$x$},t)$ .

As a result, $\mathrm{IVP}[f](\mbox{\boldmath$x$},T)=\mathrm{IVP}[Tf](\mbox{\boldmath$x$},1)$ holds. Therefore, it holds that

\{\mathrm{IVP}[f](\cdot,T)\ |\ f\in\mathcal{F}\}=\{\mathrm{IVP}[Tf](\cdot,1)\ |\ f\in\mathcal{F}\}=\Psi(T\mathcal{F}).

Thus, even if we consider $T\neq 1$ , if the set $\mathcal{F}$ is a cone, the set of the autonomous-ODE flow endpoints remains the same.

Appendix C Comparison between $L^{p}$ -universality and $\sup$ -universality

In this section, we discuss the advantage of having a representation power guarantee in terms of the $\sup$ -norm instead of the $L^{p}$ -norm in function approximation tasks.

Roughly speaking, the function approximation should be robust under a slight change of norms, but $L^{p}$ -universal approximation property can be sensitive to the choice of $p$ . To make this point, we construct an example: even if a model $g$ sufficiently approximates a target $f$ with the norm $\|\cdot\|_{1,K}$ , the model $g$ may fail to approximate $f$ with $\|\cdot\|_{p,K}$ for any $p>1$ , even if $p$ is very close to $1$ .

Let $h:(0,1)\rightarrow\mathbb{R}$ be a strictly increasing function such that

\begin{cases}\|h\|_{p^{\prime},[0,1]}&<\infty\text{ if }p^{\prime}=1,\\ \|h\|_{p^{\prime},[0,1]}&=\infty\text{ if }p^{\prime}>1.\end{cases}

For example, $h(x)=-\sum_{k=1}^{\infty}x^{1/k-1}/k^{3}$ satisfies this condition. Then, we define

g_{n}(x)=x+\frac{h(x)}{n}.

Now, the sequence $\{g_{n}\}_{n=1}^{\infty}$ approximates $\mathrm{Id}$ in $L^{1}$ -norm in the sense that for any small $\varepsilon>0$ , for sufficiently large $N$ , it holds that

\displaystyle\|g_{N}-{\rm Id}\|_{1,[0,1]}

\displaystyle<\varepsilon,

(2)

However, the same $g_{N}$ fails to approximate $\mathrm{Id}$ in $L^{p}$ -norm ( $p>1$ ) since it always holds that, for sufficiently small $\delta\in(0,1/2)$ ,

\displaystyle\|g_{N}-{\rm Id}\|_{p,[\delta,1-\delta]}

\displaystyle\geq 1.

(4)

This example highlights that fixing $p$ first and guaranteeing approximation in $L^{p}$ -norm may not suffice for guaranteeing the approximation in $L^{p^{\prime}}$ -norm ( $p^{\prime}>p$ ). On the other hand, having a guarantee in $\sup$ -norm suffices for providing an approximation guarantee in $L^{p}$ -norm for $p\geq 1$ simultaneously.

	$\displaystyle\Delta_{\mbox{\boldmath$x$}_{0}}(\tau)$	$\displaystyle=\left\\|\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},\tau)-\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},\tau)\right\\|$
		$\displaystyle=\left\\|\mbox{\boldmath$x$}_{0}+\int_{0}^{\tau}F(\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t))dt-\mbox{\boldmath$x$}_{0}-\int_{0}^{\tau}f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))dt\right\\|$
		$\displaystyle\leq\left\\|\int_{0}^{\tau}(F(\mathrm{IVP}[F](\mbox{\boldmath$x$}_{0},t))-F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\\|$
		$\displaystyle\qquad+\left\\|\int_{0}^{\tau}(F(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t))-f(\mathrm{IVP}[f](\mbox{\boldmath$x$}_{0},t)))dt\right\\|.$

	$\displaystyle\left\\|F-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\\|_{\sup,K}$	$\displaystyle=\left\\|W\circ G-W\circ\phi_{k}\circ\cdots\circ\phi_{1}\right\\|_{\sup,K}$
		$\displaystyle\leq\left\\|W\right\\|_{\mathrm{op}}\left\\|g_{k}\circ\cdots\circ g_{1}-\phi_{k}\circ\cdots\circ\phi_{1}\right\\|_{\sup,K}$
		$\displaystyle<\varepsilon$

Universal Approximation Property of Neural Ordinary Differential Equations