Universal Approximation Property of
Neural Ordinary Differential Equations
Abstract
Neural ordinary differential equations (NODEs) is an invertible neural network architecture promising for its free-form Jacobian and the availability of a tractable Jacobian determinant estimator. Recently, the representation power of NODEs has been partly uncovered: they form an -universal approximator for continuous maps under certain conditions. However, the -universality may fail to guarantee an approximation for the entire input domain as it may still hold even if the approximator largely differs from the target function on a small region of the input space. To further uncover the potential of NODEs, we show their stronger approximation property, namely the -universality for approximating a large class of diffeomorphisms. It is shown by leveraging a structure theorem of the diffeomorphism group, and the result complements the existing literature by establishing a fairly large set of mappings that NODEs can approximate with a stronger guarantee.
1 Introduction
Neural ordinary differential equations (NODEs) [1] are a family of deep neural networks that indirectly model functions by transforming an input vector through an ordinary differential equation (ODE). When viewed as an invertible neural network (INN) architecture, NODEs have the advantage of having free-form Jacobian, i.e., it is invertible without restricting the Jacobian’s structure, unlike other INN architectures [2]. For the out-of-box invertibility and the availability of a tractable unbiased estimator of the Jacobian determinant [3], NODEs have been used for constructing continuous normalizing flows for generative modeling and density estimation [1, 3, 4].
Recently, the representation power of NODEs has been partly uncovered in [5], namely, a sufficient condition for a family of NODEs to be an -universal approximator (see Definition 4) for continuous maps has been established. However, the universal approximation property with respect to the -norm can be insufficient as it does not guarantee an approximation for the entire input domain: approximation may still hold even if the approximator largely differs from the target function on a small region of the input space.
In this work, we elucidate that the NODEs are a -universal approximator (Definition 4) for a fairly large class of diffeomorphisms, i.e., smooth invertible maps with smooth inverse. Our result establishes a function class that can be approximated using NODEs with a stronger guarantee than in the existing literature [5]. We prove the result by using a structure theorem of differential geometry to represent a diffeomorphism as a finite composition of flow endpoints, i.e., diffeomorphisms that are smooth transformations of the identity map. The NODEs are themselves examples of flow endpoints, and we derive the main result by approximating the flow endpoints by the NODEs.
2 Preliminaries and goal
In this section, we define the family of NODEs considered in the present paper as well as the notion of universality.
2.1 Neural ordinary differential equations (NODEs)
Let (resp. ) denote the set of all real values (resp. all positive integers). Throughout the paper, we fix . Let . It is known that any autonomous ODE (i.e., one that is defined by a time-invariant vector field) with a Lipschitz continuous vector field has a solution and that the solution is unique:
Fact 1 (Existence and uniqueness of a global solution to an ODE [6]).
Let . Then, a solution to the following ordinary differential equation exists and it is unique:
(1) |
where , and denotes the derivative of .
In view of Fact 1, we use the following notation.
Definition 1.
Definition 2 (Autonomous-ODE flow endpoints; [5]).
For , we define
Definition 3 ().
Let denote the group of all invertible affine maps on , and let . Define the invertible neural network architecture based on NODEs as
2.2 Goal: the notions of universality and their relations
Here, we define the notions of universality. Let . For a subset and a map , we define , where denotes the Euclidean norm. Also, for a measurable map , a subset , and , we define .
Definition 4 (-universality and -universality).
Let be a model, which is a set of measurable mappings from to . Let be a set of measurable mappings , where is a measurable subset of , which may depend on . We say that is a -universal approximator or has the -universal approximation property for if for any , any , and any compact subset , there exists such that . The -universal approximation property is defined by replacing with in the above.
Our goal.
Our goal is to elucidate the representation power of INNs composed of NODEs by proving the -universality of for a fairly large class of diffeomorphisms, i.e., smooth invertible functions with smooth inverse.
3 Main result
In this section, we present our main result, Theorem 1.
First, we define the following class of invertible maps, which will be our target to be approximated.
Definition 5 (-diffeomorphisms: ).
We define as the set of all -diffeomorphisms , where is open and -diffeomorphic to , and it may depend on .
The set is a fairly large class: it contains any -diffeomorphism defined on the entire , an open convex set, or more generally, a star-shaped open set.
Now, we state our main result to establish a class that the invertible neural networks based on NODEs can approximate with respect to the -norm.
Theorem 1 (Universality of NODEs).
Assume is a -universal approximator for . Then, is a -universal approximator for .
Examples of include the multi-layer perceptron with finite weights and Lipschitz-continuous activation functions such as rectified linear unit (ReLU) activation [7, 1], as well as the Lipschitz Networks [8, Theorem 3].
Proof outline.
To prove Theorem 1, we take a similar strategy to that of Theorem 1 of [9] but with a major modification to adapt to our problem. First, the approximation target is reduced from to the set of compactly-supported diffeomorphisms from to , denoted by , by applying Fact 2 in Appendix A.1. Then, it is shown that we can represent each as a finite composition of flow endpoints (Definition 7 in Appendix A.1), each of which can be approximated by a NODE. The decomposition of into flow endpoints is realized by relying on a structure theorem of (Fact 4 in Appendix A.1) attributed to Herman, Thurston [10], Epstein [11], and Mather [12, 13]. Note that we require a different definition of flow endpoints (Definition 7 in Appendix A.1) from that employed in [9, Corollary 2] in order to incorporate sufficient smoothness of the underlying flows.
4 Related work and Discussion
In this section, we overview the existing literature on the representation power of NODEs to provide the context of the present paper.
-universal approximation property of NODEs.
[5] considered NODEs capped with a terminal family to map the output of NODEs to a vector of the desired output dimension, and its Proposition 3.8 showed that the model class has the -universality for the set of all continuous maps from to (), under a certain sufficient condition. In comparison to our result here, the result of [5] established the universality of NODEs for a larger target function class (namely continuous maps) with a weaker notion of approximation (namely -universality).
Limitations on the representation power of NODEs.
[14] formulated its Theorem 1 to show that NODEs are not universal approximators by presenting a function that a NODE cannot approximate. The existence of this counterexample does not contradict our result because our approximation target is different from the function class considered in [14]: the class in [14] can contain discontinuous maps whereas the elements of are smooth and invertible.
Universality of augmented NODEs.
As a device to enhance the representation power of NODEs, increasing the dimensionality and padding zeros to the inputs/outputs has been explored [15, 14]. [14] showed that the augmented NODEs (ANODEs) are universal approximators for homeomorphisms. The approach has a limitation that it can undermine the invertibility of the model: unless the model is ideally trained so that it always outputs zeros in the zero-padded dimensions, the model can no longer represent an invertible map operating on the original dimensionality. On the other hand, the present work explores the universal approximation property of NODEs that is achieved without introducing the complication arising from the dimensionality augmentation.
Relation between and time-dependent NODEs.
Our result can be readily extended to the design choice of NODEs that includes the time-index as an argument of . It can be done by limiting our attention to the subset of the considered class of consisting of all time-invariant ones as in the following. Let and consider be such that there exists a continuous function satisfying
Then, the initial value problem
has a solution and it is unique [6], synonymously to Fact 1. Then, given a set of such mappings , we can consider its subset that contains only the time-invariant elements, i.e., such that for any and any , is a constant mapping. Such an is an element of with being a Lipschitz constant. Then, we can apply Theorem 1 to and its induced .
5 Conclusion
In this paper, we uncovered the -universality of the INNs composed of NODEs for approximating a large class of diffeomorphisms. This result complements the existing literature that showed the weaker approximation property of NODEs, namely -universality, for general continuous maps. Whether the -universality holds for a larger class of maps than is an important research question for future work. Also, it is important for future work to quantitatively evaluate how many layers of NODEs are required to approximate a given diffeomorphism with a specified smoothness such as a bi-Lipschitz constant to evaluate the efficiency of the approximation.
Acknowledgments and Disclosure of Funding
The authors would like to thank the anonymous reviewers for the insightful discussions. This work was supported by RIKEN Junior Research Associate Program. TT was supported by Masason Foundation. II and MI were supported by CREST:JPMJCR1913.
References
- [1] Ricky T.. Chen, Yulia Rubanova, Jesse Bettencourt and David K Duvenaud “Neural Ordinary Differential Equations” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018, pp. 6571–6583 URL: http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf
- [2] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed and Balaji Lakshminarayanan “Normalizing Flows for Probabilistic Modeling and Inference” In arXiv:1912.02762 [cs, stat], 2019 arXiv: http://arxiv.org/abs/1912.02762
- [3] Will Grathwohl, Ricky T.. Chen, Jesse Bettencourt, Ilya Sutskever and David Duvenaud “FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models” In 7th International Conference on Learning Representations New Orleans, LA, USA: OpenReview.net, 2019 URL: https://openreview.net/forum?id=rJxgknCcK7
- [4] Chris Finlay, Jorn-Henrik Jacobsen, Levon Nurbekyan and Adam M Oberman “How to Train Your Neural ODE: The World of Jacobian and Kinetic Regularization” In Proceedings of the 37 Th International Conference on Machine Learning, 2020
- [5] Qianxiao Li, Ting Lin and Zuowei Shen “Deep Learning via Dynamical Systems: An Approximation Perspective” In arXiv:1912.10382 [cs, math, stat], 2020 arXiv: http://arxiv.org/abs/1912.10382
- [6] W. Derrick and L. Janos “A Global Existence and Uniqueness Theorem for Ordinary Differential Equations” In Canadian Mathematical Bulletin 19.1, 1976, pp. 105–107 DOI: 10.4153/CMB-1976-015-7
- [7] Yann LeCun, Yoshua Bengio and Geoffrey Hinton “Deep Learning” In Nature 521.7553, 2015, pp. 436–444 URL: http://www.nature.com/articles/nature14539
- [8] Cem Anil, James Lucas and Roger Grosse “Sorting out Lipschitz Function Approximation” In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2019, pp. 291–301 URL: http://proceedings.mlr.press/v97/anil19a.html
- [9] Takeshi Teshima, Isao Ishikawa, Koichi Tojo, Kenta Oono, Masahiro Ikeda and Masashi Sugiyama “Coupling-Based Invertible Neural Networks Are Universal Diffeomorphism Approximators” In Advances in Neural Information Processing Systems 33, in press
- [10] William Thurston “Foliations and Groups of Diffeomorphisms” In Bulletin of the American Mathematical Society 80.2, 1974, pp. 304–307 URL: https://projecteuclid.org:443/euclid.bams/1183535407
- [11] D… Epstein “The Simplicity of Certain Groups of Homeomorphisms” In Compositio Mathematica 22.2, 1970, pp. 165–173
- [12] John N. Mather “Commutators of diffeomorphisms” In Commentarii mathematici Helvetici 49.1, 1974, pp. 512–528 URL: https://eudml.org/doc/139598
- [13] John N. Mather “Commutators of Diffeomorphisms: II” In Commentarii Mathematici Helvetici 50.1, 1975, pp. 33–40 DOI: 10.1007/BF02565731
- [14] Han Zhang, Xi Gao, Jacob Unterman and Tomasz Arodz “Approximation Capabilities of Neural ODEs and Invertible Residual Networks” In Proceedings of the 37th International Conference on Machine Learning 119 Vienna, Austria: PMLR, 2020 URL: https://proceedings.icml.cc/paper/2020/hash/c32d9bf27a3da7ec8163957080c8628e
- [15] Emilien Dupont, Arnaud Doucet and Yee Whye Teh “Augmented Neural ODEs” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019, pp. 3140–3150 URL: http://papers.nips.cc/paper/8577-augmented-neural-odes.pdf
- [16] Serge Lang “Differential Manifolds” New York: Springer-Verlag, 1985 DOI: 10.1007/978-1-4684-0265-0
- [17] Philip Hartman “Ordinary Differential Equations” 38, Classics in Applied Mathematics Society for Industrial and Applied Mathematics, 2002
- [18] T.. Gronwall “Note on the Derivatives with Respect to a Parameter of the Solutions of a System of Differential Equations” In Annals of Mathematics 20.4 Annals of Mathematics, 1919, pp. 292–296 DOI: 10.2307/1967124
This is the Supplementary Material for “Universal approximation property of neural ordinary differential equations.” Table 1 summarizes the abbreviations and the symbols used in the paper.
Abbreviation/Notation | Meaning |
---|---|
INN | Invertible neural networks |
NODE | Neural ordinary differential equations |
Set of invertible affine transformations | |
The (unique) solution to an initial value problem evaluated at | |
Set of NODEs obtained from the Lipschitz continuous vector fields | |
The set of all Lipschitz continuous maps from to | |
INNs composed of and NODEs parametrized by | |
Dimensionality of the Euclidean space under consideration | |
Set of all -diffeomorphisms with -diffeomorphic domains | |
Group of compactly-supported -diffeomorphisms on () | |
Euclidean norm | |
Operator norm | |
Supremum norm on a subset | |
-norm on a subset | |
Identity map | |
Support of a map |
Appendix A Proof of Theorem 1
Here, we provide a proof of Theorem 1. In Section A.1, we display the known facts and show the lemmas used for the proof. In Section A.2, we prove Theorem 1.
A.1 Lemmas and known facts
We use the following definition and facts from [9].
Definition 6 (Compactly supported diffeomorphism).
We use to denote the set of all compactly supported -diffeomorphisms () from to . Here, we say a diffeomorphism on is compactly supported if there exists a compact subset such that for any , . We regard as a group whose group operation is function composition.
The following fact enables us to reduce the approximation problem for to that for .
Fact 2 (Lemma 5 of [9]).
Let be an element of , and let be a compact set. Then, there exists and an affine transform such that
The following fact enables the component-wise approximation, i.e., given a transformation that is represented by a composition of some transformations, we can approximate it by approximating each constituent and composing them.
Fact 3 (Compatibility of composition and approximation; Proposition 6 of [9]).
Let be a set of locally bounded maps from to , and be continuous maps from to . Assume for any and any compact set , there exist such that, for , . Then for any and any compact set , there exist such that
The following fact is attributed to Herman, Thurston [10], Epstein [11], and Mather [12, 13]. See Fact 2 of [9] and the remarks therein for details. Let denote the identity map.
Fact 4 (Fact 2 of [9]).
If , the group is simple, i.e., any normal subgroup is either or .
Next, we define a subset of called the flow endpoints. In Lemma 1, it is shown that the set of flow endpoints generates a non-trivial normal subgroup of . Therefore, by combining it with Fact 3, we can represent any element of as a finite composition of flow endpoints, each of which can be approximated by the NODEs.
While Corollary 2 of [9] also defined a set of flow endpoints in , it differs from the one defined here which is tailored for our purpose. The two definitions can be interpreted as describing two different generators of the same group . Let denote the support of a map.
Definition 7 (Flow endpoints in ).
Let . Let be the set of diffeomorphisms of the form for some map such that
-
•
is an open interval containing ,
-
•
,
-
•
for any ,
-
•
for any with ,
-
•
is on ,
-
•
There exists a compact subset such that .
The difference between Definition 7 and the one in Corollary 2 of [9] mainly lies in the last two conditions. Technically, these two conditions are used in Section A.2 for showing that the partial derivative of in at is Lipschitz continuous.
Lemma 1 (Modified Corollary 2 of [9]).
Let and be the set of all flow endpoints. Then, the subset of defined by
forms a subgroup of and it is a non-trivial normal subgroup.
Proof of Lemma 1.
First, we prove that forms a subgroup of . By definition, for any , it holds that . Also, is closed under inversion; to see this, it suffices to show that is closed under inversion. Let . Consider the map defined by . It is easy to confirm that satisfies the conditions of Definition 7, hence is an element of . Note that is confirmed to be on by applying the inverse function theorem to .
Next, we prove that is normal. To show that the subgroup generated by is normal, it suffices to show that is closed under conjugation. Take any and , and let be a flow associated with . Then, the function defined by is a flow associated with satisfying the conditions in Definition 7, which implies , i.e., is closed under conjugation.
Next, we prove that is non-trivial by constructing an element of that is not the identity element. First, consider the case . Let be a non-constant -function such that and for any . Then define by
which is a -function on with a compact support. Since is Lipschitz continuous and , there exists that is a -function over ; see Fact 1 and [17, Chapter V, Corollary 4.1]. Let be a compact subset that contains . Then, by considering the ordinary differential equation by which is defined, we see that and also that . We also have for any . In particular, we have for any . Therefore, we have . Since , is not an identity map and thus is not trivial. Next, we consider the case . Take a -function with and a nonzero skew-symmetric matrix (i.e. ) of size , and let . We define a -map by
Since is an orthogonal matrix for any and , is a -flow on . Now, it is enough to show that there exists a compact set satisfying . Let . Then the inclusion holds for any since for . ∎
The following lemma allows us to approximate an autonomous ODE flow endpoint by approximating the differential equation. See Definition 2 for the definition of .
Lemma 2 (Approximation of Autonomous-ODE flow endpoints).
Assume is a -universal approximator for . Then, is a -universal approximator for .
Proof.
Let . Then, by definition, there exists such that . Let denote the Lipschitz constant of . In the following, we approximate by approximating using an element of .
Let , and let be a compact subset of . We show that there exists such that . Note that is well-defined because . Define
Then, is compact. This follows from the compactness of : (i) is bounded since is bounded, and (ii) it is closed since the function is continuous and hence is the inverse image of a closed interval by a continuous map.
Since is assumed to be a -universal approximator for , for any , we can take such that . Let be such that , and take such an .
Fix and define . Let and we show that
holds for all . We prove this by contradiction. Suppose that there exists for which the inequality does not hold. Then, the set is not empty and thus . For this , we show both and . First, we have
The last term can be bounded as
because of the following argument. If , then both sides equal to zero, hence it holds with equality. If , then for any , we have because implies . In this case, implies the inequality. Therefore, we have
Now, by applying Grönwall’s inequality [18], we obtain
On the other hand, by the definition of and the continuity of , we have . These two inequalities contradict.
Therefore, holds. Since , the right-hand side is smaller than . ∎
Finally, we display a lemma that is useful in the case of . It is proved by convolving a smooth bump-like function.
Fact 5 (Lemma 11 of [9]).
Let be a strictly increasing continuous function. Then, for any compact subset and any , there exists a strictly increasing -function such that
A.2 Proof of Theorem 1
Proof of Theorem 1.
Let be an element of . Take any compact set and . First, thanks to Fact 2, there exists a and an affine transform such that
Now, if , then , hence we can immediately use Fact 4 and Lemma 1 to show that there exists a finite set of flow endpoints (Definition 7) such that
On the other hand, if , by Fact 5, for any , we can find that is a -diffeomorphism on such that . Without loss of generality, we may assume that is compactly supported so that . Then, we can use Fact 4 and Lemma 1 to show that there exists a finite set of flow endpoints (Definition 7) such that
We now construct such that . By Definition 7, for each (), there exists an associated flow . Now, define
Then, because it is a compactly-supported -map: it is compactly supported since there exists a compact subset containing the support of for all , and hence is zero in the complement of .
Now, since, by additivity of the flows,
and hence it is a solution to the initial value problem that is unique. As a result, we have .
Appendix B Terminal time of autonomous-ODE flow endpoints
In Definition 2, the choice of the terminal value of the time variable, , is only technical. To see this, let . If we consider that is the solution of the initial value problem as well as that is the unique solution to , then holds. Therefore, .
As a result, holds. Therefore, it holds that
Thus, even if we consider , if the set is a cone, the set of the autonomous-ODE flow endpoints remains the same.
Appendix C Comparison between -universality and -universality
In this section, we discuss the advantage of having a representation power guarantee in terms of the -norm instead of the -norm in function approximation tasks.
Roughly speaking, the function approximation should be robust under a slight change of norms, but -universal approximation property can be sensitive to the choice of . To make this point, we construct an example: even if a model sufficiently approximates a target with the norm , the model may fail to approximate with for any , even if is very close to .
Let be a strictly increasing function such that
For example, satisfies this condition. Then, we define
Now, the sequence approximates in -norm in the sense that for any small , for sufficiently large , it holds that
(2) |
However, the same fails to approximate in -norm () since it always holds that, for sufficiently small ,
(4) |
This example highlights that fixing first and guaranteeing approximation in -norm may not suffice for guaranteeing the approximation in -norm (). On the other hand, having a guarantee in -norm suffices for providing an approximation guarantee in -norm for simultaneously.