This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Proof of Kolmogorov-Arnold May Illuminate NN Learning

Michael H. Freedman Michael H. Freedman, Center of Mathematical Sciences and Applications, Harvard University freedmanm@google.com, mfreedman@cmsa.fas.harvard.edu
Abstract.

Kolmogorov and Arnold, in answering Hilbert’s 13th problem (in the context of continuous functions), laid the foundations for the modern theory of Neural Networks (NNs). Their proof divides the representation of a multivariate function into two steps: The first (non-linear) inter-layer map gives a universal embedding of the data manifold into a single hidden layer whose image is patterned in such a way that a subsequent dynamic can then be defined to solve for the second inter-layer map. I interpret this pattern as “minor concentration” of the almost everywhere defined Jacobians of the interlayer map. Minor concentration amounts to sparsity for higher exterior powers of the Jacobians. We present a conceptual argument for how such sparsity may set the stage for the emergence of successively higher order concepts in today’s deep NNs and suggest two classes of experiments to test this hypothesis.

The Kolmogorov-Arnold theorem (KA) [kol56, arn57] resolved Hilbert’s 13th problem in the context of continuous functions (Hilbert’s literal statement). It took some time for applied mathematicians and computer scientists to notice that it addresses the representation power of shallow, but highly non-linear, neural nets [hed71]. The relevance of KA to machine learning (ML) has been much debated [gp89, v91].

The primary criticism is that the activation functions are not even first differentiable (even when the function ff to be represented is); indeed, they must be quite wild, and appear impossible to train. I agree with this criticism, but will argue here that it misses a more important point. The discussion of KA relevance has revolved around its statement. But in mathematics, proofs are generally more revealing of power than statements; I believe this particularly true in the case of KA. The purpose of this note is to call attention to some wisdom embedded within the proof which I believe will be useful for the training of neural nets (NNs). As far as resurrecting KA from the dustbin of ML history, this has already been done earlier this year in [ZWV24], where the original 1-hidden layer of KA has been deepened to many layers while the interlayer maps somewhat tamed (the authors use cubic bb-splines).

Before explaining KA’s insight and its potential applicability, let me give a modern, optimized statement of KA quoted from [mor21]. This statement is perhaps unnecessarily succinct from the NN perspective: it gets by with a single outer function gg, not 2n+12n+1 outer functions which might appear more natural. Also, in the NN context one would expect the inner functions to carry a second index and not merely be rationally independent scalars times a single index function ϕk\phi_{k}. But we take this statement as representative and refer the reader to Morris for more historical information on the individual contributions.

Theorem 1 (Kolmogorov, Arnold, Kahane, Lorentz, and Sprecher).

For any nn\in\mathbb{N}, n2n\geq 2, there exist real numbers λ1,λ2,,λn\lambda_{1},\lambda_{2},\dots,\lambda_{n} and continuous functions ϕk:𝕀\phi_{k}:\mathbb{I}\rightarrow\mathbb{R}, for k=1,,mk=1,\dots,m, m2n+1m\geq 2n+1, with the property that for every continuous function f:𝕀nf:\mathbb{I}^{n}\rightarrow\mathbb{R} there exists a continuous function g:g:\mathbb{R}\rightarrow\mathbb{R} such that for each (x1,x2,,xn)𝕀n(x_{1},x_{2},\dots,x_{n})\in\mathbb{I}^{n},

f(x1,,xn)=k=1mg(λ1ϕk(x1)++λnϕk(xn)).f(x_{1},\dots,x_{n})=\sum_{k=1}^{m}g(\lambda_{1}\phi_{k}(x_{1})+\cdots+\lambda_{n}\phi_{k}(x_{n})).
x1x_{1}x2x_{2}ϕp,q\phi_{p,q}ggf(x1,x2)f(x_{1},x_{2})
Figure 1.

The proof of KA is divided into the construction of inner functions ϕi\phi_{i} and an outer function gg. The inner functions taken together constitute an embedding, which we now denote Φ:𝕀nm\Phi:\mathbb{I}^{n}\rightarrow\mathbb{R}^{m}, m2n+1m\geq 2n+1. We will append a subscript Φi\Phi_{i} when referring to a neuron coordinate of the embedding. Φ\Phi is universal; it can be chosen once and for all, independent of the function ff to be represented. Our chief lesson regards Φ\Phi, although I also will make a comment on the non-linear dynamic used to converge to gg (given ff).

In the mathematically idealized case of KA, feasible Φ\Phi are actually dense in the space of continuous maps 𝒞(𝕀n,m)\mathcal{C}(\mathbb{I}^{n},\mathbb{R}^{m}); they also have a crucial local feature, about to be described. This fact is reminiscent of the observation (for example see [gmrm23, skmf24]) that when NNs are randomly initialized, training is often seen to merely perturb the values of the early layers. These layers seem to be more in the business of assuming a form conducive to the learning of later layers, rather than attending to the particular data themselves.

The local property crucial to Φ\Phi is a kind of irregular staircase structure as might be used to approximate the general continuous function by one which is piecewise constant. More specifically, Φ\Phi may be taken to be Lipschitz [ak17], and so by Rademacher’s theorem will be differentiable a.e. Such Φ\Phi constructed in the proof of KA will have the property that at every point x𝕀nx\in\mathbb{I}^{n} where the Jacobian is defined it will, as an n×mn\times m matrix in the “neuron basis,” have mnm-n columns consisting entirely of zeros. So only one of its (mm choose nn) n×nn\times n minors can be nonzero. (Of course, which minor is active will vary for with xx.) To picture what this means, imagine an irregular stack of sugar cubes of different sizes but with all faces perpendicular to one of the three coordinate axes, xx, yy, or zz. Another condition says that the “inactive” coordinate on the ww-perpendicular faces (w=xw=x, yy, or zz) must take distinct values. The actual situation is more like an irregular structure of nn-D cubes wafting through mm-D space, now with mnm-n inactive directions at any generic point, and all these inactive coordinate values distinct. Being distinct is what sets the stage towards finding at least an approximate outer function g0g_{0}. The mnm-n inactive directions are crucial to the construction of g0g_{0}, and it is important that (mn)>n(m-n)>n, so that the inactive coordinates are in the majority at all generic points of Φ(x)\Phi(x).

The very rough idea for how to choose g0g_{0} is to use the (mn)(m-n) inactive function value on each “sugar cube face” as a hint: If that value f(x)f(x) is uniformly positive (negative) on the face with inactive coordinate Φi(x)\Phi_{i}(x), give g0(Φi(x))g_{0}(\Phi_{i}(x)) a small constant positive (negative) value and extend in a convex-PL manner. In the regime (mn)>n(m-n)>n, it may be checked that such a strategy produces a g0g_{0} leading to a useful approximation:

f(x1,,xn)i=1ng0(Φi(x1,,xn))<constf(x1,,xn),const<1.\Big{\lVert}f(x_{1},\dots,x_{n})-\sum_{i=1}^{n}g_{0}(\Phi_{i}(x_{1},\dots,x_{n}))\Big{\rVert}<\mathrm{const}\lVert f(x_{1},\dots,x_{n})\rVert,\ \mathrm{const}<1.

Now iterating on the approximation error (in a manner reminiscence of Resnets), and relying on additivity at the output node, one produces a series g0,g1,g_{0},g_{1},\dots which converges exponentially fast to the desired outer function gg for which the approximation error has been driven to zero. Many inactive directions at each point are essential since they provide a stationary coordinate value on which a guess for g0g_{0} can be made: A small positive constant if f(x1,,xn)f(x_{1},\dots,x_{n}) is positive throughout the face and a small negative constant if ff is negative throughout the face.

This sketch of the construction of the outer function g reveals the importance of the stationarity condition and its implication of vanishing Jacobian minors in the embedding Φ\Phi. Remembering that admissible Φ\Phi are dense, we see that the job of the first layer is simply to impress a certain microscopic texture on Φ\Phi which makes g constructable (if not learnable). This is the division of labor that feels so striking. Φ\Phi does not try to learn anything but sets up for success the task of finding gg.

I wonder if a similar division of labor occurs between early and later stages of a NN as it is trained, and propose:

Proposal 1.

Search the natural Jacobian maps for minor concentration.

First, what are these natural Jacobians, and second, what is minor concentration? One may consider either a conventional feedforward DNN or the recently proposed KANets [ZWV24], or more elaborate architectures with residual connections and self-attention. The idea is to watch the data manifold flow through the NN. Initially at t=0t=0, the data manifold is 𝕀n\mathbb{I}^{n}, the values stored in the nn input neurons, and at time jj when the input has been imaged in the jjth layer (say after application of the activation functions). Let kijk_{ij}, 0i,j<max0\leq i,j<\ell_{\text{max}}, be the map between layers ii and jj, with i=0i=0 being a very interesting special case. At any point yy in the evolving data manifold (y=xy=x, the previously defined coordinate on 𝕀n\mathbb{I}^{n}, if i=0i=0) we may consider Jy(kij):=JijyJ_{y}(k_{ij}):=J_{ij}y, the Jacobian matrix (in neuron coordinates) of the forward map kijk_{ij} at point yy. The proposal is to find some context where it is possible to search (or at least sample) this huge family of Jacobians and look for minor concentration. Based on understanding KA one may expect, particularly in some initial segment of the NN (perhaps all the way up to the penultimate layer), that training will cause these Jacobians to have their minors concentrated far beyond what would be seen in a random collection (e.g. Gaussian distributed entries) of linear maps. It seems best at first to start from the beginning and set i=0i=0. That is, one should look “in the wild” to see if a KA-style of data preparation is discovered, at least for certain tasks, and certain architectures, as a result of training.

What is minor concentration?

Given a p×qp\times q matrix MM and hh, 0<hmin{p,q}0<h\leq\min\{p,q\}, minor concentration should be some quantity that measures how far from uniform the distribution of the absolute value of the (ph)(qh)\binom{p}{h}\binom{q}{h} h×hh\times h minors of MM are. The easiest formula is:

MC(M)=absolute value of minors2absolute value of minors1.\operatorname{MC}(M)=\frac{\lVert\text{absolute value of minors}\rVert_{2}}{\lVert\text{absolute value of minors}\rVert_{1}}.

That is, the ratio of the L2L_{2} and L1L_{1} norms.

As will become clear in the thought experiment below, one might also wish to design special purpose minor concentration functionals which reward not just a large, isolated minor but respond to a row or column of large minors inside ΛhM\Lambda^{h}M, where the superscript hh indicates the hhth exterior power. This type of pattern for large and small minors could plausibly arise through training.

Proposal 2.

Evaluate the effect of forced minor concentration on learning.

Independent of the results of Proposal 1, one could study the effect of encouraging/discouraging minor concentration during training. One way to do this would be to alternate traditional training protocols with a novel step where a layer map would be updated along the gradient of a new objective function, one measuring MC. If MC is defined as above, these interleaved steps could be δMCy\delta^{\prime}\cdot\nabla\mathrm{MC}_{y}. That is, with MC-learning rate δ\delta^{\prime}, move a layer map in the direction of increased minor concentration. The gradient might be taken over the last layer-map variables—although there are many choices with which to experiment. One might expect to find some portions of the NN and some stages of training where “concepts are gelling” (see below) in which a positive δ\delta^{\prime} would accelerate learning and a negative δ\delta^{\prime} delay learning. Interestingly, there could also be regimes where the reverse applied: At an early stage of learning, it could be harmful to have a concept gel prematurely, the NN may be better off keeping an open mind (Zen-like) mind. Experiments hindering/enhancing MC could get to the heart of “how learning works” [gmrm23].

Thought experiment/example

In the KA proof for n=2n=2, m=2n+1=5m=2n+1=5, the differential JxJ_{x} at a generic point xx of Φ\Phi will be a 2×52\times 5 matrix MM, and the salient feature of JxJ_{x} is that for any given x𝕀2x\in\mathbb{I}^{2}, MM will have three of its five columns entirely zero (which three are zero will vary as xx varies), so in particular the Λ2M\Lambda^{2}M, the second exterior power, will be a 1×101\times 10 matrix with at most a single nonzero entry. This is the prototypical example of minor concentration. Let’s consider another.

Say that between layers jj and j+1j+1, the Jacobian of the non-linear map is Jy=MJ_{y}=M. In this thought experiment, let’s imagine we have a convolutional NN analyzing a picture of a face. Suppose the jjth layer has learned some lines encoded in four neurons, determining a vector in 4\mathbb{R}^{4}, and say the (j+1)(j+1)th layer has three neurons spanning in 3\mathbb{R}^{3} and may be looking to define a feature. Now MM is a 4×34\times 3 matrix. Arbitrarily pick h=2h=2, meaning we will be inquiring on how two “line” neurons stimulate two “feature neurons.” Now Λ2M\Lambda^{2}M is a 6×36\times 3 matrix. Suppose we see just one of its three rows with large values while the other 12 values are close to zero. Here is a story that we might tell: Let the feature-neurons aa, bb, and cc label the rows of MM, so abab, bcbc, and caca label the rows of Λ2M\Lambda^{2}M. Suppose it is the bcbc row with large minors. This means as we consider the six pairs of “line neurons,” as their information varies to first order, we see the information in the bcbc pair vary robustly, but not so with the abab and caca pairs. That would be consistent with each of these three pairs being responsible for detecting a feature, say, eye, nose, and mouth respectively. The most parsimonious explanation for the muted response of the eye and mouth pairs would be that the lines are not suggesting either of these features, so varying the incoming line data leaves the abab and caca neurons in their “not an eye” and “not a mouth” base point states respectively. There is little or no variation in the eye or mouth recognizing neuron pairs because they are not finding what they have been trained to look for. On the other hand, at least some of the six pairwise line characteristics seem to be describing a variety of nose shapes as they move about. Because training involves both forward- and back-propagation, similar scenarios can be constructed where a column of ΛhM\Lambda^{h}M would be heavy. Thus, in addition to the most naïve measure of minor concentration MC(M)\operatorname{MC}(M) (above), one should also look for row-wise and column-wise concentrations in ΛhM\Lambda^{h}M. Looking for such concentrations would always come down to comparing the appropriate functional on Jacobians from trained NNs with random matrices.

It is worth pointing out that when h=1h=1, minor concentration is quite close to the notion of “sparsity”; 1-minors concentrate when a few matrix entries are large (in absolute value) and the rest small. So, minor concentration for general hh is close to looking for sparsity in the exterior powers ΛhM\Lambda^{h}M. It is worth commenting that the old notion of sparsity and its generalization to k-minor concentration only makes sense in the context of linear maps between based vector spaces. Without preferred bases, neither the individual entries in matrices, nor their exterior powers, have any invariant meaning.

A caveat

I’d like to mention a final point which arose in conversation with Boris Hanin relating to the phenomenon of neural collapse. Observations presented in [koth22] suggest, for example, if a convolutional network is trained to recognize farm animals, say: cow, sheep, pig, horse, that the penultimate layer may have the data compressed to some regular tetrahedron, with the four animal types at the four vertices, but these tetrahedron vertices will not generally be oriented along neuron axes. To put them in this form a rotation may be required. This suggests information midway though the net may often be held diffusely and be difficult to detect within a small subset of neurons, and hence difficult to detect through minor concentration, at least for small hh. However, sparsification, if applied during training, may already be emphasizing the neuron basis, and so facilitate concentration of hh-minors for h>1h>1. If this is so, it could present an interesting avenue towards understanding how sparsification at the original matrix level MM could enhance minor concentration and facilitate learning.

References