This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Emergence of Globally Attracting Fixed Points in
Deep Neural Networks With Nonlinear Activations

Amir Joudaki
ETH Zürich
amir.joudaki@inf.ethz.ch
&Thomas Hofmann
ETH Zürich
thomas.hofmann@inf.ethz.ch
Abstract

Understanding how neural networks transform input data across layers is fundamental to unraveling their learning and generalization capabilities. Although prior work has used insights from kernel methods to study neural networks, a global analysis of how the similarity between hidden representations evolves across layers remains underexplored. In this paper, we introduce a theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs. Operating under the mean-field regime, we show that the kernel sequence evolves deterministically via a kernel map, which only depends on the activation function. By expanding activation using Hermite polynomials and using their algebraic properties, we derive an explicit form for kernel map and fully characterize its fixed points. Our analysis reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture. We further extend our results to networks with residual connections and normalization layers, demonstrating similar convergence behaviors. This work provides new insights into the implicit biases of deep neural networks and how architectural choices influence the evolution of representations across layers.

Code available at: https://github.com/ajoudaki/kernel-global-dynamics

1 Introduction

Deep neural networks have revolutionized various fields, from computer vision to natural language processing, due to their remarkable ability to learn complex patterns from data. Understanding the internal mechanisms that govern their learning and generalization capabilities remains a fundamental challenge.

One approach to studying these transformations is through the lens of kernel methods. Kernel methods have a long history in machine learning for analyzing relationships between data points in high-dimensional spaces (Schölkopf and Smola, 2002; Smola and Schölkopf, 2004). They provide a framework for understanding the similarity measures that underpin many learning algorithms. Recent theoretical studies have increasingly focused on analyzing neural networks from the perspective of kernels. The Neural Tangent Kernel (NTK) (Jacot et al., 2018) is a seminal work that provided a way to analyze the training dynamics of infinitely wide neural networks using kernel methods. This perspective has been further explored in various contexts, leading to significant advances in our understanding of neural networks (Lee et al., 2019; Arora et al., 2019; Yang, 2019).

Despite these advances, an important question remains unexplored: How does the similarity between hidden layer representations evolve across layers, and how is that affected by particular choices of nonlinear functions? Previous work has mainly focused on local behaviors or specific initialization conditions (Saxe et al., 2013; Schoenholz et al., 2017; Pennington et al., 2017). A comprehensive global analysis of neural kernel sequence fixed points and convergence properties, particularly in the presence of nonlinear activations, is still incomplete.

This paper addresses this gap by introducing and analyzing the evolution of kernel sequences in deep neural networks. Specifically, we consider the kernel sequence k(h(x),h(y))k(h_{\ell}(x),h_{\ell}(y)), where kk denotes a similarity measure, and h(x)h_{\ell}(x) and h(y)h_{\ell}(y) are representations of the inputs xx and yy at layer ellell. Understanding whether and how this sequence converges to a fixed point as the depth of the network increases is crucial for uncovering the inherent implicit biases of deep networks.

Our analysis builds upon foundational work in neural network theory and leverages mean-field theory to simplify the analysis. By considering the infinite-width limit, stochastic sequences become deterministic, allowing us to focus on the underlying dynamics without the interference of random fluctuations (Poole et al., 2016; Yang et al., 2019; Mei et al., 2019).

Contributions:

  • By employing algebraic properties Hermite polynomials, we derive explicit forms of the neural kernel and identify its fixed points, leading to many elegant results.

  • We demonstrate that the kernel sequence converges globally to a unique fixed point for a wide class of functions that cover all practically used activations, revealing inherent biases in deep representations.

  • We extend our analysis to networks with normalization layers and residual connections, highlighting their impact on the convergence behavior.

Understanding these dynamics contributes to a deeper understanding of how depth and nonlinearity interact with one another at initialization.

2 Related works

The study of deep neural networks through the lens of kernel methods and mean-field theory has garnered significant interest in recent years. The Neural Tangent Kernel (NTK) introduced by Jacot et al. (2018) provided a framework to analyze the training dynamics of infinitely wide neural networks using kernel methods. This perspective was further expanded by Lee et al. (2019) and Arora et al. (2019), who explored the connections between neural networks and Gaussian processes.

The propagation of signals in deep networks has been studied in the works of Schoenholz et al. (2017) and Pennington et al. (2017). Previous studies have also explored the critical role of activation functions in maintaining signal propagation and stable gradient behavior in deep networks (Hayou et al., 2019). These studies focused on understanding the conditions required for the stable propagation of information and the avoidance of signal amplification or attenuation. However, these analyses often concentrated on local behaviors or specific conditions, leaving a gap in understanding the global evolution of representations across layers.

Hermite polynomials have been used in probability theory and statistics, particularly in the context of Gaussian processes (Williams and Rasmussen, 2006). Although Poole et al. (2016) and Daniely et al. (2016) have utilized polynomial expansions to analyze neural networks, they do not study global dynamics in neural networks, as presented in this paper. To the best of our knowledge, the only existing work that uses Hermite polynomials in the mean-field regime to study the global dynamics of the kernel is by Joudaki et al. (2023b). However, this study only covers centered activations, which fail to capture several striking global dynamics covered in this study.

Our work extends these foundational studies by providing an explicit algebraic framework to analyze the global convergence of kernel sequences in deep networks with nonlinear activations. Using Hermite polynomial expansions, we offer a precise characterization of the kernel map and its fixed points, contributing new insights into the implicit biases of nonlinear activations.

3 Preliminaries

This section introduces the fundamental concepts, notations, and definitions used throughout this paper. If XX and YY are vectors in n\mathbb{R}^{n}, we use the inner product notation X,Yavg\langle X,Y\rangle_{\text{avg}} to denote their average inner product =1ni=1nXiYi.=\frac{1}{n}\sum_{i=1}^{n}X_{i}Y_{i}. We consider a feedforward neural network with LL layers and constant layer width dd. The network takes an input vector xdx\in\mathbb{R}^{d} and maps it to an output vector hL(x)dh^{L}(x)\in\mathbb{R}^{d} through a series of transformations. The hidden representations in each layer \ell are denoted by h(x)h^{\ell}(x). The transformation in each layer is composed of a linear transformation followed by a nonlinear activation function ϕ\phi. We consider the multilayer perceptron (MLP), the hidden representation at layer \ell is given by:

h(x)=ϕ(1dWh1(x)),\displaystyle h^{\ell}(x)=\phi\left(\frac{1}{\sqrt{d}}W^{\ell}h^{\ell-1}(x)\right), Wd×d,\displaystyle W^{\ell}\in\mathbb{R}^{d\times d}, (1)

where h0(x)=xh^{0}(x)=x is identified with the input. We can assume that elements of WW^{\ell} are drawn i.i.d. from a zero mean distribution, unit variance distribution. Two prominent choices for weight distributions, Gaussian N(0,1)N(0,1) and uniform distribution Uniform[1,1],\text{Uniform}[-1,1], satisfy these conditions. In some variations of the MLP, we will use normalization layers and residual connections. Finally, the neural kernel between two inputs xx and yy at layer \ell is defined as:

ρ=h(x),h(y)avg.\displaystyle\rho_{\ell}=\langle h^{\ell}(x),h^{\ell}(y)\rangle_{\text{avg}}\,. (2)

The main goal of this paper is to analyze the sequence {ρ}\{\rho_{\ell}\} at random initialization. The motivation for this analysis is to understand if architectural choices, specifically the activation function and the model depth, lead to specific biases of the kernel towards a certain fixed point. Namely, if there is a bias towards zero, it would imply that at initialization, the representations become more orthogonal. In contrast, a positive fixed point would mean the representations become more similar.

In the current setup, the sequence {ρ}\{\rho_{\ell}\} is a stochastic sequence or a Markov chain due to the random weights at each layer. However, we will show that under the mean-field regime, the sequence {ρ}\{\rho_{\ell}\} converges to a deterministic sequence, and we will analyze its properties.

4 The mean-field regime

In this section, we conduct a mean-field analysis of MLP to explore the neural kernel’s fixed point behavior as the network depth increases. This approach allows us to gain insight into the global dynamics of neural networks, mainly how the similarity between two input samples evolves as they pass through successive network layers. Now, we can state the mean-field regime for the kernel sequence, stating that in this regime, the sequence becomes deterministic.

As will be proven later, this sequence’s deterministic transition follows a scalar function defined below.

Definition 1.

Given two random variables X,YX,Y with covariance ρ,\rho, and activation ϕ,\phi, define the kernel map κ\kappa as the mapping between the covariance of preactivation and the covariance of post-activations:

κ(ρ):=𝔼ϕ(X)ϕ(Y),\displaystyle\kappa(\rho):=\mathbb{E}\,\phi(X)\phi(Y), (XY)𝒩(0,(1ρρ1)).\displaystyle\begin{pmatrix}X\\ Y\end{pmatrix}\sim\mathcal{N}\left(0,\begin{pmatrix}1&\rho\\ \rho&1\end{pmatrix}\right). (3)

This definition has appeared in previous studies, namely the definition of dual kernel by Daniely et al. (2016), and has been adopted by Joudaki et al. (2023b) to study the effects of nonlinearity in neural networks.

The following proposition formally states that in the mean-field regime, the kernel sequence follows iterative applications of the kernel map.

Proposition 1.

In the mean-field regime with d{d\to\infty}, let ρ\rho_{\ell} denote the kernel sequence of an MLP with activation function ϕ\phi obeying 𝔼ϕ(X)2=1\mathbb{E}\,\phi(X)^{2}=1 for X𝒩(0,1)X\sim\mathcal{N}(0,1). If each element of the weights drawn i.i.d. from a distribution with zero mean and unit variance, the kernel sequence evolves deterministically according to the kernel map

ρ+1=κ(ρ),\displaystyle\rho_{\ell+1}=\kappa(\rho_{\ell}),

where the initial value ρ0\rho_{0} corresponds to the input, and κ\kappa is defined in Definition 1.

Historically, conditions similar to 𝔼ϕ(X)2=1\mathbb{E}\,\phi(X)^{2}=1 have been used to prevent quantities on the forward pass from vanishing or exploding. For example, in a ReLU, half of the activations will be zeroed out, which will lead to a vanishing norm of forward representations. The initialization of (He et al., 2016) addresses that by scaling weights to maintain consistent forward pass norms across layers. This principle is further refined by Klambauer et al. (2017), proposing the idea of self-normalizing activation functions, which ensure consistent mean and variances between pre- and post-activations.

An interesting observation is that kernel dynamics is governed by kernel map κ,\kappa, defined on the basis of Gaussian preactivation, even if the weights are not Gaussian matrices. The main insight for this equivalence is that as long as elements are drawn i.i.d. from a zero mean unit variance distribution, we can apply the Central Limit Theorem (CLT) to conclude that preactivations follow the Gaussian distribution. This simple yet elegant observation gives us a powerful analytic tool to study kernel dynamics by leveraging algebraic properties for Gaussian preactivations; we will automatically get the same results for a wide class of distributions.

The proposition 1 tells us that the kernel sequence {ρ}\{\rho_{\ell}\} is precisely given by ρ,κ(ρ),κ(κ(ρ)),.\rho,\kappa(\rho),\kappa(\kappa(\rho)),\dots\,. Thus, we can analyze its convergence by studying the fixed points of the kernel map κ,\kappa, which are the values of ρ\rho^{*} that satisfy κ(ρ)=ρ.\kappa(\rho^{*})=\rho^{*}. With the assumption that 𝔼ϕ(X)2=1,\mathbb{E}\,\phi(X)^{2}=1, the kernel map κ\kappa is a mapping between [1,1][-1,-1] to itself. Thus, Brower’s fixed point theorem implies that the kernel map κ\kappa has at least one fixed point ρ.\rho^{*}. However, as we will show, there is potentially more than one fixed point, and it will be interesting to understand which ones are locally or globally attractive.

5 Hermite expansion of activation functions

Hermite polynomials possess completeness and orthogonality under the Gaussian measure. Therefore, any function that is square-integrable with respect to a Gaussian measure can be expressed as a linear combination of Hermite polynomials (see below). The square integrability rules out the possibility of having heavy-tailed postactivations without second moments. This holds for all activations that are used in practice. We use normalized Hermite polynomials and their coefficients.

Definition 2.

Normalized Hermite polynomials hek(x)\mathrm{he}_{k}(x) are defined as follows

hek(x):=1k!(1)kex22dkdxkex22.\displaystyle\mathrm{he}_{k}(x):=\frac{1}{\sqrt{k!}}(-1)^{k}e^{\frac{x^{2}}{2}}\frac{d^{k}}{dx^{k}}e^{-\frac{x^{2}}{2}}.

Although Hermite polynomials have been used in probability theory and statistics, particularly in the context of Gaussian processes (Williams and Rasmussen, 2006), their application in analyzing neural network dynamics provides a novel methodological tool. Previous works, such as Daniely et al. (2016), have utilized polynomial expansions to study neural networks, but our explicit use of Hermite polynomial expansions to derive the kernel map and analyze convergence is a new contribution.

The crucial property of Hermite polynomials is their orthogonality:

𝔼hek(X)hel(X)=δkl,\displaystyle\mathbb{E}\,\mathrm{he}_{k}(X)\mathrm{he}_{l}(X)=\delta_{kl}, X𝒩(0,1).\displaystyle X\sim\mathcal{N}(0,1). (4)

Scaling by 1/k!1/\sqrt{k!} ensures that polynomials form a orthonormal basis. Based on this property, we can define:

Definition 3.

Given a function ϕ\phi, square-integrable with respect to the Gaussian measure. Its Hermite expansion is given by:

ϕ=k=0ckhek,\displaystyle\phi=\sum_{k=0}^{\infty}c_{k}\mathrm{he}_{k}, ck=𝔼ϕ(X)hek(X),\displaystyle c_{k}=\mathbb{E}\,\phi(X)\mathrm{he}_{k}(X), X𝒩(0,1).\displaystyle X\sim\mathcal{N}(0,1)\,.

Here, ckc_{k} are called the Hermite coefficients of ϕ\phi.

In addition to orthogonality, Hermite polynomials have another “magical” property that is crucial for our later analysis.

Lemma 1 (Mehler’s lemma).

For standard-normal random variables X,YX,Y with covariance ρ\rho it holds

𝔼hem(X)hen(Y)=ρnδmn,\displaystyle\mathbb{E}\,\mathrm{he}_{m}(X)\mathrm{he}_{n}(Y)=\rho^{n}\delta_{mn}, (XY)𝒩(0,(1ρρ1)).\displaystyle\begin{pmatrix}X\\ Y\end{pmatrix}\sim\mathcal{N}\left(0,\begin{pmatrix}1&\rho\\ \rho&1\end{pmatrix}\right).

This lemma states that given two Gaussian random variables X,YX,Y with covariance ρ\rho, the expectation of the product of Hermite polynomials is zero unless the indices are equal (see explanations in Section A for proof). The lemma 1 is crucial for our theory. Based on this lemma, we can express the kernel map κ\kappa in terms of the Hermite coefficients, showing a particular structure of the kernel map.

Corollary 1.

We have the following explicit form for the kernel map:

κ(ρ)=k=0ck2ρk.\displaystyle\kappa(\rho)=\sum_{k=0}^{\infty}c_{k}^{2}\rho^{k}.

The proof follows directly from the definition of kernel map, expanding ϕ\phi in the Hermite basis and then applying Lemma 1.

Daniely et al. (2016) show a similar analytic form for dual activation, which aligns with our definition of kernel map. However, they did not use this to study global dynamics. To our knowledge, the only study that uses Hermite polynomials to study global dynamics is (Joudaki et al., 2023b), which is limited to centered activations. i.e., those that satisfy 𝔼ϕ(X)=0,\mathbb{E}\,\phi(X)=0, or equivalently κ(0)=c02=0.\kappa(0)=c_{0}^{2}=0.

6 Convergence of the kernel with general activations

In this section, we will show that for any nonlinear activation function, there is a unique fixed point ρ\rho^{*} that is globally attractive.

Theorem 1.

Given a nonlinear activation function ϕ\phi with kernel map κ\kappa such that κ(1)=1\kappa(1)=1. Define the kernel sequence ρ+1=κ(ρ)\rho_{\ell+1}=\kappa(\rho_{\ell}), with ρ0(1,1)\rho_{0}\in(-1,1). Then there is a unique globally contracting fixed point ρ\rho^{*}, which is necessarily non-negative ρ[0,1]\rho^{*}\in[0,1]. The only other fixed points distinct from ρ,\rho^{*}, could be ±1,\pm 1, neither of which is stable. Furthermore, we have the following contraction rate towards ρ\rho^{*}:

  1. 1.

    If κ(0)=0\kappa(0)=0 then ρ=0\rho^{*}=0 is an attracting with rate

    |ρ|1|ρ||ρ0|1|ρ0|α,\displaystyle\frac{|\rho_{\ell}|}{1-|\rho_{\ell}|}\leq\frac{|\rho_{0}|}{1-|\rho_{0}|}\alpha^{\ell}, α:=12κ(0).\displaystyle\alpha:=\frac{1}{2-\kappa^{\prime}(0)}.
  2. 2.

    If κ(0)>0\kappa(0)>0 and κ(1)<1\kappa^{\prime}(1)<1, then ρ=1\rho^{*}=1 is attracting, with rate

    |ρ1||ρ01|α,\displaystyle|\rho_{\ell}-1|\leq|\rho_{0}-1|\alpha^{\ell}, α:=κ(1).\displaystyle\alpha:=\kappa^{\prime}(1).
  3. 3.

    If κ(0)>0\kappa(0)>0, and κ(1)=1\kappa^{\prime}(1)=1 then ρ=1\rho^{*}=1 is attracting with rate

    |ρ1||ρ01|α|ρ01|+1,\displaystyle|\rho_{\ell}-1|\leq\frac{|\rho_{0}-1|}{\ell\alpha|\rho_{0}-1|+1}, α=1κ(0)κ(0).\displaystyle\alpha=1-\kappa(0)-\kappa^{\prime}(0).
  4. 4.

    If κ(0)>0\kappa(0)>0, and κ(1)>1\kappa^{\prime}(1)>1 then the attracting fixed point is necessarily in the range ρ(0,1)\rho^{*}\in(0,1), satisfying κ(ρ)<1,\kappa^{\prime}(\rho^{*})<1, for which we have

    |ρρ||ρ0ρ|1|ρ0|α\displaystyle|\rho_{\ell}-\rho^{*}|\leq\frac{|\rho_{0}-\rho^{*}|}{1-|\rho_{0}|}\alpha^{\ell} α=max{1κ(0),κ(ρ),1ρ2κ(ρ)},\displaystyle\alpha=\max\left\{1-\kappa(0),\kappa^{\prime}(\rho^{*}),\frac{1-\rho^{*}}{2-\kappa^{\prime}(\rho^{*})}\right\},

    where α<1\alpha<1.

Implications.

Let us take a step back and review the main takeaway of Theorem 1. Omitting the constants and for sufficiently large depth ,\ell, we have

h(x),h(y)avg={0+O(αx,yavg)case 11+O(αx,yavg)case 21+O(x,yavg/α)case 3ρ+O(αx,yavg)case 4,\displaystyle\langle h^{\ell}(x),h^{\ell}(y)\rangle_{\text{avg}}=\begin{cases}0+O(\alpha^{\ell}\langle x,y\rangle_{\text{avg}})&\text{case 1}\\ 1+O(\alpha^{\ell}\langle x,y\rangle_{\text{avg}})&\text{case 2}\\ 1+O\left(\langle x,y\rangle_{\text{avg}}/\ell\alpha\right)&\text{case 3}\\ \rho^{*}+O(\alpha^{\ell}\langle x,y\rangle_{\text{avg}})&\text{case 4},\\ \end{cases}

where x,yavg\langle x,y\rangle_{\text{avg}} denotes the input similarly.

Broadly speaking, we can think of three categories of bias:

  • Orthogonality bias ρ=0\rho^{*}=0: Implies that as the network becomes deeper, representations are pushed towards orthogonality exponentially fast. We can think of this case as a bias towards independence.

  • Weak similarity bias ρ(0,1)\rho^{*}\in(0,1): Implies that as the network becomes deeper, the representations form angles between 0 and π/2\pi/2. Thus, in this case, the representations are neither completely aligned nor completely independent.

  • Strong similarity ρ=1\rho^{*}=1: Implies that as the network becomes deeper, representations become more similar or aligned, as indicated by inner products converging to one, exponentially fast in case 2, polynomially in case 3.

See section B for a review of commonly used activations, and their biases according to Theorem 1.

The bias of activation and normalization layers has been extensively studied in the literature. For example, Daneshmand et al. (2021) show that batch normalization with linear activations makes representations more orthogonal, relying on a technical assumption that is left unproven (see assumption 𝒜1\mathcal{A}_{1}). Similarly, Joudaki et al. (2023a) extend this to odd activations yet introduce another technical assumption about the ergodicity of the chain, which is hard to verify or prove (see Assumption 1). In a similar vein,  Meterez et al. (2024) show the global convergence towards orthogonality in a network with batch normalization and orthogonal random weights, but do not theoretically analyze nonlinear activations. Yang et al. (2019) prove the global convergence of the Gram matrix of a network with linear activation and batch normalization (see Corollary F.3) toward a fixed point. However, the authors explain that because they cannot establish such a global convergence for general activations (see page 20, under the paragraph titled Main Technical Results), they resort to finding locally stable fixed points, meaning that if preactivations have such a Gram matrix structure and they are perturbed infinitesimally. Recent studies have also examined the evolution of covariance structures in deep and wide networks using stochastic differential equations (Li et al., 2022), which is similar to kernel dynamics and kernel ODE introduced here. However, the covariance SDE approach does not theoretically show the global convergence towards these solutions. Finally, Joudaki et al. (2023b) establish a global bias towards orthogonality for centered activations (see Theorem A2), which aligns with case 1 of Theorem 1, and to extend to other activations, they add layer normalization after the activation to make it centered. In contrast, Theorem 1 covers all existing activations.

One alternative way of interpreting the results of Theorem 1 is that as we pass two inputs through nonlinear activations, it ‘forgets’ the similarity between inputs and converges to a fixed point value that is independent of the inputs and only depends on the activation. Taking the view that the network ought to remember some similarity of the input for training, we can leverage Theorem 1 to strike a balance between depth and nonlinearity of the activations, as argued in the following.

Corollary 2.

Let ϵ>0\epsilon>0 be the some numerical precision, and let ϕ\phi be an activation function such that 𝔼ϕ(X)2=1\mathbb{E}\,\phi(X)^{2}=1, with its kernel map obeying κ(1)<1\kappa^{\prime}(1)<1. Then, for any initial pair of inputs xx and yy after L:=ln(1/ϵ)ln(1/κ(1))\ell\geq L:=\frac{\ln(1/\epsilon)}{\ln(1/\kappa^{\prime}(1))} layers, the representations of xx and yy will become numerically indistinguishable.

Approximately, ln(1/ϵ)\ln(1/\epsilon) represents the number of bits of numerical precision available in floating-point representations. For example for float32 numbers, ln(1/ϵ)\ln(1/\epsilon) is approximately ln(1/2128)88\ln(1/2^{-128})\approx 88. Therefore, when the network depth \ell exceeds threshold 88/ln(1/κ(1))88/\ln(1/\kappa^{\prime}(1)), the representations of any two inputs will become numerically indistinguishable from one another. This effect propagates through subsequent layers, potentially making training ineffective or impossible. Notably, ReLU and sigmoid satisfy the conditions of this corollary, implying that this phenomenon can occur in networks utilizing these activations. Previous works on rank collapse (Daneshmand et al., 2020; Noci et al., 2022) and covariance degeneracy (Li et al., 2022) have observed and studied this phenomenon in various settings. However, the explicit and closed-form relationship presented in Corollary 2 provides a novel and precise quantification of this effect.

One of the most striking results of this theorem is that for any nonlinear activation, there is exactly one fixed point ρ\rho^{*} that is globally attractive, while other fixed points are not stable. Furthermore, the fixed point is necessarily non-negative. This implies that for any MLP with nonlinear activations, the covariance of preactivations will converge towards a fixed point, which is always non-negative. It turns out that there is a geometric reason why no globally attracting negative fixed point could exist (See remark S.2)

Theorem 1 has several aspects that may seem counterintuitive. For instance, case 1 demonstrates that a quantity |ρ|/(1|ρ|)|\rho|/(1-|\rho|) decays exponentially as depth increases. To gain a better understanding of the theories behind Theorem 1, we can cast our layer-wise kernel dynamics as a continuous-time dynamical system, which is discussed next. For a detailed proof of Theorem 1, please refer to Section A.

6.1 A continuous time differential equation view to the kernel dynamics

One of the most helpful ways to find insights into discrete fixed point iterations is to cast them as a continuous problem. More specifically, consider our fixed point iteration:

Δρ\displaystyle\Delta\rho_{\ell} =ρ+1ρ=ρ+k=0ck2ρk.\displaystyle=\rho_{\ell+1}-\rho_{\ell}=-\rho_{\ell}+\sum_{k=0}^{\infty}c_{k}^{2}\rho^{k}_{\ell}.

We can replace this discrete iteration with a continuous time differential equation, which we will refer to as the kernel ODE:

dρ/dt=ρ+k=0ck2ρk.\displaystyle d\rho/dt=-\rho+\sum_{k=0}^{\infty}c_{k}^{2}\rho^{k}. (kernel ODE)

Recent studies have introduced stochastic models such as the Neural Covariance SDE by Li et al. (2022), which can be viewed as the stochastic analog of kernel ODE. However, kernel ODE will capture the most important parts of the discrete kernel dynamics for our main purpose of characterizing the attracting fixed points. The continuous analog of fixed points is a ρ\rho^{*} that satisfies ρ(ρ)=0.\rho^{\prime}(\rho^{*})=0. Furthermore, the fixed point is globally attracting if we have ρ\rho^{\prime} become negative for ρ>ρ\rho>\rho^{*} and positive for ρ<ρ.\rho<\rho^{*}. We can say that the fixed point is locally attractive if this property only holds in some small neighborhood of ρ.\rho^{*}.

While the kernel ODE presents an interesting transformation of the problem, it does not necessarily have a closed-form solution in its general form. However, our main strategy is to find worst-case (slowest) scenarios for convergence that correspond to tractable ODEs. In many cases, however, the worst-case scenario depends on some boundary conditions. Thus, we can design worst-case ODEs for different cases and combine them in a joint bound. Let us use our centered activation equation as a case study.

6.2 Kernel ODE for centered activations

For the case of centered activation 𝔼ϕ(X)=0,\mathbb{E}\,\phi(X)=0, corresponding to case 1 of Theorem 1, the kernel ODE is given by

dρ/dt=(1c12)ρ+k=2ck2ρk,\displaystyle d\rho/dt=-(1-c_{1}^{2})\rho+\sum_{k=2}^{\infty}c_{k}^{2}\rho^{k}\,,

where the first term k=0k=0 is canceled due to the assumption κ(0)=c02=1.\kappa(0)=c_{0}^{2}=1.

Observe that ρ<0\rho^{\prime}<0 when ρ>0,\rho>0, and ρ>0\rho^{\prime}>0 when ρ<0.\rho<0. This implies that ρ(t)0\rho(t)\to 0 for sufficiently large t.t. Intuitively, the terms in ρ\rho^{\prime} that have the opposite sign of ρ,\rho, contribute to a faster convergence, while terms with a similar sign contribute to a slower convergence. Thus, we can ask what distribution of Hermite coefficients {ck}\{c_{k}\} corresponds to the worst case, slowest convergence. If this is a tractable ODE, we can use its solution to bound the convergence of the original ODE. It turns out that the worst case depends on the positivity of ρ,\rho, which is why we study them separately.

Positive range. Let us first consider the dynamics when ρ0.\rho\geq 0. In this case, the term corresponding to k=1k=1 contributes positively to a faster convergence, while terms k2k\geq 2 make convergence slower. In light of this observation, a worst-case (slowest possible) convergence rate happens when the positive terms are maximized, which occurs when the weight of all k=2ck2\sum_{k=2}^{\infty}c_{k}^{2} is concentrated on the k=2k=2 term, leading to kernel ODE

dρ/dt\displaystyle d\rho/dt =(1c12)ρ+(1c12)ρ2,\displaystyle=-(1-c_{1}^{2})\rho+(1-c_{1}^{2})\rho^{2},

where we used the fact that k=1ck2=1.\sum_{k=1}^{\infty}c_{k}^{2}=1. Finally, we can solve this ODE

dρρ(1ρ)=(1c12)t|ρ(t)||1ρ(t)|=Cexp(t(1c12)),\displaystyle\int\frac{d\rho}{\rho(1-\rho)}=-(1-c_{1}^{2})\int t\implies\frac{|\rho(t)|}{|1-\rho(t)|}=C\exp(-t(1-c_{1}^{2})),

where CC corresponds to the initial values.

Negative range. Now, let us assume ρ<0.\rho<0. In this case, only the odd terms in k=2ck2ρk\sum_{k=2}^{\infty}c_{k}^{2}\rho^{k} contribute to a slowdown in convergence. Thus, the worst case occurs when all the weight of coefficients is concentrated in k=3,k=3, leading to the kernel ODE:

dρ/dt=(1c12)ρ+(1c12)ρ3|ρ(t)|1ρ2(t)=Cexp(t(1c12)),\displaystyle d\rho/dt=-(1-c_{1}^{2})\rho+(1-c_{1}^{2})\rho^{3}\implies\frac{|\rho(t)|}{\sqrt{1-\rho^{2}(t)}}=C^{\prime}\exp(-t(1-c_{1}^{2})),

where CC^{\prime} corresponds to the initial values.

To summarize, we have obtained:

{|ρ(t)|1ρ(t)Cexp((1c12))ρ0+|ρ(t)|1ρ(t)2Cexp((1c12))ρ0.\displaystyle\begin{cases}\frac{|\rho(t)|}{1-\rho(t)}\leq C\exp(-\ell(1-c_{1}^{2}))&\rho_{0}\in\mathbb{R}^{+}\\ \frac{|\rho(t)|}{\sqrt{1-\rho(t)^{2}}}\leq C^{\prime}\exp(-\ell(1-c_{1}^{2}))&\rho_{0}\in\mathbb{R}^{-}.\end{cases}

Now, we can use a numerical inequality 1x21|x|\sqrt{1-x^{2}}\geq 1-|x| valid for all x(1,1),x\in(-1,1), and by solving for the constant, we can construct a joint bound for both cases:

|ρ(t)|1|ρ(t)||ρ0|1|ρ0|exp(t(1κ(0))),\displaystyle\frac{|\rho(t)|}{1-|\rho(t)|}\leq\frac{|\rho_{0}|}{1-|\rho_{0}|}\exp(-t(1-\kappa^{\prime}(0))),

where we have also replaced c12=κ(1).c_{1}^{2}=\kappa^{\prime}(1).

Key insights.

The kernel ODE perspective reveals two important insights. First, the appearance of |ρ|/(1|ρ|)|\rho|/(1-|\rho|) formula in our bound is due to the fact that x/(1x)=exp(ct)x/(1-x)=\exp(-ct) is a fundamental solution of the differential equation x=cx(1x).x^{\prime}=-cx(1-x). Second, it reveals the importance of the positivity of coefficients in k=0ck2ρk,\sum_{k=0}^{\infty}c_{k}^{2}\rho^{k}, which allowed us to arrive at a worst-case pattern for the coefficients. This further highlights the value of algebraic properties of Hermite polynomials that canceled out the cross terms in κ.\kappa.

We can observe that the kernel ODE rate derived by analysis of kernel ODE is largely aligned with the exponential convergence of this term in Theorem 1. The slight discrepancy between the exponential rates between Theorem 1 and the kernel ODE rate exp((1κ(0))\exp(-(1-\kappa^{\prime}(0)) is due to the discrete-continuous approximation. Despite this small discrepancy, the solution to kernel ODE can help us arrive at the right form for the solution. We can leverage the solution to kernel ODE to arrive at a bound for the discrete problem, namely using induction over steps.

Since Theorem 1 only provides an upper bound and not a lower bound for convergence, one natural question is: how big is the gap between this worst case and the exact ODE convergence rates? We can construct activations where the convergence is substantially faster, such as doubly exponential convergence (see Corollary S.1). However, for all practically used activations, the gap between our bound and the real convergence is enough (See section B).

7 Extension to normalization layers and residual connections

In this section, we extend our analysis to MLPs that incorporate normalization layers and residual (skip) connections and examine how they affect convergence.

7.1 Residual connections

Residual connections, inspired by ResNets (He et al., 2016), help mitigate the vanishing gradient problem in deep networks by allowing gradients to flow directly through skip connections. We consider the MLP with residual strength tt given by

h=1r2ϕ(Wdh1)+rPdh1,\displaystyle h^{\ell}=\sqrt{1-r^{2}}\,\phi\left(\frac{W^{\ell}}{\sqrt{d}}h^{\ell-1}\right)+r\,\frac{P^{\ell}}{\sqrt{d}}h^{\ell-1},

where WW^{\ell} and PP^{\ell} are independent weight matrices of dimension d×dd\times d with entries drawn i.i.d. from a zero-mean, unit-variance distribution, and r(0,1)r\in(0,1) modulates the strength of the residual connections (the bigger rr, the stronger residuals will be).

Proposition 2.

For an MLP with activation ϕ\phi satisfying 𝔼ϕ(X)2,XN(0,1),\mathbb{E}\,\phi(X)^{2},\,X\sim N(0,1), and residual parameter r,r, we have the residual kernel map

κres(ρ)=(1r2)κϕ(ρ)+r2ρ,\kappa_{\text{res}}(\rho)=(1-r^{2})\kappa_{\phi}(\rho)+r^{2}\rho, (5)

where κϕ\kappa_{\phi} denotes kernel map of ϕ.\phi.

Furthermore, we have: 1) fixed points of κres(ρ)\kappa_{\text{res}}(\rho) are the same as those of κϕ(ρ),\kappa_{\phi}(\rho), 2) ρ\rho^{*} is a globally attracting fixed point of κϕ\kappa_{\phi} is a globally attracting fixed point of κψ,\kappa_{\psi}, 3) the convergence of residual kernel map κψ\kappa_{\psi} is monotonically decreasing in rr (the stronger residuals, the slower convergence).

Implications.

Proposition 2 reveals that residual connections modify the kernel map by blending the original kernel map with the identity function, weighted by the residual rr. This adjustment has several interesting implications. Our analysis here gives a quantitative way of balancing depth with the nonlinearity of activations. For example, Li et al. (2022) show as we ‘shape’ the negative slope of a leaky ReLU towards identity, it will prevent degenerate covariance in depth. In light of proposition 2 and corollary 2, we can write leaky ReLU as a linear combination of ReLU and identity and conclude that as residual strength increases (r1r\to 1), the convergence rate becomes slower and thus degeneracy happens at deeper layers, which aligns with the prior study. Another interesting byproduct of proposition 2 is that it shows that when r1r\to 1 (highly strong residuals), the kernel ODE becomes an exact model for kernel dynamics (See remark S.1).

7.2 Normalization layers

Normalization layers are widely used in deep learning to stabilize and accelerate training. In this section, we focus on two common normalization layers:

  • Layer Normalization (LN) (Ba et al., 2016):

    LN(z)=zμ(z)σ(z),\operatorname{LN}(z)=\frac{z-\mu(z)}{\sigma(z)}, (6)

    where μ(z)\mu(z) is the mean, and σ(z)\sigma(z) is the standard deviation of the vector zdz\in\mathbb{R}^{d}.

  • Root Mean Square (RMS) Normalization:

    RN(z)=z1di=1dzi2.\operatorname{RN}(z)=\frac{z}{\sqrt{\frac{1}{d}\sum_{i=1}^{d}z_{i}^{2}}}. (7)

The following theorem characterizes the joint activation and normalization kernel.

Proposition 3.

Let us assume ϕ\phi satisfies 𝔼ϕ(X)2=1\mathbb{E}\,\phi(X)^{2}=1 where XN(0,1).X\sim N(0,1). Consider an MLP layers

h=ψ(1dWh1),\displaystyle h^{\ell}=\psi\left(\frac{1}{\sqrt{d}}W^{\ell}h^{\ell-1}\right), ψ{ϕRN,ϕLN,RNϕ,LNϕ},\displaystyle\psi\in\{\phi\circ RN,\phi\circ LN,RN\circ\phi,LN\circ\phi\},

where ψ\psi denotes the joint activation and normalization layers. We have the following joint normalization and activation kernel:

κψ(ρ)={κϕ(ρ)κϕ(0)1κϕ(0)if ψ=LNϕκϕ(ρ)otherwise,\displaystyle\kappa_{\psi}(\rho)=\begin{cases}\frac{\kappa_{\phi}(\rho)-\kappa_{\phi}(0)}{1-\kappa_{\phi}(0)}&\text{if }\psi=LN\circ\phi\\ \kappa_{\phi}(\rho)&\text{otherwise},\\ \end{cases}

where κϕ\kappa_{\phi} denotes the kernel map of ϕ\phi.

Implications.

Proposition 3 implies that normalization layers adjust the kernel map by scaling and, in some cases, shifting it. When layer normalization is applied after activation, the activation and kernel map become centered. This means that for activations such as ReLU or sigmoid, the fixed point of the kernel sequence changes from ρ=1\rho^{*}=1 to ρ=0\rho^{*}=0, indicating a shift from a bias towards the orthogonality of representations to aligning them. This result supports previous findings by Joudaki et al. (2023b) regarding the effect of layer normalization in centering, and extends it to other configurations (order of normalization and activation) for different orders of layers, as well as using root mean normalization. Furthermore, our findings align with studies showing that batch normalization induces orthogonal representations in deep networks (Yang et al., 2019; Daneshmand et al., 2021).

8 Discussion

The power of deep neural networks fundamentally arises from their depth and nonlinearity. Despite substantial empirical evidence demonstrating the importance of deep, nonlinear models, a rigorous theoretical understanding of how they operate and learn remains a significant theoretical challenge. Much of classical mathematical and statistical machinery was developed for linear systems or shallow models (Hastie et al., 2009), and extending these tools to deep, nonlinear architectures poses substantial difficulties. Our work takes a natural first step into this theoretical mystery by providing a rigorous analysis of feedforward networks with nonlinear activations.

We showed that viewing activations in the Hermite basis uncovers several strikingly elegant properties of activations. Many architectural choices, such as normalization layers, initialization techniques, and CLT-type convergences, make Gaussian preactivations central to understanding the role of activations. Hermite expansion can be viewed as the Fourier analog for the Gaussian kernel. These facts and our theoretical results, suggest that viewing activations in the Hermite basis is more natural than in the raw signal, analogous to viewing convolutions in the Fourier basis. For example, the fact that the kernel map κ\kappa is analytic and has a positive power series expansion, i.e., is infinitely smooth, does not require the activation ϕ\phi to be smooth or even continuous. Thus, as opposed to many existing analyses that consider smooth and non-smooth cases separately, our theoretical framework gives a unified perspective. As an interesting example, smoothness, which has been observed to facilitate better training dynamics 3(Hayou et al., 2019), appears as a faster decay in Hermite coefficients. Thus, similar to leveraging the Fourier transform for understanding and designing filters, one can aspire to use Hermite analysis for designing and analyzing activation functions.

One of the key contributions of our work is the global perspective in analyzing neural networks. Traditional analyses often focus on local structure, such as Jacobian eigenspectrum by Pennington et al. (2018). While these local studies provide valuable insights, they do not capture the global kernel dynamics. More concretely, real-world inputs to neural networks, such as images of a dog and a cat, are not close enough to each other to be captured in the local changes. Our global approach allows us to study the evolution of representations across layers for any pair of inputs, providing further insights into how deep networks transform inputs with varying degrees of similarity.

One of the central assumptions for our approach is operating in the mean-field, i.e., infinite-width regime. While this approach led us to establish numerous elegant properties, it would be worth exploring the differences between finite-width and infinite-width results. This remains an important avenue for future work. Similarly, in the CLT application of Proposition 1, the exact discrepancies between the limited width and the infinite width would deepen our understanding of practical neural networks. Another intriguing endeavor is the potential to extend our framework to develop a theory for self-normalizing activation functions in a vein similar to (Klambauer et al., 2017). Finally, building on the mathematical foundations laid here to study first- and second-order gradients and their impact on training dynamics remains a highly interesting topic for future research.

References

  • Arora et al. [2019] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.
  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Daneshmand et al. [2020] Hadi Daneshmand, Jonas Kohler, Francis Bach, Thomas Hofmann, and Aurelien Lucchi. Batch normalization provably avoids ranks collapse for randomly initialised deep networks. Advances in Neural Information Processing Systems, 33:18387–18398, 2020.
  • Daneshmand et al. [2021] Hadi Daneshmand, Amir Joudaki, and Francis Bach. Batch normalization orthogonalizes representations in deep random networks. Advances in Neural Information Processing Systems, 34:4896–4906, 2021.
  • Daniely et al. [2016] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Advances in Neural Information Processing Systems, 29, 2016.
  • Hastie et al. [2009] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • Hayou et al. [2019] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the impact of the activation function on deep neural networks training. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2672–2680. PMLR, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems, 31, 2018.
  • Joudaki et al. [2023a] Amir Joudaki, Hadi Daneshmand, and Francis Bach. On bridging the gap between mean field and finite width in deep random neural networks with batch normalization. International Conference on Machine Learning, 2023a.
  • Joudaki et al. [2023b] Amir Joudaki, Hadi Daneshmand, and Francis Bach. On the impact of activation and normalization in obtaining isometric embeddings at initialization. Advances in Neural Information Processing Systems, 36:39855–39875, 2023b.
  • Klambauer et al. [2017] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. Advances in Neural Information Processing Systems, 30, 2017.
  • Lee et al. [2019] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in Neural Information Processing Systems, 32, 2019.
  • Li et al. [2022] Mufan Bill Li, Mihai Nica, and Daniel M. Roy. The neural covariance sde: Shaped infinite depth-and-width networks at initialization. Advances in Neural Information Processing Systems, 2022.
  • Mei et al. [2019] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on learning theory, pages 2388–2464. PMLR, 2019.
  • Meterez et al. [2024] Alexandru Meterez, Amir Joudaki, Francesco Orabona, Alexander Immer, Gunnar Ratsch, and Hadi Daneshmand. Towards training without depth limits: Batch normalization without gradient explosion. In The Twelfth International Conference on Learning Representations, 2024.
  • Noci et al. [2022] Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in Transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
  • Pennington et al. [2017] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in neural information processing systems, 30, 2017.
  • Pennington et al. [2018] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932, 2018.
  • Poole et al. [2016] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. Advances in Neural Information Processing Systems, 29, 2016.
  • Saxe et al. [2013] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  • Schoenholz et al. [2017] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. International Conference on Learning Representations, 2017.
  • Schölkopf and Smola [2002] Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • Smola and Schölkopf [2004] Alex J Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and computing, 14:199–222, 2004.
  • Williams and Rasmussen [2006] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.
  • Yang [2019] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
  • Yang et al. [2019] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. A mean field theory of batch normalization. In International Conference on Learning Representations, 2019.

Appendix A Supplementary theorems and proofs

This section is dedicated to the proof of main theoretical results presented in the paper.

Theorem S.2, which is closely aligned with Theorem A2 of Joudaki et al. [2023b], and for historical reasons, it is kept here, while Theorem 1, which is the general version, is citing Theorem S.2 as one specific case. Likewise, Lemma 1’s statement is a close replica of Lemma A8 in Joudaki et al. [2023b]. Thus, in both cases, we have skipped the proofs here, and refer the interested reader to Joudaki et al. [2023b] for the proofs.

Proof of Proposition 1.

The proof is a straightforward application of the law of large numbers, as the mean-field regime implies that the sample means converge to the population means. Let us inductively assume that at layer \ell, it holds 1dh(x)2=1\frac{1}{d}\|h^{\ell}(x)\|^{2}=1 and 1dh(y)2=1,\frac{1}{d}\|h^{\ell}(y)\|^{2}=1, and 1dh(x),h(y)=ρ\frac{1}{d}\langle h^{\ell}(x),h^{\ell}(y)\rangle=\rho_{\ell}. Thus, defining a=W+1h(x)a=W^{\ell+1}h^{\ell}(x) and b=W+1h(y),b=W^{\ell+1}h^{\ell}(y), their elements aia_{i}’s and bib_{i}’s follow N(0,1)N(0,1), and have covariance 𝔼aibi=ρ.\mathbb{E}\,a_{i}b_{i}=\rho_{\ell}. Thus, again by construction we have 𝔼ϕ(ai)2=𝔼ϕ(bi)2=1\mathbb{E}\,\phi(a_{i})^{2}=\mathbb{E}\,\phi(b_{i})^{2}=1, and 𝔼ϕ(ai)ϕ(bi)=ρ+1.\mathbb{E}\,\phi(a_{i})\phi(b_{i})=\rho_{\ell+1}. Finally, because aia_{i}’s and bib_{i}’s are i.i.d. copies of their respective distributions, based on the law of large numbers, we can conclude the samples means will converge to their expectations:

1dh+1(x)2=1dϕ(a)2=1di=1dϕ(ai)2d𝔼ϕ(a1)=1\displaystyle\frac{1}{d}\|h^{\ell+1}(x)\|^{2}=\frac{1}{d}\|\phi(a)\|^{2}=\frac{1}{d}\sum_{i=1}^{d}\phi(a_{i})^{2}\xrightarrow{d\to\infty}\mathbb{E}\,\phi(a_{1})=1
1dh+1(y)2=1dϕ(b)2=1di=1dϕ(bi)2d𝔼ϕ(b1)=1\displaystyle\frac{1}{d}\|h^{\ell+1}(y)\|^{2}=\frac{1}{d}\|\phi(b)\|^{2}=\frac{1}{d}\sum_{i=1}^{d}\phi(b_{i})^{2}\xrightarrow{d\to\infty}\mathbb{E}\,\phi(b_{1})=1
1dh+1(x),h+1(y)=1dϕ(a),ϕ(b)=1di=1dϕ(ai)ϕ(bi)d𝔼ϕ(a1)ϕ(b1)=ρ+1\displaystyle\frac{1}{d}\langle h^{\ell+1}(x),h^{\ell+1}(y)\rangle=\frac{1}{d}\langle\phi(a),\phi(b)\rangle=\frac{1}{d}\sum_{i=1}^{d}\phi(a_{i})\phi(b_{i})\xrightarrow{d\to\infty}\mathbb{E}\,\phi(a_{1})\phi(b_{1})=\rho_{\ell+1}

Thus, we have proved the induction hypothesis for step +1\ell+1. This concludes the proof. ∎

Theorem S.2.

Let ϕ\phi be a nonlinear activation with kernel map κ\kappa that is centered κ(0)=0,\kappa(0)=0, and κ(1)=1\kappa(1)=1. Let ρ\rho_{\ell} be the kernel sequence generated by κ\kappa, given the initial value ρ0(1,1)\rho_{0}\in(-1,1). Then, the sequence contracts towards the fixed point at zero ρ=0\rho^{*}=0 with rate α\alpha:

|ρ|1|ρ||ρ0|1|ρ0|α,\displaystyle\frac{|\rho_{\ell}|}{1-|\rho_{\ell}|}\leq\frac{|\rho_{0}|}{1-|\rho_{0}|}\alpha^{\ell}, α:=12κ(0),\displaystyle\alpha:=\frac{1}{2-\kappa^{\prime}(0)}, (8)

where α<1.\alpha<1. The only other fixed points can be ρ=±1\rho^{*}=\pm 1, and none of them is locally or globally attractive.

Proof of Theorem 1.

We will proof all cases individually, starting with the first case that falls directly under a previous theorem.

Case 1: κ(0)=0\kappa(0)=0

We can observe that the case where κ(0)=0,\kappa(0)=0, falls directly under Theorem 1, and thus there is no need to prove it again.

Cases 2,3: κ(0)>0\kappa(0)>0 and κ(1)1\kappa^{\prime}(1)\leq 1

In this part, we jointly consider two cases where κ(0)>0\kappa(0)>0 and κ(1)<1,\kappa^{\prime}(1)<1, and κ(1)=1.\kappa^{\prime}(1)=1. Let us consider the ratio between distances |ρ1||\rho_{\ell}-1|:

|κ(ρ)1||ρ1|\displaystyle\frac{|\kappa(\rho)-1|}{|\rho-1|} =1κ(ρ)1ρ\displaystyle=\frac{1-\kappa(\rho)}{1-\rho}
=κ(1)κ(ρ)1ρ\displaystyle=\frac{\kappa(1)-\kappa(\rho)}{1-\rho}
=k=1ck2(1ρk)1ρ\displaystyle=\frac{\sum_{k=1}^{\infty}c_{k}^{2}(1-\rho^{k})}{1-\rho}
=k=1ck2i=0k1ρi\displaystyle=\sum_{k=1}^{\infty}c_{k}^{2}\sum_{i=0}^{k-1}\rho^{i}
|κ(ρ)1||ρ1|\displaystyle\implies\frac{|\kappa(\rho)-1|}{|\rho-1|} =κ(1)κ(1)+k=1ck2i=0k1ρi\displaystyle=\kappa^{\prime}(1)-\kappa^{\prime}(1)+\sum_{k=1}^{\infty}c_{k}^{2}\sum_{i=0}^{k-1}\rho^{i}
=κ(1)k=1ck2(ki=0k1ρi)\displaystyle=\kappa^{\prime}(1)-\sum_{k=1}^{\infty}c_{k}^{2}\big{(}k-\sum_{i=0}^{k-1}\rho^{i}\big{)}

Clearly, the term ki=0k1ρik-\sum_{i=0}^{k-1}\rho^{i} is is always non-negative. Thus, if κ(1)<1,\kappa^{\prime}(1)<1, then we have the contraction

|κ(ρ)1||ρ1|κ(1)|ρ1||ρ01|κ(1).\displaystyle\frac{|\kappa(\rho)-1|}{|\rho-1|}\leq\kappa^{\prime}(1)\implies|\rho_{\ell}-1|\leq|\rho_{0}-1|\kappa^{\prime}(1)^{\ell}.

Otherwise, if κ(1)=1,\kappa^{\prime}(1)=1, we have

|κ(ρ)1||ρ1|=1k=1ck2(ki=0k1ρi).\displaystyle\frac{|\kappa(\rho)-1|}{|\rho-1|}=1-\sum_{k=1}^{\infty}c_{k}^{2}\big{(}k-\sum_{i=0}^{k-1}\rho^{i}\big{)}.

Now, observe that the first term for k=1k=1 is zero. Furthermore, the sequence ki=0k1ρik-\sum_{i=0}^{k-1}\rho^{i} is monotonically increasing in kk. Thus, the smallest value the weighted sum can achieve is if all of the weights of terms above k2k\geq 2 are concentrated in k=2,k=2, which leads to the contraction

|κ(ρ)1||ρ1|1(1c02c12)(21ρ)\displaystyle\frac{|\kappa(\rho)-1|}{|\rho-1|}\leq 1-(1-c_{0}^{2}-c_{1}^{2})(2-1-\rho)
=1(1c02c12)(1ρ).\displaystyle=1-(1-c_{0}^{2}-c_{1}^{2})(1-\rho).

Now, define sequence x:=1ρ,x_{\ell}:=1-\rho_{\ell}, and observe that we have

x+1x(1αx),\displaystyle x_{\ell+1}\leq x_{\ell}(1-\alpha x_{\ell}), α=1c02c12,\displaystyle\alpha=1-c_{0}^{2}-c_{1}^{2},

where α>0\alpha>0 if the activation is nonlinear. We can prove inductively that

xx0αx0+1.\displaystyle x_{\ell}\leq\frac{x_{0}}{\ell\alpha x_{0}+1}.

If we plug in the definition of xnx_{n} we have proven

|ρ1||ρ01|α|ρ01|+1.\displaystyle|\rho_{\ell}-1|\leq\frac{|\rho_{0}-1|}{\ell\alpha|\rho_{0}-1|+1}.

Case 4: κ(0)>0\kappa(0)>0 and κ(1)>1\kappa^{\prime}(1)>1

The main strategy is to prove some contraction of κ(ρ)\kappa(\rho) towards ρ\rho^{*}, under the kernel map κ\kappa. In other words, we need to show |κ(ρ)ρ||\kappa(\rho)-\rho^{*}| is smaller than |ρρ||\rho-\rho^{*}| under some potential. First, we assume there is a ρ\rho^{*} such that κ(ρ)<1,\kappa^{\prime}(\rho^{*})<1, and show this contraction, and later prove its existence and uniqeness.

To prove contraction towards ρ\rho^{*} when κ(ρ)<1\kappa^{\prime}(\rho^{*})<1, we consider three cases: 1) If ρ>ρ\rho>\rho^{*}, 2) If ρ[0,ρ]\rho\in[0,\rho^{*}], and 3) If ρ<0\rho<0. However, the bounds will be of different potential forms and will have to be combined later. Let κ(ρ)=k=0ck2ρk\kappa(\rho)=\sum_{k=0}^{\infty}c_{k}^{2}\rho^{k} be the kernel map with κ(1)=1\kappa(1)=1 with fixed point ρ\rho^{*} that satisfies κ(ρ)<1.\kappa^{\prime}(\rho^{*})<1.

  • ρρ\rho\geq\rho^{*}. we will prove:

    |κ(ρ)ρ|1κ(ρ)|ρρ|1ρκ(ρ)\displaystyle\frac{|\kappa(\rho)-\rho^{*}|}{1-\kappa(\rho)}\leq\frac{|\rho-\rho^{*}|}{1-\rho}\kappa^{\prime}(\rho^{*})

    We have the series expansion around ρ\rho^{*}: κ(ρ)=ρ+k=1ak(ρρ)k\kappa(\rho)=\rho^{*}+\sum_{k=1}^{\infty}a_{k}(\rho-\rho^{*})^{k}. For points ρρ\rho\geq\rho^{*}, we will have κ(ρ)ρ\kappa(\rho)\geq\rho^{*}, thus we can write

    κ(ρ)ρ1κ(ρ)\displaystyle\frac{\kappa(\rho)-\rho^{*}}{1-\kappa(\rho)}
    =k=1ak(ρρ)kκ(1)κ(ρ)\displaystyle=\frac{\sum_{k=1}^{\infty}a_{k}(\rho-\rho^{*})^{k}}{\kappa(1)-\kappa(\rho^{*})}
    =k=1ak(ρρ)kk=1ak(1ρ)kk=1ak(ρρ)k\displaystyle=\frac{\sum_{k=1}^{\infty}a_{k}(\rho-\rho^{*})^{k}}{\sum_{k=1}^{\infty}a_{k}(1-\rho^{*})^{k}-\sum_{k=1}^{\infty}a_{k}(\rho-\rho^{*})^{k}}
    =ρρ1ρk=1ak(ρρ)k1k=1ak(i=0k1(1ρ)i(ρρ)k1i)\displaystyle=\frac{\rho-\rho^{*}}{1-\rho}\cdot\frac{\sum_{k=1}^{\infty}a_{k}(\rho-\rho^{*})^{k-1}}{\sum_{k=1}^{\infty}a_{k}(\sum_{i=0}^{k-1}(1-\rho^{*})^{i}(\rho-\rho^{*})^{k-1-i})}
    ρρ1ρk=1ak(ρρ)k1k=2ak(1ρ)k1+k=1ak(ρρ)k1,\displaystyle\leq\frac{\rho-\rho^{*}}{1-\rho}\frac{\sum_{k=1}^{\infty}a_{k}(\rho-\rho^{*})^{k-1}}{\sum_{k=2}^{\infty}a_{k}(1-\rho^{*})^{k-1}+\sum_{k=1}^{\infty}a_{k}(\rho-\rho^{*})^{k-1}},

    where in the last line the inequality is due to the fact that we are only retaining the terms corresponding to i=0i=0 and i=k1i=k-1 in the denominator. Now, note the right hand side maximizes when ρ[ρ,1]\rho\in[\rho^{*},1] is maximized, which is obtained when ρ=1\rho=1:

    κ(ρ)ρ1κ(ρ)\displaystyle\frac{\kappa(\rho)-\rho^{*}}{1-\kappa(\rho)}
    ρρ1ρk=1ak(1ρ)k1k=2ak(1ρ)k1+k=1ak(1ρ)k1.\displaystyle\leq\frac{\rho-\rho^{*}}{1-\rho}\frac{\sum_{k=1}^{\infty}a_{k}(1-\rho^{*})^{k-1}}{\sum_{k=2}^{\infty}a_{k}(1-\rho^{*})^{k-1}+\sum_{k=1}^{\infty}a_{k}(1-\rho^{*})^{k-1}}.

    Now we can observe that the numerator and denominator correspond to

    =ρρ1ρκ(1)ρ2κ(1)κ(ρ)\displaystyle=\frac{\rho-\rho^{*}}{1-\rho}\frac{\kappa(1)-\rho^{*}}{2\kappa(1)-\kappa^{\prime}(\rho^{*})}
    =ρρ1ρ1ρ2κ(ρ)\displaystyle=\frac{\rho-\rho^{*}}{1-\rho}\frac{1-\rho^{*}}{2-\kappa^{\prime}(\rho^{*})}

    Thus, we have proven that

    ρρ|κ(ρ)ρ|1κ(ρ)|ρρ|1ρ1ρ2κ(ρ)\displaystyle\rho\geq\rho^{*}\implies\frac{|\kappa(\rho)-\rho^{*}|}{1-\kappa(\rho)}\leq\frac{|\rho-\rho^{*}|}{1-\rho}\frac{1-\rho^{*}}{2-\kappa^{\prime}(\rho^{*})}
  • 0ρρ0\leq\rho\leq\rho^{*}. Consider ρ[0,ρ]\rho\in[0,\rho^{*}]. For these κ(ρ)\kappa^{\prime}(\rho) is always monotonically increasing, implying that κ(ρ)κ(ρ)<1\kappa^{\prime}(\rho)\leq\kappa^{\prime}(\rho^{*})<1. Thus, |κ(ρ)ρ|κ(ρ)|ρρ||\kappa(\rho)-\rho^{*}|\leq\kappa^{\prime}(\rho^{*})|\rho-\rho^{*}|. Thus, by Banach fixed point theorem, we have that κ\kappa is a contraction in this range with a rate κ(ρ)\kappa^{\prime}(\rho^{*}):

    0ρρ|κ(ρ)ρ|κ(ρ)|ρρ|\displaystyle 0\leq\rho\leq\rho^{*}\implies|\kappa(\rho)-\rho^{*}|\leq\kappa^{\prime}(\rho^{*})|\rho-\rho^{*}|
  • 1ρ0-1\leq\rho\leq 0. Finally, let us consider ρ0\rho\leq 0. Recall that we have κ(1)=1\kappa(1)=1. Thus, we can express κ(ρ)1\kappa(\rho)-1 as product of (ρ1)(\rho-1) with some power series q(ρ)q(\rho):

    κ(ρ)1=(ρ1)q(ρ),q(ρ)=k=0bkρk.\displaystyle\kappa(\rho)-1=(\rho-1)q(\rho),\quad q(\rho)=\sum_{k=0}^{\infty}b_{k}\rho^{k}.

    In fact, we can expand κ(ρ)\kappa(\rho) in terms of these new coefficients

    κ(ρ)=1b0+k=0(bkbk+1)ρk\displaystyle\kappa(\rho)=1-b_{0}+\sum_{k=0}^{\infty}(b_{k}-b_{k+1})\rho^{k}

    Due to the non-negativity of coefficients of κ\kappa, we can conclude 1b0b11\geq b_{0}\geq b_{1}\geq.... Based on this observation, for 0<ρ<10<\rho<1, we can conclude q(ρ)=b0b1ρ+b2ρ2b0q(-\rho)=b_{0}-b_{1}\rho+b_{2}\rho^{2}-...\leq b_{0}. Because we can pair each odd and even term bkρk+bk+1ρk+1-b_{k}\rho^{k}+b_{k+1}\rho^{k+1} for all odd kk, and because coefficients bkbk+1b_{k}\geq b_{k+1} and ρkρk+1\rho^{k}\geq\rho^{k+1} for ρ[0,1]\rho\in[0,1], we can argue q(ρ)b0=1c02q(-\rho)\leq b_{0}=1-c_{0}^{2}. Now, plugging this value into the kernel map for 0<ρ<10<\rho<1, we have:

    κ(ρ)\displaystyle\kappa(-\rho) =1(1+ρ)q(ρ)\displaystyle=1-(1+\rho)q(-\rho)
    1(1+ρ)(1c02)\displaystyle\geq 1-(1+\rho)(1-c_{0}^{2})
    =11ρ+c02(1+ρ)\displaystyle=1-1-\rho+c_{0}^{2}(1+\rho)
    κ(ρ)+ρc02(1+ρ)\displaystyle\implies\kappa(-\rho)+\rho\geq c_{0}^{2}(1+\rho)
    ρ+c02(1+ρ)κ(ρ)κ(ρ)\displaystyle\implies-\rho+c_{0}^{2}(1+\rho)\leq\kappa(-\rho)\leq\kappa(\rho)

    Now, if we assume κ(ρ)ρ\kappa(-\rho)\leq\rho^{*} then

    |κ(ρ)ρ||ρρ|\displaystyle\frac{|\kappa(-\rho)-\rho^{*}|}{|-\rho-\rho^{*}|} =ρκ(ρ)ρ+ρ\displaystyle=\frac{\rho^{*}-\kappa(-\rho)}{\rho^{*}+\rho}
    ρ+ρc02(1+ρ)ρ+ρ\displaystyle\leq\frac{\rho^{*}+\rho-c_{0}^{2}(1+\rho)}{\rho+\rho^{*}}
    =1c02(1+ρ)ρ+ρ\displaystyle=1-\frac{c_{0}^{2}(1+\rho)}{\rho+\rho^{*}}
    1c02\displaystyle\leq 1-c_{0}^{2}
    =1κ(0).\displaystyle=1-\kappa(0).

    Now, if we assume κ(ρ)ρ\kappa(-\rho)\geq\rho^{*}, knowing that κ(ρ)κ(ρ)\kappa(-\rho)\leq\kappa(\rho), which necessitates ρρ\rho\geq\rho^{*}, which implies κ(ρ)ρ\kappa(\rho)\leq\rho. Thus, we have

    |κ(ρ)ρ||ρρ|\displaystyle\frac{|\kappa(-\rho)-\rho^{*}|}{|-\rho-\rho^{*}|} =κ(ρ)ρρ+ρ\displaystyle=\frac{\kappa(-\rho)-\rho^{*}}{\rho+\rho^{*}}
    κ(ρ)ρρ+ρ\displaystyle\leq\frac{\kappa(-\rho)-\rho^{*}}{\rho+\rho^{*}}
    ρρρ+ρ\displaystyle\leq\frac{\rho-\rho^{*}}{\rho+\rho^{*}}
    1ρ1+ρ\displaystyle\leq\frac{1-\rho^{*}}{1+\rho^{*}}
    1ρ\displaystyle\leq 1-\rho^{*}

    Combining both cases we have

    ρ0|κ(ρ)ρ||ρρ|\displaystyle\rho\leq 0\implies\frac{|\kappa(\rho)-\rho^{*}|}{|\rho-\rho^{*}|} 1min(κ(0),ρ).\displaystyle\leq 1-\min(\kappa(0),\rho^{*}).

    We can further prove that ρ=κ(ρ)=k(0)+non-negative terms\rho^{*}=\kappa(\rho^{*})=k(0)+\text{non-negative terms}, which implies that ρk(0)\rho^{*}\geq k(0). Thus, we can conclude that

    ρ0|κ(ρ)ρ|(1k(0))|ρρ|\displaystyle\rho\leq 0\implies|\kappa(\rho)-\rho^{*}|\leq(1-k(0))|\rho-\rho^{*}|

Combining the cases Let us summarize the results so far. We have proven the existence of a unique fixed point ρ[0,1]\rho^{*}\in[0,1] such that κ(ρ)<1,\kappa^{\prime}(\rho^{*})<1, and we have proven contraction rates for each of the three cases.

{|κ(ρ)ρ|1κ(ρ)|ρρ|1ρ1ρ2κ(ρ)if ρρ|κ(ρ)ρ|κ(ρ)|ρρ|if 0ρρ|κ(ρ)ρ|(1k(0))|ρρ|if ρ0\displaystyle\begin{cases}\frac{|\kappa(\rho)-\rho^{*}|}{1-\kappa(\rho)}\leq\frac{|\rho-\rho^{*}|}{1-\rho}\frac{1-\rho^{*}}{2-\kappa^{\prime}(\rho^{*})}&\text{if }\rho\geq\rho^{*}\\ |\kappa(\rho)-\rho^{*}|\leq\kappa^{\prime}(\rho^{*})|\rho-\rho^{*}|&\text{if }0\leq\rho\leq\rho^{*}\\ |\kappa(\rho)-\rho^{*}|\leq(1-k(0))|\rho-\rho^{*}|&\text{if }\rho\leq 0\end{cases}

Let us now define the joint decay rate:

α=max{1k(0),κ(ρ),1ρ1κ(ρ)}\displaystyle\alpha=\max\left\{1-k(0),\kappa^{\prime}(\rho^{*}),\frac{1-\rho^{*}}{1-\kappa^{\prime}(\rho^{*})}\right\}

In other words, this is the worst-case rate for any of the above cases.

Now, let us assume we are starting from initial ρ0\rho_{0} and define ρ=κ(ρ1)\rho_{\ell}=\kappa(\rho_{\ell-1}). One important observation is that if we have ρ0ρ\rho_{0}\geq\rho^{*} then by monotonicity of κ\kappa in the [0,1][0,1] range, it will remain the same range, and similarly if ρ0[0,ρ]\rho_{0}\in[0,\rho^{*}] it will remain in the same range. Thus, from that index onwards, we can apply the contraction rate of the respective case. The only case that there might be a transition is if ρ0<0\rho_{0}<0.

Assuming that ρ0<0\rho_{0}<0, let ρ\rho_{\ell} be the first index that we have ρ0\rho_{\ell}\geq 0. Thus, from ρ0\rho_{0} to ρ\rho_{\ell} we can apply the contraction rate of the third case:

|ρρ||ρ0ρ|α\displaystyle|\rho_{\ell}-\rho^{*}|\leq|\rho_{0}-\rho^{*}|\alpha^{\ell}

Now, we have two possibilities, either ρρ\rho_{\ell}\geq\rho^{*} or ρρ\rho_{\ell}\leq\rho^{*}. If ρρ\rho_{\ell}\geq\rho^{*}, we can apply the contraction rate of the first case, and if ρρ\rho_{\ell}\leq\rho^{*} we can apply the contraction rate of the second case:

{|ρLρ||ρρ|αL0ρρ|ρLρ|1ρL|ρρ|1ραLρρ\displaystyle\begin{cases}|\rho_{L}-\rho^{*}|\leq|\rho_{\ell}-\rho^{*}|\alpha^{L-\ell}&0\leq\rho_{\ell}\leq\rho^{*}\\ \frac{|\rho_{L}-\rho^{*}|}{1-\rho_{L}}\leq\frac{|\rho_{\ell}-\rho^{*}|}{1-\rho_{\ell}}\alpha^{L-\ell}&\rho_{\ell}\geq\rho^{*}\\ \end{cases}

If we plug in our contraction up to step \ell and use the fact that the norm of the sequence is non-increasing |ρ0|ρ,|\rho_{0}|\geq\rho_{\ell}, we have

{|ρLρ||ρ0ρ|αL0ρρ|ρLρ|1ρL|ρ0ρ|1|ρ0|αLρρ\displaystyle\begin{cases}|\rho_{L}-\rho^{*}|\leq|\rho_{0}-\rho^{*}|\alpha^{L}&0\leq\rho_{\ell}\leq\rho^{*}\\ \frac{|\rho_{L}-\rho^{*}|}{1-\rho_{L}}\leq\frac{|\rho_{0}-\rho^{*}|}{1-|\rho_{0}|}\alpha^{L}&\rho_{\ell}\geq\rho^{*}\\ \end{cases}

We can now take the worst case of these two and conclude that

|ρLρ||ρ0ρ|1|ρ0|αL\displaystyle|\rho_{L}-\rho^{*}|\leq\frac{|\rho_{0}-\rho^{*}|}{1-|\rho_{0}|}\alpha^{L}

So far in the proof, we assumed the existence of ρ\rho^{*} that obeys κ(ρ)<1.\kappa^{\prime}(\rho^{*})<1. We can now prove that such a fixed point exists. It is unique, and it is necessarily in the range ρ[0,1].\rho^{*}\in[0,1].

Positivity, uniqueness, and existence of a globally attractive fixed point Here, the goal is to prove there is exactly one point ρ[0,1]\rho^{*}\in[0,1] such that κ(ρ)=ρ\kappa(\rho^{*})=\rho^{*} and κ(ρ)<1.\kappa^{\prime}(\rho^{*})<1. We will prove the properties of positivity, uniqueness, and existence separately.

Positivity: Let us assume that ρ0.\rho^{*}\leq 0. is a fixed point. Then, we can apply the contraction rate proven for Case 3, which shows that κ(ρ)ρ+k(0)>ρ,\kappa(\rho^{*})\geq\rho^{*}+k(0)>\rho^{*}, which is a contradiction.

Uniqueness: Assume that there are two fixed points ρ1\rho_{1} and ρ2\rho_{2} that satisfy κ(ρ1),κ(ρ2)<1.\kappa^{\prime}(\rho_{1}),\kappa^{\prime}(\rho_{2})<1. Let us assume wlog that ρ1<ρ2.\rho_{1}<\rho_{2}. Then we can invoke the contraction rate proven so far to argue that all points in (1,1),(-1,1), including ρ(ρ1,ρ2)\rho\in(\rho_{1},\rho_{2}) are attracted towards both ρ1\rho_{1} and ρ2,\rho_{2}, which is a contradiction. Thus, there can be at most one fixed point.

Existence of ρ\rho^{*}: Because κ(1)=1\kappa(1)=1 the set of all fixed points is non-empty. Let us assume that ρ\rho^{*} is the first (smallest) fixed point of κ(ρ)=ρ,\kappa(\rho^{*})=\rho^{*}, which because of the positivity result is necessarily ρ>0.\rho^{*}>0. If we assume that κ(ρ)>1,\kappa^{\prime}(\rho^{*})>1, then in the small ϵ\epsilon-neighborhood of it ρ1(ρϵ,ρ)\rho_{1}\in(\rho^{*}-\epsilon,\rho^{*}) we have κ(ρ1)<ρ1.\kappa(\rho_{1})<\rho_{1}. Because κ(ρ)\kappa(\rho) is continuous, and is above identity line at ρ=0\rho=0 and under identity line ρ=ρ1,\rho=\rho_{1}, there must be a point 0<ρ2<ρ10<\rho_{2}<\rho_{1} where it is at identity κ(ρ2)=ρ2,\kappa(\rho_{2})=\rho_{2}, which is a contradiction with assumption that ρ\rho^{*} is the smallest fixed point. Thus, we must have κ(ρ)1.\kappa^{\prime}(\rho^{*})\leq 1. If we assume that κ(ρ)=1,\kappa^{\prime}(\rho^{*})=1, then the κ\kappa must align with the identity line from ρ\rho^{*} to 1,1, which implies that all higher order terms ck,k2c_{k},k\geq 2 must be zero, which in turn implies that κ\kappa is a linear function. This is a contradiction with the assumption that the activation is nonlinear. Thus, we must have κ(ρ)<1,\kappa^{\prime}(\rho^{*})<1, which proves the desired existence. ∎

Proof of Proposition 2.

First, we ought to prove that in the mean-field regime, the kernel map is transformed according to the equation that is given. To do so, we can consider one layer update. Assume that X,YN(0,1)X,Y\sim N(0,1) with covariance 𝔼XY=ρ.\mathbb{E}\,XY=\rho. Now, (X,Y)(X^{\prime},Y^{\prime}) is an independent copy of (X,Y),(X,Y), with the same variance and covariance structure. This is due to the presence of skip connection weights P.P. Now, defined the joint layer update

ψ(X)\displaystyle\psi(X) =1r2ϕ(X)+rX,\displaystyle=\sqrt{1-r^{2}}\phi(X)+rX^{\prime},
ψ(Y)\displaystyle\psi(Y) =1r2ϕ(Y)+rY,\displaystyle=\sqrt{1-r^{2}}\phi(Y^{\prime})+rY^{\prime},
κψ(ρ)\displaystyle\implies\kappa_{\psi}(\rho) :=𝔼ψ(X)ψ(Y)\displaystyle:=\mathbb{E}\,\psi(X)\psi(Y)
=(1r2)𝔼ϕ(X)ϕ(Y)+r2𝔼XY\displaystyle=(1-r^{2})\mathbb{E}\,\phi(X)\phi(Y)+r^{2}\mathbb{E}\,X^{\prime}Y^{\prime}
=(1r2)κ(ρ)+r2ρ,\displaystyle=(1-r^{2})\kappa(\rho)+r^{2}\rho,

where in the third line we use the independence assumption and the fact that XX^{\prime} and YY^{\prime} have zero mean.

Observe that κψ(ρ)=(1r2)κ(ρ),\kappa_{\psi}(\rho)=(1-r^{2})\kappa(\rho), or κψ(ρ)=(1r2)κ(ρ)+r2.\kappa_{\psi}^{\prime}(\rho)=(1-r^{2})\kappa^{\prime}(\rho)+r^{2}. Now, let us consider the conditions for the four cases of Theorem 1. The decision boundaries are if κ(0)\kappa(0) is positive or not, if κ(1)\kappa^{\prime}(1) is above, equal to, or below 1.1. Now, note that for r(0,1),r\in(0,1), strict positivity of κψ(0)=(1r2)κ(0)\kappa_{\psi}(0)=(1-r^{2})\kappa(0) remains the same as κ(0).\kappa(0). Furthermore, κψ(1)\kappa_{\psi}^{\prime}(1) is a weighted average of κ(1)\kappa^{\prime}(1) and 1.1. Thus, for all values r(0,1),r\in(0,1), it holds κψ(1)\kappa_{\psi}^{\prime}(1) is above, equal to, or below 1,1, if and only if κ(1)\kappa^{\prime}(1) has the corresponding property. Now, we can turn our focus on the convergence rates. For this, we can focus on α\alpha in each one of the four cases.

  • If κ(0)=0\kappa(0)=0 then κψ(0)=r2κ(0)\kappa_{\psi}(0)=r^{2}\kappa(0) and we have α=1/(2κψ(0)).\alpha=1/(2-\kappa_{\psi}^{\prime}(0)). Now, note that κψ(0)=(1r2)κ(0)+r2\kappa_{\psi}^{\prime}(0)=(1-r^{2})\kappa^{\prime}(0)+r^{2} is a weighted average between κ(0)<1\kappa^{\prime}(0)<1 and 1,1, and thus, the larger the value of r,r, the larger α\alpha would be, and the slower the convergence.

  • If κ(0)>0\kappa(0)>0 and κ(1)<1\kappa^{\prime}(1)<1 then we have κψ(0)=0\kappa_{\psi}(0)=0 and κψ(1)<1,\kappa_{\psi}^{\prime}(1)<1, and α=κψ(1)=(1r2)κ(1)+r2.\alpha=\kappa_{\psi}^{\prime}(1)=(1-r^{2})\kappa^{\prime}(1)+r^{2}. Thus, the larger the residual, the closer α\alpha becomes to 1,1, and the slower the convergence will be.

  • If κ(0)>0\kappa(0)>0 and κ(1)=1,\kappa^{\prime}(1)=1, then we have the same for κψ,\kappa_{\psi}, and

    α\displaystyle\alpha =1κψ(0)κψ(0)\displaystyle=1-\kappa_{\psi}(0)-\kappa_{\psi}^{\prime}(0)
    =1(1r2)κ(0)(1r2)κ(0)r2\displaystyle=1-(1-r^{2})\kappa(0)-(1-r^{2})\kappa^{\prime}(0)-r^{2}
    =(1r2)(1κ(0)κ(0))\displaystyle=(1-r^{2})(1-\kappa(0)-\kappa^{\prime}(0))

    Now, recall that we have κ(0)=c02\kappa(0)=c_{0}^{2} and κ(0)=c12,\kappa^{\prime}(0)=c_{1}^{2}, and thus we have 1κ(0)κ(0)=k=2,1-\kappa(0)-\kappa^{\prime}(0)=\sum_{k=2}^{\infty}, which is necessarily positive for a nonlinear activation:

    α=(1r2)k=2ck2.\displaystyle\alpha=(1-r^{2})\sum_{k=2}^{\infty}c_{k}^{2}.

    We can now see that for larger r,r, α\alpha will be smaller, which in this case implies a slower convergence.

  • If κ(0)>0\kappa(0)>0 and κ(1)>1\kappa^{\prime}(1)>1 then same holds for κψ,\kappa_{\psi}, and the convergence rate α\alpha .

Proof of Proposition 3.

Recall the assumption that ϕ\phi obeys 𝔼ϕ(X)2=1,XN(0,1).\mathbb{E}\,\phi(X)^{2}=1,\,X\sim N(0,1). Let us assume that ZN(0,Id)Z\sim N(0,I_{d}) represents the Gaussian pre-activations from the previous layer We can consider the joint normalization and activation layer ψ\psi in the mean-field regime:

  • ψ=ϕRN\psi=\phi\circ RN and ψ=ϕLN\psi=\phi\circ LN: Because each element of ZZ has zero mean and unit variance, in the mean field regime, the sample mean and variances will be equal to the their population counterparts, implying that LN(Z)=ZLN(Z)=Z and RN(Z)=Z.RN(Z)=Z. In other words, they act as identity. Thus, in both cases, in the mean-field regime we have ψ=ϕ.\psi=\phi. Thus, the kernel map also remains the same.

  • ψ=RNϕ\psi=RN\circ\phi: In this case, because of the assumption on activation for Gaussian preactivations we have 𝔼ϕ(Zi)2=1,\mathbb{E}\,\phi(Z_{i})^{2}=1, for all i=1,,d.i=1,\dots,d. Because elements of ZZ are i.i.d., by law of large numbers 1dϕ(Z)2\frac{1}{d}\|\phi(Z)\|^{2} will converge to its expected value 1.1. Thus, again, the normalization step in RN becomes ineffective, implying that in the mean-field we have ψ=ϕ.\psi=\phi. Thus, the kernel map also remains the same.

  • ψ=LNϕ\psi=LN\circ\phi: By definition of kernel map of activation, we have

    𝔼ϕ(Zi)\displaystyle\mathbb{E}\,\phi(Z_{i}) =κϕ(0),\displaystyle=\sqrt{\kappa_{\phi}(0)}, Var(Zi)=𝔼ϕ(Zi)2(𝔼ϕ(Zi))2=1κϕ(0)\displaystyle\mathrm{Var}(Z_{i})=\mathbb{E}\,\phi(Z_{i})^{2}-(\mathbb{E}\,\phi(Z_{i}))^{2}=1-\kappa_{\phi}(0)

    Thus, again by law of large numbers we will have

    ψ=(ϕκϕ(0))/1κϕ(0).\psi=(\phi-\sqrt{\kappa_{\phi}(0)})/\sqrt{1-\kappa_{\phi}(0)}.

    Now, if we look at the kernel map of this affine transformation, we have

    κψ(ρ)\displaystyle\kappa_{\psi}(\rho) =𝔼ψ(X)ψ(Y)\displaystyle=\mathbb{E}\,\psi(X)\psi(Y)
    =𝔼(ϕ(X)κϕ(0))κ(1)κ(0)(ϕ(Y)κϕ(0))1κϕ(0)\displaystyle=\mathbb{E}\,\frac{(\phi(X)-\sqrt{\kappa_{\phi}(0)})}{\sqrt{\kappa(1)-\kappa(0)}}\frac{(\phi(Y)-\sqrt{\kappa_{\phi}(0)})}{\sqrt{1-\kappa_{\phi}(0)}}
    =𝔼ϕ(X)ϕ(Y)κϕ(0)1κϕ(0)\displaystyle=\frac{\mathbb{E}\,\phi(X)\phi(Y)-\kappa_{\phi}(0)}{1-\kappa_{\phi}(0)}
    =κϕ(ρ)κϕ(0)1κϕ(0)\displaystyle=\frac{\kappa_{\phi}(\rho)-\kappa_{\phi}(0)}{1-\kappa_{\phi}(0)}

    which concludes the proof.

Remark S.1.

Another interesting implication is that if we take a look at discrete and continuous kernel dynamics for a very strong residual (when r0r\to 0), the kernel ODE becomes exact. Let us denote the Hermite coefficients of the residual kernel by c~k.\tilde{c}_{k}. For the first term, we have c~12=r2c12+(1r2)1\tilde{c}_{1}^{2}=r^{2}c_{1}^{2}+(1-r^{2})1 and for all other terms c~k2=r2ck\tilde{c}_{k}^{2}=r^{2}c_{k}. Note that as r0,r\to 0, we have c~121,\tilde{c}_{1}^{2}\to 1, and for all other terms c~k20.\tilde{c}_{k}^{2}\to 0. Thus, in the discrete kernel dynamics, the right-hand side will converge to zero. This implies that in the limit of very strong residuals, the kernel ODE gives the exact solution to the kernel dynamics.

Remark S.2.

Suppose there is an activation ϕ\phi with a negative fixed point ρ<0.\rho^{*}<0. Let nn be an integer such that ρ<1/(n1).\rho^{*}<-1/(n-1). Let x1,,xndx_{1},\dots,x_{n}\in\mathbb{R}^{d} be vectors that have non-zero inner products. Construct an MLP with activation ϕ\phi and let y1,,yny_{1},\dots,y_{n} be the output of this MLP. Thus, if the depth is sufficiently large, the pairwise similarity between each pair will converge to ρ.\rho^{*}. Thus, their output Gram matrix G=[yi,yj]i,jnG=[\langle y_{i},y_{j}\rangle]_{i,j\leq n} has unit diagonals and off diagonals equal to ρ.\rho^{*}. By analyzing the eigenvalues of GG, we find: The top eigenvalue corresponding to the all-ones eigenvector is λ1=1+(n1)ρ\lambda_{1}=1+(n-1)\rho^{*}. Because we assumed ρ<1/(n1),\rho^{*}<-1/(n-1),, we have 1+(n1)ρ<01+(n-1)\rho^{*}<0, which is a contradiction. Because GG by construction must be positive and semi-definite.

Corollary S.1.

If Hermite coefficients of ϕ\phi has Hermite expansion ϕ=k=mckhek,\phi=\sum_{k=m}^{\infty}c_{k}\mathrm{he}_{k}, then kernel sequence of MLP with activation ϕ\phi converges to zero with double exponential rate

|ρ||ρ0|m.\displaystyle|\rho_{\ell}|\leq|\rho_{0}|^{m^{\ell}}.

The proof of corollary is by simply observing that, the slowest convergence happens when the weight of coefficients is concentrated at k=mk=m term, and for that case, we can observe that κ(ρ)ρm,\|\kappa(\rho)\|\leq\|\rho\|^{m}, and by induction over \ell we can prove the claim.

Appendix B Validation of the global convergence theorem

Here we will provide some numerical validation of the global convergence theorem.

Remark S.3.

First off, note that Theorem 1 requires that the activation functions preserve (do not increase or decrease) the energy of the pre-activations, 𝔼[ϕ(X)2]=1,\mathbb{E}\,[\phi(X)^{2}]=1, for XN(0,1).X\sim N(0,1). While some activation functions, namely SeLU Klambauer et al. [2017], have this property, we can achieve it for all activation functions by inversely scaling them by a constant factor, equal to C=𝔼[ϕ(X)2].C=\sqrt{\mathbb{E}\,[\phi(X)^{2}]}. After this step, we can quantify various values relevant to Theorem 1. This step is applied for both the results in the table, as well as Figures S.1 and S.2. The scaling constant CC for each activation function is shown in Table S.1.

Recall the conditions for each convergence from Theorem 1:

  • κ(0)=0\kappa(0)=0: Case 1, exponential convergence towards ρ=0.\rho^{*}=0.

  • κ(0)>0\kappa(0)>0:

    • κ(1)<1\kappa^{\prime}(1)<1: Case 2, exponential convergence towards ρ=1.\rho^{*}=1.

    • κ(1)=1\kappa^{\prime}(1)=1: Case 3, polynomial convergence towards ρ=1.\rho^{*}=1.

    • κ(1)>1\kappa^{\prime}(1)>1: Case 4: exponential convergence towards some ρ(0,1).\rho^{*}\in(0,1).

In Table S.1, we have quantified the fixed points, relevant quantities, and convergence rates for each activation function. This table shows that for the most popular activation functions that are used in practice, we can quantify a fixed point and explicit rate that their kernel sequence will converge to.

It is worth noting that because Theorem 1 provides an upper bound, we cannot directly compare the quantified value of α\alpha for different cases, particularly when they correspond to different cases. In other words, because this is a worst-case bound, it is possible that an activation function manifests a much higher convergence than predicted. We will further explore the gaps between the empirical values and the upper bound in the subsequent figures.

ϕ\phi CC α\alpha ρ\rho_{\star} κ(ρ)\kappa(\rho_{\star}) κ(0)\kappa(0) κ(0)\kappa^{\prime}(0) κ(1)\kappa^{\prime}(1) κ(ρ)\kappa^{\prime}(\rho^{*}) Convergence
tanh 0.63 0.93 0.00 0.00 0.00 0.93 1.18 0.93 case 1 /Exp.
SeLU 1.00 0.97 0.00 0.00 0.00 0.97 1.06 0.97 case 1 /Exp.
ReLU 0.71 0.95 1.00 1.00 0.32 0.50 0.95 0.95 case 2/Exp.
sigmoid 0.54 0.15 1.00 1.00 0.85 0.15 0.15 0.15 case 2 /Exp.
exp 2.72 0.74 1.00 1.00 0.37 0.37 1.00 1.00 case 3 /Poly.
GELU 0.65 0.93 0.76 0.76 0.19 0.59 1.07 0.93 case 4 /Exp.
CELU 0.80 0.97 0.60 0.60 0.04 0.90 1.04 0.97 case 4 /Exp.
ELU 0.80 0.97 0.60 0.60 0.04 0.90 1.04 0.97 case 4 /Exp.
Table S.1: Activation functions: a review of most commonly used activation functions (normalized, see Remark S.3), according to Theorem 1. In the last column, we have noted the case of convergence according to Theorem 1, as well as the speed of convergence (Poly: polynomial, Exp: exponential, not to be confused with activation ϕ=exp\phi=\texttt{exp}) in depth. Note that only the exp activation function has polynomial convergence.

B.1 Figures S.1 and S.2

Each row of the figures shows one activation function and each column is dedicated to the following (from left to right). The First column shows the activation function itself, up to some scaling (see Remark S.3). In the following, we will describe the last three columns.

Kernel map and fixed point iterations

The second column shows the kernel map κ\kappa of each activation function (blue), as defined in Definition 1, and the fixed iterations over this kernel (red), as defined by pairs of points

(ρ+1,ρ),ρ+1:=κ(ρ)(\rho_{\ell+1},\rho_{\ell}),\qquad\qquad\rho_{\ell+1}:=\kappa(\rho_{\ell})

where ρ0\rho_{0} indicates the initial value that is arbitrarily chosen (shown as a red dot marker). As the iterations progress, they converge to the fixed point (ρ\rho^{*}, red star marker). The identity map is shown to demonstrate visually how the iterations lead to the fixed point.

It is worth noting that kernel maps of all activation functions are analytic, i.e., smooth up to an arbitrary degree, even when the activation function itself is non-smooth, which is the case for ReLU and SeLU at x=0.x=0. We can also see that despite the non-monotonicity of some activation functions, such as GELU, the kernel map maintains its unique properties, such as having a unique globally attracting fixed point.

These kernel map plots give important insight into the global convergence of the kernel and the predictions of Theorem 1. Intuitively, the fixed point is where the kernel map intersects with the identity line, and the speed at which it converges to the fixed point is inversely related to the slope at ρ.\rho^{*}. For example, tanh and SeLU are close to the identity line, and they exhibit very slow convergence, while sigmoid deviates the most from the identity and shows the fastest convergence rate. Overall, we can see here why deviation from the identity, as captured by κ(ρ),\kappa^{\prime}(\rho^{*}), and other terms, play an important role in the convergence of the kernel sequence.

It is worth noting that while sigmoid and tanh are tightly related by the formula tanh(x)=2sigmoid(2x)1,\texttt{tanh}(x)=2\cdot\texttt{sigmoid}(2x)-1, their convergence properties are drastically different, with tanh converging exponentially towards zero (orthogonality or independence bias), while sigmoid converges exponentially towards 11 (strong similarity bias). We can explain this by observing that the shifting ensures that the activation function has zero-mean postactivations, making ρ\rho^{*} a fixed point.

Kernel convergence theory vs empirical results.

The third and fourth columns show the kernel sequence (empirical: blue, theory: red) as a function of depth .\ell. The theoretical bound corresponds to Theorem 1. In the fourth column, we plot the upper bound provided by the theorem, and in the third column, we use the upper bound on the distance to ρ\rho^{*} to give a lower or upper bound on the kernel sequence curve. One of the most important takeaways is that for all activation functions except exp, the distances |ρρ|,|\rho_{\ell}-\rho^{*}|, shown on the log-scale, decay linearly with depth .\ell. This is perfectly aligned with the prediction of Theorem 1, because exp is the only activation function that has polynomial convergence (See Table S.1). The gap between theory and empirical values corresponds to the worst-case analysis for the global convergence rate.

Refer to caption
Figure S.1: Validation of Theorem 1 Each row corresponds to an activation, scaled down by a factor CC to obey 𝔼ϕ(X)2=1.\mathbb{E}\,\phi(X)^{2}=1.. From op to bottom: relu, exp, gelu, tanh. From left, the first column shows the activation, second column shows kernel map, third column shows the kernel sequence vs depth along with the theory prediction, and fourth column shows the distance to the fixed points in for theory and empirical kernels. (Remainder on the next page)
Refer to caption
Figure S.2: Continuation of FigureS.1, for activations elu, celu, sigmoid, and selu. for the particular case of sigmoid, the errors fall below the numerical precision and cannot be computed.
Refer to caption
Figure S.3: Continuation of FigureS.1, for LeakyReLU with various negative slopes.