This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Complex fractal trainability boundary can arise from trivial non-convexity

Yizhou Liu liuyz@mit.edu Physics of Living Systems, Department of Physics, MIT, Cambridge, MA 02139, USA. Department of Mechanical Engineering, MIT, Cambridge, MA 02139, USA.
Abstract

Training neural networks involves optimizing parameters to minimize a loss function, where the nature of the loss function and the optimization strategy are crucial for effective training. Hyperparameter choices, such as the learning rate in gradient descent (GD), significantly affect the success and speed of convergence. Recent studies indicate that the boundary between bounded and divergent hyperparameters can be fractal, complicating reliable hyperparameter selection. However, the nature of this fractal boundary and methods to avoid it remain unclear. In this study, we focus on GD to investigate the loss landscape properties that might lead to fractal trainability boundaries. We discovered that fractal boundaries can emerge from simple non-convex perturbations, i.e., adding or multiplying cosine type perturbations to quadratic functions. The observed fractal dimensions are influenced by factors like parameter dimension, type of non-convexity, perturbation wavelength, and perturbation amplitude. Our analysis identifies “roughness of perturbation”, which measures the gradient’s sensitivity to parameter changes, as the factor controlling fractal dimensions of trainability boundaries. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss function non-convex. Thus, we conclude that fractal trainability boundaries can arise from very simple non-convexity. We anticipate that our findings will enhance the understanding of complex behaviors during neural network training, leading to more consistent and predictable training strategies.

Introduction

Machine learning has become a cornerstone of modern technology. When training a machine learning model, we need to update the model parameters towards optimizing a loss function (usually based on the loss gradient) to attain desired performance. To better understand and achieve successful training, researchers have tried to capture shapes of the loss landscapes [1, 2, 3] and dynamics that arise from optimization algorithms [4, 5, 6, 7, 8]. In general, despite the considerable empirical success and broad application of these optimization techniques in training models, our theoretical understanding of the training processes remains limited.

Recent empirical evidence suggests that whether a model is trainable can be extremely sensitive to choices in optimization. A model is said to be not trainable here if the optimization applied leads to divergent loss function value. By convention, we call model parameters to be optimized as parameters and other parameters in the optimizer controlling the optimization process as hyperparameters. One of the most important hyperparameters is learning rate, which affects the size of the steps taken during optimization. On a simple two layer neural network, gradient descent (GD) was recently found to have a fractal boundary between learning rates that lead to bounded and divergent loss (fractal trainability boundary for short) [9]. Consequently, a slight change in hyperparameters can change the training result qualitatively with little hope to choose good hyperparameters in advance.

Here, we aim to explore the mechanisms underlying the fractal trainability boundary and the key factors influencing its fractal dimension. Guided by the intuition that non-convexity renders the gradient sensitive to parameter, which makes GD sensitive to varying learning rates, we tried to quantify the relation between non-convexity and fractal trainability. Given the difficulties in describing and controlling the loss functions of real neural networks, our approach involves constructing simple non-convex loss functions and testing GD on these to examine the boundary between learning rates that lead to bounded versus divergent losses. We discover that even a simple quadratic function, when perturbed by adding or multiplying a regular perturbation function (specifically a cosine function), exhibits a fractal trainability boundary. A parameter specific to the form of the perturbation, defined as roughness, appears to govern the fractal dimension of the trainability boundary, describing the gradient sensitivity to the parameter. A notable difference emerges between the fractal behaviors of trainability in quadratic functions perturbed by additive perturbation versus those altered by multiplicative perturbation: fractal behavior disappears at a finite roughness (when the perturbed loss becomes convex) for additive cases but persists for multiplicative perturbations (where the perturbed loss is always non-convex). We therefore offer a perspective to explain the fractal trainability boundaries observed in real neural networks as a result of non-convexity, emphasizing “roughness” as the factor controlling fractal dimensions.

Results

We first elaborate the key intuition of why certain non-convexity may lead to the fractal trainability boundary. Common ways to generate fractals involve iterating a function sensitive to hyperparameters in the function (e.g., Mandelbrot and quadratic Julia sets [10, 9]). We denote the loss function as f(x)f(x) where xx is the parameter to optimize to minimize f(x)f(x). GD can be described as iterating the function

x(k+1)=x(k+1)(x(k);s)=x(k)sf(x(k)).x^{(k+1)}=x^{(k+1)}(x^{(k)};s)=x^{(k)}-s\nabla f(x^{(k)}). (1)

Here, x(k)x^{(k)} is the parameter obtained at the kkth step and ss is the learning rate (hyperparameter). If non-convexity can make the gradient f(x)\nabla f(x) sensitive to parameter xx, after we shift learning rate ss a little at the kkth step (this is another training process to be compared with the original one), x(k+1)x^{(k+1)} will be a little different from the one obtained with the unchanged learning rate. However, the gradient at the new x(k+1)x^{(k+1)} will be very different, leading to very different subsequent iterations. The sensitivity of gradient’s dependence on parameter can therefore be transformed to the sensitive dependence of optimization process on the learning rate (hyperparameter), which is the key to generate fractals. Notably, this sensitive dependence on learning rate is sufficient to generate chaos while is not obvious to yield divergent training.

Refer to caption
Figure 1: On constructed loss functions, we conduct numerical experiments to study the trainability boundaries. (a) Illustration of loss landscapes with additive perturbation (f+f_{+} with ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1). (b) An example of loss constructed having multiplicative perturbation (f×f_{\times} with ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1). (c) On a fixed range of learning rate, we can put in NN small segments and evaluate whether training diverge or not at each end of the segments. We therefore can generate a set of boundary segments, BNB_{N} and count the number of boundary segments. (d) An example when we have more segments (fine-grain), the number of boundary segments (black segments) is increasing (figure obtained based on multiplicative perturbation case f×f_{\times} with parameters ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1). The colored bar at the bottom visualizes losses for bounded (blue) and divergent training (red).

We thus need to do experiments on specific functions to test if non-convexity with sensitive gradients can lead to fractal trainability boundaries. We started by looking at one dimensional parameters (xdx\in\mathbb{R}^{d}, d=1d=1) and chose to construct our loss landscape by perturbing the convex quadratic function f0(x)=x2f_{0}(x)=x^{2}. One way of perturbation is to add a non-convex function (Fig. 1a). We used the simplest regular perturbation ϵf1=ϵcos(2πx/λ)\epsilon f_{1}=\epsilon\cos(2\pi x/\lambda), where λ\lambda is the wavelength of the perturbation, and ϵ\epsilon the amplitude of the perturbation. We therefore defined the additive perturbation case in our context as

f+(x)=f0+ϵf1=x2+ϵcos(2πx/λ).f_{+}(x)=f_{0}+\epsilon f_{1}=x^{2}+\epsilon\cos(2\pi x/\lambda). (2)

An alternative way to introduce non-convexity is via multiplying a perturbation function (Fig. 1b), which will be referred to as multiplicative perturbation case:

f×(x)=f0(1+ϵf1)=x2(1+ϵcos(2πx/λ)).f_{\times}(x)=f_{0}(1+\epsilon f_{1})=x^{2}(1+\epsilon\cos(2\pi x/\lambda)). (3)

The two cases are qualitatively different as when xx\to\infty, the additive perturbation will become small comparing to f0=x2f_{0}=x^{2} while the multiplicative perturbation is always comparable to the unperturbed f0=x2f_{0}=x^{2}. Our test functions have simple analytic forms and represent different forms of non-convexity.

We next explain the idea of investigating fractal trainability boundaries numerically. The boundary points separating learning rates leading to finite and divergent loss are not accessible directly. So, we need to use finite but many grid points to locate the boundary learning rates. On a given range of learning rate (s[smin,smax]s\in[s_{\rm min},s_{\rm max}]), we can evenly put N+1N+1 grid points, with which GD can be tested to diverge or not. If two neighboring learning rate grid point values lead to the same divergent/bounded loss behavior, at this coarse-grain level (quantified by NN), we say there is no boundary between the two grid points. Otherwise, we say the segment between this two grid points is a boundary segment. We define a set BNB_{N} as the set containing all boundary segments when we have N+1N+1 grid points (Fig. 1c). Heuristically, we expect each boundary segment to cover one boundary accurately when the segment length goes to zero (i.e., NN\to\infty), and thus BNB_{N} becomes the set of boundary learning rates when NN\to\infty. If the number of boundary segments, denoted by |BN||B_{N}|, increases with respect to NN as a scaling law asymptotically

|BN|Nα,(N),|B_{N}|\propto N^{\alpha},~{}(N\to\infty), (4)

we say the trainability boundary has a fractal dimension α\alpha (by convention, the fractal dimension defined in this way is called box dimension [11]). We observed that the number of boundary segments (black segments in Fig. 1d) indeed increases when we do tests on the multiplicative case and put more and more testing grid points in a fixed learning rate range (Fig. 1d). The colored bar at the bottom visualizes loss values for bounded (blue) and divergent (red) training evaluated at 2202^{20} grid points in [0,1.5][0,1.5] (Fig. 1d). And the color intensity is determined by ifi\sum_{i}f_{i} for bounded training and ifi1\sum_{i}f_{i}^{-1} for divergent training, respectively [9], where fif_{i} is the loss value at iith step (totally 1000 steps). Once we zoom in, we see more boundaries between blue and red (the original vector image can be found online). We therefore are ready for further exploration with the numerical tool identifying fractals.

Refer to caption
Figure 2: Simple non-convexity can lead to fractal trainability boundaries, whose fractal dimensions depend on perturbation form, wavelength, and amplitude. (a) For additive perturbation case f+f_{+} with ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1, we studied learning rate in [0,1.5][0,1.5], where the number of boundary segments increases as a scaling law with respect to the number of segments put, suggesting a fractal trainability boundary. The fractal dimension, i.e., the slope log|BN|\log|B_{N}| against logN\log N is fitted as 0.996±0.0050.996\pm 0.005. (b) For additive perturbation case f×f_{\times} with ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1, fractal trainability boundary is also observed with fractal dimension 0.837±0.0040.837\pm 0.004. (c, d) The fractal dimension of trainability boundary vary with respect to perturbation amplitude ϵ\epsilon and wavelength λ\lambda. In particular, for the additive perturbation case (c), the fractal dimension increases with larger amplitude and smaller wavelength. For the multiplicative case (d), the fractal dimension has no clear dependence on the two function parameters.

We investigated the trainability boundaries using GD on specifically constructed loss functions. For our experiments, we applied GD to the loss function with additive perturbation, f+f_{+}, starting from x(0)=1.0x^{(0)}=1.0 with parameters ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1. We conducted 1000 steps of GD and defined a training session as divergent (untrainable) if the sum of losses at the 1000 steps exceeded 101610^{1}6. This upper loss threshold affects misclassifying slowly diverging training into bounded training; although varying it between 101210^{1}2 and 102010^{20} did not alter the observed fractal dimension. We adjusted the learning rate within the range of [0,1.5][0,1.5] and increased the number of intervals, NN, up to 2322^{32} (see Methods). Our findings reveal that the trainability boundary for this simple function displays fractal behaviors, meaning the number of boundary segments, |BN||B_{N}|, increases following a scaling law with NN at large values (Fig. 2a). The fractal dimension, α\alpha, calculated via least squares as the slope of log|BN|\log|B_{N}| against logN\log N (log\log base is 22 throughout this paper), is approximately 0.996±0.0050.996\pm 0.005 (error is standard deviation). A similar analysis on another loss function with multiplicative perturbation, f×f_{\times}, yielded a fractal dimension of α=0.837±0.004\alpha=0.837\pm 0.004 (Fig. 2b). This suggests that fractal boundaries are more densely packed in a narrower range for the additive perturbation scenario, indicating a potentially less erratic behavior. Nonetheless, the emergence of fractal trainability boundaries in these trivially simple loss functions is remarkable.

We next sought to examine factors affecting the fractal dimension of trainability boundaries. We evaluated the fractal dimension of the trainability boundary with ten different amplitudes ϵ\epsilon evenly picked from [0.01,0.2][0.01,0.2] and ten different wavelengths λ\lambda evenly picked from [0.01,1.0][0.01,1.0]. Least square fitting is used to obtain the fractal dimensions. We found that for the additive perturbation case, the fractal dimension increases with decreasing wavelength and increasing amplitude (Fig. 2c), while for the multiplicative perturbation case, the fractal dimension has no clear dependence on amplitude or wavelength (Fig. 2d). The fractal dimension therefore depends on the type, wavelength, and amplitude of the non-convex perturbation in a complicated manner.

We next tried to analyze how the perturbation wavelength and amplitude change the fractal dimension of trainability boundary. We can rescale the parameter, x~=x/b\tilde{x}=x/b, and renormalize the loss, f~(x~)=f(x)/ζ\tilde{f}(\tilde{x})=f(x)/\zeta, such that we map a GD process starting at x(0)x^{(0)} to another one starting at x~(0)=x(0)/b\tilde{x}^{(0)}=x^{(0)}/b and updating with respect to

x~(k+1)=x~(k)s~f~(x~(k)).\tilde{x}^{(k+1)}=\tilde{x}^{(k)}-\tilde{s}\nabla\tilde{f}(\tilde{x}^{(k)}). (5)

By choosing ζ=b2\zeta=b^{2}, we can show that s~=s\tilde{s}=s and f~\tilde{f} has the same function form (i.e., with additive perturbation or multiplicative perturbation) as ff while with different function parameters ϵ~\tilde{\epsilon} and λ~\tilde{\lambda} (see Methods). This means a GD process on our constructed loss given ϵ\epsilon, λ\lambda, and x(0)x^{(0)} will have the same divergence property with the same learning rate as another GD process on our constructed loss with ϵ~\tilde{\epsilon}, λ~\tilde{\lambda}, and x~(0)\tilde{x}^{(0)}. Therefore, trainability boundaries are the same for the two sets of conditions, {ϵ,λ,x(0)}\{\epsilon,\lambda,x^{(0)}\} and {ϵ~,λ~,x~(0)}\{\tilde{\epsilon},\tilde{\lambda},\tilde{x}^{(0)}\}. If we further assume the fractal dimension α\alpha depends mainly on loss properties rather than initial conditions (seems to be true, see SI), we conclude that fractal dimensions of the trainability boundaries for {ϵ,λ}\{\epsilon,\lambda\} and {ϵ~,λ~}\{\tilde{\epsilon},\tilde{\lambda}\} are the same.

More specifically, for the additive perturbation case, the renormalization flow not changing the fractal dimension should be given by (see Methods)

ϵ~=ϵ/b2,λ~=λ/b.\tilde{\epsilon}=\epsilon/b^{2},~{}\tilde{\lambda}=\lambda/b. (6)

Note there is only one independent combination of ϵ\epsilon and λ\lambda, i.e.,

θ+=ϵ/λ2,\theta_{+}=\epsilon/\lambda^{2}, (7)

is invariant under the renormalization transformation, we claim the fractal dimension can only depends on θ+\theta_{+} since it only depends on {ϵ,λ}\{\epsilon,\lambda\} and is invariant under the renormalization transformations. We would call the quantity θ+\theta_{+} “roughness” as it is in the pre-factor of the second derivative of the additive perturbation, measuring the sensitivity of gradient’s dependence on parameter. We plotted the fractal dimension as a function of roughness θ+\theta_{+} (the same set of data as Fig. 2c) and found the fractal dimension shows a clear and sharp transition from zero (i.e., no fractal behavior) to non-zero when increasing roughness θ+\theta_{+} (Fig. 3a). We found that the critical roughness θ+\theta_{+} is near 1/2π21/2\pi^{2} (dashed line in Fig. 3a), which corresponds to the critical situation f+f_{+} begin to be non-convex (x\exists x such that 2f+(x)=0\nabla^{2}f_{+}(x)=0). The simple renormalization analysis yields roughness of the additive perturbation, which determines the fractal dimension and shows that fractal behaviors show up when the perturbed loss is non-convex.

Refer to caption
Figure 3: Roughness determines fractal dimension of trainability boundaries and captures the transition to fractal trainability boundary when the landscape is non-convex. (a) For the additive perturbation case, the roughness θ+\theta_{+} found well organizes the fractal dimensions α\alpha with different amplitude and wavelength (data in Fig. 2c), where we can see a clear transition to non-zero fractal dimensions near θ+=1/2π2\theta_{+}=1/2\pi^{2} (dashed line, corresponding emergence of non-convexity). (b) For the multiplicative perturbation case, the roughness θ×\theta_{\times} found determines the fractal dimensions α\alpha (data from Fig. 2d). Error bars are standard deviations of fitting.

Following the same renormalization procedure, we found the roughness determining fractal dimension for the multiplicative perturbation case is

θ×=ϵ.\theta_{\times}=\epsilon. (8)

This quantity θ×=ϵ\theta_{\times}=\epsilon also shows up in the second derivative of f×f_{\times} and contributes to the sensitive dependence of gradient on parameter xx, while it is not the only one (there are also ϵ/λ\epsilon/\lambda and ϵ/λ2\epsilon/\lambda^{2}) and thus just looking at the second derivative does not suffice. For the multiplicative perturbation case, we found the fractal dimension decreases a little with increasing roughness (Fig. 3b; same data as Fig. 2d). The multiplicative case is always non-convex and always has fractal trainability boundary, which is consistent with the previous finding that fractal behaviors emerges when the perturbed loss becomes non-convex. Note that the non-convex part of the multiplicative case (where the second derivatives begin to be non-positive) has loss value around and above the order of magnitude 1/ϵ1/\epsilon, if we have too small ϵ\epsilon while our numerical upper bound to classify bounded and divergent training is not large enough, the classification will solely depends on the convex part of the loss. In this case, if non-convexity is necessary for fractal behaviors, we would expect no fractal behaviors as a numerical artifact, which is proved to be true (see SI). We therefore conclude that roughness found through simple renormalization determines fractal dimension of trainability boundaries and the transition to non-zero fractal dimensions corresponds to the loss function becoming non-convex.

Beyond simple cases we can analyze, we next studied a slightly more complicated loss landscapes. As a first step towards perturbations with multiple length scales, we considered additive perturbations with two cosine functions,

f+(x)=x2+ϵ1cos(2πx/λ1)+ϵ2cos(2πx/λ2).f_{+}(x)=x^{2}+\epsilon_{1}\cos(2\pi x/\lambda_{1})+\epsilon_{2}\cos(2\pi x/\lambda_{2}). (9)

The renormalization can only say ϵ1/λ12\epsilon_{1}/\lambda_{1}^{2} and ϵ2/λ22\epsilon_{2}/\lambda_{2}^{2} determine the fractal dimension and cannot yield a single variable controlling the fractal behavior. In numerical tests, we fixed λ1=0.3\lambda_{1}=0.3 and λ1=0.5\lambda_{1}=0.5, while changed amplitudes ϵ1\epsilon_{1} and ϵ2\epsilon_{2}. We found the fractal dimension depends non-monotonically on each of the amplitudes (Fig. 4a). However, the claim that fractal behaviors arise when the loss becomes non-convex is still valid, as the boundary between convex and non-convex losses (red curve in Fig. 4a, solved numerically) also separates zero and non-zero fractal dimensions.

We next explored how the dimension of parameters xx affect the trainability boundary. For the additive perturbation case, we generalize the function for xdx\in\mathbb{R}^{d} as

f+(x)=ixi2+ϵicos(2πxi/λ).f_{+}(x)=\sum_{i}x_{i}^{2}+\epsilon\sum_{i}\cos(2\pi x_{i}/\lambda). (10)

Following similar numerical methods as before (see detailed parameter setting in SI), we studied the fractal dimension of trainability boundary for f+f_{+} varying dd from 11 to 100100. The results suggest that the fractal dimension does not change much with respect to dd in the additive perturbation scenario (Fig. 4b). For the multiplicative perturbation case, we can define a class of functions

f×(x)=(1+ϵicos(2πxi/λ))ixi2.f_{\times}(x)=\Big{(}1+\epsilon\sum_{i}\cos(2\pi x_{i}/\lambda)\Big{)}\sum_{i}x_{i}^{2}. (11)

We found the fractal dimension α\alpha slowly increases in this case with respect to increasing dd (Fig. 4c), which makes sense as high-dimensional optimization should be more complicated. Our renormalization procedure cannot connect two functions with different dimensions dd, and therefore roughness values for functions with different dimensions dd are not comparable. Future works are needed to analyze the impact of parameter dimensions dd, e.g., defining a generalized roughness that can determine fractal dimension of trainability boundaries across different dd.

Refer to caption
Figure 4: Beyond simple cases we can analyze, fractal dimension of trainability boundary depends on many other parameters determine the loss function, while it seems to be general that non-convexity leads to fractal behaviors. (a) For additive case with two cosine perturbations, the fractal dimension depends complicatedly on the amplitudes, while it is true fractal behaviors show up after the loss is non-convex (red line is the boundary of convexity). (b and c) For high dimensional optimization, the fractal dimension can depend on parameter dimensions. The fractal dimension is robust to increasing the parameter dimension dd for the additive perturbation case. (b) While the fractal dimension increases with the parameter dimension dd for the multiplicative perturbation case. Error bars are standard deviations of fitting.

Discussion

In this study, we have demonstrated that fractal trainability boundaries can arise from relatively simple non-convex modifications to loss functions. Specifically, our results show that the sensitivity of the loss gradient to parameter changes—a consequence of non-convexity introduced either through additive or multiplicative cosine perturbation—plays a crucial role in the emergence of fractal trainability. The fractal dimensions we observed are influenced by several factors, including the parameter dimension, the type of non-convexity, perturbation wavelength, and amplitude. Notably, our use of renormalization techniques in one-dimensional optimization cases has linked various loss functions to corresponding fractal dimensions of their trainability boundaries. Therefore, we have identified “roughness of perturbation” as a key property that quantifies this sensitivity and dictates the fractal behavior. We observed a clear transition from non-fractal to fractal trainability boundaries as roughness increases, with the critical roughness causing the perturbed loss to be non-convex. These findings not only validate our hypothesis about the impact of non-convexity on trainability but also open up new avenues for understanding the dynamics of learning in complex models.

While our method effectively characterizes fractal behaviors, it may not fully capture the complexity inherent in the trainability boundary. We computed the box dimension of these boundaries as a more feasible alternative to direct, uniform sampling from the trainability boundary set, which remains impractical. However, the box dimension is not without its limitations; for example, it is known that all rational numbers between 0 and 1 technically have a box dimension of 11 [11, 12]. Consequently, while the relative magnitudes of our computed box dimensions can be informative in assessing the degree of complexity, the absolute values themselves may not be entirely reliable.

Beyond mathematical limitations, constraints in our numerical implementation also impact the accuracy of the fractal dimensions we obtained. For instance, computational resources cap the largest feasible NN, limiting the number of data points available for accurately fitting the fractal dimension. If a fractal boundary is densely packed within a very narrow range, a significantly large NN is required to discern its fractal nature, potentially causing us to overlook certain fractal behaviors when NN is limited. Interestingly, the practical significance of these fractal boundaries also comes into question; narrowly distributed boundaries are unlikely to be encountered in most applications, thus posing minimal risk. This observation led to a new insight: fractal dimension alone may not suffice to assess the risk posed by fractal boundaries. It also becomes essential to understand the distribution breadth of these boundary points. In our experiments, the maximum NN tested did not vary widely, suggesting that we may have consistently overlooked very narrow fractal boundaries. However, this might not be detrimental, as such boundaries are less likely to impact practical applications.

Our renormalization analysis, while effective in identifying roughness as a key parameter, exhibits limited generalizability. This analysis is restricted to simple functions with explicitly defined parameters, making our conclusions highly specific to the cases studied. Additionally, we cannot answer with this analysis why non-convexity of the loss function leads to the emergence of fractal behaviors. My initial concept was to establish a mapping that links different loss landscapes and learning rates, thereby preserving unchanged trainability. This mapping would ideally define an updating flow for function parameters and the learning rate. If successful, we could potentially transform the question of trainability into an investigation of where this updating flow stabilizes, using familiar functions such as the quadratic function as endpoints. However, the renormalization flow falls short in achieving this, as it cannot eliminate the perturbations. The first time I viewed the figures in [9], they reminded me of images of Jupiter, whose fractal-like surface arises from some fluid dynamics. This analogy suggests a future possibility where we might develop an updating flow for hyperparameters that mirrors principles from fluid dynamics.

In conclusion, substantial future research is necessary to more accurately capture the fractal behaviors of trainability boundaries. As we have discussed, developing a theory that can predict both the critical emergence of fractal trainability boundaries and their fractal dimensions is essential. Moreover, establishing connections to realistic loss functions from contemporary machine learning models is needed, particularly finding methods to characterize roughness in general loss functions lacking simple explicit formulas. Further exploration into the mechanisms that contribute to rough non-convexity is also required. With a deeper understanding of these phenomena, we could potentially develop strategies associated to model construction, dataset management, and optimizers that mitigate the risks associated with dangerous fractal trainability boundaries. By continuing to build on this foundation, we pave the way for more robust and predictable machine learning methodologies.

Methods

We conducted large scale numerical experiments with Julia 1.8.4 on CPUs or with Python 3.9 on GPUs of MIT Supercloud [13]. The results have no notable difference. We ran small scale tests and analyze data with Python 3.10.9 on a laptop. All codes are available online. In practice, we set number of segments N=2nN=2^{n} for integer nn, and ran tests on Nmax+1=2nmax+1N_{\rm max}+1=2^{n_{\rm max}}+1 learning rates evenly distributed in [0,1.5][0,1.5]. Given hyperparameters and the loss function, we ran GD for 1000 steps, and classify bounded or divergent training based on the sum of loss values of the 1000 steps. We also classified bounded or divergent training based on whether GD cannot or can hit an upper bound. The latter classification may mistake some cases where GD first diverge but then converge. However, in the tests reported in the main text, the two classification methods do not have notable difference. When analyzing data, we can choose 2n+12^{n}+1 (nnmaxn\leq n_{\rm max}) evenly spaced points from the 2nmax+12^{n_{\rm max}}+1 points to analyze boundary segments at a coarse-grained level, which can give |BN||B_{N}| with N=2nN=2^{n}. The largest nmaxn_{\rm max} we tested is 3232. Most times, nmax=20n_{\rm max}=20 is sufficient to yield a good fitting of fractal dimensions. We ran all numerical tests with data type float64, which is accurate enough for our choices of nmaxn_{\rm max}.

The choice of learning rate range [0,1.5][0,1.5] tested relies on the facts f0=x2f_{0}=x^{2} has one trainability boundary at s=1.0s=1.0 and we observe a lot of trainability boundaries when s<1.0s<1.0 in practice (Fig. 1d). We prove f0=x2f_{0}=x^{2} has one trainability boundary as follows. By the definition of GD, on f0f_{0}, we have

x(k+1)=x(k)2sx(k)=(12s)x(k).x^{(k+1)}=x^{(k)}-2sx^{(k)}=(1-2s)x^{(k)}. (12)

Convergence requires |12s|<1|1-2s|<1, which gives 0<s<10<s<1 and thus completes the proof. We applied our numerical method to f0f_{0} for a rational check and found indeed there is no fractal trainability boundary for f0f_{0} (SI).

Details of the renormalization procedure are given as follows. For both additive and multiplicative perturbation cases, we can write the loss function in a form

f(x)=x2+ϕ.f(x)=x^{2}+\phi. (13)

By substituting x~=x/b\tilde{x}=x/b and f~(x~)=f(x)/ζ\tilde{f}(\tilde{x})=f(x)/\zeta into the original GD, we have

x~(k+1)=x~(k)s~f~(x~(k)),\tilde{x}^{(k+1)}=\tilde{x}^{(k)}-\tilde{s}\nabla\tilde{f}(\tilde{x}^{(k)}), (14)

with s~=sζ/b2\tilde{s}=s\zeta/b^{2} and

f~(x~)=b2ζx~2+1ζϕ(bx~).\tilde{f}(\tilde{x})=\frac{b^{2}}{\zeta}\tilde{x}^{2}+\frac{1}{\zeta}\phi(b\tilde{x}). (15)

Since we want the new function f~\tilde{f} to have the same function form as ff, we need the pre-factor of x~2\tilde{x}^{2}, i.e., b2/ζb^{2}/\zeta, to be one. Consequently, we have ζ=b2\zeta=b^{2} and the learning rate s~=s\tilde{s}=s unchanged. And for the additive perturbation case, where ϕ=ϵcos(2πx/λ)\phi=\epsilon\cos(2\pi x/\lambda), if we want to write the transformed ϕ~=1ζϕ(bx~)=ϵ~cos(2πx~/λ~)\tilde{\phi}=\frac{1}{\zeta}\phi(b\tilde{x})=\tilde{\epsilon}\cos(2\pi\tilde{x}/\tilde{\lambda}), we will arrive the results ϵ~=ϵ/b2\tilde{\epsilon}=\epsilon/b^{2} and λ~=λ/b\tilde{\lambda}=\lambda/b. Similarly, for the multiplicative perturbation case, since ϕ=x2ϵcos(2πx/λ)\phi=x^{2}\epsilon\cos(2\pi x/\lambda) and ϕ~=1ζϕ(bx~)=x~2ϵ~cos(2πx~/λ~)\tilde{\phi}=\frac{1}{\zeta}\phi(b\tilde{x})=\tilde{x}^{2}\tilde{\epsilon}\cos(2\pi\tilde{x}/\tilde{\lambda}), we will have ϵ~=ϵ\tilde{\epsilon}=\epsilon and λ~=λ/b\tilde{\lambda}=\lambda/b. Since x~=x/b\tilde{x}=x/b is a one-to-one mapping, we know that changing the set of conditions {s,ϵ,λ,x(0)}\{s,\epsilon,\lambda,x^{(0)}\} to {s,ϵ~,λ~,x(0)/b}\{s,\tilde{\epsilon},\tilde{\lambda},x^{(0)}/b\} will only yield a rescaled GD trajectory but not change whether the trajectory diverge or not. In other words, the conditions {ϵ,λ,x(0)}\{\epsilon,\lambda,x^{(0)}\} have the same trainability boundary as {ϵ~,λ~,x(0)/b}\{\tilde{\epsilon},\tilde{\lambda},x^{(0)}/b\}.

acknowledgments

It is a pleasure to thank Weijie Su for introducing this problem and highlighting the importance, Ziming Liu for discussions on the intuitions, Jörn Dunkel for valuable discussions on aspects to explore, Jeff Gore for the support and inspiring discussions, and Jascha Sohl-Dickstein for valuable suggestions.

References

  • Choromanska et al. [2015] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun, The Loss Surfaces of Multilayer Networks, in Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 38, edited by G. Lebanon and S. V. N. Vishwanathan (PMLR, San Diego, California, USA, 2015) pp. 192–204.
  • Jin et al. [2018] C. Jin, L. T. Liu, R. Ge, and M. I. Jordan, On the local minima of the empirical risk, in Neural Information Processing Systems (2018).
  • Zhang et al. [2022] Y. Zhang, Q. Qu, and J. Wright, From symmetry to geometry: Tractable nonconvex problems (2022), arXiv:2007.06753 [cs.LG] .
  • Su et al. [2016] W. Su, S. Boyd, and E. J. Candès, A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights, Journal of Machine Learning Research 17, 1 (2016).
  • Wibisono et al. [2016] A. Wibisono, A. C. Wilson, and M. I. Jordan, A variational perspective on accelerated methods in optimization, Proceedings of the National Academy of Sciences 113, E7351 (2016)https://www.pnas.org/doi/pdf/10.1073/pnas.1614734113 .
  • Kong and Tao [2020] L. Kong and M. Tao, Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function, in Advances in Neural Information Processing Systems, Vol. 33, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Curran Associates, Inc., 2020) pp. 2625–2638.
  • Shi et al. [2023] B. Shi, W. Su, and M. I. Jordan, On Learning Rates and Schrödinger Operators, Journal of Machine Learning Research 24, 1 (2023).
  • Chen et al. [2023] X. Chen, K. Balasubramanian, P. Ghosal, and B. Agrawalla, From stability to chaos: Analyzing gradient descent dynamics in quadratic regression (2023), arXiv:2310.01687 [cs.LG] .
  • Sohl-Dickstein [2024] J. Sohl-Dickstein, The boundary of neural network trainability is fractal (2024), arXiv:2402.06184 [cs.LG] .
  • Mandelbrot [1982] B. B. Mandelbrot, The fractal geometry of nature, Vol. 1 (WH freeman New York, 1982).
  • Strogatz [2018] S. H. Strogatz, Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering (CRC press, 2018).
  • Falconer [2007] K. Falconer, Fractal geometry: mathematical foundations and applications (John Wiley & Sons, 2007).
  • Reuther et al. [2018] A. Reuther, J. Kepner, C. Byun, S. Samsi, W. Arcand, D. Bestor, B. Bergeron, V. Gadepally, M. Houle, M. Hubbell, M. Jones, A. Klein, L. Milechin, J. Mullen, A. Prout, A. Rosa, C. Yee, and P. Michaleas, Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis, in 2018 IEEE High Performance Extreme Computing Conference (HPEC) (2018) pp. 1–6.

Supplementary Information

All the raw data and comparison between fitted line and raw data points are available online.

We show two examples of training loss updating on our constructed loss landscapes in Fig. S1. The sanity check of our method on the quadratic function f0=x2f_{0}=x^{2} is presented in Fig. S2.

The multiplicative case is always non-convex and should always display fractal trainability boundary. But if we study the second derivative,

2f×(x)=\displaystyle\nabla^{2}f_{\times}(x)= 2+2ϵcos(2πxλ)4ϵ2πxλsin(2πxλ)\displaystyle 2+2\epsilon\cos\left(\frac{2\pi x}{\lambda}\right)-4\epsilon\frac{2\pi x}{\lambda}\sin\left(\frac{2\pi x}{\lambda}\right)
ϵ(2πxλ)2cos(2πxλ),\displaystyle-\epsilon\left(\frac{2\pi x}{\lambda}\right)^{2}\cos\left(\frac{2\pi x}{\lambda}\right), (S1)

we find that 2f×(x)>0\nabla^{2}f_{\times}(x)>0 in a region |x||x| is small enough. If we set an upper bound fmaxf_{\rm max} to tell whether training is classified to be bounded or divergent and do not care the dynamics once it goes beyond fmaxf_{\rm max}, GD actually only sees the loss within the region |x|<fmax|x|<\sqrt{f_{\rm max}} roughly. And if ϵ\epsilon is too small, in the region |x|<fmax|x|<\sqrt{f_{\rm max}}, the loss may be convex and we may not see any fractal behaviors, which is a numerical artifact. To test the idea, we set different fmaxf_{\rm max} values and stop GD and regard it as divergent once the loss reaches fmaxf_{\rm max}. We found too small ϵ\epsilon indeed will make fractal behaviors vanish and the transition to fractal behaviors differs for different upper bounds fmaxf_{\rm max} (Fig. S3). We next analyze when the loss within |x|<fmax|x|<\sqrt{f_{\rm max}} can be non-convex and see if this case corresponds to the emergence of fractal behaviors. The critical situation is that 2f×\nabla^{2}f_{\times} becomes zero near |x|=fmax|x|=\sqrt{f_{\rm max}}. Since fmax\sqrt{f_{\rm max}} is very large, the quadratic term dominants among terms having ϵ\epsilon, we have the minimum second derivative near |x|=fmax|x|=\sqrt{f_{\rm max}} being

min2f×2ϵfmax(2πλ)2.\displaystyle\min\nabla^{2}f_{\times}\approx 2-\epsilon f_{\rm max}\left(\frac{2\pi}{\lambda}\right)^{2}. (S2)

By setting the above estimation to be zero, we reach the boundary of non-convexity for loss within |x|<fmax|x|<\sqrt{f_{\rm max}} as

ϵ=1fmaxλ22π2.\epsilon=\frac{1}{f_{\rm max}}\frac{\lambda^{2}}{2\pi^{2}}. (S3)

These estimated boundaries are plotted in the (ϵ,λ)(\epsilon,\lambda) plane for different fmaxf_{\rm max} values (red curves in Fig. S3). Note greater ϵ\epsilon means non-convexity, we found it is true that fractal behaviors only show up when the part of loss function GD can see becomes non-convex (non-zero fractal dimensions are all above the red curves). We also note that with increasing upper bound fmaxf_{\rm max}, the boundary of non-convexity (red curves) tend to be smaller and smaller than the boundary of fractal behaviors. This might due to the fact the region |x|<fmax|x|<\sqrt{f_{\rm max}} is expanding with larger fmaxf_{\rm max} and near critical (ϵ,λ)(\epsilon,\lambda) for non-convexity, it is more difficult to see the non-convex part near |x|fmax|x|\approx\sqrt{f_{\rm max}}. In conclusion, the numerical artifact help to further prove the idea that loss becoming non-convex leads to the emergence of fractal behaviors in training.

We checked the dependence of fractal dimension on the initial condition of parameter x(0)x^{(0)} in Fig. S4 and S5, which suggest initial parameter x(0)x^{(0)} may not affect fractal dimension.

Refer to caption
Figure S1: Training loss can be bounded or divergent. (a) An example obtained based on the multiplicative noise case with amplitude ϵ=0.2\epsilon=0.2, wavelength λ=0.1\lambda=0.1, and learning rate s=0.01s=0.01, where the loss will decay and be bounded. (b) An example of divergent training based on the multiplicative noise case with amplitude ϵ=0.2\epsilon=0.2, wavelength λ=0.1\lambda=0.1, and learning rate s=0.2s=0.2.
Refer to caption
Figure S2: Sanity check on the quadratic loss function indicates our numerical method is not wrong. (a) The quadratic function f0=x2f_{0}=x^{2}, on which we know there is only one trainability boundary s=0s=0. (b) We found with our numerical method |BN||B_{N}| is always 11, consistent with the theory.
Refer to caption
Figure S3: Numerical artifact supports that fractal behaviors emerge when the landscape is non-convex. When we increase the loss upper bound for classifying bounded and divergent training, GD can run on greater regions. On different parameter (xx) regions, the function parameters (ϵ,λ)(\epsilon,\lambda) for multiplicative case (f×(x)f_{\times}(x)) to be non-convex are different (red curves in the figure is boundary of convexity and non-convexity). Changing the upper bound from (a) 1e+3, (b) 1e+4, (c) 1e+5, to (d) 1e+6, the fractal behaviors all show up after the region GD runs on becomes non-convex (above the red curves).
Refer to caption
Figure S4: Tests suggest that initial parameter x(0)x^{(0)} may not affect fractal dimension. We set for the additive noise case ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1, and sampled 100 different initial conditions x(0)x^{(0)} uniformly from [5,5][-5,5]. The fractal dimension averaged over initial conditions is 0.980.98 with a standard deviation 0.080.08.
Refer to caption
Figure S5: Tests suggest that initial parameter x(0)x^{(0)} may not affect fractal dimension. We set for the multiplicative noise case ϵ=0.2\epsilon=0.2 and λ=0.1\lambda=0.1, and sampled 100 different initial conditions x(0)x^{(0)} uniformly from [5,5][-5,5]. The fractal dimension averaged over initial conditions is 1.001.00 with a standard deviation 0.010.01.