Emergence of hierarchical modes from deep learning

Chan Li¹ Haiping Huang^1,2 huanghp7@mail.sysu.edu.cn ¹PMI Lab, School of Physics, Sun Yat-sen University, Guangzhou 510275, People’s Republic of China ²Guangdong Provincial Key Laboratory of Magnetoelectric Physics and Devices, Sun Yat-sen University, Guangzhou 510275, People’s Republic of China

Abstract

Large-scale deep neural networks consume expensive training costs, but the training results in less-interpretable weight matrices constructing the networks. Here, we propose a mode decomposition learning that can interpret the weight matrices as a hierarchy of latent modes. These modes are akin to patterns in physics studies of memory networks, but the least number of modes increases only logarithmically with the network width, and becomes even a constant when the width further grows. The mode decomposition learning not only saves a significant large amount of training costs, but also explains the network performance with the leading modes, displaying a striking piecewise power-law behavior. The modes specify a progressively compact latent space across the network hierarchy, making a more disentangled subspaces compared to standard training. Our mode decomposition learning is also studied in an analytic on-line learning setting, which reveals multi-stage of learning dynamics with a continuous specialization of hidden nodes. Therefore, the proposed mode decomposition learning points to a cheap and interpretable route towards the magical deep learning.

Introduction.— Deep neural networks are dominant tools with a broad range of applications in not only image and language processing, but also scientific researches Goodfellow et al. (2016); Carleo et al. (2019). These networks are parameterized by a huge amount of trainable weight matrices, thereby consuming an expensive training cost. However, these weight matrices are hard to interpret, and thus mechanisms underlying the macroscopic performance of the networks remain a big mystery in theoretical studies of neural networks Huang (2022); Roberts et al. (2022).

To save the computational cost, previous studies of deep networks applied singular value decomposition to the weight matrices Jaderberg et al. (2014); Yang et al. (2020); Giambagli et al. (2021); Chicchi et al. (2021). This decomposition requires the orthogonality condition for the singular vectors and positive singular values. The training also involves a carefully-designed structure for the trainable decomposition scheme Giambagli et al. (2021); Chicchi et al. (2021). These constraints and designs make the training process complicated, and thus a concise physics interpretation is still lacking. In addition, previous studies of recurrent memory networks showed that the network weight can be decomposed into separate random orthogonal patterns with corresponding importance scores Jiang et al. (2021); Zhou et al. (2021). Inspired by these studies, we conjecture that the learning in deep networks is shaped by a hierarchy of latent modes, which are not necessarily orthogonal, and the weight matrix can be expressed by these modes.

The mode decomposition learning (MDL) leads to a progressively compact latent mode space across the network hierarchy, and meanwhile the subspaces corresponding to different types of input are strongly disentangled, facilitating discrimination. The least number of latent modes achieving the comparable performance with the costly standard methods grows only logarithmically with the network width and even could be a constant, thereby reducing significantly the training cost. The mode spectrum exhibits an intriguing piecewise power-law behavior. In particular, these properties do not depend on details of the training setting. Therefore, our proposed MDL calls for a rethinking of conventional weight-based deep learning through the lens of cheap and interpretable mode-based learning.

Model.— To show the effectiveness of the MDL scheme, we train a deep network to implement a classification task of handwritten digits mni . The deep network has $L$ layers ( $L-2$ hidden layers) with $N_{l}$ neurons in the $l$ -th layer. The weight value of the connection from the neuron $i$ at the upstream layer $l$ to the neuron $j$ at the downstream layer $l+1$ is specified by $w_{ij}^{l}$ . The activation of the neuron $j$ at the downstream layer $h_{j}^{l+1}=f(z_{j}^{l+1})=\max(0,z_{j}^{l+1})$ , where the pre-activation $z_{j}^{l+1}=\sum_{i}w_{ij}^{l}h_{i}^{l}$ . For the output layer, the softmax function $h_{k}=e^{z_{k}}/\sum_{i}e^{z_{i}}$ is chosen to specify the probability over all classes of the input images. The cross entropy $\mathcal{C}=-\sum_{i}\hat{h}_{i}\ln h_{i}$ is used as the cost function for the supervised learning, and $\hat{h}_{i}$ is the target label (one-hot form). After training (the cross entropy is repeatedly averaged over mini-batches of training examples), we evaluate the generalization performance of the network on an unseen test dataset.

Single weight values are not interpretable. According to our hypothesis, latent patterns would emerge from training in each layer. We call these patterns hierarchical modes for deep learning. Therefore, the relationship between the modes and weight values is expressed by the following mode decomposition,

\mathbf{w}^{l}=\bm{\hat{\xi}}^{l}\bm{\Sigma}^{l}(\bm{{\xi}}^{l+1})^{\operatorname{T}},

(1)

where there are $p^{l}$ upstream modes $\bm{\hat{\xi}}^{l}\in\mathbb{R}^{N_{l}\times p^{l}}$ , and the same number of downstream modes $\bm{\xi}^{l+1}\in\mathbb{R}^{N_{l+1}\times p^{l}}$ . The importance of each pair of adjacent modes is specified by the diagonal of the importance matrix $\bm{\Sigma}^{l}\in\mathbb{R}^{p^{l}\times p^{l}}$ . These modes may not be orthogonal with each other, and the importance score can take a real value. This setting allows for more degrees of freedom for learning features of input-output mappings. We will detail their geometric and physical interpretations below.

Refer to caption — Figure 1: A simple illustration of the mode decomposition learning. (a) A deep neural network of three layers, including one hidden layer with three hidden nodes, for a classification task of non-linearly separable data. The weight matrix $w_{ij}^{l}=\sum_{\alpha=1}^{p}\hat{\xi}_{i,\alpha}^{l}\Sigma_{\alpha}^{l}{\xi}_{j,\alpha}^{l}$ , where $p=3$ . The distribution of input data is modeled as a Gaussian mixture (see the main text) from which samples are assigned to labels $t=\pm 1$ based on the corresponding mixture component. The training performance is measured by the mean-squared-error loss function $\ell_{\operatorname{MSE}}(y,t)=\|y-t\|^{2}/2$ . (b) The representation of hidden neurons $\mathbf{h}$ plotted in the 3D space, displaying the geometric separation. (c) The successive mappings from input sample $\mathbf{x}$ (grey) to $(\bm{\hat{\xi}}^{1})^{\operatorname{T}}\mathbf{x}$ (dark red), followed by $\bm{\Sigma}^{1}(\bm{\hat{\xi}}^{1})^{\operatorname{T}}\mathbf{x}$ (green), and finally $\bm{\xi}^{2}\bm{\Sigma}^{1}(\bm{\hat{\xi}}^{1})^{\operatorname{T}}\mathbf{x}$ (blue).

A geometric interpretation of Eq. (1) in a simple learning task is shown in Fig. 1. We use a three-layer network with three hidden neurons. The input data is sampled from a four-component Gaussian mixture Fischer et al. (2022),

\mathbb{P}(\mathbf{x},t)=P(t)\sum_{\pm}P_{\pm}\mathcal{N}\left(\mathbf{x}|\mu_{x}^{t,\pm},\Sigma_{x}^{t,\pm}\right),

(2)

where $\mathcal{N}\left(\mathbf{x}|\mu_{x}^{t,\pm},\Sigma_{x}^{t,\pm}\right)$ denotes a Gaussian distribution with mean $\mu_{x}^{t,\pm}$ and covariances $\Sigma_{x}^{t,\pm}$ , and $P(t)=P_{\pm}=\frac{1}{2}$ . For the label $t=+1$ , $\mu_{x}^{t=+1,\pm}=\pm(0.5,0.5)^{\operatorname{T}}$ , while for $t=-1$ , $\mu_{x}^{t=-1,\pm}=\pm(-0.5,0.5)^{\operatorname{T}}$ . Covariances are isotropic throughout with $\Sigma_{x}^{t,\pm}=0.05\mathds{1}$ . The input samples $\mathbf{x}\in\mathbb{R}^{2}$ are first projected to the input pattern space spanned by $(\hat{\bm{\xi}}^{1})^{\operatorname{T}}_{i}$ ( $i=1,2,3$ ). Then all three directions of this projection get expanded or contracted via $\bm{\Sigma}^{1}(\bm{\hat{\xi}}^{1})^{\operatorname{T}}\mathbf{x}$ . Finally the geometrically modified representation is re-mapped to the downstream representation space of a higher dimensionality, as $\bm{{\xi}}^{2}\bm{\Sigma}^{1}(\bm{\hat{\xi}}^{1})^{\operatorname{T}}\mathbf{x}$ [Fig. 1 (c)]. The non-linearity of the transfer function is then applied to the last linear transformation, leading to the geometric separation [Fig. 1 (b)]. We conclude that the MDL provides rich angles to look at the geometric transformation of the input information along the hierarchy of deep networks.

Rather than the conventional weight values in standard backpropagation (BP) algorithms Goodfellow et al. (2016), the trainable parameters are latent patterns in the MDL. The training is implemented by stochastic gradient descent in the mode space $\bm{\theta}^{l}=(\bm{\hat{\xi}}^{l},\bm{\Sigma}^{l},\bm{{\xi}}^{l+1})$ SM ,

$\displaystyle\Delta\xi^{l+1}_{j\alpha}$	$\displaystyle\equiv-\eta\frac{\partial\mathcal{L}}{\partial\xi^{l+1}_{j\alpha}}=-\eta\mathcal{K}_{j}^{l+1}\Sigma_{\alpha}^{l}\sum_{i}\hat{\xi}_{i\alpha}^{l}h_{i}^{l},$	(3)
$\displaystyle\Delta\Sigma_{\alpha}^{l}$	$\displaystyle\equiv-\eta\frac{\partial\mathcal{L}}{\partial\Sigma^{l}_{\alpha}}=-\eta\sum_{j}\mathcal{K}_{j}^{l+1}\xi_{j\alpha}^{l+1}\sum_{i}\hat{\xi}_{i\alpha}^{l}h_{i}^{l},$
$\displaystyle\Delta\hat{\xi}^{l}_{i\alpha}$	$\displaystyle\equiv-\eta\frac{\partial\mathcal{L}}{\partial\hat{\xi}_{i\alpha}^{l}}=-\eta\Sigma_{\alpha}^{l}h_{i}^{l}\sum_{j}\mathcal{K}_{j}^{l+1}\xi_{j\alpha}^{l+1},$

where $\mathcal{L}$ denotes the cost function (e.g., cross-entropy or mean-squared error) over a mini-batch of training data, $\eta$ denotes the learning rate, and $\mathcal{K}_{j}^{l+1}\equiv\partial\mathcal{L}/\partial z_{j}^{l+1}$ denotes the error term, which could back-propagate from the top layer where $\mathcal{K}_{j}^{L}=-\hat{h}_{j}^{L}\left(1-h_{j}^{L}\right)$ for $\mathcal{L}=\mathcal{C}$ (cross entropy). Based on the chain rule, the error backpropagation equation can be derived as $\mathcal{K}_{i}^{l}=\sum_{j}\mathcal{K}_{j}^{l+1}\sum_{\alpha}\xi_{i\alpha}^{l+1}\Sigma_{\alpha}^{l}\hat{\xi}_{j\alpha}^{l}f^{\prime}(z_{i}^{l})$ SM . To ensure the pre-activation is independent of the upstream-layer width, we take the initialization scheme that $[\bm{{\xi}}^{l+1}\bm{{\Sigma}}^{l}(\hat{\bm{\xi}}^{l})^{\operatorname{T}}]_{ij}\sim\mathcal{O}(\frac{1}{\sqrt{N_{l}}})$ Jiang et al. (2021). To avoid the ambiguity of choosing patterns (e.g., scaled by a factor), we impose an identical regularization with strength $10^{-4}$ for all trainable parameters. However, our result does not change qualitatively with the specific values of regularization SM .

We remark that for each hidden layer, there exist two types of pattern ( $\bm{\xi}^{l}\neq\hat{\bm{\xi}}^{l}$ ). Equation (3) is used to learn these patterns. We call this case 1L2P. If we assume $\bm{\xi}^{l}=\hat{\bm{\xi}}^{l}$ , the training can be further simplified as in SM , and we call this case 1L1P. The nature of this mode-based-computation can be understood as an expanded linear-nonlinear layered computation, as $f(z_{j}^{l+1})=f(\sum_{\alpha}c_{\alpha j}\kappa_{\alpha})$ where the linear field $\kappa_{\alpha}=\sum_{i}\hat{\xi}_{i\alpha}^{l}h_{i}^{l}$ and the equivalent weight $c_{\alpha j}=\xi_{j\alpha}^{l+1}\Sigma_{\alpha}^{l}$ . Therefore, the number of modes acts as the linear-layer width. We leave a systematic exploration of this linear-nonlinear structure by statistical mechanics in forthcoming works.

On-line learning dynamics in a shallow network.— The MDL can be analytically understood in an on-line learning setting, where we consider one-hidden-layer architecture. The on-line learning can be considered as a special case of the above mini-batch learning (i.e., the batch size is set to one, and the sample is visited by the learning only once). The training dataset consists of $n$ pairs $\left\{\mathbf{x}^{\nu},y^{\nu}\right\}_{\nu=1}^{n}$ . Each training example is independently sampled from a probability distribution $\mathbb{P}(\mathbf{x},y)=\mathbb{P}(y|\mathbf{x})\mathbb{P}(\mathbf{x})$ , where $\mathbb{P}(\mathbf{x})$ is a standard Gaussian distribution, and the scalar label ${y}^{\nu}$ is generated by the neural network of $k$ hidden neurons, (i.e., teacher, indicated by the symbol $*$ below). Given an input $\mathbf{x}^{\nu}\in\mathbb{R}^{d}$ , the corresponding label is created by

y^{\nu}=\frac{1}{k}\sum_{r=1}^{k}\sigma\left(\frac{[\bm{{\xi}}^{*}\bm{{\Sigma}^{*}}(\bm{\hat{\xi}^{*}})^{\operatorname{T}}]_{r}\mathbf{x}^{\nu}}{\sqrt{d}}\right)=\frac{1}{k}\sum_{r=1}^{k}\sigma\left(\lambda_{r}^{*\nu}\right),

(4)

where $[\bm{\xi}^{*}\bm{\Sigma}^{*}(\hat{\bm{\xi}}^{*})^{\operatorname{T}}]_{r}$ denotes the $r$ -th row of the matrix $\bm{\xi}^{*}\bm{\Sigma}^{*}(\hat{\bm{\xi}}^{*})^{\operatorname{T}}$ , and $\lambda_{r}^{*\nu}=[\bm{{\xi}}^{*}\bm{{\Sigma}^{*}}(\bm{\hat{\xi}^{*}})^{\operatorname{T}}]_{r}\mathbf{x}^{\nu}/{\sqrt{d}}$ represents the $r$ -th element of the teacher local field vector $\bm{\lambda}^{*\nu}\in\mathbb{R}^{k}$ . The teacher network is quenched as $[\bm{{\xi}}^{*}\bm{{\Sigma}^{*}}(\bm{\hat{\xi}^{*}})^{\operatorname{T}}]_{ij}\sim\mathcal{O}(1)$ . Here, we focus on the non-linear transfer function $\sigma(x)=\operatorname{erf}(x/\sqrt{2})$ . In addition, we train the other shallow network called the student network, by minimizing the loss function $\mathcal{L}(y,\hat{f}(\mathbf{x},\Theta))$ over the training data (labels are given by the teacher network), where $\Theta$ denotes the trainable parameters. The student’s prediction for a fresh sample $\mathbf{x}$ is given by

\hat{f}(\mathbf{x},\bm{\hat{\xi}},\bm{{\Sigma}},\bm{{\xi}})=\frac{1}{m}\sum_{r=1}^{m}\sigma\left(\frac{[\bm{{\xi}}\bm{{\Sigma}}(\bm{\hat{\xi}})^{\operatorname{T}}]_{r}\mathbf{x}}{\sqrt{d}}\right)=\frac{1}{m}\sum_{r=1}^{m}\sigma\left(\lambda_{r}\right),

(5)

where $\lambda_{r}$ denotes the $r$ -th component of the student local field $\bm{\lambda}=\bm{{\xi}}\bm{{\Sigma}}(\bm{\hat{\xi}})^{\top}\mathbf{x}$ , and the student has $m$ hidden neurons. The student is supplied with data samples in sequence (one sample each time step). We next use $\nu$ to indicate the time step as well.

The mean-squared-error can be evaluated as

\ell_{\mathrm{MSE}}(\bm{\Omega})=\frac{1}{2}\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{*}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{*}\mid 0,\bm{\Omega}\right)}\left[\left(\hat{f}(\bm{\lambda})-f\left(\bm{\lambda}^{*}\right)\right)^{2}\right],

(6)

where $f(\cdot)$ indicates the teacher’s output, and we have replaced the expectation $\mathbb{E}_{\mathbf{x},y\sim\mathbb{P}(\mathbf{x},y)}[\cdot]$ by $\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{*}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{*}\mid 0,\bm{\Omega}\right)}[\cdot]$ , because of the central-limit theorem and the i.i.d. setting we consider Biehl and Schwarze (1995); Saad and Solla (1995); Goldt et al. (2019). The covariance of the local field $\bm{\Omega}^{\nu}\in\mathbb{R}^{(k+m)\times(k+m)}$ can be specified as follows,

\bm{\Omega}^{\nu}\equiv\left[\begin{array}[]{cc}\mathbf{Q}^{\nu}&\mathbf{M}^{\nu}\\ (\mathbf{M}^{\nu})^{\operatorname{T}}&\mathbf{P}\end{array}\right],

(7)

where $\mathbf{Q}^{\nu}\equiv\mathbb{E}_{\mathbf{x},y\sim\mathbb{P}(\mathbf{x},y)}\left[\bm{\lambda}^{\nu}(\bm{\lambda}^{\nu})^{\operatorname{T}}\right]$ , $\mathbf{M}^{\nu}\equiv\mathbb{E}_{\mathbf{x},y\sim\mathbb{P}(\mathbf{x},y)}\left[\bm{\lambda}^{\nu}(\bm{\lambda}^{*\nu})^{\operatorname{T}}\right]$ , and $\mathbf{P}^{\nu}\equiv\mathbb{E}_{\mathbf{x},y\sim\mathbb{P}(\mathbf{x},y)}\left[\bm{\lambda}^{*\nu}(\bm{\lambda}^{*\nu})^{\operatorname{T}}\right]$ . By definition, $\mathbf{P}$ is fixed, while $\mathbf{Q}^{\nu}$ and $\mathbf{M}^{\nu}$ evolve according to the gradient updates, following a set of deterministic ordinary differential equations (ODEs) as the input dimension $d\to\infty$ SM . These matrices are exactly the order parameters in physics. For simplicity, we consider $\bm{\xi}=\bm{\xi}^{*}$ and $\bm{\Sigma}=\bm{\Sigma}^{*}$ , i.e., only the upstream patterns are learned.

Results.— MDL can reach a similar test accuracy with that of BP performed in the weight space, when $p$ is sufficiently large [Fig. 2 (a)]. The computational cost of the BP scales with $N_{l}^{2}$ . In contrast, MDL works in the mode space, requiring a training cost of only the order of $pN_{l}$ . Note that $p$ is much smaller than $N_{l}$ (or $\lim_{N_{l}\to\infty}p^{l}/N_{l}=0$ ), and our MDL does not need any additional training constraints (compared to other matrix factorization algorithms SM ). Remarkably, when $p=30$ , the performance of MDL already matches that of BP [Fig. 2 (b)], but only utilizes $40\%$ of the full sets of parameters that are consumed by the BP. In fact, each hidden layer can have two different types of latent pattern (1L2P) due to the mode decomposition. But if we assume that $\bm{{\xi}}^{l}=\bm{\hat{\xi}}^{l}$ , i.e., each layer share a single type of pattern (1L1P), we can further reduce the computational cost by an amount of $\sum_{l}p^{l}N_{l}$ , without sacrificing the test accuracy [Fig. 2 (b)]. Varying the network width, we reveal a logarithmic increase of the least number of modes [Fig. 2 (c)], which is a novel property of deep learning in the mode space, in stark contrast to a linear number of memory patterns in previous studies Jiang et al. (2021). When the network width further grows, the least number can even become a constant. We argue that this manifests three separated phases of poor-good-saturated performance with increasing layer width (see Fig. S9 in SM ).

To see how the latent patterns are transformed in geometry along the network hierarchy, we first calculate the center of the pattern space. Then the Euclidean distance from this center to each pattern is analyzed. We find that the pattern space becomes progressively compact when going to deep layers [Fig. 2 (d)]. To further characterize the geometric details, we define the subspace spanned by the principal eigenvectors of the layer neural responses to one type of inputs. Then the subspace overlap is calculated as the cosine of the principal angle between two subspaces corresponding to two types of inputs Bjoerck and Golub (1973); SM . We find that the hidden-layer representation becomes more disentangled with layer in comparison with BP [Fig. 2 (e,f)]. MDL shows great computational benefits of representation disentanglement, thereby facilitating discrimination. A slight increase of the overlap is observed for deeper layers, which is caused by the saturation of the test performance (see more analyses in SM ).

Compared to other matrix factorization methods, MDL has no additional constraints for the modes and importance scores, therefore being flexible for feature extraction. We find that the interlayer patterns are less orthogonal than the intralayer ones. The geometric transformation carried out by these latent pattern matrices is not strictly a rotation for which the $\ell_{2}$ norm is preserved. This flexibility may be the key to make our method better than other matrix factorization methods in both training cost and learning performance (see details in SM ).

We next ask whether some modes are more important than the others. Therefore, we rank the modes according to the measure $\tau_{\alpha}=\gamma\|\xi_{\alpha}\|_{2}+\gamma\|\hat{\xi}_{\alpha}\|_{2}+|\Sigma_{\alpha}|$ , where $\gamma=\sum_{\alpha}|{\Sigma_{\alpha}}|/\sum_{\alpha}(\|{\xi}_{\alpha}\|_{2}+\|{\hat{\xi}}_{\alpha}\|_{2})$ to make comparable the magnitudes of the pattern and importance ( $\bm{\Sigma}$ ) score. Removing modes with weak values of $\tau$ first yields much higher accuracy than the random removal protocol [Fig. 3 (a)], suggesting the existence of leading modes. Moreover, deeper layers are more robust. Figure 3 (b) shows the measure as a function of rank in descending order, which can be approximately captured by piecewise power-law behavior (a transition point at the rank $10$ ). Ranking with only the importance scores yields similar behavior SM . A small exponent is observed for the leading measures, while the remaining measures bear a large exponent, thereby revealing the coding hierarchy of latent modes in the deep networks. This intriguing behavior does not change with the regularization strength or the hidden-layer width SM .

Finally, the on-line mean-squared error dynamics of our model can be predicted perfectly in a teacher-student setting. The number of modes strongly affects the shape of the learning dynamics, and a large mode load can make the plateaus disappear (Fig. 4). Moreover, during learning, the alignment between receptive fields of the student’s hidden nodes and the teacher’s ones continuously emerge, which is called the specialization transition Schwarze (1993); Goldt et al. (2019).

Conclusion.— In this Letter, we propose a mode decomposition learning that works in the mode space rather than the conventional weight space. This learning scheme has three-fold technical and conceptual advances. First, the learning can achieve the comparable performance with standard methods, with a significant reduction of training costs. We also find that the least number of modes grows only logarithmically with the network width and becomes even independent of larger width, which is in stark contrast to a linear number of patterns in recurrent memory networks. Second, the learning leads to progressively compact pattern spaces, which promotes highly disentangled hierarchical representations. The upstream pattern maps the activity into a low-dimensional space, and then the resulting embedding is further expanded or contracted. After that, the modified embedding is re-mapped into the high-dimensional activity space. This sequence of geometric transformation can be understood as a linear-nonlinear hidden structure. Third, all modes are not equally important to the generalization ability of the network, showing an intriguing piecewise power-law behavior. Finally, the mode learning dynamics can be predicted by the mean-field ODEs, revealing the mode specialization transition. Therefore, the MDL inspires a rethinking of conventional deep learning, offering a faster, more interpretable training framework. Future works along this direction will be inspired. For example, the impact of other structured dataset, mode dynamics in over-parameterized or recurrent networks, and the origin of adversarial vulnerability of deep networks in terms of geometry of the mode space.

I Acknowledgments

This research was supported by the National Natural Science Foundation of China for Grant number 12122515, and Guangdong Provincial Key Laboratory of Magnetoelectric Physics and Devices (No. 2022B1212010008), and Guangdong Basic and Applied Basic Research Foundation (Grant No. 2023B1515040023).

Supplemental Material

Appendix A Derivation of learning equations

In this section, we show how to derive the updating equations for the mode parameters $\bm{\theta}^{l}=(\bm{\hat{\xi}}^{l},\bm{\Sigma}^{l},\bm{{\xi}}^{l+1})$ where the superscript $l$ indicates the layer index in the range from $1$ to $L$ . The loss function is the cross entropy $\mathcal{C}=-\sum_{i}\hat{h}_{i}\ln h_{i}$ averaged over all training examples (divided into mini-batches in stochastic gradient descent), where $\hat{h}_{i}$ is defined as the target label (one-hot representation as common in machine learning). After training the network on the training dataset with size $T$ , we evaluate the generalization performance of the network on the unseen dataset with size $V$ .

In our framework of mode decomposition learning, the weight is decomposed into the form as follows,

\mathbf{w}^{l}=\bm{\hat{\xi}}^{l}\bm{\Sigma}^{l}(\bm{{\xi}}^{l+1})^{\operatorname{T}}.

(S1)

The mode parameters are updated according to gradient descent of the loss function,

\Delta\bm{\theta}_{ij}^{l}=-\eta\mathcal{K}_{j}^{l+1}\frac{\partial z_{j}^{l+1}}{\partial\bm{\theta}_{ij}^{l}},

(S2)

where $\eta$ denotes the learning rate, and the error propagation term $\mathcal{K}_{j}^{l+1}\equiv\partial\mathcal{C}/\partial z_{j}^{l+1}$ . On the top layer, $\mathcal{K}_{j}^{l+1}$ can be computed with the result $\mathcal{K}_{j}^{L}=-\hat{h}_{j}^{L}\left(1-h_{j}^{L}\right)$ . For lower layers, the term $\mathcal{K}_{i}^{l}$ can be iteratively computed using the chain rule. More precisely,

	$\displaystyle\mathcal{K}_{i}^{l}$	$\displaystyle=\partial\mathcal{C}/\partial z_{i}^{l}=\sum_{j}\frac{\partial\mathcal{C}}{\partial z_{j}^{l+1}}\frac{\partial z_{j}^{l+1}}{\partial z_{i}^{l}}$		(S3)
		$\displaystyle=\sum_{j}\mathcal{K}_{j}^{l+1}\sum_{\alpha}\xi^{l+1}_{i\alpha}\Sigma^{l}_{\alpha}\hat{\xi}_{j\alpha}^{l}f^{\prime}(z_{i}^{l}).$		(S3)

The explicit expressions of gradient steps for the three sets of mode parameters are given as follows,

$\displaystyle\Delta\xi^{l+1}_{j\alpha}$	$\displaystyle=-\eta\frac{\partial\mathcal{C}}{\partial\xi^{l+1}_{j\alpha}}=-\eta\mathcal{K}_{j}^{l+1}\sum_{i}\Sigma_{\alpha}^{l}\hat{\xi}_{i\alpha}^{l}h_{i}^{l},$	(S4)
$\displaystyle\Delta\Sigma_{\alpha}^{l}$	$\displaystyle=-\eta\frac{\partial\mathcal{C}}{\partial\Sigma^{l}_{\alpha}}=-\eta\sum_{j}\mathcal{K}_{j}^{l+1}\sum_{i}\xi_{j\alpha}^{l+1}\hat{\xi}_{i\alpha}^{l}h_{i}^{l},$
$\displaystyle\Delta\hat{\xi}^{l}_{i\alpha}$	$\displaystyle=-\eta\frac{\partial\mathcal{C}}{\partial\hat{\xi}_{i\alpha}^{l}}=-\eta\sum_{j}\mathcal{K}_{j}^{l+1}\xi_{j\alpha}^{l+1}\Sigma_{\alpha}^{l}h_{i}^{l}.$

The above learning equations apply to the case of 1L2P case.

Next, we consider the 1L1P case ( $\bm{{\xi}}^{l}=\bm{\hat{\xi}}^{l}$ ). Apart from the single input pattern for the first layer $\bm{\hat{\xi}}^{1}$ and the single output pattern for the last layer $\bm{\xi}^{L}$ , two types of pattern in each hidden layer $[\bm{{\xi}}^{l},\bm{\hat{\xi}}^{l}]$ take the same form, and we denote $\bm{{\xi}}^{l}=\bm{\hat{\xi}}^{l}=\bm{\Xi}^{l}$ . The expression of $\mathcal{K}_{i}^{l}$ remains unchanged, and we can then update $[\bm{\hat{\xi}}^{1},\bm{\Sigma}^{l},\bm{{\xi}}^{L}]$ according to Eq. (S4). Next, we give the gradient descent equation for $\bm{\Xi}^{l}$ where $l=2,...,L-1$ as follows

\displaystyle\Delta\Xi^{l}_{j\alpha}

\displaystyle=-\eta\frac{\partial\mathcal{C}}{\partial\Xi^{l}_{j\alpha}}=-\eta\mathcal{K}_{j}^{l}\sum_{i}\Sigma_{\alpha}^{l-1}\hat{\xi}_{i\alpha}^{l-1}h_{i}^{l-1}-\eta\sum_{i}\mathcal{K}_{i}^{l+1}\xi_{i\alpha}^{l+1}\Sigma_{\alpha}^{l}h_{j}^{l},

(S5)

where two terms contribute to the gradient—the first one comes from the contribution of $\bm{\xi}^{l}$ , while the second one originates from the fact that the same pattern can act as $\bm{\hat{\xi}}^{l}$ .

To ensure the weighted sum in the pre-activation is independent of the upstream layer width and the number of modes $p^{l}$ , we choose the initialization scheme such that $[\bm{{\xi}}^{l+1}\bm{{\Sigma}}^{l}(\bm{\hat{\xi}}^{l})^{\operatorname{T}}]_{ij}\sim\mathcal{O}(\frac{1}{\sqrt{N_{l}}})$ . This scaling is inspired by studies of Hopfield models Jiang et al. (2021). In practice, we independently and identically sample the initial elements $\xi^{l+1}_{i\alpha},\Sigma^{l}_{\alpha},\hat{\xi}^{l}_{j\alpha}$ from the standard Gaussian distribution, and then the weight values are multiplied by a factor of $1/\sqrt{N_{l}\ln N_{l}}$ . Note that the number of modes are assumed to be proportional to $\ln N_{l}$ . But if the number is a constant denoted by $P^{l}$ , then the factor could be $1/\sqrt{P^{l}N_{l}}$ .

Appendix B Comparison to other matrix factorization methods

Here, we compared our MDL method to other matrix factorization methods in learning performance. These other methods include singular value decomposition (SVD), low rank decomposition (LRD) and spectral training Yang et al. (2020); Chicchi et al. (2021).

First, the SVD learning scheme is implemented by decomposing the weight of each layer as

\bm{W}^{l}=\bm{U}^{l}\operatorname{diag}(\bm{s}^{l})(\bm{V}^{l})^{\top},

(S6)

where the diagonal matrix contains $\min(N_{l},N_{l+1})$ non-zero elements in the diagonal, and the elements of $\mathbf{s}^{l}$ is constrained to be positive. The orthogonality is forced by two regularization terms as

L(\bm{U},\bm{s},\bm{V})=L_{T}+\lambda_{o}\sum_{l=1}^{D}L_{o}\left(\bm{U}_{l},\bm{V}_{l}\right)+\lambda_{s}\sum_{l=1}^{D}L_{s}\left(\bm{s}_{l}\right),

(S7)

where $L_{T}$ is the original training loss, $L_{o}(\bm{U},\bm{V})=\frac{1}{r^{2}}\left(\left\|\bm{U}^{T}\bm{U}-\bm{I}\right\|_{F}^{2}+\left\|\bm{V}^{T}\bm{V}-\bm{I}\right\|_{F}^{2}\right)$ , and $L_{s}(\bm{s})=\frac{\|\bm{s}\|_{1}}{\|\bm{s}\|_{2}}=\frac{\sum_{i}\left|s_{i}\right|}{\sqrt{\sum_{i}s_{i}^{2}}}$ . $r$ is the rank of $\bm{U}$ and $\bm{V}$ , $\|\bullet\|_{F}$ denotes the Frobenius norm of a matrix. The regularization term $L_{o}$ forces $\bm{U}$ and $\bm{V}$ to be orthogonal, while $L_{s}$ adjusts the sparsity level of $\bm{s}$ . The gradients for each set of parameters are derived below,

$\displaystyle\frac{\partial L_{o}}{\partial\bm{U}}$	$\displaystyle=\frac{4}{r^{2}}\left(\bm{U}^{\top}\bm{U}-\bm{I}\right)^{\top}\times\bm{U}^{\top},$	(S8)
$\displaystyle\frac{\partial L_{o}}{\partial\bm{V}}$	$\displaystyle=\frac{4}{r^{2}}\left(\bm{V}^{\top}\bm{V}-\bm{I}\right)^{\top}\times\bm{V}^{\top},$
$\displaystyle\frac{\partial L_{s}}{\partial s_{i}}$	$\displaystyle=\frac{\operatorname{sign}(s_{i})\sqrt{\sum_{i}s_{i}^{2}}-\sum_{i}\|s_{i}\|(\sum_{i}s_{i}^{2})^{-\frac{1}{2}}s_{i}}{\sum_{i}s_{i}^{2}}.$

For comparison, we carried out the SVD learning, with $L_{o}=100$ , $L_{s}=0.0$ , and $L_{o}=100$ , $L_{s}=5.0$ , as shown in Fig. S1. We remark that the training cost is larger for SVD models, which can be calculated as $\sum_{l}[N_{l}\times N_{l+1}+\min{(N_{l},N_{l+1})}^{2}+\min{(N_{l},N_{l+1})}]$ . Taking $[784,100,100,10]$ as an example, the learning needs $109710$ parameters in total. However, for our MDL with $p=30$ which already reaches the traditional BP performance, the learning only needs $35910$ parameters (but traditional BP needs $89400$ parameters). In simulations, we prune the full SVD model $60\%$ ((the modes with small $|s_{i}|$ ranked in descending order)) modes off each layer (except the output layer) to make the number of trainable parameters comparable with that of MDL with $p=30$ . We conclude that the MDL consumes less parameters, yet produces rapid learning with even better performances.

Next, we fix $\bm{\Sigma}=\mathbb{I}$ in our MDL, and this reduced form is called the low rank decomposition as follows,

\mathbf{W}^{l}=\bm{\hat{\xi}}^{l}(\bm{\xi^{l+1}})^{\top}.

(S9)

In the simulation, we set $p=30$ . We can see in Fig. S2 that the performance of the LRD is much worse than that of MDL and traditional BP.

For the recently proposed spectral learning Chicchi et al. (2021), a carefully-designed transformation matrix $\mathbf{A}^{k}$ (an $N\times N$ matrix, $N$ is the total number of units in the network, and $k$ is a layer index) is used with a spectral decomposition. The eigenvalues and the associated basis are optimized. However, this training performs worse compared to our MDL in the examples shown in Fig. S3.

Appendix C Ranking the modes according to the importance matrix

Here, we rank the modes according to the diagonal of the importance matrix, rather than the $\tau$ measure. We found that these two ranking schemes lead to qualitatively identical results. Removing the most important modes (according to either the $\tau$ measure or the importance score) will significantly impair the generalization ability of the network. Details are illustrated in Fig. S4. The non-smooth behavior can be attributed to the existence of mode-contribution gap, i.e., the most important modes ( $<15\%$ for the $\tau$ measure; $<30\%$ for the $\Sigma$ measure) dominate the generalization capability of the network, while other modes capture irrelevant noise in the data.

Appendix D The qualitative behavior of the MDL does not change with the regularization strength or the hidden-layer width

Further, our MDL is in essence a matrix factorization. Therefore, the pattern and importance matrices are not unique. However, in practice, we impose the $\ell_{2}$ norm level for these patterns and importance scores. In fact, we find the intriguing properties of the MDL in deep learning do not change with the regularization strength of the $\ell_{2}$ norm (denoted as $\lambda$ , see Fig. S8). Figure S5 shows an example for the behavior of the optimal number of modes versus hidden-layer width, while Fig. S6 shows that the piecewise power law behavior of the $\tau$ measure does not change with the regularization strength. In addition, the piecewise power law behavior of the $\tau$ measure does not change with the hidden-layer width as well (Fig. S7).

Appendix E Subspace overlap of layered response to pairs of stimuli

In this section, we provide details of estimating the average degree of correlation between neural responses $\mathbf{h}^{l}$ to pairs of different input stimuli (e.g., one stimulus contains images of the same class). The covariance of neural response in each layer to the stimulus can be diagonalized to specify a low-dimensional subspace. The subspace is spanned by the first $K$ principal components. The subspace overlap can then be evaluated via the cosine of the principal angle between these two subspaces corresponding to two different stimuli. In practice, for neural responses in each layer to the stimulus $s_{1}$ (e.g., many images of digit $0$ ), we first identify the first $K$ principal components of the covariance of $\mathbf{h}_{1}^{\ell}$ , which explains over $80\%$ of the total variance, and then reorganize the eigenvectors to an $N_{\ell}\times K$ matrix, namely $\mathbf{Q}^{\ell}(s_{1})$ . We repeat this procedure for another stimulus $s_{2}$ , and get another matrix $\mathbf{Q}^{\ell}(s_{2})$ . Therefore, the columns of $\mathbf{Q}^{\ell}(s_{1})$ and $\mathbf{Q}^{\ell}(s_{2})$ span two subspaces corresponding to the neural responses to $s_{1}$ and $s_{2}$ respectively. The cosine of the principal angle between these two subspaces is calculated as follows Bjoerck and Golub (1973)

\cos\theta_{p}\left(s_{1},s_{2}\right)=\sigma_{\max}\left({\mathbf{Q}^{\ell}(s_{1})}^{\rm T}\mathbf{Q}^{\ell}(s_{2})\right),

(S10)

where $\sigma_{\max}(\mathbf{B})$ denotes the largest singular value of the matrix $\mathbf{B}$ . In simulations, we consider the classification task of the MNIST dataset, where ten classes of digits are fed into a seven-layer neural network. Specifically, we choose $K^{\ell}$ that can explain over $80\%$ of the total variance for each stimulus and each layer, and therefore the value of $K^{\ell}$ varies with layer and input stimulus.

In the main text, we observe a mild increase of the subspace overlap in deep layers. Here, as shown in Fig. S9, we link this behavior to the saturation of the test performance with increasing number of layers and network width. In addition, the task we consider is relatively simple, and thus three hidden layers (five layers in total) are sufficient to classify the digits with a high accuracy. The subspace overlap under the MDL setting thus suggests a consistent way to determine the optimal number of layers and the network width in practical training.

Appendix F Mean-field predictions of on-line learning dynamics

In this section, we give a detailed derivation of the mean-field ordinary differential equations for the on-line dynamics. A sketch of the toy model setting is shown in Fig. S10. The label for each sample $\mathbf{x}^{\nu}$ (i.i.d. standard Gaussian variable) is generated by the teacher network,

y^{\nu}=f(\bm{\lambda}^{*\nu})=\frac{1}{k}\sum_{r=1}^{k}\sigma\left(\frac{[\bm{{\xi}}^{*}\mathbf{{\Sigma}^{*}}(\hat{\bm{\xi}}^{*})^{\operatorname{T}}]_{r}\mathbf{x}^{\nu}}{\sqrt{d}}\right)=\frac{1}{k}\sum_{r=1}^{k}\sigma\left(\lambda_{r}^{*\nu}\right),

(S11)

where $[\mathbf{A}]_{r}$ denotes the $r$ -th row of the matrix $\mathbf{A}$ , and $\lambda_{r}^{*\nu}=[\bm{{\xi}}^{*}\bm{{\Sigma}}^{*}(\hat{\bm{\xi}}^{*})^{\operatorname{T}}]_{r}\mathbf{x}^{\nu}/{\sqrt{d}}$ represents the $r$ -th element of the teacher local field vector $\bm{\lambda}^{*\nu}\in\mathbb{R}^{k}$ . To ensure the local field is independent of the input dimension, we choose the initialization scheme for the teacher network such that $[\bm{{\xi}}^{*}\bm{\Sigma}^{*}(\hat{\bm{\xi}}^{*})^{\operatorname{T}}]_{ij}\sim\mathcal{O}(1)$ . More precisely, we set the elements $\xi^{*}_{ik},\Sigma^{*}_{k},\hat{\xi}^{*}_{jk}$ to be independent standard Gaussian variables, and then multiply the weight values by a factor of $\frac{1}{\sqrt{\ln d}}$ for logarithmic increasing number of modes. This scaling ensures that the magnitude of the weight values is of the order one. Different forms of transfer function $\sigma(\cdot)$ can be considered, but we choose the error function for the simplicity of the following theoretical analysis. The prediction of the label by the student network for a new sample $\mathbf{x}$ is given by

\hat{f}(\mathbf{x},\bm{\hat{\xi}},\mathbf{{\Sigma}},\bm{{\xi}})=\frac{1}{m}\sum_{r=1}^{m}\sigma\left(\frac{[\bm{\xi}\bm{{\Sigma}}(\hat{\bm{\xi}})^{\operatorname{T}}]_{r}\mathbf{x}}{\sqrt{d}}\right)=\frac{1}{m}\sum_{r=1}^{m}\sigma\left(\lambda_{r}\right),

(S12)

where $\lambda_{r}$ denotes the $r$ -th component of the student local field $\bm{\lambda}=\bm{\xi}\bm{\Sigma}\hat{\bm{\xi}}^{\operatorname{T}}\mathbf{x}$ . The student network has $m$ hidden nodes and $p$ patterns. For simplicity, we assume $m=k$ , $p=p^{*}$ , and only the pattern $\hat{\bm{\xi}}$ is learned.

Training the student network with the one-pass gradient descent (on-line learning) directly minimizes the following mean-squared error (MSE):

\displaystyle\ell_{MSE}(\bm{\lambda},\bm{\lambda}^{*})=\frac{1}{2}\langle\left(\hat{f}(\bm{\lambda})-f\left(\bm{\lambda}^{*}\right)\right)^{2}\rangle,

(S13)

where $\langle\cdot\rangle$ indicates the average over $\{\mathbf{x,y}\}$ that can be replaced by the average over local fields. For the Gaussian data $\mathbb{P}(\mathbf{x})=\mathcal{N}(\mathbf{x}|0,\mathds{1})$ , the dynamics of $\ell_{MSE}$ can be completely determined by the following order parameters: $\mathbf{Q}^{\nu}\equiv\mathbb{E}_{\mathbf{x},\mathbf{y}\sim\mathbb{P}(\mathbf{x},\mathbf{y})}\left[\bm{\lambda}^{\nu}(\bm{\lambda}^{\nu})^{\operatorname{T}}\right]=\frac{1}{d}\bm{\xi}^{\nu}\bm{\Sigma}^{\nu}(\bm{\hat{\xi}}^{\nu})^{\operatorname{T}}\bm{\hat{\xi}}^{\nu}\bm{\Sigma}^{\nu}(\bm{\xi}^{\nu})^{\operatorname{T}}$ , $\mathbf{M}^{\nu}\equiv\mathbb{E}_{\mathbf{x},\mathbf{y}\sim\mathbb{P}(\mathbf{x},\mathbf{y})}\left[\bm{\lambda}^{\nu}(\bm{\lambda}^{*\nu})^{\operatorname{T}}\right]=\frac{1}{d}\bm{\xi}^{\nu}\bm{\Sigma}^{\nu}(\bm{\hat{\xi}}^{\nu})^{\operatorname{T}}\bm{\hat{\xi}}^{*\nu}\bm{\Sigma}^{*\nu}(\bm{\xi}^{*\nu})^{\operatorname{T}}$ , and $\mathbf{P}^{\nu}\equiv\mathbb{E}_{\mathbf{x},\mathbf{y}\sim\mathbb{P}(\mathbf{x},\mathbf{y})}\left[\bm{\lambda}^{*\nu}(\bm{\lambda}^{*\nu})^{\operatorname{T}}\right]=\frac{1}{d}\bm{\xi}^{*\nu}\bm{\Sigma}^{*\nu}(\bm{\hat{\xi}}^{*\nu})^{\operatorname{T}}\bm{\hat{\xi}}^{*\nu}\bm{\Sigma}^{*\nu}(\bm{\xi}^{*\nu})^{\operatorname{T}}$ . The corresponding matrix elements are denoted as $q_{jl}^{\nu}\equiv\left[\mathbf{Q}^{\nu}\right]_{jl}$ , $m_{jr}^{\nu}\equiv\left[\mathbf{M}^{\nu}\right]_{jr}$ and $\rho_{rs}\equiv[\mathbf{P}]_{rs}$ . Then we can define the local-field covariance matrix $\mathbf{\Omega}^{\nu}\in\mathbb{R}^{(k+m)\times(k+m)}$ at time step $\nu$ as follows,

\mathbf{\Omega}^{\nu}\equiv\left[\begin{array}[]{cc}\mathbf{Q}^{\nu}&\mathbf{M}^{\nu}\\ (\mathbf{M}^{\nu})^{\operatorname{T}}&\mathbf{P}\end{array}\right],

(S14)

where $\mathbf{P}$ is fixed by definition (parameters of the teacher network are quenched), the sample index $\nu$ is also the time step in the on-line learning setting, and the evolution of other order parameters $\mathbf{Q}^{\nu}$ and $\mathbf{M}^{\nu}$ is driven by the gradient flow of the mode parameters $\bm{\theta}^{l}=(\hat{\bm{\xi}}^{l},\bm{\Sigma}^{l},\bm{{\xi}}^{l+1})$ . The loss is completely determined by the evolving order parameters,

		$\displaystyle\ell_{\mathrm{MSE}}(\bm{\Omega})=\frac{1}{2}\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{}\mid 0,\bm{\Omega}\right)}\left[\left(\hat{f}(\bm{\lambda})-f\left(\bm{\lambda}^{*}\right)\right)^{2}\right]$		(S15)
		$\displaystyle={\ell}_{\mathrm{t}}(\mathbf{P})+{\ell}_{\mathrm{s}}(\mathbf{Q})+\ell_{\mathrm{st}}(\mathbf{P},\mathbf{Q},\mathbf{M}),\$		(S15)

where

		$\displaystyle{\ell}_{\mathrm{t}}(\mathbf{P})\equiv\mathbb{E}_{\bm{\lambda}^{}\sim\mathcal{N}\left(\bm{\lambda}^{}\mid 0,\mathbf{P}\right)}\left[f\left(\bm{\lambda}^{}\right)^{2}\right]=\frac{1}{k^{2}}\sum_{r,s=1}^{k}\mathbb{E}_{\bm{\lambda}^{}\sim\mathcal{N}\left(\bm{\lambda}^{}\mid 0,\mathbf{P}\right)}\left[\sigma\left({\lambda}_{r}^{}\right)\sigma\left({\lambda}_{s}^{*}\right)\right],$		(S16)
		$\displaystyle{\ell}_{\mathrm{s}}(\mathbf{Q})\equiv\mathbb{E}_{\bm{\lambda}\sim\mathcal{N}(\bm{\lambda}\mid 0,\mathbf{Q})}\left[\hat{f}(\bm{\lambda})^{2}\right]=\frac{1}{m^{2}}\sum_{j,l=1}^{m}\mathbb{E}_{\bm{\lambda}\sim\mathcal{N}(\bm{\lambda}\mid 0,\mathbf{Q})}\left[\sigma\left(\lambda_{j}\right)\sigma\left(\lambda_{l}\right)\right]\text{, }$
		$\displaystyle\ell_{\mathrm{st}}(\mathbf{P},\mathbf{Q},\mathbf{M})\equiv\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{}\mid 0,\mathbf{\Omega}\right)}\left[\hat{f}(\bm{\lambda})f\left(\bm{\lambda}^{}\right)\right]=-\frac{2}{mk}\sum_{j=1}^{m}\sum_{r=1}^{k}\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{}\mid 0,\mathbf{\Omega}\right)}\left[\sigma\left(\lambda_{j}\right)\sigma\left(\lambda_{r}^{}\right)\right].$

To proceed, we define the integral $I_{2}=\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{*}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{*}\mid 0,\mathbf{\Omega}\right)}\left[\sigma\left(\bm{\lambda}^{\alpha}\right)\sigma\left(\bm{\lambda}^{\beta}\right)\right]$ , which has an analytic form for $\sigma(x)=\operatorname{erf}(x/\sqrt{2})$ as follows,

\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{*}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{*}\mid 0,\mathbf{\Omega}\right)}\left[\sigma\left(\bm{\lambda}^{\alpha}\right)\sigma\left(\bm{\lambda}^{\beta}\right)\right]=\frac{2}{\pi}\arcsin\left(\frac{\Omega_{12}^{\alpha\beta}}{\sqrt{\left(1+\Omega_{11}^{\alpha\beta}\right)\left(1+\Omega_{22}^{\alpha\beta}\right)}}\right),

(S17)

where $\Omega_{ij}^{\alpha\beta}$ denotes the element of the overlap matrix for $\bm{\lambda}^{\alpha}$ and $\bm{\lambda}^{\beta}$ , in which $\alpha$ and $\beta$ indicate the attributes of the network—teacher or student. Therefore, the generalization error can be estimated as follows,

\displaystyle\begin{aligned} \ell_{MSE}(\mathbf{\Omega})&=\frac{1}{k^{2}}\sum_{r,s=1}^{k}\frac{1}{\pi}\arcsin\left(\frac{\rho_{rs}}{\sqrt{\left(1+\rho_{rr}\right)\left(1+\rho_{ss}\right)}}\right)+\frac{1}{m^{2}}\sum_{j,l=1}^{m}\frac{1}{\pi}\arcsin\left(\frac{q_{jl}}{\sqrt{\left(1+q_{jj}\right)\left(1+q_{ll}\right)}}\right)\\ &-\frac{2}{mk}\sum_{j=1}^{m}\sum_{r=1}^{k}\frac{1}{\pi}\arcsin\left(\frac{m_{jr}}{\sqrt{\left(1+q_{jj}\right)\left(1+\rho_{rr}\right)}}\right).\end{aligned}

(S18)

We next consider the evolution of the order parameters, which involves only the update of $\bm{\hat{\xi}}$ in our toy model setting. Therefore, we derive the evolution of order parameters $q^{\nu}_{jl}$ and $m^{\nu}_{jr}$ based on the gradient of $\hat{\bm{\xi}}$ : $\Delta\hat{\xi}_{j\alpha}=-\eta\frac{\hat{f}(\bm{\lambda})-{f}(\bm{\lambda}^{*})}{m\sqrt{d}}\sum_{i}\sigma^{\prime}(\lambda_{i})\xi_{i\alpha}\Sigma_{\alpha}x_{j}^{\nu}$ . In the high-dimensional limit ( $d\to\infty$ ), we use the self-averaging property of the order parameters considering the disorder average over the input data distribution Biehl and Schwarze (1995); Saad and Solla (1995); Goldt et al. (2019). Then we have the following expressions,

	$\displaystyle q_{jl}^{\nu+1}-q_{jl}^{\nu}$	$\displaystyle=\frac{1}{d}\mathbb{E}\left[\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}(\hat{\xi}_{n\alpha}+\Delta\hat{\xi}_{n\alpha})(\hat{\xi}_{n\beta}+\Delta\hat{\xi}_{n\beta})\Sigma_{\beta}\xi_{l\beta}\right]-q_{jl}^{\nu},$		(S19)
	$\displaystyle m_{jr}^{\nu+1}-m_{jr}^{\nu}$	$\displaystyle=\frac{1}{d}\mathbb{E}\left[\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}(\hat{\xi}_{n\alpha}+\Delta\hat{\xi}_{n\alpha})\hat{\xi}^{}_{n\beta}\Sigma^{}_{\beta}\xi^{*}_{r\beta}\right]-m_{jr}^{\nu},$		(S19)

where the expectation is carried out with respect to the data distribution.

Inserting the update equation of $\hat{\bm{\xi}}$ into the equation of the order parameter $q_{jl}$ , we get

$\displaystyle q_{jl}^{\nu+1}-q_{jl}^{\nu}$	$\displaystyle=\frac{1}{d}\mathbb{E}\left[\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}(\hat{\xi}_{n\alpha}+\eta\frac{{f}(\bm{\lambda}^{})-\hat{f}(\bm{\lambda})}{m\sqrt{d}}\sum_{i}\sigma^{\prime}(\lambda_{i})\xi_{i\alpha}\Sigma_{\alpha}x_{n}^{\nu})(\hat{\xi}_{n\beta}+\eta\frac{{f}(\bm{\lambda}^{})-\hat{f}(\bm{\lambda})}{m\sqrt{d}}\sum_{i}\sigma^{\prime}(\lambda_{i})\xi_{i\beta}\Sigma_{\beta}x_{n}^{\nu})\Sigma_{\beta}\xi_{l\beta}\right]$	(S20)
	$\displaystyle-q_{jl}^{\nu},$
	$\displaystyle=\frac{1}{d}\mathbb{E}\left[\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}\hat{\xi}_{n\alpha}\eta\frac{{f}(\bm{\lambda}^{*})-\hat{f}(\bm{\lambda})}{m\sqrt{d}}\sum_{i}\sigma^{\prime}(\lambda_{i})\xi_{i\beta}\Sigma_{\beta}x_{n}^{\nu}\Sigma_{\beta}\xi_{l\beta}\right]$
	$\displaystyle+\frac{1}{d}\mathbb{E}\left[\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}\eta\frac{{f}(\bm{\lambda}^{*})-\hat{f}(\bm{\lambda})}{m\sqrt{d}}\sum_{i}\sigma^{\prime}(\lambda_{i})\xi_{i\alpha}\Sigma_{\alpha}x_{n}^{\nu}\hat{\xi}_{n\beta}\Sigma_{\beta}\xi_{l\beta}\right]$
	$\displaystyle+\frac{1}{d}\mathbb{E}\left[\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}\eta\frac{{f}(\bm{\lambda}^{})-\hat{f}(\bm{\lambda})}{m\sqrt{d}}\sum_{i}\sigma^{\prime}(\lambda_{i})\xi_{i\alpha}\Sigma_{\alpha}x_{n}^{\nu}(\eta\frac{{f}(\bm{\lambda}^{})-\hat{f}(\bm{\lambda})}{m\sqrt{d}}\sum_{i}\sigma^{\prime}(\lambda_{i})\xi_{i\beta}\Sigma_{\beta}x_{n}^{\nu})\Sigma_{\beta}\xi_{l\beta}\right],$

where we have applied the definition of $q_{jl}^{\nu}=\frac{1}{d}\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}\hat{\xi}_{n\alpha}\hat{\xi}_{n\beta}\Sigma_{\beta}\xi_{l\beta}$ to derive the second equality. Considering the definition of $\hat{f}(\bm{\lambda})$ , ${f}(\bm{\lambda}^{*})$ , $\bm{\lambda}$ , and $\bm{\lambda}^{*}$ , we recast Eq. (S20) as follows,

$\displaystyle q_{jl}^{\nu+1}-q_{jl}^{\nu}$	$\displaystyle=\frac{\eta}{dkm}\mathbb{E}\left[\sum_{\beta,r,i}\lambda_{j}\sigma\left(\lambda_{r}^{*\nu}\right)\sigma^{\prime}(\lambda_{i})\xi_{i\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]-\frac{\eta}{dm^{2}}\mathbb{E}\left[\sum_{\beta,\hat{r},i}\lambda_{j}\sigma\left(\lambda_{\hat{r}}^{\nu}\right)\sigma^{\prime}(\lambda_{i})\xi_{i\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$	(S21)
	$\displaystyle+\frac{\eta}{dkm}\mathbb{E}\left[\sum_{\alpha,r,i}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}\lambda_{l}\sigma\left(\lambda_{r}^{*\nu}\right)\sigma^{\prime}(\lambda_{i})\right]-\frac{\eta}{dm^{2}}\mathbb{E}\left[\sum_{\alpha,\hat{r},i}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}\lambda_{l}\sigma\left(\lambda_{\hat{r}}^{\nu}\right)\sigma^{\prime}(\lambda_{i})\right]$
	$\displaystyle+\frac{\eta^{2}}{dm^{4}}\mathbb{E}\left[\sum_{\alpha,\beta,i,a,r,\hat{r}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}\sigma^{\prime}(\lambda_{i})\sigma^{\prime}(\lambda_{a})\sigma\left(\lambda_{r}^{\nu}\right)\sigma\left(\lambda_{\hat{r}}^{\nu}\right)\xi_{{a\beta}}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$
	$\displaystyle+\frac{\eta^{2}}{dm^{2}k^{2}}\mathbb{E}\left[\sum_{\alpha,\beta,i,a,r,\hat{r}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}\sigma^{\prime}(\lambda_{i})\sigma^{\prime}(\lambda_{a})\sigma\left(\lambda_{r}^{\nu}\right)\sigma\left(\lambda_{\hat{r}}^{\nu}\right)\xi_{{a\beta}}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$
	$\displaystyle-2\frac{\eta^{2}}{dm^{3}k}\mathbb{E}\left[\sum_{\alpha,\beta,i,a,r,\hat{r}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}\sigma^{\prime}(\lambda_{i})\sigma^{\prime}(\lambda_{a})\sigma\left(\lambda_{r}^{*\nu}\right)\sigma\left(\lambda_{\hat{r}}^{\nu}\right)\xi_{{a\beta}}\Sigma_{\beta}^{2}\xi_{l\beta},\right].$

To proceed, we have to estimate the integral defined by $I_{3}(\alpha,\beta,\eta)=\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{*}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{*}\mid 0,\mathbf{\Omega}\right)}\left[\sigma^{\prime}\left(\bm{\lambda}^{\alpha}\right)\bm{\lambda}^{\beta}\sigma\left(\bm{\lambda}^{\eta}\right)\right]=\frac{2}{\pi}\frac{\Omega_{23}^{\alpha\beta\eta}\left(1+\Omega_{11}^{\alpha\beta\eta}\right)-\Omega_{12}^{\alpha\beta\eta}\Omega_{13}^{\alpha\beta\eta}}{\left(1+\Omega_{11}^{\alpha\beta\eta}\right)\sqrt{\left(1+\Omega_{11}^{\alpha\beta\eta}\right)\left(1+\Omega_{33}^{\alpha\beta\eta}\right)-\left(\Omega_{13}^{\alpha\beta\eta}\right)^{2}}}$ for our transfer function $\sigma(x)=\operatorname{erf}(x/\sqrt{2})$ , where $\Omega_{ij}^{\alpha\beta\eta}$ denotes the element of the field-covariance matrix $\mathbf{\Omega}^{\alpha\beta\eta}$ . We also have to estimate the second integral defined by $I_{4}(i,j,k,l)=\mathbb{E}_{\bm{\lambda},\bm{\lambda}^{*}\sim\mathcal{N}\left(\bm{\lambda},\bm{\lambda}^{*}\mid 0,\mathbf{\Omega}\right)}\left[\sigma^{\prime}(\lambda_{i})\sigma^{\prime}(\lambda_{j})\sigma(\lambda_{k})\sigma(\lambda_{l})\right]$ , which has a closed form as

I_{4}(i,j,k,l)=\frac{4}{\pi^{2}}\frac{1}{\sqrt{\bar{\Omega}_{0}^{ijkl}}}\arcsin\left(\frac{\bar{\Omega}_{1}^{ijkl}}{\sqrt{\bar{\Omega}_{2}^{ijkl}\bar{\Omega}_{3}^{ijkl}}}\right),

(S22)

where

\begin{array}[]{c}\bar{\Omega}_{0}^{ijkl}\equiv\left(1+\Omega_{11}^{ijkl}\right)\left(1+\Omega_{22}^{ijkl}\right)-\left(\Omega_{12}^{ijkl}\right)^{2},\\ \bar{\Omega}_{1}^{ijkl}\equiv\bar{\Omega}_{0}^{ijkl}\Omega_{34}^{ijkl}-\Omega_{23}^{ijkl}\Omega_{24}^{ijkl}\left(1+\Omega_{11}^{ijkl}\right)-\Omega_{13}^{ijkl}\Omega_{14}^{ijkl}\left(1+\Omega_{22}^{ijkl}\right)\\ +\Omega_{12}^{ijkl}\Omega_{13}^{ijkl}\Omega_{24}^{ijkl}+\Omega_{12}^{ijkl}\Omega_{14}^{ijkl}\Omega_{23}^{ijkl},\\ \bar{\Omega}_{2}^{ijkl}\equiv\bar{\Omega}_{0}^{ijkl}\left(1+\Omega_{44}^{ijkl}\right)-\left(\Omega_{24}^{ijkl}\right)^{2}\left(1+\Omega_{11}^{ijkl}\right)-\left(\Omega_{13}^{ijkl}\right)^{2}\left(1+\Omega_{22}^{ijkl}\right)+2\Omega_{12}^{ijkl}\Omega_{13}^{ijkl}\Omega_{23}^{ijkl},\\ \bar{\Omega}_{3}^{ijkl}\equiv\bar{\Omega}_{0}^{ijkl}\left(1+\Omega_{44}^{ijkl}\right)-\left(\Omega_{24}^{ijkl}\right)^{2}\left(1+\Omega_{11}^{ijkl}\right)-\left(\Omega_{14}^{ijkl}\right)^{2}\left(1+\Omega_{22}^{ijkl}\right)+2\Omega_{12}^{ijkl}\Omega_{14}^{ijkl}\Omega_{24}^{ijkl}.\end{array}

(S23)

In an analogous way, we can derive the mean-field evolution of $m_{jr}^{\nu}$ as follows,

	$\displaystyle m_{jr}^{\nu+1}-m_{jr}^{\nu}$	$\displaystyle=\frac{1}{d}\mathbb{E}\left[\sum_{n,\alpha,\beta}\xi_{j\alpha}\Sigma_{\alpha}(\hat{\xi}_{n\alpha}+\Delta\hat{\xi}_{n\alpha})\hat{\xi}^{}_{n\beta}\Sigma^{}_{\beta}\xi^{*}_{r\beta}\right]-m_{jr}^{\nu}$		(S24)
		$\displaystyle=\frac{\eta}{kmd}\mathbb{E}\left[\sum_{\alpha,i,a}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}{\sigma(\lambda_{a}^{})}\sigma^{\prime}(\lambda_{i})\lambda_{r}^{}\right]-\frac{\eta}{m^{2}d}\mathbb{E}\left[\sum_{\alpha,i,a}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}{\sigma(\lambda_{a})}\sigma^{\prime}(\lambda_{i})\lambda_{r}^{*}\right],$		(S24)

where the definition of $m_{jr}$ has bee used. If we define $\tau\equiv\nu/d$ , and take the thermodynamic limit of $d\to\infty$ , the time step becomes continuous, and we can thus write down the following ODEs,

$\displaystyle\frac{\mathrm{d}q_{jl}}{\mathrm{d}\tau}$	$\displaystyle=\frac{\eta}{km}\left[\sum_{\beta,r^{},i}I_{3}(i,j,r^{})\xi_{i\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]-\frac{\eta}{m^{2}}\left[\sum_{\beta,\hat{r},i}I_{3}(i,j,\hat{r})\xi_{i\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$	(S25)
	$\displaystyle+\frac{\eta}{km}\left[\sum_{\alpha,r^{},i}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,l,r^{})\right]-\frac{\eta}{m^{2}}\mathbb{E}\left[\sum_{\alpha,\hat{r},i}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,l,\hat{r})\right]$
	$\displaystyle+\frac{\eta^{2}}{m^{4}}\left[\sum_{\alpha,\beta,i,a,r^{},\hat{r}^{}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{4}(i,a,r^{},\hat{r}^{})\xi_{a\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$
	$\displaystyle+\frac{\eta^{2}}{m^{2}k^{2}}\left[\sum_{\alpha,\beta,i,a,r,\hat{r}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{4}(i,a,r,\hat{r})\xi_{a\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$
	$\displaystyle-2\frac{\eta^{2}}{m^{3}k}\left[\sum_{\alpha,\beta,i,a,r^{},\hat{r}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{4}(i,a,r^{},\hat{r})\xi_{a\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right],$
$\displaystyle\frac{\mathrm{d}m_{jr^{*}}^{\nu}}{\mathrm{d}\tau}$	$\displaystyle=\frac{\eta}{km}\left[\sum_{\alpha,i,a}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,r^{},a^{})\right]-\frac{\eta}{m^{2}}\left[\sum_{\alpha,i,a}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,r^{*},a)\right],$

where the index in $I_{3}$ or $I_{4}$ with the symbol $*$ labels the teacher’s local-field.

References

Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, Cambridge, MA, 2016).
Carleo et al. (2019) G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, Rev. Mod. Phys. 91, 045002 (2019).
Huang (2022) H. Huang, Statistical Mechanics of Neural Networks (Springer, Singapore, 2022).
Roberts et al. (2022) D. A. Roberts, S. Yaida, and B. Hanin, The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks (Cambridge University Press, Cambridge, 2022).
Jaderberg et al. (2014) M. Jaderberg, A. Vedaldi, and A. Zisserman, arXiv:1405.3866 (2014).
Yang et al. (2020) H. Yang, M. Tang, W. Wen, F. Yan, D. Hu, A. Li, H. Li, and Y. Chen, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2020).
Giambagli et al. (2021) L. Giambagli, L. Buffoni, T. Carletti, W. Nocentini, and D. Fanelli, Nature Communications 12, 1330 (2021).
Chicchi et al. (2021) L. Chicchi, L. Giambagli, L. Buffoni, T. Carletti, M. Ciavarella, and D. Fanelli, Phys. Rev. E 104, 054312 (2021).
Jiang et al. (2021) Z. Jiang, J. Zhou, T. Hou, K. Y. M. Wong, and H. Huang, Phys. Rev. E 104, 064306 (2021).
Zhou et al. (2021) J. Zhou, Z. Jiang, T. Hou, Z. Chen, K. Y. M. Wong, and H. Huang, Phys. Rev. E 104, 064307 (2021).
(11) Y. LeCun, The MNIST database of handwritten digits, retrieved from http://yann.lecun.com/exdb/mnist.
Fischer et al. (2022) K. Fischer, A. Ren’e, C. Keup, M. Layer, D. Dahmen, and M. Helias, arXiv:2202.04925 (2022).
(13) See the supplemental material at http://… for technical and experimental details.
Biehl and Schwarze (1995) M. Biehl and H. Schwarze, Journal of Physics A: Mathematical and General 28, 643 (1995).
Saad and Solla (1995) D. Saad and S. A. Solla, Phys. Rev. Lett. 74, 4337 (1995).
Goldt et al. (2019) S. Goldt, M. S. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborová, arXiv:1901.09085 (2019).
Bjoerck and Golub (1973) A. Bjoerck and G. H. Golub, Mathematics of Computation 27, 579 (1973).
Schwarze (1993) H. Schwarze, J. Phys. A: Math. Gen. 26, 5781 (1993).

$\displaystyle\frac{\mathrm{d}q_{jl}}{\mathrm{d}\tau}$	$\displaystyle=\frac{\eta}{km}\left[\sum_{\beta,r^{},i}I_{3}(i,j,r^{})\xi_{i\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]-\frac{\eta}{m^{2}}\left[\sum_{\beta,\hat{r},i}I_{3}(i,j,\hat{r})\xi_{i\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$	(S25)
	$\displaystyle+\frac{\eta}{km}\left[\sum_{\alpha,r^{},i}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,l,r^{})\right]-\frac{\eta}{m^{2}}\mathbb{E}\left[\sum_{\alpha,\hat{r},i}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,l,\hat{r})\right]$
	$\displaystyle+\frac{\eta^{2}}{m^{4}}\left[\sum_{\alpha,\beta,i,a,r^{},\hat{r}^{}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{4}(i,a,r^{},\hat{r}^{})\xi_{a\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$
	$\displaystyle+\frac{\eta^{2}}{m^{2}k^{2}}\left[\sum_{\alpha,\beta,i,a,r,\hat{r}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{4}(i,a,r,\hat{r})\xi_{a\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right]$
	$\displaystyle-2\frac{\eta^{2}}{m^{3}k}\left[\sum_{\alpha,\beta,i,a,r^{},\hat{r}}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{4}(i,a,r^{},\hat{r})\xi_{a\beta}\Sigma_{\beta}^{2}\xi_{l\beta}\right],$
$\displaystyle\frac{\mathrm{d}m_{jr^{*}}^{\nu}}{\mathrm{d}\tau}$	$\displaystyle=\frac{\eta}{km}\left[\sum_{\alpha,i,a}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,r^{},a^{})\right]-\frac{\eta}{m^{2}}\left[\sum_{\alpha,i,a}\xi_{j\alpha}\Sigma_{\alpha}^{2}\xi_{i\alpha}I_{3}(i,r^{*},a)\right],$