The Curse of Depth in Large Language Models

Wenfang Sun Xinyuan Song Pengxiang Li Lu Yin Yefeng Zheng Shiwei Liu

Abstract

In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at LayerNorm-Scaling.

Machine Learning, ICML

1 Introduction

Refer to caption — Figure 1: Layerwise output variance. This figure compares the output variance across various layers for different setups: (1) Pre-LN; (2) Pre-LN with Scaled Initialization; and (3) LayerNorm Scaling. The experiments are conducted on the LLaM-130M model trained for 10,000 steps. The proposed LayerNorm Scaling effectively controls the variance across layers.

Recent studies reveal that the deeper layers (Transformer blocks) in modern LLMs tend to be less effective than the earlier ones (Yin et al., 2024; Gromov et al., 2024; Men et al., 2024). On the one hand, this interesting observation provides an effective indicator for LLM compression. For instance, we can compress deeper layers significantly more (Yin et al., 2024; Lu et al., 2024; Dumitru et al., 2024) to achieve high compression ratios. Even more aggressively, entire deep layers can be pruned completely without compromising performance for the sake of more affordable LLMs (Muralidharan et al., 2024; Siddiqui et al., 2024).

On the other hand, having many layers ineffective is undesirable as modern LLMs are extremely resource-intensive to train, often requiring thousands of GPUs trained for multiple months, let alone the labor used for data curation and administration (Achiam et al., 2023; Touvron et al., 2023). Ideally, we want all layers in a model to be well-trained, with sufficient diversity in features from layer to layer, to maximize the utility of resources (Li et al., 2024b). The existence of ill-trained layers suggests that there must be something off with current LLM paradigms. Addressing such limitations is a pressing need for the community to avoid the waste of valuable resources, as new versions of LLMs are usually trained with their previous computing paradigm which results in ineffective layers.

To seek the immediate attention of the community, we introduce the concept of the Curse of Depth (CoD) to systematically present the phenomenon of ineffective deep layers in various LLM families, to identify the underlying reason behind it, and to rectify it by proposing LayerNorm Scaling. We first state the Curse of Depth below.

The Curse of Depth. The Curse of Depth refers to the observed phenomenon where deeper layers in modern large language models (LLMs) contribute significantly less to learning and representation compared to earlier layers. These deeper layers often exhibit remarkable robustness to pruning and perturbations, implying they fail to perform meaningful transformations. This behavior prevents these layers from effectively contributing to training and representation learning, resulting in resource inefficiency.

Empirical Evidence of CoD. The ineffectiveness of deep layers in LLMs has been previously reported. Yin et al. (2024) found that deeper layers of LLMs can tolerate significantly higher levels of pruning compared to shallower layers, achieving high sparsity. Similarly, Gromov et al. (2024) and Men et al. (2024) demonstrated that removing early layers causes a dramatic decline in model performance, whereas removing deep layers does not. Lad et al. (2024) showed that the middle and deep layers of GPT-2 and Pythia exhibit remarkable robustness to perturbations such as layer swapping and layer dropping. Recently, Li et al. (2024a) highlighted that early layers contain more outliers and are therefore more critical for fine-tuning. While these studies effectively highlight the limitations of deep layers in LLMs, they stop short of identifying the root cause of this issue or proposing viable solutions to address it.

To demonstrate that the Curse of Depths is prevalent across popular families of LLMs, we conduct layer pruning experiments on various models, including LLaMA2-7/13B, Mistral-7B, DeepSeek-7B, and Qwen-7B. We measure performance degradation on the Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2021) by pruning entire layers of each model, one at a time, and directly evaluating the resulting pruned models on MMLU without any fine-tuning in Figure 2.

Results: 1). Most LLMs utilizing Pre-LN exhibit remarkable robustness to the removal of deeper layers, whereas BERT with Post-LN shows the opposite trend. 2). The number of layers that can be pruned without significant performance degradation increases with model size.

Identifying the Root Cause of CoD. We theoretically and empirically identify the root cause of CoD as the use of Pre-Layer Normalization (Pre-LN) (Baevski and Auli, 2019; Dai et al., 2019), which normalizes layer inputs before applying the main computations, such as attention or feedforward operations, rather than after. Specifically, while stabilizing training, we observe that the output variance of Pre-LN accumulates significantly with layer depth (see Appendix C), causing the derivatives of deep Pre-LN layers to approach an identity matrix. This behavior prevents these layers from introducing meaningful transformations, leading to diminished representation learning.

Mitigating CoD through LayerNorm Scaling. We propose LayerNorm Scaling, which scales the output of Layer Normalization by the square root of the depth $\frac{1}{\sqrt{l}}$ . LayerNorm Scaling effectively scales down the output variance across layers of Pre-LN, leading to considerably lower training loss and achieving the same loss as Pre-LN using only half tokens. Figure 1 compares the layerwise output variance across different setups: (1) Pre-LN, (2) Pre-LN with Scaled Initialization (Takase et al., 2023b), and (3) LayerNorm Scaling. As shown, Pre-LN exhibits significant variance explosion in deeper layers. In contrast, LayerNorm Scaling effectively reduces output variance across layers, enhancing the contribution of deeper layers during training. This adjustment leads to significantly lower training loss compared to Pre-LN. Unlike previous LayerNorm variants (Li et al., 2024b; Liu et al., 2020), LayerNorm Scaling is simple to implement, requires no hyperparameter tuning, and introduces no additional parameters during training. Furthermore, we further show that the model pre-trained with LayerNorm Scaling achieves better performance on downstream tasks in self-supervised fine-tuning, all thanks to the more effective deep layers learned.

Contributions.

•

We introduce the Curse of Depth to highlight, understand, and rectify the phenomenon in LLMs that is commonly overlooked—deep layers fail to contribute as effectively as they should.
•

We identify the root cause as Pre-LN, which causes output variance to grow exponentially with model depth. This leads to deep Transformer blocks having derivatives close to the identity matrix, rendering them ineffective during training. While scaled initialization (Shoeybi et al., 2020) helps mitigate variance at initialization, it does not prevent explosion during training.
•

To mitigate this issue, we propose LayerNorm Scaling, which inversely scales the output of Pre-LN by the square root of the depth. This adjustment ensures that all layers contribute effectively to learning, thereby improving LLM performance.
•

We hope this work brings greater attention to this issue, contributes to the improvement of LLMs, and maximizes the utilization of computational resources dedicated to training large models.

2 Empirical Evidence of the Curse of Depth

To empirically analyze the impact of layer normalization on the Curse of Depth in LLMs, we conduct a series of evaluations inspired by Li et al. (2024b), extending their methodology to compare Pre-LN and Post-LN models.

2.1 Experimental Setup

Methods: We evaluate Pre-LN and Post-LN models by assessing the impact of layer pruning at different depths. Our hypothesis is that Pre-LN models exhibit diminishing effectiveness in deeper layers, whereas Post-LN has less effective early layers. To verify this, we empirically quantify the contribution of individual layers to overall model performance across a diverse set of LLMs.

LLMs: We conduct experiments on multiple widely adopted LLMs: BERT-Large (Devlin, 2019), Mistral-7B (Jiang et al., 2023), LLaMA2-7B/13B (Touvron et al., 2023), DeepSeek-7B (Bi et al., 2024), and Qwen-7B (Bai et al., 2023). These models were chosen to ensure architectural and application diversity. BERT-Large represents a Post-LN model, whereas the rest are Pre-LN-based. This selection enables a comprehensive evaluation of the effects of layer normalization across varying architectures and model scales.

Evaluation Metric: To empirically assess the impact of deeper layers in LLMs, we adopt the Performance Drop metric $\Delta P^{(\ell)}$ , inspired by Li et al. (2024b). This metric quantifies the contribution of each layer by measuring the degradation in model performance following its removal. Specifically, it is defined as:

\Delta P^{(\ell)}=P_{\text{pruned}}^{(\ell)}-P_{\text{original}},

(1)

where $P_{\text{original}}$ represents the performance of the unpruned model, and $P_{\text{pruned}}^{(\ell)}$ denotes the performance after removing layer $\ell$ . A lower $\Delta P^{(\ell)}$ suggests that the pruned layer plays a minor role in the model’s overall effectiveness. For BERT-Large, we evaluate performance on the SQuAD v1.1 dataset (Rajpurkar, 2016), which measures reading comprehension. For Mistral-7B, LLaMA2-13B, and Qwen-7B, we assess model performance on the MMLU benchmark (Hendrycks et al., 2021), a widely-used dataset for multi-task language understanding.

2.2 Layer Pruning Analysis

Figure 2 presents the performance drop ( $\Delta P^{(\ell)}$ ) across different layers for six LLMs, including one Post-LN model (BERT-Large) and five Pre-LN models (Mistral-7B, LLaMA2-13B, Qwen-7B, DeepSeek-7B and LLaMA2-7B).

As shown in Figure 2 (a), pruning deeper layers in BERT-Large leads to a significant decline in accuracy on SQuAD v1.1, while pruning earlier layers has minimal impact. The performance drop $\Delta P^{(\ell)}$ becomes particularly severe beyond the 10th layer, highlighting the crucial role of deeper layers in maintaining overall performance in Post-LN models. In contrast, removing layers in the first half of the network results in negligible changes, indicating their limited contribution to the final output.

However, as shown in Figure 2 (b)-(f), Pre-LN models exhibit a contrast pattern, where deeper layers contribute significantly less to the overall model performance. For instance, as shown in Figure 2 (b) and (c), pruning layers in the last third of Mistral-7B and Qwen-7B results in a minimal performance drop on MMLU, indicating their limited contribution to overall accuracy. In contrast, pruning the first few layers leads to a substantial accuracy degradation, highlighting their crucial role in feature extraction. Similarly, Figure 2 (d) and (e) show that DeepSeek-7B and LLaMA2-7B follow a similar pattern, where deeper layers have little impact on performance, while earlier layers play a more significant role. Finally, as shown in Figure 2 (f), more than half of the layers in LLaMA2-13B can be safely removed. This suggests that as model size increases, the contrast between shallow and deep layers becomes more pronounced, with earlier layers playing a dominant role in representation learning. This observation underscores the need for the community to address the Curse of Depth to prevent resource waste.

3 Analysis of the Curse of Depth

3.1 Preliminaries

This paper primarily focuses on Pre-LN Transformer (Baevski and Auli, 2019; Dai et al., 2019). Let $x_{\ell}\in\mathbb{R}^{d}$ be the input vector at the $\ell$ -th layer of Transformer, where $d$ denotes the feature dimension of each layer. For simplicity, we assume all layers to have the same dimension $d$ . The layer output $y$ is calculated as follows:

y=x_{\ell+1}=x^{\prime}_{\ell}+\mathrm{FFN}(\mathrm{LN}(x^{\prime}_{\ell})),

(2)

x^{\prime}_{\ell}=x_{\ell}+\mathrm{Attn}(\mathrm{LN}(x_{\ell})),

(3)

where LN denotes the layer normalization function. In addition, the feed-forward network (FFN) and the multi-head self-attention (Attn) sub-layers are defined as follows:

$\displaystyle\mathrm{FFN}(x)$	$\displaystyle=W_{2}\mathcal{F}(W_{1}x),$	(4)
$\displaystyle\mathrm{Attn}(x)$	$\displaystyle=W_{O}(\mathrm{concat}(\mathrm{head}_{1}(x),\dots,\mathrm{head}_{h}(x))),$
$\displaystyle\mathrm{head}_{i}(x)$	$\displaystyle=\mathrm{softmax}\left(\frac{(W_{Qi}x)^{\top}(W_{Ki}X)}{\sqrt{d_{\mathrm{head}}}}\right)(W_{Vi}X)^{\top},$

where $\mathcal{F}$ is an activation function, $\mathrm{concat}$ concatenates input vectors, $\mathrm{softmax}$ applies the softmax function, and $W_{1}\in\mathbb{R}^{d_{\mathrm{ffn}}\times d}$ , $W_{2}\in\mathbb{R}^{d\times d_{\mathrm{ffn}}}$ , $W_{Qi}\in\mathbb{R}^{d_{\mathrm{head}}\times d}$ , $W_{Ki}\in\mathbb{R}^{d_{\mathrm{head}}\times d}$ , $W_{Vi}\in\mathbb{R}^{d_{\mathrm{head}}\times d}$ , and $W_{O}\in\mathbb{R}^{d\times d}$ are parameter matrices, and $d_{\mathrm{FFN}}$ and $d_{\mathrm{head}}$ are the internal dimensions of FFN and multi-head self-attention sub-layers, respectively. $X\in\mathbb{R}^{d\times s}$ , where $s$ is the input sequence length.

The derivatives of Pre-Ln Transformers are:

\frac{\partial\text{Pre-LN}(x)}{\partial x}=I+\frac{\partial f(\text{LN}(x))}{\partial\text{LN}(x)}\frac{\partial\text{LN}(x)}{\partial x},

(5)

where $f$ here represents either the multi-head attention function or the FFN function. If the term $\frac{\partial f(\text{LN}(x))}{\partial\text{LN}(x)}\frac{\partial\text{LN}(x)}{\partial x}$ becomes too small, the Pre-LN layer $\frac{\partial\text{Pre-LN}(x)}{\partial x}$ behaves like an identity map. Our main objective is to prevent identity map behavior for very deep Transformer networks. The first step in this process is to compute the variance $\sigma^{2}_{x_{\ell}}$ of vector $x_{\ell}$ .

3.2 Pre-LN Transformers

Assumption 1.

Let $x_{\ell}$ and $x^{\prime}_{\ell}$ denote the input and intermediate vectors of the $\ell$ -th layer. Moreover, let $W_{\ell}$ denote the model parameter matrix at the $\ell$ -th layer. We assume that, for all layers, $x_{\ell}$ , $x^{\prime}_{\ell}$ , and $W_{\ell}$ follow normal and independent distributions with mean $\mu=0$ .

Lemma 1.

Let $\sigma^{2}_{x^{\prime}_{\ell}}$ and $\sigma^{2}_{x_{\ell}}$ denote the variances of $x^{\prime}_{\ell}$ and $x_{\ell}$ , respectively. These two variances exhibit the same overall growth trend, which is:

\sigma^{2}_{x_{\ell}}=\sigma_{x_{1}}^{2}\Theta\Bigl{(}\prod_{k=1}^{\ell-1}\left(1+\frac{1}{\sigma_{x_{k}}}\right)\Bigr{)},

(6)

where the growth of $\sigma^{2}_{x_{l}}$ is sub-exponential, as shown by the following bounds:

\Theta(L)\leq\sigma^{2}_{x_{l}}\;\leq\;\Theta(\exp(L)).

(7)

Here, the notation $\Theta$ means: if $f(x)\in\Theta\bigl{(}g(x)\bigr{)}$ , then there exist constants $C_{1},C_{2}$ such that $C_{1}\,|g(x)|\leq|f(x)|\leq C_{2}\,|g(x)|$ as $x\to\infty$ . The lower bound $\Theta(L)\leq\sigma^{2}_{x_{\ell}}$ indicates that $\sigma^{2}_{x_{\ell}}$ grows at least linearly, while the upper bound $\sigma^{2}_{x_{\ell}}\leq\Theta(\exp(L))$ implies that its growth does not exceed an exponential function of $L$ .

Based on Assumption 1 and the work of (Takase et al., 2023b), we obtain the following:

Theorem 1.

For a Pre-LN Transformer with $L$ layers, using Equations (2) and (3), the partial derivative $\frac{\partial y_{L}}{\partial x_{1}}$ can be written as:

\frac{\partial y_{L}}{\partial x_{1}}=\prod_{\ell=1}^{L-1}\left(\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\cdot\frac{\partial x^{\prime}_{\ell}}{\partial x_{\ell}}\right).

(8)

The Euclidean norm of $\frac{\partial y_{L}}{\partial x_{1}}$ is given by:

\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\prod_{l=1}^{L-1}\left(1+\frac{1}{\sigma_{x_{\ell}}}A+\frac{1}{\sigma_{x_{\ell}}^{2}}B\right),

(9)

where $A$ and $B$ are constants for the Transformer network. Then the upper bound for this norm is given as follows: when $\sigma^{2}_{x_{\ell}}$ grows exponentially, (i.e., at its upper bound), we have:

\sigma^{2}_{x_{\ell}}\sim\exp(\ell),\quad\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq M,

(10)

where the gradient norm converges to a constant $M$ . Conversely, when $\sigma^{2}_{x_{\ell}}$ grows linearly (i.e., at its lower bound), we have

\sigma^{2}_{x_{\ell}}\sim\ell,\quad\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\Theta(L),

(11)

which means that the gradient norm grows linearly in $L$ .

The detailed description of $A$ and $B$ , as well as the complete proof, are provided in Appendix A.2.

From Theorem 1, we observe that when the variance grows exponentially, as the number of layers $L\to\infty$ , the norm $\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}$ is bounded above by a fixed constant $M$ . This result implies that even an infinitely deep Transformer remains stable, and by the Weierstrass Theorem, the network is guaranteed to converge. Consequently, this implies that for very large $L$ , deeper layers behave nearly as an identity map from $x_{\ell}$ to $y_{\ell}$ , thereby limiting the model’s expressivity and hindering its ability to learn meaningful transformations.

This outcome is undesirable, therefore, we would instead prefer the variance to increase more gradually—e.g., linearly—so that $\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}$ exhibits linear growth. This observation highlights the necessity of appropriate variance control mechanisms, such as scaling strategies, to prevent excessive identity mappings and enhance network depth utilization.

3.3 Post-LN Transformers

For Post-LN Transformers, we continue to adopt Assumption 1. In this setting, each layer is followed by a layer normalization (LN) step, ensuring that the variances $\sigma^{2}_{x_{\ell}}$ and $\sigma^{2}_{x^{\prime}_{\ell}}$ remain fixed at 1 across all layers. Consequently, the norm $\left\|\frac{\partial y_{\ell}}{\partial x_{\ell}}\right\|_{2}$ exhibits minimal variation from one layer to the next, indicating stable gradient propagation.

Since the variance is effectively controlled by LN in Post-LN Transformers, an explicit variance‐based analysis becomes less critical. Nonetheless, there remain other important aspects to investigate in deeper Post-LN architectures, such as the evolution of feature mappings and the behavior of covariance kernels over deep layers. These directions will be pursued in future work.

4 LayerNorm Scaling

Our theoretical and empirical analyses indicate that Pre-LN amplifies output variance, leading to the Curse of Depth and reducing the effectiveness of deeper layers. To mitigate this issue, we propose LayerNorm Scaling, a simple yet effective normalization strategy. The core idea of LayerNorm Scaling is to control the exponential growth of output variance in Pre-LN by scaling the normalized outputs according to layer depth. Specifically, we apply a scaling factor inversely proportional to the square root of the layer index to scale down the output of LN layers, stabilizing gradient flow and enhancing the contribution of deeper Transformer layers during training. LayerNorm Scaling is illustrated in Figure 3.

Formally, for a Transformer model with $L$ layers, the output of Layer Normalization in each layer $\ell$ is scaled by a factor of $\frac{1}{\sqrt{\ell}}$ . Let $\mathbf{h}^{(\ell)}$ denote the input to Layer Normalization at layer $\ell$ . The modified output is computed as:

\mathbf{h}^{(l)}=\text{LayerNorm}(\mathbf{h}^{(\ell)})\times\frac{1}{\sqrt{\ell}},

(12)

where $\ell\in\{1,2,\dots,L\}$ . This scaling prevents excessive variance growth with depth, addressing a key limitation of Pre-LN. Unlike Mix-LN, which stabilizes gradients in deeper layers but suffers from training instability caused by Post-LN (Nguyen and Salazar, 2019; Wang et al., 2024), LayerNorm Scaling preserves the stability advantages of Pre-LN while enhancing the contribution of deeper layers to representation learning. Applying LayerNorm Scaling leads to a notable reduction of layerwise output variance, resulting in lower training loss and faster convergence than vanilla Pre-LN. Moreover, compared with previous LayerNorm variants (Li et al., 2024b; Liu et al., 2020), LayerNorm Scaling is hyperparameter-free, easy to implement, and does not introduce additional learnable parameters, making it computationally efficient and readily applicable to existing Transformer architectures.

4.1 Theoretical Analysis of LayerNorm Scaling

Lemma 2.

After applying our scaling method, the variances of $x^{\prime}_{\ell}$ and $x_{\ell}$ , denoted as $\sigma^{2}_{x^{\prime}_{\ell}}$ and $\sigma^{2}_{x_{\ell}}$ , respectively, exhibit the same growth trend, which is:

\displaystyle\sigma^{2}_{x_{\ell+1}}=\sigma_{x_{\ell}}^{2}\Theta(1+\frac{1}{\sqrt{\ell}\sigma_{x_{\ell}}}),

(13)

with the following growth rate bounds:

\Theta(L)\leq\sigma^{2}_{x_{L}}\;\leq\;\Theta(L^{(2-\epsilon)}).

(14)

where $\epsilon$ is a small number with $0<\epsilon\leq 1/4$ .

From Lemma 2, we can conclude that our scaling method effectively slows the growth of the variance upper bound, reducing it from exponential to polynomial growth. Specifically, it limits the upper bound to a quadratic rate instead of an exponential one. Based on Theorem 1, after scaling, we obtain the following:

Theorem 2.

For the scaled Pre-LN Transformers, the Euclidean norm of $\frac{\partial y_{L}}{\partial x_{1}}$ is given by:

\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\prod_{\ell=1}^{L-1}\left(1+\frac{1}{\ell\sigma_{x_{\ell}}}A+\frac{1}{\ell^{2}\sigma_{x_{\ell}}^{2}}B\right),

(15)

where $A$ and $B$ are dependent on the scaled neural network parameters. Then the upper bound for the norm is given as follows: when $\sigma^{2}_{x_{\ell}}$ grows at $\ell^{(2-\epsilon)}$ , (i.e., at its upper bound), we obtain:

\sigma^{2}_{x_{\ell}}\sim\ell^{(2-\epsilon)},\quad\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\omega(1),

(16)

where $\omega$ denotes that if $f(x)=\omega(g(x))$ , then $\lim_{x\rightarrow\infty}\frac{f(x)}{g(x)}=\infty$ . Meanwhile, when $\sigma^{2}_{x_{\ell}}$ grows linearly (i.e., at its lower bound), we obtain:

\sigma^{2}_{x_{\ell}}\sim\ell,\quad\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\Theta(L).

(17)

The detailed descriptions of $A$ and $B$ , and $\epsilon$ , along with the full proof, are provided in Appendices A.3 and A.4.

By comparing Theorem 1 (before scaling) with Theorem 17 (after scaling), we observe a substantial reduction in the upper bound of variance. Specifically, it decreases from exponential growth $\Theta(\exp(L))$ to at most quadratic growth $\Theta(L^{2})$ . In fact, this growth is even slower than quadratic expansion, as it follows $\Theta(L^{(2-\epsilon)})$ for some small $\epsilon>0$ .

When we select a reasonable upper bound for this expansion, we find that $\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}$ no longer possesses a strict upper bound. That is, as the depth increases, $\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}$ continues to grow gradually. Consequently, fewer layers act as identity mappings compared to the original Pre-LN where nearly all deep layers collapsed into identity transformations. Instead, the after-scaled network effectively utilizes more layers, even as the depth approaches infinity, leading to improved expressivity and trainability.

5 Experiments

5.1 LLM Pre-training

To evaluate the effectiveness of LayerNorm Scaling, we follow the experimental setup of Li et al. (2024b), using the same model configurations and training conditions to compare it with widely used normalization techniques, including Post-LN (Nguyen and Salazar, 2019), DeepNorm (Wang et al., 2024), and Pre-LN (Dai et al., 2019). In line with Lialin et al. (2023) and Zhao et al. (2024), we conduct experiments using LLaMA-based architectures with model sizes of 130M, 250M, 350M, and 1B parameters, ensuring consistency in architecture and training settings.

Table 1: Perplexity (↓) comparison of various layer normalization methods across various LLaMA sizes.

	LLaMA-130M	LLaMA-250M	LLaMA-350M	LLaMA-1B
Training Tokens	2.2B	3.9B	6.0B	8.9B
Post-LN (Ba, 2016)	26.95	1409.79	1368.33	1390.75
DeepNorm (Wang et al., 2024)	27.17	22.77	1362.59	1409.08
Mix-LN (Li et al., 2024b)	26.07	21.39	1363.21	1414.78
Pre-LN (Baevski and Auli, 2019)	26.73	21.92	19.58	17.02
Pre-LN + LayerNorm Scaling	25.76	20.35	18.20	15.71

The architecture incorporates RMSNorm (Shazeer, 2020) and SwiGLU activations (Zhang and Sennrich, 2019), which are applied consistently across all model sizes and normalization methods. For optimization, we use the Adam optimizer (Kingma, 2015) and adopt size-specific learning rates: $1\times 10^{-3}$ for models up to 350M parameters, and $5\times 10^{-4}$ for the 1B parameter model. All models share the same architecture, hyperparameters, and training schedule, with the only difference being the choice of normalization method. Unlike Mix-LN (Li et al., 2024b), which introduces an additional hyperparameter $\alpha$ manually set to 0.25, LayerNorm Scaling requires no extra hyperparameters, making it simpler to implement. Table 1 shows that LayerNorm Scaling consistently outperforms other normalization methods across different model sizes. While DeepNorm performs comparably to Pre-LN on smaller models, it struggles with larger architectures like LLaMA-1B, showing signs of instability and divergence in loss values. Similarly, Mix-LN outperforms Pre-LN in smaller models but faces convergence issues with LLaMA-350M, indicating its sensitivity to architecture design and hyperparameter tuning due to the introduction of Post-LN. Notably, Mix-LN was originally evaluated on LLaMA-1B with 50,000 steps (Li et al., 2024b), while our setting extends training to 100,000 steps, where Mix-LN fails to converge, highlighting its instability in large-scale settings caused by the usage of Post-LN.

In contrast, LayerNorm Scaling solves the Curse of Depth without compromising the training stability thanks to its simplicity. LayerNorm Scaling achieves the lowest perplexity across all tested model sizes, showing stable performance improvements over existing methods. For instance, on LLaMA-130M and LLaMA-1B, LayerNorm Scaling reduces perplexity by 0.97 and 1.31, respectively, compared to Pre-LN. Notably, LayerNorm Scaling maintains stable training dynamics for LLaMA-1B, a model size where Mix-LN fails to converge. These findings demonstrate that LayerNorm Scaling provides a robust and computationally efficient normalization strategy, enhancing large-scale language model training without additional implementation complexity.

Comparison with Other Layer Normalization. In addition, we conducted comparisons using LLaMA-130M to evaluate LayerNorm Scaling against recently proposed normalization methods, including Admin (Liu et al., 2020), Sandwich-LN (Ding et al., 2021), and Group-LN (Wu and He, 2018; Ma et al., 2024). Table 2 shows that Admin and Group-LN degrade performance. Sandwich-LN slightly outperforms Pre-LN. Both Mix-LN and LayerNorm Scaling improve over Pre-LN by good margins. However, Mix-LN fails to reduce perplexity under 26, falling short of LayerNorm Scaling.

Table 2: Comparison against other normalization methods on LLaMA-130M. Perplexity (↓) is reported.

Pre-LN	Admin	Group-LN	Sandwich-LN	Mix-LN	LayerNorm Scaling
26.73	27.91	28.01	26.51	26.07	25.76

5.2 Supervised Fine-tuning

We believe that LayerNorm Scaling allows deeper layers in LLMs to contribute more effectively during supervised fine-tuning by alleviating gradient vanishing associated with increasing depth. Compared to models trained with Pre-LN, the deeper layers with LayerNorm Scaling maintain stable output variance, preventing uncontrolled growth and ensuring effective feature representation. As a result, deeper layers contribute more effectively to feature transformation, enhancing representation learning and improving generalization on complex downstream tasks.

Table 3: Fine-tuning performance (

\uparrow

) of LLaMA with various normalizations.

Method	MMLU	BoolQ	ARC-e	PIQA	Hellaswag	OBQA	Winogrande	Average
LLaMA-250M
Post-LN (Ba, 2016)	22.95	37.83	26.94	52.72	26.17	11.60	49.56	32.54
DeepNorm (Wang et al., 2024)	23.60	37.86	36.62	61.10	25.69	15.00	49.57	35.63
Mix-LN (Li et al., 2024b)	26.53	56.12	41.68	66.34	30.16	18.00	50.56	41.34
Pre-LN (Baevski and Auli, 2019)	24.93	38.35	40.15	63.55	26.34	16.20	49.01	36.93
Pre-LN + LayerNorm Scaling	27.08	58.17	45.24	67.38	32.81	18.80	52.49	43.14
LLaMA-1B
Post-LN (Ba, 2016)	22.95	37.82	25.08	49.51	25.04	13.80	49.57	31.96
DeepNorm (Wang et al., 2024)	23.35	37.83	27.06	52.94	26.19	11.80	49.49	32.67
Mix-LN (Li et al., 2024b)	23.19	37.83	25.08	49.51	25.04	11.80	49.57	31.72
Pre-LN (Baevski and Auli, 2019)	26.54	62.20	45.70	67.79	30.96	17.40	50.51	43.01
Pre-LN + LayerNorm Scaling	28.69	61.80	48.85	67.92	33.94	18.60	54.30	44.87

To verify this, we follow the fine-tuning methodologies in Li et al. (2024b) and Li et al. (2024a), applying the same optimization settings as pre-training. We fine-tune models from Section 5.1 on the Commonsense170K dataset (Hu et al., (Hu et al., 2023)) across eight downstream tasks. The results, presented in Table 3, demonstrate that LayerNorm Scaling consistently surpasses other normalization techniques in all evaluated datasets. For the LLaMA-250M model, LayerNorm Scaling improves average performance by1.80% and achieves a 3.56% gain on ARC-e compared to Mix-LN. Similar trends are observed with the LLaMA-1B model, where LayerNorm Scaling outperforms Pre-LN, Post-LN, Mix-LN, and DeepNorm on seven out of eight tasks, with an average gain of 1.86% over the best baseline. These results confirm that LayerNorm Scaling, by improving gradient flow and deep-layer representation quality, achieves better fine-tuning performance, demonstrating robustness and enhanced generalization on diverse downstream tasks.

5.3 LayerNorm Scaling Reduces Output Variance

As LayerNorm Scaling aims to reduce output variance, we validate this by comparing it with two scaling approaches: LayerScale (Touvron et al., 2021) and Scaled Initialization (Shoeybi et al., 2020). LayerScale applies per-channel weighting using a diagonal matrix, diag $(\lambda_{1},\ldots,\lambda_{d})$ , where each weight $\lambda_{i}$ is initialized to a small value (e.g., $\lambda_{i}=\epsilon$ ). Unlike LayerNorm Scaling, LayerScale learns the scaling factors automatically, which does not necessarily induce a down-scaling effect. Scaled Initialization scales the initialization of $W_{0}$ and $W_{2}$ to small values by $\frac{1}{\sqrt{2L}}$ where $L$ is the total number of transformer layers. Since scaling is applied only at initialization, we argue that Scaled Initialization may not effectively reduce variance throughout training. We further verify this in Figure 1, where we can see the output variance of Scaled Initialization is as large as Pre-LN. Table 4 presents the results of LLaMA-130M and LLaMA-250M. First, we can see that LayerScale degrades performance. While Scaled Initialization slightly improves over Pre-LN, it falls short of LayerNorm Scaling and the gap becomes larger for the larger model.

Table 4: Comparison against other scaling methods.

Perplexity ( $\downarrow$ )	LLaMA-130M	LLaMA-250M
Training Tokens	2.2B	3.9B
Pre-LN	26.73	21.92
+ LayerScale	27.93	23.45
+ Scaled Initialization	26.04	20.98
+ LayerNorm Scaling	25.76	20.35

5.4 Enhancing Deep Layers with LayerNorm Scaling

To evaluate how LayerNorm Scaling improves deep layer effectiveness, we conduct a layer pruning experiment on LLaMA-130M, systematically removing individual layers and measuring the performance drop ( $\Delta P^{(\ell)}$ ) on the ARC-e benchmark (Clark et al., 2018). Figure 4 compares the pruning effects between standard Pre-LN and LayerNorm Scaling. In the Pre-LN, removing deep layers results in minimal performance degradation, indicating their limited role in representation learning. In contrast, with LayerNorm Scaling, pruning deeper layers leads to a more pronounced drop, suggesting they now play a more active role in learning. While early layers remain critical in both models, the performance degradation in the LayerNorm Scaling is more evenly distributed across layers, reflecting a more balanced learning process. These findings confirm that LayerNorm Scaling mitigates the Curse of Depth by ensuring deeper layers contribute effectively to training.

6 Related Work

Layer Normalization in Language Models. LN (Ba, 2016) was initially applied after the residual connection in the original Transformer (Vaswani, 2017), which is known as Post-LN. Later on, Pre-LN (Baevski and Auli, 2019; Dai et al., 2019; Nguyen and Salazar, 2019) dominated LLMs, due to its compelling performance and stability (Brown et al., 2020; Touvron et al., 2023; Jiang et al., 2023; Bi et al., 2024). Prior works have studied the effect of Pre-LN and Post-LN. Xiong et al. (2020) proves that Post-LN tends to have larger gradients near the output layer, which necessitates smaller learning rates to stabilize training, whereas Pre-LN scales down gradients with the depth of the model, working better for deep Transformers. Wang et al. (2019) empirically confirmed that Pre-LN facilitates stacking more layers and Post-LN suffers from gradient vanishing. The idea of connecting multiple layers was proposed in previous works (Bapna et al., 2018; Dou et al., 2018; Wang et al., 2019). Adaptive Model Initialization (Admin) was introduced to use additional parameters to control residual dependencies, stabilizing Post-LN. DeepNorm (Wang et al., 2024) enables stacking 1000-layer Transformer by upscaling the residual connection before applying LN. Additionally, Ding et al. (2021) proposed Sandwich LayerNorm, normalizing both the input and output of each transformer sub-layer. Takase et al. (2023a) introduced B2T to bypass all LN except the final one in each layer. Li et al. (2024b) recently combines Post-LN and Pre-LN to enhance the middle layers.

7 Conclusion

In this paper, we introduce the concept of the Curse of Depth in LLMs, highlighting an urgent yet often overlooked phenomenon: nearly half of the deep layers in modern LLMs are less effective than expected. We discover the root cause of this phenomenon is Pre-LN which is widely used in almost all modern LLMs. To tackle this issue, we introduce LayerNorm Scaling. By scaling the output variance inversely with the layer depth, LayerNorm Scaling ensures that all layers, including deeper ones, contribute meaningfully to training. Our experiments show that this simple modification improves performance, reduces resource usage, and stabilizes training across various model sizes. LayerNorm Scaling is easy to implement, hyperparameter-free, and provides a robust solution to enhance the efficiency and effectiveness of LLMs.

8 Impact Statement

This paper introduces the Curse of Depth in LLMs to call attention to the AI community an urgent but often overlooked phenomenon that nearly half of layers in modern LLMs are not as effective as we expect. The impact of this phenomenon is large that a significant amount of resources used to train LLMs are somehow wasted. We further introduce LayerNorm Scaling to ensure that all layers contribute meaningfully to model training. The result is a significant improvement in model efficiency, enabling better performance with fewer computational resources and training tokens. This innovation not only enhances LLM effectiveness across a variety of tasks but also reduces the environmental and financial costs of training large-scale models, making LLM development more sustainable and accessible. LayerNorm Scaling presents a simple, hyperparameter-free solution that can be easily adopted, offering immediate practical benefits to the AI research community.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Ba (2016) Jimmy Lei Ba. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
Baevski and Auli (2019) Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. ICLR, 2019.
Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
Bapna et al. (2018) Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training deeper neural machine translation models with transparent attention. EMNLP, 2018.
Bi et al. (2024) Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020.
Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. ACL, 2019.
Devlin (2019) Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019.
Ding et al. (2021) Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. NeurIPS, 34:19822–19835, 2021.
Dou et al. (2018) Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. Exploiting deep representations for neural machine translation. EMNLP, 2018.
Dumitru et al. (2024) Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, and Mihai Surdeanu. Layer-wise quantization: A pragmatic and effective method for quantizing llms beyond integer bit-levels. arXiv preprint arXiv:2406.17415, 2024.
Gromov et al. (2024) Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers. arXiv preprint arXiv:2403.17887, 2024.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021.
Hu et al. (2023) Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Ka-Wei Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. EMNLP, 2023.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Kingma (2015) Diederik P Kingma. Adam: A method for stochastic optimization. ICLR, 2015.
Lad et al. (2024) Vedang Lad, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference? arXiv preprint arXiv:2406.19384, 2024.
Li et al. (2024a) Pengxiang Li, Lu Yin, Xiaowei Gao, and Shiwei Liu. Owlore: Outlier-weighed layerwise sampled low-rank projection for memory-efficient llm fine-tuning. arXiv preprint arXiv:2405.18380, 2024a.
Li et al. (2024b) Pengxiang Li, Lu Yin, and Shiwei Liu. Mix-ln: Unleashing the power of deeper layers by combining pre-ln and post-ln. arXiv preprint arXiv:2412.13795, 2024b.
Lialin et al. (2023) Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. Relora: High-rank training through low-rank updates. In ICLR, 2023.
Liu et al. (2020) Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. Understanding the difficulty of training transformers. EMNLP, 2020.
Lu et al. (2024) Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W Mahoney, and Yaoqing Yang. Alphapruning: Using heavy-tailed self regularization theory for improved layer-wise pruning of large language models. NeurIPS, 2024.
Ma et al. (2024) Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, and Chunting Zhou. Megalodon: Efficient llm pretraining and inference with unlimited context length. NeurIPS, 2024.
Men et al. (2024) Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
Muralidharan et al. (2024) Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Bhuminand Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, and Pavlo Molchanov. Compact language models via pruning and knowledge distillation. In NeurIPS, 2024.
Nguyen and Salazar (2019) Toan Q Nguyen and Julian Salazar. Transformers without tears: Improving the normalization of self-attention. IWSLT, 2019.
Rajpurkar (2016) P Rajpurkar. Squad: 100,000+ questions for machine comprehension of text. EMNLP, 2016.
Shazeer (2020) Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
Shoeybi et al. (2020) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. ICML, 2020.
Siddiqui et al. (2024) Shoaib Ahmed Siddiqui, Xin Dong, Greg Heinrich, Thomas Breuel, Jan Kautz, David Krueger, and Pavlo Molchanov. A deeper look at depth pruning of llms. ICML, 2024.
Takase et al. (2023a) Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. B2t connection: Serving stability and performance in deep transformers. ACL, 2023a.
Takase et al. (2023b) Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. Spike no more: Stabilizing the pre-training of large language models. arXiv preprint arXiv:2312.16903, 2023b.
Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. In ICCV, pages 32–42, 2021.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Vaswani (2017) A Vaswani. Attention is all you need. NeurIPS, 2017.
Vershynin (2018) Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
Wang et al. (2024) Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. Deepnet: Scaling transformers to 1,000 layers. TPAMI, 2024.
Wang et al. (2019) Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao. Learning deep transformer models for machine translation. ACL, 2019.
Whittaker and Watson (1996) E. T. Whittaker and G. N. Watson. A Course of Modern Analysis. Cambridge Mathematical Library. Cambridge University Press, 4 edition, 1996.
Wu and He (2018) Yuxin Wu and Kaiming He. Group normalization. In ECCV, pages 3–19, 2018.
Xiong et al. (2020) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In ICML, pages 10524–10533. PMLR, 2020.
Yin et al. (2024) Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. ICML, 2024.
Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. Root mean square layer normalization. NeurIPS, 32, 2019.
Zhao et al. (2024) Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. ICML, 2024.

Appendix A Proofs of the Theorems

A.1 Proof of Lemma 1

Proof.

Given Equation (2) from [Takase et al., 2023b], we have:

	$\displaystyle y$	$\displaystyle=x_{\ell+1}=x^{\prime}_{\ell}+\mathrm{FFN}(\mathrm{LN}(x^{\prime}_{\ell})),$		(18)
	$\displaystyle x^{\prime}_{\ell}$	$\displaystyle=x_{\ell}+\mathrm{Attn}(\mathrm{LN}(x_{\ell})).$		(18)

Based on our Assumption 1, let $\mathrm{Var}(\mathrm{Attn}(\mathrm{LN}(x_{\ell})))=\sigma_{\text{Attn}}^{2}$ . Then we can write:

	$\displaystyle\mathrm{Var}(x^{\prime}_{\ell})$	$\displaystyle=\mathrm{Var}(x_{\ell})+\mathrm{Var}(\mathrm{Attn}(\mathrm{LN}(x_{\ell})))+\mathrm{Cov}(\mathrm{Attn}(\mathrm{LN}(x_{\ell})),\mathrm{Var}(x_{\ell}))$		(19)
		$\displaystyle=\sigma_{x_{\ell}}^{2}+\sigma_{\text{Attn}}^{2}+\rho_{1}\cdot\sigma_{x_{\ell}}\cdot\sigma_{\text{Attn}},$		(19)

where $\rho_{1}$ is the correlation factor. Similarly, let $\mathrm{Var}(\mathrm{FFN}(\mathrm{LN}(x^{\prime}_{\ell})))=\sigma_{\mathrm{FFN}}^{2}$ . Then we have:

\sigma^{2}_{x_{\ell+1}}=\sigma_{(}x^{\prime}_{\ell})^{2}+\sigma_{\mathrm{FFN}}^{2}+\rho_{2}\cdot\sigma_{x^{\prime}_{\ell}}\cdot\sigma_{\mathrm{FFN}},

(20)

where $\rho_{2}$ is the correlation factor. Thus, the relationship between $\mathrm{Var}(x_{\ell+1})$ and $\mathrm{Var}(x_{\ell})$ becomes:

\sigma^{2}_{x_{\ell+1}}=\sigma^{2}_{x_{\ell}}+\sigma_{\text{Attn}}^{2}+\sigma_{\mathrm{FFN}}^{2}+\rho_{1}\cdot\sigma_{x_{\ell}}\cdot\sigma_{\text{Attn}}+\rho_{2}\cdot\sigma_{x^{\prime}_{\ell}}\cdot\sigma_{\mathrm{FFN}}.

(21)

A.1.1 Variance of the Attention

The scaled dot-product attention mechanism is defined as:

\mathrm{Attn}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V.

The softmax function outputs a probability distribution over the keys. Let the softmax output be $A=\mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)$ , where $A$ is a matrix with each row summing to 1. The final attention output is obtained by multiplying the softmax output $A$ with the value matrix $V$ :

\mathrm{Attn}(Q,K,V)=AV.

To simplify the analysis, we make the following additional assumptions: The softmax output $A$ is approximately uniform, meaning each element of $A$ is roughly $1/n$ , where $n$ is the number of keys/values. Given this assumption, the variance of the attention is:

\mathrm{Var}(\mathrm{Attn}(Q,K,V))\sim\mathrm{Var}(AV)=\frac{1}{n}\sum_{i=1}^{n}\mathrm{Var}(V_{i})=\frac{1}{n}\cdot n\cdot\sigma_{V}^{2}=\sigma_{V}^{2}=\sigma_{W}^{2}.

(22)

where $W$ is the universal weight matrix defined as before.

A.1.2 Variance of the Feed-Forward Network

The feed-forward network (FFN) in transformers typically consists of two linear transformations with a ReLU activation in between. The FFN can be written as:

\mathrm{FFN}(x)=W_{2}\cdot\mathrm{ReLU}(W_{1}\cdot x+b_{1})+b_{2}.

(23)

where $W_{1}$ and $W_{2}$ are weight matrices, and $b_{1}$ and $b_{2}$ are bias vectors.

Using the result obtained by Wang et al. [2024], we get:

\sigma_{\mathrm{FFN}}^{2}\sim\sigma_{W_{1}}^{2}\cdot\sigma_{W_{2}}^{2}=\sigma_{W}^{4}.

(24)

In conclusion:

$\displaystyle\sigma^{2}_{x^{\prime}_{\ell}}$	$\displaystyle=\sigma_{x_{\ell}}^{2}+\sigma_{W}^{2}+\rho_{2}\cdot\sigma_{x_{\ell}}\cdot\sigma_{W}$	(25)
	$\displaystyle=\sigma_{x_{\ell}}^{2}(1+\frac{\sigma_{W}}{\sigma_{x_{\ell}}}+\frac{\sigma_{W}^{2}}{\sigma^{2}_{x_{\ell}}})$
	$\displaystyle=\sigma_{x_{\ell}}^{2}\Theta(1+\frac{1}{\sigma_{x_{\ell}}}).$

For simplicity, we set the numerator part to 1. Substitute $\sigma_{x^{\prime}_{\ell}}=\sigma_{x_{\ell}}\sqrt{1+\frac{\sigma_{W}^{2}}{\sigma_{x_{\ell}}^{2}}+\rho_{2}\cdot\frac{\sigma_{W}}{\sigma_{x_{\ell}}}}.$ into Equation (21) we get:

$\displaystyle\sigma^{2}_{x_{\ell+1}}$	$\displaystyle=\sigma^{2}_{x_{\ell}}+\sigma_{W}^{2}+\sigma_{W}^{4}+\rho_{1}\cdot\sigma_{x_{\ell}}\cdot\sigma_{W}+\rho_{2}\cdot\sigma_{x^{\prime}_{\ell}}\cdot\sigma_{W}^{2}$	(26)
	$\displaystyle=\sigma^{2}_{x_{\ell}}+\sigma_{W}^{2}+\sigma_{W}^{4}+\rho_{1}\cdot\sigma_{x_{\ell}}\cdot\sigma_{W}+\rho_{2}\cdot\sigma_{x_{\ell}}\cdot\sigma_{W}^{2}+\frac{\rho_{2}\sigma_{W}^{4}}{2\sigma_{x_{\ell}}}+\frac{\rho_{2}^{2}\sigma_{W}^{3}\sigma_{x_{\ell}}}{2}$
	$\displaystyle=\sigma_{x_{\ell}}^{2}\Theta(1+\frac{1}{\sigma_{x_{\ell}}}).$

From the result we can generally infer that the variance accumulates layer by layer. The variance with regard to $\sigma_{x_{1}}$ :

\sigma^{2}_{x_{\ell}}=\sigma_{x_{1}}^{2}\Theta\Bigl{(}\prod_{k=1}^{\ell-1}\left(1+\frac{1}{\sigma_{x_{k}}}\right)\Bigr{)}.

(27)

We can also obtain a similar result for $\sigma^{2}_{x^{\prime}_{\ell}}$ .

We observe that for any ${\sigma^{2}_{x_{k}}}\geq 1$ , the sequence is increasing, meaning each term in the product is bounded. Consequently, the entire product is bounded above by:

{\sigma^{2}_{x_{\ell}}}\leq{\sigma^{2}_{x_{1}}}\prod_{k=1}^{\ell-1}\Bigl{(}1+\sqrt{\frac{1}{\sigma_{x_{1}}}}\Bigr{)}=\sigma^{2}_{x_{1}}\bigl{(}1+\sqrt{\frac{1}{\sigma_{x_{1}}}}\Bigr{)}^{\ell-1}=\exp{\Theta(L)}.

(28)

Taking the natural logarithm of both sides:

	$\displaystyle\log({\sigma^{2}_{x_{\ell}}})$	$\displaystyle=\log\left(\sigma_{x_{1}}^{2}\prod_{k=1}^{\ell-1}\left(1+\sqrt{\frac{1}{{\sigma^{2}_{x_{k}}}}}\right)\right)=\sum_{k=1}^{\ell-1}\log\left(1+\sqrt{\frac{1}{{\sigma^{2}_{x_{k}}}}}\right)+\log(\sigma_{x_{1}}^{2})$		(29)
		$\displaystyle\geq\sum_{k=1}^{\ell-1}\Bigl{(}\sqrt{\frac{1}{{\sigma^{2}_{x_{k}}}}}-\frac{1}{2}\left(\sqrt{\frac{1}{{\sigma^{2}_{x_{k}}}}}\right)^{2}\Bigr{)}+\log(\sigma_{x_{1}}^{2}).$		(29)

Exponentiating both sides to find the lower bound for $\sigma^{2}_{x_{\ell}}$ , we obtain:

{\sigma^{2}_{x_{\ell}}}\geq\sigma_{x_{1}}^{2}\exp\left(\sum_{k=1}^{\ell-1}\left(\sqrt{\frac{1}{{\sigma^{2}_{x_{k}}}}}-\frac{1}{2{\sigma^{2}_{x_{k}}}}\right)\right).

This provides a tighter lower bound for ${\sigma^{2}_{x_{\ell}}}$ compared to the upper bound of Equation (28). Since we know the upper bound of variance grows exponentially, the lower bound must be sub-exponential. Therefore, for ${\sigma^{2}_{x_{\ell}}}=\ell$ , we must have:

\sigma^{2}_{x_{\ell}}\geq\sigma_{x_{1}}^{2}\exp\left(\sum_{k=1}^{\ell-1}\left(\frac{1}{k}-\frac{1}{2k}\right)\right)=\Theta(\exp(\sqrt{L}))\geq\Theta(L).

Therefore, the increasing lower bound for ${\sigma^{2}_{x_{\ell}}}$ must grows faster than a linear function. So, the increasing of variance is sub-exponential.

∎

A.2 Proof of Theorem 1

In this proof, we will divide the argument into two parts: first, the calculation of the Lemma 31, and second, the analysis of $\frac{\partial y_{\ell}}{\partial x_{1}}$ .

Lemma 3.

For an $L$ -layered Pre-LN Transformer, $\frac{\partial y_{L}}{\partial x_{1}}$ using Equations (2) and (3) is given by:

\frac{\partial y_{L}}{\partial x_{1}}=\prod_{n=1}^{L-1}\left(\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\cdot\frac{\partial x^{\prime}_{\ell}}{\partial x_{\ell}}\right).

(30)

The upper bound for the norm of $\frac{\partial y_{L}}{\partial x_{1}}$ is:

\displaystyle\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\prod_{l=1}^{L-1}\Big{(}\Big{(}1+\frac{\sigma^{2}}{\sigma_{x^{\prime}_{\ell}}(\sqrt{d}+\sqrt{d_{\mathrm{FFN}}})^{2}}\Big{)}\times\Big{(}1+2dh\left(\sqrt{s}+2+\frac{1}{\sqrt{s}}\right)\frac{\sigma^{2}}{\sigma_{x_{\ell}}}\Big{(}\sigma^{2}d\sqrt{d_{\mathrm{head}}}+\left(1+\sqrt{d_{\mathrm{head}}/d}\right)\Big{)}\Big{)}.

(31)

Here, $h$ denotes the number of heads, $s$ is the sequence length, and $d$ , $d_{\mathrm{FFN}}$ , and $d_{\mathrm{head}}$ are the dimension of the embedding, FFN layer and multi-head attention layer, respectively. The standard deviation of $W_{Q}$ , $W_{K}$ , $W_{V}$ , and $W_{\mathrm{FFN}}$ at layer $\ell$ is $\sigma$ based on Assumption 1.

A.2.1 Proof of Lemma 31

Proof.

Our derivation follows results in [Takase et al., 2023b], specifically Equation (7), which provides an upper bound on the norm of $\frac{\partial y_{\ell}}{\partial x_{1}}$ as:

\displaystyle\left\|\frac{\partial y_{\ell}}{\partial x_{1}}\right\|_{2}=\left\|\prod_{l=1}^{L-1}\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\frac{\partial x^{\prime}_{\ell}}{\partial x_{\ell}}\right\|_{2}.

(32)

Thus, we can estimate the upper bound of the gradient norm of $\frac{\partial y_{\ell}}{\partial x_{1}}$ by analyzing the spectral norms of the Jacobian matrices for the FFN layer and the self-attention layer, namely,

\text{FFN:}\left\|\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\right\|_{2}\quad\text{Attention:}\left\|\frac{\partial x^{\prime}_{\ell}}{\partial x_{\ell}}\right\|_{2}.

(33)

We now derive an upper bound of $\|\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\|_{2}$ as follows:

\left\|\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\right\|_{2}\leq 1+\left\|\frac{\partial\mathrm{FFN}(\mathrm{LN}(x^{\prime}_{\ell}))}{\partial\mathrm{LN}(x^{\prime}_{\ell})}\right\|_{2}\left\|\frac{\partial\mathrm{LN}(x^{\prime}_{\ell})}{\partial x^{\prime}_{\ell}}\right\|_{2}.

(34)

Let $\sigma_{w1_{\ell}}$ and $\sigma_{w2_{\ell}}$ be the standard deviations of $W^{1}_{\ell}$ and $W^{2}_{\ell}$ , respectively. From Assumption 1, the spectral norms of $W^{1}_{\ell}$ and $W^{2}_{\ell}$ are given by their standard deviations and dimensions [Vershynin, 2018], so wo have:

\|W_{1}\|_{2}\sim\sigma_{1}\sqrt{d+\sqrt{d_{\mathrm{FFN}}}}.

For simplicity, we assume that $d$ , and $d_{\mathrm{FFN}}$ are equal, thus,

\left\|\frac{\partial\mathrm{FFN}(\mathrm{LN}(x^{\prime}_{\ell}))}{\partial\mathrm{LN}(x^{\prime}_{\ell})}\right\|_{2}=\|W^{1}_{\ell}W^{2}_{\ell}\|_{2}\leq\sigma_{1}\sigma_{2}(\sqrt{d}+\sqrt{d_{\mathrm{ffn}}})^{2}.

(35)

Finally, we have the following bound:

\left\|\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\right\|_{2}\leq 1+\frac{\sigma_{w1_{\ell}}\sigma_{w2_{\ell}}}{\sigma_{x^{\prime}_{\ell}}(\sqrt{d}+\sqrt{d_{\mathrm{FFN}}})^{2}}=1+\frac{\sigma^{2}_{\ell}}{\sigma_{x^{\prime}_{\ell}}(\sqrt{d}+\sqrt{d_{\mathrm{FFN}}})^{2}}.

(36)

Following a similar procedure for the FFN, we rewrite $\|\frac{\partial x^{\prime}}{\partial x}\|_{2}$ in Equation (33) as:

\left\|\frac{\partial x^{\prime}}{\partial x}\right\|_{2}\leq 1+\left\|\frac{\partial\text{Attn}(\mathrm{LN}(x))}{\partial\mathrm{LN}(x)}\right\|_{2}\left\|\frac{\partial\mathrm{LN}(x)}{\partial x}\right\|_{2}.

(37)

Let $Z(\cdot)=\mathrm{concat}(\mathrm{head}_{1}(\cdot),\dots,\mathrm{head}_{h}(\cdot))$ and $J^{Z}$ denote the Jacobian of the $Z(\cdot)$ . We can now express the spectral norm of the Jacobian matrix of attntion as:

\displaystyle\left\|\frac{\partial\text{Attn}(\mathrm{LN}(x_{\ell}))}{\partial\mathrm{LN}(x_{\ell})}\right\|_{2}=\left\|W^{O}_{\ell}Z(\mathrm{LN}(x_{\ell}))\frac{\partial Z(\mathrm{LN}(x_{\ell}))}{\partial\mathrm{LN}(x_{\ell})}\right\|_{2}=\|W^{O}_{\ell}J^{Z}_{\ell}\|_{2}.

(38)

From [Vershynin, 2018], we know that:

\displaystyle\|J^{Z}_{\ell}\|_{2}\leq h\Big{(}\left(\sqrt{s}+2+\frac{1}{\sqrt{s}}\right)\sigma^{3}\sqrt{d^{3}d_{\mathrm{head}}}+\sigma^{\ell}_{x}\left(\sqrt{d}+\sqrt{d_{\mathrm{head}}}\right)\Big{)}.

(39)

Here $h$ is the number of heads, $s$ is the sequence length, and the standard deviation of $W_{Q}$ , $W_{K}$ , and $W_{V}$ is $\sigma$ .

By combining the inequalities (36), (39) and (37), and assuming that all $\sigma$ values are the same for simplicity. we obtain:

\displaystyle\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\prod_{l=1}^{L-1}\Big{(}\Big{(}1+\frac{\sigma^{2}}{\sigma_{x^{\prime}_{\ell}}(\sqrt{d}+\sqrt{d_{\mathrm{FFN}}})^{2}}\Big{)}\times\Big{(}1+2dh\left(\sqrt{s}+2+\frac{1}{\sqrt{s}}\right)\frac{\sigma^{2}}{\sigma_{x_{\ell}}}\Big{(}\sigma^{2}d\sqrt{d_{\mathrm{head}}}+\left(1+\sqrt{d_{\mathrm{head}}/d}\right)\Big{)}\Big{)}.

(40)

∎

A.2.2 Analysis of the upper bound

As discussed in [Takase et al., 2023b], $\sigma$ should be sufficiently small, and the standard deviation, $\sigma_{x^{\prime}_{\ell}}$ or $\sigma_{x_{\ell}}$ should satisfy the condition $\sigma^{2}\ll\sigma_{x^{\prime}_{\ell}}$ to maintain the lazy training scheme. Thus, we obtain the following bound for the product over $\ell$ from 1 to $L$ :

To find the bound for $\left\|\frac{\partial y_{\ell}}{\partial x_{1}}\right\|_{2}$ with respect to $\ell$ , we simplify the given inequality by approximating $\sigma_{x_{\ell}}$ and $\sigma_{x^{\prime}_{\ell}}$ . Based on Equation (25), $\sigma_{x_{\ell}}$ is only one layer ahead of $\sigma_{x^{\prime}_{\ell}}$ , and this layer does not significantly affect the overall performance of deep Transformer networks. Furthermore, based on Lemma 7, we assume that $\sigma_{x^{\prime}_{\ell}}=\sigma_{x_{\ell}}$ .

Equation (31) can be expressed in a traditional product form [Whittaker and Watson, 1996] for $\sigma_{x_{\ell}}$ :

\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\prod_{l=1}^{L-1}\left(1+\frac{1}{\sigma_{x_{\ell}}}A+\frac{1}{\sigma_{x_{\ell}}^{2}}B\right),

(41)

where

A=\frac{\sigma^{2}}{(\sqrt{d}+\sqrt{d_{\mathrm{FFN}}})^{2}}+2dh\left(\sqrt{s}+2+\frac{1}{\sqrt{s}}\right)\sigma^{2}\left(d\sqrt{d_{\mathrm{head}}}+1+\sqrt{d_{\mathrm{head}}/d}\right),

(42)

and

B=2dh\left(\sqrt{s}+2+\frac{1}{\sqrt{s}}\right)\sigma^{4}d\sqrt{d_{\mathrm{head}}},

(43)

where $A$ and $B$ are independent of $\sigma_{x_{\ell}}$ , and under our assumption, are treated as constants.

From classical infinite series analysis, it is known that as $\sigma_{x_{\ell}}$ grows at a faster rate, the upper bound of the product decreases. The proof is omitted here for brevity. For the upper bound on the convergence rate of $\sigma^{2}_{x_{\ell}}$ , we assume $\sigma^{2}_{x_{\ell}}=\exp(\ell)$ without loss of generality. Under this condition, we can derive the following result:

Taking the natural logarithm of the product:

\log\left(\prod_{k=1}^{L-1}\left(1+\frac{A}{e^{k}}+\frac{B}{e^{2k}}\right)\right)=\sum_{k=1}^{L-1}\log\left(1+\frac{A}{e^{k}}+\frac{B}{e^{2k}}\right).

Using the Taylor series expansion for $\log(1+x)$ , and applying this to our sum, we get:

\sum_{k=1}^{\infty}\log\left(1+\frac{A}{e^{k}}+\frac{B}{e^{2k}}\right)=\sum_{k=1}^{\infty}\left(\frac{A}{e^{k}}+\frac{B}{e^{2k}}-\frac{1}{2}\left(\frac{A}{e^{k}}+\frac{B}{e^{2k}}\right)^{2}+\frac{1}{3}\left(\frac{A}{e^{k}}+\frac{B}{e^{2k}}\right)^{3}-\cdots\right).

By evaluating the sums for each order of terms, we find that the result is a constant. Carrying this out for each term, we obtain:

\log\left(\prod_{k=1}^{L-1}\left(1+\frac{A}{e^{k}}+\frac{B}{e^{2k}}\right)\right)\sim\frac{A}{e-1}+\frac{B}{e^{2}-1}-\frac{1}{2}\left(\frac{A^{2}}{e^{2}-1}+2\frac{A\cdot B}{e^{3}-1}+\frac{B^{2}}{e^{4}-1}\right).

Thus, the product is approximately:

\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\exp\left(\frac{A}{e-1}+\frac{B}{e^{2}-1}-\frac{1}{2}\left(\frac{A^{2}}{e^{2}-1}+2\frac{A\cdot B}{e^{3}-1}+\frac{B^{2}}{e^{4}-1}\right)\right)=M,

(44)

where $M$ is a constant.

For the lower bound on the convergence rate of $\sigma^{2}_{x_{\ell}}$ , we assume $\sigma^{2}_{x_{\ell}}=\ell$ without loss of generality. Under this condition, we derive the following result. Taking the logarithm of the product, applying the Taylor series expansion for $\log(1+x)$ , and applying this to our sum:

\sum_{k=1}^{\infty}\log\left(1+\frac{A}{k}+\frac{B}{e^{k^{2}}}\right)=\sum_{k=1}^{\infty}\left(\frac{A}{k}+\frac{B}{e^{k^{2}}}-\frac{1}{2}\left(\frac{A}{k}+\frac{B}{e^{k^{2}}}\right)^{2}+\frac{1}{3}\left(\frac{A}{k}+\frac{B}{e^{k^{2}}}\right)^{3}-\cdots\right).

For the first-order terms:

\sum_{k=1}^{\infty}\left(\frac{A}{k}+\frac{B}{e^{k^{2}}}\right)=A\sum_{k=1}^{\infty}\frac{1}{k}+B\sum_{k=1}^{\infty}\frac{1}{e^{k^{2}}}.

The series $\sum_{k=1}^{\infty}\frac{1}{k}$ is the harmonic series, which diverges. However, we approximate it using the Euler-Mascheroni constant $\gamma$ and the fact recognize that the harmonic series grows logarithmically:

\sum_{k=1}^{\infty}\frac{1}{k}\sim\log n+\gamma\quad\text{(for large }n\text{)}.

The other series such as $\sum_{k=1}^{\infty}\frac{1}{e^{k^{2}}}$ converge because $e^{k^{2}}$ grows very rapidly.

For higher-order terms, they converge to constant, involving the series $\sum_{k=1}^{\infty}\frac{1}{k^{2}}$ converges to $\frac{\pi^{2}}{6}$ , so they contribute a constant. Exponentiating both sides, we get:

\prod_{k=1}^{\infty}\left(1+\frac{A}{k}+\frac{B}{e^{k^{2}}}\right)\sim\exp\left(A(\log n+\gamma)+const\right).

Thus, the growth rate of the upper bound for $\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}$ is:

\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\Theta(L).

(45)

A.3 Proof of Lemma 2

Proof.

After scaling, the equation becomes:

	$\displaystyle y$	$\displaystyle=x_{\ell+1}=x^{\prime}_{\ell}+\mathrm{FFN}(\frac{1}{\sqrt{\ell}}\mathrm{LN}(x^{\prime}_{\ell})),$		(46)
	$\displaystyle x^{\prime}_{\ell}$	$\displaystyle=x_{\ell}+\mathrm{Attn}(\frac{1}{\sqrt{\ell}}\mathrm{LN}(x_{\ell})).$		(46)

Folloing the same analysis as before, we scale the Attention and FFN sub-layers, yielding:

\sigma_{\mathrm{Attn}}^{2}=\frac{1}{n\ell}\cdot n\cdot\sigma_{V}^{2}=\frac{1}{\ell}\sigma_{V}^{2}=\frac{\sigma_{W}^{2}}{\ell},\quad\sigma_{\mathrm{FFN}}^{2}\sim\frac{\sigma_{W_{1}}^{2}}{\ell}\cdot\frac{\sigma_{W_{2}}^{2}}{\ell}=\frac{\sigma_{W}^{4}}{\ell^{2}}.

(47)

In conclusion:

\displaystyle\sigma^{2}_{x^{\prime}_{\ell}}=\sigma_{x_{\ell}}^{2}+\sigma_{W}^{2}+\rho_{2}\cdot\sigma_{x_{\ell}}\cdot\frac{\sigma_{W}}{\sqrt{\ell}}=\sigma_{x_{\ell}}^{2}\Theta(1+\frac{1}{\sqrt{\ell}\sigma_{x_{\ell}}}).

(48)

Similarly, we obtain:

\displaystyle\sigma^{2}_{x_{\ell+1}}=\sigma_{x_{\ell}}^{2}\Theta(1+\frac{1}{\sqrt{\ell}\sigma_{x_{\ell}}}).

(49)

Taking the natural logarithm of both sides:

	$\displaystyle\log({\sigma^{2}_{x_{\ell}}})$	$\displaystyle=\log\left(\sigma_{x_{1}}^{2}\prod_{k=1}^{\ell-1}\left(1+\sqrt{\frac{1}{{\ell\sigma^{2}_{x_{k}}}}}\right)\right)=\sum_{k=1}^{\ell-1}\log\left(1+\sqrt{\frac{1}{{\ell\sigma^{2}_{x_{k}}}}}\right)+\log(\sigma_{x_{1}}^{2})$		(50)
		$\displaystyle\geq\sum_{k=1}^{\ell-1}\Bigl{(}\sqrt{\frac{1}{{\ell\sigma^{2}_{x_{k}}}}}-\frac{1}{2}\left(\sqrt{\frac{1}{{\ell\sigma^{2}_{x_{k}}}}}\right)^{2}\Bigr{)}+\log(\sigma_{x_{1}}^{2}).$		(50)

To establish a lower bound for ${\sigma^{2}_{x_{\ell}}}$ , we exponentiate both sides. Setting ${\sigma^{2}_{x_{\ell}}}=\ell$ , we must have:

\sigma^{2}_{x_{\ell}}\geq\sigma_{x_{1}}^{2}\exp\left(\sum_{k=1}^{\ell-1}\left(\frac{1}{k}-\frac{1}{2k}\right)\right)=\Theta(\exp(\log{L}))\geq\Theta(L).

(51)

Therefore, the increasing lower bound ${\sigma^{2}_{x_{\ell}}}$ is greater than a linear function.

Similarly, assuming ${\sigma^{2}_{x_{\ell}}}=\ell^{(2-\epsilon)}$ , we have:

\sigma^{2}_{x_{\ell}}=\sigma_{x_{1}}^{2}\prod_{k=1}^{\ell-1}\left(1+\frac{1}{\ell^{2-\epsilon/2}}\right)\sim\exp\left(\sum_{k=1}^{\ell-1}\frac{1}{k^{2-\epsilon/2}}\right)\sim\exp\left(\frac{\ell^{\epsilon/2-1}-1}{\epsilon/2-1}\right)\leq\Theta(\ell^{(2-\epsilon)})\leq\Theta(\ell^{2}).

(52)

Here $\epsilon$ is a small constant with $0<\epsilon\leq 1/4$ . Therefore, the increasing upper bound of ${\sigma^{2}_{x_{\ell}}}$ is slower than the $\ell^{3}$ function, leading to:

{\sigma^{2}_{x_{\ell}}}\leq\Theta(L^{2})

∎

A.4 Proof of Theorem 17

Proof.

Similarly, after applying the scaling transformation, we derive an upper bound for $\|\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\|_{2}$ as follows:

	$\displaystyle\left\\|\frac{\partial y_{\ell}}{\partial x^{\prime}_{\ell}}\right\\|_{2}$	$\displaystyle\leq 1+\left\\|\frac{\partial\mathrm{FFN}(\mathrm{LN}(x^{\prime}_{\ell}))}{\partial\mathrm{LN}(x^{\prime}_{\ell})}\right\\|_{2}\left\\|\frac{1}{\sqrt{\ell}}\right\\|_{2}\left\\|\frac{\partial\mathrm{LN}(x^{\prime}_{\ell})}{\partial x^{\prime}_{\ell}}\right\\|_{2}$		(53)
		$\displaystyle=1+\frac{\sigma^{2}_{\ell}}{\ell\sigma_{x^{\prime}_{\ell}}(\sqrt{d}+\sqrt{d_{\mathrm{FFN}}})^{2}}.$		(53)

Similarly, rewriting Equation (33) after scaling, we have

\left\|\frac{\partial x^{\prime}}{\partial x}\right\|_{2}\leq 1+\left\|\frac{\partial\text{Attn}(\mathrm{LN}(x))}{\partial\mathrm{LN}(x)}\right\|_{2}\left\|\frac{1}{\sqrt{\ell}}\right\|_{2}\left\|\frac{\partial\mathrm{LN}(x)}{\partial x}\right\|_{2}.

(54)

By combining the bound (53), and inequality (54), and assuming all $\sigma$ are equal for simplicity, we obtain:

\displaystyle\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\prod_{l=1}^{L-1}\Big{(}\Big{(}1+\frac{\sigma^{2}}{\ell\sigma_{x^{\prime}_{\ell}}(\sqrt{d}+\sqrt{d_{\mathrm{FFN}}})^{2}}\Big{)}\times\Big{(}1+2dh\left(\sqrt{s}+2+\frac{1}{\sqrt{s}}\right)\frac{\sigma^{2}}{\ell\sigma_{x_{\ell}}}\Big{(}\sigma^{2}d\sqrt{d_{\mathrm{head}}}+\left(1+\sqrt{d_{\mathrm{head}}/d}\right)\Big{)}\Big{)}.

(55)

Equation (55) is a traditional product form [Whittaker and Watson, 1996] for $\sigma_{x_{\ell}}$ . After scaling, it becomes:

\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\prod_{l=1}^{L-1}\left(1+\frac{1}{\ell\sigma_{x_{\ell}}}A+\frac{1}{\ell^{2}\sigma_{x_{\ell}}^{2}}B\right),

(56)

where $A$ and $B$ retain their forms from Equation (42) and Equation (43) and are treated as constants.

Regarding the upper bound on the convergence rate of $\sigma^{2}_{x_{\ell}}$ , we assume $\sigma^{2}_{x_{\ell}}=\ell^{(2-\epsilon)}$ without loss of generality. For large $L$ , the product can be approximated using the properties of infinite products:

\prod_{\ell=1}^{L-1}\left(1+\frac{A}{\ell^{2-\epsilon/2}}+\frac{B}{\ell^{4-\epsilon}}\right)\sim\exp\left(\sum_{\ell=1}^{L-1}\left(\frac{A}{\ell^{2-\epsilon/2}}+\frac{B}{\ell^{4-\epsilon}}\right)\right).

(57)

Then, by evaluating the sum in the exponent, we obtain:

\prod_{\ell=1}^{L-1}\left(1+\frac{A}{\ell^{2-\epsilon/2}}+\frac{B}{\ell^{4-\epsilon}}\right)\sim\exp\left(A\cdot\frac{\ell^{\epsilon/2-1}-1}{\epsilon/2-1}+B\cdot\frac{\ell^{\epsilon-3}-1}{\epsilon-3}\right).

(58)

Therefore, we establish the upper bound:

\left\|\frac{\partial y_{L}}{\partial x_{1}}\right\|_{2}\leq\Theta\left(\exp\left(A\cdot\frac{\ell^{\epsilon/2-1}-1}{\epsilon/2-1}+B\cdot\frac{\ell^{\epsilon-3}-1}{\epsilon-3}\right)\right)=\omega(1),

(59)

where $\omega(1)$ denotes a growth strictly greater than a constant as defined before. ∎

Appendix B Training Loss Curve

We report the training loss curve of Pre-LN and LayerNorm Scaling in Figure 5.

Appendix C Variance Growth in Pre-LN Training

To analyze the impact of Pre-LN on variance propagation, we track the variance of layer outputs across different depths during training.

Figure 6 illustrates the layer-wise variance in LLaMA-130M with Pre-LN at 1000, 3000, and 6000 epochs. Across all stages, variance remains low in shallow layers but grows exponentially in deeper layers, confirming that this issue persists throughout training rather than being a temporary effect. This highlights the necessity of stabilization techniques like LayerNorm Scaling to control variance and ensure effective deep-layer learning.