This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search


Lei Yang Shaoyang Xu  Deyi Xiong
Tianjin University
{yanglei_9, syxu, dyxiong}@tju.edu.cn
Abstract

Large language models (LLMs) based on the Transformer architecture usually have their context length limited due to the high training cost. Recent advancements extend the context window by adjusting the scaling factors of RoPE and fine-tuning. However, suboptimal initialization of these factors results in increased fine-tuning costs and reduced performance at target length. To address these challenges, we propose an innovative RoPE-based fine-tuning framework that diverges from conventional scaling factors search. Specifically, we present a Divide-and-Conquer Incremental Search (DCIS) algorithm that strategically determines the better scaling factors. Further fine-tuning with the identified scaling factors effectively extends the context window of LLMs. Empirical results demonstrate that our methodology not only mitigates performance decay at extended target lengths but also allows the model to fine-tune on short contexts and generalize to long contexts, thereby reducing the cost of fine-tuning. The scaling factors obtained through DCIS can even perform effectively without fine-tuning. Further analysis of the search space reveals that DCIS achieves twice the search efficiency compared to other methods. We also examine the impact of the non-strictly increasing scaling factors utilized in DCIS and evaluate the general capabilities of LLMs across various context lengths.

DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search


Lei Yang Shaoyang Xu  Deyi Xiong Tianjin University {yanglei_9, syxu, dyxiong}@tju.edu.cn


Refer to caption
Figure 1: The Llama2-7B model is expanded to 64k context window. We test the PPL using 10 Proof-pile samples with a minimum length of 128k tokens. Fine-tuning is performed using the default method described in Section 5. ``-64k/16k"``\text{-64k/16k}" indicates that fine-tuning on a 64k/16k length generalizes to a target length of 64k.

1 Introduction

Transformer Vaswani et al. (2017) has emerged as the preferred architecture for large language models due to its scaling capability Brown et al. (2020); Achiam et al. (2023). However, the inherent quadratic complexity of self-attention necessitates limiting the context window during pre-training, exemplified by the 4096-token limit in Llama2 Touvron et al. (2023). When models encounter sequences beyond this limit during inference, a significant loss in performance occurs Press et al. (2021). To extend the operational context window of LLMs, previous studies have explored a wide variety of methods, such as sequence truncation Rae et al. (2019); Dai et al. (2019); Wu et al. (2022) and sparse sequencing Han et al. (2023); Ding et al. (2023); Xiao et al. (2023), though these methods often result in the loss of critical contextual information.

Recent advances in positional encoding techniques have facilitated length generalization capabilities in LLMs, evolving from the initial sinusoidal positional encodings of Transformers to learnable and relative positional encodings Gehring et al. (2017); Shaw et al. (2018). The introduction of Rotary Position Embedding (RoPE) Su et al. (2024) has catalyzed a new wave of research that aims to increase the extrapolation length of LLMs by modifying the rotation frequency of the embedding dimensions through scaling factors Peng et al. (2023), complemented by simultaneous fine-tuning to sustain long-context performance.

However, direct fine-tuning on target lengths Chen et al. (2023b); Peng et al. (2023) incurs substantial computational costs. While fine-tuning on shorter contexts to generalize to target lengths Chen et al. (2023a) is more efficient, it suffers from significant performance degradation at the target lengths. As shown in Figures 1 and 3, these existing methods fail to achieve satisfactory results in perplexity (PPL) and passkey performance metrics at target lengths.

To address these challenges, we explore better scaling factors to unlock the potential of LLMs for length generalization. To this end, we propose a novel RoPE-based fine-tuning framework that departs from traditional approaches of linearly increasing scaling factors Peng et al. (2023); Ding et al. (2024). Instead, our framework introduces a Divide-and-Conquer Incremental Search (DCIS) algorithm, leveraging a principle of refinement from broad to specific, to efficiently determine the better scaling factors through continuous target-length inference. The non-strictly increasing nature of DCIS expands the search space considerably. The scaling factors identified via DCIS are then utilized for fine-tuning, significantly extending the model’s context window to the target length.

We conduct a comprehensive evaluation of DCIS on Llama2-7B and Mistral-7B-v0.1 using PPL and Passkey. Experimental results show that DCIS effectively mitigates the performance degradation on target lengths, which is commonly observed in baseline methods. Moreover, our approach enables the model to fine-tuning on short contexts (such as 4k/16k), and generalize to long context (64k), thereby reducing fine-tuning costs and memory requirements (Section 5.2). Additionally, the results in Section 5.3 demonstrate that DCIS is effective due to the improved initialization of scaling factors, which enhances performance even without fine-tuning. We also conduct comprehensive analysis on DCIS. Specifically, in Section 6.1, we analyze the search space of DCIS and demonstrate its superior efficiency. In Section 6.2, we investigate the impact of non-strictly increasing scaling factors, namely Adaptive Scaling Factors (ASF). We find that ASF can mitigate the aforementioned issue, leading to stable and efficient performance across various sequence lengths. Finally, we evaluate the general ability of our model on LongBench and Open LLM Leaderboard under different context lengths, in Section 6.3.

Our contributions can be summarized as follows:

  • We propose a novel framework involving a Divide-and-Conquer Incremental Search (DCIS) algorithm for scaling factor search, followed by fine-tuning to extend the context window of LLMs.

  • Extensive experiments demonstrate that DCIS overcomes the challenge of performance degradation at target lengths and exhibits strong generalization ability, leading to further reductions in fine-tuning costs.

  • We conduct an in-depth analysis of DCIS, encompassing the impact of scaling factor initialization, the search space of DCIS, the role of Adaptive Scaling Factors (ASF), and the effect of DCIS on general ability, demonstrating the method’s effectiveness, efficiency, and robustness.

2 Related Work

Positional Embedding Scaling. Recent advancements in positional embedding scaling, particularly involving Rotary Positional Embedding (RoPE), have significantly improved the capability of LLMs to manage extended context windows. Notable methods such as PI Chen et al. (2023b) and YaRN Peng et al. (2023) manually adjust scaling factors, whereas CLEX Chen et al. (2023a) optimizes rotation frequencies through training. LongRoPE Ding et al. (2024) employs a search mechanism to fine-tune scaling factors, enhancing the model’s extrapolation abilities. While these approaches demonstrate improved performance through better scaling factors utilization, they are still hindered by high fine-tuning costs and notable performance declines at target lengths, issues that our proposed method addresses more efficiently.

Sequence Compression. Models leveraging RoPE have shown intrinsic capabilities for length extrapolation even without fine-tuning. Approaches like ReRoPE Su (2023) and Self-Extend Jin et al. (2024) extend sequence lengths by compressing sequence indices, although they necessitate double attention computations, raising computational demands and limiting extrapolation potential. In contrast, our approach facilitates unrestricted extension of the model’s context length through standard inference process, eliminating the need for repeated attention mechanisms.

Chunking/Sparse Attention. Investigative efforts have revealed that models predominantly focus on information at the sequence extremes Liu et al. (2023), suggesting that removing mid-sequence data while preserving initial and proximate tokens minimally impacts overall information integrity Han et al. (2023); Xiao et al. (2023). Alternative strategies involve segmenting sequences into chunks corresponding to pre-training lengths and employing external memory modules for storing and recalling past contexts during current chunk inference Rae et al. (2019); Dai et al. (2019); Wu et al. (2022). While these methods enable some degree of length extension, they inherently sacrifice a portion of the contextual data. Our method, however, maintains the integrity of the entire text, thereby maximizing the utility of the available context information.

3 Preliminary

This section elucidates the principles underpinning our approach and delineates the problem concerning positional embedding scaling, with a focus on enhancing the Rotary Positional Embedding (RoPE) Su et al. (2024) technique widely adopted in LLMs.

3.1 Rotary Position Embedding

RoPE has garnered significant attention in the realm of LLMs due to its robust performance and superior extrapolation effectiveness. Consider a sequence of embedding vectors 𝐱1,𝐱2,,𝐱Ld\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{L}\in\mathbb{R}^{d}, where LL represents the length of the input sequence and dd denotes the dimensionality of the hidden states for each head. Let mm indicate the position index. RoPE incorporates positional information into the embedding vectors through a rotational transformation defined as follows:

f{Q,K}(𝐱m,m,𝚯)=𝐑𝚯,md𝐖{Q,K}𝐱m,\displaystyle f_{\{Q,K\}}(\mathbf{x}_{m},m,\mathbf{\Theta})=\mathbf{R}^{d}_{\mathbf{\Theta},m}\mathbf{W}_{\{Q,K\}}\mathbf{x}_{m}, (1)

where 𝐖Q\mathbf{W}_{Q} and 𝐖K\mathbf{W}_{K} denote the weights matrices of theTransformer. 𝐑𝚯,md\mathbf{R}^{d}_{\mathbf{\Theta},m} is the rotation matrix parameterized by 𝚯={θi=100002(i1)/d,i[1,2,,d/2]}\mathbf{\Theta}=\{\theta_{i}=10000^{-2(i-1)/d},i\in[1,2,\dots,d/2]\}:

𝐑𝚯,md\displaystyle\mathbf{R}^{d}_{\mathbf{\Theta},m} =([cosmθisinmθisinmθicosmθi]),\displaystyle=\left(\begin{bmatrix}\text{cos}m\theta_{i}&-\text{sin}m\theta_{i}\\ \text{sin}m\theta_{i}&\text{cos}m\theta_{i}\end{bmatrix}\right), (2)
for i=1,2,,d/2.\displaystyle\text{ for }i=1,2,\dots,\lfloor d/2\rfloor.

This transformation ensures that the relative positional information |mn||m-n| is implicitly encoded in the attention scores:

𝐐mT𝐊n\displaystyle\mathbf{Q}^{T}_{m}\mathbf{K}_{n} =(𝐑𝚯,md𝐖Q𝐱m)T(𝐑𝚯,nd𝐖K𝐱n)\displaystyle=(\mathbf{R}^{d}_{\mathbf{\Theta},m}\mathbf{W}_{Q}\mathbf{x}_{m})^{T}(\mathbf{R}^{d}_{\mathbf{\Theta},n}\mathbf{W}_{K}\mathbf{x}_{n}) (3)
=𝐱T𝐖Q𝐑𝚯,nmd𝐖K𝐱n.\displaystyle=\mathbf{x}^{T}\mathbf{W}_{Q}\mathbf{R}^{d}_{\mathbf{\Theta},n-m}\mathbf{W}_{K}\mathbf{x}_{n}.

3.2 Frequency Scaling

Frequency Scaling. Despite RoPE’s effectiveness in incorporating positional information, the model’s capacity to handle sequences exceeding pre-training lengths remains limited, primarily due to inadequate training of low-frequency dimensions LocalLLaMA (2023) within the conventional context window size LL. To address this, several scaling methods have been proposed to adjust RoPE’s rotation frequency. We denote 𝐑𝚯,md\mathbf{R}^{d}_{\mathbf{\Theta},m} succinctly as:

[(cos(mλiβi),sin(mλiβi)),i[0,d22]],\displaystyle\left[\left(\text{cos}\left(\frac{m}{\lambda_{i}\beta_{i}}\right),\text{sin}\left(\frac{m}{\lambda_{i}\beta_{i}}\right)\right),i\in\left[0,\frac{d-2}{2}\right]\right], (4)

where βi=100002i/d\beta_{i}=10000^{2i/d} represents the base frequency, and λi\lambda_{i} are the scaling factors for each frequency dimension.

Scaling Factors Methods. Both NTK-aware approach LocalLLaMA (2023) and YaRN Peng et al. (2023) apply theories from the Neural Tangent Kernel (NTK) and suggest varying the scaling factors λi\lambda_{i} based on the dimension’s training need, achieving better performance with less fine-tuning data. LongRoPE Ding et al. (2024), on the other hand, utilizes a search-based method to identify better scaling factors λi\lambda_{i}, employing an evolutionary search algorithm for initial short text search (128k/256k) and fine-tuning, and search again at the target length (2M). This method facilitates up to a significant increase in processing length compared to the baseline, highlighting the potential of scaling adjustments in extending model capacities.

Refer to caption
Figure 2: Diagram of the proposed DCIS framework. We illustrate the search procedure when d=16d=16, with 3 incremental values selected for each processing step. Since the scaling factors are divided into high-frequency and low-frequency parts, we will initially process them in two segments. First, DCIS searches the scaling factors (i.e., λ4λ7\lambda_{4}-\lambda_{7}) for the last 4 positions and gets 3 incremental values (i.e., viv_{i}) within the range [l0,r0][l_{0},r_{0}]. It then computes the PPL (i.e., pip_{i}) for each incremental value. Finally, it selects the best incremental value with the lowest PPL to update these 4 scaling factors. As the input sequence is divided into two segments, DCIS processes the first 4 scaling factors in the first segment in the same manner. At the second layer, DCIS processes 2 scaling factors at a time, and at the third layer, it processes 1 scaling factor at a time, so on so forth. Finally, the process ends with the obtained scaling factors from the search.

4 Methodology

Although existing methods demonstrate generalization capabilities, they require significant fine-tuning data and suffer from marked performance degradation at extended target lengths. To address these challenges, we propose Divide-and-Conquer Incremental Search (DCIS) algorithm for length exploration. Figure 2 outlines our framework, beginning with an exploratory search phase using the Perplexity (PPL) metric as a guide, followed by fine-tuning using the identified scaling factors to effectively extend the model’s contextual modeling. In this section, we introduce our framework (Section 4.1), followed by a detailed description of the DCIS algorithm (Section 4.2).

4.1 Framework

Searching Scaling Factors during inference. Recent research Wu et al. (2024) reveal that specific retrieval heads contain long-form textual information, suggesting the intrinsic capacity of LLMs to process extended texts. To exploit this potential within LLMs, our framework first involves searching for optimal scaling factors during the preliminary inference phase. More specifically, our approach encourages the model to autonomously select suitable scaling factors through low-cost inference, thereby enhancing its extrapolation capability with low-cost training.

Fine-Tuning with Searched Factors. Directly applying the searched scaling factors to the original LLMs leads to suboptimal performance, indicating a lack of sufficient adaptation within the LLMs Ding et al. (2024). Following YaRN, our framework further involves a fine-tuning phase with searched scaling factors. Due to our advanced search algorithm (Section 4.2), better scaling factors are searched and initialized, thereby further reducing the number of fine-tuning steps.

4.2 Divide-and-Conquer Incremental Search

Here, we introduce our Divide-and-Conquer Incremental Search (DCIS) algorithm, which integrates the divide-and-conquer strategy to efficiently approximate better scaling factors. Beyond Figure 2, Algorithm 1 presents the detailed procedure of DCIS. We also demonstrate that the search space of DCIS is half that of conventional search methods, in Section 6.1.

Algorithmic Strategy. The scaling factors sequence, designated as [λ0,λ1,,λd/21][\lambda_{0},\lambda_{1},\cdots,\lambda_{d/2-1}] (as formulated in Eq. 4), is methodically processed using a divide-and-conquer strategy. Specifically, we divide the sequence into a set of segments, and focus on one segment (as highlighted in red in Figure 2) at a time while maintaining the others constant. For each segment, we adopt an incremental search strategy, where a new sequence of scaling factors is generated by adding a predetermined value from a set range to the current sequence. The most effective increment is then selected based on the lowest perplexity (PPL), which is also used to narrow the range for subsequent searches in the divided sub-segments. This iterative process, shown during the first iteration for d=16d=16 with three values evaluated at each segment in Figure 2, incrementally refines the scaling factors.

As shown in Algorithm 1, the scaling factors are divided into high-frequency and low-frequency components. Hence we initially process them in two segments, with each segment handling N=head_dim/2N=\text{head\_dim}/2 scaling factors. In each iteration, the Segment function is first employed to obtain the current segment Seg that needs to be processed:

Seg=[(λi,λi+1,,λi+N1),\displaystyle\text{Seg}=[(\lambda_{i},\ \lambda_{i+1},\cdots,\ \lambda_{i+N-1}), (5)
wherei=dN×j,j[1,d/N]].\displaystyle\text{where}\ i=d-N\times j,\ j\in[1,\ d/N]].

Subsequently, the GetIncrementalValues function is used to uniformly extract CC incremental values from the range of the current segment:

Values(vk)=[Rj,l+step×k],\displaystyle\text{Values}(v_{k})=[R_{j,l}+\text{step}\times k], (6)
wherestep=(Rj,rRj,l)(C1),k[0,C1].\displaystyle\text{where}\ \text{step}=\frac{(R_{j,r}-R_{j,l})}{(C-1)},\ k\in[0,\ C-1].

The ComputePPL function is then utilized to add these incremental values to the current scaling factors, yielding new scaling factors and thereby calculating PPLs:

PPLs(pk)=ComputePPL(Seg+vk).\displaystyle\text{PPLs}(p_{k})=\text{ComputePPL}(\text{Seg}+v_{k}). (7)

Finally, the PPLs are used to update the scaling factors and the range of values for the next iteration. Specifically, the incremental value with the lowest PPL is used to update the scaling factors, while the lowest C/3C/3 PPL incremental values are set as the range for the next iteration:

PPLs, Valuse=sort((pk,vk))\displaystyle\text{PPLs, Valuse}=\text{sort}((p_{k},v_{k}))
𝐅Seg=Seg+v1,\displaystyle\mathbf{F}_{\text{Seg}}=\text{Seg}+v_{1},
R2×j1n=R2×jn=[l,r],\displaystyle R^{n}_{2\times j-1}=R^{n}_{2\times j}=[l,r], (8)
wherel=min(v1,v2,,vC/3),\displaystyle\text{where}\ l=\text{min}(v_{1},v_{2},\cdots,v_{C/3}),
r=max(v1,v2,,vC/3).\displaystyle r=\text{max}(v_{1},v_{2},\cdots,v_{C/3}).

This process continues until the scaling factors of the last segment are returned as the result of this search.

Algorithm 1 DCIS

Input: The target LLM, input samples 𝐗\mathbf{X}, initial scaling factors 𝐅\mathbf{F}, initial range R{R}, the number of increments processed each time CC, the number of dimensions for each head dd.

1:  NN=d/2d/2;
2:  while NN>=1 do
3:     for Seg=Segment(NNdo
4:        Values=GetIncrementalValues(RR);
5:        PPLs=ComputePPL(LLM, 𝐗\mathbf{X}, 𝐅\mathbf{F}, Seg, Values);
6:        𝐅\mathbf{F}, R{R} = update (𝐅\mathbf{F}, R{R}, PPLs);
7:     end for
8:     NN=N/2N/2;
9:  end while
10:  Return the searched scaling factors 𝐅\mathbf{F};
Refer to caption
Figure 3: The recall rate of passkey with different lengths. The darker the color, the better the model performance.
Refer to caption
Refer to caption
Refer to caption
Figure 4: PPL of models without fine-tuning. Left: PPL of Llama2-7B on 64k-length. Center: PPL of Llama2-7B on 128k-length. Right: PPL of Mistral-7B-v0.1 on the PG19-test. We compare DCIS with other baseline methods.

Optimizations Incorporated. Based on empirical insights, several optimizations have been incorporated into the DCIS algorithm:

  1. 1.

    Initial Scaling Factors: Drawing from the success of YaRN Peng et al. (2023), its scaling factors are used as starting points from which our search begins.

  2. 2.

    Priority to High-Dimensional Scaling Factors: Inspired by the NTK-aware approach LocalLLaMA (2023), which suggests that higher dimensions might require more extensive interpolation, our algorithm prioritizes these dimensions for updates, thereby speeding up the convergence to better scaling factors.

  3. 3.

    Discarding Non-Guiding Increments: Increments resulting in a PPL greater than 100 are considered ineffective and are thus excluded from the search to maintain focus on potentially successful modifications.

  4. 4.

    Avoiding Local Optima: To prevent falling into local optima, when updating the range for the next layer of values, in addition to using the top 3 PPL incremental values, we also expand the upper and lower bounds outward by one step. Here, one step is defined as the difference between adjacent incremental values.

Adaptive Scaling Factors (ASF). Unlike methods such as YaRN and LongRoPE, which prescribe a strictly increasing order for scaling factors as frequency decreases, our approach does not confine the model to predetermined scaling paths. Considering the complex and often opaque internal mechanisms of models, we posit that different dimensions within each head may require distinct treatments, some might even need interpolation despite being high-frequency components. Thus, our framework allows for flexible scaling factors adjustments, tailored to the specific needs of each dimension.

5 Experiments

We conducted a comprehensive evaluation on our proposed framework, focusing on its performance across various metrics and conditions.

Refer to caption
Refer to caption
Figure 5: Left: Scaling factor distributions. Right: PPL under two strategies, the context length for fine-tuning is 16k.
Refer to caption
Figure 6: Recall rate under two strategies, the context length for fine-tuning is 16k.

5.1 Setup

Model and Evaluation Tasks. We carried out our experiments using the Llama2-7B Touvron et al. (2023) and Mistral-7B-v0.1 Jiang et al. (2023), aiming to extend the model’s context window to 64k tokens. We adopted YaRN’s methodology for perplexity (PPL) evaluation, utilizing ten samples from the Proof-pile dataset Rae et al. (2019) with lengths of at least 128k tokens for assessment. Additionally, we assessed model performance using 50 passkey Mohtashami and Jaggi (2023) tests at each length.

Fine-tuning Parameters. We used Llama2-7B as the base model. Initially, our DCIS algorithm was employed to identify scaling factors at the target length of 64k, setting the initial range between [5,5][-5,5] with C=10C=10 incremental values per segment. Following this, we segmented the PG19 dataset Gao et al. (2020) into 4k, 16k, and 64k context lengths, and then performed fine-tuning on each segment. The fine-tuning process closely mirrors YaRN’s protocol Peng et al. (2023) with a learning rate of 2×1052\times 10^{-5}. For context lengths of {4k, 16k, 64k}, we employed total batch sizes of {512, 64, 32}, and all models were fine-tuned for 400 steps.

Baseline. By fine-tuning with the aforementioned hyperparameters, we obtained YaRN-{16k, 64k} Peng et al. (2023), LongRoPE-16k Ding et al. (2024), and our proposed DCIS-{4k, 16k, 64k} models. The CLEX-16k Chen et al. (2023a) used the original model. Additionally, to compare with other types of methods, we employed the InfLLM Xiao et al. (2024), which claims to support infinitely long context windows by chunking and storing text sequences and retrieving the top-k most relevant chunks during inference.

Method Fine-tuning Categories
Length S-doc QA M-doc QA Sum Few shot Syn Code
YaRN 16k 9.38 5.13 15.85 58.25 0.33 62.25
CLEX 16k 6.93 8.30 13.63 57.68 0.69 44.06
LongRoPE 16k 9.90 5.25 16.93 59.11 0.31 61.49
InfLLM - 6.40 5.10 6.17 51.05 0.78 62.27
DCIS(Ours) 16k 7.58 3.51 16.11 59.10 0.42 62.05
4k 8.17 4.09 14.55 58.07 0.25 61.67
Table 1: Evaluation of different methods on the LongBench benchmark.

5.2 Main Results

We assessed the PPL of our proposed model and baseline models on the Proof-pile dataset, with results visualized in Figure 1. Figure 3 depicts the performance on the passkey evaluation. Notably, models from other methods that were fine-tuned on 16k-length and subsequently generalized to a 64k context windows experienced a marked increase in PPL at the target length (64k). Furthermore, these models entirely failed the passkey test at the target length. In stark contrast, our approach exhibited consistent performance across various context lengths, including the target length, and even outperformed other methods at shorter fine-tuning lengths (4k). Additionally, when fine-tuned on a 64k-length context, DCIS consistently outperforms YaRN on both PPL and passkey metrics.

All results underscore the significance of scaling factors. DCIS identify superior scaling factors, leading to improved performance across various sequence lengths and demonstrating strong generalization capabilities, thereby reducing the costs associated with model fine-tuning. The superior initial scaling factors contribute to further reduction in the number of fine-tuning steps. For example, in Appendix A.1, we report our observations that our method requires fewer fine-tuning steps to achieve comparable performance to YaRN.

5.3 DCIS without Fine-Tuning

Figure 4 illustrates the outcomes of our experiments wherein inference was conducted solely by adjusting scaling factors without fine-tuning. A comprehensive search for scaling factors was performed at target lengths of 64k and 128k, followed by direct PPL evaluation. The left and middle plots clearly indicate a consistent decline in our model’s PPL values at both target lengths, in stark contrast to the upward trends observed in other methods. To further validate the efficacy of our method, we extended our experiments to the Mistral model, employing the DCIS algorithm for scaling factors search, and evaluated the resulting models on the PG19 test set. The results unequivocally demonstrate the superior performance of our method across diverse model architectures.

In Appendix A.2, we compare the PPL of scaling factors for various methods at shorter lengths, without fine-tuning.

6 Analysis

In this section, we delved into a comprehensive analysis of DCIS. We began by contrasting its search space with other search methods. Subsequently, we examined the implications of non-strictly increasing scaling factors. Finally, to evaluate the model’s general ability on real-world tasks, we employed LongBench Bai et al. (2023) and Open LLM Leaderboard from Hugging Face to assess the model’s performance on long and short contexts.

6.1 Search Space Analysis

The efficiency of our DCIS algorithm is highlighted by comparing the search space. Specifically, our search space is the product of the total number of processed segments d2d-2 and the number of increments per segment CC. The search space of evolutionary search utilized by LongRoPE is the product of the number of iterations TT and the population size PP. For example, for the Llama2-7B model with default parameters, our search space equates to (d2)×C=(1282)×10=1260(d-2)\times C=(128-2)\times 10=1260. In contrast, the search space of evolutionary search calculates as T×P=40×64=2560T\times P=40\times 64=2560, signifying that our algorithm’s search speed is effectively double that of evolutionary search.

6.2 Non-Strictly Increasing Scaling Factor

DCIS employs Adaptive Scaling Factors (ASF) (Section 4.2), and we investigated their impact in this section. Since LongRoPE utilizes strictly monotonically increasing scaling factors, we performed ablation studies on it to examine the effects of our proposed ASF. Specifically, we adjusted the evolutionary search algorithm in LongRoPE to a non-strictly increasing one, applied it to fine-tune the model on a 16k-length context, and subsequently generalized it to a 64k context window. Figures 5 and 6 depict the experimental outcomes. The left plot of Figure 5 visualizes the scaling factor distributions employed by various methods. Notably, both DCIS and LongRoPE + ASF exhibited irregular, sawtooth-like scaling factors, while YaRN and the original LongRoPE demonstrated more stable scaling factors. Furthermore, as shown in the right plot of Figure 5 and Figure 6, our ASF can improve LongRoPE in terms of both PPL and passkey scores, suggesting that imposing fewer constraints on scaling factors may enhance model performance.

Recent research may provide an explanation for our findings Barbero et al. (2024). They suggests that RoPE might not exhibit the long-range decay property as originally anticipated, indicating that distant tokens may have a larger influence than previously thought. Figure 2 in Barbero et al. (2024) shows that this decay property does not exist for Gaussian random vectors, further supporting our hypothesis that the complex and opaque internal mechanisms of models are better suited to learning appropriate scaling factors without constraints.

6.3 General Ability Evaluation

In addition to the previous assessments on PPL and passkey, employing LongBench and Open LLM Leaderboard, we conducted a comprehensive assessment of the models’ general abilities in both long and short context scenarios. The empirical results, as depicted in Tables 1 and 3, indicate that there is no significant difference in performance among the various methods, with different models performing best on different subsets. While models utilizing full attention generally achieved slightly better results, InfLLM, which leverages a retrieval-based approach, demonstrated a notable advantage in terms of memory efficiency. These findings suggest that the optimal choice of model is contingent upon specific application requirements.

7 Conclusion

We have presented a novel framework to unlock the potential for length extrapolation in large language models. We have developed a rapid search algorithm, the Divide-and-Conquer Incremental Search (DCIS), specifically tailored for adjusting Rotary Positional Embedding (RoPE) scaling factors. This framework not only mitigates the performance degradation when scaling to target lengths but also enables models fine-tuned on short texts to generalize to long texts, thereby reducing the cost of fine-tuning. Moreover, the scaling factors significantly improves the model’s performance without fine-tuning. Based on these findings, we delved into the impact of non-strictly increasing scaling factors and discovered that unrestricted scaling factors are more conducive to improving model performance. Comparative analyses of search spaces and experiments on various tasks demonstrate the effectiveness, efficiency, and robustness of our proposed method.

Limitations

While this framework effectively leverages inference to enhance model capabilities, the exploration of more efficient mechanisms to activate this potential remains an open area for further research. Furthermore, although we have optimized the algorithm to prevent the scaling factors from falling into a local optimum, the divide-and-conquer search algorithm still cannot definitively guarantee that the searched scaling factors is optimal. However, as discussed in Section 6.2, LLMs are opaque black boxes without a well-defined optimal solution. Therefore, searching for the scaling factors can only yield a better, rather than optimal, solution.

In addition to this, we conducted a search for a scaling factors of 128k on a weaker model, Phi-3-mini-4k-instruct Abdin et al. (2024), with an actual context window size of 2k. We observed that the model’s PPL remained around 70, with no significant decrease. Thus, such search algorithms appear to have certain requirements for the model’s inherent capabilities and are unable to span too large a scaling factor at once. This also demonstrates that, relative to YaRN, which has a smaller scaling factor, the scaling factors identified through search have larger scaling factor, thereby more fully leveraging the model’s extrapolative potential.

References

  • Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Bai et al. (2023) Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  • Barbero et al. (2024) Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veličković. 2024. Round and round we go! what makes rotary positional encodings useful? arXiv preprint arXiv:2410.06205.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Chen et al. (2023a) Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, and Lidong Bing. 2023a. Clex: Continuous length extrapolation for large language models. arXiv preprint arXiv:2310.16450.
  • Chen et al. (2023b) Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023b. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
  • Ding et al. (2023) Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. 2023. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
  • Ding et al. (2024) Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753.
  • Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR.
  • Han et al. (2023) Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. 2023. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
  • Jin et al. (2024) Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. 2024. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325.
  • Liu et al. (2023) Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the middle: How language models use long contexts. corr abs/2307.03172 (2023).
  • LocalLLaMA (2023) LocalLLaMA. 2023. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
  • Mohtashami and Jaggi (2023) Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300.
  • Peng et al. (2023) Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
  • Press et al. (2021) Ofir Press, Noah A Smith, and Mike Lewis. 2021. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  • Rae et al. (2019) Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. 2019. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507.
  • Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
  • Su (2023) Jianlin Su. 2023. Rectified rotary position embeddings. https://github.com/bojone/rerope.
  • Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Wu et al. (2024) Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2024. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574.
  • Wu et al. (2022) Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. 2022. Memorizing transformers. arXiv preprint arXiv:2203.08913.
  • Xiao et al. (2024) Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. 2024. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617.
  • Xiao et al. (2023) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2023. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.

Appendix A Additional Results

We present supplementary results obtained during our experimental process in this section.

A.1 Fewer Fine-Tuning Steps

During the fine-tuning process of YaRN and DCIS, we observed that our method, DCIS, achieved PPL and passkey test results consistent with, or even superior to, those of YaRN after only 100 steps of fine-tuning, as depicted in Table 2 and Figure 7. It is demonstrated that an initially superior scaling factors requires fewer fine-tuning steps.

Method Fine-tuning Evaluation Context Window Size
Steps 4k 8k 16k 24k 32k 40k 48k 56k 64k
YaRN 100 3.71 3.53 3.01 2.78 2.68 2.60 2.54 2.48 2.45
200 3.71 3.53 3.01 2.78 2.68 2.60 2.54 2.48 2.44
400 3.71 3.52 3.00 2.78 2.67 2.59 2.53 2.47 2.44
DCIS(Ours) 100 3.72 3.53 3.00 2.77 2.67 2.59 2.53 2.47 2.43
200 3.72 3.53 3.01 2.77 2.67 2.59 2.53 2.47 2.43
400 3.71 3.52 3.00 2.77 2.66 2.58 2.52 2.46 2.42
Table 2: Comparison of the PPL results at different numbers of steps of fine-tuning on the 64k length.
Refer to caption
Figure 7: Comparison of the passkey results at different numbers of steps of fine-tuning on the 64k length.

A.2 Shorter Length of PPL without Fine-Tuning

Even at shorter lengths of 16k and 32k, our scaling factors consistently achieved the lowest PPL, as illustrated in Figure 8.

Refer to caption
Refer to caption
Figure 8: Left: 16k-length. Right: 32k-length.
Method Fine-tuning ARC-c Hellaswag MMLU TruthfulQA
Length
Original - 52.6 79.0 46.4 39.0
YaRN 64k 52.8 78.8 42.1 39.0
16k 52.6 78.4 42.4 38.4
CLEX 16k 52.3 78.3 42.1 41.3
DCIS(Ours) 64k 52.3 78.4 41.8 39.2
16k 53.0 78.2 41.4 38.8
4k 52.6 78.4 43.7 38.0
Table 3: Evaluation of different methods on the benchmarks from the Hugging Face Open LLM Leaderboard

A.3 Benchmarks

As shown in Table 3 of the Open LLM Leaderboard, there is no clear superiority among different methods. Furthermore, the models with extended context windows do not exhibit significant performance degradation on short texts compared to their original counterparts.