This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Survey on Memory-Efficient Large-Scale Model Training in AI for Science

Kaiyuan Tian, Linbo Qiao, Baihui Liu, Gongqingjian Jiang, Dongsheng Li
Abstract

Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. To address this, we review memory-efficient training techniques for LLMs based on the transformer architecture, including distributed training, mixed precision training, and gradient checkpointing. Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy. We also discuss the challenges of memory optimization in practice and potential future directions, hoping to provide valuable insights for researchers and engineers.

Index Terms:
AI for science, memory optimization, large-scale deep learning model, distributed training.

I Introduction

The rapid advancement of artificial intelligence, particularly large language models (LLMs), has positioned deep learning (DL) as a vital tool for addressing various scientific challenges. DL offers a new approach to scientific research due to its strengths in automatic feature extraction and complex pattern recognition. According to the universal approximation theorem, deep neural networks (DNNs) can effectively fit any continuous function. Additionally, the high-performance parallel implementation of the backpropagation algorithm on GPUs enables DL to efficiently solve scientific and engineering problems [1]. In disciplines such as biology, medicine, chemistry, and climate science, DL models demonstrate great potential to assist researchers in accelerating scientific discoveries, enhancing prediction accuracy, and providing innovative solutions to complex challenges.

In Biological Sciences, it has proven effective in areas such as protein structure and function prediction, as well as genetic engineering. One notable example is AlphaFold 2 [2], which uses DL to predict the three-dimensional structures of proteins. In the medical field, LLMs have been applied to various tasks, including medical Q&A, disease prediction, and clinical decision support. For instance, the Med-PaLM series of models [3], [4], [5] can answer questions posed by medical professionals and assist doctors in diagnosing diseases and determining treatment options. In chemistry, DL is employed for tasks such as molecular generation [6], [7], molecular property prediction [8], and chemical reaction prediction [9]. In meteorology, by processing vast amounts of meteorological data, DL models can reveal the complex laws of weather change, improve the accuracy of weather prediction, simulate and predict a wide range of climate patterns [10], [11], [12], and predict the occurrence and impact of extreme weather events [13], [14]. With the improvement of computing power and the continuous progress of DL algorithms, the application of DL in scientific fields will be more extensive and in-depth, further expanding the boundaries of scientific exploration.

Refer to caption
Figure 1: AI memory wall. General LLMs are highlighted in red, while LLMs in scientific fields are highlighted in blue. The growth of model parameters has surpassed the increase in memory capacity of accelerators.

Meanwhile, studies have shown that with the increase of parameters, the model shows better performance [15], [16]. Consequently, the scale of these models has continued to grow, leading to memory usage issues in DL models. In recent years, the widespread development and application of transformer models [17] have led to a significant increase in model size, averaging a 240-fold growth every two years. The expansion of parameters in models like GPT-3 [18] and Llama 2 [19] has outpaced the growth of GPU memory. The training of large-scale models inevitably encounters the problem of the “memory wall” [20] (see Fig. 1).

There are a few related surveys summarized memory-efficient LLM training methods. Zhuang et al. [21] have provided an overview of the efficient training technologies used for transformer-based models. They discussed some memory-efficient training methods and summarized the key factors that affect training efficiency. To gain a deeper understanding of the latest advancements in efficient LLMs, Wan et al. [22] conducted a thorough literature review and proposed a taxonomy to organize the relevant research. In contrast to these studies, this survey focuses on the application of LLMs across different scientific fields, systematically introducing general memory-efficient training methods for transformer-based large-scale scientific models. Furthermore, we use a case study to demonstrate studies aimed at optimizing memory overhead in transformer-variants-based models. To the best of our knowledge, there are no existing reviews specifically addressing memory-efficient training methods for LLMs in scientific fields.

Our investigation indicates that despite the widespread application of LLMs in scientific fields, there is a notable lack of memory efficiency in their training processes. A judicious selection of memory optimization techniques is essential, as it can significantly improve the efficiency of DL models and broaden their applicability in resource-constrained settings. It is of considerable theoretical and practical importance for facilitating the broader adoption of DL technologies in scientific research.

The remainder of this paper is organized as follows. In Section II, we begin by providing a brief introduction to deep learning and the associated knowledge of LLMs to enhance understanding for researchers and practitioners from various fields. We then highlight the accomplishments of LLMs across different scientific domains and summarize the memory optimization techniques employed during model training. Section III offers a comprehensive overview of various strategies aimed at reducing memory overhead during LLM training. Section IV focuses on AlphaFold 2 as a representative example, detailing some customized memory optimization methods. Section V addresses the trends and challenges of applying memory optimization techniques in large-scale models within scientific contexts. Finally, Section VI reviews and summarizes the key points discussed throughout the paper.

II Large Language Models and AI for Science

II-A A Brief Introduction to Deep Learning and LLMs

Deep learning is a machine learning approach that automatically extracts features from complex data through multiple layers of nonlinear transformations. In contrast to traditional machine learning techniques, deep learning offers several advantages, including no need for manual feature engineering, the ability to adapt to complex high-dimensional data, and superior generalization capabilities. In recent years, the emergence of transformer models has further enriched this field.

LLMs are DNN-based language models characterized by a massive number of parameters. These models enhance performance by increasing both capacity and the volume of training data. Before the rise of deep learning, language modeling primarily relied on statistical methods, such as n-gram models and hidden Markov models. However, they struggled with data sparsity, making it challenging to handle long sequences and complex dependencies effectively. In 2003, Bengio et al. [23] introduced a neural network-based language model that sparked a wave of research into neural language models. Subsequently, Mikolov et al. proposed Word2Vec, which enhanced the representation of words through Skip-gram [24] and CBOW [25] methods. This innovation became a vital tool for feature representation in later natural language processing (NLP) research and applications. During this period, models like recurrent neural network (RNN) [26], long short-term memory network (LSTM) [27], and gated recurrent unit (GRU) [28] gained popularity for language modeling because they could better capture long-distance dependencies. Despite this progress, these approaches still encounter efficiency limitations when processing long sequences, and their training processes are challenging to parallelize.

The proposal of Transformer architecture [17] in 2017 marked a revolutionary breakthrough in the field of language modeling. Transformers use a multi-head self-attention mechanism rather than a recurrent structure, enabling them to efficiently capture global dependencies in long sequences and enhance model training speed. With these advancements, transformers quickly became the foundation of modern NLP models, driving the rapid evolution of LLMs. Models like BERT [29] and GPT [30] have emerged one after another, establishing pre-training and fine-tuning as an NLP paradigm. Subsequently, GPT-2 [31] and GPT-3 [18] expanded their parameter sizes to 1.5B and 175B, demonstrating the potential for large-scale pre-training, particularly in zero-shot and few-shot learning. The Scaling Laws [15], [16] reveal the relationship between the growth of model parameters, data size, computational complexity, and model performance, providing theoretical guidance for designing and training large-scale models. To further improve the performance of the model on tasks, researchers have also proposed methods such as instruction fine-tuning [32] and InstructGPT [33]. At the end of 2022, ChatGPT111https://openai.com/index/chatgpt showcased the tremendous potential of LLM to the world and sparked a research boom in LLM.

In recent years, companies and research institutions have introduced a range of LLMs, advancing their scale and capabilities. Notable examples include but not limited to Google’s PaLM series [34], [35], Meta’s Llama series [36], [19], [37], OpenAI’s GPT-4 [38], DeepSeek-AI’s DeepSeek series [39], [40], [41], Tsinghua’s GLM-4 [42], and Alibaba’s Qwen series [43], [44]. So far, LLMs have exhibited exceptional proficiency in language generation, multimodal integration, and specialized tasks, providing substantial support for scientific research, industrial applications, and everyday life. As technological advancements continue, LLMs are expected to exert a significant influence across a wider range of domains.

II-B Application of LLMs in Scientific Fields

With the success of LLMs in NLP, an increasing amount of research is exploring its application to scientific fields. As a powerful tool for knowledge expression and reasoning, LLMs are capable of managing complex high-dimensional data, uncovering deep patterns, and enhancing the speed and accuracy of scientific discoveries. This section will focus on the use of LLMs in common tasks across different scientific disciplines. By reviewing these applications, we aim to illustrate the broad applicability and impressive capabilities of LLMs in diverse areas of science, as well as the necessity of memory-efficient training techniques. Table I summarizes relevant studies utilizing LLMs across multiple fields and outlines the memory optimization techniques employed during model training. It is noteworthy that as activation memory usage varies with experiment settings, the straightforward estimation of memory cost in Table I does not include activations, which constitute a predominant portion of memory overhead during the training process of certain models.

TABLE I: Applications of LLMs in Scientific Fields and Corresponding Training Optimization Strategies
  Field Work Backbone Main Building Block #Parameters Memory Cost (est.) Optimizations Tasks
  Biology AlphaFold 2 [2] - Evoformer [2] 93M 1.45 GB DP mixed-precision GC protein structure prediction
RosettaFold [45] - SE(3)-Transformer [46] 130M 2.03 GB DP GA protein structure prediction
AlphaFold 3 [47] - Pairformer [47] Unreported - Unreported biomolecular complex structure prediction
OpenFold [48] - Evoformer 93M 1.45 GB DP (ZeRO-2) mixed-precision GC offloading GA protein structure prediction
FastFold [49] - Evoformer 93M 1.45 GB DP DAP protein structure prediction
ScaleFold [50] - Evoformer 97M 1.52 GB DP DAP protein structure prediction
ESMFold [51] ESM-2 15B Transformer [17] 15B 240 GB DP (FSDP) protein structure prediction
xTrimoPGLM [52] xTrimoPGLM-100B Transformer 100B 1.56 TB DP (ZeRO-1) PP (1F1B) TP (Megatron-LM) mixed-precision GC protein understanding protein generation
ESM3 [53] - Transformer 1.4B / 7B / 98B 1.53 TB DP (FSDP) mixed-precision protein reasoning protein generation
Medicine BioGPT [54] GPT-2 Medium [31] Transformer 347M 5.42 GB DP GA biomedical text understanding biomedical text generation
Med-PaLM [3] PaLM 540B [34] Flan-PaLM 540B [32] Transformer 540B 8.44 TB DP (ZeRO-3) TP GC medical question answering medical reasoning
Med-PaLM 2 [4] PaLM 2 340B [35] Transformer 340B 5.31 TB Unreported medical question answering medical reasoning
Med-PaLM M [5] PaLM-E [55] Transformer 12B / 84B / 562B 8.78 TB DP (ZeRO-3) TP GC biomedical generalist
BiomedGPT [56] OFA [57] Transformer 33M / 93M / 182M 2.84 GB DP (PyTorch DDP) mixed-precision biomedical generalist
Meditron [58] Llama-2 [19] Transformer 7B / 70B 1.09 TB DP PP TP (Megatron-LM) medical reasoning
HuatuoGPT [59] Baichuan-7B Ziya-LLaMA-13B [60] Transformer 7B / 13B 208 GB DP (ZeRO) medical consultation
HuatuoGPT-II [61] Baichuan2 [62] Yi-34B [63] Transformer 7B / 13B / 34B 544 GB DP (ZeRO) medical consultation
Biomedicine PharmBERT [64] BERT-Base [29] Transformer 110M 1.72 GB Unreported drug labeling
PharmGPT [65] Llama-2 Transformer 3B / 13B / 70B 1.09 TB DP+PP+TP text understanding text generation
Chemistry ChemBERT [66] BERT-Base Transformer 110M 1.72 GB Unreported product extraction reaction role labeling
CatBERTa [8] RoBERTa [67] Transformer 355M 5.55 GB Unreported catalyst property prediction
Chemformer [9] BART [68] Transformer 45M / 230M 3.59 GB DP (ZeRO-2) mixed-precision reaction prediction molecular optimization molecular property prediction
MolGen [7] BART Transformer 355M 5.55 GB DP (ZeRO-2) mixed-precision GA molecule generation
ChemGPT [6] GPT-Neo [69] Transformer 1.2B 19.20 GB DP (PyTorch DDP) molecule generation
Meteorology Pangu-Weather [11] - Swin transformer [70] 256M 4 GB DP weather forecast
FuXi [12] - Swin transformer [71] 4.5B 72 GB DP (FSDP) mixed-precision GC weather forecast
ClimaX [13] - ViT [72] Unreported - DP mixed-precision weather forecast climate projection
Aurora [14] - Swin transformer 1.3B 20.80 GB DP mixed-precision GC atmospheric prediction
 
  • GA: gradient accumulation

  • GC: gradient checkpointing

II-B1 Biology

Protein structure prediction stands as one of the most significant challenges in biochemistry, as high-precision predictions are essential for drug discovery and protein design. AlphaFold 2 [2], developed by DeepMind, is a DL-based tool for predicting protein structures. It can predict the three-dimensional structure of proteins based on their amino acid sequences, representing a major breakthrough in addressing long-standing biological challenges.

The architecture of AlphaFold 2 is built upon variants of Transformer. Its main structure features an input module that receives the amino acid sequence of the target protein and searches for homologs within a sequence database to perform multiple sequence alignment (MSA). The Evoformer module then processes the MSA representation and pair representation, facilitating the exchange and updating of information through Transformer layers. Finally, the structural module converts the abstract representation of proteins into 3D atomic coordinates [73]. AlphaFold 2 has significantly influenced the advancement of biotechnology. However, its training process and datasets are not fully open source, which restricts researchers from applying it to tasks beyond protein structure prediction. Furthermore, despite the AlphaFold 2 model having a relatively small parameter size (about 93M), the memory consumed by intermediate activations is substantial due to its architecture, and peak memory usage increases cubically with the length of the input sequence. Despite using mixed precision and gradient checkpointing, the memory cost of activations in attention module exceeds 20 GB [49]. To facilitate further research, Ahdritz et al. developed OpenFold [48]. As an open-source implementation of AlphaFold 2, it achieves comparable prediction accuracy while runing faster and consuming less memory. It is built based on PyTorch [74], which makes it easier for the community to use and improve.

Although AlphaFold2 and RoseTTAFold [45] achieved breakthroughs in atomic resolution structure prediction, they rely on MSA and similar protein structure templates to achieve better performance. Based on the research of OpenFold, Lin et al. [51] further explored the ability of language model and proposed ESMFold. ESMFold utilizes the internal representation of language model to replace the expensive network module for processing MSA with a transformer module for processing sequences. This simplification greatly improves the speed of ESMFold, which is much higher than models based on MSA. To enable the training of ESMFold, ZeRO stage 3 [75] is used to divide the model parameters, gradients, and optimizer states among GPUs.

The protein language model (PLM) has achieved remarkable success in learning biological information from protein sequences. Most existing models are constrained by autoencoding or autoregressive pre-training objectives, making it challenging to address both protein understanding and generation tasks. Chen et al. [52] explored the possibility of jointly optimizing these two training objectives through an innovative framework and trained a 100B PLM (xTrimoPGLM). Inspired by ESMFold, they combined the folding module with xTrimoPGLM and developed xT-Fold, a high-performance 3D structure prediction tool. The performance of xT-Fold is superior to other PLM-based models on benchmarks such as CAMEO and CASP15. As the model size and the scale of training data are greatly increased, xTrimoPGLM uses various memory optimization methods in the pre-training stage, including 3D-parallelism [76], [77], [75], FP16 mixed precision training [78], and gradient checkpointing [79].

In May 2024, Isomorphic Labs and DeepMind jointly released AlphaFold 3 [47]. The overall architecture of Alphafold 3 is similar to AlphaFold 2, its prediction accuracy exceeds AlphaFold 2, and the model is more efficient. In addition, the application scope of Alphafold 3 has also been expanded. It can predict the structure of a single protein, the complex structure of protein complexes, proteins, nucleic acids, small molecules, etc.

II-B2 Medicine

LLMs have demonstrated significant potential in the medical and clinical fields. When trained on extensive medical datasets, these models can produce high-quality outputs, and support decision-making. The applications of such medical models are numerous, including medical knowledge retrieval, assistance in clinical diagnoses, consultations, and personalized medical services.

Singhal et al. [3] found that the benchmark used to evaluate the clinical knowledge of the model is limited. To solve it, they proposed MultiMedQA, a benchmark composed of seven medical Q&A datasets. Further, they fine-tuned the command of PaLM [34] and got Flan-PaLM [32], which shows excellent performance in multiple-choice questions. Moreover, they suggested instruction prompt tuning, and the obtained Med-PaLM model can further adapt to medical field. The pre-training model of Med-PaLM, PaLM, uses ZeRO stage 3 and TP to shard model parameters across all TPUs during the training process and uses gradient checkpointing to reduce the memory overhead of activations.

Tu et al. [5] aim to further explore the potential of generalist models (those capable of performing multiple tasks without fine-tuning, such as GPT-3, PaLM, and GPT-4) in medical field. Utilizing the language and multimodal foundation models PaLM 540B and PaLM-E 562B [55], the researchers develop Med-PaLM M. They also designed a multimodal biomedical benchmark called MultiMedBench. As a large-scale biomedical generalist model, Med-PaLM M demonstrates strong performance across all tasks on MultiMedBench, matching or exceeding the capabilities of task-specific models.

Currently, most LLMs utilized in medical field are either closed source or have relatively small model sizes. Chen et al. [58] believe that while models designed for generalist tasks typically perform well across tasks, their performance in certain tasks falls short compared to models tailored for those specific purposes. They developed Meditron-70B through continued pretraining of Llama-2 and fine-tuned it for particular tasks, resulting in superior performance compared to all open-source generalist and medical LLMs across all medical benchmarks. Furthermore, to enable training, they extended Nvidia’s Megatron-LM [76] and developed Megatron-LLM222https://github.com/epfLLM/megatron-LLM, a distributed training library that accommodates various model architectures. They also implemented 3D parallelism to optimize the training efficiency of models at this scale.

HuatuoGPT [59] aims to assist doctors with diagnosis and treatment in Chinese scenarios, it combines the dialogue capabilities of ChatGPT with the expertise of real-world physicians. By employing instruction fine-tuning and reinforcement learning techniques, it exploits distilled instruction data generated by ChatGPT alongside actual response data from doctors. The upgraded version, HuatuoGPT-II [61], successfully passed the Chinese National Pharmacist Licensure Examination in October 2023 and nearly all medical qualification exams, highlighting its robust capabilities within the Chinese medical landscape. ZeRO data parallelism is enabled during the training of HuatuoGPT and HuatuoGPT-II.

II-B3 Biomedicine

The emergence and rapid development of LLM heralded a new era in the fields of biopharmaceuticals and chemical sciences, providing innovative methods for drug discovery, chemical synthesis and optimization, and elucidation of complex biological pathways.

Drug labels possess distinct characteristics that differentiate them from texts in other domains, making it ineffective to apply generic models directly to this specific field. ValizadehAslani et al. [64] leveraged drug label text corpora and conducted pre-training on BERT base checkpoints using masked language modeling (MLM) as the pre-training task. This effort resulted in PharmBERT, a domain-specific BERT model for drug labeling. Their work showcased exceptional performance by evaluating PharmBERT on three downstream tasks.

To tackle the issue of limited knowledge among general-purpose language models in the domains of biopharmaceuticals and chemistry, Chen et al. introduced PharmaGPT [65], a large-scale multilingual language model tailored for these fields. PharmaGPT is pre-trained on high-quality, large-scale, domain-specific datasets and optimized for enhanced performance in biopharmaceuticals and chemistry through fine-tuning and reinforcement learning. Experiments indicate that PharmaGPT outperforms mainstream general models on professional benchmark tests, such as NAPLEX, showcasing its exceptional capability in understanding and generating specialized knowledge.

II-B4 Chemistry

The application of LLM in the field of chemistry covers multiple aspects, including molecular design, molecular property prediction, and chemical reaction prediction. Additionally, LLMs can be used to analyze and summarize chemical literature, providing researchers with efficient knowledge retrieval and decision-making support. Their powerful sequence processing capabilities enable the handling of complex chemical terminology and diverse data formats, thereby fostering advancements in automation and intelligence in the field of chemistry.

Extracting structured data from chemical literature is a challenging task due to the complexity and heterogeneity of chemical language. Guo et al. [66] broke this task into two cascaded subtasks: product extraction and reaction role labeling. They developed two independent models, ChemBERT and ChemRxnBERT, based on BERT. By pre-training and fine-tuning these models on domain-specific and task-oriented corpora, they enhance the understanding of semantic relationships and contextual features in chemical texts.

In the field of molecular property prediction, Ock et al. introduced CatBERTa [8], which can bypass the requirement of precise atomic coordinates in previous GNN-based methods and predict catalyst properties sorely through molecular text representation. Compared to graph-based methods, using text representation to describe adsorbate-catalyst systems has better interpretability. The prediction accuracy of CatBERTa is comparable to GNN-based methods, highlighting its potential in property prediction tasks.

Irwin et al. pre-trained BART [68] on a dataset containing a large number of simplified molecular line entry system (SMILES) strings to capture the structural relationships and potential properties between chemical molecules, resulting in the Chemformer model [9]. Chemformer is capable of both sequence-to-sequence tasks (e.g. reaction prediction, molecular optimization), and discriminative tasks (e.g. molecular property prediction, biological activity analysis). It achieves or exceeds the performance of state-of-the-art models on multiple benchmarks while significantly reducing convergence time. This effectively addresses the challenges posed by limited computing resources and time constraints.

The objective of molecular design and generation tasks is to create new molecules with specific functions or properties for applications such as drug development, materials science, and catalyst development. Traditional methodologies depend heavily on the expertise of specialists and costly computational simulations. However, the introduction of LLMs has opened new possibilities in this field. Fang et al. introduced a molecular generative pre-trained language model called MolGen [7], which has enhanced its comprehension of chemical structures and semantics through a two-stage pre-training strategy. This strategy encompasses molecular syntax and semantic learning, utilizing self-sacrificing embedded strings (SELFIES) along with domain-agnostic prefix tuning. Furthermore, the study proposed a chemical feedback paradigm that aligns generated molecules with real-world chemical preferences by adjusting their probability distribution, effectively reducing the issue of ”molecular hallucination” (i.e. generated molecules lacking expected functionality in practical applications). Frey et al. [6] conducted an in-depth study on the neural scaling behavior of deep chemical models in relation to dataset and model size, aiming to explore how model performance is enhanced with increased resource investment. They designed and trained a generative pre-trained Transformer model named ChemGPT with over 1B parameters for small molecule generation. By varying both the model and dataset sizes by several orders of magnitude, they demonstrated an improvement in pre-training loss performance as the scale increased.

II-B5 Meteorology

Weather forecasting represents a crucial application scenario in meteorology, providing the ability to predict future weather changes, which benefits daily life, agriculture, transportation, and other fields. Over the past decade, the swift advancement of high-performance computing devices has led to significant progress in numerical weather prediction (NWP), enhancing the accuracy of forecasts for daily weather, extreme weather events, and climate change. However, as the growth of computing power slows and the complexity of physical models increases, the limitations of traditional methods have become more pronounced. Recently, the rapid progress in deep learning has emerged as a promising avenue for addressing these challenges in the field.

The Pangu-Weather model [10], [11], proposed by Bi et al., employs the 3D Earth-Specific Transformer (3DEST) architecture to process 3D climate data, which is a variant of Swin Transformer [70], [71]. It is the first AI-driven approach to surpass the forecast accuracy of NWP methods. Notably, Pangu-Weather achieves an inference speed of just 1.4 seconds on a single GPU, making it 10000× faster than the previous most accurate NWP method (operational IFS [80]). While the accuracy of 3D models can surpass 2D models, this enhancement comes with a considerable increase in memory requirements, leading to limited network depth. The study employs DP to train on 192 GPUs, with a batch size of 1 assigned to each GPU.

The inherent uncertainty in weather forecasting is inevitable, and the degree of this uncertainty, along with cumulative errors, increase as the forecast period lengthens. In response to this challenge, Chen et al. introduced FuXi [12], a cascaded model architecture based on U-Transformer, which effectively reduce cumulative errors and improves both forecast accuracy and duration. The model pre-training was conducted on a cluster of 8 Nvidia A100 GPUs, utilizing techniques such as FSDP [81], BF16 mixed precision training, and gradient checking to optimize memory usage.

Inspired by the concept of foundational models [82], Nguyen et al. designed and trained a basic model named ClimaX [13], aimed at effectively adapting to general tasks related to the Earth’s atmosphere. ClimaX can be trained on heterogeneous datasets to solve diverse climate and weather tasks. Remarkably, even with pre-training conducted at lower resolutions and with limited computational resources, ClimaX surpasses previous data-driven models in weather and climate prediction benchmarks. The pre-training used 80 NVIDIA V100 32GB GPUs and enabled FP16 mixed precision training. Subsequently, Bodnar et al. introduced Aurora [14], a large-scale atmospheric foundation model designed to excel in various prediction tasks, particularly in regions with limited data or extreme weather conditions. After training on diverse weather and climate datasets, Aurora can effectively adapt to new atmospheric prediction tasks by learning a universal representation of atmospheric dynamics. The model was trained on 32 GPUs with 1 batch size per GPU. To meet memory constraints, BF16 mixed precision training and gradient checkpointing were also used.

Despite the progress achieved in applying large-scale models across various domains, the continuous increase in memory overhead during model training remains a pressing concern that requires attention. We identify the underlying causes of this issue from the following key factors:

  • The number of model parameters continues to increase. In recent years, there has been a notable increase in the scale of models across various fields, as shown in Fig. 1. This trend is driven by the demand for processing complex and large-scale data, advancements in computing resources, and the pursuit of enhanced performance and accuracy in downstream tasks. Consequently, this growth has also resulted in higher memory requirements.

  • The necessity to process long sequence inputs. Long sequence data is prevalent across various scientific domains, such as DNA and protein sequences in biology, molecular structures in chemistry, and time-series weather data in meteorology. As the length of input data increases, the memory required to store activations also grows significantly. This challenge is exacerbated by the quadratic computational complexity inherent in transformer models. The situation may become worse when employing a more complex architecture.

  • The demand for handling diverse data structures. In scientific fields, the use of high-resolution images and multimodal data significantly contributes to increased memory consumption during large model training. High-resolution images contain a substantial amount of pixel information, thereby amplifying the demands of data processing and storage. Additionally, multimodal data requires the handling of diverse data types, such as images, text, and audio. Each modality necessitates an independent processing pathway and fusion mechanism, which further elevates the complexity and memory requirements of the model.

III Memory-Efficient Training Techniques of Transformers

LLMs are evolving rapidly and find extensive applications across various domains. However, the training of these large-scale models imposes significant demands on computing power and memory of computing devices. In this context, it becomes essential to explore memory-efficient strategies for training large models. These strategies focus on optimizing memory utilization and making more efficient use of computing resources. While LLMs have gained considerable popularity in scientific fields and have produced noteworthy results, many works have adopted no or few optimizations to improve training efficiency and reduce memory overhead. In the rest of this section, we will delve into various memory-efficient training techniques specific to transformer-based LLMs.

III-A Distributed Training

As deep learning models become larger in parameter and data size, it is becoming more challenging for a single computing device to fulfill training needs. Consequently, distributed training has emerged as one of the foundational technologies in deep learning. It involves distributing the training tasks across multiple devices (such as GPUs or nodes) for parallel processing. By employing effective task decomposition and data partitioning strategies, distributed training significantly enhances the efficiency and scalability of model training.

As the scale of models continues to increase, a single GPU’s memory can no longer accommodate complete model parameters, optimizer states, and intermediate activations. Traditional training strategies for large-scale models may take weeks or even months. However, distributed training allows the model to be divided across multiple devices, enabling the training of extremely large models with limited single GPU memory and significantly reducing training time.

III-A1 Data Parallelism

Data parallelism (DP) is a commonly used distributed training technique, particularly in environments with multiple GPUs and nodes. DP can significantly accelerate training process and improve overall training throughput. It involves replicating the model across computing devices. During training, each device is assigned a subset of the batched training data, known as a mini-batch, which is used as input for the model. Due to the different input data across model replicas, it is necessary to aggregate the gradients from all computing devices before updating model parameters. In terms of implementation, DP architectures can be categorized into parameter server architecture (Fig. 2a) and decentralized architecture (Fig. 2b). In parameter server architecture [83], devices are divided into parameter servers and workers. The parameter server stores the latest parameters of the model and aggregates gradients collected from workers to perform parameter updates. Each worker communicates with the parameter server during training to retrieve the latest parameters or upload gradients calculated using mini-batches.

Refer to caption
(a) Parameter Server
Refer to caption
(b) Decentralized
Figure 2: Different architectures for data parallelism.

Parameter servers are likely to become a communication bottleneck, as all workers need to communicate with servers frequently, especially in large-scale training scenarios. In contrast, decentralized architecture avoids this issue. In such a setup, computing device peers perform gradient aggregation through collective communications such as all-reduce, which enhances scalability. Moreover, the implementation of PyTorch DDP [84] leverages the overlap between computation and communication to further increase training throughput.

DP divides the input training data along the batch dimension, which reduces the size of intermediate activations on each device during training and greatly improves training efficiency. However, its problem is also obvious. DP requires each device to hold a model replica, which is infeasible for large models that a single device cannot store.

The memory overhead during model training primarily comes from model states and intermediate activations. The model states can be further divided into optimizer states, gradients, and parameters. Assuming the number of parameters in the model is denoted as PP, and utilizing Adam optimizer with FP16 mixed precision during training, the memory overhead of model parameters is 2P+4P=6P2P+4P=6P bytes. The memory overhead of gradients is 2P2P bytes, and optimizer states consume 4P+4P=8P4P+4P=8P bytes. Thus, the total memory overhead is 16P16P bytes. If training GPT-3 (175B), would require approximately 2.8TB of memory. To address the redundancy of model states, Rajbhandari et al. introduced Zero Redundancy Optimizer (ZeRO) [75], which shards redundant model states across devices, reducing memory overhead up to 1Nd\frac{1}{N_{d}}, where NdN_{d} represents the number of devices. The core principle of ZeRO is that each device maintains only a fraction of the model states instead of a complete copy. The implementation of ZeRO has been integrated into Microsoft’s open-source distributed training framework, DeepSpeed [85], making it easily accessible for researchers and developers.

Inspired by ZeRO, PyTorch developed Fully Sharded Data Parallel (FSDP) [81]. A report from Hugging Face333https://huggingface.co/blog/deepspeed-to-fsdp-and-back indicates that the performance of PyTorch FSDP is comparable to that of DeepSpeed ZeRO. Since PyTorch v2.4444https://github.com/pytorch/pytorch/releases/tag/v2.4.0, the PyTorch team has restructured FSDP and introduced FSDP2, adding additional optimizations to enhance performance.

Refer to caption
Figure 3: Splitting of matrix multiplication.

III-A2 Tensor Parallelism

The rapid increase in model parameters has rendered the memory and computing resources of a single device insufficient to support the training of the entire model. Although ZeRO can partition the model across computing devices, it does not reduce computational workloads on each device. For large models with vast parameters and high computational demands, relying solely on DP becomes inadequate for meeting the training needs. Model parallelism (MP) has emerged as a crucial technique for further improving the scale of deep learning models. MP primarily includes two techniques: tensor parallelism (TP) and pipeline parallelism (PP), which will be discussed in this section and the following section. TP, also referred to as intra-layer parallelism, finely divides the computation within each model layer into smaller tensor operations that can be executed concurrently across multiple devices, thereby effectively distributing both memory overhead and computational workload.

Matrix multiplication can be divided into smaller operations by row or column, allowing the results to be calculated separately and then aggregated to obtain the final outcome, as shown in Fig. 3. In line with this idea, Megatron-LM [76] has developed a tensor parallelism approach specifically designed for the transformer architecture. Taking the MLP layer as an example, it consists of two parameter matrices: W1W_{1} with a shape of (h,4h)(h,4h) and W2W_{2} with a shape of (4h,h)(4h,h), here we denote hidden size as hh. Initially, W1W_{1} is partitioned column-wise, while W2W_{2} is partitioned row-wise. This results in the parameter matrices stored on each device having the shapes: W1(h,4h/Nd)W_{1}^{\prime}(h,4h/N_{d}) and W2(4h/Nd,h)W_{2}^{\prime}(4h/N_{d},h). After forward pass on each device is completed, an all-reduce is performed to obtain the final result.

Megatron-LM TP divides tensors into one-dimensional segments. While this approach reduces model parameters stored on each device to 1Nd\frac{1}{N_{d}}, it still needs to store complete intermediate activations on each device. Additionally, forward and backward computations of each transformer layer involve 4 all-reduce, causing significant communication overhead. To address this, researchers have proposed multidimensional tensor partitioning methods [86], [87], [88], which further reduce memory overhead on each device and mitigate the frequent communication issues of 1D TP [89]. In practice, TP is typically applied within nodes rather than across nodes to prevent communication from becoming a bottleneck for model training efficiency [90].

III-A3 Pipeline Parallelism

Pipeline parallelism, also known as inter-layer parallelism, involves partitioning the model layers horizontally into multiple stages. Each stage consists of consecutive layers, which are assigned to different devices for sequential execution. Unlike TP, which divides intra-layer tensor calculations, PP partitions the model into stages, where each device transfers activations and gradients between adjacent stages during forward and backward passes. PP not only reduces memory demands and computational workloads on each device but also has less communication overhead. However, a major challenge associated with PP is device idle time (often referred to as pipeline bubbles). While asynchronous PP schemes [91], [92] can eliminate pipeline bubbles, they have no guarantee of model convergence [93]. As synchronous PP schemes continue to improve, their performance and efficiency are comparable with asynchronous PP, making them a more favorable option. In the following paragraphs, we will discuss in detail several common PP schemes.

In vanilla PP, data can be passed to the next device only after the preceding device completes its calculation, due to the sequential dependence in data flow. This leads to pipeline bubbles, resulting in a waste of resources and diminishing overall efficiency of the system.

GPipe [94] splits each input batch into several micro-batches, which are input to the model sequentially. After completing the processing of each micro-batch, each device immediately passes its results to the next device and begins to process the next micro-batch. This approach significantly reduces the pipeline bubbles, and finally, each device uses the accumulated gradients to perform synchronous parameter updates. To further enhance memory efficiency, GPipe incorporates gradient checkpointing [79], which retains only a subset of intermediate activations and recalculates the remaining activations during backpropagation. However, GPipe still faces the issue of high memory consumption for intermediate activations.

The GPipe scheduling involves completing the forward passes for all micro-batches before executing the backward passes (F-then-B). Let NsN_{s} denote the number of stages in the model, NdN_{d} denote the number of devices, and NmN_{m} denote the number of micro-batches. In the GPipe schedule, each device must retain intermediate activations of NmN_{m} micro-batches. To reduce memory consumption, researchers introduced a one forward pass followed by one backward pass pattern (1F1B) [95], [96]. This scheduling strategy effectively reduces the number of on-the-fly micro-batches by executing the backward pass in advance. With 1F1B schedule, the pipeline bubble is equal to that of GPipe, and the activations of micro-batches required to store on each device decrease from NmN_{m} to NdN_{d} (where NmN_{m} is typically greater than NdN_{d}).

The aforementioned PP schemes assume that the number of pipeline stages NsN_{s} and the number of devices NdN_{d} are equal, with each device possessing one stage. Narayanan et al. [95] split the stages based on the 1F1B schedule, allowing each device to have multiple stages. For instance, in Fig. 4, an 8-layer model is partitioned into 4 stages, where each stage consists of two consecutive layers, and each device handles one stage (device 1 possesses L1 and L2, device 2 possesses L3 and L4, etc.). If the model is divided into 8 stages, each device will be responsible for the computation of 2 stages. This interleaved 1F1B schedule performs a finer-grained scheduling of forward and backward computations, reducing pipeline bubbles and enhancing efficiency. However, the trade-off for this reduction in pipeline bubbles includes increased communication and higher activation memory usage.

Refer to caption
Figure 4: Device placements of 1F1B and interleaved 1F1B. Interleaved 1F1B partitions model into finer-grained stages and assign each device with multiple stages to reduce pipeline bubbles.

In training iteration, the computational cost of backpropagation is greater than that of forward propagation (generally estimated to be about twice as much), resulting in longer computation time for backpropagation. This is the primary contributor to the existence of pipeline bubbles. Qi et al. [97] suggest splitting the backward pass into the calculation of the gradients of activations (denoted as BB), and the calculation of the gradients of parameters (denoted as WW). They found that the BB pass is independent of the WW pass within the current layer and only depends on the BB pass from the previous layer. Theoretically, one could complete BB passes of all layers before performing WW passes for each layer. This allows for the design of more efficient pipeline schedules. One of the proposed pipeline schedules (ZB-H1) is similar to 1F1B, but it offers the flexibility to schedule WW after the corresponding BB within the same stage, effectively filling bubbles in the schedule. Additionally, in terms of activation memory usage, ZB-H1 ensures that peak memory usage on all devices does not exceed 1F1B. If the memory limit is adjusted to accommodate peak memory usage exceeding 1F1B, along with the optimizer post-validation strategy proposed by the author, zero bubble scheduling (ZB-H2) can be achieved. ZB-H2 fills additional FF during the warm-up phase to fill bubbles prior to the initial BB, followed by a rearrangement of WW passes at the tail to completely eliminate all bubbles in pipeline schedule.

PP typically has the issue of high activation memory usage, and the 1F1B pipeline schedules mentioned above have the problem of memory imbalance. Regarding this, Qi et al. [93] proposed an analytic framework to decompose existing pipeline schedules into building blocks and found that peak memory usage of a pipeline schedule is highly related to the lifespan of the building blocks. Based on this, a series of V-Shape building blocks are proposed in the article, which have more balanced peak memory on each device compared to interleaved 1F1B.

In the distributed training of large-scale deep learning models, PP is a commonly used technique. Compared to TP, it only communicates between adjacent stages and has lower communication overhead. Therefore, it is usually applied between nodes to scale up model size [95].

III-A4 Sequence Parallelism

Supporting long sequence lengths in large-scale models is crucial for applying AI to scientific endeavors. Many scientific challenges inherently involve complex analysis of large-scale, high-dimensional data, frequently presented as long sequences. Long-sequence issues are prevalent in fields including structural biology, chemistry, drug development, and atmospheric science. For instance, many proteins consist of hundreds or even thousands of amino acid residues, where the sequence context of each residue can be essential to its 3D structure. Long-sequence support enhances the ability to capture non-local interactions among residues, thereby improving the quality of predictions. Additionally, complex molecules or chemical reactions often necessitate long sequence representations, including molecular formulas, SMILES, reaction pathways, etc. Climate prediction models require analyzing long-term historical and global spatial data, represented in time series or grid formats. Despite the importance of long-sequence support, its implementation presents challenges. Processing long sequences demands significant increases in computational power and memory, with the memory usage of attention mechanisms rising quadratically with sequence length.

In traditional distributed training techniques such as DP and TP, model parameters or input data are partitioned and assigned to multiple devices for processing. However, when the model needs to handle long inputs, these methods may lose efficiency due to memory bottlenecks or computational overhead. Sequential parallel (SP) is relatively novel and differs from DP and MP methods in that it aims to reduce the memory overhead caused by activations during model training. Although SP strategies divide input along sequence dimension, the specific implementation varies.

Li et al. [98] considered solving the problem from a system perspective by combining ring-style communication with self-attention, allowing the input sequence to be distributed to each device and calculating attention scores across devices. Compared with Megatron-LM TP [76], the proposed method is not limited by the number of attention heads, as long as the sequence length is divisible by the degree of SP.

In Megatron-LM TP, some activations within the transformer layer are not distributed across devices, leading to increased memory overhead. The SP method proposed by Li et al. [98] can help mitigate this issue; however, it requires replicating model parameters and optimizer states across all devices, making it impractical for large-scale model training. Korthikanti et al. [90] found that the LaterNorm and Dropout operations within the transformer layer operate independently along the sequence dimension. Consequently, modifications were implemented based on Megatron-LM to effectively distribute the computational workloads and activations associated with LaterNorm and Dropout, without incurring additional communication overhead.

Jacobs et al. proposed Ulysses [99], a method that divides the input along sequence dimension onto all devices and conducts all-to-all communication on the Query (Q), Key (K), and Value (V) matrices before performing attention computation. This ensures that each device operates with disjoint attention heads, allowing for distributed attention calculation. Once the calculations are complete, another all-to-all is performed to get the results and restore the state that inputs were divided along sequence dimension. Ulysses SP and Megatron-LM TP share similarities in the distributed calculation of self-attention, both concerning limitations on the number of attention heads. A key advantage of Ulysses is that the communication volume for a single device decreases linearly as SP degree increases, while in Megatron-LM, it is independent of TP degree.

Liu et al. [100] developed a ring-style distributed attention computation method that optimizes resource usage by conserving a portion of the Q, K, and V blocks. During calculations, each device exchanges K and V blocks with its neighboring devices, enabling it to perform computations using blockwise self-attention and feedforward networks (FFN) [101]. In contrast to the approach taken by Li et al. [98], this method can eliminate communication overhead by picking an proper chunk size and overlapping communication with computation.

III-B Mixed Precision Training

Mixed Precision Training is a deep learning technique that leverages data types of varying precisions to enhance training speed, reduce memory consumption, and decrease computational costs, all while preserving comparable accuracy. In mixed precision training introduced by Micikevicius et al. [78], model parameters, activations, and gradients are stored using FP16, while maintaining a copy of the FP32 parameters. The FP16 data type utilizes half the memory space of FP32 and offers improved computation speed. Nevertheless, FP16 has a reduced numerical range and accuracy compared to FP32, which may lead to numerical stability issues during training. To cope with it, FP16 parameters are used for forward and backward calculations, while FP32 parameters are used for model updates. Additionally, to mitigate the risk of gradient underflow during backpropagation, loss scaling is commonly applied, which helps to minimize model performance degradation. The mixed precision training process is illustrated in Fig. 5.

Refer to caption
Figure 5: FP16 mixed precision training process.

While it needs to save an additional FP32 model parameter, the memory savings achieved by utilizing FP16 activations in mixed precision training surpass the memory costs associated with these extra FP32 parameters, leading to an overall memory usage reduction.

To address the challenges posed by FP16, Google has introduced a new floating-point format known as BF16. Although BF16 offers lower accuracy than FP16, it has a larger numerical range, effectively mitigating numerical overflow during training. This improvement not only enhances computation speed but also accelerates model convergence [102]. Although some large models are trained with FP16 mixed precision training, the training report of Bloom-176B [103] suggests that BF16 is a superior alternative that helps avoid numerical issues.

Modern GPUs, such as NVIDIA’s Volta, Turing, and Ampere architectures, feature specialized Tensor Cores designed to accelerate mixed-precision matrix operations. However, BF16 support is limited to certain hardware platforms, such as Ampere and Hopper architectures. Since the Hopper architecture, FP8 Tensor Cores were added allowing for FP8 mixed precision training, which can further reduce memory consumption and accelerate computation [104], [105].

Mixed precision training is a crucial technique for minimizing memory usage in training LLMs; however, it requires careful management to prevent numerical issues. As hardware and software continue to evolve, the methods and tools for mixed precision training are also advancing, facilitating the training of larger and more intricate models.

III-C Gradient Checkpointing

In the training process of DNNs, the output of each layer (activations) needs to be stored in memory to calculate gradients during backpropagation. When dealing with large-scale models, long input sequences, or large batch sizes, the memory overhead caused by activations can become substantial.

The core idea behind gradient checkpointing is to trade computation for memory. This approach reduces memory usage by retaining only the intermediate activations of certain layers during forward propagation while recalculating the discarded activations when needed during backpropagation. The preserved intermediate activations are referred to as “checkpoints” [79].

Refer to caption
(a) w/o GC
Refer to caption
(b) w/ GC
Figure 6: Comparison of training w/ and w/o gradient checkpointing (GC).

Gradient checkpointing offers an effective solution to alleviate memory bottlenecks in the training of deep learning models, particularly in memory-constrained environments. It is important to choose checkpoints carefully to achieve an optimal balance between memory usage and computation time. Taking the transformer architecture as an example, it is generally advisable to set a checkpoint every 1 to 2 transformer layers [95].

The number of parameters and computational complexity of various components in DNNs often vary. Korthikanti et al. [90] suggest that for transformer models, it is unnecessary to set checkpoints for the entire transformer layer for recomputation. Instead, selective recomputation can be performed by identifying components of the transformer layer where activations cost substantial memory yet require less computation. By setting checkpoints and recalculating selectively, it is possible to reduce memory overhead while minimizing computation.

III-D Offloading

Offloading refers to the process of transferring specific storage tasks from GPU/TPU memory to main memory (CPU memory) or lower-cost storage devices, such as NVMe SSD. This technique alleviates memory overhead through increased data communication and enhances the utilization of hardware resources.

SwapAdvisor [106] takes the data flow graph of a model as input and selects legitimate memory allocation and operator scheduling based on the graph. By precisely planning which and when tensors require swapping in and out, SwapAdvisor achieves optimal overlap between computation and communication. The integration of genetic algorithms enables the exploration of the vast search space for finding the optimal combination of memory allocation and operator scheduling. In contrast to previous heuristic-based swapping methods [107], [108], SwapAdvisor significantly enhances memory efficiency and computational performance through its dual optimization approach. Notably, due to the lack of consideration of GPU communication, SwapAdvisor has limitations in multi-GPU training scenarios.

Ren et al. [109] introduced an offloading strategy designed for mixed precision training using the Adam optimizer, which allows for seamless integration with MP. They offload all FP32 model states and FP16 gradients to CPU memory and utilize CPU to compute parameter updates while FP16 model parameters remain on GPU to perform forward and backward passes. It enables the overlapping of swapping processes with computation, mitigating most communication overheads. Furthermore, this approach can be combined with ZeRO-2 DP to achieve good scalability.

ZeRO-infinity [110] enhances the capabilities of ZeRO-3 by adding support for heterogeneous memory, thereby leveraging the entire spectrum of memory types within the system (GPU, CPU, and SSD) to enable large-scale model training. By employing the Infinity Offload Engine to offload model states to NVMe storage and efficient communication optimizations, ZeRO-infinity enables superlinear scaling of training throughput. On 32 Nvidia V100 DGX-2 nodes, ZeRO-infinity supports Transformer models with up to 32T parameters, which is 50× the maximum size accommodated by 3D Parallelism, which is around 650B parameters.

III-E Gradient Accumulation

Gradient accumulation is an effective technique for training scenarios with limited memory and computing resources. It allows training models with a larger effective batch size when GPU memory is limited. The fundamental idea behind gradient accumulation is to gradually accumulate gradients from multiple small batches until a desired batch size is achieved, at which point a weight update is performed. For instance, if the total batch size we aim to use is 32, which exceeds the GPU memory capacity, we can divide it into smaller batches with batch sizes of 8, and perform 4 training iterations before conducting a weight update.

Algorithm 1 presents the pseudocode of gradient accumulation, outlining the implementation of this technique in the training process. Gradient accumulation is frequently utilized in hardware environments with limited resources and serves as an important optimization method in deep learning training.

Algorithm 1 Gradient Accumulation.
0:  Samples XX, Labels YY, Number of epochs EE, Iterations per epoch II, Gradient accumulation steps SS
1:  for ii from 1 to EE do
2:     for jj from 1 to II do
3:        outputmodel(Xj)output\leftarrow model(X_{j})
4:        lossloss_fn(outputj,Yj)Sloss\leftarrow\frac{loss\_fn(output_{j},Y_{j})}{S}  // Normalize loss
5:        loss.backward()loss.backward()
6:        if jmodS=0j\bmod S=0 then
7:           optimizer.step()optimizer.step()
8:           optimizer.zero_grad()optimizer.zero\_grad()
9:        end if
10:     end for
11:  end for

III-F Memory Efficient Optimizers

Currently, adaptive optimizers like Adam [111] and AdamW [112] have nearly become the default optimizers for training LLMs. While Adam demonstrates impressive performance, it incurs a high memory cost due to the store of its optimizer states, including first-moment and second-moment estimates of gradients. This results in a memory overhead that is at least double the size of the model itself, which poses a significant challenge during pre-training. To support such memory-intensive algorithms, techniques like offloading and sharding are often enabled in practice; however, these methods can lead to increased training latency and decreased efficiency. Consequently, there is a strong interest in developing optimizers that utilize less memory while maintaining efficiency.

Shazeer et al. introduced a memory-efficient adaptive optimizer known as Adafactor [113], which significantly lowers memory requirements while delivering training performance comparable to Adam. Assuming the model parameter matrix is denoted as WRm×nW\in R^{m\times n}, by applying low-rank decomposition (i.e., decomposing into row and column vectors) on the second-moment estimates of the gradients, Adafactor can reduce memory consumption from O(mn)O(m\cdot n) to O(m+n)O(m+n). Meanwhile, Anil et al. proposed SM3 [114], which reduces the memory requirements for storing optimizer states through a parameter cover mechanism. It involves dividing parameters into multiple subsets, with each subset maintaining only one variable to approximate the second-order statistics of all parameters within it. This strategy helps decrease memory consumption during second-order moment estimation. SM3 employs a data-driven approach to dynamically adjust the learning rate, similar to Adagrad, while mitigating the memory overhead that arises from maintaining parameter-level statistics, as seen in Adagrad. Experiments demonstrate that the training performance of SM3 is on par with that of Adagrad and Adam, while its memory overhead is slightly lower than that of Adafactor.

Adafactor uses non-negative matrix factorization to approximate the second-order statistics of gradients. While it saves memory, it can introduce errors that lead to instability during training. SM3 encounters similar challenges. To tackle this issue, Luo et al. introduced CAME optimizer [115]. It utilizes the residual between the exponential moving average (EMA) of the update and the current update to compute a confidence matrix, subsequently adjusting parameter updates. By applying non-negative matrix factorization to the confidence matrix, CAME further decreases memory overhead. Through confidence-guided strategies and non-negative matrix factorization, CAME significantly reduces memory usage while ensuring convergence. In large-scale language model training tasks, CAME has shown superior performance compared to Adafactor, while being comparable to it in terms of memory efficiency.

Chen et al. proposed an approach that treats algorithm discovery as a program search, which was successfully applied to identify optimization algorithms for training deep neural networks, thus discovering the Lion optimizer [116]. Unlike most adaptive optimizers, Lion solely tracks momentum and employs sign operations to calculate updates, leading to reduced memory overhead and a consistent update magnitude. In comparison to Adam, AdamW, and Adafactor, Lion exhibits outstanding performance across multiple tasks and models. Notably, in language modeling tasks, Lion surpasses Adafactor, particularly in scenarios of large-batch training.

Zhang et al. discovered that the Hessian matrix of the Transformer exhibits an approximate block diagonal structure. Based on this finding, they proposed Adam-mini [117], which partitions the parameters of components such as Q, K, V, and MLP into blocks and assigns a learning rate to each block, reducing the resource requirement by 90% to 99%. In the pre-training tasks of TinyLlama-1B, Llama2-7B, and the GPT series, Adam-mini demonstrated a significant reduction in memory overhead compared to AdamW, achieving savings of 45% to 50%. Additionally, it offered faster convergence speed while maintaining comparable or improved performance, which surpassed that of Adafactor, SM3, and CAME. Adam-mini effectively lowers the memory demands of the optimizer without compromising performance, providing a more efficient solution for training large-scale models.

IV Tailored Memory-Efficient Training Techniques: A Case Study

OpenFold [48] is an open-source implementation of AlphaFold 2, which has a prediction quality that matches AlphaFold 2 and runs faster on most proteins with less memory usage. This improves the ease and performance of training new models and performing large-scale predictions. OpenFold has made various optimizations to AlphaFold 2’s training schedule. OpenFold was trained on a cluster of 44 NVIDIA A100 GPUs, each with 40GB of memory. It enables FP16 mixed precision training mode to enhance training speed while minimizing memory usage. To reproduce the effective batch size utilized during AlphaFold 2 training as closely as possible, OpenFold implements 3-way gradient accumulation. In terms of distributed training, Deepspeed [85] and ZeRO stage 2 [75] are employed to distribute the optimizer states across each GPU, thus reducing model redundancy. Additionally, they refactored the model by replacing element-wise operations with in-place equivalents wherever possible to minimize unnecessary memory allocation. Advanced implementation of the self-attention [118], [119] was also adopted to lower memory usage and speed up attention computation. Furthermore, they optimize CUDA kernels for certain model modules, decreasing memory overhead.

Despite its impressive success in prediction accuracy, AlphaFold 2’s computational and memory costs during training are much higher than those of vanilla transformers, and its architecture is computationally inefficient on GPUs. To address these issues, Cheng et al. proposed FastFold [49], an efficient implementation of AlphaFold 2. FastFold introduces diverse LLM training techniques, significantly reducing the cost of training and inference for AlphaFold 2 models.

Unlike many existing LLMs, AlphaFold 2’s architecture derives its main memory overhead from intermediate activations, requiring more than 20 GB memory in each attention module, which cannot be well distributed using common methods like ZeRO DP, PP, or TP. Therefore, FastFold introduces a model parallel approach that focuses on reducing activations to optimize AlphaFold 2 training. Dynamic Axial Parallelism (DAP) is a technique tailored for the EvoFormer modules in AlphaFold 2. It retains complete model parameters on each device while partitioning inputs and intermediate activations across devices. The MSA representation and Pair representation within the EvoFormer module contain two distinct sequence dimensions, with computations performed along only one of these dimensions at a time. Consequently, DAP considers partitioning the other sequence dimension not in use. Due to the alternating calculations across different sequence dimensions, all-to-all communication is necessary to ensure that each device possesses the required data for computation. This design allows for the distribution of activations among devices, significantly reducing the training memory overhead of AlphaFold 2 and enhancing both efficiency and scalability. When the model is trained with DAP on a considerable number of devices, throughput can be improved by disabling gradient checkpointing, as DAP has saved enough memory. Compared with the original AlphaFold 2, FastFold has a training efficiency of 3.91×, and compared with OpenFold, it has improved by 2.98×.

Although OpenFold utilizes advanced systematic approaches for performance and memory optimization, the pre-training cost of AlphaFold 2 remains high. While AlphaFold 2 has a relatively small scale with only 93M parameters, its activation memory usage during training is extremely high. Song et al. [120] designed customized attention kernels for the four types of attention variants in the Evoformer module to save memory. This approach lowers peak memory consumption by a factor of 13 compared to the OpenFold implementation. When combined with DAP, it can further reduce memory overhead.

Zhu et al. [50] conducted a comprehensive analysis of AlphaFold 2’s training performance and identified that its poor scalability is primarily due to communication imbalance and inefficient computation. To address these issues, they combined DAP with various optimization methods, such as designing non-blocking data pipelines and performing operator fusion, to construct ScaleFold. ScaleFold allows for the training of AlphaFold 2 on 2080 NVIDIA H100 GPUs and shortens the pre-training time to merely 10 hours.

V Future Trends and Challenges

The number of parameters and the data processing capabilities of large-scale models play a crucial role in their ability to address complex scientific problems. However, the memory requirements for training these models often surpass current hardware limitations, creating a significant bottleneck in scaling. Optimizing memory usage during training not only enhances efficiency but also provides possibilities for exploring even larger models, which may further improve performance across various tasks. This optimization is vital for scientific models, where tasks like protein structure prediction, molecular generation, and climate prediction demand models can handle high-dimensional and long-sequence data, while scaling up models can significantly enhance the accuracy and effectiveness of these applications. Moreover, employing effective memory optimization strategies allows practitioners to train models of the required size using fewer GPUs, resulting in substantial savings in cost and energy. It lowers the barriers to training large language models, promoting greater participation from researchers in scientific fields.

For the training of transformer-based scientific large models, there exists diverse memory optimization methods and effective training libraries, such as DeepSpeed, Megatron-LM, Colossal-AI, FairScale, Hugging Face Accelerate, etc., all of which support distributed training, mixed precision training, gradient checkpointing, and other advanced techniques. However, our investigation indicates that many studies employing LLMs for scientific research have not fully leveraged these memory-efficient training approaches. Thus, there remains considerable room for optimization in large-scale model training in scientific tasks. This survey aims to systematically summarize the memory optimization techniques utilized in LLM training, hoping to promote their application in scientific domains and assist researchers in fully harnessing the capabilities of LLMs to accelerate scientific discovery and innovation.

However, when it comes to large-scale models based on transformer variants, the memory optimization methods described in the paper, which are suitable for training standard Transformers, may not be fully applicable. Modules such as AlphaFold 2’s Evoformer, RosttaFold’s SE(3)-Transformer, and AlphaFold 3’s Pairformer depict notable differences in both memory usage and computational complexity compared to the vanilla Transformer architecture. Optimizing these models often requires domain expertise to develop tailored memory optimization strategies that align with the specific structure of the model. This process necessitates not only a profound understanding of the characteristics of corresponding scientific tasks but also the co-optimization of software and hardware to achieve optimal performance. Designing effective memory optimization methods for these models remains an urgent challenge that needs to be solved.

As model size and task complexity continue to grow, memory-efficient training methods have become a key factor for advancing large-scale models for science. It is crucial to integrate memory optimization techniques with advanced training frameworks, develop customized optimization strategies, and further lower the resource requirements for model training. This represents a significant research direction. Furthermore, future research could focus on creating more versatile guidelines for designing optimization strategies and promoting the widespread use of high-performance computing resources in scientific fields through collaborative software and hardware design.

VI Conclusion

In scientific fields, the performance and complexity of LLMs have continuously evolved as their applications expand into fields including biology, medicine, chemistry, and meteorology. However, training large-scale models often incurs significant memory demands and computational overhead, posing tough challenges to existing hardware resources. Memory-efficient training methods not only enhance training efficiency but also offer support for exploring the potential of larger-scale models by optimizing memory utilization, minimizing redundant computations, and improving resource allocation efficiency. This survey provides a systematic summary of the memory optimization techniques that have gained popularity in recent years for large-scale model training. Additionally, we use AlphaFold 2 as an illustrative example to introduce customized memory-efficient training approaches tailored to specific task requirements, serving as a beneficial reference for researchers interested in applications of large-scale models for scientific tasks.

While existing memory optimization techniques and training frameworks for transformers have made significant strides in the field of NLP, their application in scientific fields has yet to achieve widespread adoption. Scientific models often encounter specific task requirements, including long-sequence modeling, high-precision numerical calculations, and multimodal data fusion. Therefore, it is essential to choose appropriate methods for memory optimization. Furthermore, future research should aim to develop more flexible and efficient optimization schemes tailored to these unique requirements. It is particularly crucial for large-scale models based on transformer variants, where memory optimization strategies must align with task characteristics and the model’s architecture. This will require researchers to make trade-offs between task demands, model design, and hardware implementation.

References

  • [1] S. Tang and Y. Yang, “Why neural networks apply to scientific computing?” Theoretical and Applied Mechanics Letters, vol. 11, no. 3, p. 100242, 2021.
  • [2] J. Jumper et al., “Highly accurate protein structure prediction with AlphaFold,” Nature, vol. 596, no. 7873, pp. 583–589, 2021.
  • [3] K. Singhal et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.
  • [4] K. Singhal et al., “Towards expert-level medical question answering with large language models,” 2023, arXiv:2305.09617.
  • [5] T. Tu et al., “Towards generalist biomedical AI,” NEJM AI, vol. 1, no. 3, p. AIoa2300138, 2024. [Online]. Available: https://ai.nejm.org/doi/full/10.1056/AIoa2300138
  • [6] N. C. Frey et al., “Neural scaling of deep chemical models,” Nature Machine Intelligence, vol. 5, no. 11, pp. 1297–1305, 2023.
  • [7] Y. Fang et al., “Domain-agnostic molecular generation with chemical feedback,” in The Twelfth International Conference on Learning Representations, 2023.
  • [8] J. Ock, C. Guntuboina, and A. Barati Farimani, “Catalyst energy prediction with CatBERTa: Unveiling feature exploration strategies through large language models,” ACS Catalysis, vol. 13, no. 24, pp. 16 032–16 044, 2023.
  • [9] R. Irwin et al., “Chemformer: A pre-trained transformer for computational chemistry,” Machine Learning: Science and Technology, vol. 3, no. 1, p. 015022, 2022.
  • [10] K. Bi et al., “Pangu-Weather: A 3D high-resolution model for fast and accurate global weather forecast,” 2022, arXiv:2211.02556.
  • [11] K. Bi et al., “Accurate medium-range global weather forecasting with 3D neural networks,” Nature, vol. 619, no. 7970, pp. 533–538, 2023.
  • [12] L. Chen et al., “FuXi: A cascade machine learning forecasting system for 15-Day global weather forecast,” npj Climate and Atmospheric Science, vol. 6, no. 1, pp. 1–11, 2023.
  • [13] T. Nguyen et al., “ClimaX: A foundation model for weather and climate,” in International Conference on Machine Learning.   PMLR, 2023, pp. 25 904–25 938.
  • [14] C. Bodnar et al., “Aurora: A foundation model of the atmosphere,” 2024, arXiv:2405.13063.
  • [15] J. Kaplan et al., “Scaling laws for neural language models,” 2020, arXiv:2001.08361.
  • [16] J. Hoffmann et al., “An empirical analysis of compute-optimal large language model training,” in Advances in Neural Information Processing Systems, S. Koyejo et al., Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 30 016–30 030.
  • [17] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, 2017.
  • [18] T. Brown et al., “Language models are few-shot learners,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 1877–1901.
  • [19] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023, arXiv:2307.09288.
  • [20] A. Gholami et al., “Ai and memory wall,” IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024.
  • [21] B. Zhuang et al., “A survey on efficient training of transformers,” in International Joint Conference on Artificial Intelligence 2023.   Association for the Advancement of Artificial Intelligence (AAAI), 2023, pp. 6823–6831.
  • [22] Z. Wan et al., “Efficient large language models: A survey,” 2024, arxiv:2312.03863.
  • [23] Y. Bengio et al., “A neural probabilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155, 2003.
  • [24] T. Mikolov et al., “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems, C. Burges et al., Eds., vol. 26.   Curran Associates, Inc., 2013.
  • [25] T. Mikolov et al., “Efficient estimation of word representations in vector space,” 2013, arXiv:1301.3781.
  • [26] T. Mikolov et al., “Recurrent neural network based language model,” in Proc. Interspeech 2010, 2010, pp. 1045–1048.
  • [27] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
  • [28] K. Cho et al., “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in EMNLP 2014.   Association for Computational Linguistics, 2014, pp. 1724–1734.
  • [29] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT 2019, J. Burstein, C. Doran, and T. Solorio, Eds.   Association for Computational Linguistics, 2019, pp. 4171–4186.
  • [30] A. Radford et al., “Improving language understanding by generative pre-training,” 2018.
  • [31] A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [32] H. W. Chung et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024.
  • [33] L. Ouyang et al., “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, S. Koyejo et al., Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 27 730–27 744.
  • [34] A. Chowdhery et al., “PaLM: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  • [35] R. Anil et al., “PaLM 2 technical report,” 2023, arXiv:2305.10403.
  • [36] H. Touvron et al., “LLaMA: Open and efficient foundation language models,” 2023, arXiv:2302.13971.
  • [37] A. Grattafiori et al., “The Llama 3 herd of models,” 2024, arXiv:2407.21783.
  • [38] OpenAI et al., “GPT-4 technical report,” 2024, arXiv:2303.08774.
  • [39] DeepSeek-AI et al., “DeepSeek LLM: Scaling open-source language models with longtermism,” 2024, arXiv:2401.02954.
  • [40] DeepSeek-AI et al., “DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model,” 2024, arXiv:2405.04434.
  • [41] DeepSeek-AI et al., “DeepSeek-V3 technical report,” 2024, arXiv:2412.19437.
  • [42] T. GLM et al., “ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools,” 2024, arXiv:2406.12793.
  • [43] J. Bai et al., “Qwen technical report,” 2023, arXiv:2309.16609.
  • [44] A. Yang et al., “Qwen2 technical report,” 2024, arXiv:2407.10671.
  • [45] M. Baek et al., “Accurate prediction of protein structures and interactions using a three-track neural network,” Science, vol. 373, no. 6557, pp. 871–876, 2021.
  • [46] F. Fuchs et al., “SE(3)-Transformers: 3D roto-translation equivariant attention networks,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 1970–1981.
  • [47] J. Abramson et al., “Accurate structure prediction of biomolecular interactions with AlphaFold 3,” Nature, vol. 630, no. 8016, pp. 493–500, 2024.
  • [48] G. Ahdritz et al., “OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization,” Nature Methods, vol. 21, no. 8, pp. 1514–1524, 2024.
  • [49] S. Cheng et al., “FastFold: Optimizing AlphaFold training and inference on GPU clusters,” in Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’24.   Association for Computing Machinery, 2024, pp. 417–430.
  • [50] F. Zhu et al., “ScaleFold: Reducing AlphaFold initial training time to 10 hours,” in Proceedings of the 61st ACM/IEEE Design Automation Conference, ser. DAC ’24.   Association for Computing Machinery, 2024, pp. 1–6.
  • [51] Z. Lin et al., “Evolutionary-scale prediction of atomic-level protein structure with a language model,” Science, vol. 379, no. 6637, pp. 1123–1130, 2023.
  • [52] B. Chen et al., “xTrimoPGLM: Unified 100B-scale pre-trained transformer for deciphering the language of protein,” 2024, arXiv:2401.06199.
  • [53] T. Hayes et al., “Simulating 500 million years of evolution with a language model,” bioRxiv, 2024.
  • [54] R. Luo et al., “BioGPT: Generative pre-trained transformer for biomedical text generation and mining,” Briefings in Bioinformatics, vol. 23, no. 6, p. bbac409, 2022.
  • [55] D. Driess et al., “PaLM-E: An embodied multimodal language model,” in International Conference on Machine Learning.   PMLR, 2023, pp. 8469–8488.
  • [56] K. Zhang et al., “A generalist vision–language foundation model for diverse biomedical tasks,” Nature Medicine, vol. 30, no. 11, pp. 3129–3141, 2024.
  • [57] P. Wang et al., “OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in International Conference on Machine Learning.   PMLR, 2022, pp. 23 318–23 340.
  • [58] Z. Chen et al., “MEDITRON-70B: Scaling medical pretraining for large language models,” 2023, arXiv:2311.16079.
  • [59] H. Zhang et al., “HuatuoGPT, towards taming language model to be a doctor,” in Findings 2023.   Association for Computational Linguistics, 2023, pp. 10 859–10 885.
  • [60] J. Zhang et al., “Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence,” 2023, arXiv:2209.02970.
  • [61] J. Chen et al., “HuatuoGPT-II, one-stage training for medical adaption of LLMs,” 2024, arXiv:2311.09774.
  • [62] A. Yang et al., “Baichuan 2: Open large-scale language models,” 2023, arXiv:2309.10305.
  • [63] . AI et al., “Yi: Open foundation models by 01.AI,” 2024, arXiv:2403.04652.
  • [64] T. ValizadehAslani et al., “PharmBERT: A domain-specific BERT model for drug labels,” Briefings in Bioinformatics, vol. 24, no. 4, p. bbad226, 2023.
  • [65] L. Chen et al., “PharmaGPT: Domain-specific large language models for bio-pharmaceutical and chemistry,” 2024, arXiv:2406.18045.
  • [66] J. Guo et al., “Automated chemical reaction extraction from scientific literature,” Journal of Chemical Information and Modeling, vol. 62, no. 9, pp. 2035–2045, 2022.
  • [67] Y. Liu et al., “RoBERTa: A robustly optimized BERT pretraining approach,” 2019, arXiv:1907.11692.
  • [68] M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL 2020.   Association for Computational Linguistics, 2020, pp. 7871–7880.
  • [69] S. Black et al., “GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow,” Zenodo, 2021.
  • [70] Z. Liu et al., “Swin Transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  • [71] Z. Liu et al., “Swin Transformer V2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 009–12 019.
  • [72] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021, arXiv:2010.11929.
  • [73] Z. Yang et al., “AlphaFold2 and its applications in the fields of biology and medicine,” Signal Transduction and Targeted Therapy, vol. 8, no. 1, p. 115, 2023.
  • [74] A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, H. Wallach et al., Eds., vol. 32.   Curran Associates, Inc., 2019.
  • [75] S. Rajbhandari et al., “ZeRO: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1–16.
  • [76] M. Shoeybi et al., “Megatron-LM: Training multi-billion parameter language models using model parallelism,” 2020, arXiv:1909.08053.
  • [77] D. Narayanan et al., “Memory-efficient pipeline-parallel DNN training,” in International Conference on Machine Learning.   PMLR, 2021, pp. 7937–7947.
  • [78] P. Micikevicius et al., “Mixed precision training,” 2018, arXiv:1710.03740.
  • [79] T. Chen et al., “Training deep nets with sublinear memory cost,” 2016, arXiv:1604.06174.
  • [80] N. P. Wedi et al., “The modelling infrastructure of the integrated forecasting system: Recent advances and future challenges,” 2015.
  • [81] Y. Zhao et al., “PyTorch FSDP: Experiences on scaling fully sharded data parallel,” 2023, arxiv:2304.11277.
  • [82] R. Bommasani et al., “On the opportunities and risks of foundation models,” 2022, arXiv:2108.07258.
  • [83] M. Li et al., “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 583–598.
  • [84] S. Li et al., “PyTorch Distributed: Experiences on accelerating data parallel training,” Proceedings of the VLDB Endowment, vol. 13, no. 12, pp. 3005–3018, 2020.
  • [85] J. Rasley et al., “DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’20.   Association for Computing Machinery, 2020, pp. 3505–3506.
  • [86] Q. Xu and Y. You, “An efficient 2D method for training super-large deep learning models,” in 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2023, pp. 222–232.
  • [87] B. Wang et al., “Tesseract: Parallelize the tensor parallelism efficiently,” in Proceedings of the 51st International Conference on Parallel Processing, 2022, pp. 1–11.
  • [88] Z. Bian et al., “Maximizing parallelism in distributed training for huge neural networks,” 2021, arXiv:2105.14450.
  • [89] S. Li et al., “Colossal-AI: A unified deep learning system for large-scale parallel training,” in Proceedings of the 52nd International Conference on Parallel Processing, ser. ICPP ’23.   Association for Computing Machinery, 2023, pp. 766–775.
  • [90] V. A. Korthikanti et al., “Reducing activation recomputation in large transformer models,” in Proceedings of Machine Learning and Systems, D. Song, M. Carbin, and T. Chen, Eds., vol. 5.   Curan, 2023, pp. 341–353.
  • [91] A. Harlap et al., “PipeDream: Fast and efficient pipeline parallel DNN training,” 2018, arXiv:1806.03377.
  • [92] B. Yang et al., “PipeMare: Asynchronous pipeline parallel DNN training,” in Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica, Eds., vol. 3, 2021, pp. 269–296.
  • [93] P. Qi et al., “Pipeline parallelism with controllable memory,” 2024, arXiv:2405.15362.
  • [94] Y. Huang et al., “GPipe: Efficient training of giant neural networks using pipeline parallelism,” in Advances in Neural Information Processing Systems, H. Wallach et al., Eds., vol. 32.   Curran Associates, Inc., 2019.
  • [95] D. Narayanan et al., “Efficient large-scale language model training on GPU clusters using Megatron-LM,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21.   Association for Computing Machinery, 2021, pp. 1–15.
  • [96] S. Fan et al., “DAPPLE: A pipelined data parallel approach for training large models,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ser. PPoPP ’21.   Association for Computing Machinery, 2021, pp. 431–445.
  • [97] P. Qi et al., “Zero bubble (almost) pipeline parallelism,” in The Twelfth International Conference on Learning Representations, 2024.
  • [98] S. Li et al., “Sequence parallelism: Long sequence training from system perspective,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Association for Computational Linguistics, 2023, pp. 2391–2404.
  • [99] S. A. Jacobs et al., “System optimizations for enabling training of extreme long sequence transformer models,” in Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing, ser. PODC ’24.   Association for Computing Machinery, 2024, pp. 121–130.
  • [100] H. Liu, M. Zaharia, and P. Abbeel, “Ring attention with blockwise transformers for near-infinite context,” 2023, arXiv:2310.01889.
  • [101] H. Liu and P. Abbeel, “Blockwise parallel transformers for large context models,” in Advances in Neural Information Processing Systems, A. Oh et al., Eds., vol. 36.   Curran Associates, Inc., 2023, pp. 8828–8844.
  • [102] D. Kalamkar et al., “A study of BFLOAT16 for deep learning training,” 2019, arXiv:1905.12322.
  • [103] T. L. Scao et al., “BLOOM: A 176b-parameter open-access multilingual language model,” 2023, arxiv:2211.05100.
  • [104] P. Micikevicius et al., “FP8 formats for deep learning,” 2022, arXiv:2209.05433.
  • [105] H. Peng et al., “FP8-LM: Training fp8 large language models,” 2023, arXiv:2310.18313.
  • [106] C.-C. Huang, G. Jin, and J. Li, “SwapAdvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20.   Association for Computing Machinery, 2020, pp. 1341–1355.
  • [107] M. Rhu et al., “vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–13.
  • [108] T. D. Le et al., “TFLMS: Large model support in tensorflow by graph rewriting,” 2019, arXiv:1807.02037.
  • [109] J. Ren et al., “ZeRO-Offload: Democratizing billion-scale model training,” in 2021 USENIX Annual Technical Conference (USENIX ATC 21), 2021, pp. 551–564.
  • [110] S. Rajbhandari et al., “ZeRO-Infinity: Breaking the gpu memory wall for extreme scale deep learning,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21.   Association for Computing Machinery, 2021, pp. 1–14.
  • [111] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2017, arXiv:1412.6980.
  • [112] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2018, arXiv:1711.05101.
  • [113] N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” in International Conference on Machine Learning.   PMLR, 2018, pp. 4596–4604.
  • [114] R. Anil et al., “Memory efficient adaptive optimization,” in Advances in Neural Information Processing Systems, H. Wallach et al., Eds., vol. 32.   Curran Associates, Inc., 2019.
  • [115] Y. Luo et al., “CAME: Confidence-guided adaptive memory efficient optimization,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Association for Computational Linguistics, 2023, pp. 4442–4453.
  • [116] X. Chen et al., “Symbolic discovery of optimization algorithms,” in Advances in Neural Information Processing Systems, A. Oh et al., Eds., vol. 36.   Curran Associates, Inc., 2023, pp. 49 205–49 233.
  • [117] Y. Zhang et al., “Adam-mini: Use fewer learning rates to gain more,” 2024, arxiv:2406.16793.
  • [118] M. N. Rabe and C. Staats, “Self-Attention does not need $o(n^2)$ memory,” 2022, arXiv:2112.05682.
  • [119] T. Dao et al., “FlashAttention: Fast and memory-efficient exact attention with io-awareness,” in Advances in Neural Information Processing Systems, S. Koyejo et al., Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 16 344–16 359.
  • [120] S. L. Song et al., “DeepSpeed4Science Initiative: Enabling large-scale scientific discovery through sophisticated ai system technologies,” 2023, arXiv:2310.04610.