O1+m1\NewDocumentCommand\stdcalcO1+m0
SortedNet: A Scalable and Generalized Framework for Training Modular Deep Neural Networks
Abstract
Deep neural networks (DNNs) must cater to a variety of users with different performance needs and budgets, leading to the costly practice of training, storing, and maintaining numerous user/task specific models. There are solutions in the literature to deal with single dynamic or many-in-one models instead of many individual networks; however, they suffer from significant drop of performance, lack of generalization across different model architectures or different dimensions (e.g. depth, width, attention blocks), heavy model search requirements during training, and training a limited number of sub-models. To address these limitations, we propose SortedNet, a generalized and scalable training solution to harness the inherent modularity of DNNs. Thanks to a generalized nested architecture (which we refer as sorted architecture in this paper) with shared parameters and its novel update scheme combining random sub-model sampling and a new gradient accumulation mechanism, SortedNet enables the training of sub-models simultaneously along with the training of the main model (without any significant extra training or inference overhead), simplifies dynamic model selection, customizes deployment during inference, and reduces the model storage requirement significantly. The versatility and scalability of SortedNet are validated through various architectures and tasks including LLaMA, BERT, RoBERTa (NLP tasks), ResNet and MobileNet (image classification) demonstrating its superiority over existing dynamic training methods. For example, we introduce a novel adaptive self-speculative approach based on sorted-training to accelerate large language models decoding. Moreover, SortedNet is able to train up to 160 sub-models at once, achieving at least 96% of the original model’s performance.
1 Introduction

Deep neural networks (DNNs) are increasingly gaining interest and becoming more popular Sarker (2021). This popularity translates to the increasing demand and requirements from the users which should be met by these models. Users pre-train or fine-tune more models with various sizes to address the performance and computational needs of their tasks and target devices (with different memory and computational power whether deployed in the cloud or on edge devices). However, developing, storing, maintaining, and deploying many individual models for diverse set of users can be very difficult and costly Devvrit et al. (2023). Moreover, in the era of gigantic pre-trained models Devlin et al. (2018); Liu et al. (2019) and large language models Brown et al. (2020); Chowdhery et al. (2022) the computational demands can vary significantly from task to task. Therefore, there is a growing demand for models which can adapt themselves to the dynamic conditions, while conventional neural network would fail to address such cases Xin et al. (2020); Yu et al. (2018a).
On the other hand, DNNs demonstrate modular architectures along various dimensions, like layers and blocks across depth, and neurons, channels and attention heads along width. This inherent modularity enables the extraction of sub-models with similar shapes to the original model. However, this modularity has not been deployed in regular training methods, and consequently, the performance of the sub-models falls short compared to the main model. Hence, the challenge lies in harnessing the full potential of modularity in deep neural networks, allowing for the efficient utilization of sub-models to enhance their performance and enable their practical deployment in real-world scenarios.
Instead of training individual models, we can leverage sub-models of DNNs and train them together with the main models to obtain many-in-one networks with sub-models that can be used for different tasks. There are variety of approaches in the literature for training sub-models Cai et al. (2020); Xin et al. (2020); Hou et al. (2020). These techniques while effective have certain shortcomings: often use a sophisticated training process combined with knowledge distillation (which needs to train a separate teacher model) Hou et al. (2020), require architecture modification Nunez et al. (2023), work for specific architectures only Cai et al. (2019), cannot handle more than a very small number of sub-models, need heavy model search (e.g. neural architecture search) during training or inference Cai et al. (2020), involve redundant sub-model optimization Fan et al. (2019), or show poor performance for the main model or sub-models Xin et al. (2020).
To address these problems, we propose SortedNet, a generalized and scalable training solution to harness the inherent modularity of DNNs across various dimensions. As the name of our method implies, it chooses the sub-models in a sorted manner (a generalized version of nested architectures) within the main model to avoid heavy search during or after training. In contrast to nested models in which smaller sub-models are always totally encapsulated by larger sub-models, our generalized sorted version relaxes the nested constraint but ties the origin of sub-models to the origin of the main model across any target dimension (for more details see section 3.1).
This sorted configuration with shared parameters enforces regular order and consistency in the knowledge learned by sub-models. One option to sort the sub-models is based on their computation and accuracy requirements which will enable us to extract our desired sub-models without requiring extensive search at the test time. The use of a predefined sorting order ensures that each targeted sub-model possesses a unique computation overhead, effectively removing optimization of redundant sub-models from training.
To train the sorted sub-models, we propose a novel updating scheme that combines random sampling of sub-models with gradient accumulation. We tried the SortedNet solution successfully on various architectures and tasks such as the decoder-based LLaMA (13B) large language models Touvron et al. (2023) on the GSM8K Cobbe et al. (2021) mathematical reasoning task, encoder-based BERT Devlin et al. (2018) and RoBERTa Liu et al. (2019) on the set of GLUE Wang et al. (2018) language understanding tasks, ResNet He et al. (2015) and MobileNet Sandler et al. (2018) on the CIFAR-10 image classification task. Our comprehensive empirical studies across different architectures, tasks and dynamicity along various dimensions ranging from width and depth to attention head and embedding layer show the superiority and generalizabilty of our proposed method over state of the art dynamic training methods. Moreover, SortedNet offers several benefits, including minimal storage requirements and dynamic inference capability (i.e. switching between various computation budgets) during inference.
To summarize, the main contributions of this paper are:
-
•
Introducing a many-in-one solution to configure sub-models in a sorted manner and training them simultaneously with some unique aspects such as scalability (training many sub-models), generality (CNN, Transformers, depth, width), and search-free (no need for search during training or inference among sub-models) and maintaining competitive performance for the main model.
-
•
To the best of our knowledge, this work is the first attempt towards training many-in-one models along various dimensions simultaneously (and not even at different stages).
-
•
We deploy our sorted training in accelerating the decoding of large language models using a self speculative approach which can lead to about 2x inference acceleration for LLaMA 13B.
-
•
Demonstrating theoretical justifications and empirical evidence for the effectiveness of the proposed method. Outperforming state-of-the-art methods in dynamic training on CIFAR10 Krizhevsky et al. (2009) with scaling the number of sub-models to 160 and achieving at least of the original model’s performance. Moreover, showing successful results on dynamic training of the BERT, RoBERTa and LLaMA models.
2 Related Work
Method | Sub-Models Config | Performance | Anytime | Search-Free | # of Sub-Models | Target Dim. | Model |
Early Exit Xin et al. (2020) | Sorted | Low | ✓ | ✓ | Few | Depth | Transformer |
Layer Drop Fan et al. (2019) | Random | Low | ✗ | ✗ | Many | Depth | Transformer |
DynaBERT Hou et al. (2020) | Sorted | High | ✗ | ✗ | Few | Depth & Width | Transformer |
Once for All Cai et al. (2019) | Nested | High | ✗ | ✗ | Many | Width & Depth & Kernel | CNN |
LCS Nunez et al. (2023) | Arbitrary | High | ✓ | ✓ | Few | Width | CNN |
Slimmable Yu et al. (2018a) | Sorted | Moderate | ✓ | ✓ | Few | Width | CNN |
MatFormer Devvrit et al. (2023) | Sorted | High | ✗ | ✓ | Few | Width | Transformer |
SortedNet (Ours) | Sorted | High | ✓ | ✓ | Many | General | CNN & Transformer |
In this section, we briefly review the most relevant existing works to our SortedNet idea. A summary of these solutions and how they are different from each other can be found in Table 1. For more details, please refer to appendix A.
Slimmable Networks Yu et al. (2018b)
Slimmable networks is a width adjustable training method. It was proposed particularly for CNN architectures and thus, careful consideration of the batch normalization module for various width sizes is necessary. In contrast to slimmable networks, our SortedNet covers more architectures and works in both depth and width dimensions.
Early Exit Xin et al. (2020)
is one of the most popular baseline techniques which adds a classifier to intermediate layers of an already trained model. The parameters of the main model are frozen and the classifiers are updated in a separate fine-tuning process. While this solution is relatively straightforward, the performance of the sub-models lags significantly behind that of the main model.
Dayna-BERT Hou et al. (2020)
presents a dynamic compression method for pre-trained BERT models, enabling flexible adjustments in model size, both in depth and width, during inference. DynaBERT is different from us in the follow aspects: first, in DynaBERT, only a very few sub-models are functional; second, DynaBERT requires an already trained teacher model and utilizes knowledge distillation (KD); third, DynaBERT needs search to find an optimal sub-model; last, DynaBERT is architecture dependent.
Layer-drop Fan et al. (2019)
is a structured dropout training which allows layer pruning at the inference time. Similar to DynaBERT, it is applied to pre-trained language models; however, in contrast to DynaBERT, Layer-drop only targets the depth of neural networks and not their width.
Once-for-All (OFA) Cai et al. (2020)
targets efficient inference across different devices. It first trains a network which supports many sub-models with varying latency/accuracy characteristics ; it then searches among the feasible sub-models according to the accuracy and latency requirements of their target device. OFA is different from our solution in: first, it has a progressive training nature in contrast to our stochastic or summation loss; second, it needs teacher and KD; third, it requires a separate neural architecture search (NAS) at the inference time; fourth, OFA is for CNN-based models; last, it does not have any particular assumption for configuring sub-models (see Figure 4 for more details).
Learning Compressible Subspace (LCS) Nunez et al. (2023)
is an adaptive compression technique based on training compressible subspace of neural networks. While LCS does not require any re-training at the inference time, this solution has some other limitations including: first, it needs double memory at the training time; second, the choices of initial weights and the compression function are unclear and arbitrary (left as a hyper-parameter); third, it is only tried on CNNs; forth, similar to Layer-drop, the search space of sub-models is huge which makes the training sub-optimal.
MatFormer Devvrit et al. (2023)
is a pre-training only many-in-one solution based on summation loss for Transformer-based models. MatFormer works only along the width dimension of the FFN block in Transformers and cannot handle more than a very few number of sub-models.
3 Proposed Method
3.1 A Generalized and Scalable View
In the related work section, we have discussed several approaches concerning the training of many-in-one networks. These approaches differ in terms of their target architecture, training loss, number of training parameters, the configuration of the sub-models (random, nested, or sorted), the number of trained sub-models, and reliance on search or re-training before deployment. Our SortedNet method can be viewed as a simple, general, and scalable version of these existing solutions. These benefits have mostly resulted from the sorted configuration of sub-models with their shared parameters and our stochastic training. In order to train many-in-one networks, we need to specify a few design choices: first, how to form the sub-models and their configurations; second, what are the target architectures; and third, how to train the sub-models along with the main model.
Designing the sub-models
SortedNet imposes an inductive bias on training based on the assumption that the parameters of sub-models have a concentric architecture tied to the origin along each dimension (which we refer to as a sorted architecture). This sorted configuration with shared parameters enforces a regular order and consistency in the knowledge learned by sub-models (see Figure. 1).
Sorted vs. Nested Architectures
In this work, we introduce the term of sorted architectures to extend and generalize the concept of nested architectures. In contrast to nested models in which smaller sub-models are always totally encapsulated by larger sub-models, our sorted sub-models would be tied to the origin (starting index) of each dimension independently.
Let’s consider a many-in-one neural network with the parameters and the input which is comprised of sub-models , where represents the weights of the sub-model. We define a universal set which contains all unique sub-models: .
Setting up an order
Suppose that we would like to target many-in-one dimensions in our model. Then, let’s start with and build the sub-models iteratively. In this regard, at each iteration during training, we have sampling and truncation procedures along any of the targeted dimensions:
(1) |
where indicates that we have truncated along the dimension from the index 1 up to the index at the iteration . is sampled from a distribution with the support set of to form the sub-model. refers to the maximum index of the dimension. This iterative process will be done during training and the set of unique sub-models will be built.
To illustrate the process better, let’s see a simple case such as where we want to make a many-in-one network across the width and depth dimensions, . In this case, we have 12 layers and a hidden dimension size of 768. Suppose that corresponds to and corresponds to in Eq. 1. For simplicity, let’s use a discrete uniform distribution for sampling indices across these two dimensions. To create the first sub-model (), we need to sample uniformly from the set of natural numbers in the range of 1 to 12: ; and we need to sample from the range of 1 to 768: . Bear in mind that we can even choose a subset of and as the support set for sampling probability distribution. After these two samplings, we will have two truncated sets of parameters: and . The intersection of these two truncated parameters will give us the first sub-model: .
Training Paradigm
Regular training of neural networks concerns improving the performance of the whole model and usually this training is not aware of the performance of the sub-models. In fact, in this scenario, if we extract and deploy the sub-models of the trained large model on a target task, we would experience a significant drop in the performance of these sub-models compared with the main model. However in SortedNet, we propose a training method that allows for training sub-models together with the main model in a stochastic way. The SortedNet paradigm leads to the following benefits:
-
•
Search-free sub-model extraction: after training, by importance sorting of sub-models the best sub-model for a given budget can be selected without the need for search.
-
•
Anytime: Each smaller sub-model is a subset of a larger one which makes switching between different sub-models efficient. This leads to an important feature of our SortedNet which is so-called anytime that is a network which can produce its output at any stage of its computation.
-
•
Memory efficient: we train a many-in-one network where sub-models are all part of a single checkpoint, which minimizes storage requirement.
For efficiency purposes, in our training, the last layer, e.g. the classification layer, is shared between all sub-models; alternatively, we can add a separate classification layer to each sub-model. For simplicity and efficiency, we chose to the former i.e. use a shared classification layer.

3.2 SortedNet Algorithm
In this subsection, we describe our proposed training algorithm. For training a SortedNet with sub-models, at each iteration during training, a random index needs to be sampled from a pre-defined distribution: . After finding the target sub-model at each iteration, we can use one of the following objectives to update the parameters of the selected sub-model:
-
•
(Stochastic Loss) Only train the selected sub-model :
where is the loss function for training the model on a given task (e.g. can be a regular cross entropy loss) and refers to the ground-truth labels. -
•
(Stochastic Summation) Train the sub-model and all its targeted sub-models along each dimension. Let’s assume that is the universal set for all targeted sub-models of . Then the loss function can be defined as:
This way, one sub-model or a subset of sub-models are updated in each iteration. Alternatively, one can choose to train all the sub-models at each iteration which is costly in the large scale.
3.3 Why Does SortedNet Work?
In Appendix B, we provide theoretical justification for parameter convergence of the sub-models in identically trained scenarios and also provide the performance bound between the trained sub-models and their similar corresponding network trained independently.
Convergence
Suppose is a sub-model of a larger network, and is an identical model architecture trained independently. We aim to understand the relationship between the parameters of these two networks, for and for , as they are trained under identical conditions. Assuming that the gradients of the loss functions for and , are -Lipschitz continuous, and the learning rate is , we show that
(2) |
This indicates that the difference in the parameters of and is governed by the Lipschitz constant and the learning rate , suggesting that the parameters should remain close throughout the training process, especially when the difference between the gradients of the loss functions of the two networks is negligible.
Performance Bound
Moreover, we would like to find a performance bound between a trained sub-model (with optimized parameters ) and its corresponding individual model (with optimized parameters ). Let’s assume that . We show in Appendix B that the deviation in the function value from its optimal value due to a parameter perturbation is bounded by under the assumption of L-Lipschitz continuity of the gradient.
(3) |
This result implies that the function value’s deviation grows at most quadratically with the size of the parameter perturbation.
4 Experiments
In this section, we conduct a set of experiments to show the effectiveness and importance of our solution. The details of the hyper-parameters for each experiment can be found in Appendix C.3.
4.1 Is SortedNet Scalable?
To show that our proposed method is scalable, we designed an experiment that try to train 160 different models across multiple dimensions (width and depth) all at once. As baseline, we trained the largest network (a MobileNetV2), and reported the best performance of the model. Because the performance of the model was poor for all the other sub-models (less than 12%), we trained the classifier layer for 5 more epochs before evaluating each sub-model for the baseline and reported the best performance. As the results suggests in Figure 2-a, our method was able to capture the maximum performance for many of these sub-models in a zero-shot manner. In each cell, we reported the performance of the sub-model on top and the recovery percentage of the model with respect to the largest model (in this example, 95.45). Despite sharing the weights across all models, sharing the classifier and zero-shot evaluation, the proposed method preserved up to 96% of the performance of the largest model which is highly encouraging. Further training of the classifier for our proposed method will lead to even better performance as shown in Appendix C.5 (between 2 to 15% improvement for different sub-models). In addition, we also tried to sort the depth and width using proposed method individually, which is reported in the Figure 2-a as D. Only, and W. Only, respectively. Across width, SortedNet successfully preserved up to 99% of the largest network’s performance.
4.2 Can we find the best sub-models using SortedNet?
As shown in Figure 2-b, based on the performance of the models in the previous experiment shown in Figure 2-a, we selected a subset of best-performing networks (width 60% and depth 13 blocks), and retrained the network from scratch using SortedNet to show the success rate of our proposed method. As shown, SortedNet successfully preserved up to 99% of the performance of the ordinary training of the largest network. We can also make this selection process fully automated by sorting the performance of all sub-models after evaluation and filtering out a subset of best-performing models that perform better than a desired threshold. As it can be seen in Figure 2-c, there is a set of sub-models which perform better than 80%. To better understand the pattern, we annotated some of the points using “" as template which shows for each model the corresponding width and depth.
4.3 Can we generalize SortedNet?
Network | Width | FLOPs | NS-IN | LCS-p-IN | SortedNet-IN | NS-BN | LCS-p-BN (aka US) | SortedNet-BN |
cpreresnet20 He et al. (2015) (CIFAR10) | 100% | 301M | 88.672.1 | 87.612.3 | 89.142.1 | 79.845.0 | 65.872.1 | 85.242.3 |
75% | 209M | 87.861.6 | 85.732.2 | 88.462.1 | 78.593.4 | 85.671.4 | 85.293.2 | |
50% | 97M | 84.462.1 | 81.545.3 | 85.512.2 | 69.444.0 | 65.583.1 | 70.984.3 | |
25% | 59M | 75.422.2 | 76.171.2 | 75.102.6 | 10.963.4 | 15.783.5 | 12.593.2 | |
avg. | - | - | 84.10 | 82.76 | 84.55 | 59.70 | 58.22 | 63.52 |
In another experiment, as shown in Table 2, we demonstrate the superiority of our stochastic approach compared to the state-of-the-art methods such as LCS (shown as in the table) Nunez et al. (2023), Slimmable Neural Network (NS) Yu et al. (2018a), and Universally Slimmable Networks (US) Yu and Huang (2019). To make the comparisons fair, we equalized the number of gradient updates for all models. We also tried to remove the impact of architecture design such as the choice of the normalization layers. Therefore, we tried to compare methods by different layer normalization techniques such as BatchNorm Ioffe and Szegedy (2015) and InstanceNorm Ulyanov et al. (2016). In addition, we ensure that complementary methods such as Knowledge Distillation have no impact the results as these methods can be applied and improve the results independent of the method. As shown in the table, SortedNet demonstrates a superior average performance compared to other methods, indicating its generalization across various settings such as different norms. It is worth noting that we realized the unexpected nature of the LCS-p-BN results in Table 2. However, these results are in line with the original LCS paper’s observations Nunez et al. (2023) (see Figure 3 of the LCS paper). The LCS authors Nunez et al. (2023) also hypothesized that this drop caused by inaccurate batch norm statistics. To address this, they suggested an architectural adjustment to GroupNorm. Our SortedNet approach, on the other hand, remains unaffected by this issue, thus requiring no such modifications.
4.4 Extending Sorted Net to Pre-trained Language Models
Acc. | Acc. | F1 | Mathews Corr. | Acc. | Acc. | Acc. | Pearson | |||
Model | MNLI | SST-2 | MRPC | CoLA | QNLI | QQP | RTE | STS-B | AVG (Ours) | AVG w/o ours |
Sorted-RoBERTa (1L) | 60.07 | 70.76 | 81.22 | 0.00 | 59.64 | 77.80 | 47.65 | 9.36 | 50.81 | 40.33 |
Sorted-RoBERTa (2L) | 71.98 | 80.28 | 81.22 | 0.00 | 81.49 | 87.09 | 47.29 | 70.37 | 64.97 | 40.86 |
Sorted-RoBERTa (4L) | 76.74 | 80.50 | 81.22 | 0.00 | 85.21 | 88.82 | 46.93 | 75.07 | 66.81 | 41.06 |
Sorted-RoBERTa (4L) | 79.13 | 84.75 | 81.22 | 44.51 | 86.60 | 90.11 | 49.10 | 84.94 | 75.04 | 42.95 |
Sorted-RoBERTa (5L) | 81.14 | 89.91 | 81.22 | 48.41 | 87.88 | 90.86 | 55.96 | 88.22 | 77.95 | 43.80 |
Sorted-RoBERTa (6L) | 82.21 | 92.09 | 86.67 | 53.41 | 88.83 | 91.12 | 67.87 | 89.09 | 81.41 | 46.13 |
Sorted-RoBERTa (7L) | 82.99 | 92.78 | 89.13 | 56.42 | 89.29 | 91.29 | 73.29 | 89.58 | 83.10 | 44.80 |
Sorted-RoBERTa (8L) | 83.33 | 93.23 | 89.78 | 57.22 | 89.40 | 91.29 | 75.09 | 89.67 | 83.63 | 55.17 |
Sorted-RoBERTa (9L) | 83.39 | 92.66 | 89.66 | 58.69 | 89.40 | 91.25 | 77.26 | 89.72 | 84.00 | 61.36 |
Sorted-RoBERTa (10L) | 87.42 | 93.12 | 91.64 | 61.21 | 91.87 | 91.19 | 74.01 | 89.74 | 85.02 | 54.30 |
Sorted-RoBERTa (11L) | 87.34 | 93.35 | 91.45 | 60.72 | 91.74 | 91.17 | 74.01 | 89.72 | 84.94 | 77.48 |
Sorted-RoBERTa (12L) | 83.35 | 92.89 | 90.81 | 59.20 | 89.44 | 91.28 | 76.53 | 89.77 | 84.16 | 86.13 |
avg. | 79.26 | 87.93 | 86.09 | 41.25 | 85.50 | 89.45 | 64.26 | 79.61 | 76.67 | 52.86 |
In this experiment, the goal is to apply SortedNet for a pre-trained transformer model and evaluate the performance on the GLUE benchmark Wang et al. (2018). As the baseline, we chose RoBERTa Liu et al. (2019) to demonstrate the flexibility of our algorithm. In Table 3, we sorted all the layers of RoBERTa-base. As the results demonstrate, our proposed method in average perform better than the baseline by a significant margin ( 23%). However, the largest model has a small drop in performance (less than 2%). It is interesting that the transformer architecture can preserve the performance of sub-models up to some extent without additional training. However, our algorithm improve the performance of these sub-models between 10 to 40% approximately. A more complex setting (sorting across Bert models), has been investigated in Appendix C.6.
SortedNet Efficient Decoding | ||||||
Stochastic Loss | Summation Loss | |||||
Auto-regressive Decoding | ||||||
Model | Time per Token (ms) | Accuracy | Rejection Ratio | Time per Token (ms) | Accuracy | Rejection Ratio |
Layer 40 (full) | 96.41 | 23.95 | - | 86.10 | 25.24 | - |
Speculative Decoding | ||||||
Draft Model | Time per Token (ms) | Accuracy | Rejection Ratio | Time per Token (ms) | Accuracy | Rejection Ratio |
Layer 12 | 58.86 () | 22.28 | 0.35 | 63.68 () | 16.52 | 0.40 |
Layer 16 | 60.25 () | 23.42 | 0.24 | 63.89 () | 20.92 | 0.30 |
Layer 20 | 65.92 () | 25.09 | 0.16 | 69.61 () | 21.98 | 0.22 |
Confidence-based Early-Exit Varshney et al. (2023) | ||||||
Model | Time per Token (ms) | Accuracy | Rejection Ratio | Time per Token (ms) | Accuracy | Rejection Ratio |
Layer 12:40 | 46.02 () | 20.69 | - | 50.89 () | 24.63 | - |
Sorted Self-Speculative Decoding (Ours) | ||||||
Layer 12:40 | 47.93 () | 23.27 | 0.06 | 65.19 () | 24.79 | 0.07 |
Layer 12:24 | 48.66 () | 24.33 | 0.14 | 63.58 () | 25.17 | 0.19 |
4.5 Accelerating the inference of Decoder-based Large Language Models Using SortedNet


To further show the scalability and generalizablity of SortedNet in more practical scenarios, we fine-tuned a LLaMA-13b Touvron et al. (2023) on GSM8K Cobbe et al. (2021), which is one of the challenging mathematical reasoning tasks. We chose the first 12, 16, 20, 24, 28, 32, 36, and 40 layers of LLaMA to build our submodels. To equalize the number of updates, we trained the model based on stochastic loss 8 times more than the summation loss, as we have 8 models and each forward pass in stochastic loss is 1/8 of the summation loss. In table 4, we reported the performance of a subset of submodels and speedup gain that one can achieve using different sampling techinques such as Autoregressive decoding, speculative decoding Leviathan et al. (2023), Confidence-based Early-Exit which is a early-exiting approach based on the confidence of intermediate sub-models of the Sorted Models Varshney et al. (2023), and Sorted Self-Speculative Decoding, which adaptively utilizes intermeidate sub-models to generate draft tokens in Speculative Decoding algorithm (Figure 3). As shown, combining SortedNet and speculative decoding can improve the time per token efficiency up to 2.09 times faster than using auto-regressive for the full size model. In addition, we highlighted the details of each experiment hyperparameters in Appendix C.3 and further analysis has been provided in Appendix C.7 to better understand the behavior of sortedNet methodology.
Conclusion
In summary, this paper proposes a new approach for training dynamic neural networks that leverages the modularity of deep neural networks to efficiently switch between sub-models during inference. Our method sorts sub-models based on their computation/accuracy and trains them using an efficient updating scheme that randomly samples sub-models while accumulating gradients. The stochastic nature of our proposed method is helping our algorithm to generalize better and avoid greedy choices to robustly optimize many networks at once. We demonstrate through extensive experiments that our method outperforms previous dynamic training methods and yields more accurate sub-models across various architectures and tasks. The sorted architecture of the dynamic model proposed in this work aligns with sample efficient inference by allowing easier samples to exit the inference process at intermediate layers. Exploring this direction could be an interesting area for future work.
References
- Sarker [2021] Iqbal H Sarker. Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2(6):420, 2021.
- Devvrit et al. [2023] Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, and Prateek Jain. Matformer: Nested transformer for elastic inference, 2023.
- Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Liu et al. [2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019.
- Brown et al. [2020] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
- Chowdhery et al. [2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways, 2022.
- Xin et al. [2020] Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. Deebert: Dynamic early exiting for accelerating bert inference. arXiv preprint arXiv:2004.12993, 2020.
- Yu et al. [2018a] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable neural networks. arXiv preprint arXiv:1812.08928, 2018a.
- Cai et al. [2020] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-All: Train One Network and Specialize it for Efficient Deployment, April 2020. URL http://arxiv.org/abs/1908.09791. arXiv:1908.09791 [cs, stat].
- Hou et al. [2020] Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth, 2020.
- Nunez et al. [2023] Elvis Nunez, Maxwell Horton, Anish Prabhu, Anurag Ranjan, Ali Farhadi, and Mohammad Rastegari. Lcs: Learning compressible subspaces for efficient, adaptive, real-time network compression at inference time. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3818–3827, January 2023.
- Cai et al. [2019] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
- Fan et al. [2019] Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
- Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Cobbe et al. [2021] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Wang et al. [2018] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
- Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
- Yu et al. [2018b] Jiahui Yu, Linjie Yang, Ning Xu, Jianchao Yang, and Thomas Huang. Slimmable Neural Networks, December 2018b. URL http://arxiv.org/abs/1812.08928. arXiv:1812.08928 [cs].
- Yu and Huang [2019] Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1803–1811, 2019.
- Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
- Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
- Varshney et al. [2023] Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, and Chitta Baral. Accelerating llama inference by enabling intermediate layer decoding via instruction tuning with lite, 2023.
- Leviathan et al. [2023] Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
- Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Kingma and Ba [2017] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
Appendix
Appendix A Related Work
Slimmable Networks Yu et al. [2018b]
Slimmable networks is a method for training a single neural network in a way that it can be deployed with adjustable width at the inference time. This solution was proposed particularly for CNN architectures and thus, careful consideration of the batch normalization module for various width sizes is necessary. In this regard, in slimmable networks, switchable batch normalization was used which lead to additional trainable parameters. In contrast to slimmable networks, our SortedNet are architecture agnostic and work in both depth and width dimensions.
Early Exit Xin et al. [2020]
Early exit refers to a technique which adds a classifier to intermediate layers of an already trained neural network. While the parameters of the main model are frozen, the parameters of the classifiers are updated in a separate fine-tuning process. In this approach, each of the classifiers and their subsequent network can be treated as an independent sub-model. While this solution is relatively straightforward, the performance of the sub-models lags significantly behind that of the main model. Also dedicating a separate classification head to each sub-model can significantly increase the memory demand at inference.
Dayna-BERT Hou et al. [2020]
Dyna-BERT presents a dynamic compression method for pre-trained BERT models, enabling flexible adjustments in model size, both in depth and width, during inference. While the objective introduced in the DynaBERT paper shares some similarities with our approach, there are several key distinctions. Firstly, in DynaBERT, only a few subsets of the model are functional, whereas our SortedNet does not rely on such an assumption. Secondly, DynaBERT requires an already trained teacher model and utilizes knowledge distillation, whereas our technique operates independently of knowledge distillation. Thirdly, DynaBERT necessitates a search for an optimal sub-model, whereas our solution is inherently search-free. Lastly, DynaBERT’s applicability is dependent on the architecture, whereas our approach is architecture-agnostic.
Layer-drop Fan et al. [2019]
Layer-drop is a structured dropout solution at the training time which allows layer pruning at the inference time. Similar to DynaBERT, this solution is applied to pre-trained language models; however, in contrast to DynaBERT, Layer-drop only targets the depth of neural networks and not their width. In Layer-drop, there is no fixed training pattern and any layer can be dropped with a certain probability, which is referred to as drop rate. At the inference time, the number of active layers can be adjusted by the drop-rates that are seen during the training time of that network (i.e. to achieve the best performance on any other drop-rate value, the network needs to be re-trained.). Layer-drop works only in depth while our solution works for both depth and width. Moreover, Layer-Drop requires specific search patterns for dropping layers at the inference time and training time, whereas our solution is search free.
Once-for-All Cai et al. [2020]
Once-for-all(OFA) targets efficient inference across different devices by first training an OFA network which supports many sub-models with varying latency/accuracy characteristics ; it then searches among the feasible sub-models according to the accuracy and latency requirements of their target device. OFA has a progressive training nature i.e. it goes from the largest model to the smaller sub-models. OFA is different from our solution from the following aspects: first, it needs teacher and knowledge distillation; second, OFA requires a separate Neural Architecture Search (NAS) at the inference time; third, OFA is not architecture agnostic (their solution is for CNN-based neural networks while our SortedNet works for both CNNs and Transformers). Moreover, OFA is different from our solution in terms of the sub-model selection strategy. While our SortedNet selects sub-models in a sorted manner, OFA does not have any particular assumption for sorting sub-models (see Figure 4 for more details).

Learning Compressible Subspace Nunez et al. [2023]
Learning Compressible Subspace (LCS) is an adaptive compression technique based on training compressible subspace of neural networks (using a linear convex combination of two sets of weights for the network). While LCS does not require any re-training at the inference time, this solution has some other limitations including: first, it needs double memory at the training time; second, the choices of initial weights and the compression function are unclear and arbitrary (left as a hyper-parameter); third, it is only tried on CNNs; forth, similar to Layer-drop intermediate sub-models are trained randomly which will make the performance of the target model sub-optimal.
Appendix B Theoretical Analysis
B.1 Parameter Convergence in Identically Trained Sub-networks
Suppose is a sub-network within a larger neural network architecture, and represents an identical network architecture trained independently. We aim to understand the relationship between the parameters of these two networks, for and for , as they are trained under identical conditions.
B.2 Assumption of Lipschitz Continuity of Gradients
We assume that the gradients of the loss functions for and , denoted as and respectively, are -Lipschitz continuous. This implies that:
for all and in the parameter space.
B.3 Parameter Update Rule
The parameters of the networks are updated via gradient descent as follows:
-
•
For :
-
•
For :
B.4 Derivation of the Bound
We derive a bound on the difference in parameters between and after each training iteration:
Applying the triangle inequality and the Lipschitz continuity of the gradients, we obtain:
where is a constant that bounds the difference between the gradients of the loss functions of the networks.
This bound quantifies the evolution of the difference in parameters between and across training iterations, incorporating the impact of the Lipschitz constant , the learning rate , and the constant that bounds the inherent difference in gradients.
B.5 Negligible C under Identical Training Conditions
Given that and are trained under perfectly identical conditions (same data, initialization, and hyperparameters), the difference in their gradients can be considered negligible, leading us to conclude that is practically zero. Under this assumption, the bound simplifies significantly:
This indicates that the difference in the parameters of and is governed by the Lipschitz constant and the learning rate , suggesting that the parameters should remain close throughout the training process, especially when is negligible.
B.6 Performance Bound
We would like to find a performance bound between a trained sub-model (with optimized parameters ) and its corresponding individual model (with optimized parameters ). Let’s assume that . Then, the performance bound can be calculated as in the function value from its optimal value due to a parameter perturbation
Step 1: Second-Order Taylor Expansion Applying the second order taylor expansion to the function around (), we get:
Bear in mind that and refers to the Hessian matrix of at .
Step 2: Optimum Condition
Step 3: Lipschitz Continuity of Gradient
Step 4: Bounding the Hessian
Step 5: Estimating the Deviation in Function Value
The deviation in the function value from its optimal value due to a parameter perturbation is bounded by under the assumption of L-Lipschitz continuity of the gradient. This result implies that the function value’s deviation grows at most quadratically with the size of the parameter perturbation.
Appendix C More Experimental Details
C.1 Ablation Study
Convergence (Training Time) Analysis





Being sorted and randomly selecting one sub-model at the time from a predefined set of the sub-models empowers SortedNet with a higher convergence rate and a faster training time. Figure 5 empirically certifies this claim and compares the training convergence of SortedNet against LCP_p, which, to the best of our knowledge, LCP_p stands as the most recent state-of-the-art method. As LCS_p uses summation loss over four sub-models in every training steps and to have a fair comparison, we therefore report the performance of SortedNet in different values of gradient accumulation (), where provides a fair comparison with LCS_p. As shown in the figure, SortedNet with converges either faster or competitive across different sub-models. Moreover, SortedNet does not require any for-loop in its implementation; thus tailoring parallel computation and resulting in faster running time. We empirically investigate this feature and found that in the same setti
The impact of gradient accumulation
The goal of this experiment is to examine the impact of gradient accumulation () on the performance of SortedNet within an equal number of parameters updates. Table 5 presents the results obtained in terms of accuracies for 4 different gradient accumulation values. To ensure an equal number of updates, the maximum number of epochs is adjusted for each scenario, e.g. receives times more epochs than . As the results explains, increasing gradient accumulation values results in a higher performance for SortedNet. This observation can be attributed to the increase in training stochasticity when gradient accumulation is raised. Consequently, each sub-model in SortedNet contributes more equally to the updating of weight parameters, leading to a faster convergence rate. More details are provided in Appendix C.2.
C.2 Effect of gradient accumulation on SortedNet performance
Grad. Accum. | Num. Updates | Epochs | Accuracy @ Width | Avg. | |||
100% | 75% | 50% | 25% | ||||
200 | 200 | 84.94 | 84.92 | 82.54 | 71.03 | 80.85 | |
200 | 400 | 86.69 | 86.68 | 84.40 | 72.36 | 82.53 | |
200 | 600 | 87.37 | 87.50 | 84.57 | 73.00 | 83.11 | |
200 | 800 | 87.93 | 87.40 | 84.27 | 76.23 | 83.95 |
It is of interest to explore whether limiting the number of parameter updates is a suitable approach for investigating the influence of gradient accumulation on SortedNet. One possible way to certify this factor is by running SortedNet with different gradient accumulation values while keeping the number of updates fixed. To that end, we consider the same settings as Table 5 and repeat the experiment while fixing the maximum number of training epochs. By fixing this value and increasing gradient accumulation values, we implicitly decrease the number of parameter updates. Table 6 reports the results. Comparing the results of these two tables, it is obvious that the number of updates plays a significant role in the model’s performance. For instance, when considering , an average performance drop of approximately 2% is observed across all sub-models. This reduction indicates that the underlying model needs more training time for higher values of .
Grad. Accum. | Num. Updates | Epochs | Accuracy @ Width | Avg. | |||
100% | 75% | 50% | 25% | ||||
200 | 200 | 84.94 | 84.92 | 82.54 | 71.03 | 80.85 | |
100 | 200 | 85.01 | 85.12 | 82.24 | 70.65 | 80.75 | |
66 | 200 | 85.09 | 85.06 | 82.64 | 73.74 | 81.63 | |
50 | 200 | 86.05 | 86.06 | 83.66 | 73.0 | 82.19 |
C.3 Hyperparameters
This section provides an overview of the hyperparameters and experimental configurations, detailed in Table 7.
Model | Parameter | Value |
BERT-Base | ||
Optimizer | AdamW | |
Warmup Ratio | 0.06 | |
Dropout | 0.1 | |
LR Scheduler | Linear | |
Batch Size | 32 (RoBertA) / 8 (Bert) | |
Epochs | 30 (RoBertA) / 3,6 (Bert) | |
Learning Rate (LR) | 2e-5 (RoBertA / 6e-6 (Bert) | |
Weight Decay | 0.1 | |
Max Sequence Length | 512 | |
Seeds | [10, 110, 1010, 42, 4242] | |
GPU | Tesla V100-PCIE-32GB | |
MobileNetV2 | ||
Model | "google/mobilenet_v2_1.4_224" | |
Optimizer | AdamW | |
LR Scheduler | Linear | |
Batch Size | 128 | |
Seeds | 4242 | |
Epochs | 60 Models | |
GPU | 8 Tesla V100-PCIE-32GB | |
cPreResNet20 | ||
Optimizer | SGD | |
Criterion | Cross Entropy | |
LR Scheduler | cosine_lr | |
Batch Size | 128 | |
Seeds | [40,42,1010,4242] | |
Momentum | 0.9 | |
Weight Decay | 0.0005 | |
LR | 0.1 | |
Epochs | [200,400,600,800] | |
Gradient Accumulation | [1,2,3,4] |
C.4 Details of training time comparison
To empirically compare the training time between SortedNet and LCS_p, the elapsed time per epoch for five epochs is recorded independently for each method. We then ignore the first epoch to reduce the impact of first-time loading and initialization. Next, for each method we take the average of the remaining elapsed times. We refer to these averaging times (in seconds) by and for simplicity. As it is mentioned in Subsection C.1, SortedNet with can be considered as a fair comparison with LCS_p. As a result, each epoch in LCS_p holds four times the significance of SortedNet in terms of the total number of parameter updates. Therefore, we can simply multiply by a factor of four to equalize their impacts in term of total number of parameter updates. By doing that, we have , which is almost one-third less than .
C.5 Can we improve the performance of SortedNet by adjusting the classifier layer?
In Figure 2-a, as mentioned before, we adjusted the performance of the classifiers for the baseline in the experiment but not for the SortedNet. Therefore, as an additional experiment, we wanted to analyze the impact of adjusting the classifier over the performance of our proposed method as well. Same as the previous experiment, we trained the classifier layer for 5 epochs for each sub-model and reported the performance. As shown in Figure 6, the gain is much higher for very smaller networks than the large ones. The SortedNet shared classifier already doing a good job without additional computation overhead for all sub-models but further adjustments might be beneficial as shown.

C.6 Can we extend SortedNet to complex dimensions?
Acc. | F1 | Acc. | Acc. | Matthews Corr. | Spearman Corr. | F1 | Acc. | ||||
Model | Flops | Weights | MNLI | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | avg. |
Random Initialized Networks | |||||||||||
78.96 G | 62.13 0.27 | 68.74 0.70 | 61.24 0.45 | 79.89 0.59 | 0.00 0.00 | 12.92 0.63 | 78.67 0.41 | 54.51 1.11 | 52.26 | ||
pre-trained Baselines | |||||||||||
* | 22.36 G | 84.0 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.5 | |
* | 78.96 G | 86.3 | 72.1 | 92.7 | 94.9 | 60.5 | 86.5 | 89.3 | 70.1 | 81.55 | |
Paper Setting | |||||||||||
22.36 G | 84.22 0.32 | 87.37 0.08 | 91.47 0.21 | 92.61 0.15 | 54.90 0.79 | 88.08 0.49 | 86.91 0.82 | 62.96 2.36 | 81.07 | ||
78.96 G | 86.32 0.09 | 88.36 0.07 | 92.01 0.29 | 93.21 0.42 | 59.39 1.45 | 88.65 0.33 | 88.67 0.75 | 68.23 1.59 | 83.11 | ||
Extracted Networks | |||||||||||
22.36 G | 77.43 0.08 | 84.88 0.15 | 84.74 0.34 | 84.98 0.47 | 12.17 1.62 | 78.33 4.11 | 79.44 0.93 | 55.23 1.08 | 69.65 | ||
Proposed Methods | |||||||||||
Sorted | 22.36 G | 76.20 0.02 | 83.58 0.16 | 83.91 0.18 | 83.26 0.69 | 0.08 0.18 | 70.75 9.25 | 80.75 1.29 | 52.85 2.53 | 66.42 | |
Sorted | 78.96 G | 85.93 0.33 | 87.28 0.14 | 91.58 v 0.33 | 93.17 0.26 | 57.08 1.91 | 88.18 0.68 | 87.06 1.02 | 65.56 1.41 | 81.98 | |
Sorted | 22.36 G | 77.48 | 85.16 0.02 | 84.96 0.23 | 86.01 0.62 | 12.58 2.04 | 79.29 2.80 | 78.96 0.44 | 55.81 1.37 | 70.03 | |
Sorted | 78.96 G | 86.12 | 88.26 0.01 | 92.18 0.28 | 93.49 0.21 | 59.84 1.35 | 88.85 0.44 | 88.88 1.10 | 68.45 2.11 | 83.26 |
In this section, we are interested to investigate whether SortedNet is applicable to more complex dimensions other than width and depth. For example, can we utilize the SortedNet for sorting the Attention Heads Vaswani et al. [2017]. To achieve this, we conducted an experiment over BERT-large Devlin et al. [2019] which we tried to sort the information across multiple dimensions at once including, number of layers, hidden dimension, and number of attention heads. In other words, we tried to sort information over Bert-large and Bert-base as Bert-base can be seen as a subset of the Bert-large and therefore respect the nested property. As reported in Table 8, in addition to the reported performance of Bert-base and Bert-large according to the original paper Devlin et al. [2019], we reported the performance of these models in the paper experimental setting. The performance of randomly initialized Bert-base has been reported as well. We also extracted a Bert-base from a Bert-large model, and we reported the performance of such model in the same table. Additionally, we highlighted the number of training updates with respect to each objective function in front of each model. For example, in the last row (Sorted ), we approximately trained our Sorted model half of the times () over the objective function of Bert-base () and the other half of the times over the objective function of Bert-large () in an iterative random manner as introduced in the section 3. The learned Bert-base performance with these methods is still around 10% behind a pre-trained base but we argue that this is the value of pre-training. To investigate the impact, one should apply the SortedNet during pre-training which we will leave for future research. However, the performance of the learned Bert-large is on-par with an individual Bert-large which suggests sharing the weights does not necessarily have a negative impact over learning. It seems, however, the secret sauce to achieve a similar performance is that we should keep the number of updates for each objective the same as the individual training of Bert-large and Bert-base.
C.7 What is the impact of Sorting?
In order to better understand the impact of sorting information, we designed an experiment that compare the dependency order of all the neurons in a sorted network. To keep the experiment simple, we designed a one layer neural network with 10 (hidden dimension) 2 (input dimension) neurons as the hidden layer and a classifier layer on top of that which map the hidden dimension to predict the probabilities of a 4 class problem. The task is to predict whether a 2d point belong to a specific class on a synthetic generated dataset. We trained both Sorted Network and the ordinary one for 10 epochs and optimize the networks using Adam optimizer Kingma and Ba [2017] with the learning rate of 0.01 and batch size of 16.

Active Neurons | Network Accuracy | SortedNet Accuracy |
Baseline | ||
XXXXXXXXXX XXXXXXXXXX | \avercalc[2]94.40, 94.50, 94.50, 94.30, 94.70 \stdcalc[2]94.40, 94.50, 94.50, 94.30, 94.70 | \avercalc[2]93.40, 93.20, 94.30, 93.70, 93.90 \stdcalc[2]93.40, 93.20, 94.30, 93.70, 93.90 |
OOOOOOOOOO OOOOOOOOOO | \avercalc[2]25.0 \stdcalc[2]25.0 | \avercalc[2]25.0 \stdcalc[2]25.0 |
Target Order | ||
XXXXXXXXXO XXXXXXXXXO | \avercalc[2]94.20, 93.60, 94.10, 93.40, 94.00 \stdcalc[2]94.20, 93.60, 94.10, 93.40, 94.00 | \avercalc[2]93.30, 93.00, 94.40, 93.60, 93.90 \stdcalc[2]93.30, 93.00, 94.40, 93.60, 93.90 |
XXXXXXXXOO XXXXXXXXOO | \avercalc[2]93.60, 94.60, 94.30, 92.40, 93.60 \stdcalc[2]93.60, 94.60, 94.30, 92.40, 93.60 | \avercalc[2]93.30, 93.50, 94.30, 93.70, 94.10 \stdcalc[2]93.30, 93.50, 94.30, 93.70, 94.10 |
XXXXXXXOOO XXXXXXXOOO | \avercalc[2]92.30, 94.0, 93.50, 93.50, 92.20 \stdcalc[2]92.30, 94.0, 93.50, 93.50, 92.20 | \avercalc[2]93.30, 92.80, 94.30, 93.70, 94.00 \stdcalc[2]93.30, 92.80, 94.30, 93.70, 94.00 |
XXXXXXOOOO XXXXXXOOOO | \avercalc[2]91.60, 92.60, 94.10, 92.10, 91.30 \stdcalc[2]91.60, 92.60, 94.10, 92.10, 91.30 | \avercalc[2]94.60, 92.30, 94.10, 93.70, 93.80 \stdcalc[2]94.60, 92.30, 94.10, 93.70, 93.80 |
XXXXXOOOOO XXXXXOOOOO | \avercalc[2]91.90, 91.90, 91.20, 89.70, 87.20 \stdcalc[2]91.90, 91.90, 91.20, 89.70, 87.20 | \avercalc[2]94.50, 92.60, 94.10, 94.00, 94.10 \stdcalc[2]94.50, 92.60, 94.10, 94.00, 94.10 |
XXXXOOOOOO XXXXOOOOOO | \avercalc[2]92.80, 91.90, 85.50, 92.40, 71.60 \stdcalc[2]92.80, 91.90, 85.50, 92.40, 71.60 | \avercalc[2]94.40, 92.10, 93.70, 94.00, 94.10 \stdcalc[2]94.40, 92.10, 93.70, 94.00, 94.10 |
XXXOOOOOOO XXXOOOOOOO | \avercalc[2]93.60, 82.40, 69.40, 89.20, 66.10 \stdcalc[2]93.60, 82.40, 69.40, 89.20, 66.10 | \avercalc[2]94.10, 90.20, 94.50, 93.20, 93.70 \stdcalc[2]94.10, 90.20, 94.50, 93.20, 93.70 |
XXOOOOOOOO XXOOOOOOOO | \avercalc[2]50.60, 78.40, 43.00, 56.50, 76.30 \stdcalc[2]50.60, 78.40, 43.00, 56.20, 76.30 | \avercalc[2]93.50, 57.80, 91.40, 90.60, 93.00 \stdcalc[2]93.50, 57.80, 91.40, 90.60, 93.00 |
XOOOOOOOOO XOOOOOOOOO | \avercalc[2]49.20, 50.90, 46.30, 48.10, 48.80 \stdcalc[2]49.20, 50.90, 46.30, 48.10, 48.80 | \avercalc[2]48.30, 64.40, 62.10, 66.70, 71.40 \stdcalc[2]48.30, 64.40, 62.10, 66.70, 71.40 |
avg. | \avercalc[2]93.86, 93.70, 93.10, 92.34, 90.38, 86.84, 80.14, 60.96, 48.66 \stdcalc[2]93.86, 93.70, 93.10, 92.34, 90.38, 86.84, 80.14, 60.96, 48.66 | \avercalc[2]93.64, 93.78, 93.62, 93.70, 93.86, 93.66, 93.14, 85.26, 62.58 \stdcalc[2]93.64, 93.78, 93.62, 93.70, 93.86, 93.66, 93.14, 85.26, 62.58 |
Reverse Order | ||
OOOOOOOOOX OOOOOOOOOX | \avercalc[2]49.50, 58.80, 48.80, 48.20, 53.20 \stdcalc[2]49.50, 58.80, 48.80, 48.20, 53.20 | \avercalc[2]25.0, 50.50, 23.70, 25.00, 25.00 \stdcalc[2]25.0, 50.50, 23.70, 25.00, 25.00 |
OOOOOOOOXX OOOOOOOOXX | \avercalc[2]86.00, 92.00, 89.90, 66.60, 93.10 \stdcalc[2]86.00, 92.00, 89.90, 66.60, 93.10 | \avercalc[2]25.0, 85.30, 24.90, 15.70, 25.00 \stdcalc[2]25.0, 85.30, 24.90, 15.70, 25.00 |
OOOOOOOXXX OOOOOOOXXX | \avercalc[2]91.90, 88.10, 93.20, 91.30, 89.60 \stdcalc[2]91.90, 88.10, 93.20, 91.30, 89.60 | \avercalc[2]25.0, 85.80, 33.50, 16.00, 47.30 \stdcalc[2]25.0, 85.80, 33.50, 16.00, 47.30 |
OOOOOOXXXX OOOOOOXXXX | \avercalc[2]89.20, 79.90, 86.40, 86.90, 91.10 \stdcalc[2]89.20, 79.90, 86.40, 86.90, 91.10 | \avercalc[2]48.40, 85.40, 61.50, 56.80, 47.30 \stdcalc[2]48.40, 85.40, 61.50, 56.80, 47.30 |
OOOOOXXXXX OOOOOXXXXX | \avercalc[2]93.10, 91.40, 93.50, 86.50, 90.10 \stdcalc[2]93.10, 91.40, 93.50, 86.50, 90.10 | \avercalc[2]48.50, 89.40, 89.50, 53.80, 49.10 \stdcalc[2]48.50, 89.40, 89.50, 53.80, 49.10 |
OOOOXXXXXX OOOOXXXXXX | \avercalc[2]93.60, 93.20, 93.30, 92.40, 92.20 \stdcalc[2]93.60, 93.20, 93.30, 92.40, 92.20 | \avercalc[2]48.40, 87.50, 90.80, 51.60, 49.30 \stdcalc[2]48.40, 87.50, 90.80, 51.60, 49.30 |
OOOXXXXXXX OOOXXXXXXX | \avercalc[2]93.70, 93.70, 93.20, 93.80, 93.20 \stdcalc[2]93.70, 93.70, 93.20, 93.80, 93.20 | \avercalc[2]57.00, 90.30, 93.00, 68.80, 51.60 \stdcalc[2]57.00, 90.30, 93.00, 68.80, 51.60 |
OOXXXXXXXX OOXXXXXXXX | \avercalc[2]93.70, 94.00, 93.70, 93.30, 94.50 \stdcalc[2]93.70, 94.00, 93.70, 93.30, 94.50 | \avercalc[2]90.60, 91.50, 91.60, 68.10, 68.50 \stdcalc[2]90.60, 91.50, 91.60, 68.10, 68.50 |
OXXXXXXXXX OXXXXXXXXX | \avercalc[2]93.70, 94.00, 94.50, 94.20, 93.80 \stdcalc[2]94.0, 94.00, 94.50, 94.20, 93.80 | \avercalc[2]93.20, 92.80, 92.70, 62.60, 69.70 \stdcalc[2]93.20, 92.80, 92.70, 62.60, 69.70 |
avg. | \avercalc[2]51.70, 85.52, 90.82, 86.70, 90.92, 92.94, 93.52, 93.84, 94.04 \stdcalc[2]51.70, 85.52, 90.82, 86.70, 90.92, 92.94, 93.52, 93.84, 94.04 | \avercalc[2]29.84, 35.18, 41.52, 59.88, 66.06, 65.52, 72.14, 82.06, 82.2 \stdcalc[2]29.84, 35.18, 41.52, 59.88, 66.06, 65.52, 72.14, 82.06, 82.2 |
As can be seen in Table 9, the performance of different orders in the original neural network training paradigm can be different and unfortunately there is no specific pattern in it. Therefore, if one search the whole space of different orders (from neuron 1 to neuron n, from neuron n to neuron 1, or even select a subset of neurons by different strategies i.e. for the half size network activate every other neurons like XOXOXOXOXO.) might find better alternatives that work even better than the desirable target order. In this example, the reverse order in average perform better than the target order (86.67% versus 82.22%). However, with the proposed method, we can clearly see that the target order performance consistently is much better than the reverse order (89.25% versus 59.38%). This means, we have been able to enforce the desirable target order as we wanted using our proposed method. For example, neuron 2 is more dependent to neuron 1 in SortedNet in comparison with the ordinary training. In another example, the last 5 neurons are more dependent to the first 5 neurons than other way around. As shown, the performance of the first five neurons is 93.86% while the performance of the last five neurons is only 66.06% in SortedNet. In other words, the gain of adding the last five neurons is quite marginal and most probably prunnable, while the first 5 neurons contains most of the valuable information. It is of interest to further investigate the dependency of neurons to one another and with other metrics which we will leave for future research.