This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Greedy Network Enlarging

Chuanjian Liu
Huawei Noah’s Ark Lab
liuchuanjian@huawei.com
&Kai Han
Huawei Noah’s Ark Lab
kai.han@huawei.com
\ANDAn Xiao
Huawei Noah’s Ark Lab
an.xiao@huawei.com
&Yiping Deng
Huawei Technologies Co., Ltd.
yiping.deng@huawei.com
&Wei Zhang
Huawei Noah’s Ark Lab
wz.zhang@huawei.com
&Chunjing Xu
Huawei Noah’s Ark Lab
xuchunjing@huawei.com
&Yunhe Wang
Huawei Noah’s Ark Lab
yunhe.wang@huawei.com
Abstract

Recent studies on deep convolutional neural networks present a simple paradigm of architecture design, i.e., models with more MACs typically achieve better accuracy, such as EfficientNet and RegNet. These works try to enlarge all the stages in the model with one unified rule by sampling and statistical methods. However, we observe that some network architectures have similar MACs and accuracies, but their allocations on computations for different stages are quite different. In this paper, we propose to enlarge the capacity of CNN models by improving their width, depth and resolution on stage level. Under the assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, we propose an greedy network enlarging method based on the reallocation of computations. With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs. On EfficientNet, our method consistently outperforms the performance of the original scaling method. In particular, with application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies under the setting of 600M and 4.4B MACs, respectively.

1 Introduction

Convolutional neural networks (CNNs) deliver state-of-the-art accuracy in many computer vision tasks such as image classification [18, 10], object detection [26], image super-resolution [16]. Most of deep CNNs are well designed with a predefined number of parameters and computational complexities. For example, ResNet [10] mainly consists of 55 versions with 1818, 3434, 5050, 101101 and 152152 layers. These CNNs have provided strong baselines for visual applications.

To improve the accuracy further, the most common way is to scale up the base CNN model. Three factors including depth, width and input resolution heavily affect the model size. A number of works propose to scale models by the depth [10, 29], width [37] or input image resolution [14]. These works consider only one dimension from depth, width or resolution, which leads to the imbalance in utilization of the computations or multiply-accumulate operations (MACs) . Simultaneously enlarging the width, depth and resolution can provide more flexible design space to find the high-performance models. Recently, several works focus on how to efficiently scale the three factors. EfficientNet [31] constructed one compound scaling formula to constrain the network width, depth and dimension. RegNet [23] studied the relationship between width and depth by exploring the network design spaces. These methods utilize a unified principle to scale the whole model, but ignore the stage-wise differences.

Refer to caption
Figure 1: ImageNet classification results of our method. The black dash line is the original EfficientNet series. The blue dash line is searched S-EfficientNet and the blue solid line is S-EfficientNet with relabel trick. The red circle and triangle is the performance of GhostNet-based architectures.
Refer to caption
Figure 2: MACs of different stages of CNN models. Left figure presents the MACs of ResNet series. Right figure presents the MACs of EfficientNet series.

Here we rethink the procedure of enlarging CNN models from the viewpoint of stage-wise computation resource allocation. Modern CNNs usually consists of several stages, where one stage contains all layers with the same spatial size of feature maps. In Figure 2, we present the computations of different stages for ResNet [10] and EfficientNet [31]. Figure 2 left demonstrate the discrepancy between ResNet series. ResNet18 has balanced MACs for each stage, while ResNet50 and ResNet101 get more MACs in the intermediate stages but few MACs in the head and tail stages. Figure 2 right presents the allocation of FLOPS for EfficientNet-B0, EfficientNet-B2 and EfficientNet-B4. EfficientNet utilizes one unified model scaling principle for network width, depth and resolution, so different configurations of EfficientNet have the similar tendency of MACs on different stages. The later stages have far more MACs (>1010 times) than the former and intermediate stages. A universal rule of the computation allocation for different models is impractical. Neither the manual designed or unified rule is the solution of optimal computations allocation.

In this paper, we propose a network enlarging method based on greedy search of computations for each stage. In contrast to conventional unified principle, the method performs fine-grained search on the reallocation of computations. Given a baseline network, our goal is to enlarge it to the target MACs with the best configuration of depth, width and resolution in each stage. Under the assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, we are able to enlarge CNNs step-by-step using greedy network enlarging algorithm. For each iteration in proposed algorithm, 1) a series of candidate networks are constructed by searching width, depth and resolution of each stage under constrained MACs; 2) with fast performance evaluation method, the architecture with the best performance in this iteration is appended to the baseline model pool for next iteration. By gradually adding MACs at each iteration, we find the optimal architecture until achieving the target MACs. Experiment results on ImageNet classification task demonstrate the superiority of our proposed method. The searched network configurations can largely boost the performances of existing base models. For example, searched EfficientNet models by proposed method outperform the original EfficientNets by a large margin.

2 Related Work

Manual Network Design.

In the early days after AlexNet [18], a large number of manually designed network architecture emerged. VGG [29] is the typical CNN architecture without any special connections, and deeper VGG-nets get high accuracies. However, the convergence problem emerged for very deep network. ResNet [10] with shortcut was proposed with higher accuracy and more layers. Except deeper network, wider network is another direction. WideResnet [37] has higher accuracy by adding channels for each layer in Resnet. Besides, a number of light-weight network are proposed in order to meet the demands of mobile devices. GoogLeNet [30], MobileNets [12, 27], ShuffleNets [38, 22] and GhostNet [9]are these type networks. By setting one width scaling factor, the accuracy and MACs of Mobilenets and GhostNet are improved. The design pattern behind these networks was largely man-powered and focused on discovering new design choices that improve accuracy, e.g., the use of deeper or wider models or shortcuts.

Automatic Network Design.

Currently expert designed architectures are time-consuming. Because of this, there is a growing interest in automated neural architecture search (NAS) methods [6]. By now, NAS methods have outperformed manually designed architectures on some tasks such as image classification [40, 24, 20, 21, 11], object detection [8, 35, 15, 33] or semantic segmentation [19, 28]. Generally, more MACs means higher accuracy. Traditionally, researchers have already learned to change the depth, width or resolution of models. But only one dimension is considered usually. EfficientNet [31] showed that it was critical to balance all dimensions of network width/depth/resolution and proposed a simple yet effective compound scaling method in accordance with the results by random sampling. RegNet [23] got several patterns by a huge number of experiments: 1.1. good network have increasing widths with stages; 2.2. the stage depths are likewise tend to increase for the best models, although not necessarily in the last stage. These methods construct principles from small networks, and use the rule to get various sizes of model, even very large models. In this paper, we take use of greedy allocation of MACs to enlarge model and get the specific model architecture under constrained MACs. During the expansion, the width, depth and input resolution are considered for each stage. Our intention is to maximize the utilization of MACs for the network.

3 Approach

In this section, we describe the proposed network enlarging method based on greedy allocation of MACs. Firstly, we define the goal of our method to find the optimal depth and width of each stage and the input resolution. Secondly, we introduce the main algorithm of greedy network enlarging. Further, we introduce how to efficiently evaluate the performance of candidate models.

Refer to caption
Figure 3: The framework for adjusting input resolution, width and depth for each stage. The surrounding box out of the input image means candidate input resolution.

3.1 Problem Definition

The modern CNN backbone architectures usually consists of a stem layer, network body and a head [23, 10, 31]. The main MACs and parameters burdens lie in the network body, as typically the stem layer is a convolutional layer and the head is a fully-connected layer. Thus, in this paper we focus on the scaling strategy of network body. The network body consists of several stages [23], which are defined as a sequence of layers or blocks with the same spatial size. For example, ResNet50 [10] body is composed of 44 stages with 56×5656\times 56, 28×2828\times 28, 14×1414\times 14, and 7×77\times 7 output sizes, respectively.

Scaling up convolutional neural networks is widely used to achieve better accuracy. Network depth, width and input resolution are three key factors for model scaling. Deeper convolutional neural networks capture richer and more complex features, and usually have high performance in contrast to shallow network. With the help of shortcuts, very deep network can be trained to convergence. However, the improvements on accuracy become smaller with the increase of depth. Another direction is scaling the width of network. More kernels mean more fine grained features can be learned. However, the MACs is squared with the width. As a result, the network depth will be constrained and high level features maybe loss. EfficientNet [31] showed that the accuracy quickly saturates when networks become wider. Higher resolution provide rich fine-grained information. In order to match high resolution, more powerful network is wanted. Deeper and wider network can acquire large receptive field and capture fine grained features.

As a result, the network depth, width and resolution are not independent. These three dimension have various combinations. And one unified principle can not acquire the best configuration for all tasks. In this paper, we decompose the network depth, width and resolution into stage depth, width and input resolution. This will maximize the utilization of computations for each stage and the whole network.

Given a base network \mathbb{N} with LL stages, width and depth are 𝐰,𝐝=(w1,w2,,wL,d1,d2,,dL)\mathbf{w,d}=(w_{1},w_{2},...,w_{L},d_{1},d_{2},...,d_{L}), and input resolution is 𝐫\mathbf{r}. The objective is to acquire the network architecture with best performance by optimizing the allocation of target MACs TT:

𝐫,𝐝,𝐰\displaystyle\mathbf{r^{\star},d^{\star},w^{\star}} =argmax𝐫,𝐝,𝐰ACCval((𝐫,𝐝,𝐰,θ))\displaystyle=\arg\mathop{\max}\limits_{\mathbf{r,d,w}}\mathrm{ACC}_{\mathrm{val}}(\mathbb{N}(\mathbf{r,d,w},\mathbf{\theta^{\star}})) (1)
s.t.\displaystyle\mathrm{s.t.} |MACs((𝐫,𝐝,𝐰))T|δT\displaystyle\quad\ |MACs(\mathbb{N}(\mathbf{r,d,w}))-T|\leq{\delta\cdot T} (2)
θ\displaystyle\mathbf{\theta^{\star}} =argminθLOSStrain((𝐫,𝐝,𝐰,θ))\displaystyle=\arg\mathop{\min}_{\mathbf{\theta}}\mathrm{LOSS}_{\mathrm{train}}(\mathbb{N}(\mathbf{r,d,w},\mathbf{\theta})) (3)

where θ\mathbf{\theta} is the trainable parameters of the network. ACCval\mathrm{ACC}_{\mathrm{val}} denote the validation accuracy, TT is the target threshold of MACs and δ\delta is used to control the difference between the MACs of the searched model and the target MACs.

Search space. We consider the combinations of input resolution, width and depth of each stage. Suppose a base network with LL stages and configurations of width and depth 𝐰,𝐝\mathbf{w,d} = (w1w_{1}, w2w_{2}, …, wLw_{L}, d1d_{1}, d2d_{2}, …, dLd_{L}), and input resolution 𝐫\mathbf{r}. For each tieration, the growth rate for width, depth and resolution is sws_{w}, sds_{d} and srs_{r}, respectively. Under constrained target MACs, we enlarge the width, depth for each stage and the network input resolution step-by-step. For example, ResNet18 contains 44 stages, if we constrain the search upper bound is 33 times in both depth and width for each stage, and the growth rate is 11 for depth and width. The total number of combinations is 128×256×512×1024×4×4×4×44.4×1012128\times 256\times 512\times 1024\times 4\times 4\times 4\times 4\approx 4.4\times 10^{12} without considering the variation of input size.

3.2 Algorithm of Greedy Network Enlarging

Figure 3 presents our framework. Our intention is to find the optimal allocation of computations by enlarging depth, width and input resolution for each stage under constrained computations. So as to maximize the utilization of MACs, as shown in Eq. 1. For each stage ii in the network, we try to find its optimal depth did_{i}^{\star} and width wiw_{i}^{\star}. The optimal input size rr^{\star} of resolution is searched to match the specialized width and depth. In the problem1, 𝐫,𝐝,𝐰\mathbf{r,d,w} have discrete values and massive combinations. Deep learning is both time and resource consuming. Due to the extreme complexity, traditional mathematical optimization method is impracticable. So we turn to efficient neural architecture search method.

To simply the search complexity, we first introduce an assumption. Finding the global optimal model is difficult with the massive search space, so we can smooth the target to find a top-performing configuration of target MACs. We introduce an assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, as shown in Assumption 1. Resnet [10], VGG [29], EfficientNet [31] etc., fit this assumption perfectly. This assumption enables the idea of efficient search algorithm via greedy network enlarging.

Assumption 1

Given an optimal network with MACs B0B_{0}, depth (d10,d20,,dL0)(d^{0}_{1},d^{0}_{2},\cdots,d^{0}_{L}), width (w10,w20,,wL0)(w^{0}_{1},w^{0}_{2},\cdots,w^{0}_{L}) and resolution r0r^{0}, there exists at least one top-performing network with MACs B0+δB_{0}+\delta, depth (d11,d21,,dL1)(d^{1}_{1},d^{1}_{2},\cdots,d^{1}_{L}), width (w11,w21,,wL1)(w^{1}_{1},w^{1}_{2},\cdots,w^{1}_{L}) and resolution r1r^{1} that satisfies

di0\displaystyle d^{0}_{i} di1,i=1,2,,L;\displaystyle\leq d^{1}_{i},~{}\forall i=1,2,\cdots,L; (4)
wi0\displaystyle w^{0}_{i} wi1,i=1,2,,L;\displaystyle\leq w^{1}_{i},~{}\forall i=1,2,\cdots,L;
r0\displaystyle r^{0} r1.\displaystyle\leq r^{1}.

With the above assumption, we transform the optimal network architecture search problem into a series of interrelated single-stage optimal sub-network architecture search problems, and then solve them one by one. Decisions need to be made at each stage to optimize the process. The selection of decisions at each stage depends only on the current state (here, the current state refers to the resolutions, widths and depths of the current stage). When the decision of each stage is determined, a decision sequence is formed, which determines the final solution. The overall algorithm is illustrated in Algorithm 1.

Result: Configurations 𝐜=𝐫,𝐝,𝐰\mathbf{c^{\star}}=\mathbf{r^{\star},d^{\star},w^{\star}} with target MACs 𝐓\mathbf{T}.
Initialization: Base network \mathbb{N} with B0B_{0} MACs and 𝐋\mathbf{L} stages: width 𝐰𝟎\mathbf{w^{0}} = (w10w^{0}_{1}, w20w^{0}_{2}, …, wL0w^{0}_{L}), depth 𝐝𝟎\mathbf{d^{0}} = (d10d^{0}_{1}, d20d^{0}_{2}, …, dL0d^{0}_{L}) and input resolution 𝐫𝟎\mathbf{r^{0}}. Total dimension of search space is (2L+1)(2L+1). The target MACs of output network is 𝐓\mathbf{T} and the rate of error is δ\delta. The search number is 𝐍\mathbf{N}. Initialize the set of optimal sub-configurations as 𝑺={(𝐫𝟎,𝐝𝟎,𝐰𝟎)}\bm{S}=\{(\mathbf{r^{0}},\mathbf{d^{0}},\mathbf{w^{0}})\}, the growth rate of resolution srs_{r};
while i𝐍i\leq\mathbf{N} do
       current target MACs: Ti=B0(𝐓B0)i𝐍T_{i}=B_{0}*(\frac{\mathbf{T}}{B_{0}})^{\frac{i}{\mathbf{N}}};
      current candidates 𝑪=[]\bm{C}=[\ ];
      for (𝐫,𝐝,𝐰)(\mathbf{r},\mathbf{d},\mathbf{w}) in 𝐒\bm{S} do
            
            while MACs((𝐫,𝐝,𝐰))<TiMACs(\mathbb{N}(\mathbf{r},\mathbf{d},\mathbf{w}))<T_{i} do
                   𝐫=𝐫+sr\mathbf{r}=\mathbf{r}+s_{r};
                  
             end while
            if |MACs((𝐫,𝐝,𝐰))Ti|δT|MACs(\mathbb{N}(\mathbf{r},\mathbf{d},\mathbf{w}))-T_{i}|\leq\delta\cdot T then
                  𝑪\bm{C} append (𝐫,𝐝,𝐰)(\mathbf{r},\mathbf{d},\mathbf{w})
             end if
            
       end for
      for (𝐫,𝐝,𝐰)(\mathbf{r},\mathbf{d},\mathbf{w}) in 𝐒\bm{S} do
             for jj in 𝑟𝑎𝑛𝑔𝑒(L)\mathit{range}(L) do
                   while MACs((𝐫,𝐝,𝐰))<TiMACs(\mathbb{N}(\mathbf{r},\mathbf{d},\mathbf{w}))<T_{i} do
                         𝑪=proportionalcollection(𝐝,𝐰)\bm{C}^{\prime}=proportional\ collection(\mathbf{d},\mathbf{w}) as in Algorithm 2;
                   end while
                  𝑪\bm{C} extend 𝑪\bm{C}^{\prime}
             end for
            
       end for
      𝐴𝐶𝐶=[]\mathit{ACC}=[\ ];
       for c\mathit{c} in 𝐂\bm{C} do
             acc=performanceevaluation(c)acc=performance\ evaluation(\mathit{c}) as stated in Sec. 3.3;
             𝐴𝐶𝐶\mathit{ACC} append accacc;
            
       end for
      𝑖𝑛𝑑𝑒𝑥=argmax𝐴𝐶𝐶\mathit{index}=\arg\mathop{\max}\mathit{ACC};
       𝑺\bm{S} append 𝑪𝒊𝒏𝒅𝒆𝒙\bm{C_{index}};
       if i==Ni==N then
             𝐜=𝑪𝒊𝒏𝒅𝒆𝒙\mathbf{c^{\star}}=\bm{C_{index}};
            
       end if
      
end while
Algorithm 1 Greedy network enlarging.

In the algorithm, we use exponential increment of MACs in the process of search. This way make the changes of network more gentaly in contrast to uniform increment. For each iteration in Algorithm 1, in order to find the local optimal architecture configuration, we have to search and evaluate the candidate architectures. This step contains two targets: the first is to find the candidate architectures under limited increase of MACs; the second is to find the local optimal architectures with maximum validation accuracy. In the step of acquiring candidates, we consider the increase of resolution separately, which reduces the candidates. The increase of width and depth of each stage is on the basis of corresponding resolution.

Result: Candidates 𝑪\bm{C} with collected width ww and depth dd
Initialization: Target MACs TiT_{i}, configuration (𝐫,𝐝,𝐰)(\mathbf{r},\mathbf{d},\mathbf{w}), ratios set 𝑷\bm{P}. The growth rate of depth and width is sds_{d} and sws_{w}, respectively, current stage jj;
dj𝐝d_{j}\in\mathbf{d}; wj𝐰w_{j}\in\mathbf{w};
for p𝐏p\in\bm{P} do
       Td=pTiT_{d}=p*T_{i};
       while MACs((𝐫,𝐝,𝐰))TdMACs(\mathbb{N}(\mathbf{r},\mathbf{d},\mathbf{w}))\leq T_{d} do
             dj=dj+sdd_{j}=d_{j}+s_{d}
       end while
      while MACs((𝐫,𝐝,𝐰))TiMACs(\mathbb{N}(\mathbf{r},\mathbf{d},\mathbf{w}))\leq T_{i} do
             wj=wj+sww_{j}=w_{j}+s_{w}
       end while
      if |MACs((𝐫,𝐝,𝐰))Ti|δT|MACs(\mathbb{N}(\mathbf{r},\mathbf{d},\mathbf{w}))-T_{i}|\leq\delta\cdot T then
            𝑪\bm{C} append (𝐫,𝐝,𝐰)(\mathbf{r},\mathbf{d},\mathbf{w})
       end if
      
end for
Algorithm 2 Proportional collection of width and depth.

In order to reduce the searched candidates, we take use of proportional control factor to assign the MACs between depth and width for each stage. Specifically, the ratios of MACs between depth and width are in one set 𝑷\bm{P}. Under this setting, we search depth first and then width for each stage. The algorithm is illustrated in Algorithm 2.

3.3 Performance Estimation

To guide the search process, we have to estimate the performance of a given architecture. The most accurate method is to train the candidates on the whole training data and evaluate their performance on validation data. However, this way requires great computational demands in the order of thousands of GPU days. Developing methods for speeding up the process of performance estimation is crucial.

We turn to proxy tasks to estimate performance. Including shorter training times [23], training on a subset of the data [17], on proxy data [40] or using less filters per layer and less cells [25]. These low-fidelity approximations reduce the cost, they also introduce bias in the performance estimation. Proxy data and simplified architecture have large deviation which leads to poor rank preservation.

In this section, we determine the optimal proxy task for performance estimation with empirical experiments. Firstly, we get the proxy sub-dataset by evaluating the performance of different sub-datasets. Secondly, the hyper-parameters of training are acquired with parameter search. Spearman’s rank correlation coefficient ρ\rho is a non-parametric measure of rank correlation, which is used as the measure of proxy task.

For the proxy sub-dataset, we create two sub-datasets ImageNet1000-100 and ImageNet100-500 by random selecting images from ImageNet. To evaluate these datasets, 1212 network architectures with different width, depth and input sizes are generated on the basis of EfficientNet-B0. We train all the 1212 networks and EfficientNet-B0 on the whole train set of ImageNet for 150150 epochs, the Top-1 accuracies on the validation dataset are used as the comparison object. We finetune the 1212 networks for different epochs. Besides, we train the 1212 networks from scratch for few epochs. On ImageNet100-500, the average Spearman value is ρ=0.16\rho=0.16. On ImageNet1000-100, the average Spearman value is ρ=0.23\rho=0.23. So we choose ImageNet1000-100 as the proxy sub-dataset. More details are presented in the supplementary materials.

After determining the proxy dataset, we try to improve the correlation between the proxy task and original task by searching the hyper-parameters. 2424 network architectures with different width, depth and input sizes are generated on the basis of EfficientNet-B0. We train all of the 2424 networks on the whole train set of ImageNet for 150150 epochs, the Top-1 accuracies on the validation dataset are used as the comparison object. Two pretrained EfficientNet-B0 models on the ImageNet and ImageNet1000-100 are provided, respectively. The learning rate, mode of learning rate decay and training epochs are considered. Among these hyper-parameter combinations, the top-2 Spearman value ρ\rho is 0.570.57 and 0.540.54, these values indicate moderate positive correlation. They both use cosine decay method and the initial learning rate is 0.010.01 for training 2020 epochs. The difference is that the first use the ImageNet1000-100 pretrained model and the second use the ImageNet pretrained model. More details are presented in the supplementary materials. Figure. 4 presents the consistency of different networks. In the next section, we take use of initial learning rate is 0.010.01 and cosine decay for finetuning 2020 epochs on the ImageNet1000-100 pretrained model.

Refer to caption
Figure 4: Correlation between the proxy task and original task. The blue line is the target. The red line has Spearman value 0.570.57, it is trained on the basis of ImageNet1000-100. The cyan line has spearman value 0.540.54 which is trained on the basis of ImageNet. For comparison purposes, we manually adjust the accuracy up by 2727 and 1010 for red line and cyan line, respectively.

4 Experiments

In this section, we evaluate greedy network enlarging method on general image classification dataset ImageNet [5]. We demonstrate the method gets state-of-the-art accuracy with similar MACs.

4.1 Datasets, Networks and Experimental Settings

We extensively evaluate our methods on popular classification datasets ImageNet(ILSVRC2012) [5], which contains 1.31.3M images and 10001000 categories, the validation set contains 5050K images. On ImageNet, in order to speed up the search process, we create proxy ImageNet1000-100 dataset, which contains 100100K train images and 5050K validation images randomly sampled from ImageNet train set. Two baseline networks are considered: EfficientNet [31] and improved GhostNethttps://gitee.com/mindspore/models/tree/master/research/cv/ghostnet [9].

To accelerate the search process, we set the growth rate of resolution and depth as sr=8s_{r}=8 and sd=1s_{d}=1, respectively. For the growth rate of width, we use sw=2s_{w}=2 for small model and sw=4s_{w}=4 for large model. The ratios of MACs between depth and width are in one set 𝑷={0.0,0.1,0.2,,1.0}\bm{P}=\{0.0,0.1,0.2,\cdots,1.0\} The error rate of MACs is δ=0.01\delta=0.01. We take use of exponential growth of MACs. We set different number of search iterations 𝐍\mathbf{N} for small and large models. The finetune method comes from function preserving algorithm [3].

After the process of search is completed, we retrain the acquired network architecture on the whole ImageNet from scratch. The train setting is from timm [34] under its license and EfficientNet [31]. RMSProp optimizer with momentum 0.9; weight decay 1e-5; multi-step learning rate with warmup, initial learning rate 0.064 that decays by 0.97 every 2.4 epochs. Moving average of weight, dropblock [7], random erasing [39] and random augment [4] are used.

ImageNet has noise labels and the method of crop augmentation introduces more noisy input and labels. To prevent this, we use the relabel method [36] to get higher accuracy.

4.2 ImageNet Results and Analysis

For EfficientNet, we take EfficientNet-B0 as the baseline, and we search the models with MACs similar to EfficientNet-B11 to B44. Besides, we enlarge GhostNet with the principle of EfficientNet and search GhostNet architectures with greedy search method. For GhostNet, we add Squeeze-and-Excitation [13] module for each block. Table.1 shows the main results and comparison with other networks. The searched models are marked with ’S-’.

GhostNet-B1 and GhostNet-B4 in 1 are obtained by the compounding scale rule of EfficientNet. Their performance is lower in contrast to greedy search methods. This suggests that the rule on EfficientNet is not fit for GhostNet. We need to resample and optimize for new networks to get suitable rules. Besides, the compounding scale principle ignores the difference of stages, which leads to the loss of elaborate adjustment.

Model Top-1 Acc. #Params #MACs Ratio-to-EfficientNet
EfficientNet-B0 [31] 77.1% 5.3M 0.39B 1x
Ghostnet 1.0×1.0\times [9] 73.9% 5.2M 0.14B 0.36x
EfficientNet-B1 [31] 79.1% 7.8M 0.69B 1x
ResNet-RS-50 [1] 78.8% 36M 4.6B 6.7x
REGNETY-800MF [23] 76.3% 6.3M 0.8B 1.16x
S-EfficientNet-B1 79.91% 8.8M 0.68B 1x
S-EfficientNet-B1-re 80.71% 8.8M 0.68B 1x
GhostNet-B1 79.13% 13.3M 0.59B 0.85x
S-GhostNet-B1 80.08% 16.2M 0.67B 1x
S-GhostNet-B1-re 80.87% 16.2M 0.67B 1x
EfficientNet-B2 [31] 80.1% 9.1M 0.99B 1x
REGNETY-1.6GF [23] 78.0% 11.2M 1.6B 1.6x
S-EfficientNet-B2 80.92% 9.3M 1.0B 1x
S-EfficientNet-B2-re 81.58% 9.3M 1.0B 1x
EfficientNet-B3 [31] 81.6% 12.2M 1.83B 1x
ResNet-RS-101 [1] 81.2% 64M 12B 6.6x
REGNETY-4.0GF [23] 79.4% 20.6M 4.0B 2.18x
S-EfficientNet-B3 81.98% 12.3M 1.88B 1x
S-EfficientNet-B3-re 82.87% 12.3M 1.88B 1x
EfficientNet-B4 [31] 82.9% 19.3M 4.39B 1x
REGNETY-8.0GF [23] 79.9% 39.2M 8.0B 1.82x
NFNet-F0 [2] 83.6% 71.5M 12.4B 2.8x
ResNet-RS-152 [1] 83.0% 87M 31B 7.1x
EfficientNetV2-S [32] 83.9% 24M 8.8B 2.0x
S-EfficientNet-B4 83.0% 17.0M 4.34B 1x
S-EfficientNet-B4-re 84.0% 17.0M 4.34B 1x
GhostNet-B4 82.78% 36.1M 4.39B 1x
S-GhostNet-B4 83.2% 32.9M 4.37B 1x
S-GhostNet-B4-re 84.3% 32.9M 4.37B 1x
Table 1: Searched Architecture Performance on ImageNet. The ’-re’ means the models are trained with relabel trick [36]. Our results are in bold.

In Table.1, Top-1 accuracies of all searched architectures outperform the compound scaling tricks of EfficientNet [31] and RegNet [23]. On 600600M MACs, our searched architectures get 79.91%79.91\% and 80.08%80.08\%, improve performance 0.81%0.81\% and 0.97%0.97\%, respectively. On EfficientNet-B2 and B3, our searched EfficientNet architectures achieve 80.92%80.92\% and 81.98%81.98\%. We search 22 networks on 44B MACs level, S-EfficientNet-B4 gets 83.002%83.002\% and S-GhostNet-B4 gets 83.2%83.2\%, respectively.

The relabel training trick improve the accuracy further. The Top-1 accuracy improves 0.6%0.6\% to 1.1%1.1\% on all searched architectures. We achieve a new SOTA 80.87% and 84.3% ImageNet top-1 accuracy under the setting of 600600M and 4.44.4B MACs, respectively. All searched network architectures are presented in the supplementary materials.

4.3 Process of Greedy Search

Figure 5 is used specifically to show the changes of accuracy and input resolution of the search process. With increase of MACs, the resolution rises wavily, which verifies the role of dynamic search. The accuracy increases slowly and steadily.

Refer to caption
Figure 5: The search process of target 686686M MACs of EfficientNet-B1. The blue and black line demonstrate the changes of accuracy and input size as the increase of MACS, respectively.

Furtherly, the schematic diagram of greedy search for EfficientNet-B1 is shown in Figure 6. Under constrained MACs, we show the candidate network architectures. The green box means the best architecture in current iteration, and the gray box are discarded. Besides, the best architecture of each iteration are delivered to the later iterations.

Refer to caption
Figure 6: The schematic diagram of greedy search for EfficientNet-B1.

5 Conclusion

Network enlarging is an effective scheme for generating deep neural networks with excellent performance from a smaller baseline. Different from the conventional approach that directly enlarge the given network using a unified strategy, we present a novel greedy network enlarging algorithm. The entire network enlarging task is therefore divided into several iterations for searching the best computational allocation in a step-by-step fashion. In the enlarging process of the base model, the added MACs will be assigned to the most appropriate location. Experimental results on several benchmark models and datasets show that the proposed method is able to surpass the original unified enlarging scheme and achieves state-of-the-art network performance in terms of both network accuracy and computational costs. Beyond allocation of MACs in the stage level, more fine grained allocation of MACs are expected.

References

  • [1] Irwan Bello, William Fedus, Xianzhi Du, Ekin D Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579, 2021.
  • [2] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171, 2021.
  • [3] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
  • [4] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
  • [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [6] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey, 2019.
  • [7] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. arXiv preprint arXiv:1810.12890, 2018.
  • [8] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7036–7045, 2019.
  • [9] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1580–1589, 2020.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [11] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pages 1314–1324, 2019.
  • [12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [13] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • [14] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pages 103–112, 2019.
  • [15] Chenhan Jiang, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Sp-nas: Serial-to-parallel backbone search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11863–11872, 2020.
  • [16] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [17] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 528–536, 2017.
  • [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [19] Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 82–92, 2019.
  • [20] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
  • [21] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
  • [22] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
  • [23] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428–10436, 2020.
  • [24] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780–4789, 2019.
  • [25] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4780–4789, Jul. 2019.
  • [26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [27] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  • [28] Albert Shaw, Daniel Hunter, Forrest Landola, and Sammy Sidhu. Squeezenas: Fast neural architecture search for faster semantic segmentation. In Proceedings of the IEEE international conference on computer vision workshops, pages 0–0, 2019.
  • [29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [30] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [31] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
  • [32] Mingxing Tan and Quoc V Le. Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298, 2021.
  • [33] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10790, 2020.
  • [34] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  • [35] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 6649–6658, 2019.
  • [36] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, and Sanghyuk Chun. Re-labeling imagenet: from single to multi-labels, from global to localized labels. arXiv preprint arXiv:2101.05022, 2021.
  • [37] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [38] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
  • [39] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13001–13008, 2020.
  • [40] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.