BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization

Miloš Nikolić University of Toronto Toronto, Canada Ghouthi Boukli Hacene Mila Montreal, Canada Ciaran Bannon University of Toronto Toronto, Canada Alberto Delmas Lascorz University of Toronto Toronto, Canada Matthieu Courbariaux Mila Montreal, Canada Yoshua Bengio Mila Montreal, Canada Vincent Gripon IMT-Atlantique Brest, France Andreas Moshovos University of Toronto Toronto, Canada

Abstract

Neural networks have demonstrably achieved state-of-the art accuracy using low-bitlength integer quantization, yielding both execution time and energy benefits on existing hardware designs that support short bitlengths. However, the question of finding the minimum bitlength for a desired accuracy remains open. We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy. Namely, we propose a regularizer that penalizes large bitlength representations throughout the architecture and show how it can be modified to minimize other quantifiable criteria, such as number of operations or memory footprint. We demonstrate that our method learns thrifty representations while maintaining accuracy. With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits on AlexNet, ResNet18 and MobileNet V2 respectively, remaining within 2.0%, 0.5% and 0.5% of the base TOP-1 accuracy.

Index Terms:

Heterogeneous quantization, learned datatype, neural networks ^†^†Correspondence to: Miloš Nikolić

<

milos.nikolic@mail.utoronto.ca

>

I Introduction

Over the past two decades energy and power (energy over time) have emerged as the primary constraints for computing devices [1] dictating execution time performance, operating costs, or up-time for virtually all device segments, from data centers to the edge. Accordingly, reducing Deep Neural Network (DNN) energy needs can yield a multitude of benefits such as improved latency for existing models, enable the deployment of more powerful models, or boost throughput and reduce operating costs for data centers. A key choice that dictates DNN energy consumption and thus execution time during inference is the choice of datatype, that is the number of bits used per value (activations or weights) and their numerical interpretation (e.g., floating-point or fixed-point). For example, using 8b fixed-point vs. 32b floating-point reduces the energy of a multiply-accumulate by more than $23\times$ [2], whereas using 4b fixed-point more than halves energy compared to 8b and doubles compute bandwidth even on existing hardware [3]. Choosing a datatype may at first appear a simple task. However, doing so naively can leave a lot of the potential benefits untapped, or worse can yield a network that fails to converge.

While 32-bit floating point is definitely sufficient, many DNNs can use other lower cost floating- or fixed-point representations without sacrificing accuracy. For this reason, and while earlier graphics processors (GPUs) and general-purpose processors (CPUs) supported only a few datatypes, newer generations of GPUs and CPUs have been extended to support additional more energy efficient datatypes specifically targeting DNNs [3, 4, 5]. Support for multiple datatypes is a key feature of specialized hardware accelerators [6, 7, 8, 9].

Energy efficiency improves when using a narrower datatype (e.g., 16b instead of 32b floating-point) or a simpler and less expensive to implement one (e.g., fixed-point vs. floating-point). More energy efficient datatypes yield benefits in two ways (see Table I for example accelerators):
Memory: Using narrower datatypes reduces the number of bits that need to be stored and transferred. This greatly reduces energy as off-chip memory accesses are today one order of magnitude slower and two orders of magnitude more energy consuming compared to on-chip operations (accesses or computation). Doing so enables us to run larger and more powerful networks, to deploy devices with smaller on- and off-chip memories, and to enjoy higher performance especially when memory bandwidth is limited, as it is often the case today. Besides being able to directly benefit CPUs and GPUs today, this observation also motivated several low-cost memory compression techniques that directly take advantage of narrower datatypes at layer [10] [7] [8] or finer granularity [11] [12].
Computation: The operations per cycle throughput of many hardware platforms scales inversely proportionally with datatype bit length. Most commodity general purpose or graphics processors (CPUs or GPUs) support several bit lengths, e.g., 16b at 1x and 8b at 2x throughput, whereas, spatially composable and bit-serial accelerators [13, 7, 14, 15, 8, 16, 17] can support a full range of bit lengths. Moreover, using simpler datatypes such as fixed-point vs. floating-point can also amplify compute bandwidth. For example, many commodity CPUs and GPUs implement more fixed-point units than floating-point ones. In accelerators, using simpler arithmetic units allows us to pack more of them for the same area and for the same energy budget. For example, we can fit 32 8b fixed-point MAC units is in the same area as 5 floating-point ones.

Multiple techniques of determining optimal datatypes have been proposed due to the impact the choice has even on existing hardware. However, today we still usually choose a target datatype prior to training a network without help on how to navigate their complex relationship with accuracy, energy and execution time performance. Once we have a datatype — typically a fixed-point representation of a desired bitlentgh — we can use one of the many methods which can maximize the accuracy with a fixed bitlength [18, 19, 20, 21]. This approach deprives us of the benefits of tighter datatype selection or worse forces us to search the datatype selection space via trial-and-error at the great cost of training the model anew multiple times.

Profiling-based techniques can find per-layer bitlengths delivering further benefits [22, 23]. However, beyond a certain bitlength choice, such post-training methods will start incurring accuracy loss. Further reduction in bitlengths is possible if fine-tuning can recover some accuracy losses. Similarly, some hand crafted quantization techniques can treat value groups differently using advance knowledge of the expected value distribution [24, 25, 26]. However, none of these methods attempt to learn bitlengths leaving lots of potential untapped.

One way of learning bitlengths is to use reinforcement learning (RL) to reduce the search space [27, 28]. However, after each new bitlength selection, RL algorithms need to estimate the accuracy to properly apply rewards. This step often requires a lengthy fine-tuning process for each selection. Consequently, this approach is not scalable (for finer granularities and larger networks) since the search space increases exponentially with the number of different datatype groups. A better approach is to somehow map bitlenghts from the discreet into the continuous domain so that gradient descent can learn them. One successful approach is the recently proposed Mixed Precision DNN (MPDNN) that learns per layer bitlengths [29]. Given a hard memory constraint (total memory footprint needed to store all weights and the largest activation layer), MPDNN excels at finding the best way to utilize the allotted memory space to maximize accuracy. However, MPDNN is not ideal for jointly optimizing bitlength and accuracy since without a memory constraint, MPDNN provides bitlengths that are significantly larger than necessary. In essence, to find an ideal bitlength distribution, the algorithm must be given in advance its resulting memory footprint. Additionally, MPDNN does not allow for optimizing some other criteria such as operation count focusing solely on memory footprint. Our goal is precisely to fill these gaps. Ultimately, our method and MPDNN attempt to solve two very different problems: rather than best fitting within a memory capacity we minimize bitlengths and error concurrently with benefits for compute and memory.

We present BitPruning a method that alleviates users from speculating bitlengths or memory footprints to jointly optimize bitlength and accuracy targeting both memory and compute benefits. By leveraging the power of training BitPruning automatically learns reduced bitlength (number of bits) integer representations at any granularity (e.g., network, layer, or block of any shape or size). We demonstrate our extended training technique for integer representations as they have been shown to be sufficient for many DNN applications and because integer operations and functional units (multipliers and adders) are much less expensive area- and energy-wise than those for floating point.

TABLE I: Hardware accelerators that support variable bitlength activations “A” and weights “W”.

Accelerator	Bitlength	Target	Granularity
Stripes [7]	Any	A	Layer
Dynamic Stripes [30]	Any	A	Group
Dpred [13]	Any	A	Group
ShapeShifter [11]	Any	A	Group
BISMO [17]	Any	W+A	Layer
Bit-Slicing [16]	Any	W+A	Layer
BitFusion [8]	Powers of 2	W+A	Layer
Loom [14]	Any	W+A	Layer
Bit Tactical [31]	Any	A	Layer/Group
Outlier-Aware [12]	Any + outliers	W+A	Group
UNPU [32]	Any	W	Network
Proteus [15]	Any	W+A	Group
GPUs [3]	1, 4, 8	W+A	Group
CPUs w/ multimedia/vector	1, 4, 8	W+A	Group

BitPruning’s goal is to squeeze out every possible benefit from reducing the bitlength used at any desired granularity and with any priority. Our method relies on an interpolation of bitlength to non-integer values allowing to trade-off accuracy and bitlength by gradient descent, and it can be applied at any granularity: e.g., per network, layer, value, or any group of values in between, as long as it is statically determined. This allows us to target a variety of hardware platforms via a unified approach. Additionally, our approach allows groups to be emphasized arbitrarily to prioritize a selected criterion, such as multiply accumulate (MAC) operation count or memory footprint. This makes it possible to target different costs depending on the deployment scenario and workload, e.g., target computation for convolutional layers or memory for fully-connected ones. Finally, contrary to most approaches that ignore the first and last layers, ours is an end-to-end method that optimizes the whole network from input to output.

Our contributions are that we develop 1) an interpolation of integer bitlengths to non-integer ones, enabling bitlenghts to be learned, and 2) a regularizer that can, during training, reduce the number of bits used whilst minimizing the effect on inference accuracy.

II Method

BitPruning involves procedures for both the forward and backward passes used in training. We begin by defining a conventional, linear quantization scheme with integer bitlengths in the forward pass. This definition is then expanded to use non-linear bitlengths and we describe how this interpretation allows bitlengths to be learned using gradient descent. Subsequently, we introduce a parameterizable loss function, which enables BitPruning to penalize larger bitlengths. Ultimately, we describe the final selection of integer bitlengths.

II-A Quantization

The greatest challenge for learning bitlengths is that they represent discreet values over which there is no obvious differentiation. We overcome this fact by defining a quantization method based on non-integer bitlengths which is used during training. We start with a uniform integer quantization between the minimum and maximum of each layer and expand to non-integer bitlengths.

For an integer bitlength $n$ we use a simple uniform fixed-point quantization between the minimum and maximum. That is, each float value $V$ is represented by an integer:

Int(V,n)=Round((V-L_{min})/Scale(n))

where $Int(V,n)$ is the integer value with bitlength $n$ , $L_{min}$ the minimum value in the layer (across the whole batch), and $Scale$ is the smallest representable difference:

Scale(n)=(L_{max}-L_{min})/(2^{n}-1)

where $L_{max}$ is the maximum value in the layer (batch). Consequently, this scheme quantizes an input float value $V$ to the following float value:

Q_{i}(V,n)=L_{min}+Int(V,n)*Scale(n)

Throughout training, we represent the integer quantization as $Q(n)$ . This quantization scheme does not allow the learning of bitlengths with gradient descent due to its discontinuous and non-differentiable nature. To expand the definition to real-valued $n$ , the values used in inference during training are interpreted as an interpolation between the values represented by the nearest two integers:

Q_{r}(V,b+\alpha)=(1-\alpha)Q_{i}(V,b)+\alpha Q_{i}(V,b+1)

where $n=b+\alpha$ , with $0\leq\alpha<1$ , and $Q_{i}(V,b)$ is the integer bitlength quantization with $b$ bits.

The scheme can be, and in this work is, applied to activations and weights separately. Since the minimum bitlength per value is 1, $n$ is clipped at 1.0. This presents a reasonable extension of the meaning of bitlength in continuous space and allows for the loss to be differentiable with respect to bitlength. The final bitlength of each group for inference is then selected as the smallest integer greater or equal to the bitlength parameter learned during training.

During the forward pass the above formulae are applied to both activations and weights in the order shown. The values are converted to a floating point value that can be represented by an integer quantization, and if the bitlength is a non-integer value, the two nearest integer representations are interpolated. During the backward pass we use the straight-through estimator [33, 34] to prevent propagating zero gradients that result from the discreteness of the rounding operation.

While training we assume that our model adequately represents the expressiveness of the network and has a monotonous relationship with accuracy. An exception to this assumption is observed when jumping from non-integer to integer bitlengths once the initial training phase is complete. This issue is resolved by extending the fine tuning phase, fixing the bitlengths and allowing accuracy to recover. Since this work targets inference, extending training is not considered a significant issue but an alternative solution may be the topic of future work.

II-B Loss Function

The loss function penalizes bitlength by adding a weighted average (with weights $\lambda_{i}$ ) of the bits $n_{i}$ required for weights and activations of all layers. We define total loss $L$ as:

L=L_{l}+\gamma\sum(\lambda_{i}\times n_{i})

where $L_{l}$ is the original loss function, $\gamma$ is the regularization coefficient used for selecting how aggressive the quantization should be, $\lambda_{i}$ is the weight corresponding to the importance of the $i^{th}$ group of values, and $n_{i}$ is the bitlength of the activations or weights in that group.

This loss function can be used to target any quantifiable criteria. For the majority of this paper, we select $\lambda_{i}$ such that the loss function weighs all layers equally and produces a loss of 1.0 for an 8-bit network. This is equivalent to stating that the benefit of reducing bitlengths is equal across all value groups. In Section III we explore how $\lambda_{i}$ can be used to target a reduction of memory footprint of an activation and weight-heavy case, as well as minimize the number of MAC operations.

II-C Final Bitlength Selection

The above training method will produce non-integer bitlengths which are meaningless for practical hardware. Hence, we adjust the learned non-integer bitlengths to the nearest greater integer. Tables II and VI show that while this initially affects accuracy, continuing training recovers the accuracy loss. The final phase keeps bitlenghts constant and only updates the weights.

III Evaluation and Results

Without loss of generality we report experimental results for per-layer granularity, and for weights and activations separately. However, this choice is arbitrary and it is straightforward to adapt to any other granularity (e.g., per group of values that would be transferred or processed together). Similarly, it is a simple extension of the loss function to change the coefficients in the weighted sum to prioritize groups according to other quantifiable criteria.

III-A CIFAR10

TABLE II: Activation/weight bitlengths and achieved accuracy of aggressive quantization and different strength reguralizers on CIFAR10 for non-integer and integer bitlengths.

		Non-Integer Bitlengths			Rounded Integer Bitlengths
Network	Regularizer	Accuracy	Weights # of bits	Activations # of bits	Final Accuracy	Weights # of bits	Activations # of bits
AlexNet	Baseline	78.8	32 float	32 float	78.8	32 float	32 float
	$\gamma=$ 0.5	78.3	3.78	4.34	77.9	4.33	4.83
	1.0	78.5	3.03	3.89	77.9	3.50	4.33
	2.5	78.3	2.45	3.18	75.0	3.00	3.67
	5.0	78.2	2.06	2.72	75.4	2.50	3.17
	10.0	Does not converge			Does not converge
ResNet18	Baseline	94.9	32 float	32 float	94.9	32 float	32 float
	$\gamma=$ 0.5	94.2	1.67	2.73	94.4	1.90	3.38
	1.0	93.5	1.30	2.26	94.1	1.43	2.90
	2.5	93.1	1.15	2.01	93.4	1.24	2.43
	5.0	92.8	1.14	1.99	93.3	1.24	2.48
	10.0	94.1	1.61	2.35	94.2	1.90	2.90

This section discusses the BitPruning results for AlexNet [35] and ResNet18 [36] on CIFAR10 [37].

III-A1 Learning Bitlength

The networks were trained over 300 epochs with the default fast.ai [38] parameters and one cycle policy in Pytorch [39]. Table II shows TOP-1 validation accuracy for a 32b float baseline, as well as quantized models with our redefined loss. The bitlength weights ( $\lambda_{i}$ ) are set to normalize all bitlengths to 8 bits and to emphasize all layers equally. If all layers use bitlength 8, loss will be $\gamma$ . We change $\gamma$ to produce regularizers of progressively higher strength. Table II reports the resulting average bitlegths over all layers. Accuracies comparable to the baseline can be achieved with less than 3 bits on average for AlexNet, and 2 bits for ResNet. Progressively stronger regularizers achieve smaller bitlengths albeit at a slight degradation in accuracy. Weights consistently achieve smaller bitlengths than activations, while activations tend to benefit from more aggressive regularizers. This is expected as the weights are directly set by the quantization scheme, while activations are indirectly affected during run-time.

Figure 1 shows the validation accuracy and average bitlengths of activations and weights over the 300 training epochs. The bitlengths converge quickly and concurrently (all layers, weights and activations). While not shown in detail, within 30-40 epochs the bitlengths across all groups drop near to their final values. Only slight changes during the remaining training epochs are observed. While all versions of ResNet18 closely track the baseline, AlexNet versions tend to follow closely the validation accuracy for first part of training. As bitlenths plateau, the accuracy drops for quantized networks with the degradation being greater for more aggressive quantizations. However, as training approaches the 300 epoch mark, all versions of AlexNet, except the most aggressive one, converge towards the baseline accuracy. For AlexNet more aggressive quantizers arrive to more aggressive bitlengths. The most aggressive attempt crashes at epoch 69. Similarly, all versions of ResNet18 reach average bitlengths corresponding to the aggressiveness of quantization, except for the most aggressive one. The most aggressive one dips the fastest, however it bounces off the minimum and converges to a bitlength that is larger than some of the less aggressive ones. This shows that the phase at which the Loss and the Regularizer conflict matters and affects both accuracy and bitlength.

Bitlengths vary noticeably per layer, creating more opportunities for specialized hardware; Uniform per network bitlengths would leave a lot of potential untapped. A finer quantization approach would presumably offer even more potential. Bitlengths show a slight descending trend towards latter layers.

III-A2 Selecting Bitlengths

After this initial training, bitlengths are set to the ceiling of their learned value, resulting in a noticeable accuracy drop. There are two reasons why this occurs: our interpolation is imperfect (larger bitlengths often, but not always give better accuracy) and the network is partially trained for the smaller bitlength. Crucially, fine-tuning the networks regains this lost accuracy within 60 epochs. Table II shows the drop in accuracy of integer bitlength networks as well as the effects of fine-tuning. In all cases, selecting the integer bitlengths increases the bitlengths by about 0.5 bits. Finally, Table II shows the validation accuracy and average bitlengths of these fine-tuned integer bitlength versions in comparison with the baseline and non-integer bitlength versions. While the final integer bitlength ResNet18 versions slightly outperform the corresponding non-integer versions, the converse is true for AlexNet. Generaly integer and non-integer cases produce consistent accuracies.

III-A3 Other Architectures

We demonstrate BitPruning on a diverse set of architectures in Table III, using the same training approach. It learns thrifty bitlenghts for all architectures with a good selection of hyper-parameters.

TABLE III: Learning bitlenghts for different architectures

Netwrok	Base Accuracy	Quantized Accuracy	Weights # of bits	Activations # of bits	Regularizer
MobileNet V2 [40]	94.6	93.9	2.41	3.86	0.5
DenseNet 121 [41]	95.7	94.4	1.47	1.97	1.0
DPN 92 [42]	95.7	95.5	2.65	3.02	0.025
ResNext29(2x64d) [43]	95.3	94.0	1.11	2.11	1.0
PreAct ResNet18 [44]	94.8	93.4	1.78	2.64	1.0

TABLE IV: Influence of loss function weighting on Cifar10. Average number of bits where the weights are selected as follows: BS - Batch Size, MAC - Multiply Accumulate

			Non-Integer bitlengths (Avg)			Rounded Integer bitlengths (Avg)
Network	target	Accuracy	BS of 1 footprint	BS of 128 footprint	MAC operations	BS of 1 footprint	BS of 128 footprint	MAC operations
AlexNet	regular	78.5	2.97	3.22	3.82	3.40	3.64	4.25
	BS 1	80.0	2.35	3.53	5.03	2.57	3.8	5.45
	BS 128	78.7	2.42	2.90	3.91	2.80	3.29	4.33
	MAC ops	79.1	2.89	3.39	3.39	3.53	3.96	3.90
ResNet18	regular	93.5	1.23	2.85	2.26	1.35	2.70	2.15
	BS 1	94.7	1.12	2.85	1.78	1.34	2.70	2.16
	BS 128	93.6	1.67	1.83	2.55	1.86	2.32	3.10
	MAC ops	94.1	1.53	2.35	1.74	1.64	2.89	2.07

III-A4 Network Structure

TABLE V: Number of bits for activations/weights and achieved accuracy for expanded/compressed layers. Bold numbers indicate change in int bitlength, Italic ones indicate change of non-int bitlength larger than 0.25.

		accuracy							avg							avg
	base	78.70	3.99	3.49	3.48	2.51	2.32	2.38	3.03	3.54	4.61	4.54	3.58	3.51	3.16	3.82
	1.00	81.00	3.17	3.33	3.42	2.52	2.26	3.22	2.99	3.43	4.49	4.51	3.53	3.48	2.85	3.72
	2.00	79.60	3.60	3.42	2.55	2.40	2.18	2.17	2.72	3.46	4.54	4.40	3.50	3.44	2.65	3.67
x4	3.00	78.10	3.78	3.44	3.44	2.35	2.32	2.33	2.94	3.48	4.56	4.52	3.54	3.50	3.11	3.78
	4.00	78.40	3.99	3.50	3.48	2.45	2.32	2.35	3.01	3.52	4.60	4.54	3.58	3.49	3.19	3.82
	5.00	78.20	3.99	3.51	3.47	2.54	2.39	2.36	3.04	3.54	4.60	4.55	3.59	3.56	3.24	3.85
	1.00	73.50	4.23	4.47	3.50	2.44	2.47	2.44	3.26	3.62	5.51	4.58	3.63	2.63	3.40	3.90
	2.00	77.20	4.09	3.54	3.54	2.52	2.42	2.44	3.09	3.57	4.63	4.61	3.61	3.53	3.36	3.89
x0.25	3.00	79.10	4.06	3.52	3.45	2.54	2.39	2.49	3.07	3.56	4.64	4.55	3.59	3.54	3.16	3.84
	4.00	78.50	3.95	3.49	2.60	2.54	2.32	2.33	2.87	3.52	4.61	4.53	3.59	3.52	2.98	3.79
	5.00	78.00	3.90	3.49	3.44	2.44	1.82	1.54	2.77	3.51	4.60	4.52	3.57	3.44	2.97	3.77

In this subsection we discuss the interplay of channel depth and bitlength. Prior work in ternary quantization has reported that increasing channel depth was essential in successfully training models to reduced datatypes [45]. Presumably, deeper layers would lead to smaller bitlengths in the corresponding layers thus providing an additional knob for model designers to meet execution time and energy constraints (e.g., doubling the number of channels while using 1/4 the bitlength reduces footprint to half). The experiments change the channel depth of each layer by x0.25 and x4 whilst all other layers are kept as-is and all networks are trained with the same parameters as the original one. Table V shows the validation accuracy as well as the bitlengths learned. On average the bitlengths are slightly smaller for the expanded networks and slightly larger for reduced ones. In some cases the change in bitlength alters the final integer bitlegth. This change is usually, but not always, in the modified or the following layer. However, in most cases the differences are negligible compared to the cost of increasing the number of channels. The opposite can occasionally be observed with reducing channel depth. However, the accuracy can also vary significantly with these changes, and a set target could possibly be achieved with more aggressive regularization. Some of the benefits or costs of expanding/shrinking a layer appear to be absorbed by a increase or decrease of accuracy. As a result, it is not a straightforward task to balance channels/bitlengths whilst keeping accuracy constant.

III-A5 Weighted Bit Loss

Finally, we train AlexNet and ResNet18 with a modified loss function to minimize memory footprint or operation count. The bit loss is weighted respectively according to the number of elements or operations in each layer and for activations and weights separately. All other hyper-parameters are kept the same. We individually consider memory footprint for inference with batch sizes 1 and 128, representing a weight and activation-heavy network, operation count and original loss function weights. Table IV shows the effects of the resulting loss functions for a range of criteria. The weighted bitlength regularizer allows us to successfuly target different criteria. On each criterion the targeted version of the network outperforms the generic case. However, differently weighted loss functions may affect accuracy and the final bitlength also depends on the final cutoff.

III-B ILSVRC2012

The networks were trained on ImageNet [46] over 180 epochs with default fast.ai [38] parameters and one cycle policy in Pytorch [39]. Maximum learning rates were 0.01, 0.1 and 0.1 for AlexNet [35], ResNet18 [36] and MobileNet V2 [40]. The 32 float baseline was trained over 90 epochs.

Table VI shows TOP-1 validation accuracy for the 32-bit float baseline, and for quantized models with our loss. The loss parameters, $\lambda_{i}$ and $\gamma$ , are set to weigh all layers equally and to normalize all bitlengths to 8. The table shows that accuracies comparable to the baseline for all networks can be achieved with around 3.5 bits on average for weights and around 4 bits for activations. Similarly to CIFAR10, the method is more capable to reduce bitlengths on the weights than activations.

TABLE VI: ImageNet bitlength and validation accuracy.

		Non-Integer bitlengths			Rounded Integer bitlengths
Network	Regularizer	Accuracy	Weights # of bits	Activations # of bits	Final Accuracy	Weights # of bits	Activations # of bits
AlexNet	Baseline	57.12	32	32	57.12	32 float	32 float
AlexNet	1.0	56.20	3.35	4.07	55.07	3.88	4.38
ResNet18	Baseline	69.54	32	32	69.54	32 float	32 float
ResNet18	1.0	69.26	2.86	3.80	69.19	3.38	4.14
MobileNet V2	Baseline	70.44	32	32	70.44	32 float	32 float
MobileNet V2	1.0	70.99	3.59	4.15	70.09	4.15	4.57

III-B1 Learning Bitlengths

Figure 2 shows the validation accuracy and average bitlengths of activations and weights over the 180 training epochs. After sufficient training the non-integer quantized networks reach near the baseline accuracy, however during training both quantized networks validation accuracy under-performs. During training per-layer bitlengths drop quickly and concurrently, within 10 epochs, to near final values. At this point the accuracy of AlexNets diverges, while the accuracy of ResNet starts diverging a bit later. While the bitlengths do not change much until the end of training, they slowly and noticeably increase for AlexNet, whilst for ResNet they continue to slowly decrease. Although the changes are small, they may tip the final integer bitlength.

III-B2 Selecting Bitlengths

Final bitlengths are selected as the ceiling of their learned values, and the network is finetuned for 90 (35 for MobileNet) epochs with 1/10^th of the learning rate. Table VI shows that this finetuning phase recovers validation accuracy loss due to selecting integer bitlengths.

III-B3 Bitlength vs. Layer Position

Figure 3 shows that the bitlengths vary per layer for weights and activations, justifying the approach of using finer granularities. AlexNet and ResNet show a slight descending trend towards latter layers. Generally, the first and last layers require more bits. Similarly, activations typically require larger bitlengths than weights.

III-B4 Selecting Bitlength Early

We then explore when the final bitlengths can be selected. This is inspired by the fact the non-integer bitlengths converge quickly to near final values. Yet, the bitlengths slightly change throughout most of training. We test an early selection approach on AlexNet by training both the bitlengths and the model for their first 30 epochs, and then fixing bitlengths to integers over the next 150 epochs. This approach closely tracks the non-integer version with a final accuracy loss about 1%.

III-B5 Use as fine-tuning

We also investigated using our method as means of fine-tuning pre-trained methods.

We applied the bitlength training on an already trained 8-bit Alexnet with the same learning rate over 90 epochs. Just as in the original bitpruning version, the the drop of bitlengths significantly reduces accuracy of the fine-tuned one. After 90 epochs of training for bitlengths the networks starts to approach the baseline accuracy.

However the bitlenghts of the fine-tuned version fall slower than the original. At the end of training the finetuned version requires 4.67 non-integer or 5.38 integer bits. This is significantly worse than the network trained with our regularizer from beginning. However, the results demonstrate that our method can be used to deliver benefits by fine-tuning pre-trained models.

III-B6 Comparison with Other Quantization Techniques

Table VII clearly shows the advantage of our approach against uniform 4-bit (PACT) quantization and a profiled per-layer quantization in both validation accuracy and bitlength.

Additionally, we compare to MPDNN, a recent work which learns quantization parameters during training [29], briefly discussed in Section I. MPDNN starts with a pre-trained network and then learns quantization scale and range to fit a given memory limit. Both techniques achieve a validation accuracy within 0.4% of the full precision baseline for both ResNet and MobileNet V2 on ImageNet. When tasked with optimizing accuracy and memory footprint MPDNNs assigns 10.50MB/1.05MB and 3.14MB/1.58MB of weight/activation memory for ResNet18 and MobileNet V2 respectively. As per the original MPDNN study, these memory requirements are for all weights and for the largest activation layer respectively. When MPDNN is tasked with optimizing accuracy while meeting a pre-determined memory constraint, the requirements drop to 5.4MB/0.38MB and 1.55MB/0.57MB. Note that these weight memory constraints must be expertly selected to ensure that near-baseline accuracy can still be achieved. With BitPruning weights and activations respectively require 5.5MB/0.67MB and 2.2MB/0.72MB. BitPruning achieves this low memory footprint despite: 1) using a loss function that does not explicitly target memory footprint, and 2) not requiring the user to specify a target memory budget that is known to work well. Further, Table IV shows that by adjusting the loss function BitPruning can explicitly target memory footprint. Finally, BitPruning will better suit accelerators that benefit from reduced bitlengths, even after the network fits on chip.

TABLE VII: Comparison with other quantization techniques

	AlexNet			ResNet18			MobileNet V2
Method	Accuracy	Weights	Activations	Accuracy	Weights	Activations	Accuracy	Weights	Activations
PACT	55.7	5.0	5.0	69.2	4.38	4.38	—	—	—
Profiled	55.78	7.63	5.75	65.56	6.41	6.34	69.9	7.33	7.02
BitPruning	55.07	3.875	4.375	69.19	3.38	4.14	70.09	4.15	4.57

III-B7 Hardware Benefits

Finally, in Table VIII we show the benefits of our approach on existing and proposed hardware designs. BitPruning significantly outperforms past profiling approaches [23, 22], as well as the 8 bit baseline.

TABLE VIII: Trained vs profiled quantization on select accelerators. Perf - Speedup, Mem - Total Storage

AlexNet

ResNet18

MobileNet V2

Trained

Profiled

Trained

Profiled

Trained

Profiled

Accelerator

Perf

Mem

Perf

Mem

Perf

Mem

Perf

Mem

Perf

Mem

Perf

Mem

Stripes

1.69\times

0.95\times

1.26\times

0.98\times

1.72\times

0.94\times

1.23\times

0.98\times

1.76\times

0.53\times

1.09\times

0.95\times

Dpred

3.97\times

0.91\times

3.35\times

0.91\times

3.98\times

0.89\times

3.88\times

0.90\times

3.36\times

0.35\times

3.09\times

0.38\times

BitFusion

1.63\times

0.60\times

1.00\times

1.00\times

2.47\times

0.53\times

1.00\times

1.00\times

1.50\times

0.67\times

1.00\times

1.00\times

Loom

3.74\times

0.27\times

2.81\times

0.33\times

4.11\times

0.29\times

3.44\times

0.35\times

3.70\times

0.34\times

2.67\times

0.40\times

Proteus

—

0.47\times

—

0.98\times

—

0.50\times

—

0.81\times

—

0.53\times

—

0.95\times

IV Training Costs

BitPruning incurs costs during training to learn the bitlengths. On ImageNet BitPruning does not achieve baseline accuracy within the base 90 epochs ( $1.5\%-2.5\%$ loss) and $2.3\times-2.7\times$ more time, due to extra computations. To match the baseline accuracy, BitPruning required 180 (2x) epochs and $4.6\times-5.4\times$ more time. Additionally, BitPruning requires twice as much memory. Neither of these are a major concern since BitPruning targets aggressive quantization for efficient inference.

V Conclusion

BitPruning is capable of learning the bitlengths for accurate inference with controlled increase of loss function. On CIFAR10 we obtain 3.92 and 2.17 bits on average on AlexNet and ResNet18 respectively, with accuracies within 0.9% for the weakest regularizer. On ImageNet we obtain 4.13, 3.76 and 4.36 average bitlengths with AlexNet, ResNet18 and MobileNet V2 respectively, whilst remaining within 2.0%, 0.5% an 0.5% of baseline accuracy. BitPruning can be applied at an arbitrary granularity with any selected weighted criteria. We demonstrate how this method is used effectively to minimize the compute workload and memory footprint in weight heavy (small batch) or activation heavy (large batch) cases. With CIFAR10 and for AlexNet and ResNet18 we reduce footprint by 10% - 24% and computation workload by 4% - 8%. Similarly, a modification of weights in the regularizer enables optimizing any other quantifiable criteria. BitPruning can be used to quantize all layers, including first and last, resulting in a simple end to end approach for quantization. Additionally, it naturally benefits existing hardware designs that can exploit different datatypes to significantly boost performance, and reduce memory traffic and footprint for all devices.

Broader Impact

At the most fundamental level, all compute requires a hardware device that performs two functions: 1) data storage and transfers (e.g., memory), and 2) data manipulation/compute (e.g., addition, multiplication, etc.). Developing techniques that reduce energy and execution time performance by targeting these fundamental operations is bound to impact virtually all segments of computing from the edge to the data center. The environmental costs are tremendous, and the potential benefits from more capable computing hardware are tremendous and have been reiterated time after time; without more powerful machines many of the innovations that we take for granted today would have never materialized and we are certain that many innovations to come will not materialize unless we success to keep improving computing hardware capabilities. The datatype used to represent and operate upon data can greatly impact silicon chip area, energy consumption, and as a result computing performance. It is for this reason, that this work, even though it targets a seemingly “trivial” and “simple” parameter, the datatype, can have an immediate, broad, and long lasting impact throughout. Moreover, choosing the right datatype is deceptively simple as the tradeoffs and potential usage scenarios (e.g., are we optimizing for a specific device, for a device to be built, for edge, for server, for memory on- and/or off-chip, for compute, etc.).

In more detail, the ability to learn the number of bits required for minimal uniform quantization but at a finer granularity than the whole network will optimize the execution time performance and energy costs of many commodity hardware platforms and specialized hardware accelerators that exploit this property. As a result the same work can be done with less power, decreasing the energy footprint of inference and reducing the climate impact of machine learning. Furthermore, the reduction of energy cost will enable to further push deep learning towards the edge, reducing the reliance on, and reducing communication with dataservers, and therefore improving user experience and privacy as well as reducing energy cost on mobile devices.

The main drawback of the method is the extra required time and energy to train the model, which can be exacerbated for models that are continuously trained. These losses should be eclipsed by the benefits in most cases, however a more detailed analysis is need on a case by case basis.

References

[1] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” in Proceedings of the 38th Annual International Symposium on Computer Architecture, ser. ISCA ’11. New York, NY, USA: ACM, 2011, pp. 365–376.
[2] M. Horowitz, “Computing’s energy problem (and what we can do about it),” IEEE Intl’ Solid-State Circuits Conf., vol. 57, pp. 10–14, 02 2014.
[3] Nvidia, “Nvidia AI inference platform performance study,” Nvidia, Tech. Rep., 2018.
[4] A. Rodriguez, E. Segal, E. Meiri, E. Fomenko, Y. Jim Kim, H. Shen, and B. Ziv, “Lower numerical precision deep learning inference and training,” Intel, Tech. Rep., 2018.
[5] H. Wu, “Low precision inference on gpu,” Nvidia, Tech. Rep., 2019.
[6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17, 2017, pp. 1–12.
[7] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, and A. Moshovos, “Stripes: Bit-serial Deep Neural Network Computing ,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-49, 2016.
[8] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim, V. Chandra, and H. Esmaeilzadeh, “Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks,” CoRR, vol. abs/1712.01507, 2017. [Online]. Available: http://arxiv.org/abs/1712.01507
[9] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, and A. Moshovos, “Bit-pragmatic deep neural network computing,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-50 ’17, 2017, pp. 382–394.
[10] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger, and A. Moshovos, “Proteus: Exploiting numerical precision variability in deep neural networks,” in Proceedings of the 2016 International Conference on Supercomputing, ser. ICS ’16. New York, NY, USA: ACM, 2016, pp. 23:1–23:12. [Online]. Available: http://doi.acm.org/10.1145/2925426.2926294
[11] A. D. Lascorz, S. Sharify, I. Edo, D. M. Stuart, O. M. Awad, P. Judd, M. Mahmoud, M. Nikolic, K. Siu, Z. Poulos, and A. Moshovos, “Shapeshifter: Enabling fine-grain data width adaptation in deep learning,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: Association for Computing Machinery, 2019, p. 28–41. [Online]. Available: https://doi.org/10.1145/3352460.3358295
[12] E. Park, D. Kim, and S. Yoo, “Energy-efficient neural network accelerator based on outlier-aware low-precision computation,” 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 688–698, 2018.
[13] A. D. Lascorz, S. Sharify, P. Judd, K. Siu, M. Nikolic, and A. Moshovos, “Dpred: Making typical activation values matter in deep learning computing,” CoRR, vol. abs/1804.06732, 2017. [Online]. Available: http://arxiv.org/abs/1804.06732
[14] S. Sharify, A. D. Lascorz, P. Judd, and A. Moshovos, “Loom: Exploiting weight and activation precisions to accelerate convolutional neural networks,” CoRR, vol. abs/1706.07853, 2017. [Online]. Available: http://arxiv.org/abs/1706.07853
[15] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger, and A. Moshovos, “Proteus: Exploiting numerical precision variability in deep neural networks,” in Proceedings of the 2016 International Conference on Supercomputing, ser. ICS ’16. New York, NY, USA: ACM, 2016, pp. 23:1–23:12. [Online]. Available: http://doi.acm.org/10.1145/2925426.2926294
[16] O. Bilaniuk, S. Wagner, Y. Savaria, and J.-P. David, “Bit-slicing fpga accelerator for quantized neural networks,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
[17] Y. Umuroglu, L. Rasnayake, and M. Själander, “BISMO: A scalable bit-serial matrix multiplication overlay for reconfigurable computing,” CoRR, vol. abs/1806.08862, 2018. [Online]. Available: http://arxiv.org/abs/1806.08862
[18] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” ArXiv e-prints, Nov. 2015.
[19] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan, “PACT: parameterized clipping activation for quantized neural networks,” CoRR, vol. abs/1805.06085, 2018. [Online]. Available: http://arxiv.org/abs/1805.06085
[20] S. R. Jain, A. Gural, M. Wu, and C. Dick, “Trained uniform quantization for accurate and efficient neural network inference on fixed-point hardware,” CoRR, vol. abs/1903.08066, 2019. [Online]. Available: http://arxiv.org/abs/1903.08066
[21] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,” CoRR, vol. abs/1902.08153, 2019. [Online]. Available: http://arxiv.org/abs/1902.08153
[22] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. Enright Jerger, R. Urtasun, and A. Moshovos, “Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets, arXiv:1511.05236v4 [cs.LG] ,” arXiv.org, 2015.
[23] M. Nikolic, M. Mahmoud, Y. Zhao, R. Mullins, and A. Moshovos, “Characterizing sources of ineffectual computations in deep learning networks,” in International Symposium on Performance Analysis of Systems and Software, 03 2019.
[24] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with low-precision weights,” 02 2017.
[25] J. Fromm, S. Patel, and M. Philipose, “Heterogeneous bitwidth binarization in convolutional neural networks,” CoRR, vol. abs/1805.10368, 2018. [Online]. Available: http://arxiv.org/abs/1805.10368
[26] E. Park, S. Yoo, and P. Vajda, “Value-aware quantization for training and inference of neural networks,” CoRR, vol. abs/1804.07802, 2018. [Online]. Available: http://arxiv.org/abs/1804.07802
[27] A. T. Elthakeb, P. Pilligundla, A. Yazdanbakhsh, S. Kinzer, and H. Esmaeilzadeh, “Releq: A reinforcement learning approach for deep quantization of neural networks,” CoRR, vol. abs/1811.01704, 2018. [Online]. Available: http://arxiv.org/abs/1811.01704
[28] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “HAQ: hardware-aware automated quantization,” CoRR, vol. abs/1811.08886, 2018. [Online]. Available: http://arxiv.org/abs/1811.08886
[29] S. Uhlich, L. Mauch, K. Yoshiyama, F. Cardinaux, J. A. García, S. Tiedemann, T. Kemp, and A. Nakamura, “Differentiable quantization of deep neural networks,” CoRR, vol. abs/1905.11452, 2019. [Online]. Available: http://arxiv.org/abs/1905.11452
[30] A. Delmas, P. Judd, S. Sharify, and A. Moshovos, “Dynamic stripes: Exploiting the dynamic precision requirements of activation values in neural networks,” CoRR, vol. abs/1706.00504, 2017. [Online]. Available: http://arxiv.org/abs/1706.00504
[31] A. Delmas Lascorz, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify, M. Nikolic, K. Siu, and A. Moshovos, “Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’19. New York, NY, USA: ACM, 2019, pp. 749–763. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304041
[32] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “Unpu: A 50.6tops/w unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision,” 2018 IEEE International Solid - State Circuits Conference - (ISSCC), pp. 218–220, 2018.
[33] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
[34] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in neural information processing systems, 2016, pp. 4107–4115.
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep cnns,” in NIPS 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
[36] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[37] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
[38] J. Howard et al., “fastai,” https://github.com/fastai/fastai, 2018.
[39] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
[40] M. Sandler, A. F. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
[41] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online]. Available: http://arxiv.org/abs/1608.06993
[42] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” CoRR, vol. abs/1707.01629, 2017. [Online]. Available: http://arxiv.org/abs/1707.01629
[43] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” CoRR, vol. abs/1611.05431, 2016. [Online]. Available: http://arxiv.org/abs/1611.05431
[44] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” CoRR, vol. abs/1603.05027, 2016. [Online]. Available: http://arxiv.org/abs/1603.05027
[45] A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, “WRPN: Wide reduced-precision networks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=B1ZvaaeAZ
[46] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” arXiv:1409.0575 [cs], Sep. 2014, arXiv: 1409.0575.