Competition-based Adaptive ReLU
for Deep Neural Networks
Abstract
Activation functions introduce nonlinearity into deep neural networks. Most popular activation functions allow positive values to pass through while blocking or suppressing negative values. From the idea that positive values and negative values are equally important, and they must compete for activation, we proposed a new Competition-based Adaptive ReLU (CAReLU). CAReLU scales the input values based on the competition results between positive values and negative values. It defines two parameters to adjust the scaling strategy and can be trained uniformly with other network parameters. We verify the effectiveness of CAReLU on image classification, super-resolution, and natural language processing tasks. In the experiment, our method performs better than other widely used activation functions. In the case of replacing ReLU in ResNet-18 with our proposed activation function, it improves the classification accuracy on the CIFAR-100 dataset. The effectiveness and the new perspective on the utilization of competition results between positive values and negative values make CAReLU a promising activation function.
Index Terms:
Adaptive activation function, ReLU, deep learning, competition-basedI Introduction
Deep learning is one of the most popular techniques demonstrating superior performance across multiple applications and has been the focus of academic and engineering communities for years. Deep learning techniques are evolving fast and each year hundreds of new models are proposed. Its applications quickly dominate in many areas such as computer vision (CV)[1, 2] and natural language processing (NLP)[3, 4].
(1) |
A deep neural network contains thousands of neural units. Every neural unit gathers the input signals from other neurons and generates a signal for other neurons to process. The structure of a deep neural network is a directed acyclic graph where every vertex represents a mapping from signals of incoming arcs into new signals of outgoing arcs. Eq. (1) is one of the most common vertices in deep neural networks, which applies an affine transformation and a non-linear mapping to the input tensor, where is the -dimension input gathered from incoming arcs, is the weight vector, is the bias, and is the vertex’s output. and can be obtained through training. This non-linear mapping is called the activation function, without which a deep neural network degenerates into a linear regression model.
(2) |
In the early stage of deep learning research, S-shape functions like the Sigmoid function and the hyperbolic tangent function (tanh) were proposed and became popular but they were later found ineffective. Rectified Linear Unit (ReLU)[5, 6] shown in Eq. (2) was proposed that demonstrates top-tier convergence speed, better generalization and ease of implementation. But ReLU completely shuts down negative values, which led to the development of LeakyReLU[7] that allows the negative values to pass through. Parametric ReLU (PReLU)[8] improves model fitting by parameterizing the LeakyReLU’s slope of the negative part. Multi-phase ReLU[9] utilizes six different phases from the input. Sparse regularization[10] increases the sparsity of ReLU’s input.
Activation functions with more complicated formulas are also explored. Dan proposed Gaussian Error Linear Unit (GeLU) and Sigmoid Linear Unit (SiLU)[11], which are widely used in natural language processing. SiLU was also found and named Swish[12] by Prajit by utilizing meta-learning techniques. Inspired by SiLU’s self-gating property, Mish[13] and Serf[14] are proposed to further improve performance. Brosnan developed the Universal Activation Function (UAV) [15], which generalizes some popular activation functions by using five trainable parameters. Shui-Long proposed the tanhLU[16] by integrating tanh into a linear unit. Mathew proposed the Growing Cosine Unit (GCU)[17] that can improve gradient flow and reduce network size.
Most activation functions differently process the negative part and the positive part of the input tensor. For example, ReLU completely blocks negative values; LeakyReLU, PReLU, and SiLU scale down negative values. It implies that positive values are far more important than negative values. From the viewpoint of signal processing, positive and negative values are equally important and they must compete for activation instead of simply shutting down or suppressing negative values.
In this paper, we propose a new Competition-based Adaptive ReLU (CAReLU). 1) A CAReLU utilizes the competition results between positive values and negative values in the input tensor. 2) A CAReLU has two parameters that can be trained uniformly with other network parameters. 3) Different competition indicators can cooperate with CAReLU to generate different activation functions for different tasks. We evaluate our method and find that our method constantly performs better compared to the most popular activation functions.
II Proposed Method
II-A Construction of Competition-based Adaptive ReLU
Our proposed method stems from the idea that both positive values and negative values are equally important and they must compete for activation. We choose energy as an indicator of the competition. The rule is that the competitor with higher energy wins the qualification for activation. Let be the output of the previous affine transformation. The percentage of positive values’ energy is defined as follows:
(3) |
where a small positive constant value is added in the denominator to prevent the division by zero. We define as follows:
(4) |
where is the sign function. If is true, then we have , which means the negative values win the competition; If is true, then we have , which means the positive values win the competition. By multiplying with when passing it to the ReLU, we can flip the input if negative values have higher energy:
(5) |
However, this vanilla version of our idea does not perform well. First, there is no degeneracy into regular ReLU when prioritizing higher energy impairs performance. Second, the sign function creates discontinuities in the model’s parameter space, which harms the gradient-based optimization. To address the first issue, we replace the fixed scaling factor of and the bias of in Eq. (4) with trainable parameters and per layer:
(6) |
If is close or equal to after training, this mapping degenerates into ReLU with a factor of or . Only 2 extra parameters are introduced per layer, which is negligible when considering the total number of weights.
To address the second issue, we replace the sign function with the tanh function as follows:
(7) |
The tanh function has a codomain ranging from to and it is a continuous function with a non-zero gradient, which can be viewed as a smooth version of the sign function.
There are other kinds of indicators for competition between positive values and negative values other than energy. The other two indicators we experiment with are:
(8) |
(9) |
where is the Heaviside step function, Eq. (8) is the percentage of positive values’ L1-norm, and Eq. (9) is the percentage of the number of positive values.
Competition-based Adaptive Scaling (CAS) can be defined as follows to adaptively scale the input based on competition results:
(10) |
where is a constant and . Since is less than 1, the magnitude of feature vectors will shrink layer by layer in a sequence model. To combat this phenomenon, the constant is placed to scale up the results of the tanh function. By concatenating CAS and ReLU, we construct Competition-based Adaptive ReLU as follows:
(11) |
II-B Working with Batch Normalization
Batch Normalization (BN) [18] is widely used in convolutional neural networks. BN normalizes and rescales the input with 2 parameters on a mini-batch before activation. Since CAReLU requires competition between positive values and negative values, the BN’s normalization might impair this process. To improve our method’s compatibility with BN, we define as follows by placing Eq. (10) before BN so that the neural network can obtain competition results before the normalization:
(12) |
II-C Gradients of the Competition-based Adaptive Scaling
Let be the initial value of and be the initial value of . Instead of running a grid search for initial values, we simply set these values as follows:
(13) |
where means that deep neural networks don’t utilize competition results at the beginning. makes the tanh function starts at the non-saturating area, where the gradient is large enough to initiate the update. The constant is set to so that the CAS is initialized as an identity mapping when the training begins.
To update the parameters, we need to compute gradients of the loss function with respect to , , and input value . Let be the output of CAS as follows:
(14) |
Gradients can be derived from the chain rule as follows:
(15) | ||||
where is the loss function and is obtained from backpropagation. is the partial differential of the competition indicator with respect to the input. For , and , we ignore the small positive constant value and the gradients are derived as follows:
(16) | ||||
III Experiments
In this section, we evaluate CAReLU across different applications. For models without BN, we directly replace existing activation functions with Eq. (11) in every layer. For models with BN, we also replace combination with Eq. (12) in every layer. We respectively use , , and to denote CAReLU implemented with in Eq. (3), in Eq. (8), and in Eq. (9). The following experiments show that our method outperforms other popular activation functions in multiple applications.
III-A CIFAR-100 Image Classification
We compare our methods to the most popular activation functions on CIFAR-100 imagine classification task [19]. CIFAR-100 is a dataset that has 100 classes, containing 500 training images and 100 test images for each class. We use ResNet-18 [20], GoogLeNet [21], and VGG-13 [22] networks to evaluate our activation functions. A stochastic gradient descent (SGD) optimizer with a momentum of and a weight decay of is used to train all networks. The learning rate starts at and is divided by in th, th, and th epochs. The training ends at th epoch.
Methods | ResNet-18 | GoogLeNet | VGG-13 |
---|---|---|---|
Table I shows the results of the classification experiment. Most implementations of our method have overall better performance compared to other baseline methods. has the highest accuracy in ResNet-18 experiments and has the highest accuracy in GoogLeNet and VGG-13 experiments. When comparing different implementations of our method, performs better than except for the in ResNet-18 experiments and in GoogLeNet, which proves the previous assumption that BN’s normalization impairs the effectiveness of our method. Energy-based implementations generally perform better than the other two implementations. The reason is that the gradient of is a continuous function and it’s smoother than the gradient of ; contains Heaviside step functions, which results in a bumpy loss landscape [23] and thus impairs the performance.
Best Trained Models
ResNet-18 | GoogLeNet | VGG-13 | |
---|---|---|---|
#1 | |||
#2 | |||
#3 | |||
#4 | |||
#5 | |||
#6 | |||
#7 | |||
#8 | |||
#9 | |||
#10 |



We also evaluate values of in CASs on each layer. In Table II, we show the means and standard deviations of from the first 10 CASs of best models on the test set. Small deviations indicate that CASs on trained models do not generate significantly different values of for different samples. Instead, input values are scaled uniformly and then fine-tuned for each sample according to the input’s competition results. Some degenerate into constant scaling such as CAS#6 of VGG-13 in Table II. Allowing degeneracy into ReLU enables our method to utilize competition results without compromising the performance whenever the original approach is optimal. Histograms of obtained from the best models on the test set are shown in Fig. 1. Despite our initial design being a binary scale factor described in Eq. (4), values of mostly land on non-saturating regions.
III-B BSD-300 Image Super Resolution
We compare our methods to the most popular activation functions on an image super-resolution task on the Berkeley segmentation dataset (BSD300), which contains 200 training images and 100 test images [24]. The network we used for this experiment is an efficient sub-pixel convolutional neural network (ESPCN) featuring a network comprising several convolutional layers and a pixel shuffle layer [25]. Training images are cropped to and scaled down to where is the upscale factor. The ESPCN network upscales the down-scaled images back to . An Adam optimizer [26] with a learning rate of is used for training the network in 200 epochs. We run every setting 5 times and show experiment results in Table III.
The data show that and surpass other activation functions in PSNR. outperforms PReLU, Swish-1, and Swish, but does not perform as well as and due to its discontinuity introduced by the Heaviside function.
Methods | ||
---|---|---|
III-C Natural Language Inference on SNLI
In this section, we evaluate our methods on the Stanford Natural Language Inference (SNLI) corpus [27]. SNLI corpus is a collection of 570k human-written English sentence pairs labeled for entailment, contradiction, and neutral. A premise and a hypothesis comprise a sentence pair. The model we used for this task comprises an embedding layer, a Long short-term memory (LSTM) [28] encoder, and a sequence of fully connected layers. Hypothesis and premise go through the embedding layer and the encoder independently. We concatenate the encoded hypothesis feature and premise feature, then send them through the fully connected layers for classification. We use the Adam optimizer with a learning rate of to train the parameters for 50 epochs on the SNLI training set. We run every setting 5 times and show experiment results in Table IV.
[ht]
Methods | Acc |
---|---|
1 | |
-
1
”” indicates that the training with this activation function does not converge.
Data suggest that our methods perform better than other activation functions. Swish and Swish-1 are unable to converge in this task. achieve the highest classification accuracy in this experiment. Though mean accuracies of and are approximately identical, is more robust because it has a smaller standard deviation. All three implementations of our method outperform other baseline activation functions.
IV Conclusion
Stemming from the idea that both positive values and negative values are equally important and they must compete for activation, we developed the Competition-based Adaptive ReLU (CAReLU) activation function by introducing the competition between positive values and negative values. A CAReLU has 2 parameters that can be trained uniformly with other network parameters using gradient descent. By respectively implementing CAReLU with each one of 3 competition indicators, we developed 3 different activation functions: , , and . We also developed a technique when working with Batch Normalization to have extra performance gain.
We evaluated our method on different tasks and achieved consistent performance improvements. is generally more effective, but and can also achieve great performance in certain scenarios. The effectiveness and the new perspective on the competition between positive values and negative values make CAReLU promising in deep learning tasks.
References
- [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, p. 84–90, may 2017.
- [2] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
- [4] D. W. Otter, J. R. Medina, and J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 604–624, 2021.
- [5] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition?” in 2009 IEEE 12th International Conference on Computer Vision, 2009, pp. 2146–2153.
- [6] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814.
- [7] A. L. Maas, A. Y. Hannun, A. Y. Ng et al., “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML, vol. 30, no. 1. Atlanta, Georgia, USA, 2013, p. 3.
- [8] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
- [9] C. Banerjee, T. Mukherjee, and E. Pasiliao Jr, “The multi-phase relu activation function,” in Proceedings of the 2020 ACM Southeast Conference, 2020, pp. 239–242.
- [10] H. Ide and T. Kurita, “Improvement of learning for CNN with ReLU activation by sparse regularization,” in 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017, pp. 2684–2691.
- [11] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2016.
- [12] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.
- [13] D. Misra, “Mish: A self regularized non-monotonic neural activation function,” arXiv preprint arXiv:1908.08681, vol. 4, no. 2, pp. 10–48 550, 2019.
- [14] S. Nag, M. Bhattacharyya, A. Mukherjee, and R. Kundu, “Serf: Towards better training of deep neural networks using log-softplus error activation function,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5324–5333.
- [15] B. Yuen, M. T. Hoang, X. Dong, and T. Lu, “Universal activation function for machine learning,” Scientific Reports, vol. 11, no. 1, pp. 1–11, 2021.
- [16] S.-L. Shen, N. Zhang, A. Zhou, and Z.-Y. Yin, “Enhancement of neural networks with an alternative activation function tanhlu,” Expert Syst. Appl., vol. 199, no. C, aug 2022.
- [17] M. M. Noel, A. Trivedi, P. Dutta et al., “Growing cosine unit: A novel oscillatory activation function that can speedup training and reduce parameters in convolutional neural networks,” arXiv preprint arXiv:2108.12943, 2021.
- [18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
- [19] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” Univ. of Toronto, Tech. Rep., 2009.
- [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [23] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
- [24] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, vol. 2, 2001, pp. 416–423 vol.2.
- [25] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883.
- [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [27] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015.
- [28] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.