Binary Stochastic Filtering: feature selection and beyond

Andrii Trelin
trelinn@vscht.cz Aleš Procházka
a.prochazka@ieee.org

Abstract

Feature selection is one of the most decisive tools in understanding data and machine learning models. Among other methods, sparsity induced by $L^{1}$ penalty is one of the simplest and best studied approaches to this problem. Although such regularization is frequently used in neural networks to achieve sparsity of weights or unit activations, it is unclear how it can be employed in the feature selection problem. This work aims at introducing the ability to automatically select features into neural networks by rethinking how the sparsity regularization can be used, namely, by stochastically penalizing feature involvement instead of the layer weights. The proposed method has demonstrated superior efficiency when compared to a few classical methods, achieved with minimal or no computational overhead, and can be directly applied to any existing architecture. Besides, the method is easily generalizable for neuron pruning and selection of regions of importance for spectral data.

Keywords— neural network, feature selection, neuron pruning

1 Introduction

Feature selection is of great interest in all machine learning task, since it reduces the computational complexity of the models, frequently improves generalization, and helps in data understanding. In general, feature selection methods are divided into the following categories [1]:

Filter methods: use feature metrics, such as correlation, information gain to distinguish between useful and useless features.
Wrapper methods: use the feedback of model metrics to optimize the selected feature subset. This problem can be exactly solved only by brute force, which makes it intractable in the majority of cases. Numerous heuristics are suggested (modern researches mainly focused on swarm intelligence optimization [2, 3, 4]), but they are can not guarantee optimality.
Embedded methods: exists for certain algorithms that create a feature importance score during training. Classical examples are decision tree-based algorithms and $L^{1}$ -penalized linear models.

It is obvious that models able to automatically find optimal features are the most desired type of feature selector since it provides both trained model and important features subset simultaneously. Unfortunately, that is usually possible only for very simple models, while deep neural networks (NN), one of the most crucial state-of-the-art algorithms, are unable to perform feature selection during training. The presented paper is devoted to the development of method to resolve that issue by augmenting the network with stochastic variant of $L^{1}$ penalization, which can be interpreted as stochastic search in the feature space.

2 $L^{1}$ penalization for neural networks

The most straightforward way of how to achieve sparsity with neural networks is to add $L^{1}$ penalty . This method is widely used to achieve representation sparsity [5, 6] by penalizing neuron activations or sparsity of convolutional kernels [7, 8] that improves performance of convolutional models. Although $L^{1}$ penalization efficiently sparsifies networks, the structure of the obtained sparse representation is unpredictable and thus can not be used for feature selection or neuron pruning. The work of Wen et al.[9] handles that issue by explicitly introducing structure, penalizing individual components of network such as channels, layers, etc. At the same time, $L^{1}$ penalization for feature selection has not been yet applied to neural networks.

We suggest how the well-known sparsity constraints can be applied to neural networks input aiming feature selection. The proposed method exhibits high universality and can be applied to selection of input features, convolutional kernels, regions of importance, etc. It should not be confused with widely used weights or activation regularization.

3 Related works

Sparsification of neural networks is a popular research subject of significant importance, since it allows to make large and computationally demanding neural networks smaller and more efficient to be run on mobile devices. Application of structured $L^{1}$ penalty for optimization of neural network architecture was suggested by Wen et al. [9] and Scardapane et al. [10]. Both approaches are deterministic.

Since the proposed method is stochastic, it shares common properties with a wide variety of stochastic regularization technics, derived from the original Dropout [11]. Energy-based dropout [12] regularizes and prunes network by optimizing scalar energy with differential evolution algorithm. Work of Srinivas et al. [13]defines a family of Dropout-like techniques. One of them, Dropout++ uses stochastic neuron dropping with trainable parameters, derived through Bayesian NN, that lead to similar although not identical formulation of filtering units. Adaptive Dropout [14] achieves tuning of dropping probabilities by augmenting neural network with binary belief network.

4 Binary stochastic filtering

The main idea of the proposed method (BSF) is application of $L^{1}$ penalty on the involvement of the variable in the training/prediction process. This is done by element-wise multiplying of input datum $\mathbf{x}$ by the random vector $\mathbf{r}$ such that $r_{i}\sim Bernoulli(p_{i})$ , where vector $\mathbf{p}$ defines a tunable set of parameters. This is similar to the Dropout technic, which performs the same multiplication, but its weights are predefined constant. Vector $\mathbf{p}$ is penalized with $L^{1}$ norm, which stochastically forces the model to use only the most important features. Another way to imagine it is to think about it as stochastic investigation of parameter space, which at the same time penalizes the number of involved features.

Gradients

To make the layer weights $\mathbf{{p}}$ trainable, it is necessary to define two gradients for backpropagation to work, namely, $\nabla_{x}\text{bsf}(\mathbf{{x}})$ and $\nabla_{p}\text{bsf}(\mathbf{x})$ , where $\text{bsf}(\mathbf{x})=\mathbf{x}\circ\mathbf{r}$ . We define the first gradient as

\frac{\partial\text{bsf}(x_{i})}{\partial x_{i}}=\frac{\partial x_{i}r_{i}}{\partial x_{i}}=r_{i},

which is a natural way to describe a variable passed or dropped, similarly to the Dropout. It is more tricky is to define $\nabla_{p}\text{bsf}(\mathbf{x})$ due to its randomness. Instead, we can differentiate the expected value

\frac{\partial\mathbb{E}\text{bsf}(x_{i})}{\partial p_{i}}=\frac{\partial\mathbb{E}x_{i}r_{i}}{\partial p_{i}}=\frac{\partial x_{i}\mathbb{E}r_{i}}{\partial p_{i}}=\frac{\partial x_{i}p_{i}}{\partial p_{i}}=x_{i}

and use it as gradient estimate. Moreover, it was empirically found that it is useful to scale the gradient by the weight value, i.e. to redefine the gradient as $\frac{\partial\text{bsf}(x_{i})}{\partial p_{i}}=x_{i}p_{i}$ . This modification has a clear interpretation: the lower weight $p_{i}$ the lower is feature involvement in the training process, thus the weights of this feature must be changed slower. This modification stabilizes training and prevents already disabled features from being re-enabled.

A behavior of the filtering layer during inference phase is altered by setting a threshold $\tau$ and deterministically passing features above threshold, while features corresponding to weights below threshold are dropped. This replacement makes the layer at inference phase deterministic, which stabilizes validation metrics. Implementation of BSF layer in TensorFlow 2 framework can be found in the repository¹¹1https://github.com/Trel725/BSFilter.

Analysis

To get some understanding of how this method work we will investigate its behavior on the simplest possible model – linear regression. We will start with the general formula for linear regression

\min_{\mathbf{w}}||\mathbf{y}-\mathbb{X}\mathbf{w}||^{2},

where $\mathbf{y}$ is a vector of target values, $\mathbf{w}$ is a vector of model weights, and $\mathbb{X}$ is a matrix of input data, such that each row of the matrix $\mathbb{X}_{\cdot i}$ is a single observation vector $\mathbf{x}_{i}$ . Now, our goal is to investigate how will the optimization objective change if we multiply each $\mathbf{x}_{i}$ by a random vector $\mathbf{r}$ . Since our objective is now random, we will minimize its expected value, i.e.

\min_{\mathbf{w}}\mathbb{E}||\mathbf{y}-(\mathbb{R}\circ\mathbb{X})\mathbf{w}||^{2},

where $\mathbb{R}$ is a matrix, such that $\mathbb{R}_{ij}\sim Bernoulli(p_{j})$ . It can be shown (the derivation of the equation below is given in the supporting information) that the optimization objective is equivalent to

\min_{\mathbf{w}}||\mathbf{y}-\mathbb{X}(\mathbf{w}\circ\mathbf{p})||^{2}+||\Gamma(\mathbf{w}\circ\sqrt{\mathbf{p}\circ(1-\mathbf{p})})||^{2}+\lambda||\mathbf{p}||,

where $\Gamma=\text{diag}(\mathbb{X}^{T}\mathbb{X})^{1/2}$ , i.e. its diagonal elements correspond to standard deviations of features in $\mathbb{X}$ (supposing they are centered), $\circ$ denotes Hadamard product. We can see that if $p_{i}=p$ , the $p(1-p)$ member can be taken out of the norm expression, which gives an identical expression to the one derived in [11] (when $\lambda=0$ ). From that objective we can get some insights of model behavior:

1.

For $p_{i}\in(0,1)$ the $i$ th feature is efficiently penalized with $L^{2}$ norm, where the penalty is additionally scaled by the standard deviation of that feature. Thus, weights for the strongly varying feature are penalized more, which is similar to classical Dropout.
2.

If $p_{i}=0$ , which is forced by the $L^{1}$ penalty, or $p_{i}=1$ , the middle term vanishes and weight of $i$ th feature is not penalized.

Stochastic vs deterministic

It is not immediately clear why to prefer stochastic regularization to deterministic. Firstly, weights penalization is clearly enough to achieve sparsity for simple shallow models like Lasso regression. At the same time, deep models can efficiently rescale back near zero features in the hidden layers. Stochastic regularization is free from that issue since it has only two possible states, feature is either passed without changes or set to zero. Moreover, it is well known in the machine learning literature that addition of noise to the network has positive effects on model generalization and convergence [15, 16, 17]. It was observed in experiments that stochastic models are actually more stable at training phase and produce better separated (into important and unimportant) features. An example of the model convergence curves and selected feature importances is given in the Fig. 1, left.

5 Experiments

Binary stochastic filtering layer was implemented in TensorFlow 2 framework [18] according to the definition above. A collection of datasets from OpenML-CC18 benchmark suite [19] was used in the experiments. It contains 72 classification datasets that satisfy a number of desired properties, including balancing, reasonable number of features and observations, moderate classification difficulty, etc. Moreover, the authors provided reference preprocessing and cross-validation splitting, which facilitates replication of experiments. NN models typically require tuning of hyperparameters to get fair results, thus a subset of 10 datasets was selected from the OpenML-CC18 and used in further experiments. Threshold $\tau$ was set to 0.25 and $\text{F}_{1}$ score was used as the main evaluation metric in all experiments.

5.1 Feature selection

Refer to caption — Figure 1: Change in $\text{F}_{1}$ score after feature selection with different methods, sorted in ascending order according to the group means (left). Examples of validation loss evolution for reference model and $L^{1}$ penalized models. Early stopping after 20 epochs without loss improvement was used (right).

ID	BSF	DT	KB-F	KB-MI	RFE	SVC	Features
16	-0.0995	-0.1490	-0.1285	-0.1280	-0.4835	-0.2950	6/64
32	-0.0014	0.0007	-0.0058	-0.0063	-0.0019	-0.0055	13/16
45	0.0169	0.0169	0.0191	0.0185	-0.0031	-0.0053	6/60
219	0.0371	0.0271	0.0217	0.0230	0.0218	0.0375	7/8
3481	-0.0213	-0.0303	-0.1445	-0.1468	-0.0182	-0.0355	56/617
9910	0.0192	0.0080	0.0056	0.0061	0.0357	0.0075	166/1776
9957	0.0057	0.0048	0.0009	-0.0010	0.0114	0.0048	23/41
9977	-0.0333	-0.0187	-0.0606	-0.0607	-0.0218	-0.0180	7/118
14952	-0.0131	-0.0024	-0.0116	-0.0111	-0.0194	-0.0149	15/30
146825	-0.0244	-0.0304	-0.1025	-0.1027	—	-0.1425	102/784
167140	-0.0053	-0.0050	-0.0822	-0.0813	-0.0031	-0.0057	10/180

Table 1: Mean differences between metrics for model trained on full and feature-selected datasets.

For the main experiment features were selected from each experimental dataset by training a penalized model. The penalization coefficient was manually tuned to achieve maximal reduce in number of features, while keeping metrics reasonable. Other popular methods (implemented in scikit-learn [20]) were selected for comparison, corresponding abbreviations are given in parentheses:

•

Filtering features based on mutual information (KB-MI) and ANOVA F-value (KB-F)
•

Recursive feature elimination with SVM as a base classifier (RFE) [21]
•

Embedded methods: $L^{1}$ penalized SVM (SVC) and decision tree (CART algorithm, DT)

The same number of features was selected with these methods and NN model was trained on each of the selected feature subsets. Metrics for each cross-validation split were collected and differences between reference full-featured score and feature-selected one were used as a measure of feature selection efficiency. Cross-validation splits were same for all experiments. Results are provided in Fig. 1, which visualizes the distribution of $\Delta\text{F}_{1}=\text{F}_{1,\text{fs}}-\text{F}_{1,\text{ref}}$ , i.e. positive values correspond to feature selected $\text{F}_{1}$ score higher than original one. It follows from the data that BSF leads to the lowest decrease of classification score. Although the difference with its closest rival (DT) is small, it is statistically significant with Wilcoxon test p-value $=0.0233$ . Exact values are tabulated in the Tab. 1²²2RFE feature selection for dataset 146825 was intractable, thus this value is missing from the table. It is important to note that augmenting model with BSF layer has only minor impact of its convergence (Fig. 1, left), thus the filtering layer can be added to any model almost without overhead

5.2 Neuron pruning

For the second experiment every dropout layer was replaced with penalized BSF layer. Regularization coefficient was shared among layers, but normalized by the starting number of neurons in the layer to achieve equal penalization. Every model was trained on the same selected datasets, the BSF layers were removed, and neurons, corresponding to the low BSF values were pruned, which was achieved by removing corresponding columns and/or rows from the weight matrix for each layer (Fig. 2). Differences in $\text{F}_{1}$ score for the obtained pruned model are plotted against the relative amount of kept weights in Fig. 3. The same figure demonstrates how the number of weights can be further decreased by the price of reduce in classification metrics.

5.3 Region selection in spectra

Spectra are one of the most common data in natural sciences. Automated recognition of spectra is highly usesul in all branches of chemistry [22, 23, 24] and biology or medicine [25, 26, 27]. Such signals share important property, existence of importance regions, areas which are crucial for their interpretation. While for images relative positions of features matter (which are usually extracted with convolutional layers), spectra are recognized based on global positions of peaks or other features. Although it may seem like a problem for which fully-connected network is more suitable, convolutional layers are still advantageous for processing spectral information since they learn preprocessing of data, such as background subtraction, noise filtering, etc. Extraction of the importance regions from spectral data is exceptionally useful, since it sheds light on the processes that generate the data. Numerous approaches were proposed to highlight most salient regions aiming explanation of model decisions, including Grad-CAM [28], LIME [29] and SHAP [30]. Unfortunately, these methods, developed to explain individual predictions, frequently produce overly complicated picture, highlighting noise and clearly useless regions. Combination of individual explanation to get dataset-wise explanation is nontrivial and its interpretation is frequently unclear.

Although this problem can be formulated as classical feature selection, it is a poor approach since it disrupts the continuity of the spectra and breaks the convolutional preprocessing. Desired importance regions selection can be accomplished by selecting features at the output of convolutional counterpart of network, which can be performed with BSF layer that shares weights along the channels axis. For experiment, the custom Raman spectra dataset of glycoproteins was classified with simple convolutional classifier, and obtained importance regions were analyzed with Grad-CAM, adapted for analysis of 1D convolutional networks, SHAP explainer and BSF. The obtained results are presented in the Fig. 4. As it was mentioned above, SHAP and Grad-CAM detections of region importances are cumbersome and practically useless, while BSF has clearly selected the most informative regions which has clear chemical interpretation. This approach was successfully used in two analytical projects [31, 32].

6 Conclusion

The conducted experiments demonstrated that BSF selects features at least as efficiently as best of the classical methods. At the same time, it can be embedded directly in the NN model, eliminating the need for external feature selector. Moreover, thanks to its differentiability it can be utilized not only to drop nodes from the input layer (i.e. features) but can be placed in the middle of the model, which can be utilized for neuron pruning. This approach is also applicable for filtering of convolutional channels by simple weight sharing of the BSF layer along all axes except channel axis. Instead, if selection of regions of importance is an aim, BSF can be applied by sharing weights along channels axis. It was shown that for some datasets this method allows to reduce network size to approximately 1% of the original size without significant reduce of classification metrics. BSF has potential to become an indispensable tool for processing of spectral data, particularly valuable in natural sciences.

References

[1] Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
[2] Shenkai Gu, Ran Cheng, and Yaochu Jin. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Computing, 22(3):811–822, 2018.
[3] Emrah Hancer, Bing Xue, Mengjie Zhang, Dervis Karaboga, and Bahriye Akay. Pareto front feature selection based on artificial bee colony optimization. Information Sciences, 422:462–479, 2018.
[4] Majdi Mafarja, Ibrahim Aljarah, Ali Asghar Heidari, Abdelaziz I Hammouri, Hossam Faris, Al-Zoubi Ala’M, and Seyedali Mirjalili. Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowledge-Based Systems, 145:25–45, 2018.
[5] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011.
[6] Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
[7] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 806–814, 2015.
[8] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1355–1361. IEEE, 2017.
[9] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
[10] Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
[11] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
[12] Hojjat Salehinejad and Shahrokh Valaee. Edropout: Energy-based dropout and pruning of deep neural networks. arXiv preprint arXiv:2006.04270, 2020.
[13] Suraj Srinivas and R Venkatesh Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
[14] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in neural information processing systems, pages 3084–3092, 2013.
[15] Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.
[16] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[17] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
[18] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th $\{$ USENIX $\}$ symposium on operating systems design and implementation ( $\{$ OSDI $\}$ 16), pages 265–283, 2016.
[19] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. Openml benchmarking suites, 2017.
[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[21] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3):389–422, 2002.
[22] Kunal Ghosh, Annika Stuke, Milica Todorović, Peter Bjørn Jørgensen, Mikkel N Schmidt, Aki Vehtari, and Patrick Rinke. Deep learning spectroscopy: Neural networks for molecular excitation spectra. Advanced science, 6(9):1801367, 2019.
[23] Chenhao Cui and Tom Fearn. Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration. Chemometrics and Intelligent Laboratory Systems, 182:9–20, 2018.
[24] Xiaolei Zhang, Tao Lin, Jinfan Xu, Xuan Luo, and Yibin Ying. Deepspectra: An end-to-end deep learning approach for quantitative spectral analysis. Analytica chimica acta, 1058:48–57, 2019.
[25] Sigurdur Sigurdsson, Peter Alshede Philipsen, Lars Kai Hansen, Jan Larsen, Monika Gniadecka, and Hans-Christian Wulf. Detection of skin cancer by classification of raman spectra. IEEE transactions on biomedical engineering, 51(10):1784–1793, 2004.
[26] Yi-ding Chen, Shu Zheng, Jie-kai Yu, and Xun Hu. Artificial neural networks analysis of surface-enhanced laser desorption/ionization mass spectra of serum protein pattern distinguishes colorectal cancer from healthy population. Clinical Cancer Research, 10(24):8380–8385, 2004.
[27] Jindřich Charvát, Aleš Procházka, Matěj Fričl, Oldřich Vyšata, and Lucie Himmlová. Diffuse reflectance spectroscopy in dental caries detection and classification. Signal, Image and Video Processing, pages 1–8, 2020.
[28] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144, 2016.
[30] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
[31] O Guselnikova, A Trelin, A Skvortsova, P Ulbrich, P Postnikov, A Pershina, D Sykora, V Svorcik, and O Lyutakov. Label-free surface-enhanced raman spectroscopy with artificial neural network technique for recognition photoinduced dna damage. Biosensors and Bioelectronics, 145:111718, 2019.
[32] M Erzina, A Trelin, O Guselnikova, B Dvorankova, K Strnadova, A Perminova, P Ulbrich, D Mares, V Jerabek, R Elashnikov, et al. Precise cancer detection via the combination of functionalized sers surfaces and convolutional neural network with independent inputs. Sensors and Actuators B: Chemical, 308:127660, 2020.