This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Binary Stochastic Filtering: feature selection and beyond

Andrii Trelin
trelinn@vscht.cz
   Aleš Procházka
a.prochazka@ieee.org
Abstract

Feature selection is one of the most decisive tools in understanding data and machine learning models. Among other methods, sparsity induced by L1L^{1} penalty is one of the simplest and best studied approaches to this problem. Although such regularization is frequently used in neural networks to achieve sparsity of weights or unit activations, it is unclear how it can be employed in the feature selection problem. This work aims at introducing the ability to automatically select features into neural networks by rethinking how the sparsity regularization can be used, namely, by stochastically penalizing feature involvement instead of the layer weights. The proposed method has demonstrated superior efficiency when compared to a few classical methods, achieved with minimal or no computational overhead, and can be directly applied to any existing architecture. Besides, the method is easily generalizable for neuron pruning and selection of regions of importance for spectral data.

Keywords— neural network, feature selection, neuron pruning

1 Introduction

Feature selection is of great interest in all machine learning task, since it reduces the computational complexity of the models, frequently improves generalization, and helps in data understanding. In general, feature selection methods are divided into the following categories [1]:

Filter methods

use feature metrics, such as correlation, information gain to distinguish between useful and useless features.

Wrapper methods

use the feedback of model metrics to optimize the selected feature subset. This problem can be exactly solved only by brute force, which makes it intractable in the majority of cases. Numerous heuristics are suggested (modern researches mainly focused on swarm intelligence optimization [2, 3, 4]), but they are can not guarantee optimality.

Embedded methods

exists for certain algorithms that create a feature importance score during training. Classical examples are decision tree-based algorithms and L1L^{1}-penalized linear models.

It is obvious that models able to automatically find optimal features are the most desired type of feature selector since it provides both trained model and important features subset simultaneously. Unfortunately, that is usually possible only for very simple models, while deep neural networks (NN), one of the most crucial state-of-the-art algorithms, are unable to perform feature selection during training. The presented paper is devoted to the development of method to resolve that issue by augmenting the network with stochastic variant of L1L^{1} penalization, which can be interpreted as stochastic search in the feature space.

2 L1L^{1} penalization for neural networks

The most straightforward way of how to achieve sparsity with neural networks is to add L1L^{1} penalty . This method is widely used to achieve representation sparsity [5, 6] by penalizing neuron activations or sparsity of convolutional kernels [7, 8] that improves performance of convolutional models. Although L1L^{1} penalization efficiently sparsifies networks, the structure of the obtained sparse representation is unpredictable and thus can not be used for feature selection or neuron pruning. The work of Wen et al.[9] handles that issue by explicitly introducing structure, penalizing individual components of network such as channels, layers, etc. At the same time, L1L^{1} penalization for feature selection has not been yet applied to neural networks.

We suggest how the well-known sparsity constraints can be applied to neural networks input aiming feature selection. The proposed method exhibits high universality and can be applied to selection of input features, convolutional kernels, regions of importance, etc. It should not be confused with widely used weights or activation regularization.

3 Related works

Sparsification of neural networks is a popular research subject of significant importance, since it allows to make large and computationally demanding neural networks smaller and more efficient to be run on mobile devices. Application of structured L1L^{1} penalty for optimization of neural network architecture was suggested by Wen et al. [9] and Scardapane et al. [10]. Both approaches are deterministic.

Since the proposed method is stochastic, it shares common properties with a wide variety of stochastic regularization technics, derived from the original Dropout [11]. Energy-based dropout [12] regularizes and prunes network by optimizing scalar energy with differential evolution algorithm. Work of Srinivas et al. [13]defines a family of Dropout-like techniques. One of them, Dropout++ uses stochastic neuron dropping with trainable parameters, derived through Bayesian NN, that lead to similar although not identical formulation of filtering units. Adaptive Dropout [14] achieves tuning of dropping probabilities by augmenting neural network with binary belief network.

4 Binary stochastic filtering

The main idea of the proposed method (BSF) is application of L1L^{1} penalty on the involvement of the variable in the training/prediction process. This is done by element-wise multiplying of input datum 𝐱\mathbf{x} by the random vector 𝐫\mathbf{r} such that riBernoulli(pi)r_{i}\sim Bernoulli(p_{i}), where vector 𝐩\mathbf{p} defines a tunable set of parameters. This is similar to the Dropout technic, which performs the same multiplication, but its weights are predefined constant. Vector 𝐩\mathbf{p} is penalized with L1L^{1} norm, which stochastically forces the model to use only the most important features. Another way to imagine it is to think about it as stochastic investigation of parameter space, which at the same time penalizes the number of involved features.

Gradients

To make the layer weights 𝐩\mathbf{{p}} trainable, it is necessary to define two gradients for backpropagation to work, namely, xbsf(𝐱)\nabla_{x}\text{bsf}(\mathbf{{x}}) and pbsf(𝐱)\nabla_{p}\text{bsf}(\mathbf{x}), where bsf(𝐱)=𝐱𝐫\text{bsf}(\mathbf{x})=\mathbf{x}\circ\mathbf{r}. We define the first gradient as

bsf(xi)xi=xirixi=ri,\frac{\partial\text{bsf}(x_{i})}{\partial x_{i}}=\frac{\partial x_{i}r_{i}}{\partial x_{i}}=r_{i},

which is a natural way to describe a variable passed or dropped, similarly to the Dropout. It is more tricky is to define pbsf(𝐱)\nabla_{p}\text{bsf}(\mathbf{x}) due to its randomness. Instead, we can differentiate the expected value

𝔼bsf(xi)pi=𝔼xiripi=xi𝔼ripi=xipipi=xi\frac{\partial\mathbb{E}\text{bsf}(x_{i})}{\partial p_{i}}=\frac{\partial\mathbb{E}x_{i}r_{i}}{\partial p_{i}}=\frac{\partial x_{i}\mathbb{E}r_{i}}{\partial p_{i}}=\frac{\partial x_{i}p_{i}}{\partial p_{i}}=x_{i}

and use it as gradient estimate. Moreover, it was empirically found that it is useful to scale the gradient by the weight value, i.e. to redefine the gradient as bsf(xi)pi=xipi\frac{\partial\text{bsf}(x_{i})}{\partial p_{i}}=x_{i}p_{i}. This modification has a clear interpretation: the lower weight pip_{i} the lower is feature involvement in the training process, thus the weights of this feature must be changed slower. This modification stabilizes training and prevents already disabled features from being re-enabled.

A behavior of the filtering layer during inference phase is altered by setting a threshold τ\tau and deterministically passing features above threshold, while features corresponding to weights below threshold are dropped. This replacement makes the layer at inference phase deterministic, which stabilizes validation metrics. Implementation of BSF layer in TensorFlow 2 framework can be found in the repository111https://github.com/Trel725/BSFilter.

Analysis

To get some understanding of how this method work we will investigate its behavior on the simplest possible model – linear regression. We will start with the general formula for linear regression

min𝐰𝐲𝕏𝐰2,\min_{\mathbf{w}}||\mathbf{y}-\mathbb{X}\mathbf{w}||^{2},

where 𝐲\mathbf{y} is a vector of target values, 𝐰\mathbf{w} is a vector of model weights, and 𝕏\mathbb{X} is a matrix of input data, such that each row of the matrix 𝕏i\mathbb{X}_{\cdot i} is a single observation vector 𝐱i\mathbf{x}_{i}. Now, our goal is to investigate how will the optimization objective change if we multiply each 𝐱i\mathbf{x}_{i} by a random vector 𝐫\mathbf{r}. Since our objective is now random, we will minimize its expected value, i.e.

min𝐰𝔼𝐲(𝕏)𝐰2,\min_{\mathbf{w}}\mathbb{E}||\mathbf{y}-(\mathbb{R}\circ\mathbb{X})\mathbf{w}||^{2},

where \mathbb{R} is a matrix, such that ijBernoulli(pj)\mathbb{R}_{ij}\sim Bernoulli(p_{j}). It can be shown (the derivation of the equation below is given in the supporting information) that the optimization objective is equivalent to

min𝐰𝐲𝕏(𝐰𝐩)2+Γ(𝐰𝐩(1𝐩))2+λ𝐩,\min_{\mathbf{w}}||\mathbf{y}-\mathbb{X}(\mathbf{w}\circ\mathbf{p})||^{2}+||\Gamma(\mathbf{w}\circ\sqrt{\mathbf{p}\circ(1-\mathbf{p})})||^{2}+\lambda||\mathbf{p}||,

where Γ=diag(𝕏T𝕏)1/2\Gamma=\text{diag}(\mathbb{X}^{T}\mathbb{X})^{1/2} , i.e. its diagonal elements correspond to standard deviations of features in 𝕏\mathbb{X} (supposing they are centered), \circ denotes Hadamard product. We can see that if pi=pp_{i}=p, the p(1p)p(1-p) member can be taken out of the norm expression, which gives an identical expression to the one derived in [11] (when λ=0\lambda=0). From that objective we can get some insights of model behavior:

  1. 1.

    For pi(0,1)p_{i}\in(0,1) the iith feature is efficiently penalized with L2L^{2} norm, where the penalty is additionally scaled by the standard deviation of that feature. Thus, weights for the strongly varying feature are penalized more, which is similar to classical Dropout.

  2. 2.

    If pi=0p_{i}=0, which is forced by the L1L^{1} penalty, or pi=1p_{i}=1, the middle term vanishes and weight of iith feature is not penalized.

Stochastic vs deterministic

It is not immediately clear why to prefer stochastic regularization to deterministic. Firstly, weights penalization is clearly enough to achieve sparsity for simple shallow models like Lasso regression. At the same time, deep models can efficiently rescale back near zero features in the hidden layers. Stochastic regularization is free from that issue since it has only two possible states, feature is either passed without changes or set to zero. Moreover, it is well known in the machine learning literature that addition of noise to the network has positive effects on model generalization and convergence [15, 16, 17]. It was observed in experiments that stochastic models are actually more stable at training phase and produce better separated (into important and unimportant) features. An example of the model convergence curves and selected feature importances is given in the Fig. 1, left.

5 Experiments

Binary stochastic filtering layer was implemented in TensorFlow 2 framework [18] according to the definition above. A collection of datasets from OpenML-CC18 benchmark suite [19] was used in the experiments. It contains 72 classification datasets that satisfy a number of desired properties, including balancing, reasonable number of features and observations, moderate classification difficulty, etc. Moreover, the authors provided reference preprocessing and cross-validation splitting, which facilitates replication of experiments. NN models typically require tuning of hyperparameters to get fair results, thus a subset of 10 datasets was selected from the OpenML-CC18 and used in further experiments. Threshold τ\tau was set to 0.25 and F1\text{F}_{1} score was used as the main evaluation metric in all experiments.

5.1 Feature selection

Refer to caption
Refer to caption
Figure 1: Change in F1\text{F}_{1} score after feature selection with different methods, sorted in ascending order according to the group means (left). Examples of validation loss evolution for reference model and L1L^{1} penalized models. Early stopping after 20 epochs without loss improvement was used (right).
ID BSF DT KB-F KB-MI RFE SVC Features
16 -0.0995 -0.1490 -0.1285 -0.1280 -0.4835 -0.2950 6/64
32 -0.0014 0.0007 -0.0058 -0.0063 -0.0019 -0.0055 13/16
45 0.0169 0.0169 0.0191 0.0185 -0.0031 -0.0053 6/60
219 0.0371 0.0271 0.0217 0.0230 0.0218 0.0375 7/8
3481 -0.0213 -0.0303 -0.1445 -0.1468 -0.0182 -0.0355 56/617
9910 0.0192 0.0080 0.0056 0.0061 0.0357 0.0075 166/1776
9957 0.0057 0.0048 0.0009 -0.0010 0.0114 0.0048 23/41
9977 -0.0333 -0.0187 -0.0606 -0.0607 -0.0218 -0.0180 7/118
14952 -0.0131 -0.0024 -0.0116 -0.0111 -0.0194 -0.0149 15/30
146825 -0.0244 -0.0304 -0.1025 -0.1027 -0.1425 102/784
167140 -0.0053 -0.0050 -0.0822 -0.0813 -0.0031 -0.0057 10/180
Table 1: Mean differences between metrics for model trained on full and feature-selected datasets.

For the main experiment features were selected from each experimental dataset by training a penalized model. The penalization coefficient was manually tuned to achieve maximal reduce in number of features, while keeping metrics reasonable. Other popular methods (implemented in scikit-learn [20]) were selected for comparison, corresponding abbreviations are given in parentheses:

  • Filtering features based on mutual information (KB-MI) and ANOVA F-value (KB-F)

  • Recursive feature elimination with SVM as a base classifier (RFE) [21]

  • Embedded methods: L1L^{1} penalized SVM (SVC) and decision tree (CART algorithm, DT)

The same number of features was selected with these methods and NN model was trained on each of the selected feature subsets. Metrics for each cross-validation split were collected and differences between reference full-featured score and feature-selected one were used as a measure of feature selection efficiency. Cross-validation splits were same for all experiments. Results are provided in Fig. 1, which visualizes the distribution of ΔF1=F1,fsF1,ref\Delta\text{F}_{1}=\text{F}_{1,\text{fs}}-\text{F}_{1,\text{ref}}, i.e. positive values correspond to feature selected F1\text{F}_{1} score higher than original one. It follows from the data that BSF leads to the lowest decrease of classification score. Although the difference with its closest rival (DT) is small, it is statistically significant with Wilcoxon test p-value =0.0233=0.0233. Exact values are tabulated in the Tab. 1222RFE feature selection for dataset 146825 was intractable, thus this value is missing from the table. It is important to note that augmenting model with BSF layer has only minor impact of its convergence (Fig. 1, left), thus the filtering layer can be added to any model almost without overhead

5.2 Neuron pruning

Refer to caption
Figure 2: Visualization of pruning with BSF. Neurons and BSF units are drawn in circles and squares respectively. Weights of BSF are shown as saturation of squares fill.
Refer to caption
Refer to caption
Figure 3: Change in classification metrics after pruning vs fraction of kept weights. Values for optimized regularization coefficient for all datasets (left); trade-off between model accuracy and complexity for different regularization coefficients for two selected datsets (right). Datset IDs are represented in colors.

For the second experiment every dropout layer was replaced with penalized BSF layer. Regularization coefficient was shared among layers, but normalized by the starting number of neurons in the layer to achieve equal penalization. Every model was trained on the same selected datasets, the BSF layers were removed, and neurons, corresponding to the low BSF values were pruned, which was achieved by removing corresponding columns and/or rows from the weight matrix for each layer (Fig. 2). Differences in F1\text{F}_{1} score for the obtained pruned model are plotted against the relative amount of kept weights in Fig. 3. The same figure demonstrates how the number of weights can be further decreased by the price of reduce in classification metrics.

5.3 Region selection in spectra

Refer to caption
Refer to caption
Figure 4: Selected regions of importance with Grad-CAM and SHAP methods (left); Regions of importance selected by BSF. For visualization, two Raman spectra (of human glycoproteins) from dataset are plotted above with selected regions highlighted in red.

Spectra are one of the most common data in natural sciences. Automated recognition of spectra is highly usesul in all branches of chemistry [22, 23, 24] and biology or medicine [25, 26, 27]. Such signals share important property, existence of importance regions, areas which are crucial for their interpretation. While for images relative positions of features matter (which are usually extracted with convolutional layers), spectra are recognized based on global positions of peaks or other features. Although it may seem like a problem for which fully-connected network is more suitable, convolutional layers are still advantageous for processing spectral information since they learn preprocessing of data, such as background subtraction, noise filtering, etc. Extraction of the importance regions from spectral data is exceptionally useful, since it sheds light on the processes that generate the data. Numerous approaches were proposed to highlight most salient regions aiming explanation of model decisions, including Grad-CAM [28], LIME [29] and SHAP [30]. Unfortunately, these methods, developed to explain individual predictions, frequently produce overly complicated picture, highlighting noise and clearly useless regions. Combination of individual explanation to get dataset-wise explanation is nontrivial and its interpretation is frequently unclear.

Although this problem can be formulated as classical feature selection, it is a poor approach since it disrupts the continuity of the spectra and breaks the convolutional preprocessing. Desired importance regions selection can be accomplished by selecting features at the output of convolutional counterpart of network, which can be performed with BSF layer that shares weights along the channels axis. For experiment, the custom Raman spectra dataset of glycoproteins was classified with simple convolutional classifier, and obtained importance regions were analyzed with Grad-CAM, adapted for analysis of 1D convolutional networks, SHAP explainer and BSF. The obtained results are presented in the Fig. 4. As it was mentioned above, SHAP and Grad-CAM detections of region importances are cumbersome and practically useless, while BSF has clearly selected the most informative regions which has clear chemical interpretation. This approach was successfully used in two analytical projects [31, 32].

6 Conclusion

The conducted experiments demonstrated that BSF selects features at least as efficiently as best of the classical methods. At the same time, it can be embedded directly in the NN model, eliminating the need for external feature selector. Moreover, thanks to its differentiability it can be utilized not only to drop nodes from the input layer (i.e. features) but can be placed in the middle of the model, which can be utilized for neuron pruning. This approach is also applicable for filtering of convolutional channels by simple weight sharing of the BSF layer along all axes except channel axis. Instead, if selection of regions of importance is an aim, BSF can be applied by sharing weights along channels axis. It was shown that for some datasets this method allows to reduce network size to approximately 1% of the original size without significant reduce of classification metrics. BSF has potential to become an indispensable tool for processing of spectral data, particularly valuable in natural sciences.

References

  • [1] Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
  • [2] Shenkai Gu, Ran Cheng, and Yaochu Jin. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Computing, 22(3):811–822, 2018.
  • [3] Emrah Hancer, Bing Xue, Mengjie Zhang, Dervis Karaboga, and Bahriye Akay. Pareto front feature selection based on artificial bee colony optimization. Information Sciences, 422:462–479, 2018.
  • [4] Majdi Mafarja, Ibrahim Aljarah, Ali Asghar Heidari, Abdelaziz I Hammouri, Hossam Faris, Al-Zoubi Ala’M, and Seyedali Mirjalili. Evolutionary population dynamics and grasshopper optimization approaches for feature selection problems. Knowledge-Based Systems, 145:25–45, 2018.
  • [5] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323, 2011.
  • [6] Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
  • [7] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 806–814, 2015.
  • [8] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1355–1361. IEEE, 2017.
  • [9] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pages 2074–2082, 2016.
  • [10] Simone Scardapane, Danilo Comminiello, Amir Hussain, and Aurelio Uncini. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
  • [11] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • [12] Hojjat Salehinejad and Shahrokh Valaee. Edropout: Energy-based dropout and pruning of deep neural networks. arXiv preprint arXiv:2006.04270, 2020.
  • [13] Suraj Srinivas and R Venkatesh Babu. Generalized dropout. arXiv preprint arXiv:1611.06791, 2016.
  • [14] Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Advances in neural information processing systems, pages 3084–3092, 2013.
  • [15] Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250, 2011.
  • [16] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • [17] Arvind Neelakantan, Luke Vilnis, Quoc V Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015.
  • [18] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {\{USENIX}\} symposium on operating systems design and implementation ({\{OSDI}\} 16), pages 265–283, 2016.
  • [19] Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. Openml benchmarking suites, 2017.
  • [20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [21] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1-3):389–422, 2002.
  • [22] Kunal Ghosh, Annika Stuke, Milica Todorović, Peter Bjørn Jørgensen, Mikkel N Schmidt, Aki Vehtari, and Patrick Rinke. Deep learning spectroscopy: Neural networks for molecular excitation spectra. Advanced science, 6(9):1801367, 2019.
  • [23] Chenhao Cui and Tom Fearn. Modern practical convolutional neural networks for multivariate regression: Applications to nir calibration. Chemometrics and Intelligent Laboratory Systems, 182:9–20, 2018.
  • [24] Xiaolei Zhang, Tao Lin, Jinfan Xu, Xuan Luo, and Yibin Ying. Deepspectra: An end-to-end deep learning approach for quantitative spectral analysis. Analytica chimica acta, 1058:48–57, 2019.
  • [25] Sigurdur Sigurdsson, Peter Alshede Philipsen, Lars Kai Hansen, Jan Larsen, Monika Gniadecka, and Hans-Christian Wulf. Detection of skin cancer by classification of raman spectra. IEEE transactions on biomedical engineering, 51(10):1784–1793, 2004.
  • [26] Yi-ding Chen, Shu Zheng, Jie-kai Yu, and Xun Hu. Artificial neural networks analysis of surface-enhanced laser desorption/ionization mass spectra of serum protein pattern distinguishes colorectal cancer from healthy population. Clinical Cancer Research, 10(24):8380–8385, 2004.
  • [27] Jindřich Charvát, Aleš Procházka, Matěj Fričl, Oldřich Vyšata, and Lucie Himmlová. Diffuse reflectance spectroscopy in dental caries detection and classification. Signal, Image and Video Processing, pages 1–8, 2020.
  • [28] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144, 2016.
  • [30] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765–4774, 2017.
  • [31] O Guselnikova, A Trelin, A Skvortsova, P Ulbrich, P Postnikov, A Pershina, D Sykora, V Svorcik, and O Lyutakov. Label-free surface-enhanced raman spectroscopy with artificial neural network technique for recognition photoinduced dna damage. Biosensors and Bioelectronics, 145:111718, 2019.
  • [32] M Erzina, A Trelin, O Guselnikova, B Dvorankova, K Strnadova, A Perminova, P Ulbrich, D Mares, V Jerabek, R Elashnikov, et al. Precise cancer detection via the combination of functionalized sers surfaces and convolutional neural network with independent inputs. Sensors and Actuators B: Chemical, 308:127660, 2020.