This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable Bayesian Deep Learning with Kernel Seed Networks

\nameSam Maksoud \emails.maksoud@uqconnect.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS \AND\nameKun Zhao \emailk.zhao1@uq.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS \AND\nameCan Peng \emailcan.peng@uqconnect.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS \AND\nameBrian C. Lovell \emaillovell@itee.uq.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS
Abstract

This paper addresses the scalability problem of Bayesian deep neural networks. The performance of deep neural networks is undermined by the fact that these algorithms have poorly calibrated measures of uncertainty. This restricts their application in high risk domains such as computer aided diagnosis and autonomous vehicle navigation. Bayesian Deep Learning (BDL) offers a promising method for representing uncertainty in neural network. However, BDL requires a separate set of parameters to store the mean and standard deviation of model weights to learn a distribution. This results in a prohibitive 2-fold increase in the number of model parameters. To address this problem we present a method for performing BDL, namely Kernel Seed Networks (KSN), which does not require a 2-fold increase in the number of parameters. KSNs use 1x1 Convolution operations to learn a compressed latent space representation of the parameter distribution. In this paper we show how this allows KSNs to outperform conventional BDL methods while reducing the number of required parameters by up to a factor of 6.6.

Keywords: Bayesian Networks, Weight Uncertainty, Optimization and learning methods, Deep Learning, Neural Networks

1 Introduction

Modern deep neural networks (DNNs) are capable of learning complex representations of high dimensional data (He et al., 2016; Simonyan and Zisserman, 2014). This has enabled DNNs to exceed human performance in a growing number of decision making tasks (He et al., 2015; Mnih et al., 2015; Silver et al., 2017). However, these algorithms are notoriously difficult to implement in real world decision making systems because conventional DNNs are prone to overfitting and thus have poorly calibrated measures of uncertainty (Blundell et al., 2015). A well-calibrated measure of uncertainty provides insight into the reliability of model predictions (Guo et al., 2017). This is crucial for applications such as autonomous vehicle navigation (Kendall and Gal, 2017), and computer assisted diagnosis (Jiang et al., 2012); where confidently incorrect decisions can potentially have fatal consequences. Hence, there has been a resurgent interest in exploring Bayesian methods for deep learning as they naturally provide well-calibrated representations of uncertainty.

Unlike conventional DNNs, which learn a fixed point estimate of model weights, Bayesian Neural Networks (BNNs) learn a weight distribution, parameterizing by the mean and standard deviation of the weights at each layer. The parameter distributions are optimized using variational sampling during training, which forces BNNs to be robust to perturbations of model weights.

Effectively, this makes BNN optimization equivalent to learning an infinite ensemble of models (Blundell et al., 2015). With an ensemble, the variance of predictions among constituent models can be used to approximate the epistemic uncertainty (Lakshminarayanan et al., 2017). Similarly, epistemic uncertainty in BNNs can be estimated using the variance of predictions among weights that are randomly sampled from the learned distributions.

While BNNs have proven to be effective at inducing uncertainty into the weights of a model, they require significantly more parameters than their DNN counterparts (Shridhar et al., 2019). This makes it impractical to scale variational Bayesian learning to the depth of modern DNN architectures. To overcome this problem, Gal and Ghahramani (2016) proposed the use of Monte-Carlo Dropout (MC-Drop); a method for approximating Bayesian inference by applying Dropout (Srivastava et al., 2014) during both model training and inference.

A major advantage of MC-Drop, is that it requires no additional parameters to approximate Bayesian inference. MC-Drop represents model uncertainty by sampling sub-networks with randomly dropped out units and connections to perform variational inference (Gal and Ghahramani, 2016). Although this approach captures the uncertainty of model activations, it is unable to represent the uncertainty of model weights since all sub-network perturbations sampled using the Dropout method will share the same parameters (Srivastava et al., 2014). This limits the scope in which MC-Drop can be applied, since many applications such as continual learning (Kirkpatrick et al., 2017) and model pruning (Graves, 2011; Blundell et al., 2015), require representations of uncertainty over the entire parameter space. To address these limitations, Khan et al. (2018) propose the Variational Online Gauss-Newton (VOGN) algorithm. VOGN parameterizes a distribution of weights by modelling the output of a DNN as the mean, and the 2nd2^{nd} raw moment uncentered variance of computed gradients as the variance.

Similar to BNNs, VOGN optimization involves randomly sampling a set of parameters from a weight distribution to compute variational gradients, and perform variational inference. However, the key advantage of VOGN optimization, is that it does not require learning a separate set of mean and standard deviation parameters, since it models the variance with a vector that is already used in adaptive optimization methods such as Adam (Kingma and Ba, 2014). But it is important to note that VOGN optimization only finds an approximate solution to the BNN objective (Blundell et al., 2015). Tseran et al. (2018) demonstrate how this approximation can result in a trade-off between accuracy and ease of implementation when applied to Variational Continual Learning (VCL) methods (Nguyen et al., 2017). It was shown that using the VOGN approximation for VCL, results in worse performance in certain tasks comparing to conventional VCL, which optimizes the variance of the weight distributions directly (Nguyen et al., 2017), rather than employing the online estimate of the diagonal of the Fisher Information Matrix to update the variance of the parameters (Tseran et al., 2018; Khan et al., 2018). Hence, despite the high computational cost of BNNs, their are benefits to directly optimizing the mean and standard deviation parameters of a weight distribution during training.

In this paper, we introduce a novel class of BNNs, namely, Kernel Seed Networks (KSNs), which significantly reduce the number of parameters required for Bayesian deep learning (BDL). In contrast with conventional BNNs, which require doubling the size of the network in order to learn a separate set of mean and standard deviations weights for each of the parameter distributions, our method applies 1×11\times 1 convolutional kernels to trainable “seed” vectors to decode the mean and standard deviation weight vectors. The ability of 1×11\times 1 convolutional layers to map inputs to a higher-dimensional space allows for further reduction of parameters, since they can be used to upsample downsized seed kernels to the required filter size. In our experiments on MNIST (LeCun et al., 2010), FMNIST (Xiao et al., 2017) and CIFAR (Krizhevsky et al., 2009) datasets, we demonstrate how our proposed method can be used to conduct BDL with a 444470%70\% reduction in the number of parameters (compared to conventional BNNs) without compromising on model performance.

Refer to caption
Figure 1: Framework of the proposed self-germinating kernel filter. The NN dimensional kernel seed vector is passed through the germination layer to decode the mean μ\mu and standard deviations σ\sigma for the layer weight distribution. In the variational layer, μ\mu and σ\sigma are used to respectively shift and scale a randomly sampled unit Gaussian vector ϵ\epsilon. This process returns the final NN dimensional kernel filter (purple) that is applied to the inputs of the layer.

2 Kernel Seed Networks

As described in Section 1, the purpose of KSNs is to reduce the number of parameters required to construct the BNNs first described by Blundell et al. (2015). To this end, we propose to replace the NN dimensional conventional kernel filters in DNNs with self-germinating kernel filters (Figure 1) comprising three main components: (1) an NN dimensional kernel seed, (2) a kernel germination layer, and (3) a variational kernel layer. We describe the details of these components and our optimization protocol below.

2.1 The Kernel Seed

The purpose of the kernel seed is to store a latent-space representation ψ\psi of the layer weight distribution WW. To this end, we construct seed kernels for all linear and convolutional layers in a neural network as follows.

Linear Kernel Seeds. A Linear kernel seed ψFC\psi_{FC} stores the compressed weight distributions for a given linear transformation layer KFCCin×CoutK_{FC}\in\mathbb{R}^{C_{in}\times C_{out}}, where CinC_{in} and CoutC_{out} are the number of input channels and output channels respectively. Since the kernel germination method is capable of mapping kernel seeds to a higher dimensional space, we can reduce the dimensions of ψFC\psi_{FC} by applying the scaling parameter δ\delta to Cf=min(Cin,Cout)C_{f}=\min(C_{in},C_{out}). Specifically, the linear kernel seed can be expressed as ψFCCpip×CF\psi_{FC}\in\mathbb{R}^{C_{pip}\times C_{F}}, where Cpip=δCfC_{pip}=\delta C_{f} and CF=max(Cin,Cout)C_{F}=\max(C_{in},C_{out}).

Convolutional Kernel Seeds. A convolutional kernel seed ψC\psi_{C} stores the compressed weight distributions for a given convolutional layer KCCin×Cout×k×kK_{C}\in\mathbb{R}^{C_{in}\times C_{out}\times k\times k}, where k×kk\times k are the dimensions of the convolving kernel filter. As with linear kernel seeds, we can reduce the dimensions of ψC\psi_{C} by applying the scaling parameter δ\delta to CfC_{f} yielding CpipC_{pip}. Specifically, the convolutional kernel seed can be expressed as ψCCpip×CF×k×k\psi_{C}\in\mathbb{R}^{C_{pip}\times C_{F}\times k\times k}.

Refer to caption
Figure 2: The image above illustrates the germination procedure. Specifically, two 11x11 convolutional upsampling operations are applied to a compressed seed vector to decode the mean and standard deviation parameters of the weight distribution. In effect, the kernel seed is a latent space representation of the weight distribution from which the parameters of a given layer are derived.

2.2 Germination Layer

The germination layer comprises two convolutional decoders (germinators): (1) a μ\mu decoder, and (2) a ρ\rho decoder. It receives the kernel seed ψ\psi as input and decodes the μ\mu and ρ\rho parameters (Figure 2), which are then used to reconstruct WW.

Linear Seed Germinators. To decode the μ\mu and ρ\rho parameters for a linear transformation layer KFCK_{FC}, we apply two convolutional germinating kernels, GFCμG_{FC_{\mu}} and GFCρG_{FC_{\rho}}, to the linear kernel seed ψFC\psi_{FC}. GFCμG_{FC_{\mu}} and GFCρG_{FC_{\rho}} are essentially 11D convolutional kernels where GFCCf×Cpip×1G_{FC}\in\mathbb{R}^{C_{f}\times C_{pip}\times 1}. When applied to ψFC\psi_{FC}, GFCG_{FC} outputs a matrix of size CF×CfC_{F}\times C_{f}. When CF=CoutC_{F}=C_{out}, a transformation operation is applied to the outputs of GFCμG_{FC_{\mu}} and GFCρG_{FC_{\rho}} yielding WμCin×CoutW_{\mu}\in\mathbb{R}^{C_{in}\times C_{out}} and WρCin×CoutW_{\rho}\in\mathbb{R}^{C_{in}\times C_{out}} respectively.

Convolutional Seed Germinators. To decode the μ\mu and ρ\rho parameters for a convolutional layer KCK_{C}, we apply two convolutional germinating kernels, GCμG_{C_{\mu}} and GCρG_{C_{\rho}}, to the convolutional kernel seed ψC\psi_{C}. GCμG_{C_{\mu}} and GCρG_{C_{\rho}} are essentially 22D convolutional kernels where GCCf×Cpip×1×1G_{C}\in\mathbb{R}^{C_{f}\times C_{pip}\times 1\times 1}. When applied to ψC\psi_{C}, GCG_{C} outputs a tensor of size CF×Cf×k×kC_{F}\times C_{f}\times k\times k. As with linear kernel seeds, when CF=CoutC_{F}=C_{out}, a transformation operation is applied to the outputs of GCμG_{C_{\mu}} and GCρG_{C_{\rho}} yielding WμCin×Cout×k×kW_{\mu}\in\mathbb{R}^{C_{in}\times C_{out}\times k\times k} and WρCin×Cout×k×kW_{\rho}\in\mathbb{R}^{C_{in}\times C_{out}\times k\times k} respectively.

2.3 Variational Kernel Layer

The purpose of the variational kernel layer is to randomly sample a subset of weights wWw\in W. As described by Blundell et al. (2015), the subset of weights can be obtained by using the mean μ\mu and standard deviation σ=log(1+exp(ρ))\sigma=\log(1+\exp(\rho)) parameters to respectively shift and scale a randomly sampled noise vector ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1) yielding w=μ+σϵw=\mu+\sigma\epsilon. The subset of weights ww is then applied to the inputs of the layer as illustrated in Figure 1.

2.4 Optimization Protocol

While KSNs differ from BNNs in the way μ\mu and σ\sigma parameters are stored, the process of optimizing the distributions of model weights is the same. Thus, for KSN model training, we follow the BNN optimization protocol described by Blundell et al. (2015); using a scale mixture Gaussian prior:

P(w)=iπ𝒩(wi0,σ12)+(1π)𝒩(wi0,σ22)P(w)=\prod_{i}\pi\mathcal{N}\left(w_{i}\mid 0,\sigma_{1}^{2}\right)+(1-\pi)\mathcal{N}\left(w_{i}\mid 0,\sigma_{2}^{2}\right) (1)

where wiw_{i} is the ithi^{th} weight, π=14\pi=\frac{1}{4}, logσ12=0-\log\sigma_{1}^{2}=0 and logσ22=6-\log\sigma_{2}^{2}=6 are the variances of the mixture components in the Gaussian prior. We use a fixed learning rate of 1e31e^{-3} for our experiments and initialize the weights of all seed kernels using Glorot uniform initialization (Glorot and Bengio, 2010).

Model δ\delta DR Params RS ACC \uparrow NLL \downarrow ECE \downarrow ACE \downarrow MCE \downarrow
Baseline - - 11.211.2M 1.001.00 0.98940.9894 0.03530.0353 0.00210.0021 0.07670.0767 0.24320.2432
Baseline - 0.100.10 11.211.2M 1.001.00 0.99100.9910 0.02680.0268 0.00180.0018 0.13920.1392 0.64130.6413
MC-Drop - 0.100.10 11.211.2M 1.001.00 0.99100.9910 0.02690.0269 0.00210.0021 0.06880.0688 0.28450.2845
VOGN - - 11.211.2M 1.001.00 0.99210.9921 1.47631.4763 0.76260.7626 0.66880.6688 0.76730.7673
KSN-E 1.001.00 - 13.613.6M 1.221.22 0.99030.9903 0.04170.0417 0.01740.0174 0.13060.1306 0.20040.2004
KSN-P 1.001.00 - 13.613.6M 1.221.22 0.9926 0.0203 0.00180.0018 0.11990.1199 0.34790.3479
FKSN 1.001.00 - 12.412.4M 1.111.11 0.99250.9925 0.02500.0250 0.0016 0.10140.1014 0.37750.3775
KSN-E 0.750.75 - 10.210.2M 0.910.91 0.99040.9904 0.03730.0373 0.01080.0108 0.13420.1342 0.26390.2639
KSN-P 0.750.75 - 10.210.2M 0.910.91 0.99090.9909 0.03130.0313 0.00300.0030 0.18940.1894 0.73150.7315
FKSN 0.750.75 - 9.39.3M 0.830.83 0.98870.9887 0.03550.0355 0.00180.0018 0.06040.0604 0.22530.2253
KSN-E 0.500.50 - 6.86.8M 0.610.61 0.98580.9858 0.05910.0591 0.01990.0199 0.13770.1377 0.27500.2750
KSN-P 0.500.50 - 6.86.8M 0.610.61 0.98510.9851 0.05060.0506 0.00390.0039 0.11530.1153 0.35410.3541
FKSN 0.500.50 - 6.26.2M 0.560.56 0.96920.9692 0.10420.1042 0.01240.0124 0.08660.0866 0.22540.2254
KSN-E 0.250.25 - 3.43.4M 0.310.31 0.98450.9845 0.07790.0779 0.02860.0286 0.14010.1401 0.30970.3097
KSN-P 0.250.25 - 3.43.4M 0.310.31 0.98730.9873 0.06540.0654 0.00820.0082 0.10830.1083 0.18710.1871
FKSN 0.250.25 - 3.13.1M 0.280.28 0.96790.9679 0.18280.1828 0.00540.0054 0.0407 0.0823
BNN-E - - 22.322.3M 2.002.00 0.97320.9732 0.21320.2132 0.12990.1299 0.20210.2021 0.32270.3227
BNN-P - - 22.322.3M 2.002.00 0.98980.9898 0.03590.0359 0.00320.0032 0.11670.1167 0.37670.3767
Table 1: Model performance on the MNIST dataset (LeCun et al., 2010).
Model δ\delta DR Params RS ACC \uparrow NLL \downarrow ECE \downarrow ACE \downarrow MCE \downarrow
Baseline - - 11.211.2M 1.001.00 0.90460.9046 0.26410.2641 0.02220.0222 0.07920.0792 0.29810.2981
Baseline - 0.100.10 11.211.2M 1.001.00 0.90820.9082 0.26010.2601 0.01980.0198 0.06550.0655 0.27620.2762
MC-Drop - 0.100.10 11.211.2M 1.001.00 0.90670.9067 0.25700.2570 0.01400.0140 0.03730.0373 0.11290.1129
VOGN - - 11.211.2M 1.001.00 0.86880.8688 1.63171.6317 0.65960.6596 0.59710.5971 0.74390.7439
KSN-E 1.001.00 - 13.613.6M 1.221.22 0.90480.9048 0.27080.2708 0.02320.0232 0.0338 0.0708
KSN-P 1.001.00 - 13.613.6M 1.221.22 0.90580.9058 0.26950.2695 0.02090.0209 0.06560.0656 0.21440.2144
FKSN 1.001.00 - 12.412.4M 1.111.11 0.91220.9122 0.24630.2463 0.02060.0206 0.07330.0733 0.28040.2804
KSN-E 0.750.75 - 10.210.2M 0.910.91 0.90940.9094 0.25830.2583 0.01900.0190 0.05360.0536 0.16100.1610
KSN-P 0.750.75 - 10.210.2M 0.910.91 0.9175 0.2388 0.02050.0205 0.08470.0847 0.27680.2768
FKSN 0.750.75 - 9.39.3M 0.830.83 0.89380.8938 0.28640.2864 0.0078 0.03780.0378 0.14330.1433
KSN-E 0.500.50 - 6.86.8M 0.610.61 0.87850.8785 0.35600.3560 0.02480.0248 0.13320.1332 0.80950.8095
KSN-P 0.500.50 - 6.86.8M 0.610.61 0.88930.8893 0.31150.3115 0.02560.0256 0.05350.0535 0.13540.1354
FKSN 0.500.50 - 6.26.2M 0.560.56 0.89440.8944 0.28510.2851 0.01740.0174 0.04480.0448 0.12340.1234
KSN-E 0.250.25 - 3.43.4M 0.310.31 0.87240.8724 0.39100.3910 0.05300.0530 0.09080.0908 0.18460.1846
KSN-P 0.250.25 - 3.43.4M 0.310.31 0.87770.8777 0.39670.3967 0.02180.0218 0.05570.0557 0.18420.1842
FKSN 0.250.25 - 3.13.1M 0.280.28 0.87440.8744 0.40680.4068 0.01340.0134 0.05860.0586 0.19820.1982
BNN-E - - 22.322.3M 2.002.00 0.83740.8374 0.50230.5023 0.06590.0659 0.09200.0920 0.18970.1897
BNN-P - - 22.322.3M 2.002.00 0.89340.8934 0.31490.3149 0.03610.0361 0.07440.0744 0.11670.1167
Table 2: Model performance on the FMNIST dataset (Xiao et al., 2017).
Model δ\delta DR Params RS ACC \uparrow NLL \downarrow ECE \downarrow ACE \downarrow MCE \downarrow
Baseline - - 11.211.2M 1.001.00 0.90300.9030 0.64970.6497 0.07680.0768 0.18840.1884 0.33990.3399
Baseline - 0.100.10 11.211.2M 1.001.00 0.9188 0.46850.4685 0.05770.0577 0.17200.1720 0.29030.2903
MC-Drop - 0.100.10 11.211.2M 1.001.00 0.91650.9165 0.3458 0.03460.0346 0.06230.0623 0.12850.1285
VOGN - - 11.211.2M 1.001.00 0.76150.7615 1.70791.7079 0.54630.5463 0.44680.4468 0.62080.6208
KSN-E 1.001.00 - 13.613.6M 1.221.22 0.61430.6143 1.04301.0430 0.0178 0.0208 0.0590
KSN-P 1.001.00 - 13.613.6M 1.221.22 0.85310.8531 0.60440.6044 0.09130.0913 0.12350.1235 0.31790.3179
FKSN 1.001.00 - 12.412.4M 1.111.11 0.90580.9058 0.58610.5861 0.06940.0694 0.18430.1843 0.28750.2875
KSN-E 0.750.75 - 10.210.2M 0.910.91 0.78440.7844 0.77080.7708 0.18300.1830 0.16330.1633 0.28610.2861
KSN-P 0.750.75 - 10.210.2M 0.910.91 0.88210.8821 0.41200.4120 0.05070.0507 0.11270.1127 0.20770.2077
FKSN 0.750.75 - 9.39.3M 0.830.83 0.89120.8912 0.63480.6348 0.07920.0792 0.18760.1876 0.34790.3479
KSN-E 0.500.50 - 6.86.8M 0.610.61 0.79850.7985 0.70710.7071 0.14750.1475 0.14180.1418 0.22370.2237
KSN-P 0.500.50 - 6.86.8M 0.610.61 0.88600.8860 0.50570.5057 0.06610.0661 0.13000.1300 0.21230.2123
FKSN 0.500.50 - 6.26.2M 0.560.56 0.86890.8689 0.76210.7621 0.09410.0941 0.19650.1965 0.35790.3579
KSN-E 0.250.25 - 3.43.4M 0.310.31 0.82190.8219 0.57650.5765 0.08690.0869 0.13260.1326 0.48650.4865
KSN-P 0.250.25 - 3.43.4M 0.310.31 0.83820.8382 0.79510.7951 0.06370.0637 0.13550.1355 0.19730.1973
FKSN 0.250.25 - 3.13.1M 0.280.28 0.84760.8476 0.96710.9671 0.08790.0879 0.18300.1830 0.29980.2998
BNN-E - - 22.322.3M 2.002.00 0.39050.3905 1.60201.6020 0.09350.0935 0.08410.0841 0.18400.1840
BNN-P - - 22.322.3M 2.002.00 0.55960.5596 1.67921.6792 0.24320.2432 0.19280.1928 0.35750.3575
Table 3: Model performance on the CIFAR10 dataset (Krizhevsky et al., 2009).
Model δ\delta DR Params RS ACC \uparrow NLL \downarrow ECE \downarrow ACE \downarrow MCE \downarrow
Baseline - - 11.211.2M 1.001.00 0.65410.6541 2.94282.9428 0.26090.2609 0.31310.3131 0.49930.4993
Baseline - 0.100.10 11.211.2M 1.001.00 0.6700 2.47472.4747 0.23560.2356 0.27860.2786 0.45870.4587
MC-Drop - 0.100.10 11.211.2M 1.001.00 0.66750.6675 1.89021.8902 0.15090.1509 0.17590.1759 0.24600.2460
VOGN - - 11.211.2M 1.001.00 0.46650.4665 4.19174.1917 0.44580.4458 0.44580.4458 0.44580.4458
KSN-E 1.001.00 - 13.713.7M 1.221.22 0.49470.4947 2.06482.0648 0.15360.1536 0.15290.1529 0.24180.2418
KSN-P 1.001.00 - 13.713.7M 1.221.22 0.62350.6235 1.96241.9624 0.21150.2115 0.22020.2202 0.38610.3861
FKSN 1.001.00 - 12.612.6M 1.111.11 0.6540.654 2.61082.6108 0.25340.2534 0.30210.3021 0.47600.4760
KSN-E 0.750.75 - 10.310.3M 0.910.91 0.56590.5659 1.7881 0.16320.1632 0.14970.1497 0.26970.2697
KSN-P 0.750.75 - 10.310.3M 0.910.91 0.62230.6223 1.91631.9163 0.20450.2045 0.22040.2204 0.37100.3710
FKSN 0.750.75 - 9.39.3M 0.830.83 0.63900.6390 2.60042.6004 0.25430.2543 0.29370.2937 0.47480.4748
KSN-E 0.500.50 - 6.86.8M 0.610.61 0.51300.5130 1.98581.9858 0.08640.0864 0.09260.0926 0.17130.1713
KSN-P 0.500.50 - 6.86.8M 0.610.61 0.56070.5607 2.08932.0893 0.19410.1941 0.19230.1923 0.32080.3208
FKSN 0.500.50 - 6.26.2M 0.560.56 0.57200.5720 3.27703.2770 0.30460.3046 0.31490.3149 0.51680.5168
KSN-E 0.250.25 - 3.43.4M 0.310.31 0.50750.5075 1.99621.9962 0.0358 0.0450 0.0931
KSN-P 0.250.25 - 3.43.4M 0.310.31 0.54260.5426 2.21562.2156 0.20640.2064 0.19960.1996 0.37630.3763
FKSN 0.250.25 - 3.13.1M 0.280.28 0.54920.5492 3.49023.4902 0.31510.3151 0.31680.3168 0.51090.5109
BNN-E - - 22.422.4M 2.002.00 0.27140.2714 3.03453.0345 0.06670.0667 0.15190.1519 0.30810.3081
BNN-P - - 22.522.5M 2.002.00 0.43150.4315 2.22012.2201 0.11990.1199 0.12160.1216 0.18980.1898
Table 4: Model performance on the CIFAR100 dataset (Krizhevsky et al., 2009).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Model accuracy vs. number of parameters on MNIST, FMNIST, CIFAR10 and CIFAR100 classification tasks. Compared with BNNs, KSN-E and KSN-P methods consistently achieve high classification accuracy using significantly fewer parameters.

3 Experiments

The aim of the KSN is to directly optimize a set of μ\mu and ρ\rho for a weight distribution without a 2-fold increase in the number of parameters, as required by conventional BNNs (Blundell et al., 2015; Shridhar et al., 2019). Thus, to validate the effectiveness of the KSN we evaluate the relationship between model size and classification accuracy. Specifically, in Figure 3 we observe how changes in the number of model parameters affects model performance on MNIST, FMNIST, CIFAR10 and CIFAR100 classification tasks. Furthermore, in Tables 1, 2, 3, and 4, we evaluate how the number of parameters (Params) and model size relative to a ResNet18 baseline (RS) affects the Accuracy (ACC), Negative Log Loss (NLL), Expected Calibration Error (ECE), Average Calibration Error (ACE) and Maximum Calibration Error (MCE) rates when performing variational inference. A well calibrated model should have low NLL, ECE, ACE and MCE rates on unseen test data (Osawa et al., 2019). Hence, they provide a means of comparing each method’s ability to produce well calibrated models with Bayesian inference.

3.1 Method Comparison

The details of each of the methods used in our experiments are described below. For the simple MNIST and FMNIST classification tasks, accuracy was evaluated after training for 55 epochs. For the more complex CIFAR1010 and CIFAR100100 classification tasks, accuracy was evaluated after training for 100100 epochs.

Baseline Fixed-Point Estimate. All models in our experiments use a ResNet18 backbone (He et al., 2016). In order to observe the effects of inference using distributions of model weights, we require a baseline fixed point estimate for comparison. Thus, we establish a performance benchmark by modelling fixed point estimates of model weights for each classification task with ResNet18.

Monte-Carlo Dropout. We implement the MC-Drop method as introduced by Gal and Ghahramani (2016). Following the work of Osawa et al. (2019), we configure MC-Drop with a dropout rate (DR) of 0.10.1. In line with other methods used in our study, we use a ResNet18 backbone (He et al., 2016) and also provide the posterior (Baseline with DR =0.1=0.1), where we perform static inference on a model trained with dropout.

Variational Online Gauss-Newton. To compare BNN optimization with variational natural gradient optimization, we implement the VOGN optimization protocol introduced by Khan et al. (2018). Specifically, we train a ResNet18 (He et al., 2016) backbone with Vadam optimization (Khan et al., 2018) and randomly sample 1010 weight perturbations to compute the variational gradients during training, and perform the variational inference.

Bayesian Neural Networks. The BNNs used in these experiments are constructed by essentially cloning the ResNet18 backbone such that one set of parameters stores the μ\mu, and the other stores the ρ\rho for the distribution of model weights. During training, the BNNs are optimized using bayes-by-backprop (BBB) (Blundell et al., 2015). The original implementation of BBB could only be applied to linear transformation layers, however, recent work by Shridhar et al. (2019) provides a generalization of BBB that can be applied to convolutional layers. During inference, we sample the posterior mean of the learned distribution (BNN-P), as well as an ensemble of 1010 randomly sampled network weights (BNN-E).

Fixed-Point Kernel Seed Networks. While the Baseline ResNet18 method is useful to study the effects of BDL on conventional DNNs, the KSN method requires its own fixed-point implementation for comparison. This is because the process of reconstructing a layer from a kernel seed matrix is unique to KSNs. Thus, when comparing the performance of KSNs to the Baseline ResNet18 method, we cannot distinguish whether differences in model performance are due to the effects of BDL on KSNs or the fundamental variations between DNN and KSN architectures. Thus, we also provide the evaluation of the Fixed-Point Kernel Seed Network (FKSN). The Fixed-Point Kernel Seed Network is similar to the Variational Kernel Seed Network described in Section 1. However, the F-KSN does not model distributions of weights, but rather, it simply reconstructs the μ\mu weight for each layer. Specifically, GCμG_{C_{\mu}} germinates ψC\psi_{C} directly to KCK_{C} for convolutional layers, and GFCμG_{FC_{\mu}} germinates ψFC\psi_{FC} directly to KFCK_{FC} for linear transformation layers, without variational sampling. This enables us to observe the effects of applying BDL to KSNs, as FKSNs are a fixed-point implementation of our KSN model.

Variational Kernel Seed Networks. We implement the Variational Kernel Seed Network as described in Section 1, to germinate a neural network with a ResNet18 architecture (He et al., 2016). As with BNNs, we provide the posterior mean of the learned distribution (KSN-P), as well as an ensemble of 1010 randomly sampled network weights (KSN-E).

4 Discussion

The results in Figure 1 demonstrate how KSNs can approximate a distribution of network weights without a 2-fold increase in the number of parameters. From the results tabulated in Section 3, our KSN model achieves a higher classification accuracy than the conventional BNN in all classification tasks. Most notably, in the CIFAR1010 (Table 3) and CIFAR100100 (Table 4) classification tasks, a δ\delta of 0.250.25 was sufficient to outperform the standard BNN. This means that the proposed KSN could achieve higher classification accuracy than BNNs with only 15%15\% of the required parameters. Moreover, KSNs consistently produce better calibrated models overall compared to BNNs, as evident by the lower NLL, ECE, ACE and MCE in Tables 1, 2, 3, and 4.

The VOGN method could also outperform the standard BNNs with significantly fewer parameters. However, VOGN often failed to match the performance of MC-Drop and our proposed KSN method; particularly on the FMNIST (Table 2), CIFAR10 (Table 3), and CIFAR100 (Table 4) classification tasks.

Overall, MC-Drop consistently achieves high classification accuracy and can produce well calibrated models. The performance improvements with MC-Drop are most prominent on CIFAR10 (Table 3), and CIFAR100 (Table 4) classification tasks, where the other methods, including the proposed KSN method, suffer significant performance degradation compared to the baseline. However, since the MC-Drop method is unable to quantify uncertainty over the entire weight space, the benefits of MC-Drop for performing variational inference are limited to certain tasks (Section 1).

5 Conclusions

The proposed KSN method can significantly reduce the number of parameters required for BDL (compared to conventional BNNs) without compromising on performance. While conventional BNNs require a 2-fold increase in the number of parameters compared to its DNN equivalent, BDL can be applied using KSNs with up to 39%39\% fewer parameters than the equivalent DNN. This enables BDL to scale up to the depths of modern DNNs.


References

  • Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1613–1622. JMLR.org, 2015.
  • Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
  • Graves (2011) Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356. Citeseer, 2011.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Jiang et al. (2012) Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263–274, 2012.
  • Kendall and Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.
  • Khan et al. (2018) Mohammad Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. In International Conference on Machine Learning, pages 2611–2620. PMLR, 2018.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
  • LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. 2010.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Nguyen et al. (2017) Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
  • Osawa et al. (2019) Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E Turner, Rio Yokota, and Mohammad Emtiyaz Khan. Practical deep learning with bayesian principles. arXiv preprint arXiv:1906.02506, 2019.
  • Shridhar et al. (2019) Kumar Shridhar, Felix Laumann, and Marcus Liwicki. A comprehensive guide to bayesian convolutional neural network with variational inference. arXiv preprint arXiv:1901.02731, 2019.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  • Tseran et al. (2018) Hanna Tseran, Mohammad Emtiyaz Khan, Tatsuya Harada, and Thang D Bui. Natural variational continual learning. In NeurIPS Workshop on Continual Learning, 2018.
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.