Scalable Bayesian Deep Learning with Kernel Seed Networks

\nameSam Maksoud \emails.maksoud@uqconnect.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS \AND\nameKun Zhao \emailk.zhao1@uq.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS \AND\nameCan Peng \emailcan.peng@uqconnect.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS \AND\nameBrian C. Lovell \emaillovell@itee.uq.edu.au
\addrSchool of Information Technology and Electrical Engineering
The University of Queensland
Brisbane, QLD 4072, AUS

Abstract

This paper addresses the scalability problem of Bayesian deep neural networks. The performance of deep neural networks is undermined by the fact that these algorithms have poorly calibrated measures of uncertainty. This restricts their application in high risk domains such as computer aided diagnosis and autonomous vehicle navigation. Bayesian Deep Learning (BDL) offers a promising method for representing uncertainty in neural network. However, BDL requires a separate set of parameters to store the mean and standard deviation of model weights to learn a distribution. This results in a prohibitive 2-fold increase in the number of model parameters. To address this problem we present a method for performing BDL, namely Kernel Seed Networks (KSN), which does not require a 2-fold increase in the number of parameters. KSNs use 1x1 Convolution operations to learn a compressed latent space representation of the parameter distribution. In this paper we show how this allows KSNs to outperform conventional BDL methods while reducing the number of required parameters by up to a factor of 6.6.

Keywords: Bayesian Networks, Weight Uncertainty, Optimization and learning methods, Deep Learning, Neural Networks

1 Introduction

Modern deep neural networks (DNNs) are capable of learning complex representations of high dimensional data (He et al., 2016; Simonyan and Zisserman, 2014). This has enabled DNNs to exceed human performance in a growing number of decision making tasks (He et al., 2015; Mnih et al., 2015; Silver et al., 2017). However, these algorithms are notoriously difficult to implement in real world decision making systems because conventional DNNs are prone to overfitting and thus have poorly calibrated measures of uncertainty (Blundell et al., 2015). A well-calibrated measure of uncertainty provides insight into the reliability of model predictions (Guo et al., 2017). This is crucial for applications such as autonomous vehicle navigation (Kendall and Gal, 2017), and computer assisted diagnosis (Jiang et al., 2012); where confidently incorrect decisions can potentially have fatal consequences. Hence, there has been a resurgent interest in exploring Bayesian methods for deep learning as they naturally provide well-calibrated representations of uncertainty.

Unlike conventional DNNs, which learn a fixed point estimate of model weights, Bayesian Neural Networks (BNNs) learn a weight distribution, parameterizing by the mean and standard deviation of the weights at each layer. The parameter distributions are optimized using variational sampling during training, which forces BNNs to be robust to perturbations of model weights.

Effectively, this makes BNN optimization equivalent to learning an infinite ensemble of models (Blundell et al., 2015). With an ensemble, the variance of predictions among constituent models can be used to approximate the epistemic uncertainty (Lakshminarayanan et al., 2017). Similarly, epistemic uncertainty in BNNs can be estimated using the variance of predictions among weights that are randomly sampled from the learned distributions.

While BNNs have proven to be effective at inducing uncertainty into the weights of a model, they require significantly more parameters than their DNN counterparts (Shridhar et al., 2019). This makes it impractical to scale variational Bayesian learning to the depth of modern DNN architectures. To overcome this problem, Gal and Ghahramani (2016) proposed the use of Monte-Carlo Dropout (MC-Drop); a method for approximating Bayesian inference by applying Dropout (Srivastava et al., 2014) during both model training and inference.

A major advantage of MC-Drop, is that it requires no additional parameters to approximate Bayesian inference. MC-Drop represents model uncertainty by sampling sub-networks with randomly dropped out units and connections to perform variational inference (Gal and Ghahramani, 2016). Although this approach captures the uncertainty of model activations, it is unable to represent the uncertainty of model weights since all sub-network perturbations sampled using the Dropout method will share the same parameters (Srivastava et al., 2014). This limits the scope in which MC-Drop can be applied, since many applications such as continual learning (Kirkpatrick et al., 2017) and model pruning (Graves, 2011; Blundell et al., 2015), require representations of uncertainty over the entire parameter space. To address these limitations, Khan et al. (2018) propose the Variational Online Gauss-Newton (VOGN) algorithm. VOGN parameterizes a distribution of weights by modelling the output of a DNN as the mean, and the $2^{nd}$ raw moment uncentered variance of computed gradients as the variance.

Similar to BNNs, VOGN optimization involves randomly sampling a set of parameters from a weight distribution to compute variational gradients, and perform variational inference. However, the key advantage of VOGN optimization, is that it does not require learning a separate set of mean and standard deviation parameters, since it models the variance with a vector that is already used in adaptive optimization methods such as Adam (Kingma and Ba, 2014). But it is important to note that VOGN optimization only finds an approximate solution to the BNN objective (Blundell et al., 2015). Tseran et al. (2018) demonstrate how this approximation can result in a trade-off between accuracy and ease of implementation when applied to Variational Continual Learning (VCL) methods (Nguyen et al., 2017). It was shown that using the VOGN approximation for VCL, results in worse performance in certain tasks comparing to conventional VCL, which optimizes the variance of the weight distributions directly (Nguyen et al., 2017), rather than employing the online estimate of the diagonal of the Fisher Information Matrix to update the variance of the parameters (Tseran et al., 2018; Khan et al., 2018). Hence, despite the high computational cost of BNNs, their are benefits to directly optimizing the mean and standard deviation parameters of a weight distribution during training.

In this paper, we introduce a novel class of BNNs, namely, Kernel Seed Networks (KSNs), which significantly reduce the number of parameters required for Bayesian deep learning (BDL). In contrast with conventional BNNs, which require doubling the size of the network in order to learn a separate set of mean and standard deviations weights for each of the parameter distributions, our method applies $1\times 1$ convolutional kernels to trainable “seed” vectors to decode the mean and standard deviation weight vectors. The ability of $1\times 1$ convolutional layers to map inputs to a higher-dimensional space allows for further reduction of parameters, since they can be used to upsample downsized seed kernels to the required filter size. In our experiments on MNIST (LeCun et al., 2010), FMNIST (Xiao et al., 2017) and CIFAR (Krizhevsky et al., 2009) datasets, we demonstrate how our proposed method can be used to conduct BDL with a $44$ – $70\%$ reduction in the number of parameters (compared to conventional BNNs) without compromising on model performance.

Refer to caption — Figure 1: Framework of the proposed self-germinating kernel filter. The $N$ dimensional kernel seed vector is passed through the germination layer to decode the mean $\mu$ and standard deviations $\sigma$ for the layer weight distribution. In the variational layer, $\mu$ and $\sigma$ are used to respectively shift and scale a randomly sampled unit Gaussian vector $\epsilon$ . This process returns the final $N$ dimensional kernel filter (purple) that is applied to the inputs of the layer.

2 Kernel Seed Networks

As described in Section 1, the purpose of KSNs is to reduce the number of parameters required to construct the BNNs first described by Blundell et al. (2015). To this end, we propose to replace the $N$ dimensional conventional kernel filters in DNNs with self-germinating kernel filters (Figure 1) comprising three main components: (1) an $N$ dimensional kernel seed, (2) a kernel germination layer, and (3) a variational kernel layer. We describe the details of these components and our optimization protocol below.

2.1 The Kernel Seed

The purpose of the kernel seed is to store a latent-space representation $\psi$ of the layer weight distribution $W$ . To this end, we construct seed kernels for all linear and convolutional layers in a neural network as follows.

Linear Kernel Seeds. A Linear kernel seed $\psi_{FC}$ stores the compressed weight distributions for a given linear transformation layer $K_{FC}\in\mathbb{R}^{C_{in}\times C_{out}}$ , where $C_{in}$ and $C_{out}$ are the number of input channels and output channels respectively. Since the kernel germination method is capable of mapping kernel seeds to a higher dimensional space, we can reduce the dimensions of $\psi_{FC}$ by applying the scaling parameter $\delta$ to $C_{f}=\min(C_{in},C_{out})$ . Specifically, the linear kernel seed can be expressed as $\psi_{FC}\in\mathbb{R}^{C_{pip}\times C_{F}}$ , where $C_{pip}=\delta C_{f}$ and $C_{F}=\max(C_{in},C_{out})$ .

Convolutional Kernel Seeds. A convolutional kernel seed $\psi_{C}$ stores the compressed weight distributions for a given convolutional layer $K_{C}\in\mathbb{R}^{C_{in}\times C_{out}\times k\times k}$ , where $k\times k$ are the dimensions of the convolving kernel filter. As with linear kernel seeds, we can reduce the dimensions of $\psi_{C}$ by applying the scaling parameter $\delta$ to $C_{f}$ yielding $C_{pip}$ . Specifically, the convolutional kernel seed can be expressed as $\psi_{C}\in\mathbb{R}^{C_{pip}\times C_{F}\times k\times k}$ .

2.2 Germination Layer

The germination layer comprises two convolutional decoders (germinators): (1) a $\mu$ decoder, and (2) a $\rho$ decoder. It receives the kernel seed $\psi$ as input and decodes the $\mu$ and $\rho$ parameters (Figure 2), which are then used to reconstruct $W$ .

Linear Seed Germinators. To decode the $\mu$ and $\rho$ parameters for a linear transformation layer $K_{FC}$ , we apply two convolutional germinating kernels, $G_{FC_{\mu}}$ and $G_{FC_{\rho}}$ , to the linear kernel seed $\psi_{FC}$ . $G_{FC_{\mu}}$ and $G_{FC_{\rho}}$ are essentially $1$ D convolutional kernels where $G_{FC}\in\mathbb{R}^{C_{f}\times C_{pip}\times 1}$ . When applied to $\psi_{FC}$ , $G_{FC}$ outputs a matrix of size $C_{F}\times C_{f}$ . When $C_{F}=C_{out}$ , a transformation operation is applied to the outputs of $G_{FC_{\mu}}$ and $G_{FC_{\rho}}$ yielding $W_{\mu}\in\mathbb{R}^{C_{in}\times C_{out}}$ and $W_{\rho}\in\mathbb{R}^{C_{in}\times C_{out}}$ respectively.

Convolutional Seed Germinators. To decode the $\mu$ and $\rho$ parameters for a convolutional layer $K_{C}$ , we apply two convolutional germinating kernels, $G_{C_{\mu}}$ and $G_{C_{\rho}}$ , to the convolutional kernel seed $\psi_{C}$ . $G_{C_{\mu}}$ and $G_{C_{\rho}}$ are essentially $2$ D convolutional kernels where $G_{C}\in\mathbb{R}^{C_{f}\times C_{pip}\times 1\times 1}$ . When applied to $\psi_{C}$ , $G_{C}$ outputs a tensor of size $C_{F}\times C_{f}\times k\times k$ . As with linear kernel seeds, when $C_{F}=C_{out}$ , a transformation operation is applied to the outputs of $G_{C_{\mu}}$ and $G_{C_{\rho}}$ yielding $W_{\mu}\in\mathbb{R}^{C_{in}\times C_{out}\times k\times k}$ and $W_{\rho}\in\mathbb{R}^{C_{in}\times C_{out}\times k\times k}$ respectively.

2.3 Variational Kernel Layer

The purpose of the variational kernel layer is to randomly sample a subset of weights $w\in W$ . As described by Blundell et al. (2015), the subset of weights can be obtained by using the mean $\mu$ and standard deviation $\sigma=\log(1+\exp(\rho))$ parameters to respectively shift and scale a randomly sampled noise vector $\epsilon\sim\mathcal{N}(0,1)$ yielding $w=\mu+\sigma\epsilon$ . The subset of weights $w$ is then applied to the inputs of the layer as illustrated in Figure 1.

2.4 Optimization Protocol

While KSNs differ from BNNs in the way $\mu$ and $\sigma$ parameters are stored, the process of optimizing the distributions of model weights is the same. Thus, for KSN model training, we follow the BNN optimization protocol described by Blundell et al. (2015); using a scale mixture Gaussian prior:

P(w)=\prod_{i}\pi\mathcal{N}\left(w_{i}\mid 0,\sigma_{1}^{2}\right)+(1-\pi)\mathcal{N}\left(w_{i}\mid 0,\sigma_{2}^{2}\right)

(1)

where $w_{i}$ is the $i^{th}$ weight, $\pi=\frac{1}{4}$ , $-\log\sigma_{1}^{2}=0$ and $-\log\sigma_{2}^{2}=6$ are the variances of the mixture components in the Gaussian prior. We use a fixed learning rate of $1e^{-3}$ for our experiments and initialize the weights of all seed kernels using Glorot uniform initialization (Glorot and Bengio, 2010).

Model	$\delta$	DR	Params	RS	ACC $\uparrow$	NLL $\downarrow$	ECE $\downarrow$	ACE $\downarrow$	MCE $\downarrow$
Baseline	-	-	$11.2$ M	$1.00$	$0.9894$	$0.0353$	$0.0021$	$0.0767$	$0.2432$
Baseline	-	$0.10$	$11.2$ M	$1.00$	$0.9910$	$0.0268$	$0.0018$	$0.1392$	$0.6413$
MC-Drop	-	$0.10$	$11.2$ M	$1.00$	$0.9910$	$0.0269$	$0.0021$	$0.0688$	$0.2845$
VOGN	-	-	$11.2$ M	$1.00$	$0.9921$	$1.4763$	$0.7626$	$0.6688$	$0.7673$
KSN-E	$1.00$	-	$13.6$ M	$1.22$	$0.9903$	$0.0417$	$0.0174$	$0.1306$	$0.2004$
KSN-P	$1.00$	-	$13.6$ M	$1.22$	0.9926	0.0203	$0.0018$	$0.1199$	$0.3479$
FKSN	$1.00$	-	$12.4$ M	$1.11$	$0.9925$	$0.0250$	0.0016	$0.1014$	$0.3775$
KSN-E	$0.75$	-	$10.2$ M	$0.91$	$0.9904$	$0.0373$	$0.0108$	$0.1342$	$0.2639$
KSN-P	$0.75$	-	$10.2$ M	$0.91$	$0.9909$	$0.0313$	$0.0030$	$0.1894$	$0.7315$
FKSN	$0.75$	-	$9.3$ M	$0.83$	$0.9887$	$0.0355$	$0.0018$	$0.0604$	$0.2253$
KSN-E	$0.50$	-	$6.8$ M	$0.61$	$0.9858$	$0.0591$	$0.0199$	$0.1377$	$0.2750$
KSN-P	$0.50$	-	$6.8$ M	$0.61$	$0.9851$	$0.0506$	$0.0039$	$0.1153$	$0.3541$
FKSN	$0.50$	-	$6.2$ M	$0.56$	$0.9692$	$0.1042$	$0.0124$	$0.0866$	$0.2254$
KSN-E	$0.25$	-	$3.4$ M	$0.31$	$0.9845$	$0.0779$	$0.0286$	$0.1401$	$0.3097$
KSN-P	$0.25$	-	$3.4$ M	$0.31$	$0.9873$	$0.0654$	$0.0082$	$0.1083$	$0.1871$
FKSN	$0.25$	-	$3.1$ M	$0.28$	$0.9679$	$0.1828$	$0.0054$	0.0407	0.0823
BNN-E	-	-	$22.3$ M	$2.00$	$0.9732$	$0.2132$	$0.1299$	$0.2021$	$0.3227$
BNN-P	-	-	$22.3$ M	$2.00$	$0.9898$	$0.0359$	$0.0032$	$0.1167$	$0.3767$

Table 1: Model performance on the MNIST dataset (LeCun et al., 2010).

Model	$\delta$	DR	Params	RS	ACC $\uparrow$	NLL $\downarrow$	ECE $\downarrow$	ACE $\downarrow$	MCE $\downarrow$
Baseline	-	-	$11.2$ M	$1.00$	$0.9046$	$0.2641$	$0.0222$	$0.0792$	$0.2981$
Baseline	-	$0.10$	$11.2$ M	$1.00$	$0.9082$	$0.2601$	$0.0198$	$0.0655$	$0.2762$
MC-Drop	-	$0.10$	$11.2$ M	$1.00$	$0.9067$	$0.2570$	$0.0140$	$0.0373$	$0.1129$
VOGN	-	-	$11.2$ M	$1.00$	$0.8688$	$1.6317$	$0.6596$	$0.5971$	$0.7439$
KSN-E	$1.00$	-	$13.6$ M	$1.22$	$0.9048$	$0.2708$	$0.0232$	0.0338	0.0708
KSN-P	$1.00$	-	$13.6$ M	$1.22$	$0.9058$	$0.2695$	$0.0209$	$0.0656$	$0.2144$
FKSN	$1.00$	-	$12.4$ M	$1.11$	$0.9122$	$0.2463$	$0.0206$	$0.0733$	$0.2804$
KSN-E	$0.75$	-	$10.2$ M	$0.91$	$0.9094$	$0.2583$	$0.0190$	$0.0536$	$0.1610$
KSN-P	$0.75$	-	$10.2$ M	$0.91$	0.9175	0.2388	$0.0205$	$0.0847$	$0.2768$
FKSN	$0.75$	-	$9.3$ M	$0.83$	$0.8938$	$0.2864$	0.0078	$0.0378$	$0.1433$
KSN-E	$0.50$	-	$6.8$ M	$0.61$	$0.8785$	$0.3560$	$0.0248$	$0.1332$	$0.8095$
KSN-P	$0.50$	-	$6.8$ M	$0.61$	$0.8893$	$0.3115$	$0.0256$	$0.0535$	$0.1354$
FKSN	$0.50$	-	$6.2$ M	$0.56$	$0.8944$	$0.2851$	$0.0174$	$0.0448$	$0.1234$
KSN-E	$0.25$	-	$3.4$ M	$0.31$	$0.8724$	$0.3910$	$0.0530$	$0.0908$	$0.1846$
KSN-P	$0.25$	-	$3.4$ M	$0.31$	$0.8777$	$0.3967$	$0.0218$	$0.0557$	$0.1842$
FKSN	$0.25$	-	$3.1$ M	$0.28$	$0.8744$	$0.4068$	$0.0134$	$0.0586$	$0.1982$
BNN-E	-	-	$22.3$ M	$2.00$	$0.8374$	$0.5023$	$0.0659$	$0.0920$	$0.1897$
BNN-P	-	-	$22.3$ M	$2.00$	$0.8934$	$0.3149$	$0.0361$	$0.0744$	$0.1167$

Table 2: Model performance on the FMNIST dataset (Xiao et al., 2017).

Model	$\delta$	DR	Params	RS	ACC $\uparrow$	NLL $\downarrow$	ECE $\downarrow$	ACE $\downarrow$	MCE $\downarrow$
Baseline	-	-	$11.2$ M	$1.00$	$0.9030$	$0.6497$	$0.0768$	$0.1884$	$0.3399$
Baseline	-	$0.10$	$11.2$ M	$1.00$	0.9188	$0.4685$	$0.0577$	$0.1720$	$0.2903$
MC-Drop	-	$0.10$	$11.2$ M	$1.00$	$0.9165$	0.3458	$0.0346$	$0.0623$	$0.1285$
VOGN	-	-	$11.2$ M	$1.00$	$0.7615$	$1.7079$	$0.5463$	$0.4468$	$0.6208$
KSN-E	$1.00$	-	$13.6$ M	$1.22$	$0.6143$	$1.0430$	0.0178	0.0208	0.0590
KSN-P	$1.00$	-	$13.6$ M	$1.22$	$0.8531$	$0.6044$	$0.0913$	$0.1235$	$0.3179$
FKSN	$1.00$	-	$12.4$ M	$1.11$	$0.9058$	$0.5861$	$0.0694$	$0.1843$	$0.2875$
KSN-E	$0.75$	-	$10.2$ M	$0.91$	$0.7844$	$0.7708$	$0.1830$	$0.1633$	$0.2861$
KSN-P	$0.75$	-	$10.2$ M	$0.91$	$0.8821$	$0.4120$	$0.0507$	$0.1127$	$0.2077$
FKSN	$0.75$	-	$9.3$ M	$0.83$	$0.8912$	$0.6348$	$0.0792$	$0.1876$	$0.3479$
KSN-E	$0.50$	-	$6.8$ M	$0.61$	$0.7985$	$0.7071$	$0.1475$	$0.1418$	$0.2237$
KSN-P	$0.50$	-	$6.8$ M	$0.61$	$0.8860$	$0.5057$	$0.0661$	$0.1300$	$0.2123$
FKSN	$0.50$	-	$6.2$ M	$0.56$	$0.8689$	$0.7621$	$0.0941$	$0.1965$	$0.3579$
KSN-E	$0.25$	-	$3.4$ M	$0.31$	$0.8219$	$0.5765$	$0.0869$	$0.1326$	$0.4865$
KSN-P	$0.25$	-	$3.4$ M	$0.31$	$0.8382$	$0.7951$	$0.0637$	$0.1355$	$0.1973$
FKSN	$0.25$	-	$3.1$ M	$0.28$	$0.8476$	$0.9671$	$0.0879$	$0.1830$	$0.2998$
BNN-E	-	-	$22.3$ M	$2.00$	$0.3905$	$1.6020$	$0.0935$	$0.0841$	$0.1840$
BNN-P	-	-	$22.3$ M	$2.00$	$0.5596$	$1.6792$	$0.2432$	$0.1928$	$0.3575$

Table 3: Model performance on the CIFAR10 dataset (Krizhevsky et al., 2009).

Model	$\delta$	DR	Params	RS	ACC $\uparrow$	NLL $\downarrow$	ECE $\downarrow$	ACE $\downarrow$	MCE $\downarrow$
Baseline	-	-	$11.2$ M	$1.00$	$0.6541$	$2.9428$	$0.2609$	$0.3131$	$0.4993$
Baseline	-	$0.10$	$11.2$ M	$1.00$	0.6700	$2.4747$	$0.2356$	$0.2786$	$0.4587$
MC-Drop	-	$0.10$	$11.2$ M	$1.00$	$0.6675$	$1.8902$	$0.1509$	$0.1759$	$0.2460$
VOGN	-	-	$11.2$ M	$1.00$	$0.4665$	$4.1917$	$0.4458$	$0.4458$	$0.4458$
KSN-E	$1.00$	-	$13.7$ M	$1.22$	$0.4947$	$2.0648$	$0.1536$	$0.1529$	$0.2418$
KSN-P	$1.00$	-	$13.7$ M	$1.22$	$0.6235$	$1.9624$	$0.2115$	$0.2202$	$0.3861$
FKSN	$1.00$	-	$12.6$ M	$1.11$	$0.654$	$2.6108$	$0.2534$	$0.3021$	$0.4760$
KSN-E	$0.75$	-	$10.3$ M	$0.91$	$0.5659$	1.7881	$0.1632$	$0.1497$	$0.2697$
KSN-P	$0.75$	-	$10.3$ M	$0.91$	$0.6223$	$1.9163$	$0.2045$	$0.2204$	$0.3710$
FKSN	$0.75$	-	$9.3$ M	$0.83$	$0.6390$	$2.6004$	$0.2543$	$0.2937$	$0.4748$
KSN-E	$0.50$	-	$6.8$ M	$0.61$	$0.5130$	$1.9858$	$0.0864$	$0.0926$	$0.1713$
KSN-P	$0.50$	-	$6.8$ M	$0.61$	$0.5607$	$2.0893$	$0.1941$	$0.1923$	$0.3208$
FKSN	$0.50$	-	$6.2$ M	$0.56$	$0.5720$	$3.2770$	$0.3046$	$0.3149$	$0.5168$
KSN-E	$0.25$	-	$3.4$ M	$0.31$	$0.5075$	$1.9962$	0.0358	0.0450	0.0931
KSN-P	$0.25$	-	$3.4$ M	$0.31$	$0.5426$	$2.2156$	$0.2064$	$0.1996$	$0.3763$
FKSN	$0.25$	-	$3.1$ M	$0.28$	$0.5492$	$3.4902$	$0.3151$	$0.3168$	$0.5109$
BNN-E	-	-	$22.4$ M	$2.00$	$0.2714$	$3.0345$	$0.0667$	$0.1519$	$0.3081$
BNN-P	-	-	$22.5$ M	$2.00$	$0.4315$	$2.2201$	$0.1199$	$0.1216$	$0.1898$

Table 4: Model performance on the CIFAR100 dataset (Krizhevsky et al., 2009).

3 Experiments

The aim of the KSN is to directly optimize a set of $\mu$ and $\rho$ for a weight distribution without a 2-fold increase in the number of parameters, as required by conventional BNNs (Blundell et al., 2015; Shridhar et al., 2019). Thus, to validate the effectiveness of the KSN we evaluate the relationship between model size and classification accuracy. Specifically, in Figure 3 we observe how changes in the number of model parameters affects model performance on MNIST, FMNIST, CIFAR10 and CIFAR100 classification tasks. Furthermore, in Tables 1, 2, 3, and 4, we evaluate how the number of parameters (Params) and model size relative to a ResNet18 baseline (RS) affects the Accuracy (ACC), Negative Log Loss (NLL), Expected Calibration Error (ECE), Average Calibration Error (ACE) and Maximum Calibration Error (MCE) rates when performing variational inference. A well calibrated model should have low NLL, ECE, ACE and MCE rates on unseen test data (Osawa et al., 2019). Hence, they provide a means of comparing each method’s ability to produce well calibrated models with Bayesian inference.

3.1 Method Comparison

The details of each of the methods used in our experiments are described below. For the simple MNIST and FMNIST classification tasks, accuracy was evaluated after training for $5$ epochs. For the more complex CIFAR $10$ and CIFAR $100$ classification tasks, accuracy was evaluated after training for $100$ epochs.

Baseline Fixed-Point Estimate. All models in our experiments use a ResNet18 backbone (He et al., 2016). In order to observe the effects of inference using distributions of model weights, we require a baseline fixed point estimate for comparison. Thus, we establish a performance benchmark by modelling fixed point estimates of model weights for each classification task with ResNet18.

Monte-Carlo Dropout. We implement the MC-Drop method as introduced by Gal and Ghahramani (2016). Following the work of Osawa et al. (2019), we configure MC-Drop with a dropout rate (DR) of $0.1$ . In line with other methods used in our study, we use a ResNet18 backbone (He et al., 2016) and also provide the posterior (Baseline with DR $=0.1$ ), where we perform static inference on a model trained with dropout.

Variational Online Gauss-Newton. To compare BNN optimization with variational natural gradient optimization, we implement the VOGN optimization protocol introduced by Khan et al. (2018). Specifically, we train a ResNet18 (He et al., 2016) backbone with Vadam optimization (Khan et al., 2018) and randomly sample $10$ weight perturbations to compute the variational gradients during training, and perform the variational inference.

Bayesian Neural Networks. The BNNs used in these experiments are constructed by essentially cloning the ResNet18 backbone such that one set of parameters stores the $\mu$ , and the other stores the $\rho$ for the distribution of model weights. During training, the BNNs are optimized using bayes-by-backprop (BBB) (Blundell et al., 2015). The original implementation of BBB could only be applied to linear transformation layers, however, recent work by Shridhar et al. (2019) provides a generalization of BBB that can be applied to convolutional layers. During inference, we sample the posterior mean of the learned distribution (BNN-P), as well as an ensemble of $10$ randomly sampled network weights (BNN-E).

Fixed-Point Kernel Seed Networks. While the Baseline ResNet18 method is useful to study the effects of BDL on conventional DNNs, the KSN method requires its own fixed-point implementation for comparison. This is because the process of reconstructing a layer from a kernel seed matrix is unique to KSNs. Thus, when comparing the performance of KSNs to the Baseline ResNet18 method, we cannot distinguish whether differences in model performance are due to the effects of BDL on KSNs or the fundamental variations between DNN and KSN architectures. Thus, we also provide the evaluation of the Fixed-Point Kernel Seed Network (FKSN). The Fixed-Point Kernel Seed Network is similar to the Variational Kernel Seed Network described in Section 1. However, the F-KSN does not model distributions of weights, but rather, it simply reconstructs the $\mu$ weight for each layer. Specifically, $G_{C_{\mu}}$ germinates $\psi_{C}$ directly to $K_{C}$ for convolutional layers, and $G_{FC_{\mu}}$ germinates $\psi_{FC}$ directly to $K_{FC}$ for linear transformation layers, without variational sampling. This enables us to observe the effects of applying BDL to KSNs, as FKSNs are a fixed-point implementation of our KSN model.

Variational Kernel Seed Networks. We implement the Variational Kernel Seed Network as described in Section 1, to germinate a neural network with a ResNet18 architecture (He et al., 2016). As with BNNs, we provide the posterior mean of the learned distribution (KSN-P), as well as an ensemble of $10$ randomly sampled network weights (KSN-E).

4 Discussion

The results in Figure 1 demonstrate how KSNs can approximate a distribution of network weights without a 2-fold increase in the number of parameters. From the results tabulated in Section 3, our KSN model achieves a higher classification accuracy than the conventional BNN in all classification tasks. Most notably, in the CIFAR $10$ (Table 3) and CIFAR $100$ (Table 4) classification tasks, a $\delta$ of $0.25$ was sufficient to outperform the standard BNN. This means that the proposed KSN could achieve higher classification accuracy than BNNs with only $15\%$ of the required parameters. Moreover, KSNs consistently produce better calibrated models overall compared to BNNs, as evident by the lower NLL, ECE, ACE and MCE in Tables 1, 2, 3, and 4.

The VOGN method could also outperform the standard BNNs with significantly fewer parameters. However, VOGN often failed to match the performance of MC-Drop and our proposed KSN method; particularly on the FMNIST (Table 2), CIFAR10 (Table 3), and CIFAR100 (Table 4) classification tasks.

Overall, MC-Drop consistently achieves high classification accuracy and can produce well calibrated models. The performance improvements with MC-Drop are most prominent on CIFAR10 (Table 3), and CIFAR100 (Table 4) classification tasks, where the other methods, including the proposed KSN method, suffer significant performance degradation compared to the baseline. However, since the MC-Drop method is unable to quantify uncertainty over the entire weight space, the benefits of MC-Drop for performing variational inference are limited to certain tasks (Section 1).

5 Conclusions

The proposed KSN method can significantly reduce the number of parameters required for BDL (compared to conventional BNNs) without compromising on performance. While conventional BNNs require a 2-fold increase in the number of parameters compared to its DNN equivalent, BDL can be applied using KSNs with up to $39\%$ fewer parameters than the equivalent DNN. This enables BDL to scale up to the depths of modern DNNs.

References

Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1613–1622. JMLR.org, 2015.
Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
Graves (2011) Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356. Citeseer, 2011.
Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Jiang et al. (2012) Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263–274, 2012.
Kendall and Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584, 2017.
Khan et al. (2018) Mohammad Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and scalable bayesian deep learning by weight-perturbation in adam. In International Conference on Machine Learning, pages 2611–2620. PMLR, 2018.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. 2010.
Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
Nguyen et al. (2017) Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv preprint arXiv:1710.10628, 2017.
Osawa et al. (2019) Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E Turner, Rio Yokota, and Mohammad Emtiyaz Khan. Practical deep learning with bayesian principles. arXiv preprint arXiv:1906.02506, 2019.
Shridhar et al. (2019) Kumar Shridhar, Felix Laumann, and Marcus Liwicki. A comprehensive guide to bayesian convolutional neural network with variational inference. arXiv preprint arXiv:1901.02731, 2019.
Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
Tseran et al. (2018) Hanna Tseran, Mohammad Emtiyaz Khan, Tatsuya Harada, and Thang D Bui. Natural variational continual learning. In NeurIPS Workshop on Continual Learning, 2018.
Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.