B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning

Shashank Bujimalla
Intel Corporation
shashankbvs@gmail.com
Equal Contribution Mahesh Subedar¹¹footnotemark: 1
Intel Labs
mahesh.subedar@intel.com
Omesh Tickoo
Intel Labs
omesh.tickoo@intel.com

Abstract

Bayesian deep neural networks (DNNs) can provide a mathematically grounded framework to quantify uncertainty in predictions from image captioning models. We propose a Bayesian variant of policy-gradient based reinforcement learning training technique for image captioning models to directly optimize non-differentiable image captioning quality metrics such as CIDEr-D. We extend the well-known Self-Critical Sequence Training (SCST) approach for image captioning models by incorporating Bayesian inference, and refer to it as B-SCST. The “baseline” for the policy-gradients in B-SCST is generated by averaging predictive quality metrics (CIDEr-D) of the captions drawn from the distribution obtained using a Bayesian DNN model. We infer this predictive distribution using Monte Carlo (MC) dropout approximate variational inference. We show that B-SCST improves CIDEr-D scores on Flickr30k, MS COCO and VizWiz image captioning datasets, compared to the SCST approach. We also provide a study of uncertainty quantification for the predicted captions, and demonstrate that it correlates well with the CIDEr-D scores. To our knowledge, this is the first such analysis, and it can improve the interpretability of image captioning model outputs, which is critical for practical applications.

1 Introduction

Deep neural network (DNN) based image captioning approaches generate natural language descriptions of an image by transforming the image features into a sequence of output words from a predefined vocabulary. State-of-the-art image captioning models [1, 2, 3, 4, 5] use encoder-decoder architecture [6, 7, 5, 1], and follow a two-step training process. In the first step, cross-entropy loss is optimized to generate a caption with words in the same order as the ground-truth caption. In the second step, policy-gradient based reinforcement learning (RL) [8] is used to minimize the negative expected value of the generated caption quality metric scores (typically CIDEr-D) [9, 10, 11, 12].

Several recent works have shown that using a bias correction, i.e., a learned “baseline”, to normalize the RL rewards reduces the variance in policy gradients, and is effective during training. In Self-Critical Sequence Training (SCST) [8], CIDEr-D metric is directly optimized. The model chooses word with the highest SoftMax probability at each timestep and generates a greedy caption, i.e., the caption generated using the inference algorithm. The CIDEr-D score of this greedy caption is used as the “baseline” to increase the probability of generating captions that have higher score than the "baseline" while decreasing it for captions that have lower score.

Refer to caption — Figure 1: (a) Sample images and few of their ground-truth captions from Flickr30k dataset. Also shown are the greedy captions that are predicted using a trained model with MC Dropout forward passes, along with their corresponding CIDEr-D scores (red is low, green is high). (b) Distribution (min-max) of CIDEr-D scores obtained from MC Dropout forward passes for 50 randomly selected images from Flickr30k dataset, sorted by their predictive mean CIDEr-D scores (shown in red).

Image captioning DNN models can generate incorrect description of a given image, hence it is important to study the inherent ambiguity or uncertainty estimates of the generated captions. Bayesian DNNs [13, 14] provide a principled way to gain insight into the data and capture reliable uncertainty estimates, leading to interpretable models. In this work, we use MC dropout approximate inference [15] to study the correlation between uncertainty estimates and the CIDEr-D score of the generated captions. This analysis is critical to enable practical image captioning applications that require interpretability of the model outputs.

In Figure 1 (a), we show two sample images from Flickr30k [16] dataset along with few of their ground truth captions. We also show the greedy captions that are generated from a trained model using MC dropout forward passes, and their corresponding CIDEr-D scores. During training, SCST approach uses CIDEr-D score of a single greedy caption, which is obtained from standard DNN inference algorithm, as the "baseline". We instead propose a “baseline” that is obtained using a Bayesian DNN model based on MC dropout approximate inference, which infers the distribution of these predicted captions over model parameters. In Figure 1 (b), we show the distribution of CIDEr-D scores for 50 randomly selected images from Flickr30k dataset obtained using MC dropout forward passes. For each image, we plot range of CIDEr-D scores around their predictive mean (shown in red). The predictive mean CIDEr-D score of the captions that are sampled from this distribution will be a better representation of the “baseline”. We refer to this approach as Bayesian SCST (B-SCST).

In summary, our main contributions in this work are:

•

We propose B-SCST, a Bayesian variant of SCST approach, and demonstrate that it improves the caption quality score compared to SCST approach.
•

We present uncertainty quantification of the model generated image captions and demonstrate a good correlation between CIDEr-D scores and uncertainty. To our knowledge, this is the first work which provides this kind of Bayesian analysis for image captioning, and can improve interpretability of generated captions.

The paper is organized as follows. The related work is presented in Section 2, followed by proposed method in Section 3. The results are presented in Section 4 and conclusions in Section 5.

2 Related work

2.1 Attention Mechanism

State-of-the-art image captioning DNNs use attention mechanism [5] so that the encoder and decoder in the model attend on appropriate features in the image to generate the words in the caption. A bottom-up attention mechanism (Up-Down) was proposed in [1], where the model attends on Faster R-CNN [17] proposals from the image. Attention-on-attention network (AoANet) [3] uses an extra learned attention on top of self-attention [18] to avoid attentions that are irrelevant to the decoder. In this work, we demonstrate our approach on AoANet model architecture, although it can be applied on other architectures that benefit from SCST.

2.2 Training techniques

Image captioning DNNs are usually trained with word level cross-entropy loss between the ground truth and model generated caption. SCST [8] uses policy-gradient based RL to directly optimize the non-differentiable Natural language processing (NLP) metrics that are used for caption evaluation. Specifically, it uses the caption score obtained by the model using its inference algorithm, as the “baseline” to reduce the variance of gradients. A few variants of SCST have been used to improve the image captioning quality metrics. One variant [1] performs beam search and restricts the search space to only the top-k captions in the decoded beam. Another variant [2] similarly restricts the search space to only the top-k captions in the decoded beam, while also using the mean CIDEr-D score of these top-k captions as the “baseline”. We discuss the differences between our approach and these works in Section 3.

2.3 Bayesian approaches

Standard DNNs do not capture uncertainty estimates [13] associated with the data and the model parameters. SoftMax probabilities obtained from DNNs can often provide overly confident results for incorrect predictions. Hence, Bayesian DNNs have been proposed to capture data and model uncertainties [13] resulting in more robust models. In [15], dropout training in DNNs is cast as an approximate Bayesian inference in deep Gaussian processes. This work has been extensively used in many applications [19, 20] to model uncertainty estimates. Among image captioning related works, an LSTM trained using Bayesian back-propagation was proposed in [21] to improve the perplexity of image captioning results. Uncertainty measures were explored in [22] to improve caption embedding and retrieval task. In [23], MC dropout was used along with explicit outputs that predict the model uncertainty. Our approach is different from these works. We focus on improving image captioning metrics by using MC dropout to cast a state-of-the-art model as Bayesian DNN, without making any changes to the model architecture. We also perform uncertainty quantification of the generated caption to improve their interpretability, which to our knowledge has not been done in earlier works.

3 Bayesian Self-Critical Sequence Training

In this section, we present details of our approach after discussing the relevant background.

3.1 Background

3.1.1 Model Architecture

We used AoANet architecture [3] for our trials since it recently provided state-of-the-art results. AoANet is based on encoder-decoder architecture with attention mechanism. Given an image $I$ , Faster R-CNN is used to extract feature vectors $A$ . The AoANet encoder, which includes AoA modules, generates re-weighted feature vectors $A_{enc}$ . At each timestep $t$ (i.e., word in the caption), the AoANet decoder, which includes Attention LSTM and its own AoA modules, uses $A_{enc}$ and previous word $y_{t-1}$ in the caption to generate hidden state $h_{t}$ and context vector $c_{t}$ . The context vector $c_{t}$ is used to compute the conditional probabilities of the words in the word vocabulary $p_{\theta}(y_{t}\,|\,y_{1:t-1},I)$ , where $\theta$ are the AoANet model parameters. Additional AoANet architecture details can be found in [3].

To simplify notation in the rest of this paper, we denote the model generated caption as $y_{1:T}$ , and conditional probability vector of its words as $p_{\theta}(y_{1:T})$ , where $T$ is the maximum length of the caption.

3.1.2 Bayesian Inference

In Bayesian DNNs, model parameters $\theta$ are treated as random variables with a prior distribution $p(\theta)$ , instead of point estimates. Given input-output pairs $(x,y)$ and the model likelihood $p(y|x,\theta)$ , Bayes rule can be used to obtain the posterior distribution of model parameters $p(\theta|x,y)$ :

p(\theta|x,y)=\frac{p(y|x,\theta)p(\theta)}{p(y|x)}

(1)

Since computing this posterior distribution $p(\theta|x,y)$ is often intractable, Bayesian approximate inference techniques [24, 25, 13, 15] are used to infer a tractable approximate posterior distribution $q_{\phi}(\theta)$ . Given a new input $x^{*}$ , the predictive distribution of the output $y^{*}$ during inference phase is obtained using multiple stochastic forward passes through the model, while sampling from the approximate posterior distribution of model parameters $q_{\phi}(\theta)$ using Monte Carlo estimators:

\begin{gathered}p(y^{*}|x^{*},x,y)\approx\frac{1}{M}\sum_{m=1}^{M}p(y^{*}|x^{*},{\theta}_{m})~{},~{}~{}~{}{\theta}_{m}\sim q_{\phi}(\theta)\end{gathered}

(2)

where, $M$ is number of Monte Carlo samples.

In this work, we use MC dropout approximate inference to obtain the "baseline" for RL step during training phase, and also to perform uncertainty quantification analysis during inference phase. More details are presented in the next two sections.

3.2 Our Approach

3.2.1 Training

We train our image captioning model using the popular two step approach. In the first step, we minimize word-level cross-entropy loss function [1, 3]. In the second step, we optimize CIDEr-D directly and refer to this step as CIDEr-D optimization. In this work, we use our proposed B-SCST for CIDEr-D optimization. We first briefly describe the well-known SCST approach [8], which works as follows. The decoder (agent) interacts with the image features and the current word in the caption (environment) using the model’s parameters $\theta$ (policy $p_{\theta}$ ), to generate the next word (action) in the caption. After the complete caption is generated, CIDEr-D score (reward) is calculated using the ground truth sentence. The goal of CIDEr-D optimization is to minimize the negative expected CIDEr-D rewards function $r(.)$ :

L_{RL}(\theta)=-\mathbf{E}_{y_{1:T}\sim p_{\theta}}{\left[r(y_{1:T})\right]}

(3)

The gradient of this loss is approximated [8] as:

\nabla_{\theta}L_{RL}(\theta)\approx-(r(y_{1:T}^{s})-r(\hat{y}_{1:T}))\nabla_{\theta}\log p_{\theta}(y_{1:T}^{s})

(4)

Here, $y_{1:T}^{s}$ is the sampled caption that is generated by sampling words from the decoder’s output SoftMax probability distribution over the word vocabulary at each timestep. $\hat{y}_{1:T}$ is the greedy caption that is generated using inference algorithm, i.e., by choosing the word with highest SoftMax probability at each time step. SCST approach uses reward $r(\hat{y}_{1:T})$ of the caption $\hat{y}_{1:T}$ as the “baseline” to normalize the rewards of sampled caption [8], and reduces the variance of the gradient. This gradient formulation increases the probability of the captions with CIDEr-D scores higher than those generated by the current model at inference phase, and decreases the probability of the captions with lower CIDEr-D scores.

The choice of “baseline” is important here, and the usage of CIDEr-D score of a single greedy caption as “baseline” may be undesirable when there is uncertainty in the model predictions. We, therefore, propose to use Bayesian inference to estimate the “baseline”, and refer to it as B-SCST. In B-SCST, we use MC dropout approximate inference, i.e., we run multiple MC dropout forward passes through the model, to infer the posterior distribution of captions around the model parameters, and estimate their predictive mean CIDEr-D score (example is shown in Figure 1 (b)). The dropout layers are modeled using Bernoulli distribution [26] with dropout rate as the parameter. We use this predictive mean CIDEr-D score, which accounts for the uncertainty, as the “baseline” during CIDEr-D optimization.

During each MC dropout forward pass of training phase, we sample words from the decoder’s output SoftMax probability distribution in order to generate a caption. We do not choose the word(s) with highest SoftMax probability here, and instead sample from the SoftMax distribution, in order to allow the model to explore a larger search space. We observe improved results with our sampling approach, as shown later in our ablation study in Section 4.5. The "baseline" $\tilde{r}$ and gradient of the loss in our proposed model can be approximated by changing Equation 4 as:

\begin{gathered}\tilde{r}\approx\frac{1}{M}\sum_{m=1}^{M}r({{y}_{1:T}^{s}}^{(m)})\\ \nabla_{\theta}L_{RL}(\theta)\approx-\frac{1}{M}\sum_{m=1}^{M}(r({y_{1:T}^{s}}^{(m)})-\tilde{r})\nabla_{\theta}\log p_{\theta}({y_{1:T}^{s}}^{(m)})\end{gathered}

(5)

where $M$ is the total number of MC dropout forward passes, and ${y_{1:T}^{s}}^{(m)}$ is the sampled caption that is generated during the $m^{th}$ forward pass. We illustrate B-SCST approach in Figure 2. While performing MC dropout during training phase, we enable dropout in both the encoder and decoder in order to capture the model uncertainty. We allow different dropout masks across timesteps of the decoder, since we observed better captioning metric results compared to using the same dropout mask across all decoder timesteps as proposed in [27].

We want to point out that our approach is different from some of the other works [1, 2]. In [1], beam-search is used to to restrict the search space to top-k captions in terms of SoftMax probability, and the greedy caption is used to estimate “baseline”, similar to SCST. Similary in [2], beam-search is used to select the top-5 captions in terms of SoftMax probability, but the mean of five caption scores in the decoded beam is used as the “baseline”. We do not use beam-search during training phase, and instead use Bayesian inference by performing multiple stochastic MC dropout forward passes through the model. The captions sampled from beam search would all have the same dropout mask, which is not the case with our approach. We also don’t restrict the search space to top-5 candidates, as we mentioned earlier.

3.2.2 Uncertainty Quantification

Bayesian modeling allows capturing estimates for both aleatoric uncertainty, i.e. noise inherent in input observations, and epistemic uncertainty, i.e uncertainty related to model parameters [14, 28]. Predictive entropy is a measure of both aleatoric and epistemic uncertainties, whereas mutual information (MI) between model parameters (posterior distribution) and data (predictive distribution) is a measure of epistemic uncertainty [28].

In order to estimate uncertainty and perform Bayesian analysis during inference phase, we use MC dropout approximate inference by enabling dropout in the final fully connected layer, and greedily generate the caption with highest SoftMax probability. For a given image, we perform $M$ MC dropout forward passes to generate $M$ captions. For each word (i.e., timestep $t$ ) in the caption of length $T_{m}$ , that is generated during the $m^{th}$ MC dropout forward pass, we obtain the SoftMax probability of each class $i$ in the word vocabulary $V$ , denoted as ${v_{t}}^{i(m)}$ . We calculate entropy $H_{m}$ of the caption that is generated during of $m^{th}$ MC dropout forward pass, and the mean entropy $\bar{H}$ of all the $M$ captions using:

\begin{gathered}H_{m}\approx\frac{1}{T_{m}*V}\sum_{t=1}^{T_{m}}\sum_{i=1}^{V}{v_{t}}^{i(m)}*log\,{{v_{t}}^{i(m)}}\,\,\,\,\,\,\text{and}\,\,\,\,\,\,\,\bar{H}\approx\frac{1}{M}\sum_{m=1}^{M}{H_{m}}.\,\,\,\,\,\,\,\,\,\end{gathered}

(6)

We calculate the predictive entropy of these MC dropout captions using:

\begin{gathered}H\approx\frac{1}{T*V}\sum_{t=1}^{T}\sum_{i=1}^{V}{\bar{v_{t}}}^{i}*log\,{\bar{v_{t}}^{i}}\,\,\,\,\,\,\,\text{where}\,\,\,{\bar{v_{t}}}^{i}=\frac{1}{M}\sum_{m=1}^{M}{v_{t}}^{i(m)}\,\,\,\,\,\,\end{gathered}

(7)

Mutual information (MI) is then given by: $MI:=H-\bar{H}$

		Cross-entropy loss training						CIDEr-D optimization
	Model	B@1	B@4	M	R	C	S	B@1	B@4	M	R	C	S
		Flickr30k
Test	SCST	69.6	28.0	22.2	48.8	58.5	16.4	72.2	30.0	22.1	50.0	64.6	16.3
Split	B-SCST	69.6	28.0	22.2	48.8	58.5	16.4	71.9	29.6	22.6	50.2	66.9	16.7
Val	SCST	69.5	27.8	22.1	49.1	59.9	16.0	72.6	29.9	22.0	50.3	64.5	15.7
Split	B-SCST	69.5	27.8	22.1	49.1	59.9	16.0	72.4	29.1	22.5	50.3	67.0	16.2
		MS COCO
Test Split	SCST* [3]	77.4	37.2	28.4	57.5	119.8	21.3	80.2	38.9	29.2	58.8	129.8	22.4
	SCST	77.3	36.9	28.5	57.3	118.4	21.7	80.5	39.1	29.0	58.9	128.9	22.7
	B-SCST	77.3	36.9	28.5	57.3	118.4	21.7	80.8	39.0	29.2	59.0	131.0	22.9
Val Split	SCST	77.3	37.3	28.3	57.4	117.4	21.4	80.4	39.1	28.9	58.9	127.7	22.5
	B-SCST	77.3	37.3	28.3	57.4	117.4	21.4	80.8	39.0	29.0	58.9	129.4	22.7
		VizWiz
Test	SCST* [29]	-	-	-	-	-	-	66.0	23.7	20.1	46.8	60.9	15.3
Split	B-SCST	64.7	22.7	19.4	45.0	59.0	14.7	66.3	24.0	20.3	46.9	63.7	15.7

Table 1: Results on Flickr30k [16], MS COCO [30] and VizWiz [29] datasets. Our approach B-SCST consistently improves the CIDEr-D scores as compared to the traditional SCST approach. SCST* scores are presented from the published work. For MS COCO-Test split, we observe SCST scores are slightly lower than SCST* published [3] scores while using the model check-point provided by the authors.

4 Experiments

4.1 Datasets

We present the image captioning results on Flickr30k [16], MS COCO [30] and VizWiz [29] image captioning datasets. We compare the standard image captioning evaluation metrics, including BLEU [9], METEOR [10], Rouge-L [31], CIDEr-D [11] and SPICE [12] scores.

Flickr30k: We use Flickr30k data splits from [7], which contain 31014 training, 1014 validation and 1000 test images, each of which have 5 ground truth captions as labels. All captions are converted to lower case [3, 32] and only the words occuring at least 5 times are used to build a vocabulary of size 7000 words. We use the bottom-up image features from [4] as input to our encoder stage.
MS COCO: We use MS COCO data splits from [7], which contain 113287 training, 5000 validation and 5000 test images, each with 5 ground truth captions as labels. We perform similar text preprocessing as Flickr30k dataset, but use only the words occuring at least 4 times in order to build a vocabulary of size 10369 that matches AoANet [3] vocabulary. We use the bottom-up image features from [1] that were generated using Faster R-CNN model pretrained on ImageNet and Visual genome datasets. VizWiz: We use VizWiz data splits from [29], which contain 22866 training, 7542 validation and 8000 test images, each having upto 5 ground truth captions as labels. Following the approach in [29], we combine training and validation images to get for 30408 images for training, and use VizWiz 2020 challenge leaderboard [33] to evaluate performance on test set. We perform similar text preprocessing as MS COCO dataset to build a vocabulary of size 7279 that matches baseline [29] provided by the challenge organizers.

4.2 Training and Inference

We train the model using a minibatch size of 10 images and ADAM [34] optimizer. We first run 25 epochs of cross-entropy loss training, with optimizer learning rate of 2e-4 and decay factor of 0.8 every 3 epochs. The scheduled sampling [35] probability is increased at a rate of 0.05 every 5 epochs along with label smoothing [36]. Since each image contains 5 ground truth labels, we replicate each image feature 5 times and pass it through the model to calculate cross-entropy loss for each caption. We run 30 epochs of CIDEr-D optimization with optimizer learning rate of 2e-5 and a reduce on plateau factor of 0.5 when CIDEr-D degrades for more than one epoch. For VizWiz dataset, we run only 25 epochs of CIDEr-D optimization with starting learning of 1e-5 to avoid overfitting. During inference, we use beam search and pick the caption with the highest SoftMax probability. We use a beam size of 2 for fair comparison of our results with published AoANet and VizWiz results. During training using B-SCST approach, we use $M$ =5 for VizWiz dataset and $M$ =10 for MS COCO and Flickr30k datasets. During inference for uncertainty quantification (Section 4.4), we use $M$ =30 MC simulations with no beam search.

4.3 Image Captioning Results

In Table 1, we present a comparison of SCST and B-SCST approaches for the three datasets considered in our experiments (Section 4.1). SCST and B-SCST approaches for CIDEr-D optimization start with the same checkpoint obtained from the cross-entropy loss training. For MS COCO [3] and VizWiz [29] datasets, presented SCST* results are taken directly from the published papers. For MS COCO dataset, SCST results were obtained using the model checkpoints provided by the authors [3], which we observe to be slightly lower than the published results. These results in Table 1 show that B-SCST consistently improves CIDEr-D scores, along with most image captioning metrics, for all the three datasets.

4.4 Uncertainty Quantification Results

During inference phase, we perform MC dropout approximate inference (Section 3.2.2) to obtain uncertainty estimates for the captions generated using SCST and B-SCST approaches on the three datasets considered in our experiments (Section 4.1). In first two columns of Figure 3, we plot the uncertainty estimates, i.e., Predictive Entropy and Mutual information (MI) (Section 3.2.2), against predictive mean CIDEr-D scores across MC dropout forward passes. We map CIDEr-D scores into five quantiles and plot the average uncertainty score for each quantile. We observe that lower CIDEr-D scores indicate higher uncertainty in the predictions, where as higher CIDEr-D scores indicate lower uncertainty. Both these uncertainty measures show good correlation with the CIDEr-D scores, which is critical for the interpretability of the captions generated by the model. We also observe that the uncertainty estimates decrease with the B-SCST approach as compared to SCST for every CIDEr-D quantile, indicating higher confidence in the generated captions. We notice an exception on Flickr30k dataset, where MI is higher with B-SCST approach as compared to SCST, indicating higher model uncertainty although B-SCST results in higher CIDEr-D score (Table 1). In practical applications, since the ground truth caption of an image, and therefore its corresponding CIDEr-D score, is not available during inference phase, these uncertainty measures can give an indication of the predictive confidence of the caption generated using the model.

In the last column of Figure 3, we plot the mean SoftMax probability per word in the caption against the caption’s CIDEr-D scores using standard DNN inference, i.e. greedily choosing the word with highest SoftMax at each timestep to generate the caption. We observe that SoftMax probabilities are uniformly distributed for different levels of CIDEr-D scores, further validating that the SoftMax probabilities could be overly confident in predicting the CIDEr-D scores, and not a good measure of predictive confidence of the model. This uncertainty quantification analysis demonstrates that Bayesian approaches provide robust predictive confidence scores compared to SoftMax probabilities obtained from standard DNN image captioning models. This also justifies the use of a Bayesian "baseline" for the policy-gradient based RL in our B-SCST approach.

4.5 Ablation Studies

We perform ablation studies using Flickr30k dataset to confirm the benefits of B-SCST approach and the selection of hyper-parameters. We compare the captioning results on Karpathy validation split with no beam search. In Table 2 (a), we change the model architecture from AoANet to Up-Down [1] and observe that B-SCST approach improves the CIDEr-D score compared to SCST approach. In Table 2 (b), we use different sampling mechanisms to generate the words in the sampled caption during B-SCST training. These include sampling the word randomly from the word vocabulary (Random), choosing the word with highest SoftMax probability (Top1), sampling the word from the k highest SoftMax probabilities (Topk, with k=5 or 10), and sampling the word from SoftMax probability distribution (Distr.). We observe that sampling approach we use in B-SCST (Distr.) gives the maximum CIDEr-D score. In Table 2 (c), we vary the number of MC dropout forward passes that are performed during B-SCST training. We observe that increasing number of passes improves the CIDEr-D score with diminishing returns for more than 5~10 passes.

	Cross-entropy loss training						CIDEr-D optimization
Model	B@1	B@4	M	R	C	S	B@1	B@4	M	R	C	S
SCST	69.1	27.2	21.8	49.0	57.3	15.6	72.9	29.5	21.7	49.9	59.0	15.3
B-SCST	69.1	27.2	21.8	49.0	57.3	15.6	72.3	29.4	21.8	49.9	61.1	15.2

(a) Comparison of SCST and B-SCST results using Up-Down model.

	B@1	B@4	M	R	C	S
Random	68.0	25.6	21.7	48.4	58.0	15.5
Top1	69.3	27.3	21.6	48.5	61.6	15.3
Top5	72.5	30.1	21.8	50.2	65.6	15.8
Top10	72.6	29.9	21.6	50.3	64.4	15.4
Distr.	72.4	29.6	22.6	50.5	66.8	16.2

(b) Comparision of B-SCST sampling approaches.

#MC	B@1	B@4	M	R	C	S
1	68.1	25.7	21.7	48.4	58.3	15.5
3	70.7	28.3	22.3	49.8	64.2	16.1
5	72.4	29.6	22.6	50.5	66.8	16.2
10	72.4	29.1	22.5	50.3	67.1	16.2
15	71.8	28.5	22.4	49.9	66.5	16.1

Table 2: B-SCST ablation study results on Flickr30k Karpathy val split.

5 Conclusions

We presented B-SCST for image captioning models, a Bayesian variant of the SCST approach that directly optimizes the CIDEr-D metric. In B-SCST, we estimate “baseline” for the policy-gradients by averaging CIDEr-D of captions sampled from the distribution inferred using a Bayesian DNN model and demonstrated improved CIDEr-D scores on Flickr30k, MS COCO and VizWiz datasets, as compared to SCST approach. B-SCST can be applied on other image captioning architectures that benefit from using SCST approach. We also perform uncertainty quantification analysis on the captions generated using a state-of-the-art captioning model, and demonstrate that these uncertainties correlate well with the CIDEr-D scores and can thus improve interpretability of model generated captions.

References

[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
[2] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. M2: Meshed-memory transformer for image captioning. arXiv preprint arXiv:1912.08226, 2019.
[3] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision, pages 4634–4643, 2019.
[4] Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J Corso, and Marcus Rohrbach. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6578–6587, 2019.
[5] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
[6] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
[7] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
[8] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7008–7024, 2017.
[9] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
[10] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
[11] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
[12] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision, pages 382–398. Springer, 2016.
[13] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
[14] Yarin Gal. Uncertainty in deep learning. University of Cambridge, 2016.
[15] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
[16] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
[17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[19] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 3, 2017.
[20] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680, 2015.
[21] Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian recurrent neural networks. arXiv preprint arXiv:1704.02798, 2017.
[22] Kenta Hama, Takashi Matsubara, Kuniaki Uehara, and Jianfei Cai. Exploring uncertainty measures for image-caption embedding-and-retrieval task. arXiv preprint arXiv:1904.08504, 2019.
[23] Yijun Xiao and William Yang Wang. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7322–7329, 2019.
[24] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
[25] Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pages 2348–2356, 2011.
[26] Yarin Gal and Zoubin Ghahramani. Bayesian convolutional neural networks with bernoulli approximate variational inference. arXiv preprint arXiv:1506.02158, 2015.
[27] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027, 2016.
[28] Lewis Smith and Yarin Gal. Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533, 2018.
[29] Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. arXiv preprint arXiv:2002.08565, 2020.
[30] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
[31] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries.
[32] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7219–7228, 2018.
[33] 2020 VizWiz Grand Challenge Workshop. Accessed: 2020-06-01.
[34] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[35] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
[36] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.