This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Discrete Auto-regressive Variational Attention Models for Text Modeling

Xianghong Fang12, Haoli Bai12, Jian Li2, Zenglin Xu3, Michael Lyu2, Irwin King2 3School of Computer Science and Engineering, Harbin Institute of Technology, Shenzhen 2Department of Computer Science and Engineering, The Chinese University of Hong Kong
Email: xianghong_fang@163.com, {hlbai, jianli, lyu, king}@cse.cuhk.edu.hk, xuzenglin@hit.edu.cn
1: Equal Contribution
Abstract

Variational autoencoders (VAEs) have been widely applied for text modeling. In practice, however, they are troubled by two challenges: information underrepresentation and posterior collapse. The former arises as only the last hidden state of LSTM encoder is transformed into the latent space, which is generally insufficient to summarize the data. The latter is a long-standing problem during the training of VAEs as the optimization is trapped to a disastrous local optimum. In this paper, we propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges. Specifically, we introduce an auto-regressive variational attention approach to enrich the latent space by effectively capturing the semantic dependency from the input. We further design discrete latent space for the variational attention and mathematically show that our model is free from posterior collapse. Extensive experiments on language modeling tasks demonstrate the superiority of DAVAM against several VAE counterparts. Code will be released.

Index Terms:
Text Modeling, Information Underrepresentation, Posterior Collapse

I Introduction

As one of the representative deep generative models, variational autoencoders (VAEs) [1] have been widely applied in text modeling [2, 3, 4, 5, 6, 7, 8, 9]. Given input text 𝐱𝒳\mathbf{x}\in\mathcal{X}, VAEs learn the variational posterior qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}) through the encoder and reconstruct output 𝐱^\hat{\mathbf{x}} from latent variables 𝐳\mathbf{z} via the decoder pθ(𝐱|𝐳)p_{\theta}(\mathbf{x}|\mathbf{z}). Both encoder and decoder are usually implemented by deep recurrent networks such as LSTMs [10] in text modeling. Despite the success of VAEs, two long-standing challenges exist for such variational models: information underrepresentation and posterior collapse.

The challenge of information underrepresentation refers to the limited expressiveness of the latent space 𝐳\mathbf{z}. As shown in the left of Figure 1, current VAEs build a single latent variable 𝐳=zT\mathbf{z}=z_{T} based on the last hidden state of LSTM encoder [11, 12, 5, 6]. However, this is generally insufficient to summarize the input sentence [13]. Thus the generated sentences from the decoder are often poorly correlated. Notably, the sequence of encoder hidden states reflects the semantic dependency of the input sentence, and the whole hidden context may benefit the generation. Therefore, a potential solution is to enhance the representation power of VAEs via the attention mechanism [14, 15], a superior component in discriminative models. However, the attention module cannot be directly deployed in generative models like VAEs, as the attentional context vectors are hard to compute from randomly sampled latent variables during the generation phase.

Posterior collapse is another well-known problem during the training of VAEs [16]. It occurs as the variational posterior qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}) converges to the prior distribution p(𝐳)p(\mathbf{z}), thus the decoder receives no supervision from the input 𝐱\mathbf{x}. Previous efforts alleviate this issue by either annealing the KL divergence term [16, 17, 11], revising the model [18, 19, 20], or modifying the training procedure [12, 6]. Nevertheless, they primarily focus on a single latent variable for language modeling, which still suffer from the information underrepresentation as mentioned before. To derive more powerful latent space, the challenge of posterior collapse should be carefully handled.

Refer to caption
Figure 1: Illustration of conventional VAEs (left) and our proposed auto-regressive variational attention models (right).

In this paper, we propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the aforementioned challenges. First, to mitigate the information underrepresentation of VAEs, we introduce a variational attention mechanism together with an auto-regressive prior (dubbed as auto-regressive variational attention). The variational attention assigns a latent sequence 𝐳=z1:T\mathbf{z}=z_{1:T} over each encoder hidden state to capture the semantic dependency from the input, as is shown in the right of Figure 1. During the generation phase, the auto-regressive prior generates well-correlated latent sequence for computing the attentional context vectors. Second, we utilize discrete latent space to tackle the posterior collapse in VAEs. We show that the proposed auto-regressive variational attention models, when armed with conventional Gaussian distribution, face high risks of posterior collapse. Inspired by the recently proposed Vector Quantized Variational Autoencoder (VQVAE) [21, 22], we design a discrete latent distribution over the variational attention mechanism. By analyzing the intrinsic merits of discreteness, we demonstrate that our design is free from posterior collapse regardless of latent sequences length. Consequently, the representation power of DAVAM can be significantly enhanced without posterior collapse.

We evaluate DAVAM on several benchmark datasets on language modeling. The experimental results demonstrate the superiority of our proposed method in text generation over its counterparts.

Our contributions can thus be summarized as:

  1. 1.

    To the best of our knowledge, this is the first work that proposes auto-regressive variational attention to improve VAEs for text modeling, which significantly enriches the information representation of latent space.

  2. 2.

    We further design discrete latent space for the proposed variational attention, which effectively addresses the posterior collapse issue during the optimization.

II Background

II-A Variational Antoencoders for Text Modeling

Variational Autoencoders (VAEs) [1] are a well known class of generative models. Given sentences 𝐱=x1:T\mathbf{x}=x_{1:T} with length TT, we seek to infer latent variables 𝐳\mathbf{z} that explain the observation. To achieve this, we need to maximize the marginal log-likelihood logpθ(𝐱)\log p_{\theta}(\mathbf{x}), which is usually intractable due to the complex posterior p(𝐳|𝐱)p(\mathbf{z}|\mathbf{x}). Consequently an approximate posterior qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}) (i.e. the encoder) is introduced, and the evidence lower bound (ELBO) of the marginal likelihood is maximized as follows:

logpθ(𝐱)\displaystyle\log p_{\theta}(\mathbf{x})\geq 𝔼𝐳qϕ(𝐳|𝐱)[logpθ(𝐱|𝐳)]reconstruction loss\displaystyle\underbrace{\mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})]}_{\textrm{reconstruction loss}}
DKL(qϕ(𝐳|𝐱)p(𝐳))KL divergence,\displaystyle\hskip 17.22217pt-\underbrace{D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))}_{\textrm{KL divergence}}, (1)

where pθ(𝐱|𝐳)p_{\theta}(\mathbf{x}|\mathbf{z}) represents likelihood function conditioned on 𝐳\mathbf{z}, also known as the decoder. In the context of text modeling, both encoder and decoder are usually implemented by deep recurrent models such as LSTMs [10], parameterized by ϕ\phi and θ\theta respectively.

II-B Challenges

Information Underrepresentation

Information underrepresentation is a common issue in applying VAEs for text modeling. Conventional VAEs build latent variables based on the last hidden state of LSTM encoder, i.e. 𝐳=zT\mathbf{z}=z_{T}. During the decoding process, we first sample zTz_{T}, from which new sentences 𝐱^=x^1:T^\hat{\mathbf{x}}=\hat{x}_{1:\hat{T}} can be generated:

p(𝐱^|𝐳)=pθ(x^1|zT)t=2T^pθ(x^t|x^t1,zT),p(\hat{\mathbf{x}}|\mathbf{z})=p_{\theta}(\hat{x}_{1}|z_{T})\prod_{t=2}^{\hat{T}}p_{\theta}(\hat{x}_{t}|\hat{x}_{t-1},z_{T}), (2)

where T^\hat{T} is the length of reconstructed sentence 𝐱^\hat{\mathbf{x}}. However, the representation of zTz_{T} is generally insufficient to summarize the semantic dependencies in 𝐱\mathbf{x}, and thus deteriorates the reconstruction.

Posterior Collapse

Posterior collapse usually arises as DKL(qϕ(𝐳|𝐱)p(𝐳))D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})) diminishes to zero, where the local optimal gives qϕ(𝐳|𝐱)=p(𝐳)q_{\phi}(\mathbf{z}|\mathbf{x})=p(\mathbf{z}). Posterior collapse happens inevitably as the ELBO contains both the reconstruction loss 𝔼𝐳qϕ(𝐳|𝐱)[logpθ(𝐱|𝐳)]\mathbb{E}_{\mathbf{z}\sim q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] and the KL-divergence DKL(qϕ(𝐳|𝐱)p(𝐳))D_{KL}(q_{\phi}(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})), as shown in Equation (1). When posterior collapse happens, 𝐱\mathbf{x} becomes independent of 𝐳\mathbf{z} as p(𝐱)p(𝐳)=p(𝐱)qϕ(𝐳|𝐱)=p(𝐱)p(𝐱,𝐳)p(𝐱)=p(𝐱,𝐳)p(\mathbf{x})p(\mathbf{z})=p(\mathbf{x})q_{\phi}(\mathbf{z}|\mathbf{x})=p(\mathbf{x})\frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{x})}=p(\mathbf{x},\mathbf{z}). Therefore, the encoder learns a data-agnostic posterior without any information from 𝐱\mathbf{x}, while the decoder fails to perform valid generation but purely based on random noise.

III Methodology

Refer to caption
Figure 2: The overall architecture of the proposed DAVAM. Given observations 𝐱=x1:T\mathbf{x}=x_{1:T}, the encoder hidden states h1:Teh^{e}_{1:T} are quantized to code book {ek}k=1K\{e_{k}\}_{k=1}^{K} based on index sequence z1:Tz_{1:T} from the posterior. The quantized hidden states ez1:Te_{z_{1:T}} are then forwarded to the attention module with decoder hidden states h1:T^dh_{1:\hat{T}}^{d}. During back-propagation, the gradients of ez1:Te_{z_{1:T}} are directly copied to h1:Teh^{e}_{1:T} with STE. To generate new sentences from DAVAM, we start from the auto-regressive prior to sample a new latent sequence z1:Tz_{1:T}. The sequence z1:Tz_{1:T} is then used to index the code book for attention computation during decoding.

We now present our solutions to address the aforementioned challenges. In order to enrich the latent space, we propose an auto-regressive variational attention model to capture the semantic dependencies in the input space. We first instantiate variational attention with the Gaussian distribution and show that it suffers from posterior collapse. Then to solve the challenge, we further discretize the latent space with one-hot categorical distribution, leading to discrete auto-regressive variational attention models (DAVAM), as illustrated in Figure 2. We carefully analyze the superiority of DAVAM to avoid posterior collapse.

III-A Gaussian Auto-regressive Variational Attention Models

To enrich the representation of latent space 𝐳\mathbf{z}, we seek to incorporate the attention mechanism into VAEs. Specifically, we denote the encoder hidden states as h1:Teh^{e}_{1:T}, and the decoder hidden states as h1:T^dh^{d}_{1:\hat{T}}. We build a latent sequence 𝐳=z1:T\mathbf{z}=z_{1:T} upon encoder hidden states h1:Teh^{e}_{1:T}. To facilitate such variational attention model, one can choose the conventional Gaussian distribution [1] for variational posteriors, i.e. q(𝐳|𝐱)=t=1Tq(zt|𝐱)q(\mathbf{z}|\mathbf{x})=\prod_{t=1}^{T}q(z_{t}|\mathbf{x}) where qϕ(zt|𝐱)=𝒩(μt,σtI)q_{\phi}(z_{t}|\mathbf{x})=\mathcal{N}(\mu_{t},\sigma_{t}I). We name the resulting model as Gaussian Auto-regressive Variational Attention Model (GAVAM).

Given z1:Tz_{1:T}, similar to attention-based sequence-to-sequence (seq2seq) models [14], the attentional context vectors cic_{i} and scores at ii-th decoding step can be computed by

ci=t=1Tαi,tzt,αi,t=exp(α~i,j)j=1Texp(α~i,j),\displaystyle c_{i}=\sum_{t=1}^{T}\alpha_{i,t}z_{t},\hskip 4.30554pt\alpha_{i,t}=\frac{\exp(\tilde{\alpha}_{i,j})}{\sum_{j=1}^{T}\exp(\tilde{\alpha}_{i,j})}, (3)

where α~i,t=vtanh(Wezt+Wdhi1d+b)\tilde{\alpha}_{i,t}=v^{\top}\tanh(W_{e}z_{t}+W_{d}h^{d}_{i-1}+b) is the unnormalized score, t{1,2,,T}t\in\{1,2,...,T\} is the encoder time step, and We,WdW_{e},W_{d} are the corresponding parameters. By taking cic_{i} as extra input to the decoder, the generation process is reformulated as

p(𝐱^|𝐜,𝐳)=p(x^1|c1,z1:T)t^=2T^p(x^t^|x^t^1,ct^,z1:T).\displaystyle p(\hat{\mathbf{x}}|\mathbf{c},\mathbf{z})=p(\hat{x}_{1}|c_{1},z_{1:T})\prod_{\hat{t}=2}^{\hat{T}}p(\hat{x}_{\hat{t}}|\hat{x}_{\hat{t}-1},c_{\hat{t}},z_{1:T}).

Unlike Equation 2, at each time step, the decoder receives supervision from the context vector, which is a weighted sum of the latent sequence z1:Tz_{1:T}. Consequently, the variational posterior qϕ(z1:T|𝐱)q_{\phi}(z_{1:T}|\mathbf{x}) encodes the semantic dependency from the observations, such that the issue of information underrepresentation can be effectively mitigated.

Auto-regressive Prior

A key difference between variational auto-regressive attention models and conventional VAEs is the choice of a prior distribution. During the generation, the latent sequence z1:Tz_{1:T} are sampled from the prior unconditionally, and are then fed to the attention module together with h1:T^dh_{1:\hat{T}}^{d}. The most adopted prior 𝒩(0,I)\mathcal{N}(0,I), however, is non-informative to generate well-correlated latent sequence for the attention as that during training. Therefore the decoder receives no informative supervision that gives reasonable generation.

To solve that, we deploy an auto-regressive prior pψ(z1:T)=pψ(z1)t=2Tpψ(zt|z1:t1)p_{\psi}(z_{1:T})=p_{\psi}(z_{1})\prod_{t=2}^{T}p_{\psi}(z_{t}|z_{1:t-1}) parameterized by ψ\psi to capture the underlying semantic dependencies. Specifically, we take pψ(zt|z1:t1)=𝒩(μ^t,σ^tI)p_{\psi}(z_{t}|z_{1:t-1})=\mathcal{N}(\hat{\mu}_{t},\hat{\sigma}_{t}I), where (μ^t,σ^tI)(\hat{\mu}_{t},\hat{\sigma}_{t}I) is produced by a PixelCNN, a superior model in learning sequential data [23].

Posterior Collapse in GAVAM

The training of GAVAM can be easily troubled by posterior collapse due to two aspects. To see this, similar to Equation 1, the minimization of the ELBO now can be written as:

minϕ,θ,ψ\displaystyle\min_{\phi,\theta,\psi} 𝔼z1:Tqϕ[logpθ(𝐱|z1:T)]\displaystyle-\mathbb{E}_{z_{1:T}\sim q_{\phi}}[\log p_{\theta}(\mathbf{x}|z_{1:T})] (4)
+t=1TDKL(qϕ(zt|𝐱)pψ(zt|z1:t1)).\displaystyle\hskip 8.61108pt+\sum_{t=1}^{T}D_{KL}(q_{\phi}({z_{t}}|{\mathbf{x}})\|p_{\psi}({z_{t}|z_{1:t-1}})).

On the one hand, the KL divergence scales linearly to the sequence length of TT, which makes the training unstable across different input lengths. On the other hand, and more seriously, both ϕ\phi and ψ\psi are used to minimize the KL divergence, which can easily trap the learned posteriors. To demonstrate this, for example, the KL divergence between two Gaussian distributions can be written as:

t=1TDKL(qϕ(zt|𝐱)pψ(zt|z1:t1))\displaystyle\sum_{t=1}^{T}D_{KL}(q_{\phi}({z_{t}}|{\mathbf{x}})\|p_{\psi}({z_{t}|z_{1:t-1}})) (5)
=t=1Td=1D12(logσ^td2σtd21+σtd2+(μ^tdμtd)2σ^td2),\displaystyle=\sum_{t=1}^{T}\sum_{d=1}^{D}\frac{1}{2}(\log\frac{\hat{\sigma}_{td}^{2}}{\sigma_{td}^{2}}-1+\frac{\sigma_{td}^{2}+(\hat{\mu}_{td}-\mu_{td})^{2}}{\hat{\sigma}_{td}^{2}}),

where DD is the latent dimension of ztz_{t}. Whenever σtd2σ^td2\sigma^{2}_{td}\rightarrow\hat{\sigma}^{2}_{td} and μtdμ^td\mu_{td}\rightarrow\hat{\mu}_{td} before qϕ(z1:T|𝐱)q_{\phi}(z_{1:T}|\mathbf{x}) encodes anything from 𝐱\mathbf{x}, both qϕ(z1:T|𝐱)q_{\phi}(z_{1:T}|\mathbf{x}) and pψ(z1:T)p_{\psi}(z_{1:T}) get stuck in local optimal and learn no semantic dependency for reconstruction.

III-B Discrete Auto-regressive Variational Attention Models

Inspired by recent studies [21, 24] that demonstrate the promising effects of discrete latent space, we explore its potential in handling posterior collapse over the variational attention, leading to discrete auto-regressive variational attention model (DAVAM).

Specifically, we introduce a code book {ek}k=1K\{e_{k}\}_{k=1}^{K} with size of KK, where each eke_{k} is a vector in the latent space. We expect the combination of code book can represent the semantic dependency from observed sentence 𝐱\mathbf{x}. We now substitute the Gaussian distributed z1:Tz_{1:T} with discrete indices over code book that follows one-hot categorical distribution:

qϕ(zt=k|𝐱)={1k=argminjhteej20otherwise.q_{\phi}(z_{t}=k|\mathbf{x})=\begin{cases}1&k=\arg\min_{j}\|h^{e}_{t}-e_{j}\|_{2}\\ 0&\text{otherwise}\end{cases}. (6)

Given index ztz_{t}, we transform the the encoder hidden state hteh_{t}^{e} to the nearest ezte_{z_{t}}. Then we use ezte_{z_{t}} instead of ztz_{t} in Equation 3 to compute attention scores αt\alpha_{t} and the context vectors cic_{i}.

Correspondingly, as z1:Tz_{1:T} are discrete indices, we assign categorical distribution for the auto-regressive prior, i.e., pψ(zt|z1:t1)=Cat(γt)p_{\psi}(z_{t}|z_{1:t-1})=\mathrm{Cat}(\gamma_{t}). The categorical parameter can be obtained from the PixelCNN model given historical records z1:t1z_{1:t-1}, i.e, γt=PixelCNNψ(z1:t1)[0,1]K\gamma_{t}=\mathrm{PixelCNN}_{\psi}(z_{1:t-1})\in[0,1]^{K}.

Advantages of Discreteness

Thanks to the nice properties of discreteness, the optimization of DAVAM does not suffer from posterior collapse. Specifically, the KL divergence of DAVAM can be written as:

t=1TDKL(qϕ(zt|𝐱)||pψ(zt|z1:t1))\displaystyle\sum_{t=1}^{T}D_{KL}(q_{\phi}(z_{t}|\mathbf{x})||p_{\psi}({z_{t}|z_{1:t-1}})) (7)
=t=1T[H(qϕ(zt))+k=1K1(zt=k)logγt,k]\displaystyle=-\sum_{t=1}^{T}\Big{[}H(q_{\phi}(z_{t}))+\sum_{k=1}^{K}1_{(z_{t}=k)}\log\gamma_{t,k}\Big{]}
=t=1T[0+logγt,zt],\displaystyle=-\sum_{t=1}^{T}\Big{[}0+\log\gamma_{t,z_{t}}\Big{]},

where the third line is obtained with the entropy H(qϕ(zt))=1log10log00H(q_{\phi}(z_{t}))=-1\log 1-0\log 0\rightarrow 0. It can be found that DKL(qϕ(z1:T|𝐱)pψ(z1:T))D_{KL}(q_{\phi}(z_{1:T}|\mathbf{x})\|p_{\psi}(z_{1:T})) is no longer relevant to posterior parameters ϕ\phi. Consequently, the update of variational posterior qϕ(z1:T|𝐱)q_{\phi}(z_{1:T}|\mathbf{x}) does not rely on the prior but is determined purely by the reconstruction term. Therefore minimization of KL divergence will not lead to posterior collapse.

III-C Model Training

We first train the variational posterior qϕ(𝐳|𝐱)q_{\phi}(\mathbf{z}|\mathbf{x}) to convergence when the latent sequence z1:Tz_{1:T} effectively captures the semantic dependency from input 𝐱\mathbf{x}. Then we train the auto-regressive prior pψ(𝐳)p_{\psi}(\mathbf{z}) to mimic the learned posterior, to facilitate well-correlated sequence during generation. Therefore, the training of the proposed DAVAM involves two stages, described in detail as follows:

Stage one

We follow the standard paradigm to minimize the ELBO of DAVAM. As shown in Equation 7, since KL divergence is neither relevant to ϕ\phi nor θ\theta, only the reconstruction term should be concerned. In the meanwhile, as the latent variables z1:Tz_{1:T} are determined based on Euclidean distances between h1:Teh^{e}_{1:T} and code book {ek}k=1K\{e_{k}\}_{k=1}^{K}, we further regularize them to stay close via a Frobenius norm. The training objective for stage one is

minθ,ϕ𝔼qϕ[logpθ(𝐱|z1:T)]+βt=1Thtesg(e)F2,\min_{\theta,\phi}-\mathbb{E}_{q_{\phi}}[\log p_{\theta}(\mathbf{x}|z_{1:T})]+\beta\sum_{t=1}^{T}\|h_{t}^{e}-\mathrm{{sg}}(e)\|_{F}^{2}, (8)

where β\beta is the regularizer, and sg()\mathrm{{sg(\cdot)}} stands for stop-gradient operation. Note that the quantization in Equation 6 is non-differentiable. To allow the back-propagation algorithm to proceed, we adopt the widely employed straight through estimator (STE) [25] to copy gradients from ezte_{z_{t}} to hteh_{t}^{e}, as is shown in Figure 2.

For the update of code book {ek}k=1K\{e_{k}\}_{k=1}^{K}, we first apply K-means algorithm to calculate the average over all latent variables h1:Teh_{1:T}^{e} that are closest to {ek}k=1K\{e_{k}\}_{k=1}^{K}, and then take exponential moving average over the mini-batch update of the code book so as to stabilize the optimization.

Stage Two

After the convergence of DAVAM, we resort to update the auto-regressive prior pψ(zt|z1:t1)p_{\psi}(z_{t}|z_{1:t-1}). To mimic the semantic dependency in the learned posterior qϕ(z1:T|𝐱)q_{\phi}(z_{1:T}|\mathbf{x}), the prior is supposed to fit the latent sequence z1:Tqϕ(zt|𝐱)z_{1:T}\sim q_{\phi}(z_{t}|\mathbf{x}). This can be realized by the minimizing their KL-divergence w.r.t. ψ\psi as

minψtDKL(qϕ(zt|𝐱)||pψ(zt|z1:t1)),\min_{\psi}\sum_{t}D_{KL}(q_{\phi}(z_{t}|\mathbf{x})||p_{\psi}(z_{t}|z_{1:t-1})), (9)

which can be simplified to the cross-entropy loss between z1:Tz_{1:T} and γ1:T\gamma_{1:T} according to Equation (7).

IV Experiments

We verify advantages of the proposed DAVAM on language modeling tasks, and testify how well can it generate sentences from random noise. Finally, we conduct a set of further analysis to shed more light on DAVAM. Code implemented in PyTorch is available at https://github.com/sunset-clouds/DAVAM.

IV-A Experimental Setup

We take three benchmark datasets of language modeling for verification: Yahoo Answers  [20], Penn Tree [26], and a down-sampled version of SNLI [27]. A summary of dataset statistics is shown in Table I.

TABLE I: Dataset statistics.
Datasets Train Size Val Size Test Size Avg Len
Yahoo 100,000 10,000 10,000 78.7
PTB 42,068 3,370 3,761 23.1
SNLI 100,000 10,000 10,000 9.7
TABLE II: Results of language modeling on Yahoo, PTB, and SNLI Datasets. For both Rec and PPL, the lower the better. For KL, a small value indicates the posterior collapse, but this is not a issue for DAVAM (marked by “   ”).
# Methods Yahoo PTB SNLI
Rec\downarrow PPL\downarrow KL Rec\downarrow PPL\downarrow KL Rec\downarrow PPL\downarrow KL
1 LSTM-LM - 60.75 - - 100.47 - - 21.44 -
2 VAE 329.10 61.52 0.00 101.27 101.39 0.00 33.08 21.67 0.04
3 +anneal 328.80 61.21 0.00 101.28 101.40 0.00 31.66 21.50 1.42
4 +cyclic 333.80 66.93 2.83 101.85 108.97 1.37 30.69 23.67 3.63
5 +aggressive 322.70 59.77 5.70 100.26 99.83 0.93 31.53 21.16 1.42
6 +FBP 322.91 62.59 9.08 98.52 99.62 2.95 25.26 22.05 8.99
7 +pretraining+FBP 315.09 59.60 15.49 96.91 96.17 4.99 22.30 22.33 13.40
8 GAVAM 350.14 79.28 0.00 102.20 105.94 0.00 30.90 17.68 0.38
9 DAVAM-q (K=512) 323.10 57.14 0.33 95.83 79.24 0.27 28.16 13.71 0.12
10 DAVAM (K=128) 303.65 45.07 1.88 83.57 50.15 2.23 16.11 5.58 2.38
11 DAVAM (K=512) 259.68 26.61 2.60 60.16 17.94 3.12 10.85 3.52 2.69
Baselines

We compare the proposed DAVAM against a number of baselines, including the classical LSTM-based Language Modeling-(LSTM-LM), vanilla VAE [1], and its advanced variants: annealing VAE [16], cyclic annealing VAE111https://github.com/haofuml/cyclical_annealing [11], lagging VAE222https://github.com/jxhe/vae-lagging-encoder [12], Free Bits (FB) [17] and pretraining+FBP VAE333https://github.com/bohanli/vae-pretraining-encoder [6]. All these baselines do not use the attention module in their architectures.

For ablation studies, we further compare to 1) GAVAM, which takes Gaussian distribution instead of the one-hot categorical distribution over z1:Tz_{1:T} to verify the advantages of discreteness; 2) We also remove the attention mechanism (denoted as DAVAM-q) to test the effect of discreteness on the last latent variable zTz_{T}. 3) Finally, to check the choice of prior, we replace the auto-regressive prior with uninformative Gaussian priors, which leads to variational attention models (VAE+Attn) first proposed by [13, 13].

Evaluation Metrics

We evaluate language modeling using three metrics: 1) Reconstruction loss (Rec) 𝔼𝒛qϕ(𝒛|𝒙)[logpθ(𝒙|𝒛)]\mathbb{E}_{\boldsymbol{z}\sim q_{\phi}(\boldsymbol{z}|\boldsymbol{x})}[\log p_{\theta}(\boldsymbol{x}|\boldsymbol{z})] that measures the ability to recover data from latent space; 2) Perplexity (PPL) measuring the capacity of language modeling; Both lower Rec and PPL give better models in general; and 3) KL divergence (KL) indicating whether posterior collapse occurs.

Implementation

For baselines, we keep the same hyper-parameter settings to pretraining+FBP VAE [6], e.g., the dimension of latent space, word embeddings and hidden states of LSTM. Since our latent variables are discrete, we cannot use importance weighted samples to approximate the reconstruction loss in Lagging VAE and pretraining+FBP VAE.

For DAVAM and its ablation counterparts, we keep the same set of hyper-parameters. By default, we set the code-book size KK as 512512. We first warm up the training for 10 epochs, and then gradually increase β\beta in Equation (8) from 0.1 to βmax=5.0\beta_{max}=5.0, in a similar spirit to annealing VAE [16]. For all experiments, we use the SGD optimizer with initial learning rate 1.01.0, and decay it until five counts if the validation loss does not decrease for 22 epochs. For the auto-regressive prior, we use a 16-layer PixelCNN with one-dimensional convolution followed by residual connections.

IV-B Experimental Results

TABLE III: Sampled short, medium and long sentences as well as their GPT-2 PPL scores for measuring fluency.
Methods Samples PPL\downarrow
pretraining   [s] i can say what can be the least in terms also in any form of power stream [/s] 5.97
\cdashline2-3 \cdashline2-3 +FBP   [s] how you define yourself ? birth control. you will find yourself a dead because of your periods. [/s] 5.43
\cdashline2-3 VAE   [s] are they allowed to join (francisco) in _UNK. giants in the first place.? check out other answers. 5.21
do you miss the economy and not taking risks in the merchant form, what would you tell? go to
the yahoo home page and ask what restaurants follow this one. [/s]
VAE+Attn   [s] explain some make coming you think and represents middle line girl coming. [/s] 8.83
\cdashline2-3 \cdashline2-3   [s] who just live this in you to get usual usual help the idea for out to use guess guess thats UNK 7.16
\cdashline2-3   [s] is masterbating masterbating anyone hi fact fact forgive forgive forgive virgin chlorine does 5.01
’re hydrogen download ’re whats ’re does solve ’re whats solve 2y germany germany monde
pourquoi ’m does fun pourquoi ’m ’re ’m ’re solve ’m does solve pourquoi ’m ’m ’m ’m solve ’m ’m [/s]
GAVAM   [s] generally do problems problems, do you have problems [/s] 6.57
\cdashline2-3 \cdashline2-3   [s] plz can you get a UNK envelope in e ? for me for my switched for eachother i would switched to i 6.50
i think thats . i need to assume . [/s]
\cdashline2-3 \cdashline2-3   [s] what to do, yoga is there to place and pa? i can definately, but that, you will the best, but the the only 5.18
amount right? the best range, it though, to do n’t do to be to be out, and [/s]
DAVAM   [s] what is the meaning of time management? [/s] 4.52
\cdashline2-3 \cdashline2-3 (K=512)   [s] what should you be thankful for thanksgiving dinner and how to get some money with a thanksgiving 4.25
dinner? [/s]
\cdashline2-3   [s] is anyone willing to donate plasma if you are allergic to cancer or anything else? probably you can. 3.87
i’ve never done any thing but it is only that dangerous to kill bacteria. i have heard that it doesn’t have
any effect on your immune system. [/s]

IV-B1 Language Modeling

To compare the representation of latent space, we first perform language modeling over the testing corpus of benchmark datasets, as shown in Table II. Generally, the better the representation, the lower the Rec and PPL on observations. For DAVAM and GAVAM, we average the KL divergence along the latent sequence to make them comparable to baselines that only have one latent variable.

Main Results

(Rows 1-7,10-11) Comparing to baselines without variational attention, we find that our DAVAM achieves significantly better results on all three datasets, especially with larger code book size KK. For example, comparing to pretraining+FBP in row 7 on Yahoo Answers dataset, DAVAM with K=512K=512 significantly reduces the reconstruction loss by 55.4155.41, and PPL is decreased by more than a half. Therefore, our DAVAM is more expressive to summarize observations comparing to baselines without attention modules. The success verifies that DAVAM can significantly enrich the latent representation of language modeling.

In terms of posterior collapse, both vanilla VAE and some variants suffer from this issue severely as the KL of Gaussian distribution diminishes nearly to 0. However, the KL of DAVAM does not indicate posterior collapse, but only reflects how well the auto-regressive prior mimic the posterior.

Ablation Studies

(Rows 8-11) GAVAM performs less competitively on language modeling comparing to DAVAM. Moreover, its KL divergences are near or equal to 0. This leads to the posterior collapse and explains why it has the sub-optimal performance with the one-hot categorical distribution substituted. In terms of DAVAM-q, it has no attention module and only learns the variational posterior with the last zTz_{T}, which naturally yields less competitive results against attention-based models. However, DAVAM-q still outperforms a number of variants of VAE, as the posterior is free from the collapse thanks to discreteness.

IV-C Language Generation From Scratch

In this section, we dive into the ability of language generation from scratch, i.e. generating sentences directly from random noises. We study the generated languages from two perspective: quality and diversity. Then we apply such input-free generation approach to data augmentation, which can be applied to improve the language models trained over limited corpus.

TABLE IV: GPT-2 perplexity scores (\downarrow) with standard deviation (±\pm) for generation quality (row 1-5), and the reconstruction loss (\downarrow) on Yahoo dataset for language modeling (last row).
Length
VAE+Attn
GAVAM DAVAM
10 7.75±1.717.75_{\pm 1.71} 7.32±1.647.32_{\pm 1.64} 6.58±1.17\mathbf{6.58_{\pm 1.17}}
20 7.04±1.467.04_{\pm 1.46} 6.94±1.406.94_{\pm 1.40} 6.49±1.09\mathbf{6.49_{\pm 1.09}}
30 6.70±1.316.70_{\pm 1.31} 6.75±1.386.75_{\pm 1.38} 5.97±0.87\mathbf{5.97_{\pm 0.87}}
40 6.54±1.506.54_{\pm 1.50} 6.28±1.366.28_{\pm 1.36} 5.79±0.81\mathbf{5.79_{\pm 0.81}}
50 6.42±1.616.42_{\pm 1.61} 6.10±1.336.10_{\pm 1.33} 5.55±0.97\mathbf{5.55_{\pm 0.97}}
Rec 10.85\mathbf{10.85} 350.14350.14 259.68259.68

IV-C1 Generation Quality

We first visualize the sentences generated by Pretraining+FBP VAE, VAE+Attn, GAVAM, and DAVAM, along with their fluency scores (PPL). We adopt a pre-trained GPT-2 [28] 444https://huggingface.co/transformers/model_doc/gpt2.html as the fluency evaluator, which takes the generated sentences as input and returns the corresponding perplexity scores (PPL).

From Table III, we find that VAE+Attn can hardly generate well-correlated latent sequences from uninformative Gaussian distributions as a result of non-autoregressive prior. On the other hand, GAVAM is armed with the auto-regressive prior and shows more readable sentences. Nevertheless, it suffers from poor semantic meanings due to posterior collapse, as previously shown in Table II. Finally, DAVAM can produce sentences with interpretable meanings and better fluency scores, even when the sequence length is long. This suggests the discrete latent sequence combined with auto-regressive prior enjoys unique advantages in language generation from scratch.

Refer to caption
Figure 3: The perplexity scores under different sizes of augmented training sentences on SNLI dataset.
TABLE V: The generation diversity scores evaluated by entropy (Ent.), distinct unigrams (Dist-1) and bigrams (Dist-2).
Length pretraining+FBP VAE VAE+Attn GAVAM DAVAM
Ent. \uparrow Dist-1\uparrow Dist-2\uparrow Ent. \uparrow Dist-1\uparrow Dist-2\uparrow Ent. \uparrow Dist-1\uparrow Dist-2\uparrow Ent. \uparrow Dist-1\uparrow Dist-2\uparrow
10 5.41 0.412 0.853 5.09 0.366 0.792 4.78 0.288 0.724 5.00 0.366 0.844
20 5.48 0.355 0.836 5.12 0.243 0.649 4.80 0.212 0.636 5.10 0.249 0.702
30 5.57 0.301 0.798 4.70 0.166 0.487 4.60 0.150 0.503 5.18 0.210 0.655
40 5.55 0.252 0.756 4.10 0.110 0.372 4.19 0.113 0.401 5.34 0.188 0.646
50 5.69 0.249 0.765 3.92 0.089 0.326 4.02 0.093 0.354 5.33 0.173 0.611
Refer to caption
(a) Code book size KK
Refer to caption
(b) Max. regularizer βmax\beta_{\max}
Refer to caption
(c) Latent dimension of ezte_{z_{t}}
Refer to caption
(d) Training dynamics
Figure 4: Further analysis of DAVAM.

To study the quality more rigorously, we further sample 100 sentences randomly from VAE+Attn, GAVAM, and DAVAM with different lengths, and compare their average perplexity scores with the GPT-2 evaluator. As it is an inevitable trade-off between language modeling and generation, we also report the reconstruction loss on Yahoo. The results are listed in Table IV. It can be found that while VAE+Attn has a superior advantage in language modeling, it has the worst GPT-2 perplexity scores since i.i.d. Gaussian noises contain no sequential information. GAVAM has minor improvement over VAE+Attn on GPT-2 scores thanks to the auto-regressive prior, but it performs poorly on language modeling due to posterior collapse. Finally, DAVAM generally achieves the lowest perplexity scores with reasonable ability in language modeling. This indicates the superiority of the auto-regressive prior for generation from scratch, and the power of discreteness to avoid posterior collapse in fitting observations.

IV-C2 Generation Diversity

Diversity is another dimension to measure the success of language generation. We follow [29] to compute the entropy and the percentage of distinct unigrams or bigrams, which are denoted as Ent., Dist-1, and Dist-2 respectively. From Table V, it can be observed that pretraining+FBP VAE achieves the highest diversity scores. However, VAE+Attn, achieves the poorest diversity scores due to repeated words, as shown in Table III. GAVAM has only minor improvement over VAE+Attn, and repeated words frequently occur as well. Finally, our DAVAM can generate diverse sentences despite the scores being slightly lower than pretraining+FBP VAE. This is due to the training of auto-regressive prior yields over-confident choices of latent codebooks, which can better capture sequential dependency with higher generation quality, but at the sacrifice of less generation diversity.

IV-C3 Generation from Scratch as Data Augmentation

Given the superior generation quality and diversity of DAVAM, we now apply it for data augmentation to improve language models that are trained over limited corpus. By amortizing the training instances into model parameters, DAVAM is able to generate sentences directly from random noises (i.e. generation from scratch). Specifically, we selectively choose the corpus size in {1k, 2k, 4k} on SNLI dataset, and use a pre-trained DAVAM model to augment {0.5×,1×,2×,4×}\{0.5\times,1\times,2\times,4\times\} times of th corpus. Figure 3 shows the corresponding improvement of perplexity scores. It can be found that the perplexity decreases proportionally to the training size. Therefore, our DAVAM can be applied to language models that are trained with limited corpus for further improvement.

IV-D Further Analysis

To gain a better understanding of our proposed DAVAM, we first conduct a set of sensitivity analysis on hyper-parameter settings of the model. By default, all sensitivity analysis are conducted on Yahoo dataset with default parameter settings, except for the parameter under discussion. Then we turn to analyze training dynamics of DAVAM, which explains why DAVAM avoids posterior collapse.

IV-D1 Code Book Size KK

We begin with the effect of different code book size KK on the reconstruction loss and KL divergence for language modeling. We vary K{128,256,512,1024}K\in\{128,256,512,1024\}, and the results are shown in Figure 4(a). It can be observed that as KK increases, the Rec loss decreases, whereas KL increases, both monotonically. The results are also consistent to Table II by increasing KK from 128 to 512. Such phenomenons are intuitive since a larger KK improves model capacity but poses more challenges for training the auto-regressive prior. Consequently, one should properly choose the code book size, such that the prior can approximate the posterior well, and yet the posterior is representative enough for the semantic dependency.

IV-D2 Maximum Regularizer βmax\beta_{max}

Then we tune the maximum regularizer βmax\beta_{max}, which controls the distance of the continuous hidden state h1:Teh_{1:T}^{e} to the code book {ek}k=1K\{e_{k}\}_{k=1}^{K}. Recall that a small βmax\beta_{max} loosely restricts the continuous space h1:Teh_{1:T}^{e} to the code book, making the quantization hard to converge. On the other hand, if βmax\beta_{max} is too large, h1:Teh_{1:T}^{e} could easily get stuck in some local minimal during the training. Therefore, it is necessary to find a proper trade-off between the two situations. We vary βmax{0.1,0.2,0.5,1,5,10,20}\beta_{max}\in\{0.1,0.2,0.5,1,5,10,20\}, and the results is shown in Figure 4(b). We can find that when βmax=5\beta_{max}=5, DAVAM achieves the lowest Rec, while smaller or larger βmax\beta_{max} both lead to higher Rec values.

IV-D3 Dimension of Code Book Vectors

Finally, we change dimension of {ek}k=1K\{e_{k}\}_{k=1}^{K} in {8,16,32,64,128,256}\{8,16,32,64,128,256\}, and the results are shown in Figure 4(c). The performance of language modeling is relatively robust to the choice of the latent dimension. This is different from the continuous space where the dimension of latent variables is closely related to the model capacity. In the discrete scenario, the model capacity is largely determined by the code book size KK instead of the dimension of code book, which is also verified in Table II and Figure 4(a).

IV-D4 Training Dynamics

To empirically understand how DAVAM addresses posterior collpase, we investigate their training dynamics. We plot the curve of Rec and KL on the validation set of PTB in Figure 4(d). We find the KL of GAVAM rises at the beginning to explain observations but diminishes quickly afterward. In the meanwhile, Rec does not decrease sufficiently. This shows that the collapsed posterior fails to explain the observations. For DAVAM, on the other hand, since the optimization of reconstruction is not affected by the KL divergence, Rec is minimized sufficiently in the first place. Then we minimize KL, which converges quickly without oscillation. In other words, the posterior and prior are updated separately in two stages to avoid posterior collapse.

V Related Work

V-A Variational Attention Models

Attention mechanism is commonly adopted address the issue of under-fitting in various deep generative models [30, 31, 13]. Both [31] and [13] consider the generation from some source input, where new latent sequences are generated conditioned on observations. Nevertheless, these methods can hardly be applied when no source information is available. Instead, our work focuses on the ability of generation from scratch, i.e., generating from latent space directly without external sources [32]. Generation from scratch has various applications, such as data augmentation where new training instances can be directly generated from random noise to increase the limited training size. To enable such ability, an auto-regressive prior should be deployed to generate semantically dependent latent sequences. This explains the core idea of auto-regressive variational attention in our approach.

V-B Discrete Latent Variables

Aside from the mostly used Gaussian distribution in VAEs, recent works also explore discrete latent space such as DVAE [33], DVAE++ [34], and DVAE# [35]. Nevertheless, these works have different motivations for discreteness. They introduce binary latent variables to improve the model capacity. In DAVAM, instead of enhancing the model capacity, we assign one-hot distribution on latent variables that aims to resolve posterior collapse, which is not addressed in these previous efforts [33, 34, 35].

VI Conclusion

In this paper, we propose the discrete auto-regressive variational attention model, a new deep generative model for text modeling. The proposed approach addresses two important issues: information underrepresentation and posterior collapse. Empirical results on benchmark datasets demonstrate the superiority of our approach in both language modeling and auto-regressive generation. While the proposed method focuses on the fundamental text modeling, it is also promising to extend to more applications such as machine translation [36, 37, 38, 39, 40], question answering [41], log generation [42, 43], and code generation [44, 45].

Acknowledgement

The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK 2410021, Research Impact Fund (RIF), R5034-18).

References

  • [1] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2013.
  • [2] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in NeurIPS, 2015.
  • [3] B. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang, “Variational neural machine translation,” Preprint arXiv:1605.07869, 2016.
  • [4] J. Su, S. Wu, D. Xiong, Y. Lu, X. Han, and B. Zhang, “Variational recurrent neural machine translation,” in AAAI, 2018.
  • [5] P. Z. Wang and W. Y. Wang, “Neural gaussian copula for variational autoencoder,” in EMNLP, 2019.
  • [6] B. Li, J. He, G. Neubig, T. Berg-Kirkpatrick, and Y. Yang, “A surprisingly effective fix for deep latent variable modeling of text,” in EMNLP, 2019.
  • [7] H. Bai, Z. Chen, M. R. Lyu, I. King, and Z. Xu, “Neural relational topic models for scientific article analysis,” in CIKM, 2018, pp. 27–36.
  • [8] H. Liu, L. He, H. Bai, B. Dai, K. Bai, and Z. Xu, “Structured inference for recurrent hidden semi-markov model.” in IJCAI, 2018, pp. 2447–2453.
  • [9] H. Liu, H. Bai, L. He, and Z. Xu, “Stochastic sequential neural networks with structured inference,” Tech. Rep. arXiv:1705.08695, 2017.
  • [10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, 1997.
  • [11] H. Fu, C. Li, X. Liu, J. Gao, A. Çelikyilmaz, and L. Carin, “Cyclical annealing schedule: A simple approach to mitigating kl vanishing,” in NAACL, 2019.
  • [12] J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick, “Lagging inference networks and posterior collapse in variational autoencoders,” Preprint arXiv:1811.00135, 2019.
  • [13] H. Bahuleyan, L. Mou, O. Vechtomova, and P. Poupart, “Variational attention for sequence-to-sequence models,” in COLING, 2018.
  • [14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in ICLR, 2015.
  • [15] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in EMNLP, 2015.
  • [16] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, and S. Bengio, “Generating sentences from a continuous space,” in CoNLL, 2015.
  • [17] D. P. Kingma, T. Salimans, and M. Welling, “Improved variational inference with inverse autoregressive flow,” no. NeurIPS, 2017.
  • [18] Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick, “Improved variational autoencoders for text modeling using dilated convolutions,” in ICML, 2017.
  • [19] S. Semeniuta, A. Severyn, and E. Barth, “A hybrid convolutional variational autoencoder for text generation,” in EMNLP, 2017.
  • [20] J. Xu and G. Durrett, “Spherical latent spaces for stable variational autoencoders,” in EMNLP, 2018.
  • [21] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in NeurIPS, 2017.
  • [22] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with vq-vae-2,” in NeurIPS, 2019.
  • [23] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, O. Vinyals, and A. Graves, “Conditional image generation with pixelcnn decoders,” in NeurIPS, 2016.
  • [24] A. Roy, A. Vaswani, A. Neelakantan, and N. Parmar, “Theory and experiments on vector quantized autoencoders,” Preprint arXiv:1805.11063, 2018.
  • [25] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” Preprint arXiv:1308.3432, 2013.
  • [26] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated corpus of english: The penn treebank,” Computational Linguistics, 1993.
  • [27] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” EMNLP, 2015.
  • [28] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, 2019.
  • [29] J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting objective function for neural conversation models,” Preprint arXiv:1510.03055, 2015.
  • [30] H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh, “Attentive neural processes,” in ICLR, 2019.
  • [31] Y. Deng, Y. Kim, J. Chiu, D. Guo, and A. Rush, “Latent alignment and variational attention,” in NeurIPS, 2018.
  • [32] N. Subramani, S. Bowman, and K. Cho, “Can unconditional language models recover arbitrary sentences?” in NeurIPS, 2019.
  • [33] J. T. Rolfe, “Discrete variational autoencoders,” in ICLR, 2017.
  • [34] A. Vahdat, W. G. Macready, Z. Bian, and A. Khoshaman, “Dvae++: Discrete variational autoencoders with overlapping transformations,” in ICML, 2018.
  • [35] A. Vahdat, E. Andriyash, and W. G. Macready, “Dvae#: Discrete variational autoencoders with relaxed boltzmann priors,” in NeurIPS, 2018.
  • [36] J. Li, Z. Tu, B. Yang, M. R. Lyu, and T. Zhang, “Multi-head attention with disagreement regularization,” in EMNLP, 2018.
  • [37] B. Yang, J. Li, D. F. Wong, L. S. Chao, X. Wang, and Z. Tu, “Context-aware self-attention networks,” in AAAI, 2019.
  • [38] J. Li, B. Yang, Z.-Y. Dou, X. Wang, M. R. Lyu, and Z. Tu, “Information aggregation for multi-head attention with routing-by-agreement,” in NAACL-HLT, 2019.
  • [39] J. Li, X. Wang, B. Yang, S. Shi, M. R. Lyu, and Z. Tu, “Neuron interaction based representation composition for neural machine translation,” in AAAI, 2020.
  • [40] J. Li, X. Wang, Z. Tu, and M. R. Lyu, “On the diversity of multi-head attention,” Neurocomputing, 2021.
  • [41] S. Xu, G. Campagna, J. Li, and M. S. Lam, “Schema2qa: High-quality and low-cost q&a agents for the structured web,” in CIKM, 2020.
  • [42] P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Towards automated log parsing for large-scale log data analysis,” TDSC, 2017.
  • [43] P. He, S. He, J. Li, and M. R. Lyu, “An evaluation study on log parsing and its use in log mining,” in DSN, 2016.
  • [44] J. Li, P. He, J. Zhu, and M. R. Lyu, “Software defect prediction via convolutional neural network,” in QRS, 2017.
  • [45] J. Li, Y. Wang, M. R. Lyu, and I. King, “Code completion with neural attention and pointer networks,” in IJCAI, 2018.