Retrofitting Structure-aware Transformer Language Model for End Tasks

Hao Fei¹, Yafeng Ren² Corresponding author. Donghong Ji¹
1. Department of Key Laboratory of Aerospace Information Security and Trusted Computing,
Ministry of Education, School of Cyber Science and Engineering, Wuhan University, China
2. Guangdong University of Foreign Studies, China
{hao.fei,renyafeng,dhji}@whu.edu.cn

Abstract

We consider retrofitting structure-aware Transformer language model for facilitating end tasks by proposing to exploit syntactic distance to encode both the phrasal constituency and dependency connection into the language model. A middle-layer structural learning strategy is leveraged for structure integration, accomplished with main semantic task training under multi-task learning scheme. Experimental results show that the retrofitted structure-aware Transformer language model achieves improved perplexity, meanwhile inducing accurate syntactic phrases. By performing structure-aware fine-tuning, our model achieves significant improvements for both semantic- and syntactic-dependent tasks.

1 Introduction

Natural language models (LM) can generate fluent text and encode factual knowledge Mikolov et al. (2013); Pennington et al. (2014); Merity et al. (2017). Recently, pre-trained contextualized language models have given remarkable improvements on various NLP tasks Peters et al. (2018); Radford et al. (2018); Howard and Ruder (2018); Yang et al. (2019); Devlin et al. (2019); Dai et al. (2019). Among such methods, the Transformer-based Vaswani et al. (2017) BERT has become a most popular encoder for obtaining state-of-the-art NLP task performance. It has been shown Conneau et al. (2018); Tenney et al. (2019) that besides rich semantic information, implicit language structure knowledge can be captured by a deep BERT Vig and Belinkov (2019); Jawahar et al. (2019); Goldberg (2019). However, such structure features learnt via the vanilla Transformer LM are insufficient for those NLP tasks that heavily rely on syntactic or linguistic knowledge Hao et al. (2019). Some effort devote to improved the ability of structure learning in Transformer LM by installing novel syntax-attention mechanisms Ahmed et al. (2019); Wang et al. (2019). Nevertheless, several limitations can be observed.

First, according to the recent findings by probing tasks Conneau et al. (2018); Tenney et al. (2019); Goldberg (2019), the syntactic structure representations are best retained right at the middle layers Vig and Belinkov (2019); Jawahar et al. (2019). Nevertheless, existing tree Transformers employ traditional full-scale training over the whole deep Transformer architecture (as shown in Figure 1(a)), consequently weakening the upper-layer semantic learning that can be crucial for end tasks. Second, these tree Transformer methods encode either standalone constituency or dependency structure, while different tasks can depend on varying types of structural knowledge. The constituent and dependency representation for syntactic structure share underlying linguistic characteristics, while the former focuses on disclosing phrasal continuity and the latter aims at indicating dependency relations among elements. For example, semantic parsing tasks are more dependent on the dependency features Rabinovich et al. (2017); Xia et al. (2019), while constituency information is much needed for sentiment classification Socher et al. (2013).

Refer to caption — Figure 1: Full-layer multi-task learning for structural training (left), and the middle-layer training for deep structure-aware Transformer LM (right).

In this paper, we aim to retrofit structure-aware Transformer LM for facilitating end tasks. $\bullet$ On the one hand, we propose a structure learning module for Transformer LM, meanwhile exploiting syntactic distance as the measurement for encoding both the phrasal constituency and the dependency connection. $\bullet$ On the other hand, as illustrated in Figure 1, to better coordinate the structural learning and semantic learning, we employ a middle-layer structural training strategy to integrate syntactic structures to the main language modeling task under multi-task scheme, which encourages the induction of structural information to take place at most suitable layer. $\bullet$ Last but not least, we consider performing structure-aware fine-tuning with end-task training, allowing learned syntactic knowledge in accordance most with the end task needs.

We conduct experiments on language modeling and a wide range of NLP tasks. Results show that the structure-aware Transformer retrofitted via our proposed middle-layer training strategy achieves better language perplexity, meanwhile inducing high-quality syntactic phrases. Besides, the LM after structure-aware fine-tuning can give significantly improved performance for various end tasks, including semantic-dependent and syntactic-dependent tasks. We also find that supervised structured pre-training brings more benefits to syntactic-dependent tasks, while the unsupervised LM pre-training brings more benefits to semantic-dependent tasks. Further experimental results on unsupervised structure induction demonstrate that different NLP tasks rely on varying types of structure knowledge as well as distinct granularity of phrases, and our retrofitting method can help to induce structure phrases that are most adapted to the needs of end tasks.

2 Related Work

Contextual language modeling.

Contextual language models pre-trained on a large-scale corpus have witnessed significant advances Peters et al. (2018); Radford et al. (2018); Howard and Ruder (2018); Yang et al. (2019); Devlin et al. (2019); Dai et al. (2019). In contrast to the traditional static and context-independent word embedding, contextual language models can strengthen word representations by dynamically encoding the contextual sentences for each word during pre-training. By further fine-tuning with end tasks, the contextualized word representation from language models can help to give the most task-related context-sensitive features Peters et al. (2018). In this work, we follow the line of Transformer-based Vaswani et al. (2017) LM (e.g., BERT), considering its prominence.

Structure induction.

The idea of introducing tree structures into deep models for structure-aware language modeling has long been explored by supervised structure learning, which generally relies on annotated parse trees during training and maximizes the joint likelihood of sentence-tree pairs Socher et al. (2010, 2013); Tai et al. (2015); Yazdani and Henderson (2015); Dyer et al. (2016); Alvarez-Melis and Jaakkola (2017); Aharoni and Goldberg (2017); Eriguchi et al. (2017); Wang et al. (2018); Gū et al. (2018).

There has been much attention paid to unsupervised grammar induction task Williams et al. (2017); Shen et al. (2018a, b); Kuncoro et al. (2018); Kim et al. (2019a); Luo et al. (2019); Drozdov et al. (2019); Kim et al. (2019b). For example, PRPN Shen et al. (2018a) computes the syntactic distance of word pairs. On-LSTM Shen et al. (2018b) allows hidden neurons to learn long-term or short-term information by a gate mechanism. URNNG Kim et al. (2019b) applies amortized variational inference, encouraging the decoder to generate reasonable tree structures. DIORA Drozdov et al. (2019) uses inside-outside dynamic programming to compose latent representations from all possible binary trees. PCFG Kim et al. (2019a) achieves grammar induction by probabilistic context-free grammar. Unlike these recurrent network based structure-aware LM, our work focuses on structure learning for a deep Transformer LM.

Structure-aware Transformer language model.

Some efforts have been paid for the Transformer-based pre-trained language models (e.g. BERT) by visualizing the attention Vig and Belinkov (2019); Kovaleva et al. (2019); Hao et al. (2019) or probing tasks Jawahar et al. (2019); Goldberg (2019). They find that the latent language structure knowledge is best retained at the middle-layer in BERT Vig and Belinkov (2019); Jawahar et al. (2019); Goldberg (2019). Ahmed et al. (2019) employ a decomposable attention mechanism for recursively learn the tree structure for Transformer. Wang et al. (2019) integrate tree structures into Transformer via constituency-attention. However, these Transformer LMs suffer from the full-scale structural training and monotonous types of the structure, limiting the performance of structure LMs for end tasks. Our work is partially inspired by Shen et al. (2018a) and Luo et al. (2019) on employing syntax distance measurements, while their works focus on the syntax learning by recurrent LMs.

3 Model

The proposed structure-aware Transformer language model mainly consists of two components: the Transformer encoders and structure learning module, which are illustrated in Figure 2.

3.1 Transformer Encoder

The language model is built based on $N$ -layer Transformer blocks. One Transformer layer applies multi-head self-attention in combination with a feedforward network, layer normalization and residual connections. Specifically, the attention weights are computed in parallel via:

	$\displaystyle\bm{E}$	$\displaystyle=\text{softmax}(\frac{\bm{Q}\bm{K}^{T}}{\sqrt{d}})\bm{V}$		(1)
		$\displaystyle=\text{softmax}(\frac{(t\cdot\bm{x})\quad(t\cdot\bm{x})}{\sqrt{d}})(t\cdot\bm{x})$		(1)

where $Q$ (query), $K$ (key) and $V$ (value) in multi-head setting process the input $\bm{x}=\{x_{1},\cdots,x_{n}\}$ $t$ times.

Given an input sentence $\bm{x}$ , the output contextual representation of the $l$ -th layer Transformer block can be formulated as:

	$\displaystyle\{\bm{h}^{l}_{1},\cdots,\bm{h}^{l}_{n}\}$	$\displaystyle=\text{Trm}(\{x_{1},\cdots,x_{n}\})$		(2)
		$\displaystyle=\eta(\Phi(\eta(\bm{E}^{l}))+\bm{E}^{l})$		(2)

where $\eta$ is the layer normalization operation and $\Phi$ is a feedforward network. In this work, the output contextual representation $\bm{h}^{l}=\{\bm{h}^{l}_{1},\cdots,\bm{h}^{l}_{n}\}$ of the middle layers can be used to learn the structure $y_{struc}$ , and the one at the final layer will be used for the language modeling or end task training $y_{task}$ .

3.2 Unsupervised Syntax Learning Module

The structure learning module is responsible for unsupervisedly generating phrases, providing structure-aware language modeling to the host LM.

Syntactic context.

We extract the context representations from Transformer middle layers for the next syntax learning. We optimize the structure-aware Transformer LM by forcing the structure knowledge injection focused at middle three layers: ${(l-1)_{\text{th}}}$ , ${l_{\text{th}}}$ , and ${(l+1)_{\text{th}}}$ . Note that although we only make structural attending to the selected layers, structure learning can enhance lower layers via back-propagation.

Specifically, we take the first of the chosen three-layer as the word context $\bm{C}^{\Psi}=\bm{h}^{l-1}$ . For the phrasal context $\bm{C}^{\Omega}=\{\bm{c}^{\Omega}_{1},\cdots,\bm{c}^{\Omega}_{n}\}$ , we make use of contextual representations from the three chosen layers by weighted sum:

\bm{C}^{\Omega}=\alpha_{l-1}\cdot\bm{h}^{l-1}+\alpha_{l}\cdot\bm{h}^{l}+\alpha_{l+1}\cdot\bm{h}^{l+1}

(3)

where $\alpha_{l-1}$ , $\alpha_{l}$ and $\alpha_{l+1}$ are sum-to-one trainable coefficients. Rich syntactic representations are expected to be captured in $\bm{C}^{\Omega}$ by LM.

Structure measuring.

In this study, we reach the goal of measuring syntax by employing syntax distance. The general concept of syntax distance $d_{i}$ can be reckoned as a metric (i.e., distance) from a certain word $x_{i}$ to the root node within the dependency tree Shen et al. (2018a). For instance in Figure 3, the head word ‘remembered’ $x_{i}$ and its dependent word ‘James’ $x_{j}$ follow $d_{i}<d_{j}$ . While in this work, to maintain both the dependency and phrasal constituents simultaneously, we add additional constraints on words and phrases. Given two words $x_{i}$ and $x_{j}$ ( ${0\leq i<j\leq n}$ ) in one phrase, we define $d_{i}<d_{j}$ . This can be demonstrated by the word pair ‘the’ and ‘story’. While if they are in different phrases¹¹1Note that we cannot explicitly define the granularity (width) of every phrases in constituency tree, while instead it will be decided by the structure learning module in heuristics. , e.g., $S_{u}$ and $S_{v}$ , the corresponding inner-phrasal head words follow $d_{i}$ (in $S_{u}$ ) $>$ $d_{j}$ (in $S_{V}$ ), e.g., ‘story’ and ‘party’.

In the structure learning module, we first compute the syntactic distances $\bm{d}=\{d_{1},\cdots,d_{n}\}$ for each word based on the word context via a convolutional network:

\{d_{1},\cdots,d_{n}\}=\Phi(\text{CNN}(\{\bm{c}^{\Psi}_{1},\cdots,\bm{c}^{\Psi}_{n}\}))

(4)

where $d_{i}$ is a scalar, and $\Phi$ is for linearization. With such syntactic distance, we expect both the dependency as well as constituency syntax can be well captured in LM.

Syntactic phrase generating.

Considering the word $x_{i}$ opening an induced phrase $S_{m}=[x_{i},\cdots,x_{i+w}]$ in a sentence, where $w$ is the phrase width, we need to decide the probability $p^{*}(x_{j})$ that a word $x_{j}$ ( ${j}$ = ${i+w+1}$ ) (i.e., the first word outside phrase $S_{m}$ ) belongs to $S_{m}$ :

p^{*}(x_{j})=\prod_{k=i}^{i+w}\text{sigmoid}(d_{j}-d_{k}).

(5)

We set the initial width $w=1$ , if $p^{*}(x_{j})$ is above the window threshold $\lambda$ , $x_{j}$ should be considered inside the phrase; otherwise, the phrase $S_{m}$ should be closed and restart at $x_{j}$ . We incrementally conduct such phrasal searching procedure to segment all the phrases in a sentence. Given an induced phrase $S_{m}=[x_{i},\cdots,x_{i+w}]$ , we obtain its embedding $\bm{s}_{m}$ via a phrasal attention:

	$\displaystyle u_{i}$	$\displaystyle=\text{softmax}(d_{i}\cdot p^{*}(x_{i}))$		(6)
	$\displaystyle\bm{s}_{m}$	$\displaystyle=\sum_{i}^{i+w}u_{i}\cdot\bm{c}^{\Psi}_{i}$		(7)

4 Structure-aware Learning

Multi-task training for language modeling and structure induction.

Different from traditional language models, a Transformer-based LM employs the masked language modeling (MLM), which can capture larger contexts. Likewise, we predict a masked word using the corresponding context representation at the top layer:

	$\displaystyle p^{\text{W}}(y_{i}\|\bm{x})$	$\displaystyle=\text{softmax}(\bm{c_{i}}\|\bm{x})$		(8)
	$\displaystyle\mathcal{L}_{\text{W}}$	$\displaystyle=\sum_{i}^{k}\log p^{\text{W}}(y_{i}\|\bm{x})$		(9)

On the other hand, the purpose on unsupervised syntactic induction is to encourage the model to induce $\bm{s}_{m}$ that is most likely entailed by the phrasal context $\bm{c}^{\Omega}_{i}$ . The behind logic lies is that, if the initial Transformer LM can capture linguistic syntax knowledge, then after iterations of learning with the structure learning module, the induced structure can be greatly amplified and enhanced Luo et al. (2019). We thus define the following probability:

p^{\text{G}}(\bm{s}_{m}|\bm{c}^{\Omega}_{i})=\frac{1}{1+\exp(-\bm{s}^{T}_{m}\cdot\bm{c}^{\Omega}_{i})}

(10)

Additionally, to enhance the syntax learning, we employ negative sampling:

\mathcal{L}_{\text{Neg}}=\frac{1}{n}\sum_{j}^{n}p^{\text{G}}(\bm{\hat{s}}^{T}_{j}|\bm{c}^{\Omega}_{i})

(11)

where $\hat{s}$ is a randomly selected negative phrase. The final objective for structure learning is:

\mathcal{L}_{\text{G}}=\sum_{i}^{K}(\sum^{\text{M}}_{m}(1-p^{\text{G}}(\bm{s}_{m}|\bm{c}^{\Omega}_{i}))+\mathcal{L}_{\text{Neg}})

(12)

We employ multi-task learning for simultaneously training our LM for both word prediction and structure induction. Thus, the overall target is to minimize the following multi-task loss objective:

\mathcal{L}_{\text{pre}}=\mathcal{L}_{\text{W}}+\gamma^{\text{pre}}\cdot\mathcal{L}_{\text{G}}

(13)

where $\gamma^{\text{pre}}$ is a regulating coefficient.

Supervised syntax injection.

Our default structure-aware LM unsupervisedly induces syntax at the pre-training stage, as elaborated above. Alternatively, in Eq. (7), if we leverage the gold (or apriori) syntax distance information for phrases, we can achieve supervised structure injection.

Unsupervised structure fine-tuning.

We aim to improve the learnt structural information for better facilitating the end tasks. Therefore, during the fine-tuning stage of end tasks, we consider further making the structure learning module trainable:

\mathcal{L}_{\text{fine}}=\mathcal{L}_{\text{task}}+\gamma^{\text{fine}}\cdot\mathcal{L}_{\text{G}}

(14)

where $\mathcal{L}_{\text{task}}$ refers to the loss function of the end task, and $\gamma^{\text{fine}}$ is a regulating coefficient. Note that to achieve the best structural fine-tuning, the supervised structure injection is unnecessary, and we do not allow supervised structure aggregation at the fine-tuning stage.

Our approach is model-agnostic as we realize the syntax induction via a standalone structure learning module, which is disentangled from a host LM. Thus the method can be applied to various Transformer-based LM architectures.

5 Experiments

5.1 Experimental Setups

We employ the same architecture as BERT base model²²2https://github.com/google-research/bert, which is a 12-layer Transformer with 12 attention heads and 768 dimensional hidden size. To enrich our experiments, we also consider the Google pre-trained weights as the initialization. We use Adam as our optimizer with an initial learning rate in [8e-6, 1e-5, 2e-5, 3e-5], and a L2 weight decay of 0.01. The batch size is selected in [16,24,32]. We set the initial values of coefficients $\alpha_{l-1},\alpha_{l}$ and $\alpha_{l+1}$ as 0.35, 0.4 and 0.25, respectively. The pre-training coefficient $\gamma^{\text{pre}}$ is set as 0.5, and the fine-tuning one $\gamma^{\text{fine}}$ as 0.23. These values give the best effects in our development experiments. Our implementation is based on the PyTroch library³³3https://pytorch.org/.

Besides, for supervised structure learning in our experiments, we use the state-of-the-art BiAffine dependency parser Dozat and Manning (2017) to parse sentences for all the relevant datasets, and use the Self-Attentive parser Kitaev and Klein (2018) to obtain the constituency structure. Being trained on the English Penn Treebank (PTB) corpus Marcus et al. (1993), the dependency parser has 95.2% UAS and 93.4% LAS, and the constituency parser has 92.6% F1 score. With the auto-parsed annotations, we can calculate the syntax distances (substitute the ones in Eq. 4) and obtain the corresponding phrasal embeddings (in Eq. 7).

5.2 Development Experiments

Structural learning layers.

We first validate at which layer of depths the structural-aware Transformer LM can achieve the best performance when integrating our retrofitting method. We thus design probing experiments, in which we consider following two syntactic tasks. 1) Constituency phrase parsing seeks to generate grammar phrases based on the PTB dataset and evaluate whether induced constituent spans also exist in the gold Treebank dataset. 2) Dependency alignment aims to compute the proportion of Transformer attention connecting tokens in a dependency relation Vig and Belinkov (2019):

\text{Score}=\frac{\sum_{x\in X}\sum_{i=1}^{x}\sum_{j=1}^{x}\alpha_{i,j}(x)\cdot\text{dep}(x_{i},x_{j})}{\sum_{x\in X}\sum_{i=1}^{x}\sum_{j=1}^{x}\alpha_{i,j}(x)}

(15)

where ${\alpha_{i,j}(x)}$ is the attention weight, and $\text{dep}(x_{i},x_{j})$ is an indicator function (1 if $x_{i}$ and $x_{j}$ are in a dependency relation and 0 otherwise). The experiments are based on English Wikipedia, following Vig and Belinkov Vig and Belinkov (2019).

As shown in Figure 4, both the results on unsupervised and supervised phrase parsing are the best at layer 6. Also the attention aligns with dependency relations most strongly in the middle layers (5-6), consistent with findings from previous work Tenney et al. (2019); Vig and Belinkov (2019). Both two probing tasks indicate that our proposed middle-layer structure training is practical. We thus inject the structure in the structure learning module at the $6$ -th layer ( $l=6$ ).

system	Syntactic.			Semantic.					Avg.
system	TreeDepth	TopConst	Tense	SOMO	NER	SST	Rel	SRL	Avg.
$\bullet$ w/o Initial Weight:
Trm	25.31	40.32	61.06	50.11	89.22	86.21	84.70	88.30	65.65
RvTrm	29.52	45.01	63.83	51.42	89.98	86.66	85.02	88.94	67.55
Tree+Trm	30.37	46.58	65.83	53.08	90.62	87.25	84.97	88.70	68.43
PI+TrmXL	31.28	47.06	63.78	52.36	90.34	87.09	85.22	89.02	68.27
\hdashline Ours+Trm
+usp.	33.98	49.69	66.39	57.04	92.24	90.48	87.05	90.87	70.74
+sp.	37.35	57.68	72.04	56.41	91.86	90.06	86.34	90.54	73.12
+syn-embed.	36.28	54.30	67.61	55.68	91.87	87.10	86.87	89.41	71.14
$\bullet$ Initial Weight:
BERT	38.61	79.37	90.61	65.31	92.40	93.50	89.25	92.20	80.16
Ours+BERT(usp.)	45.82	88.64	94.68	67.84	94.28	94.67	90.41	93.12	83.68

\cdashline1-3[0.8pt/2pt] Ours+Trm
System	Const.	Ppl.
PRPN	42.8	-
On-LSTM	49.4	-
URNNG	52.4	-
DIORA	56.2	-
PCFG	60.1	-
Trm	22.7	78.6
RvTrm	47.0	50.3
Tree+Trm	52.0	45.7
PI+TrmXL	56.2	43.4
+usp.	60.3	37.0
+sp.	68.8	29.2
BERT	31.3	21.5
Ours+BERT(usp.)	65.2	16.2

	Mean	Median
RvTrm	0.68	0.69
Tree+Trm	0.60	0.64
PI+TrmXL	0.54	0.58
Ours+Trm(usp.)	0.50	0.52
\cdashline1-3[0.8pt/2pt] Ours+Trm(sp.)	0.32	0.37

	SST		SRL
	Ours+Trm	Tree+Trm	Ours+Trm	Tree+Trm
NP	0.48	0.45	0.37	0.53
VP	0.21	0.28	0.36	0.21
PP	0.08	0.14	0.17	0.06
ADJP	0.10	0.05	0.05	0.12
ADVP	0.07	0.02	0.03	0.02
Other	0.06	0.06	0.02	0.06
Avg.Len.	3.88	3.22	2.69	3.36

Retrofitting Structure-aware Transformer Language Model for End Tasks

Abstract

1 Introduction

2 Related Work

Contextual language modeling.

Structure induction.

Structure-aware Transformer language model.

3 Model

3.1 Transformer Encoder

3.2 Unsupervised Syntax Learning Module

Syntactic context.

Structure measuring.

Syntactic phrase generating.

4 Structure-aware Learning

Multi-task training for language modeling and structure induction.

Supervised syntax injection.

Unsupervised structure fine-tuning.

5 Experiments

5.1 Experimental Setups

5.2 Development Experiments

Structural learning layers.

Phrase generation threshold.

5.3 Structure-aware Language Modeling

5.4 Fine-tuning for End Tasks

6 Analysis

6.1 Induced Phrase after Pre-training.

6.2 Fine-tuned Structures with End Tasks

Interpreting fine-tuned syntax.

Distributions of heterogeneous syntax for different tasks.

Phrase types.

7 Conclusion

8 Acknowledgments

References