High Recall Data-to-text Generation with Progressive Edit

ChoongHan Kim
Graduate School of Artificial Intelligence
POSTECH, Pohang, South Korea

choonghankim@postech.ac.kr
\AndGary Geunbae Lee
Graduate School of Artificial Intelligence
POSTECH, Pohang, South Korea

gblee@postech.ac.kr

Abstract

Data-to-text (D2T) generation is the task of generating texts from structured inputs. We observed that when the same target sentence was repeated twice, Transformer (T5) based model generates an output made up of asymmetric sentences from structured inputs. In other words, these sentences were different in length and quality. We call this phenomenon "Asymmetric Generation" and we exploit this in D2T generation. Once asymmetric sentences are generated, we add the first part of the output with a no-repeated-target. As this goes through progressive edit (ProEdit), the recall increases. Hence, this method better covers structured inputs than before editing. ProEdit is a simple but effective way to improve performance in D2T generation and it achieves the new state-of-the-art result on the ToTTo dataset.

1 Introduction

Data-to-text (D2T) generation is the task of generating texts from structured inputs (Reiter and Dale, 1997). Previous attempts to solve this task can be classified according to whether separate stages are adopted or not. For example, one method is to generate structured inputs in correct order first, then realize the whole sentence (Puduppully et al., 2019; Wang et al., 2021; Su et al., 2021). Another is to generate the whole sentence in an end-to-end (E2E) manner using Copy mechanism (Gu et al., 2016; See et al., 2017) or just pre-trained models (Kale and Rastogi, 2020).

The two methods each have their advantages and disadvantages. Methods that have separate stages generate more confident texts than E2E models since separate stages first produce entities from structured inputs. However, the output sentence can be awkward or the overall performance may be worse than that of E2E models. This occurs because the generated output of the first stage differs from the gold label that the second stage expects. If the first stage produces a slightly incorrect result, the second stage takes over and increases the error of subsequent sentences. This is often referred to as error propagation. E2E models are free from this vulnerability but it could omit the important entities from structured inputs. To include these important entities, a separate module may be added to E2E models (e.g. Copy mechanism module), but this could generate awkward sentences and degrade system integrity. For this reason, we consider the usage of a pre-trained model without additional modules. A pre-trained E2E Transformer model (e.g. T5 (Raffel et al., 2019)) shows a competitive performance for D2T tasks (Kale and Rastogi, 2020).

Refer to caption — Figure 1: An example of generating asymmetric sentences.

Omitting important entities from structured inputs is related to recall values. In D2T, the recall value is a metric that considers not only the target sentence, but also the structured inputs. A high recall indicates that more structured inputs are included. This metric is described in detail in Section 2.

When the same target sentence is repeated, but divided by a special token (i.e. "target_1 <SEP> target_2"), we were able to make two observations. First, we find that Transformer (T5) based model generates asymmetric sentences (Figure 1); i.e., the first part of the output, which is related to target_1, is longer than the second par generated from target_2. Second, the first part of the output covers structured inputs better than the second part. We call this phenomenon Asymmetric Generation.

Asymmetric Generation can be exploited to improve the recall mentioned earlier. Based on our experiments on ToTTo corpus (Parikh et al., 2020) and WIKITABLET (Chen et al., 2020), the first part of asymmetric output shows a higher recall than the second part. It is even higher than the output of the model trained by no-repeated-targets.

We concatenate the first part with a no-repeated-target ("the first part <SEP> no-repeated-target"), then train a new model with lengthened targets (Figure 2). This process can be conducted repeatedly. We call this process Progressive Edit (ProEdit)¹¹1The name of our model comes from ProGen (Tan et al., 2020) because it progressively edits the initial output. Our experimental results on ToTTo corpus demonstrate the benefit of ProEdit in achieving the new state-of-the-art on PARENT (Dhingra et al., 2019) metric.

	ToTTo (input=418.43)					WIKITABLET (input=412.34)
T5-large	BLEU $\uparrow$	P $\uparrow$	R $\uparrow$	F1²²2we used the official scripts https://github.com/google-research-datasets/ToTTo $\uparrow$	Length (ref=86.4)	BLEU $\uparrow$	P $\uparrow$	R $\uparrow$	F1³³3we used the official scripts https://github.com/mingdachen/WikiTableT $\uparrow$	Length (ref=627)
No-Repeated-Target	49.3	80.21	50.80	58.53	80.39	20.05	56.44	23.78	32.61	391.30
Asymmetric Generation-First	44.2	77.91	52.56	59.23	94.18	23.7	52.22	25.21	33.05	532.33
Asymmetric Generation-Second	29.2	66.62	34.88	41.72	92.11	14.47	45.39	17.78	24.50	457.72
ProEdit-1-First	43.6	78.13	54.00	60.43	97.62	23.47	49.82	25.77	32.93	617.76
ProEdit-1-Second	35.6	74.96	38.34	46.33	76.19	14.47	45.39	17.78	24.50	457.72
ProEdit-2-First	42.1	77.61	54.24	60.39	101.174	-	-	-	-	-
ProEdit-2-Second	29.2	66.62	34.88	41.72	92.11	-	-	-	-	-

Table 1: Results on ToTTo and WIKITABLET validation set evaluated by BLEU and PARENT. Asymmetric Generation is trained with repeated target sentences that are divided by a special token. Asymmetric Generation-First is the output before <SEP> token. Asymmetric Generation-Second is the output after <SEP> token. ProEdit-1 is trained with Asymmetric Generation-First concatenated to a no-repeated-target. ProEdit-1-First is the output before <SEP> token. ProEdit-2 is trained in the same process using the output from ProEdit-1. P, R, F1 are PARENT precision, recall, and F1 score, respectively

2 Related Work

PARENT Metric. The PARENT metric is introduced to evaluate generated texts from structured inputs automatically.

Its precision for n-gram, denoted by $E_{p}^{n}$ , is given by:

$E_{p}^{n}=$

\frac{\sum_{g\in G_{n}^{i}}[Pr(g\in R_{n}^{i})+Pr(g\notin R_{n}^{i})w(g)]\#G_{n}^{i}(g)}{\sum_{{g\in G_{n}^{i}}}\#G_{n}^{i}(g)}

(1)

where $R_{n}^{i},G_{n}^{i}$ denote the collection of n-grams of order n in $R^{i}$ and $G^{i}$ , which are i-th targets and generated texts, respectively. $Pr(g\in R_{n}^{i})$ is given by $\frac{\#G_{n}^{i},R_{n}^{i}(g)}{\#G_{n}^{i}(g)}$ , where $\#G_{n}^{i}(g)$ is the count of n-gram $g$ in $G_{n}^{i}$ and $\#G_{n}^{i},R_{n}^{i}(g)$ denotes the minimum of its counts in $G_{n}^{i}$ and $R_{n}^{i}$ . Entailment probability, denoted as $w(g)$ , is the most important and further introduced to check whether the presence of an n-gram $g$ in a generated text is "correct" given structured inputs. Two models have been introduced to calculate of $w(g)$ : the Word overlap model, and the Co-occurrence model. In the most cases, the Word overlap model is used, so we also used it too.

The recall of PARENT is computed against both the target $E_{r}(R^{i})$ and the table $E_{r}(T^{i})$ . These are combined for geometric average:

E_{r}=E_{r}(R^{i})^{1-\lambda}E_{r}(T^{i})^{\lambda}

(2)

$E_{r}(R^{i})$ compute the recall of generated sentences against target sentences. $E_{r}(T^{i})$ is computed as follows:

E_{r}(T^{i})=\frac{1}{K}\sum_{k=1}^{K}\frac{1}{|\bar{r_{k}}|}LCS(\bar{r_{k}},G^{i})

(3)

K denotes the number of records in table and $\bar{r_{k}}$ is the number of token in the value string of a record. $LCS(x,y)$ denotes the length of the longest common subsequence between x and y. The hyperparameter $\lambda$ can be obtained heuristically using Eq.3 . The key idea is that if the recall of the target against the table is high, it already covers most of the information of structured inputs, so we can assign it a big weight ( $1-\lambda)$ . We use this method.

3 Progressive Edit of Text

We add the first part of asymmetric generation to an existing target and trained a new model by using this dataset. This process is reiterated so that after the output of the previous model is added to an existing target, a new model is trained by this dataset. As a result the output sentence goes through progressive editing. (Figure 2). ProEdit was repeated until the PARENT metric increased.

4 Experiment

4.1 Dataset and Implementation details

We used a pre-trained T5-large model with 737.66M parameters. To test Asymmetric Generation, we used two datasets in D2T: the entire set of ToTTo and the part of WIKITABLET. The ToTTo Dataset is composed of Wikipedia tables paired with human-written descriptions. WIKITABLET, which combines Wikipedia table data with its corresponding Wikipedia sections, is similar to ToTTO, but has a long target text. These datasets were selected since their lengthened targets do not exceed the maximum length of T5 decoder.

On ToTTo, the training set was made up of 120k examples, and the validation set had 7.7k examples. In the case of WIKITABLET, we used 100k and 2.7k samples for training and validation, respectively. As the length of the encoder input was 512. structured inputs exceeding the encoder input were cut to 512. We set the batch size to 2 and learning rate of 5e-5. In the generation phase, we used beam search of size 5, and early stopping with no repeat ngram size 7. Experiments were conducted with two A100 GPUs.

4.2 Results

When there were two repeated target sentences (that were divided by a <SEP> token), the generated output was not the same (Table 1); this is the Asymmetric Generation phenomenon. Although target sentences are simply repeated, the resulting sentences differed in length and performances on metrics. On both datasets, the first generated part was longer than the second and performances on metrics were higher for the former than for the after. In addition, the results of the first part on the PARENT metric were even better than when using no-repeated-targets on both datasets.

We put the first part of the output with repeated-targets for the first place of the iteration; this is ProEdit. ProEdit is repeated until the overall F1 score goes up. The result is organized in Table 1. The overall F1 score reached the highest with ProEdit-1-First for ToTTo. For WIKITABLET, ProEdit-1-First F1 score was higher than the no-repeated-target, but lower than the Asymmetric Generation-First. In both datasets, the recall values of the first part increased steadily as ProEdit was repeated.

Our method was evaluated on the test set of ToTTo (Table 2). Using ProEdit-1-First, we achieved the state-of-the-art on PARENT score for the test set.

Model	BLEU $\uparrow$	PARENT $\uparrow$
Pointer-Generator (See et al., 2017)	41.6	51.6
T5-based (Kale and Rastogi, 2020)	49.5	58.4
PlanGen (Su et al., 2021)	49.2	58.7
Ours (ProEdit-1-First)	48.6	59.18

Table 2: Evaluation results of BLEU and PARENT on the Test set of ToTTo. All the results are cited from the official leaderboar (https://github.com/google-research-datasets/ToTTo) of ToTTo.

5 Analysis

Recall and Length. In general, a longer sentence increases the recall value. The first part of Asymmetric Generation and ProEdit have longer sentences than the reference. For fair comparison, similar output lengths are generated using beam search and minimum length settings (Table 3). For beam search, the longest sentence is selected. However, selecting the longest sentence in the beam search rather reduced the recall value. In minimum length settings, the probability of the EOS token is zero until the output reaches a certain length. Setting the minimum length improves both length and recall values, but dramatically reduces precision. which led to a steady decrease in the overall F1 score. Our proposed method can improve the overall F1 score since the precision decreases slightly.

Asymmetric Generation. We conducted repeated target sentences experiment using ToTTo on another model: GPT-2(Radford et al., 2019). Asymmetric Generation also occurs (Table 4), so it is not a phenomenon that occurs only in the T5 model.

T5-large

BLEU

\uparrow

\uparrow

\uparrow

\uparrow

Length

(ref=86.4)

No-Repeated-

Target

49.3

80.21

50.80

58.53

80.39

Beam size 5

45.8

77.78

49.57

56.86

88.24

Beam size 11

44.3

77.02

49.24

56.48

90.65

min length 20

45.9

77.78

51.32

58.17

89.27

min length 25

74.48

52.13

57.61

101.98

min length 30

34.6

71.22

52.68

56.74

116.79

Table 3: Results on ToTTo validation set evaluated by BLEU and PARENT with official scripts.

GPT2

BLEU

\uparrow

\uparrow

\uparrow

\uparrow

Length

(ref=86.4)

No-Repeated-

Target

42.8

79.13

43.49

51.74

76.26

Asymmetric

Generation-First

33.2

74.97

47.57

53.89

110.43

\hdashlineAsymmetric

Generation-Second

23.4

64.45

30.82

37.16

103.29

Table 4: Results on ToTTo validation set evaluated by BLEU and PARENT with official scripts using GPT2.

6 Conclusion

In this paper, we proposed Progressive Edit (ProEdit) process for D2T generation. It utilizes Asymmetric Generation to improve recall. We obtained the new state-of-the-art result on ToTTo.

References

Chen et al. (2020) Mingda Chen, Sam Wiseman, and Kevin Gimpel. 2020. Wikitablet: A large-scale data-to-text dataset for generating wikipedia article sections. arXiv preprint arXiv:2012.14919.
Dhingra et al. (2019) Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William W Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. arXiv preprint arXiv:1906.01081.
Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393.
Kale and Rastogi (2020) Mihir Kale and Abhinav Rastogi. 2020. Text-to-text pre-training for data-to-text tasks. arXiv preprint arXiv:2005.10433.
Parikh et al. (2020) Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373.
Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6908–6915.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
Reiter and Dale (1997) Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Natural Language Engineering, 3(1):57–87.
See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
Su et al. (2021) Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier. 2021. Plan-then-generate: Controlled data-to-text generation via planning. arXiv preprint arXiv:2108.13740.
Tan et al. (2020) Bowen Tan, Zichao Yang, Maruan AI-Shedivat, Eric P Xing, and Zhiting Hu. 2020. Progressive generation of long text with pretrained language models. arXiv preprint arXiv:2006.15720.
Wang et al. (2021) Peng Wang, Junyang Lin, An Yang, Chang Zhou, Yichang Zhang, Jingren Zhou, and Hongxia Yang. 2021. Sketch and refine: Towards faithful and informative table-to-text generation. arXiv preprint arXiv:2105.14778.

Appendix A Examples of generated results

Input Table

Page_Title[Hudson Line (Metro-North)] Section_Title[Stations] Zone[Harlem–125th Street

Handicapped/disabled access] Zone[Harlem / New Haven Lines diverge] Station Miles (km) from GCT

Date opened Date closed Manhattan / Bronx border Zone[Yankees–East 153rd Street Handicapped/disabled access]

Target Sentence

Once past 125th Street and the Harlem, the Hudson Line departs from the Harlem and New Haven Lines, passing first

Yankees–East 153rd Street.

No-Repeated-Target

The Harlem / New Haven Lines diverge at 125th Street and Yankees–East

153rd Street.

Asymmetric Generation-First

The Harlem–125th Street and Yankees–East 153rd Street stations are

in the Hudson Line (Metro-North)

ProEdit-1-First

The Harlem–125th Street and the New Haven Lines diverge from the Harlem

/ New Haven Lines to the Yankees–East 153rd Street Handicapped/disabled

access stations of the Hudson Line (Metro-North) line.

Input Table

Page_Title[Sunda Kingdom] Section_Title[List of monarchs] Period[723 – 732]

King’s name[Sanjaya/Harisdarma/ Rakeyan Jamri] Ruler[Sunda, Galuh, and Mataram]

Target Sentence

In 723, Jamri was the King of Sunda.

No-Repeated-Target

Sanjaya (r. 723–732) was the ruler of the Sunda Kingdom.

Asymmetric Generation-First

In 723, Sanjaya, Harisdarma and Rakeyan Jamri were the rulers of the Sunda Kingdom.

ProEdit-1-First

Sanjaya/Harisdarma/Rakeyan Jamri was the king of the Sunda Kingdom from 723 to 732,

ruling from the Sunda, Galuh, and Mataram dynasty.

Input Table

Page_Title[Herculaneum, Missouri] Section_Title[Demographics] Historical population[2010]

Census Historical population[3,468] Pop

Target Sentence

As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri.

No-Repeated-Target

The population of Herculaneum was 3,468 at the 2010 census.

Asymmetric Generation-First

As of the census of 2010, there were 3,468 people residing in the Herculaneum.

ProEdit-1-First

As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri.

Table 5: Examples of generated result on the ToTTo dataset

Appendix B Rules for post-processing

When lengthened targets were given, but divided by a <SEP> token, the output of the model sometimes did not produce a <SEP> token or generated several. We divided the first part and the second part into the following rules.

$\circ$ If <SEP> does not occur:

generated sentence

the first part = generated sentence

the second part = generated sentence

$\circ$ If [SEP] occurs once:

generated sentence_1 [SEP]

generated sentence_2

the first part = generated sentence_1

the second part = generated sentence_2

$\circ$ If [SEP] occurs serveral:

generated sentence_1 [SEP]

generated sentence_2 [SEP]

generated sentence_3 [SEP]

…

the first part = generated sentence_1

the second part = generated sentence_1