This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High Recall Data-to-text Generation with Progressive Edit

ChoongHan Kim
Graduate School of Artificial Intelligence
POSTECH, Pohang, South Korea

choonghankim@postech.ac.kr
\AndGary Geunbae Lee
Graduate School of Artificial Intelligence
POSTECH, Pohang, South Korea

gblee@postech.ac.kr
Abstract

Data-to-text (D2T) generation is the task of generating texts from structured inputs. We observed that when the same target sentence was repeated twice, Transformer (T5) based model generates an output made up of asymmetric sentences from structured inputs. In other words, these sentences were different in length and quality. We call this phenomenon "Asymmetric Generation" and we exploit this in D2T generation. Once asymmetric sentences are generated, we add the first part of the output with a no-repeated-target. As this goes through progressive edit (ProEdit), the recall increases. Hence, this method better covers structured inputs than before editing. ProEdit is a simple but effective way to improve performance in D2T generation and it achieves the new state-of-the-art result on the ToTTo dataset.

1 Introduction

Data-to-text (D2T) generation is the task of generating texts from structured inputs (Reiter and Dale, 1997). Previous attempts to solve this task can be classified according to whether separate stages are adopted or not. For example, one method is to generate structured inputs in correct order first, then realize the whole sentence (Puduppully et al., 2019; Wang et al., 2021; Su et al., 2021). Another is to generate the whole sentence in an end-to-end (E2E) manner using Copy mechanism (Gu et al., 2016; See et al., 2017) or just pre-trained models (Kale and Rastogi, 2020).

The two methods each have their advantages and disadvantages. Methods that have separate stages generate more confident texts than E2E models since separate stages first produce entities from structured inputs. However, the output sentence can be awkward or the overall performance may be worse than that of E2E models. This occurs because the generated output of the first stage differs from the gold label that the second stage expects. If the first stage produces a slightly incorrect result, the second stage takes over and increases the error of subsequent sentences. This is often referred to as error propagation. E2E models are free from this vulnerability but it could omit the important entities from structured inputs. To include these important entities, a separate module may be added to E2E models (e.g. Copy mechanism module), but this could generate awkward sentences and degrade system integrity. For this reason, we consider the usage of a pre-trained model without additional modules. A pre-trained E2E Transformer model (e.g. T5 (Raffel et al., 2019)) shows a competitive performance for D2T tasks (Kale and Rastogi, 2020).

Refer to caption
Figure 1: An example of generating asymmetric sentences.

Omitting important entities from structured inputs is related to recall values. In D2T, the recall value is a metric that considers not only the target sentence, but also the structured inputs. A high recall indicates that more structured inputs are included. This metric is described in detail in Section 2.

Refer to caption
Figure 2: Progressive edit (ProEdit) of the output sentence. Only first parts of the output of asymmetric generation is used for next stage.

When the same target sentence is repeated, but divided by a special token (i.e. "target_1 <SEP> target_2"), we were able to make two observations. First, we find that Transformer (T5) based model generates asymmetric sentences (Figure 1); i.e., the first part of the output, which is related to target_1, is longer than the second par generated from target_2. Second, the first part of the output covers structured inputs better than the second part. We call this phenomenon Asymmetric Generation.

Asymmetric Generation can be exploited to improve the recall mentioned earlier. Based on our experiments on ToTTo corpus (Parikh et al., 2020) and WIKITABLET (Chen et al., 2020), the first part of asymmetric output shows a higher recall than the second part. It is even higher than the output of the model trained by no-repeated-targets.

We concatenate the first part with a no-repeated-target ("the first part <SEP> no-repeated-target"), then train a new model with lengthened targets (Figure 2). This process can be conducted repeatedly. We call this process Progressive Edit (ProEdit)111The name of our model comes from ProGen (Tan et al., 2020) because it progressively edits the initial output. Our experimental results on ToTTo corpus demonstrate the benefit of ProEdit in achieving the new state-of-the-art on PARENT (Dhingra et al., 2019) metric.

ToTTo (input=418.43) WIKITABLET (input=412.34)
T5-large BLEU\uparrow P\uparrow R\uparrow F1222we used the official scripts https://github.com/google-research-datasets/ToTTo\uparrow Length (ref=86.4) BLEU\uparrow P\uparrow R\uparrow F1333we used the official scripts https://github.com/mingdachen/WikiTableT\uparrow Length (ref=627)
No-Repeated-Target 49.3 80.21 50.80 58.53 80.39 20.05 56.44 23.78 32.61 391.30
Asymmetric Generation-First 44.2 77.91 52.56 59.23 94.18 23.7 52.22 25.21 33.05 532.33
Asymmetric Generation-Second 29.2 66.62 34.88 41.72 92.11 14.47 45.39 17.78 24.50 457.72
ProEdit-1-First 43.6 78.13 54.00 60.43 97.62 23.47 49.82 25.77 32.93 617.76
ProEdit-1-Second 35.6 74.96 38.34 46.33 76.19 14.47 45.39 17.78 24.50 457.72
ProEdit-2-First 42.1 77.61 54.24 60.39 101.174 - - - - -
ProEdit-2-Second 29.2 66.62 34.88 41.72 92.11 - - - - -
Table 1: Results on ToTTo and WIKITABLET validation set evaluated by BLEU and PARENT. Asymmetric Generation is trained with repeated target sentences that are divided by a special token. Asymmetric Generation-First is the output before <SEP> token. Asymmetric Generation-Second is the output after <SEP> token. ProEdit-1 is trained with Asymmetric Generation-First concatenated to a no-repeated-target. ProEdit-1-First is the output before <SEP> token. ProEdit-2 is trained in the same process using the output from ProEdit-1. P, R, F1 are PARENT precision, recall, and F1 score, respectively

2 Related Work

PARENT Metric. The PARENT metric is introduced to evaluate generated texts from structured inputs automatically.

Its precision for n-gram, denoted by EpnE_{p}^{n}, is given by:

Epn=E_{p}^{n}=

gGni[Pr(gRni)+Pr(gRni)w(g)]#Gni(g)gGni#Gni(g)\frac{\sum_{g\in G_{n}^{i}}[Pr(g\in R_{n}^{i})+Pr(g\notin R_{n}^{i})w(g)]\#G_{n}^{i}(g)}{\sum_{{g\in G_{n}^{i}}}\#G_{n}^{i}(g)} (1)

where Rni,GniR_{n}^{i},G_{n}^{i} denote the collection of n-grams of order n in RiR^{i} and GiG^{i}, which are i-th targets and generated texts, respectively. Pr(gRni)Pr(g\in R_{n}^{i}) is given by #Gni,Rni(g)#Gni(g)\frac{\#G_{n}^{i},R_{n}^{i}(g)}{\#G_{n}^{i}(g)}, where #Gni(g)\#G_{n}^{i}(g) is the count of n-gram gg in GniG_{n}^{i} and #Gni,Rni(g)\#G_{n}^{i},R_{n}^{i}(g) denotes the minimum of its counts in GniG_{n}^{i} and RniR_{n}^{i}. Entailment probability, denoted as w(g)w(g), is the most important and further introduced to check whether the presence of an n-gram gg in a generated text is "correct" given structured inputs. Two models have been introduced to calculate of w(g)w(g): the Word overlap model, and the Co-occurrence model. In the most cases, the Word overlap model is used, so we also used it too.

The recall of PARENT is computed against both the target Er(Ri)E_{r}(R^{i}) and the table Er(Ti)E_{r}(T^{i}). These are combined for geometric average:

Er=Er(Ri)1λEr(Ti)λE_{r}=E_{r}(R^{i})^{1-\lambda}E_{r}(T^{i})^{\lambda} (2)

Er(Ri)E_{r}(R^{i}) compute the recall of generated sentences against target sentences. Er(Ti)E_{r}(T^{i}) is computed as follows:

Er(Ti)=1Kk=1K1|rk¯|LCS(rk¯,Gi)E_{r}(T^{i})=\frac{1}{K}\sum_{k=1}^{K}\frac{1}{|\bar{r_{k}}|}LCS(\bar{r_{k}},G^{i}) (3)

K denotes the number of records in table and rk¯\bar{r_{k}} is the number of token in the value string of a record. LCS(x,y)LCS(x,y) denotes the length of the longest common subsequence between x and y. The hyperparameter λ\lambda can be obtained heuristically using Eq.3 . The key idea is that if the recall of the target against the table is high, it already covers most of the information of structured inputs, so we can assign it a big weight (1λ)1-\lambda). We use this method.

3 Progressive Edit of Text

We add the first part of asymmetric generation to an existing target and trained a new model by using this dataset. This process is reiterated so that after the output of the previous model is added to an existing target, a new model is trained by this dataset. As a result the output sentence goes through progressive editing. (Figure 2). ProEdit was repeated until the PARENT metric increased.

4 Experiment

4.1 Dataset and Implementation details

We used a pre-trained T5-large model with 737.66M parameters. To test Asymmetric Generation, we used two datasets in D2T: the entire set of ToTTo and the part of WIKITABLET. The ToTTo Dataset is composed of Wikipedia tables paired with human-written descriptions. WIKITABLET, which combines Wikipedia table data with its corresponding Wikipedia sections, is similar to ToTTO, but has a long target text. These datasets were selected since their lengthened targets do not exceed the maximum length of T5 decoder.

On ToTTo, the training set was made up of 120k examples, and the validation set had 7.7k examples. In the case of WIKITABLET, we used 100k and 2.7k samples for training and validation, respectively. As the length of the encoder input was 512. structured inputs exceeding the encoder input were cut to 512. We set the batch size to 2 and learning rate of 5e-5. In the generation phase, we used beam search of size 5, and early stopping with no repeat ngram size 7. Experiments were conducted with two A100 GPUs.

4.2 Results

When there were two repeated target sentences (that were divided by a <SEP> token), the generated output was not the same (Table 1); this is the Asymmetric Generation phenomenon. Although target sentences are simply repeated, the resulting sentences differed in length and performances on metrics. On both datasets, the first generated part was longer than the second and performances on metrics were higher for the former than for the after. In addition, the results of the first part on the PARENT metric were even better than when using no-repeated-targets on both datasets.

We put the first part of the output with repeated-targets for the first place of the iteration; this is ProEdit. ProEdit is repeated until the overall F1 score goes up. The result is organized in Table 1. The overall F1 score reached the highest with ProEdit-1-First for ToTTo. For WIKITABLET, ProEdit-1-First F1 score was higher than the no-repeated-target, but lower than the Asymmetric Generation-First. In both datasets, the recall values of the first part increased steadily as ProEdit was repeated.

Our method was evaluated on the test set of ToTTo (Table 2). Using ProEdit-1-First, we achieved the state-of-the-art on PARENT score for the test set.

Model BLEU\uparrow PARENT\uparrow
Pointer-Generator (See et al., 2017) 41.6 51.6
T5-based (Kale and Rastogi, 2020) 49.5 58.4
PlanGen (Su et al., 2021) 49.2 58.7
Ours (ProEdit-1-First) 48.6 59.18
Table 2: Evaluation results of BLEU and PARENT on the Test set of ToTTo. All the results are cited from the official leaderboar (https://github.com/google-research-datasets/ToTTo) of ToTTo.

5 Analysis

Recall and Length. In general, a longer sentence increases the recall value. The first part of Asymmetric Generation and ProEdit have longer sentences than the reference. For fair comparison, similar output lengths are generated using beam search and minimum length settings (Table 3). For beam search, the longest sentence is selected. However, selecting the longest sentence in the beam search rather reduced the recall value. In minimum length settings, the probability of the EOS token is zero until the output reaches a certain length. Setting the minimum length improves both length and recall values, but dramatically reduces precision. which led to a steady decrease in the overall F1 score. Our proposed method can improve the overall F1 score since the precision decreases slightly.

Asymmetric Generation. We conducted repeated target sentences experiment using ToTTo on another model: GPT-2(Radford et al., 2019). Asymmetric Generation also occurs (Table 4), so it is not a phenomenon that occurs only in the T5 model.

T5-large BLEU\uparrow P\uparrow R\uparrow F1\uparrow
Length
(ref=86.4)
No-Repeated-
Target 49.3 80.21 50.80 58.53 80.39
Beam size 5 45.8 77.78 49.57 56.86 88.24
Beam size 11 44.3 77.02 49.24 56.48 90.65
min length 20 45.9 77.78 51.32 58.17 89.27
min length 25 40 74.48 52.13 57.61 101.98
min length 30 34.6 71.22 52.68 56.74 116.79
Table 3: Results on ToTTo validation set evaluated by BLEU and PARENT with official scripts.
GPT2 BLEU\uparrow P\uparrow R\uparrow F1\uparrow
Length
(ref=86.4)
No-Repeated-
Target 42.8 79.13 43.49 51.74 76.26
Asymmetric
Generation-First 33.2 74.97 47.57 53.89 110.43
\hdashlineAsymmetric
Generation-Second 23.4 64.45 30.82 37.16 103.29
Table 4: Results on ToTTo validation set evaluated by BLEU and PARENT with official scripts using GPT2.

6 Conclusion

In this paper, we proposed Progressive Edit (ProEdit) process for D2T generation. It utilizes Asymmetric Generation to improve recall. We obtained the new state-of-the-art result on ToTTo.

References

  • Chen et al. (2020) Mingda Chen, Sam Wiseman, and Kevin Gimpel. 2020. Wikitablet: A large-scale data-to-text dataset for generating wikipedia article sections. arXiv preprint arXiv:2012.14919.
  • Dhingra et al. (2019) Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William W Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. arXiv preprint arXiv:1906.01081.
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393.
  • Kale and Rastogi (2020) Mihir Kale and Abhinav Rastogi. 2020. Text-to-text pre-training for data-to-text tasks. arXiv preprint arXiv:2005.10433.
  • Parikh et al. (2020) Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373.
  • Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6908–6915.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
  • Reiter and Dale (1997) Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Natural Language Engineering, 3(1):57–87.
  • See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
  • Su et al. (2021) Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier. 2021. Plan-then-generate: Controlled data-to-text generation via planning. arXiv preprint arXiv:2108.13740.
  • Tan et al. (2020) Bowen Tan, Zichao Yang, Maruan AI-Shedivat, Eric P Xing, and Zhiting Hu. 2020. Progressive generation of long text with pretrained language models. arXiv preprint arXiv:2006.15720.
  • Wang et al. (2021) Peng Wang, Junyang Lin, An Yang, Chang Zhou, Yichang Zhang, Jingren Zhou, and Hongxia Yang. 2021. Sketch and refine: Towards faithful and informative table-to-text generation. arXiv preprint arXiv:2105.14778.

Appendix A Examples of generated results

Input Table
Page_Title[Hudson Line (Metro-North)] Section_Title[Stations] Zone[Harlem–125th Street
Handicapped/disabled access] Zone[Harlem / New Haven Lines diverge] Station Miles (km) from GCT
Date opened Date closed Manhattan / Bronx border Zone[Yankees–East 153rd Street Handicapped/disabled access]
Target Sentence
Once past 125th Street and the Harlem, the Hudson Line departs from the Harlem and New Haven Lines, passing first
Yankees–East 153rd Street.
No-Repeated-Target
The Harlem / New Haven Lines diverge at 125th Street and Yankees–East
153rd Street.
Asymmetric Generation-First
The Harlem–125th Street and Yankees–East 153rd Street stations are
in the Hudson Line (Metro-North)
ProEdit-1-First
The Harlem–125th Street and the New Haven Lines diverge from the Harlem
/ New Haven Lines to the Yankees–East 153rd Street Handicapped/disabled
access stations of the Hudson Line (Metro-North) line.
Input Table
Page_Title[Sunda Kingdom] Section_Title[List of monarchs] Period[723 – 732]
King’s name[Sanjaya/Harisdarma/ Rakeyan Jamri] Ruler[Sunda, Galuh, and Mataram]
Target Sentence
In 723, Jamri was the King of Sunda.
No-Repeated-Target Sanjaya (r. 723–732) was the ruler of the Sunda Kingdom.
Asymmetric Generation-First In 723, Sanjaya, Harisdarma and Rakeyan Jamri were the rulers of the Sunda Kingdom.
ProEdit-1-First
Sanjaya/Harisdarma/Rakeyan Jamri was the king of the Sunda Kingdom from 723 to 732,
ruling from the Sunda, Galuh, and Mataram dynasty.
Input Table
Page_Title[Herculaneum, Missouri] Section_Title[Demographics] Historical population[2010]
Census Historical population[3,468] Pop
Target Sentence
As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri.
No-Repeated-Target The population of Herculaneum was 3,468 at the 2010 census.
Asymmetric Generation-First As of the census of 2010, there were 3,468 people residing in the Herculaneum.
ProEdit-1-First As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri.
Table 5: Examples of generated result on the ToTTo dataset

Appendix B Rules for post-processing

When lengthened targets were given, but divided by a <SEP> token, the output of the model sometimes did not produce a <SEP> token or generated several. We divided the first part and the second part into the following rules.

\circ If <SEP> does not occur:

generated sentence

the first part = generated sentence

the second part = generated sentence

\circ If [SEP] occurs once:

generated sentence_1 [SEP]
generated sentence_2

the first part = generated sentence_1

the second part = generated sentence_2

\circ If [SEP] occurs serveral:

generated sentence_1 [SEP]
generated sentence_2 [SEP]
generated sentence_3 [SEP]

the first part = generated sentence_1

the second part = generated sentence_1