High Recall Data-to-text Generation with Progressive Edit
Abstract
Data-to-text (D2T) generation is the task of generating texts from structured inputs. We observed that when the same target sentence was repeated twice, Transformer (T5) based model generates an output made up of asymmetric sentences from structured inputs. In other words, these sentences were different in length and quality. We call this phenomenon "Asymmetric Generation" and we exploit this in D2T generation. Once asymmetric sentences are generated, we add the first part of the output with a no-repeated-target. As this goes through progressive edit (ProEdit), the recall increases. Hence, this method better covers structured inputs than before editing. ProEdit is a simple but effective way to improve performance in D2T generation and it achieves the new state-of-the-art result on the ToTTo dataset.
1 Introduction
Data-to-text (D2T) generation is the task of generating texts from structured inputs (Reiter and Dale, 1997). Previous attempts to solve this task can be classified according to whether separate stages are adopted or not. For example, one method is to generate structured inputs in correct order first, then realize the whole sentence (Puduppully et al., 2019; Wang et al., 2021; Su et al., 2021). Another is to generate the whole sentence in an end-to-end (E2E) manner using Copy mechanism (Gu et al., 2016; See et al., 2017) or just pre-trained models (Kale and Rastogi, 2020).
The two methods each have their advantages and disadvantages. Methods that have separate stages generate more confident texts than E2E models since separate stages first produce entities from structured inputs. However, the output sentence can be awkward or the overall performance may be worse than that of E2E models. This occurs because the generated output of the first stage differs from the gold label that the second stage expects. If the first stage produces a slightly incorrect result, the second stage takes over and increases the error of subsequent sentences. This is often referred to as error propagation. E2E models are free from this vulnerability but it could omit the important entities from structured inputs. To include these important entities, a separate module may be added to E2E models (e.g. Copy mechanism module), but this could generate awkward sentences and degrade system integrity. For this reason, we consider the usage of a pre-trained model without additional modules. A pre-trained E2E Transformer model (e.g. T5 (Raffel et al., 2019)) shows a competitive performance for D2T tasks (Kale and Rastogi, 2020).

Omitting important entities from structured inputs is related to recall values. In D2T, the recall value is a metric that considers not only the target sentence, but also the structured inputs. A high recall indicates that more structured inputs are included. This metric is described in detail in Section 2.

When the same target sentence is repeated, but divided by a special token (i.e. "target_1 <SEP> target_2"), we were able to make two observations. First, we find that Transformer (T5) based model generates asymmetric sentences (Figure 1); i.e., the first part of the output, which is related to target_1, is longer than the second par generated from target_2. Second, the first part of the output covers structured inputs better than the second part. We call this phenomenon Asymmetric Generation.
Asymmetric Generation can be exploited to improve the recall mentioned earlier. Based on our experiments on ToTTo corpus (Parikh et al., 2020) and WIKITABLET (Chen et al., 2020), the first part of asymmetric output shows a higher recall than the second part. It is even higher than the output of the model trained by no-repeated-targets.
We concatenate the first part with a no-repeated-target ("the first part <SEP> no-repeated-target"), then train a new model with lengthened targets (Figure 2). This process can be conducted repeatedly. We call this process Progressive Edit (ProEdit)111The name of our model comes from ProGen (Tan et al., 2020) because it progressively edits the initial output. Our experimental results on ToTTo corpus demonstrate the benefit of ProEdit in achieving the new state-of-the-art on PARENT (Dhingra et al., 2019) metric.
ToTTo (input=418.43) | WIKITABLET (input=412.34) | |||||||||
T5-large | BLEU | P | R | F1222we used the official scripts https://github.com/google-research-datasets/ToTTo | Length (ref=86.4) | BLEU | P | R | F1333we used the official scripts https://github.com/mingdachen/WikiTableT | Length (ref=627) |
No-Repeated-Target | 49.3 | 80.21 | 50.80 | 58.53 | 80.39 | 20.05 | 56.44 | 23.78 | 32.61 | 391.30 |
Asymmetric Generation-First | 44.2 | 77.91 | 52.56 | 59.23 | 94.18 | 23.7 | 52.22 | 25.21 | 33.05 | 532.33 |
Asymmetric Generation-Second | 29.2 | 66.62 | 34.88 | 41.72 | 92.11 | 14.47 | 45.39 | 17.78 | 24.50 | 457.72 |
ProEdit-1-First | 43.6 | 78.13 | 54.00 | 60.43 | 97.62 | 23.47 | 49.82 | 25.77 | 32.93 | 617.76 |
ProEdit-1-Second | 35.6 | 74.96 | 38.34 | 46.33 | 76.19 | 14.47 | 45.39 | 17.78 | 24.50 | 457.72 |
ProEdit-2-First | 42.1 | 77.61 | 54.24 | 60.39 | 101.174 | - | - | - | - | - |
ProEdit-2-Second | 29.2 | 66.62 | 34.88 | 41.72 | 92.11 | - | - | - | - | - |
2 Related Work
PARENT Metric. The PARENT metric is introduced to evaluate generated texts from structured inputs automatically.
Its precision for n-gram, denoted by , is given by:
(1) |
where denote the collection of n-grams of order n in and , which are i-th targets and generated texts, respectively. is given by , where is the count of n-gram in and denotes the minimum of its counts in and . Entailment probability, denoted as , is the most important and further introduced to check whether the presence of an n-gram in a generated text is "correct" given structured inputs. Two models have been introduced to calculate of : the Word overlap model, and the Co-occurrence model. In the most cases, the Word overlap model is used, so we also used it too.
The recall of PARENT is computed against both the target and the table . These are combined for geometric average:
(2) |
compute the recall of generated sentences against target sentences. is computed as follows:
(3) |
K denotes the number of records in table and is the number of token in the value string of a record. denotes the length of the longest common subsequence between x and y. The hyperparameter can be obtained heuristically using Eq.3 . The key idea is that if the recall of the target against the table is high, it already covers most of the information of structured inputs, so we can assign it a big weight (. We use this method.
3 Progressive Edit of Text
We add the first part of asymmetric generation to an existing target and trained a new model by using this dataset. This process is reiterated so that after the output of the previous model is added to an existing target, a new model is trained by this dataset. As a result the output sentence goes through progressive editing. (Figure 2). ProEdit was repeated until the PARENT metric increased.
4 Experiment
4.1 Dataset and Implementation details
We used a pre-trained T5-large model with 737.66M parameters. To test Asymmetric Generation, we used two datasets in D2T: the entire set of ToTTo and the part of WIKITABLET. The ToTTo Dataset is composed of Wikipedia tables paired with human-written descriptions. WIKITABLET, which combines Wikipedia table data with its corresponding Wikipedia sections, is similar to ToTTO, but has a long target text. These datasets were selected since their lengthened targets do not exceed the maximum length of T5 decoder.
On ToTTo, the training set was made up of 120k examples, and the validation set had 7.7k examples. In the case of WIKITABLET, we used 100k and 2.7k samples for training and validation, respectively. As the length of the encoder input was 512. structured inputs exceeding the encoder input were cut to 512. We set the batch size to 2 and learning rate of 5e-5. In the generation phase, we used beam search of size 5, and early stopping with no repeat ngram size 7. Experiments were conducted with two A100 GPUs.
4.2 Results
When there were two repeated target sentences (that were divided by a <SEP> token), the generated output was not the same (Table 1); this is the Asymmetric Generation phenomenon. Although target sentences are simply repeated, the resulting sentences differed in length and performances on metrics. On both datasets, the first generated part was longer than the second and performances on metrics were higher for the former than for the after. In addition, the results of the first part on the PARENT metric were even better than when using no-repeated-targets on both datasets.
We put the first part of the output with repeated-targets for the first place of the iteration; this is ProEdit. ProEdit is repeated until the overall F1 score goes up. The result is organized in Table 1. The overall F1 score reached the highest with ProEdit-1-First for ToTTo. For WIKITABLET, ProEdit-1-First F1 score was higher than the no-repeated-target, but lower than the Asymmetric Generation-First. In both datasets, the recall values of the first part increased steadily as ProEdit was repeated.
Our method was evaluated on the test set of ToTTo (Table 2). Using ProEdit-1-First, we achieved the state-of-the-art on PARENT score for the test set.
Model | BLEU | PARENT | |
---|---|---|---|
Pointer-Generator (See et al., 2017) | 41.6 | 51.6 | |
T5-based (Kale and Rastogi, 2020) | 49.5 | 58.4 | |
PlanGen (Su et al., 2021) | 49.2 | 58.7 | |
Ours (ProEdit-1-First) | 48.6 | 59.18 |
5 Analysis
Recall and Length. In general, a longer sentence increases the recall value. The first part of Asymmetric Generation and ProEdit have longer sentences than the reference. For fair comparison, similar output lengths are generated using beam search and minimum length settings (Table 3). For beam search, the longest sentence is selected. However, selecting the longest sentence in the beam search rather reduced the recall value. In minimum length settings, the probability of the EOS token is zero until the output reaches a certain length. Setting the minimum length improves both length and recall values, but dramatically reduces precision. which led to a steady decrease in the overall F1 score. Our proposed method can improve the overall F1 score since the precision decreases slightly.
Asymmetric Generation. We conducted repeated target sentences experiment using ToTTo on another model: GPT-2(Radford et al., 2019). Asymmetric Generation also occurs (Table 4), so it is not a phenomenon that occurs only in the T5 model.
T5-large | BLEU | P | R | F1 |
|
||
No-Repeated- | |||||||
Target | 49.3 | 80.21 | 50.80 | 58.53 | 80.39 | ||
Beam size 5 | 45.8 | 77.78 | 49.57 | 56.86 | 88.24 | ||
Beam size 11 | 44.3 | 77.02 | 49.24 | 56.48 | 90.65 | ||
min length 20 | 45.9 | 77.78 | 51.32 | 58.17 | 89.27 | ||
min length 25 | 40 | 74.48 | 52.13 | 57.61 | 101.98 | ||
min length 30 | 34.6 | 71.22 | 52.68 | 56.74 | 116.79 |
GPT2 | BLEU | P | R | F1 |
|
||
No-Repeated- | |||||||
Target | 42.8 | 79.13 | 43.49 | 51.74 | 76.26 | ||
Asymmetric | |||||||
Generation-First | 33.2 | 74.97 | 47.57 | 53.89 | 110.43 | ||
\hdashlineAsymmetric | |||||||
Generation-Second | 23.4 | 64.45 | 30.82 | 37.16 | 103.29 |
6 Conclusion
In this paper, we proposed Progressive Edit (ProEdit) process for D2T generation. It utilizes Asymmetric Generation to improve recall. We obtained the new state-of-the-art result on ToTTo.
References
- Chen et al. (2020) Mingda Chen, Sam Wiseman, and Kevin Gimpel. 2020. Wikitablet: A large-scale data-to-text dataset for generating wikipedia article sections. arXiv preprint arXiv:2012.14919.
- Dhingra et al. (2019) Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William W Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. arXiv preprint arXiv:1906.01081.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393.
- Kale and Rastogi (2020) Mihir Kale and Abhinav Rastogi. 2020. Text-to-text pre-training for data-to-text tasks. arXiv preprint arXiv:2005.10433.
- Parikh et al. (2020) Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373.
- Puduppully et al. (2019) Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Data-to-text generation with content selection and planning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6908–6915.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Reiter and Dale (1997) Ehud Reiter and Robert Dale. 1997. Building applied natural language generation systems. Natural Language Engineering, 3(1):57–87.
- See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
- Su et al. (2021) Yixuan Su, David Vandyke, Sihui Wang, Yimai Fang, and Nigel Collier. 2021. Plan-then-generate: Controlled data-to-text generation via planning. arXiv preprint arXiv:2108.13740.
- Tan et al. (2020) Bowen Tan, Zichao Yang, Maruan AI-Shedivat, Eric P Xing, and Zhiting Hu. 2020. Progressive generation of long text with pretrained language models. arXiv preprint arXiv:2006.15720.
- Wang et al. (2021) Peng Wang, Junyang Lin, An Yang, Chang Zhou, Yichang Zhang, Jingren Zhou, and Hongxia Yang. 2021. Sketch and refine: Towards faithful and informative table-to-text generation. arXiv preprint arXiv:2105.14778.
Appendix A Examples of generated results
Input Table | ||||
---|---|---|---|---|
|
||||
Target Sentence | ||||
|
||||
No-Repeated-Target |
|
|||
Asymmetric Generation-First |
|
|||
ProEdit-1-First |
|
Input Table | |||
---|---|---|---|
|
|||
Target Sentence | |||
In 723, Jamri was the King of Sunda. | |||
No-Repeated-Target | Sanjaya (r. 723–732) was the ruler of the Sunda Kingdom. | ||
Asymmetric Generation-First | In 723, Sanjaya, Harisdarma and Rakeyan Jamri were the rulers of the Sunda Kingdom. | ||
ProEdit-1-First |
|
Input Table | |||
---|---|---|---|
|
|||
Target Sentence | |||
As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri. | |||
No-Repeated-Target | The population of Herculaneum was 3,468 at the 2010 census. | ||
Asymmetric Generation-First | As of the census of 2010, there were 3,468 people residing in the Herculaneum. | ||
ProEdit-1-First | As of the census of 2010, there were 3,468 people residing in Herculaneum, Missouri. |
Appendix B Rules for post-processing
When lengthened targets were given, but divided by a <SEP> token, the output of the model sometimes did not produce a <SEP> token or generated several. We divided the first part and the second part into the following rules.
If <SEP> does not occur:
generated sentence |
the first part = generated sentence
the second part = generated sentence
If [SEP] occurs once:
|
the first part = generated sentence_1
the second part = generated sentence_2
If [SEP] occurs serveral:
|
the first part = generated sentence_1
the second part = generated sentence_1