A \DIFdelbegin\DIFdelData Quality Enhancement \DIFdelend\DIFaddbegin\DIFaddData-centric \DIFaddendFramework for Improving \DIFdelbegin\DIFdelendDomain-specific \DIFaddbegin
\DIFaddendMachine Reading Comprehension Datasets

\DIFaddbegin\DIFaddIva Bojic¹ \DIFaddJosef Halim¹ \DIFaddVerena Suharman¹ \DIFaddSreeja Tar¹
\DIFaddQi Chwen Ong¹ and \DIFaddDuy Phung¹ and \DIFaddMathieu Ravaut¹ and
\DIFaddShafiq Joty^1,2 and \DIFaddJosip Car^1,3
\DIFadd¹\DIFaddNanyang Technological University, Singapore
\DIFadd²\DIFaddSalesforce Research, USA
\DIFadd³\DIFaddImperial College London, United Kingdom
\DIFaddend

Abstract

Low-quality data can cause downstream problems in high-stakes applications. The data-centric approach emphasizes improving the dataset quality to improve the model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models. Namely, domain-specific datasets are usually small in size as it is costly to engage a large number of experts needed for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets\DIFaddbegin¹¹1\DIFaddCode and dataset are available at
https://github.com/IvaBojic/framework\DIFaddend. We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.

1 Introduction

Data-centric approach emphasizes the collection of high-quality data as a centrally important step in the model development [jarrahi2022principles]. While model-centric approaches were more prominent in the past, recently data-centric approaches are also gaining importance [xu2021dataclue, liu2021data]. This trend was especially emphasized since 2021 when Andrew Ng launched his campaign for a more data-centric approach to AI by starting the data-centric competition²²2https://https-deeplearning-ai.github.io/data-centric-comp, which encouraged participants to increase accuracy by solely improving the datasets while keeping the model fixed.

Large Language Models (LLMs), such as Generative Pre-trained Transformer 3 (GPT-3) [floridi2020gpt], generate text that is grammatically correct, fluent, and informative. However, there is little to no control over \DIFdelbegin\DIFdelwhich data \DIFaddend\DIFaddbegin\DIFaddthe data which \DIFaddendwas used for their training. Consequently, LLMs are prone to hallucinating and providing untruthful outputs [evans2021truthful]. Ironically, this reflects LLMs’ ability to be better at learning the training distribution and consequently follow inverse scaling law [lin2021truthfulqa]. And while some of the recent research efforts are focused on providing explanations of where the LLM’s outputs came from [menick2022teaching], this research is in its infancy.

In this work, we focus on language models with \DIFaddbegin\DIFadda \DIFaddendTransformer encoder architecture \DIFdelbegin\DIFdel(e.g., \DIFaddend\DIFaddbegin\DIFaddsuch as \DIFaddendBERT [devlin2018bert]\DIFdelbegin\DIFdel) \DIFaddend\DIFaddbegin\DIFadd, \DIFaddendthat extract relevant outputs from a domain-specific evidence-based text corpus. Deep neural networks trained on domain-specific datasets, including those used in Natural Language Processing (NLP), are most heavily dependent on the quality of the training \DIFdelbegin\DIFdeldata\DIFaddend\DIFaddbegin\DIFadddataset\DIFaddend, which is usually small in size [zarcone2021small] as it is costly to engage a large number of experts needed for their creation. It is thus important to create high-quality training data for language models to perform better. In this paper, we propose a data-centric framework for Machine Reading Comprehension (MRC) datasets that increases the original dataset quality \DIFaddbegin\DIFaddby both (i) keeping the size of the original dataset fixed, and (ii) augmenting the original dataset by adding new training samples\DIFaddend.

MRC is a Natural Language Understanding (NLU) task. Its goal is to answer questions based on the information provided in a passage [zhang2020machine]. \DIFdelbegin\DIFdelMRC data points \DIFaddend\DIFaddbegin\DIFaddTraining datasets for MRC models \DIFaddendcome in the \DIFdelbegin\DIFdelshape \DIFaddend\DIFaddbegin\DIFaddform \DIFaddendof triplets: passage \DIFdelbegin\DIFdel, \DIFaddend\DIFaddbegin\DIFadd(i.e., positive context), \DIFaddendquestion, and answer. Typically, the MRC pipeline works in two phases, where a passage \DIFdelbegin\DIFdelreader\DIFaddend\DIFaddbegin\DIFaddretriever\DIFaddend \DIFdelbegin\DIFdelfollows a passage \DIFaddend\DIFaddbegin\DIFaddis followed by a passage \DIFaddend\DIFdelbegin\DIFdelretriever\DIFaddend\DIFaddbegin\DIFaddreader\DIFaddend \citepchen2017reading. For a given question, the retriever first extracts a set of relevant passages from a knowledge base (i.e., text corpus), and then the reader selects an answer (e.g., text span) from one of the retrieved passages \citepzhu2021retrieving.

2 Related Work

Data-centric approaches can be divided into (i) data quality enhancement methods \DIFaddbegin\DIFaddthat keep the original size of the dataset fixed \DIFaddend(e.g., data filtering or label consistency checking), and (ii) data augmentation methods \DIFdelbegin\DIFdel(e.g., creating more data for training \DIFaddend\DIFaddbegin\DIFaddthat increase the original dataset size (i.e., adding more training samples\DIFaddend). Results from the literature on using \DIFdelbegin\DIFdelvarious \DIFaddenddata-centric approaches to improve \DIFdelbegin\DIFdelthe model performance \DIFaddend\DIFaddbegin\DIFaddmodel performance in MRC \DIFaddendare inconclusive. \DIFaddbegin

\DIFaddend

Several studies have reported that data filtering can lead to significant model improvements [dou2020dynamic, sanyal2021fix, molla2022query]. However, this might not hold if data is filtered in a random way [victoria2021advantages]. Additionally, while increasing labelling consistency and excluding or cleaning noisy data points were shown to improve model performance on the BioASQ dataset [yoon2022data], shortening answers in AASDQA led to \DIFaddbegin\DIFadda \DIFaddenddecrease of F1-score by 4% [victoria2021advantages]. \DIFaddbegin

\DIFaddend

Adaptation of data augmentation \DIFdelbegin\DIFdelto the NLP field is still secondary and \DIFaddend\DIFaddbegin\DIFaddis still \DIFaddendcomparatively less explored \DIFaddbegin\DIFaddin NLP \DIFaddend[feng2021survey], with a body of work presenting positive results [kaushik2019learning, khashabi2020more, qin2020cosda, pappas2022data] as well as papers showing little or no improvements for the given task [huang2020counterfactually, chopard2021learning, okimura2022impact].

To the best of our knowledge, \DIFdelbegin\DIFdelthis \DIFaddend\DIFaddbegin\DIFaddour paper \DIFaddendis the first \DIFdelbegin\DIFdelresearch work that proposes a \DIFaddendframework for data quality enhancement for improving domain-specific MRC datasets \DIFdelbegin\DIFdel, while keeping the \DIFaddend\DIFaddbegin\DIFaddthat proposes both (i) keeping the original dataset \DIFaddendsize of data the same \DIFaddbegin\DIFaddand (ii) increasing the original dataset size using augmentation methods\DIFaddend. Our framework \DIFdelbegin\DIFdelfor data quality enhancement \DIFaddendincludes methods for (i) a better selection of negative passages for retriever training, and (ii) reformulating questions using paraphrasing, word substitution and back translation. \DIFaddbegin

\DIFaddend

Paraphrasing, word substitution and back translation \DIFdelbegin\DIFdelmethods \DIFaddendwere previously used \DIFdelbegin\DIFdelin data augmentation for \DIFaddend\DIFaddbegin\DIFaddas data augmentation methods in \DIFaddendvarious NLP tasks [liu2021backtranslation, pappas2022data, ishii2022can]. However, \DIFdelbegin\DIFdelthe authors of \DIFaddendthose papers did not look at how each of these methods enhanced the original dataset without increasing its size\DIFdelbegin\DIFdelby adding new labels. Namely, it is beneficial to keep \DIFaddend\DIFaddbegin\DIFadd. Keeping \DIFaddendthe size of the \DIFdelbegin\DIFdeltest set fixed \DIFaddend\DIFaddbegin\DIFadddataset fixed is necessary in resources-constrained scenarios\DIFaddend, as the resources (e.g., time) needed for fine-tuning are \DIFdelbegin\DIFdelproportional to \DIFaddend\DIFaddbegin\DIFaddlinear in \DIFaddendthe train set size. Moreover, previous studies did not present a cost-benefit analysis of the resources needed to generate extended \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets and fine-tuning processes in comparison with the performance increase.

3 A Data-centric Framework for MRC

In our framework, we first generate new training sets using four data quality enhancement methods. We then fine-tune retrieval and reader models on each new training set \DIFdelbegin\DIFdelseparately. Finally\DIFaddend\DIFaddbegin\DIFaddindividually. Secondly\DIFaddend, we fine-tune retrieval/reader models \DIFdelbegin\DIFdelcontinuously \DIFaddend\DIFaddbegin\DIFaddcontinually \DIFaddendstarting from the best individual checkpoint using enhanced training sets \DIFdelbegin\DIFdel, which showed the best improvements from the previous \DIFaddend\DIFaddbegin\DIFaddthat showed improvements in the first step. Finally, we create new augmented datasets by concatenating all training sets if they showed fine-tuning improvements in the first \DIFaddendstep.

Labels \DIFdelbegin\DIFdelfor \DIFaddend\DIFaddbegin\DIFaddin \DIFaddendMRC datasets are triplets which include a passage, a question, and an answer. In MRC datasets, an answer is part of a passage which is also called a positive context. To fine-tune a retrieval model as proposed in [karpukhin-etal-2020-dense], it is necessary to not only provide a positive context of passages that contain the answer to a given question but also negative contexts. Some previous work employed a method of randomly selecting negative contexts from a text corpus [bojic2022sleepqa]. Here we propose a method to improve \DIFaddbegin\DIFaddthe \DIFaddendrandom selection of negative contexts.

One of the problems with manually collecting labels for MRC datasets is that questions are too similar to their answers [rajpurkar2018know]. To solve this, we investigate the use of three different methods applied to the original set of questions: (i) paraphrasing - we use two different language models fine-tuned for paraphrasing; (ii) word substitution - we use two libraries: one to extract a keyword from a given question and another to obtain a list of synonyms for the chosen keyword; and (iii) back translation - we use 25 different machine translation language models to translate a source text into another language, and back into the original language.

3.1 Negative Contexts

To enhance the quality of the negative contexts \DIFdelbegin\DIFdelselected \DIFaddendfor each passage, we implemented the following procedure. \DIFdelbegin\DIFdelBertScore \DIFdel[zhang2019bertscore], a similarity score between passages , was calculated and the passages \DIFaddend\DIFaddbegin\DIFaddFor each positive context, passages \DIFaddendwere sorted in ascending order of \DIFdelbegin\DIFdelthis score \DIFaddend\DIFaddbegin\DIFaddBERTScore [zhang2019bertscore] similarity with the positive context, and the ones with the lowest score were kept to form negative contexts\DIFaddend. A global counting dictionary was maintained to prevent the replication of \DIFdelbegin\DIFdelhard negative samples \DIFaddend\DIFaddbegin\DIFaddnegative contexts \DIFaddendacross different training examples. This ensured that each negative \DIFdelbegin\DIFdelpassage \DIFaddend\DIFaddbegin\DIFaddcontext \DIFaddenddid not exceed a \DIFdelbegin\DIFdelmaximum of ten \DIFaddend\DIFaddbegin\DIFaddthreshold \DIFaddnumber of \DIFaddendoccurrences in total in the whole dataset.

3.2 Questions

\DIFaddbegin\DIFadd

In this section, we describe the various techniques used to augment the questions from MRC datasets.

\DIFaddend

For question paraphrasing, we used two models: T5³³3https://huggingface.co/Vamsi/T5_Paraphrase_Paws and Pegasus⁴⁴4https://huggingface.co/tuner007/pegasus_paraphrase. To enhance the data quality of an original dataset, for each original question, we used the two aforementioned methods to generate up to five paraphrased questions. Finally, we created five different training sets in which we grouped the most, second most, up to the least similar paraphrases for each original question together. The word similarity was calculated using a word vector model from spaCy⁵⁵5https://spacy.io/models/en##en_core_web_lg. We also generated a sixth set comprising a randomly-selected question from the list of five unique paraphrases generated.

In word substitution process, we extracted a keyword from each question with the help of the spaCy library and then \DIFdelbegin\DIFdelobtain \DIFaddend\DIFaddbegin\DIFaddobtained \DIFaddenda list of synonyms for each keyword using Natural Language Tool Kit (NLTK)’s English dictionary, WordNet⁶⁶6https://www.nltk.org/howto/wordnet.html. The top five synonyms were extracted from this list in decreasing order of word similarity calculated using the aforementioned word vector model from spaCy. We then generated five versions of the training data for each dataset such that in set 1, the keyword for each question was replaced by its most similar synonym; in set 2, the keyword for each question was replaced by its second most similar synonym and so forth, with set 5 containing the questions with the least similar synonyms as substitutes. For keywords with n < 5 synonyms, we kept the question unchanged in the first (5 - n) versions and used the synonyms as substitutes in the remaining n versions. We also created a sixth set in which we randomly selected one of the top five (or n) synonyms to substitute the keyword for each question.

We used Hugging Face Helsinki model⁷⁷7https://huggingface.co/Helsinki-NLP for back translation. In total, we generated 25 different training sets based on the number of downloads for translation from English to the respective languages, followed by the availability of translation models from the respective languages to English. We selected checkpoints based on the number of downloads, taking the top 25 most downloaded.

\DIFaddbegin\DIFadd

To understand how different the resulting questions obtained from each of the enhancement methods are, we ran pairwise comparisons between questions from each method using ROUGE-1. Results are shown in Section 8.6. \DIFaddBack-translation \DIFaddoverall yields the questions most different to the baseline and the other enhancement methods.

3.3 \DIFaddAnswers

\DIFadd

Since MRC relies on extracting the exact answer (i.e., text span) from a passage, we could not apply any of the automatic data quality enhancement methods that we applied to questions (as explained in the previous section). However, we created new training datasets in which we manually shorten the original answers where we found this appropriate. We refer to Section 7.3 for more details.

\DIFaddend

4 Datasets

To test our framework, we made adjustments (see Appendix 7) to four biomedical datasets: BioASQ [lamurias2020generating], COVID-QA [moller2020covid], cpgQA [mahbub2023cpgqa] and SleepQA [bojic2022sleepqa]. \DIFaddbegin\DIFaddWe refer the reader to Table 1 for statistics on the final version of datasets that we used in all experiments: original/final size of text corpus, original/final number of labels and finally, train/dev/test split.

\DIFadd

Original \DIFaddendBioASQ dataset \DIFdelbegin\DIFdelcontains \DIFaddend\DIFaddbegin\DIFaddcontained \DIFaddendover 3k manually-annotated biomedical labels. Questions in these datasets \DIFdelbegin\DIFdelare of four types\DIFaddend\DIFaddbegin\DIFaddcame in four different flavours\DIFaddend: factoid, list, yes/no, and summary. We extracted only factoid questions for which the exact answer can be found in the positive context. \DIFaddbegin\DIFaddOriginal \DIFaddendCOVID-QA dataset was annotated by biomedical experts and \DIFdelbegin\DIFdelcontains \DIFaddend\DIFaddbegin\DIFaddcontained \DIFaddend2k labels on COVID-19 pandemic-related topics. \DIFaddbegin\DIFaddOriginal \DIFaddendcpgQA dataset \DIFdelbegin\DIFdelcontains \DIFaddend\DIFaddbegin\DIFaddcontained \DIFaddend1k manually annotated labels in the domain of clinical practice guidelines. \DIFaddbegin\DIFaddOriginal \DIFaddendSleepQA \DIFdelbegin\DIFdelis \DIFaddend\DIFaddbegin\DIFaddwas \DIFaddenda manually annotated dataset in the sleep domain with 5k labels.

\DIFaddbegin

Table 1: \DIFaddDataset statistics, for original and final versions.

Dataset

Original

corpus

Final

corpus

Original

labels

Final

labels

Final

train/dev/test

BioASQ

4265

957

5821

957

765/96/96

COVID-QA

2079

1121

1327

1102

896/112/113

cpgQA

190

235

1097

877/110/110

SleepQA

7000

5000

4000/500/500

\DIFaddend

5 Evaluation

We evaluated our framework by performing fine-tuning of retrieval and reader models using BioLinkBERT [yasunaga2022linkbert] and BioBERT BioASQ⁸⁸8https://huggingface.co/gdario/biobert_bioasq respectively. We used BioLinkBERT for retrieval model fine-tuning as it was recently shown that it achieves state-of-the-art performance on low-resource bio-MRC tasks [mahbub2023cpgqa]. BioBERT BioASQ was used for fine-tuning of reader model as proposed in [bojic2022sleepqa]. Intrinsic evaluation of fine-tuned models was done using automatic metrics on test sets: recall@1 for retrieval and Exact Match (EM) for reader models.

5.1 Fine-tuning on Enhanced Training Sets

\DIFdelbegin\DIFdel

Tables 1 and 2 \DIFaddend\DIFaddbegin\DIFaddTable 2 and Table 3 \DIFaddendshow recall@1/EM scores \DIFaddbegin\DIFaddrespectively \DIFaddendfor fine-tuned retrieval/reader models after enhancing the method of selecting negative contexts (i.e., using BertScore) \DIFaddbegin\DIFaddfor the retrieval training datasets\DIFaddend, as well as reformulation of questions using paraphrasing, word substitution\DIFdelbegin\DIFdeland back translation . \DIFaddend\DIFaddbegin\DIFadd, back translation and answer shortening for the training datasets of \DIFaddboth \DIFaddmodels. More specifically:

•
\DIFaddend
The first row \DIFaddbegin\DIFadd(\DIFaddbaseline\DIFadd) in each table \DIFaddendshows the results of BioLinkBERT/BioBERT BioASQ models fine-tuned on \DIFaddbegin\DIFaddthe \DIFaddendoriginal datasets (i.e., baselines). \DIFaddbegin
•
\DIFaddend
Each subsequential row shows the best results for each dataset using the four aforementioned methods \DIFdelbegin\DIFdel. The last row shows the \DIFaddend\DIFaddbegin\DIFaddfor negative contexts (only for the retrieval models) and questions (for both models) enhancement.
•
\DIFadd
The following row (\DIFaddanswer shortening\DIFadd) shows recall@1/EM scores for fine-tuning of models on the training datasets in which the original answers were manually shortened if needed.
•
\DIFadd
The following row (\DIFaddcontinual\DIFadd) shows the \DIFaddendresults of continual fine-tuning: starting from the best individual checkpoint, we fine-tune on the second-best training set, and so on. For example, for reader fine-tuning on the BioASQ dataset, we first took the checkpoint of fine-tuning on the \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendset created using paraphrasing and then continued fine-tuning on \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets created using back translation. Finally, we took the newest checkpoint and continued fine-tuning on the \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendset created using word substitution.

\DIFaddbegin
•
\DIFadd
The last row (\DIFaddaugmentation\DIFadd) shows recall@1/EM scores for fine-tuning of models on the training datasets created by concatenating all data enhanced training sets if they showed fine-tuning improvements when using individually (i.e., rows 2-6 for retrieval models and rows 2-5 for reader models).

\DIFaddend

For retrieval fine-tuning (\DIFdelbegin\DIFdelsee Table 1\DIFaddend\DIFaddbegin\DIFaddTable 2\DIFaddend), the most significant improvement of \DIFaddbegin\DIFadd+\DIFaddend8.3 (+33%) from baseline was achieved for BioASQ dataset when using back translation on the Catalan language. \DIFdelbegin\DIFdelFor the other three datasets, the most significant improvement was achieved when using word substitution methods. The enhanced method \DIFaddend\DIFaddbegin\DIFaddThe enhanced methods \DIFaddendof selecting negative contexts and word substitution improved all four datasets, while paraphrasing and back translation caused a decrease in recall@1 scores for SleepQA dataset. Continual retrieval fine-tuning showed improvements when compared to baselines for all datasets, however, only for the \DIFaddbegin\DIFaddCOVID-QA \DIFaddand \DIFaddendcpgQA \DIFdelbegin\DIFdeldataset continual fine-tuning \DIFaddend\DIFaddbegin\DIFadddatasets it \DIFaddendwas better than the best results of individual fine-tuning\DIFdelbegin\DIFdelon test set enhanced with word substitution method. \DIFaddend\DIFaddbegin\DIFadd.

Table 2: \DIFaddEvaluation of fine-tuned \DIFaddretrieval \DIFaddmodels.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
baseline	25.0	42.5	66.4	46.8
negatives	32.3	48.7	67.3	48.4
paraphrasing	31.2	54.0	66.4	46.6
word substitution	30.2	50.4	69.1	48.4
back translation	33.3	49.6	66.4	45.8
answer shortening	29.2	45.1	66.4	44.8
continual	29.2	62.8	70.9	47.2
augmentation	31.2	60.2	65.5	45.0

\DIFaddend

For fine-tuned reader models (\DIFdelbegin\DIFdelsee Table 1\DIFaddend\DIFaddbegin\DIFaddTable 3\DIFaddend), the most significant improvement of 2.1 (+40%) from baseline was achieved for BioASQ dataset when using back translation on the Dutch language, as well as paraphrasing. Continual reader fine-tuning \DIFdelbegin\DIFdeldid not increase the EM scores of any \DIFaddend\DIFaddbegin\DIFaddincreased the EM score only for \DIFaddcpgQA \DIFaddenddataset compared with the corresponding baselines. \DIFaddbegin\DIFaddLastly, augmentation was better than the best results of individual fine-tuning only for the \DIFaddSleepQA \DIFadddataset with the total increase of 2.6 (+4%). \DIFaddend

Table 3: Evaluation of fine-tuned \DIFdelretrieval \DIFaddreader models.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
baseline	5.2	22.1	50.9	58.6
paraphrasing	7.3	23.9	50.9	59.0
word substitution	6.3	22.1	50.9	59.4
back translation	7.3	23.0	46.4	59.4
answer shortening	5.2	23.0	49.1	60.8
continual	5.2	23.9	N/A	58.0
augmentation	5.2	23.9	N/A	61.2

\DIFdelbegin\DIFaddend

\DIFdelbegin\DIFdel

Evaluation of fine-tuned reader models. \DIFaddend\DIFaddbegin\DIFaddGreater relative improvements with \DIFaddback-translation \DIFaddcompared to other methods could be supported by this method creating more diverse questions (Section 8.6). However, \DIFaddback-translation \DIFaddgains are hardly inconsistent from a dataset to the other. Moreover, we noticed that translation and paraphrasing with PEGASUS gave questions noticeably more different than the other data enhancing techniques. \DIFaddend

\DIFdelbegin

\DIFaddend

5.2 Cost-benefit Analysis

\DIFdelbegin\DIFdel

We first generated 48 enhanced test \DIFaddend\DIFaddbegin\DIFaddIn total, the data-centric methods that we described in the previous section enabled us to generate 28 enhanced training \DIFaddendsets for retrieval fine-tuning and \DIFdelbegin\DIFdel43 \DIFaddend\DIFaddbegin\DIFadd24 \DIFaddendfor reader fine-tuning. Subsequently\DIFaddbegin\DIFadd, \DIFaddendwe fine-tuned \DIFaddbegin\DIFaddall \DIFaddendretrieval/reader models \DIFdelbegin\DIFdel. Time spent on fine-tuning depicted in Tables 3 and 4 is for one \DIFaddend\DIFaddbegin\DIFaddon a single \DIFaddendNVIDIA-A40 GPU \DIFaddbegin\DIFaddwith \DIFaddend46GB \DIFdelbegin\DIFdel. \DIFaddend\DIFaddbegin\DIFaddof GPU RAM.

\DIFadd

We show fine-tuning time in Table 4 and Table 5. \DIFaddendFor example, we used one GPU for five hours to fine-tune retriever model on BioASQ dataset to achieve a 33% improvement in recall@1 score. At the same time, we used one GPU for 22 hours to fine-tune retriever model on SleepQA dataset only to achieve a decrease in recall@1 score for 2%. \DIFdelbegin\DIFdelThe last row in Tables 3 and 4 shows \DIFaddend\DIFaddbegin

\DIFadd

The last two rows in tables show \DIFaddendthe time needed for continual\DIFaddbegin\DIFadd/augmentation \DIFaddendfine-tuning only. However, in order to determine the order in which to fine-tune for continual learning, \DIFaddbegin\DIFaddor which datasets to use for concatenation, \DIFaddendall individual checkpoints need to be created. Hence, to obtain the total time for continual learning\DIFaddbegin\DIFadd/augmentation\DIFaddend, one needs to add up times from all previous rows as well.

Table 4: Total time spent (in hours) vs. maximum improvements of retrieval fine-tuning.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
baseline	0.2	0.2	0.2	0.9
negatives	0.9 (29%)	1.1 (15%)	1.0 (1%)	9.9 (3%)
paraphrasing	4.3 (25%)	3.7 (27%)	3.6 (0%)	25.4 (1%)
substitution	2.5 (21%)	1.4 (19%)	1.8 (4%)	6.1 (3%)
translation	4.9 (33%)	6.3 (17%)	4.9 (0%)	22.0 (2%)
answer shortening	0.4 (17%)	0.4 (6%)	0.4 (0%)	1.6 (4%)
continual	1.6 (17%)	1.7 (48%)	0.7 (7%)	1.1 (1%)
augmentation	0.9 (25%)	1.0 (42%)	0.6 (1%)	2.6 (4%)

\DIFdelbegin

Table 5: Total time spent (in hours) vs. maximum improvements of reader fine-tuning.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
baseline	0.1	0.1	0.1	0.3
paraphrasing	1.0 (40%)	0.9 (8%)	0.9 (0%)	5.0 (1%)
substitution	0.3 (21%)	0.5 (0%)	0.3 (0%)	1.1 (1%)
translation	1.0 (40%)	1.2 (4%)	1.7 (9%)	4.0 (1%)
answer shortening	0.1 (0%)	0.1 (4%)	0.1 (4%)	0.2 (4%)
continual	0.2 (0%)	0.3 (8%)	N/A	1.4 (1%)
augmentation	0.1 (0%)	0.1 (8%)	N/A	0.5 (4%)

\DIFdelbegin\DIFaddend\DIFaddbegin

6 \DIFaddDiscussion and Conclusions

\DIFaddend

\DIFdelbegin

7 \DIFdelDiscussion and Conclusion

\DIFaddend

It is estimated that over 92% of data scientists who work in the \DIFdelbegin\DIFdelAI \DIFaddend\DIFaddbegin\DIFaddArtificial Intelligence \DIFaddendfield encountered the “data cascades” phenomenon that denotes downstream problems resulting from low-quality data [sambasivan2021everyone]. One way to improve the original dataset quality is to use data-centric approaches. In this paper, we showed that by enhancing the quality of original datasets, one can achieve model fine-tuning performance improvements for small datasets (e.g., biomedical datasets). However, the results suggest that the effects of data quality enhancement methods on performance improvements are small, and the performance of the test data deteriorates in many cases.

Despite the inconsistency of \DIFdelbegin\DIFdeldata quality enhancement \DIFaddend\DIFaddbegin\DIFadddata-centric \DIFaddendmethods used in this paper in yielding improvement, two positive conclusions can be drawn. First, when taking into consideration the time needed to create \DIFdelbegin\DIFdelenhanced test \DIFaddend\DIFaddbegin\DIFadddata enhanced training \DIFaddendsets as well as performance improvements \DIFaddbegin\DIFaddin fine-tuning\DIFaddend, the word substitution method is the best, supporting previous findings [feng2019keep, pappas2022data]. Unlike other methods, word substitution is not model-based and thus is run for a few minutes, rather than a few hours like back translation and paraphrasing. Second, the best relative improvements were achieved for the BioASQ dataset with the smallest number of labels, a similar finding presented in [okimura2022impact].

\DIFaddbegin\DIFadd

In addition to the data-centric methods discussed in this work, there are other techniques such as pseudo-labelling [abney2007semisupervised, ruder2018strong, cui2019self, zhu2022introduction], data selection [axelrod2011domain, plank2011effective, ruder2017learning], or pre-training methods [han2019unsupervised, guo2020multi]. In future work, we will investigate whether those techniques would produce more reliable and consistent results across different datasets. Moreover, we will also consider approaches that incorporate aspects of multiple techniques, resulting in hybrid data-centric techniques as proposed in [ramponi2020neural]. \DIFaddend

\DIFaddbegin

\DIFaddAcknowledgements

\DIFadd

The authors would like to acknowledge the funding support from Nanyang Technological University, Singapore. Josip Car’s post at Imperial College London is supported by the NIHR NW London Applied Research Collaboration.

\DIFaddend

\DIFaddbegin

\DIFaddend

\DIFdelbegin

\DIFaddend

7 Datasets

\DIFaddbegin

7.1 \DIFaddDataset Construction

\DIFadd

In this subsection, we describe how we built the final version of datasets from Table 1. \DIFaddendWhere necessary, we divided passages from the original text corpus into one or more parts, so their length was less than 300 words. This step was done so that all passages were of a similar length across different datasets and that the same model hyperparameters can be used for fine-tuning retrieval and reader models. Then we removed those labels for which the answer could not be found in the corresponding positive context. Finally, we divided each original dataset into three parts (in the ratio of 80:10:10) to create training, development, and test sets. \DIFdelbegin\DIFdelTable LABEL:tabA1 \DIFaddend\DIFaddbegin\DIFaddTable 1 \DIFaddendshows the original number of passages in each text corpora, the original number of labels, and the final numbers after the aforementioned adjustments were done.

\DIFdelbegin

7.2 \DIFaddData Cleaning

\DIFadd

BioASQ\DIFadd: The original dataset did not include positive passages, but instead contained links to the journal articles where the answers could be found. To obtain positive passages, we first retrieved them from the individual links provided in the dataset, and then divided them into passages no longer than 300 words. Only triples that contained the exact answers in the retrieved passages were included in the final dataset. We encountered a challenge in that, of the 5,821 triplets of the factoid type identified, only 16% had the exact answers that could have been found in the provided passages.

\DIFadd

COVID-QA\DIFadd: We first divided the original corpus into passages containing no more than 300 words. We also removed redundant keywords, such as ’introduction:’, ’introductions:’, ’objective:’, ’objectives:’, ’conclusion:’, ’conclusions:’, ’method:’, ’methods:’, ’background:’, ’backgrounds:’, ’result:’, ’results:’, ’result(s):’, and ’aim:’. Additionally, we eliminated leading and trailing spaces and changed all letters to lowercase. To ensure dataset accuracy, further manual cleaning was carried out. This included filling in incomplete words, removing medical abbreviations, and correcting missing brackets such as "()" and "[]\DIFadd".

\DIFadd

cpgQA\DIFadd: To prepare the text corpus, we partitioned passages into segments of no more than 300 words, resulting in a corpus of 235 passages. Unfortunately, this division caused some answers to be separated from their corresponding positive contexts due to issues such as inaccurate sentence tokenization and answer fragmentation between two adjacent passages. These discrepancies were addressed through manual intervention. It should be noted that no labels were excluded from the original dataset as a result of this cleaning procedure.

\DIFadd

SleepQA \DIFaddThe original dataset already contained passages shorter than 300 words, and all answers were found in their provided passages. We eliminated leading and trailing spaces and changed all letters to lowercase.

7.3 \DIFaddShortening Answers

\DIFadd

BioASQ\DIFadd: The original answers varied from two to more than 120 words in length. Our focus was on shortening the answers that were excessively long, and thus all questions longer than 30 words were manually reviewed. The primary adjustments made to the answers involved isolating the main response to the corresponding question, resulting in lengthy sentences being truncated into shorter phrases. This approach effectively reduced answer length for both the test and training sets by a significant degree. The mean answer length for the training set decreased from \DIFadd30.9 \DIFaddto \DIFadd17.6 \DIFaddwords (Figure 1), while the mean answer length for the test set decreased from \DIFadd26.1 \DIFaddto \DIFadd18.4 \DIFaddwords (Figure 2).

Refer to caption — Figure 1: \DIFdelOriginal and the final \DIFaddAnswer length (in number of \DIFdellabels in each of the four datasets\DIFaddwords) before and after shortening answers for BioASQ training set.

\DIFaddend

\DIFaddbegin

\DIFadd

COVID-QA\DIFadd: In the original dataset, the length of the answers was not more than 120 words. However, some answers contained incomplete words at the beginning and/or end of sentences. To improve the dataset’s accuracy, these words were either manually removed or completed. Moreover, scientific abbreviations were eliminated manually to improve the accuracy of exact matches. Unfortunately, this had no significant effect on the mean length of answers for both the training and test sets. This result can be attributed to the training set’s prevalence of sentences with only one or two abbreviations. In other cases, completing the incomplete words also had no effect on the mean word count of the sentences.

\DIFadd

cpgQA\DIFadd: In both the training and test sets, answers were shortened manually by removing extraneous phrases and articles (such as "a/an/the") from the beginning of the responses. Before shortening, the mean answer length in the training set was \DIFadd12.7 \DIFaddwords, which was reduced to \DIFadd12.4 \DIFaddwords. In the test set, the mean answer length before shortening was \DIFadd12.1 \DIFaddwords, which was shortened to \DIFadd11.6 \DIFaddwords. The minimal difference in the mean number of words is due to the fact that most answers in the original dataset were clear and concise.

\DIFadd

sleepQA\DIFadd: The initial average answer lengths for the sleepQA dataset are \DIFadd10.15 \DIFaddand \DIFadd9.13 \DIFaddfor the train and test set respectively, making it the dataset with the shortest average answer length among all datasets studied. We focused on cutting down answers more than 15 words long, which range up to 40 words long. The cutting of answer was done by extracting the main phrases of the answers which directly responds to the associated questions. The resulting cleaned answers are in the form of shorter, more concise phrases instead of wordy full sentences. The final average answer lengths after the cleaning process are \DIFadd9.11 \DIFaddand \DIFadd8.01 \DIFaddfor the train and test set respectively.

\DIFaddend

8 Evaluation

8.1 Model Hyperparameters

Hyperparameters of retrieval models fine-tuning are shown in \DIFdelbegin\DIFdelTable 6\DIFaddend\DIFaddbegin\DIFaddTable 6\DIFaddend, and of reader models in \DIFdelbegin\DIFdelTable 7\DIFaddend\DIFaddbegin\DIFaddTable 7\DIFaddend. When fine-tuning retrieval models on \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets in which method of selecting the negative contexts for each passage was enhanced, we changed other negatives hyperparameters to reflect the number of negative contexts in the corresponding \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendset (e.g., 1 to 5). Additionally, when fine-tuning reader models on different datasets, we set eval step to 50 for BioASQ, COVID-QA and cpgQA datasets and 500 for the SleepQA dataset. The reason behind this is because the SleepQA dataset has 4,000 labels in the train set, while the other datasets have less than 1,000 labels. For continual retrieval fine-tuning, we set the num train epochs to 60, and for reader to 30. Other parameters were left the same.

Table 6: Hyperparameters of separate retrieval model fine-tuning.

Hyperparameter	Value
batch size	32
dev batch size	32
adam eps	$1e-8$
adam betas	(0.9, 0.999)
max grad norm	1.0
log batch step	100
train rolling loss step	100
weight decay	0.0
learning rate	$1e-5$
warmup steps	100
gradient accumulation steps	1
num train epochs	30/60*
eval per epoch	1
hard negatives	0
other negatives	1(2,3,4,5)*
val av rank hard neg	0
val av rank other neg	10
val av rank bsz	128
val av rank max qs	10000

Table 7: Hyperparameters of separate reader model fine-tuning.

Hyperparameter	Value
eval step	50/500*
batch size	32
dev batch size	32
adam eps	$1e-8$
adam betas	(0.9, 0.999)
max grad norm	1.0
log batch step	100
train rolling loss step	100
weight decay	0.0
learning rate	$1e-5$
warmup steps	0
gradient accumulation steps	1
num train epochs	10/30*

8.2 \DIFdelbegin\DIFdelNegatives\DIFaddend\DIFaddbegin\DIFaddNegative Contexts\DIFaddend

Using the enhanced method of selecting negative contexts, we produced five different \DIFdelbegin\DIFdelenhanced test \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets for each dataset (see \DIFdelbegin\DIFdelTable 8\DIFaddend\DIFaddbegin\DIFaddTable 8\DIFaddend). Although generally, this method produced enhanced \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets for each dataset, it is not possible to conclude which number of negatives improved the fine-tuning process the best, as this is very much dataset-specific. The last row in \DIFdelbegin\DIFdelTable 8 \DIFaddend\DIFaddbegin\DIFaddTable 8 \DIFaddendshows the time (in hours) needed to generate all five \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets for each dataset using A100 GPU 40GB. While for most of the datasets, the generation process took around one hour, for SleepQA \DIFdelbegin\DIFdelone \DIFaddendit took more than one day!

\DIFdelbegin

Table 8: Automatic evaluation of fine-tuned retrieval models using recall@1 scores when using the enhanced method of selecting negative contexts.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
baseline	25.0	42.5	66.4	46.8
BertScore (1 neg)	31.2	41.6	66.4	47.2
BertScore (2 neg)	28.1	48.7	67.3	45.8
BertScore (3 neg)	32.3	45.1	67.3	47.4
BertScore (4 neg)	29.2	45.1	63.6	46.6
BertScore (5 neg)	30.2	48.7	61.8	48.4
generation time	1.3	1.3	0.7	28.3

8.3 Paraphrasing

For question paraphrasing, we used T5 and Pegasus as they are based on Transformer architecture and utilize transfer learning, in which resource-rich sources can be efficiently adapted for resource-poor target fields, such as the domain-specific datasets [yu2018modelling]. T5 was trained on the Google PAWS dataset. Previous research showed that the Pegasus method produces paraphrases that are semantically more different, while the T5 method is found to keep more of the original meaning [martin2023using]. We found that the Pegasus consistently produces the same set of paraphrased questions, regardless of the number generated. For T5, we generated paraphrased questions up to 50 times, after which we took the first five unique paraphrases. For several questions (between 3% for cpgQA dataset and 12% for COVID-QA dataset), T5 failed to produce the required number of unique paraphrases, for which cases we added the original question to the set of five paraphrased questions. Although we used two different libraries, question paraphrasing failed to enhance \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendset quality for cpgQA dataset altogether. Generating \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets took around 15 hours for SleepQA dataset and 3 hours for other datasets on one NVIDIA TESLA P100 GPU 16GB (Kaggle).

\DIFdelbegin

Table 9: Average similarity index of each training set for each dataset, calculated using a word vector model from spaCy for paraphrasing.

Methods	set 1	set 2	set 3	set 4	set 5	set 6
BioASQ (T5)	0.997	0.991	0.979	0.962	0.927	0.970
BioASQ (Pegasus)	0.953	0.932	0.917	0.886	0.846	0.903
COVID-QA (T5)	0.996	0.987	0.970	0.949	0.904	0.959
COVID-QA (Pegasus)	0.959	0.940	0.918	0.890	0.849	0.909
cpgQA (T5)	0.995	0.987	0.973	0.954	0.920	0.967
cpgQA (Pegasus)	0.960	0.946	0.930	0.910	0.883	0.925
SleepQA (T5)	0.996	0.985	0.969	0.947	0.906	0.960
SleepQA (Pegasus)	0.974	0.957	0.938	0.915	0.880	0.933

Table 10: Automatic evaluation of fine-tuned retrieval models using recall@1 scores for paraphrasing. Baseline recall@1 scores for BioASQ, COVID-QA, cpgQA and SleepQA datasets are: 25.0, 42.5, 66.4, and 46.8.

Methods	set 1	set 2	set 3	set 4	set 5	set 6
BioASQ (T5)	25.0	29.2	26.0	26.0	24.0	24.0
BioASQ (Pegasus)	28.1	31.2	31.2	29.2	31.2	30.2
COVID-QA (T5)	49.6	48.7	44.2	47.8	46.0	54.0
COVID-QA (Pegasus)	45.1	44.2	43.4	43.4	46.9	46.9
cpgQA (T5)	65.5	65.5	65.5	66.4	65.5	66.4
cpgQA (Pegasus)	63.6	62.7	60.0	62.7	65.5	69.0
SleepQA (T5)	43.6	46.6	42.4	46.4	44.2	43.6
SleepQA (Pegasus)	43.2	39.8	45.0	39.0	38.0	41.0

Table 11: Automatic evaluation of fine-tuned reader models using EM scores for paraphrasing. Baseline EM scores for BioASQ, COVID-QA, cpgQA and SleepQA datasets are: 5.2, 22.1, 50.9, and 58.6.

Methods	set 1	set 2	set 3	set 4	set 5	set 6
BioASQ (T5)	4.2	6.2	4.2	3.1	6.2	4.2
BioASQ (Pegasus)	6.2	7.3	7.3	6.2	6.2	6.2
COVID-QA (T5)	21.2	19.5	20.4	23.9	20.4	19.5
COVID-QA (Pegasus)	22.1	18.6	18.6	20.4	23.0	19.5
cpgQA (T5)	50.9	49.1	48.2	50.9	48.2	50.0
cpgQA (Pegasus)	46.4	46.4	47.3	44.5	46.4	49.1
SleepQA (T5)	57.4	57.6	58.2	58.4	58.8	58.2
SleepQA (Pegasus)	58.2	57.8	58.0	58.2	57.2	59.0

\DIFaddbegin

\DIFaddend

8.4 Word Substitution

Word substitution is the process of substituting similar words (such as synonyms or words with similar embeddings) \DIFdelbegin\DIFdelfor words in the original training instances \DIFaddend\DIFaddbegin\DIFaddfrom the original data \DIFaddend[pappas2022data]. This method for enhancing the original \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets increased almost all recall@1/EM scores for all datasets for both \DIFaddbegin\DIFaddretrieval/reader \DIFaddendfine-tuning\DIFdelbegin\DIFdelof retrievaland reader models\DIFaddend, except for the reader models for cpgQA and COVID-QA datasets. In cases when applying word substitution on the original dataset did not increase the EM scores for the reader fine-tuning, the scores stayed the same as for the corresponding baselines (i.e., this method did not worsen them). Moreover, the generation of \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets took only 11 minutes for SleepQA dataset and around two minutes for other datasets on one NVIDIA TESLA P100 GPU 16GB (Kaggle).

Table 12: Average similarity index of each training set for each dataset, calculated using a word vector model from spaCy for word substitution.

Datasets	set 1	set 2	set 3	set 4	set 5	set 6
BioASQ	0.999	0.998	0.997	0.996	0.994	0.997
COVID-QA	0.997	0.996	0.995	0.993	0.988	0.993
cpgQA	0.998	0.997	0.996	0.994	0.989	0.995
SleepQA	0.996	0.993	0.992	0.990	0.986	0.991

Table 13: Automatic evaluation of fine-tuned retrieval models using recall@1 for word substitution. Baseline recall@1 scores for BioASQ, COVID-QA, cpgQA and SleepQA datasets are: 25.0, 42.5, 66.4, and 46.8.

Datasets	set 1	set 2	set 3	set 4	set 5	set 6
BioASQ	28.1	24.0	28.1	27.1	30.2	21.9
COVID-QA	49.6	49.6	50.4	46.9	48.7	48.7
cpgQA	63.6	68.2	67.3	69.1	67.3	66.4
SleepQA	45.8	48.4	46.4	46.8	43.0	46.0

Table 14: Automatic evaluation of fine-tuned reader models using EM scores for word substitution. Baseline EM for BioASQ, COVID-QA, cpgQA and SleepQA datasets are: 5.2, 22.1, 50.9, and 58.6.

Datasets	set 1	set 2	set 3	set 4	set 5	set 6
BioASQ	5.2	6.2	5.2	5.2	6.2	6.2
COVID-QA	21.2	21.2	21.2	22.1	21.2	19.5
cpgQA	50.0	50.0	50.9	50.0	50.9	50.9
SleepQA	57.8	58.6	58.8	59.4	58.0	58.0

8.5 Back Translation

The main idea behind the back translation method is to use machine translation from a source to a pivot language and back, obtaining paraphrases. \DIFdelbegin

\DIFdel

Automatic evaluation of fine-tuned retrieval models using recall@1 for back translation. Baseline recall@1 scores for \DIFdelBioASQ\DIFdel, \DIFdelCOVID-QA\DIFdel, \DIFdelcpgQA \DIFdeland \DIFdelSleepQA \DIFdeldatasets are: \DIFdel25.0\DIFdel, \DIFdel42.5\DIFdel, \DIFdel66.4\DIFdel, and \DIFdel46.8\DIFdel.

\DIFaddend

In total, we generated 25 different training sets for Spanish (es), French (fr), German (de), Russian (ru), Chinese (zh), Arabic (ar), Dutch (nl), Finnish (fi), Hungarian (hu), Multiple Languages (mul), Ukrainian (uk), Hindi (hi), Danish (da), Czech (cs), Romance Languages (roa), Bulgarian (bg), Catalan (ca), Afrikaans (af), Estonian (et), Turkic Languages (trk), Slavik Languages (sla), Indonesian (id), Slovak (sk), Tagalog (tl), and Kinyarwanda (rw) pivot languages. Back translation has been used as a data augmentation method for several different NLP tasks [feng2021survey, shorten2021text]. \DIFdelbegin\DIFdelResults from Tables 15 and 16 showed that fine-tuning on test sets enhanced by back translation decreased performance for many pivot languages. \DIFaddendGenerally, it produced the best results for BioASQ dataset. The generation of \DIFdelbegin\DIFdeltest \DIFaddend\DIFaddbegin\DIFaddtraining \DIFaddendsets took 10 hours for SleepQA dataset and around two hours for other datasets on one NVIDIA TESLA P100 GPU 16GB (Kaggle).

\DIFdelbegin

Table 15: \DIFaddAutomatic evaluation of fine-tuned retrieval models using recall@1 for back translation. Baseline recall@1 scores for \DIFaddBioASQ\DIFadd, \DIFaddCOVID-QA\DIFadd, \DIFaddcpgQA \DIFaddand \DIFaddSleepQA \DIFadddatasets are: \DIFadd25.0\DIFadd, \DIFadd42.5\DIFadd, \DIFadd66.4\DIFadd, and \DIFadd46.8\DIFadd.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
en-es-en	31.2	48.7	62.7	45.4
en-fr-en	29.2	47.8	60.9	44.8
en-de-en	27.1	45.1	61.8	41.6
en-ru-en	31.2	40.7	54.5	39.8
en-zh-en	30.2	46.9	61.8	42.2
en-ar-en	30.2	49.6	56.4	41.2
en-nl-en	31.2	40.7	64.5	44.8
en-fi-en	27.1	48.7	61.8	40.6
en-hu-en	29.2	49.6	66.4	41.6
en-mul-en	25.0	43.4	57.3	39.4
en-uk-en	28.1	45.1	64.5	40.8
en-hi-en	27.1	44.2	59.1	38.4
en-da-en	29.2	44.2	60.0	43.8
en-cs-en	27.1	43.4	63.6	45.8
en-roa-en	29.2	47.8	60.9	42.0
en-bg-en	29.2	43.4	58.2	40.0
en-ca-en	33.3	41.6	60.0	41.2
en-af-en	30.2	46.9	61.8	37.2
en-et-en	29.2	46.0	58.2	40.2
en-trk-en	18.8	23.9	35.5	19.6
en-sla-en	25.0	45.1	63.6	43.6
en-id-en	30.2	47.8	63.6	40.4
en-sk-en	30.2	48.7	57.3	44.2
en-tl-en	30.2	41.6	64.5	40.8
en-rw-en	28.1	29.2	50.0	34.4

Table 16: Automatic evaluation of fine-tuned reader models using EM scores for back translation. Baseline EM scores for BioASQ, COVID-QA, cpgQA and SleepQA datasets are: 5.2, 22.1, 50.9, and 58.6.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
en-es-en	4.2	21.2	40.0	58.2
en-fr-en	6.2	20.4	45.5	58.4
en-de-en	7.3	21.2	46.4	57.4
en-ru-en	3.1	18.6	45.5	58.4
en-zh-en	6.2	21.2	43.6	58.8
en-ar-en	5.2	23.0	44.5	58.2
en-nl-en	7.3	21.2	45.5	57.6
en-fi-en	6.2	20.4	44.5	58.0
en-hu-en	6.2	19.5	43.6	58.2
en-mul-en	3.1	19.5	43.6	57.0
en-uk-en	6.2	18.6	40.9	59.4
en-hi-en	5.2	20.4	40.9	57.4
en-da-en	6.2	23.0	43.6	59.4
en-cs-en	4.2	19.5	43.6	58.0
en-roa-en	6.2	18.6	43.6	57.6
en-bg-en	6.2	21.2	43.6	59.2
en-ca-en	5.2	18.6	43.6	58.2
en-af-en	7.3	20.4	44.5	59.0
en-et-en	6.2	20.4	43.6	58.0
en-trk-en	4.2	15.9	39.1	56.4
en-sla-en	6.2	18.6	44.5	57.6
en-id-en	3.1	17.7	44.5	57.2
en-sk-en	5.2	21.2	44.5	58.6
en-tl-en	4.2	22.1	46.4	58.4
en-rw-en	5.2	17.7	40.0	56.2

8.6 Mean and \DIFdelbegin\DIFdelstandard deviation\DIFaddend\DIFaddbegin\DIFaddStandard Deviation\DIFaddend

\DIFaddbegin\DIFaddend

\DIFdelbegin\DIFdel

Table 17 shows \DIFaddend\DIFaddbegin\DIFaddTable 17 shows the \DIFaddendmean and standard deviation for different data quality enhancement methods for retrieval fine-tuning. \DIFdelbegin\DIFdelTable 18 shows \DIFaddend\DIFaddbegin\DIFaddTable 18 shows the \DIFaddendmean and standard deviation for different data quality enhancement methods for reader fine-tuning.

\DIFdelbegin

Table 17: Mean and standard deviation of different data quality enhancement methods for retrieval fine-tuning.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
negatives	$30.2\pm 1.7$	$45.8\pm 3.0$	$66.0\pm 2.4$	$47.1\pm 1.0$
paraphrasing (T5)	$25.7\pm 1.9$	$48.4\pm 3.4$	$65.8\pm 0.5$	$44.5\pm 1.7$
paraphrasing (Pegasus)	$30.2\pm 1.3$	$45.0\pm 1.6$	$64.0\pm 3.1$	$41.0\pm 2.7$
substitution	$26.6\pm 3.1$	$49.0\pm 1.2$	$67.0\pm 1.9$	$46.1\pm 1.8$
translation	$28.7\pm 2.8$	$44.0\pm 6.0$	$59.6\pm 6.2$	$40.6\pm 5.1$

\DIFdelbegin

Table 18: Mean and standard deviation of different data quality enhancement methods for reader fine-tuning.

Methods	BioASQ	COVID-QA	cpgQA	SleepQA
paraphrasing (T5)	$4.7\pm 1.3$	$20.8\pm 1.6$	$49.6\pm 1.2$	$58.1\pm 0.6$
paraphrasing (Pegasus)	$6.6\pm 0.5$	$20.4\pm 1.8$	$46.7\pm 1.5$	$58.1\pm 0.6$
substitution	$5.7\pm 0.5$	$21.1\pm 0.8$	$50.5\pm 0.5$	$58.4\pm 0.6$
translation	$5.4\pm 1.3$	$20.0\pm 1.7$	$43.6\pm 2.0$	$58.0\pm 0.8$

8.7 \DIFaddSimilarity Between Enhancement Methods

\DIFadd

In the following tables, we show the average similarity compute with ROUGE-1 metric between questions obtained with each of the enhancement techniques, over all four datasets {\DIFaddBioASQ\DIFadd,\DIFaddCovidQA\DIFadd,\DIFaddcpgQA\DIFadd,\DIFaddSleepQA\DIFadd}, with Retrieval (first four tables) then Reader (next four).

Table 19: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddBioASQ retrieval datasets\DIFadd. \DIFaddBase. \DIFaddstands for baseline, \DIFaddPara/PG \DIFaddfor paraphrasing with PEGASUS, \DIFaddPara/T5 \DIFaddfor paraphrasing with T5, \DIFaddSubst. \DIFaddfor substitution and \DIFaddTransl. \DIFaddfor translation.

	\collectcellBase.\endcollectcell	\collectcellPara/PG\endcollectcell	\collectcellPara/T5\endcollectcell	\collectcellSubst.\endcollectcell	\collectcellTransl.\endcollectcell
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 8 0.44\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 9 1.49\endcollectcell	\collectcell 7 6.73\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 5.11\endcollectcell	\collectcell 7 6.25\endcollectcell	\collectcell 8 6.98\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 5 7.68\endcollectcell	\collectcell 5 1.10\endcollectcell	\collectcell 5 6.98\endcollectcell	\collectcell 5 5.01\endcollectcell	\collectcell 1 00.0\endcollectcell

Table 20: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddCovidQA retrieval datasets\DIFadd.

	\collectcell Base.	\collectcell Para/PG	\collectcell Para/T5	\collectcell Subst.	\collectcell Transl.
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 7 4.63\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 8 3.33\endcollectcell	\collectcell 6 6.06\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 5.33\endcollectcell	\collectcell 7 0.89\endcollectcell	\collectcell 7 9.73\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 7 6.44\endcollectcell	\collectcell 6 2.60\endcollectcell	\collectcell 6 9.50\endcollectcell	\collectcell 7 2.97\endcollectcell	\collectcell 1 00.0\endcollectcell

Table 21: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddcpgQA retrieval datasets\DIFadd.

	\collectcell Base.	\collectcell Para/PG	\collectcell Para/T5	\collectcell Subst.	\collectcell Transl.
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 7 1.08\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 8 0.96\endcollectcell	\collectcell 6 2.79\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 4.62\endcollectcell	\collectcell 6 7.01\endcollectcell	\collectcell 7 6.93\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 7 1.06\endcollectcell	\collectcell 5 8.85\endcollectcell	\collectcell 6 4.82\endcollectcell	\collectcell 6 7.36\endcollectcell	\collectcell 1 00.0\endcollectcell

Table 22: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddSleepQA retrieval datasets\DIFadd.

	\collectcell Base.	\collectcell Para/PG	\collectcell Para/T5	\collectcell Subst.	\collectcell Transl.
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 7 7.43\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 8 6.30\endcollectcell	\collectcell 7 0.98\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 2.95\endcollectcell	\collectcell 7 1.09\endcollectcell	\collectcell 7 9.79\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 7 9.05\endcollectcell	\collectcell 6 5.98\endcollectcell	\collectcell 7 3.53\endcollectcell	\collectcell 7 3.20\endcollectcell	\collectcell 1 00.0\endcollectcell

Table 23: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddBioASQ reader datasets\DIFadd.

	\collectcell Base.	\collectcell Para/PG	\collectcell Para/T5	\collectcell Subst.	\collectcell Transl.
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 8 0.44\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 9 1.49\endcollectcell	\collectcell 7 6.73\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 7.71\endcollectcell	\collectcell 7 8.51\endcollectcell	\collectcell 8 9.46\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 8 6.72\endcollectcell	\collectcell 7 2.32\endcollectcell	\collectcell 8 2.66\endcollectcell	\collectcell 8 4.86\endcollectcell	\collectcell 1 00.0\endcollectcell

Table 24: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddCovidQA reader datasets\DIFadd.

	\collectcell Base.	\collectcell Para/PG	\collectcell Para/T5	\collectcell Subst.	\collectcell Transl.
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 7 2.11\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 8 3.33\endcollectcell	\collectcell 6 4.65\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 3.84\endcollectcell	\collectcell 6 7.50\endcollectcell	\collectcell 7 8.59\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 6 6.01\endcollectcell	\collectcell 5 4.99\endcollectcell	\collectcell 6 0.87\endcollectcell	\collectcell 6 1.55\endcollectcell	\collectcell 1 00.0\endcollectcell

Table 25: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddcpgQA reader datasets\DIFadd.

	\collectcell Base.	\collectcell Para/PG	\collectcell Para/T5	\collectcell Subst.	\collectcell Transl.
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 7 9.46\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 8 6.09\endcollectcell	\collectcell 7 2.62\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 5.58\endcollectcell	\collectcell 7 5.34\endcollectcell	\collectcell 8 2.15\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 8 0.67\endcollectcell	\collectcell 6 8.91\endcollectcell	\collectcell 7 5.40\endcollectcell	\collectcell 7 7.41\endcollectcell	\collectcell 1 00.0\endcollectcell

Table 26: \DIFaddAverage ROUGE-1 score \DIFaddbetween pairs of questions from different enhancement methods on \DIFaddSleepQA reader datasets\DIFadd.

	\collectcell Base.	\collectcell Para/PG	\collectcell Para/T5	\collectcell Subst.	\collectcell Transl.
Base.	\collectcell 1 00.0\endcollectcell
Para/PG	\collectcell 6 8.15\endcollectcell	\collectcell 1 00.0\endcollectcell
Para/T5	\collectcell 8 5.57\endcollectcell	\collectcell 6 2.76\endcollectcell	\collectcell 1 00.0\endcollectcell
Subst.	\collectcell 9 0.92\endcollectcell	\collectcell 6 1.59\endcollectcell	\collectcell 7 7.38\endcollectcell	\collectcell 1 00.0\endcollectcell
Transl.	\collectcell 6 3.06\endcollectcell	\collectcell 5 1.40\endcollectcell	\collectcell 5 9.00\endcollectcell	\collectcell 5 7.15\endcollectcell	\collectcell 1 00.0\endcollectcell

A \DIFdelbegin\DIFdelData Quality Enhancement \DIFdelend\DIFaddbegin\DIFaddData-centric \DIFaddendFramework for Improving \DIFdelbegin\DIFdelendDomain-specific \DIFaddbegin \DIFaddendMachine Reading Comprehension Datasets

Abstract

1 Introduction

2 Related Work

3 A Data-centric Framework for MRC

3.1 Negative Contexts

3.2 Questions

3.3 \DIFaddAnswers

4 Datasets

5 Evaluation

5.1 Fine-tuning on Enhanced Training Sets

5.2 Cost-benefit Analysis

6 \DIFaddDiscussion and Conclusions

7 \DIFdelDiscussion and Conclusion

\DIFaddAcknowledgements

7 Datasets

7.1 \DIFaddDataset Construction

7.2 \DIFaddData Cleaning

7.3 \DIFaddShortening Answers

8 Evaluation

8.1 Model Hyperparameters

8.2 \DIFdelbegin\DIFdelNegatives\DIFaddend\DIFaddbegin\DIFaddNegative Contexts\DIFaddend

8.3 Paraphrasing

8.4 Word Substitution

8.5 Back Translation

8.6 Mean and \DIFdelbegin\DIFdelstandard deviation\DIFaddend\DIFaddbegin\DIFaddStandard Deviation\DIFaddend

8.7 \DIFaddSimilarity Between Enhancement Methods

A \DIFdelbegin\DIFdelData Quality Enhancement \DIFdelend\DIFaddbegin\DIFaddData-centric \DIFaddendFramework for Improving \DIFdelbegin\DIFdelendDomain-specific \DIFaddbegin
\DIFaddendMachine Reading Comprehension Datasets