Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level

Zhaopeng Feng ¹ Ruizhe Chen ¹^∗ Yan Zhang² Zijie Meng¹ Zuozhu Liu ¹
¹ZJU-Angelalign R&D Center for Intelligence Healthcare, Zhejiang University
²National University of Singapore
{zhaopeng.23, ruizhec.21, zijie.22 ,zuozhuliu}@intl.zju.edu.cn
eleyanz@nus.edu.sg
Equally Contributed.Corresponding author.

Abstract

General-purpose Large Language Models (LLMs) like GPT-4 have achieved remarkable advancements in machine translation (MT) by leveraging extensive web content. On the other hand, translation-specific LLMs are built by pre-training on domain-specific monolingual corpora and fine-tuning with human-annotated translation data. Despite the superior performance, these methods either demand an unprecedented scale of computing and data or substantial human editing and annotation efforts. In this paper, we develop MT-Ladder, a novel model-agnostic and cost-effective tool to refine the performance of general LLMs for MT. MT-Ladder is trained on pseudo-refinement triplets which can be easily obtained from existing LLMs without additional human cost. During training, we propose a hierarchical fine-tuning strategy with an easy-to-hard schema, improving MT-Ladder’s refining performance progressively. The trained MT-Ladder can be seamlessly integrated with any general-purpose LLMs to boost their translation performance. By utilizing Gemma-2B/7B as the backbone, MT-Ladder-2B can elevate raw translations to the level of top-tier open-source models (e.g., refining BigTranslate-13B with +6.91 BLEU and +3.52 COMET for XX→En), and MT-Ladder-7B can further enhance model performance to be on par with the state-of-the-art GPT-4. Extensive ablation and analysis corroborate the effectiveness of MT-Ladder in diverse settings. Our code is available at https://github.com/fzp0424/MT-Ladder.

Zhaopeng Feng ¹^†^†thanks: Equally Contributed. Ruizhe Chen ¹^∗ Yan Zhang² Zijie Meng¹ Zuozhu Liu ¹^†^†thanks: Corresponding author. ¹ZJU-Angelalign R&D Center for Intelligence Healthcare, Zhejiang University ²National University of Singapore {zhaopeng.23, ruizhec.21, zijie.22 ,zuozhuliu}@intl.zju.edu.cn eleyanz@nus.edu.sg

Refer to caption — Figure 1: The average translation quality improvements across 8 translation directions on WMT22 test set (Zh $\leftrightarrow$ En, De $\leftrightarrow$ En, En $\leftrightarrow$ Ru, En $\leftrightarrow$ Cs) using MT-Ladder-2B or 7B. The metric scores are calculated by COMET-22 (wmt22-comet-da) (Rei et al., 2020).

1 Introduction

General-purpose Large Language Models (LLMs) like GPT-4 (Achiam et al., 2023) have exhibited strong translation abilities (Hendy et al., 2023; Zhu et al., 2023; Jiao et al., 2023b), but achieving this performance requires enormous model scale, infrastructure, and deployment costs. On the other hand, translation-specific LLMs like ALMA (Xu et al., 2023a) and Aya 23 (Aryabumi et al., 2024) have reached top-tier levels through continued pretraining on large monolingual corpora (e.g., 20B tokens from Common Crawl (Su’arez et al., 2019)) and fine-tuning on high-quality translation data (e.g., 10.5M translation examples from Aya Dataset (Singh et al., 2024)), which is also time-consuming and costly. These observations raise a question: can we enhance the MT performance of existing LLMs in a model-agnostic manner, achieving results comparable to translation-specific LLMs or even GPT-4, without incurring the significant costs associated with human annotations or extensive training?

There are two potential approaches to achieving this goal. The first is the prompt-based method, which involves developing effective prompting strategies to better stimulate LLMs’ translation capabilities, such as using in-context translation examples, as outlined in works (Agrawal et al., 2023; Garcia et al., 2023; Peng et al., 2023; Chen et al., 2023; Feng et al., 2024). However, Zhang et al. (2023a) indicate that prompting methods overly rely on the language model, often under-translate the input and generate hallucinations. Additionally, Moslem et al. (2023) demonstrate that the same prompting strategy can lead to different performance across different models. Furthermore, most of these prompting strategies like agent debating or self-correction (Liang et al., 2023; Feng et al., 2024) cannot be applied to some popular neural machine translation models like NLLB (Costa-jussà et al., 2022). These limitations make the learning-free method non-model-agnostic and unstable.

Another line of work employs learning-based paradigms by fine-tuning LLMs to adapt Quality Estimation (QE, Specia et al., 2010) and Automatic Post-Editing (APE, Simard et al., 2007) tasks to refine raw translations. QE involves automatically predicting translation quality, typically using Multidimensional Quality Metrics (MQM) datasets (Freitag et al., 2021), where human experts annotate error spans and assign quality scores. APE aims to address systematic errors of a black-box MT system and tailor the output to the lexicon and style required in a specific application domain. APE datasets are manually collected from real-world post-editing triplets like QT21 (Specia et al., 2017). Built on these well-defined tasks and annotated datasets, prior works (Zeng et al., 2023; Xu et al., 2023b; Alves et al., 2024) have shown the promising utility and generalization of the learning-based method. Xu et al. (2023b) trained PaLM2 (Anil et al., 2023) on MQM datasets to refine translations, and Alves et al. (2024) trained TowerInstruct on 637k translation examples, integrating APE datasets, outperforming all open models and GPT-3.5-turbo on APE tasks. However, these works heavily rely on human-annotated evaluation data and lack extensive validation in model-agnostic and multilingual scenarios. Additionally, the overall refinement in translation quality, particularly for translation-specific models, remains limited.

In this paper, we introduce MT-Ladder, a model-agnostic and cost-effective tool for multilingual translation refinement. Instead of directly fine-tuning a translation-target LLM, we train an LLM to refine translations using refinement datasets without human evaluation or post-edits, employing an instruction-following refinement task (Section 2.1). We notice that the reference in existing parallel corpus can serve as a natural refined translation. By sampling a translation for the source sentence from an existing LLM as the intermediate translation, we create a pseudo-refinement translation triplet [source, intermediate translation, reference], allowing us to construct training data without extra labor costs. During training, we split the training triplets into three hierarchies (Easy, Medium, Hard) based on their COMET (Rei et al., 2020) scores and propose a hierarchical fine-tuning (HFT) strategy to improve MT-Ladder’s refining performance step by step. Comprehensive experiments demonstrate that effectiveness of our MT-Ladder across various LLMs on multiple translation tasks.

2 MT-Ladder

2.1 Problem Formulation and Overview

Previous works (Zhang et al., 2023b; Xu et al., 2023a) adapt LLMs to translation tasks by fine-tuning on a parallel corpus [source, reference] using direct translation ( $\mathcal{P}_{D}$ ) as shown in Figure 3. In contrast, we define our task as a refinement-target translation ( $\mathcal{P}_{R}$ ) as shown in Figure 3, teaching the pre-trained base model to refine the existing translation of LLMs to the reference, rather than translating directly to the reference. Specifically, we introduce the concept of intermediate translation, which denotes the translation sampled from existing LLMs. Then we add the intermediate translation to the pair [source, reference] to form a pseudo-refinement triplet [source, intermediate translation, reference], taking the reference as the pseudo-refined translation. The concept of translation refinement rather than direct translation is a key distinction of our work compared to previous translation-specific LLM approaches.

MT-Ladder models are created in two steps: 1) Sampling; and 2) Hierarchical Fine-tuning (HFT). First, given an existing LLM $\mathcal{M}_{S}$ and a parallel corpus $\mathcal{C}$ , we use $\mathcal{M}_{S}$ to generate intermediate translations $i\sim{\mathcal{M}_{S}}(s,\mathcal{P}_{D})$ for each source sentence $s$ in the pair $(s,r)\in\mathcal{C}$ , where $r$ is the reference. We then combine $i$ with $(s,r)$ to create pseudo-refinement triplets $(s,i,r)$ , forming our training triplets $\mathcal{T}$ . Second, we apply a hierarchical fine-tuning method with an easy-to-hard schema to fine-tune the base model on our instruction-following refinement task with triplet training data to obtain MT-Ladder $\mathcal{L}_{a}$ . When applying $\mathcal{L}_{a}$ to refine the target LLM $\mathcal{M}_{T}$ , $\mathcal{M}_{T}$ first generates the translation $i_{test}\sim{\mathcal{M}_{T}}(s_{test},\mathcal{P}_{D})$ . $\mathcal{L}_{a}$ then refines $i_{test}$ into the final translation $y_{final}\sim{\mathcal{L}_{a}}(s_{test},i_{test},\mathcal{P}_{R})$ . Figure 2 shows the pipeline.

2.2 Pseudo-refinement Triplet Construction

Our pseudo-refinement triplet [source, intermediate translation, reference] is similar in format to APE triplet [source, translation with errors, post-edits]. However, the APE annotation procedure involves significant human costs for evaluation, error marking, and post-editing, focusing on word- or phrase-level corrections rather than overall translation quality improvement (Specia et al., 2017). In contrast, our work uses reference $r$ as the supervised label, focusing on overall quality. Given the sampling LLM $\mathcal{M}_{S}$ with parameters $\theta_{S}$ , parallel corpus $\mathcal{C}$ and prompt $\mathcal{P}_{D}$ , the intermediate translation $i$ for each pair $(s,r)\in\mathcal{C}$ can be generated auto-regressively as $i_{t}\sim p_{\theta_{S}}(i_{t}\mid s,\mathcal{P}_{D},i_{<t})$ . Naturally, the quality of $i$ is inferior to $r$ , so we treat $r$ as the refined translation and construct our pseudo-refinement triplet training data $(s,i,r)\in\mathcal{T}$ without additional human costs.

Models	COMET
Models	Zh-En	En-Zh	De-En	En-De
Palm2	74.70	-	-	81.80
+LLMRefine	75.90	-	-	82.30
BigTranslate-13B	74.63	81.83	81.04	80.54
+Vicuna-13B-v1.5 (Re)	72.53	80.91	77.26	78.79
+Vicuna-13B-v1.5 (CT)	76.53	83.67	81.72	81.05
+TowerInstruct-7B	76.17	85.62	82.03	84.89
+TowerInstruct-13B	77.92	85.91	82.26	85.86
+MT-Ladder-2B	79.28	84.74	83.45	82.62
+MT-Ladder-7B	81.08	86.56	84.58	85.92
+GPT-4o mini	81.34	86.57	84.70	86.30

MT-Ladder Pipeline			WMT22 En-Zh
Sampling Model	Base Model	Refine Model	BLEU	COMET
Gemma-2B-it	Gemma-2B	Gemma-2B-it	35.46	84.41
Gemma-2B-it	Gemma-2B	Gemma-7B-it	35.86	84.60
\hdashlineVicuna-7B-v1.5	LLaMA-2-7B	Vicuna-7B-v1.5	34.31	84.12
\hdashlineVicuna-7B-v1.5	LLaMA-2-7B	Vicuna-13B-v1.5	36.19	84.74
Baseline
Gemma-2B-it			21.07	78.67
Gemma-7B-it			30.55	81.50
Vicuna-7B-v1.5			31.42	82.68
Vicuna-13B-v1.5			35.14	83.38

Language	Parallel Data
Language	Train	Development	Test (from English)	Test (to English)
Chinese (Zh)	15406	1002	2037	1875
German (De)	14211	1002	2037	1984
Russia (Ru)	15000	1002	2037	2016
Czech (Cs)	12076	1002	2037	1448

Anti-HFT Case		COMET
German Source	So jedenfalls macht die grandiose F1-Saison wesentlich weniger Spaß als es mit einem vernünftigen Sender möglich wäre.	-
English Reference	At any rate, it really makes the grand F1 season considerably less fun as would be the case with a reasonable broadcaster.	95.61
Intermediate Translation	So, in any case, the grandiose F1 season is much less fun than it would be with a reasonable broadcaster.	87.50
Anti-HFT Stage1 (Hard)	So, at any rate, the grandiose F1 season is much less fun than it would be with a reasonable broadcaster.	87.55
Anti-HFT Stage2 (Hard+Medium)	So, at least, the grandiose F1 season is much less fun than it would be with a reasonable broadcaster.	83.32
Anti-HFT Stage3 (Hard+Medium+Easy)	So the great F1 season is much less fun than it would be with a decent broadcaster.	81.57
HFT Cases		COMET
German Source	Es ist schade, dass wir den Flow nicht mitnehmen konnten.	-
English Reference	It is a shame that we were not able to get into the flow.	96.32
Intermediate Translation	It is a shame that we couldn’t take the flow with us.	83.21
HFT Stage1 (Easy)	It’s a shame we couldn’t keep the momentum going.	79.54
HFT Stage2 (Easy+Medium)	It’s a shame that we couldn’t take the flow with us.	81.18
HFT Stage3 (Easy+Medium+Hard)	It’s a shame that we couldn’t keep the flow going.	84.10

Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level

Abstract

1 Introduction

2 MT-Ladder

2.1 Problem Formulation and Overview

2.2 Pseudo-refinement Triplet Construction

2.3 Hierarchical Fine-tuning

2.4 Translation Refinement

3 Experiments

3.1 Experimental Setup

3.2 Main Results

3.3 Ablation and Analysis

4 Related Work

Automatic Post-Edition and Refinement

LLMs for Machine Translation

5 Conclusion

Limitations

Acknowledgments

References

Appendix A Dataset Statistics

Appendix B Baseline Models

Translation Models

Non-translation Models

SoTA APE Models

Appendix C Training Details

Appendix D Base Model Effect

Appendix E Self-translation and Self-refinement