¹¹institutetext: School of Artificial Intelligence, University of Chinese Academy of Sciences,
Beijing 100049, P.R. China ²²institutetext: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, P.R. China ³³institutetext: Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd,
Beijing 100190, P.R. China ⁴⁴institutetext: Samsung Research China - Beijing (SRC-B)
⁴⁴email: {cong.ma, yaping.zhang, yang.zhao, yzhou, cqzong}@nlpr.ia.ac.cn, mei.tu@samsung.com

E2TIMT: Efficient and Effective Modal Adapter for Text Image Machine Translation

Cong Ma 1122 Yaping Zhang Corresponding author.1122 Mei Tu 44 Yang Zhao 1122 Yu Zhou 2233 Chengqing Zong 1122

Abstract

Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language. Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues. The cascade models can benefit from the large-scale optical character recognition (OCR) and MT datasets but the two-stage architecture is redundant. The end-to-end models are efficient but suffer from training data deficiency. To this end, in our paper, we propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets to pursue both an effective and efficient framework. More specifically, we build a novel modal adapter effectively bridging the OCR encoder and MT decoder. End-to-end TIMT loss and cross-modal contrastive loss are utilized jointly to align the feature distribution of the OCR and MT tasks. Extensive experiments show that the proposed method outperforms the existing two-stage cascade models and one-stage end-to-end models with a lighter and faster architecture. Furthermore, the ablation studies verify the generalization of our method, where the proposed modal adapter is effective to bridge various OCR and MT models. ¹¹1Our codes are available at: https://github.com/EriCongMa/E2TIMT

Keywords:

Text image machine translation Modal adapter Cross modal contrastive learning

1 Introduction

Text image machine translation (TIMT) is the core research of many applications, such as scene text translation, document image translation, and photo translation. Approaches to TIMT are mainly divided into two categories: two-stage cascade method [10, 26, 1, 3, 7] and one-stage end-to-end method [29, 5, 20]. The cascade model deploys recognition and translation models sequentially, which benefits from training with existing large-scale optical character recognition (OCR) and machine translation (MT) datasets. However, the task gap between OCR and MT models might hurt the performance because translation models are vulnerable to recognition errors. Furthermore, the cascade model is two-stage, i.e. the sequential integration of OCR and MT models, thus is redundant in parameters and has a slow decoding speed. To alleviate the error propagation problem, some studies turn to exploring one-stage end-to-end architecture with fewer parameters and faster decoding speed [20]. However, the scarcity of end-to-end TIMT data limits the performance of end-to-end models. Although the multi-task learning enhanced end-to-end TIMT model incorporates external OCR datasets [29, 5] or MT datasets [19], the huge potential of fully benefiting from the knowledge of existing OCR and MT datasets or their corresponding pre-trained models is seldom explored. RTNet [29] is proposed to link the OCR encoder and MT decoder, but it ignores the task gap between recognition and translation tasks, causing limited performance. In summary, the following three major challenges are usually faced in the TIMT study:

$\bullet$

Task Gap. There is a large domain gap between the OCR/MT tasks, which indicates the direct connection of the recognition and translation models is not optimal.
$\bullet$

Cascade Redundancy. It leads to model/complexity redundancy when directly cascading existing OCR and MT models without any optimization.
$\bullet$

End-to-end Data Scarcity. The dataset for end-to-end TIMT is scarce. It is critical to transfer knowledge from existing OCR and MT datasets or pre-trained models, which is seldom explored by previous methods.

In this paper, we propose a novel modal adapter architecture to improve the end-to-end TIMT model by eliminating task gaps and making full of the knowledge from pre-trained OCR and MT models. Furthermore, the modal adapter can is a parameter efficient fine-tuning method, which just optimizes parameters of modal adapter by frozen pre-trained encoder and decoder. Thus, modal adapter based TIMT model has much fewer parameters to update compared with end-to-end models and has a faster inference speed than cascade models. In detail, a self-attention based modal adapter is incorporated between the pre-trained OCR encoder and MT decoder. Different from vanilla adapter tuning [24], which is just fine-tuned on downstream tasks, the task gap is bridged in our framework by a cross-modal contrastive loss that aligns the distributions between the OCR and MT features of the same sentence content. Two types of modal adapters are studied to validate the effectiveness of bridging various OCR and MT modules. Embedding modal adapter (EmbMA) is proposed to bridge OCR image encoder and MT sequential encoder, while sequential modal adapter (SeqMA) is inserted between OCR Sequential encoder and MT decoder. Finally, the MT decoder generates the translation from the features transformed by the modal adapter. Our contributions are summarized as follows:

$\bullet$

We propose a modal adapter based TIMT model to unify cascade and end-to-end models by bridging the pre-trained recognition encoder and translation decoder.
$\bullet$

Cross-modal contrastive learning is incorporated to align the distribution of image features and text features encoded by an OCR encoder and an MT encoder respectively, which alleviates the OCR-MT task gap and improves the performance of text image machine translation.
$\bullet$

Extensive experiments show our method outperforms both the existing cascade models and end-to-end models with a lighter and faster architecture. Furthermore, the modal adapter has a good generalization when bridging various recognition encoders and translation decoders.

Refer to caption — Figure 1: Architectures of OCR, MT, and Modal Adapter for TIMT Model. The solid arrow lines represent the data flow in the model. The dotted arrow lines denote the parameters of encoder and decoder in modal adapter based TIMT are inherited from pre-trained OCR/MT models.

2 Preliminary

To unify the processing progress of recognition and translation models, we divide both the OCR and MT encoders into two submodules: image/text encoder for embedding encoding, and sequential encoder for contextual feature extraction. We will introduce the processing flow of OCR and MT models individually.

2.1 OCR Model

As shown in Figure 1 (a), given an input image $I$ , a convolutional neural network (CNN) based image encoder extracts the image embedding $E_{I}$ by transforming image pixels into feature vectors:

E_{I}=\text{CNN}(I)

(1)

where $I\in\mathbb{R}^{H\times W\times C}$ and $E_{I}\in\mathbb{R}^{l_{I}\times c}$ . $H$ , $W$ , and $C$ denote the height, width, and channel of the input image respectively. $l_{I}$ represents the length of image embedding, which is calculated as $l_{I}=h\times w$ , where $h$ , $w$ , and $c$ denote the height, width, and channel of the encoded feature map separately.

The image encoder mainly extracts the local features of the input images, while the image sequential encoder aims to model contextual information by considering the whole input sequence:

S_{I}=\text{Seq}_{I}(E_{I})

(2)

where $\text{Seq}_{I}(\cdot)$ represents the image sequential encoder and transformer encoder [32] is utilized in our implementation. $S_{I}\in\mathbb{R}^{l_{S}\times d_{S}}$ denotes the sequential features in OCR model. $l_{S}$ and $d_{S}$ denote the length and dimension of sequential features respectively.

Finally, the OCR decoder generates recognized tokens auto-regressively given sequential features:

	$\displaystyle D_{I}$	$\displaystyle=\text{Dec}_{I}(S_{I});$		(3)
	$\displaystyle P(X\|I)$	$\displaystyle=\text{Softmax}(W_{I}D_{I})$		(3)

where $\text{Dec}_{I}(\cdot)$ represents the OCR decoder, and transformer decoder [32] is utilized in our implementation. $D_{I}$ denotes the outputs of the decoder. $W_{I}\in\mathbb{R}^{|\mathcal{V}_{X}|\times d_{I}}$ represents the linear transformation that maps the decoder features into corresponding recognized tokens, $\mathcal{V}_{X}$ is the recognition vocabulary, and $d_{I}$ is the dimension of decoder hidden states.

2.2 MT Model

MT model translates the source language into the target language as shown in Figure 1 (b). Given a source language sentence $T$ , the text encoder first maps the input words into a sequence of word embeddings:

E_{T}=\text{Embedding}(T)

(4)

where $E_{T}\in\mathbb{R}^{l_{E}\times d_{E}}$ denotes the text embedding. $l_{E}$ and $d_{E}$ represent the sequence length and the dimension of text embedding respectively.

Text sequential encoder further extracts contextual features based on text embeddings:

S_{T}=\text{Seq}_{T}(E_{T})

(5)

where $\text{Seq}_{T}(\cdot)$ represents the text sequential encoder, which is a transformer encoder in our implementation. $S_{T}$ denotes the encoded text sequential features.

MT decoder finally generates the target tokens auto-regressively given sequential features:

	$\displaystyle D_{T}$	$\displaystyle=\text{Dec}_{T}(S_{T});$		(6)
	$\displaystyle P(Y\|T)$	$\displaystyle=\text{Softmax}(W_{T}D_{T})$		(6)

where $\text{Dec}_{T}(\cdot)$ represents the MT decoder, and the transformer decoder is utilized in our implementation. $D_{T}$ is the output of the decoder and $W_{T}\in\mathbb{R}^{|\mathcal{V}_{Y}|\times d_{T}}$ is the linear transformation. ${\mathcal{V}_{Y}}$ represents the target language vocabulary and $d_{T}$ denotes the dimension of the hidden states.

3 Methodology

To bridge the pre-trained OCR encoder and the MT decoder, the modal adapter is proposed to transform the OCR features into the MT feature space as shown in Figure 1 (c).

Specifically, features of the OCR encoder are transformed by the stacked modal adapter layer:

	$\displaystyle\hat{H}_{\text{MA}}^{n}$	$\displaystyle=\text{LN}(\text{MSA}(H_{\text{MA}}^{n-1}))+H_{\text{MA}}^{n-1}$		(7)
	$\displaystyle{H}_{\text{MA}}^{n}$	$\displaystyle=\text{LN}(\text{FFN}(\hat{H}_{\text{MA}}^{n}))+\hat{H}_{\text{MA}}^{n}$		(7)

where $H_{\text{MA}}^{n}$ denotes the output of the $n$ -th modal adapter layer, and $H_{\text{MA}}^{0}$ is the feature from the OCR encoder. $\text{MSA}(\cdot)$ , $\text{FFN}(\cdot)$ , and $\text{LN}(\cdot)$ represent multi-head self-attention, feed-forward, and layer norm modules respectively. After transformation by the modal adapter, features encoded by the OCR encoder are further fed into the MT decoder to generate translation results.

Since there are two submodules in the OCR encoder (image encoder and sequential encoder) as introduced in Section 2.1, we propose two types of modal adapters. The first one is the embedding modal adapter (EmbMA), which aims at aligning the image embedding and text embedding. The second one is the sequential modal adapter (SeqMA), which transforms the sequential features encoded by the image sequential encoder to the sequential feature space of the MT task. We will introduce our proposed EmbMA and SeqMA in detail.

3.1 Embedding Modal Adapter

The embedding modal adapter is placed in the middle of the OCR image encoder and the text sequential encoder as shown in Figure 2 (a). First, the EmbMA transforms the image embedding into the text embedding space. Second, to better meet the feature distribution of the MT processing flow, the output of EmbMA is constrained by the text embedding through a cross-modal contrastive loss $\mathcal{L}_{\text{CMC}}^{\text{EmbMA}}$ . As so, the output of EmbMA given $i$ -th image embedding should be similar to the $i$ -th text embedding, and apart from the other text embeddings in the mini-batch:

	$\displaystyle H_{\text{EmbMA}}^{(i)}$	$\displaystyle=\text{EmbMA}(E_{I}^{(i)})$		(8)
	$\displaystyle\mathcal{L}_{\text{CMC}}^{\text{EmbMA}}$	$\displaystyle=-\sum_{i=1}^{K}\text{log}\frac{\text{exp}(d(H_{\text{EmbMA}}^{(i)},E_{T}^{(i)})/\tau)}{\sum_{j=1}^{K}\text{exp}(d(H_{\text{EmbMA}}^{(i)},E_{T}^{(j)})/\tau)}$		(8)

where $\text{EmbMA}(\cdot)$ utilizes the same modal adapter architecture as in Equation 7. $H_{\text{EmbMA}}^{(i)}$ represents the output of the EmbMA. $E_{I}^{(i)}$ and $E_{T}^{(i)}$ denote the image and text embedding of $i$ -th sample respectively. $K$ denotes the size of the mini-batch. $\tau$ stands for the temperature parameter and $d(q,k)$ represents the similarity metric which we utilize cosine similarity in our implementation.

Aligned with text embedding, the outputs of EmbMA are further fed into the text sequential encoder to obtain the contextual feature $S_{\text{EmbMA}}^{(i)}$ . Through EmbMA, the image embeddings are transformed into the MT processing flow, and MT decoder finally generates target translation:

$\displaystyle S_{\text{EmbMA}}^{(i)}$	$\displaystyle=\text{Seq}_{T}(H_{\text{EmbMA}}^{(i)})$	(9)
$\displaystyle D_{\text{EmbMA}}^{(i)}$	$\displaystyle=\text{Dec}_{T}(S_{\text{EmbMA}}^{(i)})$
$\displaystyle P(Y^{(i)}\|I^{(i)})$	$\displaystyle=\text{Softmax}(W_{T}D_{\text{EmbMA}}^{(i)})$

3.2 Sequential Modal Adapter

Different from EmbMA, SeqMA is designed to align the sequential features of the OCR and MT models. As shown in Figure 2 (b), SeqMA first transforms the image sequential features into text sequential feature space. Then, the MT decoder generates target language tokens given transformed image sequential features:

$\displaystyle H_{\text{SeqMA}}^{(i)}$	$\displaystyle=\text{SeqMA}(S_{I}^{(i)})$	(10)
$\displaystyle D_{\text{SeqMA}}^{(i)}$	$\displaystyle=\text{Dec}_{T}(H_{\text{SeqMA}}^{(i)})$
$\displaystyle P(Y^{(i)}\|I^{(i)})$	$\displaystyle=\text{Softmax}(W_{T}D_{\text{SeqMA}}^{(i)})$

where $\text{SeqMA}(\cdot)$ uses the same structure as in Equation 7. $H_{\text{SeqMA}}^{(i)}$ denotes the output of the sequential modal adapter and $S_{I}^{(i)}$ represents the output of the image sequential encoder of the $i$ -th sample in the mini-batch. $D_{\text{SeqMA}}^{(i)}$ denotes the output of text decoder given the hidden states from SeqMA.

Since the hidden states of the SeqMA are further fed into the MT decoder, the feature distribution of $H_{\text{SeqMA}}^{(i)}$ should be similar to the hidden states of $S_{T}^{(i)}$ . To bridge the feature gap between the OCR and MT tasks, a cross-modal contrastive loss is utilized to align the feature distribution of the transformed image sequential feature and text sequential feature:

\displaystyle\mathcal{L}_{\text{CMC}}^{\text{SeqMA}}=-\sum_{i=1}^{K}\text{log}\frac{\text{exp}(d(H_{\text{SeqMA}}^{(i)},S_{T}^{(i)})/\tau)}{\sum_{j=1}^{K}\text{exp}(d(H_{\text{SeqMA}}^{(i)},S_{T}^{(j)})/\tau)}

(11)

where $d(\cdot)$ and $\tau$ are the same similarity metric and temperature parameter as introduced in Equation 8.

3.3 Training of Modal Adapter

During model training, only parameters in modal adapters are updated, while the parameters in the OCR and MT models are all fixed. Through parameter-efficient modal adapter tuning, the pre-trained OCR encoder and MT decoder are able to transfer to the TIMT task with ease. Specifically, multi-task learning is utilized by optimizing end-to-end text image translation loss and cross-modal contrastive loss. Formally, the end-to-end text image translation loss and the overall loss functions are:

	$\displaystyle\mathcal{L}_{\text{TIMT}}$	$\displaystyle=-\sum_{i=1}^{\|D_{\text{TIMT}}\|}\text{log}P(Y^{(i)}\|I^{(i)})$		(12)
	$\displaystyle\mathcal{L}_{\text{All}}$	$\displaystyle=(1-\lambda_{\text{CMC}})\mathcal{L}_{\text{TIMT}}+\lambda_{\text{CMC}}\mathcal{L}_{\text{CMC}}$		(12)

where $\mathcal{L}_{\text{CMC}}$ is introduced as in Equation 8 and Equation 11. $\lambda_{\text{CMC}}$ denotes the hyper-parameter, which balances the weight of end-to-end text image translation loss and cross-modal contrastive loss. Note that the $P(Y^{(i)}|I^{(i)})$ in end-to-end text image translation loss $\mathcal{L}_{\text{TIMT}}$ and cross-modal contrastive loss $\mathcal{L}_{\text{CMC}}$ are calculated based on the corresponding training workflow of SeqMA and EmbMA.

3.4 Inference

During model inference, as the blue arrow lines shown in Figure 2, the input images are first fed into the OCR encoder to obtain the image features. Second, the modal adapter transforms the image features into the MT feature space, and the MT decoder finally generates translation results. Note that the OCR decoder and the MT encoder are not utilized during inference resulting in a fast decoding speed with the end-to-end processing architecture as shown in Figure 1 (c). By bridging OCR encoder and MT decoder, modal adapter based method can take full advantage of pre-trained OCR and MT models.

4 Experiments

4.1 Datasets

OCR, MT, and end-to-end TIMT datasets are utilized in our experiments. OCR and MT datasets are used to train the OCR and MT models respectively. While the TIMT dataset is used to train the parameters in the modal adapter.

OCR Datasets.

OCR datasets are composed of text images and corresponding text pairs $\{(I_{i},T_{i})\}_{i=1}^{|D_{\text{OCR}}|}$ . Three OCR datasets are considered in our experiments. MJSynth (MJ) [12]²²2https://www.robots.ox.ac.uk/vgg/data/text/ is a synthetic word box image recognition dataset designed for English scene text recognition containing 8.9M synthetic word box images. SynthText (ST) [9]³³3https://www.robots.ox.ac.uk/vgg/data/scenetext/ is another synthetic dataset containing 5.5M word box images, which renders the texts onto real-world scene images. Synthetic Text Line Dataset is a customized text line recognition dataset that is constructed with the rule-based synthetic method⁴⁴4https://github.com/Belval/TextRecognitionDataGenerator. 1M English and 1M Chinese synthetic text line recognition pairs are synthesized in our experiments.

MT Datasets.

Parallel sentences from the Workshop of Machine Translation 2018 ⁵⁵5http://www.statmt.org/wmt18/ are utilized to train the text machine translation models. Specifically, three translation directions are considered in our experiments: English-to-Chinese (En $\Rightarrow$ Zh), English-to-German (En $\Rightarrow$ De), and Chinese-to-English (Zh $\Rightarrow$ En). After pre-processing and filtering, 5,984,287 En $\Leftrightarrow$ Zh and 20,895,771 En $\Rightarrow$ De translation pairs are finally obtained to train MT models.

End-to-End TIMT Datasets.

A public end-to-end TIMT dataset proposed by [19] is utilized to train end-to-end TIMT models. This dataset is a synthetic text image translation corpus by synthesizing the text image through a rule-based toolkit given randomly selected background images, font types, and other rendering effects, which is similar to the synthesis method as synthetic text line recognition dataset. The parallel sentences of the end-to-end text image translation datasets are extracted from the text translation corpus. In summary, one million end-to-end TIMT pairs are utilized for each translation direction.

Evaluation Datasets.

Evaluation sets constructed by [19] are used to measure the performance of various models. Three domains are considered, including synthetic, subtitle, and street-view evaluation domains. The synthetic evaluation dataset contains 2,502 En $\Leftrightarrow$ Zh and 2,000 En $\Rightarrow$ De translation pairs, which are synthesized as the synthetic training dataset. For real-world evaluation datasets, the En $\Leftrightarrow$ Zh subtitle dataset contains 1,040 translation pairs, while the En $\Rightarrow$ Zh street-view dataset contains 1,198 translation pairs.

Table 1: Comparison of end-to-end, cascade and modal adapter tuning based text image machine translation models.

Architecture	Synthetic			Subtitle		Street
Architecture	En $\Rightarrow$ Zh	En $\Rightarrow$ De	Zh $\Rightarrow$ En	En $\Rightarrow$ Zh	Zh $\Rightarrow$ En	Zh $\Rightarrow$ En
End-to-End Models
TRBA [2]	9.61	7.36	4.77	12.12	5.18	0.36
CLTIR [5]	18.02	15.55	10.74	16.47	9.04	0.43
CLTIR+OCR [5]	19.44	16.31	13.52	17.96	11.25	1.74
RTNet [29]	18.91	15.82	12.54	17.63	10.63	1.07
RTNet+OCR [29]	19.63	16.78	14.01	18.82	11.50	1.93
MTETIMT[19]	19.25	16.27	13.16	17.73	10.79	1.69
MTETIMT+MT[19]	21.96	18.84	15.62	19.17	12.11	5.84
MHCMM[4]	22.08	18.97	15.66	19.24	12.12	5.87
Cascade Models
CRNN + Transformer	14.43	11.27	10.52	17.88	10.06	3.25
TRBA + Transformer	17.59	13.86	12.79	18.22	10.53	4.08
TRT + Transformer	20.46	16.48	15.12	19.12	12.08	5.78
Modal Adapter Tuning Models
Sequential Modal Adapter	20.90	19.02	15.22	19.31	12.03	5.81
Embedding Modal Adapter	22.53	19.67	16.25	19.46	12.39	6.24

4.2 Experimental Settings

We implement the image encoder based on the code release by [2]. The MT model is utilized the same architecture proposed in [32]. The OCR and MT models are firstly trained with OCR and MT datasets respectively. Parameters of OCR and MT models are then frozen during fine-tuning. The implementation of the modal adapter is utilized a similar architecture as the transformer encoder with the hidden dimensions of 512, 8 attention heads, and a dropout rate of 0.1. The initial learning rate is set to 2e-3, the batch size is 64, and the training step is set to 300,000. Parameters of the modal adapter are initialized with Xavier initiation method [8] and optimized with Adam optimizer [15] on single NVIDIA V100 GPU. Detokenized BLEU [21] calculated by sacre-BLEU ⁶⁶6https://github.com/mjpost/sacrebleu is utilized as the metric to evaluate the performance of text image translation models.

4.3 Comparison of Various Text Image Translation Models

Table 1 shows the BLEU scores of text image translation models on various evaluation datasets. Three OCR models are utilized in the cascade models: CRNN [27], TPS+ResNet+BiLSTM+Attention (TRBA) [2], and TPS+ResNet+
Transformer (TRT). While transformer-base [32] is utilized for MT model. The performance of the OCR and the MT models are shown in Section 4.4. Five architectures are compared in end-to-end TIMT setting. TRBA [2] represents the OCR architecture trained with end-to-end TIMT dataset. CLTIR [5] model trains end-to-end TIMT with auxiliary OCR task. RTNet [29] utilizes a feature transformer to link OCR encoder and decoder but ignores the task gap modeling. MTETIMT [19] represents the machine translation enhanced end-to-end TIMT model, which utilizes multi-task learning with auxiliary translation task. While MHCMM[4] proposes a multi-hierarchy cross-modal mimic framework for the end-to-end text image translation, which incorporates external text translation corpus and utilizes text MT model as teacher guidance for TIMT model. The modal adapter in Table 1 bridges the pre-trained TRT OCR encoder and transformer MT decoder. Experimental results show that our proposed sequential and embedding modal adapter outperforms two-stage cascade models on three translation domains with an average improvement of 1.01 BLEU scores. Meanwhile, modal adapter improves the TIMT performance on various language directions (En $\Rightarrow$ Zh and En $\Rightarrow$ De), revealing the method is robust to different language settings. For Zh $\Rightarrow$ En translation direction, modal adapter based method achieves similar results as the previous machine translation enhanced multi-task training model, indicating modal adapter method can take full advantage of the pre-trained MT model without multi-task training.

Furthermore, the embedding modal adapter performs better than the sequential modal adapter, and we attribute that EmbMA retains the cross-attention flow between the original text sequential encoder and decoder. This shows it is vital not only to eliminate the gap between the OCR and MT tasks but also to maintain the consistency of structures within each task.

Table 2: Performance of text image recognition models. Metric of scene text recognition (Rec.) is word accuracy and character error rate is utilized for text line recognition evaluation. Tr.E and Tr.D represent transformer encoder and decoder respectively.

Architecture	Image Encoder	Image Sequential Encoder	Decoder	Scene Text Rec.			Text Line Recognition
				IIIT	SVT	SP	Synthetic	Subtitle	Street
				3000	647	645	2502	1040	1198
CRNN [27]	VGG	BiLSTM	CTC	81.3	79.0	66.7	13.90	4.95	56.82
TRBA [2]	ResNet	BiLSTM	Attention	86.6	87.8	76.9	12.29	3.01	51.67
TRT	ResNet	Tr.E	Tr.D	87.9	87.2	78.6	10.89	2.33	49.83

Table 3: Performance of text translation models. BLEU score is utilized as the metric of text translation task.

Architecture	Synthetic			Subtitle		Street
Architecture	En $\Rightarrow$ Zh	En $\Rightarrow$ De	Zh $\Rightarrow$ En	En $\Rightarrow$ Zh	Zh $\Rightarrow$ En	Zh $\Rightarrow$ En
Transformer-Base [32]	25.38	20.97	17.56	19.64	13.78	15.17
Transformer-Big [32]	26.41	22.15	19.04	20.39	14.66	16.93

4.4 Performance of OCR and MT Models

OCR and MT models in cascade models are firstly trained with corresponding OCR and MT datasets. Parameters in pre-trained OCR and MT models are then frozen during modal adapter training. Three OCR models CRNN, TRBA, and TRT are all trained with the same scene text recognition and synthetic text line recognition datasets introduced in section 4.1. Table 2 shows the performance of various OCR models. Transformer based TRT model achieves the best recognition performance, indicating the strong sequential encoder is essential for optical character recognition. For MT models, the transformer-base and the transformer-big [32] are utilized to translate the source language into the target language. Table 3 shows the performance of text translation, and the transformer-big achieves better translation BLEU.

4.5 Generalization of Modal Adapter on Various OCR and MT Combinations

To evaluate the generalization of our proposed method, the modal adapter is studied by bridging various OCR encoders and MT decoders. As shown in Figure 3, modal adapter tuning outperforms the cascade models on different OCR and MT combinations, revealing the good generalization of modal adapter tuning methods. Figure 3 (a) shows the text image translation results by combining different OCR models and transformer base MT model. Better OCR image encoder can extract more information into image features, leading to better text image translation performance.

Figure 3 (b) depicts various OCR models with transformer big MT models. Similar to Figure 3 (a), better OCR models achieve better results with transformer big MT models. Furthermore, stronger MT decoders can further improve the translation performance in Figure 3 (b) compared with Figure 3 (a). As a result, our proposed modal adapter tuning method has strong scalability by bridging better OCR and MT models.

4.6 Analysis on Model Size and Decoding Speed of TIMT Models

Cascade models have redundant parameters and slow decoding speed. By removing the OCR decoder and the MT encoder, the modal adapter tuning method has fewer parameters and a faster decoding speed. As shown in Table 4, the end-to-end model, which is trained from the scratch, has fewer parameters and a faster decoding speed compared with the cascade model. Fine-tuning model is also an end-to-end model, which is initialized with the OCR encoder and MT decoder. Then the fine-tuning model is trained with the end-to-end text image translation dataset. Since the modal adapter bridges the OCR encoder and the MT decoder directly, it has a faster decoding speed than the cascade model. Meanwhile, after removing the OCR decoder and MT encoder, modal adapter models have fewer parameters than the cascade model. For the comparison of fine-tuning methods, modal adapter tuning outperforms fine-tuning model, because modal adapter models the task consistency between the OCR encoder and MT decoder, which alleviates the gap between OCR and MT tasks.

Table 4: Comparison of model size and decoding speed among various models on English-to-Chinese translation direction. The unit of parameters is million (

\times 10^{6}

), while the unit for speed is sentence per second. BLEU score is utilized to show the performance of synthetic and subtitle text image translation.

Architecture	Finetuned Params.	Total Params.	Speed	Synthetic	Subtitle
Cascade	-	195.1M	3.07	20.46	19.12
End-to-End	-	121.9M ( $\downarrow$ 37.52%)	5.21 ( $\uparrow$ 1.70x)	19.63	18.82
Fine-tuning	121.9M	121.9M ( $\downarrow$ 37.52%)	5.21 ( $\uparrow$ 1.70x)	20.18	19.04
SeqMA	13.2M	135.1M ( $\downarrow$ 30.75%)	5.12 ( $\uparrow$ 1.67x)	20.90	19.31
EmbMA	13.2M	135.1M ( $\downarrow$ 30.75%)	5.12 ( $\uparrow$ 1.67x)	22.53	19.46

Table 5: Comparison of adapter tuning and modal adapter tuning on English-to-Chinese translation.

Architecture	Synthetic	Subtitle
Adapter Tuning	16.72	15.87
SeqMA (Bottleneck)	18.25	16.80
EmbMA (Bottleneck)	21.82	19.35
SeqMA	20.90	19.31
EmbMA	22.53	19.46

4.7 Comparison with Adapter Tuning

Adapter tuning [24] is an effective parameter-efficient fine-tuning method. Different from adapter tuning, which inserted bottleneck modules inside the pre-trained transformer layers, the modal adapter is designed outside the pre-trained models by bridging the separated OCR encoder and the MT decoder. As shown in Table 5, the modal adapter significantly outperforms adapter tuning with 5.81 BLEU for the synthetic domain and 3.59 BLEU for the subtitle domain. To offer a more similar architecture, we also put the bottleneck-based adapter outside the pre-trained models, which is similar to our proposed modal adapter tuning. Bottleneck-based modal adapter tuning also outperforms the vanilla adapter tuning, revealing the effectiveness of explicitly modeling the transformation mapping from the OCR feature space to the MT feature space. Finally, self-attention based modal adapter outperforms the bottleneck-based modal adapter, which we attribute to the strong encoding ability of stacked self-attention layers.

4.8 Analysis on the Amount of End-to-End TIMT Dataset

Parameters of the modal adapter are trained on the end-to-end TIMT dataset and the amount of end-to-end data has a great impact on performance. Figure 4 shows the performance of modal adapter tuning with different amounts of end-to-end TIMT datasets. When the end-to-end data is low-resource (around 200 thousand image-text pairs), the performance of modal adapter tuning is limited. We attribute the reason to the non-convergence of modal adapter given low-resource end-to-end data. As the amount of end-to-end TIMT data increases, the modal adapter achieves better results, revealing the modal adapter needs enough data to learn the transformation from the OCR feature space to the MT feature space. When the end-to-end image-text translation data achieves more than 800 thousand pairs, the TIMT results tend to be stable and perform the best translation results. Thus, one million end-to-end text image translation pairs are suitable to train a good end-to-end TIMT model.

4.9 Hyper-parameter Analysis

Hyper-parameter $\lambda_{\text{CMC}}$ is an important parameter to balance the end-to-end TIMT optimization object and cross-modal contrastive learning object. Figure 5 shows the evaluation of hyper-parameter $\lambda_{\text{CMC}}$ . From this hyper-parameter evaluation, the optimal value of $\lambda_{\text{CMC}}$ is 0.4 for both embedding modal adapter and sequential modal adapter. When $\lambda_{\text{CMC}}=0$ , parameters in the modal adapter are only optimized by the end-to-end TIMT loss, which ignores the task gap between OCR and MT, leading to performance drop. Specifically, without cross-modal contrastive learning, SeqMA drops 2.35 BLEU scores and EmbMA drops 2.68 BLEU scores, indicating that cross-modal contrastive learning can effectively alleviate the feature gaps between the OCR and MT tasks. When $\lambda_{\text{CMC}}=1$ , the overall loss function becomes $\mathcal{L}_{\text{All}}=\mathcal{L}_{\text{CMC}}$ , and the performance drops, indicating end-to-end loss is also vital to modal adapter tuning. Thus, the optimization of the modal adapter should be guided both from direct translation object $\mathcal{L}_{\text{TIMT}}$ and cross-modal contrastive learning object $\mathcal{L}_{\text{CMC}}$ .

5 Related Work

5.1 Text Image Translation.

TIMT models are mainly divided into the cascade and end-to-end models. Cascade models deploy OCR and MT models respectively [26, 10, 1, 3, 7]. Specifically, the source language text images are first fed into OCR models to obtain the recognized source language sentences [28, 27, 2, 35, 36, 13, 14]. Second, the source language sentences are translated into the target language with the MT model [31, 32, 38, 37]. Cascade directly connects separated OCR and MT models leading to model redundancy and slow decoding speed. Furthermore, recognition errors made by OCR models are further propagated through MT models, causing severe translation mistakes.

For end-to-end models, the naive approach is to take the OCR model to translate source language text images by training with source language images and corresponding target sentences like TRBA [2]. Furthermore, multi-task learning is proposed to incorporate external OCR datasets [29, 5] or MT datasets [19] to enhance the performance of end-to-end models. MHCMM[4] further improves the feature representation through cross-modal mimic learning on the basis of incorporating external MT data.

However, existing methods still have limitations in fusing cascade and end-to-end models. In this paper, our proposed modal adapter bridges OCR encoder and MT decoder in cascade method through an end-to-end framework, which can take advantage of both cascade and end-to-end methods. Experimental results show modal adapter based TIMT effectively improves translation performance with efficient architecture and fast decoding speed.

5.2 Methods of Bridging Encoder and Decoder.

Pre-trained models have been explored to achieve good performance after fine-tuning on down-stream tasks [6, 22, 17, 23]. To simplify and speed up the fine-tuning process, efficiency tuning methods are proposed by just updating partial parameters of the model [34]. Another parameter-efficient tuning research keeps the parameters of pre-trained models unchanged and incorporates external modules to meet the downstream tasks like adapter tuning [24], LoRA [11], BitFit [33], prefix tuning [18], and so on. These fine-tuning methods just optimize the parameters of external modules, which makes the fine-tuning process more efficient.

Except for fine-tuning unified pre-trained models, existing research also tried to bridge pre-trained encoder and decoder [25]. [30] proposed to bridge pre-trained mBERT and mGPT through a Graft module to achieve text machine translation. While [16] explores combining ASR encoder and MT decoder with vanilla adapter for end-to-end speech translation. Inspired by recent research on bridging encoder and decoder, we propose a modal adapter to bridge the OCR encoder and the MT decoder.

6 Conclusion

In this paper, we propose a faster and better modal adapter tuning method for the TIMT task, bridging the pre-trained OCR encoder and MT decoder. The sequential modal adapter and embedding adapter are evaluated to verify the effectiveness of bridging different OCR and MT modules. Extensive experiments show embedding modal adapter has better performance because it retains the cross-attention flow between the original MT sequential encoder and decoder. Meanwhile, with an end-to-end architecture, the modal adapter based method outperforms the cascade method with faster decoding speed and lightweight architecture. Furthermore, the modal adapter is effective to bridge various OCR and MT frameworks, revealing the good generalization of the modal adapter tuning method. In the next step, we will design more bridge modules for text image machine translation.

Acknowledgement

This work has been supported by the National Natural Science Foundation of China (NSFC) grants 62106265.

References

[1] Afli, H., Way, A.: Integrating optical character recognition and machine translation of historical documents. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities, LT4DH@COLING, Osaka, Japan, December 2016. pp. 109–116 (2016)
[2] Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., Oh, S.J., Lee, H.: What is wrong with scene text recognition model comparisons? dataset and model analysis. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. pp. 4714–4722 (2019)
[3] Chen, J., Cao, H., Natarajan, P.: Integrating natural language processing with image document analysis: what we learned from two real-world applications. Int. J. Document Anal. Recognit. 18(3), 235–247 (2015)
[4] Chen, Z., Yin, F., Yang, Q., Liu, C.L.: Cross-lingual text image recognition via multi-hierarchy cross-modal mimic. IEEE Transactions on Multimedia (TMM) pp. 1–13 (2022)
[5] Chen, Z., Yin, F., Zhang, X., Yang, Q., Liu, C.: Cross-lingual text image recognition via multi-task sequence to sequence learning. In: 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021. pp. 3122–3129 (2020)
[6] Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics (2019)
[7] Du, J., Huo, Q., Sun, L., Sun, J.: Snap and translate using windows phone. In: 2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011. pp. 809–813. IEEE Computer Society (2011)
[8] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010. JMLR Proceedings, vol. 9, pp. 249–256. JMLR.org (2010)
[9] Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 2315–2324. IEEE Computer Society (2016)
[10] Hinami, R., Ishiwatari, S., Yasuda, K., Matsui, Y.: Towards fully automated manga translation. In: The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, February 2-9, 2021. (2021)
[11] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net (2022)
[12] Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. CoRR abs/1406.2227 (2014)
[13] Kaur, H., Kumar, M.: Offline handwritten gurumukhi word recognition using extreme gradient boosting methodology. Soft Comput. 25(6), 4451–4464 (2021)
[14] Kaur, H., Kumar, M.: On the recognition of offline handwritten word using holistic approach and adaboost methodology. Multim. Tools Appl. 80(7), 11155–11175 (2021)
[15] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
[16] Le, H., Pino, J.M., Wang, C., Gu, J., Schwab, D., Besacier, L.: Lightweight adapter tuning for multilingual speech translation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 2: Short Papers), Virtual Event, August 1-6, 2021. pp. 817–824. Association for Computational Linguistics (2021)
[17] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., Zettlemoyer, L.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. pp. 7871–7880. Association for Computational Linguistics (2020)
[18] Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021. pp. 4582–4597. Association for Computational Linguistics (2021)
[19] Ma, C., Zhang, Y., Tu, M., Han, X., Wu, L., Zhao, Y., Zhou, Y.: Improving end-to-end text image translation from the auxiliary text translation task. In: 26th International Conference on Pattern Recognition, ICPR 2022, Montreal, QC, Canada, August 21-25, 2022. pp. 1664–1670. IEEE (2022)
[20] Mansimov, E., Stern, M., Chen, M., Firat, O., Uszkoreit, J., Jain, P.: Towards end-to-end in-image neural machine translation. In: Proceedings of the First International Workshop on Natural Language Processing Beyond Text. pp. 70–74. Association for Computational Linguistics, Online (Nov 2020)
[21] Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA. pp. 311–318 (2002)
[22] Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training. Open AI Blog (2018)
[23] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
[24] Rebuffi, S., Bilen, H., Vedaldi, A.: Learning multiple visual domains with residual adapters. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 506–516 (2017)
[25] Rothe, S., Narayan, S., Severyn, A.: Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Linguistics 8, 264–280 (2020)
[26] Shekar, K.C., Cross, M., Vasudevan, V.: Optical character recognition and neural machine translation using deep learning techniques. Innovations in Computer Science and Engineering (2021)
[27] Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
[28] Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 4168–4176 (2016)
[29] Su, T., Liu, S., Zhou, S.: Rtnet: An end-to-end method for handwritten text image translation. In: 16th International Conference on Document Analysis and Recognition, ICDAR 2021, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part II. Lecture Notes in Computer Science, vol. 12822, pp. 99–113. Springer (2021)
[30] Sun, Z., Wang, M., Li, L.: Multilingual translation via grafting pre-trained language models. In: Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021. pp. 2735–2747. Association for Computational Linguistics (2021)
[31] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada. pp. 3104–3112 (2014)
[32] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. pp. 5998–6008 (2017)
[33] Zaken, E.B., Goldberg, Y., Ravfogel, S.: Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. pp. 1–9. Association for Computational Linguistics (2022)
[34] Zhang, H., Li, G., Li, J., Zhang, Z., Zhu, Y., Jin, Z.: Fine-tuning pre-trained language models effectively by optimizing subnetworks adaptively. CoRR abs/2211.01642 (2022)
[35] Zhang, Y., Nie, S., Liang, S., Liu, W.: Bidirectional adversarial domain adaptation with semantic consistency. In: Pattern Recognition and Computer Vision - Second Chinese Conference, PRCV 2019, Xi’an, China, November 8-11, 2019, Proceedings, Part III. Lecture Notes in Computer Science, vol. 11859, pp. 184–198. Springer (2019)
[36] Zhang, Y., Nie, S., Liang, S., Liu, W.: Robust text image recognition via adversarial sequence-to-sequence domain adaptation. IEEE Trans. Image Process. 30, 3922–3933 (2021)
[37] Zhao, Y., Xiang, L., Zhu, J., Zhang, J., Zhou, Y., Zong, C.: Knowledge graph enhanced neural machine translation via multi-task learning on sub-entity granularity. In: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020. pp. 4495–4505. International Committee on Computational Linguistics (2020)
[38] Zhao, Y., Zhang, J., Zhou, Y., Zong, C.: Knowledge graphs enhanced neural machine translation. In: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. pp. 4039–4045. ijcai.org (2020)