This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ViNMT: Neural Machine Translation Tookit

Nguyen Hoang Quan1*, Nguyen Thanh Dat1*, Nguyen Hoang Minh Cong1*, Nguyen Van Vinh1,
\ANDNgo Thi Vinh2, Nguyen Phuong Thai1, and Tran Hong Viet3
1University of Engineering and Technology - Hanoi VNU
{quan94fm,17020049,congnhm,vinhnv,thainp}@vnu.edu.vn
2
University of Information and Communication Technology - Thai Nguyen University
ntvinh@ictu.edu.vn
3
University of Economics-Technical For Industries
thviet@uneti.edu.vn

(Hanoi, 29/07/2022)
Abstract

We present an open-source toolkit for neural machine translation (NMT). The new toolkit is mainly based on the vaulted Transformer (Vaswani et al., 2017) along with many other improvements detailed below, in order to create a self-contained, simple to use, consistent and comprehensive framework for Machine Translation tasks of various domains. It is tooled to support both bilingual and multilingual translation tasks, starting from building the model from respective corpora, to inferring new predictions or packaging the model to serving-capable JIT format. The source code and data are available at https://github.com/KCDichDaNgu/MultilingualMT-UET-KC4.0.

* Authors with equal contributions† Corresponding Author

1 Introduction

With the emergence of neural-network based Machine Learning models and their wide application, Natural Language Processing tasks in general and Machine Translation (MT) task in particular had benefitted from this new architecture, receiving massive gains in both translation quality and fluency.

In particular, the current state-of-the-art architecture for MT task is Transformer (Vaswani et al., 2017) which features various extensions of the Attention concept adapted to older neural network models (Bahdanau et al., 2015), (Luong et al., 2015). Variants of the Transformer architecture had been investigated, such as changes in its positional encoding (Raffel et al., 2019), changes to its attention to accommodate very long sequence (Tay et al., 2021), or even adaptations that only employ the autoregressive decoder (Brown et al., 2020), but so far the base Transformer model is robust, competitive and straightforward for the vast majority of translation cases.

In a similar vein, currently production-level translation models often leverage multiple translation corpora to enable greater proficiency, achieving increasingly better translation compared to bilingual models (Aharoni et al., 2019) and enable even Zero-Shot training for rare language pairs that had no prior training data (Johnson et al., 2016). Most known Transformer frameworks available do not explicitly support this, and thus we designed our engine to support this goal to the best of our ability.

Amongst open-source engines capable of supporting Machine Translation tasks, there are OpenNMT Klein et al. (2020) which boasts a well defined structure and have different versions supporting different underlying neural frameworks; Marian Junczys-Dowmunt et al. (2018) which is written in C++ for absolute performance and minimal dependency; and Fairseq Ott et al. (2019) which provide many state-of-the-art models and improvements. However, in OpenNMT and Fairseq’s case, the breadth of their works makes it hard for new users to experiment on modifying the structure, while Marian’s compactness is offset by the lack of readability in C++.

Our engine is designed with two qualities in mind: Modularity and Coherency. First, excluding running scripts and unimportant support functions, the rest of our engine is designed to be built revolving around replaceable modules with known interface, thus enabling simple inheritance, replacement or upgrading for parts of the neural network models. Second, modules are strictly designed for one purpose and one purpose only, thus prevent unnecessary bloating in a single class and the consequent pain of modifying one.

2 Technical Details

2.1 Neural Machine Translation

Neural Machine Translation is a fully-automated translation based on neural network. (Cho et al., 2014) suggest a new architectural and called Sequence-to-Sequence (Seq2Seq) model. They apply memory unit such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU) to surmount the exploding or vanishing gradient problem in recurrent networks. The architecture includes an encoder and decoder. The one Encoder uses to present the sentence in the source language with n tokens X = (x1x_{1}; x2x_{2}; …; xnx_{n}) and a decoder to generate the predicted sentence in the target language with m tokens Y = (y1y_{1}; y2y_{2}; …; ymy_{m}). The model evaluates the conditional probability P(yjY<j,X)P(y_{j}\mid Y_{<j},X) generating output sentence Y when the input sequence X is given:

P(YX)=j=1m+1P(yjY<j,X)P(Y\mid X)=\prod_{j=1}^{m+1}P(y_{j}\mid Y_{<j},X)

In the Seq2seq model, the global attentions mechanism (Luong et al., 2015) is considered to compute alignment attention from source sentences to corresponding target sentences. However, due to the Seq2seq model computing the probability sequentially, they have limited parallelization in the training process.

2.2 Transformer-based NMT

To solve the parallelization problem, the transformer architecture for machine translation is mentioned for the first time by (Vaswani et al., 2017). They are noted to be highly parallelizable as well as better in translating long sentences. Instead of using GRU or LSTM units to encode source sentences sequentially, they use encoder layers that help the encoder look at other words in the input sentence as it encodes a specific word. This architecture allows to train much faster and gives a better quality compared to RNN architecture. The self-attention mechanism as the following:

Attention(Q, K, V)=Softmax(QKTd)V\emph{Attention(Q, K, V)}=\emph{Softmax}(\frac{QK^{T}}{\sqrt{d}})V

Here, K (key), Q (query), V (value) is the outputs of the encoder or decoder to present tokens in the input sentence, and d is the size of the input. And then, the Transformer is trained to optimize parameters θ\theta by minimizing the maximum likelihood of the sentences pairs:

L(θ)=1Tk=1TlogP(YX,θ)\emph{L($\theta$)}=\frac{1}{T}\sum_{k=1}^{T}logP(Y\mid X,\theta)

where T is the number of sentence pairs in the multilingual corpus. We use transformer-based for all our experiments.

2.3 Multilingual Neural Machine System

Multilingual neural machine translation (MNMT) system can translate many language pairs, which has been useful in improving translation quality as a result of translation knowledge transfer (transfer learning), even low-resource, zeros-shot issues. (MNMT) has some strategy to translate between many languages:

  • (1) Many to many (Ha et al., 2016): from many sources to many target languages;

  • (2) Many to one (Gu et al., 2019): from many sources to a target language;

  • (3) One to many (Wang et al., 2018): from a sources to many target languages;

Our motivation is to improve low-resource translation tasks and focus on translating from many languages to one language, so we systems are the same as the case (2). In a MNMT system from many to one, the objective function uses maximum likelihood estimation on the whole parallel pairs:

L(θ)=1Km=1Mk=1KlogP(Y(m,k)X(m,k),θ)\emph{L($\theta$)}=\frac{1}{K}\sum_{m=1}^{M}\sum_{k=1}^{K}logP(Y^{(m,k)}\mid X^{(m,k)},\theta)

Where M is the number of languages and K is the total number of sentences in languages m. Of course, the vocabulary of the source side is mixed from all source language:

V=m=1MVm\emph{V}=\sum_{m=1}^{M}V_{m}

Code-mixed Language: The benefit of shared vocab in MNMT has been shown by Gu et al. (2019) that if the languages shared the same alphabet and had many similar words, MNMT system can get many advantages of common features of languages. In order to do that they may share the same subwords in all language pairs. This significantly reduces the number of rare words in the MT systems. In our experiments, we choose English, Chinese, Khmer, and Lao are source languages and hope that they can share many tokens with each other to translate into Vietnamese.

3 Implementation

The first section details the overall structure of our engine to adhere to our outlined qualities specified above. In general, most components are built as interlinked modules that can be replaced or modified on its parent modules with minimal effort using python’s optionality.

The accompanying sections describe various techniques we used to improve the efficiency and accuracy of our translation system in both the training and inference phases. Besides, we also build a module that allows packaging the model to simplify the serving process. We heavily rely on PyTorch framework to implement all these techniques.

3.1 System Design

In order to support the technical system above, our code base is designed from an Object-Oriented point of view, with interchangeable base classes following PyTorch’s nn.Module class. This modular approach allows easy and simple modification when experimenting or upgrading specific parts of the system, since in most cases we can simply inherit the base class that needed to be changed and use our new class as an initiation argument.

Our system revolves around three main important classes:

  • the Layer class, which are layers that directly involve in neural network’s training and processing. As such, they are exclusively contained by the neural network modules in the system such as the Encoder and Decoder modules.

  • the Module class, which make up the bulk of our coding base. A Module represents a specific duty to be filled, for example, our Transformer base model currently contain a Loader object for converting text data to numeric form and vice versa; a pair of Encoder-Decoder objects for the main translation process, and an additional DecoderStrategy object to support the model during autoregressive inference.

  • the Model class, which contains all the respective modules required when training and inferring. This packaging allows (1) the highest level of user interference into translating process without the need of understanding supporting codes such as the \bin or \utils code; and (2) simple usage during TorchScript serving, as the converted object would be a stand-alone module that doesn’t have to worry about re-implementing specific non-neural modules.

Using an example to demonstrate the modularity of our system, assuming that we have a new Decoder object to put into our original model, all we have to do is to replace the corresponding decoder_cls argument within the constructor function of the Model class. This can be easily accomplished by writing a lambda function in the model name list specified within \models \__init__.py

In addition, our code base contains several additional code and resources: \bin contains command line scripts to run the system, \config contains basic configuration files which contain model hyperparameters that will be preserved during runs, \web contains serving scripts and resources, \utils contains extra support functions that don’t fit other folders.

3.2 Training

Training paradigm

Our system allows optionally training models by epochs or steps:

  • If epoch mode is chosen, the model is trained in a specified number of epochs. For each epoch, the training data is shuffled and partitioned into several batches based on which batching algorithm is used. After that, these batches are fed into the model for training.

  • If sampling mode is chosen, the model is trained in a specified number of steps. For each step, several training examples (sentence pairs) are selected based on a specified distribution to form a training batch. This mode is more suitable when we want to train a multilingual model while there is a significant different training data size between language pairs.

Automatic Mixed Precision

The Pytorch framework also provides methods for mixed precision. When this method is applied, some operations use float (32-bit) datatype and other operaions use half-float (16-bit) datatype. Some operations, like linear layers, are much faster in float 16-bit. Other operations, like reductions, often require the dynamic range of float 32-bit. Mixed precision tries to match each operation to its appropriate datatype, which can reduce our network’s runtime and memory usage.

3.3 Inference

We use the standard beam search algorithm to generate texts on the target side. The baseline configuration includes beam size k=4k=4, length penalty alpha=0.6alpha=0.6 and maximum number of decode steps is 128. There are some techniques we can use to speed up the inference process and reduce memory usage:

Attention caching

Transformer allows us to train much faster than RNN because inputs are processed in parallel. However, this benefit does not work in the inference phase because the model has to sequentially generate each token in each decoding step. Thus, the Transformer model needs entire target tokens generated from the previous decoding steps as a decoder input to compute the next target token. Therefore, all Query (Q), Key (K), Value (V) vectors from previous steps are re-calculated.

Instead of repeatedly compute Q, K, V in every decoding step, we want to store them in the memory so that they can be used for further calculations. This idea works because all Q, K, V vectors derived from the previous target tokens do not change after each decoding step. Caching these Q, K, V vectors makes calculating Self-Attention and Cross-Attention matrix in a different way. In details:

  • To calculate Self-Attention matrix, we need concatenate all the previous K, V vectors to form a complete K, V tensor and perform matrix multiplication with the current Q vector. The result is an attention tensor that attends from the current target token to previous target tokens. This calculation works perfectly when we apply causal masking for self-attention tensor on decoder side.

  • To calculate Cross-Attention matrix, we simply perform matrix multiplication between the current Q vector and K, V tensors from encoder output. The result is an attention tensor that attends from current target token to all source tokens.

Sentences sorting

Sorting source sentences by their length (calculated by the number of tokens in these sentences) and dynamically batching such as the total number of tokens does not exceed 4069s. By applying this technique, the model uses less memory compared to the regular batching algorithm (batching by the number of sentences) because we do not have to add too many pad tokens to every sentence in the batch.

3.4 Serving

We want the deployment and training process are independent as much as possible, the models can be deployed on any environment besides Python. Therefore, we use TorchScript which is a way to create serializable and optimize models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. It transitions a model from a pure Python program to a TorchScript program that can be run independently from Python and then export the model via TorchScript to a production environment where Python programs may be disadvantageous for performance and multi-threading reasons.

4 Experiments

This section describes our experiments using our own data.

4.1 Dataset

We crawled data from the news domain for four language pairs English-Vietnamese, Chinese-Vietnamese, Laos-Vietnamese and Khmer-Vietnamese. The details of those datasets are described in Table 1.

Table 1: The bililingual dataset in our experiments
Language pairs Train Valid Test
En - Vi 3M 1553 1268
Cn - Vi 675K 1000 1000
Lo - Vi 83K 1000 1000
Km - Vi 70K 1000 1000

4.2 Preprocessing

All parallel texts were tokenized and truncated using sentencepiece scripts, and then they are applied to Sennrich’s BPE (Sennrich et al., 2016). We explore 32K operators are learned to generate BPE codes for all languages.

For Vietnamese, we only apply Moses’s scripts for tokenization and true-casing.

4.3 Systems and Training

We implement our NMT systems from zeros-base to train all our experiments. The same settings are used for all experiments.

We trained our Transformer model using the number of encoder 12, decoder layers are 6, 8 head is used, dmodeld_{model} is 512, dropout value is 0.1, batch size of 64, learning rate value is 0.4 with the aid of Adam optimizer. The learning rate has warmup updates by 8000 steps and label smoothing value is 0.1.

We evaluate the quality of two systems (1) bilingual system, (2) multilingual system.

(1) Bililingual system

We train systems based on separate bilingual data of each language pair. We explore the best model to decode the test data for comparison purposes in our tests. The English - Vietnamese model we trained for 30 epochs, Chinese - Vietnamese, Lao - Vietnamese, Khmer - Vietnamese for 20 epochs. We train bilingual system for the baseline and compare with multilingual system.

(2) Multilingual system

We concatenate training sets for all language pairs in order to construct the new sets: English, Chinese, Khmer, Lao \,\to\,Vietnamese. We train the system using those data for the same number of epochs. And then, we compare with bilingual translation results and see that improvement of +4.06 BLEU points on English → Vietnamese translation task, another one of +0.56 BLEU points on Chinese → Vietnamese, +4.19 BLEU points on Lao → Vietnamese, and +3.18 BLEU points on Khmer → Vietnamese.

4.4 Results

The results of our experiments are shown in table 2 and table 3.

Table 2: BLEU scores for bililingual system
Language pairs BLEU
En - Vi 31.77
Cn - Vi 27.96
Lo - Vi 16.29
Km - Vi 20.78
Table 3: BLEU scores for multilingual system
Language pairs BLEU
En - Vi 34.98
Cn - Vi 28.62
Lo - Vi 18.94
Km - Vi 23.44

5 Conclusion

We have built a research toolkit for NMT that design efficiency and modularity. And release all code for community NLP, specially, machine translation of Vietnam. We find that the large corpus bilingual can furthermore enhance a multilingual NMT system. The MNMT system significantly reduces the number of rare words in the MT systems. Nevertheless, the rare word issue is still a challenge in NMT.

In the future, we will continue develop ViNMT to achieve strong MT results at up-to-date research.

Acknowledgments

This work has been supported by Ministry of Science and Technology of Vietnam under Program KC 4.0, No. KC-4.0.12/19-25.

References