Unsupervised Bilingual Lexicon Induction for Low Resource Languages

Charitha Rathnayake charitha.18@cse.mrt.ac.lk 0009-0008-3086-4623 , P.R.S. Thilakarathna ridmisameera.18@cse.mrt.ac.lk 0009-0003-5719-9540 , Uthpala Nethmini jasenpathiranage.18@cse.mrt.ac.lk 0009-0002-4252-7727 University of MoratuwaKatubeddaSri Lanka , Rishemjith Kaur rishemjit.kaur@csio.res.in Central Scientific Instruments OrganisationIndia and Surangika Ranathunga s.ranathunga@massey.ac.nz 0003-0701-0204 School of Mathematical and Computational Sciences, Massey UniversityAucklandNew Zealand

(20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Bilingual lexicons play a crucial role in various Natural Language Processing tasks. However, many low-resource languages (LRLs) do not have such lexicons, and due to the same reason, cannot benefit from the supervised Bilingual Lexicon Induction (BLI) techniques. To address this, unsupervised BLI (UBLI) techniques were introduced. A prominent technique in this line is structure-based UBLI. It is an iterative method, where a seed lexicon, which is initially learned from monolingual embeddings is iteratively improved. There have been numerous improvements to this core idea, however they have been experimented with independently of each other. In this paper, we investigate whether using these techniques simultaneously would lead to equal gains. We use the unsupervised version of VecMap, a commonly used structure-based UBLI framework, and carry out a comprehensive set of experiments using the LRL pairs, English-Sinhala, English-Tamil, and English-Punjabi. These experiments helped us to identify the best combination of the extensions. We also release bilingual dictionaries for English-Sinhala and English-Punjabi.

Bilingual Lexicons, VecMap, Sinhala, Tamil, Punjabi

^†^†ccs: Computing methodologies Machine translation

1. Introduction

A bilingual lexicon (aka bilingual dictionary) consists of a list of words in one language, along with the corresponding translations in another language. Bilingual lexicons have long been used to improve the performance of other Natural Language Processing (NLP) tasks such as Machine Translation (Artetxe et al., 2018a), Information Retrieval (Vulić and Moens, 2015), and cross-lingual Named Entity Recognition (Mayhew et al., 2017). They are specifically useful in the context of Low-Resource Languages (LRLs) (Miao et al., 2024; Yong et al., 2024; Mohammed and Prasad, 2023; Keersmaekers et al., 2023; Chaudhary et al., 2020; Fernando et al., 2023; Farhath et al., 2018).

However, due to the data scarcity of LRLs, many of them may not have manually developed bilingual lexicons. As a solution, researchers have proposed ways to automatically induce bilingual lexicons, which is known as Bilingual Lexicon Induction (BLI) (Shi et al., 2021). Early research used techniques such as those that exploit the similarity of word co-occurrences and the similarity of morphological structures of words (Haghighi et al., 2008). However, starting from the pioneering work of Mikolov et al. (2013b), techniques that make use of word embeddings have been at the forefront of BLI.

Techniques such as the one presented by Mikolov et al. (2013b) are supervised, meaning that they require a bilingual lexicon (seed lexicon) to train the BLI model. However, many LRLs may not have such initial bilingual lexicons. As a solution, unsupervised BLI (UBLI) techniques have been introduced. These techniques rely solely on monolingual data. UBLI techniques can be broadly categorized into two: joint learning techniques and post-alignment techniques. In joint learning techniques, a neural network model is trained using monolingual data belonging to the two languages such that the resulting model captures the cross-lingual representations (Cao et al., 2016).

Post-alignment techniques, which have been more commonly used, start from the monolingual embeddings of the two languages and learn a mapping across their distributions. Post-alignment techniques can be further categorized as adversarial techniques and structure-based methods (Ren et al., 2020). Adversarial methods suffer from the limitations of the underlying adversarial models they use. In contrast, structure-based methods have shown to be more flexible and robust, and have been the state-of-the-art for UBLI (Ren et al., 2020). Most structure-based methods are iterative (Aldarmaki et al., 2018; Hoshen and Wolf, 2018) - meaning that a seed lexicon is initially created, which is used to learn an optimal mapping between the embedding spaces of the two languages. A new lexicon is induced using the aligned embeddings, which is again used to further improve the alignment. This process is iteratively done until convergence.

It is possible to improve structure-based UBLI by improved embedding generation techniques (Ormazabal et al., 2021; Nishikawa et al., 2021; Cao et al., 2023a; Zhang et al., 2021), embedding pre-processing techniques (Vulić et al., 2020; Artetxe et al., 2018c; Cao et al., 2023b) and embedding initialization techniques (Artetxe et al., 2018b; Li et al., 2020; Ren et al., 2020; Cao and Zhao, 2021; Feng et al., 2022). However, these techniques have been experimented independently of each other, and it is not clear whether they would be still effective if simultaneously used.

In order to answer this question, we select the commonly used structure-based UBLI framework introduced by Artetxe et al. (2018b) and extended it with a combination of the aforementioned techniques. This unsupervised framework was designed by extending their supervised self-learning framework VecMap (Artetxe et al., 2017). We carry out an extensive set of experiments for three LRL pairs, English-Sinhala (EnSi), English-Tamil (EnTa) and English-Punjabi (EnPa).

The contributions of this paper are as follows:

•

We carry out an extensive set of experiments on the combinations of different extensions to the structure-based UBLI systems using the unsupervised VecMap framework (UVecMap) and identify the most effective combination.
•

We release new human-curated bilingual lexicons for English-Sinhala and English-Punjabi.

The rest of the paper is organized as follows. Section 2 discusses the UVecMap Framework, and its subsequent extensions, as well as BLI for LRLs. Section 3 discusses how the various extensions on UVecMap were implemented. Section 4 discusses the experiment setup and Section 5 discusses the results. Finally, Section 6 concludes the paper with a look into the future.

2. Related Work

2.1. Structure-based UBLI with VecMap

The UVecMap framework is shown in Figure 1. Note that UVecMap assumes that pre-trained word embeddings are already available for the considered languages. During the normalization step, the embeddings are length normalized, mean centered across each dimension, and again length normalized.

In the second step, embeddings are initialized in an unsupervised manner. To do this, an alternative representation of the normalized source and target embeddings is derived as follows: given that the normalized embeddings for the source and target are $X$ and $Z$ respectively, similarity matrix $M_{X}$ corresponding to the source and the similarity matrix $M_{Z}$ corresponding to the target are first derived. Ideally, if the two embedding spaces were perfectly isometric¹¹1Two graphs that contain the same number of graph vertices connected in the same way are said to be isomorphic (Søgaard et al., 2018). Isometry is isomorphism over matrix spaces., $M_{X}$ and $M_{Z}$ would be equivalent, apart from a permutation of their rows and columns. However, this is rarely the case in practice. To approximate alignment, each row in $M_{X}$ and $M_{Z}$ are independently sorted and their square root is calculated. This provides the initial dictionary.

During the self-learning step, the algorithm iteratively refines the cross-lingual embeddings until convergence by performing a sequence of optimization tasks. First, an optimal orthogonal mapping that maximizes the similarities for the current dictionary (D) is derived. For the $i^{th}$ vocabulary item $X^{*}_{i}$ in the source language and the $j^{th}$ vocabulary item $Z^{*}_{j}$ in the target language, the optimal solution is determined using the singular value decomposition (SVD) of $X^{T}$ DZ, where $USV^{T}=X^{T}$ DZ. The resulting mappings are given by $W_{X}=U$ and $W_{Z}=V$ . Then, a new optimal dictionary is computed over the similarity matrix of the mapped embeddings, defined as $XW_{X}$ $W^{T}_{Z}$ $Z^{T}$ . This dictionary is typically generated through nearest-neighbor retrieval from source to target embeddings (Artetxe et al., 2018b). Finally, both source and target embeddings are re-weighted.

Artetxe et al. (2017) also introduced some additional improvements to this framework. First, dictionary induction was treated as a stochastic process, meaning that only a subset of the elements with probability $p$ are kept in the similarity matrices. Secondly, the dictionary size is limited to the top $k$ frequent words²²2 Artetxe et al. (2018b) used k = 20,000.. Thirdly, to derive the dictionary from the aligned spaces, Cross-domain Similarity Local Scaling (CSLS) (Lample et al., 2018) is used, instead of the typically used nearest-neighbor retrieval. Finally, the dictionary is induced in both ways - from source to target and target to source.

Refer to caption — Figure 1. Schematic diagram of the UVecMap framework

2.2. Improvements to Structure-based BLI

These improvements have been applied to one of the following components: word embedding creation, embedding pre-processing, and dictionary initialization.

Word Embedding Creation

: Ormazabal et al. (2021) relaxed the assumption that pre-trained word embeddings are available for both source and target languages. They fixed the target embeddings and learned the aligned embeddings of the source from scratch. They used UVecMap to build the initial dictionary, which is used to constrain the source language embeddings to be aligned with the target language embeddings. Nishikawa et al. (2021) showed that rather than deriving word embeddings from monolingual corpora, using a synthetic parallel corpus generated from unsupervised Machine Translation improves the performance of UVecMap. During the creation of word embeddings such as Word2Vec (Mikolov et al., 2013a), two types of embeddings are created- word embeddings and context embeddings. All the techniques discussed above used the word embedding part only. In contrast, Cao et al. (2023a) used both word and context embeddings. The UVecMap model was originally designed to generate mappings using static word embedding models and does not inherently support embeddings with contextual representations. To address this limitation, Zhang et al. (2021) proposed a method that combines static word embeddings with contextual representations to improve alignments and enhance Bilingual Lexicon Induction (BLI) results. Their approach employs FastText (Joulin et al., 2017) embeddings for static representations and incorporates XLM (Lample and Conneau, 2019) and mBART (Liu et al., 2020) for contextual embeddings.

Embedding Pre-processing

: Vulić et al. (2020) improved the quality of input monolingual spaces before using them in cross-lingual alignment. Their method incorporates a linear transformation approach proposed by Artetxe et al. (2018c), which is controlled by a single parameter that adjusts the similarity order of the input embedding spaces. This linear transformation is applied to both monolingual spaces, enhancing the model’s ability to capture different aspects of language. Cao et al. (2023b) integrated the structural features from the source language embeddings into their corresponding features in the target language embeddings and vice versa. In other words, the value of a source side feature gets modified based on the corresponding one from the target side, and vice versa. They term this process ‘embedding fusion’. The objective of embedding fusion is to increase the isomorphism of the source and target spaces. Cao and Zhao (2021) introduced a transformation-based approach that applies rotation and scaling operations to monolingual embeddings, aiming to improve isomorphism between embedding spaces of different languages.

Initialization

: While it is always possible to use the generated word embeddings as they are, these embeddings often contain noise, which can negatively affect accuracy. To address this issue, Li et al. (2020) performed dimensionality reduction of the original word embeddings in the initialization step. This method iteratively reduces the dimensionality of the embeddings using Principal Component Analysis (PCA), starting from the original dimension to the required target dimension. During each iteration, the algorithm generates a dictionary by computing the nearest neighbors of the k most frequently occurring words. The generated dictionaries are then compared to ensure that the final dictionary is the most accurate across all evaluated dimensionalities. Ren et al. (2020) introduced a graph-based solution for the initialization step. They constructed a graph using the monolingual embeddings of each language, where vertices represent words. Next, a subset of these vertices (called cliques) are extracted in such a manner that every two distinct vertices are adjacent. Then, embeddings of these cliques are calculated and the clique embeddings are mapped across the source and target graphs. Central words of the aligned cliques are considered for the seed dictionary. Feng et al. (2022) proposed a novel cross-lingual feature extraction (CFE) approach. They defined semantic features of words based on their relevance to contextual words, and quantified using character-level distances within monolingual corpora. Semantic vectors were constructed by selecting the most relevant contextual words, capturing language-independent textual features. The cross-lingual features were then combined with pre-trained embeddings.

2.3. UBLI for Low-resource Languages

Besacier et al. (2014) defined an LRL as a “language that lacks a unique writing system, lacks (or has) a limited presence on the World Wide Web, lacks linguistic expertise specific to that language, and/or lacks electronic resources such as corpora (monolingual and parallel), vocabulary lists”, and so on. Given that the success of NLP tools for a language depends on the language resource (raw, as well as annotated text data) availability, NLP researchers identify LRLs considering the availability of text data and NLP tools as the criteria (Joshi et al., 2020; Ranathunga and De Silva, 2022). Accordingly, languages in the world have been categorized into 6 classes, with Class 0 being the least resourced³³3We use Ranathunga and De Silva (2022)’s language categorization. Language list can be found at https://github.com/NisansaDdS/Some-Languages-are-More-Equal-than-Others/tree/main/Language_List/Language_Classes_According_To/DataSet_Availability . In this work, we consider languages belonging to classes 0-3 as LRLs.

While BLI techniques can be applied to any language pair, not many have been tested on LRLs. We believe this is due to the lack of evaluation datasets. The most commonly used dataset is the MUSE dataset (Lample et al., 2018), which has lexicons for over hundreds of languages, including some LRLs. However, this dataset has been created automatically without human validation and have been reported to be of poor quality (Yang et al., 2019). PanLex (Baldwin et al., 2010) is another dataset that covers a large number of languages, however, this has also been automatically created. Some other datasets created for LRLs have also been automatically created (Glavaš et al., 2019; Bafna et al., 2023; Wickramasinghe and De Silva, 2023). For Indic languages, the IndoWordNet⁴⁴4https://www.cfilt.iitb.ac.in/indowordnet/ has been a good source to extract the evaluation dictionary. Pavlick et al. (2014) present human created bilingual dictionaries for 100 languages, however, each language pair has only 10 entries. Some other research that created evaluation lexicons for LRLs have not publicly released them (Liyanage et al., 2021), thus impeding the progression in the field (Ranathunga et al., 2024a).

3. Methodology

As discussed in Section 2.2, there have been many improvements to the core idea of structure-based UBLI. However, all those techniques have been individually experimented with. In order to verify their effectiveness when simultaneously used, we carried out a series of experiments. We use the UVecMap framework as the baseline structure-based UBLI system. Figure 2 shows the UVecMap framework extended with some of the techniques discussed in Section 2.2. Note that we considered only techniques for which the source code has been released. Components related to these improvements are shown in red color. Note that the step corresponding to dimensionality reduction during the pre-processing step is newly introduced by us. Below we discuss the technical details of these newly added components.

3.1. Dimensionality Reduction

Li et al. (2020)’s technique discussed above integrates dimensionality reduction into the self-initialization step of the UVecMap framework. It is also possible to carry out this dimensionality reduction at the embedding pre-processing step. In our implementation, we experimented with both approaches independently of each other. We utilized Principal Component Analysis (PCA) to reduce the dimensionality of the embeddings.

3.2. Embedding pre-processing

As discussed in Section 2.2, related work has discussed several ways of pre-processing word embeddings. Out of those, we experimented with linear transformation and embedding fusion techniques, on top of the pre-processing techniques already used in UVecMap. In addition to these, dimensionality reduction was also carried out as discussed earlier.

Linear Transformation (Vulić et al., 2020):

Projection-based cross-lingual word embedding (CLWE) models typically learn a linear projection between two independently trained monolingual embedding spaces $X$ and $Z$ for source language $L_{s}$ and target language $L_{t}$ , respectively. This alignment is guided by a word translation dictionary $D$ . This approach extends the traditional notion of similarity from first and second-order measures to higher $n$ -th order similarities, enabling more robust alignment between the embedding spaces.

The first-order similarity matrix of the source language space $X$ is calculated as $M_{1}(X)=XX^{T}$ . Similarly, the first-order similarity matrix for the target language space $Z$ is determined. The second-order similarity can be expressed as $M_{2}(X)=XX^{T}XX^{T}=M_{1}(M_{1}(X))$ . Extending this, the $n$ -th order similarity is defined as $M_{n}(X)=(XX^{T})^{n}$ .

Artetxe et al. (2018c) proved that the $n$ -th similarity transformation can be obtained using $M_{n}(X)=M_{1}(XR_{\frac{n-1}{2}})$ , where $R_{\alpha}=Q\Delta^{\alpha}$ , and $\alpha$ is a hyperparameter. Here, $Q$ and $\Delta$ are obtained by the eigen-decomposition of $X^{T}X=Q\Delta Q^{T}$ , where $Q$ is the orthogonal matrix with eigenvectors and $\Delta$ is the diagonal matrix containing the eigenvalues. Separate hyperparameter values, $\alpha_{s}$ and $\alpha_{t}$ , are defined for the source and target languages, respectively, to optimize the transformation for each language. The source embedding space is linearly transformed to $X^{l}_{\alpha_{s}}=XR_{\alpha_{s}}$ , where $\alpha_{s}$ is the selected hyperparameter for the source language, and the target embedding space is similarly transformed as $Z^{l}_{\alpha_{t}}=ZR_{\alpha_{t}}$ , using $\alpha_{t}$ for the target language. These transformations refine the original source embeddings, $X$ , and target embeddings, $Z$ , by incorporating the modified embeddings $X^{l}_{\alpha_{s}}$ and $Z^{l}_{\alpha_{t}}$ , respectively.

Embedding Fusion (Cao et al., 2023b):

The fusion method proposed by Cao et al. (2023b) addresses the challenge of misaligned embedding spaces by improving the isomorphism between source embeddings (X) and target embeddings (Z) through rotation and joint scaling. It begins with the assumption of perfect isometry, where X and Z are related through a row permutation and an orthogonal rotation. Mathematically, this relationship is defined as $X=PZO$ , where P is a permutation matrix containing 1s and 0s, and O is an orthogonal matrix. Using this assumption, the Singular Value Decomposition (SVD) of X and Z reveals the following relations:

U_{X}=PU_{Z},\quad S_{X}=S_{Z},\quad V_{X}^{T}=V_{Z}^{T}O,

where $U_{X}S_{X}V_{X}^{T}=X$ and $U_{Z}S_{Z}V_{Z}^{T}=Z$ . This implies:

U_{X}S_{X}=PU_{Z}S_{Z},

and consequently:

XV_{X}=PZV_{Z}

Here, $XV_{X}$ and $ZV_{Z}$ are the rotated embeddings, which are aligned across their dimensions up to a row permutation. Although the matrices P and O are unknown, this transformation places the row vectors of $XV_{X}$ and $ZV_{Z}$ in the same d-dimensional cross-lingual space.

Usually, embedding spaces are not perfectly isometric but are assumed to be approximately so. Although $XV_{X}$ and $ZV_{Z}$ may be roughly aligned, their singular values ( $S_{X}$ and $S_{Z}$ ) often differ, especially for distant language pairs. Since $XV_{X}=U_{X}S_{X}$ and $ZV_{Z}=U_{Z}S_{Z}$ , the embeddings are jointly scaled to align their singular value distributions and enhance their isomorphism. The scaling process is defined as $X^{\prime}=XV_{X}\cdot\frac{\sqrt{S_{Z}}}{\sqrt{S_{X}}},Z^{\prime}=ZV_{Z}\cdot\frac{\sqrt{S_{X}}}{\sqrt{S_{Z}}}$ , where the operations are applied element-wise to the diagonal elements of $S_{X}$ and $S_{Z}$ . The alignment ensures that $X^{\prime}$ and $Z^{\prime}$ are better suited for cross-lingual tasks, as their singular values are equalized and their geometric structures are aligned.

3.3. Combining Static and Contextual Embeddings for Bilingual Lexicon Induction (CSCBLI)

Zhang et al. (2021) considered only FastText alongside XLM and mBART embeddings in their methodology. In our work, we extended their approach by incorporating both Word2Vec and FastText as static embeddings, alongside XLM-R (Conneau et al., 2020) as the contextual embeddings. The proposed model consists of two primary steps: Unified Word Representation Space and Similarity Interpolation.

In the first step, the model constructs a unified word representation space that combines static word embeddings and contextual representations. Since the embedding spaces of the two languages are not perfectly isometric, some translation pairs remain distant even after mapping. To address this, a spring network adjusts the mapped embeddings, pulling translation pairs closer together to ensure that they become nearest neighbors.

The unified word representations, $U_{x}$ and $U_{y}$ , are mathematically defined as:

U_{x}=E^{\prime}_{x}+\gamma_{1}\odot F_{x}(A_{x}),\quad U_{y}=E^{\prime}_{y}+\gamma_{2}\odot F_{y}(A_{y}),

where $E^{\prime}_{x}$ and $E^{\prime}_{y}$ are the mapped word embeddings, $F_{x}(A_{x})$ and $F_{y}(A_{y})$ are the weighted offsets produced by the spring networks $F_{x}$ and $F_{y}$ based on the contextual representation matrices $A_{x}$ and $A_{y}$ , and $\gamma_{1}$ and $\gamma_{2}$ are weight vectors scaling the offsets. These weights are initialized as zero vectors, and during the training process, these parameters are updated by backpropagation to optimize the performance of the model.

The spring network comprises two layers. The first layer transforms the dimensionality of the contextual representation ( $d_{0}$ ) to match that of the static word embedding ( $d$ ), as shown in the following equation:

A_{x}^{1}=\varphi(\theta_{x}^{0}(A_{x})),\quad A_{y}^{1}=\varphi(\theta_{y}^{0}(A_{y})),

where $\varphi$ denotes the Tanh activation, and $\theta$ represents the feedforward layers. The second layer refines the representation by mapping the output of the first layer into the final offset values.

A_{x}^{2}=\varphi(\theta_{x}^{1}(A_{x}^{1})),\quad A_{y}^{2}=\varphi(\theta_{y}^{1}(A_{y}^{1})),

The outputs $A_{x}^{2}$ and $A_{y}^{2}$ serve as offsets to correct deviations in the mapped word embedding space.

To optimize the spring network, contrastive training is used. Initially, the bilingual dictionary is generated using the bilingual word mappings based on static word embeddings. The model iteratively refines this dictionary by finding new translations for source words and updating the dictionary after each iteration. This process continues until the dictionary stabilizes, ensuring improved alignment between the source and target embeddings.

During inference, the model performs similarity interpolation between the unified representation space and the contextual space to compute final translation similarities. Both the unified word representation space and the mapped contextual representation space, as well as the lambda value are given as the inputs for the inference. Given a source word $x$ , the similarities are interpolated as follows:

S=\cos(U_{x},U_{y})+\lambda\cos(A_{x}^{0},A_{y}^{0})

where $\lambda$ is the weight, and $A_{x}^{0}$ and $A_{y}^{0}$ are the mapped contextual representations.

4. Experiment Setup

4.1. Monolingual Data

Unsupervised BLI techniques require monolingual corpora to generate initial language-specific embeddings. While there are web-crawled corpora, those corresponding to low-resource languages are noisy (Ranathunga et al., 2024b). Therefore, for Sinhala-English and Tamil-English pairs, we used the monolingual versions of the SiTa parallel corpus (Fernando et al., 2020). This corpus has been meticulously cleaned by humans. Similarly, the Bharath Parallel Corpus Collection (BPCC) (Gala et al., 2023) was utilized for English-Punjabi. These corpora were tokenized to derive word lists.

Previous research suggested that keeping the full word list is not effective, due to the existence of rare words (Artetxe et al., 2018b; Feng et al., 2022). Therefore, the resulting word lists were filtered based on predefined minimum frequency $Min\_Freq$ threshold values. The thresholds were carefully selected via an ablation study (see Section 4.4) to balance coverage (the range of vocabulary captured) and accuracy (the quality of the embeddings). Increasing the frequency threshold typically reduces the number of words (lower coverage) but improves the quality of embeddings by focusing on more representative terms (higher accuracy). Conversely, lowering the threshold increases coverage by including more words, but it may dilute the quality of embeddings due to the inclusion of infrequent and less meaningful terms.

4.2. Evaluation Dictionary

Even though the BLI technique is unsupervised, a human-curated dataset is needed to evaluate the performance of the models. Depending on resource availability, we prepared the evaluation dictionaries as follows.

4.2.1. Sinhala-English

Using a Machine Translation Tool: We followed the method outlined by Zhang et al. (2017) to create the evaluation dictionary. Initially, word sets for English and Sinhala were extracted from the SiTa corpus as described in section 4.1. The English word list was then sorted by frequency, starting from the most frequent word and proceeding to the least frequent. Each selected English word was translated to Sinhala using a Machine Translation tool (we used Google Translate⁵⁵5https://translate.google.com/m). The resulting Sinhala word was again back-translated to English using the same tool. From this output, the following word pairs were discarded:

•

Discrepancies: Words with any inconsistencies between the original English word and its back-translated counterpart.
•

Out-of-Vocabulary: Words with translations that did not match the target language vocabulary derived from the target language corpus.
•

Multi-Word Translations: Words whose translations resulted in multi-word phrases.
•

Proper Nouns

All the challenges we encountered in creating this dictionary and the corresponding steps taken to address them are detailed in Table 1. This process continued until we were able to compile a minimum of 1,500 word pairs for the evaluation set, aligning with the typical size of evaluation sets used in most previous research (Zhang et al., 2021; Feng et al., 2022). The final dictionary consists of 1,562 word pairs.

[Uncaptioned image] — Table 1. Issue in creating the bilingual lexicon using the translation tool & Solutions

4.2.2. Tamil-English

Using an existing Lexicon: As described in the section 4.1, we constructed separate monolingual word sets for English and Tamil. Using an existing English-Tamil dictionary⁶⁶6https://education.nsw.gov.au/content/dam/main-education/teaching-and-learning/curriculum/multicultural-education/eald/eald-bilingual-dictionary-tamil.pdf, we checked each word pair to verify whether both the English word and the corresponding Tamil word were present in our respective word lists. If both words were found in the lists, the pair was included in our evaluation dictionary. The final evaluation dataset comprises 1,155 word pairs extracted from the complete dictionary.

4.2.3. Punjabi-English

Using IndoWordNet: We used the existing English-Punjabi word pair lists available from IndoWordNet⁷⁷7 https://github.com/cfiltnlp/IWN-WordLists/tree/main/bilingual/English-Punjabi. Initially, we removed multi-phrase Punjabi words from the lists. We further removed the English-Punjabi word pairs where either English or Punjabi word was absent from BPCC Samanantar parallel corpora (Gala et al., 2023). Next, we calculated the frequency of occurrence for each word in both English and Punjabi lists in the BPCC dataset. Words with a frequency of occurrence less than nine were excluded. Additionally, we removed words with multiple Punjabi translations to ensure a one-to-one correspondence between English and Punjabi words. During manual review, we identified certain translations that were not the most commonly used terms. These translations were also removed, along with the proper nouns. Following these pre-processing steps, the final dataset consisted of 1,391 English-Punjabi word pairs.

4.3. Computing Environment

Initial experiments using Word2Vec and FastText embeddings, each with 300 dimensions, were conducted on Google Colab. Subsequently, experiments with XLM-R embeddings, which have 1024 dimensions, were conducted in a GPU (Single NVIDIA Quadro M6000).

4.4. Experiment Setup

Embedding Creation: We carried out an ablation study to select threshold values for minimum frequencies, which resulted in the minimum frequencies being 8 for EnSi, and 6 for both EnTa and EnPa. The detailed results of the ablation study can be found in Tables 12-14 of Appendix B. Based on these thresholds, Word2Vec and FastText static word embeddings, along with XLM-R contextualized embeddings, were generated for the selected words. Counts⁸⁸8Although the same word sets were provided, Word2Vec, FastText, and XLM-R generated embeddings only for words present in their respective vocabularies. Since Word2Vec and FastText were trained on the same corpus with different pre-processing steps, their SRC embedding counts are similar, while XLM-R differs due to its unique tokenization and coverage. are shown in Table 2.

Table 2. Embedding Sizes for Language Pairs. SRC-Source Language, TRG-Target Language

language pair	W2V		FASTTEXT		XLMR
	SRC	TRG	SRC	TRG	SRC	TRG
EnSi	5101	6303	5101	6303	4845	6117
EnTa	6034	10587	6034	10587	5758	10413
EnPa	47783	68300	47783	68300	44296	65702

In addition, we needed to determine some parameters specific to the individual techniques we used. We integrated these techniques one at a time with UVecMap and then combined them in an iterative manner to determine the relevant hyper-parameters.

4.4.1. Linear Transformation Method:

As discussed in Section 3.2, tuning the hyperparameters $\alpha_{s}$ and $\alpha_{t}$ for the source and target languages, respectively, is crucial. However, due to resource constraints, rather than exploring the full parameter space, we manually defined a subset of the hyperparameter space based on the experiments conducted by Vulić et al. (2020). Specifically, we applied post-processing to the corpora for both the source and target languages using hyperparameter values $[-0.5,-0.25,-0.15,0,0.15,0.25,0.5]$ . Subsequently, we ran the UVecMap model for each pair of $\alpha_{s}$ and $\alpha_{t}$ values to generate cross-lingual word embeddings. These embeddings were evaluated against the evaluation dictionaries we created. This experiment was performed using both FastText and Word2Vec embeddings. For the XLM-R embeddings, separate $\alpha$ values were required when combined with Word2Vec and FastText. To generate the cross-lingual word embeddings for XLM-R, we used the CSCBLI model and evaluated these embeddings against the evaluation sets we developed. The selected $\alpha$ -values for the source and target embeddings, identified after hyperparameter training, are summarized in Table 3. Tables 6 - 11, which present the hyperparameter training values for alpha selection for FastText and Word2Vec, are included in Appendix A.

Table 3. Alpha Parameter Values for EnSi, EnPa, and EnTa Embeddings. SRC- source language, TRG - target language.

	W2V		FASTTEXT		XLMR+W2V		XLMR+FASTTEXT
	SRC	TRG	SRC	TRG	SRC	TRG	SRC	TRG
EnSi	0.15	0.25	0	0.25	0	-0.5	-0.15	0.25
EnTa	0.15	0	0.15	0.15	0.15	0	0.15	0.15
EnPa	-0.25	0	-0.15	0.25	0.25	0	0.25	0

4.4.2. Dimension Reduction:

As outlined in Section 3.1, dimensionality reduction was applied in both the preprocessing and self-initialization stages of our implementation. Here, we specifically discuss its application during the preprocessing phase. The original dimensionalities of the embeddings are 300 for Word2Vec and FastText, and 1024 for XLM-R. To enhance computational efficiency and minimize redundancy, we applied dimensionality reduction using PCA, halving the original dimensions. This resulted in embeddings with 150 dimensions for Word2Vec and FastText, and 512 dimensions for XLM-R. These reduced embeddings formed a more compact and efficient representation while retaining key semantic features for subsequent tasks.

5. Results and Discussion

The reported results for the conducted experiments followed the standard practice of using precision@k (Pr@k), with k = 1, as the evaluation metric. Pr@k represents the number of accurate translations found in the top k retrieved result set. All the experimented method combinations are listed in Table 4, along with a code for easy reference thereafter. Experiment results are in Table 5.

Table 4. Overview of Experiment Combinations and Codes

Experiment	Code
UVecMap (baseline)	M1
Effective Dim. Reduction + UVecMap	M2
Linear Transformation + UVecMap	M3
Linear Transformation + Effective Dim. Reduction + UVecMap	M4
Effective Dim. Reduction + Linear Transformation + UVecMap	M5
Iterative Dim. reduction + UVecMap	M6
Linear Transformation + Iterative Dim. Reduction + UVecMap	M7
UVecMap + Fusion	M8
Effective Dim. Reduction + UVecMap + Fusion	M9
Linear Transformation + UVecMap + Fusion	M10
Linear Transformation + Effective Dim. Reduction + UVecMap + Fusion	M11
Effective Dim. Reduction + Linear Transformation + UVecMap + Fusion	M12
Iterative Dim. reduction + UVecMap + Fusion	M13
Linear Transformation + Iterative Dim. Reduction + UVecMap + Fusion	M14
CSCBLI + UVecMap	M15
CSCBLI + Effective Dim.Reduction + UVecMap	M16
CSCBLI + Linear Transformation + UVecMap	M17
CSCBLI + Linear Transformation + Effective Dim. Reduction + UVecMap	M18
CSCBLI + Iterative Dim. reduction + UVecMap	M19
CSCBLI + Linear Transformation + Iterative Dim. Reduction + UVecMap	M20

Table 5. pr@1 Results of Experiment Combinations. For each column, the best result is in boldface, second best is in italics. Gain over the baseline (M1) is given within brackets.

Experiment	EnSi		EnTa		EnPa
	W2V	FASTTEXT	W2V	FASTTEXT	W2V	FASTTEXT
M1	31.49	27.48	16.74	11.69	15.13	15.05
M2	31.22_(-0.27)	27.74_(+0.26)	16.07_(-0.67)	0.11_(-11.58)	13.48_(-1.65)	14.81_(-0.24)
M3	33.18_(+1.69)	28.81_(+1.33)	16.4_(-0.34)	14.04_(+2.35)	15.36_(+0.23)	16.22_(+1.17)
M4	32.11_(+0.62)	27.74_(+0.26)	14.49_(-2.25)	0_(-11.69)	14.03_(-1.1)	13.87_(-1.18)
M5	31.58_(+0.09)	27.21_(-0.27)	15.84_(-0.9)	13.48_(+1.79)	14.26_(-0.87)	14.5_(-0.55)
M6	32.83_(+1.34)	26.67_(-0.81)	16.52_(-0.22)	12.58_(+0.89)	15.13₍₀₎	15.91_(+0.86)
M7	32.11_(+0.62)	27.3_(-0.18)	8.99_(-7.75)	14.16_(+2.47)	15.20_(+0.07)	15.36_(+0.31)
M8	11.24_(-20.25)	23.19_(-4.29)	2.81_(-13.93)	0.11_(-11.58)	14.97_(-0.16)	15.28_(+0.23)
M9	20.25_(-11.24)	26.32_(-1.16)	14.72_(-2.02)	13.37_(+1.68)	12.77_(-2.36)	13.71_(-1.34)
M10	31.31_(-0.18)	27.03_(-0.45)	4.49_(-12.25)	0_(-11.69)	14.81_(-0.32)	15.36_(+0.31)
M11	30.6_(-0.89)	19.27_(-8.21)	1.46_(-15.28)	0_(-11.69)	14.18_(-0.95)	14.26_(-0.79)
M12	31.85_(+0.36)	26.58_(-0.9)	12.47_(-4.27)	0.11_(-11.58)	14.26_(-0.87)	13.32_(-1.73)
M13	12.67_(-18.82)	23.28_(-4.2)	2.92_(-13.82)	0.67_(-11.02)	15.13₍₀₎	15.2_(+0.15)
M14	32.29_(+0.8)	27.83_(+0.35)	5.17_(-11.57)	0_(-11.69)	14.97_(-0.16)	15.67_(+0.62)
M15	31.07_(-0.42)	28.37_(+0.89)	16.13_(-0.61)	13.36_(+1.67)	15.50_(+0.37)	15.24_(+0.19)
M16	30.42_(-1.07)	28.47_(+0.99)	16.73_(-0.01)	0.12_(-11.57)	14.29_(-0.84)	15.15_(+0.1)
M17	32.84_(+1.35)	29.49_(+2.01)	16.85_(+0.11)	15.28_(+3.59)	15.58_(+0.45)	16.36_(+1.31)
M18	32.09_(+0.6)	28.93_(+1.45)	15.64_(-1.1)	0_(-11.69)	14.55_(-0.58)	14.98_(-0.07)
M19	31.91_(+0.42)	28.37_(+0.89)	16.25_(-0.49)	12.76_(+1.07)	15.58_(+0.45)	15.76_(+0.71)
M20	32.19_(+0.7)	28.65_(+1.17)	9.38_(-7.36)	15.28_(+3.59)	15.41_(+0.28)	15.67_(+0.62)

As for the baseline, performance of the two types of embeddings is not consistent: For EnSi, Word2Vec dominates over FastText across all the experiments. The same holds for Tamil. However, Word2Vec and FastText results remain on par for EnPa.

It appears that CSCBLI with Linear Transformation and UVecMap (M17) yielded the best results for both Word2Vec and Fasttext. The combinations of CSCBLI with Iterative Dimension Reduction and UVecMap (M19), as well as CSCBLI with Linear Transformation, Iterative Dimension Reduction and UVecMap (M20), also produced similar results to that of M17 for certain language embeddings. These combinations demonstrate the effectiveness of integrating multiple techniques to improve the quality of embeddings and alignment. The only exception is the result for EnSi with Word2Vec embeddings, where the best result was achieved by combining linear transformation with UVecMap (M3).

Note that in the EnTa FastText embeddings, some combinations result in the performance dropping to zero. This variation occurs because the mapping relies heavily on a few key embeddings. When these get modified by different techniques, the mapping can fail, leading to poor results.

Although we experimented with the fusion technique, the obtained accuracies, as detailed in Table 5 (M8-M14), were not satisfactory and fell short of the desired performance metrics.

6. Conclusion

The majority of research on BLI has predominantly focused on high-resource languages, leaving low-resource languages under-explored. Challenges such as the scarcity of parallel data, the nature of available datasets, and limitations in BLI and embedding techniques have resulted in suboptimal accuracy in previous studies. To address these issues, our research focused on enhancing the UVecMap model, a prominent unsupervised BLI framework. To achieve this, we employed a combination of techniques proposed by past research, in order to improve embedding creation, embedding pre-processing, and dictionary initialization. Our results show that combining multiple techniques over UVecMap provides the optimal BLI performance across three low-resource language pairs. The new bilingual lexicons we created for EnSi and EnPa can be found at our Github Repository.

However, despite these improvements, our research has several limitations. One such limitation is the constraint of GPU memory when running the CSCBLI method. This limitation restricted the maximum count of embeddings that could be processed, which in turn affected the scope of the language pairs and the volume of data used. The approach we used also requires manually setting hyper-parameters, which can limit its scalability and applicability to other language pairs with different linguistic characteristics. Future work will focus on optimizing memory usage, automating the hyper-parameter tuning process, and applying our approach to a broader range of LRLs.

Acknowledgements.

We thank Anushika Liyanage for sharing the findings and insights from her prior work with us. We acknowledge the contribution of Gautam for the information provided in preparing the Punjabi lexicon.

References

(1)
Aldarmaki et al. (2018) Hanan Aldarmaki, Mahesh Mohan, and Mona Diab. 2018. Unsupervised word mapping using structural similarities in monolingual embeddings. Transactions of the Association for Computational Linguistics 6 (2018), 185–196.
Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 451–462.
Artetxe et al. (2018b) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018b. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 789–798.
Artetxe et al. (2018a) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018a. Unsupervised neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018.
Artetxe et al. (2018c) Mikel Artetxe, Gorka Labaka, In‘igo Lopez-Gazpio, and Eneko Agirre. 2018c. Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation. In Proceedings of the 22nd Conference on Computational Natural Language Learning. 282–291.
Bafna et al. (2023) Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot, and Rachel Bawden. 2023. A Simple Method for Unsupervised Bilingual Lexicon Induction for Data-Imbalanced, Closely Related Language Pairs. CoRR (2023).
Baldwin et al. (2010) Timothy Baldwin, Jonathan Pool, and Susan Colowick. 2010. PanLex and LEXTRACT: Translating all words of all languages of the world. In Coling 2010: Demonstrations. 37–40.
Besacier et al. (2014) Laurent Besacier, Etienne Barnard, Alexey Karpov, and Tanja Schultz. 2014. Automatic speech recognition for under-resourced languages: A survey. Speech communication 56 (2014), 85–100.
Cao et al. (2023a) Hailong Cao, Liguo Li, Conghui Zhu, Muyun Yang, and Tiejun Zhao. 2023a. Dual Word Embedding for Robust Unsupervised Bilingual Lexicon Induction. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
Cao and Zhao (2021) Hailong Cao and Tiejun Zhao. 2021. Word Embedding Transformation for Robust Unsupervised Bilingual Lexicon Induction. arXiv:2105.12297v1 (2021).
Cao et al. (2023b) Hailong Cao, Tiejun Zhao, Weixuan Wang, and Wei Peng. 2023b. Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction. Information Fusion 97 (2023), 101818.
Cao et al. (2016) Hailong Cao, Tiejun Zhao, Shu Zhang, and Yao Meng. 2016. A distribution-based model to learn bilingual word embeddings. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 1818–1827.
Chaudhary et al. (2020) Aditi Chaudhary, Karthik Raman, Krishna Srinivasan, and Jiecao Chen. 2020. Dict-mlm: Improved multilingual pre-training using bilingual dictionaries. arXiv preprint arXiv:2010.12566 (2020).
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman´, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.
Farhath et al. (2018) Fathima Farhath, Surangika Ranathunga, Sanath Jayasena, and Gihan Dias. 2018. Integration of bilingual lists for domain-specific statistical machine translation for sinhala-tamil. In 2018 Moratuwa Engineering Research Conference (MERCon). IEEE, 538–543.
Feng et al. (2022) Zihao Feng, Hailong Cao, Tiejun Zhao, Weixuan Wang, and Wei Peng. 2022. Cross-lingual Feature Extraction from Monolingual Corpora for Low-resource Unsupervised Bilingual Lexicon Induction. In Proceedings of the 29th International Conference on Computational Linguistics. 5278–5287.
Fernando et al. (2020) Aloka Fernando, Surangika Ranathunga, and Gihan Dias. 2020. Data augmentation and terminology integration for domain-specific sinhala-english-tamil statistical machine translation. arXiv preprint arXiv:2011.02821 (2020).
Fernando et al. (2023) Aloka Fernando, Surangika Ranathunga, Dilan Sachintha, Lakmali Piyarathna, and Charith Rajitha. 2023. Exploiting bilingual lexicons to improve multilingual embedding-based document and sentence alignment for low-resource languages. Knowledge and Information Systems 65, 2 (2023), 571–612.
Gala et al. (2023) Jay Gala, Pranjal A.Chitale, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M.Khapra, Raj Dabre, and Anoop Kunchuhuttan. 2023. IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages. Transactions on Machine Learning Research (2023).
Glavaš et al. (2019) Goran Glavaš, Robert Litschko, Sebastian Ruder, and Ivan Vulić. 2019. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 710–721. https://doi.org/10.18653/v1/P19-1070
Haghighi et al. (2008) Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of ACL-08: Hlt. 771–779.
Hoshen and Wolf (2018) Yedid Hoshen and Lior Wolf. 2018. Non-Adversarial Unsupervised Word Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 469–478.
Joshi et al. (2020) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6282–6293.
Joulin et al. (2017) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Mikolov Tomas. 2017. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 427–431.
Keersmaekers et al. (2023) Alek Keersmaekers, Wouter Mercelis, and Toon Van Hal. 2023. Word Sense Disambiguation for Ancient Greek: Sourcing a training corpus through translation alignment. In Proceedings of the Ancient Language Processing Workshop. 148–159.
Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual language model pretraining. In Proceedings of the Advances in Neural Information Processing Systems. 7059–7069.
Lample et al. (2018) Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In International conference on learning representations.
Li et al. (2020) Yanyang Li, Yingfeng Luo, Ye Lin, Quan Du, Huizhen Wang, Shujian Huang, Tong Xiao, and Jingbo Zhu. 2020. A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction. In Proceedings of the 28th International Conference on Computational Linguistics. 5990–6001.
Liu et al. (2020) Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. In Transactions of the Association for Computational Linguistics. 726–742.
Liyanage et al. (2021) Anushika Liyanage, Surangika Ranathunga, and Sanath Jayasena. 2021. Bilingual lexical induction for sinhala-english using cross lingual embedding spaces. In 2021 Moratuwa Engineering Research Conference (MERCon). IEEE, 579–584.
Mayhew et al. (2017) Stephen Mayhew, Chen-Tse Tsai, and Dan Roth. 2017. Cheap translation for cross-lingual named entity recognition. In Proceedings of the 2017 conference on empirical methods in natural language processing. 2536–2545.
Miao et al. (2024) Zhongtao Miao, Qiyu Wu, Kaiyan Zhao, Zilong Wu, and Yoshimasa Tsuruoka. 2024. Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment. In Findings of the Association for Computational Linguistics: NAACL 2024. 3225–3236.
Mikolov et al. (2013a) Tomas Mikolov, Chen Kai, Greg Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781v3 (2013).
Mikolov et al. (2013b) Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).
Mohammed and Prasad (2023) Idi Mohammed and Rajesh Prasad. 2023. Building lexicon-based sentiment analysis model for low-resource languages. Methods X 11 (2023).
Nishikawa et al. (2021) Sosuke Nishikawa, Ryokan Ri, and Yoshimasa Tsuruoka. 2021. Data Augmentation with Unsupervised Machine Translation Improves the Structural Similarity of Cross-lingual Word Embeddings. ACL-IJCNLP 2021 (2021), 163.
Ormazabal et al. (2021) Aitor Ormazabal, Mikel Artetxe, Aitor Soroa, Gorka Labaka, and Eneko Agirre. 2021. Beyond Offline Mapping: Learning Cross-lingual Word Embeddings through Context Anchoring. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6479–6489.
Pavlick et al. (2014) Ellie Pavlick, Matt Post, Ann Irvine, Dmitry Kachaev, and Chris Callison-Burch. 2014. The language demographics of amazon mechanical turk. Transactions of the Association for Computational Linguistics 2 (2014), 79–92.
Ranathunga and De Silva (2022) Surangika Ranathunga and Nisansa De Silva. 2022. Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 823–848.
Ranathunga et al. (2024a) Surangika Ranathunga, Nisansa de Silva, Dilith Jayakody, and Aloka Fernando. 2024a. Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research. arXiv preprint arXiv:2406.06021 (2024).
Ranathunga et al. (2024b) Surangika Ranathunga, Nisansa De Silva, Velayuthan Menan, Aloka Fernando, and Charitha Rathnayake. 2024b. Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 860–880.
Ren et al. (2020) Shuo Ren, Shujie Liu, Ming Zhou, and Shuai Ma. 2020. A graph-based coarse-to-fine method for unsupervised bilingual lexicon induction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3476–3485.
Shi et al. (2021) Haoyue Shi, Luke Zettlemoyer, and Sida I Wang. 2021. Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 813–826.
Søgaard et al. (2018) Anders Søgaard, Sebastian Ruder, and Ivan Vulić. 2018. On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 778–788.
Vulić et al. (2020) Ivan Vulić, Anna Korhonen, and Goran Glavaŝ. 2020. Improving bilingual lexicon induction with unsupervised post processing of monolingual word vector spaces. In Proceedings of the 5th Workshop on Representation Learning for NLP. 45–54.
Vulić and Moens (2015) Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 363–372.
Wickramasinghe and De Silva (2023) Kasun Wickramasinghe and Nisansa De Silva. 2023. Sinhala-English Parallel Word Dictionary Dataset. In 2023 IEEE 17th International Conference on Industrial and Information Systems (ICIIS). IEEE, 61–66.
Yang et al. (2019) Pengcheng Yang, Fuli Luo, Peng Chen, Tianyu Liu, and Xu Sun. 2019. MAAM: A morphology-aware alignment model for unsupervised bilingual lexicon induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3190–3196.
Yong et al. (2024) Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. 2024. LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons. arXiv preprint arXiv:2402.14086 (2024).
Zhang et al. (2021) Jinpeng Zhang, Baijun Ji, Nini Xiao, Xiangyu Duan, Min Zhang, Yangbin Shi, and Weihua Luo. 2021. Combining Static Word Embeddings and Contextual Representations for Bilingual Lexicon Induction. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2943–2955.
Zhang et al. (2017) Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial Training for Unsupervised Bilingual Lexicon Induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 1959–1970.

Appendix A Linear Transformation Method

Tables 6 - 11 present the results obtained during the hyperparameter tuning process, as described in Section 4.4.1 .

Table 6. Alpha Hyperparameter Training: EnSi FastText

SRC	-0.5	-0.25	-0.15	0	0.15	0.25	0.5
TRG
-0.5	0.09	0	0	0.09	0	0.18	0.09
-0.25	0	0.45	0	0.27	20.07	20.79	0.27
-0.15	0	0.36	22.39	23.82	24.8	22.84	0.09
0	0	4.46	23.64	25.69	27.48	26.05	6.07
0.15	0.09	0.62	25.87	27.74	28.37	27.48	9.81
0.25	0	23.91	25.16	28.81	27.48	27.65	22.75
0.5	0.18	20.79	22.66	24.62	25.51	25.78	22.48

Table 7. Alpha Hyperparameter Training: EnSi Word2Vec

SRC	-0.5	-0.25	-0.15	0	0.15	0.25	0.5
TRG
-0.5	0.18	0.09	0	0	0.09	0	0
-0.25	0	1.52	26.67	28.55	12.58	0.18	16.77
-0.15	0.09	0.54	28.72	30.15	4.82	0.18	19.18
0	0.45	29.44	32.11	32.56	31.67	30.51	0.54
0.15	0.27	30.78	21.14	33.1	32.92	21.5	0.09
0.25	0.09	29.62	32.02	32.83	33.18	31.58	26.76
0.5	0	24	27.3	29.08	29.62	29.53	27.12

Table 8. Alpha Hyperparameter Training: EnTa FastText

SRC	-0.5	-0.25	-0.15	0	0.15	0.25	0.5
TRG
-0.5	0	0.22	0	0	0	0.11	0
-0.25	0	0.11	0	0	0	0	0.11
-0.15	0.11	0.11	0	0	0.11	0.34	0.11
0	0	0	0	13.15	0.34	0.11	0.11
0.15	0	0.11	0.11	0.12	14.04	0.11	0.11
0.25	0.11	0.11	0.9	0.56	13.03	0	0.11
0.5	0	0	4.16	0.9	8.09	0	0

Table 9. Alpha Hyperparameter Training: EnTa Word2Vec

SRC	-0.5	-0.25	-0.15	0	0.15	0.25	0.5
TRG
-0.5	0.11	0.11	0	0	0	0	0
-0.25	0	0.22	12.92	13.71	0	0	6.85
-0.15	0	11.8	2.36	15.17	15.39	12.81	9.21
0	0.22	13.6	0.67	16.18	16.4	1.91	5.39
0.15	0.11	0.11	0	15.39	2.92	14.27	11.24
0.25	0	0.9	12.7	0.22	15.17	14.72	11.35
0.5	0	5.62	7.87	9.44	3.6	10.67	9.78

Table 10. Alpha Hyperparameter Training: EnPa FastText

SRC	-0.5	-0.25	-0.15	0	0.15	0.25	0.5
TRG
-0.5	0	0.08	0	0	0	0	0
-0.25	0	14.66	15.05	15.44	15.52	14.58	0
-0.15	0.16	14.81	15.36	15.75	15.44	14.73	0.16
0	0	15.13	15.28	15.6	15.67	15.05	12.23
0.15	0	15.05	15.99	15.67	15.44	15.44	12.38
0.25	0	15.75	16.22	15.83	14.81	14.26	12.46
0.5	0	0.08	0.24	14.34	14.03	13.64	11.44

Table 11. Alpha Hyperparameter Training: EnPa Word2Vec

SRC	-0.5	-0.25	-0.15	0	0.15	0.25	0.5
TRG
-0.5	0	0	0.16	0.08	0	0	0
-0.25	0	14.42	14.34	14.34	14.18	13.95	0
-0.15	0	14.97	14.97	15.2	14.58	14.34	0
0	0	15.36	14.97	14.81	14.81	14.81	12.54
0.15	0	14.97	15.28	15.2	14.66	14.5	12.54
0.25	0	0.08	0.31	15.2	14.26	14.18	12.46
0.5	0	0	0	0	13.48	13.01	11.36

Appendix B Frequency Thresholds

Tables 12 to 14 illustrate the impact of minimum frequency thresholds on accuracy for the EnTa, EnPa, and EnSi datasets.

Table 12. UVecMap Results for Different Frequency Thresholds and their Impact on Accuracy(pr@1) and Coverage (EnTa Dataset)

Min_Freq	Accuracy		Coverage
	W2V	FASTTEXT
2	1.14	1.23	98.87%
4	12.78	0.1	86.14%
6	15.39	12.81	77.12%
8	16.54	15.2	70.71%
10	20.16	17.28	66.2%

Table 13. UVecMap Results for Different Frequency Thresholds and their Impact on Accuracy(pr@1) and Coverage (EnPa Dataset)

Min_Freq	Accuracy		Coverage
	W2V	FASTTEXT
2	12.53	15.22	98.78%
4	13.94	14.62	95.47%
6	15.13	15.05	91.8%
8	15.87	15.13	88.42%
10	16.46	16.37	85.25%

Table 14. UVecMap Results for Different Frequency Thresholds and their Impact on Accuracy(pr@1) and Coverage (EnSi Dataset)

Min_Freq	Accuracy		Coverage
	W2V	FASTTEXT
2	19.42	8.82	93.02%
4	20.03	20.63	85.07%
6	24.25	25	77.39%
8	27.48	31.49	71.81%
10	30.91	32.53	63.42%