Library
|
Your profile |
Software systems and computational methods
Reference:
Skacheva N.V.
Analysis of Idioms in Neural Machine Translation: A Data Set
// Software systems and computational methods.
2024. № 3.
P. 55-63.
DOI: 10.7256/2454-0714.2024.3.71518 EDN: JLJDSL URL: https://en.nbpublish.com/library_read_article.php?id=71518
Analysis of Idioms in Neural Machine Translation: A Data Set
DOI: 10.7256/2454-0714.2024.3.71518EDN: JLJDSLReceived: 19-08-2024Published: 05-10-2024Abstract: There has been a debate in various circles of the public for decades about whether a "machine can replace a person." This also applies to the field of translation. And so far, some are arguing, others are "making a dream come true." Therefore, now more and more research is aimed at improving machine translation systems (hereinafter MP). To understand the advantages and disadvantages of MP systems, it is necessary, first of all, to understand their algorithms. At the moment, the main open problem of neural machine translation (NMP) is the translation of idiomatic expressions. The meaning of such expressions does not consist of the meanings of their constituent words, and NMT models tend to translate them literally, which leads to confusing and meaningless translations. The research of idioms in the NMP is limited and difficult due to the lack of automatic methods. Therefore, despite the fact that modern NMP systems generate increasingly high-quality translations, the translation of idioms remains one of the unsolved tasks in this area. This is due to the fact that idioms, as a category of verbose expressions, represent an interesting linguistic phenomenon when the general meaning of an expression cannot be made up of the meanings of its parts. The first important problem is the lack of special data sets for learning and evaluating the translation of idioms. In this paper, we solve this problem by creating the first large-scale dataset for translating idioms. This data set is automatically extracted from the German translation corpus and includes a target set in which all sentences contain idioms, and a regular training corpus in which sentences containing idioms are marked. We have released this dataset and are using it to conduct preliminary experiments on NMP as a first step towards improving the translation of idioms. Keywords: multiword expression, idioms, bilingual corpora, machine translation, Neural Machine Translation, German, Russian, language pairs, systems, Data SetThis article is automatically translated. You can find original text of the article here.
Introduction In recent years, neural machine translation (NMP) has significantly improved the quality of translation compared to traditional translation based on statistical translation (SMT) and on rules and phrases (PBMT). To understand the issue, let's look at how SMT and PBMT work. The field of statistical machine translation has begun to develop due to the emergence of more parallel texts in linguistics. Such parallel texts can be found in various linguistic corpora. One of their earliest and largest is the Europarl building. This is a collection of protocols of the European Parliament, formed since 1996 and consisting at that time of 11 languages of the European Union. This corpus was used to create 110 machine translation systems.[1] Russian Russian language has become important for the development of the National Corpus of the Russian Language (hereinafter referred to as the NKRYA), namely one of its parallel corpus. Which includes languages such as English, Armenian, Belarusian, Bulgarian, Spanish, Italian, Chinese, German. At the moment, there are 1,322 texts and 45,235,028 words in the corpus of the NKRJ English language.[2] It is currently under development. Most systems are largely language independent, and the creation of an SMT system for a new language pair mainly depends on the availability of parallel texts. That is why it is so important to create parallel texts paired with the Russian language. Initially, translation in SMT was based on IBM verbal models.[3] Modern models are already based on phrases. [4] When translating a sentence, a phrase of the source language in any sequence of words is converted into phrases in the target language. The core of this model is a probabilistic phrase derived from a parallel corpus. Decoding is a ray search through all possible segmentation of the phrase input, any translation for each phrase and any change in its order. Scientists write that using the Bayes formula, you can imagine the problem of translating a sentence in the form of the following equation: arg max P(ϕe|ϕr) = arg max (P(ϕe) · P(ϕr |ϕe)) ∪ϕe ∪ϕe
Where ϕe is the phrase of the translation, and ϕr is the phrase of the original. Therefore, the language model P(ϕe) is the body-language text ϕe, and the translation model P(ϕr |ϕe) is a parallel corpus of text in languages ϕr |ϕe. In such MP systems, n-grams are used as a language model. [5] Google has its own collection of n-grams, which is currently the largest collection in the world. [6] The largest collection of n-grams in Russian, of course, is the NKRYA. According to n-gram models, predicting the next word after n1.. reveals a possible sequence of a certain number of words. Thus, there are bigams consisting of two words, trigrams, and so on. When the number increases, the probability of the next word or subsequent words is calculated. With large variables, it may look like this: P(n1,n2,n3,n4) = P(n4|n3,n2,n1)*P(n3|n2,n1)*P(n2|n1)*P(n1) Calculate the probability of a sentence: Ich sehe ein Auto auf der Strasse P(Ich) P(sehe|Ich) P(ein Auto|Ich sehe) P(Auto|Ich sehe ein) P(auf|Ich sehe ein Auto) P(der| Ich sehe ein Auto auf P(Strasse|Ich sehe ein Auto auf der) Since we meet auf der Strasse more often than just der Strasse, according to the Markov model[7] our proposal will look like this: P(Strasse|auf der) That is, the Markov model of the nth order will look like: P(A1 | A1, A2, ..., A i-1 ) ≈P(A1 | Ai-n, Ai-n+1, ..., A i-1 ) Thus, splitting into n-grams when translating a text allows you to find the most successful translation of the text. The problem with such models is that the system does not always determine the connections between words, especially if such words are far from each other. That is, these MP systems have become widespread due to a small investment of human resources, provided that two parallel language corpora exist. Such systems are teachable and the more texts are in parallel corpora, the more adequate the translation of a new text of a given language group will be. The problem for such a system is the translation of idioms, since they do not define the connection between words and are not sensitive to context. RBMT is an MP paradigm in which linguistic knowledge is encoded by an expert in the form of rules that are translated from the original language into the translation language[8]. This approach gives full control over the performance of the system, but the cost of formalizing the necessary linguistic knowledge is much higher than training a corpus system in which processes occur automatically. However, this MP has its own capabilities, even in conditions of low resources. In RBMT, a linguist formalizes linguistic knowledge into grammatical rules. The system uses such knowledge to analyze sentences in the source language and the target language. The big advantage of such a system is that it does not require any academic buildings, but the process of encoding linguistic knowledge requires a large amount of expert time. The largest translation project built on the RBMT system is Systran.[9] RBMT is the opposite of MP systems, which learn from the enclosures we presented here earlier. Therefore, such an MP system is suitable for those language pairs in which there are few parallel corpora and, as a result, can cover more language pairs. The basic approach of the RBMT system is based on the connection of the structure of the input sentence of the source language with the structure of the output sentence of the translated language, preserving the unique meaning of the sentences. But RBMT is also not very receptive to translating idioms. What is the difficulty of translating them? The difficulty of translating idiomatic phrases is partly explained by the difficulty of identifying a phrase as idiomatic and creating its correct translation, and partly by the fact that idioms are rarely found in standard datasets used for training neural machine translation (NMP) systems. To illustrate the problem of translating idioms, we also present the results of the work of two NMP systems for this proposal in Google and DeepL (see Table 1). This problem is especially pronounced when the original idiom is very different from its equivalent in the target language, as in this case.
Table 1. Translation of idioms in NMR systems Although there are a number of monolingual datasets for identifying idiomatic expressions, work on creating a parallel corpus annotated with idioms necessary for a more systematic study of this problem is limited. For example, American scientists selected a small subset of 17 English idioms, collected 10 sample sentences for each idiom from the Internet and manually translated them into Brazilian-Portuguese to use in the translation task[10]. Creating a dataset for translating idioms manually is an expensive and time-consuming task. In this article, we automatically create a new bilingual dataset for translating idioms extracted from an existing parallel corpus of general-purpose German-Russian texts. The first part of our dataset consists of 1,500 parallel sentences, the German part of which contains an idiom. Russian Russian translation. In addition, we provide appropriate training data sets for German-Russian and Russian-German translations, where the original sentences that include the phrase-idiom are marked. We believe that having a large dataset for learning and evaluation is the first step to improving the translation of idioms. Data collection In this paper, we have focused on the German-Russian translation of idioms. To automatically identify phraseological units in a parallel corpus, a data set manually annotated by linguists is required. We use a dictionary containing idiomatic and colloquial phrases and created manually as a reference for extracting pairs of idiomatic phrases. At the same time, it was found that the standard parallel corpora available for training contain several such pairs of sentences. Therefore, we automatically select pairs of sentences from the training corpus, in which the original sentence contains an idiom phrase, to create a new test set. Russian Russian idioms. Please note that we focus only on idioms on the source side and have two separate lists of idioms for German and Russian, so we independently create two test suites (for translating German idioms and translating Russian idioms) with different pairs of sentences selected from parallel corpora. For example, in German, the subject may be between a verb and a prepositional phrase that makes up an idiom. The German language also allows for several variants of phrase permutation. To generalize the process of identifying occurrences of idioms, we modify phrases and consider various permutations of words in a phrase as an acceptable match. We also assume that there may be a fixed number of words between the words of an idiomatic phrase. Following this set of rules, we extract sentence pairs containing idiomatic phrases and create a set of sentence pairs for each unique idiomatic phrase. In the next step, we make a selection without replacement from these sets and select individual pairs of sentences to create a test set. To create new training data, we use the remaining pairs of sentences from each set of idioms, as well as pairs of sentences from the original parallel corpora, in which there was not a single phrase-idiom. In this process, we make sure that for each idiomatic expression there is at least one form in both the training and test data, and that no sentence is included in both the training and test data. At the same time, for some idioms, the literal translation into the target language is close to the real meaning. Pairs of sentences in which an idiomatic expression was used as a literal phrase will be identified as idiomatic sentences. Translation experiments Although the focus of this work is on creating datasets for learning and evaluating the translation of idioms, we are also conducting a number of preliminary NMP experiments using our dataset to evaluate the problem of translating idioms on large data sets. In the first experiment, we do not use any labels in the data to train the translation model. In the second experiment, we use labels in the training data as an additional characteristic to investigate the presence of an idiomatic phrase in a sentence during training. We are conducting an experiment with German and Russian languages, providing additional input features to the model. Additional signs indicate whether the original sentence contains an idiom, and are implemented as a special additional token that is added to each original sentence containing an idiom. This is a simple approach that can be applied to any sequence-to-sequence conversion model. Most NMR systems have a sequence-to-sequence conversion model in which the encoder builds a representation of the original sentence, and the decoder, using the previous hidden LSTM elements and the attention mechanism, generates the target translation. We use a 4-layer attention-based encoding-decoding model, as described in the work of Thang Luong, Hieu Pham, Christopher D. Manning.[11] In all experiments, the NMP vocabulary is limited to the most common 30 thousand words in both languages, and we pre-process the data of the source and target languages using 30 thousand merge operations. We also use a phrase-based translation system similar to Moses[12] as a baseline to explore PBMT metrics when translating idioms. Evaluation of the translation of idioms Ideally, the translation of idioms should be evaluated manually, but this is a very expensive process. On the other hand, automatic metrics can be used on large amounts of data at little cost and have the advantage of reproducibility. To assess the quality of the translation, we use the following metrics, paying special attention to the accuracy of the translation of idioms: BLEU The traditional BLEU score[13] is a good indicator for determining the overall quality of the translation. However, this measure takes into account the accuracy of all n-grams in a sentence and by itself does not focus on the quality of translation of idiomatic expressions. Modified unigram precision. To focus on the quality of translation of idiomatic expressions, we also look at localized accuracy. With this approach, we translate an idiomatic expression in the context of a sentence and evaluate only the quality of the translation of an idiomatic phrase. To highlight the translation of an idiom in a sentence, we look at the word-level alignment between the expression of the idiom in the original sentence and the generated translation in the target sentence. To align words, we use the fast-align function.[14] Since idiomatic phrases and corresponding translations are not contiguous in many cases, we compare only the unigrams of two phrases. Please note that for this metric we have two links: the translation of an idiom as an independent expression and the translation of an idiom created by a person in the target sentence. The accuracy of the translation of idioms at the word level. We also use another metric to evaluate the accuracy of translating an idiom phrase at the word level. We use word alignment between the source and target sentences to determine the number of correctly translated words. To calculate the accuracy, we use the following equation: WIAcc=H-1/N where H is the number of correctly translated words, I is the number of extra words in the translation of the idiom, and N is the number of words in the reference translation of the idiom. Table 5 shows the results for the translation task using various metrics.
Table 2. The effectiveness of translation on a test set of German idioms. The accuracy of idioms at the word level and the Unigrammatical accuracy are calculated only for the phrase-idiom and its corresponding translation in the sentence The NMP experiment using a special input token indicating the presence of an idiom in a sentence is still better than PBMT, but slightly worse than the basic version of NMP, according to the BLEU indicator. Despite such a drop in the BLEU indicator, studying the unigrammatic accuracy of translation and the accuracy of translation of idioms at the word level, we see that this model generates more accurate translations of idioms. These preliminary experiments confirm the problem of translating idioms using neural models and, in addition, show that with a set of labeled data, we can develop simple models to solve this problem. Conclusions Translating idioms is one of the most difficult tasks of machine translation. In particular, it has been shown that neural MP does a poor job of translating idioms, despite its overall advantage over previous MP paradigms. Russian Russian As a first step towards a better understanding of this problem, we have presented a parallel dataset for learning and testing the translation of idioms for German-Russian and Russian-German languages. The test suites include sentences with at least one idiom on the source side, and the training data is a mixture of idiomatic and non-idiomatic sentences with labels to distinguish them from each other. We also conducted preliminary translation experiments and proposed various metrics to evaluate the translation of idioms. We are creating new data sets that can be used to further study and improve the work of NMPs in translating idioms. References
1. Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. School of Informatics University of Edinburgh, Scotland, 79-86.
2. National Corpus of the Russian Language. Retrieved from https://ruscorpora.ru/search?search=CgkyBwgFEgNlbmcwAQ%3D%3D 3. Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., & Mercer, R. L. (1993). The mathematics of statistical machine translation. Computational Linguistics, 19(2), 263-313. 4. Koehn, P., Franz, J. Marcu, Och, & Daniel. (2003). Statistical Phrase-Based Translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 127–133. Retrieved from https://aclanthology.org/N03-1017.pdf 5. Gudkov Vladimir Yulievich, Gudkova Elena Fedorovna. (2011). N-grams in linguistics. Bulletin of ChelSU, 24. 6. Linguistic Data Consortium. Retrieved from https://catalog.ldc.upenn.edu/byyear 7. Zhdanov, A.E., & Dorosinsky, L.G. (2017). Voice lock. Ural Radio Engineering Journal, 1, 80-90. 8. Daniel Torregrosa, Nivranshu Pasricha, Bharathi Raja Chakravarthi, Maraim Masoud, Mihael Arcan. (2019). Leveraging Rule-Based Machine Translation Knowledge for Under-Resourced Neural Machine Translation Models. Proceedings of MT Summit XVII, Dublin. Volume 2. Retrieved from https://aclanthology.org/W19-6725.pdf 9. Toma, Peter. (1977). Systran as a multilingual machine translation system. Overcoming the language barrier, 3-6 May 1977. Vol. 1. Retrieved from https://www.mt-archive.net/70/CEC-1977-Toma.pdf 10. Salton, G., Ross, R., & Kelleher, J. (2014). An empirical study of the impact of idioms on phrase based statistical machine translation of english to brazilian-portuguese. 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra), Gothenburg, Sweden, pp. 36-41. 11. Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September, 1412-1421. 12. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al. (2007). Moses: Open source toolkit for statistical machine translation. In 45th annual meeting of the ACL on interactive poster and demonstration sessions, 177-180. 13. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311-318. 14. Dyer, C., Chahuneau, V., & Smith, N. A. (2013). A simple, fast, and effective reparameterization of ibm model 2. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 644-649.
Peer Review
Peer reviewers' evaluations remain confidential and are not disclosed to the public. Only external reviews, authorized for publication by the article's author(s), are made public. Typically, these final reviews are conducted after the manuscript's revision. Adhering to our double-blind review policy, the reviewer's identity is kept confidential.
|