MAchine Translation for Open Science

Here you can find the publications produced in the context of the project, which can also be found on the HAL website.

Paul Lerner, François Yvon. Vers la traduction automatique des néologismes scientifiques. Proceedings of the 31st Conférence sur le Traitement Automatique des Langues Naturelles, pages 245-261, Toulouse, France, ATALA.

Scientific research continually discovers and invents new concepts, which are then referred to by new terms, neologisms, or neonyms in this context. As the vast majority of publications are written in English, disseminating this new knowledge in French often requires translating these terms, to avoid multiplying anglicisms that are less easily understood by the general public. We propose to explore this task using two thesauri, exploiting the definition of the term to translate it more accurately. To this end, we explore the capabilities of two large multilingual models, BLOOM and CroissantLLM, which can translate scientific terms to some extent. In particular, we show that they often use appropriate morphological procedures, but are limited by the segmentation into sub-lexical units. They are also biased by the frequency of term occurrences and surface similarities between English and French.

Ziqian Peng, Rachel Bawden and François Yvon (2024). Document Level Machine Translation: does length matter?. Proceedings of the 31st Conférence sur le Traitement Automatique des Langues Naturelles, pages 2-21, Toulouse, France, ATALA.

Today’s machine translation architectures can process long segments and go beyond the translation of isolated sentences, opening up the possibility of translating full documents. To achieve this goal, it is necessary to overcome several difficulties related to the length of source documents. In this work, we discuss document-level machine translation from an evaluation perspective, trying to answer a simple question: how can we measure whether translation performance degrades with document length? Our analysis, which compares encoder-decoder systems and a large language model using multiple metrics on a scientific document translation task, suggests that translating long documents holistically remains a challenging problem.

Rachel Bawden, Hatim Bourfoune, Bertrand Cabot, Nathan Cassereau, Pierre Cornette, Marco Naguib, Aurélie Névéol and François Yvon. Les modèles Bloom pour le traitement automatique de la langue française. 2024. Technical report.

The development of very large language models, capable of performing a large range of automatic language processing tasks, simultaneously requires to develop the infrastructure needed to evaluate these models, ideally covering as many tasks as possible. Numerous benchmarks have already been compiled for the English language, making it possible to evaluate these large models from multiple angles. Several multilingual test sets are also available, with a much lesser coverage, which are used to measure the ability of these models to handle multiple languages. In this paper, we present our efforts to assemble a multi-task evaluation set for French, which is then used to evaluate models from the BLOOM family. Our results confirm and complement the main evaluation results for BLOOM in English; they allow us to conclude that the performances obtained in French and English are very similar and even better when the prompts used at inference are written in the same language as the texts to analyze.

Rachel Bawden, Ziqian Peng, Maud Bénard, Eric Villemonte de La Clergerie, Raphaël Esamotunu, Mathilde Huguin, Natalie Kübler, Alexandra Mestivier, Mona Michelot, Laurent Romary, Lichao Zhu and François Yvon (2024). Translate your Own: a Post-Editing Experiment in the NLP domain. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation, pages 431–443, Sheffield, UK, European Association for Machine Translation.

The improvements in neural machine translation make translation and post- editing pipelines ever more effective for a wider range of applications. In this paper, we evaluate the effectiveness of such a pipeline for the translation of scientific documents (limited here to article abstracts). Using a dedicated interface, we collect, then analyse the post-edits of approximately 350 abstracts (English→French) in the Natural Language Processing domain for two groups of post-editors: domain experts (academics encouraged to post-edit their own articles) on the one hand and trained translators on the other. Our results confirm that such pipelines can be effective, at least for high-resource language pairs. They also highlight the difference in the post-editing strategy of the two subgroups. Finally, they suggest that working on term translation is the most pressing issue to improve fully automatic translations, but that in a post-editing setup, other error types can be equally annoying for post-editors.

Sadaf Abdul Rauf, François Yvon (2024). Translating scientific abstracts in the bio-medical domain with structure-aware models. Computer Speech & Language, vol. 87.

Machine Translation (MT) technologies have improved in many ways and generate usable outputs for a growing number of domains and language pairs. Yet, most sentence based MT systems struggle with contextual dependencies, processing small chunks of texts, typically sentences, in isolation from their textual context. This is likely to cause systematic errors or inconsistencies when processing long documents. While various attempts are made to handle extended contexts in translation, the relevance of these contextual cues, especially those related to the structural organization, and the extent to which they affect translation quality remains an under explored area. In this work, we explore ways to take these structural aspects into account, by integrating document structure as an extra conditioning context. Our experiments on biomedical abstracts, which are usually structured in a rigid way, suggest that this type of structural information can be useful for MT and document structure prediction. We also present in detail the impact of structural information on MT output and assess the degree to which structural information can be learned from the data.

Ziqian Peng (2023). Document-level Machine Translation for scientific texts. Mémoire de Master, Université Paris-Saclay.

While neural machine translation has seen significant progress during recent years at sentencelevel, translating full documents remains a challenge to efficiently incorporate document-level context. Various approaches have been proposed, but most of them consider only one to three previous source and/or target sentences as the context. This is not sufficient to faithfully translate some language phenomena, like lexical consistency and document coherence, especially in some scientific texts. In this work, we conducted experiments to include full contextual context and investigate the impact of all the past / future sentences on the source side with a context ablation study, on some abstracts from scientific publications. Our results show that future context is more influential than the past source context, and in our experiments, the Transformer architecture performs much better to translate the beginning of a long document than the end.

Maud Bénard, Alexandra Mestivier, Natalie Kübler, Lichao Zhu, Rachel Bawden, Éric De La Clergerie, Laurent Romary, Mathilde Huguin, Jean-François Nominé, Ziqian Peng, François Yvon (2023). MaTOS Traduction automatique pour la science ouverte. Actes de l'Atelier sur l'Analyse et la Recherche de Textes Scientifiques, CORIA-TALN 2023. 5 juin 2023 Paris (France).

This contribution presents the MaTOS (Machine Translation for Open Science) project, which aims to develop new methods for the complete machine translation (MT) of scientific documents between English and French, as well as automatic metrics to evaluate the translation quality. To this end, MaTOS is interested in (a) the collection of open resources for specialised MT ; (b) the description of textual coherence markers for scientific articles ; (c) the development of new multilingual processing methods for documents ; and (d) metrics to measure progress in document-level machine translation.