MAchine Translation for Open Science

Here you can find the publications produced in the context of the project, which can also be found on the HAL website.

Ziqian Peng (2023). Document-level Machine Translation for scientific texts. Mémoire de Master, Université Paris-Saclay.

While neural machine translation has seen significant progress during recent years at sentencelevel, translating full documents remains a challenge to efficiently incorporate document-level context. Various approaches have been proposed, but most of them consider only one to three previous source and/or target sentences as the context. This is not sufficient to faithfully translate some language phenomena, like lexical consistency and document coherence, especially in some scientific texts. In this work, we conducted experiments to include full contextual context and investigate the impact of all the past / future sentences on the source side with a context ablation study, on some abstracts from scientific publications. Our results show that future context is more influential than the past source context, and in our experiments, the Transformer architecture performs much better to translate the beginning of a long document than the end.

Maud Bénard, Alexandra Mestivier, Natalie Kübler, Lichao Zhu, Rachel Bawden, Éric De La Clergerie, Laurent Romary, Mathilde Huguin, Jean-François Nominé, Ziqian Peng, François Yvon (2023). MaTOS Traduction automatique pour la science ouverte. Actes de l'Atelier sur l'Analyse et la Recherche de Textes Scientifiques, CORIA-TALN 2023. 5 juin 2023 Paris (France).

This contribution presents the MaTOS (Machine Translation for Open Science) project, which aims to develop new methods for the complete machine translation (MT) of scientific documents between English and French, as well as automatic metrics to evaluate the translation quality. To this end, MaTOS is interested in (a) the collection of open resources for specialised MT ; (b) the description of textual coherence markers for scientific articles ; (c) the development of new multilingual processing methods for documents ; and (d) metrics to measure progress in document-level machine translation.