Translation alignment

From The Digital Classicist Wiki
Revision as of 15:06, 25 May 2023 by ChiaraPalladino (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


Translation Alignment (TA) is a task derived from Natural Language Processing. Fundamentally, it consists in establishing correspondences between texts in different languages, in order to define which parts of a source text correspond to which parts of a second text. When performed across two languages, it is defined as bilingual alignment; when performed across multiple languages, it is defined as multi-lingual alignment.

The task of TA can be performed at various levels of granularity: from entire books to single chapters or sections, up to sentences and individual words. A set of texts aligned at some level is defined as parallel texts or parallel corpora. The output of a TA pipeline is a list of pairs of items (words, sentences, or larger chunks) which are often called Translation Pairs (TPs).

Parallel corpora in modern and ancient languages are used for a variety of purposes, including training machine translation models, but also for automatic bilingual lexicon extraction, corpus linguistics analysis, translation history research, language learning, and cross-lingual annotation projection.


TA is performed automatically, semi-automatically, or entirely manually through annotation. There are several computational methods to perform TA. The first methods, such as IBM developed by Brown et al. (1993) were developed in the 90s, and were based on statistical lexical models. Later, Och and Ney (2003) introduced Giza++, which was considered the state-of-the-art in the field until the advent of transformer-based and neural models.

Transformer-based models for TA exploit multilingual contextualized language models to create accurate alignments, using varoius types of data for fine-tuning. The most recent automatic model for TA in ancient languages was developed by Yousef et al. (2022) for Ancient Greek and modern translations, and it is based on two multilingual contextualized language models, mBERT and XLM-R.


Alignment Guidelines and Gold Standards