TransAlign: Machine Translation Encoders are Strong Word Aligners, Too

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In cross-lingual transfer (XLT), token classification for low-resource languages often relies on translation-based strategies (e.g., translate-test/train), where label projection—mapping source-language word-level labels to corresponding tokens in the translation—constitutes a critical bottleneck. Existing approaches typically employ multilingual encoders (e.g., mBERT, LaBSE) for word alignment (WA), but achieve suboptimal accuracy. Although machine translation (MT) and WA are intrinsically related, prior work exploits only decoder-side cross-attention in MT models, overlooking the strong alignment signals encoded in the MT encoder. This paper proposes TransAlign, the first method to systematically leverage the encoder of large-scale multilingual MT models for zero-shot, high-precision word alignment—without additional training. Experiments demonstrate that TransAlign significantly outperforms state-of-the-art alignment and non-alignment baselines on multilingual WA benchmarks and substantially improves XLT performance for token classification in low-resource settings.

Technology Category

Application Category

📝 Abstract
In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test -- evaluating on noisy source language data translated from the target language -- and translate-train -- training on noisy target language data translated from the source language -- have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.
Problem

Research questions and friction points this paper is trying to address.

Improving word alignment for cross-lingual transfer in NLP tasks
Developing translation-based label projection without multilingual aligners
Enhancing token classification via machine translation encoder alignments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses machine translation encoder for alignment
Outperforms traditional word alignment methods
Enhances cross-lingual transfer for token classification
🔎 Similar Papers
No similar papers found.