LANGALIGN: Enhancing Non-English Language Models via Cross-Lingual Embedding Alignment

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

To address the performance limitations of non-English language models stemming from scarce high-quality labeled data, this paper proposes LANGALIGN—a novel method for cross-lingual embedding alignment at the interface between the language model and the task head. Its core innovation lies in the first-ever unsupervised, bidirectional (English ↔ target language) alignment of intermediate-layer representations, requiring no labeled data in the target language. LANGALIGN employs contrastive learning to project multilingual embeddings into a shared semantic space and adopts a lightweight architecture featuring a frozen backbone with a trainable adapter head. Evaluated on retrieval and classification tasks in Korean, Japanese, and Chinese, it achieves an average accuracy improvement of 12.7%. Moreover, it enables zero-shot reverse transfer: non-English inputs are automatically mapped into representation spaces compatible with English-language models, facilitating seamless integration with existing English-centric downstream systems.

Technology Category

Application Category

📝 Abstract

While Large Language Models have gained attention, many service developers still rely on embedding-based models due to practical constraints. In such cases, the quality of fine-tuning data directly impacts performance, and English datasets are often used as seed data for training non-English models. In this study, we propose LANGALIGN, which enhances target language processing by aligning English embedding vectors with those of the target language at the interface between the language model and the task header. Experiments on Korean, Japanese, and Chinese demonstrate that LANGALIGN significantly improves performance across all three languages. Additionally, we show that LANGALIGN can be applied in reverse to convert target language data into a format that an English-based model can process.

Problem

Research questions and friction points this paper is trying to address.

Enhancing non-English language models via cross-lingual embedding alignment

Improving target language performance using English embedding vectors

Enabling English models to process non-English language data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns English and target language embeddings

Improves non-English language model performance

Reversible for English-based model processing

🔎 Similar Papers

No similar papers found.