LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

πŸ“… 2025-10-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Low-resource languages suffer significant performance degradation in cross-lingual large language models due to data scarcity, translation noise, and unstable semantic alignment. To address this, we propose LiRAβ€”a novel framework for robust cross-lingual understanding. Methodologically, LiRA introduces: (1) the Anchored Representation Composition Architecture (Arca), enabling robust cross-lingual semantic alignment; (2) the Language-coupled Semantic Reasoning module (LaSR), supporting multi-agent collaborative encoding and language-aware lightweight inference; and (3) geometric stability constraints on the shared embedding space coupled with consistency regularization, enhancing generalization under few-shot and high-noise conditions. Evaluated on multiple low-resource cross-lingual benchmarks, LiRA consistently outperforms state-of-the-art methods, demonstrating exceptional robustness in few-shot and high-noise settings. Additionally, we release the first commodity retrieval dataset covering seven Asian languages, filling a critical gap in low-resource multilingual evaluation.

Technology Category

Application Category

πŸ“ Abstract
As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.
Problem

Research questions and friction points this paper is trying to address.

Improving cross-lingual LLM performance for low-resource languages
Addressing limited training data and unstable cross-lingual alignment
Enhancing multilingual retrieval and reasoning robustness simultaneously
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchors low-resource languages to English semantic space
Uses anchor-based alignment and multi-agent collaborative encoding
Adds language-aware reasoning head with consistency regularization
πŸ”Ž Similar Papers
No similar papers found.
H
Haolin Li
Department of Automation, Tsinghua University
H
Haipeng Zhang
Alibaba Group
M
Mang Li
Alibaba Group
Yaohua Wang
Yaohua Wang
National University of Defense technology
Computer Architecture
L
Lijie Wen
School of Software, Tsinghua University
Y
Yu Zhang
Alibaba Group
Biqing Huang
Biqing Huang
Tsinghua University