LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Low-resource languages suffer significant performance degradation in cross-lingual large language models due to data scarcity, translation noise, and unstable semantic alignment. To address this, we propose LiRA—a novel framework for robust cross-lingual understanding. Methodologically, LiRA introduces: (1) the Anchored Representation Composition Architecture (Arca), enabling robust cross-lingual semantic alignment; (2) the Language-coupled Semantic Reasoning module (LaSR), supporting multi-agent collaborative encoding and language-aware lightweight inference; and (3) geometric stability constraints on the shared embedding space coupled with consistency regularization, enhancing generalization under few-shot and high-noise conditions. Evaluated on multiple low-resource cross-lingual benchmarks, LiRA consistently outperforms state-of-the-art methods, demonstrating exceptional robustness in few-shot and high-noise settings. Additionally, we release the first commodity retrieval dataset covering seven Asian languages, filling a critical gap in low-resource multilingual evaluation.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.

Problem

Research questions and friction points this paper is trying to address.

Improving cross-lingual LLM performance for low-resource languages

Addressing limited training data and unstable cross-lingual alignment

Enhancing multilingual retrieval and reasoning robustness simultaneously

Innovation

Methods, ideas, or system contributions that make the work stand out.

Anchors low-resource languages to English semantic space

Uses anchor-based alignment and multi-agent collaborative encoding

Adds language-aware reasoning head with consistency regularization

🔎 Similar Papers

No similar papers found.