π€ AI Summary
Low-resource languages suffer significant performance degradation in cross-lingual large language models due to data scarcity, translation noise, and unstable semantic alignment. To address this, we propose LiRAβa novel framework for robust cross-lingual understanding. Methodologically, LiRA introduces: (1) the Anchored Representation Composition Architecture (Arca), enabling robust cross-lingual semantic alignment; (2) the Language-coupled Semantic Reasoning module (LaSR), supporting multi-agent collaborative encoding and language-aware lightweight inference; and (3) geometric stability constraints on the shared embedding space coupled with consistency regularization, enhancing generalization under few-shot and high-noise conditions. Evaluated on multiple low-resource cross-lingual benchmarks, LiRA consistently outperforms state-of-the-art methods, demonstrating exceptional robustness in few-shot and high-noise settings. Additionally, we release the first commodity retrieval dataset covering seven Asian languages, filling a critical gap in low-resource multilingual evaluation.
π Abstract
As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.