🤖 AI Summary
Current cross-lingual information retrieval (CLIR) evaluation focuses solely on semantic relevance, overlooking users’ preferences for results in the query language and thus failing to reflect real-world utility. This work addresses this limitation by decoupling semantic relevance from language preference for the first time and introducing a language-aware evaluation protocol. The proposed framework constructs controlled test pools using parallel corpora, defines novel metrics—including Language Preference Rate (LPR) and Lang-nDCG—and incorporates a four-category error decomposition analysis. Experiments across 31 retrieval systems demonstrate that conventional metrics obscure distinct behavioral patterns, such as “semantically accurate but linguistically mismatched” versus “language-matched but semantically weak” results. In contrast, the proposed framework reveals fine-grained performance differences among models, offering a more nuanced assessment of CLIR effectiveness.
📝 Abstract
Multilingual Information Retrieval is increasingly important in real-world search settings, where users issue queries over mixed-language corpora. Existing evaluations mainly reward language-agnostic semantic relevance, treating relevant passages equally regardless of language. Yet retrieval utility also depends on the language of the retrieved passages: users may prefer results they can read and verify in the query language, and query--passage language mismatch can complicate downstream grounding and answer verification in Retrieval-Augmented Generation systems. To evaluate this language-aware dimension, we introduce MLAIRE, a Multilingual Language-Aware Information Retrieval Evaluation protocol that disentangles cross-lingual semantic retrieval from query-language preference. MLAIRE constructs controlled pools with parallel passages across languages, enabling measurement of semantic retrieval accuracy and query-language preference when equivalent translations are available. We propose language-aware metrics, including Language Preference Rate (LPR) and Lang-nDCG, together with a 4-way decomposition separating semantic and query-language preference failures. Evaluating 31 dense, sparse, and late-interaction retrievers, we show that standard metrics obscure distinct behaviors: semantically strong retrievers may return correct content in a non-query language, while retrievers with stronger query-language preference may retrieve less semantically relevant passages.