Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing learning-to-rank sparse retrieval (LSR) methods suffer from poor cross-lingual generalization, resulting in low multilingual retrieval efficiency and weak interpretability. To address this, we propose MILCO: a multilingual sparse retriever that maps non-English queries and documents into a shared English vocabulary space via a multilingual connector, enabling efficient cross-lingual sparse retrieval. We introduce the LexEcho head, which incorporates an [ECHO] token to preserve source-language semantic perspectives—significantly enhancing robustness to rare entities and mitigating semantic collapse. MILCO employs a two-stage training paradigm comprising sparse alignment pretraining and contrastive learning, and supports dynamic sparse representations via posterior pruning. On standard multilingual benchmarks, MILCO-560M achieves state-of-the-art performance using only 30 active dimensions—outperforming Qwen3-Embed-0.6B (1,000 dimensions) despite its substantially lower dimensionality.

Technology Category

Application Category

📝 Abstract

Learned Sparse Retrieval (LSR) combines the efficiency of bi-encoders with the transparency of lexical matching, but existing approaches struggle to scale beyond English. We introduce MILCO, an LSR architecture that maps queries and documents from different languages into a shared English lexical space via a multilingual connector. MILCO is trained with a specialized two-stage regime that combines Sparse Alignment Pretraining with contrastive training to provide representation transparency and effectiveness while mitigating semantic collapse. Motivated by the observation that uncommon entities are often lost when projected into English, we propose a new LexEcho head, which enhances robustness by augmenting the English lexical representation with a source-language view obtained through a special [ECHO] token. MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic efficiency through post-hoc pruning. Notably, when using mass-based pruning to reduce document representations to only 30 active dimensions on average, MILCO 560M outperforms the similarly-sized Qwen3-Embed 0.6B with 1024 dimensions.

Problem

Research questions and friction points this paper is trying to address.

Learned Sparse Retrieval scales beyond English languages

Projects multilingual queries into shared English lexical space

Enhances robustness for uncommon entities across languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maps multilingual queries into shared English lexical space

Uses two-stage training with Sparse Alignment and contrastive learning

Enhances robustness with LexEcho head and source-language augmentation

🔎 Similar Papers

No similar papers found.