SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that English-centric sparse encoders struggle to generalize effectively to non-English languages, thereby limiting high-precision multilingual retrieval. To overcome this limitation, the authors propose SemBridge, a method that leverages multilingual dense embeddings as semantic bridges to identify a small set of semantically related source-language tokens for each target-language token. The representation of the target token is then initialized via a linear combination of these selected source tokens, effectively filtering out semantic noise while preserving core synonymous information. Evaluated across five languages and four sparse architectures, SemBridge achieves state-of-the-art zero-shot retrieval performance and maintains significant gains over existing baselines even after fine-tuning, demonstrating its effectiveness and broad applicability in cross-lingual transfer.
📝 Abstract
Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.
Problem

Research questions and friction points this paper is trying to address.

sparse encoders
language transfer
multilingual
semantic alignment
cross-lingual adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse encoders
language transfer
multilingual semantic bridges
embedding initialization
cross-lingual retrieval
🔎 Similar Papers
No similar papers found.