LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

📅 2025-01-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit weak multilingual embedding capabilities and heavily rely on high-resource language supervision data. Method: This paper proposes a zero-shot multilingual embedding adaptation framework that integrates multilingual encoders (e.g., XLM-R) with LLM embedding models via lightweight trainable connectors, enabling cross-lingual semantic alignment without any multilingual labeled data. Contributions/Results: (1) It introduces the first language-agnostic embedding space integration mechanism; (2) achieves zero-shot transfer of multilingual understanding capability; and (3) establishes MEBench—the first large-scale multilingual embedding evaluation benchmark covering 14 languages and 123 tasks. Experiments demonstrate consistent superiority over existing state-of-the-art methods across five mainstream task categories—including retrieval and classification—with particularly significant gains for medium- and low-resource languages.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Text Processing
Large Language Models
Cross-lingual Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Processing
Parameter Bridge
Comprehensive Multilingual Testing
🔎 Similar Papers
No similar papers found.