Granite Embedding Multilingual R2 Models

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

175K/year
🤖 AI Summary
This work addresses the demand for enterprise-grade multilingual dense retrieval by proposing a dual-encoder embedding model based on the ModernBERT architecture, supporting over 200 languages and programming code with a long-context capacity of 32,768 tokens. Through model pruning, vocabulary optimization, and Matryoshka representation learning, it introduces the first open-source multilingual embedding model under 100 million parameters that achieves state-of-the-art performance; its full-size variant further offers flexible embedding dimensions. The model attains leading results across diverse tasks, including multilingual and cross-lingual text search, code retrieval, long-document retrieval, and reasoning-intensive retrieval scenarios. It is released under the Apache 2.0 license.
📝 Abstract
We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.
Problem

Research questions and friction points this paper is trying to address.

multilingual embedding
dense retrieval
cross-lingual search
code retrieval
enterprise-scale
Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual embedding
dense retrieval
long-context modeling
model pruning
Matryoshka Representation Learning
🔎 Similar Papers
No similar papers found.