Granite Embedding Multilingual R2 Models

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the demand for enterprise-grade multilingual dense retrieval by proposing a dual-encoder embedding model based on the ModernBERT architecture, supporting over 200 languages and programming code with a long-context capacity of 32,768 tokens. Through model pruning, vocabulary optimization, and Matryoshka representation learning, it introduces the first open-source multilingual embedding model under 100 million parameters that achieves state-of-the-art performance; its full-size variant further offers flexible embedding dimensions. The model attains leading results across diverse tasks, including multilingual and cross-lingual text search, code retrieval, long-document retrieval, and reasoning-intensive retrieval scenarios. It is released under the Apache 2.0 license.

📝 Abstract

We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.

Problem

Research questions and friction points this paper is trying to address.

multilingual embedding

dense retrieval

cross-lingual search

code retrieval

enterprise-scale

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual embedding

dense retrieval

long-context modeling