🤖 AI Summary
This work addresses the challenge that existing pre-trained embedding models, constrained by fixed architectures and dimensions, struggle to balance diverse resource budgets and accuracy requirements in industrial retrieval scenarios. To overcome this limitation, the authors propose m3BERT, a novel approach that introduces multi-granularity Matryoshka embedding representations and employs a three-stage pre-training strategy—spanning monolingual, multilingual, and domain-specific continual pre-training—to enable flexible dimension-wise truncation and efficient deployment from a single model. This design maintains pre-training consistency while significantly enhancing both efficiency and effectiveness. Extensive experiments demonstrate that m3BERT substantially outperforms state-of-the-art models on the industrial-scale Bing-Click dataset and exhibits strong generalization across multiple public benchmarks.
📝 Abstract
Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.