Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost and low inference efficiency of multimodal large language models (MLLMs) in general-purpose multimodal retrieval, primarily caused by processing a large number of visual tokens. To overcome these challenges, the authors propose Magic-MM-Embedding, which integrates an efficient visual token compression architecture with a three-stage progressive training strategy—comprising continual pretraining, contrastive pretraining, and MLLM-guided fine-grained fine-tuning. The approach further incorporates hard negative mining and an MLLM-as-a-Judge data filtering mechanism to enhance embedding quality. Experimental results demonstrate that Magic-MM-Embedding significantly improves multimodal embedding performance while substantially reducing inference latency and memory consumption, outperforming existing methods on general multimodal retrieval benchmarks.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
visual token efficiency
universal multimodal retrieval
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token compression
multimodal embedding
progressive training
MLLM-as-a-Judge
efficient MLLM
Q
Qi Li
Honor Device Co., Ltd
Y
Yanzhe Zhao
Honor Device Co., Ltd
Y
Yongxin Zhou
Honor Device Co., Ltd
Y
Yameng Wang
Honor Device Co., Ltd
Y
Yandong Yang
Honor Device Co., Ltd
Y
Yuanjia Zhou
Honor Device Co., Ltd
J
Jue Wang
Honor Device Co., Ltd
Z
Zuojian Wang
Honor Device Co., Ltd
Jinxiang Liu
Jinxiang Liu
Shanghai Jiao Tong University
machine learningcomputer visiondeep learning