Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses the high computational cost and low inference efficiency of multimodal large language models (MLLMs) in general-purpose multimodal retrieval, primarily caused by processing a large number of visual tokens. To overcome these challenges, the authors propose Magic-MM-Embedding, which integrates an efficient visual token compression architecture with a three-stage progressive training strategy—comprising continual pretraining, contrastive pretraining, and MLLM-guided fine-grained fine-tuning. The approach further incorporates hard negative mining and an MLLM-as-a-Judge data filtering mechanism to enhance embedding quality. Experimental results demonstrate that Magic-MM-Embedding significantly improves multimodal embedding performance while substantially reducing inference latency and memory consumption, outperforming existing methods on general multimodal retrieval benchmarks.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

visual token efficiency

universal multimodal retrieval

computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual token compression

multimodal embedding

progressive training

MLLM-as-a-Judge