Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing multimodal large language model (MLLM) embedding approaches struggle to simultaneously capture global semantics and fine-grained perceptual details. To address this limitation, this work proposes a multidimensional embedding generation and adaptive fusion framework that leverages prompt-guided MLLM queries to produce embeddings focused on distinct semantic dimensions. By integrating an adaptive fusion mechanism with Explicit Gradient Amplification (EGA), the method enhances hard negative sample learning without requiring fine-grained data annotations. Evaluated on the MMEB and MMVP-VLM benchmarks, the proposed approach achieves state-of-the-art performance in both global and fine-grained multimodal understanding, significantly outperforming current mainstream multimodal embedding models.

Technology Category

Application Category

📝 Abstract

Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations -- CLIP-based and MLLM-based embedding models -- both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.

Problem

Research questions and friction points this paper is trying to address.

multimodal embeddings

global perception

fine-grained perception

MLLM

perceptual fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Fusion

Fine-Grained Perception

MLLM Embeddings