F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

📅 2025-08-25

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the challenges of fine-grained multimodal snippet retrieval and weak semantic coherence modeling in long-horizon multimodal dialogues, this paper proposes F2RVLM. Methodologically: (1) it formally defines the fine-grained snippet retrieval task for the first time; (2) it introduces MLDR—the first long-turn multimodal dialogue dataset explicitly designed for retrieval; and (3) it devises a two-stage training paradigm integrating supervised fine-tuning, GRPO-based reinforcement learning, multi-objective reward optimization, and difficulty-aware curriculum sampling to enhance cross-modal semantic consistency. Experiments demonstrate that F2RVLM significantly outperforms state-of-the-art vision-language models on both domain-specific benchmarks and real-world test sets, achieving new SOTA performance in retrieval accuracy, relevance, and contextual coherence.

Technology Category

Application Category

📝 Abstract

Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.

Problem

Research questions and friction points this paper is trying to address.

Retrieving semantically coherent multimodal fragments from long dialogues

Addressing incoherent utterance-image retrieval in vision-language models

Handling varying complexity in long multi-turn multimodal contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with SFT and GRPO reinforcement

Difficulty-aware curriculum sampling for complex fragments

Generative retrieval model for multimodal dialogue coherence

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs