F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

πŸ“… 2025-08-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the challenges of fine-grained multimodal snippet retrieval and weak semantic coherence modeling in long-horizon multimodal dialogues, this paper proposes F2RVLM. Methodologically: (1) it formally defines the fine-grained snippet retrieval task for the first time; (2) it introduces MLDRβ€”the first long-turn multimodal dialogue dataset explicitly designed for retrieval; and (3) it devises a two-stage training paradigm integrating supervised fine-tuning, GRPO-based reinforcement learning, multi-objective reward optimization, and difficulty-aware curriculum sampling to enhance cross-modal semantic consistency. Experiments demonstrate that F2RVLM significantly outperforms state-of-the-art vision-language models on both domain-specific benchmarks and real-world test sets, achieving new SOTA performance in retrieval accuracy, relevance, and contextual coherence.

Technology Category

Application Category

πŸ“ Abstract
Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.
Problem

Research questions and friction points this paper is trying to address.

Retrieving semantically coherent multimodal fragments from long dialogues
Addressing incoherent utterance-image retrieval in vision-language models
Handling varying complexity in long multi-turn multimodal contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with SFT and GRPO reinforcement
Difficulty-aware curriculum sampling for complex fragments
Generative retrieval model for multimodal dialogue coherence
πŸ”Ž Similar Papers
No similar papers found.
H
Hanbo Bi
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Zhiqiang Yuan
Zhiqiang Yuan
fudan university
Z
Zexi Jia
Pattern Recognition Center, WeChat AI, Tencent Inc, China
J
Jiapei Zhang
Pattern Recognition Center, WeChat AI, Tencent Inc, China
C
Chongyang Li
Pattern Recognition Center, WeChat AI, Tencent Inc, China
P
Peixiang Luo
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Y
Ying Deng
Pattern Recognition Center, WeChat AI, Tencent Inc, China
Xiaoyue Duan
Xiaoyue Duan
Beihang University
image/video generationmusic generation
Jinchao Zhang
Jinchao Zhang
WeChat AI - Pattern Recognition Center
Deep LearningNatural Language ProcessingMachine TranslationDialogue System