π€ AI Summary
To address the challenges of fine-grained multimodal snippet retrieval and weak semantic coherence modeling in long-horizon multimodal dialogues, this paper proposes F2RVLM. Methodologically: (1) it formally defines the fine-grained snippet retrieval task for the first time; (2) it introduces MLDRβthe first long-turn multimodal dialogue dataset explicitly designed for retrieval; and (3) it devises a two-stage training paradigm integrating supervised fine-tuning, GRPO-based reinforcement learning, multi-objective reward optimization, and difficulty-aware curriculum sampling to enhance cross-modal semantic consistency. Experiments demonstrate that F2RVLM significantly outperforms state-of-the-art vision-language models on both domain-specific benchmarks and real-world test sets, achieving new SOTA performance in retrieval accuracy, relevance, and contextual coherence.
π Abstract
Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.