🤖 AI Summary
Existing MRAG benchmarks suffer from narrow modality coverage and coarse-grained annotation, limiting their utility for fine-grained multimodal temporal reasoning. To address this, we introduce CFVBench—the first video-centric, fine-grained MRAG benchmark—comprising 599 publicly available videos and 5,360 open-ended question-answer pairs, curated from high-information-density domains including chart-based reports, news broadcasts, and software tutorials. To tackle the challenge of modeling transient critical information in videos, we propose the Adaptive Visual Refinement (AVR) framework, which integrates dynamic frame sampling and on-demand tool invocation to enhance fine-grained visual detail capture. Leveraging a human-validated data curation pipeline, we conduct systematic evaluation across seven retrieval methods and fourteen state-of-the-art multimodal large language models (MLLMs). Results demonstrate that AVR effectively alleviates the current bottleneck in modeling brief multimodal signals, thereby advancing research in fine-grained MRAG.
📝 Abstract
Multimodal Retrieval-Augmented Generation (MRAG) enables Multimodal Large Language Models (MLLMs) to generate responses with external multimodal evidence, and numerous video-based MRAG benchmarks have been proposed to evaluate model capabilities across retrieval and generation stages. However, existing benchmarks remain limited in modality coverage and format diversity, often focusing on single- or limited-modality tasks, or coarse-grained scene understanding. To address these gaps, we introduce CFVBench, a large-scale, manually verified benchmark constructed from 599 publicly available videos, yielding 5,360 open-ended QA pairs. CFVBench spans high-density formats and domains such as chart-heavy reports, news broadcasts, and software tutorials, requiring models to retrieve and reason over long temporal video spans while maintaining fine-grained multimodal information. Using CFVBench, we systematically evaluate 7 retrieval methods and 14 widely-used MLLMs, revealing a critical bottleneck: current models (even GPT5 or Gemini) struggle to capture transient yet essential fine-grained multimodal details. To mitigate this, we propose Adaptive Visual Refinement (AVR), a simple yet effective framework that adaptively increases frame sampling density and selectively invokes external tools when necessary. Experiments show that AVR consistently enhances fine-grained multimodal comprehension and improves performance across all evaluated MLLMs