Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

📅 2026-04-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing intelligent search systems struggle to generate factually accurate, image-text coherent multimodal long-form reports. This work proposes Deep-Reporter, a unified agent framework that formalizes the multimodal long-form generation task for the first time. By integrating multimodal retrieval filtering, checklist-guided progressive synthesis, and iterative context management, Deep-Reporter enables end-to-end trustworthy content generation. The contributions include a novel agent architecture incorporating image-text retrieval, fusion, and citation mechanisms; a high-quality dataset of human-authored trajectories; and M2LongBench, a new multimodal evaluation benchmark. Experiments across 247 tasks spanning nine domains demonstrate that the proposed approach significantly enhances multimodal content selection and integration, substantially narrowing the performance gap with human-expert reports.

Technology Category

Application Category

📝 Abstract

Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

Problem

Research questions and friction points this paper is trying to address.

multimodal long-form generation

agentic search

factual grounding

image-text integration

multimodal evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal long-form generation

agentic search

grounded generation