🤖 AI Summary
Existing video summarization methods struggle to comprehend high-level semantic structures of real-world events (e.g., natural disasters, elections), remaining confined to low-level visual features.
Method: We propose Collaborative Article Generation (CAG), the first framework to automatically generate Wikipedia-style structured event overviews from heterogeneous, multi-source videos. CAG employs iterative, multi-turn collaboration between a reasoning model and a VideoLLM, integrating cross-video evidence alignment, factual grounding, and retrieval-augmented generation (RAG) to achieve event-level semantic aggregation.
Contribution/Results: Evaluated on our newly constructed WikiVideo benchmark, CAG significantly outperforms state-of-the-art VideoLLMs under both oracle and RAG settings. It demonstrates the feasibility of high-level video semantic understanding and structured article generation, establishing a novel video-driven RAG paradigm for open-domain event synthesis.
📝 Abstract
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.