WikiVideo: Article Generation from Multiple Videos

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video summarization methods struggle to comprehend high-level semantic structures of real-world events (e.g., natural disasters, elections), remaining confined to low-level visual features. Method: We propose Collaborative Article Generation (CAG), the first framework to automatically generate Wikipedia-style structured event overviews from heterogeneous, multi-source videos. CAG employs iterative, multi-turn collaboration between a reasoning model and a VideoLLM, integrating cross-video evidence alignment, factual grounding, and retrieval-augmented generation (RAG) to achieve event-level semantic aggregation. Contribution/Results: Evaluated on our newly constructed WikiVideo benchmark, CAG significantly outperforms state-of-the-art VideoLLMs under both oracle and RAG settings. It demonstrates the feasibility of high-level video semantic understanding and structured article generation, establishing a novel video-driven RAG paradigm for open-domain event synthesis.

Technology Category

Application Category

📝 Abstract
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
Problem

Research questions and friction points this paper is trying to address.

Generating Wikipedia articles from multiple diverse videos
Integrating video evidence into retrieval-augmented generation pipelines
Enhancing high-level event semantics beyond low-level visual features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate video into RAG pipelines
Propose Collaborative Article Generation (CAG)
Combine reasoning model with VideoLLM
🔎 Similar Papers
No similar papers found.