Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Existing video understanding models struggle with ultra-long or infinite videos due to high computational complexity, fragmented memory, and weak causal modeling capabilities. This work proposes a lightweight retrieval-augmented generation framework that innovatively indexes semantically coherent events rather than fixed-length clips, constructing state–event–state graphs and integrating them into a unified global event knowledge graph. By incorporating a dual-memory architecture and a bidirectional causal retrieval strategy, the method effectively captures long-range causal dependencies and synthesizes information across multiple events. Evaluated on long-video causal reasoning benchmarks, the approach significantly outperforms current models—particularly in tasks requiring reasoning across extended temporal intervals—while simultaneously improving memory efficiency and robustness in streaming settings.
📝 Abstract
Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.
Problem

Research questions and friction points this paper is trying to address.

long video reasoning
causal inference
temporal dependencies
coherent memory
infinite video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Event-Causal RAG
State-Event-State graph
causal reasoning
long-video understanding
retrieval-augmented generation