🤖 AI Summary
Existing video understanding models struggle with ultra-long or infinite videos due to high computational complexity, fragmented memory, and weak causal modeling capabilities. This work proposes a lightweight retrieval-augmented generation framework that innovatively indexes semantically coherent events rather than fixed-length clips, constructing state–event–state graphs and integrating them into a unified global event knowledge graph. By incorporating a dual-memory architecture and a bidirectional causal retrieval strategy, the method effectively captures long-range causal dependencies and synthesizes information across multiple events. Evaluated on long-video causal reasoning benchmarks, the approach significantly outperforms current models—particularly in tasks requiring reasoning across extended temporal intervals—while simultaneously improving memory efficiency and robustness in streaming settings.
📝 Abstract
Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.