PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

📅 2026-05-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work addresses the challenges of multimodal memory construction in long-form video reasoning—specifically, heterogeneous modality fusion, human-centric information alignment, and cross-granularity evidence aggregation. Inspired by cognitive science theories of event segmentation, we propose the first hierarchical multimodal memory architecture. The framework organizes video content in a coarse-to-fine pyramid structure, integrating a multimodal alignment mechanism with structure-guided memory expansion and pruning strategies. This design effectively captures events with low semantic similarity but strong causal relationships while suppressing noise. Extensive experiments demonstrate that our approach significantly improves performance across multiple long-video understanding benchmarks and exhibits robust generalization across varying model scales and task types, confirming its effectiveness and versatility.
📝 Abstract
Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.
Problem

Research questions and friction points this paper is trying to address.

multimodal memory
long-horizon video reasoning
heterogeneous input integration
evidence aggregation
person-centric alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical multimodal memory
long-horizon video reasoning
event segmentation theory
structure-guided memory expansion
evidence aggregation
🔎 Similar Papers