PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the challenges of multimodal memory construction in long-form video reasoning—specifically, heterogeneous modality fusion, human-centric information alignment, and cross-granularity evidence aggregation. Inspired by cognitive science theories of event segmentation, we propose the first hierarchical multimodal memory architecture. The framework organizes video content in a coarse-to-fine pyramid structure, integrating a multimodal alignment mechanism with structure-guided memory expansion and pruning strategies. This design effectively captures events with low semantic similarity but strong causal relationships while suppressing noise. Extensive experiments demonstrate that our approach significantly improves performance across multiple long-video understanding benchmarks and exhibits robust generalization across varying model scales and task types, confirming its effectiveness and versatility.

📝 Abstract

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.

Problem

Research questions and friction points this paper is trying to address.

multimodal memory

long-horizon video reasoning

heterogeneous input integration

evidence aggregation

person-centric alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical multimodal memory

long-horizon video reasoning

event segmentation theory