From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the limitations of existing multimodal large language models in long-horizon video understanding, which are constrained by fixed context windows and static memory mechanisms that struggle to balance fine-grained detail retention with efficient reasoning. Inspired by the fuzzy-trace theory from cognitive science, we propose MM-Mem, a pyramid-style multimodal memory architecture comprising three tiers—sensory buffer, episodic stream, and symbolic schema—that progressively distill fine-grained perceptual inputs into high-level semantics. We introduce a novel Semantic Information Bottleneck (SIB) objective and an SIB-guided GRPO optimization algorithm to enable dynamic memory distillation, complemented by an entropy-driven top-down retrieval mechanism. Evaluated across four benchmarks, our approach significantly advances performance in both offline and streaming long-video understanding tasks, demonstrating the efficacy and generalizability of cognitively inspired memory modeling.

Technology Category

Application Category

📝 Abstract

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.

Problem

Research questions and friction points this paper is trying to address.

long-horizon video understanding

multimodal memory

context window limitation

cognitive efficiency

detail loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Information Bottleneck

Pyramidal Multimodal Memory

Fuzzy-Trace Theory

Long-Horizon Video Understanding

Memory Distillation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

2024-09-02arXiv.orgCitations: 18

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

2024-08-30Citations: 3

Authors to Follow