From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing multimodal large language models in long-horizon video understanding, which are constrained by fixed context windows and static memory mechanisms that struggle to balance fine-grained detail retention with efficient reasoning. Inspired by the fuzzy-trace theory from cognitive science, we propose MM-Mem, a pyramid-style multimodal memory architecture comprising three tiers—sensory buffer, episodic stream, and symbolic schema—that progressively distill fine-grained perceptual inputs into high-level semantics. We introduce a novel Semantic Information Bottleneck (SIB) objective and an SIB-guided GRPO optimization algorithm to enable dynamic memory distillation, complemented by an entropy-driven top-down retrieval mechanism. Evaluated across four benchmarks, our approach significantly advances performance in both offline and streaming long-video understanding tasks, demonstrating the efficacy and generalizability of cognitively inspired memory modeling.

Technology Category

Application Category

📝 Abstract
While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy, which first tries with the abstract Symbolic Schema and progressively "drills down" to the Sensory Buffer and Episodic Stream under high uncertainty. Extensive experiments across 4 benchmarks confirm the effectiveness of MM-Mem on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code is available at https://github.com/EliSpectre/MM-Mem.
Problem

Research questions and friction points this paper is trying to address.

long-horizon video understanding
multimodal memory
context window limitation
cognitive efficiency
detail loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic Information Bottleneck
Pyramidal Multimodal Memory
Fuzzy-Trace Theory
Long-Horizon Video Understanding
Memory Distillation
N
Niu Lian
Harbin Institute of Technology, Shenzhen; Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Yuting Wang
H
Hanshu Yao
Harbin Institute of Technology, Shenzhen
J
Jinpeng Wang
Harbin Institute of Technology, Shenzhen
B
Bin Chen
Harbin Institute of Technology, Shenzhen
Yaowei Wang
Yaowei Wang
The Hong Kong Polytechnic University
M
Min Zhang
Harbin Institute of Technology, Shenzhen
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security