HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding

📅 2025-04-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-term audiovisual event understanding faces two core challenges: temporal integration and cross-modal association. To address these, we propose a biologically inspired dynamic multimodal memory architecture, grounded in hippocampal neurocomputational principles. Our method introduces (1) a novel hippocampal-style pattern separation and completion mechanism tailored for continuous audiovisual streams; (2) a short-to-long-term memory consolidation paradigm spanning perceptual details to semantic abstractions; and (3) a bidirectional cross-modal associative pathway enabling reciprocal retrieval. Technically, it integrates adaptive temporal segmentation, dual-process memory encoding, neuroscience-informed representation learning, and dynamic index construction. Evaluated on the HippoVlog benchmark, our approach achieves 78.2% accuracy—surpassing prior state-of-the-art by 14 percentage points—and reduces inference latency to 20.4 seconds, a 5.5× speedup. This work advances brain-inspired multimodal memory modeling for long-horizon event understanding.

Technology Category

Application Category

📝 Abstract
Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically designed for continuous audiovisual streams, (ii) short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with static indexing schemes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating neuroscientific memory principles into computational architectures provides a promising foundation for next-generation multimodal understanding systems. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.
Problem

Research questions and friction points this paper is trying to address.

Long audiovisual event understanding challenge for computational systems
Temporal integration and cross-modal associations in multimodal data
Dynamic episodic representations for adaptive memory encoding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hippocampus-inspired pattern separation for audiovisual streams
Short-to-long term memory consolidation for semantic abstraction
Cross-modal associative retrieval pathways for queries