🤖 AI Summary
Long-term audiovisual event understanding faces two core challenges: temporal integration and cross-modal association. To address these, we propose a biologically inspired dynamic multimodal memory architecture, grounded in hippocampal neurocomputational principles. Our method introduces (1) a novel hippocampal-style pattern separation and completion mechanism tailored for continuous audiovisual streams; (2) a short-to-long-term memory consolidation paradigm spanning perceptual details to semantic abstractions; and (3) a bidirectional cross-modal associative pathway enabling reciprocal retrieval. Technically, it integrates adaptive temporal segmentation, dual-process memory encoding, neuroscience-informed representation learning, and dynamic index construction. Evaluated on the HippoVlog benchmark, our approach achieves 78.2% accuracy—surpassing prior state-of-the-art by 14 percentage points—and reduces inference latency to 20.4 seconds, a 5.5× speedup. This work advances brain-inspired multimodal memory modeling for long-horizon event understanding.
📝 Abstract
Comprehending extended audiovisual experiences remains a fundamental challenge for computational systems. Current approaches struggle with temporal integration and cross-modal associations that humans accomplish effortlessly through hippocampal-cortical networks. We introduce HippoMM, a biologically-inspired architecture that transforms hippocampal mechanisms into computational advantages for multimodal understanding. HippoMM implements three key innovations: (i) hippocampus-inspired pattern separation and completion specifically designed for continuous audiovisual streams, (ii) short-to-long term memory consolidation that transforms perceptual details into semantic abstractions, and (iii) cross-modal associative retrieval pathways enabling modality-crossing queries. Unlike existing retrieval systems with static indexing schemes, HippoMM dynamically forms integrated episodic representations through adaptive temporal segmentation and dual-process memory encoding. Evaluations on our challenging HippoVlog benchmark demonstrate that HippoMM significantly outperforms state-of-the-art approaches (78.2% vs. 64.2% accuracy) while providing substantially faster response times (20.4s vs. 112.5s). Our results demonstrate that translating neuroscientific memory principles into computational architectures provides a promising foundation for next-generation multimodal understanding systems. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.