Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address hallucination and performance degradation in streaming video understanding caused by multimodal large language models’ (MLLMs) reliance on predictive historical memory, this paper first identifies and formalizes the memory-driven hallucination phenomenon. We propose a hallucination-aware memory correction framework comprising streaming visual event modeling, predictive memory injection, online hallucination detection, and bias-mitigating memory refinement. Evaluated on multiple streaming video understanding benchmarks, our method significantly improves event reasoning accuracy while reducing memory-induced hallucination rates by over 40%, enabling more robust temporal event understanding. Our core contributions are threefold: (1) the first formal definition of memory-induced hallucination in MLLMs; (2) the design of a learnable, end-to-end memory correction paradigm; and (3) empirical validation of its effectiveness and generalizability in realistic streaming scenarios.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.
Problem

Research questions and friction points this paper is trying to address.

Enhance video event understanding using memory.
Address misinformation in memory for MLLMs.
Mitigate confabulation in streaming video analysis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMs for video event understanding
Memory contextualizes streaming events
Confabulation-aware memory modification method
🔎 Similar Papers
No similar papers found.
Gengyuan Zhang
Gengyuan Zhang
LMU Munich, MCML
Multimodal learningVideo UnderstandingVision-Language Model
M
Mingcong Ding
Ludwig-Maximilians-Universität München, Germany; Munich Center for Machine Learning, Germany
T
Tong Liu
Ludwig-Maximilians-Universität München, Germany; Munich Center for Machine Learning, Germany
Y
Yao Zhang
Ludwig-Maximilians-Universität München, Germany; Munich Center for Machine Learning, Germany
Volker Tresp
Volker Tresp
Ludwig-Maximilians-Universität München (LMU Munich)
Machine LearningArtificial IntelligenceComputational Cognitive NeuroscienceKnowledge Graphs