🤖 AI Summary
This paper addresses the multimodal memory-driven visual content retrospection question answering task, proposing the Pensieve framework to tackle the challenge of leveraging temporally and spatially structured multimodal memory for answering complex recall questions. Methodologically, it introduces a task-oriented memory construction mechanism, designs a time- and location-aware multi-signal retrieval module, and adopts an end-to-end question answering fine-tuning strategy that jointly conditions on multiple memory contexts. The core contribution lies in unifying the spatiotemporal structure and semantic associations of memory to enable cross-fragment reasoning. Evaluated on a newly constructed multimodal retrospection QA benchmark, Pensieve significantly outperforms existing methods, achieving up to a 14% absolute improvement in answer accuracy—demonstrating its effectiveness and generalizability for long-horizon, fine-grained visual recall in real-world scenarios.
📝 Abstract
We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).