Memory-QA: Answering Recall Questions Based on Multimodal Memories

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This paper addresses the multimodal memory-driven visual content retrospection question answering task, proposing the Pensieve framework to tackle the challenge of leveraging temporally and spatially structured multimodal memory for answering complex recall questions. Methodologically, it introduces a task-oriented memory construction mechanism, designs a time- and location-aware multi-signal retrieval module, and adopts an end-to-end question answering fine-tuning strategy that jointly conditions on multiple memory contexts. The core contribution lies in unifying the spatiotemporal structure and semantic associations of memory to enable cross-fragment reasoning. Evaluated on a newly constructed multimodal retrospection QA benchmark, Pensieve significantly outperforms existing methods, achieving up to a 14% absolute improvement in answer accuracy—demonstrating its effectiveness and generalizability for long-horizon, fine-grained visual recall in real-world scenarios.

Technology Category

Application Category

📝 Abstract

We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).

Problem

Research questions and friction points this paper is trying to address.

Answering recall questions using stored multimodal visual memories

Creating task-oriented memories with temporal and location information

Leveraging multiple memories to respond to complex recall queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-specific augmentation for task-oriented memories

Time- and location-aware multi-signal retrieval

Multi-memory QA fine-tuning for recall questions

🔎 Similar Papers

OmniQuery: Contextually Augmenting Captured Multimodal Memory to Enable Personal Question Answering