Memory-QA: Answering Recall Questions Based on Multimodal Memories

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the multimodal memory-driven visual content retrospection question answering task, proposing the Pensieve framework to tackle the challenge of leveraging temporally and spatially structured multimodal memory for answering complex recall questions. Methodologically, it introduces a task-oriented memory construction mechanism, designs a time- and location-aware multi-signal retrieval module, and adopts an end-to-end question answering fine-tuning strategy that jointly conditions on multiple memory contexts. The core contribution lies in unifying the spatiotemporal structure and semantic associations of memory to enable cross-fragment reasoning. Evaluated on a newly constructed multimodal retrospection QA benchmark, Pensieve significantly outperforms existing methods, achieving up to a 14% absolute improvement in answer accuracy—demonstrating its effectiveness and generalizability for long-horizon, fine-grained visual recall in real-world scenarios.

Technology Category

Application Category

📝 Abstract
We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).
Problem

Research questions and friction points this paper is trying to address.

Answering recall questions using stored multimodal visual memories
Creating task-oriented memories with temporal and location information
Leveraging multiple memories to respond to complex recall queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-specific augmentation for task-oriented memories
Time- and location-aware multi-signal retrieval
Multi-memory QA fine-tuning for recall questions
🔎 Similar Papers
No similar papers found.
Hongda Jiang
Hongda Jiang
Meta Reality Labs
X
Xinyuan Zhang
Meta Reality Labs
Siddhant Garg
Siddhant Garg
Meta Reality Labs
R
Rishab Arora
Meta Reality Labs
S
Shiun-Zu Kuo
Meta Reality Labs
Jiayang Xu
Jiayang Xu
University of Michigan, Aerospace Engineering
Reduced Order Modeling in CFD
C
Christopher Brossman
Meta Reality Labs
Y
Yue Liu
Meta Reality Labs
A
Aaron Colak
Meta Reality Labs
Ahmed Aly
Ahmed Aly
MBZUAI
Computer Vision
A
Anuj Kumar
Meta Reality Labs
Xin Luna Dong
Xin Luna Dong
ACM / IEEE Fellow, Principal Scientist at Meta
Knowledge graphData qualityNLPSearch