MMA: Multimodal Memory Agent

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the overconfidence errors in existing retrieval-augmented generation (RAG)-based multimodal agents, which often arise from reliance on outdated, low-credibility, or conflicting external memory. To mitigate this, the authors propose the Multimodal Memory Agent (MMA), which incorporates a dynamic reliability scoring mechanism that reweights retrieved evidence by jointly considering source credibility, temporal decay, and conflict-aware consensus, and abstains from answering when support is insufficient. The study further uncovers a previously unreported “visual placebo effect” in RAG agents and introduces MMA-Bench, a controllable evaluation benchmark. Experiments show that MMA reduces prediction variance by 35.2% on FEVER without sacrificing accuracy, improves actionable accuracy under the LoCoMo safety configuration, and achieves a 41.18% Type-B accuracy in the visual modality of MMA-Bench—substantially outperforming baseline methods, which score 0.0%.

Technology Category

Application Category

📝 Abstract
Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.
Problem

Research questions and friction points this paper is trying to address.

multimodal agents
external memory
reliability
conflicting information
retrieval errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Memory Agent
dynamic reliability scoring
conflict-aware consensus
Visual Placebo Effect
MMA-Bench
🔎 Similar Papers
No similar papers found.
Y
Yihao Lu
School of Computer Science, Peking University
W
Wanru Cheng
School of Computer Science, Peking University
Zeyu Zhang
Zeyu Zhang
Gaoling School of Artificial Intelligence, Renmin University of China
LLM-based AgentResponsible RecSysCausal Learning
Hao Tang
Hao Tang
Peking University
computer vision