MMA: Multimodal Memory Agent

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work addresses the overconfidence errors in existing retrieval-augmented generation (RAG)-based multimodal agents, which often arise from reliance on outdated, low-credibility, or conflicting external memory. To mitigate this, the authors propose the Multimodal Memory Agent (MMA), which incorporates a dynamic reliability scoring mechanism that reweights retrieved evidence by jointly considering source credibility, temporal decay, and conflict-aware consensus, and abstains from answering when support is insufficient. The study further uncovers a previously unreported “visual placebo effect” in RAG agents and introduces MMA-Bench, a controllable evaluation benchmark. Experiments show that MMA reduces prediction variance by 35.2% on FEVER without sacrificing accuracy, improves actionable accuracy under the LoCoMo safety configuration, and achieves a 41.18% Type-B accuracy in the visual modality of MMA-Bench—substantially outperforming baseline methods, which score 0.0%.

Technology Category

Application Category

📝 Abstract

Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: https://github.com/AIGeeksGroup/MMA.

Problem

Research questions and friction points this paper is trying to address.

multimodal agents

external memory

reliability

conflicting information

retrieval errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Memory Agent

dynamic reliability scoring

conflict-aware consensus