Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing personal AI agents struggle to provide reliable answers—or appropriately abstain—when confronted with multi-source, conflicting, or incomplete memories. The authors introduce the first diagnostic evaluation framework that systematically disentangles whether errors stem from poor evidence quality or flaws in conflict resolution mechanisms. They also release a large-scale, controllable benchmark dataset comprising 34,560 instances across 18 question templates and 8 reasoning types. By integrating structured methods with state-of-the-art large language models (LLMs), the best-performing model achieves 80.3% accuracy overall. When augmented with an abstention mechanism, it attains a selective accuracy of 85.3% at 78.3% coverage, substantially outperforming the strongest baseline LLM, which achieves only 71.0% accuracy at 95.4% coverage.
📝 Abstract
Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.
Problem

Research questions and friction points this paper is trying to address.

selective QA
conflicting memory
multi-source memory
personal AI agents
abstention
Innovation

Methods, ideas, or system contributions that make the work stand out.

selective QA
multi-source memory
conflict resolution
abstention
diagnostic benchmark