🤖 AI Summary
This work addresses the challenge that large language models struggle to accurately extract relevant evidence from lengthy dialogue histories for long-term memory question answering. To this end, the authors propose DeferMem, a novel framework that performs evidence distillation at query time, decoupling memory processing into high-recall candidate retrieval followed by query-conditioned evidence refinement. DeferMem introduces a lightweight paragraph-linking structure, a reinforcement learning–based DistillPO algorithm, structure-aligned advantage assignment, and a multi-stage reward gating mechanism to enable efficient, faithful, and self-contained evidence generation. Experimental results demonstrate that DeferMem significantly outperforms strong baselines on both LoCoMo and LongMemEval-S benchmarks, achieving state-of-the-art performance in question answering accuracy, inference speed, and zero reliance on commercial APIs.
📝 Abstract
Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.