What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

๐Ÿ“… 2026-05-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

211K/year
๐Ÿค– AI Summary
This study investigates how the composition of training data influences the ability of reinforcement learningโ€“based memory-augmented agents to leverage external memory banks for question answering in multi-turn dialogues. Holding model architecture, the GRPO algorithm, and hyperparameters constant, the authors systematically compare three curricula: training exclusively on LoCoMo, exclusively on LongMemEval, or a mixture of both. The findings reveal that training curricula exert fine-grained control over skill acquisition rather than uniformly boosting overall performance. Effective cross-benchmark mixed training requires filtering format-related noise, and small-batch training benefits from continuous reward functions. The mixed curriculum achieves the highest F1 scores on both benchmarks, and out-of-domain data successfully transfers temporal reasoning capabilities. Crucially, per-question-type analyses uncover disparities far exceeding those suggested by aggregate metrics, indicating that single composite scores can obscure the true impact of training curricula.
๐Ÿ“ Abstract
Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.
Problem

Research questions and friction points this paper is trying to address.

training data composition
curriculum effects
memory-augmented QA
reinforcement learning
skill acquisition
Innovation

Methods, ideas, or system contributions that make the work stand out.

curriculum design
memory-augmented QA
reinforcement learning
cross-benchmark transfer
continuous reward