AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing reinforcement learning approaches based on reward models often discard diagnostic evaluation information during training, hindering knowledge accumulation, recognition of recurrent suboptimal behaviors, and curriculum-driven progression. This work proposes AMARIS, the first system to integrate a persistent memory mechanism into reward model optimization. AMARIS constructs an evaluation memory bank that analyzes and summarizes each rollout step, then employs a hybrid retrieval strategy—combining static (recent-step) and dynamic (semantic-matching) approaches—to extract relevant historical context for continual refinement of the reward model. This transforms reward shaping from a stateless heuristic into a closed-loop learning system grounded in historical evidence. Experiments demonstrate that AMARIS significantly outperforms baselines in both closed- and open-domain settings, with ablation studies confirming the efficacy of the hybrid retrieval design; the approach achieves substantial performance gains with only approximately 5% additional computational overhead.

📝 Abstract

Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubrics decompose standard outcome rewards into multiple dimensions to provide richer reward signals. Recent works make the rubrics adaptive based on local signals such as the rollouts from the current step or pairwise comparisons. However, these methods discard the diagnostics produced during evaluation after immediate use and prevent the long-term accumulation and strategic reuse of evaluation knowledge. This forces the system to re-derive evaluation principles from scratch, limits its ability to detect recurring suboptimal behaviors, and forfeits the curriculum-like progression that a persistent training history would naturally support. To address these limitations, we introduce AMARIS, which grounds rubric modifications in long-term training history. At each training step, AMARIS analyzes individual rollouts, aggregates findings into step-level summaries, retrieves relevant historical context from a persistent evaluation memory through both static (recent steps) and dynamic (semantically matched) retrieval, and updates rubrics based on these accumulated analyses. This procedure runs asynchronously alongside the normal RL loop with minimal overhead. Experiments across both closed and open-ended domains show that AMARIS consistently outperforms the baselines. Ablation studies show that static and dynamic memory retrieval contributes to the performance gain and their combination provides the strongest results with moderate retrieval budgets sufficient to provide most of the gain, and that the entire pipeline adds only ~5\% time overhead through asynchronous execution. These results show that persistent evaluation memory can transform rubric-based reward shaping from a stateless, per-step heuristic into an evidence-driven loop for RL training.

Problem

Research questions and friction points this paper is trying to address.

rubric-based reinforcement learning

evaluation memory

reward shaping

long-term training history

adaptive rubrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-augmented reinforcement learning

rubric-based reward shaping

persistent evaluation memory