MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

📅 2026-03-13

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of sparse reward signals and high annotation costs in complex reasoning tasks such as mathematical theorem proving for large language models. The authors propose the first reward prediction mechanism that integrates episodic memory with graph-structured modeling, constructing a heterogeneous graph from queries, reasoning trajectories, and answers. By leveraging graph neural networks, reward signals are effectively propagated even under extremely low annotation rates, enabling efficient policy optimization. With only 20% labeled data, the method achieves 97.3% (for a 3B-parameter model) and 96.6% (for a 1.5B-parameter model) of the oracle performance; at 70% labeling, it reaches 99.4%. Notably, it also outperforms fully supervised baselines on out-of-domain tasks.

Technology Category

Application Category

📝 Abstract

Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.

Problem

Research questions and friction points this paper is trying to address.

reward prediction

limited labels

reinforcement learning

large language models

experience memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-based memory

reward propagation

limited-label reinforcement learning