MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of sparse reward signals and high annotation costs in complex reasoning tasks such as mathematical theorem proving for large language models. The authors propose the first reward prediction mechanism that integrates episodic memory with graph-structured modeling, constructing a heterogeneous graph from queries, reasoning trajectories, and answers. By leveraging graph neural networks, reward signals are effectively propagated even under extremely low annotation rates, enabling efficient policy optimization. With only 20% labeled data, the method achieves 97.3% (for a 3B-parameter model) and 96.6% (for a 1.5B-parameter model) of the oracle performance; at 70% labeling, it reaches 99.4%. Notably, it also outperforms fully supervised baselines on out-of-domain tasks.

Technology Category

Application Category

📝 Abstract
Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.
Problem

Research questions and friction points this paper is trying to address.

reward prediction
limited labels
reinforcement learning
large language models
experience memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-based memory
reward propagation
limited-label reinforcement learning
heterogeneous graph
GNN for LLMs
🔎 Similar Papers
No similar papers found.