When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that reward functions generated by large language models (LLMs) often fail in sparse, structured reinforcement learning tasks due to reward flooding or misinterpretations of semantics and APIs. To overcome this, the authors propose a diagnosis-driven iterative optimization framework that treats LLM-based reward design as a debugging process. By classifying failure modes and incorporating a dynamic feedback mechanism, the framework guides targeted refinements of the reward function. Integrated with PPO, evaluated on MiniGrid and MuJoCo environments, and augmented with failure-mode auditing and diagnosis-informed re-prompting strategies, the approach dramatically improves performance: success rates increase from 2.3% to 97.6% on the DoorKey-8x8 task and from 31.2% to 86.7% on the KeyCorridor task, demonstrating its effectiveness.
📝 Abstract
For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.
Problem

Research questions and friction points this paper is trying to address.

sparse structured RL
LLM reward design
reward shaping failure
semantic reward-function interfaces
reinforcement learning diagnostics
Innovation

Methods, ideas, or system contributions that make the work stand out.

diagnostic-driven refinement
LLM reward shaping
sparse structured RL
reward debugging
failure-mode taxonomy
🔎 Similar Papers
No similar papers found.