🤖 AI Summary
This work addresses the challenge that reward functions generated by large language models (LLMs) often fail in sparse, structured reinforcement learning tasks due to reward flooding or misinterpretations of semantics and APIs. To overcome this, the authors propose a diagnosis-driven iterative optimization framework that treats LLM-based reward design as a debugging process. By classifying failure modes and incorporating a dynamic feedback mechanism, the framework guides targeted refinements of the reward function. Integrated with PPO, evaluated on MiniGrid and MuJoCo environments, and augmented with failure-mode auditing and diagnosis-informed re-prompting strategies, the approach dramatically improves performance: success rates increase from 2.3% to 97.6% on the DoorKey-8x8 task and from 31.2% to 86.7% on the KeyCorridor task, demonstrating its effectiveness.
📝 Abstract
For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed as debugging than one-shot generation. We study PPO-trained agents using MiniGrid as core evaluation and MuJoCo as boundary stress test. Our audit finds two dominant one-shot failure modes -- reward flooding and semantic/API misunderstanding -- plus a rarer weak-shaping case. We propose diagnostic-driven iterative refinement, where training diagnostics and a failure-mode taxonomy guide targeted reward-function revision. Refinement improves DoorKey-8x8 from 2.3% to 97.6% and KeyCorridor from 31.2% to 86.7% with high seed-to-seed variance. Controls show these gains are not from retrying or extra training: metrics-only re-prompting yields large drops, while a static-vocabulary control recovers much of the gap (87.6%; 70.7%), showing the taxonomy prompt is a major mechanism and dynamic labels provide only partially isolated incremental evidence. Budget-matched and Best-of-3 comparisons separate refinement from selection and training-time effects. Component-removal tests, sensitivity analyses, and an audit against author labels provide converging evidence for the debugging interpretation while revealing calibration limits. Continuous-control results show the boundary: success-based diagnostics can misfire in dense-reward locomotion, and return-trend feedback removes one false-positive mechanism without robust gains. The low-call protocol is a cost contrast with population-based reward search, not a benchmark comparison. In four crossed-variance-design environments, point estimates suggest larger gains when LLM reward-function variance dominates but bootstrap intervals are wide. The method is bounded to sparse structured tasks with reliable interfaces under PPO; fields like event_text may help, hurt, or be neutral.