🤖 AI Summary
Manually designing reward functions for complex reasoning tasks is challenging, while existing black-box automated reward optimization methods fail to model the causal relationship between reward structure and task performance. Method: This paper proposes DERL—a differentiable meta-optimization framework for bi-level reward evolution. Its inner loop validates reward effectiveness via policy training; its outer loop performs differentiable evolutionary search over structured atomic reward primitives, approximating meta-gradients via backpropagation of RL signals. Contribution/Results: DERL is the first method to formulate reward function evolution as a differentiable meta-learning problem. It achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming heuristic rewards. It demonstrates superior robustness under out-of-distribution (OOD) generalization and—through interpretable evolutionary trajectories—autonomously discovers intrinsic task structure, enabling agent alignment without human intervention.
📝 Abstract
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.