Differentiable Evolutionary Reinforcement Learning

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manually designing reward functions for complex reasoning tasks is challenging, while existing black-box automated reward optimization methods fail to model the causal relationship between reward structure and task performance. Method: This paper proposes DERL—a differentiable meta-optimization framework for bi-level reward evolution. Its inner loop validates reward effectiveness via policy training; its outer loop performs differentiable evolutionary search over structured atomic reward primitives, approximating meta-gradients via backpropagation of RL signals. Contribution/Results: DERL is the first method to formulate reward function evolution as a differentiable meta-learning problem. It achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming heuristic rewards. It demonstrates superior robustness under out-of-distribution (OOD) generalization and—through interpretable evolutionary trajectories—autonomously discovers intrinsic task structure, enabling agent alignment without human intervention.

Technology Category

Application Category

📝 Abstract
The design of effective reward functions presents a central and often arduous challenge in reinforcement learning (RL), particularly when developing autonomous agents for complex reasoning tasks. While automated reward optimization approaches exist, they typically rely on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between reward structure and task performance. To bridge this gap, we propose Differentiable Evolutionary Reinforcement Learning (DERL), a bilevel framework that enables the autonomous discovery of optimal reward signals. In DERL, a Meta-Optimizer evolves a reward function (i.e., Meta-Reward) by composing structured atomic primitives, guiding the training of an inner-loop policy. Crucially, unlike previous evolution, DERL is differentiable in its metaoptimization: it treats the inner-loop validation performance as a signal to update the Meta-Optimizer via reinforcement learning. This allows DERL to approximate the "meta-gradient" of task success, progressively learning to generate denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Experimental results show that DERL achieves state-of-the-art performance on ALFWorld and ScienceWorld, significantly outperforming methods relying on heuristic rewards, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling selfimproving agent alignment without human intervention.
Problem

Research questions and friction points this paper is trying to address.

Automates reward function design for reinforcement learning
Enables differentiable meta-optimization of reward signals
Improves agent performance in complex reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable bilevel framework for reward optimization
Meta-Optimizer evolves reward via structured atomic primitives
Approximates meta-gradient of task success via reinforcement learning
🔎 Similar Papers
No similar papers found.