Zeroth-Order Actor-Critic: An Evolutionary Framework for Sequential Decision Problems

📅 2022-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low sample efficiency of evolutionary algorithms and the susceptibility of reinforcement learning to local optima in sequential decision-making, this paper proposes a zeroth-order Actor-Critic framework. It performs stepwise exploration in parameter space, derives the zeroth-order policy gradient theoretically, and unifies evolutionary sampling with the Actor-Critic architecture to jointly enable non-differentiable policy optimization and Markov dynamic modeling. Our key contributions are: (i) the first formal derivation of zeroth-order policy gradients; and (ii) an alternating policy evaluation–improvement mechanism that significantly reduces gradient estimation variance. Empirical results demonstrate that our method substantially outperforms conventional evolutionary algorithms and matches state-of-the-art gradient-based RL approaches on multiple benchmarks—including multi-lane autonomous driving with rule-based non-differentiable policies—and three Gymnasium environments.
📝 Abstract
Evolutionary algorithms (EAs) have shown promise in solving sequential decision problems (SDPs) by simplifying them to static optimization problems and searching for the optimal policy parameters in a zeroth-order way. While these methods are highly versatile, they often suffer from high sample complexity due to their ignorance of the underlying temporal structures. In contrast, reinforcement learning (RL) methods typically formulate SDPs as Markov Decision Process (MDP). Although more sample efficient than EAs, RL methods are restricted to differentiable policies and prone to getting stuck in local optima. To address these issues, we propose a novel evolutionary framework Zeroth-Order Actor-Critic (ZOAC). We propose to use step-wise exploration in parameter space and theoretically derive the zeroth-order policy gradient. We further utilize the actor-critic architecture to effectively leverage the Markov property of SDPs and reduce the variance of gradient estimators. In each iteration, ZOAC employs samplers to collect trajectories with parameter space exploration, and alternates between first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM). To evaluate the effectiveness of ZOAC, we apply it to a challenging multi-lane driving task, optimizing the parameters in a rule-based, non-differentiable driving policy that consists of three sub-modules: behavior selection, path planning, and trajectory tracking. We also compare it with gradient-based RL methods on three Gymnasium tasks, optimizing neural network policies with thousands of parameters. Experimental results demonstrate the strong capability of ZOAC in solving SDPs. ZOAC significantly outperforms EAs that treat the problem as static optimization and matches the performance of gradient-based RL methods even without first-order information, in terms of total average return across all tasks.
Problem

Research questions and friction points this paper is trying to address.

Evolutionary Algorithms
Reinforcement Learning
Multi-lane Driving Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zeroth-order Actor-Critic
Parameter Space Exploration
Gradient-free Optimization
🔎 Similar Papers
No similar papers found.
Yuheng Lei
Yuheng Lei
The University of Hong Kong
Embodied AIMachine LearningRoboticsAutonomous Driving
Y
Yao Lyu
School of Vehicle and Mobility, Tsinghua University, Beijing, China
G
Guojian Zhan
School of Vehicle and Mobility, Tsinghua University, Beijing, China
T
Tao Zhang
SunRising AI Ltd., Beijing, China
Jiangtao Li
Jiangtao Li
SunRising AI Ltd., Beijing, China
Jianyu Chen
Jianyu Chen
Assistant Professor, Tsinghua University
AIRobotics
S
Shengbo Eben Li
School of Vehicle and Mobility, Tsinghua University, Beijing, China
S
Sifa Zheng
School of Vehicle and Mobility, Tsinghua University, Beijing, China