Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the lack of rigorous evaluation of generalization capabilities in large-scale reinforcement learning–based autonomous driving policies under complex and diverse traffic behaviors on standard benchmarks. To this end, we introduce BehaviorBench, a novel evaluation framework that integrates PufferDrive with the nuPlan benchmark and constructs highly interactive test scenarios using the Waymo Open Motion Dataset, incorporating heterogeneous traffic agents of multiple types. We propose a hybrid policy combining proximal policy optimization (PPO) with a rule-based planner. Systematic evaluation demonstrates that purely self-play–trained policies fail significantly when exposed to out-of-distribution traffic behaviors, whereas the proposed hybrid planner substantially enhances robustness and generalization across diverse driving environments.

📝 Abstract

Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.

Problem

Research questions and friction points this paper is trying to address.

Autonomous Driving

Reinforcement Learning

Generalization

Behavior Diversity

Benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

BehaviorBench

generalization

autonomous driving