SPACeR: Self-Play Anchoring with Centralized Reference Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Autonomous driving simulation requires balancing human behavior fidelity with scalable multi-agent interaction. Existing generative imitation learning methods achieve high realism but suffer from slow inference and large parameter counts; conversely, self-play reinforcement learning (RL) is computationally efficient yet prone to divergence from the human driving distribution. To address this trade-off, we propose SPACeR—a novel framework that pioneers the use of a pre-trained, tokenized autoregressive motion model as a centralized reference policy. This reference guides decentralized self-play training via likelihood-based rewards and KL-divergence regularization, rigorously anchoring learned policies within the human driving distribution. SPACeR enables real-time closed-loop inference and efficient training. In the Waymo Sim Agents Challenge, it matches the performance of state-of-the-art generative baselines while achieving 10× faster inference and reducing model parameters by 50×. Moreover, it supports high-fidelity evaluation of autonomous driving planners.

Technology Category

Application Category

📝 Abstract

Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable. Achieving this requires sim agent policies that are human-like, fast, and scalable in multi-agent settings. Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data, producing realistic policies. However, these models are computationally expensive, slow during inference, and struggle to adapt in reactive, closed-loop scenarios. In contrast, self-play reinforcement learning (RL) scales efficiently and naturally captures multi-agent interactions, but it often relies on heuristics and reward shaping, and the resulting policies can diverge from human norms. We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a centralized reference policy to guide decentralized self-play. The reference model provides likelihood rewards and KL divergence, anchoring policies to the human driving distribution while preserving RL scalability. Evaluated on the Waymo Sim Agents Challenge, our method achieves competitive performance with imitation-learned policies while being up to 10x faster at inference and 50x smaller in parameter size than large generative models. In addition, we demonstrate in closed-loop ego planning evaluation tasks that our sim agents can effectively measure planner quality with fast and scalable traffic simulation, establishing a new paradigm for testing autonomous driving policies.

Problem

Research questions and friction points this paper is trying to address.

Developing realistic human-like autonomous vehicle behaviors efficiently

Overcoming computational limitations of imitation learning in reactive scenarios

Anchoring self-play RL policies to human driving distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-play reinforcement learning with centralized reference models

Tokenized autoregressive motion model for human-like guidance

KL divergence rewards to anchor policies to human driving

🔎 Similar Papers

A Survey on Self-play Methods in Reinforcement Learning