Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the poor scalability of Soft Actor-Critic (SAC) in large-scale parallel training, which has hindered its application to high-performance legged robot control compared to Proximal Policy Optimization (PPO). By introducing three key improvements—optimized policy initialization, timeout-aware critic targets, and multi-step return estimation—the proposed method substantially enhances SAC’s training stability and sample efficiency in massively parallel environments. For the first time, SAC achieves performance on par with PPO across diverse legged robot platforms and a wide range of locomotion tasks. Furthermore, the approach enables efficient and robust simulation-to-reality (sim-to-real) transfer, effectively overcoming a longstanding barrier that has limited off-policy algorithms in large-scale simulation and real-world online learning for legged robotics.

📝 Abstract

Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in massively parallel simulation environments like IsaacLab. However, its on-policy nature makes it inherently sample-inefficient, preventing its use for continuous adaptation and fine-tuning on real hardware. Soft Actor-Critic (SAC), by contrast, is an off-policy algorithm that can reuse past experience, making it a natural candidate for sim-to-real transfer workflows where the same algorithm can be used both in simulation and for online learning on the real robot. Despite these advantages, SAC has consistently failed to match PPO's empirical performance in massively parallel training settings. This work identifies the root causes of this gap and introduces targeted modifications, covering policy initialization, timeout-aware critic targets, and multi-step return estimation, that enable SAC to train stably at scale. Evaluated across multiple legged robot platforms and diverse locomotion tasks, our approach closes the performance gap with PPO entirely.

Problem

Research questions and friction points this paper is trying to address.

Soft Actor-Critic

legged locomotion

sample efficiency

sim-to-real transfer

massively parallel training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft Actor-Critic

sample efficiency

sim-to-real transfer