🤖 AI Summary
This work addresses the challenge of achieving high-fidelity, general-purpose multi-skill execution on real humanoid robots, where motion diversity often compromises tracking accuracy and physical feasibility. To overcome this, the authors propose OmniXtreme, a framework that decouples general motion skill learning from physics-aware execution refinement. First, a high-capacity flow-matching strategy learns a diverse repertoire of motions; subsequently, an actuator-aware sim-to-real refinement stage enhances real-world deployability. This approach breaks the longstanding trade-off between fidelity and scalability, enabling, for the first time, stable execution of multiple extreme, high-difficulty maneuvers on a single real humanoid policy. The method demonstrates strong generalization and robustness by maintaining high tracking fidelity across a challenging motion dataset.
📝 Abstract
High-fidelity motion tracking serves as the ultimate litmus test for generalizable, human-level motor skills. However, current policies often hit a"generality barrier": as motion libraries scale in diversity, tracking fidelity inevitably collapses - especially for real-world deployment of high-dynamic motions. We identify this failure as the result of two compounding factors: the learning bottleneck in scaling multi-motion optimization and the physical executability constraints that arise in real-world actuation. To overcome these challenges, we introduce OmniXtreme, a scalable framework that decouples general motor skill learning from sim-to-real physical skill refinement. Our approach uses a flow-matching policy with high-capacity architectures to scale representation capacity without interference-intensive multi-motion RL optimization, followed by an actuation-aware refinement phase that ensures robust performance on physical hardware. Extensive experiments demonstrate that OmniXtreme maintains high-fidelity tracking across diverse, high-difficulty datasets. On real robots, the unified policy successfully executes multiple extreme motions, effectively breaking the long-standing fidelity-scalability trade-off in high-dynamic humanoid control.