π€ AI Summary
This work addresses the silent failure of simulation-trained policies during real-world deployment on humanoid robots due to out-of-distribution (OOD) states. To this end, the authors propose RAPTβa lightweight, self-supervised runtime monitor that learns a spatiotemporal probabilistic manifold of normal execution in simulation and detects predictive deviations in real time at 50 Hz, enabling high-precision OOD identification and interpretable quantification of Sim-to-Real drift. RAPTβs novelty lies in integrating gradient-based temporal saliency with zero-shot reasoning from large language models to establish an automated root-cause analysis pipeline, delivering continuous and interpretable fault diagnosis using only proprioceptive data. Evaluated on the Unitree G1 platform, RAPT improves true positive rate (TPR) by 37% in simulation and 12.5% in physical deployment, achieving 75% accuracy in root-cause classification across 16 real-world failure modes.
π Abstract
Deploying learned control policies on humanoid robots is challenging: policies that appear robust in simulation can execute confidently in out-of-distribution (OOD) states after Sim-to-Real transfer, leading to silent failures that risk hardware damage. Although anomaly detection can mitigate these failures, prior methods are often incompatible with high-rate control, poorly calibrated at the extremely low false-positive rates required for practical deployment, or operate as black boxes that provide a binary stop signal without explaining why the robot drifted from nominal behavior. We present RAPT, a lightweight, self-supervised deployment-time monitor for 50Hz humanoid control. RAPT learns a probabilistic spatio-temporal manifold of nominal execution from simulation and evaluates execution-time predictive deviation as a calibrated, per-dimension signal. This yields (i) reliable online OOD detection under strict false-positive constraints and (ii) a continuous, interpretable measure of Sim-to-Real mismatch that can be tracked over time to quantify how far deployment has drifted from training. Beyond detection, we introduce an automated post-hoc root-cause analysis pipeline that combines gradient-based temporal saliency derived from RAPT's reconstruction objective with LLM-based reasoning conditioned on saliency and joint kinematics to produce semantic failure diagnoses in a zero-shot setting. We evaluate RAPT on a Unitree G1 humanoid across four complex tasks in simulation and on physical hardware. In large-scale simulation, RAPT improves True Positive Rate (TPR) by 37% over the strongest baseline at a fixed episode-level false positive rate of 0.5%. On real-world deployments, RAPT achieves a 12.5% TPR improvement and provides actionable interpretability, reaching 75% root-cause classification accuracy across 16 real-world failures using only proprioceptive data.