🤖 AI Summary
Existing approaches to language-action control for humanoid robots suffer from a fundamental disconnect: teleoperation relies on manual human input, while modular pipelines lack end-to-end alignment between linguistic instructions and physical execution. This work introduces the first fully end-to-end, embodiment-aware language-to-action generation model that directly maps natural language commands and real-time robot state (e.g., joint angles, base pose) to low-level joint torques or positions—bypassing intermediate symbolic or latent representations. The model employs flow matching to generate action trajectories and incorporates a residual action head to enhance robustness under real-world deployment conditions. Leveraging a pre-trained whole-body controller, we construct a large-scale, text-annotated simulated motion dataset. Extensive evaluation in simulation and on physical humanoid platforms—including the Unitree H1—demonstrates substantial improvements in semantic grounding accuracy, task execution stability, multi-turn interaction capability, and complex instruction parsing, thereby achieving tight coupling between linguistic understanding and embodied control.
📝 Abstract
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.