SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to language-action control for humanoid robots suffer from a fundamental disconnect: teleoperation relies on manual human input, while modular pipelines lack end-to-end alignment between linguistic instructions and physical execution. This work introduces the first fully end-to-end, embodiment-aware language-to-action generation model that directly maps natural language commands and real-time robot state (e.g., joint angles, base pose) to low-level joint torques or positions—bypassing intermediate symbolic or latent representations. The model employs flow matching to generate action trajectories and incorporates a residual action head to enhance robustness under real-world deployment conditions. Leveraging a pre-trained whole-body controller, we construct a large-scale, text-annotated simulated motion dataset. Extensive evaluation in simulation and on physical humanoid platforms—including the Unitree H1—demonstrates substantial improvements in semantic grounding accuracy, task execution stability, multi-turn interaction capability, and complex instruction parsing, thereby achieving tight coupling between linguistic understanding and embodied control.

Technology Category

Application Category

📝 Abstract
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.
Problem

Research questions and friction points this paper is trying to address.

Existing systems separate language understanding from physical execution
Current approaches lack tight alignment between commands and behaviors
Teleoperation methods are entirely human-driven without autonomy
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end model mapping language to robot actions
Flow matching generates action chunks for control
Residual action head refines actions for real deployment
🔎 Similar Papers
No similar papers found.