Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing language-driven humanoid control methods lack explicit modeling of contact transitions, support shifts, and balance preparation, often resulting in delayed or unstable motions. This work proposes the DAJI framework, which introduces, for the first time, a dynamics-aligned anticipatory joint intention representation to bridge high-level language commands and low-level closed-loop control through a hierarchical interface, enabling prediction and generation of future motion states. DAJI integrates diffusion policy distillation (DAJI-Act) with autoregressive intention generation (DAJI-Flow), augmented by student-driven rollouts and language-intention history modeling to support stable whole-body control under streaming instructions. Experiments demonstrate that DAJI achieves a 94.42% rollout success rate on HumanML3D tasks and attains a subsequence FID score of 0.152 on the BABEL dataset.

📝 Abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

Problem

Research questions and friction points this paper is trying to address.

humanoid control

language-conditioned

anticipatory intent

whole-body control

contact transitions

Innovation

Methods, ideas, or system contributions that make the work stand out.

anticipatory control

language-conditioned humanoid control

joint intent