Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
Existing language-driven humanoid control methods lack explicit modeling of contact transitions, support shifts, and balance preparation, often resulting in delayed or unstable motions. This work proposes the DAJI framework, which introduces, for the first time, a dynamics-aligned anticipatory joint intention representation to bridge high-level language commands and low-level closed-loop control through a hierarchical interface, enabling prediction and generation of future motion states. DAJI integrates diffusion policy distillation (DAJI-Act) with autoregressive intention generation (DAJI-Flow), augmented by student-driven rollouts and language-intention history modeling to support stable whole-body control under streaming instructions. Experiments demonstrate that DAJI achieves a 94.42% rollout success rate on HumanML3D tasks and attains a subsequence FID score of 0.152 on the BABEL dataset.
📝 Abstract
Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.
Problem

Research questions and friction points this paper is trying to address.

humanoid control
language-conditioned
anticipatory intent
whole-body control
contact transitions
Innovation

Methods, ideas, or system contributions that make the work stand out.

anticipatory control
language-conditioned humanoid control
joint intent
diffusion policy
hierarchical robot learning
🔎 Similar Papers
No similar papers found.
H
Haozhe Jia
The Hong Kong University of Science and Technology (Guangzhou)
H
Honglei Jin
The Hong Kong University of Science and Technology (Guangzhou)
Y
Yuan Zhang
Shandong University
Y
Youcheng Fan
The Hong Kong University of Science and Technology (Guangzhou)
S
Shaofeng Liang
The Hong Kong University of Science and Technology (Guangzhou)
Lei Wang
Lei Wang
Griffith University, Data61/CSIRO
Action RecognitionComputer VisionMachine LearningDeep LearningPattern Recognition
S
Shuxu Jin
Shandong University
K
Kuimou Yu
The Hong Kong University of Science and Technology (Guangzhou)
Z
Zinuo Zhang
Shandong University
J
Jianfei Song
LimX Dynamics Technology Co., Ltd.
Wenshuo Chen
Wenshuo Chen
Shandong University undergraduate student
Generative ModelsXAI
Y
Yutao Yue
Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)