ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an efficient, safe, and real-time method for translating natural language instructions into full-body motions for humanoid robots. Leveraging an edge-cloud collaborative architecture, the cloud generates language-conditioned motion sequences using a diffusion model (1D UNet + CLIP + DDIM), while the edge executes them in closed-loop via a reinforcement learning policy. The approach introduces a novel 38-dimensional native robot motion representation that eliminates runtime retargeting. Efficient sim-to-real transfer is achieved through teacher-student distillation, evidence-aware adaptation, and morphological symmetry constraints. Integrated IMU-based fall detection and autonomous recovery further enhance robustness. Evaluated on HumanML3D, the system achieves an FID of 0.029 and R-Precision Top-1 of 0.686, and demonstrates stable, diverse instruction-following on the Unitree G1 robot without any hardware-specific fine-tuning.

Technology Category

Application Category

📝 Abstract
We present ECHO, an edge--cloud framework for language-driven whole-body control of humanoid robots. A cloud-hosted diffusion-based text-to-motion generator synthesizes motion references from natural language instructions, while an edge-deployed reinforcement-learning tracker executes them in closed loop on the robot. The two modules are bridged by a compact, robot-native 38-dimensional motion representation that encodes joint angles, root planar velocity, root height, and a continuous 6D root orientation per frame, eliminating inference-time retargeting from human body models and remaining directly compatible with low-level PD control. The generator adopts a 1D convolutional UNet with cross-attention conditioned on CLIP-encoded text features; at inference, DDIM sampling with 10 denoising steps and classifier-free guidance produces motion sequences in approximately one second on a cloud GPU. The tracker follows a Teacher--Student paradigm: a privileged teacher policy is distilled into a lightweight student equipped with an evidential adaptation module for sim-to-real transfer, further strengthened by morphological symmetry constraints and domain randomization. An autonomous fall recovery mechanism detects falls via onboard IMU readings and retrieves recovery trajectories from a pre-built motion library. We evaluate ECHO on a retargeted HumanML3D benchmark, where it achieves strong generation quality (FID 0.029, R-Precision Top-1 0.686) under a unified robot-domain evaluator, while maintaining high motion safety and trajectory consistency. Real-world experiments on a Unitree G1 humanoid demonstrate stable execution of diverse text commands with zero hardware fine-tuning.
Problem

Research questions and friction points this paper is trying to address.

language-to-motion
humanoid control
edge-cloud orchestration
zero-shot execution
motion safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

edge-cloud orchestration
diffusion-based text-to-motion
robot-native motion representation
sim-to-real transfer
autonomous fall recovery
🔎 Similar Papers
No similar papers found.
H
Haozhe Jia
The Hong Kong University of Science and Technology (Guangzhou)
J
Jianfei Song
LimX Dynamics Technology Co., Ltd.
Y
Yuan Zhang
Shandong University
H
Honglei Jin
The Hong Kong University of Science and Technology (Guangzhou); Shandong University
Y
Youcheng Fan
The Hong Kong University of Science and Technology (Guangzhou)
Wenshuo Chen
Wenshuo Chen
Shandong University undergraduate student
Generative ModelsXAI
W
Wei Zhang
LimX Dynamics Technology Co., Ltd.
Y
Yutao Yue
Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)