🤖 AI Summary
Humanoid robots face a fundamental conflict between gait control (requiring slow, robust dynamics) and end-effector stabilization (demanding fast, high-precision regulation) when transporting spill-prone objects—e.g., a full beer mug—due to divergent time scales and objective functions. This work proposes SoFTA (Slow-Fast Two-Agent), a hierarchical control architecture: the lower body generates robust gaits at 50 Hz, while the upper body executes high-fidelity end-effector stabilization at 100 Hz, effectively decoupling locomotion from manipulation. The method integrates reinforcement learning–based dual-frequency coordinated control, a task-decoupled reward function, whole-body dynamic modeling, and real-time closed-loop feedback. Experiments demonstrate a 2–5× reduction in end-effector acceleration and successful execution of human-like fine motor tasks—including walking with a full mug, stable in-motion filming, and disturbance-resilient object holding—thereby establishing the first systematic solution to whole-body coordinated stabilization under multi-timescale task coupling.
📝 Abstract
Can your humanoid walk up and hand you a full cup of beer, without spilling a drop? While humanoids are increasingly featured in flashy demos like dancing, delivering packages, traversing rough terrain, fine-grained control during locomotion remains a significant challenge. In particular, stabilizing a filled end-effector (EE) while walking is far from solved, due to a fundamental mismatch in task dynamics: locomotion demands slow-timescale, robust control, whereas EE stabilization requires rapid, high-precision corrections. To address this, we propose SoFTA, a Slow-Fast TwoAgent framework that decouples upper-body and lower-body control into separate agents operating at different frequencies and with distinct rewards. This temporal and objective separation mitigates policy interference and enables coordinated whole-body behavior. SoFTA executes upper-body actions at 100 Hz for precise EE control and lower-body actions at 50 Hz for robust gait. It reduces EE acceleration by 2-5x relative to baselines and performs much closer to human-level stability, enabling delicate tasks such as carrying nearly full cups, capturing steady video during locomotion, and disturbance rejection with EE stability.