π€ AI Summary
Existing physics-driven whole-body dexterous manipulation approaches rely on precise trajectory tracking or VR-based teleoperation, rendering them ill-suited for prolonged, weakly constrained locomotion-manipulation tasks (e.g., βgrasp cup β transport β insert into slotβ) and incapable of naturally responding to high-level objectives (e.g., object pose, key body configurations). This work introduces the first unified generative whole-body control policy. It employs a two-stage learning framework: first, training a physics-aware motion-capture-driven tracking controller; second, distilling knowledge and applying masked action generation to decouple high-level goals from low-level motor execution. By integrating rigid-body dynamics simulation with goal-conditioned sequence modeling, the method achieves high-fidelity, stable, and real-time control in complex locomanipulation scenarios. It significantly improves generalization to unseen targets and enhances interaction naturalness, enabling partial-goal-driven intelligent animation synthesis.
π Abstract
Humans interact with their world while leveraging precise full-body control to achieve versatile goals. This versatility allows them to solve long-horizon, underspecified problems, such as placing a cup in a sink, by seamlessly sequencing actions like approaching the cup, grasping, transporting it, and finally placing it in the sink. Such goal-driven control can enable new procedural tools for animation systems, enabling users to define partial objectives while the system naturally ``fills in'' the intermediate motions. However, while current methods for whole-body dexterous manipulation in physics-based animation achieve success in specific interaction tasks, they typically employ control paradigms (e.g., detailed kinematic motion tracking, continuous object trajectory following, or direct VR teleoperation) that offer limited versatility for high-level goal specification across the entire coupled human-object system. To bridge this gap, we present MaskedManipulator, a unified and generative policy developed through a two-stage learning approach. First, our system trains a tracking controller to physically reconstruct complex human-object interactions from large-scale human mocap datasets. This tracking controller is then distilled into MaskedManipulator, which provides users with intuitive control over both the character's body and the manipulated object. As a result, MaskedManipulator enables users to specify complex loco-manipulation tasks through intuitive high-level objectives (e.g., target object poses, key character stances), and MaskedManipulator then synthesizes the necessary full-body actions for a physically simulated humanoid to achieve these goals, paving the way for more interactive and life-like virtual characters.