🤖 AI Summary
This work addresses key challenges in physics-driven human-object interaction (HOI) simulation—namely, inaccurate contact annotations in motion-capture (MoCap) data, insufficient hand-level detail, and limited object geometric diversity. We propose a curriculum-based teacher-student distillation framework: first, an individualized teacher policy refines raw MoCap sequences with contact-aware post-processing; second, knowledge is distilled into a generalizable student policy via online expert supervision; third, reinforcement learning fine-tuning enhances physical plausibility and dynamic quality beyond pure imitation. Our method requires only a few hours of imperfect MoCap data to learn diverse, high-fidelity whole-body HOI skills. It achieves zero-shot generalization across multiple HOI benchmarks, producing physically consistent and visually realistic motions. Moreover, it integrates seamlessly into existing motion generation pipelines, advancing HOI modeling from imitation learning toward controllable, physics-aware generation.
📝 Abstract
Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.