🤖 AI Summary
Existing 3D human–object interaction (HOI) generation methods decouple human motion and object dynamics, leading to physically implausible and causally inconsistent interactions. To address this, we propose the first driver–response dynamical framework: human motion serves as the driver signal, while object dynamics are explicitly modeled as a physical response. Our method employs a lightweight Transformer to instantiate the interaction dynamics model and integrates a diffusion-based human motion generator. A novel residual dynamics loss is introduced to enhance training stability—applied solely during training to preserve inference efficiency. Extensive experiments demonstrate significant improvements in interaction realism across multiple benchmarks. Moreover, we introduce the first quantitative metric for physical plausibility assessment in HOI generation, enabling objective evaluation of physical consistency. This work establishes a new paradigm for physics-aware HOI synthesis and provides a practical, publicly usable toolkit for the community.
📝 Abstract
Generating realistic 3D human-object interactions (HOIs) remains a challenging task due to the difficulty of modeling detailed interaction dynamics. Existing methods treat human and object motions independently, resulting in physically implausible and causally inconsistent behaviors. In this work, we present HOI-Dyn, a novel framework that formulates HOI generation as a driver-responder system, where human actions drive object responses. At the core of our method is a lightweight transformer-based interaction dynamics model that explicitly predicts how objects should react to human motion. To further enforce consistency, we introduce a residual-based dynamics loss that mitigates the impact of dynamics prediction errors and prevents misleading optimization signals. The dynamics model is used only during training, preserving inference efficiency. Through extensive qualitative and quantitative experiments, we demonstrate that our approach not only enhances the quality of HOI generation but also establishes a feasible metric for evaluating the quality of generated interactions.