🤖 AI Summary
This work addresses the challenge of generating controllable and physically consistent human-object interaction (HOI) videos, a task where existing methods often rely on dense control signals, template videos, or highly detailed textual prompts, thereby limiting flexibility and generalization. To overcome these constraints, the authors propose a sparse motion guidance mechanism that requires only wrist joint coordinates and object bounding boxes, substantially reducing control complexity. They further introduce an object-enhanced attention module and a multi-task auxiliary training strategy to improve generation quality and generalization to novel objects. Coupled with a dedicated HOI data curation and construction pipeline, the proposed approach achieves high-fidelity, intuitively controllable HOI video synthesis across diverse tasks, outperforming current state-of-the-art methods.
📝 Abstract
Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.