DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating controllable and physically consistent human-object interaction (HOI) videos, a task where existing methods often rely on dense control signals, template videos, or highly detailed textual prompts, thereby limiting flexibility and generalization. To overcome these constraints, the authors propose a sparse motion guidance mechanism that requires only wrist joint coordinates and object bounding boxes, substantially reducing control complexity. They further introduce an object-enhanced attention module and a multi-task auxiliary training strategy to improve generation quality and generalization to novel objects. Coupled with a dedicated HOI data curation and construction pipeline, the proposed approach achieves high-fidelity, intuitively controllable HOI video synthesis across diverse tasks, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.
Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction
Video Generation
Controllable Generation
Physical Consistency
Sparse Guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Motion Guidance
Object-Stressed Attention
Multi-Task Auxiliary Training
Human-Object Interaction
Controllable Video Generation
🔎 Similar Papers
No similar papers found.