DISPLAY: Directable Human-Object Interaction Video Generation via Sparse Motion Guidance and Multi-Task Auxiliary

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of generating controllable and physically consistent human-object interaction (HOI) videos, a task where existing methods often rely on dense control signals, template videos, or highly detailed textual prompts, thereby limiting flexibility and generalization. To overcome these constraints, the authors propose a sparse motion guidance mechanism that requires only wrist joint coordinates and object bounding boxes, substantially reducing control complexity. They further introduce an object-enhanced attention module and a multi-task auxiliary training strategy to improve generation quality and generalization to novel objects. Coupled with a dedicated HOI data curation and construction pipeline, the proposed approach achieves high-fidelity, intuitively controllable HOI video synthesis across diverse tasks, outperforming current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Human-centric video generation has advanced rapidly, yet existing methods struggle to produce controllable and physically consistent Human-Object Interaction (HOI) videos. Existing works rely on dense control signals, template videos, or carefully crafted text prompts, which limit flexibility and generalization to novel objects. We introduce a framework, namely DISPLAY, guided by Sparse Motion Guidance, composed only of wrist joint coordinates and a shape-agnostic object bounding box. This lightweight guidance alleviates the imbalance between human and object representations and enables intuitive user control. To enhance fidelity under such sparse conditions, we propose an Object-Stressed Attention mechanism that improves object robustness. To address the scarcity of high-quality HOI data, we further develop a Multi-Task Auxiliary Training strategy with a dedicated data curation pipeline, allowing the model to benefit from both reliable HOI samples and auxiliary tasks. Comprehensive experiments show that our method achieves high-fidelity, controllable HOI generation across diverse tasks. The project page can be found at \href{https://mumuwei.github.io/DISPLAY/}.

Problem

Research questions and friction points this paper is trying to address.

Human-Object Interaction

Video Generation

Controllable Generation

Physical Consistency

Sparse Guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Motion Guidance

Object-Stressed Attention

Multi-Task Auxiliary Training