VHOI: Controllable Video Generation of Human-Object Interactions from Sparse Trajectories via Motion Densification

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the challenging problem of controllable human-object interaction (HOI) video generation from sparse trajectory inputs. To this end, we propose a two-stage framework: First, we design an HOI-aware motion representation that encodes body-part and object dynamics via color coding and incorporates human anatomical priors to enhance physical plausibility. Second, we introduce a motion densification network that transforms sparse keypoint trajectories into temporally coherent HOI mask sequences, which then serve as conditional guidance for a video diffusion model. Crucially, our approach avoids expensive dense signals—such as optical flow, depth maps, or 3D meshes—enabling end-to-end synthesis of complete, physically grounded interaction videos. On controllable HOI video generation benchmarks, our method achieves state-of-the-art performance, significantly improving motion realism and structural consistency across frames.

Technology Category

Application Category

📝 Abstract

Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.

Problem

Research questions and friction points this paper is trying to address.

Generates realistic human-object interaction videos from sparse trajectories

Enhances controllability in video synthesis using instance-aware motion representation

Addresses the trade-off between sparse control ease and dense signal informativeness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework densifies sparse trajectories into masks

Novel HOI-aware motion representation uses color encodings

Fine-tunes video diffusion model conditioned on dense masks

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence