Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

📅 2025-10-03

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing interactive video generation methods struggle to model complex human-object and robot-object dynamic interactions and rely heavily on dense pixel-level mask annotations, limiting practical applicability. To address this, we propose a decoupled two-stage framework: Stage I predicts mask trajectories of agents and objects—without requiring pixel-level mask inputs; Stage II synthesizes high-fidelity videos conditioned on these trajectories, action descriptions, and spatial coordinates, enabling fine-grained control over interacting entities and motion paths. Our contributions are: (1) the first dual-benchmark dataset dedicated to human-object and robot-object interaction video generation; and (2) the first trajectory-conditioned controllable video generation framework that eliminates the need for dense annotations. Extensive experiments demonstrate that our method significantly outperforms prior approaches in both visual realism and interaction controllability.

Technology Category

Application Category

📝 Abstract

Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Generating interaction-centric videos without dense mask annotations

Modeling complex dynamic interactions between actors and objects

Providing flexible control over interaction targets and motion trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pipeline decouples motion prediction from video generation

Generates videos using predicted actor and object trajectories

Enables control via object specification and motion guidance

🔎 Similar Papers

No similar papers found.