Precise Action-to-Video Generation Through Visual Action Prompts

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Action-driven video generation faces a fundamental trade-off between precision and generalizability: textual or raw motion representations offer strong cross-domain generalization but coarse control, whereas agent-centric signals enable high-precision pose manipulation yet suffer from poor transferability across domains. To resolve this, we propose Visual Action Prompt (VAP), the first unified, transferable action representation based on visual skeletal sequences—simultaneously enabling fine-grained pose control and dynamic generalization across diverse scenarios. Our method introduces a robust skeleton preprocessing pipeline and integrates seamlessly into pretrained video diffusion models via lightweight fine-tuning. Experiments on EgoVid, RT-1, and DROID demonstrate that VAP significantly improves motion fidelity in complex human-object interaction scenes and enhances cross-domain adaptability. This work establishes a new paradigm for highly controllable, open-domain video generation.

Technology Category

Application Category

📝 Abstract

We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to "render" actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: https://zju3dv.github.io/VAP/.

Problem

Research questions and friction points this paper is trying to address.

Balancing precision and transferability in action-to-video generation

Representing complex actions with domain-agnostic visual skeletons

Enabling cross-domain training for action-driven generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual action prompts for precise video generation

Cross-domain training with visual skeletons

Lightweight fine-tuning of pretrained models

🔎 Similar Papers

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion