Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge of zero-shot physics-driven video generation—specifically, synthesizing videos conditioned on physical forces (e.g., localized point forces or global wind fields) without 3D assets or real-time physics simulation. To overcome the scarcity of force-video paired data, we introduce *force prompting*: a novel prompting mechanism that leverages a synthetically generated Blender dataset, joint text-force embeddings, and lightweight fine-tuning (15k samples) of a pre-trained diffusion model, relying solely on its inherent visual-motion priors. Our method generalizes across diverse objects and materials, and we identify visual diversity and specific textual keywords as critical factors for force-conditioned generalization. Experiments demonstrate significant improvements over prior methods in force adherence and physical plausibility, enabling realistic force-response videos featuring complex geometries, materials, and scenes. Code, dataset, model weights, and an interactive demo are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

Problem

Research questions and friction points this paper is trying to address.

Exploring physical forces as control signals for video generation

Overcoming challenges in obtaining quality force-video training data

Enhancing video realism and adherence to physics without 3D assets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Force prompts enable physical interaction control

Leverages pretrained models without 3D assets

Generalizes from Blender-synthesized training data

🔎 Similar Papers

No similar papers found.