🤖 AI Summary
This work addresses the frequent violation of physical laws in existing text-to-video (T2V) generation models due to prompts lacking physical constraints. To remedy this, the authors propose PhyPrompt, a two-stage reinforcement learning framework that first fine-tunes a 7B large language model on physics-oriented chain-of-thought data to inject physical commonsense, then employs grouped relative policy optimization with a dynamic multi-objective reward curriculum. This approach uniquely integrates physics-guided chain-of-thought fine-tuning with an adaptive reward scheduling mechanism, overcoming traditional trade-offs in multi-objective optimization. Experiments show that PhyPrompt improves joint success rate by 8.6 percentage points (pp), boosts physical commonsense accuracy by 11 pp, and enhances semantic consistency by 4.4 pp on VideoPhy2. Moreover, it enables zero-shot transfer across diverse T2V models, outperforming GPT-4o and larger models by up to 16.8%.
📝 Abstract
State-of-the-art text-to-video (T2V) generators frequently violate physical laws despite high visual quality. We show this stems from insufficient physical constraints in prompts rather than model limitations: manually adding physics details reliably produces physically plausible videos, but requires expertise and does not scale. We present PhyPrompt, a two-stage reinforcement learning framework that automatically refines prompts for physically realistic generation. First, we fine-tune a large language model on a physics-focused Chain-of-Thought dataset to integrate principles like object motion and force interactions while preserving user intent. Second, we apply Group Relative Policy Optimization with a dynamic reward curriculum that initially prioritizes semantic fidelity, then progressively shifts toward physical commonsense. This curriculum achieves synergistic optimization: PhyPrompt-7B reaches 40.8\% joint success on VideoPhy2 (8.6pp gain), improving physical commonsense by 11pp (55.8\% to 66.8\%) while simultaneously increasing semantic adherence by 4.4pp (43.4\% to 47.8\%). Remarkably, our curriculum exceeds single-objective training on both metrics, demonstrating compositional prompt discovery beyond conventional multi-objective trade-offs. PhyPrompt outperforms GPT-4o (+3.8\% joint) and DeepSeek-V3 (+2.2\%, 100$\times$ larger) using only 7B parameters. The approach transfers zero-shot across diverse T2V architectures (Lavie, VideoCrafter2, CogVideoX-5B) with up to 16.8\% improvement, establishing that domain-specialized reinforcement learning with compositional curricula surpasses general-purpose scaling for physics-aware generation.