MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion models struggle to ensure physical consistency due to their reliance solely on pixel-level reconstruction. This work proposes MMPhysPipe, a novel framework that jointly models multimodal perceptual cues—including semantics, geometry, and spatiotemporal trajectories—and unifies them into a pseudo-RGB representation, enabling diffusion models to directly learn complex physical dynamics. The approach introduces a bidirectional control teacher–student distillation architecture that decouples RGB generation from perceptual modeling, complemented by vision–language model–guided chain-of-thought annotations to construct high-quality multimodal training data. MMPhysPipe achieves the first scalable modeling of physical plausibility in video generation, significantly improving both physical consistency and visual fidelity across multiple benchmarks while attaining state-of-the-art performance—all without incurring additional inference overhead.
📝 Abstract
Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.
Problem

Research questions and friction points this paper is trying to address.

video generation
physical plausibility
video diffusion models
multimodal modeling
physical inconsistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal modeling
physical plausibility
video diffusion models
knowledge distillation
pseudo-RGB representation
🔎 Similar Papers
No similar papers found.
S
Shubo Lin
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; StepFun; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information
Xuanyang Zhang
Xuanyang Zhang
StepFun AI Researcher
Neural Architecture DesignAIGC3D GenerationMulti-modal
Wei Cheng
Wei Cheng
StepFun
AIGC3D VisionComputer Graphics
Weiming Hu
Weiming Hu
Shanghai Jiao Tong University
Computer Architecture
G
Gang Yu
StepFun
J
Jin Gao
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information