FinePhys: Fine-grained Human Action Generation by Explicitly Incorporating Physical Laws for Effective Skeletal Guidance

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating fine-grained, physically constrained human motions—such as “0.5-turn jump” in gymnastics—remains challenging. This paper proposes a physics-driven video generation framework to address this problem. Methodologically, it introduces (1) a novel physics-based motion re-estimation module grounded in the Euler–Lagrange equations, enabling interpretable joint acceleration modeling; and (2) a bidirectional temporal update mechanism with multi-scale fusion, jointly optimizing data-driven 3D pose prediction and physics-guided trajectory generation. The framework integrates online 2D pose estimation, context-aware 2D-to-3D lifting, and diffusion-model-based heatmap guidance. Evaluated on the FineGym fine-grained subsets (FX-JUMP/TURN/SALTO), our method significantly outperforms state-of-the-art approaches, producing more natural and physically plausible motions—demonstrating superior spatiotemporal dynamic modeling capability at high precision.

Technology Category

Application Category

📝 Abstract
Despite significant advances in video generation, synthesizing physically plausible human actions remains a persistent challenge, particularly in modeling fine-grained semantics and complex temporal dynamics. For instance, generating gymnastics routines such as"switch leap with 0.5 turn"poses substantial difficulties for current methods, often yielding unsatisfactory results. To bridge this gap, we propose FinePhys, a Fine-grained human action generation framework that incorporates Physics to obtain effective skeletal guidance. Specifically, FinePhys first estimates 2D poses in an online manner and then performs 2D-to-3D dimension lifting via in-context learning. To mitigate the instability and limited interpretability of purely data-driven 3D poses, we further introduce a physics-based motion re-estimation module governed by Euler-Lagrange equations, calculating joint accelerations via bidirectional temporal updating. The physically predicted 3D poses are then fused with data-driven ones, offering multi-scale 2D heatmap guidance for the diffusion process. Evaluated on three fine-grained action subsets from FineGym (FX-JUMP, FX-TURN, and FX-SALTO), FinePhys significantly outperforms competitive baselines. Comprehensive qualitative results further demonstrate FinePhys's ability to generate more natural and plausible fine-grained human actions.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing physically plausible human actions in videos
Modeling fine-grained semantics and complex temporal dynamics
Generating accurate gymnastics routines with skeletal guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incorporates physics for skeletal guidance
Uses 2D-to-3D lifting via in-context learning
Fuses physics-based and data-driven 3D poses
🔎 Similar Papers
No similar papers found.
Dian Shao
Dian Shao
Associate Professor, Northwest Polytechnical University Xi'an
computer visiondeep learningUAV
M
Mingfei Shi
Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an, China
S
Shengda Xu
School of Software, Northwestern Polytechnical University, Xi’an, China
H
Haodong Chen
School of Automation, Northwestern Polytechnical University, Xi’an, China
Yongle Huang
Yongle Huang
Undergrad., Northwestern Polytechnical University
Multi-modelVideo Understanding
Binglu Wang
Binglu Wang
School of Astronautics, Northwestern Polytechnical University
Computer VisionAI4Science