H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing approaches struggle to jointly model skeletal motion and non-rigid surface deformation, and real-world scenes lack pixel-level scene flow annotations. To address these challenges, this work proposes H-Flow, which for the first time integrates physics-inspired geometric, structural, and biomechanical priors into a self-supervised learning framework. H-Flow employs a unified multi-head Transformer to jointly predict human pose, depth, and dense scene flow from monocular video. The method establishes a cohesive framework capable of capturing both skeletal articulation and soft-tissue dynamics, and introduces DynAct4D, a high-fidelity synthetic dataset. Experiments demonstrate that H-Flow outperforms both general-purpose scene flow methods and parametric human body models on standard benchmarks, while also exhibiting strong zero-shot generalization to in-the-wild videos.

📝 Abstract

Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication

Problem

Research questions and friction points this paper is trying to address.

human scene flow

non-rigid deformation

articulated bodies

self-supervised learning

dense motion estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-supervised learning

human scene flow

physics-inspired priors