PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reconstructing physically plausible human motion from monocular video remains challenging due to the absence of physical constraints in pose estimation, often leading to kinematically invalid motions. This paper proposes an end-to-end “vision-to-control-policy” learning framework that directly generates 3D motion trajectories aligned with 2D observations and dynamically consistent within a physics simulator. We innovatively model 2D keypoints as rays in 3D space (“pixel-as-ray”) and integrate global pose guidance; local visual features are extracted via a pretrained vision encoder, while the control policy is jointly optimized through motion-capture expert policy distillation and physics-driven reinforcement learning—bypassing conventional two-stage optimization. Experiments demonstrate significant improvements in both visual fidelity and physical plausibility across multiple benchmarks, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing physically plausible human motion from monocular videos
Addressing unrealistic results from kinematics-based pose estimation methods
Overcoming error accumulation in two-stage motion reconstruction approaches
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns visual-to-action policy for humanoid control
Uses pixel-as-ray strategy for global pose guidance
Employs distillation scheme with reinforcement learning rewards
🔎 Similar Papers
No similar papers found.