🤖 AI Summary
To address motion jerkiness and physical implausibility (e.g., distorted force/torque modeling) in monocular video-based 3D human pose estimation, this paper proposes an online physics-consistent framework for 3D human motion estimation. Methodologically, we design a neural Kalman filter that adaptively fuses observed kinematics with physics-based simulated dynamics; additionally, we introduce a meta-proportional-derivative (PD) controller that end-to-end predicts joint torques and external forces, enabling joint kinematic-dynamic optimization. Our key contribution lies in being the first to unify differentiable physics simulation, recursive state estimation, and adaptive control within a monocular online estimation pipeline—significantly mitigating jitter and physically invalid poses. Extensive evaluations demonstrate state-of-the-art performance across multiple benchmarks, achieving real-time, robust, and physically plausible motion capture.
📝 Abstract
Human motion capture from monocular videos has made significant progress in recent years. However, modern approaches often produce temporal artifacts, e.g. in form of jittery motion and struggle to achieve smooth and physically plausible motions. Explicitly integrating physics, in form of internal forces and exterior torques, helps alleviating these artifacts. Current state-of-the-art approaches make use of an automatic PD controller to predict torques and reaction forces in order to re-simulate the input kinematics, i.e. the joint angles of a predefined skeleton. However, due to imperfect physical models, these methods often require simplifying assumptions and extensive preprocessing of the input kinematics to achieve good performance. To this end, we propose a novel method to selectively incorporate the physics models with the kinematics observations in an online setting, inspired by a neural Kalman-filtering approach. We develop a control loop as a meta-PD controller to predict internal joint torques and external reaction forces, followed by a physics-based motion simulation. A recurrent neural network is introduced to realize a Kalman filter that attentively balances the kinematics input and simulated motion, resulting in an optimal-state dynamics prediction. We show that this filtering step is crucial to provide an online supervision that helps balancing the shortcoming of the respective input motions, thus being important for not only capturing accurate global motion trajectories but also producing physically plausible human poses. The proposed approach excels in the physics-based human pose estimation task and demonstrates the physical plausibility of the predictive dynamics, compared to state of the art. The code is available on https://github.com/cuongle1206/OSDCap