🤖 AI Summary
Existing motion capture methods prioritize visual similarity while neglecting physical plausibility, leading to drift, sliding, interpenetration, and trajectory inaccuracies in virtual human animation and robot control. This work introduces, for the first time, plantar pressure sensing to explicitly model human–environment interaction, enabling physically grounded motion estimation. We construct MotionPRO—a large-scale dataset comprising 70 subjects, 400 motions, and 12.4 million frames—and propose a novel pressure-driven, sensor-only paradigm for joint pose and global trajectory estimation. Our method innovatively integrates a vertical-axis whole-body contact constraint and a camera-axis orthogonal similarity constraint to enable cross-modal pressure–RGB fusion. Leveraging a small-kernel decoder, long-short-term attention, and physics-aware feature fusion, it supports SMPL-based reconstruction and robot closed-loop control. Experiments show that pressure-only input achieves high-accuracy lower-body pose and trajectory estimation; pressure–RGB fusion reduces MPJPE by 21.3% and ACCEL by 38.7%. The framework enables slip-free virtual human locomotion and stable, precisely localized humanoid robot motion.
📝 Abstract
Existing human Motion Capture (MoCap) methods mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion, encompassing a total of 12.4M pose frames. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy to fuse pressure and RGB feature maps. Experiments demonstrate that fusing pressure with RGB features not only significantly improves performance in terms of objective metrics, but also plausibly drives virtual humans (SMPL) in 3D scene. Furthermore, we demonstrate that incorporating physical perception enables humanoid robots to perform more precise and stable actions, which is highly beneficial for the development of embodied artificial intelligence. Project page is available at: https://nju-cite-mocaphumanoid.github.io/MotionPRO/