Stereo-Inertial Poser: Towards Metric-Accurate Shape-Aware Motion Capture Using Sparse IMUs and a Single Stereo Camera

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of global translation inaccuracies caused by depth ambiguity and the neglect of inter-subject body shape variations in monocular visual-inertial motion capture. To overcome these limitations, we propose an end-to-end real-time system that fuses a stereo camera with six sparse IMUs. By leveraging stereo vision to resolve depth ambiguity, our method directly regresses 3D keypoints and estimates body shape parameters. We further introduce a shape-aware fusion module that dynamically balances individual morphological differences with global motion estimation. Evaluated across multiple datasets, the proposed approach achieves state-of-the-art performance, operates at over 200 FPS, exhibits no drift during long-term capture, and significantly suppresses foot sliding artifacts.

Technology Category

Application Category

📝 Abstract
Recent advancements in visual-inertial motion capture systems have demonstrated the potential of combining monocular cameras with sparse inertial measurement units (IMUs) as cost-effective solutions, which effectively mitigate occlusion and drift issues inherent in single-modality systems. However, they are still limited by metric inaccuracies in global translations stemming from monocular depth ambiguity, and shape-agnostic local motion estimations that ignore anthropometric variations. We present Stereo-Inertial Poser, a real-time motion capture system that leverages a single stereo camera and six IMUs to estimate metric-accurate and shape-aware 3D human motion. By replacing the monocular RGB with stereo vision, our system resolves depth ambiguity through calibrated baseline geometry, enabling direct 3D keypoint extraction and body shape parameter estimation. IMU data and visual cues are fused for predicting drift-compensated joint positions and root movements, while a novel shape-aware fusion module dynamically harmonizes anthropometry variations with global translations. Our end-to-end pipeline achieves over 200 FPS without optimization-based post-processing, enabling real-time deployment. Quantitative evaluations across various datasets demonstrate state-of-the-art performance. Qualitative results show our method produces drift-free global translation under a long recording time and reduces foot-skating effects.
Problem

Research questions and friction points this paper is trying to address.

metric accuracy
shape-awareness
monocular depth ambiguity
anthropometric variations
visual-inertial motion capture
Innovation

Methods, ideas, or system contributions that make the work stand out.

stereo-inertial fusion
shape-aware motion capture
metric-accurate 3D pose
sparse IMUs
real-time human pose estimation
🔎 Similar Papers
Tutian Tang
Tutian Tang
Shanghai Jiao Tong University
Robotics
X
Xingyu Ji
School of Computer Science, Shanghai Jiao Tong University; Meta Robotics Institute, SJTU
Yutong Li
Yutong Li
Shanghai Jiaot Tong University
M
MingHao Liu
School of Computer Science, Shanghai Jiao Tong University
Wenqiang Xu
Wenqiang Xu
Shanghai Jiao Tong University
Computer visionRobotics
C
Cewu Lu
School of Computer Science, Shanghai Jiao Tong University; Shanghai Innovation Institute