🤖 AI Summary
To address incomplete full-body motion reconstruction in AR/VR caused by reliance solely on head-mounted displays (HMDs) and handheld controllers, this paper proposes a lightweight, real-time full-body pose estimation method. The approach introduces three key contributions: (1) a learnable Memory-Block module that implicitly models temporal dynamics of occluded or unobserved joints via a discrete codebook; (2) an MLP-based backbone with residual connections and a multi-task learning framework jointly optimizing pose prediction, velocity consistency, and joint-angle constraints; and (3) end-to-end training achieving 72 FPS on mobile HMDs. Experiments demonstrate that the method achieves the best trade-off between accuracy—reducing MPJPE by 18.3%—and computational efficiency, significantly outperforming existing sparse-input-driven approaches for full-body pose estimation.
📝 Abstract
Realistic and smooth full-body tracking is crucial for immersive AR/VR applications. Existing systems primarily track head and hands via Head Mounted Devices (HMDs) and controllers, making the 3D full-body reconstruction in-complete. One potential approach is to generate the full-body motions from sparse inputs collected from limited sensors using a Neural Network (NN) model. In this paper, we propose a novel method based on a multi-layer perceptron (MLP) backbone that is enhanced with residual connections and a novel NN-component called Memory-Block. In particular, Memory-Block represents missing sensor data with trainable code-vectors, which are combined with the sparse signals from previous time instances to improve the temporal consistency. Furthermore, we formulate our solution as a multi-task learning problem, allowing our MLP-backbone to learn robust representations that boost accuracy. Our experiments show that our method outperforms state-of-the-art baselines by substantially reducing prediction errors. Moreover, it achieves 72 FPS on mobile HMDs that ultimately improves the accuracy-running time tradeoff.