🤖 AI Summary
When only sparse head-and-hand trajectories are available from VR devices, full-body motion estimation—particularly for lower limbs—suffers from ambiguity and physically implausible environmental interactions. To address this, we propose a two-stage environment-aware framework: (1) a multi-hypothesis probabilistic model explicitly captures motion uncertainty; and (2) an environment encoder integrating scene semantic segmentation and collision detection, coupled with physics-based regularization, to enforce geometric and semantic constraints during motion reconstruction. This work is the first to explicitly incorporate pre-scanned scene priors—encoding both semantic and geometric knowledge—into sparse-input motion estimation. Evaluated on two public benchmarks, our method achieves state-of-the-art performance, reducing lower-limb joint error by 32% and significantly improving environmental interaction plausibility. Both quantitative metrics and qualitative analysis confirm enhanced realism and robustness.
📝 Abstract
Estimating full-body motion using the tracking signals of head and hands from VR devices holds great potential for various applications. However, the sparsity and unique distribution of observations present a significant challenge, resulting in an ill-posed problem with multiple feasible solutions (i.e., hypotheses). This amplifies uncertainty and ambiguity in full-body motion estimation, especially for the lower-body joints. Therefore, we propose a new method, EnvPoser, that employs a two-stage framework to perform full-body motion estimation using sparse tracking signals and pre-scanned environment from VR devices. EnvPoser models the multi-hypothesis nature of human motion through an uncertainty-aware estimation module in the first stage. In the second stage, we refine these multi-hypothesis estimates by integrating semantic and geometric environmental constraints, ensuring that the final motion estimation aligns realistically with both the environmental context and physical interactions. Qualitative and quantitative experiments on two public datasets demonstrate that our method achieves state-of-the-art performance, highlighting significant improvements in human motion estimation within motion-environment interaction scenarios.