Masked Modeling for Human Motion Recovery Under Occlusions

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of human motion reconstruction from monocular videos in real-world scenarios, where frequent occlusions lead to poor robustness, slow inference, or heavy reliance on large-scale paired data. We propose MoRo, an end-to-end generative framework that introduces masked modeling to this task for the first time. Leveraging a video-conditioned masked Transformer, MoRo efficiently recovers full-body motion in global coordinates by integrating trajectory-aware motion priors with image-conditioned pose priors. To mitigate the scarcity of paired training data, we design a cross-modal learning strategy. Experiments demonstrate that MoRo significantly outperforms existing methods on the EgoBody and RICH datasets, achieving notably higher accuracy and motion realism under occlusion while maintaining comparable performance in non-occluded settings. The model also enables real-time inference at 70 FPS on a single H200 GPU.

Technology Category

Application Category

📝 Abstract
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings. Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles occlusions while enabling efficient, end-to-end inference. To overcome the scarcity of paired video-motion data, we design a cross-modality learning scheme that learns multi-modal priors from a set of heterogeneous datasets: (i) a trajectory-aware motion prior trained on MoCap datasets, (ii) an image-conditioned pose prior trained on image-pose datasets, capturing diverse per-frame poses, and (iii) a video-conditioned masked transformer that fuses motion and pose priors, finetuned on video-motion datasets to integrate visual cues with motion dynamics for robust inference. Extensive experiments on EgoBody and RICH demonstrate that MoRo substantially outperforms state-of-the-art methods in accuracy and motion realism under occlusions, while performing on-par in non-occluded scenarios. MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
Problem

Research questions and friction points this paper is trying to address.

human motion recovery
occlusions
monocular video
motion reconstruction
masked modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

masked modeling
human motion recovery
occlusion robustness
cross-modality learning
video-conditioned generation
🔎 Similar Papers
No similar papers found.