🤖 AI Summary
Existing 3D human pose and shape (HPS) estimation methods exhibit insufficient robustness under realistic, complex occlusions, while mainstream benchmarks predominantly employ synthetic occlusions (e.g., random patches) that poorly reflect real-world scenarios. To bridge this gap, we introduce VOccl3D—the first video-level benchmark explicitly designed for natural occlusions—featuring diverse human motions, clothing styles, and authentic occlusion patterns, along with fine-grained frame-wise 3D human annotations. Leveraging VOccl3D, we fine-tune state-of-the-art models (e.g., CLIFF) and integrate an enhanced YOLOv11 detector optimized for occluded person localization. Extensive experiments demonstrate substantial improvements in 3D HPS accuracy under occlusion, both on VOccl3D and multiple public benchmarks. Our approach establishes a more reliable, end-to-end HPS system capable of handling real-world occlusion challenges.
📝 Abstract
Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:https://yashgarg98.github.io/VOccl3D-dataset/