🤖 AI Summary
Existing 6D pose estimation benchmarks primarily target domestic scenes or simplified, manually arranged industrial environments, failing to reflect the complex challenges—such as severe occlusion, fine-grained distractors, and multi-sensor discrepancies—encountered in real-world robotic manipulation. To address this gap, we introduce CHIP, the first multi-sensor 6D pose benchmark explicitly designed for robotic arm manipulation of chairs in authentic industrial settings. CHIP comprises 77,811 RGB-D frames with precise 6D ground truth (averaging 11,115 frames per chair) across seven real chair categories; ground truth is automatically calibrated via robot forward kinematics in unstructured, non-desktop, production-line environments. The benchmark enables evaluation of generalization across sensors, occlusion levels, and zero-prior conditions. Extensive baseline experiments reveal that state-of-the-art zero-shot methods suffer significant performance degradation in industrial contexts, confirming CHIP’s strong challenge and practical relevance.
📝 Abstract
Accurate 6D pose estimation of complex objects in 3D environments is essential for effective robotic manipulation. Yet, existing benchmarks fall short in evaluating 6D pose estimation methods under realistic industrial conditions, as most datasets focus on household objects in domestic settings, while the few available industrial datasets are limited to artificial setups with objects placed on tables. To bridge this gap, we introduce CHIP, the first dataset designed for 6D pose estimation of chairs manipulated by a robotic arm in a real-world industrial environment. CHIP includes seven distinct chairs captured using three different RGBD sensing technologies and presents unique challenges, such as distractor objects with fine-grained differences and severe occlusions caused by the robotic arm and human operators. CHIP comprises 77,811 RGBD images annotated with ground-truth 6D poses automatically derived from the robot's kinematics, averaging 11,115 annotations per chair. We benchmark CHIP using three zero-shot 6D pose estimation methods, assessing performance across different sensor types, localization priors, and occlusion levels. Results show substantial room for improvement, highlighting the unique challenges posed by the dataset. CHIP will be publicly released.