🤖 AI Summary
Existing 6D pose estimation and tracking methods rely on manually annotated segmentation masks for the first frame, limiting robustness against occlusion, fast motion, and industrial scene complexity. This paper proposes a real-time, annotation-free three-stage framework: (1) context-guided automatic object localization and segmentation leveraging vision-language priors and self-cross-attention; (2) robust tracking via self-supervised feature matching; and (3) an automatic re-registration mechanism that detects tracking failures using feature similarity and enables rapid re-localization after occlusion or motion. Evaluated on an industrial dataset featuring multi-object occlusion, high-speed motion, and illumination variation, our method achieves state-of-the-art accuracy without requiring any additional training—enabling zero-shot, plug-and-play deployment. It significantly reduces practical deployment barriers and operational costs.
📝 Abstract
Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.