STORM: Segment, Track, and Object Re-Localization from a Single 3D Model

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing 6D pose estimation and tracking methods rely on manually annotated segmentation masks for the first frame, limiting robustness against occlusion, fast motion, and industrial scene complexity. This paper proposes a real-time, annotation-free three-stage framework: (1) context-guided automatic object localization and segmentation leveraging vision-language priors and self-cross-attention; (2) robust tracking via self-supervised feature matching; and (3) an automatic re-registration mechanism that detects tracking failures using feature similarity and enables rapid re-localization after occlusion or motion. Evaluated on an industrial dataset featuring multi-object occlusion, high-speed motion, and illumination variation, our method achieves state-of-the-art accuracy without requiring any additional training—enabling zero-shot, plug-and-play deployment. It significantly reduces practical deployment barriers and operational costs.

Technology Category

Application Category

📝 Abstract

Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

Problem

Research questions and friction points this paper is trying to address.

Eliminates manual annotation for 6D pose estimation in AI systems

Addresses tracking failures during occlusions and rapid object movement

Provides real-time object localization without additional training requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language understanding guides object localization

Self-cross-attention mechanisms identify candidate regions

Automatic re-registration recovers from tracking failures

🔎 Similar Papers

Towards Global Localization using Multi-Modal Object-Instance Re-Identification