ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address the failure of robotic grasp planning caused by occlusion in unstructured bin-picking scenarios, this paper proposes a real-time, class-agnostic invisible-region segmentation method based on Vision Transformers (ViT). Leveraging global self-attention to model long-range dependencies, the approach end-to-end recovers complete object masks—including occluded and non-visible regions. We introduce novel single-head and dual-head ViT architectures: the former performs invisible-region segmentation alone, while the latter jointly segments both invisible and occluding regions. Furthermore, we construct ViTA-SimData—the first photorealistic synthetic dataset specifically designed for industrial bin-picking scenes. Evaluated on COOCA and KINS benchmarks, our method achieves high-accuracy, low-latency joint segmentation of invisible and occluding regions, maintaining real-time performance while significantly improving robotic grasping robustness.

Technology Category

Application Category

📝 Abstract

Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

Recovering complete object masks despite occlusions in robotics

Providing real-time amodal segmentation for grasp planning

Enhancing robotic manipulation with computationally efficient vision transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer for real-time amodal segmentation

Dual-Head architecture for amodal and occluded mask prediction

Photo-realistic synthetic dataset for industrial bin-picking

🔎 Similar Papers

No similar papers found.

Bosch Group

Attraktive Vergütung

Horb am Neckar, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)