ViTA-Seg: Vision Transformer for Amodal Segmentation in Robotics

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the failure of robotic grasp planning caused by occlusion in unstructured bin-picking scenarios, this paper proposes a real-time, class-agnostic invisible-region segmentation method based on Vision Transformers (ViT). Leveraging global self-attention to model long-range dependencies, the approach end-to-end recovers complete object masks—including occluded and non-visible regions. We introduce novel single-head and dual-head ViT architectures: the former performs invisible-region segmentation alone, while the latter jointly segments both invisible and occluding regions. Furthermore, we construct ViTA-SimData—the first photorealistic synthetic dataset specifically designed for industrial bin-picking scenes. Evaluated on COOCA and KINS benchmarks, our method achieves high-accuracy, low-latency joint segmentation of invisible and occluding regions, maintaining real-time performance while significantly improving robotic grasping robustness.

Technology Category

Application Category

📝 Abstract
Occlusions in robotic bin picking compromise accurate and reliable grasp planning. We present ViTA-Seg, a class-agnostic Vision Transformer framework for real-time amodal segmentation that leverages global attention to recover complete object masks, including hidden regions. We proposte two architectures: a) Single-Head for amodal mask prediction; b) Dual-Head for amodal and occluded mask prediction. We also introduce ViTA-SimData, a photo-realistic synthetic dataset tailored to industrial bin-picking scenario. Extensive experiments on two amodal benchmarks, COOCA and KINS, demonstrate that ViTA-Seg Dual Head achieves strong amodal and occlusion segmentation accuracy with computational efficiency, enabling robust, real-time robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

Recovering complete object masks despite occlusions in robotics
Providing real-time amodal segmentation for grasp planning
Enhancing robotic manipulation with computationally efficient vision transformer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer for real-time amodal segmentation
Dual-Head architecture for amodal and occluded mask prediction
Photo-realistic synthetic dataset for industrial bin-picking
🔎 Similar Papers
No similar papers found.
D
Donato Caramia
Department of Electrical and Information Engineering (DEI), Polytechnic University of Bari, 70126, Bari, Italy
Florian T. Pokorny
Florian T. Pokorny
Associate Professor, KTH Royal Institute of Technology
Machine LearningRobotics
G
Giuseppe Triggiani
AROL S.p.A., 14053, Canelli, Italy
D
Denis Ruffino
AROL S.p.A., 14053, Canelli, Italy
David Naso
David Naso
Department of Electrical and Information Engineering (DEI), Polytechnic University of Bari, 70126, Bari, Italy
P
Paolo Roberto Massenio
Department of Electrical and Information Engineering (DEI), Polytechnic University of Bari, 70126, Bari, Italy