🤖 AI Summary
Existing CNN- and ViT-based visual grasping detectors struggle to simultaneously capture fine-grained local details and global contextual information in cluttered scenes, resulting in limited generalization. To address this, we propose a novel Vision Mamba–based architecture integrated with parallel convolutional and lightweight Transformer modules. Vision Mamba enables efficient long-range dependency modeling, while multi-scale convolutions enhance local texture perception and the Transformer module refines global semantic understanding—jointly optimizing grasping detection. We embed this design into an end-to-end grasp pose regression network. Our method achieves state-of-the-art performance on three major benchmarks—Cornell, Jacquard, and OCID-Grasp—and demonstrates superior accuracy (average grasp success rate improved by 3.2–5.7%) and robustness in both simulation and real-world robotic arm experiments. This work pioneers the application of state-space models to visual grasping detection, establishing a new paradigm for vision-action joint modeling.
📝 Abstract
Robot grasping, whether handling isolated objects, cluttered items, or stacked objects, plays a critical role in industrial and service applications. However, current visual grasp detection methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) often struggle to adapt to diverse scenarios, as they tend to emphasize either local or global features exclusively, neglecting complementary cues. In this paper, we propose a novel hybrid Mamba-Transformer approach to address these challenges. Our method improves robotic visual grasping by effectively capturing both global and local information through the integration of Vision Mamba and parallel convolutional-transformer blocks. This hybrid architecture significantly improves adaptability, precision, and flexibility across various robotic tasks. To ensure a fair evaluation, we conducted extensive experiments on the Cornell, Jacquard, and OCID-Grasp datasets, ranging from simple to complex scenarios. Additionally, we performed both simulated and real-world robotic experiments. The results demonstrate that our method not only surpasses state-of-the-art techniques on standard grasping datasets but also delivers strong performance in both simulation and real-world robot applications.