🤖 AI Summary
To address the limited adaptability of robotic grasping to objects with diverse shapes and sizes, this paper proposes a voxel-based multi-scale contrastive learning framework for grasp planning. The method employs a dual-Transformer architecture—comprising an Insight Transformer and an Empower Transformer—that enables query-driven interaction between high-level semantic and low-level geometric features, facilitating cross-scale feature fusion. It further integrates multi-scale voxel convolutions with a contrastive learning objective to jointly optimize fine-grained geometric detail perception and holistic structural modeling. This design significantly enhances feature discriminability and cross-scale consistency. Evaluated on both simulated and real-world tabletop clutter clearing tasks, the approach achieves substantially higher grasp success rates and robustness compared to state-of-the-art baselines and ablation variants, demonstrating its effectiveness and strong generalization capability.
📝 Abstract
Robotic grasping faces challenges in adapting to objects with varying shapes and sizes. In this paper, we introduce MISCGrasp, a volumetric grasping method that integrates multi-scale feature extraction with contrastive feature enhancement for self-adaptive grasping. We propose a query-based interaction between high-level and low-level features through the Insight Transformer, while the Empower Transformer selectively attends to the highest-level features, which synergistically strikes a balance between focusing on fine geometric details and overall geometric structures. Furthermore, MISCGrasp utilizes multi-scale contrastive learning to exploit similarities among positive grasp samples, ensuring consistency across multi-scale features. Extensive experiments in both simulated and real-world environments demonstrate that MISCGrasp outperforms baseline and variant methods in tabletop decluttering tasks. More details are available at https://miscgrasp.github.io/.