VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Addressing 6-DoF active grasping under severe occlusion and complete target invisibility. Method: We propose a vision-language-driven, instance-level spatial modeling and uncertainty-aware grasping framework. For the first time, we integrate vision-language foundation models into a closed-loop system unifying target-perceptive active viewpoint planning and grasp decision-making. We introduce an instance-centric spatial relation representation and design a multi-view uncertainty modeling and real-time fusion mechanism to jointly optimize Next-Best-View selection and 6-DoF pose estimation. Contribution/Results: Our method achieves an 87.5% goal-directed grasping success rate in realistic cluttered scenes with minimal trial attempts, significantly outperforming existing baselines. The core contribution is establishing an end-to-end pipeline—spanning language-guided prompting, spatial understanding, active observation, uncertainty-aware fusion, and robust grasping—thereby introducing a novel paradigm for manipulating invisible objects.

Technology Category

Application Category

📝 Abstract

We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of $87.5%$ in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints.

Problem

Research questions and friction points this paper is trying to address.

Addresses visibility constraints in occluded grasping environments.

Enhances grasp success using spatial reasoning and active view planning.

Optimizes sequential grasping strategies and real-time grasp confidence.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages Foundation Models for spatial reasoning

Implements multi-view uncertainty-driven grasp fusion

Integrates active Next-Best-View planning strategy

🔎 Similar Papers

HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

2024-09-16arXiv.orgCitations: 3