VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models

📅 2025-11-08

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the limitations of robotic grasping methods that rely on large-scale expert annotations and task-specific training, this paper proposes a zero-shot, annotation-free, and training-free grasping detection framework. Our method leverages vision-language models (VLMs) to generate semantically guided RGB images of target objects; integrates depth estimation and instance segmentation to lift 2D features into 3D point clouds; initializes grasp orientation via principal component analysis; and refines executable 6D grasp poses using correspondence-free point cloud alignment. To our knowledge, this is the first work to incorporate large-scale VLMs into grasp detection, enabling cross-object and cross-scene zero-shot generalization. Quantitative evaluation on the Cornell and Jacquard datasets shows performance competitive with or superior to state-of-the-art supervised methods. Furthermore, real-world validation on a Franka Emika Panda robot demonstrates successful zero-shot grasping of previously unseen objects in unstructured environments.

Technology Category

Application Category

📝 Abstract

Robotic grasping is a fundamental capability for autonomous manipulation; however, most existing methods rely on large-scale expert annotations and necessitate retraining to handle new objects. We present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting grasps. From a single RGB-D image, our method (1) prompts a large vision-language model to generate a goal image where a straight rod"impales"the object, representing an antipodal grasp, (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal component analysis and correspondence-free optimization to recover an executable grasp pose. Unlike prior work, our approach is training-free and does not rely on curated grasp datasets. Despite this, VLAD-Grasp achieves performance that is competitive with or superior to that of state-of-the-art supervised models on the Cornell and Jacquard datasets. We further demonstrate zero-shot generalization to novel real-world objects on a Franka Research 3 robot, highlighting vision-language foundation models as powerful priors for robotic manipulation.

Problem

Research questions and friction points this paper is trying to address.

Detecting robotic grasps without training or expert annotations

Achieving zero-shot generalization to novel real-world objects

Leveraging vision-language models to generate executable grasp poses

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language model generates goal grasp image

Lifts generated image into 3D with depth prediction

Aligns point clouds via optimization for grasp pose

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)