OVAL-Grasp: Open-Vocabulary Affordance Localization for Task Oriented Grasping

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Task-oriented, zero-shot, open-vocabulary part-level grasping in unstructured environments—particularly under occlusion, clutter, and with unseen objects—remains highly challenging due to the absence of geometric priors and part-level annotations. Method: We propose a language-vision collaborative reasoning framework: a large language model (LLM) parses task semantics from natural language instructions; a vision-language model (VLM) performs zero-shot part localization and segmentation without fine-tuning; and operability-aware modeling generates 2D grasp heatmaps. Contribution/Results: Our method eliminates reliance on geometry-based assumptions or part supervision, enabling execution of arbitrary natural-language tasks. Evaluated on 20 household object categories across 60 tasks, it achieves 95% part identification accuracy, 78.3% real-robot grasping success rate, and 80% critical-part selection accuracy under occlusion—demonstrating substantial improvements in generalization and robustness for open-world, task-driven robotic grasping.

Technology Category

Application Category

📝 Abstract

To manipulate objects in novel, unstructured environments, robots need task-oriented grasps that target object parts based on the given task. Geometry-based methods often struggle with visually defined parts, occlusions, and unseen objects. We introduce OVAL-Grasp, a zero-shot open-vocabulary approach to task-oriented, affordance based grasping that uses large-language models and vision-language models to allow a robot to grasp objects at the correct part according to a given task. Given an RGB image and a task, OVAL-Grasp identifies parts to grasp or avoid with an LLM, segments them with a VLM, and generates a 2D heatmap of actionable regions on the object. During our evaluations, we found that our method outperformed two task oriented grasping baselines on experiments with 20 household objects with 3 unique tasks for each. OVAL-Grasp successfully identifies and segments the correct object part 95% of the time and grasps the correct actionable area 78.3% of the time in real-world experiments with the Fetch mobile manipulator. Additionally, OVAL-Grasp finds correct object parts under partial occlusions, demonstrating a part selection success rate of 80% in cluttered scenes. We also demonstrate OVAL-Grasp's efficacy in scenarios that rely on visual features for part selection, and show the benefit of a modular design through our ablation experiments. Our project webpage is available at https://ekjt.github.io/OVAL-Grasp/

Problem

Research questions and friction points this paper is trying to address.

Developing task-oriented grasping for robots in novel unstructured environments

Overcoming geometry-based limitations for visually defined object parts

Enabling zero-shot affordance localization using language-vision models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large-language models for task-based part identification

Employs vision-language models for segmenting graspable regions

Generates 2D heatmaps to localize actionable object areas

🔎 Similar Papers

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models