Obstruction reasoning for robotic grasping

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Robotic grasping in cluttered scenes is severely hindered by occlusions, which impede reliable target localization and obstacle-aware manipulation planning. Method: We propose UNOGrasp, the first model to explicitly incorporate occlusion-aware path modeling into a multi-step vision-language reasoning framework. It is fine-tuned on UNOBench—a large-scale, self-constructed occlusion-aware benchmark featuring diverse occlusion ratios, contact points, and natural language instructions—via joint supervised and reinforcement learning. A verifiable reasoning reward mechanism is introduced to enable end-to-end co-optimization of target localization, occluder identification, and clearance path planning. Results: Experiments demonstrate that UNOGrasp significantly outperforms general-purpose vision-language models and state-of-the-art grasping methods in both synthetic and real-world settings. It achieves finer-grained understanding of occlusion relationships and yields substantial improvements in grasp success rate.

Technology Category

Application Category

📝 Abstract

Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.

Problem

Research questions and friction points this paper is trying to address.

Robots reason about obstructions blocking target objects in cluttered environments

Current models lack visual obstruction reasoning for sequential clearance actions

Proposed solution enables vision-language obstruction path planning for grasping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning-based vision-language model for obstruction reasoning

Multi-step reasoning with obstruction-aware visual cues

Combines supervised and reinforcement finetuning with reasoning rewards

🔎 Similar Papers

No similar papers found.