🤖 AI Summary
Robotic grasping in cluttered scenes is severely hindered by occlusions, which impede reliable target localization and obstacle-aware manipulation planning.
Method: We propose UNOGrasp, the first model to explicitly incorporate occlusion-aware path modeling into a multi-step vision-language reasoning framework. It is fine-tuned on UNOBench—a large-scale, self-constructed occlusion-aware benchmark featuring diverse occlusion ratios, contact points, and natural language instructions—via joint supervised and reinforcement learning. A verifiable reasoning reward mechanism is introduced to enable end-to-end co-optimization of target localization, occluder identification, and clearance path planning.
Results: Experiments demonstrate that UNOGrasp significantly outperforms general-purpose vision-language models and state-of-the-art grasping methods in both synthetic and real-world settings. It achieves finer-grained understanding of occlusion relationships and yields substantial improvements in grasp success rate.
📝 Abstract
Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.