HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models

📅 2024-09-16

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the open-vocabulary referring localization problem—characterized by multiple instances and rich attributes—in complex, cluttered scenes, specifically for natural-language-driven robotic grasping. The proposed method introduces: (1) a hierarchical FiLM-based feature modulation mechanism that integrates frozen vision-language model (VLM) embeddings (e.g., CLIP); (2) a lightweight Transformer decoder achieving superior closed-set localization performance over baselines while reducing parameter count by 100×; and (3) the first effective synergy between VLMs and an open-set detector (GroundedSAM). Evaluated across 15 realistic desktop scenes, the approach achieves a 90.33% visual localization accuracy and is successfully deployed on a 7-DOF robotic arm, enabling end-to-end Referring Grasping Synthesis (RGS) in closed-loop execution. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Robots interacting with humans through natural language can unlock numerous applications such as Referring Grasp Synthesis (RGS). Given a text query, RGS determines a stable grasp pose to manipulate the referred object in the robot's workspace. RGS comprises two steps: visual grounding and grasp pose estimation. Recent studies leverage powerful Vision-Language Models (VLMs) for visually grounding free-flowing natural language in real-world robotic execution. However, comparisons in complex, cluttered environments with multiple instances of the same object are lacking. This paper introduces HiFi-CS, featuring hierarchical application of Featurewise Linear Modulation (FiLM) to fuse image and text embeddings, enhancing visual grounding for complex attribute rich text queries encountered in robotic grasping. Visual grounding associates an object in 2D/3D space with natural language input and is studied in two scenarios: Closed and Open Vocabulary. HiFi-CS features a lightweight decoder combined with a frozen VLM and outperforms competitive baselines in closed vocabulary settings while being 100x smaller in size. Our model can effectively guide open-set object detectors like GroundedSAM to enhance open-vocabulary performance. We validate our approach through real-world RGS experiments using a 7-DOF robotic arm, achieving 90.33% visual grounding accuracy in 15 tabletop scenes. Our codebase is provided here: https://github.com/vineet2104/hifics

Problem

Research questions and friction points this paper is trying to address.

Enhances visual grounding for robotic grasping using VLMs.

Improves grasp pose estimation in cluttered, complex environments.

Supports open-vocabulary object detection for natural language queries.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical FiLM for image-text fusion

Lightweight decoder with frozen VLM

Open-vocabulary enhancement via GroundedSAM

🔎 Similar Papers

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping