DexVLG: Dexterous Vision-Language-Grasp Model at Scale

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work predominantly focuses on simple fixture control, lacking semantic-driven dexterous grasping methods tailored for anthropomorphic hands. Method: We propose the first end-to-end framework for dexterous grasp pose prediction from natural language instructions and single-view RGB-D input. To enable fine-grained alignment across modalities, we introduce DexGraspNet 3.0—a large-scale dataset containing 170 million samples—featuring semantic part-level synchronization of language, vision, and multi-finger grasp poses. Our architecture integrates a vision-language model with a flow-matching-based grasp pose head to learn text-to-high-dimensional-grasp mappings trained exclusively on simulated data. Results: The method achieves zero-shot simulation success rates exceeding 76%, sets new state-of-the-art performance in part-level grasp accuracy, and demonstrates robust real-world execution of instruction-driven dexterous manipulation across diverse object categories.

Technology Category

Application Category

📝 Abstract
As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG's performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG's strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Predict dexterous grasp poses with language instructions
Overcome data scarcity for human-like hand grasping
Achieve zero-shot generalization in grasp execution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Vision-Language-Grasp model for dexterous hands
170M grasp poses dataset with part-level captions
Flow-matching-based pose head for instruction alignment
🔎 Similar Papers
No similar papers found.