DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

📅 2025-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited generalization capability of robotic dexterous grasping across diverse objects, lighting conditions, and backgrounds. We propose a hierarchical vision-language-action framework: a high-level module leverages pretrained vision-language models (e.g., CLIP) for semantic perception and task planning, while a low-level diffusion-based policy network executes fine-grained motor control. To mitigate domain shift, we introduce multimodal alignment and domain-invariant representation learning. Our key contributions are (i) the first hierarchical architecture unifying vision, language, and action for robotic manipulation, and (ii) an end-to-end imitation learning paradigm grounded in domain-invariant representations, eliminating restrictive assumptions of single-object or controlled-environment settings. Evaluated on thousands of unseen object–lighting–background combinations, our method achieves over 90% zero-shot grasping success rate, significantly enhancing generalization and robustness in real-world scenarios.

Technology Category

Application Category

📝 Abstract
Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a ``zero-shot'' environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at https://dexgraspvla.github.io/.
Problem

Research questions and friction points this paper is trying to address.

General dexterous grasping in diverse scenarios
Overcoming domain shift with vision-language-action framework
Achieving high success rate in zero-shot environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision-Language model for task planning
Employs diffusion-based policy for action control
Transforms inputs into domain-invariant representations
🔎 Similar Papers
No similar papers found.