🤖 AI Summary
This work addresses the challenge that existing dexterous grasping methods, reliant on predefined functional labels, struggle to achieve tight coupling between semantic understanding and pose estimation in unstructured environments. To overcome this limitation, the paper introduces BLaDA, a novel framework that establishes, for the first time, an end-to-end reasoning chain from open-vocabulary instructions to interpretable dexterous manipulation within 3D Gaussian Splatting scene representations. BLaDA integrates Knowledge-guided Language Parsing (KLP), Triangular Functional Point Localization based on 3D Gaussian Splatting (TriLocation), and Keypoint-based Grasping Transformation in 3D (KGT3D+) to jointly model semantics, geometry, and action. Experiments demonstrate that BLaDA significantly outperforms state-of-the-art methods on complex benchmarks, achieving leading performance in both functional region localization accuracy and cross-category manipulation success rates.
📝 Abstract
In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.