🤖 AI Summary
This work addresses language-driven dexterous grasping by jointly reasoning about semantic intent, 3D geometry, and hand–object physical interaction—a capability lacking in existing approaches due to insufficient intermediate modeling of physical contact. The authors propose a contact-based embodied reasoning framework that uses contact points between hand links and object surfaces as an intermediate representation. This framework first autoregressively generates embodied contact tokens and then synthesizes high-dimensional grasp poses, effectively bridging linguistic intent with physical constraints. Integrating vision-language models, autoregressive generation, and multi-finger hand kinematics, the method enables controllable grasp synthesis. Evaluated on the DexGYS dataset, it achieves a success rate of 67.14%, outperforming the current state of the art by 3.83 percentage points, and improves intent alignment by 96.4%.
📝 Abstract
Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83%p with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.