🤖 AI Summary
This work addresses the challenge of tool use in bimanual dexterous manipulation, where high-dimensional state spaces and complex contact dynamics hinder effective policy learning. To this end, the authors propose a physics-guided Graph Transformer policy that models the bimanual system as a kinematic graph, preserving local states through per-link tokenization. The attention mechanism explicitly integrates multidimensional physical priors—including spatial distances, contact states, geometric proximity, and anatomical properties—to enable explicit reasoning about physical interactions. Compared to the ManipTrans baseline, the proposed method achieves significantly higher task success rates and manipulation accuracy while reducing model parameters by 49%. Furthermore, it demonstrates strong generalization capability through zero-shot transfer across three distinct dexterous hands: Shadow, Allegro, and Inspire.
📝 Abstract
Bimanual dexterous manipulation for tool use remains a formidable challenge in robotics due to the high-dimensional state space and complicated contact dynamics. Existing methods naively represent the entire system state as a single configuration vector, disregarding the rich structural and topological information inherent to articulated hands. We present PhysGraph, a physically-grounded graph transformer policy designed explicitly for challenging bimanual hand-tool-object manipulation. Unlike prior works, we represent the bimanual system as a kinematic graph and introduce per-link tokenization to preserve fine-grained local state information. We propose a physically-grounded bias generator that injects structural priors directly into the attention mechanism, including kinematic spatial distance, dynamic contact states, geometric proximity, and anatomical properties. This allows the policy to explicitly reason about physical interactions rather than learning them implicitly from sparse rewards. Extensive experiments show that PhysGraph significantly outperforms baseline - ManipTrans in manipulation precision and task success rates while using only 51% of the parameters of ManipTrans. Furthermore, the inherent topological flexibility of our architecture shows qualitative zero-shot transfer to unseen tool/object geometries, and is sufficiently general to be trained on three robotic hands (Shadow, Allegro, Inspire).