🤖 AI Summary
Existing vision–tactile learning methods struggle to achieve sub-millimeter dexterous manipulation, primarily because tactile representations are not explicitly aligned with the hand’s kinematic coordinate system—thus failing to exploit the rich spatial information inherent in tactile signals. To address this, we propose Spatially Anchored Tactile Representation (SaTA), the first framework to explicitly anchor tactile measurements to the hand’s coordinate frame, yielding a geometrically interpretable and sensor-faithful tactile representation that enables not only contact detection but also precise local geometric reconstruction of objects. SaTA integrates forward-kinematics-driven tactile feature alignment, end-to-end policy learning, and a multimodal tactile–visual fusion network. Evaluated on USB-C insertion/removal, lightbulb installation, and card sliding tasks, SaTA improves success rates by 30% and reduces task completion time by 27%, significantly advancing model-free learning-based approaches for high-precision dexterous manipulation.
📝 Abstract
Dexterous manipulation requires precise geometric reasoning, yet existing visuo-tactile learning methods struggle with sub-millimeter precision tasks that are routine for traditional model-based approaches. We identify a key limitation: while tactile sensors provide rich contact information, current learning frameworks fail to effectively leverage both the perceptual richness of tactile signals and their spatial relationship with hand kinematics. We believe an ideal tactile representation should explicitly ground contact measurements in a stable reference frame while preserving detailed sensory information, enabling policies to not only detect contact occurrence but also precisely infer object geometry in the hand's coordinate system. We introduce SaTA (Spatially-anchored Tactile Awareness for dexterous manipulation), an end-to-end policy framework that explicitly anchors tactile features to the hand's kinematic frame through forward kinematics, enabling accurate geometric reasoning without requiring object models or explicit pose estimation. Our key insight is that spatially grounded tactile representations allow policies to not only detect contact occurrence but also precisely infer object geometry in the hand's coordinate system. We validate SaTA on challenging dexterous manipulation tasks, including bimanual USB-C mating in free space, a task demanding sub-millimeter alignment precision, as well as light bulb installation requiring precise thread engagement and rotational control, and card sliding that demands delicate force modulation and angular precision. These tasks represent significant challenges for learning-based methods due to their stringent precision requirements. Across multiple benchmarks, SaTA significantly outperforms strong visuo-tactile baselines, improving success rates by up to 30 percentage while reducing task completion times by 27 percentage.