🤖 AI Summary
Existing robotic tool manipulation approaches struggle to simultaneously achieve semantic-level planning and high-fidelity contact control, and often exhibit limited generalization. This work proposes Semantic-Contact Fields (SCFields), a unified 3D representation that jointly encodes semantic and contact information. By integrating visual semantics with dense contact estimation through a two-stage simulation-to-reality contact learning pipeline, SCFields provide high-dimensional tactile observations for diffusion-based policies. The method combines geometric heuristics, force optimization, and few-shot real-world data alignment to enable robust tool use. Evaluated on scraping, crayon drawing, and peeling tasks, SCFields significantly outperform vision-only and raw tactile baselines, demonstrating category-level cross-tool generalization and robust manipulation of previously unseen tools.
📝 Abstract
Generalizing tool manipulation requires both semantic planning and precise physical control. Modern generalist robot policies, such as Vision-Language-Action (VLA) models, often lack the high-fidelity physical grounding required for contact-rich tool manipulation. Conversely, existing contact-aware policies that leverage tactile or haptic sensing are typically instance-specific and fail to generalize across diverse tool geometries. Bridging this gap requires learning unified contact representations from diverse data, yet a fundamental barrier remains: diverse real-world tactile data are prohibitive at scale, while direct zero-shot sim-to-real transfer is challenging due to the complex dynamics of nonlinear deformation of soft sensors. To address this, we propose Semantic-Contact Fields (SCFields), a unified 3D representation fusing visual semantics with dense contact estimates. We enable this via a two-stage Sim-to-Real Contact Learning Pipeline: first, we pre-train on a large simulation data set to learn general contact physics; second, we fine-tune on a small set of real data, pseudo-labeled via geometric heuristics and force optimization, to align sensor characteristics. This allows physical generalization to unseen tools. We leverage SCFields as the dense observation input for a diffusion policy to enable robust execution of contact-rich tool manipulation tasks. Experiments on scraping, crayon drawing, and peeling demonstrate robust category-level generalization, significantly outperforming vision-only and raw-tactile baselines.