Touch begins where vision ends: Generalizable policies for contact-rich manipulation

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Data-driven dexterous manipulation faces dual challenges: imitation learning requires extensive expert demonstrations, while reinforcement learning (RL) policies suffer from poor generalization. This paper proposes ViTaL, the first framework to decouple “localization” and “execution”: in localization, a vision-language model (VLM) enables cross-scene, semantic target localization; in execution, high-resolution tactile and egocentric visual inputs drive a reusable local policy network for contact-intensive manipulation. We identify three key principles: (i) foundation-model-based segmentation significantly improves robustness of behavior cloning encoders; (ii) residual RL enhances policy generalization; and (iii) tactile feedback delivers critical performance gains. Evaluated in unseen environments, ViTaL achieves ~90% success on contact-intensive tasks, exhibits robustness to distractors, and seamlessly integrates with high-level VLMs—demonstrating strong reusability of low-level manipulation skills.

Technology Category

Application Category

📝 Abstract
Data-driven approaches struggle with precise manipulation; imitation learning requires many hard-to-obtain demonstrations, while reinforcement learning yields brittle, non-generalizable policies. We introduce VisuoTactile Local (ViTaL) policy learning, a framework that solves fine-grained manipulation tasks by decomposing them into two phases: a reaching phase, where a vision-language model (VLM) enables scene-level reasoning to localize the object of interest, and a local interaction phase, where a reusable, scene-agnostic ViTaL policy performs contact-rich manipulation using egocentric vision and tactile sensing. This approach is motivated by the observation that while scene context varies, the low-level interaction remains consistent across task instances. By training local policies once in a canonical setting, they can generalize via a localize-then-execute strategy. ViTaL achieves around 90% success on contact-rich tasks in unseen environments and is robust to distractors. ViTaL's effectiveness stems from three key insights: (1) foundation models for segmentation enable training robust visual encoders via behavior cloning; (2) these encoders improve the generalizability of policies learned using residual RL; and (3) tactile sensing significantly boosts performance in contact-rich tasks. Ablation studies validate each of these insights, and we demonstrate that ViTaL integrates well with high-level VLMs, enabling robust, reusable low-level skills. Results and videos are available at https://vitalprecise.github.io.
Problem

Research questions and friction points this paper is trying to address.

Data-driven approaches struggle with precise manipulation tasks
Imitation learning requires many hard-to-obtain demonstrations
Reinforcement learning yields brittle, non-generalizable policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

ViTaL policy learning framework for manipulation tasks
Vision-language model for scene-level object localization
Tactile sensing enhances contact-rich task performance