CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding

📅 2025-05-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing tactile perception research predominantly focuses on surface attributes (e.g., texture), neglecting critical contact-state information—such as contact location, shape, and force—that is essential for robotic manipulation, thereby limiting its utility in multimodal perception. To address this, we propose the first language–tactile alignment framework explicitly designed for contact-state understanding. We introduce a large-scale, manually annotated 3D point cloud–natural language dataset comprising over 50,000 paired samples, and establish the first semantic mapping between tactile signals and natural language grounded in contact-state semantics. Methodologically, we adopt a cross-modal contrastive pretraining paradigm that freezes visual–language model (VLM) feature spaces, integrating multi-granularity contact-state annotations with tactile point cloud encoding. Experiments demonstrate substantial improvements over baselines in zero-shot 3D classification, contact-state recognition, and tactile-driven large language model interaction. Both code and dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract
Recent advancements in integrating tactile sensing with vision-language models (VLMs) have demonstrated remarkable potential for robotic multimodal perception. However, existing tactile descriptions remain limited to superficial attributes like texture, neglecting critical contact states essential for robotic manipulation. To bridge this gap, we propose CLTP, an intuitive and effective language tactile pretraining framework that aligns tactile 3D point clouds with natural language in various contact scenarios, thus enabling contact-state-aware tactile language understanding for contact-rich manipulation tasks. We first collect a novel dataset of 50k+ tactile 3D point cloud-language pairs, where descriptions explicitly capture multidimensional contact states (e.g., contact location, shape, and force) from the tactile sensor's perspective. CLTP leverages a pre-aligned and frozen vision-language feature space to bridge holistic textual and tactile modalities. Experiments validate its superiority in three downstream tasks: zero-shot 3D classification, contact state classification, and tactile 3D large language model (LLM) interaction. To the best of our knowledge, this is the first study to align tactile and language representations from the contact state perspective for manipulation tasks, providing great potential for tactile-language-action model learning. Code and datasets are open-sourced at https://sites.google.com/view/cltp/.
Problem

Research questions and friction points this paper is trying to address.

Aligns tactile 3D point clouds with natural language for contact-rich manipulation
Addresses limited tactile descriptions by capturing multidimensional contact states
Enables contact-state-aware tactile language understanding for robotic tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns tactile 3D point clouds with language
Uses pre-aligned vision-language feature space
First study linking contact states to language
🔎 Similar Papers
No similar papers found.
Wenxuan Ma
Wenxuan Ma
Zhejiang University
Smart GridQuantum ComputingQuantum Machine Learning
Xiaoge Cao
Xiaoge Cao
Institute of Automation,Chinese Academy of Sciences
Y
Yixiang Zhang
Beihang University
Chaofan Zhang
Chaofan Zhang
Institute of Automation, Chinese Academy of Sciences
tactile perception and robots dexterous manipulation
S
Shaobo Yang
Beijing University of Posts and Telecommunications
P
Peng Hao
Samsung Research China
B
Bin Fang
Beijing University of Posts and Telecommunications
Yinghao Cai
Yinghao Cai
Institute of Automation, Chinese Academy of Sciences
S
Shaowei Cui
Institute of Automation, Chinese Academy of Sciences
S
Shuo Wang
Institute of Automation, Chinese Academy of Sciences