FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing tactile representations, which predominantly rely on qualitative descriptions and struggle to model quantitative contact states—such as force magnitude, contact geometry, and principal axis orientation—thereby hindering fine manipulation. To overcome this, the authors propose a fine-grained contrastive language-tactile pretraining framework that explicitly incorporates quantitative contact states into tactile-language alignment for the first time. They construct a large-scale dataset comprising over 100,000 tactile 3D point cloud–language pairs and introduce a discrete numerical tokenization mechanism to unify physical quantities with semantic representations. Their sensor-agnostic 3D-TLA architecture integrates 3D point cloud modeling, contrastive learning, and flow matching, achieving 95.9% accuracy in classification tasks and a 52.6% reduction in regression mean absolute error. The method significantly outperforms existing approaches in contact-intensive manipulation, with a simulation-to-reality gap of only 3.5%.

Technology Category

Application Category

📝 Abstract
Recent advancements in integrating tactile sensing into vision-language-action (VLA) models have demonstrated transformative potential for robotic perception. However, existing tactile representations predominantly rely on qualitative descriptors (e.g., texture), neglecting quantitative contact states such as force magnitude, contact geometry, and principal axis orientation, which are indispensable for fine-grained manipulation. To bridge this gap, we propose FG-CLTP, a fine-grained contrastive language tactile pretraining framework. We first introduce a novel dataset comprising over 100k tactile 3D point cloud-language pairs that explicitly capture multidimensional contact states from the sensor's perspective. We then implement a discretized numerical tokenization mechanism to achieve quantitative-semantic alignment, effectively injecting explicit physical metrics into the multimodal feature space. The proposed FG-CLTP model yields a 95.9% classification accuracy and reduces the regression error (MAE) by 52.6% compared to state-of-the-art methods. Furthermore, the integration of 3D point cloud representations establishes a sensor-agnostic foundation with a minimal sim-to-real gap of 3.5%. Building upon this fine-grained representation, we develop a 3D tactile-language-action (3D-TLA) architecture driven by a flow matching policy to enable multimodal reasoning and control. Extensive experiments demonstrate that our framework significantly outperforms strong baselines in contact-rich manipulation tasks, providing a robust and generalizable foundation for tactile-language-action models.
Problem

Research questions and friction points this paper is trying to address.

tactile sensing
quantitative contact states
fine-grained manipulation
vision-language-action models
robotic perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained tactile representation
contrastive language tactile pretraining
3D point cloud-language alignment
quantitative-semantic tokenization
sensor-agnostic manipulation
🔎 Similar Papers
No similar papers found.