VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge in smart manufacturing where vision-only systems struggle to accurately identify intrinsic material properties—such as hardness and surface roughness—and detect surface defects, particularly under occlusion and reflective interference. To overcome this limitation, the work proposes the first vision–touch–language tri-modal alignment framework. It employs modality-specific encoders and dual Q-Formers to extract language-aligned features, integrates explicit vision–touch contrastive learning, and injects compressed prefix tokens into a large language model to enable property inference and natural language description. The contributions include VitaSet, a large-scale human-verified instruction dataset, on which the method achieves high-accuracy recognition of hardness (88.89%) and roughness (75.13%), a semantic similarity of 0.9009, perfect defect detection accuracy (100%), and a 94% success rate in robotic closed-loop sorting tasks.
📝 Abstract
Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/
Problem

Research questions and friction points this paper is trying to address.

quality inspection
material properties
vision-tactile perception
smart manufacturing
occlusion and reflection
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-tactile-language model
multimodal fusion
property-aware perception
contrastive learning
robotic quality inspection
🔎 Similar Papers
No similar papers found.
J
Junyi Zong
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China; also with Zhongguancun Academy, Beijing 100094, China
Q
Qingxuan Jia
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
M
Meixian Shi
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
Tong Li
Tong Li
Associate Professor, Renmin University of China. Chief Engineer, Huawei. PhD, Tsinghua University.
Computer NetworkingNetwork SecurityDistributed SystemsData Space
Jiayuan Li
Jiayuan Li
wuhan uniersity
remote sensing, image processing, computer vision
Z
Zihang Lv
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
G
Gang Chen
School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China
Fang Deng
Fang Deng
Beijing Institute of Technology
New EnergyIntelligent Information ProcessingIntelligent Wearable System