🤖 AI Summary
This study addresses the challenge in smart manufacturing where vision-only systems struggle to accurately identify intrinsic material properties—such as hardness and surface roughness—and detect surface defects, particularly under occlusion and reflective interference. To overcome this limitation, the work proposes the first vision–touch–language tri-modal alignment framework. It employs modality-specific encoders and dual Q-Formers to extract language-aligned features, integrates explicit vision–touch contrastive learning, and injects compressed prefix tokens into a large language model to enable property inference and natural language description. The contributions include VitaSet, a large-scale human-verified instruction dataset, on which the method achieves high-accuracy recognition of hardness (88.89%) and roughness (75.13%), a semantic similarity of 0.9009, perfect defect detection accuracy (100%), and a 94% success rate in robotic closed-loop sorting tasks.
📝 Abstract
Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/