🤖 AI Summary
This work addresses the challenge of achieving human-level hardness perception and interpretable tactile decision-making in contact-rich robotic manipulation. The authors propose the first multimodal framework that integrates tactile sensing (GelSight-Mini), visual input (RGB), and natural language instructions. By synergizing a pretrained tactile-visual model with a lightweight large language model (LLM), the system enables interpretable hardness estimation and interaction guidance. The approach employs ResNet50 combined with LSTM to process temporal tactile signals, while YOLO and Grounded-SAM refine contact region segmentation. A compact LLM generates natural language explanations to facilitate cross-modal alignment. Evaluated on fruit ripeness assessment, the method achieves statistically significant hardness discrimination across all category pairs (p < 0.01) and a 90% end-to-end task success rate, demonstrating strong generalization to new tasks without extensive fine-tuning.
📝 Abstract
Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual understanding. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.