TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation

📅 2026-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving human-level hardness perception and interpretable tactile decision-making in contact-rich robotic manipulation. The authors propose the first multimodal framework that integrates tactile sensing (GelSight-Mini), visual input (RGB), and natural language instructions. By synergizing a pretrained tactile-visual model with a lightweight large language model (LLM), the system enables interpretable hardness estimation and interaction guidance. The approach employs ResNet50 combined with LSTM to process temporal tactile signals, while YOLO and Grounded-SAM refine contact region segmentation. A compact LLM generates natural language explanations to facilitate cross-modal alignment. Evaluated on fruit ripeness assessment, the method achieves statistically significant hardness discrimination across all category pairs (p < 0.01) and a 90% end-to-end task success rate, demonstrating strong generalization to new tasks without extensive fine-tuning.

Technology Category

Application Category

📝 Abstract
Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual understanding. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.
Problem

Research questions and friction points this paper is trying to address.

hardness estimation
tactile perception
multimodal interaction
explainable robotics
human-like touch
Innovation

Methods, ideas, or system contributions that make the work stand out.

explainable AI
multimodal fusion
tactile sensing
language grounding
hardness estimation
🔎 Similar Papers
No similar papers found.
F
Felix Verstraete
Department of Bioengineering, Imperial-X Initiative, Imperial College London, London, United Kingdom
L
Lan Wei
Department of Bioengineering, Imperial-X Initiative, Imperial College London, London, United Kingdom
Wen Fan
Wen Fan
University of California, Berkeley
Nanotechnology - Vanadium dioxide - 2D materials
Dandan Zhang
Dandan Zhang
Imperial College London
RoboticsAI