TactEx: An Explainable Multimodal Robotic Interaction Framework for Human-Like Touch and Hardness Estimation

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge of achieving human-level hardness perception and interpretable tactile decision-making in contact-rich robotic manipulation. The authors propose the first multimodal framework that integrates tactile sensing (GelSight-Mini), visual input (RGB), and natural language instructions. By synergizing a pretrained tactile-visual model with a lightweight large language model (LLM), the system enables interpretable hardness estimation and interaction guidance. The approach employs ResNet50 combined with LSTM to process temporal tactile signals, while YOLO and Grounded-SAM refine contact region segmentation. A compact LLM generates natural language explanations to facilitate cross-modal alignment. Evaluated on fruit ripeness assessment, the method achieves statistically significant hardness discrimination across all category pairs (p < 0.01) and a 90% end-to-end task success rate, demonstrating strong generalization to new tasks without extensive fine-tuning.

Technology Category

Application Category

📝 Abstract

Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual understanding. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.

Problem

Research questions and friction points this paper is trying to address.

hardness estimation

tactile perception

multimodal interaction

explainable robotics

human-like touch

Innovation

Methods, ideas, or system contributions that make the work stand out.

explainable AI

multimodal fusion

tactile sensing

language grounding

hardness estimation

🔎 Similar Papers

MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation

2023-10-25Citations: 1

Trustworthy Conceptual Explanations for Neural Networks in Robot Decision-Making

2024-09-16arXiv.orgCitations: 1

💼 Related Jobs

AI Research Scientist, Robotics