Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
Tactile perception in embodied intelligence is hindered by spatial sparsity and the absence of global semantic context, while research on multimodal tactile fusion lacks a unified framework. This work systematically reviews relevant literature up to Q1 2026 and introduces, for the first time, a hierarchical taxonomy encompassing data modalities—such as tactile–visual and tactile–language—and three methodological pillars: perceptual recognition, cross-modal generation, and multimodal interaction. By integrating advances in deep learning and large language models, the study comprehensively surveys multimodal datasets, core algorithms, sensing hardware, and evaluation benchmarks, thereby clarifying the field’s developmental trajectory and offering a coherent theoretical foundation and systematic reference for future research.
📝 Abstract
Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.
Problem

Research questions and friction points this paper is trying to address.

Tactile sensing
Multimodal fusion
Embodied intelligence
Vision-language-tactile integration
Unified framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Tactile Fusion
Embodied Intelligence
Tactile-Vision-Language Integration
Cross-Modal Generation
Hierarchical Taxonomy
🔎 Similar Papers
No similar papers found.
Zhixiang Cao
Zhixiang Cao
Xi'an Jiaotong University
AI4SE
D
Di Tian
Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China
Runwei Guan
Runwei Guan
Hong Kong University of Science and Technology (Guangzhou) / Founder of FertiTech AI
Multi-Modal LearningUnmanned Surface VesselRadar PerceptionAI Medicine
Yanzhou Mu
Yanzhou Mu
Nanjing university
deep learning testingSE4AIconcurrency testingsoftware defect prediction
X
Xiaolou Sun
Purple Mountain Laboratory, China
S
Shaofeng Liang
Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China
Daizong Liu
Daizong Liu
Wuhan University
Computer VisionVision and Language3D UnderstandingAdversarial RobustnessLVLM
T
Tao Huang
Centre for AI and Data Science Innovation and the School of Science and Engineering, James Cook University, Australia
Y
Yutao Yue
Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China
Henghui Ding
Henghui Ding
Fudan University
Computer VisionMachine LearningSegmentationAIGC
Bin Fang
Bin Fang
Beijing University of Posts and Telecommunications /Tsinghua University
Robotics and AI
A
Alex Zhou
Linkerbot (Beijing) Technology Co., Ltd, China
Q
Qing-Long Han
School of Engineering, Swinburne University of Technology, Melbourne
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser