🤖 AI Summary
Tactile perception in embodied intelligence is hindered by spatial sparsity and the absence of global semantic context, while research on multimodal tactile fusion lacks a unified framework. This work systematically reviews relevant literature up to Q1 2026 and introduces, for the first time, a hierarchical taxonomy encompassing data modalities—such as tactile–visual and tactile–language—and three methodological pillars: perceptual recognition, cross-modal generation, and multimodal interaction. By integrating advances in deep learning and large language models, the study comprehensively surveys multimodal datasets, core algorithms, sensing hardware, and evaluation benchmarks, thereby clarifying the field’s developmental trajectory and offering a coherent theoretical foundation and systematic reference for future research.
📝 Abstract
Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.