🤖 AI Summary
This work addresses the lack of semantic descriptions for tactile vibration signals by introducing, for the first time, the task of tactile captioning and presenting LMT108-CAP, the first paired tactile-text dataset. To tackle this challenge, the authors propose ViPAC, a dual-branch neural network that disentangles periodic and aperiodic components in tactile signals. The method incorporates orthogonality constraints to ensure feature complementarity and employs a dynamic fusion mechanism to adaptively integrate multi-scale information. Experimental results demonstrate that ViPAC significantly outperforms baseline approaches adapted from audio and image captioning on LMT108-CAP, achieving superior performance in both lexical fidelity and semantic alignment.
📝 Abstract
The standardization of vibrotactile data by IEEE P1918.1 workgroup has greatly advanced its applications in virtual reality, human-computer interaction and embodied artificial intelligence. Despite these efforts, the semantic interpretation and understanding of vibrotactile signals remain an unresolved challenge. In this paper, we make the first attempt to address vibrotactile captioning, {\it i.e.}, generating natural language descriptions from vibrotactile signals. We propose Vibrotactile Periodic-Aperiodic Captioning (ViPAC), a method designed to handle the intrinsic properties of vibrotactile data, including hybrid periodic-aperiodic structures and the lack of spatial semantics. Specifically, ViPAC employs a dual-branch strategy to disentangle periodic and aperiodic components, combined with a dynamic fusion mechanism that adaptively integrates signal features. It also introduces an orthogonality constraint and weighting regularization to ensure feature complementarity and fusion consistency. Additionally, we construct LMT108-CAP, the first vibrotactile-text paired dataset, using GPT-4o to generate five constrained captions per surface image from the popular LMT-108 dataset. Experiments show that ViPAC significantly outperforms the baseline methods adapted from audio and image captioning, achieving superior lexical fidelity and semantic alignment.