HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current haptic vibration signal design faces two key bottlenecks: the absence of large-scale, text-annotated haptic datasets and insufficient cross-modal representation alignment. To address these, we introduce HapticCap—the first fully human-annotated haptic–text paired dataset, comprising 92,070 vibration signals annotated with sensory, affective, and semantic descriptors—and formally define the haptic–text retrieval task. We propose a supervised contrastive learning framework for multimodal alignment, integrating the T5 language model and the Audio Spectrogram Transformer (AST), with fine-grained training stratified by annotation category. Experiments demonstrate that the T5+AST combination significantly outperforms baselines in cross-modal retrieval, achieving substantial gains in accuracy. This work establishes a new paradigm for interpretable haptic interaction modeling and data-driven haptic design, providing both foundational infrastructure and methodological advances for the field.

Technology Category

Application Category

📝 Abstract
Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text. To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category.
Problem

Research questions and friction points this paper is trying to address.

Lack of large haptic datasets with text descriptions
Limited models for describing vibration signals in text
Need for matching user descriptions to haptic signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

First human-annotated haptic-captioned dataset
Haptic-caption retrieval task introduced
Combined T5 and AST models for best performance
🔎 Similar Papers
No similar papers found.