HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Current haptic vibration signal design faces two key bottlenecks: the absence of large-scale, text-annotated haptic datasets and insufficient cross-modal representation alignment. To address these, we introduce HapticCap—the first fully human-annotated haptic–text paired dataset, comprising 92,070 vibration signals annotated with sensory, affective, and semantic descriptors—and formally define the haptic–text retrieval task. We propose a supervised contrastive learning framework for multimodal alignment, integrating the T5 language model and the Audio Spectrogram Transformer (AST), with fine-grained training stratified by annotation category. Experiments demonstrate that the T5+AST combination significantly outperforms baselines in cross-modal retrieval, achieving substantial gains in accuracy. This work establishes a new paradigm for interpretable haptic interaction modeling and data-driven haptic design, providing both foundational infrastructure and methodological advances for the field.

Technology Category

Application Category

📝 Abstract

Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text. To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category.

Problem

Research questions and friction points this paper is trying to address.

Lack of large haptic datasets with text descriptions

Limited models for describing vibration signals in text

Need for matching user descriptions to haptic signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

First human-annotated haptic-captioned dataset

Haptic-caption retrieval task introduced

Combined T5 and AST models for best performance

🔎 Similar Papers

Cluster Haptic Texture Database: Haptic Texture Database with Variety in Velocity and Direction of Sliding Contacts

2024-07-23arXiv.orgCitations: 0

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal AI (PhD)