TextToucher: Fine-Grained Text-to-Touch Generation

📅 2024-09-09
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing tactile image generation methods heavily rely on visual inputs, leading to inaccurate modeling of fine-grained tactile features—such as surface texture, object geometry, and sensor gel deformation states. This work introduces the first text-to-tactile-image generation framework that operates without visual priors. Methodologically: (1) we propose a dual-granularity textual modeling and fusion mechanism—object-level (texture/shape) and sensor-level (gel deformation); (2) we design a contrastive text–tactile pretraining evaluation metric (CTTP); and (3) we implement dual-granularity conditional injection via a multimodal large language model, learnable text prompts, and a diffusion Transformer architecture. Extensive experiments across multiple tactile datasets demonstrate significant improvements over vision-driven baselines, with generated tactile images exhibiting higher fidelity to human tactile perception. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Tactile sensation plays a crucial role in the development of multi-modal large models and embodied intelligence. To collect tactile data with minimal cost as possible, a series of studies have attempted to generate tactile images by vision-to-touch image translation. However, compared to text modality, visual modality-driven tactile generation cannot accurately depict human tactile sensation. In this work, we analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status). We model these granularities of information through text descriptions and propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples. Specifically, we introduce a multimodal large language model to build the text sentences about object-level tactile information and employ a set of learnable text prompts to represent the sensor-level tactile information. To better guide the tactile generation process with the built text information, we fuse the dual grains of text information and explore various dual-grain text conditioning methods within the diffusion transformer architecture. Furthermore, we propose a Contrastive Text-Touch Pre-training (CTTP) metric to precisely evaluate the quality of text-driven generated tactile data. Extensive experiments demonstrate the superiority of our TextToucher method. The source codes will be available at url{https://github.com/TtuHamg/TextToucher}.
Problem

Research questions and friction points this paper is trying to address.

Visual-to-Tactile Sensing
Texture Recognition
Shape Perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

TextToucher
CTTP metric
Tactile data generation
🔎 Similar Papers
No similar papers found.
Jiahang Tu
Jiahang Tu
PhD Student, Zhejiang University
Computer VisionGenerative Models
H
Hao Fu
College of Computer Science and Technology, Zhejiang University
F
Fengyu Yang
College of Computer Science and Technology, Yale University
H
Han Zhao
College of Computer Science and Technology, Zhejiang University; Advanced Technology Institute, Zhejiang University
C
Chao Zhang
College of Computer Science and Technology, Zhejiang University; Advanced Technology Institute, Zhejiang University
Hui Qian
Hui Qian
College of CS, Zhejiang University
Artificial Intelligence