MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the high cost of acquiring vision–tactile aligned data and the limitations of existing synthetic approaches, which are often confined to single modalities and thus insufficient for cross-modal learning. The authors propose a unified diffusion model that enables controllable generation of multimodal tactile images—such as ViTac and TacTip—within a single architecture. This is achieved through dual conditioning on pose-aligned depth maps derived from object CAD models and structured prompts encoding sensor type and 4-DoF contact poses. The method significantly enhances the physical consistency and cross-modal generalization of synthetic data, outperforming the Pix2Pix baseline in SSIM. When applied to 3-DoF pose estimation, it achieves comparable performance using only 50% of the real data, effectively alleviating the data acquisition bottleneck.

Technology Category

Application Category

📝 Abstract

Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.

Problem

Research questions and friction points this paper is trying to address.

visuo-tactile

multi-modal

synthetic data generation

data collection bottleneck

tactile sensing

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model

multi-modal synthesis

visuo-tactile sensing

pose-conditioned generation

synthetic data generation

🔎 Similar Papers

TextToucher: Fine-Grained Text-to-Touch Generation

2024-09-09arXiv.orgCitations: 3