MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost of acquiring vision–tactile aligned data and the limitations of existing synthetic approaches, which are often confined to single modalities and thus insufficient for cross-modal learning. The authors propose a unified diffusion model that enables controllable generation of multimodal tactile images—such as ViTac and TacTip—within a single architecture. This is achieved through dual conditioning on pose-aligned depth maps derived from object CAD models and structured prompts encoding sensor type and 4-DoF contact poses. The method significantly enhances the physical consistency and cross-modal generalization of synthetic data, outperforming the Pix2Pix baseline in SSIM. When applied to 3-DoF pose estimation, it achieves comparable performance using only 50% of the real data, effectively alleviating the data acquisition bottleneck.

Technology Category

Application Category

📝 Abstract
Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.
Problem

Research questions and friction points this paper is trying to address.

visuo-tactile
multi-modal
synthetic data generation
data collection bottleneck
tactile sensing
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model
multi-modal synthesis
visuo-tactile sensing
pose-conditioned generation
synthetic data generation
🔎 Similar Papers
No similar papers found.
S
Sirine Bhouri
Department of Bioengineering, Imperial-X Initiative, Imperial College London, London, United Kingdom
L
Lan Wei
Department of Bioengineering, Imperial-X Initiative, Imperial College London, London, United Kingdom
Jian-Qing Zheng
Jian-Qing Zheng
University of Oxford
Biomedical Data AnalysisMedical Image ComputingImage-Guided InterventionsAI for Biomedicine
Dandan Zhang
Dandan Zhang
Imperial College London
RoboticsAI