🤖 AI Summary
This work addresses the high cost of acquiring vision–tactile aligned data and the limitations of existing synthetic approaches, which are often confined to single modalities and thus insufficient for cross-modal learning. The authors propose a unified diffusion model that enables controllable generation of multimodal tactile images—such as ViTac and TacTip—within a single architecture. This is achieved through dual conditioning on pose-aligned depth maps derived from object CAD models and structured prompts encoding sensor type and 4-DoF contact poses. The method significantly enhances the physical consistency and cross-modal generalization of synthetic data, outperforming the Pix2Pix baseline in SSIM. When applied to 3-DoF pose estimation, it achieves comparable performance using only 50% of the real data, effectively alleviating the data acquisition bottleneck.
📝 Abstract
Acquiring aligned visuo-tactile datasets is slow and costly, requiring specialised hardware and large-scale data collection. Synthetic generation is promising, but prior methods are typically single-modality, limiting cross-modal learning. We present MultiDiffSense, a unified diffusion model that synthesises images for multiple vision-based tactile sensors (ViTac, TacTip, ViTacTip) within a single architecture. Our approach uses dual conditioning on CAD-derived, pose-aligned depth maps and structured prompts that encode sensor type and 4-DoF contact pose, enabling controllable, physically consistent multi-modal synthesis. Evaluating on 8 objects (5 seen, 3 novel) and unseen poses, MultiDiffSense outperforms a Pix2Pix cGAN baseline in SSIM by +36.3% (ViTac), +134.6% (ViTacTip), and +64.7% (TacTip). For downstream 3-DoF pose estimation, mixing 50% synthetic with 50% real halves the required real data while maintaining competitive performance. MultiDiffSense alleviates the data-collection bottleneck in tactile sensing and enables scalable, controllable multi-modal dataset generation for robotic applications.