🤖 AI Summary
Existing cross-modal datasets are largely confined to simulated or controlled environments, lacking high-diversity, strongly time-aligned multimodal data from real-world scenarios—thereby limiting cross-sensory understanding in AI and robotic systems. To address this, we propose an open-source, portable multimodal sensing device built on a low-cost (<$1000), reproducible hardware architecture that enables synchronized acquisition of RGB-D images, flexible-array tactile signals, and impact audio with microsecond-precision timestamp alignment. The system integrates consumer-grade sensors and an embedded triggering mechanism, with cross-modal calibration and synchronization implemented via ROS/Python. Leveraging this platform, we introduce the first large-scale, real-world multisensory dataset targeting everyday objects—comprising 500 object categories and 3,000 annotated samples. Experimental results demonstrate substantial improvements in both pretraining and fine-tuning performance for cross-modal retrieval and reconstruction tasks.
📝 Abstract
Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.