X-Capture: An Open-Source Portable Device for Multi-Sensory Learning

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-modal datasets are largely confined to simulated or controlled environments, lacking high-diversity, strongly time-aligned multimodal data from real-world scenarios—thereby limiting cross-sensory understanding in AI and robotic systems. To address this, we propose an open-source, portable multimodal sensing device built on a low-cost (<$1000), reproducible hardware architecture that enables synchronized acquisition of RGB-D images, flexible-array tactile signals, and impact audio with microsecond-precision timestamp alignment. The system integrates consumer-grade sensors and an embedded triggering mechanism, with cross-modal calibration and synchronization implemented via ROS/Python. Leveraging this platform, we introduce the first large-scale, real-world multisensory dataset targeting everyday objects—comprising 500 object categories and 3,000 annotated samples. Experimental results demonstrate substantial improvements in both pretraining and fine-tuning performance for cross-modal retrieval and reconstruction tasks.

Technology Category

Application Category

📝 Abstract
Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset of 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and variety. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.
Problem

Research questions and friction points this paper is trying to address.

Lack of diverse real-world multi-sensory datasets for AI
High cost and complexity of current data collection methods
Limited modality pairings in existing sensory datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source portable multi-sensory data collector
Captures RGBD images, tactile readings, impact audio
Cost-effective under $1,000 with consumer-grade tools
🔎 Similar Papers
No similar papers found.
S
Samuel Clarke
Stanford University
S
Suzannah Wistreich
Stanford University
Yanjie Ze
Yanjie Ze
Stanford University
RoboticsEmbodied AIHumanoid Robots
J
Jiajun Wu
Stanford University