🤖 AI Summary
Robots struggle to visually perceive extrinsic contacts—i.e., physical interactions between a grasped object and its environment—due to occlusion, low visual resolution, and ambiguity in near-contact configurations. To address this, we propose the first vision-audio collaborative paradigm for extrinsic contact perception: it jointly leverages global visual scene understanding and locally acquired contact-specific audio signals via active acoustic sensing, enabling precise contact localization and size estimation under occlusion and clutter. A key innovation is real-to-sim audio hallucination—a novel technique that synthesizes physically grounded contact audio in simulation using real-world recordings, enabling zero-shot sim-to-real transfer without real-data fine-tuning. Our hardware integrates contact microphones and conduction-based speakers, coupled with a multimodal perception network and physics-informed acoustic simulation. Experiments demonstrate substantial improvements in task success rates and policy learning efficiency for contact-intensive manipulation tasks—including pushing, sliding, and insertion.
📝 Abstract
Robust manipulation often hinges on a robot's ability to perceive extrinsic contacts-contacts between a grasped object and its surrounding environment. However, these contacts are difficult to observe through vision alone due to occlusions, limited resolution, and ambiguous near-contact states. In this paper, we propose a visual-auditory method for extrinsic contact estimation that integrates global scene information from vision with local contact cues obtained through active audio sensing. Our approach equips a robotic gripper with contact microphones and conduction speakers, enabling the system to emit and receive acoustic signals through the grasped object to detect external contacts. We train our perception pipeline entirely in simulation and zero-shot transfer to the real world. To bridge the sim-to-real gap, we introduce a real-to-sim audio hallucination technique, injecting real-world audio samples into simulated scenes with ground-truth contact labels. The resulting multimodal model accurately estimates both the location and size of extrinsic contacts across a range of cluttered and occluded scenarios. We further demonstrate that explicit contact prediction significantly improves policy learning for downstream contact-rich manipulation tasks.