CAVER: Curious Audiovisual Exploring Robot

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This study addresses two core challenges in robotic multimodal perception: material identification and audio-driven manipulation imitation (e.g., playing melodies from auditory cues). To this end, we propose a 3D-printed end-effector capable of eliciting controllable acoustic responses; design a self-supervised audio-visual representation learning framework integrating local and global features; and introduce a curiosity-driven, uncertainty-guided active exploration strategy implemented on parallel-jaw gripper hardware for efficient multimodal interaction. Experimental results demonstrate significant improvements in representation learning efficiency across diverse scenarios: material classification accuracy increases by 12.7% on average over baselines, and audio-instruction imitation success rate improves by 23.4%. To our knowledge, this work is the first to achieve robot audio-visual synesthesia modeling and cross-modal behavioral transfer via active exploration.

Technology Category

Application Category

📝 Abstract

Multimodal audiovisual perception can enable new avenues for robotic manipulation, from better material classification to the imitation of demonstrations for which only audio signals are available (e.g., playing a tune by ear). However, to unlock such multimodal potential, robots need to learn the correlations between an object's visual appearance and the sound it generates when they interact with it. Such an active sensorimotor experience requires new interaction capabilities, representations, and exploration methods to guide the robot in efficiently building increasingly rich audiovisual knowledge. In this work, we present CAVER, a novel robot that builds and utilizes rich audiovisual representations of objects. CAVER includes three novel contributions: 1) a novel 3D printed end-effector, attachable to parallel grippers, that excites objects'audio responses, 2) an audiovisual representation that combines local and global appearance information with sound features, and 3) an exploration algorithm that uses and builds the audiovisual representation in a curiosity-driven manner that prioritizes interacting with high uncertainty objects to obtain good coverage of surprising audio with fewer interactions. We demonstrate that CAVER builds rich representations in different scenarios more efficiently than several exploration baselines, and that the learned audiovisual representation leads to significant improvements in material classification and the imitation of audio-only human demonstrations. https://caver-bot.github.io/

Problem

Research questions and friction points this paper is trying to address.

Robots need to learn correlations between visual appearance and interaction sounds

Requires new interaction capabilities and exploration methods for audiovisual knowledge

Efficiently building rich multimodal representations for material classification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel 3D printed end-effector for audio excitation

Audiovisual representation combining appearance and sound features

Curiosity-driven exploration algorithm prioritizing high uncertainty objects

🔎 Similar Papers

No similar papers found.