🤖 AI Summary
This work addresses the challenge of accurately inferring contact states in partially observable environments, where existing robotic manipulation methods often fail to leverage the rich interactive dynamics embedded in acoustic signals. To overcome this limitation, we propose a hierarchical multimodal fusion framework that departs from conventional symmetric fusion assumptions. Our approach uses sparse, contact-triggered audio cues to guide the representation learning of visual and proprioceptive modalities, explicitly modeling higher-order cross-modal interactions. A diffusion-based policy is then employed to generate smooth, continuous actions. Through end-to-end learning and mutual information analysis, the framework effectively integrates complementary information across modalities. Experiments on real-world tasks—such as liquid pouring and cabinet door opening—demonstrate significant performance gains over state-of-the-art methods, particularly in scenarios where visual input is limited and acoustic signals provide critical perceptual cues.
📝 Abstract
Existing robotic manipulation methods primarily rely on visual and proprioceptive observations, which may struggle to infer contact-related interaction states in partially observable real-world environments. Acoustic cues, by contrast, naturally encode rich interaction dynamics during contact, yet remain underexploited in current multimodal fusion literature. Most multimodal fusion approaches implicitly assume homogeneous roles across modalities, and thus design flat and symmetric fusion structures. However, this assumption is ill-suited for acoustic signals, which are inherently sparse and contact-driven. To achieve precise robotic manipulation through acoustic-informed perception, we propose a hierarchical representation fusion framework that progressively integrates audio, vision, and proprioception. Our approach first conditions visual and proprioceptive representations on acoustic cues, and then explicitly models higher-order cross-modal interactions to capture complementary dependencies among modalities. The fused representation is leveraged by a diffusion-based policy to directly generate continuous robot actions from multimodal observations. The combination of end-to-end learning and hierarchical fusion structure enables the policy to exploit task-relevant acoustic information while mitigating interference from less informative modalities. The proposed method has been evaluated on real-world robotic manipulation tasks, including liquid pouring and cabinet opening. Extensive experiment results demonstrate that our approach consistently outperforms state-of-the-art multimodal fusion frameworks, particularly in scenarios where acoustic cues provide task-relevant information not readily available from visual observations alone. Furthermore, a mutual information analysis is conducted to interpret the effect of audio cues in robotic manipulation via multimodal fusion.