Clink! Chop! Thud! -- Learning Object Sounds from Real-World Interactions

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the novel task of *Audible Object Detection*, which aims to identify objects in egocentric videos that are directly associated with sound events (e.g., a spoon dropping onto a floor or carpet). To address the fine-grained audio-object alignment challenge, we propose an end-to-end automatic mask generation pipeline that guides the model to attend to visually salient regions most informative for acoustic interpretation, coupled with a slot attention mechanism to enhance object-centric visual representations. Our method integrates a multimodal object perception framework, jointly leveraging egocentric video, instance segmentation, and slot-based visual encoding to enable cross-modal audio-object alignment. Evaluated on our newly constructed Audible Object Detection benchmark and established multimodal action understanding datasets, our approach achieves state-of-the-art performance—demonstrating the validity of the task formulation, the effectiveness of the proposed architecture, and strong generalization across diverse acoustic-visual scenarios.

Technology Category

Application Category

📝 Abstract
Can a model distinguish between the sound of a spoon hitting a hardwood floor versus a carpeted one? Everyday object interactions produce sounds unique to the objects involved. We introduce the sounding object detection task to evaluate a model's ability to link these sounds to the objects directly involved. Inspired by human perception, our multimodal object-aware framework learns from in-the-wild egocentric videos. To encourage an object-centric approach, we first develop an automatic pipeline to compute segmentation masks of the objects involved to guide the model's focus during training towards the most informative regions of the interaction. A slot attention visual encoder is used to further enforce an object prior. We demonstrate state of the art performance on our new task along with existing multimodal action understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

Distinguishing object sounds from real-world interaction contexts
Linking unique interaction sounds to specific involved objects
Developing object-aware models for multimodal sound understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Object-aware framework learns from egocentric videos
Automatic pipeline computes segmentation masks for objects
Slot attention visual encoder enforces object prior
🔎 Similar Papers
No similar papers found.