Segmenting Collision Sound Sources in Egocentric Videos

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This paper introduces “collision sound source segmentation”: localizing and segmenting the object causing a collision sound in first-person videos using audio alone. The task is challenged by multi-object interactions, cluttered scenes, and transient collisions. To address it, we propose the first audio-guided weakly supervised framework that integrates CLIP for cross-modal alignment, SAM2 for precise mask generation, and explicit modeling of first-person priors—particularly hand-object interactions. We introduce two new benchmarks, EPIC-CS3 and Ego4D-CS3, to evaluate performance. Our method achieves mIoU scores 3× and 4.7× higher than strong baselines on these benchmarks, respectively, significantly improving both localization and segmentation accuracy of collision sources. This work is the first to formally define and solve the audio-driven visual segmentation of collision-involved objects, establishing a novel paradigm for embodied perception and multimodal interaction understanding.

Technology Category

Application Category

📝 Abstract

Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e. objects in hands, to find acting objects that can potentially be collision sound sources. Our approach outperforms competitive baselines by $3 imes$ and $4.7 imes$ in mIoU on two benchmarks we introduce for the CS3 task: EPIC-CS3 and Ego4D-CS3.

Problem

Research questions and friction points this paper is trying to address.

Segmenting collision sound source objects in egocentric videos using audio cues

Addressing visual clutter and brief interactions in first-person video footage

Identifying interacting objects responsible for collision sounds in multisensory perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly-supervised audio-conditioned segmentation method

Leveraging foundation models CLIP and SAM2

Incorporating egocentric cues like objects in hands

🔎 Similar Papers

Epic-Sounds: A Large-Scale Dataset of Actions that Sound