Epic-Sounds: A Large-Scale Dataset of Actions that Sound

📅 2023-02-01
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 43
Influential: 8
📄 PDF
🤖 AI Summary
This work addresses key challenges in egocentric video analysis: temporal overlap of audio events, misalignment between audio and visual modalities in both timing and semantics, and ambiguity in audio-only annotations. To this end, we introduce EPIC-SOUNDS—the first fine-grained audio event dataset explicitly designed for causal action understanding. Built upon 100 hours of kitchen audio from EPIC-KITCHENS-100, it comprises 78.4k classifiable and 39.2k unclassifiable sound events, covering 44 audio-discriminable actions. We propose a novel “object-material pair + visual verification” paradigm to establish the first physics-grounded sound-action mapping benchmark. Annotation employs a joint strategy combining crowdsourced temporal labeling, free-text description clustering, and cross-modal ambiguity filtering. Evaluation on SOTA models (e.g., AudioCLIP, PANNs) reveals substantial limitations in recognizing “audible actions.” The dataset and baseline code are publicly released to advance audio-driven action understanding and embodied perception research.
📝 Abstract
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos from EPIC-KITCHENS-100. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping free-form descriptions into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes, as well as 39.2k non-categorised segments, totalling 117.6k segments spanning 100 hours of audio, capturing diverse actions that sound in home kitchens. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound.EPIC-SOUNDS and baseline source code is available from: https://epic-kitchens.github.io/epic-sounds.
Problem

Research questions and friction points this paper is trying to address.

Identify actions from audio segments in egocentric videos
Classify audio events into 44 distinct action categories
Evaluate audio and audio-visual recognition models' performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale audio dataset with temporal annotations
Annotation pipeline for sound-action classification
Audio-visual model training and evaluation
🔎 Similar Papers
No similar papers found.
J
Jaesung Huh
Visual Geometry Group, Department of Engineering Science, University of Oxford, UK
Jacob Chalk
Jacob Chalk
PhD Researcher, University of Bristol
Computer Vision
E
E. Kazakos
Department of Computer Science, University of Bristol, UK
D
D. Damen
Department of Computer Science, University of Bristol, UK
Andrew Zisserman
Andrew Zisserman
University of Oxford
Computer VisionMachine Learning