Language-based Audio Moment Retrieval

📅 2024-09-24

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 1

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This paper addresses language-grounded Audio Moment Retrieval (AMR)—the task of localizing semantically relevant temporal segments within untrimmed long audio sequences given natural language queries, distinct from clip-level audio retrieval. We formally define AMR and introduce Clotho-Moment, the first large-scale synthetic dataset for this task. To tackle it, we propose AM-DETR, an end-to-end temporal modeling framework built upon the DETR architecture, which jointly captures audio temporal dynamics and cross-modal semantic alignment—departing from conventional clip-based retrieval paradigms. Rigorous evaluation on human-annotated real-world data demonstrates AM-DETR’s robustness, achieving a 9.00 percentage-point improvement over sliding-window baselines in Recall@1@0.7. Both the codebase and Clotho-Moment dataset are fully open-sourced.

Technology Category

Application Category

📝 Abstract

In this paper, we propose and design a new task called audio moment retrieval (AMR). Unlike conventional language-based audio retrieval tasks that search for short audio clips from an audio database, AMR aims to predict relevant moments in untrimmed long audio based on a text query. Given the lack of prior work in AMR, we first build a dedicated dataset, Clotho-Moment, consisting of large-scale simulated audio recordings with moment annotations. We then propose a DETR-based model, named Audio Moment DETR (AM-DETR), as a fundamental framework for AMR tasks. This model captures temporal dependencies within audio features, inspired by similar video moment retrieval tasks, thus surpassing conventional clip-level audio retrieval methods. Additionally, we provide manually annotated datasets to properly measure the effectiveness and robustness of our methods on real data. Experimental results show that AM-DETR, trained with Clotho-Moment, outperforms a baseline model that applies a clip-level audio retrieval method with a sliding window on all metrics, particularly improving Recall1@0.7 by 9.00 points. Our datasets and code are publicly available in https://h-munakata.github.io/Language-based-Audio-Moment-Retrieval.

Problem

Research questions and friction points this paper is trying to address.

Predict relevant moments in untrimmed long audio using text queries

Develop a dedicated dataset (Clotho-Moment) for audio moment retrieval

Propose Audio Moment DETR (AM-DETR) to improve retrieval accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Audio Moment DETR (AM-DETR) framework

Builds Clotho-Moment dataset with moment annotations

Captures temporal dependencies in audio features

🔎 Similar Papers

No similar papers found.