CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Current audio moment retrieval (AMR) research is hindered by the absence of large-scale, real-world benchmarks with precise temporal annotations, leading to unreliable evaluation and limited practical deployment. To address this, we introduce AudioMoments—the first large-scale, human-annotated AMR benchmark designed for realistic long-audio scenarios, 24× larger than the previous largest dataset. We propose a transfer learning paradigm combining synthetic-data pretraining with fine-tuning on real-world data, establishing a strong baseline model. Our approach achieves a 10.4-percentage-point improvement in Recall@1@0.7 over a purely synthetic-data baseline, significantly enhancing temporal localization accuracy on complex, natural audio. This work provides both a reliable evaluation standard and a practical technical pathway for advancing AMR research and applications.

Technology Category

Application Category

📝 Abstract

We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.

Problem

Research questions and friction points this paper is trying to address.

Addressing the lack of real-world annotated benchmarks for audio moment retrieval

Providing large-scale human-annotated audio data with temporal boundaries

Establishing reliable performance evaluation for real-world AMR applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-annotated audio benchmark for retrieval

Large-scale manually annotated AMR dataset

Fine-tuning on real data after synthetic pre-training

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)