TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current contrastive language–audio pretraining models (e.g., CLAP) rely on global audio descriptions and lack fine-grained temporal supervision, limiting their frame-level alignment capability. To address this, we propose: (1) the first large-scale temporally aligned audio–text dataset—comprising 12k Freesound clips—with each text caption precisely annotated to a corresponding audio subsegment; (2) a temporal segment-level annotation paradigm coupled with frame-level contrastive learning, enhanced by LLM-driven annotation cleaning to ensure high quality; and (3) an extended CLAP architecture with a temporally aware contrastive loss. Our approach achieves significant improvements on the AudioSet Strong benchmark in both temporal localization and local alignment, demonstrating the critical role of strong temporal supervision in language–audio joint modeling. The dataset, code, and models are publicly released to advance fine-grained cross-modal understanding.

Technology Category

Application Category

📝 Abstract
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models - particularly, if they are expected to produce frame-level embeddings - can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech, typos, and annotator language bias. We further propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording and demonstrate that our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark. The dataset and our source code are available on Zenodo and GitHub, respectively.
Problem

Research questions and friction points this paper is trying to address.

Improving temporal alignment in audio-text models
Enhancing frame-level embeddings with temporal supervision
Addressing weak supervision in global clip-level descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporally-aligned audio-text dataset for precise training
LLM-cleaned annotations to remove irrelevant content
Frame-wise contrastive training for better alignment
🔎 Similar Papers
No similar papers found.