FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-language pretraining models struggle to effectively integrate segment-level and frame-level supervision signals, limiting their performance on fine-grained tasks. This work proposes FineLAP, the first framework to enable synergistic training with both coarse- and fine-grained alignment. It jointly optimizes alignment objectives at both granularities through a dual-stream sigmoid loss, employs a clustering-based sampling strategy, and introduces a decoupled audio projector to separately capture global semantics and local temporal details. To address the scarcity of frame-level annotations, the authors construct FineLAP-100k, a large-scale synthetic dataset. The proposed method achieves state-of-the-art performance across multiple benchmarks, including audio retrieval, classification, sound event detection, and text-to-audio localization, demonstrating the mutual benefits of multi-granularity alignment.
📝 Abstract
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
Problem

Research questions and friction points this paper is trying to address.

audio-language pretraining
heterogeneous supervision
fine-grained alignment
frame-level tasks
clip-level understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained alignment
Heterogeneous supervision
Audio-language pretraining
Sound event detection
Dual-stream loss