FLAM: Frame-Wise Language-Audio Modeling

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Current audio-language models excel at text-audio retrieval but lack frame-level sound event localization capability; conventional sound event detection models are constrained by closed-vocabulary assumptions and predefined categories, limiting generalization to out-of-distribution events. To address this, we propose the first frame-level audio-text contrastive model for open-vocabulary sound event localization. Methodologically, we introduce a calibrated frame-level contrastive learning objective with logit adjustment to mitigate spurious correlations induced by event dependency and label imbalance; leverage large language models to generate descriptive captions and synthesize realistic audio samples for weakly supervised frame-level annotation; and adopt a memory-efficient temporal modeling architecture. Experiments demonstrate that our model achieves significant improvements over baselines on open-vocabulary event localization, while preserving strong performance on global retrieval and downstream tasks.

Technology Category

Application Category

📝 Abstract

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Improving frame-wise audio understanding in multi-modal models

Enabling precise localization of diverse sound events

Addressing spurious correlations in audio-language training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-vocabulary contrastive audio-language model

Memory-efficient calibrated frame-wise objective

LLM-generated captions for frame-wise supervision

🔎 Similar Papers

No similar papers found.