🤖 AI Summary
Current audio-language models excel at text-audio retrieval but lack frame-level sound event localization capability; conventional sound event detection models are constrained by closed-vocabulary assumptions and predefined categories, limiting generalization to out-of-distribution events. To address this, we propose the first frame-level audio-text contrastive model for open-vocabulary sound event localization. Methodologically, we introduce a calibrated frame-level contrastive learning objective with logit adjustment to mitigate spurious correlations induced by event dependency and label imbalance; leverage large language models to generate descriptive captions and synthesize realistic audio samples for weakly supervised frame-level annotation; and adopt a memory-efficient temporal modeling architecture. Experiments demonstrate that our model achieves significant improvements over baselines on open-vocabulary event localization, while preserving strong performance on global retrieval and downstream tasks.
📝 Abstract
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.