🤖 AI Summary
Existing audio-language models (ALMs) achieve strong cross-modal alignment but lack effective joint modeling of natural language semantics and audio temporal structure. To address this, we propose TeminAL—the first temporal injection framework for ALMs—employing a two-stage self-supervised post-training (TeminAL A & B) that explicitly enhances temporal reasoning without degrading original cross-modal alignment capabilities. We further introduce Zero-shot Temporal Evaluation (ZSTE), the first benchmark protocol designed specifically for zero-shot temporal understanding in contrastive ALMs, filling a critical gap in ALM evaluation. Experiments demonstrate that TeminAL improves average accuracy by 5.28% on the ESC-50 temporal understanding task, maintains state-of-the-art performance on zero-shot audio–text retrieval and classification (AudioCaps/Clotho), and consistently outperforms prior methods across all evaluated benchmarks.
📝 Abstract
Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A $&$ B, where the model first learns to differentiate between multiple sounds in TeminAL A, followed by a phase that instills a sense of time, thereby enhancing its temporal understanding in TeminAL B. This approach results in an average performance gain of $5.28%$ in temporal understanding on the ESC-50 dataset, while the model remains competitive in zero-shot retrieval and classification tasks on the AudioCap/Clotho datasets. We also note the lack of proper evaluation techniques for contrastive ALMs and propose a strategy for evaluating ALMs in zero-shot settings. The general-purpose zero-shot model evaluation strategy ZSTE, is used to evaluate various prior models. ZSTE demonstrates a general strategy to evaluate all ZS contrastive models. The model trained with TeminAL successfully outperforms current models on most downstream tasks.