🤖 AI Summary
Existing automatic audio captioning (AAC) methods, such as EnCLAP, rely on discrete tokens generated by EnCodec—a codec optimized for waveform reconstruction rather than semantic representation—resulting in weak semantic expressivity and limiting caption quality. To address this, we propose the first semantic-aware discrete tokenization paradigm for AAC: leveraging the CLAP pre-trained audio encoder as the feature backbone, we integrate a differentiable vector quantization (VQ) module to construct a semantically rich audio tokenizer, and jointly fine-tune it end-to-end with a BART language model. This framework significantly improves audio–text semantic alignment. Empirical evaluation on two mainstream AAC benchmarks demonstrates consistent and substantial gains over the EnCLAP baseline, validating that semantics-driven discretization is critical for advancing AAC performance.
📝 Abstract
Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.