CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing automatic audio captioning (AAC) methods, such as EnCLAP, rely on discrete tokens generated by EnCodec—a codec optimized for waveform reconstruction rather than semantic representation—resulting in weak semantic expressivity and limiting caption quality. To address this, we propose the first semantic-aware discrete tokenization paradigm for AAC: leveraging the CLAP pre-trained audio encoder as the feature backbone, we integrate a differentiable vector quantization (VQ) module to construct a semantically rich audio tokenizer, and jointly fine-tune it end-to-end with a BART language model. This framework significantly improves audio–text semantic alignment. Empirical evaluation on two mainstream AAC benchmarks demonstrates consistent and substantial gains over the EnCLAP baseline, validating that semantics-driven discretization is critical for advancing AAC performance.

Technology Category

Application Category

📝 Abstract

Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete tokens from EnCodec as an effective input for fine-tuning a language model BART. However, EnCodec is designed to reconstruct waveforms rather than capture the semantic contexts of general sounds, which AAC should describe. To address this issue, we propose CLAP-ART, an AAC method that utilizes ``semantic-rich and discrete'' tokens as input. CLAP-ART computes semantic-rich discrete tokens from pre-trained audio representations through vector quantization. We experimentally confirmed that CLAP-ART outperforms baseline EnCLAP on two AAC benchmarks, indicating that semantic-rich discrete tokens derived from semantically rich AR are beneficial for AAC.

Problem

Research questions and friction points this paper is trying to address.

Improving Automated Audio Captioning with semantic-rich tokens

Addressing EnCodec's limitation in capturing sound semantics

Enhancing AAC performance using vector-quantized audio representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-rich audio representation tokenizer

Vector quantization of pre-trained audio

Discrete tokens for audio captioning

🔎 Similar Papers

No similar papers found.