ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing audio captioning datasets are limited in scale and granularity, hindering the development of general-purpose audio-language models. To address this, this work proposes ACAVCaps—a large-scale, fine-grained, and multi-perspective audio captioning dataset. ACAVCaps leverages a multi-expert analysis pipeline that captures vocal, musical, and acoustic characteristics to generate structured annotations, which are then synthesized into high-quality, diverse natural language descriptions using large language models. Built upon the ACAV100M corpus, ACAVCaps substantially enhances semantic richness and task generalization capabilities. Experimental results demonstrate that models pretrained on ACAVCaps achieve significantly superior performance across multiple downstream audio understanding tasks compared to existing approaches.

Technology Category

Application Category

📝 Abstract

General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets. The dataset is available at https://github.com/xiaomi-research/acavcaps.

Problem

Research questions and friction points this paper is trying to address.

audio captioning

large-scale dataset

fine-grained audio understanding

audio-language models

descriptive granularity

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio captioning

large-scale dataset

fine-grained audio understanding