ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio captioning datasets are limited in scale and granularity, hindering the development of general-purpose audio-language models. To address this, this work proposes ACAVCaps—a large-scale, fine-grained, and multi-perspective audio captioning dataset. ACAVCaps leverages a multi-expert analysis pipeline that captures vocal, musical, and acoustic characteristics to generate structured annotations, which are then synthesized into high-quality, diverse natural language descriptions using large language models. Built upon the ACAV100M corpus, ACAVCaps substantially enhances semantic richness and task generalization capabilities. Experimental results demonstrate that models pretrained on ACAVCaps achieve significantly superior performance across multiple downstream audio understanding tasks compared to existing approaches.

Technology Category

Application Category

📝 Abstract
General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity required to train truly versatile models. To address this gap, we introduce ACAVCaps, a new large-scale, fine-grained, and multi-faceted audio captioning dataset. Derived from the ACAV100M collection, ACAVCaps is constructed using a multi-expert pipeline that analyzes audio from diverse perspectives-including speech, music, and acoustic properties-which are then synthesized into rich, detailed descriptions by a large language model. Experimental results demonstrate that models pre-trained on ACAVCaps exhibit substantially stronger generalization capabilities on various downstream tasks compared to those trained on other leading captioning datasets. The dataset is available at https://github.com/xiaomi-research/acavcaps.
Problem

Research questions and friction points this paper is trying to address.

audio captioning
large-scale dataset
fine-grained audio understanding
audio-language models
descriptive granularity
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio captioning
large-scale dataset
fine-grained audio understanding
multi-expert pipeline
audio-language models
🔎 Similar Papers
No similar papers found.