FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing audio captioning methods struggle to generate fine-grained, contextually accurate descriptions due to reliance on unimodal or shallow multimodal features. To address this, we propose an end-to-end two-stage generation framework that deeply fuses heterogeneous modal signals—including speech, music, environmental sounds, and associated video—to guide large language models in producing scene-aware, high-fidelity captions. Our key contributions are: (1) the first scalable, fine-grained audio captioning paradigm; (2) FusionAudio-1.2M, a large-scale benchmark comprising 1.2 million human-annotated captions and 6 million QA pairs; and (3) a CLAP-enhanced audio encoder that significantly improves audio–text alignment and instruction-following capability. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks. All code, models, and datasets are publicly released to advance high-fidelity, context-adaptive acoustic understanding.

Technology Category

Application Category

📝 Abstract

High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments. Code and data can be found in https://github.com/satsuki2486441738/FusionAudio.

Problem

Research questions and friction points this paper is trying to address.

Generating detailed audio captions lacking fine-grained details

Improving contextual accuracy in automated audio captioning

Integrating multimodal cues for better auditory scene analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pipeline with multimodal contextual fusion

Specialized pretrained models extract diverse contextual cues

LLM synthesizes multimodal inputs for detailed captions

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation