AudioStory: Generating Long-Form Narrative Audio with Large Language Models

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current text-to-audio (TTA) models excel at generating short audio clips but struggle with temporal coherence and cross-event emotional consistency in long-form narrative audio. To address this, we propose a decoupled bridging framework that synergizes large language models (LLMs) and diffusion-based audio generators: the collaboration is decomposed into two modules—semantic alignment and cross-event consistency preservation—trained end-to-end. The LLM performs narrative structure parsing, contextual modeling, and tone control to guide the diffusion model in synthesizing high-fidelity, instruction-following long audio. Evaluated on our newly constructed benchmark AudioStory-10K, our method significantly outperforms state-of-the-art approaches, achieving improvements in audio quality, temporal coherence, and instruction/emotion controllability. It enables robust generation of diverse, extended narrative audio sequences.

Technology Category

Application Category

📝 Abstract

Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory

Problem

Research questions and friction points this paper is trying to address.

Generating long-form narrative audio with temporal coherence

Maintaining compositional reasoning across extended audio sequences

Integrating LLMs with TTA systems for structured audio generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM decomposition for structured audio narratives

Decoupled bridging mechanism for coherence

End-to-end training framework integration

🔎 Similar Papers

The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives

2024-09-17arXiv.orgCitations: 0