Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers

πŸ“… 2025-10-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In text-to-audio (T2A) generation, autoregressive language models (LMs) jointly trained with residual vector quantization (RVQ) tokenizers face two key bottlenecks: (i) strong inter-layer independence among RVQ codes impedes effective LM training, and (ii) semantic degradation in deeper RVQ layers exacerbates exposure bias during autoregressive decoding. To address these challenges, we propose Sirenβ€”a novel framework featuring: (1) a multi-isolation Transformer architecture that decouples hierarchical RVQ modeling; (2) a causal-anti-causal alignment mechanism that explicitly enforces bidirectional consistency between audio temporal structure and linguistic semantics; and (3) reinforcement-learning-based hierarchical semantic decoding optimization. Evaluated across multiple T2A benchmarks, Siren is the first LM-based approach to consistently outperform diffusion-based methods, achieving new state-of-the-art performance.

Technology Category

Application Category

πŸ“ Abstract
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.
Problem

Research questions and friction points this paper is trying to address.

Improving text-to-audio generation quality using language models
Addressing limitations of residual vector quantization in audio reconstruction
Aligning audio representations with linguistic structures for multimodal generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anti-causally aligned collaborative residual transformers framework
Multiple isolated transformers with causal conditioning
Reinforcement learning for anti-causal alignment enhancement
πŸ”Ž Similar Papers
No similar papers found.