From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing end-to-end spoken dialogue models typically rely on multi-stage training and unified autoregressive modeling, overlooking the fundamental architectural disparities between text—characterized by strong inter-token causal dependencies—and speech audio—dominated by source-target relationships with weak temporal dependencies—leading to high computational overhead and inference latency. Method: We propose TtT, a unified Transformer framework initialized from a single large language model, jointly optimizing autoregressive text generation and non-autoregressive audio diffusion. Leveraging a source-driven, non-autoregressive joint training paradigm, TtT challenges the conventional assumption that audio tokens must be generated autoregressively. Contribution/Results: TtT achieves comparable or superior speech quality while significantly reducing training complexity and inference latency. Experiments demonstrate state-of-the-art performance in speech naturalness, interactive real-time capability, and training efficiency over multi-stage baselines.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-in speech-out conversational systems. However, existing multimodal models handling interleaved audio and text, such as MOSHI require complex multi stage training pipelines, incurring substantial computational costs. Moreover, these models uniformly apply autoregressive generation to both text and audio tokens, overlooking a fundamental asymmetry in their dependency structures: while text tokens exhibit strong target target dependencies requiring causal ordering, audio tokens are predominantly driven by source target dependencies, where audio outputs primarily condition on source text rather than preceding audio tokens. In this work, we propose TtT, a unified audio-text modeling framework that integrates AR text generation with non-autoregressive audio diffusion within a single Transformer architecture initialized from a pretrained LLM.
Problem

Research questions and friction points this paper is trying to address.

Existing multimodal models require complex multi-stage training pipelines
Autoregressive generation is inefficiently applied to both text and audio tokens
Audio tokens lack strong sequential dependencies unlike text tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Non-autoregressive audio diffusion generation
Joint training in single Transformer architecture
Initialization from pretrained large language model
🔎 Similar Papers
No similar papers found.