🤖 AI Summary
Large audio-language models (LALMs) suffer from a fundamental bottleneck—reliance on text-based intermediaries—which impedes direct, natural speech generation in response to audio inputs.
Method: We propose the first end-to-end Audio Question Answering and Audio-generation (AQAA) large model. Our approach features a dual-codebook audio tokenizer, a joint architecture integrating a 130B language model with a neural vocoder, interleaved text/audio token post-training, and a hybrid optimization strategy combining Direct Preference Optimization (DPO) with model ensembling.
Contribution/Results: AQAA breaks the conventional “audio → text → speech” pipeline, enabling direct audio-to-audio generation with natural prosody and semantic fidelity. On the StepEval-Audio-360 benchmark, it achieves state-of-the-art performance in speech controllability while maintaining high audio fidelity and semantic coherence—demonstrating both architectural novelty and practical efficacy.
📝 Abstract
Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.