Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Large audio-language models (LALMs) suffer from a fundamental bottleneck—reliance on text-based intermediaries—which impedes direct, natural speech generation in response to audio inputs. Method: We propose the first end-to-end Audio Question Answering and Audio-generation (AQAA) large model. Our approach features a dual-codebook audio tokenizer, a joint architecture integrating a 130B language model with a neural vocoder, interleaved text/audio token post-training, and a hybrid optimization strategy combining Direct Preference Optimization (DPO) with model ensembling. Contribution/Results: AQAA breaks the conventional “audio → text → speech” pipeline, enabling direct audio-to-audio generation with natural prosody and semantic fidelity. On the StepEval-Audio-360 benchmark, it achieves state-of-the-art performance in speech controllability while maintaining high audio fidelity and semantic coherence—demonstrating both architectural novelty and practical efficacy.

Technology Category

Application Category

📝 Abstract

Large Audio-Language Models (LALMs) have significantly advanced intelligent human-computer interaction, yet their reliance on text-based outputs limits their ability to generate natural speech responses directly, hindering seamless audio interactions. To address this, we introduce Step-Audio-AQAA, a fully end-to-end LALM designed for Audio Query-Audio Answer (AQAA) tasks. The model integrates a dual-codebook audio tokenizer for linguistic and semantic feature extraction, a 130-billion-parameter backbone LLM and a neural vocoder for high-fidelity speech synthesis. Our post-training approach employs interleaved token-output of text and audio to enhance semantic coherence and combines Direct Preference Optimization (DPO) with model merge to improve performance. Evaluations on the StepEval-Audio-360 benchmark demonstrate that Step-Audio-AQAA excels especially in speech control, outperforming the state-of-art LALMs in key areas. This work contributes a promising solution for end-to-end LALMs and highlights the critical role of token-based vocoder in enhancing overall performance for AQAA tasks.

Problem

Research questions and friction points this paper is trying to address.

Generates natural speech responses directly for audio interactions

Integrates dual-codebook audio tokenizer for feature extraction

Enhances semantic coherence with interleaved text-audio token output

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-codebook audio tokenizer for feature extraction

130-billion-parameter LLM with neural vocoder

DPO and model merge for enhanced performance

🔎 Similar Papers

No similar papers found.