QuarkAudio Technical Report

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio models are predominantly task-specific, suffering from poor scalability and fragmented development. This paper introduces QuarkAudio, a unified generative framework built upon a decoder-only autoregressive language model, enabling joint handling of diverse tasks—including speech inpainting, source separation, voice conversion, and natural-language-driven free-form audio editing—via task-conditioned token sequences. Its core innovations include (i) H-Codec, the first discrete audio tokenizer integrating self-supervised representations with dynamic frame-rate adaptation to support 48 kHz high-fidelity reconstruction; and (ii) the first language-instruction-driven, semantic- and event-level universal audio editing capability. Experiments demonstrate that QuarkAudio achieves performance on par with or surpassing state-of-the-art task-specific and multi-task systems across multiple benchmarks, significantly advancing unification, controllability, and fidelity in audio generation.

Technology Category

Application Category

📝 Abstract
Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high-quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high-fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder-only autoregressive (AR) LM-based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H-Codec, which incorporates self-supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H-Codec, such as a dynamic frame-rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task-specific conditional information as the conditioning sequence of the decoder-only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language-queried audio source separation (LASS). In addition, we extend downstream tasks to universal free-form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H-Codec achieves high-quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state-of-the-art task-specific or multi-task systems across multiple tasks.
Problem

Research questions and friction points this paper is trying to address.

Unifies multiple audio tasks with a single framework
Improves audio tokenization for high-fidelity reconstruction
Enables universal audio editing via natural language instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified autoregressive LM framework for multiple audio tasks
H-Codec tokenizer with SSL and dynamic frame-rate
Conditional decoding for free-form audio editing via instructions
🔎 Similar Papers
No similar papers found.
Chengwei Liu
Chengwei Liu
Research Assistant Professor, Nanyang Technological University
Open Source SecuritySoftware Supply Chain SecurityProgram AnalysisSoftware Maintenance
H
Haoyin Yan
Tongyi AI Lab, Alibaba Group
S
Shaofei Xue
Intelligent Connectivity, Alibaba Group; Tongyi AI Lab, Alibaba Group
X
Xiaotao Liang
Intelligent Connectivity, Alibaba Group
Xiaofu Chen
Xiaofu Chen
MBZUAI
Vision and LanguageMLLMMultimodal Learning
B
Bin Gong
Intelligent Connectivity, Alibaba Group; Zhejiang University
Z
Zheng Xue
Intelligent Connectivity, Alibaba Group
G
Gang Song
Intelligent Connectivity, Alibaba Group