Make Some Noise: Towards LLM audio reasoning and generation using sound tokens

๐Ÿ“… 2025-03-28
๐Ÿ›๏ธ IEEE International Conference on Acoustics, Speech, and Signal Processing
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the lack of native audio understanding and generation capabilities in large language models (LLMs), this paper proposes an efficient audio discretization framework that compresses high-sample-rate continuous audio into ultra-low-bitrate (0.23 kbps) discrete acoustic tokens, jointly modeled with text tokens. Methodologically, it introduces the first audio representation learning paradigm integrating vector quantization (VQ) with conditional flow matching (CFM), and achieves evaluable audio understanding by fine-tuning a text-only LLM exclusively via LoRA. Experiments demonstrate that the proposed acoustic tokens outperform VQ-VAE on multiple acoustic event classification tasks and match state-of-the-art (SOTA) models in understanding performance. However, generative quality remains limited by dataset scale and evaluation protocols, highlighting key bottlenecks in current audioโ€“language joint modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.
Problem

Research questions and friction points this paper is trying to address.

Integrating audio comprehension and generation into LLMs
Converting audio into ultra-low bitrate discrete tokens
Achieving competitive audio comprehension with discrete tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Quantization with Conditional Flow Matching
Ultra-low bitrate discrete audio tokens
LoRA fine-tuned multimodal LLM
๐Ÿ”Ž Similar Papers
No similar papers found.