Make Some Noise: Towards LLM audio reasoning and generation using sound tokens

📅 2025-03-28

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the lack of native audio understanding and generation capabilities in large language models (LLMs), this paper proposes an efficient audio discretization framework that compresses high-sample-rate continuous audio into ultra-low-bitrate (0.23 kbps) discrete acoustic tokens, jointly modeled with text tokens. Methodologically, it introduces the first audio representation learning paradigm integrating vector quantization (VQ) with conditional flow matching (CFM), and achieves evaluable audio understanding by fine-tuning a text-only LLM exclusively via LoRA. Experiments demonstrate that the proposed acoustic tokens outperform VQ-VAE on multiple acoustic event classification tasks and match state-of-the-art (SOTA) models in understanding performance. However, generative quality remains limited by dataset scale and evaluation protocols, highlighting key bottlenecks in current audio–language joint modeling.

Technology Category

Application Category

📝 Abstract

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Problem

Research questions and friction points this paper is trying to address.

Integrating audio comprehension and generation into LLMs

Converting audio into ultra-low bitrate discrete tokens

Achieving competitive audio comprehension with discrete tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variational Quantization with Conditional Flow Matching

Ultra-low bitrate discrete audio tokens

LoRA fine-tuned multimodal LLM

🔎 Similar Papers

No similar papers found.