Towards Audio Token Compression in Large Audio Language Models

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large audio-language models (LALMs) face scalability and deployment challenges due to the quadratic computational complexity of self-attention and high audio tokenization rates. To address this, we propose a lightweight audio token compression framework that inserts unsupervised segmentation followed by uniform average pooling between the audio encoder and LLM decoder, drastically reducing token count. We further employ low-rank adaptation (LoRA) for parameter-efficient fine-tuning to mitigate performance degradation induced by compression. Evaluated on automatic speech recognition and speech-to-speech translation tasks, our method reduces input tokens by up to 67% while maintaining performance close to frame-level baseline models. Our key contribution is the first integration of unsupervised segmentation, structured pooling, and parameter-efficient fine-tuning for audio token compression in LALMs—significantly enhancing model scalability and feasibility for edge-device deployment.

Technology Category

Application Category

📝 Abstract
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
Problem

Research questions and friction points this paper is trying to address.

Compress audio tokens to reduce computational complexity
Address scalability limitations for long-form audio processing
Enable deployment on resource-constrained edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compress audio tokens using unsupervised segmentation
Apply uniform average pooling to reduce token count
Fine-tune model with low-rank adapters for performance
🔎 Similar Papers
No similar papers found.