LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Large language models (LLMs) face significant challenges in general audio-language tasks, including substantial acoustic variability, poor cross-task generalization, and heavy reliance on large-scale ASR or captioning data. Method: We propose a dynamic soft prompt selection mechanism that employs a learnable key-value memory module to adaptively balance generic and task-specific knowledge; integrates soft token embeddings; and adopts single-stage end-to-end training—eliminating multi-stage fine-tuning. Contribution/Results: Our approach substantially reduces data dependency while enhancing cross-task interpretability and prompt discriminability. It achieves state-of-the-art performance across multiple speech-language understanding benchmarks, with fewer parameters and a more streamlined training process.

Technology Category

Application Category

📝 Abstract

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

Problem

Research questions and friction points this paper is trying to address.

Adapting LLMs for general-purpose audio-language tasks

Balancing general and task-specific knowledge in multitask settings

Reducing dependence on large-scale ASR or captioning datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic prompt selection with learnable key-value pairs

Balances general and task-specific knowledge efficiently

Single-stage training with fewer parameters

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs