🤖 AI Summary
Large language models (LLMs) face significant challenges in general audio-language tasks, including substantial acoustic variability, poor cross-task generalization, and heavy reliance on large-scale ASR or captioning data. Method: We propose a dynamic soft prompt selection mechanism that employs a learnable key-value memory module to adaptively balance generic and task-specific knowledge; integrates soft token embeddings; and adopts single-stage end-to-end training—eliminating multi-stage fine-tuning. Contribution/Results: Our approach substantially reduces data dependency while enhancing cross-task interpretability and prompt discriminability. It achieves state-of-the-art performance across multiple speech-language understanding benchmarks, with fewer parameters and a more streamlined training process.
📝 Abstract
Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.