🤖 AI Summary
Large audio-language models (LALMs) suffer from short audio context windows, hindering long-form audio understanding—despite their text backbones already supporting extended context. This work introduces the first context extension technique for LALMs, proposing Partial YaRN and Virtual Long Audio Training (VLAT): Partial YaRN modifies only the audio-specific RoPE-based positional encodings—enabling audio context expansion without fine-tuning; VLAT further enhances positional generalization through synthetic long-audio supervision. Evaluated on SALMONN and Qwen2-Audio, our approach significantly improves long-audio comprehension, supports inference on audio sequences substantially longer than those seen during training, and maintains computational efficiency. Crucially, it achieves marked gains in robustness and generalization across diverse long-audio benchmarks, without compromising model throughput or requiring architectural changes to the language decoder.
📝 Abstract
Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.