Extending Audio Context for Long-Form Understanding in Large Audio-Language Models

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large audio-language models (LALMs) suffer from short audio context windows, hindering long-form audio understanding—despite their text backbones already supporting extended context. This work introduces the first context extension technique for LALMs, proposing Partial YaRN and Virtual Long Audio Training (VLAT): Partial YaRN modifies only the audio-specific RoPE-based positional encodings—enabling audio context expansion without fine-tuning; VLAT further enhances positional generalization through synthetic long-audio supervision. Evaluated on SALMONN and Qwen2-Audio, our approach significantly improves long-audio comprehension, supports inference on audio sequences substantially longer than those seen during training, and maintains computational efficiency. Crucially, it achieves marked gains in robustness and generalization across diverse long-audio benchmarks, without compromising model throughput or requiring architectural changes to the language decoder.

Technology Category

Application Category

📝 Abstract
Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.
Problem

Research questions and friction points this paper is trying to address.

Extending short audio context windows in large audio-language models
Preserving text capabilities while enabling long-form audio understanding
Generalizing to unseen audio lengths through training strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends audio context via Partial YaRN method
Modifies only audio token positions for preservation
Uses VLAT training for long-context audio generalization
🔎 Similar Papers
No similar papers found.