🤖 AI Summary
This work addresses the challenge of efficiently processing long-form audio with large speech language models, which are typically constrained by limited context length and high memory consumption. To overcome these limitations, the authors propose Speech Summary Tokens (SST), a mechanism that leverages the intrinsic key-value sparsity of large language models to generate compact representations of speech segments. Coupled with a curriculum learning strategy for progressive instruction tuning, the method incrementally increases compression ratios during training. This approach substantially reduces both computational and memory requirements for long speech modeling while achieving highly competitive performance on established benchmarks such as LongSpeech and AUDIOMARATHON. Notably, it attains these results using significantly less training data compared to existing methods.
📝 Abstract
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.