🤖 AI Summary
Current large speech language models (LSLMs) face dual bottlenecks in long-speech processing: scarcity of long-speech training data and prohibitive computational overhead. To address this, we propose an efficient long-speech processing framework that requires no long-speech supervision—leveraging dynamic compression training and iterative fusion to achieve progressive sequence compression while preserving semantic fidelity. Our approach performs transfer learning atop existing LSLMs, enabling unified modeling of both short- and long-speech tasks. We introduce LongSpeech-Eval, the first comprehensive benchmark for long-speech understanding, and empirically demonstrate that our method significantly outperforms baselines across multiple long-speech comprehension tasks. Crucially, it maintains competitive performance on short-speech benchmarks and sustains high inference efficiency. This work establishes a scalable, low-dependency paradigm for extending LSLMs to long-duration speech scenarios.
📝 Abstract
The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.