FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

📅 2025-07-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large speech language models (LSLMs) face dual bottlenecks in long-speech processing: scarcity of long-speech training data and prohibitive computational overhead. To address this, we propose an efficient long-speech processing framework that requires no long-speech supervision—leveraging dynamic compression training and iterative fusion to achieve progressive sequence compression while preserving semantic fidelity. Our approach performs transfer learning atop existing LSLMs, enabling unified modeling of both short- and long-speech tasks. We introduce LongSpeech-Eval, the first comprehensive benchmark for long-speech understanding, and empirically demonstrate that our method significantly outperforms baselines across multiple long-speech comprehension tasks. Crucially, it maintains competitive performance on short-speech benchmarks and sustains high inference efficiency. This work establishes a scalable, low-dependency paradigm for extending LSLMs to long-duration speech scenarios.

Technology Category

Application Category

📝 Abstract
The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

Efficient processing of long-form speech remains underexplored
Scarcity of long-speech training datasets limits model performance
High computational costs hinder long-sequence speech processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative fusion strategy compresses long-speech sequences
Dynamic compression training adapts LSLMs for long-speech
LongSpeech-Eval benchmark assesses long-speech understanding
🔎 Similar Papers
No similar papers found.
S
Shoutao Guo
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); University of Chinese Academy of Sciences, Beijing, China
Shaolei Zhang
Shaolei Zhang
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
Natural Language ProcessingLarge Language ModelMultimodal LLMsSimultaneous Translation
Qingkai Fang
Qingkai Fang
Institute of Computing Technology, Chinese Academy of Sciences
Large Language ModelsSpeech Language ModelsMultimodal LLMsSpeech Translation
Zhengrui Ma
Zhengrui Ma
Institute of Computing Technology, Chinese Academy of Sciences
Language Modeling
M
Min Zhang
School of Future Science and Engineering, Soochow University
Y
Yang Feng
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS); Key Laboratory of AI Safety, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China