OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

📅 2025-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of resource constraints in academia, opaque large-model training practices, and high barriers to developing Speech Understanding Language Models (SULMs), this paper introduces OSUM—an open-source, fully reproducible SULM. Methodologically, we propose the first ASR+X multi-task collaborative training paradigm, enabling efficient joint modeling of the Whisper speech encoder and the Qwen2 language model under limited computational resources; we further incorporate instruction tuning and cross-task prompt alignment. OSUM supports eight speech understanding tasks, including automatic speech recognition (ASR), speech event detection, and emotion recognition. Experiments demonstrate that OSUM achieves state-of-the-art (SOTA) performance across all evaluated tasks within the academic community. Furthermore, we fully open-source the training code, data construction scripts, and configuration files—significantly lowering the barriers to SULM research and reproduction.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.
Problem

Research questions and friction points this paper is trying to address.

Efficient Speech Understanding Models
Resource Constraints
Transparency in Large-scale Model Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

OSUM
multitask learning
open-source model
🔎 Similar Papers
No similar papers found.
Xuelong Geng
Xuelong Geng
School of Computer Science, Northwestern Polytechnical University
ASRLLMspeech
Kun Wei
Kun Wei
School of Computer Science, Northwestern Polytechnical University
deep learningcompute sciencespeech
Qijie Shao
Qijie Shao
Northwestern Polytechnical University
Speech RecognitionAccent/Dialect Recognition
S
Shuiyun Liu
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Z
Zhennan Lin
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Zhixian Zhao
Zhixian Zhao
Northwestern Polytechnical University
Emotion Speech RecognitionUnderstanding and Generation
G
Guojian Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Wenjie Tian
Wenjie Tian
Northwest Polytechnical University
speech generation
P
Peikun Chen
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Y
Yangze Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Pengcheng Guo
Pengcheng Guo
Northwestern Polytechnical University
Speech RecognitionMachine LearningDeep Learnining
M
Mingchen Shao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
S
Shuiyuan Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Y
Yuang Cao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
C
Chengyou Wang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Tianyi Xu
Tianyi Xu
Tulane University
Reinforcement LearningNetwork OptimizaitonStatisticsNLP(LLM)Operations research
Y
Yuhang Dai
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
Xinfa Zhu
Xinfa Zhu
Northwestern Polytechnical University
speech generation
Y
Yue Li
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
L
Li Zhang
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University