OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

📅 2025-01-23

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the challenges of resource constraints in academia, opaque large-model training practices, and high barriers to developing Speech Understanding Language Models (SULMs), this paper introduces OSUM—an open-source, fully reproducible SULM. Methodologically, we propose the first ASR+X multi-task collaborative training paradigm, enabling efficient joint modeling of the Whisper speech encoder and the Qwen2 language model under limited computational resources; we further incorporate instruction tuning and cross-task prompt alignment. OSUM supports eight speech understanding tasks, including automatic speech recognition (ASR), speech event detection, and emotion recognition. Experiments demonstrate that OSUM achieves state-of-the-art (SOTA) performance across all evaluated tasks within the academic community. Furthermore, we fully open-source the training code, data construction scripts, and configuration files—significantly lowering the barriers to SULM research and reproduction.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

Problem

Research questions and friction points this paper is trying to address.

Efficient Speech Understanding Models

Resource Constraints

Transparency in Large-scale Model Training

Innovation

Methods, ideas, or system contributions that make the work stand out.

OSUM

multitask learning

open-source model

🔎 Similar Papers

No similar papers found.