OpusLM: A Family of Open Unified Speech Language Models

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the limited openness and poor multi-task compatibility of existing speech-language models, this paper introduces the open and unified OpusLM model family. Methodologically, OpusLM initializes with a decoder-only large language model and incorporates speech-text joint tokenization, a multi-stream input architecture, and a staged progressive training strategy, trained uniformly on 213K hours of speech-text pairs and 292B text tokens. Our key contributions are threefold: (1) the first fully open-source, end-to-end reproducible framework unifying automatic speech recognition, text-to-speech synthesis, and pure text understanding; (2) state-of-the-art performance across multiple benchmarks—matching or surpassing leading closed-source models; and (3) complete public release of all code, data, model checkpoints, and training logs, significantly advancing open research in speech-language modeling.

Technology Category

Application Category

📝 Abstract

This paper presents Open Unified Speech Language Models (OpusLMs), a family of open foundational speech language models (SpeechLMs) up to 7B. Initialized from decoder-only text language models, the OpusLMs are continuously pre-trained on 213K hours of speech-text pairs and 292B text-only tokens. We demonstrate our OpusLMs achieve comparable (or even superior) performance with existing SpeechLMs in speech recognition, speech synthesis, and text-only capabilities. Technically, this paper articulates our SpeechLM designs on tokenization, multi-stream language models, and multi-stage training strategies. We experimentally demonstrate the importance of model size scaling and the effect of annealing data selection. The OpusLMs are all built from publicly available materials and are fully transparent models. We release our code, data, checkpoints, and training logs to facilitate open SpeechLM research

Problem

Research questions and friction points this paper is trying to address.

Develops open unified speech-language models (OpusLMs) for diverse tasks

Explores scaling and data selection for SpeechLM performance enhancement

Provides transparent models with released resources for open research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Initializes from decoder-only text models

Uses multi-stream language model designs

Implements multi-stage training strategies

🔎 Similar Papers

No similar papers found.