LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-standing challenges in speech-language modeling—including architectural fragmentation, non-public training data, and undocumented configurations—have severely hindered reproducibility and systematic evaluation. To address these issues, this work introduces LLaSO, a fully open-source end-to-end framework. Methodologically, it unifies large-scale speech-text alignment, multi-task instruction tuning, end-to-end modeling, and complete training configuration reproduction. Crucially, LLaSO releases, for the first time, 12 million speech-text aligned pairs, 13.5 million instruction-tuning samples, and a standardized benchmark suite. Leveraging only publicly available data, we train LLaSO-Base, a 3.8B-parameter foundation model. Under the unified benchmark, LLaSO-Base achieves a normalized score of 0.72—significantly outperforming prior comparable models. All data, code, and models are openly released, establishing a strong, reproducible baseline and a standardized foundation for collaborative research in speech-language modeling.

Technology Category

Application Category

📝 Abstract
The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.
Problem

Research questions and friction points this paper is trying to address.

Addressing fragmented architectures in large speech-language models
Solving lack of transparency and reproducibility in LSLM research
Providing standardized benchmarks for systematic model comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open end-to-end framework for speech-language modeling
Provides speech-text alignment and instruction-tuning datasets
Includes reproducible benchmark for standardized evaluation
🔎 Similar Papers
No similar papers found.
Y
Yirong Sun
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin
Yizhong Geng
Yizhong Geng
Beijing University of Posts and Telecommunications
TTSVCMultimodal
P
Peidong Wei
Institute of Digital Twin, Xiamen University
Yanjun Chen
Yanjun Chen
University of Illinois Urbana-Champaign
Human Computer InteractionHaptics
J
Jinghan Yang
EIT2
R
Rongfei Chen
Institute of Digital Twin
W
Wei Zhang
Institute of Digital Twin
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning