Data-Centric Lessons To Improve Speech-Language Pretraining

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This paper addresses three key challenges in speech-language pretraining: low-quality web-crawled audio, missing annotations, and difficulties in multimodal sequence modeling. To tackle these, we propose a systematic data-center paradigm: (1) a robust pipeline for raw audio cleaning and forced alignment; (2) generation of high-fidelity synthetic speech-text pairs to augment weak supervision signals; and (3) a fine-grained interleaved splicing strategy that alternates speech and text segments to improve cross-modal alignment efficiency. Leveraging this framework, we train SpeLangy—a 3.8B-parameter speech-language model—that outperforms a baseline model with three times its parameter count on the Speech Question Answering (SQA) benchmark, achieving an absolute accuracy gain of 10.2%. This result demonstrates that carefully engineered data construction—not merely scale—is pivotal for unlocking substantial performance gains in compact speech-language models.

Technology Category

Application Category

📝 Abstract

Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs. We focus on three research questions fundamental to speech-language pretraining data: (1) how to process raw web-crawled audio content for speech-text pretraining, (2) how to construct synthetic pretraining datasets to augment web-crawled data and (3) how to interleave (text, audio) segments into training sequences. We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.

Problem

Research questions and friction points this paper is trying to address.

Improving speech-language pretraining through data-centric methods

Addressing data processing and curation gaps in SpeechLMs

Enhancing spoken question-answering via optimized pretraining data strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Processing raw web-crawled audio for speech-text pretraining

Constructing synthetic datasets to augment web-crawled data

Interleaving text and audio segments into training sequences

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations