🤖 AI Summary
Closed-source speech foundation models (e.g., Whisper) hinder reproducibility and fair evaluation due to inaccessible training data and code. To address this, we introduce the first fully open-stack speech foundation model family supporting English and Italian. Our models are trained on over 150,000 hours of publicly available speech data; we further release a curated, bilingual dataset comprising 16,000 hours of cleaned and pseudo-labeled audio—publicly available under an OS-compliant license. This constitutes the first end-to-end open release (code, data, models) in speech foundation modeling, filling a critical gap in open science for speech AI. Methodologically, we employ Transformer-based self-supervised pretraining, multi-task joint fine-tuning, and an efficient data cleaning and pseudo-labeling pipeline. Empirically, our models match Whisper’s performance across diverse speech understanding and generation benchmarks while achieving up to 8× faster inference speed.
📝 Abstract
The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.