FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Closed-source speech foundation models (e.g., Whisper) hinder reproducibility and fair evaluation due to inaccessible training data and code. To address this, we introduce the first fully open-stack speech foundation model family supporting English and Italian. Our models are trained on over 150,000 hours of publicly available speech data; we further release a curated, bilingual dataset comprising 16,000 hours of cleaned and pseudo-labeled audio—publicly available under an OS-compliant license. This constitutes the first end-to-end open release (code, data, models) in speech foundation modeling, filling a critical gap in open science for speech AI. Methodologically, we employ Transformer-based self-supervised pretraining, multi-task joint fine-tuning, and an efficient data cleaning and pseudo-labeling pipeline. Empirically, our models match Whisper’s performance across diverse speech understanding and generation benchmarks while achieving up to 8× faster inference speed.

Technology Category

Application Category

📝 Abstract
The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.
Problem

Research questions and friction points this paper is trying to address.

Closed nature of existing speech foundation models limits reproducibility
Lack of open science efforts in speech processing research
Need for transparent models trained on open-source data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-science SFMs for English and Italian
Trained on 150k+ hours OS speech data
Includes 16k hours cleaned pseudo-labeled dataset
🔎 Similar Papers
No similar papers found.
Sara Papi
Sara Papi
Researcher at FBK
Speech ProcessingSpeech TranslationMultimodal LLM
Marco Gaido
Marco Gaido
Fondazione Bruno Kessler
artificial intelligencenlpspeech translation
L
L. Bentivogli
Fondazione Bruno Kessler (FBK), Italy
A
A. Brutti
Fondazione Bruno Kessler (FBK), Italy
Mauro Cettolo
Mauro Cettolo
Researcher at Fondazione Bruno Kessler, Trento (Italy)
Natural Language ProcessingStatistical Machine TranslationAutomatic Speech Recognition
R
Roberto Gretter
Fondazione Bruno Kessler (FBK), Italy
M
M. Matassoni
Fondazione Bruno Kessler (FBK), Italy
M
Mohamed Nabih
Fondazione Bruno Kessler (FBK), Italy
M
Matteo Negri
Fondazione Bruno Kessler (FBK), Italy