FAMA: The First Large-Scale Open-Science Speech Foundation Model for English and Italian

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Closed-source speech foundation models (e.g., Whisper) hinder reproducibility and fair evaluation due to inaccessible training data and code. To address this, we introduce the first fully open-stack speech foundation model family supporting English and Italian. Our models are trained on over 150,000 hours of publicly available speech data; we further release a curated, bilingual dataset comprising 16,000 hours of cleaned and pseudo-labeled audio—publicly available under an OS-compliant license. This constitutes the first end-to-end open release (code, data, models) in speech foundation modeling, filling a critical gap in open science for speech AI. Methodologically, we employ Transformer-based self-supervised pretraining, multi-task joint fine-tuning, and an efficient data cleaning and pseudo-labeling pipeline. Empirically, our models match Whisper’s performance across diverse speech understanding and generation benchmarks while achieving up to 8× faster inference speed.

Technology Category

Application Category

📝 Abstract

The development of speech foundation models (SFMs) like Whisper and SeamlessM4T has significantly advanced the field of speech processing. However, their closed nature--with inaccessible training data and code--poses major reproducibility and fair evaluation challenges. While other domains have made substantial progress toward open science by developing fully transparent models trained on open-source (OS) code and data, similar efforts in speech remain limited. To fill this gap, we introduce FAMA, the first family of open science SFMs for English and Italian, trained on 150k+ hours of OS speech data. Moreover, we present a new dataset containing 16k hours of cleaned and pseudo-labeled speech for both languages. Results show that FAMA achieves competitive performance compared to existing SFMs while being up to 8 times faster. All artifacts, including code, datasets, and models, are released under OS-compliant licenses, promoting openness in speech technology research.

Problem

Research questions and friction points this paper is trying to address.

Closed nature of existing speech foundation models limits reproducibility

Lack of open science efforts in speech processing research

Need for transparent models trained on open-source data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-science SFMs for English and Italian

Trained on 150k+ hours OS speech data

Includes 16k hours cleaned pseudo-labeled dataset

🔎 Similar Papers

No similar papers found.

Apple

Cambridge, United States of America

Machine Learning Engineer, Siri Speech

Apple

Seattle, United States of America

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs