The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This study investigates how training data scale relates to subword unit coverage in multilingual automatic speech recognition (ASR) models, particularly whether data imbalance constrains cross-lingual lexical diversity. Method: We propose the Acoustic Saturation Time (AST) metric—derived from Whisper’s decoding behavior—to quantify the cumulative subword discovery process across 49 languages; we further apply Zipf–Mandelbrot modeling and statistical correlation analysis. Contribution/Results: We uncover a cross-linguistically consistent exponential saturation pattern in subword discovery. Crucially, data scale differences exert negligible influence on vocabulary coverage; instead, subword utilization is predominantly governed by acoustic statistics, linguistic typology, and orthographic structure—not training data volume. Subword activation rates are significantly higher in Latin-script languages than in Cyrillic-, CJK-, or Semitic-script languages. This work provides the first systematic characterization of the acoustic saturation mechanism underlying subword usage in multilingual ASR, offering theoretical foundations for data allocation optimization and equitable language design.

Technology Category

Application Category

📝 Abstract

How much audio is needed to fully observe a multilingual ASR model's learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper's decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model's sub-token space. Results show that the total number of discovered tokens remains largely independent of a language's pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model's hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new sub-token activation. We refer to this convergence threshold as acoustic saturation time (AST). Further analyses of rank-frequency distributions reveal Zipf-like patterns better modeled by a Zipf-Mandelbrot law, and mean sub-token length shows a positive correlation with resource level. Additionally, those metrics show more favorable patterns for languages in the Latin script than those in scripts such as Cyrillic, CJK, and Semitic. Together, our study suggests that sub-token utilization during multilingual ASR inference is constrained more by the statistical, typological, and orthographic structure of the speech than by training data scale, providing an empirical basis for more equitable corpus construction and cross-lingual evaluation.

Problem

Research questions and friction points this paper is trying to address.

Analyzing sub-token utilization patterns across 49 multilingual ASR languages

Investigating acoustic saturation thresholds for sub-token discovery in speech

Examining how linguistic structure affects token utilization beyond training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed Whisper's decoding behavior across 49 languages

Introduced acoustic saturation time for sub-token discovery

Linked sub-token utilization to linguistic structure not data scale

🔎 Similar Papers

No similar papers found.