🤖 AI Summary
This study investigates whether large language model (LLM) decoders exacerbate demographic bias in automatic speech recognition due to textual priors. Leveraging the Common Voice 24 and Fair-Speech datasets, the authors systematically evaluate the fairness of three architectural paradigms—CTC, encoder-decoder, and explicit LLM decoders—across five demographic dimensions under 12 acoustic degradation conditions, conducting 216 inference experiments in total. The work reveals, for the first time, that audio encoder design exerts a far greater influence on fairness and robustness than LLM scale. Notably, Whisper exhibits a 9.62% insertion rate for Indian-accented speech, and silence injection amplifies its accent bias by 4.64×. Highly compressed audio tends to trigger repetitive errors in LLMs, while Granite-8B demonstrates superior demographic fairness with a WER ratio of 2.28.
📝 Abstract
As pretrained large language models replace task-specific decoders in speech recognition, a critical question arises: do their text-derived priors make recognition fairer or more biased across demographic groups? We evaluate nine models spanning three architectural generations (CTC with no language model, encoder-decoder with an implicit LM, and LLM-based with an explicit pretrained decoder) on about 43,000 utterances across five demographic axes (ethnicity, accent, gender, age, first language) using Common Voice 24 and Meta's Fair-Speech, a controlled-prompt dataset that eliminates vocabulary confounds. On clean audio, three findings challenge assumptions: LLM decoders do not amplify racial bias (Granite-8B has the best ethnicity fairness, max/min WER = 2.28); Whisper exhibits pathological hallucination on Indian-accented speech with a non-monotonic insertion-rate spike to 9.62% at large-v3; and audio compression predicts accent fairness more than LLM scale. We then stress-test these findings under 12 acoustic degradation conditions (noise, reverberation, silence injection, chunk masking) across both datasets, totaling 216 inference runs. Severe degradation paradoxically compresses fairness gaps as all groups converge to high WER, but silence injection amplifies Whisper's accent bias up to 4.64x by triggering demographic-selective hallucination. Under masking, Whisper enters catastrophic repetition loops (86% of 51,797 insertions) while explicit-LLM decoders produce 38x fewer insertions with near-zero repetition; high-compression audio encoding (Q-former) reintroduces repetition pathology even in LLM decoders. These results suggest that audio encoder design, not LLM scaling, is the primary lever for equitable and robust speech recognition.