Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study reveals severe test-set contamination in the LibriSpeech and Common Voice evaluation sets: a substantial fraction of test samples appears in publicly released pretraining corpora of mainstream large language models (LLMs), leading to inflated and unreliable automatic speech recognition (ASR) performance estimates. To systematically quantify this contamination effect, the authors propose a four-step analytical framework: (1) data provenance tracing to identify contaminated test samples; (2) training matched ASR models on contaminated versus uncontaminated subsets; (3) detecting shifts in output token probability distributions; and (4) correlating word error rate (WER) with probability-based contamination signals. Experiments show that even minimal contamination (<1% of test utterances) significantly increases generation probabilities for affected sentences, while WER remains nearly unchanged—demonstrating WER’s insensitivity to contamination. This work provides the first empirical evidence underscoring the necessity of strict test-set isolation and establishes a methodological benchmark for trustworthy ASR evaluation.

Technology Category

Application Category

📝 Abstract

Recent work suggests that large language models (LLMs) can improve performance of speech tasks compared to existing systems. To support their claims, results on LibriSpeech and Common Voice are often quoted. However, this work finds that a substantial amount of the LibriSpeech and Common Voice evaluation sets appear in public LLM pretraining corpora. This calls into question the reliability of findings drawn from these two datasets. To measure the impact of contamination, LLMs trained with or without contamination are compared, showing that a contaminated LLM is more likely to generate test sentences it has seen during training. Speech recognisers using contaminated LLMs shows only subtle differences in error rates, but assigns significantly higher probabilities to transcriptions seen during training. Results show that LLM outputs can be biased by tiny amounts of data contamination, highlighting the importance of evaluating LLM-based speech systems with held-out data.

Problem

Research questions and friction points this paper is trying to address.

Test set contamination in LLMs for speech recognition

Impact of contamination on LLM performance evaluation

Bias in LLM outputs due to data contamination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Detects test set contamination in LLM pretraining

Compares contaminated vs uncontaminated LLM performance

Highlights need for held-out data evaluation

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models