On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Current evaluations of speech language models often directly adopt text-based perplexity metrics, overlooking fundamental modality differences between speech and text, which can lead to misleading assessments of generation quality. This work systematically uncovers, for the first time, the limitations of global token-level perplexity in speech modeling and introduces a suite of novel evaluation metrics grounded in likelihood estimation and perceptual generation quality. These proposed metrics demonstrate significantly stronger correlation with human mean opinion scores (MOS), effectively reshaping model performance rankings and narrowing the gap between state-of-the-art systems and human performance ceilings. The findings underscore the critical role of modality-appropriate evaluation in advancing speech language modeling research.

Technology Category

Application Category

📝 Abstract

Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preserving attributes like speaker and emotion, serving as foundation models for spoken dialogue. In prior literature, these models are often evaluated using ``global token perplexity'', which directly applies the text perplexity formulation to speech tokens. However, this practice overlooks fundamental differences between speech and text modalities, possibly leading to an underestimation of the speech characteristics. In this work, we propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity. We demonstrate that the proposed evaluations more faithfully reflect perceived generation quality, as evidenced by stronger correlations with human-rated mean opinion scores (MOS). When assessed under the new metrics, the relative performance landscape of spoken language models is reshaped, revealing a significantly reduced gap between the best-performing model and the human topline. Together, these results suggest that appropriate evaluation is critical for accurately assessing progress in spoken language modeling.

Problem

Research questions and friction points this paper is trying to address.

spoken language modeling

global token perplexity

model evaluation

speech generation

modality difference

Innovation

Methods, ideas, or system contributions that make the work stand out.

spoken language modeling

evaluation metrics

perplexity fallacy