Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms

📅 2025-11-17

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work systematically evaluates the capability of vision-language models (VLMs) to interpret speech acoustic visualizations—specifically spectrograms and waveforms—with phoneme-level linguistic understanding and transcription ability, akin to human phoneticians. Method: We introduce the first large-scale, stylistically consistent English word–acoustic visualization dataset (>4,000 paired samples), design a phoneme-edit-distance-based multiple-choice task with distractors, and benchmark state-of-the-art VLMs under both zero-shot and fine-tuned settings. Contribution/Results: Empirical results show that current VLMs perform near chance level, indicating that end-to-end cross-modal learning alone is insufficient for acquiring structured phonological knowledge. This study provides the first systematic evidence of a fundamental limitation in VLMs’ comprehension of speech visualizations and argues that integrating domain-specific priors—such as phonotactic constraints or dedicated speech decoding modules—is essential to enable expert-level cross-modal reasoning.

Technology Category

Application Category

📝 Abstract

With the rise of Large Language Models (LLMs) and their vision-enabled counterparts (VLMs), numerous works have investigated their capabilities in tasks that fuse the modalities of vision and language. In this work, we benchmark the extent to which VLMs are able to act as highly-trained phoneticians, interpreting spectrograms and waveforms of speech. To do this, we synthesise a novel dataset containing 4k+ English words spoken in isolation alongside stylistically consistent spectrogram and waveform figures. We test the ability of VLMs to understand these representations of speech through a multiple-choice task whereby models must predict the correct phonemic or graphemic transcription of a spoken word when presented amongst 3 distractor transcriptions that have been selected based on their phonemic edit distance to the ground truth. We observe that both zero-shot and finetuned models rarely perform above chance, demonstrating the requirement for specific parametric knowledge of how to interpret such figures, rather than paired samples alone.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking VLMs' ability to interpret speech spectrograms

Testing phonetic transcription accuracy from visual speech representations

Evaluating multimodal models' understanding of phonemic and graphemic transcriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking VLMs on speech spectrogram interpretation

Synthesizing dataset with spectrogram and waveform figures

Testing phonemic transcription accuracy via multiple-choice task

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives