Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

๐Ÿ“… 2026-03-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
It remains unclear whether current speech-aware large language models (LLMs) possess speaker identification capabilities, and effective evaluation and enhancement methods are lacking. This work presents the first systematic assessment of their speaker discrimination ability and introduces a model-agnostic scoring protocol. Furthermore, it proposes a lightweight enhancement strategy that injects frozen ECAPA-TDNN speaker embeddings into the LLM and trains only a LoRA adapter, leaving the LLM backbone untouched. This approach preserves the natural language interface while achieving an equal error rate (EER) of 1.03% on VoxCeleb1-Eโ€”significantly outperforming the original speech-aware LLM (EER > 20%) and approaching the performance of dedicated speaker verification systems.

Technology Category

Application Category

๐Ÿ“ Abstract
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.
Problem

Research questions and friction points this paper is trying to address.

speaker verification
speech-aware LLMs
speaker identity
large language models
voice biometrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

speech-aware LLMs
speaker verification
model-agnostic scoring
ECAPA-TDNN
LoRA adaptation
๐Ÿ”Ž Similar Papers
No similar papers found.
Thomas Thebaud
Thomas Thebaud
Assistant Research Scientist, ECE Dept., Johns Hopkins University, Baltimore
Adversarial and Backdoor attacksSpeech Emotion RecognitionAudio LLMsSpeaker Characterisation
Y
Yuzhe Wang
Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA
L
Laureano Moro-Velazquez
Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA
J
Jesus Villalba-Lopez
Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA; Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA
Najim Dehak
Najim Dehak
Associate Professor at ECE department, Johns Hopkins University.
Machine learningspeech processingspeaker recognitionlanguage recognitionemotion recognition