Can Audio Large Language Models Verify Speaker Identity?

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current audio large language models (ALLMs) exhibit weak generalization in zero-shot speaker verification (SV), particularly degrading significantly under variable acoustic conditions. This work reformulates SV as an audio question-answering task, introduces a rule-driven hard sample pair sampling strategy, and combines supervised fine-tuning with lightweight parameter updates. Key contributions include: (i) the first demonstration of ALLMs jointly discriminating speaker identity and linguistic content; and (ii) unified modeling of both text-dependent and zero-shot SV within a single framework. Experiments show substantial improvements in zero-shot SV performance after fine-tuning; in text-dependent settings, accuracy matches that of cascaded ASR-SV systems. These results validate ALLMs as robust, multi-task foundational models for speaker verification.

Technology Category

Application Category

📝 Abstract
This paper investigates adapting Audio Large Language Models (ALLMs) for speaker verification (SV). We reformulate SV as an audio question-answering task and conduct comprehensive zero-shot evaluations on public benchmarks, showing that current ALLMs have limited zero-shot SV capability and often struggle in diverse acoustic conditions. To address this challenge, we perform supervised fine-tuning on speaker verification data. A rule-based hard pair sampling strategy is proposed to construct more challenging training pairs. Lightweight fine-tuning substantially improves the performance, though there is still a gap between ALLMs and conventional models. Then, we extend to text-dependent SV by jointly querying ALLMs to verify speaker identity and spoken content, yielding results competitive with cascaded ASR-SV systems. Our findings demonstrate that with proper adaptation, ALLMs hold substantial potential as a unified model for robust speaker verification systems, while maintaining the general audio understanding capabilities.
Problem

Research questions and friction points this paper is trying to address.

Adapting Audio LLMs for speaker verification via audio question-answering
Addressing limited zero-shot capability in diverse acoustic conditions
Enabling unified speaker and content verification through joint querying
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulated speaker verification as audio question-answering
Proposed rule-based hard pair sampling for training
Jointly queried model for text-dependent speaker verification
🔎 Similar Papers
No similar papers found.
Yiming Ren
Yiming Ren
Tsinghua University
Object Detection、Multimodal Large Language Model
Xuenan Xu
Xuenan Xu
Shanghai Jiao Tong University
audio generationaudio understandingspeech synthesis
B
Baoxiang Li
Shanghai Artificial Intelligence Laboratory
S
Shuai Wang
Nanjing University
C
Chao Zhang
Shanghai Artificial Intelligence Laboratory