🤖 AI Summary
Existing audio large language models struggle to achieve fine-grained understanding of speaker identity, vocal characteristics, and recording conditions, limiting their capacity for personalized and context-aware interaction. This work proposes SpeakerLLM, a unified framework that integrates a hierarchical speaker tokenizer—combining utterance-level embeddings with frame-level acoustic features—with a verification-oriented inference objective and a natural language interface. The model supports single-utterance profiling, pairwise speaker comparison, and evidence-based explainable reasoning by decoupling profiling evidence from final judgments, thereby generating structured decision trajectories. Experiments demonstrate that SpeakerLLM-Base outperforms general-purpose models in speaker and recording condition comprehension, while SpeakerLLM-VR maintains high accuracy and produces explanations aligned with supervised reasoning paradigms.
📝 Abstract
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.