Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Target-speaker automatic speech recognition (ASR) in streaming multi-speaker scenarios—particularly under full speaker overlap—remains challenging without explicit speaker queries (e.g., enrollment utterances or embeddings). Method: This paper proposes an adaptive speaker-directed framework featuring a speaker-aware voice activity detection (VAD) module that generates dynamic kernels, which are injected into the ASR encoder layers to enable fine-grained, temporally consistent speaker adaptation—without external speaker priors. Contribution/Results: To our knowledge, this is the first work to tightly integrate dynamic kernel injection with streaming ASR, enabling fully query-free online speaker tracking and recognition. Experiments demonstrate state-of-the-art performance in both offline and streaming settings, with substantial improvements in robustness and word error rate reduction under extreme speaker overlap conditions.

Technology Category

Application Category

📝 Abstract

We propose a self-speaker adaptation method for streaming multi-talker automatic speech recognition (ASR) that eliminates the need for explicit speaker queries. Unlike conventional approaches requiring target speaker embeddings or enrollment audio, our technique dynamically adapts individual ASR instances through speaker-wise speech activity prediction. The key innovation involves injecting speaker-specific kernels generated via speaker supervision activations into selected ASR encoder layers. This enables instantaneous speaker adaptation to target speakers while handling fully overlapped speech even in a streaming scenario. Experiments show state-of-the-art performance in both offline and streaming scenarios, demonstrating that our self-adaptive method effectively addresses severe speech overlap through streamlined speaker-focused recognition. The results validate the proposed self-speaker adaptation approach as a robust solution for multi-talker ASR under severe overlapping speech conditions.

Problem

Research questions and friction points this paper is trying to address.

Eliminates need for explicit speaker queries in ASR

Handles fully overlapped speech in streaming scenarios

Improves multi-talker ASR under severe overlapping conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-speaker adaptation without explicit queries

Dynamic ASR adaptation via activity prediction

Speaker-specific kernels in encoder layers

🔎 Similar Papers

No similar papers found.