Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

📅 2025-10-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the linear increase in inference cost with speaker count in multi-speaker overlapping speech recognition, this paper proposes a speaker-agnostic dual-stream activity modeling approach: it decouples speaker-specific activity signals into two speaker-independent temporal streams, thereby decoupling inference complexity from the number of speakers. Built upon heuristic continuity constraints, the method preserves conversational structure coherence and seamlessly integrates with existing single-speaker ASR models (e.g., Whisper) without architectural modification. Integrated within the DiCoW framework, it enables efficient processing of discontinuous speech segments. Evaluated on AMI and ICSI, the method achieves up to 3.2× speedup in runtime while maintaining competitive word error rates (WER), overcoming the poor scalability inherent in conventional activity-conditioned systems.

Technology Category

Application Category

📝 Abstract
An increasingly common training paradigm for multi-talker automatic speech recognition (ASR) is to use speaker activity signals to adapt single-speaker ASR models for overlapping speech. Although effective, these systems require running the ASR model once per speaker, resulting in inference costs that scale with the number of speakers and limiting their practicality. In this work, we propose a method that decouples the inference cost of activity-conditioned ASR systems from the number of speakers by converting speaker-specific activity outputs into two speaker-agnostic streams. A central challenge is that naïvely merging speaker activities into streams significantly degrades recognition, since pretrained ASR models assume contiguous, single-speaker inputs. To address this, we design new heuristics aimed at preserving conversational continuity and maintaining compatibility with existing systems. We show that our approach is compatible with Diarization-Conditioned Whisper (DiCoW) to greatly reduce runtimes on the AMI and ICSI meeting datasets while retaining competitive performance.
Problem

Research questions and friction points this paper is trying to address.

Reducing multi-talker ASR inference cost scaling with speaker count
Converting speaker-specific activities into speaker-agnostic streams
Maintaining recognition performance while preserving conversational continuity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-agnostic activity streams reduce computational cost
Heuristics preserve conversational continuity in merged inputs
Compatible with existing Diarization-Conditioned Whisper systems
🔎 Similar Papers
No similar papers found.