Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenge of applying autoregressive generative models to low-latency streaming target speaker extraction, a task hindered by their inherent reliance on global context. The study proposes a novel chunk-wise interleaved concatenation paradigm to enable efficient real-time inference and introduces a historical context refinement mechanism to mitigate discontinuities at speech segment boundaries. Evaluated on the Libri2Mix dataset, the system achieves 100% inference stability and high speech intelligibility, matching or even surpassing offline baselines in streaming performance. Notably, it attains a real-time factor (RTF) of 0.248 on consumer-grade GPUs, demonstrating practical feasibility for real-world deployment.

Technology Category

Application Category

📝 Abstract

While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.

Problem

Research questions and friction points this paper is trying to address.

Target Speaker Extraction

Streaming

Autoregressive Language Model

Real-time Inference

Latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive generative model

streaming target speaker extraction

chunk-wise interleaved splicing