LSZone: A Lightweight Spatial Information Modeling Architecture for Real-time In-car Multi-zone Speech Separation

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost and poor real-time deployability of multi-zone in-vehicle speech separation, this paper proposes LSZone, a lightweight spatial information modeling architecture. LSZone jointly leverages Mel-spectrogram and interaural phase difference (IPD) features to encode spatial cues, incorporates a Spatial Information Extraction and Compression (SpaIEC) module to reduce feature dimensionality, and introduces an ultra-lightweight Conv-GRU-based Cross-band Narrowband Processing (CNP) module for efficient time-frequency–spatial joint modeling. The resulting model achieves only 0.56 G MACs computational complexity and a real-time factor of 0.37, while maintaining strong robustness and delivering superior separation performance under challenging multi-speaker and highly dynamic noise conditions. Our key contribution is the first systematic application of IPD-guided lightweight spatiotemporal modeling to in-vehicle multi-zone speech separation, successfully balancing accuracy, efficiency, and practical deployment feasibility.

Technology Category

Application Category

📝 Abstract
In-car multi-zone speech separation, which captures voices from different speech zones, plays a crucial role in human-vehicle interaction. Although previous SpatialNet has achieved notable results, its high computational cost still hinders real-time applications in vehicles. To this end, this paper proposes LSZone, a lightweight spatial information modeling architecture for real-time in-car multi-zone speech separation. We design a spatial information extraction-compression (SpaIEC) module that combines Mel spectrogram and Interaural Phase Difference (IPD) to reduce computational burden while maintaining performance. Additionally, to efficiently model spatial information, we introduce an extremely lightweight Conv-GRU crossband-narrowband processing (CNP) module. Experimental results demonstrate that LSZone, with a complexity of 0.56G MACs and a real-time factor (RTF) of 0.37, delivers impressive performance in complex noise and multi-speaker scenarios.
Problem

Research questions and friction points this paper is trying to address.

Lightweight architecture for real-time in-car speech separation
Reduces computational cost while maintaining separation performance
Handles complex noise and multi-speaker scenarios efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight spatial modeling for real-time speech separation
Extraction-compression module combines Mel spectrogram and IPD
Conv-GRU crossband-narrowband processing reduces computational complexity
🔎 Similar Papers
No similar papers found.
J
Jun Chen
Huawei Technologies Co., Ltd., Shanghai, China
S
Shichao Hu
Huawei Technologies Co., Ltd., Shanghai, China
J
Jiuxin Lin
Huawei Technologies Co., Ltd., Shanghai, China
W
Wenjie Li
Huawei Technologies Co., Ltd., Shanghai, China
Z
Zihan Zhang
School of Software, Northwestern Polytechnical University, Xi'an, China
X
Xingchen Li
School of Software, Northwestern Polytechnical University, Xi'an, China
J
JinJiang Liu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
L
Longshuai Xiao
Huawei Technologies Co., Ltd., Shanghai, China
Chao Weng
Chao Weng
Anuttacon
Audio LLMsMultimodal LLMs
L
Lei Xie
School of Software, Northwestern Polytechnical University, Xi'an, China
Z
Zhiyong Wu
Shenzhen International Graduate School, Tsinghua University, Shenzhen, China