🤖 AI Summary
This work addresses the computational overhead, latency, and privacy concerns associated with conventional on-device speech enhancement systems that rely on separate auxiliary models for tasks such as voice activity detection, noise classification, and fundamental frequency (F0) estimation. The study reveals, for the first time, that the internal binary masks generated by dynamic channel pruning (DynCP)—originally designed for model compression—encode rich semantic information. By repurposing these masks as a multifunctional signal analysis tool and applying a lightweight, interpretable predictor to their weighted sum, the proposed method simultaneously estimates multiple speech attributes without requiring additional dedicated models. Experimental results demonstrate strong performance in voice activity detection (93% accuracy), noise classification (84% accuracy), and F0 estimation (R² = 0.86), all with negligible inference overhead.
📝 Abstract
Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.