🤖 AI Summary
This work addresses the challenging problem of component-level speech spoofing detection in real-world scenarios, where speech and background audio may be independently manipulated. To tackle this, the authors propose an environment-aware, three-stage cascaded framework: it first employs a mixture consistency detector to generate a binary prior, then integrates dual five-class classification branches based on SSLAM+XLS-R and EAT-large+XLS-R, enhanced by a cross-branch attention gating mechanism for effective feature fusion. The system further leverages RawBoost data augmentation to improve robustness against varying acoustic conditions. Evaluated on the CompSpoofV2 test set, the proposed method achieves a Macro-F1 score of 0.8266, significantly outperforming baseline approaches and securing second place in the ESDD2 challenge.
📝 Abstract
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.