EnvTriCascade: An Environment-Aware Tri-Stage Cascaded Framework for ESDD2 2026 Challenge

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This work addresses the challenging problem of component-level speech spoofing detection in real-world scenarios, where speech and background audio may be independently manipulated. To tackle this, the authors propose an environment-aware, three-stage cascaded framework: it first employs a mixture consistency detector to generate a binary prior, then integrates dual five-class classification branches based on SSLAM+XLS-R and EAT-large+XLS-R, enhanced by a cross-branch attention gating mechanism for effective feature fusion. The system further leverages RawBoost data augmentation to improve robustness against varying acoustic conditions. Evaluated on the CompSpoofV2 test set, the proposed method achieves a Macro-F1 score of 0.8266, significantly outperforming baseline approaches and securing second place in the ESDD2 challenge.
📝 Abstract
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.
Problem

Research questions and friction points this paper is trying to address.

audio spoofing detection
component-level manipulation
environmental sounds
speech forensics
real-world scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Environment-Aware
Tri-Stage Cascaded Framework
Mix-Consistency Detection
Cross-Branch Attention
RawBoost Augmentation
🔎 Similar Papers
Hengyan Huang
Hengyan Huang
Pursuing a degree in Intelligent Science and Technology at the Communication University of China.
AIGCMLLMAudio-Visual Processing
X
Xiaoxuan Guo
State Key Lab. of Media Convergence and Communication, Communication University of China, Beijing, China; Machine Intelligence, Ant Group, Shanghai, China
J
Jiayi Zhou
Machine Intelligence, Ant Group, Shanghai, China
Yuankun Xie
Yuankun Xie
PhD Candidate, Communication University of China
Audio Deepfake DetectionDomain GeneralizationOut-of-Distribution DetectionNeural Audio Codec
J
Jian Liu
Machine Intelligence, Ant Group, Shanghai, China
H
Haonan Cheng
State Key Lab. of Media Convergence and Communication, Communication University of China, Beijing, China
Long Ye
Long Ye
Communication University of China
Multimedia Signal ProcessingArtificial Intelligence
Qin Zhang
Qin Zhang
communication university of China
information technologyartificial intelligence