🤖 AI Summary
Existing deepfake detection methods struggle to effectively integrate spatial- and frequency-domain forgery cues, resulting in insufficient robustness against complex artifacts. To address this, we propose a dual-domain fusion and feature stacking framework: (1) a dual-domain collaborative attention fusion architecture that jointly models spatial features (via CNN-Transformer hybrid backbone) and DCT-domain frequency features; and (2) an inter-domain gated fusion mechanism coupled with a gradient-aware learnable feature stacking module to overcome limitations of single-domain representations. Evaluated on FaceForensics++ and Celeb-DF, our method achieves a mean accuracy of 98.7%, outperforming state-of-the-art approaches by a significant margin. It demonstrates strong generalization across unseen datasets and robustness against common video compression distortions.