Consistent and Invariant Generalization Learning for Short-video Misinformation Detection

📅 2025-07-05

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Short-video misinformation detection suffers from poor cross-domain generalization: models trained on a source domain exhibit significant performance degradation on unseen target domains. To address this, we propose a novel framework that jointly enforces modality consistency and domain invariance. Specifically, we introduce cross-modal interpolation distillation and diffusion-guided domain-invariant feature learning—the first such approach in this setting. By interpolating features across modalities, we construct a shared semantic space; further, we leverage the denoising process of diffusion models to enable cross-modal co-guidance during noise injection and removal, thereby mitigating both domain shift and modality bias. Our method effectively suppresses dependency discrepancies and bias accumulation inherent in multimodal fusion. Extensive experiments on multiple real-world short-video datasets demonstrate substantial improvements over state-of-the-art domain generalization methods, with strong generalization robustness. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at https://github.com/ghh1125/DOCTOR.

Problem

Research questions and friction points this paper is trying to address.

Detect misinformation in short videos across unseen domains

Address domain gaps affecting multi-modal model performance

Reduce cross-modal bias accumulation in misinformation identification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal feature interpolation for shared space mapping

Interpolation distillation to synchronize multi-modal learning

Diffusion model for core feature retention and denoising

🔎 Similar Papers

No similar papers found.