OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the limitations of existing forgery detection methods, which are often confined to single or dual modalities and struggle with real-world scenarios involving intertwined textual, visual, and video misinformation. Moreover, joint detection and localization tasks are prone to optimization bias due to task difficulty imbalance. To overcome these challenges, we propose a unified vision-language framework for forgery detection and localization, featuring a novel self-evolving chain-of-thought generation mechanism and an Adaptive Reward Scaling Policy Optimization (ARSPO) strategy. By harmonizing reinforcement learning to jointly optimize binary authenticity classification and fine-grained forgery localization, our approach effectively mitigates inter-task imbalance, enabling high-quality reasoning path synthesis and dynamic task balancing. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across multiple benchmarks and exhibits strong zero-shot out-of-domain generalization capabilities.

Technology Category

Application Category

📝 Abstract

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

Problem

Research questions and friction points this paper is trying to address.

vision-language forgery detection

multimodal misinformation

forgery grounding

difficulty bias

unified framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Vision-Language Forgery Detection

Balanced Reinforcement Learning

Self-Evolving CoT Generation