OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

πŸ“… 2026-05-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

202K/year
πŸ€– AI Summary
This work addresses the limitations of existing vision-language forgery detection methods, which rely on closed-world assumptions and struggle with real-time event verification and fine-grained tampering localization in open-world scenarios requiring external evidence. To overcome this, we propose a tool-augmented agent framework that unifies multimodal forensic reasoning by integrating real-time web search, local zooming, edge anomaly detection, facial analysis, video frame extraction, and SAM3-based segmentation. We introduce a novel tree-structured self-evolving tool trajectory generation strategy to construct a comprehensive reasoning dataset and design a Checker-Guided Agent Reinforcement Learning (CGARL) mechanism for process-level supervision and error correction. The method achieves state-of-the-art performance across multiple vision-language forensic tasks, demonstrates strong zero-shot generalization, and is accompanied by the release of the FSTR dataset and code.
πŸ“ Abstract
Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at \url{https://github.com/shen8424/OmniVL-Guard-Pro}.
Problem

Research questions and friction points this paper is trying to address.

vision-language forensics
open-world verification
forgery detection
tool-augmented agent
fine-grained manipulation analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-augmented agent
open-world forensics
self-evolving tool trajectory
checker-guided reinforcement learning
vision-language forgery detection