GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of controllable, deep cross-modal reasoning in vision-language models (VLMs) for content safety moderation, this paper proposes a reasoning-based guardian model trained via online reinforcement learning. Our method innovatively integrates safety-aware data stitching augmentation, dynamic cropping, and a length-aware multi-objective safety reward function. We jointly optimize the model through supervised fine-tuning (SFT) and Proximal Policy Optimization (PPO)-based online RL, further enhancing inference quality via rejection sampling. The proposed approach significantly improves the robustness and interpretability of moderation decisions, achieving an average 19.27% F1-score gain over state-of-the-art baselines on standard benchmarks. To foster reproducibility and advancement in trustworthy multimodal safety auditing, we publicly release three guardian models (3B and 7B variants), a high-quality multimodal reasoning corpus of 123K samples, and complete training code.

Technology Category

Application Category

📝 Abstract
To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/
Problem

Research questions and friction points this paper is trying to address.

Enhancing VLM safety via reinforced reasoning
Developing a reasoning corpus for moderation decisions
Balancing performance and token efficiency in moderation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforced reasoning via online RL for moderation
Safety-aware data concatenation for sample diversity
Length-aware safety reward balancing performance efficiency
🔎 Similar Papers
No similar papers found.