Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) are vulnerable to contextual or visual jailbreaking attacks during multimodal reasoning and struggle to detect or rectify their own harmful outputs. To address this, we propose a novel “Think–Reflect–Revise” closed-loop safety reasoning framework. Our method introduces, for the first time, a backward feedback mechanism triggered by harmful signals detected *during* generation, integrated with policy-guided self-reflection and reinforcement learning (RL)-driven self-correction. We construct the Reflective Safety Reasoning (ReSafe) dataset—comprising 5,000 high-quality samples—to supervise fine-tuning for initializing reflection capability, followed by RL-based optimization of reflection and revision policies. Evaluated on Qwen2.5-VL-7B, our approach increases the safety response rate from 42.8% to 87.7%, substantially outperforming baselines while maintaining robust performance on general multimodal benchmarks including MMMU and MMStar.

Technology Category

Application Category

📝 Abstract
As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Enhances safety alignment in Large Vision Language Models
Addresses vulnerability to contextual or visual jailbreak attacks
Introduces a policy-guided self-reflection framework for self-correction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-stage training framework with think-reflect-revise process
Policy-guided self-reflection using reinforcement learning
Reflective Safety Reasoning dataset for model fine-tuning
🔎 Similar Papers
No similar papers found.