SaFeR-ToolKit: Structured Reasoning via Virtual Tool Calling for Multimodal Safety

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the vulnerability of vision-language models to jailbreak attacks and their tendency toward over-refusal in multimodal settings, stemming from safety judgments that rely on joint reasoning over visual evidence and user intent—a process inadequately supervised by existing alignment methods that only monitor final outputs. To remedy this, the authors propose modeling safety decisions as a verifiable, structured protocol: a planner defines roles, toolsets, and state transition graphs, while a responder generates typed tool-call trajectories. Protocol adherence is reinforced through a three-stage curriculum—SFT, DPO, and GRPO—with GRPO introducing, for the first time, direct supervision over the tool-use process. The team constructs the first safety reasoning dataset based on tool calling, demonstrating significant improvements in safety (e.g., Qwen2.5-VL-3B rising from 29.39 to 84.40), helpfulness, and reasoning rigor on 3B/7B models, while preserving general capabilities.

Technology Category

Application Category

📝 Abstract
Vision-language models remain susceptible to multimodal jailbreaks and over-refusal because safety hinges on both visual evidence and user intent, while many alignment pipelines supervise only the final response. To address this, we present SaFeR-ToolKit, which formalizes safety decision-making as a checkable protocol. Concretely, a planner specifies a persona, a Perception $\to$ Reasoning $\to$ Decision tool set, and a constrained transition graph, while a responder outputs a typed key-value tool trace before the final answer. To make the protocol reliably followed in practice, we train a single policy with a three-stage curriculum (SFT $\to$ DPO $\to$ GRPO), where GRPO directly supervises tool usage beyond answer-level feedback. Our contributions are two-fold: I. Dataset. The first tool-based safety reasoning dataset, comprising 31,654 examples (SFT 6k, DPO 18.6k, GRPO 6k) plus 1k held-out evaluation. II. Experiments. On Qwen2.5-VL, SaFeR-ToolKit significantly improves Safety/Helpfulness/Reasoning Rigor on 3B (29.39/45.04/4.98 $\to$ 84.40/71.13/78.87) and 7B (53.21/52.92/19.26 $\to$ 86.34/80.79/85.34), while preserving general capabilities (3B: 58.67 $\to$ 59.21; 7B: 66.39 $\to$ 66.81). Codes are available at https://github.com/Duebassx/SaFeR_ToolKit.
Problem

Research questions and friction points this paper is trying to address.

multimodal safety
vision-language models
jailbreaks
over-refusal
safety alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured reasoning
virtual tool calling
multimodal safety
GRPO
tool-based alignment
Z
Zixuan Xu
Huazhong University of Science and Technology
T
Tiancheng He
Beijing University of Posts and Telecommunications
H
Huahui Yi
West China Hospital, Sichuan University
Kun Wang
Kun Wang
Singapore University of Technology and Design
Deep LearningComputer Vision
X
Xi Chen
West China Hospital, Sichuan University
G
Gongli Xi
Beijing University of Posts and Telecommunications
Qiankun Li
Qiankun Li
Research Fellow@NTU, Ph.D.@USTC
MLLMAI4HealthComputer VisionPattern RecognitionTrustworthy AI
K
Kang Li
West China Hospital, Sichuan University
Yang Liu
Yang Liu
Nanyang Technological University
AgentSoftware EngineeringCyber SecurityTrustworthy AISoftware Security
Zhigang Zeng
Zhigang Zeng
Huazhong University of Science and Technology
Stability analysisMemristorComputational intelligenceAssociative memoriesNeural Networks