SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the vulnerability of multi-turn multimodal large language models to progressive visual-textual history-based attacks in real-world deployment, where safety significantly degrades over long contexts. Existing alignment methods, relying on single-turn and templated data, struggle to generalize. To overcome this, the authors propose the SaFeR-Steer framework, which introduces a novel synthetic data generation pipeline and dynamic feedback mechanism tailored for multi-turn multimodal safety alignment. It integrates a teacher-in-the-loop GRPO algorithm with Trajectory-Consistent Cumulative Reward (TCSR) to enable robust online training of a single student model against adaptive attacks. Experiments on Qwen2.5-VL-3B/7B demonstrate substantial improvements in both single- and multi-turn safety and helpfulness—e.g., multi-turn safety scores rise from 12.55/24.66 to 55.58/64.89—and effectively delay safety failure rounds. The study also contributes STEER, the first large-scale multi-turn multimodal safety alignment dataset.

📝 Abstract

MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late-turn failures to earlier turns.I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 ->81.84/70.77 for 3B; 56.21/60.32 ->87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 ->55.58/70.27 for 3B; 24.66/46.48 ->64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed-Bg/SaFeR-Steer

Problem

Research questions and friction points this paper is trying to address.

multi-turn safety

MLLMs

safety alignment

long-context decay

unsafe intent escalation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn safety alignment

synthetic bootstrapping

trajectory-consistent reward