๐ค AI Summary
This work addresses the vulnerability of multi-turn multimodal large language models to progressive visual-textual history-based attacks in real-world deployment, where safety significantly degrades over long contexts. Existing alignment methods, relying on single-turn and templated data, struggle to generalize. To overcome this, the authors propose the SaFeR-Steer framework, which introduces a novel synthetic data generation pipeline and dynamic feedback mechanism tailored for multi-turn multimodal safety alignment. It integrates a teacher-in-the-loop GRPO algorithm with Trajectory-Consistent Cumulative Reward (TCSR) to enable robust online training of a single student model against adaptive attacks. Experiments on Qwen2.5-VL-3B/7B demonstrate substantial improvements in both single- and multi-turn safety and helpfulnessโe.g., multi-turn safety scores rise from 12.55/24.66 to 55.58/64.89โand effectively delay safety failure rounds. The study also contributes STEER, the first large-scale multi-turn multimodal safety alignment dataset.
๐ Abstract
MLLMs are increasingly deployed in multi-turn settings, where attackers can escalate unsafe intent through the evolving visual-text history and exploit long-context safety decay. Yet safety alignment is still dominated by single-turn data and fixed-template dialogues, leaving a mismatch between training and deployment.To bridge this gap, we propose SaFeR-Steer, a progressive multi-turn alignment framework that combines staged synthetic bootstrapping with tutor-in-the-loop GRPO to train a single student under adaptive, on-policy attacks. We also introduce TCSR, which uses trajectory minimum/average safety to propagate late-turn failures to earlier turns.I. Dataset. We release STEER, a multi-turn multimodal safety dataset with STEER-SFT (12,934), STEER-RL (2,000), and STEER-Bench (3,227) dialogues spanning 2~10 turns.II. Experiment. Starting from Qwen2.5-VL-3B/7B, SaFeR-Steer substantially improves Safety/Helpfulness on both single-turn (48.30/45.86 ->81.84/70.77 for 3B; 56.21/60.32 ->87.89/77.40 for 7B) and multi-turn benchmarks (12.55/27.13 ->55.58/70.27 for 3B; 24.66/46.48 ->64.89/72.35 for 7B), shifting failures to later turns and yielding robustness beyond scaling alone.Codes are available at https://github.com/Ed-Bg/SaFeR-Steer