ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the risk that multimodal large language models may memorize sensitive cross-modal information during pretraining, while existing machine unlearning methods often neglect post-unlearning generation quality, leading to hallucinations or rigid responses. To tackle this issue, the authors propose ASRU, a novel framework that uniquely treats generation quality as a core optimization objective. ASRU first induces initial refusal behavior through activation redirection and then employs a tailored reward function combined with reinforcement learning fine-tuning to precisely calibrate the unlearning boundary. Experiments on Qwen3-VL demonstrate that ASRU improves average unlearning efficacy by 24.6% and enhances generation quality by 5.8×, all while effectively preserving the model’s general capabilities using only a small amount of retained data.

📝 Abstract

Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.

Problem

Research questions and friction points this paper is trying to address.

machine unlearning

multimodal large language models

generation quality

sensitive information

model safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation Steering

Reinforcement Unlearning

Multimodal Large Language Models