SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

📅 2025-03-16

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses two key limitations in concept forgetting for vision-language models (VLMs): heavy reliance on large-scale annotated forgetting data and coarse-grained interventions that cause over-forgetting and utility degradation. We propose a fine-grained, selective concept suppression method. Our core innovation is the first use of sparse autoencoders (SAEs) for semantic feature localization and targeted intervention, enabling precise masking of both concrete and abstract concepts. By integrating semantic importance scoring with multimodal representation disentanglement, our approach supports cross-model transferability and concurrent forgetting of multiple concepts. Evaluations across 60 concepts show an average 18.04% improvement in forgetting quality, with zero degradation in downstream task performance. Moreover, the method exhibits strong adversarial robustness and scalability.

Technology Category

Application Category

📝 Abstract

Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.

Problem

Research questions and friction points this paper is trying to address.

Selective concept unlearning in vision-language models

Fine-grained unlearning using sparse autoencoders

Maintaining model utility while suppressing specific concepts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses sparse autoencoders for fine-grained unlearning

Selectively modifies features to suppress concepts

Outperforms state-of-the-art by 18.04% in quality

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts