🤖 AI Summary
Existing concept erasure methods primarily target diffusion models and exhibit poor adaptability to vision autoregressive (VAR) models—due to instability arising from VARs’ multi-scale token-by-token prediction architecture. This work introduces VARE, the first dedicated concept erasure framework for VAR models, along with its lightweight variant S-VARE. To enhance training stability, VARE incorporates auxiliary visual tokens; it employs a filtering cross-entropy loss to suppress spurious concept activations, and couples a semantic fidelity loss to preserve generation quality. Extensive experiments across multiple benchmarks demonstrate that VARE achieves precise, minimally disruptive removal of unsafe visual concepts—significantly improving VAR model safety without compromising image fidelity or diversity. Our approach effectively bridges a critical gap in content safety for autoregressive generative models.
📝 Abstract
The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.