🤖 AI Summary
This work addresses the challenges of catastrophic forgetting and entangled feature representations with inconsistent attributes that plague existing visual autoregressive models in continual personalized generation. It presents the first systematic study of this setting and introduces a unified framework that requires no model expansion. The approach leverages Gradient-guided Concept Neuron Selection (GCNS) to enable forgetting-resistant continual learning for individual concepts, while incorporating context-aware multi-branch feature modeling and a spatially conditioned local cross-attention fusion mechanism to support disentangled and controllable multi-concept composition. Experiments demonstrate that the proposed method significantly outperforms current baselines in both long-sequence continual learning and multi-concept image synthesis, yielding more accurate generations with higher attribute consistency.
📝 Abstract
Visual autoregressive (VAR) models have recently emerged as an efficient paradigm for text-to-image generation. Despite their strong generative capability, existing VAR-based personalization methods remain limited to static settings, failing to accommodate evolving user demands. In particular, sequential concept learning leads to severe catastrophic forgetting, while multi-concept synthesis often suffers from feature entanglement and attribute inconsistency. In this work, we present the first systematic study of continual personalized generation in VAR models. We identify two key challenges: (i) preserving previously learned concepts during sequential customization, and (ii) composing multiple personalized concepts in a controllable manner. To address these issues, we propose a unified framework with two core components. For continual single-concept learning, we introduce Gradient-based Concept Neuron Selection (GCNS), which identifies concept-relevant neurons and constrains only conflicting parameters across tasks, effectively mitigating forgetting without additional model expansion. For multi-concept synthesis, we propose a context-aware composition strategy that performs multi-branch feature modeling and localized cross-attention fusion guided by spatial conditions, enabling precise and disentangled concept composition. Extensive experiments demonstrate that our method significantly improves performance in long-sequence continual personalization while achieving superior results in multi-concept image synthesis compared to existing baselines. These findings highlight the potential of VAR models for scalable and controllable personalized generation.