🤖 AI Summary
To address the high computational cost, language drift, and limited diversity in few-shot subject-driven image generation with diffusion models, this paper pioneers the adoption of Vision Autoregressive (VAR) models for this task. We propose three key innovations: (1) selective layer tuning—identifying that early VAR layers predominantly encode subject identity and thus focusing optimization on coarse-grained semantic representations; (2) prior distillation—to mitigate language drift by transferring knowledge from a pretrained text-to-image prior; and (3) scale-weighted tuning—to enhance multi-scale token representation learning. Our method achieves state-of-the-art performance across FID, CLIP-Score, and human evaluation, consistently outperforming diffusion-based baselines. Moreover, it accelerates inference by multiple orders of magnitude, enabling end-to-end real-time deployment. This work establishes a new paradigm for efficient, controllable, and diverse personalized image generation.
📝 Abstract
Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive~(VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, na""{i}ve fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.