🤖 AI Summary
Visual autoregressive (VAR) models suffer from diversity collapse—severely diminished output variability during generation. To address this without increasing training overhead, we propose DiverseVAR, a plug-and-play method that enhances diversity through input preprocessing and output postprocessing. By analyzing multi-scale feature map components, we identify early-layer channel activations as decisive for diversity. Leveraging this insight, DiverseVAR introduces an input suppression and output amplification strategy, integrated with a training-free feature-space decomposition and component modulation mechanism. Crucially, it requires no fine-tuning or retraining. Experiments across multiple benchmarks demonstrate that DiverseVAR significantly improves generative diversity—evidenced by stable or improved FID and CLIP-Score—while preserving image fidelity and maintaining inference efficiency. Thus, DiverseVAR establishes an efficient, general-purpose paradigm for diversity enhancement in VAR models, advancing their practical deployment.
📝 Abstract
Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.