🤖 AI Summary
This work addresses the susceptibility of visual autoregressive (VAR) models to cascading errors, which often leads to distorted image generation. The authors propose AID-VAR, a framework that introduces a lightweight guidance injector and a discriminator to enable cross-scale active error correction through adversarial error diagnosis and non-invasive feature manifold rectification, without modifying the pretrained VAR backbone. Leveraging a frozen-backbone fine-tuning strategy and a novel multi-scale consistency scoring metric (ISCS), the method effectively evaluates and enhances generation quality. Experiments demonstrate that AID-VAR consistently improves performance across various VAR backbones; for instance, AID-VAR-d20 achieves a 16% FID improvement with only a 3% increase in parameters, yielding images with sharper details and more stable structures.
📝 Abstract
Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction. However, VAR models are inherently prone to cascading error propagation, where subtle coarse-scale mispredictions are amplified across the hierarchy, ultimately distorting the final synthesis. To mitigate this, we propose AID-VAR, a plug-and-play framework that enhances pre-trained VARs through Adversarially Injected Diagnosis. Instead of a standard passive generation, AID-VAR introduces a proactive error-correction mechanism inspired by the adversarial feedback in GANs. We deploy a discriminator to diagnose fidelity gaps at each scale transition, coupled with a lightweight guidance injector. This module operates as a non-invasive adapter that refines the feature manifold of a frozen VAR backbone, effectively steering the generation toward the distribution of real images without destabilizing the pre-trained latent space. Furthermore, to rigorously evaluate this cross-scale progression, we introduce the Inter-Scale Consistency Score (ISCS), a novel metric that quantifies the fidelity and structural alignment between consecutive resolution scales. Experimental results across various backbones demonstrate that AID-VAR delivers sharper textural details and fewer structural distortions with negligible overhead. For instance, AID-VAR-d20 achieves a 16% improvement in FID with only a 3% increase in parameters. These results establish AID-VAR as a highly efficient and scalable pathway for upgrading large-scale VAR generators, enhancing global coherence and local detail without altering training data, base architectures, or sampling schedules. Code is available at https://github.com/bijiw515/AID-VAR.