π€ AI Summary
This work addresses structural fragmentation in visual autoregressive models for image super-resolution, which arises from locally biased attention mechanisms and error accumulation across scales due to residual supervision, thereby compromising global consistency. To mitigate these issues, the authors propose the AlignVAR framework, integrating Spatial Consistency Autoregression (SCA) and Hierarchical Consistency Constraints (HCC). SCA employs adaptive mask-reweighted attention to alleviate local bias, while HCC replaces pure residual learning with multi-scale full-supervision reconstruction, enhancing long-range dependencies and stabilizing the coarse-to-fine generation process. The proposed method achieves a nearly 50% reduction in parameter count and over 10Γ faster inference compared to prevailing diffusion models, significantly improving structural coherence and perceptual quality.
π Abstract
Visual autoregressive (VAR) models have recently emerged as a promising alternative for image generation, offering stable training, non-iterative inference, and high-fidelity synthesis through next-scale prediction. This encourages the exploration of VAR for image super-resolution (ISR), yet its application remains underexplored and faces two critical challenges: locality-biased attention, which fragments spatial structures, and residual-only supervision, which accumulates errors across scales, severely compromises global consistency of reconstructed images. To address these issues, we propose AlignVAR, a globally consistent visual autoregressive framework tailored for ISR, featuring two key components: (1) Spatial Consistency Autoregression (SCA), which applies an adaptive mask to reweight attention toward structurally correlated regions, thereby mitigating excessive locality and enhancing long-range dependencies; and (2) Hierarchical Consistency Constraint (HCC), which augments residual learning with full reconstruction supervision at each scale, exposing accumulated deviations early and stabilizing the coarse-to-fine refinement process. Extensive experiments demonstrate that AlignVAR consistently enhances structural coherence and perceptual fidelity over existing generative methods, while delivering over 10x faster inference with nearly 50% fewer parameters than leading diffusion-based approaches, establishing a new paradigm for efficient ISR.