🤖 AI Summary
Traditional vision autoregressive (VAR) models construct multiscale pyramids via uniform downsampling, which introduces aliasing artifacts that degrade fine detail preservation and text readability. To address this, we propose FVAR: a novel VAR framework that abandons downsampling in favor of simulating the optical focusing process. FVAR builds an aliasing-free progressive refocusing pyramid using physically consistent defocus kernels. It introduces the first “next-focus prediction” modeling paradigm and incorporates high-frequency residual learning to jointly optimize detail recovery and inference efficiency. The method integrates optical low-pass filtering, multiscale progressive deblurring autoregressive modeling, and residual teacher distillation training. Evaluated on ImageNet, FVAR significantly suppresses jaggies and moiré patterns, enhances texture fidelity and text clarity, and outperforms existing VAR methods—while maintaining full compatibility with mainstream deep learning frameworks.
📝 Abstract
Visual autoregressive models achieve remarkable generation quality through next-scale predictions across multi-scale token pyramids. However, the conventional method uses uniform scale downsampling to build these pyramids, leading to aliasing artifacts that compromise fine details and introduce unwanted jaggies and moiré patterns. To tackle this issue, we present extbf{FVAR}, which reframes the paradigm from emph{next-scale prediction} to emph{next-focus prediction}, mimicking the natural process of camera focusing from blur to clarity. Our approach introduces three key innovations: extbf{1) Next-Focus Prediction Paradigm} that transforms multi-scale autoregression by progressively reducing blur rather than simply downsampling; extbf{2) Progressive Refocusing Pyramid Construction} that uses physics-consistent defocus kernels to build clean, alias-free multi-scale representations; and extbf{3) High-Frequency Residual Learning} that employs a specialized residual teacher network to effectively incorporate alias information during training while maintaining deployment simplicity. Specifically, we construct optical low-pass views using defocus point spread function (PSF) kernels with decreasing radius, creating smooth blur-to-clarity transitions that eliminate aliasing at its source. To further enhance detail generation, we introduce a High-Frequency Residual Teacher that learns from both clean structure and alias residuals, distilling this knowledge to a vanilla VAR deployment network for seamless inference. Extensive experiments on ImageNet demonstrate that FVAR substantially reduces aliasing artifacts, improves fine detail preservation, and enhances text readability, achieving superior performance with perfect compatibility to existing VAR frameworks.