MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

To address inefficient prior modeling in visual generation caused by scale and spatial redundancy, this paper proposes a Markovian visual autoregressive modeling framework. Methodologically, it introduces (1) a scale-wise Markov trajectory to enable cross-scale conditional independence, and (2) a local spatial Markov attention mechanism that reduces computational complexity from O(N²) to O(Nk) and eliminates dependence on key-value caching. The framework supports fully parallel training and memory-efficient inference, enabling scalable deployment on eight RTX 4090 GPUs. Experiments demonstrate that our approach achieves image quality on ImageNet comparable to or exceeding state-of-the-art methods, while reducing GPU memory consumption by 3.0×. Moreover, it unifies two distinct training paradigms—end-to-end training of compact models from scratch and fine-tuning of large pre-trained models—within a single architectural and algorithmic framework.

Technology Category

Application Category

📝 Abstract

Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.

Problem

Research questions and friction points this paper is trying to address.

Reduces redundancy in multi-scale visual autoregressive modeling

Lowers GPU memory use via scale and spatial Markov assumptions

Improves efficiency by reducing attention complexity from O(N^2) to O(Nk)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scale-Markov trajectory reduces conditional complexity

Spatial-Markov attention localizes token interactions

Parallel training cuts GPU memory usage

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models