CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video masked autoregressive (MAR) models suffer from two key bottlenecks: slow initialization due to lack of global priors and error accumulation across spatiotemporal dimensions. To address these, we propose CanvasMAR—a novel MAR framework that introduces a learnable, global “canvas” mechanism as a structured prior for joint spatiotemporal conditional modeling. We further design a compositional classifier-free guidance scheme coupled with noise-based canvas augmentation to enhance robustness and reduce autoregressive steps. Our method unifies masked modeling, continuous tokenization, and end-to-end canvas prediction. Evaluated on BAIR and Kinetics-600, CanvasMAR achieves substantial gains in generation quality: on Kinetics-600, it surpasses prior MAR methods and approaches diffusion-model performance—while preserving the determinism and controllability inherent to autoregressive frameworks.

Technology Category

Application Category

📝 Abstract
Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses slow-start problem in video generation
Mitigates error accumulation in spatial-temporal autoregression
Improves frame synthesis coherence with global canvas prior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Canvas mechanism provides global structure for video generation
Compositional guidance enhances spatial and temporal conditioning
Noise-based augmentation improves model robustness and performance