CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Video masked autoregressive (MAR) models suffer from two key bottlenecks: slow initialization due to lack of global priors and error accumulation across spatiotemporal dimensions. To address these, we propose CanvasMAR—a novel MAR framework that introduces a learnable, global “canvas” mechanism as a structured prior for joint spatiotemporal conditional modeling. We further design a compositional classifier-free guidance scheme coupled with noise-based canvas augmentation to enhance robustness and reduce autoregressive steps. Our method unifies masked modeling, continuous tokenization, and end-to-end canvas prediction. Evaluated on BAIR and Kinetics-600, CanvasMAR achieves substantial gains in generation quality: on Kinetics-600, it surpasses prior MAR methods and approaches diffusion-model performance—while preserving the determinism and controllability inherent to autoregressive frameworks.

Technology Category

Application Category

📝 Abstract

Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.

Problem

Research questions and friction points this paper is trying to address.

Addresses slow-start problem in video generation

Mitigates error accumulation in spatial-temporal autoregression

Improves frame synthesis coherence with global canvas prior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Canvas mechanism provides global structure for video generation

Compositional guidance enhances spatial and temporal conditioning

Noise-based augmentation improves model robustness and performance

🔎 Similar Papers

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

2024-10-10arXiv.orgCitations: 0

TikTok

San Jose, California

AI Research Scientist, Video Generation and Post Training, FAIR