🤖 AI Summary
Visual autoregressive (AR) image generation faces a fundamental trade-off: random token ordering enables bidirectional context modeling but violates spatial priors (e.g., centrality bias and locality), whereas fixed raster-scan ordering respects such priors yet hinders flexible, region-specific editing. This work proposes Spanning Tree Autoregressive (STAR), a novel AR framework that constructs a structured token sequence via breadth-first traversal of a uniformly sampled spanning tree over the image patch lattice. STAR integrates spatial inductive biases intrinsically while preserving compatibility with standard language-style AR architectures. It supports efficient suffix completion via rejection sampling and enables dynamic specification—during inference—of both initial seed regions and edit masks. Experiments demonstrate that STAR significantly outperforms random permutation baselines on image editing tasks, achieving superior balance among sampling efficiency, contextual expressivity, and sequence-order flexibility without compromising generation quality.
📝 Abstract
We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to accommodate image editing at inference. Approaches that expose randomly permuted sequence orders to conventional autoregressive (AR) models in visual generation for bidirectional context either suffer from a decline in performance or compromise the flexibility in sequence order choice at inference. Instead, STAR utilizes traversal orders of uniform spanning trees sampled in a lattice defined by the positions of image patches. Traversal orders are obtained through breadth-first search, allowing us to efficiently construct a spanning tree whose traversal order ensures that the connected partial observation of the image appears as a prefix in the sequence through rejection sampling. Through the tailored yet structured randomized strategy compared to random permutation, STAR preserves the capability of postfix completion while maintaining sampling performance without any significant changes to the model architecture widely adopted in the language AR modeling.