🤖 AI Summary
High-resolution video generation suffers from optimization instability and prohibitive computational costs, with long-sequence modeling often leading to structural collapse and inference latency. This work proposes PixelWizard, a framework that hierarchically decouples global structure modeling from detail synthesis: it first constructs compact spatiotemporal anchors to encode structural priors, which then guide high-resolution detail generation. The method introduces a novel shortcut training mechanism based on Noise-Span alignment, exponential index-biased sampling, and adaptive noise span calibration, enabling stable and efficient few-step inference without distillation. Evaluated on native 2K/4K video generation, PixelWizard significantly outperforms existing approaches while achieving over a 10× speedup in sampling.
📝 Abstract
High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.