🤖 AI Summary
Existing image generation models suffer from low efficiency and high computational resource consumption when modeling complex visual content and maintaining long-term temporal consistency. This paper introduces a novel grid-based visual generation paradigm, unifying diverse tasks—including video synthesis and 3D editing—into a two-dimensional grid layout problem, where temporal content is represented via a filmstrip-like structure. Methodologically, we propose the first parallel flow matching training strategy coupled with a coarse-to-fine loss scheduling mechanism to jointly optimize spatiotemporal coherence. Experiments demonstrate a 35× speedup in inference latency and a reduction in GPU memory consumption to 0.1% of that required by task-specific models. Our approach significantly improves generation quality, inter-frame consistency, and cross-modal generalization across benchmarks such as text-to-video and 3D editing, establishing an efficient and scalable foundation for general-purpose visual generation.
📝 Abstract
In this paper, we introduce GRID, a novel paradigm that reframes a broad range of visual generation tasks as the problem of arranging grids, akin to film strips. At its core, GRID transforms temporal sequences into grid layouts, enabling image generation models to process visual sequences holistically. To achieve both layout consistency and motion coherence, we develop a parallel flow-matching training strategy that combines layout matching and temporal losses, guided by a coarse-to-fine schedule that evolves from basic layouts to precise motion control. Our approach demonstrates remarkable efficiency, achieving up to 35 faster inference speeds while using 1/1000 of the computational resources compared to specialized models. Extensive experiments show that GRID exhibits exceptional versatility across diverse visual generation tasks, from Text-to-Video to 3D Editing, while maintaining its foundational image generation capabilities. This dual strength in both expanded applications and preserved core competencies establishes GRID as an efficient and versatile omni-solution for visual generation.