CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods struggle to natively generate high-resolution 360° videos due to computational constraints, typically supporting only up to 1K resolution and relying on post-hoc super-resolution that compromises VR immersion. This work proposes CubeComposer, a spatiotemporal autoregressive diffusion model based on cubemap representation, which decomposes 4K 360° video into six faces and synthesizes them autoregressively following a carefully designed spatiotemporal order. By integrating sparse context attention, cubemap-aware positional encoding, and boundary blending techniques, CubeComposer effectively eliminates seam artifacts and enhances temporal coherence. To the best of our knowledge, this is the first method capable of native 4K 360° video generation, significantly outperforming existing approaches on benchmark datasets and achieving state-of-the-art results in resolution, visual quality, and practical utility for virtual reality applications.

Technology Category

Application Category

📝 Abstract
Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer
Problem

Research questions and friction points this paper is trying to address.

360-degree video generation
high-resolution video
virtual reality
panoramic video
video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatio-temporal autoregression
4K 360° video generation
cubemap representation
diffusion model
boundary seam elimination
🔎 Similar Papers
No similar papers found.