Conditional Panoramic Image Generation via Masked Autoregressive Modeling

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing panoramic image generation methods face two key bottlenecks: (1) diffusion-based approaches violate the i.i.d. Gaussian noise assumption when operating directly in equirectangular projection (ERP) space; and (2) text-to-panorama synthesis and panorama outpainting are treated as disjoint tasks, lacking a unified generative framework. To address these issues, we propose the Panoramic AutoRegressive model (PAR), the first ERP-native masked autoregressive paradigm for panoramic generation. PAR employs toroidal convolution to preserve spherical continuity during mask filling and introduces multimodal conditional embedding—jointly encoding text and image inputs—along with a consistency alignment strategy. Crucially, PAR unifies text-to-panorama generation and panorama outpainting under a single architecture. It achieves state-of-the-art performance on both tasks, while demonstrating strong generalization capability and linear scalability with respect to resolution.

Technology Category

Application Category

📝 Abstract

Recent progress in panoramic image generation has underscored two critical limitations in existing approaches. First, most methods are built upon diffusion models, which are inherently ill-suited for equirectangular projection (ERP) panoramas due to the violation of the identically and independently distributed (i.i.d.) Gaussian noise assumption caused by their spherical mapping. Second, these methods often treat text-conditioned generation (text-to-panorama) and image-conditioned generation (panorama outpainting) as separate tasks, relying on distinct architectures and task-specific data. In this work, we propose a unified framework, Panoramic AutoRegressive model (PAR), which leverages masked autoregressive modeling to address these challenges. PAR avoids the i.i.d. assumption constraint and integrates text and image conditioning into a cohesive architecture, enabling seamless generation across tasks. To address the inherent discontinuity in existing generative models, we introduce circular padding to enhance spatial coherence and propose a consistency alignment strategy to improve generation quality. Extensive experiments demonstrate competitive performance in text-to-image generation and panorama outpainting tasks while showcasing promising scalability and generalization capabilities.

Problem

Research questions and friction points this paper is trying to address.

Overcome i.i.d. violation in diffusion models for ERP panoramas

Unify text-to-panorama and panorama outpainting in one framework

Address spatial discontinuity via circular padding and consistency alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked autoregressive modeling for panoramic generation

Unified text and image conditioning architecture

Circular padding for spatial coherence enhancement

🔎 Similar Papers

PanoDiffusion: 360-degree Panorama Outpainting via Diffusion