🤖 AI Summary
To address cross-view geometric inconsistency in single-image-and-text-driven multi-view image generation, this paper proposes the first end-to-end discrete diffusion framework. It formulates multi-view synthesis as a joint text-conditioned visual token sequence prediction task, leveraging MAGVIT-v2 for efficient visual tokenization and integrating multimodal masked modeling with autoregressive Transformer decoding to enforce geometric consistency. Crucially, the method requires no 3D priors or explicit geometric constraints—structural coherence across views emerges solely from stochastic masking and self-attention mechanisms. The architecture is lightweight and fully differentiable. Extensive experiments on GSO and 3D-FUTURE demonstrate state-of-the-art performance: our approach achieves superior PSNR, SSIM, and LPIPS scores compared to existing continuous diffusion and 3D-aware methods, ranking first on multiple metrics.
📝 Abstract
Multi-view image generation from a single image and text description remains challenging due to the difficulty of maintaining geometric consistency across different viewpoints. Existing approaches typically rely on 3D-aware architectures or specialized diffusion models that require extensive multi-view training data and complex geometric priors. In this work, we introduce ViewMask-1-to-3, a pioneering approach to apply discrete diffusion models to multi-view image generation. Unlike continuous diffusion methods that operate in latent spaces, ViewMask-1-to-3 formulates multi-view synthesis as a discrete sequence modeling problem, where each viewpoint is represented as visual tokens obtained through MAGVIT-v2 tokenization. By unifying language and vision through masked token prediction, our approach enables progressive generation of multiple viewpoints through iterative token unmasking with text input. ViewMask-1-to-3 achieves cross-view consistency through simple random masking combined with self-attention, eliminating the requirement for complex 3D geometric constraints or specialized attention architectures. Our approach demonstrates that discrete diffusion provides a viable and simple alternative to existing multi-view generation methods, ranking first on average across GSO and 3D-FUTURE datasets in terms of PSNR, SSIM, and LPIPS, while maintaining architectural simplicity.