🤖 AI Summary
Existing diffusion models treat color, depth, and surface normals as disjoint modalities, leading to appearance-geometry misalignment and hindering unified 3D scene generation. This work introduces Orchid—the first text-driven, joint RGB/depth/normal generative model—built upon a novel appearance-geometry joint implicit prior within a VAE+Latent Diffusion Model (LDM) framework to achieve multi-modal latent-space alignment. Orchid enables zero-shot 3D scene synthesis, sparse-view reconstruction regularization, and joint color-depth-normal inpainting. Experiments demonstrate that Orchid matches state-of-the-art single-task performance in joint depth/normal prediction; yields highly aligned generated images and geometry; and significantly improves partial 3D generation quality and cross-modal consistency.
📝 Abstract
Diffusion models are state-of-the-art for image generation. Trained on large datasets, they capture expressive image priors that have been used for tasks like inpainting, depth, and (surface) normal prediction. However, these models are typically trained for one specific task, e.g., a separate model for each of color, depth, and normal prediction. Such models do not leverage the intrinsic correlation between appearance and geometry, often leading to inconsistent predictions. In this paper, we propose using a novel image diffusion prior that jointly encodes appearance and geometry. We introduce a diffusion model Orchid, comprising a Variational Autoencoder (VAE) to encode color, depth, and surface normals to a latent space, and a Latent Diffusion Model (LDM) for generating these joint latents. Orchid directly generates photo-realistic color images, relative depth, and surface normals from user-provided text, and can be used to create image-aligned partial 3D scenes seamlessly. It can also perform image-conditioned tasks like joint monocular depth and normal prediction and is competitive in accuracy to state-of-the-art methods designed for those tasks alone. Lastly, our model learns a joint prior that can be used zero-shot as a regularizer for many inverse problems that entangle appearance and geometry. For example, we demonstrate its effectiveness in color-depth-normal inpainting, showcasing its applicability to problems in 3D generation from sparse views.