SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Existing 3D layout-guided image generation methods struggle to accurately model occlusion relationships among objects, often resulting in inconsistent geometry and scale for occluded entities in synthesized scenes. To address this limitation, this work proposes SeeThrough3D, the first approach to explicitly model occlusion relations in text-to-image generation. By introducing an Occlusion-aware 3D Scene Representation (OSCR), objects are represented as semi-transparent 3D bounding boxes, enabling explicit reasoning about occluded regions. The method integrates a flow-based diffusion model with a mask-based self-attention mechanism to achieve precise 3D layout control while preserving occlusion consistency. SeeThrough3D supports multi-object binding and viewpoint-consistent occlusion synthesis, generalizes well to unseen object categories, and significantly enhances both the realism and layout fidelity of complex scene generation.

Technology Category

Application Category

📝 Abstract

We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.

Problem

Research questions and friction points this paper is trying to address.

occlusion reasoning

3D layout-conditioned generation

inter-object occlusions

depth-consistent geometry

text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

occlusion-aware generation

3D layout conditioning

translucent 3D representation