SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

📅 2026-04-29

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing unified image generation models lack intrinsic 3D geometric understanding and explicit spatial constraints, limiting their spatial awareness. This work proposes a novel architecture integrating parallel spatial Transformers with depth adapters, introducing for the first time an endogenous 3D geometric perception mechanism into a unified generative framework. The approach employs a Mixture-of-Transformers structure and a two-stage progressive training strategy to inject explicit geometric guidance while maintaining low inference overhead. Evaluated on spatial perception benchmarks, the method significantly outperforms state-of-the-art models such as GPT-4o and demonstrates consistent performance gains across text-to-image synthesis and image editing tasks.

📝 Abstract

Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.

Problem

Research questions and friction points this paper is trying to address.

spatial awareness

3D geometric understanding

unified image generation

geometric guidance

spatially-aware tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D geometric awareness

Mixture-of-Transformers

depth adapter