SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Existing unified image generation models lack intrinsic 3D geometric understanding and explicit spatial constraints, limiting their spatial awareness. This work proposes a novel architecture integrating parallel spatial Transformers with depth adapters, introducing for the first time an endogenous 3D geometric perception mechanism into a unified generative framework. The approach employs a Mixture-of-Transformers structure and a two-stage progressive training strategy to inject explicit geometric guidance while maintaining low inference overhead. Evaluated on spatial perception benchmarks, the method significantly outperforms state-of-the-art models such as GPT-4o and demonstrates consistent performance gains across text-to-image synthesis and image editing tasks.
📝 Abstract
Recent unified image generation models have achieved remarkable success by employing MLLMs for semantic understanding and diffusion backbones for image generation. However, these models remain fundamentally limited in spatially-aware tasks due to a lack of intrinsic spatial understanding and the absence of explicit geometric guidance during generation. In this paper, we propose SpatialFusion, a novel framework that internalizes 3D geometric awareness into unified image generation models. Specifically, we first employ a Mixture-of-Transformers (MoT) architecture to augment the MLLM with a parallel spatial transformer to enhance 3D geometric modeling capability. By sharing self-attention with the MLLM, the spatial transformer learns to derive metric-depth maps of target images from rich semantic contexts. These explicit geometric scaffolds are then injected into the diffusion backbone through a specialized depth adapter, providing precise spatial constraints for spatially-coherent image generation. Through a progressive two-stage training strategy, SpatialFusion significantly enhances performance on spatially-aware benchmarks, notably outperforming leading models such as GPT-4o. Additionally, it achieves generalized performance gains across both text-to-image generation and image editing scenarios, all while maintaining negligible inference overhead.
Problem

Research questions and friction points this paper is trying to address.

spatial awareness
3D geometric understanding
unified image generation
geometric guidance
spatially-aware tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D geometric awareness
Mixture-of-Transformers
depth adapter
spatially-coherent generation
unified image generation
🔎 Similar Papers