3D Space as a Scratchpad for Editable Text-to-Image Generation

πŸ“… 2026-01-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited spatial reasoning capabilities of current vision-language models, which struggle to generate images that faithfully reflect geometric relationships, object identities, and compositional intent. To overcome this, the authors propose a 3D spatial sketchpad framework that explicitly leverages 3D space as a reasoning medium. The method parses textual prompts to produce editable 3D meshes, which are then arranged by an agent that plans object placement, orientation, and camera viewpoint. Final image generation is achieved through identity-preserving rendering in conjunction with a vision-language model. By moving beyond conventional 2D layout paradigms, the approach significantly enhances spatial accuracy and controllability, achieving a 32% improvement in text-alignment metrics on the GenAI-Bench benchmark.

Technology Category

Application Category

πŸ“ Abstract
Recent progress in large language models (LLMs) has shown that reasoning improves when intermediate thoughts are externalized into explicit workspaces, such as chain-of-thought traces or tool-augmented reasoning. Yet, visual language models (VLMs) lack an analogous mechanism for spatial reasoning, limiting their ability to generate images that accurately reflect geometric relations, object identities, and compositional intent. We introduce the concept of a spatial scratchpad -- a 3D reasoning substrate that bridges linguistic intent and image synthesis. Given a text prompt, our framework parses subjects and background elements, instantiates them as editable 3D meshes, and employs agentic scene planning for placement, orientation, and viewpoint selection. The resulting 3D arrangement is rendered back into the image domain with identity-preserving cues, enabling the VLM to generate spatially consistent and visually coherent outputs. Unlike prior 2D layout-based methods, our approach supports intuitive 3D edits that propagate reliably into final images. Empirically, it achieves a 32% improvement in text alignment on GenAI-Bench, demonstrating the benefit of explicit 3D reasoning for precise, controllable image generation. Our results highlight a new paradigm for vision-language models that deliberate not only in language, but also in space. Code and visualizations at https://oindrilasaha.github.io/3DScratchpad/
Problem

Research questions and friction points this paper is trying to address.

spatial reasoning
visual language models
3D representation
image generation
geometric relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

spatial scratchpad
3D reasoning
editable text-to-image generation
visual language models
3D scene planning
πŸ”Ž Similar Papers
No similar papers found.