A Practical Investigation of Spatially-Controlled Image Generation with Transformers

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This work addresses the challenge of spatial controllability in image generation models, systematically investigating the unified modeling capability of Transformer-based diffusion, flow, and autoregressive architectures under fine-grained spatial conditions—such as edge maps and pose keypoints. We propose control-token pre-filling as an efficient, general-purpose baseline; identify classifier-free guidance scaling and softmax truncation as critical for improving control consistency; and re-validate adapter-based fine-tuning for mitigating task forgetting. Experiments on ImageNet demonstrate that our approach significantly enhances controllability consistency while preserving high-fidelity generation under data-limited regimes. By decoupling the effects of architecture, training methodology, and guidance strategies, this study establishes a reproducible benchmark framework and provides practical design principles for controllable image synthesis.

Technology Category

Application Category

📝 Abstract

Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate "forgetting" and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency. Code will be released upon publication.

Problem

Research questions and friction points this paper is trying to address.

Enabling spatially-controlled image generation with transformers

Comparing performance across different generation paradigms

Addressing knowledge gaps in transformer-based control methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Control token prefilling for transformer baseline

Extending classifier-free guidance for consistency

Adapter-based approaches mitigate forgetting

🔎 Similar Papers

No similar papers found.