Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

📅 2026-01-09

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This study investigates how Diffusion Transformers (DiTs) model spatial relationships between objects in text-to-image generation. Using mechanistic interpretability methods, the authors train DiTs of varying scales to systematically analyze their internal mechanisms when generating images containing two objects with specified attributes and spatial relations. The findings reveal that with random text embeddings, DiTs rely on a two-stage cross-attention circuit to process spatial relationships, whereas models using a pretrained T5 text encoder achieve this through a single-token fusion mechanism. Although both variants attain near-perfect accuracy on in-distribution tasks, they exhibit markedly different robustness under out-of-distribution perturbations, highlighting the critical role of text encoder choice in shaping the model’s generalization capabilities.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

spatial relations

Diffusion Transformers

text-to-image generation

object placement

mechanistic interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformers

spatial relations

mechanistic interpretability