Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image Generation

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models exhibit limited capability in understanding spatial relationships (e.g., “the dog is to the right of the teddy bear”), especially for atypical configurations (e.g., “the giraffe is above the airplane”), where error rates remain high. Existing fine-tuning or hand-crafted loss-based approaches yield marginal improvements. This paper proposes Learn-to-Steer, a novel inference-time framework that requires no model modification or retraining. It employs a lightweight classifier to learn spatial relations in a data-driven manner from cross-attention maps, enabling construction of a geometry-aware loss function. A dual inversion strategy is introduced to suppress linguistic cues, thereby compelling the model to rely on geometric structure. Evaluated on FLUX.1-dev and SD2.1, Learn-to-Steer achieves spatial accuracy of 0.61 and 0.54—improving over baselines by more than 40 percentage points—and generalizes robustly across diverse spatial relations.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial--like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual--a giraffe above an airplane--these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal. Rather than imposing our assumptions about spatial encoding, we propose learning these objectives directly from the model's internal representations. We introduce Learn-to-Steer, a novel framework that learns data-driven objectives for test-time optimization rather than handcrafting them. Our key insight is to train a lightweight classifier that decodes spatial relationships from the diffusion model's cross-attention maps, then deploy this classifier as a learned loss function during inference. Training such classifiers poses a surprising challenge: they can take shortcuts by detecting linguistic traces rather than learning true spatial patterns. We solve this with a dual-inversion strategy that enforces geometric understanding. Our method dramatically improves spatial accuracy: from 0.20 to 0.61 on FLUX.1-dev and from 0.07 to 0.54 on SD2.1 across standard benchmarks. Moreover, our approach generalizes to multiple relations and significantly improves accuracy.
Problem

Research questions and friction points this paper is trying to address.

Improving spatial reasoning in text-to-image generation models
Replacing handcrafted losses with data-driven inference-time objectives
Preventing linguistic shortcuts in spatial relationship classifiers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns data-driven objectives from internal representations
Uses lightweight classifier on cross-attention maps
Employs dual-inversion strategy for geometric understanding
🔎 Similar Papers
No similar papers found.