Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation

📅 2025-11-13

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Current text-to-image generative models exhibit fundamental deficiencies in compositional logical reasoning, specifically failing to correctly synthesize negation, counting, and spatial relations—despite handling each individually. Method: We conduct a systematic empirical evaluation and architectural analysis to diagnose the root causes of this compositional failure. Contribution/Results: We identify three primary sources: (1) scarcity of compositional patterns in training data; (2) inherent limitations of continuous attention mechanisms in modeling discrete logical operations; and (3) evaluation metrics biased toward visual plausibility rather than logical fidelity. Crucially, we demonstrate that standard interventions—such as data augmentation or fine-tuning—fail to bridge this “compositional gap.” Genuine progress requires rethinking representational formalisms and reasoning mechanisms, not incremental architectural refinements. Our findings provide theoretical insights and methodological guidance for developing multimodal generative models with robust compositional generalization.

Technology Category

Application Category

📝 Abstract

The architectural blueprint of today's leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

Problem

Research questions and friction points this paper is trying to address.

Models fail to handle logical composition in text-to-image generation

Performance collapses when combining negation, counting and spatial relations

Current architectures cannot achieve genuine compositionality through scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes training data absence of explicit negations

Identifies continuous attention architectures' logic limitations

Proposes fundamental advances beyond current scaling solutions

🔎 Similar Papers

A Survey on Quality Metrics for Text-to-Image Generation