Discovering Divergent Representations between Text-to-Image Models

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the mechanisms underlying visual representation divergence across text-to-image (T2I) models—specifically, which visual attributes appear frequently in images generated by one model but rarely by another, and what textual prompts trigger such disparities. To this end, we propose CompCon, an automated discovery framework integrating evolutionary search, LLM-guided semantic exploration, and VLM-based visual evaluation, coupled with the ID2 data generation pipeline for reproducible benchmarking. Compared to manually designed or heuristic baselines, CompCon systematically uncovers cross-model representational discrepancies—for instance, PixArt’s strong association of “loneliness”-related prompts with wet urban street scenes, and Stable Diffusion 3.5’s heightened propensity to generate Black individuals under “media profession” prompts. To our knowledge, this is the first work enabling interpretable, scalable, and fully automated attribution of implicit visual biases in T2I models.

Technology Category

Application Category

📝 Abstract
In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon
Problem

Research questions and friction points this paper is trying to address.

Discovering visual attribute differences between text-to-image models
Identifying prompt concepts triggering divergent visual representations
Comparing generative model outputs for attribute prevalence variations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary search algorithm comparing model outputs
Automated pipeline generating dataset of differences
Identifying prompt concepts causing visual divergences
🔎 Similar Papers
No similar papers found.