Constantly Improving Image Models Need Constantly Improving Benchmarks

๐Ÿ“… 2025-10-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current image generation evaluation benchmarks lag behind rapid technical advances and inadequately cover complex, creative tasks arising in real-world applications. To address this, we propose ECHO: a novel, application-oriented evaluation framework grounded in 31,000 user prompts and associated feedback crawled from social mediaโ€”enabling dynamic, usage-informed benchmark construction. ECHO identifies previously unaddressed scenarios, including cross-lingual image editing and receipt generation with specified monetary amounts. It further introduces new quality metrics targeting color fidelity, identity consistency, and structural controllability. Extensive experiments demonstrate that ECHO effectively discriminates performance differences among state-of-the-art models, uncovers behavioral biases under practical usage conditions, and advances the evaluation paradigm from static, task-agnostic benchmarks toward user-driven, scenario-aware assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at https://echo-bench.github.io.
Problem

Research questions and friction points this paper is trying to address.

Existing benchmarks lag behind rapidly advancing image generation capabilities
Current evaluations fail to capture emerging real-world use cases
There is a gap between community perceptions and formal evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework creates benchmarks from social media evidence
Dataset built from 31000 real-world user prompts
Measures model quality through community feedback metrics
๐Ÿ”Ž Similar Papers
No similar papers found.