ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

The evaluation of image generation models lacks a unified, comprehensive benchmark. Method: This paper introduces ICE-Bench—the first holistic, unified benchmark covering the full spectrum of image creation and editing. It categorizes tasks into four coarse-grained types based on dependency on source or reference images, further decomposing them into 31 fine-grained capability subtasks. We propose a hierarchical “coarse-to-fine” evaluation architecture, a six-dimensional, eleven-metric multidimensional assessment framework, and the first LLM-based VLLM-QA method for quantifying editing quality. Data is collected via hybrid sampling of real-world and synthetic instances to ensure unbiased coverage. Contribution/Results: Systematic evaluation of leading models reveals critical weaknesses in controllability, cross-modal consistency, and other dimensions. ICE-Bench is fully open-sourced—including datasets, code, and evaluation models—to advance standardization in generative AI assessment.

Technology Category

Application Category

📝 Abstract

Image generation has witnessed significant advancements in the past few years. However, evaluating the performance of image generation models remains a formidable challenge. In this paper, we propose ICE-Bench, a unified and comprehensive benchmark designed to rigorously assess image generation models. Its comprehensiveness could be summarized in the following key features: (1) Coarse-to-Fine Tasks: We systematically deconstruct image generation into four task categories: No-ref/Ref Image Creating/Editing, based on the presence or absence of source images and reference images. And further decompose them into 31 fine-grained tasks covering a broad spectrum of image generation requirements, culminating in a comprehensive benchmark. (2) Multi-dimensional Metrics: The evaluation framework assesses image generation capabilities across 6 dimensions: aesthetic quality, imaging quality, prompt following, source consistency, reference consistency, and controllability. 11 metrics are introduced to support the multi-dimensional evaluation. Notably, we introduce VLLM-QA, an innovative metric designed to assess the success of image editing by leveraging large models. (3) Hybrid Data: The data comes from real scenes and virtual generation, which effectively improves data diversity and alleviates the bias problem in model evaluation. Through ICE-Bench, we conduct a thorough analysis of existing generation models, revealing both the challenging nature of our benchmark and the gap between current model capabilities and real-world generation requirements. To foster further advancements in the field, we will open-source ICE-Bench, including its dataset, evaluation code, and models, thereby providing a valuable resource for the research community.

Problem

Research questions and friction points this paper is trying to address.

Evaluate image generation models comprehensively.

Introduce a benchmark with 31 fine-grained tasks.

Assess models using multi-dimensional metrics and hybrid data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine tasks for image generation

Multi-dimensional metrics for evaluation

Hybrid data from real and virtual sources

🔎 Similar Papers

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images