GEBench: Benchmarking Image Generation Models as GUI Environments

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the absence of dedicated evaluation benchmarks for dynamic GUI interaction and temporal consistency in current image generation models. To this end, we propose GEBench—the first benchmark specifically designed for GUI generation—comprising 700 expert-annotated interaction trajectories, both real and synthetic, covering single-step and multi-step interactions as well as localization tasks. We introduce GE-Score, a five-dimensional metric assessing goal achievement, interaction logic, content consistency, UI plausibility, and visual quality. Experimental results demonstrate that while existing models perform adequately in single-step generation, their performance significantly degrades in multi-step scenarios, particularly in temporal coherence and spatial localization. Key bottlenecks include icon comprehension, text rendering fidelity, and precise element positioning.

Technology Category

Application Category

📝 Abstract

Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.

Problem

Research questions and friction points this paper is trying to address.

GUI generation

temporal coherence

state transition

image generation benchmark

dynamic interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

GUI generation

temporal coherence

benchmark