PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This study addresses planning-oriented image generation in computer usage scenarios, evaluating the spatial reasoning and procedural understanding capabilities of unified multimodal models on tasks such as path planning, workflow diagram generation, and web/UI layout rendering. To this end, we introduce PlanViz, a new benchmark comprising three categories of everyday planning subtasks along with human-annotated reference images, and propose PlanScore—a task-adaptive evaluation metric that holistically assesses semantic correctness, visual quality, and efficiency of generated outputs. This work is the first to systematically uncover critical limitations of current models in such planning-intensive visual generation tasks, providing the community with a high-quality benchmark and a principled evaluation framework to guide future research.

Technology Category

Application Category

📝 Abstract

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning steps. Specifically, three new sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For challenges of comprehensive and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.

Problem

Research questions and friction points this paper is trying to address.

planning-oriented image generation

computer-use tasks

spatial reasoning

procedural understanding

multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

PlanViz

planning-oriented image generation

multimodal evaluation benchmark