ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing evaluation paradigms for mobile agents suffer from critical limitations: offline benchmarks validate only a single “gold path” and lack coverage of complex, long-horizon tasks; online testing is hindered by the uncontrollability and irreproducibility of real devices. Method: We propose ColorBench, a graph-structured benchmarking framework that models task workflows as state-transition graphs, enabling multi-path annotation, subtask completion rate tracking, and atomic capability disentanglement—thereby supporting static, reproducible simulation of dynamic interactions. It integrates erroneous-path injection and realistic device-state modeling to bridge the offline–online evaluation gap. Contribution/Results: The released ColorBench comprises 175 long-horizon tasks (mean steps >13), each with ≥2 valid solution paths. Systematic baseline experiments expose key bottlenecks in path generalization, state persistence, and subtask coordination, providing quantifiable insights for targeted model improvement.

Technology Category

Application Category

📝 Abstract

The rapid advancement of multimodal large language models has enabled agents to operate mobile devices by directly interacting with graphical user interfaces, opening new possibilities for mobile automation. However, real-world mobile tasks are often complex and allow for multiple valid solutions. This contradicts current mobile agent evaluation standards: offline static benchmarks can only validate a single predefined "golden path", while online dynamic testing is constrained by the complexity and non-reproducibility of real devices, making both approaches inadequate for comprehensively assessing agent capabilities. To bridge the gap between offline and online evaluation and enhance testing stability, this paper introduces a novel graph-structured benchmarking framework. By modeling the finite states observed during real-device interactions, it achieves static simulation of dynamic behaviors. Building on this, we develop ColorBench, a benchmark focused on complex long-horizon tasks. It supports evaluation of multiple valid solutions, subtask completion rate statistics, and atomic-level capability analysis. ColorBench contains 175 tasks (74 single-app, 101 cross-app) with an average length of over 13 steps. Each task includes at least two correct paths and several typical error paths, enabling quasi-dynamic interaction. By evaluating ColorBench across various baselines, we discover limitations of existing models and propose improvement directions and feasible technical pathways to enhance agents' performance on complex, long-horizon problems based on experimental results. Code and data are available at: https://github.com/MadeAgents/ColorBench.

Problem

Research questions and friction points this paper is trying to address.

Bridging offline and online mobile agent evaluation gaps

Enabling static simulation of dynamic mobile interactions

Assessing multiple valid solutions for complex mobile tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-structured framework simulates dynamic mobile interactions

Static simulation models finite states from real-device behaviors

Benchmark supports multiple valid paths and atomic capability analysis

🔎 Similar Papers

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents