What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing virtual agent benchmarks suffer from uncontrolled task complexity, narrow scenario coverage, single-dimensional evaluation, and heavy reliance on manual annotation. To address these limitations, we propose OmniBench—the first self-generating, cross-platform, graph-structured, multidimensional benchmark. It leverages a subtask knowledge graph and an automated synthesis pipeline to generate controllable-complexity tasks across 20 diverse scenarios (36K tasks total). Our approach introduces a novel graph-structured task synthesis paradigm and the OmniEval framework, enabling joint assessment of subtask-level accuracy, graph-topological validity, and ten core agent capabilities. Synthesized data achieves a 91% human acceptance rate and yields higher training efficiency than manually annotated data. Comprehensive evaluation of over 20 state-of-the-art multimodal large language model (MLLM)-based agents reveals precise capability bottlenecks, facilitating quantifiable, systematic advancement of virtual agent intelligence.

Technology Category

Application Category

📝 Abstract
As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it can more efficiently guide agents compared to manually annotated data. We conduct multidimensional evaluations for various open-source and closed-source models, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Limitations in existing virtual agent benchmarks
Lack of controllable task complexity and scenarios
Need for multidimensional evaluation framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-generating cross-platform graph-based benchmark
Automated pipeline for controllable complexity tasks
Multidimensional evaluation framework OmniEval
W
Wendong Bu
Zhejiang University, Hangzhou, China
Y
Yang Wu
Ant Group, Hangzhou, China
Qifan Yu
Qifan Yu
Zhejiang University
MLLMmultimodal learningimage generation & editing
Minghe Gao
Minghe Gao
浙江大学
机器学习
B
Bingchen Miao
Zhejiang University, Hangzhou, China
Z
Zhenkui Zhang
Zhejiang University, Hangzhou, China
Kaihang Pan
Kaihang Pan
Zhejiang University
nlpvision-and-language
Yunfei Li
Yunfei Li
ByteDance Seed
Reinforcement LearningRobotics
M
Mengze Li
The Hong Kong University of Science and Technology, Hong Kong SAR, China
W
Wei Ji
Nanjing University, Nanjing, China
Juncheng Li
Juncheng Li
East China Normal University
Super ResolutionImage RestorationComputer VisionMedical Image Analysis
Siliang Tang
Siliang Tang
Professor of Computer Science, Zhejiang University
Natural Language ProcessingCross-media AnalysisGraph Neural Network
Y
Yueting Zhuang
Zhejiang University, Hangzhou, China