Benchmarking Agentic Workflow Generation

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing LLM-based workflow generation evaluation frameworks suffer from narrow scenario coverage, oversimplified structural modeling, and lenient evaluation standards. To address these limitations, we propose WorFBench—the first unified benchmark for agent-oriented workflow generation—featuring multi-scenario coverage and native support for complex directed-graph-structured workflows. We further introduce WorFEval, a systematic evaluation protocol that jointly leverages subsequence matching and subgraph matching to enable fine-grained, structure-aware quantification of both sequential and graph-structural planning capabilities. Our evaluation reveals a substantial capability gap (e.g., up to 15% for GPT-4) between LLMs’ performance on sequential versus graph-structural planning tasks—a phenomenon previously unreported. Fine-tuning two open-source models on WorFBench yields strong generalization to unseen tasks. Moreover, workflows generated under this framework demonstrably enhance downstream reasoning efficiency and task performance.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset are available at https://github.com/zjunlp/WorFBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM workflow generation capabilities.

Introduce WorfBench for multi-faceted scenarios.

Assess sequence vs graph planning gaps.

Innovation

Methods, ideas, or system contributions that make the work stand out.

WorfBench benchmark

WorfEval evaluation protocol

subgraph matching algorithms

🔎 Similar Papers

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks