SOP-Maze: Evaluating Large Language Models on Complicated Business Standard Operating Procedures

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing large language models (LLMs) lack systematic evaluation on complex, real-world business standard operating procedures (SOPs), particularly under multi-branch logic and deep reasoning scenarios. Method: We introduce SOP-Maze, the first fine-grained benchmark built from authentic enterprise workflows, comprising 23 scenario categories and 397 tasks. We propose a novel task decomposition into Lateral Root Systems (LRS) and Heart Root Systems (HRS), and establish a three-dimensional analytical framework—route blindness, dialogue fragility, and arithmetic errors—to assess path adherence, conversational robustness, and computational reasoning. Contribution/Results: Experiments reveal significant performance limitations across mainstream LLMs, exposing consistent deficiencies in structured procedural understanding. SOP-Maze provides a reproducible, attribution-aware evaluation paradigm for enterprise-grade decision-support model development. The benchmark and evaluation code are publicly released.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) are widely deployed as domain-specific agents, many benchmarks have been proposed to evaluate their ability to follow instructions and make decisions in real-world scenarios. However, business scenarios often involve complex standard operating procedures (SOPs), and the evaluation of LLM capabilities in such contexts has not been fully explored. To bridge this gap, we propose SOP-Maze, a benchmark constructed from real-world business data and adapted into a collection of 397 tasks from 23 complex SOP scenarios. We further categorize SOP tasks into two broad classes: Lateral Root System (LRS), representing wide-option tasks that demand precise selection; and Heart Root System (HRS), which emphasizes deep logical reasoning with complex branches. Extensive experiments reveal that nearly all state-of-the-art models struggle with SOP-Maze. We conduct a comprehensive analysis and identify three key error categories: (i) route blindness: difficulty following procedures; (ii) conversational fragility: inability to handle real dialogue nuances; and (iii) calculation errors: mistakes in time or arithmetic reasoning under complex contexts. The systematic study explores LLM performance across SOP tasks that challenge both breadth and depth, offering new insights for improving model capabilities. We have open-sourced our work on https://github.com/ADoublLEN/SOP-Maze.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on complex business standard operating procedures

Assessing model performance in wide-option selection and deep reasoning tasks

Identifying key error categories in procedural following and dialogue handling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposed SOP-Maze benchmark from real business data

Categorized tasks into Lateral and Heart Root Systems

Identified key error types in procedure following

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval