FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current language models struggle to simultaneously achieve terminological precision, sequential step coherence, and constraint-aware reasoning in standard operating procedure (SOP) comprehension, and exhibit limited generalization across domains. To address these challenges, this work proposes a progressive task-mixing training framework that incrementally enhances capabilities in term disambiguation, action sequence modeling, and scene-aware graph-based reasoning. Furthermore, a multi-agent automated evaluation mechanism is introduced to dynamically generate scoring rubrics and test sets. Evaluated on SOPBench across seven domains, the proposed 32B model achieves a pass rate of 48.3%, while the open-sourced 7B variant reaches 34.3%—matching the performance of Qwen-2.5-72B-Instruct with approximately one-tenth the parameters—demonstrating substantial improvements in both SOP understanding and cross-domain adaptability.

Technology Category

Application Category

📝 Abstract

Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.

Problem

Research questions and friction points this paper is trying to address.

SOP understanding

cross-domain generalization

reasoning capabilities

terminology precision

constraint reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive task mixture

automatic multi-agent evaluation

cross-domain SOP understanding

scenario-aware graph reasoning

action sequence understanding

🔎 Similar Papers

No similar papers found.

Authors to Follow