🤖 AI Summary
Current language models struggle to simultaneously achieve terminological precision, sequential step coherence, and constraint-aware reasoning in standard operating procedure (SOP) comprehension, and exhibit limited generalization across domains. To address these challenges, this work proposes a progressive task-mixing training framework that incrementally enhances capabilities in term disambiguation, action sequence modeling, and scene-aware graph-based reasoning. Furthermore, a multi-agent automated evaluation mechanism is introduced to dynamically generate scoring rubrics and test sets. Evaluated on SOPBench across seven domains, the proposed 32B model achieves a pass rate of 48.3%, while the open-sourced 7B variant reaches 34.3%—matching the performance of Qwen-2.5-72B-Instruct with approximately one-tenth the parameters—demonstrating substantial improvements in both SOP understanding and cross-domain adaptability.
📝 Abstract
Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.