MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks inadequately evaluate large language models’ (LLMs) capabilities in collaborative, multi-stage reasoning and autonomous optimization under complex real-world scenarios. Method: We introduce MSCoRe—the first comprehensive benchmark for multi-stage collaborative reasoning—spanning automotive, pharmaceutical, electronics, and energy domains, comprising 126,696 high-quality domain-specific question-answer pairs. MSCoRe employs a three-stage progressive data construction pipeline: dynamic sampling, iterative question-answering generation, and multi-level quality assessment, with tasks systematically stratified by difficulty. Contribution/Results: Evaluating leading LLM-based agent systems using ROUGE and other metrics reveals that while commercial models outperform open-source counterparts overall, their performance degrades significantly on higher-order collaborative tasks and exhibits heightened sensitivity to input noise. MSCoRe fills a critical gap in multi-stage reasoning evaluation, establishing a rigorous, domain-diverse standard and benchmark to advance LLM agents’ capabilities in realistic, complex environments.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models' abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose extbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models' robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0-source/MSCoRE.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM reasoning in complex multi-stage collaborative scenarios
Addressing the gap in benchmarks for multi-domain coordination without guidance
Assessing model robustness and optimization across varying difficulty levels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sampling method for domain-specific QA generation
Iterative question-answer generation pipeline for data creation
Multi-level quality assessment to ensure dataset robustness
Y
Yuzhen Lei
Jilin University
H
Hongbin Xie
Southern University of Science and Technology
Jiaxing Zhao
Jiaxing Zhao
Jilin University
LLMsMulti-Agent
S
Shuangxue Liu
Jilin University
X
Xuan Song
Jilin University