One Battle After Another: Probing LLMs'Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

📅 2025-11-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multi-turn instruction-following benchmarks are constrained to fixed turn counts, leading to evaluation saturation and poor alignment with real-world interactive scenarios. To address this, we propose EvolIF, a dynamic evaluation benchmark and framework that decouples surface linguistic form from underlying user intent. Our framework introduces a three-layer tracking mechanism—state evolution, topic transition, and instruction backtracking—alongside a dynamic termination strategy that simulates user patience exhaustion. EvolIF supports cross-topic, multi-turn dialogue modeling under nine distinct constraint types, effectively mitigating saturation in static benchmarks. Experimental results show that GPT-5 sustains an average of 18.54 turns on EvolIF with 70.31% instruction-following robustness—outperforming Gemini-2.5-Pro by 11.41 percentage points—and significantly surpasses other state-of-the-art models.

Technology Category

Application Category

📝 Abstract
Understanding how well large language models can follow users'instructions throughout a dialogue spanning multiple topics is of great importance for data-intensive conversational applications. Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience. In this work, we propose an extensible framework for assessing multi-turn instruction-following ability. At its core, our framework decouples linguistic surface forms from user intent simulation through a three-layer mechanism that tracks constraints, instructions, and topics. This framework mimics User-LLM interaction by enabling the dynamic construction of benchmarks with state changes and tracebacks, terminating a conversation only when the model exhausts a simulated user's patience. We define a suite of metrics capturing the quality of the interaction process. Using this framework, we construct EvolIF, an evolving instruction-following benchmark incorporating nine distinct constraint types. Our results indicate that GPT-5 exhibits superior instruction-following performance. It sustains an average of 18.54 conversational turns and demonstrates 70.31% robustness, outperforming Gemini-2.5-Pro by a significant margin of 11.41%, while other models lag far behind. All of the data and code will be made publicly available online.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to follow multi-turn conversational instructions across topics
Addressing limitations of fixed-turn benchmarks through dynamic user simulation
Assessing instruction-following robustness with evolving constraints and state changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples linguistic forms from user intent
Enables dynamic benchmark construction with state changes
Defines metrics capturing interaction process quality
🔎 Similar Papers
No similar papers found.
Q
Qi Jia
Shanghai Artificial Intelligence Laboratory
K
Kaiwei Zhang
Shanghai Artificial Intelligence Laboratory
Xiujie Song
Xiujie Song
Shanghai Jiao Tong University
Ye Shen
Ye Shen
Baylor College of Medicine
X
Xiangyang Zhu
Shanghai Artificial Intelligence Laboratory
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays