StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing instruction-following benchmarks overlook structural dependencies across turns in multi-turn dialogues—such as continuation, correction, and expansion—leading to inadequate evaluation of models’ ability to model intent evolution and structural flow. Method: We introduce the first multi-turn instruction-following benchmark explicitly focused on structural dependencies, formally defining six fundamental inter-turn structural relations and integrating structural-flow modeling into the evaluation framework; it further supports customizable dialogue process generation. Our LLM-based automated assessment framework combines structured prompt engineering with a relation-annotation pipeline to enable scalable, reproducible, multi-dimensional evaluation. Contribution/Results: Systematic evaluation across 13 state-of-the-art LLMs reveals an average structural-flow understanding accuracy of only 41.7%, substantially lower than constraint-satisfaction performance—highlighting a critical, previously unmeasured capability gap in current models.

Technology Category

Application Category

📝 Abstract
Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependency between dialogue turns that distinguishes multi-turn from single-turn interactions. This structural dependency not only reflects user intent but also establishes a second dimension for instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark innovatively defines a structural flow framework comprising six fundamental inter-turn relationships, which not only introduces novel structural constraints for model evaluation but also serves as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at url{https://github.com/MLGroupJLU/StructFlowBench}.
Problem

Research questions and friction points this paper is trying to address.

Multi-turn instruction following capability
Structural dependency in dialogues
Evaluation of large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural flow modeling
Six inter-turn relationships
LLM-based automatic evaluation
🔎 Similar Papers
No similar papers found.
J
Jinnan Li
School of Artificial Intelligence, Jilin University; International Center of Future Science, Jilin University
Jinzhe Li
Jinzhe Li
Fudan University & Shanghai AI Lab
AI4ScienceMulti-Modal
Y
Yue Wang
School of Information and Library Science, University of North Carolina at Chapel Hill
Y
Yi Chang
School of Artificial Intelligence, Jilin University; Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China; International Center of Future Science, Jilin University
Y
Yuan Wu
School of Artificial Intelligence, Jilin University