SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Current benchmarks for instruction-following primarily focus on single-turn or short conversations, failing to adequately assess a model’s ability to consistently adhere to dynamic constraints over extended multi-turn dialogues. This work proposes the first evaluation framework tailored for long-horizon, dynamic, and multi-constraint instruction following. It extracts real-world constraints from authentic dialogues, leverages persona-guided generation to construct multi-turn interactions, and supports flexible constraint modification—addition, deletion, or alteration—alongside automated evaluation. Experimental results reveal significant performance degradation in state-of-the-art large language models: instruction-following accuracy drops by over 11% as dialogue length increases, declines by more than 40% under concurrent multi-constraint settings, and suffers a further decrease exceeding 9% when constraints are altered mid-dialogue, exposing critical limitations in sustained, adaptive instruction adherence.

📝 Abstract

In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.

Problem

Research questions and friction points this paper is trying to address.

instruction following

multi-turn conversation

constraint adherence

long-horizon dialogue

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-turn conversation

constraint following

instruction-following benchmark