đ€ AI Summary
Current benchmarks for instruction-following primarily focus on single-turn or short conversations, failing to adequately assess a modelâs ability to consistently adhere to dynamic constraints over extended multi-turn dialogues. This work proposes the first evaluation framework tailored for long-horizon, dynamic, and multi-constraint instruction following. It extracts real-world constraints from authentic dialogues, leverages persona-guided generation to construct multi-turn interactions, and supports flexible constraint modificationâaddition, deletion, or alterationâalongside automated evaluation. Experimental results reveal significant performance degradation in state-of-the-art large language models: instruction-following accuracy drops by over 11% as dialogue length increases, declines by more than 40% under concurrent multi-constraint settings, and suffers a further decrease exceeding 9% when constraints are altered mid-dialogue, exposing critical limitations in sustained, adaptive instruction adherence.
đ Abstract
In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints are added or replaced at arbitrary points of the conversation, model accuracy decreases by more than 9%. Taken together, our results reveal that current models still struggle to follow user instructions in multi-turn conversations, and provide a way for better measuring instruction-following capabilities in assistants.