π€ AI Summary
This work addresses the lack of systematic evaluation of large language modelsβ (LLMs) reliance on instruction sequence structure in complex multi-step tasks, a gap exacerbated by mainstream benchmarks that conflate content complexity with structural order. To disentangle these factors, we introduce RIFT, a benchmark constructed from rephrased Jeopardy! question-answer pairs that yield semantically identical but structurally distinct linear and non-linear (jumping) multi-step prompts. Evaluating over 10,000 trials across six open-source LLMs, our study is the first to decouple prompt content from structural sequencing, revealing a fundamental limitation: LLMs treat instruction following as sequential pattern matching rather than genuine reasoning. Results show accuracy drops by up to 72% under jumping structures, with approximately 50% of errors attributable to violations of expected instruction order and semantic drift, underscoring modelsβ strong dependence on positional continuity and offering a new evaluation dimension for non-sequential control flow applications.
π Abstract
Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain flow of instructions remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it difficult to isolate the impact of prompt topology on performance. We introduce RIFT, Reordered Instruction Following Testbed, to assess instruction following by disentangling structure from content. Using rephrased Jeopardy! question-answer pairs, we test LLMs across two prompt structures: linear prompts, which progress sequentially, and jumping prompts, which preserve identical content but require non-sequential traversal. Across 10,000 evaluations spanning six state-of-the-art open-source LLMs, accuracy dropped by up to 72% under jumping conditions (compared to baseline), revealing a strong dependence on positional continuity. Error analysis shows that approximately 50% of failures stem from instruction-order violations and semantic drift, indicating that current architectures internalize instruction following as a sequential pattern rather than a reasoning skill. These results reveal structural sensitivity as a fundamental limitation in current architectures, with direct implications for applications requiring non-sequential control flow such as workflow automation and multi-agent systems.