When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Although large language models excel at reasoning tasks, they often fail to faithfully execute multi-step procedures specified in prompts. This work introduces a controlled diagnostic benchmark comprising simple arithmetic programs with variable lengths and backtracking dependencies to systematically evaluate execution fidelity across 14 prominent models on 55 datasets. The study reveals characteristic failure modes in long programs—such as step omission, premature answering, erroneous self-correction, under-execution, and hallucinated steps—for the first time. While models achieve a 61% first-answer accuracy on 5-step programs, performance sharply declines to 20% on 95-step programs, underscoring their significant limitations in following complex, extended instructions.

📝 Abstract

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.

Problem

Research questions and friction points this paper is trying to address.

procedural execution

large language models

reasoning fidelity

instruction following

diagnostic benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

procedural execution

diagnostic benchmark

faithful reasoning