๐ค AI Summary
This work addresses the lack of systematic evaluation of existing code agents in following heterogeneous, cross-interaction scaffolding instructions. To this end, we propose OctoBenchโthe first benchmark specifically designed for repository-level programming that assesses scaffolding instruction following, encompassing 34 environments, 217 tasks, three scaffolding types, and 7,098 objective verification items. By decoupling task completion from instruction adherence, we introduce a fine-grained automated trajectory observation and scoring mechanism, and develop an evaluation framework comprising a trajectory-capturing toolchain, structured checklists, and multi-type scaffolding instantiations. Experiments across eight mainstream models reveal a significant gap between task-solving capability and scaffolding compliance, thereby validating the effectiveness and necessity of our benchmark.
๐ Abstract
Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.