OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

๐Ÿ“… 2026-01-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of systematic evaluation of existing code agents in following heterogeneous, cross-interaction scaffolding instructions. To this end, we propose OctoBenchโ€”the first benchmark specifically designed for repository-level programming that assesses scaffolding instruction following, encompassing 34 environments, 217 tasks, three scaffolding types, and 7,098 objective verification items. By decoupling task completion from instruction adherence, we introduce a fine-grained automated trajectory observation and scoring mechanism, and develop an evaluation framework comprising a trajectory-capturing toolchain, structured checklists, and multi-type scaffolding instantiations. Experiments across eight mainstream models reveal a significant gap between task-solving capability and scaffolding compliance, thereby validating the effectiveness and necessity of our benchmark.

Technology Category

Application Category

๐Ÿ“ Abstract
Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.
Problem

Research questions and friction points this paper is trying to address.

scaffold-aware instruction following
repository-grounded agentic coding
heterogeneous constraints
instruction compliance
coding agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

scaffold-aware instruction following
repository-grounded agentic coding
automated evaluation toolkit
heterogeneous constraints
OctoBench
๐Ÿ”Ž Similar Papers
No similar papers found.
D
Deming Ding
Fudan University, MiniMax
Shichun Liu
Shichun Liu
Fudan University
NLP
E
Enhui Yang
MiniMax, Peking University
J
Jiahang Lin
Fudan University
Ziying Chen
Ziying Chen
PhD student, School of Informatics, University of Edinburgh
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
Honglin Guo
Honglin Guo
Fudan University
Large Language Model
W
Weiyu Cheng
MiniMax
Pengyu Zhao
Pengyu Zhao
Peking University
Neural Architecture SearchRecommender System360-degree Video
C
Chengjun Xiao
MiniMax
Q
Qunhong Zeng
MiniMax
Qi Zhang
Qi Zhang
Fudan University
SAGINsatellite routing
X
Xuanjing Huang
Fudan University
Q
Qidi Xu
MiniMax
T
Tao Gui
Fudan University