LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing benchmarks for evaluating command-line agents on long-horizon, complex software engineering tasks suffer from short task horizons, data contamination, and coarse-grained evaluation. This work proposes the first fine-grained evaluation framework specifically designed for long-horizon command-line programming, introducing a benchmark comprising four realistic task categories: zero-shot development, feature addition, bug fixing, and code refactoring. The framework employs a dual-test protocol—fail-to-pass and pass-to-pass—and a step-level scoring mechanism, complemented by two key metrics: requirement satisfaction and regression avoidance. Experimental results reveal that even state-of-the-art agents achieve an overall pass rate below 20%, with most failures occurring before reaching 30% task completion. Notably, human-in-the-loop collaborative strategies significantly outperform purely self-correcting approaches, underscoring the critical role of collaborative workflows in enhancing task success.

Technology Category

Application Category

📝 Abstract

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and regression avoidance (pass-to-pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents'planning and execution capabilities to overcome key challenges in long-horizon task performance.

Problem

Research questions and friction points this paper is trying to address.

long-horizon agentic programming

command-line interfaces

AI-assisted programming

benchmark evaluation

software engineering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-horizon agentic programming

command-line interface benchmark

step-level evaluation