Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge that long-horizon large language model (LLM) agents often terminate prematurely or repeat tasks when required to complete a specified number of objectives, due to insufficient persistent tracking of quantitative goals. The paper formally introduces the concept of Quantitative Goal Persistence (QGP) and presents PushBench, a benchmark incorporating an external validator to detect issues such as redundant submissions and false completions. To enhance progress monitoring, the authors design a state-tracking retrieval controller and a backlog-tracking task unit controller. Experimental results demonstrate that this framework boosts task success rates to 69–78% while entirely eliminating duplicate submissions. Although state-of-the-art black-box models perform reasonably on 50-task settings, their success rates plummet to 3 out of 9 in 100-task scenarios, revealing a critical deficiency in current LLMs’ ability to maintain quantitative goal persistence over extended horizons.

📝 Abstract

Long-horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository-artifact collection and verifier-backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matched controller comparisons, a state-tracking retrieval controller reaches 69-78% success while eliminating duplicate submissions, and a backlog-tracking work-unit controller reaches 25-50% success in settings where standard and completion-gated controllers complete no task instances. Black-box frontier-agent evaluations with Claude Code (Sonnet 4.6) and Codex CLI (gpt-5.4) solve many 50-artifact tasks but drop to 3 out of 9 successes per condition at 100 artifacts. The results show that quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete.

Problem

Research questions and friction points this paper is trying to address.

Quantitative Goal Persistence

Long-Horizon Agents

Task Completion

Progress Verification

Duplicate Submissions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantitative Goal Persistence

PushBench

long-horizon agents