Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing static benchmarks inadequately assess the real-world capabilities of large language model (LLM) agents under the complex conditions of production environments—particularly regarding long-horizon execution, multi-tool coordination, and dependency management. This work proposes RAMP, the first runtime evaluation framework tailored for production settings, built upon the YatCC platform. RAMP dynamically evaluates agent performance on realistic software compilation workloads through a unified runtime architecture, standardized execution interfaces, and sequentially dependent task designs, while incorporating a staged recovery mechanism to analyze behavior under partial failures. Coupled with multidimensional utility metrics, RAMP jointly evaluates both process efficiency and output quality. Experiments across 15 prominent models reveal a sharp decline in task completion rates—from 100% initially to only 20% in the final stage—with no model successfully completing the entire workflow, exposing systemic error propagation and resource consumption disparities up to three orders of magnitude, thereby demonstrating that conventional benchmarks substantially overestimate practical agent capabilities.
📝 Abstract
LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.
Problem

Research questions and friction points this paper is trying to address.

agentic models
runtime assessment
production systems
long-horizon evaluation
benchmark limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

runtime assessment
agentic models
production-grounded evaluation
long-horizon workflows
failure propagation
🔎 Similar Papers