NetPress: Dynamically Generated LLM Benchmarks for Network Applications

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing LLM agent evaluations rely heavily on static, small-scale benchmarks, failing to reflect real-world requirements—particularly in high-reliability domains such as network operations. Method: We propose the first dynamic, closed-loop evaluation framework tailored for network operations. It employs a unified state-action abstraction to support on-demand generation of million-scale, configurable queries and ground-truth labels. The framework deeply integrates containerized network emulators (e.g., EVE-NG, Colosseum) to jointly assess correctness, security compliance, and latency. Methodologically, it incorporates DSL-driven benchmark synthesis, automated ground-truth derivation, and multi-dimensional metrics. Contribution/Results: Evaluated across three canonical network tasks, our framework uncovers previously unobserved performance disparities among mainstream LLM agents—especially regarding security constraints, operational fault tolerance, and temporal sensitivity. These findings empirically validate the critical role of dynamic, closed-loop evaluation in assessing deployment readiness for production-grade network automation.

Technology Category

Application Category

📝 Abstract

Despite growing interest in domain-specific benchmarking of large language models (LLMs) and agents, current evaluations remain limited to static, small-scale datasets, especially in high-stakes tasks like network operations that demand reliability for deployments. We present NetPress, an automated benchmark generation framework for evaluating LLM agents in network applications. NetPress introduces a unified abstraction with state and action, enabling dynamic generation of diverse query sets along with corresponding ground truths. At runtime, users can specify benchmark configurations to generate millions of queries on the fly. In addition to dynamic benchmark construction, NetPress integrates with network emulators to provide realistic environment feedback, supporting comprehensive evaluation across correctness, safety, and latency. We instantiate NetPress on three representative applications, revealing interesting fine-grained differences in agent behavior that static, correctness-only benchmarks often miss. NetPress moves LLM evaluation toward realistic, scalable testing in infrastructure-centric domains, helping close the gap between benchmark performance and real-world deployment readiness. Code is available at https://github.com/Froot-NetSys/NetPress.

Problem

Research questions and friction points this paper is trying to address.

Dynamic benchmark generation for LLM evaluation in network applications

Addressing limitations of static datasets in high-stakes network operations

Integrating network emulators for realistic correctness, safety, and latency testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated dynamic benchmark generation framework

Unified abstraction with state and action

Integration with network emulators for feedback

🔎 Similar Papers

Is Your LLM Outdated? A Deep Look at Temporal Generalization