AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Current tool-augmented large language model (LLM) agents perform well on idealized benchmarks but exhibit insufficient robustness in real-world noisy environments, and lack systematic evaluation methodologies. This work proposes AgentNoiseBench, the first framework to formally define and categorize environmental noise affecting agents into user-side and tool-side noise. It introduces a controllable and scalable noise injection mechanism and constructs an automated evaluation pipeline that preserves task solvability. Extensive experiments across diverse LLM architectures and scales reveal significant performance degradation under various noise conditions, highlighting the sensitivity of existing agents to realistic perturbations. AgentNoiseBench thus establishes the first systematic benchmark for robust agent research and offers clear directions for future improvements.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

robustness

environmental noise

benchmarking

real-world deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

AgentNoiseBench

robustness evaluation

environmental noise