Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the limitation that model-level safety evaluations fail to reflect real-world deployment risks in agent-based AI systems, this paper proposes a novel hierarchical red-teaming paradigm. Leveraging the open-source large language model GPT-OSS-20B and the AgentSeer framework—which integrates Action-Graph observability with the HarmBench adversarial objective suite—we conduct the first systematic comparison of red-teaming efficacy at both model and agent levels. Experimental results demonstrate that agent execution contexts introduce unique attack vectors: vulnerability detection rates increase by 24% in tool-calling scenarios, and model-level vulnerabilities cannot be reliably extrapolated to agent systems. Our core contribution is the identification of agent-specific vulnerabilities and the establishment of a behavior-observability-driven methodology for agent security assessment—providing both theoretical foundations and practical benchmarks for evaluating safety in agent-based AI systems.

Technology Category

Application Category

📝 Abstract

As the industry increasingly adopts agentic AI systems, understanding their unique vulnerabilities becomes critical. Prior research suggests that security flaws at the model level do not fully capture the risks present in agentic deployments, where models interact with tools and external environments. This paper investigates this gap by conducting a comparative red teaming analysis of GPT-OSS-20B, a 20-billion parameter open-source model. Using our observability framework AgentSeer to deconstruct agentic systems into granular actions and components, we apply iterative red teaming attacks with harmful objectives from HarmBench at two distinct levels: the standalone model and the model operating within an agentic loop. Our evaluation reveals fundamental differences between model level and agentic level vulnerability profiles. Critically, we discover the existence of agentic-only vulnerabilities, attack vectors that emerge exclusively within agentic execution contexts while remaining inert against standalone models. Agentic level iterative attacks successfully compromise objectives that completely failed at the model level, with tool-calling contexts showing 24% higher vulnerability than non-tool contexts. Conversely, certain model-specific exploits work exclusively at the model level and fail when transferred to agentic contexts, demonstrating that standalone model vulnerabilities do not always generalize to deployed systems.

Problem

Research questions and friction points this paper is trying to address.

Comparing security vulnerabilities between standalone AI models and agentic systems

Investigating unique attack vectors that emerge exclusively in agentic AI deployments

Analyzing how model-level vulnerabilities differ from agentic-level security risks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comparative red teaming analysis of model vs agentic levels

AgentSeer framework deconstructs agentic systems into granular actions

Reveals agentic-only vulnerabilities in tool-calling contexts

🔎 Similar Papers

No similar papers found.

Uber

For New York, NY-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For San Francisco, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Seattle, WA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Sunnyvale, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

New York, NY, USA / San Francisco, CA, USA / Seattle, WA, USA

Machine Learning Engineer - Agentic AI

Apple

Sunnyvale, United States of America

Authors to Follow