Security Steerability is All You Need

📅 2025-04-28

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses application-layer security vulnerabilities in generative AI, specifically examining large language models’ (LLMs) resilience against customized adversarial attacks under system-level safety guardrails—e.g., “no discussion of politics”—explicitly defined in system prompts. To this end, we introduce *Security Steerability*, a novel evaluation dimension that anchors application-level defense on system prompts, moving beyond conventional general-purpose safety mitigation. Methodologically, we propose two benchmarking protocols—VeganRibs and ReverseText—that jointly integrate adversarial user simulation, semantic input decoupling, and strict guardrail enforcement. Experiments span major closed- and open-source LLMs, demonstrating their capacity to consistently adhere to application-specific safety policies under strong perturbations, while quantifying cross-model differences in security steerability. Our framework enables fine-grained, prompt-grounded assessment of LLMs’ operational security compliance.

Technology Category

Application Category

📝 Abstract

The adoption of Generative AI (GenAI) in various applications inevitably comes with expanding the attack surface, combining new security threats along with the traditional ones. Consequently, numerous research and industrial initiatives aim to mitigate these security threats in GenAI by developing metrics and designing defenses. However, while most of the GenAI security work focuses on universal threats (e.g. manipulating the LLM to generate forbidden content), there is significantly less discussion on application-level security and how to mitigate it. Thus, in this work we adopt an application-centric approach to GenAI security, and show that while LLMs cannot protect against ad-hoc application specific threats, they can provide the framework for applications to protect themselves against such threats. Our first contribution is defining Security Steerability - a novel security measure for LLMs, assessing the model's capability to adhere to strict guardrails that are defined in the system prompt ('Refrain from discussing about politics'). These guardrails, in case effective, can stop threats in the presence of malicious users who attempt to circumvent the application and cause harm to its providers. Our second contribution is a methodology to measure the security steerability of LLMs, utilizing two newly-developed datasets: VeganRibs assesses the LLM behavior in forcing specific guardrails that are not security per se in the presence of malicious user that uses attack boosters (jailbreaks and perturbations), and ReverseText takes this approach further and measures the LLM ability to force specific treatment of the user input as plain text while do user try to give it additional meanings...

Problem

Research questions and friction points this paper is trying to address.

Addressing application-level security threats in Generative AI

Defining Security Steerability to assess LLM adherence to guardrails

Measuring LLM capability to resist malicious user circumvention attempts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Application-centric approach to GenAI security

Security Steerability as novel LLM security measure

Methodology to measure LLM security steerability

🔎 Similar Papers

No similar papers found.

Uber

For New York, NY-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For San Francisco, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Seattle, WA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year. For Sunnyvale, CA-based roles: The base salary range for this role is USD$202,000 per year - USD$224,000 per year.

New York, NY, USA / San Francisco, CA, USA / Seattle, WA, USA

Authors to Follow