Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work identifies a novel security vulnerability in autonomous code agents: their hidden system prompts may be inadvertently leaked through interactions with large language models. To address this, the authors propose JustAsk, a framework that formulates system prompt extraction as an online exploration problem and enables unsupervised recovery through agent-driven strategy evolution. JustAsk integrates an Upper Confidence Bound (UCB)-based policy selection mechanism with a hierarchical skill space, combining atomic probing actions with high-level orchestration capabilities. Extensive experiments across 41 black-box commercial large language models demonstrate that JustAsk can fully or nearly fully reconstruct system prompts, exposing a pervasive architectural-level security flaw in current agent systems.

Technology Category

Application Category

📝 Abstract

Autonomous code agents built on large language models are reshaping software and AI development through tool use, long-horizon reasoning, and self-directed interaction. However, this autonomy introduces a previously unrecognized security risk: agentic interaction fundamentally expands the LLM attack surface, enabling systematic probing and recovery of hidden system prompts that guide model behavior. We identify system prompt extraction as an emergent vulnerability intrinsic to code agents and present \textbf{\textsc{JustAsk}}, a self-evolving framework that autonomously discovers effective extraction strategies through interaction alone. Unlike prior prompt-engineering or dataset-based attacks, \textsc{JustAsk} requires no handcrafted prompts, labeled supervision, or privileged access beyond standard user interaction. It formulates extraction as an online exploration problem, using Upper Confidence Bound-based strategy selection and a hierarchical skill space spanning atomic probes and high-level orchestration. These skills exploit imperfect system-instruction generalization and inherent tensions between helpfulness and safety. Evaluated on \textbf{41} black-box commercial models across multiple providers, \textsc{JustAsk} consistently achieves full or near-complete system prompt recovery, revealing recurring design- and architecture-level vulnerabilities. Our results expose system prompts as a critical yet largely unprotected attack surface in modern agent systems.

Problem

Research questions and friction points this paper is trying to address.

system prompt extraction

code agents

LLM security

prompt leakage

autonomous agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

system prompt extraction

autonomous code agents

LLM security