🤖 AI Summary
This work identifies and exploits quasi-loss signals returned by fine-tuning APIs of closed-source large language models (e.g., Google Gemini) to devise a novel black-box prompt injection attack. Unlike conventional black-box methods, it requires no access to model weights or gradients; instead, it systematically reverse-engineers side-channel information from fine-tuning API responses—interpreting them as discrete optimization feedback—and integrates this signal into a greedy search framework for adversarial prompt generation. Its key contribution lies in the first principled use of fine-grained loss-like feedback for efficient, parameter-free prompt optimization. Experiments on the PurpleLlama benchmark demonstrate attack success rates of 65%–82% across Gemini models, revealing fine-tuning APIs as a previously overlooked attack surface for prompt injection. This finding delivers both a critical security warning and a new technical paradigm for API security auditing and red-teaming evaluations.
📝 Abstract
We surface a new threat to closed-weight Large Language Models (LLMs) that enables an attacker to compute optimization-based prompt injections. Specifically, we characterize how an attacker can leverage the loss-like information returned from the remote fine-tuning interface to guide the search for adversarial prompts. The fine-tuning interface is hosted by an LLM vendor and allows developers to fine-tune LLMs for their tasks, thus providing utility, but also exposes enough information for an attacker to compute adversarial prompts. Through an experimental analysis, we characterize the loss-like values returned by the Gemini fine-tuning API and demonstrate that they provide a useful signal for discrete optimization of adversarial prompts using a greedy search algorithm. Using the PurpleLlama prompt injection benchmark, we demonstrate attack success rates between 65% and 82% on Google's Gemini family of LLMs. These attacks exploit the classic utility-security tradeoff - the fine-tuning interface provides a useful feature for developers but also exposes the LLMs to powerful attacks.