Black-box Optimization of LLM Outputs by Asking for Directions

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses adversarial attacks against black-box large language models (LLMs) that expose only textual outputs—no logits, confidence scores, or model parameters are accessible. We propose a novel attack paradigm that leverages the model’s own natural-language confidence expressions (e.g., “I am highly confident,” “this may be inaccurate”) as surrogate signals for gradient estimation and optimization. By modeling such linguistic confidence feedback as differentiable proxies and integrating prompt engineering, reinforcement learning, and adaptive search strategies, our method enables efficient adversarial search solely via text-based interfaces. It unifies application across vision-language model evasion, jailbreaking, and prompt injection. Experiments on diverse mainstream black-box LLMs demonstrate substantially improved attack success rates. Notably, we uncover a security paradox: stronger models—with richer, more nuanced confidence articulation—are empirically more vulnerable to this attack. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

We present a novel approach for attacking black-box large language models (LLMs) by exploiting their ability to express confidence in natural language. Existing black-box attacks require either access to continuous model outputs like logits or confidence scores (which are rarely available in practice), or rely on proxy signals from other models. Instead, we demonstrate how to prompt LLMs to express their internal confidence in a way that is sufficiently calibrated to enable effective adversarial optimization. We apply our general method to three attack scenarios: adversarial examples for vision-LLMs, jailbreaks and prompt injections. Our attacks successfully generate malicious inputs against systems that only expose textual outputs, thereby dramatically expanding the attack surface for deployed LLMs. We further find that better and larger models exhibit superior calibration when expressing confidence, creating a concerning security paradox where model capability improvements directly enhance vulnerability. Our code is available at this [link](https://github.com/zj-jayzhang/black_box_llm_optimization).

Problem

Research questions and friction points this paper is trying to address.

Exploits LLMs' natural language confidence expression for black-box optimization

Generates adversarial examples without requiring continuous model outputs

Demonstrates improved model calibration increases vulnerability to attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompting LLMs to express calibrated internal confidence

Enabling adversarial optimization via natural language confidence

Expanding attack surface for text-only output systems

🔎 Similar Papers

Unleashing the Potential of Large Language Models as Prompt Optimizers: An Analogical Analysis with Gradient-based Model Optimizers