On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This paper presents the first systematic study of the robustness of large language models’ (LLMs) verbal confidence expressions under adversarial attacks. Motivated by the critical need for reliable confidence calibration in high-stakes human-AI interaction, we propose the first evaluation framework that jointly incorporates semantic-preserving perturbations and jailbreaking-style attacks. We empirically assess confidence generation across diverse models, prompts, and scenarios. Results reveal that current LLMs are highly vulnerable: minimal, meaning-preserving input modifications consistently induce severe confidence miscalibration and answer instability—e.g., frequent answer switching despite unchanged semantics. Moreover, mainstream defense mechanisms fail to mitigate this fragility and often exacerbate output inconsistency. Our work underscores the urgent need for robust confidence expression mechanisms and establishes a foundational benchmark for evaluating and improving confidence reliability in trustworthy LLM deployment.

Technology Category

Application Category

📝 Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to ensure transparency, trust, and safety in human-AI interactions across many high-stakes applications. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods, and show that these attacks can significantly jeopardize verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current confidence elicitation methods are vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the urgent need to design more robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

Problem

Research questions and friction points this paper is trying to address.

Assessing robustness of LLM verbal confidence under adversarial attacks

Evaluating vulnerability of confidence scores to perturbation and jailbreak methods

Identifying weaknesses in current confidence elicitation and defense techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework attacking verbal confidence via perturbations

Examining vulnerabilities across model sizes and domains

Highlighting need for robust confidence expression mechanisms

🔎 Similar Papers

No similar papers found.