When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

๐Ÿ“… 2025-11-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Safety-aligned language models frequently reject harmless prompts, yet existing evaluations rely solely on global false-rejection rates, overlooking local inconsistencyโ€”i.e., contradictory responses to semantically equivalent but lexically distinct prompts. Method: We introduce the concept of *semantic confusion* to formalize such local decision inconsistency and propose ParaGuard, a model-agnostic evaluation framework. It jointly leverages prompt paraphrase clustering, token embeddings, next-token probabilities, and perplexity to define three fine-grained, confusion-aware metrics: confusion index, confusion ratio, and confusion depth. Contribution/Results: Experiments reveal that global metrics obscure critical structural disparities in rejection behavior. Our metrics precisely localize unstable decision boundaries and regions of local inconsistency, decoupling rejection frequency from justification validity. This enables principled trade-off analysis and joint optimization of safety assurance and user utility.

Technology Category

Application Category

๐Ÿ“ Abstract
Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.
Problem

Research questions and friction points this paper is trying to address.

Measures local inconsistency in LLM refusals across paraphrases
Introduces semantic confusion to diagnose safety alignment gaps
Provides metrics to reduce false refusals while maintaining safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Paraphrase clusters for intent consistency testing
Token-level metrics using embeddings and probabilities
Confusion-aware auditing to reduce false refusals
๐Ÿ”Ž Similar Papers
No similar papers found.
Riad Ahmed Anonto
Riad Ahmed Anonto
Bangladesh University of Engineering and Technology (BUET)
LLMsComputer SecurityMachine Learning
M
Md Labid Al Nahiyan
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
M
Md Tanvir Hassan
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
C
Ch. Md. Rakin Haider
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)