SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing black-box hallucination attacks against large language models (LLMs) suffer from semantic distortion and low realism, limiting their validity and interpretability. Method: We propose a semantically equivalent and contextually coherent black-box adversarial attack framework, formulating the attack as a zeroth-order optimization problem constrained by both semantic similarity and contextual coherence. Our gradient-free algorithm preserves these constraints via semantic embedding regularization, dynamic context coherence modeling, and a robust zeroth-order search strategy—enabling efficient generation of natural, credible adversarial prompts without access to gradients. Contribution/Results: Evaluated on open-domain multiple-choice QA tasks, our method achieves significantly higher attack success rates while reducing semantic deviation by 92% and incurring <0.8% coherence loss (measured by BERTScore). It is the first approach to systematically uncover LLM hallucination triggers under high-fidelity constraints, establishing a novel paradigm for evaluating model reliability.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly deployed in high-risk domains. However, state-of-the-art LLMs often produce hallucinations, raising serious concerns about their reliability. Prior work has explored adversarial attacks for hallucination elicitation in LLMs, but it often produces unrealistic prompts, either by inserting gibberish tokens or by altering the original meaning. As a result, these approaches offer limited insight into how hallucinations may occur in practice. While adversarial attacks in computer vision often involve realistic modifications to input images, the problem of finding realistic adversarial prompts for eliciting LLM hallucinations has remained largely underexplored. To address this gap, we propose Semantically Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic modifications to the prompt that preserve its meaning while maintaining semantic coherence. Our contributions are threefold: (i) we formulate finding realistic attacks for hallucination elicitation as a constrained optimization problem over the input prompt space under semantic equivalence and coherence constraints; (ii) we introduce a constraint-preserving zeroth-order method to effectively search for adversarial yet feasible prompts; and (iii) we demonstrate through experiments on open-ended multiple-choice question answering tasks that SECA achieves higher attack success rates while incurring almost no constraint violations compared to existing methods. SECA highlights the sensitivity of both open-source and commercial gradient-inaccessible LLMs to realistic and plausible prompt variations. Code is available at https://github.com/Buyun-Liang/SECA.

Problem

Research questions and friction points this paper is trying to address.

Developing realistic adversarial prompts to elicit LLM hallucinations

Preserving semantic meaning and coherence in attack modifications

Addressing limited insights from prior unrealistic adversarial approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

SECA uses realistic prompt modifications to elicit hallucinations

It formulates hallucination elicitation as constrained optimization problem

SECA employs constraint-preserving zeroth-order optimization method

🔎 Similar Papers

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks