π€ AI Summary
Large language models are prone to hallucinations, yet existing adversarial attack methods struggle to effectively induce such hallucinations while preserving semantic equivalence and textual coherence. This work formulates hallucination induction as a constrained optimization problem, constructing an input-dependent dictionary of semantically equivalent rewrite directions and optimizing their continuous combination in the latent space to generate adversarial prompts that are both truthful and coherent. By integrating the flexibility of continuous latent optimization with the semantic fidelity of discrete rewrites, the proposed method achieves the first successful attacks against large reasoning models in open-ended generation settings. It matches or surpasses state-of-the-art realistic attack performance on open-source models, substantially overcoming the limitations of prior approaches in open-generation scenarios.
π Abstract
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need for realistic adversarial prompts that elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.