LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning

๐Ÿ“… 2026-05-01
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

196K/year
๐Ÿค– AI Summary
This work addresses the critical issue of large language models (LLMs) hallucinating non-existent software packages during code generation, which introduces significant risks of typosquatting-based supply chain attacks. Existing mitigation strategies either incur high computational costs or compromise the modelโ€™s general capabilities. To overcome these limitations, the authors propose an Adaptive Unlearning (AU) frameworkโ€”the first post-training intervention that operates without human annotations or predefined forget sets. AU employs an adaptive discovery mechanism to continuously identify emerging hallucination scenarios and integrates a hybrid token-level optimization objective to precisely suppress package-related hallucinations using only model-generated data, while preserving valid outputs. The method reduces package hallucination rates by 81%, substantially narrowing the attack surface, and maintains competitive performance on standard code generation benchmarks.
๐Ÿ“ Abstract
Hallucinations, outputs that sound plausible but are factually incorrect, remain an open challenge for deployed LLMs. In code generation, models frequently hallucinate non-existent software packages, recommending imports and installation commands for fictional libraries. This creates a critical supply-chain vulnerability: an attacker can proactively register such packages on public registries with malicious payloads that are subsequently installed and executed by developers or autonomous agents, a class of package confusion attack known as slopsquatting. Once a model is deployed, mitigating this failure mode is difficult: full retraining is costly, and existing approaches either cause severe degradation of model utility or rely on a pre-specified forget-set, an assumption that does not apply to the unbounded space of hallucinations. To address this problem, we present Adaptive Unlearning (AU), a post-deployment framework that surgically suppresses hallucinations while preserving general model utility. AU introduces a hybrid token-level objective that simultaneously reinforces valid outputs and suppresses hallucinated ones. Combined with an adaptive discovery loop that continuously surfaces new hallucination-inducing contexts without human supervision, AU enables generalization to unseen prompts and hallucinations. We demonstrate that AU reduces package hallucination rates by 81%, corresponding to a substantial reduction in slopsquatting attack surface, while maintaining performance on standard coding benchmarks. Our analysis shows that distributional changes are concentrated on package-related generations, leaving general coding behavior largely unaffected and confirming that AU's effect is isolated to the targeted distribution. AU operates entirely on model-generated data, requires no human annotation, and generalizes across domains.
Problem

Research questions and friction points this paper is trying to address.

hallucination
slopsquatting
code generation
supply-chain vulnerability
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive Unlearning
Hallucination Suppression
Slopsquatting
Token-level Objective
Self-supervised Discovery
๐Ÿ”Ž Similar Papers
No similar papers found.