🤖 AI Summary
In generative planning, large language models often produce hallucinated goals—semantically plausible yet physically unreachable states—leading to delusional decision-making and safety risks. To address this, we propose a planning framework augmented with a learnable goal evaluator. Its core is a differentiable goal reachability discriminator, synergistically integrating rule-guided neural architecture design with two novel delusion-aware hindsight relabeling strategies. This is the first approach capable of robustly identifying and actively rejecting hallucinated goals without requiring real-world reward signals. The method effectively suppresses planning delusions, improving task success rates by 12.7–34.2% across multiple simulated domains while reducing unsafe action frequency by over 50%. By enabling reliable goal validation within the planning loop, our framework establishes a new paradigm for safe and trustworthy generative agent planning.
📝 Abstract
Generative models can be used in planning to propose targets corresponding to states or observations that agents deem either likely or advantageous to experience. However, agents can struggle with hallucinated, infeasible targets proposed by the models, leading to delusional planning behaviors, which raises safety concerns. Drawing inspiration from the human brain, we propose to reject these hallucinated targets with an add-on target evaluator. Without proper training, however, the evaluator can produce delusional estimates, rendering it futile. We propose to address this via a combination of learning rule, architecture, and two novel hindsight relabeling strategies, which leads to correct evaluations of infeasible targets. Our experiments confirm that our approach significantly reduces delusional behaviors and enhances the performance of planning agents.