🤖 AI Summary
Large language model agents often treat internally generated reasoning as reliable, even when it violates logical or evidential constraints, leading to systematic behavioral drift over time. To address this, this work proposes the SAVeR framework, which performs structured self-audit and verification of internal beliefs prior to action execution: it generates diverse candidate beliefs through role-driven prompting, identifies errors via adversarial auditing, and applies minimal, constraint-compliant interventions to rectify them. SAVeR introduces the first approach that validates belief formation before action commitment, shifting faithfulness evaluation from outcome consensus to process verifiability and thereby effectively halting the accumulation and propagation of erroneous beliefs. Experiments demonstrate that SAVeR significantly improves reasoning faithfulness across six benchmarks while maintaining strong task performance.
📝 Abstract
In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory. However, coherent reasoning can still violate logical or evidential constraints, allowing unsupported beliefs repeatedly stored and propagated across decision steps, leading to systematic behavioral drift in long-horizon agentic systems. Most existing strategies rely on the consensus mechanism, conflating agreement with faithfulness. In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose \textbf{S}elf-\textbf{A}udited \textbf{Ve}rified \textbf{R}easoning (\textsc{SAVeR}), a novel framework that enforces verification over internal belief states within the agent before action commitment, achieving faithful reasoning. Concretely, we structurally generate persona-based diverse candidate beliefs for selection under a faithfulness-relevant structure space. To achieve reasoning faithfulness, we perform adversarial auditing to localize violations and repair through constraint-guided minimal interventions under verifiable acceptance criteria. Extensive experiments on six benchmark datasets demonstrate that our approach consistently improves reasoning faithfulness while preserving competitive end-task performance.