🤖 AI Summary
Large language models (LLMs) optimize implicit, opaque objectives, resulting in low alignment reliability and poor auditability; existing methods struggle with objective unidentifiability and uncertainty. This paper proposes the first Bayesian inverse reinforcement learning (IRL) framework for LLM alignment auditing, casting objective inference as a statistically verifiable inference problem. We introduce posterior contraction—a novel mechanism that quantifies and mitigates unidentifiability—while integrating sequential evidence accumulation and epistemic uncertainty modeling to enable policy-level utility validation and diagnostic analysis. Our end-to-end auditing pipeline is empirically validated on RLHF-trained models: it successfully recovers well-calibrated, interpretable reward functions in detoxified models, achieving performance in toxicity suppression and training dynamics that closely approximates ground-truth alignment.
📝 Abstract
The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.