The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Large language models (LLMs) optimize implicit, opaque objectives, resulting in low alignment reliability and poor auditability; existing methods struggle with objective unidentifiability and uncertainty. This paper proposes the first Bayesian inverse reinforcement learning (IRL) framework for LLM alignment auditing, casting objective inference as a statistically verifiable inference problem. We introduce posterior contraction—a novel mechanism that quantifies and mitigates unidentifiability—while integrating sequential evidence accumulation and epistemic uncertainty modeling to enable policy-level utility validation and diagnostic analysis. Our end-to-end auditing pipeline is empirically validated on RLHF-trained models: it successfully recovers well-calibrated, interpretable reward functions in detoxified models, achieving performance in toxicity suppression and training dynamics that closely approximates ground-truth alignment.

Technology Category

Application Category

📝 Abstract

The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.

Problem

Research questions and friction points this paper is trying to address.

Inferring opaque objectives optimized by Large Language Models

Addressing non-identifiability in reward function estimation

Providing uncertainty-aware diagnostics for trustworthy AI alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian IRL infers reward distribution from behavior

Quantifies non-identifiability via posterior contraction evidence

Provides uncertainty-aware diagnostics for spurious shortcuts

🔎 Similar Papers

Does Alignment Tuning Really Break LLMs' Internal Confidence?