Explainable reinforcement learning from human feedback to improve alignment

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Despite RLHF fine-tuning, language models still generate suboptimal responses. Method: This paper proposes a causally interpretable alignment optimization paradigm. First, it enables interpretable error attribution via convex decomposition constraints in feature space to precisely identify training samples responsible for undesirable outputs. Second, it introduces an optimality-preserving selective forgetting mechanism that iteratively selects and “unlearns” these harmful samples through constrained combinatorial optimization. Contribution/Results: This work is the first to formally integrate the human cognitive strategy of “attribution–correction” into the RLHF framework. Experiments demonstrate significant improvements in response quality across multiple alignment benchmarks, without compromising generalization to unseen prompts—achieving a favorable trade-off among interpretability, effectiveness, and safety.

Technology Category

Application Category

📝 Abstract
A common and effective strategy for humans to improve an unsatisfactory outcome in daily life is to find a cause of this outcome and correct the cause. In this paper, we investigate whether this human improvement strategy can be applied to improving reinforcement learning from human feedback (RLHF) for alignment of language models (LMs). In particular, it is observed in the literature that LMs tuned by RLHF can still output unsatisfactory responses. This paper proposes a method to improve the unsatisfactory responses by correcting their causes. Our method has two parts. The first part proposes a post-hoc explanation method to explain why an unsatisfactory response is generated to a prompt by identifying the training data that lead to this response. We formulate this problem as a constrained combinatorial optimization problem where the objective is to find a set of training data closest to this prompt-response pair in a feature representation space, and the constraint is that the prompt-response pair can be decomposed as a convex combination of this set of training data in the feature space. We propose an efficient iterative data selection algorithm to solve this problem. The second part proposes an unlearning method that improves unsatisfactory responses to some prompts by unlearning the training data that lead to these unsatisfactory responses and, meanwhile, does not significantly degrade satisfactory responses to other prompts. Experimental results demonstrate that our algorithm can improve RLHF.
Problem

Research questions and friction points this paper is trying to address.

Improves RLHF by correcting causes of unsatisfactory LM responses
Proposes post-hoc explanation to identify problematic training data
Uses unlearning method to enhance responses without degrading others
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-hoc explanation via constrained optimization for response causes
Unlearning training data to improve unsatisfactory model responses
Iterative data selection algorithm for efficient cause identification