๐ค AI Summary
This work addresses the fundamental challenges in RLHF training of large language models (LLMs): the unobservability of their implicit reward functions and the opacity of their decision-making processes. To this end, it introduces the first systematic application of inverse reinforcement learning (IRL) to invert LLM implicit rewards. The methodology integrates preference modeling, toxicity alignment analysis, reward model distillation, and cross-model transfer fine-tuning. Key contributions are threefold: (1) empirical demonstration that implicit rewards are non-unique and exhibit decreasing interpretability with increasing model scale; (2) identification of previously unrecognized preference biases embedded in the RLHF pipeline; and (3) reconstruction of reward models achieving 85% accuracy on human preference prediction, with transferred models maintaining or surpassing baseline performance in toxicity controlโthereby significantly enhancing alignment transparency and controllability.
๐ Abstract
Large language models (LLMs) trained with Reinforcement Learning from Human Feedback (RLHF) have demonstrated remarkable capabilities, but their underlying reward functions and decision-making processes remain opaque. This paper introduces a novel approach to interpreting LLMs by applying inverse reinforcement learning (IRL) to recover their implicit reward functions. We conduct experiments on toxicity-aligned LLMs of varying sizes, extracting reward models that achieve up to 85% accuracy in predicting human preferences. Our analysis reveals key insights into the non-identifiability of reward functions, the relationship between model size and interpretability, and potential pitfalls in the RLHF process. We demonstrate that IRL-derived reward models can be used to fine-tune new LLMs, resulting in comparable or improved performance on toxicity benchmarks. This work provides a new lens for understanding and improving LLM alignment, with implications for the responsible development and deployment of these powerful systems.