🤖 AI Summary
This work addresses the interpretability bottleneck of reward models by enabling unbiased, quantitative estimation of the causal effects of high-order semantic attributes—such as sentiment, helpfulness, and complexity—on reward scores. Existing counterfactual attribution methods suffer from estimation bias due to imperfect counterfactual samples. To overcome this, we propose RATE: a framework that leverages dual LLM-based rewriting to generate controllable counterfactuals and employs doubly robust estimation theory to construct a correction mechanism. RATE achieves unbiased attribute-level causal sensitivity estimation without requiring perfect counterfactuals, and we theoretically prove it eliminates systematic bias induced by rewriting distortion. Experiments across multiple reward models and datasets demonstrate that RATE significantly outperforms single-rewrite baselines and state-of-the-art attribution methods, reducing causal effect estimation error by over 40%, while maintaining robustness and interpretability.
📝 Abstract
Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the causal effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactual examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting twice. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.