Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

๐Ÿ“… 2026-02-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

209K/year
๐Ÿค– AI Summary
This work addresses the challenge of scalable supervision for open-ended tasksโ€”such as hallucination reductionโ€”by introducing RLFR, a novel framework that leverages interpretable internal features of language models directly as reward signals in reinforcement learning, enabling end-to-end training. By integrating probe-based hallucination detection, feature-driven reward mechanisms, and test-time compute steering, RLFR reduces hallucination rates by 58% on Gemma-2-12B-IT while preserving performance on standard benchmarks. This approach establishes a new paradigm for feature-based supervision, facilitating efficient training and inference for open-ended behaviors without relying on external oracles or human annotations.

Technology Category

Application Category

๐Ÿ“ Abstract
Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model (when run in tandem with our probing harness), while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.
Problem

Research questions and friction points this paper is trying to address.

hallucination reduction
open-ended tasks
scalable supervision
feature-based rewards
language model interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

feature-based rewards
reinforcement learning
hallucination reduction
interpretability
scalable supervision
๐Ÿ”Ž Similar Papers
No similar papers found.