Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of scalable supervision for open-ended tasks—such as hallucination reduction—by introducing RLFR, a novel framework that leverages interpretable internal features of language models directly as reward signals in reinforcement learning, enabling end-to-end training. By integrating probe-based hallucination detection, feature-driven reward mechanisms, and test-time compute steering, RLFR reduces hallucination rates by 58% on Gemma-2-12B-IT while preserving performance on standard benchmarks. This approach establishes a new paradigm for feature-based supervision, facilitating efficient training and inference for open-ended behaviors without relying on external oracles or human annotations.

Technology Category

Application Category

📝 Abstract
Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model (when run in tandem with our probing harness), while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.
Problem

Research questions and friction points this paper is trying to address.

hallucination reduction
open-ended tasks
scalable supervision
feature-based rewards
language model interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

feature-based rewards
reinforcement learning
hallucination reduction
interpretability
scalable supervision
🔎 Similar Papers
No similar papers found.