How's it going? Reinforcement learning in language models recruits a functional welfare axis

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study investigates how reinforcement learning (RL) leverages pre-existing internal representations within language models. By training agents in semantically neutral maze environments, the authors extract concept vectors associated with reward and punishment trajectories and evaluate their behavioral effects in unrelated tasks. The results demonstrate that RL does not construct novel representations but instead recruits a pre-trained “functional valence axis”: punishment vectors encode negative states such as failure or uncertainty, while reward vectors align approximately with their inverse direction. Both types of vectors effectively guide behavior even in models without fine-tuning. These findings are rigorously validated through multivariate controls—including LoRA versus full fine-tuning, diverse RL algorithms, and multiple model families—as well as cross-task generalization experiments, offering the first evidence that language models harbor an intrinsic, systematically recruitable functional valence structure accessible to RL.

📝 Abstract

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

language models

functional welfare

internal representations

pre-existing representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

functional welfare axis

reinforcement learning

concept vectors