Notes on the Reward Representation of Posterior Updates

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This study investigates the probabilistic foundations of reasoning-based decision-making in reinforcement learning, aiming to clarify the relationship between reward signals and Bayesian posterior updates. By introducing KL-regularized soft updates, the work demonstrates that under specific conditions such updates can be rigorously interpreted as Bayesian posteriors within a single fixed probabilistic model, where behavioral changes are entirely driven by observed evidence. Integrating Bayesian inference, information-theoretic channel modeling, and probabilistic graphical models, the approach establishes that posterior updates uniquely determine relative and context-dependent incentive signals, while imposing a consistency constraint on value continuation across update directions. Key contributions include an identifiability theorem for reward representations, a characterization of the mechanistic limits through which posterior updates influence behavior, and the derivation of coherence conditions that reward descriptions must satisfy under multi-directional updates.

Technology Category

Application Category

📝 Abstract

Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.

Problem

Research questions and friction points this paper is trying to address.

posterior update

reward representation

Bayesian inference

KL regularization

context-dependent incentive

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian posterior

KL-regularized update

reward representation