🤖 AI Summary
This study investigates the probabilistic foundations of reasoning-based decision-making in reinforcement learning, aiming to clarify the relationship between reward signals and Bayesian posterior updates. By introducing KL-regularized soft updates, the work demonstrates that under specific conditions such updates can be rigorously interpreted as Bayesian posteriors within a single fixed probabilistic model, where behavioral changes are entirely driven by observed evidence. Integrating Bayesian inference, information-theoretic channel modeling, and probabilistic graphical models, the approach establishes that posterior updates uniquely determine relative and context-dependent incentive signals, while imposing a consistency constraint on value continuation across update directions. Key contributions include an identifiability theorem for reward representations, a characterization of the mechanistic limits through which posterior updates influence behavior, and the derivation of coherence conditions that reward descriptions must satisfy under multi-directional updates.
📝 Abstract
Many ideas in modern control and reinforcement learning treat decision-making as inference: start from a baseline distribution and update it when a signal arrives. We ask when this can be made literal rather than metaphorical. We study the special case where a KL-regularized soft update is exactly a Bayesian posterior inside a single fixed probabilistic model, so the update variable is a genuine channel through which information is transmitted. In this regime, behavioral change is driven only by evidence carried by that channel: the update must be explainable as an evidence reweighing of the baseline. This yields a sharp identification result: posterior updates determine the relative, context-dependent incentive signal that shifts behavior, but they do not uniquely determine absolute rewards, which remain ambiguous up to context-specific baselines. Requiring one reusable continuation value across different update directions adds a further coherence constraint linking the reward descriptions associated with different conditioning orders.