π€ AI Summary
This work addresses the challenge of achieving stable behavior in infinite-horizon Markov decision processes when agents operate under misspecified internal model classes. To this end, the authors propose a novel approach that integrates entropy regularization with bilevel optimization. The method establishes a soft Bellman fixed point to guarantee uniqueness and smoothness of policy updates and characterizes the BerkβNash equilibrium as a coupled linear program. An exploration mechanism based on the EXP3 algorithm, combined with adaptive scaling of the conjecture set, is designed to jointly optimize model selection and policy learning. Both theoretical analysis and numerical experiments demonstrate that the proposed framework effectively balances exploration and exploitation, converges to the KL-divergence-minimizing model, and achieves a sublinear regret bound.
π Abstract
We study sequential decision-making when the agent's internal model class is misspecified. Within the infinite-horizon Berk-Nash framework, stable behavior arises as a fixed point: the agent acts optimally relative to a subjective model, while that model is statistically consistent with the long-run data endogenously generated by the policy itself. We provide a rigorous characterization of this equilibrium via coupled linear programs and a bilevel optimization formulation. To address the intrinsic non-smoothness of standard best-response correspondences, we introduce entropy regularization, establishing the existence of a unique soft Bellman fixed point and a smooth objective. Exploiting this regularity, we develop an online learning scheme that casts model selection as an adversarial bandit problem using an EXP3-type update, augmented by a novel conjecture-set zooming mechanism that adaptively refines the parameter space. Numerical results demonstrate effective exploration-exploitation trade-offs, convergence to the KL-minimizing model, and sublinear regret.