A Mathematical Programming Approach to Computing and Learning Berk--Nash Equilibria in Infinite-Horizon MDPs

πŸ“… 2026-03-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of achieving stable behavior in infinite-horizon Markov decision processes when agents operate under misspecified internal model classes. To this end, the authors propose a novel approach that integrates entropy regularization with bilevel optimization. The method establishes a soft Bellman fixed point to guarantee uniqueness and smoothness of policy updates and characterizes the Berk–Nash equilibrium as a coupled linear program. An exploration mechanism based on the EXP3 algorithm, combined with adaptive scaling of the conjecture set, is designed to jointly optimize model selection and policy learning. Both theoretical analysis and numerical experiments demonstrate that the proposed framework effectively balances exploration and exploitation, converges to the KL-divergence-minimizing model, and achieves a sublinear regret bound.

Technology Category

Application Category

πŸ“ Abstract
We study sequential decision-making when the agent's internal model class is misspecified. Within the infinite-horizon Berk-Nash framework, stable behavior arises as a fixed point: the agent acts optimally relative to a subjective model, while that model is statistically consistent with the long-run data endogenously generated by the policy itself. We provide a rigorous characterization of this equilibrium via coupled linear programs and a bilevel optimization formulation. To address the intrinsic non-smoothness of standard best-response correspondences, we introduce entropy regularization, establishing the existence of a unique soft Bellman fixed point and a smooth objective. Exploiting this regularity, we develop an online learning scheme that casts model selection as an adversarial bandit problem using an EXP3-type update, augmented by a novel conjecture-set zooming mechanism that adaptively refines the parameter space. Numerical results demonstrate effective exploration-exploitation trade-offs, convergence to the KL-minimizing model, and sublinear regret.
Problem

Research questions and friction points this paper is trying to address.

misspecified models
Berk-Nash equilibrium
infinite-horizon MDPs
sequential decision-making
model consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Berk-Nash equilibrium
entropy regularization
bilevel optimization
adversarial bandits
conjecture-set zooming
πŸ”Ž Similar Papers
No similar papers found.
Quanyan Zhu
Quanyan Zhu
Department of Electrical and Computer Engineering, New York University
AIGame and Control TheorySecurity and ResilienceAutonomyCyber-Physical Systems
Z
Zhengye Han
Department of Electrical and Computer Engineering, New York University, Brooklyn, NY 11201, USA