🤖 AI Summary
This paper identifies miscalibration in the loss functions of mainstream value-aware models (e.g., MuZero), causing systematic bias during joint optimization of dynamics and value functions. Method: We first establish the theoretical root cause of this miscalibration, then propose a provably calibrated loss correction framework. We further prove that stochastic implicit modeling strictly outperforms deterministic modeling for value prediction. Our approach integrates theoretical analysis, calibrated loss design, stochastic environment modeling, and empirical evaluation via MuZero architecture variants to systematically investigate the coupling among loss calibration, model architecture, and auxiliary losses. Contribution/Results: Empirical results demonstrate that calibrated models significantly improve value estimation accuracy and policy performance. The work unifies environmental modeling and value learning objectives both theoretically and experimentally, offering the first provably calibrated framework for value-aware model-based reinforcement learning.
📝 Abstract
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.