🤖 AI Summary
This work investigates the impact of the discount factor on planning horizon length and the bias–variance trade-off in partially observable Markov decision processes (POMDPs). Contrary to the conventional reinforcement learning paradigm that favors long-horizon planning via large discount factors, we theoretically establish that smaller discount factors—inducing shallower planning horizons—significantly mitigate policy evaluation bias arising from partial observability while simultaneously reducing estimation variance. Methodologically, we integrate MDP structural analysis, bias–variance decomposition, and entropy-based modeling of observation information to formally characterize the coupling between the discount factor and observability. Empirical evaluation on canonical POMDP benchmarks demonstrates improved policy robustness and sample efficiency. Crucially, this work provides the first theoretical justification for the advantages of short-horizon planning in partially observable environments.
📝 Abstract
Formulating a real-world problem under the Reinforcement Learning framework involves non-trivial design choices, such as selecting a discount factor for the learning objective (discounted cumulative rewards), which articulates the planning horizon of the agent. This work investigates the impact of the discount factor on the bias-variance trade-off given structural parameters of the underlying Markov Decision Process. Our results support the idea that a shorter planning horizon might be beneficial, especially under partial observability.