From Relative Entropy to Minimax: A Unified Framework for Coverage in MDPs

📅 2026-01-17

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of targeted exploration in reward-free Markov decision processes, where guiding exploration toward state-action pairs of varying importance or difficulty remains difficult. The authors propose a parametric family of concave coverage objectives, denoted $U_\rho$, which operates directly on the state-action occupancy measure and unifies several existing objectives—including relative entropy, weighted average, and minimax coverage—as special cases. By tuning the parameter $\rho$, the exploration preference can be continuously adjusted, naturally recovering worst-case coverage behavior in the limiting regime. Leveraging concave optimization over occupancy measures and gradient-based policy updates, the proposed algorithm flexibly steers exploration toward desired coverage patterns, increasingly emphasizing under-explored regions as $\rho$ grows, while offering both theoretical guarantees and practical controllability.

Technology Category

Application Category

📝 Abstract

Targeted and deliberate exploration of state--action pairs is essential in reward-free Markov Decision Problems (MDPs). More precisely, different state-action pairs exhibit different degree of importance or difficulty which must be actively and explicitly built into a controlled exploration strategy. To this end, we propose a weighted and parameterized family of concave coverage objectives, denoted by $U_\rho$, defined directly over state--action occupancy measures. This family unifies several widely studied objectives within a single framework, including divergence-based marginal matching, weighted average coverage, and worst-case (minimax) coverage. While the concavity of $U_\rho$ captures the diminishing return associated with over-exploration, the simple closed form of the gradient of $U_\rho$ enables an explicit control to prioritize under-explored state--action pairs. Leveraging this structure, we develop a gradient-based algorithm that actively steers the induced occupancy toward a desired coverage pattern. Moreover, we show that as $\rho$ increases, the resulting exploration strategy increasingly emphasizes the least-explored state--action pairs, recovering worst-case coverage behavior in the limit.

Problem

Research questions and friction points this paper is trying to address.

coverage

Markov Decision Processes

exploration

state-action pairs

reward-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

coverage in MDPs

concave coverage objectives

gradient-based exploration