$\kappa$-Explorer: A Unified Framework for Active Model Estimation in MDPs

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of efficiently allocating exploration resources in fully observable tabular Markov decision processes (MDPs) to achieve accurate model estimation, guided by the intrinsic complexity of state-action transition distributions. The authors propose κ-Explorer, an active exploration algorithm based on Frank-Wolfe optimization, which introduces a decomposable concave objective function $ U_\kappa $ parameterized by curvature $ \kappa $. This formulation unifies the handling of both average and worst-case model estimation errors and leverages diminishing returns to automatically prioritize high-variance or undersampled regions. By integrating a coverage-driven mechanism, closed-form gradients of occupancy measures, and an online-efficient surrogate optimization procedure, κ-Explorer outperforms existing strategies on standard MDP benchmarks and comes with tight theoretical regret guarantees.

Technology Category

Application Category

📝 Abstract

In tabular Markov decision processes (MDPs) with perfect state observability, each trajectory provides active samples from the transition distributions conditioned on state-action pairs. Consequently, accurate model estimation depends on how the exploration policy allocates visitation frequencies in accordance with the intrinsic complexity of each transition distribution. Building on recent work on coverage-based exploration, we introduce a parameterized family of decomposable and concave objective functions $U_\kappa$ that explicitly incorporate both intrinsic estimation complexity and extrinsic visitation frequency. Moreover, the curvature $\kappa$ provides a unified treatment of various global objectives, such as the average-case and worst-case estimation error objectives. Using the closed-form characterization of the gradient of $U_\kappa$, we propose $\kappa$-Explorer, an active exploration algorithm that performs Frank-Wolfe-style optimization over state-action occupancy measures. The diminishing-returns structure of $U_\kappa$ naturally prioritizes underexplored and high-variance transitions, while preserving smoothness properties that enable efficient optimization. We establish tight regret guarantees for $\kappa$-Explorer and further introduce a fully online and computationally efficient surrogate algorithm for practical use. Experiments on benchmark MDPs demonstrate that $\kappa$-Explorer provides superior performance compared to existing exploration strategies.

Problem

Research questions and friction points this paper is trying to address.

active model estimation

Markov decision processes

exploration policy

transition distribution

visitation frequency

Innovation

Methods, ideas, or system contributions that make the work stand out.

active exploration

model estimation

Markov decision processes