🤖 AI Summary
Existing hardness metrics for reinforcement learning—such as MDP diameter and suboptimality gaps—lack theoretical grounding in deep RL (DRL), as they ignore “representation difficulty”: the fundamental impact of observation-space structure (e.g., raw pixels vs. compact state vectors) on agent learnability. Method: The authors systematically establish representation difficulty as the dominant factor governing DRL environment hardness and propose a representation-aware framework for environment difficulty measurement. They design Pharos, an open-source tool enabling decoupled control and joint analysis of MDP structure and agent input representations. Contribution/Results: Empirical evaluation demonstrates that classical tabular-RL metrics fail to predict DRL performance, whereas Pharos provides an effective platform for developing interpretable, reproducible, and representation-sensitive environment difficulty benchmarks.
📝 Abstract
Principled evaluation is critical for progress in deep reinforcement learning (RL), yet it lags behind the theory-driven benchmarks of tabular RL. While tabular settings benefit from well-understood hardness measures like MDP diameter and suboptimality gaps, deep RL benchmarks are often chosen based on intuition and popularity. This raises a critical question: can tabular hardness metrics be adapted to guide non-tabular benchmarking? We investigate this question and reveal a fundamental gap. Our primary contribution is demonstrating that the difficulty of non-tabular environments is dominated by a factor that tabular metrics ignore: representation hardness. The same underlying MDP can pose vastly different challenges depending on whether the agent receives state vectors or pixel-based observations. To enable this analysis, we introduce exttt{pharos}, a new open-source library for principled RL benchmarking that allows for systematic control over both environment structure and agent representations. Our extensive case study using exttt{pharos} shows that while tabular metrics offer some insight, they are poor predictors of deep RL agent performance on their own. This work highlights the urgent need for new, representation-aware hardness measures and positions exttt{pharos} as a key tool for developing them.