On the Limits of Tabular Hardness Metrics for Deep RL: A Study with the Pharos Benchmark

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hardness metrics for reinforcement learning—such as MDP diameter and suboptimality gaps—lack theoretical grounding in deep RL (DRL), as they ignore “representation difficulty”: the fundamental impact of observation-space structure (e.g., raw pixels vs. compact state vectors) on agent learnability. Method: The authors systematically establish representation difficulty as the dominant factor governing DRL environment hardness and propose a representation-aware framework for environment difficulty measurement. They design Pharos, an open-source tool enabling decoupled control and joint analysis of MDP structure and agent input representations. Contribution/Results: Empirical evaluation demonstrates that classical tabular-RL metrics fail to predict DRL performance, whereas Pharos provides an effective platform for developing interpretable, reproducible, and representation-sensitive environment difficulty benchmarks.

Technology Category

Application Category

📝 Abstract
Principled evaluation is critical for progress in deep reinforcement learning (RL), yet it lags behind the theory-driven benchmarks of tabular RL. While tabular settings benefit from well-understood hardness measures like MDP diameter and suboptimality gaps, deep RL benchmarks are often chosen based on intuition and popularity. This raises a critical question: can tabular hardness metrics be adapted to guide non-tabular benchmarking? We investigate this question and reveal a fundamental gap. Our primary contribution is demonstrating that the difficulty of non-tabular environments is dominated by a factor that tabular metrics ignore: representation hardness. The same underlying MDP can pose vastly different challenges depending on whether the agent receives state vectors or pixel-based observations. To enable this analysis, we introduce exttt{pharos}, a new open-source library for principled RL benchmarking that allows for systematic control over both environment structure and agent representations. Our extensive case study using exttt{pharos} shows that while tabular metrics offer some insight, they are poor predictors of deep RL agent performance on their own. This work highlights the urgent need for new, representation-aware hardness measures and positions exttt{pharos} as a key tool for developing them.
Problem

Research questions and friction points this paper is trying to address.

Adapting tabular hardness metrics to guide non-tabular RL benchmarking
Understanding how representation hardness dominates non-tabular environment difficulty
Developing new representation-aware hardness measures for deep RL evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces pharos library for RL benchmarking
Shows representation hardness dominates non-tabular difficulty
Reveals tabular metrics poorly predict deep RL performance
🔎 Similar Papers
No similar papers found.
M
Michelangelo Conserva
School of Electronic Engineering and Computer Science, Queen Mary University of London, United Kingdom
Remo Sasso
Remo Sasso
PhD student, Queen Mary University of London
Artificial IntelligenceMachine LearningReinforcement Learning
P
Paulo Rauber
School of Electronic Engineering and Computer Science, Queen Mary University of London, United Kingdom