On the Limits of Tabular Hardness Metrics for Deep RL: A Study with the Pharos Benchmark

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing hardness metrics for reinforcement learning—such as MDP diameter and suboptimality gaps—lack theoretical grounding in deep RL (DRL), as they ignore “representation difficulty”: the fundamental impact of observation-space structure (e.g., raw pixels vs. compact state vectors) on agent learnability. Method: The authors systematically establish representation difficulty as the dominant factor governing DRL environment hardness and propose a representation-aware framework for environment difficulty measurement. They design Pharos, an open-source tool enabling decoupled control and joint analysis of MDP structure and agent input representations. Contribution/Results: Empirical evaluation demonstrates that classical tabular-RL metrics fail to predict DRL performance, whereas Pharos provides an effective platform for developing interpretable, reproducible, and representation-sensitive environment difficulty benchmarks.

Technology Category

Application Category

📝 Abstract

Principled evaluation is critical for progress in deep reinforcement learning (RL), yet it lags behind the theory-driven benchmarks of tabular RL. While tabular settings benefit from well-understood hardness measures like MDP diameter and suboptimality gaps, deep RL benchmarks are often chosen based on intuition and popularity. This raises a critical question: can tabular hardness metrics be adapted to guide non-tabular benchmarking? We investigate this question and reveal a fundamental gap. Our primary contribution is demonstrating that the difficulty of non-tabular environments is dominated by a factor that tabular metrics ignore: representation hardness. The same underlying MDP can pose vastly different challenges depending on whether the agent receives state vectors or pixel-based observations. To enable this analysis, we introduce exttt{pharos}, a new open-source library for principled RL benchmarking that allows for systematic control over both environment structure and agent representations. Our extensive case study using exttt{pharos} shows that while tabular metrics offer some insight, they are poor predictors of deep RL agent performance on their own. This work highlights the urgent need for new, representation-aware hardness measures and positions exttt{pharos} as a key tool for developing them.

Problem

Research questions and friction points this paper is trying to address.

Adapting tabular hardness metrics to guide non-tabular RL benchmarking

Understanding how representation hardness dominates non-tabular environment difficulty

Developing new representation-aware hardness measures for deep RL evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces pharos library for RL benchmarking

Shows representation hardness dominates non-tabular difficulty

Reveals tabular metrics poorly predict deep RL performance

🔎 Similar Papers

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation