When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work investigates how to effectively leverage suboptimal yet available expert controllers in reinforcement learning and diagnoses the trust boundaries and failure modes of query-based expert-guided methods. Within a unified SAC framework, we establish the first benchmark tailored to this setting, systematically evaluate multiple approaches, and uncover three classes of failure modes previously overlooked in the literature through large-scale experiments. Guided by expert quality, task termination mechanisms, and perturbation types, we formulate testable trust decision rules and introduce EDGE (softmax ensemble of LCB) as a concrete design instantiation. Our results demonstrate that no single method universally dominates, and existing approaches struggle to surpass experts operating near the performance ceiling within one million environment steps, suggesting fundamental theoretical or sample efficiency limitations.

📝 Abstract

Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses training-time-handoff methods under deployment-time expert undertuning. No single method dominates: each wins on one task-structure regime and fails predictably elsewhere; on RL-near-ceiling experts (FourTank, GlassFurnace) no query-time method clears the expert within our 1M-step budget, leaving open whether this is a fundamental wall or a budget effect. We convert the spread into a testable decision rule keyed on three pre-training observables (expert quality, task termination, perturbation type). The benchmark, taxonomy, and decision rule are the primary contribution; we additionally describe EDGE, a softmax-over-ensemble-LCB design point used to demonstrate that both axes the taxonomy points to (gate form, scoring rule) are individually exploitable.

Problem

Research questions and friction points this paper is trying to address.

expert-guided reinforcement learning

imperfect expert

query-time guidance

failure modes

trust decision

Innovation

Methods, ideas, or system contributions that make the work stand out.

expert-guided reinforcement learning

query-time expert

failure mode analysis