🤖 AI Summary
Existing imitation learning benchmarks lack sufficient distributional shift between training and evaluation, hindering meaningful generalization assessment. This paper introduces Labyrinth: a discrete, fully observable environment with controllable structure, precisely adjustable start/goal positions, and task complexity, supporting optimal action labeling and deterministic environment generation. Its key innovation is the first systematic, orthogonal control over generalization dimensions—such as partial observability, key-door tasks, and icy-surface slipperiness—as out-of-distribution challenges, with strict separation of training, validation, and test sets. This framework significantly improves experimental reproducibility and result interpretability. Empirical evaluation across multiple baselines effectively discriminates algorithmic generalization capabilities. Labyrinth establishes the first standardized, verifiable benchmark for robustness assessment in imitation learning.
📝 Abstract
Imitation learning benchmarks often lack sufficient variation between training and evaluation, limiting meaningful generalisation assessment. We introduce Labyrinth, a benchmarking environment designed to test generalisation with precise control over structure, start and goal positions, and task complexity. It enables verifiably distinct training, evaluation, and test settings. Labyrinth provides a discrete, fully observable state space and known optimal actions, supporting interpretability and fine-grained evaluation. Its flexible setup allows targeted testing of generalisation factors and includes variants like partial observability, key-and-door tasks, and ice-floor hazards. By enabling controlled, reproducible experiments, Labyrinth advances the evaluation of generalisation in imitation learning and provides a valuable tool for developing more robust agents.