🤖 AI Summary
Meta-World, a widely adopted benchmark for multitask and meta-reinforcement learning, has suffered from inconsistent versioning and outdated documentation, undermining result reproducibility and cross-algorithm comparability. To address this, we systematically reconstruct and standardize the benchmark: (1) we achieve full reproducibility of all historical results through deterministic environment initialization, modular task definitions, and rigorous CI/CD-based validation; (2) we unify the API interface and task configuration paradigm, enabling fine-grained, customizable task suite composition; and (3) we release an open-source, Gym-compatible implementation built on Python, with explicit random seed control and modular architecture. The new version is publicly available as Farama-Foundation/Metaworld. This work advances benchmark design toward scientific rigor, substantially improving experimental reproducibility, cross-study comparability, and research efficiency—establishing foundational infrastructure for fair and reliable reinforcement learning evaluation.
📝 Abstract
Meta-World is widely used for evaluating multi-task and meta-reinforcement learning agents, which are challenged to master diverse skills simultaneously. Since its introduction however, there have been numerous undocumented changes which inhibit a fair comparison of algorithms. This work strives to disambiguate these results from the literature, while also leveraging the past versions of Meta-World to provide insights into multi-task and meta-reinforcement learning benchmark design. Through this process we release a new open-source version of Meta-World (https://github.com/Farama-Foundation/Metaworld/) that has full reproducibility of past results, is more technically ergonomic, and gives users more control over the tasks that are included in a task set.