🤖 AI Summary
This work exposes a critical flaw in the long-standing implicit assumption of “algorithmic implementation interchangeability” in deep reinforcement learning (DRL). To systematically investigate this issue, we employ differential testing, cross-implementation benchmarking, source-code auditing, and large-scale replication experiments—spanning 56 Atari games and five widely used open-source Proximal Policy Optimization (PPO) implementations. Our analysis reveals substantial performance disparities across functionally equivalent implementations: three PPO variants achieve superhuman performance on ≥50% of games, while the other two succeed on <15%; crucially, substituting implementations can fully invert original research conclusions. We quantitatively characterize how code-level variations distort empirical evaluation and scientific inference. Beyond diagnosis, this work catalyzes the development of standardized, verifiable DRL implementation practices, establishing foundational methodological support for reproducible AI research.
📝 Abstract
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment. DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games. Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist. However, studies make the mistake of assuming implementations of the same algorithm to be consistent and thus, interchangeable. In this paper, through a differential testing lens, we present the results of studying the extent of implementation inconsistencies, their effect on the implementations' performance, as well as their impact on the conclusions of prior studies under the assumption of interchangeable implementations. The outcomes of our differential tests showed significant discrepancies between the tested algorithm implementations, indicating that they are not interchangeable. In particular, out of the five PPO implementations tested on 56 games, three implementations achieved superhuman performance for 50% of their total trials while the other two implementations only achieved superhuman performance for less than 15% of their total trials. As part of a meticulous manual analysis of the implementations' source code, we analyzed implementation discrepancies and determined that code-level inconsistencies primarily caused these discrepancies. Lastly, we replicated a study and showed that this assumption of implementation interchangeability was sufficient to flip experiment outcomes. Therefore, this calls for a shift in how implementations are being used.