🤖 AI Summary
This work addresses the limitations of existing reinforcement learning approaches in legged locomotion control, which typically rely on realistic robot morphologies and handcrafted, multi-component reward functions, rendering them ill-suited for stylized, non-realistic creature designs such as those found in game NPCs. To bridge this gap, the authors introduce four MuJoCo-based continuous control environments inspired by *ARC Raiders*, featuring standardized observation, action, and reward structures, thereby establishing the first benchmark that incorporates stylized, non-realistic morphologies into reinforcement learning. The framework employs a universal closed-form reward function and provides demonstration data generated via Central Pattern Generators (CPGs), enabling systematic evaluation of both online and offline-to-online algorithms—including SAC, SPEQ, and SOPE. Experiments demonstrate that incorporating prior demonstration data substantially improves policy learning efficiency and stability, validating the framework’s effectiveness under diverse morphological and stylistic constraints.
📝 Abstract
Reinforcement learning for legged locomotion has matured into a stack of multi-component reward functions and physics-engine benchmarks whose morphologies are uniformly derived from real commercial hardware. Game NPCs, however, are bound by stylistic constraints absent from sim-to-real robotics and routinely take the form of creatures with no real-robot counterpart. We introduce ARC-RL, a suite of four MuJoCo continuous-control environments featuring robotic morphologies inspired by the bestiary of ARC Raiders: the 18-DoF tall hexapod Queen, the 12-DoF armoured hexapod Bastion, the 18-DoF compact hexapod Tick, and the 12-DoF quadruped Leaper. All four robots share a unified observation template, action convention, simulation cadence, and a single closed-form multi-component reward function whose only per-morphology variation lives in a small set of weights and parameters. The reward fuses a velocity-tracking tent, a healthy survive bonus, a phase-locked gait-compliance bonus/cost pair, action regularisers, three safety penalties, and a posture anchor; no motion-capture data enters the reward at any point. We additionally provide hand-crafted Central Pattern Generator demonstrators per morphology, which serve both as fixed expert references and as sources of prior data for offline-to-online training. On this playground, we conduct a controlled empirical study comparing standard online algorithms (SAC, SPEQ, SOPE-EO) and methods augmented with prior data (SACfD, SPEQ-O2O, SOPE), and characterise how each paradigm copes with the playground's morphological diversity and animation-style stylistic constraints.