🤖 AI Summary
Designing reward functions for reinforcement learning (RL) agents in games traditionally relies heavily on domain expertise and struggles to adapt to dynamic content changes. Method: This paper proposes an LLM-based automated iterative reward weight optimization method that takes user-specified behavioral objectives as input and leverages agent training feedback—such as success rate and episode length—to perform closed-loop, multi-round LLM reasoning for reward weight self-calibration, eliminating manual intervention. Contribution/Results: To our knowledge, this is the first work to integrate LLMs into online adaptive optimization of RL reward functions, substantially reducing dependence on human experts. Evaluated on a racing task, the approach improves agent success rate from 9% to 80% and reduces average lap steps to 855—performance approaching that achieved by expert manual tuning.
📝 Abstract
Reinforcement Learning (RL) in games has gained significant momentum in recent years, enabling the creation of different agent behaviors that can transform a player's gaming experience. However, deploying RL agents in production environments presents two key challenges: (1) designing an effective reward function typically requires an RL expert, and (2) when a game's content or mechanics are modified, previously tuned reward weights may no longer be optimal. Towards the latter challenge, we propose an automated approach for iteratively fine-tuning an RL agent's reward function weights, based on a user-defined language based behavioral goal. A Language Model (LM) proposes updated weights at each iteration based on this target behavior and a summary of performance statistics from prior training rounds. This closed-loop process allows the LM to self-correct and refine its output over time, producing increasingly aligned behavior without the need for manual reward engineering. We evaluate our approach in a racing task and show that it consistently improves agent performance across iterations. The LM-guided agents show a significant increase in performance from $9%$ to $74%$ success rate in just one iteration. We compare our LM-guided tuning against a human expert's manual weight design in the racing task: by the final iteration, the LM-tuned agent achieved an $80%$ success rate, and completed laps in an average of $855$ time steps, a competitive performance against the expert-tuned agent's peak $94%$ success, and $850$ time steps.