๐ค AI Summary
Traditional next-token prediction objectives suffer from diminishing returns and signal saturation, limiting continual learning and generalization. Method: This work introduces goal-oriented, rule-driven โdialogue gamesโ as a novel, scalable learning signal. We construct a large language model (LLM)-orchestrated dialogue game environment and design a unified data generation framework supporting both offline and online interaction. Using this setup, we systematically conduct supervised fine-tuning (SFT), direct preference optimization (DPO), and generalized reward policy optimization (GRPO). Results: SFT and DPO improve in-domain dialogue game performance only; GRPO achieves the first demonstration of cross-domain generalization while maintaining competitive performance on canonical benchmark tasks. Crucially, this study formally defines dialogue games as structured, interpretable learning signals; we open-source our training framework and baseline configurations to advance embodied, goal-directed LLM learning.
๐ Abstract
Are we running out of learning signal? Predicting the next word in an existing text has turned out to be a powerful signal, at least at scale. But there are signs that we are running out of this resource. In recent months, interaction between learner and feedback-giver has come into focus, both for"alignment"(with a reward model judging the quality of instruction following attempts) and for improving"reasoning"(process- and outcome-based verifiers judging reasoning steps). In this paper, we explore to what extent synthetic interaction in what we call Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can provide a learning signal, and how this signal can be used. We introduce an environment for producing such interaction data (with the help of a Large Language Model as counterpart to the learner model), both offline and online. We investigate the effects of supervised fine-tuning on this data, as well as reinforcement learning setups such as DPO, and GRPO; showing that all of these approaches achieve some improvements in in-domain games, but only GRPO demonstrates the ability to generalise to out-of-domain games as well as retain competitive performance in reference-based tasks. We release the framework and the baseline training setups in the hope that this can foster research in this promising new direction.