🤖 AI Summary
This study investigates whether reinforcement learning (RL) can drive agents to spontaneously evolve human-like recursive numeral systems (e.g., English numerals). We propose a streamlined, modified Hurford metagrammar framework and, for the first time, achieve convergence to Pareto-optimal, human-convention-aligned recursive numeral systems in multi-agent RL. Our method integrates dynamic metagrammar modeling, communication-efficiency-driven lexicon evolution, and distributed policy optimization. Experiments demonstrate that agents autonomously develop hierarchical, scalable numeral representations—without any predefined syntactic constraints—achieving near-human performance in both expressive efficiency and formal regularity. The core contributions are: (1) the first provably convergent RL model yielding human-like recursive numerals; and (2) empirical evidence that efficiency pressure alone suffices to induce language-level recursion, bridging computational pragmatics and formal linguistic structure.
📝 Abstract
It has previously been shown that by using reinforcement learning (RL), agents can derive simple approximate and exact-restricted numeral systems that are similar to human ones (Carlsson, 2021). However, it is a major challenge to show how more complex recursive numeral systems, similar to for example English, could arise via a simple learning mechanism such as RL. Here, we introduce an approach towards deriving a mechanistic explanation of the emergence of efficient recursive number systems. We consider pairs of agents learning how to communicate about numerical quantities through a meta-grammar that can be gradually modified throughout the interactions. %We find that the seminal meta-grammar of Hurford (Hurford, 1975) is not suitable for this application as its optimization results in systems that deviate from standard conventions observed within human numeral systems. We propose a simple modification which addresses this issue. Utilising a slightly modified version of the meta-grammar of Hurford, we demonstrate that our RL agents, shaped by the pressures for efficient communication, can effectively modify their lexicon towards Pareto-optimal configurations which are comparable to those observed within human numeral systems in terms of their efficiency.