Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

219K/year
🤖 AI Summary
In open-ended social language games, self-play reinforcement learning often suffers from evolutionary stagnation due to strategy homogenization, which erodes informative gradient signals. To address this, this work proposes the Dual-scale Evolutionary Policy Training (DEPT) framework, which jointly monitors strategic deadlock through dual-scale value baselines and match entropy, and dynamically reshapes asymmetric advantages to restore effective gradients and sustain continuous evolution. DEPT achieves, for the first time in social language games, sustainable and diverse strategy evolution, significantly outperforming strong baselines across multiple tasks while effectively mitigating policy degradation.
📝 Abstract
While Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for closed-ended tasks, extending it to open-ended social language games via self-play reveals a critical issue: evolution impasse. Due to the vast strategy space, language agents frequently converge to homogenized behaviors, leading to deterministic match outcomes that eliminate the gradient signals necessary for policy evolution. To tackle this issue, we propose Dual-scale Evolutionary Policy Training (DEPT) for social language games. DEPT introduces a time-scaled evolutionary perception mechanism that detects impasse by quantifying dual-scale value baseline divergence alongside match entropy. Upon perceiving the collapse, it then activates asymmetric advantage reshaping to dynamically modulate the optimization landscape for intervention. Thus, our method effectively restores gradient signals and enforces sustained strategic exploration. Extensive experiments on multiple social language games demonstrate that DEPT outperforms strong baselines, avoiding policy degeneration and driving the continuous evolution of social language agents.
Problem

Research questions and friction points this paper is trying to address.

evolution impasse
social language games
policy homogenization
gradient signal loss
self-play
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-scale Evolutionary Policy Training
evolution impasse
asymmetric advantage reshaping
social language agents
self-play reinforcement learning