PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

📅 2026-05-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
This work addresses the degradation of problem difficulty and limited reasoning capacity inherent in single-agent self-play due to self-calibration. To overcome these limitations, the authors propose a population-based asymmetric self-play framework that leverages a co-evolving ensemble of teacher-student LoRA adapters. Problem correctness is evaluated by a procedural verifier, while cross-population assessments drive continuous escalation of task complexity. The approach innovatively introduces evolutionary operators—mutation and crossover—directly into the LoRA weight space, enabling efficient population updates and establishing a multi-agent collaborative self-play mechanism. Experimental results demonstrate that the proposed method consistently outperforms compute-matched single-agent baselines across three code and seven mathematical benchmarks, with even the weakest member of the evolved population surpassing the overall performance of the baseline.
📝 Abstract
We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.
Problem

Research questions and friction points this paper is trying to address.

self-play
reasoning
large language models
co-evolution
problem complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

population-based self-play
LoRA evolution
asymmetric teacher-student
co-evolutionary training
verifiable reward RL