PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the degradation of problem difficulty and limited reasoning capacity inherent in single-agent self-play due to self-calibration. To overcome these limitations, the authors propose a population-based asymmetric self-play framework that leverages a co-evolving ensemble of teacher-student LoRA adapters. Problem correctness is evaluated by a procedural verifier, while cross-population assessments drive continuous escalation of task complexity. The approach innovatively introduces evolutionary operators—mutation and crossover—directly into the LoRA weight space, enabling efficient population updates and establishing a multi-agent collaborative self-play mechanism. Experimental results demonstrate that the proposed method consistently outperforms compute-matched single-agent baselines across three code and seven mathematical benchmarks, with even the weakest member of the evolved population surpassing the overall performance of the baseline.

📝 Abstract

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.

Problem

Research questions and friction points this paper is trying to address.

self-play

reasoning

large language models

co-evolution

problem complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

population-based self-play

LoRA evolution

asymmetric teacher-student