Bootstrapping Task Spaces for Self-Improvement

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of depth-unbounded multi-step self-improvement for reinforcement learning agents during inference. We propose the Exploratory Iteration (ExIt) family of methods, which exploits the cyclic structure inherent in self-improvement tasks by selectively sampling intermediate historical trajectories to construct a self-evolving curriculum—enabling deep optimization via single-step iterative training. ExIt integrates reinforcement learning with explicit exploration mechanisms within a self-paced curriculum learning framework, focusing training on the most informative single-step updates while progressively expanding the task space to enhance generalization and behavioral diversity. Empirical results demonstrate that ExIt significantly improves inference-time self-improvement capabilities on unseen samples across diverse domains—including competitive mathematics, multi-turn tool use, and machine learning engineering—while exhibiting monotonic performance gains as the inference budget (i.e., number of iterations) exceeds the average depth observed during training.

Technology Category

Application Category

📝 Abstract
Progress in many task domains emerges from repeated revisions to previous solution attempts. Training agents that can reliably self-improve over such sequences at inference-time is a natural target for reinforcement learning (RL), yet the naive approach assumes a fixed maximum iteration depth, which can be both costly and arbitrary. We present Exploratory Iteration (ExIt), a family of autocurriculum RL methods that directly exploits the recurrent structure of self-improvement tasks to train LLMs to perform multi-step self-improvement at inference-time while only training on the most informative single-step iterations. ExIt grows a task space by selectively sampling the most informative intermediate, partial histories encountered during an episode for continued iteration, treating these starting points as new self-iteration task instances to train a self-improvement policy. ExIt can further pair with explicit exploration mechanisms to sustain greater task diversity. Across several domains, encompassing competition math, multi-turn tool-use, and machine learning engineering, we demonstrate that ExIt strategies, starting from either a single or many task instances, can produce policies exhibiting strong inference-time self-improvement on held-out task instances, and the ability to iterate towards higher performance over a step budget extending beyond the average iteration depth encountered during training.
Problem

Research questions and friction points this paper is trying to address.

Training agents for reliable self-improvement during inference
Overcoming fixed maximum iteration depth limitations in RL
Developing autocurriculum methods for multi-step self-improvement training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Exploratory Iteration autocurriculum RL methods
Selectively samples informative intermediate histories
Trains self-improvement policy on single-step iterations
🔎 Similar Papers
No similar papers found.