MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from exploration-exploitation imbalance during inference in black-box optimization tasks and rely heavily on manually curated training data. Method: We propose MiGrATe, an online test-time adaptation framework that requires no external labeled data. It is the first to integrate the GRPO reinforcement learning paradigm into test-time training, combining on-policy sampling, greedy sampling, and structured neighborhood sampling to construct a dynamic hybrid policy for generating high-quality synthetic data that drives online policy updates. Results: MiGrATe achieves significant improvements over pure inference baselines and existing test-time training methods on diverse black-box optimization tasks—including lexical search, molecular optimization, and ARC complex reasoning—demonstrating its effectiveness, robustness, and cross-task generalizability in unsupervised, high-dimensional solution-space search.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.
Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in LLM optimization tasks
Eliminating need for hand-crafted training data in test-time adaptation
Improving solution quality in black-box optimization without external supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online TTT with GRPO for LLM adaptation
Mixed-policy group construction for exploration
Greedy and neighborhood sampling for exploitation
🔎 Similar Papers
No similar papers found.