MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Large language models (LLMs) suffer from exploration-exploitation imbalance during inference in black-box optimization tasks and rely heavily on manually curated training data. Method: We propose MiGrATe, an online test-time adaptation framework that requires no external labeled data. It is the first to integrate the GRPO reinforcement learning paradigm into test-time training, combining on-policy sampling, greedy sampling, and structured neighborhood sampling to construct a dynamic hybrid policy for generating high-quality synthetic data that drives online policy updates. Results: MiGrATe achieves significant improvements over pure inference baselines and existing test-time training methods on diverse black-box optimization tasks—including lexical search, molecular optimization, and ARC complex reasoning—demonstrating its effectiveness, robustness, and cross-task generalizability in unsupervised, high-dimensional solution-space search.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.

Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and exploitation in LLM optimization tasks

Eliminating need for hand-crafted training data in test-time adaptation

Improving solution quality in black-box optimization without external supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Online TTT with GRPO for LLM adaptation

Mixed-policy group construction for exploration

Greedy and neighborhood sampling for exploitation

🔎 Similar Papers

Gradient Alignment Improves Test-Time Adaptation for Medical Image Segmentation