🤖 AI Summary
Large language models (LLMs) suffer from exploration-exploitation imbalance during inference in black-box optimization tasks and rely heavily on manually curated training data. Method: We propose MiGrATe, an online test-time adaptation framework that requires no external labeled data. It is the first to integrate the GRPO reinforcement learning paradigm into test-time training, combining on-policy sampling, greedy sampling, and structured neighborhood sampling to construct a dynamic hybrid policy for generating high-quality synthetic data that drives online policy updates. Results: MiGrATe achieves significant improvements over pure inference baselines and existing test-time training methods on diverse black-box optimization tasks—including lexical search, molecular optimization, and ARC complex reasoning—demonstrating its effectiveness, robustness, and cross-task generalizability in unsupervised, high-dimensional solution-space search.
📝 Abstract
Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.