🤖 AI Summary
Large language models (LLMs) rely on static pattern matching for complex reasoning and lack the ability to dynamically select optimal cognitive strategies, resulting in limited cross-task adaptability. To address this, we propose a test-time scalable meta-reasoning framework that—novelty—integrates Upper Confidence Bound (UCB) multi-armed bandit selection with genetic algorithm–driven evolution, enabling adaptive generation, evaluation, and refinement of task-specific reasoning strategies during inference. The framework synergistically combines reward modeling, test-time scaling, and meta-reasoning representations to support online strategy self-growth and cross-task generalization. Evaluated on the Arena-Hard benchmark, our method boosts GPT-4o’s win rate by 11%, substantially outperforming o1-mini (+0.9% under controlled stylistic constraints), while also improving response structuring and elevating reasoning quality to expert-level standards.
📝 Abstract
One critical challenge for large language models (LLMs) for making complex reasoning is their reliance on matching reasoning patterns from training data, instead of proactively selecting the most appropriate cognitive strategy to solve a given task. Existing approaches impose fixed cognitive structures that enhance performance in specific tasks but lack adaptability across diverse scenarios. To address this limitation, we introduce METASCALE, a test-time scaling framework based on meta-thoughts -- adaptive thinking strategies tailored to each task. METASCALE initializes a pool of candidate meta-thoughts, then iteratively selects and evaluates them using a multi-armed bandit algorithm with upper confidence bound selection, guided by a reward model. To further enhance adaptability, a genetic algorithm evolves high-reward meta-thoughts, refining and extending the strategy pool over time. By dynamically proposing and optimizing meta-thoughts at inference time, METASCALE improves both accuracy and generalization across a wide range of tasks. Experimental results demonstrate that MetaScale consistently outperforms standard inference approaches, achieving an 11% performance gain in win rate on Arena-Hard for GPT-4o, surpassing o1-mini by 0.9% under style control. Notably, METASCALE scales more effectively with increasing sampling budgets and produces more structured, expert-level responses.