Group Pattern Selection Optimization: Let LRMs Pick the Right Pattern for Reasoning

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the suboptimal reasoning patterns that large reasoning models often exhibit on mathematical and scientific tasks due to training biases, which constrain their performance. The authors propose a reinforcement learning framework that, for the first time, enables dynamic selection of the optimal reasoning strategy on a per-problem basis. Their approach generates multiple candidate reasoning paths through multimodal rollback, employs a verifier to guide the selection of the best-performing mode, and incorporates attention masking to prevent mode leakage, thereby internalizing strategy selection as part of the model’s policy. Trained with GRPO, the method yields significant and consistent performance gains across multiple models and benchmarks, effectively mitigating suboptimal reasoning and enhancing both robustness and adaptability.

Technology Category

Application Category

📝 Abstract

Large reasoning models (LRMs) exhibit diverse high-level reasoning patterns (e.g., direct solution, reflection-and-verification, and exploring multiple solutions), yet prevailing training recipes implicitly bias models toward a limited set of dominant patterns. Through a systematic analysis, we identify substantial accuracy variance across these patterns on mathematics and science benchmarks, revealing that a model's default reasoning pattern is often sub-optimal for a given problem. To address this, we introduce Group Pattern Selection Optimization (GPSO), a reinforcement learning framework that extends GRPO by incorporating multi-pattern rollouts, verifier-guided optimal pattern selection per problem, and attention masking during optimization to prevent the leakage of explicit pattern suffixes into the learned policy. By exploring a portfolio of diverse reasoning strategies and optimizing the policy on the most effective ones, GPSO enables the model to internalize the mapping from problem characteristics to optimal reasoning patterns. Extensive experiments demonstrate that GPSO delivers consistent and substantial performance gains across various model backbones and benchmarks, effectively mitigating pattern sub-optimality and fostering more robust, adaptable reasoning. All data and codes are available at https://github.com/wanghanbinpanda/GPSO.

Problem

Research questions and friction points this paper is trying to address.

reasoning patterns

pattern sub-optimality

large reasoning models

problem-dependent reasoning

reasoning strategy selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Pattern Selection Optimization

reasoning patterns

reinforcement learning