ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses the limited accuracy and self-correction capabilities of large language models in complex reasoning tasks by proposing ThinkTwice, a novel framework that jointly optimizes reasoning and self-refinement for the first time. The approach employs a two-stage reinforcement learning strategy: it first generates an initial answer and then performs self-correction based on a shared binary correctness reward signal, eliminating the need for external annotations or human feedback. The training process exhibits an implicit curriculum effect—“correct errors first, then consolidate knowledge”—and leverages Group Relative Policy Optimization (GRPO). Evaluated on Qwen3-4B and Olmo3-7B, ThinkTwice significantly outperforms existing online policy optimization methods, achieving up to an 11.5 percentage point gain in accuracy (pass@4) on mathematical benchmarks such as AIME with just a single round of self-refinement.

Technology Category

Application Category

📝 Abstract

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.

Problem

Research questions and friction points this paper is trying to address.

reasoning

self-refinement

large language models

reinforcement learning

reward optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-refinement

reasoning

joint optimization