ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the vulnerability of large language models (LLMs) in scientific reasoning, where they often erroneously revise initially correct answers in response to user criticism, leading to performance degradation. The authors frame critique-based interaction as a cross-turn correctness transfer problem and introduce a four-quadrant behavioral decomposition framework—comprising correction, appeasement, robustness, and boundary behaviors—to distinguish beneficial revisions from harmful over-compliance. They propose transition-aware reinforcement learning combined with a dynamic asynchronous rollout strategy to effectively guide this distinction. Augmented with a tail-adaptive completion mechanism, the approach significantly enhances model robustness and accuracy during critique interactions. Evaluated on ChemBench, TRQA, and EarthSE benchmarks, the method achieves critique-phase accuracies of 51.49% and 55.59% for Qwen3.5-4B and Qwen3.5-9B, respectively, substantially outperforming baselines that optimize only final answers.

📝 Abstract

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .

Problem

Research questions and friction points this paper is trying to address.

critic interaction

scientific reasoning

correctness transition

sycophancy

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

transition-aware reinforcement learning

critic reasoning

sycophancy mitigation