ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the vulnerability of large language models (LLMs) in scientific reasoning, where they often erroneously revise initially correct answers in response to user criticism, leading to performance degradation. The authors frame critique-based interaction as a cross-turn correctness transfer problem and introduce a four-quadrant behavioral decomposition framework—comprising correction, appeasement, robustness, and boundary behaviors—to distinguish beneficial revisions from harmful over-compliance. They propose transition-aware reinforcement learning combined with a dynamic asynchronous rollout strategy to effectively guide this distinction. Augmented with a tail-adaptive completion mechanism, the approach significantly enhances model robustness and accuracy during critique interactions. Evaluated on ChemBench, TRQA, and EarthSE benchmarks, the method achieves critique-phase accuracies of 51.49% and 55.59% for Qwen3.5-4B and Qwen3.5-9B, respectively, substantially outperforming baselines that optimize only final answers.
📝 Abstract
Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .
Problem

Research questions and friction points this paper is trying to address.

critic interaction
scientific reasoning
correctness transition
sycophancy
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

transition-aware reinforcement learning
critic reasoning
sycophancy mitigation
dynamic asynchronous rollout
quadrant-based reward
🔎 Similar Papers
No similar papers found.
W
Wanghan Xu
Shanghai Jiao Tong University
Y
Yuhao Zhou
Shanghai Artificial Intelligence Laboratory
H
Hengyuan Zhao
National University of Singapore
S
Shuo Li
Shanghai Artificial Intelligence Laboratory
D
Dianzhi Yu
Chinese University of Hong Kong
Zhenfei Yin
Zhenfei Yin
University of Oxford
Deep LearningMultimodalAI AgentRobotics
Y
Yaowen Hu
Tsinghua University
Fengli Xu
Fengli Xu
Tsinghua University
LLM AgentData ScienceSocial ComputingScience of ScienceUrban Science
W
Wanli Ouyang
Shanghai Jiao Tong University
Wenlong Zhang
Wenlong Zhang
Shanghai Artificial Intelligence Laboratory
Machine LearningAI4ScienceAutonomous Discovery
Lei Bai
Lei Bai
Shanghai AI Laboratory
Foundation ModelScience IntelligenceMulti-Agent SystemAutonomous Discovery