TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

πŸ“… 2026-03-31
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the absence of ground-truth reward signals during inference in in-context reinforcement learning by proposing the TR-ICRL framework, which introduces a retrieval-based pseudo-reward mechanism at test time. Specifically, the method retrieves relevant instances from an unlabeled evaluation set, leverages a large language model to generate candidate answers, and applies majority voting to produce pseudo-labels that serve as reward signals and formative feedback. This enables iterative model refinement without access to true labels and facilitates online reinforcement learning. The final output is generated by integrating these refined responses with the original query. Evaluated on MedQA and AIME2024, TR-ICRL achieves performance gains of 21.23% and 137.59%, respectively, substantially enhancing model robustness and efficacy on reasoning- and knowledge-intensive tasks.
πŸ“ Abstract
In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.
Problem

Research questions and friction points this paper is trying to address.

In-Context Reinforcement Learning
reward estimation
ground-truth rewards
test-time reasoning
pseudo-labeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Context Reinforcement Learning
Test-Time Rethinking
Pseudo-Labeling
Majority Voting
Reward Estimation
πŸ”Ž Similar Papers
No similar papers found.
W
Wenxuan Jiang
The Hong Kong Polytechnic University
Y
Yuxin Zuo
Institute of Computing Technology, Chinese Academy of Sciences
Zijian Zhang
Zijian Zhang
Tencent
Generative ModelsDiffusion ModelsRepresentation Learning
X
Xuecheng Wu
Xi’an Jiaotong University
Z
Zining Fan
East China Normal University
W
Wenxuan Liu
Institute of Computing Technology, Chinese Academy of Sciences
L
Li Chen
Northeastern University
X
Xiaoyu Li
Meituan-M17
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs
Xiaolong Jin
Xiaolong Jin
Purdue University
AI safety
Ninghao Liu
Ninghao Liu
Assistant Professor, University of Georgia
Explainable AIFairness in Machine LearningGraph MiningAnomaly Detection