π€ AI Summary
This work addresses the absence of ground-truth reward signals during inference in in-context reinforcement learning by proposing the TR-ICRL framework, which introduces a retrieval-based pseudo-reward mechanism at test time. Specifically, the method retrieves relevant instances from an unlabeled evaluation set, leverages a large language model to generate candidate answers, and applies majority voting to produce pseudo-labels that serve as reward signals and formative feedback. This enables iterative model refinement without access to true labels and facilitates online reinforcement learning. The final output is generated by integrating these refined responses with the original query. Evaluated on MedQA and AIME2024, TR-ICRL achieves performance gains of 21.23% and 137.59%, respectively, substantially enhancing model robustness and efficacy on reasoning- and knowledge-intensive tasks.
π Abstract
In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.