CG-TTRL: Context-Guided Test-Time Reinforcement Learning for On-Device Large Language Models

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient context utilization in Test-Time Reinforcement Learning (TTRL) for on-device large language model adaptation, this paper proposes Context-Guided TTRL. Our method is the first to dynamically integrate in-context learning into the dual-stage sampling process of TTRL, enhancing both pseudo-label accuracy and exploration control via efficient context selection and dynamic fusion. It incorporates multi-sampling/downsampling, reward-driven fine-tuning, majority voting, and lightweight context integration—designed specifically for resource-constrained edge deployment. Evaluated on multiple mathematical and scientific QA benchmarks, our approach achieves a 7% relative accuracy improvement over standard TTRL. Notably, it attains an 8% performance gain within only three training steps, significantly accelerating convergence and improving practical deployability.

Technology Category

Application Category

📝 Abstract
Test-time Reinforcement Learning (TTRL) has shown promise in adapting foundation models for complex tasks at test-time, resulting in large performance improvements. TTRL leverages an elegant two-phase sampling strategy: first, multi-sampling derives a pseudo-label via majority voting, while subsequent downsampling and reward-based fine-tuning encourages the model to explore and learn diverse valid solutions, with the pseudo-label modulating the reward signal. Meanwhile, in-context learning has been widely explored at inference time and demonstrated the ability to enhance model performance without weight updates. However, TTRL's two-phase sampling strategy under-utilizes contextual guidance, which can potentially improve pseudo-label accuracy in the initial exploitation phase while regulating exploration in the second. To address this, we propose context-guided TTRL (CG-TTRL), integrating context dynamically into both sampling phases and propose a method for efficient context selection for on-device applications. Our evaluations on mathematical and scientific QA benchmarks show CG-TTRL outperforms TTRL (e.g. additional 7% relative accuracy improvement over TTRL), while boosting efficiency by obtaining strong performance after only a few steps of test-time training (e.g. 8% relative improvement rather than 1% over TTRL after 3 steps).
Problem

Research questions and friction points this paper is trying to address.

Enhancing pseudo-label accuracy through contextual guidance in test-time reinforcement learning
Regulating exploration phase with dynamic context integration for on-device LLMs
Improving efficiency of test-time adaptation with optimized context selection methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates dynamic context into two-phase sampling strategy
Proposes efficient context selection for on-device applications
Enhances pseudo-label accuracy and regulates exploration phases
🔎 Similar Papers
No similar papers found.