🤖 AI Summary
This work addresses the significant performance degradation of large language models under distribution shifts and the limitations of existing test-time adaptation methods, which often rely on gradient updates or external supervision, hindering efficient deployment. The authors propose TF-TTCL, a novel framework that, for the first time, enables fully training-free and black-box-compatible test-time contrastive learning. TF-TTCL operates through an “explore–reflect–guide” loop, integrating multi-agent role-playing and contrastive experience distillation to dynamically generate and apply textual rules derived from the model’s own reasoning trajectories, thereby achieving self-supervised online adaptation. Experimental results demonstrate that TF-TTCL substantially outperforms zero-shot baselines and state-of-the-art test-time adaptation approaches across both closed- and open-ended reasoning tasks, effectively enhancing model robustness and generalization.
📝 Abstract
Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic "Explore-Reflect-Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.