🤖 AI Summary
Existing test-time alignment methods either rely on trajectory-level signals or suffer from low sampling efficiency, making it challenging to balance performance and generation diversity. This work proposes LLMdoctor, a novel framework that introduces token-level reward acquisition and Trajectory-Flow Preference Optimization (TFPO). Leveraging a patient-doctor architecture, LLMdoctor employs fine-grained token-level preference signals at test time to guide a small "doctor" model in efficiently aligning a frozen large language model. By enforcing flow consistency across sub-trajectories, the method achieves precise token-by-token alignment while preserving output diversity. Experiments demonstrate that LLMdoctor significantly outperforms current test-time alignment approaches across multiple benchmarks and even surpasses full fine-tuning methods such as DPO in terms of performance.
📝 Abstract
Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising alternative, existing approaches often rely on distorted trajectory-level signals or inefficient sampling, fundamentally capping performance and failing to preserve the generative diversity of the base model. This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. It integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO) to steer a large, frozen patient LLM with a smaller, specialized doctor model. Unlike conventional methods that rely on trajectory-level rewards, LLMdoctor first extracts fine-grained, token-level preference signals from the patient model's behavioral variations. These signals then guide the training of the doctor model via TFPO, which establishes flow consistency across all subtrajectories, enabling precise token-by-token alignment while inherently preserving generation diversity. Extensive experiments demonstrate that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.