🤖 AI Summary
Aligning frozen large language models (LLMs) with human preferences during inference—without access to model weights or fine-tuning—is challenging, especially when deploying on private user data. Method: We propose Iterative Reweight-then-Optimize (IRO), a test-time reinforcement learning framework that performs alignment solely through inference-time operations: iterative reward-weighted sampling, lightweight value function estimation, and search-based decoding—without updating any model parameters. Contribution/Results: IRO is the first method enabling pure test-time RL alignment for frozen LLMs. It significantly reduces reliance on accurate first-order reward functions, improves output stability and inference efficiency, and supports personalized alignment on private user data. Experiments demonstrate that IRO enhances preference consistency of generated outputs without requiring model weights, additional online computational overhead, or parameter updates.
📝 Abstract
Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT), but without requiring access to the model weights.