Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Aligning frozen large language models (LLMs) with human preferences during inference—without access to model weights or fine-tuning—is challenging, especially when deploying on private user data. Method: We propose Iterative Reweight-then-Optimize (IRO), a test-time reinforcement learning framework that performs alignment solely through inference-time operations: iterative reward-weighted sampling, lightweight value function estimation, and search-based decoding—without updating any model parameters. Contribution/Results: IRO is the first method enabling pure test-time RL alignment for frozen LLMs. It significantly reduces reliance on accurate first-order reward functions, improves output stability and inference efficiency, and supports personalized alignment on private user data. Experiments demonstrate that IRO enhances preference consistency of generated outputs without requiring model weights, additional online computational overhead, or parameter updates.

Technology Category

Application Category

📝 Abstract

Aligning large language models (LLMs) with human preferences usually requires fine-tuning methods such as RLHF and DPO. These methods directly optimize the model parameters, so they cannot be used in test-time to improve model performance, nor are they applicable when the model weights are not accessible. In contrast, test-time methods sidestep weight updates by leveraging reward functions to guide and improve output quality. However, they incur high inference costs, and their one-shot guidance is often based on imperfect reward or value functions, leading to suboptimal outputs. In this work, we present a method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning (RL) framework that performs RL-style alignment of the (frozen) base model without touching its parameters. During training, each iteration (i) samples candidates from the base model, (ii) resamples using current value functions, and (iii) trains a new lightweight value function that guides the next decoding pass. At test time, the value functions are used to guide the base model generation via a search-based optimization process. Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT), but without requiring access to the model weights.

Problem

Research questions and friction points this paper is trying to address.

Align frozen LLMs without weight updates

Reduce inference costs in test-time methods

Improve output quality with imperfect reward functions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Reweight-then-Optimize RL framework

Aligns frozen LLMs without weight updates

Lightweight value functions guide generation

🔎 Similar Papers

No similar papers found.