Aligning LLMs on a Budget: Inference-Time Alignment with Heuristic Reward Models

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the challenge of aligning large language model (LLM) outputs with user preferences under strict inference-time budget constraints, this paper proposes HIA—a fine-tuning-free, low-overhead black-box alignment method. HIA integrates a lightweight prompt optimizer with a heuristic reward model and employs a two-stage response filtering mechanism to achieve multi-objective personalized alignment with only 1–2 API queries. Its core innovation lies in the first integration of learnable prompt optimization with gradient-free heuristic evaluation, jointly optimizing alignment quality and computational efficiency while maintaining full compatibility with black-box LLM APIs. On the HelpSteer and ComPRed benchmarks, HIA substantially outperforms baselines—including best-of-N sampling and beam search—achieving superior multi-dimensional alignment performance under extremely limited inference budgets.

Technology Category

Application Category

📝 Abstract

Aligning LLMs with user preferences is crucial for real-world use but often requires costly fine-tuning or expensive inference, forcing trade-offs between alignment quality and computational cost. Existing inference-time methods typically ignore this balance, focusing solely on the optimized policy's performance. We propose HIA (Heuristic-Guided Inference-time Alignment), a tuning-free, black-box-compatible approach that uses a lightweight prompt optimizer, heuristic reward models, and two-stage filtering to reduce inference calls while preserving alignment quality. On real-world prompt datasets, HelpSteer and ComPRed, HIA outperforms best-of-N sampling, beam search, and greedy search baselines in multi-objective, goal-conditioned tasks under the same inference budget. We also find that HIA is effective under low-inference budgets with as little as one or two response queries, offering a practical solution for scalable, personalized LLM deployment.

Problem

Research questions and friction points this paper is trying to address.

Balancing alignment quality and computational cost in LLMs

Reducing inference calls while preserving alignment quality

Providing scalable, personalized LLM deployment under low budgets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight prompt optimizer for alignment

Heuristic reward models for cost efficiency

Two-stage filtering reduces inference calls

🔎 Similar Papers

Is Free Self-Alignment Possible?