Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling

📅 2026-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of aligning large language models with heterogeneous user preferences at test time by proposing a general alignment framework based on test-time scaling. The approach generates multiple candidate responses and leverages user feedback to select preferred outputs, thereby achieving consistency with diverse user intents. The paper formally introduces the notion of “asymptotic universal alignment” and establishes its optimal convergence rate as \(k/(k+1)\). Building upon this, the authors formulate a symmetric multi-agent alignment game and integrate Nash equilibrium-based strategy learning with self-play dynamics to preserve response diversity and circumvent the scaling failure inherent in conventional single-output methods. Theoretically, the proposed method achieves \((k, k/(k+1))\)-robust alignment, and empirical results demonstrate its significant superiority over post-training approaches such as NLHF in multi-response settings.

Technology Category

Application Category

📝 Abstract
Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces $k\ge 1$ candidate responses and a user selects their preferred one. We introduce $(k,f(k))$-robust alignment, which requires the $k$-output model to have win rate $f(k)$ against any other single-output model, and asymptotic universal alignment (U-alignment), which requires $f(k)\to 1$ as $k\to\infty$. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose $k$-sample product policies achieve U-alignment at rate $f(k)=\frac{k}{k+1}$, and no method can achieve a faster rate in general. We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for $k=1$, sampling from the resulting (often deterministic) policy cannot guarantee win rates above $\tfrac{1}{2}$ except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the $(k+1)$-player alignment game achieves the optimal $(k,\frac{k}{k+1})$-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.
Problem

Research questions and friction points this paper is trying to address.

universal alignment
test-time scaling
large language models
preference heterogeneity
asymptotic alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

asymptotic universal alignment
test-time scaling
robust alignment
multi-player alignment games
Nash equilibrium
🔎 Similar Papers
No similar papers found.