Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning

πŸ“… 2026-05-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

188K/year
πŸ€– AI Summary
This work addresses the Jensen’s inequality bias and training instability in Answer-Level Fine-Tuning caused by mini-batch estimation of log rewards. The authors generalize distribution alignment games to arbitrary Bregman divergences and construct an unbiased estimator based on U-statistics. Specifically, for the KL divergence case, they design a globally robust minimax polynomial estimator that achieves the statistical error lower bound of Θ(1/KΒ²). By integrating these two components, they propose the variance-optimal Augmented Polynomial Optimization (AQP) framework, which enables unbiased estimation and accelerated convergence without incurring any additional online computational overhead, thereby significantly enhancing both training stability and efficiency.
πŸ“ Abstract
The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of $Θ(1/K^2)$, which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Answer-Level Fine-Tuning
Distributional Alignment
Estimation Bias
Jensen's Inequality
Unbiased Estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributional Alignment Games
Bregman Divergence
U-statistics
Minimax Polynomial Estimator
Variance-Optimal Estimation
πŸ”Ž Similar Papers
No similar papers found.