🤖 AI Summary
Current Video-LLMs suffer from a pervasive “likelihood shift” problem in DPO-based preference alignment: the log-likelihoods of both winning and losing responses often decrease synchronously, inadvertently increasing the probability of non-target responses—exacerbated by video redundancy. To address this, we propose LeanPO, a reference-free preference optimization framework grounded in likelihood alignment. First, we redefine the implicit reward as the policy model’s average token-level log-likelihood over the response. Second, we introduce a reward-confidence-coupled self-reflection mechanism for generating high-quality preference data. Third, we design a dynamic label-smoothing strategy tailored to video-specific noise. Evaluated across multiple benchmarks, LeanPO significantly improves Video-LLMs’ response faithfulness, human preference alignment, and overall performance—while incurring minimal training overhead and demonstrating strong generalization.
📝 Abstract
Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $log pi_ heta (y_wmid x)$ and $log pi_ heta (y_lmid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.