LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Current Video-LLMs suffer from a pervasive “likelihood shift” problem in DPO-based preference alignment: the log-likelihoods of both winning and losing responses often decrease synchronously, inadvertently increasing the probability of non-target responses—exacerbated by video redundancy. To address this, we propose LeanPO, a reference-free preference optimization framework grounded in likelihood alignment. First, we redefine the implicit reward as the policy model’s average token-level log-likelihood over the response. Second, we introduce a reward-confidence-coupled self-reflection mechanism for generating high-quality preference data. Third, we design a dynamic label-smoothing strategy tailored to video-specific noise. Evaluated across multiple benchmarks, LeanPO significantly improves Video-LLMs’ response faithfulness, human preference alignment, and overall performance—while incurring minimal training overhead and demonstrating strong generalization.

Technology Category

Application Category

📝 Abstract

Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $log pi_ heta (y_wmid x)$ and $log pi_ heta (y_lmid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Mitigates likelihood displacement in Video-LLM preference alignment

Reduces unintended probability boost for non-target responses

Addresses noise impact from diverse video content

Innovation

Methods, ideas, or system contributions that make the work stand out.

LeanPO optimizes reward margin via average likelihood

Self-generated preference data pipeline enhances trustworthiness

Dynamic label smoothing reduces noise impact

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs