ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the visual anchor collapse in multimodal alignment caused by likelihood shift in Direct Preference Optimization (DPO) and proposes Asymmetrically Constrained Preference Optimization (ACPO). ACPO introduces, for the first time, an asymmetric gradient constraint mechanism that applies a complexity-aware dynamic scaling factor exclusively to rejected responses within the DPO framework, thereby suppressing their gradient flow while preserving the likelihood stability of chosen responses. This effectively mitigates the suppression of visual information by language priors. Experiments on InternVL demonstrate that ACPO substantially reverses the reward degradation observed with standard DPO and consistently outperforms baselines across hallucination benchmarks such as HallusionBench and MM-IFEval, as well as comprehensive multimodal evaluation suites including MMBench, MMStar, and OCRBenchV2, achieving both reliable multimodal alignment and enhanced general capabilities.

Technology Category

Application Category

📝 Abstract

While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.

Problem

Research questions and friction points this paper is trying to address.

Likelihood Displacement

Visual Anchor Collapse

Vision-Language Alignment

Multimodal Hallucination

Direct Preference Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Constrained Preference Optimization

Likelihood Displacement

Visual Anchor Collapse