S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language models in multi-image reasoning, particularly their lack of global visual search and autonomous cross-image comparison capabilities, as well as overreliance on predefined image indices. To overcome these issues, the authors propose a Simple-to-Hard (S2H) learning framework that constructs, for the first time, a cross-model generalizable multi-image preference dataset based on prompt complexity, spanning three levels: single-image local reasoning, multi-image local comparison, and global visual search. The framework employs a hardness-aware Direct Preference Optimization (DPO) approach, integrating hierarchical prompt design with multi-granularity vision-language alignment to enable progressive preference learning from simple to complex tasks. Experiments on LLaVA and Qwen-VL demonstrate that S2H significantly enhances multi-image reasoning performance while preserving strong single-image understanding, consistently outperforming current baselines.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

multi-image reasoning

global visual search

cross-image comparison

preference optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hardness-Aware Preference Optimization

Multi-image Reasoning

Simple-to-Hard Learning