POPEN: Preference-Based Optimization and Ensemble for LVLM-Based Reasoning Segmentation

📅 2025-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the co-occurring issues of imprecise segmentation and text hallucination in vision-language models (LVLMs) for referring segmentation, this paper proposes a preference-driven optimization and ensemble framework. Methodologically: (1) it introduces a curriculum-learning-based strategy for constructing segmentation preference data; (2) it designs a preference-aware loss function explicitly targeting segmentation quality; and (3) it incorporates an attention-weighted multi-path output ensemble mechanism guided by preference scores. The framework unifies preference learning, curriculum learning, multimodal alignment optimization, and LVLM fine-tuning. Evaluated on referring segmentation benchmarks, it achieves state-of-the-art performance: segmentation accuracy is significantly improved, and text hallucination rates are reduced by over 35% compared to baselines such as LISA and PixelLM, while maintaining both high localization precision and linguistic reliability.

Technology Category

Application Category

📝 Abstract
Existing LVLM-based reasoning segmentation methods often suffer from imprecise segmentation results and hallucinations in their text responses. This paper introduces POPEN, a novel framework designed to address these issues and achieve improved results. POPEN includes a preference-based optimization method to finetune the LVLM, aligning it more closely with human preferences and thereby generating better text responses and segmentation results. Additionally, POPEN introduces a preference-based ensemble method for inference, which integrates multiple outputs from the LVLM using a preference-score-based attention mechanism for refinement. To better adapt to the segmentation task, we incorporate several task-specific designs in our POPEN framework, including a new approach for collecting segmentation preference data with a curriculum learning mechanism, and a novel preference optimization loss to refine the segmentation capability of the LVLM. Experiments demonstrate that our method achieves state-of-the-art performance in reasoning segmentation, exhibiting minimal hallucination in text responses and the highest segmentation accuracy compared to previous advanced methods like LISA and PixelLM. Project page is https://lanyunzhu.site/POPEN/
Problem

Research questions and friction points this paper is trying to address.

Improves imprecise segmentation in LVLM-based methods
Reduces hallucinations in LVLM text responses
Enhances segmentation accuracy via preference-based optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference-based optimization for LVLM fine-tuning
Preference-score-based ensemble for output refinement
Task-specific designs for improved segmentation accuracy