MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key bottlenecks in alt-text generation for visually impaired users—including high noise in user-provided annotations, weak contextual modeling, and heavy reliance of supervised fine-tuning (SFT) on high-quality labels—this paper proposes MCM-DPO, a Multi-dimensional Cross-modal Direct Preference Optimization framework. MCM-DPO is the first to jointly model textual, visual, and cross-modal consistency across single-point, pairwise, and multi-preference levels, enabling preference-based learning without requiring precise ground-truth annotations. Built upon large vision-language models, it is trained and validated on two newly constructed high-quality datasets, TAlt and PAlt. Extensive experiments demonstrate that MCM-DPO significantly outperforms both SFT and standard DPO across multiple automatic and human evaluation metrics, achieving state-of-the-art performance in alt-text generation. The code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: https://github.com/LVUGAI/MCM-DPO
Problem

Research questions and friction points this paper is trying to address.

Improving alt-text generation for blind users via preference optimization
Addressing noisy annotations and contextual insensitivity in vision-language models
Creating datasets and methods for multifaceted cross-modal preference learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multifaceted Cross-Modal Direct Preference Optimization for alt-text
Optimizes preferences across textual, visual, and cross-modal dimensions
Constructs large-scale datasets TAlt and PAlt for training
🔎 Similar Papers
No similar papers found.