Multi-Objective Task-Aware Predictor for Image-Text Alignment

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key bottlenecks in image-text alignment evaluation—namely, difficulty modeling multi-dimensional human preferences, weak long-sequence handling, low inference efficiency, and lack of unified multi-objective scoring—this paper proposes MULTI-TAP: a lightweight, plug-and-play multi-task-aware predictor. Built upon frozen LVLM hidden states, it introduces a ridge regression layer coupled with a reward head to jointly support fine-grained and holistic alignment scoring while accommodating diverse contextual descriptions. MULTI-TAP is the first method to unify human judgment consistency, long-sequence robustness, high inference efficiency, and decoupled multi-objective scoring—and enables cross-LVLM transfer. We also release EYE4ALL, the first alignment preference dataset tailored for visually impaired users. Experiments demonstrate that MULTI-TAP significantly outperforms VisionREWARD on multi-objective benchmarks and EYE4ALL, matching the performance of GPT-4o–driven G-VEval while using only 7–8B parameters.

Technology Category

Application Category

📝 Abstract
Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.
Problem

Research questions and friction points this paper is trying to address.

Evaluating image-text alignment reflecting human preferences across multiple aspects
Addressing lack of comprehensive benchmarks for vision-language evaluation predictors
Developing efficient multi-objective scoring for reliable vision-language applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play architecture for multi-objective scoring
Lightweight ridge regression layer on frozen LVLM
Produces fine-grained scores for human-interpretable objectives
🔎 Similar Papers
No similar papers found.