Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

📅 2026-01-10

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work proposes a novel approach to enhance image captioning in vision-language models (VLMs) without directly endowing them with multimodal capabilities. Instead, a unimodal large language model (LLM) operating solely on text provides preference feedback to guide the fine-tuning of the VLM. This study demonstrates, for the first time, that a purely text-based LLM can effectively facilitate cross-modal preference alignment. Experimental results show that the proposed method significantly improves caption quality, achieving up to a 13% increase in accuracy, while the alignment between AI-generated feedback and human preferences reaches 64.6%. These findings establish a new paradigm for leveraging language models to drive multimodal alignment, highlighting the potential of text-only LLMs as scalable and effective supervisors in multimodal learning scenarios.

Technology Category

Application Category

📝 Abstract

To explore a more scalable path for adding multimodal capabilities to existing LLMs, this paper addresses a fundamental question: Can a unimodal LLM, relying solely on text, reason about its own informational needs and provide effective feedback to optimize a multimodal model? To answer this, we propose a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent's preferences. Our results from different experiments affirm this hypothesis, showing that LLM preference feedback significantly enhances VLM descriptions. Using our proposed method, we find that the VLM can generate multimodal scene descriptions to help the LLM better understand multimodal context, leading to improvements of maximum 13% in absolute accuracy compared to the baseline multimodal approach. Furthermore, a human study validated our AI-driven feedback, showing a 64.6% preference alignment rate between the LLM's choices and human judgments. Extensive experiments provide insights on how and why the method works and its limitations.

Problem

Research questions and friction points this paper is trying to address.

unimodal language agent

multimodal vision-language model

preference feedback

LLM

VLM

Innovation

Methods, ideas, or system contributions that make the work stand out.

language agent

preference feedback

vision-language model

multimodal adaptation

LLM reasoning

🔎 Similar Papers

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

2024-05-24arXiv.orgCitations: 20

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

2024-04-19arXiv.orgCitations: 4

Can We Talk Models Into Seeing the World Differently?

2024-03-14Citations: 10