Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

📅 2026-01-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a novel approach to enhance image captioning in vision-language models (VLMs) without directly endowing them with multimodal capabilities. Instead, a unimodal large language model (LLM) operating solely on text provides preference feedback to guide the fine-tuning of the VLM. This study demonstrates, for the first time, that a purely text-based LLM can effectively facilitate cross-modal preference alignment. Experimental results show that the proposed method significantly improves caption quality, achieving up to a 13% increase in accuracy, while the alignment between AI-generated feedback and human preferences reaches 64.6%. These findings establish a new paradigm for leveraging language models to drive multimodal alignment, highlighting the potential of text-only LLMs as scalable and effective supervisors in multimodal learning scenarios.

Technology Category

Application Category

📝 Abstract
To explore a more scalable path for adding multimodal capabilities to existing LLMs, this paper addresses a fundamental question: Can a unimodal LLM, relying solely on text, reason about its own informational needs and provide effective feedback to optimize a multimodal model? To answer this, we propose a method that enables a language agent to give feedback to a vision-language model (VLM) to adapt text generation to the agent's preferences. Our results from different experiments affirm this hypothesis, showing that LLM preference feedback significantly enhances VLM descriptions. Using our proposed method, we find that the VLM can generate multimodal scene descriptions to help the LLM better understand multimodal context, leading to improvements of maximum 13% in absolute accuracy compared to the baseline multimodal approach. Furthermore, a human study validated our AI-driven feedback, showing a 64.6% preference alignment rate between the LLM's choices and human judgments. Extensive experiments provide insights on how and why the method works and its limitations.
Problem

Research questions and friction points this paper is trying to address.

unimodal language agent
multimodal vision-language model
preference feedback
LLM
VLM
Innovation

Methods, ideas, or system contributions that make the work stand out.

language agent
preference feedback
vision-language model
multimodal adaptation
LLM reasoning
🔎 Similar Papers
No similar papers found.