TLDR: Token-Level Detective Reward Model for Large Vision Language Models

📅 2024-10-07

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing reward models for multimodal large language models (MLLMs) provide only coarse-grained binary feedback, limiting fine-grained evaluation of long-form text and exacerbating image-text misalignment due to text dominance. To address this, we propose TLDR—a token-level, fine-grained reward model—introducing the first “detective-style” token-wise discrimination framework. TLDR generates hard negative samples via targeted perturbations and employs an automated annotation mechanism to deliver interpretable, token-level feedback. Our method integrates perturbation-augmented learning, token-level supervised training, multimodal alignment modeling, and likelihood-based optimization guidance. Experiments demonstrate that TLDR significantly improves base model performance, enables self-correcting generation and hallucination detection, achieves high accuracy as an evaluation metric, and triples human annotation efficiency—substantially accelerating high-quality VLM data curation.

Technology Category

Application Category

📝 Abstract

Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a $ extbf{T}$oken-$ extbf{L}$evel $ extbf{D}$etective $ extbf{R}$eward Model ($ extbf{TLDR}$) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. We show that TLDR automatically trains a token-level likelihood optimization, and can improve the base model's performance significantly. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

Problem

Research questions and friction points this paper is trying to address.

Improves token-level feedback in vision language models

Reduces biases in multimodal reward models

Accelerates human annotation for vision language data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-Level Detective Reward Model

Perturbation-based synthetic negatives

Hallucination evaluation tool enhancement

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling