Token-Level Inference-Time Alignment for Vision-Language Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Visual language models (VLMs) suffer from hallucination stemming from text–image misalignment; existing alignment methods rely on costly preference annotations for fine-tuning or coarse-grained, delayed feedback. This paper proposes TITA—a test-time alignment framework that freezes the base VLM and trains only a lightweight reward model. For the first time, it introduces the principle of direct preference optimization (DPO) into inference, extracting implicit preference signals from token-level log-probability ratios to enable fine-grained, real-time, and tuning-free dynamic calibration. TITA is compatible with mainstream VLMs including LLaVA, Qwen2.5-VL, and DeepSeek-VL2. It achieves average improvements of 8.6% on MMVet and 6.7% on POPE, with negligible inference overhead. The method effectively suppresses hallucinations while demonstrating strong generalization across diverse VLM architectures and benchmarks.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs. Existing alignment approaches often rely on expensive fine-tuning with annotated preference data or sequence-level inference strategies that provide only coarse, delayed feedback. To overcome these limitations, we present TITA (Token-level Inference-Time Alignment), a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution. During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback. This formulation can be viewed as an inference-time variant of Direct Preference Optimization (DPO), providing token-level corrective signals without retraining the backbone. Extensive evaluations on LLaVA-1.5-7B and 13B show consistent gains across 12 benchmarks, with improvements of 8.6% on MMVet and 6.7% on POPE, indicating stronger general understanding and reduced hallucinations. Additional experiments on Qwen2.5-VL-7B and DeepSeek-VL2-27.5B show comparable gains, especially in hallucination reduction and VQA accuracy, while incurring negligible inference overhead.

Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in vision-language model outputs

Provides token-level corrective signals without retraining

Offers lightweight alignment with minimal inference overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level inference-time alignment for VLMs

Freezes base model, trains lightweight reward model

Provides dense autoregressive feedback without retraining

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models

2024-03-04Computer Vision and Pattern RecognitionCitations: 3

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

2024-08-21arXiv.orgCitations: 9

Law of Vision Representation in MLLMs

2024-08-29arXiv.orgCitations: 7

Authors to Follow