Token-Level Inference-Time Alignment for Vision-Language Models

๐Ÿ“… 2025-10-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Visual language models (VLMs) suffer from hallucination stemming from textโ€“image misalignment; existing alignment methods rely on costly preference annotations for fine-tuning or coarse-grained, delayed feedback. This paper proposes TITAโ€”a test-time alignment framework that freezes the base VLM and trains only a lightweight reward model. For the first time, it introduces the principle of direct preference optimization (DPO) into inference, extracting implicit preference signals from token-level log-probability ratios to enable fine-grained, real-time, and tuning-free dynamic calibration. TITA is compatible with mainstream VLMs including LLaVA, Qwen2.5-VL, and DeepSeek-VL2. It achieves average improvements of 8.6% on MMVet and 6.7% on POPE, with negligible inference overhead. The method effectively suppresses hallucinations while demonstrating strong generalization across diverse VLM architectures and benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs. Existing alignment approaches often rely on expensive fine-tuning with annotated preference data or sequence-level inference strategies that provide only coarse, delayed feedback. To overcome these limitations, we present TITA (Token-level Inference-Time Alignment), a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution. During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback. This formulation can be viewed as an inference-time variant of Direct Preference Optimization (DPO), providing token-level corrective signals without retraining the backbone. Extensive evaluations on LLaVA-1.5-7B and 13B show consistent gains across 12 benchmarks, with improvements of 8.6% on MMVet and 6.7% on POPE, indicating stronger general understanding and reduced hallucinations. Additional experiments on Qwen2.5-VL-7B and DeepSeek-VL2-27.5B show comparable gains, especially in hallucination reduction and VQA accuracy, while incurring negligible inference overhead.
Problem

Research questions and friction points this paper is trying to address.

Reduces hallucinations in vision-language model outputs
Provides token-level corrective signals without retraining
Offers lightweight alignment with minimal inference overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level inference-time alignment for VLMs
Freezes base model, trains lightweight reward model
Provides dense autoregressive feedback without retraining
๐Ÿ”Ž Similar Papers
Kejia Chen
Kejia Chen
Technical University of Munich
Manipulation of Deformable ObjectsMulti-robot CollaborationLLM-based Planning
Jiawen Zhang
Jiawen Zhang
The Hong Kong University of Science and Technology
Time SeriesKnowledge GraphAIHCI
J
Jiacong Hu
Zhejiang University
K
Kewei Gao
Zhejiang University
J
Jian Lou
Sun Yat-sen University
Z
Zunlei Feng
Zhejiang University
M
Mingli Song
Zhejiang University