๐ค AI Summary
Visual language models (VLMs) suffer from hallucination stemming from textโimage misalignment; existing alignment methods rely on costly preference annotations for fine-tuning or coarse-grained, delayed feedback. This paper proposes TITAโa test-time alignment framework that freezes the base VLM and trains only a lightweight reward model. For the first time, it introduces the principle of direct preference optimization (DPO) into inference, extracting implicit preference signals from token-level log-probability ratios to enable fine-grained, real-time, and tuning-free dynamic calibration. TITA is compatible with mainstream VLMs including LLaVA, Qwen2.5-VL, and DeepSeek-VL2. It achieves average improvements of 8.6% on MMVet and 6.7% on POPE, with negligible inference overhead. The method effectively suppresses hallucinations while demonstrating strong generalization across diverse VLM architectures and benchmarks.
๐ Abstract
Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence, yet their outputs remain prone to hallucination-plausible text misaligned with visual inputs. Existing alignment approaches often rely on expensive fine-tuning with annotated preference data or sequence-level inference strategies that provide only coarse, delayed feedback. To overcome these limitations, we present TITA (Token-level Inference-Time Alignment), a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution. During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback. This formulation can be viewed as an inference-time variant of Direct Preference Optimization (DPO), providing token-level corrective signals without retraining the backbone. Extensive evaluations on LLaVA-1.5-7B and 13B show consistent gains across 12 benchmarks, with improvements of 8.6% on MMVet and 6.7% on POPE, indicating stronger general understanding and reduced hallucinations. Additional experiments on Qwen2.5-VL-7B and DeepSeek-VL2-27.5B show comparable gains, especially in hallucination reduction and VQA accuracy, while incurring negligible inference overhead.