Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) suffer from hallucinations due to modality-dependent biases or overgeneralized memorization, compromising factual consistency. To address this, we propose a training-free, three-layer contrastive decoding framework. First, a watermark-based probing question automatically evaluates the visual grounding capability of each layer to identify the optimal *pivot layer*—the one with strongest vision-language alignment. Then, outputs from the mature layer, novice layer, and pivot layer are jointly leveraged in a discrepancy-driven contrastive decoding process to suppress hallucinations and enhance visual grounding. Our key innovations include: (i) the first use of watermarking for layer selection, and (ii) a lightweight, plug-and-play three-layer contrastive inference architecture. Extensive evaluation on POPE, MME, and AMBER benchmarks demonstrates significant hallucination reduction, improved factual accuracy, and enhanced multimodal consistency—achieved without fine-tuning or additional parameters.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.
Problem

Research questions and friction points this paper is trying to address.

Reducing hallucinations in large vision-language models
Improving visual grounding of multimodal model outputs
Enhancing factuality through contrastive decoding techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free tri-layer contrastive decoding method
Watermark-guided pivot layer identification
Contrastive decoding reduces hallucinations in models
🔎 Similar Papers
No similar papers found.
K
Kyungryul Back
CSE, Korea University
Seongbeom Park
Seongbeom Park
CSE, Korea University
M
Milim Kim
CSE, Korea University
M
Mincheol Kwon
CSE, Korea University
S
SangHyeok Lee
CSE, Korea University
H
Hyunyoung Lee
KT Corporation
J
Junhee Cho
KT Corporation
Seunghyun Park
Seunghyun Park
Soongsil University
Vision-Language Model
J
Jinkyu Kim
CSE, Korea University