A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing watermarking methods for vision-language models often compromise visual-semantic consistency or incur high inference latency due to their sampling mechanisms, making it challenging to simultaneously achieve high fidelity, relevance, and detection efficiency. This work proposes VISA-Mark, a novel framework that integrates, for the first time, a visual-evidence-aware adaptive watermarking mechanism with lightweight prefix tuning. Specifically, a prefix tuner extracts visual evidence weights to dynamically partition the vocabulary and perturb logits, thereby concentrating watermark strength on tokens supported by visual content. This approach maintains efficient inference while significantly improving visual consistency (7.8% gain in Chair-I), watermark detectability (AUC of 96.88%), and robustness against attacks (99.3% success rate under adversarial conditions).

Technology Category

Application Category

📝 Abstract

Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases, while some semantic-aware methods incur prohibitive inference latency due to rejection sampling. In this paper, we propose the VIsual Semantic Adaptive Watermark (VISA-Mark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights, which quantify the evidentiary support for candidate tokens based on the visual input. These weights guide an adaptive vocabulary partitioning and logits perturbation mechanism, concentrating watermark strength specifically on visually-supported tokens. By actively aligning the watermark with visual evidence, VISA-Mark effectively maintains visual fidelity. Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency (Chair-I) and superior semantic fidelity. The framework maintains highly competitive detection accuracy (96.88% AUC) and robust attack resilience (99.3%) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multimodal watermarking.

Problem

Research questions and friction points this paper is trying to address.

watermarking

Large Vision-Language Models

visual fidelity

semantic consistency

inference latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Semantic Watermarking

Prefix-Tuning

Visual-Evidence Weight