VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models often struggle to maintain factual consistency due to coarse reward signals, leading to omissions or hallucinations. This work proposes VCap, introducing the first Witness-Adjudicator reward framework: it treats reference captions as “witnesses” and visual features as “adjudicators,” leveraging the hypergeometric distribution to perform fine-grained factual verification of generated content and deliver high-precision reinforcement learning rewards. This mechanism transcends the limitations of conventional RLVR paradigms by achieving strong generalization under only weak supervision. Experimental results demonstrate that an 8B-parameter multimodal language model equipped with VCap outperforms both open- and closed-source state-of-the-art methods across multiple image and video captioning benchmarks, with human evaluations confirming its superior factual accuracy, perceptual grounding, and cross-task generalization.
📝 Abstract
Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.
Problem

Research questions and friction points this paper is trying to address.

visual captioning
reward design
factual consistency
reinforcement learning
hallucination
Innovation

Methods, ideas, or system contributions that make the work stand out.

VCap
Witness-Adjudicator reward
hypergeometric reward
visual captioning
reinforcement learning
Xingyu Lu
Xingyu Lu
Tsinghua University
Large Language ModelMultimodal Language ModelReinforcement LearningRecommendation System
J
Jinpeng Wang
Harbin Institute of Technology, Shenzhen
Yi-Fan Zhang
Yi-Fan Zhang
Institute of Automation, Chinese Academy of Sciences
Computer VisionMultimodalityAlignmentMachine Learning
Y
Yankai Yang
Kuaishou Technology
Y
Yancheng Long
Kuaishou Technology
Y
Yiyang Fan
Kuaishou Technology
X
Xuanyu Zheng
Kuaishou Technology
H
Haonan Fan
Kuaishou Technology
Kaiyu Jiang
Kaiyu Jiang
Kuaishou
MLLM
Tianke Zhang
Tianke Zhang
Tsinghua University; Kuaishou Technology
Computer VisionNeuro-Linguistic Programming
C
Changyi Liu
Kuaishou Technology
Bin Wen
Bin Wen
快手
MLLM
F
Fan Yang
Kuaishou Technology
T
Tingting Gao
Kuaishou Technology
H
Han Li
Kuaishou Technology
Chun Yuan
Chun Yuan
Graduate School at Shenzhen, Tsinghua University
Computer visionmultimedia access control