ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the limitations of coarse-grained sequence-level rewards in reinforcement learning for long-text image captioning, which fail to distinguish factual errors from missing information and obscure the trade-off between factuality and coverage. The authors propose ClaimDiff-RL, a novel framework that introduces verifiable, type-annotated atomic visual claims as fine-grained reward units. By leveraging a multimodal judge to compare claims against references, verify them against images, classify error types, and assess severity, the method explicitly decouples hallucination from information omission, enabling independent control over both aspects. Experiments demonstrate that ClaimDiff-RL significantly improves the balance between hallucination and missed details on human-diagnosed datasets, public captioning benchmarks, and VQA tasks, outperforming Gemini-3-Pro-Preview in fine-grained capabilities such as object counting, spatial reasoning, and scene recognition while maintaining strong general performance.

📝 Abstract

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

Problem

Research questions and friction points this paper is trying to address.

reward granularity

visual claims

hallucination

coverage

long-form image captioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

ClaimDiff-RL

fine-grained reinforcement learning

visual claim comparison