🤖 AI Summary
Existing scientific chart captioning methods rely on document-extracted chart-caption pairs for training, resulting in significant misalignment with reader preferences regarding helpfulness, interpretability, and visual descriptiveness. This paper introduces the first RLHF-driven chart caption generation framework specifically designed for scientific charts, leveraging domain-expert feedback to optimize caption quality. Our key contributions are: (1) establishing a scientific-chart-specific RLHF paradigm; (2) designing an automated chart-caption quality evaluator; and (3) releasing SciCap-HF—the first large-scale scientific chart–caption benchmark annotated with human feedback. Fine-tuning BLIP via RLHF yields substantial improvements: ROUGE, BLEU, and METEOR scores increase by 35.7%, 16.9%, and 9.0%, respectively—demonstrating marked gains in caption utility and interpretability. Both code and dataset are publicly released.
📝 Abstract
Captions are crucial for understanding scientific visualizations and documents. Existing captioning methods for scientific figures rely on figure-caption pairs extracted from documents for training, many of which fall short with respect to metrics like helpfulness, explainability, and visual-descriptiveness [15] leading to generated captions being misaligned with reader preferences. To enable the generation of high-quality figure captions, we introduce FigCaps-HF a new framework for figure-caption generation that can incorporate domain expert feedback in generating captions optimized for reader preferences. Our framework comprises of 1) an automatic method for evaluating quality of figure-caption pairs, 2) a novel reinforcement learning with human feedback (RLHF) method to optimize a generative figure-to-caption model for reader preferences. We demonstrate the effectiveness of our simple learning framework by improving performance over standard fine-tuning across different types of models. In particular, when using BLIP as the base model, our RLHF framework achieves a mean gain of 35.7%, 16.9%, and 9% in ROUGE, BLEU, and Meteor, respectively. Finally, we release a large-scale benchmark dataset with human feedback on figure-caption pairs to enable further evaluation and development of RLHF techniques for this problem.