Multi-LLM Collaborative Caption Generation in Scientific Documents

📅 2025-01-05

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Scientific figure caption generation faces two key challenges: (1) prevailing methods rely on unimodal modeling—either vision-only or text-only—failing to capture fine-grained cross-modal semantic alignment; and (2) noisy, inconsistent caption annotations from sources like arXiv hinder large language model (LLM) training. This paper introduces the first three-stage, multi-LLM collaborative framework: (1) a multimodal LLM-based (e.g., LLaVA) quality assessment module filters low-fidelity data; (2) multiple expert LLMs generate diverse candidate captions in parallel; and (3) a high-capability LLM (GPT-4–level) performs judgment-based selection and post-editing refinement. Our framework innovatively decouples data cleansing, diverse generation, and intelligent adjudication—establishing a new task paradigm. Evaluated on an arXiv scientific document benchmark, it achieves state-of-the-art performance. Human evaluation confirms that its generated captions significantly surpass human-written ones in both informativeness and factual accuracy.

Technology Category

Application Category

📝 Abstract

Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality captions for scientific figures

Addressing low-quality training data from arXiv papers

Overcoming incomplete information in existing captioning methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs assess training data quality

Multiple fine-tuned LLMs generate diverse candidate captions

Prominent LLM selects and refines highest quality caption

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs