🤖 AI Summary
This work addresses the challenge of simultaneously preserving author-specific writing style and ensuring high-quality caption generation for scientific figures. We propose a multimodal large language model–based style transfer method that integrates authors’ historical texts, fine-grained stylistic features (e.g., terminology preferences and syntactic structures), and paper metadata (domain, task, figure type) into an end-to-end personalized captioning framework. To our knowledge, this is the first study to identify and formalize the intrinsic trade-off between stylistic fidelity and caption informativeness/accuracy. We mitigate this tension via an author profile enhancement mechanism. Evaluated on the Third SciCap Challenge, our approach achieves a +23.6% improvement in style similarity without compromising caption quality—demonstrating strong practical utility for automated scientific image understanding and AI-assisted scholarly writing systems.
📝 Abstract
We study personalized figure caption generation using author profile data from scientific papers. Our experiments demonstrate that rich author profile data, combined with relevant metadata, can significantly improve the personalization performance of multimodal large language models. However, we also reveal a fundamental trade-off between matching author style and maintaining caption quality. Our findings offer valuable insights and future directions for developing practical caption automation systems that balance both objectives. This work was conducted as part of the 3rd SciCap challenge.