PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

📅 2025-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of domain-specific models and datasets for automatic patent diagram captioning, this paper introduces PatentDesc-355K—the first large-scale, publicly available multimodal dataset of patent figures and textual descriptions—and proposes PatentLMM, a dedicated multimodal large language model. Methodologically, PatentLMM comprises two key components: (1) PatentMME, a domain-adapted vision encoder designed to capture structured technical features in patent drawings; and (2) PatentLLaMA, a LLaMA-based language model fine-tuned via image-text alignment pretraining and multi-stage instruction tuning to ensure precise, domain-aware caption generation. Experimental results demonstrate that PatentLMM significantly outperforms general-purpose multimodal models in caption accuracy, technical consistency, and readability. The dataset, model checkpoints, and code are fully open-sourced, establishing foundational resources for intelligent patent processing.

Technology Category

Application Category

📝 Abstract
Writing comprehensive and accurate descriptions of technical drawings in patent documents is crucial to effective knowledge sharing and enabling the replication and protection of intellectual property. However, automation of this task has been largely overlooked by the research community. To this end, we introduce PatentDesc-355K, a novel large-scale dataset containing ~355K patent figures along with their brief and detailed textual descriptions extracted from more than 60K US patent documents. In addition, we propose PatentLMM - a novel multimodal large language model specifically tailored to generate high-quality descriptions of patent figures. Our proposed PatentLMM comprises two key components: (i) PatentMME, a specialized multimodal vision encoder that captures the unique structural elements of patent figures, and (ii) PatentLLaMA, a domain-adapted version of LLaMA fine-tuned on a large collection of patents. Extensive experiments demonstrate that training a vision encoder specifically designed for patent figures significantly boosts the performance, generating coherent descriptions compared to fine-tuning similar-sized off-the-shelf multimodal models. PatentDesc-355K and PatentLMM pave the way for automating the understanding of patent figures, enabling efficient knowledge sharing and faster drafting of patent documents. We make the code and data publicly available.
Problem

Research questions and friction points this paper is trying to address.

patent diagrams
automatic description
machine learning models
Innovation

Methods, ideas, or system contributions that make the work stand out.

PatentLMM
PatentMME
PatentLLaMA
🔎 Similar Papers
No similar papers found.