PatentVision: A multimodal method for drafting patent applications

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Patent drafting faces three major challenges: technical complexity, stringent legal compliance requirements, and difficulty in cross-modal alignment between textual descriptions and figures. This paper introduces large vision-language models (LVLMs) to automated patent specification generation for the first time, proposing a domain-specific multimodal generation framework. It employs domain-adaptive fine-tuning to jointly model claims text and accompanying drawings by integrating fine-grained visual understanding with patent-domain knowledge. Experiments demonstrate that our approach significantly outperforms text-only baselines in claim accuracy, completeness of design feature representation, and fidelity of function–structure correspondence—yielding outputs approaching human-drafted quality. Our core contribution is the establishment of the first LVLM paradigm for patent generation that explicitly supports text–figure co-generation, empirically validated for effectiveness and practical utility in real-world patent drafting scenarios.

Technology Category

Application Category

📝 Abstract
Patent drafting is complex due to its need for detailed technical descriptions, legal compliance, and visual elements. Although Large Vision Language Models (LVLMs) show promise across various tasks, their application in automating patent writing remains underexplored. In this paper, we present PatentVision, a multimodal framework that integrates textual and visual inputs such as patent claims and drawings to generate complete patent specifications. Built on advanced LVLMs, PatentVision enhances accuracy by combining fine tuned vision language models with domain specific training tailored to patents. Experiments reveal it surpasses text only methods, producing outputs with greater fidelity and alignment with human written standards. Its incorporation of visual data allows it to better represent intricate design features and functional connections, leading to richer and more precise results. This study underscores the value of multimodal techniques in patent automation, providing a scalable tool to reduce manual workloads and improve consistency. PatentVision not only advances patent drafting but also lays the groundwork for broader use of LVLMs in specialized areas, potentially transforming intellectual property management and innovation processes.
Problem

Research questions and friction points this paper is trying to address.

Automating patent drafting with multimodal text and visual inputs
Enhancing accuracy through domain-specific vision-language model training
Generating patent specifications with improved fidelity and legal compliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework integrates textual and visual inputs
Fine tuned vision language models enhance patent accuracy
Generates complete specifications surpassing text only methods
R
Ruo Yang
Samsung Semiconductor, Inc.
S
Sai Krishna Reddy Mudhiganti
Samsung Semiconductor, Inc.
Manali Sharma
Manali Sharma
University of California, Los Angeles
Wireless and Wired CommunicationArtificial Intelligence