PEDANTIC: A Dataset for the Automatic Examination of Definiteness in Patent Claims

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Ambiguity in patent claims—governed by 35 U.S.C. § 112(b)—is the leading cause of U.S. patent rejections, yet the absence of annotated datasets hinders automated examination research. Method: We introduce PEDANTIC, the first large-scale, fine-grained dataset for patent definiteness assessment, comprising 14K NLP-domain U.S. patent claims extracted from USPTO office actions. We propose a novel LLM-driven pipeline for fully automated rationale extraction and validation, and introduce the “LLM-as-Judge” paradigm for fine-grained evaluation—moving beyond binary classification. Contribution/Results: The dataset is manually verified for high reliability. Experiments reveal a counterintuitive finding: state-of-the-art LLMs underperform even simple logistic regression baselines on legal fine-grained reasoning tasks. This work establishes a foundational resource and methodology for interpretable patent text analysis and AI-augmented patent examination.

Technology Category

Application Category

📝 Abstract

Patent claims define the scope of protection for an invention. If there are ambiguities in a claim, it is rejected by the patent office. In the US, this is referred to as indefiniteness (35 U.S.C {S} 112(b)) and is among the most frequent reasons for patent application rejection. The development of automatic methods for patent definiteness examination has the potential to make patent drafting and examination more efficient, but no annotated dataset has been published to date. We introduce PEDANTIC (underline{P}atunderline{e}nt underline{D}efiniteness Exunderline{a}miunderline{n}aunderline{ti}on underline{C}orpus), a novel dataset of 14k US patent claims from patent applications relating to Natural Language Processing (NLP), annotated with reasons for indefiniteness. We construct PEDANTIC using a fully automatic pipeline that retrieves office action documents from the USPTO and uses Large Language Models (LLMs) to extract the reasons for indefiniteness. A human validation study confirms the pipeline's accuracy in generating high-quality annotations. To gain insight beyond binary classification metrics, we implement an LLM-as-Judge evaluation that compares the free-form reasoning of every model-cited reason with every examiner-cited reason. We show that LLM agents based on Qwen 2.5 32B and 72B struggle to outperform logistic regression baselines on definiteness prediction, even though they often correctly identify the underlying reasons. PEDANTIC provides a valuable resource for patent AI researchers, enabling the development of advanced examination models. We will publicly release the dataset and code.

Problem

Research questions and friction points this paper is trying to address.

Detecting ambiguities in patent claims automatically

Providing annotated dataset for patent definiteness examination

Evaluating LLM performance on patent claim analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic pipeline for patent claim analysis

LLMs extract indefiniteness reasons from USPTO

LLM-as-Judge evaluation compares model and examiner reasons

🔎 Similar Papers

Can Large Language Models Generate High-quality Patent Claims?