PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks

πŸ“… 2026-01-08
πŸ›οΈ Proceedings of the Natural Legal Language Processing Workshop 2025
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the absence of a systematic evaluation benchmark for large language models (LLMs) in the domain of patent law reasoning. The authors construct the first benchmark centered on decisions from the U.S. Patent Trial and Appeal Board (PTAB), aligning PTAB rulings with USPTO patent data to formulate three structured classification tasks grounded in the IRAC legal analysis framework: issue type, cited authority, and sub-decision. The benchmark enables multidimensional evaluation across input variations, model families, and error analyses, offering a comprehensive assessment of both open- and closed-source LLMs. Experimental results reveal a substantial performance gap: the best closed-source model achieves a Micro-F1 score of 0.75 on the issue-type task, whereas the strongest open-source model, Qwen-8B, attains only 0.56, highlighting significant limitations in current models’ capacity for patent-related legal reasoning.

Technology Category

Application Category

πŸ“ Abstract
The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.
Problem

Research questions and friction points this paper is trying to address.

legal reasoning
patent domain
benchmark
large language models
PTAB
Innovation

Methods, ideas, or system contributions that make the work stand out.

legal reasoning
patent domain
IRAC-aligned tasks
LLM benchmark
PTAB
πŸ”Ž Similar Papers
No similar papers found.
Y
Yehoon Jang
Major in Industrial Data Science & Engineering, Department of Industrial and Data Engineering, Pukyong National University
C
Chaewon Lee
Major in Industrial Data Science & Engineering, Department of Industrial and Data Engineering, Pukyong National University
Hyun-seok Min
Hyun-seok Min
Tomocube
Machine LearningDeep LearningMedical Image AnalysisImage Technology
Sungchul Choi
Sungchul Choi
Pukyong National University
Machine LearningDeep LearningTechnology AnalysisPatent Analysis