PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks

📅 2026-01-08

🏛️ Proceedings of the Natural Legal Language Processing Workshop 2025

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the absence of a systematic evaluation benchmark for large language models (LLMs) in the domain of patent law reasoning. The authors construct the first benchmark centered on decisions from the U.S. Patent Trial and Appeal Board (PTAB), aligning PTAB rulings with USPTO patent data to formulate three structured classification tasks grounded in the IRAC legal analysis framework: issue type, cited authority, and sub-decision. The benchmark enables multidimensional evaluation across input variations, model families, and error analyses, offering a comprehensive assessment of both open- and closed-source LLMs. Experimental results reveal a substantial performance gap: the best closed-source model achieves a Micro-F1 score of 0.75 on the issue-type task, whereas the strongest open-source model, Qwen-8B, attains only 0.56, highlighting significant limitations in current models’ capacity for patent-related legal reasoning.

Technology Category

Application Category

📝 Abstract

The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.

Problem

Research questions and friction points this paper is trying to address.

legal reasoning

patent domain

benchmark

large language models

PTAB

Innovation

Methods, ideas, or system contributions that make the work stand out.

legal reasoning

patent domain

IRAC-aligned tasks