IPBench: Benchmarking the Knowledge of Large Language Models in Intellectual Property

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing IP-domain evaluation benchmarks suffer from narrow coverage and unrealistic task scenarios, hindering comprehensive assessment of large language models (LLMs) on law-technology interdisciplinary tasks. Method: We introduce IPBench—the first comprehensive bilingual IP evaluation benchmark—covering eight IP mechanisms and twenty realistic tasks, underpinned by a novel systematic IP task taxonomy. The dataset is constructed via expert-curated annotation and rigorous domain-expert validation, supporting zero-shot and few-shot evaluation. Contribution/Results: Experiments across 16 state-of-the-art LLMs reveal a top accuracy of only 75.8%; open-source IP-specialized models substantially underperform general-purpose closed-source models. To foster trustworthy AI research in IP, we fully open-source the dataset, evaluation code, and protocols—with ongoing community-driven updates.

Technology Category

Application Category

📝 Abstract

Intellectual Property (IP) is a unique domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. As large language models (LLMs) continue to advance, they show great potential for processing IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks either focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce the first comprehensive IP task taxonomy and a large, diverse bilingual benchmark, IPBench, covering 8 IP mechanisms and 20 tasks. This benchmark is designed to evaluate LLMs in real-world intellectual property applications, encompassing both understanding and generation. We benchmark 16 LLMs, ranging from general-purpose to domain-specific models, and find that even the best-performing model achieves only 75.8% accuracy, revealing substantial room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. We publicly release all data and code of IPBench and will continue to update it with additional IP-related tasks to better reflect real-world challenges in the intellectual property domain.

Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive benchmarks for LLMs in IP domain

Existing datasets misalign with real-world IP scenarios

Need to evaluate LLMs' performance on diverse IP tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces comprehensive IP task taxonomy

Develops bilingual benchmark IPBench

Evaluates 16 LLMs on IP tasks

🔎 Similar Papers

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval