Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the high cost, subjectivity, and limited scalability of human evaluation of research idea novelty, as well as the absence of a unified benchmark for automated assessment methods. To this end, we introduce RINoBench, the first large-scale benchmark for novelty judgment, comprising 1,381 research ideas annotated by human experts alongside nine automatic evaluation metrics. We systematically evaluate large language models (LLMs) on both scoring accuracy and reasoning plausibility. Our experiments reveal that while LLM-generated rationales closely resemble human reasoning, their novelty judgments significantly deviate from human gold-standard assessments, exposing a critical disconnect between reasoning processes and actual judgment capabilities in current models. RINoBench provides a standardized, scalable framework for evaluating research idea novelty.

Technology Category

Application Category

📝 Abstract

Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: https://github.com/TimSchopf/RINoBench.

Problem

Research questions and friction points this paper is trying to address.

novelty judgment

research ideas

automated evaluation

benchmark

scientific literature

Innovation

Methods, ideas, or system contributions that make the work stand out.

novelty judgment

research idea evaluation

automated benchmark