🤖 AI Summary
Automated novelty detection in scientific literature faces challenges due to the exponential growth of publications and the absence of benchmark datasets annotated at the *idea-unit* level. Method: We introduce the first idea-unit–level novelty detection benchmark dataset for marketing and NLP. We propose a relation-closure–based method for extracting coherent idea sets and design a lightweight, LLM-knowledge–distilled idea-level retriever to bridge the semantic gap between surface-level text similarity and conceptual novelty. Our pipeline comprises LLM-generated idea summarization, relation-graph–driven closure construction, idea-level knowledge distillation, and retriever training. Contribution/Results: Evaluated on our newly constructed dual-domain benchmark, our approach significantly outperforms state-of-the-art methods in both idea retrieval and novelty classification. The code and dataset are publicly released.
📝 Abstract
In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.