🤖 AI Summary
This study addresses the challenge of efficiently and scalably auditing citations in academic manuscripts for relevance, accuracy, timeliness, and ethical compliance. To this end, the authors propose a transparent hybrid decision-support framework that integrates contextual reasoning from large language models, semantic similarity computation, metadata validation, and human review. Notably, the framework incorporates a human-in-the-loop feedback mechanism into the citation auditing pipeline for the first time and introduces a configurable, multi-signal–based three-tier review process with tunable thresholds, balancing conservative screening with editorial controllability. Evaluated on a test set of 104 references, the system achieved a Cohen’s kappa of 0.429 against human annotations for relevance judgment and, at a threshold τ = 17, successfully identified all irrelevant citations, demonstrating its effectiveness and potential as an intelligent tool for citation quality screening.
📝 Abstract
Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet manuscript-level citation auditing remains largely manual, fragmented, and difficult to scale. Citation context, metadata quality, self-citation patterns, and bibliographic integrity all affect whether a reference appropriately supports a local claim. We present CitePrism, a transparent hybrid decision-support framework for editorial citation auditing that combines LLM-assisted contextual reasoning, embedding-based semantic similarity, metadata verification, integrity-oriented flags, and human-in-the-loop analyst review. CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores, surfaces metadata and self-citation review prompts, and supports configurable threshold-based triage. In a preliminary validation on a single case-study manuscript with 104 references from pavement engineering, agreement with human binary relevance labels reached Cohen's kappa = 0.429. At operating threshold tau = 17, CitePrism flagged all human-labeled irrelevant citations, while also producing false positives requiring analyst review. These results suggest that CitePrism may support conservative editorial screening and citation-quality triage, but they do not establish general editorial performance. CitePrism is intended as pilot-stage decision support, not as an autonomous misconduct detector or automated editorial decision system. Broader validation across manuscripts, domains, annotators, baselines, and deployment settings is required before operational use.