๐ค AI Summary
Existing approaches struggle to capture fine-grained interactions among scientific papers centered on specific claims. This work proposes ClaimFlow, the first claim-centric NLP framework for scholarly document analysis. By manually annotating 1,084 claims and 832 cross-paper relationships across 304 papers from the ACL Anthology, we define and construct a novel claim relationship classification task aimed at inferring the scientific stance of citing papers toward cited claims. Combining neural architectures with large language models, our approach achieves a macro F1 score of 0.78 on this task. Extending the analysis to 13,000 papers reveals that 63.5% of claims are never reused, only 11.1% are challenged, and widely disseminated claims are typically reframed through qualification or extension.
๐ Abstract
Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.