🤖 AI Summary
Research on abductive reasoning in large language models has long suffered from the absence of a unified framework, leading to conceptual ambiguity and fragmented task definitions. This work proposes the first two-stage formalization—hypothesis generation and hypothesis selection—and establishes a systematic taxonomy encompassing tasks, datasets, methods, and evaluation protocols. Through comprehensive literature review, benchmarking, and cross-model comparative analysis, the study empirically reveals performance disparities among existing models on abductive reasoning tasks. It further identifies critical limitations, including reliance on static evaluation setups and insufficient domain coverage, thereby laying a foundational theoretical and practical groundwork for future research in this area.
📝 Abstract
Regardless of its foundational role in human discovery and sense-making, abductive reasoning--the inference of the most plausible explanation for an observation--has been relatively underexplored in Large Language Models (LLMs). Despite the rapid advancement of LLMs, the exploration of abductive reasoning and its diverse facets has thus far been disjointed rather than cohesive. This paper presents the first survey of abductive reasoning in LLMs, tracing its trajectory from philosophical foundations to contemporary AI implementations. To address the widespread conceptual confusion and disjointed task definitions prevalent in the field, we establish a unified two-stage definition that formally categorizes prior work. This definition disentangles abduction into \textit{Hypothesis Generation}, where models bridge epistemic gaps to produce candidate explanations, and \textit{Hypothesis Selection}, where the generated candidates are evaluated and the most plausible explanation is chosen. Building upon this foundation, we present a comprehensive taxonomy of the literature, categorizing prior work based on their abductive tasks, datasets, underlying methodologies, and evaluation strategies. In order to ground our framework empirically, we conduct a compact benchmark study of current LLMs on abductive tasks, together with targeted comparative analyses across model sizes, model families, evaluation styles, and the distinct generation-versus-selection task typologies. Moreover, by synthesizing recent empirical results, we examine how LLM performance on abductive reasoning relates to deductive and inductive tasks, providing insights into their broader reasoning capabilities. Our analysis reveals critical gaps in current approaches--from static benchmark design and narrow domain coverage to narrow training frameworks and limited mechanistic understanding of abductive processes...