🤖 AI Summary
This work addresses the limitations of large language models (LLMs) in multi-step reasoning question answering, where retrieved contexts often lack sufficient inferential support. The authors propose a novel metric called “convergence score” to quantify how effectively a sentence helps eliminate incorrect answer choices, thereby enabling the construction and ranking of reasoning-oriented contexts. Compared to conventional similarity-based retrieval methods, passages built from sentences with high convergence scores substantially enhance model performance. Experimental results on the TriviaQA-HG subset demonstrate consistent accuracy improvements across six LLMs of diverse architectures and scales, underscoring the superiority of convergence as a measure for modeling reasoning relevance in context selection.
📝 Abstract
While Large Language Models (LLMs) are widely used in open-domain Question Answering (QA), their ability to handle inferential questions-where answers must be derived rather than directly retrieved-remains still underexplored. This study investigates how the structure and quality of passages influence LLM performance on such questions. We focus on convergence, a measure of how effectively sentences (hints) eliminate incorrect answers, as a criterion for constructing passages. Using subsets of the TriviaHG dataset, we form passages by combining sentences with varying convergence levels and evaluate six LLMs of different sizes and architectures. Our results show that passages built from higher convergence sentences lead to substantially better answer accuracy than those selected by cosine similarity, indicating that convergence captures meaningful relevance for inferential reasoning. Additionally, ordering sentences by descending convergence slightly improves performance, suggesting that LLMs tend to prioritize earlier, information-rich cues. These findings highlight convergence as a practical signal for guiding passage construction and analyzing inferential reasoning behavior in LLMs.