Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the limited ability of current large language models to support fine-grained evaluation of intermediate reasoning steps in multi-hop reasoning tasks, which hinders precise failure diagnosis. To this end, the authors introduce Omanic, an open-domain multi-hop question answering resource that includes two novel datasets—OmanicBench and OmanicSynth—featuring large-scale human-annotated and synthetically generated step-level labels. These datasets enable both explainable evaluation and supervised training of reasoning processes. Experimental results show that leading models achieve only 73.11% accuracy on OmanicBench. Furthermore, supervised fine-tuning using OmanicSynth combined with chain-of-thought (CoT) prompting yields an average improvement of 7.41 points across six reasoning and mathematical benchmarks, substantially enhancing multi-hop reasoning performance.

Technology Category

Application Category

📝 Abstract

Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

Problem

Research questions and friction points this paper is trying to address.

multi-hop reasoning

large language models

reasoning evaluation

step-wise analysis

intermediate reasoning steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

step-wise evaluation

multi-hop reasoning

intermediate reasoning annotation

reasoning diagnosis

supervised fine-tuning

🔎 Similar Papers

Do Large Language Models Latently Perform Multi-Hop Reasoning?

2024-02-26Annual Meeting of the Association for Computational LinguisticsCitations: 97

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting

2024-10-10Citations: 0

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering

2024-05-22arXiv.orgCitations: 5