🤖 AI Summary
Existing LLM reasoning methods—such as Chain-of-Thought (CoT), Tree-of-Thought (ToT), and ReACT—are hindered by insufficient context utilization, hallucinated intermediate steps, and inefficient iteration, compromising accuracy and robustness. To address these limitations, we propose E2G, a novel “evidence-first” single-agent two-step prompting framework. In the first step, E2G precisely extracts explicit, structured contextual reasoning sequences directly from the input as verifiable evidence; in the second step, it generates answers strictly grounded in this evidence, eliminating unvalidated intermediate reasoning. By integrating retrieval augmentation and reconstructing the CoT paradigm around evidence fidelity, E2G significantly enhances reasoning reliability and context awareness. Experiments demonstrate that E2G achieves 53.8% accuracy on LogiQA—outperforming standard CoT by 18 percentage points—and attains an F1 score of 83.3 on the DROP subset when paired with PaLM2, surpassing Gemini Ultra by 0.9 points.
📝 Abstract
While chain-of-thought (CoT) prompting has revolutionized how LLMs perform reasoning tasks, its current methods and variations (e.g, Self-consistency, ReACT, Reflexion, Tree-of-Thoughts (ToT), Cumulative Reasoning (CR)) suffer from limitations like slowness, limited context grounding, hallucination and inconsistent outputs. To overcome these challenges, we introduce Evidence to Generate (E2G), a novel single-agent, two-step prompting framework. Instead of unverified reasoning claims, this innovative approach leverages the power of"evidence for decision making"by first focusing exclusively on the thought sequences (the series of intermediate steps) explicitly mentioned in the context which then serve as extracted evidence, guiding the LLM's output generation process with greater precision and efficiency. This simple yet powerful approach unlocks the true potential of chain-of-thought like prompting, paving the way for faster, more reliable, and more contextually aware reasoning in LLMs. ool achieves remarkable results robustly across a wide range of knowledge-intensive reasoning and generation tasks, surpassing baseline approaches with state-of-the-art LLMs. For example, (i) on LogiQA benchmark using GPT-4 as backbone model, ool achieves a new state-of-the Accuracy of 53.8% exceeding CoT by 18%, ToT by 11%, CR by 9% (ii) a variant of E2G with PaLM2 outperforms the variable-shot performance of Gemini Ultra by 0.9 F1 points, reaching an F1 score of 83.3 on a subset of DROP.