🤖 AI Summary
Large language models (LLMs) exhibit weak multi-hop reasoning capabilities and struggle to locate and integrate critical information in ultra-long contexts (>100K tokens).
Method: This paper proposes a purely prompt-driven, end-to-end reasoning framework that synergistically combines structured prompt engineering with chain-of-thought (CoT) prompting to intrinsically enable key passage localization, stepwise evidence integration, and lightweight inference—all within a single forward pass. It emulates the retrieval-and-reasoning functionality of RAG without external retrievers.
Contribution/Results: The work identifies the decisive impact of prompt elements—such as question, label, and instruction ordering—on long-range comprehension. Evaluated on the BABILong benchmark, it significantly outperforms both retrieval-free baselines and naive RAG across multi-fact question answering tasks—including object position tracking, dynamic counting, and uncertain knowledge reasoning—demonstrating strong robustness. Results empirically validate that optimized prompting can substantively replace conventional retrieval pipelines.
📝 Abstract
This paper addresses the challenge of comprehending very long contexts in Large Language Models (LLMs) by proposing a method that emulates Retrieval Augmented Generation (RAG) through specialized prompt engineering and chain-of-thought (CoT) reasoning. While recent LLMs support over 100,000 tokens in a single prompt, simply enlarging context windows has not guaranteed robust multi-hop reasoning when key details are scattered across massive input. Our approach treats the model as both the retriever and the reasoner: it first tags relevant segments within a long passage, then employs a stepwise CoT workflow to integrate these pieces of evidence. This single-pass method thereby reduces reliance on an external retriever, yet maintains focus on crucial segments. We evaluate our approach on selected tasks from BABILong, which interleaves standard bAbI QA problems with large amounts of distractor text. Compared to baseline (no retrieval) and naive RAG pipelines, our approach more accurately handles multi-fact questions such as object location tracking, counting, and indefinite knowledge. Furthermore, we analyze how prompt structure, including the order of question, relevant-text tags, and overall instructions, significantly affects performance. These findings underscore that optimized prompt engineering, combined with guided reasoning, can enhance LLMs' long-context comprehension and serve as a lightweight alternative to traditional retrieval pipelines.