π€ AI Summary
This work addresses the limitations of conventional RAG systems in high-precision citation scenarios, where flat chunking leads to structural information loss, lexical mismatches, and unstable reasoning, thereby failing to meet the accuracy and traceability demands of technical question answering. To overcome these challenges, the authors propose a hierarchical dense retrieval framework that constructs a four-level tree-structured document index (document β section β paragraph β sentence) and employs bottom-up embedding aggregation to preserve hierarchical context. The approach further integrates LLM-driven query planning and cross-query reranking to enhance retrieval coverage, alongside an ensemble reasoning mechanism with an abstention protocol to stabilize outputs. Remarkably, using dense retrieval alone, this method matches the performance of sparse-dense hybrid strategies, achieving a score of 0.861 to rank first on both public and private leaderboards in the WattBot 2025 Challengeβthe only team to top both evaluation tracks. Ablation studies confirm the contribution of each component, and the code is publicly released.
π Abstract
Retrieval-augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high-precision citations are required: flat chunking strategies sacrifice document structure, single-query formulations miss relevant passages through vocabulary mismatch, and single-pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four-level tree representation (document $\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence) with bottom-up embedding aggregation, improves retrieval coverage through an LLM-powered query planner with cross-query reranking, and stabilizes answers through ensemble inference with abstention-aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with $\pm$0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse-dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open-source software at https://github.com/KohakuBlueleaf/KohakuRAG.