🤖 AI Summary
To address challenges in enterprise document question answering—particularly heterogeneous format support, insufficient answer accuracy, poor traceability, and production deployment difficulties in high-compliance domains (e.g., legal and financial)—this paper proposes a secure, scalable retrieval-augmented generation (RAG) framework. Methodologically, it integrates multi-source parsing (PDF, Office, web), 1000-token overlapping chunking, hybrid HNSW-BM25 indexing, Cohere-based re-ranking, and GPT-4o query optimization. Key innovations include a LangGraph-driven citation coverage validator and CO-STAR prompt engineering, deployed via end-to-end encrypted containerization. Experiments on a LegalBench subset demonstrate a 1.0% improvement in Recall@50, a 7.0% gain in Precision@10, TRACe Utilization ≥0.50, and an unsupported-answer rate below 3%, significantly enhancing answer verifiability and regulatory compliance.
📝 Abstract
We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.