eSapiens's DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs

📅 2025-07-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in enterprise document question answering—particularly heterogeneous format support, insufficient answer accuracy, poor traceability, and production deployment difficulties in high-compliance domains (e.g., legal and financial)—this paper proposes a secure, scalable retrieval-augmented generation (RAG) framework. Methodologically, it integrates multi-source parsing (PDF, Office, web), 1000-token overlapping chunking, hybrid HNSW-BM25 indexing, Cohere-based re-ranking, and GPT-4o query optimization. Key innovations include a LangGraph-driven citation coverage validator and CO-STAR prompt engineering, deployed via end-to-end encrypted containerization. Experiments on a LegalBench subset demonstrate a 1.0% improvement in Recall@50, a 7.0% gain in Precision@10, TRACe Utilization ≥0.50, and an unsupported-answer rate below 3%, significantly enhancing answer verifiability and regulatory compliance.

Technology Category

Application Category

📝 Abstract
We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.
Problem

Research questions and friction points this paper is trying to address.

Enterprise document question answering with secure retrieval
Handling heterogeneous content for accurate knowledge extraction
Ensuring traceable and auditable responses in high-stakes domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid HNSW+BM25 indexing for document chunks
GPT-4o refined queries with Cohere reranking
LangGraph verifier ensures citation overlap
🔎 Similar Papers
No similar papers found.
I
Isaac Shi
eSapiens Team
Z
Zeyuan Li
eSapiens Team
F
Fan Liu
eSapiens Team
W
Wenli Wang
eSapiens Team
Lewei He
Lewei He
South China Normal University
3D PrintingDeep Learning
Y
Yang Yang
eSapiens Team
Tianyu Shi
Tianyu Shi
University of Toronto
Reinforcement learningIntelligent Transportation SystemLarge Language ModelsAILLM agent