eSapiens's DEREK Module: Deep Extraction & Reasoning Engine for Knowledge with LLMs

📅 2025-07-13

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

To address challenges in enterprise document question answering—particularly heterogeneous format support, insufficient answer accuracy, poor traceability, and production deployment difficulties in high-compliance domains (e.g., legal and financial)—this paper proposes a secure, scalable retrieval-augmented generation (RAG) framework. Methodologically, it integrates multi-source parsing (PDF, Office, web), 1000-token overlapping chunking, hybrid HNSW-BM25 indexing, Cohere-based re-ranking, and GPT-4o query optimization. Key innovations include a LangGraph-driven citation coverage validator and CO-STAR prompt engineering, deployed via end-to-end encrypted containerization. Experiments on a LegalBench subset demonstrate a 1.0% improvement in Recall@50, a 7.0% gain in Precision@10, TRACe Utilization ≥0.50, and an unsupported-answer rate below 3%, significantly enhancing answer verifiability and regulatory compliance.

Technology Category

Application Category

📝 Abstract

We present the DEREK (Deep Extraction & Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.

Problem

Research questions and friction points this paper is trying to address.

Enterprise document question answering with secure retrieval

Handling heterogeneous content for accurate knowledge extraction

Ensuring traceable and auditable responses in high-stakes domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid HNSW+BM25 indexing for document chunks

GPT-4o refined queries with Cohere reranking

LangGraph verifier ensures citation overlap

🔎 Similar Papers

Large Language Model Enhanced Knowledge Representation Learning: A Survey