🤖 AI Summary
To address critical judicial evidence challenges in legal question answering—including severe hallucination, knowledge obsolescence, and non-auditable answers—this paper proposes a retrieval-first, multi-model fusion framework. The framework integrates retrieval-augmented generation (RAG), collaborative generation across multiple large language models (LLMs), a domain-specialized selector for answer ranking, and a human feedback–driven closed-loop mechanism to ensure answer veracity, traceable provenance, and dynamic knowledge base evolution. Its key innovations include (1) feeding human verification outcomes back into knowledge base updates, and (2) explicitly modeling legal domain expertise via the selector to suppress hallucination. Evaluated on the Law_QA benchmark, the framework achieves significant improvements over single-LLM baselines and conventional RAG in F1 score, ROUGE-L, and LLM-as-a-Judge metrics—demonstrating comprehensive gains in legal compliance, factual accuracy, and system trustworthiness.
📝 Abstract
As artificial intelligence permeates judicial forensics, ensuring the veracity and traceability of legal question answering (QA) has become critical. Conventional large language models (LLMs) are prone to hallucination, risking misleading guidance in legal consultation, while static knowledge bases struggle to keep pace with frequently updated statutes and case law. We present a hybrid legal QA agent tailored for judicial settings that integrates retrieval-augmented generation (RAG) with multi-model ensembling to deliver reliable, auditable, and continuously updatable counsel. The system prioritizes retrieval over generation: when a trusted legal repository yields relevant evidence, answers are produced via RAG; otherwise, multiple LLMs generate candidates that are scored by a specialized selector, with the top-ranked answer returned. High-quality outputs then undergo human review before being written back to the repository, enabling dynamic knowledge evolution and provenance tracking. Experiments on the Law_QA dataset show that our hybrid approach significantly outperforms both a single-model baseline and a vanilla RAG pipeline on F1, ROUGE-L, and an LLM-as-a-Judge metric. Ablations confirm the complementary contributions of retrieval prioritization, model ensembling, and the human-in-the-loop update mechanism. The proposed system demonstrably reduces hallucination while improving answer quality and legal compliance, advancing the practical landing of media forensics technologies in judicial scenarios.