🤖 AI Summary
To address factual hallucinations in large language model (LLM)–generated text, this paper proposes a three-stage post-editing framework: (1) retrieving counterevidence via external tools (e.g., search engines or APIs); (2) prompting the LLM to generate a “factual error explanation”—a novel reasoning step that identifies and articulates the root cause of the error; and (3) performing precise correction grounded in this explanation. The method integrates lightweight chain-of-explanation prompting, LLM self-reflective rewriting, and prompt compression to jointly ensure high correction fidelity and substantially reduce computational overhead. Experiments across multiple benchmarks demonstrate that our approach outperforms FacTool, CoVE, and RARR in both error detection and correction accuracy, while reducing inference latency by up to 42% and token consumption by up to 38%.
📝 Abstract
Mitigating hallucination issues is a key challenge that must be overcome to reliably deploy large language models (LLMs) in real-world scenarios. Recently, various methods have been proposed to detect and revise factual errors in LLM-generated texts, in order to reduce hallucination. In this paper, we propose Re-Ex, a method for post-editing LLM-generated responses. Re-Ex introduces a novel reasoning step dubbed as the factual error explanation step. Re-Ex revises the initial response of LLMs using 3-steps : first, external tools are used to retrieve the evidences of the factual errors in the initial LLM response; next, LLM is instructed to explain the problematic parts of the response based on the gathered evidence; finally, LLM revises the initial response using the explanations provided in the previous step. In addition to the explanation step, Re-Ex also incorporates new prompting techniques to reduce the token count and inference time required for the response revision process. Compared with existing methods including FacTool, CoVE, and RARR, Re-Ex provides better detection and revision performance with less inference time and fewer tokens in multiple benchmarks.