NCL-UoR at SemEval-2025 Task 3: Detecting Multilingual Hallucination and Related Observable Overgeneration Text Spans with Modified RefChecker and Modified SeflCheckGPT

📅 2025-03-02

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This paper addresses hallucination detection and fine-grained localization in multilingual large language model (LLM) outputs. Methodologically, it proposes a knowledge-augmented, multi-granularity detection framework: (1) it pioneers the reconstruction of reference documents into declarative fact-checking units to enhance traceability of factual grounding; (2) it extends SelfCheckGPT with dynamic external knowledge integration capabilities; and (3) it introduces multilingual prompt engineering alongside a word-level IoU and Coverage-over-Reference (COR) evaluation framework for precise hallucination localization—from sentence- to word-level. Evaluated on the SemEval-2025 Task 3 multilingual benchmark, the framework achieves an average IoU of 0.5310 and COR of 0.5669, ranking first overall. To our knowledge, this is the first work to systematically enable word-level, interpretable hallucination detection in multilingual LLM generations.

Technology Category

Application Category

📝 Abstract

SemEval-2025 Task 3 (Mu-SHROOM) focuses on detecting hallucinations in content generated by various large language models (LLMs) across multiple languages. This task involves not only identifying the presence of hallucinations but also pinpointing their specific occurrences. To tackle this challenge, this study introduces two methods: modified RefChecker and modified SelfCheckGPT. The modified RefChecker integrates prompt-based factual verification into References, structuring them as claim-based tests rather than single external knowledge sources. The modified SelfCheckGPT incorporates external knowledge to overcome its reliance on internal knowledge. In addition, both methods' original prompt designs are enhanced to identify hallucinated words within LLM-generated texts. Experimental results demonstrate the effectiveness of the approach, achieving a high ranking on the test dataset in detecting hallucinations across various languages, with an average IoU of 0.5310 and an average COR of 0.5669.

Problem

Research questions and friction points this paper is trying to address.

Detecting hallucinations in multilingual LLM-generated content.

Identifying specific occurrences of hallucinations in text.

Enhancing methods to verify factual accuracy across languages.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modified RefChecker integrates prompt-based factual verification.

Modified SelfCheckGPT incorporates external knowledge sources.

Enhanced prompt designs identify hallucinated words effectively.

🔎 Similar Papers

No similar papers found.