Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work investigates the capability of large language models (LLMs) to detect intrinsic hallucinations—i.e., nonsensical, logically inconsistent, or factually incorrect outputs—in conditional generation tasks, specifically machine translation and paraphrasing. We systematically evaluate open-source LLMs of varying scales, instruction-tuned variants, and natural language inference (NLI) models across multilingual and multi-task settings, assessing consistency in hallucination detection. Using the HalluciGen benchmark and controlled ablation studies, we analyze the impact of model scale, instruction fine-tuning, and prompt design. Results show that detection performance depends more critically on model architecture and specialization than on prompting strategy; notably, several lightweight NLI models match or exceed larger LLMs in accuracy, confirming their viability as efficient, task-specialized hallucination detectors. This study presents the first comprehensive, cross-model, cross-task, and cross-lingual comparative evaluation of intrinsic hallucination detection, and introduces an NLI-driven lightweight detection paradigm.

Technology Category

Application Category

📝 Abstract

A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and language and we investigate the impact of model size, instruction tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.

Problem

Research questions and friction points this paper is trying to address.

Detecting intrinsic hallucinations in LLM-generated paraphrasing and translation

Evaluating LLM performance variations across tasks and languages

Comparing NLI models with LLMs for hallucination detection effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLMs on hallucination detection tasks

Studies model performance across languages and tasks

Compares LLM and NLI models for detection

🔎 Similar Papers

No similar papers found.