🤖 AI Summary
Large language models (LLMs) suffer from context hallucination in zero-shot scientific text summarization due to prompt-output inconsistency. Method: We propose and evaluate two lightweight prompt engineering strategies—key-sentence context repetition and random sentence injection—on six instruction-tuned LLMs across 336 scientific abstracts. Evaluation employs ROUGE, BERTScore, METEOR, and cosine similarity to assess lexical and semantic consistency; statistical significance is rigorously validated via BCa bootstrap confidence intervals and Wilcoxon signed-rank tests. Contribution/Results: Context repetition significantly improves lexical alignment (p < 0.01) and semantic fidelity between summaries and source texts, effectively mitigating hallucination without model retraining. It offers a training-free, interpretable, and deployment-ready prompting paradigm for trustworthy scientific summarization.
📝 Abstract
Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a base- line prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment be- tween the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.