Will It Still Be True Tomorrow? Multilingual Evergreen Question Classification to Improve Trustworthy QA

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Large language models (LLMs) frequently generate hallucinated answers in question answering due to insufficient handling of temporal sensitivity—particularly failing to distinguish “evergreen questions” (whose answers remain stable over time) from “temporal questions” (whose answers evolve). This work introduces, for the first time, a multilingual evergreen question classification task and presents EverGreenQA, the first benchmark dataset covering 12 languages. To address this, we analyze how LLMs explicitly and implicitly encode temporal information and propose EG-E5, a lightweight classifier integrating multilingual E5 embeddings, uncertainty modeling, self-supervised discriminative training, and retrieval attribution analysis. EG-E5 achieves state-of-the-art performance on multilingual evergreen classification; significantly improves model-calibrated self-knowledge estimation; enables high-quality QA data filtering; and enhances the interpretability of GPT-4o’s retrieval behavior.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often hallucinate in question answering (QA) tasks. A key yet underexplored factor contributing to this is the temporality of questions -- whether they are evergreen (answers remain stable over time) or mutable (answers change). In this work, we introduce EverGreenQA, the first multilingual QA dataset with evergreen labels, supporting both evaluation and training. Using EverGreenQA, we benchmark 12 modern LLMs to assess whether they encode question temporality explicitly (via verbalized judgments) or implicitly (via uncertainty signals). We also train EG-E5, a lightweight multilingual classifier that achieves SoTA performance on this task. Finally, we demonstrate the practical utility of evergreen classification across three applications: improving self-knowledge estimation, filtering QA datasets, and explaining GPT-4o retrieval behavior.

Problem

Research questions and friction points this paper is trying to address.

Classifying multilingual questions as evergreen or mutable to enhance QA trustworthiness

Assessing LLMs' ability to encode question temporality via judgments or uncertainty

Developing a lightweight classifier for multilingual evergreen question detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces multilingual EverGreenQA dataset

Benchmarks 12 LLMs on temporality encoding

Trains lightweight EG-E5 classifier

🔎 Similar Papers

FacLens: Transferable Probe for Foreseeing Non-Factuality in Large Language Models