VISTA Score: Verification In Sequential Turn-based Assessment

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Conversational AI systems suffer from hallucination in fact-sensitive applications, yet existing evaluation methods inadequately handle multi-turn dynamic interactions—either assessing single-turn responses in isolation or erroneously flagging unverifiable statements as factual errors. To address this, we propose VISTA, the first framework that models factual consistency as a dynamic property of dialogue. VISTA enables fine-grained, interpretable truth assessment via three core components: (1) atomic fact assertion extraction, (2) turn-level verifiability classification, and (3) cross-turn consistency tracking. It performs claim-level verification using trusted knowledge sources and dialogue history, explicitly distinguishing unverifiable claims from false ones. Experiments across eight large language models and four benchmarks demonstrate that VISTA significantly outperforms FACTSCORE and LLM-as-Judge, achieving higher inter-annotator agreement and improved assessment reliability.

Technology Category

Application Category

📝 Abstract

Hallucination--defined here as generating statements unsupported or contradicted by available evidence or conversational context--remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA's decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating factual reliability in multi-turn conversational AI systems

Detecting hallucinated claims against evidence and dialogue context

Improving verification of sequential consistency in dialogue factuality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes dialogue turns into atomic factual claims

Verifies claims against trusted sources and history

Tracks sequential consistency for multi-turn evaluation

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation