The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Large language models (LLMs) exhibit reliable factual accuracy on simple question-answering tasks but frequently suffer from factual inconsistency between short and long answers under complex queries, undermining trustworthiness. To address this, we propose SLAQ, a systematic evaluation framework grounded in controlled comparative experiments across 600 diverse questions and 16 state-of-the-art LLMs. Our empirical analysis reveals a pronounced decline in factual consistency as task complexity increases. We identify, for the first time, two previously unrecognized mechanistic biases: *position-dependent accuracy decay*—where answer reliability degrades with output length—and the *answer momentum effect*—where early-generated tokens bias subsequent reasoning. Leveraging similarity in internal activation patterns, we design a consistency predictor that discriminates alignment states between short and long answers, achieving up to 78% classification accuracy. This work establishes a novel, interpretable paradigm for diagnosing structural limitations in LLM factuality and provides a reproducible toolkit for consistency-aware evaluation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) can correctly answer "When was Einstein born?" yet fail to provide the same date when writing about Einstein's life revealing a fundamental inconsistency in how models access factual knowledge across task complexities. While models display impressive accuracy on factual question-answering benchmarks, the reliability gap between simple and complex queries remains poorly understood, eroding their trustworthiness. In this work, we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a controlled evaluation framework that compares LLMs' answers to the same factual questions asked (a) in isolation (short) vs. (b) integrated into complex queries (long). Looking at 16 LLMs across 600 queries, we find a systematic misalignment of answers to the corresponding short and long queries. We further uncover position-dependent accuracy loss and momentum effects where consecutive correct or incorrect answers create self-reinforcing patterns. Through mechanistic analysis, we find that aligned facts activate overlapping model internals, and that metrics based on mechanistic similarity can predict short-long answer alignment with up to 78% accuracy. Our work establishes factual consistency over query complexity as an important aspect of LLMs' trustworthiness and challenges current evaluation practices, which implicitly assume that good performance for simple factual queries implies reliability in more complex knowledge-seeking tasks too.

Problem

Research questions and friction points this paper is trying to address.

LLMs show inconsistency between short and long factual answers

Factual reliability gap exists across query complexities in LLMs

Current evaluations assume simple query performance implies complex reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

SLAQ framework evaluates short-long answer alignment

Mechanistic analysis identifies overlapping model internals

Metrics predict answer alignment with 78% accuracy

🔎 Similar Papers

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence