Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to context manipulation in extended dialogues, where responses may be grounded in premises later retracted. The authors propose the first linear-time dialogue verification mechanism with formal guarantees against such inconsistencies. By dynamically constructing a symbolic dependency graph at runtime, the method categorizes each utterance into one of eight update operations, explicitly tracking dependencies between claims and supporting evidence. This enables linear-time validation of response coherence and immediate propagation of retractions. The approach decouples soundness from faithfulness, allowing precise invalidation of conclusions derived from withdrawn premises. Experiments demonstrate an 89.7% accuracy on the LongMemEval-KU oracle benchmark, perfect performance (100%) on subsets involving stale premises, and microsecond-level retraction checks—significantly outperforming both standard LLM baselines and transcript-RAG.

📝 Abstract

In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context-manipulation attacks against deployed agents now actively exploit this gap. We close it with a runtime verifier that maintains an explicit dependency graph: an LLM classifies each turn into one of 8 update operations drawn from four formalisms (dynamic epistemic logic, abductive reasoning, awareness logic, argumentation), and a symbolic engine records which claims depend on which evidence. Checking whether a continuation is supported reduces to a graph walk; retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per-turn cost and a formal conflict-free guarantee. On LongMemEval-KU oracle (n=78), the verifier reaches 89.7% accuracy vs. 88.5% for the LLM-only baseline (+1.3pp) and 87.2% for a transcript-RAG baseline matched on retrieval budget (+2.6pp); wins among disagreements are correct abstentions where the baseline confabulates. On LoCoMo's 60 official QA items the verifier is competitive with retrieval-augmented baselines. Beyond external benchmarks, we construct two multi-agent scenarios and a 50-item grounding test: on the 15-item stale-premise subset, the verifier reaches 100% accuracy vs. 93.3% (+6.7pp). These instantiate a soundness-faithfulness decomposition: the structural check is sound by construction, and per-deployment LLM extraction faithfulness is the empirical question we measure across four LLM families. The retraction check plateaus at microseconds while history-replay grows linearly with conversation length.

Problem

Research questions and friction points this paper is trying to address.

groundedness

context-manipulation attacks

conversational consistency

premise abandonment

LLM hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

runtime verification

dependency graph

linear-time algorithm