Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses a critical flaw in current citation-based RAG systems: their tendency to misinterpret relevant citations as sufficient evidence, thereby overlooking mismatches in semantic strength between claims and supporting references—a phenomenon termed “citation whitewashing.” To tackle this, the authors introduce the concept of evidence-strength calibration and present FORCEBENCH, a contrastive stress-testing benchmark that generates semantically perturbed false claims across five dimensions—relational structure, modality, scope, temporal validity, and numerical precision—while holding citations fixed. Through carefully constructed contrastive examples, locality filtering, explicit strength prompting, and a novel Monotonicity Violation Rate (MVR) metric, the study reveals systematic calibration failures in mainstream models: MVR reaches 47.2% under standard prompting but drops to 24.5% with explicit strength cues. Notably, conventional overlap-based metrics violate monotonicity in 32.8%–36.4% of samples. The authors release an open-source evaluation toolkit to advance trustworthy RAG assessment.

📝 Abstract

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

Problem

Research questions and friction points this paper is trying to address.

citation laundering

evidence-force calibration

RAG evaluation

monotonicity violation

claim warrant

Innovation

Methods, ideas, or system contributions that make the work stand out.

evidence-force calibration

citation laundering

FORCEBENCH