🤖 AI Summary
Existing static benchmarks inadequately evaluate models’ ability to detect misinformation in real-world, dynamic, and multilingual online environments. This work introduces a dynamically updatable, multilingual, and multidomain benchmark for misinformation detection, covering five languages and two domains with 15,992 fine-grained annotated claims, and incorporates Community Notes as a novel training and evaluation signal for the first time. By integrating large language models’ reasoning capabilities with web search, the proposed approach employs retrieval expansion and pruning mechanisms to optimize source selection, substantially improving verification performance. Experiments demonstrate that web access is critical for closed-input verification, reveal significant performance disparities across language–domain combinations, and show that the method effectively reduces the systematic bias between model and human source selection.
📝 Abstract
Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for misinformation detection in the wild, with three major goals: coverage, granularity, and redistributability. This release contains 15,992 standalone claims across five languages and two domains. We evaluate ten LLMs under varying inference-time capabilities, including thinking and web-search. Our results show that closed-input verification remains challenging, web access yields the largest gains, and web-enabled LLMs' source-selection policies are systematically misaligned with the sources human Community Notes raters converge on -- a gap that closes through model-specific mechanisms of retrieval expansion or pruning. We further find substantial variation across language-domain slices and across the evidence ecosystems used by web-enabled systems. Beyond evaluation, CommunityFact positions Community Notes as a training signal for claim-conditioned source suggesters that could improve factual verification on novel claims.