🤖 AI Summary
This study addresses a critical gap in the evaluation of large language models (LLMs): while existing benchmarks predominantly assess factual recall under direct queries, they largely overlook performance when entities are referenced indirectly through contextual mentions—a common pattern in natural language. The authors present the first systematic evaluation of multilingual LLMs’ factual recall under such context-mediated conditions. They construct controlled prompts that preserve factual content while introducing referential context, and evaluate models across five languages, comparing the effects of real versus synthetic names. Results show that contextual mediation consistently degrades factual recall accuracy, indicating that although models exhibit some robustness, neither name type nor origin yields significant or systematic performance differences. This reveals a fundamental discrepancy between isolated fact retrieval and context-dependent comprehension.
📝 Abstract
Large language models (LLMs) can recall a wide range of factual knowledge across languages. However, existing factual recall evaluations primarily assess fact retrieval in isolation, where the queried entity is explicitly named and the fact is requested directly. In natural language use, facts are often accessed through context, where the relevant entity is introduced only indirectly. In this work, we study contextually mediated factual recall, asking whether LLMs can reliably retrieve factual knowledge when the target entity is embedded in a naturalistic context rather than queried explicitly, across languages. We construct controlled prompts that preserve the underlying fact while introducing referential mediation through contextual sentences. To disentangle contextual effects from name-specific associations, we further compare performance using synthetic names and real names across languages. Evaluating multiple model families in five languages, we find that contextual mediation consistently degrades factual recall, with substantial variation across relations. Larger models are more robust to contextual mediation, exhibiting a reduced performance gap relative to direct queries, while the effect of real names and name origin is mixed and unsystematic. These findings highlight a gap between isolated factual recall and context-dependent language understanding in multilingual LLMs.