🤖 AI Summary
This study identifies a structural deficiency in large language models (LLMs) regarding historical Olympic medal knowledge: while LLMs accurately retrieve national medal counts (Task 1, >90% accuracy), they exhibit severe limitations in ranking logic reasoning (Task 2, <35% accuracy). We construct a systematic, fine-grained Olympic medal dataset and evaluate multiple state-of-the-art LLMs via zero-shot prompting. Our empirical analysis reveals— for the first time—that LLMs’ knowledge representation is biased toward factual recall rather than relational reasoning: they reliably encode “how many” but fail to consistently infer “which rank.” This finding indicates a fundamental divergence between LLMs’ internal knowledge organization and human-like structured reasoning, challenging the implicit assumption that LLMs serve as general-purpose reasoning engines. To support reproducible evaluation of structured reasoning capabilities, we publicly release all code, data, and model outputs, establishing a benchmark for assessing relational inference in foundation models.
📝 Abstract
Large language models (LLMs) have become a dominant approach in natural language processing, yet their internal knowledge structures remain largely unexplored. In this paper, we analyze the internal knowledge structures of LLMs using historical medal tallies from the Olympic Games. We task the models with providing the medal counts for each team and identifying which teams achieved specific rankings. Our results reveal that while state-of-the-art LLMs perform remarkably well in reporting medal counts for individual teams, they struggle significantly with questions about specific rankings. This suggests that the internal knowledge structures of LLMs are fundamentally different from those of humans, who can easily infer rankings from known medal counts. To support further research, we publicly release our code, dataset, and model outputs.