🤖 AI Summary
This study systematically evaluates the breadth and accuracy of large language models’ (LLMs) knowledge regarding historical financial data of U.S. publicly traded companies. Method: Leveraging a benchmark of 197,000 fact-based questions—grounded in authoritative financial statements—and integrating truth verification, multi-dimensional regression analysis, prompt engineering, empirical response comparison, and consistency validation, the study quantifies LLM performance across temporal, structural, and linguistic dimensions. Contribution/Results: We identify three pervasive limitations: (1) temporal decay—reduced accuracy for earlier fiscal periods; (2) scale bias—stronger recall for larger firms yet higher hallucination rates among them; and (3) selective coverage bias correlated with institutional attention and financial statement readability. Crucially, we provide the first quantitative evidence that firm size, analyst coverage, and disclosure clarity significantly modulate LLM knowledge accuracy. The study releases an open-source financial LLM evaluation benchmark, establishing a methodological foundation and empirical basis for assessing LLM reliability in finance.
📝 Abstract
Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model's cutoff date, it is less clear how their knowledge spans across historical information. In this study, we assess the breadth of LLMs' knowledge using financial data of U.S. publicly traded companies by evaluating more than 197k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. We will make the code, prompts, and model outputs public upon the publication of the work.