🤖 AI Summary
This paper addresses the reliability threat posed by unpredictable errors of large language models (LLMs) in empirical economic research and introduces the first LLM-oriented econometric framework tailored for empirical applications. Methodologically, it distinguishes between predictive tasks (e.g., hypothesis generation) and estimation tasks (e.g., textual concept quantification), and proposes two foundational principles: “no training-data leakage” and “mandatory validation data.” It adopts a dual-track approach—leveraging open-source models with transparent training data to ensure predictive validity, and employing small-sample, domain-specific validation data to calibrate estimation bias. The key contribution is the first systematic delineation of boundary conditions under which LLMs can be reliably deployed in economics. Empirical results demonstrate that violating either principle induces statistically significant estimation bias, whereas strict adherence enables high-fidelity prediction and bias-calibrated estimation—even with minimal textual data.
📝 Abstract
How can we use the novel capacities of large language models (LLMs) in empirical research? And how can we do so while accounting for their limitations, which are themselves only poorly understood? We develop an econometric framework to answer this question that distinguishes between two types of empirical tasks. Using LLMs for prediction problems (including hypothesis generation) is valid under one condition: no ``leakage'' between the LLM's training dataset and the researcher's sample. No leakage can be ensured by using open-source LLMs with documented training data and published weights. Using LLM outputs for estimation problems to automate the measurement of some economic concept (expressed either by some text or from human subjects) requires the researcher to collect at least some validation data: without such data, the errors of the LLM's automation cannot be assessed and accounted for. As long as these steps are taken, LLM outputs can be used in empirical research with the familiar econometric guarantees we desire. Using two illustrative applications to finance and political economy, we find that these requirements are stringent; when they are violated, the limitations of LLMs now result in unreliable empirical estimates. Our results suggest the excitement around the empirical uses of LLMs is warranted -- they allow researchers to effectively use even small amounts of language data for both prediction and estimation -- but only with these safeguards in place.