🤖 AI Summary
This study investigates how to extract predictive signals for stock fundamentals and returns from historical public textual data, aiming to overcome the limitations of traditional valuation metrics. Innovatively treating twelve frozen snapshots of ChatGPT from 2021–2025 as compressed representations of publicly available information at specific points in time, the authors construct industry-neutral LLM-based forward-looking scores for approximately 7,000 U.S. equities. Employing time-series analysis, Fama-MacBeth regressions, and fixed-effects panel models, they find that these scores significantly and positively predict analyst target price revisions and one-month cross-sectional returns (t = 6.02). The predictive power strengthens over longer horizons and is particularly pronounced among firms with high analyst coverage, underscoring the unique value of large language models in synthesizing dispersed qualitative information for asset pricing.
📝 Abstract
Frozen large language model (LLM) checkpoints extract information from pre-cutoff public text that is associated with future fundamentals and equity returns beyond standard contemporaneous valuation measures. Because each frozen checkpoint has a fixed knowledge cutoff, it can be interpreted as a compressed representation of publicly available textual information at a given point in time. We treat twelve OpenAI snapshots spanning 2021-2025 as time-stamped summaries of the public textual record and extract a sector-neutral LLM outlook score for roughly 7,000 U.S. equities per cross-section. The outlook score is positively associated with analyst revisions, target-price changes and one-month cross-sectional returns in both Fama-MacBeth regressions and pooled panels with model fixed effects (t = 6.02), after direct controls for market-implied valuation and standard factors. Predictability broadly increases with the return horizon, despite a non-monotonic intermediate dip, and, in the pooled panel, is stronger for firms with high analyst coverage, consistent with the view that the bottleneck is not investor inattention but the cost of aggregating dispersed qualitative information across many documents.