Is All the Information in the Price? LLM Embeddings versus the EMH in Stock Clustering

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether AI-based methods can outperform conventional approaches in stock clustering under the semi-strong form of the Efficient Market Hypothesis (EMH). Methodologically, it systematically compares—within a unified framework—three clustering strategies: (i) price-based clustering using historical return correlations, (ii) Global Industry Classification Standard (GICS) sector labels, and (iii) large language model (LLM)-derived embeddings of news headlines. Forecasting performance is rigorously evaluated via a dynamic synthetic factor model grounded in the Arbitrage Pricing Theory (APT). Results show that price-based clustering achieves significantly higher predictive accuracy, reducing root mean square error (RMSE) by 15.9% and 14.7% relative to GICS and LLM embeddings, respectively—offering novel empirical support for EMH. The proposed scalable methodology establishes an innovative paradigm for assessing market information absorption efficiency and monitoring industry structure evolution, delivering both theoretical insight and practical analytical tools.

Technology Category

Application Category

📝 Abstract
This paper investigates whether artificial intelligence can enhance stock clustering compared to traditional methods. We consider this in the context of the semi-strong Efficient Markets Hypothesis (EMH), which posits that prices fully reflect all public information and, accordingly, that clusters based on price information cannot be improved upon. We benchmark three clustering approaches: (i) price-based clusters derived from historical return correlations, (ii) human-informed clusters defined by the Global Industry Classification Standard (GICS), and (iii) AI-driven clusters constructed from large language model (LLM) embeddings of stock-related news headlines. At each date, each method provides a classification in which each stock is assigned to a cluster. To evaluate a clustering, we transform it into a synthetic factor model following the Arbitrage Pricing Theory (APT) framework. This enables consistent evaluation of predictive performance in a roll forward, out-of-sample test. Using S&P 500 constituents from from 2022 through 2024, we find that price-based clustering consistently outperforms both rule-based and AI-based methods, reducing root mean squared error (RMSE) by 15.9% relative to GICS and 14.7% relative to LLM embeddings. Our contributions are threefold: (i) a generalizable methodology that converts any equity grouping: manual, machine, or market-driven, into a real-time factor model for evaluation; (ii) the first direct comparison of price-based, human rule-based, and AI-based clustering under identical conditions; and (iii) empirical evidence reinforcing that short-horizon return information is largely contained in prices. These results support the EMH while offering practitioners a practical diagnostic for monitoring evolving sector structures and provide academics a framework for testing alternative hypotheses about how quickly markets absorb information.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI's ability to improve stock clustering methods
Testing if LLM embeddings outperform traditional price-based clustering
Assessing whether market prices fully reflect all public information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses LLM embeddings from news headlines for clustering
Converts clustering into APT factor model for evaluation
Compares price-based, rule-based and AI methods identically
🔎 Similar Papers
No similar papers found.