🤖 AI Summary
Prior work lacks systematic evaluation of large language models’ (LLMs) practical capabilities in predictive analytics tasks. Method: We introduce PredictiQ—a novel, comprehensive benchmark comprising 1,130 real-world, data-driven prediction questions spanning eight domains—and propose a three-dimensional evaluation framework integrating textual understanding, executable code generation, and logical consistency verification. Empirical assessment is conducted across 12 state-of-the-art LLMs, leveraging program synthesis, natural language inference, and structured output validation. Results: Current LLMs exhibit weak generalization, high code error rates, and deficient causal reasoning in prediction tasks, achieving an overall accuracy below 40%. These findings reveal fundamental limitations in their reliability for rigorous predictive analysis. This work establishes the first domain-specific benchmark and methodological foundation for evaluating LLMs in predictive analytics.
📝 Abstract
Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the extbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See href{https://github.com/Cqkkkkkk/PredictiQ}{Github}.