π€ AI Summary
This work addresses the persistent challenges of insufficient reliability and weak deployment capabilities of general-purpose AI agents in high-stakes vertical domains such as finance, retail, public health, and natural disasters. To bridge this gap, we propose FutureX-Pro, a novel framework that systematically extends agent-based future prediction capabilities across multiple critical verticals. The framework introduces a contamination-free, real-time evaluation pipeline comprising five domain-specialized subsystems and employs domain-tailored forecasting tasks to rigorously benchmark state-of-the-art large language model agents. Our evaluation reveals a significant disparity between the agentsβ general reasoning abilities and the precision required in specialized real-world scenarios. This study establishes the first real-time benchmark dedicated to high-value domains and provides a foundational direction for the development of domain-specific intelligent agents.
π Abstract
Building upon FutureX, which established a live benchmark for general-purpose future prediction, this report introduces FutureX-Pro, including FutureX-Finance, FutureX-Retail, FutureX-PublicHealth, FutureX-NaturalDisaster, and FutureX-Search. These together form a specialized framework extending agentic future prediction to high-value vertical domains. While generalist agents demonstrate proficiency in open-domain search, their reliability in capital-intensive and safety-critical sectors remains under-explored. FutureX-Pro targets four economically and socially pivotal verticals: Finance, Retail, Public Health, and Natural Disaster. We benchmark agentic Large Language Models (LLMs) on entry-level yet foundational prediction tasks -- ranging from forecasting market indicators and supply chain demands to tracking epidemic trends and natural disasters. By adapting the contamination-free, live-evaluation pipeline of FutureX, we assess whether current State-of-the-Art (SOTA) agentic LLMs possess the domain grounding necessary for industrial deployment. Our findings reveal the performance gap between generalist reasoning and the precision required for high-value vertical applications.