🤖 AI Summary
Large language model (LLM) agents face significant challenges in cryptocurrency domains—including high time sensitivity, adversarial information environments, and heterogeneous data sources—yet lack rigorous, dynamic evaluation benchmarks. Method: We introduce the first expert-driven, dynamic benchmark platform, updated monthly with 50 real-world questions integrating on-chain data, DeFi real-time dashboards, and domain-expert knowledge. We propose a novel four-quadrant task taxonomy to precisely characterize imbalances between retrieval and predictive capabilities. Contribution/Results: Evaluation across 10 state-of-the-art LLMs reveals a consistent capability gap: agents excel at information retrieval but underperform significantly in deep reasoning and multi-source synthesis—exposing a fundamental bottleneck in domain-specific analytical competence. This work establishes a new evaluation paradigm and scalable framework for assessing LLM agents in high-stakes, time-critical, adversarial vertical domains.
📝 Abstract
This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: emph{extreme time-sensitivity}, emph{a highly adversarial information environment}, and the critical need to synthesize data from emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills.
Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a extit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.