🤖 AI Summary
How to rigorously evaluate the multidimensional capabilities of large language model (LLM) agents in open-ended, data-driven scientific research tasks? This paper introduces BLADE, the first benchmark specifically designed for this scenario, comprising 12 real-world scientific datasets and associated research questions. BLADE enables automated, multidimensional evaluation of planning, memory, and code-execution capabilities in LLM agents. To address core challenges in open-ended analysis—including solution multiplicity, partial correctness, and representation heterogeneity—it innovatively integrates expert-annotated ground truth with computationally grounded semantic matching. Technically, BLADE unifies programmatic code execution, statistical semantic parsing, multi-representation alignment, and expert-in-the-loop validation. Experimental results demonstrate that current LLM agents predominantly perform only basic analyses; those supporting interactive data exploration exhibit significantly greater analytical diversity, yet still fall substantially short of domain-expert performance.
📝 Abstract
Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents'multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents'analysis approaches.