AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

πŸ“… 2026-05-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current evaluations of enterprise data agents predominantly focus on task workflows or report generation, failing to adequately assess their genuine understanding of underlying analytical structuresβ€”such as customer segmentation, causal drivers, temporal events, and relational patterns. This work proposes AvalancheBench, a novel benchmark introducing a latent-world recovery evaluation paradigm: by synthesizing observational data from known generative mechanisms, it tests whether agents can accurately reconstruct the original analytical structures. The framework enables fine-grained scoring and error propagation analysis, addressing the limited diagnostic controllability of real-data benchmarks. In an e-commerce scenario, even the strongest code-based agent achieved only 26% structural recovery, revealing critical deficiencies in generalizing segmentation logic and integrating temporal event dynamics.
πŸ“ Abstract
We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.
Problem

Research questions and friction points this paper is trying to address.

enterprise data agents
latent world recovery
analytical understanding
benchmark evaluation
goal-driven analytics
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent world recovery
enterprise data agents
analytical understanding
benchmarking
goal-driven analytics
πŸ”Ž Similar Papers
No similar papers found.