π€ AI Summary
Current evaluations of enterprise data agents predominantly focus on task workflows or report generation, failing to adequately assess their genuine understanding of underlying analytical structuresβsuch as customer segmentation, causal drivers, temporal events, and relational patterns. This work proposes AvalancheBench, a novel benchmark introducing a latent-world recovery evaluation paradigm: by synthesizing observational data from known generative mechanisms, it tests whether agents can accurately reconstruct the original analytical structures. The framework enables fine-grained scoring and error propagation analysis, addressing the limited diagnostic controllability of real-data benchmarks. In an e-commerce scenario, even the strongest code-based agent achieved only 26% structural recovery, revealing critical deficiencies in generalizing segmentation logic and integrating temporal event dynamics.
π Abstract
We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.