CMT-Bench: Cricket Multi-Table Generation Benchmark for Probing Robustness in Large Language Models

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient robustness of large language models (LLMs) in dynamic text-to-table generation, particularly their vulnerability in extracting critical information from temporally evolving narratives. To this end, we introduce the first benchmark for dynamic table generation grounded in live cricket commentary, and propose a three-dimensional semantic preservation framework for robustness evaluation—explicitly decoupling reliance on extraction shortcuts from state-tracking capability. Leveraging controlled perturbations—including cue ablation, temporal prefix extension, and entity surface-form perturbation—we systematically expose core weaknesses: sharp performance degradation when summary cues are absent, monotonic deterioration under long contexts, and high sensitivity to entity morphological variations. Experiments reveal substantial reasoning drift across mainstream LLMs, empirically validating the framework’s efficacy in diagnosing genuine reasoning capabilities rather than superficial pattern matching.

Technology Category

Application Category

📝 Abstract
LLM Driven text-to-table (T2T) systems often rely on extensive prompt-engineering or iterative event extraction in code-parsable formats, which boosts scores but are computationally expensive and obscure how models actually reason over temporal evolving narratives to summarise key information. We present CMT-Bench, a diagnostic benchmark built from live cricket commentary that requires dynamic table generation across two evolving schemas under a dense, rule-governed policy. CMT-Bench is designed to probe robustness via three semantics-preserving dimensions: (i) extractive-cue ablation to separate extractive shortcuts from state tracking, (ii) temporal prefixing to test long-context stability, and (iii) entity-form perturbations (anonymization, outof-distribution substitutions, role-entangling paraphrases) to assess sensitivity to surface variation. Across diverse long-context stateof-the-art LLMs, we find large drops without extractive summaries, monotonic degradation with input length, and consistent accuracy drop under entity-form changes. Complementary distributional tests confirm significant shifts in numeric error patterns, indicating drift in reasoning rather than mere noise. Our results show that current LLMs are brittle in dynamic Textto-table generation, motivating robustness-first evaluation as a prerequisite for developing efficient and scalable approaches for this task.
Problem

Research questions and friction points this paper is trying to address.

Probing robustness in dynamic multi-table generation from evolving narratives
Testing LLM sensitivity to extractive cues and entity-form perturbations
Evaluating long-context stability in text-to-table systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic table generation across evolving schemas
Probing robustness via semantics-preserving dimensional tests
Diagnosing reasoning brittleness through distributional error analysis
🔎 Similar Papers
No similar papers found.
R
Ritam Upadhyay
School of Computing and Augmented Intelligence, Arizona State University
N
Naman Ahuja
School of Computing and Augmented Intelligence, Arizona State University
R
Rishabh Baral
School of Computing and Augmented Intelligence, Arizona State University
Aparna Garimella
Aparna Garimella
Adobe Inc
Natural Language ProcessingComputational Social Science
Vivek Gupta
Vivek Gupta
Assistant Professor of Computer Science, Arizona State University
Artificial IntelligenceNatural Language ProcessingLarge Language ModelsInformation Retrieval