🤖 AI Summary
This study investigates how input structure affects the factual accuracy of live sports commentary summarization generated by large language models (LLMs), with emphasis on hallucination and factual error suppression in high-precision settings.
Method: Leveraging structured NBA game-by-game data, we systematically compare row-based, JSON, and unstructured text input formats on Llama-3.1-70B and Qwen2.5-72B. Factuality is evaluated via human annotation and repeated-measures ANOVA with Tukey HSD post-hoc tests to quantify error rates.
Contribution/Results: JSON formatting significantly improves factual consistency, reducing error rates by 69% and 65% on the two models, respectively. Input structure accounts for over 80% of the variance in factual errors. This work provides the first quantitative evidence that input structure is the dominant factor governing LLM factual accuracy—establishing a critical engineering principle for high-reliability generative applications.
📝 Abstract
A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.