Time to Revist Exact Match

📅 2025-09-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Exact match (EM), the predominant evaluation metric in temporal question answering (TQA), fails to quantify error magnitude—especially for date- and duration-based answers—rendering it insensitive to temporally meaningful discrepancies. Method: We reformulate TQA as a numerical estimation task and introduce TempAnswerQA, the first unified numeric benchmark integrating Test of Time and TempTabQA. We replace EM with scale-invariant forecasting metrics—specifically sMAPE and MASE—to assess prediction accuracy. Contribution/Results: Experiments reveal that models achieving high EM scores often exhibit substantial sMAPE/MASE errors (e.g., systematic ±1-year deviations), exposing systemic gaps in large language models’ temporal knowledge representation. MASE-based re-ranking significantly alters model performance rankings, uncovering previously masked deficiencies. This work demonstrates EM’s masking effect on temporal reasoning flaws and establishes a more granular, interpretable, and numerically grounded evaluation paradigm for TQA.

Technology Category

Application Category

📝 Abstract
Temporal question answering is an established method for evaluating temporal reasoning in large language models. Expected answers are often numeric (e.g., dates or durations), yet model responses are evaluated like regular text with exact match (EM), unable to distinguish small from large errors. In this investigative work, we frame temporal question answering as a numerical estimation task to assess the shortcomings of EM. We introduce TempAnswerQA, a benchmark distilled from Test of Time and TempTabQA, where all questions require a numerical, temporal answer, allowing us to evaluate models beyond EM. We use the forecasting metrics symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE). With sMAPE, we find that error size and EM are decoupled. Models with low EM still have low sMAPE (both ~20%), and some models have high sMAPE despite high EM. Scaling errors by the deviation of the ground truth data with MASE reshuffles model rankings compared to EM, revealing gaps in models' understanding of temporal domain knowledge, especially when trained with synthetic data. Lastly, the models' most frequent error is to deviate by only $pm1$ from the ground truth. sMAPE and MASE, unlike EM, adequately weight these errors. Our findings underscore the need for specialised metrics for temporal QA tasks. Code and data are available on https://github.com/aauss/temporal-answer-qa.
Problem

Research questions and friction points this paper is trying to address.

Evaluating temporal reasoning in LLMs using exact match metrics
Assessing shortcomings of exact match for numerical temporal answers
Developing specialized metrics for temporal question answering tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing TempAnswerQA benchmark for temporal QA
Using sMAPE and MASE forecasting metrics
Evaluating numerical temporal answers beyond exact match
🔎 Similar Papers
No similar papers found.