🤖 AI Summary
This work addresses the prevalent overconfidence problem in large language models—particularly RoBERTa—when applied to entity matching tasks. We systematically investigate confidence calibration techniques and propose a multi-strategy calibration framework integrating temperature scaling, Monte Carlo Dropout, and model ensembling. Empirical evaluation is conducted across multiple standard entity matching benchmarks. Results demonstrate that the baseline RoBERTa exhibits significant miscalibration; temperature scaling emerges as the most effective single-method intervention, reducing the Expected Calibration Error (ECE) by up to 23.83%; the full calibrated framework achieves a minimum ECE of 0.0043, substantially improving predictive reliability and decision trustworthiness. This study establishes a reproducible calibration paradigm and empirical benchmark for deploying LLMs reliably in high-stakes downstream applications.
📝 Abstract
This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.