🤖 AI Summary
This work addresses the pervasive issue of overconfidence in large language models within the telecommunications domain, which undermines output reliability. To mitigate this, the authors propose a dual-channel chain-of-thought (CoT) ensemble method that calibrates model self-assessment through multi-path independent reasoning and confidence fusion. Implemented on the Gemma-3 model series (4B/12B/27B), the approach demonstrates substantial improvements across key telecom benchmarks—including TeleQnA, ORANBench, and srsRANBench—reducing expected calibration error (ECE) by up to 88%. This significant enhancement aligns model confidence more closely with actual accuracy, thereby strengthening the trustworthiness and deployability of these models in critical telecommunications scenarios.
📝 Abstract
Large Language Models (LLMs) are increasingly applied to complex telecommunications tasks, including 3GPP specification analysis and O-RAN network troubleshooting. However, a critical limitation remains: LLM-generated confidence scores are often biased and unreliable, frequently exhibiting systematic overconfidence. This lack of trustworthy self-assessment makes it difficult to verify model outputs and safely rely on them in practice. In this paper, we study confidence calibration in telecom-domain LLMs using the representative Gemma-3 model family (4B, 12B, and 27B parameters), evaluated on TeleQnA, ORANBench, and srsRANBench. We show that standard single-pass, verbalized confidence estimates fail to reflect true correctness, often assigning high confidence to incorrect predictions. To address this, we propose a novel Twin-Pass Chain of Thought (CoT)-Ensembling methodology for improving confidence estimation by leveraging multiple independent reasoning evaluations and aggregating their assessments into a calibrated confidence score. Our approach reduces Expected Calibration Error (ECE) by up to 88% across benchmarks, significantly improving the reliability of model self-assessment. These results highlight the limitations of current confidence estimation practices and demonstrate a practical path toward more trustworthy evaluation of LLM outputs in telecommunications.