EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs

📅 2026-01-11
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of current large language models to over-reinforce successful reasoning paths during self-training, which leads to overconfidence and degraded calibration—particularly an impaired ability to express uncertainty. To mitigate this, the authors propose Epistemic Calibration Reasoning (EpiCaR), a framework that reframes reasoning training as a cognitive learning problem. EpiCaR jointly optimizes reasoning accuracy and calibration through explicit self-evaluation signals within an iterative supervised fine-tuning process. Evaluated on Llama-3 and Qwen-3 model families, EpiCaR achieves Pareto-optimal trade-offs between accuracy and calibration on GSM8K and MBPP benchmarks, reduces reasoning computation by a factor of three, and matches the performance of the STaR method using only 10 samples—compared to STaR’s 30.

Technology Category

Application Category

📝 Abstract
Improving the reasoning abilities of large language models (LLMs) has largely relied on iterative self-training with model-generated data. While effective at boosting accuracy, existing approaches primarily reinforce successful reasoning paths, incurring a substantial calibration cost: models become overconfident and lose the ability to represent uncertainty. This failure has been characterized as a form of model collapse in alignment, where predictive distributions degenerate toward low-variance point estimates. We address this issue by reframing reasoning training as an epistemic learning problem, in which models must learn not only how to reason, but also when their reasoning should be trusted. We propose epistemically-calibrated reasoning (EpiCaR) as a training objective that jointly optimizes reasoning performance and calibration, and instantiate it within an iterative supervised fine-tuning framework using explicit self-evaluation signals. Experiments on Llama-3 and Qwen-3 families demonstrate that our approach achieves Pareto-superiority over standard baselines in both accuracy and calibration, particularly in models with sufficient reasoning capacity (e.g., 3B+). This framework generalizes effectively to OOD mathematical reasoning (GSM8K) and code generation (MBPP). Ultimately, our approach enables a 3X reduction in inference compute, matching the K=30 performance of STaR with only K=10 samples in capable models.
Problem

Research questions and friction points this paper is trying to address.

reasoning
calibration
uncertainty
model collapse
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

epistemic calibration
reasoning uncertainty
self-evaluation
iterative fine-tuning
model collapse
🔎 Similar Papers
No similar papers found.
J
J. Yeom
Graduate School of Data Science, Seoul National University
J
Jaewon Sok
Department of Rural Systems Engineering, Seoul National University
S
Seonghyeon Park
Department of Aerospace Engineering, Seoul National University
J
Jeongjae Park
Graduate School of Data Science, Seoul National University
Taesup Kim
Taesup Kim
Assistant Professor, Seoul National University
Representation LearningTransfer LearningAIMachine LearningDeep Learning