Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Large language models (LLMs) rely on chain-of-thought (CoT) prompting for mathematical reasoning but frequently produce high-confidence incorrect answers—posing significant risks in safety-critical domains such as education. To address this, we introduce signal temporal logic (STL) for CoT confidence modeling for the first time, treating stepwise reasoning confidence as a temporal signal and formally encoding structural constraints—including smoothness, monotonicity, and causal consistency—to enable structured, interpretable uncertainty quantification. We propose an STL-based robustness scoring mechanism and an uncertainty recalibration strategy, substantially improving confidence calibration. Evaluated across multiple mathematical reasoning benchmarks, our approach outperforms conventional ensemble and post-hoc calibration methods, delivering more reliable and verifiable uncertainty estimates. This work establishes a novel paradigm for trustworthy AI reasoning through logically grounded, temporally aware confidence modeling.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have shown impressive performance in mathematical reasoning tasks when guided by Chain-of-Thought (CoT) prompting. However, they tend to produce highly confident yet incorrect outputs, which poses significant risks in domains like education, where users may lack the expertise to assess reasoning steps. To address this, we propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL). In particular, we define formal STL-based constraints to capture desirable temporal properties and compute robustness scores that serve as structured, interpretable confidence estimates. Our approach also introduces a set of uncertainty reshaping strategies to enforce smoothness, monotonicity, and causal consistency across the reasoning trajectory. Experiments show that our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.

Problem

Research questions and friction points this paper is trying to address.

Evaluating confidence in Chain-of-Thought reasoning steps

Detecting incorrect but highly confident LLM outputs

Improving reliability of uncertainty estimates in reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Model stepwise confidence as temporal signal

Evaluate confidence using Signal Temporal Logic

Introduce uncertainty reshaping strategies

🔎 Similar Papers

Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models