🤖 AI Summary
Large language models (LLMs) frequently generate reasoning chains (Chain-of-Thought, CoT) containing latent errors and biases, undermining faithfulness and auditability.
Method: We propose the Semi-Structured Reasoning Model (SSRM), which constrains LLMs to produce parsable, Python-inspired syntactic reasoning traces; introduce task-specific vocabulary embeddings intrinsically integrated into the CoT structure to internalize reasoning format; and design a dual-paradigm automated auditing framework combining rule-driven validation with typicality modeling.
Contribution: SSRM achieves an average accuracy improvement of nearly 10 percentage points across 10 diverse benchmarks; matches or exceeds domain-specialized models on cross-domain medical reasoning tasks; and its auditing mechanism precisely localizes reasoning flaws—substantially enhancing output reliability, transparency, and verifiability.
📝 Abstract
As Large Language Models (LLMs) become increasingly capable at reasoning, the problem of"faithfulness"persists: LLM"reasoning traces"can contain errors and omissions that are difficult to detect, and may obscure biases in model outputs. To address these limitations, we introduce Semi-Structured Reasoning Models (SSRMs), which internalize a semi-structured Chain-of-Thought (CoT) reasoning format within the model. Our SSRMs generate reasoning traces in a Pythonic syntax. While SSRM traces are not executable, they adopt a restricted, task-specific vocabulary to name distinct reasoning steps, and to mark each step's inputs and outputs. Through extensive evaluation on ten benchmarks, SSRMs demonstrate strong performance and generality: they outperform comparably sized baselines by nearly ten percentage points on in-domain tasks while remaining competitive with specialized models on out-of-domain medical benchmarks. Furthermore, we show that semi-structured reasoning is more amenable to analysis: in particular, they can be automatically audited to identify reasoning flaws. We explore both hand-crafted structured audits, which detect task-specific problematic reasoning patterns, and learned typicality audits, which apply probabilistic models over reasoning patterns, and show that both audits can be used to effectively flag probable reasoning errors.