From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of existing post-training methods for large language models, which treat entire reasoning trajectories as monolithic optimization units, conflating generalizable strategies with task-specific execution and thereby compromising generalization and efficiency. To overcome this, the authors propose a two-stage, cognitively aligned post-training framework. First, abstract reasoning strategies are distilled via supervised learning using Chain-of-Meta-Thought (CoMT); then, task execution is refined through confidence-calibrated reinforcement learning (CCRL). This approach explicitly models human-like problem-solving cognition by decoupling meta-strategy learning from instance-level execution. Evaluated across four models and eight benchmarks, the method achieves average in-distribution and out-of-distribution performance gains of 2.19% and 4.63%, respectively, while reducing training time by 65–70% and token consumption by 50%.

Technology Category

Application Category

📝 Abstract

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought (CoMT) focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and eight benchmarks show 2.19\% and 4.63\% improvements in-distribution and out-of-distribution respectively over standard methods, while reducing training time by 65-70% and token consumption by 50%, demonstrating that aligning post-training with human cognitive principles yields not only superior generalization but also enhanced training efficiency.

Problem

Research questions and friction points this paper is trying to address.

cognitive alignment

reasoning generalization

meta-thought

LLM post-training

reasoning reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Meta-Thought

Confidence-Calibrated Reinforcement Learning

cognitive alignment