Thinking About Thinking: Evaluating Reasoning in Post-Trained Language Models

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates three critical properties of post-trained language models: (1) self-awareness of their own decision-making strategies, (2) cross-domain generalization capability, and (3) alignment between internal reasoning trajectories and final outputs. We propose a three-dimensional evaluation framework and conduct systematic comparisons across supervised fine-tuning (SFT), direct preference optimization (DPO), and group-relative policy optimization (GRPO) models on diverse multi-task benchmarks. Results show that reinforcement learning–based methods—particularly DPO and GRPO—significantly outperform SFT in strategy awareness and cross-task transfer. However, all RL-based models exhibit weak alignment between reasoning paths and outputs, with GRPO showing the most severe inconsistency. To our knowledge, this is the first study to systematically expose the “strong behavior, weak reasoning” tension inherent in current RL-based post-training paradigms. Our findings provide empirical grounding for advancing interpretable AI and trustworthy reasoning modeling, highlighting concrete directions for improving reasoning fidelity in policy-optimized language models.

Technology Category

Application Category

📝 Abstract
Recent advances in post-training techniques have endowed Large Language Models (LLMs) with enhanced capabilities for tackling complex, logic-intensive tasks through the generation of supplementary planning tokens. This development raises a fundamental question: Are these models aware of what they "learn" and "think"? To address this, we define three core competencies: (1) awareness of learned latent policies, (2) generalization of these policies across domains, and (3) alignment between internal reasoning traces and final outputs. We empirically evaluate these abilities on several tasks, each designed to require learning a distinct policy. Furthermore, we contrast the profiles of models post-trained via Supervised Fine-Tuning (SFT), Direct Policy Optimization (DPO), and Group Relative Policy Optimization (GRPO). Our findings indicate that RL-trained models not only demonstrate greater awareness of their learned behaviors and stronger generalizability to novel, structurally similar tasks than SFT models but also often exhibit weak alignment between their reasoning traces and final outputs, an effect most pronounced in GRPO-trained models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning awareness in post-trained language models
Assessing generalization of learned policies across domains
Measuring alignment between internal reasoning and final outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating reasoning awareness in language models
Assessing policy generalization across different domains
Contrasting alignment in SFT, DPO, and GRPO models
🔎 Similar Papers
No similar papers found.