In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor generalization of supervised fine-tuning and the high computational cost and credit assignment difficulty of reinforcement learning in chain-of-thought (CoT) reasoning, this paper proposes InTRO: an efficient inference optimization framework that operates within a single forward pass. InTRO introduces token-level importance weighting guided by a self-feedback mechanism—specifically, a correction factor that quantifies the information divergence between the generation policy and its answer-conditioned variant—to dynamically steer subsequent token selection without reinforcement learning. Evaluated on six mathematical reasoning benchmarks, InTRO achieves up to a 20% absolute accuracy improvement over strong baselines, yields more concise and faithful reasoning paths, and demonstrates strong cross-domain transferability, significantly enhancing generalization performance.

Technology Category

Application Category

📝 Abstract
Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single"golden"rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.
Problem

Research questions and friction points this paper is trying to address.

Optimizing chain-of-thought reasoning in LLMs without penalizing valid alternatives
Addressing credit assignment and computational cost issues in reinforcement learning approaches
Enabling accurate and concise reasoning through token-level exploration and self-feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

InTRO framework enables token-level exploration and self-feedback
Uses correction factors for informative next token selection
Performs exploration and self-feedback in single forward pass
🔎 Similar Papers
No similar papers found.
M
Mingye Zhu
University of Science and Technology of China, Hefei, China
Y
Yi Liu
State Key Laboratory of Communication Content Cognition, People’s Daily Online, Beijing, China
Zheren Fu
Zheren Fu
University of Science and Technology of China
Multi-modal LearningVision-Language ModelAI Security
Q
Quan Wang
Beijing University of Posts and Telecommunications, Beijing, China
Y
Yongdong Zhang
University of Science and Technology of China, Hefei, China