CoRT: Code-integrated Reasoning within Thinking

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (LRMs) suffer from low accuracy and poor computational efficiency in complex mathematical reasoning, primarily because their natural-language chain-of-thought reasoning cannot efficiently invoke external computational tools. To address this, we propose the first post-training framework tailored for LRM–code interpreter (CI) collaboration. Our method comprises: (1) Hint-Engineering—a novel data synthesis technique that manually designs optimal prompt positions to enhance LRM–CI interaction; and (2) a multi-stage post-training paradigm—comprising supervised fine-tuning, rejection sampling, and reinforcement learning—adapted to models of varying scales. Evaluated on five mathematical reasoning benchmarks, our approach improves accuracy by 4% for a 32B-parameter model and by 8% for a 1.5B-parameter model, while reducing generated tokens by 30% and 50%, respectively. These results demonstrate substantial gains in both reasoning accuracy and inference efficiency.

Technology Category

Application Category

📝 Abstract
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4% and 8% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30% fewer tokens for the 32B model and 50% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.
Problem

Research questions and friction points this paper is trying to address.

Improving efficiency of Large Reasoning Models in math operations
Integrating Code Interpreters effectively with LRMs
Overcoming data scarcity for code-integrated reasoning training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-training framework for LRM-CI integration
Hint-Engineering synthesizes code-integrated reasoning data
Combines supervised, rejection, and reinforcement fine-tuning
🔎 Similar Papers
No similar papers found.